csci5612 methods

Association Rule Mining

Overview

Association Rule Mining (ARM) is a technique that find association rules between events in transactional data. A common example of this is looking at baskets of foods as transactions and seeing the associations between common food items, impying that if one item is in a basket, it is likely that another item is also in the basket. This technique uses sets and probability to find associations of interest with metrics like support, confidence, and lift. Support is, for a given rule, just the probability of finding all the events of that rule in a single transaction. For A, B ⊂ Basket, and the rule A → B, support = ℙ (A ∪ B). Confidence is the conditional probability of B given A, confidence = ℙ (B | A) = ℙ (A ∪ B) / ℙ (A). Finally, the lift is a measure of the correlation where lift = ℙ (A ∪ B) / (ℙ (A) * ℙ (B)). A lift of 1.0 means the events are independent and a lift higher than 1 indicates correlation and this a good association rule. The apriori algorithm first measures the support of a bunch of subsets by counting the occurrences and assuming that a superset will be less probable than a smaller set. Then after a table of events and supports is made, the association rules can be found by calculating the various metrics.

Data

The HaveWorkedWith column in the original dataset contains transactions where each event is a possible language or skill separated by semicolons. This can be removed and separately processed to create association rules.

code_transactions.py


import pandas as pd

def get_skills(df):
    skills_df = df["HaveWorkedWith"]
    skills_count = dict()
    for transaction in skills_df:
        if type(transaction) == str:
            for sk in transaction.split(";"):
                if sk in skills_count:
                    skills_count[sk] += 1
                else:
                    skills_count[sk] = 0
    skills = list(skills_count.keys())
    counts = [skills_count[sk] for sk in skills]
    combined_sorted = list(sorted(zip(counts, skills)))[::-1]
    counts, skills = zip(*combined_sorted)
    print("skills (%d):" % len(skills))
    print(skills)

    data = dict()
    for sk in skills:
        data[sk] = list()
    for transaction in skills_df:
        if type(transaction) == str:
            this_guys_skills = transaction.split(";")
            for sk in skills:
                if sk in this_guys_skills:
                    data[sk].append(True)
                else:
                    data[sk].append(False)
    return pd.DataFrame(data)

if __name__ == "__main__":
    datafile = "../../dataprep/stackoverflow.csv"
    df = pd.read_csv(datafile, index_col=0)
    skills_df = get_skills(df)
    print(skills_df)
    skills_df.to_csv("stackoverflow_skills.csv")

code_transactions.py output


skills (116):
('JavaScript', 'Docker', 'HTML/CSS', 'SQL', 'Git', 'AWS', 'Python', 'PostgreSQL', 'MySQL', 'TypeScript', 'Node.js', 'React.js', 'Java', 'Bash/Shell', 'C#', 'Microsoft SQL Server', 'SQLite', 'jQuery', 'Microsoft Azure', 'MongoDB', 'npm', 'Redis', 'PHP', 'Yarn', 'Kubernetes', 'Angular', 'Express', 'C++', 'ASP.NET Core ', 'Vue.js', 'MariaDB', 'Elasticsearch', 'C', 'ASP.NET', 'Heroku', 'Firebase', 'DigitalOcean', 'PowerShell', 'Google Cloud Platform', 'Oracle', 'Go', 'Flask', 'Homebrew', 'Django', 'Terraform', 'Angular.js', 'Ansible', 'Kotlin', 'Google Cloud', 'DynamoDB', 'Laravel', 'Ruby', 'Spring', 'Rust', 'Ruby on Rails', 'Unity 3D', 'Dart', 'Swift', 'Next.js', 'VBA', 'R', 'Groovy', 'FastAPI', 'Symfony', 'Gatsby', 'Assembly', 'Scala', 'Firebase Realtime Database', 'Objective-C', 'Delphi', 'Cloud Firestore', 'Svelte', 'Perl', 'VMware', 'Cassandra', 'Elixir', 'Xamarin', 'Drupal', 'Unreal Engine', 'IBM DB2', 'Clojure', 'Managed Hosting', 'Matlab', 'Blazor', 'Puppet', 'Haskell', 'Nuxt.js', 'Chef', 'Lua', 'Couchbase', 'OVH', 'MATLAB', 'Linode', 'IBM Cloud or Watson', 'Oracle Cloud Infrastructure', 'F#', 'Deno', 'Julia', 'LISP', 'Flow', 'Phoenix', 'Erlang', 'Neo4j', 'Fastify', 'Pulumi', 'OpenStack', 'COBOL', 'Solidity', 'CouchDB', 'Crystal', 'Colocation', 'Fortran', 'APL', 'Play Framework', 'SAS', 'OCaml')
       JavaScript  Docker  HTML/CSS    SQL  ...    APL  Play Framework    SAS  OCaml
0           False   False     False  False  ...  False           False  False  False
1            True   False      True   True  ...  False           False  False  False
2           False   False     False  False  ...  False           False  False  False
3            True   False      True   True  ...  False           False  False  False
4           False   False     False  False  ...  False           False  False  False
...           ...     ...       ...    ...  ...    ...             ...    ...    ...
73394        True    True      True  False  ...  False           False  False  False
73395        True   False      True  False  ...  False           False  False  False
73396        True    True      True  False  ...  False           False  False  False
73397        True   False      True   True  ...  False           False  False  False
73398       False    True     False  False  ...  False           False  False  False

[73399 rows x 116 columns]

The prepared transaction data of all listed stackoverflow skills can be found here: stackoverflow_skills.csv.

Code

code_arm.py


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 128)

def plot_lift_heatmap(rules, output_png="output.png"):
    rules["lhs_count"] = rules["antecedents"].apply(lambda x:len(x))
    rules[rules["lhs_count"]>1].sort_values("lift", ascending=False).head()
    rules["lhs"] = rules["antecedents"].apply(lambda a: (",").join(list(a)))
    rules["rhs"] = rules["consequents"].apply(lambda a: (",").join(list(a)))
    pivot = rules[rules["lhs_count"]>1].pivot(index="lhs", columns="rhs", values="lift")

    plt.figure(figsize=(12,8))
    sns.heatmap(pivot, annot=True)
    plt.title("lift heatmap")
    plt.yticks(rotation=0)
    plt.xticks(rotation=90)
    plt.savefig(output_png)

if __name__ == "__main__":
    df = pd.read_csv("stackoverflow_skills.csv", index_col=0)

    frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
    print(frequent_itemsets)

    rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
    good_rules = rules[rules["confidence"]>0.2]
    #print(good_rules)

    print("--------------------")
    print("top 15 support")
    print("--------------------")
    print(good_rules.sort_values("support", ascending=False).head(15))
    print()

    print("--------------------")
    print("top 15 confidence")
    print("--------------------")
    print(good_rules.sort_values("confidence", ascending=False).head(15))
    print()

    print("--------------------")
    print("top 15 lift")
    print("--------------------")
    print(good_rules.sort_values("lift", ascending=False).head(15))
    print()

    plot_lift_heatmap(good_rules, "lift_heatmap.png")

code_arm.py output


      support                                      itemsets
0    0.672312                       frozenset({JavaScript})
1    0.548018                           frozenset({Docker})
2    0.547787                         frozenset({HTML/CSS})
3    0.522637                              frozenset({SQL})
4    0.489257                              frozenset({Git})
..        ...                                           ...
97   0.202619     frozenset({jQuery, HTML/CSS, JavaScript})
98   0.200044             frozenset({SQL, Git, JavaScript})
99   0.221829           frozenset({MySQL, SQL, JavaScript})
100  0.212769  frozenset({Node.js, TypeScript, JavaScript})
101  0.203954            frozenset({SQL, HTML/CSS, Docker})

[102 rows x 2 columns]
--------------------
top 15 support
--------------------
                 antecedents                 consequents  antecedent support  consequent support   support  confidence  \
2      frozenset({HTML/CSS})     frozenset({JavaScript})            0.547787            0.672312  0.498931    0.910812
3    frozenset({JavaScript})       frozenset({HTML/CSS})            0.672312            0.547787  0.498931    0.742112
4           frozenset({SQL})     frozenset({JavaScript})            0.522637            0.672312  0.400223    0.765778
5    frozenset({JavaScript})            frozenset({SQL})            0.672312            0.522637  0.400223    0.595295
0    frozenset({JavaScript})         frozenset({Docker})            0.672312            0.548018  0.390618    0.581008
1        frozenset({Docker})     frozenset({JavaScript})            0.548018            0.672312  0.390618    0.712783
55     frozenset({HTML/CSS})            frozenset({SQL})            0.547787            0.522637  0.349310    0.637675
54          frozenset({SQL})       frozenset({HTML/CSS})            0.522637            0.547787  0.349310    0.668361
6           frozenset({Git})     frozenset({JavaScript})            0.489257            0.672312  0.339596    0.694105
7    frozenset({JavaScript})            frozenset({Git})            0.672312            0.489257  0.339596    0.505117
14   frozenset({TypeScript})     frozenset({JavaScript})            0.375114            0.672312  0.337784    0.900483
15   frozenset({JavaScript})     frozenset({TypeScript})            0.672312            0.375114  0.337784    0.502422
16      frozenset({Node.js})     frozenset({JavaScript})            0.354814            0.672312  0.330250    0.930768
17   frozenset({JavaScript})        frozenset({Node.js})            0.672312            0.354814  0.330250    0.491215
155  frozenset({JavaScript})  frozenset({SQL, HTML/CSS})            0.672312            0.349310  0.320645    0.476929

         lift  representativity  leverage  conviction  zhangs_metric   jaccard  certainty  kulczynski
2    1.354746               1.0  0.130647    3.674112       0.579051  0.691837   0.727825    0.826462
3    1.354746               1.0  0.130647    1.753526       0.799096  0.691837   0.429720    0.826462
4    1.139022               1.0  0.048849    1.399049       0.255683  0.503600   0.285229    0.680536
5    1.139022               1.0  0.048849    1.179533       0.372469  0.503600   0.152207    0.680536
0    1.060198               1.0  0.022179    1.078736       0.173274  0.470788   0.072989    0.646896
1    1.060198               1.0  0.022179    1.140910       0.125624  0.470788   0.123507    0.646896
55   1.220112               1.0  0.063017    1.317501       0.398933  0.484404   0.240987    0.653018
54   1.220112               1.0  0.063017    1.363571       0.377915  0.484404   0.266632    0.653018
6    1.032415               1.0  0.010663    1.071244       0.061474  0.413147   0.066506    0.599611
7    1.032415               1.0  0.010663    1.032047       0.095816  0.413147   0.031052    0.599611
14   1.339383               1.0  0.085590    3.292790       0.405495  0.475992   0.696306    0.701452
15   1.339383               1.0  0.085590    1.255854       0.773258  0.475992   0.203729    0.701452
16   1.384430               1.0  0.091704    4.733216       0.430389  0.473900   0.788727    0.710992
17   1.384430               1.0  0.091704    1.268092       0.847394  0.473900   0.211414    0.710992
155  1.365345               1.0  0.085800    1.243980       0.816582  0.457426   0.196128    0.697433

--------------------
top 15 confidence
--------------------
                           antecedents              consequents  antecedent support  consequent support   support  confidence  \
186     frozenset({Node.js, HTML/CSS})  frozenset({JavaScript})            0.266679            0.672312  0.258982    0.971135
200      frozenset({HTML/CSS, jQuery})  frozenset({JavaScript})            0.209281            0.672312  0.202619    0.968166
194    frozenset({HTML/CSS, React.js})  frozenset({JavaScript})            0.242292            0.672312  0.233736    0.964687
216   frozenset({Node.js, TypeScript})  frozenset({JavaScript})            0.224635            0.672312  0.212769    0.947174
180  frozenset({HTML/CSS, TypeScript})  frozenset({JavaScript})            0.275562            0.672312  0.260726    0.946158
174       frozenset({MySQL, HTML/CSS})  frozenset({JavaScript})            0.266570            0.672312  0.248655    0.932792
16                frozenset({Node.js})  frozenset({JavaScript})            0.354814            0.672312  0.330250    0.930768
29                 frozenset({jQuery})  frozenset({JavaScript})            0.256298            0.672312  0.238505    0.930576
139       frozenset({Node.js, Docker})  frozenset({JavaScript})            0.233000            0.672312  0.216706    0.930067
162         frozenset({HTML/CSS, AWS})  frozenset({JavaScript})            0.246298            0.672312  0.228395    0.927315
168  frozenset({HTML/CSS, PostgreSQL})  frozenset({JavaScript})            0.238110            0.672312  0.219295    0.920982
110      frozenset({HTML/CSS, Docker})  frozenset({JavaScript})            0.307674            0.672312  0.283110    0.920161
150         frozenset({SQL, HTML/CSS})  frozenset({JavaScript})            0.349310            0.672312  0.320645    0.917938
156         frozenset({HTML/CSS, Git})  frozenset({JavaScript})            0.278628            0.672312  0.255630    0.917461
19               frozenset({React.js})  frozenset({JavaScript})            0.336449            0.672312  0.306734    0.911683

         lift  representativity  leverage  conviction  zhangs_metric   jaccard  certainty  kulczynski
186  1.444472               1.0  0.079690   11.352518       0.419606  0.380850   0.911914    0.678173
200  1.440056               1.0  0.061917   10.293704       0.386461  0.298419   0.902853    0.634771
194  1.434881               1.0  0.070840    9.279634       0.399994  0.343292   0.892237    0.656174
216  1.408831               1.0  0.061744    6.203130       0.374265  0.310984   0.838791    0.631823
180  1.407321               1.0  0.075462    6.086157       0.399524  0.379431   0.835693    0.666982
174  1.387439               1.0  0.069436    4.875704       0.380742  0.360250   0.794901    0.651321
16   1.384430               1.0  0.091704    4.733216       0.430389  0.473900   0.788727    0.710992
29   1.384144               1.0  0.066193    4.720118       0.373176  0.345606   0.788141    0.642665
139  1.383386               1.0  0.060057    4.685725       0.361325  0.314702   0.786586    0.626198
162  1.379293               1.0  0.062807    4.508334       0.364854  0.330905   0.778189    0.633516
168  1.369874               1.0  0.059211    4.147002       0.354389  0.317300   0.758862    0.623581
110  1.368653               1.0  0.076257    4.104374       0.389058  0.406256   0.756358    0.670630
150  1.365345               1.0  0.085800    3.993157       0.411232  0.457426   0.749572    0.697433
156  1.364637               1.0  0.068305    3.970116       0.370411  0.367650   0.748118    0.648843
19   1.356042               1.0  0.080536    3.710346       0.395688  0.436927   0.730483    0.683961

--------------------
top 15 lift
--------------------
                             antecedents                          consequents  antecedent support  consequent support  \
219                 frozenset({Node.js})  frozenset({TypeScript, JavaScript})            0.354814            0.337784
218  frozenset({TypeScript, JavaScript})                 frozenset({Node.js})            0.337784            0.354814
106                 frozenset({Node.js})                frozenset({React.js})            0.354814            0.336449
107                frozenset({React.js})                 frozenset({Node.js})            0.336449            0.354814
220              frozenset({TypeScript})     frozenset({Node.js, JavaScript})            0.375114            0.330250
217     frozenset({Node.js, JavaScript})              frozenset({TypeScript})            0.330250            0.375114
102                 frozenset({Node.js})              frozenset({TypeScript})            0.354814            0.375114
103              frozenset({TypeScript})                 frozenset({Node.js})            0.375114            0.354814
105                frozenset({React.js})              frozenset({TypeScript})            0.336449            0.375114
104              frozenset({TypeScript})                frozenset({React.js})            0.375114            0.336449
198    frozenset({HTML/CSS, JavaScript})                  frozenset({jQuery})            0.498931            0.256298
203                  frozenset({jQuery})    frozenset({HTML/CSS, JavaScript})            0.256298            0.498931
141                 frozenset({Node.js})      frozenset({Docker, JavaScript})            0.354814            0.390618
140      frozenset({Docker, JavaScript})                 frozenset({Node.js})            0.390618            0.354814
199      frozenset({jQuery, JavaScript})                frozenset({HTML/CSS})            0.238505            0.547787

      support  confidence      lift  representativity  leverage  conviction  zhangs_metric   jaccard  certainty  kulczynski
219  0.212769    0.599662  1.775283               1.0  0.092918    1.654143       0.676874  0.443425   0.395457    0.614779
218  0.212769    0.629896  1.775283               1.0  0.092918    1.743253       0.659467  0.443425   0.426360    0.614779
106  0.206706    0.582575  1.731542               1.0  0.087329    1.589630       0.654819  0.426587   0.370923    0.598475
107  0.206706    0.614375  1.731542               1.0  0.087329    1.673093       0.636695  0.426587   0.402305    0.598475
220  0.212769    0.567210  1.717519               1.0  0.088887    1.547519       0.668546  0.431934   0.353804    0.605738
217  0.212769    0.644266  1.717519               1.0  0.088887    1.756608       0.623762  0.431934   0.430721    0.605738
102  0.224635    0.633107  1.687771               1.0  0.091539    1.703182       0.631605  0.444564   0.412864    0.615976
103  0.224635    0.598845  1.687771               1.0  0.091539    1.608321       0.652123  0.444564   0.378234    0.615976
105  0.207905    0.617939  1.647336               1.0  0.081698    1.635565       0.592206  0.412789   0.388591    0.586091
104  0.207905    0.554244  1.647336               1.0  0.081698    1.488598       0.628850  0.412789   0.328227    0.586091
198  0.202619    0.406106  1.584508               1.0  0.074744    1.252247       0.736204  0.366658   0.201435    0.598333
203  0.202619    0.790559  1.584508               1.0  0.074744    2.392416       0.496017  0.366658   0.582013    0.598333
141  0.216706    0.610759  1.563570               1.0  0.078109    1.565564       0.558657  0.409864   0.361253    0.582768
140  0.216706    0.554777  1.563570               1.0  0.078109    1.449128       0.591481  0.409864   0.309930    0.582768
199  0.202619    0.849537  1.550854               1.0  0.071969    3.005484       0.466443  0.347144   0.667275    0.609712

Results

Javascript was the highest supported skill with 67% of all survey results containing that skill. Therefore it makes sense that All of the top 15 confidence rules all had Javascript as the right-hand-side or skill predicted if other skills were present. Intuitively, one would hope that knowing Node.js or React.js or jQuery would mean having Javascript as a skill, so this fits the narrative. The lift normalizes according to the support of the right-hand-side and is a better metric for if an association rule has value. The highest lift was 1.77 that knowing Node.js implied knowing both Typescript and Javascript. Because of the symmetry of the formula for lift, the arrow direction can be swapped and have the same lift for 2 events. In looking at the lift heatmap, it is interesting to see that there was a rule almost across all of the top skills where one side of the association rule was HTML/CSS+Javascript, the foundational web technologies and this had little to know correlation with things like knowing AWS, Docker, and Git, but did have a high correlation with SQL, and diferent JS technologies.