Association Rule Mining (ARM) is a technique that find association rules between events in transactional data. A common example of this is looking at baskets of foods as transactions and seeing the associations between common food items, impying that if one item is in a basket, it is likely that another item is also in the basket. This technique uses sets and probability to find associations of interest with metrics like support, confidence, and lift. Support is, for a given rule, just the probability of finding all the events of that rule in a single transaction. For A, B ⊂ Basket, and the rule A → B, support = ℙ (A ∪ B). Confidence is the conditional probability of B given A, confidence = ℙ (B | A) = ℙ (A ∪ B) / ℙ (A). Finally, the lift is a measure of the correlation where lift = ℙ (A ∪ B) / (ℙ (A) * ℙ (B)). A lift of 1.0 means the events are independent and a lift higher than 1 indicates correlation and this a good association rule. The apriori algorithm first measures the support of a bunch of subsets by counting the occurrences and assuming that a superset will be less probable than a smaller set. Then after a table of events and supports is made, the association rules can be found by calculating the various metrics.
The HaveWorkedWith column in the original dataset contains transactions where each event is a possible language or skill separated by semicolons. This can be removed and separately processed to create association rules.
code_transactions.py
import pandas as pddef get_skills(df):skills_df = df["HaveWorkedWith"]skills_count = dict()for transaction in skills_df:if type(transaction) == str:for sk in transaction.split(";"):if sk in skills_count:skills_count[sk] += 1else:skills_count[sk] = 0skills = list(skills_count.keys())counts = [skills_count[sk] for sk in skills]combined_sorted = list(sorted(zip(counts, skills)))[::-1]counts, skills = zip(*combined_sorted)print("skills (%d):" % len(skills))print(skills)data = dict()for sk in skills:data[sk] = list()for transaction in skills_df:if type(transaction) == str:this_guys_skills = transaction.split(";")for sk in skills:if sk in this_guys_skills:data[sk].append(True)else:data[sk].append(False)return pd.DataFrame(data)if __name__ == "__main__":datafile = "../../dataprep/stackoverflow.csv"df = pd.read_csv(datafile, index_col=0)skills_df = get_skills(df)print(skills_df)skills_df.to_csv("stackoverflow_skills.csv")
code_transactions.py output
skills (116):('JavaScript', 'Docker', 'HTML/CSS', 'SQL', 'Git', 'AWS', 'Python', 'PostgreSQL', 'MySQL', 'TypeScript', 'Node.js', 'React.js', 'Java', 'Bash/Shell', 'C#', 'Microsoft SQL Server', 'SQLite', 'jQuery', 'Microsoft Azure', 'MongoDB', 'npm', 'Redis', 'PHP', 'Yarn', 'Kubernetes', 'Angular', 'Express', 'C++', 'ASP.NET Core ', 'Vue.js', 'MariaDB', 'Elasticsearch', 'C', 'ASP.NET', 'Heroku', 'Firebase', 'DigitalOcean', 'PowerShell', 'Google Cloud Platform', 'Oracle', 'Go', 'Flask', 'Homebrew', 'Django', 'Terraform', 'Angular.js', 'Ansible', 'Kotlin', 'Google Cloud', 'DynamoDB', 'Laravel', 'Ruby', 'Spring', 'Rust', 'Ruby on Rails', 'Unity 3D', 'Dart', 'Swift', 'Next.js', 'VBA', 'R', 'Groovy', 'FastAPI', 'Symfony', 'Gatsby', 'Assembly', 'Scala', 'Firebase Realtime Database', 'Objective-C', 'Delphi', 'Cloud Firestore', 'Svelte', 'Perl', 'VMware', 'Cassandra', 'Elixir', 'Xamarin', 'Drupal', 'Unreal Engine', 'IBM DB2', 'Clojure', 'Managed Hosting', 'Matlab', 'Blazor', 'Puppet', 'Haskell', 'Nuxt.js', 'Chef', 'Lua', 'Couchbase', 'OVH', 'MATLAB', 'Linode', 'IBM Cloud or Watson', 'Oracle Cloud Infrastructure', 'F#', 'Deno', 'Julia', 'LISP', 'Flow', 'Phoenix', 'Erlang', 'Neo4j', 'Fastify', 'Pulumi', 'OpenStack', 'COBOL', 'Solidity', 'CouchDB', 'Crystal', 'Colocation', 'Fortran', 'APL', 'Play Framework', 'SAS', 'OCaml')JavaScript Docker HTML/CSS SQL ... APL Play Framework SAS OCaml0 False False False False ... False False False False1 True False True True ... False False False False2 False False False False ... False False False False3 True False True True ... False False False False4 False False False False ... False False False False... ... ... ... ... ... ... ... ... ...73394 True True True False ... False False False False73395 True False True False ... False False False False73396 True True True False ... False False False False73397 True False True True ... False False False False73398 False True False False ... False False False False[73399 rows x 116 columns]
The prepared transaction data of all listed stackoverflow skills can be found here: stackoverflow_skills.csv.
code_arm.py
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom mlxtend.frequent_patterns import apriorifrom mlxtend.frequent_patterns import association_rulespd.set_option("display.max_colwidth", None)pd.set_option("display.max_columns", None)pd.set_option("display.width", 128)def plot_lift_heatmap(rules, output_png="output.png"):rules["lhs_count"] = rules["antecedents"].apply(lambda x:len(x))rules[rules["lhs_count"]>1].sort_values("lift", ascending=False).head()rules["lhs"] = rules["antecedents"].apply(lambda a: (",").join(list(a)))rules["rhs"] = rules["consequents"].apply(lambda a: (",").join(list(a)))pivot = rules[rules["lhs_count"]>1].pivot(index="lhs", columns="rhs", values="lift")plt.figure(figsize=(12,8))sns.heatmap(pivot, annot=True)plt.title("lift heatmap")plt.yticks(rotation=0)plt.xticks(rotation=90)plt.savefig(output_png)if __name__ == "__main__":df = pd.read_csv("stackoverflow_skills.csv", index_col=0)frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)print(frequent_itemsets)rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)good_rules = rules[rules["confidence"]>0.2]#print(good_rules)print("--------------------")print("top 15 support")print("--------------------")print(good_rules.sort_values("support", ascending=False).head(15))print()print("--------------------")print("top 15 confidence")print("--------------------")print(good_rules.sort_values("confidence", ascending=False).head(15))print()print("--------------------")print("top 15 lift")print("--------------------")print(good_rules.sort_values("lift", ascending=False).head(15))print()plot_lift_heatmap(good_rules, "lift_heatmap.png")
code_arm.py output
support itemsets0 0.672312 frozenset({JavaScript})1 0.548018 frozenset({Docker})2 0.547787 frozenset({HTML/CSS})3 0.522637 frozenset({SQL})4 0.489257 frozenset({Git}).. ... ...97 0.202619 frozenset({jQuery, HTML/CSS, JavaScript})98 0.200044 frozenset({SQL, Git, JavaScript})99 0.221829 frozenset({MySQL, SQL, JavaScript})100 0.212769 frozenset({Node.js, TypeScript, JavaScript})101 0.203954 frozenset({SQL, HTML/CSS, Docker})[102 rows x 2 columns]--------------------top 15 support--------------------antecedents consequents antecedent support consequent support support confidence \2 frozenset({HTML/CSS}) frozenset({JavaScript}) 0.547787 0.672312 0.498931 0.9108123 frozenset({JavaScript}) frozenset({HTML/CSS}) 0.672312 0.547787 0.498931 0.7421124 frozenset({SQL}) frozenset({JavaScript}) 0.522637 0.672312 0.400223 0.7657785 frozenset({JavaScript}) frozenset({SQL}) 0.672312 0.522637 0.400223 0.5952950 frozenset({JavaScript}) frozenset({Docker}) 0.672312 0.548018 0.390618 0.5810081 frozenset({Docker}) frozenset({JavaScript}) 0.548018 0.672312 0.390618 0.71278355 frozenset({HTML/CSS}) frozenset({SQL}) 0.547787 0.522637 0.349310 0.63767554 frozenset({SQL}) frozenset({HTML/CSS}) 0.522637 0.547787 0.349310 0.6683616 frozenset({Git}) frozenset({JavaScript}) 0.489257 0.672312 0.339596 0.6941057 frozenset({JavaScript}) frozenset({Git}) 0.672312 0.489257 0.339596 0.50511714 frozenset({TypeScript}) frozenset({JavaScript}) 0.375114 0.672312 0.337784 0.90048315 frozenset({JavaScript}) frozenset({TypeScript}) 0.672312 0.375114 0.337784 0.50242216 frozenset({Node.js}) frozenset({JavaScript}) 0.354814 0.672312 0.330250 0.93076817 frozenset({JavaScript}) frozenset({Node.js}) 0.672312 0.354814 0.330250 0.491215155 frozenset({JavaScript}) frozenset({SQL, HTML/CSS}) 0.672312 0.349310 0.320645 0.476929lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski2 1.354746 1.0 0.130647 3.674112 0.579051 0.691837 0.727825 0.8264623 1.354746 1.0 0.130647 1.753526 0.799096 0.691837 0.429720 0.8264624 1.139022 1.0 0.048849 1.399049 0.255683 0.503600 0.285229 0.6805365 1.139022 1.0 0.048849 1.179533 0.372469 0.503600 0.152207 0.6805360 1.060198 1.0 0.022179 1.078736 0.173274 0.470788 0.072989 0.6468961 1.060198 1.0 0.022179 1.140910 0.125624 0.470788 0.123507 0.64689655 1.220112 1.0 0.063017 1.317501 0.398933 0.484404 0.240987 0.65301854 1.220112 1.0 0.063017 1.363571 0.377915 0.484404 0.266632 0.6530186 1.032415 1.0 0.010663 1.071244 0.061474 0.413147 0.066506 0.5996117 1.032415 1.0 0.010663 1.032047 0.095816 0.413147 0.031052 0.59961114 1.339383 1.0 0.085590 3.292790 0.405495 0.475992 0.696306 0.70145215 1.339383 1.0 0.085590 1.255854 0.773258 0.475992 0.203729 0.70145216 1.384430 1.0 0.091704 4.733216 0.430389 0.473900 0.788727 0.71099217 1.384430 1.0 0.091704 1.268092 0.847394 0.473900 0.211414 0.710992155 1.365345 1.0 0.085800 1.243980 0.816582 0.457426 0.196128 0.697433--------------------top 15 confidence--------------------antecedents consequents antecedent support consequent support support confidence \186 frozenset({Node.js, HTML/CSS}) frozenset({JavaScript}) 0.266679 0.672312 0.258982 0.971135200 frozenset({HTML/CSS, jQuery}) frozenset({JavaScript}) 0.209281 0.672312 0.202619 0.968166194 frozenset({HTML/CSS, React.js}) frozenset({JavaScript}) 0.242292 0.672312 0.233736 0.964687216 frozenset({Node.js, TypeScript}) frozenset({JavaScript}) 0.224635 0.672312 0.212769 0.947174180 frozenset({HTML/CSS, TypeScript}) frozenset({JavaScript}) 0.275562 0.672312 0.260726 0.946158174 frozenset({MySQL, HTML/CSS}) frozenset({JavaScript}) 0.266570 0.672312 0.248655 0.93279216 frozenset({Node.js}) frozenset({JavaScript}) 0.354814 0.672312 0.330250 0.93076829 frozenset({jQuery}) frozenset({JavaScript}) 0.256298 0.672312 0.238505 0.930576139 frozenset({Node.js, Docker}) frozenset({JavaScript}) 0.233000 0.672312 0.216706 0.930067162 frozenset({HTML/CSS, AWS}) frozenset({JavaScript}) 0.246298 0.672312 0.228395 0.927315168 frozenset({HTML/CSS, PostgreSQL}) frozenset({JavaScript}) 0.238110 0.672312 0.219295 0.920982110 frozenset({HTML/CSS, Docker}) frozenset({JavaScript}) 0.307674 0.672312 0.283110 0.920161150 frozenset({SQL, HTML/CSS}) frozenset({JavaScript}) 0.349310 0.672312 0.320645 0.917938156 frozenset({HTML/CSS, Git}) frozenset({JavaScript}) 0.278628 0.672312 0.255630 0.91746119 frozenset({React.js}) frozenset({JavaScript}) 0.336449 0.672312 0.306734 0.911683lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski186 1.444472 1.0 0.079690 11.352518 0.419606 0.380850 0.911914 0.678173200 1.440056 1.0 0.061917 10.293704 0.386461 0.298419 0.902853 0.634771194 1.434881 1.0 0.070840 9.279634 0.399994 0.343292 0.892237 0.656174216 1.408831 1.0 0.061744 6.203130 0.374265 0.310984 0.838791 0.631823180 1.407321 1.0 0.075462 6.086157 0.399524 0.379431 0.835693 0.666982174 1.387439 1.0 0.069436 4.875704 0.380742 0.360250 0.794901 0.65132116 1.384430 1.0 0.091704 4.733216 0.430389 0.473900 0.788727 0.71099229 1.384144 1.0 0.066193 4.720118 0.373176 0.345606 0.788141 0.642665139 1.383386 1.0 0.060057 4.685725 0.361325 0.314702 0.786586 0.626198162 1.379293 1.0 0.062807 4.508334 0.364854 0.330905 0.778189 0.633516168 1.369874 1.0 0.059211 4.147002 0.354389 0.317300 0.758862 0.623581110 1.368653 1.0 0.076257 4.104374 0.389058 0.406256 0.756358 0.670630150 1.365345 1.0 0.085800 3.993157 0.411232 0.457426 0.749572 0.697433156 1.364637 1.0 0.068305 3.970116 0.370411 0.367650 0.748118 0.64884319 1.356042 1.0 0.080536 3.710346 0.395688 0.436927 0.730483 0.683961--------------------top 15 lift--------------------antecedents consequents antecedent support consequent support \219 frozenset({Node.js}) frozenset({TypeScript, JavaScript}) 0.354814 0.337784218 frozenset({TypeScript, JavaScript}) frozenset({Node.js}) 0.337784 0.354814106 frozenset({Node.js}) frozenset({React.js}) 0.354814 0.336449107 frozenset({React.js}) frozenset({Node.js}) 0.336449 0.354814220 frozenset({TypeScript}) frozenset({Node.js, JavaScript}) 0.375114 0.330250217 frozenset({Node.js, JavaScript}) frozenset({TypeScript}) 0.330250 0.375114102 frozenset({Node.js}) frozenset({TypeScript}) 0.354814 0.375114103 frozenset({TypeScript}) frozenset({Node.js}) 0.375114 0.354814105 frozenset({React.js}) frozenset({TypeScript}) 0.336449 0.375114104 frozenset({TypeScript}) frozenset({React.js}) 0.375114 0.336449198 frozenset({HTML/CSS, JavaScript}) frozenset({jQuery}) 0.498931 0.256298203 frozenset({jQuery}) frozenset({HTML/CSS, JavaScript}) 0.256298 0.498931141 frozenset({Node.js}) frozenset({Docker, JavaScript}) 0.354814 0.390618140 frozenset({Docker, JavaScript}) frozenset({Node.js}) 0.390618 0.354814199 frozenset({jQuery, JavaScript}) frozenset({HTML/CSS}) 0.238505 0.547787support confidence lift representativity leverage conviction zhangs_metric jaccard certainty kulczynski219 0.212769 0.599662 1.775283 1.0 0.092918 1.654143 0.676874 0.443425 0.395457 0.614779218 0.212769 0.629896 1.775283 1.0 0.092918 1.743253 0.659467 0.443425 0.426360 0.614779106 0.206706 0.582575 1.731542 1.0 0.087329 1.589630 0.654819 0.426587 0.370923 0.598475107 0.206706 0.614375 1.731542 1.0 0.087329 1.673093 0.636695 0.426587 0.402305 0.598475220 0.212769 0.567210 1.717519 1.0 0.088887 1.547519 0.668546 0.431934 0.353804 0.605738217 0.212769 0.644266 1.717519 1.0 0.088887 1.756608 0.623762 0.431934 0.430721 0.605738102 0.224635 0.633107 1.687771 1.0 0.091539 1.703182 0.631605 0.444564 0.412864 0.615976103 0.224635 0.598845 1.687771 1.0 0.091539 1.608321 0.652123 0.444564 0.378234 0.615976105 0.207905 0.617939 1.647336 1.0 0.081698 1.635565 0.592206 0.412789 0.388591 0.586091104 0.207905 0.554244 1.647336 1.0 0.081698 1.488598 0.628850 0.412789 0.328227 0.586091198 0.202619 0.406106 1.584508 1.0 0.074744 1.252247 0.736204 0.366658 0.201435 0.598333203 0.202619 0.790559 1.584508 1.0 0.074744 2.392416 0.496017 0.366658 0.582013 0.598333141 0.216706 0.610759 1.563570 1.0 0.078109 1.565564 0.558657 0.409864 0.361253 0.582768140 0.216706 0.554777 1.563570 1.0 0.078109 1.449128 0.591481 0.409864 0.309930 0.582768199 0.202619 0.849537 1.550854 1.0 0.071969 3.005484 0.466443 0.347144 0.667275 0.609712
Javascript was the highest supported skill with 67% of all survey results containing that skill. Therefore it makes sense that All of the top 15 confidence rules all had Javascript as the right-hand-side or skill predicted if other skills were present. Intuitively, one would hope that knowing Node.js or React.js or jQuery would mean having Javascript as a skill, so this fits the narrative. The lift normalizes according to the support of the right-hand-side and is a better metric for if an association rule has value. The highest lift was 1.77 that knowing Node.js implied knowing both Typescript and Javascript. Because of the symmetry of the formula for lift, the arrow direction can be swapped and have the same lift for 2 events. In looking at the lift heatmap, it is interesting to see that there was a rule almost across all of the top skills where one side of the association rule was HTML/CSS+Javascript, the foundational web technologies and this had little to know correlation with things like knowing AWS, Docker, and Git, but did have a high correlation with SQL, and diferent JS technologies.