PCA is a technique that uses quantitative record data to determine which variables of that data contain the most variation and in turn can be used to transform the record data into new columns called principal components that are decorrelated and independent from each other. These components capture all of the variance of the data in decreasing order of importance and can thus be used to reduce the dimensionality of the data into fewer columns than what was started out with while preserving most of the important features.
First, the stackoverflow data is prepared for PCA by only keeping the quantitative columns. Then sklearn StandardScaler() was used to normalize all the columns.
code_quant.py
import pandas as pdfrom sklearn.preprocessing import StandardScalerdef get_quant_cols(df):df2 = df.drop(columns=["Age", "EdLevel", "Gender", "MentalHealth", "MainBranch", "Country", "Employed"])return df2def get_scaled_df(df, col_tail="S"):scaler = StandardScaler()scaled = scaler.fit_transform(df)df_scaled = pd.DataFrame({df.columns[j]+col_tail: scaled[:,j] for j in range(scaled.shape[1])})return df_scaledif __name__ == "__main__":datafile = "../../dataprep/stackoverflow_clean.csv"df = pd.read_csv(datafile, index_col=0)quant_df = get_quant_cols(df)print(quant_df.describe())scaled_df = get_scaled_df(quant_df)print(scaled_df)scaled_df.to_csv("stackoverflow_quant.csv")
code_quant.py output
YearsCode YearsCodePro PreviousSalary ComputerSkillscount 67265.000000 67265.000000 67265.000000 67265.000000mean 14.014004 8.965093 67570.164484 13.644362std 9.126883 7.698038 49438.578398 7.012238min 0.000000 0.000000 1.000000 0.00000025% 7.000000 3.000000 28584.000000 9.00000050% 12.000000 7.000000 57336.000000 13.00000075% 19.000000 12.000000 95541.000000 17.000000max 50.000000 42.000000 224000.000000 107.000000YearsCodeS YearsCodeProS PreviousSalaryS ComputerSkillsS0 -0.768505 -0.644986 -0.324004 -1.3753721 -0.220669 -0.515082 -0.426556 -0.2345012 0.108033 -0.385178 0.196606 -0.9475453 -0.549371 -0.385178 -0.433575 -0.0918924 -0.549371 -0.904795 -0.579616 -1.232763... ... ... ... ...67260 -0.768505 -0.904795 -0.536269 -0.09189267261 0.765436 0.913864 0.959376 -0.37710967262 -1.097207 -0.774891 -0.199242 -0.23450167263 -0.987640 -1.034699 0.049149 0.19332667264 -0.439803 -0.774891 0.158053 -1.660589[67265 rows x 4 columns]
The prepared quantitative data can be found here: stackoverflow_quant.csv.
code_pca.py
import numpy as npimport pandas as pdfrom sklearn.decomposition import PCAimport matplotlib.pyplot as pltdef get_pca_data(df, n=3):pca = PCA(n_components=n)X_pca = pca.fit_transform(df.to_numpy())eigenvalues = pca.explained_variance_eigenvectors = pca.components_explained_ratio = pca.explained_variance_ratio_eigenveclist = list(eigenvectors[j,:] for j in range(len(eigenvalues)))for i, (eig, vec, ratio) in enumerate(zip(eigenvalues, eigenveclist, explained_ratio)):print(f"PC{i+1}:")print(" Eigenvalue: %.4f" % eig)print(" Eigenvector:", np.round(vec, 4))print(" Variance Explained: %.2f %%" % (ratio * 100))print("Total Variance Explained: %.2f %%" % sum(explained_ratio*100))return pca, X_pcadef plot_feature_importance(feature_names, pca, output_png="output.png"):relative_importance = []n_vecs = len(pca.components_)for j in range(len(pca.components_)):pc = pca.components_[j]rel_imp = np.abs(pc) / np.sum(np.abs(pc))relative_importance.append(rel_imp)relative_importance = np.array(relative_importance)plt.figure(figsize=(8,6))bottom = np.zeros(n_vecs)for i, feature in enumerate(feature_names):plt.bar([f'PC{j+1}' for j in range(n_vecs)],relative_importance[:, i],bottom=bottom,label=feature)bottom += relative_importance[:,i]plt.ylabel("Relative Importance")plt.title("Stacked Relative Feature Importance by Principal Component")plt.legend( bbox_to_anchor=(1.05, 1), loc="upper left")plt.tight_layout()plt.savefig(output_png)def plot2d(X, output_png="output.png"):fig = plt.figure(figsize=(8,6))ax = fig.add_subplot(111)pc1 = X[:,0]pc2 = X[:,1]ax.scatter(pc1, pc2, marker=".", cmap="Paired")plt.title("Transformed Data (PCA n_components=2)")ax.set_xlabel("pc1")ax.set_ylabel("pc2")m1 = min(pc1)M1 = max(pc1)m2 = min(pc2)M2 = max(pc2)plt.xlim(m1, M1)plt.ylim(m2, M2)plt.savefig(output_png)def plot3d(X, output_png="output.png"):fig = plt.figure(figsize=(8,6))ax = fig.add_subplot(111, projection="3d")pc1 = X[:,0]pc2 = X[:,1]pc3 = X[:,2]ax.scatter(pc1, pc2, pc3, marker=".", cmap="Paired")plt.title("Transformed Data (PCA n_components=3)")ax.set_xlabel("pc1")ax.set_ylabel("pc2")ax.set_zlabel("pc3")m1 = min(pc1)M1 = max(pc1)m2 = min(pc2)M2 = max(pc2)m3 = min(pc3)M3 = max(pc3)ax.set_xlim(m1, M1)ax.set_ylim(m2, M2)ax.set_zlim(m3, M3)plt.savefig(output_png)if __name__ == "__main__":datafile = "stackoverflow_quant.csv"df = pd.read_csv(datafile, index_col=0)print("covar:")print(df.cov())print("performing pca with n=2")pca, X_pca = get_pca_data(df, n=2)print()print("performing pca with n=3")pca, X_pca = get_pca_data(df, n=3)print("plotting feature importance")plot_feature_importance(df.columns, pca, "pca_feature_importance.png")plot2d(X_pca, "pca_2.png")plot3d(X_pca, "pca_3.png")X_pca_df = pd.DataFrame({"PC%d" % (j+1) : X_pca[:,j] for j in range(X_pca.shape[1])})X_pca_df.to_csv("stackoverflow_pca.csv")
code_pca.py output
covar:YearsCodeS YearsCodeProS PreviousSalaryS ComputerSkillsSYearsCodeS 1.000015 0.903143 0.392955 -0.019983YearsCodeProS 0.903143 1.000015 0.397412 -0.014316PreviousSalaryS 0.392955 0.397412 1.000015 0.025314ComputerSkillsS -0.019983 -0.014316 0.025314 1.000015performing pca with n=2PC1:Eigenvalue: 2.1702Eigenvector: [ 0.6376 0.6385 0.4308 -0.0094]Variance Explained: 54.25 %PC2:Eigenvalue: 1.0040Eigenvector: [-0.0346 -0.0278 0.1141 0.9925]Variance Explained: 25.10 %Total Variance Explained: 79.35 %performing pca with n=3PC1:Eigenvalue: 2.1702Eigenvector: [ 0.6376 0.6385 0.4308 -0.0094]Variance Explained: 54.25 %PC2:Eigenvalue: 1.0040Eigenvector: [-0.0346 -0.0278 0.1141 0.9925]Variance Explained: 25.10 %PC3:Eigenvalue: 0.7290Eigenvector: [-0.3057 -0.3005 0.8952 -0.122 ]Variance Explained: 18.22 %Total Variance Explained: 97.58 %plotting feature importance
The final transformed pca data can be found here: stackoverflow_pca.csv.
Looking at PC1, which accounts for 54% of the variance in the data in the feature importance plot, it is interesting that it weighs the features YearsCode and YearsCodePro almost equally. This is actually true if you look at every single eigenvector, where the first 2 columns appear approximately equal all throughout. This makes sense because the 2 should be highly correllated and not much additional information is gained by knowing both over one since if someone has been coding professionally for a longer amount of time, then they have been coding overall for much longer as well. The data shows that the covariance between these variables is 90%. This brings up the question of whether the difference between when someone started coding professionally and when they started coding in general is a useful derived feature.
Once the 2nd principal component is added, which mainly comprises of the variable of the number of listed ComputerSkills, then 79.3% of the variance is explained. Adding the 3rd component brings the total explained variance up to 97.6%. Only 3 dimensions is necessary to capture over 95% of the data which makes sense because the first 2 features are very highly correlated.