07 — Hands On ML — Ensemble
Ensemble Learning is taking the predictions of multiple models and assume the output to be having the most votes. When you train multiple Decision Trees each on some random sampling of the dataset and for predictions you take predictions of all the trees, the output class would be the class which gets the most votes. This approach is called Random Forest. Voting classifier is when you train the data on multiple classifier such as Logistic Regression, SVM, RF and other classifiers and the majority vote is the predicted output class ie hard classifier. Voting can also be taken as soft by taking argmax of the outputs. Surprisingly, Voting classifier has higher accuracy when the classifier used are all weak learners (they perform a little better than random guessing). Lets verify with and example. Please find the notebook here.
import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import make_moonsfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import VotingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=1000, noise=0.2, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)sns.scatterplot(x=X[:,0], y=X[:,1], hue=y);log_clf = LogisticRegression()rf_clf = RandomForestClassifier()svm_clf = SVC()voting_clf = VotingClassifier(estimators=[('lr', log_clf), ('rf', rf_clf), ('svc', svm_clf)],voting='hard')voting_clf.fit(X_train, y_train)
for clf in (log_clf, rf_clf, svm_clf, voting_clf):clf.fit(X_train, y_train)y_preds = clf.predict(X_test)print(clf.__class__.__name__, accuracy_score(y_preds, y_test))
Bagging is training same algorithm with random subsets of the data. When the sampling is performed with replacement it is called bagging. Wen performed without replacement is called pasting.
Random Forest are also very useful for finding feature importance using the feature_importances__, this can be used for feature selection.
Boosting is ensemble method to combine several weak learners into strong learner. Train predictors sequentially, each trying to correct its own predecessor. Adaboost is new predictor to correct its predecessors, to pay more attention to instances that predictor underfitted. The difference with Gradient Descent and this technique is unlike gradient descent instead of tweaking the same predictors parameters to optimize the cost function, Adaboost adds predictors to the ensemble to make it better.
## MNIST Classifier - stacking and ensemblingimport randomimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifierfrom sklearn.svm import SVCfrom sklearn.datasets import fetch_openml## Datamnist = fetch_openml('mnist_784')X, y = mnist['data'], mnist['target']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)random_digit = X[random.randint(0, len(X))].reshape((28,28))plt.imshow(random_digit);## Modelsrf = RandomForestClassifier()svm = SVC()ext_clf = ExtraTreesClassifier()for clf in (rf, svm, ext_clf):clf.fit(X_train, y_train)print(f"Model: {clf.__class__.__name__} Score: {clf.score(X_test, y_test)}")voting_clf = VotingClassifier(estimators=[("rf", rf),("svm", svm),("ext_clf", ext_clf)])voting_clf.voting = "hard"voting_clf.fit(X_train, y_train)print(f"Voting Classifier with Hard voting {voting_clf.score(X_test, y_test)}")
Stacking Ensembles
## Stacking ensembleimport numpy as npestimators = [("rf", rf),("svm", svm),("ext_clf", ext_clf)]X_val_preds = np.empty((len(X_test), len(estimators)), dtype=np.float32)for index, estimator in enumerate(estimators):X_val_preds[:, index] = estimator.predict(X_test)rnd_forest_blender = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)rnd_forest_blender.fit(X_val_predictions, y_val)print(rnd_forest_blender.oob_score_)
All the images and code have been taken from the book Hand-On Machine Learning with Scikit-Learn and Tensorflow by Aurelien.