This project aims to demonstrate the effectiveness of ensembles classifiers by comparing and contrasting several classifier families. The problem space, which consists of a simple dataset, insurance data, and a simple target, binary classification, is pushed to its limit to give ultra high test set accuracy.¶
import pandas as pd
import numpy as np
import seaborn as sns
import math
import category_encoders as ce
from collections import defaultdict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
Read data to csv and coerce data type.
pstr = pd.StringDtype()
dtype_dict = {'State': pstr, 'Customer Lifetime Value ': np.float64, 'Response': pstr, 'Coverage': pstr, 'Education': pstr,
'EmploymentStatus': pstr, 'Gender': pstr, 'Income': int, 'Location Code': pstr, 'Marital Status': pstr,
'Monthly Premium Auto': int, 'Months Since Last Claim': int, 'Number of Open Complaints': int,
'Number of Policies': int, 'Policy': pstr, 'Renew Offer Type': pstr, 'Sales Channel': pstr,
'Total Claim Amount': np.float64, 'Vehicle Class': pstr}
df_raw = pd.read_csv("Customer relationship marketing (CRM).csv", parse_dates=['Effective To Date'], dtype=dtype_dict)
df_raw.head()
The data set consists of 19 features and one binary label variable, response. The objective is to predict a customers response to an insurance policy offer via the accompanying feature variables.
df_raw.info()
The data does not contain any missingness thus no data imputation is necessary. Feature engineering and data encoding follows. For now, we will keep all feature variables, and continue with determining an encoding scheme. When doing predicitve modeling with mixed categorical/metric features, especialy when the feature count and/or the feature cardinalities are high, the choice of data encoding scheme can be very important.
There are a moderately high number of features present, with most features possessing low cardinality. Thus, the curse of dimensionality (CoD) is a moderate concern our encoding scheme must consider. Additionally, we will be preforming classification as the label variable (response) is categorical binary, and are likely to utilize decision trees which have difficulty dealing with large feature counts (sensitive to CoD). The lowest feature class counts are given below.
def column_value_counts(df_in):
value_table = pd.crosstab(**df_in.melt(var_name='columns', value_name='index'))
value_table = value_table.melt(var_name='column', value_name='count', ignore_index=False).reset_index()
value_table = value_table[value_table['count'] != 0].rename(columns={'index': 'value'})
cols = value_table.columns.to_list()
value_table = value_table[[cols[1], cols[0], cols[2]]].sort_values('count', ascending=False).reset_index(drop=True)
return value_table
column_value_counts(df_raw[df_raw.columns[df_raw.dtypes == pstr]]).tail(10)
All class counts for categorical features are high enough, so there is no need to lump together fringe class values into an ‘other’ category. Gender is binary, and Coverage and Education contain strong ordinal elements, so they will all be label encoded. Policy has the highest cardinality of nine, so we want to avoid one-hot encoding if possible. Policy appears to be a combination of two features, each of cardinality three. Splitting Policy gives a nominal feature, and an ordinal categorical feature, which can be label encoded, thus we save five dimensions over one-hot encoding Policy. Renew Offer Type cannot be assumed to be ordinal despite its appearence. This leaves all unencoded categorical variables which will need to be one hot encoded.
# feature engineering data frame
df_ml = df_raw.copy(deep=True)
df_ml['Policy Level'] = df_ml['Policy'].map(lambda val: val[-1]).astype(int)
df_ml['Policy'] = df_ml['Policy'].map(lambda val: val[:-3]).astype(pstr)
# encoded data
categorical_cols = df_ml.columns[df_ml.dtypes == pstr].to_list()
df_enc = df_ml[categorical_cols].drop(columns=['Response']).copy(deep=True)
y_label = df_ml['Response'].map({'No': 0, 'Yes': 1})
# ordinal encoding
def get_ce_map(col_name, zip_obj):
return {'col': col_name, 'mapping': {x: y for x, y in zip_obj}}
classes = [['Basic', 'Extended', 'Premium'],
['High School or Below', 'College', 'Bachelor', 'Master', 'Doctor']]
rules = [zip(l1, list(range(1, len(l1)+1))) for l1 in classes]
col_list = ['Coverage', 'Education']
col_rule = zip(col_list, rules)
ce_1 = ce.OrdinalEncoder(mapping=[get_ce_map(col, rule) for col, rule in col_rule], cols=col_list)
df_enc = ce_1.fit_transform(df_enc)
df_enc = df_enc.astype({x: 'category' for x in df_enc.columns.to_series()[df_enc.dtypes == pstr].values})
# one-hot (k degrees of freedom)
ce_2 = ce.OneHotEncoder()
df_enc_hot = ce_2.fit_transform(df_enc)
# dummy encoding (k-1 degrees of freedom)
df_enc_dummy = pd.get_dummies(df_enc, drop_first=True)
We have a date variable which varies from January 1st 2011 to February 28th 2011. Date variables often exhibit linear and seasonal trends. To give the date variable the greatest opportunity to demonstrate potential relationships with the target variable it will be decomposed into several linear and cyclical components, the latter of which can be modeled with two sinusoidal components. If the new 'date derived' components fail to account for any significant percentage of the variance of the target variable, they will be removed in feature selection.
def get_date_features(date_vec):
# default data dict
data_dict = defaultdict()
# sudo constant terms
data_dict['Days_in_Year'] = [366 if x.is_leap_year else 356 for x in date_vec]
data_dict['Days_in_Month'] = [getattr(x, 'days_in_month') for x in date_vec]
# linear terms
data_dict['Month'] = [getattr(x, 'month') for x in date_vec]
data_dict['Day_of_Month'] = [getattr(x, 'day') for x in date_vec]
data_dict['Day_of_Week'] = [getattr(x, 'day_of_week') for x in date_vec]
data_dict['Day_of_Year'] = [getattr(x, 'day_of_year') for x in date_vec]
# cyclical terms
data_dict['Day_of_Month_Cyclical_1'] = [math.cos(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Month'], data_dict['Days_in_Month'])]
data_dict['Day_of_Month_Cyclical_2'] = [math.sin(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Month'], data_dict['Days_in_Month'])]
data_dict['Day_of_Week_Cyclical_1'] = [math.cos(2*math.pi*x/7) for x in data_dict['Day_of_Year']]
data_dict['Day_of_Week_Cyclical_2'] = [math.cos(2*math.pi*x/7) for x in data_dict['Day_of_Year']]
data_dict['Day_of_Year_Cyclical_1'] = [math.cos(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Year'], data_dict['Days_in_Year'])]
data_dict['Day_of_Year_Cyclical_2'] = [math.sin(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Year'], data_dict['Days_in_Year'])]
return pd.DataFrame(data=data_dict)
The date features dataframe and the mertic data are merged with the feature engineering dataframe, and the encoding dataframes as the date features need no encoding. The ml dataframe keeps track of our features while the encoded dataframes are passed to the models.
df_date_features = get_date_features(df_ml['Effective To Date']).drop(columns=['Days_in_Year', 'Days_in_Month'])
df_ml = df_ml.drop(['Effective To Date'], axis=1)
df_ml = df_ml.join(df_date_features)
df_enc_hot = df_enc_hot.join(df_ml.drop(categorical_cols + ['Response'], axis=1))
df_enc_dummy = df_enc_dummy.join(df_ml.drop(categorical_cols + ['Response'], axis=1))
We have to consider the assumptions of the ML models we will use and some ml models require metric data be standardized. These models also require dummy encoding (each feature to k-1 features) instead of one-hot encoding (each feature to k features). This is to remove redundant information. but many models such as decision tree models can benefit from keeping all the columns.
# standardized data
df_std = pd.DataFrame(data=StandardScaler().fit_transform(X=df_enc_dummy), columns=df_enc_dummy.columns.to_list())
rf_c = RandomForestClassifier(random_state=42)
rf_c_param = { 'n_estimators': [200],
'max_depth': [10, 15],
'min_samples_leaf': [1, 2]}
gs_cv_rf_c = GridSearchCV(rf_c, rf_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_rf_c.fit(df_enc_hot, y_label);
df_importances = pd.DataFrame(data={'importance': gs_cv_rf_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
df_importances = pd.DataFrame(data={'importance': gs_cv_rf_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
df_importances['features_combine'] = np.array([col[:-2] if col[-2] == '_' else col for col in df_enc_hot.columns])
importance_collapsed = []
features_collapsed = []
for feature in df_importances['features_combine'].unique():
importance = df_importances['importance'][df_importances['features_combine'] == feature].sum()
importance_collapsed.append(importance)
features_collapsed.append(feature)
df_importances_collapsed = pd.DataFrame(data={'importance': importance_collapsed, 'feature': features_collapsed}).sort_values('importance', ascending=False)
sns.set(rc={'figure.figsize':(20,8.5)})
ax = sns.barplot(data=df_importances_collapsed, x='feature', y='importance')
ax.set_xticklabels(ax.get_xticklabels(), rotation=80);
gs_cv_rf_c.best_estimator_.score(df_enc_hot, y_label)
The classifier fit very well, possibly overfit, to the data as it was able to correctly classify all of the data. It appears some of our cyclical date components performed well, and the most important feature was one's employment status. State, policy, month, policy level, and number of complaints were not very useful for the classifier, and may be removed.
remove_features = df_importances_collapsed['feature'][-7:].values
xgb_c = XGBClassifier(use_label_encoder=False)
xgb_c_param = {'n_estimators': [200],
'eval_metric': ['logloss'],
'max_depth': [10, 14],
'subsample': [0.8, 0.9]}
gs_cv_xgb_c = GridSearchCV(xgb_c, xgb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_xgb_c.fit(df_enc_hot, y_label);
df_importances = pd.DataFrame(data={'importance': gs_cv_xgb_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
df_importances['features_combine'] = np.array([col[:-2] if col[-2] == '_' else col for col in df_enc_hot.columns])
importance_collapsed = []
features_collapsed = []
for feature in df_importances['features_combine'].unique():
importance = df_importances['importance'][df_importances['features_combine'] == feature].sum()
importance_collapsed.append(importance)
features_collapsed.append(feature)
df_importances_collapsed = pd.DataFrame(data={'importance': importance_collapsed, 'feature': features_collapsed}).sort_values('importance', ascending=False)
sns.set(rc={'figure.figsize':(20,8.5)})
ax = sns.barplot(data=df_importances_collapsed, x='feature', y='importance')
ax.set_xticklabels(ax.get_xticklabels(), rotation=80);
gs_cv_xgb_c.best_estimator_.score(df_enc_hot, y_label)
XGBoost has different suggestions for which features are the most important. This is good and suggests it is finding different relationships than Random Forest which will be useful later when classifiers are stacked together. XGBoost found employment status to be the most important, and policy, day of the week, gender, policy level, and the day of the year cyclical component to be the least important. XGBoost uses gradient boosting instead of bagging which Random Forest uses, which prioritizes correcting erroneous predictions. We will prioritize optimizing a Random Forest classifier and as such drop the features least useful for to Random Forest.
std_drop = ['State_California', 'State_Nevada', 'State_Oregon', 'State_Washington',
'Number of Open Complaints', 'Coverage', 'Policy Level', 'Policy_Personal', 'Policy_Special', 'Month']
df_std.drop(std_drop, axis=1, inplace=True)
hot_drop = ['State_1', 'State_2', 'State_3', 'State_4', 'State_5',
'Number of Open Complaints', 'Coverage', 'Policy Level', 'Policy_1',
'Policy_2', 'Policy_3', 'Month']
df_enc_hot.drop(hot_drop, axis=1, inplace=True)
X_train_hot, X_test_hot, y_train_hot, y_test_hot = train_test_split(df_enc_hot, y_label, test_size=0.25, random_state=42)
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(df_std, y_label, test_size=0.25, random_state=42)
svc_c = SVC(random_state=42)
svc_c_param = {'gamma': ['scale', 'auto'],
'C': [0.9, 1, 1.1]}
gs_cv_svc_c = GridSearchCV(svc_c, svc_c_param, scoring=['accuracy', 'f1_macro'], refit='f1_macro')
gs_cv_svc_c.fit(X_train_std, y_train_std)
print(f'Training Data Accuracy: {gs_cv_svc_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_svc_c.best_estimator_.score(X_test_std, y_test_std)}')
The SVM classifier did adequately, with not too much overfitting, and 90 accuracy on the test data set. A SVM classifier is not the optimal classifier for our application as it is only capable of finding linear associations, however it will make a good baseline for future comparison.
nb_c = GaussianNB()
nb_c_param = {'var_smoothing': [1e-8, 1e-09, 1e-8]}
gs_cv_nb_c = GridSearchCV(nb_c, nb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_nb_c.fit(X_train_std, y_train_std)
print(f'Training Data Accuracy: {gs_cv_nb_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_nb_c.best_estimator_.score(X_test_std, y_test_std)}')
The naive bayes classifier preformed very poorly, only able to obtain 40% accuracy in either dataset. The classifier assumes a gaussian distribution for each feature which we know is not true for many of the features, there are several count variables for example.
mlp_c = MLPClassifier()
mlp_c_param = {'learning_rate': ['adaptive'],
'max_iter': [500],
'hidden_layer_sizes': [[64], [64, 32], [128, 64]]}
gs_cv_mlp_c = GridSearchCV(mlp_c, mlp_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_mlp_c.fit(X_train_std, y_train_std)
print(f'Training Data Accuracy: {gs_cv_mlp_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_mlp_c.best_estimator_.score(X_test_std, y_test_std)}')
The multi-layer perceptron fit extremely well to the training data with a perfit fit, and only suffer a small amount of overfitting to give 99% on the test set. The hyperparameter tunned was the network architecture parameter 'hidden_layer_sizes' which governs how many hidden layers there are and how many units are in each layer. The mlp model is essential a deep learning model and as such has a strong capacity to fit to highly non-linear relationships. The model's success over the SVM classifier is likely due in part to the presence of highly non-linear relationships in our data.
xgb_c = XGBClassifier(use_label_encoder=False)
xgb_c_param = {'n_estimators': [200],
'eval_metric': ['logloss'],
'max_depth': [10, 14, 18],
'subsample': [0.7, 0.8, 0.9]}
gs_cv_xgb_c = GridSearchCV(xgb_c, xgb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_xgb_c.fit(X_train_hot, y_train_hot);
print(f'Training Data Accuracy: {gs_cv_xgb_c.best_estimator_.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {gs_cv_xgb_c.best_estimator_.score(X_test_hot, y_test_hot)}')
XGBoost preform very well, fitting perfectly on the training data and with a 99.65% on the test data. The hyperparamaters trained were the depth of the trees and the fraction of the data to be used for the subsampling.
rf_c = RandomForestClassifier(random_state=42)
rf_c_param = { 'n_estimators': [250],
'max_depth': [10, 15, 20],
'min_samples_leaf': [1, 2, 3]}
gs_cv_rf_c = GridSearchCV(rf_c, rf_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
gs_cv_rf_c.fit(X_train_hot, y_train_hot)
gs_cv_rf_c.best_params_
print(f'Training Data Accuracy: {gs_cv_rf_c.best_estimator_.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {gs_cv_rf_c.best_estimator_.score(X_test_hot, y_test_hot)}')
The random forest classifer is wildly successful. With a perfect fit on the training data and 99.78 % accuracy on the test data, the model barely overfit. The work put in during the encoding and feature engineering stages, which was partially targeted at optimising decision tree classifiers, has proven very useful. The hyerparameters optimised were the max depth of the trees and the minimum number of samples required to be at a leaf node.
estimators_r = [('mlp', gs_cv_mlp_c.best_estimator_), ('xgb', gs_cv_xgb_c.best_estimator_), ('rf', gs_cv_rf_c.best_estimator_)]
stack_r = StackingClassifier(estimators=estimators_r,
final_estimator=RandomForestClassifier(random_state=42 , **gs_cv_rf_c.best_params_),
passthrough=True)
stack_r.fit(X_train_hot, y_train_hot);
print(f'Training Data Accuracy: {stack_r.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {stack_r.score(X_test_hot, y_test_hot)}')
Employing a stacked classifier improved our results significantly to where we are now achieving 100% accuracy on the test data. The stacked classifier is an ensemble of our three best models (mlp, xgb, rf) which is then passed to another Random Forest classifier. The final random forest classifier has the same hyperparameters as our previously trained Random Forest classifier, and the classifier is also trained on the training data as well. The classifier has a high model complexity and appears to find all valid relationships within our data, thus reaching the limits of perfection in this application.
The use of ensemble models was critical to the success of our predictive modeling. The best approach was a stacked classifier composed of the best pre-trained models, which achieved a perfect accuracy of 100% on the test data. It is possible to employ an ensemble model without a decision tree and the best example of that is a stacked classifier in which the result of base models are feed to a final model for classification. Both the base models and the final model need not be deecion tree based models. In our implementation two of the three base models were decison tree based models, and the final model was also a descion tree based model. An example of a fully non-decsion tree based ensemble model is given below, where the base models are the three non-ensemble models from earlier, and the final model is a default logistic classifier.
estimators_r = [('mlp', gs_cv_mlp_c.best_estimator_), ('nb', gs_cv_nb_c.best_estimator_), ('svc', gs_cv_svc_c.best_estimator_)]
stack_r = StackingClassifier(estimators=estimators_r,
passthrough=False)
stack_r.fit(X_train_std, y_train_std);
print(f'Training Data Accuracy: {stack_r.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {stack_r.score(X_test_std, y_test_std)}')
The non decision tree based ensemble classifier did very very with 98.90% accuracy on the test set, but it is still not as powerful as the descion tree based ensemble classifier which preformed perfectly.