Ensemble Classifiers

Introduction

This project aims to demonstrate the effectiveness of ensembles classifiers by comparing and contrasting several classifier families. The problem space, which consists of a simple dataset, insurance data, and a simple target, binary classification, is pushed to its limit to give ultra high test set accuracy.

Requirements

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import math
import category_encoders as ce
from collections import defaultdict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

Load and preprocess data

Read data to csv and coerce data type.

In [3]:
pstr = pd.StringDtype()
dtype_dict = {'State': pstr, 'Customer Lifetime Value ': np.float64, 'Response': pstr, 'Coverage': pstr, 'Education': pstr,
              'EmploymentStatus': pstr, 'Gender': pstr, 'Income': int, 'Location Code': pstr, 'Marital Status': pstr,
              'Monthly Premium Auto': int, 'Months Since Last Claim': int, 'Number of Open Complaints': int, 
              'Number of Policies': int, 'Policy': pstr, 'Renew Offer Type': pstr, 'Sales Channel': pstr,
              'Total Claim Amount': np.float64, 'Vehicle Class': pstr}

df_raw = pd.read_csv("Customer relationship marketing (CRM).csv", parse_dates=['Effective To Date'], dtype=dtype_dict)
In [5]:
df_raw.head()
Out[5]:
State Customer Lifetime Value Response Coverage Education Effective To Date EmploymentStatus Gender Income Location Code Marital Status Monthly Premium Auto Months Since Last Claim Number of Open Complaints Number of Policies Policy Renew Offer Type Sales Channel Total Claim Amount Vehicle Class
0 Washington 2763.519279 No Basic Bachelor 2011-02-24 Employed F 56274 Suburban Married 69 32 0 1 Corporate L3 Offer1 Agent 384.811147 Two-Door Car
1 Arizona 6979.535903 No Extended Bachelor 2011-01-31 Unemployed F 0 Suburban Single 94 13 0 8 Personal L3 Offer3 Agent 1131.464935 Four-Door Car
2 Nevada 12887.431650 No Premium Bachelor 2011-02-19 Employed F 48767 Suburban Married 108 18 0 2 Personal L3 Offer1 Agent 566.472247 Two-Door Car
3 California 7645.861827 No Basic Bachelor 2011-01-20 Unemployed M 0 Suburban Married 106 18 0 7 Corporate L2 Offer1 Call Center 529.881344 SUV
4 Washington 2813.692575 No Basic Bachelor 2011-02-03 Employed M 43836 Rural Single 73 12 0 1 Personal L1 Offer1 Agent 138.130879 Four-Door Car

The data set consists of 19 features and one binary label variable, response. The objective is to predict a customers response to an insurance policy offer via the accompanying feature variables.

In [4]:
df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   State                      9134 non-null   string        
 1   Customer Lifetime Value    9134 non-null   float64       
 2   Response                   9134 non-null   string        
 3   Coverage                   9134 non-null   string        
 4   Education                  9134 non-null   string        
 5   Effective To Date          9134 non-null   datetime64[ns]
 6   EmploymentStatus           9134 non-null   string        
 7   Gender                     9134 non-null   string        
 8   Income                     9134 non-null   int64         
 9   Location Code              9134 non-null   string        
 10  Marital Status             9134 non-null   string        
 11  Monthly Premium Auto       9134 non-null   int64         
 12  Months Since Last Claim    9134 non-null   int64         
 13  Number of Open Complaints  9134 non-null   int64         
 14  Number of Policies         9134 non-null   int64         
 15  Policy                     9134 non-null   string        
 16  Renew Offer Type           9134 non-null   string        
 17  Sales Channel              9134 non-null   string        
 18  Total Claim Amount         9134 non-null   float64       
 19  Vehicle Class              9134 non-null   string        
dtypes: datetime64[ns](1), float64(2), int64(5), string(12)
memory usage: 1.4 MB

The data does not contain any missingness thus no data imputation is necessary. Feature engineering and data encoding follows. For now, we will keep all feature variables, and continue with determining an encoding scheme. When doing predicitve modeling with mixed categorical/metric features, especialy when the feature count and/or the feature cardinalities are high, the choice of data encoding scheme can be very important.

There are a moderately high number of features present, with most features possessing low cardinality. Thus, the curse of dimensionality (CoD) is a moderate concern our encoding scheme must consider. Additionally, we will be preforming classification as the label variable (response) is categorical binary, and are likely to utilize decision trees which have difficulty dealing with large feature counts (sensitive to CoD). The lowest feature class counts are given below.

In [4]:
def column_value_counts(df_in):
    value_table = pd.crosstab(**df_in.melt(var_name='columns', value_name='index'))
    value_table = value_table.melt(var_name='column', value_name='count', ignore_index=False).reset_index()
    value_table = value_table[value_table['count'] != 0].rename(columns={'index': 'value'})
    cols = value_table.columns.to_list()
    value_table = value_table[[cols[1], cols[0], cols[2]]].sort_values('count', ascending=False).reset_index(drop=True)
    
    return value_table

column_value_counts(df_raw[df_raw.columns[df_raw.dtypes == pstr]]).tail(10)
Out[4]:
column value count
41 EmploymentStatus Medical Leave 432
42 EmploymentStatus Disabled 405
43 Policy Corporate L1 359
44 Education Doctor 342
45 EmploymentStatus Retired 282
46 Vehicle Class Luxury SUV 184
47 Policy Special L2 164
48 Vehicle Class Luxury Car 163
49 Policy Special L3 148
50 Policy Special L1 66

All class counts for categorical features are high enough, so there is no need to lump together fringe class values into an ‘other’ category. Gender is binary, and Coverage and Education contain strong ordinal elements, so they will all be label encoded. Policy has the highest cardinality of nine, so we want to avoid one-hot encoding if possible. Policy appears to be a combination of two features, each of cardinality three. Splitting Policy gives a nominal feature, and an ordinal categorical feature, which can be label encoded, thus we save five dimensions over one-hot encoding Policy. Renew Offer Type cannot be assumed to be ordinal despite its appearence. This leaves all unencoded categorical variables which will need to be one hot encoded.

In [5]:
# feature engineering data frame
df_ml = df_raw.copy(deep=True)
df_ml['Policy Level'] = df_ml['Policy'].map(lambda val: val[-1]).astype(int)
df_ml['Policy'] = df_ml['Policy'].map(lambda val: val[:-3]).astype(pstr)
In [6]:
# encoded data
categorical_cols = df_ml.columns[df_ml.dtypes == pstr].to_list()
df_enc = df_ml[categorical_cols].drop(columns=['Response']).copy(deep=True)
y_label = df_ml['Response'].map({'No': 0, 'Yes': 1})

# ordinal encoding
def get_ce_map(col_name, zip_obj):
    return {'col': col_name, 'mapping': {x: y for x, y in zip_obj}}

classes = [['Basic', 'Extended', 'Premium'],
           ['High School or Below', 'College', 'Bachelor', 'Master', 'Doctor']]
rules = [zip(l1, list(range(1, len(l1)+1))) for l1 in classes]
col_list = ['Coverage', 'Education']
col_rule = zip(col_list, rules)
ce_1 = ce.OrdinalEncoder(mapping=[get_ce_map(col, rule) for col, rule in col_rule], cols=col_list)
df_enc = ce_1.fit_transform(df_enc)
df_enc = df_enc.astype({x: 'category' for x in df_enc.columns.to_series()[df_enc.dtypes == pstr].values})

# one-hot (k degrees of freedom)
ce_2 = ce.OneHotEncoder()
df_enc_hot = ce_2.fit_transform(df_enc)

# dummy encoding (k-1 degrees of freedom)
df_enc_dummy = pd.get_dummies(df_enc, drop_first=True)
/home/noah/anaconda3/envs/main/lib/python3.7/site-packages/category_encoders/utils.py:21: FutureWarning: is_categorical is deprecated and will be removed in a future version.  Use is_categorical_dtype instead
  elif pd.api.types.is_categorical(cols):

We have a date variable which varies from January 1st 2011 to February 28th 2011. Date variables often exhibit linear and seasonal trends. To give the date variable the greatest opportunity to demonstrate potential relationships with the target variable it will be decomposed into several linear and cyclical components, the latter of which can be modeled with two sinusoidal components. If the new 'date derived' components fail to account for any significant percentage of the variance of the target variable, they will be removed in feature selection.

In [7]:
def get_date_features(date_vec):
    
    # default data dict
    data_dict = defaultdict()
    
    # sudo constant terms
    data_dict['Days_in_Year'] = [366 if x.is_leap_year else 356 for x in date_vec]
    data_dict['Days_in_Month'] = [getattr(x, 'days_in_month') for x in date_vec]
    
    # linear terms
    data_dict['Month'] = [getattr(x, 'month') for x in date_vec]
    data_dict['Day_of_Month'] = [getattr(x, 'day') for x in date_vec]
    data_dict['Day_of_Week'] = [getattr(x, 'day_of_week') for x in date_vec]
    data_dict['Day_of_Year'] = [getattr(x, 'day_of_year') for x in date_vec]
    
    # cyclical terms
    data_dict['Day_of_Month_Cyclical_1'] = [math.cos(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Month'], data_dict['Days_in_Month'])]
    data_dict['Day_of_Month_Cyclical_2'] = [math.sin(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Month'], data_dict['Days_in_Month'])]
    data_dict['Day_of_Week_Cyclical_1'] = [math.cos(2*math.pi*x/7) for x in data_dict['Day_of_Year']]
    data_dict['Day_of_Week_Cyclical_2'] = [math.cos(2*math.pi*x/7) for x in data_dict['Day_of_Year']]
    data_dict['Day_of_Year_Cyclical_1'] = [math.cos(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Year'], data_dict['Days_in_Year'])]
    data_dict['Day_of_Year_Cyclical_2'] = [math.sin(2*math.pi*x/y) for x, y in zip(data_dict['Day_of_Year'], data_dict['Days_in_Year'])]
    
    return pd.DataFrame(data=data_dict)

The date features dataframe and the mertic data are merged with the feature engineering dataframe, and the encoding dataframes as the date features need no encoding. The ml dataframe keeps track of our features while the encoded dataframes are passed to the models.

In [8]:
df_date_features = get_date_features(df_ml['Effective To Date']).drop(columns=['Days_in_Year', 'Days_in_Month'])
df_ml = df_ml.drop(['Effective To Date'], axis=1)
df_ml = df_ml.join(df_date_features)
df_enc_hot = df_enc_hot.join(df_ml.drop(categorical_cols + ['Response'], axis=1))
df_enc_dummy = df_enc_dummy.join(df_ml.drop(categorical_cols + ['Response'], axis=1))

We have to consider the assumptions of the ML models we will use and some ml models require metric data be standardized. These models also require dummy encoding (each feature to k-1 features) instead of one-hot encoding (each feature to k features). This is to remove redundant information. but many models such as decision tree models can benefit from keeping all the columns.

In [9]:
# standardized data
df_std = pd.DataFrame(data=StandardScaler().fit_transform(X=df_enc_dummy), columns=df_enc_dummy.columns.to_list())

Feature importance

Random Forest

In [10]:
rf_c = RandomForestClassifier(random_state=42)
rf_c_param = { 'n_estimators': [200], 
              'max_depth': [10, 15],
              'min_samples_leaf': [1, 2]}

gs_cv_rf_c = GridSearchCV(rf_c, rf_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [11]:
gs_cv_rf_c.fit(df_enc_hot, y_label);
In [12]:
df_importances = pd.DataFrame(data={'importance': gs_cv_rf_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
In [13]:
df_importances = pd.DataFrame(data={'importance': gs_cv_rf_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
df_importances['features_combine'] = np.array([col[:-2] if col[-2] == '_' else col for col in df_enc_hot.columns])
importance_collapsed = []
features_collapsed = []
for feature in df_importances['features_combine'].unique():
    importance = df_importances['importance'][df_importances['features_combine'] == feature].sum()
    importance_collapsed.append(importance)
    features_collapsed.append(feature)
    
df_importances_collapsed = pd.DataFrame(data={'importance': importance_collapsed, 'feature': features_collapsed}).sort_values('importance', ascending=False)
In [14]:
sns.set(rc={'figure.figsize':(20,8.5)})
ax = sns.barplot(data=df_importances_collapsed, x='feature', y='importance')
ax.set_xticklabels(ax.get_xticklabels(), rotation=80);
2021-10-08T13:04:10.886711 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [15]:
gs_cv_rf_c.best_estimator_.score(df_enc_hot, y_label)
Out[15]:
1.0

The classifier fit very well, possibly overfit, to the data as it was able to correctly classify all of the data. It appears some of our cyclical date components performed well, and the most important feature was one's employment status. State, policy, month, policy level, and number of complaints were not very useful for the classifier, and may be removed.

In [16]:
remove_features = df_importances_collapsed['feature'][-7:].values

XGBoost

In [17]:
xgb_c = XGBClassifier(use_label_encoder=False)
xgb_c_param = {'n_estimators': [200],
               'eval_metric': ['logloss'],
               'max_depth': [10, 14],
               'subsample': [0.8, 0.9]}

gs_cv_xgb_c = GridSearchCV(xgb_c, xgb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [18]:
gs_cv_xgb_c.fit(df_enc_hot, y_label);
In [19]:
df_importances = pd.DataFrame(data={'importance': gs_cv_xgb_c.best_estimator_.feature_importances_, 'feature': df_enc_hot.columns.to_list()})
df_importances['features_combine'] = np.array([col[:-2] if col[-2] == '_' else col for col in df_enc_hot.columns])
importance_collapsed = []
features_collapsed = []
for feature in df_importances['features_combine'].unique():
    importance = df_importances['importance'][df_importances['features_combine'] == feature].sum()
    importance_collapsed.append(importance)
    features_collapsed.append(feature)
    
df_importances_collapsed = pd.DataFrame(data={'importance': importance_collapsed, 'feature': features_collapsed}).sort_values('importance', ascending=False)
In [20]:
sns.set(rc={'figure.figsize':(20,8.5)})
ax = sns.barplot(data=df_importances_collapsed, x='feature', y='importance')
ax.set_xticklabels(ax.get_xticklabels(), rotation=80);
2021-10-08T13:05:30.657554 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/
In [21]:
gs_cv_xgb_c.best_estimator_.score(df_enc_hot, y_label)
Out[21]:
1.0

XGBoost has different suggestions for which features are the most important. This is good and suggests it is finding different relationships than Random Forest which will be useful later when classifiers are stacked together. XGBoost found employment status to be the most important, and policy, day of the week, gender, policy level, and the day of the year cyclical component to be the least important. XGBoost uses gradient boosting instead of bagging which Random Forest uses, which prioritizes correcting erroneous predictions. We will prioritize optimizing a Random Forest classifier and as such drop the features least useful for to Random Forest.

Drop useless features

In [22]:
std_drop = ['State_California', 'State_Nevada', 'State_Oregon', 'State_Washington',
            'Number of Open Complaints', 'Coverage', 'Policy Level', 'Policy_Personal', 'Policy_Special', 'Month']
df_std.drop(std_drop, axis=1, inplace=True)
In [23]:
hot_drop = ['State_1', 'State_2', 'State_3', 'State_4', 'State_5',
            'Number of Open Complaints', 'Coverage', 'Policy Level', 'Policy_1',
            'Policy_2', 'Policy_3', 'Month']
df_enc_hot.drop(hot_drop, axis=1, inplace=True) 

Part 5 - non-ensemble models

In [24]:
X_train_hot, X_test_hot, y_train_hot, y_test_hot = train_test_split(df_enc_hot, y_label, test_size=0.25, random_state=42)
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(df_std, y_label, test_size=0.25, random_state=42)

Support Vector Machine

In [25]:
svc_c = SVC(random_state=42)
svc_c_param = {'gamma': ['scale', 'auto'], 
              'C': [0.9, 1, 1.1]}

gs_cv_svc_c = GridSearchCV(svc_c, svc_c_param, scoring=['accuracy', 'f1_macro'], refit='f1_macro')
In [26]:
gs_cv_svc_c.fit(X_train_std, y_train_std)
Out[26]:
GridSearchCV(estimator=SVC(random_state=42),
             param_grid={'C': [0.9, 1, 1.1], 'gamma': ['scale', 'auto']},
             refit='f1_macro', scoring=['accuracy', 'f1_macro'])
In [27]:
print(f'Training Data Accuracy: {gs_cv_svc_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_svc_c.best_estimator_.score(X_test_std, y_test_std)}')
Training Data Accuracy: 0.9214598540145985
Testing Data Accuracy: 0.8966725043782837

The SVM classifier did adequately, with not too much overfitting, and 90 accuracy on the test data set. A SVM classifier is not the optimal classifier for our application as it is only capable of finding linear associations, however it will make a good baseline for future comparison.

Naive Bayes

In [28]:
nb_c = GaussianNB()
nb_c_param = {'var_smoothing': [1e-8, 1e-09, 1e-8]}

gs_cv_nb_c = GridSearchCV(nb_c, nb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [29]:
gs_cv_nb_c.fit(X_train_std, y_train_std)
Out[29]:
GridSearchCV(estimator=GaussianNB(),
             param_grid={'var_smoothing': [1e-08, 1e-09, 1e-08]},
             refit='f1_macro', scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'])
In [30]:
print(f'Training Data Accuracy: {gs_cv_nb_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_nb_c.best_estimator_.score(X_test_std, y_test_std)}')
Training Data Accuracy: 0.4137226277372263
Testing Data Accuracy: 0.4172504378283713

The naive bayes classifier preformed very poorly, only able to obtain 40% accuracy in either dataset. The classifier assumes a gaussian distribution for each feature which we know is not true for many of the features, there are several count variables for example.

Multi-Layer Perceptron

In [31]:
mlp_c = MLPClassifier()
mlp_c_param = {'learning_rate': ['adaptive'],
               'max_iter': [500],
               'hidden_layer_sizes': [[64], [64, 32], [128, 64]]}

gs_cv_mlp_c = GridSearchCV(mlp_c, mlp_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [32]:
gs_cv_mlp_c.fit(X_train_std, y_train_std)
Out[32]:
GridSearchCV(estimator=MLPClassifier(),
             param_grid={'hidden_layer_sizes': [[64], [64, 32], [128, 64]],
                         'learning_rate': ['adaptive'], 'max_iter': [500]},
             refit='f1_macro', scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'])
In [33]:
print(f'Training Data Accuracy: {gs_cv_mlp_c.best_estimator_.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {gs_cv_mlp_c.best_estimator_.score(X_test_std, y_test_std)}')
Training Data Accuracy: 1.0
Testing Data Accuracy: 0.9855516637478109

The multi-layer perceptron fit extremely well to the training data with a perfit fit, and only suffer a small amount of overfitting to give 99% on the test set. The hyperparameter tunned was the network architecture parameter 'hidden_layer_sizes' which governs how many hidden layers there are and how many units are in each layer. The mlp model is essential a deep learning model and as such has a strong capacity to fit to highly non-linear relationships. The model's success over the SVM classifier is likely due in part to the presence of highly non-linear relationships in our data.

Part 6 - ensemble models

XGBoost

In [34]:
xgb_c = XGBClassifier(use_label_encoder=False)
xgb_c_param = {'n_estimators': [200],
               'eval_metric': ['logloss'],
               'max_depth': [10, 14, 18],
               'subsample': [0.7, 0.8, 0.9]}

gs_cv_xgb_c = GridSearchCV(xgb_c, xgb_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [35]:
gs_cv_xgb_c.fit(X_train_hot, y_train_hot);
In [36]:
print(f'Training Data Accuracy: {gs_cv_xgb_c.best_estimator_.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {gs_cv_xgb_c.best_estimator_.score(X_test_hot, y_test_hot)}')
Training Data Accuracy: 1.0
Testing Data Accuracy: 0.9964973730297724

XGBoost preform very well, fitting perfectly on the training data and with a 99.65% on the test data. The hyperparamaters trained were the depth of the trees and the fraction of the data to be used for the subsampling.

Random Forest

In [37]:
rf_c = RandomForestClassifier(random_state=42)
rf_c_param = { 'n_estimators': [250], 
              'max_depth': [10, 15, 20],
              'min_samples_leaf': [1, 2, 3]}

gs_cv_rf_c = GridSearchCV(rf_c, rf_c_param, scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'], refit='f1_macro')
In [38]:
gs_cv_rf_c.fit(X_train_hot, y_train_hot)
Out[38]:
GridSearchCV(estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [10, 15, 20],
                         'min_samples_leaf': [1, 2, 3], 'n_estimators': [250]},
             refit='f1_macro', scoring=['accuracy', 'roc_auc_ovo', 'f1_macro'])
In [39]:
gs_cv_rf_c.best_params_
Out[39]:
{'max_depth': 20, 'min_samples_leaf': 1, 'n_estimators': 250}
In [40]:
print(f'Training Data Accuracy: {gs_cv_rf_c.best_estimator_.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {gs_cv_rf_c.best_estimator_.score(X_test_hot, y_test_hot)}')
Training Data Accuracy: 1.0
Testing Data Accuracy: 0.9978108581436077

The random forest classifer is wildly successful. With a perfect fit on the training data and 99.78 % accuracy on the test data, the model barely overfit. The work put in during the encoding and feature engineering stages, which was partially targeted at optimising decision tree classifiers, has proven very useful. The hyerparameters optimised were the max depth of the trees and the minimum number of samples required to be at a leaf node.

Stacked Classifier

In [41]:
estimators_r = [('mlp', gs_cv_mlp_c.best_estimator_), ('xgb', gs_cv_xgb_c.best_estimator_), ('rf', gs_cv_rf_c.best_estimator_)]
stack_r = StackingClassifier(estimators=estimators_r,
                             final_estimator=RandomForestClassifier(random_state=42 , **gs_cv_rf_c.best_params_),
                             passthrough=True)
stack_r.fit(X_train_hot, y_train_hot);
In [42]:
print(f'Training Data Accuracy: {stack_r.score(X_train_hot, y_train_hot)}')
print(f'Testing Data Accuracy: {stack_r.score(X_test_hot, y_test_hot)}')
Training Data Accuracy: 1.0
Testing Data Accuracy: 1.0

Employing a stacked classifier improved our results significantly to where we are now achieving 100% accuracy on the test data. The stacked classifier is an ensemble of our three best models (mlp, xgb, rf) which is then passed to another Random Forest classifier. The final random forest classifier has the same hyperparameters as our previously trained Random Forest classifier, and the classifier is also trained on the training data as well. The classifier has a high model complexity and appears to find all valid relationships within our data, thus reaching the limits of perfection in this application.

Conclusion

The use of ensemble models was critical to the success of our predictive modeling. The best approach was a stacked classifier composed of the best pre-trained models, which achieved a perfect accuracy of 100% on the test data. It is possible to employ an ensemble model without a decision tree and the best example of that is a stacked classifier in which the result of base models are feed to a final model for classification. Both the base models and the final model need not be deecion tree based models. In our implementation two of the three base models were decison tree based models, and the final model was also a descion tree based model. An example of a fully non-decsion tree based ensemble model is given below, where the base models are the three non-ensemble models from earlier, and the final model is a default logistic classifier.

In [43]:
estimators_r = [('mlp', gs_cv_mlp_c.best_estimator_), ('nb', gs_cv_nb_c.best_estimator_), ('svc', gs_cv_svc_c.best_estimator_)]
stack_r = StackingClassifier(estimators=estimators_r,
                            passthrough=False)
stack_r.fit(X_train_std, y_train_std);
In [44]:
print(f'Training Data Accuracy: {stack_r.score(X_train_std, y_train_std)}')
print(f'Testing Data Accuracy: {stack_r.score(X_test_std, y_test_std)}')
Training Data Accuracy: 1.0
Testing Data Accuracy: 0.98861646234676

The non decision tree based ensemble classifier did very very with 98.90% accuracy on the test set, but it is still not as powerful as the descion tree based ensemble classifier which preformed perfectly.