This notebook was written by Nicholas Rosas
LinkedIn: https://www.linkedin.com/in/nickrosas96/
GitHub: https://github.com/nr96
The purpose of this notebook is to demonstrate the data mining process utilized to implement Machine Learning into my MLB Daily Hitter program. The program itself was originally created in order to streamline and automate my player selection process when competing in MLB's Beat The Streak seasonal competition, where participants compete for a grand prize of $5.6 Million by attempting to build a streak of 57 correctly selected MLB players to earn a hit in any single MLB game. Although the models demonstrated here are not the exact models to be utilized in my program, the process and final outcomes are very similar.
Q: What is the problem to be solved?
A: Given a collection of features or stats for a player prior to game start, can we predict if the individual would get a hit in the game or not.
Q: How will this problem be approched?
A: As our dataset contains labeled data and our target variable is one of two classes, we will approach this problem as a Supervised Classification problem and will utilize several Machine Learning algorithms appropriate to the problem in order to create models that will make a prediction on a player.
Q: How performance be judged?
A: We will judge the individual model performance using a confusion matrix, and hyperparameters will be tuned using several performance metrics, such as accuracy, f1 score, ROC, and AUC.
Q: Are methods utilized fully explained?
A: Although I try to briefly explain the various data science and machine learning methods and processes utilized throughout the notebook, I do not explain them in great detail. This notebook is only meant to be a demonstration of the data science process used to solve a real world baseball problem.
Before we begin our data analysis, we must first ensure all required libraries are installed and imported correctly.
!type python3 # check directory location of python enviorment
!/home/ec2-user/anaconda3/envs/python3/bin/python3 -m pip install xgboost # Download / install xgboost
The very first thing we do is load the various python modules used as a part of our workflow. These modules give us extra functionality to import the data, clean it up and format it, and then build, evaluate and draw the various machine learning models.
import xgboost as xgb # Warning, XGBoost must first be installed before importing
import pandas as pd # pandas is used to load and manipulate data
import numpy as np # numpy is used to calculate the mean and standard deviation
from sklearn.model_selection import train_test_split # splits data into training and testing sets
from sklearn import preprocessing # scales and centers data
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn.model_selection import GridSearchCV # this will do cross validation and help tune hyperparameters
from sklearn.model_selection import cross_val_score # for cross validation
from sklearn.metrics import confusion_matrix, plot_confusion_matrix # to make confusion matrixs
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer # various scoring metrics
from sklearn.metrics import classification_report, accuracy_score # more scoring metrics
from sklearn.metrics import matthews_corrcoef, plot_roc_curve # even more scoring metrics
import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.tree import plot_tree # to draw classification tree
from sklearn.neural_network import MLPClassifier # this will make a Neural network
from sklearn.tree import DecisionTreeClassifier # this will make a Classification tree
from sklearn.svm import SVC # this will make a Support Vector Machine for classificaiton
from sklearn.neighbors import KNeighborsClassifier # this will make a K Nearest Neighbors model
from sklearn.ensemble import RandomForestClassifier # this will make a Random Forest model
import os
import boto3
import re
import sagemaker # AWS
from sagemaker import get_execution_role # AWS
Before we begin creating our ML Models, we want to make sure the data being used is high quality and informative. We can achieve this via data cleaning, data transformation and data reduction. Data preparation or preprocessing is the most important step in any data science workflow, as the quality of our output will directly match the quality of our input.
Before we can do anything with our data, we must first load it into a pandas data frame. When pandas (pd) reads in data, it returns a data frame, which is a lot like a spreadsheet. The data are organized in rows and columns and each row can contain a mixture of text and numbers. The standard variable name for a data frame is the initials df, and that is what we will use here:
# load data from AWS S3 Bucket
role = get_execution_role()
region = boto3.Session().region_name
bucket = 'sagemaker-studio-298378432906-ea5batqypv8'
data_key = 'mlb_hitters_2018.csv'
bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)
data_location = 's3://{}/{}'.format(bucket, data_key)
df = pd.read_csv(data_location,header=0) # load data into pandas df
df.head() # preview first 5 rows of data
In our Data Frame we see a bunch of columns for the various stats collected for each player. The columns are...
We begin preparing our data by dropping uninformative columns that don't tell us anything about a players' performance, such as Date and BatterID. But before we do that, we'll save a copy of the original unedited data frame in case we ever need to reference the original data or start from a clean state.
cols = [0,1,2,18,19,20] # columns to drop
og_df = df.copy() # store a copy of the original df
df.drop(df.columns[cols], axis=1, inplace=True) # drop unnecesarry columns from df
df.head() # preview edited df
We now see that the number of columns has dropped from 74 to 68, so we have succesfuled removed the 6 uninformative columns from our df.
og_df.head()
And a copy of the original dataframe has correctly been saved as og_df. Next we will also drop several less informative columns as part of our features selection process.
Utilizing our domain knowledge, we know DK and FD are two ways to score players fantasy points. Although these scoring metrics are slightly different, the information gain should be extremely similar. By keeping both, we could be adding a lot of unnecessary noise for our algorithms to decipher. So let's check the correlation coefficient to validate our hypothesis.
print(df[['H/A_FD', 'H/A_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last7_FD', 'last7_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last15_FD', 'last15_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last30_FD', 'last30_DK','Hit']].corr()['Hit'].to_string() + '\n')
After checking the correlation coefficients of all FD and DK features, we can see our hypothesis is correct and both provide similar amounts of information gain. However, DK seems to have a slight edge so we'll keep that and drop all FD features.
Feature selection is one of the most important steps of data preparation that can have a big impact on model performance. There are many ways to go about selecting which features to keep and which to discard. Discussing, testing, and showcasing these various methods would be a bit excessive for the purpose of this notebook, so for brevity's sake, I'll be providing a list of features to drop that I've concluded on using a combination of domain knowledge, standard feature selection methods, and model performance testing.
cols = ['H/A_AB', 'vsP_AB', 'vsP_BB',
'H/A_FD', 'last7_FD', 'last15_FD', 'last30_FD',
'H/A_RBI', 'last7_RBI', 'last15_RBI', 'last30_RBI', 'vsP_RBI',
'H/A_H', 'last7_H', 'last15_H', 'last30_H',
'last7_IBB', 'last15_IBB', 'last30_IBB', 'H/A_IBB', 'vsP_IBB',
'last7_HR','last15_HR','last30_HR', 'vsP_HR','H/A_HR']
df = df.drop(cols,axis=1)
df.head()
We can see our modifed df no longer has the unnecessary data and now only has 42 columns.
One of the biggest parts of any data science project is making sure that the data is correctly formatted and fixing it if it is not. The first part of this process is identifying and dealing with missing data.
Missing Data is simply a blank space, or a surrogate value like NA, that indicates that we failed to collect data for one of the features. For example, if we forgot to ask someone's age, or forgot to write it down, then we would have a blank space in the dataset for that person's age.
There are two main ways to deal with missing data:
In this section, we'll focus on identifying missing values in the dataset.
First, let's see what sort of data is in each column.
pd.set_option('display.max_column', 42) # expand display of df to see 40 columns
pd.set_option('display.max_rows', 50) # expand display of df to see 100 rows
pd.set_option('display.min_rows', None)
df # display first 25 and last 25 rows of data
df.describe() # generate descriptive statistics
By briefly grazing through our data frame, we can see our data has a wide range of values; some ints and some floats and 0 is a common value across numerous features. It's important to know if 0's found in your data are legitimate values or simply indicators of missing data. In our case, 0 is a legitimate value. However, there are a few features where 0 should never appear, such as last30_AB
so we'll check the appropriate columns to make sure they don't contain outlier values. We'll also check our entire dataframe for standard missing values like NA or NaN.
# check the specified columns for values == 0
len(df.loc[ (df['H/A_PA'] == 0 ) | (df['vsP_PA'] == 0 ) | (df['last7_AB'] == 0) | (df['last7_PA'] == 0) |
(df['last15_AB'] == 0) | (df['last15_PA'] == 0) | (df['last30_AB'] == 0) | (df['last30_PA'] == 0 )])
df.isnull().sum() # check all features for missing values
We can see from our output that no missing NaN values were found in the data frame, and no unexpected 0 values were found either. Since we now know there are no missing values in our data, we'll double our data types to ensure nothing unexpected appears.
df.dtypes
We can see that all of our data is currently being read as either float64 or int64 variables, which is correct and expected. However, several of our float64 features are meant to be int64 types, so let's change those variable types.
cols = ['H/A_BB', 'H/A_PA', 'H/A_SO',
'last7_AB', 'last7_BB', 'last7_PA', 'last7_SO',
'last15_AB', 'last15_BB', 'last15_PA', 'last15_SO',
'last30_AB', 'last30_BB', 'last30_PA', 'last30_SO',
'vsP_PA', 'vsP_SO','Hit'] # columns to be changed
df[cols] = df[cols].astype(int) # update specified columns type to int
df[cols].dtypes
We see our data now has the correct typing.
It's important to remember data can come in a variety of types, such as:
However, most Machine Learning algorithms are meant to only handle numerical data. Because of this, it is important to always convert non-numerical data types into numerical data types. This can be done in several ways depending on the original data type, such as:
Our dataframe currently only contains numerical data, so there is no need to showcase these methods, but data conversion is an important step to go through as a part of any data science process.
Some Machine Learning models, like Support Vector Machines, are great with small datasets, but not great with large ones, and this dataset, while not huge, is big enough to take a long time to optimize with Cross Validation. So we'll downsample both categories, players who did and did not get a hit, to 2,000 each. At the same time, this will help us keep our dataset balanced, which is important to prevent Overfitting in one direction.
First, let's remind ourselves how many players are in the dataset...
len(df) # get current number of rows/players in df
df_no_hit = df[df['Hit'] == 0] # seperate players who did not get a hit
df_hit = df[df['Hit'] == 1] # seperate players who did get a hit
len(df_no_hit) # get number of players who did not get a hit
len(df_hit) # get number of players who did get a hit
# downsample df to only include 2000 players who did not get a hit
df_no_hit_downsampled = resample(df_no_hit,
replace=False,
n_samples=2000,
random_state=42)
len(df_no_hit_downsampled)
# downsample df to only include 2000 players who did get a hit
df_hit_downsampled = resample(df_hit,
replace=False,
n_samples=2000,
random_state=42)
len(df_hit_downsampled)
# finally combine the two seperate downsampled data frames to a single df
df_downsample = pd.concat([df_no_hit_downsampled, df_hit_downsampled])
len(df_downsample)
Now that we have preprocessed our data, we are ready to start formatting the data for making models.
The first step is to split the data into two parts:
We will use the conventional notation of X
to represent the independent variables or columns of data that we will use to make classifications and y
to represent the thing we want to predict, our dependent variables. In this case, we want to predict Hit, whether or not a player will get a hit in a game.
NOTE: In the code below we are using copy()
to copy the data by value. By default, pandas uses copy by reference. Using copy()
ensures that the original data df_downsample
is not modified when we modify X
or y
. So if we make a mistake when we are formatting the columns for our classification models, we can just re-copy df_downsample
, rather than reload the original data and re-process the data.
X = df_downsample.drop('Hit', axis=1).copy() # Seperate our independent variables
X.head()
y = df_downsample['Hit'].copy() # Seperate our dependent variable
y.head()
Now that our variables are correctly defined, we'll once again split the data. This time we'll split into training and testing sets. Doing this ensures that we have not only a way to train our Machine Learning models, but a way to evaluate them aswell. This is akin to giving a student a practice test or study guide to learn upon, then evaluating them on a real test with similar problems.
## Split the data into training and testing data,
## a random state variable is also passed for replication purposes
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 25)
Several of the algorithms we'll use work best when the data provided is centered and scaled prior to training. The Radial Basis Function (RBF) that we are using with our Support Vector Machine assumes that the data is centered and scaled. In other words, each column should have a mean value = 0 and a standard deviation = 1. Our Neural Network algorithm also performs best with centered and scaled data as it utilizes Gradient Descent. If features are not scaled properly, the various ranges of features can cause inconsistencies in the steps taken during Gradient Descent.
So, to get the best out of our models, we will scale both the training and testing datasets. Specifically, we'll split the data into training and testing datasets and then scale them separately to avoid Data Leakage. Data Leakage occurs when information about the training dataset corrupts or influences the testing dataset.
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Now that we have cleaned, downsampled, split and scaled the data we can begin construction our Machine Learning models.
Support Vector Machines work by plotting a subset of data points of two classes on a plane and then finding the best hyperplane between the data points to separate the distinct classes. It decides the best hyperplane by attempting to maximize the distance or margin between points of both classes. The data points closest to the margin are referred to as Support Vectors as moving them would also cause the hyperplane to be moved, but moving other data points would not affect the hyperplane.
Now, let's move on towards making a preliminary Support Vector Machine.
clf_svm = SVC(random_state=15) # construct a SVM model
clf_svm.fit(X_train_scaled, y_train) # fit data to model
# output results of model
plot_confusion_matrix(clf_svm,
X_test_scaled,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
In the confusion matrix, we see that of the 489 players that did not get a hit, 271 (55%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Support Vector Machine was only okay. Let's try to improve our predictions using Cross Validation to optimize the parameters.
Optimizing a Support Vector Machine is all about finding the best hyperparameter values for gamma and C. So let's see if we can find better parameters than the preliminary Support Vector Machine values using cross validation in hopes that we can improve the accuracy with the Testing Dataset.
Since we have two parameters to optimize, we will use GridSearchCV()
. We specify several potential values for gamma and C, and GridSearchCV()
tests all possible combinations of the parameters for us.
num_features = np.size(X_train_scaled, axis=1) # get number of features
param_grid = {
'C': [0.5, 1, 10,],
'gamma': ['scale', 1/num_features, 1, 0.5, 0.25, 0.12, 0.05,],
}
# we are includeing C=1 and gamma='scale' as they are default values.
optimal_params = GridSearchCV(
SVC(random_state=15),
param_grid,
cv=3,
scoring='f1_micro',
## For more scoring metics see:
## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
verbose=0 # if you want to see what Grid Search is doing, set verbose=2
)
optimal_params.fit(X_train_scaled, y_train) # find best params
print(optimal_params.best_params_) # output best params
Now that we have optimized C
and gamma
parameters for our SVM, let's construct our final model.
clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)
plot_confusion_matrix(clf_svm,
X_test_scaled,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
In the confusion matrix, we see that of the 489 players that did not get a hit, 275 (56%) were correctly classified. And of the 511 players that did get a hit, 296 (58%) were correctly classified. So the Support Vector Machine was only slightly improved but we now have a baseline model to compare to.
Now that we have optimized our SVM Model, we can move on to building our Neural Network model. Neural Networks works by utilizing layers of nodes or nuerons, similar to how the human brain works. The initial input layer represents the features in the dataset, there are then several hidden layers of nodes in between the input layer and the output layer. Connecting the various layers are channels that hold various numerical weights, and the various hidden layers hold threshold values. If the output of any individual node is above the specified threshold value, that node is activated, and data is sent to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Now that we have a brief understanding of how Neural Networks work, let's begin by building a baseline model.
clf_nn = MLPClassifier(random_state=15).fit(X_train_scaled, y_train) # Create a Neural Network and fit it to the training data.
plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
In the confusion matrix, we see that of the 489 players that did not get a hit, 255 (52%) were correctly classified. And of the 511 players that did get a hit, 295 (58%) were correctly classified. So the preliminary Neural Network model was not very good and seemingly overfit.
Now that we have built our preliminary Neural Network model, we can begin to optimize by looking for ideal values for max_iter
, hidden_layer_sizes
, learning_rate_init
, beta_1
, and beta_2
hyperparameters. We will once again use GridSearchCV()
and a list of several possible values for every paramater and test the various combinations.
## Warning, this code can take several minutes to run depending on the system it's ran on
## adding extra test parameters increases the run time exponentially.
param_grid = {
'max_iter':[100, 165, 220,],
'hidden_layer_sizes': [(19,),(21,)],
'learning_rate_init':[ 0.0005, 0.0009, 0.005],
'activation':['tanh'],
'beta_1' : [0.75, 0.82, 0.9,],
'beta_2' : [0.75, 0.82, 0.9,],
}
optimal_params = GridSearchCV(
MLPClassifier(random_state=15),
param_grid,
scoring='f1_micro',
verbose=0,
n_jobs=10,
cv=3
)
optimal_params.fit(X_train_scaled, y_train)
print(optimal_params.best_params_)
And we see that the ideal values for learning_rate_init
is 0.0009 , beta_1
is 0.82 , beta_2
is 0.75 ,hidden_layer_sizes
is 19 , and finally max_iter
is 165.
Now that we have the ideal values for learning_rate_init
, hidden_layer_sizes
, beta_1
, beta_2
, and max_iter
we can build the final Neural Network Model:
clf_nn = MLPClassifier(
activation='tanh',
learning_rate_init=0.0009,
hidden_layer_sizes = (19,),
beta_1=0.82,
beta_2=0.75,
max_iter=165,
random_state = 15
).fit(X_train_scaled, y_train)
plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
clf_nn.score(X_test_scaled, y_test) # get accuracy of model
In the confusion matrix, we see that of the 489 players that did not get a hit, 277 (56%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the Neural Network model performed worse in terms of correct true positives, but the overall accuracy has slightly improved and the model is no longer overfit.
We can now move on to making a K Nearest Neighbors model. K Nearest Neighbors models work by plotting known training data in a defined space for specific features. Unknown testing data is then placed in the same space and compared to the nearest k known training points. The class of the unknown data point is then determined via a majority vote by those k nearest neighbors while taking the distance of those neighbors into account. K Nearest Neighbors is one of the simplest yet efficient Machine Learning algorithms and would be a good starting point for supervised classification problems like ours.
Now let's begin by making a preliminary K Nearest Neighbors model.
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier().fit(X_train, y_train)
plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])
In the confusion matrix, we see that of the 489 people that did not get a hit, 271 (55%) were correctly classified. And of the 511 people that did get a hit, 270 (53%) were correctly classified. So the preliminary K-NN model was not great.
Now that we have built our preliminary K-Nearest Neighbors model, we can begin to optimize by looking for ideal values for n_neighbors
and distance metric
hyperparameters. We will once again use GridSearchCV()
and a list of several possible values for every parameter and test the various combinations.
# Round 1
param_grid = {
'n_neighbors':[5, 10, 15, 20, 25, 35, 40, 45, 50],
'metric':['euclidean', 'manhattan', 'minkowski', 'mahalanobis'],
}
optimal_params = GridSearchCV(
KNeighborsClassifier(),
param_grid,
scoring='f1', # f1_micro
verbose=0,
n_jobs=10,
cv=3
)
optimal_params.fit(X_train, y_train)
print(optimal_params.best_params_)
Now that we have the ideal values for n_neighbors
and metric
we can build the final K Nearest Neighbors Model:
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)
plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])
In the confusion matrix, we see that of the 489 people that did not get a hit, 291 (59%) were correctly classified. And of the 511 people that did get a hit, 309 (60%) were correctly classified. So the preliminary K-NN model was noticeably improved.
Now that we are done with our K Nearest Neighbors model, let's move on to Decision Trees. Decision Trees work by attempting to predict the value of a target variable by learning simple decision rules inferred from the data features. Decision Trees are a good starting algorithm to model because they are simple to interpret, easily visualized, and work well with various data types. However, some downsides to Decision Trees are their tendency to become overfit and complex with large datasets.
With a brief understanding of how Decision Trees work, let's begin by making a preliminary model.
clf_dt = DecisionTreeClassifier(random_state=35)
clf_dt = clf_dt.fit(X_train, y_train)
plot_confusion_matrix(clf_dt,
X_test,
y_test,
display_labels=["Did not get a hit", "Got a hit"])
plt.figure(figsize=(200, 100))
plot_tree(clf_dt,
fontsize=10,
filled = True,
rounded = True,
class_names = ["Not a hit","Hit"],
feature_names = X.columns);
In the confusion matrix, we see that of the 489 players that did not get a hit, 264 (54%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Decision Tree was not great. And after plotting the entire tree, we can see it is very deep and complex, hinting at overfitting taking place, so let's try to improve the tree with Pruning .
Decision Trees are known to overfit the Training Data, and there are many parameters such as max_depth
and min_samples
that are designed to reduce overfitting. However, pruning a tree with Cost Complexity Pruning is one of the simpler processes for finding a smaller tree that can improve accuracy of predictions on testing data.
Pruning a Decision Tree is all about finding an ideal value for the pruning parameter alpha
, which controls how little or how much pruning happens. One method of finding the optimal alpha
is by plotting the accuracy of the tree as a function of different values. We'll utilize this method with both our training and testing datasets.
First, let's find the different alpha
values available for this tree and build a pruned tree for each value.
path = clf_dt.cost_complexity_pruning_path(X_train,y_train) # determine values for alphas
ccp_alphas = path.ccp_alphas # extract different values for alphas
ccp_alphas = ccp_alphas[:-1] # exclude the maximun value for alpha, this value would prune ALL leaves, leaving only a stump
clf_dts = [] # array to hold decision tress
# create one decision tree per value for alpha and store it in the array
for ccp_alpha in ccp_alphas:
clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
clf_dt.fit(X_train,y_train)
clf_dts.append(clf_dt)
Now we'll graph the accuracy of the trees using the training and testing datasets as functions of alpha.
train_scores = [clf_dt.score(X_train,y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test,y_test) for clf_dt in clf_dts]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label='train', drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label='test', drawstyle="steps-post")
ax.legend()
plt.show()
In the graph above, we can see the accuracy of the testing dataset is highest when alpha
is about 0.0015. After this value, we can notice the accuracy of both datasets begin to drop-off slightly. Let's verify our findings with cross validation.
alpha_loop_values = []
## for each candidate alpha value, we'll run 3-fold cross validation
## then we'll store the mean and standard deviation of the scores for each call
## to cross_val_score and alpha_loop_values
for ccp_alpha in ccp_alphas:
clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
scores = cross_val_score(clf_dt, X_train, y_train, cv=3)
alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])
# now we'll draw a graph of the means and standard deviations of the scores for each alpha value
alpha_results = pd.DataFrame(alpha_loop_values,
columns=['alpha','mean_accuracy','std'])
alpha_results.plot(x='alpha',
y='mean_accuracy',
yerr='std',
marker='o',
linestyle='--')
By using cross validation and drawing the graph, we can see that accuracy is still at its highest when alpha
is just below 0.0015. So let's see if we can extract an exact value to use.
alpha_results[(alpha_results['alpha'] > 0.0013)
&
(alpha_results['alpha'] < 0.0015)]
# store series of alpha values
ideal_ccp_alpha = alpha_results[(alpha_results['alpha'] > 0.0013)
&
(alpha_results['alpha'] < 0.0015)]['alpha']
ideal_ccp_alpha
And from here we'll select a final alpha value. In our case, we'll use number 316 as it performs well and has a low standard deviation.
ideal_ccp_alpha = float(ideal_ccp_alpha[316]) # select desired alpha value and convert from series to float
ideal_ccp_alpha
And now we have an ideal_ccp_alpha
to be used as alpha
when we build our final tree.
Now that we have an ideal value for ccp_alpha
we can build the final Decision Tree Model:
clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)
plot_confusion_matrix(clf_dt_pruned,
X_test,
y_test,
display_labels=["Did not get a hit", "Got a hit"])
plt.figure(figsize=(50, 25))
plot_tree(clf_dt_pruned,
fontsize=10,
filled = True,
rounded = True,
class_names = ["Not a hit","Hit"],
feature_names = X.columns);
In the confusion matrix, we see that of the 489 players that did not get a hit, 274 (56%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Decision Tree model performs similarly in terms of total correct predictions. However, the final model is much simpler and no longer overfit.
Since our Decision Tree model was not the best, we can move another to another tree-based model in hopes of better performance. Random Forest are also tree models like Decision Trees, however Random Forest models utilize ensemble learning, which use the outcome of multiple decision trees to make a conclusion rather than a single tree. Due to this, Random Forests are generally less prone to overfitting.
So let's go ahead and build a preliminary model.
clf_rf = RandomForestClassifier(random_state=35)
clf_rf.fit(X_train, y_train)
plot_confusion_matrix(clf_rf,
X_test,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
clf_rf.score(X_test, y_test)
In the confusion matrix, we see that of the 489 players that did not get a hit, 296 (60%) were correctly classified. And of the 511 players that did get a hit, 262 (51%) were correctly classified. So the initial Random Forest model was not very good and seems to be overfit for predicting true negatives.
plt.figure(figsize=(200, 100))
plot_tree(clf_rf.estimators_[3],
fontsize=12,
filled = True,
rounded = True,
class_names = ["Not a hit","Hit"],
feature_names = X.columns);
If we take a look at one of our ensemble trees, we see that the trees being used can be rather deep and complex, which also hints at overfitting taking place.
Now that we have built our preliminary Random Forest model, and have noticed evidence of overfitting, we can begin to optimize by looking for ideal values for max_depth
, max_features
and n_estimators
hyperparameters. We will once again use GridSearchCV()
and a list of several possible values for every parameter and test the various combinations.
param_grid = [
{'max_depth': [3, 4, 5],
'max_features': [5, 10, 'sqrt', 'log2'],
'n_estimators': [50, 65, 100, 200]},
]
optimal_params = GridSearchCV(
RandomForestClassifier(random_state=35),
param_grid,
cv=3,
scoring='roc_auc',
## For more scoring metics see:
## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
verbose=0 # if you want to see what GridSearchCV is doing, set verbose=2
)
optimal_params.fit(X_train, y_train)
print(optimal_params.best_params_)
Now that GridSearchCV()
has ran and we see that the ideal values for max_features
is 10 , n_estimators
is 65 , max_depth
is 4.
Now that we have the ideal values for max_depth
, n_estimators
, and max_features
we can build the final Random Forest Model:
clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)
plot_confusion_matrix(clf_rf,
X_test,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
clf_rf.score(X_test, y_test)
In the confusion matrix, we see that of the 489 players that did not get a hit, 280 (57%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Random Forest model was only slightly improved in terms of total correct predictions, but is no longer overfit.
plt.figure(figsize=(90, 30))
plot_tree(clf_rf.estimators_[3],
fontsize=25,
filled = True,
rounded = True,
class_names = ["Not a hit","Hit"],
feature_names = X.columns);
We can also see that the trees being used are now much shallower, simplier, and less prone to overfitting. So we can safely say the model was improved.
Now that we've gone ahead and produced one ensemble method, let's try a different type of ensemble tree-based model, XGBoost. XGBoost and Random Forest models differ in that, rather than producing a collection of decision trees that are independent of each other, XGBoost utilizes Boosting in order to create a collection of weak and strong learner trees that build upon each other and compensate for the weaknesses of their predecessors.
Now that we've briefly explained XGBoost, let's build a baseline model.
# Create a model and fit it to the training data
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', missing=1, use_label_encoder=False, seed=42)
clf_xgb.fit(X_train,
y_train,
verbose= True,
early_stopping_rounds=10,
eval_metric='aucpr',
eval_set=[(X_test, y_test)])
plot_confusion_matrix(clf_xgb,X_test, y_test)
In the confusion matrix, we see that of the 489 players that did not get a hit, 269 (55%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the preliminary Random Forest model was ok but not great.
Now that we have a preliminary XGBoost model, let's once again try to optimize our model by searching for optimal hyperperameters utilizing GridSearchCV()
. For XGBoost models, we'll attempt to find ideal values for reg_alpha
, reg_lambda
and max_depth
hyperparameters.
param_grid = {
'max_depth':[5, 4, 2],
'reg_lambda':[5, 4, 3,],
'reg_alpha':[1, 2, 3,],
}
optimal_params = GridSearchCV(
estimator=xgb.XGBClassifier(objective= 'binary:logistic',
seed=42,
subsample=0.8,
use_label_encoder=False,
colsample_bytree=0.5),
param_grid=param_grid,
scoring='f1_micro',
verbose=0,
n_jobs=10,
cv=3
)
optimal_params.fit( X_train,
y_train,
early_stopping_rounds=10,
verbose= False,
eval_metric='aucpr',
eval_set=[(X_test, y_test)])
print(optimal_params.best_params_)
By optimizing with GridSearchCV()
, we now see the ideal values for reg_alpha
is 2 , reg_lambda
is 5 , and max_depth
is 4. We'll now use these values in the final XGBoost model.
Now that we have the ideal values for reg_alpha
, reg_lambda
and max_depth
we can build the final XGBoost Model:
clf_xgb = xgb.XGBClassifier(objective='binary:logistic',
missing= 1,
max_depth= 4,
reg_lambda= 5,
reg_alpha= 2,
use_label_encoder=False)
clf_xgb.fit(X_train,
y_train,)
plot_confusion_matrix(clf_xgb, X_test, y_test)
clf_xgb.score(X_test, y_test)
In the confusion matrix, we see that of the 489 people that did not default, 280 (57%) were correctly classified. And of the 511 people that defaulted, 287 (56%) were correctly classified. So the XGBoost model was improved slightly.
Now that we've made and optimized Support Vector Machine, Neural Network, K Nearest Neighbors, Decision Tree, Random Forest, and XGBoost models. Let's review and compare the final models.
clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)
plot_confusion_matrix(clf_svm,
X_test_scaled,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
y_pred = clf_svm.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
As we review our final Support Vector Machine, by looking solely at our confusion matrix, it shows the model is better at predicting True Positives than True Negatives. However, this evaluation would ignore the slight imbalance we have of True Positives and True Negatives. Therefore, it would be useful for us to look at other metrics such as precision ("The proportion of positive identifications that were actually correct"), recall ("The proportion of actual positives that were correctly identified"), and f1 score (harmonic mean of recall and precision). By examining our classification report, we can see that although all 3 scoring metrics are higher for positive predictions, they are only 2% higher than negative predictions. So, given a completely balanced dataset, we should expect a more even distribution of predictions than our confusion matrix might lead us to believe. Although the metrics examined are good for measuring performance in regards to positive predictions, for problems where negative predictions are equally important, it is better to use the Matthews correlation coefficient (MCC) as it "takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes". Although our other metrics were on a scale of 0 to 1, the MCC is on a scale of -1 to +1 and anything over 0 is considered better than random guessing. Our value of 0.142, although not great, isn't terrible either and will provide a solid baseline score to compare our other models against.
clf_nn = MLPClassifier(activation='tanh',
learning_rate_init=0.0009,
hidden_layer_sizes = (19,),
beta_2=0.75,
beta_1=0.82,
max_iter=165,
random_state = 15
).fit(X_train_scaled, y_train)
plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
y_pred = clf_nn.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
A quick glance at the confusion matrix for our Neural Network model, it's clear to see this model did not perform as well as our Support Vector Machine. However, by examining our other scoring metrics, we can begin to understand the finer differences in our models. Our previous Support Vector Machine model, had identical precision and recall scores for each class, indicating a very balanced model in terms of positive predictions. However, our Neural Network model has slightly differing precision and recall scores, indicating our model predicts a larger proportion of True Negatives than True Positives, but we can be more confident in our True Postive predictions than our True Negative predictions. We can also see that our MCC of 0.109, although still above 0, is noticeably lower than our SVM of 0.142. So it's safe to say our SVM model is the better performer of the two models.
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)
plot_confusion_matrix(clf_knn,
X_test,
y_test,
display_labels=["Did not get a hit", "Got a hit"])
y_pred = clf_knn.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
Now that we take another look at the confusion matrix for our K Nearest Neighbors model, we can see vast improvements in performance relative to the Neural Network model. This model is similar to our SVM model in that although we have more correct postive predictions than negative predictions. However, once we take class balance into account and take a look at our precision and recall scores, the model actually performs similarly for both classifications. We also notice the MCC is the highest of our models so far at 0.2 and is almost double that of our Neural Network score of 0.109 and much better than our SVM score of 0.142 also. With the performance examined in our confusion matrix and the high MCC, our K Nearest Neighbors is easily the best of the bunch so far.
clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)
plot_confusion_matrix(clf_dt_pruned,
X_test,
y_test,
display_labels=["Did not get a hit", "Got a hit"])
y_pred = clf_dt_pruned.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
Looking at the confusion matrix for our Decision Tree model, it's clear to see the performance is not good enough. A glance at our classification report and MCC scores show similar performance to our Neural Network model, the worst of our previously examined models. No need to dig deep into the poor performance of this model when we already have a significantly better performing model, so let's move on to the next model.
clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)
plot_confusion_matrix(clf_rf,
X_test,
y_test,
values_format='d',
display_labels=["Did not get a hit", "Got a hit"])
y_pred = clf_rf.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
Another quick glance at the confusion matrix will show another poor performing model relative to our best. Although an MCC of 0.121 indicates our Random Forest model performs better than our Neural Network and Decision Tree models, it's still 3rd best behind our K Nearest Neighbors and SVM models. So we'll now move on to examining our final model, XGBoost.
clf_xgb = xgb.XGBClassifier(objective='binary:logistic',
missing= 1,
max_depth= 4,
reg_lambda= 5,
reg_alpha= 2,
use_label_encoder=False)
clf_xgb.fit(X_train,
y_train,)
plot_confusion_matrix(clf_xgb, X_test, y_test)
y_pred = clf_xgb.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
A look at our final confusion matrix would show that although performance is better than most of the previous models we've examined, Extreme Gradient Boost still lacks in comparison to our K Nearest Neighbors model. While a MCC of 0.134 is better than our Random Forest model, its still slightly behind our SVM score of 0.142.
svc = SVC(C=1, gamma=0.12,random_state=15)
svc.fit(X_train_scaled, y_train)
knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance')
knn.fit(X_train, y_train)
svc_disp = plot_roc_curve(svc, X_test_scaled, y_test)
knn_disp = plot_roc_curve(knn, X_test, y_test, ax=svc_disp.ax_)
knn_disp.figure_.suptitle("SVM/K-NN ROC Curve Comparison")
plt.show()
As a final analysis of model performance, we'll take a look at the ROC curves of our top two performing models.
Receiver Operating Characteristic curves are a way to visualize the trade off between Sensitivity (True Positive Rate) and Specificity (1 – False Positive Rate). ROC curves are commonly used to summarize a models' ability to predict classifications in combination with the AUC (Area Under the Curve). A perfect curve would have an immediate vertical line that turns horizontal at the very top of the graph and produce a AUC of 1. Whereas a 45-degree diagonal curve would represent average performance, akin to random guessing, and would produce a AUC of 0.5.
As we can see in our graph, our top performing K Nearest Neighbors model produces a better ROC curve than our SVM's ROC curve. After calculating the AUC of the two curves, we can conclude that the K Nearest Neighbors model is about 4% more accurate than the SVM model. To many, 4% may not seem like a significant difference. However, given the context of our problem and how much closer we are to the average AUC of 0.5 than a perfect AUC of 1, 4% is a sizeable difference.
Now that we've established that our K Nearest Neighbors is the best of the bunch, the only thing left to do is save the model for later use!
import pickle
pickle.dump(clf_knn, open("mlb_hitter_model.pickle", "wb"))
Finally, we save our model in order to load and prepare for deployment later!
Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.
Starmer, J., 2018. StatQuest. [online] StatQuest. Available at: https://statquest.org/
Google Developers. 2022. Classification: Precision and Recall | Machine Learning Crash Course | Google Developers. [online] Available at: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall [Accessed 11 February 2022].
Wikipedia contributors. (2022, January 30). Phi coefficient. In Wikipedia, The Free Encyclopedia. Retrieved 01:51, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Phi_coefficient&oldid=1068832478
Wikipedia contributors. (2022, January 25). Receiver operating characteristic. In Wikipedia, The Free Encyclopedia. Retrieved 01:55, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Receiver_operating_characteristic&oldid=1067917334
Chan, C., 2018. What is a ROC Curve and How to Interpret It. [online] Displayr. Available at: https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/ [Accessed 12 February 2022].
Xgboost.readthedocs.io. 2021. XGBoost Documentation. [online] Available at: https://xgboost.readthedocs.io/en/stable/index.html [Accessed 8 February 2022].