Author¶

This notebook was written by Nicholas Rosas

LinkedIn: https://www.linkedin.com/in/nickrosas96/

Introduction

The purpose of this notebook is to demonstrate the data mining process utilized to implement Machine Learning into my MLB Daily Hitter program. The program itself was originally created in order to streamline and automate my player selection process when competing in MLB's Beat The Streak seasonal competition, where participants compete for a grand prize of $5.6 Million by attempting to build a streak of 57 correctly selected MLB players to earn a hit in any single MLB game. Although the models demonstrated here are not the exact models to be utilized in my program, the process and final outcomes are very similar.

Table of contents¶

Introduction
1. Q & A
Setup
1. Library Imports
Data Preperation
Machine Learning Model Construction
Final Model Evaluation
1. Saving the Final Model
References

Q & A¶

Q: What is the problem to be solved?

A: Given a collection of features or stats for a player prior to game start, can we predict if the individual would get a hit in the game or not.

Q: How will this problem be approched?

A: As our dataset contains labeled data and our target variable is one of two classes, we will approach this problem as a Supervised Classification problem and will utilize several Machine Learning algorithms appropriate to the problem in order to create models that will make a prediction on a player.

Q: How performance be judged?

A: We will judge the individual model performance using a confusion matrix, and hyperparameters will be tuned using several performance metrics, such as accuracy, f1 score, ROC, and AUC.

Q: Are methods utilized fully explained?

A: Although I try to briefly explain the various data science and machine learning methods and processes utilized throughout the notebook, I do not explain them in great detail. This notebook is only meant to be a demonstration of the data science process used to solve a real world baseball problem.

Setup¶

Before we begin our data analysis, we must first ensure all required libraries are installed and imported correctly.

!type python3 # check directory location of python enviorment

python3 is /home/ec2-user/anaconda3/envs/python3/bin/python3

!/home/ec2-user/anaconda3/envs/python3/bin/python3 -m pip install  xgboost # Download / install xgboost

Requirement already satisfied: xgboost in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.5.2)
Requirement already satisfied: scipy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost) (1.5.3)
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost) (1.19.5)

Library Imports¶

The very first thing we do is load the various python modules used as a part of our workflow. These modules give us extra functionality to import the data, clean it up and format it, and then build, evaluate and draw the various machine learning models.

import xgboost as xgb # Warning, XGBoost must first be installed before importing

import pandas as pd # pandas is used to load and manipulate data 
import numpy as np # numpy is used to calculate the mean and standard deviation

from sklearn.model_selection import train_test_split # splits data into training and testing sets
from sklearn import preprocessing # scales and centers data
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn.model_selection import GridSearchCV # this will do cross validation and help tune hyperparameters
from sklearn.model_selection import cross_val_score # for cross validation 

from sklearn.metrics import confusion_matrix, plot_confusion_matrix # to make confusion matrixs
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer # various scoring metrics
from sklearn.metrics import classification_report, accuracy_score # more scoring metrics
from sklearn.metrics import matthews_corrcoef, plot_roc_curve # even more scoring metrics

import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.tree import plot_tree # to draw classification tree

from sklearn.neural_network import MLPClassifier # this will make a Neural network
from sklearn.tree import DecisionTreeClassifier # this will make a Classification tree
from sklearn.svm import SVC # this will make a Support Vector Machine for classificaiton
from sklearn.neighbors import KNeighborsClassifier # this will make a K Nearest Neighbors model
from sklearn.ensemble import RandomForestClassifier # this will make a Random Forest model

import os
import boto3
import re
import sagemaker # AWS
from sagemaker import get_execution_role # AWS

Data Preparation¶

Before we begin creating our ML Models, we want to make sure the data being used is high quality and informative. We can achieve this via data cleaning, data transformation and data reduction. Data preparation or preprocessing is the most important step in any data science workflow, as the quality of our output will directly match the quality of our input.

Load the Data¶

Before we can do anything with our data, we must first load it into a pandas data frame. When pandas (pd) reads in data, it returns a data frame, which is a lot like a spreadsheet. The data are organized in rows and columns and each row can contain a mixture of text and numbers. The standard variable name for a data frame is the initials df, and that is what we will use here:

# load data from AWS S3 Bucket 

role = get_execution_role()
region = boto3.Session().region_name

bucket = 'sagemaker-studio-298378432906-ea5batqypv8'
data_key = 'mlb_hitters_2018.csv'

bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)
data_location = 's3://{}/{}'.format(bucket, data_key)

df = pd.read_csv(data_location,header=0) # load data into pandas df
df.head() # preview first 5 rows of data

In our Data Frame we see a bunch of columns for the various stats collected for each player. The columns are...

BatterID: The ID number assigned to batter
Hit: The depent variable, whethere a batter

H/A_AB: Number of Home OR Away At Bats by player prior to game start
H/A_BA: Home OR Away Batting Average (hits/at bats) of player prior to game start
H/A_BB: Number of Home OR Away Bases on Balls/Walks by player prior to game start
H/A_DK: Average of Home OR Away DraftKings points by player prior to game start
H/A_FD: Average of Home OR Away FanDuel points by player prior to game start
H/A_H: Number of Home OR Away Hits by player prior to game start
H/A_HR: Number of Home OR Away Home Runs by player prior to game start
H/A_IBB: Number of Home OR Away Intentional Walks by player prior to game start
H/A_OBP: Home OR Away On Base Percentage ((H + BB + HBP)/(At Bats + BB + HBP + SF)) of player
H/A_OPS: Home OR Away On-Base + Slugging Percentages
H/A_PA: Number of Home OR Away Plate Appearences by player prior to game start
H/A_RBI: Number of Home OR Away Runs Batted In by player prior to game start
H/A_SLG: Home OR Away Slugging Percentage ((1B + 22B + 33B + 4*HR)/AB) of player
H/A_SO: Number of Home OR Away Strikeouts by player prior to game start

last15_AB: Number of At Bats over the last 15 games by player prior to game start
last15_BA: Batting Average (hits/at bats) over the last 15 games of player prior to game start
last15_BB: Number of Bases on Balls/Walks over the last 15 games by player prior to game start
last15_DK: Average DraftKings points over the last 15 games by player prior to game start
last15_FD: Average of FanDuel points over the last 15 games by player prior to game start
last15_H: Number of Hits over the last 15 games by player prior to game start
last15_HR: Number of Home Runs over the last 15 games by player prior to game start
last15_IBB: Number of Intentional Walks over the last 15 games by player prior to game start
last15_OBP: On Base Percentage over the last 15 games of player
last15_OPS: On-Base + Slugging Percentages over the last 15 games
last15_PA: Number of Plate Appearences over the last 15 games by player prior to game start
last15_RBI: Number of Runs Batted In over the last 15 games by player prior to game start
last15_SLG: Slugging Percentage over the last 15 games of player
last15_SO: Number of Strikeouts over the last 15 games by player prior to game start

last30_AB: Number of At Bats over the last 30 games by player prior to game start
last30_BA: Batting Average (hits/at bats) over the last 30 games of player prior to game start
last30_BB: Number of Bases on Balls/Walks over the last 30 games by player prior to game start
last30_DK: Average DraftKings points over the last 30 games by player prior to game start
last30_FD: Average of FanDuel points over the last 30 games by player prior to game start
last30_H: Number of Hits over the last 30 games by player prior to game start
last30_HR: Number of Home Runs over the last 30 games by player prior to game start
last30_IBB: Number of Intentional Walks over the last 30 games by player prior to game start
last30_OBP: On Base Percentage over the last 30 games of player
last30_OPS: On-Base + Slugging Percentages over the last 30 games
last30_PA: Number of Plate Appearences over the last 30 games by player prior to game start
last30_RBI: Number of Runs Batted In over the last 30 games by player prior to game start
last30_SLG: Slugging Percentage over the last 30 games of player
last30_SO: Number of Strikeouts over the last 30 games by player prior to game start

last7_AB: Number of At Bats over the last 30 games by player prior to game start
last7_BA: Batting Average (hits/at bats) over the last 30 games of player prior to game start
last7_BB: Number of Bases on Balls/Walks over the last 30 games by player prior to game start
last7_DK: Average DraftKings points over the last 30 games by player prior to game start
last7_FD: Average of FanDuel points over the last 30 games by player prior to game start
last7_H: Number of Hits over the last 30 games by player prior to game start
last7_HR: Number of Home Runs over the last 30 games by player prior to game start
last7_IBB: Number of Intentional Walks over the last 30 games by player prior to game start
last7_OBP: On Base Percentage over the last 30 games of player
last7_OPS: On-Base + Slugging Percentages over the last 30 games
last7_PA: Number of Plate Appearences over the last 30 games by player prior to game start
last7_RBI: Number of Runs Batted In over the last 30 games by player prior to game start
last7_SLG: Slugging Percentage over the last 30 games of player
last7_SO: Number of Strikeouts over the last 30 games by player prior to game start

vsP_AB: Number of At Bats vs Starting Pitcher by player prior to game start
vsP_BA: Batting Average (hits/at bats) vs Starting Pitcher of player prior to game start
vsP_BB: Number of Bases on Balls/Walks vs Starting Pitcher by player prior to game start
vsP_HR: Number of Home Runs vs Starting Pitcher by player prior to game start
vsP_IBB: Number of Intentional Walks vs Starting Pitcher by player prior to game start
vsP_OBP: On Base Percentage vs Starting Pitcher of player
vsP_OPS: On-Base + Slugging Percentages vs Starting Pitcher
vsP_PA: Number of Plate Appearences vs Starting Pitcher by player prior to game start
vsP_RBI: Number of Runs Batted In vs Starting Pitcher games by player prior to game start
vsP_SLG: Slugging Percentage vs Starting Pitcher of player
vsP_SO: Number of Strikeouts vs Starting Pitcher by player prior to game start

Data Reduction¶

We begin preparing our data by dropping uninformative columns that don't tell us anything about a players' performance, such as Date and BatterID. But before we do that, we'll save a copy of the original unedited data frame in case we ever need to reference the original data or start from a clean state.

cols = [0,1,2,18,19,20] # columns to drop
og_df = df.copy() # store a copy of the original df 
df.drop(df.columns[cols], axis=1, inplace=True) # drop unnecesarry columns from df
df.head() # preview edited df

We now see that the number of columns has dropped from 74 to 68, so we have succesfuled removed the 6 uninformative columns from our df.

og_df.head()

And a copy of the original dataframe has correctly been saved as og_df. Next we will also drop several less informative columns as part of our features selection process.

Utilizing our domain knowledge, we know DK and FD are two ways to score players fantasy points. Although these scoring metrics are slightly different, the information gain should be extremely similar. By keeping both, we could be adding a lot of unnecessary noise for our algorithms to decipher. So let's check the correlation coefficient to validate our hypothesis.

print(df[['H/A_FD', 'H/A_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last7_FD', 'last7_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last15_FD', 'last15_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last30_FD', 'last30_DK','Hit']].corr()['Hit'].to_string() + '\n')

H/A_FD    0.057659
H/A_DK    0.062603
Hit       1.000000

last7_FD    0.025077
last7_DK    0.027717
Hit         1.000000

last15_FD    0.046595
last15_DK    0.050827
Hit          1.000000

last30_FD    0.057070
last30_DK    0.062151
Hit          1.000000

After checking the correlation coefficients of all FD and DK features, we can see our hypothesis is correct and both provide similar amounts of information gain. However, DK seems to have a slight edge so we'll keep that and drop all FD features.

Feature selection is one of the most important steps of data preparation that can have a big impact on model performance. There are many ways to go about selecting which features to keep and which to discard. Discussing, testing, and showcasing these various methods would be a bit excessive for the purpose of this notebook, so for brevity's sake, I'll be providing a list of features to drop that I've concluded on using a combination of domain knowledge, standard feature selection methods, and model performance testing.

cols = ['H/A_AB', 'vsP_AB', 'vsP_BB',
        'H/A_FD', 'last7_FD', 'last15_FD', 'last30_FD', 
        'H/A_RBI', 'last7_RBI', 'last15_RBI', 'last30_RBI', 'vsP_RBI',
        'H/A_H', 'last7_H', 'last15_H', 'last30_H',
        'last7_IBB', 'last15_IBB', 'last30_IBB', 'H/A_IBB', 'vsP_IBB',
        'last7_HR','last15_HR','last30_HR', 'vsP_HR','H/A_HR']
df = df.drop(cols,axis=1)
df.head()

We can see our modifed df no longer has the unnecessary data and now only has 42 columns.

Missing Data Check¶

One of the biggest parts of any data science project is making sure that the data is correctly formatted and fixing it if it is not. The first part of this process is identifying and dealing with missing data.

Missing Data is simply a blank space, or a surrogate value like NA, that indicates that we failed to collect data for one of the features. For example, if we forgot to ask someone's age, or forgot to write it down, then we would have a blank space in the dataset for that person's age.

There are two main ways to deal with missing data:

We can remove the rows that contain missing data from the dataset. This is easy to do, but wastes all of the other values collected. How big of a waste this is depends on how important the missing value is for classification. For example, if we are missing a value for age, and age is not useful for our classification, it would be a shame to throw out all of someone's data just because we do not have their age.

We can impute the values that are missing, which is to say we can make an educated guess about what the value should be. Continuing our example where we are missing a value for age, instead of throwing out the entire row of data, we can fill in the missing value with the average age or the median age, or use another more advanced approach to guess an appropriate value.

In this section, we'll focus on identifying missing values in the dataset.

First, let's see what sort of data is in each column.

pd.set_option('display.max_column', 42) # expand display of df to see 40 columns
pd.set_option('display.max_rows', 50) # expand display of df to see 100 rows
pd.set_option('display.min_rows', None)

df # display first 25 and last 25 rows of data

df.describe() # generate descriptive statistics

By briefly grazing through our data frame, we can see our data has a wide range of values; some ints and some floats and 0 is a common value across numerous features. It's important to know if 0's found in your data are legitimate values or simply indicators of missing data. In our case, 0 is a legitimate value. However, there are a few features where 0 should never appear, such as last30_AB so we'll check the appropriate columns to make sure they don't contain outlier values. We'll also check our entire dataframe for standard missing values like NA or NaN.

# check the specified columns for values == 0
len(df.loc[ (df['H/A_PA'] == 0 ) | (df['vsP_PA'] == 0 ) | (df['last7_AB'] == 0) | (df['last7_PA'] == 0) | 
            (df['last15_AB'] == 0) | (df['last15_PA'] == 0) | (df['last30_AB'] == 0) | (df['last30_PA'] == 0 )])

0

df.isnull().sum() # check all features for missing values

H/A_BA        0
H/A_BB        0
H/A_DK        0
H/A_OBP       0
H/A_OPS       0
H/A_PA        0
H/A_SLG       0
H/A_SO        0
Hit           0
last15_AB     0
last15_BA     0
last15_BB     0
last15_DK     0
last15_OBP    0
last15_OPS    0
last15_PA     0
last15_SLG    0
last15_SO     0
last30_AB     0
last30_BA     0
last30_BB     0
last30_DK     0
last30_OBP    0
last30_OPS    0
last30_PA     0
last30_SLG    0
last30_SO     0
last7_AB      0
last7_BA      0
last7_BB      0
last7_DK      0
last7_OBP     0
last7_OPS     0
last7_PA      0
last7_SLG     0
last7_SO      0
vsP_BA        0
vsP_OBP       0
vsP_OPS       0
vsP_PA        0
vsP_SLG       0
vsP_SO        0
dtype: int64

We can see from our output that no missing NaN values were found in the data frame, and no unexpected 0 values were found either. Since we now know there are no missing values in our data, we'll double our data types to ensure nothing unexpected appears.

df.dtypes

H/A_BA        float64
H/A_BB        float64
H/A_DK        float64
H/A_OBP       float64
H/A_OPS       float64
H/A_PA        float64
H/A_SLG       float64
H/A_SO        float64
Hit           float64
last15_AB     float64
last15_BA     float64
last15_BB     float64
last15_DK     float64
last15_OBP    float64
last15_OPS    float64
last15_PA     float64
last15_SLG    float64
last15_SO     float64
last30_AB     float64
last30_BA     float64
last30_BB     float64
last30_DK     float64
last30_OBP    float64
last30_OPS    float64
last30_PA     float64
last30_SLG    float64
last30_SO     float64
last7_AB      float64
last7_BA      float64
last7_BB      float64
last7_DK      float64
last7_OBP     float64
last7_OPS     float64
last7_PA      float64
last7_SLG     float64
last7_SO      float64
vsP_BA        float64
vsP_OBP       float64
vsP_OPS       float64
vsP_PA          int64
vsP_SLG       float64
vsP_SO          int64
dtype: object

We can see that all of our data is currently being read as either float64 or int64 variables, which is correct and expected. However, several of our float64 features are meant to be int64 types, so let's change those variable types.

cols = ['H/A_BB', 'H/A_PA', 'H/A_SO', 
        'last7_AB', 'last7_BB', 'last7_PA', 'last7_SO', 
        'last15_AB', 'last15_BB', 'last15_PA', 'last15_SO',
        'last30_AB', 'last30_BB',  'last30_PA', 'last30_SO',
        'vsP_PA', 'vsP_SO','Hit'] # columns to be changed
df[cols] = df[cols].astype(int) # update specified columns type to int

df[cols].dtypes

H/A_BB       int64
H/A_PA       int64
H/A_SO       int64
last7_AB     int64
last7_BB     int64
last7_PA     int64
last7_SO     int64
last15_AB    int64
last15_BB    int64
last15_PA    int64
last15_SO    int64
last30_AB    int64
last30_BB    int64
last30_PA    int64
last30_SO    int64
vsP_PA       int64
vsP_SO       int64
Hit          int64
dtype: object

We see our data now has the correct typing.

Data Transformation¶

It's important to remember data can come in a variety of types, such as:

Numerical i.e. income, batting average
Categorical i.e. gender, position
Ordinal i.e. low/medium/high

However, most Machine Learning algorithms are meant to only handle numerical data. Because of this, it is important to always convert non-numerical data types into numerical data types. This can be done in several ways depending on the original data type, such as:

One Hot Encoding
Integer Encoding
Dummy Encoding

Our dataframe currently only contains numerical data, so there is no need to showcase these methods, but data conversion is an important step to go through as a part of any data science process.

Downsample the Data¶

Some Machine Learning models, like Support Vector Machines, are great with small datasets, but not great with large ones, and this dataset, while not huge, is big enough to take a long time to optimize with Cross Validation. So we'll downsample both categories, players who did and did not get a hit, to 2,000 each. At the same time, this will help us keep our dataset balanced, which is important to prevent Overfitting in one direction.

First, let's remind ourselves how many players are in the dataset...

len(df) # get current number of rows/players in df

10705

df_no_hit = df[df['Hit'] == 0] # seperate players who did not get a hit
df_hit = df[df['Hit'] == 1] # seperate players who did get a hit

len(df_no_hit) # get number of players who did not get a hit

3729

len(df_hit) # get number of players who did get a hit

6976

# downsample df to only include 2000 players who did not get a hit

df_no_hit_downsampled = resample(df_no_hit,
                                 replace=False,
                                 n_samples=2000,
                                 random_state=42)
len(df_no_hit_downsampled)

2000

# downsample df to only include 2000 players who did get a hit

df_hit_downsampled = resample(df_hit,
                              replace=False,
                              n_samples=2000,
                              random_state=42)
len(df_hit_downsampled)

2000

# finally combine the two seperate downsampled data frames to a single df

df_downsample = pd.concat([df_no_hit_downsampled, df_hit_downsampled])
len(df_downsample)

4000

Splitting the Data¶

Now that we have preprocessed our data, we are ready to start formatting the data for making models.

The first step is to split the data into two parts:

The columns of data that we will use to make classifications
The column of data that we want to predict.

We will use the conventional notation of X to represent the independent variables or columns of data that we will use to make classifications and y to represent the thing we want to predict, our dependent variables. In this case, we want to predict Hit, whether or not a player will get a hit in a game.

NOTE: In the code below we are using copy() to copy the data by value. By default, pandas uses copy by reference. Using copy() ensures that the original data df_downsample is not modified when we modify X or y. So if we make a mistake when we are formatting the columns for our classification models, we can just re-copy df_downsample, rather than reload the original data and re-process the data.

X = df_downsample.drop('Hit', axis=1).copy() # Seperate our independent variables
X.head()

y = df_downsample['Hit'].copy() # Seperate our dependent variable
y.head()

7826     0
10007    0
7885     0
1116     0
2367     0
Name: Hit, dtype: int64

Now that our variables are correctly defined, we'll once again split the data. This time we'll split into training and testing sets. Doing this ensures that we have not only a way to train our Machine Learning models, but a way to evaluate them aswell. This is akin to giving a student a practice test or study guide to learn upon, then evaluating them on a real test with similar problems.

## Split the data into training and testing data,
## a random state variable is also passed for replication purposes

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 25)

Centering and Scaling¶

Several of the algorithms we'll use work best when the data provided is centered and scaled prior to training. The Radial Basis Function (RBF) that we are using with our Support Vector Machine assumes that the data is centered and scaled. In other words, each column should have a mean value = 0 and a standard deviation = 1. Our Neural Network algorithm also performs best with centered and scaled data as it utilizes Gradient Descent. If features are not scaled properly, the various ranges of features can cause inconsistencies in the steps taken during Gradient Descent.

So, to get the best out of our models, we will scale both the training and testing datasets. Specifically, we'll split the data into training and testing datasets and then scale them separately to avoid Data Leakage. Data Leakage occurs when information about the training dataset corrupts or influences the testing dataset.

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Machine Learning Model Construction¶

Now that we have cleaned, downsampled, split and scaled the data we can begin construction our Machine Learning models.

Support Vector Machines¶

Preliminary Support Vector Machine¶

Support Vector Machines work by plotting a subset of data points of two classes on a plane and then finding the best hyperplane between the data points to separate the distinct classes. It decides the best hyperplane by attempting to maximize the distance or margin between points of both classes. The data points closest to the margin are referred to as Support Vectors as moving them would also cause the hyperplane to be moved, but moving other data points would not affect the hyperplane.

Now, let's move on towards making a preliminary Support Vector Machine.

clf_svm = SVC(random_state=15) # construct a SVM model
clf_svm.fit(X_train_scaled, y_train) # fit data to model

# output results of model
plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748b17c550>

In the confusion matrix, we see that of the 489 players that did not get a hit, 271 (55%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Support Vector Machine was only okay. Let's try to improve our predictions using Cross Validation to optimize the parameters.

Support Vector Machine Optimization¶

Optimizing a Support Vector Machine is all about finding the best hyperparameter values for gamma and C. So let's see if we can find better parameters than the preliminary Support Vector Machine values using cross validation in hopes that we can improve the accuracy with the Testing Dataset.

Since we have two parameters to optimize, we will use GridSearchCV(). We specify several potential values for gamma and C, and GridSearchCV() tests all possible combinations of the parameters for us.

num_features = np.size(X_train_scaled, axis=1) # get number of features
param_grid = {
   'C': [0.5, 1, 10,],
   'gamma': ['scale', 1/num_features, 1, 0.5, 0.25, 0.12, 0.05,],
  }
# we are includeing C=1 and gamma='scale' as they are default values.

optimal_params = GridSearchCV(
        SVC(random_state=15), 
        param_grid,
        cv=3,
        scoring='f1_micro',
        ## For more scoring metics see: 
        ## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
        verbose=0 # if you want to see what Grid Search is doing, set verbose=2
    )

optimal_params.fit(X_train_scaled, y_train) # find best params
print(optimal_params.best_params_) # output best params

{'C': 1, 'gamma': 0.12}

Final Support Vector Machine¶

Now that we have optimized C and gamma parameters for our SVM, let's construct our final model.

clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748996dac8>

In the confusion matrix, we see that of the 489 players that did not get a hit, 275 (56%) were correctly classified. And of the 511 players that did get a hit, 296 (58%) were correctly classified. So the Support Vector Machine was only slightly improved but we now have a baseline model to compare to.

Neural Network¶

Preliminary Neural Network Model¶

Now that we have optimized our SVM Model, we can move on to building our Neural Network model. Neural Networks works by utilizing layers of nodes or nuerons, similar to how the human brain works. The initial input layer represents the features in the dataset, there are then several hidden layers of nodes in between the input layer and the output layer. Connecting the various layers are channels that hold various numerical weights, and the various hidden layers hold threshold values. If the output of any individual node is above the specified threshold value, that node is activated, and data is sent to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

Now that we have a brief understanding of how Neural Networks work, let's begin by building a baseline model.

clf_nn = MLPClassifier(random_state=15).fit(X_train_scaled, y_train) # Create a Neural Network and fit it to the training data.

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2bcb4a8>

In the confusion matrix, we see that of the 489 players that did not get a hit, 255 (52%) were correctly classified. And of the 511 players that did get a hit, 295 (58%) were correctly classified. So the preliminary Neural Network model was not very good and seemingly overfit.

Neural Network Optimization¶

Now that we have built our preliminary Neural Network model, we can begin to optimize by looking for ideal values for max_iter, hidden_layer_sizes, learning_rate_init, beta_1, and beta_2 hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every paramater and test the various combinations.

## Warning, this code can take several minutes to run depending on the system it's ran on
## adding extra test parameters increases the run time exponentially.
param_grid = {
    'max_iter':[100, 165, 220,],
    'hidden_layer_sizes': [(19,),(21,)],
    'learning_rate_init':[ 0.0005, 0.0009, 0.005],
    'activation':['tanh'],
    'beta_1' : [0.75, 0.82, 0.9,],
    'beta_2' : [0.75, 0.82, 0.9,],
}


optimal_params = GridSearchCV(
    MLPClassifier(random_state=15),
    param_grid,
    scoring='f1_micro',
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit(X_train_scaled, y_train)

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

GridSearchCV(cv=3, estimator=MLPClassifier(random_state=15), n_jobs=10,
             param_grid={'activation': ['tanh'], 'beta_1': [0.75, 0.82, 0.9],
                         'beta_2': [0.75, 0.82, 0.9],
                         'hidden_layer_sizes': [(19,), (21,)],
                         'learning_rate_init': [0.0005, 0.0009, 0.005],
                         'max_iter': [100, 165, 220]},
             scoring='f1_micro')

print(optimal_params.best_params_)

{'activation': 'tanh', 'beta_1': 0.82, 'beta_2': 0.75, 'hidden_layer_sizes': (19,), 'learning_rate_init': 0.0009, 'max_iter': 165}

And we see that the ideal values for learning_rate_init is 0.0009 , beta_1 is 0.82 , beta_2 is 0.75 ,hidden_layer_sizes is 19 , and finally max_iter is 165.

Final Neural Network¶

Now that we have the ideal values for learning_rate_init , hidden_layer_sizes , beta_1 , beta_2 , and max_iter we can build the final Neural Network Model:

clf_nn = MLPClassifier(
                    activation='tanh', 
                    learning_rate_init=0.0009, 
                    hidden_layer_sizes = (19,),
                    beta_1=0.82,
                    beta_2=0.75,
                    max_iter=165,
                    random_state = 15
                   ).fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748f437780>

clf_nn.score(X_test_scaled, y_test) # get accuracy of model

0.554

In the confusion matrix, we see that of the 489 players that did not get a hit, 277 (56%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the Neural Network model performed worse in terms of correct true positives, but the overall accuracy has slightly improved and the model is no longer overfit.

K-Nearest Neighbors¶

Preliminary K-Nearest Neighbors¶

We can now move on to making a K Nearest Neighbors model. K Nearest Neighbors models work by plotting known training data in a defined space for specific features. Unknown testing data is then placed in the same space and compared to the nearest k known training points. The class of the unknown data point is then determined via a majority vote by those k nearest neighbors while taking the distance of those neighbors into account. K Nearest Neighbors is one of the simplest yet efficient Machine Learning algorithms and would be a good starting point for supervised classification problems like ours.

Now let's begin by making a preliminary K Nearest Neighbors model.

# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier().fit(X_train, y_train)

plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748e5a56d8>

In the confusion matrix, we see that of the 489 people that did not get a hit, 271 (55%) were correctly classified. And of the 511 people that did get a hit, 270 (53%) were correctly classified. So the preliminary K-NN model was not great.

K-Nearest Neighbors Optimization¶

Now that we have built our preliminary K-Nearest Neighbors model, we can begin to optimize by looking for ideal values for n_neighbors and distance metric hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every parameter and test the various combinations.

# Round 1
param_grid = {
    'n_neighbors':[5, 10, 15, 20, 25, 35, 40, 45, 50],
    'metric':['euclidean', 'manhattan', 'minkowski', 'mahalanobis'],
}


optimal_params = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    scoring='f1', # f1_micro
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit(X_train, y_train)

GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=10,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski',
                                    'mahalanobis'],
                         'n_neighbors': [5, 10, 15, 20, 25, 35, 40, 45, 50]},
             scoring='f1')

print(optimal_params.best_params_)

{'metric': 'euclidean', 'n_neighbors': 25}

Final K-Nearest Neighbors Model¶

Now that we have the ideal values for n_neighbors and metric we can build the final K Nearest Neighbors Model:

# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)

plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f7489859be0>

In the confusion matrix, we see that of the 489 people that did not get a hit, 291 (59%) were correctly classified. And of the 511 people that did get a hit, 309 (60%) were correctly classified. So the preliminary K-NN model was noticeably improved.

Decision Tree¶

Preliminary Decision Tree¶

Now that we are done with our K Nearest Neighbors model, let's move on to Decision Trees. Decision Trees work by attempting to predict the value of a target variable by learning simple decision rules inferred from the data features. Decision Trees are a good starting algorithm to model because they are simple to interpret, easily visualized, and work well with various data types. However, some downsides to Decision Trees are their tendency to become overfit and complex with large datasets.

With a brief understanding of how Decision Trees work, let's begin by making a preliminary model.

clf_dt = DecisionTreeClassifier(random_state=35)
clf_dt = clf_dt.fit(X_train, y_train)

plot_confusion_matrix(clf_dt, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c504630>

plt.figure(figsize=(200, 100))
plot_tree(clf_dt,
          fontsize=10,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

In the confusion matrix, we see that of the 489 players that did not get a hit, 264 (54%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Decision Tree was not great. And after plotting the entire tree, we can see it is very deep and complex, hinting at overfitting taking place, so let's try to improve the tree with Pruning .

Decision Tree Optimization¶

Decision Trees are known to overfit the Training Data, and there are many parameters such as max_depth and min_samples that are designed to reduce overfitting. However, pruning a tree with Cost Complexity Pruning is one of the simpler processes for finding a smaller tree that can improve accuracy of predictions on testing data.

Pruning a Decision Tree is all about finding an ideal value for the pruning parameter alpha, which controls how little or how much pruning happens. One method of finding the optimal alpha is by plotting the accuracy of the tree as a function of different values. We'll utilize this method with both our training and testing datasets.

First, let's find the different alpha values available for this tree and build a pruned tree for each value.

path = clf_dt.cost_complexity_pruning_path(X_train,y_train) # determine values for alphas
ccp_alphas = path.ccp_alphas # extract different values for alphas
ccp_alphas = ccp_alphas[:-1] # exclude the maximun value for alpha, this value would prune ALL leaves, leaving only a stump

clf_dts = [] # array to hold decision tress

# create one decision tree per value for alpha and store it in the array
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train,y_train)
    clf_dts.append(clf_dt)

Now we'll graph the accuracy of the trees using the training and testing datasets as functions of alpha.

train_scores = [clf_dt.score(X_train,y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test,y_test) for clf_dt in clf_dts]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label='train', drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label='test', drawstyle="steps-post")
ax.legend()
plt.show()

In the graph above, we can see the accuracy of the testing dataset is highest when alpha is about 0.0015. After this value, we can notice the accuracy of both datasets begin to drop-off slightly. Let's verify our findings with cross validation.

alpha_loop_values = [] 

## for each candidate alpha value, we'll run 3-fold cross validation
## then we'll store the mean and standard deviation of the scores for each call 
## to cross_val_score and alpha_loop_values

for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
    scores = cross_val_score(clf_dt, X_train, y_train, cv=3)
    alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])

# now we'll draw a graph of the means and standard deviations of the scores for each alpha value
alpha_results = pd.DataFrame(alpha_loop_values,
                            columns=['alpha','mean_accuracy','std'])

alpha_results.plot(x='alpha',
                  y='mean_accuracy',
                  yerr='std',
                  marker='o',
                  linestyle='--')

<AxesSubplot:xlabel='alpha'>

By using cross validation and drawing the graph, we can see that accuracy is still at its highest when alpha is just below 0.0015. So let's see if we can extract an exact value to use.

alpha_results[(alpha_results['alpha'] > 0.0013)
             &
             (alpha_results['alpha'] < 0.0015)]

# store series of alpha values
ideal_ccp_alpha = alpha_results[(alpha_results['alpha'] > 0.0013)
                                 &
                                (alpha_results['alpha'] < 0.0015)]['alpha']
ideal_ccp_alpha

297    0.001302
298    0.001304
299    0.001312
300    0.001315
301    0.001320
302    0.001328
303    0.001337
304    0.001342
305    0.001354
306    0.001362
307    0.001379
308    0.001402
309    0.001434
310    0.001435
311    0.001444
312    0.001465
313    0.001467
314    0.001468
315    0.001469
316    0.001480
Name: alpha, dtype: float64

And from here we'll select a final alpha value. In our case, we'll use number 316 as it performs well and has a low standard deviation.

ideal_ccp_alpha = float(ideal_ccp_alpha[316]) # select desired alpha value and convert from series to float
ideal_ccp_alpha

0.0014800361336946713

And now we have an ideal_ccp_alpha to be used as alpha when we build our final tree.

Final Decision Tree Model¶

Now that we have an ideal value for ccp_alpha we can build the final Decision Tree Model:

clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cac8da0>

plt.figure(figsize=(50, 25))
plot_tree(clf_dt_pruned,
          fontsize=10,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

In the confusion matrix, we see that of the 489 players that did not get a hit, 274 (56%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Decision Tree model performs similarly in terms of total correct predictions. However, the final model is much simpler and no longer overfit.

Random Forest¶

Build A Preliminary Random Forest Model¶

Since our Decision Tree model was not the best, we can move another to another tree-based model in hopes of better performance. Random Forest are also tree models like Decision Trees, however Random Forest models utilize ensemble learning, which use the outcome of multiple decision trees to make a conclusion rather than a single tree. Due to this, Random Forests are generally less prone to overfitting.

So let's go ahead and build a preliminary model.

clf_rf = RandomForestClassifier(random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf, 
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c8f5a20>

clf_rf.score(X_test, y_test)

0.558

In the confusion matrix, we see that of the 489 players that did not get a hit, 296 (60%) were correctly classified. And of the 511 players that did get a hit, 262 (51%) were correctly classified. So the initial Random Forest model was not very good and seems to be overfit for predicting true negatives.

plt.figure(figsize=(200, 100))
plot_tree(clf_rf.estimators_[3],
          fontsize=12,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

If we take a look at one of our ensemble trees, we see that the trees being used can be rather deep and complex, which also hints at overfitting taking place.

Random Forest Optimization¶

Now that we have built our preliminary Random Forest model, and have noticed evidence of overfitting, we can begin to optimize by looking for ideal values for max_depth, max_features and n_estimators hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every parameter and test the various combinations.

param_grid = [
  {'max_depth': [3, 4, 5], 
   'max_features': [5, 10, 'sqrt', 'log2'],
   'n_estimators': [50, 65, 100, 200]},
]

optimal_params = GridSearchCV(
        RandomForestClassifier(random_state=35), 
        param_grid,
        cv=3,
        scoring='roc_auc',
        ## For more scoring metics see: 
        ## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
        verbose=0 # if you want to see what GridSearchCV is doing, set verbose=2
    )
    
optimal_params.fit(X_train, y_train)
print(optimal_params.best_params_)

{'max_depth': 4, 'max_features': 10, 'n_estimators': 65}

Now that GridSearchCV() has ran and we see that the ideal values for max_features is 10 , n_estimators is 65 , max_depth is 4.

Final Random Forest Model¶

Now that we have the ideal values for max_depth, n_estimators, and max_features we can build the final Random Forest Model:

clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf,
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c79eef0>

clf_rf.score(X_test, y_test)

0.56

In the confusion matrix, we see that of the 489 players that did not get a hit, 280 (57%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Random Forest model was only slightly improved in terms of total correct predictions, but is no longer overfit.

plt.figure(figsize=(90, 30))
plot_tree(clf_rf.estimators_[3],
          fontsize=25,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

We can also see that the trees being used are now much shallower, simplier, and less prone to overfitting. So we can safely say the model was improved.

Extreme Gradient Boosting¶

Preliminary XGBoost Model¶

Now that we've gone ahead and produced one ensemble method, let's try a different type of ensemble tree-based model, XGBoost. XGBoost and Random Forest models differ in that, rather than producing a collection of decision trees that are independent of each other, XGBoost utilizes Boosting in order to create a collection of weak and strong learner trees that build upon each other and compensate for the weaknesses of their predecessors.

Now that we've briefly explained XGBoost, let's build a baseline model.

# Create a model and fit it to the training data
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', missing=1, use_label_encoder=False, seed=42)
clf_xgb.fit(X_train, 
            y_train,
            verbose= True,
            early_stopping_rounds=10,
            eval_metric='aucpr',
            eval_set=[(X_test, y_test)])
plot_confusion_matrix(clf_xgb,X_test, y_test)

[0]	validation_0-aucpr:0.53296
[1]	validation_0-aucpr:0.53930
[2]	validation_0-aucpr:0.54378
[3]	validation_0-aucpr:0.55654
[4]	validation_0-aucpr:0.55137
[5]	validation_0-aucpr:0.55593
[6]	validation_0-aucpr:0.55597
[7]	validation_0-aucpr:0.55225
[8]	validation_0-aucpr:0.55437
[9]	validation_0-aucpr:0.55278
[10]	validation_0-aucpr:0.54873
[11]	validation_0-aucpr:0.55084
[12]	validation_0-aucpr:0.55193
[13]	validation_0-aucpr:0.55554

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cc1b8d0>

In the confusion matrix, we see that of the 489 players that did not get a hit, 269 (55%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the preliminary Random Forest model was ok but not great.

XGBoost Optimization¶

Now that we have a preliminary XGBoost model, let's once again try to optimize our model by searching for optimal hyperperameters utilizing GridSearchCV(). For XGBoost models, we'll attempt to find ideal values for reg_alpha, reg_lambda and max_depth hyperparameters.

param_grid = {
    'max_depth':[5, 4, 2], 
    'reg_lambda':[5, 4, 3,], 
    'reg_alpha':[1, 2, 3,], 
}

optimal_params = GridSearchCV(
    estimator=xgb.XGBClassifier(objective= 'binary:logistic',
                               seed=42,
                               subsample=0.8,
                               use_label_encoder=False,
                               colsample_bytree=0.5),
    param_grid=param_grid,
    scoring='f1_micro',
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit( X_train, 
                    y_train,
                    early_stopping_rounds=10,
                    verbose= False,
                    eval_metric='aucpr',
                    eval_set=[(X_test, y_test)])

GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=0.5,
                                     enable_categorical=False, gamma=None,
                                     gpu_id=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, predictor=None,
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     seed=42, subsample=0.8, tree_method=None,
                                     use_label_encoder=False,
                                     validate_parameters=None, verbosity=None),
             n_jobs=10,
             param_grid={'max_depth': [5, 4, 2], 'reg_alpha': [1, 2, 3],
                         'reg_lambda': [5, 4, 3]},
             scoring='f1_micro')

print(optimal_params.best_params_)

{'max_depth': 4, 'reg_alpha': 2, 'reg_lambda': 5}

By optimizing with GridSearchCV(), we now see the ideal values for reg_alpha is 2 , reg_lambda is 5 , and max_depth is 4. We'll now use these values in the final XGBoost model.

Final XGBoost Model¶

Now that we have the ideal values for reg_alpha, reg_lambdaand max_depth we can build the final XGBoost Model:

clf_xgb = xgb.XGBClassifier(objective='binary:logistic', 
                            missing= 1, 
                            max_depth= 4, 
                            reg_lambda= 5,
                            reg_alpha= 2, 
                            use_label_encoder=False)
clf_xgb.fit(X_train, 
            y_train,)

plot_confusion_matrix(clf_xgb, X_test, y_test)

[03:05:44] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74882f60b8>

clf_xgb.score(X_test, y_test)

0.567

In the confusion matrix, we see that of the 489 people that did not default, 280 (57%) were correctly classified. And of the 511 people that defaulted, 287 (56%) were correctly classified. So the XGBoost model was improved slightly.

Final Model Evaluation¶

Now that we've made and optimized Support Vector Machine, Neural Network, K Nearest Neighbors, Decision Tree, Random Forest, and XGBoost models. Let's review and compare the final models.

clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2c23eb8>

y_pred = clf_svm.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.56      0.56      0.56       489
           1       0.58      0.58      0.58       511

    accuracy                           0.57      1000
   macro avg       0.57      0.57      0.57      1000
weighted avg       0.57      0.57      0.57      1000

Matthews correlation coefficient: 0.142

As we review our final Support Vector Machine, by looking solely at our confusion matrix, it shows the model is better at predicting True Positives than True Negatives. However, this evaluation would ignore the slight imbalance we have of True Positives and True Negatives. Therefore, it would be useful for us to look at other metrics such as precision ("The proportion of positive identifications that were actually correct"), recall ("The proportion of actual positives that were correctly identified"), and f1 score (harmonic mean of recall and precision). By examining our classification report, we can see that although all 3 scoring metrics are higher for positive predictions, they are only 2% higher than negative predictions. So, given a completely balanced dataset, we should expect a more even distribution of predictions than our confusion matrix might lead us to believe. Although the metrics examined are good for measuring performance in regards to positive predictions, for problems where negative predictions are equally important, it is better to use the Matthews correlation coefficient (MCC) as it "takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes". Although our other metrics were on a scale of 0 to 1, the MCC is on a scale of -1 to +1 and anything over 0 is considered better than random guessing. Our value of 0.142, although not great, isn't terrible either and will provide a solid baseline score to compare our other models against.

clf_nn = MLPClassifier(activation='tanh', 
                    learning_rate_init=0.0009, 
                    hidden_layer_sizes = (19,),
                    beta_2=0.75,
                    beta_1=0.82,
                    max_iter=165,
                    random_state = 15
                   ).fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])

/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2ca05c0>

y_pred = clf_nn.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.54      0.57      0.55       489
           1       0.57      0.54      0.55       511

    accuracy                           0.55      1000
   macro avg       0.55      0.55      0.55      1000
weighted avg       0.55      0.55      0.55      1000

Matthews correlation coefficient: 0.109

A quick glance at the confusion matrix for our Neural Network model, it's clear to see this model did not perform as well as our Support Vector Machine. However, by examining our other scoring metrics, we can begin to understand the finer differences in our models. Our previous Support Vector Machine model, had identical precision and recall scores for each class, indicating a very balanced model in terms of positive predictions. However, our Neural Network model has slightly differing precision and recall scores, indicating our model predicts a larger proportion of True Negatives than True Positives, but we can be more confident in our True Postive predictions than our True Negative predictions. We can also see that our MCC of 0.109, although still above 0, is noticeably lower than our SVM of 0.142. So it's safe to say our SVM model is the better performer of the two models.

# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)

plot_confusion_matrix(clf_knn, 
                      X_test, 
                      y_test, 
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f7478238b70>

y_pred = clf_knn.predict(X_test)

print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.59      0.60      0.59       489
           1       0.61      0.60      0.61       511

    accuracy                           0.60      1000
   macro avg       0.60      0.60      0.60      1000
weighted avg       0.60      0.60      0.60      1000

Matthews correlation coefficient: 0.2

Now that we take another look at the confusion matrix for our K Nearest Neighbors model, we can see vast improvements in performance relative to the Neural Network model. This model is similar to our SVM model in that although we have more correct postive predictions than negative predictions. However, once we take class balance into account and take a look at our precision and recall scores, the model actually performs similarly for both classifications. We also notice the MCC is the highest of our models so far at 0.2 and is almost double that of our Neural Network score of 0.109 and much better than our SVM score of 0.142 also. With the performance examined in our confusion matrix and the high MCC, our K Nearest Neighbors is easily the best of the bunch so far.

clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cac8908>

y_pred = clf_dt_pruned.predict(X_test)

print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.54      0.56      0.55       489
           1       0.57      0.55      0.56       511

    accuracy                           0.55      1000
   macro avg       0.55      0.55      0.55      1000
weighted avg       0.55      0.55      0.55      1000

Matthews correlation coefficient: 0.108

Looking at the confusion matrix for our Decision Tree model, it's clear to see the performance is not good enough. A glance at our classification report and MCC scores show similar performance to our Neural Network model, the worst of our previously examined models. No need to dig deep into the poor performance of this model when we already have a significantly better performing model, so let's move on to the next model.

clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf,
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74781c4b00>

y_pred = clf_rf.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.55      0.57      0.56       489
           1       0.57      0.55      0.56       511

    accuracy                           0.56      1000
   macro avg       0.56      0.56      0.56      1000
weighted avg       0.56      0.56      0.56      1000

Matthews correlation coefficient: 0.121

Another quick glance at the confusion matrix will show another poor performing model relative to our best. Although an MCC of 0.121 indicates our Random Forest model performs better than our Neural Network and Decision Tree models, it's still 3rd best behind our K Nearest Neighbors and SVM models. So we'll now move on to examining our final model, XGBoost.

clf_xgb = xgb.XGBClassifier(objective='binary:logistic', 
                            missing= 1, 
                            max_depth= 4, 
                            reg_lambda= 5,
                            reg_alpha= 2, 
                            use_label_encoder=False)
clf_xgb.fit(X_train, 
            y_train,)

plot_confusion_matrix(clf_xgb, X_test, y_test)

[03:06:29] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74781642b0>

y_pred = clf_xgb.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")

              precision    recall  f1-score   support

           0       0.56      0.57      0.56       489
           1       0.58      0.56      0.57       511

    accuracy                           0.57      1000
   macro avg       0.57      0.57      0.57      1000
weighted avg       0.57      0.57      0.57      1000

Matthews correlation coefficient: 0.134

A look at our final confusion matrix would show that although performance is better than most of the previous models we've examined, Extreme Gradient Boost still lacks in comparison to our K Nearest Neighbors model. While a MCC of 0.134 is better than our Random Forest model, its still slightly behind our SVM score of 0.142.

svc = SVC(C=1, gamma=0.12,random_state=15)
svc.fit(X_train_scaled, y_train)

knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance')
knn.fit(X_train, y_train)

svc_disp = plot_roc_curve(svc, X_test_scaled, y_test)
knn_disp = plot_roc_curve(knn, X_test, y_test, ax=svc_disp.ax_)

knn_disp.figure_.suptitle("SVM/K-NN ROC Curve Comparison")

plt.show()

As a final analysis of model performance, we'll take a look at the ROC curves of our top two performing models.

Receiver Operating Characteristic curves are a way to visualize the trade off between Sensitivity (True Positive Rate) and Specificity (1 – False Positive Rate). ROC curves are commonly used to summarize a models' ability to predict classifications in combination with the AUC (Area Under the Curve). A perfect curve would have an immediate vertical line that turns horizontal at the very top of the graph and produce a AUC of 1. Whereas a 45-degree diagonal curve would represent average performance, akin to random guessing, and would produce a AUC of 0.5.

As we can see in our graph, our top performing K Nearest Neighbors model produces a better ROC curve than our SVM's ROC curve. After calculating the AUC of the two curves, we can conclude that the K Nearest Neighbors model is about 4% more accurate than the SVM model. To many, 4% may not seem like a significant difference. However, given the context of our problem and how much closer we are to the average AUC of 0.5 than a perfect AUC of 1, 4% is a sizeable difference.

Saving the Final Model¶

Now that we've established that our K Nearest Neighbors is the best of the bunch, the only thing left to do is save the model for later use!

import pickle 

pickle.dump(clf_knn, open("mlb_hitter_model.pickle", "wb"))

Finally, we save our model in order to load and prepare for deployment later!

References¶

Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.
Starmer, J., 2018. StatQuest. [online] StatQuest. Available at: https://statquest.org/
Google Developers. 2022. Classification: Precision and Recall | Machine Learning Crash Course | Google Developers. [online] Available at: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall [Accessed 11 February 2022].
Wikipedia contributors. (2022, January 30). Phi coefficient. In Wikipedia, The Free Encyclopedia. Retrieved 01:51, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Phi_coefficient&oldid=1068832478
Wikipedia contributors. (2022, January 25). Receiver operating characteristic. In Wikipedia, The Free Encyclopedia. Retrieved 01:55, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Receiver_operating_characteristic&oldid=1067917334
Chan, C., 2018. What is a ROC Curve and How to Interpret It. [online] Displayr. Available at: https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/ [Accessed 12 February 2022].
Xgboost.readthedocs.io. 2021. XGBoost Documentation. [online] Available at: https://xgboost.readthedocs.io/en/stable/index.html [Accessed 8 February 2022].

	Unnamed: 0	BatterID	Date	H/A_AB	H/A_BA	H/A_BB	H/A_DK	H/A_FD	H/A_H	...	vsP_BA	vsP_BB	vsP_HR	vsP_IBB	vsP_OBP	vsP_OPS	vsP_PA	vsP_RBI	vsP_SLG	vsP_SO
0	37	mathije01	20180710	51.0	0.176	5.0	4.000	5.300	9.0	...	0.200	0	1	0	0.200	1.000	5	1	0.800	0
1	45	mathije01	20180828	83.0	0.241	7.0	5.320	7.052	20.0	...	0.375	1	1	1	0.444	1.444	9	1	1.000	1
2	9	mathije01	20180901	89.0	0.225	7.0	4.926	6.530	20.0	...	0.091	0	1	0	0.091	0.455	11	1	0.364	4
3	13	mathije01	20180904	60.0	0.183	8.0	3.250	4.165	11.0	...	0.200	0	0	0	0.200	0.400	5	1	0.200	2
4	54	mathije01	20180908	64.0	0.188	9.0	3.364	4.341	12.0	...	0.222	1	0	1	0.300	0.522	10	0	0.222	3

	H/A_AB	H/A_BA	H/A_BB	H/A_DK	H/A_FD	H/A_H	H/A_IBB	H/A_OBP	H/A_OPS	...	vsP_BA	vsP_BB	vsP_HR	vsP_IBB	vsP_OBP	vsP_OPS	vsP_PA	vsP_RBI	vsP_SLG	vsP_SO
0	51.0	0.176	5.0	4.000	5.300	9.0	1.0	0.246	0.481	...	0.200	0	1	0	0.200	1.000	5	1	0.800	0
1	83.0	0.241	7.0	5.320	7.052	20.0	1.0	0.297	0.622	...	0.375	1	1	1	0.444	1.444	9	1	1.000	1
2	89.0	0.225	7.0	4.926	6.530	20.0	1.0	0.278	0.581	...	0.091	0	1	0	0.091	0.455	11	1	0.364	4
3	60.0	0.183	8.0	3.250	4.165	11.0	0.0	0.275	0.475	...	0.200	0	0	0	0.200	0.400	5	1	0.200	2
4	64.0	0.188	9.0	3.364	4.341	12.0	0.0	0.284	0.503	...	0.222	1	0	1	0.300	0.522	10	0	0.222	3

	Unnamed: 0	BatterID	Date	H/A_AB	H/A_BA	H/A_BB	H/A_DK	H/A_FD	H/A_H	...	vsP_BA	vsP_BB	vsP_HR	vsP_IBB	vsP_OBP	vsP_OPS	vsP_PA	vsP_RBI	vsP_SLG	vsP_SO
0	37	mathije01	20180710	51.0	0.176	5.0	4.000	5.300	9.0	...	0.200	0	1	0	0.200	1.000	5	1	0.800	0
1	45	mathije01	20180828	83.0	0.241	7.0	5.320	7.052	20.0	...	0.375	1	1	1	0.444	1.444	9	1	1.000	1
2	9	mathije01	20180901	89.0	0.225	7.0	4.926	6.530	20.0	...	0.091	0	1	0	0.091	0.455	11	1	0.364	4
3	13	mathije01	20180904	60.0	0.183	8.0	3.250	4.165	11.0	...	0.200	0	0	0	0.200	0.400	5	1	0.200	2
4	54	mathije01	20180908	64.0	0.188	9.0	3.364	4.341	12.0	...	0.222	1	0	1	0.300	0.522	10	0	0.222	3

	H/A_BA	H/A_BB	H/A_DK	H/A_OBP	H/A_OPS	H/A_PA	H/A_SLG	H/A_SO	Hit	last15_AB	...	last7_OPS	last7_PA	last7_SLG	last7_SO	vsP_BA	vsP_OBP	vsP_OPS	vsP_PA	vsP_SLG	vsP_SO
0	0.176	5.0	4.000	0.246	0.481	57.0	0.235	19.0	1.0	50.0	...	0.291	28.0	0.148	11.0	0.200	0.200	1.000	5	0.800	0
1	0.241	7.0	5.320	0.297	0.622	92.0	0.325	24.0	0.0	52.0	...	0.608	23.0	0.304	8.0	0.375	0.444	1.444	9	1.000	1
2	0.225	7.0	4.926	0.278	0.581	98.0	0.303	28.0	0.0	51.0	...	0.522	23.0	0.261	8.0	0.091	0.091	0.455	11	0.364	4
3	0.183	8.0	3.250	0.275	0.475	69.0	0.200	21.0	0.0	50.0	...	0.500	24.0	0.250	10.0	0.200	0.200	0.400	5	0.200	2
4	0.188	9.0	3.364	0.284	0.503	74.0	0.219	22.0	0.0	46.0	...	0.411	20.0	0.211	9.0	0.222	0.300	0.522	10	0.222	3

	H/A_BA	H/A_BB	H/A_DK	H/A_OBP	H/A_OPS	H/A_PA	H/A_SLG	H/A_SO	Hit	last15_AB	last15_BA	last15_BB	last15_DK	last15_OBP	last15_OPS	last15_PA	last15_SLG	last15_SO	last30_AB	last30_BA	last30_BB	last30_DK	last30_OBP	last30_OPS	last30_PA	last30_SLG	last30_SO	last7_AB	last7_BA	last7_BB	last7_DK	last7_OBP	last7_OPS	last7_PA	last7_SLG	last7_SO	vsP_BA	vsP_OBP	vsP_OPS	vsP_PA	vsP_SLG	vsP_SO
0	0.176	5.0	4.000	0.246	0.481	57.0	0.235	19.0	1.0	50.0	0.160	5.0	4.200	0.228	0.448	57.0	0.220	17.0	91.0	0.165	12.0	3.467	0.257	0.466	105.0	0.209	33.0	27.0	0.148	0.0	2.286	0.143	0.291	28.0	0.148	11.0	0.200	0.200	1.000	5	0.800	0
1	0.241	7.0	5.320	0.297	0.622	92.0	0.325	24.0	0.0	52.0	0.308	3.0	6.267	0.345	0.730	56.0	0.385	12.0	102.0	0.235	8.0	5.233	0.286	0.590	113.0	0.304	29.0	23.0	0.304	0.0	4.143	0.304	0.608	23.0	0.304	8.0	0.375	0.444	1.444	9	1.000	1
2	0.225	7.0	4.926	0.278	0.581	98.0	0.303	28.0	0.0	51.0	0.255	2.0	5.133	0.283	0.597	54.0	0.314	15.0	103.0	0.233	7.0	5.100	0.277	0.578	113.0	0.301	33.0	23.0	0.261	0.0	3.429	0.261	0.522	23.0	0.261	8.0	0.091	0.091	0.455	11	0.364	4
3	0.183	8.0	3.250	0.275	0.475	69.0	0.200	21.0	0.0	50.0	0.240	2.0	4.667	0.269	0.569	53.0	0.300	16.0	103.0	0.223	6.0	4.800	0.261	0.552	112.0	0.291	34.0	24.0	0.250	0.0	3.429	0.250	0.500	24.0	0.250	10.0	0.200	0.200	0.400	5	0.200	2
4	0.188	9.0	3.364	0.284	0.503	74.0	0.219	22.0	0.0	46.0	0.239	3.0	4.067	0.286	0.569	50.0	0.283	17.0	102.0	0.225	6.0	4.867	0.266	0.570	110.0	0.304	32.0	19.0	0.158	1.0	2.429	0.200	0.411	20.0	0.211	9.0	0.222	0.300	0.522	10	0.222	3
5	0.217	7.0	4.750	0.270	0.563	101.0	0.293	30.0	0.0	46.0	0.196	2.0	2.867	0.229	0.446	48.0	0.217	18.0	102.0	0.216	5.0	4.400	0.250	0.525	109.0	0.275	33.0	19.0	0.158	1.0	2.429	0.200	0.411	20.0	0.211	8.0	0.286	0.286	0.571	7	0.286	2
6	0.188	9.0	3.208	0.278	0.495	79.0	0.217	23.0	1.0	42.0	0.214	3.0	2.867	0.267	0.505	45.0	0.238	18.0	94.0	0.223	6.0	3.967	0.270	0.547	101.0	0.277	32.0	16.0	0.125	2.0	1.429	0.222	0.347	18.0	0.125	8.0	0.000	0.000	0.000	10	0.000	6
7	0.192	9.0	3.720	0.277	0.537	83.0	0.260	25.0	0.0	41.0	0.220	3.0	3.600	0.273	0.590	44.0	0.317	18.0	94.0	0.223	6.0	4.400	0.270	0.579	101.0	0.309	33.0	18.0	0.167	2.0	3.714	0.250	0.583	20.0	0.333	9.0	0.222	0.417	0.972	12	0.556	1
8	0.180	16.0	6.125	0.397	0.697	68.0	0.300	22.0	1.0	55.0	0.255	7.0	6.867	0.349	0.731	63.0	0.382	24.0	109.0	0.248	19.0	8.367	0.369	0.828	130.0	0.459	41.0	28.0	0.214	2.0	4.571	0.267	0.553	30.0	0.286	13.0	0.143	0.250	0.393	16	0.143	4
9	0.170	16.0	5.765	0.380	0.663	71.0	0.283	24.0	1.0	56.0	0.250	7.0	6.733	0.333	0.708	63.0	0.375	25.0	110.0	0.245	16.0	8.100	0.352	0.807	128.0	0.455	42.0	26.0	0.115	2.0	2.429	0.179	0.294	28.0	0.115	14.0	0.333	0.400	0.844	10	0.444	3
10	0.161	18.0	5.667	0.382	0.650	76.0	0.268	25.0	1.0	55.0	0.218	8.0	6.067	0.317	0.626	63.0	0.309	24.0	111.0	0.243	17.0	8.033	0.349	0.799	129.0	0.450	41.0	25.0	0.080	4.0	2.286	0.207	0.287	29.0	0.080	13.0	0.400	0.500	1.100	6	0.600	2
11	0.305	3.0	10.929	0.339	0.932	62.0	0.593	19.0	1.0	56.0	0.196	7.0	5.733	0.286	0.572	63.0	0.286	24.0	111.0	0.243	17.0	8.033	0.349	0.799	129.0	0.450	39.0	25.0	0.040	4.0	1.857	0.172	0.212	29.0	0.040	11.0	0.467	0.500	1.367	16	0.867	1
12	0.302	5.0	11.067	0.353	0.940	68.0	0.587	20.0	1.0	57.0	0.211	8.0	6.467	0.308	0.624	65.0	0.316	22.0	111.0	0.243	16.0	7.967	0.344	0.794	128.0	0.450	39.0	25.0	0.040	5.0	3.000	0.200	0.280	30.0	0.080	10.0	0.222	0.323	0.878	31	0.556	5
13	0.150	18.0	5.368	0.362	0.612	80.0	0.250	25.0	1.0	57.0	0.193	9.0	5.800	0.303	0.549	66.0	0.246	20.0	113.0	0.248	15.0	8.000	0.341	0.792	129.0	0.451	39.0	25.0	0.080	6.0	4.000	0.258	0.378	31.0	0.120	8.0	0.273	0.308	0.580	13	0.273	3
14	0.138	18.0	5.100	0.341	0.572	85.0	0.231	27.0	1.0	60.0	0.150	7.0	5.000	0.239	0.439	67.0	0.200	22.0	115.0	0.243	15.0	7.933	0.336	0.779	131.0	0.443	41.0	27.0	0.074	5.0	3.714	0.219	0.330	32.0	0.111	8.0	0.067	0.125	0.192	16	0.067	8
15	0.145	18.0	5.000	0.337	0.569	89.0	0.232	29.0	0.0	59.0	0.153	7.0	4.533	0.242	0.445	66.0	0.203	22.0	114.0	0.246	15.0	7.933	0.338	0.785	130.0	0.447	42.0	27.0	0.111	5.0	3.857	0.250	0.398	32.0	0.148	8.0	0.353	0.421	0.892	19	0.471	3
16	0.139	20.0	4.833	0.327	0.542	101.0	0.215	32.0	1.0	56.0	0.107	9.0	3.333	0.231	0.356	65.0	0.125	21.0	112.0	0.241	16.0	7.667	0.341	0.770	129.0	0.429	39.0	27.0	0.148	5.0	4.857	0.281	0.466	32.0	0.185	8.0	0.267	0.389	0.922	18	0.533	3
17	0.280	6.0	10.389	0.333	0.893	81.0	0.560	25.0	0.0	56.0	0.107	8.0	3.933	0.219	0.398	64.0	0.179	20.0	110.0	0.191	16.0	5.867	0.299	0.617	127.0	0.318	43.0	25.0	0.120	3.0	4.286	0.214	0.454	28.0	0.240	10.0	0.250	0.300	0.675	10	0.375	4
18	0.278	6.0	10.000	0.329	0.873	85.0	0.544	26.0	0.0	56.0	0.125	8.0	4.000	0.234	0.430	64.0	0.196	19.0	111.0	0.189	15.0	5.433	0.291	0.579	127.0	0.288	43.0	27.0	0.148	1.0	3.857	0.179	0.438	28.0	0.259	10.0	0.273	0.333	0.697	12	0.364	4
19	0.268	6.0	9.600	0.326	0.850	89.0	0.524	26.0	1.0	56.0	0.125	8.0	4.133	0.246	0.442	65.0	0.196	17.0	112.0	0.188	15.0	5.433	0.289	0.575	128.0	0.286	42.0	26.0	0.154	1.0	4.143	0.214	0.483	28.0	0.269	8.0	0.278	0.409	0.909	22	0.500	4
20	0.140	21.0	4.654	0.321	0.530	109.0	0.209	34.0	0.0	54.0	0.167	6.0	5.333	0.262	0.595	61.0	0.333	21.0	114.0	0.158	13.0	5.167	0.250	0.513	128.0	0.263	43.0	25.0	0.200	3.0	6.714	0.310	0.750	29.0	0.440	9.0	0.333	0.333	0.667	6	0.333	1
21	0.167	22.0	5.667	0.331	0.615	127.0	0.284	37.0	1.0	56.0	0.214	5.0	7.667	0.302	0.784	63.0	0.482	19.0	112.0	0.161	14.0	5.500	0.266	0.570	128.0	0.304	40.0	28.0	0.286	2.0	10.429	0.355	0.962	31.0	0.607	7.0	0.000	0.000	0.000	6	0.000	2
22	0.262	10.0	9.269	0.333	0.857	114.0	0.524	36.0	1.0	56.0	0.214	6.0	7.933	0.312	0.794	64.0	0.482	19.0	111.0	0.153	15.0	5.533	0.266	0.563	128.0	0.297	38.0	26.0	0.269	4.0	9.000	0.387	0.887	31.0	0.500	6.0	0.308	0.383	0.922	60	0.538	12
23	0.280	10.0	9.333	0.347	0.889	118.0	0.542	37.0	0.0	56.0	0.268	6.0	8.667	0.359	0.913	64.0	0.554	16.0	111.0	0.180	15.0	5.900	0.289	0.622	128.0	0.333	37.0	27.0	0.333	3.0	9.857	0.419	1.012	31.0	0.593	6.0	0.400	0.500	1.100	14	0.600	3
24	0.304	10.0	9.857	0.366	0.946	123.0	0.580	37.0	0.0	57.0	0.316	6.0	9.333	0.400	0.996	65.0	0.596	15.0	113.0	0.212	14.0	6.633	0.310	0.699	129.0	0.389	35.0	27.0	0.407	3.0	11.857	0.484	1.262	31.0	0.778	5.0	0.429	0.579	1.508	19	0.929	2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10680	0.289	10.0	6.846	0.355	0.778	107.0	0.423	12.0	1.0	57.0	0.298	1.0	6.200	0.310	0.749	58.0	0.439	4.0	111.0	0.306	5.0	7.200	0.336	0.813	116.0	0.477	11.0	27.0	0.259	1.0	6.000	0.286	0.656	28.0	0.370	2.0	0.333	0.429	0.762	7	0.333	3
10681	0.264	3.0	5.960	0.281	0.666	96.0	0.385	5.0	1.0	59.0	0.322	1.0	6.867	0.333	0.791	60.0	0.458	4.0	114.0	0.316	5.0	7.533	0.345	0.827	119.0	0.482	11.0	28.0	0.286	1.0	6.714	0.310	0.703	29.0	0.393	2.0	0.500	0.462	1.462	13	1.000	0
10682	0.271	3.0	6.038	0.287	0.672	101.0	0.385	5.0	1.0	60.0	0.350	1.0	7.400	0.361	0.844	61.0	0.483	3.0	115.0	0.322	5.0	7.700	0.350	0.837	120.0	0.487	10.0	29.0	0.276	1.0	6.714	0.300	0.679	30.0	0.379	2.0	0.400	0.455	1.555	11	1.100	2
10683	0.280	3.0	6.444	0.295	0.715	105.0	0.420	5.0	1.0	61.0	0.377	1.0	8.533	0.387	0.944	62.0	0.557	3.0	115.0	0.322	5.0	8.067	0.350	0.863	120.0	0.513	10.0	30.0	0.300	0.0	6.857	0.300	0.700	30.0	0.400	1.0	0.250	0.333	0.708	9	0.375	1
10684	0.276	3.0	6.857	0.291	0.729	110.0	0.438	6.0	0.0	62.0	0.371	1.0	9.533	0.381	0.978	63.0	0.597	4.0	116.0	0.319	5.0	8.200	0.347	0.856	121.0	0.509	9.0	30.0	0.300	0.0	8.714	0.300	0.800	30.0	0.500	1.0	0.800	0.800	3.000	5	2.200	0
10685	0.284	4.0	7.034	0.304	0.754	115.0	0.450	7.0	0.0	62.0	0.371	2.0	9.667	0.391	0.988	64.0	0.597	5.0	116.0	0.336	6.0	8.600	0.369	0.903	122.0	0.534	10.0	30.0	0.333	1.0	10.000	0.355	0.922	31.0	0.567	2.0	0.211	0.318	0.634	22	0.316	2
10686	0.292	5.0	7.367	0.317	0.777	120.0	0.460	8.0	0.0	62.0	0.355	3.0	9.933	0.385	0.966	65.0	0.581	6.0	116.0	0.345	7.0	8.933	0.382	0.934	123.0	0.552	10.0	30.0	0.367	2.0	11.714	0.406	1.039	32.0	0.633	3.0	0.267	0.267	0.667	15	0.400	2
10687	0.294	10.0	6.963	0.357	0.779	112.0	0.422	12.0	0.0	62.0	0.306	3.0	8.467	0.338	0.822	65.0	0.484	7.0	116.0	0.319	6.0	8.433	0.352	0.869	122.0	0.517	12.0	31.0	0.355	2.0	11.714	0.394	1.007	33.0	0.613	5.0	0.417	0.417	0.958	24	0.542	2
10688	0.280	10.0	6.714	0.342	0.744	117.0	0.402	13.0	1.0	63.0	0.286	3.0	8.267	0.318	0.778	66.0	0.460	8.0	118.0	0.305	5.0	8.033	0.333	0.833	123.0	0.500	12.0	31.0	0.290	2.0	10.286	0.333	0.881	33.0	0.548	6.0	0.625	0.700	1.450	10	0.750	0
10689	0.284	10.0	6.700	0.341	0.746	126.0	0.405	15.0	1.0	64.0	0.281	3.0	8.267	0.313	0.782	67.0	0.469	10.0	120.0	0.300	4.0	7.033	0.323	0.773	124.0	0.450	13.0	31.0	0.258	2.0	8.571	0.303	0.755	33.0	0.452	8.0	0.321	0.345	0.952	29	0.607	2
10690	0.283	10.0	6.935	0.338	0.763	130.0	0.425	15.0	0.0	65.0	0.277	2.0	8.133	0.299	0.761	67.0	0.462	9.0	121.0	0.298	4.0	7.400	0.320	0.791	125.0	0.471	13.0	30.0	0.267	2.0	8.000	0.312	0.779	32.0	0.467	7.0	0.182	0.250	0.523	12	0.273	2
10691	0.274	10.0	6.719	0.328	0.739	134.0	0.411	16.0	0.0	64.0	0.266	2.0	7.800	0.288	0.741	66.0	0.453	9.0	122.0	0.287	3.0	7.100	0.304	0.763	125.0	0.459	14.0	30.0	0.200	1.0	6.286	0.226	0.593	31.0	0.367	7.0	0.385	0.467	1.159	15	0.692	4
10692	0.289	10.0	7.182	0.341	0.786	138.0	0.445	16.0	1.0	64.0	0.297	2.0	9.067	0.318	0.849	66.0	0.531	9.0	122.0	0.295	3.0	7.567	0.312	0.796	125.0	0.484	14.0	30.0	0.233	0.0	7.000	0.233	0.700	30.0	0.467	6.0	0.214	0.267	0.481	15	0.214	2
10693	0.286	10.0	7.118	0.336	0.772	143.0	0.436	17.0	0.0	65.0	0.292	2.0	9.067	0.313	0.836	67.0	0.523	10.0	123.0	0.293	3.0	7.633	0.310	0.790	126.0	0.480	15.0	31.0	0.258	0.0	7.714	0.258	0.742	31.0	0.484	5.0	0.154	0.154	0.308	13	0.154	3
10694	0.287	10.0	7.000	0.336	0.770	146.0	0.434	18.0	1.0	65.0	0.308	2.0	9.267	0.328	0.866	67.0	0.538	11.0	122.0	0.303	3.0	7.733	0.320	0.812	125.0	0.492	15.0	29.0	0.310	0.0	8.143	0.310	0.862	29.0	0.552	5.0	0.188	0.188	0.375	16	0.188	4
10695	0.282	5.0	7.129	0.306	0.750	124.0	0.444	10.0	1.0	61.0	0.295	2.0	8.600	0.317	0.858	63.0	0.541	12.0	120.0	0.308	3.0	7.733	0.325	0.825	123.0	0.500	16.0	25.0	0.280	0.0	7.286	0.280	0.840	25.0	0.560	5.0	0.250	0.250	0.500	12	0.250	1
10696	0.275	6.0	6.969	0.305	0.738	128.0	0.433	11.0	0.0	59.0	0.271	3.0	8.200	0.306	0.831	62.0	0.525	13.0	119.0	0.311	4.0	7.800	0.333	0.837	123.0	0.504	16.0	24.0	0.250	1.0	6.571	0.280	0.780	25.0	0.500	5.0	0.471	0.526	1.409	19	0.882	4
10697	0.285	10.0	6.806	0.333	0.764	147.0	0.431	19.0	1.0	58.0	0.224	3.0	5.867	0.262	0.641	61.0	0.379	14.0	120.0	0.300	4.0	7.700	0.323	0.815	124.0	0.492	18.0	24.0	0.208	1.0	4.571	0.240	0.573	25.0	0.333	6.0	0.000	0.000	0.000	6	0.000	2
10698	0.258	6.0	6.559	0.287	0.693	136.0	0.406	13.0	0.0	59.0	0.203	2.0	5.267	0.230	0.569	61.0	0.339	14.0	121.0	0.289	4.0	7.467	0.312	0.783	125.0	0.471	19.0	25.0	0.120	1.0	1.857	0.154	0.274	26.0	0.120	7.0	0.429	0.467	1.610	15	1.143	2
10699	0.252	7.0	6.429	0.286	0.683	140.0	0.397	14.0	1.0	58.0	0.172	2.0	4.267	0.200	0.493	60.0	0.293	14.0	120.0	0.267	5.0	7.100	0.296	0.738	125.0	0.442	20.0	23.0	0.087	2.0	1.429	0.160	0.247	25.0	0.087	7.0	0.231	0.286	0.824	14	0.538	2
10700	0.254	8.0	6.211	0.289	0.676	152.0	0.387	15.0	1.0	57.0	0.246	4.0	6.000	0.295	0.663	61.0	0.368	12.0	121.0	0.273	6.0	7.533	0.307	0.762	127.0	0.455	21.0	29.0	0.379	2.0	10.714	0.419	1.040	31.0	0.621	4.0	0.571	0.571	1.143	7	0.571	0
10701	0.258	8.0	6.400	0.292	0.676	161.0	0.384	15.0	1.0	58.0	0.259	4.0	6.800	0.306	0.685	62.0	0.379	10.0	123.0	0.285	6.0	8.033	0.318	0.781	129.0	0.463	21.0	32.0	0.344	0.0	9.286	0.344	0.813	32.0	0.469	3.0	0.357	0.400	0.900	15	0.500	0
10702	0.273	9.0	6.628	0.306	0.710	173.0	0.404	16.0	1.0	60.0	0.333	4.0	8.600	0.369	0.852	65.0	0.483	8.0	119.0	0.286	7.0	7.833	0.323	0.785	127.0	0.462	22.0	27.0	0.370	1.0	7.857	0.379	0.823	29.0	0.444	2.0	0.381	0.435	1.149	23	0.714	5
10703	0.297	2.0	6.800	0.313	0.766	67.0	0.453	16.0	1.0	60.0	0.317	3.0	7.733	0.344	0.861	64.0	0.517	12.0	123.0	0.285	5.0	7.200	0.310	0.782	129.0	0.472	25.0	28.0	0.321	1.0	5.857	0.345	0.738	29.0	0.393	5.0	0.333	0.333	0.750	12	0.417	1
10704	0.301	2.0	6.647	0.316	0.768	76.0	0.452	17.0	1.0	61.0	0.328	3.0	7.800	0.359	0.884	64.0	0.525	10.0	122.0	0.279	5.0	6.833	0.305	0.756	128.0	0.451	25.0	29.0	0.310	1.0	5.286	0.333	0.712	30.0	0.379	5.0	0.455	0.538	1.357	13	0.818	3

	H/A_BA	H/A_BB	H/A_DK	H/A_OBP	H/A_OPS	H/A_PA	H/A_SLG	H/A_SO	Hit	last15_AB	last15_BA	last15_BB	last15_DK	last15_OBP	last15_OPS	last15_PA	last15_SLG	last15_SO	last30_AB	last30_BA	last30_BB	last30_DK	last30_OBP	last30_OPS	last30_PA	last30_SLG	last30_SO	last7_AB	last7_BA	last7_BB	last7_DK	last7_OBP	last7_OPS	last7_PA	last7_SLG	last7_SO	vsP_BA	vsP_OBP	vsP_OPS	vsP_PA	vsP_SLG	vsP_SO
count	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000	10705.000000
mean	0.260314	14.937879	7.674552	0.332148	0.769395	165.403550	0.437248	32.809155	0.651658	56.552732	0.259216	5.554601	7.559870	0.329716	0.762251	63.332555	0.432534	12.681644	113.485661	0.260756	11.226810	7.641318	0.331470	0.767933	127.141896	0.436464	25.369267	26.299953	0.257485	2.581784	7.526441	0.328363	0.759645	29.455208	0.431281	5.920691	0.265447	0.322671	0.761926	14.818776	0.439293	2.981224
std	0.046545	9.534610	1.768946	0.050901	0.142986	71.930104	0.102192	16.728234	0.476467	4.886898	0.063300	3.165619	2.283758	0.068370	0.191839	4.714733	0.137798	4.561990	8.015167	0.047393	5.301515	1.810919	0.052692	0.147668	8.023482	0.105976	7.923207	2.925231	0.089143	1.891455	3.059821	0.093853	0.265967	2.719936	0.191775	2.634954	0.144938	0.147518	0.418236	10.038111	0.297466	2.655641
min	0.053000	0.000000	1.727000	0.138000	0.261000	36.000000	0.075000	1.000000	0.000000	31.000000	0.063000	0.000000	1.667000	0.107000	0.237000	41.000000	0.078000	1.000000	74.000000	0.097000	0.000000	2.267000	0.152000	0.260000	87.000000	0.107000	5.000000	14.000000	0.000000	0.000000	0.000000	0.000000	0.000000	17.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	0.000000	0.000000
25%	0.232000	8.000000	6.471000	0.298000	0.678000	106.000000	0.368000	20.000000	0.000000	53.000000	0.216000	3.000000	5.933000	0.283000	0.630000	60.000000	0.339000	9.000000	108.000000	0.229000	7.000000	6.400000	0.295000	0.666000	122.000000	0.363000	20.000000	24.000000	0.194000	1.000000	5.286000	0.267000	0.573000	28.000000	0.292000	4.000000	0.167000	0.222000	0.471000	8.000000	0.222000	1.000000
50%	0.259000	13.000000	7.600000	0.331000	0.761000	155.000000	0.430000	30.000000	1.000000	57.000000	0.258000	5.000000	7.333000	0.328000	0.749000	64.000000	0.419000	12.000000	114.000000	0.259000	10.000000	7.467000	0.330000	0.757000	128.000000	0.426000	25.000000	26.000000	0.258000	2.000000	7.143000	0.323000	0.740000	30.000000	0.409000	6.000000	0.250000	0.326000	0.718000	12.000000	0.400000	2.000000
75%	0.289000	20.000000	8.706000	0.365000	0.849000	217.000000	0.496000	43.000000	1.000000	60.000000	0.302000	7.000000	8.933000	0.375000	0.879000	67.000000	0.509000	16.000000	119.000000	0.292000	14.000000	8.733000	0.365000	0.858000	133.000000	0.500000	30.000000	28.000000	0.320000	4.000000	9.429000	0.393000	0.923000	31.000000	0.542000	8.000000	0.353000	0.417000	1.000000	19.000000	0.600000	4.000000
max	0.462000	72.000000	17.455000	0.542000	1.542000	397.000000	1.000000	119.000000	1.000000	73.000000	0.500000	21.000000	18.267000	0.582000	1.527000	77.000000	1.040000	31.000000	137.000000	0.421000	36.000000	15.667000	0.530000	1.404000	150.000000	0.915000	55.000000	40.000000	0.700000	13.000000	21.714000	0.781000	2.223000	40.000000	1.652000	19.000000	1.000000	1.000000	3.114000	83.000000	2.400000	22.000000

	H/A_BA	H/A_BB	H/A_DK	H/A_OBP	H/A_OPS	H/A_PA	H/A_SLG	H/A_SO	last15_AB	last15_BA	last15_BB	last15_DK	last15_OBP	last15_OPS	last15_PA	last15_SLG	last15_SO	last30_AB	last30_BA	last30_BB	last30_DK	last30_OBP	last30_OPS	last30_PA	last30_SLG	last30_SO	last7_AB	last7_BA	last7_BB	last7_DK	last7_OBP	last7_OPS	last7_PA	last7_SLG	last7_SO	vsP_BA	vsP_OBP	vsP_OPS	vsP_PA	vsP_SLG	vsP_SO
7826	0.310	7	8.684	0.337	0.824	166	0.487	20	62	0.323	2	11.333	0.344	1.038	64	0.694	8	120	0.283	3	7.933	0.317	0.809	126	0.492	20	31	0.290	2	13.429	0.333	1.107	33	0.774	3	0.000	0.222	0.222	9	0.000	1
10007	0.241	7	8.467	0.333	0.626	66	0.293	15	58	0.138	8	6.467	0.265	0.437	69	0.172	18	120	0.225	12	7.667	0.306	0.623	136	0.317	34	29	0.103	2	5.571	0.188	0.326	33	0.138	10	0.222	0.300	0.633	11	0.333	1
7885	0.287	14	9.882	0.338	0.838	219	0.500	39	60	0.300	0	11.667	0.311	1.011	62	0.700	7	124	0.298	3	11.233	0.326	0.891	130	0.565	24	24	0.292	0	8.143	0.320	0.862	26	0.542	2	0.400	0.438	1.104	16	0.667	3
1116	0.259	13	7.642	0.310	0.726	213	0.416	60	62	0.226	3	8.933	0.279	0.650	68	0.371	16	121	0.207	9	7.733	0.276	0.607	135	0.331	35	31	0.290	0	11.429	0.312	0.796	32	0.484	7	0.500	0.692	1.317	13	0.625	1
2367	0.305	0	7.600	0.318	0.745	88	0.427	10	67	0.254	2	6.267	0.271	0.570	70	0.299	9	128	0.281	4	7.600	0.309	0.700	138	0.391	13	32	0.188	1	4.429	0.212	0.431	33	0.219	6	0.000	0.000	0.000	5	0.000	0

	alpha	mean_accuracy	std
297	0.001302	0.535667	0.017988
298	0.001304	0.535333	0.017594
299	0.001312	0.534667	0.018445
300	0.001315	0.535000	0.018019
301	0.001320	0.537000	0.015513
302	0.001328	0.537000	0.015513
303	0.001337	0.536333	0.014885
304	0.001342	0.535667	0.015063
305	0.001354	0.536333	0.016760
306	0.001362	0.535667	0.017461
307	0.001379	0.536667	0.016780
308	0.001402	0.535667	0.016938
309	0.001434	0.536000	0.015769
310	0.001435	0.536000	0.015769
311	0.001444	0.537000	0.016573
312	0.001465	0.536333	0.017308
313	0.001467	0.536333	0.017308
314	0.001468	0.536333	0.017308
315	0.001469	0.536333	0.017308
316	0.001480	0.535333	0.016007