Author

This notebook was written by Nicholas Rosas

LinkedIn: https://www.linkedin.com/in/nickrosas96/

GitHub: https://github.com/nr96

Introduction

The purpose of this notebook is to demonstrate the data mining process utilized to implement Machine Learning into my MLB Daily Hitter program. The program itself was originally created in order to streamline and automate my player selection process when competing in MLB's Beat The Streak seasonal competition, where participants compete for a grand prize of $5.6 Million by attempting to build a streak of 57 correctly selected MLB players to earn a hit in any single MLB game. Although the models demonstrated here are not the exact models to be utilized in my program, the process and final outcomes are very similar.

Q & A

  • Q: What is the problem to be solved?

    A: Given a collection of features or stats for a player prior to game start, can we predict if the individual would get a hit in the game or not.

  • Q: How will this problem be approched?

    A: As our dataset contains labeled data and our target variable is one of two classes, we will approach this problem as a Supervised Classification problem and will utilize several Machine Learning algorithms appropriate to the problem in order to create models that will make a prediction on a player.

  • Q: How performance be judged?

    A: We will judge the individual model performance using a confusion matrix, and hyperparameters will be tuned using several performance metrics, such as accuracy, f1 score, ROC, and AUC.

  • Q: Are methods utilized fully explained?

    A: Although I try to briefly explain the various data science and machine learning methods and processes utilized throughout the notebook, I do not explain them in great detail. This notebook is only meant to be a demonstration of the data science process used to solve a real world baseball problem.

Setup

Before we begin our data analysis, we must first ensure all required libraries are installed and imported correctly.

In [1]:
!type python3 # check directory location of python enviorment
python3 is /home/ec2-user/anaconda3/envs/python3/bin/python3
In [2]:
!/home/ec2-user/anaconda3/envs/python3/bin/python3 -m pip install  xgboost # Download / install xgboost 
Requirement already satisfied: xgboost in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.5.2)
Requirement already satisfied: scipy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost) (1.5.3)
Requirement already satisfied: numpy in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from xgboost) (1.19.5)

Library Imports

The very first thing we do is load the various python modules used as a part of our workflow. These modules give us extra functionality to import the data, clean it up and format it, and then build, evaluate and draw the various machine learning models.

In [3]:
import xgboost as xgb # Warning, XGBoost must first be installed before importing 
In [4]:
import pandas as pd # pandas is used to load and manipulate data 
import numpy as np # numpy is used to calculate the mean and standard deviation

from sklearn.model_selection import train_test_split # splits data into training and testing sets
from sklearn import preprocessing # scales and centers data
from sklearn.decomposition import PCA # to perform PCA to plot the data
from sklearn.model_selection import GridSearchCV # this will do cross validation and help tune hyperparameters
from sklearn.model_selection import cross_val_score # for cross validation 

from sklearn.metrics import confusion_matrix, plot_confusion_matrix # to make confusion matrixs
from sklearn.metrics import balanced_accuracy_score, roc_auc_score, make_scorer # various scoring metrics
from sklearn.metrics import classification_report, accuracy_score # more scoring metrics
from sklearn.metrics import matthews_corrcoef, plot_roc_curve # even more scoring metrics

import matplotlib.pyplot as plt # matplotlib is for drawing graphs
import matplotlib.colors as colors
from sklearn.utils import resample # downsample the dataset
from sklearn.tree import plot_tree # to draw classification tree

from sklearn.neural_network import MLPClassifier # this will make a Neural network
from sklearn.tree import DecisionTreeClassifier # this will make a Classification tree
from sklearn.svm import SVC # this will make a Support Vector Machine for classificaiton
from sklearn.neighbors import KNeighborsClassifier # this will make a K Nearest Neighbors model
from sklearn.ensemble import RandomForestClassifier # this will make a Random Forest model

import os
import boto3
import re
import sagemaker # AWS
from sagemaker import get_execution_role # AWS

Data Preparation

Before we begin creating our ML Models, we want to make sure the data being used is high quality and informative. We can achieve this via data cleaning, data transformation and data reduction. Data preparation or preprocessing is the most important step in any data science workflow, as the quality of our output will directly match the quality of our input.

Load the Data

Before we can do anything with our data, we must first load it into a pandas data frame. When pandas (pd) reads in data, it returns a data frame, which is a lot like a spreadsheet. The data are organized in rows and columns and each row can contain a mixture of text and numbers. The standard variable name for a data frame is the initials df, and that is what we will use here:

In [5]:
# load data from AWS S3 Bucket 

role = get_execution_role()
region = boto3.Session().region_name

bucket = 'sagemaker-studio-298378432906-ea5batqypv8'
data_key = 'mlb_hitters_2018.csv'

bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)
data_location = 's3://{}/{}'.format(bucket, data_key)
In [6]:
df = pd.read_csv(data_location,header=0) # load data into pandas df
df.head() # preview first 5 rows of data
Out[6]:
Unnamed: 0 BatterID Date H/A_AB H/A_BA H/A_BB H/A_DK H/A_FD H/A_H H/A_HR ... vsP_BA vsP_BB vsP_HR vsP_IBB vsP_OBP vsP_OPS vsP_PA vsP_RBI vsP_SLG vsP_SO
0 37 mathije01 20180710 51.0 0.176 5.0 4.000 5.300 9.0 0.0 ... 0.200 0 1 0 0.200 1.000 5 1 0.800 0
1 45 mathije01 20180828 83.0 0.241 7.0 5.320 7.052 20.0 0.0 ... 0.375 1 1 1 0.444 1.444 9 1 1.000 1
2 9 mathije01 20180901 89.0 0.225 7.0 4.926 6.530 20.0 0.0 ... 0.091 0 1 0 0.091 0.455 11 1 0.364 4
3 13 mathije01 20180904 60.0 0.183 8.0 3.250 4.165 11.0 0.0 ... 0.200 0 0 0 0.200 0.400 5 1 0.200 2
4 54 mathije01 20180908 64.0 0.188 9.0 3.364 4.341 12.0 0.0 ... 0.222 1 0 1 0.300 0.522 10 0 0.222 3

5 rows × 74 columns

In our Data Frame we see a bunch of columns for the various stats collected for each player. The columns are...

  • BatterID: The ID number assigned to batter
  • Hit: The depent variable, whethere a batter
  • H/A_AB: Number of Home OR Away At Bats by player prior to game start
  • H/A_BA: Home OR Away Batting Average (hits/at bats) of player prior to game start
  • H/A_BB: Number of Home OR Away Bases on Balls/Walks by player prior to game start
  • H/A_DK: Average of Home OR Away DraftKings points by player prior to game start
  • H/A_FD: Average of Home OR Away FanDuel points by player prior to game start
  • H/A_H: Number of Home OR Away Hits by player prior to game start
  • H/A_HR: Number of Home OR Away Home Runs by player prior to game start
  • H/A_IBB: Number of Home OR Away Intentional Walks by player prior to game start
  • H/A_OBP: Home OR Away On Base Percentage ((H + BB + HBP)/(At Bats + BB + HBP + SF)) of player
  • H/A_OPS: Home OR Away On-Base + Slugging Percentages
  • H/A_PA: Number of Home OR Away Plate Appearences by player prior to game start
  • H/A_RBI: Number of Home OR Away Runs Batted In by player prior to game start
  • H/A_SLG: Home OR Away Slugging Percentage ((1B + 22B + 33B + 4*HR)/AB) of player
  • H/A_SO: Number of Home OR Away Strikeouts by player prior to game start
  • last15_AB: Number of At Bats over the last 15 games by player prior to game start
  • last15_BA: Batting Average (hits/at bats) over the last 15 games of player prior to game start
  • last15_BB: Number of Bases on Balls/Walks over the last 15 games by player prior to game start
  • last15_DK: Average DraftKings points over the last 15 games by player prior to game start
  • last15_FD: Average of FanDuel points over the last 15 games by player prior to game start
  • last15_H: Number of Hits over the last 15 games by player prior to game start
  • last15_HR: Number of Home Runs over the last 15 games by player prior to game start
  • last15_IBB: Number of Intentional Walks over the last 15 games by player prior to game start
  • last15_OBP: On Base Percentage over the last 15 games of player
  • last15_OPS: On-Base + Slugging Percentages over the last 15 games
  • last15_PA: Number of Plate Appearences over the last 15 games by player prior to game start
  • last15_RBI: Number of Runs Batted In over the last 15 games by player prior to game start
  • last15_SLG: Slugging Percentage over the last 15 games of player
  • last15_SO: Number of Strikeouts over the last 15 games by player prior to game start
  • last30_AB: Number of At Bats over the last 30 games by player prior to game start
  • last30_BA: Batting Average (hits/at bats) over the last 30 games of player prior to game start
  • last30_BB: Number of Bases on Balls/Walks over the last 30 games by player prior to game start
  • last30_DK: Average DraftKings points over the last 30 games by player prior to game start
  • last30_FD: Average of FanDuel points over the last 30 games by player prior to game start
  • last30_H: Number of Hits over the last 30 games by player prior to game start
  • last30_HR: Number of Home Runs over the last 30 games by player prior to game start
  • last30_IBB: Number of Intentional Walks over the last 30 games by player prior to game start
  • last30_OBP: On Base Percentage over the last 30 games of player
  • last30_OPS: On-Base + Slugging Percentages over the last 30 games
  • last30_PA: Number of Plate Appearences over the last 30 games by player prior to game start
  • last30_RBI: Number of Runs Batted In over the last 30 games by player prior to game start
  • last30_SLG: Slugging Percentage over the last 30 games of player
  • last30_SO: Number of Strikeouts over the last 30 games by player prior to game start
  • last7_AB: Number of At Bats over the last 30 games by player prior to game start
  • last7_BA: Batting Average (hits/at bats) over the last 30 games of player prior to game start
  • last7_BB: Number of Bases on Balls/Walks over the last 30 games by player prior to game start
  • last7_DK: Average DraftKings points over the last 30 games by player prior to game start
  • last7_FD: Average of FanDuel points over the last 30 games by player prior to game start
  • last7_H: Number of Hits over the last 30 games by player prior to game start
  • last7_HR: Number of Home Runs over the last 30 games by player prior to game start
  • last7_IBB: Number of Intentional Walks over the last 30 games by player prior to game start
  • last7_OBP: On Base Percentage over the last 30 games of player
  • last7_OPS: On-Base + Slugging Percentages over the last 30 games
  • last7_PA: Number of Plate Appearences over the last 30 games by player prior to game start
  • last7_RBI: Number of Runs Batted In over the last 30 games by player prior to game start
  • last7_SLG: Slugging Percentage over the last 30 games of player
  • last7_SO: Number of Strikeouts over the last 30 games by player prior to game start
  • vsP_AB: Number of At Bats vs Starting Pitcher by player prior to game start
  • vsP_BA: Batting Average (hits/at bats) vs Starting Pitcher of player prior to game start
  • vsP_BB: Number of Bases on Balls/Walks vs Starting Pitcher by player prior to game start
  • vsP_HR: Number of Home Runs vs Starting Pitcher by player prior to game start
  • vsP_IBB: Number of Intentional Walks vs Starting Pitcher by player prior to game start
  • vsP_OBP: On Base Percentage vs Starting Pitcher of player
  • vsP_OPS: On-Base + Slugging Percentages vs Starting Pitcher
  • vsP_PA: Number of Plate Appearences vs Starting Pitcher by player prior to game start
  • vsP_RBI: Number of Runs Batted In vs Starting Pitcher games by player prior to game start
  • vsP_SLG: Slugging Percentage vs Starting Pitcher of player
  • vsP_SO: Number of Strikeouts vs Starting Pitcher by player prior to game start

Data Reduction

We begin preparing our data by dropping uninformative columns that don't tell us anything about a players' performance, such as Date and BatterID. But before we do that, we'll save a copy of the original unedited data frame in case we ever need to reference the original data or start from a clean state.

In [7]:
cols = [0,1,2,18,19,20] # columns to drop
og_df = df.copy() # store a copy of the original df 
df.drop(df.columns[cols], axis=1, inplace=True) # drop unnecesarry columns from df
df.head() # preview edited df
Out[7]:
H/A_AB H/A_BA H/A_BB H/A_DK H/A_FD H/A_H H/A_HR H/A_IBB H/A_OBP H/A_OPS ... vsP_BA vsP_BB vsP_HR vsP_IBB vsP_OBP vsP_OPS vsP_PA vsP_RBI vsP_SLG vsP_SO
0 51.0 0.176 5.0 4.000 5.300 9.0 0.0 1.0 0.246 0.481 ... 0.200 0 1 0 0.200 1.000 5 1 0.800 0
1 83.0 0.241 7.0 5.320 7.052 20.0 0.0 1.0 0.297 0.622 ... 0.375 1 1 1 0.444 1.444 9 1 1.000 1
2 89.0 0.225 7.0 4.926 6.530 20.0 0.0 1.0 0.278 0.581 ... 0.091 0 1 0 0.091 0.455 11 1 0.364 4
3 60.0 0.183 8.0 3.250 4.165 11.0 0.0 0.0 0.275 0.475 ... 0.200 0 0 0 0.200 0.400 5 1 0.200 2
4 64.0 0.188 9.0 3.364 4.341 12.0 0.0 0.0 0.284 0.503 ... 0.222 1 0 1 0.300 0.522 10 0 0.222 3

5 rows × 68 columns

We now see that the number of columns has dropped from 74 to 68, so we have succesfuled removed the 6 uninformative columns from our df.

In [8]:
og_df.head()
Out[8]:
Unnamed: 0 BatterID Date H/A_AB H/A_BA H/A_BB H/A_DK H/A_FD H/A_H H/A_HR ... vsP_BA vsP_BB vsP_HR vsP_IBB vsP_OBP vsP_OPS vsP_PA vsP_RBI vsP_SLG vsP_SO
0 37 mathije01 20180710 51.0 0.176 5.0 4.000 5.300 9.0 0.0 ... 0.200 0 1 0 0.200 1.000 5 1 0.800 0
1 45 mathije01 20180828 83.0 0.241 7.0 5.320 7.052 20.0 0.0 ... 0.375 1 1 1 0.444 1.444 9 1 1.000 1
2 9 mathije01 20180901 89.0 0.225 7.0 4.926 6.530 20.0 0.0 ... 0.091 0 1 0 0.091 0.455 11 1 0.364 4
3 13 mathije01 20180904 60.0 0.183 8.0 3.250 4.165 11.0 0.0 ... 0.200 0 0 0 0.200 0.400 5 1 0.200 2
4 54 mathije01 20180908 64.0 0.188 9.0 3.364 4.341 12.0 0.0 ... 0.222 1 0 1 0.300 0.522 10 0 0.222 3

5 rows × 74 columns

And a copy of the original dataframe has correctly been saved as og_df. Next we will also drop several less informative columns as part of our features selection process.

Utilizing our domain knowledge, we know DK and FD are two ways to score players fantasy points. Although these scoring metrics are slightly different, the information gain should be extremely similar. By keeping both, we could be adding a lot of unnecessary noise for our algorithms to decipher. So let's check the correlation coefficient to validate our hypothesis.

In [9]:
print(df[['H/A_FD', 'H/A_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last7_FD', 'last7_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last15_FD', 'last15_DK','Hit']].corr()['Hit'].to_string() + '\n')
print(df[['last30_FD', 'last30_DK','Hit']].corr()['Hit'].to_string() + '\n')
H/A_FD    0.057659
H/A_DK    0.062603
Hit       1.000000

last7_FD    0.025077
last7_DK    0.027717
Hit         1.000000

last15_FD    0.046595
last15_DK    0.050827
Hit          1.000000

last30_FD    0.057070
last30_DK    0.062151
Hit          1.000000

After checking the correlation coefficients of all FD and DK features, we can see our hypothesis is correct and both provide similar amounts of information gain. However, DK seems to have a slight edge so we'll keep that and drop all FD features.

Feature selection is one of the most important steps of data preparation that can have a big impact on model performance. There are many ways to go about selecting which features to keep and which to discard. Discussing, testing, and showcasing these various methods would be a bit excessive for the purpose of this notebook, so for brevity's sake, I'll be providing a list of features to drop that I've concluded on using a combination of domain knowledge, standard feature selection methods, and model performance testing.

In [10]:
cols = ['H/A_AB', 'vsP_AB', 'vsP_BB',
        'H/A_FD', 'last7_FD', 'last15_FD', 'last30_FD', 
        'H/A_RBI', 'last7_RBI', 'last15_RBI', 'last30_RBI', 'vsP_RBI',
        'H/A_H', 'last7_H', 'last15_H', 'last30_H',
        'last7_IBB', 'last15_IBB', 'last30_IBB', 'H/A_IBB', 'vsP_IBB',
        'last7_HR','last15_HR','last30_HR', 'vsP_HR','H/A_HR']
df = df.drop(cols,axis=1)
df.head()
Out[10]:
H/A_BA H/A_BB H/A_DK H/A_OBP H/A_OPS H/A_PA H/A_SLG H/A_SO Hit last15_AB ... last7_OPS last7_PA last7_SLG last7_SO vsP_BA vsP_OBP vsP_OPS vsP_PA vsP_SLG vsP_SO
0 0.176 5.0 4.000 0.246 0.481 57.0 0.235 19.0 1.0 50.0 ... 0.291 28.0 0.148 11.0 0.200 0.200 1.000 5 0.800 0
1 0.241 7.0 5.320 0.297 0.622 92.0 0.325 24.0 0.0 52.0 ... 0.608 23.0 0.304 8.0 0.375 0.444 1.444 9 1.000 1
2 0.225 7.0 4.926 0.278 0.581 98.0 0.303 28.0 0.0 51.0 ... 0.522 23.0 0.261 8.0 0.091 0.091 0.455 11 0.364 4
3 0.183 8.0 3.250 0.275 0.475 69.0 0.200 21.0 0.0 50.0 ... 0.500 24.0 0.250 10.0 0.200 0.200 0.400 5 0.200 2
4 0.188 9.0 3.364 0.284 0.503 74.0 0.219 22.0 0.0 46.0 ... 0.411 20.0 0.211 9.0 0.222 0.300 0.522 10 0.222 3

5 rows × 42 columns

We can see our modifed df no longer has the unnecessary data and now only has 42 columns.

Missing Data Check

One of the biggest parts of any data science project is making sure that the data is correctly formatted and fixing it if it is not. The first part of this process is identifying and dealing with missing data.

Missing Data is simply a blank space, or a surrogate value like NA, that indicates that we failed to collect data for one of the features. For example, if we forgot to ask someone's age, or forgot to write it down, then we would have a blank space in the dataset for that person's age.

There are two main ways to deal with missing data:

  1. We can remove the rows that contain missing data from the dataset. This is easy to do, but wastes all of the other values collected. How big of a waste this is depends on how important the missing value is for classification. For example, if we are missing a value for age, and age is not useful for our classification, it would be a shame to throw out all of someone's data just because we do not have their age.
  1. We can impute the values that are missing, which is to say we can make an educated guess about what the value should be. Continuing our example where we are missing a value for age, instead of throwing out the entire row of data, we can fill in the missing value with the average age or the median age, or use another more advanced approach to guess an appropriate value.

In this section, we'll focus on identifying missing values in the dataset.

First, let's see what sort of data is in each column.

In [11]:
pd.set_option('display.max_column', 42) # expand display of df to see 40 columns
pd.set_option('display.max_rows', 50) # expand display of df to see 100 rows
pd.set_option('display.min_rows', None) 
In [12]:
df # display first 25 and last 25 rows of data
Out[12]:
H/A_BA H/A_BB H/A_DK H/A_OBP H/A_OPS H/A_PA H/A_SLG H/A_SO Hit last15_AB last15_BA last15_BB last15_DK last15_OBP last15_OPS last15_PA last15_SLG last15_SO last30_AB last30_BA last30_BB last30_DK last30_OBP last30_OPS last30_PA last30_SLG last30_SO last7_AB last7_BA last7_BB last7_DK last7_OBP last7_OPS last7_PA last7_SLG last7_SO vsP_BA vsP_OBP vsP_OPS vsP_PA vsP_SLG vsP_SO
0 0.176 5.0 4.000 0.246 0.481 57.0 0.235 19.0 1.0 50.0 0.160 5.0 4.200 0.228 0.448 57.0 0.220 17.0 91.0 0.165 12.0 3.467 0.257 0.466 105.0 0.209 33.0 27.0 0.148 0.0 2.286 0.143 0.291 28.0 0.148 11.0 0.200 0.200 1.000 5 0.800 0
1 0.241 7.0 5.320 0.297 0.622 92.0 0.325 24.0 0.0 52.0 0.308 3.0 6.267 0.345 0.730 56.0 0.385 12.0 102.0 0.235 8.0 5.233 0.286 0.590 113.0 0.304 29.0 23.0 0.304 0.0 4.143 0.304 0.608 23.0 0.304 8.0 0.375 0.444 1.444 9 1.000 1
2 0.225 7.0 4.926 0.278 0.581 98.0 0.303 28.0 0.0 51.0 0.255 2.0 5.133 0.283 0.597 54.0 0.314 15.0 103.0 0.233 7.0 5.100 0.277 0.578 113.0 0.301 33.0 23.0 0.261 0.0 3.429 0.261 0.522 23.0 0.261 8.0 0.091 0.091 0.455 11 0.364 4
3 0.183 8.0 3.250 0.275 0.475 69.0 0.200 21.0 0.0 50.0 0.240 2.0 4.667 0.269 0.569 53.0 0.300 16.0 103.0 0.223 6.0 4.800 0.261 0.552 112.0 0.291 34.0 24.0 0.250 0.0 3.429 0.250 0.500 24.0 0.250 10.0 0.200 0.200 0.400 5 0.200 2
4 0.188 9.0 3.364 0.284 0.503 74.0 0.219 22.0 0.0 46.0 0.239 3.0 4.067 0.286 0.569 50.0 0.283 17.0 102.0 0.225 6.0 4.867 0.266 0.570 110.0 0.304 32.0 19.0 0.158 1.0 2.429 0.200 0.411 20.0 0.211 9.0 0.222 0.300 0.522 10 0.222 3
5 0.217 7.0 4.750 0.270 0.563 101.0 0.293 30.0 0.0 46.0 0.196 2.0 2.867 0.229 0.446 48.0 0.217 18.0 102.0 0.216 5.0 4.400 0.250 0.525 109.0 0.275 33.0 19.0 0.158 1.0 2.429 0.200 0.411 20.0 0.211 8.0 0.286 0.286 0.571 7 0.286 2
6 0.188 9.0 3.208 0.278 0.495 79.0 0.217 23.0 1.0 42.0 0.214 3.0 2.867 0.267 0.505 45.0 0.238 18.0 94.0 0.223 6.0 3.967 0.270 0.547 101.0 0.277 32.0 16.0 0.125 2.0 1.429 0.222 0.347 18.0 0.125 8.0 0.000 0.000 0.000 10 0.000 6
7 0.192 9.0 3.720 0.277 0.537 83.0 0.260 25.0 0.0 41.0 0.220 3.0 3.600 0.273 0.590 44.0 0.317 18.0 94.0 0.223 6.0 4.400 0.270 0.579 101.0 0.309 33.0 18.0 0.167 2.0 3.714 0.250 0.583 20.0 0.333 9.0 0.222 0.417 0.972 12 0.556 1
8 0.180 16.0 6.125 0.397 0.697 68.0 0.300 22.0 1.0 55.0 0.255 7.0 6.867 0.349 0.731 63.0 0.382 24.0 109.0 0.248 19.0 8.367 0.369 0.828 130.0 0.459 41.0 28.0 0.214 2.0 4.571 0.267 0.553 30.0 0.286 13.0 0.143 0.250 0.393 16 0.143 4
9 0.170 16.0 5.765 0.380 0.663 71.0 0.283 24.0 1.0 56.0 0.250 7.0 6.733 0.333 0.708 63.0 0.375 25.0 110.0 0.245 16.0 8.100 0.352 0.807 128.0 0.455 42.0 26.0 0.115 2.0 2.429 0.179 0.294 28.0 0.115 14.0 0.333 0.400 0.844 10 0.444 3
10 0.161 18.0 5.667 0.382 0.650 76.0 0.268 25.0 1.0 55.0 0.218 8.0 6.067 0.317 0.626 63.0 0.309 24.0 111.0 0.243 17.0 8.033 0.349 0.799 129.0 0.450 41.0 25.0 0.080 4.0 2.286 0.207 0.287 29.0 0.080 13.0 0.400 0.500 1.100 6 0.600 2
11 0.305 3.0 10.929 0.339 0.932 62.0 0.593 19.0 1.0 56.0 0.196 7.0 5.733 0.286 0.572 63.0 0.286 24.0 111.0 0.243 17.0 8.033 0.349 0.799 129.0 0.450 39.0 25.0 0.040 4.0 1.857 0.172 0.212 29.0 0.040 11.0 0.467 0.500 1.367 16 0.867 1
12 0.302 5.0 11.067 0.353 0.940 68.0 0.587 20.0 1.0 57.0 0.211 8.0 6.467 0.308 0.624 65.0 0.316 22.0 111.0 0.243 16.0 7.967 0.344 0.794 128.0 0.450 39.0 25.0 0.040 5.0 3.000 0.200 0.280 30.0 0.080 10.0 0.222 0.323 0.878 31 0.556 5
13 0.150 18.0 5.368 0.362 0.612 80.0 0.250 25.0 1.0 57.0 0.193 9.0 5.800 0.303 0.549 66.0 0.246 20.0 113.0 0.248 15.0 8.000 0.341 0.792 129.0 0.451 39.0 25.0 0.080 6.0 4.000 0.258 0.378 31.0 0.120 8.0 0.273 0.308 0.580 13 0.273 3
14 0.138 18.0 5.100 0.341 0.572 85.0 0.231 27.0 1.0 60.0 0.150 7.0 5.000 0.239 0.439 67.0 0.200 22.0 115.0 0.243 15.0 7.933 0.336 0.779 131.0 0.443 41.0 27.0 0.074 5.0 3.714 0.219 0.330 32.0 0.111 8.0 0.067 0.125 0.192 16 0.067 8
15 0.145 18.0 5.000 0.337 0.569 89.0 0.232 29.0 0.0 59.0 0.153 7.0 4.533 0.242 0.445 66.0 0.203 22.0 114.0 0.246 15.0 7.933 0.338 0.785 130.0 0.447 42.0 27.0 0.111 5.0 3.857 0.250 0.398 32.0 0.148 8.0 0.353 0.421 0.892 19 0.471 3
16 0.139 20.0 4.833 0.327 0.542 101.0 0.215 32.0 1.0 56.0 0.107 9.0 3.333 0.231 0.356 65.0 0.125 21.0 112.0 0.241 16.0 7.667 0.341 0.770 129.0 0.429 39.0 27.0 0.148 5.0 4.857 0.281 0.466 32.0 0.185 8.0 0.267 0.389 0.922 18 0.533 3
17 0.280 6.0 10.389 0.333 0.893 81.0 0.560 25.0 0.0 56.0 0.107 8.0 3.933 0.219 0.398 64.0 0.179 20.0 110.0 0.191 16.0 5.867 0.299 0.617 127.0 0.318 43.0 25.0 0.120 3.0 4.286 0.214 0.454 28.0 0.240 10.0 0.250 0.300 0.675 10 0.375 4
18 0.278 6.0 10.000 0.329 0.873 85.0 0.544 26.0 0.0 56.0 0.125 8.0 4.000 0.234 0.430 64.0 0.196 19.0 111.0 0.189 15.0 5.433 0.291 0.579 127.0 0.288 43.0 27.0 0.148 1.0 3.857 0.179 0.438 28.0 0.259 10.0 0.273 0.333 0.697 12 0.364 4
19 0.268 6.0 9.600 0.326 0.850 89.0 0.524 26.0 1.0 56.0 0.125 8.0 4.133 0.246 0.442 65.0 0.196 17.0 112.0 0.188 15.0 5.433 0.289 0.575 128.0 0.286 42.0 26.0 0.154 1.0 4.143 0.214 0.483 28.0 0.269 8.0 0.278 0.409 0.909 22 0.500 4
20 0.140 21.0 4.654 0.321 0.530 109.0 0.209 34.0 0.0 54.0 0.167 6.0 5.333 0.262 0.595 61.0 0.333 21.0 114.0 0.158 13.0 5.167 0.250 0.513 128.0 0.263 43.0 25.0 0.200 3.0 6.714 0.310 0.750 29.0 0.440 9.0 0.333 0.333 0.667 6 0.333 1
21 0.167 22.0 5.667 0.331 0.615 127.0 0.284 37.0 1.0 56.0 0.214 5.0 7.667 0.302 0.784 63.0 0.482 19.0 112.0 0.161 14.0 5.500 0.266 0.570 128.0 0.304 40.0 28.0 0.286 2.0 10.429 0.355 0.962 31.0 0.607 7.0 0.000 0.000 0.000 6 0.000 2
22 0.262 10.0 9.269 0.333 0.857 114.0 0.524 36.0 1.0 56.0 0.214 6.0 7.933 0.312 0.794 64.0 0.482 19.0 111.0 0.153 15.0 5.533 0.266 0.563 128.0 0.297 38.0 26.0 0.269 4.0 9.000 0.387 0.887 31.0 0.500 6.0 0.308 0.383 0.922 60 0.538 12
23 0.280 10.0 9.333 0.347 0.889 118.0 0.542 37.0 0.0 56.0 0.268 6.0 8.667 0.359 0.913 64.0 0.554 16.0 111.0 0.180 15.0 5.900 0.289 0.622 128.0 0.333 37.0 27.0 0.333 3.0 9.857 0.419 1.012 31.0 0.593 6.0 0.400 0.500 1.100 14 0.600 3
24 0.304 10.0 9.857 0.366 0.946 123.0 0.580 37.0 0.0 57.0 0.316 6.0 9.333 0.400 0.996 65.0 0.596 15.0 113.0 0.212 14.0 6.633 0.310 0.699 129.0 0.389 35.0 27.0 0.407 3.0 11.857 0.484 1.262 31.0 0.778 5.0 0.429 0.579 1.508 19 0.929 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10680 0.289 10.0 6.846 0.355 0.778 107.0 0.423 12.0 1.0 57.0 0.298 1.0 6.200 0.310 0.749 58.0 0.439 4.0 111.0 0.306 5.0 7.200 0.336 0.813 116.0 0.477 11.0 27.0 0.259 1.0 6.000 0.286 0.656 28.0 0.370 2.0 0.333 0.429 0.762 7 0.333 3
10681 0.264 3.0 5.960 0.281 0.666 96.0 0.385 5.0 1.0 59.0 0.322 1.0 6.867 0.333 0.791 60.0 0.458 4.0 114.0 0.316 5.0 7.533 0.345 0.827 119.0 0.482 11.0 28.0 0.286 1.0 6.714 0.310 0.703 29.0 0.393 2.0 0.500 0.462 1.462 13 1.000 0
10682 0.271 3.0 6.038 0.287 0.672 101.0 0.385 5.0 1.0 60.0 0.350 1.0 7.400 0.361 0.844 61.0 0.483 3.0 115.0 0.322 5.0 7.700 0.350 0.837 120.0 0.487 10.0 29.0 0.276 1.0 6.714 0.300 0.679 30.0 0.379 2.0 0.400 0.455 1.555 11 1.100 2
10683 0.280 3.0 6.444 0.295 0.715 105.0 0.420 5.0 1.0 61.0 0.377 1.0 8.533 0.387 0.944 62.0 0.557 3.0 115.0 0.322 5.0 8.067 0.350 0.863 120.0 0.513 10.0 30.0 0.300 0.0 6.857 0.300 0.700 30.0 0.400 1.0 0.250 0.333 0.708 9 0.375 1
10684 0.276 3.0 6.857 0.291 0.729 110.0 0.438 6.0 0.0 62.0 0.371 1.0 9.533 0.381 0.978 63.0 0.597 4.0 116.0 0.319 5.0 8.200 0.347 0.856 121.0 0.509 9.0 30.0 0.300 0.0 8.714 0.300 0.800 30.0 0.500 1.0 0.800 0.800 3.000 5 2.200 0
10685 0.284 4.0 7.034 0.304 0.754 115.0 0.450 7.0 0.0 62.0 0.371 2.0 9.667 0.391 0.988 64.0 0.597 5.0 116.0 0.336 6.0 8.600 0.369 0.903 122.0 0.534 10.0 30.0 0.333 1.0 10.000 0.355 0.922 31.0 0.567 2.0 0.211 0.318 0.634 22 0.316 2
10686 0.292 5.0 7.367 0.317 0.777 120.0 0.460 8.0 0.0 62.0 0.355 3.0 9.933 0.385 0.966 65.0 0.581 6.0 116.0 0.345 7.0 8.933 0.382 0.934 123.0 0.552 10.0 30.0 0.367 2.0 11.714 0.406 1.039 32.0 0.633 3.0 0.267 0.267 0.667 15 0.400 2
10687 0.294 10.0 6.963 0.357 0.779 112.0 0.422 12.0 0.0 62.0 0.306 3.0 8.467 0.338 0.822 65.0 0.484 7.0 116.0 0.319 6.0 8.433 0.352 0.869 122.0 0.517 12.0 31.0 0.355 2.0 11.714 0.394 1.007 33.0 0.613 5.0 0.417 0.417 0.958 24 0.542 2
10688 0.280 10.0 6.714 0.342 0.744 117.0 0.402 13.0 1.0 63.0 0.286 3.0 8.267 0.318 0.778 66.0 0.460 8.0 118.0 0.305 5.0 8.033 0.333 0.833 123.0 0.500 12.0 31.0 0.290 2.0 10.286 0.333 0.881 33.0 0.548 6.0 0.625 0.700 1.450 10 0.750 0
10689 0.284 10.0 6.700 0.341 0.746 126.0 0.405 15.0 1.0 64.0 0.281 3.0 8.267 0.313 0.782 67.0 0.469 10.0 120.0 0.300 4.0 7.033 0.323 0.773 124.0 0.450 13.0 31.0 0.258 2.0 8.571 0.303 0.755 33.0 0.452 8.0 0.321 0.345 0.952 29 0.607 2
10690 0.283 10.0 6.935 0.338 0.763 130.0 0.425 15.0 0.0 65.0 0.277 2.0 8.133 0.299 0.761 67.0 0.462 9.0 121.0 0.298 4.0 7.400 0.320 0.791 125.0 0.471 13.0 30.0 0.267 2.0 8.000 0.312 0.779 32.0 0.467 7.0 0.182 0.250 0.523 12 0.273 2
10691 0.274 10.0 6.719 0.328 0.739 134.0 0.411 16.0 0.0 64.0 0.266 2.0 7.800 0.288 0.741 66.0 0.453 9.0 122.0 0.287 3.0 7.100 0.304 0.763 125.0 0.459 14.0 30.0 0.200 1.0 6.286 0.226 0.593 31.0 0.367 7.0 0.385 0.467 1.159 15 0.692 4
10692 0.289 10.0 7.182 0.341 0.786 138.0 0.445 16.0 1.0 64.0 0.297 2.0 9.067 0.318 0.849 66.0 0.531 9.0 122.0 0.295 3.0 7.567 0.312 0.796 125.0 0.484 14.0 30.0 0.233 0.0 7.000 0.233 0.700 30.0 0.467 6.0 0.214 0.267 0.481 15 0.214 2
10693 0.286 10.0 7.118 0.336 0.772 143.0 0.436 17.0 0.0 65.0 0.292 2.0 9.067 0.313 0.836 67.0 0.523 10.0 123.0 0.293 3.0 7.633 0.310 0.790 126.0 0.480 15.0 31.0 0.258 0.0 7.714 0.258 0.742 31.0 0.484 5.0 0.154 0.154 0.308 13 0.154 3
10694 0.287 10.0 7.000 0.336 0.770 146.0 0.434 18.0 1.0 65.0 0.308 2.0 9.267 0.328 0.866 67.0 0.538 11.0 122.0 0.303 3.0 7.733 0.320 0.812 125.0 0.492 15.0 29.0 0.310 0.0 8.143 0.310 0.862 29.0 0.552 5.0 0.188 0.188 0.375 16 0.188 4
10695 0.282 5.0 7.129 0.306 0.750 124.0 0.444 10.0 1.0 61.0 0.295 2.0 8.600 0.317 0.858 63.0 0.541 12.0 120.0 0.308 3.0 7.733 0.325 0.825 123.0 0.500 16.0 25.0 0.280 0.0 7.286 0.280 0.840 25.0 0.560 5.0 0.250 0.250 0.500 12 0.250 1
10696 0.275 6.0 6.969 0.305 0.738 128.0 0.433 11.0 0.0 59.0 0.271 3.0 8.200 0.306 0.831 62.0 0.525 13.0 119.0 0.311 4.0 7.800 0.333 0.837 123.0 0.504 16.0 24.0 0.250 1.0 6.571 0.280 0.780 25.0 0.500 5.0 0.471 0.526 1.409 19 0.882 4
10697 0.285 10.0 6.806 0.333 0.764 147.0 0.431 19.0 1.0 58.0 0.224 3.0 5.867 0.262 0.641 61.0 0.379 14.0 120.0 0.300 4.0 7.700 0.323 0.815 124.0 0.492 18.0 24.0 0.208 1.0 4.571 0.240 0.573 25.0 0.333 6.0 0.000 0.000 0.000 6 0.000 2
10698 0.258 6.0 6.559 0.287 0.693 136.0 0.406 13.0 0.0 59.0 0.203 2.0 5.267 0.230 0.569 61.0 0.339 14.0 121.0 0.289 4.0 7.467 0.312 0.783 125.0 0.471 19.0 25.0 0.120 1.0 1.857 0.154 0.274 26.0 0.120 7.0 0.429 0.467 1.610 15 1.143 2
10699 0.252 7.0 6.429 0.286 0.683 140.0 0.397 14.0 1.0 58.0 0.172 2.0 4.267 0.200 0.493 60.0 0.293 14.0 120.0 0.267 5.0 7.100 0.296 0.738 125.0 0.442 20.0 23.0 0.087 2.0 1.429 0.160 0.247 25.0 0.087 7.0 0.231 0.286 0.824 14 0.538 2
10700 0.254 8.0 6.211 0.289 0.676 152.0 0.387 15.0 1.0 57.0 0.246 4.0 6.000 0.295 0.663 61.0 0.368 12.0 121.0 0.273 6.0 7.533 0.307 0.762 127.0 0.455 21.0 29.0 0.379 2.0 10.714 0.419 1.040 31.0 0.621 4.0 0.571 0.571 1.143 7 0.571 0
10701 0.258 8.0 6.400 0.292 0.676 161.0 0.384 15.0 1.0 58.0 0.259 4.0 6.800 0.306 0.685 62.0 0.379 10.0 123.0 0.285 6.0 8.033 0.318 0.781 129.0 0.463 21.0 32.0 0.344 0.0 9.286 0.344 0.813 32.0 0.469 3.0 0.357 0.400 0.900 15 0.500 0
10702 0.273 9.0 6.628 0.306 0.710 173.0 0.404 16.0 1.0 60.0 0.333 4.0 8.600 0.369 0.852 65.0 0.483 8.0 119.0 0.286 7.0 7.833 0.323 0.785 127.0 0.462 22.0 27.0 0.370 1.0 7.857 0.379 0.823 29.0 0.444 2.0 0.381 0.435 1.149 23 0.714 5
10703 0.297 2.0 6.800 0.313 0.766 67.0 0.453 16.0 1.0 60.0 0.317 3.0 7.733 0.344 0.861 64.0 0.517 12.0 123.0 0.285 5.0 7.200 0.310 0.782 129.0 0.472 25.0 28.0 0.321 1.0 5.857 0.345 0.738 29.0 0.393 5.0 0.333 0.333 0.750 12 0.417 1
10704 0.301 2.0 6.647 0.316 0.768 76.0 0.452 17.0 1.0 61.0 0.328 3.0 7.800 0.359 0.884 64.0 0.525 10.0 122.0 0.279 5.0 6.833 0.305 0.756 128.0 0.451 25.0 29.0 0.310 1.0 5.286 0.333 0.712 30.0 0.379 5.0 0.455 0.538 1.357 13 0.818 3

10705 rows × 42 columns

In [13]:
df.describe() # generate descriptive statistics 
Out[13]:
H/A_BA H/A_BB H/A_DK H/A_OBP H/A_OPS H/A_PA H/A_SLG H/A_SO Hit last15_AB last15_BA last15_BB last15_DK last15_OBP last15_OPS last15_PA last15_SLG last15_SO last30_AB last30_BA last30_BB last30_DK last30_OBP last30_OPS last30_PA last30_SLG last30_SO last7_AB last7_BA last7_BB last7_DK last7_OBP last7_OPS last7_PA last7_SLG last7_SO vsP_BA vsP_OBP vsP_OPS vsP_PA vsP_SLG vsP_SO
count 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000 10705.000000
mean 0.260314 14.937879 7.674552 0.332148 0.769395 165.403550 0.437248 32.809155 0.651658 56.552732 0.259216 5.554601 7.559870 0.329716 0.762251 63.332555 0.432534 12.681644 113.485661 0.260756 11.226810 7.641318 0.331470 0.767933 127.141896 0.436464 25.369267 26.299953 0.257485 2.581784 7.526441 0.328363 0.759645 29.455208 0.431281 5.920691 0.265447 0.322671 0.761926 14.818776 0.439293 2.981224
std 0.046545 9.534610 1.768946 0.050901 0.142986 71.930104 0.102192 16.728234 0.476467 4.886898 0.063300 3.165619 2.283758 0.068370 0.191839 4.714733 0.137798 4.561990 8.015167 0.047393 5.301515 1.810919 0.052692 0.147668 8.023482 0.105976 7.923207 2.925231 0.089143 1.891455 3.059821 0.093853 0.265967 2.719936 0.191775 2.634954 0.144938 0.147518 0.418236 10.038111 0.297466 2.655641
min 0.053000 0.000000 1.727000 0.138000 0.261000 36.000000 0.075000 1.000000 0.000000 31.000000 0.063000 0.000000 1.667000 0.107000 0.237000 41.000000 0.078000 1.000000 74.000000 0.097000 0.000000 2.267000 0.152000 0.260000 87.000000 0.107000 5.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 17.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 0.000000 0.000000
25% 0.232000 8.000000 6.471000 0.298000 0.678000 106.000000 0.368000 20.000000 0.000000 53.000000 0.216000 3.000000 5.933000 0.283000 0.630000 60.000000 0.339000 9.000000 108.000000 0.229000 7.000000 6.400000 0.295000 0.666000 122.000000 0.363000 20.000000 24.000000 0.194000 1.000000 5.286000 0.267000 0.573000 28.000000 0.292000 4.000000 0.167000 0.222000 0.471000 8.000000 0.222000 1.000000
50% 0.259000 13.000000 7.600000 0.331000 0.761000 155.000000 0.430000 30.000000 1.000000 57.000000 0.258000 5.000000 7.333000 0.328000 0.749000 64.000000 0.419000 12.000000 114.000000 0.259000 10.000000 7.467000 0.330000 0.757000 128.000000 0.426000 25.000000 26.000000 0.258000 2.000000 7.143000 0.323000 0.740000 30.000000 0.409000 6.000000 0.250000 0.326000 0.718000 12.000000 0.400000 2.000000
75% 0.289000 20.000000 8.706000 0.365000 0.849000 217.000000 0.496000 43.000000 1.000000 60.000000 0.302000 7.000000 8.933000 0.375000 0.879000 67.000000 0.509000 16.000000 119.000000 0.292000 14.000000 8.733000 0.365000 0.858000 133.000000 0.500000 30.000000 28.000000 0.320000 4.000000 9.429000 0.393000 0.923000 31.000000 0.542000 8.000000 0.353000 0.417000 1.000000 19.000000 0.600000 4.000000
max 0.462000 72.000000 17.455000 0.542000 1.542000 397.000000 1.000000 119.000000 1.000000 73.000000 0.500000 21.000000 18.267000 0.582000 1.527000 77.000000 1.040000 31.000000 137.000000 0.421000 36.000000 15.667000 0.530000 1.404000 150.000000 0.915000 55.000000 40.000000 0.700000 13.000000 21.714000 0.781000 2.223000 40.000000 1.652000 19.000000 1.000000 1.000000 3.114000 83.000000 2.400000 22.000000

By briefly grazing through our data frame, we can see our data has a wide range of values; some ints and some floats and 0 is a common value across numerous features. It's important to know if 0's found in your data are legitimate values or simply indicators of missing data. In our case, 0 is a legitimate value. However, there are a few features where 0 should never appear, such as last30_AB so we'll check the appropriate columns to make sure they don't contain outlier values. We'll also check our entire dataframe for standard missing values like NA or NaN.

In [14]:
# check the specified columns for values == 0
len(df.loc[ (df['H/A_PA'] == 0 ) | (df['vsP_PA'] == 0 ) | (df['last7_AB'] == 0) | (df['last7_PA'] == 0) | 
            (df['last15_AB'] == 0) | (df['last15_PA'] == 0) | (df['last30_AB'] == 0) | (df['last30_PA'] == 0 )]) 
Out[14]:
0
In [15]:
df.isnull().sum() # check all features for missing values
Out[15]:
H/A_BA        0
H/A_BB        0
H/A_DK        0
H/A_OBP       0
H/A_OPS       0
H/A_PA        0
H/A_SLG       0
H/A_SO        0
Hit           0
last15_AB     0
last15_BA     0
last15_BB     0
last15_DK     0
last15_OBP    0
last15_OPS    0
last15_PA     0
last15_SLG    0
last15_SO     0
last30_AB     0
last30_BA     0
last30_BB     0
last30_DK     0
last30_OBP    0
last30_OPS    0
last30_PA     0
last30_SLG    0
last30_SO     0
last7_AB      0
last7_BA      0
last7_BB      0
last7_DK      0
last7_OBP     0
last7_OPS     0
last7_PA      0
last7_SLG     0
last7_SO      0
vsP_BA        0
vsP_OBP       0
vsP_OPS       0
vsP_PA        0
vsP_SLG       0
vsP_SO        0
dtype: int64

We can see from our output that no missing NaN values were found in the data frame, and no unexpected 0 values were found either. Since we now know there are no missing values in our data, we'll double our data types to ensure nothing unexpected appears.

In [16]:
df.dtypes
Out[16]:
H/A_BA        float64
H/A_BB        float64
H/A_DK        float64
H/A_OBP       float64
H/A_OPS       float64
H/A_PA        float64
H/A_SLG       float64
H/A_SO        float64
Hit           float64
last15_AB     float64
last15_BA     float64
last15_BB     float64
last15_DK     float64
last15_OBP    float64
last15_OPS    float64
last15_PA     float64
last15_SLG    float64
last15_SO     float64
last30_AB     float64
last30_BA     float64
last30_BB     float64
last30_DK     float64
last30_OBP    float64
last30_OPS    float64
last30_PA     float64
last30_SLG    float64
last30_SO     float64
last7_AB      float64
last7_BA      float64
last7_BB      float64
last7_DK      float64
last7_OBP     float64
last7_OPS     float64
last7_PA      float64
last7_SLG     float64
last7_SO      float64
vsP_BA        float64
vsP_OBP       float64
vsP_OPS       float64
vsP_PA          int64
vsP_SLG       float64
vsP_SO          int64
dtype: object

We can see that all of our data is currently being read as either float64 or int64 variables, which is correct and expected. However, several of our float64 features are meant to be int64 types, so let's change those variable types.

In [17]:
cols = ['H/A_BB', 'H/A_PA', 'H/A_SO', 
        'last7_AB', 'last7_BB', 'last7_PA', 'last7_SO', 
        'last15_AB', 'last15_BB', 'last15_PA', 'last15_SO',
        'last30_AB', 'last30_BB',  'last30_PA', 'last30_SO',
        'vsP_PA', 'vsP_SO','Hit'] # columns to be changed
df[cols] = df[cols].astype(int) # update specified columns type to int
In [18]:
df[cols].dtypes
Out[18]:
H/A_BB       int64
H/A_PA       int64
H/A_SO       int64
last7_AB     int64
last7_BB     int64
last7_PA     int64
last7_SO     int64
last15_AB    int64
last15_BB    int64
last15_PA    int64
last15_SO    int64
last30_AB    int64
last30_BB    int64
last30_PA    int64
last30_SO    int64
vsP_PA       int64
vsP_SO       int64
Hit          int64
dtype: object

We see our data now has the correct typing.

Data Transformation

It's important to remember data can come in a variety of types, such as:

  • Numerical i.e. income, batting average
  • Categorical i.e. gender, position
  • Ordinal i.e. low/medium/high

However, most Machine Learning algorithms are meant to only handle numerical data. Because of this, it is important to always convert non-numerical data types into numerical data types. This can be done in several ways depending on the original data type, such as:

  • One Hot Encoding
  • Integer Encoding
  • Dummy Encoding

Our dataframe currently only contains numerical data, so there is no need to showcase these methods, but data conversion is an important step to go through as a part of any data science process.

Downsample the Data

Some Machine Learning models, like Support Vector Machines, are great with small datasets, but not great with large ones, and this dataset, while not huge, is big enough to take a long time to optimize with Cross Validation. So we'll downsample both categories, players who did and did not get a hit, to 2,000 each. At the same time, this will help us keep our dataset balanced, which is important to prevent Overfitting in one direction.

First, let's remind ourselves how many players are in the dataset...

In [19]:
len(df) # get current number of rows/players in df
Out[19]:
10705
In [20]:
df_no_hit = df[df['Hit'] == 0] # seperate players who did not get a hit
df_hit = df[df['Hit'] == 1] # seperate players who did get a hit
In [21]:
len(df_no_hit) # get number of players who did not get a hit
Out[21]:
3729
In [22]:
len(df_hit) # get number of players who did get a hit
Out[22]:
6976
In [23]:
# downsample df to only include 2000 players who did not get a hit

df_no_hit_downsampled = resample(df_no_hit,
                                 replace=False,
                                 n_samples=2000,
                                 random_state=42)
len(df_no_hit_downsampled)
Out[23]:
2000
In [24]:
# downsample df to only include 2000 players who did get a hit

df_hit_downsampled = resample(df_hit,
                              replace=False,
                              n_samples=2000,
                              random_state=42)
len(df_hit_downsampled)
Out[24]:
2000
In [25]:
# finally combine the two seperate downsampled data frames to a single df

df_downsample = pd.concat([df_no_hit_downsampled, df_hit_downsampled])
len(df_downsample)
Out[25]:
4000

Splitting the Data

Now that we have preprocessed our data, we are ready to start formatting the data for making models.

The first step is to split the data into two parts:

  1. The columns of data that we will use to make classifications
  2. The column of data that we want to predict.

We will use the conventional notation of X to represent the independent variables or columns of data that we will use to make classifications and y to represent the thing we want to predict, our dependent variables. In this case, we want to predict Hit, whether or not a player will get a hit in a game.

NOTE: In the code below we are using copy() to copy the data by value. By default, pandas uses copy by reference. Using copy() ensures that the original data df_downsample is not modified when we modify X or y. So if we make a mistake when we are formatting the columns for our classification models, we can just re-copy df_downsample, rather than reload the original data and re-process the data.

In [26]:
X = df_downsample.drop('Hit', axis=1).copy() # Seperate our independent variables
X.head()
Out[26]:
H/A_BA H/A_BB H/A_DK H/A_OBP H/A_OPS H/A_PA H/A_SLG H/A_SO last15_AB last15_BA last15_BB last15_DK last15_OBP last15_OPS last15_PA last15_SLG last15_SO last30_AB last30_BA last30_BB last30_DK last30_OBP last30_OPS last30_PA last30_SLG last30_SO last7_AB last7_BA last7_BB last7_DK last7_OBP last7_OPS last7_PA last7_SLG last7_SO vsP_BA vsP_OBP vsP_OPS vsP_PA vsP_SLG vsP_SO
7826 0.310 7 8.684 0.337 0.824 166 0.487 20 62 0.323 2 11.333 0.344 1.038 64 0.694 8 120 0.283 3 7.933 0.317 0.809 126 0.492 20 31 0.290 2 13.429 0.333 1.107 33 0.774 3 0.000 0.222 0.222 9 0.000 1
10007 0.241 7 8.467 0.333 0.626 66 0.293 15 58 0.138 8 6.467 0.265 0.437 69 0.172 18 120 0.225 12 7.667 0.306 0.623 136 0.317 34 29 0.103 2 5.571 0.188 0.326 33 0.138 10 0.222 0.300 0.633 11 0.333 1
7885 0.287 14 9.882 0.338 0.838 219 0.500 39 60 0.300 0 11.667 0.311 1.011 62 0.700 7 124 0.298 3 11.233 0.326 0.891 130 0.565 24 24 0.292 0 8.143 0.320 0.862 26 0.542 2 0.400 0.438 1.104 16 0.667 3
1116 0.259 13 7.642 0.310 0.726 213 0.416 60 62 0.226 3 8.933 0.279 0.650 68 0.371 16 121 0.207 9 7.733 0.276 0.607 135 0.331 35 31 0.290 0 11.429 0.312 0.796 32 0.484 7 0.500 0.692 1.317 13 0.625 1
2367 0.305 0 7.600 0.318 0.745 88 0.427 10 67 0.254 2 6.267 0.271 0.570 70 0.299 9 128 0.281 4 7.600 0.309 0.700 138 0.391 13 32 0.188 1 4.429 0.212 0.431 33 0.219 6 0.000 0.000 0.000 5 0.000 0
In [27]:
y = df_downsample['Hit'].copy() # Seperate our dependent variable
y.head()
Out[27]:
7826     0
10007    0
7885     0
1116     0
2367     0
Name: Hit, dtype: int64

Now that our variables are correctly defined, we'll once again split the data. This time we'll split into training and testing sets. Doing this ensures that we have not only a way to train our Machine Learning models, but a way to evaluate them aswell. This is akin to giving a student a practice test or study guide to learn upon, then evaluating them on a real test with similar problems.

In [28]:
## Split the data into training and testing data,
## a random state variable is also passed for replication purposes

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 25) 

Centering and Scaling

Several of the algorithms we'll use work best when the data provided is centered and scaled prior to training. The Radial Basis Function (RBF) that we are using with our Support Vector Machine assumes that the data is centered and scaled. In other words, each column should have a mean value = 0 and a standard deviation = 1. Our Neural Network algorithm also performs best with centered and scaled data as it utilizes Gradient Descent. If features are not scaled properly, the various ranges of features can cause inconsistencies in the steps taken during Gradient Descent.

So, to get the best out of our models, we will scale both the training and testing datasets. Specifically, we'll split the data into training and testing datasets and then scale them separately to avoid Data Leakage. Data Leakage occurs when information about the training dataset corrupts or influences the testing dataset.

In [29]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

Machine Learning Model Construction

Now that we have cleaned, downsampled, split and scaled the data we can begin construction our Machine Learning models.

Support Vector Machines

Preliminary Support Vector Machine

Support Vector Machines work by plotting a subset of data points of two classes on a plane and then finding the best hyperplane between the data points to separate the distinct classes. It decides the best hyperplane by attempting to maximize the distance or margin between points of both classes. The data points closest to the margin are referred to as Support Vectors as moving them would also cause the hyperplane to be moved, but moving other data points would not affect the hyperplane.

Now, let's move on towards making a preliminary Support Vector Machine.

In [30]:
clf_svm = SVC(random_state=15) # construct a SVM model
clf_svm.fit(X_train_scaled, y_train) # fit data to model

# output results of model
plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[30]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748b17c550>

In the confusion matrix, we see that of the 489 players that did not get a hit, 271 (55%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Support Vector Machine was only okay. Let's try to improve our predictions using Cross Validation to optimize the parameters.

Support Vector Machine Optimization

Optimizing a Support Vector Machine is all about finding the best hyperparameter values for gamma and C. So let's see if we can find better parameters than the preliminary Support Vector Machine values using cross validation in hopes that we can improve the accuracy with the Testing Dataset.

Since we have two parameters to optimize, we will use GridSearchCV(). We specify several potential values for gamma and C, and GridSearchCV() tests all possible combinations of the parameters for us.

In [31]:
num_features = np.size(X_train_scaled, axis=1) # get number of features
param_grid = {
   'C': [0.5, 1, 10,],
   'gamma': ['scale', 1/num_features, 1, 0.5, 0.25, 0.12, 0.05,],
  }
# we are includeing C=1 and gamma='scale' as they are default values.

optimal_params = GridSearchCV(
        SVC(random_state=15), 
        param_grid,
        cv=3,
        scoring='f1_micro',
        ## For more scoring metics see: 
        ## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
        verbose=0 # if you want to see what Grid Search is doing, set verbose=2
    )

optimal_params.fit(X_train_scaled, y_train) # find best params
print(optimal_params.best_params_) # output best params
{'C': 1, 'gamma': 0.12}

Final Support Vector Machine

Now that we have optimized C and gamma parameters for our SVM, let's construct our final model.

In [32]:
clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[32]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748996dac8>

In the confusion matrix, we see that of the 489 players that did not get a hit, 275 (56%) were correctly classified. And of the 511 players that did get a hit, 296 (58%) were correctly classified. So the Support Vector Machine was only slightly improved but we now have a baseline model to compare to.

Neural Network

Preliminary Neural Network Model

Now that we have optimized our SVM Model, we can move on to building our Neural Network model. Neural Networks works by utilizing layers of nodes or nuerons, similar to how the human brain works. The initial input layer represents the features in the dataset, there are then several hidden layers of nodes in between the input layer and the output layer. Connecting the various layers are channels that hold various numerical weights, and the various hidden layers hold threshold values. If the output of any individual node is above the specified threshold value, that node is activated, and data is sent to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

Now that we have a brief understanding of how Neural Networks work, let's begin by building a baseline model.

In [33]:
clf_nn = MLPClassifier(random_state=15).fit(X_train_scaled, y_train) # Create a Neural Network and fit it to the training data.

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Out[33]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2bcb4a8>

In the confusion matrix, we see that of the 489 players that did not get a hit, 255 (52%) were correctly classified. And of the 511 players that did get a hit, 295 (58%) were correctly classified. So the preliminary Neural Network model was not very good and seemingly overfit.

Neural Network Optimization

Now that we have built our preliminary Neural Network model, we can begin to optimize by looking for ideal values for max_iter, hidden_layer_sizes, learning_rate_init, beta_1, and beta_2 hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every paramater and test the various combinations.

In [34]:
## Warning, this code can take several minutes to run depending on the system it's ran on
## adding extra test parameters increases the run time exponentially.
param_grid = {
    'max_iter':[100, 165, 220,],
    'hidden_layer_sizes': [(19,),(21,)],
    'learning_rate_init':[ 0.0005, 0.0009, 0.005],
    'activation':['tanh'],
    'beta_1' : [0.75, 0.82, 0.9,],
    'beta_2' : [0.75, 0.82, 0.9,],
}


optimal_params = GridSearchCV(
    MLPClassifier(random_state=15),
    param_grid,
    scoring='f1_micro',
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit(X_train_scaled, y_train)
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Out[34]:
GridSearchCV(cv=3, estimator=MLPClassifier(random_state=15), n_jobs=10,
             param_grid={'activation': ['tanh'], 'beta_1': [0.75, 0.82, 0.9],
                         'beta_2': [0.75, 0.82, 0.9],
                         'hidden_layer_sizes': [(19,), (21,)],
                         'learning_rate_init': [0.0005, 0.0009, 0.005],
                         'max_iter': [100, 165, 220]},
             scoring='f1_micro')
In [35]:
print(optimal_params.best_params_)
{'activation': 'tanh', 'beta_1': 0.82, 'beta_2': 0.75, 'hidden_layer_sizes': (19,), 'learning_rate_init': 0.0009, 'max_iter': 165}

And we see that the ideal values for learning_rate_init is 0.0009 , beta_1 is 0.82 , beta_2 is 0.75 ,hidden_layer_sizes is 19 , and finally max_iter is 165.

Final Neural Network

Now that we have the ideal values for learning_rate_init , hidden_layer_sizes , beta_1 , beta_2 , and max_iter we can build the final Neural Network Model:

In [36]:
clf_nn = MLPClassifier(
                    activation='tanh', 
                    learning_rate_init=0.0009, 
                    hidden_layer_sizes = (19,),
                    beta_1=0.82,
                    beta_2=0.75,
                    max_iter=165,
                    random_state = 15
                   ).fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Out[36]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748f437780>
In [37]:
clf_nn.score(X_test_scaled, y_test) # get accuracy of model
Out[37]:
0.554

In the confusion matrix, we see that of the 489 players that did not get a hit, 277 (56%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the Neural Network model performed worse in terms of correct true positives, but the overall accuracy has slightly improved and the model is no longer overfit.

K-Nearest Neighbors

Preliminary K-Nearest Neighbors

We can now move on to making a K Nearest Neighbors model. K Nearest Neighbors models work by plotting known training data in a defined space for specific features. Unknown testing data is then placed in the same space and compared to the nearest k known training points. The class of the unknown data point is then determined via a majority vote by those k nearest neighbors while taking the distance of those neighbors into account. K Nearest Neighbors is one of the simplest yet efficient Machine Learning algorithms and would be a good starting point for supervised classification problems like ours.

Now let's begin by making a preliminary K Nearest Neighbors model.

In [38]:
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier().fit(X_train, y_train)

plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])
Out[38]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748e5a56d8>

In the confusion matrix, we see that of the 489 people that did not get a hit, 271 (55%) were correctly classified. And of the 511 people that did get a hit, 270 (53%) were correctly classified. So the preliminary K-NN model was not great.

K-Nearest Neighbors Optimization

Now that we have built our preliminary K-Nearest Neighbors model, we can begin to optimize by looking for ideal values for n_neighbors and distance metric hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every parameter and test the various combinations.

In [39]:
# Round 1
param_grid = {
    'n_neighbors':[5, 10, 15, 20, 25, 35, 40, 45, 50],
    'metric':['euclidean', 'manhattan', 'minkowski', 'mahalanobis'],
}


optimal_params = GridSearchCV(
    KNeighborsClassifier(),
    param_grid,
    scoring='f1', # f1_micro
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit(X_train, y_train)
Out[39]:
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=10,
             param_grid={'metric': ['euclidean', 'manhattan', 'minkowski',
                                    'mahalanobis'],
                         'n_neighbors': [5, 10, 15, 20, 25, 35, 40, 45, 50]},
             scoring='f1')
In [40]:
print(optimal_params.best_params_)
{'metric': 'euclidean', 'n_neighbors': 25}

Final K-Nearest Neighbors Model

Now that we have the ideal values for n_neighbors and metric we can build the final K Nearest Neighbors Model:

In [41]:
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)

plot_confusion_matrix(clf_knn, X_test, y_test, display_labels=["Did not get a hit", "Got a hit"])
Out[41]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f7489859be0>

In the confusion matrix, we see that of the 489 people that did not get a hit, 291 (59%) were correctly classified. And of the 511 people that did get a hit, 309 (60%) were correctly classified. So the preliminary K-NN model was noticeably improved.

Decision Tree

Preliminary Decision Tree

Now that we are done with our K Nearest Neighbors model, let's move on to Decision Trees. Decision Trees work by attempting to predict the value of a target variable by learning simple decision rules inferred from the data features. Decision Trees are a good starting algorithm to model because they are simple to interpret, easily visualized, and work well with various data types. However, some downsides to Decision Trees are their tendency to become overfit and complex with large datasets.

With a brief understanding of how Decision Trees work, let's begin by making a preliminary model.

In [42]:
clf_dt = DecisionTreeClassifier(random_state=35)
clf_dt = clf_dt.fit(X_train, y_train)

plot_confusion_matrix(clf_dt, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])
Out[42]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c504630>
In [43]:
plt.figure(figsize=(200, 100))
plot_tree(clf_dt,
          fontsize=10,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

In the confusion matrix, we see that of the 489 players that did not get a hit, 264 (54%) were correctly classified. And of the 511 players that did get a hit, 289 (56%) were correctly classified. So the Decision Tree was not great. And after plotting the entire tree, we can see it is very deep and complex, hinting at overfitting taking place, so let's try to improve the tree with Pruning .

Decision Tree Optimization

Decision Trees are known to overfit the Training Data, and there are many parameters such as max_depth and min_samples that are designed to reduce overfitting. However, pruning a tree with Cost Complexity Pruning is one of the simpler processes for finding a smaller tree that can improve accuracy of predictions on testing data.

Pruning a Decision Tree is all about finding an ideal value for the pruning parameter alpha, which controls how little or how much pruning happens. One method of finding the optimal alpha is by plotting the accuracy of the tree as a function of different values. We'll utilize this method with both our training and testing datasets.

First, let's find the different alpha values available for this tree and build a pruned tree for each value.

In [44]:
path = clf_dt.cost_complexity_pruning_path(X_train,y_train) # determine values for alphas
ccp_alphas = path.ccp_alphas # extract different values for alphas
ccp_alphas = ccp_alphas[:-1] # exclude the maximun value for alpha, this value would prune ALL leaves, leaving only a stump

clf_dts = [] # array to hold decision tress

# create one decision tree per value for alpha and store it in the array
for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
    clf_dt.fit(X_train,y_train)
    clf_dts.append(clf_dt)

Now we'll graph the accuracy of the trees using the training and testing datasets as functions of alpha.

In [45]:
train_scores = [clf_dt.score(X_train,y_train) for clf_dt in clf_dts]
test_scores = [clf_dt.score(X_test,y_test) for clf_dt in clf_dts]

fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label='train', drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label='test', drawstyle="steps-post")
ax.legend()
plt.show()

In the graph above, we can see the accuracy of the testing dataset is highest when alpha is about 0.0015. After this value, we can notice the accuracy of both datasets begin to drop-off slightly. Let's verify our findings with cross validation.

In [46]:
alpha_loop_values = [] 

## for each candidate alpha value, we'll run 3-fold cross validation
## then we'll store the mean and standard deviation of the scores for each call 
## to cross_val_score and alpha_loop_values

for ccp_alpha in ccp_alphas:
    clf_dt = DecisionTreeClassifier(random_state = 35, ccp_alpha=ccp_alpha)
    scores = cross_val_score(clf_dt, X_train, y_train, cv=3)
    alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])

# now we'll draw a graph of the means and standard deviations of the scores for each alpha value
alpha_results = pd.DataFrame(alpha_loop_values,
                            columns=['alpha','mean_accuracy','std'])

alpha_results.plot(x='alpha',
                  y='mean_accuracy',
                  yerr='std',
                  marker='o',
                  linestyle='--')
Out[46]:
<AxesSubplot:xlabel='alpha'>

By using cross validation and drawing the graph, we can see that accuracy is still at its highest when alpha is just below 0.0015. So let's see if we can extract an exact value to use.

In [47]:
alpha_results[(alpha_results['alpha'] > 0.0013)
             &
             (alpha_results['alpha'] < 0.0015)]
Out[47]:
alpha mean_accuracy std
297 0.001302 0.535667 0.017988
298 0.001304 0.535333 0.017594
299 0.001312 0.534667 0.018445
300 0.001315 0.535000 0.018019
301 0.001320 0.537000 0.015513
302 0.001328 0.537000 0.015513
303 0.001337 0.536333 0.014885
304 0.001342 0.535667 0.015063
305 0.001354 0.536333 0.016760
306 0.001362 0.535667 0.017461
307 0.001379 0.536667 0.016780
308 0.001402 0.535667 0.016938
309 0.001434 0.536000 0.015769
310 0.001435 0.536000 0.015769
311 0.001444 0.537000 0.016573
312 0.001465 0.536333 0.017308
313 0.001467 0.536333 0.017308
314 0.001468 0.536333 0.017308
315 0.001469 0.536333 0.017308
316 0.001480 0.535333 0.016007
In [48]:
# store series of alpha values
ideal_ccp_alpha = alpha_results[(alpha_results['alpha'] > 0.0013)
                                 &
                                (alpha_results['alpha'] < 0.0015)]['alpha']
ideal_ccp_alpha
Out[48]:
297    0.001302
298    0.001304
299    0.001312
300    0.001315
301    0.001320
302    0.001328
303    0.001337
304    0.001342
305    0.001354
306    0.001362
307    0.001379
308    0.001402
309    0.001434
310    0.001435
311    0.001444
312    0.001465
313    0.001467
314    0.001468
315    0.001469
316    0.001480
Name: alpha, dtype: float64

And from here we'll select a final alpha value. In our case, we'll use number 316 as it performs well and has a low standard deviation.

In [49]:
ideal_ccp_alpha = float(ideal_ccp_alpha[316]) # select desired alpha value and convert from series to float
ideal_ccp_alpha
Out[49]:
0.0014800361336946713

And now we have an ideal_ccp_alpha to be used as alpha when we build our final tree.

Final Decision Tree Model

Now that we have an ideal value for ccp_alpha we can build the final Decision Tree Model:

In [50]:
clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])
Out[50]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cac8da0>
In [51]:
plt.figure(figsize=(50, 25))
plot_tree(clf_dt_pruned,
          fontsize=10,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

In the confusion matrix, we see that of the 489 players that did not get a hit, 274 (56%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Decision Tree model performs similarly in terms of total correct predictions. However, the final model is much simpler and no longer overfit.

Random Forest

Build A Preliminary Random Forest Model

Since our Decision Tree model was not the best, we can move another to another tree-based model in hopes of better performance. Random Forest are also tree models like Decision Trees, however Random Forest models utilize ensemble learning, which use the outcome of multiple decision trees to make a conclusion rather than a single tree. Due to this, Random Forests are generally less prone to overfitting.

So let's go ahead and build a preliminary model.

In [52]:
clf_rf = RandomForestClassifier(random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf, 
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[52]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c8f5a20>
In [53]:
clf_rf.score(X_test, y_test)
Out[53]:
0.558

In the confusion matrix, we see that of the 489 players that did not get a hit, 296 (60%) were correctly classified. And of the 511 players that did get a hit, 262 (51%) were correctly classified. So the initial Random Forest model was not very good and seems to be overfit for predicting true negatives.

In [54]:
plt.figure(figsize=(200, 100))
plot_tree(clf_rf.estimators_[3],
          fontsize=12,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

If we take a look at one of our ensemble trees, we see that the trees being used can be rather deep and complex, which also hints at overfitting taking place.

Random Forest Optimization

Now that we have built our preliminary Random Forest model, and have noticed evidence of overfitting, we can begin to optimize by looking for ideal values for max_depth, max_features and n_estimators hyperparameters. We will once again use GridSearchCV() and a list of several possible values for every parameter and test the various combinations.

In [55]:
param_grid = [
  {'max_depth': [3, 4, 5], 
   'max_features': [5, 10, 'sqrt', 'log2'],
   'n_estimators': [50, 65, 100, 200]},
]

optimal_params = GridSearchCV(
        RandomForestClassifier(random_state=35), 
        param_grid,
        cv=3,
        scoring='roc_auc',
        ## For more scoring metics see: 
        ## https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
        verbose=0 # if you want to see what GridSearchCV is doing, set verbose=2
    )
    
optimal_params.fit(X_train, y_train)
print(optimal_params.best_params_)
{'max_depth': 4, 'max_features': 10, 'n_estimators': 65}

Now that GridSearchCV() has ran and we see that the ideal values for max_features is 10 , n_estimators is 65 , max_depth is 4.

Final Random Forest Model

Now that we have the ideal values for max_depth, n_estimators, and max_features we can build the final Random Forest Model:

In [56]:
clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf,
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[56]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748c79eef0>
In [57]:
clf_rf.score(X_test, y_test)
Out[57]:
0.56

In the confusion matrix, we see that of the 489 players that did not get a hit, 280 (57%) were correctly classified. And of the 511 players that did get a hit, 280 (55%) were correctly classified. So the final Random Forest model was only slightly improved in terms of total correct predictions, but is no longer overfit.

In [58]:
plt.figure(figsize=(90, 30))
plot_tree(clf_rf.estimators_[3],
          fontsize=25,
          filled = True,
          rounded = True,
          class_names = ["Not a hit","Hit"],
          feature_names = X.columns);

We can also see that the trees being used are now much shallower, simplier, and less prone to overfitting. So we can safely say the model was improved.

Extreme Gradient Boosting

Preliminary XGBoost Model

Now that we've gone ahead and produced one ensemble method, let's try a different type of ensemble tree-based model, XGBoost. XGBoost and Random Forest models differ in that, rather than producing a collection of decision trees that are independent of each other, XGBoost utilizes Boosting in order to create a collection of weak and strong learner trees that build upon each other and compensate for the weaknesses of their predecessors.

Now that we've briefly explained XGBoost, let's build a baseline model.

In [59]:
# Create a model and fit it to the training data
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', missing=1, use_label_encoder=False, seed=42)
clf_xgb.fit(X_train, 
            y_train,
            verbose= True,
            early_stopping_rounds=10,
            eval_metric='aucpr',
            eval_set=[(X_test, y_test)])
plot_confusion_matrix(clf_xgb,X_test, y_test)
[0]	validation_0-aucpr:0.53296
[1]	validation_0-aucpr:0.53930
[2]	validation_0-aucpr:0.54378
[3]	validation_0-aucpr:0.55654
[4]	validation_0-aucpr:0.55137
[5]	validation_0-aucpr:0.55593
[6]	validation_0-aucpr:0.55597
[7]	validation_0-aucpr:0.55225
[8]	validation_0-aucpr:0.55437
[9]	validation_0-aucpr:0.55278
[10]	validation_0-aucpr:0.54873
[11]	validation_0-aucpr:0.55084
[12]	validation_0-aucpr:0.55193
[13]	validation_0-aucpr:0.55554
Out[59]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cc1b8d0>

In the confusion matrix, we see that of the 489 players that did not get a hit, 269 (55%) were correctly classified. And of the 511 players that did get a hit, 277 (54%) were correctly classified. So the preliminary Random Forest model was ok but not great.

XGBoost Optimization

Now that we have a preliminary XGBoost model, let's once again try to optimize our model by searching for optimal hyperperameters utilizing GridSearchCV(). For XGBoost models, we'll attempt to find ideal values for reg_alpha, reg_lambda and max_depth hyperparameters.

In [60]:
param_grid = {
    'max_depth':[5, 4, 2], 
    'reg_lambda':[5, 4, 3,], 
    'reg_alpha':[1, 2, 3,], 
}

optimal_params = GridSearchCV(
    estimator=xgb.XGBClassifier(objective= 'binary:logistic',
                               seed=42,
                               subsample=0.8,
                               use_label_encoder=False,
                               colsample_bytree=0.5),
    param_grid=param_grid,
    scoring='f1_micro',
    verbose=0,
    n_jobs=10,
    cv=3
)

optimal_params.fit( X_train, 
                    y_train,
                    early_stopping_rounds=10,
                    verbose= False,
                    eval_metric='aucpr',
                    eval_set=[(X_test, y_test)])
Out[60]:
GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=0.5,
                                     enable_categorical=False, gamma=None,
                                     gpu_id=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, predictor=None,
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     seed=42, subsample=0.8, tree_method=None,
                                     use_label_encoder=False,
                                     validate_parameters=None, verbosity=None),
             n_jobs=10,
             param_grid={'max_depth': [5, 4, 2], 'reg_alpha': [1, 2, 3],
                         'reg_lambda': [5, 4, 3]},
             scoring='f1_micro')
In [61]:
print(optimal_params.best_params_)
{'max_depth': 4, 'reg_alpha': 2, 'reg_lambda': 5}

By optimizing with GridSearchCV(), we now see the ideal values for reg_alpha is 2 , reg_lambda is 5 , and max_depth is 4. We'll now use these values in the final XGBoost model.

Final XGBoost Model

Now that we have the ideal values for reg_alpha, reg_lambdaand max_depth we can build the final XGBoost Model:

In [62]:
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', 
                            missing= 1, 
                            max_depth= 4, 
                            reg_lambda= 5,
                            reg_alpha= 2, 
                            use_label_encoder=False)
clf_xgb.fit(X_train, 
            y_train,)

plot_confusion_matrix(clf_xgb, X_test, y_test)
[03:05:44] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Out[62]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74882f60b8>
In [63]:
clf_xgb.score(X_test, y_test)
Out[63]:
0.567

In the confusion matrix, we see that of the 489 people that did not default, 280 (57%) were correctly classified. And of the 511 people that defaulted, 287 (56%) were correctly classified. So the XGBoost model was improved slightly.

Final Model Evaluation

Now that we've made and optimized Support Vector Machine, Neural Network, K Nearest Neighbors, Decision Tree, Random Forest, and XGBoost models. Let's review and compare the final models.

In [64]:
clf_svm = SVC(C=1, gamma=0.12,random_state=15)
clf_svm.fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_svm, 
                      X_test_scaled, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[64]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2c23eb8>
In [65]:
y_pred = clf_svm.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.56      0.56      0.56       489
           1       0.58      0.58      0.58       511

    accuracy                           0.57      1000
   macro avg       0.57      0.57      0.57      1000
weighted avg       0.57      0.57      0.57      1000

Matthews correlation coefficient: 0.142

As we review our final Support Vector Machine, by looking solely at our confusion matrix, it shows the model is better at predicting True Positives than True Negatives. However, this evaluation would ignore the slight imbalance we have of True Positives and True Negatives. Therefore, it would be useful for us to look at other metrics such as precision ("The proportion of positive identifications that were actually correct"), recall ("The proportion of actual positives that were correctly identified"), and f1 score (harmonic mean of recall and precision). By examining our classification report, we can see that although all 3 scoring metrics are higher for positive predictions, they are only 2% higher than negative predictions. So, given a completely balanced dataset, we should expect a more even distribution of predictions than our confusion matrix might lead us to believe. Although the metrics examined are good for measuring performance in regards to positive predictions, for problems where negative predictions are equally important, it is better to use the Matthews correlation coefficient (MCC) as it "takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes". Although our other metrics were on a scale of 0 to 1, the MCC is on a scale of -1 to +1 and anything over 0 is considered better than random guessing. Our value of 0.142, although not great, isn't terrible either and will provide a solid baseline score to compare our other models against.

In [66]:
clf_nn = MLPClassifier(activation='tanh', 
                    learning_rate_init=0.0009, 
                    hidden_layer_sizes = (19,),
                    beta_2=0.75,
                    beta_1=0.82,
                    max_iter=165,
                    random_state = 15
                   ).fit(X_train_scaled, y_train)

plot_confusion_matrix(clf_nn, X_test_scaled, y_test, display_labels=["Did not get a hit", "Got a hit"])
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (165) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Out[66]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74b2ca05c0>
In [67]:
y_pred = clf_nn.predict(X_test_scaled)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.54      0.57      0.55       489
           1       0.57      0.54      0.55       511

    accuracy                           0.55      1000
   macro avg       0.55      0.55      0.55      1000
weighted avg       0.55      0.55      0.55      1000

Matthews correlation coefficient: 0.109

A quick glance at the confusion matrix for our Neural Network model, it's clear to see this model did not perform as well as our Support Vector Machine. However, by examining our other scoring metrics, we can begin to understand the finer differences in our models. Our previous Support Vector Machine model, had identical precision and recall scores for each class, indicating a very balanced model in terms of positive predictions. However, our Neural Network model has slightly differing precision and recall scores, indicating our model predicts a larger proportion of True Negatives than True Positives, but we can be more confident in our True Postive predictions than our True Negative predictions. We can also see that our MCC of 0.109, although still above 0, is noticeably lower than our SVM of 0.142. So it's safe to say our SVM model is the better performer of the two models.

In [68]:
# Create a model and fit it to the training data
clf_knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance').fit(X_train, y_train)

plot_confusion_matrix(clf_knn, 
                      X_test, 
                      y_test, 
                      display_labels=["Did not get a hit", "Got a hit"])
Out[68]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f7478238b70>
In [69]:
y_pred = clf_knn.predict(X_test)

print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.59      0.60      0.59       489
           1       0.61      0.60      0.61       511

    accuracy                           0.60      1000
   macro avg       0.60      0.60      0.60      1000
weighted avg       0.60      0.60      0.60      1000

Matthews correlation coefficient: 0.2

Now that we take another look at the confusion matrix for our K Nearest Neighbors model, we can see vast improvements in performance relative to the Neural Network model. This model is similar to our SVM model in that although we have more correct postive predictions than negative predictions. However, once we take class balance into account and take a look at our precision and recall scores, the model actually performs similarly for both classifications. We also notice the MCC is the highest of our models so far at 0.2 and is almost double that of our Neural Network score of 0.109 and much better than our SVM score of 0.142 also. With the performance examined in our confusion matrix and the high MCC, our K Nearest Neighbors is easily the best of the bunch so far.

In [70]:
clf_dt_pruned = DecisionTreeClassifier(random_state=35, ccp_alpha=ideal_ccp_alpha)
clf_dt_pruned = clf_dt_pruned.fit(X_train, y_train)

plot_confusion_matrix(clf_dt_pruned, 
                      X_test, 
                      y_test,
                      display_labels=["Did not get a hit", "Got a hit"])
Out[70]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f748cac8908>
In [71]:
y_pred = clf_dt_pruned.predict(X_test)

print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.54      0.56      0.55       489
           1       0.57      0.55      0.56       511

    accuracy                           0.55      1000
   macro avg       0.55      0.55      0.55      1000
weighted avg       0.55      0.55      0.55      1000

Matthews correlation coefficient: 0.108

Looking at the confusion matrix for our Decision Tree model, it's clear to see the performance is not good enough. A glance at our classification report and MCC scores show similar performance to our Neural Network model, the worst of our previously examined models. No need to dig deep into the poor performance of this model when we already have a significantly better performing model, so let's move on to the next model.

In [72]:
clf_rf = RandomForestClassifier(max_depth=4, n_estimators=65, max_features=10, random_state=35)
clf_rf.fit(X_train, y_train)

plot_confusion_matrix(clf_rf,
                      X_test, 
                      y_test,
                      values_format='d',
                      display_labels=["Did not get a hit", "Got a hit"])
Out[72]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74781c4b00>
In [73]:
y_pred = clf_rf.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.55      0.57      0.56       489
           1       0.57      0.55      0.56       511

    accuracy                           0.56      1000
   macro avg       0.56      0.56      0.56      1000
weighted avg       0.56      0.56      0.56      1000

Matthews correlation coefficient: 0.121

Another quick glance at the confusion matrix will show another poor performing model relative to our best. Although an MCC of 0.121 indicates our Random Forest model performs better than our Neural Network and Decision Tree models, it's still 3rd best behind our K Nearest Neighbors and SVM models. So we'll now move on to examining our final model, XGBoost.

In [74]:
clf_xgb = xgb.XGBClassifier(objective='binary:logistic', 
                            missing= 1, 
                            max_depth= 4, 
                            reg_lambda= 5,
                            reg_alpha= 2, 
                            use_label_encoder=False)
clf_xgb.fit(X_train, 
            y_train,)

plot_confusion_matrix(clf_xgb, X_test, y_test)
[03:06:29] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
Out[74]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f74781642b0>
In [75]:
y_pred = clf_xgb.predict(X_test)
print(classification_report(y_test,y_pred,))
print(f"Matthews correlation coefficient: {round(matthews_corrcoef(y_test, y_pred),3)}")
              precision    recall  f1-score   support

           0       0.56      0.57      0.56       489
           1       0.58      0.56      0.57       511

    accuracy                           0.57      1000
   macro avg       0.57      0.57      0.57      1000
weighted avg       0.57      0.57      0.57      1000

Matthews correlation coefficient: 0.134

A look at our final confusion matrix would show that although performance is better than most of the previous models we've examined, Extreme Gradient Boost still lacks in comparison to our K Nearest Neighbors model. While a MCC of 0.134 is better than our Random Forest model, its still slightly behind our SVM score of 0.142.

In [76]:
svc = SVC(C=1, gamma=0.12,random_state=15)
svc.fit(X_train_scaled, y_train)

knn = KNeighborsClassifier(metric='euclidean', n_neighbors=25, weights='distance')
knn.fit(X_train, y_train)

svc_disp = plot_roc_curve(svc, X_test_scaled, y_test)
knn_disp = plot_roc_curve(knn, X_test, y_test, ax=svc_disp.ax_)

knn_disp.figure_.suptitle("SVM/K-NN ROC Curve Comparison")

plt.show()

As a final analysis of model performance, we'll take a look at the ROC curves of our top two performing models.

Receiver Operating Characteristic curves are a way to visualize the trade off between Sensitivity (True Positive Rate) and Specificity (1 – False Positive Rate). ROC curves are commonly used to summarize a models' ability to predict classifications in combination with the AUC (Area Under the Curve). A perfect curve would have an immediate vertical line that turns horizontal at the very top of the graph and produce a AUC of 1. Whereas a 45-degree diagonal curve would represent average performance, akin to random guessing, and would produce a AUC of 0.5.

As we can see in our graph, our top performing K Nearest Neighbors model produces a better ROC curve than our SVM's ROC curve. After calculating the AUC of the two curves, we can conclude that the K Nearest Neighbors model is about 4% more accurate than the SVM model. To many, 4% may not seem like a significant difference. However, given the context of our problem and how much closer we are to the average AUC of 0.5 than a perfect AUC of 1, 4% is a sizeable difference.

Saving the Final Model

Now that we've established that our K Nearest Neighbors is the best of the bunch, the only thing left to do is save the model for later use!

In [77]:
import pickle 

pickle.dump(clf_knn, open("mlb_hitter_model.pickle", "wb"))

Finally, we save our model in order to load and prepare for deployment later!

References

  1. Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

  2. Starmer, J., 2018. StatQuest. [online] StatQuest. Available at: https://statquest.org/

  3. Google Developers. 2022. Classification: Precision and Recall | Machine Learning Crash Course | Google Developers. [online] Available at: https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall [Accessed 11 February 2022].

  4. Wikipedia contributors. (2022, January 30). Phi coefficient. In Wikipedia, The Free Encyclopedia. Retrieved 01:51, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Phi_coefficient&oldid=1068832478

  5. Wikipedia contributors. (2022, January 25). Receiver operating characteristic. In Wikipedia, The Free Encyclopedia. Retrieved 01:55, February 16, 2022, from https://en.wikipedia.org/w/index.php?title=Receiver_operating_characteristic&oldid=1067917334

  6. Chan, C., 2018. What is a ROC Curve and How to Interpret It. [online] Displayr. Available at: https://www.displayr.com/what-is-a-roc-curve-how-to-interpret-it/ [Accessed 12 February 2022].

  7. Xgboost.readthedocs.io. 2021. XGBoost Documentation. [online] Available at: https://xgboost.readthedocs.io/en/stable/index.html [Accessed 8 February 2022].