Nowadays, the casual NBA viewer can see that teams are shooting more three-point shots than they ever have. This is because guards, wings, and bigs in the NBA now have free reign to shoot these shots. In the past, coaches regarded players who shot 3-point shots as specialists. Now, it is normal for most players to have this skill. But how does three point phenomenon relate to position? Today, teams carry on their roster more “Stretch 4s”, meaning forwards who can stretch the floor (shoot the 3). In the past, these “Stretch 4s” shot the mid-range jumper, a shot that is going out of favor on many, but not all, teams. The goal of this project is to see how the 3-point shot has changed our perspective of position. To accomplish this goal, I must first scrape the data from basketballreference.com, use cluster analysis to see how many positions the NBA really has, classify those positions using prior knowledge, and then see how those positions have changed over time. I will go through and explain these steps and the decisions I made.
I used many of the libraries below, but not all of them. There is an abundance of libraries that I did not use and will update this at a later time.
import re
import requests
import requests_cache
import pandas as pd
from bs4 import BeautifulSoup
requests_cache.install_cache('bball_ref_cache')
import time
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import manifold
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cdist, pdist, squareform
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
import warnings
warnings.filterwarnings('ignore')
Scraping this website was pretty straightforward when using BeautifulSoup. basketball-reference.com is set up in a way that I could get all the season stats for all players using just one request. So, for 17 seasons, I only had to request the website 17 times. The only problem I ran into was a player appearing multiple times in the data frame, if he switched teams during the season. Thankfully, basketball reference also has the player’s total stats for the season, which I kept and then deleted his stats for the different teams that he played for. At the end of the for loop, I appended ‘Year’ onto the data frame, since I would be using that column very heavily for subsetting.
def bball_scraper(stat_type, start_year, end_year):
'''
bball_scraper scrapes certain pages on basketball-reference.com.
inputs: stat_type: there are 5 stat types that are accepted and the input must be a string
'per_game', totals', 'advanced', 'per_poss', and 'per_minute
start_year: input an integer number starting from 1947 to 2017
end_year: input an integer number starting from 1947 to 2017
outputs: bball_df: a dataframe that contains all of the NBA stats for the years and stat_type selected
'''
bball_df = pd.DataFrame()
url = 'http://www.basketball-reference.com/leagues/NBA_'
for i in range(end_year-start_year + 1):
#make correct url to request the website
url2 = url + r'%s_' %start_year + r'%s.html' %stat_type
#request and utilize BeautifulSoup
year = BeautifulSoup(requests.get(url2).content, "lxml")
#get columns for final dataframe
columns = year.find('thead').text.split('\n')
#find where the stats start
stats = year.find('tbody')
#get rid of extraneous stuff
for figure in stats.find_all('tr', 'thead'):
figure.decompose()
#find where stats start
data = stats.find_all('tr')
#where the magic happens
player = [[td.getText() for td in data[i].findAll('td')] for i in range(len(data))]
#get rid of extraneous player rows, if they played for more than 1 team
temp = pd.DataFrame(player)
for index, row in temp.iterrows():
if row[3] == 'TOT':
pname = row[0]
temp = temp[(temp[0] != pname) | (temp[3] == 'TOT')]
temp['Year'] = start_year
#append and start all over again
bball_df = bball_df.append(temp)
start_year = start_year + 1
columns = columns[3:-2]
columns.append('Year')
#set column values
bball_df.columns = columns
return bball_df
bball_per_game = bball_scraper('per_game', 2001, 2017)
bball_advanced = bball_scraper('advanced', 2001, 2017)
bball_100poss = bball_scraper('per_poss', 2001, 2017)
bball_per_game['MPG'] = bball_per_game['MP']
bball_per_game['MP'] = bball_advanced['MP']
The function silhouette outputs two plots that helps chopse an optimal cluster size when using KMeans cluster analysis. The function kmeans does the KMeans analysis and returns the cluster of each element. The function distance_map is purely for visualization, since I did not use PCA to do my KMeans analysis. I found that using the raw data gave me slightly more accurate clusters, so I made a 2D distance map to visualize the clusters.
def silhouette(data):
'''
this function outputs two plots that have show what the optimal K would be in a KMeans analysis
inputs: data: any DataFrame
outputs: a plot of silhouette score for a k range of 2 to 29
a plot of Average Within-Cluster Sum of Squares for a k range of 2 to 29, more commonly know as an elbow plot
'''
#http://datascience.stackexchange.com/questions/6508/k-means-incoherent-behaviour-choosing-k-with-elbow-method-bic-variance-explain
#range of KMean analysis, can be change on a whim
k = range(2,30)
score = []
elbow = []
for n_clusters in k:
#fit data to KMeans
clusters = KMeans(n_clusters=n_clusters).fit(data)
#get labels (clusters)
cluster_labels = clusters.labels_
#get cluster centers
centroid = clusters.cluster_centers_
#use euclidean distance of centroid to data and return minimum for each data index
euclid = np.min(cdist(data, centroid, 'euclidean'), axis = 1)
#sum euclid and divide by shape[0] to get average
avgss = sum(euclid)/data.shape[0]
elbow.append(avgss)
#use function to get silhouette score
silhouette_avg = silhouette_score(data, cluster_labels)
score.append(silhouette_avg)
#plot both plots
plt.plot(k, score, 'b*-')
plt.ylabel("Silhouette Score")
plt.xlabel("Number of Clusters")
plt.title("Silhouette Score for KMeans Cluster Analysis")
plt.show()
plt.plot(k, elbow, 'b*-')
plt.ylabel("Average Within-Cluster Sum of Squares")
plt.xlabel("Number of Clusters")
plt.title("Elbow Curve for KMeans Cluster Analysis")
plt.show()
def kmeans(data, n_clusters):
'''
this function does a KMeans analysis of the data given for the number of clusters given
inputs: data: any dataframe
n_cluster: the number of clusters you want
outputs: cluster_labels: the cluster a certain index belongs to
'''
#do KMeans and return clusters
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(data)
return cluster_labels
def distance_map(data, clusters):
'''
this function takes in a distance matrix and the clusters and outputs a plot
inputs: data: a SQUARE distance matrix
cluster: the cluster that each index belongs to, in a DataFrame or list
outputs: a distance plot that color-codes clusters, for up to 9 clusters (more can be added if needed)
'''
#http://baoilleach.blogspot.com/2014/01/convert-distance-matrix-to-2d.html
adist = np.array(data)
amax = np.amax(adist)
adist /= amax
mds = manifold.MDS(n_components=2, dissimilarity="precomputed")
results = mds.fit(adist)
coords = pd.DataFrame(results.embedding_, index = data.index)
coords['Cluster'] = clusters
#http://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib
colors = {0:'grey', 1:'red', 2:'blue', 3:'green', 4:'black', 5:'yellow', 6:'purple', 7:'pink', 8:'white'}
plt.scatter(coords[0], coords[1], marker = 'o', c=coords['Cluster'].apply(lambda x: colors[x]))
plt.title('2D Representation of Distance Matrix and Color Representation of Clusters')
plt.show()
I had to do a couple of things to get the DataFrame that I want to analyze. I merged the per 100 possessions DataFrame with the Advanced one in order to get more stats to cluster on. I made the entire DataFrame numeric, so that the numbers actually had meaning. Then, I removed a lot of columns that were either another iterance of a statistic that I already had, so that it wouldn't increase the variation in that one portion of basketball to much.
#incorporate advanced stats and transform dataframe into what I want
bball_100poss2 = bball_100poss.merge(bball_advanced)
bball_100poss2 = bball_100poss2.drop(bball_100poss2[[-5,-10]], axis = 1)
bball_100poss2 = bball_100poss2.apply(lambda x: pd.to_numeric(x, errors='ignore'))
bball_100poss2 = bball_100poss2.loc[bball_100poss2['MP'] > 250]
bball_100poss2.fillna(0, inplace=True)
bball_100poss2 = bball_100poss2.set_index(bball_100poss2['Player'])
bball_100poss2 = bball_100poss2.drop(['Player','Pos','Age','Tm','G','GS','MP', 'TS%', '3PAr', 'FTr', 'PER',
'ORtg','DRtg', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', '',
'FG', '3P', '2P', 'FT', 'TRB', 'TRB%', 'FGA', 'FG%', 'PTS'],axis = 1)
bball_100poss2016 = bball_100poss2.loc[bball_100poss2['Year'] == 2016]
bball_100poss2016.head()
This was just to see if I was on the right track. The following example below, shows that I was indeed on the right track.
#use euclidean distance to see if it really works
df1 = bball_100poss2016
dist = pdist(df1, 'euclidean')
df_dist = pd.DataFrame(squareform(dist), index = df1.index, columns = df1.index)
df_dist['Festus Ezeli'].sort_values().head()
In order to choose a K for the KMeans analysis, I used to indicators and my own personal knowledge of basketball. Most times the silhouette score shows that the optimal K is around K = 8, 9, 10. The elbow in this case is very slanted, and not sudden, so it does not provide much information. I ended up testing out many different possibilities, and ended up choosing a K = 9 as the best way to cluster this dataset.
#use silhouette analysis and elbow curve to determine optimal KMeans
silhouette(bball_100poss2016)
Below are self-explanatory visualizations of the clusters. The DataFrame below gives the means for each statistic for every cluster. The pie chart shows what percent the cluster contributes to the entire dataset. The 2D representation of the distance matrix shows how the clusters are different, but also overlap since many players are skilled and can do many things on the court.
bball_100poss2016['Cluster'] = kmeans(bball_100poss2016, 9)
#get means of every cluster for each statistic
bball_100poss2016.groupby('Cluster').mean()
#visualize the clusters using pie charts and scatter plots
bball_100poss2016['Cluster'].value_counts().plot(kind = 'pie', autopct='%.2f')
plt.title('Pie Graph of Different Clusters')
plt.show()
distance_map(df_dist, bball_100poss2016['Cluster'].transpose())
In order to get count data by position for all of the years, I used the logistic regression to predict which cluster the rest of the players belong to. Then, I manually set the position for each of the clusters based on their statistics and my prior knowledge of basketball. In order to visualize this, I also constructed a pie chart.
#http://stackoverflow.com/questions/36760000/python-how-to-use-multinomial-logistic-regression-using-sklearn
#using logistic regression to predict for the rest of the 16 years
X = bball_100poss2016.ix[:, 0:21]
y = bball_100poss2016.ix[:, -1]
lr = LogisticRegression()
lr.fit(X,y)
#predict THEM ALL
predict = lr.predict(bball_100poss2.ix[:,0:21])
bball_100posspredict = bball_100poss2
bball_100posspredict['Cluster'] = predict
#set positions manually, it changes everytime you call kmeans(), since the cluster numbers change every time
bball_100posspredict['Position'] = ''
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 0] = 'Elite Scoring Ballhandlers'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 1] = 'Catch-and-Shoot Wings'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 2] = 'Scoring Bigs'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 3] = 'Low-Usage Forwards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 4] = 'Stretch Forwards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 5] = 'Low-Usage Wings'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 6] = 'Scoring Guards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 7] = 'Primary Ballhandlers'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 8] = 'Defensive Bigs'
bball_100posspredict['Position'].value_counts().plot(kind = 'pie', autopct='%.2f')
plt.title('Pie Chart of Entire Dataset')
plt.show()
bball_100poss2016.loc[bball_100poss2016['Cluster'] == 0]
Now we can finally answer the question how has position in the NBA changed over time. All we had to do was scrape data, apply a clustering model to one year of data, predict the rest of the data using logistic regression, and manually go through each cluster to assign a label that made sense.
#get position counts for every year
positioncounts = []
for i in bball_100posspredict['Year'].unique():
bball_100posstemp = bball_100posspredict.loc[bball_100posspredict['Year'] == i]
positioncounts.append(bball_100posspredict.groupby(['Year', 'Position']).size()[i])
positioncounts = pd.DataFrame(positioncounts, index = bball_100posspredict['Year'].unique()).transpose()
#get average to compare
positioncounts['Average'] = positioncounts.mean(axis = 1)
positioncounts
The two glaring results in the table above are that the amount of Catch-and-Shoot Wings and Stretch Forwards are at an all-time high in the NBA. In addition, we can also see that the amount of Scoring Bigs and Low-Usage Forwards are at an all-time low. So are Scoring Guards, but I believe that is an aberration, based off just one year of data. The amounts of the other positions are within reason, although some are trending upward, while other are trending downward. From our results, we can conclude that the three-point shot has indeed changed the amount of each position we see in the game today. There are more players geared towards shooting beyond the arc now more than ever. Teams are picking up on what ancient civilizations learned many millennia ago 3 > 2, especially when you can make shots beyond the arc at a similar efficiency to shots inside the arc. However, not everyone is happy at this increase in shots beyond the arc. The Scoring Bigs and Low-Usage Forwards, once two of the most prominent positions in basketball are now almost extinct, having been replaced by the Catch-and-Shoot Wings and Stretch Forwards of today. As the era has changed, now fans are left to wonder what will become the new staple of basketball. Will it be the three-point shot, will the low-post game make its return, or will it be something entirely new? The only way to be certain is to wait and see and do this analysis again in about 10 years.
I started this project with more on ambition than anything else. I honestly did not know how to do more than half of what I did in this project when I started it. Although it is not fully complete, since I also started with the question “How have roles in the NBA changed over time”, the process to answer that question is very similar and can be easily answered in an update to this notebook. I don’t think I learned anything new about the NBA, just reaffirmed my own theories and others that I have heard or read. But I definitely learned an amazing amount about python, statistics, and data science by doing this project alone. I had a vision for this project, and although I haven’t been able to realize it in this iteration, it will definitely come to fruition at a later time.
A lot of stackoverflow, basketball-reference.com, and python documentation of SciKit Learn, BeautifulSoup, Pandas, Numpy, etc. I have commented in the more important citation where I used them. Thanks to James for letting me change my project so that I could do this.