How Has Position in the NBA Changed Over Time?

Introduction

Nowadays, the casual NBA viewer can see that teams are shooting more three-point shots than they ever have. This is because guards, wings, and bigs in the NBA now have free reign to shoot these shots. In the past, coaches regarded players who shot 3-point shots as specialists. Now, it is normal for most players to have this skill. But how does three point phenomenon relate to position? Today, teams carry on their roster more “Stretch 4s”, meaning forwards who can stretch the floor (shoot the 3). In the past, these “Stretch 4s” shot the mid-range jumper, a shot that is going out of favor on many, but not all, teams. The goal of this project is to see how the 3-point shot has changed our perspective of position. To accomplish this goal, I must first scrape the data from basketballreference.com, use cluster analysis to see how many positions the NBA really has, classify those positions using prior knowledge, and then see how those positions have changed over time. I will go through and explain these steps and the decisions I made.

Libraries Used

I used many of the libraries below, but not all of them. There is an abundance of libraries that I did not use and will update this at a later time.

In [14]:
import re
import requests
import requests_cache
import pandas as pd
from bs4 import BeautifulSoup
requests_cache.install_cache('bball_ref_cache')
import time
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import manifold
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cdist, pdist, squareform
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
import warnings
warnings.filterwarnings('ignore')

Scraping basketball-reference.com

Scraping this website was pretty straightforward when using BeautifulSoup. basketball-reference.com is set up in a way that I could get all the season stats for all players using just one request. So, for 17 seasons, I only had to request the website 17 times. The only problem I ran into was a player appearing multiple times in the data frame, if he switched teams during the season. Thankfully, basketball reference also has the player’s total stats for the season, which I kept and then deleted his stats for the different teams that he played for. At the end of the for loop, I appended ‘Year’ onto the data frame, since I would be using that column very heavily for subsetting.

In [9]:
def bball_scraper(stat_type, start_year, end_year):
    '''
    bball_scraper scrapes certain pages on basketball-reference.com. 
    inputs: stat_type: there are 5 stat types that are accepted and the input must be a string
                        'per_game', totals', 'advanced', 'per_poss', and 'per_minute
            start_year: input an integer number starting from 1947 to 2017
            end_year: input an integer number starting from 1947 to 2017
            
    outputs: bball_df: a dataframe that contains all of the NBA stats for the years and stat_type selected
    
    '''
    bball_df = pd.DataFrame()
    url = 'http://www.basketball-reference.com/leagues/NBA_'
    for i in range(end_year-start_year + 1):
        #make correct url to request the website
        url2 = url + r'%s_' %start_year + r'%s.html' %stat_type
        #request and utilize BeautifulSoup
        year = BeautifulSoup(requests.get(url2).content, "lxml")
        #get columns for final dataframe
        columns = year.find('thead').text.split('\n')
        #find where the stats start
        stats = year.find('tbody')
        #get rid of extraneous stuff
        for figure in stats.find_all('tr', 'thead'):
            figure.decompose()
        #find where stats start
        data = stats.find_all('tr')
        #where the magic happens
        player = [[td.getText() for td in data[i].findAll('td')] for i in range(len(data))]
        #get rid of extraneous player rows, if they played for more than 1 team
        temp = pd.DataFrame(player)
        for index, row in temp.iterrows():
            if row[3] == 'TOT':
                pname = row[0] 
                temp = temp[(temp[0] != pname) | (temp[3] == 'TOT')]
        temp['Year'] = start_year
        #append and start all over again
        bball_df = bball_df.append(temp)
        start_year = start_year + 1 
    columns = columns[3:-2]
    columns.append('Year')
    #set column values
    bball_df.columns = columns
    return bball_df

bball_per_game = bball_scraper('per_game', 2001, 2017)
bball_advanced = bball_scraper('advanced', 2001, 2017)
bball_100poss = bball_scraper('per_poss', 2001, 2017)
bball_per_game['MPG'] = bball_per_game['MP']
bball_per_game['MP'] = bball_advanced['MP']

Other Functions

The function silhouette outputs two plots that helps chopse an optimal cluster size when using KMeans cluster analysis. The function kmeans does the KMeans analysis and returns the cluster of each element. The function distance_map is purely for visualization, since I did not use PCA to do my KMeans analysis. I found that using the raw data gave me slightly more accurate clusters, so I made a 2D distance map to visualize the clusters.

In [15]:
def silhouette(data): 
    '''
    this function outputs two plots that have show what the optimal K would be in a KMeans analysis
    inputs: data: any DataFrame
    
    outputs: a plot of silhouette score for a k range of 2 to 29
             a plot of Average Within-Cluster Sum of Squares for a k range of 2 to 29, more commonly know as an elbow plot
    
    '''
    #http://datascience.stackexchange.com/questions/6508/k-means-incoherent-behaviour-choosing-k-with-elbow-method-bic-variance-explain
    #range of KMean analysis, can be change on a whim
    k = range(2,30)
    score = []
    elbow = []
    for n_clusters in k:
        #fit data to KMeans
        clusters = KMeans(n_clusters=n_clusters).fit(data)
        #get labels (clusters)
        cluster_labels = clusters.labels_
        #get cluster centers
        centroid = clusters.cluster_centers_
        #use euclidean distance of centroid to data and return minimum for each data index
        euclid = np.min(cdist(data, centroid, 'euclidean'), axis = 1)
        #sum euclid and divide by shape[0] to get average
        avgss = sum(euclid)/data.shape[0]
        elbow.append(avgss)
        #use function to get silhouette score
        silhouette_avg = silhouette_score(data, cluster_labels)
        score.append(silhouette_avg)
    #plot both plots
    plt.plot(k, score, 'b*-')
    plt.ylabel("Silhouette Score")
    plt.xlabel("Number of Clusters")
    plt.title("Silhouette Score for KMeans Cluster Analysis")
    plt.show()
    plt.plot(k, elbow, 'b*-')
    plt.ylabel("Average Within-Cluster Sum of Squares")
    plt.xlabel("Number of Clusters")
    plt.title("Elbow Curve for KMeans Cluster Analysis")
    plt.show()
    
def kmeans(data, n_clusters):
    '''
    this function does a KMeans analysis of the data given for the number of clusters given
    inputs: data: any dataframe
            n_cluster: the number of clusters you want
    
    outputs: cluster_labels: the cluster a certain index belongs to
    '''
    #do KMeans and return clusters
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(data)
    return cluster_labels

def distance_map(data, clusters):
    '''
    this function takes in a distance matrix and the clusters and outputs a plot
    inputs: data: a SQUARE distance matrix
            cluster: the cluster that each index belongs to, in a DataFrame or list
    
    outputs: a distance plot that color-codes clusters, for up to 9 clusters (more can be added if needed)
    '''
    #http://baoilleach.blogspot.com/2014/01/convert-distance-matrix-to-2d.html
    adist = np.array(data)
    amax = np.amax(adist)
    adist /= amax
    mds = manifold.MDS(n_components=2, dissimilarity="precomputed")
    results = mds.fit(adist)
    coords = pd.DataFrame(results.embedding_, index = data.index)
    coords['Cluster'] = clusters
    #http://stackoverflow.com/questions/26139423/plot-different-color-for-different-categorical-levels-using-matplotlib
    colors = {0:'grey', 1:'red', 2:'blue', 3:'green', 4:'black', 5:'yellow', 6:'purple', 7:'pink', 8:'white'}
    plt.scatter(coords[0], coords[1], marker = 'o', c=coords['Cluster'].apply(lambda x: colors[x]))
    plt.title('2D Representation of Distance Matrix and Color Representation of Clusters')
    plt.show()

The DataFrame

I had to do a couple of things to get the DataFrame that I want to analyze. I merged the per 100 possessions DataFrame with the Advanced one in order to get more stats to cluster on. I made the entire DataFrame numeric, so that the numbers actually had meaning. Then, I removed a lot of columns that were either another iterance of a statistic that I already had, so that it wouldn't increase the variation in that one portion of basketball to much.

In [25]:
#incorporate advanced stats and transform dataframe into what I want
bball_100poss2 = bball_100poss.merge(bball_advanced)
bball_100poss2 = bball_100poss2.drop(bball_100poss2[[-5,-10]], axis = 1)
bball_100poss2 = bball_100poss2.apply(lambda x: pd.to_numeric(x, errors='ignore'))
bball_100poss2 = bball_100poss2.loc[bball_100poss2['MP'] > 250]
bball_100poss2.fillna(0, inplace=True)
bball_100poss2 = bball_100poss2.set_index(bball_100poss2['Player'])
bball_100poss2 = bball_100poss2.drop(['Player','Pos','Age','Tm','G','GS','MP', 'TS%', '3PAr', 'FTr', 'PER',
                                      'ORtg','DRtg', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', '',
                                      'FG', '3P', '2P', 'FT', 'TRB', 'TRB%', 'FGA', 'FG%', 'PTS'],axis = 1)
bball_100poss2016 = bball_100poss2.loc[bball_100poss2['Year'] == 2016]
bball_100poss2016.head()
Out[25]:
3PA 3P% 2PA 2P% FTA FT% ORB DRB AST STL ... TOV PF Year ORB% DRB% AST% STL% BLK% TOV% USG%
Player
Quincy Acy 2.7 0.388 9.0 0.606 3.7 0.735 3.6 6.7 1.5 1.6 ... 1.5 5.6 2016 8.1 15.1 4.4 1.6 2.2 10.0 13.1
Steven Adams 0.0 0.000 10.5 0.613 4.8 0.582 5.4 7.7 1.5 1.0 ... 2.1 5.5 2016 12.5 16.1 4.3 1.0 3.3 14.1 12.6
Arron Afflalo 5.2 0.382 12.2 0.469 2.8 0.840 0.5 5.3 3.1 0.5 ... 1.8 3.1 2016 1.1 11.0 9.9 0.5 0.3 8.7 17.9
Alexis Ajinca 0.1 0.000 18.1 0.478 3.6 0.839 4.3 11.2 1.8 1.1 ... 3.1 7.7 2016 9.3 25.9 5.8 1.1 3.4 13.6 20.4
Cole Aldrich 0.0 0.000 14.1 0.596 5.3 0.714 5.4 12.7 3.1 2.9 ... 4.0 8.7 2016 11.9 27.1 10.0 2.9 6.7 19.6 18.4

5 rows × 21 columns

Euclidean Distance

This was just to see if I was on the right track. The following example below, shows that I was indeed on the right track.

In [26]:
#use euclidean distance to see if it really works
df1 = bball_100poss2016
dist = pdist(df1, 'euclidean')
df_dist = pd.DataFrame(squareform(dist), index = df1.index, columns = df1.index)
df_dist['Festus Ezeli'].sort_values().head()
Out[26]:
Player
Festus Ezeli    0.000000
Clint Capela    3.579668
Aron Baynes     3.585576
Willie Reed     5.187558
Henry Sims      6.042845
Name: Festus Ezeli, dtype: float64

Choosing a K

In order to choose a K for the KMeans analysis, I used to indicators and my own personal knowledge of basketball. Most times the silhouette score shows that the optimal K is around K = 8, 9, 10. The elbow in this case is very slanted, and not sudden, so it does not provide much information. I ended up testing out many different possibilities, and ended up choosing a K = 9 as the best way to cluster this dataset.

In [27]:
#use silhouette analysis and elbow curve to determine optimal KMeans
silhouette(bball_100poss2016)

Visualization of the Clusters

Below are self-explanatory visualizations of the clusters. The DataFrame below gives the means for each statistic for every cluster. The pie chart shows what percent the cluster contributes to the entire dataset. The 2D representation of the distance matrix shows how the clusters are different, but also overlap since many players are skilled and can do many things on the court.

In [28]:
bball_100poss2016['Cluster'] = kmeans(bball_100poss2016, 9)
#get means of every cluster for each statistic
bball_100poss2016.groupby('Cluster').mean()
Out[28]:
3PA 3P% 2PA 2P% FTA FT% ORB DRB AST STL ... TOV PF Year ORB% DRB% AST% STL% BLK% TOV% USG%
Cluster
0 6.370370 0.346852 15.062963 0.466926 6.340741 0.777407 1.055556 5.581481 11.044444 2.240741 ... 4.648148 3.611111 2016 2.351852 12.440741 36.696296 2.240741 0.748148 16.555556 25.514815
1 7.620779 0.362312 8.366234 0.466169 3.027273 0.777013 1.027273 4.783117 2.703896 1.501299 ... 1.861039 3.874026 2016 2.279221 10.529870 8.212987 1.501299 0.868831 9.745455 16.933766
2 1.459459 0.222216 20.105405 0.500081 6.256757 0.744459 3.762162 9.867568 3.751351 1.502703 ... 3.189189 5.081081 2016 8.278378 21.808108 12.821622 1.502703 3.100000 11.654054 24.335135
3 1.460000 0.202822 10.611111 0.531311 4.031111 0.669178 3.904444 7.228889 2.164444 1.497778 ... 2.171111 5.475556 2016 8.575556 15.935556 6.502222 1.497778 2.637778 13.775556 14.117778
4 6.362264 0.332566 11.696226 0.481189 4.430189 0.758151 2.145283 8.135849 3.013208 1.375472 ... 2.447170 4.426415 2016 4.715094 18.064151 9.556604 1.375472 1.690566 11.009434 19.871698
5 5.920930 0.339535 7.555814 0.461767 2.679070 0.749884 1.237209 5.183721 4.923256 1.711628 ... 2.709302 4.486047 2016 2.732558 11.404651 14.337209 1.711628 0.860465 15.883721 15.360465
6 6.339394 0.337273 15.854545 0.467485 6.672727 0.821091 1.057576 5.169697 5.766667 1.593939 ... 3.527273 3.412121 2016 2.324242 11.393939 19.554545 1.593939 0.803030 12.354545 25.263636
7 5.302703 0.323757 10.667568 0.463405 3.500000 0.798351 0.867568 4.786486 8.489189 1.840541 ... 3.616216 3.943243 2016 1.937838 10.489189 25.583784 1.840541 0.621622 17.318919 18.659459
8 0.354167 0.105396 13.562500 0.538062 6.008333 0.646333 5.050000 11.439583 2.522917 1.375000 ... 2.714583 5.825000 2016 11.116667 25.195833 7.725000 1.375000 3.685417 14.287500 17.004167

9 rows × 21 columns

In [29]:
#visualize the clusters using pie charts and scatter plots
bball_100poss2016['Cluster'].value_counts().plot(kind = 'pie', autopct='%.2f')
plt.title('Pie Graph of Different Clusters')
plt.show()
distance_map(df_dist, bball_100poss2016['Cluster'].transpose())

Logistic Regression to Predict

In order to get count data by position for all of the years, I used the logistic regression to predict which cluster the rest of the players belong to. Then, I manually set the position for each of the clusters based on their statistics and my prior knowledge of basketball. In order to visualize this, I also constructed a pie chart.

In [20]:
#http://stackoverflow.com/questions/36760000/python-how-to-use-multinomial-logistic-regression-using-sklearn
#using logistic regression to predict for the rest of the 16 years
X = bball_100poss2016.ix[:, 0:21]
y = bball_100poss2016.ix[:, -1]
lr = LogisticRegression()
lr.fit(X,y)
#predict THEM ALL
predict = lr.predict(bball_100poss2.ix[:,0:21])
bball_100posspredict = bball_100poss2
bball_100posspredict['Cluster'] = predict
In [37]:
#set positions manually, it changes everytime you call kmeans(), since the cluster numbers change every time
bball_100posspredict['Position'] = ''
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 0] = 'Elite Scoring Ballhandlers'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 1] = 'Catch-and-Shoot Wings'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 2] = 'Scoring Bigs'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 3] = 'Low-Usage Forwards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 4] = 'Stretch Forwards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 5] = 'Low-Usage Wings'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 6] = 'Scoring Guards'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 7] = 'Primary Ballhandlers'
bball_100posspredict['Position'].loc[bball_100posspredict['Cluster'] == 8] = 'Defensive Bigs'
bball_100posspredict['Position'].value_counts().plot(kind = 'pie', autopct='%.2f')
plt.title('Pie Chart of Entire Dataset')
plt.show()
In [39]:
bball_100poss2016.loc[bball_100poss2016['Cluster'] == 0]
Out[39]:
3PA 3P% 2PA 2P% FTA FT% ORB DRB AST STL ... PF Year ORB% DRB% AST% STL% BLK% TOV% USG% Cluster
Player
J.J. Barea 7.9 0.385 13.8 0.481 2.9 0.771 0.8 3.8 9.2 0.8 ... 3.4 2016 1.7 8.3 31.5 0.8 0.1 13.0 23.8 0
Eric Bledsoe 5.9 0.372 16.7 0.482 7.9 0.802 0.8 4.9 8.7 2.9 ... 3.5 2016 1.8 11.2 31.8 2.9 1.5 16.2 27.2 0
Mike Conley 6.3 0.363 14.1 0.449 6.7 0.834 0.8 4.0 10.0 2.0 ... 2.9 2016 1.6 9.3 32.7 2.0 0.8 9.5 22.4 0
Stephen Curry 15.9 0.454 12.7 0.566 7.2 0.908 1.2 6.5 9.4 3.0 ... 2.9 2016 2.9 13.6 33.7 3.0 0.4 12.9 32.6 0
Tyreke Evans 5.5 0.388 14.8 0.450 6.0 0.796 1.3 7.1 10.6 2.1 ... 4.3 2016 2.8 16.4 34.6 2.1 0.9 17.1 24.7 0
Tim Frazier 3.2 0.333 12.2 0.442 5.0 0.716 1.7 5.6 10.9 2.2 ... 4.9 2016 3.8 12.8 33.3 2.2 0.2 19.9 19.4 0
James Harden 10.3 0.359 15.1 0.494 13.2 0.860 1.0 6.9 9.6 2.2 ... 3.6 2016 2.2 15.6 35.4 2.2 1.4 15.9 32.5 0
Jrue Holiday 7.0 0.336 18.3 0.478 5.9 0.843 0.7 4.6 10.6 2.4 ... 4.0 2016 1.6 10.5 37.3 2.4 1.0 14.0 28.9 0
Jarrett Jack 5.0 0.304 12.6 0.426 5.5 0.893 0.4 6.2 11.6 1.7 ... 3.8 2016 1.0 14.1 35.0 1.7 0.6 18.8 21.7 0
Reggie Jackson 7.0 0.353 18.9 0.464 7.0 0.864 1.2 4.1 10.2 1.2 ... 3.9 2016 2.5 9.1 36.3 1.2 0.3 13.8 29.1 0
LeBron James 5.4 0.309 21.5 0.573 9.3 0.731 2.1 8.6 9.8 2.0 ... 2.7 2016 4.7 18.8 36.0 2.0 1.5 13.2 31.4 0
Damian Lillard 11.4 0.375 16.1 0.450 8.7 0.892 0.8 4.8 9.6 1.2 ... 3.1 2016 1.8 10.4 33.6 1.2 0.8 12.6 31.3 0
Kyle Lowry 9.9 0.388 11.8 0.461 8.9 0.811 1.0 5.6 9.0 2.9 ... 3.8 2016 2.2 12.3 29.9 2.9 1.0 13.7 26.1 0
Shelvin Mack 6.2 0.312 14.3 0.493 3.0 0.738 0.6 5.7 9.2 1.7 ... 3.5 2016 1.4 12.5 30.6 1.7 0.2 17.7 23.3 0
T.J. McConnell 2.7 0.348 11.4 0.499 1.3 0.634 1.3 6.3 11.2 2.9 ... 3.5 2016 2.8 14.5 37.2 2.9 0.5 22.5 17.0 0
Emmanuel Mudiay 5.6 0.319 16.3 0.379 5.1 0.670 0.8 4.8 9.0 1.6 ... 3.4 2016 1.7 10.9 29.0 1.6 1.4 17.9 25.7 0
Chris Paul 6.8 0.371 16.3 0.501 6.8 0.896 0.8 5.6 15.3 3.1 ... 3.8 2016 1.8 12.0 52.7 3.1 0.4 13.4 27.1 0
Elfrid Payton 2.1 0.326 14.9 0.451 4.4 0.589 1.8 4.3 10.9 2.1 ... 3.7 2016 3.9 9.6 32.8 2.1 0.8 17.9 20.4 0
Phil Pressey 3.1 0.222 10.4 0.433 4.3 0.520 0.5 4.8 13.0 3.1 ... 4.8 2016 1.1 11.1 40.7 3.1 1.1 27.0 18.7 0
Rajon Rondo 3.2 0.365 11.6 0.479 2.8 0.580 1.5 6.8 15.9 2.7 ... 3.3 2016 3.3 15.2 48.0 2.7 0.3 24.7 18.8 0
Ricky Rubio 4.1 0.326 8.6 0.396 6.8 0.847 0.9 6.2 14.3 3.5 ... 4.4 2016 2.0 14.1 41.4 3.5 0.4 21.0 17.7 0
Dennis Schroder 7.3 0.322 16.5 0.465 5.5 0.791 0.8 5.5 10.6 2.1 ... 4.2 2016 1.7 11.8 36.1 2.1 0.4 17.5 28.8 0
Ish Smith 3.7 0.329 17.5 0.428 3.9 0.693 0.9 5.8 11.0 1.9 ... 2.9 2016 2.0 13.3 38.3 1.9 0.8 14.4 23.9 0
Jeff Teague 6.0 0.400 15.6 0.454 6.8 0.837 0.7 4.0 10.3 2.1 ... 3.7 2016 1.6 8.6 34.4 2.1 0.8 16.2 26.6 0
Isaiah Thomas 8.6 0.359 16.9 0.462 10.0 0.871 0.8 3.6 9.4 1.7 ... 3.1 2016 1.8 8.0 32.7 1.7 0.3 11.9 29.6 0
John Wall 5.7 0.351 17.9 0.448 6.0 0.791 0.7 5.9 13.8 2.5 ... 2.8 2016 1.7 13.8 46.2 2.5 1.7 17.5 28.6 0
Russell Westbrook 6.2 0.296 19.9 0.503 10.3 0.812 2.6 8.7 15.1 2.9 ... 3.6 2016 6.1 18.1 49.6 2.9 0.6 16.8 31.6 0

27 rows × 22 columns

The Final Table

Now we can finally answer the question how has position in the NBA changed over time. All we had to do was scrape data, apply a clustering model to one year of data, predict the rest of the data using logistic regression, and manually go through each cluster to assign a label that made sense.

In [40]:
#get position counts for every year
positioncounts = []
for i in bball_100posspredict['Year'].unique():
    bball_100posstemp = bball_100posspredict.loc[bball_100posspredict['Year'] == i]
    positioncounts.append(bball_100posspredict.groupby(['Year', 'Position']).size()[i])

positioncounts = pd.DataFrame(positioncounts, index = bball_100posspredict['Year'].unique()).transpose()
#get average to compare
positioncounts['Average'] = positioncounts.mean(axis = 1)
positioncounts
Out[40]:
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Average
Position
Catch-and-Shoot Wings 32 34 24 28 44 33 46 51 61 68 70 66 75 77 89 80 96 57.294118
Defensive Bigs 36 39 40 34 37 37 44 42 43 46 42 38 45 46 50 48 38 41.470588
Elite Scoring Ballhandlers 22 20 19 19 24 18 17 22 15 18 22 31 26 27 28 26 29 22.529412
Low-Usage Forwards 96 100 99 98 106 101 81 77 68 73 74 73 69 56 54 41 38 76.705882
Low-Usage Wings 27 18 27 29 31 27 27 23 19 15 23 23 29 26 31 37 28 25.882353
Primary Ballhandlers 44 50 36 48 42 37 43 43 48 42 38 36 44 36 44 40 34 41.470588
Scoring Bigs 48 42 38 44 37 39 27 37 31 29 40 40 40 39 38 38 21 36.941176
Scoring Guards 35 33 47 48 36 47 47 38 48 45 41 50 34 41 34 34 26 40.235294
Stretch Forwards 16 18 19 16 21 31 38 41 38 38 35 31 37 42 40 56 61 34.000000

Conclusion and Results

The two glaring results in the table above are that the amount of Catch-and-Shoot Wings and Stretch Forwards are at an all-time high in the NBA. In addition, we can also see that the amount of Scoring Bigs and Low-Usage Forwards are at an all-time low. So are Scoring Guards, but I believe that is an aberration, based off just one year of data. The amounts of the other positions are within reason, although some are trending upward, while other are trending downward. From our results, we can conclude that the three-point shot has indeed changed the amount of each position we see in the game today. There are more players geared towards shooting beyond the arc now more than ever. Teams are picking up on what ancient civilizations learned many millennia ago 3 > 2, especially when you can make shots beyond the arc at a similar efficiency to shots inside the arc. However, not everyone is happy at this increase in shots beyond the arc. The Scoring Bigs and Low-Usage Forwards, once two of the most prominent positions in basketball are now almost extinct, having been replaced by the Catch-and-Shoot Wings and Stretch Forwards of today. As the era has changed, now fans are left to wonder what will become the new staple of basketball. Will it be the three-point shot, will the low-post game make its return, or will it be something entirely new? The only way to be certain is to wait and see and do this analysis again in about 10 years.

Discussion and Other Comments

I started this project with more on ambition than anything else. I honestly did not know how to do more than half of what I did in this project when I started it. Although it is not fully complete, since I also started with the question “How have roles in the NBA changed over time”, the process to answer that question is very similar and can be easily answered in an update to this notebook. I don’t think I learned anything new about the NBA, just reaffirmed my own theories and others that I have heard or read. But I definitely learned an amazing amount about python, statistics, and data science by doing this project alone. I had a vision for this project, and although I haven’t been able to realize it in this iteration, it will definitely come to fruition at a later time.

Sources

A lot of stackoverflow, basketball-reference.com, and python documentation of SciKit Learn, BeautifulSoup, Pandas, Numpy, etc. I have commented in the more important citation where I used them. Thanks to James for letting me change my project so that I could do this.