The 3 pointer is the most deadly method of scoring in basketball, offering the most potential points per shot attempt. However, it was not always like this. Basketball during the Michael Jordan era (and even earlier eras) was mainly dependent on team-based basketball with 3 pointers not as popularized throughout the game. For the NBA, this is important as the three point shot has only been increasing in popularity in recent years. Over the past 5 or so years, the Golden State Warriors revolutionized the game by shooting a whole bunch of three pointers, and other teams have started to copy their strategy as they were very successful. In a sense, the Warriors have turned the 3 pointer into a weapon and a style of play. However, there has recently been discussions about whether or not this is actually a successful strategy or if these other teams and players are actually becoming worse by shooting too many three pointers.
There are two main approaches that we took to analyzing the effectiveness of the 3 point line for individual players and teams:
For players we will be asking the question: Will a shooting at a higher 3 pt percentage translate to higher player success (higher salary)? If a higher 3 point percentage does translate to success, then should future basketball players solely focus and train on 3 point shooting (since it’ll mean they get paid more)? Or will a player need to be more well-rounded to have a higher salary and value? In addition we wanted to see if we could predict a player’s salary and market-value given their 3 point shooting statistics. These questions not only have implications on future and current basketball players, but also on NBA owners and general managers. Deriving a model is important to assessing the market value of an NBA player. If NBA owners and General managers can predict market values of players, they can approach player negotiations better. The same goes for players when approaching contract negotiations as well.
For teams will be analyzing the following: Do 3 point attempts and 3 points made have an impact on if the team will win or lose? This specific question has a lot of impact on NBA front offices. It can potentially impact the decisions that they make and the players that they acquire. It also is important for sports betting and placing odds on the game. If it can be accurately predicted if a team will win or lose based on 3 point percentage, it may make sense to bet more on the higher 3 point shooting teams in the NBA.
import warnings
warnings.filterwarnings("ignore")
from bs4 import BeautifulSoup
import requests
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import ttest_ind
from sklearn import metrics
from scipy import stats
url1 = 'https://www.basketball-reference.com/leagues/NBA_2020_per_game.html'
url2 = "https://hoopshype.com/salaries/players/2019-2020/"
headers = {"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0"}
To proceed with the player question, we must first gather data about player statistics and salaries. Here we webscrape 2 seperate databases. One being from basetkball reference (the primary data source for all basketball statistics and one from hoopshype (for the salaries).
# Import the data set of all players stats from 2020
r1 = requests.get(url1, headers = headers)
root1 = BeautifulSoup(r1.content)
lnks1 = root1.find('table')
pretty1 = lnks1.prettify()
table1 = pd.read_html(pretty1)
stats = table1[0]
stats.head()
stats.drop(stats[stats.Rk == 'Rk'].index, inplace=True)
# This data set repeats the headings in the data set, so those rows are dropped
counter = Counter(stats['Player'])
for player in counter:
if counter[player] > 1:
stats.drop(stats[(stats['Player'] == player) & (stats['Tm'] != 'TOT')].index, inplace=True)
# Players who appear for multiple teams have all stats except their total stats dropped
stats.head()
# Import the salary information for all players in 2020
r2 = requests.get(url2, headers = headers)
root2 = BeautifulSoup(r2.content)
table2 = root2.find("table")
pretty2 = table2.prettify()
pd_table2 = pd.read_html(pretty2)
pand2 = pd_table2[0]
pand2 = pand2.drop(columns=["Unnamed: 0"])
nba_salaries = pand2.drop(columns=["2019/20(*)"])
nba_salaries.columns = ['Player', 'Salary']
nba_salaries.head()
final_dataset = pd.merge(stats, nba_salaries, how = 'inner', on = 'Player')
final_dataset = final_dataset[final_dataset['3P%'].notna()]
# Salary and stats data sets are merged and players who did not play enough for their percentage to exist are dropped.
final_dataset['3P'] = pd.to_numeric(final_dataset['3P'])
final_dataset['3PA'] = pd.to_numeric(final_dataset['3PA'])
final_dataset['3P%'] = pd.to_numeric(final_dataset['3P%'])
i = 0
while i < len(final_dataset):
final_dataset.iat[i,30] = final_dataset.iat[i,30][1:]
i += 1
final_dataset['Salary'] = final_dataset['Salary'].str.replace(',', '').astype(int)
# Salary column is cleaned so it can be treated as an int
reg = linear_model.LinearRegression()
x = []
y = []
final_dataset.head()
After gathering the two datasets, we merged them to combine into a large datset, including player statistics and salary from the 2020 season. A couple challenges is that we had to get rid of duplicate players (if a player was traded to another team mid-season) and convert the salary and other variables into numeric numbers for easier analysis. Above is a view of our final dataframe that we will use for analysis for specific players.
plt.plot(final_dataset['3PA'], final_dataset['3P%'], 'o')
plt.xlabel('Three Point Attempts per Game')
plt.ylabel('Three Point Percentage')
plt.title('Three point Percentage vs Attempts per Game')
plt.ylim([0, 0.7])
# Allows the graph to be seen better, cuts off a couple outliers of people who show 100% on almost zero attempts
plt.show()
This graph shows that people who shoots more attempts tend to have a higher percentage of their shots that go in. This makes sense as players who don't shoot as well likely will try to score in other ways rather than just shooting three pointers. Additionally, this graph shows that as the attempts rise above around 6 per game, the percentage doesn't really increase, and may actually decrease, as attempts go up. I think this makes sense as the people shooting the most three pointers per game are probably shooting some of their shots when they are well defended, lowering their chance of making it compared to if they only shot when wide open. For example, James Harden, the player who shot over 12 three point attempts per game, made only 35.5% of them. However, if he decided not to shoot a few of the lowest percentage shots each game, his attempts would significantly decrease, while his makes would likely decrease at a lower rate as he would be eliminating the shots with the lowest percentage of going in. Thus, he could likely raise his percentage by eliminating the attempts with the lowest percent chance of going in, meaning that by shooting more, he is likely lowering his three point percentage.
plt.plot(final_dataset['3PA'], final_dataset['Salary'], 'o')
plt.xlabel('Three Point Attempts per Game')
plt.ylabel('Salary')
plt.title('Salary vs Three Point Attempts per Game')
plt.show()
plt.plot(final_dataset['3P%'], final_dataset['Salary'], 'o')
plt.xlabel('Three Point Percentage')
plt.ylabel('Salary')
plt.title('Salary vs Three Point Percentage')
plt.xlim([0, 0.7])
plt.show()
The first graph seems to show a positive correlation between three point attempts and salary. Intuitively, this makes sense as players who make more money likely play and shoot more, leading to more attempts per game. However, the second graph shows less of a relationship between three point percentage and salary, probably because a player a lot of minutes per game and shooting 35% on many attempts is much more valuable than a player playing very little and shooting 35% on very few attempts.
i = 0
while i < len(final_dataset):
salary = final_dataset.iat[i,30]
three_attempt = final_dataset.iat[i,12]
three_made = final_dataset.iat[i,11]
y.append(salary)
x.append([three_attempt, three_made])
i += 1
reg.fit(x, y)
# Regression trying to predict salary based on three point attempts and makes
print("Coefficient are " + str(reg.coef_))
print("Y intercept is " + str(reg.intercept_))
print("R^2 of model is " + str(reg.score(x, y)))
i = 0
sum_res = 0
while i < len(final_dataset):
act_salary = final_dataset.iat[i,30]
three_attempt = final_dataset.iat[i,12]
three_made = final_dataset.iat[i,11]
prediction = reg.intercept_ + ((three_attempt * reg.coef_[0]) + (three_made * reg.coef_[1]))
residual = prediction - act_salary
square_res = residual ** 2
sum_res = sum_res + square_res
i += 1
mean_square_error = sum_res / len(final_dataset)
print("Mean Square Error " + str(mean_square_error))
We ran a multivariate linear regression involving 3 pointers attempted and 3 pointers made to try and predict salary. The model yielded an R squared of 0.23 and a large Mean Square Error. Although the MSE is rather larger than normal, we think this makes sense given the fact that NBA salaries are often in the multi-millions and any error in prediction will be amplified due to the large numbers.
reg = linear_model.LinearRegression()
x = []
y = []
i = 0
while i < len(final_dataset):
salary = final_dataset.iat[i,30]
three_attempt = final_dataset.iat[i,12]
y.append(salary)
x.append([three_attempt])
i += 1
reg.fit(x, y)
# Regression trying to predict salary based on three point attempts
print("Coefficient are " + str(reg.coef_))
print("Y intercept is " + str(reg.intercept_))
print("R^2 of model is " + str(reg.score(x, y)))
i = 0
sum_res = 0
while i < len(final_dataset):
act_salary = final_dataset.iat[i,30]
three_attempt = final_dataset.iat[i,12]
prediction = reg.predict(np.array([[three_attempt]]))
residual = prediction - act_salary
square_res = residual ** 2
sum_res = sum_res + square_res
i += 1
mean_square_error = sum_res / len(final_dataset)
print("Mean Square Error " + str(mean_square_error[0]))
After viewing the graphs with the seperate variables graphed vs Salary we decided to drop the 3 points made variable. Based on the viewing of the graphs, it seemed like the 3 point attempts had the biggest linear correlation with Salary. This intuitively made sense, since if you shot more 3 pointers, you are more likely to be a more successful scorer ("You miss 100% of the shots you don't take"- Wayne Gretzsky - Michael Scott) and thus would probably be compensated with a higher salary.
As it turns out the singe regression model didn't make too dramatic of a difference with the prediction, with a rather large Mean Square Error and a drop in the R^2 of the model.
A thing to note is that we did not include 3 point percentage in our model. Since 3 point percentage is just the number of 3 pointers made/3 pointers attempted, it is a redundant variable and would impact the model accordingly. This is the approach we will use for the team win model as well.
Now lets move onto analyzing 3 point attempts and 3 points made on team wins:
url3 = 'https://www.basketball-reference.com/teams/MIL/2020/gamelog/'
# Game stats for all games played by MIL
r3 = requests.get(url3, headers = headers)
root3 = BeautifulSoup(r3.content)
lnks3 = root3.find('table')
pretty3 = lnks3.prettify()
table3 = pd.read_html(pretty3)
log = table3[0]
log.columns = ['Rk', 'G', 'Date', '@', 'Opp', 'W/L', 'Tm', 'Op', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'null', 'FG1', 'FGA1', 'FG%1', '3P1', '3PA1', '3P%1', 'FT1', 'FTA1', 'FT%1', 'ORB1', 'TRB1', 'AST1', 'STL1', 'BLK1', 'TOV1', 'PF1']
log = log.drop(columns = ['@', 'Op', 'null', 'FG1', 'FGA1', 'FG%1', '3P1', '3PA1', '3P%1', 'FT1', 'FTA1', 'FT%1', 'ORB1', 'TRB1', 'AST1', 'STL1', 'BLK1', 'TOV1', 'PF1'])
log = log[log['Rk'].notna()]
log.drop(log[log.Rk == 'Rk'].index, inplace=True)
# Data set is cleaned up
teams = set(log['Opp'])
# All other teams were opponents in this data set, so this gives all teams
for team in teams:
url3 = url3[:43] + team + url3[46:]
# String replacement is used to scrape the data of the other teams
r3 = requests.get(url3, headers = headers)
root3 = BeautifulSoup(r3.content)
lnks3 = root3.find('table')
pretty3 = lnks3.prettify()
table3 = pd.read_html(pretty3)
temp = table3[0]
temp.columns = ['Rk', 'G', 'Date', '@', 'Opp', 'W/L', 'Tm', 'Op', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'null', 'FG1', 'FGA1', 'FG%1', '3P1', '3PA1', '3P%1', 'FT1', 'FTA1', 'FT%1', 'ORB1', 'TRB1', 'AST1', 'STL1', 'BLK1', 'TOV1', 'PF1']
temp = temp.drop(columns = ['@', 'Op', 'null', 'FG1', 'FGA1', 'FG%1', '3P1', '3PA1', '3P%1', 'FT1', 'FTA1', 'FT%1', 'ORB1', 'TRB1', 'AST1', 'STL1', 'BLK1', 'TOV1', 'PF1'])
temp = temp[temp['Rk'].notna()]
temp.drop(temp[temp.Rk == 'Rk'].index, inplace=True)
log = pd.concat([log, temp])
# The data for all other teams is concatenated to the end of the data frame.
log = log.reset_index()
log = log.drop(columns = ['index'])
log.head()
The basketball reference database only had links to specific teams and their entire season of games (including overall game statistics and if the team won or lost that specific game). Instead of going to all the teams databases, we quickly realized that the url can be manipulated to navigate to a specific team. For example the Milwaukee Bucks url is https://www.basketball-reference.com/teams/MIL/2020/gamelog/, with the MIL section of the url representing the abbreviation for Milwaukee.
To go about pulling the entire dataset, we first pulled the Bucks dataset from the page. The dataset came with the Bucks data as well as the opponents statstics from that game. Since we were planning on pulling the data from all teams, we felt that including opponents data from that game would be redundant, so this specific portion of the dataset was dropped.
Next we noted that the Bucks played all of the NBA teams last season. In addition, the team abbreviation was included in the opponents section. We used a list to collect unique team abbrevations and with a little string manipulation, we were able to construct an url referencing to the team data. We performed the same data cleaning as the Bucks dataset and were able to merge all the respective datasets into the final dataset.
The final dataset comprising of total stats and the win or loss is visualized above.
log = log.reset_index()
log = log.drop(columns = ['index', 'Rk', 'G', 'Date', 'Opp', 'Tm'])
# Unneeded columns are dropped
i = 0
while i < len(log):
label = log.iat[i, 0]
if label == 'W':
log.iat[i,0] = int(1)
else:
log.iat[i,0] = int(0)
i += 1
# A win is encoded to a 1 and a loss is encoded to a 0
log.head()
Before running the model, we first had to convert the 'W' and 'L' column to numeric numbers: 1 representing a Win and 0 representing a loss.
log['W/L'] = pd.to_numeric(log['W/L'])
log['3P%'] = pd.to_numeric(log['3P%'])
ax = sns.violinplot(x='W/L', y='3P%', data=log)
sns.set(rc={'figure.figsize':(8,5)})
ax.set_xticklabels(['L', 'W'])
plt.show()
# Violin plot showing three point percentage in wins vs losses
This plot shows that teams who win tend to make a higher percentage of three pointers than teams that lose. Since there are over 2000 points in the plot and the mean and distribution of three point percentage look different, it stands to reason that three point percentage is correlated with a team won or lost the game. Furthermore, it appears that teams are more likely to win if they make a higher percentage of three pointers which makes sense as this helps them score more points.
log['3P'] = pd.to_numeric(log['3P'])
log['3PA'] = pd.to_numeric(log['3PA'])
log['W/L'] = pd.to_numeric(log['W/L'])
x = log.drop(columns = ['W/L','FG','FGA','FG%','3P%','FT','FTA', 'FT%','ORB','TRB','AST','STL','BLK','TOV','PF']).values
y = log['W/L'].values
kf = KFold(n_splits=10)
decision_error = 0
random_error = 0
for train_index, test_index in kf.split(x):
x_train, x_test = x[train_index], x[test_index]
y_train, y_test = y[train_index], y[test_index]
##Training Model
DTC_CLF = tree.DecisionTreeClassifier().fit(x_train,y_train)
RF_CLF = RandomForestClassifier(n_estimators=10).fit(x_train,y_train)
##Predicting
DTC_Y = DTC_CLF.predict(x_test)
RF_Y = RF_CLF.predict(x_test)
##Computing Error Estimate
decision_error += stats.sem(np.round_(DTC_Y - y_test))
random_error += stats.sem(np.round_(RF_Y - y_test))
print(stats.ttest_rel(np.round_(DTC_Y), y_test))
print(stats.ttest_rel(np.round_(RF_Y), y_test))
decision_error /= 10
random_error /= 10
print("Decision Error " + str(decision_error) )
print("Random Error " + str(random_error) )
We tried to classify a game as a win or a loss based on three point attempts and makes. We used both a decision tree classifier and a random forest classifier to do this classification. We used 10 fold cross validation with a paired t-test to look at how well our classifiers did. Since most of the p values in the t-tests are greater than the standard of 0.05, we are unable to determine that there is a significant relationship between the predicted win/loss and the actual result of the game. Therefore, looking at three point attempts and makes is not enough information to confidently predict whether a team won or lost.
Overall, it seems that on average, teams tend to make a higher percentage of three pointers in wins than losses. This makes sense as games are often close and making one or two more shots can affect the outcome of the game. However, when looking at the results of all NBA games last year, we found it difficult to predict whether a team won or lost based on three point makes and attempts. We believe that this may have to do with the different styles teams choose to play. For example, a team that doesn’t focus on shooting three pointers can have a good game when shooting and making a low volume of three pointers, but a team that focuses on shooting three pointers will likely struggle if they shoot a low number of three pointers.
Additionally, we found it difficult to predict the salary of an individual player based off of either three point attempts or three point percentage. While there seemed to be a stronger correlation between three point attempts and salary compared to the correlation between three point percentage and salary, predicting the salary was still very difficult. While on average, it makes sense that better players will play and shoot more, due to the differences in play style and positions, this information wasn’t enough to accurately predict salary. For example, a highly paid center may choose to almost never shoot from three if he is a better scorer closer to the basket, while a shooting guard may take many attempts from three but struggle in other aspects of the game, resulting in a lower salary.
While the three point shot is undoubtedly an important aspect in basketball today, due to the several different strategies and play styles, we found it hard to predict either player success (in terms of salary) or team success (in terms of wins) based on how the player or team shot three pointers. As the game continues to evolve, it will be interesting to see whether the three point trend remains as popular as it is today and what strategies teams will come up with to reduce its effectiveness.