Kayla Fortson and Torri Green
The first dataset we have chosen is School Shootings since Columbine. This data set features data collected by The Washington Post. This data is collected through multiple sources such as: Nexis, news articles, open-source databases, law enforcement reports, information from school websites, and calls to schools and police departments. In order for the shooting to have been included in this database it must have taken place on campus immediately before, during, or right after classes took place. This database excludes colleges and universities. There are a total of 338 shootings documented in this dataset. This dataset is critical because it involves the loss of innocent lives and can give us insightful knowledge to bring awareness to this unimaginable terror. We would like to examine the frequencies, demographics, and trends of this data. The data set includes information such as the school name, district ID and name, date and time of shooting, location, school type, enrollment, count of victims, and information about the shooter and their weapon. Further information of this data set can be found here
The second dataset is collected from the National Center for Education Statistics. “The Common Core of Data (CCD) is the Department of Education’s primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.” For this project we used the Elementary/Secondary Information System (EISi), which is an NCES web application that allows its users to formulate data tables. Therefore, we matched information for this dataset to our former dataset with schools that experienced school shootings. This data set features all of the schools within school districts that experienced a shooting for that year. For example, for the Columbine shooting in 1999, all of Jefferson County’s schools will be included in the data (if available). Within the schools that have not experienced school shootings we have a total of 22,934 schools and we removed any schools that experienced shootings to avoid duplicate data with our other dataset. We also removed all virtual schools. Some columns include but are not limited to: school name, district name, city, state, county, school type, race, and enrollment. Further information regarding this data set can be found here
Utilizing these datasets, we hope to bring further awareness to such horrendous events. In other words, by analyzing this data efficiently we can better understand how school shootings take place, how they unfold, the demographics and trends of the data, and more. We can also compare such data to schools that have not experienced shootings. These observations can prove critical to inferring answers to questions that can save lives in the future. For example, what areas are higher risk to school shootings? We will infer the answers to questions regarding the frequencies, locations, and demographics behind school shootings and those that have not experienced shootings.
The plan for our project is to meet in person once a week. In the meantime, progress can be made on our project virtually by using Google colab to work together as well as a shared Github repository.
First we import all of the libaries we need.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
pd.set_option("display.max_rows", None, "display.max_columns", None)
Now we will read in and tidy our dataset of school shootings so that we can do some analysis later.
shootings_df = pd.read_csv('https://raw.githubusercontent.com/KaylaFortson/KaylaFortson.github.io/main/datasets/school-shootings-data.csv', encoding='ISO-8859-1')
shootings_df.head()
This table includes a ton of data but some of the columns are unnecessary for our analysis. For example, we won't be working with the latitude and longitude of the schools so we can drop these column.
shootings_df.drop(columns = ['uid', 'nces_district_id', 'lat', 'long', 'state_fips', 'county_fips'], inplace=True)
We reformat the school_year column for use later.
shootings_df['school_year'] = shootings_df['school_year'].str[:5] + shootings_df['school_year'].str[-2:]
We also want to check the dtypes so we can make corrections.
shootings_df.dtypes
Some of our dtypes are off so we correct them here.
# Making sure categorical variables are classified correctly.
string_columns = ['nces_school_id', 'shooter_deceased1', 'shooter_deceased2', 'resource_officer', 'ulocale', 'district_name']
shootings_df[string_columns] = shootings_df[string_columns].astype(str)
# Making most quantitative variables floats for consistency.
float_columns = ['enrollment', 'killed', 'injured', 'casualties']
shootings_df[float_columns] = shootings_df[float_columns].astype('float')
shootings_df.dtypes
Finally, here is a final version of our dataframe.
shootings_df.head(10)
Here we have a function that streamlines the tidying process for all of our data. Since we're combining data from NCES, we can apply the same changes to tidy every table while making a few exceptions and checks for some data that was only collected in certain years. This dataset includes all the schools from each district that experienced a school shooting in that year. (Ranges from 1998-2021)
def read_data(filepath):
'''
Take in a filename as a string, make and tidy a dataframe using that data
'''
# get just the year part of the filepath
year = filepath[-11:-4]
# read in dataset
schools = pd.read_csv(filepath)
# create columns for school year and type
schools['school_year'] = year
schools['school_type'] = 'public'
# rename colummns
schools = schools.rename(columns={
'School Name': 'school_name',
'State Name [Public School] Latest available year': 'state',
'School ID - NCES Assigned [Public School] Latest available year': 'nces_school_id',
'Agency Name [Public School] ' + year: 'district_name',
'Location City [Public School] ' + year: 'city',
'Total Students All Grades (Excludes AE) [Public School] ' + year: 'enrollment',
'White Students [Public School] ' + year: 'white',
'Black or African American Students [Public School] ' + year: 'black',
'Hispanic Students [Public School] ' + year: 'hispanic',
'Asian or Asian/Pacific Islander Students [Public School] ' + year: 'asian',
'American Indian/Alaska Native Students [Public School] ' + year: 'american_indian_alaska_native',
'Nat. Hawaiian or Other Pacific Isl. Students [Public School] ' + year: 'hawaiian_native_pacific_islander',
'Two or More Races Students [Public School] ' + year: 'two_or_more',
'Full-Time Equivalent (FTE) Teachers [Public School] ' + year: 'staffing',
'Lowest Grade Offered [Public School] ' + year: 'low_grade',
'Highest Grade Offered [Public School] ' + year: 'high_grade',
'Free and Reduced Lunch Students [Public School] ' + year: 'lunch',
'County Name [Public School] ' + year: 'county',
'Locale [Public School] ' + year: 'ulocale',
})
# fix data capitalization
string_cols = ['school_name', 'district_name', 'city', 'county', 'state']
for col in string_cols:
schools[col] = schools[col].astype(str).str.title()
# fix dtype of school id
schools['nces_school_id'] = schools['nces_school_id'].astype(str)
# we're interested in the schools that are in the same districts as schools that had shootings
districts = shootings_df[shootings_df['school_year'] == year]['district_name'].unique()
schools = schools.drop(schools.loc[~schools['district_name'].isin(districts)].index).reset_index(drop=True)
# remove virtual schools then we no longer need that column ignoring error if it doesn't exist.
try:
schools = schools[schools['Virtual School Status (SY 2016-17 onward) [Public School] ' + year] != 'FULLVIRTUAL']
schools = schools[schools['Virtual School Status (SY 2016-17 onward) [Public School] ' + year] != 'FACEVIRTUAL']
schools = schools.drop(columns='Virtual School Status (SY 2016-17 onward) [Public School] ' + year)
except:
pass
try:
schools = schools[schools['Virtual School Status [Public School] ' + year] != 'A virtual school']
schools = schools.drop(columns=['Virtual School Status [Public School] ' + year])
except:
pass
# drop rows with adult education
schools.drop(schools.loc[schools['low_grade'] == 'Adult Education'].index, inplace=True)
schools.drop(schools.loc[schools['low_grade'] == 'Ungraded'].index, inplace=True)
schools.drop(schools.loc[schools['school_name'].str.contains("Adult")].index, inplace=True)
# consider 0 enrollment schools to be nan
demos = ['white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native',
'hawaiian_native_pacific_islander', 'two_or_more']
schools.loc[:, 'enrollment'].replace(0.0, np.nan, inplace=True)
schools.loc[schools['enrollment'] == np.nan, demos] = np.nan
# get just the numerical code for ulocale
schools.loc[:, 'ulocale'] = schools['ulocale'].str.partition('-')[0] + '.0'
# shorten grade levels
grade_levels = {'Prekindergarten': 'PK',
'Kindergarten': 'K',
'1st Grade': '1',
'2nd Grade': '2',
'3rd Grade': '3',
'4th Grade': '4',
'5th Grade': '5',
'6th Grade': '6',
'7th Grade': '7',
'8th Grade': '8',
'9th Grade': '9',
'10th Grade': '10',
'11th Grade': '11',
'12th Grade': '12',
'13th Grade': '13'
}
schools.loc[:, 'low_grade'] = schools['low_grade'].map(grade_levels)
schools.loc[:, 'high_grade'] = schools['high_grade'].map(grade_levels)
# reorder columns
order = ['nces_school_id', 'school_name', 'district_name', 'school_year', 'city', 'state', 'school_type',
'enrollment', 'white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native',
'hawaiian_native_pacific_islander', 'two_or_more', 'staffing', 'low_grade',
'high_grade', 'lunch', 'county', 'ulocale']
for col in order:
if col not in schools.columns:
schools.loc[:, col] = np.nan
schools = schools[order]
return schools
Here we iterate over a list of years to load in our NCES data and build the dataframe of those schools.
years = ['1998-99', '1999-00', '2000-01', '2001-02', '2002-03', '2003-04',
'2004-05', '2005-06', '2006-07', '2007-08', '2008-09', '2009-10',
'2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16',
'2016-17', '2017-18', '2018-19', '2019-20', '2020-21']
schools = pd.DataFrame()
for year in years:
path = 'https://raw.githubusercontent.com/KaylaFortson/KaylaFortson.github.io/main/datasets/' + year + '.csv'
data = read_data(path)
schools = schools.append(data)
schools = schools.reset_index(drop=True)
Our final dataframe of schools.
schools.head()
Our last step in ETL is going to be to combine the two dataframes above into one larger dataframe. This will be useful for comparing schools that had shootings to those that didn't.
The first thing we do is get a version of the shootings data with only the columns we can compare to the schools data.
columns = ['nces_school_id', 'school_name', 'district_name', 'school_year', 'city', 'state', 'school_type',
'enrollment', 'white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native',
'hawaiian_native_pacific_islander', 'two_or_more', 'staffing', 'low_grade',
'high_grade', 'lunch', 'county', 'ulocale']
trim_shootings = shootings_df[columns].copy()
trim_shootings.head()
We also have to drop all of the shootings from the 2021-2022 school year since we don't have data on schools that didn't have shootings in that year.
trim_shootings = trim_shootings[trim_shootings['school_year'] != '2021-22']
We want a column indicating whether a shooting occurred for each dataset.
trim_shootings['shooting'] = 1
schools['shooting'] = 0
Now we concatenate the two dataframes and account for the schools that had a shooting but are in both sets. To do this we drop the duplicates based on if they have the same school id and are in the same year.
all_schools = pd.concat([trim_shootings, schools]).drop_duplicates(subset=['nces_school_id', 'school_year'], keep='first').reset_index(drop=True)
Now from here we are ready to explore our data and gain some insight from it!
all_schools.head()
With our data now cleaned and organized, we can start to analyze it.
First, we'll look at the yearly trend in school shootings since Columbine.
shootings_df['year'].value_counts().to_frame().sort_index(ascending=True).plot(xlabel="Year", ylabel="Shootings", figsize=(20,5), title='School Shootings Per Year 1998-2022', legend=False)
The above graph maps the number of shootings recorded per year. This trend shows an overall increase in shootings with the largest spike taking place in 2021 (42 shootings). The lowest amount of shootings took place in 2002 (5 shootings). This is an extremely significant observation because it shows how there is an upward trend in the number of shootings per year. Therefore, this means that this is a growing issue that needs to be addressed immediately and has yet to be properly resolved.
Next, we take a look at the weapons used in the school shootings and compare the most used weapon types to the number of casualties they cause.
weapons = shootings_df[["weapon", "casualties"]].copy()
weapons.fillna("", inplace=True)
# Simplify the weapon column
weapons.loc[weapons["weapon"].str.contains("shotgun"), "weapon"] = "Shotgun"
weapons.loc[weapons["weapon"].str.contains("rifle"), "weapon"] = "Rifle"
weapons.loc[weapons["weapon"].str.contains("handgun"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].str.contains("pistol"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].str.contains("revolver"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].__eq__(""), "weapon"] = "Unknown"
weapons.loc[~weapons["weapon"].str.contains("Handgun") & ~weapons["weapon"].str.contains("Rifle") & ~weapons["weapon"].str.contains("Shotgun") & ~weapons["weapon"].str.contains("Unknown"), "weapon"] = "Other"
# Specifies a 1 x 5 grid of plots, figsize in inches
figure, axes = plt.subplots(1, 2, figsize=(20, 8))
weapons.groupby("weapon").count().plot.bar(ylabel="Number of Shootings", xlabel="Weapon Type", color="maroon", legend=False , ax=axes[0])
weapons.groupby("weapon").sum().plot.bar(ylabel="Number of Casualties", xlabel="Weapon Type", color="maroon", legend=False , ax=axes[1])
The left side graph shows us that the handgun was used in the greatest number of shootings. This may explain that a handgun is more attainable than other weapon types. Rifles, shotguns, and other weapon types were very close in the amount of times they were used, meaning they all may be similarly attainable. However, if you look closely the type of weapon that was the most rarely used was the rifle (besides other types).
On the right hand side, the bar graph shows the number of casualties in relation to weapon type. Here, handguns is also the highest most likely because they were involved in the greatest number of shootings. Shotguns were more responsible for deaths than rifles were. There were also more undocumented/unknown weapon caused deaths than weapons that did not fit into the categories of handgun, rifles, or shotguns (other).
temp = shootings_df[['state', 'date']].copy()
temp['month'] = pd.DatetimeIndex(temp['date']).month #adds month column
temp.drop(columns='date', inplace=True)
month_abv = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
'7': 'Jul', '8': 'Aug', '9': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
temp['month'] = temp['month'].astype(str).map(month_abv) #abbreviate the months
fig, ax = plt.subplots(figsize=(32,32))
location_month_df = temp.groupby('state').month.value_counts().to_frame().unstack().fillna(0) #construct dataframe of month to state shootings
location_month_df = location_month_df['month'][['Aug', 'Sep','Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul']] #reorder columns by standard school year
#normalize the data
normalized_df=(location_month_df-location_month_df.mean())/location_month_df.std()
sns.heatmap(normalized_df, ax=ax, vmin=-1.0, vmax=5.0) #create heatmap
ax.set_xlabel("Month")
ax.set_ylabel("State")
ax.set_title('Number of Shootings by Month and State')
xticks = ['Aug', 'Sep','Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul']
ax.XData = xticks
plt.show()
This heatmap is a mapping of the number of school shootings in relation to the Month that they occurred in as well as the State. The heatmap ranges from -1 to 5 standard deviations away from the mean number of shootings for that month. However, the maximum amount of shootings was in the state of California with a value of 36. Within the state of California, the month with the greatest number of shootings was October. There were multiple states that shared a minimum of one shooting. Overall, the month with the highest number of shootings was January and the minimum was July. Looking into the month(s) with the greatest number of shootings can prove critical because this can be a time to focus even further on mental health and support for students.
Below, we will plot the number of school shootings in relation to the percentage of free and reduced lunch eligible students at a school. We will normalize for the fact that there could be a large number of schools with a high percentage of eligible students. We hypothesis that there will be a larger number of school shootings at schools with a greater number of free and reduced lunches.
# get subset of our dataset with relevant information
lunch = all_schools[['lunch','enrollment', 'shooting']].copy()
lunch = lunch.loc[lunch['enrollment'] > lunch['lunch']]
lunch['pct'] = lunch['lunch'] / lunch['enrollment']
lunch['z-score'] = (lunch['pct'] - lunch['pct'].mean()) / lunch['pct'].std()
lunch.max()
Now we create the bins that we'll use to separate the schools.
range_0 = lunch[(lunch['z-score'] > -2.50) & (lunch['z-score'] <= -2.00)].copy()
range_1 = lunch[(lunch['z-score'] > -2.00) & (lunch['z-score'] <= -1.50)].copy()
range_2 = lunch[(lunch['z-score'] > -1.50) & (lunch['z-score'] <= -1.00)].copy()
range_3 = lunch[(lunch['z-score'] > -1.00) & (lunch['z-score'] <= -0.50)].copy()
range_4 = lunch[(lunch['z-score'] > -0.50) & (lunch['z-score'] <= 0.0)].copy()
range_5 = lunch[(lunch['z-score'] > 0.0) & (lunch['z-score'] <= 0.50)].copy()
range_6 = lunch[(lunch['z-score'] > 0.50) & (lunch['z-score'] <= 1.00)].copy()
range_7 = lunch[(lunch['z-score'] > 1.00) & (lunch['z-score'] <= 1.50)].copy()
Then we add a column to identify which bin the school is in.
range_0['deviation'] = '-2.50->-2.00'
range_1['deviation'] = '-2.00->-1.50'
range_2['deviation'] = '-1.50->-1.00'
range_3['deviation'] = '-1.00->-0.50'
range_4['deviation'] = '-0.50->0.0'
range_5['deviation'] = '0.0->0.50'
range_6['deviation'] = '0.50->1.00'
range_7['deviation'] = '1.00->1.50'
Lastly, we graph the data.
# concatenate the dataframes back together
grouped_ranges = pd.concat([range_0, range_1, range_2, range_3, range_4, range_5, range_6, range_7])
# get the amount of schools in each bin that had a shooting divide to represent the amount as a percent of the total schools in the bin.
shootings_in_bin = grouped_ranges[grouped_ranges['shooting'] == 1].deviation.value_counts()
all_in_bin = grouped_ranges.deviation.value_counts()
ranges_counts = shootings_in_bin / all_in_bin
# correct the ordering and plot
order = ['-2.50->-2.00', '-2.00->-1.50', '-1.50->-1.00', '-1.00->-0.50', '-0.50->0.0', '0.0->0.50', '0.50->1.00', '1.00->1.50']
ranges_counts = ranges_counts.reindex(order)
ranges_counts.plot.bar(ylabel='Shootings', xlabel='Free Lunch Students', color="forestgreen", rot=0, figsize=(20,5))
The graph shows the correlation between the percentage of Students who received free or reduced price lunches due to financial reasons and the number of shootings. From analzying the graph, it seems that the percentage of free or reduced lunch students may not be correlated with amount of shootings. We do see, however, that schools above the mean have lower amounts of shootings than schools below the mean. This goes against our original hypothesis that schools with more low-income students would have more shootings.
The next bar graph will be similar to the last but for the ratio of white students to total students. We focused on this race specifically since a lot of the dialogue around school shootings centers around the race of the people involved.
# get subset of our dataset with relevant information
student = all_schools[['white', 'enrollment', 'shooting']].copy()
student['white_pct'] = student['white'] / student['enrollment']
student['z-score'] = (student['white_pct'] - student['white_pct'].mean()) / student['white_pct'].std()
# bin the ratios of white students - min was -0.8, max was 2.9
range_0 = student[(student['z-score'] > -1.00) & (student['z-score'] <= -0.50)].copy()
range_1 = student[(student['z-score'] > -0.50) & (student['z-score'] <= 0.0)].copy()
range_2 = student[(student['z-score'] > 0.0) & (student['z-score'] <= 0.50)].copy()
range_3 = student[(student['z-score'] > 0.50) & (student['z-score'] <= 1.0)].copy()
range_4 = student[(student['z-score'] > 1.0) & (student['z-score'] <= 1.50)].copy()
range_5 = student[(student['z-score'] > 1.50) & (student['z-score'] <= 2.00)].copy()
range_6 = student[(student['z-score'] > 2.00) & (student['z-score'] <= 2.50)].copy()
range_7 = student[(student['z-score'] > 2.50) & (student['z-score'] <= 3.00)].copy()
range_0['deviation'] = '-1.0->-0.5'
range_1['deviation'] = '-0.5->0.0'
range_2['deviation'] = '0.0->0.5'
range_3['deviation'] = '0.5->1.0'
range_4['deviation'] = '1.0->1.5'
range_5['deviation'] = '1.5->2.0'
range_6['deviation'] = '2.0->2.5'
range_7['deviation'] = '2.5->3.0'
# concatenate the dataframes back together
grouped_ranges = pd.concat([range_0, range_1, range_2, range_3, range_4, range_5, range_6, range_7])
# get the amount of schools in each bin that had a shooting divide to represent the amount as a percent of the total schools in the bin.
shootings_in_bin = grouped_ranges[grouped_ranges['shooting'] == 1].deviation.value_counts()
all_in_bin = grouped_ranges.deviation.value_counts()
white_ranges = shootings_in_bin / all_in_bin
# correct the ordering and plot
order = ['-1.0->-0.5', '-0.5->0.0', '0.0->0.5', '0.5->1.0', '1.0->1.5', '1.5->2.0', '2.0->2.5', '2.5->3.0']
white_ranges = white_ranges.reindex(order)
white_ranges.plot.bar(ylabel='Shootings', xlabel='White-Student Ratio Z-Score', color="orange", rot=0, figsize=(20,5))
According to the shootings dataset the maximum number of shootings is done by white individuals. Therefore, we took a deeper analysis into the ratio of white student population. This graphs show that there isn't much difference in the number of school shootings based on the percentage of a school's enrollment that is white. Again, however, this changes slightly in the two bins of schools that are further than 2 standard deviations from the mean percentage. In this area, the number of shootings is a lot higher. This shows that schools that have a predominately more white population have experienced a greater number of school shootings.
From here, we will build a model to try to determine if there is some correlation between the student demographics and school location and the likeliness of a shooting occuring at a school. We would want to analyze if there are any predictive attributes in order to use that to possibly prevent more shootings.
To do this we will use the district, city, state, enrollment, staffing, and lunch data to predict if a shooting will occur.
First we will evaluate the model with different values of k using 10-fold cross validation
# list the variables we will use for the model
features = ['district_name','city', 'state', 'enrollment', 'white', 'black',
'hispanic', 'asian', 'staffing', 'lunch', 'shooting']
# load the data and drop rows that we cant use
model_df = all_schools[features].dropna()
# separate the sets of data
x_dict = model_df[features[:-1]].to_dict(orient='records')
y = model_df['shooting']
Here we create a function to get the cross validated rmse of different k-nearest neighbors models.
def get_cv_error(k):
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=k)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
mse = np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error"))
return mse
Then we test different k values and plot the result.
# took about 15 minutes to run for us
ks = pd.Series(range(1, 31))
ks.index = range(1, 31)
test_errs = ks.apply(get_cv_error)
test_errs.plot.line()
We see from this graph that a k value of around 10 is where the mse is minimized, so that is what we will use for our model.
Now we want to pick the best parameters for our model. To do this we'll test the model by dropping each of our columns one by one.
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=11)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
features = ['district_name','city', 'state', 'enrollment', 'white', 'black',
'hispanic', 'asian', 'staffing', 'lunch', 'shooting']
# baseline for comparison
rmse = np.sqrt(np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error")))
print ('baseline rmse = ', rmse)
for i in range(len(features[:-1])):
copy = features[:-1]
del copy[i]
x_dict = model_df[copy].to_dict(orient="records")
rmse = np.sqrt(np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error")))
print ('no ' + features[i] + ' rmse = ', rmse)
We see here that removing any one variable from the features doesn't have much of an impact on the rmse. This makes sense because we didn't find much correlation before between any of the variables and shootings happening. Our model doesn't have much predictive power, suggesting the randomness of school shootings. Maybe if we had data on city mental health statistics or access to mental health help in schools we could find a stronger correlation there. For now, we can take away that school demographics are unlikely to hold much, if any, weight in determining if a shooting is going to happen.