Analysis of School Shootings

Kayla Fortson and Torri Green

Project Datasets

The first dataset we have chosen is School Shootings since Columbine. This data set features data collected by The Washington Post. This data is collected through multiple sources such as: Nexis, news articles, open-source databases, law enforcement reports, information from school websites, and calls to schools and police departments. In order for the shooting to have been included in this database it must have taken place on campus immediately before, during, or right after classes took place. This database excludes colleges and universities. There are a total of 338 shootings documented in this dataset. This dataset is critical because it involves the loss of innocent lives and can give us insightful knowledge to bring awareness to this unimaginable terror. We would like to examine the frequencies, demographics, and trends of this data. The data set includes information such as the school name, district ID and name, date and time of shooting, location, school type, enrollment, count of victims, and information about the shooter and their weapon. Further information of this data set can be found here

The second dataset is collected from the National Center for Education Statistics. “The Common Core of Data (CCD) is the Department of Education’s primary database on public elementary and secondary education in the United States. CCD is a comprehensive, annual, national database of all public elementary and secondary schools and school districts.” For this project we used the Elementary/Secondary Information System (EISi), which is an NCES web application that allows its users to formulate data tables. Therefore, we matched information for this dataset to our former dataset with schools that experienced school shootings. This data set features all of the schools within school districts that experienced a shooting for that year. For example, for the Columbine shooting in 1999, all of Jefferson County’s schools will be included in the data (if available). Within the schools that have not experienced school shootings we have a total of 22,934 schools and we removed any schools that experienced shootings to avoid duplicate data with our other dataset. We also removed all virtual schools. Some columns include but are not limited to: school name, district name, city, state, county, school type, race, and enrollment. Further information regarding this data set can be found here

Project Goals

Utilizing these datasets, we hope to bring further awareness to such horrendous events. In other words, by analyzing this data efficiently we can better understand how school shootings take place, how they unfold, the demographics and trends of the data, and more. We can also compare such data to schools that have not experienced shootings. These observations can prove critical to inferring answers to questions that can save lives in the future. For example, what areas are higher risk to school shootings? We will infer the answers to questions regarding the frequencies, locations, and demographics behind school shootings and those that have not experienced shootings.

Collaboration Plan

The plan for our project is to meet in person once a week. In the meantime, progress can be made on our project virtually by using Google colab to work together as well as a shared Github repository.

ETL

First we import all of the libaries we need.

In [114]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
pd.set_option("display.max_rows", None, "display.max_columns", None)

Shootings Dataset

Now we will read in and tidy our dataset of school shootings so that we can do some analysis later.

In [115]:
shootings_df = pd.read_csv('https://raw.githubusercontent.com/KaylaFortson/KaylaFortson.github.io/main/datasets/school-shootings-data.csv', encoding='ISO-8859-1')
In [116]:
shootings_df.head()
Out[116]:
uid nces_school_id school_name nces_district_id district_name date school_year year time day_of_week city state school_type enrollment killed injured casualties shooting_type age_shooter1 gender_shooter1 race_ethnicity_shooter1 shooter_relationship1 shooter_deceased1 deceased_notes1 age_shooter2 gender_shooter2 race_ethnicity_shooter2 shooter_relationship2 shooter_deceased2 deceased_notes2 white black hispanic asian american_indian_alaska_native hawaiian_native_pacific_islander two_or_more resource_officer weapon weapon_source lat long staffing low_grade high_grade lunch county state_fips county_fips ulocale
0 1 080480000707 Columbine High School 804800.0 Jefferson County R-1 4/20/1999 1998-1999 1999 11:19 AM Tuesday Littleton Colorado public 1965 13 21 34 indiscriminate 18.0 m w student 1.0 suicide 17.0 m w student 1.0 suicide 1783.0 16.0 112.0 42.0 12.0 NaN NaN 1 12-gauge Savage-Springfield 67H pump-action sh... purchased from friends 39.603910 -105.075000 89.6 9 12 41.0 Jefferson County 8 8059 21.0
1 2 220054000422 Scotlandville Middle School 2200540.0 East Baton Rouge Parish School Board 4/22/1999 1998-1999 1999 12:30 PM Thursday Baton Rouge Louisiana public 588 0 1 1 targeted 14.0 m NaN former student (expelled) 0.0 NaN NaN NaN NaN NaN NaN NaN 5.0 583.0 0.0 0.0 0.0 NaN NaN 0 .22 caliber pistol NaN 30.529958 -91.169966 39.0 6 8 495.0 East Baton Rouge Parish 22 22033 12.0
2 3 130441001591 Heritage High School 1304410.0 Rockdale County 5/20/1999 1998-1999 1999 8:03 AM Thursday Conyers Georgia public 1369 0 6 6 indiscriminate 15.0 m w student 0.0 NaN NaN NaN NaN NaN NaN NaN 1189.0 136.0 28.0 15.0 1.0 NaN NaN 1 .22 caliber rifle, 357-caliber Magnum handgun NaN 33.626922 -84.047960 84.0 9 12 125.0 Rockdale County 13 13247 21.0
3 4 421899003847 John Bartram High School 4218990.0 Philadelphia City SD 10/4/1999 1999-2000 1999 10:00 AM Monday Philadelphia Pennsylvania public 3147 0 1 1 targeted 17.0 m NaN student 0.0 NaN NaN NaN NaN NaN NaN NaN 209.0 2736.0 27.0 170.0 5.0 NaN NaN 1 RG .25-caliber pistol purchased from friend 39.921509 -75.234108 41.0 9 12 2007.0 Philadelphia County 42 42101 11.0
4 5 250279000225 Dorchester High School 2502790.0 Boston 11/3/1999 1999-2000 1999 7:40 AM Wednesday Boston Massachusetts public 1116 0 1 1 targeted NaN m NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 40.0 755.0 287.0 29.0 5.0 NaN NaN 0 NaN NaN 42.285268 -71.075901 NaN 9 12 543.0 Suffolk County 25 25025 11.0

This table includes a ton of data but some of the columns are unnecessary for our analysis. For example, we won't be working with the latitude and longitude of the schools so we can drop these column.

In [117]:
shootings_df.drop(columns = ['uid', 'nces_district_id', 'lat', 'long', 'state_fips', 'county_fips'], inplace=True)

We reformat the school_year column for use later.

In [118]:
shootings_df['school_year'] = shootings_df['school_year'].str[:5] + shootings_df['school_year'].str[-2:]

We also want to check the dtypes so we can make corrections.

In [119]:
shootings_df.dtypes
Out[119]:
nces_school_id                       object
school_name                          object
district_name                        object
date                                 object
school_year                          object
year                                  int64
time                                 object
day_of_week                          object
city                                 object
state                                object
school_type                          object
enrollment                            int64
killed                                int64
injured                               int64
casualties                            int64
shooting_type                        object
age_shooter1                        float64
gender_shooter1                      object
race_ethnicity_shooter1              object
shooter_relationship1                object
shooter_deceased1                   float64
deceased_notes1                      object
age_shooter2                        float64
gender_shooter2                      object
race_ethnicity_shooter2              object
shooter_relationship2                object
shooter_deceased2                   float64
deceased_notes2                      object
white                               float64
black                               float64
hispanic                            float64
asian                               float64
american_indian_alaska_native       float64
hawaiian_native_pacific_islander    float64
two_or_more                         float64
resource_officer                      int64
weapon                               object
weapon_source                        object
staffing                            float64
low_grade                            object
high_grade                           object
lunch                               float64
county                               object
ulocale                             float64
dtype: object

Some of our dtypes are off so we correct them here.

In [120]:
# Making sure categorical variables are classified correctly.
string_columns = ['nces_school_id', 'shooter_deceased1', 'shooter_deceased2', 'resource_officer', 'ulocale', 'district_name']
shootings_df[string_columns] = shootings_df[string_columns].astype(str)

# Making most quantitative variables floats for consistency.
float_columns = ['enrollment', 'killed', 'injured', 'casualties']
shootings_df[float_columns] = shootings_df[float_columns].astype('float')

shootings_df.dtypes
Out[120]:
nces_school_id                       object
school_name                          object
district_name                        object
date                                 object
school_year                          object
year                                  int64
time                                 object
day_of_week                          object
city                                 object
state                                object
school_type                          object
enrollment                          float64
killed                              float64
injured                             float64
casualties                          float64
shooting_type                        object
age_shooter1                        float64
gender_shooter1                      object
race_ethnicity_shooter1              object
shooter_relationship1                object
shooter_deceased1                    object
deceased_notes1                      object
age_shooter2                        float64
gender_shooter2                      object
race_ethnicity_shooter2              object
shooter_relationship2                object
shooter_deceased2                    object
deceased_notes2                      object
white                               float64
black                               float64
hispanic                            float64
asian                               float64
american_indian_alaska_native       float64
hawaiian_native_pacific_islander    float64
two_or_more                         float64
resource_officer                     object
weapon                               object
weapon_source                        object
staffing                            float64
low_grade                            object
high_grade                           object
lunch                               float64
county                               object
ulocale                              object
dtype: object

Finally, here is a final version of our dataframe.

In [121]:
shootings_df.head(10)
Out[121]:
nces_school_id school_name district_name date school_year year time day_of_week city state school_type enrollment killed injured casualties shooting_type age_shooter1 gender_shooter1 race_ethnicity_shooter1 shooter_relationship1 shooter_deceased1 deceased_notes1 age_shooter2 gender_shooter2 race_ethnicity_shooter2 shooter_relationship2 shooter_deceased2 deceased_notes2 white black hispanic asian american_indian_alaska_native hawaiian_native_pacific_islander two_or_more resource_officer weapon weapon_source staffing low_grade high_grade lunch county ulocale
0 080480000707 Columbine High School Jefferson County R-1 4/20/1999 1998-99 1999 11:19 AM Tuesday Littleton Colorado public 1965.0 13.0 21.0 34.0 indiscriminate 18.0 m w student 1.0 suicide 17.0 m w student 1.0 suicide 1783.0 16.0 112.0 42.0 12.0 NaN NaN 1 12-gauge Savage-Springfield 67H pump-action sh... purchased from friends 89.6 9 12 41.0 Jefferson County 21.0
1 220054000422 Scotlandville Middle School East Baton Rouge Parish School Board 4/22/1999 1998-99 1999 12:30 PM Thursday Baton Rouge Louisiana public 588.0 0.0 1.0 1.0 targeted 14.0 m NaN former student (expelled) 0.0 NaN NaN NaN NaN NaN nan NaN 5.0 583.0 0.0 0.0 0.0 NaN NaN 0 .22 caliber pistol NaN 39.0 6 8 495.0 East Baton Rouge Parish 12.0
2 130441001591 Heritage High School Rockdale County 5/20/1999 1998-99 1999 8:03 AM Thursday Conyers Georgia public 1369.0 0.0 6.0 6.0 indiscriminate 15.0 m w student 0.0 NaN NaN NaN NaN NaN nan NaN 1189.0 136.0 28.0 15.0 1.0 NaN NaN 1 .22 caliber rifle, 357-caliber Magnum handgun NaN 84.0 9 12 125.0 Rockdale County 21.0
3 421899003847 John Bartram High School Philadelphia City SD 10/4/1999 1999-00 1999 10:00 AM Monday Philadelphia Pennsylvania public 3147.0 0.0 1.0 1.0 targeted 17.0 m NaN student 0.0 NaN NaN NaN NaN NaN nan NaN 209.0 2736.0 27.0 170.0 5.0 NaN NaN 1 RG .25-caliber pistol purchased from friend 41.0 9 12 2007.0 Philadelphia County 11.0
4 250279000225 Dorchester High School Boston 11/3/1999 1999-00 1999 7:40 AM Wednesday Boston Massachusetts public 1116.0 0.0 1.0 1.0 targeted NaN m NaN NaN 0.0 NaN NaN NaN NaN NaN nan NaN 40.0 755.0 287.0 29.0 5.0 NaN NaN 0 NaN NaN NaN 9 12 543.0 Suffolk County 11.0
5 350069000240 Deming Middle School Deming Public Schools 11/19/1999 1999-00 1999 12:45 PM Friday Deming New Mexico public 753.0 1.0 0.0 1.0 accidental or targeted 12.0 m h student 0.0 NaN NaN NaN NaN NaN nan NaN 160.0 6.0 583.0 2.0 2.0 NaN NaN 0 .22-caliber Colt revolver home 44.0 6 7 502.0 Luna County 33.0
6 401197000559 Fort Gibson Middle School Fort Gibson 12/6/1999 1999-00 1999 7:45 AM Monday Fort Gibson Oklahoma public 407.0 0.0 5.0 5.0 indiscriminate 13.0 m ai student 0.0 NaN NaN NaN NaN NaN nan NaN 239.0 3.0 12.0 0.0 153.0 NaN NaN 0 Taurus 9mm pistol father 25.4 6 8 146.0 Muskogee County 32.0
7 120153001590 Ridgewood High School Pasco 1/19/2000 1999-00 2000 3:00 PM Wednesday New Port Richey Florida public 1852.0 1.0 0.0 1.0 accidental 16.0 m w student 0.0 NaN NaN NaN NaN NaN nan NaN 1697.0 28.0 96.0 26.0 5.0 NaN NaN 1 .22-caliber Magnum handgun stolen 85.0 9 12 654.0 Pasco County 21.0
8 062460009581 Alicia Reyes Elementary School Merced City Elementary 1/26/2000 1999-00 2000 8:34 AM Wednesday Merced California public 734.0 0.0 0.0 0.0 targeted 13.0 m NaN no connection 0.0 NaN NaN NaN NaN NaN nan NaN 82.0 40.0 389.0 222.0 1.0 NaN NaN 0 .22 caliber revolver NaN 36.0 KG 5 643.0 Merced County 13.0
9 170993000748 Duke Ellington Elementary School City of Chicago School District 299 2/14/2000 1999-00 2000 NaN Monday Chicago Illinois public 736.0 0.0 1.0 1.0 accidental 11.0 m NaN student 0.0 NaN NaN NaN NaN NaN nan NaN 0.0 736.0 0.0 0.0 0.0 NaN NaN 0 NaN NaN 37.0 PK 6 NaN Cook County 11.0

All Schools Dataset

Here we have a function that streamlines the tidying process for all of our data. Since we're combining data from NCES, we can apply the same changes to tidy every table while making a few exceptions and checks for some data that was only collected in certain years. This dataset includes all the schools from each district that experienced a school shooting in that year. (Ranges from 1998-2021)

In [122]:
def read_data(filepath):
  '''
  Take in a filename as a string, make and tidy a dataframe using that data
  '''
  # get just the year part of the filepath
  year = filepath[-11:-4]

  # read in dataset
  schools = pd.read_csv(filepath)

  # create columns for school year and type
  schools['school_year'] = year
  schools['school_type'] = 'public'

  # rename colummns
  schools = schools.rename(columns={
      'School Name': 'school_name',
      'State Name [Public School] Latest available year': 'state',
      'School ID - NCES Assigned [Public School] Latest available year': 'nces_school_id',
      'Agency Name [Public School] ' + year: 'district_name',
      'Location City [Public School] ' + year: 'city',
      'Total Students All Grades (Excludes AE) [Public School] ' + year: 'enrollment',
      'White Students [Public School] ' + year: 'white',
      'Black or African American Students [Public School] ' + year: 'black',
      'Hispanic Students [Public School] ' + year: 'hispanic',
      'Asian or Asian/Pacific Islander Students [Public School] ' + year: 'asian',
      'American Indian/Alaska Native Students [Public School] ' + year: 'american_indian_alaska_native',
      'Nat. Hawaiian or Other Pacific Isl. Students [Public School] ' + year: 'hawaiian_native_pacific_islander',
      'Two or More Races Students [Public School] ' + year: 'two_or_more',
      'Full-Time Equivalent (FTE) Teachers [Public School] ' + year: 'staffing',
      'Lowest Grade Offered [Public School] ' + year: 'low_grade',
      'Highest Grade Offered [Public School] ' + year: 'high_grade',
      'Free and Reduced Lunch Students [Public School] ' + year: 'lunch',
      'County Name [Public School] ' + year: 'county',
      'Locale [Public School] ' + year: 'ulocale',
  })

  # fix data capitalization
  string_cols = ['school_name', 'district_name', 'city', 'county', 'state']
  for col in string_cols:
    schools[col] = schools[col].astype(str).str.title()

  # fix dtype of school id
  schools['nces_school_id'] = schools['nces_school_id'].astype(str)

  # we're interested in the schools that are in the same districts as schools that had shootings
  districts = shootings_df[shootings_df['school_year'] == year]['district_name'].unique()
  schools = schools.drop(schools.loc[~schools['district_name'].isin(districts)].index).reset_index(drop=True)

  # remove virtual schools then we no longer need that column ignoring error if it doesn't exist.
  try:
    schools = schools[schools['Virtual School Status (SY 2016-17 onward) [Public School] ' + year] != 'FULLVIRTUAL']
    schools = schools[schools['Virtual School Status (SY 2016-17 onward) [Public School] ' + year] != 'FACEVIRTUAL']
    schools = schools.drop(columns='Virtual School Status (SY 2016-17 onward) [Public School] ' + year)
  except:
    pass
  
  try:
    schools = schools[schools['Virtual School Status [Public School] ' + year] != 'A virtual school']
    schools = schools.drop(columns=['Virtual School Status [Public School] ' + year])
  except:
    pass

  # drop rows with adult education
  schools.drop(schools.loc[schools['low_grade'] == 'Adult Education'].index, inplace=True)
  schools.drop(schools.loc[schools['low_grade'] == 'Ungraded'].index, inplace=True)
  schools.drop(schools.loc[schools['school_name'].str.contains("Adult")].index, inplace=True)


  # consider 0 enrollment schools to be nan
  demos = ['white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native', 
          'hawaiian_native_pacific_islander', 'two_or_more']
  schools.loc[:, 'enrollment'].replace(0.0, np.nan, inplace=True)
  schools.loc[schools['enrollment'] == np.nan, demos] = np.nan

  # get just the numerical code for ulocale
  schools.loc[:, 'ulocale'] = schools['ulocale'].str.partition('-')[0] + '.0'

  # shorten grade levels
  grade_levels = {'Prekindergarten': 'PK',
                  'Kindergarten': 'K',
                  '1st Grade': '1',
                  '2nd Grade': '2',
                  '3rd Grade': '3',
                  '4th Grade': '4',
                  '5th Grade': '5',
                  '6th Grade': '6',
                  '7th Grade': '7',
                  '8th Grade': '8',
                  '9th Grade': '9',
                  '10th Grade': '10',
                  '11th Grade': '11',
                  '12th Grade': '12',
                  '13th Grade': '13'
                  }
  schools.loc[:, 'low_grade'] = schools['low_grade'].map(grade_levels)
  schools.loc[:, 'high_grade'] = schools['high_grade'].map(grade_levels)

  # reorder columns
  order = ['nces_school_id', 'school_name', 'district_name', 'school_year', 'city', 'state', 'school_type',
          'enrollment', 'white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native', 
          'hawaiian_native_pacific_islander', 'two_or_more', 'staffing', 'low_grade',
          'high_grade', 'lunch', 'county', 'ulocale']
  for col in order:
    if col not in schools.columns:
      schools.loc[:, col] = np.nan
  schools = schools[order]

  return schools

Here we iterate over a list of years to load in our NCES data and build the dataframe of those schools.

In [96]:
years = ['1998-99', '1999-00', '2000-01', '2001-02', '2002-03', '2003-04',
         '2004-05', '2005-06', '2006-07', '2007-08', '2008-09', '2009-10',
         '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16',
         '2016-17', '2017-18', '2018-19', '2019-20', '2020-21']
schools = pd.DataFrame()

for year in years:
  path = 'https://raw.githubusercontent.com/KaylaFortson/KaylaFortson.github.io/main/datasets/' + year + '.csv'
  data = read_data(path)
  schools = schools.append(data)

schools = schools.reset_index(drop=True)

Our final dataframe of schools.

In [97]:
schools.head()
Out[97]:
nces_school_id school_name district_name school_year city state school_type enrollment white black hispanic asian american_indian_alaska_native hawaiian_native_pacific_islander two_or_more staffing low_grade high_grade lunch county ulocale
0 220054000340 Arlington Preparatory Academy East Baton Rouge Parish School Board 1998-99 Baton Rouge Louisiana public 89.0 32.0 57.0 0.0 0.0 0.0 NaN NaN 17.0 9 12 43.0 East Baton Rouge 1.0
1 220054000341 Audubon Elementary School East Baton Rouge Parish School Board 1998-99 Baton Rouge Louisiana public 437.0 181.0 246.0 4.0 6.0 0.0 NaN NaN 26.0 PK 5 231.0 East Baton Rouge 1.0
2 220054000429 B. R. Foreign Language Acad. Immersion Magnet East Baton Rouge Parish School Board 1998-99 Baton Rouge Louisiana public 262.0 60.0 192.0 2.0 8.0 0.0 NaN NaN 20.0 PK 5 132.0 East Baton Rouge 1.0
3 220054000342 Baker Heights Elementary School East Baton Rouge Parish School Board 1998-99 Baker Louisiana public 563.0 185.0 376.0 1.0 0.0 1.0 NaN NaN 35.0 PK 5 406.0 East Baton Rouge 3.0
4 220054000344 Baker High School East Baton Rouge Parish School Board 1998-99 Baker Louisiana public 1167.0 312.0 849.0 1.0 4.0 1.0 NaN NaN 72.0 9 12 510.0 East Baton Rouge 3.0

Combining the Data

Our last step in ETL is going to be to combine the two dataframes above into one larger dataframe. This will be useful for comparing schools that had shootings to those that didn't.

The first thing we do is get a version of the shootings data with only the columns we can compare to the schools data.

In [98]:
columns = ['nces_school_id', 'school_name', 'district_name', 'school_year', 'city', 'state', 'school_type',
          'enrollment', 'white', 'black', 'hispanic', 'asian', 'american_indian_alaska_native', 
          'hawaiian_native_pacific_islander', 'two_or_more', 'staffing', 'low_grade',
          'high_grade', 'lunch', 'county', 'ulocale']
trim_shootings = shootings_df[columns].copy()
trim_shootings.head()
Out[98]:
nces_school_id school_name district_name school_year city state school_type enrollment white black hispanic asian american_indian_alaska_native hawaiian_native_pacific_islander two_or_more staffing low_grade high_grade lunch county ulocale
0 080480000707 Columbine High School Jefferson County R-1 1998-99 Littleton Colorado public 1965.0 1783.0 16.0 112.0 42.0 12.0 NaN NaN 89.6 9 12 41.0 Jefferson County 21.0
1 220054000422 Scotlandville Middle School East Baton Rouge Parish School Board 1998-99 Baton Rouge Louisiana public 588.0 5.0 583.0 0.0 0.0 0.0 NaN NaN 39.0 6 8 495.0 East Baton Rouge Parish 12.0
2 130441001591 Heritage High School Rockdale County 1998-99 Conyers Georgia public 1369.0 1189.0 136.0 28.0 15.0 1.0 NaN NaN 84.0 9 12 125.0 Rockdale County 21.0
3 421899003847 John Bartram High School Philadelphia City SD 1999-00 Philadelphia Pennsylvania public 3147.0 209.0 2736.0 27.0 170.0 5.0 NaN NaN 41.0 9 12 2007.0 Philadelphia County 11.0
4 250279000225 Dorchester High School Boston 1999-00 Boston Massachusetts public 1116.0 40.0 755.0 287.0 29.0 5.0 NaN NaN NaN 9 12 543.0 Suffolk County 11.0

We also have to drop all of the shootings from the 2021-2022 school year since we don't have data on schools that didn't have shootings in that year.

In [99]:
trim_shootings = trim_shootings[trim_shootings['school_year'] != '2021-22']

We want a column indicating whether a shooting occurred for each dataset.

In [100]:
trim_shootings['shooting'] = 1
schools['shooting'] = 0

Now we concatenate the two dataframes and account for the schools that had a shooting but are in both sets. To do this we drop the duplicates based on if they have the same school id and are in the same year.

In [101]:
all_schools = pd.concat([trim_shootings, schools]).drop_duplicates(subset=['nces_school_id', 'school_year'], keep='first').reset_index(drop=True)

Now from here we are ready to explore our data and gain some insight from it!

In [102]:
all_schools.head()
Out[102]:
nces_school_id school_name district_name school_year city state school_type enrollment white black hispanic asian american_indian_alaska_native hawaiian_native_pacific_islander two_or_more staffing low_grade high_grade lunch county ulocale shooting
0 080480000707 Columbine High School Jefferson County R-1 1998-99 Littleton Colorado public 1965.0 1783.0 16.0 112.0 42.0 12.0 NaN NaN 89.6 9 12 41.0 Jefferson County 21.0 1
1 220054000422 Scotlandville Middle School East Baton Rouge Parish School Board 1998-99 Baton Rouge Louisiana public 588.0 5.0 583.0 0.0 0.0 0.0 NaN NaN 39.0 6 8 495.0 East Baton Rouge Parish 12.0 1
2 130441001591 Heritage High School Rockdale County 1998-99 Conyers Georgia public 1369.0 1189.0 136.0 28.0 15.0 1.0 NaN NaN 84.0 9 12 125.0 Rockdale County 21.0 1
3 421899003847 John Bartram High School Philadelphia City SD 1999-00 Philadelphia Pennsylvania public 3147.0 209.0 2736.0 27.0 170.0 5.0 NaN NaN 41.0 9 12 2007.0 Philadelphia County 11.0 1
4 250279000225 Dorchester High School Boston 1999-00 Boston Massachusetts public 1116.0 40.0 755.0 287.0 29.0 5.0 NaN NaN NaN 9 12 543.0 Suffolk County 11.0 1

EDA

With our data now cleaned and organized, we can start to analyze it.

School Shootings Over the Years

First, we'll look at the yearly trend in school shootings since Columbine.

In [103]:
shootings_df['year'].value_counts().to_frame().sort_index(ascending=True).plot(xlabel="Year", ylabel="Shootings", figsize=(20,5), title='School Shootings Per Year 1998-2022', legend=False)
Out[103]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f971a21c340>

The above graph maps the number of shootings recorded per year. This trend shows an overall increase in shootings with the largest spike taking place in 2021 (42 shootings). The lowest amount of shootings took place in 2002 (5 shootings). This is an extremely significant observation because it shows how there is an upward trend in the number of shootings per year. Therefore, this means that this is a growing issue that needs to be addressed immediately and has yet to be properly resolved.

Shootings by Weapon Type

Next, we take a look at the weapons used in the school shootings and compare the most used weapon types to the number of casualties they cause.

In [104]:
weapons = shootings_df[["weapon", "casualties"]].copy()
weapons.fillna("", inplace=True)

# Simplify the weapon column
weapons.loc[weapons["weapon"].str.contains("shotgun"), "weapon"] = "Shotgun" 
weapons.loc[weapons["weapon"].str.contains("rifle"), "weapon"] = "Rifle"
weapons.loc[weapons["weapon"].str.contains("handgun"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].str.contains("pistol"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].str.contains("revolver"), "weapon"] = "Handgun"
weapons.loc[weapons["weapon"].__eq__(""), "weapon"] = "Unknown" 
weapons.loc[~weapons["weapon"].str.contains("Handgun") & ~weapons["weapon"].str.contains("Rifle") & ~weapons["weapon"].str.contains("Shotgun") & ~weapons["weapon"].str.contains("Unknown"), "weapon"] = "Other"

# Specifies a 1 x 5 grid of plots, figsize in inches
figure, axes = plt.subplots(1, 2, figsize=(20, 8))
weapons.groupby("weapon").count().plot.bar(ylabel="Number of Shootings", xlabel="Weapon Type", color="maroon", legend=False , ax=axes[0])
weapons.groupby("weapon").sum().plot.bar(ylabel="Number of Casualties", xlabel="Weapon Type", color="maroon", legend=False , ax=axes[1])
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9711161b50>

The left side graph shows us that the handgun was used in the greatest number of shootings. This may explain that a handgun is more attainable than other weapon types. Rifles, shotguns, and other weapon types were very close in the amount of times they were used, meaning they all may be similarly attainable. However, if you look closely the type of weapon that was the most rarely used was the rifle (besides other types).

On the right hand side, the bar graph shows the number of casualties in relation to weapon type. Here, handguns is also the highest most likely because they were involved in the greatest number of shootings. Shotguns were more responsible for deaths than rifles were. There were also more undocumented/unknown weapon caused deaths than weapons that did not fit into the categories of handgun, rifles, or shotguns (other).

Shootings by State and Month

In [105]:
temp = shootings_df[['state', 'date']].copy()
temp['month'] = pd.DatetimeIndex(temp['date']).month #adds month column
temp.drop(columns='date', inplace=True)
month_abv = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun', 
             '7': 'Jul', '8': 'Aug', '9': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
temp['month'] = temp['month'].astype(str).map(month_abv) #abbreviate the months 

fig, ax = plt.subplots(figsize=(32,32))  
location_month_df = temp.groupby('state').month.value_counts().to_frame().unstack().fillna(0) #construct dataframe of month to state shootings
location_month_df = location_month_df['month'][['Aug', 'Sep','Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul']] #reorder columns by standard school year 
#normalize the data 
normalized_df=(location_month_df-location_month_df.mean())/location_month_df.std()
sns.heatmap(normalized_df, ax=ax, vmin=-1.0, vmax=5.0) #create heatmap
ax.set_xlabel("Month")
ax.set_ylabel("State")
ax.set_title('Number of Shootings by Month and State')
xticks = ['Aug', 'Sep','Oct', 'Nov', 'Dec','Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul']
ax.XData = xticks
plt.show()

This heatmap is a mapping of the number of school shootings in relation to the Month that they occurred in as well as the State. The heatmap ranges from -1 to 5 standard deviations away from the mean number of shootings for that month. However, the maximum amount of shootings was in the state of California with a value of 36. Within the state of California, the month with the greatest number of shootings was October. There were multiple states that shared a minimum of one shooting. Overall, the month with the highest number of shootings was January and the minimum was July. Looking into the month(s) with the greatest number of shootings can prove critical because this can be a time to focus even further on mental health and support for students.

Free/Reduced Price Lunch vs. Shootings

Below, we will plot the number of school shootings in relation to the percentage of free and reduced lunch eligible students at a school. We will normalize for the fact that there could be a large number of schools with a high percentage of eligible students. We hypothesis that there will be a larger number of school shootings at schools with a greater number of free and reduced lunches.

In [106]:
# get subset of our dataset with relevant information
lunch = all_schools[['lunch','enrollment', 'shooting']].copy()
lunch = lunch.loc[lunch['enrollment'] > lunch['lunch']]
lunch['pct'] = lunch['lunch'] / lunch['enrollment']
lunch['z-score'] = (lunch['pct'] - lunch['pct'].mean()) / lunch['pct'].std()
lunch.max()
Out[106]:
lunch         4748.000000
enrollment    5447.000000
shooting         1.000000
pct              0.999457
z-score          1.203218
dtype: float64

Now we create the bins that we'll use to separate the schools.

In [107]:
range_0 = lunch[(lunch['z-score'] > -2.50) & (lunch['z-score'] <= -2.00)].copy()
range_1 = lunch[(lunch['z-score'] > -2.00) & (lunch['z-score'] <= -1.50)].copy()
range_2 = lunch[(lunch['z-score'] > -1.50) & (lunch['z-score'] <= -1.00)].copy()
range_3 = lunch[(lunch['z-score'] > -1.00) & (lunch['z-score'] <= -0.50)].copy()
range_4 = lunch[(lunch['z-score'] > -0.50) & (lunch['z-score'] <= 0.0)].copy()
range_5 = lunch[(lunch['z-score'] > 0.0) & (lunch['z-score'] <= 0.50)].copy()
range_6 = lunch[(lunch['z-score'] > 0.50) & (lunch['z-score'] <= 1.00)].copy()
range_7 = lunch[(lunch['z-score'] > 1.00) & (lunch['z-score'] <= 1.50)].copy()

Then we add a column to identify which bin the school is in.

In [108]:
range_0['deviation'] = '-2.50->-2.00'
range_1['deviation'] = '-2.00->-1.50'
range_2['deviation'] = '-1.50->-1.00'
range_3['deviation'] = '-1.00->-0.50'
range_4['deviation'] = '-0.50->0.0'
range_5['deviation'] = '0.0->0.50'
range_6['deviation'] = '0.50->1.00'
range_7['deviation'] = '1.00->1.50'

Lastly, we graph the data.

In [109]:
# concatenate the dataframes back together
grouped_ranges = pd.concat([range_0, range_1, range_2, range_3, range_4, range_5, range_6, range_7])

# get the amount of schools in each bin that had a shooting divide to represent the amount as a percent of the total schools in the bin.
shootings_in_bin = grouped_ranges[grouped_ranges['shooting'] == 1].deviation.value_counts()
all_in_bin = grouped_ranges.deviation.value_counts()
ranges_counts = shootings_in_bin / all_in_bin

# correct the ordering and plot
order = ['-2.50->-2.00', '-2.00->-1.50', '-1.50->-1.00', '-1.00->-0.50', '-0.50->0.0', '0.0->0.50', '0.50->1.00', '1.00->1.50']
ranges_counts = ranges_counts.reindex(order)
ranges_counts.plot.bar(ylabel='Shootings', xlabel='Free Lunch Students', color="forestgreen", rot=0, figsize=(20,5))
Out[109]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f97100a9f10>

The graph shows the correlation between the percentage of Students who received free or reduced price lunches due to financial reasons and the number of shootings. From analzying the graph, it seems that the percentage of free or reduced lunch students may not be correlated with amount of shootings. We do see, however, that schools above the mean have lower amounts of shootings than schools below the mean. This goes against our original hypothesis that schools with more low-income students would have more shootings.

Race vs. Shootings

The next bar graph will be similar to the last but for the ratio of white students to total students. We focused on this race specifically since a lot of the dialogue around school shootings centers around the race of the people involved.

In [110]:
# get subset of our dataset with relevant information
student = all_schools[['white', 'enrollment', 'shooting']].copy()
student['white_pct'] = student['white'] / student['enrollment']
student['z-score'] = (student['white_pct'] - student['white_pct'].mean()) / student['white_pct'].std()

# bin the ratios of white students - min was -0.8, max was 2.9
range_0 = student[(student['z-score'] > -1.00) & (student['z-score'] <= -0.50)].copy()
range_1 = student[(student['z-score'] > -0.50) & (student['z-score'] <= 0.0)].copy()
range_2 = student[(student['z-score'] > 0.0) & (student['z-score'] <= 0.50)].copy()
range_3 = student[(student['z-score'] > 0.50) & (student['z-score'] <= 1.0)].copy()
range_4 = student[(student['z-score'] > 1.0) & (student['z-score'] <= 1.50)].copy()
range_5 = student[(student['z-score'] > 1.50) & (student['z-score'] <= 2.00)].copy()
range_6 = student[(student['z-score'] > 2.00) & (student['z-score'] <= 2.50)].copy()
range_7 = student[(student['z-score'] > 2.50) & (student['z-score'] <= 3.00)].copy()

range_0['deviation'] = '-1.0->-0.5'
range_1['deviation'] = '-0.5->0.0'
range_2['deviation'] = '0.0->0.5'
range_3['deviation'] = '0.5->1.0'
range_4['deviation'] = '1.0->1.5'
range_5['deviation'] = '1.5->2.0'
range_6['deviation'] = '2.0->2.5'
range_7['deviation'] = '2.5->3.0'

# concatenate the dataframes back together
grouped_ranges = pd.concat([range_0, range_1, range_2, range_3, range_4, range_5, range_6, range_7])

# get the amount of schools in each bin that had a shooting divide to represent the amount as a percent of the total schools in the bin.
shootings_in_bin = grouped_ranges[grouped_ranges['shooting'] == 1].deviation.value_counts()
all_in_bin = grouped_ranges.deviation.value_counts()
white_ranges = shootings_in_bin / all_in_bin

# correct the ordering and plot
order = ['-1.0->-0.5', '-0.5->0.0', '0.0->0.5', '0.5->1.0', '1.0->1.5', '1.5->2.0', '2.0->2.5', '2.5->3.0']
white_ranges = white_ranges.reindex(order)

white_ranges.plot.bar(ylabel='Shootings', xlabel='White-Student Ratio Z-Score', color="orange", rot=0, figsize=(20,5))
Out[110]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f97103bd880>

According to the shootings dataset the maximum number of shootings is done by white individuals. Therefore, we took a deeper analysis into the ratio of white student population. This graphs show that there isn't much difference in the number of school shootings based on the percentage of a school's enrollment that is white. Again, however, this changes slightly in the two bins of schools that are further than 2 standard deviations from the mean percentage. In this area, the number of shootings is a lot higher. This shows that schools that have a predominately more white population have experienced a greater number of school shootings.

Predictive Modeling

From here, we will build a model to try to determine if there is some correlation between the student demographics and school location and the likeliness of a shooting occuring at a school. We would want to analyze if there are any predictive attributes in order to use that to possibly prevent more shootings.

To do this we will use the district, city, state, enrollment, staffing, and lunch data to predict if a shooting will occur.

Choosing Best K-Value

First we will evaluate the model with different values of k using 10-fold cross validation

In [111]:
# list the variables we will use for the model
features = ['district_name','city', 'state', 'enrollment', 'white', 'black',
            'hispanic', 'asian', 'staffing', 'lunch', 'shooting']
# load the data and drop rows that we cant use
model_df = all_schools[features].dropna()

# separate the sets of data
x_dict = model_df[features[:-1]].to_dict(orient='records')
y = model_df['shooting']

Here we create a function to get the cross validated rmse of different k-nearest neighbors models.

In [125]:
def get_cv_error(k):
  vec = DictVectorizer(sparse=False)
  scaler = StandardScaler()
  model = KNeighborsRegressor(n_neighbors=k)
  pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
  mse = np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error"))
  
  return mse

Then we test different k values and plot the result.

In [127]:
# took about 15 minutes to run for us
ks = pd.Series(range(1, 31))
ks.index = range(1, 31)
test_errs = ks.apply(get_cv_error)

test_errs.plot.line()
Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f970e052280>

We see from this graph that a k value of around 10 is where the mse is minimized, so that is what we will use for our model.

Choosing Best Parameters

Now we want to pick the best parameters for our model. To do this we'll test the model by dropping each of our columns one by one.

In [129]:
vec = DictVectorizer(sparse=False)
scaler = StandardScaler()
model = KNeighborsRegressor(n_neighbors=11)
pipeline = Pipeline([("vectorizer", vec), ("scaler", scaler), ("fit", model)])
In [131]:
features = ['district_name','city', 'state', 'enrollment', 'white', 'black',
            'hispanic', 'asian', 'staffing', 'lunch', 'shooting']

# baseline for comparison
rmse = np.sqrt(np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error")))
print ('baseline rmse = ', rmse)

for i in range(len(features[:-1])):
  copy = features[:-1]
  del copy[i]
  x_dict = model_df[copy].to_dict(orient="records")
  rmse = np.sqrt(np.mean(-cross_val_score(pipeline, x_dict, y, cv=10, scoring="neg_mean_squared_error")))
  print ('no ' + features[i] + ' rmse = ', rmse)
baseline rmse =  0.12448563503411578
no district_name rmse =  0.12514971978958883
no city rmse =  0.12493600142397653
no state rmse =  0.12381710761674042
no enrollment rmse =  0.12447665352945511
no white rmse =  0.12445329274727257
no black rmse =  0.12454669694931136
no hispanic rmse =  0.12445508943723166
no asian rmse =  0.12449642462602131
no staffing rmse =  0.12442093329610496
no lunch rmse =  0.12448563503411578

We see here that removing any one variable from the features doesn't have much of an impact on the rmse. This makes sense because we didn't find much correlation before between any of the variables and shootings happening. Our model doesn't have much predictive power, suggesting the randomness of school shootings. Maybe if we had data on city mental health statistics or access to mental health help in schools we could find a stronger correlation there. For now, we can take away that school demographics are unlikely to hold much, if any, weight in determining if a shooting is going to happen.