Introduction

Every year 1.3 million people die as a result of a road traffic crash around the world. And 20 - 50 million people suffer non-fatal injuries, with many incurring a disability as a result of their injury.1 In USA alone, there were 33,244 fatal motor vehicle crashes in 2019 in which 36,096 deaths occurred.2

In this project, we are going to study the 'US Accidents' dataset provided by Sobhan Moosavi on https://www.kaggle.com. We can analyse this data to gain some really interesting insights about road accidents in USA, such as the impact of environmental stimuli on accidental occurance, the change in the occurance of accidents with change in months, or which 5 states have the highest (or lowest) number of accidents. [Note: This Dataset covers 49 mainland states of the USA (excluding Alaska) for the time period: February 2016 to Dec 2020]

We will use the approach of Exploratory Data Analysis (EDA) for studying this data, which is used to analyse datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA can help in seeing what the data can tell us beyond the formal modeling or hypothesis testing task.3

Because this dataset is huge, with dozens of features, it can be used to answer a lot of questions. For this project, we are going to limit ourselves to 5-10 questions and then try to answer these questions.

Setup

Import the necessary libraries and get the file_path for the dataset.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization

import os
for dirname, _, filenames in os.walk('US_Accidents_Dec20_updated.csv/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
US_Accidents_Dec20_updated.csv/US_Accidents_Dec20_updated.csv
# Read the file
file_path = "US_Accidents_Dec20_updated.csv/US_Accidents_Dec20_updated.csv"
us_accidents = pd.read_csv(file_path)
pd.set_option("display.max_columns", None)
us_accidents.head()
ID Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Description Number Street Side City County State Zipcode Country Timezone Airport_Code Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-2716600 3 2016-02-08 00:37:08 2016-02-08 06:37:08 40.10891 -83.09286 40.11206 -83.03187 3.230 Between Sawmill Rd/Exit 20 and OH-315/Olentang... NaN Outerbelt E R Dublin Franklin OH 43017 US US/Eastern KOSU 2016-02-08 00:53:00 42.1 36.1 58.0 29.76 10.0 SW 10.4 0.00 Light Rain False False False False False False False False False False False False False Night Night Night Night
1 A-2716601 2 2016-02-08 05:56:20 2016-02-08 11:56:20 39.86542 -84.06280 39.86501 -84.04873 0.747 At OH-4/OH-235/Exit 41 - Accident. NaN I-70 E R Dayton Montgomery OH 45424 US US/Eastern KFFO 2016-02-08 05:58:00 36.9 NaN 91.0 29.68 10.0 Calm NaN 0.02 Light Rain False False False False False False False False False False False False False Night Night Night Night
2 A-2716602 2 2016-02-08 06:15:39 2016-02-08 12:15:39 39.10266 -84.52468 39.10209 -84.52396 0.055 At I-71/US-50/Exit 1 - Accident. NaN I-75 S R Cincinnati Hamilton OH 45203 US US/Eastern KLUK 2016-02-08 05:53:00 36.0 NaN 97.0 29.70 10.0 Calm NaN 0.02 Overcast False False False False True False False False False False False False False Night Night Night Day
3 A-2716603 2 2016-02-08 06:15:39 2016-02-08 12:15:39 39.10148 -84.52341 39.09841 -84.52241 0.219 At I-71/US-50/Exit 1 - Accident. NaN US-50 E R Cincinnati Hamilton OH 45202 US US/Eastern KLUK 2016-02-08 05:53:00 36.0 NaN 97.0 29.70 10.0 Calm NaN 0.02 Overcast False False False False True False False False False False False False False Night Night Night Day
4 A-2716604 2 2016-02-08 06:51:45 2016-02-08 12:51:45 41.06213 -81.53784 41.06217 -81.53547 0.123 At Dart Ave/Exit 21 - Accident. NaN I-77 N R Akron Summit OH 44311 US US/Eastern KAKR 2016-02-08 06:54:00 39.0 NaN 55.0 29.65 10.0 Calm NaN NaN Overcast False False False False False False False False False False False False False Night Night Day Day

Data Preparation and Cleaning

Premilinary analysis of dataset

us_accidents.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1516064 entries, 0 to 1516063
Data columns (total 47 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   ID                     1516064 non-null  object 
 1   Severity               1516064 non-null  int64  
 2   Start_Time             1516064 non-null  object 
 3   End_Time               1516064 non-null  object 
 4   Start_Lat              1516064 non-null  float64
 5   Start_Lng              1516064 non-null  float64
 6   End_Lat                1516064 non-null  float64
 7   End_Lng                1516064 non-null  float64
 8   Distance(mi)           1516064 non-null  float64
 9   Description            1516064 non-null  object 
 10  Number                 469969 non-null   float64
 11  Street                 1516064 non-null  object 
 12  Side                   1516064 non-null  object 
 13  City                   1515981 non-null  object 
 14  County                 1516064 non-null  object 
 15  State                  1516064 non-null  object 
 16  Zipcode                1515129 non-null  object 
 17  Country                1516064 non-null  object 
 18  Timezone               1513762 non-null  object 
 19  Airport_Code           1511816 non-null  object 
 20  Weather_Timestamp      1485800 non-null  object 
 21  Temperature(F)         1473031 non-null  float64
 22  Wind_Chill(F)          1066748 non-null  float64
 23  Humidity(%)            1470555 non-null  float64
 24  Pressure(in)           1479790 non-null  float64
 25  Visibility(mi)         1471853 non-null  float64
 26  Wind_Direction         1474206 non-null  object 
 27  Wind_Speed(mph)        1387202 non-null  float64
 28  Precipitation(in)      1005515 non-null  float64
 29  Weather_Condition      1472057 non-null  object 
 30  Amenity                1516064 non-null  bool   
 31  Bump                   1516064 non-null  bool   
 32  Crossing               1516064 non-null  bool   
 33  Give_Way               1516064 non-null  bool   
 34  Junction               1516064 non-null  bool   
 35  No_Exit                1516064 non-null  bool   
 36  Railway                1516064 non-null  bool   
 37  Roundabout             1516064 non-null  bool   
 38  Station                1516064 non-null  bool   
 39  Stop                   1516064 non-null  bool   
 40  Traffic_Calming        1516064 non-null  bool   
 41  Traffic_Signal         1516064 non-null  bool   
 42  Turning_Loop           1516064 non-null  bool   
 43  Sunrise_Sunset         1515981 non-null  object 
 44  Civil_Twilight         1515981 non-null  object 
 45  Nautical_Twilight      1515981 non-null  object 
 46  Astronomical_Twilight  1515981 non-null  object 
dtypes: bool(13), float64(13), int64(1), object(20)
memory usage: 412.1+ MB
# basic statistics
us_accidents.describe()
Severity Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Number Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Speed(mph) Precipitation(in)
count 1.516064e+06 1.516064e+06 1.516064e+06 1.516064e+06 1.516064e+06 1.516064e+06 4.699690e+05 1.473031e+06 1.066748e+06 1.470555e+06 1.479790e+06 1.471853e+06 1.387202e+06 1.005515e+06
mean 2.238630e+00 3.690056e+01 -9.859919e+01 3.690061e+01 -9.859901e+01 5.872617e-01 8.907533e+03 5.958460e+01 5.510976e+01 6.465960e+01 2.955495e+01 9.131755e+00 7.630812e+00 8.477855e-03
std 6.081481e-01 5.165653e+00 1.849602e+01 5.165629e+00 1.849590e+01 1.632659e+00 2.242190e+04 1.827316e+01 2.112735e+01 2.325986e+01 1.016756e+00 2.889112e+00 5.637364e+00 1.293168e-01
min 1.000000e+00 2.457022e+01 -1.244976e+02 2.457011e+01 -1.244978e+02 0.000000e+00 0.000000e+00 -8.900000e+01 -8.900000e+01 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 2.000000e+00 3.385422e+01 -1.182076e+02 3.385420e+01 -1.182077e+02 0.000000e+00 1.212000e+03 4.700000e+01 4.080000e+01 4.800000e+01 2.944000e+01 1.000000e+01 4.600000e+00 0.000000e+00
50% 2.000000e+00 3.735113e+01 -9.438100e+01 3.735134e+01 -9.437987e+01 1.780000e-01 4.000000e+03 6.100000e+01 5.700000e+01 6.800000e+01 2.988000e+01 1.000000e+01 7.000000e+00 0.000000e+00
75% 2.000000e+00 4.072593e+01 -8.087469e+01 4.072593e+01 -8.087449e+01 5.940000e-01 1.010000e+04 7.300000e+01 7.100000e+01 8.400000e+01 3.004000e+01 1.000000e+01 1.040000e+01 0.000000e+00
max 4.000000e+00 4.900058e+01 -6.711317e+01 4.907500e+01 -6.710924e+01 1.551860e+02 9.999997e+06 1.706000e+02 1.130000e+02 1.000000e+02 5.804000e+01 1.400000e+02 9.840000e+02 2.400000e+01

Number of null values in the dataset, column vise

null_val_cols = us_accidents.isnull().sum()
null_val_cols = null_val_cols[null_val_cols > 0].sort_values(ascending = False)
null_val_cols
Number                   1046095
Precipitation(in)         510549
Wind_Chill(F)             449316
Wind_Speed(mph)           128862
Humidity(%)                45509
Visibility(mi)             44211
Weather_Condition          44007
Temperature(F)             43033
Wind_Direction             41858
Pressure(in)               36274
Weather_Timestamp          30264
Airport_Code                4248
Timezone                    2302
Zipcode                      935
City                          83
Sunrise_Sunset                83
Civil_Twilight                83
Nautical_Twilight             83
Astronomical_Twilight         83
dtype: int64
# plot percentage of missing values
missing_percentage = null_val_cols / len(us_accidents) * 100
missing_percentage.plot(kind = 'barh', figsize = (14, 8))
plt.xlabel("Percentage of missing values")
plt.ylabel("Columns")
plt.show()

Observations

  • The Number column represents the street number in address record.
  • It has over a million null values.

Because we already have the location data from the Start_Lat, Start_Lng, County and State features, we can choose to drop this column from the dataset. With regards to columns representing weather conditions, we can choose to drop the columns having a significant number of null values, and impute other columns to the mean values from the same State and month, which should represent similar weather conditions.

us_accidents.drop(labels = ["Number", "Precipitation(in)"], axis = 1, inplace = True)
us_accidents.shape
(1516064, 45)

We'll also convert the Start_Time column to datetime dtype, so that we can use use it to apply datetime functions later on.

us_accidents.Start_Time = us_accidents.Start_Time.astype("datetime64")

Exploratory Analysis and Visualization

With these columns in mind, we will try to answer the following questions from the dataset:

  1. Are there more accidents in warmer or colder areas?
  2. Which 5 states have the highest number of accidents?
  3. Among the top 100 cities in number of accidents, which states do they belong to most frequently.
  4. What time of the day are accidents most frequent in?
  5. Which days of the week have the most accidents?
  6. Which months have the most accidents?
  7. What is the trend of accidents year over year (decreasing/increasing?)
  8. How does accident severity change with change in precipitation?

We can start with plotting longitude and lattitude data to get a sense of where is the data concentrated

lat, long = us_accidents.Start_Lat, us_accidents.Start_Lng
# plot locations of accidents
plt.figure(figsize = (10, 6))
sns.scatterplot(x = long, y = lat, s = 0.5)
<AxesSubplot:xlabel='Start_Lng', ylabel='Start_Lat'>

It appears that accidents are more concentrated near the coasts, which are more populated areas of US. There seems to be a sharp decline in the mid - mid-western US. Apart from lack of accidents in those areas, it could also be a signal of poor data collection in those states.

Q1. Are there more accidents in warmer or colder areas?

First, we'll plot a histogram for the temperature data.

us_accidents["Temperature(F)"].plot(kind = "hist", logy = True)
plt.xlabel("Temperature(F)")
plt.show()
print(f'''Mean: {us_accidents["Temperature(F)"].mean()} 
Skewness: {us_accidents["Temperature(F)"].skew()}
Kurtosis: {us_accidents["Temperature(F)"].kurtosis()}''')
Mean: 58.5816432896377 
Skewness: -0.30533266112446805
Kurtosis: -0.13261978426888454

Observations

  • Data is normally distributed with very low skewness and kurtosis values.
  • Thus, the data doesn't support the idea that warmer areas have more accidents than colder ones or vice versa.

Q2. Which 5 states have the highest number of accidents?

accidents_by_states = us_accidents.State.value_counts()
print("The top 5 states in terms of number of accidents are:")
accidents_by_states.head()
The top 5 states in terms of number of accidents are:
CA    448833
FL    153007
OR     87484
TX     75142
NY     60974
Name: State, dtype: int64
plt.figure(figsize = (16, 5))
sns.barplot(x = accidents_by_states.index, y = accidents_by_states.values)
<AxesSubplot:>

Q3. Among the top 100 cities in number of accidents, which states do they belong to most frequently.

city_accidents = us_accidents.City.value_counts()
city_accidents
Los Angeles                     39984
Miami                           36233
Charlotte                       22203
Houston                         20843
Dallas                          19497
                                ...  
Manzanita                           1
West Brooklyn                       1
Garfield Heights                    1
Belding                             1
American Fork-Pleasant Grove        1
Name: City, Length: 10657, dtype: int64
city_states = us_accidents.groupby("City")["State"].aggregate(pd.Series.mode)
city_states
City
Aaronsburg      PA
Abbeville       SC
Abbotsford      WI
Abbottstown     PA
Aberdeen        MD
                ..
Zortman         MT
Zumbro Falls    MN
Zumbrota        MN
Zuni            VA
Zwingle         IA
Name: State, Length: 8769, dtype: object
top_100_cities = pd.concat([city_accidents, city_states], axis = 1)[:100]
top_100_cities
City State
Los Angeles 27760 CA
Miami 26831 FL
Orlando 10772 FL
Dallas 10522 TX
Charlotte 10312 NC
... ... ...
Flint 1465 MI
Hollywood 1431 FL
Eugene 1427 OR
Silver Spring 1425 MD
Birmingham 1422 AL

100 rows × 2 columns

#Most frequent states in the top 100 cities in the number of accidents
most_freq_states = top_100_cities.State.value_counts()
most_freq_states
CA    35
FL    13
TX     5
NY     4
OR     4
MI     3
VA     3
LA     3
PA     3
SC     2
MO     2
UT     2
AZ     2
TN     2
MN     2
NC     2
OH     2
OK     1
NJ     1
MD     1
KY     1
IN     1
CO     1
DC     1
WA     1
IL     1
GA     1
AL     1
Name: State, dtype: int64
# plot
most_freq_states.plot(kind = "bar", figsize = (15,5))
plt.xlabel("States")
plt.ylabel("Frequency in top 100 cities by number of accidents")
plt.title("Most frequent states in the top 100 cities in the number of accidents")
plt.show()

Q4. What time of the day are accidents most frequent in?

Do more accidents tend to occur at a particular time of the day?

# plot
us_accidents.Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.show()

Observations

It appears that accidents tend to occur more in the morning between 7-9 and in the afternoon between 13-17. Office hours could be the reason behind this trend. We can examine this by separating the data for weekdays and weekends.

# plot
weekdays = us_accidents.Start_Time.dt.dayofweek < 5
us_accidents.loc[weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekdays")
plt.show()
# plot data for weekends
us_accidents.loc[~weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekends")
plt.show()

Observations

  • On weekends, there seems to be a breakaway from the pattern that we saw earlier, which supports the idea that the trend we saw earlier could be due to the traffic resulting from office timings, which are usually closed on weekends.
  • On weekends, accidents tend to occur more during daylight, which should be due to more traffic during those hours.

Q5. Which days of the week have the most accidents?

sns.histplot(us_accidents.Start_Time.dt.dayofweek, bins = 7)
plt.xlabel("Day of Week")
plt.show()

Observations

  • Accidents occur more on weekdays and there is a sharp drop in their count on the weekends.

Q6. Which months have the most accidents?

sns.histplot(us_accidents.Start_Time.dt.month, bins = 12)
plt.xlabel("Month")
plt.show()

Observations

  • Later months of the year appear to witness more accidents.

This cannot be due to temperature(winter season) because the count drops suddenly in the starting months of the year. Due to there being no other apparent cause, it demands further analysis.

We can start by looking at the trend year over year.

fig, axes = plt.subplots(2, 3, figsize = (15,10), )
year = 2016
for r in range(2):
    for c in range(3):
        if year < 2021:
            year_data = us_accidents.loc[us_accidents.Start_Time.dt.year == year]
            sns.histplot(year_data.Start_Time.dt.month, bins = 12, ax = axes[r, c])
            axes[r, c].set_title(year)
            axes[r, c].set_xlabel("Month of the year")
            year += 1

Observations

  • Probably because the data was started being collected in the year 2016, the starting months have a much lower number of datapoints.
  • Also, in the year 2020, because of the coronavirus pandemic lockdown restrictions, there was a suddent drop in the middle of the year. And as the restrictions eased, more accidents started to occur.

But this data requires further more analysis on the month wise trends.

Q7. What is the trend of accidents year over year (decreasing/increasing?)

us_accidents.Start_Time.dt.year.plot(kind = 'hist', bins = 5, title = "Year wise trend", 
                                     xticks = np.arange(2016, 2021), figsize = (7, 5))
plt.show()

Observations

  • The number of accidents year over year looks to be increasing.

This can be attributed to better data collection in the later years. Thus, it need further analysis.

Summary and Conclusion

Many questions could be asked and explored, but here we analysed this dataset to answer 7 questions about road accidents in USA. To answer these questions, we first did some basic data preparation, and then went on to analyse the data. It is important to take note that the findings from the data analysis are correlational in nature and do not establish a causal relationship in any way. These findings are:

  • The number of accidents don't differ much between warmer and colder temperatures.
  • The top 5 states in number of accidents are California, Florida, Oregon, Texas and New York.
  • These 5 states are also the most frequent states in the top 100 cities by number of road accidents.
  • The number of accidents look to be highly correlated with the office hours.
  • This idea is also supported by the fact that weekdays see more accidents than weekends.
  • Although the number of accidents look to be increasing with years, but this could be attributed to better data collection in the later years.

Thanks for reading.