Introduction

Every year 1.3 million people die as a result of a road traffic crash around the world. And 20 - 50 million people suffer non-fatal injuries, with many incurring a disability as a result of their injury.¹ In USA alone, there were 33,244 fatal motor vehicle crashes in 2019 in which 36,096 deaths occurred.²

In this project, we are going to study the 'US Accidents' dataset provided by Sobhan Moosavi on https://www.kaggle.com. We can analyse this data to gain some really interesting insights about road accidents in USA, such as the impact of environmental stimuli on accidental occurance, the change in the occurance of accidents with change in months, or which 5 states have the highest (or lowest) number of accidents. [Note: This Dataset covers 49 mainland states of the USA (excluding Alaska) for the time period: February 2016 to Dec 2020]

We will use the approach of Exploratory Data Analysis (EDA) for studying this data, which is used to analyse datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA can help in seeing what the data can tell us beyond the formal modeling or hypothesis testing task.³

Because this dataset is huge, with dozens of features, it can be used to answer a lot of questions. For this project, we are going to limit ourselves to 5-10 questions and then try to answer these questions.

Setup

Import the necessary libraries and get the file_path for the dataset.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization

import os
for dirname, _, filenames in os.walk('US_Accidents_Dec20_updated.csv/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

US_Accidents_Dec20_updated.csv/US_Accidents_Dec20_updated.csv

# Read the file
file_path = "US_Accidents_Dec20_updated.csv/US_Accidents_Dec20_updated.csv"
us_accidents = pd.read_csv(file_path)

pd.set_option("display.max_columns", None)
us_accidents.head()

Data Preparation and Cleaning

Premilinary analysis of dataset

us_accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1516064 entries, 0 to 1516063
Data columns (total 47 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   ID                     1516064 non-null  object 
 1   Severity               1516064 non-null  int64  
 2   Start_Time             1516064 non-null  object 
 3   End_Time               1516064 non-null  object 
 4   Start_Lat              1516064 non-null  float64
 5   Start_Lng              1516064 non-null  float64
 6   End_Lat                1516064 non-null  float64
 7   End_Lng                1516064 non-null  float64
 8   Distance(mi)           1516064 non-null  float64
 9   Description            1516064 non-null  object 
 10  Number                 469969 non-null   float64
 11  Street                 1516064 non-null  object 
 12  Side                   1516064 non-null  object 
 13  City                   1515981 non-null  object 
 14  County                 1516064 non-null  object 
 15  State                  1516064 non-null  object 
 16  Zipcode                1515129 non-null  object 
 17  Country                1516064 non-null  object 
 18  Timezone               1513762 non-null  object 
 19  Airport_Code           1511816 non-null  object 
 20  Weather_Timestamp      1485800 non-null  object 
 21  Temperature(F)         1473031 non-null  float64
 22  Wind_Chill(F)          1066748 non-null  float64
 23  Humidity(%)            1470555 non-null  float64
 24  Pressure(in)           1479790 non-null  float64
 25  Visibility(mi)         1471853 non-null  float64
 26  Wind_Direction         1474206 non-null  object 
 27  Wind_Speed(mph)        1387202 non-null  float64
 28  Precipitation(in)      1005515 non-null  float64
 29  Weather_Condition      1472057 non-null  object 
 30  Amenity                1516064 non-null  bool   
 31  Bump                   1516064 non-null  bool   
 32  Crossing               1516064 non-null  bool   
 33  Give_Way               1516064 non-null  bool   
 34  Junction               1516064 non-null  bool   
 35  No_Exit                1516064 non-null  bool   
 36  Railway                1516064 non-null  bool   
 37  Roundabout             1516064 non-null  bool   
 38  Station                1516064 non-null  bool   
 39  Stop                   1516064 non-null  bool   
 40  Traffic_Calming        1516064 non-null  bool   
 41  Traffic_Signal         1516064 non-null  bool   
 42  Turning_Loop           1516064 non-null  bool   
 43  Sunrise_Sunset         1515981 non-null  object 
 44  Civil_Twilight         1515981 non-null  object 
 45  Nautical_Twilight      1515981 non-null  object 
 46  Astronomical_Twilight  1515981 non-null  object 
dtypes: bool(13), float64(13), int64(1), object(20)
memory usage: 412.1+ MB

# basic statistics
us_accidents.describe()

Number of null values in the dataset, column vise

null_val_cols = us_accidents.isnull().sum()
null_val_cols = null_val_cols[null_val_cols > 0].sort_values(ascending = False)
null_val_cols

Number                   1046095
Precipitation(in)         510549
Wind_Chill(F)             449316
Wind_Speed(mph)           128862
Humidity(%)                45509
Visibility(mi)             44211
Weather_Condition          44007
Temperature(F)             43033
Wind_Direction             41858
Pressure(in)               36274
Weather_Timestamp          30264
Airport_Code                4248
Timezone                    2302
Zipcode                      935
City                          83
Sunrise_Sunset                83
Civil_Twilight                83
Nautical_Twilight             83
Astronomical_Twilight         83
dtype: int64

# plot percentage of missing values
missing_percentage = null_val_cols / len(us_accidents) * 100
missing_percentage.plot(kind = 'barh', figsize = (14, 8))
plt.xlabel("Percentage of missing values")
plt.ylabel("Columns")
plt.show()

Observations

The Number column represents the street number in address record.
It has over a million null values.

Because we already have the location data from the Start_Lat, Start_Lng, County and State features, we can choose to drop this column from the dataset. With regards to columns representing weather conditions, we can choose to drop the columns having a significant number of null values, and impute other columns to the mean values from the same State and month, which should represent similar weather conditions.

us_accidents.drop(labels = ["Number", "Precipitation(in)"], axis = 1, inplace = True)
us_accidents.shape

(1516064, 45)

We'll also convert the Start_Time column to datetime dtype, so that we can use use it to apply datetime functions later on.

us_accidents.Start_Time = us_accidents.Start_Time.astype("datetime64")

Exploratory Analysis and Visualization

With these columns in mind, we will try to answer the following questions from the dataset:

Are there more accidents in warmer or colder areas?
Which 5 states have the highest number of accidents?
Among the top 100 cities in number of accidents, which states do they belong to most frequently.
What time of the day are accidents most frequent in?
Which days of the week have the most accidents?
Which months have the most accidents?
What is the trend of accidents year over year (decreasing/increasing?)
How does accident severity change with change in precipitation?

We can start with plotting longitude and lattitude data to get a sense of where is the data concentrated

lat, long = us_accidents.Start_Lat, us_accidents.Start_Lng

# plot locations of accidents
plt.figure(figsize = (10, 6))
sns.scatterplot(x = long, y = lat, s = 0.5)

<AxesSubplot:xlabel='Start_Lng', ylabel='Start_Lat'>

It appears that accidents are more concentrated near the coasts, which are more populated areas of US. There seems to be a sharp decline in the mid - mid-western US. Apart from lack of accidents in those areas, it could also be a signal of poor data collection in those states.

Q1. Are there more accidents in warmer or colder areas?

First, we'll plot a histogram for the temperature data.

us_accidents["Temperature(F)"].plot(kind = "hist", logy = True)
plt.xlabel("Temperature(F)")
plt.show()

print(f'''Mean: {us_accidents["Temperature(F)"].mean()} 
Skewness: {us_accidents["Temperature(F)"].skew()}
Kurtosis: {us_accidents["Temperature(F)"].kurtosis()}''')

Mean: 58.5816432896377 
Skewness: -0.30533266112446805
Kurtosis: -0.13261978426888454

Observations

Data is normally distributed with very low skewness and kurtosis values.
Thus, the data doesn't support the idea that warmer areas have more accidents than colder ones or vice versa.

Q2. Which 5 states have the highest number of accidents?

accidents_by_states = us_accidents.State.value_counts()
print("The top 5 states in terms of number of accidents are:")
accidents_by_states.head()

The top 5 states in terms of number of accidents are:

CA    448833
FL    153007
OR     87484
TX     75142
NY     60974
Name: State, dtype: int64

plt.figure(figsize = (16, 5))
sns.barplot(x = accidents_by_states.index, y = accidents_by_states.values)

<AxesSubplot:>

Q3. Among the top 100 cities in number of accidents, which states do they belong to most frequently.

city_accidents = us_accidents.City.value_counts()
city_accidents

Los Angeles                     39984
Miami                           36233
Charlotte                       22203
Houston                         20843
Dallas                          19497
                                ...  
Manzanita                           1
West Brooklyn                       1
Garfield Heights                    1
Belding                             1
American Fork-Pleasant Grove        1
Name: City, Length: 10657, dtype: int64

city_states = us_accidents.groupby("City")["State"].aggregate(pd.Series.mode)
city_states

City
Aaronsburg      PA
Abbeville       SC
Abbotsford      WI
Abbottstown     PA
Aberdeen        MD
                ..
Zortman         MT
Zumbro Falls    MN
Zumbrota        MN
Zuni            VA
Zwingle         IA
Name: State, Length: 8769, dtype: object

top_100_cities = pd.concat([city_accidents, city_states], axis = 1)[:100]
top_100_cities

#Most frequent states in the top 100 cities in the number of accidents
most_freq_states = top_100_cities.State.value_counts()
most_freq_states

CA    35
FL    13
TX     5
NY     4
OR     4
MI     3
VA     3
LA     3
PA     3
SC     2
MO     2
UT     2
AZ     2
TN     2
MN     2
NC     2
OH     2
OK     1
NJ     1
MD     1
KY     1
IN     1
CO     1
DC     1
WA     1
IL     1
GA     1
AL     1
Name: State, dtype: int64

# plot
most_freq_states.plot(kind = "bar", figsize = (15,5))
plt.xlabel("States")
plt.ylabel("Frequency in top 100 cities by number of accidents")
plt.title("Most frequent states in the top 100 cities in the number of accidents")
plt.show()

Q4. What time of the day are accidents most frequent in?

Do more accidents tend to occur at a particular time of the day?

# plot
us_accidents.Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.show()

Observations

It appears that accidents tend to occur more in the morning between 7-9 and in the afternoon between 13-17. Office hours could be the reason behind this trend. We can examine this by separating the data for weekdays and weekends.

# plot
weekdays = us_accidents.Start_Time.dt.dayofweek < 5
us_accidents.loc[weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekdays")
plt.show()

# plot data for weekends
us_accidents.loc[~weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekends")
plt.show()

Observations

On weekends, there seems to be a breakaway from the pattern that we saw earlier, which supports the idea that the trend we saw earlier could be due to the traffic resulting from office timings, which are usually closed on weekends.
On weekends, accidents tend to occur more during daylight, which should be due to more traffic during those hours.

Q5. Which days of the week have the most accidents?

sns.histplot(us_accidents.Start_Time.dt.dayofweek, bins = 7)
plt.xlabel("Day of Week")
plt.show()

Observations

Accidents occur more on weekdays and there is a sharp drop in their count on the weekends.

Q6. Which months have the most accidents?

sns.histplot(us_accidents.Start_Time.dt.month, bins = 12)
plt.xlabel("Month")
plt.show()

Observations

Later months of the year appear to witness more accidents.

This cannot be due to temperature(winter season) because the count drops suddenly in the starting months of the year. Due to there being no other apparent cause, it demands further analysis.

We can start by looking at the trend year over year.

fig, axes = plt.subplots(2, 3, figsize = (15,10), )
year = 2016
for r in range(2):
    for c in range(3):
        if year < 2021:
            year_data = us_accidents.loc[us_accidents.Start_Time.dt.year == year]
            sns.histplot(year_data.Start_Time.dt.month, bins = 12, ax = axes[r, c])
            axes[r, c].set_title(year)
            axes[r, c].set_xlabel("Month of the year")
            year += 1

Observations

Probably because the data was started being collected in the year 2016, the starting months have a much lower number of datapoints.
Also, in the year 2020, because of the coronavirus pandemic lockdown restrictions, there was a suddent drop in the middle of the year. And as the restrictions eased, more accidents started to occur.

But this data requires further more analysis on the month wise trends.

Q7. What is the trend of accidents year over year (decreasing/increasing?)

us_accidents.Start_Time.dt.year.plot(kind = 'hist', bins = 5, title = "Year wise trend", 
                                     xticks = np.arange(2016, 2021), figsize = (7, 5))
plt.show()

Observations

The number of accidents year over year looks to be increasing.

This can be attributed to better data collection in the later years. Thus, it need further analysis.

Summary and Conclusion

Many questions could be asked and explored, but here we analysed this dataset to answer 7 questions about road accidents in USA. To answer these questions, we first did some basic data preparation, and then went on to analyse the data. It is important to take note that the findings from the data analysis are correlational in nature and do not establish a causal relationship in any way. These findings are:

The number of accidents don't differ much between warmer and colder temperatures.
The top 5 states in number of accidents are California, Florida, Oregon, Texas and New York.
These 5 states are also the most frequent states in the top 100 cities by number of road accidents.
The number of accidents look to be highly correlated with the office hours.
This idea is also supported by the fact that weekdays see more accidents than weekends.
Although the number of accidents look to be increasing with years, but this could be attributed to better data collection in the later years.

Thanks for reading.

	ID	Severity	Start_Time	End_Time	Start_Lat	Start_Lng	End_Lat	End_Lng	Distance(mi)	Description	Number	Street	Side	City	County	State	Zipcode	Country	Timezone	Airport_Code	Weather_Timestamp	Temperature(F)	Wind_Chill(F)	Humidity(%)	Pressure(in)	Visibility(mi)	Wind_Direction	Wind_Speed(mph)	Precipitation(in)	Weather_Condition	Amenity	Bump	Crossing	Give_Way	Junction	No_Exit	Railway	Roundabout	Station	Stop	Traffic_Calming	Traffic_Signal	Turning_Loop	Sunrise_Sunset	Civil_Twilight	Nautical_Twilight	Astronomical_Twilight
0	A-2716600	3	2016-02-08 00:37:08	2016-02-08 06:37:08	40.10891	-83.09286	40.11206	-83.03187	3.230	Between Sawmill Rd/Exit 20 and OH-315/Olentang...	NaN	Outerbelt E	R	Dublin	Franklin	OH	43017	US	US/Eastern	KOSU	2016-02-08 00:53:00	42.1	36.1	58.0	29.76	10.0	SW	10.4	0.00	Light Rain	False	False	False	False	False	False	False	False	False	False	False	False	False	Night	Night	Night	Night
1	A-2716601	2	2016-02-08 05:56:20	2016-02-08 11:56:20	39.86542	-84.06280	39.86501	-84.04873	0.747	At OH-4/OH-235/Exit 41 - Accident.	NaN	I-70 E	R	Dayton	Montgomery	OH	45424	US	US/Eastern	KFFO	2016-02-08 05:58:00	36.9	NaN	91.0	29.68	10.0	Calm	NaN	0.02	Light Rain	False	False	False	False	False	False	False	False	False	False	False	False	False	Night	Night	Night	Night
2	A-2716602	2	2016-02-08 06:15:39	2016-02-08 12:15:39	39.10266	-84.52468	39.10209	-84.52396	0.055	At I-71/US-50/Exit 1 - Accident.	NaN	I-75 S	R	Cincinnati	Hamilton	OH	45203	US	US/Eastern	KLUK	2016-02-08 05:53:00	36.0	NaN	97.0	29.70	10.0	Calm	NaN	0.02	Overcast	False	False	False	False	True	False	False	False	False	False	False	False	False	Night	Night	Night	Day
3	A-2716603	2	2016-02-08 06:15:39	2016-02-08 12:15:39	39.10148	-84.52341	39.09841	-84.52241	0.219	At I-71/US-50/Exit 1 - Accident.	NaN	US-50 E	R	Cincinnati	Hamilton	OH	45202	US	US/Eastern	KLUK	2016-02-08 05:53:00	36.0	NaN	97.0	29.70	10.0	Calm	NaN	0.02	Overcast	False	False	False	False	True	False	False	False	False	False	False	False	False	Night	Night	Night	Day
4	A-2716604	2	2016-02-08 06:51:45	2016-02-08 12:51:45	41.06213	-81.53784	41.06217	-81.53547	0.123	At Dart Ave/Exit 21 - Accident.	NaN	I-77 N	R	Akron	Summit	OH	44311	US	US/Eastern	KAKR	2016-02-08 06:54:00	39.0	NaN	55.0	29.65	10.0	Calm	NaN	NaN	Overcast	False	False	False	False	False	False	False	False	False	False	False	False	False	Night	Night	Day	Day

	Severity	Start_Lat	Start_Lng	End_Lat	End_Lng	Distance(mi)	Number	Temperature(F)	Wind_Chill(F)	Humidity(%)	Pressure(in)	Visibility(mi)	Wind_Speed(mph)	Precipitation(in)
count	1.516064e+06	1.516064e+06	1.516064e+06	1.516064e+06	1.516064e+06	1.516064e+06	4.699690e+05	1.473031e+06	1.066748e+06	1.470555e+06	1.479790e+06	1.471853e+06	1.387202e+06	1.005515e+06
mean	2.238630e+00	3.690056e+01	-9.859919e+01	3.690061e+01	-9.859901e+01	5.872617e-01	8.907533e+03	5.958460e+01	5.510976e+01	6.465960e+01	2.955495e+01	9.131755e+00	7.630812e+00	8.477855e-03
std	6.081481e-01	5.165653e+00	1.849602e+01	5.165629e+00	1.849590e+01	1.632659e+00	2.242190e+04	1.827316e+01	2.112735e+01	2.325986e+01	1.016756e+00	2.889112e+00	5.637364e+00	1.293168e-01
min	1.000000e+00	2.457022e+01	-1.244976e+02	2.457011e+01	-1.244978e+02	0.000000e+00	0.000000e+00	-8.900000e+01	-8.900000e+01	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	2.000000e+00	3.385422e+01	-1.182076e+02	3.385420e+01	-1.182077e+02	0.000000e+00	1.212000e+03	4.700000e+01	4.080000e+01	4.800000e+01	2.944000e+01	1.000000e+01	4.600000e+00	0.000000e+00
50%	2.000000e+00	3.735113e+01	-9.438100e+01	3.735134e+01	-9.437987e+01	1.780000e-01	4.000000e+03	6.100000e+01	5.700000e+01	6.800000e+01	2.988000e+01	1.000000e+01	7.000000e+00	0.000000e+00
75%	2.000000e+00	4.072593e+01	-8.087469e+01	4.072593e+01	-8.087449e+01	5.940000e-01	1.010000e+04	7.300000e+01	7.100000e+01	8.400000e+01	3.004000e+01	1.000000e+01	1.040000e+01	0.000000e+00
max	4.000000e+00	4.900058e+01	-6.711317e+01	4.907500e+01	-6.710924e+01	1.551860e+02	9.999997e+06	1.706000e+02	1.130000e+02	1.000000e+02	5.804000e+01	1.400000e+02	9.840000e+02	2.400000e+01

	City	State
Los Angeles	27760	CA
Miami	26831	FL
Orlando	10772	FL
Dallas	10522	TX
Charlotte	10312	NC
...	...	...
Flint	1465	MI
Hollywood	1431	FL
Eugene	1427	OR
Silver Spring	1425	MD
Birmingham	1422	AL