Exploratory Data Analysis of Road Accidents in USA
- Introduction
- Setup
- Data Preparation and Cleaning
- Exploratory Analysis and Visualization
- Q1. Are there more accidents in warmer or colder areas?
- Q2. Which 5 states have the highest number of accidents?
- Q3. Among the top 100 cities in number of accidents, which states do they belong to most frequently.
- Q4. What time of the day are accidents most frequent in?
- Q5. Which days of the week have the most accidents?
- Q6. Which months have the most accidents?
- Q7. What is the trend of accidents year over year (decreasing/increasing?)
- Summary and Conclusion
Introduction
Every year 1.3 million people die as a result of a road traffic crash around the world. And 20 - 50 million people suffer non-fatal injuries, with many incurring a disability as a result of their injury.1 In USA alone, there were 33,244 fatal motor vehicle crashes in 2019 in which 36,096 deaths occurred.2
In this project, we are going to study the 'US Accidents' dataset provided by Sobhan Moosavi on https://www.kaggle.com. We can analyse this data to gain some really interesting insights about road accidents in USA, such as the impact of environmental stimuli on accidental occurance, the change in the occurance of accidents with change in months, or which 5 states have the highest (or lowest) number of accidents. [Note: This Dataset covers 49 mainland states of the USA (excluding Alaska) for the time period: February 2016 to Dec 2020]
We will use the approach of Exploratory Data Analysis (EDA) for studying this data, which is used to analyse datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. EDA can help in seeing what the data can tell us beyond the formal modeling or hypothesis testing task.3
Because this dataset is huge, with dozens of features, it can be used to answer a lot of questions. For this project, we are going to limit ourselves to 5-10 questions and then try to answer these questions.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
import os
for dirname, _, filenames in os.walk('US_Accidents_Dec20_updated.csv/'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Read the file
file_path = "US_Accidents_Dec20_updated.csv/US_Accidents_Dec20_updated.csv"
us_accidents = pd.read_csv(file_path)
pd.set_option("display.max_columns", None)
us_accidents.head()
us_accidents.info()
# basic statistics
us_accidents.describe()
Number of null values in the dataset, column vise
null_val_cols = us_accidents.isnull().sum()
null_val_cols = null_val_cols[null_val_cols > 0].sort_values(ascending = False)
null_val_cols
# plot percentage of missing values
missing_percentage = null_val_cols / len(us_accidents) * 100
missing_percentage.plot(kind = 'barh', figsize = (14, 8))
plt.xlabel("Percentage of missing values")
plt.ylabel("Columns")
plt.show()
Observations
- The
Number
column represents the street number in address record. - It has over a million null values.
Because we already have the location data from the Start_Lat
, Start_Lng
, County
and State
features, we can choose to drop this column from the dataset. With regards to columns representing weather conditions, we can choose to drop the columns having a significant number of null values, and impute other columns to the mean values from the same State and month, which should represent similar weather conditions.
us_accidents.drop(labels = ["Number", "Precipitation(in)"], axis = 1, inplace = True)
us_accidents.shape
We'll also convert the Start_Time
column to datetime dtype, so that we can use use it to apply datetime functions later on.
us_accidents.Start_Time = us_accidents.Start_Time.astype("datetime64")
Exploratory Analysis and Visualization
With these columns in mind, we will try to answer the following questions from the dataset:
- Are there more accidents in warmer or colder areas?
- Which 5 states have the highest number of accidents?
- Among the top 100 cities in number of accidents, which states do they belong to most frequently.
- What time of the day are accidents most frequent in?
- Which days of the week have the most accidents?
- Which months have the most accidents?
- What is the trend of accidents year over year (decreasing/increasing?)
- How does accident severity change with change in precipitation?
We can start with plotting longitude and lattitude data to get a sense of where is the data concentrated
lat, long = us_accidents.Start_Lat, us_accidents.Start_Lng
# plot locations of accidents
plt.figure(figsize = (10, 6))
sns.scatterplot(x = long, y = lat, s = 0.5)
It appears that accidents are more concentrated near the coasts, which are more populated areas of US. There seems to be a sharp decline in the mid - mid-western US. Apart from lack of accidents in those areas, it could also be a signal of poor data collection in those states.
First, we'll plot a histogram for the temperature data.
us_accidents["Temperature(F)"].plot(kind = "hist", logy = True)
plt.xlabel("Temperature(F)")
plt.show()
print(f'''Mean: {us_accidents["Temperature(F)"].mean()}
Skewness: {us_accidents["Temperature(F)"].skew()}
Kurtosis: {us_accidents["Temperature(F)"].kurtosis()}''')
Observations
- Data is normally distributed with very low skewness and kurtosis values.
- Thus, the data doesn't support the idea that warmer areas have more accidents than colder ones or vice versa.
accidents_by_states = us_accidents.State.value_counts()
print("The top 5 states in terms of number of accidents are:")
accidents_by_states.head()
plt.figure(figsize = (16, 5))
sns.barplot(x = accidents_by_states.index, y = accidents_by_states.values)
city_accidents = us_accidents.City.value_counts()
city_accidents
city_states = us_accidents.groupby("City")["State"].aggregate(pd.Series.mode)
city_states
top_100_cities = pd.concat([city_accidents, city_states], axis = 1)[:100]
top_100_cities
#Most frequent states in the top 100 cities in the number of accidents
most_freq_states = top_100_cities.State.value_counts()
most_freq_states
# plot
most_freq_states.plot(kind = "bar", figsize = (15,5))
plt.xlabel("States")
plt.ylabel("Frequency in top 100 cities by number of accidents")
plt.title("Most frequent states in the top 100 cities in the number of accidents")
plt.show()
Do more accidents tend to occur at a particular time of the day?
# plot
us_accidents.Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.show()
Observations
It appears that accidents tend to occur more in the morning between 7-9 and in the afternoon between 13-17. Office hours could be the reason behind this trend. We can examine this by separating the data for weekdays and weekends.
# plot
weekdays = us_accidents.Start_Time.dt.dayofweek < 5
us_accidents.loc[weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekdays")
plt.show()
# plot data for weekends
us_accidents.loc[~weekdays].Start_Time.dt.hour.plot(kind = 'hist', bins = 24)
plt.xlabel("Hour of the day")
plt.title("Accidents frequency of Weekends")
plt.show()
Observations
- On weekends, there seems to be a breakaway from the pattern that we saw earlier, which supports the idea that the trend we saw earlier could be due to the traffic resulting from office timings, which are usually closed on weekends.
- On weekends, accidents tend to occur more during daylight, which should be due to more traffic during those hours.
sns.histplot(us_accidents.Start_Time.dt.dayofweek, bins = 7)
plt.xlabel("Day of Week")
plt.show()
Observations
- Accidents occur more on weekdays and there is a sharp drop in their count on the weekends.
sns.histplot(us_accidents.Start_Time.dt.month, bins = 12)
plt.xlabel("Month")
plt.show()
Observations
- Later months of the year appear to witness more accidents.
This cannot be due to temperature(winter season) because the count drops suddenly in the starting months of the year. Due to there being no other apparent cause, it demands further analysis.
We can start by looking at the trend year over year.
fig, axes = plt.subplots(2, 3, figsize = (15,10), )
year = 2016
for r in range(2):
for c in range(3):
if year < 2021:
year_data = us_accidents.loc[us_accidents.Start_Time.dt.year == year]
sns.histplot(year_data.Start_Time.dt.month, bins = 12, ax = axes[r, c])
axes[r, c].set_title(year)
axes[r, c].set_xlabel("Month of the year")
year += 1
Observations
- Probably because the data was started being collected in the year 2016, the starting months have a much lower number of datapoints.
- Also, in the year 2020, because of the coronavirus pandemic lockdown restrictions, there was a suddent drop in the middle of the year. And as the restrictions eased, more accidents started to occur.
But this data requires further more analysis on the month wise trends.
us_accidents.Start_Time.dt.year.plot(kind = 'hist', bins = 5, title = "Year wise trend",
xticks = np.arange(2016, 2021), figsize = (7, 5))
plt.show()
Observations
- The number of accidents year over year looks to be increasing.
This can be attributed to better data collection in the later years. Thus, it need further analysis.
Many questions could be asked and explored, but here we analysed this dataset to answer 7 questions about road accidents in USA. To answer these questions, we first did some basic data preparation, and then went on to analyse the data. It is important to take note that the findings from the data analysis are correlational in nature and do not establish a causal relationship in any way. These findings are:
- The number of accidents don't differ much between warmer and colder temperatures.
- The top 5 states in number of accidents are California, Florida, Oregon, Texas and New York.
- These 5 states are also the most frequent states in the top 100 cities by number of road accidents.
- The number of accidents look to be highly correlated with the office hours.
- This idea is also supported by the fact that weekdays see more accidents than weekends.
- Although the number of accidents look to be increasing with years, but this could be attributed to better data collection in the later years.
Thanks for reading.