Predicting Taxi Fares in NYC with Select ML Algorithms
- Introduction
- Setup
- Exploratory Data Analysis
- Data Preparation
- Train Hardcoded & Baseline Models
- Feature engineering
- Model Selection
- Tune Hyperparameters
- Final Training and Submission
- Summary and Conclusion
In this project, we'll participate in a Kaggle playground competition - New York City Taxi Fare Prediction. In this competition, hosted in partnership with Google Cloud and Coursera, we are tasked with predicting the fare amount for a taxi ride in New York City given the pickup and dropoff locations. The competition provides three files in csv format - train.csv
, test.csv
and sample_submission.csv
. We have to train ML algorithms on data provided in the test.csv file and then make predictions on test.csv file, which will then be submitted on Kaggle in the format provided in the sample_submission.csv file. The submission will be evaluated using the root mean-squared error or RMSE between the predictions in the submission file and the corresponding ground truth. Therefore, our primary goal in this project will be to improve on this rmse score by minimizing it.
We begin the project by first setting up the training environment and loading the files.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import mean_squared_error # Scoring metric
from sklearn.model_selection import train_test_split # validation
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder # data preparation
import psutil # get cpu info
from bayes_opt import BayesianOptimization # hyperparameter tuning
from tqdm import tqdm # progress meter
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# ML models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
# model pipelines
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# get file path
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
data_dir = "/kaggle/input/new-york-city-taxi-fare-prediction"
!ls -l {data_dir} # get the files info
# get the no. of lines in the train dataset
!wc -l {data_dir}/train.csv
# get the no. of lines in the test dataset
!wc -l {data_dir}/test.csv
# get the no. of lines in the sample submission.
!wc -l {data_dir}/sample_submission.csv
Observations
- This is a supervised learning regression problem
- Training data is 5.5 GB in size
- Training data has 5.5 million rows
- Test set is much smaller (< 10,000 rows)
Loading this whole dataset direcltly will slowdown our initial training. Thus, we'll initially import only about 1% of it. This will still give us about 550k rows of data to train our model, which should be enough for initial training. Once, we have finalized our model and its hyperparameters, we can then use the full dataset for training the final model and making the final predictions.
We'll now inspect how the dataset is structured and import only the columns we need.
# Train set
!head -n 20 {data_dir}/train.csv
# Test set
!head -n 20 {data_dir}/test.csv
# Sample Submission
!head -n 20 {data_dir}/sample_submission.csv
Observations
- The training set has 8 columns:
-
key
(a unique identifier) -
fare_amount
(target column) pickup_datetime
pickup_longitude
pickup_latitude
dropoff_longitude
dropoff_latitude
passenger_count
-
- The test set has all columns except the target column
fare_amount
. - The submission file should contain the
key
andfare_amount
for each test sample. -
key
has the same entries as thepickup_datetime
column. - Because
key
doesn't offer any new information, we can drop it. - Some of the entries have 0 as entry in the longitude and lattitude columns, which will need further inspection in the EDA section.
- We can optimize file reading by setting dtypes in advance and not leaving it to pandas to infer, which adds overhead. Parsing dates will be an exception here, which we'll do using string methods which is much faster.
- The data looks to be randomized by pickup_datetime. Because we also wanted to have a randomized representative sample from the training dataset for initial training, we can just import the first 1% of the rows.
# columns to import
columns = "fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")
columns
%%time
# set dtypes
dtypes = {'fare_amount': "float32",
'pickup_longitude': "float32",
'pickup_latitude': "float32",
'dropoff_longitude': "float32",
'dropoff_latitude': "float32",
'passenger_count': "uint8"}
# set nrows to import
nrows_to_import = 550_000
# import the training data
df = pd.read_csv(f"{data_dir}/train.csv", usecols = columns, dtype = dtypes, nrows = nrows_to_import)
# inspect df
df.head()
Observations
The data has loaded correctly. Now we'll change the column pickup_datetime
dtype to datetime.
%%time
# parse dates
def datetime_parser(dataframe):
datetime = dataframe["pickup_datetime"].str.slice(0, 16)
datetime = pd.to_datetime(datetime, utc = True, format = "%Y-%m-%d %H:%M")
return datetime
df["pickup_datetime"] = datetime_parser(df)
# inspect dtypes
df.dtypes
Now that all the columns are loaded correctly, we can similarly load the test dataset and the sample submission file.
# load test set
test_df = pd.read_csv(data_dir + "/test.csv", index_col = "key", dtype = dtypes)
test_df["pickup_datetime"] = datetime_parser(test_df)
test_df.dtypes
test_df
# load sample submission
submission_df = pd.read_csv(data_dir + "/sample_submission.csv", index_col = "key")
submission_df.head()
Now, we can set the random state seed and then move on to EDA and Data Preparation.
# set random state seed
seed = 42
# inspect top 5 rows
df.head()
df.info()
# basic statistics
df.describe()
# train data time range
df.pickup_datetime.min(), df.pickup_datetime.max()
Observations
- The first 5 rows look to be alright.
- There are some null values in the drop location columns.
- There are some negative values in the fare amount and the location features. These will need to be dealt with.
- The max values also suggest that there are some outliers, we'll also deal with them.
- The training data looks to be from 1st Jan 2009 to 30th June 2015.
- The data takes about 15 MB in memory.
test_df.info()
# basic statistics
test_df.describe()
# test data time range
test_df["pickup_datetime"].min(), test_df["pickup_datetime"].max()
Observations
- There are 9914 rows of data.
- There are no null values in the test data.
- There are no obvious data entry errors as seen from the basic statistics.
- Its time range is also from 1st Jan 2009 to 30th June 2015.
- Its min and max values can inform us on removing outliers from the training data.
EDA
In this subsection we'll these questions from the data:
- In which locations are rides mostly located?
- What is the busiest day of the week?
- What is the busiest time of the day?
- In which month are fares the highest?
- Which pickup locations have the highest fares?
- Which drop locations have the highest fares?
- What is the average ride distance?
# Extract day of week and their counts
dayofweek = df["pickup_datetime"].dt.dayofweek.value_counts()
dayofweek.sort_values(ascending = False)
%matplotlib inline
# visualize its frequency
sns.barplot(x = dayofweek.index, y = dayofweek)
Observations
- In this plot, week starts with Monday denoted by 0 and ends on Sunday denoted by 6.
- The taxi trips look to increase in number as the week goes by, peeking on Friday and then fall off on Sunday.
- Thus, Friday is the busiest day of week followed closely by Saturday.
# Extract hour of day and its counts
hourofday = df["pickup_datetime"].dt.hour.value_counts()
hourofday
# visualize frequency
sns.barplot(x = hourofday.index, y = hourofday)
Observations
- The number of trips are at their lowest between 5-6 AM in the morning, and then they start to rise.
- They peak in the evening between 7-8 PM, and then again start to fall.
- The busiest hour of the day is between 7-8 PM.
# Extract months from datetime and get their average fare_amount
month_avg_fare = df.groupby(by = df["pickup_datetime"].dt.month)["fare_amount"].agg("mean")
month_avg_fare = month_avg_fare.sort_values()
month_avg_fare
# visualize
sns.barplot(x = month_avg_fare, y = month_avg_fare.index, orient = "h")
plt.ylabel("Month")
Observations
- The fare_amount is the highest in September.
- The taxi fares look to generally rise as the year goes.
# remove outliers and extract pickup data
pickup = df[['pickup_longitude', 'pickup_latitude', 'fare_amount']]
pickup = pickup.loc[(pickup['pickup_latitude'] >= 40.5) &
(pickup['pickup_latitude'] <= 41) &
(pickup['pickup_longitude'] >= -74.1) &
(pickup['pickup_longitude'] <= -73.7) &
(pickup['fare_amount'] > 0) &
(pickup['fare_amount'] <= 200)]
# plot taxi fares' distribution
sns.kdeplot(pickup['fare_amount'])
The taxi fares look to be highly skewed with quite a few outliers. We can confirm this by looking at the skewness value.
# Taxi fare skewness
pickup['fare_amount'].skew()
# Visualize its outliers
sns.boxplot(pickup['fare_amount'])
# get fare limits to remove outliers in visualization
Q1, Q3 = pickup['fare_amount'].quantile(q = [0.25, 0.75]).values
IQR = Q3 - Q1
fare_min, fare_max = Q1 - (1.5 * IQR), Q3 + (1.5 * IQR)
fare_min, fare_max
# visualize
plt.figure(figsize = (16, 12))
ax = sns.scatterplot(data = pickup, x='pickup_longitude', y='pickup_latitude', hue = pickup['fare_amount'],
palette = 'rocket_r', hue_norm = (fare_min, fare_max), s = 0.1)
norm = plt.Normalize(fare_min, fare_max)
sm = plt.cm.ScalarMappable(cmap="rocket_r", norm=norm)
sm.set_array([])
ax.get_legend().remove()
cbar = ax.figure.colorbar(sm)
cbar.set_label('Fare Amount', rotation=270)
plt.show()
We can look at the map below to get a sense of where the taxi fares are high.
Observation
- The pickups from John F. Kennedy Airport and East Elmhurst, which also has an airport, which can be seen using the satellite option on google maps are generallly more expensive.
- The southern side of the city also looks to be more expensive.
This suggests that maybe some landmarks like airports mean high taxi fares.
# remove outliers and extract dropoff data
dropoff = df[['dropoff_longitude', 'dropoff_latitude', 'fare_amount']]
dropoff = dropoff.loc[(dropoff['dropoff_latitude'] >= 40.5) &
(dropoff['dropoff_latitude'] <= 41) &
(dropoff['dropoff_longitude'] >= -74.1) &
(dropoff['dropoff_longitude'] <= -73.7) &
(dropoff['fare_amount'] > 0) &
(dropoff['fare_amount'] <= 200)]
# visualize
plt.figure(figsize = (16, 12))
ax = sns.scatterplot(data = dropoff, x='dropoff_longitude', y='dropoff_latitude', hue = pickup['fare_amount'],
palette = 'rocket_r', hue_norm = (fare_min, fare_max), s = 0.1)
norm = plt.Normalize(fare_min, fare_max)
sm = plt.cm.ScalarMappable(cmap="rocket_r", norm=norm)
sm.set_array([])
ax.get_legend().remove()
c_bar = ax.figure.colorbar(sm)
c_bar.set_label("Fare Amount", rotation = 270)
plt.show()
Observations
- This map looks very similar to the map of pickup locations.
- Popular landmarks seem to have expensive taxi rides here also.
- One different thing is that there are quite a few dropoffs which are outside the city. In other words, There are more dropoffs than there are pickups outside the city.
# number of null values
df.isnull().sum()
There are 6 missing values in the dropoff_longitude
and the dropoff_latitude
columns. This is quite a small number considering the dataset size. So, we'll drop the rows with the missing values instead of trying to fill them.
# drop null values
df.dropna(inplace = True)
# Separate target from predictors
X = df.copy()
y = X.pop("fare_amount")
# predictors
X.head()
# target
y.head()
# make train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = seed)
Train Hardcoded & Baseline Models
We'll now train a hardcoded model and a baseline model to establish some scores that our future models should at least beat. If they don't, that would indicate some error in training.
- Hardcoded model: Will always predict average fare
- Baseline model: Linear regression
# train hardcoded model
dummy_model = DummyRegressor(strategy = "mean")
dummy_model.fit(X_train, y_train)
# make training predictions
train_dummy_preds = dummy_model.predict(X_train)
train_dummy_preds
# score
mean_squared_error(y_train, train_dummy_preds, squared = False)
# make validation predictions
valid_dummy_preds = dummy_model.predict(X_valid)
valid_dummy_preds
# score
mean_squared_error(y_valid, valid_dummy_preds, squared = False)
# remove "pickup_datetime" from input column for training
input_cols = X_train.columns[1:]
input_cols
# train baseline model
base_model = LinearRegression()
base_model.fit(X_train[input_cols], y_train)
# make training predictions
train_base_preds = base_model.predict(X_train[input_cols])
train_base_preds
# score
mean_squared_error(y_train, train_base_preds, squared = False)
# make validation predictions
valid_base_preds = base_model.predict(X_valid[input_cols])
valid_base_preds
# score
mean_squared_error(y_valid, valid_base_preds, squared = False)
Observations
The linear regression model is off by $\$$9.871, which isn't much better than simply predicting the average, which was off by $\$$9.873.
This is mainly because the training data (geocoordinates) is not in a format that's useful for the model, and we're not using one of the most important columns: pickup date & time.
However, now we have a baseline that our other models should ideally beat.
# make test predictions
test_base_preds = base_model.predict(test_df[input_cols])
test_base_preds
# save_submission_file
submission_df["fare_amount"] = test_base_preds
submission_df.to_csv("baseline_model.csv")
Submission also gives a similar score of 9.406.
Feature engineering
Now that we have got a baseline score, we'll do feature engineering, wherein we'll:
- Extract parts of datetime
- Remove outliers & invalid data
- Add distance between pickup & drop
- Add distance from landmarks
All this will be done using functions, which we'll combine into a single preprocessing function in the end.
# Extract parts of datetime.
def add_datetime_cols(dataframe):
dataframe["date"] = dataframe["pickup_datetime"].dt.day
dataframe["month"] = dataframe["pickup_datetime"].dt.month
dataframe["weekday"] = dataframe["pickup_datetime"].dt.dayofweek
dataframe["year"] = dataframe["pickup_datetime"].dt.year
return dataframe
Remove outliers & invalid data
There seems to be some invalide data in each of the following columns:
- Fare amount
- Passenger count
- Pickup latitude & longitude
- Drop latitude & longitude
Using the limits from test data, we'll limit the data to range:
- fare_amount: 1 to 200
- longitudes: -75 to -72
- latitudes: 40 to 42
- passenger_count: 1 to 10
# remove outliers
def remove_outliers(dataframe):
dataframe = dataframe.loc[(dataframe["fare_amount"] >= 1) &
(dataframe["fare_amount"] <= 200) &
(dataframe["pickup_longitude"] >= -75) &
(dataframe["pickup_longitude"] <= -72) &
(dataframe["pickup_latitude"] >= 40) &
(dataframe["pickup_latitude"] <= 42) &
(dataframe["dropoff_longitude"] >= -75) &
(dataframe["dropoff_longitude"] <= -72) &
(dataframe["dropoff_latitude"] >= 40) &
(dataframe["dropoff_latitude"] <= 42) &
(dataframe["passenger_count"] >= 1) &
(dataframe["passenger_count"] <= 10)]
return dataframe
Add distance between pickup & drop
This is usually the biggest factor affecting the taxi fare. While we cannot truly know the route and the total distance that the taxi ran for, from the data given we can calculate the distance between pickup & drop using haversine distance. The formula for calculating it is taken from here.
# haversine for calcualting distance
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance in kilometers between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 +np. cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
return c * r
# Add distance between pickup & drop
def add_distance(dataframe):
dataframe["trip_dist"] = haversine(dataframe["pickup_longitude"], dataframe["pickup_latitude"], dataframe["dropoff_longitude"], dataframe["dropoff_latitude"])
return dataframe
Add distance from landmarks
We'll also add the distance between the dropoff location and some landmarks like:
- JFK Airport
- LGA Airport
- EWR Airport
- Times Square
- Met Meuseum
- World Trade Center
Because popular locations means more traffic, which affects the waiting time in taxis, this information can also help in prediction.
The names of popular landmarks and their locations are taken from google. To calculate the distance, we'll use the function that we have just created.
# landmark locations
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126
# landmarks
landmarks = [("jfk", jfk_lonlat), ("lga", lga_lonlat), ("ewr", ewr_lonlat), ("met", met_lonlat), ("wtc", wtc_lonlat)]
# Add distance from landmarks
def add_dist_from_landmarks(dataframe):
for lmrk_name, longlat in landmarks:
dataframe[lmrk_name + "_dist"] = haversine(longlat[0], longlat[1], dataframe["dropoff_longitude"], dataframe["dropoff_latitude"])
return dataframe
# full data preparation
def preprocessor(dataframe):
if "fare_amount" in dataframe:
dataframe = dataframe.dropna()
dataframe = remove_outliers(dataframe)
dataframe = add_datetime_cols(dataframe)
dataframe = add_distance(dataframe)
dataframe = add_dist_from_landmarks(dataframe)
dataframe = dataframe.drop(columns = ["pickup_datetime"])
return dataframe
# Prepare the data
df = preprocessor(df) # training data
test_df = preprocessor(test_df) # test data
# save processed files for future use
df.to_csv("processed_train_1_perc.csv", index = None)
test_df.to_csv("preprocessed_test_df.csv")
# Separate training data from validation data
X = df.copy()
y = X.pop("fare_amount")
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = seed)
# inspect preprocessing result
X_train.head()
test_df.head()
The files have been succesfully preprocessed. Now we'll select ML models for training.
Linear Models
Prepare Data
We'll train a LinearRegression
model and a Ridge
in this subsection. But before proceeding, we'll also scale numerical columns and onehotencode categorical columns. This transformation will only be used with linear models, which can benefit from such transformation. Other tree based models that we'll train work really well even without it. So with linear models, additional steps will be to:
- Scale numerical features with MinMaxScaler
- Encode categorical features with OneHotEncoder
# encode categorical columns
cat_cols = pd.Index(["weekday", "month"])
ohe = OneHotEncoder(drop = "first", sparse = False)
# scale numeric columns
num_cols = X_train.columns.difference(cat_cols)
scaler = MinMaxScaler()
# combine transformation steps
lin_ct = make_column_transformer((ohe, cat_cols), (scaler, num_cols))
lin_ct
# transform
X_lin_train = lin_ct.fit_transform(X_train)
X_lin_valid = lin_ct.transform(X_valid)
X_lin_train
Train and Evaluate
# evaluation function
def lin_eval(model):
train_preds = model.predict(X_lin_train)
train_rmse = mean_squared_error(y_train, train_preds, squared = False)
val_preds = model.predict(X_lin_valid)
val_rmse = mean_squared_error(y_valid, val_preds, squared = False)
return train_rmse, val_rmse, train_preds, val_preds
# models
l_models = {"Linear Regression": LinearRegression(),
"Ridge": Ridge(random_state = seed)}
l_scores = {}
# train and evaluate
for model_name, model in l_models.items():
print(model_name)
model.fit(X_lin_train, y_train)
train_rmse, val_rmse, train_preds, val_preds = lin_eval(model)
l_scores[model_name] = {"train_rmse": train_rmse,
"validation_rmse": val_rmse}
print(l_scores[model_name])
print("-" * 70)
The linear models show improvement over the baseline model and score about 5 rmse. We'll compare them now with tree based models.
# evaluation function
def tree_eval(model):
train_preds = model.predict(X_train)
train_rmse = mean_squared_error(y_train, train_preds, squared = False)
val_preds = model.predict(X_valid)
val_rmse = mean_squared_error(y_valid, val_preds, squared = False)
return train_rmse, val_rmse, train_preds, val_preds
# models
t_models = {"DecisionTree": DecisionTreeRegressor(max_depth = 6, random_state = seed),
"RandomForest": RandomForestRegressor(max_depth = 6, n_jobs = -1, random_state = seed)}
t_scores = {}
# train and evaluate
for model_name, model in t_models.items():
print(model_name)
model.fit(X_train, y_train)
train_rmse, val_rmse, train_preds, val_preds = tree_eval(model)
t_scores[model_name] = {"train_rmse": train_rmse,
"validation_rmse": val_rmse}
print(t_scores[model_name])
print("-" * 70)
Tree based models performed better than linear models. Both the models scores about 4.25 rmse on the validation data. Based on this we'll train a gradient boosting model - lightgbm, evaluate its performance and make a submission next. Then we can move on to tune its hyperparameters before training on full training data and making a final submission.
# get cpu core count
core_count = psutil.cpu_count(logical = False)
core_count
# model parameters
params = {"num_leaves": 25,
"learning_rate": 0.03,
"seed": seed,
"metric": "rmse",
"num_threads": core_count}
# train and evaluate
train_lgb = lgb.Dataset(X_train, y_train)
valid_lgb = lgb.Dataset(X_valid, y_valid)
bst = lgb.train(params, train_lgb, num_boost_round = 1500, valid_sets = [train_lgb, valid_lgb], early_stopping_rounds = 10, verbose_eval = 20)
# make predictions
preds = bst.predict(test_df)
# save submission file
submission_df["fare_amount"] = preds
submission_df.to_csv("lightgbm.csv")
This submission gives us a score of RMSE: 3.29476 which places us at roughly 35th percentile out of a total of 1400 current participants. We can now improve on this by hyperparameter tuning and by using the full training data.
# black box function for Bayesian Optimization
def LGB_bayesian(bagging_fraction,
bagging_freq,
lambda_l1,
lambda_l2,
learning_rate,
max_depth,
min_data_in_leaf,
min_gain_to_split,
min_sum_hessian_in_leaf,
num_leaves,
feature_fraction):
# LightGBM expects these parameters to be integer. So we make them integer
bagging_freq = int(bagging_freq)
num_leaves = int(num_leaves)
min_data_in_leaf = int(min_data_in_leaf)
max_depth = int(max_depth)
# parameters
param = {'bagging_fraction': bagging_fraction,
'bagging_freq': bagging_freq,
'lambda_l1': lambda_l1,
'lambda_l2': lambda_l2,
'learning_rate': learning_rate,
'max_depth': max_depth,
'min_data_in_leaf': min_data_in_leaf,
'min_gain_to_split': min_gain_to_split,
'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
'num_leaves': num_leaves,
'feature_fraction': feature_fraction,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'boosting_type': 'gbdt',
'metric': 'rmse',
'force_col_wise': True,
'verbosity': -1,
'num_threads': core_count}
trn = lgb.Dataset(X, y)
lgb_cv = lgb.cv(param, trn, num_boost_round = 1500, nfold = 3, stratified = False, early_stopping_rounds = 10, seed = seed)
score = lgb_cv["rmse-mean"][-1]
return -score
# parameter bounds
bounds_LGB = {
'bagging_fraction': (0.6, 1),
'bagging_freq': (1, 4),
'lambda_l1': (0, 3.0),
'lambda_l2': (0, 3.0),
'learning_rate': (0.01, 0.1),
'max_depth':(3,8),
'min_data_in_leaf': (5, 20),
'min_gain_to_split': (0, 1),
'min_sum_hessian_in_leaf': (0.01, 20),
'num_leaves': (5, 25),
'feature_fraction': (0.05, 1)
}
seed = 42
df = pd.read_parquet("nyc_sample_train.parquet")
X = df.copy()
y = X.pop("fare_amount")
# optimizer
LG_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state = seed)
# find the best hyperparameters
LG_BO.maximize(init_points = 5, n_iter = 50)
# tuned hyperparameters
LG_BO.max["params"]
Now that the hyperparameter tuning is done, we can load the full training data and then train lightgbm on the full data using tuned hyperparameters.
%%time
chunksize = 5_000_000
df_list = []
for df_chunk in tqdm(pd.read_csv(f"{data_dir}/train.csv", usecols = columns, dtype = dtypes, chunksize = chunksize)):
df_chunk['pickup_datetime'] = datetime_parser(df_chunk)
df_chunk = preprocessor(df_chunk)
df_list.append(df_chunk)
full_df = pd.concat(df_list)
full_df.info()
We'll now delete the importing list to free memory and then save the full processed train data to .parquet format for faster reading in the future.
# delete the importing list
del df_list
# save the full processed train data
full_df.to_parquet("full_training_df.parquet", index = None)
# separate target
y = full_df.pop("fare_amount")
# create lightgbm dataset from the training data.
full_lgb_df = lgb.Dataset(full_df, label = y, free_raw_data = True)
We'll save the lightgbm dataset into binary format and then load it again after exiting the system. This will free a memory and make sure we'll only have processed train data in the memory
# save the lightgb dataset to binary format which is much lighter (about 600 MB.)
full_lgb_df.save_binary("full_train.bin")
# exit the system, which will force the notebook to restart
os._exit(00)
# import the required libraries
import lightgbm as lgb
import pandas as pd
import warnings
import psutil
# set parameters
core_count = psutil.cpu_count(logical = False)
seed = 42
params = {'bagging_fraction': 1.0,
'bagging_freq': 4,
'feature_fraction': 1.0,
'lambda_l1': 1.7911760022903485,
'lambda_l2': 3.0,
'learning_rate': 0.1,
'max_depth': 8,
'min_data_in_leaf': 5,
'min_gain_to_split': 0.0,
'min_sum_hessian_in_leaf': 0.01,
'num_leaves': 16,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'boosting_type': 'gbdt',
'metric': 'rmse',
'force_col_wise': True,
'num_threads': core_count,
'device': 'gpu'}
# train on full data
train = lgb.Dataset("full_train.bin")
warnings.filterwarnings("ignore")
full_bst = lgb.train(params, train, num_boost_round = 400,
valid_sets = [train], verbose_eval = 25)
# # Prediction and Submission
test_df = pd.read_csv("preprocessed_test_df.csv", index_col = "key")
submission_df = pd.read_csv(data_dir + "/sample_submission.csv", index_col = "key")
test_preds = full_bst.predict(test_df)
submission_df["fare_amount"] = test_preds
submission_df.to_csv("nyc_full_tuned1.csv")
submission_df.head()
This submission gives us our best score yet of RMSE $ 3.21. This is our final submssion in this project and this lands us in the top 30 %. We can now save this model for future use and we can also analyse what can this model tell us about the importance of various features in the data.
# save model
full_bst.save_model("nyc_full_tuned_model.bin")
# plot feature importance
lgb.plot_importance(full_bst)
Observations
As we can see, the trip distance faeture that we added through feature engineering contributed the most in prediction of taxi fares. Following it are also features determining the location of taxi trips like pickup location features, distance from jfk airport and dropoff location features. Next in the importance list is the year of taxi trip, which could suggest that the fares probably increased with the years.
In this project, we worked on the problem of predicting taxi fares in new york city with the dataset provided by Google Cloud and Coursera. The involved dealing with multiple challenges, with handling big data one being the primary one. We imported only 1 percent of training data to select the ML model and to tune its hyperparameters. But before we could do so, we also analysed and visulaized the data, which also helped in the feature engineering. After preparing the data for training, we moved on to select ML model for full training. The initial results suggested that lightgbm worked the best on the problem. We then tuned the hyperparameters for lightgbm using bayesian optimization. After doing so, we were ready for training the model on the full data. We then imported the full training data and applied preprocessing on it. But before proceeding with the ML model training, we optimized the process by saving the binary format of precessed training data and freeing the memory by restarting the kernel.
The final training and the submission on Kaggle gave us the RMSE of 3.21. We also analysed the feature importance that the model found, and it suggested that the location based features such as trip distance which we calculated using haversin distance in feature engineering section, the pickup longitude and pickup latitude were the most important in making predictions.
Thanks for reading.