Introduction

In this project, we'll participate in a Kaggle playground competition - New York City Taxi Fare Prediction. In this competition, hosted in partnership with Google Cloud and Coursera, we are tasked with predicting the fare amount for a taxi ride in New York City given the pickup and dropoff locations. The competition provides three files in csv format - train.csv, test.csv and sample_submission.csv. We have to train ML algorithms on data provided in the test.csv file and then make predictions on test.csv file, which will then be submitted on Kaggle in the format provided in the sample_submission.csv file. The submission will be evaluated using the root mean-squared error or RMSE between the predictions in the submission file and the corresponding ground truth. Therefore, our primary goal in this project will be to improve on this rmse score by minimizing it.

We begin the project by first setting up the training environment and loading the files.

Setup

First, we'll import the required libraries and get the data directory.

Import required libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.metrics import mean_squared_error # Scoring metric
from sklearn.model_selection import train_test_split # validation
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder # data preparation
import psutil # get cpu info
from bayes_opt import BayesianOptimization # hyperparameter tuning
from tqdm import tqdm # progress meter

# data visualization
import matplotlib.pyplot as plt 
import seaborn as sns 

# ML models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb

# model pipelines
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# get file path
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/new-york-city-taxi-fare-prediction/sample_submission.csv
/kaggle/input/new-york-city-taxi-fare-prediction/GCP-Coupons-Instructions.rtf
/kaggle/input/new-york-city-taxi-fare-prediction/train.csv
/kaggle/input/new-york-city-taxi-fare-prediction/test.csv

Load dataset files

We'll first inspect the data using shell commands.

data_dir = "/kaggle/input/new-york-city-taxi-fare-prediction"

!ls -l {data_dir} # get the files info
total 5564952
-rw-r--r-- 1 nobody nogroup        486 Apr  2 14:07 GCP-Coupons-Instructions.rtf
-rw-r--r-- 1 nobody nogroup     343271 Apr  2 14:07 sample_submission.csv
-rw-r--r-- 1 nobody nogroup     983020 Apr  2 14:07 test.csv
-rw-r--r-- 1 nobody nogroup 5697178298 Apr  2 14:07 train.csv
# get the no. of lines in the train dataset
!wc -l {data_dir}/train.csv 
55423856 /kaggle/input/new-york-city-taxi-fare-prediction/train.csv
# get the no. of lines in the test dataset
!wc -l {data_dir}/test.csv 
9914 /kaggle/input/new-york-city-taxi-fare-prediction/test.csv
# get the no. of lines in the sample submission.
!wc -l {data_dir}/sample_submission.csv 
9915 /kaggle/input/new-york-city-taxi-fare-prediction/sample_submission.csv

Observations

  • This is a supervised learning regression problem
  • Training data is 5.5 GB in size
  • Training data has 5.5 million rows
  • Test set is much smaller (< 10,000 rows)

Loading this whole dataset direcltly will slowdown our initial training. Thus, we'll initially import only about 1% of it. This will still give us about 550k rows of data to train our model, which should be enough for initial training. Once, we have finalized our model and its hyperparameters, we can then use the full dataset for training the final model and making the final predictions.

We'll now inspect how the dataset is structured and import only the columns we need.

# Train set
!head -n 20 {data_dir}/train.csv
key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2009-06-15 17:26:21.0000001,4.5,2009-06-15 17:26:21 UTC,-73.844311,40.721319,-73.84161,40.712278,1
2010-01-05 16:52:16.0000002,16.9,2010-01-05 16:52:16 UTC,-74.016048,40.711303,-73.979268,40.782004,1
2011-08-18 00:35:00.00000049,5.7,2011-08-18 00:35:00 UTC,-73.982738,40.76127,-73.991242,40.750562,2
2012-04-21 04:30:42.0000001,7.7,2012-04-21 04:30:42 UTC,-73.98713,40.733143,-73.991567,40.758092,1
2010-03-09 07:51:00.000000135,5.3,2010-03-09 07:51:00 UTC,-73.968095,40.768008,-73.956655,40.783762,1
2011-01-06 09:50:45.0000002,12.1,2011-01-06 09:50:45 UTC,-74.000964,40.73163,-73.972892,40.758233,1
2012-11-20 20:35:00.0000001,7.5,2012-11-20 20:35:00 UTC,-73.980002,40.751662,-73.973802,40.764842,1
2012-01-04 17:22:00.00000081,16.5,2012-01-04 17:22:00 UTC,-73.9513,40.774138,-73.990095,40.751048,1
2012-12-03 13:10:00.000000125,9,2012-12-03 13:10:00 UTC,-74.006462,40.726713,-73.993078,40.731628,1
2009-09-02 01:11:00.00000083,8.9,2009-09-02 01:11:00 UTC,-73.980658,40.733873,-73.99154,40.758138,2
2012-04-08 07:30:50.0000002,5.3,2012-04-08 07:30:50 UTC,-73.996335,40.737142,-73.980721,40.733559,1
2012-12-24 11:24:00.00000098,5.5,2012-12-24 11:24:00 UTC,0,0,0,0,3
2009-11-06 01:04:03.0000002,4.1,2009-11-06 01:04:03 UTC,-73.991601,40.744712,-73.983081,40.744682,2
2013-07-02 19:54:00.000000232,7,2013-07-02 19:54:00 UTC,-74.00536,40.728867,-74.008913,40.710907,1
2011-04-05 17:11:05.0000001,7.7,2011-04-05 17:11:05 UTC,-74.001821,40.737547,-73.99806,40.722788,2
2013-11-23 12:57:00.000000190,5,2013-11-23 12:57:00 UTC,0,0,0,0,1
2014-02-19 07:22:00.00000074,12.5,2014-02-19 07:22:00 UTC,-73.98643,40.760465,-73.98899,40.737075,1
2009-07-22 16:08:00.000000163,5.3,2009-07-22 16:08:00 UTC,-73.98106,40.73769,-73.994177,40.728412,1
2010-07-07 14:52:00.00000044,5.3,2010-07-07 14:52:00 UTC,-73.969505,40.784843,-73.958732,40.783357,1
# Test set
!head -n 20 {data_dir}/test.csv
key,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
2015-01-27 13:08:24.0000002,2015-01-27 13:08:24 UTC,-73.973320007324219,40.7638053894043,-73.981430053710938,40.74383544921875,1
2015-01-27 13:08:24.0000003,2015-01-27 13:08:24 UTC,-73.986862182617188,40.719383239746094,-73.998886108398438,40.739200592041016,1
2011-10-08 11:53:44.0000002,2011-10-08 11:53:44 UTC,-73.982524,40.75126,-73.979654,40.746139,1
2012-12-01 21:12:12.0000002,2012-12-01 21:12:12 UTC,-73.98116,40.767807,-73.990448,40.751635,1
2012-12-01 21:12:12.0000003,2012-12-01 21:12:12 UTC,-73.966046,40.789775,-73.988565,40.744427,1
2012-12-01 21:12:12.0000005,2012-12-01 21:12:12 UTC,-73.960983,40.765547,-73.979177,40.740053,1
2011-10-06 12:10:20.0000001,2011-10-06 12:10:20 UTC,-73.949013,40.773204,-73.959622,40.770893,1
2011-10-06 12:10:20.0000003,2011-10-06 12:10:20 UTC,-73.777282,40.646636,-73.985083,40.759368,1
2011-10-06 12:10:20.0000002,2011-10-06 12:10:20 UTC,-74.014099,40.709638,-73.995106,40.741365,1
2014-02-18 15:22:20.0000002,2014-02-18 15:22:20 UTC,-73.969582,40.765519,-73.980686,40.770725,1
2014-02-18 15:22:20.0000003,2014-02-18 15:22:20 UTC,-73.989374,40.741973,-73.9993,40.722534,1
2014-02-18 15:22:20.0000001,2014-02-18 15:22:20 UTC,-74.001614,40.740893,-73.956387,40.767437,1
2010-03-29 20:20:32.0000002,2010-03-29 20:20:32 UTC,-73.991198,40.739937,-73.997166,40.735269,1
2010-03-29 20:20:32.0000001,2010-03-29 20:20:32 UTC,-73.982034,40.762723,-74.001867,40.761545,1
2011-10-06 03:59:12.0000002,2011-10-06 03:59:12 UTC,-73.992455,40.728701,-73.983397,40.750149,1
2011-10-06 03:59:12.0000001,2011-10-06 03:59:12 UTC,-73.983583,40.746993,-73.951178,40.785903,1
2012-07-15 16:45:04.0000006,2012-07-15 16:45:04 UTC,-74.006746,40.731721,-74.010204,40.732318,1
2012-07-15 16:45:04.0000002,2012-07-15 16:45:04 UTC,-73.976446,40.785598,-73.95222,40.772121,1
2012-07-15 16:45:04.0000003,2012-07-15 16:45:04 UTC,-73.973548,40.763349,-73.972096,40.756417,1
# Sample Submission
!head -n 20 {data_dir}/sample_submission.csv
key,fare_amount
2015-01-27 13:08:24.0000002,11.35
2015-01-27 13:08:24.0000003,11.35
2011-10-08 11:53:44.0000002,11.35
2012-12-01 21:12:12.0000002,11.35
2012-12-01 21:12:12.0000003,11.35
2012-12-01 21:12:12.0000005,11.35
2011-10-06 12:10:20.0000001,11.35
2011-10-06 12:10:20.0000003,11.35
2011-10-06 12:10:20.0000002,11.35
2014-02-18 15:22:20.0000002,11.35
2014-02-18 15:22:20.0000003,11.35
2014-02-18 15:22:20.0000001,11.35
2010-03-29 20:20:32.0000002,11.35
2010-03-29 20:20:32.0000001,11.35
2011-10-06 03:59:12.0000002,11.35
2011-10-06 03:59:12.0000001,11.35
2012-07-15 16:45:04.0000006,11.35
2012-07-15 16:45:04.0000002,11.35
2012-07-15 16:45:04.0000003,11.35

Observations

  • The training set has 8 columns:
    • key (a unique identifier)
    • fare_amount (target column)
    • pickup_datetime
    • pickup_longitude
    • pickup_latitude
    • dropoff_longitude
    • dropoff_latitude
    • passenger_count
  • The test set has all columns except the target column fare_amount.
  • The submission file should contain the key and fare_amount for each test sample.
  • key has the same entries as the pickup_datetime column.
  • Because key doesn't offer any new information, we can drop it.
  • Some of the entries have 0 as entry in the longitude and lattitude columns, which will need further inspection in the EDA section.
  • We can optimize file reading by setting dtypes in advance and not leaving it to pandas to infer, which adds overhead. Parsing dates will be an exception here, which we'll do using string methods which is much faster.
  • The data looks to be randomized by pickup_datetime. Because we also wanted to have a randomized representative sample from the training dataset for initial training, we can just import the first 1% of the rows.
# columns to import
columns = "fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count".split(",")
columns
['fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']
%%time

# set dtypes
dtypes = {'fare_amount': "float32",
 'pickup_longitude': "float32",
 'pickup_latitude': "float32",
 'dropoff_longitude': "float32",
 'dropoff_latitude': "float32",
 'passenger_count': "uint8"}

# set nrows to import
nrows_to_import = 550_000

# import the training data
df = pd.read_csv(f"{data_dir}/train.csv", usecols = columns, dtype = dtypes, nrows = nrows_to_import)
CPU times: user 989 ms, sys: 111 ms, total: 1.1 s
Wall time: 1.1 s
# inspect df
df.head()
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 4.5 2009-06-15 17:26:21 UTC -73.844315 40.721317 -73.841614 40.712276 1
1 16.9 2010-01-05 16:52:16 UTC -74.016045 40.711304 -73.979271 40.782005 1
2 5.7 2011-08-18 00:35:00 UTC -73.982735 40.761269 -73.991241 40.750561 2
3 7.7 2012-04-21 04:30:42 UTC -73.987129 40.733143 -73.991570 40.758091 1
4 5.3 2010-03-09 07:51:00 UTC -73.968094 40.768009 -73.956657 40.783764 1

Observations

The data has loaded correctly. Now we'll change the column pickup_datetime dtype to datetime.

%%time

# parse dates
def datetime_parser(dataframe):
    datetime = dataframe["pickup_datetime"].str.slice(0, 16)
    datetime = pd.to_datetime(datetime, utc = True, format = "%Y-%m-%d %H:%M")

    return datetime


df["pickup_datetime"] = datetime_parser(df)

# inspect dtypes
df.dtypes
CPU times: user 488 ms, sys: 37.9 ms, total: 526 ms
Wall time: 529 ms
fare_amount                      float32
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float32
pickup_latitude                  float32
dropoff_longitude                float32
dropoff_latitude                 float32
passenger_count                    uint8
dtype: object

Now that all the columns are loaded correctly, we can similarly load the test dataset and the sample submission file.

# load test set
test_df = pd.read_csv(data_dir + "/test.csv", index_col = "key", dtype = dtypes)
test_df["pickup_datetime"] = datetime_parser(test_df)

test_df.dtypes
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float32
pickup_latitude                  float32
dropoff_longitude                float32
dropoff_latitude                 float32
passenger_count                    uint8
dtype: object
test_df
pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
key
2015-01-27 13:08:24.0000002 2015-01-27 13:08:00+00:00 -73.973320 40.763805 -73.981430 40.743835 1
2015-01-27 13:08:24.0000003 2015-01-27 13:08:00+00:00 -73.986862 40.719383 -73.998886 40.739201 1
2011-10-08 11:53:44.0000002 2011-10-08 11:53:00+00:00 -73.982521 40.751259 -73.979652 40.746140 1
2012-12-01 21:12:12.0000002 2012-12-01 21:12:00+00:00 -73.981163 40.767807 -73.990448 40.751637 1
2012-12-01 21:12:12.0000003 2012-12-01 21:12:00+00:00 -73.966049 40.789776 -73.988564 40.744427 1
... ... ... ... ... ... ...
2015-05-10 12:37:51.0000002 2015-05-10 12:37:00+00:00 -73.968124 40.796997 -73.955643 40.780388 6
2015-01-12 17:05:51.0000001 2015-01-12 17:05:00+00:00 -73.945511 40.803600 -73.960213 40.776371 6
2015-04-19 20:44:15.0000001 2015-04-19 20:44:00+00:00 -73.991600 40.726608 -73.789742 40.647011 6
2015-01-31 01:05:19.0000005 2015-01-31 01:05:00+00:00 -73.985573 40.735432 -73.939178 40.801731 6
2015-01-18 14:06:23.0000006 2015-01-18 14:06:00+00:00 -73.988022 40.754070 -74.000282 40.759220 6

9914 rows × 6 columns

# load sample submission
submission_df = pd.read_csv(data_dir + "/sample_submission.csv", index_col = "key")
submission_df.head()
fare_amount
key
2015-01-27 13:08:24.0000002 11.35
2015-01-27 13:08:24.0000003 11.35
2011-10-08 11:53:44.0000002 11.35
2012-12-01 21:12:12.0000002 11.35
2012-12-01 21:12:12.0000003 11.35

Now, we can set the random state seed and then move on to EDA and Data Preparation.

# set random state seed
seed = 42

Exploratory Data Analysis

Preliminary Analysis

Inspect Training Data

# inspect top 5 rows
df.head()
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 4.5 2009-06-15 17:26:00+00:00 -73.844315 40.721317 -73.841614 40.712276 1
1 16.9 2010-01-05 16:52:00+00:00 -74.016045 40.711304 -73.979271 40.782005 1
2 5.7 2011-08-18 00:35:00+00:00 -73.982735 40.761269 -73.991241 40.750561 2
3 7.7 2012-04-21 04:30:00+00:00 -73.987129 40.733143 -73.991570 40.758091 1
4 5.3 2010-03-09 07:51:00+00:00 -73.968094 40.768009 -73.956657 40.783764 1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550000 entries, 0 to 549999
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   fare_amount        550000 non-null  float32            
 1   pickup_datetime    550000 non-null  datetime64[ns, UTC]
 2   pickup_longitude   550000 non-null  float32            
 3   pickup_latitude    550000 non-null  float32            
 4   dropoff_longitude  549994 non-null  float32            
 5   dropoff_latitude   549994 non-null  float32            
 6   passenger_count    550000 non-null  uint8              
dtypes: datetime64[ns, UTC](1), float32(5), uint8(1)
memory usage: 15.2 MB
# basic statistics
df.describe()
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 550000.000000 550000.000000 550000.000000 549994.000000 549994.000000 550000.000000
mean 11.349306 -72.322418 39.845139 -72.320076 39.841064 1.683707
std 9.882444 12.547520 7.917986 11.692136 7.288269 1.307607
min -44.900002 -3377.680908 -3116.285400 -3383.296631 -2559.749023 0.000000
25% 6.000000 -73.992043 40.734943 -73.991386 40.734058 1.000000
50% 8.500000 -73.981789 40.752682 -73.980141 40.753136 1.000000
75% 12.500000 -73.967110 40.767094 -73.963600 40.768124 2.000000
max 500.000000 2140.601074 1703.092773 40.851028 404.616669 6.000000
# train data time range
df.pickup_datetime.min(), df.pickup_datetime.max()
(Timestamp('2009-01-01 00:31:00+0000', tz='UTC'),
 Timestamp('2015-06-30 23:38:00+0000', tz='UTC'))

Observations

  • The first 5 rows look to be alright.
  • There are some null values in the drop location columns.
  • There are some negative values in the fare amount and the location features. These will need to be dealt with.
  • The max values also suggest that there are some outliers, we'll also deal with them.
  • The training data looks to be from 1st Jan 2009 to 30th June 2015.
  • The data takes about 15 MB in memory.

Inspect Test Data

test_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9914 entries, 2015-01-27 13:08:24.0000002 to 2015-01-18 14:06:23.0000006
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   pickup_datetime    9914 non-null   datetime64[ns, UTC]
 1   pickup_longitude   9914 non-null   float32            
 2   pickup_latitude    9914 non-null   float32            
 3   dropoff_longitude  9914 non-null   float32            
 4   dropoff_latitude   9914 non-null   float32            
 5   passenger_count    9914 non-null   uint8              
dtypes: datetime64[ns, UTC](1), float32(4), uint8(1)
memory usage: 319.5+ KB
# basic statistics
test_df.describe()
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
count 9914.000000 9914.000000 9914.000000 9914.000000 9914.000000
mean -73.976181 40.750954 -73.974945 40.751553 1.671273
std 0.042799 0.033542 0.039093 0.035436 1.278747
min -74.252190 40.573143 -74.263245 40.568974 1.000000
25% -73.992500 40.736125 -73.991249 40.735253 1.000000
50% -73.982327 40.753052 -73.980015 40.754065 1.000000
75% -73.968012 40.767113 -73.964062 40.768757 2.000000
max -72.986534 41.709557 -72.990967 41.696682 6.000000
# test data time range
test_df["pickup_datetime"].min(), test_df["pickup_datetime"].max()
(Timestamp('2009-01-01 11:04:00+0000', tz='UTC'),
 Timestamp('2015-06-30 20:03:00+0000', tz='UTC'))

Observations

  • There are 9914 rows of data.
  • There are no null values in the test data.
  • There are no obvious data entry errors as seen from the basic statistics.
  • Its time range is also from 1st Jan 2009 to 30th June 2015.
  • Its min and max values can inform us on removing outliers from the training data.

EDA

In this subsection we'll these questions from the data:

  1. In which locations are rides mostly located?
  2. What is the busiest day of the week?
  3. What is the busiest time of the day?
  4. In which month are fares the highest?
  5. Which pickup locations have the highest fares?
  6. Which drop locations have the highest fares?
  7. What is the average ride distance?

Q1. What is the busiest day of the week?

# Extract day of week and their counts
dayofweek = df["pickup_datetime"].dt.dayofweek.value_counts()

dayofweek.sort_values(ascending = False)
4    84826
5    83691
3    82322
2    79525
1    76975
6    72016
0    70645
Name: pickup_datetime, dtype: int64
%matplotlib inline

# visualize its frequency
sns.barplot(x = dayofweek.index, y = dayofweek)
<AxesSubplot:ylabel='pickup_datetime'>

Observations

  • In this plot, week starts with Monday denoted by 0 and ends on Sunday denoted by 6.
  • The taxi trips look to increase in number as the week goes by, peeking on Friday and then fall off on Sunday.
  • Thus, Friday is the busiest day of week followed closely by Saturday.

Q2. What is the busiest time of the day?

# Extract hour of day and its counts
hourofday = df["pickup_datetime"].dt.hour.value_counts()

hourofday
19    34510
18    33091
20    32169
21    31448
22    30594
14    27994
23    27314
12    27141
17    27053
13    26901
15    26408
9     25860
11    25656
8     24847
10    24706
16    22643
0     21647
7     20010
1     16019
2     11983
6     11352
3      8757
4      6437
5      5460
Name: pickup_datetime, dtype: int64
# visualize frequency
sns.barplot(x = hourofday.index, y = hourofday)
<AxesSubplot:ylabel='pickup_datetime'>

Observations

  • The number of trips are at their lowest between 5-6 AM in the morning, and then they start to rise.
  • They peak in the evening between 7-8 PM, and then again start to fall.
  • The busiest hour of the day is between 7-8 PM.

Q3. In which month are fares the highest?

# Extract months from datetime and get their average fare_amount
month_avg_fare = df.groupby(by = df["pickup_datetime"].dt.month)["fare_amount"].agg("mean")

month_avg_fare = month_avg_fare.sort_values()

month_avg_fare
pickup_datetime
1     10.768173
2     10.864570
3     11.129817
7     11.140449
8     11.204127
4     11.267707
6     11.540161
11    11.572874
12    11.625045
5     11.654774
10    11.690510
9     11.816459
Name: fare_amount, dtype: float32
# visualize
sns.barplot(x = month_avg_fare, y = month_avg_fare.index, orient = "h")
plt.ylabel("Month")
Text(0, 0.5, 'Month')

Observations

  • The fare_amount is the highest in September.
  • The taxi fares look to generally rise as the year goes.

Q4. Which pickup locations have the highest fares?

# remove outliers and extract pickup data
pickup = df[['pickup_longitude', 'pickup_latitude', 'fare_amount']]
pickup = pickup.loc[(pickup['pickup_latitude'] >= 40.5) &
                    (pickup['pickup_latitude'] <= 41) &
                    (pickup['pickup_longitude'] >= -74.1) &
                    (pickup['pickup_longitude'] <= -73.7) &
                    (pickup['fare_amount'] > 0) &
                    (pickup['fare_amount'] <= 200)]

# plot taxi fares' distribution
sns.kdeplot(pickup['fare_amount'])
<AxesSubplot:xlabel='fare_amount', ylabel='Density'>

The taxi fares look to be highly skewed with quite a few outliers. We can confirm this by looking at the skewness value.

# Taxi fare skewness
pickup['fare_amount'].skew()
3.3180816
# Visualize its outliers
sns.boxplot(pickup['fare_amount'])
<AxesSubplot:xlabel='fare_amount'>
# get fare limits to remove outliers in visualization
Q1, Q3 = pickup['fare_amount'].quantile(q = [0.25, 0.75]).values
IQR = Q3 - Q1
fare_min, fare_max = Q1 - (1.5 * IQR), Q3 + (1.5 * IQR)

fare_min, fare_max
(-3.75, 22.25)
# visualize
plt.figure(figsize = (16, 12))

ax = sns.scatterplot(data = pickup, x='pickup_longitude', y='pickup_latitude', hue = pickup['fare_amount'],
                     palette = 'rocket_r', hue_norm = (fare_min, fare_max), s = 0.1)
norm = plt.Normalize(fare_min, fare_max)
sm = plt.cm.ScalarMappable(cmap="rocket_r", norm=norm)
sm.set_array([])

ax.get_legend().remove()
cbar = ax.figure.colorbar(sm)
cbar.set_label('Fare Amount', rotation=270)

plt.show()

We can look at the map below to get a sense of where the taxi fares are high.

Observation

  • The pickups from John F. Kennedy Airport and East Elmhurst, which also has an airport, which can be seen using the satellite option on google maps are generallly more expensive.
  • The southern side of the city also looks to be more expensive.

This suggests that maybe some landmarks like airports mean high taxi fares.

Q5. Which drop locations have the highest fares?

Similar to pickup locations, we can look at the drop locations.

# remove outliers and extract dropoff data
dropoff = df[['dropoff_longitude', 'dropoff_latitude', 'fare_amount']]
dropoff = dropoff.loc[(dropoff['dropoff_latitude'] >= 40.5) &
                    (dropoff['dropoff_latitude'] <= 41) &
                    (dropoff['dropoff_longitude'] >= -74.1) &
                    (dropoff['dropoff_longitude'] <= -73.7) &
                    (dropoff['fare_amount'] > 0) &
                    (dropoff['fare_amount'] <= 200)]
<AxesSubplot:xlabel='fare_amount', ylabel='Density'>
# visualize
plt.figure(figsize = (16, 12))

ax = sns.scatterplot(data = dropoff, x='dropoff_longitude', y='dropoff_latitude', hue = pickup['fare_amount'],
                     palette = 'rocket_r', hue_norm = (fare_min, fare_max), s = 0.1)
norm = plt.Normalize(fare_min, fare_max)
sm = plt.cm.ScalarMappable(cmap="rocket_r", norm=norm)
sm.set_array([])

ax.get_legend().remove()
c_bar = ax.figure.colorbar(sm)
c_bar.set_label("Fare Amount", rotation = 270)

plt.show()

Observations

  • This map looks very similar to the map of pickup locations.
  • Popular landmarks seem to have expensive taxi rides here also.
  • One different thing is that there are quite a few dropoffs which are outside the city. In other words, There are more dropoffs than there are pickups outside the city.

Data Preparation

Null Values

# number of null values
df.isnull().sum()
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    6
dropoff_latitude     6
passenger_count      0
dtype: int64

There are 6 missing values in the dropoff_longitude and the dropoff_latitude columns. This is quite a small number considering the dataset size. So, we'll drop the rows with the missing values instead of trying to fill them.

# drop null values
df.dropna(inplace = True)

Separate target from predictors

# Separate target from predictors
X = df.copy()
y = X.pop("fare_amount")
# predictors
X.head()
pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2009-06-15 17:26:00+00:00 -73.844315 40.721317 -73.841614 40.712276 1
1 2010-01-05 16:52:00+00:00 -74.016045 40.711304 -73.979271 40.782005 1
2 2011-08-18 00:35:00+00:00 -73.982735 40.761269 -73.991241 40.750561 2
3 2012-04-21 04:30:00+00:00 -73.987129 40.733143 -73.991570 40.758091 1
4 2010-03-09 07:51:00+00:00 -73.968094 40.768009 -73.956657 40.783764 1
# target
y.head()
0     4.5
1    16.9
2     5.7
3     7.7
4     5.3
Name: fare_amount, dtype: float32

Split training data into trainind and validation sets.

# make train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = seed)

Train Hardcoded & Baseline Models

We'll now train a hardcoded model and a baseline model to establish some scores that our future models should at least beat. If they don't, that would indicate some error in training.

  • Hardcoded model: Will always predict average fare
  • Baseline model: Linear regression

Train and Evaluate Hardcoded Model.

# train hardcoded model
dummy_model = DummyRegressor(strategy = "mean")

dummy_model.fit(X_train, y_train)
DummyRegressor()
# make training predictions
train_dummy_preds = dummy_model.predict(X_train)

train_dummy_preds
array([11.347245, 11.347245, 11.347245, ..., 11.347245, 11.347245,
       11.347245], dtype=float32)
# score
mean_squared_error(y_train, train_dummy_preds, squared = False)
9.886141
# make validation predictions
valid_dummy_preds = dummy_model.predict(X_valid)

valid_dummy_preds
array([11.347245, 11.347245, 11.347245, ..., 11.347245, 11.347245,
       11.347245], dtype=float32)
# score
mean_squared_error(y_valid, valid_dummy_preds, squared = False)
9.8728695

Train and Evaluate Baseline Model

Because LinearRegression can't take dtype datetime as input, we'll drop the pickup_datetime column.

# remove "pickup_datetime" from input column for training
input_cols = X_train.columns[1:]
input_cols
Index(['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
       'dropoff_latitude', 'passenger_count'],
      dtype='object')
# train baseline model
base_model = LinearRegression()

base_model.fit(X_train[input_cols], y_train)
LinearRegression()
# make training predictions
train_base_preds = base_model.predict(X_train[input_cols])

train_base_preds
array([11.262208 , 11.2621765, 11.262113 , ..., 11.681306 , 11.262474 ,
       11.471427 ], dtype=float32)
# score
mean_squared_error(y_train, train_base_preds, squared = False)
9.884694
# make validation predictions
valid_base_preds = base_model.predict(X_valid[input_cols])

valid_base_preds
array([11.262225 , 11.785624 , 11.262499 , ..., 11.2622795, 11.262311 ,
       11.366884 ], dtype=float32)
# score
mean_squared_error(y_valid, valid_base_preds, squared = False)
9.871208

Observations

The linear regression model is off by $\$$9.871, which isn't much better than simply predicting the average, which was off by $\$$9.873.

This is mainly because the training data (geocoordinates) is not in a format that's useful for the model, and we're not using one of the most important columns: pickup date & time.

However, now we have a baseline that our other models should ideally beat.

Make submission

# make test predictions
test_base_preds = base_model.predict(test_df[input_cols])

test_base_preds
array([11.262217, 11.262282, 11.262238, ..., 11.786515, 11.785327,
       11.78521 ], dtype=float32)
# save_submission_file
submission_df["fare_amount"] = test_base_preds
submission_df.to_csv("baseline_model.csv")

Submission also gives a similar score of 9.406.

Feature engineering

Now that we have got a baseline score, we'll do feature engineering, wherein we'll:

  • Extract parts of datetime
  • Remove outliers & invalid data
  • Add distance between pickup & drop
  • Add distance from landmarks

All this will be done using functions, which we'll combine into a single preprocessing function in the end.

Extract parts of datetime.

The datetime column alone can give us many useful features for training the model, like hour of the day, day of week, month, etc.

# Extract parts of datetime. 

def add_datetime_cols(dataframe):
    dataframe["date"] = dataframe["pickup_datetime"].dt.day
    dataframe["month"] = dataframe["pickup_datetime"].dt.month
    dataframe["weekday"] = dataframe["pickup_datetime"].dt.dayofweek
    dataframe["year"] = dataframe["pickup_datetime"].dt.year
    return dataframe
    

Remove outliers & invalid data

There seems to be some invalide data in each of the following columns:

  • Fare amount
  • Passenger count
  • Pickup latitude & longitude
  • Drop latitude & longitude

Using the limits from test data, we'll limit the data to range:

  • fare_amount: 1 to 200
  • longitudes: -75 to -72
  • latitudes: 40 to 42
  • passenger_count: 1 to 10
# remove outliers
def remove_outliers(dataframe):
    dataframe = dataframe.loc[(dataframe["fare_amount"] >= 1) &
                             (dataframe["fare_amount"] <= 200) &
                             (dataframe["pickup_longitude"] >= -75) &
                             (dataframe["pickup_longitude"] <= -72) &
                             (dataframe["pickup_latitude"] >= 40) &
                             (dataframe["pickup_latitude"] <= 42) &
                             (dataframe["dropoff_longitude"] >= -75) &
                             (dataframe["dropoff_longitude"] <= -72) &
                             (dataframe["dropoff_latitude"] >= 40) &
                             (dataframe["dropoff_latitude"] <= 42) &
                             (dataframe["passenger_count"] >= 1) &
                             (dataframe["passenger_count"] <= 10)]
    return dataframe

Add distance between pickup & drop

This is usually the biggest factor affecting the taxi fare. While we cannot truly know the route and the total distance that the taxi ran for, from the data given we can calculate the distance between pickup & drop using haversine distance. The formula for calculating it is taken from here.

# haversine for calcualting distance
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 +np. cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r


# Add distance between pickup & drop
def add_distance(dataframe):
    dataframe["trip_dist"] = haversine(dataframe["pickup_longitude"], dataframe["pickup_latitude"], dataframe["dropoff_longitude"], dataframe["dropoff_latitude"])
    return dataframe

Add distance from landmarks

We'll also add the distance between the dropoff location and some landmarks like:

  • JFK Airport
  • LGA Airport
  • EWR Airport
  • Times Square
  • Met Meuseum
  • World Trade Center

Because popular locations means more traffic, which affects the waiting time in taxis, this information can also help in prediction.

The names of popular landmarks and their locations are taken from google. To calculate the distance, we'll use the function that we have just created.

# landmark locations
jfk_lonlat = -73.7781, 40.6413
lga_lonlat = -73.8740, 40.7769
ewr_lonlat = -74.1745, 40.6895
met_lonlat = -73.9632, 40.7794
wtc_lonlat = -74.0099, 40.7126

# landmarks
landmarks = [("jfk", jfk_lonlat), ("lga", lga_lonlat), ("ewr", ewr_lonlat), ("met", met_lonlat), ("wtc", wtc_lonlat)]
# Add distance from landmarks
def add_dist_from_landmarks(dataframe):
    for lmrk_name, longlat in landmarks:
        dataframe[lmrk_name + "_dist"] = haversine(longlat[0], longlat[1], dataframe["dropoff_longitude"], dataframe["dropoff_latitude"])
    
    return dataframe

Preprocessor

Now, we'll combine all data preparation steps into a single function. It will:

  • drop null values if it is the training data
  • remove outliers if it is the training data
  • add additional datetime columns
  • add trip distance column
  • add distance from landmark columns
  • drop pickup_datetime column
# full data preparation
def preprocessor(dataframe):
    if "fare_amount" in dataframe:
        dataframe = dataframe.dropna()
        dataframe = remove_outliers(dataframe)
    dataframe = add_datetime_cols(dataframe)
    dataframe = add_distance(dataframe)
    dataframe = add_dist_from_landmarks(dataframe)
    
    dataframe = dataframe.drop(columns = ["pickup_datetime"])
    
    return dataframe
# Prepare the data
df = preprocessor(df) # training data
test_df = preprocessor(test_df) # test data

# save processed files for future use
df.to_csv("processed_train_1_perc.csv", index = None)
test_df.to_csv("preprocessed_test_df.csv")

# Separate training data from validation data
X = df.copy()
y = X.pop("fare_amount")
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, random_state = seed)
# inspect preprocessing result
X_train.head()
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count date month weekday year trip_dist jfk_dist lga_dist ewr_dist met_dist wtc_dist
106009 -73.975952 40.776409 -73.979698 40.760006 2 14 12 5 2013 1.851406 21.518574 9.097258 18.191933 2.565198 5.853306
455614 -73.990707 40.751011 -73.982460 40.745422 1 2 8 0 2010 0.932484 20.758091 9.782166 17.338640 4.111363 4.320646
114078 -73.991280 40.730213 -73.985870 40.745068 2 14 12 4 2012 1.713230 20.975971 10.065125 17.055971 4.268357 4.139222
72126 -73.973801 40.784374 -73.978233 40.764561 1 17 6 1 2014 2.234654 21.736300 8.883799 18.525803 2.079320 6.364183
549968 -73.982056 40.776432 -73.790337 40.646755 2 22 7 6 2012 21.657000 1.197777 16.097822 32.747494 20.732004 19.909977
test_df.head()
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count date month weekday year trip_dist jfk_dist lga_dist ewr_dist met_dist wtc_dist
key
2015-01-27 13:08:24.0000002 -73.973320 40.763805 -73.981430 40.743835 1 27 1 1 2015 2.323358 20.587837 9.766299 17.357740 4.242006 4.221359
2015-01-27 13:08:24.0000003 -73.986862 40.719383 -73.998886 40.739201 1 27 1 1 2015 2.425299 21.564516 11.323100 15.799543 5.386261 3.100082
2011-10-08 11:53:44.0000002 -73.982521 40.751259 -73.979652 40.746140 1 8 10 5 2011 0.618403 20.607008 9.532814 17.588009 3.949200 4.517339
2012-12-01 21:12:12.0000002 -73.981163 40.767807 -73.990448 40.751637 1 1 12 5 2012 1.960912 21.702991 10.201496 16.980310 3.846307 4.639962
2012-12-01 21:12:12.0000003 -73.966049 40.789776 -73.988564 40.744427 1 1 12 5 2012 5.387211 21.127256 10.302326 16.818926 4.436549 3.969716

The files have been succesfully preprocessed. Now we'll select ML models for training.

Model Selection

In this section, we'll train some linear and some tree based models and compare their performance.

Linear Models

Prepare Data

We'll train a LinearRegression model and a Ridge in this subsection. But before proceeding, we'll also scale numerical columns and onehotencode categorical columns. This transformation will only be used with linear models, which can benefit from such transformation. Other tree based models that we'll train work really well even without it. So with linear models, additional steps will be to:

  • Scale numerical features with MinMaxScaler
  • Encode categorical features with OneHotEncoder
# encode categorical columns
cat_cols = pd.Index(["weekday", "month"])
ohe = OneHotEncoder(drop = "first", sparse = False)

# scale numeric columns
num_cols = X_train.columns.difference(cat_cols)
scaler = MinMaxScaler()

# combine transformation steps
lin_ct = make_column_transformer((ohe, cat_cols), (scaler, num_cols))

lin_ct
ColumnTransformer(transformers=[('onehotencoder',
                                 OneHotEncoder(drop='first', sparse=False),
                                 Index(['weekday', 'month'], dtype='object')),
                                ('minmaxscaler', MinMaxScaler(),
                                 Index(['date', 'dropoff_latitude', 'dropoff_longitude', 'ewr_dist', 'jfk_dist',
       'lga_dist', 'met_dist', 'passenger_count', 'pickup_latitude',
       'pickup_longitude', 'trip_dist', 'wtc_dist', 'year'],
      dtype='object'))])
# transform
X_lin_train = lin_ct.fit_transform(X_train)
X_lin_valid = lin_ct.transform(X_valid)

X_lin_train
array([[0.        , 0.        , 0.        , ..., 0.01670442, 0.04239011,
        0.66666667],
       [0.        , 0.        , 0.        , ..., 0.00841339, 0.03126889,
        0.16666667],
       [0.        , 0.        , 0.        , ..., 0.01545772, 0.02995245,
        0.5       ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.0341279 , 0.03019252,
        0.83333333],
       [0.        , 0.        , 0.        , ..., 0.0882236 , 0.03355575,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.06261117, 0.05642887,
        0.        ]])

Train and Evaluate

# evaluation function
def lin_eval(model):
    
    train_preds = model.predict(X_lin_train)
    train_rmse = mean_squared_error(y_train, train_preds, squared = False)
    
    val_preds = model.predict(X_lin_valid)
    val_rmse = mean_squared_error(y_valid, val_preds, squared = False)
    
    return train_rmse, val_rmse, train_preds, val_preds
# models
l_models = {"Linear Regression": LinearRegression(),
           "Ridge": Ridge(random_state = seed)}
l_scores = {}

# train and evaluate
for model_name, model in l_models.items():
    
    print(model_name)
    
    model.fit(X_lin_train, y_train)
    
    train_rmse, val_rmse, train_preds, val_preds = lin_eval(model)
    
    l_scores[model_name] = {"train_rmse": train_rmse,
                           "validation_rmse": val_rmse}
    
    print(l_scores[model_name])
    print("-" * 70)
    
    
Linear Regression
{'train_rmse': 5.10889904076924, 'validation_rmse': 5.024879081297552}
----------------------------------------------------------------------
Ridge
{'train_rmse': 5.110620297096774, 'validation_rmse': 5.024162833522911}
----------------------------------------------------------------------

The linear models show improvement over the baseline model and score about 5 rmse. We'll compare them now with tree based models.

Tree based models

We'll train a DecisionTreeRegressor and a RandomForestRegressor in this subsection. These models won't need any more transformation to the data. Therefore, we can move on to training and evaluation part directly.

# evaluation function
def tree_eval(model):
    train_preds = model.predict(X_train)
    train_rmse = mean_squared_error(y_train, train_preds, squared = False)
    
    val_preds = model.predict(X_valid)
    val_rmse = mean_squared_error(y_valid, val_preds, squared = False)
    
    return train_rmse, val_rmse, train_preds, val_preds
# models
t_models = {"DecisionTree": DecisionTreeRegressor(max_depth = 6, random_state = seed),
           "RandomForest": RandomForestRegressor(max_depth = 6, n_jobs = -1, random_state = seed)}
t_scores = {}

# train and evaluate
for model_name, model in t_models.items():
    
    print(model_name)
    
    model.fit(X_train, y_train)
    
    train_rmse, val_rmse, train_preds, val_preds = tree_eval(model)
    
    t_scores[model_name] = {"train_rmse": train_rmse,
                           "validation_rmse": val_rmse}
    
    print(t_scores[model_name])
    print("-" * 70)
DecisionTree
{'train_rmse': 4.153546496377676, 'validation_rmse': 4.3580147846085975}
----------------------------------------------------------------------
RandomForest
{'train_rmse': 4.047689087955997, 'validation_rmse': 4.246326819808587}
----------------------------------------------------------------------

Tree based models performed better than linear models. Both the models scores about 4.25 rmse on the validation data. Based on this we'll train a gradient boosting model - lightgbm, evaluate its performance and make a submission next. Then we can move on to tune its hyperparameters before training on full training data and making a final submission.

LightGBM

# get cpu core count
core_count = psutil.cpu_count(logical = False)

core_count
2
# model parameters
params = {"num_leaves": 25,
         "learning_rate": 0.03,
         "seed": seed,
          "metric": "rmse",
         "num_threads": core_count}

# train and evaluate
train_lgb = lgb.Dataset(X_train, y_train)
valid_lgb = lgb.Dataset(X_valid, y_valid)

bst = lgb.train(params, train_lgb, num_boost_round = 1500, valid_sets = [train_lgb, valid_lgb], early_stopping_rounds = 10, verbose_eval = 20)
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.011514 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2617
[LightGBM] [Info] Number of data points in the train set: 402413, number of used features: 15
[LightGBM] [Info] Start training from score 11.326931
Training until validation scores don't improve for 10 rounds
[20]	training's rmse: 6.35493	valid_1's rmse: 6.44781
[40]	training's rmse: 4.89231	valid_1's rmse: 5.02841
[60]	training's rmse: 4.2942	valid_1's rmse: 4.46046
[80]	training's rmse: 4.04413	valid_1's rmse: 4.23143
[100]	training's rmse: 3.92997	valid_1's rmse: 4.1298
[120]	training's rmse: 3.86415	valid_1's rmse: 4.07594
[140]	training's rmse: 3.81705	valid_1's rmse: 4.03604
[160]	training's rmse: 3.78547	valid_1's rmse: 4.01541
[180]	training's rmse: 3.7564	valid_1's rmse: 3.9956
[200]	training's rmse: 3.7309	valid_1's rmse: 3.98351
[220]	training's rmse: 3.70612	valid_1's rmse: 3.96904
[240]	training's rmse: 3.68621	valid_1's rmse: 3.95755
[260]	training's rmse: 3.66694	valid_1's rmse: 3.94732
[280]	training's rmse: 3.65078	valid_1's rmse: 3.94048
[300]	training's rmse: 3.63347	valid_1's rmse: 3.93003
[320]	training's rmse: 3.61758	valid_1's rmse: 3.92148
[340]	training's rmse: 3.6042	valid_1's rmse: 3.91516
[360]	training's rmse: 3.59117	valid_1's rmse: 3.91013
[380]	training's rmse: 3.57992	valid_1's rmse: 3.90559
[400]	training's rmse: 3.56915	valid_1's rmse: 3.90191
[420]	training's rmse: 3.55754	valid_1's rmse: 3.89855
[440]	training's rmse: 3.54752	valid_1's rmse: 3.89467
[460]	training's rmse: 3.53806	valid_1's rmse: 3.89107
[480]	training's rmse: 3.52856	valid_1's rmse: 3.88794
[500]	training's rmse: 3.52008	valid_1's rmse: 3.88553
[520]	training's rmse: 3.51151	valid_1's rmse: 3.88371
[540]	training's rmse: 3.50374	valid_1's rmse: 3.88035
[560]	training's rmse: 3.49658	valid_1's rmse: 3.87734
[580]	training's rmse: 3.48752	valid_1's rmse: 3.87406
[600]	training's rmse: 3.48019	valid_1's rmse: 3.87159
[620]	training's rmse: 3.4738	valid_1's rmse: 3.86963
[640]	training's rmse: 3.46723	valid_1's rmse: 3.86759
Early stopping, best iteration is:
[642]	training's rmse: 3.46642	valid_1's rmse: 3.86745

LightGBM improves on the performance of RandomForestRegressor by scoring an an impressive rmse of 3.86. Now, we'll make a submission with this model.

Make Submsission

# make predictions
preds = bst.predict(test_df)

# save submission file
submission_df["fare_amount"] = preds
submission_df.to_csv("lightgbm.csv")

This submission gives us a score of RMSE: 3.29476 which places us at roughly 35th percentile out of a total of 1400 current participants. We can now improve on this by hyperparameter tuning and by using the full training data.

Tune Hyperparameters

For hyperparameter tuning, we'll use bayesian optimization. This will involve first creating a black box function which the later the bayes optimizer will try to maximize using different hyperparameters.

# black box function for Bayesian Optimization
def LGB_bayesian(bagging_fraction,
                bagging_freq,
                lambda_l1,
                lambda_l2,
                learning_rate,
                max_depth,
                min_data_in_leaf,
                min_gain_to_split,
                min_sum_hessian_in_leaf,  
                num_leaves,
                feature_fraction):
    
    # LightGBM expects these parameters to be integer. So we make them integer
    bagging_freq = int(bagging_freq)
    num_leaves = int(num_leaves)
    min_data_in_leaf = int(min_data_in_leaf)
    max_depth = int(max_depth)
    
    # parameters
    param = {'bagging_fraction': bagging_fraction,
            'bagging_freq': bagging_freq,
            'lambda_l1': lambda_l1,
            'lambda_l2': lambda_l2,
            'learning_rate': learning_rate,
            'max_depth': max_depth,
            'min_data_in_leaf': min_data_in_leaf,
            'min_gain_to_split': min_gain_to_split,
            'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
            'num_leaves': num_leaves,
            'feature_fraction': feature_fraction,
            'seed': seed,
            'feature_fraction_seed': seed,
            'bagging_seed': seed,
            'drop_seed': seed,
            'boosting_type': 'gbdt',
            'metric': 'rmse',
            'force_col_wise': True,
            'verbosity': -1,
            'num_threads': core_count}
    
    trn = lgb.Dataset(X, y)
    
    lgb_cv = lgb.cv(param, trn, num_boost_round = 1500, nfold = 3, stratified = False, early_stopping_rounds = 10, seed = seed)
    score = lgb_cv["rmse-mean"][-1]
    
    return -score
# parameter bounds
bounds_LGB = {
    'bagging_fraction': (0.6, 1),
    'bagging_freq': (1, 4),
    'lambda_l1': (0, 3.0), 
    'lambda_l2': (0, 3.0), 
    'learning_rate': (0.01, 0.1),
    'max_depth':(3,8),
    'min_data_in_leaf': (5, 20),  
    'min_gain_to_split': (0, 1),
    'min_sum_hessian_in_leaf': (0.01, 20),    
    'num_leaves': (5, 25),
    'feature_fraction': (0.05, 1)
}
seed = 42
df = pd.read_parquet("nyc_sample_train.parquet")
X = df.copy()
y = X.pop("fare_amount")
# optimizer
LG_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state = seed)
# find the best hyperparameters
LG_BO.maximize(init_points = 5, n_iter = 50)
|   iter    |  target   | baggin... | baggin... | featur... | lambda_l1 | lambda_l2 | learni... | max_depth | min_da... | min_ga... | min_su... | num_le... |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
|  1        | -3.878    |  0.7498   |  3.852    |  0.7454   |  1.796    |  0.4681   |  0.02404  |  3.29     |  17.99    |  0.6011   |  14.16    |  5.412    |
|  2        | -3.855    |  0.988    |  3.497    |  0.2517   |  0.5455   |  0.5502   |  0.03738  |  5.624    |  11.48    |  0.2912   |  12.24    |  7.79     |
|  3        | -3.833    |  0.7169   |  2.099    |  0.4833   |  2.356    |  0.599    |  0.05628  |  5.962    |  5.697    |  0.6075   |  3.419    |  6.301    |
|  4        | -3.731    |  0.9796   |  3.897    |  0.818    |  0.9138   |  0.293    |  0.07158  |  5.201    |  6.831    |  0.4952   |  0.6974   |  23.19    |
|  5        | -3.795    |  0.7035   |  2.988    |  0.3461   |  1.56     |  1.64     |  0.02664  |  7.848    |  16.63    |  0.9395   |  17.9     |  16.96    |
|  6        | -4.28     |  0.6316   |  1.0      |  0.05     |  0.0      |  3.0      |  0.1      |  8.0      |  18.12    |  1.0      |  0.01     |  25.0     |
|  7        | -3.815    |  0.8511   |  3.327    |  0.2452   |  1.442    |  0.6327   |  0.08836  |  6.227    |  6.279    |  0.3659   |  0.7767   |  24.7     |
|  8        | -3.811    |  1.0      |  4.0      |  1.0      |  0.5456   |  0.0      |  0.03513  |  3.0      |  5.0      |  0.5616   |  1.382    |  18.08    |
|  9        | -3.933    |  1.0      |  4.0      |  1.0      |  0.0      |  0.0      |  0.01     |  3.0      |  7.293    |  1.0      |  6.517    |  22.86    |
|  10       | -3.715    |  1.0      |  4.0      |  1.0      |  2.476    |  0.0      |  0.1      |  8.0      |  7.369    |  0.0      |  0.01     |  19.15    |
|  11       | -3.832    |  1.0      |  4.0      |  1.0      |  3.0      |  3.0      |  0.01     |  8.0      |  8.835    |  0.0      |  0.01     |  13.3     |
|  12       | -3.735    |  1.0      |  1.0      |  1.0      |  0.0      |  3.0      |  0.1      |  8.0      |  5.0      |  1.0      |  0.01     |  19.17    |
|  13       | -3.827    |  0.6      |  1.0      |  1.0      |  3.0      |  3.0      |  0.1      |  3.0      |  20.0     |  0.0      |  20.0     |  23.85    |
|  14       | -4.273    |  0.6      |  1.0      |  0.05     |  3.0      |  3.0      |  0.1      |  8.0      |  7.674    |  0.0      |  20.0     |  16.49    |
|  15       | -3.933    |  1.0      |  4.0      |  1.0      |  0.0      |  0.0      |  0.01     |  3.0      |  20.0     |  1.0      |  20.0     |  14.9     |
|  16       | -4.989    |  1.0      |  4.0      |  0.05     |  3.0      |  0.0      |  0.01     |  8.0      |  20.0     |  1.0      |  15.16    |  22.47    |
|  17       | -3.797    |  0.6      |  1.031    |  1.0      |  3.0      |  3.0      |  0.1      |  4.379    |  8.206    |  0.0      |  0.01     |  19.6     |
|  18       | -3.792    |  0.6      |  4.0      |  1.0      |  3.0      |  3.0      |  0.1      |  8.0      |  5.0      |  0.0      |  4.282    |  18.72    |
|  19       | -4.267    |  1.0      |  1.0      |  0.05     |  0.0      |  0.0      |  0.1      |  8.0      |  8.037    |  1.0      |  3.758    |  15.92    |
|  20       | -3.813    |  0.6      |  4.0      |  1.0      |  3.0      |  3.0      |  0.01     |  6.159    |  5.0      |  1.0      |  0.01     |  20.66    |
|  21       | -3.856    |  0.6      |  2.865    |  0.5243   |  0.2189   |  2.828    |  0.01274  |  8.0      |  16.59    |  1.0      |  17.32    |  11.54    |
|  22       | -3.84     |  0.6      |  4.0      |  1.0      |  3.0      |  3.0      |  0.1      |  3.0      |  5.0      |  0.0      |  0.01     |  12.42    |
|  23       | -3.811    |  0.6      |  4.0      |  1.0      |  0.0      |  3.0      |  0.01     |  8.0      |  9.051    |  0.0      |  0.01     |  21.33    |
|  24       | -3.954    |  0.7067   |  4.0      |  1.0      |  3.0      |  3.0      |  0.01     |  3.203    |  11.12    |  0.0      |  0.01     |  6.119    |
|  25       | -3.932    |  1.0      |  4.0      |  1.0      |  3.0      |  0.0      |  0.01     |  3.0      |  11.03    |  0.0      |  0.01     |  21.11    |
|  26       | -3.983    |  0.6      |  4.0      |  1.0      |  3.0      |  2.971    |  0.01     |  3.0      |  5.0      |  0.0      |  9.472    |  5.0      |
|  27       | -3.826    |  0.6      |  1.0      |  1.0      |  3.0      |  0.0      |  0.1      |  3.0      |  15.18    |  0.0      |  15.69    |  12.3     |
|  28       | -3.786    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  3.0      |  13.18    |  0.0      |  19.45    |  6.494    |
|  29       | -3.841    |  0.6      |  1.0      |  1.0      |  0.0      |  3.0      |  0.1      |  8.0      |  12.35    |  1.0      |  18.0     |  5.0      |
|  30       | -3.865    |  0.6      |  4.0      |  1.0      |  0.0      |  0.0      |  0.06983  |  3.0      |  7.798    |  1.0      |  19.92    |  5.0      |
|  31       | -3.79     |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  6.897    |  18.7     |  0.0      |  20.0     |  5.0      |
|  32       | -3.963    |  0.8491   |  3.073    |  0.1499   |  2.922    |  2.358    |  0.07925  |  4.056    |  18.72    |  0.8548   |  19.94    |  7.476    |
|  33       | -3.782    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  8.0      |  16.57    |  0.0      |  13.73    |  5.0      |
|  34       | -3.834    |  0.6      |  1.0      |  1.0      |  0.0      |  3.0      |  0.1      |  8.0      |  20.0     |  0.0      |  8.132    |  5.0      |
|  35       | -3.751    |  1.0      |  1.0      |  1.0      |  3.0      |  0.0      |  0.1      |  8.0      |  18.25    |  0.0      |  10.86    |  9.799    |
|  36       | -3.773    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  3.0      |  19.02    |  0.0      |  8.749    |  9.815    |
|  37       | -4.271    |  1.0      |  1.0      |  0.05     |  3.0      |  0.0      |  0.1      |  5.383    |  15.75    |  1.0      |  7.404    |  5.795    |
|  38       | -3.737    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  6.136    |  20.0     |  0.0      |  13.56    |  9.97     |
|  39       | -3.825    |  0.6      |  3.931    |  1.0      |  0.0      |  3.0      |  0.1      |  7.734    |  20.0     |  0.0      |  10.28    |  10.79    |
|  40       | -3.934    |  1.0      |  1.0      |  1.0      |  0.0      |  3.0      |  0.01     |  3.0      |  18.02    |  0.0      |  11.98    |  13.07    |
|  41       | -3.895    |  0.6      |  4.0      |  1.0      |  3.0      |  0.0      |  0.01     |  8.0      |  20.0     |  0.0      |  14.44    |  8.194    |
|  42       | -3.77     |  1.0      |  1.0      |  1.0      |  0.0      |  3.0      |  0.1      |  3.0      |  5.0      |  0.0      |  0.01     |  22.56    |
|  43       | -3.841    |  0.6      |  4.0      |  1.0      |  0.0      |  3.0      |  0.1      |  3.0      |  20.0     |  0.0      |  2.512    |  11.95    |
|  44       | -3.819    |  0.6      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  8.0      |  15.15    |  0.0      |  13.15    |  11.61    |
|  45       | -3.799    |  0.6      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  8.0      |  20.0     |  0.0      |  6.499    |  12.16    |
|  46       | -3.779    |  1.0      |  4.0      |  1.0      |  0.0      |  3.0      |  0.1      |  8.0      |  5.0      |  0.0      |  0.01     |  5.0      |
|  47       | -3.979    |  1.0      |  4.0      |  1.0      |  3.0      |  0.0      |  0.01     |  8.0      |  12.85    |  0.0      |  20.0     |  5.0      |
|  48       | -5.078    |  0.6      |  1.0      |  0.05     |  0.0      |  3.0      |  0.01     |  8.0      |  20.0     |  1.0      |  15.05    |  5.0      |
|  49       | -3.741    |  1.0      |  2.33     |  1.0      |  0.7072   |  1.186    |  0.1      |  6.142    |  6.29     |  0.0      |  0.0376   |  20.51    |
|  50       | -3.749    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  6.615    |  19.09    |  0.0      |  10.41    |  11.33    |
|  51       | -3.738    |  1.0      |  3.279    |  1.0      |  1.481    |  0.0      |  0.1      |  5.232    |  17.54    |  0.0      |  12.46    |  10.42    |
|  52       | -3.739    |  1.0      |  1.0      |  1.0      |  0.0      |  0.0      |  0.1      |  5.307    |  14.7     |  0.0      |  15.16    |  7.892    |
|  53       | -3.707    |  1.0      |  4.0      |  1.0      |  1.791    |  3.0      |  0.1      |  8.0      |  5.0      |  0.0      |  0.01     |  16.65    |
|  54       | -3.731    |  1.0      |  1.0      |  1.0      |  3.0      |  0.0      |  0.1      |  7.134    |  19.29    |  0.0      |  14.2     |  12.87    |
|  55       | -3.832    |  0.6      |  3.801    |  1.0      |  3.0      |  0.0      |  0.1      |  5.359    |  20.0     |  1.0      |  9.467    |  12.58    |
=============================================================================================================================================================
# tuned hyperparameters
LG_BO.max["params"]
{'bagging_fraction': 1.0,
 'bagging_freq': 4.0,
 'feature_fraction': 1.0,
 'lambda_l1': 1.7911760022903485,
 'lambda_l2': 3.0,
 'learning_rate': 0.1,
 'max_depth': 8.0,
 'min_data_in_leaf': 5.0,
 'min_gain_to_split': 0.0,
 'min_sum_hessian_in_leaf': 0.01,
 'num_leaves': 16.653557814597693}

Now that the hyperparameter tuning is done, we can load the full training data and then train lightgbm on the full data using tuned hyperparameters.

Load full dataset

To load full dataset, we'll import chunks and apply all preprocessing on these chunks and then merge these chunks.

%%time

chunksize = 5_000_000

df_list = []

for df_chunk in tqdm(pd.read_csv(f"{data_dir}/train.csv", usecols = columns, dtype = dtypes, chunksize = chunksize)):

    df_chunk['pickup_datetime'] = datetime_parser(df_chunk)
    
    df_chunk = preprocessor(df_chunk)
    
    df_list.append(df_chunk)
12it [03:18, 16.51s/it]
CPU times: user 2min 25s, sys: 17.2 s, total: 2min 42s
Wall time: 3min 18s

full_df = pd.concat(df_list)

full_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 54061209 entries, 0 to 55423855
Data columns (total 16 columns):
 #   Column             Dtype  
---  ------             -----  
 0   fare_amount        float32
 1   pickup_longitude   float32
 2   pickup_latitude    float32
 3   dropoff_longitude  float32
 4   dropoff_latitude   float32
 5   passenger_count    uint8  
 6   date               uint8  
 7   month              uint8  
 8   weekday            uint8  
 9   year               uint16 
 10  trip_dist          float32
 11  jfk_dist           float32
 12  lga_dist           float32
 13  ewr_dist           float32
 14  met_dist           float32
 15  wtc_dist           float32
dtypes: float32(11), uint16(1), uint8(4)
memory usage: 2.9 GB

We'll now delete the importing list to free memory and then save the full processed train data to .parquet format for faster reading in the future.

# delete the importing list
del df_list

# save the full processed train data
full_df.to_parquet("full_training_df.parquet", index = None)

Final Training and Submission

We'll need to make some optimizations in order to train on full data, otherwise the system crashes because of low memory.

# separate target
y = full_df.pop("fare_amount")
# create lightgbm dataset from the training data.
full_lgb_df = lgb.Dataset(full_df, label = y, free_raw_data = True)

We'll save the lightgbm dataset into binary format and then load it again after exiting the system. This will free a memory and make sure we'll only have processed train data in the memory

# save the lightgb dataset to binary format which is much lighter (about 600 MB.)
full_lgb_df.save_binary("full_train.bin")
[LightGBM] [Info] Saving data to binary file full_train.bin
<lightgbm.basic.Dataset at 0x7fadd0803a50>
# exit the system, which will force the notebook to restart
os._exit(00)
# import the required libraries
import lightgbm as lgb
import pandas as pd
import warnings
import psutil

# set parameters
core_count = psutil.cpu_count(logical = False)
seed = 42

params = {'bagging_fraction': 1.0,
         'bagging_freq': 4,
         'feature_fraction': 1.0,
         'lambda_l1': 1.7911760022903485,
         'lambda_l2': 3.0,
         'learning_rate': 0.1,
         'max_depth': 8,
         'min_data_in_leaf': 5,
         'min_gain_to_split': 0.0,
         'min_sum_hessian_in_leaf': 0.01,
         'num_leaves': 16,
         'seed': seed,
         'feature_fraction_seed': seed,
         'bagging_seed': seed,
         'drop_seed': seed,
         'boosting_type': 'gbdt',
         'metric': 'rmse',
         'force_col_wise': True,
         'num_threads': core_count,
         'device': 'gpu'}
# train on full data
train = lgb.Dataset("full_train.bin")

warnings.filterwarnings("ignore")

full_bst = lgb.train(params, train, num_boost_round = 400, 
                     valid_sets = [train], verbose_eval = 25)
[LightGBM] [Info] Load from binary file full_train.bin
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2617
[LightGBM] [Info] Number of data points in the train set: 54061209, number of used features: 15
[LightGBM] [Info] Using GPU Device: gfx90c, Vendor: Advanced Micro Devices, Inc.
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 15 dense feature groups (824.91 MB) transferred to GPU in 0.891994 secs. 0 sparse feature groups
[LightGBM] [Info] Start training from score 11.324434
[25]	training's rmse: 4.15862
[50]	training's rmse: 3.91879
[75]	training's rmse: 3.85214
[100]	training's rmse: 3.80957
[125]	training's rmse: 3.77661
[150]	training's rmse: 3.74922
[175]	training's rmse: 3.72905
[200]	training's rmse: 3.71091
[225]	training's rmse: 3.69806
[250]	training's rmse: 3.68685
[275]	training's rmse: 3.67519
[300]	training's rmse: 3.66482
[325]	training's rmse: 3.6549
[350]	training's rmse: 3.6476
# # Prediction and Submission
test_df = pd.read_csv("preprocessed_test_df.csv", index_col = "key")
submission_df = pd.read_csv(data_dir + "/sample_submission.csv", index_col = "key")

test_preds = full_bst.predict(test_df)
submission_df["fare_amount"] = test_preds
submission_df.to_csv("nyc_full_tuned1.csv")
submission_df.head()
fare_amount
key
2015-01-27 13:08:24.0000002 10.046500
2015-01-27 13:08:24.0000003 10.233339
2011-10-08 11:53:44.0000002 4.996559
2012-12-01 21:12:12.0000002 8.939364
2012-12-01 21:12:12.0000003 16.210700

This submission gives us our best score yet of RMSE $ 3.21. This is our final submssion in this project and this lands us in the top 30 %. We can now save this model for future use and we can also analyse what can this model tell us about the importance of various features in the data.

# save model
full_bst.save_model("nyc_full_tuned_model.bin")
# plot feature importance
lgb.plot_importance(full_bst)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

Observations

As we can see, the trip distance faeture that we added through feature engineering contributed the most in prediction of taxi fares. Following it are also features determining the location of taxi trips like pickup location features, distance from jfk airport and dropoff location features. Next in the importance list is the year of taxi trip, which could suggest that the fares probably increased with the years.

Summary and Conclusion

In this project, we worked on the problem of predicting taxi fares in new york city with the dataset provided by Google Cloud and Coursera. The involved dealing with multiple challenges, with handling big data one being the primary one. We imported only 1 percent of training data to select the ML model and to tune its hyperparameters. But before we could do so, we also analysed and visulaized the data, which also helped in the feature engineering. After preparing the data for training, we moved on to select ML model for full training. The initial results suggested that lightgbm worked the best on the problem. We then tuned the hyperparameters for lightgbm using bayesian optimization. After doing so, we were ready for training the model on the full data. We then imported the full training data and applied preprocessing on it. But before proceeding with the ML model training, we optimized the process by saving the binary format of precessed training data and freeing the memory by restarting the kernel.

The final training and the submission on Kaggle gave us the RMSE of 3.21. We also analysed the feature importance that the model found, and it suggested that the location based features such as trip distance which we calculated using haversin distance in feature engineering section, the pickup longitude and pickup latitude were the most important in making predictions.

Thanks for reading.