EDA and Prediction of Real Estate Sale Price using Select ML Algorithms on Kaggle
- Introduction
- Setup
- EDA and Data Preparation
- Train a Dummy model and Evaluate Performance
- Linear Models
- Gradient Boosting Models
- Summary and Conclusion
In this notebook, we'll work on the Ames Housing Dataset available on Kaggle as an educational competition. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, the competition challenges us to predict the final price of each home. The competition dataset was compiled by Dean De Cock and it is an incredible alternative as a modernized and expanded version of the often cited Boston Housing dataset.
Our work will be to make predictions on the SalePrice
of the houses in the dataset. We will train ML algorithms on the train dataset given by the competition and then make submissions on the predictions of SalePrice
of houses in the test dataset. Our submissions will be evaluated by rmsle, and we'll try to improve on this metric with each of our submission.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from sklearn.preprocessing import OneHotEncoder, Normalizer, RobustScaler # data preparation
from sklearn.impute import SimpleImputer # missing value handling
from sklearn.model_selection import KFold, cross_val_score # model selection
from sklearn.metrics import mean_squared_error # metrics
from scipy.stats import norm, skew # statistics
import psutil # get cpu core count
from bayes_opt import BayesianOptimization # hyperparameter tuning
# pipelines
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
# ML models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
import xgboost as xgb
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# get file paths
import os
data_dir = os.getcwd()
for dirname, _, filenames in os.walk(data_dir):
for filename in filenames:
if filename[-4:] == '.csv':
print(os.path.join(dirname, filename))
All data files are under 1MB and safe to import wholly.
# import files
train = pd.read_csv(data_dir + "/train.csv", index_col = ["Id"])
test = pd.read_csv(data_dir + "/test.csv", index_col = ["Id"])
submission_df = pd.read_csv(data_dir + "/sample_submission.csv")
train.head()
Observations
- data contains both numeric and categorical features.
-
SalePrice
is the target column
# random state seed
seed = 42
# Separate target from features
X_train = train.copy()
y_train = X_train.pop("SalePrice")
pd.set_option("display.max_columns", None)
X_train.head(15)
X_train.info()
Observations
- There are 1460 entries with 79 features.
- Some features have null values, and thus will need further inspection.
- Some features can be accumulated to give new features. For eg. totalling different types of rooms to give a feature
total_rooms
.
# distribution of null values
plt.figure(figsize = (25, 10))
sns.heatmap(X_train.isnull(), yticklabels = "")
plt.title("Distribution of null values")
plt.show()
# count of null values
null_count = X_train.isnull().sum()
null_count = null_count.to_frame(name = "null_values")[null_count > 0]
null_count["null_percentage"] = null_count["null_values"]/len(X_train)*100
null_vals = null_count.sort_values(by = ["null_values"], ascending = False)
null_vals
Observations
-
PoolQC
,MiscFeature
,Alley
andFence
have a lot of null values. - Fortunately, the data description tells us that in all of the categorical feature columns bar one, null values indicate absence of those feature condition in the entries. Therefore, we'll impute null values in these feaure columns with 'None'.
-
Electrical
is the only feature column where null values don't indicate absence of any condition. But because there is only 1 null entry, we can drop this entry. - With regards to numerical columns,
MasVnrArea
andLotFrontage
can be filled with 0 because here too, null values indicate absence of the conditions.GarageYrBlt
can be filled with the median value, because although it is linked with theGarageCond
column, absence of garage cannot have year 0 inGarageYrBlt
. - For newer null value columns in the test dataset, we'll choose to impute them with either
most_frequent
(Categorical) ormedian
(Numeric).
# column dtypes
X_train.dtypes.value_counts()
null_cols = null_vals.index
null_cols = null_cols.drop("Electrical")
# drop null from `Electrical`
def drop_electrical_null(X, y):
drop_idx = X[X["Electrical"].isnull()].index
X = X.drop(drop_idx)
y = y.drop(drop_idx)
return X, y
# categorical columns
X_train['MSSubClass'] = X_train['MSSubClass'].apply(str)
cat_cols = X_train.select_dtypes(include = ["object"]).columns
cat_null_cols = null_cols.intersection(cat_cols)
cat_null_cols_imp = SimpleImputer(strategy = "constant", fill_value = "None")
cat_not_null_cols = cat_cols.difference(cat_null_cols)
cat_not_null_coll_imp = SimpleImputer(strategy = "most_frequent")
cat_null_ct = make_column_transformer((cat_null_cols_imp, cat_null_cols), (cat_not_null_coll_imp, cat_not_null_cols))
# numeric columns
num_cols = X_train.select_dtypes(exclude = ["object"]).columns
num_null0_cols = pd.Index(["MasVnrArea", "LotFrontage"])
num_null0_cols_imp = SimpleImputer(strategy = "constant", fill_value = 0)
num_not_null_cols = num_cols.difference(num_null0_cols)
num_not_null_cols_imp = SimpleImputer(strategy = "median")
num_null_ct = make_column_transformer((num_null0_cols_imp, num_null0_cols), (num_not_null_cols_imp, num_not_null_cols))
# combine both into a common transformer
null_ct = make_column_transformer((cat_null_ct, cat_cols), (num_null_ct, num_cols))
null_ct_features_in = cat_null_cols.append(cat_not_null_cols).append(num_null0_cols).append(num_not_null_cols)
# SalePrice (target)
sns.distplot(y_train, kde = True)
plt.title("Distribution of Sale Price")
plt.show()
The target looks skewed. We'll normalise it.
# normalized SalePrice (target)
y_train = np.log1p(y_train)
sns.distplot(y_train, fit = norm, kde = True)
plt.title("Distribution of Sale Price")
plt.show()
The target is normalized. Now, we can look at the numerical features.
plt.figure(figsize = (12,10))
fig = sns.boxplot(data = X_train[num_cols], orient = 'h')
fig.set_xscale("log")
There are quite a few outliers in the numeric columns. We'll need to scale numeric data using RobustScaler
in the preprocessor building section. Also, many of the numeric features are skewed and will need to be normalized so that we can train ML models which assume normality in the features.
# features which deviate a lot from the normal curve
imputed_features = pd.DataFrame(num_null_ct.fit_transform(X_train[num_cols]))
num_features_in = num_null0_cols.append(num_not_null_cols)
imputed_features.columns = num_features_in
skew_features = imputed_features.apply(lambda x: skew(x)).sort_values(ascending = False)
high_skew = skew_features.loc[skew_features > 0.5]
skewed_index = high_skew.index
high_skew
# normalize the skewed features
norma = Normalizer()
normalized_cols = pd.DataFrame(norma.fit_transform(imputed_features[skewed_index]))
normalized_cols.columns = skewed_index
normalized_cols.head()
plt.figure(figsize = (12, 10))
fig = sns.boxplot(data = normalized_cols, orient = "h")
fig.set_xscale("log")
Some features like MasVnrArea, OpenPorchSF, BsmtFinSF1, WoodDeckDF, 2ndFlrSF
couldn't be normalized. These columns will thus need to be dropped after extracting important information from them, which we'll do in the feature engineering section.
Feature Engineering
Area related features are important in predicting sale price of houses, and thus we can engineer some features related to area. We can also add some binary features indicating presence of swimming pool, garage or fireplace, which also are an important determiners of real estate value.
def add_cols(df):
# Add area related features
df['TotalSF'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']
df['Total_Bathrooms'] = (df['FullBath'] + (0.5 * df['HalfBath']) + df['BsmtFullBath'] + (0.5 * df['BsmtHalfBath']))
df['Total_porch_sf'] = (df['OpenPorchSF'] + df['3SsnPorch'] +
df['EnclosedPorch'] + df['ScreenPorch'] +
df['WoodDeckSF'])
# Add simplified categorical features
df['haspool'] = df['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
df['hasgarage'] = df['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
df['hasbsmt'] = df['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
df['hasfireplace'] = df['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
df[["haspool", "hasgarage", "hasbsmt", "hasfireplace"]] = df[["haspool", "hasgarage", "hasbsmt", "hasfireplace"]].astype("object")
return df
Prepare the data
Now, we will combine all we have learnt and done in the previous sections to process the data on which an ML algorithm can be trained. We'll
- normalize target
- impute null values
- add features
- drop features
- onehotencode categorical features
- normalize skewed numeric features
- scale numeric features
# function to preprocess the data
def get_prepared_data(transform_numeric = True):
X_trn = train.copy()
y_trn = X_trn.pop("SalePrice")
X_tst = test.copy()
X_trn['MSSubClass'] = X_trn['MSSubClass'].astype("object")
X_tst['MSSubClass'] = X_tst['MSSubClass'].astype("object")
# normalize target
y_trn = np.log1p(y_trn)
# handle null values
X_trn, y_trn = drop_electrical_null(X_trn, y_trn)
X_trn = pd.DataFrame(null_ct.fit_transform(X_trn))
X_tst = pd.DataFrame(null_ct.transform(X_tst))
X_trn = X_trn.infer_objects()
X_tst = X_tst.infer_objects()
# re add column names
X_trn.columns = null_ct_features_in
X_tst.columns = null_ct_features_in
X_trn['MSSubClass'] = X_trn['MSSubClass'].astype("object")
X_tst['MSSubClass'] = X_tst['MSSubClass'].astype("object")
# add features
X_trn = add_cols(X_trn)
X_tst = add_cols(X_tst)
# drop features
X_trn.drop(columns = ["MasVnrArea", "OpenPorchSF", "BsmtFinSF1", "WoodDeckSF", "2ndFlrSF"], inplace = True)
X_tst.drop(columns = ["MasVnrArea", "OpenPorchSF", "BsmtFinSF1", "WoodDeckSF", "2ndFlrSF"], inplace = True)
# categorical features
cat_cols = X_trn.select_dtypes(include = ["object"]).columns
cat_ohe = OneHotEncoder(drop = 'first', handle_unknown = 'ignore', sparse = False, dtype = 'uint8')
# normalize numeric features
num_cols = X_trn.select_dtypes(exclude = ["object"]).columns
num_pipe = make_pipeline(Normalizer(), RobustScaler())
# column transformer
if transform_numeric:
ct = make_column_transformer((cat_ohe, cat_cols), (num_pipe, num_cols))
else:
ct = make_column_transformer((cat_ohe, cat_cols), remainder = "passthrough")
X_trn = pd.DataFrame(ct.fit_transform(X_trn))
X_tst = pd.DataFrame(ct.transform(X_tst))
return X_trn, y_trn, X_tst
# get the processed data
X_trn, y_trn, X_tst = get_prepared_data(True)
Now, we can move on to ML model training.
# define model
dummy_model = DummyRegressor()
# model evaluation
def evaluate_model(model, X_trn = X_trn):
cvs = cross_val_score(model, X_trn, y_trn, scoring = "neg_root_mean_squared_error")
rmsle = -cvs.mean()
print(f"Model RMSLE: {rmsle:.5f}")
evaluate_model(dummy_model)
# train model
dummy_model.fit(X_trn, y_trn)
# make predictions
dummy_preds = dummy_model.predict(X_tst)
-cross_val_score(dummy_model, X_trn, y_trn, scoring = "neg_root_mean_squared_error").mean()
# create submission file
submission_df["SalePrice"] = dummy_preds
submission_df.to_csv("dummy_model.csv", index = None)
The submission gives use a score of 0.42578. The future ML models should at least beat the cv rmsle score of 0.39936 and submission rmsle score of 0.42578.
# cv splitter
k_folds = KFold(5, shuffle = True, random_state = seed)
# parameters for cv
e_alphas = np.arange(0.0001, 0.0007, 0.0001)
e_l1ratio = np.arange(0.8, 1, 0.05)
alphas_ridge = np.arange(10, 16, 0.5)
alphas_lasso = np.arange(0.0001, 0.0008, 0.0001)
# linear models
ridge = RidgeCV(alphas = alphas_ridge, scoring = "neg_mean_squared_error", cv = k_folds)
lasso = LassoCV(alphas = alphas_lasso, max_iter = 1e6, cv = k_folds, n_jobs = -1, random_state = seed)
elastic_net = ElasticNetCV(l1_ratio = e_l1ratio, alphas = e_alphas, max_iter = 1e6, cv = k_folds, n_jobs = -1, random_state = seed)
models = {"Ridge": ridge,
"Lasso": lasso,
"ElasticNet": elastic_net}
# compare linear models
scores = {}
for model_name, model in models.items():
print(f"{model_name}:")
score = np.sqrt(-cross_val_score(model, X_trn, y_trn, scoring = "neg_mean_squared_error", cv = k_folds))
print(score)
print(f"RMSLE mean: {score.mean():.5f} \nRMSLE std: {score.std():.5f}")
print("-" * 50)
scores[model_name] = (score.mean(), score.std())
All these scores are improvements over baseline score and achieve rmsle scores in the range of 0.12-0.15. We can get a better score by blending all these linear models. Blending makes theses models complement each other and reduce their individual overfits.
%%time
print("training started...")
# train all models
ridge.fit(X_trn, y_trn)
lasso.fit(X_trn, y_trn)
elastic_net.fit(X_trn, y_trn)
print("training complete")
# make predictions
blended_preds = (ridge.predict(X_tst) + lasso.predict(X_tst) + elastic_net.predict(X_tst))/3
blended_preds = np.expm1(blended_preds)
# create submission file
submission_df["SalePrice"] = blended_preds
submission_df.to_csv("blended_linear.csv", index = None)
This submission from blended linear models give us a rmsle score of 0.13819 which is similar to the performance of a single lasso model. But we can improve more on this score by using gradient boosting trees, which we'll do in the next section.
Gradient Boosting Models
In this section, first we'll train a lightgbm model and then we'll train an xgboost model. Normalizing and scaling that we applied on numeric features earlier deteriorates the performance of gradient boosting trees, which can actually use the information lost through transformation. Therefore, To train these models, we'll reload the processed data, this time without transforming the numeric features.
# load processed data
X_trn, y_trn, X_tst = get_prepared_data(False)
# get cpu core count
core_count = psutil.cpu_count(logical = False)
core_count
# lightgbm parameters
param = {"bagging_fraction": 0.8,
"bagging_freq": 2,
"learning_rate": 0.01,
"num_leaves": 10,
"max_depth": 5,
"min_data_in_leaf": 10,
"metric": "rmse",
"num_threads": core_count,
"verbosity": -1}
# train and evaluate lightgbm
val_scores = []
i = 1
for trn_idx, val_idx in k_folds.split(X_trn, y_trn):
print(f"Split {i}:")
trn = lgb.Dataset(X_trn.iloc[trn_idx], y_trn.iloc[trn_idx])
val = lgb.Dataset(X_trn.iloc[val_idx], y_trn.iloc[val_idx])
bst = lgb.train(param, trn, num_boost_round = 3000, valid_sets = [trn, val], early_stopping_rounds = 10, verbose_eval = 50)
score = bst.best_score["valid_1"]["rmse"]
val_scores.append(score)
print(f"RMSLE: {score:.5f}")
print("-" * 65)
i += 1
# Avg RMSLE
np.mean(val_scores)
Even without hyperparameter tuning, the validation scores are better than those of linear models. Now, we'll train on the whole dataset and make a submission.
trn = lgb.Dataset(X_trn, y_trn)
lgb_cv = lgb.cv(param, trn, num_boost_round = 3000, folds = k_folds, early_stopping_rounds = 10)
lgb_cv["rmse-mean"][-1]
# train on full data
bst = lgb.train(param, trn, num_boost_round = len(lgb_cv["rmse-mean"]))
# make predictions
lgb_preds = np.expm1(bst.predict(X_tst))
submission_df["SalePrice"] = lgb_preds
submission_df.to_csv("lgb.csv", index = None)
This submission gives us a score of 0.12901, which is an improvement over the last submission. Now, we can further optimize it with hyperparameter tuning.
# black box function for Bayesian Optimization
def LGB_bayesian(bagging_fraction,
bagging_freq,
lambda_l1,
lambda_l2,
learning_rate,
max_depth,
min_data_in_leaf,
min_gain_to_split,
min_sum_hessian_in_leaf,
num_leaves,
feature_fraction):
# LightGBM expects these parameters to be integer. So we make them integer
bagging_freq = int(bagging_freq)
num_leaves = int(num_leaves)
min_data_in_leaf = int(min_data_in_leaf)
max_depth = int(max_depth)
# parameters
param = {'bagging_fraction': bagging_fraction,
'bagging_freq': bagging_freq,
'lambda_l1': lambda_l1,
'lambda_l2': lambda_l2,
'learning_rate': learning_rate,
'max_depth': max_depth,
'min_data_in_leaf': min_data_in_leaf,
'min_gain_to_split': min_gain_to_split,
'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
'num_leaves': num_leaves,
'feature_fraction': feature_fraction,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'boosting_type': 'gbdt',
'metric': 'rmse',
'verbosity': -1,
'num_threads': core_count}
trn = lgb.Dataset(X_trn, y_trn)
lgb_cv = lgb.cv(param, trn, num_boost_round = 1500, folds = k_folds, stratified = False, early_stopping_rounds = 10, seed = seed)
score = lgb_cv["rmse-mean"][-1]
return 1/score
# parameter bounds
bounds_LGB = {
'bagging_fraction': (0.5, 1),
'bagging_freq': (1, 4),
'lambda_l1': (0, 3.0),
'lambda_l2': (0, 3.0),
'learning_rate': (0.005, 0.3),
'max_depth':(3,8),
'min_data_in_leaf': (5, 20),
'min_gain_to_split': (0, 1),
'min_sum_hessian_in_leaf': (0.01, 20),
'num_leaves': (5, 20),
'feature_fraction': (0.05, 1)
}
# optimizer
LG_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state = seed)
# find the best hyperparameters
LG_BO.maximize(init_points = 10, n_iter = 200)
# get the performance of best hyperparameters
tuned_lgbm_score = 1/LG_BO.max['target']
print(f"RMSLE of tuned lightgbm: {tuned_lgbm_score:.5f}")
# best parameters
params = LG_BO.max["params"]
int_params = ["bagging_freq", "max_depth", "min_data_in_leaf", "num_leaves"]
for parameter in int_params:
params[parameter] = int(params[parameter])
other_lgbm_params = {'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'boosting_type': 'gbdt',
'metric': 'rmse',
'verbosity': -1,
'num_threads': core_count}
params.update(other_lgbm_params)
params
# get the num_boost_rounds
trn = lgb.Dataset(X_trn, y_trn)
lgb_cv = lgb.cv(params, trn, num_boost_round = 3000, folds = k_folds, early_stopping_rounds = 10)
num_boost_round = len(lgb_cv["rmse-mean"]) - 10
num_boost_round
# train model
bst = lgb.train(params, trn, num_boost_round = num_boost_round)
# make predictions
lgb_preds = np.expm1(bst.predict(X_tst))
# create submission file
submission_df["SalePrice"] = lgb_preds
submission_df.to_csv("lgb_tuned.csv", index = None)
This submission gives us a score of 0.12715. The hyperparameter tuning helped to improve the performance of the lightgb model by a little bit. Now, we'll see how xgboost performs.
# load the datasets into xgboost DMatrices
train_d = xgb.DMatrix(X_trn, y_trn)
test_d = xgb.DMatrix(X_tst)
# xgboost parameters
xgb_params = {"eta": 0.1,
"subsample": 0.7,
"tree_method": "hist",
"random_state": seed}
# train and evaluate xgboost
xgb_cv = xgb.cv(xgb_params, train_d, num_boost_round = 1500, nfold = 5, early_stopping_rounds = 10)
xgb_cv.tail()
Even without hyperparameter tuning, xgboost is giving us a rmsle validation score of 0.127240. We can improve this score by hyperparameter tuning. First, we'll make predictions and a submission.
# train model
xgb_bst = xgb.train(xgb_params, train_d, num_boost_round = len(xgb_cv))
# make predictions
xgb_preds = np.expm1(xgb_bst.predict(test_d))
# create submission file
submission_df["SalePrice"] = xgb_preds
submission_df.to_csv("xgb_non_tuned.csv", index = None)
This submission gives a score of 0.13190, which looks like the model overfit a little bit. We can tune the hyperparameters and improve the performance.
# black box function for Bayesian Optimization
def xgb_bayesian(eta,
gamma,
subsample,
colsample_bytree,
colsample_bynode,
colsample_bylevel,
max_depth):
# this parameter has to an integer
max_depth = int(max_depth)
# xgboost parameters
params = {"eta": eta,
"gamma": gamma,
"subsample": subsample,
"colsample_bytree": colsample_bytree,
"colsample_bynode": colsample_bynode,
"colsample_bylevel": colsample_bylevel,
"max_depth": max_depth,
"tree_method": "hist"}
# train and score
xgb_cv = xgb.cv(params, train_d, num_boost_round = 1500, nfold = 5, early_stopping_rounds = 10, seed = seed)
score = xgb_cv.iloc[-10]["test-rmse-mean"]
return 1/score
# parameter bounds
xgb_bounds = {"eta": (0.01, 0.05),
"gamma": (0, 20),
"subsample": (0.4, 1),
"colsample_bytree": (0.5, 1),
"colsample_bynode": (0.5, 1),
"colsample_bylevel": (0.5, 1),
"max_depth": (2, 7)}
# optimizer
xgb_bo = BayesianOptimization(xgb_bayesian, xgb_bounds, random_state = seed)
# find the best hyperparameters
xgb_bo.maximize(init_points = 3, n_iter = 60)
# get the performance of best hyperparameters
tuned_xgb_score = 1/xgb_bo.max['target']
print(f"RMSLE of tuned xgboost: {tuned_xgb_score:.5f}")
xgb_bo.max
# parameters
xgb_tuned_params = {"eta": 0.01,
"gamma": 0,
"subsample": 1.0,
"colsample_bytree": 0.5,
"colsample_bynode": 0.5,
"colsample_bylevel": 0.5,
"max_depth": 4,
"tree_method": "hist"}
# get the num_boost_round
xgb_cv = xgb.cv(xgb_tuned_params, train_d, num_boost_round = 1500, nfold = 5, early_stopping_rounds = 10)
num_boost_round = len(xgb_cv) - 10
xgb_cv.tail()
# train model
bst = xgb.train(xgb_tuned_params, train_d, num_boost_round = num_boost_round)
# make predictions
xgb_preds = np.expm1(bst.predict(test_d))
# create submission file
submission_df["SalePrice"] = xgb_preds
submission_df.to_csv("xgb_tuned.csv", index = None)
This submission gives us the best score yet of rmsle 0.12488. This is our final submission.
In this project, we worked on the Ames housing data provided as part of a competition on Kaggle. We tried to predict the Sale Price of houses in Ames from this dataset. Before we could train ML models, we evaluated the data and prepared it for ML algorithms. We also trained a dummy model to spot errors in training. In the ML model training part, we first trained some linear models. Then we combined these linear models to form blended predictions. This improved the overall performance. Then we trained two gradient boosting models. First we trained a lightgbm model, which improved on the performance by blended linear models. Hyperparameter tuning also helped to further increase the submission score. Then we trained an XGBoost model, for which we also tuned its hyperparameters. This gave us the best rmsle score of 0.12488.