Predicting Customer Transactions for Santander Competition on Kaggle
- Introduction
- Setup
- EDA and Data Preparation
- Train Hardcoded and Baseline models and evaluate results
- Model Seletion
- Feature Engineering
- Model Training
- Final Training and Submission
- Summary and Conclusion
The Santander Group is a Spanish multinational financial services company based in Madrid and is the 16th largest financial institution in the world. It held a Kaggle competition in 2019 where the goal was to identify which customers would make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for the competition had the same structure as the real data the Santander Group has available to solve this problem. The competetion is one of the most popular competitions on Kaggle with over 8,000 participating teams.
Today, we'll also participate in this competition and work towards improving our standing on the competition leaderboard. The dataset provided has about 200 anonymized features, using which we have to predict the target class in the test dataset. The submission is evaluated based on the AUC ROC between the predicted probability and the observed target in the submission file. Thus, our goal is to make predictions with the highest possible AUC ROC score on the test dataset. We begin the work by first setting up the environment and importing the dataset files.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from tqdm import tqdm # progress meter
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate # training validation
from sklearn.preprocessing import MinMaxScaler # numeric scaler
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay # metrics
from imblearn.over_sampling import SMOTE # oversampling imbalanced data
from imblearn.pipeline import make_pipeline as make_imb_pipeline # imbalanced pipeline
from bayes_opt import BayesianOptimization # hyperparameter tuning
import psutil # cpu information
# ML models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
# set data_dir
import os
os.chdir(os.getcwd() + "\Santader_Transactions_Predictions")
# file paths
for dirname, _, filenames in os.walk(os.getcwd()):
for filename in filenames:
print(os.path.join(dirname, filename))
Here, the csv files are part of the official competition dataset. while all the other files are from a separate kernel and will be used for feature engineering in the later section. We have already added them for future use.
# file_paths
train_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv"
test_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\test.csv"
submission_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\sample_submission.csv"
# training dataset file size
!dir {train_path} /a/s
print("-" * 50)
!dir {test_path} /a/s
The training dataset file size is 289 MB and the test dataset file size is 288 MB. It is safe to import both datasets fully at once.
# import training dataset
X = pd.read_csv(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv", index_col = ["ID_code"])
# look at all columns
pd.set_option("display.max_columns", None)
X.head()
# separate target class
y = X.pop("target")
# import test dataset
test_df = pd.read_csv(test_path, index_col = ["ID_code"])
test_df.head()
We'll also import the submission file which will be used for adding targets from the trained ML models. It will then be exported as a csv file, which will be submitted on Kaggle.
# import submssion file
submission_df = pd.read_csv(submission_path)
submission_df.head()
This submission file will be used for adding targets from the trained ML models. It will then be exported as a csv file, which will be submitted on Kaggle.
# random state seed
seed = 42
# Basic overview
X.info()
Observations
- The data is of 200000 * 200 shape
- All of the values are numeric stored as dtype float64.
- The 200 columns are named var_0 to var_199.
X.describe()
test_df.describe()
Observations
- A quick look at the basic statistics and comparing the training and test dataset doesn't reveal too much.
- Both the train and test datasets look quite similar in the mean, std and median values.
- We can confirm this by looking at the frequency distribution of the data along all features in both the datasets.
%%time
fig, ax = plt.subplots(20, 10, figsize = (25, 50), constrained_layout = True)
for i, col in tqdm(enumerate(X.columns, start = 1)):
plt.subplot(20, 10, i)
sns.kdeplot(X[col])
sns.kdeplot(test_df[col])
plt.xlabel(col)
plt.ylabel("")
plt.legend(["train", "test"])
Observations
- All variables are distributed similarly among the train and test datasets.
- The train dataset is representative of the test dataset.
%%time
# divide the dataset with respect to target class.
t0 = X.loc[y == 0]
t1 = X.loc[y == 1]
# plot
fig, ax = plt.subplots(20, 10, figsize = (25, 50), constrained_layout = True)
for i, col in tqdm(enumerate(X.columns, start = 1)):
plt.subplot(20, 10, i)
sns.kdeplot(t0[col])
sns.kdeplot(t1[col])
plt.xlabel(col)
plt.ylabel("")
plt.legend([0, 1])
Observations
There do appear to some differences in the distributions of features betweeen both target clases. The ML algorithms we'll train will try to learn from these differences and also find more patterns which will help them classify and differentiate between the two target classes better.
# distribution
target_vc = y.value_counts()/len(y)
target_vc
sns.barplot(x = target_vc.index, y = target_vc)
Observations
- The target class is imbalanced.
- It is distributed with about 9:1 ratio.
- For training some ML models, it would be better to oversample the data.
# null values count
X.isnull().sum().sort_values(ascending = False)
Observations
- There are no null values in the data, and thus it doesn't need any handling/preparation.
# Scaling and Oversampling
scaler = MinMaxScaler()
sm = SMOTE(random_state = seed)
# make train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify = y, random_state = seed)
Train Hardcoded and Baseline models and evaluate results
We'll train a dummy classifier as a hardcoded model which will always predict the most frequent target class, in this case being 0. We'll also train a baseline model with logistic regression. Doing this will give us two score values, which our future models should at least beat. It helps to identify errors in training.
Hardcode model
# define hardcoded model and its pipeline
mf_model = DummyClassifier()
mf_pipe = make_imb_pipeline(scaler, sm, mf_model)
# fit the hardcoded pipline
mf_pipe.fit(X_train, y_train)
# function to evaluate the model
def evaluate_model(model_pipe, plot_graph = False):
preds = model_pipe.predict(X_valid)
try:
preds_prob = model_pipe.predict_proba(X_valid)[:, 1]
except:
preds_prob = model_pipe.decision_function(X_valid)
res = {
"Accuracy Score": accuracy_score(y_valid, preds),
"Precision Score": precision_score(y_valid, preds, zero_division = 0),
"Recall Score": recall_score(y_valid, preds),
"ROC_AUC Score": roc_auc_score(y_valid, preds_prob),
"f1 Score": f1_score(y_valid, preds)
}
print(res)
if plot_graph:
plt.figure(1)
ConfusionMatrixDisplay.from_predictions(y_valid, preds)
plt.title("Confusion Matrix")
plt.figure(2)
RocCurveDisplay.from_predictions(y_valid, preds_prob)
plt.title("Roc Curve")
return res
mf_scores = evaluate_model(mf_pipe, True)
As expected, dummy classifier achieved accuracy of 0.89 because the class is imbalanced and the classifier predicted the majority class for all predictions. And as expected, it has no discrimative ability. This is reflected in its ROC_AUC score of 0.5. The other ML Algorithms that we'll train should at least beat this score.
# predicions
dummy_preds = mf_pipe.predict(test_df)
# add predictions to submission file
submission_df["target"] = dummy_preds
submission_df.head()
# save submission file
submission_df.to_csv("hardcoded_model_preds.csv", index = None)
On Kaggle, this submission gives the score of 0.5, as expected.
# define baseline model and its pipeline
base_model = LogisticRegression(random_state = seed)
base_pipe = make_imb_pipeline(scaler, sm, base_model)
# fit the base pipline
base_pipe.fit(X_train, y_train)
# evaluate model
mf_scores = evaluate_model(base_pipe, True)
# predicions
base_preds = base_pipe.predict(test_df)
# add predictions to submission file
submission_df["target"] = base_preds
submission_df.head()
# save submission file
submission_df.to_csv("baseline_model_preds.csv", index = None)
This submission gives a score of 0.77256, which is better than the hardcoded model, but we can expect to improve on this score further.
Model Seletion
We'll now train multiple models for classification and compare their performance. We'll also compare the effects of scaling and oversampling on performance. From these trained models, we can then choose one model and then try to improve the scores through feature engineering and tuning hyperpararmeters on that ML model.
# Define the models
models = {"LogisticRegression": LogisticRegression(n_jobs = -1),
"RidgeClassification": RidgeClassifier(random_state = seed),
"GaussianNB": GaussianNB(),
"RandomForestClassifier": RandomForestClassifier(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = seed),
"LGBMClassifier": lgb.LGBMClassifier(max_depth = 7, learning_rate = 0.05, n_estimators = 300, random_state = seed)}
# models with no scaling and oversampling
model_scores = {}
for model_name, model in models.items():
model.fit(X_train, y_train)
print(model_name, "\n")
model_scores[model_name] = evaluate_model(model)
print("\n-------------------\n\n")
# models with only scaling
model_scores = {}
for model_name, model in models.items():
model_pipe = make_imb_pipeline(scaler, model)
model_pipe.fit(X_train, y_train)
print(model_name, "\n")
model_scores[model_name] = evaluate_model(model_pipe)
print("\n-------------------\n\n")
# models with data scaled and oversampled
model_scores = {}
for model_name, model in models.items():
model_pipe = make_imb_pipeline(scaler, sm, model)
model_pipe.fit(X_train, y_train)
print(model_name, "\n")
model_scores[model_name] = evaluate_model(model_pipe)
print("\n-------------------\n\n")
Observations
Oversampling doesn't seem to help in performance and is actually hurting the performance in most models. Thus it would be better to avoid it. Scaling does help in Logistic Regression a little, but it has no effect on other models.GaussianNB
and LGBClassifier
perform the best without any scaling and oversampling. Although GaussianNB
performs the best here, LGBClassifier
is only a little behind. The opportunity to improve LGBClassifier with more boosting rounds suggests that we should opt this model for further improvement on scores.
Feature Engineering
This wonderful kernel showed that there is synthetic data in the test dataset and half of it is used to evaluate the submission file. It also gives the indices of test dataset which are using for the public LB and the private LB. The important fact that is realized in this kernel is that the count of unique values for every feature is somehow important. This exact knowledge will be used for feature engineering.
The data from that kernel has already been added to the project data directory and will now be imported.
# file paths
for dirname, _, filenames in os.walk(os.getcwd()):
for filename in filenames:
print(os.path.join(dirname, filename))
synthetic_samples_indices = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\synthetic_samples_indexes.npy")
public_lb = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\public_LB.npy")
private_lb = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\private_LB.npy")
# merge the train dataset with the real data from the public LB and the private LB into a new dataset
full = pd.concat([X, pd.concat([test_df.iloc[public_lb], test_df.iloc[private_lb]])])
full
We will add a new column for each feature in the dataset. This new column will include information of the value counts of the values in each of those feature columns. This extra information should improve our score.
# add new columns
for feature in X.columns:
count_vals = full[feature].value_counts()
X["new_" + feature] = count_vals.loc[X[feature]].values
test_df["new_" + feature] = count_vals.loc[test_df[feature]].values
# check the new shape of both train data and test data
X.shape, test_df.shape
# make new train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = seed)
Model Training
To train the LGBClassifier, this time we'll not use the Scikit Learn's API. Rather, we'll use the lightgbm API which will give us better control and flexibility on the hyperparameters. Also, high score on GaussianNB suggests that the features are pretty independent of each other (except the new columns which depend on the column which they are based on). Therefore, we can train the LGB model two features at a time, one being the original feature and the second being the value count feature that we have added through feature engineering. This stops the model from studying interelationships between all the 400 features, which can save a huge amount of training time.
# set training hyperparameters
core_count = psutil.cpu_count(logical=False)
param = {'bagging_fraction': 0.8,
'bagging_freq': 2,
'lambda_l1': 0.7,
'lambda_l2': 2,
'learning_rate': 0.01,
'max_depth': 5,
'min_data_in_leaf': 22,
'min_gain_to_split': 0.07,
'min_sum_hessian_in_leaf': 15,
'num_leaves': 20,
'feature_fraction': 1,
'save_binary': True,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'data_random_seed': seed,
'objective': 'binary',
'boosting_type': 'gbdt',
'verbosity': -1,
'metric': 'auc',
'is_unbalance': True,
'boost_from_average': 'false',
'num_threads': core_count}
# prediction matrices
valid_sub_preds = np.zeros([X_valid.shape[0], 200])
test_sub_preds = np.zeros([test_df.shape[0], 200])
# run training col by col
for col in tqdm(range(200)):
feature = X.columns[col]
feature_set = [feature, "new_" + feature]
# make lgbm datasets
train_l = lgb.Dataset(X_train[feature_set], y_train)
valid_l = lgb.Dataset(X_valid[feature_set], y_valid)
# train model
lgb_clf = lgb.train(param, train_l, num_boost_round = 50, valid_sets = [train_l, valid_l], verbose_eval = -1)
# make predictions
valid_sub_preds[:, col] = lgb_clf.predict(X_valid[feature_set], num_iteration = lgb_clf.best_iteration)
test_sub_preds[:, col] = lgb_clf.predict(test_df[feature_set], num_iteration = lgb_clf.best_iteration)
# validation set predictions
val_full_preds = valid_sub_preds.sum(axis = 1) / 200
def evaluate_perf(preds_prob):
preds = (preds_prob > 0.5).astype(int)
res = {
"Accuracy Score": accuracy_score(y_valid, preds),
"Precision Score": precision_score(y_valid, preds, zero_division = 0),
"Recall Score": recall_score(y_valid, preds),
"ROC_AUC Score": roc_auc_score(y_valid, preds_prob),
"f1 Score": f1_score(y_valid, preds)
}
return res
evaluate_perf(val_full_preds)
ROC_AUC Score has improved a lot considering that it was already high. Accuracy also looks good. This improvement can be attributed to the feature engineering we did. Now we can make a submssion.
# Make predictions on test dataset
test_full_preds = test_sub_preds.sum(axis = 1) / 200
test_full_preds
# save predictions and export it to csv file
submission_df["target"] = test_full_preds
submission_df.to_csv("lgbm_first_training.csv", index = None)
Submitting this file on Kaglle gets us a score of 0.91093 which makes us stand in the top 3% of the leaderboard. Our standing can improve further with hyperparameter tuning and some other tweaks.
Add feature weights
Because each independent feature contributes differently to the variation in target class, we can add the feauture weights in the final computation the final predictions. The feature weights can ba calculated with the roc_auc score each feature gets and calculating the deviation from mean for each roc_auc score.
# calculate feature weights
weights = []
for col in range(200):
feature_roc_score = roc_auc_score(y_valid, valid_sub_preds[:, col])
if feature_roc_score > 0.5:
weights.append(feature_roc_score)
else:
weights.append(0)
weights[:30]
# transform weights to usable form
weights = np.array(weights)
weights = 1 + ((weights - weights.mean()) / weights.mean())
weights[:30]
weighted_valid_preds = (valid_sub_preds * weights).sum(axis = 1) / 200
evaluate_perf(weighted_valid_preds)
The scores have improved a little after adding weights, and we can make a new submission now.
weighted_preds = (test_sub_preds * weights).sum(axis = 1) / 200
weighted_preds
# save predictions and export it to csv file
submission_df["target"] = weighted_preds
submission_df.to_csv("lgbm_weighted_preds.csv", index = None)
Our new submission private ROC_AUC score is 0.91353. This makes our new private LB standing in the top 150, which means we are in the top 2% of the LB. Now, it is time to tune the model's hyperparameters and then we can make a final submission.
Hyperparameter Tuning
We can run Bayesian Optimization for hyperparameter tuning. It'll include the following steps:
- Create a sperate train test split of ratio 1:1.
- Create a black box function for Bayesian Optimization to execute.
- Set parameter boundaries
- Find the best hyperparameters by running maximise on Bayesian Optimization
Because finding the best hyperparameters takes a long time, running the algorithm multiple times is neither feasable not environmental friendly. Thus, I'll use the hyperparameters from the past runs.
The inspiration for the tuning was taken from here.
# separate data for tuning
X_train_tuning, X_valid_tuning, y_train_tuning, y_valid_tuning = train_test_split(X, y, stratify = y, test_size = 0.5, random_state = seed)
# black box function for Bayesian Optimization
def LGB_bayesian(
bagging_fraction,
bagging_freq, # int
lambda_l1,
lambda_l2,
learning_rate,
max_depth, # int
min_data_in_leaf, # int
min_gain_to_split,
min_sum_hessian_in_leaf,
num_leaves, # int
feature_fraction,
num_boost_rounds):
# LightGBM expects these parameters to be integer. So we make them integer
bagging_freq = int(bagging_freq)
num_leaves = int(num_leaves)
min_data_in_leaf = int(min_data_in_leaf)
max_depth = int(max_depth)
num_boost_rounds = int(num_boost_rounds)
# parameters
param = {'bagging_fraction': bagging_fraction,
'bagging_freq': bagging_freq,
'lambda_l1': lambda_l1,
'lambda_l2': lambda_l2,
'learning_rate': learning_rate,
'max_depth': max_depth,
'min_data_in_leaf': min_data_in_leaf,
'min_gain_to_split': min_gain_to_split,
'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
'num_leaves': num_leaves,
'feature_fraction': feature_fraction,
'save_binary': True,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'data_random_seed': seed,
'objective': 'binary',
'boosting_type': 'gbdt',
'verbosity': -1,
'metric': 'auc',
'is_unbalance': True,
'boost_from_average': 'false',
'num_threads': core_count}
# prediction matrix
valid_sub_preds_tuning = np.zeros([X_valid_tuning.shape[0], 200])
# run training col by col
for col in range(200):
feature = X.columns[col]
feature_set = [feature, "new_" + feature]
# make lgbm datasets
train_l_tuning = lgb.Dataset(X_train_tuning[feature_set], y_train_tuning)
valid_l_tuning = lgb.Dataset(X_valid_tuning[feature_set], y_valid_tuning)
# train model
lgb_clf = lgb.train(param, train_l_tuning, num_boost_round = num_boost_rounds, valid_sets = [train_l_tuning, valid_l_tuning], verbose_eval = -1)
# make predictions
valid_sub_preds_tuning[:, col] = lgb_clf.predict(X_valid_tuning[feature_set], num_iteration = lgb_clf.best_iteration)
# calculate feature weights
weights = []
for col in range(200):
feature_roc_score = roc_auc_score(y_valid, valid_sub_preds[:, col])
if feature_roc_score > 0.5:
weights.append(feature_roc_score)
else:
weights.append(0)
# validation predictions
weights = np.array(weights)
weights_mean = weights.mean()
weights = 1 + ((weights - weights_mean) / weights_mean)
valid_full_preds_tuning = (valid_sub_preds_tuning * weights).sum(axis = 1)
# score
score = roc_auc_score(y_valid, valid_full_preds_tuning)
return score
# parameter bounds
bounds_LGB = {
'bagging_fraction': (0.5, 1),
'bagging_freq': (1, 4),
'lambda_l1': (0, 3.0),
'lambda_l2': (0, 3.0),
'learning_rate': (0.005, 0.3),
'max_depth':(3,8),
'min_data_in_leaf': (5, 20),
'min_gain_to_split': (0, 1),
'min_sum_hessian_in_leaf': (0.01, 20),
'num_leaves': (5, 20),
'feature_fraction': (0.05, 1),
'num_boost_rounds': (30, 130)
}
LG_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state = seed)
LG_BO.space.keys
# LG_BO.maximize(init_points=5, n_iter=120, acq='ucb', xi=0.0, alpha=1e-6)
Running the above cell after uncommenting the code will find the best hyperparameters. We'll use the results from the past run.
# tuned hyperparameters
iterations = 126
param = {'bagging_fraction': 0.7693,
'bagging_freq': 2,
'lambda_l1': 0.7199,
'lambda_l2': 1.992,
'learning_rate': 0.009455,
'max_depth': 3,
'min_data_in_leaf': 22,
'min_gain_to_split': 0.06549,
'min_sum_hessian_in_leaf': 18.55,
'num_leaves': 20,
'feature_fraction': 1}
# All hyperparameters
iterations = 126
param = {'bagging_fraction': 0.7693,
'bagging_freq': 2,
'lambda_l1': 0.7199,
'lambda_l2': 1.992,
'learning_rate': 0.009455,
'max_depth': 3,
'min_data_in_leaf': 22,
'min_gain_to_split': 0.06549,
'min_sum_hessian_in_leaf': 18.55,
'num_leaves': 20,
'feature_fraction': 1,
'save_binary': True,
'seed': seed,
'feature_fraction_seed': seed,
'bagging_seed': seed,
'drop_seed': seed,
'data_random_seed': seed,
'objective': 'binary',
'boosting_type': 'gbdt',
'verbosity': -1,
'metric': 'auc',
'is_unbalance': True,
'boost_from_average': 'false',
'num_threads': core_count}
# Model Training
folds = StratifiedKFold(n_splits = 4)
columns = X.columns
col_count = 200
train_sub_preds = np.zeros([len(X), col_count])
test_sub_preds = np.zeros([len(test_df), col_count])
for col_idx in tqdm(range(col_count)):
feature = columns[col_idx]
feature_set = [feature, 'new_' + feature]
temp_preds = np.zeros(len(test_df))
for train_idx, valid_idx in folds.split(X, y):
train_data = lgb.Dataset(X.iloc[train_idx][feature_set], y[train_idx])
valid_data = lgb.Dataset(X.iloc[valid_idx][feature_set], y[valid_idx])
clf = lgb.train(param, train_data, num_boost_round = iterations, valid_sets = [train_data, valid_data], verbose_eval=-1)
train_sub_preds[valid_idx, col_idx] = clf.predict(X.iloc[valid_idx][feature_set], num_iteration=clf.best_iteration)
temp_preds += clf.predict(test_df[feature_set], num_iteration=clf.best_iteration) / folds.n_splits
test_sub_preds[:, col_idx] = temp_preds
# calculate feature weights
weights = []
for col in range(200):
feature_roc_score = roc_auc_score(y, train_sub_preds[:, col])
if feature_roc_score > 0.5:
weights.append(feature_roc_score)
else:
weights.append(0)
# final predictions
weights = np.array(weights)
weights_mean = weights.mean()
weights = 1 + ((weights - weights_mean) / weights_mean)
train_full_preds = (train_sub_preds * weights).sum(axis = 1) / 200
test_full_preds = (test_sub_preds * weights).sum(axis = 1) / 200
# avg roc_score on validation data
roc_auc_score(y, train_full_preds)
submission_df["target"] = test_full_preds
submission_df.to_csv("final_submssion.csv", index = None)
This submssion gives us a public score of 0.92231 and a private score of 0.92069. With this score we stand at about 60 rank out of 8712 current submissions. Thus we are in the top 1% with our final submission.
In this project, we participated in the Santander Customer Transaction Predition Competition on Kaggle and tried to achieve as high standing as possible in the public and private leaderboards. Our goal was to predict whether a customer would make a specific transaction of not. After doing basic data analysis, we compared multiple models and selected lightgbm which had the potential to improve the score most. We also applied some feature engineering, which helped in improving the auc roc score. Additional methods like using feaure weights to compute predictions, and hyperparameter tuning further improved the ML training and predictions.
Throughout all these steps, we made multiple submission files and submitted them on Kaggle, each improving the standing on the previous one. In the final submission, we achieved a private LB score of 0.92069 which placed us at rank 60 out of 8712 participants.
Thanks for reading.