Introduction

The Santander Group is a Spanish multinational financial services company based in Madrid and is the 16th largest financial institution in the world. It held a Kaggle competition in 2019 where the goal was to identify which customers would make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for the competition had the same structure as the real data the Santander Group has available to solve this problem. The competetion is one of the most popular competitions on Kaggle with over 8,000 participating teams.

Today, we'll also participate in this competition and work towards improving our standing on the competition leaderboard. The dataset provided has about 200 anonymized features, using which we have to predict the target class in the test dataset. The submission is evaluated based on the AUC ROC between the predicted probability and the observed target in the submission file. Thus, our goal is to make predictions with the highest possible AUC ROC score on the test dataset. We begin the work by first setting up the environment and importing the dataset files.

Setup

Import the required libraries, import the datasets and setup the training environment.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
from tqdm import tqdm # progress meter
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_validate # training validation
from sklearn.preprocessing import MinMaxScaler # numeric scaler
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score, roc_auc_score, RocCurveDisplay, ConfusionMatrixDisplay # metrics
from imblearn.over_sampling import SMOTE # oversampling imbalanced data
from imblearn.pipeline import make_pipeline as make_imb_pipeline # imbalanced pipeline
from bayes_opt import BayesianOptimization # hyperparameter tuning
import psutil # cpu information

# ML models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb


# ignore warnings
import warnings
warnings.filterwarnings("ignore")

# set data_dir
import os
os.chdir(os.getcwd() + "\Santader_Transactions_Predictions")

# file paths
for dirname, _, filenames in os.walk(os.getcwd()):
    for filename in filenames:
        print(os.path.join(dirname, filename))
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\private_LB.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\public_LB.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\sample_submission.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\synthetic_samples_indexes.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\test.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\.ipynb_checkpoints\Untitled-checkpoint.ipynb

Here, the csv files are part of the official competition dataset. while all the other files are from a separate kernel and will be used for feature engineering in the later section. We have already added them for future use.

# file_paths
train_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv"
test_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\test.csv"
submission_path = r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\sample_submission.csv"
# training dataset file size
!dir {train_path} /a/s
print("-" * 50)
!dir {test_path} /a/s
 Volume in drive C has no label.
 Volume Serial Number is 7257-2892

 Directory of C:\Users\ncits\Downloads\Santader_Transactions_Predictions

11-12-2019  22:01       302,133,017 train.csv
               1 File(s)    302,133,017 bytes

     Total Files Listed:
               1 File(s)    302,133,017 bytes
               0 Dir(s)  299,401,584,640 bytes free
--------------------------------------------------
 Volume in drive C has no label.
 Volume Serial Number is 7257-2892

 Directory of C:\Users\ncits\Downloads\Santader_Transactions_Predictions

11-12-2019  22:00       301,526,706 test.csv
               1 File(s)    301,526,706 bytes

     Total Files Listed:
               1 File(s)    301,526,706 bytes
               0 Dir(s)  299,401,584,640 bytes free

The training dataset file size is 289 MB and the test dataset file size is 288 MB. It is safe to import both datasets fully at once.

# import training dataset
X = pd.read_csv(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv", index_col = ["ID_code"])

# look at all columns
pd.set_option("display.max_columns", None)
X.head()
target var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 var_40 var_41 var_42 var_43 var_44 var_45 var_46 var_47 var_48 var_49 var_50 var_51 var_52 var_53 var_54 var_55 var_56 var_57 var_58 var_59 var_60 var_61 var_62 var_63 var_64 var_65 var_66 var_67 var_68 var_69 var_70 var_71 var_72 var_73 var_74 var_75 var_76 var_77 var_78 var_79 var_80 var_81 var_82 var_83 var_84 var_85 var_86 var_87 var_88 var_89 var_90 var_91 var_92 var_93 var_94 var_95 var_96 var_97 var_98 var_99 var_100 var_101 var_102 var_103 var_104 var_105 var_106 var_107 var_108 var_109 var_110 var_111 var_112 var_113 var_114 var_115 var_116 var_117 var_118 var_119 var_120 var_121 var_122 var_123 var_124 var_125 var_126 var_127 var_128 var_129 var_130 var_131 var_132 var_133 var_134 var_135 var_136 var_137 var_138 var_139 var_140 var_141 var_142 var_143 var_144 var_145 var_146 var_147 var_148 var_149 var_150 var_151 var_152 var_153 var_154 var_155 var_156 var_157 var_158 var_159 var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
ID_code
train_0 0 8.9255 -6.7863 11.9081 5.0930 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.2840 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 7.8871 4.6667 3.8743 -5.2387 7.3746 11.5767 12.0446 11.6418 -7.0170 5.9226 -14.2136 16.0283 5.3253 12.9194 29.0460 -0.6940 5.1736 -0.7474 14.8322 11.2668 5.3822 2.0183 10.1166 16.1828 4.9590 2.0771 -0.2154 8.6748 9.5319 5.8056 22.4321 5.0109 -4.7010 21.6374 0.5663 5.1999 8.8600 43.1127 18.3816 -2.3440 23.4104 6.5199 12.1983 13.6468 13.8372 1.3675 2.9423 -4.5213 21.4669 9.3225 16.4597 7.9984 -1.7069 -21.4494 6.7806 11.0924 9.9913 14.8421 0.1812 8.9642 16.2572 2.1743 -3.4132 9.4763 13.3102 26.5376 1.4403 14.7100 6.0454 9.5426 17.1554 14.1104 24.3627 2.0323 6.7602 3.9141 -0.4851 2.5240 1.5093 2.5516 15.5752 -13.4221 7.2739 16.0094 9.7268 0.8897 0.7754 4.2218 12.0039 13.8571 -0.7338 -1.9245 15.4462 12.8287 0.3587 9.6508 6.5674 5.1726 3.1345 29.4547 31.4045 2.8279 15.6599 8.3307 -5.6011 19.0614 11.2663 8.6989 8.3694 11.5659 -16.4727 4.0288 17.9244 18.5177 10.7800 9.0056 16.6964 10.4838 1.6573 12.1749 -13.1324 17.6054 11.5423 15.4576 5.3133 3.6159 5.0384 6.6760 12.6644 2.7004 -0.6975 9.5981 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914
train_1 0 11.5006 -4.1473 13.8588 5.3890 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.8000 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 5.8434 10.8160 3.6783 -11.1147 1.8730 9.8775 11.7842 1.2444 -47.3797 7.3718 0.1948 34.4014 25.7037 11.8343 13.2256 -4.1083 6.6885 -8.0946 18.5995 19.3219 7.0118 1.9210 8.8682 8.0109 -7.2417 1.7944 -1.3147 8.1042 1.5365 5.4007 7.9344 5.0220 2.2302 40.5632 0.5134 3.1701 20.1068 7.7841 7.0529 3.2709 23.4822 5.5075 13.7814 2.5462 18.1782 0.3683 -4.8210 -5.4850 13.7867 -13.5901 11.0993 7.9022 12.2301 0.4768 6.8852 8.0905 10.9631 11.7569 -1.2722 24.7876 26.6881 1.8944 0.6939 -13.6950 8.4068 35.4734 1.7093 15.1866 2.6227 7.3412 32.0888 13.9550 13.0858 6.6203 7.1051 5.3523 8.5426 3.6159 4.1569 3.0454 7.8522 -11.5100 7.5109 31.5899 9.5018 8.2736 10.1633 0.1225 12.5942 14.5697 2.4354 0.8194 16.5346 12.4205 -0.1780 5.7582 7.0513 1.9568 -8.9921 9.7797 18.1577 -1.9721 16.1622 3.6937 6.6803 -0.3243 12.2806 8.6086 11.0738 8.9231 11.7700 4.2578 -4.4223 20.6294 14.8743 9.4317 16.7242 -0.5687 0.1898 12.2419 -9.6953 22.3949 10.6261 29.4846 5.8683 3.8208 15.8348 -5.0121 15.1345 3.2003 9.3192 3.8821 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518
train_2 0 8.6093 -2.7457 12.0805 7.8928 10.5825 -9.0837 6.9427 14.6155 -4.9193 5.9525 -0.3249 -11.2648 14.1929 7.3124 7.5244 14.6472 7.6782 -1.7395 4.7011 20.4775 17.7559 18.1377 1.2145 3.5137 5.6777 13.2177 -7.9940 -2.9029 5.8463 6.1439 -11.1025 12.4858 -2.2871 19.0422 11.0449 4.1087 4.6974 6.9346 10.8917 0.9003 -13.5174 2.2439 11.5283 12.0406 4.1006 -7.9078 11.1405 -5.7864 20.7477 6.8874 12.9143 19.5856 0.7268 6.4059 9.3124 6.2846 15.6372 5.8200 1.1000 9.1854 12.5963 -10.3734 0.8748 5.8042 3.7163 -1.1016 7.3667 9.8565 5.0228 -5.7828 2.3612 0.8520 6.3577 12.1719 19.7312 19.4465 4.5048 23.2378 6.3191 12.8046 7.4729 15.7811 13.3529 10.1852 5.4604 19.0773 -4.4577 9.5413 11.9052 2.1447 -22.4038 7.0883 14.1613 10.5080 14.2621 0.2647 20.4031 17.0360 1.6981 -0.0269 -0.3939 12.6317 14.8863 1.3854 15.0284 3.9995 5.3683 8.6273 14.1963 20.3882 3.2304 5.7033 4.5255 2.1929 3.1290 2.9044 1.1696 28.7632 -17.2738 2.1056 21.1613 8.9573 2.7768 -2.1746 3.6932 12.4653 14.1978 -2.5511 -0.9479 17.1092 11.5419 0.0975 8.8186 6.6231 3.9358 -11.7218 24.5437 15.5827 3.8212 8.6674 7.3834 -2.4438 10.2158 7.4844 9.1104 4.3649 11.4934 1.7624 4.0714 -1.2681 14.3330 8.0088 4.4015 14.1479 -5.1747 0.5778 14.5362 -1.7624 33.8820 11.6041 13.2070 5.8442 4.7086 5.7141 -1.0410 20.5092 3.2790 -5.5952 7.3176 5.7690 -7.0927 -3.9116 7.2569 -5.8234 25.6820 10.9202 -0.3104 8.8438 -9.7009 2.4013 -4.2935 9.3908 -13.2648 3.1545 23.0866 -5.3000 5.3745 -6.2660 10.1934 -0.8417 2.9057 9.7905 1.6704 1.6858 21.6042 3.1417 -6.5213 8.2675 14.7222 0.3965
train_3 0 11.0604 -2.1518 8.9522 7.1957 12.5846 -1.8361 5.8428 14.9250 -5.8609 8.2450 2.3061 2.8102 13.8463 11.9704 6.4569 14.8372 10.7430 -0.4299 15.9426 13.7257 20.3010 12.5579 6.8202 2.7229 12.1354 13.7367 0.8135 -0.9059 5.9070 2.8407 -15.2398 10.4407 -2.5731 6.1796 10.6093 -5.9158 8.1723 2.8521 9.1738 0.6665 -3.8294 -1.0370 11.7770 11.2834 8.0485 -24.6840 12.7404 -35.1659 0.7613 8.3838 12.6832 9.5503 1.7895 5.2091 8.0913 12.3972 14.4698 6.5850 3.3164 9.4638 15.7820 -25.0222 3.4418 -4.3923 8.6464 6.3072 5.6221 23.6143 5.0220 -3.9989 4.0462 0.2500 1.2516 24.4187 4.5290 15.4235 11.6875 23.6273 4.0806 15.2733 0.7839 10.5404 1.6212 -5.2896 1.6027 17.9762 -2.3174 15.6298 4.5474 7.5509 -7.5866 7.0364 14.4027 10.7795 7.2887 -1.0930 11.3596 18.1486 2.8344 1.9480 -19.8592 22.5316 18.6129 1.3512 9.3291 4.2835 10.3907 7.0874 14.3256 14.4135 4.2827 6.9750 1.6480 11.6896 2.5762 -2.5459 5.3446 38.1015 3.5732 5.0988 30.5644 11.3025 3.9618 -8.2464 2.7038 12.3441 12.5431 -1.3683 3.5974 13.9761 14.3003 1.0486 8.9500 7.1954 -1.1984 1.9586 27.5609 24.6065 -2.8233 8.9821 3.8873 15.9638 10.0142 7.8388 9.9718 2.9253 10.4994 4.1622 3.7613 2.3701 18.0984 17.1765 7.6508 18.2452 17.0336 -10.9370 12.0500 -1.2155 19.9750 12.3892 31.8833 5.9684 7.2084 3.8899 -11.0882 17.2502 2.5881 -2.7018 0.5641 5.3430 -7.1541 -6.1920 18.2366 11.7134 14.7483 8.1013 11.8771 13.9552 -10.4701 5.6961 -3.7546 8.4117 1.8986 7.2601 -0.4639 -0.0498 7.9336 -12.8279 12.4124 1.8489 4.4666 4.7433 0.7178 1.4214 23.0347 -1.2706 -2.9275 10.2922 17.9697 -8.9996
train_4 0 9.8369 -1.4834 12.8746 6.6375 12.2772 2.4486 5.9405 19.2514 6.2654 7.6784 -9.4458 -12.1419 13.8481 7.8895 7.7894 15.0553 8.4871 -3.0680 6.5263 11.3152 21.4246 18.9608 10.1102 2.7142 14.2080 13.5433 3.1736 -3.3423 5.9015 7.9352 -3.1582 9.4668 -0.0083 19.3239 12.4057 0.6329 2.7922 5.8184 19.3038 1.4450 -5.5963 14.0685 11.9171 11.5111 6.9087 -65.4863 13.8657 0.0444 -0.1346 14.4268 13.3273 10.4857 -1.4367 5.7555 -8.5414 14.1482 16.9840 6.1812 1.9548 9.2048 8.6591 -27.7439 -0.4952 -1.7839 5.2670 -4.3205 6.9860 1.6184 5.0301 -3.2431 40.1236 0.7737 -0.7264 4.5886 -4.5346 23.3521 1.0273 19.1600 7.1734 14.3937 2.9598 13.3317 -9.2587 -6.7075 7.8984 14.5265 7.0799 20.1670 8.0053 3.7954 -39.7997 7.0065 9.3627 10.4316 14.0553 0.0213 14.7246 35.2988 1.6844 0.6715 -22.9264 12.3562 17.3410 1.6940 7.1179 5.1934 8.8230 10.6617 14.0837 28.2749 -0.1937 5.9654 1.0719 7.9923 2.9138 -3.6135 1.4684 25.6795 13.8224 4.7478 41.1037 12.7140 5.2964 9.7289 3.9370 12.1316 12.5815 7.0642 5.6518 10.9346 11.4266 0.9442 7.7532 6.6173 -6.8304 6.4730 17.1728 25.8128 2.6791 13.9547 6.6289 -4.3965 11.7159 16.1080 7.6874 9.1570 11.5670 -12.7047 3.7574 9.9110 20.1461 1.2995 5.8493 19.8234 4.7022 10.6101 13.0021 -12.6068 27.0846 8.0913 33.5107 5.6953 5.4663 18.2201 6.5769 21.2607 3.2304 -1.7759 3.1283 5.5518 1.4493 -2.6627 19.8056 2.3705 18.4685 16.3309 -3.3456 13.5261 1.7189 5.1743 -7.6938 9.7685 4.8910 12.2198 11.8503 -7.8931 6.4209 5.9270 16.0201 -0.2829 -1.4905 9.5214 -0.1508 9.1942 13.2876 -1.5121 3.9267 9.5031 17.9974 -8.8104
# separate target class
y = X.pop("target")
# import test dataset
test_df = pd.read_csv(test_path, index_col = ["ID_code"])
test_df.head()
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 var_40 var_41 var_42 var_43 var_44 var_45 var_46 var_47 var_48 var_49 var_50 var_51 var_52 var_53 var_54 var_55 var_56 var_57 var_58 var_59 var_60 var_61 var_62 var_63 var_64 var_65 var_66 var_67 var_68 var_69 var_70 var_71 var_72 var_73 var_74 var_75 var_76 var_77 var_78 var_79 var_80 var_81 var_82 var_83 var_84 var_85 var_86 var_87 var_88 var_89 var_90 var_91 var_92 var_93 var_94 var_95 var_96 var_97 var_98 var_99 var_100 var_101 var_102 var_103 var_104 var_105 var_106 var_107 var_108 var_109 var_110 var_111 var_112 var_113 var_114 var_115 var_116 var_117 var_118 var_119 var_120 var_121 var_122 var_123 var_124 var_125 var_126 var_127 var_128 var_129 var_130 var_131 var_132 var_133 var_134 var_135 var_136 var_137 var_138 var_139 var_140 var_141 var_142 var_143 var_144 var_145 var_146 var_147 var_148 var_149 var_150 var_151 var_152 var_153 var_154 var_155 var_156 var_157 var_158 var_159 var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
ID_code
test_0 11.0656 7.7798 12.9536 9.4292 11.4327 -2.3805 5.8493 18.2675 2.1337 8.8100 -2.0248 -4.3554 13.9696 0.3458 7.5408 14.5001 7.7028 -19.0919 15.5806 16.1763 3.7088 18.8064 1.5899 3.0654 6.4509 14.1192 -9.4902 -2.1917 5.7107 3.7864 -1.7981 9.2645 2.0657 12.7753 11.3334 8.1462 -0.0610 3.5331 9.7804 8.7625 -15.6305 18.8766 11.2864 11.8362 13.3680 -31.9891 12.1776 8.7714 17.2011 16.8508 13.0534 14.4069 -4.8525 7.3213 -0.5259 16.6365 19.3036 6.4129 -5.3948 9.3269 11.9314 -3.5750 -0.7706 0.8705 6.9282 2.8914 5.9744 17.4851 5.0125 -1.4230 33.3401 0.8018 -4.7906 30.2708 26.8339 21.7205 7.3075 14.0810 3.1192 17.4265 9.4883 16.9060 14.5117 10.0276 -0.9706 20.4588 4.7945 20.4160 13.1633 7.9307 -7.6509 7.0834 15.2324 10.1416 5.9156 -0.5775 5.7600 30.3238 2.1251 1.8585 -9.2198 17.3089 30.9548 1.4918 12.8721 3.4902 8.2856 11.9794 14.0176 15.0763 3.7662 6.0426 4.4243 14.1799 2.0921 1.5493 3.2206 0.0172 -6.6602 8.4785 42.0248 11.4164 0.4564 9.4006 0.9685 12.4929 14.1240 4.0388 -4.4442 16.6684 12.5380 0.9205 10.5998 7.5147 -4.1748 -0.4824 10.5267 17.7547 -6.5226 -2.5502 -5.1547 -2.1246 19.8319 13.0752 9.2275 3.0213 11.6793 -11.6827 4.1017 5.2954 18.7741 9.8892 7.5219 14.9745 18.9880 1.0842 11.9125 -4.5103 16.1361 11.0067 5.9232 5.4113 3.8302 5.7380 -8.6105 22.9530 2.5531 -0.2836 4.3416 5.1855 4.2603 1.6779 29.0849 8.4685 18.1317 12.2818 -0.6912 10.2226 -5.5579 2.2926 -4.5358 10.3903 -15.4937 3.9697 31.3521 -1.1651 9.2874 -23.5705 13.2643 1.6591 -2.1556 11.8495 -1.4300 2.4508 13.7112 2.4669 4.3654 10.7200 15.4722 -8.7197
test_1 8.5304 1.2543 11.3047 5.1858 9.1974 -4.0117 6.0196 18.6316 -4.4131 5.9739 -1.3809 -0.3310 14.1129 2.5667 5.4988 14.1853 7.0196 4.6564 29.1609 0.0910 12.1469 3.1389 5.2578 2.4228 16.2064 13.5023 -5.2341 -3.6648 5.7080 2.9965 -10.4720 11.4938 -0.9660 15.3445 10.6361 0.8966 6.7428 2.3421 12.8678 -1.5536 10.0309 3.1337 10.5742 11.7664 2.1782 -41.1924 13.5322 -17.3834 6.3806 12.5589 11.6887 25.3930 1.5776 6.8481 8.7348 16.4239 21.7056 6.9345 1.6678 9.5249 5.3383 -18.7083 1.3382 -1.7401 5.8398 3.1051 4.4307 16.0005 5.0306 -7.3365 12.2806 0.6992 -0.7772 21.5123 6.7803 18.1896 6.9388 22.1336 6.3755 13.1525 1.9772 14.0406 6.6904 9.9732 -11.5679 20.4525 9.4951 9.6343 8.1252 2.6059 -17.4201 7.1848 15.3484 10.6522 5.9897 0.3392 10.3516 29.8204 1.9998 -1.4166 -1.7257 15.4712 35.6020 1.6570 13.0783 2.7752 6.4986 4.6835 13.7963 17.7261 1.7375 5.5689 3.6609 8.9725 4.1159 1.0693 2.0234 8.2760 -6.8610 0.2780 17.0488 11.6704 3.1215 8.5093 5.6367 12.0099 14.2372 -6.1600 -5.6690 8.9094 11.0605 0.4583 9.7974 7.0891 2.6849 8.4970 15.7774 4.8775 3.6129 6.7530 11.1003 15.3593 2.2105 8.2280 9.0717 -5.0947 8.7644 -2.2873 4.1240 -13.3006 18.7454 9.3783 1.5284 16.0407 7.7732 1.4316 14.8679 3.3619 11.5799 14.2058 30.9641 5.6723 3.6873 13.0429 -10.6572 15.5134 3.2185 9.0535 7.0535 5.3924 -0.7720 -8.1783 29.9227 -5.6274 10.5018 9.6083 -0.4935 8.1696 -4.3605 5.2110 0.4087 12.0030 -10.3812 5.8496 25.1958 -8.8468 11.8263 -8.7112 15.9072 0.9812 10.6165 8.8349 0.9403 10.1282 15.5765 0.4773 -1.4852 9.8714 19.1293 -20.9760
test_2 5.4827 -10.3581 10.1407 7.0479 10.2628 9.8052 4.8950 20.2537 1.5233 8.3442 -4.7057 -3.0422 13.6751 3.8183 10.8535 14.2126 9.8837 2.6541 21.2181 20.8163 12.4666 12.3696 4.7473 2.7936 5.2189 13.5670 -15.4246 -0.1655 7.2633 3.4310 -9.1508 9.7320 3.1062 22.3076 11.9593 9.9255 4.0702 4.9934 8.0667 0.8804 -19.0841 5.2272 9.5977 12.1801 8.3565 15.1170 10.0921 -20.8504 8.6758 8.1292 11.8932 10.6869 -0.6434 5.6510 9.3742 25.8831 19.8701 5.4834 -4.0304 8.5160 8.9776 -5.6619 2.8117 2.5996 9.0986 7.1167 4.9466 13.8268 5.0093 4.7782 19.2081 0.4340 0.8459 34.8598 20.7048 16.4953 -9.7077 19.6357 7.6587 15.5744 16.1691 14.3299 1.3360 -0.4412 -0.2830 14.9105 -3.9016 14.6881 7.3220 -5.1443 -34.3488 7.0194 12.4785 9.6665 13.2595 -0.5624 5.6347 9.5853 1.4515 1.7818 -3.5065 14.1663 28.0256 1.3935 10.8257 4.2954 8.2125 26.2595 14.0232 19.4604 8.6896 8.1036 1.2057 8.9156 0.9777 2.3797 3.1638 37.8664 -3.3864 -2.4090 29.7978 12.2056 4.7688 7.9344 2.2102 12.6482 14.3377 2.3268 2.3930 13.7005 12.7047 0.7507 7.7726 6.5950 0.2990 12.9154 29.9162 6.8031 10.5031 -6.0452 -4.5298 1.3903 5.0469 12.9740 9.3878 -0.1113 11.6749 16.8588 4.2600 14.6476 14.4431 14.1649 9.4875 16.5769 7.2638 -2.2008 12.5953 7.4487 23.1407 10.4597 39.3654 5.5228 3.3159 4.3324 -0.5382 13.3009 3.1243 -4.1731 1.2330 6.1513 -0.0391 1.4950 16.8874 -2.9787 27.4035 15.8819 -10.9660 15.6415 -9.4056 4.4611 -3.0835 8.5549 -2.8517 13.4770 24.4721 -3.4824 4.9178 -2.0720 11.5390 1.1821 -0.7484 10.9935 1.9803 2.1800 12.9813 2.1281 -7.1086 7.0618 19.8956 -23.1794
test_3 8.5374 -1.3222 12.0220 6.5749 8.8458 3.1744 4.9397 20.5660 3.3755 7.4578 0.0095 -5.0659 14.0526 13.5010 8.7660 14.7352 10.0383 -15.3508 2.1273 21.4797 14.5372 12.5527 2.9707 4.2398 13.7796 14.1408 1.0061 -1.3479 5.2570 6.5911 6.2161 9.5540 2.3628 10.2124 10.8047 -2.5588 6.0720 3.2613 16.5632 8.8336 -4.8327 0.9554 12.3754 11.4241 6.6917 -12.9761 13.7343 5.0150 31.3923 5.8555 12.6082 1.4182 -4.1185 6.2536 1.4257 13.5426 15.4090 6.8761 1.7476 10.0413 15.2857 -4.1378 0.7928 2.5301 8.1458 2.5738 5.9876 13.0758 5.0087 -9.7824 8.9289 0.4205 -2.5463 2.9428 10.7087 12.2008 12.5465 19.4201 5.5060 14.1586 17.5941 15.4375 -13.2668 14.0885 4.0357 22.3119 1.8571 16.5210 10.8149 0.3256 -21.4797 6.9174 9.9483 10.3696 11.0362 0.1892 19.4321 40.3383 1.4105 2.6165 1.7021 2.5363 3.8763 1.5173 13.4083 2.8965 7.0919 21.6304 14.2000 23.0368 10.3445 6.0369 5.0227 12.6600 2.1278 4.0592 1.9084 11.6095 7.5397 8.1972 20.0844 10.4440 8.4676 5.0350 4.3103 12.0067 13.7149 1.6143 -1.2328 22.7248 12.6609 0.8039 4.7666 6.7888 5.8537 -4.5434 19.0111 12.6907 -2.9322 12.7898 12.0466 13.1646 7.7063 11.6549 9.8274 1.8061 8.6963 1.8057 3.8265 -16.3027 13.7106 9.7908 5.8497 15.4378 5.0372 -8.7673 13.6035 -3.5002 13.9785 14.6118 19.7251 5.3882 3.6775 7.4753 -11.0780 24.8712 2.6415 2.2673 7.2788 5.6406 7.2048 3.4504 2.4130 11.1674 14.5499 10.6151 -5.7922 13.9407 7.1078 1.1019 9.4590 9.8243 5.9917 5.1634 8.1154 3.6638 3.3102 -19.7819 13.4499 1.3104 9.5702 9.0766 1.6580 3.5813 15.1874 3.1656 3.9567 9.2295 13.0168 -4.2108
test_4 11.7058 -0.1327 14.1295 7.7506 9.1035 -8.5848 6.8595 10.6048 2.9890 7.1437 5.1025 -3.2827 14.1013 8.9672 4.7276 14.5811 11.8615 3.1480 18.0126 13.8006 1.6026 16.3059 6.7954 3.6015 13.6569 13.8807 8.6228 -2.2654 5.2255 7.0165 -15.6961 10.6239 -4.7674 17.5447 11.8668 3.0154 4.2546 6.7601 5.9613 0.3695 -14.4364 5.1392 11.6336 12.0338 18.9670 12.0144 16.2096 -2.1966 1.1174 13.4532 12.7925 4.3775 -0.1543 5.6794 0.8210 19.1358 12.6589 6.4394 4.3425 8.7003 12.0586 -10.4753 -0.0337 5.6603 6.2529 1.5238 4.5356 20.1344 5.0267 -1.8628 39.8219 1.0498 -0.9113 38.5076 2.2201 9.5235 8.1522 14.9224 6.1573 15.5221 11.8133 16.7661 -14.6524 -0.4469 0.0306 22.5276 6.9774 2.2563 3.5779 1.4268 9.0680 7.0197 19.7765 10.0499 11.4803 0.2548 16.7029 45.5510 1.5795 0.1148 -14.3858 17.8630 23.2274 1.4375 14.4838 4.3806 10.6976 18.4023 14.2212 16.0638 6.3933 6.8699 2.7253 12.6458 3.2376 3.4218 -0.5658 -5.6840 4.7753 10.3320 39.7127 11.2319 -1.2978 12.4827 6.5034 12.7157 13.3054 -1.9678 -1.2363 11.5686 12.6428 0.4792 7.1984 7.1434 -0.2056 -16.3908 27.1589 23.5997 -4.6175 11.7989 12.5683 -3.6145 22.1069 9.5539 9.2721 -1.6214 12.9327 6.8080 4.2135 22.1044 20.0502 6.9953 9.3823 20.5534 3.4368 -15.2208 13.0974 -14.0888 11.7586 14.5259 22.8700 5.6688 6.1159 13.2433 -11.9785 26.2040 3.2348 -5.5775 5.7036 6.1717 -1.6039 -2.4866 17.2728 2.3640 14.0037 12.9165 -12.0311 10.1161 -8.7562 6.0889 -1.3620 10.3559 -7.4915 9.4588 3.9829 5.8580 8.3635 -24.8254 11.4928 1.6321 4.2259 9.1723 1.2835 3.3778 19.5542 -0.2860 -5.1612 7.2882 13.9260 -9.1846

We'll also import the submission file which will be used for adding targets from the trained ML models. It will then be exported as a csv file, which will be submitted on Kaggle.

# import submssion file
submission_df = pd.read_csv(submission_path)
submission_df.head()
ID_code target
0 test_0 0
1 test_1 0
2 test_2 0
3 test_3 0
4 test_4 0

This submission file will be used for adding targets from the trained ML models. It will then be exported as a csv file, which will be submitted on Kaggle.

# random state seed
seed = 42

EDA and Data Preparation

# Basic overview
X.info()
<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, train_0 to train_199999
Columns: 200 entries, var_0 to var_199
dtypes: float64(200)
memory usage: 306.7+ MB

Observations

  • The data is of 200000 * 200 shape
  • All of the values are numeric stored as dtype float64.
  • The 200 columns are named var_0 to var_199.

Basic statistics

First we'll look at the basic statistics of both train and test datasets.

X.describe()
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 var_40 var_41 var_42 var_43 var_44 var_45 var_46 var_47 var_48 var_49 var_50 var_51 var_52 var_53 var_54 var_55 var_56 var_57 var_58 var_59 var_60 var_61 var_62 var_63 var_64 var_65 var_66 var_67 var_68 var_69 var_70 var_71 var_72 var_73 var_74 var_75 var_76 var_77 var_78 var_79 var_80 var_81 var_82 var_83 var_84 var_85 var_86 var_87 var_88 var_89 var_90 var_91 var_92 var_93 var_94 var_95 var_96 var_97 var_98 var_99 var_100 var_101 var_102 var_103 var_104 var_105 var_106 var_107 var_108 var_109 var_110 var_111 var_112 var_113 var_114 var_115 var_116 var_117 var_118 var_119 var_120 var_121 var_122 var_123 var_124 var_125 var_126 var_127 var_128 var_129 var_130 var_131 var_132 var_133 var_134 var_135 var_136 var_137 var_138 var_139 var_140 var_141 var_142 var_143 var_144 var_145 var_146 var_147 var_148 var_149 var_150 var_151 var_152 var_153 var_154 var_155 var_156 var_157 var_158 var_159 var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
count 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000
mean 10.679914 -1.627622 10.715192 6.796529 11.078333 -5.065317 5.408949 16.545850 0.284162 7.567236 0.394340 -3.245596 14.023978 8.530232 7.537606 14.573126 9.333264 -5.696731 15.244013 12.438567 13.290894 17.257883 4.305430 3.019540 10.584400 13.667496 -4.055133 -1.137908 5.532980 5.053874 -7.687740 10.393046 -0.512886 14.774147 11.434250 3.842499 2.187230 5.868899 10.642131 0.662956 -6.725505 9.299858 11.222356 11.569954 8.948289 -12.699667 11.326488 -12.471737 14.704713 16.682499 12.740986 13.428912 -2.528816 6.008569 1.137117 12.745852 16.629165 6.272014 3.177633 8.931124 12.155618 -11.946744 0.874170 0.661173 6.369157 0.982891 5.794039 11.943223 5.018893 -3.331515 24.446811 0.669756 0.640553 19.610888 19.518846 16.853732 6.050871 19.066993 5.349479 14.402136 5.795044 14.719024 -3.471273 1.025817 -2.590209 18.362721 5.621058 11.351483 8.702924 3.725208 -16.548147 6.987541 12.739578 10.556740 10.999162 -0.084344 14.400433 18.539645 1.752012 -0.746296 -6.600518 13.413526 22.294908 1.568393 11.509834 4.244744 8.617657 17.796266 14.224435 18.458001 5.513238 6.312603 3.317843 8.136542 3.081191 2.213717 2.402570 16.102233 -5.305132 3.032849 24.521078 11.310591 1.192984 7.076254 4.272740 12.489165 13.202326 0.851507 -1.127952 15.460314 12.257151 0.544674 7.799676 6.813270 -4.826053 -4.259472 22.968602 17.613651 1.210792 7.760193 3.423636 2.897596 11.983489 12.333698 8.647632 4.841328 10.341178 -3.300779 3.990726 5.296237 16.817671 10.141542 7.633199 16.727902 6.974955 -2.074128 13.209272 -4.813552 17.914591 10.223282 24.259300 5.633293 5.362896 11.002170 -2.871906 19.315753 2.963335 -4.151155 4.937124 5.636008 -0.004962 -0.831777 19.817094 -0.677967 20.210677 11.640613 -2.799585 11.882933 -1.014064 2.591444 -2.741666 10.085518 0.719109 8.769088 12.756676 -3.983261 8.970274 -10.335043 15.377174 0.746072 3.234440 7.438408 1.927839 3.331774 17.993784 -0.142088 2.303335 8.908158 15.870720 -3.326537
std 3.040051 4.050044 2.640894 2.043319 1.623150 7.863267 0.866607 3.418076 3.332634 1.235070 5.500793 5.970253 0.190059 4.639536 2.247908 0.411711 2.557421 6.712612 7.851370 7.996694 5.876254 8.196564 2.847958 0.526893 3.777245 0.285535 5.922210 1.523714 0.783367 2.615942 7.965198 2.159891 2.587830 4.322325 0.541614 5.179559 3.119978 2.249730 4.278903 4.068845 8.279259 5.938088 0.695991 0.309599 5.903073 21.404912 2.860511 10.579862 11.384332 7.855762 0.691709 8.187306 4.985532 0.764753 8.414241 5.690072 3.540174 0.795026 4.296686 0.854798 4.222389 11.622948 2.026238 3.113089 1.485854 3.786493 1.121366 7.365115 0.007186 3.955723 11.951742 0.266696 3.944703 7.466303 14.112591 6.055322 7.938351 3.817292 1.993792 1.309055 7.436737 2.299567 8.479255 8.297229 6.225305 3.908536 7.751142 5.661867 2.491460 3.560554 13.152810 0.152641 4.186252 0.543341 2.768099 0.621125 8.525400 12.642382 0.715836 1.862550 9.181683 4.950537 8.628179 0.185020 1.970520 0.855698 1.894899 7.604723 0.171091 4.355031 3.823253 1.082404 1.591170 4.459077 0.985396 2.621851 1.650912 13.297662 8.799268 4.182796 12.121016 1.714416 5.168479 6.147345 2.736821 0.318100 0.776056 3.137684 3.238043 4.136453 0.832199 0.456280 1.456486 0.375603 6.166126 7.617732 10.382235 8.890516 4.551750 7.686433 4.896325 6.715637 5.691936 2.934706 0.922469 3.899281 2.518883 7.413301 0.199192 10.385133 2.464157 3.962426 3.005373 2.014200 4.961678 5.771261 0.955140 5.570272 7.885579 4.122912 10.880263 0.217938 1.419612 5.262056 5.457784 5.024182 0.369684 7.798020 3.105986 0.369437 4.424621 5.378008 8.674171 5.966674 7.136427 2.892167 7.513939 2.628895 8.579810 2.798956 5.261243 1.371862 8.963434 4.474924 9.318280 4.725167 3.189759 11.574708 3.944604 0.976348 4.559922 3.023272 1.478423 3.992030 3.135162 1.429372 5.454369 0.921625 3.010945 10.438015
min 0.408400 -15.043400 2.117100 -0.040200 5.074800 -32.562600 2.347300 5.349700 -10.505500 3.970500 -20.731300 -26.095000 13.434600 -6.011100 1.013300 13.076900 0.635100 -33.380200 -10.664200 -12.402500 -5.432200 -10.089000 -5.322500 1.209800 -0.678400 12.720000 -24.243100 -6.166800 2.089600 -4.787200 -34.798400 2.140600 -8.986100 1.508500 9.816900 -16.513600 -8.095100 -1.183400 -6.337100 -14.545700 -35.211700 -8.535900 8.859000 10.652800 -9.939600 -90.252500 1.206200 -47.686200 -23.902200 -8.070700 10.385500 -15.046200 -24.721400 3.344900 -26.778600 -3.782600 2.761800 3.442300 -12.600900 6.184000 -2.100600 -48.802700 -6.328900 -10.554400 1.611700 -14.088800 1.336800 -19.544300 4.993800 -16.309400 -17.027500 -0.224000 -12.383400 -1.665800 -34.101500 -1.293600 -21.633300 7.425700 -1.818300 10.445400 -18.042200 7.586500 -30.026600 -24.220100 -24.439800 7.023000 -19.272200 -8.481600 1.350200 -9.601400 -61.718000 6.521800 -1.018500 8.491600 2.819000 -2.432400 -12.158400 -21.740000 -0.603500 -7.280600 -39.179100 0.075700 -7.382900 0.979300 4.084600 0.715300 0.942400 -5.898000 13.729000 5.769700 -9.239800 2.194200 -2.030200 -5.513900 -0.050500 -6.858600 -3.163000 -31.836900 -37.527700 -9.774200 -18.696200 6.305200 -15.194000 -12.405900 -7.053800 11.486100 11.265400 -8.876900 -11.755900 2.186300 9.528300 -0.954800 2.890000 5.359300 -24.254600 -31.380800 -9.949300 -9.851000 -16.468400 -21.274300 -15.459500 -16.693700 -7.108000 2.806800 5.444300 -8.273400 0.427400 -29.984000 3.320500 -41.168300 9.242000 -2.191500 -2.880000 11.030800 -8.196600 -21.840900 9.996500 -22.990400 -4.554400 -4.641600 -7.452200 4.852600 0.623100 -6.531700 -19.997700 3.816700 1.851200 -35.969500 -5.250200 4.258800 -14.506000 -22.479300 -11.453300 -22.748700 -2.995300 3.241500 -29.116500 4.952100 -29.273400 -7.856100 -22.037400 5.416500 -26.001100 -4.808200 -18.489700 -22.583300 -3.022300 -47.753600 4.412300 -2.554300 -14.093300 -2.691700 -3.814500 -11.783400 8.694400 -5.261000 -14.209600 5.960600 6.299300 -38.852800
25% 8.453850 -4.740025 8.722475 5.254075 9.883175 -11.200350 4.767700 13.943800 -2.317800 6.618800 -3.594950 -7.510600 13.894000 5.072800 5.781875 14.262800 7.452275 -10.476225 9.177950 6.276475 8.627800 11.551000 2.182400 2.634100 7.613000 13.456400 -8.321725 -2.307900 4.992100 3.171700 -13.766175 8.870000 -2.500875 11.456300 11.032300 0.116975 -0.007125 4.125475 7.591050 -2.199500 -12.831825 4.519575 10.713200 11.343800 5.313650 -28.730700 9.248750 -20.654525 6.351975 10.653475 12.269000 7.267625 -6.065025 5.435600 -5.147625 8.163900 14.097875 5.687500 0.183500 8.312400 8.912750 -20.901725 -0.572400 -1.588700 5.293500 -1.702800 4.973800 6.753200 5.014000 -6.336625 15.256625 0.472300 -2.197100 14.097275 9.595975 12.480975 0.596300 16.014700 3.817275 13.375400 0.694475 13.214775 -10.004950 -5.106400 -7.216125 15.338575 0.407550 7.247175 6.918775 1.140500 -26.665600 6.869900 9.670300 10.195600 8.828000 -0.527400 7.796950 8.919525 1.267675 -2.106200 -13.198700 9.639800 16.047975 1.428900 10.097900 3.639600 7.282300 12.168075 14.098900 15.107175 2.817475 5.510100 2.092675 4.803250 2.388775 0.399700 1.171875 6.373500 -11.587850 -0.161975 15.696275 9.996400 -2.565200 2.817050 2.353600 12.245400 12.608400 -1.502325 -3.580725 12.514475 11.619300 0.207800 6.724375 6.543500 -9.625700 -9.957100 14.933900 10.656550 -2.011825 2.387575 -0.121700 -2.153725 7.900000 10.311200 7.968075 1.885875 8.646900 -8.751450 3.853600 -1.903200 14.952200 7.064600 5.567900 15.233000 3.339900 -6.266025 12.475100 -8.939950 12.109200 7.243525 15.696125 5.470500 4.326100 7.029600 -7.094025 15.744550 2.699000 -9.643100 2.703200 5.374600 -3.258500 -4.720350 13.731775 -5.009525 15.064600 9.371600 -8.386500 9.808675 -7.395700 0.625575 -6.673900 9.084700 -6.064425 5.423100 5.663300 -7.360000 6.715200 -19.205125 12.501550 0.014900 -0.058825 5.157400 0.889775 0.584600 15.629800 -1.170700 -1.946925 8.252800 13.829700 -11.208475
50% 10.524750 -1.608050 10.580000 6.825000 11.108250 -4.833150 5.385100 16.456800 0.393700 7.629600 0.487300 -3.286950 14.025500 8.604250 7.520300 14.574100 9.232050 -5.666350 15.196250 12.453900 13.196800 17.234250 4.275150 3.008650 10.380350 13.662500 -4.196900 -1.132100 5.534850 4.950200 -7.411750 10.365650 -0.497650 14.576000 11.435200 3.917750 2.198000 5.900650 10.562700 0.672300 -6.617450 9.162650 11.243400 11.565000 9.437200 -12.547200 11.310750 -12.482400 14.559200 16.672400 12.745600 13.444400 -2.502450 6.027800 1.274050 12.594100 16.648150 6.262500 3.170100 8.901000 12.064350 -11.892000 0.794700 0.681700 6.377700 1.021350 5.782000 11.922000 5.019100 -3.325500 24.445000 0.668400 0.646450 19.309750 19.536650 16.844200 6.297800 18.967850 5.440050 14.388850 6.061750 14.844500 -3.284450 1.069700 -2.517950 18.296450 6.006700 11.288000 8.616200 3.642550 -16.482600 6.986500 12.673500 10.582200 10.983850 -0.098600 14.369900 18.502150 1.768300 -0.771300 -6.401500 13.380850 22.306850 1.566000 11.497950 4.224500 8.605150 17.573200 14.226600 18.281350 5.394300 6.340100 3.408400 8.148550 3.083800 2.249850 2.456300 15.944850 -5.189500 3.023950 24.354700 11.239700 1.200700 7.234300 4.302100 12.486300 13.166800 0.925000 -1.101750 15.426800 12.264650 0.556600 7.809100 6.806700 -4.704250 -4.111900 22.948300 17.257250 1.211750 8.066250 3.564700 2.975500 11.855900 12.356350 8.651850 4.904700 10.395600 -3.178700 3.996000 5.283250 16.736950 10.127900 7.673700 16.649750 6.994050 -2.066100 13.184300 -4.868400 17.630450 10.217550 23.864500 5.633500 5.359700 10.788700 -2.637800 19.270800 2.960200 -4.011600 4.761600 5.634300 0.002800 -0.807350 19.748000 -0.569750 20.206100 11.679800 -2.538450 11.737250 -0.942050 2.512300 -2.688800 10.036050 0.720200 8.600000 12.521000 -3.946950 8.902150 -10.209750 15.239450 0.742600 3.203600 7.347750 1.901300 3.396350 17.957950 -0.172700 2.408900 8.888200 15.934050 -2.819550
75% 12.758200 1.358625 12.516700 8.324100 12.261125 0.924800 6.003000 19.102900 2.937900 8.584425 4.382925 0.852825 14.164200 12.274775 9.270425 14.874500 11.055900 -0.810775 21.013325 18.433300 17.879400 23.089050 6.293200 3.403800 13.479600 13.863700 -0.090200 0.015625 6.093700 6.798925 -1.443450 11.885000 1.469100 18.097125 11.844400 7.487725 4.460400 7.542400 13.598925 3.637825 -0.880875 13.754800 11.756900 11.804600 13.087300 3.150525 13.318300 -4.244525 23.028650 22.549050 13.234500 19.385650 0.944350 6.542900 7.401825 17.086625 19.289700 6.845000 6.209700 9.566525 15.116500 -3.225450 2.228200 3.020300 7.490600 3.739200 6.586200 17.037650 5.024100 -0.498875 33.633150 0.864400 3.510700 25.207125 29.620700 21.432225 11.818800 22.041100 6.867200 15.383100 11.449125 16.340800 3.101725 7.449900 1.986700 21.358850 11.158375 15.433225 10.567025 6.146200 -6.409375 7.101400 15.840225 10.944900 13.089100 0.329100 20.819375 28.158975 2.260900 0.528500 0.132100 17.250225 28.682225 1.705400 12.902100 4.822200 9.928900 23.348600 14.361800 21.852900 8.104325 7.080300 4.577400 11.596200 3.811900 4.121500 3.665100 25.780825 0.971800 6.098400 33.105275 12.619425 5.091700 11.734750 6.192200 12.718100 13.811700 3.293000 1.351700 18.480400 12.876700 0.901000 8.911425 7.070800 -0.178800 1.125950 31.042425 24.426025 4.391225 13.232525 7.078525 8.192425 16.073925 14.461050 9.315000 7.676925 12.113225 2.028275 4.131600 12.688225 18.682500 13.057600 9.817300 18.263900 10.766350 1.891750 13.929300 -0.988575 23.875325 13.094525 32.622850 5.792000 6.371200 14.623900 1.323600 23.024025 3.241500 1.318725 7.020025 5.905400 3.096400 2.956800 25.907725 3.619900 25.641225 13.745500 2.704400 13.931300 5.338750 4.391125 0.996200 11.011300 7.499175 12.127425 19.456150 -0.590650 11.193800 -1.466000 18.345225 1.482900 6.406200 9.512525 2.949500 6.205800 20.396525 0.829600 6.556725 9.593300 18.064725 4.836800
max 20.315000 10.376800 19.353000 13.188300 16.671400 17.251600 8.447700 27.691800 10.151300 11.150600 18.670200 17.188700 14.654500 22.331500 14.937700 15.863300 17.950600 19.025900 41.748000 35.183000 31.285900 49.044300 14.594500 4.875200 25.446000 14.654600 15.675100 3.243100 8.787400 13.143100 15.651500 20.171900 6.787100 29.546600 13.287800 21.528900 14.245600 11.863800 29.823500 15.322300 18.105600 26.165800 13.469600 12.577900 34.196100 62.084400 21.293900 20.685400 54.273800 41.153000 15.317200 40.689000 17.096800 8.231500 28.572400 29.092100 29.074100 9.160900 20.483300 11.986700 25.195500 27.102900 7.753600 11.231700 11.153700 15.731300 9.713200 39.396800 5.046900 8.547300 64.464400 1.571900 14.150000 44.536100 70.272000 36.156700 34.435200 30.956900 11.350700 18.225600 30.476900 23.132400 21.893400 27.714300 17.742400 32.901100 34.563700 33.354100 17.459400 15.481600 27.271300 7.489500 26.997600 12.534300 18.975000 1.804000 40.880600 58.287900 4.502800 5.076400 25.140900 28.459400 51.326500 2.188700 19.020600 7.169200 15.307400 46.379500 14.743000 32.059100 19.519300 9.800200 8.431700 21.542100 6.585000 11.950400 8.120700 64.810900 25.263500 15.688500 74.032100 17.307400 18.471400 26.874900 14.991500 13.664200 15.515600 10.597600 9.809600 31.203600 14.989500 2.192300 12.465000 8.309100 12.723600 21.412800 54.579400 44.437600 18.818700 36.097100 21.121900 23.965800 32.891100 22.691600 11.810100 16.008300 20.437300 22.149400 4.752800 48.424000 25.435700 21.124500 18.384600 24.007500 23.242800 16.831600 16.497000 11.972100 44.779500 25.120000 58.394200 6.309900 10.134400 27.564800 12.119300 38.332200 4.220400 21.276600 14.886100 7.089000 16.731900 17.917300 53.591900 18.855400 43.546800 20.854800 20.245200 20.596500 29.841300 13.448700 12.750500 14.393900 29.248700 23.704900 44.363400 12.997500 21.739200 22.786100 29.330300 4.034100 18.440900 16.716500 8.402400 18.281800 27.928800 4.272900 18.321500 12.000400 26.079100 28.500700
test_df.describe()
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 var_40 var_41 var_42 var_43 var_44 var_45 var_46 var_47 var_48 var_49 var_50 var_51 var_52 var_53 var_54 var_55 var_56 var_57 var_58 var_59 var_60 var_61 var_62 var_63 var_64 var_65 var_66 var_67 var_68 var_69 var_70 var_71 var_72 var_73 var_74 var_75 var_76 var_77 var_78 var_79 var_80 var_81 var_82 var_83 var_84 var_85 var_86 var_87 var_88 var_89 var_90 var_91 var_92 var_93 var_94 var_95 var_96 var_97 var_98 var_99 var_100 var_101 var_102 var_103 var_104 var_105 var_106 var_107 var_108 var_109 var_110 var_111 var_112 var_113 var_114 var_115 var_116 var_117 var_118 var_119 var_120 var_121 var_122 var_123 var_124 var_125 var_126 var_127 var_128 var_129 var_130 var_131 var_132 var_133 var_134 var_135 var_136 var_137 var_138 var_139 var_140 var_141 var_142 var_143 var_144 var_145 var_146 var_147 var_148 var_149 var_150 var_151 var_152 var_153 var_154 var_155 var_156 var_157 var_158 var_159 var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
count 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.00000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000 200000.000000
mean 10.658737 -1.624244 10.707452 6.788214 11.076399 -5.050558 5.415164 16.529143 0.277135 7.569407 0.371335 -3.268551 14.022662 8.540872 7.532703 14.573704 9.321669 -5.70445 15.265776 12.456675 13.298428 17.230598 4.299010 3.019707 10.567479 13.666970 -3.983721 -1.129536 5.530656 5.047247 -7.687695 10.404920 -0.524830 14.762686 11.434861 3.870130 2.213288 5.875048 10.647806 0.672667 -6.736054 9.270646 11.221732 11.568972 8.952581 -12.666095 11.299741 -12.455577 14.685615 16.687280 12.745666 13.391711 -2.516943 6.015355 1.133257 12.756876 16.604447 6.273477 3.190170 8.928052 12.161971 -11.956023 0.882443 0.673228 6.368548 0.976359 5.786662 11.881600 5.018852 -3.334390 24.442541 0.670278 0.648746 19.613597 19.445254 16.876525 6.071365 19.071853 5.348848 14.407224 5.815081 14.729931 -3.426199 1.021512 -2.565281 18.386720 5.637935 11.379621 8.738049 3.728791 -16.334976 6.988214 12.735527 10.556921 10.989159 -0.088772 14.425378 18.603350 1.751693 -0.745673 -6.674013 13.389322 22.334372 1.568707 11.523125 4.244311 8.621051 17.824277 14.225107 18.455457 5.497816 6.313691 3.311793 8.145250 3.086162 2.199727 2.392058 16.039925 -5.310524 3.060541 24.536099 11.312735 1.191748 7.023936 4.279579 12.488117 13.205603 0.847175 -1.132617 15.435012 12.256858 0.547403 7.801597 6.814902 -4.806086 -4.309017 23.019125 17.647446 1.243290 7.765888 3.411716 2.873279 11.971068 12.335213 8.644856 4.860353 10.362214 -3.238206 3.991193 5.276210 16.820051 10.156002 7.628824 16.721077 6.971188 -2.043747 13.208797 -4.836848 17.924734 10.239229 24.146181 5.635300 5.360975 11.026376 -2.857328 19.320760 2.962821 -4.189133 4.930356 5.633716 -0.020824 -0.805148 19.779528 -0.666240 20.264135 11.635715 -2.776134 11.864538 -0.949318 2.582604 -2.722636 10.080827 0.651432 8.768929 12.719302 -3.963045 8.978800 -10.291919 15.366094 0.755673 3.189766 7.458269 1.925944 3.322016 17.996967 -0.133657 2.290899 8.912428 15.869184 -3.246342
std 3.036716 4.040509 2.633888 2.052724 1.616456 7.869293 0.864686 3.424482 3.333375 1.231865 5.508661 5.961443 0.190071 4.628712 2.255257 0.411592 2.544860 6.74646 7.846983 7.989812 5.884245 8.199877 2.844023 0.527951 3.771047 0.285454 5.945853 1.524765 0.785618 2.610078 7.971581 2.156324 2.588700 4.325727 0.541040 5.170614 3.120685 2.257235 4.260820 4.078592 8.298827 5.944335 0.694831 0.309927 5.911181 21.402708 2.860911 10.535003 11.382443 7.877107 0.690394 8.192913 4.969897 0.764381 8.418369 5.699920 3.549413 0.798123 4.303706 0.852982 4.211900 11.632700 2.029280 3.111174 1.480398 3.789367 1.124349 7.370539 0.007194 3.959641 11.933799 0.266336 3.957074 7.474120 14.078789 6.085044 7.966018 3.819166 1.996811 1.307898 7.427520 2.295052 8.426273 8.250587 6.228793 3.911504 7.762716 5.641466 2.497482 3.558041 13.174833 0.153239 4.199007 0.542281 2.765664 0.617136 8.552785 12.674441 0.713795 1.868773 9.181642 4.953360 8.630791 0.185295 1.974172 0.857720 1.890661 7.598867 0.171439 4.368647 3.834687 1.080271 1.592758 4.472500 0.983776 2.622971 1.652755 13.256813 8.767417 4.175211 12.105735 1.708636 5.145416 6.140683 2.728711 0.317715 0.779032 3.150284 3.243374 4.146347 0.832933 0.456736 1.454610 0.375246 6.170636 7.616104 10.373414 8.889635 4.564611 7.683951 4.907185 6.732620 5.685169 2.941028 0.923951 3.897157 2.514621 7.444975 0.198559 10.374152 2.463189 3.971911 3.015322 2.013893 4.972334 5.775461 0.951806 5.576516 7.867884 4.133414 10.876184 0.217936 1.426064 5.268894 5.457937 5.039303 0.370668 7.827428 3.086443 0.365750 4.417876 5.378492 8.678024 5.987419 7.141816 2.884821 7.557001 2.626556 8.570314 2.803890 5.225554 1.369546 8.961936 4.464461 9.316889 4.724641 3.206635 11.562352 3.929227 0.976123 4.551239 3.025189 1.479966 3.995599 3.140652 1.429678 5.446346 0.920904 3.008717 10.398589
min 0.188700 -15.043400 2.355200 -0.022400 5.484400 -27.767000 2.216400 5.713700 -9.956000 4.243300 -22.672400 -25.811800 13.424500 -4.741300 0.670300 13.203400 0.314300 -28.90690 -11.324200 -12.699400 -2.634600 -9.940600 -5.164000 1.390600 -0.731300 12.749600 -24.536100 -6.040900 2.842500 -4.421500 -34.054800 1.309200 -8.209000 1.691100 9.776400 -16.923800 -10.466800 -0.885100 -5.368300 -14.083700 -34.952400 -8.569600 8.983700 10.644400 -9.867700 -83.692100 1.512000 -46.554200 -25.114800 -7.439000 10.199600 -14.366900 -23.347300 3.411400 -25.595500 -3.275100 3.170300 3.549400 -14.979700 6.052800 -1.518700 -48.336900 -6.389900 -12.600400 1.486700 -15.242900 1.940000 -16.187300 4.996000 -15.538300 -17.027500 -0.171700 -12.728300 -0.249000 -37.347000 -0.798800 -21.382400 7.990400 -0.762000 10.551700 -16.704500 8.204700 -30.314600 -24.709100 -24.563900 7.164700 -17.967600 -7.364900 2.011200 -8.169300 -57.385300 6.538800 -0.074200 8.613600 2.817100 -2.313700 -10.657900 -24.502700 -0.596500 -7.364000 -40.949000 -0.644800 -7.925600 0.967500 3.852600 1.200800 1.064800 -6.542200 13.737900 5.078200 -9.396400 2.825500 -2.379600 -4.928300 0.077200 -6.875000 -2.891200 -37.607300 -38.295700 -9.315500 -19.541500 6.436000 -14.097200 -11.795700 -6.352800 11.525500 11.274200 -8.657300 -11.514100 2.417400 9.754600 -0.910300 3.014500 5.491500 -25.002000 -28.480000 -9.753500 -9.634100 -16.697400 -21.274300 -15.459500 -16.213500 -7.310600 3.254000 5.637700 -8.207100 1.056100 -29.788300 3.324000 -33.527000 10.001800 -2.902700 -2.309600 10.895800 -7.835700 -22.410000 10.415400 -24.955700 -4.451500 -3.921200 -8.925700 4.910600 0.106200 -6.093700 -21.514000 3.667300 1.813100 -37.176400 -5.405700 4.291500 -15.593200 -20.393600 -11.796600 -21.342800 -2.485400 2.951200 -29.838400 5.025300 -29.118500 -7.767400 -20.610600 5.346000 -28.092800 -5.476800 -17.011400 -22.467000 -2.303800 -47.306400 4.429100 -2.511500 -14.093300 -2.407000 -3.340900 -11.413100 9.382800 -4.911900 -13.944200 6.169600 6.584000 -39.457800
25% 8.442975 -4.700125 8.735600 5.230500 9.891075 -11.201400 4.772600 13.933900 -2.303900 6.623800 -3.626000 -7.522000 13.891000 5.073375 5.769500 14.262400 7.454400 -10.49790 9.237700 6.322300 8.589600 11.511500 2.178300 2.633300 7.610750 13.456200 -8.265500 -2.299000 4.986275 3.166200 -13.781900 8.880600 -2.518200 11.440500 11.033200 0.162400 0.016900 4.120700 7.601375 -2.170300 -12.878300 4.495000 10.718300 11.344000 5.315700 -28.673200 9.220500 -20.591000 6.396500 10.662225 12.274200 7.210600 -6.021650 5.443900 -5.132300 8.130400 14.060000 5.684900 0.195000 8.307700 8.927900 -20.948500 -0.564800 -1.569400 5.298000 -1.722400 4.963300 6.694775 5.014000 -6.364100 15.328700 0.473700 -2.201500 14.129400 9.545100 12.486500 0.588100 16.022600 3.804600 13.379575 0.721275 13.222100 -9.897300 -5.064400 -7.165400 15.352500 0.421900 7.259875 6.949100 1.156800 -26.461300 6.869900 9.670100 10.198200 8.818350 -0.525800 7.747800 8.951900 1.269600 -2.119700 -13.250800 9.604400 16.129275 1.428400 10.114875 3.641700 7.285400 12.221100 14.098700 15.084800 2.773900 5.504375 2.088075 4.794500 2.396475 0.385500 1.159800 6.316600 -11.520900 -0.131100 15.761400 10.007300 -2.543300 2.728400 2.380175 12.243600 12.605500 -1.508125 -3.597500 12.476500 11.621700 0.210500 6.731400 6.545800 -9.628800 -10.058525 14.972100 10.702150 -1.988350 2.439000 -0.127800 -2.201625 7.908800 10.325300 7.963500 1.922100 8.678500 -8.718200 3.855100 -1.893100 14.961000 7.087400 5.540675 15.220575 3.319600 -6.221800 12.479600 -8.965800 12.058350 7.228600 15.567800 5.473000 4.308175 7.067775 -7.051200 15.751000 2.696500 -9.712000 2.729800 5.375200 -3.250000 -4.678450 13.722200 -4.998900 15.126500 9.382575 -8.408100 9.793700 -7.337925 0.605200 -6.604425 9.081000 -6.154625 5.432025 5.631700 -7.334000 6.705300 -19.136225 12.492600 0.019400 -0.095000 5.166500 0.882975 0.587600 15.634775 -1.160700 -1.948600 8.260075 13.847275 -11.124000
50% 10.513800 -1.590500 10.560700 6.822350 11.099750 -4.834100 5.391600 16.422700 0.372000 7.632000 0.491850 -3.314950 14.024600 8.617400 7.496950 14.572700 9.228900 -5.69820 15.203200 12.484250 13.218650 17.211300 4.269000 3.008000 10.344300 13.661200 -4.125800 -1.127800 5.529900 4.953100 -7.409000 10.385350 -0.535200 14.561400 11.435100 3.947500 2.219250 5.909800 10.563750 0.700850 -6.615750 9.137500 11.243800 11.563800 9.433200 -12.577800 11.276100 -12.438800 14.511750 16.691500 12.755000 13.393000 -2.500300 6.036500 1.296500 12.650600 16.627100 6.264300 3.188000 8.898600 12.096050 -11.957200 0.817400 0.693500 6.375450 1.025400 5.772500 11.846850 5.019100 -3.335600 24.426900 0.668900 0.656900 19.311700 19.571150 16.877200 6.301700 18.988850 5.442500 14.399200 6.046850 14.865200 -3.225700 1.013350 -2.493900 18.327800 6.040800 11.338500 8.639600 3.664300 -16.188350 6.987800 12.677050 10.581900 10.959100 -0.104700 14.422700 18.564200 1.768900 -0.769500 -6.550400 13.356950 22.304000 1.566700 11.509200 4.224300 8.605350 17.605100 14.227300 18.280600 5.376950 6.340400 3.395500 8.144100 3.085300 2.226400 2.445700 15.850600 -5.201200 3.045500 24.366500 11.231100 1.190200 7.158100 4.329550 12.485600 13.171700 0.916400 -1.102900 15.392150 12.266000 0.559500 7.812050 6.809100 -4.667800 -4.148100 22.975200 17.379900 1.243750 8.035900 3.554150 2.981700 11.859350 12.345200 8.652600 4.922950 10.406750 -3.126700 3.996300 5.211200 16.738300 10.153400 7.672500 16.639450 6.985600 -1.995000 13.184600 -4.897700 17.662300 10.255000 23.734400 5.636600 5.359800 10.820600 -2.618500 19.290300 2.961100 -4.080550 4.749300 5.632800 0.008000 -0.782800 19.723750 -0.564750 20.287200 11.668400 -2.515400 11.707650 -0.868300 2.496500 -2.671600 10.027200 0.675100 8.602300 12.493350 -3.927300 8.912850 -10.166800 15.211000 0.759700 3.162400 7.379000 1.892600 3.428500 17.977600 -0.162000 2.403600 8.892800 15.943400 -2.725950
75% 12.739600 1.343400 12.495025 8.327600 12.253400 0.942575 6.005800 19.094550 2.930025 8.584825 4.362400 0.832525 14.162900 12.270900 9.271125 14.875600 11.035500 -0.81160 21.014500 18.441950 17.914200 23.031600 6.278200 3.405700 13.467500 13.862800 -0.000700 0.026200 6.092200 6.793425 -1.464000 11.890900 1.460225 18.084425 11.843100 7.513375 4.485600 7.556750 13.615200 3.654800 -0.899000 13.724125 11.752900 11.802400 13.069425 3.094500 13.292825 -4.304400 22.970150 22.519425 13.237500 19.343250 0.958925 6.546925 7.415400 17.101150 19.266025 6.850900 6.212600 9.565900 15.108750 -3.228375 2.243300 3.023450 7.486400 3.720800 6.586025 17.007125 5.024000 -0.501075 33.598525 0.865600 3.517900 25.216200 29.471025 21.478875 11.901925 22.045300 6.868200 15.388800 11.432875 16.334300 3.086950 7.441350 2.022700 21.408675 11.171350 15.431200 10.612150 6.144700 -6.207550 7.102500 15.844900 10.944800 13.067800 0.323625 20.890400 28.182100 2.257700 0.539425 0.042400 17.235150 28.720150 1.706100 12.908300 4.821700 9.926125 23.374000 14.363500 21.861700 8.094200 7.077325 4.575100 11.636025 3.815800 4.095625 3.652600 25.718700 0.914700 6.124100 33.061100 12.615025 5.067600 11.665600 6.184200 12.716725 13.817400 3.282500 1.363400 18.469000 12.873700 0.904400 8.914600 7.072400 -0.142700 1.075900 31.068500 24.426300 4.420325 13.246525 7.057325 8.187300 16.083900 14.460925 9.312600 7.691900 12.127900 2.057600 4.131500 12.641350 18.678325 13.079775 9.833200 18.252250 10.775900 1.892750 13.925600 -0.995600 23.871200 13.135000 32.495275 5.794100 6.367900 14.645800 1.330300 23.040250 3.241600 1.313125 7.004400 5.898900 3.070100 2.982900 25.849600 3.652625 25.720000 13.748500 2.737700 13.902500 5.423900 4.384725 1.024600 11.002000 7.474700 12.126700 19.437600 -0.626300 11.227100 -1.438800 18.322925 1.495400 6.336475 9.531100 2.956000 6.174200 20.391725 0.837900 6.519800 9.595900 18.045200 4.935400
max 22.323400 9.385100 18.714100 13.142000 16.037100 17.253700 8.302500 28.292800 9.665500 11.003600 20.214500 16.771300 14.682000 21.605100 14.723100 15.798000 17.368700 19.15090 38.929000 35.432300 32.075800 47.417900 14.042600 5.024600 23.839600 14.596400 13.456400 3.371300 8.459900 12.953200 14.391500 19.471900 6.949600 29.247500 13.225100 22.318300 13.094100 12.014900 27.142700 14.167300 17.071700 26.725500 13.389500 12.564700 29.136300 58.307800 21.167500 20.496900 52.270300 42.040600 15.205500 40.689000 17.268300 8.083900 29.590300 29.802300 27.931800 9.179300 21.438000 11.791600 24.394800 26.399400 7.383200 11.048900 11.117200 15.488400 9.897200 41.514700 5.046500 7.388400 63.982700 1.516500 15.613300 43.798100 84.684100 37.947800 34.349100 30.967800 11.571200 18.127800 28.180700 23.437300 21.294900 25.775800 17.403000 30.890100 33.844500 32.776100 17.268600 16.032100 22.337500 7.501400 28.443500 12.461000 18.975000 1.926200 42.425700 57.646200 4.572800 5.014400 24.073200 29.156500 51.984500 2.239700 19.473600 7.256800 15.274300 46.566600 14.743000 32.173400 18.093300 9.728200 7.925400 21.974800 6.265000 11.361700 7.439200 63.754500 24.648900 15.008400 68.466100 17.071800 17.320600 27.682800 14.198300 13.600400 15.466600 10.161500 8.930200 30.778100 14.845900 2.019900 12.554900 8.188700 11.694900 19.725700 54.014500 44.296600 18.818700 37.730400 19.888700 22.293600 32.977300 22.561600 11.629600 16.432600 19.908400 22.011800 4.822800 46.864400 24.665100 22.148800 18.372700 23.768900 22.483800 16.821900 16.346800 13.189700 46.157700 26.495900 64.291100 6.343700 10.194200 27.150300 11.885500 37.026700 4.216200 20.524400 14.983600 6.936400 16.846500 17.269200 53.426500 19.237600 42.758200 19.892200 19.677300 20.007800 27.956800 14.067500 13.991000 14.055900 28.255300 25.568500 44.363400 12.488600 21.699900 23.569900 28.885200 3.780300 20.359000 16.716500 8.005000 17.632600 27.947800 4.545400 15.920700 12.275800 26.538400 27.907400

Observations

  • A quick look at the basic statistics and comparing the training and test dataset doesn't reveal too much.
  • Both the train and test datasets look quite similar in the mean, std and median values.
  • We can confirm this by looking at the frequency distribution of the data along all features in both the datasets.

Frequency Distribution Plots

1. Comparing train and test

We can visualize how each feature is distributed in both the train and test datasets and compare if there are any differences. The train dataset should ideally be representative of the test dataset.

%%time

fig, ax = plt.subplots(20, 10, figsize = (25, 50), constrained_layout = True)

for i, col in tqdm(enumerate(X.columns, start = 1)):
    plt.subplot(20, 10, i)
    sns.kdeplot(X[col])
    sns.kdeplot(test_df[col])
    plt.xlabel(col)
    plt.ylabel("")
    plt.legend(["train", "test"])
200it [07:08,  2.14s/it]
Wall time: 7min 10s

Observations

  • All variables are distributed similarly among the train and test datasets.
  • The train dataset is representative of the test dataset.

2. Comparing target classes

We can also see if the data is distributed differently between the two target classes.

%%time

# divide the dataset with respect to target class.
t0 = X.loc[y == 0]
t1 = X.loc[y == 1]

# plot
fig, ax = plt.subplots(20, 10, figsize = (25, 50), constrained_layout = True)

for i, col in tqdm(enumerate(X.columns, start = 1)):
    plt.subplot(20, 10, i)
    sns.kdeplot(t0[col])
    sns.kdeplot(t1[col])
    plt.xlabel(col)
    plt.ylabel("")
    plt.legend([0, 1])
200it [04:28,  1.34s/it]
Wall time: 4min 31s

Observations

There do appear to some differences in the distributions of features betweeen both target clases. The ML algorithms we'll train will try to learn from these differences and also find more patterns which will help them classify and differentiate between the two target classes better.

Target class

We'll look at the distribution of the target class now.

# distribution
target_vc = y.value_counts()/len(y)
target_vc
0    0.89951
1    0.10049
Name: target, dtype: float64
sns.barplot(x = target_vc.index, y = target_vc)
<AxesSubplot:ylabel='target'>

Observations

  • The target class is imbalanced.
  • It is distributed with about 9:1 ratio.
  • For training some ML models, it would be better to oversample the data.

Null values

# null values count
X.isnull().sum().sort_values(ascending = False)
var_0      0
var_137    0
var_127    0
var_128    0
var_129    0
          ..
var_69     0
var_70     0
var_71     0
var_72     0
var_199    0
Length: 200, dtype: int64

Observations

  • There are no null values in the data, and thus it doesn't need any handling/preparation.

Feature transformation

# Scaling and Oversampling
scaler = MinMaxScaler()
sm = SMOTE(random_state = seed)

Separate validation data

This validation data will help us in evaluating the performance of trained ML models.

# make train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify = y, random_state = seed)

Train Hardcoded and Baseline models and evaluate results

We'll train a dummy classifier as a hardcoded model which will always predict the most frequent target class, in this case being 0. We'll also train a baseline model with logistic regression. Doing this will give us two score values, which our future models should at least beat. It helps to identify errors in training.

Hardcode model

# define hardcoded model and its pipeline
mf_model = DummyClassifier()
mf_pipe = make_imb_pipeline(scaler, sm, mf_model)
# fit the hardcoded pipline
mf_pipe.fit(X_train, y_train)
Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                ('smote', SMOTE(random_state=42)),
                ('dummyclassifier', DummyClassifier())])
# function to evaluate the model
def evaluate_model(model_pipe, plot_graph = False):
    preds = model_pipe.predict(X_valid)
    try:
        preds_prob = model_pipe.predict_proba(X_valid)[:, 1]
    except:
        preds_prob = model_pipe.decision_function(X_valid)
    
    res = {
            "Accuracy Score": accuracy_score(y_valid, preds),
            "Precision Score": precision_score(y_valid, preds, zero_division = 0),
            "Recall Score": recall_score(y_valid, preds),
            "ROC_AUC Score": roc_auc_score(y_valid, preds_prob),
            "f1 Score": f1_score(y_valid, preds)
    }
    
    print(res)
    
    if plot_graph:
        plt.figure(1)
        ConfusionMatrixDisplay.from_predictions(y_valid, preds)
        plt.title("Confusion Matrix")

        plt.figure(2)
        RocCurveDisplay.from_predictions(y_valid, preds_prob)
        plt.title("Roc Curve")
    
    return res
mf_scores = evaluate_model(mf_pipe, True)
{'Accuracy Score': 0.89952, 'Precision Score': 0.0, 'Recall Score': 0.0, 'ROC_AUC Score': 0.5, 'f1 Score': 0.0}
<Figure size 432x288 with 0 Axes>

As expected, dummy classifier achieved accuracy of 0.89 because the class is imbalanced and the classifier predicted the majority class for all predictions. And as expected, it has no discrimative ability. This is reflected in its ROC_AUC score of 0.5. The other ML Algorithms that we'll train should at least beat this score.

Make submission

# predicions
dummy_preds = mf_pipe.predict(test_df)
# add predictions to submission file
submission_df["target"] = dummy_preds
submission_df.head()
ID_code target
0 test_0 0
1 test_1 0
2 test_2 0
3 test_3 0
4 test_4 0
# save submission file
submission_df.to_csv("hardcoded_model_preds.csv", index = None)

On Kaggle, this submission gives the score of 0.5, as expected.

Baseline Model

# define baseline model and its pipeline
base_model = LogisticRegression(random_state = seed)

base_pipe = make_imb_pipeline(scaler, sm, base_model)
# fit the base pipline
base_pipe.fit(X_train, y_train)
Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                ('smote', SMOTE(random_state=42)),
                ('logisticregression', LogisticRegression(random_state=42))])
# evaluate model
mf_scores = evaluate_model(base_pipe, True)
{'Accuracy Score': 0.7887, 'Precision Score': 0.2902566431978197, 'Recall Score': 0.7631369426751592, 'ROC_AUC Score': 0.8575722869606891, 'f1 Score': 0.4205561344814348}
<Figure size 432x288 with 0 Axes>

Make Submission

# predicions
base_preds = base_pipe.predict(test_df)
# add predictions to submission file
submission_df["target"] = base_preds
submission_df.head()
ID_code target
0 test_0 1
1 test_1 1
2 test_2 0
3 test_3 1
4 test_4 0
# save submission file
submission_df.to_csv("baseline_model_preds.csv", index = None)

This submission gives a score of 0.77256, which is better than the hardcoded model, but we can expect to improve on this score further.

Model Seletion

We'll now train multiple models for classification and compare their performance. We'll also compare the effects of scaling and oversampling on performance. From these trained models, we can then choose one model and then try to improve the scores through feature engineering and tuning hyperpararmeters on that ML model.

# Define the models
models = {"LogisticRegression": LogisticRegression(n_jobs = -1),
        "RidgeClassification": RidgeClassifier(random_state = seed),
        "GaussianNB": GaussianNB(),
        "RandomForestClassifier": RandomForestClassifier(n_estimators = 100, max_depth = 7, n_jobs = -1, random_state = seed),
        "LGBMClassifier": lgb.LGBMClassifier(max_depth = 7, learning_rate = 0.05, n_estimators = 300, random_state = seed)}
# models with no scaling and oversampling
model_scores = {}

for model_name, model in models.items():
    
    model.fit(X_train, y_train)
    
    print(model_name, "\n")
    model_scores[model_name] = evaluate_model(model)
    print("\n-------------------\n\n")
LogisticRegression 

{'Accuracy Score': 0.911, 'Precision Score': 0.6688235294117647, 'Recall Score': 0.22631369426751594, 'ROC_AUC Score': 0.8465524013727349, 'f1 Score': 0.3381915526472338}

-------------------


RidgeClassification 

{'Accuracy Score': 0.90196, 'Precision Score': 0.9621212121212122, 'Recall Score': 0.025278662420382167, 'ROC_AUC Score': 0.8589495475081402, 'f1 Score': 0.04926299456943367}

-------------------


GaussianNB 

{'Accuracy Score': 0.91994, 'Precision Score': 0.7046092184368737, 'Recall Score': 0.3499203821656051, 'ROC_AUC Score': 0.8872843205689885, 'f1 Score': 0.4676153743848916}

-------------------


RandomForestClassifier 

{'Accuracy Score': 0.89952, 'Precision Score': 0.0, 'Recall Score': 0.0, 'ROC_AUC Score': 0.7883268347329475, 'f1 Score': 0.0}

-------------------


LGBMClassifier 

{'Accuracy Score': 0.9112, 'Precision Score': 0.8387470997679815, 'Recall Score': 0.1439092356687898, 'ROC_AUC Score': 0.8771059621748726, 'f1 Score': 0.2456676860346585}

-------------------


# models with only scaling
model_scores = {}

for model_name, model in models.items():
    model_pipe = make_imb_pipeline(scaler, model)
    
    model_pipe.fit(X_train, y_train)
    
    print(model_name, "\n")
    model_scores[model_name] = evaluate_model(model_pipe)
    print("\n-------------------\n\n")
LogisticRegression 

{'Accuracy Score': 0.91324, 'Precision Score': 0.6748216106014271, 'Recall Score': 0.2635350318471338, 'ROC_AUC Score': 0.8590269330833487, 'f1 Score': 0.37904380188949327}

-------------------


RidgeClassification 

{'Accuracy Score': 0.90194, 'Precision Score': 0.9618320610687023, 'Recall Score': 0.025079617834394906, 'ROC_AUC Score': 0.8589494811245403, 'f1 Score': 0.04888457807953443}

-------------------


GaussianNB 

{'Accuracy Score': 0.91994, 'Precision Score': 0.7046092184368737, 'Recall Score': 0.3499203821656051, 'ROC_AUC Score': 0.8872830327271504, 'f1 Score': 0.4676153743848916}

-------------------


RandomForestClassifier 

{'Accuracy Score': 0.89952, 'Precision Score': 0.0, 'Recall Score': 0.0, 'ROC_AUC Score': 0.7883291271799312, 'f1 Score': 0.0}

-------------------


LGBMClassifier 

{'Accuracy Score': 0.9114, 'Precision Score': 0.8453488372093023, 'Recall Score': 0.14470541401273884, 'ROC_AUC Score': 0.8771115560995588, 'f1 Score': 0.2471108089734874}

-------------------


# models with data scaled and oversampled
model_scores = {}

for model_name, model in models.items():
    model_pipe = make_imb_pipeline(scaler, sm, model)
    
    model_pipe.fit(X_train, y_train)
    
    print(model_name, "\n")
    model_scores[model_name] = evaluate_model(model_pipe)
    print("\n-------------------\n\n")
LogisticRegression 

{'Accuracy Score': 0.7887, 'Precision Score': 0.2902566431978197, 'Recall Score': 0.7631369426751592, 'ROC_AUC Score': 0.8575722869606891, 'f1 Score': 0.4205561344814348}

-------------------


RidgeClassification 

{'Accuracy Score': 0.78284, 'Precision Score': 0.28573527251358893, 'Recall Score': 0.7742834394904459, 'ROC_AUC Score': 0.8576126393382912, 'f1 Score': 0.41742676252816824}

-------------------


GaussianNB 

{'Accuracy Score': 0.8685, 'Precision Score': 0.1988349514563107, 'Recall Score': 0.10191082802547771, 'ROC_AUC Score': 0.6057410599524276, 'f1 Score': 0.1347545729701277}

-------------------


RandomForestClassifier 

{'Accuracy Score': 0.73772, 'Precision Score': 0.1695801339650384, 'Recall Score': 0.41321656050955413, 'ROC_AUC Score': 0.6527922154731639, 'f1 Score': 0.2404726051198888}

-------------------


LGBMClassifier 

{'Accuracy Score': 0.85454, 'Precision Score': 0.2912567291627993, 'Recall Score': 0.31230095541401276, 'ROC_AUC Score': 0.7348220492896991, 'f1 Score': 0.30141196811065224}

-------------------


Observations

Oversampling doesn't seem to help in performance and is actually hurting the performance in most models. Thus it would be better to avoid it. Scaling does help in Logistic Regression a little, but it has no effect on other models.GaussianNB and LGBClassifier perform the best without any scaling and oversampling. Although GaussianNB performs the best here, LGBClassifier is only a little behind. The opportunity to improve LGBClassifier with more boosting rounds suggests that we should opt this model for further improvement on scores.

Feature Engineering

This wonderful kernel showed that there is synthetic data in the test dataset and half of it is used to evaluate the submission file. It also gives the indices of test dataset which are using for the public LB and the private LB. The important fact that is realized in this kernel is that the count of unique values for every feature is somehow important. This exact knowledge will be used for feature engineering.

The data from that kernel has already been added to the project data directory and will now be imported.

# file paths
for dirname, _, filenames in os.walk(os.getcwd()):
    for filename in filenames:
        print(os.path.join(dirname, filename))
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\baseline_model_preds.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\final_submssion.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\private_LB.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\public_LB.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\sample_submission.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\synthetic_samples_indexes.npy
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\test.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\train.csv
C:\Users\ncits\Downloads\Santader_Transactions_Predictions\.ipynb_checkpoints\Untitled-checkpoint.ipynb
synthetic_samples_indices = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\synthetic_samples_indexes.npy")
public_lb = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\public_LB.npy")
private_lb = np.load(r"C:\Users\ncits\Downloads\Santader_Transactions_Predictions\private_LB.npy")
# merge the train dataset with the real data from the public LB and the private LB into a new dataset
full = pd.concat([X, pd.concat([test_df.iloc[public_lb], test_df.iloc[private_lb]])])
full
var_0 var_1 var_2 var_3 var_4 var_5 var_6 var_7 var_8 var_9 var_10 var_11 var_12 var_13 var_14 var_15 var_16 var_17 var_18 var_19 var_20 var_21 var_22 var_23 var_24 var_25 var_26 var_27 var_28 var_29 var_30 var_31 var_32 var_33 var_34 var_35 var_36 var_37 var_38 var_39 var_40 var_41 var_42 var_43 var_44 var_45 var_46 var_47 var_48 var_49 var_50 var_51 var_52 var_53 var_54 var_55 var_56 var_57 var_58 var_59 var_60 var_61 var_62 var_63 var_64 var_65 var_66 var_67 var_68 var_69 var_70 var_71 var_72 var_73 var_74 var_75 var_76 var_77 var_78 var_79 var_80 var_81 var_82 var_83 var_84 var_85 var_86 var_87 var_88 var_89 var_90 var_91 var_92 var_93 var_94 var_95 var_96 var_97 var_98 var_99 var_100 var_101 var_102 var_103 var_104 var_105 var_106 var_107 var_108 var_109 var_110 var_111 var_112 var_113 var_114 var_115 var_116 var_117 var_118 var_119 var_120 var_121 var_122 var_123 var_124 var_125 var_126 var_127 var_128 var_129 var_130 var_131 var_132 var_133 var_134 var_135 var_136 var_137 var_138 var_139 var_140 var_141 var_142 var_143 var_144 var_145 var_146 var_147 var_148 var_149 var_150 var_151 var_152 var_153 var_154 var_155 var_156 var_157 var_158 var_159 var_160 var_161 var_162 var_163 var_164 var_165 var_166 var_167 var_168 var_169 var_170 var_171 var_172 var_173 var_174 var_175 var_176 var_177 var_178 var_179 var_180 var_181 var_182 var_183 var_184 var_185 var_186 var_187 var_188 var_189 var_190 var_191 var_192 var_193 var_194 var_195 var_196 var_197 var_198 var_199
ID_code
train_0 8.9255 -6.7863 11.9081 5.0930 11.4607 -9.2834 5.1187 18.6266 -4.9200 5.7470 2.9252 3.1821 14.0137 0.5745 8.7989 14.5691 5.7487 -7.2393 4.2840 30.7133 10.5350 16.2191 2.5791 2.4716 14.3831 13.4325 -5.1488 -0.4073 4.9306 5.9965 -0.3085 12.9041 -3.8766 16.8911 11.1920 10.5785 0.6764 7.8871 4.6667 3.8743 -5.2387 7.3746 11.5767 12.0446 11.6418 -7.0170 5.9226 -14.2136 16.0283 5.3253 12.9194 29.0460 -0.6940 5.1736 -0.7474 14.8322 11.2668 5.3822 2.0183 10.1166 16.1828 4.9590 2.0771 -0.2154 8.6748 9.5319 5.8056 22.4321 5.0109 -4.7010 21.6374 0.5663 5.1999 8.8600 43.1127 18.3816 -2.3440 23.4104 6.5199 12.1983 13.6468 13.8372 1.3675 2.9423 -4.5213 21.4669 9.3225 16.4597 7.9984 -1.7069 -21.4494 6.7806 11.0924 9.9913 14.8421 0.1812 8.9642 16.2572 2.1743 -3.4132 9.4763 13.3102 26.5376 1.4403 14.7100 6.0454 9.5426 17.1554 14.1104 24.3627 2.0323 6.7602 3.9141 -0.4851 2.5240 1.5093 2.5516 15.5752 -13.4221 7.2739 16.0094 9.7268 0.8897 0.7754 4.2218 12.0039 13.8571 -0.7338 -1.9245 15.4462 12.8287 0.3587 9.6508 6.5674 5.1726 3.1345 29.4547 31.4045 2.8279 15.6599 8.3307 -5.6011 19.0614 11.2663 8.6989 8.3694 11.5659 -16.4727 4.0288 17.9244 18.5177 10.7800 9.0056 16.6964 10.4838 1.6573 12.1749 -13.1324 17.6054 11.5423 15.4576 5.3133 3.6159 5.0384 6.6760 12.6644 2.7004 -0.6975 9.5981 5.4879 -4.7645 -8.4254 20.8773 3.1531 18.5618 7.7423 -10.1245 13.7241 -3.5189 1.7202 -8.4051 9.0164 3.0657 14.3691 25.8398 5.8764 11.8411 -19.7159 17.5743 0.5857 4.4354 3.9642 3.1364 1.6910 18.5227 -2.3978 7.8784 8.5635 12.7803 -1.0914
train_1 11.5006 -4.1473 13.8588 5.3890 12.3622 7.0433 5.6208 16.5338 3.1468 8.0851 -0.4032 8.0585 14.0239 8.4135 5.4345 13.7003 13.8275 -15.5849 7.8000 28.5708 3.4287 2.7407 8.5524 3.3716 6.9779 13.8910 -11.7684 -2.5586 5.0464 0.5481 -9.2987 7.8755 1.2859 19.3710 11.3702 0.7399 2.7995 5.8434 10.8160 3.6783 -11.1147 1.8730 9.8775 11.7842 1.2444 -47.3797 7.3718 0.1948 34.4014 25.7037 11.8343 13.2256 -4.1083 6.6885 -8.0946 18.5995 19.3219 7.0118 1.9210 8.8682 8.0109 -7.2417 1.7944 -1.3147 8.1042 1.5365 5.4007 7.9344 5.0220 2.2302 40.5632 0.5134 3.1701 20.1068 7.7841 7.0529 3.2709 23.4822 5.5075 13.7814 2.5462 18.1782 0.3683 -4.8210 -5.4850 13.7867 -13.5901 11.0993 7.9022 12.2301 0.4768 6.8852 8.0905 10.9631 11.7569 -1.2722 24.7876 26.6881 1.8944 0.6939 -13.6950 8.4068 35.4734 1.7093 15.1866 2.6227 7.3412 32.0888 13.9550 13.0858 6.6203 7.1051 5.3523 8.5426 3.6159 4.1569 3.0454 7.8522 -11.5100 7.5109 31.5899 9.5018 8.2736 10.1633 0.1225 12.5942 14.5697 2.4354 0.8194 16.5346 12.4205 -0.1780 5.7582 7.0513 1.9568 -8.9921 9.7797 18.1577 -1.9721 16.1622 3.6937 6.6803 -0.3243 12.2806 8.6086 11.0738 8.9231 11.7700 4.2578 -4.4223 20.6294 14.8743 9.4317 16.7242 -0.5687 0.1898 12.2419 -9.6953 22.3949 10.6261 29.4846 5.8683 3.8208 15.8348 -5.0121 15.1345 3.2003 9.3192 3.8821 5.7999 5.5378 5.0988 22.0330 5.5134 30.2645 10.4968 -7.2352 16.5721 -7.3477 11.0752 -5.5937 9.4878 -14.9100 9.4245 22.5441 -4.8622 7.6543 -15.9319 13.3175 -0.3566 7.6421 7.7214 2.5837 10.9516 15.4305 2.0339 8.1267 8.7889 18.3560 1.9518
train_2 8.6093 -2.7457 12.0805 7.8928 10.5825 -9.0837 6.9427 14.6155 -4.9193 5.9525 -0.3249 -11.2648 14.1929 7.3124 7.5244 14.6472 7.6782 -1.7395 4.7011 20.4775 17.7559 18.1377 1.2145 3.5137 5.6777 13.2177 -7.9940 -2.9029 5.8463 6.1439 -11.1025 12.4858 -2.2871 19.0422 11.0449 4.1087 4.6974 6.9346 10.8917 0.9003 -13.5174 2.2439 11.5283 12.0406 4.1006 -7.9078 11.1405 -5.7864 20.7477 6.8874 12.9143 19.5856 0.7268 6.4059 9.3124 6.2846 15.6372 5.8200 1.1000 9.1854 12.5963 -10.3734 0.8748 5.8042 3.7163 -1.1016 7.3667 9.8565 5.0228 -5.7828 2.3612 0.8520 6.3577 12.1719 19.7312 19.4465 4.5048 23.2378 6.3191 12.8046 7.4729 15.7811 13.3529 10.1852 5.4604 19.0773 -4.4577 9.5413 11.9052 2.1447 -22.4038 7.0883 14.1613 10.5080 14.2621 0.2647 20.4031 17.0360 1.6981 -0.0269 -0.3939 12.6317 14.8863 1.3854 15.0284 3.9995 5.3683 8.6273 14.1963 20.3882 3.2304 5.7033 4.5255 2.1929 3.1290 2.9044 1.1696 28.7632 -17.2738 2.1056 21.1613 8.9573 2.7768 -2.1746 3.6932 12.4653 14.1978 -2.5511 -0.9479 17.1092 11.5419 0.0975 8.8186 6.6231 3.9358 -11.7218 24.5437 15.5827 3.8212 8.6674 7.3834 -2.4438 10.2158 7.4844 9.1104 4.3649 11.4934 1.7624 4.0714 -1.2681 14.3330 8.0088 4.4015 14.1479 -5.1747 0.5778 14.5362 -1.7624 33.8820 11.6041 13.2070 5.8442 4.7086 5.7141 -1.0410 20.5092 3.2790 -5.5952 7.3176 5.7690 -7.0927 -3.9116 7.2569 -5.8234 25.6820 10.9202 -0.3104 8.8438 -9.7009 2.4013 -4.2935 9.3908 -13.2648 3.1545 23.0866 -5.3000 5.3745 -6.2660 10.1934 -0.8417 2.9057 9.7905 1.6704 1.6858 21.6042 3.1417 -6.5213 8.2675 14.7222 0.3965
train_3 11.0604 -2.1518 8.9522 7.1957 12.5846 -1.8361 5.8428 14.9250 -5.8609 8.2450 2.3061 2.8102 13.8463 11.9704 6.4569 14.8372 10.7430 -0.4299 15.9426 13.7257 20.3010 12.5579 6.8202 2.7229 12.1354 13.7367 0.8135 -0.9059 5.9070 2.8407 -15.2398 10.4407 -2.5731 6.1796 10.6093 -5.9158 8.1723 2.8521 9.1738 0.6665 -3.8294 -1.0370 11.7770 11.2834 8.0485 -24.6840 12.7404 -35.1659 0.7613 8.3838 12.6832 9.5503 1.7895 5.2091 8.0913 12.3972 14.4698 6.5850 3.3164 9.4638 15.7820 -25.0222 3.4418 -4.3923 8.6464 6.3072 5.6221 23.6143 5.0220 -3.9989 4.0462 0.2500 1.2516 24.4187 4.5290 15.4235 11.6875 23.6273 4.0806 15.2733 0.7839 10.5404 1.6212 -5.2896 1.6027 17.9762 -2.3174 15.6298 4.5474 7.5509 -7.5866 7.0364 14.4027 10.7795 7.2887 -1.0930 11.3596 18.1486 2.8344 1.9480 -19.8592 22.5316 18.6129 1.3512 9.3291 4.2835 10.3907 7.0874 14.3256 14.4135 4.2827 6.9750 1.6480 11.6896 2.5762 -2.5459 5.3446 38.1015 3.5732 5.0988 30.5644 11.3025 3.9618 -8.2464 2.7038 12.3441 12.5431 -1.3683 3.5974 13.9761 14.3003 1.0486 8.9500 7.1954 -1.1984 1.9586 27.5609 24.6065 -2.8233 8.9821 3.8873 15.9638 10.0142 7.8388 9.9718 2.9253 10.4994 4.1622 3.7613 2.3701 18.0984 17.1765 7.6508 18.2452 17.0336 -10.9370 12.0500 -1.2155 19.9750 12.3892 31.8833 5.9684 7.2084 3.8899 -11.0882 17.2502 2.5881 -2.7018 0.5641 5.3430 -7.1541 -6.1920 18.2366 11.7134 14.7483 8.1013 11.8771 13.9552 -10.4701 5.6961 -3.7546 8.4117 1.8986 7.2601 -0.4639 -0.0498 7.9336 -12.8279 12.4124 1.8489 4.4666 4.7433 0.7178 1.4214 23.0347 -1.2706 -2.9275 10.2922 17.9697 -8.9996
train_4 9.8369 -1.4834 12.8746 6.6375 12.2772 2.4486 5.9405 19.2514 6.2654 7.6784 -9.4458 -12.1419 13.8481 7.8895 7.7894 15.0553 8.4871 -3.0680 6.5263 11.3152 21.4246 18.9608 10.1102 2.7142 14.2080 13.5433 3.1736 -3.3423 5.9015 7.9352 -3.1582 9.4668 -0.0083 19.3239 12.4057 0.6329 2.7922 5.8184 19.3038 1.4450 -5.5963 14.0685 11.9171 11.5111 6.9087 -65.4863 13.8657 0.0444 -0.1346 14.4268 13.3273 10.4857 -1.4367 5.7555 -8.5414 14.1482 16.9840 6.1812 1.9548 9.2048 8.6591 -27.7439 -0.4952 -1.7839 5.2670 -4.3205 6.9860 1.6184 5.0301 -3.2431 40.1236 0.7737 -0.7264 4.5886 -4.5346 23.3521 1.0273 19.1600 7.1734 14.3937 2.9598 13.3317 -9.2587 -6.7075 7.8984 14.5265 7.0799 20.1670 8.0053 3.7954 -39.7997 7.0065 9.3627 10.4316 14.0553 0.0213 14.7246 35.2988 1.6844 0.6715 -22.9264 12.3562 17.3410 1.6940 7.1179 5.1934 8.8230 10.6617 14.0837 28.2749 -0.1937 5.9654 1.0719 7.9923 2.9138 -3.6135 1.4684 25.6795 13.8224 4.7478 41.1037 12.7140 5.2964 9.7289 3.9370 12.1316 12.5815 7.0642 5.6518 10.9346 11.4266 0.9442 7.7532 6.6173 -6.8304 6.4730 17.1728 25.8128 2.6791 13.9547 6.6289 -4.3965 11.7159 16.1080 7.6874 9.1570 11.5670 -12.7047 3.7574 9.9110 20.1461 1.2995 5.8493 19.8234 4.7022 10.6101 13.0021 -12.6068 27.0846 8.0913 33.5107 5.6953 5.4663 18.2201 6.5769 21.2607 3.2304 -1.7759 3.1283 5.5518 1.4493 -2.6627 19.8056 2.3705 18.4685 16.3309 -3.3456 13.5261 1.7189 5.1743 -7.6938 9.7685 4.8910 12.2198 11.8503 -7.8931 6.4209 5.9270 16.0201 -0.2829 -1.4905 9.5214 -0.1508 9.1942 13.2876 -1.5121 3.9267 9.5031 17.9974 -8.8104
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
test_131051 6.4676 -0.6751 8.2936 6.4750 10.3517 1.7992 6.0640 16.0825 -2.4526 6.3020 11.3281 -1.3814 14.1652 2.9898 4.5672 14.2940 7.1272 -3.8645 19.1195 17.6041 21.6629 26.6998 6.2519 2.7735 7.2963 13.6628 -11.9718 0.6629 6.2859 0.3822 4.2549 13.7082 -2.8315 6.6608 12.4849 -0.6445 3.9421 7.5283 12.9724 7.4984 -13.1367 2.2516 12.5235 11.4311 15.6269 -33.5031 7.7235 -8.7433 -10.0378 11.6555 12.0142 -0.2932 -0.2155 5.8820 11.5303 16.3143 21.9504 5.1339 -2.4913 7.6735 14.7846 -5.5707 0.4015 2.8647 7.3000 3.2443 7.0295 9.9881 5.0173 -6.7424 2.3657 0.9425 -0.6697 31.0763 21.4923 15.9743 7.0236 15.9122 4.4354 16.0806 6.6441 13.3110 -13.1598 -11.8077 -18.8852 20.8884 12.2497 3.8526 8.3548 11.8627 -15.7791 7.0773 13.2672 10.2773 8.6421 -0.3864 7.7504 41.8491 1.6827 -0.5183 -11.5218 8.2739 23.3924 1.7795 14.9711 3.8828 7.2746 16.1618 14.0643 13.4608 8.1026 4.9293 5.8725 3.7971 1.6581 5.8520 2.0255 18.3802 -23.5947 10.8909 26.5699 10.3014 1.7596 10.3124 5.2988 12.5212 12.2784 -2.9135 -4.8077 20.8228 10.8098 0.9398 5.1387 7.4697 -1.8767 -12.3929 17.2235 14.2443 1.0082 15.1984 0.8586 3.6226 11.7467 9.0578 9.3819 5.7547 12.3301 0.7659 3.6361 12.6739 13.8519 13.2884 11.6808 17.5711 7.6217 -8.5340 12.5772 -2.8817 22.1513 10.6107 22.6822 5.8186 7.7799 18.1606 -5.5226 23.7968 3.6984 -0.3227 2.4371 5.1285 -2.3154 -6.9206 17.3889 3.8711 12.0771 10.9589 5.6342 12.7486 -1.7424 2.3116 7.4584 9.3447 19.5723 10.1140 20.1822 0.1209 11.4629 -28.2552 19.4171 2.2255 0.1002 9.9519 -0.6884 6.6460 19.1234 -1.5598 -2.8097 9.0289 18.2563 -12.4072
test_131059 8.4813 3.1147 9.9969 9.2470 11.1361 -2.1919 5.2232 17.4854 -2.5667 6.6899 5.9598 -5.2289 13.9081 13.7772 11.6033 14.4411 10.3478 -6.3069 10.6149 8.0140 20.8811 -2.7980 8.8075 3.5666 4.8129 13.6839 7.9705 0.4641 6.5098 5.6265 -25.2886 10.8785 1.8876 19.0652 11.1831 2.8044 -0.2012 6.2079 8.3930 3.1757 0.0542 9.9176 10.0010 11.4281 7.2591 -12.4100 7.1068 0.7974 33.3185 25.5068 13.4549 23.6881 -6.8979 5.6977 0.4225 19.1785 13.8189 5.4389 5.4396 7.9882 8.5549 9.1621 3.5599 6.1650 6.9362 2.3054 7.2683 17.9180 5.0174 -6.6177 36.3947 0.3163 7.7810 16.3636 36.9809 25.6018 22.0670 15.4416 3.7327 14.9617 12.6117 14.7653 -4.2003 -6.5108 3.0251 17.5000 8.5830 6.8111 12.0136 -1.8740 -7.4069 7.1612 14.1059 11.0104 11.8350 0.5402 9.8477 47.3763 1.8608 -2.6691 -19.3323 18.5075 17.9046 1.4808 13.8692 3.4859 8.4436 11.6112 14.1922 21.5739 4.3684 9.0379 2.4136 6.3450 3.6251 7.9110 0.6643 14.5977 -9.3238 -1.7052 50.6796 12.9420 8.0891 3.0561 4.4203 12.6712 13.2752 -2.5462 3.1256 16.4356 12.4657 0.8914 7.7370 7.1875 -3.1910 4.3278 24.5657 30.5515 -0.1621 19.0629 5.2945 -0.0317 21.8289 13.0997 9.3507 0.7396 9.9363 -5.6303 3.8166 11.4600 18.5876 2.1761 8.0577 14.3899 8.5478 -3.8608 13.7062 -8.1591 10.1942 9.6796 36.9286 5.3861 5.0618 6.1363 -6.7802 22.8567 2.9525 -15.8005 8.5602 4.9867 4.7021 -3.8686 29.5134 2.8487 26.8952 11.5513 0.7991 13.0859 -9.5414 0.5115 9.8042 9.6781 10.6630 9.1215 30.7039 -1.1086 12.7785 12.5316 16.9558 0.3449 4.5032 6.8670 4.0856 3.8657 21.4049 0.4630 0.6766 8.6395 19.1666 11.9240
test_131066 17.1476 -1.7580 9.0408 4.9697 12.3497 -2.6610 4.9144 21.3001 -2.0329 7.9769 -7.9172 -0.0365 14.2177 5.3723 8.2095 14.5701 5.2749 -7.8550 20.2733 26.9390 14.8837 18.9319 1.3159 3.1540 6.6168 13.8299 0.3746 -5.3681 5.7036 5.4731 6.9659 12.4996 -0.4385 17.0244 12.2047 1.3484 0.7108 4.6492 8.7400 -3.4173 -19.0955 10.1491 12.2569 12.0181 7.4443 -42.2688 12.7955 -18.4056 18.2175 24.3498 13.0738 19.9123 -0.5735 7.1556 17.8137 7.4439 15.1618 6.1296 -1.8050 10.0322 10.5506 -10.3003 -2.7427 2.1599 4.6109 9.6146 5.4939 8.8724 5.0152 -6.7305 16.2822 0.6807 -4.4602 25.0129 18.9161 29.0850 7.0172 27.8048 2.3660 12.1447 -3.5334 17.7416 -0.4245 4.9783 -11.8972 23.5959 -4.1598 17.7982 9.4893 6.8621 -38.1650 7.1488 9.5792 11.1319 12.1864 -0.2319 15.4551 21.1336 1.0619 2.8736 -21.9540 12.7619 27.5728 1.5587 13.0486 5.4359 8.8955 11.9490 14.3019 22.0889 8.1021 6.4279 1.3237 6.8962 3.7049 -1.9316 3.3246 7.2147 -6.7410 -0.4083 17.4429 10.7978 5.1751 -0.2867 2.5337 12.4948 14.1547 2.1254 -4.6822 12.8528 10.9902 0.8533 6.8729 6.2429 3.5496 2.8385 37.5117 23.7626 -8.1436 16.6858 3.3072 1.8837 4.0947 7.7933 9.9988 8.1783 13.3116 2.5664 3.8657 10.5833 14.1726 8.2687 7.2453 12.3911 10.7258 8.3014 13.7621 -4.5784 21.7287 11.5826 11.6605 5.9014 4.6668 15.6401 -12.0932 20.2850 2.6463 -3.4234 1.2506 5.4782 -2.9302 -1.2197 22.2231 -8.4721 34.9281 14.6835 -5.5410 12.5028 9.0345 3.7606 -2.9783 10.1719 -11.5395 7.9889 -2.7402 -5.8177 7.1102 10.4245 18.1095 0.0710 -4.7865 9.7313 -0.2040 4.4193 18.6293 0.4939 2.9853 9.1823 22.3054 -3.9768
test_131068 11.8592 5.9623 9.6131 7.8257 11.3450 -13.0483 4.7687 19.7234 7.0627 7.0057 6.8098 -5.1289 14.1632 11.5822 8.1099 14.8988 7.9353 -15.0157 23.5460 19.0837 21.6578 18.5514 2.0605 2.2013 10.7035 13.5751 -0.1525 -0.7117 4.9148 4.5541 1.5445 13.5480 -5.0413 9.6490 11.5276 11.0541 8.5084 6.8674 4.5066 3.6886 10.1572 3.8873 11.3149 11.7900 6.6631 -44.7317 12.4323 -33.6344 27.1009 7.5273 12.4311 1.8985 1.6427 7.1692 -2.7789 15.7664 20.6419 6.4920 7.6287 9.6064 11.1338 -5.2222 2.3550 -3.0622 4.4360 -5.0402 6.4659 3.9092 5.0348 -11.0843 20.5578 0.3970 -0.1217 6.6432 14.8794 15.7465 1.6692 17.1907 3.8452 13.8303 16.1960 13.9851 -12.2722 -10.1260 2.7476 17.1415 3.2802 14.9158 8.7803 -0.9468 1.5280 6.7659 12.0910 9.8853 13.3212 -1.0293 15.0316 5.4395 1.3333 1.5205 3.0958 10.2154 22.0886 1.7895 10.7978 5.4645 9.6934 0.8809 13.9915 21.0866 6.3316 6.8531 7.1170 5.2512 4.8697 1.9485 0.9979 39.2366 -9.6428 6.8303 31.4263 9.3174 -6.7042 0.5318 5.3938 12.5598 12.4643 3.4577 -0.6334 15.0960 11.6978 -0.6141 7.3610 7.2742 5.2248 -3.5881 24.9923 19.2004 4.4551 -6.0112 -1.7854 0.2647 4.8186 14.3978 9.3242 -0.1742 9.2957 0.8374 3.8922 2.1836 15.7742 4.9734 6.0724 19.0119 7.1925 -6.5685 12.4716 2.9276 16.1010 11.4127 28.4897 5.7982 4.9428 23.4808 -10.2529 20.2112 3.1077 -3.0883 1.0345 5.3482 -3.9167 -12.3275 18.5421 -0.2225 23.6320 6.1771 2.7007 12.5825 3.7566 3.6299 -6.0186 9.0342 1.0227 10.3582 -4.8846 -1.4530 3.4663 -19.1525 11.8059 -0.6135 -2.5308 7.4326 2.5299 9.1893 14.7174 1.6083 4.2321 8.0658 20.5306 5.5140
test_131071 8.2034 -8.0700 12.9146 4.6771 9.5412 -4.4721 5.6222 23.3273 3.7714 8.0598 2.3445 10.6460 14.1794 4.4355 8.5268 14.9357 8.1839 -1.7573 30.2598 24.1162 12.5149 5.5738 -1.4916 2.8164 6.4644 13.2357 -5.4919 -0.0643 5.6453 6.0481 -19.3015 9.3845 -3.8450 16.0137 11.4814 2.5956 1.8732 2.0693 7.0976 -0.4648 -8.1865 13.1152 11.6222 11.7830 0.0182 -12.7629 14.4351 -27.3811 15.7809 21.6270 12.1818 8.7560 3.7991 5.9564 11.6130 15.4324 15.7169 5.8218 -1.0566 8.9456 15.4067 -5.2110 -1.3703 0.8467 4.5174 0.4181 4.2484 9.8887 5.0260 -8.2359 16.3639 0.5370 1.8536 21.8880 31.4273 16.6845 6.4504 23.0138 8.1258 15.4581 3.1832 17.3645 9.8441 -2.0266 2.9415 21.7955 9.4422 5.6769 6.7804 3.3566 -6.9003 6.8944 9.3717 10.7856 11.2807 -0.9129 21.5711 9.7624 0.6004 -0.6183 -5.6391 11.3413 34.6067 1.6308 10.9678 4.7667 10.9063 23.0681 14.2528 18.8450 9.0313 5.4689 4.9279 7.3150 2.7818 -3.3044 3.4508 32.9176 -11.1527 -0.1159 47.3747 9.3028 0.9779 0.6206 7.6684 12.2409 13.7193 2.8372 1.7559 11.4343 12.3175 0.1347 8.3950 6.6922 -14.3920 -10.0356 7.7070 9.1436 -6.6935 21.7355 3.1190 -6.6898 -0.4675 13.3535 8.3959 -1.5484 11.5772 -0.6959 3.9411 22.8115 14.7320 11.2564 10.6410 14.6840 7.3109 -1.4962 11.1819 -2.1142 16.5170 14.5130 9.5167 5.4935 5.8381 -0.1552 -1.1440 10.2765 3.5344 -6.4611 4.7353 5.4278 -2.2865 8.5304 15.4835 -5.5625 24.5046 12.0405 -3.5315 16.9220 8.5335 4.8674 4.2000 11.6058 -3.2128 11.5689 5.7378 -4.3803 5.8254 -14.9304 10.3127 0.1279 11.5927 7.1993 -1.0688 7.4700 14.2476 0.0007 10.8189 7.5399 12.7860 -1.6035

300000 rows × 200 columns

We will add a new column for each feature in the dataset. This new column will include information of the value counts of the values in each of those feature columns. This extra information should improve our score.

# add new columns
for feature in X.columns:
    count_vals = full[feature].value_counts()
    X["new_" + feature] = count_vals.loc[X[feature]].values
    test_df["new_" + feature] = count_vals.loc[test_df[feature]].values
# check the new shape of both train data and test data
X.shape, test_df.shape
((200000, 400), (200000, 400))
# make new train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = seed)

Model Training

To train the LGBClassifier, this time we'll not use the Scikit Learn's API. Rather, we'll use the lightgbm API which will give us better control and flexibility on the hyperparameters. Also, high score on GaussianNB suggests that the features are pretty independent of each other (except the new columns which depend on the column which they are based on). Therefore, we can train the LGB model two features at a time, one being the original feature and the second being the value count feature that we have added through feature engineering. This stops the model from studying interelationships between all the 400 features, which can save a huge amount of training time.

# set training hyperparameters
core_count = psutil.cpu_count(logical=False)

param = {'bagging_fraction': 0.8,
   'bagging_freq': 2,
   'lambda_l1': 0.7,
   'lambda_l2': 2,
   'learning_rate': 0.01,
   'max_depth': 5,
   'min_data_in_leaf': 22,
   'min_gain_to_split': 0.07,
   'min_sum_hessian_in_leaf': 15,
   'num_leaves': 20,
   'feature_fraction': 1,
   'save_binary': True,
   'seed': seed,
   'feature_fraction_seed': seed,
   'bagging_seed': seed,
   'drop_seed': seed,
   'data_random_seed': seed,
   'objective': 'binary',
   'boosting_type': 'gbdt',
   'verbosity': -1,
   'metric': 'auc',
   'is_unbalance': True,
   'boost_from_average': 'false',
   'num_threads': core_count}
# prediction matrices
valid_sub_preds = np.zeros([X_valid.shape[0], 200])
test_sub_preds = np.zeros([test_df.shape[0], 200])

# run training col by col
for col in tqdm(range(200)):
    feature = X.columns[col]
    feature_set = [feature, "new_" + feature]
    
    # make lgbm datasets
    train_l = lgb.Dataset(X_train[feature_set], y_train)
    valid_l = lgb.Dataset(X_valid[feature_set], y_valid)
    
    # train model
    lgb_clf = lgb.train(param, train_l, num_boost_round = 50, valid_sets = [train_l, valid_l], verbose_eval = -1)
    
    # make predictions
    valid_sub_preds[:, col] = lgb_clf.predict(X_valid[feature_set], num_iteration = lgb_clf.best_iteration)
    test_sub_preds[:, col] = lgb_clf.predict(test_df[feature_set], num_iteration = lgb_clf.best_iteration)
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [01:35<00:00,  2.09it/s]

Evaluate Performance

# validation set predictions
val_full_preds = valid_sub_preds.sum(axis = 1) / 200
def evaluate_perf(preds_prob):
    preds = (preds_prob > 0.5).astype(int)
    
    res = {
            "Accuracy Score": accuracy_score(y_valid, preds),
            "Precision Score": precision_score(y_valid, preds, zero_division = 0),
            "Recall Score": recall_score(y_valid, preds),
            "ROC_AUC Score": roc_auc_score(y_valid, preds_prob),
            "f1 Score": f1_score(y_valid, preds)
    }
    
    return res
    
    
evaluate_perf(val_full_preds)
{'Accuracy Score': 0.89192,
 'Precision Score': 0.4756971092350985,
 'Recall Score': 0.7402468152866242,
 'ROC_AUC Score': 0.9152776960521903,
 'f1 Score': 0.5791932720760006}

ROC_AUC Score has improved a lot considering that it was already high. Accuracy also looks good. This improvement can be attributed to the feature engineering we did. Now we can make a submssion.

Make Submission

# Make predictions on test dataset
test_full_preds = test_sub_preds.sum(axis = 1) / 200
test_full_preds
array([0.49980011, 0.49991953, 0.50039014, ..., 0.49759414, 0.4996321 ,
       0.49941331])
# save predictions and export it to csv file
submission_df["target"] = test_full_preds
submission_df.to_csv("lgbm_first_training.csv", index = None)

Submitting this file on Kaglle gets us a score of 0.91093 which makes us stand in the top 3% of the leaderboard. Our standing can improve further with hyperparameter tuning and some other tweaks.

Add feature weights

Because each independent feature contributes differently to the variation in target class, we can add the feauture weights in the final computation the final predictions. The feature weights can ba calculated with the roc_auc score each feature gets and calculating the deviation from mean for each roc_auc score.

# calculate feature weights
weights = []

for col in range(200):
    feature_roc_score = roc_auc_score(y_valid, valid_sub_preds[:, col])
    if feature_roc_score > 0.5:
        weights.append(feature_roc_score)
    else:
        weights.append(0)

weights[:30]
[0.543336515143533,
 0.5473523268496205,
 0.5464708831971531,
 0.5074337837752676,
 0.5024420645540325,
 0.5266766700555937,
 0.5584090641866745,
 0.50110258512608,
 0.5156600970092755,
 0.5397526747988171,
 0.5029713874646804,
 0.5129343952478831,
 0.5598962449116529,
 0.5563167593310913,
 0.5052252058316452,
 0.5078600970411395,
 0.5009789965653303,
 0,
 0.5380105522839358,
 0.5083492822144917,
 0.5205073101974274,
 0.5587881233933398,
 0.5545546287991954,
 0.5214425223530398,
 0.5334865630565602,
 0.5012005053615289,
 0.5557101769740748,
 0.5005401301607142,
 0.5289215974457432,
 0]
# transform weights to usable form
weights = np.array(weights)
weights = 1 + ((weights - weights.mean()) / weights.mean())
weights[:30]
array([1.12208421, 1.13037756, 1.12855722, 1.04793884, 1.03763007,
       1.08767874, 1.15321164, 1.03486381, 1.0649276 , 1.11468296,
       1.03872321, 1.05929855, 1.15628292, 1.14889067, 1.04337774,
       1.04881925, 1.03460858, 0.        , 1.11108517, 1.0498295 ,
       1.07493794, 1.15399446, 1.14525156, 1.07686931, 1.10174235,
       1.03506603, 1.14763797, 1.03370224, 1.0923149 , 0.        ])

Evaluate Performance

weighted_valid_preds = (valid_sub_preds * weights).sum(axis = 1) / 200
evaluate_perf(weighted_valid_preds)
{'Accuracy Score': 0.8914,
 'Precision Score': 0.47449748743718595,
 'Recall Score': 0.7517914012738853,
 'ROC_AUC Score': 0.9167512969054126,
 'f1 Score': 0.5817929759704252}

The scores have improved a little after adding weights, and we can make a new submission now.

Make submission

weighted_preds = (test_sub_preds * weights).sum(axis = 1) / 200
weighted_preds
array([0.49969991, 0.49995959, 0.50052471, ..., 0.49724971, 0.49961701,
       0.49939437])
# save predictions and export it to csv file
submission_df["target"] = weighted_preds
submission_df.to_csv("lgbm_weighted_preds.csv", index = None)

Our new submission private ROC_AUC score is 0.91353. This makes our new private LB standing in the top 150, which means we are in the top 2% of the LB. Now, it is time to tune the model's hyperparameters and then we can make a final submission.

Hyperparameter Tuning

We can run Bayesian Optimization for hyperparameter tuning. It'll include the following steps:

  • Create a sperate train test split of ratio 1:1.
  • Create a black box function for Bayesian Optimization to execute.
  • Set parameter boundaries
  • Find the best hyperparameters by running maximise on Bayesian Optimization

Because finding the best hyperparameters takes a long time, running the algorithm multiple times is neither feasable not environmental friendly. Thus, I'll use the hyperparameters from the past runs.

The inspiration for the tuning was taken from here.

# separate data for tuning
X_train_tuning, X_valid_tuning, y_train_tuning, y_valid_tuning = train_test_split(X, y, stratify = y, test_size = 0.5, random_state = seed)
# black box function for Bayesian Optimization
def LGB_bayesian(
    bagging_fraction,
    bagging_freq, # int
    lambda_l1,
    lambda_l2,
    learning_rate,
    max_depth, # int
    min_data_in_leaf,  # int
    min_gain_to_split,
    min_sum_hessian_in_leaf,  
    num_leaves,  # int
    feature_fraction,
    num_boost_rounds):
    
    # LightGBM expects these parameters to be integer. So we make them integer
    bagging_freq = int(bagging_freq)
    num_leaves = int(num_leaves)
    min_data_in_leaf = int(min_data_in_leaf)
    max_depth = int(max_depth)
    num_boost_rounds = int(num_boost_rounds)
    
    # parameters
    param = {'bagging_fraction': bagging_fraction,
   'bagging_freq': bagging_freq,
   'lambda_l1': lambda_l1,
   'lambda_l2': lambda_l2,
   'learning_rate': learning_rate,
   'max_depth': max_depth,
   'min_data_in_leaf': min_data_in_leaf,
   'min_gain_to_split': min_gain_to_split,
   'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,
   'num_leaves': num_leaves,
   'feature_fraction': feature_fraction,
   'save_binary': True,
   'seed': seed,
   'feature_fraction_seed': seed,
   'bagging_seed': seed,
   'drop_seed': seed,
   'data_random_seed': seed,
   'objective': 'binary',
   'boosting_type': 'gbdt',
   'verbosity': -1,
   'metric': 'auc',
   'is_unbalance': True,
   'boost_from_average': 'false',
   'num_threads': core_count}
    
    # prediction matrix
    valid_sub_preds_tuning = np.zeros([X_valid_tuning.shape[0], 200])

    # run training col by col
    for col in range(200):
        feature = X.columns[col]
        feature_set = [feature, "new_" + feature]

        # make lgbm datasets
        train_l_tuning = lgb.Dataset(X_train_tuning[feature_set], y_train_tuning)
        valid_l_tuning = lgb.Dataset(X_valid_tuning[feature_set], y_valid_tuning)

        # train model
        lgb_clf = lgb.train(param, train_l_tuning, num_boost_round = num_boost_rounds, valid_sets = [train_l_tuning, valid_l_tuning], verbose_eval = -1)

        # make predictions
        valid_sub_preds_tuning[:, col] = lgb_clf.predict(X_valid_tuning[feature_set], num_iteration = lgb_clf.best_iteration)
    
    
    # calculate feature weights
    weights = []
    for col in range(200):
        feature_roc_score = roc_auc_score(y_valid, valid_sub_preds[:, col])
        if feature_roc_score > 0.5:
            weights.append(feature_roc_score)
        else:
            weights.append(0)
    
    # validation predictions
    weights = np.array(weights)
    weights_mean = weights.mean()
    weights = 1 + ((weights - weights_mean) / weights_mean)
    valid_full_preds_tuning = (valid_sub_preds_tuning * weights).sum(axis = 1)
    
    # score
    score = roc_auc_score(y_valid, valid_full_preds_tuning)
    return score
# parameter bounds
bounds_LGB = {
    'bagging_fraction': (0.5, 1),
    'bagging_freq': (1, 4),
    'lambda_l1': (0, 3.0), 
    'lambda_l2': (0, 3.0), 
    'learning_rate': (0.005, 0.3),
    'max_depth':(3,8),
    'min_data_in_leaf': (5, 20),  
    'min_gain_to_split': (0, 1),
    'min_sum_hessian_in_leaf': (0.01, 20),    
    'num_leaves': (5, 20),
    'feature_fraction': (0.05, 1),
    'num_boost_rounds': (30, 130)
}
LG_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state = seed)
LG_BO.space.keys
['bagging_fraction',
 'bagging_freq',
 'feature_fraction',
 'lambda_l1',
 'lambda_l2',
 'learning_rate',
 'max_depth',
 'min_data_in_leaf',
 'min_gain_to_split',
 'min_sum_hessian_in_leaf',
 'num_boost_rounds',
 'num_leaves']
# LG_BO.maximize(init_points=5, n_iter=120, acq='ucb', xi=0.0, alpha=1e-6)

Running the above cell after uncommenting the code will find the best hyperparameters. We'll use the results from the past run.

# tuned hyperparameters
iterations = 126
param = {'bagging_fraction': 0.7693,
   'bagging_freq': 2,
   'lambda_l1': 0.7199,
   'lambda_l2': 1.992,
   'learning_rate': 0.009455,
   'max_depth': 3,
   'min_data_in_leaf': 22,
   'min_gain_to_split': 0.06549,
   'min_sum_hessian_in_leaf': 18.55,
   'num_leaves': 20,
   'feature_fraction': 1}

Final Training and Submission

Now, that we have found the best hyperparameters for the model, we can train the model on the full training data, and then make a submission from its predictions.

# All hyperparameters
iterations = 126
param = {'bagging_fraction': 0.7693,
   'bagging_freq': 2,
   'lambda_l1': 0.7199,
   'lambda_l2': 1.992,
   'learning_rate': 0.009455,
   'max_depth': 3,
   'min_data_in_leaf': 22,
   'min_gain_to_split': 0.06549,
   'min_sum_hessian_in_leaf': 18.55,
   'num_leaves': 20,
   'feature_fraction': 1,
   'save_binary': True,
   'seed': seed,
   'feature_fraction_seed': seed,
   'bagging_seed': seed,
   'drop_seed': seed,
   'data_random_seed': seed,
   'objective': 'binary',
   'boosting_type': 'gbdt',
   'verbosity': -1,
   'metric': 'auc',
   'is_unbalance': True,
   'boost_from_average': 'false',
   'num_threads': core_count}
# Model Training

folds = StratifiedKFold(n_splits = 4)

columns = X.columns
col_count = 200

train_sub_preds = np.zeros([len(X), col_count])
test_sub_preds = np.zeros([len(test_df), col_count])


for col_idx in tqdm(range(col_count)):
    
    feature = columns[col_idx]
    feature_set = [feature, 'new_' + feature]
    
    temp_preds = np.zeros(len(test_df))
    
    for train_idx, valid_idx in folds.split(X, y):
        
        train_data = lgb.Dataset(X.iloc[train_idx][feature_set], y[train_idx])
        valid_data = lgb.Dataset(X.iloc[valid_idx][feature_set], y[valid_idx])
        
        clf = lgb.train(param, train_data, num_boost_round = iterations, valid_sets = [train_data, valid_data], verbose_eval=-1)
        
        train_sub_preds[valid_idx, col_idx] = clf.predict(X.iloc[valid_idx][feature_set], num_iteration=clf.best_iteration)
        temp_preds += clf.predict(test_df[feature_set], num_iteration=clf.best_iteration) / folds.n_splits
    
    test_sub_preds[:, col_idx] = temp_preds
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [15:51<00:00,  4.76s/it]
# calculate feature weights
weights = []
for col in range(200):
    feature_roc_score = roc_auc_score(y, train_sub_preds[:, col])
    if feature_roc_score > 0.5:
        weights.append(feature_roc_score)
    else:
        weights.append(0)

# final predictions
weights = np.array(weights)
weights_mean = weights.mean()
weights = 1 + ((weights - weights_mean) / weights_mean)
train_full_preds = (train_sub_preds * weights).sum(axis = 1) / 200
test_full_preds = (test_sub_preds * weights).sum(axis = 1) / 200
# avg roc_score on validation data
roc_auc_score(y, train_full_preds)
0.9210397907630518

Make Submission

submission_df["target"] = test_full_preds
submission_df.to_csv("final_submssion.csv", index = None)

This submssion gives us a public score of 0.92231 and a private score of 0.92069. With this score we stand at about 60 rank out of 8712 current submissions. Thus we are in the top 1% with our final submission.

Summary and Conclusion

In this project, we participated in the Santander Customer Transaction Predition Competition on Kaggle and tried to achieve as high standing as possible in the public and private leaderboards. Our goal was to predict whether a customer would make a specific transaction of not. After doing basic data analysis, we compared multiple models and selected lightgbm which had the potential to improve the score most. We also applied some feature engineering, which helped in improving the auc roc score. Additional methods like using feaure weights to compute predictions, and hyperparameter tuning further improved the ML training and predictions.

Throughout all these steps, we made multiple submission files and submitted them on Kaggle, each improving the standing on the previous one. In the final submission, we achieved a private LB score of 0.92069 which placed us at rank 60 out of 8712 participants.

Thanks for reading.