산탄데르 고객 만족 예측

 

파이썬 머신러닝 완벽가이드 내용임. 데이터를 어떻게 다뤄서 모델의 성능을 높이는지에 주목

문제

  • 370개의 피처로 주어진 데이터 세트 기반으로 고객 만족 여부(0 or 1) 예측하는 것

    Kaggle-Data

성능평가

  • ROC-AUC로 평가
    • 대부분이 만족하며 불만족인 데이터는 일부일 것이기 때문에 정확도 수치보다는 ROC-AUC가 더 적합함

진행과정

  • 바로 lightGBM에 적용시켜서 accuracy, roc_auc 값 출력해봄

      import pandas as pd
      import numpy as np
      from sklearn.model_selection import train_test_split
      from lightgbm import LGBMClassifier
      from sklearn.metrics import accuracy_score, roc_auc_score
        
      raw_data = pd.read_csv('./train_santander.csv.csv')
      data = raw_data.iloc[:, :-1]
      target = raw_data.iloc[:, -1]
      # data.shape, target.shape
      X_train, X_test, y_train, y_test = train_test_split(data, target)
        
      ligbm = LGBMClassifier()
      ligbm.fit(X_train, y_train)
      preds = ligbm.predict(X_test)
      probs = ligbm.predict_proba(X_test)[:, 1]
      print(f'정확도 {accuracy_score(y_test, preds)}')
      print(f'AUC: {roc_auc_score(y_test, preds)}')
        
      '''output
      정확도 0.9582215206524599
      AUC: 0.5016565312870864
      '''
    

Null 값 여부 및 TARGET 분포 확인

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib

cust_df = pd.read_csv("./train_santander.csv",encoding='latin-1')
print('dataset shape:', cust_df.shape)
cust_df.head(3)
dataset shape: (76020, 371)
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
0 1 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 39205.17 0
1 3 2 34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 49278.03 0
2 4 2 23 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 67333.77 0

3 rows × 371 columns

cust_df.info()      # 피처의 타입과 널 값을 알아보기 위해 DataFrame.info() 사용
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB
# cust_df['TARGET'].value_counts()[0] / len(cust_df['TARGET'].value_counts()), cust_df['TARGET'].value_counts()[1] / len(cust_df['TARGET'].value_counts())
len(cust_df['TARGET'].value_counts())
2
print(cust_df['TARGET'].value_counts())
unsatisfied_cnt = cust_df[cust_df['TARGET'] == 1].TARGET.count()
total_cnt = cust_df.TARGET.count()
print('unsatisfied 비율은 {0:.2f}'.format((unsatisfied_cnt / total_cnt)))
0    73012
1     3008
Name: TARGET, dtype: int64
unsatisfied 비율은 0.04

각 Feature의 값 분포 확인

cust_df.describe( )
ID var3 var15 imp_ent_var16_ult1 imp_op_var39_comer_ult1 imp_op_var39_comer_ult3 imp_op_var40_comer_ult1 imp_op_var40_comer_ult3 imp_op_var40_efect_ult1 imp_op_var40_efect_ult3 ... saldo_medio_var33_hace2 saldo_medio_var33_hace3 saldo_medio_var33_ult1 saldo_medio_var33_ult3 saldo_medio_var44_hace2 saldo_medio_var44_hace3 saldo_medio_var44_ult1 saldo_medio_var44_ult3 var38 TARGET
count 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 ... 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 76020.000000 7.602000e+04 76020.000000
mean 75964.050723 -1523.199277 33.212865 86.208265 72.363067 119.529632 3.559130 6.472698 0.412946 0.567352 ... 7.935824 1.365146 12.215580 8.784074 31.505324 1.858575 76.026165 56.614351 1.172358e+05 0.039569
std 43781.947379 39033.462364 12.956486 1614.757313 339.315831 546.266294 93.155749 153.737066 30.604864 36.513513 ... 455.887218 113.959637 783.207399 538.439211 2013.125393 147.786584 4040.337842 2852.579397 1.826646e+05 0.194945
min 1.000000 -999999.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.163750e+03 0.000000
25% 38104.750000 2.000000 23.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.787061e+04 0.000000
50% 76043.000000 2.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.064092e+05 0.000000
75% 113748.750000 2.000000 40.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.187563e+05 0.000000
max 151838.000000 238.000000 105.000000 210000.000000 12888.030000 21024.810000 8237.820000 11073.570000 6600.000000 6600.000000 ... 50003.880000 20385.720000 138831.630000 91778.730000 438329.220000 24650.010000 681462.900000 397884.300000 2.203474e+07 1.000000

8 rows × 371 columns

var3 Feature를 볼 때 -999999 값은 다른 값들에 비해 너무 큼. Nan, 특정 예외값을 -999999로 처리한것으로 보임. 가장 많이 보이는 2로 대체

  • ID Feature는 고객 만족도 분류 하는데 쓸모없을 것이므로 Drop
cust_df['var3'].value_counts()
 2         74165
 8           138
-999999      116
 9           110
 3           108
           ...  
 231           1
 188           1
 168           1
 135           1
 87            1
Name: var3, Length: 208, dtype: int64
# var3 피처 값 대체 및 ID 피처 드롭
cust_df['var3'].replace(-999999, 2, inplace=True)
cust_df.drop('ID', axis=1, inplace=True)

# 피처 세트와 레이블 세트분리. 레이블 컬럼은 DataFrame의 맨 마지막에 위치해 컬럼 위치 -1로 분리
X_features = cust_df.iloc[:, :-1]
y_labels = cust_df.iloc[:, -1]
print('피처 데이터 shape:{0}'.format(X_features.shape))
피처 데이터 shape:(76020, 369)

브로드캐스팅을 활용하여 split한 데이터들의 레이블별 %를 쉽게 나타냄

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels,
                                                    test_size=0.2, random_state=0)
train_cnt = y_train.count()
test_cnt = y_test.count()
print('학습 세트 Shape:{0}, 테스트 세트 Shape:{1}'.format(X_train.shape , X_test.shape))
print(' 학습 세트 레이블 값 분포 비율')
print(y_train.value_counts()/train_cnt)
print('\n 테스트 세트 레이블 값 분포 비율')
print(y_test.value_counts()/test_cnt)
학습 세트 Shape:(60816, 369), 테스트 세트 Shape:(15204, 369)
 학습 세트 레이블 값 분포 비율
0    0.960964
1    0.039036
Name: TARGET, dtype: float64

 테스트 세트 레이블 값 분포 비율
0    0.9583
1    0.0417
Name: TARGET, dtype: float64
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

# n_estimators는 500으로, random state는 예제 수행 시마다 동일 예측 결과를 위해 설정. 
xgb_clf = XGBClassifier(n_estimators=500, random_state=156)

# 성능 평가 지표를 auc로, 조기 중단 파라미터는 100으로 설정하고 학습 수행. 
xgb_clf.fit(X_train, y_train, early_stopping_rounds=100,
            eval_metric="auc", eval_set=[(X_test, y_test)])

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))


C:\Users\hojun_window\anaconda3\envs\ml\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)


[0]	validation_0-auc:0.81157
[1]	validation_0-auc:0.82452
[2]	validation_0-auc:0.82746
[3]	validation_0-auc:0.82922
[4]	validation_0-auc:0.83298
[5]	validation_0-auc:0.83500
[6]	validation_0-auc:0.83653
[7]	validation_0-auc:0.83782
[8]	validation_0-auc:0.83802
[9]	validation_0-auc:0.83914
[10]	validation_0-auc:0.83954
[11]	validation_0-auc:0.83983
[12]	validation_0-auc:0.84033
[13]	validation_0-auc:0.84054
[14]	validation_0-auc:0.84135
[15]	validation_0-auc:0.84117
[16]	validation_0-auc:0.84101
[17]	validation_0-auc:0.84071
[18]	validation_0-auc:0.84052
[19]	validation_0-auc:0.84023
[20]	validation_0-auc:0.84012
[21]	validation_0-auc:0.84022
[22]	validation_0-auc:0.84007
[23]	validation_0-auc:0.84009
[24]	validation_0-auc:0.83974
[25]	validation_0-auc:0.84015
[26]	validation_0-auc:0.84101
[27]	validation_0-auc:0.84088
[28]	validation_0-auc:0.84074
[29]	validation_0-auc:0.83999
[30]	validation_0-auc:0.83959
[31]	validation_0-auc:0.83952
[32]	validation_0-auc:0.83901
[33]	validation_0-auc:0.83885
[34]	validation_0-auc:0.83887
[35]	validation_0-auc:0.83864
[36]	validation_0-auc:0.83834
[37]	validation_0-auc:0.83810
[38]	validation_0-auc:0.83810
[39]	validation_0-auc:0.83813
[40]	validation_0-auc:0.83776
[41]	validation_0-auc:0.83720
[42]	validation_0-auc:0.83684
[43]	validation_0-auc:0.83672
[44]	validation_0-auc:0.83674
[45]	validation_0-auc:0.83693
[46]	validation_0-auc:0.83686
[47]	validation_0-auc:0.83678
[48]	validation_0-auc:0.83694
[49]	validation_0-auc:0.83676
[50]	validation_0-auc:0.83655
[51]	validation_0-auc:0.83669
[52]	validation_0-auc:0.83641
[53]	validation_0-auc:0.83690
[54]	validation_0-auc:0.83693
[55]	validation_0-auc:0.83681
[56]	validation_0-auc:0.83680
[57]	validation_0-auc:0.83667
[58]	validation_0-auc:0.83664
[59]	validation_0-auc:0.83591
[60]	validation_0-auc:0.83576
[61]	validation_0-auc:0.83534
[62]	validation_0-auc:0.83513
[63]	validation_0-auc:0.83510
[64]	validation_0-auc:0.83508
[65]	validation_0-auc:0.83518
[66]	validation_0-auc:0.83510
[67]	validation_0-auc:0.83523
[68]	validation_0-auc:0.83457
[69]	validation_0-auc:0.83460
[70]	validation_0-auc:0.83446
[71]	validation_0-auc:0.83462
[72]	validation_0-auc:0.83394
[73]	validation_0-auc:0.83410
[74]	validation_0-auc:0.83394
[75]	validation_0-auc:0.83368
[76]	validation_0-auc:0.83413
[77]	validation_0-auc:0.83359
[78]	validation_0-auc:0.83353
[79]	validation_0-auc:0.83293
[80]	validation_0-auc:0.83253
[81]	validation_0-auc:0.83187
[82]	validation_0-auc:0.83230
[83]	validation_0-auc:0.83216
[84]	validation_0-auc:0.83206
[85]	validation_0-auc:0.83196
[86]	validation_0-auc:0.83200
[87]	validation_0-auc:0.83208
[88]	validation_0-auc:0.83174
[89]	validation_0-auc:0.83160
[90]	validation_0-auc:0.83155
[91]	validation_0-auc:0.83165
[92]	validation_0-auc:0.83172
[93]	validation_0-auc:0.83160
[94]	validation_0-auc:0.83150
[95]	validation_0-auc:0.83132
[96]	validation_0-auc:0.83090
[97]	validation_0-auc:0.83091
[98]	validation_0-auc:0.83066
[99]	validation_0-auc:0.83058
[100]	validation_0-auc:0.83029
[101]	validation_0-auc:0.82955
[102]	validation_0-auc:0.82962
[103]	validation_0-auc:0.82893
[104]	validation_0-auc:0.82837
[105]	validation_0-auc:0.82815
[106]	validation_0-auc:0.82744
[107]	validation_0-auc:0.82728
[108]	validation_0-auc:0.82651
[109]	validation_0-auc:0.82650
[110]	validation_0-auc:0.82621
[111]	validation_0-auc:0.82620
[112]	validation_0-auc:0.82591
[113]	validation_0-auc:0.82498
[114]	validation_0-auc:0.82525
ROC AUC: 0.8413

XGBoost 모델의 하이퍼 파라미터 수정

from sklearn.model_selection import GridSearchCV

xgb_clf = XGBClassifier(n_estimators=100)

params = {'max_depth':[5, 7], 'min_child_weight':[1, 3], 'colsample_bytree':[0.5, 0.75]}

#cv는 3으로 지정
gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)

gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc", eval_set=[(X_test, y_test)])

print(f'GridSearchCV 최적 파라미터: {gridcv.best_params_}')

xgb_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:, 1], average='macro')

print(f'ROC AUC: {xgb_roc_score:.4f}')
C:\Users\hojun_window\anaconda3\envs\ml\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)


[0]	validation_0-auc:0.79321
[1]	validation_0-auc:0.81375
[2]	validation_0-auc:0.81846
[3]	validation_0-auc:0.82226
[4]	validation_0-auc:0.82677
[5]	validation_0-auc:0.83225
[6]	validation_0-auc:0.82654
[7]	validation_0-auc:0.83486
[8]	validation_0-auc:0.83682
[9]	validation_0-auc:0.83472
'''
[43]	validation_0-auc:0.84143
GridSearchCV 최적 파라미터: {'colsample_bytree': 0.5, 'max_depth': 5, 'min_child_weight': 3}
ROC AUC: 0.8445
# n_estimators는 1000으로 증가시키고, learning_rate=0.02로 감소, reg_alpha=0.03으로 추가함. 
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
                        min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)

# evaluation metric을 auc로, early stopping은 200 으로 설정하고 학습 수행. 
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, 
            eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)])

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
[0]	validation_0-auc:0.82311	validation_1-auc:0.815226
Multiple eval metrics have been passed: 'validation_1-auc' will be used for early stopping.

Will train until validation_1-auc hasn't improved in 200 rounds.
[1]	validation_0-auc:0.827094	validation_1-auc:0.816566
'''
Stopping. Best iteration:
[205]	validation_0-auc:0.891798	validation_1-auc:0.84558

ROC AUC: 0.8456
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)
<matplotlib.axes._subplots.AxesSubplot at 0x1938acb20b8>

LightGBM 모델 학습과 하이퍼 파라미터 튜닝

from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators=500)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=True)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
[1]	valid_0's auc: 0.817384	valid_0's binary_logloss: 0.165046
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's auc: 0.81863	valid_0's binary_logloss: 0.16
[3]	valid_0's auc: 0.827411	valid_0's binary_logloss: 0.156287
[4]	valid_0's auc: 0.832175	valid_0's binary_logloss: 0.153416
[5]	valid_0's auc: 0.83481	valid_0's binary_logloss: 0.151206
[6]	valid_0's auc: 0.834721	valid_0's binary_logloss: 0.149303
[7]	valid_0's auc: 0.83659	valid_0's binary_logloss: 0.147804
[8]	valid_0's auc: 0.837602	valid_0's binary_logloss: 0.146466
[9]	valid_0's auc: 0.838114	valid_0's binary_logloss: 0.145476
[10]	valid_0's auc: 0.838472	valid_0's binary_logloss: 0.144681
[11]	valid_0's auc: 0.83808	valid_0's binary_logloss: 0.143978
'''
Early stopping, best iteration is:
[22]	valid_0's auc: 0.871544	valid_0's binary_logloss: 0.126309	valid_1's auc: 0.844171	valid_1's binary_logloss: 0.139253
GridSearchCV 최적 파라미터: {'max_depth': 128, 'min_child_samples': 100, 'num_leaves': 32, 'subsample': 0.8}
ROC AUC: 0.8442
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=32, sumbsample=0.8, min_child_samples=100,
                          max_depth=128)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
                verbose=True)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
[1]	valid_0's auc: 0.819488	valid_0's binary_logloss: 0.165016
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's auc: 0.822387	valid_0's binary_logloss: 0.159711
[3]	valid_0's auc: 0.829542	valid_0's binary_logloss: 0.156068
'''
Early stopping, best iteration is:
[22]	valid_0's auc: 0.844171	valid_0's binary_logloss: 0.139253
ROC AUC: 0.8442