파이썬 머신러닝 완벽가이드 내용임. 데이터를 어떻게 다뤄서 모델의 성능을 높이는지에 주목
문제
-
370개의 피처로 주어진 데이터 세트 기반으로 고객 만족 여부(0 or 1) 예측하는 것
성능평가
- ROC-AUC로 평가
- 대부분이 만족하며 불만족인 데이터는 일부일 것이기 때문에 정확도 수치보다는 ROC-AUC가 더 적합함
진행과정
-
바로 lightGBM에 적용시켜서 accuracy, roc_auc 값 출력해봄
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from lightgbm import LGBMClassifier from sklearn.metrics import accuracy_score, roc_auc_score raw_data = pd.read_csv('./train_santander.csv.csv') data = raw_data.iloc[:, :-1] target = raw_data.iloc[:, -1] # data.shape, target.shape X_train, X_test, y_train, y_test = train_test_split(data, target) ligbm = LGBMClassifier() ligbm.fit(X_train, y_train) preds = ligbm.predict(X_test) probs = ligbm.predict_proba(X_test)[:, 1] print(f'정확도 {accuracy_score(y_test, preds)}') print(f'AUC: {roc_auc_score(y_test, preds)}') '''output 정확도 0.9582215206524599 AUC: 0.5016565312870864 '''
Null 값 여부 및 TARGET 분포 확인
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
cust_df = pd.read_csv("./train_santander.csv",encoding='latin-1')
print('dataset shape:', cust_df.shape)
cust_df.head(3)
dataset shape: (76020, 371)
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39205.17 | 0 |
1 | 3 | 2 | 34 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 49278.03 | 0 |
2 | 4 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 67333.77 | 0 |
3 rows × 371 columns
cust_df.info() # 피처의 타입과 널 값을 알아보기 위해 DataFrame.info() 사용
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB
# cust_df['TARGET'].value_counts()[0] / len(cust_df['TARGET'].value_counts()), cust_df['TARGET'].value_counts()[1] / len(cust_df['TARGET'].value_counts())
len(cust_df['TARGET'].value_counts())
2
print(cust_df['TARGET'].value_counts())
unsatisfied_cnt = cust_df[cust_df['TARGET'] == 1].TARGET.count()
total_cnt = cust_df.TARGET.count()
print('unsatisfied 비율은 {0:.2f}'.format((unsatisfied_cnt / total_cnt)))
0 73012
1 3008
Name: TARGET, dtype: int64
unsatisfied 비율은 0.04
각 Feature의 값 분포 확인
cust_df.describe( )
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | ... | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 7.602000e+04 | 76020.000000 |
mean | 75964.050723 | -1523.199277 | 33.212865 | 86.208265 | 72.363067 | 119.529632 | 3.559130 | 6.472698 | 0.412946 | 0.567352 | ... | 7.935824 | 1.365146 | 12.215580 | 8.784074 | 31.505324 | 1.858575 | 76.026165 | 56.614351 | 1.172358e+05 | 0.039569 |
std | 43781.947379 | 39033.462364 | 12.956486 | 1614.757313 | 339.315831 | 546.266294 | 93.155749 | 153.737066 | 30.604864 | 36.513513 | ... | 455.887218 | 113.959637 | 783.207399 | 538.439211 | 2013.125393 | 147.786584 | 4040.337842 | 2852.579397 | 1.826646e+05 | 0.194945 |
min | 1.000000 | -999999.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.163750e+03 | 0.000000 |
25% | 38104.750000 | 2.000000 | 23.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.787061e+04 | 0.000000 |
50% | 76043.000000 | 2.000000 | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.064092e+05 | 0.000000 |
75% | 113748.750000 | 2.000000 | 40.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.187563e+05 | 0.000000 |
max | 151838.000000 | 238.000000 | 105.000000 | 210000.000000 | 12888.030000 | 21024.810000 | 8237.820000 | 11073.570000 | 6600.000000 | 6600.000000 | ... | 50003.880000 | 20385.720000 | 138831.630000 | 91778.730000 | 438329.220000 | 24650.010000 | 681462.900000 | 397884.300000 | 2.203474e+07 | 1.000000 |
8 rows × 371 columns
var3 Feature를 볼 때 -999999 값은 다른 값들에 비해 너무 큼. Nan, 특정 예외값을 -999999로 처리한것으로 보임. 가장 많이 보이는 2로 대체
- ID Feature는 고객 만족도 분류 하는데 쓸모없을 것이므로 Drop
cust_df['var3'].value_counts()
2 74165
8 138
-999999 116
9 110
3 108
...
231 1
188 1
168 1
135 1
87 1
Name: var3, Length: 208, dtype: int64
# var3 피처 값 대체 및 ID 피처 드롭
cust_df['var3'].replace(-999999, 2, inplace=True)
cust_df.drop('ID', axis=1, inplace=True)
# 피처 세트와 레이블 세트분리. 레이블 컬럼은 DataFrame의 맨 마지막에 위치해 컬럼 위치 -1로 분리
X_features = cust_df.iloc[:, :-1]
y_labels = cust_df.iloc[:, -1]
print('피처 데이터 shape:{0}'.format(X_features.shape))
피처 데이터 shape:(76020, 369)
브로드캐스팅을 활용하여 split한 데이터들의 레이블별 %를 쉽게 나타냄
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels,
test_size=0.2, random_state=0)
train_cnt = y_train.count()
test_cnt = y_test.count()
print('학습 세트 Shape:{0}, 테스트 세트 Shape:{1}'.format(X_train.shape , X_test.shape))
print(' 학습 세트 레이블 값 분포 비율')
print(y_train.value_counts()/train_cnt)
print('\n 테스트 세트 레이블 값 분포 비율')
print(y_test.value_counts()/test_cnt)
학습 세트 Shape:(60816, 369), 테스트 세트 Shape:(15204, 369)
학습 세트 레이블 값 분포 비율
0 0.960964
1 0.039036
Name: TARGET, dtype: float64
테스트 세트 레이블 값 분포 비율
0 0.9583
1 0.0417
Name: TARGET, dtype: float64
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
# n_estimators는 500으로, random state는 예제 수행 시마다 동일 예측 결과를 위해 설정.
xgb_clf = XGBClassifier(n_estimators=500, random_state=156)
# 성능 평가 지표를 auc로, 조기 중단 파라미터는 100으로 설정하고 학습 수행.
xgb_clf.fit(X_train, y_train, early_stopping_rounds=100,
eval_metric="auc", eval_set=[(X_test, y_test)])
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
C:\Users\hojun_window\anaconda3\envs\ml\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[0] validation_0-auc:0.81157
[1] validation_0-auc:0.82452
[2] validation_0-auc:0.82746
[3] validation_0-auc:0.82922
[4] validation_0-auc:0.83298
[5] validation_0-auc:0.83500
[6] validation_0-auc:0.83653
[7] validation_0-auc:0.83782
[8] validation_0-auc:0.83802
[9] validation_0-auc:0.83914
[10] validation_0-auc:0.83954
[11] validation_0-auc:0.83983
[12] validation_0-auc:0.84033
[13] validation_0-auc:0.84054
[14] validation_0-auc:0.84135
[15] validation_0-auc:0.84117
[16] validation_0-auc:0.84101
[17] validation_0-auc:0.84071
[18] validation_0-auc:0.84052
[19] validation_0-auc:0.84023
[20] validation_0-auc:0.84012
[21] validation_0-auc:0.84022
[22] validation_0-auc:0.84007
[23] validation_0-auc:0.84009
[24] validation_0-auc:0.83974
[25] validation_0-auc:0.84015
[26] validation_0-auc:0.84101
[27] validation_0-auc:0.84088
[28] validation_0-auc:0.84074
[29] validation_0-auc:0.83999
[30] validation_0-auc:0.83959
[31] validation_0-auc:0.83952
[32] validation_0-auc:0.83901
[33] validation_0-auc:0.83885
[34] validation_0-auc:0.83887
[35] validation_0-auc:0.83864
[36] validation_0-auc:0.83834
[37] validation_0-auc:0.83810
[38] validation_0-auc:0.83810
[39] validation_0-auc:0.83813
[40] validation_0-auc:0.83776
[41] validation_0-auc:0.83720
[42] validation_0-auc:0.83684
[43] validation_0-auc:0.83672
[44] validation_0-auc:0.83674
[45] validation_0-auc:0.83693
[46] validation_0-auc:0.83686
[47] validation_0-auc:0.83678
[48] validation_0-auc:0.83694
[49] validation_0-auc:0.83676
[50] validation_0-auc:0.83655
[51] validation_0-auc:0.83669
[52] validation_0-auc:0.83641
[53] validation_0-auc:0.83690
[54] validation_0-auc:0.83693
[55] validation_0-auc:0.83681
[56] validation_0-auc:0.83680
[57] validation_0-auc:0.83667
[58] validation_0-auc:0.83664
[59] validation_0-auc:0.83591
[60] validation_0-auc:0.83576
[61] validation_0-auc:0.83534
[62] validation_0-auc:0.83513
[63] validation_0-auc:0.83510
[64] validation_0-auc:0.83508
[65] validation_0-auc:0.83518
[66] validation_0-auc:0.83510
[67] validation_0-auc:0.83523
[68] validation_0-auc:0.83457
[69] validation_0-auc:0.83460
[70] validation_0-auc:0.83446
[71] validation_0-auc:0.83462
[72] validation_0-auc:0.83394
[73] validation_0-auc:0.83410
[74] validation_0-auc:0.83394
[75] validation_0-auc:0.83368
[76] validation_0-auc:0.83413
[77] validation_0-auc:0.83359
[78] validation_0-auc:0.83353
[79] validation_0-auc:0.83293
[80] validation_0-auc:0.83253
[81] validation_0-auc:0.83187
[82] validation_0-auc:0.83230
[83] validation_0-auc:0.83216
[84] validation_0-auc:0.83206
[85] validation_0-auc:0.83196
[86] validation_0-auc:0.83200
[87] validation_0-auc:0.83208
[88] validation_0-auc:0.83174
[89] validation_0-auc:0.83160
[90] validation_0-auc:0.83155
[91] validation_0-auc:0.83165
[92] validation_0-auc:0.83172
[93] validation_0-auc:0.83160
[94] validation_0-auc:0.83150
[95] validation_0-auc:0.83132
[96] validation_0-auc:0.83090
[97] validation_0-auc:0.83091
[98] validation_0-auc:0.83066
[99] validation_0-auc:0.83058
[100] validation_0-auc:0.83029
[101] validation_0-auc:0.82955
[102] validation_0-auc:0.82962
[103] validation_0-auc:0.82893
[104] validation_0-auc:0.82837
[105] validation_0-auc:0.82815
[106] validation_0-auc:0.82744
[107] validation_0-auc:0.82728
[108] validation_0-auc:0.82651
[109] validation_0-auc:0.82650
[110] validation_0-auc:0.82621
[111] validation_0-auc:0.82620
[112] validation_0-auc:0.82591
[113] validation_0-auc:0.82498
[114] validation_0-auc:0.82525
ROC AUC: 0.8413
XGBoost 모델의 하이퍼 파라미터 수정
from sklearn.model_selection import GridSearchCV
xgb_clf = XGBClassifier(n_estimators=100)
params = {'max_depth':[5, 7], 'min_child_weight':[1, 3], 'colsample_bytree':[0.5, 0.75]}
#cv는 3으로 지정
gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc", eval_set=[(X_test, y_test)])
print(f'GridSearchCV 최적 파라미터: {gridcv.best_params_}')
xgb_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:, 1], average='macro')
print(f'ROC AUC: {xgb_roc_score:.4f}')
C:\Users\hojun_window\anaconda3\envs\ml\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[0] validation_0-auc:0.79321
[1] validation_0-auc:0.81375
[2] validation_0-auc:0.81846
[3] validation_0-auc:0.82226
[4] validation_0-auc:0.82677
[5] validation_0-auc:0.83225
[6] validation_0-auc:0.82654
[7] validation_0-auc:0.83486
[8] validation_0-auc:0.83682
[9] validation_0-auc:0.83472
'''
[43] validation_0-auc:0.84143
GridSearchCV 최적 파라미터: {'colsample_bytree': 0.5, 'max_depth': 5, 'min_child_weight': 3}
ROC AUC: 0.8445
# n_estimators는 1000으로 증가시키고, learning_rate=0.02로 감소, reg_alpha=0.03으로 추가함.
xgb_clf = XGBClassifier(n_estimators=1000, random_state=156, learning_rate=0.02, max_depth=7,\
min_child_weight=1, colsample_bytree=0.75, reg_alpha=0.03)
# evaluation metric을 auc로, early stopping은 200 으로 설정하고 학습 수행.
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200,
eval_metric="auc",eval_set=[(X_train, y_train), (X_test, y_test)])
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(xgb_roc_score))
[0] validation_0-auc:0.82311 validation_1-auc:0.815226
Multiple eval metrics have been passed: 'validation_1-auc' will be used for early stopping.
Will train until validation_1-auc hasn't improved in 200 rounds.
[1] validation_0-auc:0.827094 validation_1-auc:0.816566
'''
Stopping. Best iteration:
[205] validation_0-auc:0.891798 validation_1-auc:0.84558
ROC AUC: 0.8456
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(1,1,figsize=(10,8))
plot_importance(xgb_clf, ax=ax , max_num_features=20,height=0.4)
<matplotlib.axes._subplots.AxesSubplot at 0x1938acb20b8>
LightGBM 모델 학습과 하이퍼 파라미터 튜닝
from lightgbm import LGBMClassifier
lgbm_clf = LGBMClassifier(n_estimators=500)
evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
verbose=True)
lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
[1] valid_0's auc: 0.817384 valid_0's binary_logloss: 0.165046
Training until validation scores don't improve for 100 rounds.
[2] valid_0's auc: 0.81863 valid_0's binary_logloss: 0.16
[3] valid_0's auc: 0.827411 valid_0's binary_logloss: 0.156287
[4] valid_0's auc: 0.832175 valid_0's binary_logloss: 0.153416
[5] valid_0's auc: 0.83481 valid_0's binary_logloss: 0.151206
[6] valid_0's auc: 0.834721 valid_0's binary_logloss: 0.149303
[7] valid_0's auc: 0.83659 valid_0's binary_logloss: 0.147804
[8] valid_0's auc: 0.837602 valid_0's binary_logloss: 0.146466
[9] valid_0's auc: 0.838114 valid_0's binary_logloss: 0.145476
[10] valid_0's auc: 0.838472 valid_0's binary_logloss: 0.144681
[11] valid_0's auc: 0.83808 valid_0's binary_logloss: 0.143978
'''
Early stopping, best iteration is:
[22] valid_0's auc: 0.871544 valid_0's binary_logloss: 0.126309 valid_1's auc: 0.844171 valid_1's binary_logloss: 0.139253
GridSearchCV 최적 파라미터: {'max_depth': 128, 'min_child_samples': 100, 'num_leaves': 32, 'subsample': 0.8}
ROC AUC: 0.8442
lgbm_clf = LGBMClassifier(n_estimators=1000, num_leaves=32, sumbsample=0.8, min_child_samples=100,
max_depth=128)
evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals,
verbose=True)
lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1],average='macro')
print('ROC AUC: {0:.4f}'.format(lgbm_roc_score))
[1] valid_0's auc: 0.819488 valid_0's binary_logloss: 0.165016
Training until validation scores don't improve for 100 rounds.
[2] valid_0's auc: 0.822387 valid_0's binary_logloss: 0.159711
[3] valid_0's auc: 0.829542 valid_0's binary_logloss: 0.156068
'''
Early stopping, best iteration is:
[22] valid_0's auc: 0.844171 valid_0's binary_logloss: 0.139253
ROC AUC: 0.8442
PREVIOUS경제