Machine Learning (11) - 앙상블 / 부스팅
분류
GBM(Gradient Boosting Machine)
부스팅 알고리즘은 여러개의 약한 학습기(weak learner)를 순차적으로 학습-예측하면서 잘못 예측한 데이터에 가중치를 부여해 오류를 개선해 나가면서 학습하는 방식이다.
병렬처리가 되지 않아 시간이 오래걸린다.
가중치 업데이트를 경사 하강법(Gradient Descent)을 이용한다.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')
# 앞서 작성한 함수 가져오기
def get_new_df(old_df):
dup_df = pd.DataFrame(data=old_df.groupby('column_name').cumcount(), columns=['dup_cnt']) # cumcount 그룹별로 묶어 같은값에 대해 번호를 매김
dup_df = dup_df.reset_index() # 기존의 인덱스값을 하나의 컬럼으로 추가
new_df = pd.merge(old_df.reset_index(), dup_df, how='outer') # reset_index하여 추가된 컬럼을 기준으로 병합
new_df['column_name'] = new_df[['column_name', 'dup_cnt']].apply(lambda x: x[0]+'_'+str(x[1]) if x[1]>0 else x[0], axis=1)
new_df.drop(columns=['index'], inplace=True)
return new_df
def get_human_dataset():
import pandas as pd
feature_name_df = pd.read_csv('human_activity/features.txt', sep='\s+',
header=None, names=['column_index', 'column_name'])
name_df = get_new_df(feature_name_df)
feature_name = name_df.iloc[:, 1].values.tolist()
X_train = pd.read_csv('human_activity/train/X_train.txt', sep='\s+', names=feature_name)
X_test = pd.read_csv('human_activity/test/X_test.txt', sep='\s+', names=feature_name)
y_train = pd.read_csv('human_activity/train/y_train.txt', sep='\s+', names=['action'])
y_test = pd.read_csv('human_activity/test/y_test.txt', sep='\s+', names=['action'])
return X_train, y_train, X_test, y_test
X_train, y_train, X_test, y_test = get_human_dataset()
# GMB 수행 시간 측정을 위해 시작 시간 설정
start_time = time.time()
gb_clf = GradientBoostingClassifier(random_state=0) # learning_rate: 학습률, 얼만큼 이동할 지
# gb_clf.fit(X_train, y_train)
# pred = gb_clf.predict(X_test)
# print(f'GBM 정확도: {accuracy_score(y_test, pred):.4f}')
# print(f'GBM 수행시간: {time.time() - start_time:.1f}초')
GBM 정확도: 0.9389
GBM 수행시간: 480.1273331642151
GBM 하이퍼 파라미터
- loss: 경사 하강법에 사용할 비용 함수이다.
- learning_rate: 학습률, 0~1사이의 값을 지정, default는 0.1이다.
- n_estimators: weak learner의 갯수이다. 갯수가 많을수록 일정 수준까지 성능이 좋아질 수 있지만 수행시간이 오래 걸린다. default는 100이다.
- subsample: weak learner가 학습에 사용하는 데이터의 샘플링 비율 default는 1이다.
XGBoost(eXtra Gradient Boost)
GBM에 기반하고 있으나 GBM의 단점인 느린 수행 시간 및 과적합 규제(Regularization) 부재 등의 문제를 해결했다.
병렬 CPU환경에서 병렬 합습이 가능해 기존 GBM보다 빠르게 학습을 완료할 수 있다.
- XGBoost의 주요 장점
뛰어난 예측 성능
GBM 대비 빠른 수행 시간
과적합 규제(Regularization)
나무 가지치기(Tree pruning)
자체 내장된 교차 검증
결손값 자체 처리
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
xgb.__version__
'1.5.0'
파이썬 래퍼 XGBoost 하이퍼 파라미터
주요 일반 파라미터
- booster: gbtree(tree based model) 또는 gblinear(linear model)선택, defautl는 gbtree
- silent: 출력 메시지를 나타내고 싶지 않을 경우 1, default 0
- nthread: CPU의 실행 스레드 갯수를 조정, default는 전체 스레드를 모두 사용
주요 부스터 파라미터
- eta[default=O.3]: 학습률
- num_boost_rounds: n_estimaotrs(GBM)
- min_child_weight[default=1]: 트리에서 추가적으로 가지를 나눌지를 결정하기 위해 필요한 데이터들의 weight 총합
- gamma [default=O]: 트리의 리프 노드를 추가적으로 나눌지를 결정할 최소 손실 감소값
- max_depth[default=6]
- sub_sample[default=1]: subsample(GBM)
- colsample_bytree[default=1]: max_features(GBM)
- lambda [default=1]: L2 Regularization 적용값, 제곱을 이용
- alpha [default=O]: L1 Regularization 적용값, 절대값을 이용
-
scale_pos_weight [default=1]: 특정 값으로 치우친 비대칭한 클래스로 구성된 데이터 서 트의 균형을 유지하기 위한 파라미터
학습 태스크 파라미터
- objective: 최솟값을 가져야 할 손실 함수를 정의
- eval metric: 검증에 사용되는 함수를 정의
과적합 문제가 심각할 경우 고려
- eta 값 낮추기(0.01~0.1) num_round(또는 n_estimators)는 반대로 높여줘야 함
- max_depth 값 낮추기
- min_child_weight 값 높이기
- gamma 값 높이기
- subsample과 colsample_bytree를 조정
파이썬 래퍼 XGBoost 적용 - 위스콘신 유방암 예측
dataset = load_breast_cancer(as_frame=True)
dataset.data.head(2)
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.8 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.6 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.9 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.8 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 rows × 30 columns
dataset.target_names # malignant:악성, benign:양성
array(['malignant', 'benign'], dtype='<U9')
dataset.target.value_counts()
1 357
0 212
Name: target, dtype: int64
# 학습80%, 테스트20%
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=156)
# 위의 학습용 데이터를 학습90%, 검증10%
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
X_train.shape, X_test.shape
((455, 30), (114, 30))
X_tr.shape, X_val.shape
((409, 30), (46, 30))
y_train.value_counts()
1 280
0 175
Name: target, dtype: int64
XGBoost만의 전용 데이터 객체인 DMatrix를 사용한다. 때문에 numpy나 pandas로 되어 잇는 데이터 세트를 모두 데이터 객체인 DMatrix로 생성하여 모델에 입력해야 한다.
# 학습, 검증, 테스트용 DMatrix 생성
dtr = xgb.DMatrix(data=X_tr, label=y_tr)
dval = xgb.DMatrix(data=X_val, label=y_val)
dtest = xgb.DMatrix(data=X_test, label=y_test)
하이퍼 파라미터를 딕셔너리 형태로 입력해야 한다.
params = {
'max_depth':3,
'eta':0.05,
'objective':'binary:logistic', # 이진분류, 로지스틱 사용
'eval_metric':'logloss' # 평가를 logloss(차이의 log값)
}
num_rounds = 500 # n_estimator.. 400개 사용
# 학습 데이터셋을 train, 평가 데이터셋을 eval로 이름 붙임
eval_list = [(dtr, 'train'), (dval, 'eval')]
# xgb모듈의 train함수에 하이퍼 파라미터를 전달
model = xgb.train(params, dtr, num_rounds, eval_list, early_stopping_rounds=50) # 파라미터 순서에 맞으면 파라미터명 생략
[0] train-logloss:0.65016 eval-logloss:0.66183
[1] train-logloss:0.61131 eval-logloss:0.63609
[2] train-logloss:0.57563 eval-logloss:0.61144
[3] train-logloss:0.54310 eval-logloss:0.59204
[4] train-logloss:0.51323 eval-logloss:0.57329
[5] train-logloss:0.48447 eval-logloss:0.55037
[6] train-logloss:0.45796 eval-logloss:0.52929
[7] train-logloss:0.43436 eval-logloss:0.51534
[8] train-logloss:0.41150 eval-logloss:0.49718
[9] train-logloss:0.39027 eval-logloss:0.48154
[10] train-logloss:0.37128 eval-logloss:0.46990
[11] train-logloss:0.35254 eval-logloss:0.45474
[12] train-logloss:0.33528 eval-logloss:0.44229
[13] train-logloss:0.31893 eval-logloss:0.42961
[14] train-logloss:0.30439 eval-logloss:0.42065
[15] train-logloss:0.29000 eval-logloss:0.40958
[16] train-logloss:0.27651 eval-logloss:0.39887
[17] train-logloss:0.26389 eval-logloss:0.39050
[18] train-logloss:0.25210 eval-logloss:0.38254
[19] train-logloss:0.24123 eval-logloss:0.37393
[20] train-logloss:0.23076 eval-logloss:0.36789
[21] train-logloss:0.22091 eval-logloss:0.36017
[22] train-logloss:0.21155 eval-logloss:0.35421
[23] train-logloss:0.20263 eval-logloss:0.34683
[24] train-logloss:0.19434 eval-logloss:0.34111
[25] train-logloss:0.18637 eval-logloss:0.33634
[26] train-logloss:0.17875 eval-logloss:0.33082
[27] train-logloss:0.17167 eval-logloss:0.32675
[28] train-logloss:0.16481 eval-logloss:0.32099
[29] train-logloss:0.15835 eval-logloss:0.31671
[30] train-logloss:0.15225 eval-logloss:0.31277
[31] train-logloss:0.14650 eval-logloss:0.30882
[32] train-logloss:0.14102 eval-logloss:0.30437
[33] train-logloss:0.13590 eval-logloss:0.30103
[34] train-logloss:0.13109 eval-logloss:0.29794
[35] train-logloss:0.12647 eval-logloss:0.29499
[36] train-logloss:0.12197 eval-logloss:0.29295
[37] train-logloss:0.11784 eval-logloss:0.29043
[38] train-logloss:0.11379 eval-logloss:0.28927
[39] train-logloss:0.10994 eval-logloss:0.28578
[40] train-logloss:0.10638 eval-logloss:0.28364
[41] train-logloss:0.10302 eval-logloss:0.28183
[42] train-logloss:0.09963 eval-logloss:0.28005
[43] train-logloss:0.09649 eval-logloss:0.27972
[44] train-logloss:0.09359 eval-logloss:0.27744
[45] train-logloss:0.09080 eval-logloss:0.27542
[46] train-logloss:0.08807 eval-logloss:0.27504
[47] train-logloss:0.08541 eval-logloss:0.27458
[48] train-logloss:0.08299 eval-logloss:0.27348
[49] train-logloss:0.08035 eval-logloss:0.27247
[50] train-logloss:0.07786 eval-logloss:0.27163
[51] train-logloss:0.07550 eval-logloss:0.27094
[52] train-logloss:0.07344 eval-logloss:0.26967
[53] train-logloss:0.07147 eval-logloss:0.27008
[54] train-logloss:0.06964 eval-logloss:0.26890
[55] train-logloss:0.06766 eval-logloss:0.26854
[56] train-logloss:0.06592 eval-logloss:0.26900
[57] train-logloss:0.06433 eval-logloss:0.26790
[58] train-logloss:0.06259 eval-logloss:0.26663
[59] train-logloss:0.06107 eval-logloss:0.26743
[60] train-logloss:0.05957 eval-logloss:0.26610
[61] train-logloss:0.05817 eval-logloss:0.26644
[62] train-logloss:0.05691 eval-logloss:0.26673
[63] train-logloss:0.05550 eval-logloss:0.26550
[64] train-logloss:0.05422 eval-logloss:0.26443
[65] train-logloss:0.05311 eval-logloss:0.26500
[66] train-logloss:0.05207 eval-logloss:0.26591
[67] train-logloss:0.05093 eval-logloss:0.26501
[68] train-logloss:0.04976 eval-logloss:0.26435
[69] train-logloss:0.04872 eval-logloss:0.26360
[70] train-logloss:0.04776 eval-logloss:0.26319
[71] train-logloss:0.04680 eval-logloss:0.26255
[72] train-logloss:0.04580 eval-logloss:0.26204
[73] train-logloss:0.04484 eval-logloss:0.26254
[74] train-logloss:0.04388 eval-logloss:0.26289
[75] train-logloss:0.04309 eval-logloss:0.26249
[76] train-logloss:0.04224 eval-logloss:0.26217
[77] train-logloss:0.04133 eval-logloss:0.26166
[78] train-logloss:0.04050 eval-logloss:0.26179
[79] train-logloss:0.03967 eval-logloss:0.26103
[80] train-logloss:0.03877 eval-logloss:0.26094
[81] train-logloss:0.03806 eval-logloss:0.26148
[82] train-logloss:0.03740 eval-logloss:0.26054
[83] train-logloss:0.03676 eval-logloss:0.25967
[84] train-logloss:0.03605 eval-logloss:0.25905
[85] train-logloss:0.03545 eval-logloss:0.26007
[86] train-logloss:0.03488 eval-logloss:0.25984
[87] train-logloss:0.03425 eval-logloss:0.25933
[88] train-logloss:0.03361 eval-logloss:0.25932
[89] train-logloss:0.03311 eval-logloss:0.26002
[90] train-logloss:0.03260 eval-logloss:0.25936
[91] train-logloss:0.03202 eval-logloss:0.25886
[92] train-logloss:0.03152 eval-logloss:0.25918
[93] train-logloss:0.03107 eval-logloss:0.25865
[94] train-logloss:0.03049 eval-logloss:0.25951
[95] train-logloss:0.03007 eval-logloss:0.26091
[96] train-logloss:0.02963 eval-logloss:0.26014
[97] train-logloss:0.02913 eval-logloss:0.25974
[98] train-logloss:0.02866 eval-logloss:0.25937
[99] train-logloss:0.02829 eval-logloss:0.25893
[100] train-logloss:0.02789 eval-logloss:0.25928
[101] train-logloss:0.02751 eval-logloss:0.25955
[102] train-logloss:0.02714 eval-logloss:0.25901
[103] train-logloss:0.02668 eval-logloss:0.25991
[104] train-logloss:0.02634 eval-logloss:0.25950
[105] train-logloss:0.02594 eval-logloss:0.25924
[106] train-logloss:0.02556 eval-logloss:0.25901
[107] train-logloss:0.02522 eval-logloss:0.25738
[108] train-logloss:0.02492 eval-logloss:0.25702
[109] train-logloss:0.02453 eval-logloss:0.25789
[110] train-logloss:0.02418 eval-logloss:0.25770
[111] train-logloss:0.02384 eval-logloss:0.25842
[112] train-logloss:0.02356 eval-logloss:0.25810
[113] train-logloss:0.02322 eval-logloss:0.25848
[114] train-logloss:0.02290 eval-logloss:0.25833
[115] train-logloss:0.02260 eval-logloss:0.25820
[116] train-logloss:0.02229 eval-logloss:0.25905
[117] train-logloss:0.02204 eval-logloss:0.25878
[118] train-logloss:0.02176 eval-logloss:0.25728
[119] train-logloss:0.02149 eval-logloss:0.25722
[120] train-logloss:0.02119 eval-logloss:0.25764
[121] train-logloss:0.02095 eval-logloss:0.25761
[122] train-logloss:0.02067 eval-logloss:0.25832
[123] train-logloss:0.02045 eval-logloss:0.25808
[124] train-logloss:0.02023 eval-logloss:0.25855
[125] train-logloss:0.01998 eval-logloss:0.25714
[126] train-logloss:0.01973 eval-logloss:0.25587
[127] train-logloss:0.01946 eval-logloss:0.25640
[128] train-logloss:0.01927 eval-logloss:0.25685
[129] train-logloss:0.01908 eval-logloss:0.25665
[130] train-logloss:0.01886 eval-logloss:0.25712
[131] train-logloss:0.01863 eval-logloss:0.25609
[132] train-logloss:0.01839 eval-logloss:0.25649
[133] train-logloss:0.01816 eval-logloss:0.25789
[134] train-logloss:0.01802 eval-logloss:0.25811
[135] train-logloss:0.01785 eval-logloss:0.25794
[136] train-logloss:0.01763 eval-logloss:0.25876
[137] train-logloss:0.01748 eval-logloss:0.25884
[138] train-logloss:0.01732 eval-logloss:0.25867
[139] train-logloss:0.01719 eval-logloss:0.25876
[140] train-logloss:0.01696 eval-logloss:0.25987
[141] train-logloss:0.01681 eval-logloss:0.25960
[142] train-logloss:0.01669 eval-logloss:0.25982
[143] train-logloss:0.01656 eval-logloss:0.25992
[144] train-logloss:0.01638 eval-logloss:0.26035
[145] train-logloss:0.01623 eval-logloss:0.26055
[146] train-logloss:0.01606 eval-logloss:0.26092
[147] train-logloss:0.01589 eval-logloss:0.26137
[148] train-logloss:0.01572 eval-logloss:0.25999
[149] train-logloss:0.01557 eval-logloss:0.26028
[150] train-logloss:0.01546 eval-logloss:0.26048
[151] train-logloss:0.01531 eval-logloss:0.26142
[152] train-logloss:0.01515 eval-logloss:0.26188
[153] train-logloss:0.01501 eval-logloss:0.26227
[154] train-logloss:0.01486 eval-logloss:0.26287
[155] train-logloss:0.01476 eval-logloss:0.26299
[156] train-logloss:0.01461 eval-logloss:0.26346
[157] train-logloss:0.01448 eval-logloss:0.26379
[158] train-logloss:0.01434 eval-logloss:0.26306
[159] train-logloss:0.01424 eval-logloss:0.26237
[160] train-logloss:0.01410 eval-logloss:0.26251
[161] train-logloss:0.01401 eval-logloss:0.26265
[162] train-logloss:0.01392 eval-logloss:0.26264
[163] train-logloss:0.01380 eval-logloss:0.26250
[164] train-logloss:0.01372 eval-logloss:0.26264
[165] train-logloss:0.01359 eval-logloss:0.26255
[166] train-logloss:0.01350 eval-logloss:0.26188
[167] train-logloss:0.01342 eval-logloss:0.26203
[168] train-logloss:0.01331 eval-logloss:0.26190
[169] train-logloss:0.01319 eval-logloss:0.26184
[170] train-logloss:0.01312 eval-logloss:0.26133
[171] train-logloss:0.01304 eval-logloss:0.26148
[172] train-logloss:0.01297 eval-logloss:0.26157
[173] train-logloss:0.01285 eval-logloss:0.26253
[174] train-logloss:0.01278 eval-logloss:0.26229
[175] train-logloss:0.01267 eval-logloss:0.26086
지정한 500회를 완료하지않고 early_stopping_rounds로 지정한 50회 동안 logloss 값이 향상되지 않아 조기종료했다.
pred_probs = model.predict(dtest)
# XGBoost의 predict()는 예측 결괏값이 아닌 확률값을 반환한다.
np.round(pred_probs[:10], 3)
array([0.845, 0.008, 0.68 , 0.081, 0.975, 0.999, 0.998, 0.998, 0.996,
0.001], dtype=float32)
# 예측 결괏값으로 변환
pred = [1 if x > 0.5 else 0 for x in pred_probs]
def get_clf_eval(y_test, pred, pred_proba_1):
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score
confusion = confusion_matrix(y_test, pred)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)
auc = roc_auc_score(y_test, pred_proba_1)
print('==오차 행렬==')
print(confusion)
print(f"정확도: {accuracy:.4f}, 정밀도: {precision:.4f}, 재현율: {recall:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}")
# 예측 성능 평가
get_clf_eval(y_test, pred, pred_probs)
==오차 행렬==
[[34 3]
[ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9937
- 피처 중요도 시각화
plot_importance(model)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
xgb.to_graphviz(model)
사이킷런 래퍼 XGBoost
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=3, eval_metric='logloss')
model.fit(X_train, y_train, verbose=True)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.05, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=500, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
pred = model.predict(X_test) # 파이썬 래퍼와 다르게 결정값으로 나옴
pred_proba = model.predict_proba(X_test)[:, 1]
# 예측 성능 평가
get_clf_eval(y_test, pred, pred_proba)
==오차 행렬==
[[34 3]
[ 1 76]]
정확도: 0.9649, 정밀도: 0.9620, 재현율: 0.9870, F1: 0.9744, AUC: 0.9951
이전에 사용하던 사이킷런의 다른 모델과 사용법이 같다!
# 파이썬 래퍼와 같은 조건으로 학습
model = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=3)
evals = [(X_tr, y_tr), (X_val, y_val)]
model.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=False)
pred = model.predict(X_test)
pred_proba = model.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
==오차 행렬==
[[34 3]
[ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9933
xgb.to_graphviz(model)
조기종료 값을 너무 작게 주면 학습을 충분히 하지 못하고 종료될 수 있다.
LightGBM
XGBoost보다 학습에 걸리는 시간이 훨씬 적다.
적은 데이터 세트에 적용할 경우 과적합이 발생하기 쉽다.(과적합에 취약)
일반 GBM 계열의 트리 분할 방법과 다르게 리프 중심 트리 분할(Leaf Wise) 방식을 사용한다.
LightGBM의 XGBoost 대비 장점
- 더 빠른 학습과 예측 수행 시간
- 더 작은 메모리 사용량
- 카테고리형 피처의 자동 변환과 최적 분할
LightGBM 하이퍼 파라미터
max_depth를 다른 GBM보다 더 깊게 가져야 한다.(리프 중심 트리 분할이기 때문)
주요 파라미터
- num_iterations[dafault=100]: 반복 수행하려는 트리의 갯수
- learning_rate[default=0.1]: 학습률
- max_depth[default=-1]: 0보다 작으면 깊이 제한 없음
- min_dat_in_leaf[default=20]: min_samples_leaf(DecisionTree)
- num_leaves[default=31]: 하나의 트리가 가질 수 있는 최대 리프 갯수
- boosting[default=gbdt]: 부스팅 트리를 생성하는 알고리즘, gbdt-일반적인 그래디언트 부스팅 결정 트리, rf-랜덤 포레스트
- bagging_fraction[default=1.0]: 데이터 샘플링하는 비율, subsample(XGBClassifier)
- feature_fraction[default=1.0]: 개별 트리를 학습할 때마다 무작위로 선택하는 피처의 비율, max_features(GBM), colsample_bytree(XBGClassifier)
- lambda_l2[default=0.0]: L2 regulation 제어, reg_lambda(XGBClassifier), 제곱을 이용
- lambda_l1[default=0.0]: L1 regulation 제어, reg_alpha(XGBClassifier), 절대값을 이용
Learning Task 파라미터
- objective: 최솟값을 가져야 할 손실함수(손실함수:실제값과 예측값의 오차), [회귀, 다중 클래스 분류, 이진분류인지에 따라 손실함수가 지정됨]
하이퍼 파라미터 튜닝 방안
num_leaves의 갯수를 중심으로 min_child_samples(min_data_in_leaf), max_depth를 함께 조장하면서 모델의 복잡도를 줄이는 것이 기본 튜닝 방안이다.
파이썬 래퍼 LightGBM과 사이킷런 래퍼 XGBoost, LightGBM 하이퍼 파라미터 비교
유형 | 파이썬 래퍼 LightGBM | 사이킷런 래퍼 LightGBM | 사이킷런 래퍼 XGBoost |
---|---|---|---|
num iterations | n_estimators | n_estimators | |
learning_rate | learning_rate | learning_rate | |
max_depth | max_depth | max_depth | |
min_data_in_leaf | min_child_samples | -N/A | |
bagging_fraction | subsample | subsample | |
파라미터명 | feature_fraction | colsample_bytree | colsample_bytree |
lambda_12 | reg_lambda | reg_lambda | |
lambda_11 | reg_alpha | reg_alpha | |
early_stopping_round | early_stopping_rounds | early_stopping_rounds | |
num_leaves | num_leaves | -N/A | |
min_sum_hessian_in_leaf | min_child_weight | min_child_weight |
LightGBM 적용 - 위스콘신 유방암 예측
import lightgbm
lightgbm.__version__
'3.2.1'
from lightgbm import LGBMClassifier
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
dataset = load_breast_cancer(as_frame=True)
X = dataset.data
y = dataset.target
# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=156)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
# 사이킷런 래퍼 객체 생성
lgbm = LGBMClassifier(n_estimators=400, learning_rate=0.05)
evals = [(X_tr, y_tr), (X_val, y_val)]
lgbm.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=True)
[1] training's binary_logloss: 0.625671 valid_1's binary_logloss: 0.628248
Training until validation scores don't improve for 50 rounds
[2] training's binary_logloss: 0.588173 valid_1's binary_logloss: 0.601106
[3] training's binary_logloss: 0.554518 valid_1's binary_logloss: 0.577587
[4] training's binary_logloss: 0.523972 valid_1's binary_logloss: 0.556324
[5] training's binary_logloss: 0.49615 valid_1's binary_logloss: 0.537407
[6] training's binary_logloss: 0.470108 valid_1's binary_logloss: 0.519401
[7] training's binary_logloss: 0.446647 valid_1's binary_logloss: 0.502637
[8] training's binary_logloss: 0.425055 valid_1's binary_logloss: 0.488311
[9] training's binary_logloss: 0.405125 valid_1's binary_logloss: 0.474664
[10] training's binary_logloss: 0.386526 valid_1's binary_logloss: 0.461267
[11] training's binary_logloss: 0.367027 valid_1's binary_logloss: 0.444274
[12] training's binary_logloss: 0.350713 valid_1's binary_logloss: 0.432755
[13] training's binary_logloss: 0.334601 valid_1's binary_logloss: 0.421371
[14] training's binary_logloss: 0.319854 valid_1's binary_logloss: 0.411418
[15] training's binary_logloss: 0.306374 valid_1's binary_logloss: 0.402989
[16] training's binary_logloss: 0.293116 valid_1's binary_logloss: 0.393973
[17] training's binary_logloss: 0.280812 valid_1's binary_logloss: 0.384801
[18] training's binary_logloss: 0.268352 valid_1's binary_logloss: 0.376191
[19] training's binary_logloss: 0.256942 valid_1's binary_logloss: 0.368378
[20] training's binary_logloss: 0.246443 valid_1's binary_logloss: 0.362062
[21] training's binary_logloss: 0.236874 valid_1's binary_logloss: 0.355162
[22] training's binary_logloss: 0.227501 valid_1's binary_logloss: 0.348933
[23] training's binary_logloss: 0.218988 valid_1's binary_logloss: 0.342819
[24] training's binary_logloss: 0.210621 valid_1's binary_logloss: 0.337386
[25] training's binary_logloss: 0.202076 valid_1's binary_logloss: 0.331523
[26] training's binary_logloss: 0.194199 valid_1's binary_logloss: 0.326349
[27] training's binary_logloss: 0.187107 valid_1's binary_logloss: 0.322785
[28] training's binary_logloss: 0.180535 valid_1's binary_logloss: 0.317877
[29] training's binary_logloss: 0.173834 valid_1's binary_logloss: 0.313928
[30] training's binary_logloss: 0.167198 valid_1's binary_logloss: 0.310105
[31] training's binary_logloss: 0.161229 valid_1's binary_logloss: 0.307107
[32] training's binary_logloss: 0.155494 valid_1's binary_logloss: 0.303837
[33] training's binary_logloss: 0.149125 valid_1's binary_logloss: 0.300315
[34] training's binary_logloss: 0.144045 valid_1's binary_logloss: 0.297816
[35] training's binary_logloss: 0.139341 valid_1's binary_logloss: 0.295387
[36] training's binary_logloss: 0.134625 valid_1's binary_logloss: 0.293063
[37] training's binary_logloss: 0.129167 valid_1's binary_logloss: 0.289127
[38] training's binary_logloss: 0.12472 valid_1's binary_logloss: 0.288697
[39] training's binary_logloss: 0.11974 valid_1's binary_logloss: 0.28576
[40] training's binary_logloss: 0.115054 valid_1's binary_logloss: 0.282853
[41] training's binary_logloss: 0.110662 valid_1's binary_logloss: 0.279441
[42] training's binary_logloss: 0.106358 valid_1's binary_logloss: 0.28113
[43] training's binary_logloss: 0.102324 valid_1's binary_logloss: 0.279139
[44] training's binary_logloss: 0.0985699 valid_1's binary_logloss: 0.276465
[45] training's binary_logloss: 0.094858 valid_1's binary_logloss: 0.275946
[46] training's binary_logloss: 0.0912486 valid_1's binary_logloss: 0.272819
[47] training's binary_logloss: 0.0883115 valid_1's binary_logloss: 0.272306
[48] training's binary_logloss: 0.0849963 valid_1's binary_logloss: 0.270452
[49] training's binary_logloss: 0.0821742 valid_1's binary_logloss: 0.268671
[50] training's binary_logloss: 0.0789991 valid_1's binary_logloss: 0.267587
[51] training's binary_logloss: 0.0761072 valid_1's binary_logloss: 0.26626
[52] training's binary_logloss: 0.0732567 valid_1's binary_logloss: 0.265542
[53] training's binary_logloss: 0.0706388 valid_1's binary_logloss: 0.264547
[54] training's binary_logloss: 0.0683911 valid_1's binary_logloss: 0.26502
[55] training's binary_logloss: 0.0659347 valid_1's binary_logloss: 0.264388
[56] training's binary_logloss: 0.0636873 valid_1's binary_logloss: 0.263128
[57] training's binary_logloss: 0.0613354 valid_1's binary_logloss: 0.26231
[58] training's binary_logloss: 0.0591944 valid_1's binary_logloss: 0.262011
[59] training's binary_logloss: 0.057033 valid_1's binary_logloss: 0.261454
[60] training's binary_logloss: 0.0550801 valid_1's binary_logloss: 0.260746
[61] training's binary_logloss: 0.0532381 valid_1's binary_logloss: 0.260236
[62] training's binary_logloss: 0.0514074 valid_1's binary_logloss: 0.261586
[63] training's binary_logloss: 0.0494837 valid_1's binary_logloss: 0.261797
[64] training's binary_logloss: 0.0477826 valid_1's binary_logloss: 0.262533
[65] training's binary_logloss: 0.0460364 valid_1's binary_logloss: 0.263305
[66] training's binary_logloss: 0.0444552 valid_1's binary_logloss: 0.264072
[67] training's binary_logloss: 0.0427638 valid_1's binary_logloss: 0.266223
[68] training's binary_logloss: 0.0412449 valid_1's binary_logloss: 0.266817
[69] training's binary_logloss: 0.0398589 valid_1's binary_logloss: 0.267819
[70] training's binary_logloss: 0.0383095 valid_1's binary_logloss: 0.267484
[71] training's binary_logloss: 0.0368803 valid_1's binary_logloss: 0.270233
[72] training's binary_logloss: 0.0355637 valid_1's binary_logloss: 0.268442
[73] training's binary_logloss: 0.0341747 valid_1's binary_logloss: 0.26895
[74] training's binary_logloss: 0.0328302 valid_1's binary_logloss: 0.266958
[75] training's binary_logloss: 0.0317853 valid_1's binary_logloss: 0.268091
[76] training's binary_logloss: 0.0305626 valid_1's binary_logloss: 0.266419
[77] training's binary_logloss: 0.0295001 valid_1's binary_logloss: 0.268588
[78] training's binary_logloss: 0.0284699 valid_1's binary_logloss: 0.270964
[79] training's binary_logloss: 0.0273953 valid_1's binary_logloss: 0.270293
[80] training's binary_logloss: 0.0264668 valid_1's binary_logloss: 0.270523
[81] training's binary_logloss: 0.0254636 valid_1's binary_logloss: 0.270683
[82] training's binary_logloss: 0.0245911 valid_1's binary_logloss: 0.273187
[83] training's binary_logloss: 0.0236486 valid_1's binary_logloss: 0.275994
[84] training's binary_logloss: 0.0228047 valid_1's binary_logloss: 0.274053
[85] training's binary_logloss: 0.0221693 valid_1's binary_logloss: 0.273211
[86] training's binary_logloss: 0.0213043 valid_1's binary_logloss: 0.272626
[87] training's binary_logloss: 0.0203934 valid_1's binary_logloss: 0.27534
[88] training's binary_logloss: 0.0195552 valid_1's binary_logloss: 0.276228
[89] training's binary_logloss: 0.0188623 valid_1's binary_logloss: 0.27525
[90] training's binary_logloss: 0.0183664 valid_1's binary_logloss: 0.276485
[91] training's binary_logloss: 0.0176788 valid_1's binary_logloss: 0.277052
[92] training's binary_logloss: 0.0170059 valid_1's binary_logloss: 0.277686
[93] training's binary_logloss: 0.0164317 valid_1's binary_logloss: 0.275332
[94] training's binary_logloss: 0.015878 valid_1's binary_logloss: 0.276236
[95] training's binary_logloss: 0.0152959 valid_1's binary_logloss: 0.274538
[96] training's binary_logloss: 0.0147216 valid_1's binary_logloss: 0.275244
[97] training's binary_logloss: 0.0141758 valid_1's binary_logloss: 0.275829
[98] training's binary_logloss: 0.0136551 valid_1's binary_logloss: 0.276654
[99] training's binary_logloss: 0.0131585 valid_1's binary_logloss: 0.277859
[100] training's binary_logloss: 0.0126961 valid_1's binary_logloss: 0.279265
[101] training's binary_logloss: 0.0122421 valid_1's binary_logloss: 0.276695
[102] training's binary_logloss: 0.0118067 valid_1's binary_logloss: 0.278488
[103] training's binary_logloss: 0.0113994 valid_1's binary_logloss: 0.278932
[104] training's binary_logloss: 0.0109799 valid_1's binary_logloss: 0.280997
[105] training's binary_logloss: 0.0105953 valid_1's binary_logloss: 0.281454
[106] training's binary_logloss: 0.0102381 valid_1's binary_logloss: 0.282058
[107] training's binary_logloss: 0.00986714 valid_1's binary_logloss: 0.279275
[108] training's binary_logloss: 0.00950998 valid_1's binary_logloss: 0.281427
[109] training's binary_logloss: 0.00915965 valid_1's binary_logloss: 0.280752
[110] training's binary_logloss: 0.00882581 valid_1's binary_logloss: 0.282152
[111] training's binary_logloss: 0.00850714 valid_1's binary_logloss: 0.280894
Early stopping, best iteration is:
[61] training's binary_logloss: 0.0532381 valid_1's binary_logloss: 0.260236
LGBMClassifier(learning_rate=0.05, n_estimators=400)
pred = lgbm.predict(X_test)
pred_proba = lgbm.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
==오차 행렬==
[[34 3]
[ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9877
# 피처 중요도 시각화
lightgbm.plot_importance(lgbm)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>
lightgbm.plot_tree(lgbm, figsize=(20, 5))
<AxesSubplot:>
베이지안 최적화 기반의 HyperOpt를 이용한 하이퍼 파라미터 튜닝
베이지안 최적화
베이지안 확률에 기반을 두고 있다.
새로운 사건의 관측이나 새로운 샘플 데이터를 기반으로 사후 확률을 개선해 나가는 것처럼 새로운 데이터를 입력받았을 때 최적 함수를 예측하는 사후 모델을 개선해 나가면서 최적 함수 모델을 만들어 낸다.
Step1
Step2
Step3
Step4
import hyperopt
hyperopt.__version__
'0.2.7'
HyperOpt를 이용한 XGBoost 하이퍼 파라미터 최적화
필요한 정의
- 값의 범위(serach space)
- 목적 함수
- fmin()
dataset = load_breast_cancer(as_frame=True)
X = dataset.data
y = dataset.target
# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=156)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
from hyperopt import hp
from sklearn.model_selection import cross_val_score # 점수를 계산해 반환
from xgboost import XGBClassifier
from hyperopt import STATUS_OK # 상태값 출력
입력값의 검색 공간을 제공하는 대표적인 함수
- hp.quniform(label, low, high, q): label로 지정된 입력값 변수 검색 공간을 최솟값 low에서 최댓값 high까지 q의 간격을 가지고 설정
- hp.uniform(label, low, high): 최솟값 low에서 최댓값 high까지 정규 분포 형태의 검색 공간 설정
- hp.randint(label, upper): 0부터 최댓값 upper까지 random한 정숫값으로 검색 공간 설정
- hp.loguniform(label, low, high): exp(uniform(low, high)값을 반환하며, 반환 값의 log 변환 된 값은 정규 분포 형태를 가지는 검색 공간 설정
## 값의 범위 지정
# max_depth는 5~20 1간격, min_child_weight는 1~2 1간격
# learning_rate는 0.01~0.2 사이, colsample_bytree는 0.5~1 사이의 정규 분포된 값
search_space = {
'max_depth':hp.quniform('max_depth', 5, 20, 1),
'min_child_weight':hp.quniform('min_child_weight', 1, 2, 1),
'learning_rate':hp.uniform('learning_rate', 0.01, 0.2),
'colsample_bytree':hp.uniform('colsample_bytree', 0.5, 1)
}
## 목적함수 제작
# uniform.. 등이 실수형태로 리턴하므로 형변환이 필요(XGBClassifier의 정수형 하이퍼 파라미터는 정수형으로)
def objective_func(search_space):
# 수행시간 절약을 위해 n_estimator 축소
xgb_clf = XGBClassifier(n_estimators=100, max_depth=int(search_space['max_depth']),
min_child_weight=int(search_space['min_child_weight']),
learning_rate=search_space['learning_rate'],
colsample_bytree=search_space['colsample_bytree'],
eval_metric='logloss')
accuracy = cross_val_score(xgb_clf, X_train, y_train, scoring='accuracy', cv=3)
# 일반적으로 dictionary형태로 반환
# logloss는 작은 것이 좋지만 accuracy는 높을 수록 좋으므로 accuracy에 -1을 곱해 logloss에 맞춤
# accuracy는 cv=3 갯수만큼 roc-auc 결과를 리스트로 가짐. 이를 평균해서 큰 정확도 값일수록 최소가 되도록 -1을 곱해서 반환
return {'loss':-1 * np.mean(accuracy), 'status':STATUS_OK}
# fmin()
from hyperopt import fmin, tpe, Trials
import numpy as np
import warnings
warnings.filterwarnings('ignore')
trial_val = Trials()
best = fmin(fn=objective_func,
space=search_space,
algo=tpe.suggest, max_evals=50, trials=trial_val, rstate=np.random.default_rng(seed=9)) # rstate: random_state
print('best:', best)
100%|███████████████████████████████████████████████| 50/50 [00:09<00:00, 5.44trial/s, best loss: -0.9670616939700244]
best: {'colsample_bytree': 0.5424149213362504, 'learning_rate': 0.12601372924444681, 'max_depth': 17.0, 'min_child_weight': 2.0}
model = XGBClassifier(n_estimators=400,
learning_rate=round(best['learning_rate'], 5),
max_depth=int(best['max_depth']),
min_child_weight=int(best['min_child_weight']),
colsample_bytree=round(best['colsample_bytree'], 5)
)
evals = [(X_tr, y_tr), (X_val, y_val)]
model.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=False)
pred = model.predict(X_test)
pred_proba = model.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
==오차 행렬==
[[35 2]
[ 2 75]]
정확도: 0.9649, 정밀도: 0.9740, 재현율: 0.9740, F1: 0.9740, AUC: 0.9944
Reference
- 이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.
댓글남기기