12 분 소요


분류

GBM(Gradient Boosting Machine)

부스팅 알고리즘은 여러개의 약한 학습기(weak learner)를 순차적으로 학습-예측하면서 잘못 예측한 데이터에 가중치를 부여해 오류를 개선해 나가면서 학습하는 방식이다.
병렬처리가 되지 않아 시간이 오래걸린다.
가중치 업데이트를 경사 하강법(Gradient Descent)을 이용한다.

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
import time
import warnings
warnings.filterwarnings('ignore')
In [2]:
# 앞서 작성한 함수 가져오기
def get_new_df(old_df):
    dup_df = pd.DataFrame(data=old_df.groupby('column_name').cumcount(), columns=['dup_cnt']) # cumcount 그룹별로 묶어 같은값에 대해 번호를 매김
    dup_df = dup_df.reset_index() # 기존의 인덱스값을 하나의 컬럼으로 추가
    new_df = pd.merge(old_df.reset_index(), dup_df, how='outer') # reset_index하여 추가된 컬럼을 기준으로 병합
    new_df['column_name'] = new_df[['column_name', 'dup_cnt']].apply(lambda x: x[0]+'_'+str(x[1]) if x[1]>0 else x[0], axis=1)
    new_df.drop(columns=['index'], inplace=True)
    return new_df

def get_human_dataset():
    import pandas as pd
    feature_name_df = pd.read_csv('human_activity/features.txt', sep='\s+',
                              header=None, names=['column_index', 'column_name'])
    name_df = get_new_df(feature_name_df)
    feature_name = name_df.iloc[:, 1].values.tolist()
    
    X_train = pd.read_csv('human_activity/train/X_train.txt', sep='\s+', names=feature_name)
    X_test = pd.read_csv('human_activity/test/X_test.txt', sep='\s+', names=feature_name)
    
    y_train = pd.read_csv('human_activity/train/y_train.txt', sep='\s+', names=['action'])
    y_test = pd.read_csv('human_activity/test/y_test.txt', sep='\s+', names=['action'])
    
    return X_train, y_train, X_test, y_test
In [3]:
X_train, y_train, X_test, y_test = get_human_dataset()
In [4]:
# GMB 수행 시간 측정을 위해 시작 시간 설정
start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=0) # learning_rate: 학습률, 얼만큼 이동할 지
# gb_clf.fit(X_train, y_train)
# pred = gb_clf.predict(X_test)

# print(f'GBM 정확도: {accuracy_score(y_test, pred):.4f}')
# print(f'GBM 수행시간: {time.time() - start_time:.1f}초')

GBM 정확도: 0.9389
GBM 수행시간: 480.1273331642151

GBM 하이퍼 파라미터

  • loss: 경사 하강법에 사용할 비용 함수이다.
  • learning_rate: 학습률, 0~1사이의 값을 지정, default는 0.1이다.
  • n_estimators: weak learner의 갯수이다. 갯수가 많을수록 일정 수준까지 성능이 좋아질 수 있지만 수행시간이 오래 걸린다. default는 100이다.
  • subsample: weak learner가 학습에 사용하는 데이터의 샘플링 비율 default는 1이다.

XGBoost(eXtra Gradient Boost)

GBM에 기반하고 있으나 GBM의 단점인 느린 수행 시간 및 과적합 규제(Regularization) 부재 등의 문제를 해결했다.
병렬 CPU환경에서 병렬 합습이 가능해 기존 GBM보다 빠르게 학습을 완료할 수 있다.

  • XGBoost의 주요 장점

뛰어난 예측 성능
GBM 대비 빠른 수행 시간
과적합 규제(Regularization)
나무 가지치기(Tree pruning)
자체 내장된 교차 검증
결손값 자체 처리

In [5]:
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
In [6]:
xgb.__version__
Out [6]:
'1.5.0'

파이썬 래퍼 XGBoost 하이퍼 파라미터

주요 일반 파라미터

  • booster: gbtree(tree based model) 또는 gblinear(linear model)선택, defautl는 gbtree
  • silent: 출력 메시지를 나타내고 싶지 않을 경우 1, default 0
  • nthread: CPU의 실행 스레드 갯수를 조정, default는 전체 스레드를 모두 사용

주요 부스터 파라미터

  • eta[default=O.3]: 학습률
  • num_boost_rounds: n_estimaotrs(GBM)
  • min_child_weight[default=1]: 트리에서 추가적으로 가지를 나눌지를 결정하기 위해 필요한 데이터들의 weight 총합
  • gamma [default=O]: 트리의 리프 노드를 추가적으로 나눌지를 결정할 최소 손실 감소값
  • max_depth[default=6]
  • sub_sample[default=1]: subsample(GBM)
  • colsample_bytree[default=1]: max_features(GBM)
  • lambda [default=1]: L2 Regularization 적용값, 제곱을 이용
  • alpha [default=O]: L1 Regularization 적용값, 절대값을 이용
  • scale_pos_weight [default=1]: 특정 값으로 치우친 비대칭한 클래스로 구성된 데이터 서 트의 균형을 유지하기 위한 파라미터

학습 태스크 파라미터

  • objective: 최솟값을 가져야 할 손실 함수를 정의
  • eval metric: 검증에 사용되는 함수를 정의

과적합 문제가 심각할 경우 고려

  • eta 값 낮추기(0.01~0.1) num_round(또는 n_estimators)는 반대로 높여줘야 함
  • max_depth 값 낮추기
  • min_child_weight 값 높이기
  • gamma 값 높이기
  • subsample과 colsample_bytree를 조정

파이썬 래퍼 XGBoost 적용 - 위스콘신 유방암 예측

In [7]:
dataset = load_breast_cancer(as_frame=True)
In [8]:
dataset.data.head(2)
Out [8]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
0 17.99 10.38 122.8 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.6 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.8 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902

2 rows × 30 columns

In [9]:
dataset.target_names # malignant:악성, benign:양성
Out [9]:
array(['malignant', 'benign'], dtype='<U9')
In [10]:
dataset.target.value_counts()
Out [10]:
1    357
0    212
Name: target, dtype: int64
In [11]:
# 학습80%, 테스트20%
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2, random_state=156)

# 위의 학습용 데이터를 학습90%, 검증10%
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
In [12]:
X_train.shape, X_test.shape
Out [12]:
((455, 30), (114, 30))
In [13]:
X_tr.shape, X_val.shape
Out [13]:
((409, 30), (46, 30))
In [14]:
y_train.value_counts()
Out [14]:
1    280
0    175
Name: target, dtype: int64

XGBoost만의 전용 데이터 객체인 DMatrix를 사용한다. 때문에 numpy나 pandas로 되어 잇는 데이터 세트를 모두 데이터 객체인 DMatrix로 생성하여 모델에 입력해야 한다.

In [15]:
# 학습, 검증, 테스트용 DMatrix 생성
dtr = xgb.DMatrix(data=X_tr, label=y_tr)
dval = xgb.DMatrix(data=X_val, label=y_val)
dtest = xgb.DMatrix(data=X_test, label=y_test)

하이퍼 파라미터를 딕셔너리 형태로 입력해야 한다.

In [16]:
params = {
    'max_depth':3,
    'eta':0.05,
    'objective':'binary:logistic', # 이진분류, 로지스틱 사용
    'eval_metric':'logloss' # 평가를 logloss(차이의 log값)
}
num_rounds = 500 # n_estimator.. 400개 사용

# 학습 데이터셋을 train, 평가 데이터셋을 eval로 이름 붙임
eval_list = [(dtr, 'train'), (dval, 'eval')]
In [17]:
# xgb모듈의 train함수에 하이퍼 파라미터를 전달
model = xgb.train(params, dtr, num_rounds, eval_list, early_stopping_rounds=50) # 파라미터 순서에 맞으면 파라미터명 생략
Out [17]:
[0]	train-logloss:0.65016	eval-logloss:0.66183
[1]	train-logloss:0.61131	eval-logloss:0.63609
[2]	train-logloss:0.57563	eval-logloss:0.61144
[3]	train-logloss:0.54310	eval-logloss:0.59204
[4]	train-logloss:0.51323	eval-logloss:0.57329
[5]	train-logloss:0.48447	eval-logloss:0.55037
[6]	train-logloss:0.45796	eval-logloss:0.52929
[7]	train-logloss:0.43436	eval-logloss:0.51534
[8]	train-logloss:0.41150	eval-logloss:0.49718
[9]	train-logloss:0.39027	eval-logloss:0.48154
[10]	train-logloss:0.37128	eval-logloss:0.46990
[11]	train-logloss:0.35254	eval-logloss:0.45474
[12]	train-logloss:0.33528	eval-logloss:0.44229
[13]	train-logloss:0.31893	eval-logloss:0.42961
[14]	train-logloss:0.30439	eval-logloss:0.42065
[15]	train-logloss:0.29000	eval-logloss:0.40958
[16]	train-logloss:0.27651	eval-logloss:0.39887
[17]	train-logloss:0.26389	eval-logloss:0.39050
[18]	train-logloss:0.25210	eval-logloss:0.38254
[19]	train-logloss:0.24123	eval-logloss:0.37393
[20]	train-logloss:0.23076	eval-logloss:0.36789
[21]	train-logloss:0.22091	eval-logloss:0.36017
[22]	train-logloss:0.21155	eval-logloss:0.35421
[23]	train-logloss:0.20263	eval-logloss:0.34683
[24]	train-logloss:0.19434	eval-logloss:0.34111
[25]	train-logloss:0.18637	eval-logloss:0.33634
[26]	train-logloss:0.17875	eval-logloss:0.33082
[27]	train-logloss:0.17167	eval-logloss:0.32675
[28]	train-logloss:0.16481	eval-logloss:0.32099
[29]	train-logloss:0.15835	eval-logloss:0.31671
[30]	train-logloss:0.15225	eval-logloss:0.31277
[31]	train-logloss:0.14650	eval-logloss:0.30882
[32]	train-logloss:0.14102	eval-logloss:0.30437
[33]	train-logloss:0.13590	eval-logloss:0.30103
[34]	train-logloss:0.13109	eval-logloss:0.29794
[35]	train-logloss:0.12647	eval-logloss:0.29499
[36]	train-logloss:0.12197	eval-logloss:0.29295
[37]	train-logloss:0.11784	eval-logloss:0.29043
[38]	train-logloss:0.11379	eval-logloss:0.28927
[39]	train-logloss:0.10994	eval-logloss:0.28578
[40]	train-logloss:0.10638	eval-logloss:0.28364
[41]	train-logloss:0.10302	eval-logloss:0.28183
[42]	train-logloss:0.09963	eval-logloss:0.28005
[43]	train-logloss:0.09649	eval-logloss:0.27972
[44]	train-logloss:0.09359	eval-logloss:0.27744
[45]	train-logloss:0.09080	eval-logloss:0.27542
[46]	train-logloss:0.08807	eval-logloss:0.27504
[47]	train-logloss:0.08541	eval-logloss:0.27458
[48]	train-logloss:0.08299	eval-logloss:0.27348
[49]	train-logloss:0.08035	eval-logloss:0.27247
[50]	train-logloss:0.07786	eval-logloss:0.27163
[51]	train-logloss:0.07550	eval-logloss:0.27094
[52]	train-logloss:0.07344	eval-logloss:0.26967
[53]	train-logloss:0.07147	eval-logloss:0.27008
[54]	train-logloss:0.06964	eval-logloss:0.26890
[55]	train-logloss:0.06766	eval-logloss:0.26854
[56]	train-logloss:0.06592	eval-logloss:0.26900
[57]	train-logloss:0.06433	eval-logloss:0.26790
[58]	train-logloss:0.06259	eval-logloss:0.26663
[59]	train-logloss:0.06107	eval-logloss:0.26743
[60]	train-logloss:0.05957	eval-logloss:0.26610
[61]	train-logloss:0.05817	eval-logloss:0.26644
[62]	train-logloss:0.05691	eval-logloss:0.26673
[63]	train-logloss:0.05550	eval-logloss:0.26550
[64]	train-logloss:0.05422	eval-logloss:0.26443
[65]	train-logloss:0.05311	eval-logloss:0.26500
[66]	train-logloss:0.05207	eval-logloss:0.26591
[67]	train-logloss:0.05093	eval-logloss:0.26501
[68]	train-logloss:0.04976	eval-logloss:0.26435
[69]	train-logloss:0.04872	eval-logloss:0.26360
[70]	train-logloss:0.04776	eval-logloss:0.26319
[71]	train-logloss:0.04680	eval-logloss:0.26255
[72]	train-logloss:0.04580	eval-logloss:0.26204
[73]	train-logloss:0.04484	eval-logloss:0.26254
[74]	train-logloss:0.04388	eval-logloss:0.26289
[75]	train-logloss:0.04309	eval-logloss:0.26249
[76]	train-logloss:0.04224	eval-logloss:0.26217
[77]	train-logloss:0.04133	eval-logloss:0.26166
[78]	train-logloss:0.04050	eval-logloss:0.26179
[79]	train-logloss:0.03967	eval-logloss:0.26103
[80]	train-logloss:0.03877	eval-logloss:0.26094
[81]	train-logloss:0.03806	eval-logloss:0.26148
[82]	train-logloss:0.03740	eval-logloss:0.26054
[83]	train-logloss:0.03676	eval-logloss:0.25967
[84]	train-logloss:0.03605	eval-logloss:0.25905
[85]	train-logloss:0.03545	eval-logloss:0.26007
[86]	train-logloss:0.03488	eval-logloss:0.25984
[87]	train-logloss:0.03425	eval-logloss:0.25933
[88]	train-logloss:0.03361	eval-logloss:0.25932
[89]	train-logloss:0.03311	eval-logloss:0.26002
[90]	train-logloss:0.03260	eval-logloss:0.25936
[91]	train-logloss:0.03202	eval-logloss:0.25886
[92]	train-logloss:0.03152	eval-logloss:0.25918
[93]	train-logloss:0.03107	eval-logloss:0.25865
[94]	train-logloss:0.03049	eval-logloss:0.25951
[95]	train-logloss:0.03007	eval-logloss:0.26091
[96]	train-logloss:0.02963	eval-logloss:0.26014
[97]	train-logloss:0.02913	eval-logloss:0.25974
[98]	train-logloss:0.02866	eval-logloss:0.25937
[99]	train-logloss:0.02829	eval-logloss:0.25893
[100]	train-logloss:0.02789	eval-logloss:0.25928
[101]	train-logloss:0.02751	eval-logloss:0.25955
[102]	train-logloss:0.02714	eval-logloss:0.25901
[103]	train-logloss:0.02668	eval-logloss:0.25991
[104]	train-logloss:0.02634	eval-logloss:0.25950
[105]	train-logloss:0.02594	eval-logloss:0.25924
[106]	train-logloss:0.02556	eval-logloss:0.25901
[107]	train-logloss:0.02522	eval-logloss:0.25738
[108]	train-logloss:0.02492	eval-logloss:0.25702
[109]	train-logloss:0.02453	eval-logloss:0.25789
[110]	train-logloss:0.02418	eval-logloss:0.25770
[111]	train-logloss:0.02384	eval-logloss:0.25842
[112]	train-logloss:0.02356	eval-logloss:0.25810
[113]	train-logloss:0.02322	eval-logloss:0.25848
[114]	train-logloss:0.02290	eval-logloss:0.25833
[115]	train-logloss:0.02260	eval-logloss:0.25820
[116]	train-logloss:0.02229	eval-logloss:0.25905
[117]	train-logloss:0.02204	eval-logloss:0.25878
[118]	train-logloss:0.02176	eval-logloss:0.25728
[119]	train-logloss:0.02149	eval-logloss:0.25722
[120]	train-logloss:0.02119	eval-logloss:0.25764
[121]	train-logloss:0.02095	eval-logloss:0.25761
[122]	train-logloss:0.02067	eval-logloss:0.25832
[123]	train-logloss:0.02045	eval-logloss:0.25808
[124]	train-logloss:0.02023	eval-logloss:0.25855
[125]	train-logloss:0.01998	eval-logloss:0.25714
[126]	train-logloss:0.01973	eval-logloss:0.25587
[127]	train-logloss:0.01946	eval-logloss:0.25640
[128]	train-logloss:0.01927	eval-logloss:0.25685
[129]	train-logloss:0.01908	eval-logloss:0.25665
[130]	train-logloss:0.01886	eval-logloss:0.25712
[131]	train-logloss:0.01863	eval-logloss:0.25609
[132]	train-logloss:0.01839	eval-logloss:0.25649
[133]	train-logloss:0.01816	eval-logloss:0.25789
[134]	train-logloss:0.01802	eval-logloss:0.25811
[135]	train-logloss:0.01785	eval-logloss:0.25794
[136]	train-logloss:0.01763	eval-logloss:0.25876
[137]	train-logloss:0.01748	eval-logloss:0.25884
[138]	train-logloss:0.01732	eval-logloss:0.25867
[139]	train-logloss:0.01719	eval-logloss:0.25876
[140]	train-logloss:0.01696	eval-logloss:0.25987
[141]	train-logloss:0.01681	eval-logloss:0.25960
[142]	train-logloss:0.01669	eval-logloss:0.25982
[143]	train-logloss:0.01656	eval-logloss:0.25992
[144]	train-logloss:0.01638	eval-logloss:0.26035
[145]	train-logloss:0.01623	eval-logloss:0.26055
[146]	train-logloss:0.01606	eval-logloss:0.26092
[147]	train-logloss:0.01589	eval-logloss:0.26137
[148]	train-logloss:0.01572	eval-logloss:0.25999
[149]	train-logloss:0.01557	eval-logloss:0.26028
[150]	train-logloss:0.01546	eval-logloss:0.26048
[151]	train-logloss:0.01531	eval-logloss:0.26142
[152]	train-logloss:0.01515	eval-logloss:0.26188
[153]	train-logloss:0.01501	eval-logloss:0.26227
[154]	train-logloss:0.01486	eval-logloss:0.26287
[155]	train-logloss:0.01476	eval-logloss:0.26299
[156]	train-logloss:0.01461	eval-logloss:0.26346
[157]	train-logloss:0.01448	eval-logloss:0.26379
[158]	train-logloss:0.01434	eval-logloss:0.26306
[159]	train-logloss:0.01424	eval-logloss:0.26237
[160]	train-logloss:0.01410	eval-logloss:0.26251
[161]	train-logloss:0.01401	eval-logloss:0.26265
[162]	train-logloss:0.01392	eval-logloss:0.26264
[163]	train-logloss:0.01380	eval-logloss:0.26250
[164]	train-logloss:0.01372	eval-logloss:0.26264
[165]	train-logloss:0.01359	eval-logloss:0.26255
[166]	train-logloss:0.01350	eval-logloss:0.26188
[167]	train-logloss:0.01342	eval-logloss:0.26203
[168]	train-logloss:0.01331	eval-logloss:0.26190
[169]	train-logloss:0.01319	eval-logloss:0.26184
[170]	train-logloss:0.01312	eval-logloss:0.26133
[171]	train-logloss:0.01304	eval-logloss:0.26148
[172]	train-logloss:0.01297	eval-logloss:0.26157
[173]	train-logloss:0.01285	eval-logloss:0.26253
[174]	train-logloss:0.01278	eval-logloss:0.26229
[175]	train-logloss:0.01267	eval-logloss:0.26086

지정한 500회를 완료하지않고 early_stopping_rounds로 지정한 50회 동안 logloss 값이 향상되지 않아 조기종료했다.

In [18]:
pred_probs = model.predict(dtest)
In [19]:
# XGBoost의 predict()는 예측 결괏값이 아닌 확률값을 반환한다.
np.round(pred_probs[:10], 3)
Out [19]:
array([0.845, 0.008, 0.68 , 0.081, 0.975, 0.999, 0.998, 0.998, 0.996,
       0.001], dtype=float32)
In [20]:
# 예측 결괏값으로 변환
pred = [1 if x > 0.5 else 0 for x in pred_probs]
In [21]:
def get_clf_eval(y_test, pred, pred_proba_1):
    from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    auc = roc_auc_score(y_test, pred_proba_1)
    print('==오차 행렬==')
    print(confusion)
    print(f"정확도: {accuracy:.4f}, 정밀도: {precision:.4f}, 재현율: {recall:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}")
In [22]:
# 예측 성능 평가
get_clf_eval(y_test, pred, pred_probs)
Out [22]:
==오차 행렬==
[[34  3]
 [ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9937

  • 피처 중요도 시각화
In [23]:
plot_importance(model)
Out [23]:
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

img

In [24]:
xgb.to_graphviz(model)
Out [24]:

svg

사이킷런 래퍼 XGBoost

In [25]:
from xgboost import XGBClassifier
In [26]:
model = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=3, eval_metric='logloss')
In [27]:
model.fit(X_train, y_train, verbose=True)
Out [27]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.05, max_delta_step=0,
              max_depth=3, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=500, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)
In [28]:
pred = model.predict(X_test) # 파이썬 래퍼와 다르게 결정값으로 나옴
In [29]:
pred_proba = model.predict_proba(X_test)[:, 1]
In [30]:
# 예측 성능 평가
get_clf_eval(y_test, pred, pred_proba)
Out [30]:
==오차 행렬==
[[34  3]
 [ 1 76]]
정확도: 0.9649, 정밀도: 0.9620, 재현율: 0.9870, F1: 0.9744, AUC: 0.9951

이전에 사용하던 사이킷런의 다른 모델과 사용법이 같다!

In [31]:
# 파이썬 래퍼와 같은 조건으로 학습
model = XGBClassifier(n_estimators=500, learning_rate=0.05, max_depth=3)

evals = [(X_tr, y_tr), (X_val, y_val)]
model.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=False)

pred = model.predict(X_test)
pred_proba = model.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
Out [31]:
==오차 행렬==
[[34  3]
 [ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9933

In [32]:
xgb.to_graphviz(model)
Out [32]:

svg

조기종료 값을 너무 작게 주면 학습을 충분히 하지 못하고 종료될 수 있다.

LightGBM

XGBoost보다 학습에 걸리는 시간이 훨씬 적다.
적은 데이터 세트에 적용할 경우 과적합이 발생하기 쉽다.(과적합에 취약)
일반 GBM 계열의 트리 분할 방법과 다르게 리프 중심 트리 분할(Leaf Wise) 방식을 사용한다.

LightGBM의 XGBoost 대비 장점

  • 더 빠른 학습과 예측 수행 시간
  • 더 작은 메모리 사용량
  • 카테고리형 피처의 자동 변환과 최적 분할

LightGBM 하이퍼 파라미터

max_depth를 다른 GBM보다 더 깊게 가져야 한다.(리프 중심 트리 분할이기 때문)

주요 파라미터

  • num_iterations[dafault=100]: 반복 수행하려는 트리의 갯수
  • learning_rate[default=0.1]: 학습률
  • max_depth[default=-1]: 0보다 작으면 깊이 제한 없음
  • min_dat_in_leaf[default=20]: min_samples_leaf(DecisionTree)
  • num_leaves[default=31]: 하나의 트리가 가질 수 있는 최대 리프 갯수
  • boosting[default=gbdt]: 부스팅 트리를 생성하는 알고리즘, gbdt-일반적인 그래디언트 부스팅 결정 트리, rf-랜덤 포레스트
  • bagging_fraction[default=1.0]: 데이터 샘플링하는 비율, subsample(XGBClassifier)
  • feature_fraction[default=1.0]: 개별 트리를 학습할 때마다 무작위로 선택하는 피처의 비율, max_features(GBM), colsample_bytree(XBGClassifier)
  • lambda_l2[default=0.0]: L2 regulation 제어, reg_lambda(XGBClassifier), 제곱을 이용
  • lambda_l1[default=0.0]: L1 regulation 제어, reg_alpha(XGBClassifier), 절대값을 이용

Learning Task 파라미터

  • objective: 최솟값을 가져야 할 손실함수(손실함수:실제값과 예측값의 오차), [회귀, 다중 클래스 분류, 이진분류인지에 따라 손실함수가 지정됨]

하이퍼 파라미터 튜닝 방안

num_leaves의 갯수를 중심으로 min_child_samples(min_data_in_leaf), max_depth를 함께 조장하면서 모델의 복잡도를 줄이는 것이 기본 튜닝 방안이다.

파이썬 래퍼 LightGBM과 사이킷런 래퍼 XGBoost, LightGBM 하이퍼 파라미터 비교

유형 파이썬 래퍼 LightGBM 사이킷런 래퍼 LightGBM 사이킷런 래퍼 XGBoost
  num iterations n_estimators n_estimators
  learning_rate learning_rate learning_rate
  max_depth max_depth max_depth
  min_data_in_leaf min_child_samples -N/A
  bagging_fraction subsample subsample
파라미터명 feature_fraction colsample_bytree colsample_bytree
  lambda_12 reg_lambda reg_lambda
  lambda_11 reg_alpha reg_alpha
  early_stopping_round early_stopping_rounds early_stopping_rounds
  num_leaves num_leaves -N/A
  min_sum_hessian_in_leaf min_child_weight min_child_weight

LightGBM 적용 - 위스콘신 유방암 예측

In [33]:
import lightgbm
In [34]:
lightgbm.__version__
Out [34]:
'3.2.1'
In [35]:
from lightgbm import LGBMClassifier
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
In [36]:
dataset = load_breast_cancer(as_frame=True)
In [37]:
X = dataset.data
y = dataset.target

# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=156)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
In [38]:
# 사이킷런 래퍼 객체 생성
lgbm = LGBMClassifier(n_estimators=400, learning_rate=0.05)

evals = [(X_tr, y_tr), (X_val, y_val)]
lgbm.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=True)
Out [38]:
[1]	training's binary_logloss: 0.625671	valid_1's binary_logloss: 0.628248
Training until validation scores don't improve for 50 rounds
[2]	training's binary_logloss: 0.588173	valid_1's binary_logloss: 0.601106
[3]	training's binary_logloss: 0.554518	valid_1's binary_logloss: 0.577587
[4]	training's binary_logloss: 0.523972	valid_1's binary_logloss: 0.556324
[5]	training's binary_logloss: 0.49615	valid_1's binary_logloss: 0.537407
[6]	training's binary_logloss: 0.470108	valid_1's binary_logloss: 0.519401
[7]	training's binary_logloss: 0.446647	valid_1's binary_logloss: 0.502637
[8]	training's binary_logloss: 0.425055	valid_1's binary_logloss: 0.488311
[9]	training's binary_logloss: 0.405125	valid_1's binary_logloss: 0.474664
[10]	training's binary_logloss: 0.386526	valid_1's binary_logloss: 0.461267
[11]	training's binary_logloss: 0.367027	valid_1's binary_logloss: 0.444274
[12]	training's binary_logloss: 0.350713	valid_1's binary_logloss: 0.432755
[13]	training's binary_logloss: 0.334601	valid_1's binary_logloss: 0.421371
[14]	training's binary_logloss: 0.319854	valid_1's binary_logloss: 0.411418
[15]	training's binary_logloss: 0.306374	valid_1's binary_logloss: 0.402989
[16]	training's binary_logloss: 0.293116	valid_1's binary_logloss: 0.393973
[17]	training's binary_logloss: 0.280812	valid_1's binary_logloss: 0.384801
[18]	training's binary_logloss: 0.268352	valid_1's binary_logloss: 0.376191
[19]	training's binary_logloss: 0.256942	valid_1's binary_logloss: 0.368378
[20]	training's binary_logloss: 0.246443	valid_1's binary_logloss: 0.362062
[21]	training's binary_logloss: 0.236874	valid_1's binary_logloss: 0.355162
[22]	training's binary_logloss: 0.227501	valid_1's binary_logloss: 0.348933
[23]	training's binary_logloss: 0.218988	valid_1's binary_logloss: 0.342819
[24]	training's binary_logloss: 0.210621	valid_1's binary_logloss: 0.337386
[25]	training's binary_logloss: 0.202076	valid_1's binary_logloss: 0.331523
[26]	training's binary_logloss: 0.194199	valid_1's binary_logloss: 0.326349
[27]	training's binary_logloss: 0.187107	valid_1's binary_logloss: 0.322785
[28]	training's binary_logloss: 0.180535	valid_1's binary_logloss: 0.317877
[29]	training's binary_logloss: 0.173834	valid_1's binary_logloss: 0.313928
[30]	training's binary_logloss: 0.167198	valid_1's binary_logloss: 0.310105
[31]	training's binary_logloss: 0.161229	valid_1's binary_logloss: 0.307107
[32]	training's binary_logloss: 0.155494	valid_1's binary_logloss: 0.303837
[33]	training's binary_logloss: 0.149125	valid_1's binary_logloss: 0.300315
[34]	training's binary_logloss: 0.144045	valid_1's binary_logloss: 0.297816
[35]	training's binary_logloss: 0.139341	valid_1's binary_logloss: 0.295387
[36]	training's binary_logloss: 0.134625	valid_1's binary_logloss: 0.293063
[37]	training's binary_logloss: 0.129167	valid_1's binary_logloss: 0.289127
[38]	training's binary_logloss: 0.12472	valid_1's binary_logloss: 0.288697
[39]	training's binary_logloss: 0.11974	valid_1's binary_logloss: 0.28576
[40]	training's binary_logloss: 0.115054	valid_1's binary_logloss: 0.282853
[41]	training's binary_logloss: 0.110662	valid_1's binary_logloss: 0.279441
[42]	training's binary_logloss: 0.106358	valid_1's binary_logloss: 0.28113
[43]	training's binary_logloss: 0.102324	valid_1's binary_logloss: 0.279139
[44]	training's binary_logloss: 0.0985699	valid_1's binary_logloss: 0.276465
[45]	training's binary_logloss: 0.094858	valid_1's binary_logloss: 0.275946
[46]	training's binary_logloss: 0.0912486	valid_1's binary_logloss: 0.272819
[47]	training's binary_logloss: 0.0883115	valid_1's binary_logloss: 0.272306
[48]	training's binary_logloss: 0.0849963	valid_1's binary_logloss: 0.270452
[49]	training's binary_logloss: 0.0821742	valid_1's binary_logloss: 0.268671
[50]	training's binary_logloss: 0.0789991	valid_1's binary_logloss: 0.267587
[51]	training's binary_logloss: 0.0761072	valid_1's binary_logloss: 0.26626
[52]	training's binary_logloss: 0.0732567	valid_1's binary_logloss: 0.265542
[53]	training's binary_logloss: 0.0706388	valid_1's binary_logloss: 0.264547
[54]	training's binary_logloss: 0.0683911	valid_1's binary_logloss: 0.26502
[55]	training's binary_logloss: 0.0659347	valid_1's binary_logloss: 0.264388
[56]	training's binary_logloss: 0.0636873	valid_1's binary_logloss: 0.263128
[57]	training's binary_logloss: 0.0613354	valid_1's binary_logloss: 0.26231
[58]	training's binary_logloss: 0.0591944	valid_1's binary_logloss: 0.262011
[59]	training's binary_logloss: 0.057033	valid_1's binary_logloss: 0.261454
[60]	training's binary_logloss: 0.0550801	valid_1's binary_logloss: 0.260746
[61]	training's binary_logloss: 0.0532381	valid_1's binary_logloss: 0.260236
[62]	training's binary_logloss: 0.0514074	valid_1's binary_logloss: 0.261586
[63]	training's binary_logloss: 0.0494837	valid_1's binary_logloss: 0.261797
[64]	training's binary_logloss: 0.0477826	valid_1's binary_logloss: 0.262533
[65]	training's binary_logloss: 0.0460364	valid_1's binary_logloss: 0.263305
[66]	training's binary_logloss: 0.0444552	valid_1's binary_logloss: 0.264072
[67]	training's binary_logloss: 0.0427638	valid_1's binary_logloss: 0.266223
[68]	training's binary_logloss: 0.0412449	valid_1's binary_logloss: 0.266817
[69]	training's binary_logloss: 0.0398589	valid_1's binary_logloss: 0.267819
[70]	training's binary_logloss: 0.0383095	valid_1's binary_logloss: 0.267484
[71]	training's binary_logloss: 0.0368803	valid_1's binary_logloss: 0.270233
[72]	training's binary_logloss: 0.0355637	valid_1's binary_logloss: 0.268442
[73]	training's binary_logloss: 0.0341747	valid_1's binary_logloss: 0.26895
[74]	training's binary_logloss: 0.0328302	valid_1's binary_logloss: 0.266958
[75]	training's binary_logloss: 0.0317853	valid_1's binary_logloss: 0.268091
[76]	training's binary_logloss: 0.0305626	valid_1's binary_logloss: 0.266419
[77]	training's binary_logloss: 0.0295001	valid_1's binary_logloss: 0.268588
[78]	training's binary_logloss: 0.0284699	valid_1's binary_logloss: 0.270964
[79]	training's binary_logloss: 0.0273953	valid_1's binary_logloss: 0.270293
[80]	training's binary_logloss: 0.0264668	valid_1's binary_logloss: 0.270523
[81]	training's binary_logloss: 0.0254636	valid_1's binary_logloss: 0.270683
[82]	training's binary_logloss: 0.0245911	valid_1's binary_logloss: 0.273187
[83]	training's binary_logloss: 0.0236486	valid_1's binary_logloss: 0.275994
[84]	training's binary_logloss: 0.0228047	valid_1's binary_logloss: 0.274053
[85]	training's binary_logloss: 0.0221693	valid_1's binary_logloss: 0.273211
[86]	training's binary_logloss: 0.0213043	valid_1's binary_logloss: 0.272626
[87]	training's binary_logloss: 0.0203934	valid_1's binary_logloss: 0.27534
[88]	training's binary_logloss: 0.0195552	valid_1's binary_logloss: 0.276228
[89]	training's binary_logloss: 0.0188623	valid_1's binary_logloss: 0.27525
[90]	training's binary_logloss: 0.0183664	valid_1's binary_logloss: 0.276485
[91]	training's binary_logloss: 0.0176788	valid_1's binary_logloss: 0.277052
[92]	training's binary_logloss: 0.0170059	valid_1's binary_logloss: 0.277686
[93]	training's binary_logloss: 0.0164317	valid_1's binary_logloss: 0.275332
[94]	training's binary_logloss: 0.015878	valid_1's binary_logloss: 0.276236
[95]	training's binary_logloss: 0.0152959	valid_1's binary_logloss: 0.274538
[96]	training's binary_logloss: 0.0147216	valid_1's binary_logloss: 0.275244
[97]	training's binary_logloss: 0.0141758	valid_1's binary_logloss: 0.275829
[98]	training's binary_logloss: 0.0136551	valid_1's binary_logloss: 0.276654
[99]	training's binary_logloss: 0.0131585	valid_1's binary_logloss: 0.277859
[100]	training's binary_logloss: 0.0126961	valid_1's binary_logloss: 0.279265
[101]	training's binary_logloss: 0.0122421	valid_1's binary_logloss: 0.276695
[102]	training's binary_logloss: 0.0118067	valid_1's binary_logloss: 0.278488
[103]	training's binary_logloss: 0.0113994	valid_1's binary_logloss: 0.278932
[104]	training's binary_logloss: 0.0109799	valid_1's binary_logloss: 0.280997
[105]	training's binary_logloss: 0.0105953	valid_1's binary_logloss: 0.281454
[106]	training's binary_logloss: 0.0102381	valid_1's binary_logloss: 0.282058
[107]	training's binary_logloss: 0.00986714	valid_1's binary_logloss: 0.279275
[108]	training's binary_logloss: 0.00950998	valid_1's binary_logloss: 0.281427
[109]	training's binary_logloss: 0.00915965	valid_1's binary_logloss: 0.280752
[110]	training's binary_logloss: 0.00882581	valid_1's binary_logloss: 0.282152
[111]	training's binary_logloss: 0.00850714	valid_1's binary_logloss: 0.280894
Early stopping, best iteration is:
[61]	training's binary_logloss: 0.0532381	valid_1's binary_logloss: 0.260236

LGBMClassifier(learning_rate=0.05, n_estimators=400)
In [39]:
pred = lgbm.predict(X_test)
pred_proba = lgbm.predict_proba(X_test)[:, 1]
In [40]:
get_clf_eval(y_test, pred, pred_proba)
Out [40]:
==오차 행렬==
[[34  3]
 [ 2 75]]
정확도: 0.9561, 정밀도: 0.9615, 재현율: 0.9740, F1: 0.9677, AUC: 0.9877

In [41]:
# 피처 중요도 시각화
lightgbm.plot_importance(lgbm)
Out [41]:
<AxesSubplot:title={'center':'Feature importance'}, xlabel='Feature importance', ylabel='Features'>

img

In [42]:
lightgbm.plot_tree(lgbm, figsize=(20, 5))
Out [42]:
<AxesSubplot:>

img

베이지안 최적화 기반의 HyperOpt를 이용한 하이퍼 파라미터 튜닝

베이지안 최적화
베이지안 확률에 기반을 두고 있다.
새로운 사건의 관측이나 새로운 샘플 데이터를 기반으로 사후 확률을 개선해 나가는 것처럼 새로운 데이터를 입력받았을 때 최적 함수를 예측하는 사후 모델을 개선해 나가면서 최적 함수 모델을 만들어 낸다.

Step1
img

Step2
img

Step3
img

Step4
img

In [43]:
import hyperopt
In [44]:
hyperopt.__version__
Out [44]:
'0.2.7'

HyperOpt를 이용한 XGBoost 하이퍼 파라미터 최적화

필요한 정의

  • 값의 범위(serach space)
  • 목적 함수
  • fmin()
In [45]:
dataset = load_breast_cancer(as_frame=True)

X = dataset.data
y = dataset.target

# 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=156)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=156)
In [46]:
from hyperopt import hp
from sklearn.model_selection import cross_val_score # 점수를 계산해 반환
from xgboost import XGBClassifier
from hyperopt import STATUS_OK # 상태값 출력

입력값의 검색 공간을 제공하는 대표적인 함수

  • hp.quniform(label, low, high, q): label로 지정된 입력값 변수 검색 공간을 최솟값 low에서 최댓값 high까지 q의 간격을 가지고 설정
  • hp.uniform(label, low, high): 최솟값 low에서 최댓값 high까지 정규 분포 형태의 검색 공간 설정
  • hp.randint(label, upper): 0부터 최댓값 upper까지 random한 정숫값으로 검색 공간 설정
  • hp.loguniform(label, low, high): exp(uniform(low, high)값을 반환하며, 반환 값의 log 변환 된 값은 정규 분포 형태를 가지는 검색 공간 설정
In [47]:
## 값의 범위 지정

# max_depth는 5~20 1간격, min_child_weight는 1~2 1간격
# learning_rate는 0.01~0.2 사이, colsample_bytree는 0.5~1 사이의 정규 분포된 값
search_space = {
    'max_depth':hp.quniform('max_depth', 5, 20, 1),
    'min_child_weight':hp.quniform('min_child_weight', 1, 2, 1),
    'learning_rate':hp.uniform('learning_rate', 0.01, 0.2),
    'colsample_bytree':hp.uniform('colsample_bytree', 0.5, 1)
}
In [48]:
## 목적함수 제작

# uniform.. 등이 실수형태로 리턴하므로 형변환이 필요(XGBClassifier의 정수형 하이퍼 파라미터는 정수형으로)
def objective_func(search_space):
    # 수행시간 절약을 위해 n_estimator 축소
    xgb_clf = XGBClassifier(n_estimators=100, max_depth=int(search_space['max_depth']),
                            min_child_weight=int(search_space['min_child_weight']),
                            learning_rate=search_space['learning_rate'],
                            colsample_bytree=search_space['colsample_bytree'],
                            eval_metric='logloss')
    accuracy = cross_val_score(xgb_clf, X_train, y_train, scoring='accuracy', cv=3)
    
    # 일반적으로 dictionary형태로 반환
    # logloss는 작은 것이 좋지만 accuracy는 높을 수록 좋으므로 accuracy에 -1을 곱해 logloss에 맞춤
    # accuracy는 cv=3 갯수만큼 roc-auc 결과를 리스트로 가짐. 이를 평균해서 큰 정확도 값일수록 최소가 되도록 -1을 곱해서 반환
    return {'loss':-1 * np.mean(accuracy), 'status':STATUS_OK}
In [49]:
# fmin()
from hyperopt import fmin, tpe, Trials
import numpy as np
import warnings
warnings.filterwarnings('ignore')
In [50]:
trial_val = Trials()
best = fmin(fn=objective_func,
            space=search_space,
            algo=tpe.suggest, max_evals=50, trials=trial_val, rstate=np.random.default_rng(seed=9)) # rstate: random_state
print('best:', best)
Out [50]:
100%|███████████████████████████████████████████████| 50/50 [00:09<00:00,  5.44trial/s, best loss: -0.9670616939700244]
best: {'colsample_bytree': 0.5424149213362504, 'learning_rate': 0.12601372924444681, 'max_depth': 17.0, 'min_child_weight': 2.0}

In [51]:
model = XGBClassifier(n_estimators=400,
                      learning_rate=round(best['learning_rate'], 5),
                      max_depth=int(best['max_depth']),
                      min_child_weight=int(best['min_child_weight']),
                      colsample_bytree=round(best['colsample_bytree'], 5)
                     )

evals = [(X_tr, y_tr), (X_val, y_val)]
model.fit(X_tr, y_tr, early_stopping_rounds=50, eval_metric='logloss', eval_set=evals, verbose=False)

pred = model.predict(X_test)
pred_proba = model.predict_proba(X_test)[:, 1]
get_clf_eval(y_test, pred, pred_proba)
Out [51]:
==오차 행렬==
[[35  2]
 [ 2 75]]
정확도: 0.9649, 정밀도: 0.9740, 재현율: 0.9740, F1: 0.9740, AUC: 0.9944

Reference

  • 이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.

댓글남기기