5 분 소요


사이킷런으로 시작하는 머신러닝

사이킷런으로 수행하는 타이타닉 생존자 예측

데이터 처리

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('titanic.csv')
In [3]:
df.head(2)
Out [3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [4]:
df.info()
Out [4]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In [5]:
df.isnull().sum()
Out [5]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Null값 수정

  • Age: nan값 평균값으로
  • Cabin: nan값 ‘N’으로 변경
  • Embarked: nan값 ‘N’으로 변경
In [6]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Cabin'].fillna('N', inplace=True)
df['Embarked'].fillna('N', inplace=True)
In [7]:
print("null값 :", df.isnull().sum().sum())
Out [7]:
null값 : 0

In [8]:
df.info()
Out [8]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

문자열 컬럼 수정

  • Name, Ticket: 안씀
  • Sex, Cabin, Embarked: 수정
In [9]:
df['Sex'].value_counts() # Nan은 체크하지 않으므로 미리 처리해야함
# 바로 숫자로 변경가능
Out [9]:
male      577
female    314
Name: Sex, dtype: int64
In [10]:
df['Cabin'].value_counts()
# 첫문자만 남기고 제거해야함
Out [10]:
N              687
C23 C25 C27      4
G6               4
B96 B98          4
C22 C26          3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: Cabin, Length: 148, dtype: int64
In [11]:
df['Embarked'].value_counts()
# 바로 숫자로 변경가능
Out [11]:
S    644
C    168
Q     77
N      2
Name: Embarked, dtype: int64
In [12]:
# cabin의 첫글자만 남기기
df['Cabin'] = df['Cabin'].str[:1]

데이터 탐색

In [13]:
df.groupby(['Sex', 'Survived'])['Survived'].count()
Out [13]:
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
In [14]:
sns.barplot(data=df, x='Sex', y='Survived', palette='pastel')
Out [14]:
<AxesSubplot:xlabel='Sex', ylabel='Survived'>

img

In [15]:
sns.barplot(data=df, x='Pclass', y='Survived', palette='hls')
Out [15]:
<AxesSubplot:xlabel='Pclass', ylabel='Survived'>

img

In [16]:
ax = sns.barplot(data=df, x='Pclass', y='Survived', hue='Sex', palette='husl', errwidth=0)
plt.grid()
for p in ax.containers:
    ax.bar_label(p,)
Out [16]:

img

In [17]:
df['Age'].value_counts()
# 나잇대 그룹으로 묶기
Out [17]:
29.699118    177
24.000000     30
22.000000     27
18.000000     26
28.000000     25
            ... 
36.500000      1
55.500000      1
0.920000       1
23.500000      1
74.000000      1
Name: Age, Length: 89, dtype: int64
In [18]:
def get_category(age):
    cat = ''
    if age <= -1: cat = 'Unknown'
    elif age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else: cat = 'Elderly'
    
    return cat
In [19]:
df['Age_cat'] = df['Age'].apply(lambda x : get_category(x))
In [20]:
plt.figure(figsize=(10, 5))
group_name = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Elderly']
sns.barplot(data=df, x='Age_cat', y='Survived', hue='Sex', order=group_name, palette='hls')
plt.legend(loc='upper left')
Out [20]:
<matplotlib.legend.Legend at 0x1a6bf7b7910>

img

In [21]:
df.drop('Age_cat', axis=1, inplace=True)

머신러닝

LabelEncoder 적용

In [22]:
from sklearn.preprocessing import LabelEncoder
In [23]:
def encode_features(dataDF):
    features = ['Sex', 'Cabin', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        dataDF[feature] = le.fit_transform(dataDF[feature])
        print(le.classes_) # label 정보 저장
    return dataDF
In [24]:
df = encode_features(df)
Out [24]:
['female' 'male']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'N' 'T']
['C' 'N' 'Q' 'S']

In [25]:
df.head(2)
Out [25]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 7 3
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 2 0

재사용 가능한 함수 작성

In [26]:
# Null 처리 함수
def fillna(df):
    df['Age'].fillna(df['Age'].mean(), inplace=True)
    df['Cabin'].fillna('N', inplace=True)
    df['Embarked'].fillna('N', inplace=True)
    df['Fare'].fillna('0', inplace=True)
    return df

# 불필요한 feature 제거
def drop_features(df):
    df.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
    return df

# 레이블 인코딩
def format_features(df):
    from sklearn.preprocessing import LabelEncoder
    df['Cabin'] = df['Cabin'].str[:1]
    features = ['Sex', 'Cabin', 'Embarked']
    for feature in features:
        le = LabelEncoder()
        df[feature] = le.fit_transform(df[feature])
        print(le.classes_)
    return df

# 데이터 전처리 함수 전체 호출
def transform_features(df):
    df = fillna(df)
    df = drop_features(df)
    df = format_features(df)
    return df

원본 재로딩, 함수 작동 확인

In [27]:
df = pd.read_csv('titanic.csv')
y = df['Survived']
X = df.drop(columns=['Survived'])

X = transform_features(X)
Out [27]:
['female' 'male']
['A' 'B' 'C' 'D' 'E' 'F' 'G' 'N' 'T']
['C' 'N' 'Q' 'S']

In [28]:
X.head(2)
Out [28]:
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 1 22.0 1 0 7.2500 7 3
1 1 0 38.0 1 0 71.2833 2 0

머신러닝

In [29]:
from sklearn.model_selection import train_test_split
In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)
In [31]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier # tree가 여러개
from sklearn.linear_model import LogisticRegression # 이진분류에 사용(이름은 회귀지만 분류 알고리즘)
from sklearn.metrics import accuracy_score
In [32]:
# 의사결정트리, 랜덤포레스트, 로지스틱회귀를 위한 사이킷런 Classifier 클래스 생성
dt_clf = DecisionTreeClassifier(random_state=11)
rf_clf = RandomForestClassifier(random_state=11)
lr_clf = LogisticRegression(solver='liblinear') # fit_intercept=True: y절편을 고정하지 않고 찾음
In [33]:
# DesisionTreeClassifier 학습/예측/평가
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
print("의사결정트리 정확도 :", accuracy_score(y_test, dt_pred))
Out [33]:
의사결정트리 정확도 : 0.7877094972067039

In [34]:
# RandomForestClassifier 학습/예측/평가
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
print("랜덤포레스트 정확도 :", accuracy_score(y_test, rf_pred))
Out [34]:
랜덤포레스트 정확도 : 0.8547486033519553

In [35]:
# LogisticRegression 학습/예측/평가
lr_clf.fit(X_train, y_train)
lr_pred = lr_clf.predict(X_test)
print("로지스틱회귀 정확도 :", accuracy_score(y_test, lr_pred))
Out [35]:
로지스틱회귀 정확도 : 0.8659217877094972

교차검증

KFold
In [36]:
from sklearn.model_selection import KFold
In [37]:
def exec_kfold(clf, folds=5):
    kfold = KFold(n_splits=folds, shuffle=False)
    scores = []
    
    for iter_count, (train_index, test_index) in enumerate(kfold.split(X)):
        # X_train, X_test = X.values[train_index], X.values[test_index]
        # y_train, y_test = y.values[train_index], y.values[test_index]
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        # Classifier 학습, 예측, 정확도 계산
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)
        accuracy = accuracy_score(y_test, pred)
        scores.append(accuracy)
        print(f"{iter_count+1}차 교차 검증 정확도 : {accuracy:.4f}")
    
    mean_score = np.mean(scores)
    print(f"평균 정확도 : {mean_score:.4f}")
In [38]:
exec_kfold(dt_clf, folds=5)
Out [38]:
1차 교차 검증 정확도 : 0.7542
2차 교차 검증 정확도 : 0.7809
3차 교차 검증 정확도 : 0.7865
4차 교차 검증 정확도 : 0.7697
5차 교차 검증 정확도 : 0.8202
평균 정확도 : 0.7823

cross_val_score()
In [39]:
from sklearn.model_selection import cross_val_score
In [40]:
scores = cross_val_score(dt_clf, X, y, cv=5)
In [41]:
for iter_count, accuracy in enumerate(scores):
    print(f"{iter_count+1}차 교차 검증 정확도 : {accuracy:.4f}")
    
print(f"평균 정확도 : {np.mean(scores):.4f}") # cross_val_score()는 StratifiedKFold를 사용하기에 KFold와는 결과가 다름
Out [41]:
1차 교차 검증 정확도 : 0.7430
2차 교차 검증 정확도 : 0.7753
3차 교차 검증 정확도 : 0.7921
4차 교차 검증 정확도 : 0.7865
5차 교차 검증 정확도 : 0.8427
평균 정확도 : 0.7879

GridSearchCV
In [42]:
from sklearn.model_selection import GridSearchCV
In [43]:
param = {
    'max_depth':[2, 3, 5, 10],
    'min_samples_split':[2, 3, 5],
    'min_samples_leaf':[1, 5, 8] # root node (트리 최상위), leaf node(트리 최말단) # leaf node가 되기 위한 최소 샘플 수 # defaut 1
}
In [44]:
grid = GridSearchCV(dt_clf, param, cv=5, scoring='accuracy') # 경우의 수: 4*3*3*5
grid.fit(X_train, y_train)
Out [44]:
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=11),
             param_grid={'max_depth': [2, 3, 5, 10],
                         'min_samples_leaf': [1, 5, 8],
                         'min_samples_split': [2, 3, 5]},
             scoring='accuracy')
In [45]:
pred = grid.predict(X_test)
In [46]:
print("의사결정트리 GridSearchCV 정확도 :", accuracy_score(y_test, pred))
Out [46]:
의사결정트리 GridSearchCV 정확도 : 0.8715083798882681

In [47]:
# 최적 하이머 파라미터
grid.best_params_
Out [47]:
{'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2}

Reference

  • 이 포스트는 SeSAC 인공지능 자연어처리, 컴퓨터비전 기술을 활용한 응용 SW 개발자 양성 과정 - 심선조 강사님의 강의를 정리한 내용입니다.

댓글남기기