자연어 처리 - RNN 스팸 메일 분류

2023년 01월 13일 3 분 소요

스팸 메일 분류 - RNN

전처리

In [1]:

import pandas as pd

In [2]:

data = pd.read_csv('datasets/spam.csv', encoding='latin1')
data = data[['v1', 'v2']]
data['v1'] = data['v1'].replace(['ham', 'spam'], [0, 1])
# data['v1'] = data['v1'].replace({'ham':0, 'spam':1})
data

Out [2]:

	v1	v2
0	0	Go until jurong point, crazy.. Available only ...
1	0	Ok lar... Joking wif u oni...
2	1	Free entry in 2 a wkly comp to win FA Cup fina...
3	0	U dun say so early hor... U c already then say...
4	0	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	1	This is the 2nd time we have tried 2 contact u...
5568	0	Will Ì_ b going to esplanade fr home?
5569	0	Pity, * was in mood for that. So...any other s...
5570	0	The guy did some bitching but I acted like i'd...
5571	0	Rofl. Its true to its name

5572 rows × 2 columns

데이터 중복 확인

In [3]:

data.info()

Out [3]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   int64 
 1   v2      5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB

In [4]:

data['v2'].nunique()

Out [4]:

In [5]:

data = data.drop_duplicates(subset=['v2'])
data.info()

Out [5]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5169 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5169 non-null   int64 
 1   v2      5169 non-null   object
dtypes: int64(1), object(1)
memory usage: 121.1+ KB

In [6]:

data['v2'].nunique()

Out [6]:

In [7]:

data['v1'].value_counts()

Out [7]:

0    4516
1     653
Name: v1, dtype: int64

In [8]:

X_data = data['v2']
y_data = data['v1']

train, test 나누기

In [9]:

from sklearn.model_selection import train_test_split

In [10]:

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=0, stratify=y_data)

In [11]:

X_train.shape, X_test.shape, y_train.shape, y_test.shape

Out [11]:

((4135,), (1034,), (4135,), (1034,))

토큰화

In [12]:

from keras.preprocessing.text import Tokenizer

In [13]:

tok = Tokenizer()
tok.fit_on_texts(X_train)
X_train_encoded = tok.texts_to_sequences(X_train)
X_train_encoded[:2]

Out [13]:

[[102, 1, 210, 230, 3, 17, 39], [1, 59, 8, 427, 17, 5, 137, 2, 2326]]

In [14]:

print(tok.word_index)

Out [14]:

{'i': 1, 'to': 2, 'you': 3, 'a': 4, 'the': 5, 'u': 6, 'and': 7, 'in': 8, 'is': 9, 'me': 10, 'my': 11, 'for': 12, 'your': 13, 'it': 14, 'of': 15, 'have': 16, 'on': 17, 'call': 18, 'that': 19, 'are': 20, '2': 21, 'now': 22, 'so': 23, 'but': 24, 'not': 25, 'can': 26, 'or': 27, "i'm": 28, 'get': 29, 'at': 30, 'do': 31, 'if': 32, 'be': 33, 'will': 34, 'just': 35, 'with': 36, 'we': 37, 'no': 38, 'this': 39, 'ur': 40, 

...

 'khelate': 7804, 'kintu': 7805, 'opponenter': 7806, 'dhorte': 7807, 'lage': 7808, "tm'ing": 7809, 'laughs': 7810, 'adding': 7811, 'zeros': 7812, 'savings': 7813, 'hos': 7814, 'heroes': 7815, 'tips': 7816, 'genes': 7817, 'begun': 7818, 'registration': 7819, 'permanent': 7820, 'residency': 7821}

In [15]:

total_cnt = len(tok.word_index)
vocab_size = len(tok.word_index) + 1 # zero padding
vocab_size

Out [15]:

메일 제목 길이 분포 확인

In [16]:

import matplotlib.pyplot as plt

In [17]:

plt.hist([len(sample) for sample in X_data], bins=100)
plt.show()

Out [17]:

In [18]:

from keras.utils import pad_sequences

In [19]:

max_len = 200 # 너무 긴 제목은 200 단어가 넘어가면 잘라냄, 모자르면 0으로 채움
X_train_padded = pad_sequences(X_train_encoded, maxlen=max_len)
X_train_padded.shape

Out [19]:

(4135, 200)

딥러닝

In [20]:

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

In [21]:

model = Sequential([
    Embedding(vocab_size, 100),
    SimpleRNN(32),
    Dense(1, activation='sigmoid')
])
model.summary()

Out [21]:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 100)         782200    
                                                                 
 simple_rnn (SimpleRNN)      (None, 32)                4256      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 786,489
Trainable params: 786,489
Non-trainable params: 0
_________________________________________________________________

In [22]:

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train_padded, y_train, epochs=5, batch_size=64, validation_split=0.2)

Out [22]:

Epoch 1/5

C:\Users\user\AppData\Roaming\Python\Python38\site-packages\keras\engine\data_adapter.py:1508: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
  return t[start:end]

52/52 [==============================] - 3s 36ms/step - loss: 0.2688 - acc: 0.9144 - val_loss: 0.1273 - val_acc: 0.9613
Epoch 2/5
52/52 [==============================] - 2s 31ms/step - loss: 0.0851 - acc: 0.9794 - val_loss: 0.0962 - val_acc: 0.9698
Epoch 3/5
52/52 [==============================] - 2s 32ms/step - loss: 0.0425 - acc: 0.9885 - val_loss: 0.0716 - val_acc: 0.9794
Epoch 4/5
52/52 [==============================] - 2s 31ms/step - loss: 0.0244 - acc: 0.9927 - val_loss: 0.0623 - val_acc: 0.9843
Epoch 5/5
52/52 [==============================] - 2s 33ms/step - loss: 0.0135 - acc: 0.9958 - val_loss: 0.0615 - val_acc: 0.9831

In [23]:

X_test_encoded = tok.texts_to_sequences(X_test)
X_test_padded = pad_sequences(X_test_encoded, maxlen=max_len)
model.evaluate(X_test_padded, y_test)

Out [23]:

33/33 [==============================] - 0s 7ms/step - loss: 0.0889 - acc: 0.9768

[0.08886431902647018, 0.9767891764640808]

딥러닝에서는 일반 머신러닝과 다르게 제목의 문자 조합 즉, 문맥으로 구분한다.
하지만 진짜 의미를 아는게(XAI) 아니다.

Twitter Facebook LinkedIn

Kwak Jisub

자연어 처리 - RNN 스팸 메일 분류

스팸 메일 분류 - RNN

전처리

딥러닝

공유하기

댓글남기기

참고

python max 함수의 매개변수

python round 함수의 문제

python set 자료형과 메소드

MySQL GROUP BY와 MAX