스팸 메일 분류 - RNN


In [1]:
import pandas as pd
In [2]:
data = pd.read_csv('datasets/spam.csv', encoding='latin1')
data = data[['v1', 'v2']]
data['v1'] = data['v1'].replace(['ham', 'spam'], [0, 1])
# data['v1'] = data['v1'].replace({'ham':0, 'spam':1})
Out [2]:
v1 v2
0 0 Go until jurong point, crazy.. Available only ...
1 0 Ok lar... Joking wif u oni...
2 1 Free entry in 2 a wkly comp to win FA Cup fina...
3 0 U dun say so early hor... U c already then say...
4 0 Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 1 This is the 2nd time we have tried 2 contact u...
5568 0 Will Ì_ b going to esplanade fr home?
5569 0 Pity, * was in mood for that. So...any other s...
5570 0 The guy did some bitching but I acted like i'd...
5571 0 Rofl. Its true to its name

5572 rows × 2 columns

  • 데이터 중복 확인
In [3]:
Out [3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   int64 
 1   v2      5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB

In [4]:
Out [4]:
In [5]:
data = data.drop_duplicates(subset=['v2'])
Out [5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5169 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5169 non-null   int64 
 1   v2      5169 non-null   object
dtypes: int64(1), object(1)
memory usage: 121.1+ KB

In [6]:
Out [6]:
In [7]:
Out [7]:
0    4516
1     653
Name: v1, dtype: int64
In [8]:
X_data = data['v2']
y_data = data['v1']
  • train, test 나누기
In [9]:
from sklearn.model_selection import train_test_split
In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=0, stratify=y_data)
In [11]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out [11]:
((4135,), (1034,), (4135,), (1034,))
  • 토큰화
In [12]:
from keras.preprocessing.text import Tokenizer
In [13]:
tok = Tokenizer()
X_train_encoded = tok.texts_to_sequences(X_train)
Out [13]:
[[102, 1, 210, 230, 3, 17, 39], [1, 59, 8, 427, 17, 5, 137, 2, 2326]]
In [14]:
Out [14]:
In [15]:
total_cnt = len(tok.word_index)
vocab_size = len(tok.word_index) + 1 # zero padding
Out [15]:
  • 메일 제목 길이 분포 확인
In [16]:
import matplotlib.pyplot as plt
In [17]:
plt.hist([len(sample) for sample in X_data], bins=100)
Out [17]:


In [18]:
from keras.utils import pad_sequences
In [19]:
max_len = 200 # 너무 긴 제목은 200 단어가 넘어가면 잘라냄, 모자르면 0으로 채움
X_train_padded = pad_sequences(X_train_encoded, maxlen=max_len)
Out [19]:
(4135, 200)


In [20]:
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
In [21]:
model = Sequential([
    Embedding(vocab_size, 100),
    Dense(1, activation='sigmoid')
Out [21]:
Model: "sequential"
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         782200    
 simple_rnn (SimpleRNN)      (None, 32)                4256      
 dense (Dense)               (None, 1)                 33        
Total params: 786,489
Trainable params: 786,489
Non-trainable params: 0

In [22]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train_padded, y_train, epochs=5, batch_size=64, validation_split=0.2)
Out [22]:
Epoch 1/5

52/52 [==============================] - 3s 36ms/step - loss: 0.2688 - acc: 0.9144 - val_loss: 0.1273 - val_acc: 0.9613
Epoch 2/5
52/52 [==============================] - 2s 31ms/step - loss: 0.0851 - acc: 0.9794 - val_loss: 0.0962 - val_acc: 0.9698
Epoch 3/5
52/52 [==============================] - 2s 32ms/step - loss: 0.0425 - acc: 0.9885 - val_loss: 0.0716 - val_acc: 0.9794
Epoch 4/5
52/52 [==============================] - 2s 31ms/step - loss: 0.0244 - acc: 0.9927 - val_loss: 0.0623 - val_acc: 0.9843
Epoch 5/5
52/52 [==============================] - 2s 33ms/step - loss: 0.0135 - acc: 0.9958 - val_loss: 0.0615 - val_acc: 0.9831

In [23]:
X_test_encoded = tok.texts_to_sequences(X_test)
X_test_padded = pad_sequences(X_test_encoded, maxlen=max_len)
model.evaluate(X_test_padded, y_test)
Out [23]:
33/33 [==============================] - 0s 7ms/step - loss: 0.0889 - acc: 0.9768

[0.08886431902647018, 0.9767891764640808]

딥러닝에서는 일반 머신러닝과 다르게 제목의 문자 조합 즉, 문맥으로 구분한다.
하지만 진짜 의미를 아는게(XAI) 아니다.

