자연어 처리 - RNN 가사 생성
가사 생성 모델 - RNN
가사 일부를 제시하면 여러 단어를 출력하도록 학습(생성형 맛보기)
전처리
In [1]:
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences, to_categorical
In [2]:
text = '''저 별을 따다가 니 귀에 걸어주고파
저 달 따다가 니 목에 걸어주고파
세상 모든 좋은 것만 해주고 싶은
이런 내 맘을 그댄 아나요'''
In [3]:
tok = Tokenizer()
tok.fit_on_texts([text])
vocab_size = len(tok.word_index) + 1 # zero padding
vocab_size
Out [3]:
20
In [4]:
seq_list = []
for sentence in text.split('\n'):
res = tok.texts_to_sequences([sentence])[0]
for i in range(1, len(res)):
seq = res[:i+1]
seq_list.append(seq)
# 최대문장길이 구해 zero padding
max_len = max(len(sent) for sent in seq_list)
seq_padded = pad_sequences(seq_list, maxlen = max_len)
# X, y 나누기
X = seq_padded[:, :-1]
y = seq_padded[:, -1]
# 원-핫 인코딩
y_hot = to_categorical(y, num_classes=vocab_size)
딥러닝
In [5]:
import numpy as np
from keras.models import Sequential
from keras.layers import Embedding, Dense, SimpleRNN
In [6]:
model = Sequential([
Embedding(vocab_size, 10),
SimpleRNN(32),
Dense(vocab_size, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
history = model.fit(X, y_hot, epochs=1000, verbose=1)
Out [6]:
Epoch 1/1000
1/1 [==============================] - 1s 679ms/step - loss: 3.0037 - accuracy: 0.1053
Epoch 2/1000
1/1 [==============================] - 0s 5ms/step - loss: 2.9955 - accuracy: 0.1053
Epoch 3/1000
1/1 [==============================] - 0s 4ms/step - loss: 2.9874 - accuracy: 0.1579
Epoch 4/1000
1/1 [==============================] - 0s 4ms/step - loss: 2.9791 - accuracy: 0.1579
Epoch 5/1000
1/1 [==============================] - 0s 18ms/step - loss: 2.9708 - accuracy: 0.1579
...
Epoch 996/1000
1/1 [==============================] - 0s 3ms/step - loss: 0.0767 - accuracy: 0.9474
Epoch 997/1000
1/1 [==============================] - 0s 3ms/step - loss: 0.0767 - accuracy: 0.9474
Epoch 998/1000
1/1 [==============================] - 0s 2ms/step - loss: 0.0767 - accuracy: 0.9474
Epoch 999/1000
1/1 [==============================] - 0s 3ms/step - loss: 0.0767 - accuracy: 0.9474
Epoch 1000/1000
1/1 [==============================] - 0s 3ms/step - loss: 0.0767 - accuracy: 0.9474
In [7]:
def generate_sentence(model, starting_word, tok, n):
sentence = starting_word
# 단어 print n회 반복, 문장 생성
for dummy in range(n):
encoded = tok.texts_to_sequences([starting_word])[0]
padded = pad_sequences([encoded], maxlen=max_len)
res = model.predict(padded, verbose=0)
res_softmax = np.argmax(res, axis=1)
# for word, index in tok.word_index.items():
# if res_softmax == index:
# break
word = tok.sequences_to_texts([res_softmax])[0]
starting_word = starting_word + ' ' + word
sentence = sentence + ' ' + word
return sentence
In [8]:
generate_sentence(model, '저', tok, 2)
Out [8]:
'저 별을 따다가'
In [9]:
generate_sentence(model, '저', tok, 8)
Out [9]:
'저 별을 따다가 니 귀에 걸어주고파 그댄 그댄 아나요'
In [10]:
generate_sentence(model, '저', tok, 20)
Out [10]:
'저 별을 따다가 니 귀에 걸어주고파 그댄 그댄 아나요 내 맘을 그댄 아나요 목에 모든 좋은 목에 걸어주고파 그댄 아나요 걸어주고파'
댓글남기기