Word Embeddings

텍스트를 숫자로 표현하기

ML 모델들은 vector 를 입력으로 받기 때문에, text 처리 문제를 풀기 위해서는 '입력된 문자열을 숫자로 변환'하는 방법을 먼저 적용해야 한다.

One-hot encoding

첫번째 방법은 입력된 단어들을 one-hot 인코딩 하는 것이다. 이는 아래 그림과 같이, 모든 vocabulary 들을 각 bit 위치로 매핑하여 각 단어 위치의 bit 만 1로 set 하는 방법이다.

가장 단순하게 vector 를 만들 수 있다는 장점이 있으나, 각 단어 사이의 관계를 매핑할 수 없다.

각 단어를 unique 한 숫자로 encoding

one-hot encoding 의 sparse 한 한계를 개선하기 위해 각 단어를 binary representation 에서 integer representation 으로 나타낸 방법이다.

하지만 여전히 각 단어 사이의 관계는 모델링 할 수 없고, 이로 인해 이를 학습한 모델의 파라미터 또한 interpretable 하지 않다 (또는 meaningful 하지 않다).

스크린샷 2020-12-28 오후 4.45.31.png

Word embeddings

단어 임베딩 방법은 유사한 단어가 유사한 인코딩으로 나타나는 효율적이고 dense 한 표현법을 제공한다. 이 임베딩은 수동으로 임베딩 값을 설정했던 기존 방법과 다르게, 데이터로부터 학습된다. 학습 과정을 통해 단어는 특정한 길이의 (임베딩의 길이는 hparam) 실수 벡터로 나타난다. 인코딩된 임베딩의 길이가 길면 좀 더 상세한 관계를 표현할 수 있으나, 더 많은 데이터가 필요하고, 길이가 짧으면 덜 상세한 관계가 캡쳐된다.

Codes

본 예제에서는 문장에서 반응을 Classify 하는 Sentiment Classification 문제를 푼다. 이는 단어들의 연속으로부터 positive / negative 를 판단하는 Binary classification 문제이다.

from datetime import datetime
import io
import os
import re
import shutil
import string

from absl import logging
import numpy as np
import tensorflow as tf

logging.set_verbosity(1)

데이터 준비

본 문제를 위해서 IMDb 데이터셋을 사용한다.

  1. IMDb 데이터셋 다운로드
  2. tf.data.Dataset 생성
  3. Dataset 가속화

IMDb 데이터셋 다운로드

IMDB_URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

def download_imdb(url=IMDB_URL, base_dir=base_dir):
    dataset = tf.keras.utils.get_file(os.path.basename(url),
                                      url, 
                                      untar=True,
                                      cache_dir=base_dir,
                                      cache_subdir='')
    dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
    logging.info('%s', os.listdir(dataset_dir))
    return dataset_dir

if os.path.isdir(os.path.join(base_dir, 'aclImdb', 'train')):
    dataset_dir = os.path.join(base_dir, 'aclImdb')
else:
    dataset_dir = download_imdb(IMDB_URL)
Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
84131840/84125825 [==============================] - 4s 0us/step
INFO:absl:['imdbEr.txt', 'test', 'README', 'imdb.vocab', 'train']
tf.keras.utils.get_file

API link: https://www.tensorflow.org/api_docs/python/tf/keras/utils/get_file

cache_dir 에 파일이 없다면 다운로드

이 중 train/unsup 은 필요 없으니 삭제 (unsupervised dataset)

unsup_dir = os.path.join(dataset_dir, 'train', 'unsup')
![ -d "{unsup_dir}" ] && rm -rf "{unsup_dir}"
!ls "{dataset_dir}/train"
labeledBow.feat  pos		urls_neg.txt  urls_unsup.txt
neg		 unsupBow.feat	urls_pos.txt

IMDb 데이터셋 구성

import glob

import pandas as pd

data_dirs = ['train/pos', 'train/neg', 'test/pos', 'test/neg']
data_list = [os.path.join(dataset_dir, path) for path in data_dirs]
data_list = [glob.glob(os.path.join(path, '*')) for path in data_list]

headers = [path.split('/') for path in data_dirs] # make multi-level index for Pandas
headers = pd.MultiIndex.from_tuples(headers)
print(headers)
data_list = pd.DataFrame(data_list, index=headers)
print(data_list)

assert not os.path.isdir(os.path.join(dataset_dir, 'train', 'unsup'))

print('-' * 50)
print(f'num of data for each split: {data_list.shape[1]}')
print(f'total data: {data_list.shape[0] * data_list.shape[1]}')

MultiIndex([('train', 'pos'),
            ('train', 'neg'),
            ( 'test', 'pos'),
            ( 'test', 'neg')],
           )
                                    0      ...                            12499
train pos  ./aclImdb/train/pos/6836_9.txt  ...  ./aclImdb/train/pos/2563_10.txt
      neg  ./aclImdb/train/neg/3006_3.txt  ...   ./aclImdb/train/neg/5741_1.txt
test  pos  ./aclImdb/test/pos/7998_10.txt  ...  ./aclImdb/test/pos/10682_10.txt
      neg   ./aclImdb/test/neg/4245_4.txt  ...    ./aclImdb/test/neg/1213_3.txt

[4 rows x 12500 columns]
--------------------------------------------------
num of data for each split: 12500
total data: 50000

tf.data.Dataset 생성

tf.keras.preprocessing.text_dataset_from_directory 사용하여 tf.data.Dataset 생성

def make_dataset(path, seed=0, batch_size=1024, validation_split=0.2):
    logging.debug('path: %s, seed: %d, batch_size: %d, validation_split: %f',
                  path, seed, batch_size, validation_split)
    train_ds = tf.keras.preprocessing.text_dataset_from_directory(
        path,
        batch_size=batch_size,
        validation_split=validation_split, 
        subset='training',
        seed=seed)
    val_ds = tf.keras.preprocessing.text_dataset_from_directory(
        path,
        batch_size=batch_size,
        validation_split=validation_split, 
        subset='validation',
        seed=seed)
    return train_ds, val_ds
train_ds, val_ds = make_dataset(os.path.join(dataset_dir, 'train'), seed=123)
DEBUG:absl:path: ./aclImdb/train, seed: 123, batch_size: 1024, validation_split: 0.200000
Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.

Dataset 확인

for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])
0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.'
0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk."
0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved.<br /><br />For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned.<br /><br />Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them.<br /><br />So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense.<br /><br />On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range.<br /><br />On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers."
0 b'I saw this movie at an actual movie theater (probably the $2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective $6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.'

Dataset 설정

가속을 위해 Dataset 에 설정을 추가함

cache(): 데이터를 메모리에 올림. 데이터가 메모리 크기에 비해 크다면, on-disk 캐쉬를 생성할 수 있음. 이는 작은 여러개의 파일보다 성능이 높음.

prefetch(): 학습 동안 데이터 전처리와 모델 실행을 overlap 함.

data performance 디스크 캐쉬 등 여러 방법을 설명함

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Embedding layer

임베딩 레이어는 입력으로 integer 인덱스를 받아서, 이에 해당하는 실수 vector 를 출력으로 반환하는 lookup table 으로 볼 수 있다.

여기서 임베딩의 크기는 하이퍼파라미터로 실험을 통해 조절이 필요하다.

tf.keras.layers.Embedding

tf.keras.layers.Embedding(
    input_dim, output_dim, embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs
)
EMBEDDING_SIZE = 5
embedding_layer = tf.keras.layers.Embedding(1000, EMBEDDING_SIZE)

임베딩 레이어를 생성하면 처음에 랜덤한 값으로 초기화 되고, 이후에는 backpropagation 을 통해 점차적으로 조절된다. 학습이 완료되면, 임베딩은 단어 사이의 유사도 를 인코딩한다.

result = embedding_layer(tf.constant([1,2,3]))
result.numpy()
array([[-0.00623799,  0.02114315,  0.03288199, -0.02495058,  0.04346171],
       [-0.00650135,  0.03021416, -0.01943512,  0.03576155, -0.02844778],
       [ 0.02554953,  0.03821062, -0.03710605,  0.035557  , -0.01577029]],
      dtype=float32)

임베딩 레이어는 입력된 각 value 를 이에 해당하는 embedding 으로 변환하므로, 한 차원이 추가된다.

임베딩 사이즈가 $N$ 이라고 했을 때, \ $(I, J, K)$ 차원의 입력이 들어오면 $(I, J, K, N)$ 차원이 출력된다.

input = np.random.randint(6, size=(2, 3))
input = tf.convert_to_tensor(input, dtype=tf.int32)
embeddings = embedding_layer(input)

assert embeddings.shape == list(input.shape) + [EMBEDDING_SIZE]
print(embeddings.shape)
(2, 3, 5)

variable length 를 입력으로 받기 위해서는 여러가지 표준화된 방법들이 있다.

  • RNN
  • Attention
  • pooling layer
  • Padding + cropping

여기서는 Pooling Layer 방법을 사용한다.

Model

본 예제에서는 단순한 형태의 모델을 사용한다.

모델의 구조는 아래와 같다.

input: batch of string sentence (batch, -1)

  1. Text preprocessing layer: (batch, sequence_length, 1)
    • standardize
    • padding/cropping 수행
  2. Embedding layer: (batch, sequence_length, embedding_dim)
    • positive integer 입력값을 encoding 된 임베딩 벡터 값으로 변환
  3. Global 1D pooling layer: (batch, embedding_dim)
    • sequence 차원으로 평균을 취함으로써 fixed-length 벡터를 만듦
  4. Fully connected layer: (batch, num_of_fc)
  5. Output layer: (batch, 1)

Text Preprocessing

모델에서 텍스트 전처리를 정의한다.

여기에서는 TextVectorization 클래스를 사용하는데, 이 함수는 먼저 standardization을 수행하고, 이를 sequence_length 길이로 padding/cropping 한다.

최종적으로 dataset 을 넣어 adapt() 함수를 수행함으로써 데이터셋의 vocabulary 기반으로 단어 사전을 만든다.

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped_html,
                                    '[%s]' % re.escape(string.punctuation),
                                    '')

# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to 
# integers. Note that the layer uses the custom standardization defined above. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

Model

embedding_dim=16

model = tf.keras.Sequential([
  vectorize_layer,
  tf.keras.layers.Embedding(vocab_size, embedding_dim, name="embedding"),
  tf.keras.layers.GlobalAveragePooling1D(),
  tf.keras.layers.Dense(16, activation='relu'),
  tf.keras.layers.Dense(1)
])

Compile and Training

학습에 필요한 유틸들을 추가함

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

Compile 하고 학습 진행

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
    train_ds,
    validation_data=val_ds, 
    epochs=15,
    callbacks=[tensorboard_callback])
Epoch 1/15
20/20 [==============================] - 5s 202ms/step - loss: 0.6927 - accuracy: 0.5037 - val_loss: 0.6906 - val_accuracy: 0.4886
Epoch 2/15
20/20 [==============================] - 2s 92ms/step - loss: 0.6892 - accuracy: 0.5037 - val_loss: 0.6854 - val_accuracy: 0.4886
Epoch 3/15
20/20 [==============================] - 2s 91ms/step - loss: 0.6830 - accuracy: 0.5037 - val_loss: 0.6766 - val_accuracy: 0.4886
Epoch 4/15
20/20 [==============================] - 2s 92ms/step - loss: 0.6726 - accuracy: 0.5037 - val_loss: 0.6630 - val_accuracy: 0.4886
Epoch 5/15
20/20 [==============================] - 2s 94ms/step - loss: 0.6566 - accuracy: 0.5037 - val_loss: 0.6441 - val_accuracy: 0.4886
Epoch 6/15
20/20 [==============================] - 2s 92ms/step - loss: 0.6348 - accuracy: 0.5039 - val_loss: 0.6205 - val_accuracy: 0.4962
Epoch 7/15
20/20 [==============================] - 2s 91ms/step - loss: 0.6075 - accuracy: 0.5357 - val_loss: 0.5933 - val_accuracy: 0.5784
Epoch 8/15
20/20 [==============================] - 2s 92ms/step - loss: 0.5762 - accuracy: 0.6235 - val_loss: 0.5645 - val_accuracy: 0.6404
Epoch 9/15
20/20 [==============================] - 2s 92ms/step - loss: 0.5429 - accuracy: 0.6898 - val_loss: 0.5360 - val_accuracy: 0.6824
Epoch 10/15
20/20 [==============================] - 2s 93ms/step - loss: 0.5096 - accuracy: 0.7375 - val_loss: 0.5094 - val_accuracy: 0.7176
Epoch 11/15
20/20 [==============================] - 2s 92ms/step - loss: 0.4781 - accuracy: 0.7664 - val_loss: 0.4859 - val_accuracy: 0.7412
Epoch 12/15
20/20 [==============================] - 2s 92ms/step - loss: 0.4494 - accuracy: 0.7903 - val_loss: 0.4656 - val_accuracy: 0.7550
Epoch 13/15
20/20 [==============================] - 2s 92ms/step - loss: 0.4236 - accuracy: 0.8096 - val_loss: 0.4485 - val_accuracy: 0.7652
Epoch 14/15
20/20 [==============================] - 2s 91ms/step - loss: 0.4008 - accuracy: 0.8228 - val_loss: 0.4342 - val_accuracy: 0.7760
Epoch 15/15
20/20 [==============================] - 2s 90ms/step - loss: 0.3805 - accuracy: 0.8354 - val_loss: 0.4223 - val_accuracy: 0.7866
<tensorflow.python.keras.callbacks.History at 0x7f7761f350f0>

모델 구조는 아래와 같다.

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________

학습 진행은 아래와 같다.

%load_ext tensorboard
%tensorboard --logdir logs

(loss 감소를 보니 좀 더 진행해도 괜찮았을듯)

학습된 Embedding 추출 및 저장

상기와 같이 학습된 embedding 을 추출하고 저장하여 추후에 사용하도록 한다.

embedding 의 구조는 (vocab_size, embedding_dimension) 의 형태를 가진다.

embedding layer 의 weight 값을 가져오는 방법은 get_layer()get_weights() 를 사용하는 것이다.

그리고, TextVectorization 레이어 클래스의 get_vocabulary() 함수는 vocabulary 의 각 token 에 대한 meta 정보를 가져온다.

def get_embedding(model,
                  embedding_layer_name='embedding',
                  vectorize_layer_name='text_vectorization'):
    weights = model.get_layer(embedding_layer_name).get_weights()[0]
    meta = model.get_layer(vectorize_layer_name).get_vocabulary()
    return weights, meta

weights, vocab = get_embedding(model, vectorize_layer_name=vectorize_layer.name)

추출한 embedding 을 저장하기 위해서는 Embedding Projector 를 사용한다.

상기 추출한 weight 와 meta 정보를 tab separated format 으로 저장하여 업로드하면 된다.

def save_embedding(weights, meta, log_dir='.'):
    out_v = io.open(os.path.join(log_dir, 'vectors.tsv'), 'w', encoding='utf-8')
    out_m = io.open(os.path.join(log_dir, 'metadata.tsv'), 'w', encoding='utf-8')

    for index, word in enumerate(meta):
        if  index == 0: continue # skip 0, it's padding.
        vec = weights[index] 
        out_v.write('\t'.join([str(x) for x in vec]) + "\n")
        out_m.write(word + "\n")
    out_v.close()
    out_m.close()
save_embedding(weights, vocab)

Embedding Projector in Colab

https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin 에서 보면 바로 Colab tensorboard 상에서 embedding 을 visualize 할 수 있다.

log_dir='/logs/imdb-example/'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
save_embedding(weights, vocab, log_dir=log_dir)

여기서는 weights 를 checkpoint 로 저장한다.

weights = tf.Variable(weights)

# Create a checkpoint from embedding, the filename and key are
# name of the tensor.
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

!ls /logs/imdb-example
checkpoint			      metadata.tsv
embedding.ckpt-1.data-00000-of-00001  projector_config.pbtxt
embedding.ckpt-1.index		      vectors.tsv

새로 configuration

from tensorboard.plugins import projector

# Set up config
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)

Tensorboard로 visualize

%tensorboard --logdir /logs/imdb-example/
Reusing TensorBoard on port 6007 (pid 223), started 0:03:18 ago. (Use '!kill 223' to kill it.)

comments powered by Disqus