Text Preprocessing in NLP: Tokenization & Word Embeddings Explained
In the last lesson, we explored sequence modeling and text generation using LSTMs, where we learned how to build models that can predict the next word in a sequence. Now, we'll dive into the foundational step that comes before any NLP task: text preprocessing. This step is crucial because raw text data is messy and unstructured. To make sense of it, we need to clean and transform it into a format that machine learning models can understand. In this lesson, we'll focus on two key concepts: tokenization and word embeddings.
I’ve faced situations where skipping proper text preprocessing led to poor model performance. For example, while working on a sentiment analysis project, I directly fed raw text into an LSTM model without tokenizing or embedding the words. The results were disappointing because the model couldn’t interpret the text properly. This taught me the importance of preprocessing, which I’ll now walk you through step by step.
Understanding Tokenization
Tokenization is the process of breaking down text into smaller units, such as words or subwords. These units are called tokens, and they serve as the building blocks for NLP tasks. For instance, the sentence “I love NLP!” can be tokenized into [“I”, “love”, “NLP”, “!”].
Tokenization helps machines understand text by converting it into numerical data. Without this step, models wouldn’t know how to process words. TensorFlow and Keras provide tools like Tokenizer to make this process easy. Here’s an example of how to tokenize text using Keras:
from tensorflow.keras.preprocessing.text import Tokenizer
# Sample text
texts = ["I love NLP!", "Tokenization is essential."]
# Initialize tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
# Convert text to sequences
sequences = tokenizer.texts_to_sequences(texts)
print("Word Index:", tokenizer.word_index)
print("Sequences:", sequences)
In this example, the Tokenizer creates a word index, which maps each word to a unique number. The texts_to_sequences method converts the text into sequences of these numbers. This is the first step in preparing text data for machine learning models.
Overview of Word Embeddings
Once we’ve tokenized the text, the next step is to represent words in a way that captures their meaning. This is where word embeddings come in. Word embeddings are dense vector representations of words that capture semantic relationships. For example, in a good embedding space, the words “king” and “queen” would be close to each other because they have similar meanings.
Popular word embedding methods include Word2Vec and GloVe. These methods map words to vectors in a high-dimensional space, where the distance between vectors reflects the similarity between words. TensorFlow and Keras make it easy to use pre-trained embeddings or train your own. Here’s how you can use pre-trained GloVe embeddings in Keras:
import numpy as np
# Load pre-trained GloVe embeddings
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
# Create embedding matrix
embedding_dim = 100
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
This code loads pre-trained GloVe embeddings and creates an embedding matrix for the words in your tokenizer. This matrix can then be used as weights in an embedding layer in your model.
Implementing Text Preprocessing with TensorFlow/Keras
Now that we understand tokenization and word embeddings, let’s see how to implement text preprocessing in TensorFlow and Keras. The goal is to prepare text data for an NLP model. Here’s a step-by-step guide:
-
Tokenize the Text: Use the Tokenizer class to convert text into sequences of integers.
-
Pad Sequences: Ensure all sequences have the same length by padding them with zeros.
-
Load or Train Embeddings: Use pre-trained embeddings or train your own using an embedding layer.
-
Build the Model: Add an embedding layer to your model, which will convert tokenized words into dense vectors.
Here’s an example of how to do this:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Pad sequences
max_length = 10
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
# Build the model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=embedding_dim, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(LSTM(128))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
This code preprocesses the text and builds a simple LSTM model for binary classification. The embedding layer uses the pre-trained GloVe embeddings we loaded earlier.
Conclusion
In this lesson, we covered the basics of text preprocessing, including tokenization and word embeddings. These steps are essential for transforming raw text into a format that machine learning models can understand. By following the examples and code snippets provided, you can implement these techniques in your own NLP projects.
In the next lesson, we’ll build on this foundation by exploring sentiment analysis with LSTMs. You’ll learn how to use the preprocessed text data to train a model that can classify text as positive or negative.
Comments
There are no comments yet.