Another Twitter sentiment analysis with Python?—?Part 11 (CNN + Word2Vec)
This is the 11th and the last part of my Twitter sentiment analysis project. It has been a long journey, and through many trials and errors along the way, I have learned countless valuable lessons. I haven’t decided on my next project. But I will definitely make time to start a new project. You can find the previous posts from the below links.
- Part 1: Data cleaning
- Part 2: EDA, Data visualisation
- Part 3: Zipf’s Law, Data visualisation
- Part 4: Feature extraction (count vectorizer), N-gram, confusion matrix
- Part 5: Feature extraction (Tfidf vectorizer), machine learning model comparison, lexical approach
- Part 6: Doc2Vec
- Part 7: Phrase modeling + Doc2Vec
- Part 8: Dimensionality reduction (Chi2, PCA)
- Part 9: Neural Networks with Tfidf vectors
- Part 10: Neural Networks with Doc2Vec/Word2Vec/GloVe
*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.
Preparation for Convolutional Neural Network
In the last post, I have aggregated the word vectors of each word in a tweet, either summation or calculating mean to get one vector representation of each tweet. However, in order to feed to a CNN, we have to not only feed each word vector to the model, but also in a sequence which matches the original tweet.
For example, let’s say we have a sentence as below.
“I love cats”
And let’s assume that we have a 2-dimensional vector representation of each word as follows:
I: [0.3, 0.5] love: [1.2, 0.8] cats: [0.4, 1.3]
With the above sentence, the dimension of the vector we have for the whole sentence is 3 X 2 (3: number of words, 2: number of vector dimension).
But there is one more thing we need to consider. A neural network model will expect all the data to have the same dimension, but in case of different sentences, they will have different lengths. This can be handled with padding.
Let’s say we have our second sentence as below.
“I love dogs too”
with the below vector representation of each word:
I: [0.3, 0.5], love: [1.2, 0.8], dogs: [0.8, 1.2], too: [0.1, 0.1]
The first sentence had 3X2 dimension vectors, but the second sentence has 4X2 dimension vector. Our neural network won’t accept these as inputs. By padding the inputs, we decide the maximum length of words in a sentence, then zero pads the rest, if the input length is shorter than the designated length. In the case where it exceeds the maximum length, then it will also truncate either from the beginning or from the end. For example, let’s say we decide our maximum length to be 5.
Then by padding, the first sentence will have 2 more 2-dimensional vectors of all zeros at the start or the end (you can decide this by passing an argument), and the second sentence will have 1 more 2-dimensional vector of zeros at the beginning or the end. Now we have 2 same dimensional (5X2) vectors for each sentence, and we can finally feed this to a model.
Let’s first load the Word2Vec models to extract word vectors from. I have saved the Word2Vec models I trained in the previous post, and can easily be loaded with “KeyedVectors” function in Gensim. I have two different Word2Vec models, one with CBOW (Continuous Bag Of Words) model, and the other with skip-gram model. I won’t go into detail of how CBOW and skip-gram differs, but you can refer to my previous post if you want to know a bit more in detail.
from gensim.models import KeyedVectors
model_ug_cbow = KeyedVectors.load('w2v_model_ug_cbow.word2vec')
model_ug_sg = KeyedVectors.load('w2v_model_ug_sg.word2vec')
By running below code block, I am constructing a sort of dictionary I can extract the word vectors from. Since I have two different Word2Vec models, below “embedding_index” will have concatenated vectors of the two models. For each model, I have 100 dimension vector representation of the word, and by concatenating, each word will have 200 dimension vector representation.
embeddings_index = {}
for w in model_ug_cbow.wv.vocab.keys():
embeddings_index[w] = np.append(model_ug_cbow.wv[w],model_ug_sg.wv[w])
Now we have our reference to word vectors ready, but we still haven’t prepared data to be in the format I have explained at the start of the post. Keras’ ‘Tokenizer’ will split each word in a sentence, then we can call ‘texts_to_sequences’ method to get a sequential representation of each sentence. We also need to pass ‘num_words’ which is a number of vocabularies you want to use, and this will be applied when you call ‘texts_to_sequences’ method. This might be a bit counter-intuitive. Because if you check the length of all the word index, it will not be the number of words you defined, but the actual screening process happens when you call ‘texts_to_sequences’ method.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)
Below are the first five entries of the original train data.
for x in x_train[:5]:
print x
And the same data prepared as sequential data is as below.
sequences[:5]
Each word is represented as a number, and we can see that the number of words in each sentence is matching the length of numbers in the “sequences”. We can later make connections of which word each number represents. But we still didn’t pad our data, so each sentence has varying length. Let’s deal with this.
length = []
for x in x_train:
length.append(len(x.split()))
max(length)
The maximum number of words in a sentence within the training data is 40. Let’s decide the maximum length to be a bit longer than this, let’s say 45.
x_train_seq = pad_sequences(sequences, maxlen=45) x_train_seq[:5]
As you can see from the padded sequences, all the data now transformed to have the same length of 45, and by default, Keras zero-pads at the beginning, if a sentence length is shorter than the maximum length. If you want to know more in detail, please check the Keras documentation on sequence preprocessing.
sequences_val = tokenizer.texts_to_sequences(x_validation) x_val_seq = pad_sequences(sequences_val, maxlen=45)
There’s still one more thing left to do before we can feed the sequential text data to a model. When we transformed a sentence into a sequence, each word is represented by an integer number. Actually, these numbers are where each word is stored in the tokenizer’s word index. Keeping this in mind, let’s build a matrix of these word vectors, but this time we will use the word index number so that our model can refer to the corresponding vector when fed with integer sequence.
Below, I am defining the number of words to be 100,000. This means I will only care about 100,000 most frequent words in the training set. If I don’t limit the number of words, the total number of vocabulary will be more than 200,000.
num_words = 100000
embedding_matrix = np.zeros((num_words, 200))
for word, i in tokenizer.word_index.items():
if i >= num_words:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
As a sanity check, if the embedding matrix has been generated properly. In the above, when I saw the first five entries of the training set, the first entry was “hate you”, and the sequential representation of this was [137, 6]. Let’s see if 6th embedding matrix is as same as vectors for the word ‘you’.
np.array_equal(embedding_matrix[6] ,embeddings_index.get('you'))
Now we are done with the data preparation. Before we jump into CNN, I would like to test one more thing (sorry for the delay). When we feed this sequential vector representation of data, we will use Embedding layer in Keras. With Embedding layer, I can either pass pre-defined embedding, which I prepared as ‘embedding_matrix’ above, or Embedding layer itself can learn word embeddings as the whole model trains. And another possibility is we can still feed the pre-defined embedding but make it trainable so that it will update the values of vectors as the model trains.
In order to check which method performs better, I defined a simple shallow neural network with one hidden layer. For this model structure, I will not try to refine models by tweaking parameters, since the main purpose of this post is to implement CNN.
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
model_ptw2v = Sequential()e = Embedding(100000, 200,
weights=[embedding_matrix], input_length=45, trainable=False)
model_ptw2v.add(e)
model_ptw2v.add(Flatten())
model_ptw2v.add(Dense(256, activation='relu'))
model_ptw2v.add(Dense(1, activation='sigmoid'))
model_ptw2v.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_ptw2v.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5,
batch_size=32, verbose=2)
model_ptw2v = Sequential()
e = Embedding(100000, 200, input_length=45)
model_ptw2v.add(e)
model_ptw2v.add(Flatten())
model_ptw2v.add(Dense(256, activation='relu'))
model_ptw2v.add(Dense(1, activation='sigmoid'))
model_ptw2v.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_ptw2v.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32,
verbose=2)
model_ptw2v = Sequential()
e = Embedding(100000, 200, weights=[embedding_matrix],
input_length=45, trainable=True)
model_ptw2v.add(e)
model_ptw2v.add(Flatten())
model_ptw2v.add(Dense(256, activation='relu'))
model_ptw2v.add(Dense(1, activation='sigmoid'))
model_ptw2v.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_ptw2v.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5,
batch_size=32, verbose=2)
As a result, the best validation accuracy is from the third method (fine-tune pre-trained Word2Vec) at 82.22%. The best training accuracy is the second method (learn word embedding from scratch) at 90.52%. Using pre-trained Word2Vec without updating its vector values showed the lowest accuracy both in training and validation. However, what’s interesting is that in terms of training set accuracy, fine-tuning pre-trained word vectors couldn’t outperform the word embeddings learned from scratch through the embedding layer. Before I tried the above three methods, my first guess was that if I fine-tune the pre-trained word vectors, it would give me the best training accuracy.
Feeding pre-trained word vectors for an embedding layer to update is like providing the first initialisation guideline to the embedding layer so that it can learn more efficiently the task-specific word vectors. But the result is somewhat counterintuitive, and in this case, it turns out that it is better to force the embedding layer to learn from scratch.
But premature generalisation could be dangerous. For this reason, I will compare three methods again in the context of CNN.
Convolutional Neural Network
You might have already seen how Convolutional Neural Network (CNN) works on image data. There are many good sources that you can learn basics of CNN. In my case, the blog post, “A Beginner’s Guide To Understanding Convolutional Neural Networks” by Adit Deshpande really helped me a lot to grasp the concept. If you are not familiar with CNN, I highly recommend his article, so that you will have a firm understanding of CNN.
Now I will assume you have an understanding of CNN in case of image data. How can this be applied to text data then? Let’s say we have a sentence as follows:
“I love cats and dogs”
With word vectors (let’s assume we have 200-dimensional word vectors for each word), the above sentence can be represented in 5X200 matrix, one row for each word. You remember we added zeros to pad a sentence in the above where we prepared the data to feed to an embedding layer? If our decided word length is 45, then the above sentence will have 45X200 matrix, but with all zeros in the first 40 rows. Keeping this in mind, let’s take a look at how CNN works on image data.
In the above GIF, we have one filter (kernel matrix) of 3X3 dimension, convolving over the data (image matrix) and calculate the sum of element-wise multiplication result, and record the result on a feature map (output matrix). If we imagine each row of the data is for a word in a sentence, then it would not be learning efficiently since the filter is only looking at a part of a word vector at a time. The above CNN is so-called 2D Convolutional Neural Network since the filter is moving in 2-dimensional space.
What we do with text data represented in word vectors is making use of 1D Convolutional Neural Network. If a filter’s column width is as same as the data column width, then it has no room to stride horizontally, and only stride vertically. For example, if our sentence is represented in 45X200 matrix, then a filter column width will also have 200 columns, and the length of row (height) will be similar to the concept of n-gram. If the filter height is 2, the filter will stride through the document computing the calculation above with all the bigrams, if the filter height is 3, it will go through all the trigrams in the document, and so on.
If a 2X200 filter is applied with stride size of 1 to 45X200 matrix, we will get 44X1 dimensional output. In the case of 1D Convolution, the output width will be just 1 in this case(number of filter=1). The output height can be easily calculated with below formula (assuming that your data is already padded).
where
H: input data height
Fh: filter height
S: stride size
Now let’s try to add more filters to our 1D Convolutional layer. If we apply 100 2X200 filters with stride size of 1 to 45X200 matrix, can you guess the output dimension?
As I have already mentioned in the above, now the output width will reflect the number of filters we apply, so the answer is we will have 44X100 dimension output. You can also check the dimensions of each output layer by looking at the model summary after you define the structure.
from keras.layers import Conv1D, GlobalMaxPooling1D
structure_test = Sequential()
e = Embedding(100000, 200, input_length=45)
structure_test.add(e)
structure_test.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
structure_test.summary()
Now if we add Global Max Pooling layer, then the pooling layer will extract the maximum value from each filter, and the output dimension will be a just 1-dimensional vector with length as same as the number of filters we applied. This can be directly passed on to a dense layer without flattening.
structure_test = Sequential()
e = Embedding(100000, 200, input_length=45)
structure_test.add(e)
structure_test.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
structure_test.add(GlobalMaxPooling1D())
structure_test.summary()
Now, let’s define a simple CNN going through bigrams on a tweet. The output from global max pooling layer will be fed to a fully connected layer, then finally the output layer. Again I will try three different inputs, static word vectors extracted from Word2Vec, word embedding being learned from scratch with embedding layer, Word2Vec word vectors being updated through training.
model_cnn_01 = Sequential()
e = Embedding(100000, 200, weights=[embedding_matrix], input_length=45, trainable=False)
model_cnn_01.add(e)
model_cnn_01.add(Conv1D(filters=100, kernel_size=2, padding='valid',
activation='relu', strides=1))
model_cnn_01.add(GlobalMaxPooling1D())
model_cnn_01.add(Dense(256, activation='relu'))
model_cnn_01.add(Dense(1, activation='sigmoid'))
model_cnn_01.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_cnn_01.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32,
verbose=2)
model_cnn_02 = Sequential()
e = Embedding(100000, 200, input_length=45)
model_cnn_02.add(e)
model_cnn_02.add(Conv1D(filters=100, kernel_size=2, padding='valid',
activation='relu', strides=1))
model_cnn_02.add(GlobalMaxPooling1D())
model_cnn_02.add(Dense(256, activation='relu'))
model_cnn_02.add(Dense(1, activation='sigmoid'))
model_cnn_02.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_cnn_02.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32,
verbose=2)
model_cnn_03 = Sequential()
e = Embedding(100000, 200, weights=[embedding_matrix], input_length=45,
trainable=True)
model_cnn_03.add(e)
model_cnn_03.add(Conv1D(filters=100, kernel_size=2, padding='valid',
activation='relu', strides=1))
model_cnn_03.add(GlobalMaxPooling1D())
model_cnn_03.add(Dense(256, activation='relu'))
model_cnn_03.add(Dense(1, activation='sigmoid'))
model_cnn_03.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
model_cnn_03.fit(x_train_seq, y_train,
validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32,
verbose=2)
The best validation accuracy is from the word vectors updated through training, at epoch 3 with the validation accuracy of 83.25%. By looking at the training loss and accuracy, it seems that word embedding learned from scratch tends to overfit to the training data, and by feeding pre-trained word vectors as weights initialisation, it somewhat more generalises and ends up having higher validation accuracy.
But finally! I have a better result than Tf-Idf + logistic regression model! I have tried various different methods with Doc2Vec, Word2Vec in the hope of outperforming a simple logistic regression model with Tf-Idf input. You can take a look at the previous post for detail. Tf-Idf + logistic regression model’s validation accuracy was at 82.91%. And now I’m finally beginning to see a possibility of Word2Vec + neural network outperforming this simple model.
Let’s see if we can do better by defining a bit more elaborate model structure. The CNN architecture I will implement below is inspired by Zhang, Y., & Wallace, B. (2015) “A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification”.
Basically, the above structure is implementing what we have done above with bigram filters, but not only to bigrams but also to trigrams and fourgrams. However this is not linearly stacked layers, but parallel layers. And after convolutional layer and max pooling layer, it simply concatenated max pooled result from each of bigram, trigram, and fourgram, then build one output layer on top of them.
The model I defined below is basically as same as the above picture, but the differences are that I added one fully connected hidden layer with dropout just before the output layer, and also my output layer will have just one output node with Sigmoid activation instead of two.
There is also another famous paper by Y. Kim(2014), “Convolutional Neural Networks for Sentence Classification”. https://arxiv.org/pdf/1408.5882.pdf
In this paper, he implemented more sophisticated approach by making use of “channel” concept. Not only the model go through different n-grams, his model has multi-channels (eg. one channel for static input word vectors, another channel for word vectors input but set them to update during training). But in this post, I will not go through multi-channel approach.
So far I have only used Sequential model API of Keras, and this worked fine with all the previous models I defined above since the structures of the models were only linearly stacked. But as you can see from the above picture, the model I am about to define has parallel layers which take the same input but do their own computation, then the results will be merged. In this kind of neural network structure, we can use Keras functional API.
Keras functional API can handle multi-input, multi-output, shared layers, shared input, etc. It is not impossible to define these types of models with Sequential API, but when you want to save the trained model, functional API enables you to simply save the model and load, but with sequential API it is difficult.
from keras.layers import Input, Dense, concatenate, Activation
from keras.models import Model
tweet_input = Input(shape=(45,), dtype='int32')
tweet_encoder = Embedding(100000, 200, weights=[embedding_matrix],
input_length=45, trainable=True)(tweet_input)
bigram_branch = Conv1D(filters=100, kernel_size=2, padding='valid',
activation='relu', strides=1)(tweet_encoder)
bigram_branch = GlobalMaxPooling1D()(bigram_branch)
trigram_branch = Conv1D(filters=100, kernel_size=3, padding='valid',
activation='relu', strides=1)(tweet_encoder)
trigram_branch = GlobalMaxPooling1D()(trigram_branch)
fourgram_branch = Conv1D(filters=100, kernel_size=4, padding='valid',
activation='relu', strides=1)(tweet_encoder)
fourgram_branch = GlobalMaxPooling1D()(fourgram_branch)
merged = concatenate([bigram_branch, trigram_branch, fourgram_branch],
axis=1)
merged = Dense(256, activation='relu')(merged)
merged = Dropout(0.2)(merged)
merged = Dense(1)(merged)
output = Activation('sigmoid')(merged)
model = Model(inputs=[tweet_input], outputs=[output])
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
from keras.callbacks import ModelCheckpoint
filepath="CNN_best_weights.{epoch:02d}-{val_acc:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc',
verbose=1, save_best_only=True, mode='max')
model.fit(x_train_seq, y_train, batch_size=32, epochs=5,
validation_data=(x_val_seq, y_validation),
callbacks = [checkpoint])
from keras.models import load_model
loaded_CNN_model = load_model('CNN_best_weights.02-0.8333.hdf5')
loaded_CNN_model.evaluate(x=x_val_seq, y=y_validation)
The best validation accuracy is 83.33%, slightly better than the simple CNN model with bigram filters, which yielded 83.25% validation accuracy. I could even define a deeper structure with more hidden layers, or even make use of multi-channel approach that Yoon Kim(2014) has implemented, or try different pool size to see how the performance differs, but I will stop here for now. However if you happen to try more complex CNN structure, and get the result, I would love to hear about it.
Final Model Evaluation with Test Set
So far I have tested the model on the validation set to decide the feature extraction tuning and model comparison. Now I will finally check the final result with the test set. I will compare two different models: 1. Tf-Idf + logistic regression, 2. Word2Vec + CNN. As another measure for comparison, I will also plot ROC curve of both models.
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
tvec.fit(x_train)
x_train_tfidf = tvec.transform(x_train)
x_test_tfidf = tvec.transform(x_test)
lr_with_tfidf = LogisticRegression()
lr_with_tfidf.fit(x_train_tfidf,y_train)
yhat_lr = lr_with_tfidf.predict_proba(x_test_tfidf)
lr_with_tfidf.score(x_test_tfidf,y_test)
sequences_test = tokenizer.texts_to_sequences(x_test)
x_test_seq = pad_sequences(sequences_test, maxlen=45)
yhat_cnn = loaded_CNN_model.predict(x_test_seq)
loaded_CNN_model.evaluate(x=x_test_seq, y=y_test)
from sklearn.metrics import roc_curve, auc
fpr, tpr, threshold = roc_curve(y_test, yhat_lr[:,1])
roc_auc = auc(fpr, tpr)
fpr_cnn, tpr_cnn, threshold = roc_curve(y_test, yhat_cnn)
roc_auc_nn = auc(fpr_cnn, tpr_cnn)
plt.figure(figsize=(8,7))
plt.plot(fpr, tpr, label='tfidf-logit (area = %0.3f)' % roc_auc, linewidth=2)
plt.plot(fpr_cnn, tpr_cnn, label='w2v-CNN (area = %0.3f)' % roc_auc_nn, linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', linewidth=2)
plt.xlim([-0.05, 1.0])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic: is positive', fontsize=18)
plt.legend(loc="lower right")
plt.show()
And the final result is as below.
Thank you for reading. You can find the Jupyter Notebook from the below link.
https://github.com/tthustla/twitter_sentiment_analysis_part11/blob/master/Capstone_part11.ipynb
And Medium blog post:
Analytics Strategy / AI Lead for Office of CTO @board
7 年This is awesome Ricky! Getting 1% higher than tfidf logreg must feel amazing when you tried so many things