Problem Statement¶

Backgroud:¶

With so many unstructured data online, in the form as text, video and sounds, how can we turn them into structured data? Nowadays we have access to thousands of reviews/comments on different social medias, while some reviews come with numeric ratings (like those on app Yelp), some don't (like those on Twitter). Business owners can certainly trace reviews/comments from customers' posts on all sorts of popular social media platfloms, however the process could become tedious especially when the number of reviews is large and the conclusion is more towards qualitative. For those reviews/comments that don't come with numeric ratings, if we are able to label these reviews as positive or negative (or on a scale of 1 to 5) by training models with labeled data(yelp data), we can help business owners to keep track of the feedback from customers in a faster and more interpretable way, so that they can adjust their services and offerings in response to customers' most-recent feedback.

Problem Statement¶

The purpose of the project is to gain an understanding of yelp users' reviews and to predict sentimental feedback based on text from individual yelp reviews.

Outline of the project :¶

- Problem Statement¶

- Data Properties & EDA¶

- Text Cleaning¶

- Word Embedding¶

- Model Selections and Evaluations¶

- Results and Insights¶

- Future Work¶

# New pacakges installment
# import nltk
# nltk.download('wordnet')
# !pip install gensim
# import nltk
# nltk.download('stopwords')
# ! pip install regex
#! pip install langdetect
#! pip install langid
#! pip install pydot
#! pip install wordcloud

# load necessary packages
from __future__ import division, print_function
import pandas as pd
import os
import json
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import re
import regex
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib
mpl.rcParams['figure.figsize'] = (8, 8)
#inline_rc = dict(mpl.rcParams)
import re
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import autocorrect
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import gensim
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences 
import numpy as np
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense,Input, Dropout, Reshape, Activation,Flatten, concatenate, Input
from keras.layers import Bidirectional,GlobalMaxPooling1D,Conv1D, MaxPooling1D, Conv2D,MaxPool2D, MaxPooling2D
from keras.layers import Activation, Embedding, GRU
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.models import Model
import collections
from keras.models import load_model
from nltk.classify import textcat
from langdetect import detect
import langid
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint
from numpy import savetxt
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
from keras.callbacks import History
import pydotplus
import keras
import pydot as pyd
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
import graphviz
from keras import regularizers
from numpy import asarray
from numpy import savetxt
from numpy import loadtxt
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from keras.models import load_model
from sklearn.metrics import classification_report

Data Properties and EDA¶

1. Load data¶

2. Basic understanding of data¶

3. Split data into train,test, and validation set¶

4. WordCloud on training set¶

Load Data¶

Specific Steps:

loading data from yelp review and business dataset
Filter for Charlotte city only
Merge review and buisness dataset
Filter for "restaurant" category only

load data from yelp review and business dataset¶

# Load Yelp review dataset
reviews = []
with open('yelp_academic_dataset_review.json') as fl:
    for i, line in enumerate(fl):
        reviews.append(json.loads(line))
        #if i+1 >= 100000:
            #break
df = pd.DataFrame(reviews)
df.head()

# Load Yelp business dataset
business = []
with open('yelp_academic_dataset_business.json') as fl:
    for i, line in enumerate(fl):
        business.append(json.loads(line))
df_busi = pd.DataFrame(business)
df_busi.head()

df.rename(columns={'stars': 'stars_review'}, inplace=True)

df_busi.rename(columns={'stars': 'stars_business'}, inplace=True)

df.shape

(6685900, 9)

df_busi.shape

(192609, 14)

# Number of Restaurants in each city
pd.DataFrame(df_busi.city.value_counts(dropna=False)).head(10)

# filter for businesses in Phoenix city only
phoenix_busi_df = df_busi.loc[df_busi.city  == "Phoenix", :]

# filter for reviews for Phoenix city only
phoenix_busi_id = df_busi.loc[df_busi.city  == "Phoenix", "business_id"].tolist()


# Create a Dataframe containing reviews for restaurants in Phoenix only
phoenix_rev = df[df["business_id"].isin(phoenix_busi_id)]

# Merge business and review dataset for Phoenix city only
df_phoenix = pd.merge(phoenix_rev, phoenix_busi_df, on="business_id")

df_phoenix.to_csv("df_phoenix_reviews.csv")

df = pd.read_csv("df_phoenix_reviews.csv")

df.head()

df.columns

Index(['Unnamed: 0', 'review_id', 'user_id', 'business_id', 'stars_review',
       'useful', 'funny', 'cool', 'text', 'date', 'name', 'address', 'city',
       'state', 'postal_code', 'latitude', 'longitude', 'stars_business',
       'review_count', 'is_open', 'attributes', 'categories', 'hours'],
      dtype='object')

df.shape

(734136, 23)

# make sure there is no duplicate reviews
df.review_id.value_counts()

EBE_Ly3QO-trSOHz-2vGKw    1
cplLCkOHx9ywf6WawYBGkQ    1
Olq7haxQ7jaeacWRsr4iBg    1
V80qi5ECEjsll26lZVE6Sg    1
U8czR0S0BHaoBNSRzUbqMQ    1
                         ..
GWYjmqbdJvZcYG8sMMedCw    1
Uo7a7dOoEKuEGf_S5gcKXQ    1
bdO8LhseKbADhADTx63Klw    1
PBSfrGuJbEt_NoM1aH5NSQ    1
AsLdPFcoe_0nfyDM3xYCiA    1
Name: review_id, Length: 734136, dtype: int64

df.describe(include="all")

Filter for reviews on restaurants only¶

# Filter for only restaurants
is_restaurants = [re.search("restaurants",str(df.categories[i]).lower()) is not None for i in range(len(df))]
df = df.loc[is_restaurants,:]

# check shape of dataframe
df.shape

(427491, 23)

Distribution of target variable - "stars_reviews"¶

# looks like there is data type inconsistency
df.stars_review.value_counts()

5.0    188664
4.0    105415
3.0     48555
1.0     48075
2.0     36782
Name: stars_review, dtype: int64

# Check the distribution of target variable - dataset imbalanced
df.stars_review.value_counts(normalize = True)

5.0    0.441329
4.0    0.246590
3.0    0.113581
1.0    0.112459
2.0    0.086042
Name: stars_review, dtype: float64

labels = df.stars_review.value_counts().index.tolist()
labels

[5.0, 4.0, 3.0, 1.0, 2.0]

x=df['stars_review'].value_counts()
x=x.sort_index()
#plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.index, x.values, alpha=0.8)
plt.title("Yelp Star Rating Distribution")
plt.ylabel('# of Reviews', fontsize=12)
plt.xlabel('Star Ratings ', fontsize=12)

Text(0.5, 0, 'Star Ratings ')

Noted that 5-star accounts for 44.1% of the total ratings, followed by 4-star representing 24.6% of all the ratings. The distribution on 5 different stars is not balanced but I don't think the mild imbalanced would be a big issue impacting the accaracy of the model predictions.

# one-hot-encoding the target variables, stars_review
df = pd.concat([df,pd.get_dummies(df['stars_review'], prefix='review_stars_')],axis=1)

Split data set into training and test sets¶

# Create y 
y = df["stars_review"]

# Split into train test
train, test = train_test_split(df, test_size=0.1, random_state=42, stratify = y)

# Creat ytrain
ytrain = train["stars_review"]

# Further splitting train into train and validation set
train, valid = train_test_split(train, test_size=0.2, random_state=42, stratify = ytrain)

train.stars_review.value_counts(normalize = True)

5.0    0.441327
4.0    0.246589
3.0    0.113580
1.0    0.112462
2.0    0.086042
Name: stars_review, dtype: float64

valid.stars_review.value_counts(normalize = True)

5.0    0.441331
4.0    0.246592
3.0    0.113582
1.0    0.112451
2.0    0.086044
Name: stars_review, dtype: float64

test.stars_review.value_counts(normalize = True)

5.0    0.441333
4.0    0.246596
3.0    0.113591
1.0    0.112444
2.0    0.086035
Name: stars_review, dtype: float64

train.to_csv("yelp_train.csv")

test.to_csv("yelp_test.csv")

valid.to_csv("yelp_valid.csv")

Create word clouds training set data¶

# Load back datasets if needed
train= pd.read_csv("yelp_train.csv")
valid= pd.read_csv("yelp_valid.csv")
test= pd.read_csv("yelp_test.csv")

train.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'review_id', 'user_id', 'business_id',
       'stars_review', 'useful', 'funny', 'cool', 'text', 'date', 'name',
       'address', 'city', 'state', 'postal_code', 'latitude', 'longitude',
       'stars_business', 'review_count', 'is_open', 'attributes', 'categories',
       'hours', 'review_stars__1.0', 'review_stars__2.0', 'review_stars__3.0',
       'review_stars__4.0', 'review_stars__5.0'],
      dtype='object')

train_sentence = [sen for sen in train["text"]]

training_all_sens = ' '.join(train_sentence)

# Create and generate a word cloud image:
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(training_all_sens)

# Display the generated image:
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

wordcloud.to_file("yelp_phoenix_wordcloud.png")

<wordcloud.wordcloud.WordCloud at 0x4176b7510>

# Try removing some sentiment-neutral words
stopwords = set(STOPWORDS)
stopwords.update(["think", "one", "restaurant", "know", "meal", "say",
                 "eat"])

# Create and generate a word cloud image:
wordcloud_1 = WordCloud(max_font_size=50, max_words=100,stopwords=stopwords,
                        background_color="white").generate(training_all_sens)

# Display the generated image:
plt.figure()
plt.imshow(wordcloud_1, interpolation="bilinear")
plt.axis("off")
plt.show()

Create separate word clouds for low-ratings and high-rating revies

negative_ind = train.loc[(train.stars_review == 1) | (train.stars_review == 2) | (train.stars_review == 3), "text"]

positive_ind = train.loc[(train.stars_review == 4) | (train.stars_review == 5) , "text"]

training_all_neg = ' '.join(negative_ind)
training_all_pos = ' '.join(positive_ind)

# Create and generate a word cloud image:
wordcloud_neg = WordCloud(max_font_size=50, max_words=100,stopwords=stopwords,
                          background_color="white").generate(training_all_neg)

# Display the generated image:
plt.figure()
plt.imshow(wordcloud_neg, interpolation="bilinear")
plt.axis("off")
plt.show()

# Create and generate a word cloud image:
wordcloud_pos = WordCloud(max_font_size=50, max_words=100, stopwords=stopwords,
                          background_color="white").generate(training_all_pos)

# Display the generated image:
plt.figure()
plt.imshow(wordcloud_pos, interpolation="bilinear")
plt.axis("off")
plt.show()

Text Cleaning¶

Removing any leading and trailing whitespaces
Remove comments not in English
Convert to lowercase
Apostrophes removed
Splitting sentences into words
Remove punctuations
Filter out remaining tokens that are not alphabetic
Remove stop words
Word Normalization : lemmatization

*Stop words are a list of words that do not contribute to the deeper meaning of a sentence or phrase such as "a", "the", "is". Removing stop words can help us to reduce the vocabular size and therefore faster processing. While there are many stop words lists available online, it's important to consider if they are appropriate to the specific task of your project. For example, removing stop words such as "don't", "not", "aren't" would change a sentence from negative sentiment to a positive sentiment. Therefore I decided to customize a stop word list for this project.
See link for more details: https://medium.com/@limavallantin/why-is-removing-stop-words-not-always-a-good-idea-c8d35bd77214

A short list of additional considerations when cleaning text:

Locating and correcting common typos and misspellings.
Word Normalization : stemming
Handling large documents and large collections of text documents that do not fit into memory.
Extracting text from markup like HTML, PDF, or other structured document formats.
Transliteration of characters from other languages into English.
Decoding Unicode characters into a normalized form, such as UTF8.
Handling of domain specific words, phrases, and acronyms.
Handling or removing numbers, such as dates and amounts.

Source : https://machinelearningmastery.com/clean-text-machine-learning-python/

# Load back datasets if needed
train= pd.read_csv("yelp_train.csv")
valid= pd.read_csv("yelp_valid.csv")
test= pd.read_csv("yelp_test.csv")

1. Removing any leading and trailing whitespaces¶

striped_sen = [sen.strip() for sen in train['text']]

2.Remove non-english reviews¶

english_text_train = []
notlangs_train =[]

for i in range(len(train)):
    try:
        detected_lang = detect(striped_sen[i])
        if detected_lang == "en":
            english_text_train.append(i)
        
    except:
        notlangs_train.append(i)

notlangs_train

[241255]

train["text"][notlangs_train]

241255    !
Name: text, dtype: object

# get index of all rows in train
full_ind = [i for i in range(len(train))]

# get index of non-english reviews
english_train_set = set(english_text_train) # this reduces the lookup time from O(n) to O(1)

noteng = [ind for ind in full_ind if ind not in english_train_set]

# Let's look what are identified as not in Englisht - not very accurate but decent
train.loc[noteng,"text"].head(10)

370               I like it. Chimi & Margaritas.  Winner!
745                                           Unique food
4389                        meh. i've experienced better.
4860    Este lugar es mi favorito para comprar popusas...
4881    Las mejores salsas que e probado ...todo muy r...
5366    Siempre que agarro comida de aquí y le doy una...
5442    David is a dickhead he's very rude and I need ...
7782    Un día pasando de pruebas y terminamos en El C...
8434    Visite este lugar con unos amigos y el servici...
9157    La comida es muy buena el servicio excelente e...
Name: text, dtype: object

# Remove reviews that are not able to be identified as any languange
train.drop(index =  noteng , axis=0, inplace= True )

train.shape

(307472, 29)

train['text'][8]

"Loved this place! My boyfriend lives walking distance from here so we headed over to grab a quick bite before a party. Fast, amazingly good pizza (we got the one with prosciutto and arugula) that rivals any other pizza I've had in Phoenix and fantastic service. Everyone who works is SO nice and friendly. Like- really, really nice. You'll see! I'd love to go back for drinks/dessert too!"

3.Convert to lowercase¶

#Lowercasing before negation
lower_case = [[sen.lower()] for sen in train['text']]

# let's see an example below
lower_case[8]

["loved this place! my boyfriend lives walking distance from here so we headed over to grab a quick bite before a party. fast, amazingly good pizza (we got the one with prosciutto and arugula) that rivals any other pizza i've had in phoenix and fantastic service. everyone who works is so nice and friendly. like- really, really nice. you'll see! i'd love to go back for drinks/dessert too!"]

4.Split words by whitespace¶

#split sentence into words
words = [sen[0].split() for sen in lower_case]

# Check an example here
words[8]

['loved',
 'this',
 'place!',
 'my',
 'boyfriend',
 'lives',
 'walking',
 'distance',
 'from',
 'here',
 'so',
 'we',
 'headed',
 'over',
 'to',
 'grab',
 'a',
 'quick',
 'bite',
 'before',
 'a',
 'party.',
 'fast,',
 'amazingly',
 'good',
 'pizza',
 '(we',
 'got',
 'the',
 'one',
 'with',
 'prosciutto',
 'and',
 'arugula)',
 'that',
 'rivals',
 'any',
 'other',
 'pizza',
 "i've",
 'had',
 'in',
 'phoenix',
 'and',
 'fantastic',
 'service.',
 'everyone',
 'who',
 'works',
 'is',
 'so',
 'nice',
 'and',
 'friendly.',
 'like-',
 'really,',
 'really',
 'nice.',
 "you'll",
 'see!',
 "i'd",
 'love',
 'to',
 'go',
 'back',
 'for',
 'drinks/dessert',
 'too!']

5. Apostrophes removed¶

# %load appos.py
appos = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not"
}

def appos_remove(sen):
    return [appos[word] if word in appos else word for word in sen]

# Apostrophes connecting words replaced with uniform structures
appos_removed = [appos_remove(sen) for sen in words]

# confirm the function works: "i've" became 'I have', "you'll" became "you will"
appos_removed[8]

['loved',
 'this',
 'place!',
 'my',
 'boyfriend',
 'lives',
 'walking',
 'distance',
 'from',
 'here',
 'so',
 'we',
 'headed',
 'over',
 'to',
 'grab',
 'a',
 'quick',
 'bite',
 'before',
 'a',
 'party.',
 'fast,',
 'amazingly',
 'good',
 'pizza',
 '(we',
 'got',
 'the',
 'one',
 'with',
 'prosciutto',
 'and',
 'arugula)',
 'that',
 'rivals',
 'any',
 'other',
 'pizza',
 'I have',
 'had',
 'in',
 'phoenix',
 'and',
 'fantastic',
 'service.',
 'everyone',
 'who',
 'works',
 'is',
 'so',
 'nice',
 'and',
 'friendly.',
 'like-',
 'really,',
 'really',
 'nice.',
 'you will',
 'see!',
 'I had',
 'love',
 'to',
 'go',
 'back',
 'for',
 'drinks/dessert',
 'too!']

#rejoin again
rejoined_sen = [' '.join(sen) for sen in appos_removed]

#Lowercasing again
lower_sen = [sen.lower() for sen in rejoined_sen]

lower_sen[8]

'loved this place! my boyfriend lives walking distance from here so we headed over to grab a quick bite before a party. fast, amazingly good pizza (we got the one with prosciutto and arugula) that rivals any other pizza i have had in phoenix and fantastic service. everyone who works is so nice and friendly. like- really, really nice. you will see! i had love to go back for drinks/dessert too!'

4.Punctuation: removing the punctuation marks from each tokens¶

# define a function to replace punctuations with white space
def remove_punct(text):
    text_nopunct = ''
    text_nopunct = re.sub('['+string.punctuation+']', ' ', text)
    return text_nopunct

# replace punctuations in the text with white space
text_rem_punct = [remove_punct(sen) for sen in lower_sen]

# Now all punctuations are removed, e.g., forward slash in 'drinks/dessert' was removed
text_rem_punct[8]

'loved this place  my boyfriend lives walking distance from here so we headed over to grab a quick bite before a party  fast  amazingly good pizza  we got the one with prosciutto and arugula  that rivals any other pizza i have had in phoenix and fantastic service  everyone who works is so nice and friendly  like  really  really nice  you will see  i had love to go back for drinks dessert too '

5. Tokenize words¶

#Tokenize text on whitespace
#token_word = [sen.split() for sen in lower_sen]

token_word = [WhitespaceTokenizer().tokenize(sen) for sen in text_rem_punct]

token_word[8]

['loved',
 'this',
 'place',
 'my',
 'boyfriend',
 'lives',
 'walking',
 'distance',
 'from',
 'here',
 'so',
 'we',
 'headed',
 'over',
 'to',
 'grab',
 'a',
 'quick',
 'bite',
 'before',
 'a',
 'party',
 'fast',
 'amazingly',
 'good',
 'pizza',
 'we',
 'got',
 'the',
 'one',
 'with',
 'prosciutto',
 'and',
 'arugula',
 'that',
 'rivals',
 'any',
 'other',
 'pizza',
 'i',
 'have',
 'had',
 'in',
 'phoenix',
 'and',
 'fantastic',
 'service',
 'everyone',
 'who',
 'works',
 'is',
 'so',
 'nice',
 'and',
 'friendly',
 'like',
 'really',
 'really',
 'nice',
 'you',
 'will',
 'see',
 'i',
 'had',
 'love',
 'to',
 'go',
 'back',
 'for',
 'drinks',
 'dessert',
 'too']

6.Remove non-alphabetic tokens¶

# Remove non-alphabetic tokens, such as numbers 
def remove_non_alpha(sen):
    return [word for word in sen if word.isalpha()]

non_alpha_removed = [remove_non_alpha(sen) for sen in token_word]

train['text'][120]

'They have some really delicious, creative rolls! They also have a great happy hour from 2-5 and 8-close with $5 sake bombs and several rolls!'

# confirm the function works 
" ".join(non_alpha_removed[120])

'they have some really delicious creative rolls they also have a great happy hour from and close with sake bombs and several rolls'

7. Remove stop words¶

Customize a stopword list

first look at the most frequent words in the training set
obtain the stopwords list provided by NLTK package as a foundation
create a stopwords list

see helpful link: https://programminghistorian.org/en/lessons/counting-frequencies

# Combine all words in the train set
all_train_words = [word for tokens in non_alpha_removed for word in tokens]
train_vocab = list(set(all_train_words))

print("The training set has a total of "+ str(len(all_train_words)) + " words with a vocab size of " + str(len(train_vocab))
      + " unique words" )

The training set has a total of 31774640 words with a vocab size of 95357 unique words

# def wordListToFreqDict(wordlist, vocab):
#     wordfreq = [wordlist.count(w) for w in vocab]
#     return dict(list(zip(vocab,wordfreq)))

#train_dict = wordListToFreqDict(wordlist = all_train_words, vocab = train_vocab)
# the function above is not recommended, not efficient

# create a vocab dictionary to record frequence of each words
train_dict = {}
for word in all_train_words:
    try:
        train_dict[word] += 1
    except KeyError:
        train_dict[word] = 1

# create function to sort the vocab dictionary from most frequent to least
def sortFreqDict(freqdict):
    freqword_list = [(freqdict[key], key) for key in freqdict]
    freqword_list.sort()
    freqword_list.reverse()
    return freqword_list

#sort the vocab dictionary from most frequent to least
sorted_train_dict = sortFreqDict(train_dict)

sorted_train_dict = dict(sorted_train_dict)

train_len = len(non_alpha_removed)
train_len

307472

# find most frequent words
poplist=[]
for num,word in sorted_train_dict.items():
    if num/len(all_train_words)>0.002:
        poplist.append(word)

poplist

['the',
 'and',
 'i',
 'a',
 'to',
 'was',
 'is',
 'it',
 'of',
 'not',
 'for',
 'in',
 'we',
 'my',
 'this',
 'with',
 'that',
 'but',
 'have',
 'they',
 'you',
 'food',
 'had',
 'on',
 'are',
 'good',
 'were',
 'place',
 'so',
 'great',
 'at',
 'be',
 'as',
 'very',
 'there',
 'service',
 'will',
 'our',
 'here',
 'out',
 'like',
 'all',
 'just',
 'if',
 'back',
 'me',
 'their',
 'time',
 'one',
 'would',
 'do',
 'get',
 'go',
 'or',
 'from',
 's',
 'when',
 'really',
 'am',
 'up',
 'did',
 'about',
 'which',
 'been',
 'an',
 'what',
 'some',
 'ordered',
 'order']

# identify any words that show up in more than 80% of the reviews in training set
stop_word_cand=[]
for word in poplist:
    if sum([word in sen for sen in non_alpha_removed])/train_len >= 0.80:
        stop_word_cand.append(word)
        
stop_word_cand

['the', 'and']

# Get stop words from NLTK
stoplist = stopwords.words('english')

stoplist

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

# create my own stopword list
custom_stop_words = ['i','me','my','myself',
                     'we','us','our','ours','ourselves',
                     'you',"you're","you've","you'll","you'd",'your','yours','yourself',
                     'yourselves',
                     'he','him','his','himself',
                     'she',"she's",'her','hers','herself',
                     'it',"it's", 'its','itself',
                     'they', 'them', 'their', 'theirs','themselves',
                     'what','which', 'who','whom', 'this','that',  "that'll",'these','those',
                     'am', 'is','are', 'was','were','be', 'been', 'being','will','would',
                     'here','there',
                     'have','has','had', 'having',  'do','does','did','doing',
                      'a','an',"and",'the', 'or',
                      't','d','s', 'll','m','o','re','ve', 'ma',
                     'to','of','for','in','out','with','on','up', 'at','as','from','about']

def removeStopWords(tokens): 
    return [word for word in tokens if word not in custom_stop_words]

# remove stop words
filtered_words = [removeStopWords(sen) for sen in non_alpha_removed]

# check if stop words removed or not
filtered_words[8]

['loved',
 'place',
 'boyfriend',
 'lives',
 'walking',
 'distance',
 'so',
 'headed',
 'over',
 'grab',
 'quick',
 'bite',
 'before',
 'party',
 'fast',
 'amazingly',
 'good',
 'pizza',
 'got',
 'one',
 'prosciutto',
 'arugula',
 'rivals',
 'any',
 'other',
 'pizza',
 'phoenix',
 'fantastic',
 'service',
 'everyone',
 'works',
 'so',
 'nice',
 'friendly',
 'like',
 'really',
 'really',
 'nice',
 'see',
 'love',
 'go',
 'back',
 'drinks',
 'dessert',
 'too']

Spelling Correction¶

It would be nice if we can correct the spelling error in the text. However, we are skipping this step for now as the function takes too long.

reference:https://www.quora.com/Are-there-any-NLP-auto-correct-auto-complete-libraries-for-Python

# # create a function for auto correct spelling errors 
# def correctspelling(tokens): 
#     spell = autocorrect.Speller(lang='en')
#     return [spell(word) for word in tokens]

# # Check if function works
# correctspelling(['caaaar','mussage','hte'])

# # auto-correct spelling
# corrected_words = [correctspelling(sen) for sen in filtered_words]

# # check if stop words removed or not
# corrected_words[0]

7.Word Normalization : stemming or lemmatization¶

"The goal of both stemming and lemmatization is to reduce inflected words to their word stem, base or root form—generally a written word form. For example, raining, rains, rained could be all stemmized to "rain".

The difference between stemming and lemmatization is the way they change the words. Stemming usually directly chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma."

Comparing the text after lemmatized and snowball-stemmed, I prefer to use the text after only lemmatization since stemming actually changes some words to less interpretable, like 'happy' to 'happi'

Source : https://stackoverflow.com/questions/1787110/what-is-the-difference-between-lemmatization-vs-stemming

def lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

def porterstemming(tokens):
    ps = PorterStemmer() 
    return [ps.stem(word) for word in tokens]

def snowballstemming(tokens):
    ss = SnowballStemmer("english")
    return [ss.stem(word) for word in tokens]

# Check if the function works
lemmatization(["rocking", "rains","rained","boys", "ran", "generously","happy"])

['rocking', 'rain', 'rained', 'boy', 'ran', 'generously', 'happy']

# Check if the function works
porterstemming(["rocking", "rains","rained","boys", "ran", "generously","happy"])

['rock', 'rain', 'rain', 'boy', 'ran', 'gener', 'happi']

# Check if the function works
snowballstemming(["rocking", "rains","rained","boys", "ran", "generously","happy"])

['rock', 'rain', 'rain', 'boy', 'ran', 'generous', 'happi']

#  Apply lemmatization to text 
lemmatized_words = [lemmatization(sen) for sen in filtered_words]

# #  Apply stemming
# stemmed_words = [snowballstemming(sen) for sen in lemmatized_words]

# # Check reviews after lemmatized and stemmed
# stemmed_words[8]

# Check reviews after lemmatized
lemmatized_words[8]

['loved',
 'place',
 'boyfriend',
 'life',
 'walking',
 'distance',
 'so',
 'headed',
 'over',
 'grab',
 'quick',
 'bite',
 'before',
 'party',
 'fast',
 'amazingly',
 'good',
 'pizza',
 'got',
 'one',
 'prosciutto',
 'arugula',
 'rival',
 'any',
 'other',
 'pizza',
 'phoenix',
 'fantastic',
 'service',
 'everyone',
 'work',
 'so',
 'nice',
 'friendly',
 'like',
 'really',
 'really',
 'nice',
 'see',
 'love',
 'go',
 'back',
 'drink',
 'dessert',
 'too']

# Make sure the cleaned text has the same length as the train dataframe
len(lemmatized_words) == len(train)

True

# Add cleaned(lemmatized) Text back to df
train['text_clean'] = [' '.join(sen) for sen in lemmatized_words]

# add tokenized  cleaned Text back to df
train['tokens'] =lemmatized_words

train["text"][8]

"Loved this place! My boyfriend lives walking distance from here so we headed over to grab a quick bite before a party. Fast, amazingly good pizza (we got the one with prosciutto and arugula) that rivals any other pizza I've had in Phoenix and fantastic service. Everyone who works is SO nice and friendly. Like- really, really nice. You'll see! I'd love to go back for drinks/dessert too!"

train["text_clean"][8]

'loved place boyfriend life walking distance so headed over grab quick bite before party fast amazingly good pizza got one prosciutto arugula rival any other pizza phoenix fantastic service everyone work so nice friendly like really really nice see love go back drink dessert too'

Define a function for text cleaning¶

# Give a new name to appos_dict , in order to have the next function able to run
appos_dict = {
"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"that's" : "that is",
"there's" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not"
}

# define a function to remove apotrophes
def appos_remove(sen):
    return [appos_dict[word] if word in appos_dict else word for word in sen]

# define a function to replace punctuations with white space
def remove_punct(text):
    text_nopunct = ''
    text_nopunct = re.sub('['+string.punctuation+']', ' ', text)
    return text_nopunct

# def remove_punct_sen(sen):
#     return [remove_punct(word) for word in sen]

# Remove non-alphabetic tokens, such as numbers 
def remove_non_alpha(sen):
    return [word for word in sen if word.isalpha()]
    
def removeStopWords(tokens): 
    return [word for word in tokens if word not in custom_stop_words]


# 7.Word Normalization : lemmatization
def lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

# create my own stopword list
custom_stop_words = ['i','me','my','myself',
                     'we','us','our','ours','ourselves',
                     'you',"you're","you've","you'll","you'd",'your','yours','yourself',
                     'yourselves',
                     'he','him','his','himself',
                     'she',"she's",'her','hers','herself',
                     'it',"it's", 'its','itself',
                     'they', 'them', 'their', 'theirs','themselves',
                     'what','which', 'who','whom', 'this','that',  "that'll",'these','those',
                     'am', 'is','are', 'was','were','be', 'been', 'being','will','would',
                     'here','there',
                     'have','has','had', 'having',  'do','does','did','doing',
                      'a','an',"and",'the', 'or',
                      't','d','s', 'll','m','o','re','ve', 'ma',
                     'to','of','for','in','out','with','on','up', 'at','as','from','about']

# define a function to clean data
def text_cleaning(dataset):
    
    # 1. Remove non-english reviews
    
    # first Removing any leading and trailing whitespaces
    striped_sen = [sen.strip() for sen in dataset["text"]]
    
    
    english_text = []
    notlangs =[]

    for i in range(len(dataset)):
        try:
            detected_lang = detect(striped_sen[i])
            if detected_lang == "en":
                english_text.append(i)
        
        except:
            notlangs.append(i)
        
    
    full_ind = [i for i in range(len(dataset))]
    
    english_set = set(english_text) # this reduces the lookup time from O(n) to O(1)

    not_eng = [ind for ind in full_ind if ind not in english_set]
    
    dataset.drop(index =not_eng, axis=0, inplace= True )
    
    
    # 2.Convert to lowercase
    #first Lowercasing 
    lower_case = [[sen.lower()] for sen in dataset['text']]
    
    
    
    # 3. Split words by whitespace
    #split sentence into words 
    words = [sen[0].split() for sen in lower_case]
   
    
    
    # 4. Remove Apostrophes 
    
    # Apostrophes connecting words replaced with uniform structures
    appos_removed = [appos_remove(sen) for sen in words]
    
    #rejoin again
    rejoined_sen = [' '.join(sen) for sen in appos_removed]
    
    #Lowercasing again
    lower_sen = [sen.lower() for sen in rejoined_sen]
    
    
    
    # 5.Punctuation
    # replace punctuations in the text with white space
    text_rem_punct = [remove_punct(sen) for sen in lower_sen]
    
    
    
    
    # 6.split sentence into words
    token_word = [WhitespaceTokenizer().tokenize(sen) for sen in text_rem_punct]
    
    

    # 7.Remove non-alphabetic tokens
    non_alpha_removed = [remove_non_alpha(sen) for sen in token_word]
    
    
    
    # 8. Remove stop words

    filtered_words = [removeStopWords(sen) for sen in non_alpha_removed]
    
    
   
    # 8. Apply lemmatization 
    lemmatized_words = [lemmatization(sen) for sen in filtered_words]
    

    # Add cleaned(lemmatized) Text back to df
    dataset.loc[:,'text_clean'] = [' '.join(sen) for sen in lemmatized_words]

    # add tokenized  cleaned Text back to df
    dataset.loc[:, 'tokens'] =lemmatized_words
    

    return dataset

Clean text in the training set¶

#train= pd.read_csv("yelp_train.csv")
valid= pd.read_csv("yelp_valid.csv")
test= pd.read_csv("yelp_test.csv")

#train = text_cleaning(train)

train.loc[:,["stars_review","tokens","text_clean","text"]].head()

Clean text in the validation set¶

valid= text_cleaning(valid)

valid.loc[:,["stars_review","tokens","text_clean","text"]].head()

valid.shape

(76857, 31)

valid['text'][6]

"This was our first stop in Phoenix and it was so great!\n\nThe indoor & outdoor decor is so quaint and cool--there's so much to see. The outdoor patio is super charming and even has misters.\n\nI had my first Chimichanga here and it was sooo good and sooo spicy. I chose the Shredded Beef one with Green Chili and it was just great. So much flavor in every bite, super savory.\n\nService was super friendly and the music was loud but good.\n\nI'll definitely be back next time I'm in town!"

valid['text_clean'][6]

'first stop phoenix so great indoor outdoor decor so quaint cool so much see outdoor patio super charming even mister first chimichanga sooo good sooo spicy chose shredded beef one green chili just great so much flavor every bite super savory service super friendly music loud but good definitely back next time town'

# save cleaned training dataset 
train.to_csv("yelp_train_cleaned_0325.csv")

# save cleaned validation dataset 
valid.to_csv("yelp_valid_cleaned_0325.csv")

train = pd.read_csv("yelp_train_cleaned_0325.csv")
valid = pd.read_csv("yelp_valid_cleaned_0325.csv")

train['text_clean'][8]

'loved place boyfriend life walking distance so headed over grab quick bite before party fast amazingly good pizza got one prosciutto arugula rival any other pizza phoenix fantastic service everyone work so nice friendly like really really nice see love go back drink dessert too'

Data Transformation¶

Determine Vocab size and maximum sequence length¶

Word Tokenize and Pad sequences¶

Word Embedding¶

1. Word embedding with Word2Vec
2. Word embedding with GloVe

Determine Vocabulary size and maximum sequence length¶

To find the most appropriate vocabulary size and maximum sequence length of each reviews for this dataset, I got inspiration from Paul Nation and Robert Waring in their paper VOCABULARY SIZE, TEXT COVERAGE AND WORD LISTS. The idea is a vocabulary of about 3000 words which provides coverage of at least 95% of a text allows new language learners (the models I will use) be able to efficiently learn from context with unknown words (Paul Nation and Robert Waring). http://www.fltr.ucl.ac.be/fltr/germ/etan/bibs/vocab/cup.html

Based on the research result above, I decided to use a vocabulary that is large enough to cover at least 98% (greater than 95%) of the entire train text and vocabulary size is no smaller than 3000. The maximum sequence length is at least as long as 95% of all reviews in the train set.

Let's take a look at the cleaned text to see how big the vocabulary size.

# perform the same tokenization on the test dataset
def check_vocab(dataset):
    dataset_info = {}
    all_words = [word for tokens in dataset.tokens for word in tokens]
    vocab = sorted(list(set(all_words)))
    sentence_lengths = [len(tokens) for tokens in dataset.tokens]
    
    dataset_info["number_all_words"] = len(all_words)
    dataset_info["number_vocab"] = len(vocab)
    dataset_info["max_sentence_lengths"] = max(sentence_lengths)
    
    print("%s words total, with a vocabulary size of %s" % (len(all_words), len(vocab)))
    print("Max sentence length is %s" % max(sentence_lengths))
    
    return dataset_info

# All words in the training set
train_info = check_vocab(train)

train_info

18096758 words total, with a vocabulary size of 87528
Max sentence length is 572

{'number_all_words': 18096758,
 'number_vocab': 87528,
 'max_sentence_lengths': 572}

# average sequence length of all train set reviews is 59
np.mean([len(tokens) for tokens in train.tokens])

58.8566048290576

# 95% of the reviews with sequence shorter than 164
np.quantile([len(tokens) for tokens in train.tokens],0.95)

164.0

# 98% of the reviews with sequence no longer than 225
np.quantile([len(tokens) for tokens in train.tokens],0.98)

225.0

# Use 225 as max sequence length
max_sequence_len = 225

all_cleaned_train_words = [word for tokens in train["tokens"] for word in tokens]
cleaned_train_vocab = sorted(list(set(all_cleaned_train_words)))

print("The cleaned train set has a total of "+ str(len(all_cleaned_train_words)) + 
      " words with a vocab size of " + str(len(cleaned_train_vocab))+
      " unique words." )

print("{:.2%}".format(len(cleaned_train_vocab)/len(all_cleaned_train_words)) + " of words in the train set are unique")

The cleaned train set has a total of 18096758 words with a vocab size of 87528 unique words.
0.48% of words in the train set are unique

# create a vocab dictionary to record frequence of each words
cleaned_train_dict = {}
for word in all_cleaned_train_words:
    try:
        cleaned_train_dict[word] += 1
    except KeyError:
        cleaned_train_dict[word] = 1

# create function to sort the vocab dictionary from most frequent to least
def sortFreqDict(freqdict):
    freqword_list = [(freqdict[key], key) for key in freqdict]
    freqword_list.sort()
    freqword_list.reverse()
    return freqword_list

#sort the vocab dictionary from most frequent to least
sorted_cleaned_train_list = sortFreqDict(cleaned_train_dict)

# convert list to dict
sorted_cleaned_train_dict = dict(sorted_cleaned_train_list)

# total number of words in the combine train set
all_train_words = len(all_cleaned_train_words)

# sorted_cleaned_train_dict={}
# for (num,w) in sorted_cleaned_train_list:
#     sorted_cleaned_train_dict[w] = num

# # calculate number of most frequent words that cover at least 95% of the all texts in train
# count = 0
# words_list=[]
# words_num =0 
# for num, word in sorted_cleaned_train_dict.items():
#     if count/all_train_words <= 0.99:
#         count += num
#         words_list.append(word)
#         words_num += 1

count_1 = 0
words_list_1 =[]
words_num_1 =0 
for (num,word) in sorted_cleaned_train_list:
    if count_1/all_train_words <= 0.95:
        count_1 += num
        words_list_1.append(word)
        words_num_1 += 1
# seems like 4911 most frequent words is enough, we will use 5000 for vocab size
print("The first " + str(words_num_1) + " most frequent words cover at least 95% of entire train text")

The first 4911 most frequent words cover at least 95% of entire train text

count = 0
words_list=[]
words_num =0 
for (num,word) in sorted_cleaned_train_list:
    if count/all_train_words <= 0.98:
        count += num
        words_list.append(word)
        words_num += 1

# the first 10000 most frequent words will can have 98% coverage over the entire train set
print("The first " + str(words_num) + " most frequent words cover at least 98% of entire train text")

The first 10534 most frequent words cover at least 98% of entire train text

Based on the information above, I will map each word onto a 300 length real valued vector (vector length determined by the pre-trained word2vec). I will also limit the total number of words that we are interested in modeling to the 10000 most frequent words in the train set, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 225 words, truncating long reviews and pad the shorter reviews with zero values.

# Based on the infomation above, I decided to use v
vocab_size=10000
max_sequence_len = 225
oov_tok = '<OOV>'

Tokenize and Pad sequences¶

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok, lower=True, char_level=False)

tokenizer.fit_on_texts(train["text_clean"].tolist())

train_word_index= tokenizer.word_index
train_word_index

{'<OOV>': 1,
 'not': 2,
 'but': 3,
 'food': 4,
 'place': 5,
 'good': 6,
 'so': 7,
 'great': 8,
 'very': 9,
 'time': 10,
 'service': 11,
 'like': 12,
 'all': 13,
 'just': 14,
 'if': 15,
 'back': 16,
 'one': 17,
 'get': 18,
 'go': 19,
 'when': 20,
 'really': 21,
 'restaurant': 22,
 'order': 23,
 'some': 24,
 'ordered': 25,
 'no': 26,
 'love': 27,
 'also': 28,
 'chicken': 29,
 'more': 30,
 'best': 31,
 'pizza': 32,
 'only': 33,
 'got': 34,
 'can': 35,
 'by': 36,
 'menu': 37,
 'delicious': 38,
 'too': 39,
 'try': 40,
 'nice': 41,
 'because': 42,
 'well': 43,
 'drink': 44,
 'even': 45,
 'come': 46,
 'little': 47,
 'always': 48,
 'could': 49,
 'friendly': 50,
 'other': 51,
 'came': 52,
 'make': 53,
 'staff': 54,
 'amazing': 55,
 'salad': 56,
 'table': 57,
 'first': 58,
 'after': 59,
 'cheese': 60,
 'bar': 61,
 'definitely': 62,
 'sauce': 63,
 'fresh': 64,
 'lunch': 65,
 'went': 66,
 'than': 67,
 'never': 68,
 'over': 69,
 'again': 70,
 'price': 71,
 'burger': 72,
 'taco': 73,
 'much': 74,
 'made': 75,
 'meal': 76,
 'wait': 77,
 'thing': 78,
 'people': 79,
 'server': 80,
 'eat': 81,
 'sandwich': 82,
 'pretty': 83,
 'know': 84,
 'better': 85,
 'then': 86,
 'friend': 87,
 'experience': 88,
 'star': 89,
 'say': 90,
 'night': 91,
 'hour': 92,
 'way': 93,
 'phoenix': 94,
 'two': 95,
 'how': 96,
 'going': 97,
 'flavor': 98,
 'off': 99,
 'day': 100,
 'minute': 101,
 'side': 102,
 'ever': 103,
 'right': 104,
 'take': 105,
 'happy': 106,
 'favorite': 107,
 'location': 108,
 'want': 109,
 'said': 110,
 'everything': 111,
 'dish': 112,
 'think': 113,
 'dinner': 114,
 'lot': 115,
 'down': 116,
 'sure': 117,
 'before': 118,
 'fry': 119,
 'give': 120,
 'new': 121,
 'cannot': 122,
 'still': 123,
 'any': 124,
 'taste': 125,
 'now': 126,
 'area': 127,
 'while': 128,
 'beer': 129,
 'customer': 130,
 'awesome': 131,
 'meat': 132,
 'few': 133,
 'review': 134,
 'bit': 135,
 'breakfast': 136,
 'next': 137,
 'small': 138,
 'bad': 139,
 'tried': 140,
 'both': 141,
 'atmosphere': 142,
 'recommend': 143,
 'hot': 144,
 'around': 145,
 'since': 146,
 'something': 147,
 'took': 148,
 'every': 149,
 'asked': 150,
 'special': 151,
 'tasty': 152,
 'excellent': 153,
 'another': 154,
 'see': 155,
 'last': 156,
 'many': 157,
 'bread': 158,
 'super': 159,
 'year': 160,
 'coffee': 161,
 'spot': 162,
 'though': 163,
 'most': 164,
 'sweet': 165,
 'visit': 166,
 'long': 167,
 'worth': 168,
 'rice': 169,
 'home': 170,
 'roll': 171,
 'find': 172,
 'perfect': 173,
 'big': 174,
 'should': 175,
 'where': 176,
 'need': 177,
 'look': 178,
 'enough': 179,
 'feel': 180,
 'quality': 181,
 'egg': 182,
 'work': 183,
 'different': 184,
 'plate': 185,
 'family': 186,
 'fried': 187,
 'nothing': 188,
 'waitress': 189,
 'wanted': 190,
 'told': 191,
 'option': 192,
 'into': 193,
 'top': 194,
 'portion': 195,
 'check': 196,
 'looking': 197,
 'ok': 198,
 'sushi': 199,
 'pork': 200,
 'loved': 201,
 'potato': 202,
 'salsa': 203,
 'owner': 204,
 'wine': 205,
 'chip': 206,
 'coming': 207,
 'mexican': 208,
 'busy': 209,
 'beef': 210,
 'husband': 211,
 'spicy': 212,
 'soup': 213,
 'item': 214,
 'inside': 215,
 'kind': 216,
 'same': 217,
 'manager': 218,
 'however': 219,
 'away': 220,
 'full': 221,
 'bean': 222,
 'steak': 223,
 'once': 224,
 'cooked': 225,
 'maybe': 226,
 'served': 227,
 'clean': 228,
 'fast': 229,
 'dessert': 230,
 'decided': 231,
 'let': 232,
 'thought': 233,
 'probably': 234,
 'ask': 235,
 'shrimp': 236,
 'large': 237,
 'half': 238,
 'outside': 239,
 'enjoyed': 240,
 'tasted': 241,
 'patio': 242,
 'eating': 243,
 'selection': 244,
 'disappointed': 245,
 'green': 246,
 'fish': 247,
 'open': 248,
 'found': 249,
 'house': 250,
 'appetizer': 251,
 'why': 252,
 'left': 253,
 'anything': 254,
 'everyone': 255,
 'huge': 256,
 'overall': 257,
 'burrito': 258,
 'week': 259,
 'old': 260,
 'enjoy': 261,
 'free': 262,
 'cool': 263,
 'guy': 264,
 'fantastic': 265,
 'actually': 266,
 'each': 267,
 'wing': 268,
 'fun': 269,
 'put': 270,
 'else': 271,
 'quite': 272,
 'town': 273,
 'parking': 274,
 'couple': 275,
 'looked': 276,
 'cream': 277,
 'waiter': 278,
 'room': 279,
 'must': 280,
 'wonderful': 281,
 'quick': 282,
 'bbq': 283,
 'choice': 284,
 'hard': 285,
 'usually': 286,
 'today': 287,
 'business': 288,
 'cold': 289,
 'finally': 290,
 'almost': 291,
 'bacon': 292,
 'far': 293,
 'without': 294,
 'water': 295,
 'kid': 296,
 'used': 297,
 'red': 298,
 'dining': 299,
 'absolutely': 300,
 'line': 301,
 'tell': 302,
 'wife': 303,
 'sat': 304,
 'high': 305,
 'least': 306,
 'highly': 307,
 'gave': 308,
 'decent': 309,
 'waiting': 310,
 'three': 311,
 'oh': 312,
 'point': 313,
 'stop': 314,
 'onion': 315,
 'such': 316,
 'through': 317,
 'bite': 318,
 'brought': 319,
 'local': 320,
 'close': 321,
 'wrong': 322,
 'ice': 323,
 'getting': 324,
 'rib': 325,
 'during': 326,
 'sit': 327,
 'attentive': 328,
 'door': 329,
 'seemed': 330,
 'especially': 331,
 'keep': 332,
 'yummy': 333,
 'second': 334,
 'veggie': 335,
 'drive': 336,
 'trying': 337,
 'seated': 338,
 'fan': 339,
 'done': 340,
 'street': 341,
 'care': 342,
 'extra': 343,
 'part': 344,
 'thai': 345,
 'bowl': 346,
 'music': 347,
 'yelp': 348,
 'tea': 349,
 'whole': 350,
 'party': 351,
 'live': 352,
 'tomato': 353,
 'name': 354,
 'group': 355,
 'kitchen': 356,
 'bartender': 357,
 'return': 358,
 'several': 359,
 'regular': 360,
 'style': 361,
 'walked': 362,
 'waited': 363,
 'wish': 364,
 'bring': 365,
 'ate': 366,
 'seating': 367,
 'real': 368,
 'size': 369,
 'hand': 370,
 'someone': 371,
 'ingredient': 372,
 'use': 373,
 'own': 374,
 'reason': 375,
 'front': 376,
 'either': 377,
 'yet': 378,
 'le': 379,
 'piece': 380,
 'until': 381,
 'called': 382,
 'liked': 383,
 'end': 384,
 'instead': 385,
 'arrived': 386,
 'may': 387,
 'downtown': 388,
 'dog': 389,
 'employee': 390,
 'cake': 391,
 'offer': 392,
 'glass': 393,
 'entree': 394,
 'start': 395,
 'okay': 396,
 'yes': 397,
 'light': 398,
 'later': 399,
 'perfectly': 400,
 'started': 401,
 'crust': 402,
 'course': 403,
 'deal': 404,
 'person': 405,
 'cocktail': 406,
 'decor': 407,
 'grilled': 408,
 'felt': 409,
 'soon': 410,
 'white': 411,
 'italian': 412,
 'counter': 413,
 'money': 414,
 'pasta': 415,
 'chocolate': 416,
 'flavorful': 417,
 'stopped': 418,
 'walk': 419,
 'slow': 420,
 'might': 421,
 'sunday': 422,
 'game': 423,
 'dry': 424,
 'valley': 425,
 'amount': 426,
 'crispy': 427,
 'although': 428,
 'warm': 429,
 'ordering': 430,
 'extremely': 431,
 'serve': 432,
 'tortilla': 433,
 'pay': 434,
 'authentic': 435,
 'impressed': 436,
 'add': 437,
 'corn': 438,
 'ambiance': 439,
 'plus': 440,
 'call': 441,
 'short': 442,
 'saw': 443,
 'girl': 444,
 'seems': 445,
 'shop': 446,
 'sausage': 447,
 'helpful': 448,
 'neighborhood': 449,
 'chef': 450,
 'expect': 451,
 'leave': 452,
 'brunch': 453,
 'pie': 454,
 'french': 455,
 'mean': 456,
 'type': 457,
 'saturday': 458,
 'sitting': 459,
 'pick': 460,
 'making': 461,
 'often': 462,
 'able': 463,
 'date': 464,
 'pepper': 465,
 'mac': 466,
 'average': 467,
 'thank': 468,
 'quickly': 469,
 'slice': 470,
 'fine': 471,
 'margarita': 472,
 'tender': 473,
 'month': 474,
 'chinese': 475,
 'bill': 476,
 'wall': 477,
 'variety': 478,
 'problem': 479,
 'ago': 480,
 'friday': 481,
 'noodle': 482,
 'wow': 483,
 'anyone': 484,
 'already': 485,
 'evening': 486,
 'fact': 487,
 'glad': 488,
 'morning': 489,
 'needed': 490,
 'tip': 491,
 'chili': 492,
 'eaten': 493,
 'bland': 494,
 'weekend': 495,
 'plenty': 496,
 'run': 497,
 'list': 498,
 'store': 499,
 'priced': 500,
 'late': 501,
 'guess': 502,
 'vegan': 503,
 'seat': 504,
 'reasonable': 505,
 'working': 506,
 'unique': 507,
 'cheap': 508,
 'lady': 509,
 'ready': 510,
 'garlic': 511,
 'vibe': 512,
 'hope': 513,
 'hit': 514,
 'simple': 515,
 'joint': 516,
 'topping': 517,
 'seem': 518,
 'twice': 519,
 'birthday': 520,
 'job': 521,
 'worst': 522,
 'cook': 523,
 'rude': 524,
 'cup': 525,
 'ended': 526,
 'toast': 527,
 'horrible': 528,
 'four': 529,
 'help': 530,
 'cute': 531,
 'believe': 532,
 'change': 533,
 'past': 534,
 'stuff': 535,
 'mind': 536,
 'brisket': 537,
 'carne': 538,
 'easy': 539,
 'given': 540,
 'crab': 541,
 'cut': 542,
 'early': 543,
 'min': 544,
 'salmon': 545,
 'reservation': 546,
 'heard': 547,
 'please': 548,
 'mushroom': 549,
 'beautiful': 550,
 'seriously': 551,
 'asada': 552,
 'arizona': 553,
 'remember': 554,
 'packed': 555,
 'five': 556,
 'between': 557,
 'vegetarian': 558,
 'butter': 559,
 'life': 560,
 'terrible': 561,
 'space': 562,
 'issue': 563,
 'boyfriend': 564,
 'under': 565,
 'prepared': 566,
 'yum': 567,
 'behind': 568,
 'trip': 569,
 'surprised': 570,
 'choose': 571,
 'entire': 572,
 'totally': 573,
 'donut': 574,
 'enchilada': 575,
 'rather': 576,
 'added': 577,
 'man': 578,
 'mouth': 579,
 'show': 580,
 'healthy': 581,
 'hungry': 582,
 'hostess': 583,
 'pho': 584,
 'offered': 585,
 'n': 586,
 'pancake': 587,
 'outstanding': 588,
 'dressing': 589,
 'recommended': 590,
 'empty': 591,
 'near': 592,
 'along': 593,
 'excited': 594,
 'soft': 595,
 'loud': 596,
 'tonight': 597,
 'together': 598,
 'gone': 599,
 'taking': 600,
 'combo': 601,
 'opened': 602,
 'expensive': 603,
 'thanks': 604,
 'bun': 605,
 'rest': 606,
 'across': 607,
 'black': 608,
 'airport': 609,
 'miss': 610,
 'stay': 611,
 'unfortunately': 612,
 'expected': 613,
 'curry': 614,
 'sometimes': 615,
 'others': 616,
 'complaint': 617,
 'establishment': 618,
 'serving': 619,
 'vegetable': 620,
 'completely': 621,
 'homemade': 622,
 'lettuce': 623,
 'delivery': 624,
 'literally': 625,
 'kept': 626,
 'spice': 627,
 'received': 628,
 'set': 629,
 'greeted': 630,
 'main': 631,
 'medium': 632,
 'card': 633,
 'pita': 634,
 'thin': 635,
 'sign': 636,
 'etc': 637,
 'pleasant': 638,
 'treat': 639,
 'bottle': 640,
 'read': 641,
 'world': 642,
 'forward': 643,
 'seafood': 644,
 'seen': 645,
 'sorry': 646,
 'knew': 647,
 'mom': 648,
 'paid': 649,
 'charge': 650,
 'chance': 651,
 'baked': 652,
 'waffle': 653,
 'phone': 654,
 'available': 655,
 'based': 656,
 'pulled': 657,
 'hotel': 658,
 'guacamole': 659,
 'dip': 660,
 'understand': 661,
 'pricey': 662,
 'poor': 663,
 'grab': 664,
 'die': 665,
 'gyro': 666,
 'interesting': 667,
 'orange': 668,
 'craving': 669,
 'az': 670,
 'incredible': 671,
 'desert': 672,
 'share': 673,
 'avocado': 674,
 'standard': 675,
 'immediately': 676,
 'building': 677,
 'watch': 678,
 'turkey': 679,
 'filling': 680,
 'roasted': 681,
 'noticed': 682,
 'soda': 683,
 'cafe': 684,
 'tuna': 685,
 'shared': 686,
 'gem': 687,
 'honestly': 688,
 'giving': 689,
 'chain': 690,
 'tv': 691,
 'brown': 692,
 'greasy': 693,
 'son': 694,
 'saying': 695,
 'outdoor': 696,
 'daughter': 697,
 'stand': 698,
 'dirty': 699,
 'seasoned': 700,
 'bagel': 701,
 'due': 702,
 'buffet': 703,
 'word': 704,
 'sub': 705,
 'truly': 706,
 'refill': 707,
 'salty': 708,
 'comfortable': 709,
 'idea': 710,
 'mix': 711,
 'crowd': 712,
 'sour': 713,
 'walking': 714,
 'taken': 715,
 'tasting': 716,
 'filled': 717,
 'iced': 718,
 'la': 719,
 'split': 720,
 'disappointing': 721,
 'including': 722,
 'eye': 723,
 'cost': 724,
 'casual': 725,
 'gluten': 726,
 'closed': 727,
 'head': 728,
 'recently': 729,
 'number': 730,
 'solid': 731,
 'w': 732,
 'guest': 733,
 'chile': 734,
 'mixed': 735,
 'box': 736,
 'visiting': 737,
 'salt': 738,
 'feeling': 739,
 'car': 740,
 'strip': 741,
 'finish': 742,
 'anyway': 743,
 'low': 744,
 'mine': 745,
 'simply': 746,
 'ton': 747,
 'longer': 748,
 'bag': 749,
 'talk': 750,
 'management': 751,
 'generous': 752,
 'note': 753,
 'recommendation': 754,
 'mediocre': 755,
 'above': 756,
 'olive': 757,
 'exactly': 758,
 'smile': 759,
 'dollar': 760,
 'tiny': 761,
 'level': 762,
 'diner': 763,
 'grill': 764,
 'wrap': 765,
 'moved': 766,
 'meatball': 767,
 'mention': 768,
 'hummus': 769,
 'upon': 770,
 'window': 771,
 'thick': 772,
 'buck': 773,
 'value': 774,
 'worker': 775,
 'chipotle': 776,
 'heat': 777,
 'stick': 778,
 'middle': 779,
 'rating': 780,
 'lobster': 781,
 'boy': 782,
 'event': 783,
 'stuffed': 784,
 'attention': 785,
 'fruit': 786,
 'scottsdale': 787,
 'write': 788,
 'true': 789,
 'booth': 790,
 'visited': 791,
 'located': 792,
 'gravy': 793,
 'lemon': 794,
 'talking': 795,
 'bathroom': 796,
 'single': 797,
 'view': 798,
 'question': 799,
 'sort': 800,
 'oil': 801,
 'young': 802,
 'fabulous': 803,
 'juice': 804,
 'attitude': 805,
 'forgot': 806,
 'slightly': 807,
 'lack': 808,
 'corner': 809,
 'sad': 810,
 'traditional': 811,
 'spinach': 812,
 'finished': 813,
 'hear': 814,
 'crisp': 815,
 'greek': 816,
 'within': 817,
 'texture': 818,
 'returning': 819,
 'shot': 820,
 'weird': 821,
 'certainly': 822,
 'blue': 823,
 'lol': 824,
 'quesadilla': 825,
 'turn': 826,
 'crowded': 827,
 'environment': 828,
 'limited': 829,
 'typical': 830,
 'afternoon': 831,
 'crazy': 832,
 'perfection': 833,
 'happened': 834,
 'beat': 835,
 'case': 836,
 'concept': 837,
 'whatever': 838,
 'face': 839,
 'anywhere': 840,
 'soggy': 841,
 'chose': 842,
 'total': 843,
 'woman': 844,
 'school': 845,
 'juicy': 846,
 'non': 847,
 'city': 848,
 'overpriced': 849,
 'except': 850,
 'turned': 851,
 'lovely': 852,
 'central': 853,
 'fair': 854,
 'deep': 855,
 'checked': 856,
 'conversation': 857,
 'thru': 858,
 'touch': 859,
 'dark': 860,
 'asian': 861,
 'kinda': 862,
 'biscuit': 863,
 'asking': 864,
 'strawberry': 865,
 'hole': 866,
 'changed': 867,
 'combination': 868,
 'creamy': 869,
 'thinking': 870,
 'spring': 871,
 'surprise': 872,
 'hate': 873,
 'indian': 874,
 'forget': 875,
 'awful': 876,
 'tuesday': 877,
 'opinion': 878,
 'expectation': 879,
 'floor': 880,
 'beyond': 881,
 'modern': 882,
 'buy': 883,
 'sport': 884,
 'plan': 885,
 'normally': 886,
 'expecting': 887,
 'chicago': 888,
 'delivered': 889,
 'unless': 890,
 'mall': 891,
 'basically': 892,
 'usual': 893,
 'matter': 894,
 'original': 895,
 'lived': 896,
 'cashier': 897,
 'paying': 898,
 'addition': 899,
 'rare': 900,
 'easily': 901,
 'picture': 902,
 'frozen': 903,
 'notch': 904,
 'topped': 905,
 'chair': 906,
 'alone': 907,
 'delish': 908,
 'platter': 909,
 'rock': 910,
 'girlfriend': 911,
 'welcoming': 912,
 'tap': 913,
 'mostly': 914,
 'smell': 915,
 'picked': 916,
 'american': 917,
 'board': 918,
 'apple': 919,
 'sample': 920,
 'classic': 921,
 'park': 922,
 'meet': 923,
 'somewhere': 924,
 'patron': 925,
 'sound': 926,
 'double': 927,
 'dine': 928,
 'nacho': 929,
 'ramen': 930,
 'north': 931,
 'worked': 932,
 'fairly': 933,
 'fancy': 934,
 'valet': 935,
 'play': 936,
 'missing': 937,
 'child': 938,
 'lamb': 939,
 'incredibly': 940,
 'prefer': 941,
 'ambience': 942,
 'shake': 943,
 'hang': 944,
 'plain': 945,
 'spend': 946,
 'considering': 947,
 'hubby': 948,
 'bomb': 949,
 'honey': 950,
 'cozy': 951,
 'bruschetta': 952,
 'spent': 953,
 'clearly': 954,
 'yeah': 955,
 'barely': 956,
 'kick': 957,
 'rush': 958,
 'occasion': 959,
 'mentioned': 960,
 'cookie': 961,
 'placed': 962,
 'gotten': 963,
 'ahead': 964,
 'center': 965,
 'playing': 966,
 'monday': 967,
 'hash': 968,
 'pastry': 969,
 'market': 970,
 'de': 971,
 'mistake': 972,
 'covered': 973,
 'perhaps': 974,
 'owned': 975,
 'online': 976,
 'apparently': 977,
 'phx': 978,
 'folk': 979,
 'obviously': 980,
 'strong': 981,
 'welcome': 982,
 'mozzarella': 983,
 'included': 984,
 'met': 985,
 'despite': 986,
 'omg': 987,
 'none': 988,
 'art': 989,
 'peanut': 990,
 'otherwise': 991,
 'broth': 992,
 'nicely': 993,
 'pad': 994,
 'company': 995,
 'sugar': 996,
 'bottom': 997,
 'interior': 998,
 'baby': 999,
 'charged': 1000,
 ...}

len_train_word_index = len(train_word_index)
len_train_word_index

87529

print("Found %s unique tokens."% len(train_word_index))

Found 87529 unique tokens.

train_sequences= tokenizer.texts_to_sequences(train["text_clean"].tolist())

#Need to pad our data as the sequence length (number of words) in each review varies.

train_padded = pad_sequences(train_sequences, 
                               maxlen=max_sequence_len,
                            padding="post", truncating="post")

train_padded.shape

(307472, 225)

print(len(train_sequences[0]))
print(len(train_padded[0]))

print(len(train_sequences[10]))
print(len(train_padded[10]))

44
225
104
225

# Use the tokenizer and pad_sequences to transform valid dataset 
valid_sequences = tokenizer.texts_to_sequences(valid["text_clean"].tolist())
valid_padded = pad_sequences(valid_sequences, maxlen=max_sequence_len,
                            padding="post", truncating="post")

valid_padded.shape

(76857, 225)

print(len(valid_sequences[0]))
print(len(valid_padded[0]))

print(len(valid_sequences[10]))
print(len(valid_padded[10]))

47
225
75
225

savetxt("train_padded.csv", train_padded, delimiter=',')
savetxt("valid_padded.csv", valid_padded, delimiter=',')

# load array
train_padded = loadtxt('train_padded.csv', delimiter=',')


# load array
valid_padded = loadtxt('valid_padded.csv', delimiter=',')

# print the array
train_padded.shape

# print the array
valid_padded.shape

(76857, 225)

reverse_word_index = dict([(value, key) for (key, value) in train_word_index.items()])

def decode_article(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('---')
print(train.text_clean[10])

came today after saw holly <OOV> sign corner while walking brunch kombucha tap stoked flavor tap went rosemary lemon very good make honey rather than sugar can really taste honey while still <OOV> le acidic enjoyed mom got strawberry smooth said really good lot flavor good texture most all stood service guy working possibly owner long <OOV> awesome friendly guy super laid back friendly vibe even giving <OOV> when told wanting make own kombucha home seemed like just got back relaxing <OOV> vacation <OOV> something because just very calm soothing while still upbeat friendly guy service really make place stand definitely return when phoenix ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
---
came today after saw holly grail sign corner while walking brunch kombucha tap stoked flavor tap went rosemary lemon very good make honey rather than sugar can really taste honey while still carbonation le acidic enjoyed mom got strawberry smooth said really good lot flavor good texture most all stood service guy working possibly owner long dread awesome friendly guy super laid back friendly vibe even giving pointer when told wanting make own kombucha home seemed like just got back relaxing spiritual vacation bali something because just very calm soothing while still upbeat friendly guy service really make place stand definitely return when phoenix

Word Embedding¶

Model 1 : Word2Vec model¶

https://github.com/kk7nc/Text_Classification/blob/master/README.rst#term-frequency

I am going to use the pre-trained vectors that is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.

# First load the Google's pre-trained Word2Vec model.
word2vec_path = "GoogleNews-vectors-negative300.bin.gz"
word2vec =gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

embedding_size = word2vec.vector_size
embedding_size

300

#Embedding weights for the entire vocabulary of the training set
all_train_embedding_weights = np.zeros((len(train_word_index)+1, embedding_size))

# create embedding weights for the entire train vocab
for word,index in train_word_index.items():
    all_train_embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(embedding_size)

print(all_train_embedding_weights.shape)

# save to csv file
savetxt('all_train_embedding_weights.csv', train_embedding_weights, delimiter=',')

# embedding weights for the chosen vocabulary size 10000
train_embedding_weights = np.zeros((vocab_size , embedding_size))

for word,index in train_word_index.items():
    if index <= 10000:
        train_embedding_weights[index-1,:] = word2vec[word] if word in word2vec else np.random.rand(embedding_size)

train_embedding_weights.shape

(10000, 300)

# save to csv file
savetxt('train_embedding_weights.csv', train_embedding_weights, delimiter=',')

# load back the embedding weights
train_embedding_weights = loadtxt('train_embedding_weights.csv', delimiter=',')

train_embedding_weights.shape

(10000, 300)

reverse_word_index = dict([(value, key) for (key, value) in train_word_index.items()])

def decode_article(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
print(decode_article(train_padded[10]))
print('---')
print(train.text_clean[10])

came today after saw holly <OOV> sign corner while walking brunch kombucha tap stoked flavor tap went rosemary lemon very good make honey rather than sugar can really taste honey while still <OOV> le acidic enjoyed mom got strawberry smooth said really good lot flavor good texture most all stood service guy working possibly owner long <OOV> awesome friendly guy super laid back friendly vibe even giving <OOV> when told wanting make own kombucha home seemed like just got back relaxing <OOV> vacation <OOV> something because just very calm soothing while still upbeat friendly guy service really make place stand definitely return when phoenix ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
---
came today after saw holly grail sign corner while walking brunch kombucha tap stoked flavor tap went rosemary lemon very good make honey rather than sugar can really taste honey while still carbonation le acidic enjoyed mom got strawberry smooth said really good lot flavor good texture most all stood service guy working possibly owner long dread awesome friendly guy super laid back friendly vibe even giving pointer when told wanting make own kombucha home seemed like just got back relaxing spiritual vacation bali something because just very calm soothing while still upbeat friendly guy service really make place stand definitely return when phoenix

Word Embedding with GloVe¶

train vocab size = 10000

max_sequence_len = 225

from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = 'glove.twitter.27B.200d.txt'
word2vec_output_file = 'glove.twitter.27B.200d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

(1193514, 200)

glove_path='glove.twitter.27B.200d.txt.word2vec'

glove =gensim.models.KeyedVectors.load_word2vec_format(glove_path, binary=False)

embedding_size_glove = glove.vector_size
embedding_size_glove

200

#Embedding weights for the entire vocabulary of the training set
all_train_embedding_weights_glove = np.zeros((len(train_word_index)+1, embedding_size_glove))

# create embedding weights for the entire train vocab
for word,index in train_word_index.items():
    all_train_embedding_weights_glove[index,:] = glove[word] if word in glove else np.random.rand(embedding_size_glove)

print(all_train_embedding_weights_glove.shape)

# save to csv file
savetxt('all_train_embedding_weights_glove.csv', all_train_embedding_weights_glove, delimiter=',')

(87530, 200)

# embedding weights for the chosen vocabulary size 10000
train_embedding_weights_glove = np.zeros((vocab_size , embedding_size_glove))

for word,index in train_word_index.items():
    if index <= 10000:
        train_embedding_weights_glove[index-1,:] = glove[word] if word in glove else np.random.rand(embedding_size_glove)

train_embedding_weights_glove.shape

(10000, 200)

# save to csv file
savetxt('train_embedding_weights_glove.csv', train_embedding_weights_glove, delimiter=',')

# # load back the embedding weights
# train_embedding_weights_glove = loadtxt('train_embedding_weights_glove.csv', delimiter=',')

# train_embedding_weights_glove.shape

(10000, 200)

Create ytrain, yvalid¶

ytrain = train[['review_stars__1.0', 'review_stars__2.0',
       'review_stars__3.0', 'review_stars__4.0', 'review_stars__5.0']]

yvalid = valid[['review_stars__1.0', 'review_stars__2.0',
       'review_stars__3.0', 'review_stars__4.0', 'review_stars__5.0']]

ytrain.head()

yvalid.head()

labels = train.stars_review.value_counts().index.tolist()
labels

[5.0, 4.0, 3.0, 1.0, 2.0]

# let's review a list of parameters we set up  , max_sequence_len =300

vocab_size=10000
max_sequence_len = 225
oov_tok = '<OOV>'
embedding_size=300
#len_train_word_index = 87529

Modelling¶

First train the following 4 models on the word2Vec embedded data.

LSTM model
BiLSTM model
CNN model
CNN-BiLSTM model

I noticed that with the word2Vec embedded training data, the best model among the 4 above is BiLST. I then trained BiLSTM model with GloVe embedded data and noticed that GloVe embedding model performed lightly better than the word2Vec model in predicting the rating of reviews.

Here are two setup I implemented to prevent overfitting. Early Stopping and Checkpoints We use early stopping to monitor the loss on the validation dataset and use the model checkpoint to save the best models based on validation set accuracy. Also, we set patience of early stopping of 10 epoches.

RNN(LSTM) model (word2vec)¶

parameters for lstm_model :
max_sequence_len =225
vocab_size =10000
embedding_size =300

# Create an instance of Sequential called "model_rnn"
lstm_model = Sequential()

#add an Embedding layer
lstm_model.add(Embedding(input_dim =vocab_size,
                            output_dim = embedding_size,
                            weights=[train_embedding_weights],
                            input_length=max_sequence_len,
                            trainable=False))

# Add a LSTM layer
lstm_model.add(LSTM(units = 64,  return_sequences=True, recurrent_dropout=0.2))

# Add 2nd LSTM layer
lstm_model.add(LSTM(units = 64, recurrent_dropout=0.2))
                    
# Add a dropout layer
lstm_model.add(Dropout(rate=0.2))

lstm_model.add(Dense(32, activation="relu"))

lstm_model.add(Dropout(rate=0.2))

# Add a Dense Layer
lstm_model.add(Dense(units=5, activation = 'softmax'))
          
# Compile
lstm_model.compile(optimizer = "adam", loss = 'categorical_crossentropy',
                  metrics = ["accuracy"])

lstm_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 225, 300)          3000000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 225, 64)           93440     
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 165       
=================================================================
Total params: 3,128,709
Trainable params: 128,709
Non-trainable params: 3,000,000
_________________________________________________________________

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

lstm_model.fit(x = train_padded, y = ytrain, batch_size = 128 , epochs = 20,
                   validation_split=0.25, verbose=1, callbacks=[es, mc])

Train on 230604 samples, validate on 76868 samples
Epoch 1/20
230604/230604 [==============================] - 2362s 10ms/step - loss: 1.3329 - accuracy: 0.4584 - val_loss: 1.0749 - val_accuracy: 0.5336

Epoch 00001: val_accuracy improved from -inf to 0.53358, saving model to best_model.h5
Epoch 2/20
230604/230604 [==============================] - 2395s 10ms/step - loss: 0.9813 - accuracy: 0.5771 - val_loss: 0.8843 - val_accuracy: 0.6160

Epoch 00002: val_accuracy improved from 0.53358 to 0.61598, saving model to best_model.h5
Epoch 3/20
230604/230604 [==============================] - 2888s 13ms/step - loss: 0.8843 - accuracy: 0.6191 - val_loss: 0.8705 - val_accuracy: 0.6254

Epoch 00003: val_accuracy improved from 0.61598 to 0.62541, saving model to best_model.h5
Epoch 4/20
230604/230604 [==============================] - 2361s 10ms/step - loss: 0.8476 - accuracy: 0.6349 - val_loss: 0.8268 - val_accuracy: 0.6387

Epoch 00004: val_accuracy improved from 0.62541 to 0.63865, saving model to best_model.h5
Epoch 5/20
230604/230604 [==============================] - 2345s 10ms/step - loss: 0.8231 - accuracy: 0.6438 - val_loss: 0.8047 - val_accuracy: 0.6515

Epoch 00005: val_accuracy improved from 0.63865 to 0.65149, saving model to best_model.h5
Epoch 6/20
230604/230604 [==============================] - 2340s 10ms/step - loss: 0.8070 - accuracy: 0.6508 - val_loss: 0.7926 - val_accuracy: 0.6588

Epoch 00006: val_accuracy improved from 0.65149 to 0.65878, saving model to best_model.h5
Epoch 7/20
230604/230604 [==============================] - 2343s 10ms/step - loss: 0.7932 - accuracy: 0.6575 - val_loss: 0.7900 - val_accuracy: 0.6601

Epoch 00007: val_accuracy improved from 0.65878 to 0.66011, saving model to best_model.h5
Epoch 8/20
230604/230604 [==============================] - 2418s 10ms/step - loss: 0.7840 - accuracy: 0.6603 - val_loss: 0.7779 - val_accuracy: 0.6619

Epoch 00008: val_accuracy improved from 0.66011 to 0.66187, saving model to best_model.h5
Epoch 9/20
230604/230604 [==============================] - 2351s 10ms/step - loss: 0.7747 - accuracy: 0.6649 - val_loss: 0.7729 - val_accuracy: 0.6656

Epoch 00009: val_accuracy improved from 0.66187 to 0.66558, saving model to best_model.h5
Epoch 10/20
230604/230604 [==============================] - 2344s 10ms/step - loss: 0.7661 - accuracy: 0.6683 - val_loss: 0.7663 - val_accuracy: 0.6676

Epoch 00010: val_accuracy improved from 0.66558 to 0.66759, saving model to best_model.h5
Epoch 11/20
230604/230604 [==============================] - 2366s 10ms/step - loss: 0.7589 - accuracy: 0.6720 - val_loss: 0.7654 - val_accuracy: 0.6689

Epoch 00011: val_accuracy improved from 0.66759 to 0.66889, saving model to best_model.h5
Epoch 12/20
165760/230604 [====================>.........] - ETA: 10:07 - loss: 0.7521 - accuracy: 0.6738

# load the best_model
saved_best_lstm_model = load_model('best_lstm_model.h5')

WARNING:tensorflow:From /opt/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

keras.utils.vis_utils.pydot = pyd

#Visualize Model
def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))

#call the function on your model
visualize_model(saved_best_lstm_model)

_, train_acc = saved_best_lstm_model.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_lstm_model.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Test: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 1615s 5ms/step
76857/76857 [==============================] - 385s 5ms/step
Train: 0.685, Test: 0.665

valid_pred_lstm = saved_best_lstm_model.predict(valid_padded)

valid_pred_lstm_result=[]

for i in range(len(valid_pred_lstm)):
    valid_pred_lstm_result.append(np.argmax(valid_pred_lstm[i])+1)

print(classification_report(valid['stars_review'], valid_pred_lstm_result))

              precision    recall  f1-score   support

         1.0       0.72      0.74      0.73      8637
         2.0       0.46      0.43      0.45      6613
         3.0       0.52      0.44      0.47      8731
         4.0       0.54      0.52      0.53     18956
         5.0       0.78      0.83      0.80     33920

    accuracy                           0.67     76857
   macro avg       0.60      0.59      0.60     76857
weighted avg       0.66      0.67      0.66     76857

BiLSTM_model (word2vec)¶

parameters for bilstm_model:
max_sequence_len =225
embedding_size =300
vocab_size=10000

The keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

# embedding layer output_dim
embedding_size =300

# Create an instance of Sequential called "model_rnn"
bilstm_model = Sequential()

#add an Embedding layer
bilstm_model.add(Embedding(input_dim =vocab_size,
                            output_dim = embedding_size,
                            weights=[train_embedding_weights],
                            input_length=max_sequence_len,
                            trainable=False))

# Add a LSTM layer
bilstm_model.add(Bidirectional(LSTM(units = 256,  return_sequences=True, 
                                     recurrent_dropout=0.5)))

# Add 2nd LSTM layer
bilstm_model.add(Bidirectional(LSTM(units = 64, 
                                   dropout=0.2, recurrent_dropout=0.5)))
                    
# Add a Dense Layer
bilstm_model.add(Dense(32, activation="relu"))

# add a dropout layer
bilstm_model.add(Dropout(rate=0.5))

# Add a Dense Layer
bilstm_model.add(Dense(units=5, activation = 'softmax'))
          
# Compile
bilstm_model.compile(optimizer = "adam", loss = 'categorical_crossentropy',
                  metrics = ["accuracy"])

bilstm_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 225, 300)          3000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 225, 512)          1140736   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               295424    
_________________________________________________________________
dense_1 (Dense)              (None, 32)                4128      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 165       
=================================================================
Total params: 4,440,453
Trainable params: 1,440,453
Non-trainable params: 3,000,000
_________________________________________________________________

keras.utils.vis_utils.pydot = pyd

#Visualize Model
def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))

#call the function on your model
visualize_model(saved_best_bilstm_model)

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
mc = ModelCheckpoint('best_bilstm_model.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

bilstm_hist= bilstm_model.fit(x = train_padded, y = ytrain, batch_size = 128 , epochs = 20,
                   validation_split=0.25, verbose=1, callbacks=[es, mc])

Train on 230604 samples, validate on 76868 samples
Epoch 1/20
230604/230604 [==============================] - 7233s 31ms/step - loss: 1.1787 - accuracy: 0.4980 - val_loss: 0.9415 - val_accuracy: 0.5943

Epoch 00001: val_accuracy improved from -inf to 0.59434, saving model to best_bilstm_model.h5
Epoch 2/20
230604/230604 [==============================] - 7532s 33ms/step - loss: 0.9388 - accuracy: 0.5978 - val_loss: 0.8529 - val_accuracy: 0.6326

Epoch 00002: val_accuracy improved from 0.59434 to 0.63256, saving model to best_bilstm_model.h5
Epoch 3/20
230604/230604 [==============================] - 7316s 32ms/step - loss: 0.8693 - accuracy: 0.6260 - val_loss: 0.8019 - val_accuracy: 0.6522

Epoch 00003: val_accuracy improved from 0.63256 to 0.65217, saving model to best_bilstm_model.h5
Epoch 4/20
230604/230604 [==============================] - 7391s 32ms/step - loss: 0.8351 - accuracy: 0.6415 - val_loss: 0.7758 - val_accuracy: 0.6637

Epoch 00004: val_accuracy improved from 0.65217 to 0.66371, saving model to best_bilstm_model.h5
Epoch 5/20
230604/230604 [==============================] - 7415s 32ms/step - loss: 0.8105 - accuracy: 0.6506 - val_loss: 0.7827 - val_accuracy: 0.6591

Epoch 00005: val_accuracy did not improve from 0.66371
Epoch 6/20
230604/230604 [==============================] - 7760s 34ms/step - loss: 0.7924 - accuracy: 0.6592 - val_loss: 0.7585 - val_accuracy: 0.6721

Epoch 00006: val_accuracy improved from 0.66371 to 0.67209, saving model to best_bilstm_model.h5
Epoch 7/20
 90880/230604 [==========>...................] - ETA: 1:03:43 - loss: 0.7783 - accuracy: 0.6641

# load the best_model
saved_best_bilstm_model = load_model('best_bilstm_model.h5')

_, train_acc = saved_best_bilstm_model.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_bilstm_model.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Test: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 4640s 15ms/step
76857/76857 [==============================] - 1161s 15ms/step
Train: 0.683, Test: 0.668

valid_pred_bilstm = saved_best_bilstm_model.predict(valid_padded)

valid_pred_bilstm_result=[]

for i in range(len(valid_pred_bilstm)):
    valid_pred_bilstm_result.append(np.argmax(valid_pred_bilstm[i])+1)

print(classification_report(valid['stars_review'], valid_pred_bilstm_result))

              precision    recall  f1-score   support

         1.0       0.73      0.76      0.75      8637
         2.0       0.49      0.30      0.38      6613
         3.0       0.50      0.45      0.47      8731
         4.0       0.54      0.53      0.54     18956
         5.0       0.77      0.85      0.81     33920

    accuracy                           0.67     76857
   macro avg       0.61      0.58      0.59     76857
weighted avg       0.65      0.67      0.66     76857

keras.utils.vis_utils.pydot = pyd

#Visualize Model

def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))

visualize_model(saved_best_bilstm_model)

CNN model (word2vec)¶

max_sequence_len

225

def build_model_cnn(word_index_len,embedding_dim, embedding_matrix, nclasses,
                    MAX_SEQUENCE_LENGTH,num_filters = 64,dropout_rate=0.5):
    
    
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    
    embedded_sequences = Embedding(input_dim = word_index_len,
                                output_dim =  embedding_dim,
                                weights=[embedding_matrix],
                                input_length= MAX_SEQUENCE_LENGTH,
                                trainable=False)(sequence_input)
    
    
    
    convs = []
    filter_sizes = [2,3,4,5,6]
    
    
    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=num_filters ,
                        kernel_size=filter_size, 
                        activation='relu',
                        name='Conv_'+'_'+str(filter_size))(embedded_sequences)
        
        l_pool = GlobalMaxPooling1D()(l_conv)
                          
        
        convs.append(l_pool)

    l_merge = concatenate(convs, axis=1)
    
    x = Dropout(dropout_rate)(l_merge) 
    
    x = Dense(128, activation='relu')(x)
    
    x = Dropout(dropout_rate)(x)
    
    preds = Dense(nclasses, activation="softmax")(x)
    
    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    model.summary()
    return model

cnn_model = build_model_cnn(word_index_len = vocab_size,
                                 embedding_dim= embedding_size,
                            embedding_matrix = train_embedding_weights,
                            nclasses = len(labels), 
                            MAX_SEQUENCE_LENGTH= max_sequence_len,
                            num_filters =200,
                                dropout_rate=0.5)

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_4 (InputLayer)            (None, 225)          0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 225, 300)     3000000     input_4[0][0]                    
__________________________________________________________________________________________________
Conv__2 (Conv1D)                (None, 224, 200)     120200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__3 (Conv1D)                (None, 223, 200)     180200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__4 (Conv1D)                (None, 222, 200)     240200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__5 (Conv1D)                (None, 221, 200)     300200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__6 (Conv1D)                (None, 220, 200)     360200      embedding_4[0][0]                
__________________________________________________________________________________________________
global_max_pooling1d_6 (GlobalM (None, 200)          0           Conv__2[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_7 (GlobalM (None, 200)          0           Conv__3[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_8 (GlobalM (None, 200)          0           Conv__4[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_9 (GlobalM (None, 200)          0           Conv__5[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_10 (Global (None, 200)          0           Conv__6[0][0]                    
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 1000)         0           global_max_pooling1d_6[0][0]     
                                                                 global_max_pooling1d_7[0][0]     
                                                                 global_max_pooling1d_8[0][0]     
                                                                 global_max_pooling1d_9[0][0]     
                                                                 global_max_pooling1d_10[0][0]    
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 1000)         0           concatenate_4[0][0]              
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 128)          128128      dropout_6[0][0]                  
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, 128)          0           dense_7[0][0]                    
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 5)            645         dropout_7[0][0]                  
==================================================================================================
Total params: 4,329,773
Trainable params: 1,329,773
Non-trainable params: 3,000,000
__________________________________________________________________________________________________

cnn_model.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_4 (InputLayer)            (None, 225)          0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 225, 300)     3000000     input_4[0][0]                    
__________________________________________________________________________________________________
Conv__2 (Conv1D)                (None, 224, 200)     120200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__3 (Conv1D)                (None, 223, 200)     180200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__4 (Conv1D)                (None, 222, 200)     240200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__5 (Conv1D)                (None, 221, 200)     300200      embedding_4[0][0]                
__________________________________________________________________________________________________
Conv__6 (Conv1D)                (None, 220, 200)     360200      embedding_4[0][0]                
__________________________________________________________________________________________________
global_max_pooling1d_6 (GlobalM (None, 200)          0           Conv__2[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_7 (GlobalM (None, 200)          0           Conv__3[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_8 (GlobalM (None, 200)          0           Conv__4[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_9 (GlobalM (None, 200)          0           Conv__5[0][0]                    
__________________________________________________________________________________________________
global_max_pooling1d_10 (Global (None, 200)          0           Conv__6[0][0]                    
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 1000)         0           global_max_pooling1d_6[0][0]     
                                                                 global_max_pooling1d_7[0][0]     
                                                                 global_max_pooling1d_8[0][0]     
                                                                 global_max_pooling1d_9[0][0]     
                                                                 global_max_pooling1d_10[0][0]    
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 1000)         0           concatenate_4[0][0]              
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 128)          128128      dropout_6[0][0]                  
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, 128)          0           dense_7[0][0]                    
__________________________________________________________________________________________________
dense_8 (Dense)                 (None, 5)            645         dropout_7[0][0]                  
==================================================================================================
Total params: 4,329,773
Trainable params: 1,329,773
Non-trainable params: 3,000,000
__________________________________________________________________________________________________

keras.utils.vis_utils.pydot = pyd

#Visualize Model

def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))

visualize_model(cnn_model)

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
mc = ModelCheckpoint('cnn_model_best.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

cnn_model_hist = cnn_model.fit(x = train_padded, y = ytrain, batch_size = 128 , epochs = 20,
                   validation_split=0.2, verbose=1, callbacks=[es, mc])

Train on 245977 samples, validate on 61495 samples
Epoch 1/20
245977/245977 [==============================] - 2247s 9ms/step - loss: 1.0974 - accuracy: 0.5333 - val_loss: 0.8654 - val_accuracy: 0.6261

Epoch 00001: val_accuracy improved from -inf to 0.62615, saving model to cnn_model_best.h5
Epoch 2/20
245977/245977 [==============================] - 2327s 9ms/step - loss: 0.8914 - accuracy: 0.6172 - val_loss: 0.8352 - val_accuracy: 0.6420

Epoch 00002: val_accuracy improved from 0.62615 to 0.64197, saving model to cnn_model_best.h5
Epoch 3/20
245977/245977 [==============================] - 2250s 9ms/step - loss: 0.8421 - accuracy: 0.6375 - val_loss: 0.7991 - val_accuracy: 0.6527

Epoch 00003: val_accuracy improved from 0.64197 to 0.65265, saving model to cnn_model_best.h5
Epoch 4/20
245977/245977 [==============================] - 2299s 9ms/step - loss: 0.8109 - accuracy: 0.6494 - val_loss: 0.8128 - val_accuracy: 0.6459

Epoch 00004: val_accuracy did not improve from 0.65265
Epoch 5/20
245977/245977 [==============================] - 2288s 9ms/step - loss: 0.7854 - accuracy: 0.6592 - val_loss: 0.7851 - val_accuracy: 0.6615

Epoch 00005: val_accuracy improved from 0.65265 to 0.66153, saving model to cnn_model_best.h5
Epoch 6/20
245977/245977 [==============================] - 2190s 9ms/step - loss: 0.7620 - accuracy: 0.6692 - val_loss: 0.7832 - val_accuracy: 0.6590

Epoch 00006: val_accuracy did not improve from 0.66153
Epoch 7/20
245977/245977 [==============================] - 2318s 9ms/step - loss: 0.7416 - accuracy: 0.6790 - val_loss: 0.7772 - val_accuracy: 0.6645

Epoch 00007: val_accuracy improved from 0.66153 to 0.66454, saving model to cnn_model_best.h5
Epoch 8/20
245977/245977 [==============================] - 2261s 9ms/step - loss: 0.7206 - accuracy: 0.6852 - val_loss: 0.7852 - val_accuracy: 0.6605

Epoch 00008: val_accuracy did not improve from 0.66454
Epoch 9/20
245977/245977 [==============================] - 2155s 9ms/step - loss: 0.7027 - accuracy: 0.6943 - val_loss: 0.7891 - val_accuracy: 0.6567

Epoch 00009: val_accuracy did not improve from 0.66454
Epoch 10/20
245977/245977 [==============================] - 2161s 9ms/step - loss: 0.6822 - accuracy: 0.7036 - val_loss: 0.7926 - val_accuracy: 0.6585

Epoch 00010: val_accuracy did not improve from 0.66454
Epoch 11/20
245977/245977 [==============================] - 2158s 9ms/step - loss: 0.6652 - accuracy: 0.7104 - val_loss: 0.7919 - val_accuracy: 0.6609

Epoch 00011: val_accuracy did not improve from 0.66454
Epoch 12/20
245977/245977 [==============================] - 2157s 9ms/step - loss: 0.6476 - accuracy: 0.7195 - val_loss: 0.8021 - val_accuracy: 0.6590

Epoch 00012: val_accuracy did not improve from 0.66454
Epoch 13/20
245977/245977 [==============================] - 2157s 9ms/step - loss: 0.6341 - accuracy: 0.7273 - val_loss: 0.8030 - val_accuracy: 0.6571

Epoch 00013: val_accuracy did not improve from 0.66454
Epoch 14/20
245977/245977 [==============================] - 2156s 9ms/step - loss: 0.6155 - accuracy: 0.7356 - val_loss: 0.8045 - val_accuracy: 0.6560

Epoch 00014: val_accuracy did not improve from 0.66454
Epoch 15/20
245977/245977 [==============================] - 2156s 9ms/step - loss: 0.6030 - accuracy: 0.7408 - val_loss: 0.8254 - val_accuracy: 0.6541

Epoch 00015: val_accuracy did not improve from 0.66454
Epoch 16/20
245977/245977 [==============================] - 2162s 9ms/step - loss: 0.5881 - accuracy: 0.7478 - val_loss: 0.8369 - val_accuracy: 0.6418

Epoch 00016: val_accuracy did not improve from 0.66454
Epoch 17/20
245977/245977 [==============================] - 2244s 9ms/step - loss: 0.5747 - accuracy: 0.7564 - val_loss: 0.8405 - val_accuracy: 0.6447

Epoch 00017: val_accuracy did not improve from 0.66454
Epoch 00017: early stopping

# load the best_model
saved_best_cnn_model = load_model('cnn_model_best.h5')

# get the loss value & the accuracy value on the test data.
_, train_acc = saved_best_cnn_model.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_cnn_model.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Valid: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 1011s 3ms/step
76857/76857 [==============================] - 241s 3ms/step
Train: 0.722, Valid: 0.663

def plot_graphs(model, metric):
  plt.plot(model.history[metric])
  plt.plot(model.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

plot_graphs(cnn_model_hist,"accuracy" )

plot_graphs(cnn_model_hist,"loss" )

valid_pred_cnn = saved_best_cnn_model.predict(valid_padded)

valid_pred_cnn_result=[]

for i in range(len(valid_pred_cnn)):
    valid_pred_cnn_result.append(np.argmax(valid_pred_cnn[i])+1)

print(classification_report(valid['stars_review'], valid_pred_cnn_result))

              precision    recall  f1-score   support

         1.0       0.71      0.75      0.73      8637
         2.0       0.45      0.43      0.44      6613
         3.0       0.52      0.42      0.46      8731
         4.0       0.55      0.47      0.51     18956
         5.0       0.76      0.86      0.81     33920

    accuracy                           0.66     76857
   macro avg       0.60      0.58      0.59     76857
weighted avg       0.65      0.66      0.65     76857

Convolutional LSTM network(word2vec)

max_sequence_len =225
embedding_size =300
vocab_size =10000

def build_cnn_lstm_model(word_index_len, embedding_matrix, nclasses, MAX_SEQUENCE_LENGTH,
                       embedding_dim, num_filters, dropout_rate ):
    
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
    
    embedded_sequences = Embedding(input_dim = word_index_len,
                                output_dim = embedding_dim,
                                weights=[embedding_matrix],
                                input_length= MAX_SEQUENCE_LENGTH,
                                trainable=False)(sequence_input)
    
    convs = []
    filter_sizes = [2,3,4,5]
    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=num_filters , 
                        kernel_size=filter_size, 
                        padding='same',
                        activation='relu')(embedded_sequences)
        l_pool = MaxPooling1D(pool_size=MAX_SEQUENCE_LENGTH - filter_size + 1)(l_conv)
        convs.append(l_pool)
    l_merge = concatenate(convs, axis=1)

    
    l_lstm = Bidirectional(LSTM(units = 256, 
                                      recurrent_dropout=0.5))(l_merge)
    
    
                                
    x = Dropout(dropout_rate)(l_lstm)  
    x = Dense(128, activation='relu', kernel_regularizer=regularizers.l2(1))(x)
    x = Dropout(dropout_rate)(x)
    preds = Dense(nclasses, activation="softmax")(x)
    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    model.summary()
    return model

cnn_lstm_model = build_cnn_lstm_model(word_index_len = vocab_size, 
                                      embedding_matrix = train_embedding_weights,
                                      nclasses = len(labels), 
                                     MAX_SEQUENCE_LENGTH = max_sequence_len,
                                      embedding_dim= embedding_size,
                                     num_filters =100 ,
                                     dropout_rate=0.5)

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_13 (InputLayer)           (None, 225)          0                                            
__________________________________________________________________________________________________
embedding_13 (Embedding)        (None, 225, 300)     3000000     input_13[0][0]                   
__________________________________________________________________________________________________
conv1d_21 (Conv1D)              (None, 225, 100)     60100       embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_22 (Conv1D)              (None, 225, 100)     90100       embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_23 (Conv1D)              (None, 225, 100)     120100      embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_24 (Conv1D)              (None, 225, 100)     150100      embedding_13[0][0]               
__________________________________________________________________________________________________
max_pooling1d_9 (MaxPooling1D)  (None, 1, 100)       0           conv1d_21[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_10 (MaxPooling1D) (None, 1, 100)       0           conv1d_22[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_11 (MaxPooling1D) (None, 1, 100)       0           conv1d_23[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_12 (MaxPooling1D) (None, 1, 100)       0           conv1d_24[0][0]                  
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, 4, 100)       0           max_pooling1d_9[0][0]            
                                                                 max_pooling1d_10[0][0]           
                                                                 max_pooling1d_11[0][0]           
                                                                 max_pooling1d_12[0][0]           
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) (None, 512)          731136      concatenate_9[0][0]              
__________________________________________________________________________________________________
dropout_12 (Dropout)            (None, 512)          0           bidirectional_5[0][0]            
__________________________________________________________________________________________________
dense_13 (Dense)                (None, 128)          65664       dropout_12[0][0]                 
__________________________________________________________________________________________________
dropout_13 (Dropout)            (None, 128)          0           dense_13[0][0]                   
__________________________________________________________________________________________________
dense_14 (Dense)                (None, 5)            645         dropout_13[0][0]                 
==================================================================================================
Total params: 4,217,845
Trainable params: 1,217,845
Non-trainable params: 3,000,000
__________________________________________________________________________________________________

cnn_lstm_model.summary()

Model: "model_7"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_13 (InputLayer)           (None, 225)          0                                            
__________________________________________________________________________________________________
embedding_13 (Embedding)        (None, 225, 300)     3000000     input_13[0][0]                   
__________________________________________________________________________________________________
conv1d_21 (Conv1D)              (None, 225, 100)     60100       embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_22 (Conv1D)              (None, 225, 100)     90100       embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_23 (Conv1D)              (None, 225, 100)     120100      embedding_13[0][0]               
__________________________________________________________________________________________________
conv1d_24 (Conv1D)              (None, 225, 100)     150100      embedding_13[0][0]               
__________________________________________________________________________________________________
max_pooling1d_9 (MaxPooling1D)  (None, 1, 100)       0           conv1d_21[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_10 (MaxPooling1D) (None, 1, 100)       0           conv1d_22[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_11 (MaxPooling1D) (None, 1, 100)       0           conv1d_23[0][0]                  
__________________________________________________________________________________________________
max_pooling1d_12 (MaxPooling1D) (None, 1, 100)       0           conv1d_24[0][0]                  
__________________________________________________________________________________________________
concatenate_9 (Concatenate)     (None, 4, 100)       0           max_pooling1d_9[0][0]            
                                                                 max_pooling1d_10[0][0]           
                                                                 max_pooling1d_11[0][0]           
                                                                 max_pooling1d_12[0][0]           
__________________________________________________________________________________________________
bidirectional_5 (Bidirectional) (None, 512)          731136      concatenate_9[0][0]              
__________________________________________________________________________________________________
dropout_12 (Dropout)            (None, 512)          0           bidirectional_5[0][0]            
__________________________________________________________________________________________________
dense_13 (Dense)                (None, 128)          65664       dropout_12[0][0]                 
__________________________________________________________________________________________________
dropout_13 (Dropout)            (None, 128)          0           dense_13[0][0]                   
__________________________________________________________________________________________________
dense_14 (Dense)                (None, 5)            645         dropout_13[0][0]                 
==================================================================================================
Total params: 4,217,845
Trainable params: 1,217,845
Non-trainable params: 3,000,000
__________________________________________________________________________________________________

keras.utils.vis_utils.pydot = pyd

#Visualize Model

def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))
#create your model
#then call the function on your model
visualize_model(cnn_lstm_model)

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=8)
mc = ModelCheckpoint('cnn_lstm_best_model.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

cnn_lstm_model_hist = cnn_lstm_model.fit(x = train_padded, y = ytrain, batch_size = 128 ,
                                         epochs = 15,
                   validation_split=0.2, verbose=1, callbacks=[es, mc])

Train on 245977 samples, validate on 61495 samples
Epoch 1/15
245977/245977 [==============================] - 1156s 5ms/step - loss: 4.0296 - accuracy: 0.5526 - val_loss: 0.9440 - val_accuracy: 0.6260

Epoch 00001: val_accuracy improved from -inf to 0.62600, saving model to cnn_lstm_best_model.h5
Epoch 2/15
245977/245977 [==============================] - 1121s 5ms/step - loss: 0.9321 - accuracy: 0.6391 - val_loss: 0.8958 - val_accuracy: 0.6477

Epoch 00002: val_accuracy improved from 0.62600 to 0.64771, saving model to cnn_lstm_best_model.h5
Epoch 3/15
245977/245977 [==============================] - 1076s 4ms/step - loss: 0.8688 - accuracy: 0.6683 - val_loss: 0.8957 - val_accuracy: 0.6434

Epoch 00003: val_accuracy did not improve from 0.64771
Epoch 4/15
245977/245977 [==============================] - 1117s 5ms/step - loss: 0.8159 - accuracy: 0.6944 - val_loss: 0.9128 - val_accuracy: 0.6483

Epoch 00004: val_accuracy improved from 0.64771 to 0.64835, saving model to cnn_lstm_best_model.h5
Epoch 5/15
245977/245977 [==============================] - 1142s 5ms/step - loss: 0.7694 - accuracy: 0.7181 - val_loss: 0.9251 - val_accuracy: 0.6484

Epoch 00005: val_accuracy improved from 0.64835 to 0.64843, saving model to cnn_lstm_best_model.h5
Epoch 6/15
245977/245977 [==============================] - 1098s 4ms/step - loss: 0.7260 - accuracy: 0.7412 - val_loss: 0.9673 - val_accuracy: 0.6397

Epoch 00006: val_accuracy did not improve from 0.64843
Epoch 7/15
245977/245977 [==============================] - 1085s 4ms/step - loss: 0.6863 - accuracy: 0.7626 - val_loss: 0.9896 - val_accuracy: 0.6342

Epoch 00007: val_accuracy did not improve from 0.64843
Epoch 8/15
245977/245977 [==============================] - 1093s 4ms/step - loss: 0.6502 - accuracy: 0.7817 - val_loss: 1.1337 - val_accuracy: 0.6324

Epoch 00008: val_accuracy did not improve from 0.64843
Epoch 9/15
245977/245977 [==============================] - 1129s 5ms/step - loss: 0.6206 - accuracy: 0.7975 - val_loss: 1.0878 - val_accuracy: 0.6304

Epoch 00009: val_accuracy did not improve from 0.64843
Epoch 10/15
245977/245977 [==============================] - 1115s 5ms/step - loss: 0.5940 - accuracy: 0.8110 - val_loss: 1.0937 - val_accuracy: 0.6237

Epoch 00010: val_accuracy did not improve from 0.64843
Epoch 11/15
245977/245977 [==============================] - 1087s 4ms/step - loss: 0.5685 - accuracy: 0.8238 - val_loss: 1.1833 - val_accuracy: 0.6254

Epoch 00011: val_accuracy did not improve from 0.64843
Epoch 00011: early stopping

# load the best_model
saved_best_cnn_lstm_model = load_model('cnn_lstm_best_model.h5')

_, train_acc = saved_best_cnn_lstm_model.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_cnn_lstm_model.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Valid: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 617s 2ms/step
76857/76857 [==============================] - 151s 2ms/step
Train: 0.756, Valid: 0.646

plot_graphs(cnn_lstm_model_hist, 'accuracy')

plot_graphs(cnn_lstm_model_hist, 'loss')

valid_pred_cnn_lstm = saved_best_cnn_lstm_model.predict(valid_padded)

valid_pred_cnn_lstm_result=[]

for i in range(len(valid_pred_cnn_lstm)):
    valid_pred_cnn_lstm_result.append(np.argmax(valid_pred_cnn_lstm[i])+1)

print(classification_report(valid['stars_review'], valid_pred_cnn_lstm_result))

              precision    recall  f1-score   support

         1.0       0.70      0.74      0.72      8637
         2.0       0.45      0.35      0.39      6613
         3.0       0.47      0.44      0.45      8731
         4.0       0.52      0.46      0.49     18956
         5.0       0.75      0.84      0.79     33920

    accuracy                           0.65     76857
   macro avg       0.58      0.57      0.57     76857
weighted avg       0.63      0.65      0.64     76857

BiLSTM_model (GloVe)¶

parameters for bilstm_model:
max_sequence_len =225
embedding_size =200
vocab_size=10000

# embedding layer output_dim
embedding_size_glove

200

# Create an instance of Sequential called "model_rnn"
glove_bilstm_model = Sequential()

#add an Embedding layer
glove_bilstm_model.add(Embedding(input_dim =vocab_size,
                            output_dim = embedding_size_glove ,
                            weights=[train_embedding_weights_glove],
                            input_length=max_sequence_len,
                            trainable=False))

# Add a LSTM layer
glove_bilstm_model.add(Bidirectional(LSTM(units = 256,  return_sequences=True, 
                                     recurrent_dropout=0.5)))

# Add 2nd LSTM layer
glove_bilstm_model.add(Bidirectional(LSTM(units = 64, 
                                   dropout=0.2, recurrent_dropout=0.5)))

# add a dropout layer
glove_bilstm_model.add(Dropout(rate=0.5))
                    
# Add a Dense Layer
glove_bilstm_model.add(Dense(32, activation="relu"))

# add a dropout layer
glove_bilstm_model.add(Dropout(rate=0.5))

# Add a Dense Layer
glove_bilstm_model.add(Dense(units=5, activation = 'softmax'))
          
# Compile
glove_bilstm_model.compile(optimizer = "adam", loss = 'categorical_crossentropy',
                  metrics = ["accuracy"])

glove_bilstm_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_14 (Embedding)     (None, 225, 200)          2000000   
_________________________________________________________________
bidirectional_6 (Bidirection (None, 225, 512)          935936    
_________________________________________________________________
bidirectional_7 (Bidirection (None, 128)               295424    
_________________________________________________________________
dropout_14 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 32)                4128      
_________________________________________________________________
dropout_15 (Dropout)         (None, 32)                0         
_________________________________________________________________
dense_16 (Dense)             (None, 5)                 165       
=================================================================
Total params: 3,235,653
Trainable params: 1,235,653
Non-trainable params: 2,000,000
_________________________________________________________________

keras.utils.vis_utils.pydot = pyd

#Visualize Model
def visualize_model(model):
    return SVG(model_to_dot(model,dpi=45).create(prog='dot', format='svg'))

#call the function on your model
visualize_model(glove_bilstm_model)

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)
mc = ModelCheckpoint('best_glove_bilstm_model.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

glove_bilstm_hist= glove_bilstm_model.fit(x = train_padded, y = ytrain, batch_size = 128 , epochs = 20,
                   validation_split=0.25, verbose=1, callbacks=[es, mc])

Train on 230604 samples, validate on 76868 samples
Epoch 1/20
230604/230604 [==============================] - 7863s 34ms/step - loss: 1.1483 - accuracy: 0.5107 - val_loss: 0.9097 - val_accuracy: 0.6028

Epoch 00001: val_accuracy improved from -inf to 0.60277, saving model to best_glove_bilstm_model.h5
Epoch 2/20
230604/230604 [==============================] - 8289s 36ms/step - loss: 0.9160 - accuracy: 0.6060 - val_loss: 0.8147 - val_accuracy: 0.6497

Epoch 00002: val_accuracy improved from 0.60277 to 0.64974, saving model to best_glove_bilstm_model.h5
Epoch 3/20
230604/230604 [==============================] - 7567s 33ms/step - loss: 0.8465 - accuracy: 0.6362 - val_loss: 0.8162 - val_accuracy: 0.6439

Epoch 00003: val_accuracy did not improve from 0.64974
Epoch 4/20
230604/230604 [==============================] - 7286s 32ms/step - loss: 0.8096 - accuracy: 0.6540 - val_loss: 0.7657 - val_accuracy: 0.6713

Epoch 00004: val_accuracy improved from 0.64974 to 0.67127, saving model to best_glove_bilstm_model.h5
Epoch 5/20
230604/230604 [==============================] - 7360s 32ms/step - loss: 0.7825 - accuracy: 0.6648 - val_loss: 0.7490 - val_accuracy: 0.6769

Epoch 00005: val_accuracy improved from 0.67127 to 0.67689, saving model to best_glove_bilstm_model.h5
Epoch 6/20
118656/230604 [==============>...............] - ETA: 50:52 - loss: 0.7627 - accuracy: 0.6726

The training process stopped unexpectedly during the epoch 6 (jupyter kernel shutdown). I will load the saved best model and evaluate its on train and valid dataset and then continue training it.

# load the best_model
saved_best_glove_bilstm_model = load_model('best_glove_bilstm_model.h5')

WARNING:tensorflow:From /opt/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

_, train_acc = saved_best_glove_bilstm_model.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_glove_bilstm_model.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Test: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 4580s 15ms/step
76857/76857 [==============================] - 1148s 15ms/step
Train: 0.693, Test: 0.675

valid_pred_glove_bilstm = saved_best_glove_bilstm_model.predict(valid_padded)

labels=[1,2,3,4,5]
valid_pred_glove_bilstm_result=[]

for p in valid_pred_glove_bilstm:
    valid_pred_glove_bilstm_result.append(labels[np.argmax(p)])

print(classification_report(valid['stars_review'], valid_pred_glove_bilstm_result))

              precision    recall  f1-score   support

         1.0       0.68      0.83      0.75      8637
         2.0       0.49      0.36      0.41      6613
         3.0       0.54      0.41      0.47      8731
         4.0       0.56      0.51      0.53     18956
         5.0       0.77      0.86      0.81     33920

    accuracy                           0.68     76857
   macro avg       0.61      0.59      0.60     76857
weighted avg       0.66      0.68      0.66     76857

Continue training glove_bilstm model

#patient early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
mc = ModelCheckpoint('best_glove_bilstm_model.h5', monitor='val_accuracy', mode='max', verbose=1, 
                     save_best_only=True)

glove_bilstm_hist= saved_best_glove_bilstm_model.fit(x = train_padded, y = ytrain, batch_size = 128 , epochs = 10,
                   validation_split=0.25, verbose=1, callbacks=[es, mc])

Train on 230604 samples, validate on 76868 samples
Epoch 1/10
230604/230604 [==============================] - 8370s 36ms/step - loss: 0.7634 - accuracy: 0.6729 - val_loss: 0.7378 - val_accuracy: 0.6816

Epoch 00001: val_accuracy improved from -inf to 0.68162, saving model to best_glove_bilstm_model.h5
Epoch 2/10
230604/230604 [==============================] - 8469s 37ms/step - loss: 0.7445 - accuracy: 0.6800 - val_loss: 0.7476 - val_accuracy: 0.6826

Epoch 00002: val_accuracy improved from 0.68162 to 0.68259, saving model to best_glove_bilstm_model.h5
Epoch 3/10
230604/230604 [==============================] - 8991s 39ms/step - loss: 0.7289 - accuracy: 0.6869 - val_loss: 0.7439 - val_accuracy: 0.6884

Epoch 00003: val_accuracy improved from 0.68259 to 0.68843, saving model to best_glove_bilstm_model.h5
Epoch 4/10
230604/230604 [==============================] - 8374s 36ms/step - loss: 0.7156 - accuracy: 0.6937 - val_loss: 0.7461 - val_accuracy: 0.6864

Epoch 00004: val_accuracy did not improve from 0.68843
Epoch 5/10
230604/230604 [==============================] - 7563s 33ms/step - loss: 0.7002 - accuracy: 0.6996 - val_loss: 0.7419 - val_accuracy: 0.6879

Epoch 00005: val_accuracy did not improve from 0.68843
Epoch 6/10
221952/230604 [===========================>..] - ETA: 4:07 - loss: 0.6894 - accuracy: 0.7032

# load the best_model
saved_best_glove_bilstm_model_2 = load_model('best_glove_bilstm_model.h5')

WARNING:tensorflow:From /opt/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

_, train_acc = saved_best_glove_bilstm_model_2.evaluate(train_padded, ytrain, verbose=1)
_, valid_acc = saved_best_glove_bilstm_model_2.evaluate(valid_padded, yvalid, verbose=1)
print('Train: %.3f, Test: %.3f' % (train_acc, valid_acc))

307472/307472 [==============================] - 5238s 17ms/step
76857/76857 [==============================] - 1373s 18ms/step
Train: 0.715, Test: 0.684

valid_pred_glove_bilstm_2 = saved_best_glove_bilstm_model_2.predict(valid_padded)

labels=[1,2,3,4,5]
valid_pred_glove_bilstm_result_2=[]

for p in valid_pred_glove_bilstm_2:
    valid_pred_glove_bilstm_result_2.append(labels[np.argmax(p)])

print(classification_report(valid['stars_review'], valid_pred_glove_bilstm_result_2))

              precision    recall  f1-score   support

         1.0       0.74      0.76      0.75      8637
         2.0       0.50      0.44      0.47      6613
         3.0       0.54      0.48      0.51      8731
         4.0       0.59      0.47      0.52     18956
         5.0       0.76      0.88      0.82     33920

    accuracy                           0.68     76857
   macro avg       0.63      0.61      0.61     76857
weighted avg       0.67      0.68      0.67     76857

Our best model is BiLSTM¶

Apply best model to predict Test dataset¶

BiLSTM with Word2Vec embedding Model

test = pd.read_csv("yelp_test.csv")

y_test = test[['review_stars__1.0', 'review_stars__2.0',
       'review_stars__3.0', 'review_stars__4.0', 'review_stars__5.0']]

Clean test set¶

test = text_cleaning(test)

test.loc[:,["stars_review","tokens","text_clean","text"]].head()

test.to_csv("test_cleaned_032520.csv")

#test = pd.read_csv("test_cleaned_032520.csv")

Tokenize and pad sequence on test set¶

# Use the tokenizer and pad_sequences to transform valid dataset 
test_sequences = tokenizer.texts_to_sequences(test["text_clean"].tolist())
test_padded = pad_sequences(test_sequences, maxlen=max_sequence_len,
                            padding="post", truncating="post")

savetxt("test_padded.csv", test_padded, delimiter=',')

Use BiLSTM-Word2Vec model on test set¶

test_pred_bilstm = saved_best_bilstm_model.predict(test_padded, batch_size=2048, 
                            verbose=1)

42698/42698 [==============================] - 350s 8ms/step

labels = [1,2,3,4,5]
test_pred_bilstm_result=[]

for p in test_pred_bilstm:
    test_pred_bilstm_result.append(labels[np.argmax(p)])

print(classification_report(test['stars_review'], test_pred_bilstm_result))

              precision    recall  f1-score   support

         1.0       0.74      0.76      0.75      4800
         2.0       0.51      0.32      0.39      3676
         3.0       0.49      0.43      0.46      4852
         4.0       0.54      0.53      0.54     10529
         5.0       0.77      0.85      0.81     18841

    accuracy                           0.67     42698
   macro avg       0.61      0.58      0.59     42698
weighted avg       0.66      0.67      0.66     42698

test.loc[:,["stars_review","tokens","text_clean","text"]].head()

BiLSTM with GloVe embedding

# Use the tokenizer and pad_sequences to transform valid dataset 
test_sequences = tokenizer.texts_to_sequences(test["text_clean"].tolist())
test_padded = pad_sequences(test_sequences, maxlen=max_sequence_len,
                            padding="post", truncating="post")

test_pred_glove_bilstm = saved_best_glove_bilstm_model_2.predict(test_padded, batch_size=2048,  
                            verbose=1)

42698/42698 [==============================] - 333s 8ms/step

labels = [1,2,3,4,5]
test_pred_glove_bilstm_result=[]

for p in test_pred_glove_bilstm:
    test_pred_glove_bilstm_result.append(labels[np.argmax(p)])

print(classification_report(test['stars_review'], test_pred_glove_bilstm_result))

              precision    recall  f1-score   support

         1.0       0.74      0.76      0.75      4800
         2.0       0.50      0.46      0.48      3676
         3.0       0.54      0.47      0.50      4852
         4.0       0.59      0.47      0.52     10529
         5.0       0.77      0.88      0.82     18841

    accuracy                           0.68     42698
   macro avg       0.63      0.61      0.61     42698
weighted avg       0.67      0.68      0.67     42698

The BiLSTM model with GloVe embedding is slightly better than the Word2Vec model as overall accuracy score of GloVe model is 68% and 67% for the Word2Vec model.

If we look at the recall score for each class (stars), the model is really good at differentiating between 1-star and 5-star reviews, moderate at 4-star and slightly poor on 2- and 3- stars. This make sense as the 1-star and 5-star reviews are more likely to present stronger sentiment, either strong positive or strong negative, the model can pick up the differences between 1-star and 5-star reviews and predict more correctly than middle ratings(star 2,3,4)

Future Work:¶

Train with more data
Continue improve text cleaning (e.g., spelling auto-correction)
Fine tuning word vectors in the embedding layers
Fine tune structures and parameters of neural network models
- Changing the complexity of the network
- Use grid search or random search to tune a suitable number of nodes and/or layers
- Add additional regularizations(penalization, dropout, early stopping, weight constraint)
- Try different layering combinations

	review_id	user_id	business_id	stars	useful	funny	text	date
0	Q1sbwvVQXV2734tPgoKj4Q	hG7b0MtEbXx5QzbzE6C_VA	ujmEBvifdJM6h6RLv4wQIg	1.0	6	1	Total bill for this horrible service? Over $8G...	2013-05-07 04:34:36
1	GJXCdrto3ASJOqKeVWPi6Q	yXQM5uF2jS6es16SJzNHfg	NZnhc2sEQy3RmzKTZnqtwQ	5.0	0	0	I adore Travis at the Hard Rock's new Kelly ...	2017-01-14 21:30:33
2	2TzJjDVDEuAW6MR5Vuc1ug	n6-Gk65cPZL6Uz8qRm3NYw	WTqjgwHlXbSFevF32_DJVw	5.0	3	0	I have to say that this office really has it t...	2016-11-09 20:09:03
3	yi0R0Ugj_xUx_Nek0-_Qig	dacAIZ6fTM6mqwW5uxkskg	ikCg8xy5JIg_NGPx-MSIDA	5.0	0	0	Went in for a lunch. Steak sandwich was delici...	2018-01-09 20:56:38
4	11a8sVPMUFtaC7_ABRkmtw	ssoyf2_x0EQMed6fgHeMyQ	b1b1eb3uo-w561D0ZfCEiQ	1.0	7	0	Today was my second out of three sessions I ha...	2018-01-30 23:07:38

	business_id	name	address	city	state	postal_code	latitude	longitude	stars	review_count	is_open	attributes	categories	hours
0	1SWheh84yJXfytovILXOAQ	Arizona Biltmore Golf Club	2818 E Camino Acequia Drive	Phoenix	AZ	85016	33.522143	-112.018481	3.0	5	0	{'GoodForKids': 'False'}	Golf, Active Life	None
1	QXAEGFB4oINsVuTFxEYKFQ	Emerald Chinese Restaurant	30 Eglinton Avenue W	Mississauga	ON	L5R 3E7	43.605499	-79.652289	2.5	128	1	{'RestaurantsReservations': 'True', 'GoodForMe...	Specialty Food, Restaurants, Dim Sum, Imported...	{'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W...
2	gnKjwL_1w79qoiV3IC_xQQ	Musashi Japanese Restaurant	10110 Johnston Rd, Ste 15	Charlotte	NC	28210	35.092564	-80.859132	4.0	170	1	{'GoodForKids': 'True', 'NoiseLevel': 'u'avera...	Sushi Bars, Restaurants, Japanese	{'Monday': '17:30-21:30', 'Wednesday': '17:30-...
3	xvX2CttrVhyG2z1dFg_0xw	Farmers Insurance - Paul Lorenz	15655 W Roosevelt St, Ste 237	Goodyear	AZ	85338	33.455613	-112.395596	5.0	3	1	None	Insurance, Financial Services	{'Monday': '8:0-17:0', 'Tuesday': '8:0-17:0', ...
4	HhyxOkGAM07SRYtlQ4wMFQ	Queen City Plumbing	4209 Stuart Andrew Blvd, Ste F	Charlotte	NC	28217	35.190012	-80.887223	4.0	4	1	{'BusinessAcceptsBitcoin': 'False', 'ByAppoint...	Plumbing, Shopping, Local Services, Home Servi...	{'Monday': '7:0-23:0', 'Tuesday': '7:0-23:0', ...

	city
Las Vegas	29370
Toronto	18906
Phoenix	18766
Charlotte	9509
Scottsdale	8837
Calgary	7736
Pittsburgh	7017
Montréal	6449
Mesa	6080
Henderson	4892

	Unnamed: 0	review_id	user_id	business_id	stars_review	useful	funny	cool	text	date	...	state	postal_code	latitude	longitude	stars_business	review_count	is_open	attributes	categories	hours
0	0	6BnQwlxRn7ZuWdzninM9sQ	JSrP-dUmLlwZiI7Dp3PQ2A	cHdJXLlKNWixBXpDwEGb_A	3.0	1	7	1	I love chinese food and I love mexican food. W...	2015-04-01 16:30:00	...	AZ	85023.0	33.626831	-112.1003	4.0	1600	1	{'RestaurantsDelivery': 'False', 'RestaurantsT...	Caribbean, Szechuan, Mexican, Restaurants, Chi...	{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...
1	1	ljyWWUY5WHa5iOweyvYsPA	Olo_x4fV1sFBtlj6COf9Wg	cHdJXLlKNWixBXpDwEGb_A	5.0	2	2	2	A must go destination in Phoenix.\nChinese - M...	2015-07-16 07:50:21	...	AZ	85023.0	33.626831	-112.1003	4.0	1600	1	{'RestaurantsDelivery': 'False', 'RestaurantsT...	Caribbean, Szechuan, Mexican, Restaurants, Chi...	{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...
2	2	cZdwD4dbYoSu0aokf3LLCA	X_LSo5sB4vVucNalQ723eg	cHdJXLlKNWixBXpDwEGb_A	4.0	0	0	0	Second time I have been here, and the only re...	2015-07-03 21:45:17	...	AZ	85023.0	33.626831	-112.1003	4.0	1600	1	{'RestaurantsDelivery': 'False', 'RestaurantsT...	Caribbean, Szechuan, Mexican, Restaurants, Chi...	{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...
3	3	BfcgnuRnHkURybTMII5HVQ	cMUo-e1If4hP_GivbjQntQ	cHdJXLlKNWixBXpDwEGb_A	4.0	2	2	3	I was brought here by my coworkers. It's becom...	2008-12-03 22:42:57	...	AZ	85023.0	33.626831	-112.1003	4.0	1600	1	{'RestaurantsDelivery': 'False', 'RestaurantsT...	Caribbean, Szechuan, Mexican, Restaurants, Chi...	{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...
4	4	Y7s6NijX8CBihH6NF8xEBw	nZpSxSHct3faGgonk0nmdw	cHdJXLlKNWixBXpDwEGb_A	4.0	1	0	0	Very interesting! I've tried korean/mexican fu...	2011-08-25 01:12:05	...	AZ	85023.0	33.626831	-112.1003	4.0	1600	1	{'RestaurantsDelivery': 'False', 'RestaurantsT...	Caribbean, Szechuan, Mexican, Restaurants, Chi...	{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'...

	Unnamed: 0	review_id	user_id	business_id	stars_review	useful	funny	cool	text	date	...	state	postal_code	latitude	longitude	stars_business	review_count	is_open	attributes	categories	hours
count	734136.000000	734136	734136	734136	734136.000000	734136.000000	734136.000000	734136.000000	734136	734136	...	734136	733247.000000	734136.000000	734136.000000	734136.000000	734136.000000	734136.000000	698854	733842	673848
unique	NaN	734136	253152	18766	NaN	NaN	NaN	NaN	732377	732179	...	2	NaN	NaN	NaN	NaN	NaN	NaN	8358	12168	6632
top	NaN	EBE_Ly3QO-trSOHz-2vGKw	d_TBs6J3twMy9GChqUEXkg	VyVIneSU7XAWgMBllI6LnQ	NaN	NaN	NaN	NaN	Had the worst experience with Safeway this wee...	2014-04-14 18:03:24	...	AZ	NaN	NaN	NaN	NaN	NaN	NaN	{'BusinessAcceptsCreditCards': 'True'}	Mexican, Restaurants	{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...
freq	NaN	1	721	2662	NaN	NaN	NaN	NaN	8	5	...	734105	NaN	NaN	NaN	NaN	NaN	NaN	43279	10732	37792
mean	367067.500000	NaN	NaN	NaN	3.764229	1.440333	0.501520	0.587966	NaN	NaN	...	NaN	85026.517512	33.520374	-112.050366	3.772355	310.130510	0.879559	NaN	NaN	NaN
std	211926.952948	NaN	NaN	NaN	1.505596	3.457371	1.976142	2.293659	NaN	NaN	...	NaN	151.834923	0.093962	0.055830	0.809203	437.155122	0.325477	NaN	NaN	NaN
min	0.000000	NaN	NaN	NaN	1.000000	0.000000	0.000000	0.000000	NaN	NaN	...	NaN	36867.000000	33.256202	-112.441067	1.000000	3.000000	0.000000	NaN	NaN	NaN
25%	183533.750000	NaN	NaN	NaN	3.000000	0.000000	0.000000	0.000000	NaN	NaN	...	NaN	85013.000000	33.459004	-112.077585	3.500000	43.000000	1.000000	NaN	NaN	NaN
50%	367067.500000	NaN	NaN	NaN	4.000000	0.000000	0.000000	0.000000	NaN	NaN	...	NaN	85018.000000	33.508065	-112.057969	4.000000	145.000000	1.000000	NaN	NaN	NaN
75%	550601.250000	NaN	NaN	NaN	5.000000	2.000000	0.000000	0.000000	NaN	NaN	...	NaN	85034.000000	33.585729	-112.004239	4.500000	388.000000	1.000000	NaN	NaN	NaN
max	734135.000000	NaN	NaN	NaN	5.000000	333.000000	169.000000	203.000000	NaN	NaN	...	NaN	94523.000000	33.848416	-111.709554	5.000000	2556.000000	1.000000	NaN	NaN	NaN

	stars_review	tokens	text_clean	text
0	5.0	[used, work, right, around, corner, all, offic...	used work right around corner all office girl ...	I used to work right around the corner from he...
1	4.0	[very, random, boxing, rink, mason, jar, mixol...	very random boxing rink mason jar mixologist t...	Very random. Boxing Rink, Mason Jars and Mixol...
2	5.0	[part, job, pick, different, incentive, busine...	part job pick different incentive business ari...	Part of my job is to pick up different incenti...
3	5.0	[great, drink, happy, hour, tasty, vegan, opti...	great drink happy hour tasty vegan option brea...	Great drinks at happy hour and tasty vegan opt...
4	1.0	[food, substandard, started, off, right, but, ...	food substandard started off right but then we...	The food is substandard. It started off right ...

	stars_review	tokens	text_clean	text
0	2.0	[tried, sushi, roll, first, all, no, chopstick...	tried sushi roll first all no chopstick seriou...	I tried 3 of their sushi rolls....first of all...
1	5.0	[entire, staff, amazing, still, hard, rock, so...	entire staff amazing still hard rock so probab...	Entire staff was amazing. It's still a hard ro...
2	4.0	[came, late, lunch, other, word, brunch, broug...	came late lunch other word brunch brought fami...	Came here for a late lunch in other words brun...
3	4.0	[look, not, gourmet, mexican, however, fast, f...	look not gourmet mexican however fast food mex...	Look, this is not gourmet Mexican. However, fo...
4	1.0	[sadly, live, north, phoenix, wife, love, chin...	sadly live north phoenix wife love chinese foo...	Sadly, we live in north Phoenix, and my wife l...

	stars_review	tokens	text_clean	text
0	3.0	[finding, place, large, patio, hot, summer, da...	finding place large patio hot summer day can n...	Finding a place with a large patio on a hot su...
1	4.0	[just, business, dinner, tonight, private, roo...	just business dinner tonight private room shar...	Just had a business dinner here tonight. We we...
2	1.0	[tried, mojo, first, time, last, night, seriou...	tried mojo first time last night seriously wor...	Tried Mojo for the first time last night and i...
3	5.0	[celebrated, father, day, mom, ordered, grille...	celebrated father day mom ordered grilled rain...	Celebrated Fathers Day here! \nMy mom ordered ...
4	3.0	[lox, bagel, sandwich, freaking, dollar, said,...	lox bagel sandwich freaking dollar said came d...	$9 for a lox and bagel sandwich, $9 freaking d...

	stars_review	tokens	text_clean	text
0	3.0	['finding', 'place', 'large', 'patio', 'hot', ...	finding place large patio hot summer day can n...	Finding a place with a large patio on a hot su...
1	4.0	['just', 'business', 'dinner', 'tonight', 'pri...	just business dinner tonight private room shar...	Just had a business dinner here tonight. We we...
2	1.0	['tried', 'mojo', 'first', 'time', 'last', 'ni...	tried mojo first time last night seriously wor...	Tried Mojo for the first time last night and i...
3	5.0	['celebrated', 'father', 'day', 'mom', 'ordere...	celebrated father day mom ordered grilled rain...	Celebrated Fathers Day here! \nMy mom ordered ...
4	3.0	['lox', 'bagel', 'sandwich', 'freaking', 'doll...	lox bagel sandwich freaking dollar said came d...	$9 for a lox and bagel sandwich, $9 freaking d...

	review_stars__1.0	review_stars__4.0	review_stars__5.0
0	0	0	1
1	0	1	0
2	0	0	1
3	0	0	1
4	1	0	0

	review_stars__1.0	review_stars__2.0	review_stars__4.0	review_stars__5.0
0	0	1	0	0
1	0	0	0	1
2	0	0	1	0
3	0	0	1	0
4	1	0	0	0