Last active
          March 23, 2022 23:24 
        
      - 
      
- 
        Save abhigrover101/dff3ebd06a0c30c7155f to your computer and use it in GitHub Desktop. 
    Sentiment Classification : Amazon Fine Food Reviews Dataset
  
        
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
  | Amazon Fine Food Reviews: A Sentiment Classification Problem | |
| Abhishek Grover | |
| 1503611 | |
| [email protected] | |
| Instructor: Jay Pujara | |
| [email protected] | |
| 1. Abstract | |
| The internet is full of websites that provide the ability to write reviews for products and services available online and offline. The websites like yelp, zomato, imdb etc got successful only through the authenticity and accuracy of the reviews they make available. Success of product selling websites such as Amazon, ebay etc also gets affected by the quality of the reviews they have for their products. All these sites provide a way to the reviewer to write his/her comments about the service or product and give a rating for it. Based on these comments one can classify each review as good or bad. From this data a model can be trained that can identify the sentiment hidden in a review. This has many possible applications: the learned model can be used to identify sentiments in reviews or data that doesn’t have any sentiment information like score or rating eg. People post comments about restaurants on facebook and twitter which do not provide any rating mechanism. This project intends to tackle this problem by employing text classification techniques and learning several models based on different algorithms such as Decision Tree, Perceptron, Naïve Bayes and Logistic regression. This paper will discuss the problems that were faced while performing sentiment classification on a large dataset and what can be done to solve those problems | |
| 2. Introduction | |
| The main goal of the project is to analyze some large dataset and perform sentiment classification on it. Sentiment classification is a type of text classification in which a given text is classified according to the sentimental polarity of the opinion it contains. For the purpose of this project the Amazon Fine Food Reviews dataset, which is available on Kaggle, is being used. [1][4] | |
| Following sections describe the important phases of Sentiment Classification: the Exploratory Data Analysis for the dataset, the preprocessing steps done on the data, learning algorithms applied and the results they gave and finally the analysis from those results. | |
| 3. Exploratory Data Analysis | |
| The Amazon Fine Food Reviews dataset is ~300 MB large dataset which consists of around 568k reviews about amazon food products written by reviewers between 1999 and 2012. Each review has the following 10 features: | |
| • Id | |
| • ProductId - unique identifier for the product | |
| • UserId - unqiue identifier for the user | |
| • ProfileName | |
| • HelpfulnessNumerator - number of users who found the review helpful | |
| • HelpfulnessDenominator - number of users who indicated whether they found the review helpful | |
| • Score - rating between 1 and 5 | |
| • Time - timestamp for the review | |
| • Summary - brief summary of the review | |
| • Text - text of the review | |
| So out of the 10 features for the reviews it can be seen that ‘score’, ‘summary’ and ‘text’ are the ones having some kind of predictive value. Also ‘text’ is kind of redundant as summary is sufficient to extract the sentiment hidden in the review. So it’s sufficient to load only these two from the sqlite data file. Score has a value between 1 and 5. So for the purpose of the project all reviews having score above 3 are encoded as positive and below or equal to 3 are encoded as negative. The mean of scores is 4.18. One should expect a distribution which has more positive than negative reviews. The data looks some thing like this. | |
| (Figure 1) | |
| After loading the data it is found that there are exactly 568454 number of reviews in the dataset. As expected after encoding the score the dataset got split into 124677 negative reviews and 443777 positive reviews. | |
| (Figure 2) | |
| From the label distribution one can conclude that the dataset is skewed as it has a large number of positive reviews and very few negative reviews. Positive reviews form 21.93 % of the dataset and negative reviews form 78.07 % of the dataset. This is an important piece of information as it already enables one to decide that a stratified strategy needs to be used for splitting data for evaluation. | |
| 4. Preprocessing | |
| To make the data more useful a number of preprocessing techniques are applied, most of them very common in text classification. | |
| • Stop words removal: stop words refer to the most common words in any language. They usually don’t have any predictive value and just increase the size of the feature set. Removing such words from the dataset would be very beneficial. | |
| • Lemmatization: lemmatization is chosen over stemming. Although the goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form, better results were observed when using lemmatization instead of stemming. Explaining the difference between the two is a little out of the scope for this paper. | |
| • Punctuation Removal: refers to removing common punctuation marks such as !,?,”” etc. For eg: ‘Hi!’ and ‘Hi’ will be considered as two different words although they refer to the same thing. | |
| • Upper Case to Lower Case: convert all upper case letters to lower case letters. | |
| • Feature Reduction/Selection: This is the most important preprocessing step for sentiment classification. Classification algorithms are run on subset of the features, so selecting the right features becomes important. This step will be discussed in detail later in the report. | |
| After applying all preprocessing steps except feature reduction/selection, 27048 unique words were obtained from the dataset which form the feature set. The next step is to try and reduce the size of the feature set by applying various Feature Reduction/Selection techniques. After preprocessing, the dataset is split into train and test, with test consisting of 25% samples of the entire dataset. | |
| 5. Feature Reduction/Selection | |
| The size of the dataset is essentially 568454*27048 which is quite a large number to be running any algorithm. Thus it becomes important to somehow reduce the size of the feature set. There are a number of ways this can be done. | |
| a. PCA | |
| One can make use of application of principal component analysis (PCA) to reduce the feature set [3]. PCA is a procedure which uses orthogonal transformation to convert a set of variables in n-dimensional space to a smaller dimensional space. Consider an example in which points are distributed in a 2-d plane having maximum variance along the x-axis. One can fit these points in 1-d by squeezing all the points on the x axis. The x axis is the first principal component and the data has maximum variance along it. Something similar can be done for higher dimensions too. | |
| For the purpose of the project, the feature set is reduced to 200 components using Truncated SVD which is a variant of PCA and works on sparse matrices. | |
| b. Most Frequent Features | |
| Another way to reduce the number of features is to use a subset of the most frequent words occurring in the dataset as the feature set. | |
| Find the frequency of all words in the training data and select the most common 5000 words as features. The logic behind this approach is that all reviews must contain certain critical words that define the sentiment of the review and since it’s a reviews dataset these must occur very frequently. 5000 words are still quite a lot of features but it reduces the feature set to about 1/5th of the original which is still a workable problem. The frequency distribution for the dataset looks something like below. | |
| (Figure 3) | |
| From figure it is visible that words such as great, good, best, love, delicious etc occur most frequently in the dataset and these are the words that usually have maximum predictive value for sentiment analysis. This also proves that the dataset is not corrupt or irrelevant to the problem statement. | |
| 6. Learning the models | |
| 4 models are trained on the training set and evaluated against the test set. Since the number of samples in the training set is huge it’s clear that it won’t be possible to run some inefficient classification algorithms like KNearest Neighbors or Random Forests etc. The 4 classifiers used in the project are: | |
| • Decision Tree Classifier | |
| • Naïve Bayes Classifier | |
| • Logistic Regression | |
| • Perceptron | |
| The first problem that needs to be tackled is that most of the classification algorithms expect inputs in the form of feature vectors having numerical values and having fixed size instead of raw text documents (reviews in this case) of variable size. This can be tackled by using the Bag-of-Words strategy[2]. This strategy involves 3 steps: | |
| • Tokenization: breaking the document into tokens where each token represents a single word. | |
| • Counting: counting the frequency of each word in the document. | |
| • Normalization: weighing down or reducing importance of the words that occur the most in the corpus. | |
| The reviews can be represented in the form of vectors of numerical values where each numerical value reflects the frequency of a word in that review. These vectors are then normalized based on the frequency of tokens/words occurring in the entire corpus. Thus the entire set of reviews can be represented as a single matrix of rows where each row represents a review and each column represents a word in the corpus. This process is called Vectorization. | |
| 6.1 Using PCA | |
| After applying vectorization and before applying any kind of feature reduction/selection the size of the input matrix is 426340*27048. After applying PCA to reduce features, the input matrix size reduces to 426340*200. The decision to choose 200 components is a consequence of running and testing the algorithms with different number of components. | |
| 6.1.1 Results | |
| The models are trained on the input matrix generated above. Test data is also transformed in a similar fashion to get a test matrix. Following are the results: | |
| (Figure 4) | |
| From the results it can be seen that Decision Tree Classifier works best for the Dataset. This implies that the dataset splits pretty well on words, which is kind of obvious as meaning of words affects the sentiment of the review. Note that although the accuracy of Perceptron and BernoulliNB does not look that bad but if one considers that the dataset is skewed and contains 78% positive reviews, predicting the majority class will always give at least 78% accuracy. So compared to that perceptron and BernoulliNB doesn’t work that well in this case. | |
| Note that for skewed data recall is the best measure for performance of a model. The performance of all four models is compared below. | |
| (Figure 5) | |
| As claimed earlier Perceptron and Naïve Bayes are predicting positive for almost all the elements, hence the recall and precision values are pretty low for negative samples precision/recall. | |
| (Figure 6) | |
| 6.2 Using Max Frequency Words | |
| The most important 5000 words are vectorized using Tf-idf transformer. Using the same transformer, the train and the test data are also vectorized. This essentially means that only those words of the training and testing data, which are among the most frequent 5000 words, will have numerical value in the generated matrices. These matrices are then used for training and evaluating the models. | |
| 6.2.1 Results | |
| There is significant improvement in all the models. Following is a result summary. | |
| (Figure 7) | |
| One important thing to note about Perceptron is that it only converges when data is linearly separable. Since the number of features are so large one cannot tell if Perceptron will converge on this dataset. Thus restricting the maximum iterations for it is important. Following is a comparison of recall for negative samples. | |
| (Figure 7) | |
| (Figure 8) | |
| 6.3 Without Feature Reduction/Selection | |
| Lastly the models are trained without doing any feature reduction/selection step. Decision Tree Classifier runs pretty inefficiently for datasets having large number of features, so training the Decision Tree Classifier is avoided. | |
| Since the entire feature set is being used, the sequence of words (relative order) can be utilized to do a better prediction. For example : some words when used together have a different meaning compared to their meaning when considered alone like “not good” or “not bad”. | |
| The models are trained for 3 strategies called Unigram, Bigram and Trigram. | |
| 6.3.1 Unigram | |
| Unigram is the normal case, when each word is considered as a separate feature. The entire feature set is vectorized and the model is trained on the generated matrix. The size of the training matrix is 426340*27048 and testing matrix is 142114*27048. | |
| 6.3.1.1 Results | |
| As expected accuracies obtained are better than after applying feature reduction or selection but the number of computations done is also way higher. Following are the accuracies: | |
| (Figure 9) | |
| (Figure 10) | |
| All the classifiers perform pretty well and even have good precision and recall values for negative samples. Following shows a visual comparison of recall for negative samples: | |
| (Figure 11) | |
| 6.3.2 Bigram | |
| In this approach all sequence of adjacent words are also considered as features apart from Unigrams. So now 2 word phrases like “not good”, “not bad”, “pretty bad” etc will also have a predictive value which wasn’t there when using Unigrams. The entire feature set is vectorized and the model is trained on the generated matrix. The size of the training matrix is 426340*263567 and testing matrix is 142114*263567 | |
| 6.3.2.1 Results | |
| The accuracies improved even further. The algorithms being used run well on sparse data which is the format of the input that is generated after vectorization. Following are the results: | |
| (Figure 12) | |
| There is a significant improvement on the recall of negative instances which might infer that many reviewers would have used 2 word phrases like “not good” or “not great” to imply a negative review. Following is the visual representation of the negative samples accuracy: | |
| (Figure 13) | |
| (Figure 14) | |
| 6.3.3 Trigram | |
| In this all sequences of 3 adjacent words are considered as a separate feature apart from Bigrams and Trigrams. | |
| The entire feature set is again vectorized and the model is trained on the generated matrix. The size of the training matrix is 426340* 653393 and testing matrix is 142114* 653393. | |
| 6.3.3.1 Results | |
| Trigrams give the best results. | |
| (Figure 15) | |
| Logistic Regression gives accuracy as high as 93.2 % and even perceptron accuracy is very high. The recall/precision values for negative samples are higher than ever. | |
| (Figure 16) | |
| (Figure 17) | |
| 6.4 Analysis | |
| Since logistic regression performs best in all three cases, let’s do a little more analysis of it with the help of a confusion matrix. A confusion matrix plots the True labels against predicted labels. It is just a good way to visualize the classification report. | |
| (Figure 18) | |
| From the first matrix it is evident that a large number of samples were predicted to be positive and their actual label was also positive. Whereas very few negative samples which were predicted negative were also truly negative. But this matrix is not indicative of the performance because in testing data the negative samples were very less, so it is expected to see the predicted label vs true label part of the matrix for negative labels as lightly shaded. To visualize the performance better, it is better to look at the normalized confusion matrix. The normalized confusion matrix represents the ratio of predicted labels and true labels. Now one can see that logistic regression predicted negative samples accurately too. | |
| 7. Future Work | |
| It is evident that for the purpose of sentiment classification, feature reduction and selection are very important. Apart from the methods discussed in this paper there are other ways which can be explored to select features more smartly. | |
| One can utilize POS tagging mechanism to tag words in the training data and extract the important words based on the tags. For sentiment classification adjectives are the critical tags. One must take care of other tags too which might have some predictive value. | |
| Other advanced strategies such as using Word2Vec can also be utilized. Using Word2Vec, one can find similar words in the dataset and essentially find their relation with labels. There are other ways too in which one can use Word2Vec to improve the models. | |
| 8. Conclusion | |
| As a conclusion it can be said that bag-of-words is a pretty efficient method if one can compromise a little with accuracy. Also for datasets of such a large size it is advisable to use algorithms that run in linear time (like naïve bayes, although they might not give a very high accuracy). | |
| Finally, utilizing sequence of words is a good approach when the main goal is to improve accuracy of the model. | |
| 9. References | |
| [1] https://www.kaggle.com/snap/amazon-fine-food-reviews | |
| [2] http://scikit-learn.org/stable/modules/feature_extraction.html | |
| [3] https://en.wikipedia.org/wiki/Principal_component_analysis | |
| [4] J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013 | |
| 10. Github link | |
| You can find this paper and code for the project at the following github link. | 
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment