{ "cells": [ { "cell_type": "markdown", "id": "b30f0a7e-6d6e-4b9e-bece-e11f2a0f78fc", "metadata": {}, "source": [ "# Sentiment Analysis\n", "\n", "In this post we are going to learn more about the [technical requirements to become a Data Scientist](https://medium.com/@fmnobar/data-scientist-role-requirements-bbae1f85d4d5) by taking a closer look at Sentiment Analysis. In the field of Natural Language Processing (NLP), sentiment analysis is a tool to identify, quantify, extract and study subjective information from textual data. For example, \"I like watching TV shows.\" carries a positive sentiment. But maybe the sentiment could even be \"relatively more\" positive if one says that \"I really like watching TV shows!\". Sentiment analysis attempts at quantifying the sentiment conveyed in textual data. One of the most common use cases of sentiment analysis is enabling brands and businesses to review their customers' feedback and monitor their level of satisfaction. As you can imagine, it would be quite expensive to have human headcount read customer reviews to determine whether the customers are happy or not with the business, service, or products. In such cases brands and businesses use machine learning techniques such as sentiment analysis to achieve similar results at scale.\n", "\n", "Similar to my other posts, learning is achieved through practice questions and answers. I will include hints and explanations in the questions as needed to make the journey easier. Lastly, the notebook that I used to create this exercise is also linked in the bottom of the post, which you can download, run and follow along.\n", "\n", "Let’s get started!\n", "\n", "## Data Set\n", "\n", "In order to practice sentiment analysis, we are going to use a test set from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sentiment+labelled+sentences), which can be downloaded from [this link](https://gist.github.com/fmnobar/88703ec6a1f37b3eabf126ad38c392b8). \n", "\n", "Let's start with importing the libraries we will be using today, then read the data set into a dataframe and look at the top five rows of the dataframe to familiarize ourselves with the data." ] }, { "cell_type": "code", "execution_count": 46, "id": "b925656b-9706-487f-aa7c-1b8f4d771148", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlabel
0A very, very, very slow-moving, aimless movie about a distressed, drifting young man.0
1Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.0
2Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.0
3Very little music or anything to speak of.0
4The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.1
\n", "
" ], "text/plain": [ " text \\\n", "0 A very, very, very slow-moving, aimless movie about a distressed, drifting young man. \n", "1 Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out. \n", "2 Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent. \n", "3 Very little music or anything to speak of. \n", "4 The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head. \n", "\n", " label \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 1 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import required packages\n", "import numpy as np\n", "import pandas as pd\n", "import nltk\n", "\n", "# Making width of the column viewable\n", "pd.set_option('display.max_colwidth', None)\n", "\n", "# Read the data into a dataframe\n", "df = pd.read_csv('imdb_labelled.csv')\n", "\n", "# look at the top five rows of the dataframe\n", "df.head()" ] }, { "cell_type": "markdown", "id": "74bda39c-042d-428f-8eae-94fbb104d347", "metadata": {}, "source": [ "There are only two columns. \"text\" contains the review itself and \"label\" indicates the sentiment of the review. In this dataset a label of 1 indicates a postivie sentiment, while a label of 0 indicates a negative sentiment. Since there are only two classes of labels, let's look at whether these two classes are balanced or imbalanced. Classes are considered balanced when classes (roughly) account for the same portion of the total observations. Let's look at the data, which makes this easier to understand. " ] }, { "cell_type": "code", "execution_count": 47, "id": "a424b8a5-fea6-42be-aa03-ee5d8e2e03f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 386\n", "0 362\n", "Name: label, dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].value_counts()" ] }, { "cell_type": "markdown", "id": "df9fbc49-2169-42e7-822d-9c2f6208d8d3", "metadata": {}, "source": [ "The data is almost equally divided between positive and negative sentiments, therefore we consider the data to have balanced classes.\n", "\n", "Next, we are going to create a sample string, which includes the very first entry in the \"text\" column of the dataframe. In some of the questions, we will apply various techniques to this one sample to better understand the concepts. Let's go ahead and create our sample string." ] }, { "cell_type": "code", "execution_count": 48, "id": "7898707d-df56-424a-a23a-b0bb53e84b0d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A very, very, very slow-moving, aimless movie about a distressed, drifting young man. '" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Take the very first text entry of the dataframe\n", "sample = df.text[0]\n", "sample" ] }, { "cell_type": "markdown", "id": "67ee7ecf-ac2d-4f6b-954c-7675782246ef", "metadata": {}, "source": [ "# Tutorial + Questions and Answers\n", "\n", "## Tokens and Bigrams\n", "\n", "In order for programs and computers to understand textual data, we start by breaking down larger segments of textual data into smaller pieces. Breaking down a sequence of characters (such as a string) into smaller pieces (or substrings) is called tokenization and the functions that perform tokenization are called tokenizers. A tokenizer can break down a given string into a list of substrings. Let's look at an example. \n", "\n", "Input: `What is a sentence?`\n", "\n", "If we apply a tokenizer to the above \"Input\", we will get the following \"Output\":\n", "\n", "Output: `['What', 'is', 'a', 'sentence', '?']`\n", "\n", "As expected, the output is a sequence of the tokenized substrings of the input sentence. \n", "\n", "We can implement this concept with the `nltk.word_tokenize` package. Let's see how this is implemented in an example.\n", "\n", "**Question 1:**\n", "\n", "Tokenize the generated sample and return the first 10 tokens.\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 49, "id": "3f6e9d5c-43eb-4847-902a-e8421d149a30", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['A', 'very', ',', 'very', ',', 'very', 'slow-moving', ',', 'aimless', 'movie']" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the package\n", "from nltk import word_tokenize\n", "\n", "# Tokenize the sample\n", "sample_tokens = word_tokenize(sample)\n", "\n", "# Return the first 10 tokens\n", "sample_tokens[:10]" ] }, { "cell_type": "markdown", "id": "7de19d0b-c606-40cb-ba5a-6a98873cf98d", "metadata": {}, "source": [ "A token is also called a unigram. If we combine two unigrams, we get to a bigram (and this process can continue). Formally, a bigram is an n-gram where n equals two. An n-gram is a sequence of n adjacent items from a given sample of text. Therefore, a bigram is a sequence of two adjacent elements from a string of tokens. It will be easier to understand in an example:\n", "\n", "Original Sentence: `What is a sentence?`\n", "\n", "Tokens: `['What', 'is', 'a', 'sentence', '?']`\n", "\n", "Bigrams: `[('What', 'is'), ('is', 'a'), ('a', 'sentence'), ('sentence', '?')]`\n", "\n", "As defined, each two adjacent tokens are now represented in one bigram.\n", "\n", "We can implement this concept with the `nltk.bigrams` package.\n", "\n", "**Question 2:**\n", "\n", "Create a list of bigrams from the tokenized sample and return the first 10 bigrams. \n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 50, "id": "f7863a99-e1bb-4975-84f6-2092e3400eac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('A', 'very'),\n", " ('very', ','),\n", " (',', 'very'),\n", " ('very', ','),\n", " (',', 'very'),\n", " ('very', 'slow-moving'),\n", " ('slow-moving', ','),\n", " (',', 'aimless'),\n", " ('aimless', 'movie'),\n", " ('movie', 'about')]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the package\n", "from nltk import bigrams\n", "\n", "# Create the bigrams\n", "sample_bitokens = list(bigrams(sample_tokens))\n", "\n", "# Return the first 10 bigrams\n", "sample_bitokens[:10]" ] }, { "cell_type": "markdown", "id": "45b46bab-dc89-459a-b0ac-8d0e35f35902", "metadata": {}, "source": [ "## Frequency Distribution\n", "\n", "Let's go back to the tokens (unigrams) that we created from our sample. It is good to see what tokens are out there but it might be more informative to know which tokens have a higher representation compared to others in a given textual input. In other words, an occurrence frequency distribution of tokens would be more informative. More formally, a frequency distribution records the number of times each outcome of an experiment has occurred.\n", "\n", "Let's implement a frequency distribution using `nltk.FreqDist` package. \n", "\n", "**Question 3:**\n", "\n", "What are the top 10 most frequent tokens in our sample?\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 51, "id": "ccf1dce2-e0c7-4226-8d82-77aa0c5f5b04", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(',', 4),\n", " ('very', 3),\n", " ('A', 1),\n", " ('slow-moving', 1),\n", " ('aimless', 1),\n", " ('movie', 1),\n", " ('about', 1),\n", " ('a', 1),\n", " ('distressed', 1),\n", " ('drifting', 1)]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the package\n", "from nltk import FreqDist\n", "\n", "# Create the frequency distribution for all tokens\n", "sample_freqdist = FreqDist(sample_tokens)\n", "\n", "# Return top ten most frequent tokens\n", "sample_freqdist.most_common(10)" ] }, { "cell_type": "markdown", "id": "1dcdff45-c55b-4dfe-a24c-07c2404722cd", "metadata": {}, "source": [ "Some of the results intuitively make sense. For exmaple, a comma, \"the\", \"a\" or periods can be quite common in a given textual input. Now let's put all of these steps into one Python function to streamline the process. If you need a refresher on Python functions, I have a post with practice questions on Python functions [linked here](https://medium.com/@fmnobar/python-foundation-for-data-science-advanced-functions-practice-notebook-dbe4204b83d6).\n", "\n", "**Question 4:**\n", "\n", "Create a function named \"top_n\" that takes in a text as an input and returns the top n most common tokens in the given text. Use \"text\" and \"n\" as the function arguments. Then try it on our sample to reproduce the results from the previous question. \n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 52, "id": "26b44a2a-d91a-4282-8a0d-fbc3d9314084", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('the', 2),\n", " ('Not', 1),\n", " ('sure', 1),\n", " ('who', 1),\n", " ('was', 1),\n", " ('more', 1),\n", " ('lost', 1),\n", " ('-', 1),\n", " ('flat', 1),\n", " ('characters', 1)]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a function to accept a text and n and returns top n most common tokens\n", "def top_n(text, n):\n", " # Create tokens\n", " tokens = word_tokenize(text)\n", " \n", " # Create the frequency distribution\n", " freqdist = FreqDist(tokens)\n", " \n", " # Return the top n most common ones\n", " return freqdist.most_common(n)\n", "\n", "# Try it on the sample\n", "top_n(df.text[1], 10)" ] }, { "attachments": { "938d222b-5a6d-4e1d-960f-9bcf1a130b85.png": { "image/png": "" } }, "cell_type": "markdown", "id": "b763fdee-675a-4aa4-a9aa-4e910f1b7886", "metadata": {}, "source": [ "We were able to reproduce the same output using the function. \n", "\n", "A Document-Term Matrix (DTM) is a matrix that represents the frequency of terms that occur in a collection of documents. Let's look at two sentences to understand what DTM is. \n", "\n", "Let's say that we have the following two sentences:\n", "```\n", "sentence_1 = 'He is walking down the street.'\n", "\n", "sentence_2 = 'She walked up then walked down the street yesterday.'\n", "```\n", "The DTM of the above two sentences will be:\n", "\n", "![image.png](attachment:938d222b-5a6d-4e1d-960f-9bcf1a130b85.png)\n", "\n", "In this DTM, numbers indicate how many times that particular term was observed in the given sentence. For example, \"down\" is present once in both sentences, while \"walked\" appears twice but only in the second sentence. \n", "\n", "Now let's look at how we can implement a DTM concept, using `sklearn`'s `CountVectorizer`. Note that the DTM that is initially created using `sklearn` is in the form of a sparse matrix/array (i.e. most of the entries are zero). This is done for efficiency reasons but we will need to convert the sparse array to a dense array (i.e. most of the values are non-zero). Since understanding the differentiation between sparse and dense arrays are not the intention of this post, we won't go deeper into that topic.\n", "\n", "**Question 5:**\n", "\n", "Define a function named \"create_dtm\" that creates a Document-Term Matrix in the form of a dataframe for a given series of strings. Then test it on the top five rows of our data set.\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 53, "id": "97a28516-7328-44ed-92ef-c473b0026254", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
aboutactingaimlessalmostandanglesanythingartinessasattempting...tryingverywalkedwaswhenwhitewhowhomwithyoung
01010000000...0300000001
10000000000...0011001100
20101310111...0001010010
30000001000...0100000000
40000000000...1001100000
\n", "

5 rows × 68 columns

\n", "
" ], "text/plain": [ " about acting aimless almost and angles anything artiness as \\\n", "0 1 0 1 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 0 0 \n", "2 0 1 0 1 3 1 0 1 1 \n", "3 0 0 0 0 0 0 1 0 0 \n", "4 0 0 0 0 0 0 0 0 0 \n", "\n", " attempting ... trying very walked was when white who whom with \\\n", "0 0 ... 0 3 0 0 0 0 0 0 0 \n", "1 0 ... 0 0 1 1 0 0 1 1 0 \n", "2 1 ... 0 0 0 1 0 1 0 0 1 \n", "3 0 ... 0 1 0 0 0 0 0 0 0 \n", "4 0 ... 1 0 0 1 1 0 0 0 0 \n", "\n", " young \n", "0 1 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", "[5 rows x 68 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import the package\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "def create_dtm(series):\n", " # Create an instance of the class\n", " cv = CountVectorizer()\n", " \n", " # Create a document term matrix from the provided series\n", " dtm = cv.fit_transform(series)\n", " \n", " # Convert the sparse array to a dense array\n", " dtm = dtm.todense()\n", " \n", " # Get column names\n", " features = cv.get_feature_names_out()\n", " \n", " # Create a dataframe\n", " dtm_df = pd.DataFrame(dtm, columns = features)\n", " \n", " # Return the dataframe\n", " return dtm_df\n", "\n", "# Try the function on the top 5 rows of the df['text']\n", "create_dtm(df.text.head())" ] }, { "cell_type": "markdown", "id": "ce1f7e22-b726-4159-8487-be65fa688f2f", "metadata": {}, "source": [ "## Feature Importance\n", "\n", "Now we want to think about sentiment analysis as a machine learning model. In such a machine learning model, we would like the model to take in the textual input and make predictions about the sentiment of each textual entry. In other words, the textual input is the independent variable and the sentiment is the dependent variable. We also learned that we can break down the text into smaller pieces named tokens, therefore, we can think of each of the tokens within the textual input as \"features\" that help in predicting the sentiment as the output of the machine learning model. To summarize, we started with a machine learning model that took in large textual data and predicted sentiments but now we have converted our task into a model that takes in multiple \"tokens\" (instead of a large body of text) and predicts the sentiment based on the given tokens. Then the next logical step would be to make an attempt at quantifying which of the tokens (i.e. features) are more important in predicting the sentiment. This task is called feature importance. \n", "\n", "Luckily for us, feature importance can be easily implemented in `sklearn`. Let's look at an example together. \n", "\n", "**Question 6:**\n", "\n", "Define a function named \"top_n_tokens\" that accepts three arguemnts: (1) \"text\", which is the textual input in the format of a data frame column, (2) \"sentiment\", which is the label of the sentiment for the given text in the format of a data frame column, and (3) \"n\", which is a positive number. The function will return the top \"n\" most important tokens (i.e. features) to predict the \"sentiment\" of the \"text\". Please use `LogisticRegression` from `sklearn.linear_model` with the following parameters: `solver = 'lbfgs'`, `max_iter = 2500`, and `random_state = 1234`. Finally, use the function to return the top 10 most important tokens in the \"text\" column of the dataframe.\n", "\n", "***Note:** Since the goal of this post is to explore sentiment analysis, we assume the reader is familiar with Logistic Regression. If you would like to take a deeper look at Logistic Regression, check out [this post](https://medium.com/@fmnobar/logistic-regression-overview-through-11-practice-questions-practice-notebook-64e94cb8d09d).*\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 54, "id": "d6448d0a-de63-4e45-ac1e-13b9d2cbd71c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TokensCoefficients
1567liked1.286747
2997wonderful1.242158
1104funny1.112821
1182great1.068772
2949well1.043139
246beautiful1.042833
0101.035405
344brilliant1.014080
908excellent1.009914
2203right0.985806
\n", "
" ], "text/plain": [ " Tokens Coefficients\n", "1567 liked 1.286747\n", "2997 wonderful 1.242158\n", "1104 funny 1.112821\n", "1182 great 1.068772\n", "2949 well 1.043139\n", "246 beautiful 1.042833\n", "0 10 1.035405\n", "344 brilliant 1.014080\n", "908 excellent 1.009914\n", "2203 right 0.985806" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "def top_n_tokens(text, sentiment, n):\n", " # Create an instance of the class\n", " lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)\n", " cv = CountVectorizer()\n", " \n", " # create the DTM\n", " dtm = cv.fit_transform(text)\n", " \n", " # Fit the logistic regression model\n", " lgr.fit(dtm, sentiment)\n", " \n", " # Get the coefficients\n", " coefs = lgr.coef_[0]\n", " \n", " # Create the features / column names\n", " features = cv.get_feature_names_out()\n", " \n", " # create the dataframe\n", " df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})\n", " \n", " # Return the largest n\n", " return df.nlargest(n, 'Coefficients')\n", "\n", "# Test it on the df['text']\n", "top_n_tokens(df.text, df.label, 10)" ] }, { "cell_type": "markdown", "id": "37184aba-f899-4fc6-9061-fb66f27dcfed", "metadata": {}, "source": [ "Results are quite interesting. We were looking for the most important features and as we know label 1 indicated a positive sentiment in the dataset. In other words, the most important features (i.e. the ones with the highest coefficients) will be the ones that indicate a strong positive sentiment. This comes across in the results, which all sound quite positive.\n", "\n", "In order to validate this hypothesis, let's look at the 10 smallest coefficients (i.e. the least important features). We expect those to convey a strong negative sentiment." ] }, { "cell_type": "code", "execution_count": 55, "id": "d7f97825-06ad-49d1-8c3d-0471d3b6fd78", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TokensCoefficients
222bad-1.872751
211awful-1.334554
2530stupid-1.175416
441cheap-1.139512
1802no-1.137234
893even-1.091436
3017would-1.047931
3012worst-1.039231
2923waste-1.038206
1819nothing-0.973472
\n", "
" ], "text/plain": [ " Tokens Coefficients\n", "222 bad -1.872751\n", "211 awful -1.334554\n", "2530 stupid -1.175416\n", "441 cheap -1.139512\n", "1802 no -1.137234\n", "893 even -1.091436\n", "3017 would -1.047931\n", "3012 worst -1.039231\n", "2923 waste -1.038206\n", "1819 nothing -0.973472" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "def bottom_n_tokens(text, sentiment, n):\n", " # Create an instance of the class\n", " lgr = LogisticRegression(solver = 'lbfgs', max_iter = 2500, random_state = 1234)\n", " cv = CountVectorizer()\n", " \n", " # create the DTM\n", " dtm = cv.fit_transform(text)\n", " \n", " # Fit the logistic regression model\n", " lgr.fit(dtm, sentiment)\n", " \n", " # Get the coefficients\n", " coefs = lgr.coef_[0]\n", " \n", " # Create the features / column names\n", " features = cv.get_feature_names_out()\n", " \n", " # create the dataframe\n", " df = pd.DataFrame({'Tokens' : features, 'Coefficients' : coefs})\n", " \n", " # Return the largest n\n", " return df.nsmallest(n, 'Coefficients')\n", "\n", "# Test it on the df['text']\n", "bottom_n_tokens(df.text, df.label, 10)" ] }, { "cell_type": "markdown", "id": "c706feaa-404b-434d-af7a-05e422071d63", "metadata": {}, "source": [ "As expected, these words convey a strong negative sentiment. \n", "\n", "In the previous example, we trained a logistic regression model on the existing labeled data. But what if we do not have labeled data and would like to determine the sentiment of a given data set? In such cases, we can leverage pre-trained models, such as TextBlob, which we will discuss next. \n", "\n", "## Pre-Trained Models - TextBlob\n", "\n", "TextBlob is a library for processing textual data and one of its functions returns the sentiment of a given data in the format of a named tuple as follows: \"(polarity, subjectivity)\". The polarity score is a float within the range of [-1.0, 1.0] that aims at differentiating whether the text is positive or negative. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. For example, a fact is expected to be objective and one's opinion is expected to be subjective. Polarity and subjectivity detection are two of the most common tasks within sentiment analysis, which we will explore in the next question.\n", "\n", "**Question 7:**\n", "\n", "Define a function named \"polarity_subjectivity\" that accepts two arguments. The function applies \"TextBlob\" to the provided \"text\" (defaulting to \"sample\") and if `print_results = True`, prints polarity and subjectivity of the \"text\" using \"TextBlob\", otherwise returns a tuple of float values with the first value being polarity and the second value being subjectivity, such as \"(polarity, subjectivty)\". Returning the tuple should be the default for the function (i.e. set `print_results = False`). Lastly, use the function on our sample and print the results. \n", "\n", "***Hint:** If you need to install TextBlob you can do so using the following command: `!pip install textblob`*\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 56, "id": "cb692961-ff9c-4133-bec1-5958fb4c8bf7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Polarity is 0.18 and subjectivity is 0.4.\n" ] } ], "source": [ "# Import TextBlob\n", "from textblob import TextBlob\n", "\n", "def polarity_subjectivity(text = sample, print_results = False):\n", " # Create an instance of TextBlob\n", " tb = TextBlob(text)\n", " \n", " # If the condition is met, print the results, otherwise, return the tuple\n", " if print_results:\n", " print(f\"Polarity is {round(tb.sentiment[0], 2)} and subjectivity is {round(tb.sentiment[1], 2)}.\")\n", " else:\n", " return(tb.sentiment[0], tb.sentiment[1])\n", " \n", "# Test the function on our sample\n", "polarity_subjectivity(sample, print_results = True)" ] }, { "cell_type": "markdown", "id": "b66d5751-d1ae-4fa7-a40c-5cc8c7654e17", "metadata": {}, "source": [ "Let's look at the sample and try to interpret these values. " ] }, { "cell_type": "code", "execution_count": 57, "id": "0d0ac7c0-425b-40e8-86a8-11f2f33c5470", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'A very, very, very slow-moving, aimless movie about a distressed, drifting young man. '" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sample" ] }, { "cell_type": "markdown", "id": "7d2e3f06-47af-4ec6-a307-b7ee192ea135", "metadata": {}, "source": [ "Interpreting these results are more meaningful in comparison to other strings but in the absence of such a comparison and purely based on the numbers, let's try to intrepret the reuslts. The results indicate that our sample has a neutral to positive polarity (remember polarity ranges from -1 to 1, therefore 0.18 would indicate neutral to positive) and is relatively subjective, which makes intuitive sense since this is someone's review describing their subjective experience about a movie. \n", "\n", "**Question 8:**\n", "\n", "First define a function named \"token_count\" that accepts a string and using `nltk`'s word tokenizer, returns an integer number of tokens in the given string. Then define a second function named \"series_tokens\" that accepts a Pandas Series object as an argument and applies the previously-defined \"token_count\" function to the given Series, returning the integer number of tokens for each row of the given Series. Lastly, use the second function on the top 10 rows of our dataframe and return the results. \n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 58, "id": "addc9348-a2d1-4872-8768-591faba0312c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 18\n", "1 21\n", "2 33\n", "3 9\n", "4 22\n", "5 27\n", "6 4\n", "7 17\n", "8 4\n", "9 11\n", "Name: text, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import libraries\n", "from nltk import word_tokenize\n", "\n", "# Define the first function that counts the number of tokens in a given string\n", "def token_count(string):\n", " return len(word_tokenize(string))\n", "\n", "# Define the second function that applies the token_count function to a given Pandas Series\n", "def series_tokens(series):\n", " return series.apply(token_count)\n", "\n", "# Apply the function to the top 10 rows of the dataframe\n", "series_tokens(df.text.head(10))" ] }, { "cell_type": "markdown", "id": "60b05e22-ff26-461d-9096-f3d93a749a3b", "metadata": {}, "source": [ "**Question 9:**\n", "\n", "Define a function named `series_polarity_subjectivity` that applies the `polarity_subjectivity()` function defined in Question 7 to a Pandas Series (in the form of a dataframe column) and returns the results. Then use the function on the top 10 rows of our dataframe to see the results.\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 59, "id": "4ec921d8-0706-4121-aa89-496fca33cf82", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 (0.18, 0.395)\n", "1 (0.014583333333333337, 0.4201388888888889)\n", "2 (-0.12291666666666666, 0.5145833333333333)\n", "3 (-0.24375000000000002, 0.65)\n", "4 (1.0, 0.3)\n", "5 (-0.1, 0.5)\n", "6 (-0.2, 0.0)\n", "7 (0.7, 0.6000000000000001)\n", "8 (-0.2, 0.5)\n", "9 (0.7, 0.8)\n", "Name: text, dtype: object" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the function\n", "def series_polarity_subjectivity(series):\n", " return series.apply(polarity_subjectivity)\n", "\n", "# Apply to the top 10 rows of the df['text']\n", "series_polarity_subjectivity(df['text'].head(10))" ] }, { "cell_type": "markdown", "id": "f49cc565-9646-433e-a34b-85ba622694b3", "metadata": {}, "source": [ "## Measure of Complexity - Lexical Diversity\n", "\n", "As the name suggests, Lexical Diversity is a measurement of how many different lexical words there are in a given text and is formulaically defined as the number of unique tokens over the total number of tokens. The idea is that the more diverse lexical tokens in a text are, the more complex that text is expected to be. Let's look at an example. \n", "\n", "**Question 10:**\n", "\n", "Define a \"complexity\" function that accepts a string as an argument and returns the lexical complexity score defined as the number of unique tokens over the total number of tokens. Then apply the function to the top 10 rows of our dataframe. \n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 60, "id": "202d893c-02e5-440f-9836-c252170f510a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.722222\n", "1 0.952381\n", "2 0.848485\n", "3 1.000000\n", "4 1.000000\n", "5 0.814815\n", "6 1.000000\n", "7 0.941176\n", "8 1.000000\n", "9 0.909091\n", "Name: text, dtype: float64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def complexity(string):\n", " # Create a list of all tokens\n", " total_tokens = word_tokenize(string)\n", " \n", " # Create a set of all tokens (which only keeps unique values)\n", " unique_tokens = set(word_tokenize(string))\n", " \n", " # Return the complexity measure\n", " if len(total_tokens) == 0:\n", " return 0\n", " else:\n", " return len(unique_tokens) / len(total_tokens)\n", "\n", "# Apply to the top 10 rows of the dataframe\n", "df.text.head(10).apply(complexity)" ] }, { "cell_type": "markdown", "id": "804e9928-ae05-47e1-8440-d5c005157443", "metadata": {}, "source": [ "## Stopwords and Non-Alphabeticals\n", "\n", "If you recall in Question 3 we conducted a Frequency Distribution and the resulting 10 most common tokens were as follows: \n", "```\n", "[(',', 4), ('very', 3), ('A', 1), ('slow-moving', 1), ('aimless', 1), ('movie', 1), ('about', 1), ('a', 1), ('distressed', 1), ('drifting', 1)]\n", "```\n", "\n", "Some of these are not very helpful and are considered less significant compared to other tokens. For example, how much information can be gained from knowing that periods are quite common in a given text? An attempt at filtering out such less significant words so that the focus can be directed towards more significant words is called removal of the stopwords. Note that there is no universal definition of what these stopwords are and this designation is purely subjective. \n", "\n", "Let's look at some examples of English stopwords, as defined by `nltk`:" ] }, { "cell_type": "code", "execution_count": 61, "id": "0782d2e0-7d46-4021-99c8-b1f6abe23951", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']\n" ] } ], "source": [ "# Import library\n", "from nltk.corpus import stopwords\n", "\n", "# Select only English stopwords\n", "english_stop_words = stopwords.words('english')\n", "\n", "# Print the first 20\n", "print(english_stop_words[:20])" ] }, { "cell_type": "markdown", "id": "20a8604e-255f-4683-a482-53ba6ac2ea91", "metadata": {}, "source": [ "**Question 11:**\n", "\n", "Define a function named \"stopword_remover\" that accepts a string as argument, tokenizes the input string, removes the English stopwords (as defined by `nltk`), and returns the tokens without the stopwords. Then apply the function to the top 5 rows of our dataframe.\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 62, "id": "dc55726c-d595-48e4-9943-20e1450c04ce", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [,, ,, slow-moving, ,, aimless, movie, distressed, ,, drifting, young, man, .]\n", "1 [sure, lost, -, flat, characters, audience, ,, nearly, half, walked, .]\n", "2 [Attempting, artiness, black, &, white, clever, camera, angles, ,, movie, disappointed, -, became, even, ridiculous, -, acting, poor, plot, lines, almost, non-existent, .]\n", "3 [little, music, anything, speak, .]\n", "4 [best, scene, movie, Gerardo, trying, find, song, keeps, running, head, .]\n", "Name: text, dtype: object" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def stopword_remover(string):\n", " # Tokenize the string\n", " tokens = word_tokenize(string)\n", " \n", " # Create a list of English stopwords\n", " english_stopwords = stopwords.words('english')\n", " \n", " # Return non-stopwords\n", " return [w for w in tokens if w.lower() not in english_stopwords]\n", "\n", "# Apply to the top 5 rows of our df['text']\n", "df.text.head(5).apply(stopword_remover)" ] }, { "cell_type": "markdown", "id": "9525bf14-d0d7-4fd8-b4ce-c0a67c501571", "metadata": {}, "source": [ "Another group of tokens that we can consider filtering out, similar to stopwords, is the non-alphabeticals. As the name suggests, examples of non-alphabeticals are: `! % & # * $` (note that space is also considered a non-alphabetical). To help identify what is considered alphabetical or not, we can use `isalpha()`, which is a built-in Python function that checks whether all characters in a given string are alphabets or not. Let's look at a few examples to better understand this concept:" ] }, { "cell_type": "code", "execution_count": 63, "id": "8494228f-1ba8-4a9a-891d-50f023c151d1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "String_1: True\n", "\n", "String_2: False\n", "\n", "String_3: False\n" ] } ], "source": [ "string_1 = \"TomAndJerryAreFun\"\n", "string_2 = \"Tom&JerryAreFun\"\n", "string_3 = \"TomAndJerryAreFun!\"\n", "\n", "print(f\"String_1: {string_1.isalpha()}\\n\")\n", "print(f\"String_2: {string_2.isalpha()}\\n\")\n", "print(f\"String_3: {string_3.isalpha()}\")" ] }, { "cell_type": "markdown", "id": "e8cbcc83-3658-4ba8-a037-5c9094e2d4b5", "metadata": {}, "source": [ "Let's look at each one to better understand what happened. The first one returned \"True\" indicating the string contains only alpabeticals. The second one returned \"False\", which was because of \"&\" and the third one also returned \"False\", driven by the \"!\".\n", "\n", "Now that we are familiar with how `isalpha()` works, let's use it in our example to further clean up our data.\n", "\n", "**Question 12:**\n", "\n", "Define a function named \"stopword_nonalpha_remover\" that accepts a string as an argument, removes both stopwords (using the `stopword_remover()` function that we defined in the previous question) and non-alphabeticals and then returns the remainder. Apply this function to the top 5 rows of our dataframe and visually compare to the outcome of the previous question (which still included the non-alphabeticals).\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 64, "id": "95b507b9-f9e3-40c7-8805-897c15d266ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [aimless, movie, distressed, drifting, young, man]\n", "1 [sure, lost, flat, characters, audience, nearly, half, walked]\n", "2 [Attempting, artiness, black, white, clever, camera, angles, movie, disappointed, became, even, ridiculous, acting, poor, plot, lines, almost]\n", "3 [little, music, anything, speak]\n", "4 [best, scene, movie, Gerardo, trying, find, song, keeps, running, head]\n", "Name: text, dtype: object" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def stopword_nonalpha_remover(string):\n", " return [x for x in stopword_remover(string) if x.isalpha()]\n", "\n", "df.text.head().apply(stopword_nonalpha_remover)" ] }, { "cell_type": "markdown", "id": "53710503-12bb-44b9-b4d0-e99397a08dfa", "metadata": {}, "source": [ "As expected, the non-alphabeticals were removed, in addition to the stopwords. Therefore the tokens that are expected to have a higher significance, compared to the removed ones.\n", "\n", "In the next step, we will put together everything that we have learned so far to find out which reviews had the highest complexity score.\n", "\n", "**Question 13:**\n", "\n", "Define a function named \"complexity_cleaned\" that accepts a Series and removes the stopwords and non-alphabeticals (using the function defined in Question 12). Then create a column named \"complexity\" in our dataframe that uses the \"complexity_cleaned\" function to calculate the complexity. Finally, return the rows of the dataframe for the 10 largest complexity scores.\n", "\n", "**Answer:**" ] }, { "cell_type": "code", "execution_count": 65, "id": "0b8d1ec4-f0aa-4cb3-a2ee-21b3a627512a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlabelcomplexity
0A very, very, very slow-moving, aimless movie about a distressed, drifting young man.01.0
484Kris Kristoffersen is good in this movie and really makes a difference.11.0
476Tom Wilkinson broke my heart at the end... and everyone else's judging by the amount of fumbling for hankies and hands going up to faces among males and females alike.11.0
477Julian Fellowes has triumphed again.11.0
478He's a national treasure.11.0
479GO AND SEE IT!11.0
480This is an excellent film.11.0
481The aerial scenes were well-done.11.0
482It was also the right balance of war and love.11.0
483The film gives meaning to the phrase, \"Never in the history of human conflict has so much been owed by so many to so few.11.0
\n", "
" ], "text/plain": [ " text \\\n", "0 A very, very, very slow-moving, aimless movie about a distressed, drifting young man. \n", "484 Kris Kristoffersen is good in this movie and really makes a difference. \n", "476 Tom Wilkinson broke my heart at the end... and everyone else's judging by the amount of fumbling for hankies and hands going up to faces among males and females alike. \n", "477 Julian Fellowes has triumphed again. \n", "478 He's a national treasure. \n", "479 GO AND SEE IT! \n", "480 This is an excellent film. \n", "481 The aerial scenes were well-done. \n", "482 It was also the right balance of war and love. \n", "483 The film gives meaning to the phrase, \"Never in the history of human conflict has so much been owed by so many to so few. \n", "\n", " label complexity \n", "0 0 1.0 \n", "484 1 1.0 \n", "476 1 1.0 \n", "477 1 1.0 \n", "478 1 1.0 \n", "479 1 1.0 \n", "480 1 1.0 \n", "481 1 1.0 \n", "482 1 1.0 \n", "483 1 1.0 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Define the complexity_cleaned function\n", "def complexity_cleaned(series):\n", " return series.apply(lambda x: complexity(' '.join(stopword_nonalpha_remover(x))))\n", "\n", "# Add 'complexity' column to the dataframe\n", "df['complexity'] = complexity_cleaned(df.text)\n", "\n", "# Return top 10 highest complexity scores\n", "df.sort_values('complexity', ascending = False).head(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }