Text Mining and Sentiment Analysis with NLTK and pandas in Python
Data import
import pandas as pd
# Import some Tweets from Barack Obama
= pd.read_csv("https://raw.githubusercontent.com/kirenz/twitter-tweepy/main/tweets-obama.csv")
df 3) df.head(
Unnamed: 0 | created_at | id | author_id | text | |
---|---|---|---|---|---|
0 | 0 | 2022-05-16T21:24:35.000Z | 1526312680226799618 | 813286 | It’s despicable, it’s dangerous — and it needs… |
1 | 1 | 2022-05-16T21:24:34.000Z | 1526312678951641088 | 813286 | We need to repudiate in the strongest terms th… |
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | This weekend’s shootings in Buffalo offer a tr… |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5 non-null int64
1 created_at 5 non-null object
2 id 5 non-null int64
3 author_id 5 non-null int64
4 text 5 non-null object
5 text_token 5 non-null object
6 text_string 5 non-null object
7 text_string_fdist 5 non-null object
8 text_string_lem 5 non-null object
9 is_equal 5 non-null bool
dtypes: bool(1), int64(3), object(6)
memory usage: 493.0+ bytes
Data transformation
'text'] = df['text'].astype(str).str.lower()
df[3) df.head(
Unnamed: 0 | created_at | id | author_id | text | |
---|---|---|---|---|---|
0 | 0 | 2022-05-16T21:24:35.000Z | 1526312680226799618 | 813286 | it’s despicable, it’s dangerous — and it needs… |
1 | 1 | 2022-05-16T21:24:34.000Z | 1526312678951641088 | 813286 | we need to repudiate in the strongest terms th… |
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | this weekend’s shootings in buffalo offer a tr… |
Tokenization
- Install NLTK:
conda install -c anaconda nltk
We use NLTK’s RegexpTokenizer to perform tokenization in combination with regular expressions.
To learn more about regular expressions (“regexp”), visit the following sites:
\w+
matches Unicode word characters with one or more occurrences;this includes most characters that can be part of a word in any language, as well as numbers and the underscore.
from nltk.tokenize import RegexpTokenizer
= RegexpTokenizer('\w+')
regexp
'text_token']=df['text'].apply(regexp.tokenize)
df[3) df.head(
Unnamed: 0 | created_at | id | author_id | text | text_token | |
---|---|---|---|---|---|---|
0 | 0 | 2022-05-16T21:24:35.000Z | 1526312680226799618 | 813286 | it’s despicable, it’s dangerous — and it needs… | [it, s, despicable, it, s, dangerous, and, it,… |
1 | 1 | 2022-05-16T21:24:34.000Z | 1526312678951641088 | 813286 | we need to repudiate in the strongest terms th… | [we, need, to, repudiate, in, the, strongest, … |
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | this weekend’s shootings in buffalo offer a tr… | [this, weekend, s, shootings, in, buffalo, off… |
Stopwords
- Stop words are words in a stop list which are dropped before analysing natural language data since they don’t contain valuable information (like “will”, “and”, “or”, “has”, …).
import nltk
'stopwords') nltk.download(
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/jankirenz/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
True
from nltk.corpus import stopwords
# Make a list of english stopwords
= nltk.corpus.stopwords.words("english")
stopwords
# Extend the list with your own custom stopwords
= ['https']
my_stopwords stopwords.extend(my_stopwords)
- We use a lambda function to remove the stopwords:
# Remove stopwords
'text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
df[3) df.head(
Unnamed: 0 | created_at | id | author_id | text | text_token | |
---|---|---|---|---|---|---|
0 | 0 | 2022-05-16T21:24:35.000Z | 1526312680226799618 | 813286 | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… |
1 | 1 | 2022-05-16T21:24:34.000Z | 1526312678951641088 | 813286 | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… |
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… |
Remove infrequent words
- We first change the format of
text_token
to strings and keep only words which are longer than 2 letters
'text_string'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if len(item)>2])) df[
'text', 'text_token', 'text_string']].head() df[[
text | text_token | text_string | |
---|---|---|---|
0 | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… | despicable dangerous needs stop 0ch2zosmhb |
1 | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… | need repudiate strongest terms politicians med… |
2 | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… |
3 | i’m proud to announce the voyager scholarship … | [proud, announce, voyager, scholarship, friend… | proud announce voyager scholarship friend bche… |
4 | across the country, americans are standing up … | [across, country, americans, standing, abortio… | across country americans standing abortion rig… |
- Create a list of all words
= ' '.join([word for word in df['text_string']]) all_words
- Tokenize
all_words
= nltk.tokenize.word_tokenize(all_words) tokenized_words
- Create a frequency distribution which records the number of times each word has occurred:
from nltk.probability import FreqDist
= FreqDist(tokenized_words)
fdist fdist
FreqDist({'need': 2, 'americans': 2, 'proud': 2, 'despicable': 1, 'dangerous': 1, 'needs': 1, 'stop': 1, '0ch2zosmhb': 1, 'repudiate': 1, 'strongest': 1, ...})
- Now we can use our
fdist
dictionary to drop words which occur less than a certain amount of times (usually we use a value of 3 or 4). - Since our dataset is really small, we don’t filter out any words and set the value to greater or equal to 1 (otherwise there are not many words left in this particular dataset)
'text_string_fdist'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if fdist[item] >= 1 ])) df[
'text', 'text_token', 'text_string', 'text_string_fdist']].head() df[[
text | text_token | text_string | text_string_fdist | |
---|---|---|---|---|
0 | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb |
1 | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… |
2 | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… |
3 | i’m proud to announce the voyager scholarship … | [proud, announce, voyager, scholarship, friend… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… |
4 | across the country, americans are standing up … | [across, country, americans, standing, abortio… | across country americans standing abortion rig… | across country americans standing abortion rig… |
Lemmatization
- Next, we perfom lemmatization.
'wordnet')
nltk.download('omw-1.4') nltk.download(
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/jankirenz/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data] /Users/jankirenz/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
True
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
wordnet_lem
'text_string_lem'] = df['text_string_fdist'].apply(wordnet_lem.lemmatize) df[
- Note that in some datasets, there are no words to lemmatize. We can check this as follows:
# check if the columns are equal
'is_equal']= (df['text_string_fdist']==df['text_string_lem']) df[
# show level count
df.is_equal.value_counts()
True 5
Name: is_equal, dtype: int64
df
Unnamed: 0 | created_at | id | author_id | text | text_token | text_string | text_string_fdist | text_string_lem | is_equal | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2022-05-16T21:24:35.000Z | 1526312680226799618 | 813286 | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | True |
1 | 1 | 2022-05-16T21:24:34.000Z | 1526312678951641088 | 813286 | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | True |
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | True |
3 | 3 | 2022-05-16T13:16:16.000Z | 1526189794665107457 | 813286 | i’m proud to announce the voyager scholarship … | [proud, announce, voyager, scholarship, friend… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | True |
4 | 4 | 2022-05-14T15:03:07.000Z | 1525491905139773442 | 813286 | across the country, americans are standing up … | [across, country, americans, standing, abortio… | across country americans standing abortion rig… | across country americans standing abortion rig… | across country americans standing abortion rig… | True |
Word cloud
- Install wordcloud:
conda install -c conda-forge wordcloud
= ' '.join([word for word in df['text_string_lem']]) all_words_lem
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
= WordCloud(width=600,
wordcloud =400,
height=2,
random_state=100).generate(all_words_lem)
max_font_size
=(10, 7))
plt.figure(figsize='bilinear')
plt.imshow(wordcloud, interpolation'off'); plt.axis(
- Different style:
import numpy as np
= np.ogrid[:300, :300]
x, y = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)
mask
= WordCloud(background_color="white", repeat=True, mask=mask)
wc
wc.generate(all_words_lem)
"off")
plt.axis(="bilinear"); plt.imshow(wc, interpolation
Frequency distributions
'punkt') nltk.download(
[nltk_data] Downloading package punkt to /Users/jankirenz/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
= nltk.word_tokenize(all_words_lem)
words = FreqDist(words) fd
Most common words
3) fd.most_common(
[('need', 2), ('americans', 2), ('proud', 2)]
3) fd.tabulate(
need americans proud
2 2 2
Plot common words
# Obtain top 10 words
= fd.most_common(10)
top_10
# Create pandas series to make plotting easier
= pd.Series(dict(top_10)) fdist
import seaborn as sns
="ticks")
sns.set_theme(style
=fdist.index, x=fdist.values, color='blue'); sns.barplot(y
import plotly.express as px
= px.bar(y=fdist.index, x=fdist.values)
fig
# sort values
='stack', yaxis={'categoryorder':'total ascending'})
fig.update_layout(barmode
# show plot
fig.show()
Search specific words
# Show frequency of a specific word
"americans"] fd[
2
Sentiment analysis
VADER lexicon
- NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for “Valence Aware Dictionary and Sentiment Reasoner” (Hutto & Gilbert, 2014).
'vader_lexicon') nltk.download(
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/jankirenz/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
True
Sentiment
Sentiment Intensity Analyzer
- Initialize an object of
SentimentIntensityAnalyzer
with name “analyzer”:
from nltk.sentiment import SentimentIntensityAnalyzer
= SentimentIntensityAnalyzer() analyzer
Polarity scores
- Use the
polarity_scores
method:
'polarity'] = df['text_string_lem'].apply(lambda x: analyzer.polarity_scores(x))
df[3) df.tail(
Unnamed: 0 | created_at | id | author_id | text | text_token | text_string | text_string_fdist | text_string_lem | is_equal | polarity | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 2 | 2022-05-16T21:24:34.000Z | 1526312677521428480 | 813286 | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | True | {‘neg’: 0.247, ‘neu’: 0.557, ‘pos’: 0.195, ’co… |
3 | 3 | 2022-05-16T13:16:16.000Z | 1526189794665107457 | 813286 | i’m proud to announce the voyager scholarship … | [proud, announce, voyager, scholarship, friend… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | True | {‘neg’: 0.0, ‘neu’: 0.573, ‘pos’: 0.427, ’comp… |
4 | 4 | 2022-05-14T15:03:07.000Z | 1525491905139773442 | 813286 | across the country, americans are standing up … | [across, country, americans, standing, abortio… | across country americans standing abortion rig… | across country americans standing abortion rig… | across country americans standing abortion rig… | True | {‘neg’: 0.0, ‘neu’: 0.71, ‘pos’: 0.29, ’compou… |
Transform data
# Change data structure
= pd.concat(
df 'Unnamed: 0', 'id', 'author_id', 'polarity'], axis=1),
[df.drop(['polarity'].apply(pd.Series)], axis=1)
df[3) df.head(
created_at | text | text_token | text_string | text_string_fdist | text_string_lem | is_equal | neg | neu | pos | compound | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-16T21:24:35.000Z | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | True | 0.639 | 0.361 | 0.000 | -0.6486 |
1 | 2022-05-16T21:24:34.000Z | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | True | 0.247 | 0.458 | 0.295 | 0.2263 |
2 | 2022-05-16T21:24:34.000Z | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | True | 0.247 | 0.557 | 0.195 | -0.1280 |
# Create new variable with sentiment "neutral," "positive" and "negative"
'sentiment'] = df['compound'].apply(lambda x: 'positive' if x >0 else 'neutral' if x==0 else 'negative')
df[4) df.head(
created_at | text | text_token | text_string | text_string_fdist | text_string_lem | is_equal | neg | neu | pos | compound | sentiment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-05-16T21:24:35.000Z | it’s despicable, it’s dangerous — and it needs… | [despicable, dangerous, needs, stop, co, 0ch2z… | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | despicable dangerous needs stop 0ch2zosmhb | True | 0.639 | 0.361 | 0.000 | -0.6486 | negative |
1 | 2022-05-16T21:24:34.000Z | we need to repudiate in the strongest terms th… | [need, repudiate, strongest, terms, politician… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | need repudiate strongest terms politicians med… | True | 0.247 | 0.458 | 0.295 | 0.2263 | positive |
2 | 2022-05-16T21:24:34.000Z | this weekend’s shootings in buffalo offer a tr… | [weekend, shootings, buffalo, offer, tragic, r… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | weekend shootings buffalo offer tragic reminde… | True | 0.247 | 0.557 | 0.195 | -0.1280 | negative |
3 | 2022-05-16T13:16:16.000Z | i’m proud to announce the voyager scholarship … | [proud, announce, voyager, scholarship, friend… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | proud announce voyager scholarship friend bche… | True | 0.000 | 0.573 | 0.427 | 0.9313 | positive |
Analyze data
# Tweet with highest positive sentiment
'compound'].idxmax()].values df.loc[df[
array(['2022-05-16T13:16:16.000Z',
'i’m proud to announce the voyager scholarship with my friend @bchesky. we hope to provide young people with an interest in public service with some financial support to graduate college, exposure to travel, and the networks they need to make a difference. https://t.co/rbtrjalgpe https://t.co/rz7qknmmww',
list(['proud', 'announce', 'voyager', 'scholarship', 'friend', 'bchesky', 'hope', 'provide', 'young', 'people', 'interest', 'public', 'service', 'financial', 'support', 'graduate', 'college', 'exposure', 'travel', 'networks', 'need', 'make', 'difference', 'co', 'rbtrjalgpe', 'co', 'rz7qknmmww']),
'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
'proud announce voyager scholarship friend bchesky hope provide young people interest public service financial support graduate college exposure travel networks need make difference rbtrjalgpe rz7qknmmww',
True, 0.0, 0.573, 0.427, 0.9313, 'positive'], dtype=object)
# Tweet with highest negative sentiment
# ...seems to be a case of wrong classification because of the word "deficit"
'compound'].idxmin()].values df.loc[df[
array(['2022-05-16T21:24:35.000Z',
'it’s despicable, it’s dangerous — and it needs to stop.\nhttps://t.co/0ch2zosmhb',
list(['despicable', 'dangerous', 'needs', 'stop', 'co', '0ch2zosmhb']),
'despicable dangerous needs stop 0ch2zosmhb',
'despicable dangerous needs stop 0ch2zosmhb',
'despicable dangerous needs stop 0ch2zosmhb', True, 0.639, 0.361,
0.0, -0.6486, 'negative'], dtype=object)
Visualize data
# Number of tweets
='sentiment',
sns.countplot(y=df,
data=['#b2d8d8',"#008080", '#db3d13']
palette; )
# Lineplot
= sns.lineplot(x='created_at', y='compound', data=df)
g
set(xticklabels=[])
g.set(title='Sentiment of Tweets')
g.set(xlabel="Time")
g.set(ylabel="Sentiment")
g.=False)
g.tick_params(bottom
0, ls='--', c = 'grey'); g.axhline(
# Boxplot
='compound',
sns.boxplot(y='sentiment',
x=['#b2d8d8',"#008080", '#db3d13'],
palette=df); data
Literature: