Introduction to Textmining in R

This post demonstrates how various R packages can be used for text mining in R. In particular, we start with common text transformations, perform various data explorations with term frequency (tf) and inverse document frequency (idf) and build a supervised classifiaction model that learns the difference between texts of different authors.

The content of this tutorial is based on the excellent book “Textmining with R (2019)” from Julia Silge and David Robinson and the blog post “Text classification with tidy data principles (2018)” from Julia Silges.

Installation of R packages

If you like to install all packages at once, use the code below.

install.packages(c("dplyr", "gutenbergr", "stringr", "tidytext", "tidyr", "stopwords", "wordcloud", "rsample", "glmnet", "forcats", "broom", "igraph", "ggraph", "kableExtra", "yardstick"))

Data import

We can access the full texts of various books from “Project Gutenberg” via the gutenbergr package. We can look up certain authors or titles with a regular expression using the stringr package. All functions in stringr start with str_and take a vector of strings as the first argument. To learn more about stringr, visit the stringr documentation.

library(gutenbergr)
library(stringr)

doyle <- gutenberg_works(str_detect(author, "Doyle"))

library(kableExtra)

knitr::kable(head(doyle, 4)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

gutenberg_id	title	author	gutenberg_author_id	language	gutenberg_bookshelf	rights	has_text
108	The Return of Sherlock Holmes	Doyle, Arthur Conan	69	en	Detective Fiction	Public domain in the USA.	TRUE
126	The Poison Belt	Doyle, Arthur Conan	69	en	Science Fiction	Public domain in the USA.	TRUE
139	The Lost World	Doyle, Arthur Conan	69	en	Science Fiction	Public domain in the USA.	TRUE
244	A Study in Scarlet	Doyle, Arthur Conan	69	en	Detective Fiction	Public domain in the USA.	TRUE

We obtain “Relativity: The Special and General Theory” by Albert Einstein (gutenberg_id: 30155) and “Experiments with Alternate Currents of High Potential and High Frequency” by Nikola Tesla (gutenberg_id: 13476) from gutenberg and add the column “author” to the result.

books <- gutenberg_download(c(30155, 13476), meta_fields = "author")

Furthermore, we transfrom the data to a tibble (tibbles are a modern take on data frames), add the row number with the column name document to the tibble and drop the column gutenberg_id. We will use the information in column document to train a model that can take an individual line (row) and give us a probability that the text in this particular line comes from a certain author.

library(dplyr)

books <- as_tibble(books) |> 
  mutate(document = row_number()) |> 
  select(-gutenberg_id)

kable(head(books, 8)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

text	author	document
EXPERIMENTS WITH ALTERNATE CURRENTS OF HIGH POTENTIAL AND HIGH FREQUENCY	Tesla, Nikola	1
	Tesla, Nikola	2
A Lecture Delivered before the Institution of Electrical Engineers, London	Tesla, Nikola	3
	Tesla, Nikola	4
by	Tesla, Nikola	5
	Tesla, Nikola	6
NIKOLA TESLA	Tesla, Nikola	7
	Tesla, Nikola	8

Data transformation

Tokenization

First of all, we need to both break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure (i.e. each variable must have its own column, each observation must have its own row and each value must have its own cell). To do this, we use tidytext’s unnest_tokens() function. We also remove the rarest words in that step, keeping only words in our dataset that occur more than 10 times.

library(tidytext)

tidy_books <- books |>
  unnest_tokens(word, text) |>
  group_by(word) |>
  filter(n() > 10) |>
  ungroup()

kable(head(tidy_books, 8)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

author	document	word
Tesla, Nikola	1	experiments
Tesla, Nikola	1	with
Tesla, Nikola	1	alternate
Tesla, Nikola	1	currents
Tesla, Nikola	1	of
Tesla, Nikola	1	high
Tesla, Nikola	1	potential
Tesla, Nikola	1	and

Stop words

Now that the data is in a tidy “one-word-per-row” format, we can manipulate it with packages like dplyr. Often in text analysis, we will want to remove stop words: Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth. We can remove stop words in our data by using the stop words provided in the package stopwords with an anti_join() from the package dplyr.

library(stopwords) 
library(tibble)

stopword <- as_tibble(stopwords::stopwords("en")) 
stopword <- rename(stopword, word=value)
tb <- anti_join(tidy_books, stopword, by = 'word')

kable(head(tb, 8)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

author	document	word
Tesla, Nikola	1	experiments
Tesla, Nikola	1	alternate
Tesla, Nikola	1	currents
Tesla, Nikola	1	high
Tesla, Nikola	1	potential
Tesla, Nikola	1	high
Tesla, Nikola	1	frequency
Tesla, Nikola	3	lecture

The tidy data structure allows different types of exploratory data analysis (EDA), which we turn to next.

Exploratory data analysis

Term frequency (tf)

An important question in text mining is how to quantify what a document is about. One measure of how important a word may be is its term frequency (tf), i.e. how frequently a word occurs in a document.

We can start by using dplyr to explore the most commonly used words.

word_count <- count(tb, word, sort = TRUE)

kable(head(word_count, 5)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word	n
one	239
body	230
may	224
can	194
relativity	193

Term frequency by author:

author_count <-  tb |> 
  count(author, word, sort = TRUE)

kable(head(author_count, 10)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

author	word	n
Einstein, Albert	relativity	193
Tesla, Nikola	may	184
Einstein, Albert	theory	181
Tesla, Nikola	bulb	171
Tesla, Nikola	coil	166
Tesla, Nikola	high	166
Einstein, Albert	body	156
Tesla, Nikola	one	156
Einstein, Albert	reference	150
Tesla, Nikola	tube	147

Plot terms with a frequency greater than 100:

library(ggplot2)

tb |>
  count(author, word, sort = TRUE) |>
  filter(n > 100) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(word, n)) +
  geom_col(aes(fill=author)) +
  xlab(NULL) +
  scale_y_continuous(expand = c(0, 0)) +
  coord_flip() +
  theme_classic(base_size = 12) +
  labs(fill= "Author", title="Word frequency", subtitle="n > 100")+
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

Plot top 20 terms by author:

tb |>
  count(author, word, sort = TRUE) |>
  group_by(author) |>
  top_n(20) |>
  ungroup() |>
  ggplot(aes(reorder_within(word, n, author), n,
    fill = author)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~author, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic(base_size = 12) +
  labs(fill= "Author", 
       title="Most frequent words", 
       subtitle="Top 20 words by book",
       x= NULL, 
       y= "Word Count")+
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

You may notice expressions like “_k”, “co” in the Einstein text and “fig” in the Tesla text. Let’s remove these and other less meaningful words with a custom list of stop words and use anti_join() to remove them.

newstopwords <- tibble(word = c("eq", "co", "rc", "ac", "ak", "bn", "fig", "file", "cg", "cb", "cm", "ab", "_k", "_k_", "_x"))

tb <- anti_join(tb, newstopwords, by = "word")

Now we plot the data again without the new stopwords:

tb |>
  count(author, word, sort = TRUE) |>
  group_by(author) |>
  top_n(20) |>
  ungroup() |>
  ggplot(aes(reorder_within(word, n, author), n,
    fill = author)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  scale_x_reordered() +
  coord_flip() +
  facet_wrap(~author, scales = "free") +
  scale_y_continuous(expand = c(0, 0)) +
  theme_classic(base_size = 12) +
  labs(fill= "Author", 
       title="Most frequent words after removing stop words", 
       subtitle="Top 20 words by book",
       x= NULL, 
       y= "Word Count")+
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

You also may want to visualize the most frequent terms as a simple word cloud:

library(wordcloud)

tb |>
  count(word) |>
  with(wordcloud(word, n, max.words = 15))

Term frequency and inverse document frequency (tf-idf)

Term frequency is a useful measure to determine how frequently a word occurs in a document. There are words in a document, however, that occur many times but may not be important.

Another approach is to look at a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.

The inverse document frequency for any given term is defined as:

\[idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}\]

Hence, term frequency and inverse document frequency allows us to find words that are characteristic for one document within a collection of documents. The tidytext package uses an implementation of tf-idf consistent with tidy data principles that enables us to see how different words are important in documents within a collection or corpus of documents.

library(forcats)

plot_tb <- tb |>
  count(author, word, sort = TRUE) |>
  bind_tf_idf(word, author, n) |>
  mutate(word = fct_reorder(word, tf_idf)) |>
  mutate(author = factor(author, 
                         levels = c("Tesla, Nikola",
                                    "Einstein, Albert")))

plot_tb |> 
  group_by(author) |> 
  top_n(15, tf_idf) |> 
  ungroup() |>
  mutate(word = reorder(word, tf_idf)) |>
  ggplot(aes(word, tf_idf, fill = author)) +
  scale_y_continuous(expand = c(0, 0)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~author, ncol = 2, scales = "free") +
  coord_flip() +
  theme_classic(base_size = 12) +
  labs(fill= "Author", 
       title="Term frequency and inverse document frequency (tf-idf)", 
       subtitle="Top 20 words by book",
       x= NULL, 
       y= "tf-idf") +
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

In particular, the bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the documents (authors in this case), and the last necessary column contains the counts, how many times each document contains each term (n in this example).

tf_idf <- tb |>
  count(author, word, sort = TRUE) |>
  bind_tf_idf(word, author, n)

kable(head(tf_idf, 10)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

author	word	n	tf	idf	tf_idf
Einstein, Albert	relativity	193	0.0177618	0.6931472	0.0123116
Tesla, Nikola	may	184	0.0139436	0.0000000	0.0000000
Einstein, Albert	theory	181	0.0166575	0.6931472	0.0115461
Tesla, Nikola	bulb	171	0.0129585	0.6931472	0.0089821
Tesla, Nikola	coil	166	0.0125796	0.6931472	0.0087195
Tesla, Nikola	high	166	0.0125796	0.0000000	0.0000000
Einstein, Albert	body	156	0.0143567	0.0000000	0.0000000
Tesla, Nikola	one	156	0.0118218	0.0000000	0.0000000
Einstein, Albert	reference	150	0.0138045	0.0000000	0.0000000
Tesla, Nikola	tube	147	0.0111397	0.0000000	0.0000000

Notice that idf and thus tf-idf are zero for extremely common words (like “may”). These are all words that appear in both documents, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

Tokenizing by n-gram

We’ve been using the unnest_tokens function to tokenize by word, or sometimes by sentence, which is useful for the kinds of frequency analyses we’ve been doing so far. But we can also use the function to tokenize into consecutive sequences of words, called n-grams. By seeing how often word X is followed by word Y, we can then build a model of the relationships between them.

einstein_bigrams <- books |>
  filter(author == "Einstein, Albert") |> 
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

kable(head(einstein_bigrams, 10)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

author	document	bigram
Einstein, Albert	3797	NA
Einstein, Albert	3798	NA
Einstein, Albert	3799	NA
Einstein, Albert	3800	NA
Einstein, Albert	3801	NA
Einstein, Albert	3802	relativity the
Einstein, Albert	3802	the special
Einstein, Albert	3802	special and
Einstein, Albert	3802	and general
Einstein, Albert	3802	general theory

We can examine the most common bigrams using dplyr’s count():

einstein_bigrams_count <- einstein_bigrams |> 
    count(bigram, sort = TRUE)

kable(head(einstein_bigrams_count, 10)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

bigram	n
NA	921
of the	613
to the	247
in the	197
of relativity	164
theory of	121
with the	119
on the	111
that the	110
of a	98

Now we use tidyr’s separate(), which splits a column into multiple columns based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word. This time, we use the stopwords from the package tidyr:

library(tidyr)

# seperate words
bigrams_separated <- einstein_bigrams |>
  separate(bigram, c("word1", "word2"), sep = " ")

# filter stop words and NA
bigrams_filtered <- bigrams_separated |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word) |> 
  filter(!is.na(word1))

# new bigram counts:
bigram_counts <- bigrams_filtered |> 
  count(word1, word2, sort = TRUE)

kable(head(bigram_counts, 10)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word1	word2	n
reference	body	56
gravitational	field	53
special	theory	35
ordinate	system	34
space	time	27
classical	mechanics	26
lorentz	transformation	23
measuring	rods	22
straight	line	17
rigid	body	16

This one-bigram-per-row format is helpful for exploratory analyses of the text. As a simple example, we might be interested in the most often mentioned “theory”:

bigram_theory <- bigrams_filtered |>
  filter(word2 == "theory") |>
  count(word1, sort = TRUE)

kable(head(bigram_theory, 7)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word1	n
special	35
lorentz	4
newton’s	4
_special	1
comprehensive	1
electrodynamic	1
electromagnetic	1

In other analyses you may be interested in the most common trigrams, which are consecutive sequences of 3 words. We can find this by setting n = 3:

trigram <- books |>
  unnest_tokens(trigram, text, token = "ngrams", n = 3) |>
  separate(trigram, c("word1", "word2", "word3"), sep = " ") |>
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word,  
         !is.na(word1)) |>
  count(word1, word2, word3, sort = TRUE)

kable(head(trigram, 7)) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

word1	word2	word3	n
_x_1	_x_2	_x_3	12
light	_in	vacuo_	10
reference	body	_k_	10
space	time	continuum	9
_x_2	_x_3	_x_4	8
reference	body	_k	8
disruptive	discharge	coil	6

Network analysis

We may be interested in visualizing all of the relationships among words simultaneously, rather than just the top few at a time. As one common visualization, we can arrange the words into a network, or “graph.” Here we’ll be referring to a “graph” not in the sense of a visualization, but as a combination of connected nodes. A graph can be constructed from a tidy object since it has three variables:

from: the node an edge is coming from
to: the node an edge is going towards
weight: A numeric value associated with each edge

The igraph package has many functions for manipulating and analyzing networks. One way to create an igraph object from tidy data is the graph_from_data_frame() function, which takes a data frame of edges with columns for “from”, “to”, and edge attributes (in this case n):

library(igraph)

# filter for only relatively common combinations
bigram_graph <- bigram_counts |>
  filter(n > 5) |>
  graph_from_data_frame()

We use the ggraph package to convert the igraph object into a ggraph with the ggraph function, after which we add layers to it, much like layers are added in ggplot2. For example, for a basic graph we need to add three layers: nodes, edges, and text:

library(ggraph)
set.seed(123)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

Finally, we will change some settings to obtain to a better looking graph:

We add the edge_alpha aesthetic to the link layer to make links transparent based on how common or rare the bigram is.
We add directionality with an arrow, constructed using grid::arrow(), including an end_cap option that tells the arrow to end before touching the node.
We tinker with the options to the node layer to make the nodes more attractive (larger, blue points).
We add a theme that’s useful for plotting networks, theme_void().

set.seed(123)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

Classification with logistic regression

In the first part we will build a statistical learning model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen, which is why we start by splitting the data.

Train test split

Let’s go back to the original books dataset (not the tidy_books dataset) because the lines of text are our individual observations.

We could use functions from the rsample package to generate resampled datasets, but the specific modeling approach we’re going to use will do that for us so we only need a simple train/test split.

library(rsample)

books_split <- books |>
  select(document) |>
  initial_split(prop = 3/4)

train_data <- training(books_split)
test_data <- testing(books_split)

Notice that we just select specific text rows (column document) for training and others for our test data (we set the proportion of data to be retained for modeling/analysis to 3/4) without selecting the actual text lines at this point.

Training data (sparse matrix)

Now we want to transform our training data from a tidy data structure to a “sparse matrix” (these objects can be treated as though they were matrices, for example accessing particular rows and columns, but are stored in a more efficient format) to use for our classification algorithm.

sparse_words <- tidy_books |>
  count(document, word) |>
  inner_join(train_data, by = "document") |>
  cast_sparse(document, word, n)

dim(sparse_words)

[1] 4769  893

We have over 4,700 training observations and almost 900 features. Text feature space handled in this way is very high dimensional, so we need to take that into account when considering our modeling approach.

One reason this overall approach is flexible is that you could at this point cbind() other columns, such as non-text numeric data, onto this sparse matrix. Then you can use this combination of text and non-text data as your predictors in the classifiaction algorithm, and the regularized regression algorithm we are going to use will find which are important for your problem space.

Response variable

We also need to build a tibble with a response variable to associate each of the rownames() of the sparse matrix with an author, to use as the quantity we will predict in the model.

word_rownames <- as.integer(rownames(sparse_words))

books_joined <- tibble(document = word_rownames) |>
  left_join(books  |>
    select(document, author))

kable(head(books_joined, 7)) |>
  kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")

document	author
1	Tesla, Nikola
3	Tesla, Nikola
5	Tesla, Nikola
11	Tesla, Nikola
21	Tesla, Nikola
24	Tesla, Nikola
25	Tesla, Nikola

Logistic regression model

Now it’s time to train our classification model. Let’s use the glmnet package to fit a logistic regression model with lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) regularization. This regression analysis method performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces.

Glmnet is a package that fits lasso models via penalized maximum likelihood. We do not cover the method and glmnet package in detail at this point, but if you want to learn more about glmnet and lasso regression, review the following resources:

The package is very useful for text classification because the variable selection that lasso regularization performs can tell you which words are important for your prediction problem. The glmnet package also supports parallel processing, so we can train on multiple cores with cross-validation on the training set using cv.glmnet().

library(glmnet)


is_einstein <- books_joined$author == "Einstein, Albert"

model <- cv.glmnet(sparse_words, 
                   is_einstein,
                   family = "binomial",
                   parallel = TRUE, 
                   keep = TRUE)

Let’s use the package broom (the broom package takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames) to check out the coefficients of the model, for the largest value of lambda with error within 1 standard error of the minimum (lambda.1se).

library(broom)

coefs <- model$glmnet.fit |>
  tidy() |>
  filter(lambda == model$lambda.1se)

Which coefficents are the largest in size, in each direction:

library(forcats)

coefs |>
  group_by(estimate > 0) |>
  top_n(10, abs(estimate)) |>
  ungroup() |>
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  labs(
    x = NULL,
    title = "Coefficients that increase/decrease probability the most",
    subtitle = "A document mentioning lecture or probably is unlikely to be written by Albert Einstein"
  ) +
  theme_classic(base_size = 12) +
  theme(plot.title = element_text(lineheight=.8, face="bold")) +
  scale_fill_brewer()

Model evaluation with test data

Now we want to evaluate how well this model is doing using the test data that we held out and did not use for training the model. Let’s create a dataframe that tells us, for each document in the test set, the probability of being written by Albert Einstein.

intercept <- coefs |>
  filter(term == "(Intercept)") |>
  pull(estimate)

classifications <- tidy_books |>
  inner_join(test_data) |>
  inner_join(coefs, by = c("word" = "term")) |>
  group_by(document) |>
  summarize(score = sum(estimate)) |>
  mutate(probability = plogis(intercept + score))

kable(head(classifications, 7)) |>
  kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")

document	score	probability
7	-1.6569483	0.1715378
9	-0.5226801	0.3916220
27	-1.2047108	0.2455423
32	-1.0465089	0.2760125
36	-0.5824899	0.3774680
38	-0.1416163	0.4851455
40	-4.7306180	0.0094857

Now let’s use the yardstick package (yardstick is a package to estimate how well models are working using tidy data principles) to calculate some model performance metrics. For example, what does the ROC curve (receiver operating characteristic curve - a graph showing the performance of a classification model at all classification thresholds) look like:

library(yardstick)

comment_classes <- classifications |>
  left_join(books |>
    select(author, document), by = "document") |>
  mutate(author = as.factor(author))

comment_classes

# A tibble: 1,553 × 4
   document  score probability author       
      <int>  <dbl>       <dbl> <fct>        
 1        7 -1.66      0.172   Tesla, Nikola
 2        9 -0.523     0.392   Tesla, Nikola
 3       27 -1.20      0.246   Tesla, Nikola
 4       32 -1.05      0.276   Tesla, Nikola
 5       36 -0.582     0.377   Tesla, Nikola
 6       38 -0.142     0.485   Tesla, Nikola
 7       40 -4.73      0.00949 Tesla, Nikola
 8       41 -3.18      0.0434  Tesla, Nikola
 9       42 -0.209     0.468   Tesla, Nikola
10       46 -0.433     0.413   Tesla, Nikola
# ℹ 1,543 more rows

roc <- comment_classes |>
  roc_curve(author, probability)
  
roc |>
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(
    color = "midnightblue",
    size = 1.5
  ) +
  geom_abline(
    lty = 2, alpha = 0.5,
    color = "gray50",
    size = 1.2
  ) +
  labs(
    title = "ROC curve for text classification using regularized regression",
    subtitle = "Predicting whether text was written by Albert Einstein or Nikola Tesla"
  ) +
  theme_classic(base_size = 12) +
  theme(plot.title = element_text(lineheight=.8, face="bold"))

Let’s obtain the accuracy (AUC - the fraction of predictions that a classification model got right) on the test data:

auc <- comment_classes |>
  roc_auc(author, probability)

kable(auc) |>
  kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")

.metric	.estimator	.estimate
roc_auc	binary	0.9741431

Next we turn to the confusion matrix. Let’s make the following definitions:

“Einstein, Albert” is a positive class.
“Tesla, Nikola” is a negative class.

True Positive (TP):	False Positive (FP):
Reality: Text is from Einstein	Reality: Text is from Tesla
Model: Text is from Einstein	Model: Text is from Einstein

False Negative (FN):	True Negative (TN):
Reality: Text is from Einstein	Reality: Text is from Tesla
Model: Text is from Tesla	Model: Text is from Tesla

We can summarize our “einstein-text-prediction” model using a 2x2 confusion matrix that depicts all four possible outcomes:

A true positive is an outcome where the model correctly predicts the positive class (Einstein). Similarly, a true negative is an outcome where the model correctly predicts the negative class (Tesla).
A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class.

Let’s use a probability of 0.5 as our threshold. That means all model predictions with a probability greater than 50% get labeld as beeing text from Einstein:

comment_classes |>
  mutate(prediction = case_when(
          probability > 0.5 ~ "Einstein, Albert",
          TRUE ~ "Tesla, Nikola"),
        prediction = as.factor(prediction)) |>
  conf_mat(author, prediction)

                  Truth
Prediction         Einstein, Albert Tesla, Nikola
  Einstein, Albert              646            82
  Tesla, Nikola                  68           757

Let’s take a closer look at these misclassifications: false negatives (FN) and false positives (FP). Which documents here were incorrectly predicted to be written by Albert Einstein, at the extreme probability end of greater than 80% (false positive)?

FP<- comment_classes |>
  filter(probability > .8,
          author == "Tesla, Nikola") |>
  sample_n(10) |>
  inner_join(books |>
  select(document, text)) |>
  select(probability, text)

kable(FP) |>
  kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")

probability	text
0.8592571	From experiences of this kind I am led to infer that, in order to be
0.9115352	already described, except with the view of completing, or more clearly
0.8991634	there is any motion which is measurable going on in the space, such a
0.8161307	[Transcriber's note: Corrected the following typesetting errors:
0.8024522	principle of the vacuum pump of the future. For the present, we must
0.8213111	shown by the following experiment:
0.8963556	brilliancy. Next, suppose we diminish to any degree we choose the
0.8449774	covered with a milky film, which is separated by a dark space from the
0.8447254	medium surely must exist, and I am convinced that, for instance, even
0.8027681	velocity--the energy associated with the moving body--is another, and

These documents were incorrectly predicted to be written by Albert Einstein. However, they were written by Nikola Tesla.

Finally, let’s take a look at the texts which are from Albert Einstein that the model did not correctly identify (false negative):

FN <- comment_classes |>
  filter(probability < .3,
         author == "Einstein, Albert") |>
  sample_n(10) |>
  inner_join(books |>
  select(document, text)) |>
  select(probability, text)

kable(FN) |>
  kable_styling(bootstrap_options = "striped", "hover", "condensed", "responsive", full_width = F, position = "center")

probability	text
0.2932465	another, shall indicate position and time directly. Such was the
0.2447074	This conception is in itself not very satisfactory. It is still less
0.1485998	was necessary to surmount a serious difficulty, and as this lies deep
0.2958725	potential φ, hence the result we have obtained will hold quite
0.2572766	[20] Mathematicians have been confronted with our problem in the
0.2032739	any doubt to arise as to the prime importance of the Galileian
0.0453919	bring some one a few happy hours of suggestive thought!
0.1163048	ideas among themselves.
0.2196137	connection in which they actually originated. In the interest of
0.2389984	strings to the floor, otherwise the slightest impact against the floor

We can conclude that the model did a very good job in predicting the authors of the texts. Furthermore, the texts of the misclassifications are quite short and we can imagine, that even a human reader who is familiar with the work of Einstein and Tesla would have difficulties to classify them correctly.