Tidytext analysis of 100DaysOfWriting

This post is about a recent challenge I’ve finished on Twitter called #100DaysOfWriting. The challenge itself was created by Jenn Ashworth. It is also about doing a text analysis on the tweets I have produced as part of this challenge. It is both a personal example of what it is like to write a PhD thesis as well as a tutorial into text analysis. Let me know in case you find it helpful, any feedback is also welcomed.

Resources

The following R packages were used in R version 4.0.4 (2021-02-15) to produce the analysis below.

library(renv) # virtual env
library(rtweet) # connect to Twitter API
library(tidyverse) # general cleaning
library(tidytext) # text analysis
library(qdapDictionaries) # dictionaries
library(lubridate) # work with dates
library(showtext) # work with fonts in ggplot2
library(ggdark) # dark themes
library(grid, include.only = "rasterGrob") # to do background images
library(png, include.only = "readPNG") # to use the image
library(ggforce, include.only = "facet_zoom") # fancy ggplot extensions
library(ggrepel) # to repel labels / text
library(flextable) # to make tables
library(textdata) # for sentiment analysis

The blog is based on the following resources:

Code

The following repository hosts the data, R scripts, and renv.lock file that can be used to reproduce the analysis. Some tweaking may be needed. For example, setting connection to Twitter API.

Introduction to 100DaysOfTorture (Writing)

Some time ago I have started the Twitter #100DaysOfWriting challenge. The challenge is pretty self-explanatory, you write, then post a tweet about it, maybe get some reactions from others. It is akin #100daysofcode, and other similar challenges.

Writing for 100 days straight was tough. The challenge helped me to build a routine, but it also helped me to manage my expectations and maybe even become a little more confident writer.

By the end of the challenge, I worked out a plan on days when it made sense to write, and days when to take a break. The writing days were (and remain to be) Monday, Tuesday, Thursday and one day during the Weekend.

When I finished the challenge, I thought about how to share with others how the challenge went. This is how this post came to be. The idea is to introduce the challenge though basic text analysis approaches (formally known as corpus analysis or computerised text analysis) using the series of tweets I made throughout the challenge as “the corpus”.

Writing a PhD thesis is a sinusoid of misery and ecstatic happiness. So, here’s the glimpse of this “joyride”.

Brief Introduction to Text Analysis

Before going into the code, I want to discuss few principles of text analysis. After attempting some of this in my PhD, I can tell that it can be challenging. One reason is that it introduces whole field of scientific vocabulary that feels completely alien at first - let’s go through the basics.

When I say the word “corpus”, it refers to a fancy statement that there’s a bunch of text that is knitted together into what is the dataset to analyse. It is a collection of texts stored on a computer and one standard way is to store the text in a data frame object.

Besides corpus, there’s also “token”. Token refers loosely to a linguistic unit that can be, for example, word, sentence, or paragraph and is usually used similarly to what “observations” or “participants”, or “subjects” are in experiments (but is just a text).

Finally, another useful term is a "sentiment" which refers to what I define as emotionally loaded tokens (typically words, the most basic distinction is to simply differ which tokens are positive or negative).

This is a mere tip of the iceberg. There are also ngrams, lemma, or lexical diversity, and then vast linguistics vocabulary to define various “units” of text or types of analyses. They are incredibly helpful when it comes to thinking about text but are above and beyond a blog post like this.

The analysis presented here uses the following workflow:

  1. Prepare Twitter data into a corpus

  2. Prepare tokens

  3. Analyse

    1. Frequency analyses and visualisations

    2. Sentiment analyses and visualisations

Preparing Corpus

The data I work with come from Twitter. What is shown below is sanitised version of the full extract from Twitter API. To produce this data, I recommend using “{rtweet}” package comes with great documentation and vignettes to guide through that process.

I am going to show the data as their columns. The first two columns were looking at when and who has created the tweet.

## # A tibble: 6 x 2
##   created_at          screen_name
##   <dttm>              <chr>      
## 1 2021-02-02 11:50:44 m_cadek    
## 2 2021-01-31 23:14:55 m_cadek    
## 3 2021-01-30 23:14:48 m_cadek    
## 4 2021-01-29 19:31:47 m_cadek    
## 5 2021-01-28 23:41:41 m_cadek    
## 6 2021-01-28 23:41:40 m_cadek

The next column was the text in the tweet. This is what will be the key focus of this analysis.

## # A tibble: 6 x 1
##   text                                                                          
##   <chr>                                                                         
## 1 #100DaysOfWriting (Day 100). I simply enjoy that I managed to complete and se~
## 2 #100DaysOfWriting (Day 99) Draft submitted to the supervisors for the review.~
## 3 #100DaysOfWriting (Day 98) Voila. Eight hours later and I finalised the intro~
## 4 #100DaysOfWriting (Day 97) Taking a break after intense week of writing! Howe~
## 5 #100DaysOfWriting (Day 96) Nearly there. Tired but trying my best to make it ~
## 6 #100DaysOfWriting (Day 92 - 95) So busy that I didn't even though of putting ~

The last two columns contain information about width of text and hashtags used as part of the tweet.

## # A tibble: 6 x 2
##   display_text_width hashtags        
##                <dbl> <chr>           
## 1                280 100DaysOfWriting
## 2                278 100DaysOfWriting
## 3                278 100DaysOfWriting
## 4                155 100DaysOfWriting
## 5                279 100DaysOfWriting
## 6                261 100DaysOfWriting

What should be obvious from the columns above is that they are already stored in a data frame. In other words, this is the corpus. This may feel as cheating but it is often not that easy to get into this stage.

Preparing corpus can be tedious because at the start, there is no formal order to how data are organised. All text here is already presented in one row per tweet and this can be treated as a corpus as it organised and clear. The text I have worked with in past was stored in “.pdf” files, images, or one large Word document and these are much more challenging to work with. While Word document is a text stored in the computer, it is not a corpus because it is not organised into a data frame or other format that could be used in text analysis.

Preparing Tokens

Now, it would be lovely if I could simply take the text in the data and do the rest of the analysis, however, further preparations are required. The approach taken here is to further reorganise the data into tokens.

This means I have to define what is a token (word, sentence, or longer?), what words to remove (stop words? numbers? others?), whether to pair tokens into bigrams or ngrams, or compound tokens (for example, the word “Local” and “Government” has to be treated as “Local Government”).

It is possible to do some of these tasks by hand but not feasible nor practical to do it all by hand for large volumes of data. To facilitate the process here, I will use some custom-functions to clean text, dictionaries to capture meaningful words, and {tidytext} library to tokenise the data and remove stop words.

I will use {tidytext} package throughout this post as it is fairly straightforward and beginner-friendly. However, this is actually not my go-to library for corpus analysis. For large projects I recommend {quanteda} pacakge/project which I have been using for most of my text analysis in my dissertation. If there’s any interest in covering this package, I would be happy to write an introduction around this as well.

Let’s proceed to preparing the tokens for text analysis. There will be the following steps.

Step 1 Flattening Text into Tokens

Preparing tokens requires to “extract” a token from each unit of text in the corpus and store it as one row per token. This is somewhat similar to flattening a JSON file. This would be tedious to do manually but it is fairly straightforward using the unnest_tokens() function from {tidytext} package. See below a head of the output from this function (showing only relevant columns).

# tidytext approach to get tokens - as sentences

cleaned_tokens <- unnest_tokens(
     tbl = tidy_twdata, # data where to look for input column
     output = sentences, # output column
     input = text, # input column
     token = "sentences") # definition of token

head(cleaned_tokens[c("sentences")]) # column of tokens
## # A tibble: 6 x 1
##   sentences                                                                     
##   <chr>                                                                         
## 1 #100daysofwriting (day 100).                                                  
## 2 i simply enjoy that i managed to complete and send full draft.                
## 3 there is still a lot of editing to do in the future, howerver, now the projec~
## 4 thanks if you followed this thread!                                           
## 5 fin.                                                                          
## 6 #phd https://t.co/sqntgdooe3

In this case, tokens are defined as sentences (token) which are taken from text variable (input) stored in data frame or tibble (tbl) and flattened into a new data frame (tibble) that stores the tokens in sentences (output) column.

Aside from “sentences” the function can also create the following types of tokens:

“words” (default), “characters”, “character_shingles”, “ngrams”, “skip_ngrams”, “sentences”, “lines”, “paragraphs”, “regex”, “tweets” (tokenization by word that preserves usernames, hashtags, and URLS ), and “ptb” (Penn Treebank)

Step 2 Cleaning Tokens

After creating a new data frame that organised text into tokens, I should ensure that the tokens are sensible. In this case I am cleaning the text after it has been tokenised but it would be fine to clean it beforehand too and tokenise later or combine further cleaning arguments into the tokenisation function. This is possible by supplying further arguments into the function, for example unnest_tokens(..., strip_punct) in {tidytext} or tokens(..., remove_punct = TRUE) in {quanteda} package I mentioned earlier.

I think it is simpler to separate this process to illustrate it but in practice it should be combined to save typing and make the code more concise.

I will illustrate three approaches to clean the tokens. Using custom functions, packaged functions, and dictionaries. They are not exclusive steps, usually they complement each other.

Using custom functions

A custom function can be used to clean tokens. How such function could look? The following example is a low-dependency custom function build with {base} R that uses gsub function.

The function consists of several arguments but the key one is gsub(pattern, replacement, x, ignore.case = FALSE) . The pattern is what the function tries to look for, for example “apple”. the replacement is obvious, in this case I simply replace it with nothing, i.e., "". However, this means that empty character vectors with "" are created and they will have to be removed later on, for example by converting them to explicit NA. The x refers to place where to look for the previous arguments, usually a character vector. Finally, ignore.case tells function if it should differ between “Apple” and “apple” or treat them as identical.

Fully developed function to clean text data could look like this, look at corresponding comments to see what each line does.

# create token cleaner function
clean_tokens <- function(raw_token, empty_as_na = FALSE){
        # remove specific word - #100daysofwriting
        raw_token <- gsub("(\\b100daysofwriting\\b)", 
                          x = raw_token, ignore.case = TRUE, 
                          replacement = "")
        # remove URLs
        raw_token <- gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", 
                          x = raw_token, 
                          replacement = "")
        # remove HTTPs
        raw_token <- gsub("http\\S+\\s*", 
                          x = raw_token, 
                          replacement = "")
        # remove non printable or special characters such as emoji
        raw_token <- gsub("[^[:alnum:][:blank:]?&/\\-]", 
                          x = raw_token, 
                          replacement = "") 
        # remove leftover of non printable or special characters such as emoji
        raw_token <- gsub("U00..", 
                          x = raw_token,
                          replacement = "") 
        # remove all digits
        raw_token <- gsub("\\d+",
                          x = raw_token, 
                          replacement = "") 
        # remove .com and .co after URLs
        raw_token <- gsub("\\S+\\.co+\\b", 
                          x = raw_token, 
                          replacement = "") 
        # remove all punctuation
        raw_token <- gsub("[[:punct:]]", 
                          x = raw_token, 
                          replacement = "") 
        # remove trailing whitespaces
        raw_token <- trimws(x = raw_token, which = "both") 
        # remove double space
        raw_token <- gsub("[[:space:]]+", 
                          x = raw_token, 
                          replacement = " ") 
        # remove specific word - "day"
        raw_token <- gsub("(\\bday\\b)|(\\bDay\\b)", 
                          x = raw_token, 
                          replacement = "")
        # replace empty strings as NA or keep them there
        # (useful if you need to then drop all NAs)
        if (empty_as_na == TRUE) {
                raw_token <- gsub("^$", 
                                  replacement = NA, 
                                  x = trimws(raw_token))
                tidy_token <- raw_token
                # explicit return in IF
                return(tidy_token) # this is optional but it makes it more clear
                } # End of if statement
        else {
        tidy_token <- raw_token
        # explicit return in if was not triggered
        return(raw_token) # this is optional but it makes it more clear
                } # End of else statement
        } # End of function

Note that it matters what is done first. For example, removing all whitespace first would mean words are compounded, e.g. “apple pear” becomes “applepear”, therefore, it has to be one of the last steps.

I am using the function to clean sentence tokens, but to illustrate how it works, the function could be then used in the following context.

clean_tokens("This is https://twitter.com sen###124444tence.", empty_as_na = TRUE)

The result would be turning the sentence from: “This is https://twitter.com sen###124444tence.” into the: “This is sentence”.

Using package functions

Another way to clean tokens is to use some function developed by authors of packages for text analysis. A common step is to remove stopwords from tokens, this means getting rid of all words such as “the”, “to”, or “is”. It is typically implemented into the packages in some fashion. For example, {tidytext} offers the lexicon solution to remove stopwords as part of stop_words object that is passed into anti_join() function (which is part of {dplyr}). The {quanteda} offers its own function stopwords(kind = quanteda_options("language_stopwords")) which is passed into tokens_remove(tokens, c(stopwords("english")). This offers reproducible way to clean the tokens and has a potential to even extend further as there are packages that collect stopwords for other languages (e.g., {stopwords}).

I was using the stop_words function, a simple example is below.

tidy_twdata %>%
        # tidytext approach to get tokens - as words
        unnest_tokens(word, 
                      text,
                      token = "words") %>%
        # remove stopwords such "this", "or", "and"
        anti_join(stop_words)

Using dictionaries

Finally, the last method I have utilised were dictionaries. These can be really helpful as dictionary is typically a vector, list, or data frame that contains words that fall under the same category.

However, the category can be specific or broad such as the one I used for cleaning. An example of more specific category is a sentiment dictionary (containing words related to positive / negative feelings). An example of broad category can be list of words commonly used in English language.

As I wanted to ensure that no strange words, acronyms or nonsensical parts left out from cleaning hashtags and URLs were present, I used the {qdapDictionaries’} Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List. This is a dataset augmented for spell checking purposes. Thus, it should filter out words that are not grammatically correct.

Below is an example how it was used. Notice the filter(word %in% english_words_dictionary) this is where I tell R to filter words that are occurring in the dictionary. I treat anything that does not occur in the dictionary as meaningless.

# load dictionary
english_words_dictionary <- qdapDictionaries::GradyAugmented

# use dictionary
cleaned_tokens <- tidy_twdata %>%
        # tidytext approach to get tokens - as words
        unnest_tokens(word, 
                      text,
                      token = "words") %>%
        # remove stopwords such "this", "or", "and"
        anti_join(stop_words) %>%
        # keep only words that are in the dictionary
        filter(word %in% english_words_dictionary)

While this can be an extremely efficient way to quickly clean text, it may also remove less standard phrases. To avoid this, a dictionary can be customised, it is usually as simple as removing certain words from a data frame.

Presenting Cleaned Tokens

This section shows the output after applying the approaches mentioned above. While these approaches are not foolproof, I feel it can get anyone fairly far depending on how much do they wish to fiddle with their cleaning functions or dictionaries.

Two outputs were produced, first was the data frame with tokens as words.

## # A tibble: 6 x 5
##   created_at          screen_name display_text_width hashtags         word    
##   <dttm>              <chr>                    <dbl> <chr>            <chr>   
## 1 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting simply  
## 2 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting enjoy   
## 3 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting managed 
## 4 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting complete
## 5 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting send    
## 6 2021-02-02 11:50:44 m_cadek                    280 100DaysOfWriting draft

The second was the data frame with tokens as sentences.

## # A tibble: 6 x 5
##   created_at          screen_name display_text_wid~ hashtags  sentences         
##   <dttm>              <chr>                   <dbl> <chr>     <chr>             
## 1 2021-02-02 11:50:44 m_cadek                   280 100DaysO~ I simply enjoy th~
## 2 2021-02-02 11:50:44 m_cadek                   280 100DaysO~ There is still a ~
## 3 2021-02-02 11:50:44 m_cadek                   280 100DaysO~ Thanks if you fol~
## 4 2021-02-02 11:50:44 m_cadek                   280 100DaysO~ Fin.              
## 5 2021-02-02 11:50:44 m_cadek                   280 100DaysO~ Phd.              
## 6 2021-01-31 23:14:55 m_cadek                   278 100DaysO~ Draft submitted t~

These two outputs were saved as token_wordsand token_sentences and are used in the next section that computed several aggregations.

Aggregating Tokens

After cleaning tokens, the real text analysis can commence. This can be presenting a simple frequency of words to more complex analyses such as a topic model or applying classifiers to text, for example, a Naive Bayes classifier.

I am keeping things simple and present three common analyses. Below each bullet point is an example of code (full code available here) to produce this type of aggregation and short description of what was done. In the visualisation, I present further analyses that build or use combination of these three types. This should illustrate that even relatively simple analytical approaches can be quite powerful.

  • Frequencies

    • Everyone knows frequencies. They are simple to make once data frame of tokens is available. The code below takes data frame, then count of word column with sorting. This produces an aggregate with n column that counts number of occurrences. These are these are summed to totals and relative frequencies are produced as n / sum (n). The reorder ensures that words are ordered by their occurrence.
token_words %>%
     count(word, sort = TRUE) %>%
     mutate(
          total = sum(n),
          word = reorder(word, n),
          proportion = n / sum(n))
  • Term frequency–inverse document frequency (TF-IDF)

    • The problem with simple frequencies is that they do not account for features of text that are unique and important. Simple approach to use to do is either rank and weight frequencies or produce something known as TF-IDF. This measure (in theory) takes into account total proportion of word and inverse document term proportion. The document can be anything that further groups the tokens. In this case, my “documents” were months of the challenge.

    • For example, word “writing” occurred in August 2020 8 times across 1279 total occurrences, therefore the TF = 8/1271 = 0.0063. The IDF is log() of ratio of all documents / documents containing a given word. In this case “writing” occurred in 6 documents out of 8 (documents = months), hence IDF = log(8/6) = 0.29. Finally, the TF-IDF = 0.29*0.0063 = 0.0018. The formula works in the way that rare words have higher importance (i.e., larger tf-idf means higher importance) and words that are very frequent are weighted down. IDF is therefore low for high occurring words such as stop words.

-   
token_words %>%
     mutate(
          month_date = paste0(year(created_at), " ", months.Date(created_at)),
          month_date = factor(month_date, levels = date_ym_fct,
                              date_ym_fct, ordered = TRUE)) %>% 
     count(word, month_date, sort = TRUE) %>%
     mutate(
          total = sum(n),
          word = reorder(word, n),
          proportion = n / sum(n)) %>%
     bind_tf_idf(word, month_date, n) %>%
     arrange(desc(tf_idf))
  • Sentiment analysis

    • Sentiment analysis is really just an extension of a dictionary approach. The idea is to associate each words with certain category of sentiment, for example “good” will be categorised as positive. The way it works in tidytext is that each words is inner_join to a group in sentiment dictionary. It can then be counted how many positive words occurred or what was the ratio of positive / negative.
bing <- get_sentiments("bing")

sentiment_token_overall <-  token_words %>%
        inner_join(bing) %>%
        count(word, sentiment, sort = TRUE) %>%
        mutate(
             rownumber = row_number(),
             total = sum(n),
             word = reorder(word, n),
             proportion = n / sum(n))

Presenting Text Analysis using ggplot2

The last section will present the text analysing of the corpus. I aim to deliver these results by presenting less text and more visuals, therefore, I will comment scarcely. If you want to see how each visual was produced, please see the following code.

I will present distribution of the token, frequencies, TF-IDF, and conclude with sentiment analysis.

Distribution of Tokens

The first thing I look at is histogram showing a distribution of how many times words occurred in the corpus. The histogram shows that words typically occurred only once and only on rare occasions occurred more than that. This is fairly typical for analysis of text (many words occurs few times, and few occur many times).

This is often subject of further research as it is interesting to look at relationship between word frequency and its rank. Below, I am illustrating this using a classical example known as Zipf’s Law.

Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.

Frequencies of Word and Sentence Tokens

An elegant way to present data from text analysis is to show how many times certain tokens (words or sentences) occurred, what percentage of corpus did a particular word cover, and illustrate examples of common and rare words.

I show below frequency and relative frequency of word tokens, and sample of sentences in relationship to their length.

Frequency of common words (top 15) should not be surprising to anyone who knows the corpus. The most common word (after removing some stop words and “day” word) was “writing”. This is helpful to get a quick sense of what is in the corpus.

Relative frequency then shows the information above as relative to the whole of corpus.

Visualising examples of sentences was little difficult as I had to find a creative way to present the sentence. I chose using only a sample of sentences and present them as association with their length. The lengths of sentences were binned and these bins were then used to sample.

The above can also be shown as a tabular output. Which might be a little more user friendly.

TF-IDF

I covered what TF-IDF is above. Here I am presenting the TF-IDF in context of months. First, I show the relative frequency (%) of words per each month. This is simply extending the frequencies by grouping the tokens by month.

This starts to show much more interesting findings. For example, months were not consistently just about writing. As I was nearing the end of the challenge, I was finishing discussion, and eventually I went onto editing. This is something worth stressing in text analysis. Any natural groupings are in my opinion far better way to interpret results than throwing tokens into a topic modelling or similar classification method. This is because without some prior sense of what data is about, the result of these complicated models is usually impossible to interpret.

TF-IDF then tries to account for words that are important relative to other months. In this context, words such as “writing” are essentially treated as obvious enough to not be worth analysis (the same way an analyst would remove stop words). The result is drastically different bar charts showing some other words that defined my writing experience.

Sentiment

One of very popular and perhaps overused text analytical methods is sentiment analysis. The idea of sentiment analysis is that it provides fast and standardised way of associating words with emotions. This is usually done by using a dictionary approach to join defined words to words in corpus. It is used to evaluate many words and save analyst’s time but it may not be suitable for all corpora and sometimes it fails in comparison to qualitative methods where analyst categorises the words on his or her own. The sentiment analysis below will show three charts.

First I present a relative frequency of the most common negative and positive sentiment. These are interesting words to me as it shows how I felt about the challenge of writing my thesis at times or what was pleasant.

I also felt it was important to provide some idea of how sentiment changed over the time. Surprisingly, at the beginning the sentiment was on average more negative. Maybe I was motivated but realised it is also challenging. This is very rough idea and you can see that these vary a lot in their “errors” (the bar shows standard deviation around the mean).

Finally, I tried to show which words were prominent each month. The x-axis dates, y-axis number of words, and around the points are sample of those words. This overlaps with the previous graphs in terms of the middle part of the challenge containing the most positive sentiment.

Conclusion

This concludes the text analysis and my writing challenge. It was interesting to me to see the challenge presented like this and I hope the potential readers may find this post useful. Please contact me if you have any feedback.


Martin Cadek

5088 Words

2021-03-21 00:00 +0000

comments powered by Disqus