Finding missing patterns in text data

This post is about approach I use to find missing data patterns within data that are exclusively text type. I’ve came up with the idea for this post when I’ve been cleaning a messy dataset where participants provided open ended responses to several questions, these responses were exclusively words. After cleaning the data with stringr package and labeling missing responses as “NA”, I though it would be interesting to see any missing patterns within data. I’ve searched briefly; however, could not find examples that would get me started regarding this type of data, hence I’ve created this example.

The data I am going to use here is simple example I’ve created but it can be generalized to more complex data. I am using several packages here. Some of the mentioned packages are part of the tidyverse which is a set of packages sharing an underlying philosophy of so called tidy data. You can use the whole tidyverse (it’s a lot of packages) but you’ll be fine with using only dplyr package from the tidyverse. For identification of missing data patterns, I am using VIM and mice packages. I encourage you to read a bit about these amazing packages.

Load required Libraries

We start by loading our packages. If you would fancy creating a reproducible script that is possible to keep as the data update, check out the package called checkpoint.

library(dplyr)
library(VIM)
library(mice)

Creating random dataset

Our simple data consists of 5 columns and 360 rows. The data are answers about profession provided by our participants. Normally, there would be some typos but that would not be an issue using our approach. The columns are labeled from A to E and represent different conditions.

set.seed(123)
medical_professions = c("dietitian", "psychologist", NA, "nurse", 
                        "pharmacists", "dentist", "surgeon")

df = as_tibble(replicate(5, sample(medical_professions, 360, rep=TRUE)))
names(df) = c("A", "B", "C", "D", "E")  

Overview of our data

Let overview our simple data first.

head(df)
## # A tibble: 6 x 5
##           A            B           C            D            E
##       <chr>        <chr>       <chr>        <chr>        <chr>
## 1      <NA> psychologist        <NA>         <NA> psychologist
## 2   dentist psychologist     dentist psychologist  pharmacists
## 3      <NA>      surgeon     dentist         <NA>    dietitian
## 4   surgeon         <NA>     surgeon         <NA>        nurse
## 5   surgeon         <NA> pharmacists      dentist      dentist
## 6 dietitian      dentist     dentist psychologist      dentist
tail(df)
## # A tibble: 6 x 5
##             A         B            C           D           E
##         <chr>     <chr>        <chr>       <chr>       <chr>
## 1 pharmacists     nurse  pharmacists        <NA>        <NA>
## 2     surgeon      <NA>  pharmacists        <NA>        <NA>
## 3 pharmacists dietitian psychologist pharmacists pharmacists
## 4        <NA> dietitian    dietitian     surgeon     surgeon
## 5        <NA>   surgeon         <NA> pharmacists pharmacists
## 6     dentist dietitian         <NA>       nurse       nurse
dim(df) # 360 x 5 columns
## [1] 360   5
sum(is.na(df)) # 265 missing values
## [1] 265
mean(is.na(df)) # 15% missing percent
## [1] 0.1472222
mean(!complete.cases(df)) # 53% rows containt at least one missing value
## [1] 0.5333333

VIM Package to show patterns

Assuming our missing data are labeled as NA, we can use a great visualization function “aggr” from VIM package. We will disable the proportions output by specifying “prop = FALSE”, and allow representation of missing values using numbers by indicating “numbers = TRUE”.

library(VIM)
aggr(df, prop = FALSE, numbers = TRUE)

The resulting graph shows number of missing values on left and the pattern of missing values on right, luckily the most common pattern is that all columns are non-missing; however, lot of responses in conditions “B” and “A” seemed to be missing. For example you may have changed the instructions in those conditions or the sample of participant from conditions “B” and “A” thought your experiment is particularly boring.

Either way, we can go even a step further and visualize missing pattern of individual rows with something called shadow matrix.

Shadow matrix

To use shadow matrix (e.g.: R in Action, page 360) means transforming our data into set of logical values. If there is missing value we will assign TRUE, if the value is non-missing we will assign FALSE. In numerical terms, this procedure replaces all values that are missing with 1 and values that are non-missing with 0.

Now we can use function “matrixplot”" from VIM package. All 1 that are missing are represented with black line.

library(VIM)
shadowmatrix = as_tibble(abs(is.na(df)))
matrixplot(shadowmatrix)

You could also visualize specific rows only. For example rows 5 to 10.

matrixplot(shadowmatrix[5:10,])

Displaying the pattern again

While visualizing individual rows as mentioned above is possible, it might be better to use another function, in particular “md.pattern” function from mice package.

This function would not work on the original data that contained strings but since we have transformed it into shadow matrix, it will work.

First of all, the function focuses on showing NA patterns; however, those are labeled as number 1, therefore we need to transfer them to NA.

shadowmatrix.NA = shadowmatrix

Replace all 1 with NA

The safest way how to replace any value as NA is to use “is.na” function from base R.

is.na(shadowmatrix.NA) <- shadowmatrix.NA == 1

Now we have the shadow matrix with NA labels, let’s use the “md.pattern” function.

missing.pattern = as_tibble(md.pattern(shadowmatrix.NA))
head(missing.pattern) # Top rows
## # A tibble: 6 x 6
##       E     C     D     B     A    ``
##   <int> <int> <int> <int> <int> <int>
## 1     1     1     1     1     1     0
## 2     1     1     1     1     0     1
## 3     1     1     1     0     1     1
## 4     1     0     1     1     1     1
## 5     1     1     0     1     1     1
## 6     0     1     1     1     1     1
tail(missing.pattern) # Bottom rows
## # A tibble: 6 x 6
##       E     C     D     B     A    ``
##   <int> <int> <int> <int> <int> <int>
## 1     0     1     1     0     0     3
## 2     0     0     1     1     0     3
## 3     0     1     0     0     1     3
## 4     0     0     0     1     1     3
## 5     1     0     0     0     0     4
## 6    47    50    52    54    62   265

The unlabeled column indicates the number of missing cases in each missing data pattern. Not surprisingly, the first row has no missing cases. Last row indicates summary of missing values in each condition (compare it to visualization created above with “aggr” function). Finally the total number of missing values is showed at the bottom of unlabeled column, it should be same number as the one from “sum(is.na(df))” output.

Identify missing values

Finally, we might decide that rows with more than 3 missing values across all conditions are too much. To see which rows are having 3 or more missing values across all condition, we can use the command below.

which(apply(shadowmatrix, 1, sum) > 3)
## [1]  19 141

That’s all folks. Thanks for reading. Let me know what you think in comments below. Share it if you like it.

P.S.: This is my first post about R on this blog written in R. So, if you have any suggestions on how to improve similar posts, comment below!