This is the second part of the two-part post. The first part is accessible via this - link to the part I. Please read the first part for the background information or the relevant Tidytuesday repository. If you just want to download the scripts for R, please use this repository on GitHub and open the
What is the blog post about?
In the second part, I am going to prepare the data for analysis and show an example of exploratory clustering using the Tidymodels framework. The analysis is inspired by Five Thirty Eight and tidymodels tutorial which is great reading if you want to get into this method.
This part will be less wordy and feature more visualisations. While I will also show some of the code in this part, I deliberately wanted to avoid this and I rather encourage you to download the full script - especially if you want to see some of the coercion, and the ggplot2 code behind the graphs.
Feel free to reuse any of the parts with appropriate attribution.
These packages were used throughout both part I and part II of the projects. Please see the previous link to the part I for further comments on reproducibility.
## [,1] [,2] [,3] [,4] ## tidyverse 1 3 0 1 ## tidymodels 0 1 1 0 ## here 0 1 0 1 ## Cairo 1 5 12 2 ## colorspace 1 4 1 1 ## janitor 2 0 1 2 ## showtext 0 8 1 0 ## patchwork 1 0 1 1 ## ggthemes 4 2 0 4 ## lubridate 1 7 9 1 ## flextable 0 5 10 0 ## tidytext 0 2 5 0 ## arrow 0 17 1 0 ## klaR 0 6 15 0 ## tidytuesdayR 1 0 1 1 ## ggimage 0 2 8 0 ## rsvg 2 1 2 1 ## conflicted 1 0 4 1 ## viridis 0 5 1 0 ## ggrepel 0 8 2 0
For a reminder, I am working with the following dataset. This is the version of the dataset before further cleaning and tidying.
Choosing the algorithm
The method applied in here is called k-means (p.460). This method relies on numerical variables (as it requires the computation of Euclidean distance matrix) and although it is possible to transform categorical or ordinal variables (via dummy coding / one-hot encoding), such approach is discouraged. That said, there are some alternatives, for example, see this discussion or this and this.
I used k-means as it is a simple ML algorithm, and at least with my setup, it is relatively resource-light. The algorithms such as K-modes or Gower distance are a suitable alternative when the assumptions fit, but I found them slow (I wonder if other folks have better experience optimising these techniques on high-dimensional categorical data?).
Tidying & data manipulation
The tidying I’ve done is discussed briefly. It is included in the full script at the following GitHub repository -
script.R. While the data was relatively clean to begin with, I still needed to engineer it and prepare it for the clustering analysis via the k-means method.
Here is the summary of what I did:
Trimmed and cleaned all character variables
Extracted numerical quantity from the measure volume
Converted quantity from the measure column to millilitres (where it was sensible)
Standardised character vectors and their letter case
Standardised the date variable
Removed any missing or corrupt rows
Removed drink the image column
Created variables with collapsed factor levels for a number of variables
Once tidied, the data looked like this (some columns were removed due to space):
The tidy-er data allowed me to show additional visualisations that were not possible previously.
I’ll start by re-introducing some of the variables.
First, the glass type. Not much has changed at first glance after tidying (I merged some levels and cleaned them). The variable seems to be skewed towards certain types (such as collins) of glass. I won’t be working with this variable - although it would be possible.
The following graph confirms that the alcoholic drinks were the most prevalent. I focused specifically on those (by filtering out
non alhocolic and
optionally alcoholic). The ordinary drink was also a common type of drink.
Cleaned ingredients show that vodka or gin lovers would be lucky with this dataset. I focused on the gin ingredient in the analysis that follows.
I also tried to see if there was some pattern in the order at which the ingredients are added (there was). Notice that usually the signature alcohol was added first (by signature I mean it being important or particularly noticeable in the drink). That was followed either by another important alcohol ingredient or a variety of juices. Added later were the aesthetics, like colouring.
If you are barista and you start with sugar, you may be unorthodox.
measure variable took the most effort to prepare because of the way it was stored initially. The format, e.g.
1 oz of lime juice was convenient for human readability but not for analytical purpose. My approach was to extract all numeric values using regex, then extract the unit values such as
oz, and finally convert the values to a sensible measurement scale, millilitres (I still fail to grasp where the beauty lies in the imperial system of units).
The first graph is the millilitres converted as part of
case_when), and then without such conversion as standardised z-scores. Millilitres were used further as standardising, e.g. 8 litres doesn’t make it comparable to 8 millilitres (notice difference in density and outliers).
I also compared z-standardised ml and all other units. The difference was quite striking in the case of gin drinks - this confirmed that where direct conversion can be utilised it is superior to scaling different units.
Data nesting and unnesting
Now that the data were tidied, I needed to extract only drinks containing gin while keeping all other variables (other ingredients, glass, etc.) This was a perfect use case for nested rows/columns in my opinion. After columns were nested, I utilised functional programming features of the
purrr package to create a column confirming the drink contains gin and then filtered out based on this column.
The following script does what I have just described.
# Nest alcoholic and gin drinks system.time( data_gin_nested <- data_tidy %>% # Remove non-alcoholic drinks and other drinks # The first filter. filter(alcoholic == "alcoholic") %>% # Nesting group_by(drink, id_drink, row_id) %>% nest() %>% # Use map() mutate( gin_true = map(data, select, ingredient), gin_true = map(gin_true, filter, ingredient == "gin"), gin_true = map(gin_true, pull, ingredient), gin_true = map(gin_true, str_detect, pattern = "gin"), gin_true = map_lgl(gin_true, any) ) %>% # Include only drinks containing gin! # The second filter. filter(gin_true == "TRUE") ) # Measure elapsed time, takes about
I have also tried to extend the data with
parsnip’s capabilities of creating dummy variables for all ingredients. The idea was then to convert it to a distance matrix and utilise Gower algorithm or k-modes, but that ended up being a big no. The final matrix was far too large to be meaningfully processed and the
data.frame itself took 2GB of memory.
I wouldn’t recommend this route unless you can use cloud servers to do the computation or somehow optimise it. Even then it is not guaranteed that clustering ingredients would uncover something.
Zoom-in on gin drinks
Once I filtered out non-gin drinks from nested data, I unnested them again and started looking at the data through more targeted visualisations.
I liked the idea of utilising some form of complexity measure. The best I came up with was to simply measure the complexity as a number of ingredients required. The following visualisations show this.
I also described the common ingredients in the drinks.
Both graphs show interesting features of gin drinks. That is, most gin drinks have 3 - 4 ingredients (the Complexity graph; note to self: try the Radioactive Long Island Iced Tea), and lemon juice, grenadine, or dry vermouth (the Common ingredients graph) are the popular choice of ingredients, but not 7-up.
If you are concerned about the content of gin in the drinks, then hold my drink. Here’s the graph utilising the measure (ml) to show which drinks you could go for. This doesn’t account for the total volume of drink - that said, if you fancy a lot of gin, I suspect the Vesper, Jitterbug, Derby, Abbey Martini, and Ace might be your friends?
Data aggregation (feature engineering)
Now for the final parts of the analysis. Before running the algorithm I had to do a few simple data aggregation procedures. I counted the number of ingredients (total), an approximate sum of total volume content (ml) in each drink, approximate average volume per ingredient, and finally, approximate proportion of gin content in the drink.
When I state “approximate” it means that the measurement unit (ml) was converted from raw data with custom-made functions, therefore, some differences could occur. For example, to measure a shot I had to define shot as 30 ml. This was a little bit of trial and error, but the graph with gin volume (shown above) in drinks was good guidance in terms of any outliers produced systematically through conversion.
These aggregated (or feature engineered) variables formed the basis of the distance matrix computed using the Euclidean distance (
kmeans function in R handles this).
The sum of squared distances Euclidean distances is defined as the within-cluster variation. Other algorithms (such as hierarchical clustering) require the user to compute distance themselves as different distance fits different contexts.
The following script prepared the final dataset:
aggregated_gin <- data_gin_unnested %>% group_by(drink) %>% # Create: # * sum of ingredients, # * approximate sum of volume, # * approximate average ingredient volume, and # * approximate proportion of gin # It is approximate because the ml unit was computed only roughly. summarise(sum_ingredients = n(), aprx_sum_volume_ml = sum(measure_ml, na.rm = TRUE), aprx_avg_ingredient_volume_ml = mean(measure_ml, na.rm = TRUE), aprx_prop_of_gin_ml = case_when( ingredient == "gin" ~ measure_ml/aprx_sum_volume_ml )) %>% arrange(-aprx_prop_of_gin_ml) %>% ungroup() %>% drop_na() %>% # Save drink names as rownames to keep the labels. K means accepts only numerical values. column_to_rownames(var = "drink")
Exploratory clustering with k-means
The clustering conducted here was exploratory in the sense that I did not know of suitable k (number of) clusters to predict in advance.
A good indicator of whether clustering could be worthwhile is to visualise the data and see if some points form distinctive groups.
The visualisation below shows the aggregated data, and (indeed) some of the points were fairly close together while being far from the others. There could be something?
Running the k-means in the tidymodels framework was very satisfying. The code shown below implements the optimal approach (ideal would be to also split data to train and test). As I was not aware of any given number of clusters, 1 to 8 were specified (k = 1:8) and then map functions were used to initiate the algorithm and iterate it eight times across the entire dataset. The map functions saved the result of the algorithm as list columns.
In next step, I am using three key
broom package functions -
glance (function extracts a summary stats for models),
tidy (function summarizes per cluster stats), and
augment (augment adds predicted classifications to the original data set) to extract data about the models.
The way I think of glance is that it shows summary statistics and diagnostics from the perspective of the whole model. Not participants or clusters (or other groups).
In tidy, I go a level down, each iteration of the model produced several clusters and for each cluster, there are specific summary statistics. Tidy shows these cluster-level statistics.
Finally, with augment, I get the participants data. It’s the predicted values for each separate drink, observation, participant. This is the smallest detail.
Hopefully, this explains sufficiently what was going on under the hood. The beauty of tidymodels is that it all happened at once in a unified approach - imagine running eight different models.
# Run the k-mean algorithm on the aggregated data kmeans_summary <- tibble(k = 1:8) %>% mutate( # Do the cluster analysis from one to 8 clusters kmeans = map(k, ~kmeans(aggregated_gin, centers = .x)), # Show a summary per each cluster kmeans_tidy = map(kmeans, tidy), # Show a summary of key statistics kmeans_glan = map(kmeans, glance), # Clusters with the original data kmeans_augm = map(kmeans, augment, aggregated_gin) ) # glance() function extracts a summary stats for models clusters_statistics <- kmeans_summary %>% unnest(cols = c(kmeans_glan)) # tidy() function summarizes per cluster stats clusters_summary <- kmeans_summary %>% unnest(cols = c(kmeans_tidy)) # augment adds predicted classifications to the original data set data_predicted <- kmeans_summary %>% unnest(cols = c(kmeans_augm))
To understand these data, I use visualisation. That is the most intuitive approach (aside from also looking at the statistics outputs).
First, I wanted to show so-called model in data view where predicted clusters are plotted in data.
Next, each cluster has a centre or centroid, an area that was used to associate the clusters with - as shown in graph below.
Then, I manually filtered clusters close to those centroids and labelled them. This gave me an idea of what was a good representation of each cluster - it was usually the drink closest to the centre.
Choosing the optimal k
Several sophisticated methods can be utilised to diagnose the cluster analysis via k-means. Particularly intuitive is the so-called elbow method which shows the total within the sum of squares across each iteration (the top-level - glance output).
Following this method, I can settle on the number of clusters where total within the sum of squares was minimised and pick specific iteration. However, I am looking for a drop or change, not the lowest value per se as I also want a reasonable number of clusters (the more is not the merrier) - parsimony.
Judging from the data visualisation shown before I would go for either two or three clusters as the change from that point was smaller.
I hope you enjoyed reading this analysis of Tidy Tuesday dataset. Hopefully, it piqued your interest in the k-means algorithm and tidymodels. Using this framework is elegant and powerful.
I also illustrated how to conduct this type of analysis from initial data exploration to the final result (part I to part II). You may wonder what is the interpretation, I leave it up to you to determine what defines each cluster.
Thank you for reading my blog post. Please feel free to let me know any comments or feedback. Feel free to also reuse any of the parts with appropriate attribution.