Tidy analysis of cocktails - Part I - EDA
Background story about carrots and parsnips
Some time ago, I wanted to use machine learning so I have learned to use the caret developed by Max Kuhn. Caret was my choice (and for some problems I would use it again) because it offers pretty much every ML algorithm I can think of - it is very comprehensive. It also makes it possible to use these algorithms a unified and consistent way. This was also the first time I realised how useful it is to have a common framework or package “universe”.
However, there are some downsides when you wish to use caret:
First, the package relies on a lot of dependencies, and to install and set-up them in a reproducible environment can be a pain. I think the author did not initially expect the package to become so comprehensive.
Second, if your machine or internet connection is not the fastest then it is going to take a while.
Third, some packages simply remain quirky and difficult to use in caret interference, and lastly, speed - caret is not the fastest package.
There are more niche things which depend on your domain use, but I think these are the key downside I see. They make the use of the package somewhat inconvenient at the time but nothing you cannot resolve if you still want to use it. That said, I was very excited when I have learned that parsnip - a package similar to caret was developed. Like caret, parsnip promises consistency across various modelling packages in R. Parsnip does this faster and in a more unified interference (universe) of packages called Tidymodels.
What is this post about?
The post is about two things. First, It’s been some time since I have attempted Tidy Tuesday and it’s also been on my to-do list to familiarise myself with the Tidymodels and a little bit of parsnip.
The post is two parts. In the first, I will utilise Tidy Tuesday dataset and show an example of basic exploratory data analysis (EDA). I will be doing some data manipulation and coercion using Tidyverse and preparing the data for the second part. I will also develop (I will call it) a research question for this dataset.
In the second part, I will take the data and show a basic example of clustering using the Tidymodels framework.
I think this post may illustrate how a question or objective of analysis through EDA is developed and some basic features of the aforementioned packages (plus hopefully some clustering).
If you don’t want to follow me alongside in this post and just want to download the scripts for R, please use this repository on GitHub and open the script.R
file. The second part is accessible via this - link to the part II.
Feel free to reuse any of the parts with appropriate attribution.
Cocktails data from Tidy Tuesday 22 (2020)
The dataset I am using here is available from Tidy Tuesday 22 (26/05/2020) and describes various cocktails and their content. I was inspired by the idea of clustering various cocktails published on Five Thirty Eight and though this would be the perfect opportunity to play with the Tidymodels. They have utilised k-means clustering algorithm to assess what are the four main types of margaritas.
Used packages
Below are the packages I am going to use throughout this project (both part I and II). Columns represent the version number, e.g.: tidyverse 1.3.0.1.
## [,1] [,2] [,3] [,4]
## tidyverse 1 3 0 1
## tidymodels 0 1 1 0
## here 0 1 0 1
## Cairo 1 5 12 2
## colorspace 1 4 1 1
## janitor 2 0 1 2
## showtext 0 8 1 0
## patchwork 1 0 1 1
## ggthemes 4 2 0 4
## lubridate 1 7 9 1
## flextable 0 5 10 0
## tidytext 0 2 5 0
## arrow 0 17 1 0
## klaR 0 6 15 0
## tidytuesdayR 1 0 1 1
## ggimage 0 2 8 0
## rsvg 2 1 2 1
## conflicted 1 0 4 1
## viridis 0 5 1 0
## ggrepel 0 8 2 0
Since R comes in many flavours (like it or not), below is a slice from the SessionInfo()
output so you can see if you are running similar setup. I am trying to use the latest available packages as of the June 2020. The file provided as part of the repository utilises a method well-known to all python users - virtual environment. This will help you create a snapshot if you wish to run the script, but you can also simply run the latest packages and R version above 4.0.0 and you should be fine. To learn how to utilise the virtual environment, please install the renv package and read the following vignette.
Load data
I have already downloaded and saved the data (as .rds
). Two files are available for the Tidy Tuesday 22 - I will focus on datafile cocktails.csv
as it should be more analysis-ready. The other file, boston_cocktails.csv
is not used here because it is messier and would require more cleaning.
tt_cocktails <- tidytuesdayR::tt_load(x = "2020-05-26", download_files = "cocktails")
##
## Downloading file 1 of 1: `cocktails.csv`
data <- tt_cocktails$cocktails
cocktails <- tt_cocktails$cocktails
data <- data %>%
select(-c(iba, video)) # drop iba and video columns with lot of NAs
Opening data
After opening the data, I want to quickly see what I am tackling here. At this stage, I have only a vague idea about the data. I simply want to familiarise myself with it - understand the variables, missing values, potential errors (whether systematic or unique).
## Rows: 2,104
## Columns: 13
## $ row_id <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3...
## $ drink <chr> "'57 Chevy with a White License Plate", "'57 Chev...
## $ date_modified <dttm> 2016-07-18 22:49:04, 2016-07-18 22:49:04, 2016-0...
## $ id_drink <dbl> 14029, 14029, 15395, 15395, 15395, 15395, 15395, ...
## $ alcoholic <chr> "Alcoholic", "Alcoholic", "Alcoholic", "Alcoholic...
## $ category <chr> "Cocktail", "Cocktail", "Shot", "Shot", "Shot", "...
## $ drink_thumb <chr> "http://www.thecocktaildb.com/images/media/drink/...
## $ glass <chr> "Highball glass", "Highball glass", "Old-fashione...
## $ iba <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ video <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ ingredient_number <dbl> 1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 1, 2, 3, 4, 5...
## $ ingredient <chr> "Creme de Cacao", "Vodka", "Absolut Kurant", "Gra...
## $ measure <chr> "1 oz white", "1 oz", "1/2 oz", "1/4 oz", "1/4 oz...
There are 2104 rows and 13 columns (or 2104 observations and 13 variables) before any further coercion. I am now going to look at the types of variables in dataset. The code below displays several variable types in the cocktails the dataset. There are is one date (POSIxct + POSIXt), three numeric or integer variables, the rest is categorical or character. However; a closer look reveals these data are mostly categorical and some further cleaning is needed. For example, the measure
contains unit value and volume. EDA should give me better ideas about the questions I want to ask here.
##
## character logical numeric POSIXct POSIXt
## 8 1 3 1 1
Exploratory data analysis (EDA)
With every dataset, I want to make sure I understand what the variables represent. First, what are the observations? In this case, each observation represents an ingredient which belongs to a cocktail. This can be further understood when looking at the “raw” source of data here.
In the previous overview, I saw 13 variables. I will quickly interpret each. The row_id
, drink_id
, and drink
are in theory the same things represented differently - they are the drink. The date_modified
is likely a date when the entry was made to the database. The alcoholic
refers to a version of a drink, i.e. alcoholic or non-alcoholic. The category
refers to the type of drink (e.g.: shot). The drink_thumb
is an image associated with a drink. The glass
refers to the serving glass of drink. I will not bother much with iba
and video
, as shown later, they are mostly missing but the first stands for “International Bartenders association category” while the latter for “Video to how to make”. The ingredient_number
refers to the order of ingredients in a drink. The ingredient
is what a drink consists of, and finally, the measure
is in what quantity an ingredient exists in a drink.
Word of caution. What I find a little tricky with datasets like this (but it is a very common feature of datasets, not an error). Except for ingredient
, ingredient_number
, and measure
, the other variables can be grouped. To be precise, there is no point counting glass
, alcoholic
, and category
without grouping all variables by drink_id
. The resulting number without grouping would be the count of all of 2104 rows containing “alcohol” when there are only 546 drinks! If I were precise the variables I have mentioned should be called, for example, drink_glass
and variables like measure
, for example, ingredient_measure
to differ which relate to what. This will become clearer as I progress with the EDA and analysis.
At this stage, I can see that most variables are in what is called “nominal” scale of measurement. This can be changed with further aggregations across drink_id
, row_id
, and drink
(and I will need to do this). That way I will be able to get, for example, how many drinks containing gin are there and a “new” variables in “ratio” scale.
The ingredient_number
and date_modified
are the only variables not in the nominal scale of measurement. The ingredient_number
tells me the maximum number of ingredients possible, i.e. 12 and minimum, i.e. 1 and in sense of measurement is in “ordinal” scale. It also makes me think of another aspect of the data - the complexity. The more combinations of ingredients, the more complex it is. This starts to probe into questioning the dataset. For example, how complex are the drinks? I quite fancy gin, are there some complex ones? How do they differ from the simple ones? Now, the date_modified
is in something called an interval
scale of measurement but I am not going to make use of this variable.
Lastly, the measure
surely contains some information about quantity. It does but needs to be cleaned further. For example, “1 oz white” should be “1” and then “oz” (the white is shown in ingredient). That way I can get some sense of quantity to compare and get the variable from “nominal” to “interval” or possibly “ratio” scale if I standardise the values.
I will now look at the missing variables.
row_id | drink | date_modified | id_drink | alcoholic | category | drink_thumb | glass | iba | video | ingredient_number | ingredient | measure |
0 | 0 | 3 | 0 | 8 | 0 | 0 | 0 | 1848 | 2104 | 0 | 0 | 0 |
The variables date_modified
, and alcoholic
contain a small number of NA. The variables iba
and video
are almost entirely made out of NAs. I believe it is safe to remove iba
, video
. I don’t think I can guess or do anything with date but filling the alcoholic should be straightforward if I can see the content of drink based on the ingredients.
Now, I want to run EDA on cocktails dataset. I will first summarise and visualise anything numerical and date, then I will move to strings and logical variables. I am also going to save each type as following temporary data frames (“t_”) and count the number of cocktails and their ingredients.
t_categorical <- data %>%
select_if(list(Negate(is.numeric))) %>%
select(-date_modified) %>%
mutate_all(as.factor)
t_numerical <- data %>%
select_if(Negate(is.character))
ingr_per_cocktail <- t_numerical %>% select(id_drink) %>% group_by(id_drink) %>% count() %>% pull(n)
Numerical
The summary()
function from base
R package is sufficient to show the numerical data
## row_id date_modified id_drink ingredient_number
## Min. : 0 Min. :2015-08-13 10:12:27 Min. :11000 Min. : 1.0
## 1st Qu.:135 1st Qu.:2016-07-18 22:28:43 1st Qu.:11984 1st Qu.: 1.0
## Median :264 Median :2017-01-02 20:18:16 Median :12944 Median : 2.0
## Mean :268 Mean :2016-12-02 04:27:17 Mean :13715 Mean : 2.7
## 3rd Qu.:399 3rd Qu.:2017-09-02 16:40:11 3rd Qu.:15615 3rd Qu.: 4.0
## Max. :545 Max. :2017-09-08 18:07:16 Max. :17230 Max. :12.0
## NA's :3
To reiterate, none of the variables classed as numeric are numeric in sense of ratio scale. The row_id
and drink_id
are simply more convenient names for the cocktails.
I can also see that the actual number of drinks in the dataset is (n = 546) as per the row_id
. Therefore, there are 546 unique cocktails. I will not visualise the id variables because the graphs would not provide any more information than what I have just described. However, I would like to see the other variables, so I will use a series of histograms and bar plots to visualise the data.
The graphs above show various features of ingredients, ingredients order, and date. I can make several observations.
The first two panels show the drinks grouped by the number of ingredients. The top panel shows the most ingredient “heavy” drinks (i.e., they have 7 or more ingredients), and then the second panel shows the most ingredient “light” drinks (i.e., drinks with only a single ingredient). There are only 4 drinks with a single ingredient. These are followed by 90 drinks with two ingredients. There is also only 1 drink with 12 ingredients, followed by 2 drinks with up to 11 ingredients and cut-off at 11 with 7 ingredients.
This observation leads nicely to the graph at the bottom left, it is not surprising to see that the order of ingredients is heavily skewed - in other words, it shows that each cocktail has at least 1 ingredient, few have more than 8, and only one has 12 ingredients. The common number of ingredients in a drink is 4 if I use median, and 3.85 if mean is used.
Finally, the bottom right graph shows dates - most of the observations are “newer”, they have been made between August and October 2017. I could look at when these observations are stored (e.g., hours) but I do not think that is very useful.
The information provided thus far makes me convinced that my questions should make use of the grouped dataset, ingredient’s order, and ingredients themselves. Time to move onto the categorical variables.
Categorical
Again, the summary function in R can show what categorical variables am I dealing with and which are possible to visualise (e.g., they do not have a large number of levels or “Other” values). For example, drink
, drink_thumb
(picture), ingredient
, and measure
have large number of levels and would not make a good bar chart because of this.
## drink alcoholic category
## Angelica Liqueur: 12 Alcoholic :1821 Ordinary Drink :1060
## Amaretto Liqueur: 11 Non alcoholic : 214 Cocktail : 244
## Egg Nog #4 : 11 Non Alcoholic : 5 Punch / Party Drink: 187
## Arizona Twister : 9 Optional alcohol: 56 Shot : 152
## (Other) :2061 NA's : 8 (Other) : 461
## drink_thumb
## http://www.thecocktaildb.com/images/media/drink/yuurps1472667672.jpg: 12
## http://www.thecocktaildb.com/images/media/drink/swqxuv1472719649.jpg: 11
## http://www.thecocktaildb.com/images/media/drink/wpspsy1468875747.jpg: 11
## http://www.thecocktaildb.com/images/media/drink/ido1j01493068134.jpg: 9
## (Other) :2061
## glass ingredient measure
## Cocktail glass :435 Vodka : 88 1 oz : 189
## Highball glass :302 Gin : 84 1/2 oz : 141
## Collins Glass :269 Sugar : 70 \n : 123
## Old-fashioned glass:213 Orange juice: 57 2 oz : 107
## (Other) :885 (Other) :1805 (Other):1544
From the summary, I can see that there several categories which could be useful for further analyses and filtering. For example, I am thinking that the analysis could focus only on “alcoholic drinks” to ensure more consistency. I can also see the drink
names which occur most often and have the most ingredients, e.g. Angelica Liqueur is quite complex (that is the 12 ingredient one) and looking at some sort of drink complexity could be useful. I also see that vodka appears to be the most common ingredient which is followed by gin, and sugar. Additionally, the measure seems to include some funky levels, such as “n” which is not a measure but a new paragraph line and leftover from scraping that should be removed.
At this point, I am thinking that there are simply too many ingredients to focus on them all. Therefore, narrowing the dataset could provide better insight. For example, I like gin so I could focus on drinks containing gin, this also makes only alcoholic drinks the viable option.
Some variables could variables be difficult to visualise because of the number of their levels. I will supplement this by using three simple descriptive tables on variables that had too many levels to visualise, i.e. ingredient
, and measure.
I will not show drink as that would essentially lead to showing the ingredient heavy and ingredient light drinks again (from the previous output). I will do the tabular summaries and then the visualisations of categorical variables.
The most frequent ingredients | Frequency |
Vodka | 88 |
Gin | 84 |
Sugar | 70 |
Orange juice | 57 |
Lemon juice | 51 |
Lemon | 49 |
Light rum | 43 |
Ice | 41 |
Amaretto | 39 |
Water | 38 |
The second table shows the most frequent measures of the various ingredients.
The frequent measures | Frequency |
1 oz | 189 |
1/2 oz | 141 |
123 | |
2 oz | 107 |
1 1/2 oz | 97 |
1 | 91 |
1 tsp | 57 |
1 part | 50 |
1 shot | 48 |
3/4 oz | 47 |
/newspace
I have mentioned already - this needs to be removed. The most common measure seems to be “1 oz”, then “1/2”, and “2”, “oz” seems to be also the typical unit of measure. It is clear that the measure
is a mix of volume and unit and will need to be further cleaned using functions such as str_split
or separate
. I will try to do this using regular expressions.
The variables which are “safe” to visualise are alcoholic
, category
, and glass
as they do not have a large number of levels and should be safe to plot. I will need to group these by drink
name because otherwise, I would run into the issue I have warned about earlier.
I will start with the alcoholic
variable.
I have visualised two plots for the most common versions of drinks. The first plot shows that alcoholic drinks are the most common, the second shows histogram using rows plotted across the drink version showing how the various type of drink varies across individual cocktails/observations. It seems safe to say that the way forward is to either focus on only alcoholic drinks or all drinks. I think there are not enough observations on non-alcoholic beverages.
Now for the drink’s category
and glass
.
The final pair of bar plots should be straightforward to interpret, the plots show the most common levels for each variable. Notably, drinks served in cocktail glass, and drinks categorised as ordinary are the most common.
Summary
What did I learn about the data? The exploratory analysis revealed that there is 546 of cocktails with average 3.85 ingredients in each. The most common type of drink is alcoholic, and most likely the drinks are served in cocktail glass. There are cocktails such as Angelica Liqueur which seem to be awfully complex and four drinks which are essentially one ingredient. I also need to be careful and appropriately group by drink names or ids to ensure I am not creating nonsensical summaries or values. There are also some issues requiring coercion, namely, the measure
variable needs to be cleaned further, it also contains some missing values (likely due to the web scraping).
What questions can I ask? The analysis should narrow down the scope. I’ve decided to further explore complexity of the drinks and its ingredients. To do that, I need to focus on one ingredient as the main with the other ingredients as “complementary”. I am curious to see more about gin drinks, so questions I am asking here are - What drinks contain gin? What are the simple, or complex gin drinks? What are the other ingredients defining gin drinks? What other features and qualities such drinks have?
These and related questions will be explored in the following post. To go and read the post, please follow this link to the part II.