Animation in R
--
I was blown away by Hans Rosling's Ted talk The Best Stats You’ve Ever Seen. I like R but it never strikes me as COOL. Now this is geekily cool, i.e. my type of cool. For those of you who have not yet watched the talk, here it is.
No matter we love it or hate it, there has never been a better (or worse?) time to study data. Though, with great amount of data comes great amount of confusion. I am increasingly losing confidence in my instincts, fearing that unconscious bias or emotions may have implanted their influences on me. It is mentally exhausted to fight your own thoughts, when you constantly pause to invoke your system two (per Kahneman). Nonetheless, I find it visually pleasing to see animated data, which makes navigating the data labyrinth easier.
So obviously I have replicated the dataset from Gapminder (Hans Rosling's co-founded organisation to fight misconception) and the brilliant code illustration is available at gganimate. I would not repeat here. However, there are two mistakes that I made during the replication, making this seemingly straightforward replication a steep learning curve.
- Forget to install the "gifski" library. This is the key to generate the gif file where your animation lives. Without this library, the odds are you will be generating hundreds of png files.
- I showed the legends. In the demonstration code, there is no legend. (show.legend = FALSE) So I thought, why not? I removed it and the code generated a gif like this:
It took me a frenzy search online but still couldn't figure out what happened? No one (even those amazing strangers at StackOflow) seemed to encounter the same issue as I did. I tried a bunch of things (mostly restarting R) in vain. Then I gazed at this gif and tried to make sense of it. A couple hours or days later, I realised this is a giant legend which rendered the gif broken. So do mask all your legends. show.legend = FALSE — CORRECT code.
After the replication, I gained confidence and started to play with other public available datasets. The first dataset I worked with is Hong Kong inflation rate. Let's see how much more we have to pay for goods in the past 40 years in Hong Kong. Turns out not as bad as I thought.
Animation 1: HK Inflation Rate 1982–2022
Dataset: Table 510–60001 : Consumer Price Indices (October 2019 — September 2020 = 100) (updated: 20 July, 2023)
Source: HKSAR Census and Statistics Department
One small thing, oh well, I am sure you all are careful and smart and all that, just in case: your working directory should be your source file location. Otherwise, R won't run. Just WON'T.
The code is straightforward. The majority is just to create a conventional ggplot object. I add some aesthetic codes like changing the panel background to "lightblue" and forcing the X and Y scale to break in 10 slots. The animation code is astonishingly simple. I use transition_reveal(Year) to create the effect of revealing the figure by years. Here is the R code:
library(ggplot2)
library(gganimate)
library(dplyr)
library(gifski)
#conventional ggplot object creation code
anim<-inflationhk %>% ggplot(aes(Year, Inflation)) +
geom_line(color = "darkorange") + geom_point(color = "darkorange") +
ggtitle("HK Inflation Rate 1982 - 2022")+ ylab("Inflation Rate %") +
theme(legend.position = "none", panel.background = element_rect(fill = "lightblue")) +
scale_x_continuous(n.breaks =10) + scale_y_continuous(n.breaks = 10)+
#here is the animation code
transition_reveal(Year) + ease_aes('linear')
animate(anim, renderer = gifski_renderer())
anim_save("inflation.gif") #save your animation in your source file location
The CSV data looks like this:
You need to clean the data a bit, getting rid of whatever you don't need.
Animation 2: GDP per capita (current US$) vs. CO2 Emission(KT)
Here is a slightly more complicated animation, more resembling the Gapminder sample in gganimate site. I use World Bank datasets.
Datasets:
Population, total (updated: 25 July, 2023)
GDP per capita (current US$)(updated: 25 July, 2023)
CO2 emissions (kt)(updated: 25 July, 2023)
Source: The World Bank
I want to show the relationship between GDP per capita (current US$) and CO2 Emission(KT) over the years of 1990 to 2020. The size of the bubbles are proportion to population size of the country. Since The World Bank datasets are "messy" for R, so I did some data cleansing first.
I use the CO2 emissions dataset as my base and combine population and GDP per capita data into it. I start off by checking all three datasets and pick 1990 to 2020, when the majority of three datasets have data in them. I delete unnecessary columns and rows in the CSV files, then rename them for easy identification. Since the years in World Bank dataset is by default horizontal, I use library(reshape2) for the function of melt() to turn them into vertical so R can process. I try some excel tricks but don't work as I wish. melt() is very handy.
Here are the codes:
library(reshape2)
#data cleansing
CO2data <-read.csv("CO2.csv")
CO2melt<-melt(CO2data, id.vars=c("Country", "Code", "Indicator"),
variable.name = "Year", value.name = "Emissions", na.rm=TRUE)
#save as a new csv file
write.csv(CO2melt,"/Users/YOURFOLDER/meltedco2.csv")
melt() arguments are:
data = CO2data #the dataset you are going to operate on
id.vars=c(“Country”, “Code”, “Indicator”) #these are the identifier variables, i.e. the columns that I want them to be fixed.
measure.vars #these are the variables that you want to melt. I didn't specify them; therefore, R would assume whatever variables that are not specified in the id.vars to be melted.
variable.name = “Year” # I want to call the newly melted variable Year
value.name = “Emissions” # I want to call the values that correspond to each Year, Emissions.
na.rm=TRUE #I want remove all na values.
In a snap of finger, the melted process is done. The GDP per capita and population data file remain the same format from World Bank. Just remove whatever you don't need. I use EXCEL functions Index() and Match() to match GDP and population data to CO2 emissions. I also vlookup metadata details in World Bank to create a Region column so that I can separate the figures into different regions of the world. Here is the EXCEL formula for the first cell:
INDEX(‘worldbankdata/[GDPpercap.csv]GDPpercap’!$1:$1048576,MATCH(A2,’/worldbankdata/[GDPpercap.csv]GDPpercap’!$B:$B,0),MATCH(E2,’/worldbankdata/[GDPpercap.csv]GDPpercap’!$A$1:$BO$1,0))
The idea is to use MATCH() to find out the row and column indexes that match both the country (A2 cell) and year, (E2 cell) then use INDEX() to fill it (G2 cell). The final CSV file looks like this:
Here are complete codes:
library(ggplot2)
library(gganimate)
library(gifski)
CO2GDP<-read.csv("CO2GDP.csv")
CO2GDP$Country<-factor(CO2GDP$Country)
p<-ggplot(CO2GDP, aes(GDPpercap, Emissions, size = Population,
colour = Country, label=Country)) +
geom_point(alpha = 0.7, show.legend = NULL)+
scale_x_log10() + scale_y_log10()+
facet_wrap(~Region) + scale_size(range = c(2, 12))+
# Here comes the animation code
labs(title = 'Year: {frame_time}', x = 'GDP per capita (current US$)',
y = 'CO2 Emission(KT)') +
transition_time(Year) +
ease_aes('linear')
animate(p,renderer = gifski_renderer())
anim_save("CO2.gif")
I transform both X and Y axises to log10 since figures are relatively large and use transition_time to effectuate the changes by year. Done.
More on animation in R later.
*The data used here is meant for code illustration, not meant for implication of any kind.