12 minute readGgplot2 for revealing useful coffee relations

Can ggplot2 show if coffee producers drink more coffee?

After making this map of coffee producing countries with ggplot2 in R, and with the help of some comments, I wanted to figure out if coffee producing countries also consume more coffee. And what about the export of coffee? In the meantime, I have finished my batch of Brazilian coffee and am now drinking the Ethiopian flavour. So now I will show you how R and ggplot2 can tackle the question “Do coffee producing countries also drink more coffee?”

Different data this time

In the last post I used FAO data to find some numbers on coffee, but it was not enough to answer the questions above. After googling some more, I found new sources of information. I used data from the International Coffee Organisation (ICO) on production, consumption and export of producing countries, but the data only goes back until 1990.

I also used the list of countries per continent from the other map (find it here).

But similar packages: loading them

New to loading packages? Read this first.

library(readxl)   # import excelsheet
library(ggplot2)  # data visualization
library(dplyr)    # data wrangling
library(tidyr)    # reshaping data
library(rworldmap)# get worldmap background
library(randomcoloR) # random colour picker
library(scales) # needed for ggplot2 axis

And adding new data

The data from ICO is not in a perfect format for R yet, and I did the quick and easy way of just deleting some rows (first 3 and last ) and columns (2nd) in Excel to make the data square. I also change the 1st column name to Countries instead of Calendar Years. Here’s the code bit:

Total_production <- read_excel("Total_production.xlsx") 
Exports_calendar_year <- read_excel("Exports_calendar_year.xlsx")
Domestic_consumption <- read_excel("Domestic_consumption.xlsx")
map.world <- map_data(map="world")

It includes the 3 data sets from ICO (production, export and consumption) and the worldmap background for eventual visualisation.

Using the gather function for ggplot2

The strength of ggplot2 is that it is easy to see how some variables relate to others. But to do this the data needs to be in a certain format: each variable is in a column, each observation is in a row and each value is in a cell (this is what tidyr does). From this tidyr package we use the gather() function: gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. In practice this looks like this:

Prod<-gather(data=Total_production, "1990/91","1991/92","1992/93","1993/94","1994/95","1995/96","1996/97","1997/98","1998/99","1999/00","2000/01","2001/02" ,"2002/03","2003/04", "2004/05", "2005/06", "2006/07", "2007/08", "2008/09", "2009/10"," 2010/11","2011/12","2012/13","2013/14","2014/15"  , "2015/16", "2016/17", "2017/18",  "2018/19",
key= Year_crop, value=Production) %>%
mutate(Year=substr(Year_crop, 1, 4))

The data is the production data, which is followed by all the columns of the original dataframe in this case. The key is year_crop, which will be the 2nd column in the new dataframe. Value will be the 3rd column (called Production), containing the values, in this case all the cells with production values. Then the mutate function adds another column with the year, just to make it easier to compare. This all results in a dataframe where the dimensions are changed from 56×30 to 1624×4. The usefulness will become clear later on. But in short, each row now shows the country name, year and production for that year, meaning there are 29 rows per country for the last 29 years.

The same is done for the export and the consumption:

Export<-gather(data=Exports_calendar_year, "1990", "1991","1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", key= Year, value=Export)  

Cons<-gather(data=Domestic_consumption, "1990/91", "1991/92", "1992/93", "1993/94", "1994/95", "1995/96", "1996/97", "1997/98", "1998/99", "1999/00", "2000/01", "2001/02", "2002/03", "2003/04", "2004/05", "2005/06", "2006/07", "2007/08", “2008/09", "2009/10", "2010/11", "2011/12", "2012/13", "2013/14", "2014/15", "2015/16", "2016/17", "2017/18", "2018/19", key= Year_crop, value=Consumption) %>% mutate(Year=substr(Year_crop, 1, 4))

Joining the data to one dataframe

This step is simply joins (using left_join) the different dataframes based on a columname (using by=…):

All_data <- left_join(Prod ,Cons, by=c("Country","Year","Year_crop"))
All_data <- left_join(All_data,Export, by=c("Country","Year"))

I also changed the data type of the Year column to numeric, now it is recognised as a character, and changed the values from 1000 60kg bags to ton:

All_data<-All_data %>% mutate(Year = as.numeric(Year))  %>%

Correcting country names

As we saw in the first coffee post, not all datasets use the same names for countries. Especially Ivory Coast and the two different Congo’s like to be spelled differently…

anti_join(All_data, countries, by = c("Country"="Area"))

To see which names are different, followed by

All_data <- All_data %>%  mutate(Country = recode(Country 
    ,'Congo, Rep. of' = 'Republic of Congo'
    ,"Côte d'Ivoire" = 'Ivory Coast'
    ,'Congo, Dem. Rep. of' = 'Democratic Republic of the Congo'
    ,"Lao, People's Dem. Rep. of" = 'Laos'
    ,'Trinidad & Tobago'='Trinidad'))

To change the names. Then the continent of that country is added to the dataframe:

All_data<- left_join(All_data ,countries, by=c("Country"="Area"))

Now we are ready to do some exploring!

Some basic plotting with ggplot2

First off all, I wanted to know if the annual production of coffee is increasing. So a basic plot would result in (plot1):

Production of coffee per continent, per year
ggplot(All_data,aes(x=Year,y=Prod_ton,col=Continent))+geom_point(aes(fill=Continent))+labs(title="Production per year, per continent")

I have chosen to colour each point according to its continent, as each colour would become overwhelming. You notice straight away that South America (blue) is by far producing most of the coffee. But in terms of total annual production this graph is not that useful. So I will stack the points using geom_col():

Other way of presenting, with total annual production visible now. And pretty rainbow colours 🙂
ggplot(All_data,aes(x=Year,y=Prod_ton))+geom_col(aes(fill=Continent))+labs(title="Production per year, per continent")

Now you see a pretty rainbow chart, which, although not optimal, at least shows how much coffee is produced each year, and how much each continent contributes to the annual production. Annual production seems to be increasing, but not all continents contribute with this growth. Actually, only South America and South-Eastern Asia seem to grow annually, whereas the rest of the world has a steady production.

What about seasons?

A funny thing is also the 2019 column, which shows the different seasons throughout the world. South America is at the end of the season and has almost produced all of its coffee, whereas South-Eastern Asia still needs to reach the middle of their harvesting season. Note that some errors in this way of presenting may be present, as previously I split the year from 1990/91 to 1990. This may cause some production to be grouped in the wrong year. But for now, I’m okay with this, as it is a systematic error done with each year. So maybe the values are wrong, but the trend can still be trusted. You can also plot Year_crop on the x-axis instead of Year, but the graph is similar.

The graph shows that two continents are interesting, which I will compare now.

South America vs South-East Asia: which country produces most?

Boxplot showing range of production per country, for highest producing continents. (I forgot to change the axis names…)
ggplot(filter(All_data, Continent == "South America" | Continent == "South-Eastern Asia"), aes(Country,Prod_ton,col=Continent)) + geom_boxplot()+facet_grid(~Continent,scales = "free_x") +   theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position="none") + scale_y_continuous(trans='log2')+  geom_jitter(alpha = 1/5) + labs(title="Coffee production in South America and South-Eastern Asia, on a y log scale")

By now you might have noticed that when plotting with ggplot2 you may need a lot of code. So it takes some time to learn what the possibilities are, but once you know them, ggplot2 is the way to go (for me at least), especially in combination with dplyr and tidyr packages, making data manipulation much easier.

An example is in the previous script:

filter(All_data, Continent == "South America" | Continent == "South-Eastern Asia")

This part actually does the filtering without changing the data itself. I filtered all the data to only show countries that are in the two continents, using | as or in R language. The next interesting part is faceting:

facet_grid(~Continent,scales = "free_x")

Which tells R to split the data into different boxes, based on the Continent. Free_x prevents the x-axis to show all countries, instead showing only the relevant countries of that continent. The rest of the script adjusts the angle of the text (axis.text.x) to 90 degrees and hides the legend. It would only show that the colour corresponds to the continent. Scale_y_continous is used to put the y axis on a logarithmic scale, otherwise countries such as Guyana hardly appear on the graph.

There are also many options in adjusting the aesthetics of the graph, for example with a white background (use theme_bw() ), but check the ggplot2 cheat sheet for all of these options: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf  . I will not go into them here.

What about consumption of coffee within these countries? Is it also rising?

Coffee consumption of top producing countries

I want to know how the consumption is compared to the production, but only for the top producing countries. Visually I can see that these are Vietnam and Indonesia in South-Eastern Asia, and Brazil and Columbia in South America, so these I filter out:

ggplot(filter(All_data,Country=="Brazil" | Country =="Indonesia" | Country == "Colombia"| Country =="Vietnam"), aes(Cons_ton,Prod_ton,col=Country))+
  geom_smooth(se=F,method = lm,formula = y ~ splines::bs(x, 4)) 
  labs(x="Consumption (kg)",y="Production (kg)", title="Coffee consumption vs production in top producing countries") +
  scale_x_continuous(labels = comma)+scale_y_continuous(labels = comma)

I also plot a smooth line using geom_smooth(), making it smooth with the splines part, and I want to show the values in full numbers, not scientific, using the labels=comma in the last part. The result:

Top producing countries and their production vs consumption. Seems like a rising trend in most, right?

Although there seems to be a rise in consumption when increasing production, I have not taken time into account yet. I can do this by faceting again. Perhaps not the most elegant way, but it works, is by adding facet_grid(Country~Year) and removing the geom_smooth():

  ggplot(filter(All_data,Country=="Brazil" | Country =="Indonesia" | Country == "Colombia"| Country =="Vietnam"), aes(Cons_ton,Prod_ton,col=Country))+
    theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position="none")+
    labs(x="Consumption (kg)",y="Production (kg)", title="Coffee consumption vs production in top producing countries")+
  scale_x_continuous(labels = comma)+scale_y_continuous(labels = comma)
Previous plot, but now split out per year. Now the trens are clearer: there is a rise in production in Brazil and Vietnam, but there is still an annual fluctuation. Consumption is increasing faster in Brazil by the way.

It serves its purpose. We can now see the points neatly ordered in their year. What becomes apparent now is that the production fluctuates quite a lot each year for all countries, but that Colombia and Indonesia do not necessarily consume much more when production increases (which it also hardly does). Brazil and Vietnam show a strong increase in production, over the years, but where Brazilians also drink more coffee, the Vietnamese prefer to sell the coffee. Perhaps if I can find a dataset on tea consumption?  Anyways, lets add the export data to verify this thought. I will only look at Brazil and Vietnam in the next step.

Coffee drinkers vs non-coffee drinkers: comparing Brazil to Vietnam

The export is a continuous variable, and plotting three continuous variables is possible. The export would be represented by a colour, but the colour would be continuous, making distinguishing difficult. So I made three categories: low, medium and high, based on quantiles (0.33,0.66). This is done in the cut part:

All_data2 <- All_data%>% filter(Country=="Brazil" | Country =="Vietnam") %>% mutate(Exp_cat = cut(Export, breaks = quantile(Export, probs = seq(0, 1, 0.33),na.rm=T), labels=c("Low","Medium","High")))

Next I plotted the production vs the consumption vs export, with export being coloured and having a different size per category:

ggplot(filter(All_data2,Country=="Brazil" | Country =="Vietnam"), aes(Cons_ton,Cons_ton,col=Exp_cat))+geom_point(aes(size=Exp_ton))+# geom_smooth(se=F) +
  labs(y="Production (kg)",x="Consumption (kg)",title="Production vs consumption vs export")+     theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_x_continuous(labels = comma)+scale_y_continuous(labels = comma)
Export has been added as size of the circle. Vietnam is exporting quite a lot!

Now it is possible to see that Vietnam is producing a lot, not consuming that much and thus exporting a lot, which make sense… Brazil is both consuming and exporting a lot of coffee. Vietnam still has a lot to catch up on, now it is only producing half of what Brazil produced in 1990.

Lets take a closer look at 2017 in numbers. Brazil produces 306001 ton of coffee vs 177000 ton for Vietnam. Brazil consumes 131982 ton vs 15000 ton for Vietnam. Export: 183825 vs 139252 ton. In percentage, Vietnam exported 80% of its produced coffee, whereas Brazil exported 60% in 2017.

And now: (moving) art

I came across a package that lets you take a 5th dimension into account: time! This package makes an animation of many graphs. It’s called gganimate . Here’s the code and the result:

An animated version! Each colour is a different country.

There is quite some movement going on! Don’t forget to look at the scales of each graph, they are not the same 😛

I wondered if this is also possible with a map… YES, here is the result 😀 :

A gif of the world coffee production from 1990-2018

With the following code:

  geom_polygon(data=map.world, aes(x = long, y = lat, group = group)) +
  geom_polygon(data=map_coffee,aes(fill = log(Prod_ton),x = long, y = lat,frame=Year, group = group)) +  #+ #geom_point(aes(x = Centre_long, y = Centre_lat,size=Consumption), col="red")
  labs(title = 'Year: {current_frame}', x = 'Production (ton)', y = 'Consumption (ton)') +
  transition_manual(Year) +


The data shows that, as always, it depends on the country where more coffee is being consumed, but in general you can see that coffee consumption is following production. The past plots showed that ggplot has no problem in showing data in many different ways, but there is still much more to be seen. In the next post I will use Rshiny to make a shiny app, where you can play with the data yourselves!

Edit: this is the second of three posts. Click here for the first and here for the third post in this coffee exploring theme.

One Comment

Add a Comment

Your email address will not be published. Required fields are marked *