1.Introducing the Rank Abundance Diagram
dpylr
ggplot
and ggrepel
gridExtra
This tutorial is going to take you through the steps to make basic Rank Abundance Diagrams. It is aimed at beginners and whilst some knowledge of dpylr and ggplot2 is helpful, it’s not vital. We will run through all the steps to make Rank Abundance Diagrams, from data importation, wrangling and manipulation to visualisation and saving your plots. If you want to understand more about the dpylr functions we are using or make your plots more beautiful, you can check out the Coding Club Tutorials. First, let’s download the files you need.
You can download the dataset, trees_messy.csv
for this tutorial from the data
folder in this GitHub repository. Just download the zip file and unzip it.
Global biodiversity is declining meaning ecological communities are changing. Rank Abundance diagrams give us a way to visualise community composition based on the relative abundance of each species, a key component of biodiveristy. Species within the community are ranked based on abundance; most abundant to lowest abundance. We can then easily visualise the community composition and compare communities over time by plotting the rank of each species against its relative abundance.
Let’s make a start!
Open RStudio
and create a new script by clicking on File/ New File/ R Script
. Now we need to set the working directory. This will be where your script and any outputs are saved, and also where RStudio
will look for datasets. If you haven’t done so already, its useful to move the datasets you just downloaded for this tutorial into a new folder. Then set that folder as your working directory. To do this, select Session/ Set working directory
and then select your folder.
# Set the working directory
setwd("your_filepath") # enter your own file path here
getwd() # run this to check that you have set your working directory correctly.
To run the code, highlight the line and press Ctrl
and R
(Windows PC) or Command
and Enter
(Mac). You can also highlight the code and click the run icon in the top right hand corner.
We are going to need some packages installed to complete this tutorial - don’t panic! It’s easy to install them if you haven’t done so previously. If you have already used them, you will just need to load the libraries.
We are going to use the tidyverse
package which includes dpylr
(for data wrangling) and ggplot2
for data visualisation. We are also going to need ggrepel
to tidy our plots up and gridExtra
to make a final panel. Install these packages if you haven’t done so already or go ahead and load the libraries.
# Load libraries
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(gridExtra)
# To install packages uncomment #install.packages("package name") and then load the libraries.
Time to import your dataset. We are going to be working with a fake dataset showing populations of tree species. You can click on Import Dataset and import the dataset you downloaded earlier. Alternatively, use the read.csv()
command to tell RStudio
where your dataset is - hopefully in the working directory you specified earlier! If you use the Import Dataset button, copy the code from the console and insert it into your script to save any confusion in the future.
Read in the trees_messy.csv
and make it a data frame object trees_messy
using <-
# Load the data
trees_messy <- read.csv("~/Desktop/data_science/tutorial-darahubert/trees_messy.csv")
# Your file path will be different.
Let’s explore the data and see what we are working with.
# Preview data
head(trees_messy) # view the first few observations
str(trees_messy) # view the types of variables
names(trees_messy) # view the column names
Right, we are ready to start tidying the data. If any of the above doesn’t make sesne, have a look at this tutorial on getting started with R and RStudio.
Data wrangling involves organising the data in a way that we can easily manipulate and visualise it. We want it to be in a tidy format where each row shows an observation and each column shows a variable. If that doesn’t make sense, take a look at this Coding Club tutorial. We also may need to clean the data (get rid of n/a values, make sure variable names/classification make sense and are grouped appropriately)or transform the dataset to create appropriate values for our data visualisation.
The next steps will talk you through how to tidy your data and then transform it ready for data visualisation.
Let’s view the entire dataset.
view(trees_messy)
The dataset itself is in a tidy format but it could do with some cleaning up.
We need to get rid of the n/a values, change the column names to lower case and get rid of any rows we don’t want. Before we start getting rid of data, it’s good to think about our final plot and what we want to show.
The dataset contains data for two years, 1981 and 2021. We want make a plot for each year showing the the relative abundance of each species and the rank of each species. Then we can visualise how the community has changed over time. Let’s get rid of the columns for date and phylum. We will keep the columns for year, population and species.
# Tidy the dataset
trees_tidy <- select(trees_messy, -date, -phylum) %>% # delete unwanted column showing date and phylum.
na.omit() # delete all n/a values
colnames(trees_tidy) <- c("year", "species", "population") # change column names
head(trees_tidy) # Check changes have been made
We have used a pipe %>%
to pass the function na.omit
through dataset on the left, trees_messy
. It’s a handy command that makes code more efficient. You can find more about pipes in this tutorial.
Great! Now we want to split our dataset by year. To do this we are going to use the filter()
command and make a new data frame object <-
for each year.
trees_81 <- trees_tidy %>% filter(year == 1981) # data frame for 1981
trees_21 <- trees_tidy %>% filter(year == 2021) # data frame for 2021
Perfect. Let’s transform the data so we can plot it in Rank Abundance Diagram. The diagrams will show rank of species (x axis) and relative abundance of each species (y axis). Let’s deal with the ranking first.
Rank abundance diagrams rank the species with the highest relative abundance as 1. We want to rank species based on their population in desecnding order.
We will add a new column to each data frame using $
. Using a -
before trees_21$population
ranks in descending order. ties.method = random
allocates a ranking at random between species who have the same population size (we want all ranks to be unique).
# Ranking the data
trees_21$rank <- rank(-trees_21$population, ties.method = "random")
# create a new column in the trees_21 data frame.
# use rank() with - to rank the species in descending order of population size.
# use the ties.random method to allocate each species a unique number and assign any tied values at random.
# Run this code to repeat the steps for tree_81
trees_81$rank <- rank(-trees_81$population, ties.method = "random")
We now want to know the total population in each year to calculate the relative abundance of each species. We can do this using the sum()
and transform
commands.
transform
makes a new column based on existing values. For each year, we will divide the (population size of each species by the total population)* 100. We will also save our final data frames as .csv
files.
# Save the total population size of each year as a data frame object
total81 <- sum(trees_81$population)
total21 <- sum(trees_21$population)
# Use the transform function to add a new column for relative abundance.
trees_21 <- transform(trees_21,relative_abundance = (population/total21)*100) # make a new column for relative abundance
trees_81 <- transform(trees_81,relative_abundance = (population/total81)*100) # make a new column for relative abundance
# Save the final datasets to your working directory
write.csv(trees_21, file = "trees21_final.csv")
write.csv(trees_81, file = "trees81_final.csv")
Great! Now we have our data to plot. Let’s move on to data visualisation.
We want to show the relative abundance of each species, plotted against the rank of each species for both 1981 and 2021 and connect the data points.
We will plot a connected scatterplot using ggplot2
. This uses commands geom_point
to plot the data points and geom_line
to connect them. There is so much more you can do with ggplot2
so if you want to find out more about data visualisation be sure to have a look at this data visualisation tutorial
For now, let’s make our first Rank Abundance Diagram for 2021, using the trees_21
data frame.
p2021 <- ggplot(trees_21, aes(x=rank, y=relative_abundance)) + # set x and y axis
geom_point(shape=21, size=2.25, aes(color=species, fill=species)) + # plot data points and colour them according to species.
geom_line(colour = "grey") + # plot line to connect data points
theme_bw() + # set theme for plot
labs(x="Species Rank",y="Relative Abundance (%)", # label the axis
title = "2021") + # give your plot a title
theme(panel.grid.major = element_blank(), # get rid of grid lines
panel.grid.minor = element_blank(),
axis.text = element_text(size = 8), # set axis font size
legend.position = "top", # position the legend at the top
legend.background = element_rect(colour="black")) # outline the legend
p2021 # View the plot.
# Add data labels
p2021 + geom_text(aes(label=species), size = 3). # add species labels to data points
That’s not very clear. We can use the geom_text_repel
in the ggrepel
package to make sure the data labels don’t overlap.
(p2021_labels <- p2021 + geom_text_repel(aes(label=species), size = 3)) # Avoid data points overlapping
# Including the plot code in () means the plot is automatically shown in the plot window.
Now we can plot the Rank Abundance Diagram for 1981.
# Plot 1981 Rank Abundance Diagram
(p1981 <- ggplot(trees_81, aes(x=rank, y=relative_abundance)) +
geom_point(shape=21, size=2.25, aes(color=species, fill=species)) +
geom_text_repel(aes(label=species), size = 3) +. # add data labels
geom_line(colour = "grey") +
theme_bw() +
labs(x="Species Rank",y="Relative Abundance", title = "1981") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size = 8),
legend.position = "top",
legend.background = element_rect(colour="black")))
p2021_labels
and p1981
, we can format them and make a panel.We can display both plots in a panel which allows them to be easily compared. Adding the data labels means we don’t need the legend.
Use the code below to make the final plots and arrange and save them as a panel.
(p2021_final <- p2021_labels + theme(legend.position = "none")). # Remove legends from plots
(p1981_final <- p1981 + theme(legend.position = "none"))
panel <- grid.arrange(p1981_final,p2021_final,ncol=1, top = "Tree Rank Abundance Diagrams") # Make panel using gridExtra and add title.
ggsave(panel,filename = "Tree Rank Abundance Diagrams.jpg")
We have our final output!
Obviously this is a fake dataset, but we can see that the community of tree species has changed since between 1981-2021. The relative abundance of species is more even in 2021 and we can compare how the different species have declined/increased in relative abundance between the communities over time.
That’s the end of the tutorial! In this tutorial we learned:
Rstudio
.dpylr
functions.For more on ggplot2
, read the official ggplot2 cheatsheet.