rank-abundance-tutorial

Screenshot 2021-12-07 at 16 55 48

Link for tutorial - https://darahubert.github.io/rank-abundance-tutorial/

Tutorial Aims

1.Introducing the Rank Abundance Diagram

Setting up
- RStudio
- Importing data
Data Wrangling - using dpylr
- a) Tidy data
- b) Transforming data - creating new variables, ranking data
Data Visualisation: Plotting a Rank Abundance Diagram- using ggplot and ggrepel
Making a panel - gridExtra

This tutorial is going to take you through the steps to make basic Rank Abundance Diagrams. It is aimed at beginners and whilst some knowledge of dpylr and ggplot2 is helpful, it’s not vital. We will run through all the steps to make Rank Abundance Diagrams, from data importation, wrangling and manipulation to visualisation and saving your plots. If you want to understand more about the dpylr functions we are using or make your plots more beautiful, you can check out the Coding Club Tutorials. First, let’s download the files you need.

You can download the dataset, trees_messy.csv for this tutorial from the data folder in this GitHub repository. Just download the zip file and unzip it.

1.Introducing the Rank Abundance Diagram

Global biodiversity is declining meaning ecological communities are changing. Rank Abundance diagrams give us a way to visualise community composition based on the relative abundance of each species, a key component of biodiveristy. Species within the community are ranked based on abundance; most abundant to lowest abundance. We can then easily visualise the community composition and compare communities over time by plotting the rank of each species against its relative abundance.

Let’s make a start!

2. Setting up

R.Studio

Open RStudio and create a new script by clicking on File/ New File/ R Script. Now we need to set the working directory. This will be where your script and any outputs are saved, and also where RStudio will look for datasets. If you haven’t done so already, its useful to move the datasets you just downloaded for this tutorial into a new folder. Then set that folder as your working directory. To do this, select Session/ Set working directory and then select your folder.

# Set the working directory
setwd("your_filepath")  # enter your own file path here 
getwd() # run this to check that you have set your working directory correctly.

To run the code, highlight the line and press Ctrl and R (Windows PC) or Commandand Enter (Mac). You can also highlight the code and click the run icon in the top right hand corner.

Screenshot 2021-12-07 at 16 55 48

We are going to need some packages installed to complete this tutorial - don’t panic! It’s easy to install them if you haven’t done so previously. If you have already used them, you will just need to load the libraries.

We are going to use the tidyverse package which includes dpylr (for data wrangling) and ggplot2 for data visualisation. We are also going to need ggrepel to tidy our plots up and gridExtra to make a final panel. Install these packages if you haven’t done so already or go ahead and load the libraries.

Load packages

# Load libraries
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(gridExtra)

# To install packages uncomment #install.packages("package name") and then load the libraries.

Importing data

Time to import your dataset. We are going to be working with a fake dataset showing populations of tree species. You can click on Import Dataset and import the dataset you downloaded earlier. Alternatively, use the read.csv() command to tell RStudio where your dataset is - hopefully in the working directory you specified earlier! If you use the Import Dataset button, copy the code from the console and insert it into your script to save any confusion in the future.

Read in the trees_messy.csv and make it a data frame object trees_messy using <-

# Load the data
trees_messy <- read.csv("~/Desktop/data_science/tutorial-darahubert/trees_messy.csv")  
# Your file path will be different.

Let’s explore the data and see what we are working with.

# Preview data
head(trees_messy)  # view the first few observations
str(trees_messy)  # view the types of variables
names(trees_messy)  # view the column names

Right, we are ready to start tidying the data. If any of the above doesn’t make sesne, have a look at this tutorial on getting started with R and RStudio.

3.Data wrangling

Screenshot 2021-12-07 at 16 55 48

Data wrangling involves organising the data in a way that we can easily manipulate and visualise it. We want it to be in a tidy format where each row shows an observation and each column shows a variable. If that doesn’t make sense, take a look at this Coding Club tutorial. We also may need to clean the data (get rid of n/a values, make sure variable names/classification make sense and are grouped appropriately)or transform the dataset to create appropriate values for our data visualisation.

The next steps will talk you through how to tidy your data and then transform it ready for data visualisation.

3a) Tidy the data

Let’s view the entire dataset.

view(trees_messy)

The dataset itself is in a tidy format but it could do with some cleaning up.

We need to get rid of the n/a values, change the column names to lower case and get rid of any rows we don’t want. Before we start getting rid of data, it’s good to think about our final plot and what we want to show.

The dataset contains data for two years, 1981 and 2021. We want make a plot for each year showing the the relative abundance of each species and the rank of each species. Then we can visualise how the community has changed over time. Let’s get rid of the columns for date and phylum. We will keep the columns for year, population and species.

# Tidy the dataset
trees_tidy <- select(trees_messy, -date, -phylum) %>%  # delete unwanted column showing date and phylum. 
  na.omit()  # delete all n/a values

colnames(trees_tidy) <- c("year", "species", "population") # change column names
head(trees_tidy)  # Check changes have been made

We have used a pipe %>% to pass the function na.omit through dataset on the left, trees_messy. It’s a handy command that makes code more efficient. You can find more about pipes in this tutorial.

Great! Now we want to split our dataset by year. To do this we are going to use the filter() command and make a new data frame object <- for each year.

trees_81 <- trees_tidy %>% filter(year == 1981)  # data frame for 1981
trees_21 <- trees_tidy %>% filter(year == 2021)  # data frame for 2021

3b) Transform Data

Perfect. Let’s transform the data so we can plot it in Rank Abundance Diagram. The diagrams will show rank of species (x axis) and relative abundance of each species (y axis). Let’s deal with the ranking first.

Ranking the data

Screenshot 2021-12-07 at 16 55 48

Rank abundance diagrams rank the species with the highest relative abundance as 1. We want to rank species based on their population in desecnding order.

We will add a new column to each data frame using $. Using a - before trees_21$population ranks in descending order. ties.method = random allocates a ranking at random between species who have the same population size (we want all ranks to be unique).

# Ranking the data

trees_21$rank <- rank(-trees_21$population, ties.method = "random")

# create a new column in the trees_21 data frame.
# use rank() with - to rank the species in descending order of population size.
# use the ties.random method to allocate each species a unique number and assign any tied values at random.

# Run this code to repeat the steps for tree_81
trees_81$rank <- rank(-trees_81$population, ties.method = "random")

Relative Abundance

Screenshot 2021-12-07 at 16 55 48

We now want to know the total population in each year to calculate the relative abundance of each species. We can do this using the sum() and transform commands. transform makes a new column based on existing values. For each year, we will divide the (population size of each species by the total population)* 100. We will also save our final data frames as .csv files.

# Save the total population size of each year as a data frame object
total81 <- sum(trees_81$population)

total21 <- sum(trees_21$population)

# Use the transform function to add a new column for relative abundance.
trees_21 <- transform(trees_21,relative_abundance = (population/total21)*100) # make a new column for relative abundance
trees_81 <- transform(trees_81,relative_abundance = (population/total81)*100) # make a new column for relative abundance


# Save the final datasets to your working directory
write.csv(trees_21, file = "trees21_final.csv")
write.csv(trees_81, file = "trees81_final.csv")

Great! Now we have our data to plot. Let’s move on to data visualisation.

4. Data Visualisation

Screenshot 2021-12-07 at 16 55 48

We want to show the relative abundance of each species, plotted against the rank of each species for both 1981 and 2021 and connect the data points. We will plot a connected scatterplot using ggplot2. This uses commands geom_point to plot the data points and geom_line to connect them. There is so much more you can do with ggplot2 so if you want to find out more about data visualisation be sure to have a look at this data visualisation tutorial

For now, let’s make our first Rank Abundance Diagram for 2021, using the trees_21 data frame.

p2021 <- ggplot(trees_21, aes(x=rank, y=relative_abundance)) +  # set x and y axis
  geom_point(shape=21, size=2.25, aes(color=species, fill=species)) + # plot data points and colour them according to species.
  geom_line(colour = "grey") +  # plot line to connect data points
  theme_bw() +  # set theme for plot
  labs(x="Species Rank",y="Relative Abundance (%)", # label the axis
       title = "2021") +  # give your plot a title
  theme(panel.grid.major = element_blank(),  # get rid of grid lines
        panel.grid.minor = element_blank(),
        axis.text = element_text(size = 8),  # set axis font size
        legend.position = "top",  # position the legend at the top
        legend.background = element_rect(colour="black"))  # outline the legend

p2021  # View the plot.

Screenshot 2021-12-07 at 16 55 48

# Add data labels 
p2021 + geom_text(aes(label=species), size = 3). # add species labels to data points

Screenshot 2021-12-07 at 16 55 48

That’s not very clear. We can use the geom_text_repel in the ggrepel package to make sure the data labels don’t overlap.

(p2021_labels <- p2021 + geom_text_repel(aes(label=species), size = 3)) # Avoid data points overlapping

# Including the plot code in () means the plot is automatically shown in the plot window. 

Screenshot 2021-12-07 at 16 55 48

Now we can plot the Rank Abundance Diagram for 1981.

# Plot 1981 Rank Abundance Diagram
(p1981 <- ggplot(trees_81, aes(x=rank, y=relative_abundance)) +
    geom_point(shape=21, size=2.25, aes(color=species, fill=species)) +
    geom_text_repel(aes(label=species), size = 3) +. # add data labels 
    geom_line(colour = "grey") +
    theme_bw() + 
    labs(x="Species Rank",y="Relative Abundance", title = "1981") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(), 
          axis.text = element_text(size = 8),
          legend.position = "top",
          legend.background = element_rect(colour="black"))) 

Screenshot 2021-12-07 at 16 55 48

Now we have our two plots `p2021_labels` and `p1981`, we can format them and make a panel.

5.Making a panel.

We can display both plots in a panel which allows them to be easily compared. Adding the data labels means we don’t need the legend.

Use the code below to make the final plots and arrange and save them as a panel.

(p2021_final <- p2021_labels + theme(legend.position = "none")). # Remove legends from plots 
(p1981_final <- p1981 + theme(legend.position = "none"))

panel <- grid.arrange(p1981_final,p2021_final,ncol=1, top = "Tree Rank Abundance Diagrams")  # Make panel using gridExtra and add title. 

ggsave(panel,filename = "Tree Rank Abundance Diagrams.jpg")

Screenshot 2021-12-07 at 16 55 48

We have our final output!

Obviously this is a fake dataset, but we can see that the community of tree species has changed since between 1981-2021. The relative abundance of species is more even in 2021 and we can compare how the different species have declined/increased in relative abundance between the communities over time.

A summary…

That’s the end of the tutorial! In this tutorial we learned:

How to set a working directory, start new scripts and import data into Rstudio.
How to wrangle and transform data using some of the dpylr functions.
How to make a Rank Abundance Diagram.
How to make a panel and save the output.

For more on ggplot2, read the official ggplot2 cheatsheet.

Check out our Useful links page where you can find loads of guides and cheatsheets.

If you have any questions about completing this tutorial, please contact us on ourcodingclub@gmail.com

We would love to hear your feedback on the tutorial, whether you did it in the classroom or online!

This site is open source. Improve this page.

rank-abundance-tutorial

Link for tutorial - https://darahubert.github.io/rank-abundance-tutorial/

Tutorial Aims

1.Introducing the Rank Abundance Diagram

2. Setting up

R.Studio

Load packages

Importing data

3.Data wrangling

3a) Tidy the data

3b) Transform Data

Ranking the data

Relative Abundance

4. Data Visualisation

Now we have our two plots p2021_labels and p1981, we can format them and make a panel.

5.Making a panel.

A summary…

Check out our Useful links page where you can find loads of guides and cheatsheets.

If you have any questions about completing this tutorial, please contact us on ourcodingclub@gmail.com

We would love to hear your feedback on the tutorial, whether you did it in the classroom or online!

Follow our coding adventures on Twitter!

Subscribe to our mailing list:

Now we have our two plots `p2021_labels` and `p1981`, we can format them and make a panel.