Creating cusom IDs and merging data easily
These two dataset can be easily merged using IDs, but it can be hard to merge them without it, and using just the names. This as we saw in class, creates other unwanted columns. Let’s imagine the common scenario where we DON’T have a numeric ID. How do we create them?
The process is as follow:
- We manually create an ID for the
- We match the reef names which generates a vector of numbers;
- We create a new ID column called
- We merge the two datasets by ID;
Loading data and libraries
## Loading libraries library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0 library(readxl) # Read Excel Files, CRAN v1.3.1 ## Loading data # All data data_all <- read.csv("https://raw.githubusercontent.com/CBMC-GCMP/CBMC-GCMP.github.io/master/_data/monitoring_reef.csv") # Loading reef data with protection_status and coordinates reef <- read.csv("https://raw.githubusercontent.com/CBMC-GCMP/CBMC-GCMP.github.io/master/_data/reef_list.csv")
Step 1: creating an ID in the
Now, we want to manually create an ID column, we will call this
we can do this using the following code:
reef$IDReef <- 1:length(reef$Reef)
lenght() function returns how many rows or observations are in the
This approach is useful if we don’t want to specify a particular number, which can change in time (for example if we add more sites in our reference table).
Step 2: matching the names of the reefs and obtaining a numeric ID
matched_ID_reefs <- match(data_all$Reef, reef$Reef)
Step 3: Creating and ID in our main data
Now we just need to create an ID column into our data object
For convenience, we will call the column
IDReef to match the same name
as the one we created in the
reef data. We can easily achieve this by
using the code below:
data_all$IDReef <- matched_ID_reefs
Step 4: Merging the two dataset by ID
Finally, we just go with the usual
merge function. Note that the
number of observations in the
final dataset is the same as the one
from the original dataset
test <- merge(data_all, reef, by = "IDReef")
From here, we can save the dataset for future reference or just rerun the script!
R is FUN!!
Step 6: CELEBRATE!
Bonus step: Not matching names
That’s all good Fabio, but what if I work with someone who has the
tendency to make a lot of typos!! So my reef names DO not match
perfectly! No worries! R has a solution for this too of course! There is
here that uses a packages called
which allow for partial matching of the names. Partial matching means
that the strings have to be similar, and not identical, to be matched!
Pretty cool right?!
With this I can see the matching names even if there are some small differences in letters (e.g. typos), create an ID and then easily merge the two dataset. Of course, the best thing would be to detect the typo and correct it, but in thousands of observation this can be tricky! Remeber, good programmers are lazy!
So how can we do it?
First we need to install the
stringdist package we can use the
install.packages() function and then we load it using the
function as follow:
library(stringdist) # Approximate String Matching, Fuzzy Text Search, and String Distance Functions, CRAN v0.9.6.3
Then, we can achieve the partial matching by using the
function, you can see the the help page by running
match takes different arguments, the first is the list of names we have
(i.e. the complete reef list in our data), the second is a list of names
we want to compare it to (i.e. the names of the reefs in the reef table
data), the third is the
maxDist argument which set how much
“different” the string can be. Translated, if
maxDist is set to 3, we
can have 3 letters of difference in the names and these will still
match! Cool right? To be conservative we can set this to 1. This is
very useful in the case of typos etc. that you can’t or don’t want
to correct. Let’s get to it!
matched_ID_reefs <- amatch(data_all$Reef, reef$Reef, maxDist = 1)
You can then follow step 3 to 6 as they are to merge the two datasets!