Let's start now.
Joining or Merging & Its Benefits
Merging or Joining is the process of combining two or more different results sets into one result set based on some of the unique columns/attributes and their values. It is required when we need related data into a single place for a particular entity or object. The result of a merge or join operation gives us a consolidated/overall idea about an object into one place and hence it makes us easier to understand and help us to make decisions faster.
For example, HR keeps the employee records, such as educational background and previous employment records, and MIS admin keeps records of the target and achievement of an employee. Similarly, Finance admin keeps the records of employee's expenses and investment on an employee like salary, allowances, etc. As we can see here the data is in different places. Once the company's owner, MD or CF, wants to view all these data into a single place at a time when we can get all these data by the joining or merging operation by employee id of a particular employee.
What will happen if we do not use join merge or join and here MD wants to know all the records? They need to appoint separately to each admin like Finance admin, HR admin, and so on to show the records of an employee. So, we can see here joining saves time and improves our productivity also.
The above image clarifies how different departments are joined in an organization.
R Data Frame Join
To join R Data Frame we use merge() function. We can learn about the merge() function in R studio itself by just typing either ?merge() or simply help(merge). It will show its definition, uses and different arguments it takes in detail as shown below. To know more about how to get help in R you can follow one of my previous article
Getting help in R. Additionally to know more about merge function in R you can follow the R documentation link
merge() function.
As we can see from the above images the basic syntax of a merge is as shown below.
Here x and y are the two different data frames or the object to be merged into one.
Argument Of R Merge() Function
The argument of the merge() function has been denoted by (, ...) as shown in the below image.
The important arguments of the merge() function have been listed in the table below.
Creating Data Frame For Example
We need to create two different data frames which we will use in R Data Frame merge operations below. We will create those data frames as shown below in the image.
To create the data frame of the name df.OrdersDetail, we will run the below R script in R studio.
-
- OrderID <- c("10308", "10309", "10310", "10853", "10950")
- CustomerID <- c(2, 37, 77, 55, 80)
- OrderDate <- c("18-06-2019", "19-03-2019", "20-01-2019", "20-05-2018", "20-03-2018")
-
- df.OrdersDetail <- data.frame(OrderID, CustomerID, OrderDate)
Similar to the above we will use the below R script to create the df.AddressDetails data frame.
-
- CustomerID <- c(1, 2, 3, 77, 85)
- CustomerName <- c("Alfreds Futterkiste", "Ana Trujillo Emparedados C-245",
- "Antonio Moreno Taquería", "Zachariah E. Somers",
- "Jonathan G. Steck")
- Address <- c("Obere Str. 57", "Avda. de la Constitución 2222",
- "Mataderos 2312", "1181 Lawman Avenue", "Mellingburgredder 82")
- City <- c("Berlin", "Alexandria", "México D.F.", "Alexandria","Herzogenaurach")
- PostalCode <- c(12209, 5021, 5023, 14850, 57860)
- Country <- c("Germany", "Mexico", "Mexico", "USA", "Germany")
-
-
- df.AddressDetails <- data.frame(CustomerID, CustomerName, Address, City,
- PostalCode, Country)
Now, as we have already created both data frames to be used for the merge operation. So, we will print both data frames to see their elements in R studio.
- df.OrdersDetail
- df.AddressDetails
When we run the above script it gives the output as shown in the below image which is similar to the above data frames we had decided to create earlier. Now we can use both the data frames easily whenever required.
Types Of R Data Frame Join
There are five important join types available in R to join or merge R Data Frame shown below.
- Natural join
- Full Outer join
- Left Outer join
- Right Outer join &
- Cross join
Now we shall learn about each in detail one by one.
R Data Frame Natural Join
The Natural join in R Data Frame merge works similar to SQL Server inner join. The output of natural join is based on the common values which exist on both sides of the data frame or object used in the merge.
The argument of natural join is all = FALSE. Below syntax is used for R data frame natural join.
-
- merge(<dfA>, <dfB>, by = ["<Column Name(s)>"], all = FALSE)
The below image demonstrates the output of the data frame's natural join in R.
The below example show the R data frame natural join in R studio. It will give the output as shown in the image below.
-
- merge(df.OrdersDetail, df.AddressDetails, by="CustomerID", all = FALSE)
Here the output is just only for CustomerID, 2 and 77 as it exists in both data frames.
R Data Frame Full Outer Join
It is just opposite to the Natural join and works as full outer join in SQL Server. The argument of natural join is all = TRUE. Below syntax is used for R Data Frame Full Outer Join.
-
- merge(<dfA>, <dfB>, by = ["<Column Name(s)>"], all = TRUE)
The output of the data frame full outer join is the combination of both data frames' merged column common values and union of both data frames' uncommon values. The below image shows how R Data Frame full outer join works.
Here is an example of R Data Frame Full Outer join in R studio.
-
- merge(df.OrdersDetail, df.AddressDetails, by="CustomerID", all = TRUE)
The output will be as shown in the image below. Here we need to remember that we have joined both data frames based on the CustomerID. Hence for the CustomerID which value will not be on another side of each data frame for that CustomerID value of other columns in join output will be shown as NA.
R Data Frame Left Outer Join
The result of the R data frame left outer join is all records of a left object/data frame and the matching records of the right object/data frame. Its output remains the same as the SQL Server left outer join.
The argument of R Data Frame's left outer is all = TRUE. The syntax of the data frame left outer join is as shown below.
-
- merge(<dfA>, <dfB>, by = ["<Column Name(s)>"], all.dfA = TRUE)
Below image demonstrate the R Data Frame left outer join.
Below R script shows the output of the R data frame left outer join in R studio with the data frames created earlier.
-
- merge(x = df.OrdersDetail, y = df.AddressDetails, by.x = "CustomerID", all.x = TRUE)
R Data Frame Right Outer Join
The R data frame Right Outer join works just opposite to the R data frame left outer join. Its output remains similar to the SQL Server right outer join as it returns all the rows of the right data frame and corresponding matching rows of the left data frame.
The argument of R data frame right outer join is all.y = TRUE. Below syntax is used for the R data frame right outer join.
-
- merge(x, y, by.y = ["<Column Name(s)>"], all.y = TRUE)
Here is an example of the R data frame right outer join.
-
- merge(x = df.OrdersDetail, y = df.AddressDetails, by.y = "CustomerID", all.y = TRUE)
The output of the R data frame right outer join will look as shown below.
R Data Frame Cross Join
To use the R data frame cross join we pass the argument by = NULL as shown below.
-
- merge(x = df.OrdersDetail, y = df.AddressDetails, by = NULL)
Important Points To Remember For R Data Frame Join
Case Sensitive
As we know R is case sensitive we should remember that FALSE and TRUE should be in upper case. Also, the name of the column in by should be exactly as in the data frame.
The below example shows the error when you pass “CustomerID” to “CustomerId” in merge() function. Although the name is the same the case is different so it is giving the error.
As shown in the image below it is saying to pass a valid column as “CustomerId” column is not there. It should be “CustomerID”.
Name & Data Type of Key
The name and number of the column should be the same in both the data frames to be merged into one otherwise it will show the below error.
Here as shown in the image below I have changed the name of the CustomerID column to the CustomerIDs which is not available in another data frame to which we are merging. So, it is giving the below error.
Summary
In this article, we have learned and explored what is the join or merge and what are the benefits of the merge. We learned also different types of merged available in R to merge two data frames.
I hope you learned and enjoyed it. I look forward to seeing your feedback.