Merge data frames in R (2024)

HOME DATA MANIPULATION IN R MERGE R

Data Manipulation in R Data transformation

The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others. In this tutorial you will learn how to merge datasets in base R in the possible available ways with several examples.

Merge function in R

The syntax of the R merge function with a brief description of its arguments is shown in the following block of code:

merge(x, y, ...)# For data frames:merge(x, y, # Data frames or objects to be coerced by = intersect(names(x), names(y)), # Columns used for merging by.x = by, by.y = by, # Columns used for merging all = FALSE, # If TRUE, all.x = TRUE and all.y = TRUE all.x = all, all.y = all, # If TRUE, adds rows for each row in x (y) that not match a row in y (x). sort = TRUE, # Whether to sort the output by the 'by' columns suffixes = c(".x",".y"), # Suffixes for creating unique column names no.dups = TRUE, # Whether to avoid duplicated column names appending more suffixes or not incomparables = NULL, # How to deal with values that can not be matched ...) # Additional arguments

Note that the main method of the R merge function is for data frames. However, merge is a generic function that can be also used with other objects (like vectors or matrices), but they will be coerced to data.frame class.

R merge data frames

In order to create a reproducible example to show how to merge two data frames in R we are going to use the following sample datasets named df_1, that represents the id, name and monthly salary of some employees of a company and df_2, that shows the id, name, age and position of some employees.

set.seed(61)employee_id <- 1:10employee_name <- c("Andrew", "Susan", "John", "Joe", "Jack", "Jacob", "Mary", "Kate", "Jacqueline", "Ivy")employee_salary <- round(rnorm(10, mean = 1500, sd = 200))employee_age <- round(rnorm(10, mean = 50, sd = 8))employee_position <- c("CTO", "CFO", "Administrative", rep("Technician", 7))df_1 <- data.frame(id = employee_id[1:8], name = employee_name[1:8], month_salary = employee_salary[1:8])df_2 <- data.frame(id = employee_id[-5], name = employee_name[-5], age = employee_age[-5], position = employee_position[-5])df_1df_2

 # df_1 # df_2 id name month_salary id name age position 1 Andrew 1424 1 Andrew 40 CTO 2 Susan 1425 2 Susan 38 CFO 3 John 1156 3 John 54 Administrative 4 Joe 1570 4 Joe 66 Technician 5 Jack 1223 6 Jacob 38 Technician 6 Jacob 1462 7 Mary 53 Technician 7 Mary 1641 8 Kate 56 Technician 8 Kate 1603 9 Jacqueline 55 Technician 10 Ivy 43 Technician

Note that on a real life example, all ids will be unique but the names can be repeated. Also note that ‘Jack’ is missing in the second table (neither his age nor his position are available) and ‘Jacqueline’ and ‘Ivy’ are missing in the first (their monthly salaries are not available with the current data).

Inner join

An inner join (actually a natural join), is the most usual join of data sets that you can perform. It consists on merging two dataframes in one that contains the common elements of both, as described in the following illustration:

In order to merge the two sample data sets, you just have to pass them to the merge function without the need of changing other arguments, due to by default, the function merges the data sets by the common column names. In consequence, in this case, the function merges the data by two columns (id and name).

merge(x = df_1, y = df_2)merge(x = df_1, y = df_2, by = c("id", "name")) # Equivalent

 id name month_salary age position 1 Andrew 1424 40 CTO 2 Susan 1425 38 CFO 3 John 1156 54 Administrative 4 Joe 1570 66 Technician 6 Jacob 1462 38 Technician 7 Mary 1641 53 Technician 8 Kate 1603 56 Technician

As we pointed out before, ‘Jack’, ‘Ivy’ and ‘Jacqueline’ were not in both tables. In consequence, in the resulting output of this join they are missing.

Full (outer) join

The outer join, also known as full outer join or full join, merges all the columns of both data sets into one for all elements:

In order to create a full outer join of the two data frames in R you have to set the argument all to TRUE:

merge(x = df_1, y = df_2, all = TRUE)

 id name month_salary age position 1 Andrew 1424 40 CTO 2 Susan 1425 38 CFO 3 John 1156 54 Administrative 4 Joe 1570 66 Technician 5 Jack 1223 NA <NA> # <-- NA values 6 Jacob 1462 38 Technician 7 Mary 1641 53 Technician 8 Kate 1603 56 Technician 9 Jacqueline NA 55 Technician # <-- NA value 10 Ivy NA 43 Technician # <-- NA value

As not all rows in the first data frame match all the rows in the second, the output is filled with NA values in those cases.

Left (outer) join in R

The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second.

In order to create the join, you just have to set all.x = TRUE as follows:

merge(x = df_1, y = df_2, all.x = TRUE)

 id name month_salary age position 1 Andrew 1424 40 CTO 2 Susan 1425 38 CFO 3 John 1156 54 Administrative 4 Joe 1570 66 Technician 5 Jack 1223 NA <NA> # <-- NA values 6 Jacob 1462 38 Technician 7 Mary 1641 53 Technician 8 Kate 603 56 Technician

As the employee corresponding to id = 5 is on the first data set but not on the second, the corresponding missing values are displayed as NA.

Right (outer) join in R

The right join in R is the opposite of the left outer join. In this case, the merge consists on joining all the rows in the second data frame with the corresponding on the first.

In consequence, you will need to set the argument all.y to TRUE to join the datasets this way.

merge(x = df_1, y = df_2, all.y = TRUE)

 id name month_salary age position 1 Andrew 1424 40 CTO 2 Susan 1425 38 CFO 3 John 1156 54 Administrative 4 Joe 1570 66 Technician 6 Jacob 1462 38 Technician 7 Mary 1641 53 Technician 8 Kate 1603 56 Technician 9 Jacqueline NA 55 Technician # <-- Note the difference 10 Ivy NA 43 Technician # <-- with the left join

As ‘Jacqueline’ and ‘ Ivy’ are on the second data set but not on the first, the corresponding values of monthly salary are not available.

Cross join

The R cross join performs the Cartesian product of the datasets df_1 and df_2:

You can create a cross join in R setting as NULL the argument by of the R merge function. Note that we only show the first rows of the output, as it is very large.

Merged <- merge(x = df_1, y = df_2, by = NULL)head(Merged)

 id.x name.x month_salary id.y name.y age position 1 Andrew 1424 1 Andrew 40 CTO 2 Susan 1425 1 Andrew 40 CTO 3 John 1156 1 Andrew 40 CTO 4 Joe 1570 1 Andrew 40 CTO 5 Jack 1223 1 Andrew 40 CTO 6 Jacob 1462 1 Andrew 40 CTO

Merge rows in R

You can also merge data frames by row names. For illustration purposes, consider the following datasets:

df1 <- data.frame(var = c("one", "two", "three", "four", "five"), data = c(1, 5, 1, 6, 8))rownames(df1) <- c("A", "B", "C", "D", "E")df1df2 <- data.frame(var = c("three", "one", "eight", "two", "nine"), data = c(1, 5, 1, 6, 8))rownames(df2) <- c("E", "A", "B", "D", "C")df2

# df1 # df2 var data var dataA one 1 E three 1B two 5 A one 5C three 1 B eight 1D four 6 D two 6E five 8 C nine 8

In this case, in order to join the data frames by the row names you have to set the argument by to 0 or to "row.names".

merge(df1, df2, by = 0, all = TRUE) merge(df1, df2, by = "row.names", all = TRUE) # Equivalent

 Row.names var.x data.x var.y data.y1 A one 1 one 52 B two 5 eight 13 C three 1 nine 84 D four 6 two 65 E five 8 three 1

As you can observe, the output contains as many rows as different row names. Note that we applied a full outer join (as in this case it is equivalent to a left and right join), but you could join the data as you want.

Merge more than two dataframes in R

Finally, it is worth to mention that you can iteratively merge data frames in R, concatenating the merge function. Consider, for instance the following data frames:

x <- data.frame(id = 1:4, year = 1995:1998)xy <- data.frame(id = c(4, 1, 3, 2), year = c(1998, 1995, 1997, 1996), age = c(22, 25, 23, 24))yz <- data.frame(id = c(1, 2, 3), year = 1995:1997, wage = c(1000, 1200, 1599))z

# x # y # z id year id year age id year wage 1 1995 4 1998 22 1 1995 1000 2 1996 1 1995 25 2 1996 1200 3 1997 3 1997 23 3 1997 1599 4 1998 2 1996 24

You can merge the three data frames merging two and then merging the output with the remaining data set.

merge(x, merge(y, z))

id year age wage1 1995 25 10002 1996 24 12003 1997 23 1599

Note that you can specify the arguments you prefer for each join and that you can concatenate as many merges as you need.

merge(x, merge(y, z, all = TRUE), all = TRUE)

id year age wage1 1995 25 10002 1996 24 12003 1997 23 15994 1998 22 NA

A cleaner alternative is to use the Reduce function as follows, so instead of concatenating the merge functions, you can specify all the data frames inside a list. Nonetheless, you will have to specify the same arguments for all joins.

Reduce(function(x, y) merge(x, y), list(x, y, z))

id year age wage1 1995 25 10002 1996 24 12003 1997 23 1599