How to merge two data.table by different column names?

问题

I have two data.table X and Y.

columns in X: area, id, value
columns in Y: ID, price, sales

Create the two data.tables:

X = data.table(area=c('US', 'UK', 'EU'),
               id=c('c001', 'c002', 'c003'),
               value=c(100, 200, 300)
              )

Y = data.table(ID=c('c001', 'c002', 'c003'),
               price=c(500, 200, 400),
               sales=c(20, 30, 15)
              )

And I set keys for X and Y:

setkey(X, id)
setkey(Y, ID)

Now I try to join X and Y by id in X and ID in Y:

merge(X, Y)
merge(X, Y, by=c('id', 'ID'))
merge(X, Y, by.x='id', by.y='ID')

All raised error saying that column names in the by argument invalid.

I referred to the manual of data.table and found the merge function not supporting by.x and by.y arguments.

How could I join two data.tables by different column names without changing the column names?

Append:
I managed to join the two tables by X[Y], but why merge function fails in data.table?

回答1:

Use this operation:

X[Y]
#    area   id value price sales
# 1:   US c001   100   500    20
# 2:   UK c002   200   200    30
# 3:   EU c003   300   400    15

or this operation:

Y[X]
#      ID price sales area value
# 1: c001   500    20   US   100
# 2: c002   200    30   UK   200
# 3: c003   400    15   EU   300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") : 
# A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID")
#     id area value price sales
# 1 c001   US   100   500    20
# 2 c002   UK   200   200    30
# 3 c003   EU   300   400    15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

回答2:

As of data.table version 1.9.6 (on CRAN on sep 2015) you can specify the by.x and by.y arguments in data.table::merge

merge(x=X, y=Y, by.x="id", by.y="ID")[]
#     id area value price sales
#1: c001   US   100   500    20
#2: c002   UK   200   200    30
#3: c003   EU   300   400    15

However, in data.table 1.9.6 you can also specfy the on argument in the X[Y] notation

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

X[Y, on=c(id = "ID")]
#   area   id value price sales
#1:   US c001   100   500    20
#2:   UK c002   200   200    30
#3:   EU c003   300   400    15

this answer by the data.table author has more details

回答3:

Merge fails when you use by.x and by.y with data.table. Taking your data:

> merge(X,Y, by.x='id', by.y='ID')
Error in merge.data.table(X, Y, by.x = "id", by.y = "ID")

You can use data.table with merge , but you need to use by argument for joining (so rename the columns to have the same colnames)

Y = setNames(Y,c('id','price','sales'))

This will still not work:

merge(X,Y, by.x='id', by.y='id')
Error in merge.data.table(X, Y, by.x = "id", by.y = "id") :

But this will work:

> merge(X,Y, by='id')
#     id area value price sales
#1: c001   US   100   500    20
#2: c002   UK   200   200    30
#3: c003   EU   300   400    15

Alternatively, you would need to convert data.table to data.frame in order to use merge with by.x and by.y arguments:

merge(data.frame(X), data.frame(Y), by.x='id', by.y='ID')

回答4:

You can also merge using multiple columns having different names. see example below

# create data frame authors
authors <- data.frame(
FirstName=c("Lorne", "Loren", "Robin",
              "Robin", "Billy"),
LastName=c("Green", "Jaye", "Green",
             "Howe", "Jaye"),
Age=c(82, 40, 45, 2, 40),
Income=c(1200000, 40000, 25000, 0, 27500),
Home=c("California", "Washington", "Washington",
    "Alberta", "Washington"))

# create data frame books Note First name in authors is same as AuthorFirstname same thing with lastname.
books <- data.frame(
        AuthorFirstName=c("Lorne", "Loren", "Loren",
            "Loren", "Robin", "Rich"),
        AuthorLastName=c("Green", "Jaye", "Jaye", "Jaye",
            "Green", "Calaway"),
        Book=c("Bonanza", "Midwifery", "Gardening",
        "Perennials", "Who_dun_it?", "Support"))

merge(authors, books, by.x=c("FirstName", "LastName"),
      by.y=c("AuthorFirstName", "AuthorLastName"),
      all.x=TRUE)

来源：https://stackoverflow.com/questions/29865512/how-to-merge-two-data-table-by-different-column-names

标签

merge

data.table