问题
I have two tables I wish to left_join through the dplyr package. The issue is that is produces NA values for all new columns (the ones I'm after).
As you can see below, the left_join procudes NA values for the new column of Incep.Price and DayCounter. Why does this happen, and how can this be resolved?
Update: Thanks to @akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm')) solves the issue and the columns are joined correctly.
However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join columns, or otherwise produce NA values?
> head(Avanza.XML)
Firm Gain.Month.1 Last.Price Vol.Month.1
1 Stockwik Förvaltning 131.25 0.074 131264420
2 Novestra 37.14 7.200 605330
3 Bactiguard Holding 29.55 14.250 2815572
4 MSC Group B 20.87 3.070 671855
5 NeuroVive Pharmaceutical 18.07 9.800 3280944
6 Shelton Petroleum B 16.21 3.800 2135798
> head(checkpoint)
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 87.50 0.06 91270090 0.032000 2016-01-25
2 Novestra 38.10 7.25 604683 5.249819 2016-01-25
3 Bactiguard Holding 29.09 14.20 2784161 11.000077 2016-01-25
4 MSC Group B 27.56 3.24 657699 2.539981 2016-01-25
5 Shelton Petroleum B 19.27 3.90 1985305 3.269892 2016-01-25
6 NeuroVive Pharmaceutical 16.87 9.70 3220303 8.299820 2016-01-25
> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 131.25 0.074 131264420 NA <NA>
2 Novestra 37.14 7.200 605330 NA <NA>
3 Bactiguard Holding 29.55 14.250 2815572 NA <NA>
4 MSC Group B 20.87 3.070 671855 NA <NA>
5 NeuroVive Pharmaceutical 18.07 9.800 3280944 NA <NA>
6 Shelton Petroleum B 16.21 3.800 2135798 NA <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
回答1:
There are two problems.
Not specifying the
byargument inleft_join: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - allnumericclass and do not have a matching value in each of the datasets. So, it is better to join by "Firm"left_join(Avanza.XML, checkpoint, by = "Firm")The "Firm" column class -
factor: We get warning when there is difference in thelevelsof thefactorcolumn (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets tocharacterclassAvanza.XML$Firm <- as.character(Avanza.XML$Firm) checkpoint$Firm <- as.character(checkpoint$Firm)
Or if we still want to keep the columns as factor, then change the levels in the "Firm" to include all the levels in both the datasets
lvls <- sort(unique(c(levels(Avanza.XML$Firm),
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
and then do the left_join.
来源:https://stackoverflow.com/questions/35016377/dplyrleft-join-produce-na-values-for-new-joined-columns