Why does merge result in more rows than original data?

后端 未结 1 1156
渐次进展
渐次进展 2020-11-28 13:35

When I merge two data frames, the result has more rows than the original data.

In this instance, the all dataframe has 104956

相关标签:
1条回答
  • 2020-11-28 14:25

    First, from ?merge:

    The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.

    Using your link in the comments:

    url    <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
    koppen <- read.table(url, header=T, sep="\t")
    nrow(koppen)
    # [1] 3594
    length(unique(koppen$FIPS))
    # [1] 2789
    

    So clearly koppen has duplicated FIPS codes. Examining the dataset and the website, it appears that many of the counties are in more than one climate class, so for example, the county of Ankorage, Alaska has three climate classes:

    koppen[koppen$FIPS==2020,]
    #     STATE    COUNTY FIPS CLS  PROP
    # 73 Alaska Anchorage 2020 Dsc 0.010
    # 74 Alaska Anchorage 2020 Dfc 0.961
    # 75 Alaska Anchorage 2020  ET 0.029
    

    The solution depends on what you are trying to accomplish. If you want to extract all rows in all with any FIPS that appear in koppen, either of these should work:

    merge(all,unique(koppen$FIPS))
    
    all[all$FIPS %in% unique(koppen$FIPS),]
    

    If you need to append the county and state name to all, use this:

    merge(all,unique(koppen[c("STATE","COUNTY","FIPS")]),by="FIPS")
    

    EDIT Based on the exchange below in the comments.

    So, since there are sometimes multiple rows in koppen with the same FIPS, but different CLS, we need a way to decide which of the rows (e.g., which CLS) to pick. Here are two ways:

    # this extracts the row with the largest value of PROP, for that FIPS
    url        <- "http://koeppen-geiger.vu-wien.ac.at/data/KoeppenGeiger.UScounty.txt"
    koppen     <- read.csv(url, header=T, sep="\t")
    koppen     <- with(koppen,koppen[order(FIPS,-PROP),])
    sub.koppen <- aggregate(koppen,by=list(koppen$FIPS),head,n=1)
    result     <- merge(all, sub.koppen, by="FIPS")
    
    # this extracts a row at random
    sub.koppen <- aggregate(koppen,by=list(koppen$FIPS), 
                            function(x)x[sample(1:length(x),1)])
    result     <- merge(all, sub.koppen, by="FIPS")
    
    0 讨论(0)
提交回复
热议问题