Reshape a dataframe to long format with multiple sets of measure columns [duplicate]

帅比萌擦擦* 提交于 2019-11-26 17:51:58

问题


This question already has an answer here:

  • Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 7 answers

I have an R dataframe that I scraped from the internet using readHTMLTable() in the XML package. The table looks like the following excerpt with multiple variables/columns for population and year. (Note that the years are not duplicated across columns and represent a unique identifier for population.)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938

I would like to reorganize the data into just two columns, one for year and one for population that looks like the following:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ... 

The values from the variables/columns year2 and year3 are appended below year1, as are the corresponding population values.

I have considered the following:

(1) Looping over the population and year columns (n>2) and adding those values as new observations to year1 and population1 will work, but this seems unnecessarily cumbersome.

(2) I have tried melt as below, but either it cannot handle the id variable split across multiple columns, or I am not implementing it correctly.

df.melt <- melt(df, id=c("year1", "year2",...)

(3) Lastly, I considered pulling out each year column as its own vector, and appending each of those vectors together as here:

year.all <- c(df$year1, df$year2,...)

However, the above returns the following for year.all

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...

rather than this

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...

If there is a straightforward way of accomplishing this reorganizing I would love to learn it. Many thanks for the help.


回答1:


If the 'year', 'pop', columns are alternating, we can subset with c(TRUE, FALSE) to get the columns 1, 3, 5,..etc. and c(FALSE, TRUE) to get 2, 4, 6,.. due to the recycling. Then, we unlist the columns and create a new 'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0

Or another option is

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]

data

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))



回答2:


Using the new feature in melt from data.table v1.9.5+:

require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))

You can find the rest of the vignettes here.




回答3:


Another option is to use split.default to split the dataframe in a list of dataframes and then bind them together:

lst <- lapply(split.default(df1, sub('.*(\\d)', '\\1', names(df1))),
              setNames, c('year','pop'))

do.call(rbind, lst)

which gives the desired result:

    year     pop
1.1 16XX  4675,0
1.2 17XX  4739,3
1.3 17XX  4834,0
1.4 180X  4930,0
1.5 180X  5029,0
1.6 181X  5129,0
1.7 181X  5231,9
1.8 182X  5297,0
1.9 182X  5362,0
2.1 1900  6453,0
2.2 1901  6553,5
2.3 1902  6684,0
2.4 1903  6818,0
2.5 1904  6955,0
2.6 1905  7094,0
2.7 1906  7234,7
2.8 1907  7329,0
2.9 1908  7422,0
3.1 1930  9981,2
3.2 1931 10583,5
3.3 1932  8671,0
3.4 1933  9118,0
3.5 1934  9625,0
3.6 1935  8097,0
3.7 1936  7984,7
3.8 1937  8729,0
3.9 1938 10462,0

You could also use rbindlist from the data.table package for the last step:

library(data.table)
rbindlist(lst)

Used data:

df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
                      pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
                      year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
                      pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
                      year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
                      pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
                 .Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))


来源:https://stackoverflow.com/questions/31504735/reshape-a-dataframe-to-long-format-with-multiple-sets-of-measure-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!