Is there syntactic sugar to define a data frame in R

偶尔善良 提交于 2021-02-10 20:36:35

问题


I want to regroup US states by regions and thus I need to define a "US state" -> "US Region" mapping function, which is done by setting up an appropriate data frame.

The basis is this exercise (apparently this is a map of the "Commonwealth of the Fallout"):

One starts off with an original list in raw form:

Alabama = "Gulf"
Arizona = "Four States"
Arkansas = "Texas"
California = "South West"
Colorado = "Four States"
Connecticut = "New England"
Delaware = "Columbia"

which eventually leads to this R code:

us_state <- c("Alabama","Arizona","Arkansas","California","Colorado","Connecticut",
"Delaware","District of Columbia","Florida","Georgia","Idaho","Illinois","Indiana",
"Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland","Massachusetts","Michigan",
"Minnesota","Mississippi","Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey","New Mexico","New York","North Carolina","North Dakota","Ohio","Oklahoma",
"Oregon","Pennsylvania","Rhode Island","South Carolina","South Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West Virginia ","Wisconsin","Wyoming")

us_region <- c("Gulf","Four States","Texas","South West","Four States","New England",
"Columbia","Columbia","Gulf","Southeast","North West","Midwest","Midwest","Plains",
"Plains","East Central","Gulf","New England","Columbia","New England","Midwest",
"Midwest","Gulf","Plains","North","Plains","South West","New England","Eastern",
"Four States","Eastern","Southeast","North","East Central","Plains","North West",
"Eastern","New England","Southeast","North","East Central","Texas","Four States",
"New England","Columbia","North West","Eastern","Midwest","North")

us_state_to_region_map <- data.frame(us_state, us_region, stringsAsFactors=FALSE)

which is supremely ugly and unmaintainable as the State -> Region mapping is effectively obfuscated.

I actually wrote a Perl program to generate the above from the original list.

In Perl, one would write things like:

#!/usr/bin/perl

$mapping = {
"Alabama"=> "Gulf",
"Arizona"=> "Four States",
"Arkansas"=> "Texas",
"California"=> "South West",
"Colorado"=> "Four States",
"Connecticut"=> "New England",
...etc...etc...
"West Virginia "=> "Eastern",
"Wisconsin"=> "Midwest",
"Wyoming"=> "North" };

which is maintainable because one can verify the mapping on a line-by-line basis.

There must be something similar to this Perl goodness in R?


回答1:


As @tim-biegeleisen says it could be more appropriate to maintain this dataset in a database, a CSV file or a spreadsheet and open it in R (with readxl::read_excel(), readr::read_csv(),...).

However if you want to write it directly in your code you can use tibble:tribble() which allows to write a dataframe row by row :

library(tibble)
tribble(~ state, ~ region,
        "Alabama", "Gulf",
        "Arizona", "Four States",
(...)
        "Wisconsin", "Midwest", 
        "Wyoming", "North")



回答2:


It seems a bit open for interpretation as to what you're looking for.

Is the mapping meant to be a function type thing such that a call would return the region or vise-versa (Eg. similar to a function call mapping("alabama") => "Gulf")?

I am reading the question to be more looking for a dictionary style storage, which in R could be obtained with an equivalent named list

ncountry <- 49
mapping <- as.list(c("Gulf","Four States",
...
,"Midwest","North"))
names(mapping) <- c("Alabama","Arizona",
...
,"Wisconsin","Wyoming")
mapping[["Pennsylvania"]]
[1] "Eastern"

This could be performed in a single call

mapping <- list("Alabama" = "Gulf",  
                "Arizona" = "Four States", 
                 ..., 
                "Wisconsin" = "Midwest", 
                "Wyoming" = "North")

Which makes it very simple to check that the mapping is working as expected. This doesn't convert nicely to a 2 column data.frame however, which we would then obtain using

mapping_df <- data.frame(region = unlist(mapping), state = names(mapping))

note "not nicely" simply means as.data.frame doesn't translate the input into a 2 column output.

Alternatively just using a named character vector would likely be fine too

mapping_c <- c("Alabama" = "Gulf",  
                "Arizona" = "Four States", 
                 ..., 
                "Wisconsin" = "Midwest", 
                "Wyoming" = "North")

which would be converted to a data.frame in almost the same fashion

mapping_df_c <- data.frame(region = mapping_c, state = names(mapping_c))

Note however a slight difference in the two choices of storage. While referencing an entry that exists using either single brackets [ or double brackets [[ works just fine

#Works:
mapping_c["Pennsylvania"] == mapping["Pennsylvania"]
#output
Pennsylvania 
        TRUE
mapping_c[["Pennsylvania"]] == mapping[["Pennsylvania"]]
[1] TRUE

But when referencing unknown entries these differ slightly in behaviour

#works sorta:
mapping_c["hello"] == mapping["hello"]
#output
$<NA>
NULL
#Does not work:
mapping_c[["hello"]] == mapping[["hello"]]

Error in mapping_c[["hello"]] : subscript out of bounds

If you are converting your input into a data.frame this is not an issue, but it is worth being aware of this, so you obtain the behaviour expected.

Of course you could use a function call to create a proper dictionary with a simple switch statement. I don't think that would be any prettier though.




回答3:


If us_region is a named list...

us_region <- list(Alabama = "Gulf",
                  Arizona = "Four States",
                  Arkansas = "Texas",
                  California = "South West",
                  Colorado = "Four States",
                  Connecticut = "New England",
                  Delaware = "Columbia")

Then,

us_state_to_region_map <- data.frame(us_state = names(us_region), 
                                     us_region = sapply(us_region, c),
                                     stringsAsFactors = FALSE)

and, as a bonus, you also get the states as row names...

us_state_to_region_map
               us_state   us_region
Alabama         Alabama        Gulf
Arizona         Arizona Four States
Arkansas       Arkansas       Texas
California   California  South West
Colorado       Colorado Four States
Connecticut Connecticut New England
Delaware       Delaware    Columbia



回答4:


One option could be to create a data frame in wide format (your initial list makes it very straightforward and this maintains a very obvious mapping. It is actually quite similar to your Perl code), then transform it to the long format:

library(tidyr)

data.frame(
  Alabama = "Gulf",
  Arizona = "Four States",
  Arkansas = "Texas",
  California = "South West",
  Colorado = "Four States",
  Connecticut = "New England",
  Delaware = "Columbia",
  stringsAsFactors = FALSE
) %>%
  gather("us_state", "us_region") # transform to long format


来源:https://stackoverflow.com/questions/58427326/is-there-syntactic-sugar-to-define-a-data-frame-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!