dplyr summarise with dynamic columns

问题

I'm trying to use dplyr against my postgres database and am conducting a simple function. Everything works if I parse the column name directly, however I want to do this dynamically (i.e. sort through each column name from another dataframe

The problem I'm geeting is for the first two calculations, i'm getting the right results

Assume the first dynamic column is called "id"

pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string:

<SQL>
SELECT 
    COUNT(*) AS "row_count", 
    COUNT(DISTINCT id) AS "distinct_count", 
    COUNT(
        DISTINCT CASE 
            WHEN ('id' = '') THEN (NULL) 
            WHEN NOT('id' = '') THEN ('id') 
        END) AS "distinct_count_minus_blank", 
    MAX('id') AS "maxvalue", 
    MIN('id') AS "minvalue", 
    MAX(LENGTH('id')) AS "maxlength", 
    MIN(LENGTH('id')) AS "minlen"
FROM "table"

You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". Why is this and how can I fix it so it calculates on the actual column rather than the string?

回答1:

I think you should look at rlang::sym (which is imported by dplyr).

Assuming pull_table is a dataframe including id, some_numeric_variable and some_character_variable columns, you could write something like this:

xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(!!xx) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)), 
        maxvalue = max(!!yy ), 
        minvalue = min(!!yy ), 
        maxlength = max(length(!!ww)), 
        minlen = min(length(!!ww))
    )

The sym() function turn a string variable into a name, which can be unquoted inside dplyr functions with the !! operator. If you want more information, please take a look at the quasiquotation doc or this tuto.

Unfortunately, since I didn't have any tbl_sql at hand, I couldn't test it with show_query.

Side advice: don't ever name your variables "var" as var is also the variance function. I pulled my hair off many times just because this had messed up with some packages or custom functions.

回答2:

I ended up solving it with dots

i.e.
pull_table %>%
select(var=(dots=column_i)) %>%
    summarise(
        row_count = n(), 
        distinct_count = n_distinct(var) , 
        distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)), 
        maxvalue = max(var), 
        minvalue = min(var), 
        maxlength = max(length(var)), 
        minlen = min(length(var))
    )  %>% 
    show_query()

来源：https://stackoverflow.com/questions/53148816/dplyr-summarise-with-dynamic-columns

标签

dynamic

dplyr