问题
I'm trying to use dplyr against my postgres database and am conducting a simple function. Everything works if I parse the column name directly, however I want to do this dynamically (i.e. sort through each column name from another dataframe
The problem I'm geeting is for the first two calculations, i'm getting the right results
Assume the first dynamic column is called "id"
pull_table %>%
summarise(
row_count = n(),
distinct_count = n_distinct(var) ,
distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)),
maxvalue = max(var),
minvalue = min(var),
maxlength = max(length(var)),
minlen = min(length(var))
) %>%
show_query()
The wrong result I get is obvious when you see the sql - sometimes id has '' around it so it's calculating as a string:
<SQL>
SELECT
COUNT(*) AS "row_count",
COUNT(DISTINCT id) AS "distinct_count",
COUNT(
DISTINCT CASE
WHEN ('id' = '') THEN (NULL)
WHEN NOT('id' = '') THEN ('id')
END) AS "distinct_count_minus_blank",
MAX('id') AS "maxvalue",
MIN('id') AS "minvalue",
MAX(LENGTH('id')) AS "maxlength",
MIN(LENGTH('id')) AS "minlen"
FROM "table"
You can see from this output that sometimes the calculation is happening on the column, but sometimes it's just happening on the string "id". Why is this and how can I fix it so it calculates on the actual column rather than the string?
回答1:
I think you should look at rlang::sym
(which is imported by dplyr
).
Assuming pull_table
is a dataframe including id
, some_numeric_variable
and some_character_variable
columns, you could write something like this:
xx = sym("id")
yy = sym("some_numeric_variable")
ww = sym("some_character_variable")
pull_table %>%
summarise(
row_count = n(),
distinct_count = n_distinct(!!xx) ,
distinct_count_minus_blank = n_distinct(ifelse(var=="", NA, !!xx)),
maxvalue = max(!!yy ),
minvalue = min(!!yy ),
maxlength = max(length(!!ww)),
minlen = min(length(!!ww))
)
The sym()
function turn a string
variable into a name
, which can be unquoted inside dplyr functions with the !!
operator. If you want more information, please take a look at the quasiquotation doc or this tuto.
Unfortunately, since I didn't have any tbl_sql
at hand, I couldn't test it with show_query
.
Side advice: don't ever name your variables "var" as var
is also the variance function. I pulled my hair off many times just because this had messed up with some packages or custom functions.
回答2:
I ended up solving it with dots
i.e.
pull_table %>%
select(var=(dots=column_i)) %>%
summarise(
row_count = n(),
distinct_count = n_distinct(var) ,
distinct_count_minus_blank = n_distinct(ifelse(var=="",NA,var)),
maxvalue = max(var),
minvalue = min(var),
maxlength = max(length(var)),
minlen = min(length(var))
) %>%
show_query()
来源:https://stackoverflow.com/questions/53148816/dplyr-summarise-with-dynamic-columns