dplyr

Factor Analysis using sparklyr in Databricks

丶灬走出姿态 提交于 2021-01-29 06:13:50
问题 I would like to perform a Factor Analysis by using dplyr::collect() in Databricks but because of its size I am getting this error: Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 82.0 B Is there a function in sparklyr using which I can do this analysis without collecting the data? 来源: https://stackoverflow.com/questions/64113459/factor-analysis-using-sparklyr-in

How to separate values in a column and convert to numeric values?

╄→гoц情女王★ 提交于 2021-01-29 05:02:37
问题 I have a dataset where the values are collapsed so each row has multiple inputs per one column. For example: Gene Score1 Gene1 NA, NA, NA, 0.03, -0.3 Gene2 NA, 0.2, 0.1 I am trying to unpack this to then select the maximum absolute value per row for the Score1 column - and also keep track of if the maximum absolute value was previously negative by creating a new column. So output of the example is: Gene Score1 Negatives1 Gene1 0.3 1 Gene1 0.2 0 #Score1 is now the maximum absolute value and if

Select a maximum value across rows and columns with grouped data

*爱你&永不变心* 提交于 2021-01-29 04:55:30
问题 The data below have an IndID field as well as three columns containing numbers, including NA in some instances, with a varying number of rows for each IndID . library(dplyr) n = 10 set.seed(123) dat <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), n, replace = T), Num1 = c(2,4,2,4,4,1,3,4,3,2), Num2 = sample(c(1,2,5,8,7,8,NA), n, replace = T), Num3 = sample(c(NA, NA,NA,8,7,9,NA), n, replace = T)) %>% arrange(IndID) head(dat) IndID Num1 Num2 Num3 1 AAA 1 NA 7 2 BBB 2 NA NA 3 BBB 2 7

Select a maximum value across rows and columns with grouped data

这一生的挚爱 提交于 2021-01-29 04:53:51
问题 The data below have an IndID field as well as three columns containing numbers, including NA in some instances, with a varying number of rows for each IndID . library(dplyr) n = 10 set.seed(123) dat <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), n, replace = T), Num1 = c(2,4,2,4,4,1,3,4,3,2), Num2 = sample(c(1,2,5,8,7,8,NA), n, replace = T), Num3 = sample(c(NA, NA,NA,8,7,9,NA), n, replace = T)) %>% arrange(IndID) head(dat) IndID Num1 Num2 Num3 1 AAA 1 NA 7 2 BBB 2 NA NA 3 BBB 2 7

How can I create a new column based on conditional statements and dplyr?

女生的网名这么多〃 提交于 2021-01-29 02:54:38
问题 x y 2 4 5 8 1 4 9 12 I have four conditions maxx = 3, minx = 1, maxy = 6, miny = 3. (If minx < x < maxx and miny < y < maxy, then z = apple) maxx = 6, minx = 4, maxy = 9, miny = 7. (If minx < x < maxx and miny < y < maxy, then z = ball) maxx = 2, minx = 0, maxy = 5, miny = 3. (If minx < x < maxx and miny < y < maxy, then z = pine) maxx = 12, minx = 7, maxy = 15, miny = 11. (If minx < x < maxx and miny < y < maxy, then z = orange) Expected outcome: x y z 2 4 apple 5 8 ball 1 4 pine 9 12 orange

Use dplyr´s filter and mutate to generate a new variable

◇◆丶佛笑我妖孽 提交于 2021-01-29 01:51:58
问题 i choose the hflights-dataset as an example. I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime. install.packages("hflights") library("hflights") flights <-tbl_df(hflights) flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum) EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with

Use dplyr´s filter and mutate to generate a new variable

那年仲夏 提交于 2021-01-29 01:43:25
问题 i choose the hflights-dataset as an example. I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime. install.packages("hflights") library("hflights") flights <-tbl_df(hflights) flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum) EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with

How to recode dataframe values to keep only those that satisfy a certain set, replace others with “other”

做~自己de王妃 提交于 2021-01-29 01:35:02
问题 I'm looking for a concise solution, preferably using dplyr , to clean up values in a dataframe column so that I can keep as they are values that match a certain set, but others that don't match will be recoded as "other". Example I have a dataframe with names of animals. There are 4 legit animal names, but other rows contain gibberish rather than names. I want to clean the column up, to keep only the legit animal names: zebra , lion , cow , or cat . Data library(tidyverse) library(stringi)

R dplyr window function, get the first value in the next x window that fulfil some condition

烂漫一生 提交于 2021-01-28 22:01:02
问题 I have some dplyr dataframe and I have some condition. I want to know for each cell what is the index of the first cell that matches the condition in the next x rows. In my case, I want to have an additional column that holds the index of the first value that was larger than the current value in at least z. Example: here we are looking for the index of the first value in the next 3 rows that is larger by at least 3 from the current value. In the case of the first row, the value is 0 and the

R dplyr window function, get the first value in the next x window that fulfil some condition

随声附和 提交于 2021-01-28 21:42:14
问题 I have some dplyr dataframe and I have some condition. I want to know for each cell what is the index of the first cell that matches the condition in the next x rows. In my case, I want to have an additional column that holds the index of the first value that was larger than the current value in at least z. Example: here we are looking for the index of the first value in the next 3 rows that is larger by at least 3 from the current value. In the case of the first row, the value is 0 and the