r

Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

安稳与你 提交于 2021-02-11 13:45:16
问题 Here is the source code that I have used: MyData <- Corpus(DirSource("F:/Data/CSV/Data"),readerControl = list(reader=readPlain,language="cn")) SegmentedData <- lapply(MyData, function(x) unlist(segmentCN(x))) temp <- Corpus(DataframeSource(SegmentedData), readerControl = list(reader=readPlain, language="cn")) Preprocessing Data temp <- tm_map(temp, removePunctuation) temp <- tm_map(temp,removeNumbers) removeURL <- function(x)gsub("http[[:alnum:]]*"," ",x) temp <- tm_map(temp, removeURL) temp

R: Overlay Poisson distribution over histogram of data

与世无争的帅哥 提交于 2021-02-11 13:44:15
问题 I have some discrete data, which I have plotted in a histogram. I'd like to overlay a Poisson distribution to show the data is roughly Poisson distributed. Imagine the two plots from the code below merging into one plot, that is what I'd like to achieve. # Read data data <- read.csv("data.csv") # Plot data hist(data, prob=TRUE) # Plot Poisson c <- c(0:7) plot(c, dpois(c, mean(data)), type="l") I have tried the curve function: curve(c, dpois(x=c, lambda=mean(data)), add=T) But all I get is

Tidy Evaluation not working with mutate and stringr

守給你的承諾、 提交于 2021-02-11 13:43:16
问题 I've trying to use Tidy Eval and Stringr togheter inside a mutate pipe, but every time I run it it gives me an undesirable result. Instead of changing the letter 'a' for the letter 'X', it overwrite the entire vector with the column name, as you can see in the example below, that uses the IRIS dataset. text_col="Species" iris %>% mutate({{text_col}} := str_replace_all({{text_col}}, pattern = "a", replacement = "X")) result: structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4

Documentation for `*tmp*` in R?

北慕城南 提交于 2021-02-11 13:42:13
问题 While skimming through the fuctions section of Hadley Wickham's Advanced R last night I came across this example: It’s often useful to combine replacement and subsetting x <- c(a = 1, b = 2, c = 3) names(x) #> [1] "a" "b" "c" names(x)[2] <- "two" names(x) #> [1] "a" "two" "c" This works because the expression names(x)[2] <- "two" is evaluated as if you had written: `*tmp*` <- names(x) `*tmp*`[2] <- "two" names(x) <- `*tmp*` (Yes, it really does create a local variable named `*tmp*`, which is

How to lag a specific column of a data frame in R

∥☆過路亽.° 提交于 2021-02-11 13:38:05
问题 Input (Say d is the data frame below.) a b c 1 5 7 2 6 8 3 7 9 I want to shift the contents of column b one position down and put an arbitrary number in the first position in b. How do I do this? I would appreciate any help in this regard. Thank you. I tried c(6,tail(d["b"],-1)) but it does not produce (6,5,6). Output a b c 1 6 7 2 5 8 3 6 9 回答1: Use head instead df$b <- c(6, head(df$b, -1)) # a b c #1 1 6 7 #2 2 5 8 #3 3 6 9 You could also use lag in dplyr library(dplyr) df %>% mutate(b =

R - replace values in dataframe based on two matching conditions

喜夏-厌秋 提交于 2021-02-11 13:37:21
问题 I'm working with lists of spatial data for 20+ different sites (difficult to reproduce here; sorry in advance). I have three data frames associated with each site; each has a 'sample_ID' column and some other shared columns names. What I'm trying to do seems very simple: if the 'sample_ID' values match for two data frames and the column names match, replace the value in DF 1 with that of DF 2 and DF 3 three. Example: # DF 1: SAMPLE_ID CLASS_ID CLASS VALUE 1 0 0 5 2 0 0 5 3 0 0 3 4 0 0 6 5 0 0

Clustering multivariate time series - question regarding distance matrix

放肆的年华 提交于 2021-02-11 13:36:49
问题 I am trying to cluster meteorological stations using R. Stations provide such data as temperature, wind speed, humidity and some more on hourly intervals. I can easily cluster univariate time series using tsclust library, but when I cluster multivariate series I get errors. I have data as a list so each list element is a matrix with time series data of one station (variables are columns and rows are different timestamp). If I run: tsclust(data, k = 2, distance = 'Euclidean', seed = 3247,

Getting P-Values of Zero in Cox Regression: R

﹥>﹥吖頭↗ 提交于 2021-02-11 13:35:45
问题 I am a student conducting a gene expression survival analysis in R. I have the expression data for 249 patients, and I am using 6,000 genes as well as their event-free survival times and vital state as response variables. When I tried to run the Cox regression on my dataset, I got extremely strange results (p-values of 0.00 and strange hazard ratios). I have checked over my code multiple times, but I am not able to catch my mistake (when I tried earlier with just one gene, it worked fine, but

How to write efficient nested functions for parallelization?

爱⌒轻易说出口 提交于 2021-02-11 13:32:48
问题 I have a dataframe with two grouping variables class and group . For each class, I have a plotting task per group. Mostly, I have 2 levels per class and 500 levels per group . I'm using parallel package for parallelization and mclapply function for the iteration through class and group levels. I'm wondering which is the best way to write my iterations. I think I have two options: Run parallelization for class variable. Run parallelization for group variable. My computer has 3 cores working

Writing a custom case_when function to use in dplyr mutate using tidyeval

萝らか妹 提交于 2021-02-11 13:32:12
问题 I'm trying to write a custom case_when function to use inside dplyr. I've been reading through the tidyeval examples posted in other questions, but still can't figure out how to make it work. Here's a reprex: df1 <- data.frame(animal_1 = c("Horse", "Pig", "Chicken", "Cow", "Sheep"), animal_2 = c(NA, NA, "Horse", "Sheep", "Chicken")) translate_title <- function(data, input_col, output_col) { mutate(data, !!output_col := case_when( input_col == "Horse" ~ "Cheval", input_col == "Pig" ~ "Рorc",