Removing Whitespace From a Whole Data Frame in R

不打扰是莪最后的温柔 提交于 2019-11-28 07:44:00

If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:

 apply(myData,2,function(x)gsub('\\s+', '',x))

Hope this works.

This will return a matrix however, if you want to change it to data frame then do:

as.data.frame(apply(myData,2,function(x)gsub('\\s+', '',x)))

EDIT In 2017:

Using sapply and trimws function with both=T can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.

df <- data.frame(val = c(" abc"," klm","dfsd "),val1 = c("klm ","gdfs","123"),num=1:3,num1=2:4,stringsAsFactors = F)
truth <- sapply(df,is.character)
df1 <- data.frame(cbind(sapply(df[,truth],trimws,which="both"),df[,!truth]))

Output:

> df1
   val val1 num num1
1  abc  klm   1    2
2  klm gdfs   2    3
3 dfsd  123   3    4
> str(df1)
'data.frame':   3 obs. of  4 variables:
 $ val : chr  "abc" "klm" "dfsd"
 $ val1: chr  "klm" "gdfs" "123"
 $ num : int  1 2 3
 $ num1: int  2 3 4

A lot of the answers are older, so here in 2019 is a simple dplyr answer that will operate only on the character columns to remove trailing and leading whitespace.

library(dplyr)
library(stringr)

data %>%
  mutate_if(is.character, str_trim)

You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.

Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:

df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)

As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.

If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542

Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.

Picking up on Fremzy and Mielniczuk, I came to the following solution:

data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)

It works for mixed numeric/charactert dataframes manipulates only character-columns.

R is simply not the right tool for such file size. However have 2 options :

Use ffdply and ff base

Use ff and ffbase packages:

library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
                 first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)        
             apply(myData,2,function(x)gsub('\\s+', '',x))

Use sed (my preference)

sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file 

If you're dealing with large data sets like this, you could really benefit form the speed of data.table.

library(data.table)

setDT(df)

for (j in names(df)) set(df, j = j, value = df[[trimws(j)]]) 

I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.

You could use trimws function in R 3.2 on all the columns.

myData[,c(1)]=trimws(myData[,c(1)])

You can loop this for all the columns in your dataset. It has good performance with large datasets as well.

If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):

for (i in names(mydata)) {
  if(class(mydata[, i]) %in% c("factor", "character")){
    mydata[, i] <- trimws(mydata[, i])
  }
}

I think that a simple approach with sapply, also works, given a df like:

dat<-data.frame(S=LETTERS[1:10],
            M=LETTERS[11:20],
            X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
            Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            N=c(1:3,'4 ','5 ',6:10),
            stringsAsFactors = FALSE)

You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))

To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.

dat$N<-as.numeric(dat$N)

If you want to remove all the spaces, do:

dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)

And again use as.numeric on col N (ause sapply will convert it to character)

dat.b$N<-as.numeric(dat.b$N)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!