Removing Whitespace From a Whole Data Frame in R

后端未结

关注

 10  601

I\'ve been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains whi

相关标签:

10条回答

北荒

2020-12-05 03:32
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
```
 apply(myData,2,function(x)gsub('\\s+', '',x))
```
Hope this works.

This will return a matrix however, if you want to change it to data frame then do:
```
as.data.frame(apply(myData,2,function(x)gsub('\\s+', '',x)))
```
EDIT In 2020:

Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.

DATA:
```
df <- data.frame(val = c(" abc"," kl m","dfsd "),val1 = c("klm ","gdfs","123"),num=1:3,num1=2:4,stringsAsFactors = FALSE)
```
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
```
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,cols_to_be_rectified] <- lapply(df[,cols_to_be_rectified], trimws)
```
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).

(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
```
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,cols_to_be_rectified] <- lapply(df[,cols_to_be_rectified], function(x)gsub('\\s+','',x))
```
## situation: 1 (Using data.table, removing only leading and trailing blanks)
```
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
```
Output from situation1:
```
    val val1 num num1
1:  abc  klm   1    2
2: kl m gdfs   2    3
3: dfsd  123   3    4
```
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
```
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,c(cols_to_be_rectified) := lapply(.SD, function(x)gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
```
Output from situation2:
```
    val val1 num num1
1:  abc  klm   1    2
2:  klm gdfs   2    3
3: dfsd  123   3    4
```
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).

I hope this helps , Thanks
0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2020-12-05 03:32

You could use trimws function in R 3.2 on all the columns.

myData[,c(1)]=trimws(myData[,c(1)])

You can loop this for all the columns in your dataset. It has good performance with large datasets as well.

0 讨论(0)
发布评论:

提交评论
- 加载中...

执笔经年

2020-12-05 03:39

R is simply not the right tool for such file size. However have 2 options :

Use ffdply and ff base

Use ff and ffbase packages:

library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
                 first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)        
             apply(myData,2,function(x)gsub('\\s+', '',x))

Use sed (my preference)

sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file

0 讨论(0)

陌清茗

2020-12-05 03:40
Picking up on Fremzy and Mielniczuk, I came to the following solution:
```
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
```
It works for mixed numeric/charactert dataframes manipulates only character-columns.
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-05 03:41
I think that a simple approach with sapply, also works, given a df like:
```
dat<-data.frame(S=LETTERS[1:10],
            M=LETTERS[11:20],
            X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
            Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            N=c(1:3,'4 ','5 ',6:10),
            stringsAsFactors = FALSE)
```
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))

To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.

dat$N<-as.numeric(dat$N)

If you want to remove all the spaces, do:
```
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
```
And again use as.numeric on col N (ause sapply will convert it to character)
```
dat.b$N<-as.numeric(dat.b$N)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不思量自难忘°

2020-12-05 03:44
One possibility involving just dplyr could be:
```
data %>%
 mutate_if(is.character, trimws)
```
Or considering that all variables are of class character:
```
data %>%
 mutate_all(trimws)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页