The best way to mark (split?) dataset in each string

两盒软妹~` 提交于 2019-12-12 22:13:44

问题


I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol ,?


For example: I have strings like:

0123456789012...
1234567890123...    

and array of lengths: 5,3,1,4,... then I should get like this:

01234,567,8,9012,...
12345,678,9,0123,...

Could anyone help me with this? Python or R-tools are mostly preferred to me...


回答1:


Pandas could load this using read_fwf:

In [321]:

t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
      0    1  2     3
0  1234  567  8  9012

This will give you a dataframe allowing you to access each individual column for whatever purpose you require




回答2:


Try this in R:

x <- "0123456789012"

y <- c(5,3,1,4)

output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]



回答3:


In R read.fwf would work:

# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)

read.fwf(textConnection(x), widths, colClasses = "character")

giving:

     V1  V2 V3   V4
1 01234 567  8 9012
2 12345 678  9 0123

If numeric rather than character columns were desired then drop the colClasses argument.




回答4:


One option in R is

indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
         substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"

data

str1 <- '0123456789012'
len <- c(5,3,1,4)


来源:https://stackoverflow.com/questions/29800023/the-best-way-to-mark-split-dataset-in-each-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!