The best way to mark (split?) dataset in each string

问题

I have a dataset containing 485k strings (1.1 GB). Each string contains about 700 of chars featuring about 250 variables (1-16 chars per variable), but it doesn't have any splitmarks. Lengths of each variable are known. What is the best way to modify and mark the data by symbol ,?

For example: I have strings like:

0123456789012...
1234567890123...

and array of lengths: 5,3,1,4,... then I should get like this:

01234,567,8,9012,...
12345,678,9,0123,...

Could anyone help me with this? Python or R-tools are mostly preferred to me...

回答1:

Pandas could load this using read_fwf:

In [321]:

t="""0123456789012..."""
pd.read_fwf(io.StringIO(t), widths=[5,3,1,4], header=None)
Out[321]:
      0    1  2     3
0  1234  567  8  9012

This will give you a dataframe allowing you to access each individual column for whatever purpose you require

回答2:

Try this in R:

x <- "0123456789012"

y <- c(5,3,1,4)

output <- paste(substring(x,c(1,cumsum(y)+1),cumsum(y)),sep=",")
output <- output[-length(output)]

回答3:

In R read.fwf would work:

# inputs
x <- c("0123456789012...", "1234567890123... ")
widths <- c(5,3,1,4)

read.fwf(textConnection(x), widths, colClasses = "character")

giving:

     V1  V2 V3   V4
1 01234 567  8 9012
2 12345 678  9 0123

If numeric rather than character columns were desired then drop the colClasses argument.

回答4:

One option in R is

indx1 <- c(1, cumsum(len)[-length(len)]+1)
indx2 <- cumsum(len)
toString(vapply(seq_along(len), function(i)
         substr(str1, indx1[i], indx2[i]), character(1)))
#[1] "01234, 567, 8, 9012"

data

str1 <- '0123456789012'
len <- c(5,3,1,4)

来源：https://stackoverflow.com/questions/29800023/the-best-way-to-mark-split-dataset-in-each-string

标签

python

string

split

dataset