How to preprocess and load a “big data” tsv file into a python dataframe?

后端未结

关注

 3  1006

谎友^ 2020-12-06 23:00

I am currently trying to import the following large tab-delimited file into a dataframe-like structure within Python---naturally I am using pandas dataframe, th

3条回答

醉梦人生 (楼主)

2020-12-06 23:18

You can do this more cleanly completely in Pandas.

Suppose you have two independent data frames with only one overlapping column:

>>> df1
   A  B
0  1  2
>>> df2
   B  C
1  3  4

You can use .concat to concatenate them together:

>>> pd.concat([df1, df2])
    A  B   C
0   1  2 NaN
1 NaN  3   4

You can see NaN is created for row values that do not exist.

This can easily be applied to your example data without preprocessing at all:

import pandas as pd
df=pd.DataFrame()
with open(fn) as f_in:
    for i, line in enumerate(f_in):
        line_data=pd.DataFrame({k.strip():v.strip() 
                  for k,_,v in (e.partition(':') 
                        for e in line.split('\t'))}, index=[i])
        df=pd.concat([df, line_data])

>>> df
  Col_01 Col_20 Col_21  Col_22  Col_23 Col_24  Col_25
0     14     25  23432  639142     NaN    NaN     NaN
1      8     25    NaN   25134  243344    NaN     NaN
2     17    NaN     75       5   79876  73453  634534
3     19     25  32425     NaN  989423    NaN     NaN
4     12     25  23424  342421       7  13424      67
5      3     95  32121     NaN     NaN    NaN  111231

Alternatively, if your main issue is establishing the desired order of the columns in a multi chunk add of columns, just read all the column value first (not tested):

# based on the alpha numeric sort of the example of:
# [ALPHA]_[NUM]
headers=set()
with open(fn) as f:
    for line in f:
        for record in line.split('\t'):
            head,_,datum=record.partition(":")
            headers.add(head)
# sort as you wish:             
cols=sorted(headers, key=lambda e: int(e.partition('_')[2]))

Pandas will use the order of the list for the column order if given in the initial creation of the DataFrame.

0 讨论(0)

查看其它3个回答