Converting mixed-format .DAT to .CSV (or anything else)

后端未结

关注

 3  429

无人及你 2021-01-21 16:14

I have a large collection of DAT files that need to be converted (eventually to a unique file type). The DAT\'s have a mixed amount of whitespace between fields, and the column

3条回答

醉酒成梦 (楼主)

2021-01-21 16:54
It looks like you can combine the header rows dynamically based on a word's position in the line. You can skip the first two lines, and combine the next two. If you do it right, you will be left with an iterator over a file stream that you can use to process the remainder of the data as you wish. You can convert it to a different format, or even import it into a pandas DataFrame directly.

To get the headers:
```
import re

def get_words_and_positions(line):
    return [(match.start(), match.group()) in re.finditer(r'[\w.]+', line)]

with open('file.dat') as file:
    iterator = iter(file)
    # Skip two lines
    next(iterator)
    next(iterator)
    # Get two header lines
    header = get_words_and_positions(next(iterator)) + \
             get_words_and_positions(next(iterator))
    # Sort by positon
    header.sort()
    # Extract words
    header = [word for pos, word in header]
```
You can now convert the file to a true CSV, or do something else with it. The important thing here is that you have iterator pointing to the actual data in the file stream, and a bunch of dynamically loaded column headers.

To write the remainder to a CSV file, without having to load the entire thing into memory at once, use csv.writer and the iterator from above:
```
 import csv
 ...
 with ...:
 ...
    with open('outfile.csv', 'w') as output:
        writer = csv.writer(output)
        writer.writerow(header)
        for line in iterator:
            writer.writerow(re.split(r'\s+', line))
```
You can combine the nested output with and the outer input with into a single outer block to reduce the nesting levels:
```
with open('file.dat') as file, open('outputfile.csv', 'w') as output:
    ....
```
To read in a pandas DataFrame, you can just pass the file object to pandas.read_csv. Since the file stream is past the headers at this point, it will not give you any issues:
```
import pandas as pd
...
with ...:
    ...
    df = pd.read_csv(file, sep=r'\s'+, header=None, names=header)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...