问题
I have a tsv file which includes some newline data.
111 222 333 "aaa"
444 555 666 "bb
b"
Here b on the third line is a newline character of bb on the second line, so they are one data:
The fourth value of first line:
aaa
The fourth value of second line:
bb
b
If I use Ctrl+C and Ctrl+V paste to a excel file, it works well. But if I want to import the file using python, how to parse?
I have tried:
lines = [line.rstrip() for line in open(file.tsv)]
for i in range(len(lines)):
value = re.split(r'\t', lines[i]))
But the result was not good:
I want:
回答1:
Just use the csv module. It knows about all the possible corner cases in CSV files like new lines in quoted fields. And it can delimit on tabs.
with open("file.tsv") as fd:
rd = csv.reader(fd, delimiter="\t", quotechar='"')
for row in rd:
print(row)
will correctly output:
['111', '222', '333', 'aaa']
['444', '555', '666', 'bb\nb']
回答2:
Newline characters, when within the content (cell) of your .tsv/.csv, is usually enclosed in quotes. If not, standard parses might confuse it as the start of the next row. In your case, the line
for line in open(file.tsv)
automatically uses newline character as a separator.
If you are sure that the file only has 4 columns, you could simply read the entire text, split it based on tab, and then pull out 4 items at a time.
# read the entire text and split it based on tab
old_data = open("file.tsv").read().split('\t')
# Now group them 4 at a time
# This simple list comprehension creates a for loop with step size = num. of columns
# It then creates sublists of size 4 (num. columns) and puts it into the new list
new_data = [old_data[i:i+4] for i in range(0, len(old_data), 4)]
Ideally, you should close content that could have newlines in quotes.
回答3:
import scipy as sp
data = sp.genfromtxt("filename.tsv", delimiter="\t")
来源:https://stackoverflow.com/questions/42358259/how-to-parse-tsv-file-with-python