So I want to convert a simple tab delimited text file into a csv file. If I convert the txt file into a string using string.split(\'\\n\') I get a list with each list item a
Why you should always use 'rb' mode when reading files with the csv
module:
Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
What's in the sample file: any old rubbish, including control characters obtained by extracting blobs or whatever from a database, or injudicious use of the CHAR
function in Excel formulas, or ...
>>> open('demo.txt', 'rb').read()
'h1\t"h2a\nh2b"\th3\r\nx1\t"x2a\r\nx2b"\tx3\r\ny1\ty2a\x1ay2b\ty3\r\n'
Python follows CP/M, MS-DOS, and Windows when it reads files in text mode: \r\n
is recognised as the line separator and is served up as \n
, and \x1a
aka Ctrl-Z is recognised as an END-OF-FILE marker.
>>> open('demo.txt', 'r').read()
'h1\t"h2a\nh2b"\th3\nx1\t"x2a\nx2b"\tx3\ny1\ty2a' # WHOOPS
csv with a file opened with 'rb' works as expected:
>>> import csv
>>> list(csv.reader(open('demo.txt', 'rb'), delimiter='\t'))
[['h1', 'h2a\nh2b', 'h3'], ['x1', 'x2a\r\nx2b', 'x3'], ['y1', 'y2a\x1ay2b', 'y3']]
but text mode doesn't:
>>> list(csv.reader(open('demo.txt', 'r'), delimiter='\t'))
[['h1', 'h2a\nh2b', 'h3'], ['x1', 'x2a\nx2b', 'x3'], ['y1', 'y2a']]
>>>