Remove new line from CSV file

问题

I want to remove the new line character in CSV file field's data. The same question is asked by multiple people in SO/other places. However the provided solutions are in scripting. I'm looking for a solution in programming languages like PYTHON or in Spark(not only these two) as I have pretty big files.

Previously asked questions on the same topic:

Remove New Line Character from CSV file's string column
Replace new line character between double quotes with space
Remove New Line from CSV file's string column
https://unix.stackexchange.com/questions/222049/how-to-detect-and-remove-newline-character-within-a-column-in-a-csv-file

I have a CSV file of size ~1GB and want to remove the new line characters in field's data. The schema of the CSV file varies dynamically, so I can't hard code the schema. The line break doesn't always appear before a comma, it appears randomly even within a field.

Sample Data:

playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is 
Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is 
Cricket" 
,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is 
Team4",DET,AL,1
dykesji01,1933,5,"Game name is 
Hockey"
,"Team name 
Team5",CHA,AL,1

Expected Output:

playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,"This is Team2",BOS,AL,1
gehrilo01,1933,3,"Game name is Cricket" ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,"Here it is Team4",DET,AL,1
dykesji01,1933,5,"Game name is Hockey","Team name Team5",CHA,AL,1

Newline character can be in any field's data.

Edit: Screenshot as per the code:

回答1:

If you are using pyspark then I would suggest you to go with sparkContext's wholeTextFiles function to read the file, since your file needs to be read as whole text for parsing appropriately.

After reading it using wholeTextFiles, you should parse by replacing end of line characters by , and do some additional formattings so that whole text can be broken down into groups of eight strings.

import re
rdd = sc.wholeTextFiles("path to your csv file")\
    .map(lambda x: re.sub(r'(?!(([^"]*"){2})*[^"]*$),', ' ', x[1].replace("\r\n", ",").replace(",,", ",")).split(","))\
    .flatMap(lambda x: [x[k:k+8] for k in range(0, len(x), 8)])

You should get output as

[u'playerID', u'yearID', u'gameNum', u'gameName', u'teamName', u'lgID', u'GP', u'startingPos']
[u'gomezle01', u'1933', u'1', u'Cricket', u'Team1', u'NYA', u'AL', u'1']
[u'ferreri01', u'1933', u'2', u'Hockey', u'"This is Team2"', u'BOS', u'AL', u'1']
[u'gehrilo01', u'1933', u'3', u'"Game name is Cricket"', u'Team3', u'NYA', u'AL', u'1']
[u'gehrich01', u'1933', u'4', u'Hockey', u'"Here it is Team4"', u'DET', u'AL', u'1']
[u'dykesji01', u'1933', u'5', u'"Game name is Hockey"', u'"Team name Team5"', u'CHA', u'AL', u'1']

If you would like to convert all the array rdd rows into strings of rows then you can add

.map(lambda x: ", ".join(x))

and you should get

playerID, yearID, gameNum, gameName, teamName, lgID, GP, startingPos
gomezle01, 1933, 1, Cricket, Team1, NYA, AL, 1
ferreri01, 1933, 2, Hockey, "This is Team2", BOS, AL, 1
gehrilo01, 1933, 3, "Game name is Cricket", Team3, NYA, AL, 1
gehrich01, 1933, 4, Hockey, "Here it is Team4", DET, AL, 1
dykesji01, 1933, 5, "Game name is Hockey", "Team name Team5", CHA, AL, 1

回答2:

You can use re, pandas and io modules as follows:

import re
import io
import pandas as pd

with open('data.csv','r') as f:
    data = f.read()
df = pd.read_csv(io.StringIO(re.sub('"\s*\n','"',data)))

for col in df.columns: #To replace all line breaks in all textual columns
    if df[col].dtype == np.object_:
        df[col] = df[col].str.replace('\n','');

In [78]: df
Out[78]:
    playerID    yearID  gameNum gameName               teamName        lgID GP  startingPos
0   gomezle01   1933    1       Cricket                Team1           NYA  AL  1
1   ferreri01   1933    2       Hockey                 This is Team2   BOS  AL  1
2   gehrilo01   1933    3       Game name is Cricket   Team3           NYA  AL  1
3   gehrich01   1933    4       Hockey  Here it is     Team4           DET  AL  1
4   dykesji01   1933    5       Game name is Hockey    Team name Team5 CHA  AL  1

If you want this DataFrame as an output CSV file use:

df.to_csv('./output.csv')

回答3:

It could use a bit cleaning but here is some code that would do what you want. Works for line breaks within a field and before a comma. If more requirements needed, some tweaking could be done:

import csv

with open('data.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    actual_rows = [next(reader)]
    length = len(actual_rows[0])
    real_row = []
    for row in reader:
        if len(row) < length:
            if real_row:
                real_row[-1] += row[0]
                real_row += row[1:]
            else:
                real_row = row
        else:
            real_row = row
        if len(real_row) == length:
            real_row = map(lambda s: s.replace('\n', ' '), real_row)
            # store real_row or use it as needed
            actual_rows.append(list(real_row))
            real_row = []

    print(actual_rows)

I'm storing the corrected rows in actual_rows but if you don't want to load into memory, just use the real_row variable in every loop were pointed out in the comment

回答4:

This one is a basic one with simple preprocessing before reading it through csv.

import csv

def simple_sanitize(data):
    result = []
    for i, a in enumerate(data):
        if i + 1 != len(data) and data[i + 1][0] == ',':
            a = a.replace('\n', '')
            result.append(a + data[i + 1])
        elif a[0] != ',':
            result.append(a)
    return result

data = [line for line in open('test.csv', 'r')]
sdata = simple_sanitize(data)

with open('out.csv','w') as f:
    for row in sdata:
        f.write(row)

result = [list(val.replace('\n', '') for val in line) for line in csv.reader(open('out.csv', 'r'))]

print(result)

Result :

[['playerID', 'yearID', 'gameNum', 'gameName', 'teamName', 'lgID', 'GP', 'startingPos'], 
['gomezle01', '1933', '1', 'Cricket', 'Team1', 'NYA', 'AL', '1'], 
['ferreri01', '1933', '2', 'Hockey', 'This is Team2', 'BOS', 'AL', '1'], 
['gehrilo01', '1933', '3', 'Game name is Cricket ', 'Team3', 'NYA', 'AL', '1'], 
['gehrich01', '1933', '4', 'Hockey', 'Here it is Team4', 'DET', 'AL', '1'], 
['dykesji01', '1933', '5', 'Game name is Hockey', 'Team name Team5', 'CHA', 'AL', '1']]

回答5:

The basic idea in this solution is to get fixed length chunks (of length equal to the number of columns in the first row) using the grouper recipe. Since it doesn't read the entire file at once, it wouldn't blow up your memory usage with large files.

$ cat a.py
import csv,itertools as it,operator as op

def grouper(iterable,n):return it.zip_longest(*[iter(iterable)]*n)

with open('in.csv') as inf,open('out.csv','w',newline='') as outf:
 r,w=csv.reader(inf),csv.writer(outf)
 hdr=next(r)
 w.writerow(hdr)
 for row in grouper(filter(bool,map(op.methodcaller('replace','\n',''),it.chain.from_iterable(r))),len(hdr)):
  w.writerow(row)

$ python3 a.py
$ cat out.csv
playerID,yearID,gameNum,gameName,teamName,lgID,GP,startingPos
gomezle01,1933,1,Cricket,Team1,NYA,AL,1
ferreri01,1933,2,Hockey,This is Team2,BOS,AL,1
gehrilo01,1933,3,Game name is Cricket ,Team3,NYA,AL,1
gehrich01,1933,4,Hockey,Here it is Team4,DET,AL,1
dykesji01,1933,5,Game name is Hockey,Team name Team5,CHA,AL,1

One assumption being made here is the absence of empty cells in the input csv.

来源：https://stackoverflow.com/questions/48970822/remove-new-line-from-csv-file

标签

python

csv

apache-spark

newline