问题
I have a directory of roughly 600 CSV files that contain twitter data with multiple fields of various types (ints, floats, and strings). I have a script that can merge the files together, but the string fields can contain commas themselves are not quoted causing the string fields to separate and force text on new lines. Is it possible to quote the strings in each file and then merge them into a single file? Below is the script I use to merge the files and some sample data.
Merger script: %%time import csv import glob from tqdm import tqdm
with open('C:\Python\Scripts\Test_tweets\Test_output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
write_header = True
for filename in tqdm(glob.glob(r'C:\Python\Scripts\Test_tweets\*.csv')):
with open(filename, 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
if write_header:
csv_output.writerow(header)
write_header = False
for row in tqdm(csv_input):
row = row[:7] + [','.join(row[7:])]
# Skip rows with insufficient values
if len(row) > 7:
row[1] = float(row[1])
row[5] = float(row[5])
row[6] = float(row[6])
csv_output.writerow(row)
Sample data:
2014-02-07T00:25:40Z,431584511542198272,FalseAlarm_xox,en,-,-81.4994315,35.3268904,is still get hair done,Is Still Getting Hair Done
2014-02-07T00:25:40Z,431584511525003265,enabrkovic,en,-,-85.40364208,40.19369368,i had no class todai why did i wait 630 to start do everyth,I had no classes today why did I wait 630 to start doing EVERYTHING
2014-02-07T00:25:41Z,431584515757457408,_beacl,pt,-,-48.05338676,-16.02483911,passei o dia com o meu amor comemo demai <3 @guugaraujo,passei o dia com o meu amor, comemos demais ❤️ @guugaraujo
2014-02-07T00:25:42Z,431584519930396672,aprihasanah,in,-,106.9224971,-6.2441371,4 hari ngga ada kepsek rasanya nyaman bgt kerjaan juga lebih teratur tp skalinya doi masuk administrasi kacau balau lg yanasib,4 hari ngga ada kepsek rasanya nyaman bgt. kerjaan juga lebih teratur. tp skalinya doi masuk, administrasi kacau balau lg. yanasib >_<"
2014-02-07T00:25:42Z,431584519951749120,MLEFFin_awesome,en,-,-77.20315866,39.08811105,never a dull moment with emma <3 /MLEFFin_awesome/status/431584519951749120/photo/1,Never a dull moment with Emma 💗 /0Wfs5VqfVz
2014-02-07T00:25:43Z,431584524120510464,mimiey_natasya,en,-,103.3596089,3.9210196,good morn,Good morning...
2014-02-07T00:25:43Z,431584524124684288,louykins,en,-,-86.06823257,41.74938946,that Oikos commerci with @johnstamos @bobsaget and @davecoulier is better than my whole life #takesmeback #youcankissmeanytimejohn,That Oikos commercial with @JohnStamos, @bobsaget, and @DaveCoulier is better than my whole life. #takesmeback #youcankissmeanytimejohn
2014-02-07T00:25:44Z,431584528306421760,savannachristy4,en,-,-79.99920285,39.65367864,rememb when we would go to club zoo :D,Remember when we would go to club zoo??😂😂😂
2014-02-07T00:25:44Z,431584528302231553,janiya_monet,en,-,-83.62028684,39.20591822,@itscourtney_365 thei call,@ItsCourtney_365 they. Called.
2014-02-07T00:25:44Z,431584528302223360,norastanky,en,-,-118.09849064,33.79394737,when you see your hometown in your english book /norastanky/status/431584528302223360/photo/1,When you see your hometown in your english book>> /XHRFymLFp4
2014-02-07T00:25:46Z,431584536703799296,Ericb1980,en,-,-82.32639648,27.92373599,i'm at longhorn steakhouse brandon fl .com/1bzZsrp,I'm at LongHorn Steakhouse (Brandon, FL) /YdCJKXmSmN
2014-02-07T00:25:46Z,431584536695410688,repokempt,en,-,37.40298473,55.96248794,@tonichopchop moron drive me nut,@tonichopchop Morons. Drives me nuts!
2014-02-07T00:25:47Z,431584540889317377,BeeNiabee6,en,-,-82.494139,27.4908062,my god sister got drink,My God sister got drinking
2014-02-08T00:00:01Z,4.3194E+17,NewarkWeather,in,-,-75.68444444,39.695,02 07 @19 00 temp 31.0 f wc 31.0 f wind 0.0 mph gust 0.0 mph bar 30.358 in rise rain 0.00 in hum 68 uv 0.0 solarrad 0,02/07@19:00 - Temp 31.0F, WC 31.0F. Wind 0.0mph ---, Gust 0.0mph. Bar 30.358in, Rising. Rain 0.00in. Hum 68%. UV 0.0. SolarRad 0.,,,,,,,,,,,,,,
2014-02-08T00:00:02Z,4.3194E+17,bastianwr,in,-,106.11073,-2.1198,happi weekend at sman 1 pangkalpinang https://path.com/p/1zjYtB,Happy Weekend! (at SMAN 1 Pangkalpinang) — /9U86N1BmD6,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,izaklast,en,-,-109.9176369,31.40244847,dihydrogen monoxid is good for you Watermill express .com/1bxHT81,Dihydrogen monoxide is good for you (@ Watermill Express) /IvfiuNHigM,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,blackbestpeople,tr,-,29.21950004,40.91441821,okulda özlediyim sadec kantindeki kakayolu süd,Okulda özlediyim sadece kantindeki kakayolu süd,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,Hakooo03,tr,-,3.72651687,51.06650946,gta v oynar katliam cikartirim bend,Gta v oynar katliam cikartirim bende !,,,,,,,,,,,,,,,,,
2014-02-08T00:00:03Z,4.3194E+17,piaras_14,en,-,-6.21720811,54.11456545,@blainmcg17 wee hornbal #taughtyouwell /piaras_14/status/431940452770934784/photo/1,@blainmcg17 wee hornball #taughtyouwell /C6yGymDoyl,,,,,,,,,,,,,,,,,
2014-02-08T00:00:04Z,4.3194E+17,PPompita,es,-,9.3215546,40.315019,@enrique305 esto es perfecto uauh yo y mi hermano v a ny al concierto lo enamorado 15feb desd italia solo para ti /PPompita/status/431940456973619200/photo/1,@enrique305 Esto es Perfecto uauh yo y mi hermano V a NY al concierto Los Enamorados 15Feb desde Italia solo para ti. /OrYYE2zN80,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,NickMontesdeoca,und,-,-71.34854858,42.63122899,<3,😍,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,Askin28Furkan,tr,-,28.6281946,41.0166627,birakma beni insanlar kötü bırakma beni korkuyorumm,Birakma beni insanlar kötü, bırakma beni korkuyorumm,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,mumfy98,en,-,-75.59400911,43.08187836,i just want a horse,I just want a horse!!,,,,,,,,,,,,,,,,,
2014-02-08T00:00:05Z,4.3194E+17,Pitmedden_Weath,en,-,-2.18416667,57.33888889,wind 7.2 mph s Barometer 979.9 hpa fall temperature 2.6 c rain todai 0.0 mm forecast stormi much precipitation,Wind 7.2mph S. Barometer 979.9hPa, Falling. Temperature 2.6°C. Rain today 0.0mm. Forecast Stormy, much precipitation,,,,,,,,,,,,,,,
2014-02-08T00:00:06Z,4.3194E+17,BoeBaFett,en,-,-79.0129325,33.794075,2 whole hour still no repli,2 whole hours... still no reply,,,,,,,,,,,,,,,,,
回答1:
If you are ok with merging the last two fields into a single string, then the following approach should work:
- Use a variable to determine if the header needs to be written. The header is always read first (using
next()
). IfTrue
, write the header, else discard it. - First strip the row and split it on
,
seven times. This will then preserve the two last string fields as a single value. - Next use a function to try and convert each field into either an integer or a float.
- Use the csv
quoting=csv.QUOTE_NONNUMERIC
option to force quoting on all the remaining string values.
This can be done as follows:
import csv
def get_number(value):
"Convert numberic strings into ints and floats"
try:
value = int(value)
except ValueError:
try:
value = float(value)
except ValueError:
pass
return value
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
write_header = True
with open('sample.csv') as f_input:
header = next(f_input).strip().split(',')
if write_header:
csv_output.writerow(header)
write_header = False
for row in f_input:
row = [get_number(value) for value in row.strip().split(',', 7)]
csv_output.writerow(row)
This would give you output starting:
"1/1/1",1,"username1","en","-",-39.0,162,"Dreamlike. Semi-sensical. Sort of terrifying. The site is less a Twitter toy than a disturbing peer into my subconscious.,Dreamlike. Semi-sensical. Sort of terrifying. The site is less a Twitter toy than a disturbing peer into my subconscious."
"1/1/2",2,"username2","en","-",84.0,147,"The results are, predictably, hilarious. I couldn't have said it better myself,The results are, predictably, hilarious. I couldn't have said it better myself"
"1/1/3",3,"username3","en","-",-22.0,-180,"This site is providing some good laughs this morning here at the Twitter office.,This site is providing some good laughs this morning here at the Twitter office."
"1/1/4",4,"username4","en","-",-28.0,-49,"You can image what something like this might look like five, ten or twenty years from now, as our technical capabilities improve,You can image what something like this might look like five, ten or twenty years from now, as our technical capabilities improve"
This approach could then be extended to work on your multiple input files.
If some of your data is already quoted, and the ints and floats are in known columns, then a different approach is needed. The sample data only shows non-quoted data.
import csv
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
write_header = True
with open('sample.csv', 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
if write_header:
csv_output.writerow(header)
write_header = False
for row in csv_input:
row = row[:7] + [','.join(row[7:])]
# Skip rows with insufficient values
if len(row) > 7:
row[1] = int(row[1])
row[5] = float(row[5])
row[6] = float(row[6])
csv_output.writerow(row)
To work with multiple files, you need to add a loop to read each CSV filename, for example:
import csv
import glob
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_NONNUMERIC)
write_header = True
for filename in glob.glob(r'C:\Python\Scripts\Test_tweets\*.csv'):
with open(filename, 'rb') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
if write_header:
csv_output.writerow(header)
write_header = False
for row in csv_input:
row = row[:7] + [','.join(row[7:])]
# Skip rows with insufficient values
if len(row) > 7:
row[1] = int(row[1])
row[5] = float(row[5])
row[6] = float(row[6])
csv_output.writerow(row)
Note: don't forget to prefix your folder string with r
to stop Python from trying to escape the \
characters.
回答2:
Sample data is corrupt. Correct data:
1,2,3,"Value with separator (,) must be in quotes",Value without comma
See https://tools.ietf.org/html/rfc4180
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
来源:https://stackoverflow.com/questions/55369361/python-quote-strings-in-multiple-csvs-and-merge-files-together