问题
So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below:
Path Filename Size Signature
C:\ a.txt 14kb 012345
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
The second file, masterlist.csv is shown below:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion. I know this would look something like:
hosts[3] == masterlist[1]
I am looking for a solution that will give me something like the following (basically the hosts.csv file with a new RESULTS column):
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
I've searched the posts and found something similar to this here but I don't quite understand it as I'm still learning python.
Edit Using Python 2.6
回答1:
Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.
You can find the documentation for the python CSV module here.
What you're looking for is something like this:
import csv
f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
masterlist = list(c2)
for hosts_row in c1:
row = 1
found = False
for master_row in masterlist:
results_row = hosts_row
if hosts_row[3] == master_row[1]:
results_row.append('FOUND in master list (row ' + str(row) + ')')
found = True
break
row = row + 1
if not found:
results_row.append('NOT FOUND in master list')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
回答2:
The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:
import csv
with open('masterlist.csv', 'rb') as master:
master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))
with open('hosts.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['RESULTS'])
for row in reader:
index = master_indices.get(row[3])
if index is not None:
message = 'FOUND in master list (row {})'.format(index)
else:
message = 'NOT FOUND in master list'
writer.writerow(row + [message])
This produces a dictionary, mapping signatures from masterlist.csv
to a line number first. Lookups in a dictionary take constant time, making the second loop over hosts.csv
rows independant from the number of rows in masterlist.csv
. Not to mention code that's a lot simpler.
For those using Python 3, the above only needs to have the open()
calls adjusted to open in text mode (remove the b
from the file mode), and you want to add new line=''
so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (use encoding=...
). The master_indices
mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}
).
回答3:
Python's CSV and collections module, specifically OrderedDict, are really helpful here. You want to use OrderedDict to preserve the order of the keys, etc. You don't have to, but it's useful!
import csv
from collections import OrderedDict
signature_row_map = OrderedDict()
with open('hosts.csv') as file_object:
for line in csv.DictReader(file_object, delimiter='\t'):
signature_row_map[line['Signature']] = {'line': line, 'found_at': None}
with open('masterlist.csv') as file_object:
for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1):
if line['Signature'] in signature_row_map:
signature_row_map[line['Signature']]['found_at'] = i
with open('newhosts.csv', 'w') as file_object:
fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS']
writer = csv.DictWriter(file_object, fieldnames, delimiter='\t')
writer.writer.writerow(fieldnames)
for signature_info in signature_row_map.itervalues():
result = '{0} FOUND in masterlist {1}'
# explicit check for sentinel
if signature_info['found_at'] is not None:
result = result.format('', '(row %s)' % signature_info['found_at'])
else:
result = result.format('NOT', '')
payload = signature_info['line']
payload['RESULTS'] = result
writer.writerow(payload)
Here's the output using your test CSV files:
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
Please excuse the misalignment, they are tab separated :)
回答4:
The csv
module comes in handy in parsing csv files. But just for fun, I am simply splitting the input on whitespace to get at the data.
Just parse in the data, build a dict
for the data in masterlist.csv with the signature as key and the line number as value. Now, for each row of hosts.csv, we can just query the dict
and find out whether or not a corresponding entry exists in masterlist.csv and if so at which line.
#! /usr/bin/env python
def read_data(filename):
input_source=open(filename,'r')
input_source.readline()
return [line.split() for line in input_source]
if __name__=='__main__':
hosts=read_data('hosts.csv')
masterlist=read_data('masterlist.csv')
master=dict()
for index,data in enumerate(masterlist):
master[data[-1]]=index+1
for row in hosts:
try:
found="FOUND in masterlist (row %s)"%master[row[-1]]
except KeyError:
found="NOT FOUND in masterlist"
line=row+[found]
print "%s %s %s %s %s"%tuple(line)
回答5:
I just fixed a small thing in Martijn Pieters code in order to make it work in Python 3, and in this code, I am trying to match the first column elements in the file1 row[0]
with the first column elements in file2 row[0]
.
import csv
with open('file1.csv', 'rt', encoding='utf-8') as master:
master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))
with open('file2.csv', 'rt', encoding='utf-8') as hosts:
with open('result.csv', 'w') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['RESULTS'])
for row in reader:
index = master_indices.get(row[0])
if index is not None:
message = 'FOUND in master list (row {})'.format(index)
writer.writerow(row + [message])
else:
message = 'NOT FOUND in master list'
writer.writerow(row + [message])
results.close()
来源:https://stackoverflow.com/questions/5268929/compare-two-csv-files-and-search-for-similar-items