Compare two files for differences in python

ぐ巨炮叔叔 提交于 2019-12-04 06:06:36

问题


I want to compare two files (take line from first file and look up in whole second file) to see differences between them and write missing line from fileA.txt to end of fileB.txt. I am new to python so at first time I thought abou simple program like this:

import difflib

file1 = "fileA.txt"
file2 = "fileB.txt"

diff = difflib.ndiff(open(file1).readlines(),open(file2).readlines())
print ''.join(diff),

but in result I have got a combination of two files with suitable tags for each line. I know that I can look for line start with tag "-" and then write it to end of file fileB.txt, but with huge file (~100 MB) this method will be inefficient. Can somebody help me to improve program?

File structure will be like this:

input:

fileA.txt

Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

fileB.txt

    Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)

Output:

fileB_after.txt

Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)

回答1:


read in two files and convert to set

find union of two sets
sort union set based on time
join set to string with new line

import datetime
import 
file1 = "fileA.txt"
file2 = "fileB.txt"

with open(file1 ,'rb') as f:
  sa = set( line for line in f )
with open(file2 ,'rb') as f:
  sb = set( line for line in f )
print '\n'.join( sorted( sa.union(sb), key = lambda x: datetime.datetime.strptime( ' '.join( x.split()[:3]), '%b %d %H:%M:%S' )) )



Oct  9 12:19:16 user sshd[12744]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 12:19:16 user sshd[12744]: Accepted password for root from 213.XXX.XXX.XX7 port 60554 ssh2
Oct  9 13:24:42 user sshd[12744]: pam_unix(sshd:session): session closed for user root
Oct  9 13:24:42 user sshd[12744]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 13:25:31 user sshd[12844]: Accepted password for root from 213.XXX.XXX.XX7 port 33254 ssh2
Oct  9 13:25:31 user sshd[12844]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:35:48 user sshd[12868]: Accepted password for root from 213.XXX.XXX.XX7 port 33574 ssh2
Oct  9 13:35:48 user sshd[12868]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct  9 13:46:58 user sshd[12844]: pam_unix(sshd:session): session closed for user root
Oct  9 13:46:58 user sshd[12844]: Received disconnect from 213.XXX.XXX.XX7: 11: disconnected by user
Oct  9 15:47:58 user sshd[12868]: pam_unix(sshd:session): session closed for user root
Oct 11 22:17:31 user sshd[2655]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 11 22:17:31 user sshd[2655]: Accepted password for root from 17X.XXX.XXX.X19 port 5567 ssh2



回答2:


Try with this in the bash:

cat fileA.txt fileB.txt | sort -M | uniq > new_file.txt

sort -M: sorts based on initial string, consisting of any amount of whitespace, followed by a month name abbreviation, is folded to UPPER case and compared in the order 'JAN' < 'FEB' < ... < 'DEC'. Invalid names compare low to valid names. The `LC_TIME' locale determines the month spellings.

uniq: filters out repeated lines in a file.

|: passes the output of one command to another for further processing.

What this will do is take the two files, sort them in the way described above, keep the unique items and store them in new_file.txt

Note: This is not a python solution but you have tagged the question with linux so I thought it might interest you. Also you can find more detailed info about the commands used, here.



来源:https://stackoverflow.com/questions/40029985/compare-two-files-for-differences-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!