Remove Duplicates from Text File

寵の児 提交于 2019-11-27 22:50:47

问题


I want to remove duplicate word from a text file.

i have some text file which contain such like following:

None_None

ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624
ConfigHandler_56663624

None_None

ColumnConverter_56963312
ColumnConverter_56963312

PredicatesFactory_56963424
PredicatesFactory_56963424

PredicateConverter_56963648
PredicateConverter_56963648

ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888
ConfigHandler_80134888

The resulted output needs to be:

None_None

ConfigHandler_56663624

ColumnConverter_56963312

PredicatesFactory_56963424

PredicateConverter_56963648

ConfigHandler_80134888

I have used just this command: en=set(open('file.txt') but it does not work.

Could anyone help me with how to extract only the unique set from the file

Thank you


回答1:


Here's about option that preserves order (unlike a set), but still has the same behaviour (note that the EOL character is deliberately stripped and blank lines are ignored)...

from collections import OrderedDict

with open('/home/jon/testdata.txt') as fin:
    lines = (line.rstrip() for line in fin)
    unique_lines = OrderedDict.fromkeys( (line for line in lines if line) )

print unique_lines.keys()
# ['None_None', 'ConfigHandler_56663624', 'ColumnConverter_56963312',PredicatesFactory_56963424', 'PredicateConverter_56963648', 'ConfigHandler_80134888']

Then you just need to write the above to your output file.




回答2:


Here is a simple solution using sets to remove the duplicates from the text file.

lines = open('workfile.txt', 'r').readlines()

lines_set = set(lines)

out  = open('workfile.txt', 'w')

for line in lines_set:
    out.write(line)



回答3:


Here's how you can do it with sets (unordered results):

from pprint import pprint

with open('input.txt', 'r') as f:
    print pprint(set(f.readlines()))

Additionally you may want to get rid of new line chars.




回答4:


If you just want to get un-duplicate output , you can use uniq and sort

hvn@lappy: /tmp () $ sort -nr dup | uniq
PredicatesFactory_56963424
PredicateConverter_56963648
None_None
ConfigHandler_80134888
ConfigHandler_56663624
ColumnConverter_56963312

For python:

In [2]: with open("dup", 'rt') as f:
    lines = f.readlines()
   ...:     

In [3]: lines
Out[3]: 
['None_None\n',
 '\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 'ConfigHandler_56663624\n',
 '\n',
 'None_None\n',
 '\n',
 'ColumnConverter_56963312\n',
 'ColumnConverter_56963312\n',
 '\n',
 'PredicatesFactory_56963424\n',
 'PredicatesFactory_56963424\n',
 '\n',
 'PredicateConverter_56963648\n',
 'PredicateConverter_56963648\n',
 '\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n',
 'ConfigHandler_80134888\n']

In [4]: set(lines)
Out[4]: 
set(['ColumnConverter_56963312\n',
     '\n',
     'PredicatesFactory_56963424\n',
     'ConfigHandler_56663624\n',
     'PredicateConverter_56963648\n',
     'ConfigHandler_80134888\n',
     'None_None\n'])



回答5:


import json
myfile = json.load(open('yourfile', 'r'))
uniq = set()
for p in myfile:
if p in uniq:
    print "duplicate : " + p
    del p
else:
    uniq.add(p)
print uniq



回答6:


this way get same file out that was put in

import uuid

def _remove_duplicates(filePath):
  f = open(filePath, 'r')
  lines = f.readlines()
  lines_set = set(lines)
  tmp_file=str(uuid.uuid4())
  out=open(tmp_file, 'w')
  for line in lines_set:
    out.write(line)
  f.close()
  os.rename(tmp_file,filePath)



回答7:


def remove_duplicates(infile):
    storehouse = set()
    with open('outfile.txt', 'w+') as out:
        for line in open(infile):
            if line not in storehouse:
                out.write(line)
                storehouse.add(line)

remove_duplicates('infile.txt')


来源:https://stackoverflow.com/questions/15830290/remove-duplicates-from-text-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!