Unix Bash Remove Duplicate Lines From Directory Files?

僤鯓⒐⒋嵵緔 提交于 2019-12-12 04:16:12

问题


I have a directory with a few hundred txt files. I need to remove all duplicate lines from each of the existing files. Every line in the entire directory should be unique regardless of the file it's in, so I need to compare and check each file against the other. Is this possible to do without altering the existing file structure? The file names need to stay the same.

Let's say all the files are in directory "foo" and the total size of the directory is 30mb.

I think I can do this through comm or awk, but I haven't found a working command line to do this and I'm unfamiliar with the syntax.

UPDATE I have tried this line which I believe posts all the duplicates in the shell, but it's not removing the duplicates from the files.

awk 'NR==FNR{a[$0]="";next}; !($0 in a)' tmp/*

回答1:


awk '{
   if(FNR==1){
       if(fs!=lfn && NR!=1){
         b[lfn]
       };
   lfn=FILENAME
   };
   if(!($0 in a)) {
        a[$0];print $0>FILENAME;
        fs=FILENAME
  }
  }
END{
    if(fs!=lfn){
         b[FILENAME]
    };
    for (i in b){
         close(i);
         printf (data) >i;
    }
}' tmp/* 

1st Condition:

if(!($0 in a)) {
  a[$0];print $0>FILENAME;
  fs=FILENAME
}

If the current line $0 is in array a if not add the line to array a and to the current file being read else ignore the line. FILENAME awk built-in variable gives the name of the file being read. If there is at least one different line in current file being read is found flag fs with FILENAME is set.

2st Condition:

  if(FNR==1){
    if(fs!=lfn && NR!=1){
      b[lfn]
    };
     lfn=FILENAME
  }

So when next file is read FNR==1 fs(last file with different line) and lfn(lastfilename) is compared if this differs then array b with index lfn is created.( To touch as empty file)

1st Condition:

  END{
      if(fs!=lfn){
           b[FILENAME]
      };
      for (i in b){
           close(i);
           printf (data) >i;
      }
  }

In the END, above condition 2 checked again to find if last file has different line. Also loops through the array b to touch empty file where no different lines are found. Here I have assumed there is no order in which file are read.

This is script is not optimal but will do the work.



来源:https://stackoverflow.com/questions/34020528/unix-bash-remove-duplicate-lines-from-directory-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!