uniq + not get uniq strings from file

◇◆丶佛笑我妖孽 提交于 2021-02-05 07:29:04

问题


Please take a look on the following all I want is to get the node_name string from the file and perform uniq to get uniq strings if they equal

but on the following example I have two equal lines ( node_name moon1 )

but why uniq command not decrease it to one line??? , I am really not understand this why?

 sed s'/=/ /'g file|  awk -v WORD=node_name '$2 == WORD {print $0}' | awk '{print $2" "$3}' | sort | uniq

 node_name moon1
 node_name moon1

the file:

  .
  .

 /home/home1/home2/config1/COMPANY/my.config      node_name=moon1
 /home/home1/home2/config1/MEAS2XML/my.config.tmp node_name=moon1

  .
  .

回答1:


There are two main causes for uniq failing to merge identical lines:

  • The lines are not really identical - extra trailing whitespace is a usual suspect here. The awk commands should filter those in this example, but you can check that anyway. Mixed-up line terminators are also a possible cause.

  • The lines are not adjacent - sort can and will fail to sort lines "properly" if the locale settings are incorrect. I have seen weird cases with non-ASCII text where identical lines will not end up one after the other. Check the output of sort, before the uniq command. What are your locale settings?

BTW, 'sort -u' is a better alternative for 'sort | uniq'.

EDIT:

Seems like one or both of these issues:

  • You have mixed line-terminators. If some of your lines end up in LF (\n, Unix style terminators) and some in CR/LF (\r\n, DOS-style terminators), uniq will treat them as different lines, even if they are otherwise identical.

  • Trailing whitespace in some of your lines along with CR/LF DOS-style line terminators. The CR (Carriage return, '\r') character is not considered whitespace by most (all?) unix utilities, including awk. If one of your lines does not have any other trailing whitespace, the CR will be considered part of its last field and be printed out. On the other hand, in a line with whitespace between the last field and the CR, the last field as printed by awk would not contain the CR.

Changing the CR/LF line terminator to LF will solve both issues in this case, although it's generally best to filter trailing whitespace as well:

  • dos2unix is the preferred way
  • As an alternative, filter your file through sed 's|\r$||'



回答2:


Sounds like you have stray characters in your file. Clean it first using:

dos2unix your_file

Also, unrelated to your problem, but you can replace sort | uniq with simply sort -u.




回答3:


I haven't tried the command you specified in your question, but ran the following instead:

cat foo|cut -d \= -f 2|sort |uniq

where "foo" is a file containing the 2 lines in your example. The output of the above is "moon1".

Simpler than your example because I assume that there is only one 'name=value' pair per line; I don't know anything about your file format.

Hope this helps anyway...




回答4:


I was going through a similar problem but in addition to removing duplicate lines I wanted to make sure that order of lines is also maintained. Combining uniq and sort defeats this purpose.

Luckily awk provides the solution

$ awk ‘!x[$0]++’ filename.txt

awk and duplicate lines



来源:https://stackoverflow.com/questions/4247791/uniq-not-get-uniq-strings-from-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!