how to generate list of (unique) words from text file in ubuntu?

左心房为你撑大大i 提交于 2020-01-01 04:54:15

问题


I have an ASCII text file. I want to generate a list of all "words" from that file using one or more Ubuntu commands. A word is defined as an alpha-num sequence between delimiters. Delimiters are by default whitespaces but I also want to experiment with other characters like punctuation etc. IN other words, i want to be able to specify a delimiter char set. How do I produce only a unique set of words? What if I also want to list only those words that are at least N characters long?


回答1:


You could use grep:

-E '\w+' searches for words -o only prints the portion of the line that matches % cat temp Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text.

if you don't care whether words repeat

% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text

If you want to only print each word once, disregarding case, you can use sort

-u only prints each word once -f tells sort to ignore case when comparing words

if you only want each word once

% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use

you can also use the tr command

echo the quick brown fox jumped over the lazydog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazydog

The -c is for the complement of the specified characters; the -s squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics, if you add a character here, the input won't get delimited on that character (see another example below); the '\n' is the replacement character (newline).

echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9-' '\n'
the
quick
brown
fox
jumped
over
the
lazy-dog

As we added '-' in the list of non-delimiters list, lazy-dog was printed. Other the output is

echo the quick brown fox jumped over the lazy-dog | tr -cs 'a-zA-Z0-9' '\n'
the
quick
brown
fox
jumped
over
the
lazy
dog

Summary for tr: any character not in argument of -c, will act as a delimiter. I hope this solves your delimiter problem too.




回答2:


This ought to work for you:

tr \ \\t\\v\\f\\r \\n | | tr -s \\n | tr -dc a-zA-Z0-9\\n | LC_ALL=C sort | uniq

If you want the characters that are at least five characters long, pipe the output of tr through grep ...... If you want case-insensitivity, stick tr A-Z a-z someplace in the pipeline before sort.

Note that LC_ALL=C is necessary for sort to work correctly.

I'd recommend reading the man pages for ant commands you don't understand here.




回答3:


Here's my word-cloud like chain

cat myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr

if you have a tex file, replace cat with detex:

detex myfile | grep -o -E '\w+' | tr '[A-Z]' '[a-z]' | sort | uniq -c | sort -nr



来源:https://stackoverflow.com/questions/16489317/how-to-generate-list-of-unique-words-from-text-file-in-ubuntu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!