Fast shell command to remove stop words in a text file

问题

I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.

I have stopwords.txt containing like this..

a
an
the
for
and
I

What is the fast method to do this using shell command such as tr, sed or awk?

回答1:

Here's a method using the command line and perl:

Save the text below as replacesw.sh:

#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2

Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains:

This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.

Then the following at the command line will remove the stopwords:

KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt 
This is  file with  stopwords from  stopwords.txt  testing.
More than one line in  file,   better test.

You might need to chmod u+x replacesw.sh first.

来源：https://stackoverflow.com/questions/30574124/fast-shell-command-to-remove-stop-words-in-a-text-file

标签

shell

nlp

text-processing

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!