问题
I have a 2GB text file. I am trying to remove frequently occurring english stop words from this file.
I have stopwords.txt containing like this..
a
an
the
for
and
I
What is the fast method to do this using shell command such as tr, sed or awk?
回答1:
Here's a method using the command line and perl
:
Save the text below as replacesw.sh
:
#! /bin/bash
MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b
perl -pe "s/$MYREGEX//g" $2
Then if you have saved your file above as stopwords.txt
, and have a second file (e.g.) called testtext.txt
that contains:
This is a file with the stopwords from the stopwords.txt for testing.
More than one line in the file, for a better test.
Then the following at the command line will remove the stopwords
:
KBs-MBP13:temp kbenoit$ ./replacesw.sh stopwords.txt testtext.txt
This is file with stopwords from stopwords.txt testing.
More than one line in file, better test.
You might need to chmod u+x replacesw.sh
first.
来源:https://stackoverflow.com/questions/30574124/fast-shell-command-to-remove-stop-words-in-a-text-file