问题
How do I add the tag NEG_
to all words that follow not
, no
and never
until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
回答1:
To make up for Python's re
regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub
function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (
\b
is a word boundary,(?:...)
a non capturing group), followed by alpahnum and spaces (\w
is[0-9a-zA-Z_]
,\s
is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).Note that the punctuation is mandatory here, but you could safely remove
[^\w\s]
to match end of string as well.Now you're dealing with
never going to work,
kind of strings. Just select the words preceded by spaces with(\s+)(\w+)
And replace them with what you want
\1NEG_\2
回答2:
I would not do this with regexp. Rather I would;
- Split the input on punctuation characters.
- For each fragment do
- Set negation counter to 0
- Split input into words
- For each word
- Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
- If original word is in {No,Never,Not} increase negation counter by one.
回答3:
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with
not
,no
ornever
. The regex\b(?:not?|never)\b([^.,:;!?]+)
would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend
NEG_
to them.Join the string together again and insert the result in your original string in the place of the first regex's match.
来源:https://stackoverflow.com/questions/23384351/how-to-add-tags-to-negated-words-in-strings-that-follow-not-no-and-never