Remove a verb as a stopword

时光怂恿深爱的人放手 提交于 2019-12-19 04:54:25

问题


There are some words which are used sometimes as a verb and sometimes as other part of speech.

Example

A sentence with the meaning of the word as verb:

I blame myself for what happened

And a sentence with the meaning of word as noun:

For what happened the blame is yours

The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb.

Is there any easy way to make it?


回答1:


You can install TreeTagger and then use the koRpus package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger.

I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:

Intro treetagger

library(koRpus)

your_sentences <- c("I blame myself for what happened", 
                    "For what happened the blame is yours")

text.tagged <- treetag(file="I blame myself for what happened", 
                  format="obj", treetagger="manual", lang="en",
                  TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged@TT.res[, 1:2]
#       token tag    
#1         I  PP
#2     blame VVP 
#3    myself  PP 
#4       for  IN
#5      what  WP
#6  happened VVD 

The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame" that are a verb.

Solution

I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame" that are also a verb and finally removes them from the sentence:

remove_words <- function(sentence, badword="blame"){
  tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en", 
                         TT.options=list(path=":C\\Treetagger", preset="en"))
  # Check for bad words AND verb:
  cond1 <- (tagged.text@TT.res$token == badword)
  cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
  redflag <- which(cond1 & cond2)

  # If no such case, return sentence as is. If so, then remove that word:
  if(length(redflag) == 0) return(sentence)
  else{
    splitsent <- strsplit(sentence, " ")[[1]]
    splitsent <- splitsent[-redflag]
    return(paste0(splitsent, collapse=" "))
  }
}

lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"



回答2:


In python it is done as:

from nltk import pos_tag
s1 = "I blame myself for what happened"
pos_tag(s1.split())

It will give you words with there tags




回答3:


You can do something like this in Python .

import ntlk
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

And add youre filter to eliminate Verbs for instance .

Hope this is helpful !



来源:https://stackoverflow.com/questions/47274691/remove-a-verb-as-a-stopword

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!