Extracting noun+noun or (adj|noun)+noun from Text

不羁岁月 提交于 2019-11-28 07:03:39

It is possible.

EDIT:

You got it. Use the POS tagger and split on spaces: ll <- strsplit(acqTag,' '). From there iterate on the length of the input list (length of ll) like: for (i in 1:37){qq <-strsplit(ll[[1]][i],'/')} and get the part of speech sequence you're looking for.

After splitting on spaces it is just list processing in R.

I don't have an open console on which to test this, but have your tried to tokenize with tagPOS and then grep for "noun", "noun" or perhaps paste(tagPOS(acq), collapse=".") and search for "noun.noun". Then gregexpr could be used to extract positions.

EDIT: The format of the tagged output was a bit different than I remembered. I think this method of read.table()-ing after substituting "\n"s for spaces is much more efficient than what I see above:

 acqdf <- read.table(textConnection(gsub(" ", "\n", acqTag)), sep="/", stringsAsFactors=FALSE)
 acqdf$nnadj <- grepl("NN|JJ", acqdf$V2)
 acqdf$nnadj 
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE
#[16] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
#[31]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
 acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)]
# [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#[16] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
#[31] FALSE FALSE FALSE FALSE FALSE FALSE
 acqdf$pair <- c(NA, acqdf$nnadj[1:(nrow(acqdf)-1)] & acqdf$nnadj[2:nrow(acqdf)])
 acqdf[1:7, ]

            V1  V2 nnadj  pair
1         Gulf NNP  TRUE    NA
2      Applied NNP  TRUE  TRUE
3 Technologies NNP  TRUE  TRUE
4          Inc NNP  TRUE  TRUE
5         said VBD FALSE FALSE
6           it PRP FALSE FALSE
7         sold VBD FALSE FALSE
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!