quanteda kwic regex operation

六月ゝ 毕业季﹏ 提交于 2020-01-05 06:47:59

问题


Further edit to original question.
Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin):

echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"


Originary question
Trying to follow https://github.com/kbenoit/ITAUR/blob/master/README.md
to learn Quanteda after seeing that everybody that uses this package finds it very good.
In demo.R, line 22 I find the line:
kwic(immigCorpus, "deport", window = 3)  

Its output is -

[BNP, 157]        The BNP will | deport | all foreigners convicted  
[BNP, 1946]                . 2. | Deport | all illegal immigrants    
[BNP, 1952] immigrants We shall | deport | all illegal immigrants  
[BNP, 2585]  Criminals We shall | deport | all criminal entrants  

To try/learn the basics I execute

kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")

expecting to get

[BNP, 157]        The BNP will | deport | all foreigners convicted

but I get:

kwic object with 0 rows

Similar attempts like

kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")

Get the same result:

kwic object with 0 rows

Why is that? Tokenization? if so how should I write the regex?

PS Thanks for this great package


回答1:


You are trying to match a phrase with your pattern. By default, the pattern argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using

> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted

A valuetype = "regex" makes sense if you are using a regex. E.g. to get both shall and will deport use

> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")

   [BNP, 156:157]             - The BNP | will deport  | all foreigners convicted
 [BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants  
 [BNP, 2584:2585]  Foreign Criminals We | shall deport | all criminal entrants 

See this kwic documentation.




回答2:


The examples from the ITAUR repository are based on an older syntax. What you need is the phrase() wrapper - see ?phrase. You should also probably brush up on the regular expression syntax you are trying to achieve with the *, since it may not be what you want, and since a regular expression cannot start with a "*". (This might help.) The default "glob" valuetype will probably achieve what you want.

library("quanteda")
## Package version: 1.1.4
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

kwic(data_char_ukimmig2010, phrase("will deport"))

## [BNP, 156:157] nation.- The BNP | will deport | all foreigners convicted of crimes

kwic(data_char_ukimmig2010, phrase("will .*deport.*"), valuetype = "regex")

## [BNP, 156:157] nation.- The BNP | will deport | all foreigners convicted of crimes


来源:https://stackoverflow.com/questions/49478723/quanteda-kwic-regex-operation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!