nlp | 易学教程

can NLTK/pyNLTK work “per language” (i.e. non-english), and how?

阅读更多关于 can NLTK/pyNLTK work “per language” (i.e. non-english), and how?

问题 How can I tell NLTK to treat the text in a particular language? Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain. This question seem to address only different corpora, not the change in code/settings: POS tagging in German Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python? 回答1: I'm not sure what you're referring to as the changes in code/settings. NLTK mostly

Is there a natural language parser for dates/times in ColdFusion?

阅读更多关于 Is there a natural language parser for dates/times in ColdFusion?

问题 Is there a natural language parser for date/times in ColdFusion? 回答1: There's a (reportedly -- I've not used it) good one for Java called JChronic -- a port of the Ruby Chronic date parser. You could try using it. It hasn't been updated since 2006, but should still be useful. 回答2: I believe parseDateTime() and lsParseDateTime() are the closest to what you're looking for, from the library of built-in ColdFusion functions. Check out Adobe's LiveDocs for other date/time functions. Remember that

How best to parse a simple grammar?

阅读更多关于 How best to parse a simple grammar?

问题 Ok, so I've asked a bunch of smaller questions about this project, but I still don't have much confidence in the designs I'm coming up with, so I'm going to ask a question on a broader scale. I am parsing pre-requisite descriptions for a course catalog. The descriptions almost always follow a certain form, which makes me think I can parse most of them. From the text, I would like to generate a graph of course pre-requisite relationships. (That part will be easy, after I have parsed the data.)

How do I replace the string exactly using gsub()

阅读更多关于 How do I replace the string exactly using gsub()

问题 I have a corpus: txt = "a patterned layer within a microelectronic pattern." I would like to replace the term "pattern" exactly by "form", I try to write a code: txt_replaced = gsub("pattern","form",txt) However, the responsed corpus in txt_replaced is: "a formed layer within a microelectronic form." As you can see, the term "patterned" is wrongly replaced by "formed" because parts of characteristics in "patterned" matched to "pattern". I would like to query that if I can replace the string

How do I replace the string exactly using gsub()

阅读更多关于 How do I replace the string exactly using gsub()

How do I replace the string exactly using gsub()

阅读更多关于 How do I replace the string exactly using gsub()

NLP基础：语言模型

阅读更多关于 NLP基础：语言模型

什么是语言模型语言模型旨在为语句的联合概率函数建模，是用来计算一个句子概率的模型，对有意义的句子赋予大概率，对没有意义的句子赋予小概率，也就是用来判断一句话是否是人话的概念。这样的模型可以用于NLP中的很多任务，如机器翻译、语音识别、信息检索、词性标注以及手写识别等。语言模型考虑两个方面的子任务（以“How long is a football game?”为例）：句子中的词序：“How long game is a football?” 句子中的词义：“How long is a football bame?” 语音识别举例： “厨房里的食油用完了”和“厨房里的石油用完了” 文本翻译举例： “you go first”：“你走先”和“你先走” 给定一个句子的词语序列：如果假设句子中的每个词都相互独立，则整体的句子概率为：然而，句子中的每一个词的含义均与前面的词紧密相关，所以实际的语言模型概率可以通过条件概率计算为：求解上式中的条件概率：这样就存在两个问题：参数空间太大：条件概率 P ( W k ∣ W 1 , W 2 , . . . , W k − 1 ) P(W_{k}|W_{1},W_{2},...,W_{k-1}) P ( W k ∣ W 1 , W 2 , . . . , W k − 1 ) 的可能性太多，计算开销巨大

10 ML & NLP Research Highlights of 2019

阅读更多关于 10 ML & NLP Research Highlights of 2019

10 ML & NLP Research Highlights of 2019 2020-01-07 08:56:32 Source : https://ruder.io/research-highlights-2019/ This post gathers ten ML and NLP research directions that I found exciting and impactful in 2019. For each highlight, I summarise the main advances that took place this year, briefly state why I think it is important, and provide a short outlook to the future. The full list of highlights is here: Universal unsupervised pretraining Lottery tickets The Neural Tangent Kernel Unsupervised multilingual learning More robust benchmarks ML and NLP for science Fixing decoding errors in NLG

Problem with indexes in enumerate() - Python [closed]

阅读更多关于 Problem with indexes in enumerate() - Python [closed]

问题 Closed . This question needs details or clarity. It is not currently accepting answers. Want to improve this question? Add details and clarify the problem by editing this post. Closed 7 days ago . I have a dataset (a_list_of_sentences) in the form of a list of lists of lists, where the smaller list consist in a word and its syntactic dependency, and these lists are joined into sentences, like this: [[['mary', 'nsubj'], ['loves', 'ROOT'], ['every', 'det'], ['man', 'dobj']], [['mary', 'nsubj'],

StanfordNLP - ArrayIndexOutOfBoundsException at TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:696)

阅读更多关于 StanfordNLP - ArrayIndexOutOfBoundsException at TokensRegexNERAnnotator.readEntries(TokensRegexNERAnnotator.java:696)

问题 I want to identify following as SKILL using stanfordNLP's TokensRegexNERAnnotator. AREAS OF EXPERTISE Areas of Knowledge Computer Skills Technical Experience Technical Skills There are many more sequence of text like above. Code - Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true)); String[] tests = {