nlp

Pointwise mutual information on text

痴心易碎 提交于 2019-12-20 08:40:48
问题 I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories. 回答1: PMI is a measure of association between a feature (in your case a

Python module with access to english dictionaries including definitions of words [closed]

随声附和 提交于 2019-12-20 08:33:26
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . I am looking for a python module that helps me get the definition(s) from an english dictionary for a word. There is of course enchant , which helps me check if the word exists in the English language, but it does not provide definitions of them (at least I don't see anything like that in the docs) There is also

Is it possible to train Stanford NER system to recognize more named entities types?

流过昼夜 提交于 2019-12-20 08:18:44
问题 I'm using some NLP libraries now, (stanford and nltk) Stanford I saw the demo part but just want to ask if it possible to use it to identify more entity types. So currently stanford NER system (as the demo shows) can recognize entities as person(name), organization or location. But the organizations recognized are limited to universities or some, big organizations. I'm wondering if I can use its API to write program for more entity types, like if my input is "Apple" or "Square" it can

Training n-gram NER with Stanford NLP

﹥>﹥吖頭↗ 提交于 2019-12-20 08:01:24
问题 Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials - http://nlp.stanford.edu/software/crf-faq.shtml#b With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set. Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for

Am I passing the string correctly to the python library?

眉间皱痕 提交于 2019-12-20 07:58:28
问题 I'm using a python library called Guess Language: http://pypi.python.org/pypi/guess-language/0.1 "justwords" is a string with unicode text. I stick it in the package, but it always returns English, even though the web page is in Japanese. Does anyone know why? Am I not encoding correctly? §ç©ºéå ¶ä»æ¡å°±æ²æéç¨®å¾ é¤ï¼æä»¥ä¾é裡ç¶ç éäºï¼åæ­¤ç°å¢æ°£æ°¹³åèµ·ä¾åªè½ç®âå¾å¥½âé常好âåå ¶æ¯è¦é»é¤ï¼é¨ä¾¿é»çé»ã飲æãä¸ææ²»ç­åä¸å 便å®ï¼æ¯æ´è¥ç äºï¼æ³æ³é裡以å°é»ãæ¯è§ä¾èªªä¹è©²æpremiumï¼åªæ±é¤é»å¥½åå°

how to build json array dynamically in javascript

风格不统一 提交于 2019-12-20 06:28:32
问题 I receive a json object with some number of quick reply elements from wit.ai, like this: "msg": "So glad to have you back. What do you want me to do? "action_id": "6fd7f2bd-db67-46d2-8742-ec160d9261c1", "confidence": 0.08098269709064443, "quickreplies": [ "News?", "Subscribe?", "Contribute?", "Organize?" ], "type": "msg" I then need to convert them to a slightly different format as they are passed to FaceBook Messenger as described in the code below. Wit only exposes 'msg' and 'quickreplies.'

Check perplexity of a Language Model

家住魔仙堡 提交于 2019-12-20 06:17:02
问题 I created a language model with Keras LSTM and now I want to assess wether it's good so I want to calculate perplexity. What is the best way to calc perplexity of a model in Python? 回答1: I've come up with two versions and attached their corresponding source, please feel free to check the links out. def perplexity_raw(y_true, y_pred): """ The perplexity metric. Why isn't this part of Keras yet?! https://stackoverflow.com/questions/41881308/how-to-calculate-perplexity-of-rnn-in-tensorflow https

define CRF++ template file

淺唱寂寞╮ 提交于 2019-12-20 06:04:10
问题 This is my issue, but it doesn't say HOW to define the template file correctly. My training file looks like this: 上 B-NR 海 L-NR 浦 B-NR 东 L-NR 开 B-NN 发 L-NN 与 U-CC 法 B-NN 制 L-NN 建 B-NN ... 回答1: CRF++ is extremely easy to use. The instructions on the website explains it clearly. http://crfpp.googlecode.com/svn/trunk/doc/index.html Suppose we extract feature for the line 东 L-NR Unigram U02:%x[0,0] #means column 0 of the current line U03:%x[1,0] #means column 0 of the next line So the underlying

gensim doc2vec “intersect_word2vec_format” command

大城市里の小女人 提交于 2019-12-20 03:55:16
问题 Just reading through the doc2vec commands on the gensim page. I am curious about the command"intersect_word2vec_format" . My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model and then train my doc2vec model using the pretrained word2vec values rather than generating the word vector values from my document corpus. The result is that I get a more accurate doc2vec model because I am using pretrained w2v values which was

How can we extract the main verb from a sentence?

回眸只為那壹抹淺笑 提交于 2019-12-20 03:09:57
问题 For example, "parrots do not swim." Here the main verb is "swim". How can we extract that by language processing? Are there any known algorithms for this purpose? 回答1: You can run a dependency parsing algorithm on the sentence and the find the dependent of the root relation. For example, running the sentence "Parrots do not swim" through the Stanford Parser online demo, I get the following dependencies: nsubj(swim-4, Parrots-1) aux(swim-4, do-2) neg(swim-4, not-3) root(ROOT-0, swim-4) Each of