nlp

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'

烂漫一生 提交于 2019-12-23 12:27:26
问题 I'm using the StanfordCoreNLP API interface to programmatically do some basic NLP. I need to train a model on my own corpus, but I'd like to use the StanfordCoreNLP interface to do it, because it handles a lot of the dry mechanics behind the scenes and I don't need much specialization there. I've trained a CRFClassifier that I'd like to use for NER, serialized to a file. Based on the documentation, I'd think the following would work, but it doesn't seem to find my model and instead barfs on

Semantic parsing with NLTK

纵然是瞬间 提交于 2019-12-23 10:22:45
问题 I am trying to use NLTK for semantic parsing of spoken navigation commands such as "go to San Francisco", "give me directions to 123 Main Street", etc. This could be done with a fairly simple CFG grammar such as S -> COMMAND LOCATION COMMAND -> "go to" | "give me directions to" | ... LOCATION -> CITY | STREET | ... The problem is that this involves non-atomic (more than one word-long) literals such as "go to", which NLTK doesn't seem to be set up for (correct me if I am wrong). The parsing

How to define person's names in text (Java)

半世苍凉 提交于 2019-12-23 10:10:04
问题 I have some input text, which contains one or more human person names. I do not have any dictionary for these names. Which Java library can help me to define names from my input text? I looked through OpenNLP, but did not find any example or guide or at least description of how it can be applied into my code. (I saw javadoc, but it is pretty poor documentation for such a project.) I want to find names from some random text. If the input text is "My friend Joe Smith went to the store.", then I

Edit distance between two pandas columns

拜拜、爱过 提交于 2019-12-23 08:43:30
问题 I have a pandas DataFrame consisting of two columns of strings. I would like to create a third column containing the Edit Distance of the two columns. from nltk.metrics import edit_distance df['edit'] = edit_distance(df['column1'], df['column2']) For some reason this seems to go to some sort of infinite loop in the sense that it remains unresponsive for quite some time and then I have to terminate it manually. Any suggestions are welcome. 回答1: The nltk's edit_distance function is for

Max over time pooling in Keras

浪尽此生 提交于 2019-12-23 07:47:13
问题 I'm using CNNs in Keras for an NLP task and instead of max pooling, I'm trying to achieve max over time pooling. Any ideas/hacks on how to achieve this? What I mean by max over time pooling is to pool the highest value, no matter where they are in the vector 回答1: Assuming that your data shape is (batch_size, seq_len, features) you may apply: seq_model = Reshape((seq_len * features, 1))(seq_model) seq_model = GlobalMaxPooling1D()(seq_model) 来源: https://stackoverflow.com/questions/41958115/max

Tabulating characters with diacritics in R

十年热恋 提交于 2019-12-23 07:12:39
问题 I'm trying to tabulate phones (characters) occurrences in a string, but diacritics are tabulated as characters on their own. Ideally, I have a wordlist in International Phonetic Alphabet, with a fair amount of diacritics and several combinations of them with base characters. I give here a MWE with just one word, but the same goes with list of words and more types of combinations. > word <- "n̥ana" # word constituted by 4 phones: [n̥],[a],[n],[a] > table(strsplit(word, "")) ̥ a n 1 2 2 But the

Calculating distance between word/document vectors from a nested dictionary

谁都会走 提交于 2019-12-23 06:45:15
问题 I have a nested dictionary as such: myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 'an': {1:3, 2:15, 3:1, 4:312, 5:100}} The outer key is a word, the inner keys are file/document ids the values are the number of times the word (outer key occurs) How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1 , I should get: 2^2 + 27

Calculating distance between word/document vectors from a nested dictionary

北慕城南 提交于 2019-12-23 06:42:07
问题 I have a nested dictionary as such: myDict = {'a': {1:2, 2:163, 3:12, 4:67, 5:84}, 'about': {1:27, 2:45, 3:21, 4:10, 5:15}, 'apple': {1:0, 2: 5, 3:0, 4:10, 5:0}, 'anticipate': {1:1, 2:5, 3:0, 4:8, 5:7}, 'an': {1:3, 2:15, 3:1, 4:312, 5:100}} The outer key is a word, the inner keys are file/document ids the values are the number of times the word (outer key occurs) How do I calculate the sum of the square values to the inner keys? For example for the inner key number 1 , I should get: 2^2 + 27

Converting POS tags from TextBlob into Wordnet compatible inputs

守給你的承諾、 提交于 2019-12-23 06:07:13
问题 I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class. Example Wordnet.synsets() requires that the POS you give it is one of n,v,a,r, like so wn.synsets("dog", POS="n,v,a,r") But a standard POS tagging from upenn_treebank looks like JJ, VBD, VBZ,

Converting POS tags from TextBlob into Wordnet compatible inputs

我是研究僧i 提交于 2019-12-23 06:07:04
问题 I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class. Example Wordnet.synsets() requires that the POS you give it is one of n,v,a,r, like so wn.synsets("dog", POS="n,v,a,r") But a standard POS tagging from upenn_treebank looks like JJ, VBD, VBZ,