nlp

Java NLP: Extracting Indicies When Tokenizing Text

六眼飞鱼酱① 提交于 2021-02-20 04:54:46
问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One

Extracting City, State and Country from Raw address string [closed]

流过昼夜 提交于 2021-02-20 04:12:47
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Improve this question Given a raw string input 1600 Divisadero St San Francisco, CA 94115 b/t Post St & Sutter St Lower Pacific Heights I want to extract City: San Francisco state: California or CA Country: USA I'll be parsing millions of addresses and using a Paid API is not feasible

Extracting City, State and Country from Raw address string [closed]

£可爱£侵袭症+ 提交于 2021-02-20 04:12:17
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Improve this question Given a raw string input 1600 Divisadero St San Francisco, CA 94115 b/t Post St & Sutter St Lower Pacific Heights I want to extract City: San Francisco state: California or CA Country: USA I'll be parsing millions of addresses and using a Paid API is not feasible

Querying part-of-speech tags with Lucene 7 OpenNLP

故事扮演 提交于 2021-02-20 03:50:40
问题 For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not

Querying part-of-speech tags with Lucene 7 OpenNLP

点点圈 提交于 2021-02-20 03:49:49
问题 For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not

Querying part-of-speech tags with Lucene 7 OpenNLP

半腔热情 提交于 2021-02-20 03:49:00
问题 For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not

Removing punctuation using spaCy; AttribueError

。_饼干妹妹 提交于 2021-02-19 03:00:24
问题 Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy: lemma = [] for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844, n_threads=3): if doc.is_parsed: lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"]) else: lemma.append(None) df['lemma_col'] = lemma vect = sklearn.feature_extraction.text.TfidfVectorizer() lemmas = df['lemma_col'].apply(lambda x: ' '.join(x)) vect = sklearn.feature

Gensim: how to load precomputed word vectors from text file

依然范特西╮ 提交于 2021-02-18 22:17:46
问题 I have a text file with my precomputed word vectors in the following format (example): word -0.0762464299711 0.0128308048976 ... 0.0712385589283\n’ on each line for every word (with 297 extra floats in place of the ... ). I am trying to load these with Gensim as KeyedVectors, because I ultimately would like to compute the cosine similarity, find most similar words, etc. Unfortunately I have not worked with Gensim before and from the documentation it's not quite clear to me how to do this. I

Gensim: how to load precomputed word vectors from text file

倾然丶 夕夏残阳落幕 提交于 2021-02-18 22:17:45
问题 I have a text file with my precomputed word vectors in the following format (example): word -0.0762464299711 0.0128308048976 ... 0.0712385589283\n’ on each line for every word (with 297 extra floats in place of the ... ). I am trying to load these with Gensim as KeyedVectors, because I ultimately would like to compute the cosine similarity, find most similar words, etc. Unfortunately I have not worked with Gensim before and from the documentation it's not quite clear to me how to do this. I

Extracting the person names in the named entity recognition in NLP using Python

限于喜欢 提交于 2021-02-18 12:20:27
问题 I have a sentence for which i need to identify the Person names alone: For example: sentence = "Larry Page is an American business magnate and computer scientist who is the co-founder of Google, alongside Sergey Brin" I have used the below code to identify the NERs. from nltk import word_tokenize, pos_tag, ne_chunk print(ne_chunk(pos_tag(word_tokenize(sentence)))) The output i received was: (S (PERSON Larry/NNP) (ORGANIZATION Page/NNP) is/VBZ an/DT (GPE American/JJ) business/NN magnate/NN and