nlp | 易学教程

Java NLP: Extracting Indicies When Tokenizing Text

阅读更多关于 Java NLP: Extracting Indicies When Tokenizing Text

问题 When tokenizing a string of text, I need to extract the indexes of the tokenized words. For example, given: "Mary didn't kiss John" I would need something like: [(Mary, 0), (did, 5), (n't, 8), (kiss, 12), (John, 17)] Where 0, 5, 8, 12 and 17 correspond to the index (in the original string) where the token began. I cannot rely on just whitespace, since some words become 2 tokens. Further, I cannot just search for the token in the string, since the word likely will appear multiple times. One

Extracting City, State and Country from Raw address string [closed]

阅读更多关于 Extracting City, State and Country from Raw address string [closed]

问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Improve this question Given a raw string input 1600 Divisadero St San Francisco, CA 94115 b/t Post St & Sutter St Lower Pacific Heights I want to extract City: San Francisco state: California or CA Country: USA I'll be parsing millions of addresses and using a Paid API is not feasible

Extracting City, State and Country from Raw address string [closed]

阅读更多关于 Extracting City, State and Country from Raw address string [closed]

Querying part-of-speech tags with Lucene 7 OpenNLP

阅读更多关于 Querying part-of-speech tags with Lucene 7 OpenNLP

问题 For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not

Querying part-of-speech tags with Lucene 7 OpenNLP

阅读更多关于 Querying part-of-speech tags with Lucene 7 OpenNLP

Querying part-of-speech tags with Lucene 7 OpenNLP

阅读更多关于 Querying part-of-speech tags with Lucene 7 OpenNLP

Removing punctuation using spaCy; AttribueError

阅读更多关于 Removing punctuation using spaCy; AttribueError

问题 Currently I'm using the following code to lemmatize and calculate TF-IDF values for some text data using spaCy: lemma = [] for doc in nlp.pipe(df['col'].astype('unicode').values, batch_size=9844, n_threads=3): if doc.is_parsed: lemma.append([n.lemma_ for n in doc if not n.lemma_.is_punct | n.lemma_ != "-PRON-"]) else: lemma.append(None) df['lemma_col'] = lemma vect = sklearn.feature_extraction.text.TfidfVectorizer() lemmas = df['lemma_col'].apply(lambda x: ' '.join(x)) vect = sklearn.feature

Gensim: how to load precomputed word vectors from text file

阅读更多关于 Gensim: how to load precomputed word vectors from text file

问题 I have a text file with my precomputed word vectors in the following format (example): word -0.0762464299711 0.0128308048976 ... 0.0712385589283\n’ on each line for every word (with 297 extra floats in place of the ... ). I am trying to load these with Gensim as KeyedVectors, because I ultimately would like to compute the cosine similarity, find most similar words, etc. Unfortunately I have not worked with Gensim before and from the documentation it's not quite clear to me how to do this. I

Gensim: how to load precomputed word vectors from text file

阅读更多关于 Gensim: how to load precomputed word vectors from text file

Extracting the person names in the named entity recognition in NLP using Python

阅读更多关于 Extracting the person names in the named entity recognition in NLP using Python

问题 I have a sentence for which i need to identify the Person names alone: For example: sentence = "Larry Page is an American business magnate and computer scientist who is the co-founder of Google, alongside Sergey Brin" I have used the below code to identify the NERs. from nltk import word_tokenize, pos_tag, ne_chunk print(ne_chunk(pos_tag(word_tokenize(sentence)))) The output i received was: (S (PERSON Larry/NNP) (ORGANIZATION Page/NNP) is/VBZ an/DT (GPE American/JJ) business/NN magnate/NN and