nlp | 易学教程

turn lists of lists into strings pandas dataframe

阅读更多关于 turn lists of lists into strings pandas dataframe

问题 Background I have the following toy df that contains lists in the columns Before and After as seen below import pandas as pd before = [list(['in', 'the', 'bright', 'blue', 'box']), list(['because','they','go','really','fast']), list(['to','ride','and','have','fun'])] after = [list(['there', 'are', 'many', 'different']), list(['i','like','a','lot','of', 'sports']), list(['the','middle','east','has','many'])] df= pd.DataFrame({'Before' : before, 'After' : after, 'P_ID': [1,2,3], 'Word' : [

Painfully slow Postgres query using WHERE on many adjacent rows

阅读更多关于 Painfully slow Postgres query using WHERE on many adjacent rows

问题 I have the following psql table. It has roughly 2 billion rows in total. id word lemma pos textid source 1 Stuffing stuff vvg 190568 AN 2 her her appge 190568 AN 3 key key nn1 190568 AN 4 into into ii 190568 AN 5 the the at 190568 AN 6 lock lock nn1 190568 AN 7 she she appge 190568 AN 8 pushed push vvd 190568 AN 9 her her appge 190568 AN 10 way way nn1 190568 AN 11 into into ii 190568 AN 12 the the appge 190568 AN 13 house house nn1 190568 AN 14 . . 190568 AN 15 She she appge 190568 AN 16 had

SGDClassifier giving different accuracy each time for text classification

阅读更多关于 SGDClassifier giving different accuracy each time for text classification

问题 I'm using the SVM Classifier for classifying text as good text and gibberish. I'm using python's scikit-learn and doing it as follows: ''' Created on May 5, 2017 ''' import re import random import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import SGDClassifier from sklearn import metrics # Prepare data def prepare_data(data): """ data is expected to be a list of tuples of category and texts. Returns a tuple of a list of lables and a list

Spacy-nightly (spacy 2.0) issue with “thinc.extra.MaxViolation has wrong size”

阅读更多关于 Spacy-nightly (spacy 2.0) issue with “thinc.extra.MaxViolation has wrong size”

问题 After apparently successful installation of spacy-nightly (spacy-nightly-2.0.0a14) and english model (en_core_web_sm) I was still receiving error message during attempt to run it import spacy nlp = spacy.load('en_core_web_sm') ValueError: thinc.extra.search.MaxViolation has the wrong size, try recompiling. Expected 104, got 128 I tried to reinstall spacy and model as well and it has not help. Tried it again within new venv (Python 3.6) 回答1: Issue is probably with thinc package, spacy-nightly

Where to find a state of art relation extraction dataset

阅读更多关于 Where to find a state of art relation extraction dataset

问题 I am looking for a dataset which contains large quantities of relation tuples. For example, the search of "people" and "location" yields "lives in", "worked in", etc. University of Washington's OpenIE http://OpenIE.cs.washington.edu is a good tool but their dataset is only accessible through web. Where can I download a database or library like this? 回答1: OpenIE itself provides large dataset of 11 gb for this purpose. Check this http://knowitall.cs.washington.edu/paralex/ Although it is an

NLTK context-free grammars

阅读更多关于 NLTK context-free grammars

问题 I am just wondering how would you add an optional grammer in the rule >>> import nltk >>> nltk.app.rdparser() For example, the normal way to add a optional grammer is by putting it in parentheses: NP -> NP (PP) But in the program how would you do it? parentheses doesnt work. S Þ NP VP NP Þ NP PP | Det N VP Þ V NP PP PP Þ P NP Det Þ 'the' | 'a' N Þ 'man' | 'park' | 'dog' | 'boy' | 'girl' V Þ 'was' | 'saw' P Þ 'in' | 'under' | 'with' Thanks, Ray 回答1: NP -> NP | NP PP But note that, with this

Arabic text not showing in R-

阅读更多关于 Arabic text not showing in R-

问题 Just started working with R in Arabic as I plan to do text analysis and text mining with Hadith corpus. I have been reading threads related to my question but nevertheless, still can't manage to get the REAL basics here (sorry, absolute beginner). So, I entered: textarabic.v <- scan("data/arabic-text.txt", encoding="UTF-8", what= "character",sep="\n") And what comes out textarabic.v is of course, symbols (pic). Prior to this, I saved my text in utf-8 as I read in a thread but still nothing

Grouping Similar Strings

阅读更多关于 Grouping Similar Strings

问题 I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example, Term Group NBA Basketball 1 Basketball NBA 1 Basketball 1 Baseball 2 It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely

Where to find an exhaustive list of stop words?

阅读更多关于 Where to find an exhaustive list of stop words?

问题 Where could I find an exhaustive list of stop words? The one I have is quite short and it seems to be inapplicable to scientific texts. I am creating lexical chains to extract key topics from scientific papers. The problem is that words like based , regarding , etc. should also be considered as stop words as they do not deliver much sense. 回答1: You can also easily add to existing stop word lists. E.g. use the one in the NLTK toolkit: from nltk.corpus import stopwords and then add whatever you

How to build POS-tagged corpus with NLTK?

阅读更多关于 How to build POS-tagged corpus with NLTK?

问题 I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution: Read files with into a plain text corpus: from nltk.corpus.reader import PlaintextCorpusReader my_corp = PlaintextCorpusReader(".", r".*\.txt") Tag corpus with built-in Penn POS-tagger: my_tagged_corp= nltk.batch_pos_tag(my_corp.sents()) (By the way, at this pont Python threw an error: NameError: name 'batch' is not defined ) Write