nlp

turn lists of lists into strings pandas dataframe

落爺英雄遲暮 提交于 2019-12-12 18:52:18
问题 Background I have the following toy df that contains lists in the columns Before and After as seen below import pandas as pd before = [list(['in', 'the', 'bright', 'blue', 'box']), list(['because','they','go','really','fast']), list(['to','ride','and','have','fun'])] after = [list(['there', 'are', 'many', 'different']), list(['i','like','a','lot','of', 'sports']), list(['the','middle','east','has','many'])] df= pd.DataFrame({'Before' : before, 'After' : after, 'P_ID': [1,2,3], 'Word' : [

Painfully slow Postgres query using WHERE on many adjacent rows

亡梦爱人 提交于 2019-12-12 18:37:48
问题 I have the following psql table. It has roughly 2 billion rows in total. id word lemma pos textid source 1 Stuffing stuff vvg 190568 AN 2 her her appge 190568 AN 3 key key nn1 190568 AN 4 into into ii 190568 AN 5 the the at 190568 AN 6 lock lock nn1 190568 AN 7 she she appge 190568 AN 8 pushed push vvd 190568 AN 9 her her appge 190568 AN 10 way way nn1 190568 AN 11 into into ii 190568 AN 12 the the appge 190568 AN 13 house house nn1 190568 AN 14 . . 190568 AN 15 She she appge 190568 AN 16 had

SGDClassifier giving different accuracy each time for text classification

我的未来我决定 提交于 2019-12-12 18:17:02
问题 I'm using the SVM Classifier for classifying text as good text and gibberish. I'm using python's scikit-learn and doing it as follows: ''' Created on May 5, 2017 ''' import re import random import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import SGDClassifier from sklearn import metrics # Prepare data def prepare_data(data): """ data is expected to be a list of tuples of category and texts. Returns a tuple of a list of lables and a list

Spacy-nightly (spacy 2.0) issue with “thinc.extra.MaxViolation has wrong size”

…衆ロ難τιáo~ 提交于 2019-12-12 18:02:12
问题 After apparently successful installation of spacy-nightly (spacy-nightly-2.0.0a14) and english model (en_core_web_sm) I was still receiving error message during attempt to run it import spacy nlp = spacy.load('en_core_web_sm') ValueError: thinc.extra.search.MaxViolation has the wrong size, try recompiling. Expected 104, got 128 I tried to reinstall spacy and model as well and it has not help. Tried it again within new venv (Python 3.6) 回答1: Issue is probably with thinc package, spacy-nightly

Where to find a state of art relation extraction dataset

寵の児 提交于 2019-12-12 17:13:39
问题 I am looking for a dataset which contains large quantities of relation tuples. For example, the search of "people" and "location" yields "lives in", "worked in", etc. University of Washington's OpenIE http://OpenIE.cs.washington.edu is a good tool but their dataset is only accessible through web. Where can I download a database or library like this? 回答1: OpenIE itself provides large dataset of 11 gb for this purpose. Check this http://knowitall.cs.washington.edu/paralex/ Although it is an

NLTK context-free grammars

倖福魔咒の 提交于 2019-12-12 17:04:50
问题 I am just wondering how would you add an optional grammer in the rule >>> import nltk >>> nltk.app.rdparser() For example, the normal way to add a optional grammer is by putting it in parentheses: NP -> NP (PP) But in the program how would you do it? parentheses doesnt work. S Þ NP VP NP Þ NP PP | Det N VP Þ V NP PP PP Þ P NP Det Þ 'the' | 'a' N Þ 'man' | 'park' | 'dog' | 'boy' | 'girl' V Þ 'was' | 'saw' P Þ 'in' | 'under' | 'with' Thanks, Ray 回答1: NP -> NP | NP PP But note that, with this

Arabic text not showing in R-

旧时模样 提交于 2019-12-12 16:25:44
问题 Just started working with R in Arabic as I plan to do text analysis and text mining with Hadith corpus. I have been reading threads related to my question but nevertheless, still can't manage to get the REAL basics here (sorry, absolute beginner). So, I entered: textarabic.v <- scan("data/arabic-text.txt", encoding="UTF-8", what= "character",sep="\n") And what comes out textarabic.v is of course, symbols (pic). Prior to this, I saved my text in utf-8 as I read in a thread but still nothing

Grouping Similar Strings

随声附和 提交于 2019-12-12 15:31:31
问题 I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example, Term Group NBA Basketball 1 Basketball NBA 1 Basketball 1 Baseball 2 It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely

Where to find an exhaustive list of stop words?

China☆狼群 提交于 2019-12-12 15:03:05
问题 Where could I find an exhaustive list of stop words? The one I have is quite short and it seems to be inapplicable to scientific texts. I am creating lexical chains to extract key topics from scientific papers. The problem is that words like based , regarding , etc. should also be considered as stop words as they do not deliver much sense. 回答1: You can also easily add to existing stop word lists. E.g. use the one in the NLTK toolkit: from nltk.corpus import stopwords and then add whatever you

How to build POS-tagged corpus with NLTK?

穿精又带淫゛_ 提交于 2019-12-12 14:23:52
问题 I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution: Read files with into a plain text corpus: from nltk.corpus.reader import PlaintextCorpusReader my_corp = PlaintextCorpusReader(".", r".*\.txt") Tag corpus with built-in Penn POS-tagger: my_tagged_corp= nltk.batch_pos_tag(my_corp.sents()) (By the way, at this pont Python threw an error: NameError: name 'batch' is not defined ) Write