nlp | 易学教程

How can I enter data using non English (Bangla) language into this database table?

阅读更多关于 How can I enter data using non English (Bangla) language into this database table?

问题 How can I enter data using non English (Bangla) language into this database table ? 回答1: As pointed out by @Tim you need to change the collation of your table/database/column to UTF-8 . First check the collation of your database/table/column . CHECK COLLATION: How to check the collation of DATABASE: SELECT default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = "YOUR_DATABASE_NAME"; How to check the collation of TABLE: SELECT CCSA.character_set_name FROM information

Extract only body text from arXiv articles formatted as .tex

阅读更多关于 Extract only body text from arXiv articles formatted as .tex

问题 My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title, footnotes, acknowledgements, citations, etc.). I've been trying with Python3 and tex2py, but I'm struggling with getting a clean corpus, because the files differ in labeling & the text is broken up between labels. I have attached a SSCCE, a couple sample Latex files and their pdfs, and the

Multi-Threaded NLP with Spacy pipe

阅读更多关于 Multi-Threaded NLP with Spacy pipe

问题 I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example: from spacy.en import English input = open("big_file.txt") big_text= input.read() input.close() nlp= English() out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1) doc = out.next() Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of

Automatic semantic role labeling (ASRL) in Java (using Frame net in Java)

阅读更多关于 Automatic semantic role labeling (ASRL) in Java (using Frame net in Java)

问题 I'm looking for a long time to create ASRL analysis in Java, and unfortunately the web offers very little support, it seems like all of the other SO questions relate to "which tools to use", but not to "how to use them". I want to create (preferably in java) something exactly like this : http://demo.ark.cs.cmu.edu/parse ,an algorithm that has sentences as input, and frames as output. I downloaded the related Jar files of mate tools https://code.google.com/p/mate-tools/downloads/list and

how to write spacy matcher of POS regex

阅读更多关于 how to write spacy matcher of POS regex

问题 Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching. How can I combine them in a neat way? For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example Rule-based matching can have POS rules? If not - here is my current plan - gather

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

阅读更多关于 NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

问题 I am trying to Tokenize text using RegexpTokenizer. Code: from nltk.tokenize import RegexpTokenizer #from nltk.tokenize import word_tokenize line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20" pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S' tokenizer = RegexpTokenizer(pattern) print tokenizer.tokenize(line) #print word_tokenize(line) Output: ['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name',

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

阅读更多关于 NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

Fake reviews datasets

阅读更多关于 Fake reviews datasets

问题 There are datasets with usual mail spam in the Internet, but I need datasets with fake reviews to conduct some research and I can't find any of them. Can anybody give me advices on where fake reviews datasets can be obtained? 回答1: Our dataset is available on my Cornell homepage: http://www.cs.cornell.edu/~myleott/ 回答2: A recent ACL paper, where the authors compiled such a data set: Finding Deceptive Opinion Spam by Any Stretch of the Imagination Myle Ott, Yejin Choi, Claire Cardie, Jeffrey T.

Mallet topic modelling

阅读更多关于 Mallet topic modelling

问题 I have been using mallet for inferring topics for a text file containing 100,000 lines(around 34 MB in mallet format). But now i need to run it for on a file containing a million lines(around 180MB) and I am getting an java.lang.outofmemory exception . Is there a way of splitting the file into smaller ones and build a model for the data present in all the files combined?? thanks in advance 回答1: In bin/mallet.bat increase value for this line: set MALLET_MEMORY=1G 回答2: I'm not sure about

Multilingual spell checking with language detection

阅读更多关于 Multilingual spell checking with language detection

问题 I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject. The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages. Trivial example (Welsh +