corpus

Fake reviews datasets

淺唱寂寞╮ 提交于 2019-12-22 06:49:11
问题 There are datasets with usual mail spam in the Internet, but I need datasets with fake reviews to conduct some research and I can't find any of them. Can anybody give me advices on where fake reviews datasets can be obtained? 回答1: Our dataset is available on my Cornell homepage: http://www.cs.cornell.edu/~myleott/ 回答2: A recent ACL paper, where the authors compiled such a data set: Finding Deceptive Opinion Spam by Any Stretch of the Imagination Myle Ott, Yejin Choi, Claire Cardie, Jeffrey T.

Using the first field in AWK as file name

冷暖自知 提交于 2019-12-21 17:01:17
问题 The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following: A01 001 This is a simple test. A01 002 Just for exemplary purpose. A01 003 A02 001 This is another text I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column. The example above should result two files, one with name A01

Using my own corpus for category classification in Python NLTK

主宰稳场 提交于 2019-12-20 14:09:36
问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

Using my own corpus for category classification in Python NLTK

梦想的初衷 提交于 2019-12-20 14:09:12
问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

一笑奈何 提交于 2019-12-19 07:49:48
问题 Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard

Classification using movie review corpus in NLTK/Python

大憨熊 提交于 2019-12-17 02:37:25
问题 I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt ), but I would prefer to create directories I could dump files into. from nltk.corpus import movie_reviews

Using R for Text Mining Reuters-21578

╄→尐↘猪︶ㄣ 提交于 2019-12-13 06:24:36
问题 I am trying to do some work with the well known Reuters-21578 dataset and am having some trouble with loading the sgm files into my corpus. Right now I am using the command require(tm) reut21578 <- system.file("reuters21578", package = "tm") reuters <-Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) In an attempt to include all the files into my corpus but this gives me the following error: Error in DirSource(reut21578) : empty directory Any idea where I may be

set encoding for reading text files into tm Corpora

家住魔仙堡 提交于 2019-12-12 03:45:42
问题 loading a bunch of documents using tm Corpus i need to specify encoding. All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es library(tm) cname <- file.path("C:", "Users", "john", "Documents", "texts") docs <- Corpus(DirSource(cname), encoding ="UTF-8") > Error in Corpus(DirSource(cname), encoding = "UTF-8") : unused argument (encoding = "UTF-8") EDITED: Getting str

How to search a corpus to find frequency of a string?

风格不统一 提交于 2019-12-12 01:45:21
问题 I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For example, if given the strings "Swing the stick" and "Eat the stick" I would hope that the corpus would show it's much more likely for someone to swing a stick than eat one. I've been reading about n-grams and corpus linguistics but I'm struggling to

Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

半城伤御伤魂 提交于 2019-12-11 16:39:13
问题 How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package? a <- c("I hate going to work it is dull","I love going to work it is fun") a_corpus = Corpus(VectorSource(a)) a_tm <- TermDocumentMatrix(a_corpus) a_tmx <- as.matrix(a_tm) a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F) a_sent<-get_nrc