corpus | 易学教程

Fake reviews datasets

阅读更多关于 Fake reviews datasets

问题 There are datasets with usual mail spam in the Internet, but I need datasets with fake reviews to conduct some research and I can't find any of them. Can anybody give me advices on where fake reviews datasets can be obtained? 回答1: Our dataset is available on my Cornell homepage: http://www.cs.cornell.edu/~myleott/ 回答2: A recent ACL paper, where the authors compiled such a data set: Finding Deceptive Opinion Spam by Any Stretch of the Imagination Myle Ott, Yejin Choi, Claire Cardie, Jeffrey T.

Using the first field in AWK as file name

阅读更多关于 Using the first field in AWK as file name

问题 The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following: A01 001 This is a simple test. A01 002 Just for exemplary purpose. A01 003 A02 001 This is another text I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column. The example above should result two files, one with name A01

Using my own corpus for category classification in Python NLTK

阅读更多关于 Using my own corpus for category classification in Python NLTK

问题 I'm a NTLK/Python beginner and managed to load my own corpus using CategorizedPlaintextCorpusReader but how do I actually train and use the data for classification of text? >>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader >>> reader = CategorizedPlaintextCorpusReader('/ebs/category', r'.*\.txt', cat_pattern=r'(.*)\.txt') >>> len(reader.categories()) 234 回答1: Assuming you want a naive Bayes classifier with bag of words features: from nltk import FreqDist from nltk.classify

Using my own corpus for category classification in Python NLTK

阅读更多关于 Using my own corpus for category classification in Python NLTK

NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

阅读更多关于 NLP: Building (small) corpora, or “Where to get lots of not-too-specialized English-language text files?”

问题 Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard

Classification using movie review corpus in NLTK/Python

阅读更多关于 Classification using movie review corpus in NLTK/Python

问题 I'm looking to do some classification in the vein of NLTK Chapter 6. The book seems to skip a step in creating the categories, and I'm not sure what I'm doing wrong. I have my script here with the response following. My issues primarily stem from the first part -- category creation based upon directory names. Some other questions on here have used filenames (i.e. pos_1.txt and neg_1.txt ), but I would prefer to create directories I could dump files into. from nltk.corpus import movie_reviews

Using R for Text Mining Reuters-21578

阅读更多关于 Using R for Text Mining Reuters-21578

问题 I am trying to do some work with the well known Reuters-21578 dataset and am having some trouble with loading the sgm files into my corpus. Right now I am using the command require(tm) reut21578 <- system.file("reuters21578", package = "tm") reuters <-Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML)) In an attempt to include all the files into my corpus but this gives me the following error: Error in DirSource(reut21578) : empty directory Any idea where I may be

set encoding for reading text files into tm Corpora

阅读更多关于 set encoding for reading text files into tm Corpora

问题 loading a bunch of documents using tm Corpus i need to specify encoding. All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es library(tm) cname <- file.path("C:", "Users", "john", "Documents", "texts") docs <- Corpus(DirSource(cname), encoding ="UTF-8") > Error in Corpus(DirSource(cname), encoding = "UTF-8") : unused argument (encoding = "UTF-8") EDITED: Getting str

How to search a corpus to find frequency of a string?

阅读更多关于 How to search a corpus to find frequency of a string?

问题 I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For example, if given the strings "Swing the stick" and "Eat the stick" I would hope that the corpus would show it's much more likely for someone to swing a stick than eat one. I've been reading about n-grams and corpus linguistics but I'm struggling to

Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

阅读更多关于 Sentiment analysis R syuzhet NRC Word-Emotion Association Lexicon

问题 How do you find the associated words to the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) (NRC Word-Emotion Association Lexicon) when using get_nrc_sentiment of the using the syuzhet package? a <- c("I hate going to work it is dull","I love going to work it is fun") a_corpus = Corpus(VectorSource(a)) a_tm <- TermDocumentMatrix(a_corpus) a_tmx <- as.matrix(a_tm) a_df<-data.frame(text=unlist(sapply(a, `[`)), stringsAsFactors=F) a_sent<-get_nrc