corpus

Corpus/data set of English words with syllabic stress information?

筅森魡賤 提交于 2020-01-03 07:27:10
问题 I know this is a long shot, but does anyone know of a dataset of English words that has stress information by syllable? Something as simple as the following would be fantastic: AARD vark A ble a BOUT ac COUNT AC id ad DIC tion ad VERT ise ment ... 回答1: I closest thing I'm aware of is the CMU Pronouncing Dictionary. I don't think it explicitly marks the stressed syllable, but it should be a start. 来源: https://stackoverflow.com/questions/2839548/corpus-data-set-of-english-words-with-syllabic

nltk function to count occurrences of certain words

牧云@^-^@ 提交于 2020-01-02 03:45:08
问题 In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?" I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of

Looking for dataset to test FULLTEXT style searches on [closed]

纵然是瞬间 提交于 2019-12-30 08:50:50
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I am looking for a corpus of text to run some trial fulltext style data searches across. Either something I can download, or a system that generates it. Something a bit more random would be better e.g. 1,000,000 wikipedia articles in a format easy to insert into a 2 column database (id, text). Any ideas or

R text mining documents from CSV file (one row per doc)

给你一囗甜甜゛ 提交于 2019-12-29 03:33:14
问题 I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set. Originally I did the following: fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")

Does anyone have a Categorized XML Corpus Reader for NLTK?

 ̄綄美尐妖づ 提交于 2019-12-24 04:33:08
问题 Has anyone written a Categorized XML Corpus reader for NLTK? I'm working with the Annotated NYTimes corpus. It's an XML corpus. I can read the files with XMLCorpusReader but I'd like to use some of NLTK's category functionality. There's a nice tutorial for subclassing NLTK readers. I'll can go ahead and write this but was hoping to save some time if someone's already done this. If not I'll post what I've written. 回答1: Here's a Categorized XML Corpus Reader for NLTK. It's based on this

R - slowly working lapply with sort on ordered factor

痴心易碎 提交于 2019-12-23 15:56:48
问题 Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory. sparseTDM <- function(vc){ id = unlist(lapply(vc, function(x){x$meta$id})) content = unlist(lapply(vc, function(x){x$content})) out = strsplit(content, "\\s", perl = T) names(out) = id lev.terms = sort(unique(unlist(out))) lev.docs = id v1 = lapply( out, function(x, lev) { sort(as

R - slowly working lapply with sort on ordered factor

≡放荡痞女 提交于 2019-12-23 15:54:04
问题 Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory. sparseTDM <- function(vc){ id = unlist(lapply(vc, function(x){x$meta$id})) content = unlist(lapply(vc, function(x){x$content})) out = strsplit(content, "\\s", perl = T) names(out) = id lev.terms = sort(unique(unlist(out))) lev.docs = id v1 = lapply( out, function(x, lev) { sort(as

Print first line of one element of Corpus in R using tm package

我的梦境 提交于 2019-12-23 02:52:44
问题 How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal. # Load Libraries library(tm) # Read in Corpus corp <- SimpleCorpus( DirSource( "C:/TextDocument")) # Remove puncuation corp <- removePunctuation(corp, preserve_intra_word_contractions = TRUE, preserve_intra

How to read corpus of parsed sentences using NLTK in python?

旧街凉风 提交于 2019-12-23 01:45:32
问题 I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43). I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code... from nltk.corpus.reader import SyntaxCorpusReader path = '/corpus/wsj' filename = 'wsj1' reader = SyntaxCorpusReader('/corpus/wsj','wsj1') I am able to see the raw text from the file. It returns a string of the parsed

Creating a subset of words from a corpus in R

偶尔善良 提交于 2019-12-22 08:44:07
问题 I have a 1,500-row vector created from a Twitter search using the XML package. I have then converted it to a Corpus to be used with the tm package. I want to ultimately create a wordcloud with some (the most frequent) of those words, so I converted it to a TermDocumentMatrix to be able to find terms with a minimum frequency. I create the object "a", which is a list of those terms. a <- findFreqTerms(mydata.dtm, 10) The wordcloud package does not work on document matrices. So now, I want to