text-mining | 易学教程

Text mining - extract name of band from unstructured text [closed]

阅读更多关于 Text mining - extract name of band from unstructured text [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I'm aware that this is kind of a general, open-ended question. I'm essentially looking for help in deciding a way forward, and perhaps for some reading material. I'm working on an algorithm that does unstructured text mining, and trying to extract something specific - the names of

python luigi died unexpectedly with exit code -11

阅读更多关于 python luigi died unexpectedly with exit code -11

问题 I have a data pipeline with luigi that works perfectly fine if I put 1 worker to the task. However, if I put > 1 workers, then it dies (unexpectedly with exit code -11) in a stage with 2 dependencies. The code is rather complex, so a minimum example would be difficult to give. The gist of the matter is that I am doing the following things with gensim : Building a dictionary from some texts. Building a corpus from said texts and the dictionary (requires (1)). Training an LDA model from the

algorithm to extract simple sentences from complex(mixed) sentences?

阅读更多关于 algorithm to extract simple sentences from complex(mixed) sentences?

问题 Is there an algorithm that can be used to extract simple sentences from paragraphs? My ultimate goal is to later run another algorithm on the resulted simple sentence to determine the author's sentiment. I've researched this from sources such as Chae-Deug Park but none discuss preparing simple sentences as training data. Thanks in advance 回答1: I have just used openNLP for the same. public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException

Why am I getting error? ValueError: chunk structures must contain tagged tokens or trees

阅读更多关于 Why am I getting error? ValueError: chunk structures must contain tagged tokens or trees

问题 I've been tinkering with NLTK with the aim of extracting entities from some news articles, but I keep getting an error, ValueError: chunk structures must contain tagged tokens or trees. Can anyone help me? import lxml.html import nltk, re, pprint def ie_preprocess(document): """This function takes raw text and chops and then connects the process to break it down into sentences, then words and then complete part-of-speech tagging""" sentences = nltk.sent_tokenize(document) sentences = [nltk

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

阅读更多关于 R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

问题 Having learned loads from answers on this site (thanks!), it's finally time to ask my own question. I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6. For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm

Trouble with findAssocs from package tm

阅读更多关于 Trouble with findAssocs from package tm

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I am attempting to find words associated with a particular word in a term document matrix using the tm package. I am using findAssocs to do this. Arguments for findAssocs are: x: A term-document matrix. term: A character holding a term. corlimit: A numeric for the lower correlation bound limit. I am consistently getting numeric(0) as my result Example: findAssocs(test.dtm,

How do you find the list of wikidata (or freebase or DBpedia) topics that a text is about?

阅读更多关于 How do you find the list of wikidata (or freebase or DBpedia) topics that a text is about?

问题 I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia). For example " Bad is a song by Mikael Jackson " should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad). Ideally the system should work across multiple languages, it should work both

how to read text in a table from a csv file

阅读更多关于 how to read text in a table from a csv file

问题 I am new using the tm package. I want to read a csv file which contents one column with 2000 texts and a second column with a factor variable yes/no into a Corpus. My intention is to convert the text as a matrix and use the factor variable as target for prediction. I would need to divide the corpus in train and test sets as well. I read several documents like tm.pdf etc. and found the documentation relatively limited. This is my attempt following another threat on the same subject, TexTest<

R text mining - dealing with plurals

阅读更多关于 R text mining - dealing with plurals

问题 I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word. x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.' 回答1: One possible solution. Here I use the pacman package to make the solution self contained: if (!require(

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.