text-mining

Text mining - extract name of band from unstructured text [closed]

两盒软妹~` 提交于 2019-12-11 03:30:09
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . I'm aware that this is kind of a general, open-ended question. I'm essentially looking for help in deciding a way forward, and perhaps for some reading material. I'm working on an algorithm that does unstructured text mining, and trying to extract something specific - the names of

python luigi died unexpectedly with exit code -11

你离开我真会死。 提交于 2019-12-10 20:45:38
问题 I have a data pipeline with luigi that works perfectly fine if I put 1 worker to the task. However, if I put > 1 workers, then it dies (unexpectedly with exit code -11) in a stage with 2 dependencies. The code is rather complex, so a minimum example would be difficult to give. The gist of the matter is that I am doing the following things with gensim : Building a dictionary from some texts. Building a corpus from said texts and the dictionary (requires (1)). Training an LDA model from the

algorithm to extract simple sentences from complex(mixed) sentences?

你。 提交于 2019-12-10 16:49:34
问题 Is there an algorithm that can be used to extract simple sentences from paragraphs? My ultimate goal is to later run another algorithm on the resulted simple sentence to determine the author's sentiment. I've researched this from sources such as Chae-Deug Park but none discuss preparing simple sentences as training data. Thanks in advance 回答1: I have just used openNLP for the same. public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException

Why am I getting error? ValueError: chunk structures must contain tagged tokens or trees

给你一囗甜甜゛ 提交于 2019-12-10 16:47:03
问题 I've been tinkering with NLTK with the aim of extracting entities from some news articles, but I keep getting an error, ValueError: chunk structures must contain tagged tokens or trees. Can anyone help me? import lxml.html import nltk, re, pprint def ie_preprocess(document): """This function takes raw text and chops and then connects the process to break it down into sentences, then words and then complete part-of-speech tagging""" sentences = nltk.sent_tokenize(document) sentences = [nltk

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

心不动则不痛 提交于 2019-12-10 15:58:14
问题 Having learned loads from answers on this site (thanks!), it's finally time to ask my own question. I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6. For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm

Trouble with findAssocs from package tm

风格不统一 提交于 2019-12-10 14:22:33
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I am attempting to find words associated with a particular word in a term document matrix using the tm package. I am using findAssocs to do this. Arguments for findAssocs are: x: A term-document matrix. term: A character holding a term. corlimit: A numeric for the lower correlation bound limit. I am consistently getting numeric(0) as my result Example: findAssocs(test.dtm,

How do you find the list of wikidata (or freebase or DBpedia) topics that a text is about?

你说的曾经没有我的故事 提交于 2019-12-10 10:44:37
问题 I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia). For example " Bad is a song by Mikael Jackson " should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad). Ideally the system should work across multiple languages, it should work both

how to read text in a table from a csv file

心已入冬 提交于 2019-12-10 08:20:53
问题 I am new using the tm package. I want to read a csv file which contents one column with 2000 texts and a second column with a factor variable yes/no into a Corpus. My intention is to convert the text as a matrix and use the factor variable as target for prediction. I would need to divide the corpus in train and test sets as well. I read several documents like tm.pdf etc. and found the documentation relatively limited. This is my attempt following another threat on the same subject, TexTest<

R text mining - dealing with plurals

折月煮酒 提交于 2019-12-09 23:53:36
问题 I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word. x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.' 回答1: One possible solution. Here I use the pacman package to make the solution self contained: if (!require(

A lemmatizing function using a hash dictionary does not work with tm package in R

佐手、 提交于 2019-12-09 23:45:13
问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.