text-mining

Text summarization in R language

为君一笑 提交于 2021-02-07 04:17:01
问题 I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences. How to summarize text in at least 10 line with R language ? 回答1: You may try this (from the LSAfun package): genericSummary(D,k=1) whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation). For more information: http://search.r-project.org/library/LSAfun/html

Feature selection in document-feature matrix by using chi-squared test

送分小仙女□ 提交于 2021-02-06 12:50:43
问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

Feature selection in document-feature matrix by using chi-squared test

三世轮回 提交于 2021-02-06 12:49:32
问题 I am doing texting mining using natural language processing. I used quanteda package to generate a document-feature matrix (dfm). Now I want to do feature selection using a chi-square test. I know there were already a lot of people asked this question. However, I couldn't find the relevant code for that. (The answers just gave a brief concept, like this: https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r) I learned that I could use

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

荒凉一梦 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

天大地大妈咪最大 提交于 2021-02-06 09:26:09
问题 LDA Original Output Uni-grams topic1 -scuba,water,vapor,diving topic2 -dioxide,plants,green,carbon Required Output Bi-gram topics topic1 -scuba diving,water vapor topic2 -green plants,carbon dioxide Any idea? 回答1: Given I have a dict called docs , containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this: from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w

Dataframe aggregation of n-gram, their frequency and associate the entries of other columns with it using R

人盡茶涼 提交于 2021-01-29 15:48:18
问题 I am trying to aggregate a dataframe based on 1-gram (can be extended to n-gram by changing n in the code below) frequency and associate other columns to it. The way I did it is shown below. Are there any other shortcuts/ alternatives to produce the table shown at the very end of this question for the dataframe given below? The code and the results are shown below. The below chunk sets the environment, loads the libraries and reads the dataframe: # Clear variables in the working environment

Text mining with Python and pandas

六眼飞鱼酱① 提交于 2021-01-29 08:47:50
问题 this maybe is a duplicate, but I had no luck finding it... I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem. import pandas as pd pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding

Breaking a paragraph into a vector of sentences in R

十年热恋 提交于 2021-01-28 23:19:21
问题 I have the following paragraph: Well, um...such a personal topic. No wonder I am the first to write a review. Suffice to say this stuff does just what they claim and tastes pleasant. And I had, well, major problems in this area and now I don't. 'Nuff said. :-) for the purpose of applying the calculate_total_presence_sentiment command from the RSentiment package I would like to break this paragraph into a vector of sentences as follows: [1] "Well, um...such a personal topic." [2] "No wonder I

Is it possible to use `kwic` function to find words near to each other?

你说的曾经没有我的故事 提交于 2021-01-28 08:24:46
问题 I found this reference : https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9781449327453/ch05s07.html Is it possible to use it with kwic function in the quanteda package to be able to find documents in a corpus containing words that are not "stuck" but close to each other, with maybe a few other words between ? for example, if I give two words in the function, I would like to find the documents in a corpus where these two words occur but maybe with some words between

Keyword in context (kwic) for skipgrams?

蓝咒 提交于 2020-12-12 02:07:06
问题 I do keyword in context analysis with quanteda for ngrams and tokens and it works well. I now want to do it for skipgrams, capture the context of "barriers to entry" but also "barriers to [...] [and] entry. The following code a kwic object which is empty but I don't know what I did wrong. dcc.corpus refers to the text document. I also used the tokenized version but nothing changes. The result is: "kwic object with 0 rows" x <- tokens("barriers entry") ntoken_test <- tokens_ngrams(x, n = 2,