text-mining

R Text Mining with quanteda

谁说胖子不能爱 提交于 2019-12-04 22:41:32
I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries # Define as corpus fb_corp <-corpus(fb_test) class(fb_corp) # LIWC

Calculate similarity between list of words

不羁岁月 提交于 2019-12-04 22:05:20
I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

A lemmatizing function using a hash dictionary does not work with tm package in R

孤街醉人 提交于 2019-12-04 20:18:40
I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan

Lemmatization using txt file with lemmes in R

假装没事ソ 提交于 2019-12-04 19:47:56
I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

R text mining - dealing with plurals

天涯浪子 提交于 2019-12-04 19:32:36
I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word. x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.' One possible solution. Here I use the pacman package to make the solution self contained: if (!require("pacman")) install.packages("pacman"); library(pacman) p_load_gh('hrbrmstr/pluralize') p_load(quanteda) x <- '

Does tm package itself provide a built-in way to combine document-term matrices?

风流意气都作罢 提交于 2019-12-04 17:42:18
Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it does,I don't want to rebuild something that has already been built. If it doesn't, is there any handier

Counting syllables

流过昼夜 提交于 2019-12-04 17:38:30
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid. Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count. so for instance: x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle') would yield: 1, 1, 2, 2, 1, 3 Each number corresponding the the

String matching to estimate similarity

爷,独闯天下 提交于 2019-12-04 16:55:10
I want to analyse a field of 100 character length and estimate similarity %. For example, for a same question "Whats your opinion on smartphone?", Person A: "Best way to waste money" Person B: "Amazing stuff. lets you stay connected all the time" Person C: "Instrument to waste money and time" Out of these, just by matching individual words, A and C sound similar. I am trying to do something like this to start with in R and later on extend to match combination of words like "Best", "Best way", "Best way waste" etc. I am newbie to text analysis and R and could not get the proper naming of these

Working with text classification and big sparse matrices in R

不羁的心 提交于 2019-12-04 15:44:51
I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build

Parse GATE Document to get Co-Reference Text

狂风中的少年 提交于 2019-12-04 15:30:24
I'm creating a GATE app which used to find co-reference text. It works fine and I have created zipped file of the app by export option provided in GATE. Now I'm trying to use the same in my Java code. Gate.runInSandbox(true); Gate.setGateHome(new File(gateHome)); Gate.setPluginsHome(new File(gateHome, "plugins")); Gate.init(); URL applicationURL = new URL("file:" + new Path(gateHome, "application.xgapp").toString()); application = (CorpusController) PersistenceManager.loadObjectFromUrl(applicationURL); corpus = Factory.newCorpus("Megaki Corpus"); application.setCorpus(corpus); Document