text-mining | 易学教程

R Text Mining with quanteda

阅读更多关于 R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries # Define as corpus fb_corp <-corpus(fb_test) class(fb_corp) # LIWC

Calculate similarity between list of words

阅读更多关于 Calculate similarity between list of words

I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list_B = ['email','mail','address','netmail'] In the above two list, we will find the cosine similarity between

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm. Let me paste Dmitriy's code: library(hashmap) library(data.table) txt = "Abadan Abadanem Abadan

Lemmatization using txt file with lemmes in R

阅读更多关于 Lemmatization using txt file with lemmes in R

I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

R text mining - dealing with plurals

阅读更多关于 R text mining - dealing with plurals

I'm learning text mining in R and have had pretty good success. But I am stuck on how to deal with plurals. i.e. I want "nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word. x <- '"nation" and "nations" to be counted as the same word and ideally "dictionary" and "dictionaries" to be counted as the same word.' One possible solution. Here I use the pacman package to make the solution self contained: if (!require("pacman")) install.packages("pacman"); library(pacman) p_load_gh('hrbrmstr/pluralize') p_load(quanteda) x <- '

Does tm package itself provide a built-in way to combine document-term matrices?

阅读更多关于 Does tm package itself provide a built-in way to combine document-term matrices?

Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it does,I don't want to rebuild something that has already been built. If it doesn't, is there any handier

Counting syllables

阅读更多关于 Counting syllables

问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid. Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count. so for instance: x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle') would yield: 1, 1, 2, 2, 1, 3 Each number corresponding the the

String matching to estimate similarity

阅读更多关于 String matching to estimate similarity

I want to analyse a field of 100 character length and estimate similarity %. For example, for a same question "Whats your opinion on smartphone?", Person A: "Best way to waste money" Person B: "Amazing stuff. lets you stay connected all the time" Person C: "Instrument to waste money and time" Out of these, just by matching individual words, A and C sound similar. I am trying to do something like this to start with in R and later on extend to match combination of words like "Best", "Best way", "Best way waste" etc. I am newbie to text analysis and R and could not get the proper naming of these

Working with text classification and big sparse matrices in R

阅读更多关于 Working with text classification and big sparse matrices in R

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language. I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm , where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build

Parse GATE Document to get Co-Reference Text

阅读更多关于 Parse GATE Document to get Co-Reference Text

I'm creating a GATE app which used to find co-reference text. It works fine and I have created zipped file of the app by export option provided in GATE. Now I'm trying to use the same in my Java code. Gate.runInSandbox(true); Gate.setGateHome(new File(gateHome)); Gate.setPluginsHome(new File(gateHome, "plugins")); Gate.init(); URL applicationURL = new URL("file:" + new Path(gateHome, "application.xgapp").toString()); application = (CorpusController) PersistenceManager.loadObjectFromUrl(applicationURL); corpus = Factory.newCorpus("Megaki Corpus"); application.setCorpus(corpus); Document