word-frequency

Word frequency algorithm for natural language processing

我只是一个虾纸丫 提交于 2019-11-28 15:07:04
Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle . What I'd like: ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc) preserve proper nouns ignore hyphenation, except for soft kind Reaching for the stars, these would be peachy: handling stemming & plurals (e.g. like, likes, liked, liking match the same result) grouping of adjectives (adverbs,

Efficiently calculate word frequency in a string

◇◆丶佛笑我妖孽 提交于 2019-11-28 03:13:33
问题 I am parsing a long string of text and calculating the number of times each word occurs in Python. I have a function that works but I am looking for advice on whether there are ways I can make it more efficient(in terms of speed) and whether there's even python library functions that could do this for me so I'm not reinventing the wheel? Can you suggest a more efficient way to calculate the most common words that occur in a long string(usually over 1000 words in the string)? Also whats the

Count word frequency in a text? [duplicate]

风流意气都作罢 提交于 2019-11-27 16:35:07
问题 Possible Duplicate: php: sort and count instances of words in a given string I am looking to write a php function which takes a string as input, splits it into words and then returns an array of words sorted by the frequency of occurence of each word. What's the most algorithmically efficient way of accomplishing this ? 回答1: Your best bet are these: str_word_count — Return information about words used in a string array_count_values — Counts all the values of an array Example $words = 'A

list of word frequencies using R

匆匆过客 提交于 2019-11-27 12:09:50
I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm)

Word frequency algorithm for natural language processing

我与影子孤独终老i 提交于 2019-11-27 09:01:24
问题 Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle. What I'd like: ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc) preserve proper nouns ignore hyphenation, except for soft kind Reaching for the stars, these would be peachy: handling

list of word frequencies using R

旧巷老猫 提交于 2019-11-26 22:21:50
问题 I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer

The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence

纵饮孤独 提交于 2019-11-26 10:06:24
Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text. My thinking is like this. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm. After sorting, we just take the first K words. This

Sorted Word frequency count using python

£可爱£侵袭症+ 提交于 2019-11-26 08:08:46
问题 I have to count the word frequency in a text using python. I thought of keeping words in a dictionary and having a count for each of these words. Now if I have to sort the words according to # of occurrences. Can i do it with same dictionary instead of using a new dictionary which has the key as the count and array of words as the values ? 回答1: You can use the same dictionary: >>> d = { "foo": 4, "bar": 2, "quux": 3 } >>> sorted(d.items(), key=lambda item: item[1]) The second line prints: [(