word-frequency | 易学教程

Count number of times a word-wildcard appears in text (in R)

阅读更多关于 Count number of times a word-wildcard appears in text (in R)

问题 I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to: 1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1). 2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2). I'm able to achieve (1), but not (2). Can anyone please help? thanks. library(tm) library(qdap) text <-

How to sort the words by their frequency

阅读更多关于 How to sort the words by their frequency

问题 I take an input text file, convert it to an array, sort the array, and then get the frequencies of each word. I can't figure out how to sort them according to their frequencies, from highest to lowest, without importing lots of things (which is what I am trying to do): //find frequencies int count = 0; List<String> list = new ArrayList<>(); for(String s:words){ if(!list.contains(s)){ list.add(s); } } for(int i=0;i<list.size();i++){ for(int j=0;j<words.length;j++){ if(list.get(i).equals(words

Counting Word Frequency (most significant words) in a String, excluding keywords

阅读更多关于 Counting Word Frequency (most significant words) in a String, excluding keywords

问题 I would like to count the frequency of words (excluding some keywords) in a string and sort them DESC. So, how can i do it? In the following string... This is stackoverflow. I repeat stackoverflow. Where the excluding keywords are ExKeywords() ={"i","is"} the output should be like stackoverflow repeat this P.S. NO! I am not re-designing google! :) 回答1: string input = "This is stackoverflow. I repeat stackoverflow."; string[] keywords = new[] {"i", "is"}; Regex regex = new Regex("\\w+");

How to get rid of MemoryError while dealing with a large dictionary?

阅读更多关于 How to get rid of MemoryError while dealing with a large dictionary?

问题 I'm trying to build an index of trigrams of words using dictonary type of structure. Keys are strings and values are numbers of occurences. for t in arrayOfTrigrams: if t in trigrams: trigrams[t] += 1 else: trigrams[t] = 1 But the data is very big - more than 500 MB of raw texts and I don't know how to cope with the MemoryError. And as distinct from Python memoryerror creating large dictionary I don't create any irrelevant stuff, each trigram is important. 回答1: On Further Edit -- code

cannot perform reduce with flexible type plt.hist

阅读更多关于 cannot perform reduce with flexible type plt.hist

问题 I have a dataset with 1000s of elements and their respective frquencies. i need to plot a histogram of the top 10 occurring elements. i did: top_words = Counter(my_data).most_common() top_words_10 = top_words[:10] plt.hist(top_words_10,label='True') and got this error : TypeError Traceback (most recent call last) <ipython-input-29-ff974b3a2354> in <module>() 5 print top_words[:10] 6 ----> 7 plt.hist(top_words_10) C:\Anaconda\lib\site-packages\numpy\core\_methods.pyc in _amin(a, axis, out,

Awk: Characters-frequency from one text file?

阅读更多关于 Awk: Characters-frequency from one text file?

问题 Given a multilangual .txt files such as: But where is Esope the holly Bastard But where is 생 지 옥 이 군 지 옥 이 지 옥 지 我是你的爸爸！爸爸！！！你不會的！ I counted space-separated words' word-frequency using this Awk function : $ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort Getting the elegant : 1 생 1 군 1 Bastard 1 Esope 1 holly 1 the 1 不 1 我 1 是 1 會 2 이 2 But 2 is 2 where 2 你 2 的 3 옥 4 지 4 爸 5 ！ How to change it to count characters-frequency ? EDIT: For Characters

Create advanced frequency table with Python

阅读更多关于 Create advanced frequency table with Python

问题 I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R. Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows): node | precedingWord ------------------------- A-bom de A-bom die A-bom de A-bom een A-bom n A-bom de acroniem het acroniem t acroniem het acroniem n acroniem een act de act het act

word count frequency in document

阅读更多关于 word count frequency in document

问题 I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far private Hashtable<String, Integer>

Find most frequent words on a webpage (using Jsoup)?

阅读更多关于 Find most frequent words on a webpage (using Jsoup)?

In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ? Thanks. Yes, you could use Jsoup to get the text from the webpage, like this: Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); String text = doc.body().text(); Then, you need to count the words and find out which ones are the most frequent ones. This code looks

Combining Lists of Word Frequency Data

阅读更多关于 Combining Lists of Word Frequency Data

问题 This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results. I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica.