问题
I have a huge paragraph and want to know what word appears most in it. Could anyone please point me in the right direction with this? Any examples and explanations would be helpful. Thanks!
回答1:
Here is a simple solution, should be quite fast.
example_paragraph = 'This is an example corpus. Is is a verb?';
words = regexp(example_paragraph, ' ', 'split');
vocabulary = unique(words);
n = length(vocabulary);
counts = zeros(n, 1);
for i=1:n
counts(i) = sum(strcmpi(words, vocabulary{i}));
end
[frequency_of_the_most_frequent_word, idx] = max(counts);
most_frequent_word = vocabulary{idx};
You can also check out answers here for getting the most frequent word out of the array of words.
回答2:
Here's a very MATLAB-y way to do it. I tried to name the variables clearly. Play with each line and examine the results to understand how it works. Workhorse functions: unique and hist
% First produce a cell array of words to be analyzed
paragraph_cleaned_up_whitespace = regexprep(paragraph, '\s', ' ');
paragraph_cleaned_up = regexprep(paragraph_cleaned_up_whitespace, '[^a-zA-Z0-9 ]', '');
words = regexpi(paragraph_cleaned_up, '\s+', 'split');
[unique_words, i, j] = unique(words);
frequency_count = hist(j, 1:max(j));
[~, sorted_locations] = sort(frequency_count);
sorted_locations = fliplr(sorted_locations);
words_sorted_by_frequency = unique_words(sorted_locations).';
frequency_of_those_words = frequency_count(sorted_locations).';
来源:https://stackoverflow.com/questions/13592390/how-to-know-what-word-appears-most-in-a-paragraph-matlab