How to get rid of the punctuation? and check the spelling error

守給你的承諾、 提交于 2019-12-11 23:53:51

问题


  • eliminate punctuation
  • words split when meeting new line and space, then store in array
  • check the text file got error or not with the function of checkSpelling.m file
  • sum up the total number of error in that article
  • no suggestion is assumed to be no error, then return -1
  • sum of error>20, return 1
  • sum of error<=20, return -1

I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:

My data2 file is :

checkSpelling.m

function suggestion = checkSpelling(word)

h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
  suggestion = []; %return empty if spelled correctly
else
  %If incorrect and there are suggestions, return them in a cell array
  if h.GetSpellingSuggestions(word).count > 0
      count = h.GetSpellingSuggestions(word).count;
      for i = 1:count
          suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
      end
  else
      %If incorrect but there are no suggestions, return this:
      suggestion = 'no suggestion';
  end

end
%Quit Word to release the server
h.Quit    

f19.m

for i = 1:1

data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')';  %read text file and store data in CharData
fclose(data2);

word_punctuation=regexprep(CharData,'[`~!@#$%^&*()-_=+[{]}\|;:\''<,>.?/','')

word_newLine = regexp(word_punctuation, '\n', 'split')

word = regexp(word_newLine, ' ', 'split')

[sizeData b] = size(word)

suggestion = cellfun(@checkSpelling, word, 'UniformOutput', 0)

A19(i)=sum(~cellfun(@isempty,suggestion))

feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end

回答1:


Substitute your regexprep call to

word_punctuation=regexprep(CharData,'\W','\n');

Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.

Then

word = regexp(word_punctuation, '\n', 'split');

As you can see you don't need to split by space (see above). But you can remove the empty cells:

word(cellfun(@isempty,word)) = [];

Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.

UPDATE

The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.

UPDATE 2

Another solution to the 1st UPDATE:

word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');


来源:https://stackoverflow.com/questions/23501856/how-to-get-rid-of-the-punctuation-and-check-the-spelling-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!