Clean Microsoft Word Pasted Text using JavaScript

自作多情 提交于 2019-11-28 03:42:50
OneNerd

Here is the function I wound up writing that does the job fairly well (as far as I can tell anyway).

I am certainly open for improvement suggestions if anyone has any. Thanks.

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.

You can either use the full CKEditor which cleans on paste, or look at the source.

I am using this:

$(body_doc).find('body').bind('paste',function(e){
                var rte = $(this);
                _activeRTEData = $(rte).html();
                beginLen = $.trim($(rte).html()).length; 

                setTimeout(function(){
                    var text = $(rte).html();
                    var newLen = $.trim(text).length;

                    //identify the first char that changed to determine caret location
                    caret = 0;

                    for(i=0;i < newLen; i++){
                        if(_activeRTEData[i] != text[i]){
                            caret = i-1;
                            break;  
                        }
                    }

                    var origText = text.slice(0,caret);
                    var newText = text.slice(caret, newLen - beginLen + caret + 4);
                    var tailText = text.slice(newLen - beginLen + caret + 4, newLen);

                    var newText = newText.replace(/(.*(?:endif-->))|([ ]?<[^>]*>[ ]?)|(&nbsp;)|([^}]*})/g,'');

                    newText = newText.replace(/[·]/g,'');

                    $(rte).html(origText + newText + tailText);
                    $(rte).contents().last().focus();
                },100);
            });

body_doc is the editable iframe, if you are using an editable div you could drop out the .find('body') part. Basically it detects a paste event, checks the location cleans the new text and then places the cleaned text back where it was pasted. (Sounds confusing... but it's not really as bad as it sounds.

The setTimeout is needed because you can't grab the text until it is actually pasted into the element, paste events fire as soon as the paste begins.

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}

This works great to remove any comments from HTML text, including those from Word:

function CleanWordPastedHTML(sTextHTML) {
  var sStartComment = "<!--", sEndComment = "-->";
  while (true) {
    var iStart = sTextHTML.indexOf(sStartComment);
    if (iStart == -1) break;
    var iEnd = sTextHTML.indexOf(sEndComment, iStart);
    if (iEnd == -1) break;
    sTextHTML = sTextHTML.substring(0, iStart) + sTextHTML.substring(iEnd + sEndComment.length);
  }
  return sTextHTML;
}

Had a similar issue with line-breaks being counted as characters and I had to remove them.

$(document).ready(function(){

  $(".section-overview textarea").bind({
    paste : function(){
    setTimeout(function(){
      //textarea
      var text = $(".section-overview textarea").val();
      // look for any "\n" occurences and replace them
      var newString = text.replace(/\n/g, '');
      // print new string
      $(".section-overview textarea").val(newString);
    },100);
    }
  });
  
});

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!