Clean Microsoft Word Pasted Text using JavaScript

孤人 提交于 2019-12-17 17:30:32

问题


I am using a 'contenteditable' <div/> and enabling PASTE.

It is amazing the amount of markup code that gets pasted in from a clipboard copy from Microsoft Word. I am battling this, and have gotten about 1/2 way there using Prototypes' stripTags() function (which unfortunately does not seem to enable me to keep some tags).

However, even after that, I wind up with a mind-blowing amount of unneeded markup code.

So my question is, is there some function (using JavaScript), or approach I can use that will clean up the majority of this unneeded markup?


回答1:


Here is the function I wound up writing that does the job fairly well (as far as I can tell anyway).

I am certainly open for improvement suggestions if anyone has any. Thanks.

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.




回答2:


You can either use the full CKEditor which cleans on paste, or look at the source.




回答3:


I am using this:

$(body_doc).find('body').bind('paste',function(e){
                var rte = $(this);
                _activeRTEData = $(rte).html();
                beginLen = $.trim($(rte).html()).length; 

                setTimeout(function(){
                    var text = $(rte).html();
                    var newLen = $.trim(text).length;

                    //identify the first char that changed to determine caret location
                    caret = 0;

                    for(i=0;i < newLen; i++){
                        if(_activeRTEData[i] != text[i]){
                            caret = i-1;
                            break;  
                        }
                    }

                    var origText = text.slice(0,caret);
                    var newText = text.slice(caret, newLen - beginLen + caret + 4);
                    var tailText = text.slice(newLen - beginLen + caret + 4, newLen);

                    var newText = newText.replace(/(.*(?:endif-->))|([ ]?<[^>]*>[ ]?)|(&nbsp;)|([^}]*})/g,'');

                    newText = newText.replace(/[·]/g,'');

                    $(rte).html(origText + newText + tailText);
                    $(rte).contents().last().focus();
                },100);
            });

body_doc is the editable iframe, if you are using an editable div you could drop out the .find('body') part. Basically it detects a paste event, checks the location cleans the new text and then places the cleaned text back where it was pasted. (Sounds confusing... but it's not really as bad as it sounds.

The setTimeout is needed because you can't grab the text until it is actually pasted into the element, paste events fire as soon as the paste begins.




回答4:


How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.




回答5:


I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}



回答6:


This works great to remove any comments from HTML text, including those from Word:

function CleanWordPastedHTML(sTextHTML) {
  var sStartComment = "<!--", sEndComment = "-->";
  while (true) {
    var iStart = sTextHTML.indexOf(sStartComment);
    if (iStart == -1) break;
    var iEnd = sTextHTML.indexOf(sEndComment, iStart);
    if (iEnd == -1) break;
    sTextHTML = sTextHTML.substring(0, iStart) + sTextHTML.substring(iEnd + sEndComment.length);
  }
  return sTextHTML;
}



回答7:


Had a similar issue with line-breaks being counted as characters and I had to remove them.

$(document).ready(function(){

  $(".section-overview textarea").bind({
    paste : function(){
    setTimeout(function(){
      //textarea
      var text = $(".section-overview textarea").val();
      // look for any "\n" occurences and replace them
      var newString = text.replace(/\n/g, '');
      // print new string
      $(".section-overview textarea").val(newString);
    },100);
    }
  });
  
});



回答8:


Could you paste to a hidden textarea, copy from same textarea, and paste to your target?




回答9:


Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.



来源:https://stackoverflow.com/questions/2875027/clean-microsoft-word-pasted-text-using-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!