Is there a way to get all text from the rendered page with JS?

ぃ、小莉子 提交于 2019-12-12 09:41:43

问题


Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text from the alread rendered page.

To clarify, I don't want to grab text from a selection, I want the entire page.

Thank you!


回答1:


All credit to Greg W's answer, as I based this answer on his code, but I found that for a website without inline style or script tags it was generally simpler to use:

var theText = $('body').text();

as this grabs all text in all tags without one having to manually set every tag that might contain text.

Also, if you're not careful, setting the tags manually has the propensity to create duplicated text in the output as the each function will often have to check tags contained within other tags which results in it grabbing the same text twice. Using one selector which contains all the tags we want to grab text from circumvents this issue.

The caveat is that if there are inline style or script tags within the body tag it will grab those too.

Update:

After reading this article about innerText I now think the absolute best way to get the text is plain ol vanilla js:

document.body.innerText

As is, this is not reliable cross-browser, but in controlled environments it returns the best results. Read the article for more details.

This method formats the text in a usually more readable manner and does not include style or script tag contents in the output.




回答2:


I suppose you could do something like this, if you don't mind loading jQuery.

var theText;
$('p,h1,h2,h3,h4,h5').each(function(){
  theText += $(this).text();
});

When its all done, "theText" should contain most of the text on the page. Add any relevant selectors I may have left out.




回答3:


As an improvement to Greg W's answer, you could also remove 'undefined', and remove any numbers, considering they're not the words.

function countWords() {

    var collectedText;

    $('p,h1,h2,h3,h4,h5').each(function(index, element){
        collectedText += element.innerText + " ";
    });   

    // Remove 'undefined if there'
    collectedText = collectedText.replace('undefined', '');

    // Remove numbers, they're not words
    collectedText = collectedText.replace(/[0-9]/g, '');

    // Get
    console.log("You have " + collectedText.split(' ').length + " in your document.");
    return collectedText;

}

This can be split into an array of words, a count of words; whatever, really.



来源:https://stackoverflow.com/questions/2986990/is-there-a-way-to-get-all-text-from-the-rendered-page-with-js

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!