Use PhantomJS to extract html and text

前端 未结 4 744
眼角桃花
眼角桃花 2020-12-21 23:11

I try to extract all the text content of a page (because it doesn\'t work with Simpledomparser)

I try to modify this simple example from the manual

         


        
4条回答
  •  刺人心
    刺人心 (楼主)
    2020-12-21 23:42

    There are multiple ways to retrieve the page content as a string:

    • page.content gives the complete source including the markup () and doctype (),

    • document.documentElement.outerHTML (via page.evaluate) gives the complete source including the markup (), but without doctype,

    • document.documentElement.textContent (via page.evaluate) gives the cumulative text content of the complete document including inline CSS & JavaScript, but without markup,

    • document.documentElement.innerText (via page.evaluate) gives the cumulative text content of the complete document excluding inline CSS & JavaScript and without markup.

    document.documentElement can be exchanged by an element or query of your choice.

提交回复
热议问题