How to parse a webpage that includes Javascript? [duplicate]

拥有回忆 提交于 2019-12-29 01:27:08

问题


I've got a webpage that creates a table using Javascript. Right now I'm using JSoup in my Java project to parse the webpage. By the way JSoup isn't able to run Javascript so the table isn't generated and the source of the webpage is incomplete. How can I include the HTML code created by that script in order to parse its content using JSoup? Can you provide a simple example? Thank you!

Webpage example:

<!doctype html>
<html>
  <head>
    <title>A blank HTML5 page</title>
    <meta charset="utf-8" />
  </head>
  <body>
    <script>
        var table = document.createElement("table");
        var tr = document.createElement("tr");
        table.appendChild(tr);
        document.body.appendChild(table);
    </script>
    <p>First paragraph</p>
  </body>
</html>

The output should be:

<!DOCTYPE html>
<html>
    <head>
        <title>
            A blank HTML5 page
        </title>
        <meta charset="utf-8"></meta>
    </head>
    <body>
        <script>
            var table = document.createElement("table");
            var tr = document.createElement("tr");
            table.appendChild(tr);
            document.body.appendChild(table);   
        </script>
        <table>
            <tr></tr>
        </table>
        <p>
            First paragraph
        </p>
    </body>
</html>

By the way, JSoup doesn't include the table tag as it isn't able to execute Javascript. How can I achieve this?


回答1:


First possibility

You have some options outside Jsoup, i.e. employing a "real" browser and interact with it. An excellent choice for this would be selenium webdriver. With selenium you can use different browsers as back end, and maybe in your case the very lightweight htmlUnit would do already. If more complicated JavaScript is called there is often no other choice then running a full browser. Luckily, phantomjs is out there and its footprint is not too bad (headless and all).

Second possibility

Another approach could be that you grab the javascript source with JSoup and start a JavaScript interpreter within Java. For that you could use Rhino. However, if you go that path you might as well use HtmlUnit directly, which is probably a bit less bulky.



来源:https://stackoverflow.com/questions/19465510/how-to-parse-a-webpage-that-includes-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!