How to parse a webpage that includes Javascript? [duplicate]

问题

I've got a webpage that creates a table using Javascript. Right now I'm using JSoup in my Java project to parse the webpage. By the way JSoup isn't able to run Javascript so the table isn't generated and the source of the webpage is incomplete. How can I include the HTML code created by that script in order to parse its content using JSoup? Can you provide a simple example? Thank you!

Webpage example:

<!doctype html>
<html>
  <head>
    <title>A blank HTML5 page</title>
    <meta charset="utf-8" />
  </head>
  <body>
    <script>
        var table = document.createElement("table");
        var tr = document.createElement("tr");
        table.appendChild(tr);
        document.body.appendChild(table);
    </script>
    <p>First paragraph</p>
  </body>
</html>

The output should be:

<!DOCTYPE html>
<html>
    <head>
        <title>
            A blank HTML5 page
        </title>
        <meta charset="utf-8"></meta>
    </head>
    <body>
        <script>
            var table = document.createElement("table");
            var tr = document.createElement("tr");
            table.appendChild(tr);
            document.body.appendChild(table);   
        </script>
        <table>
            <tr></tr>
        </table>
        <p>
            First paragraph
        </p>
    </body>
</html>

By the way, JSoup doesn't include the table tag as it isn't able to execute Javascript. How can I achieve this?

回答1:

First possibility

You have some options outside Jsoup, i.e. employing a "real" browser and interact with it. An excellent choice for this would be selenium webdriver. With selenium you can use different browsers as back end, and maybe in your case the very lightweight htmlUnit would do already. If more complicated JavaScript is called there is often no other choice then running a full browser. Luckily, phantomjs is out there and its footprint is not too bad (headless and all).

Second possibility

Another approach could be that you grab the javascript source with JSoup and start a JavaScript interpreter within Java. For that you could use Rhino. However, if you go that path you might as well use HtmlUnit directly, which is probably a bit less bulky.

来源：https://stackoverflow.com/questions/19465510/how-to-parse-a-webpage-that-includes-javascript

标签

java

javascript

html-parsing

jsoup