问题
I'm trying to create a web scraper for my coming android app. Therefore I need to use a simple search form on a website, fill it out and send my results back to the server.
As mentioned in the Jsoup-Cookbook, I scraped the site I needed from the Server and changed the values.
Now I just need to post my modified document back to the server and scrape the resulting page.
As far as I've seen in the Jsoup-API there is no way to post something back, except with the .data-Attribute
in Jsoup.connection, which is unfortunately not able to fill out text fields by their id.
Any ideas or workarounds, how to post the modified document, or its parts back to the website ?
回答1:
You seem to misunderstand how HTTP works in general. It is not true that the entire HTML document with modified input values is been sent from the client to the server. It's more so that the name=value pairs of all input elements are been sent as request parameters. The server will return the desired HTML response then.
For example, if you want to simulate a submit of the following form in Jsoup (you can find the exact HTML form syntax by opening the page with the form in your browser and do a rightclick, View Source)
<form method="post" action="http://example.com/somescript">
<input type="text" name="text1" />
<input type="text" name="text2" />
<input type="hidden" name="hidden1" value="hidden1value" />
<input type="submit" name="button1" value="Submit" />
<input type="submit" name="button2" value="Other button" />
</form>
then you need to construct the request as follows:
Document document = Jsoup.connect("http://example.com/somescript")
.data("text1", "yourText1Value") // Fill the first input field.
.data("text2", "yourText2Value") // Fill the second input field.
.data("hidden1", "hidden1value") // You need to keep it unmodified!
.data("button1", "Submit") // This way the server knows which button was pressed.
.post();
// ...
In some cases you'd also need to send the session cookies back, but that's a subject apart (and a question which has already been asked several times here before; in general, it's easier to use a real HTTP client for this and pass its response through Jsoup#parse()
).
See also:
- HTTP tutorial
- HTTP specification
回答2:
That's not the way. You should create a POST request (use Apache HTTP Components), get the response and then scrape it with JSoup.
来源:https://stackoverflow.com/questions/8986617/jsoup-posting-modified-document