Pulling HTML from a Webpage in Java

為{幸葍}努か 提交于 2019-12-13 04:09:21

问题


I want to pull the entire HTML source code file from a website in Java (or Python or PHP if it is easier in those languages to display). I wish only to view the HTML and scan through it with a few methods- not edit or manipulate it in any way, and I really wish that I do not write it to a new file unless there is no other way. Are there any library classes or methods that do this? If not, is there any way of going about this?


回答1:


In Java:

URL url = new URL("http://stackoverflow.com");
URLConnection connection = new URLConnection(url);
InputStream stream = url.openConnection();
// ... read stream like any file stream

This code, is good for scripting purposes and internal use. I would argue against using it for production use though. It doesn't handle timeouts and failed connections.

I would recommend using HttpClient library for production use. It supports authentication, redirect handling, threading, pooling, etc.




回答2:


In Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Please see Python and HTML Processing for more details.




回答3:


Maybe you should also consider an alternative like running a standard utility like wget or curl from the command line to fetch the site tree into a local directory tree. Then do your scanning (in Java, Python, whatever) using the local copy. It should be simpler to do that, than to implement all of the boring stuff like error handling, argument parsing, etc yourself.

If you want to fetch all pages in a site, wget and curl don't know how to harvest links from HTML pages. An alternative is to use an open source web crawler.



来源:https://stackoverflow.com/questions/1837471/pulling-html-from-a-webpage-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!