Get source of website in java

问题

I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how can i easily get just the source, preferraby as the DOM Document oso that I could easily get the info I want.

Or is there a better way to connect to https site, get the source (which I neet to do to get a table of data...its pretty simple) then those links are files i am going to download.

I wish it was FTP but these are files stored on my tivo (i want to programmatically download them to my computer(

回答1:

You can get low level and just request it with a socket. In java it looks like

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

回答2:

Extremely similar questions:

How do I retrieve a URL from a website using Java?
How do you Programmatically Download a Webpage in Java
A good library to do URL manipulation in Java

回答3:

Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

回答4:

Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://stackoverflow.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

回答5:

You can use javacurl to get the site's html, and java DOM to analyze it.

回答6:

Try using the jsoup library.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

回答7:

There are two meanings of souce in a web context:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

回答8:

There exists FTP server that can be installed on your Tivo to allow for show downloads, see here http://dvrpedia.com/MFS_FTP

The question is formulated differently (how to handle http/html in java) but at the end you mention what you want is to download shows. Tivo uses unique file system (MFS - Media File System) of their own, so it is not easy to mount the drive on another machine - instead it is easier to run http or ftp server on the Tivo and download from these

来源：https://stackoverflow.com/questions/422970/get-source-of-website-in-java

标签

java

url

dvr

tivo