Get source of website in java

不想你离开。 提交于 2019-11-30 07:43:38

You can get low level and just request it with a socket. In java it looks like

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

Try HttpUnit or HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://stackoverflow.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

You can use javacurl to get the site's html, and java DOM to analyze it.

Try using the jsoup library.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

There are two meanings of souce in a web context:

The HTML source: If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

Script Source: If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also exists a source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing feature but completely by purpose.

Parsing: Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTML you can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTML you might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

There exists FTP server that can be installed on your Tivo to allow for show downloads, see here http://dvrpedia.com/MFS_FTP

The question is formulated differently (how to handle http/html in java) but at the end you mention what you want is to download shows. Tivo uses unique file system (MFS - Media File System) of their own, so it is not easy to mount the drive on another machine - instead it is easier to run http or ftp server on the Tivo and download from these

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!