Get source of website in java

前端 未结 8 1577
轮回少年
轮回少年 2020-12-30 17:37

I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how

相关标签:
8条回答
  • 2020-12-30 18:01

    You can use javacurl to get the site's html, and java DOM to analyze it.

    0 讨论(0)
  • 2020-12-30 18:07

    You can get low level and just request it with a socket. In java it looks like

    // Arg[0] = Hostname
    // Arg[1] = File like index.html
    public static void main(String[] args) throws Exception {
        SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();
    
        SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);
    
        SSLSession session = sslsock.getSession();
        X509Certificate cert;
        try {
            cert = (X509Certificate) session.getPeerCertificates()[0];
        } catch (SSLPeerUnverifiedException e) {
            System.err.println(session.getPeerHost() + " did not present a valid cert.");
            return;
        }
    
        // Now use the secure socket just like a regular socket to read pages.
        PrintWriter out = new PrintWriter(sslsock.getOutputStream());
        out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
        out.flush();
    
        BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
        String line;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );
    
        while ((line = in.readLine()) != null) {
            // Using Oscar's RegEx.
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
    
        sslsock.close();
    }
    
    0 讨论(0)
提交回复
热议问题