How to find hyperlink in a webpage using java?

泪湿孤枕 提交于 2019-12-23 05:38:11

问题


how we can find out the no of hyperlinks in a page.
and how to find out what all are they?? i need to develop the stuff in plan java not in any frame work which means,by using
JAVA.NET.*; method,any scope?how can i do that?
can you guys give me a proper example??

i need to get all the links in the page and i need to save that in the database,all the links with domain name


回答1:


Try using the jsoup library.

Download the project jar and compile this code snippet:

    Document doc = Jsoup.parse(new URL("http://www.bits4beats.it/"), 2000);

    Elements resultLinks = doc.select("a");
    System.out.println("number of links: " + resultLinks.size());
    for (Element link : resultLinks) {
        System.out.println();
        String href = link.attr("href");
        System.out.println("Title: " + link.text());
        System.out.println("Url: " + href);
    }

The code prints the numbers of hypertext elements in a html page and infos about them.




回答2:


You can use the javax.swing.text.html and javax.swing.text.html.parser packages to achieve this:

import java.io.*;
import java.net.URL;
import java.util.Enumeration;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Test {
   public static void main(String[] args) throws Exception  {
      Reader r = null;

      try   {
         URL u = new URL(args[0]);
         InputStream in = u.openStream();
         r = new InputStreamReader(in);

         ParserDelegator hp = new ParserDelegator();
         hp.parse(r, new HTMLEditorKit.ParserCallback() {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               // System.out.println(t);
               if(t == HTML.Tag.A)  {
                  Enumeration attrNames = a.getAttributeNames();
                  StringBuilder b = new StringBuilder();
                  while(attrNames.hasMoreElements())    {
                      Object key = attrNames.nextElement();
                      if("href".equals(key.toString())) {
                          System.out.println(a.getAttribute(key));
                      }
                  }
               }
            }
         }, true);
      }finally {
         if(r != null)  {
            r.close();
         }
      }
   }
}

Compile and call it this way:

java Test http://www.oracle.com/technetwork/java/index.html



回答3:


Best option is use some html parser library but if you dont want to use any such third party library you may try to do this by matching with regular expression using java's Pattern and Matcher classes from the regex package.

Edit Example:

String regex="\\b(?<=(href=\"))[^\"]*?(?=\")";
Pattern pattern = Pattern.compile(regex);

Matcher m = pattern.matcher(str_YourHtmlHere);
while(m.find()) {
  System.out.println("FOUND: " + m.group());
}

In above example is a simple basic regex which will find all links indicated by attribute href. You may have to enhance the regex for correctly handling all scenarios such as href with url in single quote etc.




回答4:


Getting Links in an HTML Document




回答5:


    Pattern p = Pattern.compile("(https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?)");

    Matcher m = p.matcher(br.toString());


    while (m.find() == true) {

        resp.getWriter().print("<a href="+m.group(0).toString()+">"+m.group(0).toString()+"</a><br/>");
      }


来源:https://stackoverflow.com/questions/3383152/how-to-find-hyperlink-in-a-webpage-using-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!