Parsing html in java for an android app

前端 未结 3 891
滥情空心
滥情空心 2020-12-10 23:42

I\'m writing an android app that takes relevant data from a website and presents it to the user (html scraping). The application downloads the source code and parses it, loo

相关标签:
3条回答
  • 2020-12-11 00:00

    Here is how i would do it:

            StringBuffer text = new StringBuffer();
            HttpURLConnection conn = null;
            InputStreamReader in = null;
            BufferedReader buff = null;
            try {
                URL page = new URL(
                        "http://example.com/");
    // URLEncoder.encode(someparameter); use when passing params that may contain symbols or spaces use URLEncoder to encode it and conver space to %20...etc other wise you will get a 404
                conn = (HttpURLConnection) page.openConnection();
                conn.connect();
                /* use this if you need to
                int responseCode = conn.getResponseCode();
    
                if (responseCode == 401 || responseCode == 403) {
                    // Authorization Error
                    Log.e(tag, "Authorization Error");
                    throw new Exception("Authorization Error");
                }
    
                if (responseCode >= 500 && responseCode <= 504) {
                    // Server Error
                    Log.e(tag, "Internal Server Error");
                    throw new Exception("Internal Server Error");
                }*/
                in = new InputStreamReader((InputStream) conn.getContent());
                buff = new BufferedReader(in);
                String line = "anything";
                while (line != null) {
                    line = buff.readLine();
                String found = interpretHtml(line);
                if(null != found)
                    return found; // comment the previous 2 lines and this one if u need to load the whole html document.
                    text.append(line + "\n");
                }
            } catch (Exception e) {
                Log.e(Standards.tag,
                        "Exception while getting html from website, exception: "
                                + e.toString() + ", cause: " + e.getCause()
                                + ", message: " + e.getMessage());
            } finally {
                if (null != buff) {
                    try {
                        buff.close();
                    } catch (IOException e1) {
                    }
                    buff = null;
                }
                if (null != in) {
                    try {
                        in.close();
                    } catch (IOException e1) {
                    }
                    in = null;
                }
                if (null != conn) {
                    conn.disconnect();
                    conn = null;
                }
            }
            if (text.toString().length() > 0) {
                return interpretHtml(text.toString()); // use this if you don't need to load the whole page.
            } else return null;
        }
    
    private String interpretHtml(String s){
        if(s.startsWidth("<textTag class=\"text\"")){
        return s.substring(22, s.length() - 10);
        }
        return null;
    }
    
    0 讨论(0)
  • 2020-12-11 00:02

    I would say it's probably a bad idea to parse HTML on the device if you're experiencing performance issues. Have you considered creating a web app that your device app fetches data from?

    If the data is from one source (i.e.; one webpage and not many) I would build a web app to prefetch the site, parse for relevant data, and cache it for later use on the device(s).

    0 讨论(0)
  • 2020-12-11 00:04

    to get a webpage in java you'll find a code on the bottom of this answer.

    you can use reg-expressions.

    here's a nice reference

    android regex

    but, if the html is well written you can also try with yahoo's yql. it outputs as json or xml so you can grab it really easy after.

    yahoo yql console

    personalty, I parse them in python or in php because I feel more comfortable in those languages.

    get webpage: How to use it:

    Get_Webpage obj = new Get_Webpage("http://your_url_here"); Sting source = obj.get_webpage_source();


    public class Get_Webpage {
    
        public String parsing_url = "";
    
        public Get_Webpage(String url_2_get){       
            parsing_url = url_2_get;
        }
    
        public String get_webpage_source(){
    
            HttpClient client = new DefaultHttpClient();
            HttpGet request = new HttpGet(parsing_url);
            HttpResponse response = null;
            try {
                response = client.execute(request);
            } catch (ClientProtocolException e) {
    
            } catch (IOException e) {
    
            }
    
            String html = "";
            InputStream in = null;
            try {
                in = response.getEntity().getContent();
            } catch (IllegalStateException e) {
    
            } catch (IOException e) {
    
            }
            BufferedReader reader = new BufferedReader(new InputStreamReader(in));
            StringBuilder str = new StringBuilder();
            String line = null;
            try {
                while((line = reader.readLine()) != null)
                {
                    str.append(line);
                }
            } catch (IOException e) {
    
            }
            try {
                in.close();
            } catch (IOException e) {
    
            }
            html = str.toString();
    
            return html;
        }
    
    }
    
    0 讨论(0)
提交回复
热议问题