Finding a word in a web page using java

问题

I am trying to search a specific word in a specific web page, I'm using Java and Eclipse. The problem is that if I'm taking a web page with almost without content it works fine, but when I'm trying in a "big" web page it doesn't find the word.

for example: I am trying to find the word ["InitialChatFriendsList" in the web page: https://www.facebook.com, if it finds the word then print WIN!!!

Here is a full Java code:

public class BR4Qustion {               
    public static void main(String[] args) {
        BufferedReader br = null;
        try {
            URL url = new URL("https://www.facebook.com");  
            br = new BufferedReader(new InputStreamReader(url.openStream()));

            String foundWord = "[\"InitialChatFriendsList\"";          
            String sCurrentLine;

            while ((sCurrentLine = br.readLine()) != null) {
                String[] words = sCurrentLine.split(",");
                for (String word : words) {         
                    if (word.equals(foundWord)) {
                        System.out.println("WIN!!!");
                        break;
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null)
                    br.close();
            } catch (IOException ex) {
                System.out.println("*** IOException for URL : ");
            }
        }
    }
}

回答1:

Problem

Besides some small flaws with your code (you should use try-with-ressources and the new IO library NIO) it looks totally fine and does not seem to have a logical error.

You are facing a different problem here. When trying to read Facebook you first need to login to your account, else you will see the starting page:

I guess you think that it is enough to login from your browser (for example Google Chrome) but that is not the case. Login information gets saved inside the local storage of the specific browser you have used, for example in its cookies. We talk from a session.

Showcase

As a small experiment visit Facebook with your Google Chrome and login. After that visit it with Internet Explorer, it will not be logged in and you are reading the starting page again.

The same happens with your Java code, you are simply reading the starting page because for "Javas browser" you are not logged in already. You can just check it by dumping the content your BufferedReader is reading:

final URL url = new URL("https://www.facebook.com");
try (final BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()))) {
    // Read the whole page
    while (true) {
        final String line = br.readLine();
        if (line == null) {
            break;
        }

        System.out.println(line);
    }
}

Take a look at the output, it will probably be the source of the starting page.

Insights

After logging in to Facebook via my browser the website sends me the following cookies:

The highlighted c_user cookie is definitely relevant for the session. If I delete it and refresh the page then I am not logged in anymore.

Solution

In order to work your Java code would need to login itself, via filling the form and submitting it (or just by sending the corresponding POST request), then listening to the answer of Facebook and saving all those cookie information. However doing this by yourself would be a huge task, I would not recommend it. Instead you could use an API that emulates a browser from inside Java, for example HTMLUnit. Alternatively you could use libraries like Selenium with which you can control your favorite browser directly via its driver interface.

The other approach would be to hijack the session. There you try to extract the relevant cookie data from your browsers local files and recreate the cookie data inside your Java application, with the same content. Also a huge task without APIs for a non-expert.

Remarks

Now, very important, note that Facebook (and also other websites like Twitter) have a public available API (Facebook for Developers) which is designed to ease the interaction with automated software. There are of course also Java API Wrapper available, like Facebook4J. So you should just use those APIs if trying to scrape sites like Facebook.

Also note that many sites, also Facebook, state in their Terms of Service (TOS) that interaction via automated software which does not use their API is treated as violation of those terms. It could result in legal consequences.

An excerpt from the TOS:

Safety
You will not collect users' content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our prior permission.

回答2:

You could try to use Jsoup

This library allows you to connect and load a page to parse it.

Here is an example

来源：https://stackoverflow.com/questions/45905285/finding-a-word-in-a-web-page-using-java

标签

java

web-scraping

bufferedreader