How To Fix: HtmlUnit GetElementById Returns Null

别等时光非礼了梦想. 提交于 2020-04-30 07:07:24

问题


I am writing a web scraper and am trying to type in a search word into a search box. However, it looks like I am getting null when I try to access the search box by ID. I am just learning HtmlUnit so I could be missing something very obvious but I have not been able to identify this myself yet.

Here is the website's code:

<html xmlns="http://www.w3.org/1999/xhtml" xml:1ang="en" class="no-touch">
    <head>-</head>
    <body lang="en" class="garageBrand" emailcookiename="grgemailca" loyaltycookiename="grgloyaltyca">
        <div id="fb-root" class="fb_reset">-</div>
        <noscript>...</noscript>
        <script>...</script>
        <div id="container">
            <div id="avsDialog" sty1e="disp1ay: none; position: absolute; top: 0; right: 0;"></div>
            <input type="hidden" value="en" id="displayLanguage">
            <input type="hidden" value="garageSiteCA" id="currSiteId">
            <input type="hidden" value="en_CA" id="currLocale">
            <div id="contentarea">
                <div id="header" class="nonHeaderScroll">
                <div id="topnav">...</div>
                <div class="socialSearch">
                <div id="searchMenu">
                    <form action="//www.garageclothing.com/ca/search/search.jsp" method="GET">
                        <input type="hidden" name="N" value="0">
                        <input type="hidden" name="Dy" value="1">
                        <input type="hidden" name="Nty" value="1">
                        <input type="hidden" name="Ntk" value="All">
                        <input type="hidden" name="Ntx" value="mode matchall">
                        <input id="searchText" maxlength="40" type="text" name="Ntt" class="textInput" placeholder="Search..." autocomplete="off">
                        <input class="mainSearchButton" type="image" src="//images.gdicdn.com/img/magnifying-glass.png?version=375" name="search">
                    </form>
                </div>

Here is my code:

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlInput;

import java.io.IOException;


public class Main {

public static void main(String[] args) {

    WebClient client = new WebClient();
    client.getOptions().setJavaScriptEnabled(true);
    client.getOptions().setCssEnabled(false);
    client.getOptions().setUseInsecureSSL(true);

    try {
        HtmlPage page = client.getPage("https://www.garageclothing.com/ca");

        // Check for popup.
        if(page.getElementById("cboxClose") != null) {
            page = page.getElementById("cboxClose").click();
        }

        // Debugging line that returns null:
        System.out.println(page.getElementById("searchText"));
        // What I would like to do:
      /*HtmlInput searchInput = (HtmlInput) page.getElementById("searchText");
        searchInput.setValueAttribute("red scarf");
        HtmlSubmitInput submitBtn = page.getElementByName("search");
        page = submitBtn.click();

        System.out.println(page.asXml());*/

    } catch (IOException e) {
        e.printStackTrace();
    }
}

}

回答1:


Even if the page looks simple, this page is (like many shopping portals) really complicated and based on tons of javascript (not only for the page itself, but also for all this nasty trackers to observe the users). If you like to learn more about this page i suggest to use a web proxy like Charles to capture the whole traffic.

Now back to your problem... Because HtmlUnit javascript support (based on Rhino) is not perfect, you face some javascript errors. To not stop at js errors, you have to configure the client

webClient.getOptions().setThrowExceptionOnScriptError(false);

The next step is to get the page. This is also not that simple because of all the js stuff. It looks like the js stuff also replaces the page initially returned by getting the url. Because of this you have to do three steps

  • get the page
  • wait some time to let the js do some work
  • get the current page from the current window

Now you are able to find the search field; type some search into it and finally press the search button. Then you have to do again the three steps to get the current content.

Hope that helps....

public static void main(String[] args) throws IOException {
    String url = "https://www.garageclothing.com/ca";

    try (final WebClient webClient = new WebClient()) {
        // do not stop at js errors
        webClient.getOptions().setThrowExceptionOnScriptError(false);

        webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(10000);

        HtmlPage page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
        HtmlInput searchInput = (HtmlInput) page.getElementById("searchText");
        searchInput.type("red scarf");

        HtmlElement submitBtn = (HtmlElement) page.getElementByName("search");
        submitBtn.click();
        webClient.waitForBackgroundJavaScript(10000);

        page = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
        // System.out.println("------------------------------------------------");
        // System.out.println(page.asXml());

        System.out.println("------------------------------------------------");
        final DomNodeList<DomNode> divs = page.querySelectorAll(".divProdPriceSale");
        for (DomNode div : divs) {
            System.out.println(div.asText());
        }
    }
}



回答2:


You should check the URL you are passing to the WebClient is the one you are viewing in the web browser you are using.

I went to the link you use in your code (https://www.garageclothing.com) and the page I got is not the one you are expecting. It asked me to pick a country (USA or Canada) and after I clicked in any of the options, it then took me to the page you are expecting.

Try changing the URL to "https://www.garageclothing.com/us/" or "https://www.garageclothing.com/ca/"



来源:https://stackoverflow.com/questions/54029644/how-to-fix-htmlunit-getelementbyid-returns-null

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!