问题
First of I wanted to apologize in case my question may not be provided with enough connect or anything of that matter, I'm typing this up on my phone right now.
So I'm working on a project that requires me to automate tasks within a webpage and in order to do that, step one is to access the page in the first place, but I've reached an obstacle that I've tried searching and figuring out with no avail.
The webpage I'm trying to reach had DDoS protection by CloudFlare, meaning before entering the page, your browser is checked for a couple seconds then let through.
I'm using the external library HtmlUnit which provides me with everything I will need and when accessing the page, I get a 503 error
, saying I cannot access it, in fairly sure this is the protection blocking it.
Now my question is how should I bypass it. There is a .jar
I decompiled and looked at which goes to the same site as me but it's far too illegible for me to make out.
Would appreciate help on this task so much, thanks.
For reference, here is an example of a webpage that uses CloudFare for testing, www.osbot.org (this isn't the site BTW).
If you need anything else please let me know and again sorry for text only, it's hard typing this up on my phone and I currently have no PC access.
Edit: Cannot whitelist my IP or get in contact with site owner
回答1:
By default, HtmlUnit throws exception (which is not what real browsers do), and that is on purpose.
Anyhow, you can use webClient.getOptions().setThrowExceptionOnFailingStatusCode(false).
Also, you need to wait
enough, below is an example:
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
String url = "http://www.osbot.org/";
HtmlPage htmlPage = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10_000);
System.out.println(htmlPage.asText());
}
回答2:
I know this question is quite old, but there is no correct answer yet. Here is what works for me:
WebClient client = new WebClient(BrowserVersion.CHROME);
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.getOptions().setRedirectEnabled(true);
client.getCache().setMaxSize(0);
client.waitForBackgroundJavaScript(10000);
client.setJavaScriptTimeout(10000);
client.waitForBackgroundJavaScriptStartingBefore(10000);
try {
String url = "https://www.badlion.net/";
HtmlPage page = client.getPage(url);
synchronized(page) {
page.wait(7000);
}
//Print cookies for test purposes. Comment out in production.
URL _url = new URL(url);
for(Cookie c : client.getCookies(_url)) {
System.out.println(c.getName() +"="+c.getValue());
}
//This prints the content after bypassing Cloudflare.
System.out.println(client.getPage(url).getWebResponse().getContentAsString());
} catch (FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch(InterruptedException e) {
e.printStackTrace();
}
Just replace String url = "https://badlion.net/";
with the URL you are attempting to access.
回答3:
You should ask the site owner if they can whitelist your IPs. If you're doing anything like trying to scrape the site, then they may not want you to.
来源:https://stackoverflow.com/questions/32232259/accessing-webpage-with-cloudflare-protection