org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

别等时光非礼了梦想. 提交于 2020-01-02 13:56:13

问题


I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q=CGMSBFMKrI0YiJHfqgUiGQDxp4NLfGBv6zgPSjfyQ9LBi5F-K1EbGwQ
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)

I know it is linked with simple google protection against robots. How I can improve my connection

    Connection connection = 
             Jsoup.connect(url)
              .userAgent("Mozilla/5.0")
              .timeout(10000)
              .followRedirects(true);

to not have temporary ban? I know there is a way to check response, like this:

Connection.Response response = 
           Jsoup.connect(url)
           .userAgent("Mozilla/5.0")
           .timeout(10000)
           .execute();

int statusCode = response.statusCode();
if (statusCode == 200) { ... }
else if (statusCode == 503) { do recconect magic}

But what should I do, when I got 503 error? Have I to use proxy? Random wait time beetween connections? I hope there is better idea than saving my results in file, do manual hard-restart of router and try with new IP :P


回答1:


You have already provided your own answers...

Have I to use proxy?

Of course. You should already have setup a bunch of proxies for your wrawling activity.

Random wait time beetween connections?

Yes. Use some random wait between 3000 and 5000 ms.

Alternatively, you could use an online captcha service resolving if you hit the URL https://ipv4.google.com/sorry/IndexRedirect.... Don't hit it too often or you'll get banned.

Happy coding :)



来源:https://stackoverflow.com/questions/30281650/org-jsoup-httpstatusexception-http-error-fetching-url-status-503-google-schol

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!