web-crawler

how to fix HTTP error fetching URL. Status=500 in java while crawling?

左心房为你撑大大i 提交于 2019-12-10 04:04:49
问题 I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long) try { //connecting to mysql db ResultSet res = st .executeQuery("SELECT id, title, production_year " + "FROM title " + "WHERE kind_id =1 " + "LIMIT 0 , 100000"); while (res.next()){ ....... ....... String baseUrl = "http://www.imdb.com/search/title

NodeJS Web Scraping - Form Submittion [closed]

╄→гoц情女王★ 提交于 2019-12-10 00:40:31
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I'm trying to use X-Ray to do the following, i'm not famliar with web scarping, and i'm looking for a technolegy to fit my use. Browse to a page, allocate a specific form in it, set some vars, and submit it. Then get the other page, and so on... What's the best NodeJS based

Symfony2 Crawler - Use UTF-8 with XPATH

筅森魡賤 提交于 2019-12-09 19:30:53
问题 I am using Symfony2 Crawler - Bundle for using XPath. Everything works fine, except the encoding. I would like to use UTF-8 encoding and the Crawler is somehow not using it. I noticed that because th   are converted to   , which is a known issue: UTF-8 Encoding Issue My question is: How could I force the Symfony Crawler to use UTF-8 Encoding? Here is the code I am using: $dom_input = new \DOMDocument("1.0","UTF-8"); $dom_input->encoding = "UTF-8"; $dom_input->formatOutput = true; $dom_input-

Cannot navigate with casperjs evaluate and __doPostBack function

流过昼夜 提交于 2019-12-09 19:14:24
问题 When I try to navigate the pagination on sites with links where href is a __doPostBack function call, I never achieve the page change. I am not sure what I am missing, so after a few hours of messing around I decided to see if someone here can give me a clue. This is my code (uber-simplified to show the use case). var casper = require('casper').create({ verbose: true, logLevel: "debug" }); casper.start('http://www.gallito.com.uy/inmuebles/venta'); // here i simulate the click on a link in the

How to build a Python crawler for websites using oauth2

前提是你 提交于 2019-12-09 18:27:52
问题 I'm new in web programming. I want to build a crawler for crawling the social graph in Foursquare by Python. I've got a "manually" controlled crawler by using the apiv2 library. The main method is like: def main(): CODE = "******" url = "https://foursquare.com/oauth2/authenticate?client_id=****&response_type=code&redirect_uri=****" key = "***" secret = "****" re_uri = "***" auth = apiv2.FSAuthenticator(key, secret, re_uri) auth.set_token(code) finder = apiv2.UserFinder(auth) #DO SOME REQUIRES

Splinter or Selenium: Can we get current html page after clicking a button?

萝らか妹 提交于 2019-12-09 17:34:51
问题 I'm trying to crawl the website "http://everydayhealth.com". However, I found that the page will dynamically rendered. So, when I click the button "More", some new news will be shown. However, using splinter to click the button doesn't let "browser.html" automatically changes to the current html content. Is there a way to let it get newest html source, using either splinter or selenium? My code in splinter is as follows: import requests from bs4 import BeautifulSoup from splinter import

What are the best prebuilt libraries for doing Web Crawling in Python [duplicate]

不打扰是莪最后的温柔 提交于 2019-12-09 13:58:17
问题 This question already has answers here : Closed 10 years ago . I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site. Are there existing free libraries to get me there? I've seen Chilkat, but it's for pay. I'm just looking for baseline functionality here. Thoughts? Suggestions? Exact Duplicate: Anyone know of a good python based web crawler

Malicious crawler blocker for ASP.NET

喜夏-厌秋 提交于 2019-12-09 12:41:49
问题 I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET and ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted. EDIT: I am interested specifically in solutions that will detect access patterns to the site - these would prevent screen scraping the site as a whole, or at least make it a

Exclude bots and spiders from a View counter in PHP

元气小坏坏 提交于 2019-12-09 10:49:14
问题 I have built a pretty basic advertisement manager for a website in PHP. I say basic because it's not complex like Google or Facebook ads or even most high end ad servers. Doesn't handle payments or anything or even targeting users. It serves the purpose for my low traffic site though to simply show a random banner ad, count impression views and clicks. Features: Ad slot/position on page Banner image Name View/impression counter Click counter Start and end date, or never ending Disable/enable

why facebook is flooding my site?

时光毁灭记忆、已成空白 提交于 2019-12-09 09:42:30
问题 Every hour and a half Im getting a flood of requests from http://www.facebook.com/externalhit_uatext.php. I know what theses requests should mean, but this behavior is very odd. On a regular basis (aproximatedly every 1,5 hour), Im getting dozen of these requests per minute to very old posts of my site - and this is giving me a headache since they are not cached... Does anyone know what this could be? In what cases facebook does that? Leo,, log sample: 66.220.158.251, 200.147.35.64 (5715) - -