web-crawler | 易学教程

how to fix HTTP error fetching URL. Status=500 in java while crawling?

阅读更多关于 how to fix HTTP error fetching URL. Status=500 in java while crawling?

问题 I am trying to crawl the user's ratings of cinema movies of imdb from the review page: (number of movies in my database is about 600,000). I used jsoup to parse pages as below: (sorry, I didn't write the whole code here since it is too long) try { //connecting to mysql db ResultSet res = st .executeQuery("SELECT id, title, production_year " + "FROM title " + "WHERE kind_id =1 " + "LIMIT 0 , 100000"); while (res.next()){ ....... ....... String baseUrl = "http://www.imdb.com/search/title

NodeJS Web Scraping - Form Submittion [closed]

阅读更多关于 NodeJS Web Scraping - Form Submittion [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 3 years ago . I'm trying to use X-Ray to do the following, i'm not famliar with web scarping, and i'm looking for a technolegy to fit my use. Browse to a page, allocate a specific form in it, set some vars, and submit it. Then get the other page, and so on... What's the best NodeJS based

Symfony2 Crawler - Use UTF-8 with XPATH

阅读更多关于 Symfony2 Crawler - Use UTF-8 with XPATH

问题 I am using Symfony2 Crawler - Bundle for using XPath. Everything works fine, except the encoding. I would like to use UTF-8 encoding and the Crawler is somehow not using it. I noticed that because th are converted to Â , which is a known issue: UTF-8 Encoding Issue My question is: How could I force the Symfony Crawler to use UTF-8 Encoding? Here is the code I am using: $dom_input = new \DOMDocument("1.0","UTF-8"); $dom_input->encoding = "UTF-8"; $dom_input->formatOutput = true; $dom_input-

Cannot navigate with casperjs evaluate and __doPostBack function

阅读更多关于 Cannot navigate with casperjs evaluate and __doPostBack function

问题 When I try to navigate the pagination on sites with links where href is a __doPostBack function call, I never achieve the page change. I am not sure what I am missing, so after a few hours of messing around I decided to see if someone here can give me a clue. This is my code (uber-simplified to show the use case). var casper = require('casper').create({ verbose: true, logLevel: "debug" }); casper.start('http://www.gallito.com.uy/inmuebles/venta'); // here i simulate the click on a link in the

How to build a Python crawler for websites using oauth2

阅读更多关于 How to build a Python crawler for websites using oauth2

问题 I'm new in web programming. I want to build a crawler for crawling the social graph in Foursquare by Python. I've got a "manually" controlled crawler by using the apiv2 library. The main method is like: def main(): CODE = "******" url = "https://foursquare.com/oauth2/authenticate?client_id=****&response_type=code&redirect_uri=****" key = "***" secret = "****" re_uri = "***" auth = apiv2.FSAuthenticator(key, secret, re_uri) auth.set_token(code) finder = apiv2.UserFinder(auth) #DO SOME REQUIRES

Splinter or Selenium: Can we get current html page after clicking a button?

阅读更多关于 Splinter or Selenium: Can we get current html page after clicking a button?

问题 I'm trying to crawl the website "http://everydayhealth.com". However, I found that the page will dynamically rendered. So, when I click the button "More", some new news will be shown. However, using splinter to click the button doesn't let "browser.html" automatically changes to the current html content. Is there a way to let it get newest html source, using either splinter or selenium? My code in splinter is as follows: import requests from bs4 import BeautifulSoup from splinter import

What are the best prebuilt libraries for doing Web Crawling in Python [duplicate]

阅读更多关于 What are the best prebuilt libraries for doing Web Crawling in Python [duplicate]

问题 This question already has answers here : Closed 10 years ago . I need to crawl and store locally for future analysis the contents of a finite list of websites. I basically want to slurp in all pages and follow all internal links to get the entire publicly available site. Are there existing free libraries to get me there? I've seen Chilkat, but it's for pay. I'm just looking for baseline functionality here. Thoughts? Suggestions? Exact Duplicate: Anyone know of a good python based web crawler

Malicious crawler blocker for ASP.NET

阅读更多关于 Malicious crawler blocker for ASP.NET

问题 I have just stumbled upon Bad Behavior - a plugin for PHP that promises to detect spam and malicious crawlers by preventing them from accessing the site at all. Does something similar exist for ASP.NET and ASP.NET MVC? I am interested in blocking access to the site altogether, not in detecting spam after it was posted. EDIT: I am interested specifically in solutions that will detect access patterns to the site - these would prevent screen scraping the site as a whole, or at least make it a

Exclude bots and spiders from a View counter in PHP

阅读更多关于 Exclude bots and spiders from a View counter in PHP

问题 I have built a pretty basic advertisement manager for a website in PHP. I say basic because it's not complex like Google or Facebook ads or even most high end ad servers. Doesn't handle payments or anything or even targeting users. It serves the purpose for my low traffic site though to simply show a random banner ad, count impression views and clicks. Features: Ad slot/position on page Banner image Name View/impression counter Click counter Start and end date, or never ending Disable/enable

why facebook is flooding my site?

阅读更多关于 why facebook is flooding my site?

问题 Every hour and a half Im getting a flood of requests from http://www.facebook.com/externalhit_uatext.php. I know what theses requests should mean, but this behavior is very odd. On a regular basis (aproximatedly every 1,5 hour), Im getting dozen of these requests per minute to very old posts of my site - and this is giving me a headache since they are not cached... Does anyone know what this could be? In what cases facebook does that? Leo,, log sample: 66.220.158.251, 200.147.35.64 (5715) - -