web-crawler | 易学教程

Facebook fanpage crawler

阅读更多关于 Facebook fanpage crawler

I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds. I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db. Is there any better way to do this? Any help is appreciable I think its against the Facebook TOS, no long time ago i read some blog where the whriter create some type of spider to collect data about facebook pages user etc and he get a call from facebook lawyers 来源： https://stackoverflow.com/questions/3519600/facebook

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

阅读更多关于 Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

问题 Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web for proper answers but could not find one. There are too many mixed suggestions and I do not know which is the correct / proper method. Questions: (1) Can each user agent have it's own crawl-delay? (I assume yes) (2) Where do you put the crawl-delay

Is there a better approach to use BeautifulSoup in my python web crawler codes?

阅读更多关于 Is there a better approach to use BeautifulSoup in my python web crawler codes?

I'm trying to crawl information from urls in a page and save them in a text file. I have recieved great help in the question How to get the right source code with Python from the URLs using my web crawler? and I try to use what I have learned in BeautifulSoup to finish my codes based on that question. But when I look at my codes, although they have satisfied my need, but they look pretty messed up. Can anyone help me to optimize them a little, especially on the BeautifulSoup part? Such as the infoLists part and the saveInfo part. Thanks! Here are my codes: import requests from bs4 import

Importing URLs for JSOUP to Scrape via Spreadsheet

阅读更多关于 Importing URLs for JSOUP to Scrape via Spreadsheet

问题 I finally got IntelliJ to work. I'm using the code below. It works perfect. I need it to loop over and over and pull links from a spreadsheet to find the price over and over again on different items. I have a spreadsheet with a few sample URLs located in column C starting at row 2. How can I have JSOUP use the URLs in this spreadsheet then output to column D? public class Scraper { public static void main(String[] args) throws Exception { final Document document = Jsoup.connect("examplesite

Can't run Scrapy program

阅读更多关于 Can't run Scrapy program

I have been learning how to work with Scrapy from the following link : http://doc.scrapy.org/en/master/intro/tutorial.html When i try to run the code written in the Crawling( scrapy crawl dmoz ) section, i get the following error: AttributeError: 'module' object has no attribute 'Spider ' However, i changed "Spider" to "spider" and i got nothing but a new error: TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) I'm so confused, what is the problem? Any help would highly be appreciated. Thanks. By the way, i am using Windows. EDIT(source

Scrapy celery and multiple spiders

阅读更多关于 Scrapy celery and multiple spiders

I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=" + str(job.id) cur.execute(update) db.commit() db.close() #Start new crawler configure_logging()

Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

阅读更多关于 Scrapy CLOSESPIDER_PAGECOUNT setting don't work as should

I use scrapy 1.0.3 and can't discover how works CLOSESPIDER extesnion. For command: scrapy crawl domain_links --set=CLOSESPIDER_PAGECOUNT=1 is correctly one requst, but for two pages count: scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT=2 is infinity of requests. So please explain me how it works in simple example. This is my spider code: class DomainLinksSpider(CrawlSpider): name = "domain_links" #allowed_domains = ["www.example.org"] start_urls = [ "www.example.org/",] rules = ( # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor

Any idea on how to scrape pages which are behind __doPostBack('…');?

阅读更多关于 Any idea on how to scrape pages which are behind __doPostBack('…');?

问题 I am working on this php base scraper/crawler, which works fine until it get .net generated herf link __doPostBack(...), any idea how to deal with this and crawl page behind those links ? 回答1: Instead of trying to automate clicking the JavaScript button, which requires additional libraries in PHP, try replicating what request is sent by your browser after clicking the button. There are various firefox extensions that will help you examine the request, such as TamperData, Firebug, and LiveHttp

Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

阅读更多关于 Is this Anti-Scraping technique viable with Robots.txt Crawl-Delay?

问题 I want to prevent web scrapers from agressively scraping 1,000,000 pages on my website. I'd like to do this by returning a "503 Service Unavailable" HTTP error code to bots that access an abnormal number of pages per minute. I'm not having trouble with form-spammers, just with scrapers. I don't want search engine spiders to ever receive the error. My inclination is to set a robots.txt crawl-delay which will ensure spiders access a number of pages per minute under my 503 threshold. Is this an

Preventing my PHP Web Crawler from Stalling

阅读更多关于 Preventing my PHP Web Crawler from Stalling

问题 I'm using the PHPCrawl class and added some DOMDocument and DOMXpath code to take specific data off web pages however the script stalls out before it gets even close to crawling the whole website. I have set_time_limit set to 100000000 so that shouldn't be an issue. Any ideas? Thank you, Nick <?php // It may take a while to crawl a site ... set_time_limit(100000000); // Inculde the phpcrawl-mainclass include("classes/phpcrawler.class.php"); //connect to the database mysql_connect('localhost',