web-crawler | 易学教程

Crawling local files with Scrapy without an active project?

阅读更多关于 Crawling local files with Scrapy without an active project?

问题 Is it possible to crawl local files with Scrapy 0.18.4 without having an active project? I've seen this answer and it looks promising, but to use the crawl command you need a project. Alternatively, is there an easy/minimalist way to set up a project for an existing spider? I have my spider, pipelines, middleware, and items defined in one Python file. I've created a scrapy.cfg file with only the project name. This lets me use crawl , but since I don't have a spiders folder Scrapy can't find

MVC site is not crawlable by main stream search engines?

阅读更多关于 MVC site is not crawlable by main stream search engines?

问题 It's based on MVC 3 + Razor, and now there is no DNS created for the site, but just public IP. Due to lack of understanding of whether and how google handle the spider for IP sites, we're getting a headache that found we cannot get any search results of our public IP in google. Someone insists that this is because of MVC 3, which cannot be indexed by main stream search engine. Frankly, that sounds a big joke to me, how could google handle AJAX sites but cannot crawl MVC web sites? I cannot

Crawling data successfully but cannot scraped or write it into csv

阅读更多关于 Crawling data successfully but cannot scraped or write it into csv

问题 I have added a DOWNLOAD_DELAY = 2 and a COOKIES_ENABLED = False , my spider crawls the website but do not write the items in my CSV file. I don't think it's normal because when I don't add this two settings, everything is ok... Could somebody help me please? I call my spider with this line in my command prompt : scrapy crawl CDiscount -o items.csv Here is my spider: # -*- coding: utf-8 -*- # Every import is done for a specific use import scrapy # Once you downloaded scrapy, you have to import

Facebook isn't crawling my site [closed]

阅读更多关于 Facebook isn't crawling my site [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . When I publish link of my site to facebook, it's now showing thumbnails and it's showing my old site's titles. I just added opengraph code to my site, but that don't help. When I check my site in facebook debug, it shows Response code: 403. I guess this means that my site is blocking facebook bots, but I don't

How to crawl data from the linked webpages on a webpage we are crawling

阅读更多关于 How to crawl data from the linked webpages on a webpage we are crawling

问题 I am crawling the names of the colleges on this webpage, but, i also want to crawl the number of faculties in these colleges which is available if open the specific webpages of the colleges by clicking the name of the college. What should i append to this code to get the result. The result should be in the form of [(name1, faculty1), (name2,faculty2),... ] import scrapy class QuotesSpider(scrapy.Spider): name = "student" start_urls = [ 'http://www.engineering.careers360.com/colleges/list-of

CasperJS : how to call __doPostBack

阅读更多关于 CasperJS : how to call __doPostBack

问题 I am trying to scrap a page : http://fd1-www.leclercdrive.fr/057701/courses/pgeWMEL009_Courses.aspx#RS284323 But as you can see this link redirect to fd1-www.leclercdrive.fr/057701/courses/pgeWMEL009_Courses.aspx when you first access it. after you click on "fruits et légumes" you can access the page using the url directly So I need to simulate a click on the button "Fruits et légumes" to access the page I want. In the code, it does a dopostback Here is my code that I use with casperj s : var

Making AngularJS and Parse Web App Crawlable with Prerender

阅读更多关于 Making AngularJS and Parse Web App Crawlable with Prerender

问题 I have been trying to get my AngularJS and Parse web app crawlable for Google and Facebook share and even with prerender-parse I have not been able to get it working. I have tried using tips from this Parse Developers thread for engaging HTML5 Mode. Nothing will work using the Facebook URL debugger or Google Fetch Bot. Can anyone share a full step by step setup that they have used and is currently working? 回答1: After some help from Prerender.io team, here are the outlined steps that resulted

How to send another request and get result in scrapy parse function?

阅读更多关于 How to send another request and get result in scrapy parse function?

问题 I'm analyzing an HTML page which has a two level menu. When the top-level menu changed, there's an AJAX request sent to get second-level menu item. When the top and second menu are both selected, then refresh the content. What I need is sending another request and get the submenu response in the scrapy's parse function. So I can iterate the submenu, build scrapy.Request per submenu item. The pseudo code like this: def parse(self, response): top_level_menu = response.xpath('//TOP_LEVEL_MENU

SharePoint 2010 search crawling but not displaying results [closed]

阅读更多关于 SharePoint 2010 search crawling but not displaying results [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . I have searched for possible solutions for days, but have had no luck at getting my SharePoint 2010 to return search results. The search was working, but was only returning results from a subsite. I have gone through many blog posts and sites on setting up the search and still nothing. My last resort was to

web crawler performance

阅读更多关于 web crawler performance

问题 I am interested to know in a very general situation (a home-brew amateur web crawler) what will be the performance of such. More specifically how many pages can a crawler process. When i say home-brew take that in all senses, a 2.4Ghz core 2 processor, java written, 50mbit internet speed, etc, etc. Any resources you may share in this regard will be greatly appreciated Thanks a lot, Carlos 回答1: First of all, the speed of your computer won't be the limiting factor; as for the connection, you