web-crawler | 易学教程

Is it possible for Scrapy to get plain text from raw HTML data?

阅读更多关于 Is it possible for Scrapy to get plain text from raw HTML data?

问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

Is it possible for Scrapy to get plain text from raw HTML data?

阅读更多关于 Is it possible for Scrapy to get plain text from raw HTML data?

How do i create rules for a crawlspider using scrapy

阅读更多关于 How do i create rules for a crawlspider using scrapy

问题 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from manga.items import MangaItem class MangaHere(BaseSpider): name = "mangah" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] def parse(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() if rating > 4.5: item = MangaItem() item["title"] = site.select("div/a/text()

scrapyd-client command not found

阅读更多关于 scrapyd-client command not found

问题 I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file(https://github.com/scrapy/scrapyd-client), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? 回答1: Create a fresh

Linking together >100K pages without getting SEO penalized

阅读更多关于 Linking together >100K pages without getting SEO penalized

问题 I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info"). According to the SEO MOZ Beginner's Guide to SEO: Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to

Linking together >100K pages without getting SEO penalized

阅读更多关于 Linking together >100K pages without getting SEO penalized

.NET Custom Threadpool with separate instances

阅读更多关于 .NET Custom Threadpool with separate instances

问题 What is the most recommended .NET custom threadpool that can have separate instances i.e more than one threadpool per application? I need an unlimited queue size (building a crawler), and need to run a separate threadpool in parallel for each site I am crawling. Edit : I need to mine these sites for information as fast as possible, using a separate threadpool for each site would give me the ability to control the number of threads working on each site at any given time. (no more than 2-3)

PhantomJS using too many threads

阅读更多关于 PhantomJS using too many threads

问题 I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom. What's the expected result? The console output should read through a ton of URLs and then tell if the script is loaded or not. What's really happening? The console output will read as expected for about 50 requests and then just start spitting

Can I block search crawlers for every site on an Apache web server?

阅读更多关于 Can I block search crawlers for every site on an Apache web server?

问题 I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed. Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers? Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to

Web crawler to extract from list elements

阅读更多关于 Web crawler to extract from list elements

问题 I am trying to extract from <li> tags the dates and store them in an Excel file. <li>January 13, 1991: At least 40 people <a href ="......."> </a> </li> Code: import urllib2 import os from datetime import datetime import re os.environ["LANG"]="en_US.UTF-8" from bs4 import BeautifulSoup page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes") soup = BeautifulSoup(page1) li =soup.find_all("li") count = 0 while count < len(li): soup = BeautifulSoup(li[count]) date_string,