web-crawler

Is it possible for Scrapy to get plain text from raw HTML data?

房东的猫 提交于 2020-01-01 07:52:49
问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

Is it possible for Scrapy to get plain text from raw HTML data?

喜欢而已 提交于 2020-01-01 07:51:47
问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

How do i create rules for a crawlspider using scrapy

前提是你 提交于 2020-01-01 06:42:09
问题 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from manga.items import MangaItem class MangaHere(BaseSpider): name = "mangah" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] def parse(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() if rating > 4.5: item = MangaItem() item["title"] = site.select("div/a/text()

scrapyd-client command not found

风流意气都作罢 提交于 2020-01-01 05:31:25
问题 I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file(https://github.com/scrapy/scrapyd-client), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? 回答1: Create a fresh

Linking together >100K pages without getting SEO penalized

你说的曾经没有我的故事 提交于 2020-01-01 05:06:57
问题 I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info"). According to the SEO MOZ Beginner's Guide to SEO: Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to

Linking together >100K pages without getting SEO penalized

北慕城南 提交于 2020-01-01 05:06:12
问题 I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info"). According to the SEO MOZ Beginner's Guide to SEO: Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to

.NET Custom Threadpool with separate instances

不问归期 提交于 2020-01-01 04:34:09
问题 What is the most recommended .NET custom threadpool that can have separate instances i.e more than one threadpool per application? I need an unlimited queue size (building a crawler), and need to run a separate threadpool in parallel for each site I am crawling. Edit : I need to mine these sites for information as fast as possible, using a separate threadpool for each site would give me the ability to control the number of threads working on each site at any given time. (no more than 2-3)

PhantomJS using too many threads

大憨熊 提交于 2020-01-01 00:44:11
问题 I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom. What's the expected result? The console output should read through a ton of URLs and then tell if the script is loaded or not. What's really happening? The console output will read as expected for about 50 requests and then just start spitting

Can I block search crawlers for every site on an Apache web server?

我们两清 提交于 2019-12-31 08:30:09
问题 I have somewhat of a staging server on the public internet running copies of the production code for a few websites. I'd really not like it if the staging sites get indexed. Is there a way I can modify my httpd.conf on the staging server to block search engine crawlers? Changing the robots.txt wouldn't really work since I use scripts to copy the same code base to both servers. Also, I would rather not change the virtual host conf files either as there is a bunch of sites and I don't want to

Web crawler to extract from list elements

我们两清 提交于 2019-12-31 05:38:12
问题 I am trying to extract from <li> tags the dates and store them in an Excel file. <li>January 13, 1991: At least 40 people <a href ="......."> </a> </li> Code: import urllib2 import os from datetime import datetime import re os.environ["LANG"]="en_US.UTF-8" from bs4 import BeautifulSoup page1 = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes") soup = BeautifulSoup(page1) li =soup.find_all("li") count = 0 while count < len(li): soup = BeautifulSoup(li[count]) date_string,