web-crawler | 易学教程

How to scrape all contents from infinite scroll website? scrapy

阅读更多关于 How to scrape all contents from infinite scroll website? scrapy

问题 I'm using scrapy. The website i'm using has infinite scroll. the website has loads of posts but i only scraped 13. How to scrape the rest of the posts? here's my code: class exampleSpider(scrapy.Spider): name = "example" #from_date = datetime.date.today() - datetime.timedelta(6*365/12) allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/somethinghere/" ] def parse(self, response): for href in response.xpath("//*[@id='page-wrap']/div/div/div/section[2]/div/div/div/div[3]

Robots.txt: allow only major SE

阅读更多关于 Robots.txt: allow only major SE

问题 Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders? 回答1: User-agent: * Disallow: / User-agent: Googlebot Allow: / User-agent: Slurp Allow: / User-Agent: msnbot Disallow: Slurp is Yahoo's robot 回答2: Why? Anyone doing evil (e.g., gathering email addresses to spam) will just ignore robots.txt. So you're only going to be blocking legitimate search engines, as robots.txt compliance is voluntary. But — if you insist on doing it anyway

Crawlable AJAX with _escaped_fragment_ in htaccess

阅读更多关于 Crawlable AJAX with _escaped_fragment_ in htaccess

问题 Hello fellow developers! We are almost finished with developing first phase of our ajax web app. In our app we are using hash fragments like: http://ourdomain.com/#!list=last_ads&order=date I understand google will fetch this url and make a request to the server in this form: http://ourdomain.com/?_escaped_fragment_=list=last_ads?order=date&direction=desc everything is perfect, except... I would like to route this kind of request to another script like so: RewriteCond %{QUERY_STRING} ^

Getting imdb movie titles in a specific language

阅读更多关于 Getting imdb movie titles in a specific language

问题 I am writing a crawler in java that examines an IMDB movie page and extracts some info like name, year etc. User writes (or copy/pastes) the link of the tittle and my program should do the rest. After examining html sources of several (imdb) pages and browsing on how crawlers work I managed to write a code. The info I get (for example title) is in my mother tongue. If there is no info in my mother tongue I get the original title. What I want is to get the title in a specific language of my

Tricking browser into calling javascript events?

阅读更多关于 Tricking browser into calling javascript events?

问题 So i'm trying to create a web spider. I've run into a website, that has some javascript, and I want to trick the browser into thinking that an event has been fired and that it must call the corresponding javascript code to handle the event. How would I be able to do this in Perl? using the WWW::Mechanize or WWW::Scripter::Plugin::Javascript? Also, it would be very appreciated I someone could put up an example of how to use WWW::Scripter::Plugin::Javascript. Thanks in advance. Also if someone

How to write Selenium Webdriver path address with python in Windows 10?

阅读更多关于 How to write Selenium Webdriver path address with python in Windows 10?

问题 I'm making a simple web crawler using Python with selenium. (Running on PyCharm Window 10) from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Firefox() driver.get(http://www.python.org) I tried various formats for the file Path but all of them seem to return error. What would be the correct format for the file Path? P.S. File address I copied off from File Explorer doesn't work either. 回答1: Here is the Answer to your

Following certain links with beautiful soup

阅读更多关于 Following certain links with beautiful soup

问题 I've been having a lot of trouble with this problem, and I think I understand the work, but then my head now has a dent in it from banging it on the desk. What I need to do is make a program that scrapes through a webpage with beautiful soup, but it then gets a certain link (anywhere from the 3rd or 20th link down the page) then goes to that 3rd(or 20th, or whatever number) link and tries to find the 3rd link from that page, over and over, for an unspecified amount of times (im keeping it

Scrapy shell can't crawl information while xpath works in Chrome console

阅读更多关于 Scrapy shell can't crawl information while xpath works in Chrome console

问题 I'm working on a project to collect the university's professors contact information. (So it is not malicious.) The professor page is dynamic. I find out the request via Chrome network. However, scrapy xpath doesn't work in scrapy shell while it works on the browser. I even tried to add headers. scrapy shell result Chrome console result import scrapy from universities.items import UniversitiesItem class UniversityOfHouston(scrapy.Spider): name = 'University_of_Houston' allowed_domains = ['uh

Finding specific URLs from a list of URLs using Python

阅读更多关于 Finding specific URLs from a list of URLs using Python

问题 I want find if specific links exist in a list of URLs by crawling through them. I have written the following program and it works perfectly. However, I am stuck at 2 places. Instead of using an array, how can I call the links from a text file. The crawler takes close to 4 minutes to crawl through 100 webpages. Is there a way I can make that faster. from bs4 import BeautifulSoup, SoupStrainer import urllib2 import re import threading start = time.time() #Links I want to find url = "example.com

How to Change select value using PhantomJS

阅读更多关于 How to Change select value using PhantomJS

问题 I made a scraper using PhantomJS inside node (Node Module). I am trying to get data from a table on the page (url). When the page loads it only displays 25 records of the table. There is a 'select' at the bottom that you can change to 'All' to see all records. How can i change the value of the select to 'All' before getting the HTML returned? var phantom = require('phantom'); phantom.create().then(function(ph){ ph.createPage().then(function(page){ page.open(url).then(function(status){ console