web-crawler

Python Web Crawlers and “getting” html source code

萝らか妹 提交于 2019-11-30 01:52:31
So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnection and request concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page. Just for background, I need to download a page and replace any img with ones I have And it would be

Java Web Crawler Libraries

醉酒当歌 提交于 2019-11-29 23:13:26
I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two important questions. How will my program 'visit' or 'connect' to web pages? Please give a brief explanation. (I understand the basics of the layers of abstraction from the hardware up to the software, here I am interested in the Java abstractions) What libraries should I use? I would assume I need a library for connecting to web pages, a library for HTTP/HTTPS protocol, and a library for HTML parsing. Mohammad Adil This is How

Can Scrapy be replaced by pyspider?

你说的曾经没有我的故事 提交于 2019-11-29 23:05:59
I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider , which, according to it's github page, is fresh, actively developed and popular. pyspider 's home page lists several things being supported out-of-the-box: Powerful WebUI with script editor, task monitor, project manager and result viewer Javascript pages supported! Task priority, retry, periodical and recrawl by age or marks in index page (like update time) Distributed architecture These are the things that Scrapy itself doesn't provide, but,

Get proxy ip address scrapy using to crawl

浪尽此生 提交于 2019-11-29 22:55:45
问题 I use Tor to crawl web pages. I started tor and polipo service and added class ProxyMiddleware(object): # overwrite process request def process_request(self, request, spider): # Set the location of the proxy request.meta['proxy'] = "127.0.0.1:8123" Now, how can I make sure that scrapy uses different IP address for requests? 回答1: You can yield the first request to check your public IP, and compare this to the IP you see when you go to http://checkip.dyndns.org/ without using Tor/VPN. If they

Run Multiple Spider sequentially

萝らか妹 提交于 2019-11-29 22:26:10
问题 Class Myspider1 #do something.... Class Myspider2 #do something... The above is the architecture of my spider.py file. and i am trying to run the Myspider1 first and then run the Myspider2 multiples times depend on some conditions. How Could I do that??? any tips? configure_logging() runner = CrawlerRunner() def crawl(): yield runner.crawl(Myspider1,arg.....) yield runner.crawl(Myspider2,arg.....) crawl() reactor.run() I am trying to use this way.but have no idea how to run it. Should I run

Can't get through a form with scrapy

北慕城南 提交于 2019-11-29 22:13:33
问题 I'm new with using scrapy and i'm trying to get some info from a real estate website. The site has a home page with a search form (method GET). I'm trying to go to the results page in my start_requests (recherche.php), and setting all the get parameters i see in the address bar in the formdata parameter. I also set up the cookies i had, but he didn't work either.. Here's my spider: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.http import

Web crawler in ruby [closed]

白昼怎懂夜的黑 提交于 2019-11-29 21:56:59
What is your recommendation of writing a web crawler in Ruby? Any lib better than mechanize? Nakilon If you want just to get pages' content, the simpliest way is to use open-uri functions. They don't require additional gems. You just have to require 'open-uri' and... http://ruby-doc.org/stdlib-2.2.2/libdoc/open-uri/rdoc/OpenURI.html To parse content you can use Nokogiri or other gems, which also can have, for example, useful XPATH -technology. You can find other parsing libraries just here on SO . I'd give a try to anemone . It's simple to use, especially if you have to write a simple crawler.

Is there CURRENTLY anyway to fetch Instagram user media without authentication?

前提是你 提交于 2019-11-29 21:08:44
Until recently there were several ways to retrieve Instagram user media without the need for API authentication. But apparently, the website stopped all of them. Some of the old methods: https://api.instagram.com/v1/users/user-id/media/recent/ https://www.instagram.com/user-id/media https://www.instagram.com/user-id/?__a=1 And some old related questions are: How can I get a user's media from Instagram without authenticating as a user? Is there still a way to fetch instagram feed without using access token now (06/2016)? I was able to retrieve the first twenty items by crawling the webpage of

How do I use the Python Scrapy module to list all the URLs from my website?

你说的曾经没有我的故事 提交于 2019-11-29 21:01:49
I want to use the Python Scrapy module to scrape all the URLs from my website and write the list to a file. I looked in the examples but didn't see any simple example to do this. Here's the python program that worked for me: from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider from scrapy.http import Request DOMAIN = 'example.com' URL = 'http://%s' % DOMAIN class MySpider(BaseSpider): name = DOMAIN allowed_domains = [DOMAIN] start_urls = [ URL ] def parse(self, response): hxs = HtmlXPathSelector(response) for url in hxs.select('//a/@href').extract(): if not ( url

Automated link-checker for system testing [closed]

社会主义新天地 提交于 2019-11-29 19:57:49
I often have to work with fragile legacy websites that break in unexpected ways when logic or configuration are updated. I don't have the time or knowledge of the system needed to create a Selenium script. Besides, I don't want to check a specific use case - I want to verify every link and page on the site. I would like to create an automated system test that will spider through a site and check for broken links and crashes. Ideally, there would be a tool that I could use to achieve this. It should have as many as possible of the following features, in descending order of priority: Triggered