web-crawler

Downloading pdf files using mechanize and urllib

吃可爱长大的小学妹 提交于 2019-12-08 02:41:32
问题 I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url): import mechanize import urllib import sys mech = mechanize.Browser() mech.set_handle_robots(False) url = "http://www.xyz.com" try: mech.open(url, timeout = 30.0) except HTTPError, e: sys.exit("%d: %s" % (e.code, e.msg)) links = mech.links() for l in links: #Some are relative links path = str(l.base_url[:-1])+str

Parsing ajax responses to retrieve final url content in Scrapy?

徘徊边缘 提交于 2019-12-08 01:39:07
问题 I have the following problem: My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is. Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that

File Crawler PHP

隐身守侯 提交于 2019-12-08 00:58:49
问题 just wondering how it would be possible to recursively search through a website folder directory (the same one as the script is uploaded to) and open/read every file and search for a specific string? for example I might have this: search.php?string=hello%20world this would run a process then output somethign like "hello world found inside" httpdocs /index.php /contact.php httpdocs/private/ ../prviate.php ../morestuff.php ../tastey.php httpdocs/private/love ../../goodness.php I dont want it to

Scrapy InIt self.initialized() — not initializing

a 夏天 提交于 2019-12-08 00:21:44
问题 I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib

Nutch does not crawl URLs with query string parameters

…衆ロ難τιáo~ 提交于 2019-12-07 20:22:25
问题 I am using Nutch1.9 and trying to crawl using individual commands. as can be seen in the output when going in to the 2nd level generater returned with 0 records. any one has faced this issue ? i am stuck in here from past 2 days. have searched all possible options. any leads/helps would be much appreciated. <br>####### INJECT ######<br> Injector: starting at 2015-04-08 17:36:20 <br> Injector: crawlDb: crawl/crawldb<br> Injector: urlDir: urls<br> Injector: Converting injected urls to crawl db

java html parser for reading javascript generated contents

好久不见. 提交于 2019-12-07 18:40:18
问题 I am using jsoup for reading a web page by the following function. public Document getDocuement(String url){ Document doc = null; try { doc = Jsoup.connect(url).timeout(20*1000).userAgent("Mozilla").get(); } catch (Exception e) { return null; } return doc; } But whenever i am trying to read a web page that contain javascript generated contents, jsoup does not read those contents. ie, the actual content of the page is loading by some javascript calls.So it is not present in the page source of

htmlunit : An invalid or illegal selector was specified

无人久伴 提交于 2019-12-07 18:16:40
问题 I am trying to simulate the login with htmlunit. Although I wrote my code according to the examples, I have encountered a boring problem. Below are some message I have picked up from the console. runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName=[http://user.mofangge.com/Scripts/inc/jquery-1.10.2.js] line=[1640] lineSource=[null] lineOffset=[0] WARNING: Obsolete content type encountered: 'application/x-javascript'.

screen scraping using Ghost.py

南笙酒味 提交于 2019-12-07 16:00:27
Here is the simple program which does not work from ghost import Ghost ghost = Ghost(wait_timeout=40) page, extra_resources = ghost.open("http://samsung.com/in/consumer/mobile-phone/mobile-phone/smartphone/") ghost.wait_page_loaded() n=2; links=ghost.evaluate("alist=document.getElementsByTagName('a');alist") print links ERROR IS: raise Exception(timeout_message) Exception: Unable to load requested page iS there some problem with the program? Seem like people are reporting similar issues to yours, without really getting any explanation (for example: https://github.com/jeanphix/Ghost.py/issues

scrapy not printing out stacktrace on exception

丶灬走出姿态 提交于 2019-12-07 15:55:44
问题 Is there a special mechanism to force scrapy to print out all python exception/stacktrace. I made a simple mistake of getting a list attribute wrong resulting in AttributeError which did not show up in full in the logs What showed up was : 2015-11-15 22:13:50 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 264, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 40342, 'downloader/response_count': 1, 'downloader/response

Python requests error 10060

给你一囗甜甜゛ 提交于 2019-12-07 15:32:32
I have a script that crawls a website. Untill today it ran perfect, However it does not do so now. it give sme the following error: Connection Aborted Error(10060 ' A connection attempt failed becvause the connected party did not properly respond after a period of time, or established a connection failed because connected host has failed to respond' I have been looking into answers ans settings but i cannot figure out how to fix this... In IE i am not using any Proxy (Connection -> Lan Settings-> Proxy = Disabled) it breaks in this piece of code, somethimes the first run, somethimes the 2nd..