web-crawler | 易学教程

Downloading pdf files using mechanize and urllib

阅读更多关于 Downloading pdf files using mechanize and urllib

问题 I am new to Python, and my current task is to write a web crawler that looks for PDF files in certain webpages and downloads them. Here's my current approach (just for 1 sample url): import mechanize import urllib import sys mech = mechanize.Browser() mech.set_handle_robots(False) url = "http://www.xyz.com" try: mech.open(url, timeout = 30.0) except HTTPError, e: sys.exit("%d: %s" % (e.code, e.msg)) links = mech.links() for l in links: #Some are relative links path = str(l.base_url[:-1])+str

Parsing ajax responses to retrieve final url content in Scrapy?

阅读更多关于 Parsing ajax responses to retrieve final url content in Scrapy?

问题 I have the following problem: My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is. Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that

File Crawler PHP

阅读更多关于 File Crawler PHP

问题 just wondering how it would be possible to recursively search through a website folder directory (the same one as the script is uploaded to) and open/read every file and search for a specific string? for example I might have this: search.php?string=hello%20world this would run a process then output somethign like "hello world found inside" httpdocs /index.php /contact.php httpdocs/private/ ../prviate.php ../morestuff.php ../tastey.php httpdocs/private/love ../../goodness.php I dont want it to

Scrapy InIt self.initialized() — not initializing

阅读更多关于 Scrapy InIt self.initialized() — not initializing

问题 I am trying to use Scrapy to login to a website in the init then after confirming login I want to initialize and start the standard crawl through start_urls. Im not sure what is going wrong but i get clear to the login and every thing confirms but parse_item never starts. Any help would be well appreciated. I can get it up to "================Successfully logged in=================" but I can not get to "==========================PARSE ITEM==========================" from scrapy.contrib

Nutch does not crawl URLs with query string parameters

阅读更多关于 Nutch does not crawl URLs with query string parameters

问题 I am using Nutch1.9 and trying to crawl using individual commands. as can be seen in the output when going in to the 2nd level generater returned with 0 records. any one has faced this issue ? i am stuck in here from past 2 days. have searched all possible options. any leads/helps would be much appreciated. ####### INJECT ###### Injector: starting at 2015-04-08 17:36:20 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db

java html parser for reading javascript generated contents

阅读更多关于 java html parser for reading javascript generated contents

问题 I am using jsoup for reading a web page by the following function. public Document getDocuement(String url){ Document doc = null; try { doc = Jsoup.connect(url).timeout(20*1000).userAgent("Mozilla").get(); } catch (Exception e) { return null; } return doc; } But whenever i am trying to read a web page that contain javascript generated contents, jsoup does not read those contents. ie, the actual content of the page is loading by some javascript calls.So it is not present in the page source of

htmlunit : An invalid or illegal selector was specified

阅读更多关于 htmlunit : An invalid or illegal selector was specified

问题 I am trying to simulate the login with htmlunit. Although I wrote my code according to the examples, I have encountered a boring problem. Below are some message I have picked up from the console. runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName=[http://user.mofangge.com/Scripts/inc/jquery-1.10.2.js] line=[1640] lineSource=[null] lineOffset=[0] WARNING: Obsolete content type encountered: 'application/x-javascript'.

screen scraping using Ghost.py

阅读更多关于 screen scraping using Ghost.py

Here is the simple program which does not work from ghost import Ghost ghost = Ghost(wait_timeout=40) page, extra_resources = ghost.open("http://samsung.com/in/consumer/mobile-phone/mobile-phone/smartphone/") ghost.wait_page_loaded() n=2; links=ghost.evaluate("alist=document.getElementsByTagName('a');alist") print links ERROR IS: raise Exception(timeout_message) Exception: Unable to load requested page iS there some problem with the program? Seem like people are reporting similar issues to yours, without really getting any explanation (for example: https://github.com/jeanphix/Ghost.py/issues

scrapy not printing out stacktrace on exception

阅读更多关于 scrapy not printing out stacktrace on exception

问题 Is there a special mechanism to force scrapy to print out all python exception/stacktrace. I made a simple mistake of getting a list attribute wrong resulting in AttributeError which did not show up in full in the logs What showed up was : 2015-11-15 22:13:50 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 264, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 40342, 'downloader/response_count': 1, 'downloader/response

Python requests error 10060

阅读更多关于 Python requests error 10060

I have a script that crawls a website. Untill today it ran perfect, However it does not do so now. it give sme the following error: Connection Aborted Error(10060 ' A connection attempt failed becvause the connected party did not properly respond after a period of time, or established a connection failed because connected host has failed to respond' I have been looking into answers ans settings but i cannot figure out how to fix this... In IE i am not using any Proxy (Connection -> Lan Settings-> Proxy = Disabled) it breaks in this piece of code, somethimes the first run, somethimes the 2nd..