web-crawler | 易学教程

Bingpreview invalidates one time links in email

阅读更多关于 Bingpreview invalidates one time links in email

It seems that Outlook.com uses the BingPreview crawler to crawl links in emails. But the one-time links are marked as used/expired after opening the email and before the user gets the chance to use them. I try to add a rel="nofollow" in the <a> but without success. How can I block the crawler for each links in email ? Thanks I did the same. $user_agent = isset($_SERVER['HTTP_USER_AGENT']) ? $_SERVER['HTTP_USER_AGENT'] : ''; // Deny access for the BingPreview bot, used by outlook.com on links in e-mails ad Slackbot if (strpos($user_agent, 'BingPreview') !== false || strpos($user_agent,

Re-crawling websites fast

阅读更多关于 Re-crawling websites fast

问题 I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I

Mechanze form submission causes 'Assertion Error' in response when .read() is attempted

阅读更多关于 Mechanze form submission causes 'Assertion Error' in response when .read() is attempted

问题 I am writing a web-crawl program with python and am unable to login using mechanize. The form on the site looks like: <form method="post" action="PATLogon"> <h2 align="center"><img src="/myaladin/images/aladin_logo_rd.gif"></h2>  <input type=hidden name=req value="db"> <input type=hidden name=key value="PROXYAUTH"> <input type=hidden name=url value="http://eebo.chadwyck.com/search"> <input type=hidden name=lib value="8"> <table> <tr><td><b>Last Name:</b></td>

Database for web crawler in python?

阅读更多关于 Database for web crawler in python?

问题 Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project? Thanks in advance! 回答1: This could be a great project to use a document database like CouchDB, MongoDB, or SimpleDB. MongoDB has a hosted solution: http://mongohq.com. There is also a binding for Python (Pymongo). SimpleDB is a great choice if you are hosting this on Amazon Web Services CouchDB is an open source

Crawler4j with authentication

阅读更多关于 Crawler4j with authentication

问题 I'm trying to execute the crawler4j in a personal redmine for testing purposes. I want to authenticate and crawl several leves of depth in the application. I follow this tutorial from the FAQ of crawler4j. And create the next snippet: import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; public class CustomWebCrawler extends WebCrawler{ @Override public void visit

How to increase number of documents fetched by Apache Nutch crawler

阅读更多关于 How to increase number of documents fetched by Apache Nutch crawler

问题 I am using Apache Nutch 2.3 for crawling. There were about 200 urls in seed at start. Now as the time elasped, number of documents crawler are going to decrease or atmost same as at start. How I can configure Nutch so that my documents crawled should be increased? Is there any parameter that can be used to control number of documents? Second, how I can count number of documents crawled per day by nutch? 回答1: One crawl cycle consists of four steps: Generate, Fetch, Parse and Update DB. for

How to crawl a website after login in it with username and password

阅读更多关于 How to crawl a website after login in it with username and password

问题 I have written a webcrawler that crawls a website with keyward but i want to login to my specified website and filter information by keyword.How to achive that. i posting my code so far i have done . public class DB { public Connection conn = null; public DB() { try { Class.forName("com.mysql.jdbc.Driver"); String url = "jdbc:mysql://localhost:3306/test"; conn = DriverManager.getConnection(url, "root","root"); System.out.println("conn built"); } catch (SQLException e) { e.printStackTrace(); }

Multi-request pycurl running forever ( Infinite loop)

阅读更多关于 Multi-request pycurl running forever ( Infinite loop)

问题 I want to perform Multi-request using Pycurl. Code is: m.add_handle(handle) requests.append((handle, response)) # Perform multi-request. SELECT_TIMEOUT = 1.0 num_handles = len(requests) while num_handles: ret = m.select(SELECT_TIMEOUT) if ret == -1: continue while 1: ret, num_handles = m.perform() print "In while loop of multicurl" if ret != pycurl.E_CALL_MULTI_PERFORM: break Thing is, this loop takes forever to run. Its not terminating. Can any one tell me, what it does and what are the

Callback for redirected requests Scrapy

阅读更多关于 Callback for redirected requests Scrapy

问题 I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones. I have the following code in the start_requests function: for user in users: yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p) But this self.parse_p is called only for the Non-302 requests. 回答1: I

Is there anyway of making json data readable by a Google spider?

阅读更多关于 Is there anyway of making json data readable by a Google spider?

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in