web-crawler | 易学教程

Adding URL parameter to Nutch/Solr index and search results

阅读更多关于 Adding URL parameter to Nutch/Solr index and search results

I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on). the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?) the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/) The crawling works fine so far. Any ideas? cheers, mana EDIT: A part of the solution is hidden here: configuring nutch regex-normalize.xml # skip URLs containing certain

How to crawl foursquare check-in data?

阅读更多关于 How to crawl foursquare check-in data?

Is it possible to crawl check-in data from foursquare in a greedy way? (even if I don't have friendship with all the users) Just like crawling publicly available twitter messages. If you have any experience or suggestions, please share. Thanks. If you have publicly available tweets containing links to foursquare, you can resolve the foursquare short links (4sq.com/XXXXXX) by making a HEAD request. The head request will return a URL with a check-in ID and a signature. You can use those two values to retrieve a check-in object via the foursquare API /checkins/ endpoint. You're only allowed to

Python 3 - Add custom headers to urllib.request Request

阅读更多关于 Python 3 - Add custom headers to urllib.request Request

In Python 3 , the following code obtains the HTML source for a webpage. import urllib.request url = "https://docs.python.org/3.4/howto/urllib2.html" response = urllib.request.urlopen(url) response.read() How can I add the following custom header to the request when using urllib.request? headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' } The request headers can be customized by first creating a request object then supplying it to urlopen. import urllib.request url = "https://docs.python.org/3.4/howto/urllib2.html" hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64;

How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

阅读更多关于 How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article. The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this: <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject1">subject1</a> </td> </tr> <tr> <td class="xmlLink"> <a href="http://feeds.example.com/subject2">subject2</a> </td> </tr> Once you click that link it brings you the the articles for that RSS category that looks like this: <li class="regularitem"> <h4 class="itemtitle"> <a

Identifying large bodies of text via BeautifulSoup or other python based extractors

阅读更多关于 Identifying large bodies of text via BeautifulSoup or other python based extractors

问题 Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on the page. The original plan was to use a BeautifulSoup findAll(True) and to sort each tag by its .getText() value. EDIT: don't use this for html work, use the lxml library, it's python based and much faster than BeautifulSoup. command (which means extract all html tags) But this won't work for most pages, like the one

Best solution to host a crawler? [closed]

阅读更多关于 Best solution to host a crawler? [closed]

Closed. This question is off-topic. It is not currently accepting answers. Learn more . Want to improve this question? Update the question so it's on-topic for Stack Overflow. I have a crawler that crawl a few different domains for new posts/content. The total amount of content is hundred of thousands of pages, and there is a lot of new content added each day. So to be able to crawl through all this content, I need my crawler to be crawling 24/7. Currently I host the crawler script on the same server as the site the crawler is adding the content to, and I'm only able to run a cronjob to run

Rotating Proxies for web scraping

阅读更多关于 Rotating Proxies for web scraping

问题 I've got a python web crawler and I want to distribute the download requests among many different proxy servers, probably running squid (though I'm open to alternatives). For example, it could work in a round-robin fashion, where request1 goes to proxy1, request2 to proxy2, and eventually looping back around. Any idea how to set this up? To make it harder, I'd also like to be able to dynamically change the list of available proxies, bring some down, and add others. If it matters, IP addresses

How to crawl a website/extract data into database with python?

阅读更多关于 How to crawl a website/extract data into database with python?

问题 I'd like to build a webapp to help other students at my university create their schedules. To do that I need to crawl the master schedules (one huge html page) as well as a link to a detailed description for each course into a database, preferably in python. Also, I need to log in to access the data. How would that work? What tools/libraries can/should I use? Are there good tutorials on that? How do I best deal with binary data (e.g. pretty pdf)? Are there already good solutions for that? 回答1

Scrapy - logging to file and stdout simultaneously, with spider names

阅读更多关于 Scrapy - logging to file and stdout simultaneously, with spider names

问题 I've decided to use the Python logging module because the messages generated by Twisted on std error is too long, and I want to INFO level meaningful messages such as those generated by the StatsCollector to be written on a separate log file while maintaining the on screen messages. from twisted.python import log import logging logging.basicConfig(level=logging.INFO, filemode='w', filename='buyerlog.txt') observer = log.PythonLoggingObserver() observer.start() Well, this is fine, I've got my

How can I safely check is node empty or not? (Symfony 2 Crawler)

阅读更多关于 How can I safely check is node empty or not? (Symfony 2 Crawler)

When I try to take some nonexistent content from page I catch this error: The current node list is empty. 500 Internal Server Error - InvalidArgumentException How can I safely check exists this content or not? Here some examples that does not work: if($crawler->filter('.PropertyBody')->eq(2)->text()){ // bla bla } if(!empty($crawler->filter('.PropertyBody')->eq(2)->text())){ // bla bla } if(($crawler->filter('.PropertyBody')->eq(2)->text()) != null){ // bla bla } THANKS, I helped myself with: $count = $crawler->filter('.PropertyBody')->count(); if($count > 2){ $marks = $crawler->filter('