web-crawler | 易学教程

Python-Requests (>= 1.*): How to disable keep-alive?

阅读更多关于 Python-Requests (>= 1.*): How to disable keep-alive?

问题 I'm trying to program a simple web-crawler using the Requests module, and I would like to know how to disable its -default- keep-alive feauture. I tried using: s = requests.session() s.config['keep_alive'] = False However, I get an error stating that session object has no attribute 'config', I think it was changed with the new version, but i cannot seem to find how to do it in the official documentation. The truth is when I run the crawler on a specific website, it only gets five pages at

Why is google not using a headless browser to crawl clientside content? [closed]

阅读更多关于 Why is google not using a headless browser to crawl clientside content? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I'm aware of the steps it takes to make a client side website crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?hl=nl I just wonder, why isn't Google just integrating a headless browser in their crawlers to save us the pain of providing html snapshots via e.g. NodeJS and a

Scraping the table data from accuweather website

阅读更多关于 Scraping the table data from accuweather website

问题 Hi I want to scrap the data from the table. I need all the weather information for all days click to see image Please check this link https://www.accuweather.com/en/in/bengaluru/204108/month/204108?view=table Source code: <tbody> <tr class="pre"> <th scope="row">Tue <time>5/1</time></th> <td>91°/71°</td> <td>0 <span class="small">in</span></td> <td>0 <span class="small">in</span></td> <td> </td> <td>93°/71°</td> </tr> <tr class="pre"> <th scope="row">Wed <time>5/2</time></th> <td>91°/75°</td>

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

阅读更多关于 Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

问题 Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available. Any idea how I can fix this error and resolve the issue? I currently have const browser = await puppeteer.launch({ args: [ "--proxy-server=https=myproxy:myproxyport", "--no-sandbox", '--disable-gpu', "--disable-setuid-sandbox", ], timeout: 0,

Stormcrawler not indexing content with Elasticsearch

阅读更多关于 Stormcrawler not indexing content with Elasticsearch

问题 When using Stormcrawler it is indexing to Elasticsearch, but not the content. Stormcrawler is up-to-date with 'origin/master' https://github.com/DigitalPebble/storm-crawler.git Using elasticsearch-5.6.4 crawler-conf.yaml has indexer.url.fieldname: "url" indexer.text.fieldname: "content" indexer.canonical.name: "canonical" The url and title fields are indexed, but not content. I have trying to get this working by following Julien's tutorial at: https://www.youtube.com/watch?v=xMCuWpPh-4A

Why google index this? [closed]

阅读更多关于 Why google index this? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 9 years ago . In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ??

How to keep Google from indexing the Session ID in the URL?

阅读更多关于 How to keep Google from indexing the Session ID in the URL?

问题 One of my sites is for old mobile phones that don't accept cookies so it uses a URL-based Session ID. However, Google is indexing the Session ID, so when my site is searched on Google, all the results come up with a specific Session ID. On most occasions, that Session ID is no longer valid by the time a guest clicks on it, but I've had at least one case where a guest clicked on a link from Google and it actually logged them into someone else's account, which is obviously a huge security flaw.

Run StormCrawler in local mode or install Apache Storm?

阅读更多关于 Run StormCrawler in local mode or install Apache Storm?

问题 So I'm trying to figure out how to install and setup Storm/Stormcrawler with ES and Kibana as described here. I never installed Storm on my local machine because I've worked with Nutch before and I never had to install Hadoop locally... thought it might be the same with Storm(maybe not?). I'd like to start crawling with Stormcrawler instead of Nutch now. It seems that if I just download a release and add the /bin to my PATH, I can only talk to a remote cluster. It seems like I need to setup a

How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

阅读更多关于 How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

I'm using PhantomJS to retrieve this page: Target Page Link . The contents I need are under the "行政公告" and "就業徵才公告" tabs. Because this page is written in Chinese, in case you cannot find the tabs, you can use "find" function of the browsers to find the "行政公告" and "就業徵才公告" tabs. Because the contents under the "行政公告" tab are the loaded as the default option, I can easily use the script below to retrieve the page: var page = require('webpage').create(); var url = 'http://sa.ttu.edu.tw/bin/home.php'; page.open(url, function (status) { var js = page.evaluate(function () { return document; });

Scrapy weird output

阅读更多关于 Scrapy weird output

问题 I have a scrapy spider that parses this link My spider looks as follows: from scrapy.spider import BaseSpider from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from medsynergies.items import MedsynergiesItem class methodistspider(BaseSpider): name="samplemedsynergies" allowed_domains=['msi-openhire.silkroad.com/epostings/'] start_urls=['https://msi-openhire.silkroad.com/epostings/index.cfm?fuseaction