web-crawler

Python-Requests (>= 1.*): How to disable keep-alive?

霸气de小男生 提交于 2019-12-08 14:56:23
问题 I'm trying to program a simple web-crawler using the Requests module, and I would like to know how to disable its -default- keep-alive feauture. I tried using: s = requests.session() s.config['keep_alive'] = False However, I get an error stating that session object has no attribute 'config', I think it was changed with the new version, but i cannot seem to find how to do it in the official documentation. The truth is when I run the crawler on a specific website, it only gets five pages at

Why is google not using a headless browser to crawl clientside content? [closed]

谁说胖子不能爱 提交于 2019-12-08 14:49:09
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I'm aware of the steps it takes to make a client side website crawlable: https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?hl=nl I just wonder, why isn't Google just integrating a headless browser in their crawlers to save us the pain of providing html snapshots via e.g. NodeJS and a

Scraping the table data from accuweather website

ⅰ亾dé卋堺 提交于 2019-12-08 14:20:00
问题 Hi I want to scrap the data from the table. I need all the weather information for all days click to see image Please check this link https://www.accuweather.com/en/in/bengaluru/204108/month/204108?view=table Source code: <tbody> <tr class="pre"> <th scope="row">Tue <time>5/1</time></th> <td>91°/71°</td> <td>0 <span class="small">in</span></td> <td>0 <span class="small">in</span></td> <td> </td> <td>93°/71°</td> </tr> <tr class="pre"> <th scope="row">Wed <time>5/2</time></th> <td>91°/75°</td>

Puppeteer Crawler - Error: net::ERR_TUNNEL_CONNECTION_FAILED

痴心易碎 提交于 2019-12-08 13:50:30
问题 Currently I have my Puppeteer running with a Proxy on Heroku. Locally the proxy relay works totally fine. I however get the error Error: net::ERR_TUNNEL_CONNECTION_FAILED. I've set all .env info in the Heroku config vars so they are all available. Any idea how I can fix this error and resolve the issue? I currently have const browser = await puppeteer.launch({ args: [ "--proxy-server=https=myproxy:myproxyport", "--no-sandbox", '--disable-gpu', "--disable-setuid-sandbox", ], timeout: 0,

Stormcrawler not indexing content with Elasticsearch

泄露秘密 提交于 2019-12-08 12:35:30
问题 When using Stormcrawler it is indexing to Elasticsearch, but not the content. Stormcrawler is up-to-date with 'origin/master' https://github.com/DigitalPebble/storm-crawler.git Using elasticsearch-5.6.4 crawler-conf.yaml has indexer.url.fieldname: "url" indexer.text.fieldname: "content" indexer.canonical.name: "canonical" The url and title fields are indexed, but not content. I have trying to get this working by following Julien's tutorial at: https://www.youtube.com/watch?v=xMCuWpPh-4A

Why google index this? [closed]

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 10:38:24
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 9 years ago . In this webpage: http://www.alvolante.it/news/pompe_benzina_%E2%80%9Ctruccate%E2%80%9D_autostrada-308391044 there is this image: http://immagini.alvolante.it/sites/default/files/imagecache/anteprima_100/images/rifornimento_benzina.jpg Why this image is indexed if in the robots.txt there is "Disallow: /sites/" ??

How to keep Google from indexing the Session ID in the URL?

十年热恋 提交于 2019-12-08 09:31:18
问题 One of my sites is for old mobile phones that don't accept cookies so it uses a URL-based Session ID. However, Google is indexing the Session ID, so when my site is searched on Google, all the results come up with a specific Session ID. On most occasions, that Session ID is no longer valid by the time a guest clicks on it, but I've had at least one case where a guest clicked on a link from Google and it actually logged them into someone else's account, which is obviously a huge security flaw.

Run StormCrawler in local mode or install Apache Storm?

最后都变了- 提交于 2019-12-08 09:13:06
问题 So I'm trying to figure out how to install and setup Storm/Stormcrawler with ES and Kibana as described here. I never installed Storm on my local machine because I've worked with Nutch before and I never had to install Hadoop locally... thought it might be the same with Storm(maybe not?). I'd like to start crawling with Stormcrawler instead of Nutch now. It seems that if I just download a release and add the /bin to my PATH, I can only talk to a remote cluster. It seems like I need to setup a

How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

这一生的挚爱 提交于 2019-12-08 09:08:38
I'm using PhantomJS to retrieve this page: Target Page Link . The contents I need are under the "行政公告" and "就業徵才公告" tabs. Because this page is written in Chinese, in case you cannot find the tabs, you can use "find" function of the browsers to find the "行政公告" and "就業徵才公告" tabs. Because the contents under the "行政公告" tab are the loaded as the default option, I can easily use the script below to retrieve the page: var page = require('webpage').create(); var url = 'http://sa.ttu.edu.tw/bin/home.php'; page.open(url, function (status) { var js = page.evaluate(function () { return document; });

Scrapy weird output

ⅰ亾dé卋堺 提交于 2019-12-08 08:05:54
问题 I have a scrapy spider that parses this link My spider looks as follows: from scrapy.spider import BaseSpider from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from medsynergies.items import MedsynergiesItem class methodistspider(BaseSpider): name="samplemedsynergies" allowed_domains=['msi-openhire.silkroad.com/epostings/'] start_urls=['https://msi-openhire.silkroad.com/epostings/index.cfm?fuseaction