What is the most elegant way to do screen scraping in node.js?

我怕爱的太早我们不能终老 提交于 2019-12-03 18:42:01

问题


I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:

  1. Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.

  2. Redirect following. I want each request to follow through redirects when a 302 status code is returned.

I came across two things which looked useful, but I couldn't use in the end:

  • http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.

  • http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.

Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?


回答1:


i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.

feedback is welcome :)




回答2:


You may want to check out https://github.com/mikeal/request from mikeal, I just spoke to him the chatroom and he says that it does not handle cookies at the moment but you can write a submodule to handle these for you in the meantime.

in regards to redirect it handles beautifully :)




回答3:


It turns out someone made a phantomjs module for node.js:

https://github.com/sgentle/phantomjs-node

While phantom is fairly heavy, it also supports SSL, cookies, and everything else a typical browser supports (since it is a webkit browser, after all).

Give it a shot, it may be exactly what you are looking for.



来源:https://stackoverflow.com/questions/5441265/what-is-the-most-elegant-way-to-do-screen-scraping-in-node-js

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!