Browser-based client-side scraping

后端未结

关注

 4  765

失恋的感觉 2020-12-31 02:05

I wonder if its possible to scrape an external (cross-domain) page through the user\'s IP?

For a shopping comparison site, I need to scrape pages of an e-com site b

4条回答

遥遥无期 (楼主)

2020-12-31 02:32
Basically browsers are made to avoid doing this…

The solution everyone thinks about first:

jQuery/JavaScript: accessing contents of an iframe

But it will not work in most cases with "recent" browsers (<10 years old)

Alternatives are:
- Using the official apis of the server (if any)
- Try finding if the server is providing a JSONP service (good luck)
- Being on the same domain, try a cross site scripting (if possible, not very ethical)
- Using a trusted relay or proxy (but this will still use your own ip)
- Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
- Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
- Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
- Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
- Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch
[EDIT]

One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you. Here is a simple example to do so, In short, you get cross domain GET requests
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...