Browser-based client-side scraping

后端 未结 4 765
失恋的感觉
失恋的感觉 2020-12-31 02:05

I wonder if its possible to scrape an external (cross-domain) page through the user\'s IP?

For a shopping comparison site, I need to scrape pages of an e-com site b

4条回答
  •  遥遥无期
    2020-12-31 02:32

    Basically browsers are made to avoid doing this…

    The solution everyone thinks about first:

    jQuery/JavaScript: accessing contents of an iframe

    But it will not work in most cases with "recent" browsers (<10 years old)

    Alternatives are:

    • Using the official apis of the server (if any)
    • Try finding if the server is providing a JSONP service (good luck)
    • Being on the same domain, try a cross site scripting (if possible, not very ethical)
    • Using a trusted relay or proxy (but this will still use your own ip)
    • Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
    • Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
    • Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
    • Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
    • Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch

    [EDIT]

    One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you. Here is a simple example to do so, In short, you get cross domain GET requests

提交回复
热议问题