How to display images when using cURL?

前端 未结 3 1721
旧巷少年郎
旧巷少年郎 2020-12-20 07:06

When scraping page, I would like the images included with the text.

Currently I\'m only able to scrape the text. For example, as a test script, I scraped Google\'s h

相关标签:
3条回答
  • 2020-12-20 07:29

    If the site you're loading is using relative paths for its resource URLs (i.e. /images/whatever.gif instead of http://www.site.com/images/whatever.gif), you're going to need to do some rewriting of those URLs in the source you get back, since cURL won't do that itself, though Wget (official site seems to be down) does (and will even download and mirror the resources for you), but does not provide PHP bindings.

    So, you need to come up with a methodology to scrape through the resulting source and change relative paths into absolute paths. A naive way would be something like this:

    if (!preg_match('/src="https?:\/\/"/', $result))
        $result = preg_replace('/src="(.*)"/', "src=\"$MY_BASE_URL\\1\"", $result);
    

    where $MY_BASE_URL is the base URL you want to rewrite, i.e. http://www.mydomain.com. That won't work for everything, but it should get you started. It's not an easy thing to do, and you might be better off just spawning off a wget command in the background and letting it mirror or rewrite the HTML for you.

    0 讨论(0)
  • 2020-12-20 07:35

    Try obtaining the images by having the raw output returned, using the CURLOPT_BINARYTRANSFER option set to true, as below

    curl_setopt($ch,CURLOPT_BINARYTRANSFER, true);

    I've used this successfully to obtain images and audio from a webpage.

    0 讨论(0)
  • 2020-12-20 07:37

    When you scrape a URL, you're retrieving a single file, be it html, image, css, javascript, etc... The document you see displayed in a browser is almost always the result of MULTIPLE files: the original html, each seperate image, each css file, each javascript file. You enter only a single address, but fully building/displaying the page will require many HTTP requests.

    When you scrape the google home page via curl and output that HTML to the user, there's no way for the user to know that they're actually viewing Google-sourced HTML - it appears as if the HTML came from your server, and your server only. The user's browser will happily suck in this HTML, find the images, and request the images from YOUR server, not google's. Since you're not hosting any of google's images, your server responds with a properly 404 "not found" error.

    To make the page work properly, you've got a few choices. The easiest is to parse the HTML of the page and insert a <base href="..." /> tag into the document's header block. This will tell any viewing browsers that "relatively" links within the document should be fetched from this 'base' source (e.g. google).

    A harder option is to parse the document and rewrite any references to external files (images ,css, js, etc...) and put in the URL of the originating server, so the user's browser goes to the original site and fetches from there.

    The hardest option is to essentially set up a proxy server, and if a request comes in for a file that doesn't exist on your server, to try and fetch the corresponding file from Google via curl and output it to the user.

    0 讨论(0)
提交回复
热议问题