Best way to download all images from a site using Java? Currently getting an 403 Status Error

问题

I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely. Here is my code so far.

private static void getPages() throws IOException {

        Document doc = Jsoup.connect("https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686")
                .get();
        Elements media = doc.getElementsByTag("img");
        System.out.println(media);
        Iterator<Element> ie = media.iterator();
        int i = 1;

        while (ie.hasNext()) {
                Response resultImageResponse = Jsoup.connect(ie.next().attr("src")).ignoreContentType(true)
                        .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0")
                        .referrer("www.google.com").timeout(120000).execute();
                FileOutputStream out = (new FileOutputStream(new java.io.File("image #" + i++ + ".jpg")));
                out.write(resultImageResponse.bodyAsBytes());
                out.close();

            }
        
    }

回答1:

You have a few problems with your suggested approach:

you're trying to use JSoup to download file content data... JSoup is only for the text data but won't return the image content/values. To download image content you will need an HTTP request
to download the images you also need to copy the request that would be made via a browser. You can open up Chrome, open developer tools and open the network tab. Enter the URL for the page you want to scrape images from, and you'll see a bunch of requests being made. There'll be an individual request for each image somewhere in the view... if you click on the one labelled 1.jpg you'll see the request made to download the first image, you'll then need to copy all headers that are used to make the request for that image. You'll note, request AND response headers are shown in this view. Once you've replicated the request successfully, you can then start testing which headers/cookies are required. I found the only real requirement was for the "referer" header being necessary.

I've stripped out most of what you might need/want but something similar to the below is what you're after. I've pulled the comic book images in their entirety at full quality. I introduced a small sleep timer so as not to overload the server as sometimes you'll get rate limited. Even without it you should be fine but you don't want to get blocked for a lengthy period of time so the slower you can allow the requests to come back to you the better. You could even make the requests in parallel.

You could cut back even more on some of the code below I'm almost certain, to get a cleaner result... but it works and I'm assuming that's more than enough of a result.

Interesting question.


import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;

public class JSoupExample {

    private static int TIMEOUT = 30000;
    private static final int BUFFER_SIZE = 4096;

    public static void main(String... args) throws InterruptedException, IOException {
        String url = "https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686";
        Document doc = Jsoup.connect(url).get();
        // Select only urls where the source starts with the relevant url (not all images)
        Elements media = doc.select("img[src^=\"https://s5.mkklcdnv5.com/mangakakalot/r1/read_bleach_manga_online_for_free2/chapter_686_death_and_strawberry/\"]");
        Iterator<Element> ie = media.iterator();
        int i = 1;

        while (ie.hasNext()) {
            String imageUrlString = ie.next().attr("src");
            System.out.println(imageUrlString + " ");

            try {
                HttpURLConnection response = makeImageRequest(url, imageUrlString);

                if (response.getResponseCode() == 200) {
                    writeToFile(i, response);
                }
            } catch (IOException e) {
                // skip file and move to next if unavailable
                e.printStackTrace();
                System.out.println("Unable to download file: " + imageUrlString);
            }
            i++; // increment image ID whatever the result of the request.
            Thread.sleep(200l); // prevent yourself from being blocked due to rate limiting
        }
    }

    private static void writeToFile(int i, HttpURLConnection response) throws IOException {
        // opens input stream from the HTTP connection
        InputStream inputStream = response.getInputStream();

        // opens an output stream to save into file
        FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");

        int bytesRead = -1;
        byte[] buffer = new byte[BUFFER_SIZE];
        while ((bytesRead = inputStream.read(buffer)) != -1) {
            outputStream.write(buffer, 0, bytesRead);
        }
        outputStream.close();
        inputStream.close();

        System.out.println("File downloaded");
    }

    private static HttpURLConnection makeImageRequest(String referer, String imageUrlString) throws IOException {
        URL imageUrl = new URL(imageUrlString);
        HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();

        response.setRequestMethod("GET");
        response.setRequestProperty("referer",  referer);

        response.setConnectTimeout(TIMEOUT);
        response.setReadTimeout(TIMEOUT);
        response.connect();
        return response;
    }
}

I'd also want to ensure I set the right file extension based on the content type as I believe some were coming back as .png format rather than .jpeg. I'm also fairly sure the write to file can be cleaned up to be simpler/clearer, rather than reading in a byte stream.

来源：https://stackoverflow.com/questions/63108958/best-way-to-download-all-images-from-a-site-using-java-currently-getting-an-403

标签

java

web-scraping

jsoup

downloading-website-files