How to parse a zipped file completely from RAM?

柔情痞子 提交于 2020-05-29 06:01:06

问题


Background

I need to parse some zip files of various types (getting some inner files content for one purpose or another, including getting their names).

Some of the files are not reachable via file-path, as Android has Uri to reach them, and as sometimes the zip file is inside another zip file. With the push to use SAF, it's even less possible to use file-path in some cases.

For this, we have 2 main ways to handle: ZipFile class and ZipInputStream class.

The problem

When we have a file-path, ZipFile is a perfect solution. It's also very efficient in terms of speed.

However, for the rest of the cases, ZipInputStream could reach issues, such as this one, which has a problematic zip file, and cause this exception:

  java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor
        at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:321)
        at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:124)

What I've tried

The only always-working solution would be to copy the file to somewhere else, where you could parse it using ZipFile, but this is inefficient and requires you to have free storage, as well as remove the file when you are done with it.

So, what I've found is that Apache has a nice, pure Java library (here) to parse Zip files, and for some reason its InputStream solution (called "ZipArchiveInputStream") seem even more efficient than the native ZipInputStream class.

As opposed to what we have in the native framework, the library offers a bit more flexibility. I could, for example, load the entire zip file into bytes array, and let the library handle it as usual, and this works even for the problematic Zip files I've mentioned:

org.apache.commons.compress.archivers.zip.ZipFile(SeekableInMemoryByteChannel(byteArray)).use { zipFile ->
    for (entry in zipFile.entries) {
      val name = entry.name
      ... // use the zipFile like you do with native framework

gradle dependency:

// http://commons.apache.org/proper/commons-compress/ https://mvnrepository.com/artifact/org.apache.commons/commons-compress
implementation 'org.apache.commons:commons-compress:1.20'

Sadly, this isn't always possible, because it depends on having the heap memory hold the entire zip file, and on Android it gets even more limited, because the heap size could be relatively small (heap could be 100MB while the file is 200MB). As opposed to a PC which can have a huge heap memory being set, for Android it's not flexible at all.

So, I searched for a solution that has JNI instead, to have the entire ZIP file loaded into byte array there, not going to the heap (at least not entirely). This could be a nicer workaround because if the ZIP could be fit in the device's RAM instead of the heap, it could prevent me from reaching OOM while also not needing to have an extra file.

I've found this library called "larray" which seems promising , but sadly when I tried using it, it crashed, because its requirements include having a full JVM, meaning not suitable for Android.

EDIT: seeing that I can't find any library and any built-in class, I tried to use JNI myself. Sadly I'm very rusty with it, and I looked at an old repository I've made a long time ago to perform some operations on Bitmaps (here). This is what I came up with :

native-lib.cpp

#include <jni.h>
#include <android/log.h>
#include <cstdio>
#include <android/bitmap.h>
#include <cstring>
#include <unistd.h>

class JniBytesArray {
public:
    uint32_t *_storedData;

    JniBytesArray() {
        _storedData = NULL;
    }
};

extern "C" {
JNIEXPORT jobject JNICALL Java_com_lb_myapplication_JniByteArrayHolder_allocate(
        JNIEnv *env, jobject obj, jlong size) {
    auto *jniBytesArray = new JniBytesArray();
    auto *array = new uint32_t[size];
    for (int i = 0; i < size; ++i)
        array[i] = 0;
    jniBytesArray->_storedData = array;
    return env->NewDirectByteBuffer(jniBytesArray, 0);
}
}

JniByteArrayHolder.kt

class JniByteArrayHolder {
    external fun allocate(size: Long): ByteBuffer

    companion object {
        init {
            System.loadLibrary("native-lib")
        }
    }
}
class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
        thread {
            printMemStats()
            val jniByteArrayHolder = JniByteArrayHolder()
            val byteBuffer = jniByteArrayHolder.allocate(1L * 1024L)
            printMemStats()
        }
    }

    fun printMemStats() {
        val memoryInfo = ActivityManager.MemoryInfo()
        (getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager).getMemoryInfo(memoryInfo)
        val nativeHeapSize = memoryInfo.totalMem
        val nativeHeapFreeSize = memoryInfo.availMem
        val usedMemInBytes = nativeHeapSize - nativeHeapFreeSize
        val usedMemInPercentage = usedMemInBytes * 100 / nativeHeapSize
        Log.d("AppLog", "total:${Formatter.formatFileSize(this, nativeHeapSize)} " +
                "free:${Formatter.formatFileSize(this, nativeHeapFreeSize)} " +
                "used:${Formatter.formatFileSize(this, usedMemInBytes)} ($usedMemInPercentage%)")
    }

This doesn't seem right, because if I try to create a 1GB byte array using jniByteArrayHolder.allocate(1L * 1024L * 1024L * 1024L) , it crashes without any exception or error logs.

The questions

  1. Is it possible to use JNI for Apache's library, so that it will handle the ZIP file content which is contained within JNI's "world" ?

  2. If so, how can I do it? Is there any sample of how to do it? Is there a class for it? Or do I have to implement it myself? If so, can you please show how it's done in JNI?

  3. If it's not possible, what other way is there to do it? Maybe alternative to what Apache has?

  4. For the solution of JNI, how come it doesn't work well ? How could I efficiently copy the bytes from the stream into the JNI byte array (my guess is that it will be via a buffer)?


回答1:


I took a look at the JNI code you posted and made a couple of changes. Mostly it is defining the size argument for NewDirectByteBuffer and using malloc().

Here is the output of the log after allocating 800mb:

D/AppLog: total:1.57 GB free:1.03 GB used:541 MB (34%)
D/AppLog: total:1.57 GB free:247 MB used:1.32 GB (84%)

And the following is what the buffer looks like after the allocation. As you can see, the debugger is reporting a limit of 800mb which is what we expect.

My C is very rusty, so I am sure that there is some work to be done. I have updated the code to be a little more robust and to allow for the freeing of memory.

native-lib.cpp

extern "C" {
static jbyteArray *_holdBuffer = NULL;
static jobject _directBuffer = NULL;
/*
    This routine is not re-entrant and can handle only one buffer at a time. If a buffer is
    allocated then it must be released before the next one is allocated.
 */
JNIEXPORT
jobject JNICALL Java_com_example_zipfileinmemoryjni_JniByteArrayHolder_allocate(
        JNIEnv *env, jobject obj, jlong size) {
    if (_holdBuffer != NULL || _directBuffer != NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Call to JNI allocate() before freeBuffer()");
        return NULL;
    }

    // Max size for a direct buffer is the max of a jint even though NewDirectByteBuffer takes a
    // long. Clamp max size as follows:
    if (size > SIZE_T_MAX || size > INT_MAX || size <= 0) {
        jlong maxSize = SIZE_T_MAX < INT_MAX ? SIZE_T_MAX : INT_MAX;
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Native memory allocation request must be >0 and <= %lld but was %lld.\n",
                            maxSize, size);
        return NULL;
    }

    jbyteArray *array = (jbyteArray *) malloc(static_cast<size_t>(size));
    if (array == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to allocate %lld bytes of native memory.\n",
                            size);
        return NULL;
    }

    jobject directBuffer = env->NewDirectByteBuffer(array, size);
    if (directBuffer == NULL) {
        free(array);
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to create direct buffer of size %lld.\n",
                            size);
        return NULL;
    }
    // memset() is not really needed but we call it here to force Android to count
    // the consumed memory in the stats since it only seems to "count" dirty pages. (?)
    memset(array, 0xFF, static_cast<size_t>(size));
    _holdBuffer = array;

    // Get a global reference to the direct buffer so Java isn't tempted to GC it.
    _directBuffer = env->NewGlobalRef(directBuffer);
    return directBuffer;
}

JNIEXPORT void JNICALL Java_com_example_zipfileinmemoryjni_JniByteArrayHolder_freeBuffer(
        JNIEnv *env, jobject obj, jobject directBuffer) {

    if (_directBuffer == NULL || _holdBuffer == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Attempt to free unallocated buffer.");
        return;
    }

    jbyteArray *bufferLoc = (jbyteArray *) env->GetDirectBufferAddress(directBuffer);
    if (bufferLoc == NULL) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "Failed to retrieve direct buffer location associated with ByteBuffer.");
        return;
    }

    if (bufferLoc != _holdBuffer) {
        __android_log_print(ANDROID_LOG_ERROR, "JNI Routine",
                            "DirectBuffer does not match that allocated.");
        return;
    }

    // Free the malloc'ed buffer and the global reference. Java can not GC the direct buffer.
    free(bufferLoc);
    env->DeleteGlobalRef(_directBuffer);
    _holdBuffer = NULL;
    _directBuffer = NULL;
}
}

I also updated the array holder:

class JniByteArrayHolder {
    external fun allocate(size: Long): ByteBuffer
    external fun freeBuffer(byteBuffer: ByteBuffer)

    companion object {
        init {
            System.loadLibrary("native-lib")
        }
    }
}

I can confirm that this code along with the ByteBufferChannel class provided by Botje here works for Android versions before API 24. The SeekableByteChannel interface was introduced in API 24 and is needed by the ZipFile utility.

The maximum buffer size that can be allocated is the size of a jint and is due to the limitation of JNI. Larger data can be accommodated (if available) but would require multiple buffers and a way to handle them.

Here is the main activity for the sample app. An earlier version always assumed the the InputStream read buffer was was always filled and errored out when trying to put it to the ByteBuffer. This was fixed.

MainActivity.kt

class MainActivity : AppCompatActivity() {
    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)
    }

    fun onClick(view: View) {
        button.isEnabled = false
        status.text = getString(R.string.running)

        thread {
            printMemStats("Before buffer allocation:")
            var bufferSize = 0L
            // testzipfile.zip is not part of the project but any zip can be uploaded through the
            // device file manager or adb to test.
            val fileToRead = "$filesDir/testzipfile.zip"
            val inStream =
                if (File(fileToRead).exists()) {
                    FileInputStream(fileToRead).apply {
                        bufferSize = getFileSize(this)
                        close()
                    }
                    FileInputStream(fileToRead)
                } else {
                    // If testzipfile.zip doesn't exist, we will just look at this one which
                    // is part of the APK.
                    resources.openRawResource(R.raw.appapk).apply {
                        bufferSize = getFileSize(this)
                        close()
                    }
                    resources.openRawResource(R.raw.appapk)
                }
            // Allocate the buffer in native memory (off-heap).
            val jniByteArrayHolder = JniByteArrayHolder()
            val byteBuffer =
                if (bufferSize != 0L) {
                    jniByteArrayHolder.allocate(bufferSize)?.apply {
                        printMemStats("After buffer allocation")
                    }
                } else {
                    null
                }

            if (byteBuffer == null) {
                Log.d("Applog", "Failed to allocate $bufferSize bytes of native memory.")
            } else {
                Log.d("Applog", "Allocated ${Formatter.formatFileSize(this, bufferSize)} buffer.")
                val inBytes = ByteArray(4096)
                Log.d("Applog", "Starting buffered read...")
                while (inStream.available() > 0) {
                    byteBuffer.put(inBytes, 0, inStream.read(inBytes))
                }
                inStream.close()
                byteBuffer.flip()
                ZipFile(ByteBufferChannel(byteBuffer)).use {
                    Log.d("Applog", "Starting Zip file name dump...")
                    for (entry in it.entries) {
                        Log.d("Applog", "Zip name: ${entry.name}")
                        val zis = it.getInputStream(entry)
                        while (zis.available() > 0) {
                            zis.read(inBytes)
                        }
                    }
                }
                printMemStats("Before buffer release:")
                jniByteArrayHolder.freeBuffer(byteBuffer)
                printMemStats("After buffer release:")
            }
            runOnUiThread {
                status.text = getString(R.string.idle)
                button.isEnabled = true
                Log.d("Applog", "Done!")
            }
        }
    }

    /*
        This function is a little misleading since it does not reflect the true status of memory.
        After native buffer allocation, it waits until the memory is used before counting is as
        used. After release, it doesn't seem to count the memory as released until garbage
        collection. (My observations only.) Also, see the comment for memset() in native-lib.cpp
        which is a member of this project.
    */
    private fun printMemStats(desc: String? = null) {
        val memoryInfo = ActivityManager.MemoryInfo()
        (getSystemService(Context.ACTIVITY_SERVICE) as ActivityManager).getMemoryInfo(memoryInfo)
        val nativeHeapSize = memoryInfo.totalMem
        val nativeHeapFreeSize = memoryInfo.availMem
        val usedMemInBytes = nativeHeapSize - nativeHeapFreeSize
        val usedMemInPercentage = usedMemInBytes * 100 / nativeHeapSize
        val sDesc = desc?.run { "$this:\n" }
        Log.d(
            "AppLog", "$sDesc total:${Formatter.formatFileSize(this, nativeHeapSize)} " +
                    "free:${Formatter.formatFileSize(this, nativeHeapFreeSize)} " +
                    "used:${Formatter.formatFileSize(this, usedMemInBytes)} ($usedMemInPercentage%)"
        )
    }

    // Not a great way to do this but not the object of the demo.
    private fun getFileSize(inStream: InputStream): Long {
        var bufferSize = 0L
        while (inStream.available() > 0) {
            val toSkip = inStream.available().toLong()
            inStream.skip(toSkip)
            bufferSize += toSkip
        }
        return bufferSize
    }
}

A sample GitHub repository is here.




回答2:


You can steal LWJGL's native memory management functions. It is BSD3 licensed, so you only have to mention somewhere that you are using code from it.

Step 1: given an InputStream is and a file size ZIP_SIZE, slurp the stream into a direct byte buffer created by LWJGL's org.lwjgl.system.MemoryUtil helper class:

ByteBuffer bb = MemoryUtil.memAlloc(ZIP_SIZE);
byte[] buf = new byte[4096]; // Play with the buffer size to see what works best
int read = 0;
while ((read = is.read(buf)) != -1) {
  bb.put(buf, 0, read);
}

Step 2: wrap the ByteBuffer in a ByteChannel. Taken from this gist. You possibly want to strip the writing parts out.

package io.github.ncruces.utils;

import java.nio.ByteBuffer;
import java.nio.channels.NonWritableChannelException;
import java.nio.channels.SeekableByteChannel;

import static java.lang.Math.min;

public final class ByteBufferChannel implements SeekableByteChannel {
    private final ByteBuffer buf;

    public ByteBufferChannel(ByteBuffer buffer) {
        if (buffer == null) throw new NullPointerException();
        buf = buffer;
    }

    @Override
    public synchronized int read(ByteBuffer dst) {
        if (buf.remaining() == 0) return -1;

        int count = min(dst.remaining(), buf.remaining());
        if (count > 0) {
            ByteBuffer tmp = buf.slice();
            tmp.limit(count);
            dst.put(tmp);
            buf.position(buf.position() + count);
        }
        return count;
    }

    @Override
    public synchronized int write(ByteBuffer src) {
        if (buf.isReadOnly()) throw new NonWritableChannelException();

        int count = min(src.remaining(), buf.remaining());
        if (count > 0) {
            ByteBuffer tmp = src.slice();
            tmp.limit(count);
            buf.put(tmp);
            src.position(src.position() + count);
        }
        return count;
    }

    @Override
    public synchronized long position() {
        return buf.position();
    }

    @Override
    public synchronized ByteBufferChannel position(long newPosition) {
        if ((newPosition | Integer.MAX_VALUE - newPosition) < 0) throw new IllegalArgumentException();
        buf.position((int)newPosition);
        return this;
    }

    @Override
    public synchronized long size() { return buf.limit(); }

    @Override
    public synchronized ByteBufferChannel truncate(long size) {
        if ((size | Integer.MAX_VALUE - size) < 0) throw new IllegalArgumentException();
        int limit = buf.limit();
        if (limit > size) buf.limit((int)size);
        return this;
    }

    @Override
    public boolean isOpen() { return true; }

    @Override
    public void close() {}
}

Step 3: Use ZipFile as before:

ZipFile zf = new ZipFile(ByteBufferChannel(bb);
for (ZipEntry ze : zf) {
    ...
}

Step 4: Manually release the native buffer (preferably in a finally block):

MemoryUtil.memFree(bb);


来源:https://stackoverflow.com/questions/61652063/how-to-parse-a-zipped-file-completely-from-ram

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!