Extract images from .idx3-ubyte file or GZIP via Python

问题

I have created a simple function for facerecognition by using the facerecognizer from OpenCV. It works all fine with images from people.

Now I would like to make a test by using handwritten characters instead of people. I came across MNIST dataset, but they store images in a weird file which I have never seen before.

I simply need to extract a few images from:

train-images.idx3-ubyte

and save them in a folder as .gif

Or am I missunderstand this MNIST thing. If yes where could I get such a dataset?

EDIT

I also have the gzip file:

train-images-idx3-ubyte.gz

I am trying to read the content, but show() does not work and if I read() I see random symbols.

images = gzip.open("train-images-idx3-ubyte.gz", 'rb')
print images.read()

EDIT

Managed to get some usefull output by using:

with gzip.open('train-images-idx3-ubyte.gz','r') as fin:
    for line in fin:
        print('got line', line)

Somehow I have to convert this now to an image, output:

回答1:

Download the training/test images and labels:

train-images-idx3-ubyte.gz: training set images
train-labels-idx1-ubyte.gz: training set labels
t10k-images-idx3-ubyte.gz: test set images
t10k-labels-idx1-ubyte.gz: test set labels

And uncompress them in a workdir, say samples/.

Get the python-mnist package from PyPi:

pip install python-mnist

Import the mnist package and read the training/test images:

from mnist import MNIST

mndata = MNIST('samples')

images, labels = mndata.load_training()
# or
images, labels = mndata.load_testing()

To display an image to the console:

index = random.randrange(0, len(images))  # choose an index ;-)
print(mndata.display(images[index]))

You'll get something like this:

............................
............................
............................
............................
............................
.................@@.........
..............@@@@@.........
............@@@@............
..........@@................
..........@.................
...........@................
...........@................
...........@...@............
...........@@@@@.@..........
...........@@@...@@.........
...........@@.....@.........
..................@.........
..................@@........
..................@@........
..................@.........
.................@@.........
...........@.....@..........
...........@....@@..........
............@@@@............
.............@..............
............................
............................
............................

Explanation:

Each image of the images list is a Python list of unsigned bytes.
The labels is an Python array of unsigned bytes.

回答2:

(Using only matplotlib, gzip and numpy)
Extract image data:

import gzip
f = gzip.open('train-images-idx3-ubyte.gz','r')

image_size = 28
num_images = 5

import numpy as np
f.read(16)
buf = f.read(image_size * image_size * num_images)
data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
data = data.reshape(num_images, image_size, image_size, 1)

Print images:

import matplotlib.pyplot as plt
image = np.asarray(data[2]).squeeze()
plt.imshow(image)
plt.show()

Print first 50 labels:

f = gzip.open('train-labels-idx1-ubyte.gz','r')
f.read(8)
for i in range(0,50):   
    buf = f.read(1)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
    print(labels)

回答3:

You could actually use the idx2numpy package available at PyPI. It's extremely simple to use and directly converts the data to numpy arrays. Here's what you have to do:

Downloading the data

Download the MNIST dataset from the official website.
If you're using Linux then you can use wget to get it from command line itself. Just run:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Decompressing the data

Unzip or decompress the data. On Linux, you could use gzip

Ultimately, you should have the following files:

data/train-images-idx3-ubyte
data/train-labels-idx1-ubyte
data/t10k-images-idx3-ubyte
data/t10k-labels-idx1-ubyte

The prefix data/ is just because I've extracted them into a folder named data. Your question looks like you're well done till here, so keep reading.

Using idx2numpy

Here's a simple python code to read everything from the decompressed files as numpy arrays.

import idx2numpy
import numpy as np
file = 'data/train-images-idx3-ubyte'
arr = idx2numpy.convert_from_file(file)
# arr is now a np.ndarray type of object of shape 60000, 28, 28

You can now use it with OpenCV juts the same way how you display any other image, using something like

cv.imshow("Image", arr[4])

To install idx2numpy, you can use PyPI (pip package manager). Simply run the command:

pip install idx2numpy

回答4:

Use this to extract mnist database to images and csv labels in python :

https://github.com/sorki/python-mnist

来源：https://stackoverflow.com/questions/40427435/extract-images-from-idx3-ubyte-file-or-gzip-via-python

标签

python

mnist