Pickled scipy sparse matrix as input data?

谁都会走 提交于 2019-12-07 23:32:57

问题


I am working on a multiclass classification problem consisting in classifying resumes.

I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.

Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.

Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?


回答1:


You are correct that Python's "open" won't work with GCS out of the box. Given that you're using TensorFlow, you can use the file_io library instead, which will work both with local files as well as files on GCS.

from tensorflow.python.lib.io import file_io
pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))

NB: pickle.load(file_io.FileIO('gs://..', 'r')) does not appear to work.

You are welcome to use whatever data format works for you and are not limited to CSV or TFRecord (do you mind pointing to the place in the documentation that makes that claim?). If the data fits in memory, then your approach is sensible.

If the data doesn't fit in memory, you will likely want to use TensorFlow's reader framework, the most convenient of which tend to be CSV or TFRecords. TFRecord is simply a container of byte strings. Most commonly, it contains serialized tf.Example data which does support sparse data (it is essentially a map). See tf.parse_example for more information on parsing tf.Example data.



来源:https://stackoverflow.com/questions/40133223/pickled-scipy-sparse-matrix-as-input-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!