Concurrent SAX processing of large, simple XML files?

穿精又带淫゛_ 提交于 2019-12-12 01:33:26

问题


I have a couple of gigantic XML files (10GB-40GB) that have a very simple structure: just a single root node containing multiple row nodes. I'm trying to parse them using SAX in Python, but the extra processing I have to do for each row means that the 40GB file takes an entire day to complete. To speed things up, I'd like to use all my cores simultaneously. Unfortunately, it seems that the SAX parser can't deal with "malformed" chunks of XML, which is what you get when you seek to an arbitrary line in the file and try parsing from there. Since the SAX parser can accept a stream, I think I need to divide my XML file into eight different streams, each containing [number of rows]/8 rows and padded with fake opening and closing tags. How would I go about doing this? Or — is there a better solution that I might be missing? Thank you!


回答1:


You can't easily split the SAX parsing into multiple threads, and you don't need to: if you just run the parse without any other processing, it should run in 20 minutes or so. Focus on the processing you do to the data in your ContentHandler.




回答2:


My suggested way is to read the whole XML file into an internal format and do the extra processing afterwards. SAX should be fast enough to read 40GB of XML in no more than an hour.

Depending on the data you could use a SQLite database or HDF5 file for intermediate storage.

By the way, Python is not really multi-threaded (see GIL). You need the multiprocessing module to split the work into different processes.



来源:https://stackoverflow.com/questions/23214773/concurrent-sax-processing-of-large-simple-xml-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!