OneNote parsing - how to get to the Text Blobs in the document?

故事扮演 提交于 2019-12-03 09:14:24

问题


I am creating a parser for the .one file extension, which when finished I will add to the Apache Tika project.

Here is the APL 2.0 licensed Open Source project I'm creating: https://github.com/nddipiazza/onenote-parser-java

I used the specification document here: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-one/73d22548-a613-4350-8c23-07d15576be50

As a starting point, I ported over the code from this open source C++ project: https://github.com/dropbox/onenote-parser

I have gotten a long way in the parsing of the documents, but I've hit a road block.

Here is the OneNote file I'm using to parse: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing

I am unable to view the Section1TextArea1 and Section1TextArea2 in my parsed results. So I'm missing some sort of key data parsing element or something.

It is definitely in the OneNote file itself. I can see it in the Hex viewer:

Here is the JSON parse output: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267

I feel like the spec document is missing some very important information needed in order to parse this proprietary format.

What major element(s) am I missing resulting in me not getting the actual text content?

来源:https://stackoverflow.com/questions/59008205/onenote-parsing-how-to-get-to-the-text-blobs-in-the-document

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!