PDF.js getTextContent returning text in wrong order

主宰稳场 提交于 2019-12-23 06:23:23

问题


I'm having a problem extracting text from a pdf using PDF.js. I'm using a PDF that was exported from Excel, and thus has multiple rows/columns. Here's the function I'm using to get the text:

 gettext: function(url, name){
     var self = this;
     var data = url;
     console.log('attempting to get text');
     return pdfjs.getDocument(data).then(function(pdf) {
     var pages = [];
     for (var i = 0; i <= 1; i++) {
         pages.push(i);
     }
     return Promise.all(pages.map(function(pageNumber) {
         return pdf.getPage(pageNumber + 1).then(function(page) {
         return page.getTextContent().then(function(textContent) {
             return textContent.items.map(function(item) {

             return item.str;
             }).join('###');
         });
         });
     })).then(function(pages) {
         return pages.join("\r\n")
     });
     }).then(function(pages){
     return self.parsetext(pages, url, name);        
     });        
 },

This works very well for most of the text contents. The very annoying issue is that, for a certain section, the contents are mixed up. The following is being logged by the gettext function when it reaches the beschreibung section:

One odd thing happening here: the beschreibung section is being extracted at the end of the text, even though it is just another column like kks or seite. I can work with this if it retains the correct order, but as you can see below, it doesn't.

Data.vue?1e15:250 item string: 12
Data.vue?1e15:250 item string: MA-KF12
Data.vue?1e15:250 item string: 26
Data.vue?1e15:250 item string: MA-KF12
Data.vue?1e15:250 item string: 33
Data.vue?1e15:250 item string: MA-KF12
Data.vue?1e15:250 item string: 44
Data.vue?1e15:250 item string: MA-KF12
Data.vue?1e15:250 item string: 82
Data.vue?1e15:250 item string: Inbetriebnahme
2Data.vue?1e15:250 item string: Anhang
Data.vue?1e15:250 item string: Vorwort
Data.vue?1e15:250 item string: Produktübersicht
Data.vue?1e15:250 item string: Grundlagen der Kommunikation
Data.vue?1e15:250 item string: Montage und Verdrahtung
Data.vue?1e15:250 item string: Vorwort
Data.vue?1e15:250 item string: Produktübersicht
Data.vue?1e15:250 item string: Grundlagen der Kommunikation
Data.vue?1e15:250 item string: Montage und Verdrahtung
Data.vue?1e15:250 item string: Inbetriebnahme
Data.vue?1e15:250 item string: Produktübersicht
Data.vue?1e15:250 item string: Grundlagen der Kommunikation
Data.vue?1e15:250 item string: Montage und Verdrahtung
Data.vue?1e15:250 item string: Inbetriebnahme
Data.vue?1e15:250 item string: Anhang
Data.vue?1e15:250 item string: Produktdaten
Data.vue?1e15:250 item string: Dokumentation
Data.vue?1e15:250 item string: Beschreibung
Data.vue?1e15:250 item string: Vorwort

As you can see, the order is off (but only slightly...)

Any ideas what could be going wrong?

来源:https://stackoverflow.com/questions/44520376/pdf-js-gettextcontent-returning-text-in-wrong-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!