How to extract plain text from PDF in golang

被刻印的时光 ゝ 提交于 2020-01-04 05:50:09

问题


I want to extract text from pdf file using GO. I tried using ledongthuc/pdf Go package that implement the method GetPlainText() to get plain text content without format. But I don't get the plain text. I have as a result:

 W
 S
 D
 V
 Y R
 O
 R
 Q
 W
 D
 L
 U
 H
 P
 H
 Q
 W
......

Go code

package main

import (
    "bytes"
    "fmt"

    "github.com/ledongthuc/pdf"
)

func main() {
    content, err := readPdf("test.pdf")
    if err != nil {
        panic(err)
    }
    fmt.Println(content)
    return
}

func readPdf(path string) (string, error) {
    r, err := pdf.Open(path)
    if err != nil {
        return "", err
    }
    totalPage := r.NumPage()

    var textBuilder bytes.Buffer
    for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
        p := r.Page(pageIndex)
        if p.V.IsNull() {
            continue
        }
        textBuilder.WriteString(p.GetPlainText("\n"))
    }
    return textBuilder.String(), nil
}

回答1:


You can have a message such as "Exemple of a pdf document." instead of

Ex
a
m
pl
e

of

a

pd
f

doc
u
m
e
nt
.

What you need to do is change the textBuilder.WriteString(p.GetPlainText("\n")) to

textBuilder.WriteString(p.GetPlainText(""))

I hope this helps.



来源:https://stackoverflow.com/questions/44560265/how-to-extract-plain-text-from-pdf-in-golang

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!