text-extraction

PDFminer: extract text with its font information

人走茶凉 提交于 2019-12-09 15:55:53
问题 I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on. 回答1: #!/usr/bin/env python from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import

Extract All Unique Lines

浪尽此生 提交于 2019-12-09 04:44:24
问题 I have text files with repeated exact lines of text, but I only want one of each. Imagine this text file: AAAAA AAAAA AAAAA BB BBBBB BBBBB CCC CCC CCC I would only need the following four lines from it: AAAAA BB BBBBB CCC I'm using a text editor (EmEditor or Notepad++), that supports RegEx, not a programming language, so I must use a purely Regular Expression. Any help? EDIT: I checked the other thread that hsz mentioned and I'd like to make it clear that this one is not the same. Although

iTextSharp inserting spaces within words from a pdf file

做~自己de王妃 提交于 2019-12-08 06:08:45
问题 Using iTextSharp, I am trying to extract the text from the following pdf file: https://www.treasury.gov/ofac/downloads/sdnlist.pdf This is the code: var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy()); if (currentText.Length > 0) { var capture = new Capture(); capture.Text = currentText; // write the results to the DB, if any data was found _dataService.AddCapture(capture); } Using the SimpleTextExtractionStrategy, the results are written to

How to extract literal words from a consecutive string efficiently? [duplicate]

試著忘記壹切 提交于 2019-12-08 01:46:50
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: How to split text without spaces into list of words? There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor . Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable. So, what's the fastest way

select HTML text element with regex?

╄→尐↘猪︶ㄣ 提交于 2019-12-07 05:28:34
问题 I want to look for © in an HTML document, and basically get the entity the copyright is attributed to. The copyright line shows up a couple of different ways: <p class="bg-copy">© 2011 The New York Times Company</p> or <a href="http://www.nytimes.com/ref/membercenter/help/copyright.html"> © 2011</a> <a href="http://www.nytco.com/">The New York Times Company</a> or <br>Published since 1996<br>Copyright © CounterPunch<br> All rights reserved.<br> I want to ignore the dates and intervening tags

iTextSharp inserting spaces within words from a pdf file

。_饼干妹妹 提交于 2019-12-06 15:20:16
Using iTextSharp, I am trying to extract the text from the following pdf file: https://www.treasury.gov/ofac/downloads/sdnlist.pdf This is the code: var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 2, new SimpleTextExtractionStrategy()); if (currentText.Length > 0) { var capture = new Capture(); capture.Text = currentText; // write the results to the DB, if any data was found _dataService.AddCapture(capture); } Using the SimpleTextExtractionStrategy, the results are written to the database with myriads of unwanted spaces within words. The first several lines of of page 2 write as:

How to extract data from a file in C

删除回忆录丶 提交于 2019-12-06 15:14:43
问题 I have a .dat file containing 6 columns of N numbers like so: -4.997740e-01 -1.164187e+00 3.838383e-01 6.395961e+01 -1.938013e+02 -4.310365e-02 -1.822405e+00 4.470735e-01 -2.691410e-01 -8.528020e+01 -1.358874e+02 -7.072167e-01 9.932887e-01 -2.157249e+00 -2.303825e+00 -5.508925e+01 -3.548236e+02 1.250405e+00 -1.871123e+00 1.505421e-01 -6.550555e-01 -3.254452e+02 -5.501001e+01 8.776851e-01 1.370605e+00 -1.028076e+00 -1.137059e+00 6.096598e+01 -4.472264e+02 -1.268752e+00 ............ ...........

How to extract literal words from a consecutive string efficiently? [duplicate]

♀尐吖头ヾ 提交于 2019-12-06 13:25:11
This question already has answers here : Closed 7 years ago . Possible Duplicate: How to split text without spaces into list of words? There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor . Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable. So, what's the fastest way to extract these words? I'm not really sure a naive algorithm would serve your purpose well, as

Extracting Number and Name from String [r]

陌路散爱 提交于 2019-12-06 07:14:05
问题 POSIX Expression is giving me a headache. Lets say we have a string: a = "[question(37), question_pipe(\"Person10\")]" and ultimately I would like to be able to have: b = c("37", "Person10") I've had a look at the stringr package but cant figure out how to extract the information out using regular expressions and str_split . Any help would be greatly appreciated. Cameron 回答1: So if I understand correctly you want to extract the elements within parenthesis. You can first extract those elements

How to read PDF files which are in asian languages (Chinese, Japanese, Thai, etc.) and store in a string in python

巧了我就是萌 提交于 2019-12-06 06:35:47
问题 I am using PyPDF2 to read PDF files in python. While it works well for languages in English and European languages (with alphabets in english), the library fails to read Asian languages like Japanese and Chinese. I tried encode('utf-8') , decode('utf-8') but nothing seems to work. It just prints a blank string on extraction of the text. I have tried other libraries like textract and PDFMiner but no success yet. When I copy the text from PDF and paste it on a notebook, the characters turn into