text-extraction

Regular expression to extract chunks of text from a text file?

白昼怎懂夜的黑 提交于 2019-12-06 05:53:09
I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult. I converted this PDF to text so that it now looks like this: So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex: import re with open('data/single.txt', encoding='UTF-8') as file: for line in file: headings = re.findall(r'^\d+(?:\.\d+)*\.?', line) print(headings)` I just don't know how to get the worded part of those headings or the paragraph of text beneath them. EDIT -

HTML downloading and text extraction

雨燕双飞 提交于 2019-12-06 04:10:24
问题 What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus. The platform is linux. 回答1: wget | html2ascii Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it). See also: lynx. 回答2: Python Beautiful Soup allows you to build a nice extractor. 回答3: I know that w3m can be used to render

Using boilerpipe to extract non-english articles

对着背影说爱祢 提交于 2019-12-06 01:37:12
问题 I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem. In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper. My question is, are

Is there a way to get all text from the rendered page with JS?

时光总嘲笑我的痴心妄想 提交于 2019-12-05 13:05:58
Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could get the HTML, parse it, remove all tags, etc, but I'm wondering if there's a way to get the text from the alread rendered page. To clarify, I don't want to grab text from a selection, I want the entire page. Thank you! All credit to Greg W's answer , as I based this answer on his code, but I found that for a website without inline style or script tags it was generally simpler to use: var theText = $('body').text(); as this grabs all text in all tags without one having to manually set every tag

select HTML text element with regex?

浪子不回头ぞ 提交于 2019-12-05 08:44:15
I want to look for © in an HTML document, and basically get the entity the copyright is attributed to. The copyright line shows up a couple of different ways: <p class="bg-copy">© 2011 The New York Times Company</p> or <a href="http://www.nytimes.com/ref/membercenter/help/copyright.html"> © 2011</a> <a href="http://www.nytco.com/">The New York Times Company</a> or <br>Published since 1996<br>Copyright © CounterPunch<br> All rights reserved.<br> I want to ignore the dates and intervening tags and just get "The New York Times Company" or "Counterpunch". I haven't been able to find much on using

How to extract text under specific headings from a pdf?

风格不统一 提交于 2019-12-04 23:22:45
问题 I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'. How can I do this? 回答1: This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I

How to extract data from a file in C

倖福魔咒の 提交于 2019-12-04 21:26:41
I have a .dat file containing 6 columns of N numbers like so: -4.997740e-01 -1.164187e+00 3.838383e-01 6.395961e+01 -1.938013e+02 -4.310365e-02 -1.822405e+00 4.470735e-01 -2.691410e-01 -8.528020e+01 -1.358874e+02 -7.072167e-01 9.932887e-01 -2.157249e+00 -2.303825e+00 -5.508925e+01 -3.548236e+02 1.250405e+00 -1.871123e+00 1.505421e-01 -6.550555e-01 -3.254452e+02 -5.501001e+01 8.776851e-01 1.370605e+00 -1.028076e+00 -1.137059e+00 6.096598e+01 -4.472264e+02 -1.268752e+00 ............ ............ ............ ............ ........... ........... I want to write a code in C language where I

How to read PDF files which are in asian languages (Chinese, Japanese, Thai, etc.) and store in a string in python

…衆ロ難τιáo~ 提交于 2019-12-04 14:56:20
I am using PyPDF2 to read PDF files in python. While it works well for languages in English and European languages (with alphabets in english), the library fails to read Asian languages like Japanese and Chinese. I tried encode('utf-8') , decode('utf-8') but nothing seems to work. It just prints a blank string on extraction of the text. I have tried other libraries like textract and PDFMiner but no success yet. When I copy the text from PDF and paste it on a notebook, the characters turn into some random format text (probably in a different encoding). def convert_pdf_to_text(filename): text =

HTML downloading and text extraction

不羁的心 提交于 2019-12-04 10:29:46
What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus. The platform is linux. dsm wget | html2ascii Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it). See also: lynx . Python Beautiful Soup allows you to build a nice extractor. I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example. For the

ColdFusion extract values from text file

故事扮演 提交于 2019-12-04 06:28:01
问题 The technical details I want to EXTRACT values from a text file containing parameter names and values. For each line that starts with "request.config." (there are empty lines, lines with comments, etc. which I don't want to extract anything from) I want to extract these values (in bold) : request.config. my_param_1 = "some random string" ; I thought the best way to do this might be using REGEX, but how can I do this? I thought there would be something like a regular expression that would