extract

Extracting data between two tags in HTML file

自作多情 提交于 2019-11-27 15:52:21
I've got a HUUUGE HTML file here saved on my system, which contains data from a product catalogue. The data is structured such that for each product record the name is between two tags (name) and (/name) . Each product has up to 3 attributes: name, productID, and color, but not all products will have all these attributes. How would I go about extracting this data for each product without mixing up the product attributes? The file is also 50 megabyte! Code example .... <name>'hat'</name> blah blah blah <prodId>'1829493'</prodId> blah blah blah <color>'cyan'</color> blah blah blah blah blah blah

How can I extract text from a PDF file in Perl?

感情迁移 提交于 2019-11-27 14:01:16
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine. The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text. Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e

Extracting image from PDF with /CCITTFaxDecode filter

笑着哭i 提交于 2019-11-27 13:15:06
I have a pdf that was generated from scanning software. The pdf has 1 TIFF image per page. I want to extract the TIFF image from each page. I am using iTextSharp and I have successfully found the images and can get back the raw bytes from the PdfReader.GetStreamBytesRaw method. The problem is, as many before me have discovered, iTextSharp does not contain a PdfReader.CCITTFaxDecode method. What else do I know? Even without iTextSharp I can open the pdf in notepad and find the streams with /Filter /CCITTFaxDecode and I know from the /DecodeParams that it is using CCITTFaxDecode group 4. Does

Extract part of data from JSON file with python [duplicate]

女生的网名这么多〃 提交于 2019-11-27 12:54:33
问题 This question already has answers here : Access a particular field in arbitrarily nested JSON data [duplicate] (3 answers) Closed last year . I have been trying to extract only certain data from a JSON file. I managed to decode the JSON and get the wanted data into a python dict. When I print out the dict it shows all the wanted data, but when I try to write the dict into a new file, only the last object gets written. One thing that I can't understand is also why when I print the dict I get

Extract links from a web page

穿精又带淫゛_ 提交于 2019-11-27 12:27:25
Using Java, how can I extract all the links from a given web page? download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements links = doc.select("a[href]"); // a with href Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png Element

How can I extract the folder path from file path in Python?

守給你的承諾、 提交于 2019-11-27 10:04:09
问题 I would like to get just the folder path from the full path to a file. For example T:\Data\DBDesign\DBDesign_93_v141b.mdb and I would like to get just T:\Data\DBDesign (excluding the \DBDesign_93_v141b.mdb ). I have tried something like this: existGDBPath = r'T:\Data\DBDesign\DBDesign_93_v141b.mdb' wkspFldr = str(existGDBPath.split('\\')[0:-1]) print wkspFldr but it gave me a result like this: ['T:', 'Data', 'DBDesign'] which is not the result that I require (being T:\Data\DBDesign ). Any

How to parse the Manifest.mbdb file in an iOS 4.0 iTunes Backup

一世执手 提交于 2019-11-27 09:59:10
In iOS 4.0 Apple has redesigned the backup process. iTunes used to store a list of filenames associated with backup files in the Manifest.plist file, but in iOS 4.0 it has moved this information to a Manifest.mbdb You can see an example of this file by making a backup with your iOS 4.0 devices and looking in your ~/Library/Application Support/MobileSync/Backup folder (Look inside the subfolders with the most recent date) Here's a screenshot of what the file looks like in a text editor: (source: supercrazyawesome.com ) How do I parse this into a Cocoa application so that I can update my (free)

Javascript: extract URLs from string (inc. querystring) and return array

与世无争的帅哥 提交于 2019-11-27 09:31:01
I know this has been asked a thousand times before (apologies), but searching SO/Google etc I am yet to get a conclusive answer. Basically, I need a JS function which when passed a string, identifies & extracts all URLs based on a regex, returning an array of all found. e.g: function findUrls(searchText){ var regex=??? result= searchText.match(regex); if(result){return result;}else{return false;} } The function should be able to detect and return any potential urls. I am aware of the inherant difficulties/isses with this (closing parentheses etc), so I have a feeling the process needs to be:

R: get element by name from a nested list

怎甘沉沦 提交于 2019-11-27 08:07:01
问题 I have a nested list like so: smth <- list() smth$a <- list(a1=1, a2=2, a3=3) smth$b <- list(b1=4, b2=5, b3=6) smth$c <- "C" The names of every element in the list are unique. I would like to get an element from such a list merely by name without knowing where it is located. Example: getByName(smth, "c") = "C" getByName(smth, "b2") = 5 Also I don't really want to use unlist since the real list has a lot of heavy elements in it. 回答1: The best solution so far is the following: rmatch <-

Extract files from zip without keeping the structure using python ZipFile?

允我心安 提交于 2019-11-27 07:45:59
I try to extract all files from .zip containing subfolders in one folder. I want all the files from subfolders extract in only one folder without keeping the original structure. At the moment, I extract all, move the files to a folder, then remove previous subfolders. The files with same names are overwrited. Is it possible to do it before writing files? Here is a structure for example: my_zip/file1.txt my_zip/dir1/file2.txt my_zip/dir1/dir2/file3.txt my_zip/dir3/file4.txt At the end I whish this: my_dir/file1.txt my_dir/file2.txt my_dir/file3.txt my_dir/file4.txt What can I add to this code ?