extraction

How should I extract compressed folders in java?

孤街浪徒 提交于 2019-12-22 04:12:47
问题 I am using the following code to extract a zip file in Java. import java.io.*; import java.util.zip.*; class testZipFiles { public static void main(String[] args) { try { String filename = "C:\\zip\\includes.zip"; testZipFiles list = new testZipFiles( ); list.getZipFiles(filename); } catch (Exception e) { e.printStackTrace(); } } public void getZipFiles(String filename) { try { String destinationname = "c:\\zip\\"; byte[] buf = new byte[1024]; ZipInputStream zipinputstream = null; ZipEntry

What Is The Best Python Zip Module To Handle Large Files?

笑着哭i 提交于 2019-12-21 10:03:13
问题 EDIT: Specifically compression and extraction speeds. Any Suggestions? Thanks 回答1: So I made a random-ish large zipfile: $ ls -l *zip -rw-r--r-- 1 aleax 5000 115749854 Nov 18 19:16 large.zip $ unzip -l large.zip | wc 23396 93633 2254735 i.e., 116 MB with 23.4K files in it, and timed things: $ time unzip -d /tmp large.zip >/dev/null real 0m14.702s user 0m2.586s sys 0m5.408s this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can

What Is The Best Python Zip Module To Handle Large Files?

这一生的挚爱 提交于 2019-12-21 10:02:16
问题 EDIT: Specifically compression and extraction speeds. Any Suggestions? Thanks 回答1: So I made a random-ish large zipfile: $ ls -l *zip -rw-r--r-- 1 aleax 5000 115749854 Nov 18 19:16 large.zip $ unzip -l large.zip | wc 23396 93633 2254735 i.e., 116 MB with 23.4K files in it, and timed things: $ time unzip -d /tmp large.zip >/dev/null real 0m14.702s user 0m2.586s sys 0m5.408s this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can

How to extract data from a PDF?

半腔热情 提交于 2019-12-21 02:04:06
问题 My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database? Would this require writing an app or is there an automated way of doing this? 回答1: It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here: The data is just a text object

Programmatically extract keywords from domain names

梦想的初衷 提交于 2019-12-20 09:24:33
问题 Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example: ilikecheese.com becomes "i like cheese" sanfranciscohotels.com becomes "san francisco hotels" ... Any suggestions for accomplishing this efficiently and effectively? Edit: I'd like to write this in PHP. 回答1: Ok, I ran the

R, tm-error of transformation drops documents

陌路散爱 提交于 2019-12-18 23:02:29
问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

R, tm-error of transformation drops documents

痞子三分冷 提交于 2019-12-18 23:01:27
问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

PDF table extraction

坚强是说给别人听的谎言 提交于 2019-12-18 15:28:16
问题 I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML: PDF PDFBox || iText

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

对着背影说爱祢 提交于 2019-12-18 13:05:22
问题 The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word. When using the following code: File someFile = new File("some\\path\\MSWFile.doc"); InputStream inputStrm = new FileInputStream(someFile); HWPFDocument wordDoc = new HWPFDocument(inputStrm); System.out.println(wordDoc.getText()); the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like

Extract string from between quotations

▼魔方 西西 提交于 2019-12-18 11:03:30
问题 I want to extract information from user-inputted text. Imagine I input the following: SetVariables "a" "b" "c" How would I extract information between the first set of quotations? Then the second? Then the third? 回答1: >>> import re >>> re.findall('"([^"]*)"', 'SetVariables "a" "b" "c" ') ['a', 'b', 'c'] 回答2: You could do a string.split() on it. If the string is formatted properly with the quotation marks (i.e. even number of quotation marks), every odd value in the list will contain an