extraction | 易学教程

How should I extract compressed folders in java?

阅读更多关于 How should I extract compressed folders in java?

问题 I am using the following code to extract a zip file in Java. import java.io.*; import java.util.zip.*; class testZipFiles { public static void main(String[] args) { try { String filename = "C:\\zip\\includes.zip"; testZipFiles list = new testZipFiles( ); list.getZipFiles(filename); } catch (Exception e) { e.printStackTrace(); } } public void getZipFiles(String filename) { try { String destinationname = "c:\\zip\\"; byte[] buf = new byte[1024]; ZipInputStream zipinputstream = null; ZipEntry

What Is The Best Python Zip Module To Handle Large Files?

阅读更多关于 What Is The Best Python Zip Module To Handle Large Files?

问题 EDIT: Specifically compression and extraction speeds. Any Suggestions? Thanks 回答1: So I made a random-ish large zipfile: $ ls -l *zip -rw-r--r-- 1 aleax 5000 115749854 Nov 18 19:16 large.zip $ unzip -l large.zip | wc 23396 93633 2254735 i.e., 116 MB with 23.4K files in it, and timed things: $ time unzip -d /tmp large.zip >/dev/null real 0m14.702s user 0m2.586s sys 0m5.408s this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can

What Is The Best Python Zip Module To Handle Large Files?

阅读更多关于 What Is The Best Python Zip Module To Handle Large Files?

How to extract data from a PDF?

阅读更多关于 How to extract data from a PDF?

问题 My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database? Would this require writing an app or is there an automated way of doing this? 回答1: It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here: The data is just a text object

Programmatically extract keywords from domain names

阅读更多关于 Programmatically extract keywords from domain names

问题 Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example: ilikecheese.com becomes "i like cheese" sanfranciscohotels.com becomes "san francisco hotels" ... Any suggestions for accomplishing this efficiently and effectively? Edit: I'd like to write this in PHP. 回答1: Ok, I ran the

R, tm-error of transformation drops documents

阅读更多关于 R, tm-error of transformation drops documents

问题 I want to create a network based on the weight of keywords from text. Then I got an error when running the codes related to tm_map: library (tm) library(NLP) lirary (openNLP) text = c('.......') corp <- Corpus(VectorSource(text)) corp <- tm_map(corp, stripWhitespace) Warning message: In tm_map.SimpleCorpus(corp, stripWhitespace) : transformation drops documents corp <- tm_map(corp, tolower) Warning message: In tm_map.SimpleCorpus(corp, tolower) : transformation drops documents The codes were

R, tm-error of transformation drops documents

阅读更多关于 R, tm-error of transformation drops documents

PDF table extraction

阅读更多关于 PDF table extraction

问题 I have (same) data saved as a GIF image file and as a PDF file and I want to parse it to HTML or XML. The data is actually the menu for my university's cafeteria. That means that there is a new version of the file that has to be parsed each week! In General, the files contain some header and footer text, as well as a table full of other data in between. I have read some posts on stackoverflow and I also had started some attempts to parse out the table data as HTML/XML: PDF PDFBox || iText

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

阅读更多关于 Java: Apache POI: Can I get clean text from MS Word (.doc) files?

问题 The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word. When using the following code: File someFile = new File("some\\path\\MSWFile.doc"); InputStream inputStrm = new FileInputStream(someFile); HWPFDocument wordDoc = new HWPFDocument(inputStrm); System.out.println(wordDoc.getText()); the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like

Extract string from between quotations

阅读更多关于 Extract string from between quotations

问题 I want to extract information from user-inputted text. Imagine I input the following: SetVariables "a" "b" "c" How would I extract information between the first set of quotations? Then the second? Then the third? 回答1: >>> import re >>> re.findall('"([^"]*)"', 'SetVariables "a" "b" "c" ') ['a', 'b', 'c'] 回答2: You could do a string.split() on it. If the string is formatted properly with the quotation marks (i.e. even number of quotation marks), every odd value in the list will contain an