text-processing | 易学教程

Python removing extra special unicode characters

阅读更多关于 Python removing extra special unicode characters

I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions. I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble. tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "") for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed. Things

Replace Long list Words in a big Text File

阅读更多关于 Replace Long list Words in a big Text File

问题 i need a fast method to work with big text file i have 2 files, a big text file (~20Gb) and an another text file that contain ~12 million list of Combo words i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline) example "Computer Information" >Replace With> "Computer_Information" i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core) public partial class Form1 : Form { HashSet<string>

bash routine to return the page number of a given line number from text file

阅读更多关于 bash routine to return the page number of a given line number from text file

Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'): alpha\n beta\n gamma\n\f one\n two\n three\n four\n five\n\f earth\n wind\n fire\n water\n\f Note that each page has a random number of lines. Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character. After a long time researching the solution I finally came across this piece of code: function get_page_from_line { local nline="$1" local input_file="$2" local npag=0 local ln=0 local total=0 while IFS= read -d $'\f' -r

Optimizing MySQL Import (Converting a Verbose SQL Dump to a Speedy One / use extended-inserts)

阅读更多关于 Optimizing MySQL Import (Converting a Verbose SQL Dump to a Speedy One / use extended-inserts)

We are using mysqldump with the options --complete-insert --skip-extended-insert to create database dumps that are kept in VCS. We use these options (and the VCS) to have the possibility to easily compare different database versions. Now importing of the dump takes quite a while because there are - of course - single inserts per database row. Is there an easy way to convert such a verbose dump to one with a single insert per table? Does anyone maybe already have a some script at hand? I wrote a little python script that converts this: LOCK TABLES `actor` WRITE; /*!40000 ALTER TABLE `actor`

How to extract a single function from a source file

阅读更多关于 How to extract a single function from a source file

I'm working on a small academic research about extremely long and complicated functions in the Linux kernel . I'm trying to figure out if there is a good reason to write 600 or 800 lines-long functions. For that purpose, I would like to find a tool that can extract a function from a .c file, so I can run some automated tests on the function. For example, If I have the function cifs_parse_mount_options() within the file connect.c , I'm seeking a solution that would roughly work like: extract /fs/cifs/connect.c cifs_parse_mount_options and return the 523 lines of code(!) of the function, from

Extracting the body text of an HTML document using PHP

阅读更多关于 Extracting the body text of an HTML document using PHP

I know it's better to use DOM for this purpose but let's try to extract the text in this way: <?php $html=<<<EOD <html> <head> </head> <body> <p>Some text</p> </body> </html> EOD; preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE); if (empty($matches)) exit; $matched_body_start_tag = $matches[0][0]; $index_of_body_start_tag = $matches[0][1]; $index_of_body_end_tag = strpos($html, '</body>'); $body = substr( $html, $index_of_body_start_tag + strlen($matched_body_start_tag), $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag) ); echo $body; The result

How to remove OCR artifacts from text?

阅读更多关于 How to remove OCR artifacts from text?

OCR generated texts sometimes come with artifacts, such as this one: Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks. How can one turn the above text into a more, say, canonical form, like: Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint Can this be done

Apache Tika and character limit when parsing documents

阅读更多关于 Apache Tika and character limit when parsing documents

问题 Could please anybody help me to sort it out? It can be done like this Tika tika = new Tika(); tika.setMaxStringLength(10*1024*1024); But if you don't use Tika directly, like this: ContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext ps = new ParseContext(); for (InputStream is : getInputStreams()) { parser.parse(is, textHandler, metadata, ps); is.close(); System.out.println("Title: " + metadata.get(

Exploding UpperCasedCamelCase to Upper Cased Camel Case in PHP

阅读更多关于 Exploding UpperCasedCamelCase to Upper Cased Camel Case in PHP

Right now, I am implementing this with a split, slice, and implosion: $exploded = implode(' ',array_slice(preg_split('/(?=[A-Z])/','ThisIsATest'),1)); //$exploded = "This Is A Test" Prettier version: $capital_split = preg_split('/(?=[A-Z])/','ThisIsATest'); $blank_first_ignored = array_slice($capital_split,1); $exploded = implode(' ',$blank_first_ignored); However, the problem is when you have input like 'SometimesPDFFilesHappen' , which my implementation would (incorrectly) interpret as 'Sometimes P D F Files Happen' . How can I (simply) get my script to condense 'P D F' to 'PDF' ? My

How to remove YAML frontmatter from markdown files?

阅读更多关于 How to remove YAML frontmatter from markdown files?

I have markdown files that contain YAML frontmatter metadata, like this: --- title: Something Somethingelse author: Somebody Sometheson --- But the YAML is of varying widths. Can I use a Posix command like sed to remove that frontmatter when it's at the beginning of a file? Something that just removes everything between --- and --- , inclusive, but also ignores the rest of the file, in case there are --- s elsewhere. Wintermute I understand your question to mean that you want to remove the first --- -enclosed block if it starts at the first line. In that case, sed '1 { /^---/ { :a N; /\n---/!