text-processing | 易学教程

Find regex, move the next line at the end of this line and copy the first 5 columns to the next lines that start with a letter

阅读更多关于 Find regex, move the next line at the end of this line and copy the first 5 columns to the next lines that start with a letter

问题 I have such text: 37 7 -------------- No aaa 40 0 -------------- No bbb xxx zzy aa bb cc 42 2 -------------- No ccc xxx zyz a b c d 43 3 -------------- No ddd xy zz a a a a c 52 5 -------------- No eee yyyx zzz When I process it with awk I get: awk '{if($1+0==$1) p=$1 FS $2 FS $3 FS $4 FS $5; else $0=p FS $0}1' /tmp/test3 | column -t 37 7 -------------- No aaa 37 7 -------------- No aaa xxx zzz 40 0 -------------- No bbb 40 0 -------------- No bbb xxx zzy 40 0 -------------- No bbb aa bb cc

awk pipeline to extract and validate xml files

阅读更多关于 awk pipeline to extract and validate xml files

问题 How do I extract and validate xml files using awk and xmllint in a pipeline. Awk program that only extracts files: extractxml #!/usr/bin/awk -f /<?xml version/{ getline doctype; getline datadoc; if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) { fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next; }}{ print > fn } The input concatenated xml file: refcase.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]> <data-document lang=

how migrate from office documents to modern web technologies based documents - advice welcome

阅读更多关于 how migrate from office documents to modern web technologies based documents - advice welcome

问题 Currently, all documentation is based on MS office. This makes it quite challenging if you want to integrate some functionality. Then you have either the option to go with VBA or VSTO. First is not that comfortable, second could be like taking a sledgehammer to crack a nut. Simple things like simple controls, hiding text or basic maths can be easily realized by HTML. So I would need an HTML text processor what focus on content (text) and allow me to add interactivity when I need it. That

Using sed's append/change/insert without a newline

阅读更多关于 Using sed's append/change/insert without a newline

问题 I want to replace my pattern space in SED. I can do this with s/^.*$/hello world/; - but can I do it using the c command somehow - without using line breaks in my sed script? It's not entirely clear to me whether that's possible in any way. (Same question for the a and i commands) 回答1: If your shell is bash, here is a convenient way to use c in a one-liner: $ seq 3 | sed $'/2/c\\\nNew Text' 1 New Text 3 This looks for any line containing 2 and changes it to New Text . This uses bash's $'...'

Removing multiple recurring text from pandas rows`

阅读更多关于 Removing multiple recurring text from pandas rows`

问题 I am having a pandas dataframe which consists of scraped articles from websites as rows. I have 100 thousand articles in the similar nature. Here is a glimse of my dataset. text 0 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 1 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So 2 which brings not only warmer weather but also the unsettling realization that

A simple way to remove headers from XML files

阅读更多关于 A simple way to remove headers from XML files

问题 I need remove non-xml tags from file generated by another program. The file is some like this: Executing Command - Blah.exe ... -----Command Output----- HTTP/1.1 200 OK Connection: close Content-Type: text/xml <?xml version="1.0"?> <testResults> <finalCounts> <right>7</right> <wrong>4</wrong> <ignores>0</ignores> <exceptions>0</exceptions> </finalCounts> </testResults> Exit-Code: 15 How to remove the non-xml text easily in java? 回答1: // getContent() returns the complete text to strip. //

How to pass a regular expression as a parameter to a perl one-liner in a bash script?

阅读更多关于 How to pass a regular expression as a parameter to a perl one-liner in a bash script?

问题 I have this input.txt file: Dog walks in the park Man runs in the park Man walks in the park Dog runs in the park Dog stays still They run in the park Woman runs in the park I want to search for matches of the runs? regular expression and output them to a file, while highlighting matches with two asterisks on both sides of the match. So my desired output is this: Man **runs** in the park Dog **runs** in the park They **run** in the park Woman **runs** in the park What I want to do is to write

Loading text data in Octave with specific format

阅读更多关于 Loading text data in Octave with specific format

问题 I have a data set that I would like to store and be able to load in Octave 18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu" 15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320" 18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite" 16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst" 17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino" 15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500" 14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala" 14.0 8 440.0 215.0 4312. 8.5 70 1

Given upper case names transform to Proper Case, handling “O'Hara”, “McDonald” “van der Sloot” etc

阅读更多关于 Given upper case names transform to Proper Case, handling “O'Hara”, “McDonald” “van der Sloot” etc

问题 I am provided a list of names in upper case. For the purpose of a salutation in an email I would like them them to be Proper Cased. Easy enough to do using PHP's ucwords. But I feel I need some regex function to handle common exceptions, such as: "O'Hara", "McDonald", "van der Sloot", etc It's not so much that I need help constructing a regex statement to handle the three examples above (tho that would be nice), as it is that I don't know what all the common exceptions might be. Surely

extracting n grams from huge text

阅读更多关于 extracting n grams from huge text

问题 For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this: ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...] twos : ['Spark is', 'is a', 'a