text-processing | 易学教程

Delete Chars in Python

阅读更多关于 Delete Chars in Python

问题 does anybody know how to delete all characters behind a specific character?? like this: http://google.com/translate_t into http://google.com 回答1: if you're asking about an abstract string and not url you could go with: >>> astring ="http://google.com/translate_t" >>> astring.rpartition('/')[0] http://google.com 回答2: For urls, using urlparse: >>> import urlparse >>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor') >>> parts ('http', 'google.com', '/path/to

Perl add <a></a> around words within an HTML/XML tag

阅读更多关于 Perl add around words within an HTML/XML tag

I have a file formatted like this one: Eye color Eye color, color blue, cornflower blue, steely blue velvet brown <link rel="stylesheet" href="a.css"> </> weasel weasel musteline <link rel="stylesheet" href="a.css"> </> Each word within the separated by , should be wrapped in an <a> tag, like this: Eye color Eye color, color <a href="entry://blue">blue</a>, <a href="entry://cornflower blue">cornflower blue</a>, <a href="entry://steely blue">steely blue<

Delete Chars in Python

阅读更多关于 Delete Chars in Python

does anybody know how to delete all characters behind a specific character?? like this: http://google.com/translate_t into http://google.com SilentGhost if you're asking about an abstract string and not url you could go with: >>> astring ="http://google.com/translate_t" >>> astring.rpartition('/')[0] http://google.com For urls, using urlparse : >>> import urlparse >>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor') >>> parts ('http', 'google.com', '/path/to/resource', 'query=spam', 'anchor') >>> urlparse.urlunsplit((parts[0], parts[1], '', '', '')) 'http:/

Randomizing text between delimiters

阅读更多关于 Randomizing text between delimiters

问题 I have this simple input I have {red;green;orange} fruit and cup of {tea;coffee;juice} I use Perl to identify patterns between two external brace delimiters { and } , and randomize the fields inside with the internal delimiter ; . I'm getting this output I have green fruit and cup of coffee This is my working Perl script perl -plE 's!\{(.*?)\}!@x=split/;/,$1;$x[rand@x]!ge' <<< 'I have {red;green;orange} fruit and cup of {tea;coffee;juice}' My task is to process this input format I have { {red

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

阅读更多关于 Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

问题 I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=TRUE) Here the R code : library(RTextTools) library(e1071) pos_tweets = rbind( c('j AIME la voiture', 'positive'), c('cette machine est performante', 'positive'), c('je me sens en bonne forme ce matin', 'positive'), c('je suis super excitée d aller voir le spectacle de

How to delete parts of a file in python?

阅读更多关于 How to delete parts of a file in python?

问题 I have a file named a.txt which looks like this: I'm the first line I'm the second line. There may be more lines here. I'm below an empty line. I'm a line. More lines here. Now, I want to remove the contents above the empty line(including the empty line itself). How could I do this in a Pythonic way? 回答1: Basically you can't delete stuff from the beginning of a file, so you will have to write to a new file. I think the pythonic way looks like this: # get a iterator over the lines in the file:

Extracting info from large structured text files

阅读更多关于 Extracting info from large structured text files

问题 I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern "No.999999999 dd/mm/yyyy ZZZ". Here´s some sample data. No.813829461 16/09/1987 270 Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA) C.N.P.J./C.I.C./N INPI : 16404287000155 Procurador: MARCELLO DO NASCIMENTO No.815326777 28/12/1989 351 Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ) C.N.P.J./C.I.C./NºINPI : 34162651000108 Apres.: Nominativa ;

Split a text file in PHP

阅读更多关于 Split a text file in PHP

问题 How can I split a large text file into separate files by character count using PHP? So a 10,000 character file split every 1000 characters would be split into 10 files. Further, can I split only after a full stop is found? Thanks. UPDATE 1: I like zombats code and I removed some errors and have come up with the following, but does anyone know how to only split after a full stop? $i = 1; $fp = fopen("test.txt", "r"); while(! feof($fp)) { $contents = fread($fp,1000); file_put_contents('new_file

Does an empty string contain an empty string in C++?

阅读更多关于 Does an empty string contain an empty string in C++?

Just had an interesting argument in the comment to one of my questions. My opponent claims that the statement "" does not contain "" is wrong. My reasoning is that if "" contained another "" , that one would also contain "" and so on. Who is wrong? P.S. I am talking about a std::string P.S. P.S I was not talking about substrings, but even if I add to my question " as a substring", it still makes no sense. An empty substring is nonsense . If you allow empty substrings to be contained in strings, that means you have an infinity of empty substrings. What is the point of that? Edit: Am I the only

Extract text between two strings repeatedly using sed or awk? [duplicate]

阅读更多关于 Extract text between two strings repeatedly using sed or awk? [duplicate]

This question already has an answer here: How to use sed/grep to extract text between two words? 11 answers I have a file called 'plainlinks' that looks like this: 13080. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94092-2012.gz 13081. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94094-2012.gz 13082. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94096-2012.gz 13083. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94097-2012.gz 13084. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94098-2012.gz 13085. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94644-2012.gz 13086. ftp://ftp3.ncdc.noaa.gov