text-processing

Delete Chars in Python

ぃ、小莉子 提交于 2019-12-02 06:26:06
问题 does anybody know how to delete all characters behind a specific character?? like this: http://google.com/translate_t into http://google.com 回答1: if you're asking about an abstract string and not url you could go with: >>> astring ="http://google.com/translate_t" >>> astring.rpartition('/')[0] http://google.com 回答2: For urls, using urlparse: >>> import urlparse >>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor') >>> parts ('http', 'google.com', '/path/to

Perl add <a></a> around words within an HTML/XML tag

杀马特。学长 韩版系。学妹 提交于 2019-12-02 06:05:34
I have a file formatted like this one: Eye color <p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"> </> weasel <p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"> </> Each word within the <p class="ul1"> separated by , should be wrapped in an <a> tag, like this: Eye color <p class="ul">Eye color, color</p> <p class="ul1"><a href="entry://blue">blue</a>, <a href="entry://cornflower blue">cornflower blue</a>, <a href="entry://steely blue">steely blue<

Delete Chars in Python

偶尔善良 提交于 2019-12-02 01:45:28
does anybody know how to delete all characters behind a specific character?? like this: http://google.com/translate_t into http://google.com SilentGhost if you're asking about an abstract string and not url you could go with: >>> astring ="http://google.com/translate_t" >>> astring.rpartition('/')[0] http://google.com For urls, using urlparse : >>> import urlparse >>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor') >>> parts ('http', 'google.com', '/path/to/resource', 'query=spam', 'anchor') >>> urlparse.urlunsplit((parts[0], parts[1], '', '', '')) 'http:/

Randomizing text between delimiters

白昼怎懂夜的黑 提交于 2019-12-01 17:14:44
问题 I have this simple input I have {red;green;orange} fruit and cup of {tea;coffee;juice} I use Perl to identify patterns between two external brace delimiters { and } , and randomize the fields inside with the internal delimiter ; . I'm getting this output I have green fruit and cup of coffee This is my working Perl script perl -plE 's!\{(.*?)\}!@x=split/;/,$1;$x[rand@x]!ge' <<< 'I have {red;green;orange} fruit and cup of {tea;coffee;juice}' My task is to process this input format I have { {red

Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

白昼怎懂夜的黑 提交于 2019-12-01 13:12:19
问题 I try to create a matrix, for this I would like to tolower text. For this I use this R instruction : matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", removeStopwords=FALSE, removeNumbers=TRUE, stemWords=TRUE) Here the R code : library(RTextTools) library(e1071) pos_tweets = rbind( c('j AIME la voiture', 'positive'), c('cette machine est performante', 'positive'), c('je me sens en bonne forme ce matin', 'positive'), c('je suis super excitée d aller voir le spectacle de

How to delete parts of a file in python?

家住魔仙堡 提交于 2019-12-01 13:06:41
问题 I have a file named a.txt which looks like this: I'm the first line I'm the second line. There may be more lines here. I'm below an empty line. I'm a line. More lines here. Now, I want to remove the contents above the empty line(including the empty line itself). How could I do this in a Pythonic way? 回答1: Basically you can't delete stuff from the beginning of a file, so you will have to write to a new file. I think the pythonic way looks like this: # get a iterator over the lines in the file:

Extracting info from large structured text files

老子叫甜甜 提交于 2019-12-01 11:49:27
问题 I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern "No.999999999 dd/mm/yyyy ZZZ". Here´s some sample data. No.813829461 16/09/1987 270 Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA) C.N.P.J./C.I.C./N INPI : 16404287000155 Procurador: MARCELLO DO NASCIMENTO No.815326777 28/12/1989 351 Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ) C.N.P.J./C.I.C./NºINPI : 34162651000108 Apres.: Nominativa ;

Split a text file in PHP

六眼飞鱼酱① 提交于 2019-12-01 09:38:54
问题 How can I split a large text file into separate files by character count using PHP? So a 10,000 character file split every 1000 characters would be split into 10 files. Further, can I split only after a full stop is found? Thanks. UPDATE 1: I like zombats code and I removed some errors and have come up with the following, but does anyone know how to only split after a full stop? $i = 1; $fp = fopen("test.txt", "r"); while(! feof($fp)) { $contents = fread($fp,1000); file_put_contents('new_file

Does an empty string contain an empty string in C++?

℡╲_俬逩灬. 提交于 2019-11-30 23:47:45
Just had an interesting argument in the comment to one of my questions. My opponent claims that the statement "" does not contain "" is wrong. My reasoning is that if "" contained another "" , that one would also contain "" and so on. Who is wrong? P.S. I am talking about a std::string P.S. P.S I was not talking about substrings, but even if I add to my question " as a substring", it still makes no sense. An empty substring is nonsense . If you allow empty substrings to be contained in strings, that means you have an infinity of empty substrings. What is the point of that? Edit: Am I the only

Extract text between two strings repeatedly using sed or awk? [duplicate]

牧云@^-^@ 提交于 2019-11-30 20:04:27
This question already has an answer here: How to use sed/grep to extract text between two words? 11 answers I have a file called 'plainlinks' that looks like this: 13080. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94092-2012.gz 13081. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94094-2012.gz 13082. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94096-2012.gz 13083. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94097-2012.gz 13084. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94098-2012.gz 13085. ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/999999-94644-2012.gz 13086. ftp://ftp3.ncdc.noaa.gov