regex

Regex negative lookbehind in R

佐手、 提交于 2021-02-08 07:42:36
问题 I'm trying to do a regex in stringr for a negative lookbehind in R. So basically, I have a text data that looks something like this : See item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 7 Management's Discussion and Analysis. BlahBlahBlah. Item 8 Financial Statements and Supplementary Data. I want to select everything from the "Item 7" right after the "blahblahblah." sentence to "Item 8-Financial Statements and Supplementary Data" So I want Item 7 Management's Discussion and

Positive lookbehind regex obvious maximum length

♀尐吖头ヾ 提交于 2021-02-08 07:39:40
问题 So I have been experimenting with regex in order to parse the following strings: INFO: Device 6: Time 20.11.2015 06:28:00 - [Script] FunFehlerButton: Execute [0031 text] and INFO: Device 0: Time 09.12.2015 03:51:44 - [Replication] FunFehlerButton: Execute and INFO: Device 6: Time 20.11.2015 06:28:00 - FunFehlerButton: Execute The regex I tried to use are: (?<=\\d{1,2}:\\d{2}:\\d{2} - ).* and (?<=\\[\\w*\\]).* of which the first one runs correctly and the second one lands in a expcetion. My

Positive lookbehind regex obvious maximum length

断了今生、忘了曾经 提交于 2021-02-08 07:39:07
问题 So I have been experimenting with regex in order to parse the following strings: INFO: Device 6: Time 20.11.2015 06:28:00 - [Script] FunFehlerButton: Execute [0031 text] and INFO: Device 0: Time 09.12.2015 03:51:44 - [Replication] FunFehlerButton: Execute and INFO: Device 6: Time 20.11.2015 06:28:00 - FunFehlerButton: Execute The regex I tried to use are: (?<=\\d{1,2}:\\d{2}:\\d{2} - ).* and (?<=\\[\\w*\\]).* of which the first one runs correctly and the second one lands in a expcetion. My

Modify NLTK word_tokenize to prevent tokenization of parenthesis

巧了我就是萌 提交于 2021-02-08 07:32:48
问题 I have the following main.py . #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import nltk import string import sys for token in nltk.word_tokenize(''.join(sys.stdin.readlines())): #print token if len(token) == 1 and not token in string.punctuation or len(token) > 1: print token The output is the following. ./main.py <<< 'EGR1(-/-) mouse embryonic fibroblasts' EGR1 -/- mouse embryonic fibroblasts I want to slightly change the tokenizer so

sre_constants.error: unexpected end of regular expression - Should Work Fine

情到浓时终转凉″ 提交于 2021-02-08 07:29:56
问题 So I'm doing a little bit of testing for something and I require a method of splitting a string into groups of two. (e.g. 'abcdef' => ['ab','cd','ef'] ) I'm trying to use a regex pattern to do this ( [^]{2} ). Whenever I try to compile this pattern, I get the error message: sre_constants.error: unexpected end of regular expression The exact line of code is: pat = re.compile(r'[^]{2}') Could someone please tell me what I'm doing wrong here? I've done a lot of searching but a lot of the

Extract specific words from a text file?

霸气de小男生 提交于 2021-02-08 06:51:33
问题 I have a text file with over 10,000 lines, each line have a word that starts with the CDID_ followed by 10 more characters with no spaces as below: a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111") I would like to extract the words that start with CDID_ only to make the lines above look like this: CDID_1254WE_1023 CDID_1254XE01478 CDID_ZXASWE_1111 回答1: Here are three base R options. Option 1: Use sub() , removing everything except the CDID_*

Extract specific words from a text file?

ぐ巨炮叔叔 提交于 2021-02-08 06:51:26
问题 I have a text file with over 10,000 lines, each line have a word that starts with the CDID_ followed by 10 more characters with no spaces as below: a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111") I would like to extract the words that start with CDID_ only to make the lines above look like this: CDID_1254WE_1023 CDID_1254XE01478 CDID_ZXASWE_1111 回答1: Here are three base R options. Option 1: Use sub() , removing everything except the CDID_*

AWK set multiple delimiters for comma and quotes with commas

こ雲淡風輕ζ 提交于 2021-02-08 06:43:33
问题 I have a CSV file where columns are comma separated and columns with textual data that have commas are quoted. Sometimes, within quoted text there also exist quotes to mean things like inches resulting in more quotes. Textual data without embedded commas do not have quotes. For example: A,B,C 1,"hello, how are you",hello 2,car,bike 3,13.3 inch tv,"tv 13.3""" How do i use awk to print the number of columns for each row of which i should get 3 3 3 I thought of using $awk -F'[,"]' but im getting

Regular expressions in POS tagged NLTK corpus

荒凉一梦 提交于 2021-02-08 06:29:14
问题 I'm loading a POS-tagged corpus in NLTK, and I would like to find certain patterns involving POS tags. These patterns can be quite complex, including a lot of different combinations of POS tags. Example input string: We/PRP spent/VBD some/DT time/NN reading/NN about/IN the/DT historical/JJ importance/NN of/IN tea/NN in/IN Korea/NNP and/CC China/NNP and/CC then/RB tasted/VBD the/DT most/JJS expensive/JJ green/JJ tea/NN I/PRP have/VBP ever/RB seen/VBN ./. In this case the POS pattern is

javascript search string contains ' + '

≯℡__Kan透↙ 提交于 2021-02-08 06:17:04
问题 i would to search a string into another string but i'm facing an issue. There is my code : reference = "project/+bug/1234"; str = "+bug/1234"; alert(reference.search(str)); //it should alert 8 (index of matched strings) but it alert -1 : so, str wasn't found into reference. I've found what's the problem is, and it seems to be the " + " character into str, because .search("string+str") seems to evaluate searched string with the regex " + " 回答1: Just use string.indexOf(). It takes a literal