string-matching | 易学教程

Using grep to subset rows from a data.table, comparing row content

阅读更多关于 Using grep to subset rows from a data.table, comparing row content

DT <- data.table(num=c("20031111","1112003","23423","2222004"),y=c("2003","2003","2003","2004")) > DT num y 1: 20031111 2003 2: 1112003 2003 3: 23423 2003 4: 2222004 2004 I want to compare the two cell content, and perform an action based on the boolean value. for instance, if "num" matches the year, create a column x holding that value. I thought about subsetting based on grep, and that works, but naturally checks the whole column every time which seems wasteful DT[grep(y,num)] # works with a pattern>1 warning I could apply() my way but perhaps there's a data.table way? Thanks Nick Kennedy If

Using Rabin-Karp to search for multiple patterns in a string

阅读更多关于 Using Rabin-Karp to search for multiple patterns in a string

问题 According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this? Edit (December 2011): The wikipedia article

How to search for a string in one column in other columns of a data frame

阅读更多关于 How to search for a string in one column in other columns of a data frame

问题 I have a table, call it df, with 3 columns, the 1st is the title of a product, the 2nd is the description of a product, and the third is a one word string. What I need to do is run an operation on the entire table, creating 2 new columns (call them 'exists_in_title' and 'exists_in_description') that have either a 1 or 0 indicating if the 3rd column exists in either the 1st or 2nd column. I need it to simply be a 1:1 operation, so for example, calling row 1 'A', I need to check if the cell A3,

How to select R data.table rows based on substring match (a la SQL like)

阅读更多关于 How to select R data.table rows based on substring match (a la SQL like)

I have a data.table with a character column, and want to select only those rows that contain a substring in it. Equivalent to SQL WHERE x LIKE '%substring%' E.g. > Months = data.table(Name = month.name, Number = 1:12) > Months["mb" %in% Name] Empty data.table (0 rows) of 2 cols: Name,Number How would I select only the rows where Name contains "mb"? data.table has a like function. Months[like(Name,"mb")] Name Number 1: September 9 2: November 11 3: December 12 Or, %like% looks nicer : > Months[Name %like% "mb"] Name Number 1: September 9 2: November 11 3: December 12 Note that %like% and like()

python - regex search and findall

阅读更多关于 python - regex search and findall

I need to find all matches in a string for a given regex. I've been using findall() to do that until I came across a case where it wasn't doing what I expected. For example: regex = re.compile('(\d+,?)+') s = 'There are 9,000,000 bicycles in Beijing.' print re.search(regex, s).group(0) > 9,000,000 print re.findall(regex, s) > ['000'] In this case search() returns what I need (the longest match) but findall() behaves differently, although the docs imply it should be the same: findall() matches all occurrences of a pattern, not just the first one as search() does. Why is the behaviour different?

Search for string allowing for one mismatch in any location of the string

阅读更多关于 Search for string allowing for one mismatch in any location of the string

I am working with DNA sequences of length 25 (see examples below). I have a list of 230,000 and need to look for each sequence in the entire genome (toxoplasma gondii parasite). I am not sure how large the genome is, but much longer than 230,000 sequences. I need to look for each of my sequences of 25 characters, for example, (AGCCTCCCATGATTGAACAGATCAT). The genome is formatted as a continuous string, i.e. (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAGTGCGGAGCCTGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTT....) I don't care where or how many times it is found, only whether it is or not. This is

Regular Expression Match to test for a valid year

阅读更多关于 Regular Expression Match to test for a valid year

Given a value I want to validate it to check if it is a valid year. My criteria is simple where the value should be an integer with 4 characters. I know this is not the best solution as it will not allow years before 1000 and will allow years such as 5000 . This criteria is adequate for my current scenario. What I came up with is \d{4}$ While this works it also allows negative values. How do I ensure that only positive integers are allowed? You need to add a start anchor ^ as: ^\d{4}$ Your regex \d{4}$ will match strings that end with 4 digits . So input like -1234 will be accepted. By adding

Check if string matches pattern

阅读更多关于 Check if string matches pattern

How do I check if a string matches this pattern? Uppercase letter, number(s), uppercase letter, number(s)... Example, These would match: A1B2 B10L1 C1N200J1 These wouldn't ('^' points to problem) a1B2 ^ A10B ^ AB400 ^ CrazyCasta import re pattern = re.compile("^([A-Z][0-9]+)+$") pattern.match(string) Edit: As noted in the comments match checks only for matches at the beginning of the string while re.search() will match a pattern anywhere in string. (See also: https://docs.python.org/library/re.html#search-vs-match ) One-liner: re.match(r"pattern", string) # No need to compile import re >>> if

Fuzzy matching of product names

阅读更多关于 Fuzzy matching of product names

问题 I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough

How to use Regex in Java to pattern match?

阅读更多关于 How to use Regex in Java to pattern match?

问题 I have read the documentation and various tutorials online but I'm still confused on how regex works in Java. What I am trying to do is create a function which takes in argument of type string. I then want to check if the passed string contains any characters other than MDCLXVIivxlcdm. So for example, string "XMLVID" should return false and "ABXMLVA" should return true. public boolean checkString(String arg) { Pattern p = Pattern.complile("[a-zA-z]&&[^MDCLXVIivxlcdm]"); Matcher m = p.matcher