string-matching | 易学教程

Using Rabin-Karp to search for multiple patterns in a string

阅读更多关于 Using Rabin-Karp to search for multiple patterns in a string

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this? Edit (December 2011): The wikipedia article has since been updated and no longer claims to match multiple patterns of differing length in O(n). I'm

Fuzzy matching of product names

阅读更多关于 Fuzzy matching of product names

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database. For example "Canon PowerShot a20IS" , "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS" . I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately. The main problem is that even single-letter changes in relevant keywords can make a huge difference

Match and replace emoticons in string - what is the most efficient way?

阅读更多关于 Match and replace emoticons in string - what is the most efficient way?

Wikipedia defines a lot of possible emoticons people can use. I want to match this list to words in a string. I now have this: $string = "Lorem ipsum :-) dolor :-| samet"; $emoticons = array( '[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc... '[SAD]' => array(' :-( ', ' :( ', ' :-| ') ); foreach ($emoticons as $emotion => $icons) { $string = str_replace($icons, " $emotion ", $string); } echo $string; Output: Lorem ipsum [HAPPY] dolor [SAD] samet so in principle this works. However, I have two questions: As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) '

How to replace all matching characters except the first occurrence

阅读更多关于 How to replace all matching characters except the first occurrence

I am trying to use regex to compare a string in JavaScript. I want to replace all '.'s and '%'s with empty character '' but the catch is I don't want to replace the first occurrence of '.' . value.replace(/\%\./g, ''); Expected result like below: .4.5.6.7. ==> .4567 4.5667.444... ==> 4.56667444 ..3445.4 ==> .34454 You can pass in a function to replace , and skip the first match like this: var i = 0; value.replace(/[\.\%]/g, function(match) { return match === "." ? (i++ === 0 ? '.' : '') : ''; }); Here is a self-contained version with no external variables: value.replace(/[\.\%]/g, function

How to use Regex in Java to pattern match?

阅读更多关于 How to use Regex in Java to pattern match?

I have read the documentation and various tutorials online but I'm still confused on how regex works in Java. What I am trying to do is create a function which takes in argument of type string. I then want to check if the passed string contains any characters other than MDCLXVIivxlcdm. So for example, string "XMLVID" should return false and "ABXMLVA" should return true. public boolean checkString(String arg) { Pattern p = Pattern.complile("[a-zA-z]&&[^MDCLXVIivxlcdm]"); Matcher m = p.matcher(arg); if(m.matches()) return true; else return false; } When I pass, "XMLIVD", "ABXMLVA", and "XMLABCIX

check if all characters of one string exist in another string in r

阅读更多关于 check if all characters of one string exist in another string in r

问题 I am trying to compare strings like PRABHAKAR SHARMA and SHARMA KUMAR PRABHAKAR . the intention is to check if all the characters of the shorter string exist in the other string. If that is the case, I should get a 100% match otherwise a percentage representing the percentage of characters that matched. I tried using levenshteinSim in RecordLinkage package but it gives a number corresponding to the number of changes required to change one string to another. install.packages("RecordLinkage")

Find matches of a vector of strings in another vector of strings

阅读更多关于 Find matches of a vector of strings in another vector of strings

I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases. # Sample data frame of articles articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse")) articles$text <- as.character(articles$text) # Sample vector of keywords or

String similarity in PHP: levenshtein like function for long strings

阅读更多关于 String similarity in PHP: levenshtein like function for long strings

The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP. Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below: $ss="Jack is a very nice boy, isn't he?"; $pp="jack is a very nice boy is he"; $ss=strtolower($ss); // convert to lower case as we dont care about case $pp=strtolower($pp); $score=similar_text($ss, $pp); echo "$score %\n"; // Outputs

Regex allow a string to only contain numbers 0 - 9 and limit length to 45

阅读更多关于 Regex allow a string to only contain numbers 0 - 9 and limit length to 45

I am trying to create a regex to have a string only contain 0-9 as the characters and it must be at least 1 char in length and no more than 45 . so example would be 00303039 would be a match, and 039330a29 would not. So far this is what I have but I am not sure that it is correct [0-9]{1,45} I have also tried ^[0-9]{45}*$ but that does not seem to work either. I am not very familiar with regex so any help would be great. Thanks! You are almost there, all you need is start anchor ( ^ ) and end anchor ( $ ): ^[0-9]{1,45}$ \d is short for the character class [0-9] . You can use that as: ^\d{1,45}

Check whether a string contains a substring

阅读更多关于 Check whether a string contains a substring

How can I check whether a given string contains a certain substring, using Perl? More specifically, I want to see whether s1.domain.com is present in the given string variable. To find out if a string contains substring you can use the index function: if (index($str, $substr) != -1) { print "$str contains $substr\n"; } It will return the position of the first occurrence of $substr in $str , or -1 if the substring is not found. David W. Another possibility is to use regular expressions which is what Perl is famous for: if ($mystring =~ /s1\.domain\.com/) { print qq("$mystring" contains "s1

订阅 string-matching