string-matching | 易学教程

Speeding up a “closest” string match algorithm

阅读更多关于 Speeding up a “closest” string match algorithm

问题 I am currently processing a very large database of locations and trying to match them with their real world coordinates. To achieve this, I have downloaded the geoname dataset which contains a lot of entries. It gives possible names and lat/long coordinates. To try and speed up the process, I have managed to reduce the huge csv file (of 1.6 GB) to 0.450 GB by removing entries that do not make sense for my dataset. It still contains however 4 million entries. Now I have many entries such as:

How to search a string of key/value pairs in Java

阅读更多关于 How to search a string of key/value pairs in Java

问题 I have a String that's formatted like this: "key1=value1;key2=value2;key3=value3" for any number of key/value pairs. I need to check that a certain key exists (let's say it's called "specialkey"). If it does, I want the value associated with it. If there are multiple "specialkey"s set, I only want the first one. Right now, I'm looking for the index of "specialkey". I take a substring starting at that index, then look for the index of the first = character. Then I look for the index of the

Boyer Moore Algorithm Implementation?

阅读更多关于 Boyer Moore Algorithm Implementation?

Is there a working example of the Boyer-Moore string search algorithm in C? I've looked at a few sites, but they seem pretty buggy, including wikipedia. Thanks. The best site for substring search algorithms: http://igm.univ-mlv.fr/~lecroq/string/ There are a couple of implementations of Boyer-Moore-Horspool (including Sunday's variant) on Bob Stout's Snippets site. Ray Gardner's implementation in BMHSRCH.C is bug-free as far as I know 1 , and definitely the fastest I've ever seen or heard of. It's not, however, the easiest to understand -- he uses some fairly tricky code to keep the inner loop

Fastest way to find Strings in String collection that begin with certain chars

阅读更多关于 Fastest way to find Strings in String collection that begin with certain chars

I have a large collection of Strings. I want to be able to find the Strings that begin with "Foo" or the Strings that end with "Bar". What would be the best Collection type to get the fastest results? (I am using Java) I know that a HashSet is very fast for complete matches, but not for partial matches I would think? So, what could I use instead of just looping through a List? Should I look into LinkedList's or similar types? Are there any Collection Types that are optimized for this kind of queries? Mario Rossi The best collection type for this problem is SortedSet . You would need two of

UTL_MATCH-like function to work with CLOB

阅读更多关于 UTL_MATCH-like function to work with CLOB

My question is: Is there a UTL_MATCH -like function which works with a CLOB rather than a VARCHAR2 ? My specific problem is: I'm on an Oracle database. I have a bunch of pre-written queries which interface with Domo CenterView . The queries have variables in them defined by ${variableName} . I need to rewrite these queries. I didn't write the original so instead of figuring out what a good value for the variables should be I want to run the queries with the application and get what the query was from V$SQL . So my solution is: Do a UTL_MATCH on the queries with the variable stuff in it and V

How to search for a part of a dictionary key?

阅读更多关于 How to search for a part of a dictionary key?

Could someone please tell me, how I can search for only a part of a key in a dictionary (in VB.NET)? I use the following sample code: Dim PriceList As New Dictionary(Of String, Double)(System.StringComparer.OrdinalIgnoreCase) PriceList.Add("Spaghetti alla carbonara", 21.65) PriceList.Add("Spaghetti aglio e olio", 22.65) PriceList.Add("Spaghetti alla napoletana", 23.65) PriceList.Add("Spaghetti alla puttanesca ", 24.65) PriceList.Add("Spaghetti alla gricia ", 25.65) PriceList.Add("Spaghetti alle vongole", 26.65) PriceList.Add("Spaghetti Bolognese", 27.65) If PriceList.ContainsKey("spaghetti

How can I generate a list of words from a group of letters using Perl?

阅读更多关于 How can I generate a list of words from a group of letters using Perl?

I was looking for a module, regex, or anything else that might apply to this problem. How can I programatically parse the string and create known English &| Spanish words given that I have a dictionary table against which I can check each permutation of the algorithm's randomization for a match? Given a group of characters: EBLAIDL KDIOIDSI ADHFWB The program should return: BLADE AID KID KIDS FIDDLE HOLA etc.... I also want to be able to define the minimum & maximum word length as well as the number of syllables The input length doesn't matter, it must be only letters, and punctuation doesn't

Longest Common Substring with wrong character tolerance

阅读更多关于 Longest Common Substring with wrong character tolerance

I have a script I found on here that works well when looking for the Lowest Common Substring. However, I need it to tolerate some incorrect/missing characters. I would like be able to either input a percentage of similarity required, or perhaps specify the number of missing/wrong characters allowable. For example, I want to find this string: big yellow school bus inside of this string: they rode the bigyellow schook bus that afternoon This is the code i'm currently using: function longest_common_substring($words) { $words = array_map('strtolower', array_map('trim', $words)); $sort_by_strlen =

Extract last word in a string after comma if there are multiple words else the first word

阅读更多关于 Extract last word in a string after comma if there are multiple words else the first word

I have data where the words as follows location<- c("xyz, sss, New Zealand", "USA", "Pris,France") id<- c(1,2,3) df<-data.frame(location,id) I would like to extract the country name from the data. The tricky part is if i extract just the last word then I will have only one record (France). library(stringr) df$country<- word(df$location,-1) Any ideas on how to extract country data from this data? id location country 1 xyz, sss, New Zealand New Zealand 2 USA USA 3 Pris,France France You can try sub df$country <- sub('.*,\\s*', '', df$location) df$country #[1] "New Zealand" "USA" "France" Or

Efficient string suffix detection

阅读更多关于 Efficient string suffix detection

I am working with PySpark on a huge dataset, where I want to filter the data frame based on strings in another data frame. For example, dd = spark.createDataFrame(["something.google.com","something.google.com.somethingelse.ac.uk","something.good.com.cy", "something.good.com.cy.mal.org"], StringType()).toDF('domains') +----------------------------------------+ |domains | +----------------------------------------+ |something.google.com | |something.google.com.somethingelse.ac.uk| |something.good.com.cy | |something.good.com.cy.mal.org | +----------------------------------------+ dd1 = spark