string-matching

R: Replacing foreign characters in a string

本秂侑毒 提交于 2019-11-30 12:41:33
I'm dealing with a large amount of data, mostly names with non-English characters. My goal is to match these names against some information on them collected in the USA. ie, I might want to match the name 'Sølvsten' (from some list of names) to 'Soelvsten' (the name as stored in some American database). Here is a function I wrote to do this. It's clearly clunky and somewhat arbitrary, but I wonder if there is a simple R function that translates these foreign characters to their nearest English neighbours. I understand that there might not be any standard way to do this conversion, but I'm just

strstr faster than algorithms?

落爺英雄遲暮 提交于 2019-11-30 10:48:42
问题 I have a file that's 21056 bytes. I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars. I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr and benchmarked each one. What I'm wondering is, each time the strstr outperforms all the other algorithms. The only one that is faster

Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe

試著忘記壹切 提交于 2019-11-30 10:36:57
This question is based on another question I asked, where I didn't cover the problem entirely: Pandas - check if a string column contains a pair of strings This is a modified version of the question. I have two dataframes : df1 = pd.DataFrame({'consumption':['squirrel ate apple', 'monkey likes apple', 'monkey banana gets', 'badger gets banana', 'giraffe eats grass', 'badger apple loves', 'elephant is huge', 'elephant eats banana tree', 'squirrel digs in grass']}) df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']}) The

String searching algorithms in Java

假如想象 提交于 2019-11-30 10:35:03
I am doing string matching with big amount of data. EDIT: I am matching words contained in a big list with some ontology text files. I take each file from ontology, and search for a match between the third String of each file line and any word from the list. I made a mistake in overseeing the fact that what I need to do is not pure matching (results are poor), but I need some looser matching function that will also return results when the string is contained inside another string. I did this with a Radix Trie ; it was very fast and works nice, but now I guess my work is useless because a trie

Create a unique ID by fuzzy matching of names (via agrep using R)

左心房为你撑大大i 提交于 2019-11-30 09:23:39
Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names. A sample chunk of the dataset is structured as follows: df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO

Normalizing the edit distance

大兔子大兔子 提交于 2019-11-30 05:00:47
问题 I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well. for eg: ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15. if we increase the length of the string, the edit distance will increase even though they are similar. Therefore, I can not set a value,

XPath partial of attribute known

我是研究僧i 提交于 2019-11-30 04:52:13
I know the partial value of an attribute in a document, but not the whole thing. Is there a character I can use to represent any value? For example, a value of a label for an input is "A. Choice 1". I know it says "Choice 1", but not whether it will say "A. " or "B. " before the "Choice 1". Below is the relevant HTML. There are other attributes for the input and the label, but they are not the same every time the page is rendered, so I can't use them as references: <tr> <td><input type="checkbox" /><label>A. Choice 1</label></td> </tr><tr> <td><input type="checkbox" /><label>B. Choice 2</label

Algorithm to find out whether the matches for two Glob patterns (or Regular Expressions) intersect

前提是你 提交于 2019-11-30 02:54:48
I'm looking at matching glob-style patterns similar the what the Redis KEYS command accepts . Quoting: h?llo matches hello, hallo and hxllo h*llo matches hllo and heeeello h[ae]llo matches hello and hallo, but not hillo But I am not matching against a text string, but matching the pattern against another pattern with all operators being meaningful on both ends. For example these patterns should match against each other in the same row: prefix* prefix:extended* *suffix *:extended:suffix left*right left*middle*right a*b*c a*b*d*b*c hello* *ok pre[ab]fix* pre[bc]fix* And these should not match:

Search with various combinations of space, hyphen, casing and punctuations

Deadly 提交于 2019-11-30 02:11:41
问题 My schema: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/> <filter class="solr.LowerCaseFilterFactory

Removing an item from list matching a substring

怎甘沉沦 提交于 2019-11-30 01:49:23
How do I remove an element from a list if it matches a substring? I have tried removing an element from a list using the pop() and enumerate method but seems like I'm missing a few contiguous items that needs to be removed: sents = ['@$\tthis sentences needs to be removed', 'this doesnt', '@$\tthis sentences also needs to be removed', '@$\tthis sentences must be removed', 'this shouldnt', '# this needs to be removed', 'this isnt', '# this must', 'this musnt'] for i, j in enumerate(sents): if j[0:3] == "@$\t": sents.pop(i) continue if j[0] == "#": sents.pop(i) for i in sents: print i Output: