agrep | 易学教程

R: Fuzzy merge using agrep and data.table

阅读更多关于 R: Fuzzy merge using agrep and data.table

问题 I try to merge two data.tables, but due to different spelling in stock names I lose a substantial number of data points. Hence, instead of an exact match I was looking into a fuzzy merge. library("data.table") dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2)) dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q")) When merging dt1 and dt2 on "Name", ASML HOLDING will be excluded due to the addition of "NV", while the actual data would be accurate

How to fix error agrep: pattern too long (has > 32 chars) it doesn't show error if there is no full stop in the string?

阅读更多关于 How to fix error agrep: pattern too long (has > 32 chars) it doesn't show error if there is no full stop in the string?

问题 agrep gives the error agrep: pattern too long (has > 32 chars) when there is a full stop(.) in the pattern string but not otherwise. I want to compare(approximately) two strings, so I'm using agrep for that but its giving an error agrep: pattern too long (has > 32 chars) . But I found out that it doesn't give the error if there is no full stop in the pattern string(why?) `echo "The quick brown fox jumped over the lazy dog." | agrep -c -4 "The quick brown fox jumped over the lazy dog."`

Create a unique ID by fuzzy matching of names (via agrep using R)

阅读更多关于 Create a unique ID by fuzzy matching of names (via agrep using R)

问题 Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names. A sample chunk of the dataset is structured as follows: df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA"

agrep: only return best match(es)

阅读更多关于 agrep: only return best match(es)

问题 I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the package 'cba' on each element of the resulting vector, but this seems very redundant. /edit: here is the function I'm currently using. I'd like to speed it up, as it seems redundant to calculate distance twice. library(cba) word <- 'test' words <- c(

approximate string matching within single list - r

阅读更多关于 approximate string matching within single list - r

问题 I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example: names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.') I've looked at amatch in the stringdist function, as well as agrep , but these all require a master list of names that are used to match another list of names

R multiple fuzzy match agrep create variable

阅读更多关于 R multiple fuzzy match agrep create variable

问题 New to R. I would like to create a test by creating a variable (yes/no) that checks to see if first name OR last name fuzzy match to email address. If so, append a 'yes' variable to that row. Data Example: id firstname lastname email address match 1 patrick boyles patrickb@gmail.com yes 2 zeke cosmos zeke@gmail.com yes 3 foo foo abcd@gmail.com no I understand that I need to use agrep. What confuses me is how to tell R to check 2 columns (first name and last name) and only check within that

How to match a string with a tolerance of one character?

阅读更多关于 How to match a string with a tolerance of one character?

问题 I have a vector of locations that I am trying to disambiguate against a vector of correct location names. For this example I am using just two disambiguated locations tho: agrepl('Au', c("Austin, TX", "Houston, TX"), max.distance = .000000001, ignore.case = T, fixed = T) [1] TRUE TRUE The help page says that max.distance is Maximum distance allowed for a match. Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost I am not sure about the

R : Record Linkage problem with all fields combined in 1 column

阅读更多关于 R : Record Linkage problem with all fields combined in 1 column

问题 I have to match column a from dataset A to column b in dataset B. But the different variables aren't in separate fields(columns a, b, c) but in the same one. I have been looking at packages RecordLinkage & fastLink they work great with the fields being separated. Separate fields : # make dataframe 1 fname <- c("ash", "aalok", "aaron", "adam", "adrian", "ajay") lname <- c("perry", "phillips", "picardo", "pinck", "pinnick-flood", "pledger") dob <- c(1957, 1971, 1948, 1961, 1972, 2000) city <- c

How to fix error agrep: pattern too long (has > 32 chars) it doesn't show error if there is no full stop in the string?

阅读更多关于 How to fix error agrep: pattern too long (has > 32 chars) it doesn't show error if there is no full stop in the string?

agrep gives the error agrep: pattern too long (has > 32 chars) when there is a full stop(.) in the pattern string but not otherwise. I want to compare(approximately) two strings, so I'm using agrep for that but its giving an error agrep: pattern too long (has > 32 chars) . But I found out that it doesn't give the error if there is no full stop in the pattern string(why?) `echo "The quick brown fox jumped over the lazy dog." | agrep -c -4 "The quick brown fox jumped over the lazy dog."` expected output is 1 instead it gives an error: agrep: pattern too long (has > 32 chars) it works if I remove

Create a unique ID by fuzzy matching of names (via agrep using R)

阅读更多关于 Create a unique ID by fuzzy matching of names (via agrep using R)

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names. A sample chunk of the dataset is structured as follows: df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO