string-matching

How to select R data.table rows based on substring match (a la SQL like)

牧云@^-^@ 提交于 2019-11-26 18:55:13
问题 I have a data.table with a character column, and want to select only those rows that contain a substring in it. Equivalent to SQL WHERE x LIKE '%substring%' E.g. > Months = data.table(Name = month.name, Number = 1:12) > Months["mb" %in% Name] Empty data.table (0 rows) of 2 cols: Name,Number How would I select only the rows where Name contains "mb"? 回答1: data.table has a like function. Months[like(Name,"mb")] Name Number 1: September 9 2: November 11 3: December 12 Or, %like% looks nicer : >

javascript regular expression to check for IP addresses

你说的曾经没有我的故事 提交于 2019-11-26 17:55:09
I have several ip addresses like: 115.42.150.37 115.42.150.38 115.42.150.50 What type of regular expression should I write if I want to search for the all the 3 ip addresses? Eg, if I do 115.42.150.* (I will be able to search for all 3 ip addresses) What I can do now is something like: /[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}/ but it can't seems to work well. Thanks. Spudley The regex you've got already has several problems: Firstly, it contains dots. In regex, a dot means "match any character", where you need to match just an actual dot. For this, you need to escape it, so put a back

How can I match fuzzy match strings from two datasets?

时光怂恿深爱的人放手 提交于 2019-11-26 17:32:25
I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the

python - regex search and findall

旧城冷巷雨未停 提交于 2019-11-26 16:59:31
问题 I need to find all matches in a string for a given regex. I've been using findall() to do that until I came across a case where it wasn't doing what I expected. For example: regex = re.compile('(\d+,?)+') s = 'There are 9,000,000 bicycles in Beijing.' print re.search(regex, s).group(0) > 9,000,000 print re.findall(regex, s) > ['000'] In this case search() returns what I need (the longest match) but findall() behaves differently, although the docs imply it should be the same: findall() matches

Joining two datasets using fuzzy logic

怎甘沉沦 提交于 2019-11-26 16:56:51
问题 I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set. I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for

Regex allow a string to only contain numbers 0 - 9 and limit length to 45

自闭症网瘾萝莉.ら 提交于 2019-11-26 15:34:07
问题 I am trying to create a regex to have a string only contain 0-9 as the characters and it must be at least 1 char in length and no more than 45 . so example would be 00303039 would be a match, and 039330a29 would not. So far this is what I have but I am not sure that it is correct [0-9]{1,45} I have also tried ^[0-9]{45}*$ but that does not seem to work either. I am not very familiar with regex so any help would be great. Thanks! 回答1: You are almost there, all you need is start anchor ( ^ )

Return positions of a regex match() in Javascript?

独自空忆成欢 提交于 2019-11-26 12:53:06
Is there a way to retrieve the (starting) character positions inside a string of the results of a regex match() in Javascript? Gumbo exec returns an object with a index property: var match = /bar/.exec("foobar"); if (match) { console.log("match found at " + match.index); } And for multiple matches: var re = /bar/g, str = "foobarfoobar"; while ((match = re.exec(str)) != null) { console.log("match found at " + match.index); } stagas Here's what I came up with: // Finds starting and ending positions of quoted text // in double or single quotes with escape char support like \" \' var str = "this

Check if string matches pattern

﹥>﹥吖頭↗ 提交于 2019-11-26 11:48:32
问题 How do I check if a string matches this pattern? Uppercase letter, number(s), uppercase letter, number(s)... Example, These would match: A1B2 B10L1 C1N200J1 These wouldn\'t (\'^\' points to problem) a1B2 ^ A10B ^ AB400 ^ 回答1: import re pattern = re.compile("^([A-Z][0-9]+)+$") pattern.match(string) Edit: As noted in the comments match checks only for matches at the beginning of the string while re.search() will match a pattern anywhere in string. (See also: https://docs.python.org/library/re

Filter multiple values on a string column in dplyr

混江龙づ霸主 提交于 2019-11-26 11:40:38
I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing? Example: data.frame name = dat days name 88 Lynn 11 Tom 2 Chris 5 Lisa 22 Kyla 1 Tom 222 Lynn 2 Lynn I'd like to filter out Tom and Lynn for example. When I do: target <- c("Tom", "Lynn") filt <- filter(dat, name == target) I get this error: longer object length is not a multiple of shorter object length You need %in% instead of == : library(dplyr) target <- c("Tom", "Lynn") filter(dat, name %in% target)

Python: optimal search for substring in list of strings

霸气de小男生 提交于 2019-11-26 11:35:31
问题 I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do: listStrings = [ACDE, CDDE, BPLL, ... ] listSubstrings = [ACD, BPI, KLJ, ...] The above entries are just examples. len(listStrings) is ~ 60,000, len(listSubstrings) is ~50,000-300,000, and len(listStrings[i]) is anywhere from 10 to 30,000. My current Python attempt is: for i in listSubstrings: for j in listStrings: if i in j: w.write(i+j) Or