string-matching | 易学教程

How to select R data.table rows based on substring match (a la SQL like)

阅读更多关于 How to select R data.table rows based on substring match (a la SQL like)

问题 I have a data.table with a character column, and want to select only those rows that contain a substring in it. Equivalent to SQL WHERE x LIKE '%substring%' E.g. > Months = data.table(Name = month.name, Number = 1:12) > Months["mb" %in% Name] Empty data.table (0 rows) of 2 cols: Name,Number How would I select only the rows where Name contains "mb"? 回答1: data.table has a like function. Months[like(Name,"mb")] Name Number 1: September 9 2: November 11 3: December 12 Or, %like% looks nicer : >

javascript regular expression to check for IP addresses

阅读更多关于 javascript regular expression to check for IP addresses

I have several ip addresses like: 115.42.150.37 115.42.150.38 115.42.150.50 What type of regular expression should I write if I want to search for the all the 3 ip addresses? Eg, if I do 115.42.150.* (I will be able to search for all 3 ip addresses) What I can do now is something like: /[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}/ but it can't seems to work well. Thanks. Spudley The regex you've got already has several problems: Firstly, it contains dots. In regex, a dot means "match any character", where you need to match just an actual dot. For this, you need to escape it, so put a back

How can I match fuzzy match strings from two datasets?

阅读更多关于 How can I match fuzzy match strings from two datasets?

I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS. So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the

python - regex search and findall

阅读更多关于 python - regex search and findall

问题 I need to find all matches in a string for a given regex. I've been using findall() to do that until I came across a case where it wasn't doing what I expected. For example: regex = re.compile('(\d+,?)+') s = 'There are 9,000,000 bicycles in Beijing.' print re.search(regex, s).group(0) > 9,000,000 print re.findall(regex, s) > ['000'] In this case search() returns what I need (the longest match) but findall() behaves differently, although the docs imply it should be the same: findall() matches

Joining two datasets using fuzzy logic

阅读更多关于 Joining two datasets using fuzzy logic

问题 I’m trying to do a fuzzy logic join in R between two datasets: first data set has the name of a location and a column called config second data set has the name of a location and two additional attributes that need to be summarized before they are joined to the first data set. I would like to use the name column to join between the two data sets. However the name column may have additional or leading characters in either data set or have one word contained inside of a larger word. So for

Regex allow a string to only contain numbers 0 - 9 and limit length to 45

阅读更多关于 Regex allow a string to only contain numbers 0 - 9 and limit length to 45

问题 I am trying to create a regex to have a string only contain 0-9 as the characters and it must be at least 1 char in length and no more than 45 . so example would be 00303039 would be a match, and 039330a29 would not. So far this is what I have but I am not sure that it is correct [0-9]{1,45} I have also tried ^[0-9]{45}*$ but that does not seem to work either. I am not very familiar with regex so any help would be great. Thanks! 回答1: You are almost there, all you need is start anchor ( ^ )

Return positions of a regex match() in Javascript?

阅读更多关于 Return positions of a regex match() in Javascript?

Is there a way to retrieve the (starting) character positions inside a string of the results of a regex match() in Javascript? Gumbo exec returns an object with a index property: var match = /bar/.exec("foobar"); if (match) { console.log("match found at " + match.index); } And for multiple matches: var re = /bar/g, str = "foobarfoobar"; while ((match = re.exec(str)) != null) { console.log("match found at " + match.index); } stagas Here's what I came up with: // Finds starting and ending positions of quoted text // in double or single quotes with escape char support like \" \' var str = "this

Check if string matches pattern

阅读更多关于 Check if string matches pattern

问题 How do I check if a string matches this pattern? Uppercase letter, number(s), uppercase letter, number(s)... Example, These would match: A1B2 B10L1 C1N200J1 These wouldn\'t (\'^\' points to problem) a1B2 ^ A10B ^ AB400 ^ 回答1: import re pattern = re.compile("^([A-Z][0-9]+)+$") pattern.match(string) Edit: As noted in the comments match checks only for matches at the beginning of the string while re.search() will match a pattern anywhere in string. (See also: https://docs.python.org/library/re

Filter multiple values on a string column in dplyr

阅读更多关于 Filter multiple values on a string column in dplyr

I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing? Example: data.frame name = dat days name 88 Lynn 11 Tom 2 Chris 5 Lisa 22 Kyla 1 Tom 222 Lynn 2 Lynn I'd like to filter out Tom and Lynn for example. When I do: target <- c("Tom", "Lynn") filt <- filter(dat, name == target) I get this error: longer object length is not a multiple of shorter object length You need %in% instead of == : library(dplyr) target <- c("Tom", "Lynn") filter(dat, name %in% target)

Python: optimal search for substring in list of strings

阅读更多关于 Python: optimal search for substring in list of strings

问题 I have a particular problem where I want to search for many substrings in a list of many strings. The following is the gist of what I am trying to do: listStrings = [ACDE, CDDE, BPLL, ... ] listSubstrings = [ACD, BPI, KLJ, ...] The above entries are just examples. len(listStrings) is ~ 60,000, len(listSubstrings) is ~50,000-300,000, and len(listStrings[i]) is anywhere from 10 to 30,000. My current Python attempt is: for i in listSubstrings: for j in listStrings: if i in j: w.write(i+j) Or