difflib

Pandas replace strings with fuzzy match in the same column

吃可爱长大的小学妹 提交于 2021-01-29 13:27:28
问题 I have a column in a dataframe that is like this: OWNER -------------- OTTO J MAYER OTTO MAYER DANIEL J ROSEN DANIEL ROSSY LISA CULLI LISA CULLY LISA CULLY CITY OF BELMONT CITY OF BELMONT CITY Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name. For example this is I what I expect from the data frame above:

match changes by words, not by characters

亡梦爱人 提交于 2020-08-25 08:14:32
问题 I'm using difflib 's SequenceMatcher to get_opcodes() and than highlight the changes with css to create some kind of web diff . First, I set a min_delta so that I consider two strings different if only 3 or more characters in the whole string differ, one after another ( delta means a real, encountered delta, which sums up all one-character changes): matcher = SequenceMatcher(source_str, diff_str) min_delta = 3 delta = 0 for tag, i1, i2, j1, j2 in matcher.get_opcodes(): if tag == "equal":

Better fuzzy matching performance?

风流意气都作罢 提交于 2020-07-05 04:39:06
问题 I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings: a=['blah','pie','apple'...] b=['jimbo','zomg','pie'...] for value in a: difflib.get_close_matches(value,b,n=1,cutoff=.85) It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for

How to highlight more than two characters per line in difflibs html output

左心房为你撑大大i 提交于 2020-04-08 17:48:22
问题 I am using difflib.HtmlDiff to compare two files. I want the differences to be highlighted in the outputted html. This already works when there are a maximum of two different chars in one line: a = "2.000" b = "2.120" But when there are more different characters on one line then in the output the whole line is marked red (on the left side) or green (on the right side of the table): a = "2.000" b = "2.123" Is this behaviour configurable? So can I set the number of different characters at which

Python: Passing SequenceMatcher in difflib an “autojunk=False” flag yields error

早过忘川 提交于 2020-01-23 10:51:10
问题 I am trying to use the SequenceMatcher method in Python's difflib package to identify string similarity. I have experienced strange behavior with the method, though, and I believe my problem may be related to the package's "junk" filter, a problem described in detail here. Suffice it to say that I thought I could fix my problem by passing an autojunk flag to my SequenceMatcher in the way described by the difflib documentation: import difflib def matches(s1, s2): s = difflib.SequenceMatcher

Can difflib be used to make a plagiarism detection program?

喜欢而已 提交于 2020-01-14 05:25:08
问题 I am trying to figure this out... Can the difflib.* library in Python be used to make some kind of plagiarism detection program? If so how? Maybe anyone could help me to figure out this question. 回答1: It could be used, but you're going to face all the same general issues you find in automated plagiarism detection. It might give you a little bit of a head start on implementing some of the algorithms you need, but I don't think it is likely to take you very far. 回答2: The short answer is yes.

How to distinguish between added sentences and altered sentences with difflib and nltk?

有些话、适合烂在心里 提交于 2019-12-24 15:32:46
问题 Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68 : I then run it through the following code to pull out the diffs. import bs4 from bs4 import BeautifulSoup import urllib2 import lxml.html as lh url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of lines root = lh.fromstring(content) section1 = root.xpath("//div[@class = 'column-12']")

Can difflib's charjunk be used to ignore whitespace?

北城余情 提交于 2019-12-24 14:06:12
问题 I'd like to compare differences between two lists of strings. For my purposes, whitespace is noise and these differences do not need to be shown. Reading into difflib's documentation, "the default [for charjunk ] is module-level function IS_CHARACTER_JUNK() , which filters out whitespace characters". Perfect, except I don't see it working, or making much difference (<- pun!). import difflib A = ['3 4\n'] B = ['3 4\n'] print ''.join(difflib.ndiff(A, B)) # default: charjunk=difflib.IS_CHARACTER

Python's difflib SequenceMatcher speed up

久未见 提交于 2019-12-23 19:26:19
问题 I'm using difflib SequenceMatcher (ratio() method) to define similarity between text files. While difflib is relatively fast to compare a small set of text files e.g. 10 files of 70 kb on average comparing to each other (46 comparisons) takes about 80 seconds. The issue here is that i have a collection of 3000 txt files (75 kb on average), a raw estimation on how much time SequenceMatcher needs to complete the comparison job is 80 days! I tried "real_quick_ratio()" and "quick_ratio()" methods

Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list?

爱⌒轻易说出口 提交于 2019-12-21 19:48:43
问题 I want to use something like difflib.get_close_matches but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list). The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string). For example, instead of: >>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'hallo', 'Hallo'] I would