duplicates

Duplicate elimination of similar company names

自古美人都是妖i 提交于 2020-01-15 03:28:14
问题 I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c": +------------------+ | company | +------------------+ | 1c | | 1c company | | 1c game studios | | 1c wireless | | 1c-avalon | | 1c-softclub | | 1c: maddox games | | 1c:inoco | | 1cc games | +------------------+ I identified Levenshtein distance as a good way

How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate?

萝らか妹 提交于 2020-01-14 19:18:07
问题 How do I remove element from a list of tuple if the 2nd item in each tuple is a duplicate? For example, I have a list sorted by 1st element that looks like this: alist = [(0.7897897,'this is a foo bar sentence'), (0.653234, 'this is a foo bar sentence'), (0.353234, 'this is a foo bar sentence'), (0.325345, 'this is not really a foo bar'), (0.323234, 'this is a foo bar sentence'),] The desired output leave the tuple with the highest 1st item, should be: alist = [(0.7897897,'this is a foo bar

Excel 2013 VBA Range.RemoveDuplicates issue specifying array

拟墨画扇 提交于 2020-01-14 14:13:18
问题 The sheets that I am scanning for duplicates have different numbers of columns I'm trying to specify the array of columns for Range.RemoveDuplicates by using a string like this: Let's say there are 5 columns in this sheet Dim Rng As Range Dim i As Integer Dim lColumn As Integer Dim strColumnArray As String With ActiveSheet lColumn = Cells(1, Columns.Count).End(xlToLeft).Column strColumnArray = "1" For i = 2 To lColumn strColumnArray = strColumnArray & ", " & i Next i 'String ends up as "1, 2,

Removing duplicate dates based on another column in R

北城以北 提交于 2020-01-14 10:20:13
问题 I have a timeseries with multiple entries for some hours. date wd ws temp sol octa pg mh daterep 1 2007-01-01 00:00:00 100 1.5 9.0 0 8 D 100 FALSE 2 2007-01-01 01:00:00 90 2.6 9.0 0 7 E 50 TRUE 3 2007-01-01 01:00:00 90 2.6 9.0 0 8 D 100 TRUE 4 2007-01-01 02:00:00 40 1.0 8.8 0 7 F 50 FALSE 5 2007-01-01 03:00:00 20 2.1 8.0 0 8 D 100 FALSE 6 2007-01-01 04:00:00 30 1.0 8.0 0 8 D 100 FALSE I need to get to a time series with one entry per hour, taking the entry with the minimum mh value where

Counting consecutive duplicates of strings from a list

耗尽温柔 提交于 2020-01-14 06:50:09
问题 I have a Python list of strings such that, Input: li = ['aaa','bbb','aaa','abb','abb','bbb','bbb','bbb','aaa','aaa'] What can I do to generate another list counting the number of consecutive repetitions of any string in the list? For the list above the return list resembles: Expected Output: li_count = [['aaa',1],['bbb',1]['abb',2],['bbb',3],['aaa',2]] 回答1: Use itertools.groupby: from itertools import groupby li = ['aaa','bbb','aaa','abb','abb','bbb','bbb','bbb','aaa','aaa'] a = [[i, sum(1

Javascript - Quickly remove duplicates in object array

跟風遠走 提交于 2020-01-14 02:19:08
问题 I have 2 arrays with objects in them such as: [{"Start": 1, "End": 2}, {"Start": 4, "End": 9}, {"Start": 12, "End": 16}, ... ] I want to merge the 2 arrays while removing duplicates. Currently, I am doing the following: array1.concat(array2); Then I am doing a nested $.each loop, but as my arrays get larger and larger, this takes O(n^2) time to execute and is not scalable. I presume there is a quicker way to do this, however, all of the examples I have found are working with strings or

Python Fuzzy matching strings in list performance

二次信任 提交于 2020-01-13 20:28:47
问题 I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any

Why is Safari duplicating GET request but Chrome is not?

时间秒杀一切 提交于 2020-01-13 18:33:27
问题 Update TL;DR : This is potentially a bug in Safari and/or Webkit. Longer TL;DR : In Safari, after the Fetch API is used to make a GET request, Safari will automatically (and unintentionally) re-run the the request when the page is reloaded even if the code that makes the request is removed . Newly discovered minimal reproducible code (courtesy of Kaiido below): Front end <script>fetch('/url')</script> Original Post I have a javascript web application which uses the fetch API to make a GET

Pandas - Conditional drop duplicates

删除回忆录丶 提交于 2020-01-13 17:05:49
问题 I have a Pandas 0.19.2 dataframe for Python 3.6x as below. I want to drop_duplicates() with the same Id based on a conditional logic. import pandas as pd import numpy as np np.random.seed(1) df = pd.DataFrame({'Id':[1,2,3,4,3,2,6,7,1,8], 'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'], 'Size':np.random.rand(10), 'Age':[19, 25, 22, 31, 43, 23, 44, 20, 51, 31]}) What would be the most efficient (if possible vectorised) way to achieve this based on the logic I describe below? 1) Before

Pandas - Conditional drop duplicates

杀马特。学长 韩版系。学妹 提交于 2020-01-13 17:05:11
问题 I have a Pandas 0.19.2 dataframe for Python 3.6x as below. I want to drop_duplicates() with the same Id based on a conditional logic. import pandas as pd import numpy as np np.random.seed(1) df = pd.DataFrame({'Id':[1,2,3,4,3,2,6,7,1,8], 'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'], 'Size':np.random.rand(10), 'Age':[19, 25, 22, 31, 43, 23, 44, 20, 51, 31]}) What would be the most efficient (if possible vectorised) way to achieve this based on the logic I describe below? 1) Before