duplicate-removal | 易学教程

Removing duplicate rows (based on values from multiple columns) from SQL table

阅读更多关于 Removing duplicate rows (based on values from multiple columns) from SQL table

问题 I have following SQL table: AR_Customer_ShipTo +--------------+------------+-------------------+------------+ | ARDivisionNo | CustomerNo | CustomerName | ShipToCode | +--------------+------------+-------------------+------------+ | 00 | 1234567 | Test Customer | 1 | | 00 | 1234567 | Test Customer | 2 | | 00 | 1234567 | Test Customer | 3 | | 00 | ARACODE | ARACODE Customer | 1 | | 00 | ARACODE | ARACODE Customer | 2 | | 01 | CBE1EX | Normal Customer | 1 | | 02 | ZOCDOC | Normal Customer-2 | 1

Filtering out duplicated/non-unique rows in data.table

阅读更多关于 Filtering out duplicated/non-unique rows in data.table

I have a data.table table with about 2.5 million rows. There are two columns. I want to remove any rows that are duplicated in both columns. Previously for a data.frame I would have done this: df -> unique(df[,c('V1', 'V2')]) but this doesn't work with data.table. I have tried unique(df[,c(V1,V2), with=FALSE]) but it seems to still only operate on the key of the data.table and not the whole row. Any suggestions? Cheers, Davy Example >dt V1 V2 [1,] A B [2,] A C [3,] A D [4,] A B [5,] B A [6,] C D [7,] C D [8,] E F [9,] G G [10,] A B in the above data.table where V2 is the table key, only rows 4

How to remove duplicate entries from a mysql db?

阅读更多关于 How to remove duplicate entries from a mysql db?

问题 I have a table with some ids + titles. I want to make the title column unique, but it has over 600k records already, some of which are duplicates (sometimes several dozen times over). How do I remove all duplicates, except one, so I can add a UNIQUE key to the title column after? 回答1: This command adds a unique key, and drops all rows that generate errors (due to the unique key). This removes duplicates. ALTER IGNORE TABLE table ADD UNIQUE KEY idx1(title); Edit: Note that this command may not

Delete duplicate rows (don't delete all duplicate)

阅读更多关于 Delete duplicate rows (don't delete all duplicate)

I am using postgres. I want to delete Duplicate rows. The condition is that , 1 copy from the set of duplicate rows would not be deleted. i.e : if there are 5 duplicate records then 4 of them will be deleted. Try the steps described in this article: Removing duplicates from a PostgreSQL database . It describes a situation when you have to deal with huge amount of data which isn't possible to group by . A simple solution would be this: DELETE FROM foo WHERE id NOT IN (SELECT min(id) --or max(id) FROM foo GROUP BY hash) Where hash is something that gets duplicated. delete from table where not id

Remove duplicate rows leaving oldest row Only?

阅读更多关于 Remove duplicate rows leaving oldest row Only?

问题 I have a table of data and there are many duplicate entries from user submissions. I want to delete all duplicates rows based on the field subscriberEmail , leaving only the original submission. In other words, I want to search for all duplicate emails, and delete those rows, leaving only the original. How can I do this without swapping tables? My table contains unique IDs for each row. 回答1: Since you're using the id column as an indicator of which record is 'original': delete x from myTable

Techniques for finding near duplicate records

阅读更多关于 Techniques for finding near duplicate records

问题 I\'m attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like \"Some Company Limited\" and \"SOME COMPANY LTD!\". My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like \"limited\" -> \"ltd\"), strip out non-alphabetic characters and then use agrep to see what looks similar. My first problem is that agrep only

How to delete duplicate entries?

阅读更多关于 How to delete duplicate entries?

I have to add a unique constraint to an existing table. This is fine except that the table has millions of rows already, and many of the rows violate the unique constraint I need to add. What is the fastest approach to removing the offending rows? I have an SQL statement which finds the duplicates and deletes them, but it is taking forever to run. Is there another way to solve this problem? Maybe backing up the table, then restoring after the constraint is added? just somebody For example you could: CREATE TABLE tmp ... INSERT INTO tmp SELECT DISTINCT * FROM t; DROP TABLE t; ALTER TABLE tmp

Get the distinct sum of a joined table column

阅读更多关于 Get the distinct sum of a joined table column

问题 I have a problem here, and I\'m hoping there is an easy solution. I\'ll try to make this as simple as possible: A ticket belongs to an attendee Example: select * from tickets JOIN attendees ON attendee.id = tickets.attendee_id An attendee has a decimal column called \"revenue\" That said, I need to run a query that will return a variety of information about the tickets, including the total revenue. The problem is that if 2 tickets belong to the same attendee, it counts their revenue twice.

Eliminating duplicate values based on only one column of the table

阅读更多关于 Eliminating duplicate values based on only one column of the table

问题 My query: SELECT sites.siteName, sites.siteIP, history.date FROM sites INNER JOIN history ON sites.siteName = history.siteName ORDER BY siteName,date First part of the output: How can I remove the duplicates in siteName column? I want to leave only the updated one based on date column. In the example output above, I need the rows 1, 3, 6, 10 回答1: This is where the window function row_number() comes in handy: SELECT s.siteName, s.siteIP, h.date FROM sites s INNER JOIN (select h.*, row_number()

duplicates in multiple columns

阅读更多关于 duplicates in multiple columns

问题 I have a data frame like so > df a b c d 1 1 2 A 1001 2 2 4 B 1002 3 3 6 B 1002 4 4 8 C 1003 5 5 10 D 1004 6 6 12 D 1004 7 7 13 E 1005 8 8 14 E 1006 I want to remove the rows where there are repeated values in column c AND column d. So in this example rows 2,3,5 and 6 would removed. I have used this, which works: df[!(df$c %in% df$c[duplicated(df$c)] & df$d %in% df$d[duplicated(df$d)]),] >df a b c d 1 1 2 A 1001 4 4 8 C 1003 7 7 13 E 1005 8 8 14 E 1006 but it seems clunky and I can\'t help