Eliminate duplicate cities from database

三世轮回 提交于 2020-01-05 04:17:15

问题


Background

Over 5300 duplicate rows:

"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"

Additional details:

  • PostgreSQL 8.4 Database
  • Linux

Problem

Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):

"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"

The goal is to eliminate all duplicates.

Questions

Given a table of values such as the above CSV data:

  • How would you eliminate duplicates?
  • What geo-centric PostgreSQL functions would you use?
  • What other criteria would you use to wheedle down the duplicates?

Update

Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):

select
  c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.country = 'BE' and
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  (c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
  country, name

Ideas

Two phase approach:

  1. Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
  2. Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.

Thank you!


回答1:


Finding duplicates is simple:

select
  max(id) as this_should_stay,
  latitude,
  longitude,
  country,
  region,
  city
FROM
  your_table
group by
  latitude,
  longitude,
  country,
  region,
  city
having count(*) > 1;

Adding code to remove duplicates based on this is simple:

delete from your_table where id not in (
    select
      max(id) as this_should_stay
    FROM
      your_table
    group by
      latitude,
      longitude,
      country,
      region,
      city
)

note lack of having in the delete query.




回答2:


This deletes the second city within close proximity to a city of the same name in the same country:

delete from climate.maxmind_city mc where id in (
select
  max(c1.id)
from
  climate.maxmind_city c1,
  climate.maxmind_city c2
where
  c1.id <> c2.id and
  c1.country = c2.country and
  c1.name = c2.name and
  earth_distance(
    ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
    ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 35
group by
  c1.country, c1.name
order by
  c1.country, c1.name
)



回答3:


if your data have been imported thru CSV files and with the code (PHP) then you can prevent duplicates entry with the putting condition in PHP code. if the city you inserted is already exist then make loop continue to next record and skip current record.

try this if you are follow this way to import data in database..

Thanks.



来源:https://stackoverflow.com/questions/5816945/eliminate-duplicate-cities-from-database

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!