K-means algorithm variation with equal cluster size

前端 未结 16 967
挽巷
挽巷 2020-11-27 14:26

I\'m looking for the fastest algorithm for grouping points on a map into equally sized groups, by distance. The k-means clustering algorithm looks straightforward and promis

16条回答
  •  盖世英雄少女心
    2020-11-27 15:10

    I've been struggling on how to solve this question too. However, I realize that I have used the wrong keyword all this time. If you want the number of point result member to be same size, you are doing a grouping, not clustering anymore. I finally able to solve the problem using simple python script and postgis query.

    For example, I have a table called tb_points which has 4000 coordinate point, and you want to divide it into 10 same size group, which will contain 400 coordinate point each. Here is the example of the table structure

    CREATE TABLE tb_points (
      id SERIAL PRIMARY KEY,
      outlet_id INTEGER,
      longitude FLOAT,
      latitide FLOAT,
      group_id INTEGER
    );
    

    Then what you need to do are:

    1. Find the first coordinate that will be your starting point
    2. Find nearest coordinate from your starting point, order by distance ascending, limit the result by the number of your preferred member (in this case 400)
    3. Update the result by updating the group_id column
    4. Do 3 steps above 10 times for the rest of data, which group_id column is still NULL

    This is the implementation in python:

    import psycopg2
    
    dbhost = ''
    dbuser = ''
    dbpass = ''
    dbname = ''
    dbport = 5432
    
    conn = psycopg2.connect(host = dbhost,
           user = dbuser,
           password = dbpass,
           database = dbname,
           port = dbport)
    
    def fetch(sql):
        cursor = conn.cursor()
        rs = None
        try:
            cursor.execute(sql)
            rs = cursor.fetchall()
        except psycopg2.Error as e:
            print(e.pgerror)
            rs = 'error'
        cursor.close()
        return rs
    
    def execScalar(sql):
        cursor = conn.cursor()
        try:
            cursor.execute(sql)
            conn.commit()
            rowsaffected = cursor.rowcount
        except psycopg2.Error as e:
            print(e.pgerror)
            rowsaffected = -1
            conn.rollback()
        cursor.close()
        return rowsaffected
    
    
    def select_first_cluster_id():
        sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
        a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
        dest_lon, b.latitude as dest_lat,
        ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326)
        AS geography), 
        CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography))
        AS air_distance FROM  tb_points a CROSS JOIN tb_points b WHERE
        a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id is
        null order by air_distance desc limit 1 """
        return sql
    
    def update_group_id(group_id, ori_id, limit_constraint):
        sql = """ UPDATE tb_points
        set group_id = %s
        where outlet_id in
        (select b.outlet_id
        from tb_points a,
        tb_points b
        where a.outlet_id = '%s'
        and a.group_id is null
        and b.group_id is null
        order by ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326) AS geography),
        CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography)) asc
        limit %s)
        """ % (group_id, ori_id, limit_constraint)
        return sql
    
    def clustering():
        data_constraint = [100]
        n = 1
        while n <= 10:
            sql = select_first_cluster_id()
            res = fetch(sql)
            ori_id = res[0][0]
    
            sql = update_group_id(n, ori_id, data_constraint[0])
            print(sql)
            execScalar(sql)
    
            n += 1
    
    clustering()
    

    Hope it helps

提交回复
热议问题