Removing duplicate chess games and then storing unique games in Postgresql

人盡茶涼 提交于 2019-12-24 04:27:07

问题


I have a very large number of chess games(around 5 million) stored in several pgn files(portable game notation). If you aren't familiar with PGN, the result will basically be a csv file when parsed, with several fields containing info about the players, location, etc and then one larger text field with the moves separated by some delimeter, possible a space. There will be one row with such data per game.

The catch is there may be duplicate games. Ultimately, I would like to store the unique set in Postgres, but what is the best way to get there? I had two approaches in mind:

1.Insert a game at a time and then with each subsequent insert run a uniqueness test script that would only insert the game if it is unique. Of course I would index fields as necessary to optimize this process(should I index all fields or just the 'cheap' ones like rating which are just integers)

2.Do a batch insert from the generated csv and only then check for duplicates. The algorithm I was thinking of was to just loop through 1..(# games) ids, find the game with that unique id in Postrgres(if not already deleted) and then look forward for all game that are identical, delete all but one, and then move onto the next id/game.

The second method would insert much faster but would require searching n games each search. The first one would insert more slowly, but would only search through n/2 games on average. What are people's expectations about the efficiency of each approach?

来源:https://stackoverflow.com/questions/23692850/removing-duplicate-chess-games-and-then-storing-unique-games-in-postgresql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!