Which algorithm for hashing name, firstName and birth-date of a person

丶灬走出姿态 提交于 2019-12-06 04:57:51

If you want to search for a person knowing only those credentials, you could store the SHA-1 in the database(or MD5 for speed, unless you have like a quadrillion people to sample).

The hash will be worthless, as it stores no information about the person, but it can work for searching a database. You just want to make sure that the three pieces of information match, so it would be safe to just concatenate them:

user.hash = SHA1(user.firstName + user.DOB + user.lastName)

And when you query, you could check if the two match:

hash = SHA1(query.firstName + query.DOB + query.lastName)

for user in database:
  if user.hash == hash:
    return user

I put query.DOB in the middle because the first and last name might collide, like if JohnDoe Bob was born on the same day as John DoeBob. I'm not aware of numeric names, so I think this will stop collisions like those ;)

But if this is a big database, I'd try MD5. It's faster, but there is a chance of a collision (in your case, I can guarantee that one won't occur). The chance of a collision, however, is really small.

To put that into perspective, a collision is a 1 / 2^128 occurrence, which is:

                          1
---------------------------------------------------
340,282,366,920,938,463,463,374,607,431,768,211,456

And that's a little smaller than:

0.0000000000000000000000000000000000000293873 %

I'm pretty sure you're not going to get a collision ;)

Hash collisions are inevitable. However small can be the chance of the collision, you shouldn't really rely only on hash if you really want 100% identification.

If you use hashing to speed up database search, there is no need to use SHA256. Use whatever hash function your system has with the smallest size (MD5() for MySQL or you might even try CRC32, if your database is not-so-big). Just when you query table, you need to provide all conditions you are searching by:

SELECT * from user WHERE hash="AABBCCDD" AND firstname="Pavel" AND surname="Sokolov"

Databases maintain a value, that is called index cardinality. It's a measure of uniqueness of the data on the given index. So, you can index fields you want together with hash field and database will choose the most selective index for the query himself. Adding additional conditions doesn't affect performance negatively because most database can use only one index when selecting data from the table and they will select the one with the most cardinality value.

The database will need to first select all rows matches the index and then scan through them to discard rows that doesn't match other conditions.

If you cannot use the method I described, well, I think even MD5 collision probability is very low to occur on database of people names.

P.S. I hope you know, that you know that "the combination of lastname, firstname and birth-date of a person" is not enough to 100% identify a human? And sooner this combination will match than some hashes collide.

fredw

If you are concerned with collisions there is a good discussion here:

Understanding sha-1 collision weakness

If you have security concerns, I would consider SHA-256 instead.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!