Checksum verification in Hadoop

杀马特。学长 韩版系。学妹 提交于 2019-11-29 12:32:40

问题


Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?

I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?

I read client does checksum before data is written to HDFS

Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.


回答1:


If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.

In the below example I am comparing two files with the same content in different locations:

Old-school md5sum method returns the same checksum:

$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a  -

However, checksum generated on the HDFS is different for files with the same content:

$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914

$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e

A bit puzzling as I would expect identical checksum to be generated against the identical content.




回答2:


Checksum for a file can be calculated using hadoop fs command.

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

hadoop fs -checksum hdfs://nn1.example.com/file1 hadoop fs -checksum file:///path/in/linux/file1

Refer : Hadoop documentation for more details

So if you want to comapre file1 in both linux and hdfs you can use above utility.




回答3:


I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.

So, you can compare the checksum to cross check. https://github.com/srch07/HDFSChecksumForLocalfile




回答4:


If you are doing this check via API

import org.apache.hadoop.fs._
import org.apache.hadoop.io._

Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a

val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString

Option 2: for the value 3e50be59553b2ddaf401c575f8df6914

val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)



回答5:


It does crc check. For each and everyfile it create .crc to make sure there is no corruption.



来源:https://stackoverflow.com/questions/31920033/checksum-verification-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!