How to get files metadata, when retrieving data from HDFS?

问题

I requested data from HDFS, and I would like to get the metadata of the files from which they were read. This will allow me to build reports which will look like based on available data at the given moment.

I found the solution which is to use org.apache.hadoop.fs.FileSystem to get a listing of all files. I know the partitioning rule, and I can build mapping row -> meta, based on the received listing.

But this decision seems difficult to implement and support. Maybe there are simpler ways to achieve the same result?

回答1:

Easiest way to do so is with spark udf input_file_name.

import scala.collection.mutable.Map
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name()).repartition($"input_file_name")

def getMetadata(rdd: Iterator[Row]) = {
    val map = Map[String, Long]()
    val fs = FileSystem.get(new Configuration())
    rdd.map(row => {
                val path = row.getString(row.size -1)
                if(! map.contains(path)){
                    map.put(path,fs.listStatus(new Path(path))(0).getModificationTime())
                }
                Row.fromSeq(row.toSeq ++ Array[Any](map(path)))
            })
}

spark.createDataFrame(df.rdd.mapPartitions(getMetadata),df.schema.add("modified_ts", LongType)).show(10000,false)

Here modified_ts is the mtime for the file.

Depending on size of the data, you can also do it with join. The logic will look something like:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.functions._

val mtime =(path:String)=> FileSystem.get(new Configuration()).listStatus(new Path(path)).head.getModificationTime
val mtimeUDF = udf(mtime)

val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name())

val metadata_df = df.select($"input_file_name").distinct().withColumn("mtime", mtimeUDF($"input_file_name"))

val rows_with_metadata = df.join(metadata_df , "input_file_name")
rows_with_metadata.show(false)

回答2:

I have create small helper method metadata, you can directly invoke on DataFrame object like df.metadata, It will create DataFrame on available metadata & return DataFrame back.

Meta Columns in final DataFrame

path
isDirectory
length -- will be displayed human readable format 47 bytes
replication
blockSize -- will be displayed human readable format 47 bytes
modificationTime -- This will be converted from unix time to normal datetime.
accessTime
owner
group
permission
isSymlink


scala> :paste
// Entering paste mode (ctrl-D to finish)

  import org.joda.time.DateTime
  import org.apache.commons.io.FileUtils
  import org.apache.spark.sql.DataFrame
  import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}

  // Storing Metadata
  case class FileMetaData(path: String,
                          isDirectory:Boolean,
                          length:String,
                          replication:Int,
                          blockSize:String,
                          modificationTime: String,
                          accessTime:String ,
                          owner:String ,
                          group:String ,
                          permission:String,
                          isSymlink:Boolean)

  object FileMetaData {

    def apply(lfs: LocatedFileStatus):FileMetaData = {        
      new FileMetaData(
        path= lfs.getPath.toString,
        isDirectory=lfs.isDirectory,
        length=FileUtils.byteCountToDisplaySize(lfs.getLen),
        replication=lfs.getReplication,
        blockSize=FileUtils.byteCountToDisplaySize(lfs.getBlockSize),
        modificationTime=new DateTime(lfs.getModificationTime).toString,
        accessTime=new DateTime(lfs.getAccessTime).toString ,
        owner=lfs.getOwner ,
        group=lfs.getGroup ,
        permission=lfs.getPermission.toString,
        isSymlink=lfs.isSymlink
      )
    }
  }

  // Convert RemoteIterator to Scala Iterator.
  implicit def convertToScalaIterator[T](remoteIterator: RemoteIterator[T]): Iterator[T] = {
    case class wrapper(remoteIterator: RemoteIterator[T]) extends Iterator[T] {
      override def hasNext: Boolean = remoteIterator.hasNext
      override def next(): T = remoteIterator.next()
    }
    wrapper(remoteIterator)
  }

  // Using this we can call metadata method on df - like df.metadata.
  implicit class MetaData(df: DataFrame) {
    def metadata =  {
      import df.sparkSession.implicits._
       df.inputFiles.map(new Path(_))
         .flatMap{
           FileSystem
             .get(df.sparkSession.sparkContext.hadoopConfiguration)
             .listLocatedStatus(_)
             .toList
       }
         .map(FileMetaData(_)).toList.toDF
     }
  }

// Exiting paste mode, now interpreting.

warning: there was one feature warning; re-run with -feature for details
import org.joda.time.DateTime
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.DataFrame
import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}
defined class FileMetaData
defined object FileMetaData
convertToScalaIterator: [T](remoteIterator: org.apache.hadoop.fs.RemoteIterator[T])Iterator[T]
defined class MetaData

scala> val df = spark.read.format("json").load("/tmp/data")
df: org.apache.spark.sql.DataFrame = [json_data: struct<value: string>]



scala> df.show(false)
+------------------+
|json_data         |
+------------------+
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
+------------------+

scala>

DataFrame Metadata Output

scala> df.metadata.show(false)

+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|path                     |isDirectory|length  |replication|blockSize|modificationTime             |accessTime                   |owner   |group|permission|isSymlink|
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|file:/tmp/data/fileB.json|false      |47 bytes|1          |32 MB    |2020-04-25T13:47:00.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
|file:/tmp/data/fileC.json|false      |47 bytes|1          |32 MB    |2020-04-25T13:47:10.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
|file:/tmp/data/fileA.json|false      |47 bytes|1          |32 MB    |2020-04-25T11:35:12.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+

回答3:

I can make a guess try to debug the code:

hdfs debug computeMeta -block <block-file> -out <output-metadata-file>

you can find your command kindly visit https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#verifyMeta

来源：https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs

标签

apache-spark

Hadoop

HDFS