How to get files metadata, when retrieving data from HDFS?

不羁的心 提交于 2020-08-18 08:00:19

问题


I requested data from HDFS, and I would like to get the metadata of the files from which they were read. This will allow me to build reports which will look like based on available data at the given moment.

I found the solution which is to use org.apache.hadoop.fs.FileSystem to get a listing of all files. I know the partitioning rule, and I can build mapping row -> meta, based on the received listing.

But this decision seems difficult to implement and support. Maybe there are simpler ways to achieve the same result?


回答1:


Easiest way to do so is with spark udf input_file_name.

import scala.collection.mutable.Map
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name()).repartition($"input_file_name")

def getMetadata(rdd: Iterator[Row]) = {
    val map = Map[String, Long]()
    val fs = FileSystem.get(new Configuration())
    rdd.map(row => {
                val path = row.getString(row.size -1)
                if(! map.contains(path)){
                    map.put(path,fs.listStatus(new Path(path))(0).getModificationTime())
                }
                Row.fromSeq(row.toSeq ++ Array[Any](map(path)))
            })
}

spark.createDataFrame(df.rdd.mapPartitions(getMetadata),df.schema.add("modified_ts", LongType)).show(10000,false)

Here modified_ts is the mtime for the file.

Depending on size of the data, you can also do it with join. The logic will look something like:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.functions._

val mtime =(path:String)=> FileSystem.get(new Configuration()).listStatus(new Path(path)).head.getModificationTime
val mtimeUDF = udf(mtime)

val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name())

val metadata_df = df.select($"input_file_name").distinct().withColumn("mtime", mtimeUDF($"input_file_name"))

val rows_with_metadata = df.join(metadata_df , "input_file_name")
rows_with_metadata.show(false)



回答2:


I have create small helper method metadata, you can directly invoke on DataFrame object like df.metadata, It will create DataFrame on available metadata & return DataFrame back.

Meta Columns in final DataFrame

  • path
  • isDirectory
  • length -- will be displayed human readable format 47 bytes
  • replication
  • blockSize -- will be displayed human readable format 47 bytes
  • modificationTime -- This will be converted from unix time to normal datetime.
  • accessTime
  • owner
  • group
  • permission
  • isSymlink

scala> :paste
// Entering paste mode (ctrl-D to finish)

  import org.joda.time.DateTime
  import org.apache.commons.io.FileUtils
  import org.apache.spark.sql.DataFrame
  import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}

  // Storing Metadata
  case class FileMetaData(path: String,
                          isDirectory:Boolean,
                          length:String,
                          replication:Int,
                          blockSize:String,
                          modificationTime: String,
                          accessTime:String ,
                          owner:String ,
                          group:String ,
                          permission:String,
                          isSymlink:Boolean)

  object FileMetaData {

    def apply(lfs: LocatedFileStatus):FileMetaData = {        
      new FileMetaData(
        path= lfs.getPath.toString,
        isDirectory=lfs.isDirectory,
        length=FileUtils.byteCountToDisplaySize(lfs.getLen),
        replication=lfs.getReplication,
        blockSize=FileUtils.byteCountToDisplaySize(lfs.getBlockSize),
        modificationTime=new DateTime(lfs.getModificationTime).toString,
        accessTime=new DateTime(lfs.getAccessTime).toString ,
        owner=lfs.getOwner ,
        group=lfs.getGroup ,
        permission=lfs.getPermission.toString,
        isSymlink=lfs.isSymlink
      )
    }
  }

  // Convert RemoteIterator to Scala Iterator.
  implicit def convertToScalaIterator[T](remoteIterator: RemoteIterator[T]): Iterator[T] = {
    case class wrapper(remoteIterator: RemoteIterator[T]) extends Iterator[T] {
      override def hasNext: Boolean = remoteIterator.hasNext
      override def next(): T = remoteIterator.next()
    }
    wrapper(remoteIterator)
  }

  // Using this we can call metadata method on df - like df.metadata.
  implicit class MetaData(df: DataFrame) {
    def metadata =  {
      import df.sparkSession.implicits._
       df.inputFiles.map(new Path(_))
         .flatMap{
           FileSystem
             .get(df.sparkSession.sparkContext.hadoopConfiguration)
             .listLocatedStatus(_)
             .toList
       }
         .map(FileMetaData(_)).toList.toDF
     }
  }

// Exiting paste mode, now interpreting.

warning: there was one feature warning; re-run with -feature for details
import org.joda.time.DateTime
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.DataFrame
import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}
defined class FileMetaData
defined object FileMetaData
convertToScalaIterator: [T](remoteIterator: org.apache.hadoop.fs.RemoteIterator[T])Iterator[T]
defined class MetaData

scala> val df = spark.read.format("json").load("/tmp/data")
df: org.apache.spark.sql.DataFrame = [json_data: struct<value: string>]



scala> df.show(false)
+------------------+
|json_data         |
+------------------+
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
+------------------+

scala>

DataFrame Metadata Output

scala> df.metadata.show(false)

+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|path                     |isDirectory|length  |replication|blockSize|modificationTime             |accessTime                   |owner   |group|permission|isSymlink|
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|file:/tmp/data/fileB.json|false      |47 bytes|1          |32 MB    |2020-04-25T13:47:00.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
|file:/tmp/data/fileC.json|false      |47 bytes|1          |32 MB    |2020-04-25T13:47:10.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
|file:/tmp/data/fileA.json|false      |47 bytes|1          |32 MB    |2020-04-25T11:35:12.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false    |
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+



回答3:


I can make a guess try to debug the code:

hdfs debug computeMeta -block <block-file> -out <output-metadata-file>

you can find your command kindly visit https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#verifyMeta



来源:https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!