问题
I requested data from HDFS, and I would like to get the metadata of the files from which they were read. This will allow me to build reports which will look like based on available data at the given moment.
I found the solution which is to use org.apache.hadoop.fs.FileSystem
to get a listing of all files.
I know the partitioning rule, and I can build mapping row -> meta
, based on the received listing.
But this decision seems difficult to implement and support. Maybe there are simpler ways to achieve the same result?
回答1:
Easiest way to do so is with spark udf input_file_name
.
import scala.collection.mutable.Map
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name()).repartition($"input_file_name")
def getMetadata(rdd: Iterator[Row]) = {
val map = Map[String, Long]()
val fs = FileSystem.get(new Configuration())
rdd.map(row => {
val path = row.getString(row.size -1)
if(! map.contains(path)){
map.put(path,fs.listStatus(new Path(path))(0).getModificationTime())
}
Row.fromSeq(row.toSeq ++ Array[Any](map(path)))
})
}
spark.createDataFrame(df.rdd.mapPartitions(getMetadata),df.schema.add("modified_ts", LongType)).show(10000,false)
Here modified_ts
is the mtime
for the file.
Depending on size of the data, you can also do it with join. The logic will look something like:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.functions._
val mtime =(path:String)=> FileSystem.get(new Configuration()).listStatus(new Path(path)).head.getModificationTime
val mtimeUDF = udf(mtime)
val df = spark.read.text("<path>").withColumn("input_file_name", input_file_name())
val metadata_df = df.select($"input_file_name").distinct().withColumn("mtime", mtimeUDF($"input_file_name"))
val rows_with_metadata = df.join(metadata_df , "input_file_name")
rows_with_metadata.show(false)
回答2:
I have create small helper method metadata
, you can directly invoke on DataFrame object like df.metadata
, It will create DataFrame on available metadata & return DataFrame back.
Meta Columns in final DataFrame
- path
- isDirectory
- length -- will be displayed human readable format 47 bytes
- replication
- blockSize -- will be displayed human readable format 47 bytes
- modificationTime -- This will be converted from unix time to normal datetime.
- accessTime
- owner
- group
- permission
- isSymlink
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.joda.time.DateTime
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.DataFrame
import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}
// Storing Metadata
case class FileMetaData(path: String,
isDirectory:Boolean,
length:String,
replication:Int,
blockSize:String,
modificationTime: String,
accessTime:String ,
owner:String ,
group:String ,
permission:String,
isSymlink:Boolean)
object FileMetaData {
def apply(lfs: LocatedFileStatus):FileMetaData = {
new FileMetaData(
path= lfs.getPath.toString,
isDirectory=lfs.isDirectory,
length=FileUtils.byteCountToDisplaySize(lfs.getLen),
replication=lfs.getReplication,
blockSize=FileUtils.byteCountToDisplaySize(lfs.getBlockSize),
modificationTime=new DateTime(lfs.getModificationTime).toString,
accessTime=new DateTime(lfs.getAccessTime).toString ,
owner=lfs.getOwner ,
group=lfs.getGroup ,
permission=lfs.getPermission.toString,
isSymlink=lfs.isSymlink
)
}
}
// Convert RemoteIterator to Scala Iterator.
implicit def convertToScalaIterator[T](remoteIterator: RemoteIterator[T]): Iterator[T] = {
case class wrapper(remoteIterator: RemoteIterator[T]) extends Iterator[T] {
override def hasNext: Boolean = remoteIterator.hasNext
override def next(): T = remoteIterator.next()
}
wrapper(remoteIterator)
}
// Using this we can call metadata method on df - like df.metadata.
implicit class MetaData(df: DataFrame) {
def metadata = {
import df.sparkSession.implicits._
df.inputFiles.map(new Path(_))
.flatMap{
FileSystem
.get(df.sparkSession.sparkContext.hadoopConfiguration)
.listLocatedStatus(_)
.toList
}
.map(FileMetaData(_)).toList.toDF
}
}
// Exiting paste mode, now interpreting.
warning: there was one feature warning; re-run with -feature for details
import org.joda.time.DateTime
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.DataFrame
import org.apache.hadoop.fs.{FileSystem, LocatedFileStatus, Path, RemoteIterator}
defined class FileMetaData
defined object FileMetaData
convertToScalaIterator: [T](remoteIterator: org.apache.hadoop.fs.RemoteIterator[T])Iterator[T]
defined class MetaData
scala> val df = spark.read.format("json").load("/tmp/data")
df: org.apache.spark.sql.DataFrame = [json_data: struct<value: string>]
scala> df.show(false)
+------------------+
|json_data |
+------------------+
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
|[{"a":1} ,{"b":2}]|
+------------------+
scala>
DataFrame Metadata Output
scala> df.metadata.show(false)
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|path |isDirectory|length |replication|blockSize|modificationTime |accessTime |owner |group|permission|isSymlink|
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
|file:/tmp/data/fileB.json|false |47 bytes|1 |32 MB |2020-04-25T13:47:00.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false |
|file:/tmp/data/fileC.json|false |47 bytes|1 |32 MB |2020-04-25T13:47:10.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false |
|file:/tmp/data/fileA.json|false |47 bytes|1 |32 MB |2020-04-25T11:35:12.000+05:30|1970-01-01T05:30:00.000+05:30|srinivas|wheel|rw-r--r-- |false |
+-------------------------+-----------+--------+-----------+---------+-----------------------------+-----------------------------+--------+-----+----------+---------+
回答3:
I can make a guess try to debug the code:
hdfs debug computeMeta -block <block-file> -out <output-metadata-file>
you can find your command kindly visit https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#verifyMeta
来源:https://stackoverflow.com/questions/61317600/how-to-get-files-metadata-when-retrieving-data-from-hdfs