Databricks read Azure blob last modified date

假装没事ソ 提交于 2019-12-11 09:55:14

问题


I have an Azure blob storage mounted to my Databricks hdfs. Is there a way to get the last modified date of the blob in databricks?

This is how i'm reading the blob content:

val df = spark.read
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .csv("/mnt/test/*")

回答1:


Generally, there are two ways to read an Azure Blob last modified data, as below.

  1. Directly read it via Azure Storage REST API or Azure Storage SDK for Java. After I researched Azure Blob Storage REST APIs, there are two REST APIs Get Blob & Get Blob Properties which can get the Last-Modified property from the response header. So you can call these apis in Scala to parse api response header to get it, or simply using Azure Storage SDK for Java in Scala to do the same.

Here is my sample code in Java for getting Last-Modified property of a blob.

import java.util.Date;

import com.microsoft.azure.storage.CloudStorageAccount;
import com.microsoft.azure.storage.StorageException;
import com.microsoft.azure.storage.blob.CloudBlob;
import com.microsoft.azure.storage.blob.CloudBlobClient;
import com.microsoft.azure.storage.blob.CloudBlobContainer;

String StorageConnectionStringTemplate = "DefaultEndpointsProtocol=https;" + 
        "DefaultEndpointsProtocol=https;" +
        "AccountName=%s;" +
        "AccountKey=%s";
String accountName = "<your storage account name for HDInsight>";
String accountKey = "<your storage account key for HDInsight>";
String containerName = "<container name for HDFS>";
String blobName = "<blob name>";
String storageConnectionString = String.format(StorageConnectionStringTemplate, accountName, accountKey);
CloudStorageAccount storageAccount = CloudStorageAccount.parse(storageConnectionString);
CloudBlobClient client = storageAccount.createCloudBlobClient();
CloudBlobContainer container = client.getContainerReference(containerName);
CloudBlob blob = container.getBlobReferenceFromServer(blobName);
Date lastModifiedDate = blob.getProperties().getLastModified();

Considering for Hadoop Azure is based on Azure Storage SDK for Java 8.0.0, not a newest version 10.0, so my sample code above is different from the offical tutorial of Azure Blob Storage for Java.

If you want to get the Last-Modified property of a container, you can use the REST API [Get Container Properties][5] or the Java code Date lastModifiedDate = container.getProperties().getLastModified();.

  1. Using Hadoop Azure Java API for wasb:// protocol.

    import java.util.Date;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.fs.FileStatus;
    
    Configuration conf = new Configuration();
    FileSystem hdfs = FileSystem.get(conf);
    Path f = new Path("<blob path on HDFS>");
    FileStatus fileStatus = hdfs.getFileStatus(f);
    long lastModifiedTime = f.getModificationTime();
    Date lastModifiedDate = new Date(lastModifiedTime);
    


来源:https://stackoverflow.com/questions/53584561/databricks-read-azure-blob-last-modified-date

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!