azure-data-lake

Copy and Extracting Zipped XML files from HTTP Link Source to Azure Blob Storage using Azure Data Factory

徘徊边缘 提交于 2021-02-19 08:48:05
问题 I am trying to establish an Azure Data Factory copy data pipeline. The source is an open HTTP Linked Source (Url reference: https://clinicaltrials.gov/AllPublicXML.zip). So basically the source contains a zipped folder having many XML files. I want to unzip and save the extracted XML files in Azure Blob Storage using Azure Data Factory. I was trying to follow the configurations mentioned here: How to decompress a zip file in Azure Data Factory v2 but I am getting the following error:

Azure Data Factory : Set a limit to copy number of files using Copy activity

落花浮王杯 提交于 2021-02-11 14:01:10
问题 I have a copy activity used in my pipeline to copy files from Azure data Lake gen 2. The source location may have 1000's of files and the files are required to be copied but we need to set a limit for the number files required to be copied. Is there any option available in ADF to achieve the same barring a custom activity? Eg: I have 2000 files available in Data lake, but while running the pipeline i should able to pass a parameter to copy only 500 files. Regards, Sandeep 回答1: I think you can

How to add a validation in azure data factory pipeline to check file size?

感情迁移 提交于 2021-02-08 11:49:17
问题 I have multiple data sources I want to add a validation in azure data factory before loading into tables it should check for file size so that it is not empty. So if the file size is more than 10 kb or if it is not empty loading should start and if it is empty then loading should not start. I checked validation activity in Azure Data Factory but it is not showing size for multiple files in a folder. Any suggestions appreciated basically if I can add any python notebook for this validation

Azure function binding for Azure data lake (python)

你说的曾经没有我的故事 提交于 2021-01-29 09:19:26
问题 I am having a requirement like I want to connect to my Azure data lake v2(ADLS) from Azure functions, read file, process it using python(pyspark) and write it again in Azure data lake. So my input and output binding would be to ADLS. Is there any ADLS binding for Azure function in python available? Could somebody give any suggestions on this? Thank, Anten D 回答1: Update: 1, When we read the data, we can use blob input binding. 2, But when we write the data, we can not use blob output binding.

Can't list file system of azure datalake with javascript

断了今生、忘了曾经 提交于 2021-01-29 05:02:13
问题 I'm trying to list the paths within a file system in Azure datalake using this code : I'm able to retrieve ${fileSystem.name} but getting permissions denied with .listPaths() node:15660) UnhandledPromiseRejectionWarning: RestError: This request is not authorized to perform this operation using this permission. i'm not sure what permissions I need to provide, the service principle has owner access over the datalake storage account and also has the api permissions the code: const {

Efficient way of reading parquet files between a date range in Azure Databricks

柔情痞子 提交于 2021-01-29 04:31:55
问题 I would like to know if below pseudo code is efficient method to read multiple parquet files between a date range stored in Azure Data Lake from PySpark(Azure Databricks). Note: the parquet files are not partitioned by date. Im using uat/EntityName/2019/01/01/EntityName_2019_01_01_HHMMSS.parquet convention for storing data in ADL as suggested in the book Big Data by Nathan Marz with slight modification(using 2019 instead of year=2019). Read all data using * wildcard: df = spark.read.parquet

Copy Different type of file from Gen1 Azur lake to Azur Gen2 lake with attribute( like last updated)

余生颓废 提交于 2021-01-28 06:24:51
问题 I need to migrate all my data from Azur data lake Gen1 to Lake Gen2. In my lake we have different types of file mixed (.txt, .zip,.json and many other). We want to move them as-it-is to GEN2 lake. Along with that we also want to maintain last updated time for all files as GEN1 lake. I was looking to use ADF for this use case. But for that we need to define dataset, and to define dataset we have to define data format(Avro,json,xml, binary etc). As we have different type of data mixed, I tried

Process Azure Datalake store file using Azure function

前提是你 提交于 2021-01-28 01:40:22
问题 I am getting files in a particular folder on my Azure Datalake store at regular interval. As soon as file come, I want to process it further using an Azure function. Is that possible? 回答1: UPDATE: With Multi-Protocol Access for Azure Data Lake Storage, the storage extension should indeed work and some basic tests do confirm that. There are open issues [1, 2] for official confirmation of support. Though Azure Data Lake Storage (ADLS) Gen2 is built upon Azure Blob Storage, there are a couple of

How to throw a error or raise exception in U-SQL?

最后都变了- 提交于 2021-01-28 00:31:44
问题 What is the mechanism used to raise an error or exception in a U-Sql script? I have a scenario where am processing a CSV file, and if duplicates are found in it, then I need to abandon processing. In SQL, I could do raiseerror , what it the equivalent way of doing it in U-Sql? 回答1: Create a c# function to raise custom errors (or output to a file): DECLARE @RaiseError Func<string, int> = (error) => { throw new Exception(error); return 0; }; @Query = SELECT @RaiseError(value) AS ErrorCode FROM

How to write Azure machine learning batch scoring results to data lake?

非 Y 不嫁゛ 提交于 2021-01-27 22:02:02
问题 I'm trying to write the output of batch scoring into datalake: parallel_step_name = "batchscoring-" + datetime.now().strftime("%Y%m%d%H%M") output_dir = PipelineData(name="scores", datastore=def_ADL_store, output_mode="upload", output_path_on_compute="path in data lake") parallel_run_config = ParallelRunConfig( environment=curated_environment, entry_script="use_model.py", source_directory="./", output_action="append_row", mini_batch_size="20", error_threshold=1, compute_target=compute_target,