azure-data-lake

Data Lake Analytics U-SQL EXTRACT speed (Local vs Azure)

人盡茶涼 提交于 2019-12-06 02:08:06
Been looking into using the Azure Data Lake Analytics functionality to try and manipulate some Gzip’d xml data I have stored within Azures Blob Storage but I’m running into an interesting issue. Essentially when using U-SQL locally to process 500 of these xml files the processing time is extremely quick , roughly 40 seconds using 1 AU locally (which appears to be the limit). However when we run this same functionality from within Azure using 5 AU’s the processing takes 17+ minutes. We are eventually wanting to scale this up to ~ 20,000 files and more but have reduced the set to try and measure

Value too long failure when attempting to convert column data

爱⌒轻易说出口 提交于 2019-12-05 23:31:52
Scenario I have a source file that contains blocks of JSON on each new line. I then have a simple U-SQL extract as follows where [RawString] represents each new line in the file and the [FileName] is defined as a variable from the @SourceFile path. @BaseExtract = EXTRACT [RawString] string, [FileName] string FROM @SourceFile USING Extractors.Text(delimiter:'\b', quoting : false); This executes without failure for the majority of my data and I'm able to parse the [RawString] as JSON further down in my script without any problems. However, I seem to have an extra long row of data in a recent

Connect Azure Event Hubs with Data Lake Store

感情迁移 提交于 2019-12-05 14:11:46
What is the best way to send data from Event Hubs to Data Lake Store? I am assuming you want to ingest data from EventHubs to Data Lake Store on a regular basis. Like Nava said, you can use Azure Stream Analytics to get data from EventHub into Azure Storage Blobs. Thereafter you can use Azure Data Factory (ADF) to copy data on a scheduled basis from Blobs to Azure Data Lake Store. More details on using ADF are available here: https://azure.microsoft.com/en-us/documentation/articles/data-factory-azure-datalake-connector/ . Hope this helps. == March 17, 2016 update. Support for Azure Data Lake

Reasons to use Azure Data Lake Analytics vs Traditional ETL approach

一曲冷凌霜 提交于 2019-12-05 05:33:13
I'm considering using Data Lake technologies which I have been studying for the latest weeks, compared with the traditional ETL SSIS scenarios, which I have been working with for so many years. I think of Data Lake as something very linked to big data, but where is the line between using Data Lake technolgies vs SSIS? Is there any advantage of using Data Lake technologies with 25MB ~100MB ~ 300MB files? Parallelism? flexibility? Extensible in the future? Is there any performance gain when the files to be loaded are not so big as U-SQL best scenario... What are your thoughts? Would it be like

Parse json file in U-SQL

一世执手 提交于 2019-12-05 04:55:51
I'm trying to parse below Json file using USQL but keep getting error. Json file@ {"dimBetType_SKey":1,"BetType_BKey":1,"BetTypeName":"Test1"} {"dimBetType_SKey":2,"BetType_BKey":2,"BetTypeName":"Test2"} {"dimBetType_SKey":3,"BetType_BKey":3,"BetTypeName":"Test3"} Below is the USQL script, I'm trying to extract the data from above file. REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; DECLARE @Full_Path string = "adl://xxxx.azuredatalakestore.net/2017/03/28/00_0_66ffdd26541742fab57139e95080e704.json"; DECLARE @Output_Path = "adl://xxxx

Azure Databricks vs ADLA for processing

夙愿已清 提交于 2019-12-04 22:27:00
问题 Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates or certain events related to a scenario or adding data from multiple tables/files. These jobs run everyday through u-sql jobs in data factory(v1 or v2) and then sent to powerBI for visualization. Using ADLA for all this processing, I feel it takes

How can I log something in USQL UDO?

我是研究僧i 提交于 2019-12-04 11:44:53
I have custom extractor, and I'm trying to log some messages from it. I've tried obvious things like Console.WriteLine , but cannot find where output is. However, I found some system logs in adl://<my_DLS>.azuredatalakestore.net/system/jobservice/jobs/Usql/.../<my_job_id>/ . How can I log something? Is it possible to specify log file somewhere on Data Lake Store or Blob Storage Account? A recent release of U-SQL has added diagnostic logging for UDOs. See the release notes here . // Enable the diagnostics preview feature SET @@FeaturePreviews = "DIAGNOSTICS:ON"; // Extract as one column @input

Upload to ADLS from file stream

二次信任 提交于 2019-12-04 11:43:36
I am making a custom activity in ADF, which involves reading multiple files from Azure Storage Blobs, doing some work on them, and then finally writing a resulting file to the Azure Data Lake Store. Last step is where I stop, because as far as I can see, the .NET SDK only allows for uploading from a local file. Is there any way to to (programmatically) upload a file to ADL Store, where it is not from a local file? Could be a blob or a stream. If not, any workarounds? Yes, it's possible to upload from Stream, the trick is to create file first and then append your stream to it: string

How to access Azure datalake using the webhdfs API

﹥>﹥吖頭↗ 提交于 2019-12-03 21:57:07
问题 We're just getting started evaluating the datalake service at Azure. We created our lake, and via the portal we can see the two public URLs for the service. (One is an https:// scheme, the other an adl:// scheme) The datalake documentation states that there are indeed two interfaces: webHDFS REST API, and ADL. So, I am assuming the https:// scheme gets me the wehHDFS interface. However, I can find no more information at Azure about using this interface. I tried poking at the given https://

Azure Databricks vs ADLA for processing

假装没事ソ 提交于 2019-12-03 14:55:38
Presently, I have all my data files in Azure Data Lake Store. I need to process these files which are mostly in csv format. The processing would be running jobs on these files to extract various information for e.g.Data for certain periods of dates or certain events related to a scenario or adding data from multiple tables/files. These jobs run everyday through u-sql jobs in data factory(v1 or v2) and then sent to powerBI for visualization. Using ADLA for all this processing, I feel it takes a lot of time to process and seems very expensive. I got a suggestion that I should use Azure