问题
I have built an Pipeline with one Copy Data activity which copies data from an Azure Data Lake and output it to an Azure Blob Storage.
In the output, I can see that some of my rows do not have data and I would like to exclude them from the copy. In the following example, the 2nd row does not have useful data:
{"TenantId":"qa","Timestamp":"2019-03-06T10:53:51.634Z","PrincipalId":2,"ControlId":"729c3b6e-0442-4884-936c-c36c9b466e9d","ZoneInternalId":0,"IsAuthorized":true,"PrincipalName":"John","StreetName":"Rue 1","ExemptionId":8}
{"TenantId":"qa","Timestamp":"2019-03-06T10:59:09.74Z","PrincipalId":null,"ControlId":null,"ZoneInternalId":null,"IsAuthorized":null,"PrincipalName":null,"StreetName":null,"ExemptionId":null}
Question
In the Copy Data activity, how can I put a rule to exclude rows that miss certain values?
Here is the code of my pipeline :
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Copy from Data Lake to Blob",
"type": "Copy",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [
{
"name": "Source",
"value": "tenantdata/events/"
},
{
"name": "Destination",
"value": "controls/"
}
],
"typeProperties": {
"source": {
"type": "AzureDataLakeStoreSource",
"recursive": true
},
"sink": {
"type": "BlobSink",
"copyBehavior": "MergeFiles"
},
"enableStaging": false,
"translator": {
"type": "TabularTranslator",
"columnMappings": {
"Body.TenantId": "TenantId",
"Timestamp": "Timestamp",
"Body.PrincipalId": "PrincipalId",
"Body.ControlId": "ControlId",
"Body.ZoneId": "ZoneInternalId",
"Body.IsAuthorized": "IsAuthorized",
"Body.PrincipalName": "PrincipalName",
"Body.StreetName": "StreetName",
"Body.Exemption.Kind": "ExemptionId"
}
}
},
"inputs": [
{
"referenceName": "qadl",
"type": "DatasetReference"
}
],
"outputs": [
{
"referenceName": "datalakestaging",
"type": "DatasetReference"
}
]
}
]
}
}
回答1:
This is a very good question (+1 for that), I had the same question months back and I was surprised that I could not find anything within the Copy Activity to handle this (I even tried with the fault tolerance feature but no luck).
And given that I had other transformations going on in my pipelines with U-SQL, I ended up using it to accomplish this. So, instead of a Copy Activity I have a U-SQL Activity in ADF using the IS NOT NULL operator, it depends on your data but you can play with that, maybe your string contains the "NULL" or empty strings "", this is how it looks :
DECLARE @file_set_path string = "adl://myadl.azuredatalake.net/Samples/Data/{date_utc:yyyy}{date_utc:MM}{date_utc:dd}T{date_utc:HH}{date_utc:mm}{date_utc:ss}Z.txt";
@data =
EXTRACT
[id] string,
date_utc DateTime
FROM @file_set_path
USING Extractors.Text(delimiter: '\u0001', skipFirstNRows : 1, quoting:false);
@result =
SELECT
[id] ,
date_utc.ToString("yyyy-MM-ddTHH:mm:ss") AS SourceExtractDateUTC
FROM @data
WHERE id IS NOT NULL -- you can also use WHERE id <> "" or <> "NULL";
OUTPUT @result TO "wasb://samples@mywasb/Samples/Data/searchlog.tsv" USING Outputters.Text(delimiter: '\u0001', outputHeader:true);
Notes: ADLS and Blob storage are supported INPUT/OUTPUT files
Let me know if that helps or if the example above does not work for your data. Hopefully somebody will post an answer using Copy Activity and that'd be awesome but this is one possibility so far.
来源:https://stackoverflow.com/questions/55050376/how-can-i-exclude-rows-in-a-copy-data-activity-in-azure-data-factory