How can I exclude rows in a Copy Data Activity in Azure Data Factory?

风流意气都作罢 提交于 2019-12-24 02:15:30

问题


I have built an Pipeline with one Copy Data activity which copies data from an Azure Data Lake and output it to an Azure Blob Storage.

In the output, I can see that some of my rows do not have data and I would like to exclude them from the copy. In the following example, the 2nd row does not have useful data:

{"TenantId":"qa","Timestamp":"2019-03-06T10:53:51.634Z","PrincipalId":2,"ControlId":"729c3b6e-0442-4884-936c-c36c9b466e9d","ZoneInternalId":0,"IsAuthorized":true,"PrincipalName":"John","StreetName":"Rue 1","ExemptionId":8}
{"TenantId":"qa","Timestamp":"2019-03-06T10:59:09.74Z","PrincipalId":null,"ControlId":null,"ZoneInternalId":null,"IsAuthorized":null,"PrincipalName":null,"StreetName":null,"ExemptionId":null}

Question

In the Copy Data activity, how can I put a rule to exclude rows that miss certain values?

Here is the code of my pipeline :

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Copy from Data Lake to Blob",
                "type": "Copy",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [
                    {
                        "name": "Source",
                        "value": "tenantdata/events/"
                    },
                    {
                        "name": "Destination",
                        "value": "controls/"
                    }
                ],
                "typeProperties": {
                    "source": {
                        "type": "AzureDataLakeStoreSource",
                        "recursive": true
                    },
                    "sink": {
                        "type": "BlobSink",
                        "copyBehavior": "MergeFiles"
                    },
                    "enableStaging": false,
                    "translator": {
                        "type": "TabularTranslator",
                        "columnMappings": {
                            "Body.TenantId": "TenantId",
                            "Timestamp": "Timestamp",
                            "Body.PrincipalId": "PrincipalId",
                            "Body.ControlId": "ControlId",
                            "Body.ZoneId": "ZoneInternalId",
                            "Body.IsAuthorized": "IsAuthorized",
                            "Body.PrincipalName": "PrincipalName",
                            "Body.StreetName": "StreetName",
                            "Body.Exemption.Kind": "ExemptionId"
                        }
                    }
                },
                "inputs": [
                    {
                        "referenceName": "qadl",
                        "type": "DatasetReference"
                    }
                ],
                "outputs": [
                    {
                        "referenceName": "datalakestaging",
                        "type": "DatasetReference"
                    }
                ]
            }
        ]
    }
}

回答1:


This is a very good question (+1 for that), I had the same question months back and I was surprised that I could not find anything within the Copy Activity to handle this (I even tried with the fault tolerance feature but no luck).

And given that I had other transformations going on in my pipelines with U-SQL, I ended up using it to accomplish this. So, instead of a Copy Activity I have a U-SQL Activity in ADF using the IS NOT NULL operator, it depends on your data but you can play with that, maybe your string contains the "NULL" or empty strings "", this is how it looks :

DECLARE @file_set_path string = "adl://myadl.azuredatalake.net/Samples/Data/{date_utc:yyyy}{date_utc:MM}{date_utc:dd}T{date_utc:HH}{date_utc:mm}{date_utc:ss}Z.txt";

@data =
    EXTRACT 
            [id] string,
            date_utc DateTime
    FROM @file_set_path
    USING Extractors.Text(delimiter: '\u0001', skipFirstNRows : 1, quoting:false);

@result =
    SELECT 

            [id] ,
            date_utc.ToString("yyyy-MM-ddTHH:mm:ss") AS SourceExtractDateUTC
    FROM @data
    WHERE id IS NOT NULL -- you can also use WHERE id <> "" or <> "NULL";

OUTPUT @result TO "wasb://samples@mywasb/Samples/Data/searchlog.tsv" USING Outputters.Text(delimiter: '\u0001', outputHeader:true);

Notes: ADLS and Blob storage are supported INPUT/OUTPUT files

Let me know if that helps or if the example above does not work for your data. Hopefully somebody will post an answer using Copy Activity and that'd be awesome but this is one possibility so far.



来源:https://stackoverflow.com/questions/55050376/how-can-i-exclude-rows-in-a-copy-data-activity-in-azure-data-factory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!