U-SQL Split a CSV file to multiple files based on Distinct values in file

问题

I have the Data in Azure Data Lake Store and I am processing the data present there with Azure Data Analytic Job with U-SQL. I have several CSV files which contain spatial data, similar to this:

File_20170301.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80

File_20170302.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

Each file contains data for a different day and for all longtitude-lattitude combinations.

I would like to merge all files I have and split the data so I can end up with one file for each longtitude-lattitude combination.

So, after looping through all files in my folder and appending all data for all days, i would end up with the following:

File_45_21.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    45.121    | 21.123    | 2017-03-01   | 01   | 20
    45.121    | 21.123    | 2017-03-01   | 02   | 10
    45.121    | 21.123    | 2017-03-01   | 03   | 50
    45.121    | 21.123    | 2017-03-02   | 01   | 20
    45.121    | 21.123    | 2017-03-02   | 02   | 10
    45.121    | 21.123    | 2017-03-02   | 03   | 50

File_48_35.csv

    longtitude| lattitude | date         | hour | value1
    ----------+-----------+--------------+------+-------
    48.121    | 35.123    | 2017-03-01   | 01   | 60
    48.121    | 35.123    | 2017-03-01   | 02   | 15
    48.121    | 35.123    | 2017-03-01   | 03   | 80
    48.121    | 35.123    | 2017-03-02   | 01   | 60
    48.121    | 35.123    | 2017-03-02   | 02   | 15
    48.121    | 35.123    | 2017-03-02   | 03   | 80

In theory the following should happen:

Find distinct values for a combination of longtitude and lattitude in the data
take the above Array of distinct values and create a file for each combination and extract the data to it from the source files, based on the two parameters (longtitude and lattitude)

What I am struggling with is how to get started on the looping and the extraction of data based on two parameters in the source and how to "partition" the data source by the distinct values of the combination of parameters.

回答1:

U-SQL does not support dynamic U-SQL directly, but it is possible to use "script the script" techniques to create your output. You could then manually run this output, or use something like Powershell or Azure Data Factory to run it.

I have created a simple example based on your test data partially based on the example from here.

// Get the initial fileset
@input =
    EXTRACT longtitude float,
            lattitude float,
            date string,
            hour int,
            value1 int,
            filename string
    FROM "/input/File_201703{filename}"
    USING Extractors.Csv();


// Add int version of the long and lat columns for grouping on
@working =
    SELECT *,
           (int) longtitude AS int_long,
           (int) lattitude AS int_lat
    FROM @input;


// Work out the filenames
@filenames =
    SELECT String.Format("File {0}_{1}.csv", int_long, int_lat) AS outputFilename,
           int_long,
           int_lat
    FROM
    (
        SELECT int_long,
               int_lat
        FROM @working
        GROUP BY int_long,
                 int_lat
    ) AS x;


// Construct the dynamic usql and output it
@output =
    SELECT x.xsort, "@input = EXTRACT longtitude float, lattitude float, date string, hour int, value1 int, filename string FROM \"input/File_201703{filename}\" USING Extractors.Csv();" AS usql
   FROM ( VALUES ( 10 ) ) AS x(xsort)

    UNION ALL

    SELECT x.xsort, "@working = SELECT *, (int) longtitude AS int_long, (int) lattitude AS int_lat FROM @input;" AS usql
    FROM ( VALUES ( 20 ) ) AS x(xsort)

    UNION ALL

    SELECT 30 AS xsort, String.Format("OUTPUT (SELECT * FROM @working WHERE int_long == {0} AND int_lat == {1}) TO \"/output/{2}\" USING Outputters.Csv();", int_long, int_lat, outputFilename) AS usql
    FROM @filenames;


// Select only the usql column and sort the output
@output =
    SELECT usql
    FROM @output
ORDER BY xsort
FETCH 100;


OUTPUT @output
TO "/output/dynamic.usql"
USING Outputters.Text(delimiter : ' ', quoting : false);

回答2:

I would recommend to look at the notion of File Sets to operate on many files (see https://msdn.microsoft.com/en-us/library/azure/mt771650.aspx) and some of the suggestions to do dynamic outputs based on values until that feature becomes available (see How do I partition a large file into files/directories using only U-SQL and certain fields in the file? as an example).

来源：https://stackoverflow.com/questions/43227421/u-sql-split-a-csv-file-to-multiple-files-based-on-distinct-values-in-file

标签

azure

output

azure-data-lake

u-sql

data-partitioning