Find lowest and highest values split into rows from a single string of concatenated values

问题

This is a follow-up to my question here: in which I got an excellent answer for that question provided by uzi. I however noticed that a new Company, Company3 also used single data Points, such as account 6000 which does not follow the manner of the previous companies which makes uzi's recursive cte not applicable.

As such I feel like it is required to alter the question, but I Believe that this complication would issue a new question rather than an edit on my previous one due to having a great impact of the solution.

I need to read data from an Excel workbook, where data is stored in this manner:

Company       Accounts
Company1      (#3000...#3999)
Company2      (#4000..#4019)+(#4021..#4024)
Company3      (#5000..#5001)+#6000+(#6005..#6010)

I believe that due to some companies, such as Company3 having single values of accounts such as #6000 that I need to, in this step, create a result set of the following appearence:

Company       FirstAcc LastAcc
Company1      3000     3999
Company2      4000     4019
Company2      4021     4024
Company3      5000     5001
Company3      6000     NULL
Company3      6005     6010

I will then use this table and JOIN it with a table of only integers to get the appearance of the final table such as the one in my linked question.

Does anyone have any ideas?

回答1:

A good t-sql splitter function makes this quite simple; I suggest delimitedSplit8k. This will perform significantly better than a recursive CTE too. First the sample data:

-- your sample data
if object_id('tempdb..#yourtable') is not null drop table #yourtable;
create table #yourtable (company varchar(100), accounts varchar(8000));
insert #yourtable values ('Company1','(#3000...#3999)'),
('Company2','(#4000..#4019)+(#4021..#4024)'),('Company3','(#5000..#5001)+#6000+(#6005..#6010)');

and the solution:

select 
  company, 
  firstAcc = max(case when split2.item not like '%)' then clean.Item end),
  lastAcc  = max(case when split2.item     like '%)' then clean.Item end)
from #yourtable t
cross apply dbo.delimitedSplit8K(accounts, '+') split1
cross apply dbo.delimitedSplit8K(split1.Item, '.') split2
cross apply (values (replace(replace(split2.Item,')',''),'(',''))) clean(item)
where split2.item > ''
group by split1.Item, company;

Results:

company   firstAcc   lastAcc
--------- ---------- --------------
Company1  #3000      #3999
Company2  #4000      #4019
Company2  #4021      #4024
Company3  #6000      NULL
Company3  #5000      #5001
Company3  #6005      #6010

回答2:

I believe that list (#6005..#6010) is represented like #6005#6006#6007#6008#6009#6010 in your Excel file. Try this query if that is true and there are no gaps

with cte as (
select 
    company, replace(replace(replace(accounts,'(',''),')',''),'+','')+'#' accounts
from 
    (values ('company 1','#3000#3001#3002#3003'),('company 2','(#4000#4001)+(#4021#4022)'),('company 3','(#5000#5001)+#6000+(#6005#6006)')) data(company, accounts)
)

, rcte as (
    select 
        company, stuff(accounts, ind1, ind2 - ind1, '') acc, substring(accounts, ind1 + 1, ind2 - ind1 - 1) accounts
    from 
        cte
        cross apply (select charindex('#', accounts) ind1) ca
        cross apply (select charindex('#', accounts, ind1 + 1) ind2) cb
    union all
    select
        company, stuff(acc, ind1, ind2 - ind1, ''), substring(acc, ind1 + 1, ind2 - ind1 - 1)
    from
        rcte
        cross apply (select charindex('#', acc) ind1) ca
        cross apply (select charindex('#', acc, ind1 + 1) ind2) cb
    where
        len(acc)>1
)

select
    company, min(accounts) FirstAcc, case when max(accounts)  =min(accounts) then null else max(accounts) end LastAcc
from (
    select
        company, accounts, accounts - row_number() over (partition by company order by accounts) group_
    from 
        rcte
    ) t
group by company, group_

option (maxrecursion 0)

回答3:

I made a little editing to @uzi solution from the other question, in which i added three other CTE's and used windows function like LEAD() and ROW_NUMBER() to solve the problem. I don't know if there is a simpler solution, but i think this is working good.

with cte as (
select 
    company, replace(replace(replace(accounts,'(',''),')',''),'+','')+'#' accounts 
from 
    (values ('company 1','#3000..#3999'),('company 2','(#4000..#4019)+(#4021..#4024)'),('company 3','(#5000..#5001)+#6000+(#6005..#6010)')) data(company, accounts)
)
, rcte as (
    select 
        company, stuff(accounts, ind1, ind2 - ind1, '') acc, substring(accounts, ind1 + 1, ind2 - ind1 - 1) accounts
    from 
        cte
        cross apply (select charindex('#', accounts) ind1) ca
        cross apply (select charindex('#', accounts, ind1 + 1) ind2) cb
    union all
    select
        company, stuff(acc, ind1, ind2 - ind1, ''), substring(acc, ind1 + 1, ind2 - ind1 - 1)
    from
        rcte
        cross apply (select charindex('#', acc) ind1) ca
        cross apply (select charindex('#', acc, ind1 + 1) ind2) cb
    where
        len(acc)>1
) ,cte2 as (

    select company, accounts as  accounts_raw, Replace( accounts,'..','') as accounts,
        LEAD(accounts) OVER(Partition by company ORDER BY accounts) ld,
        ROW_NUMBER() OVER(ORDER BY accounts) rn 
    from rcte
) , cte3 as (

    Select company,accounts,ld ,rn 
    from cte2 
    WHERE ld not like '%..' 
) , cte4 as (
    select * from cte3 where accounts not in (select ld from cte3 t1 where t1.rn < cte3.rn)
)

SELECT company,accounts,ld from cte4
UNION
SELECT DISTINCT company,ld,NULL from cte3 where accounts not in (select accounts from cte4 t1)

option (maxrecursion 0)

Result:

回答4:

It looks like you tagged SSIS so I will provide a solution for that using a script task. All other examples require loading to a staging table.

Use your normal reader (Excel probably) and load
Add a script transformation component
Edit Component
Input Columns - Check both Company and Accounts
Input and Output - Add a new Output and call it CompFirstLast
Add three columns to it - Company string, First int, and Last int

Open Script and paste the following code

public override void Input0_ProcessInputRow(Input0Buffer Row)
{

//Create an array for each group to create rows out of by splitting on '+'

string[] SplitForRows = Row.Accounts.Split('+'); //Note single quotes denoting char 

//Deal with each group and create the new Output
for (int i = 0; i < SplitForRows.Length; i++) //Loop each split column
    {
        CompFirstLastBuffer.AddRow();
        CompFirstLastBuffer.Company = Row.Company; //This is static for each incoming row

        //Clean up the string getting rid of (). and leaving a delimited list of #
        string accts = SplitForRows[i].Replace("(", String.Empty).Replace(")", String.Empty).Replace(".", String.Empty).Substring(1);

        //Split into Array
        string[] accounts = accts.Split('#');

        // Write out first and last and handle null
        CompFirstLastBuffer.First = int.Parse(accounts[0]);

        if (accounts.Length == 1)
            CompFirstLastBuffer.Last_IsNull = true;
        else
            CompFirstLastBuffer.Last = int.Parse(accounts[1]);

    }
}

Make sure you use the right output.

来源：https://stackoverflow.com/questions/47972115/find-lowest-and-highest-values-split-into-rows-from-a-single-string-of-concatena

标签

sql-server

excel

tsql

ssis

etl