Generating the Shortest Regex Dynamically from a source List of Strings

前端 未结 3 656
一整个雨季
一整个雨季 2021-01-04 10:13

I have a bunch of SKUs (stock keeping units) that represent a series of strings that I\'d like to create a single Regex to match for.

So, for example, if I have SKUs

3条回答
  •  没有蜡笔的小新
    2021-01-04 11:12

    This is what I finally worked out:

    var skus = new[] { "BATPAG003", "BATTWLP03", "BATTWLP04", "BATTWSP04", "SPIFATB01" };
    
    Func>, IEnumerable> regexify = null;
    
    Func, IEnumerable> generate =
        xs =>
            from n in Enumerable.Range(2, 20)
            let g = xs.GroupBy(x => new String(x.Take(n).ToArray()), x => new String(x.Skip(n).ToArray()))
            where g.Count() != xs.Count()
            from r in regexify(g)
            select r;
    
    regexify = gxs =>
    {
        if (!gxs.Any())
        {
            return new [] { "" };
        }
        else
        {
            var rs = regexify(gxs.Skip(1)).ToArray();
            return
                from f in gxs.Take(1)
                from z in new [] { String.Join("|", f) }.Concat(f.Count() > 1 ? generate(f) : Enumerable.Empty())
                from r in rs
                select f.Key + (f.Count() == 1 ? z : $"({z})") + (r != "" ? "|" + r : "");
        }
    };
    

    Then using this query:

    generate(skus).OrderBy(x => x).OrderBy(x => x.Length);
    

    ...I got this result:

    BAT(PAG003|TW(LP0(3|4)|SP04))|SPIFATB01 
    BAT(PAG003|TWLP0(3|4)|TWSP04)|SPIFATB01 
    BA(TPAG003|TTW(LP0(3|4)|SP04))|SPIFATB01 
    BAT(PAG003|TW(LP(03|04)|SP04))|SPIFATB01 
    BAT(PAG003|TW(LP03|LP04|SP04))|SPIFATB01 
    BAT(PAG003|TWLP(03|04)|TWSP04)|SPIFATB01 
    BATPAG003|BATTW(LP0(3|4)|SP04)|SPIFATB01 
    BA(TPAG003|TT(WLP0(3|4)|WSP04))|SPIFATB01 
    BA(TPAG003|TTW(LP(03|04)|SP04))|SPIFATB01 
    BA(TPAG003|TTW(LP03|LP04|SP04))|SPIFATB01 
    BA(TPAG003|TTWLP0(3|4)|TTWSP04)|SPIFATB01 
    BAT(PAG003|TWL(P0(3|4))|TWSP04)|SPIFATB01 
    BAT(PAG003|TWL(P03|P04)|TWSP04)|SPIFATB01 
    BATPAG003|BATT(WLP0(3|4)|WSP04)|SPIFATB01 
    BATPAG003|BATTW(LP(03|04)|SP04)|SPIFATB01 
    BATPAG003|BATTW(LP03|LP04|SP04)|SPIFATB01 
    BA(TPAG003|TT(WLP(03|04)|WSP04))|SPIFATB01 
    BA(TPAG003|TTWLP(03|04)|TTWSP04)|SPIFATB01 
    BAT(PAG003|TWLP03|TWLP04|TWSP04)|SPIFATB01 
    BATPAG003|BATT(WLP(03|04)|WSP04)|SPIFATB01 
    BA(TPAG003|TT(WL(P0(3|4))|WSP04))|SPIFATB01 
    BA(TPAG003|TT(WL(P03|P04)|WSP04))|SPIFATB01 
    BA(TPAG003|TT(WLP03|WLP04|WSP04))|SPIFATB01 
    BA(TPAG003|TTWL(P0(3|4))|TTWSP04)|SPIFATB01 
    BA(TPAG003|TTWL(P03|P04)|TTWSP04)|SPIFATB01 
    BATPAG003|BATT(WL(P0(3|4))|WSP04)|SPIFATB01 
    BATPAG003|BATT(WL(P03|P04)|WSP04)|SPIFATB01 
    BATPAG003|BATT(WLP03|WLP04|WSP04)|SPIFATB01 
    BATPAG003|BATTWLP0(3|4)|BATTWSP04|SPIFATB01 
    BATPAG003|BATTWLP(03|04)|BATTWSP04|SPIFATB01 
    BA(TPAG003|TTWLP03|TTWLP04|TTWSP04)|SPIFATB01 
    BATPAG003|BATTWL(P0(3|4))|BATTWSP04|SPIFATB01 
    BATPAG003|BATTWL(P03|P04)|BATTWSP04|SPIFATB01 
    

    The only problem with my approach was computation time. Some of my source lists have nearly 100 SKUs. Some of the runs were taking longer than I care to wait for and had to break it down into smaller chunks and then manually concatenate.

提交回复
热议问题