Extracting substring from string based on delimiter

问题

I am trying to extract the data out of an encoded 2D barcode. The extraction part is working fine, and I can get the value in a text input.

E.g., the decoded string is

]d20105000456013482172012001000001/:210000000001

Based on the following rules (couldn't get the proper table markdown thus attaching a picture), I am trying to extract the substrings from the string mentioned above.

Substrings I want to extract:

05000456013482 (which is after the delimiter 01)

201200 (which is after delimiter 17)

00001 (which is after delimiter 10)

0000000001 (which is after delimiter 21)

P.S - > the first 3 chars in the original string (]d2) are always the same since it just simply signifies the decoding method.

Now some quirks:   

1) The number of letters after delimiter 10 is not fixed. So, in the above-given example even though it is 00001 it could be even 001. Similarly, the number of letters after delimiter 21 is also not fixed and it could be of varying length.

For different length delimiters, I have added a constant /: to determine when encoding has ended after scanning through a handheld device.

Now, I have a look for /: after delimiter 10 and extract the string until it hits /: or EOL and find delimiter 21 and remove the string until it hits /: or EOL

2) The number of letters after delimiter 01 and 17 are always fixed (14 letter and six letters respectively)   as shown in the table.

Note: The position of delimiters could change. In order words, the encoded barcode could be written in a different sequence.

  ]d20105000456013482172012001000001/:210000000001 - Note: No /: sign after 21 group since it is EOL

]d2172012001000001/:210000000001/:0105000456013482 - Note: Both 10 and 21 group have /. sign to signify we have to extract until that sign

]d21000001/:210000000001/:010500045601348217201200 - First two are of varying length, and the next two are of fixed length.

I am not an expert in regex and thus far I only tried using some simple patterns like (01)(\d*)(21)(\d*)(10)(\d*)(17)(\d*)$ which doesn't work in the given an example since it looks for 10 like the first 2 chars. Also, using substring(x, x) method only works in case of a fixed length string when I am aware of which indexes I have to pluck the string.

P.S - Either JS and jQuery help is appreciated.

回答1:

While you could try to make a very complicated regex to do this, it would be more readable, and maintainable to parse through the string in steps.

Basic steps would be to:

remove the decode method characters (]d2).
Split off the first two characters from the result of step 1.
Use that to choose which method to extract the data
Remove and save that data from the string, goto step 2 repeat until exhausted string.

Now since you have a table of the structure of the AI/data you can make several methods to extract the different forms of data

For instance, since AI: 01, 11, 15, 17 are all fixed length you can just use string's slice method with the length

str.slice(0,14); //for 01
str.slice(0,6);  //for 11 15 17

While the variable ones like AI 21, would be something like

var fnc1 = "/:";
var fnc1Index = str.indexOf(fnc1);
str.slice(0,fnc1Index);

Demo

var dataNames = {
  '01': 'GTIN',
  '10': 'batchNumber',
  '11': 'prodDate',
  '15': 'bestDate',
  '17': 'expireDate',
  '21': 'serialNumber'
};

var input = document.querySelector("input");
document.querySelector("button").addEventListener("click",function(){
  var str = input.value;
  console.log( parseGS1(str) );
});

function parseGS1(str) {
  var fnc1 = "/:";
  var data = {};
  
  //remove ]d2
  str = str.slice(3);
  while (str.length) {
    //get the AI identifier: 01,10,11 etc
    let aiIdent = str.slice(0, 2);
    //get the name we want to use for the data object
    let dataName = dataNames[aiIdent];
    //update the string
    str = str.slice(2);

    switch (aiIdent) {
      case "01":
        data[dataName] = str.slice(0, 14);
        str = str.slice(14);
        break;
      case "10":
      case "21":
        let fnc1Index = str.indexOf(fnc1);
        //eol or fnc1 cases
        if(fnc1Index==-1){
          data[dataName] = str.slice(0);
          str = "";
        } else {
          data[dataName] = str.slice(0, fnc1Index);
          str = str.slice(fnc1Index + 2);
        }
        break;
      case "11":
      case "15":
      case "17":
        data[dataName] = str.slice(0, 6);
        str = str.slice(6);
      break;
      default:
        console.log("unexpected ident encountered:",aiIndent);
        return false;
        break;
    }
  }
  return data;
}

<input><button>Parse</button>

回答2:

Ok, here's my take on this. I created a regex that will match all possible patterns. That way all parts are split correctly, all that remains is to use the first two digits to know what it means.

^\]d2(?:((?:10|21)[a-zA-Z0-9]{1,20}(?:\/:|$))|(01[0-9]{14})|((?:11|15|17)[0-9]{6}))*

I suggest you copy it into regex101.com to read the full descriptors and test it out against different possible results.

There are 3 mains parts:

((?:10|21)[a-zA-Z0-9]{1,20}(?:\/:|$))

Which tests for the sections starting in 10 and 21. It looks for alphanumerical entities between 1 and 20 times. It should end either with EOL or /:

(01[0-9]{14})

Looks up for the GTIN, pretty straightforward.

((?:11|15|17)[0-9]{6})

Looks up for the 3 date fields.

As we expect those 3 segments to come in any order, I've glued them around | to imply a OR and expect this big sequence to repeat (with the * at the end expressing 0 or more, we could define the exact minimum and maximum for more reliability)

I am unsure if this will work for everything as the test strings you gave do not include identifiers inside actual values... It could very well happen that a product's best before date is in January so there will be a 01 in its value. But forcing the regex to execute in this manner should circumvent some of those problems.

EDIT: Capturing groups are only capturing the last occurence, so we need to split their definitions:

^\]d2(?:(21[a-zA-Z0-9]{1,20}(?:\/:|$))|(10[a-zA-Z0-9]{1,20}(?:\/:|$))|(01[0-9]{14})|(11[0-9]{6})|(15[0-9]{6})|(17[0-9]{6}))*

EDIT AGAIN: Javascript seems to cause us some headaches... I am not sure of the correct way to handle it, but here's an example code that could work.

var str = "]d20105000456013482172012001000001/:210000000001";
var r = new RegExp("(21[a-zA-Z0-9]{1,20}(?:\/:|$))|(10[a-zA-Z0-9]{1,20}(?:\/:|$))|(01[0-9]{14})|(11[0-9]{6})|(15[0-9]{6})|(17[0-9]{6})", "g");
var i = 0;
while ((match = r.exec(str)) != null) {
  console.log(match[0]);
}

I am not very happy with how it turns out though. There might be better solutions.

来源：https://stackoverflow.com/questions/45918849/extracting-substring-from-string-based-on-delimiter

标签

javascript

jquery

regex

string

substring