How to parse Kinesis data stream in AWS Lambda Java

末鹿安然 提交于 2021-01-05 09:03:22

问题


I am creating a AWS Lambda function in Java to process Kinesis Data Stream.

My current setup of parsing involves:

  1. Stringify using UTF-8 as suggested in AWS Documentation
            for(KinesisEvent.KinesisEventRecord rec : event.getRecords())
            {
                String stringRecords = new String(rec.getKinesis().getData().array(), "UTF-8");

                    pageEventList.add(pageEvent);
            }
            
  1. Clean up characters using Regex Patterns
   a. non-ascii: "[^\\x00-\\x7F]";
   b. ascii-control-characters: "[\\p{Cntrl}&&[^\r\n\t]]";
   c. non-printable-characters: "\\p{C}";
  1. Format json string objects without square brackets and commas
        int firstBeginningCurlyBracketIndex = cleanString.indexOf("{");
        if (firstBeginningCurlyBracketIndex != -1 ){
            cleanString = cleanString.substring(firstBeginningCurlyBracketIndex + 1);
            cleanString = "[{" + cleanString;
        }

        int lastIndexOfCurlyBracketIndex = cleanString.lastIndexOf("}");
        if (lastIndexOfCurlyBracketIndex != -1) {
            cleanString = cleanString.substring(0, lastIndexOfCurlyBracketIndex);
            cleanString = cleanString + "}]";
        }

        cleanString = cleanString.replaceAll("}\\{", "\\},\\{");

Currently, when I got this far, I am using Regex parsing to separate and parse them into JSON object. Reference: How to match string within parentheses (nested) in Java?

        String REGEX_BRACKET_PATTERN_TWO_LAYERS = "(\\{(?:[^}{]+|\\{(?:[^}{]+|\\{[^}{]*\\})*\\})*\\})";

        Pattern splitDelRegex = Pattern.compile(REGEX_BRACKET_PATTERN_TWO_LAYERS);
        Matcher regexMatcher = splitDelRegex.matcher(nonAsciiRemovedString);
        List<String> matcherList = new ArrayList<String>();
        while (regexMatcher.find()) {
            String perm = regexMatcher.group(1);
            matcherList.add(perm);
        }

I have attempted to use Gson and Jackson to parse string-json-array after step 3 (ref: How to parse JSON in Java). Parsing works fine until a random invalid JSON / string appears out of Data Stream and throws exception - java.lang.Exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_ARRAY but was STRING at line 2 column 1 path $

Invalid json that causes this exception looks something like this:

[

 ...

  {
    "name": "banana"
    "description": "description"
  },
  {
    "name": "orange"
    "description": "description"
  }
GD~
{}
FDSE-}
]

My questions are:

  1. Since the last random string part is very random, I am having difficulties formatting the whole string into valid string json array. If anybody has a good Idea to make sure this string json array is always valid.

  2. Aside from what I have described in steps to parse Kinesis Data Stream to Json data, which by the way is working using REGEX although I still notice that random string at the end, if anybody has experience in this parsing process, please share with the community. I feel like AWS Documentation on this topic of Lambda-Kinesis is not detail enough to make sure the whole parsing process.

Adding to this, I am aware that this could just all be because of the quality of data from data stream. It would also be nice just to hear other people's experience on handling their data on this topic.

来源:https://stackoverflow.com/questions/63071209/how-to-parse-kinesis-data-stream-in-aws-lambda-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!