Caching file content inside ExecuteScript processor of Apache NiFi

旧城冷巷雨未停 提交于 2020-01-25 07:06:55

问题


I have an ExecuteScript processor that does an XML flow file validation against schematron. I'd like the content of the schematron file to be cached somewhere rather than read from the disk for every flow file again and again.

What is the best option for doing this? Do I need yet another script that puts the content of the schematron into context.stateManager or PutDistributedMapCache or what?


回答1:


I was about to answer NO but it seems that it is possible. You are able to cache variables inside the ExecuteScript processor.

general idea

Using a simple script with the ExecuteScript processor using the EcmaScript engine shows that you actually are able to store state inside the processor.

var flowFile = session.get();

if (flowFile !== null) {
    var x = (x || 0) + 1;
    log.error('this is round: ' + x);

    session.transfer(flowFile, REL_SUCCESS);
}

Using this script inside the processor will result in something along the lines being logged:

...
ExecuteScript[id=...] this is round: 3
ExecuteScript[id=...] this is round: 2
ExecuteScript[id=...] this is round: 1

updating the file at most every x time units

I borowed the base code from the existing NiFi ValidateXML processor.

The basic idea is to update the file when

  1. it is not set yet or
  2. at least x units of time have passed since last update

The following code will achieve this, whereby SCHEMA_FILE_PATH is the path to the schema file. In this case x is thirty seconds:

// type definitions
var File = Java.type("java.io.File");
var FileNotFoundException = Java.type("java.io.FileNotFoundException");
var System = Java.type("java.lang.System");

// constants
var SCHEMA_FILE_PATH = "/foo/bar"; // exchange with real path
var timeoutInMillis = 30 * 1000; // 30 seconds

// initialize
var schemaFile = schemaFile || null;
var lastUpdateMillis = lastUpdateMillis || 0;



var flowFile = session.get();

function updateSchemaFile() {
    schemaFile = new File(SCHEMA_FILE_PATH);

    if (!schemaFile.exists()) {
        throw new FileNotFoundException("Schema file not found at specified location: " + schemaFile.getAbsolutePath());
    }

    lastUpdateMillis = System.currentTimeMillis();
}

if (flowFile !== null) {
    var now = System.currentTimeMillis();
    var schemaFileShouldBeUpdated = (schemaFile == null) || ((lastUpdateMillis || 0) + timeoutInMillis) < now;

    if (schemaFileShouldBeUpdated) {
        updateSchemaFile();
    }

    // TODO Do with the file whatever you want
    log.error('was file updated this round? ' + schemaFileShouldBeUpdated + '; last update millis: ' + lastUpdateMillis);

    session.transfer(flowFile, REL_SUCCESS);
}

DISCLAIMER

I cannot tell if, let alone when, the variable/s may be purged. Inspecting the source code used in the ExecuteScript processor indicates that the script file is reloaded periodically. I am not sure about the consequences of that.

Also I haven't tried using one of the other ScriptingLanguage supported as I'm most familiar with JavaScript.




回答2:


In groovy script there is a possibility to declare class with static variables, so they definitely will keep status after processor started.

Additionally, to manage initialization of those static variables you could use the feature of ExecuteGroovyScript processor to intercept processor start and stop.

In following example I'm going to compare flow-file content to some file on disk because I'm not familiar to schematron.

import org.apache.nifi.processor.ProcessContext

class Cache {
    static String validatorText = null
}
//this function called on processor start, so you can't use flow file in it
static void onStart(ProcessContext context){
    //init cached(static) variable from file
    Cache.validatorText = new File('/path/to/validator.txt').getText('UTF-8')
    println "onStart ${context}"
}

//process flow file and compare it to `Cache.validatorText`
def ff=session.get()
if(!ff)return

def ffText = ff.read().getText("UTF-8")
assert ffText = Cache.validatorText

REL_SUCCESS << ff

Note: you could set Failure strategy = transfer to failure. In this case on any error (including assertion failure) flow file will be redirected to REL_FAILURE without additional code.



来源:https://stackoverflow.com/questions/58959352/caching-file-content-inside-executescript-processor-of-apache-nifi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!