Is it possible to use Riak CS with Apache Flink?

前端 未结 1 658
迷失自我
迷失自我 2021-01-26 02:33

I want to configure filesystem state backend and zookeeper recovery mode:

state.backend: filesystem
state.backend.fs.checkpointdir: ???         


        
1条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-26 03:14

    Answer: How to join Apache Flink and Riak CS?

    Riak CS has S3 (version 2) compatible interface. So, possible to use S3 file system adapter from Hadoop to work with Riak CS.

    I don't known why but Apache Flink has only part of Hadoop filesystem adapters inside fat jar (lib/flink-dist_2.11-1.0.1.jar) i.e. it has FTP file system (org.apache.hadoop.fs.ftp.FTPFileSystem) but doesn't have S3 file system (i.e. org.apache.hadoop.fs.s3a.S3AFileSystem). So, you have 2 ways to solve this problem:

    • use these adapters from Hadoop installation. I didn't try this but seems like you should just configure HADOOP_CLASSPATH or HADOOP_HOME evn variable.
    • monky patch Apache Flink and download required JARs to /lib directory

    So, I choose second way because don't want to provision Hadoop in my environment. You can copy JARs from Hadoop dist or internet:

    curl http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.jar -o /flink/lib/hadoop-aws-2.7.2.jar
    curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -o /flink/lib/aws-java-sdk-1.7.4.jar
    curl http://central.maven.org/maven2/org/apache/httpcomponents/httpcore/4.2.5/httpcore-4.2.5.jar -o /flink/lib/httpcore-4.2.5.jar
    curl http://central.maven.org/maven2/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar -o /flink/lib/httpclient-4.2.5.jar
    

    As you can see I am using old versions because such version using in Hadoop 2.7.2 and I use Flink compatible with this version of Hadoop.

    FYI: Such hack can cause problems if you are using latest version of these JARs in own flow. To avoid problem related to different versions you can relocate packages when you are building fat jar with flow use something like (I am using Gradle):

    // Relocate org.apache.http packages because Apache Flink include old version of this library (we place them for using S3 compatible FS)
    shadowJar {
        dependencies {
            include(dependency('.*:.*:.*'))
        }
    
        relocate 'org.apache.http', 'relocated.org.apache.http'
        relocate 'org.apache.commons', 'relocated.org.apache.commons'
    }
    

    Then you should specify path to core-site.xml in flink-conf.yaml because Hadoop compatible file systems using this config for loading settings:

    ...
    fs.hdfs.hadoopconf: /flink/conf
    ...
    

    As you can see I just place it to /conf directory. It has the following settings:

    
    
        
            fs.s3a.impl
            org.apache.hadoop.fs.s3a.S3AFileSystem // because S3A better then other: https://wiki.apache.org/hadoop/AmazonS3
        
        
            fs.s3a.endpoint
            my-riak-cs.stage.local  // this is my Riak CS host
        
        
            fs.s3a.connection.ssl.enabled // my Riak CS in staging doesn't support SSL
            false
        
        
            fs.s3a.access.key
            ???? // this is my access key for Riak CS
        
        
            fs.s3a.secret.key
            ???? // this is my secret key for Riak CS
        
    
    

    Then you should configure Riak CS buckets in flink-conf.yaml as recommender here:

    ...
    state.backend.fs.checkpointdir: s3a://example-staging-flink/checkpoints
    ...
    recovery.zookeeper.storageDir: s3a://example-staging-flink/recovery
    ...
    

    and create buckets in Riak CS. I am using s3cmd (installed over brew in my OS X dev env):

    s3cmd mb s3://example-staging-flink
    

    FYI: Before using s3cmd you should configure it use s3cmd --configure and then fix some settings in ~/.s3cmd file:

    signature_v2 = True // because Riak CS using S3 V2 interface
    use_https = False // if your don't use SSL
    access_key = ???
    secret_key = ???
    host_base = my-riak-cs.stage.local // your Riak CS host
    host_bucket = %(bucket).my-riak-cs.stage.local // format of bucket used by Riak CS
    

    So, that's all what you should configure for save/restore state of Standalone HA Apache Flink cluster in Riak CS.

    0 讨论(0)
提交回复
热议问题