DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for example write the data to (distributed) files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines.

Flink中DataSet编程是非常常规的编程，只需要实现他的数据集的转换（例如filtering, mapping, joining, grouping）。这个数据集最初是通过数据源创建（例如读取文件、本地数据集加载本地集合），转换的结果通过sink返回到本地（或者分布式）的文件系统或者终端。Flink程序可以运行在各种环境中例如单机，或者嵌入其他程序中。执行过程可以在本地JVM中或者集群中。

Source ===> Flink（transformation）===> Sink

基于文件

readTextFile(path) / TextInputFormat - Reads files line wise and returns them as Strings.
readTextFileWithValue(path) / TextValueInputFormat - Reads files line wise and returns them as StringValues. StringValues are mutable strings.
readCsvFile(path) / CsvInputFormat - Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.
readFileOfPrimitives(path, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer.
readFileOfPrimitives(path, delimiter, Class) / PrimitiveInputFormat - Parses files of new-line (or another char sequence) delimited primitive data types such as String or Integer using the given delimiter.

基于集合

fromCollection(Collection)
fromCollection(Iterator, Class)
fromElements(T ...)
fromParallelCollection(SplittableIterator, Class)
generateSequence(from, to)

从简单的基于集合创建ataSet

基于集合的数据源往往用来在开发环境中或者程序员学习中，可以随意造我们所需要的数据，因为方式简单。下面从java和scala两种方式来实现使用集合作为数据源。数据源是简单的1到10

java

import org.apache.flink.api.java.ExecutionEnvironment;

import java.util.ArrayList;
import java.util.List;

public class JavaDataSetSourceApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();
        fromCollection(executionEnvironment);
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<Integer>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

scala

import org.apache.flink.api.scala.ExecutionEnvironment

object DataSetSourceApp {

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    fromCollection(env)
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    import org.apache.flink.api.scala._
    val data = 1 to  10
    env.fromCollection(data).print()
  }

}

来源：https://my.oschina.net/duanvincent/blog/3101680

标签

apache

Apache Flink 零基础入门（十）Flink DataSet编程

基于文件

基于集合

从简单的基于集合创建ataSet

java

scala