DataSet programs in Flink are regular programs that implement transformations on data sets (e.g., filtering, mapping, joining, grouping). The data sets are initially created from certain sources (e.g., by reading files, or from local collections). Results are returned via sinks, which may for example write the data to (distributed) files, or to standard output (for example the command line terminal). Flink programs run in a variety of contexts, standalone, or embedded in other programs. The execution can happen in a local JVM, or on clusters of many machines.
Flink中DataSet编程是非常常规的编程,只需要实现他的数据集的转换(例如filtering, mapping, joining, grouping)。这个数据集最初是通过数据源创建(例如读取文件、本地数据集加载本地集合),转换的结果通过sink返回到本地(或者分布式)的文件系统或者终端。Flink程序可以运行在各种环境中例如单机,或者嵌入其他程序中。执行过程可以在本地JVM中或者集群中。
Source ===> Flink(transformation)===> Sink
基于文件
readTextFile(path)
/TextInputFormat
- Reads files line wise and returns them as Strings.readTextFileWithValue(path)
/TextValueInputFormat
- Reads files line wise and returns them as StringValues. StringValues are mutable strings.readCsvFile(path)
/CsvInputFormat
- Parses files of comma (or another char) delimited fields. Returns a DataSet of tuples or POJOs. Supports the basic java types and their Value counterparts as field types.readFileOfPrimitives(path, Class)
/PrimitiveInputFormat
- Parses files of new-line (or another char sequence) delimited primitive data types such asString
orInteger
.readFileOfPrimitives(path, delimiter, Class)
/PrimitiveInputFormat
- Parses files of new-line (or another char sequence) delimited primitive data types such asString
orInteger
using the given delimiter.
基于集合
- fromCollection(Collection)
- fromCollection(Iterator, Class)
- fromElements(T ...)
- fromParallelCollection(SplittableIterator, Class)
- generateSequence(from, to)
从简单的基于集合创建ataSet
基于集合的数据源往往用来在开发环境中或者程序员学习中,可以随意造我们所需要的数据,因为方式简单。下面从java和scala两种方式来实现使用集合作为数据源。数据源是简单的1到10
java
import org.apache.flink.api.java.ExecutionEnvironment;
import java.util.ArrayList;
import java.util.List;
public class JavaDataSetSourceApp {
public static void main(String[] args) throws Exception {
ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();
fromCollection(executionEnvironment);
}
public static void fromCollection(ExecutionEnvironment env) throws Exception {
List<Integer> list = new ArrayList<Integer>();
for (int i = 1; i <= 10; i++) {
list.add(i);
}
env.fromCollection(list).print();
}
}
scala
import org.apache.flink.api.scala.ExecutionEnvironment
object DataSetSourceApp {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
fromCollection(env)
}
def fromCollection(env: ExecutionEnvironment): Unit = {
import org.apache.flink.api.scala._
val data = 1 to 10
env.fromCollection(data).print()
}
}