I have tried the example code of SortValues transform using DirectRunner on local machine (Windows)
PCollection<KV<String, KV<String, Integer>>> input = ...
PCollection<KV<String, Iterable<KV<String, Integer>>>> grouped =
input.apply(GroupByKey.<String, KV<String, Integer>>create());
PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted =
grouped.apply(SortValues.<String, String, Integer>create(BufferedExternalSorter.options()));
but I got the error PipelineExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable. Does this mean this transform function only works in Hadoop environment?
As of today, if you use Beam with release version below 2.0.0, you will have to add two hadoop dependencies in your maven pom file for this SortValues module to work.
- add
hadoop-commonversion 2.7.3 or later - add
hadoop-mapreduce-client-coreversion 2.7.3 or later.
Otherwise, you will just need to use Beam with release version >= 2.0.0.
来源:https://stackoverflow.com/questions/45069550/does-sortvalues-transform-java-sdk-extension-in-beam-only-run-in-hadoop-environm