问题
Pcollection<String> p1 = {"a","b","c"}
PCollection< KV<Integer,String> > p2 = p1.apply("some operation ")
//{(1,"a"),(2,"b"),(3,"c")}
I need to make it scalable for large file like Apache Spark such that it works like:
sc.textFile("./filename").zipWithIndex
My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way.
How can I get the result by Apache Beam?
Some related posts: zipWithIndex on Apache Flink
Ranking pcollection elements
回答1:
There is no built-in way to do this. PCollections in Beam are unordered, potentially unbounded and are processed in parallel on multiple workers. The fact that the PCollection comes out of a source with known order cannot be observed in Beam model directly. I think the easier way would be to preprocess the file before it is consumed in the Beam pipeline.
回答2:
(copying my response from user@beam.apache.org)
This is interesting. So if I understand your algorithm, it would be something like (pseudocode):
A = ReadWithShardedLineNumbers(myFile) : output K<ShardOffset+LocalLineNumber>, V<Data>
B = A.ExtractShardOffsetKeys() : output K<ShardOffset>, V<LocalLineNumber>
C = B.PerKeySum() : output K<ShardOffset>, V<ShardTotalLines>
D = C.GlobalSortAndPrefixSum() : output K<ShardOffset> V<ShardLineNumberOffset>
E = [A,D].JoinAndCalculateGlobalLineNumbers() : output V<GlobalLineNumber+Data>
This makes a couple assumptions:
ReadWithShardedLineNumbers: Sources can output their shard offset, and the offsets are globally orderedGlobalSortAndPrefixSum: The totals for all read shards can fit in memory to perform a total sort
Assumption #2 will not hold true for all data sizes, and varies by runner depending on how granular the read shards are. But it seems feasible for some practical subset of file-sizes.
Also, I believe the pseudo-code above is representable in Beam, and would not require SplittableDoFn.
来源:https://stackoverflow.com/questions/53746046/how-can-i-implement-zipwithindex-like-spark-in-apache-beam