How can I implement zipWithIndex like Spark in Apache Beam?

◇◆丶佛笑我妖孽 提交于 2019-12-14 03:11:50

问题


Pcollection<String> p1 = {"a","b","c"}

PCollection< KV<Integer,String> > p2 = p1.apply("some operation ") 
//{(1,"a"),(2,"b"),(3,"c")}

I need to make it scalable for large file like Apache Spark such that it works like:

sc.textFile("./filename").zipWithIndex

My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way.

How can I get the result by Apache Beam?

Some related posts: zipWithIndex on Apache Flink

Ranking pcollection elements


回答1:


There is no built-in way to do this. PCollections in Beam are unordered, potentially unbounded and are processed in parallel on multiple workers. The fact that the PCollection comes out of a source with known order cannot be observed in Beam model directly. I think the easier way would be to preprocess the file before it is consumed in the Beam pipeline.




回答2:


(copying my response from user@beam.apache.org)

This is interesting. So if I understand your algorithm, it would be something like (pseudocode):

A = ReadWithShardedLineNumbers(myFile) : output K<ShardOffset+LocalLineNumber>, V<Data>
B = A.ExtractShardOffsetKeys() : output K<ShardOffset>, V<LocalLineNumber>
C = B.PerKeySum() : output K<ShardOffset>, V<ShardTotalLines>
D = C.GlobalSortAndPrefixSum() : output K<ShardOffset> V<ShardLineNumberOffset>
E = [A,D].JoinAndCalculateGlobalLineNumbers() : output V<GlobalLineNumber+Data>

This makes a couple assumptions:

  1. ReadWithShardedLineNumbers: Sources can output their shard offset, and the offsets are globally ordered
  2. GlobalSortAndPrefixSum: The totals for all read shards can fit in memory to perform a total sort

Assumption #2 will not hold true for all data sizes, and varies by runner depending on how granular the read shards are. But it seems feasible for some practical subset of file-sizes.

Also, I believe the pseudo-code above is representable in Beam, and would not require SplittableDoFn.



来源:https://stackoverflow.com/questions/53746046/how-can-i-implement-zipwithindex-like-spark-in-apache-beam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!