apache-beam

How can I maximize throughput for an embarrassingly-parallel task in Python on Google Cloud Platform?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-11 16:26:33
问题 I am trying to use Apache Beam/Google Cloud Dataflow to speed up an existing Python application. The bottleneck of the application occurs after randomly permuting an input matrix N (default 125, but could be more) times, when the system runs a clustering algorithm on each matrix. The runs are fully independent of one another. I've captured the top of the pipeline below: This processes the default 125 permutations. As you can see, only the RunClustering step takes an appreciable amount of time

Apache beam : Programatically create partitioned tables

那年仲夏 提交于 2019-12-11 15:46:35
问题 I am writing a cloud dataflow that reads messages from Pubsub and stores those into BigQuery. I want to use partitioned table (by date) and I am using Timestamp associated with message to determine which partition the message should go into. Below is my code: BigQueryIO.writeTableRows() .to(new SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>() { private static final long serialVersionUID = 1L; @Override public TableDestination apply(ValueInSingleWindow<TableRow> value) {

Apache Beam Java SDK SparkRunner write to parquet error

你说的曾经没有我的故事 提交于 2019-12-11 15:45:57
问题 I'm using Apache Beam with Java. I'm trying to read a csv file and write it to parquet format using the SparkRunner on a predeployed Spark env, using local mode. Everything worked fine with the DirectRunner, but the SparkRunner simply wont work. I'm using maven shade plugin to build a fat jat. Code is as below: Java: public class ImportCSVToParquet{ -- ommitted File csv = new File(filePath); PCollection<String> vals = pipeline.apply(TextIO.read().from(filePath)); String parquetFilename = csv

Dataflow Apache beam Python job stuck at Group by step

我是研究僧i 提交于 2019-12-11 14:13:37
问题 I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB. Now the dataflow job scales well when reading from

Assigning to GenericRecord the timestamp from inner object

删除回忆录丶 提交于 2019-12-11 13:36:32
问题 Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such. I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows... Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1))) .triggering(AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.standardSeconds(1))) .withAllowedLateness(Duration.ZERO)

nltk dependencies in dataflow

六月ゝ 毕业季﹏ 提交于 2019-12-11 12:48:24
问题 I know that external python dependencies can by fed into Dataflow via the requirements.txt file. I can successfully load nltk in my Dataflow script. However, nltk often needs further files to be downloaded (e.g. stopwords or punkt). Usually on a local run of the script, I can just run nltk.download('stopwords') nltk.download('punkt') and these files will be available to the script. How do I do this so the files are also available to the worker scripts. It seems like it would be extremely

How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?

心已入冬 提交于 2019-12-11 12:09:05
问题 I try to run a DataFlow pipeline remotely which will use a pickle file. Locally, I can use the code below to invoke the file. with open (known_args.file_path, 'rb') as fp: file = pickle.load(fp) However, I find it not valid when the path is about cloud storage(gs://...): IOError: [Errno 2] No such file or directory: 'gs://.../.pkl' I kind of understand why it is not working but I cannot find the right way to do it. 回答1: If you have pickle files in your GCS bucket, then you can load them as

How to count elements per window

南楼画角 提交于 2019-12-11 11:35:39
问题 I'm trying to solve what seems to be easy problem -- count how many elements there are in a PCollection per window. I need it to pass to .withSharding() function on write, to create as many shards as there are going to be files to write. I tried to do: FileIO.writeDynamic<Long, E>() .withDestinationCoder(AvroCoder.of(Long::class.java)) .by { e -> e.key } .via(Contextful.fn(MySerFunction())) .withNaming({ key -> MyFileNaming() }) .withSharding(ShardingFn()) .to("gs://some-output") class

Accessing information (Metadata) in the file name & type in a Beam pipeline

情到浓时终转凉″ 提交于 2019-12-11 11:25:15
问题 My filename contains information that I need in my pipeline, for example the identifier for my data points is part of the filename and not a field in the data. e.g Every wind turbine generates a file turbine-loc-001-007.csv. e.g And I need the loc data within the pipeline. 回答1: Java (sdk 2.9.0): Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike

Returning a large data structure from Dataflow worker node, getting stuck in serializing graph

杀马特。学长 韩版系。学妹 提交于 2019-12-11 10:14:52
问题 I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph); . public static class Prep extends DoFn<TableRow, TableRows> { @Override public void processElement(ProcessContext c) { //Graph creation logic runs very fast, no problem here LOG.info("Starting Graph Output"); // can see this in logs c.output(graph); //outputs data from DoFn function LOG.info("Ending Graph Output