apache-beam

AttributeError: 'module' object has no attribute 'ensure_str'

北慕城南 提交于 2020-06-27 11:05:27
问题 I try to transfer data from one bigquery to anther through Beam , however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception: AttributeError: 'module' object has no attribute 'ensure_str' Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 197, in wrapper return fun(*args, **kwargs) File "

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

回眸只為那壹抹淺笑 提交于 2020-06-17 15:57:29
问题 Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ? PCollectionView<List<String>> list_of

Reading an xml file in apache beam using XmlIo

不羁岁月 提交于 2020-06-17 09:45:14
问题 problem statement: i am trying to read and print contents of an xml file in beam using direct runner here is the code snippet: public class BookStore{ public static void main (string args[]){ BookOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BookOptions .class); Pipeline pipeline = Pipeline.create(options); PCollection<Book> output = pipeline.apply(XmlIO.<Book>read().from("sample.xml") .withRootElement("book") .withRecordElement("name") .withRecordClass(Book

Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

一曲冷凌霜 提交于 2020-06-17 02:28:47
问题 We want to create a custom dataflow template using the function parameters add_value_provider_argument unable to launch the following command without inputting the variables defined in add_value_provider_argument () class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( '--input_topic', help='The Cloud Pub/Sub topic to read from.\n' '"projects/<PROJECT_NAME>/topics/<TOPIC_NAME>".' ) parser.add_value_provider_argument( '-

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

家住魔仙堡 提交于 2020-05-26 06:44:26
问题 Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes? I have successfully managed to run a Beam Python pipeline using the Portable runner and the job service pointing to a local Flink server running in Docker containers. I was able to achieve that mounting the Docker socket in my Flink containers, and running Flink as root process, so the class DockerEnvironmentFactory can create the Python harness container. Unfortunately, I can't use the

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

允我心安 提交于 2020-05-26 06:43:07
问题 Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes? I have successfully managed to run a Beam Python pipeline using the Portable runner and the job service pointing to a local Flink server running in Docker containers. I was able to achieve that mounting the Docker socket in my Flink containers, and running Flink as root process, so the class DockerEnvironmentFactory can create the Python harness container. Unfortunately, I can't use the

Array type in clickhouseIO for apache beam(dataflow)

江枫思渺然 提交于 2020-05-17 07:55:26
问题 I am using Apache Beam to consume json and insert into clickhouse. I am currently having a problem with the Array data type. Everything works fine before I add an array type of field Schema.Field.of("inputs.value", Schema.FieldType.array(Schema.FieldType.INT64).withNullable(true)) Code for the transformations p.apply(transformNameSuffix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription(chainConfig.getPubSubSubscriptionPrefix() + "transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE))

Why increments are not supported in Dataflow-BigTable connector?

狂风中的少年 提交于 2020-05-13 08:14:32
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

心已入冬 提交于 2020-05-13 08:12:11
问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Write tfrecords from beam pipeline?

随声附和 提交于 2020-04-30 08:21:47
问题 I have some data in Map format and I want to convert them to tfrecords, using the beam pipeline. Here is my attempt to write the code. I have attempted this in python which works but I need to implement this in java as some business logic is there which I can't port to python. The corresponding working python implementation can be found here in this question. import com.google.protobuf.ByteString; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.extensions.protobuf.ProtoCoder;