apache-beam | 易学教程

AttributeError: 'module' object has no attribute 'ensure_str'

阅读更多关于 AttributeError: 'module' object has no attribute 'ensure_str'

问题 I try to transfer data from one bigquery to anther through Beam , however, the following error comes up: WARNING:root:Retry with exponential backoff: waiting for 4.12307941111 seconds before retrying get_query_location because we caught exception: AttributeError: 'module' object has no attribute 'ensure_str' Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 197, in wrapper return fun(*args, **kwargs) File "

Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

阅读更多关于 Apache Beam : Refreshing a sideinput which i am reading from the MongoDB using MongoDbIO.read() Part 2

问题 Not sure about how this GenerateSequence work for me as i have to read values from Mongo periodically on hourly or on daily basis, created a ParDo that reads the MongoDB, also added window into GlobalWindows with an trigger (trigger i will update as pr requirement). But below code snippet giving return type error so could you please help me to correct below lines of code? Also find snapshot of the error. Also how this Generate Sequence help in my case ? PCollectionView<List<String>> list_of

Reading an xml file in apache beam using XmlIo

阅读更多关于 Reading an xml file in apache beam using XmlIo

问题 problem statement: i am trying to read and print contents of an xml file in beam using direct runner here is the code snippet: public class BookStore{ public static void main (string args[]){ BookOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(BookOptions .class); Pipeline pipeline = Pipeline.create(options); PCollection<Book> output = pipeline.apply(XmlIO.<Book>read().from("sample.xml") .withRootElement("book") .withRecordElement("name") .withRecordClass(Book

Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

阅读更多关于 Usage problem add_value_provider_argument on a streaming stream ( Apache beam /PYTHON)

问题 We want to create a custom dataflow template using the function parameters add_value_provider_argument unable to launch the following command without inputting the variables defined in add_value_provider_argument () class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument( '--input_topic', help='The Cloud Pub/Sub topic to read from.\n' '"projects/<PROJECT_NAME>/topics/<TOPIC_NAME>".' ) parser.add_value_provider_argument( '-

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

阅读更多关于 How do I run Beam Python pipelines using Flink deployed on Kubernetes?

问题 Does anybody know how to run Beam Python pipelines with Flink when Flink is running as pods in Kubernetes? I have successfully managed to run a Beam Python pipeline using the Portable runner and the job service pointing to a local Flink server running in Docker containers. I was able to achieve that mounting the Docker socket in my Flink containers, and running Flink as root process, so the class DockerEnvironmentFactory can create the Python harness container. Unfortunately, I can't use the

How do I run Beam Python pipelines using Flink deployed on Kubernetes?

阅读更多关于 How do I run Beam Python pipelines using Flink deployed on Kubernetes?

Array type in clickhouseIO for apache beam(dataflow)

阅读更多关于 Array type in clickhouseIO for apache beam(dataflow)

问题 I am using Apache Beam to consume json and insert into clickhouse. I am currently having a problem with the Array data type. Everything works fine before I add an array type of field Schema.Field.of("inputs.value", Schema.FieldType.array(Schema.FieldType.INT64).withNullable(true)) Code for the transformations p.apply(transformNameSuffix + "ReadFromPubSub", PubsubIO.readStrings().fromSubscription(chainConfig.getPubSubSubscriptionPrefix() + "transactions").withIdAttribute(PUBSUB_ID_ATTRIBUTE))

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

问题 We have a use case in the Streaming mode where we want to keep track of a counter on BigTable from the pipeline (something #items finished processing) for which we need the increment operation. From looking at https://cloud.google.com/bigtable/docs/dataflow-hbase, I see that append/increment operations of the HBase API are not supported by this client. The reason stated is the retry logic on batch mode but if Dataflow guarantees exactly-once, why would supporting it be a bad idea since I know

Why increments are not supported in Dataflow-BigTable connector?

阅读更多关于 Why increments are not supported in Dataflow-BigTable connector?

Write tfrecords from beam pipeline?

阅读更多关于 Write tfrecords from beam pipeline?

问题 I have some data in Map format and I want to convert them to tfrecords, using the beam pipeline. Here is my attempt to write the code. I have attempted this in python which works but I need to implement this in java as some business logic is there which I can't port to python. The corresponding working python implementation can be found here in this question. import com.google.protobuf.ByteString; import org.apache.beam.sdk.Pipeline; import org.apache.beam.sdk.extensions.protobuf.ProtoCoder;