问题
I am using eventhub for ingesting a lot of events. I have multiple consumers which are running behing a scaling group reading these events from the eventhub which has multiple partitions. I was going through the Azure SDK in python and was confused as to what to use. There is eventhubconsumerclient, eventprocessorHost ....
I would like to use a library where my multiple consumer can connect using the consumer group, the partitions are assigned dynamically to each consumer and checkpointing is made in the storage account, just like how I used kafka.
回答1:
Update:
For production usage, I suggest you should use the stable version of event hub sdk. You can use eph, sample code is here.
I can use the pre-release eventhub 5.0.0b6 to use consumer group as well as set checkpoint.
But the strange thing is that, in blob storage, I can see 2 folders created for the eventhub: checkpoint and ownership folder. Inside the folders, there're blob created for the partitions, but blob is empty. More stranger thing is that, even the blob is empty, every time I read from eventhub, it always read the latest data(means that it never reads the data has been read already in the same consumer group).
You need to install azure-eventhub 5.0.0b6 and use pip install --pre azure-eventhub-checkpointstoreblob
to install azure-eventhub-checkpointstoreblob. For blob storage, you should install the latest version 12.1.0 of azure-storage-blob.
I follow this sample. In this sample, it uses event hub level connection string(NOT event hub namespace level connection string). You need to create an event hub level connection string by nav to azure portal -> your eventhub namespace -> your event hub instance -> Shared access policies -> click "Add" -> then specify a policy name, and select permission. If you just want to receive data, you can only select the Listen permission. The screenshot as below:
After the policy created, you can copy the connection string as per screenshot below:
Then you can follow this code below:
import os
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
CONNECTION_STR = 'Endpoint=sb://ivanehubns.servicebus.windows.net/;SharedAccessKeyName=saspolicy;SharedAccessKey=xxx;EntityPath=myeventhub'
STORAGE_CONNECTION_STR = 'DefaultEndpointsProtocol=https;AccountName=xx;AccountKey=xxx;EndpointSuffix=core.windows.net'
def on_event(partition_context, event):
# do something with event
print(event)
print('on event')
partition_context.update_checkpoint(event)
if __name__ == '__main__':
#the "a22" is the blob container name
checkpoint_store = BlobCheckpointStore.from_connection_string(STORAGE_CONNECTION_STR, "a22")
#the "$default" is the consumer group
client = EventHubConsumerClient.from_connection_string(
CONNECTION_STR, "$default", checkpoint_store=checkpoint_store)
try:
print('ok')
client.receive(on_event)
except KeyboardInterrupt:
client.close()
The test result:
回答2:
azure-eventhub v5 has been GAed in 2020 Jan, and the latest version is v5.2.0
It's available on pypi: https://pypi.org/project/azure-eventhub/
Please follow the migration guide from v1 to v5 to migrate your program.
For receiving with checkpoint, please follow the sample code:
import os
import logging
from azure.eventhub import EventHubConsumerClient
from azure.eventhub.extensions.checkpointstoreblob import BlobCheckpointStore
CONNECTION_STR = os.environ["EVENT_HUB_CONN_STR"]
EVENTHUB_NAME = os.environ['EVENT_HUB_NAME']
STORAGE_CONNECTION_STR = os.environ["AZURE_STORAGE_CONN_STR"]
BLOB_CONTAINER_NAME = "your-blob-container-name" # Please make sure the blob container resource exists.
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
def on_event_batch(partition_context, event_batch):
log.info("Partition {}, Received count: {}".format(partition_context.partition_id, len(event_batch)))
# put your code here
partition_context.update_checkpoint()
def receive_batch():
checkpoint_store = BlobCheckpointStore.from_connection_string(STORAGE_CONNECTION_STR, BLOB_CONTAINER_NAME)
client = EventHubConsumerClient.from_connection_string(
CONNECTION_STR,
consumer_group="$Default",
eventhub_name=EVENTHUB_NAME,
checkpoint_store=checkpoint_store,
)
with client:
client.receive_batch(
on_event_batch=on_event_batch,
max_batch_size=100,
starting_position="-1", # "-1" is from the beginning of the partition.
)
if __name__ == '__main__':
receive_batch()
One more thing worth to note is that in V5, we use the metadata of blob to store checkpoint and ownership information instead of storing them as the content of a blob in v1. So it's expected that the content of a blob is empty when using the v5 sdk.
来源:https://stackoverflow.com/questions/59188944/azure-eventhub-library-for-python