Batch Processing System Architecture Using Azure Event Hub, Capture, and Databricks

This architecture pattern is designed for a batch processing system utilizing Azure Event Hub, Capture, and Databricks. In this pattern, Azure Event Hub functions as the « event ingestor » for a drag-and-drop event service. It collaborates with Azure API Management (APIM) to orchestrate the publication and consumption of events. The pattern covers components, their roles, compute, availability, consistency, scalability, security, and network considerations.

Given that Azure Function App instances have unique hostnames and can be deployed across multiple regions, directly exposing these URLs to API consumers would introduce complexity. To mitigate this, the pattern employs an API-first approach, centralizing API endpoints via a APIM.

Azure Event Hubs automatically captures streaming data and stores it in Azure Blob Storage or Azure Data Lake Storage Gen 2. This setup allows for the specification of capture intervals based on time or size, facilitating efficient batch processing with Azure Databricks for subsequent analysis and reporting.

This pattern Schema Registry and Processing through Azure Stream Analytics or data materialization in Azure Cosmos DB.

When to Use This Pattern ?

You can choose this pattern of Event Hub if:

  • You prefer to capture streaming data into Azure Blob Storage or Azure Data Lake Storage Gen 2 for batch processing.
  • You need to ensure data security by using Private Endpoints or Service Endpoints.
  • You need to process large volumes of data in scheduled batches.
  • Your data processing tasks are not time-critical and can wait for batch intervals.
  • You require complex data transformations and aggregations.
  • Long-term data storage and historical analysis are important for your use case.
  • You aim to leverage the cost-efficiency of batch processing.
  • You need to integrate with big data processing tools like Azure Databricks.

Components and Roles

  • Azure API Management (APIM) Gateway for API consumers, providing security, rate limiting, and monitoring.
  • Azure Function App Executes business logic for API requests and interacts with Azure Event Hub.
  • Azure Event Hub Namespace Scalable platform for event ingestion and processing.
  • Storage Account/Data Lake Stores ingested events for long-term retention and serves as an integration point for processing tools.
  • Azure Databricks (Consumer) Transforms and analyzes data captured by Azure Event Hub, enabling advanced analytics and machine learning.

Compute Sizing and Tiers 

RessourcesTier
API ManagementStandard Tier: Development, testing, and basic data processing.
Premium Tier: Production workloads, enhanced security, and large datasets.
Azure Function AppPremium service plan 
Storage AccountGeneral-purpose v2 (GPv2) storage account. – Hot Tier: Frequent data access. 
Event Hub NamespaceStandard Tier
Azure DatabricksDeveloper Tier: Development, testing, and basic API configuration
Premium Tier: Production workloads, enhanced security, and availability.

Security

Identity Access and Management

  • Authentication: Managed identities over Shared Access Signatures (SAS) for enhanced security and ease of management.
  • API Management: Uses OAuth 2.0 or subscription keys for authenticating API requests.
  • RBAC Implementation: Ensures minimal necessary permissions for users and applications. Managed identities are configured for various components to allow specific access levels.

Encryption

  • In-Transit Encryption: All data transmitted to and from Event Hub and Function App is encrypted using TLS 2.1.
  • At-Rest Encryption: Data stored in Event Hub and Storage Account is encrypted using Azure Storage encryption (PMK) and (CMK) for enhanced security.

Availability and SLA

  • Function App: Standard Tier with regional redundancy and auto-scaling for high availability.
  • Event Hub: Standard Tier with SLA-backed uptime guarantee and partition-based scaling.
  • Storage Account: Standard Tier with Zone-Redundant Storage (ZRS) for availability.
  • Databricks: Premium Tier with high availability through multiple mechanisms, including automatic node replacement and deployment across multiple availability zones.

Scalability

  • Azure API Management: Out of scope.
  • Azure Function App: Auto-scale used only for business-critical use cases.
  • Azure Event Hub: Scales by increasing the number of partitions, with auto-inflate to scale throughput units.

Network

  • API Management: Configured on Virtual Network (VNET) for secure communication, exposed via Application Gateway (WAF)/NVA.
  • Function App: Uses Private Endpoint and VNET integration for secure access.
  • Event Hub Namespace: Uses Service Endpoint for secure communication with Function App.
  • Storage Account: Uses Service Endpoint/Private Endpoint for secure communication.
  • Databricks: Deployed on an isolated network, using service endpoint for secure data access.

Event Hub Capture is a powerful tool for efficiently managing a wide range of batch processing scenarios. It is ideal for handling large volumes of data such as IoT data, application logs, user interactions, or transaction data. This tool allows for the collection, storage, and analysis of data in batches, rather than in real-time. This method supports complex data transformations, long-term storage, and detailed historical analysis.

With these key components and guidelines, you can establish a solid foundation for creating your own batch processing design. For further details or specific inquiries, feel free to contact us or leave a comment.

Laisser un commentaire

Quote of the week

« Good architecture allows changes; bad architecture prevents it. »

~ Robert C. Martin