❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Learning Notes #63 – Change Data Capture. What does it do ?

19 January 2025 at 16:22

Few days back i came across a concept of CDC. Like a notifier of database events. Instead of polling, this enables event to be available in a queue, which can be consumed by many consumers. In this blog, i try to explain the concepts, types in a theoretical manner.

You run a library. Every day, books are borrowed, returned, or new books are added. What if you wanted to keep a live record of all these activities so you always know the exact state of your library?

This is essentially what Change Data Capture (CDC) does for your databases. It’s a way to track changes (like inserts, updates, or deletions) in your database tables and send them to another system, like a live dashboard or a backup system. (Might be a bad example. Don’t lose hope. Continue …)

CDC is widely used in modern technology to power,

  • Real-Time Analytics: Live dashboards that show sales, user activity, or system performance.
  • Data Synchronization: Keeping multiple databases or microservices in sync.
  • Event-Driven Architectures: Triggering notifications, workflows, or downstream processes based on database changes.
  • Data Pipelines: Streaming changes to data lakes or warehouses for further processing.
  • Backup and Recovery: Incremental backups by capturing changes instead of full data dumps.

It’s a critical part of tools like Debezium, Kafka, and cloud services such as AWS Database Migration Service (DMS) and Azure Data Factory. CDC enables companies to move towards real-time data-driven decision-making.

What is CDC?

CDC stands for Change Data Capture. It’s a technique that listens to a database and captures every change that happens in it. These changes can then be sent to other systems to,

  • Keep data in sync across multiple databases.
  • Power real-time analytics dashboards.
  • Trigger notifications for certain database events.
  • Process data streams in real time.

In short, CDC ensures your data is always up-to-date wherever it’s needed.

Why is CDC Useful?

Imagine you have an online store. Whenever someone,

  • Places an order,
  • Updates their shipping address, or
  • Cancels an order,

you need these changes to be reflected immediately across,

  • The shipping system.
  • The inventory system.
  • The email notification service.

Instead of having all these systems query the database (this is one of main reasons) constantly (which is slow and inefficient), CDC automatically streams these changes to the relevant systems.

This means,

  1. Real-Time Updates: Systems receive changes instantly.
  2. Improved Performance: Your database isn’t overloaded with repeated queries.
  3. Consistency: All systems stay in sync without manual intervention.

How Does CDC Work?

Note: I haven’t yet tried all these. But conceptually having a feeling.

CDC relies on tracking changes in your database. There are a few ways to do this,

1. Query-Based CDC

This method repeatedly checks the database for changes. For example:

  • Every 5 minutes, it queries the database: β€œWhat changed since my last check?”
  • Any new or modified data is identified and processed.

Drawbacks: This can miss changes if the timing isn’t right, and it’s not truly real-time (Long Polling).

2. Log-Based CDC

Most modern databases (like PostgreSQL or MySQL) keep logs of every operation. Log-based CDC listens to these logs and captures changes as they happen.

Advantages

  • It’s real-time.
  • It’s lightweight since it doesn’t query the database directly.

3. Trigger-Based CDC

In this method, the database uses triggers to log changes into a separate table. Whenever a change occurs, a trigger writes a record of it.

Advantages: Simple to set up.

Drawbacks: Can slow down the database if not carefully managed.

Tools That Make CDC Easy

Several tools simplify CDC implementation. Some popular ones are,

  1. Debezium: Open-source and widely used for log-based CDC with databases like PostgreSQL, MySQL, and MongoDB.
  2. Striim: A commercial tool for real-time data integration.
  3. AWS Database Migration Service (DMS): A cloud-based CDC service.
  4. StreamSets: Another tool for real-time data movement.

These tools integrate with databases, capture changes, and deliver them to systems like RabbitMQ, Kafka, or cloud storage.

To help visualize CDC, think of,

  • Social Media Feeds: When someone likes or comments on a post, you see the update instantly. This is CDC in action.
  • Bank Notifications: Whenever you make a transaction, your bank app updates instantly. Another example of CDC.

In upcoming blogs, will include Debezium implementation with CDC.

Learning Notes #48 – Common Pitfalls in Event Driven Architecture

8 January 2025 at 15:04

Today, i came across Raul Junco post on mistakes in Event Driven Architecture – https://www.linkedin.com/posts/raul-junco_after-years-building-event-driven-systems-activity-7278770394046631936-zu3-?utm_source=share&utm_medium=member_desktop. In this blog i am highlighting the same for future reference.

Event-driven architectures are awesome, but they come with their own set of challenges. Missteps can lead to unreliable systems, inconsistent data, and frustrated users. Let’s explore some of the most common pitfalls and how to address them effectively.

1. Duplication

Idempotent APIs – https://parottasalna.com/2025/01/08/learning-notes-47-idempotent-post-requests/

Events often get re-delivered due to retries or system failures. Without proper handling, duplicate events can,

  • Charge a customer twice for the same transaction: Imagine a scenario where a payment service retries a payment event after a temporary network glitch, resulting in a duplicate charge.
  • Cause duplicate inventory updates: For example, an e-commerce platform might update stock levels twice for a single order, leading to overestimating available stock.
  • Create inconsistent or broken system states: Duplicates can cascade through downstream systems, introducing mismatched or erroneous data.

Solution:

  • Assign unique IDs: Ensure every event has a globally unique identifier. Consumers can use these IDs to detect and discard duplicates.
  • Design idempotent processing: Structure your operations so they produce the same outcome even when executed multiple times. For instance, an API updating inventory could always set stock levels to a specific value rather than incrementing or decrementing.

2. Not Guaranteeing Order

Events can arrive out of order when distributed across partitions or queues. This can lead to

  • Processing a refund before the payment: If a refund event is processed before the corresponding payment event, the system might show a negative balance or fail to reconcile properly.
  • Breaking logic that relies on correct sequence: Certain workflows, such as assembling logs or transactional data, depend on a strict event order to function correctly.

Solution

  • Use brokers with ordering guarantees: Message brokers like Apache Kafka support partition-level ordering. Design your topics and partitions to align with entities requiring ordered processing (e.g., user or account ID).
  • Add sequence numbers or timestamps: Include metadata in events to indicate their position in a sequence. Consumers can use this data to reorder events if necessary, ensuring logical consistency.

3. The Dual Write Problem

Outbox Pattern: https://parottasalna.com/2025/01/03/learning-notes-31-outbox-pattern-cloud-pattern/

When writing to a database and publishing an event, one might succeed while the other fails. This can

  • Lose events: If the event is not published after the database write, downstream systems might remain unaware of critical changes, such as a new order or a status update.
  • Cause mismatched states: For instance, a transaction might be logged in a database but not propagated to analytical or monitoring systems, creating inconsistencies.

Solution

  • Use the Transactional Outbox Pattern: In this pattern, events are written to an β€œoutbox” table within the same database transaction as the main data write. A separate process then reads from the outbox and publishes events reliably.
  • Adopt Change Data Capture (CDC) tools: CDC tools like Debezium can monitor database changes and publish them as events automatically, ensuring no changes are missed.

4. Non-Backward-Compatible Changes

Changing event schemas without considering existing consumers can break systems. For example:

  • Removing a field: A consumer relying on this field might encounter null values or fail altogether.
  • Renaming or changing field types: This can lead to deserialization errors or misinterpretation of data.

Solution:

  • Maintain versioned schemas: Introduce new schema versions incrementally and ensure consumers can continue using older versions during the transition.
  • Use schema evolution-friendly formats: Formats like Avro or Protobuf natively support schema evolution, allowing you to add fields or make other non-breaking changes easily.
  • Add adapters for compatibility: Build adapters or translators that transform events from new schemas to older formats, ensuring backward compatibility for legacy systems.

❌
❌