❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Learning Notes #63 – Change Data Capture. What does it do ?

19 January 2025 at 16:22

Few days back i came across a concept of CDC. Like a notifier of database events. Instead of polling, this enables event to be available in a queue, which can be consumed by many consumers. In this blog, i try to explain the concepts, types in a theoretical manner.

You run a library. Every day, books are borrowed, returned, or new books are added. What if you wanted to keep a live record of all these activities so you always know the exact state of your library?

This is essentially what Change Data Capture (CDC) does for your databases. It’s a way to track changes (like inserts, updates, or deletions) in your database tables and send them to another system, like a live dashboard or a backup system. (Might be a bad example. Don’t lose hope. Continue …)

CDC is widely used in modern technology to power,

  • Real-Time Analytics: Live dashboards that show sales, user activity, or system performance.
  • Data Synchronization: Keeping multiple databases or microservices in sync.
  • Event-Driven Architectures: Triggering notifications, workflows, or downstream processes based on database changes.
  • Data Pipelines: Streaming changes to data lakes or warehouses for further processing.
  • Backup and Recovery: Incremental backups by capturing changes instead of full data dumps.

It’s a critical part of tools like Debezium, Kafka, and cloud services such as AWS Database Migration Service (DMS) and Azure Data Factory. CDC enables companies to move towards real-time data-driven decision-making.

What is CDC?

CDC stands for Change Data Capture. It’s a technique that listens to a database and captures every change that happens in it. These changes can then be sent to other systems to,

  • Keep data in sync across multiple databases.
  • Power real-time analytics dashboards.
  • Trigger notifications for certain database events.
  • Process data streams in real time.

In short, CDC ensures your data is always up-to-date wherever it’s needed.

Why is CDC Useful?

Imagine you have an online store. Whenever someone,

  • Places an order,
  • Updates their shipping address, or
  • Cancels an order,

you need these changes to be reflected immediately across,

  • The shipping system.
  • The inventory system.
  • The email notification service.

Instead of having all these systems query the database (this is one of main reasons) constantly (which is slow and inefficient), CDC automatically streams these changes to the relevant systems.

This means,

  1. Real-Time Updates: Systems receive changes instantly.
  2. Improved Performance: Your database isn’t overloaded with repeated queries.
  3. Consistency: All systems stay in sync without manual intervention.

How Does CDC Work?

Note: I haven’t yet tried all these. But conceptually having a feeling.

CDC relies on tracking changes in your database. There are a few ways to do this,

1. Query-Based CDC

This method repeatedly checks the database for changes. For example:

  • Every 5 minutes, it queries the database: β€œWhat changed since my last check?”
  • Any new or modified data is identified and processed.

Drawbacks: This can miss changes if the timing isn’t right, and it’s not truly real-time (Long Polling).

2. Log-Based CDC

Most modern databases (like PostgreSQL or MySQL) keep logs of every operation. Log-based CDC listens to these logs and captures changes as they happen.

Advantages

  • It’s real-time.
  • It’s lightweight since it doesn’t query the database directly.

3. Trigger-Based CDC

In this method, the database uses triggers to log changes into a separate table. Whenever a change occurs, a trigger writes a record of it.

Advantages: Simple to set up.

Drawbacks: Can slow down the database if not carefully managed.

Tools That Make CDC Easy

Several tools simplify CDC implementation. Some popular ones are,

  1. Debezium: Open-source and widely used for log-based CDC with databases like PostgreSQL, MySQL, and MongoDB.
  2. Striim: A commercial tool for real-time data integration.
  3. AWS Database Migration Service (DMS): A cloud-based CDC service.
  4. StreamSets: Another tool for real-time data movement.

These tools integrate with databases, capture changes, and deliver them to systems like RabbitMQ, Kafka, or cloud storage.

To help visualize CDC, think of,

  • Social Media Feeds: When someone likes or comments on a post, you see the update instantly. This is CDC in action.
  • Bank Notifications: Whenever you make a transaction, your bank app updates instantly. Another example of CDC.

In upcoming blogs, will include Debezium implementation with CDC.

TASK – The Botanical Garden and Rose Garden – Python SETS

3 August 2024 at 10:01
  1. Create a set named rose_garden containing different types of roses: "red rose", "white rose", "yellow rose". Print the same.
  2. Add "pink rose" to the rose_garden set. Print the set to confirm the addition.
  3. Remove "yellow rose" from the rose_garden set using the remove() method. Print the set to verify the removal.
  4. Create another set botanical_garden with elements "sunflower", "tulip", and "red rose". Find the union of rose_garden and botanical_garden and print the result.
  5. Find the intersection of rose_garden and botanical_garden and print the common elements.
  6. Find the difference between rose_garden and botanical_garden and print the elements that are only in rose_garden.
  7. Find the symmetric difference between rose_garden and botanical_garden and print the elements unique to each set.
  8. Create a set small_garden containing "red rose", "white rose". Check if small_garden is a subset of rose_garden and print the result.
  9. Check if rose_garden is a superset of small_garden and print the result.
  10. Use the len() function to find the number of elements in the rose_garden set. Print the result.
  11. Use the discard() method to remove "pink rose" from the rose_garden set. Try to discard a non-existent element "blue rose" and observe what happens.
  12. Use the clear() method to remove all elements from the rose_garden set. Print the set to confirm it’s empty.
  13. Make a copy of the botanical_garden set using the copy() method. Add "lily" to the copy and print both sets to see the differences.
  14. Create a frozen set immutable_garden with elements "orchid", "daisy", "red rose". Try to add or remove an element and observe what happens.
  15. Iterate over the botanical_garden set and print each element.
  16. Use set comprehension to create a set even_numbers containing even numbers from 1 to 10.
  17. Given a list of flowers ["rose", "tulip", "rose", "daisy", "tulip"], use a set to remove duplicates and print the unique flowers.
  18. Check if "sunflower" is in the botanical_garden set and print the result.
  19. Use the intersection_update() method to update the botanical_garden set with only the elements found in rose_garden. Print the updated set.
  20. Use the difference_update() method to remove all elements in small_garden from botanical_garden. Print the updated set.

❌
❌