Building Scalable Data Ingestion Pipelines at Quizizz
Introduction
At Quizizz, we have been using MongoDB as our primary database and Google BigQuery as our analytics data warehouse. The ability to effectively sync data from MongoDB to BigQuery has played a vital role in driving product and business decisions. Our data ingestion pipeline has been syncing MongoDB data to BigQuery for the past few years. In this article, we will explore the challenges we faced with our previous ingestion pipeline, discuss the decisions we made while building our new data ingestion pipeline, and delve into the details of how this pipeline has empowered us to make data-driven decisions at Quizizz.
Old Data Ingestion Pipeline
Architecture
Our old data ingestion pipeline had the following architecture:
- Events were triggered from code after DB changes (Create, Update, and Delete).
- These events were added to SQS (Amazon’s managed queuing service) so that they could be asynchronously processed without affecting our live flow.
- Consumers would read events from SQS and populate temporary BigQuery (BQ) tables. There are temporary tables for each entity, with one table for each 6-hour block.
- Every 24 hours, a cron job would merge the data from the temporary BQ tables for each entity into the main BQ table for that entity. This batch-processing method aimed to optimize costs by reducing the frequency of table updates.
Challenges with Old Data Ingestion Pipeline
tl;dr: Data Loss and Unreliable Data
Code-triggered Events
In our old pipeline, events were manually triggered from code during create, update, and delete operations. However, this approach had limitations, primarily, the risk of missed events due to code bugs.
Lack of fault tolerance
Our old pipeline was based on SQS. The system was fairly simple without a DLQ resulting in the absence of a robust retry mechanism in case of failures. This lack of fault tolerance made us vulnerable to data loss during transient failures and/or intermittent network issues. To enhance the reliability of our pipeline, we prioritized the implementation of a comprehensive retry mechanism (covered later).
Manual Schema Updates
Another challenge we encountered was the manual synchronization of schema changes between our primary database and BigQuery. This process required human intervention and was prone to errors, leading to data loss. We aimed to streamline this process and ensure data consistency by automating schema updates or removing the need for schema updates.
Backfill and Evolving Data Requirements
We do not index documents in any of the collections on the created date in MongoDB. Accommodating evolving data requirements and performing backfill operations proved to be challenging in our previous pipeline as querying based on dates would be very expensive. Due to changing business needs, we had to change the data type of certain fields in the past. Having strict data types for fields in BQ would result in errors when backfilling older events. We needed a more flexible solution to adapt to changing data needs without disrupting workflows.
Loss of Historical Data
In the previous architecture, we only stored the current state of each entity, disregarding the change events. This lack of historical data limited our ability to perform comprehensive analysis and gain insights into the evolution of our data over time.
Introducing the New Data Ingestion Pipeline
tl;dr: Change Data Capture-based Kafka Connect pipeline
Change Data Capture
To address the limitations of our old pipeline, we decided to use Change Data Capture (CDC) for data ingestion. CDC is a widely adopted standard for syncing data from a primary data source to a data lake, data warehouse, or any other secondary data source.
Debezium (an open-source CDC framework) captures change events (creates, updates, deletes) from various databases and adds the change events to Kafka. Consumers can read from Kafka and react to those change events. In our case, Debezium captures change events from MongoDB in real time, allowing us to keep our data warehouse, BigQuery up to date. We used Debezium Kafka Connector for this.
Using CDC helped us solve the problems related to triggering events from code.
Fault tolerance using Kafka
Because of Debezium, we started using Kafka instead of SQS. Because of Kafka, we got more control in terms of error handling. Consumers could now implement retries with exponential backoff. If a message could not be processed due to non-retriable exceptions, Kafka Connect tasks would fail and we get alerts. This helped us make the pipeline fault-tolerant.
Every component in the pipeline is decoupled and can be scaled independently. Offsets are maintained for every component (including the change stream offset) in Kafka making it easy to restart any of the components in case of issues.
Agile Data Storage with JSON State
As part of our new pipeline, we made the strategic decision to store the state data as a JSON string within a field called “state” for each event row. This approach allows us to store the entire state of each entity instead of specific fields at a given time. By doing so, we eliminate the need for backfilling operations as the data is always complete. Furthermore, this approach helps us avoid schema updates, enhancing our agility and reducing disruptions.
Streaming Change Events to BigQuery
In our new data ingestion pipeline, we have introduced a more efficient approach. Each collection in MongoDB now has a dedicated events table in BigQuery. The change events are streamed directly to BigQuery and stored as rows in the corresponding table. The new tables follow an append-only model, ensuring an efficient and reliable data flow. This also helped us store historical data and make the analytics data available in real time. We used BigQuery Kafka Connector for this.
Easier onboarding of new tables
In the new pipeline, onboarding a new table requires just adding the name of the collection in the configuration and data would start flowing ASAP. Backfilling data for any table would require adding an entry in a signal table and everything gets handled.
Other Use Cases
Given the reliable and fault-tolerant nature of Kafka and Kafka Connect as well as the usefulness of CDC events, this change opens up possibilities for expanding use cases beyond syncing data to BigQuery. Some potential use cases that can benefit from a similar pipeline include:
- Syncing Data to Elasticsearch: With our pipeline’s scalability and reliability, we can seamlessly sync data from MongoDB to Elasticsearch by adding a new sink connector for ES. This eliminates the sync issues often encountered with code-based triggers in our current Elasticsearch ingestion pipeline.
- Triggering Real-time Campaigns: The flexibility of our data ingestion pipeline allows us to sync data to platforms like Braze or other similar systems. By capturing and streaming change events, we can trigger real-time emails or campaigns based on user interactions and events, delivering personalized experiences to our users.
- Building a Data Lake with S3: Our new pipeline sets the foundation for building a data lake by syncing data to Amazon S3. Storing the complete state as a JSON string enables seamless integration of MongoDB change events into our data lake. This empowers our machine learning pipelines, leveraging the scalability and processing capabilities of S3.
Conclusion
Building a scalable and agile data ingestion pipeline is crucial for organizations aiming to make data-driven decisions effectively. At Quizizz, we embarked on a journey to transform our pipeline, overcome challenges, and leverage industry-standard solutions.
Our new data ingestion pipeline empowers us to capture and sync change events from MongoDB in real-time, ensuring data accuracy and eliminating missed events. This new pipeline is playing a pivotal role in empowering Quizizz to make data-driven decisions that drive our product and business growth.
To know more about the technical challenges that we are solving at Quizizz, check out our engineering blog.