How to solve data engineering issues caused by rapid scaling

solving data issues

As your business grows you are bound to encounter some growing pains. One of the hurdles caused by growth is, how to handle the ever-growing data that is generated with the increase of records and transactions. At MUNCH:ON, we grew our architecture from a monolith to a scalable microservice architecture that could generate millions of records and transactions every day, far more than was previously generated on a day-to-day basis. Our current Analytics team could no longer keep up with the growing size of incoming data or effectively manage its increasing complexity with traditional analytics tools. Being a start-up, we knew we needed to find a solution that had the capability to continue to grow and evolve as we did. We knew we needed to complete the following in order to achieve this goal:

  1. Establish a big data ecosystem giving us a horizontally scalable architecture.
  2. Add customer journey analytics to observe the user experience within our apps.
  3. Empower our Business and Data Science Team with an effective infrastructure and engineering products.
  4. Use a data engineering infrastructure that has: scalability, flexibility, durability, interoperability, no vendor-lock ins, is highly cost-efficient as well as highly mobile.
  5. Engage and empower all the teams within MUNCH:ON with Data Engineering stack (examples: Kafka Pipelines, and Presto + Drill abstraction layers)
  6. Use a centralized layer of ETL that will automate, run, generate alerts for almost everything running within data engineering clusters.
  7. Enhance visualizations of data within MUNCH:ON covering all areas of descriptive and predictive analytics.

Keeping all these points in mind, our team came up with a robust architecture that is both highly scalable and mobile, meaning you can run it on a laptop without compromising its performance or available features.

The architecture is composed of multiple clusters that are integrated into a big data ecosystem. 

The clusters we have in our architecture are: 

ZooKeeper Cluster 

ZooKeeper, while being a coordination service for distributed systems, is a distributed application on its own. ZooKeeper follows a simple client-server model where clients are nodes (i.e., machines) that make use of the service, and servers are nodes that provide the service. A collection of ZooKeeper servers forms a ZooKeeper ensemble. In our architecture, ZooKeeper helps us keep our scalable architectures of Druid and Kafka in sync. Zookeeper, in this case, is the entity that primarily helps in the coordination between Druid middle managers and Kafka brokers.

Although both Druid and Kafka allow ZooKeeper instances to be initiated with their provided setups, generally it’s the best practice to run a ZooKeeper cluster as an independent entity in production. We have established a quorum of zookeeper nodes, thereby the master or leader in this quorum is dynamically picked upon election. The clients, both Druid middle managers, and Kafka brokers are integrated with ZooKeeper cluster to enable distributed and scalable processing of incoming data. Since ZooKeeper inherently runs in a quorum it is highly fault-tolerant for our given number of nodes (which should be an odd number).

Hadoop Cluster 

Forming the basis of our scalable and distributed file system, providing us with HDFS, the Hadoop Cluster is the central player of the entire big data ecosystem. The data lake at MUNCH:ON is primarily powered by Hadoop, as its size keeps growing with the incoming data, the architecture is scaled horizontally each month depending on the volumes of incoming data streams. The reason for bringing in Hadoop (primarily into HDFS), was its active community, it’s growing ecosystem, its outstanding economical features and massively distributed nature. Hadoop framework handles all the parallel processing of the data at the back-end. This means we don’t have to worry about the complexities of distributed processing while coding. Hadoop framework takes care of how the data gets stored and processed in a distributed manner and provides the base for great technologies like Hive, Druid and Spark. 

Spark Cluster

The Spark cluster empowers memory processing and more importantly complements our big data stack. Although many consider Druid and Spark as competitors when it comes to the processing of data, we have always considered them as entities that complement each other. Spark is highly regarded in the industry due to its parallel in-memory processing across data nodes, however, for OLAP transactions it can be expensive. Spark cluster provides data science teams at MUNCH:ON with fast and scalable machine learning capabilities, where the objective is to train the machine with terabytes of data over many runs. Spark fits in this role perfectly, however, its response time is not the fastest when it comes to OLAP queries. Therefore, the data engineering team primarily uses it to process the incoming data and the store in Hadoop Lake. This data can then be queried or explored using the Druid cluster and the performance is comparable with other OLAP systems. For data engineering at MUNCH:ON both these technologies work together perfectly.

Druid Cluster

Driving scalable big data analytics for MUNCH:ON, the Druid provides an excellent interface for user-end applications. A Druid cluster consists of different types of nodes and each node type is designed to perform a specific set of things. The different types of nodes are loosely coupled so that clusters could support distributed, shared-nothing architecture so that intra-cluster communication failures have minimal impact on availability. As per the best practices we have a few Druid overlords, or coordinators, accompanied by various middle managers and brokers. Query engines sit on top of end nodes in the cluster to allow third-party applications to query the data stored in Druid on HDFS.

At MUNCH:ON, Druid is often used to explore data and generate reports, mostly focusing on time intervals. Concurrent queries could be problematic but are solved by query prioritization. The response time of Druid is excellent for the volumes we are currently storing on HDFS.

Airflow Cluster

Airflow clusters enable scalable ETL flow and robust pipelines at MUNCH:ON. The cluster is made highly scalable using celery workers and therefore allows us to automate very complex pipelines with ease. Generally, we use Airflow to integrate the input and outputs of various technologies. A pipeline for ETL processing might look somewhat like the following (at a very high level):

Designed to incorporate all the various scenarios and cases including the cases of batch and incremental load, the Airflow pipelines play a central role in managing data flows at MUNCH:ON. The failover checks and email/Sack alerts are maintained at all the data sensitive and operation critical steps. Failover checks are also routed back to retries in certain cases.

Presto Cluster

Presto clusters enable SQL on everything “together”, which allows us to generate cross technology queries and operations on data that reside in multiple data stores. Our Data Engineering team uses the Presto Cluster for all the cases where business or data science teams need to consult different sources or databases, and ultimately see results by running a single query. The Presto cluster is a distributed environment, it relies heavily on the coordination between coordinator (the term Presto uses for its master) and workers. The workers are generally deployed in two different ways, one co-located with data as remote engines, and locally within the cluster (suitable for the scenarios where data transfer volumes and cost is low). In either case Presto cluster provides the abstraction layer for over the top user applications for analytics.

ELK Data Logging Cluster

ELK (Elasticsearch-Logstash-Kibana) empowers internal auditing of the clusters within the site. The logs generated by each member of each cluster are parsed by Logstash workers co-located on the nodes and push the data to Elasticsearch. Elasticsearch is a text analytics data store, built upon Lucene, and hence enables us to visualize errors, mark important events in the logs as well as retrieve key information from the logs using Kibana Visualization dashboards. The scope of ELK currently is to provide auditing and data monitoring within growing clusters. The logs generated by different components discussed above are available for operations and technical teams for analysis and optimizations within the whole ecosystem.

Disaster Recovery and System Backup/Migration Strategy

One of the most important parts of any data engineering project is the ability to recover and restore the system in case of an emergency. At MUNCH:ON we protect our data with our synchronization and data backup strategy. Our disaster recovery strategy for the data engineering clusters is designed and maintained keeping in mind two major outcomes:

  • Each member of the data engineering team should be able to sync with the current production status clusters.
  • The restoration or recovery time should be minimal with a standby site (DR site).

 In the future, we plan to establish an entirely independent DR site that will serve the purpose of a standby infrastructure. Currently, this is accomplished by using clever engineering. Each team member is able to run the entire cluster in its most basic form from their desktops or notebooks. Although this might seem surprising we achieved it using continuous syncing of configurations and installations of the production cluster onto our local machines. Therefore each member of the data engineering team is equally equipped to fully restore the production-grade architecture quickly, and only using their personal workstation.

We’re continuously trying to improve our architecture and it’s constantly changing in multiple dimensions such as scale, technologies, and performance, which has been crucial in our ability to keep up with our growth. Innovation has been key to our success at finding a solution that works for our business and is able to grow with us. 

Let us know if this blog has been useful to you. We’d love to hear your thoughts in the comments below! We’re always trying to improve and innovate at MUNCH:ON so feel free to reach out to us with any questions or insights you may have. 

Written by: Furquan Shah & Ashan Aftab

Leave a Reply

Your email address will not be published. Required fields are marked *