Check Point® Software Technologies Ltd. has been recognized as a leader in The Forrester Wave™: Zero Trust Platform Providers, Q3 2023 report.
Analytics has come a very long way in recent years, with transformational developments widening its impact beyond internal stakeholders and into the hands of external users. This brings with it a range of challenges, and when building analytics applications for customers, for example, one of the first considerations will be the choice of database backend.
While the main options will probably include PostgreSQL, MySQL, or even extending a data warehouse beyond its core BI dashboards and reports, it's important to keep in mind that analytics for external users can be revenue-impacting. As a result, choosing the right tool for the job is essential if organizations are to deliver a high-quality user experience.
For many users, the most frustrating element of their analytics experience is performance, and in particular, the wait-state of queries in a processing queue. It's one thing to have an internal business analyst wait a few seconds or even several minutes for a report to load; it's entirely different when analytics functionality is being offered to external users whose tolerance of processing delays will be much lower.
This problem is generally caused by the amount of data to analyze with the processing power of the database and the number of users and API calls. Collectively, this determines how well the database can keep up with the application.
However, there are a variety of approaches to building an interactive data experience with any generic OLAP database when there's a lot of data. The issue? These come at a price. For instance, precomputing all the queries makes the architecture very expensive and rigid, while aggregating the data first minimizes the insights, and limiting the data analyzed to recent events doesn't give users the complete picture. All of these ways involve making compromises.
There is, however, a no-compromise approach that can deliver an optimized architecture and data format built for interactivity at scale. This comes in the form of Apache Druid — a high-performance, real-time analytics database to power analytics applications for any number of users.
Druid employs a uniquely distributed and elastic architecture that prefetches data from a shared data layer into a near-infinite cluster of data servers. This architecture enables faster performance than a decoupled query engine like a cloud data warehouse because there's no data to move and more scalability than a scale-up database like PostgreSQL/MySQL.
Furthermore, Druid provides automatic, multi-level indexing that is built into the data format to drive more queries per core. This goes beyond the typical OLAP columnar format with the addition of a global index, data dictionary, and bitmap index. In doing so, it maximizes CPU cycles for faster crunching.
High Availability Should be a High Priority
To illustrate the value of these capabilities, consider this scenario: if a dev team is building a backend for internal reporting, does it really matter if it goes down for a few minutes or even longer?
For most, the answer is probably not and explains why there's always been tolerance for unplanned downtime and maintenance windows in classical OLAP databases and data warehouses.
But what if the dev team then needs to build an external analytics application that customers will use?
Any outages here can impact revenue, with a serious knock-on effect on issues as varied as team resources to customer satisfaction. As a result, resilience — both high availability and data durability — must be a priority when choosing a database for external analytics applications.
Delivering resilience means posing some important design criteria questions — can you protect from a node or a cluster-wide failure?
How bad would it be to lose data?
What work is involved to protect your app and your data?
The legacy approach to achieving greater resiliency is to replicate nodes and to remember to take backups. But when dev teams are building apps for customers, the sensitivity to data loss is much higher, and as a result, occasional backups aren't fit for purpose.
In contrast, Druid's core architecture is designed to withstand downtime without losing data (even recent events) by implementing high availability (HA) and durability based on automatic, multi-level replication with shared data in S3/object storage. This not only enables the HA properties dev teams expect but also a form of continuous backup that automatically protects and restores the latest state of the database even if an entire cluster is lost.
Building a database that delivers high concurrency means striking the right balance between CPU usage, scalability, and cost. Historically, addressing concurrency was a matter of allocating more hardware to the challenge, and while adding more CPUs certainly allows organizations to run more queries, it can easily become very expensive.
In contrast, databases like Apache Druid are built with optimized storage and query engine that drives down CPU usage. By only reading the data it needs to, the infrastructure can serve more queries in the same timespan.
This is also an important consideration when building external applications that will deliver the performance and resilience required both today and in the future. For those organizations focused on customer retention, being able to scale their infrastructure is key to remaining competitive.