The ChronDB was born from the idea of combining the reliability and native version control of Git with the flexibility and power of a modern database. While our journey began with basic key-value operations, we quickly evolved to support complete SQL queries, including JOINs and Full-Text Search. However, a critical question remained: how to ensure adequate performance when using Git as the underlying storage mechanism?
In this post, I will explore our recent benchmarking and optimization efforts for ChronDB’s SQL protocol, demonstrating how we are building a robust and performant database on the Git architecture.
“A well-designed benchmark is one that simulates as closely as possible the behavior of the system in a production environment.” — Bert Scalzo & Kevin Kline, Database Benchmarking and Stress Testing1
Git was not originally designed to be a database. It was created for source code version control, with optimizations for that specific purpose. This creates some intrinsic challenges when building a database on top of it:
JSON
file within the Git repositoryTo face these challenges, we implemented a Lucene indexing layer and developed a comprehensive benchmarking infrastructure to measure, understand, and optimize performance.
Scalzo and Kline emphasize that “unconventional database systems often require customized benchmarking methodologies, as standard industry benchmarks may not adequately capture their unique strengths and weaknesses”1. This approach was crucial for developing a specific testing methodology for ChronDB’s Git-based architecture.
Our benchmark infrastructure was designed with several objectives:
The benchmark code includes:
|
|
We followed the principle advocated by Scalzo and Kline that “a benchmark should focus on isolating and measuring specific operations to identify bottlenecks, not just the overall system performance”1. Therefore, we created distinct tests for each type of SQL operation supported by ChronDB.
To simulate real-world conditions, we generated substantial test datasets:
|
|
Our benchmark tests include:
SELECT * FROM table LIMIT n
SELECT * FROM table WHERE condition
Instead of focusing only on raw time, we adopted TPS (Transactions Per Second) as our main metric, allowing a more objective comparison between different types of operations.
As Scalzo and Kline recommend, “it is crucial to measure both latency and throughput, as systems may optimize one at the expense of the other”1. Our TPS measurement approach allows us to evaluate the processing capacity of the system under different workloads.
Typical benchmark results in our development environment:
Operation | Average time (ms) | TPS |
---|---|---|
SELECT 1000 records | 115 | 8,695 |
SELECT with WHERE | 203 | 492 |
INNER JOIN | 587 | 170 |
LEFT JOIN | 432 | 231 |
INSERT | 73 | 137 |
These numbers provide a basis for evaluating performance improvements and identifying bottlenecks.
Based on the benchmark results, we implemented several optimizations:
The most impactful optimization was in the Lucene integration:
|
|
Scalzo and Kline highlight that “efficient indexing is often the most significant factor in determining query performance on large datasets”1. This observation guided our focus on optimizing the indexing layer.
We enhanced the full-text search to support advanced operators like to_tsquery
, inspired by PostgreSQL:
We significantly improved JOIN operations by introducing:
|
|
In their book, Scalzo and Kline emphasize that “JOINs are often the most expensive operations in a database, and their optimization should be prioritized in the benchmarking process”1. Our hash-join implementation was directly inspired by the techniques described in the book for optimizing joins in memory-constrained environments.
We improved data access with:
As Scalzo and Kline point out, “I/O optimizations often produce the greatest performance gains in disk-based systems”1. This is particularly relevant for ChronDB, given its file-based storage in the Git repository.
It’s important to contextualize our performance in relation to traditional databases. While ChronDB doesn’t achieve the same raw query speed as systems like PostgreSQL or MongoDB for simple operations, we offer unique benefits:
Our goal is not to outperform traditional databases in raw performance, but to offer acceptable performance while maintaining Git benefits.
“Often, the most useful benchmark is not one that compares your system with others, but one that measures your own system against the specific requirements of your users.” — Scalzo & Kline1
In addition to ad-hoc benchmarks during development, we implemented continuous benchmarks via GitHub Actions:
|
|
This allows us to track performance over time, ensuring that new features don’t degrade performance.
Scalzo and Kline argue that “continuous benchmarking is an essential practice for evolving systems, as it allows identifying performance regressions shortly after their introduction”1. Our integration of automated benchmarks with GitHub Actions implements exactly this principle.
Our ChronDB optimization journey taught us several valuable lessons:
These lessons align with what Scalzo and Kline call the “benchmark lifecycle” — an iterative process of measurement, analysis, optimization, and new measurement1.
ChronDB was designed with expandability in mind from the beginning. Our modular architecture allows:
In the near future, we plan to:
An important recommendation from Scalzo and Kline that we intend to follow is creating “specific benchmarks for future workloads, not just current ones”1, ensuring that our optimizations also anticipate new use cases.
Building a performant database on Git is a significant challenge, but our benchmarking and optimization efforts demonstrate that it is feasible. ChronDB offers a unique combination of version control, auditability, and powerful SQL queries, all with acceptable performance for many use cases.
Our benchmark infrastructure allows us to identify bottlenecks, implement optimizations, and measure progress objectively. As ChronDB evolves, we will continue focusing on building a solid foundation that balances Git benefits with the expected performance of a modern database.
ChronDB proves that it is possible to have the best of both worlds: the reliability and complete history of Git, with the flexibility and power of SQL queries - all with sufficient performance for real-world applications.
As Scalzo and Kline conclude in their book, “the successful benchmark is not just one that produces impressive numbers, but one that leads to tangible improvements in the end-user experience”1. It is with this spirit that we continue to optimize ChronDB.
This post is part of our series on ChronDB development. For more information on JOIN support implementation, check our previous post.