Written by Ming Ying on February 1, 2024

Relative Query Time

Scores measured by Clickbench. Lower scores are better.

We’re excited to introduce pg_analytics, an extension that accelerates the native analytical performance of any Postgres database1 by 94x. With pg_analytics installed, Postgres is 8x faster than Elasticsearch and nearly ties ClickHouse on analytical benchmarks2.

Today, developers who store billions of data points in Postgres struggle with slow query times and poor data compression. Even with database tuning, complex analytical queries (e.g. counts, window functions, string aggregations) can take anywhere from minutes to hours. Many organizations turn to an external analytical data store like Elasticsearch as a result. This increases operational complexity as data becomes siloed and engineers must learn to use a new database.

By speeding up analytical queries directly inside Postgres, pg_analytics is a drop-in solution for analytics in Postgres without the need to extract, transform, and load (ETL) data into another system. The goal of this blog post is to share how pg_analytics was built and why now is an unprecedented time for building a Postgres-based analytical database.

How It Works

Regular Postgres tables, known as heap tables, organize data by row. While this makes sense for operational data, it is inefficient for analytical queries, which often scan a large amount of data from a subset of the columns in a table.

ParadeDB introduces a new kind of table called the parquet table. parquet tables behave like regular Postgres tables but use a column-oriented layout via Apache Arrow and leverage Apache DataFusion, a query engine optimized for column-oriented data. This means that users can choose between row and column-oriented storage at table creation time.

Arrow and Datafusion are integrated with Postgres via two features of the Postgres API: the table access method and executor hooks. The table access method registers parquet tables with the Postgres catalog and handles data manipulation language (DML) statements like inserts. Executor hooks intercept and reroute queries to DataFusion, which parses the query, constructs an optimal query plan, executes it, and returns the results to Postgres.

Data is persisted to disk with Parquet, a highly-compressed file format for column-oriented data. Thanks to Parquet, ParadeDB compacts data 5x more than both regular Postgres and Elasticsearch.

Compressed Data Size

Data size (GB) of a 75.56GB CSV file, loaded into the database.

The final dependency is delta-rs, a Rust-based implementation of Delta Lake. This library adds ACID transactions, updates and deletes, and file compaction to Parquet storage. It also supports querying over data lakes like S3, which introduces the future possibility connecting Postgres tables to cloud data lakes.

Why DataFusion

Building a state-of-the-art analytical database in Postgres is a task that’s expensive and difficult to get right.

One of the first Postgres-based analytical databases, Greenplum, was released in 2005. Since then, several companies like Citus and Timescale have built similar products. However, the performance gap between these databases and their non-Postgres, OLAP counterparts is wide. This is one reason that systems like Elasticsearch are popular even among companies that prefer Postgres.

Recently, embeddable query engines like DataFusion have changed the game by surpassing the query speed of many OLAP databases. DataFusion teases the idea of excellent analytical performance from any database — including Postgres.

Andy Pavlo, professor of databases at Carnegie Mellon, was right. Today, we’ve reached a point where it does not make sense to build a query engine from scratch within a database. Instead, the next generation of analytical databases should integrate existing, embeddable query engines3 like DataFusion that can continuously improve the database as the engine itself improves.

Getting Started

At the time of writing, pg_analytics is open source and in an MVP state. Almost all Postgres queries and basic operations like inserts and vacuums are supported. Our roadmap can be found in the project README.

The easiest way to try pg_analytics is by running the ParadeDB Docker image. Once connected, you can follow this toy example.

CREATE EXTENSION pg_analytics;
-- Create a parquet table
CREATE TABLE t (a int) USING parquet;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a parquet table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;

The core ParadeDB team is focused on making pg_analytics production-ready. Combined with pg_search, our Postgres extension for full-text search, pg_analytics aims to make ParadeDB the best Elasticsearch alternative.

We welcome community contributions and are active on Slack. Finally, please don’t hesitate to show your support by giving us a star!

Footnotes

  1. pg_analytics is compatible with Postgres 12+ and can be installed on any self-hosted Postgres instance.

  2. According to Clickbench, a benchmarking tool for analytical databases.

  3. We also evaluated DuckDB, Polars, and Velox as candidates for an embedded query engine. DuckDB is a popular in-process OLAP database, Polars is a dataframe processing library built on Arrow, and Velox is a query execution library built by Meta. DataFusion was chosen for three reasons. First, it interoperates with a storage framework like Delta Lake, which provides essential properties like ACID transactions. Secondly, its API was intended to be an embedded query engine inside another database, unlike standalone databases like Polars and DuckDB. Finally, it’s written in Rust and comes with a query parser and optimizer, unlike Velox.