Today, a vast amount of data is stored in

  1. File formats like Parquet or CSV
  2. Data lakes like S3 or GCS
  3. Table formats like Delta Lake or Iceberg
  4. Databases like Postgres

Ingesting this data into ParadeDB with data processing engines or ETL tools is complex and error-prone. ParadeDB makes it easy to consume this data using nothing but Postgres utilities.

Basic Usage

In the following example, we will query and copy a Parquet file stored in S3 to Postgres. The Parquet file contains 3 million NYC taxi trips from January 2024, hosted in a public S3 bucket provided by ParadeDB.

To begin, let’s create a Postgres foreign data wrapper, which is how ParadeDB connects to S3.

CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');

Next, let’s query the foreign table trips. You’ll notice that the column names and types of this table are automatically inferred from the Parquet file.

SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;

Queries over this table are powered by DuckDB, an in-process analytical query engine. This means that you can run fast analytical queries over data lakes from ParadeDB.

SELECT COUNT(*) FROM trips;

Finally, let’s copy this table into a Postgres heap table. For demonstration, we will copy over the first 100 rows.

CREATE TABLE trips_copy AS SELECT * FROM trips LIMIT 100;

That’s it! Please refer to the other sections for instructions on how to ingest from other file formats, table formats, and object stores.

For Further Assistance

The paradedb.help function opens a GitHub Discussion that the ParadeDB team will respond to.

SELECT paradedb.help(
  subject => $$Something isn't working$$,
  body => $$Issue description$$
);

Was this page helpful?