Today, a vast amount of data is stored in

  1. File formats like Parquet or CSV
  2. Data lakes like S3 or GCS
  3. Table formats like Delta Lake or Iceberg

ParadeDB’s integrations that make it easy to ingest this data without data processing engines or ETL tools, which can be complex and error-prone.

Basic Usage

In this example, we will query and copy a Parquet file stored in S3 to Postgres. The Parquet file contains 3 million NYC taxi trips from January 2024, hosted in a public S3 bucket provided by ParadeDB.

To begin, let’s create a Postgres foreign data wrapper, which is how ParadeDB connects to S3.

CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;

CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');

Next, let’s query the foreign table trips. You’ll notice that the column names and types of this table are automatically inferred from the Parquet file.

SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;

Queries over this table are powered by DuckDB, an in-process analytical query engine. This means that you can run fast analytical queries over data lakes from ParadeDB.

SELECT COUNT(*) FROM trips;

Finally, let’s copy this table into a Postgres heap table. For demonstration, we will copy over the first 100 rows.

CREATE TABLE trips_copy AS SELECT * FROM trips LIMIT 100;

That’s it! Please refer to the other sections for instructions on how to ingest from other file and table formats and object stores.

Was this page helpful?