Overview
Today, a vast amount of data is stored in
- File formats like Parquet or CSV
- Data lakes like S3 or GCS
- Table formats like Delta Lake or Iceberg
- Databases like Postgres
Ingesting this data into ParadeDB with data processing engines or ETL tools is complex and error-prone. ParadeDB makes it easy to consume this data using nothing but Postgres utilities.
Basic Usage
In the following example, we will query and copy a Parquet file stored in S3 to Postgres. The Parquet file contains 3 million NYC taxi trips from January 2024, hosted in a public S3 bucket provided by ParadeDB.
To begin, let’s create a Postgres foreign data wrapper, which is how ParadeDB connects to S3.
CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet');
Next, let’s query the foreign table trips
. You’ll notice that the column names and types of this table are automatically
inferred from the Parquet file.
SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1;
Queries over this table are powered by DuckDB, an in-process analytical query engine. This means that you can run fast analytical queries over data lakes from ParadeDB.
SELECT COUNT(*) FROM trips;
Finally, let’s copy this table into a Postgres heap table. For demonstration, we will copy over the first 100 rows.
CREATE TABLE trips_copy AS SELECT * FROM trips LIMIT 100;
That’s it! Please refer to the other sections for instructions on how to ingest from other file formats, table formats, and object stores.
For Further Assistance
The paradedb.help
function opens a GitHub Discussion that the ParadeDB team will respond to.
SELECT paradedb.help(
subject => $$Something isn't working$$,
body => $$Issue description$$
);
Was this page helpful?