Multi-level partitioned tables are a powerful feature in database systems that allow us to divide large tables into smaller, more manageable pieces called partitions. These partitions are organized in a hierarchical structure, with each level representing a different category of data.
In our auto sales dataset scenario, we use a two-level partitioning strategy:
- First level: Partitioned by year
- Second level: Partitioned by manufacturer
This structure creates a tree-like organization of our data:
%%{init: {"theme": "base", "themeVariables": { "fontSize": "16px"}}}%%
graph TB
subgraph S3["S3 Bucket: demo-mlp-auto-sales"]
direction TB
A[2020] -->|Manufacturer| B1[Toyota]
A -->|Manufacturer| B2[Honda]
A -->|Manufacturer| B3[...]
B1 -->|Files| C1[data_0.parquet]
B1 -->|Files| C2[data_1.parquet]
B1 -->|Files| C3[...]
D[2021] --> E1[...]
F[...] --> E2[...]
end
classDef s3 fill:#f0f0f0,stroke:#333,stroke-width:2px;
class A,B1,B2,B3,C1,C2,C3,D,E1,E2,F s3;
%%{init: {"theme": "base", "themeVariables": { "fontSize": "16px"}}}%%
graph TB
subgraph PostgreSQL["PostgreSQL Tables"]
direction TB
subgraph PG["Native PostgreSQL Tables"]
G[auto_sales_partitioned] -->|Year| H1[auto_sales_y2020]
G -->|Year| H2[auto_sales_y2021]
G -->|Year| H3[...]
end
subgraph PGFDW["pg_analytics Foreign Data Tables"]
H1 -->|Manufacturer| I1[auto_sales_y2020_Toyota]
H1 -->|Manufacturer| I2[auto_sales_y2020_Honda]
H1 -->|Manufacturer| I3[...]
I1 -.->|FDW| J1[S3 Files]
end
end
classDef pg fill:#d9d9d9,stroke:#333,stroke-width:2px;
classDef pgfdw fill:#bfbfbf,stroke:#333,stroke-width:2px;
class G,H1,H2,H3 pg;
class I1,I2,I3,J1 pgfdw;
class PostgreSQL,PG,PGFDW root;
-
Improved Query Performance: When you run a query for a specific year and manufacturer, the database only needs to scan the relevant partition instead of the entire dataset. For example, a query for Toyota sales in 2020 would only look at the
auto_sales_y2020_Toyotapartition. -
Efficient Data Management: You can easily add or remove data for a specific year or manufacturer by managing individual partitions. This is particularly useful for historical data or when adding new car brands.
-
Parallel Query Execution: Different partitions can be queried simultaneously, speeding up complex analyses across multiple years or manufacturers.
-
Simplified Maintenance: You can perform maintenance operations (like updates or backups) on specific partitions without affecting the entire dataset.
-
Improved Data Organization: The partition structure naturally organizes data in a way that matches business logic, making it easier to understand and manage the dataset.
Our multi-level partitioned table system comprises several key components that work together to provide efficient data storage and querying capabilities. Here's an overview of each component and how they're utilized in our test cases:
-
PostgreSQL:
- Acts as the primary database system
- Stores the structure of the partitioned tables
- Handles user queries and interactions
-
S3 Storage:
- Stores the actual data in Parquet file format
- Organizes data into a hierarchical structure (year/manufacturer)
- Provides scalable and cost-effective storage for large datasets
-
pg_analyticsForeign Data Wrapper (FDW):- Bridges PostgreSQL and S3 storage
- Manages the connection between PostgreSQL tables and S3 data
- Intercepts and optimizes queries to partitioned tables
-
DuckDB:
- In-memory analytics engine used by
pg_analytics - Processes data from S3 efficiently
- Helps in query optimization and execution
- In-memory analytics engine used by
%%{init: {"theme": "base", "themeVariables": { "fontSize": "16px"}}}%%
graph TB
subgraph CL[Client Layer]
TA[Test Application]
end
subgraph DL[Database Layer]
subgraph PG[PostgreSQL]
PT[Partitioned Tables]
end
subgraph PGA[pg_analytics]
FDW[Foreign Data Wrapper]
QI[Query Interceptor]
QO[Query Transcoder]
end
DD[DuckDB Engine]
end
subgraph SL[Storage Layer]
S3[S3 Storage]
subgraph S3D[Data Organization]
PF[Parquet Files]
end
end
TA -->|1. Sends Query| PT
PT -->|2. Forwards Query| QI
QI -->|3. Intercepts Query| QO
QO -->|4. Optimizes Query| FDW
FDW -->|5. Requests Data| DD
DD <-->|6. Reads/Writes Data| S3
S3 -.->|Stores Data| S3D
classDef client fill:#f0f0f0,stroke:#333,stroke-width:2px;
classDef database fill:#d9d9d9,stroke:#333,stroke-width:2px;
classDef pganalytics fill:#bfbfbf,stroke:#333,stroke-width:2px;
classDef storage fill:#e6e6e6,stroke:#333,stroke-width:2px;
class TA client;
class PG,PT,DD database;
class FDW,QI,QO pganalytics;
class S3,YP,MP,PF storage;
class CL,DL,SL,PGA,S3D root;
-
Foreign Data Wrapper (FDW) Management: Extends PostgreSQL to interact with S3-stored Parquet files.
-
Multi-level Partition Management: Creates and manages partitioned tables across different levels (year, manufacturer).
-
Query Interception and Optimization:
- Intercepts queries for partitioned tables
- Transcodes PostgreSQL queries to DuckDB syntax
- Optimizes queries for efficient execution on S3 data
-
Transparent Integration: Enables querying of partitioned S3 data using standard PostgreSQL syntax.
Our demo application, implemented using core structs AutoSale, AutoSalesSimulator, and AutoSalesTestRunner, demonstrates the functionality and performance of the multi-level partitioned table system. The setup process involves several key steps:
AutoSalesSimulator::save_to_parquet_in_batches(10000, 1000, &parquet_path)This function generates sample auto sales data and saves it to local Parquet files.
pub async fn create_partition_and_upload_to_s3(
s3: &S3,
s3_bucket: &str,
df_sales_data: &DataFrame,
) -> Result<()> {
// Organizes and uploads data to S3 in a partitioned structure
}This function uploads the data to S3, organizing it to match the PostgreSQL partition structure for efficient querying.
AutoSalesTestRunner::setup_tables(&mut conn, &s3, s3_bucket).await?;This function creates the partitioned table structure in PostgreSQL and sets up Foreign Data Wrapper (FDW) connections.
pg_analytics extends PostgreSQL's FDW capabilities to work with S3 and Parquet files:
fn setup_s3_fdw(s3_endpoint: &str) -> String {
format!(
r#"
CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler
VALIDATOR parquet_fdw_validator;
CREATE SERVER auto_sales_server
FOREIGN DATA WRAPPER parquet_wrapper;
CREATE USER MAPPING FOR public
SERVER auto_sales_server
OPTIONS (
type 'S3',
region 'us-east-1',
endpoint '{s3_endpoint}',
use_ssl 'false',
url_style 'path'
);
"#
)
}This function generates SQL commands to set up the necessary FDW components, connecting PostgreSQL with S3.
The following functions generate SQL commands to create the multi-level partition structure in PostgreSQL:
fn create_partitioned_table() -> String {
// Creates the root partitioned table
}
fn create_year_partition(year: i32) -> String {
// Creates a partition for a specific year
}
fn create_manufacturer_partition(s3_bucket: &str, year: i32, manufacturer: &str) -> String {
// Creates a foreign table partition for a specific manufacturer within a year
}These queries are intercepted by pg_analytics to manage partitions at different levels.
The demo includes several test cases to validate the system's functionality:
AutoSalesTestRunner::assert_total_sales(&mut conn, &df_sales_data).await?;Compares total sales calculations between PostgreSQL and DataFrame.
AutoSalesTestRunner::assert_avg_price(&mut conn, &df_sales_data).await?;Verifies average price calculations for a specific year.
AutoSalesTestRunner::assert_monthly_sales(&mut conn, &df_sales_data).await?;Ensures accurate monthly sales counts and sale ID collection.
These test cases validate:
- Correct setup of the multi-level partitioned table structure
- Proper data storage and accessibility
- Accurate query results
- Performance comparability between
pg_analyticsqueries and direct DataFrame operations
- Install
pg_analyticsfollowing the instructions at ParadeDB Documentation.
Execute the following command in your terminal:
RUST_LOG=info cargo test --test test_mlp_auto_sales -- --nocaptureThis command runs the multi-level partitioned auto sales test with detailed logging, displaying all output.
The multi-level partitioned table feature in pg_analytics demonstrates a powerful approach to managing large datasets efficiently. By leveraging PostgreSQL's familiar interface, DuckDB's analytical capabilities, and S3 storage, it offers:
- Improved query performance on large, partitioned datasets
- Seamless integration with existing PostgreSQL workflows
- Efficient data management and organization in cloud storage
This demo showcases the potential for enhancing data analytics workflows in scenarios requiring fast access to specific subsets of large datasets, such as time-series or categorized data.
For specific questions about this feature or running the test application, please contact the ParadeDB team. For general inquiries and community discussions, join the ParadeDB Slack.