Aggregations, Analytics & The Observer
PardoX provides two layers of data insight:
- Aggregation kernels — reduce a column to a scalar (sum, mean, std, etc.)
- The Observer — export and inspect the full DataFrame (value_counts, unique, to_dict, to_json)
All operations run entirely in native Rust machine code.
1. Column Aggregations (Series)
Access aggregation methods directly on any numeric Series object (obtained via df['col']).
sum()
total_revenue = df['amount'].sum()
print(f"Total: ${total_revenue:,.2f}")
Returns: float
mean()
avg_ticket = df['amount'].mean()
Returns: float
min() / max()
highest = df['amount'].max()
lowest = df['amount'].min()
Returns: float
std()
Sample standard deviation.
volatility = df['amount'].std()
Returns: float
!!! info “Interpretation” Low std → values cluster near the mean. High std → values are widely spread.
count()
Number of non-null values in the column.
valid_transactions = df['transaction_id'].count()
Returns: int
2. DataFrame-Level Standard Deviation
df.std(col) is a DataFrame method (distinct from the Series .std()) that computes the sample standard deviation for a named column and returns a Python float. Useful when working with derived DataFrames from df.mul(), df.add(), etc.
revenue_df = df.mul("price", "quantity")
std_val = revenue_df.std("result_mul")
print(f"Revenue std dev: {std_val:,.4f}")
3. Null Value Handling
All aggregation functions are null-aware: they skip NaN / null values and compute metrics only over valid data points. This matches SQL and Pandas behavior.
To fill nulls before aggregation:
df.fillna(0.0)
total = df['amount'].sum()
4. Performance
| Method | Mechanism | Speed (10M rows) |
|---|---|---|
Python sum(list) | Iterates PyObject one by one | ~1.5s |
PardoX .sum() | SIMD vectorized accumulator | ~0.02s |
!!! tip “Under the hood” .sum() passes the memory pointer directly to a Rust function. AVX2 instructions add 4–8 values per CPU cycle without ever materializing Python objects.
5. The Observer — Full DataFrame Export
The Observer module provides functions to export or inspect the entire DataFrame. All string results are heap-allocated (proper ownership) and freed after the Python string is created.
to_dict()
Returns all rows as a list of dictionaries (records format). Equivalent to Pandas’ df.to_dict('records').
records = df.to_dict()
# [{'price': 19.99, 'quantity': 3, ...}, ...]
print(f"Total records: {len(records)}")
first_row = records[0]
print(first_row['price'])
Returns: list[dict]
to_json()
Returns all rows as a JSON string "[{...}, ...]". Useful for API responses or file writing.
json_str = df.to_json()
with open("output.json", "w") as f:
f.write(json_str)
Returns: str
value_counts(col)
Returns the frequency of each unique value in a column, sorted by count descending.
state_counts = df.value_counts("state")
# {'TX': 6345, 'CA': 6301, 'CO': 6304, ...}
print(f"Unique states: {len(state_counts)}")
# Top 5
for state, count in list(state_counts.items())[:5]:
print(f" {state}: {count:,}")
Returns: dict[str, int]
unique(col)
Returns the unique values of a column in insertion order.
categories = df.unique("category")
# ['Electronics', 'Books', 'Clothing', ...]
print(f"Distinct categories: {len(categories)}")
Returns: list
6. Observer — PHP and Node.js
Node.js
// value_counts
const stateCounts = df.valueCounts('state');
console.log(`States: ${Object.keys(stateCounts).length}`);
// unique
const cats = df.unique('category');
// Full export
const records = df.toDict(); // array of objects
const jsonStr = df.toJson(); // JSON string
// tolist — array of arrays (values only)
const matrix = df.tolist();
PHP
// value_counts
$stateCounts = $df->value_counts('state');
echo count($stateCounts) . " unique states\n";
// unique
$cats = $df->unique('category');
// Full export
$records = $df->to_dict(); // array of assoc arrays
$json = $df->to_json(); // JSON string
// tolist — array of arrays
$matrix = $df->tolist();
7. Complete Analysis Pipeline
import pardox as px
from pardox.io import execute_sql
df = px.read_csv("sales_50k.csv")
df.cast("quantity", "Float64")
# Feature engineering
revenue_df = df.mul("price", "quantity")
# Aggregations
print(f"Total revenue : ${df['price'].sum():,.2f}")
print(f"Avg price : ${df['price'].mean():,.4f}")
print(f"Max price : ${df['price'].max():,.2f}")
print(f"Revenue std : {revenue_df.std('result_mul'):,.2f}")
print(f"Valid rows : {df['transaction_id'].count():,}")
# EDA inspection
state_counts = df.value_counts("state")
categories = df.unique("category")
print(f"\nTop 3 states: {list(state_counts.items())[:3]}")
print(f"Categories: {categories}")
# Full export
records = df.to_dict()
print(f"\nExported {len(records):,} records to Python list")
8. GroupBy
Vectorized groupby with a Rust hash-aggregation engine (Gap 1, added in v0.3.2).
summary = df.groupby("region", {
"revenue": "sum",
"quantity": "mean",
"transaction_id": "count"
})
You can also use value_counts for frequency analysis and filter + aggregate on subsets:
# Filter then aggregate
tx_mask = df['state'].eq("TX")
tx_df = df.filter(tx_mask)
print(f"TX revenue: ${tx_df['amount'].sum():,.2f}")