Polars vs pandas for Tick-Level Data
What actually changes when you rewrite a futures tick pipeline from pandas to Polars — memory, query cost, and a few footguns.
I recently rewrote a futures L1/L2 ingestion pipeline from pandas to Polars. The dataset was roughly 200 GB of Parquet covering four years of limit order book snapshots. Below are the parts that actually mattered, and the parts that did not.
Where Polars wins decisively
- Lazy API collapses filter + group_by + join into one scan — peak memory dropped from 32 GB to under 6 GB.
- Vectorized operations on Arrow columns are 5–15x faster than pandas for the same group_by aggregations.
- Streaming engine lets you process files larger than RAM without manual chunking.
Where the gains were marginal
For row-wise feature engineering that needs lookahead-aware rolling windows (e.g., causal rolling skew over an asymmetric window), Polars requires you to think in expressions, and the code ends up looking similar in length to the pandas version. The win there is correctness, not speed.
Footguns worth knowing
- Polars' default join is not stable across releases — always sort by a deterministic key after a join if order matters.
- pl.Expr.rolling requires explicit window sizing in time, not rows, when working with irregular tick spacing.
- Casting between dtypes can silently materialize a full column; use explain() liberally.
import polars as pl
lf = pl.scan_parquet("ticks/*.parquet")
(
lf.filter(pl.col("symbol").is_in(liquid_symbols))
.group_by(["symbol", pl.col("ts").dt.truncate("1m")])
.agg(
pl.col("price").last().alias("close"),
pl.col("price").std().alias("vol"),
)
.collect(streaming=True)
)When I still reach for pandas
Small interactive exploration, Jupyter scratch work, and any time I need a library that only accepts DataFrames (some older statsmodels paths). The right answer in practice is to use both: Polars for the heavy ingestion, pandas for the messy modeling conversation.