Mastering Automated Report Generation with Pipelines | Ugur Kaval

The Hidden Cost of Manual Reporting

Last year, our finance team was losing 40 hours a month to what they called "The Reconciliation Dance." It involved exporting three different CSVs from Stripe, two from our Postgres production replica, and one from a legacy Zendesk instance. They would then manually stitch these together in an Excel workbook with VLOOKUPs that broke if a column name so much as twitched. By the time the 'Final_v3_Updated_v2.pdf' hit the board's inbox, the data was already 48 hours stale. This wasn't just a waste of time; it was a risk. One copy-paste error in a cell formula nearly led to a $200k overstatement in our quarterly projections.

In 2026, there is zero excuse for this. We have the tools to treat reports as first-class software citizens. Automated report generation isn't just about scheduling a script; it's about building a resilient, observable data pipeline that ensures the numbers are correct, the formatting is consistent, and the delivery is invisible.

The Architecture of a Modern Pipeline

A robust reporting system requires three distinct layers: Extraction/Transformation, Document Orchestration, and Distribution. In my recent builds, I've moved away from monolithic 'reporting scripts' toward a modular stack using Dagster for orchestration, DuckDB for local-speed transformations, and Quarto for the final rendering.

1. The Compute Layer: DuckDB and Dagster

Previously, we would have piped everything into a massive Snowflake or BigQuery instance just to generate a 10-row summary table. That's overkill and expensive. Today, I use DuckDB. It allows us to run complex SQL directly against Parquet files stored in S3 or local storage with zero overhead.

By wrapping this in Dagster, we get 'Software Defined Assets.' Instead of just a task that runs at 9 AM, we have an asset called monthly_revenue_report that knows it depends on stripe_transactions and user_metadata. If the upstream transactions fail to sync, the report doesn't run, and I get an alert before the CEO asks why the report is empty.

2. The Rendering Layer: Quarto

If you're still using Jinja2 templates to hack together HTML and then trying to convert that to PDF using a brittle headless Chrome driver, stop. Quarto is the evolution of RMarkdown. It allows you to write reports in Markdown while embedding executable Python blocks. It handles the cross-referencing, the styling, and the multi-format output (PDF, HTML, MS Word) natively using Pandoc.

Building it: A Concrete Example

Let's build a reporting asset. In this scenario, we're using Dagster 1.9+ and DuckDB 1.1.0. We want to fetch transaction data, calculate a growth metric, and render a PDF.

First, we define our data transformation asset:

import duckdb
import pandas as pd
from dagster import asset, AssetExecutionContext

@asset(group_name="finance_reports", compute_kind="duckdb")
def monthly_transaction_summary(context: AssetExecutionContext) -> pd.DataFrame:

# We use DuckDB to query our S3 data lake directly
# In 2026, DuckDB's httpfs extension is the gold standard for this
conn = duckdb.connect(database=':memory:')
conn.execute("INSTALL httpfs; LOAD httpfs;")

query = """
    SELECT 
        date_trunc('month', transaction_date) as report_month,
        category,
        SUM(amount_usd) as total_revenue,
        COUNT(DISTINCT user_id) as active_customers
    FROM read_parquet('s3://company-datalake/transactions/*.parquet')
    WHERE transaction_date >= current_date - interval '1 year'
    GROUP BY 1, 2
    ORDER BY 1 DESC
"""
df = conn.execute(query).df()
context.log.info(f"Processed {len(df)} summary rows")
return df

Next, we create the report rendering asset. We'll use the quarto Python package to trigger the build. This asset depends on the summary data we just created.

import subprocess
from dagster import asset, AssetIn

@asset(
    ins={"summary_data": AssetIn("monthly_transaction_summary")},
    group_name="finance_reports",
    compute_kind="quarto"
)
def executive_revenue_report(context, summary_data):

# We save the data to a temporary location for Quarto to pick up
summary_data.to_csv("input_data.csv", index=False)

# Run Quarto render. This uses the 'report_template.qmd' file
# which contains the Markdown and layout logic.
try:
    result = subprocess.run(
        ["quarto", "render", "report_template.qmd", "--to", "pdf", "--output", "exec_report.pdf"],
        check=True,
        capture_output=True
    )
    context.log.info("Report rendered successfully via Quarto")
except subprocess.CalledProcessError as e:
    context.log.error(f"Quarto rendering failed: {e.stderr.decode()}")
    raise e

# Here you would typically upload the PDF to S3 and return the URL
return "s3://company-reports/2026/exec_report.pdf"

What Went Wrong: Lessons from Production

The OOM Nightmare

Early on, we tried to render a report containing a high-cardinality heatmap with 50,000 data points. The Quarto process, which uses Playwright under the hood for certain PDF exports, ballooned to 4GB of RAM and crashed our small Fargate task. The Fix: Always aggregate data in the DuckDB layer. Never send raw grains to the rendering engine. If you need a visualization of 1 million points, use a datashaded raster image, not a vector-heavy SVG or PDF element.

Timezone Drift

There is nothing more frustrating than a finance lead telling you the numbers are off by $5,000 because your cron job ran at 11:59 PM UTC, but the data for the 'last day' hadn't fully arrived from a provider in PST. The Fix: We implemented 'Data Wait' sensors in Dagster. The reporting pipeline no longer triggers strictly at a time; it triggers when the stripe_transactions asset metadata indicates that the max_timestamp has crossed the midnight threshold for the target day.

The Font Fiasco

Rendering PDFs in Docker containers is a special kind of hell. Your local machine has 'Helvetica' or 'Inter', but your Alpine Linux container doesn't. Your report will look like a 1995 Notepad document if you don't explicitly package your fonts. The Fix: Use a Debian-based slim image and explicitly COPY your .ttf files into /usr/local/share/fonts and run fc-cache -fv.

Gotchas the Docs Don't Tell You

Idempotency is a Lie: Reports are rarely truly idempotent because they often involve 'current date' logic. Always pass an explicit reference_date parameter through your pipeline rather than calling datetime.now() inside your functions. This allows you to re-run a report for 'last Tuesday' and get the exact same results.
PDF Metadata Matters: If you're sending reports to external clients, use a tool like pdfrw to set the metadata (Title, Author, Subject). It adds a layer of professionalism that automated scripts often lack.
The 'Silent Failure' of CSS: Quarto uses CSS for styling PDFs via PagedJS. If a style fails to load (like a remote Google Font), the renderer might hang indefinitely. Always use local font files and local CSS stylesheets.

The Takeaway

Manual reporting is technical debt that compounds with interest. Every hour your team spends 'checking the numbers' is an hour they aren't spent analyzing them.

Your action item for today: Identify one recurring manual export. Don't try to automate the whole thing yet. Just write a single DuckDB query that replicates the manual 'VLOOKUP' logic and output it to a CSV. Once the logic is codified in SQL, the orchestration and rendering are just plumbing. Stop being the 'data janitor' and start being the 'data architect.'", "tags": ["Automation", "Data Engineering", "Python", "Dagster", "DuckDB", "Reporting"], "seoTitle": "Mastering Automated Report Generation with Pipelines | Ugur Kaval", "seoDescription": "Senior Engineer Ugur Kaval shares a guide on building automated, resilient reporting pipelines using Dagster, DuckDB, and Quarto for 2026 production environments." }

Stop Manually Exporting CSVs: Building a Resilient Reporting Pipeline in 2026

The Hidden Cost of Manual Reporting

The Architecture of a Modern Pipeline

1. The Compute Layer: DuckDB and Dagster

2. The Rendering Layer: Quarto

Building it: A Concrete Example

What Went Wrong: Lessons from Production

The OOM Nightmare

Timezone Drift

The Font Fiasco

Gotchas the Docs Don't Tell You

The Takeaway

Enjoyed this article?

Related Articles

Beyond Cron: Industrial-Grade Automated Reporting in 2026

Beyond the Cron Job: Building Resilient Automated Reporting Pipelines in 2026

Uğur Kaval

Scalable Automated Report Generation: A 2026 Engineering Guide