Blog Post

PostgreSQL Statistics Explained: Why Row Estimates Are Wrong

If PostgreSQL query planning feels unpredictable, it usually comes down to one thing: statistics. When row estimates are wrong, everything built on top of them collapses — join order, scan type, memory usage, and overall performance.

PostgreSQL isn’t bad at estimating rows. It’s just working with incomplete information.

Where PostgreSQL Gets Its Estimates

PostgreSQL does not inspect your data at query time. That would be far too expensive.

Instead, it relies on statistics collected ahead of time by ANALYZE. Those statistics include:

  • Total row count
  • Null fraction per column
  • Most Common Values (MCVs)
  • Histograms showing value distribution

Every row estimate you see in EXPLAIN comes from these numbers. If they’re inaccurate, the plan never had a chance.

ANALYZE vs Autovacuum

ANALYZE is what actually gathers statistics. Autovacuum is what decides when to run it automatically. This distinction matters more than most people realize.

Autovacuum triggers ANALYZE based on thresholds — usually a percentage of rows changed. That means:

  • A table can change significantly before stats update
  • Freshly loaded data may have no useful statistics
  • Rapidly changing distributions can fool the planner

Running ANALYZE manually forces PostgreSQL to refresh its understanding immediately:


ANALYZE rental;

This alone often “fixes” a bad query plan — not because the query changed, but because PostgreSQL finally knows what the data looks like!

Most Common Values (MCVs)

PostgreSQL tracks a small list of the most frequent values in each column. These are incredibly important for skewed data.

Consider a column where:

  • 90% of rows have the same value
  • The remaining 10% are spread thinly

If that dominant value isn’t captured accurately, PostgreSQL will badly underestimate how many rows match. MCVs are stored per column and are limited in number — which leads to the next problem.

default_statistics_target

The amount of detail PostgreSQL stores is controlled by default_statistics_target. By default, this value is 100. That’s a middle-of-the-road setting designed to work reasonably well for most workloads.

In practical terms:

  • Lowest possible value: 0 (almost no statistics)
  • Default value: 100 (general-purpose compromise)
  • Highest allowed value: 10000 (extremely detailed, rarely justified, borderline insane)

Higher values mean more Most Common Values (MCVs), more histogram buckets, and better row estimates — especially for skewed data.

Lower values make ANALYZE faster and reduce memory usage, but at the cost of precision.

The default is a compromise — and it’s not always good enough for performance-critical queries.

You can increase statistics for a specific column without affecting the rest of the table:


ALTER TABLE rental
ALTER COLUMN return_date
SET STATISTICS 1000;

Then re-run ANALYZE:


ANALYZE rental;

This tells PostgreSQL: “This column matters. Pay attention.”

You can also inspect the statistics target PostgreSQL is currently using for each column:


SELECT
  attname AS column_name,
  attstattarget AS statistics_target
FROM pg_attribute
WHERE attrelid = 'rental'::regclass
  AND attnum > 0
  AND NOT attisdropped;

A value of NULL or -1 means the column is inheriting the table or default_statistics_target setting. Any positive number overrides it for that column.

To see the global default:


SHOW default_statistics_target;

When Increasing Statistics Helps

Increasing statistics is useful when:

  • Data is highly skewed
  • Filters are selective but incorrectly estimated
  • Joins depend heavily on specific columns
  • Plans flip unpredictably

This is especially common with status columns, boolean-like fields, and dates with heavy recent activity.

When Increasing Statistics Is a Bad Idea

More statistics are not always better. Don’t blindly raise statistics on every column.

Higher statistics targets — meaning larger STATISTICS values or a higher default_statistics_target (i.e., setting a column to 500, 1000, or even 2000 instead of the default 100) — increase the amount of data PostgreSQL collects during ANALYZE.

That comes with real costs:

  • Slower ANALYZE runs
  • More memory usage
  • Longer autovacuum cycles

Most columns don’t need extra precision. Focus on the ones that actually affect query plans!

Seeing the Damage in EXPLAIN

The clearest sign of bad statistics is a large gap between estimated and actual rows.


EXPLAIN ANALYZE
SELECT *
FROM rental
WHERE return_date IS NULL;

If PostgreSQL estimates 50 rows but processes 50,000, the planner made a bad decision — based on bad information.

PostgreSQL doesn’t self-correct mid-query. It commits to the plan and suffers through it.

Conclusion

PostgreSQL query planning is only as good as the statistics behind it. When row estimates are wrong, performance problems follow — no matter how good your indexes are.

Before rewriting queries or adding hints that don’t exist, ask simpler questions:

  • Are statistics current?
  • Are important columns detailed enough?
  • Has the data distribution changed?

Once PostgreSQL understands your data, it usually makes the right decision!