Blog Post

Common SQL Mistakes

Most Common SQL Mistakes (and How to Avoid Them)

SQL is powerful — but even experienced developers run into subtle mistakes that affect correctness, performance, or readability. In this article, we walk through the most common SQL pitfalls using PostgreSQL and the Pagila sample database, along with practical ways to avoid them.

1. Selecting Too Many Columns

It’s tempting to grab everything with (select *), especially when you’re exploring a database or trying to move quickly. But in production environments, (select *) can quietly become one of the most expensive patterns in SQL. Fetching every column increases the amount of data PostgreSQL must read from disk, send across the network, and store in memory on both the server and client side. This overhead grows quickly as tables accumulate more columns over time.

There are several problems with relying on (select *):

  • Slower queries: PostgreSQL must load more data pages into memory, even if your application only needs a few values.
  • Higher network I/O: More bytes are transmitted to the application layer, slowing things down in distributed systems.
  • Reduced clarity: It’s harder to see what data the query truly depends on, making debugging and code reviews more difficult.
  • Schema fragility: If new columns are added later, all existing (select *) queries suddenly return a different shape of data.
  • Worse caching behavior: Query plans and application-side caching become less efficient when the result shape is unpredictable.
-- Bad
SELECT *
FROM film;

-- Good
SELECT film_id, title, release_year
FROM film;

2. Forgetting to Filter with Proper Indexed Conditions

PostgreSQL can use indexes to find matching rows very quickly, but only if your query gives it something useful to work with. If you skip the WHERE clause or filter on non-indexed expressions, PostgreSQL may have to perform a full table scan, reading every row. That might be fine in development with a few hundred rows, but it does not scale when your tables reach hundreds of thousands or millions of rows.

By filtering on indexed columns (and writing conditions that are “index-friendly”), you:

  • Reduce I/O by reading only relevant pages from disk
  • Lower CPU usage because fewer rows need to be processed
  • Speed up response times and improve overall application performance
-- Bad
SELECT *
FROM inventory;

-- Good
SELECT inventory_id, film_id, store_id
FROM inventory
WHERE store_id = 1;

3. Using Subqueries Instead of Joins

Correlated subqueries often feel intuitive - “for each row, look up the matching value.” But in SQL, that pattern can be both slower and harder to maintain. A correlated subquery may need to be executed once per row, which becomes expensive as your tables grow. Even when PostgreSQL can optimize it, the intent of the query is harder to understand compared to a clear JOIN.

In the Pagila database, it’s common to want related information such as customer names alongside payments. The subquery version works, but PostgreSQL cannot optimize it as aggressively as a standard join.

-- Bad
SELECT amount,
  (SELECT first_name || ' ' || last_name
    FROM customer
    WHERE customer.customer_id = payment.customer_id)
    FROM payment;

-- Good
SELECT p.amount,
       c.first_name || ' ' || c.last_name AS customer_name
    FROM payment p
    JOIN customer c
    ON c.customer_id = p.customer_id;
  

4. Missing JOIN Conditions (Accidental Cartesian Products)

One of the most common and dangerous SQL mistakes is forgetting a join condition. When two tables are listed together without a proper JOIN clause, PostgreSQL creates a Cartesian product — a combination of every row from the first table paired with every row from the second. This almost always produces an enormous result set, consumes excessive CPU and memory, and slows down the entire query.

In the Pagila database, the film table has 1,000 rows and the actor table has 200. A Cartesian product will produce 200,000 rows — long before applying any filters!!

-- Bad
SELECT film.title, actor.first_name
FROM film, actor;

-- Good
SELECT f.title, a.first_name, a.last_name
FROM film f
JOIN film_actor fa ON f.film_id = fa.film_id
JOIN actor a ON a.actor_id = fa.actor_id;

5. Using DISTINCT To Hide Duplicate Problems

The DISTINCT keyword is one of the most commonly misused tools in SQL. Many developers reach for it when they see duplicate rows, treating it like a quick fix. In reality, DISTINCT doesn’t solve the underlying problem — it simply hides it. This often leads to slower queries, unnecessary sorting, and confusion about where the duplicates came from in the first place.

Duplicates usually appear because of:

  • Incorrect or incomplete JOIN conditions
  • Joining to a table that has multiple matching rows
  • Filtering logic that is too broad
  • Many-to-many relationships that require a bridge table

In the Pagila database, the category table has unique category names. If you see duplicates, it’s almost always caused by a join—not an issue with the data itself. Using DISTINCT here would mask a deeper problem.

-- Bad: DISTINCT hides the fact that the join is wrong
SELECT DISTINCT name
FROM category;

A better approach is to identify why the duplicates exist and fix the join or filtering logic. For example, if you joined category to film_category without using the proper key, you would create duplicates for every film that shares the same category. The right solution is to correct the join, not slap DISTINCT on the result set.

Reserve DISTINCT for cases where it truly represents the business logic: when you intentionally want unique values. When it’s being used to silence a problem, it’s a sign that something else in the query needs attention.

6. Misunderstanding NULL Behavior

In SQL, a NULL value represents “unknown” or “missing” data. Because of this, a NULL is never equal to anything — not even another NULL. This is one of the most common sources of confusion for new PostgreSQL users and can lead to queries that quietly return zero rows.

A typical mistake is trying to compare a column directly to NULL using the equals operator. In SQL’s three-valued logic (true, false, and unknown), the expression column = NULL always evaluates to “unknown,” which behaves like false in a WHERE clause.

-- Bad: this will never match any rows
SELECT *
FROM customer
WHERE email = NULL;

The correct way to test for a missing value is to use the IS NULL or IS NOT NULL operators. These are specifically designed for working with unknown values.

-- Good: find customers with no email on file
SELECT *
FROM customer
WHERE email IS NULL;

Understanding how NULL behaves also matters when working with:

  • JOIN conditions — NULLs prevent matches unless handled explicitly
  • Aggregations — COUNT(*) versus COUNT(column) treat NULLs differently
  • Comparisons and filters — expressions involving NULL often return “unknown” instead of true/false
  • Sorting and indexing — PostgreSQL treats NULLs as higher-than or lower-than depending on ASC/DESC rules

When in doubt, remember this simple rule: compare with IS NULL, never with = NULL. Understanding how NULL works prevents subtle bugs and ensures your queries behave as intended.

7. Case Sensitivity in PostgreSQL

PostgreSQL’s text comparisons are case-sensitive by default. This often surprises users coming from databases like MySQL, where string comparisons are usually case-insensitive. In PostgreSQL, searching for alien will not match Alien unless you explicitly tell it to ignore case.

This can lead to queries that appear to work sometimes, fail at other times, or silently return zero rows. When working with user-generated text, titles, names, or anything not guaranteed to match case exactly, it’s safer to use ILIKE, which performs a case-insensitive comparison.

-- Case-sensitive: returns zero rows in Pagila
SELECT *
FROM film
WHERE title = 'alien';

-- Case-insensitive: matches titles regardless of case
SELECT *
FROM film
WHERE title ILIKE 'alien';

The ILIKE operator is especially useful when users are typing search input—where enforcing perfect case matching isn’t realistic. For more control, you can also normalize both sides of the comparison using lower(), which is index-friendly when paired with functional indexes.

-- Alternative approach using lower(): deterministic and indexable
SELECT *
FROM film
WHERE lower(title) = lower('Alien');

Keep in mind:

  • LIKE and = are case-sensitive
  • ILIKE is case-insensitive
  • Lowercasing both sides can be more efficient if you create an index such as CREATE INDEX ON film (lower(title));

Understanding case sensitivity helps avoid unexpected empty result sets and makes your text-search queries more reliable and user-friendly.

8. Incorrect GROUP BY Usage

One of the most common SQL mistakes is forgetting that PostgreSQL requires every non-aggregated column in a query to appear in the GROUP BY clause. This rule ensures that PostgreSQL always knows exactly how to group rows before applying aggregate functions like COUNT, SUM, AVG, or MAX.

If you select columns that aren’t inside an aggregate function and also don’t appear in the GROUP BY, PostgreSQL cannot determine which specific values should be returned for each group — and it will raise an error. While some databases allow this and return arbitrary values, PostgreSQL enforces correctness and consistency.

-- Bad: manager_staff_id is neither grouped nor aggregated
SELECT store_id, COUNT(*), manager_staff_id
FROM store;

To fix this, every column in the SELECT list must either:

  • Be part of the GROUP BY, or
  • Be wrapped in an aggregate function
-- Correct: all non-aggregated columns appear in GROUP BY
SELECT store_id, manager_staff_id, COUNT(*)
FROM store
GROUP BY store_id, manager_staff_id;

This behavior becomes even more important when working with tables that have many columns or many-to-one relationships. For example, in Pagila, grouping customer data incorrectly can easily generate errors if you forget to group on all relevant fields.

-- Another correct example with multiple dimensions
SELECT customer_id, staff_id, COUNT(*)
FROM payment
GROUP BY customer_id, staff_id;

A helpful way to think about it: if you return a column in the result set, PostgreSQL needs to know exactly which value belongs to the aggregated group. If that value is not uniquely determined by the grouping, you must either add it to the GROUP BY or apply an aggregate.

When in doubt, ask yourself: “Does each group have one and only one value for this column?” If the answer is no, it either belongs in the GROUP BY or needs to be aggregated.

9. Over-Normalizing or Under-Normalizing

Database design is a balancing act. Beginners often fall into one of two traps: over-normalizing their schema into dozens of small, fragmented tables, or under-normalizing by keeping too much repeated or multi-purpose data in a single place. Both extremes can cause significant performance and usability issues.

Over-normalizing typically happens when every tiny piece of information is split into its own table. While this reduces duplication, it can create overly complex joins, slow performance, and queries that are difficult to write and maintain. A schema with too many micro-tables forces the database to constantly join data back together just to answer basic questions.

Under-normalizing, on the other hand, occurs when too much information is packed into a single table or when values are repeated unnecessarily. This leads to larger row sizes, wasted storage, inconsistent updates, and slower queries as the table grows. When every row contains repeated text or redundant attributes, performance suffers and data integrity becomes harder to enforce.

The Pagila database offers a practical example of a balanced approach. The schema is normalized enough to eliminate duplication — for instance, using separate tables for actor, film, and their many-to-many bridge film_actor — but not so fragmented that common queries become painful. Lookup values like category, language, and store are stored separately, while large entities such as films or customers keep their core attributes together.

A healthy schema usually follows third normal form (3NF), but also knows when to remain pragmatic. If a value isn’t likely to change and splitting it out would add unnecessary complexity, keeping it in the main table may be the better choice. Likewise, if you find yourself repeating the same data across many rows, it’s often a sign that normalization is needed.

The key is to design for clarity, consistency, and maintainability. Normalize to avoid duplication and anomalies, but don’t go so far that every query requires a dozen joins to answer a simple question. The goal is a schema that supports both performance and ease of use.

10. Running DML Inside Loops

A very common performance mistake is executing INSERT, UPDATE, or DELETE statements one row at a time inside application loops. Although this approach feels natural in most programming languages, it is extremely inefficient for databases. Each statement sent to PostgreSQL carries network overhead, transaction overhead, planning overhead, and disk activity. When repeated thousands of times, performance degrades quickly.

Instead of sending row-by-row operations, you should batch your writes so the database can process multiple records in a single command. PostgreSQL is highly optimized for set-based operations, and grouping statements together reduces round trips between the application and the database dramatically.

-- Bad: executed one row at a time inside an application loop
INSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)
VALUES (1, 1, 76, 4.99, NOW());

INSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)
VALUES (2, 1, 77, 2.99, NOW());

-- … repeated thousands of times

Each call above is a separate transaction, even if you implicitly rely on autocommit. PostgreSQL must parse, plan, log, and commit every individual statement, which leads to unnecessary load and can slow down your entire application.

A better approach is to combine multiple rows into a single VALUES clause. This reduces logging, minimizes network overhead, and allows PostgreSQL to treat the operation as a single optimized batch.

-- Better approach: a single batched write
INSERT INTO payment(customer_id, staff_id, rental_id, amount, payment_date)
VALUES
(1, 1, 76, 4.99, NOW()),
(2, 1, 77, 2.99, NOW());

For even larger imports, you can go beyond multi-row inserts and use COPY (or \copy from the psql client). The COPY command is the fastest way to bulk-load data into PostgreSQL and is designed for high-throughput ingestion.

-- Fastest bulk-load option
COPY payment(customer_id, staff_id, rental_id, amount, payment_date)
FROM '/path/to/data.csv'
CSV HEADER;

Whenever you find yourself writing a loop that calls the database repeatedly, stop and ask whether the operation can be expressed as a single set-based statement. In PostgreSQL, batching isn’t just cleaner — it delivers massive performance improvements as your data volume grows.

Conclusion

SQL mistakes are inevitable when you’re learning or working quickly, but most of them are easy to avoid once you understand how PostgreSQL thinks about data, joins, indexing, and query structure. By being deliberate with your column selections, writing index-friendly filters, avoiding accidental Cartesian products, and understanding concepts like NULL handling and normalization, you can dramatically improve both the clarity and performance of your queries.

The Pagila sample database is an excellent environment for practicing these skills in a realistic schema. As you experiment and refine your approach, you’ll develop habits that lead to cleaner, faster, and more maintainable SQL — whether you’re building analytics queries, application code, or database-backed services.

The more intentional you are with your SQL, the more PostgreSQL rewards you with predictable, scalable performance. Master the fundamentals, avoid the common pitfalls, and your database work will become both easier and far more effective!