Sida Liu

A Learner in the Complex World.


Why Indexing Makes Search Faster: A First-Principle Explanation

Imagine you have an array of 1,000,000 elements and need to find a specific value. Without an index, the only way to locate the value is to scan each element one by one. In the worst case, this requires checking all 1,000,000 elements, which is very slow.

The Core Bottleneck: Searching Without Structure

The fundamental problem is that without structure, there is no shortcut—you must check each element individually to be sure you’ve found the right one. This is why a simple search without an index is painfully slow when dealing with large datasets.

How an Index Speeds Up Search

An index is a separate data structure that organizes the array so searches can be much faster. Two common types:

  1. Sorted Index (Binary Search – ~20 steps instead of 1,000,000)
    • Suppose we maintain a sorted version of the array (or at least a sorted list of positions).
    • Instead of scanning every element, we can use binary search, where each step halves the search space.
    • This means instead of checking 1,000,000 elements, we only need about 20 steps to find the value.
  2. Hash Index (Near-Instant Lookup in One Step)
    • A hash table stores values as keys and their positions as values.
    • This allows direct lookup in one step instead of searching at all.
    • However, hash tables require extra memory and might not work well in every scenario.

Why This Matters Even More in Joins

If searching one array is slow, imagine what happens when we join two large arrays (or tables in a database).

  • If each has 1,000,000 elements, a naive join may require 1,000,000 × 1,000,000 = 1 trillion comparisons!
  • With indexes, each lookup is much faster, making joins feasible instead of impossible.

Tradeoffs and Conclusion

Indexes make searches faster but require extra storage and need updates when data changes. Despite this, for large datasets, they are essential—without an index, searching 1,000,000 items could take seconds instead of milliseconds. When dealing with joins, indexing can be the difference between a query running in seconds or never finishing at all.



Leave a comment