Notifications

No notifications

Hash Tables (Hash Maps)

A hash table maps keys to values using a hash function, providing O(1) average lookup, insertion, and deletion.

How It Works

1. Hash Function: Converts key → integer index 2. Index: hash(key) % arraySize 3. Store: Place key-value pair at that index

Collision Handling

MethodHowProsCons
ChainingLinked list at each bucketSimple, no limitExtra memory
Open AddressingProbe for next empty slotCache-friendlyClustering
Linear ProbingCheck next slotSimplePrimary clustering
Quadratic ProbingCheck i² awayLess clusteringSecondary clustering

Time Complexity

OperationAverageWorst
InsertO(1)O(n)
SearchO(1)O(n)
DeleteO(1)O(n)

Load Factor

α = n/m (items/buckets). When α > 0.75, resize the table (double and rehash).

On this page

Detailed Theory

Hash tables are the single most useful data structure in everyday programming. Every \Map\, \dict\, \HashMap\, \unordered_map\, JS object property, Redis key, environment variable lookup — they're all hash tables under the hood. The whole idea: turn a key into an array index in O(1), so lookup is *also* O(1). When you internalise this, half of "how do I make this faster?" answers become "use a hash map".

What a Hash Table Actually Is

A hash table is an array of buckets + a hash function that turns any key into an index into that array.

` key "name" ──hash──▶ 417 ──% 16──▶ bucket 1 ──▶ ["name", "Ana"] `

Three steps every time:

1. Hash the key to a (large) integer. 2. Compress that integer to a valid index: \hash % tableSize\. 3. Store (or read) at \array[index]\.

Because step 3 is a constant-time array access, the *whole* operation is O(1) — *as long as* most keys land in different buckets.

Daily Mental Model

Think of a hash table as a coat-check stand at a theatre:

  • The hash function is the algorithm that converts your ticket number into a hook number.
  • The buckets are the hooks.
  • A collision is two coats hashed to the same hook — you stack them and have to look through them when claimed.
Good hash function = even distribution = each hook holds 1 coat = O(1) retrieval. Bad hash function = everyone on hook #4 = O(n) search of one giant pile.

Beginner Mistakes to Skip

1. Using a non-hashable / mutable key. In Python, a list can't be a dict key. In JS, objects-as-keys with a plain \Object\ use \toString()\ (so \{a:1}\ and \{a:2}\ both become \"[object Object]"\). Use \Map\ for object keys. 2. Mutating an object after using it as a key. The hash now no longer matches where it was stored — the entry becomes "lost". 3. Iteration order assumptions. Order varies by language: Python 3.7+ dicts and JS \Map\ preserve insertion order; older Java HashMap doesn't; Go *deliberately randomises* it. 4. Counting on O(1). Hash tables are O(1) on average. Worst case is O(n) when everything collides. For adversarial inputs, this matters (HashDoS attacks). 5. \if (map[key])\ to check existence. Falsy values (0, "", false, null) make this break. Use \map.has(key)\ or \key in map\. 6. Resizing during a hot loop. First insert into a tiny map → many rehashes. If you know the size, pre-size: \new Map(entries)\, \dict.fromkeys()\, \new HashMap<>(expectedSize)\.

Intermediate: Hash Functions

A "good" hash function is:

  • Deterministic — same key, same hash, every time.
  • Uniform — keys spread evenly across buckets (no patterns).
  • Fast — usually O(1) for primitives, O(m) for strings of length m.
  • Avalanche — flipping one input bit changes ~half the output bits.
Common families:

FunctionFormula / IdeaUsed for
Division\h(k) = k % m\ (m prime)quick integer hash
Multiplication\floor(m * (k * A % 1))\, A ≈ 0.618integer hash, less sensitive to m
Polynomial rolling\Σ s[i] * pⁱ % m\string hash (Rabin-Karp)
FNV-1aXOR each byte then × primegeneral purpose, simple
MurmurHash / xxHashbit mixinghigh-performance lookups (Java, Redis)
SipHashkeyed, cryptographically randomiseddefends against HashDoS (Python, Rust, Ruby use this)

Why prime table sizes? \% prime\ washes out periodic patterns in keys. Java uses powers of two and re-hashes the key with bit-mixing instead.

Intermediate: Collision Resolution

Two keys hash to the same bucket. You have to handle it.

Chaining (separate chaining): each bucket is a linked list (or, in modern Java, a red-black tree once the chain exceeds 8). Insert appends; lookup walks the chain.

` bucket 4 ─▶ ["dog",1] ─▶ ["god",2] ─▶ ["odg",3] `

  • Pros: simple, handles α > 1, removal is trivial.
  • Cons: each node = pointer chase = poor cache behaviour.
Open addressing: keep everything in the array. On collision, probe for the next slot.

ProbeSequenceProsCons
Lineari, i+1, i+2, …cache-friendlyprimary clustering
Quadratici, i+1², i+2², …breaks runssecondary clustering
Double hashingi + j * h₂(k)nearly uniformtwo hashes per probe

  • Pros: no pointers, excellent cache locality.
  • Cons: deletion needs tombstones (special "deleted" marker) so probing doesn't stop early. Performance collapses near α = 1.
Python's \dict\ and Rust's \HashMap\ use open addressing. Java's \HashMap\ uses chaining (with treeification).

Intermediate: Load Factor and Resizing

\α = n / m\ (entries / buckets). It controls collision frequency.

αBehaviour
0.25very fast, lots of empty space
0.5–0.75sweet spot — typical default
> 0.75collisions climb fast
≥ 1 (open addressing)impossible — must resize first

When α crosses the threshold, resize:

1. Allocate a new array (typically the size — Java HashMap, JS Map). 2. Rehash every entry into the new array (indices change because \m\ changed). 3. Replace the old array.

A single resize is O(n), but it happens rarely enough that amortised cost per insert stays O(1).

Caveat: during resize, the entire structure can be unavailable — that's why high-throughput systems pre-size or use *incremental* rehashing (Redis splits the rehash across many ops).

Advanced: Hash Map vs Hash Set, and When NOT to Use Either

A hash set is just a hash map with no value (or a dummy). Use it for membership checks and deduplication.

When not to use a hash table:

  • You need ordered traversal → use a balanced BST (\TreeMap\, \std::map\, \SortedDict\).
  • You need range queries ("all keys between 100 and 200") → BST or sorted array.
  • You need worst-case O(log n) guarantees (real-time systems) → BST.
  • The data is tiny (n < ~16) → a flat array scan is faster *and* uses less memory.
  • You're sorting numbers in a known small range → counting sort beats hashing.

Advanced: The Interview Patterns You'll See Over and Over

  • Frequency counting — \map[x] = (map[x] || 0) + 1\. Most-frequent element, anagrams, top-K, majority element.
  • Two-Sum pattern — for each \x\, ask if \target - x\ is in the map. O(n) instead of O(n²).
  • Prefix-sum + hash map — "subarrays that sum to K" = count how many times each prefix sum has been seen.
  • Group-by hash — group anagrams by sorted-string key, group rectangles by (width × height) key.
  • Two maps for bijection — isomorphic strings, word pattern matching.
  • Hash map + doubly linked list = LRU cache. O(1) get and put.
If you can recognise these five patterns, hash maps will solve a *huge* fraction of medium-LeetCode problems.

Advanced: Real Implementations Compared

LanguageMapSetNotes
Java\HashMap\\HashSet\chaining + treeify ≥ 8, default α 0.75
Python\dict\\set\open addressing with perturbation, insertion-ordered since 3.7
JavaScript\Map\ / object\Set\\Map\ allows any key type, insertion-ordered, has \.size\
C++\unordered_map\\unordered_set\chaining; \.reserve(n)\ to pre-size
Rust\HashMap\\HashSet\SipHash by default (DoS-resistant)
Go\map[K]V\use \map[K]struct{}\iteration order is deliberately randomised

JS-specific gotcha: \Map\ ≠ plain object. \{}\ keys are coerced to strings, has prototype keys (\__proto__\, \hasOwnProperty\) you can accidentally collide with. Use \new Map()\ for general data.

Advanced: Consistent Hashing (the Distributed Cache Trick)

Naive sharding: \server = hash(key) % numServers\. Add or remove one server → \numServers\ changes → *every* key remaps. Cache invalidates entirely.

Consistent hashing: arrange servers on a ring at positions \hash(serverId)\. A key lives on the next server clockwise from \hash(key)\. Adding/removing a server only remaps ~1/n of keys.

Virtual nodes: place each server at *many* random ring positions to even out load. This is what Memcached, Cassandra, DynamoDB, and Redis Cluster all do.

Advanced: Worst-Case Bounds and HashDoS

If an attacker can predict your hash function, they can craft inputs that all collide → all of your "O(1)" operations become O(n) → server falls over.

Defences:

  • Randomised hash function seeded per process (SipHash in Python/Rust/Ruby).
  • Treeify long chains (Java HashMap turns chains ≥ 8 into red-black trees → O(log n) worst case).
  • Cap input size before hashing.

Practice Path

1. Implement a tiny hash map with chaining: array of 16 buckets, polynomial hash for strings, \set\ / \get\ / \delete\. Then add resizing when α > 0.75. 2. Solve Two Sum in O(n) using a hash map. Then solve Subarray Sum Equals K with prefix-sum + hash map. 3. Implement an LRU Cache with a hash map + doubly linked list. Verify both \get\ and \put\ are O(1). 4. Build the Group Anagrams solution with a hash map keyed on the sorted version of each string. Then re-do it with a 26-element character-count tuple as the key — confirm the same answer in O(n · m) instead of O(n · m log m).