UNIT 14Prerequisite: Unit 05 — Linked Lists75 min read
Every data structure we have built so far searches by comparison — check this element, is it the one? Try the next. Binary search cuts this to O(log n). But hashing does something completely different: it calculates exactly where the data should be using arithmetic, then goes there directly. No comparison loop at all. The result is O(1) average lookup — constant time regardless of how many items you have stored.
In this unit we build a hash table from scratch — hash functions, collision handling with chaining, open addressing, and load factor. By the end you will understand exactly how Python dictionaries, JavaScript objects, and database indexes work under the hood.
// Section 1
The Core Idea — A Locker Room
Imagine a gym locker room with 100 numbered lockers (0 to 99). You want to store your belongings. Instead of searching every locker, you use a rule: take your membership number, divide by 100, take the remainder — that is your locker. Membership 347 → locker 47. Membership 892 → locker 92. You always know exactly which locker is yours. No searching. No comparing. Just arithmetic.
🔑 From locker room to hash table
Locker room (100 lockers)
Hash table (array of buckets)
Membership number
Key (what you want to store/find)
Your locker number
Index in the array (hash value)
The rule: memberNo % 100
Hash function: maps key → index
Your belongings in the locker
Value stored at that index
💡 Note
The key insight: A hash function converts a key (any value — a number, a string, an email address) into an array index. Given the same key, the same function always produces the same index. So lookup is just: compute the index, go there directly. No scanning. O(1).
// Section 2
Hash Functions — The Formula That Maps Keys to Indices
A hash function takes a key and returns an index between 0 and SIZE-1. A good hash function distributes keys evenly across all buckets — if keys pile up in a few buckets, performance degrades. Here are the most common approaches.
#define SIZE 10 /* number of buckets in our hash table */
/* Method 1: Division method — key mod table size */
int hashInt(int key) {
return key % SIZE;
/* key=23 → 23%10 = 3
key=47 → 47%10 = 7
key=100 → 100%10 = 0 */
}
/* Method 2: Hash a string — sum of ASCII values mod SIZE */
int hashString(char *key) {
int hash = 0;
int i;
for (i = 0; key[i] != '\0'; i++) {
hash += key[i]; /* add ASCII value of each character */
}
return hash % SIZE;
/* "cat" → 99+97+116 = 312 → 312%10 = 2 */
}
/* Method 3: Better string hash (djb2 — widely used in real systems) */
int hashStringDJB2(char *key) {
unsigned long hash = 5381; /* magic seed value */
int c;
while ((c = *key++)) {
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
}
return (int)(hash % SIZE);
}
int main() {
printf("hash(23) = %d\n", hashInt(23)); /* 3 */
printf("hash(47) = %d\n", hashInt(47)); /* 7 */
printf("hash(cat) = %d\n", hashString("cat")); /* 2 */
printf("hash(dog) = %d\n", hashString("dog")); /* 8 */
return 0;
}
Properties of a good hash function:
⚡Fast to compute — O(1) or O(length of key)
📊Uniform distribution — keys spread evenly, no bucket gets overloaded
🔄Deterministic — same key always produces the same index
📦Output always in range [0, SIZE-1] — use modulo to guarantee this
// Section 3
Collisions — Two Keys, Same Bucket
A collision happens when two different keys hash to the same index. With a table of size 10 and keys like 13 and 23, both give index 3. Collisions are inevitable — by the pigeonhole principle, if you store more items than buckets there must be at least one collision. The question is not how to avoid them — it is how to handle them gracefully.
// hash(13) = 13%10 = 3 · hash(23) = 23%10 = 3 → COLLISION at index 3
Both map to index 3 — where does 23 go?
There are two main strategies to handle collisions. We will build both from scratch.
Chaining
Each bucket holds a linked list. Multiple keys at the same index form a chain. Fast and simple — never runs out of space.
Open Addressing
All data stays in the array itself. On collision, probe the next available slot. More cache-friendly, trickier to implement.
// Section 4
Collision Handling — Chaining
With chaining, each bucket in the array is the head of a linked list. When two keys collide, both live in the same bucket — one after the other in the chain. To find a key, compute its hash (go to the bucket), then walk the chain until you find it.
// Hash table with chaining after inserting 5, 15, 25, 7, 17, 3
5%10=5, 15%10=5, 25%10=5 → all chain at bucket 5
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define SIZE 10
/* each node in the chain stores a key-value pair */
struct Node {
int key;
int value;
struct Node *next;
};
typedef struct Node Node;
/* hash table: array of linked list heads */
Node *table[SIZE];
void initTable() {
int i;
for (i = 0; i < SIZE; i++) table[i] = NULL;
}
int hash(int key) { return key % SIZE; }
/* insert key-value pair */
void insert(int key, int value) {
int idx = hash(key);
/* check if key already exists — update it */
Node *current = table[idx];
while (current != NULL) {
if (current->key == key) {
current->value = value; /* update existing */
return;
}
current = current->next;
}
/* key not found — insert new node at front of chain */
Node *newNode = (Node*)malloc(sizeof(Node));
newNode->key = key;
newNode->value = value;
newNode->next = table[idx]; /* point to existing chain */
table[idx] = newNode; /* new node becomes head */
}
/* search: returns value or -1 if not found */
int search(int key) {
int idx = hash(key);
Node *current = table[idx];
while (current != NULL) {
if (current->key == key) return current->value;
current = current->next;
}
return -1; /* not found */
}
/* delete a key */
void delete(int key) {
int idx = hash(key);
Node *current = table[idx];
Node *prev = NULL;
while (current != NULL) {
if (current->key == key) {
if (prev == NULL) {
table[idx] = current->next; /* removing head */
} else {
prev->next = current->next; /* bypass this node */
}
free(current);
return;
}
prev = current;
current = current->next;
}
}
void printTable() {
int i;
for (i = 0; i < SIZE; i++) {
printf("[%d]: ", i);
Node *current = table[i];
while (current != NULL) {
printf("(%d→%d) ", current->key, current->value);
current = current->next;
}
printf("\n");
}
}
int main() {
initTable();
insert(5, 50);
insert(15, 150);
insert(25, 250); /* all three collide at bucket 5 */
insert(7, 70);
insert(3, 30);
printTable();
printf("Search 15: %d\n", search(15)); /* 150 */
printf("Search 99: %d\n", search(99)); /* -1 */
delete(15);
printf("After delete 15, search: %d\n", search(15)); /* -1 */
return 0;
}
Average Case:O(1)
Worst Case:O(n)(all keys in one bucket)
// Section 5
Collision Handling — Open Addressing
Open addressing keeps everything in the array itself — no linked lists, no extra memory. When a collision occurs, it probes for the next empty slot. There are three probing strategies.
Linear Probing
index = (hash(key) + i) % SIZETry next slot, then the one after, then the one after that. Simple but causes "clustering" — occupied slots clump together and slow down future searches.
Quadratic Probing
index = (hash(key) + i²) % SIZEJump by increasing squares: +1, +4, +9, +16... Reduces primary clustering but can cause secondary clustering.
Double Hashing
index = (hash1(key) + i × hash2(key)) % SIZEUse a second hash function to determine the step size. Best distribution, eliminates clustering. Most complex to implement.
// Linear probing — inserting 5, 15, 25 into table of size 10
Insert 5: hash=5, slot 5 empty → place here
Insert 15: hash=5, slot 5 taken → probe slot 6, empty → place here
Insert 25: hash=5, slot 5 taken → slot 6 taken → slot 7 empty → place here
#include <stdio.h>
#include <string.h>
#define SIZE 10
#define EMPTY -1
#define DELETED -2 /* tombstone — marks a deleted slot */
int keys[SIZE];
int values[SIZE];
void initTableOA() {
int i;
for (i = 0; i < SIZE; i++) {
keys[i] = EMPTY;
values[i] = EMPTY;
}
}
int hash(int key) { return key % SIZE; }
void insertOA(int key, int value) {
int idx = hash(key);
int i;
for (i = 0; i < SIZE; i++) {
int probe = (idx + i) % SIZE; /* linear probing */
/* empty or deleted slot — insert here */
if (keys[probe] == EMPTY || keys[probe] == DELETED) {
keys[probe] = key;
values[probe] = value;
return;
}
/* key already exists — update */
if (keys[probe] == key) {
values[probe] = value;
return;
}
}
printf("Table is full\n");
}
int searchOA(int key) {
int idx = hash(key);
int i;
for (i = 0; i < SIZE; i++) {
int probe = (idx + i) % SIZE;
if (keys[probe] == EMPTY) return -1; /* empty slot = not in table */
if (keys[probe] == key) return values[probe]; /* found */
/* DELETED slot — keep probing past it */
}
return -1;
}
void deleteOA(int key) {
int idx = hash(key);
int i;
for (i = 0; i < SIZE; i++) {
int probe = (idx + i) % SIZE;
if (keys[probe] == EMPTY) return; /* not found */
if (keys[probe] == key) {
keys[probe] = DELETED; /* tombstone — do NOT set to EMPTY */
return; /* EMPTY would break chains for other keys */
}
}
}
int main() {
initTableOA();
insertOA(5, 50);
insertOA(15, 150); /* collides at 5 → goes to 6 */
insertOA(25, 250); /* collides at 5,6 → goes to 7 */
insertOA(7, 70); /* collides at 7 → goes to 8 */
printf("Search 15: %d\n", searchOA(15)); /* 150 */
printf("Search 25: %d\n", searchOA(25)); /* 250 */
deleteOA(15);
printf("After delete 15: %d\n", searchOA(15)); /* -1 */
printf("Search 25 still works: %d\n", searchOA(25)); /* 250 — tombstone preserved chain */
return 0;
}
⚠️ Important
The tombstone trick for deletion: In open addressing, you must never set a deleted slot back to EMPTY. If you do, future searches for keys that probed past this slot will stop early and incorrectly report "not found." Instead mark it as DELETED (a tombstone) — searches skip over it, new inserts can reuse it.
// Section 6
Load Factor — When to Resize
The load factor (λ) is the ratio of stored elements to the number of buckets:
λ = number of elements / number of buckets
10 elements in a table of 20 buckets → λ = 0.5
λ < 0.5
Ideal
Very few collisions. O(1) average lookup. Table has plenty of room.
0.5 ≤ λ ≤ 0.75
Acceptable
Some collisions but still fast. Most production systems target this range.
λ > 0.75
Degrading
Collision chains grow. Performance approaching O(n). Time to resize.
λ = 1.0+
Critical
Open addressing breaks completely — no empty slots to probe. Chaining still works but slowly.
When the load factor exceeds a threshold (typically 0.75), the table is rehashed — a new larger array is created (usually double the size), and every existing key is re-inserted using the new hash function (new SIZE).
/* conceptual rehash — double the table size when load factor > 0.75 */
void rehash(Node **oldTable, int oldSize) {
int newSize = oldSize * 2; /* double the size */
Node **newTable = calloc(newSize, sizeof(Node*));
/* re-insert all existing elements using new size */
int i;
for (i = 0; i < oldSize; i++) {
Node *current = oldTable[i];
while (current != NULL) {
int newIdx = current->key % newSize; /* NEW hash with new size */
/* insert into new table */
Node *newNode = malloc(sizeof(Node));
newNode->key = current->key;
newNode->value = current->value;
newNode->next = newTable[newIdx];
newTable[newIdx] = newNode;
current = current->next;
}
}
/* swap old table with new (simplified — in practice free old table) */
}
💡 Note
Real-world load factors: Java's HashMap rehashes at λ = 0.75 (the default). Python's dict rehashes at λ = 0.67. Redis uses incremental rehashing — it spreads the cost over many operations instead of one big pause.
// Section 7
Complete Hash Table — String Keys
Let us build a complete, working hash table that stores string key → integer value pairs. This is exactly what Python's dictionary and JavaScript's object do internally.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define TABLE_SIZE 16
struct Entry {
char *key;
int value;
struct Entry *next;
};
typedef struct Entry Entry;
Entry *hashTable[TABLE_SIZE];
/* djb2 hash for strings */
int hashStr(char *key) {
unsigned long h = 5381;
int c;
while ((c = *key++)) h = ((h << 5) + h) + c;
return (int)(h % TABLE_SIZE);
}
void htInsert(char *key, int value) {
int idx = hashStr(key);
Entry *current = hashTable[idx];
/* update if key already exists */
while (current) {
if (strcmp(current->key, key) == 0) {
current->value = value;
return;
}
current = current->next;
}
/* create new entry */
Entry *e = malloc(sizeof(Entry));
e->key = strdup(key); /* copy the key string */
e->value = value;
e->next = hashTable[idx];
hashTable[idx] = e;
}
int htSearch(char *key, int *found) {
int idx = hashStr(key);
Entry *current = hashTable[idx];
while (current) {
if (strcmp(current->key, key) == 0) {
*found = 1;
return current->value;
}
current = current->next;
}
*found = 0;
return -1;
}
void htDelete(char *key) {
int idx = hashStr(key);
Entry *current = hashTable[idx];
Entry *prev = NULL;
while (current) {
if (strcmp(current->key, key) == 0) {
if (prev) prev->next = current->next;
else hashTable[idx] = current->next;
free(current->key);
free(current);
return;
}
prev = current;
current = current->next;
}
}
int main() {
/* initialise all buckets to NULL */
int i;
for (i = 0; i < TABLE_SIZE; i++) hashTable[i] = NULL;
/* store student marks */
htInsert("Aisha", 92);
htInsert("Rohan", 85);
htInsert("Priya", 78);
htInsert("Kiran", 95);
htInsert("Dev", 88);
int found;
printf("Aisha: %d\n", htSearch("Aisha", &found)); /* 92 */
printf("Priya: %d\n", htSearch("Priya", &found)); /* 78 */
printf("Kiran: %d\n", htSearch("Kiran", &found)); /* 95 */
/* update */
htInsert("Rohan", 90);
printf("Rohan updated: %d\n", htSearch("Rohan", &found)); /* 90 */
/* delete */
htDelete("Dev");
htSearch("Dev", &found);
printf("Dev found: %s\n", found ? "yes" : "no"); /* no */
return 0;
}
// Section 8
Hashing in the Real World
🐍Python dict and set
Every Python dictionary is a hash table. d["name"] = "Asil" uses a hash function on "name" to find the bucket. Average O(1) for get, set, and delete. Python uses open addressing with pseudo-random probing.
🔐Password storage
Websites never store your actual password. They store hash("yourpassword" + salt). When you log in, they hash your input and compare. Even if the database leaks, passwords cannot be reversed — hash functions are one-way.
🗄️Database indexes
Hash indexes in PostgreSQL and MySQL give O(1) exact lookups on equality queries (WHERE id = 5). They are faster than B-tree indexes for equality but cannot do range queries (BETWEEN).
⚡Caching (Redis, Memcached)
Redis is essentially a distributed hash table. SET key value and GET key are hash table insert and lookup. Used by Twitter, GitHub, and Snapchat to serve millions of requests per second.
🧬Compiler symbol tables
When a compiler sees a variable name, it looks it up in a symbol table — a hash table mapping names to types and memory locations. Every function call, every variable access goes through this table.
// Section 9
Chaining vs Open Addressing — When to Use Which
| Aspect | Chaining | Open Addressing |
|---|
| Memory | Extra memory for pointers | No extra memory — all in array |
| Load factor | Works fine above 1.0 | Must stay below 1.0 (no empty slots) |
| Cache performance | Poor — linked list nodes scattered | Excellent — everything in one array |
| Deletion | Simple — remove from chain | Needs tombstones — tricky |
| Implementation | Easier to implement correctly | Trickier — probing + tombstones |
| Best for | Unknown or high load, large values | Known load under 0.7, small values |
// Errors You Will Hit
Common Hashing Mistakes
⚠ Using % SIZE with a non-prime table size
Symptom: Keys cluster into a small number of buckets — poor distribution
Fix: Use a prime number for table size (7, 13, 31, 61...). Prime sizes reduce patterns in hash(key) % SIZE. Many production tables use prime or power-of-2 sizes with special hash functions.
⚠ Setting deleted slot to EMPTY in open addressing
Symptom: Searches incorrectly return "not found" for keys that probed past the deleted slot
Fix: Always use a DELETED tombstone constant, never EMPTY, when deleting in open addressing. Searches skip over DELETED but stop at EMPTY.
⚠ Not handling negative hash values
Symptom: Negative array index — program crashes or reads garbage
Fix: key % SIZE can be negative in C if key is negative. Use abs(key) % SIZE or cast: (unsigned)key % SIZE to guarantee a non-negative index.
// What's Next
You Are Ready for Unit 15
You now understand hashing completely — hash functions, collision handling with both chaining and open addressing, load factor, rehashing, and why Python dicts, Redis, and database indexes all use this technique. O(1) average lookup is one of the most powerful tools in all of computing.
In Unit 15 we cover Graphs — the most general and most powerful data structure in DSA. Nodes connected by edges in any direction. Maps, social networks, flight routes, dependency graphs — every complex relationship is a graph. We build BFS, DFS, Dijkstra's shortest path, and topological sort from scratch.
UP NEXT → UNIT 15
Graphs — The Most Powerful Data Structure
BFS, DFS, Dijkstra's, topological sort — adjacency list in C.
Coming Soon →🎯 Key Takeaways
- ✓Hashing converts a key into an array index using a hash function — enables O(1) average lookup
- ✓Hash function: key % SIZE for integers, sum of ASCII values % SIZE for strings — must be fast and uniform
- ✓Collisions are inevitable — two keys producing the same index must be handled
- ✓Chaining: each bucket is a linked list — simple, works above load factor 1.0, poor cache performance
- ✓Open addressing: probe for next empty slot — linear probing, quadratic, double hashing
- ✓Tombstone deletion: in open addressing, mark deleted slots as DELETED not EMPTY — or searches break
- ✓Load factor λ = elements / buckets — keep below 0.75 for good performance, rehash when exceeded
- ✓Rehashing: create new larger array (double size), re-insert all keys with new hash
- ✓Python dict, JavaScript object, Java HashMap, Redis — all hash tables under the hood
- ✓Hash tables do NOT support range queries or sorted output — use BST/B-tree for those needs