Welcome back! In the previous chapter, Cryptography & Hashing, we learned how to turn data into a unique fingerprint (a Hash) to keep it secure or organize it.
Now, we are going to use that same "Fingerprint" concept for a totally different purpose: Finding things.
String Algorithms are the magic behind the "Search" bar. Whether you are using Ctrl+F in a Word document, searching for a DNA sequence in a genome, or Googling a phrase, you are using string algorithms.
Imagine you have a text file containing the entire series of Harry Potter (about 1,000,000 words). You want to find the exact location of the phrase "Golden Snitch".
The Naive Approach (The Slow Way):
This is very slow. We need a way to scan the text without constantly backtracking and re-reading letters we just looked at.
Remember in the last chapter how we turned "Hello" into a number? Rabin-Karp uses that to speed up searching.
The Logic: Instead of comparing letter-by-letter (which is slow), let's compare Hashes (which is fast).
500.500?Recalculating the hash for every set of letters would be slow. Rabin-Karp uses a Rolling Hash.
Here is how we update the hash as we slide across the text.
// Simplified from strings/rabin_karp.cpp
// str_hash is the current value. We want to slide the window.
// 'old_char' is leaving, 'new_char' is entering.
int64_t recalculate_hash(char old_char, char new_char, int64_t old_hash) {
// 1. Remove the leading character
int64_t temp = old_hash - old_char;
// 2. Shift values (like dividing by 10 to remove a digit)
temp /= PRIME;
// 3. Add the new character with its weight
temp += new_char * pow(PRIME, pattern_length - 1);
return temp;
}
Rabin-Karp is great, but there is an even smarter way that doesn't use math, just pure logic. It is called KMP.
The Logic: If you partially match a word and then fail, don't start over completely. Use what you already matched to skip ahead.
Example:
To do this, KMP pre-calculates a "Cheat Sheet" (called a Failure Array or LPS). It tells the computer: "If you fail at index 5, you can safely jump back to index 3 without checking."
This is how KMP uses the cheat sheet to skip re-checking characters.
// Simplified from strings/knuth_morris_pratt.cpp
// k is our position in the Pattern
// j is our position in the Text
while (k != npos && pattern[k] != text[j]) {
// MISMATCH!
// Don't go back to 0. Look at the cheat sheet (failure array).
// It tells us the best previous spot to resume from.
k = failure[k];
}
// If characters match, move both forward
if (++k == pattern_length) {
return "Found it!";
}
Let's visualize how the Rabin-Karp algorithm "slides" over the text. It behaves like a conveyor belt.
How do we assign a "score" to a word? We treat letters like numbers.
Just like 123 = 1*100 + 2*10 + 3*1, we do:
Hash("ABC") = A * (Prime^0) + B * (Prime^1) + C * (Prime^2)
In the file strings/rabin_karp.cpp, this is implemented here:
// From strings/rabin_karp.cpp
int64_t create_hash(const std::string& s, int n) {
int64_t result = 0;
for (int i = 0; i < n; ++i) {
// Multiply char code by Prime Number power
result += (int64_t)(s[i] * (int64_t)pow(PRIME, i));
}
return result;
}
The hardest part of KMP is building the "Cheat Sheet" (Failure Array). Let's look at strings/knuth_morris_pratt.cpp.
This function looks at the pattern before scanning the text. It looks for prefixes that are also suffixes.
// From strings/knuth_morris_pratt.cpp
std::vector<size_t> getFailureArray(const std::string &pattern) {
// ... setup ...
for (int i = 0; i < pattern_length; i++) {
// If the current char doesn't continue the prefix chain, fall back
while (j != std::string::npos && pattern[j] != pattern[i]) {
j = failure[j];
}
// If it matches, our prefix chain is longer!
failure[i + 1] = ++j;
}
return failure;
}
String algorithms turn slow, repetitive reading into fast, efficient scanning.
These algorithms allow computers to read, search, and process massive amounts of text in milliseconds. But what if we want the computer to understand what it's reading, or make predictions based on data it has never seen before?
Next Chapter: Machine Learning
Generated by Code IQ