What I learned from Algorithms
Have you ever wondered how exactly a text editor performs searches for those words or phrases you type into the “Find” dialog box?
Neither did I, until the wonderfully strange math/CS crossover class Design and Analysis of Algorithms came into my life. This class taught me entirely too much about text searching algorithms, along with a good chunk of other stuff about binary trees and time-memory tradeoffs. (I really wanted to link to the course page, but it’s regrettably without a nice Web frontend due to being a summer course, so here’s a link to the fall 2009 course page instead.)
Anyway, this text-searching thing. Turns out there are three primary algorithms used when looking for a piece of text in a larger body:
- The brute-force method, which lines up the search word or phrase (the “needle”) against the beginning of the body of text (the “haystack”) and starts checking letters left to right, stopping and shifting over one letter when there’s a mismatch. As with most brute-force algorithms, this tends to be very slow.
- Horspool’s algorithm, which lines up the needle at the beginning of the haystack like the brute-force method, but checks letters right to left in the needle and jumps several characters at once when a mismatch is found.
- The Boyer-Moore algorithm, which lines up the needle at the left of the haystack, checks letters right-to-left, but uses additional information about patterns existing in the needle to jump characters more efficiently on a mismatch.
One homework assignment for Algorithms was to implement Boyer-Moore in a Web-friendly platform or language, so that the final demo could be easily run in a Web browser. Not being content with just doing Boyer-Moore (and not being comfortable enough with Javascript to jump straight into the most complicated algorithm), I instead chose to implement all three.
Readers should note that I’m very close to actively ashamed of that code (available through a simple View Source, since it’s pure HTML/JS/jQuery). I chose to model the algorithm actions through a state machine, and didn’t bother to abstract out the details, so there’s a ton of duplicate code scattered throughout each algorithm, as well as a complete and utter lack of separation between controller code (the states themselves) and view code (the various jQuery animations, etc.).
I’ve managed to convince myself I’m content with it, though, since I’m done with the class and will likely forget about that project entirely for years on end. If anyone feels the desperate urge to clean it up, however, anonymous checkout is available through Mercurial at http://pneumaticsystem.com/hg/textsearch.