Definition and Glossary
Strings are everywhere. Text information always requires string process. A String is a sequence of characters.
String Sort
- LSD
key-indexed counting, from right to left - MSD
recursive method, left to right - Three-way string quicksort
adapt quicksort to MSD string sorting by using 3-way partitioning on the leading character of the keys, moving to the next character on only the middle subarray
Tries
A search tree, composed of nodes that contain links that are either null or references to other nodes.
class Node:
def __init__(self, value):
self.value = value
self.children = []
self.count = 0
class Trie:
def __init__(self):
self.root = Node(None)
def add(self, value):
current_node = self.root
for char in value:
in_children = False
for n in current_node.children:
if n.value == char:
in_children = True
current_node = n
break
if not in_children:
new_node = Node(char)
current_node.children.append(new_node)
current_node = new_node
current_node.count += 1
def find(self, value):
current_node = self.root
for char in value:
in_children = False
for n in current_node.children:
if char == n.value:
current_node = n
in_children = True
break
if not in_children:
return False
return current_node.count > 0
Properties of Tries
- The linked structure (shape) of a trie is independent of the key in- sertion/deletion order: there is a unique trie for any given set of keys.
- The number of array accesses when searching in a trie or inserting a key into a trie is at most 1 plus the length of the key.
Substring Search
Knuth-Morris-Pratt substring search
Whenever we detect a mismatch, we already know some of the characters in the text. Construct a prefix array of the pattern to avoid unnecessary compares.Boyer-Moore substring search
Rabin-Karp fingerprint search
We compute a hash function for the pattern and then look for a match by using the same hash function for each possible M-character substring of the text. If we find a text substring with the same hash value as the pattern, we can check for a match.
Data Compression
Huffman compression
Instead of using the usual 7 or 8 bits for each character, we use fewer bits for characters that appear often than for those that appear rarely. One convenient way to represent a prefix-free code is with a trie.LZW compression
Maintain a symbol table that associates string keys with (fixed-length) codeword values.