String

Definition and Glossary

Strings are everywhere. Text information always requires string process. A String is a sequence of characters.

String Sort

  • LSD
    key-indexed counting, from right to left
  • MSD
    recursive method, left to right
  • Three-way string quicksort
    adapt quicksort to MSD string sorting by using 3-way partitioning on the leading character of the keys, moving to the next character on only the middle subarray

Tries

A search tree, composed of nodes that contain links that are either null or references to other nodes.

class Node:
    def __init__(self, value):
        self.value = value
        self.children = []
        self.count = 0

class Trie:
    def __init__(self):
        self.root = Node(None)
    
    def add(self, value):
        current_node = self.root
        for char in value:
            in_children = False
            for n in current_node.children:
                if n.value == char:
                    in_children = True
                    current_node = n
                    break
            if not in_children:
                new_node = Node(char)
                current_node.children.append(new_node)
                current_node = new_node
        current_node.count += 1

    def find(self, value):
        current_node = self.root
        for char in value:
            in_children = False
            for n in current_node.children:
                if char == n.value:
                    current_node = n
                    in_children = True
                    break
            if not in_children:
                return False
        return current_node.count > 0    

Properties of Tries

  • The linked structure (shape) of a trie is independent of the key in- sertion/deletion order: there is a unique trie for any given set of keys.
  • The number of array accesses when searching in a trie or inserting a key into a trie is at most 1 plus the length of the key.
  • Knuth-Morris-Pratt substring search
    Whenever we detect a mismatch, we already know some of the characters in the text. Construct a prefix array of the pattern to avoid unnecessary compares.

  • Boyer-Moore substring search

  • Rabin-Karp fingerprint search
    We compute a hash function for the pattern and then look for a match by using the same hash function for each possible M-character substring of the text. If we find a text substring with the same hash value as the pattern, we can check for a match.

Data Compression

  • Huffman compression
    Instead of using the usual 7 or 8 bits for each character, we use fewer bits for characters that appear often than for those that appear rarely. One convenient way to represent a prefix-free code is with a trie.

  • LZW compression
    Maintain a symbol table that associates string keys with (fixed-length) codeword values.

Chuanrong Li

Read more posts by this author.