Frequent Pattern

Suppose we got the following shopping record:

transaction item
T1 baguette, croissant
T2 baguette, croissant, jam
T3 madeleine, croissant, baguette, jam

The first term is support, a measure of absolute frequency. | item | count | support | | :—-: | :-: | :-: | | baguette | 3 | 0.333 | | croissant | 3 | 0.333 | | jam | 2 | 0.222 | | madeleine | 1 | 0.111 |

Then, confidence is a measure of correlative frequency. $$confidence(A \Rightarrow B) = P(B|A) = \frac{support(A \cup B)}{support(A)} = \frac{support \_count(A \cup B)}{support \_count(A)}$$ So in this example, we can see that the confidence of baguette and croissant to jam is 0.666 Actually we got our first glance of association rule, $$baguette, croissant \Rightarrow jam [ support = 66.6\%, confidence = 66.6\% ]$$ the support = 66.6% means in all transactions, the probability of baguette, croissant and jam occurs(they may come seperately), and the confidence = 66.6% just as the equation before.

Apriori

an algorithm for frequent item set mining and association rule learning over transactional databases.

Apriori takes as input (1) a minsup threshold set by the user and (2) a transactional database containing a set of transactions. Apriori outputs all frequent itemsets and the association rules

so for the example and we set the minsup to 2, here we got: | item | support_count | | :—-: | :-: | | baguette | 3 | | croissant | 3 | | jam | 2 | | madeleine | 1 | and we remove the last one due to the threshold and generate a list of all pairs | item | support_count | | :—-: | :-: | | baguette, croissant | 3 | | croissant, jam | 2 | | baguette, jam | 2 | they all exceed the minsup, and then | item | support_count | | :—-: | :-: | | baguette, croissant, jam | 2 | so, we got all the frequent sets.

Then, we can follow a rule to generate all the association rules according to the formula $$confidence(A \Rightarrow B) = \frac{support \_count(A \cup B)}{support \_count(A)}$$ and filter the results using the threshold min_conf set by user. We can get all the assiciation rules: - {baguette, croissant} => jam, confidence = 23 = 0.666 - {baguette, jam} => croissant, confidence = 22 = 1 - {croissant, jam} => baguette, confidence = 22 = 1 - croissant => {baguette, jam}, confidence = 23 = 0.666 - jam => {baguette, croissant}, confidence = 22 = 1 - baguette => {croissant, jam}, confidence = 23 = 0.666

FP Growth

In the previous part, we need to do a lot of join operation and calculation to find the frequent patterns. But we can do it more efficiently and avoid unnecessary calculation. And this algorithm can work well with distrubuted system for it support MapReduce.

A tree structure can help keep track of items and find potential sequence.

Counting

We need to take a look at all the transactions first and count all the items, then sort the list, we can apply some threshold at the same time. | item | count | | :—-: | :-: | | baguette | 3 | | croissant | 3 | | jam | 2 | | madeleine | 1 |

Constructing

  1. create the root node and label it null

                  +------+
                  | null |
                  +------+
    
  2. for each transaction, generate one branch according to the order in the previous counting step. The benefit of tree structure is that we can share some nodes in common. For the later traverse, we create a table to store the links of the nodes for the same item.

                  +------+
                  | null |
                  +------+
                     |       
               +------------+
               | baguette:3 |
               +------------+
                      |       
               +-------------+
               | croissant:3 |
               +-------------+
                      |       
                  +-------+
                  | jam:2 |
                  +-------+
                      |       
               +-------------+
               | madeleine:1 |
               +-------------+
    
    item count node_list
    baguette 3 node1
    croissant 3 node2
    jam 2 node3
    madeleine 1 node4
  3. this example is somewhat not so representative.

Mining

We go through every branch of the tree and keep all the associations meet the requirements(passing threshold)

Chuanrong Li

Read more posts by this author.