I\'ve heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings
The best introduction to Apriori can be downloaded from this book:
http://www-users.cs.umn.edu/~kumar/dmbook/index.php
you can download the chapter 6 for free which explain Apriori very clearly.
Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:
http://www.philippe-fournier-viger.com/spmf/
It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.
Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.
Apriori Property - A given (k+1)-itemset
is a candidate (k+1)-itemset
only if everyone of its k-itemset
subsets are frequent.
Now, here is the apriori algorithm in 4 steps.
1-itemset
.k+1
candidate itemsets from length k
frequent itemsets.Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup
is 2
. The term support is the number of transactions in which a certain itemset is present/included.
Transaction DB
tid | items
-------------
10 | A,C,D
20 | B,C,E
30 | A,B,C,E
40 | B,E
Now, let's create the candidate 1-itemsets
by the 1st scan of the DB. It is simply called as the set of C_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{D} | 1
{E} | 3
If we test this with min_sup
, we can see {D}
does not satisfy the min_sup
of 2
. So, it will not be included in the frequent 1-itemset
, which we simply call as the set of L_1
as follows.
itemset | sup
-------------
{A} | 2
{B} | 3
{C} | 3
{E} | 3
Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets
, which we simply call as the set of C_2
as follows.
itemset | sup
-------------
{A,B} | 1
{A,C} | 2
{A,E} | 1
{B,C} | 2
{B,E} | 3
{C,E} | 2
As you can see, {A,B}
and {A,E}
itemsets do not satisfy the min_sup
of 2
and hence they will not be included in the frequent 2-itemset
, L_2
itemset | sup
-------------
{A,C} | 2
{B,C} | 2
{B,E} | 3
{C,E} | 2
Now let's do a 3rd scan of the DB and get candidate 3-itemsets
, C_3
as follows.
itemset | sup
-------------
{A,B,C} | 1
{A,B,E} | 1
{A,C,E} | 1
{B,C,E} | 2
You can see that, {A,B,C}
, {A,B,E}
and {A,C,E}
does not satisfy min_sup
of 2
. So they will not be included in frequent 3-itemset
, L_3
as follows.
itemset | sup
-------------
{B,C,E} | 2
Now, finally, we can calculate the support (supp)
, confidence (conf)
and lift (interestingness value)
values of the Association/Correlation Rules that can be generated by the itemset {B,C,E}
as follows.
Rule | supp | conf | lift
-------------------------------------------
B -> C & E | 50% | 66.67% | 1.33
E -> B & C | 50% | 66.67% | 1.33
C -> E & B | 50% | 66.67% | 1.77
B & C -> E | 50% | 100% | 1.33
E & B -> C | 50% | 66.67% | 1.77
C & E -> B | 50% | 100% | 1.33
Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.
Read the wikipedia article.
See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.