Apriori Algorithm

前端 未结 4 2091
野性不改
野性不改 2020-12-29 16:37

I\'ve heard about the Apriori algorithm several times before but never got the time or the opportunity to dig into it, can anyone explain to me in a simple way the workings

相关标签:
4条回答
  • 2020-12-29 17:27

    The best introduction to Apriori can be downloaded from this book:

    http://www-users.cs.umn.edu/~kumar/dmbook/index.php

    you can download the chapter 6 for free which explain Apriori very clearly.

    Moreover, if you want to download a Java version of Apriori and other algorithms for frequent itemset mining, you can check my website:

    http://www.philippe-fournier-viger.com/spmf/

    0 讨论(0)
  • 2020-12-29 17:31

    Apriori Algorithm

    It is a candidate-generation-and-test approach for frequent pattern mining in datasets. There are two things you have to remember.

    Apriori Pruning Principle - If any itemset is infrequent, then its superset should not be generated/tested.

    Apriori Property - A given (k+1)-itemset is a candidate (k+1)-itemset only if everyone of its k-itemset subsets are frequent.

    Now, here is the apriori algorithm in 4 steps.

    1. Initially, scan the database/dataset once to get the frequent 1-itemset.
    2. Generate length k+1 candidate itemsets from length k frequent itemsets.
    3. Test the candidates against the database/dataset.
    4. Terminate when no frequent or candidate set can be generated.

    Solved Example

    Suppose there is a transaction database as follows with 4 transactions including their transaction IDs and items bought with them. Assume the minimum support - min_sup is 2. The term support is the number of transactions in which a certain itemset is present/included.

    Transaction DB

    tid  | items
    -------------
    10   | A,C,D
    20   | B,C,E
    30   | A,B,C,E
    40   | B,E
    

    Now, let's create the candidate 1-itemsets by the 1st scan of the DB. It is simply called as the set of C_1 as follows.

    itemset | sup
    -------------
      {A}   |  2
      {B}   |  3
      {C}   |  3
      {D}   |  1
      {E}   |  3
    

    If we test this with min_sup, we can see {D} does not satisfy the min_sup of 2. So, it will not be included in the frequent 1-itemset, which we simply call as the set of L_1 as follows.

    itemset | sup
    -------------
      {A}   |  2
      {B}   |  3
      {C}   |  3
      {E}   |  3
    

    Now, let's scan the DB for the 2nd time, and generate candidate 2-itemsets, which we simply call as the set of C_2 as follows.

    itemset | sup
    -------------
      {A,B} |  1
      {A,C} |  2
      {A,E} |  1
      {B,C} |  2
      {B,E} |  3
      {C,E} |  2
    

    As you can see, {A,B} and {A,E} itemsets do not satisfy the min_sup of 2 and hence they will not be included in the frequent 2-itemset, L_2

    itemset | sup
    -------------
      {A,C} |  2
      {B,C} |  2
      {B,E} |  3
      {C,E} |  2
    

    Now let's do a 3rd scan of the DB and get candidate 3-itemsets, C_3 as follows.

    itemset  | sup
    -------------
     {A,B,C} |  1
     {A,B,E} |  1
     {A,C,E} |  1
     {B,C,E} |  2
    

    You can see that, {A,B,C}, {A,B,E} and {A,C,E} does not satisfy min_sup of 2. So they will not be included in frequent 3-itemset, L_3 as follows.

    itemset  | sup
    -------------
     {B,C,E} |  2
    

    Now, finally, we can calculate the support (supp), confidence (conf) and lift (interestingness value) values of the Association/Correlation Rules that can be generated by the itemset {B,C,E} as follows.

             Rule       | supp  |  conf   |  lift
        -------------------------------------------
         B ->  C & E    |  50%  |  66.67% |  1.33
         E ->  B & C    |  50%  |  66.67% |  1.33
         C ->  E & B    |  50%  |  66.67% |  1.77
         B & C ->  E    |  50%  |  100%   |  1.33
         E & B ->  C    |  50%  |  66.67% |  1.77
         C & E ->  B    |  50%  |  100%   |  1.33
    
    0 讨论(0)
  • 2020-12-29 17:37

    Well, I would assume you've read the wikipedia entry but you said "a basic example would make it a lot easier for me to understand". Wikipedia has just that so I'll assume you haven't read it and suggest that you do.

    Read the wikipedia article.

    0 讨论(0)
  • 2020-12-29 17:37

    See Top 10 algorithms in data mining (free access) or The Top Ten Algorithms in Data Mining. The latter gives a detailed description of the algorithm, together with details on how to get optimized implementations.

    0 讨论(0)
提交回复
热议问题