Алгоритм fp growth python

FP Growth — Frequent Pattern Generation in Data Mining with Python Implementation

Powerful algorithm to mine association rules in large itemsets!

Introduction

We have introduced the Apriori Algorithm and pointed out its major disadvantages in the previous post. In this article, an advanced method called the FP Growth algorithm will be revealed. We will walk through the whole process of the FP Growth algorithm and explain why it’s better than Apriori.

Apriori: Association Rule Mining In-depth Explanation and Python Implementation

Association rule mining is a technique to identify underlying relations between different items. There are many methods…

Why it’s good?

Let’s recall from the previous post, the two major shortcomings of the Apriori algorithm are

  • The size of candidate itemsets could be extremely large
  • High costs on counting support since we have to scan the itemset database over and over again

To overcome these challenges, the biggest breakthrough of Fp Growth is that

No candidate generation is required!

All the problems of Apriori can be solved by leveraging the FP tree. To be more specific, the itemset size will not be a problem anymore since all the data will be stored in a way more compact version. Moreover, there’s no need to scan the database over and over again. Instead, traversing the FP tree could do the same job more efficiently.

FP Tree

FP tree is the core concept of the whole FP Growth algorithm. Briefly speaking, the FP tree is the compressed representation of the itemset database. The tree structure not only reserves the itemset in DB but also keeps track of the association between itemsets

The tree is constructed by taking each itemset and mapping it to a path in the tree one at a time. The whole idea behind this construction is that

More frequently occurring items will have better chances of sharing items

We then mine the tree recursively to get the frequent pattern. Pattern growth, the name of the algorithm, is achieved by concatenating the frequent pattern generated from the conditional FP trees.

FP Growth Algorithm

Feel free to check out the well-commented source code. It could really help to understand the whole algorithm.

chonyy/fpgrowth_py

pip install fpgrowth_py Then use it like Get a copy of this repo using git clone git clone…

The reason why FP Growth is so efficient is that it’s a divide-and-conquer approach. And we know that an efficient algorithm must have leveraged some kind of data structure and advanced programming technique. It implemented tree, linked list, and the concept of depth-first search. The process can be split into two main stages, each stage can be further divided into two steps.

Читайте также:  Monty python cheese shop

Stage 1: FP tree construction

For each transaction, we first remove the items that are below the minimum support. Then, we sort the items in frequency support descending order.

Step 2: Construct FP tree, header table with cleaned itemsets

Loop through the cleaned itemsets, map it to the tree one at a time. If any item already exists in the branch, they share the same node and the count is incremented. Else, the item will be on the new branch created.

The header table is also constructed during this procedure. There’s a linked list for each unique item in the transactions. With the linked list, we can find the occurrence of the item on the tree in a short period of time without traversing the tree.

Stage 2: Mine the main tree and conditional FP trees

Step 1: Divide the main FP tree into conditional FP trees

Staring from each frequent 1-pattern, we create conditional pattern bases with the set of prefixes in the FP tree. Then, we use those pattern bases to construct conditional FP trees with the exact same method in Stage 1.

Step 2: Mine each conditional trees recursively

The frequent patterns are generated from the conditional FP Trees. One conditional FP tree is created for one frequent pattern. The recursive function we used to mine the conditional trees is close to depth-first search. It makes sure that there is no more trees can be constructed with the remaining items before moving on.

Let’s take a careful look at the process below

Same level in same color. The algorithm workflow is listed below

  1. Checking the first 1-frequent pattern ‘a’
  2. Get the conditional FP tree for ‘a’ in pink
  3. Mine the pink tree and go deeper to level 2
  4. Check ‘f’ on the pink tree and found no more tree can be constructed
  5. Check ‘c’ on the pink tree
  6. Mine the yellow tree and go deeper to level 3

Python Implementation

FP Growth Function

I have organized the main features of both algorithms and made it into the table above. From observing the table, we can tell that FP Growth is generally better than Apriori under most of the circumstances. That’s why Apriori is just a fundamental method, and FP Growth is an improvement of it.

Runtime with different minimum support

We take the dataset from the source code repo and try some experiments. The first thing we do is to check how the minimum support infer the runtime.

From the plot, we can see that FP Growth is always faster than Apriori. The reason for this is already explained above. An interesting observation is that, on both datasets, the runtime of apriori started to increase rapidly after a specific minimum. On the other hand, the runtime of FP Growth is merely interfered by the value.

Runtime with different number of transaction

The runtime of both algorithm increases as the number of itemsets goes up. However, we can clearly see that the slope of the increase of Apriori is way higher. This means that the Apriori algorithm is more sensitive to the itemsets size comparing to Fp Growth.

Читайте также:  Css for table inside table

Ascending order vs Decreasing order

You might be wondering why we have to sort the items in frequency descending order before using it to construct the tree. What will happen if try it with ascending order? As you can see from the graph, sorting with descending order is always faster. And the difference between the speed is more obvious with lower support.

But why? I will like to give some of my own assumptions. Let’s recall the core concept of FP Growth

More frequently occurring items will have better chances of sharing items

Starting from items with higher support will have a higher possibility of sharing branches. The more shared branches, the smaller is the size of the tree. And the smaller is the tree, the costs of running the whole algorithm will be less. Therefore, sorting the items in descending order will cause this method to run faster.

Conclusion

I would like to share an interesting story. While I was writing this article after work, a full-time engineer walked by and he saw that I was writing something about Apriori and Fp Growth. He said, “Interesting, but not realistic.” He further explained that this algorithm doesn’t take weighing under consideration. For example, what about a transaction with multiple same items? There are also more subtle conditions that are not included in this algorithm. And that’s why companies won’t implement this in their business.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

An implementation of the FP-growth algorithm in pure Python.

enaeseth/python-fp-growth

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Читайте также:  Текст

Latest commit

Git stats

Files

Failed to load latest commit information.

Readme.md

This module provides a pure Python implementation of the FP-growth algorithm for finding frequent itemsets. FP-growth exploits an (often-valid) assumption that many transactions will have items in common to build a prefix tree. If the assumption holds true, this tree produces a compact representation of the actual transactions and is used to generate itemsets much faster than Apriori can.

After downloading and extracting the package, install the module by running python setup.py install from within the extracted package directory. (If you encounter errors, you may need to run setup with elevated permissions: sudo python setup.py install .)

Usage of the module is very simple. Assuming you have some iterable of transactions (which are themselves iterables of items) called transactions and an integer minimum support value minsup , you can find the frequent itemsets in your transactions with the following code:

from fp_growth import find_frequent_itemsets for itemset in find_frequent_itemsets(transactions, minsup): print itemset 

Note that find_frequent_itemsets returns a generator of itemsets, not a greedily-populated list. Each item must be hashable (i.e., it must be valid as a member of a dictionary or a set).

Once installed, the module can also be used as a stand-alone script. It will read a list of transactions formatted as a CSV file. (An example of such a file in included in the examples directory.)

For example, to find the itemsets with support ≥ 4 in the included example file:

python -m fp_growth -s 4 examples/tsk.csv 

The following references were used as source descriptions of the algorithm:

  • Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. 1st ed. Boston: Pearson / Addison Wesley, 2006. (pp. 363-370)
  • Han, Jiawei, Jian Pei, and Yiwen Yin. «Mining Frequent Patterns without Candidate Generation.» Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000.

The example data included in tsk.csv comes from the section in Introduction to Data Mining.

The python-fp-growth package is made available under the terms of the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the «Software»), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED «AS IS», WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

An implementation of the FP-growth algorithm in pure Python.

Источник

Оцените статью