Hmine - mlxtend 机器学习库

`hmine`: 通过 H-mine 算法获取频繁项集

hmine 函数，用于提取频繁项集以进行关联规则挖掘

from mlxtend.frequent_patterns import hmine

概述

H-mine [1]（基于内存的超结构频繁模式挖掘）是一种数据挖掘算法，用于频繁项集挖掘——在大规模事务数据集中寻找频繁出现的模式的过程。

H-mine 是对 Apriori 和 FP-Growth 算法的改进，在时间和空间复杂度方面提供了更好的性能。它通过使用 H-struct 数据结构和更高效的搜索空间遍历方法来实现这一点。

H-mine 通过引入一种名为 H-struct 的新颖数据结构改进了 FP-Growth 算法。H-struct 是一种混合数据结构，结合了水平和垂直数据布局的优点，使其在频繁项集挖掘方面更高效。

这种方法的一个显著特点是，它的空间开销非常有限且精确可预测，并且在基于内存的环境中运行速度非常快。此外，通过数据库分区，它可以扩展到非常大的数据库；当数据集变得密集时，（条件）FP 树可以作为挖掘过程的一部分动态构建。

参考文献

[1] Pei J, Han J, Lu H, Nishio S, Tang S and Yang D, "H-Mine: 在大型数据库中快速且节省空间的频繁模式挖掘。" IIE Transactions, Vol. 39, pp. 593–605, 2007.

示例 1 -- 生成频繁项集

hmine 函数期望数据采用独热编码的 pandas DataFrame 格式。假设我们有以下事务数据

dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

我们可以通过 TransactionEncoder 将其转换为正确的格式，如下所示：

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

	苹果	玉米	莳萝	鸡蛋	冰淇淋	芸豆	牛奶	肉豆蔻	洋葱	独角兽	酸奶
0	False	False	False	True	False	True	True	True	True	False	True
1	False	False	True	True	False	True	False	True	True	False	True
2	True	False	False	True	False	True	True	False	False	False	False
3	False	True	False	False	False	True	True	False	False	True	True
4	False	True	False	True	True	True	False	False	True	False	False

现在，让我们返回支持度至少为 60% 的项和项集

from mlxtend.frequent_patterns import hmine

hmine(df, min_support=0.6)

	支持度	项集
0	0.8	(3)
1	0.8	(3, 5)
2	0.6	(8, 3, 5)
3	0.6	(8, 3)
4	1.0	(5)
5	0.6	(5, 6)
6	0.6	(8, 5)
7	0.6	(10, 5)
8	0.6	(6)
9	0.6	(8)
10	0.6	(10)

默认情况下，hmine 返回项的列索引，这在关联规则挖掘等下游操作中可能很有用。为了提高可读性，我们可以设置 use_colnames=True 将这些整数值转换为相应的项名称

hmine(df, min_support=0.6, use_colnames=True)

	支持度	项集
0	0.8	(鸡蛋)
1	0.8	(鸡蛋, 芸豆)
2	0.6	(鸡蛋, 芸豆, 洋葱)
3	0.6	(鸡蛋, 洋葱)
4	1.0	(芸豆)
5	0.6	(牛奶, 芸豆)
6	0.6	(芸豆, 洋葱)
7	0.6	(酸奶, 芸豆)
8	0.6	(牛奶)
9	0.6	(洋葱)
10	0.6	(酸奶)

示例 2 -- H-Mine 对比 Apriori 和 FP-Growth

由于 hmine 算法是基于内存的算法，对于大型数据集，它的速度可能比替代的 Apriori 算法快几个数量级。然而，它可能比 FP-Growth 算法慢很多。在下面的示例中，我们将 hmine 与 apriori 和 fpgrowth 算法在一个小型数据集上进行性能比较。

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

from mlxtend.frequent_patterns import apriori

%timeit -n 100 -r 10 apriori(df, min_support=0.6, use_colnames=True)

3.41 ms ± 584 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit -n 100 -r 10 apriori(df, min_support=0.6, use_colnames=True, low_memory=True)

3.36 ms ± 404 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

from mlxtend.frequent_patterns import fpgrowth

%timeit -n 100 -r 10 fpgrowth(df, min_support=0.6, use_colnames=True)

1.18 ms ± 76.7 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

from mlxtend.frequent_patterns import hmine

%timeit -n 100 -r 10 hmine(df, min_support=0.6, use_colnames=True)

1.44 ms ± 94.8 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

示例 3 -- 使用稀疏表示

为了节省内存，您可能希望以稀疏格式表示您的事务数据。如果您有大量产品和小型事务，这尤其有用。

oht_ary = te.fit(dataset).transform(dataset, sparse=True)
sparse_df = pd.DataFrame.sparse.from_spmatrix(oht_ary, columns=te.columns_)
sparse_df

	苹果	玉米	莳萝	鸡蛋	冰淇淋	芸豆	牛奶	肉豆蔻	洋葱	独角兽	酸奶
0	0	0	0	1	0	True	1	1	1	0	1
1	0	0	1	1	0	True	0	1	1	0	1
2	1	0	0	1	0	True	1	0	0	0	0
3	0	1	0	0	0	True	1	0	0	1	1
4	0	1	0	1	1	True	0	0	1	0	0

hmine(sparse_df, min_support=0.6, use_colnames=True, verbose=1)

2 itemset(s) from the suffixes on item(s) (Eggs)
1 itemset(s) from the suffixes on item(s) (Eggs, Kidney Beans)
0 itemset(s) from the suffixes on item(s) (Eggs, Kidney Beans, Onion)
0 itemset(s) from the suffixes on item(s) (Eggs, Onion)
3 itemset(s) from the suffixes on item(s) (Kidney Beans)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Milk)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Onion)
0 itemset(s) from the suffixes on item(s) (Kidney Beans, Yogurt)
0 itemset(s) from the suffixes on item(s) (Milk)
0 itemset(s) from the suffixes on item(s) (Onion)
0 itemset(s) from the suffixes on item(s) (Yogurt)

	支持度	项集
0	0.8	(鸡蛋)
1	0.8	(鸡蛋, 芸豆)
2	0.6	(鸡蛋, 芸豆, 洋葱)
3	0.6	(鸡蛋, 洋葱)
4	1.0	(芸豆)
5	0.6	(牛奶, 芸豆)
6	0.6	(芸豆, 洋葱)
7	0.6	(酸奶, 芸豆)
8	0.6	(牛奶)
9	0.6	(洋葱)
10	0.6	(酸奶)

API

hmine(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0) -> pandas.core.frame.DataFrame

从独热编码的 DataFrame 获取频繁项集

参数

df : pandas DataFrame

独热编码格式的 pandas DataFrame。也支持包含稀疏数据的 DataFrame；更多信息请参阅 https://pandas.ac.cn/pandas-docs/stable/user_guide/sparse.html#sparse-data-structures。

请注意，mlxtend >= 0.17.2 已不再支持旧的 pandas SparseDataFrame 格式。

允许的值为 0/1 或 True/False。例如，

Apple Bananas Beer Chicken Milk Rice 0 True False True True False True 1 True False True False False True 2 True False True False False False 3 True True False False False False 4 False False True True True True 5 False False True False True True 6 False False True False True False 7 True True False False False False
min_support : float (默认值: 0.5)

一个介于 0 和 1 之间的浮点数，表示返回项集的最小支持度。支持度计算为项目(s)出现次数所在的事务数 / 总事务数。
use_colnames : bool (默认值: False)

如果为 True，则在返回的 DataFrame 中使用 DataFrame 的列名而不是列索引。
max_len : int (默认值: None)

生成的项集的最大长度。如果为 None（默认值），则评估所有可能的项集长度。
verbose : int (默认值: 0)

显示条件树生成的阶段。

返回值

一个 pandas DataFrame，包含所有 >= min_support 且 < max_len（如果 max_len 不为 None）的项集，列名为 ['support', 'itemsets']。'itemsets' 列中的每个项集都是 frozenset 类型，这是一种 Python 内置类型，行为类似于集合，但它是不可变的（更多信息请参阅 https://docs.pythonlang.cn/3.6/library/stdtypes.html#frozenset）。

示例

有关使用示例，请参阅 https://mlxtend.cn/mlxtend/user_guide/frequent_patterns/hmine/

ython

键	操作
`?`	打开此帮助
`n`	下一页
`p`	上一页
`s`	搜索

hmine: 通过 H-mine 算法获取频繁项集

概述

参考文献

相关

示例 1 -- 生成频繁项集

示例 2 -- H-Mine 对比 Apriori 和 FP-Growth

示例 3 -- 使用稀疏表示

更多示例

API

`hmine`: 通过 H-mine 算法获取频繁项集