pcy 算法_大数据分析中的PCY算法

pcy 算法

PCY algorithm was developed by three Chinese scientists Park, Chen, and Yu. This is an algorithm used in the field of big data analytics for the frequent itemset mining when the dataset is very large.

PCY算法是由三位中国科学家Park，Chen和Yu开发的。当数据集非常大时，这是在大数据分析领域中用于频繁项集挖掘的算法。

Consider we have a huge collection of data, and in this data, we have a number of transactions. For example, if we buy any product online it’s transaction is being noted. Let, a person is buying a shirt from any site now, along with the shirt the site advised the person to buy jeans also, with some discount. So, we can see that how two different things are made into a single set and associated. The main purpose of this algorithm is to make frequent item sets say, along with shirt people frequently buy jeans.

考虑一下我们有大量的数据收集，并且在这些数据中，我们有许多事务。例如，如果我们在线购买任何产品，则会记录其交易。现在，一个人正在从任何站点购买衬衫，该站点建议该人也购买衬衫，并有一定折扣。因此，我们可以看到如何将两个不同的东西组合成一个单一的集合。该算法的主要目的是使频繁的服装搭配，以及人们经常购买的衬衫。

For example:

例如：

    Transaction         Items bought
Transaction 1       Shirt + jeans
Transaction 2       Shirt + jeans +Trouser
Transaction 3       Shirt +Tie
Transaction 4       Shirt +jeans +shoes

So, from the above example we can see that shirt is most frequently bought along with jeans, so, it is considered as a frequent itemset.

因此，从上面的示例中，我们可以看到衬衫与牛仔裤一起最常购买，因此，它被认为是一件频繁的商品。

An example problem solved using PCY algorithm

使用PCY算法解决的示例问题

Question: Apply PCY algorithm on the following transaction to find the candidate sets (frequent sets).

问题：对以下事务应用PCY算法以查找候选集(频繁集)。

Given data

给定数据

    Threshold value or minimization value = 3
Hash function= (i*j) mod 10
T1  =   {1, 2, 3}
T2  =   {2, 3, 4}
T3  =   {3, 4, 5}
T4  =   {4, 5, 6}
T5  =   {1, 3, 5}
T6  =   {2, 4, 6}
T7  =   {1, 3, 4}
T8  =   {2, 4, 5}
T9  =   {3, 4, 6}
T10 =   {1, 2, 4}
T11 =   {2, 3, 5}
T12=    {3, 4, 6}

Use buckets and concepts of Mapreduce to solve the above problem.

使用存储桶和Mapreduce概念可以解决上述问题。

Solution:

解：

To identify the length or we can say repetition of each candidate in the given dataset.

为了确定长度，或者我们可以说给定数据集中每个候选项的重复。
Reduce the candidate set to all having length 1.

将候选集减少为所有长度为1的候选集。
Map pair of candidates and find the length of each pair.

映射候选对，并找到每个对的长度。
Apply a hash function to find bucket no.

应用哈希函数查找存储桶编号。
Draw a candidate set table.

绘制候选集表。

Step 1: Mapping all the elements in order to find their length.

步骤1：映射所有元素以找到其长度。

    Items →    {1, 2, 3, 4, 5, 6}
Key         1  2  3  4  5  6
Value       4  6  8  8  6  4

Step 2: Removing all elements having value less than 1.

步骤2：移除所有值小于1的元素。

But here in this example there is no key having value less than 1. Hence, candidate set = {1, 2, 3, 4, 5, 6}

但是在此示例中，这里没有键的值小于1。因此，候选集= {1,2,3,4,5,6}

Step 3: Map all the candidate set in pairs and calculate their lengths.

步骤3：成对映射所有候选集并计算其长度。

    T1: {(1, 2) (1, 3) (2, 3)} = (2, 3, 3)
T2: {(2, 4) (3, 4)} = (3 4)
T3: {(3, 5) (4, 5)} = (5, 3)
T4: {(4, 5) (5, 6)} = (3, 2)
T5: {(1, 5)} = 1
T6: {(2, 6)} = 1
T7: {(1, 4)} = 2
T8: {(2, 5)} = 2
T9: {(3, 6)} = 2
T10:______
T11:______
T12:______

Note: Pairs should not get repeated avoid the pairs that are already written before.

注意：避免重复配对，避免之前已经写入过的配对。

Listing all the sets having length more than threshold value: {(1,3) (2,3) (2,4) (3,4) (3,5) (4,5) (4,6)}

列出长度大于阈值的所有集合： {(1,3)(2,3)(2,4)(3,4)(3,5)(4,5)(4,6)}

.minHeight{ min-height: 250px; } @media (min-width: 1025px){ .minHeight{ min-height: 90px; } } .minHeight{ min-height: 250px; } @media (min-width: 1025px){ .minHeight{ min-height: 90px; } }

Step 4: Apply Hash Functions. (It gives us bucket number)

步骤4：应用哈希函数。 (它给了我们存储桶号)

    Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4

Now, arrange the pairs according to the ascending order of their obtained bucket number.

现在，根据获得的存储桶编号的升序排列对。

    Bucket no.             Pair
0                     (4,5)
2                     (3,4)
3                     (1,3)
4                     (4,6)
5                     (3,5)
6                     (2,3)
8                     (2,4)

Step 5: In this final step we will prepare the candidate set.

步骤5：在最后一步中，我们将准备候选集。

Bit vector	Bucket no.	Highest Support Count	Pairs	Candidate Set
1	0	3	(4,5)	(4,5)
1	2	4	(3,4)	(3,4)
1	3	3	(1,3)	(1,3)
1	4	3	(4,6)	(4,6)
1	5	5	(3,5)	(3,5)
1	6	3	(2,3)	(2,3)
1	8	3	(2,4)	(2,4)

位向量	斗号	最高支持数	对	候选集
1个	0	3	(4,5)	(4,5)
1个	2	4	(3,4)	(3,4)
1个	3	3	(1,3)	(1,3)
1个	4	3	(4,6)	(4,6)
1个	5	5	(3,5)	(3,5)
1个	6	3	(2,3)	(2,3)
1个	8	3	(2,4)	(2,4)

Note: Highest support count is the no. of repetition of that vector.

注意：最高支持数量为否。该向量的重复。

Check the pairs which have the highest support count less than 3, and write those in the candidate set, if less than 3 then reject.

检查具有最高支持数的对小于3，然后将其写入候选集中，如果小于3，则拒绝。

(NOTE: There are some exceptional cases where highest count support is less than 3, i.e. threshold value and for every candidate pair write bit vector as 1 means if HCS is greater than equal to threshold then bit vector is 1 otherwise 0).

( 注：在某些特殊情况下，最高计数支持小于3，即阈值，并且对于每个候选对，将位向量写为1表示如果HCS大于等于阈值，则位向量为1，否则为0)。

Hence, The frequent itemsets are (4, 5), (3,4)

因此， 频繁项集是(4，5)，(3,4)

Conclusion:

结论：

In this article, we have discussed a very important algorithm i.e PCY algorithm used in Big Data Analytics. We have also solved a simple question to understand its application more clearly. Let me mention one more thing that this is also very important from the examination point of view, so it’s a must do the algorithm for all of you who have BDA as a subject in their academics. If you have any further queries shoot them in the comment section, will try to solve them as soon as possible. See you in my next article till then stay healthy and keep learning!

在本文中，我们讨论了非常重要的算法，即大数据分析中使用的PCY算法 。我们还解决了一个简单的问题，以更清楚地了解其应用。让我再说一件事，从考试的角度来看，这也是非常重要的，因此对于所有将BDA作为其学术学科的人来说，这都是必须做的算法。如果您还有其他疑问，请在评论部分中进行介绍，请尝试尽快解决。在下一篇文章中再见，然后保持健康并继续学习！

翻译自: https://www.includehelp.com/big-data/pcy-algorithm-in-big-data-analytics.aspx

pcy 算法

pcy 算法_大数据分析中的PCY算法相关推荐

学大数据要学哪些算法_大数据分析都有哪些常见的算法
随着互联网的不断发展,大数据分析算法让众多企业在用户分析上获得了很大的突破.今天,我们就一起来了解一下,数据分析领域常见的算法都有哪些. 1.线性回归线性回归可能是统计学和机器学习中知名和易理解的算 ...
大数据算法_大数据时代，机器学习算法该如何升级？
文 /杨晓宁随着产业界数据量的爆炸式增长,大数据概念受到越来越多的关注.由于大数据的海量.复杂多样.变化快的特性,对于大数据环境下的应用问题,传统的小数据上的机器学习算法很多已不再适用.因此,研究大 ...
协同过滤算法_利用数据分析量化协同过滤算法的两大常见难题
点击上方"蓝色字体",选择 "设为星标" 关键讯息,D1时间送达! 推荐系统自从问世以来解决了许多不同的商业产品问题,深受广大互联网从业者的喜爱.传统的互联网电 ...
java 冒泡算法_关于java中的冒泡算法
/**输入一些数字,要求按顺序输出*/importjava.io.BufferedReader;importjava.io.IOException;importjava.io.InputStreamR ...
供应链信用管理大数据_大数据分析在供应链管理中的应用
薛如宾 [摘要]我国的综合实力在不断增强,大数据分析也在企业中得到了更大的作用,本文概述了大数据分析在供应链管理的应用,大数据分析在企业中的应用,大数据分析与供应链之间的关系. [关键词]大数据分析; ...
python判断题题库大数据技术_智慧树_大数据分析的python基础_搜题公众号
智慧树_大数据分析的python基础_搜题公众号更多相关问题社会公众可以查阅烟草专卖行政主管部门的监督检查记录.() 公民.法人或者其他组织不得利用自动售货机销售烟草制品.() 烟草广告中不得有下 ...
python在大数据分析中的应用
每个人都喜欢Python,如果您打算开始从事数据科学事业,我们可以肯定Python在您心中已经占有特殊的位置.它直观且易于在任何平台上运行,并且具有大量令人惊叹的库和工具.与其他编程语言相比,Pyth ...
R语言和Hadoop系统架构在大数据分析中的应用
也许正在喝着咖啡的你,看着阳光从玻璃窗蹦进来,回忆近日的美好,惬意的享受这个"温暖"的暑假.而SupStat已经为你准备了一份暑期数据盛宴. R是什么? ...
python智慧树判断题_智慧树知到_大数据分析的python基础_判断题答案
智慧树知到_大数据分析的python基础_判断题答案答案: 更多相关问题强心苷中毒先兆症状A．一定次数的早搏B．窦性心律低于60次/minC．视色障碍D．房室传导阻滞E．室性心动强心苷在临床上可 ...

pcy 算法_大数据分析中的PCY算法

pcy 算法_大数据分析中的PCY算法相关推荐

最新文章

热门文章