Apriori算法也属于无监督学习,它强调的是“从数据X中能够发现什么”。从大规模的数据集中寻找物品之间隐含关系被称为关联分析或者称为关联规则学习。这里的主要问题在于,寻找物品的不同组合是一项十分耗时的任务,所需的计算代价很高,蛮力搜索并不能解决这个问题。因此此处介绍使用Apriorio算法来解决上述问题。
1:简单概念描述
(1) 频繁项集:指经常出现在一块的物品的集合。 关联规则暗示两种物品之间存在很强的关系。(这里我们事先定义阀值,超过该阀值,证明两者之间存在很强的关系).
(2) 一个项集的支持度(support)被定义为数据集中包含该项集的记录所占的比例。我们事先需要定义一个最小支持度(minSupport),而只保留满足最小支持度的项集。
(3) 可信度或置信度(confidence)是针对一条诸如{尿布}->{葡萄酒}的关联规则来定义的。
(4) Apriori的原理是如果某个项集是频繁的,那么它的子集也是频繁的。反过来说,如果一个项集是非频繁的,那么它的所有超集也是非频繁的。比如{1,2}出现的次数已经小于最小支持度了(非频繁的),那么超集{0,1,2}的组合肯定也是非频繁的了。主要包括发现频繁项集和挖掘关联规则这两步。
2:发现频繁项集
过程是:从C1= {{0},{1},{2},{3}}开始,然后生成L1,L1是C1中项集的支持度大于等于最小支持度,比如L1 = {{0},{1},{3}}。然后由L1组合得到C2 = {{01},{03},{13}}。一直进行下去直到Ck为空。
-
- def loadDataSet():
- return [[1,3,4], [2,3,5], [1,2,3,5], [2,5]]
-
-
- def createC1(dataSet):
- C1 = []
- for transaction in dataSet:
- for item in transaction:
- if not[item] in C1:
- C1.append([item])
- C1.sort()
- return map(frozenset, C1)
-
-
- def scanD(D, Ck, minSupport):
- ssCnt = {}
- for tid in D:
- for can in Ck:
- if can.issubset(tid):
- if not ssCnt.has_key(can):ssCnt[can] = 1
- else: ssCnt[can] += 1
- numItems = float(len(D))
- retList = []
- supportData = {}
- for key in ssCnt:
- support = ssCnt[key]/numItems
- if support >= minSupport:
- retList.insert(0, key)
- supportData[key] = support
- return retList, supportData
-
-
-
- def aprioriGen(Lk, k):
- retList = []
- lenLk = len(Lk)
- for i in range(lenLk):
- for j in range(i+1, lenLk):
- L1 = list(Lk[i])[:k-2]; L2 = list(Lk[i])[:k-2]
- L1.sort(); L2.sort()
- if L1 == L2:
- retList.append(Lk[i] | Lk[j])
- return retList
-
- def apriori(dataSet, minSupport = 0.5):
- C1 = createC1(dataSet)
- D = map(set, dataSet)
- L1, supportData = scanD(D, C1, minSupport)
- L = [L1]
- k = 2
- while(len(L[k-2]) > 0):
- Ck = aprioriGen(L[k-2], k)
- Lk, supK = scanD(D,Ck, minSupport)
- supportData.update(supK)
- L.append(Lk)
- k += 1
- return L, supportData
注意:(1)C1是大小为1的所有候选项集的集合
(2)这里使用了python的frozenset类型。frozenset是指被“冰冻”的集合,就说它们是不可改变的,即用户不能修改它们。这里必须使用frozenset而不是set类型,因为之后必须要将这些集合作为字典键值使用,使用frozenset可以实现这一点,而set却做不到。
3:从频繁项集中发现关联规则
-
- def generateRules(L, supportData, minConf=0.7):
- bigRuleList = []
- for i in range(1, len(L)):
- for freqSet in L[i]:
- H1 = [frozenset([item]) for item in freqSet]
- if (i > 1):
- rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
- else:
- calcConf(freqSet, H1, supportData, bigRuleList, minConf)
- return bigRuleList
-
- def calcConf(freqSet, H, supportData, brl, minConf=0.7):
- prunedH = []
- for conseq in H:
- conf = supportData[freqSet]/supportData[freqSet-conseq]
- if conf >= minConf:
- print freqSet-conseq,'-->',conseq,'conf:',conf
- brl.append((freqSet-conseq, conseq, conf))
- prunedH.append(conseq)
- return prunedH
-
- def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
- m = len(H[0])
- if (len(freqSet) > (m + 1)):
- Hmp1 = aprioriGen(H, m+1)
- Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
- if (len(Hmp1) > 1):
- rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
4:使用FP-growth算法来高效发现频繁项集
每次增加频繁项集的大小,Apriori算法都会重新扫描整个数据集。当数据集很大时,这会显著降低频繁项集发现的速度。而FP-growth树只需要对数据库进行两次遍历,能够显著加快频繁项集的速度。但是该算法不能用于发现关联规则。
第一遍对所有元素项的出现次数进行计数,只用来统计出现的频率。而第二遍扫描只考虑那些频繁元素,用来构建FP树。
-
-
- class treeNode:
- def __init__(self, nameValue, numOccur, parentNode):
- self.name = nameValue
- self.count = numOccur
- self.nodeLink = None
- self.parent = parentNode
- self.children = {}
-
- def inc(self, numOccur):
- self.count += numOccur
-
- def disp(self, ind = 1):
- print ' '*ind, self.name, ' ', self.count
- for child in self.children.values():
- child.disp(ind+1)
-
-
- def loadSimpDat():
- simpDat = [['r', 'z', 'h', 'j', 'p'],
- ['z', 'y', 'x', 'w', 'v', 'u','t', 's'],
- ['z'],
- ['r','x','n','o','s'],
- ['y', 'r','x','z','q','t','p'],
- ['y','z','x','e','q','s','t','m']]
- return simpDat
-
- def createInitSet(dataSet):
- retDict = {}
- for trans in dataSet:
- retDict[frozenset(trans)] = 1
- return retDict
-
-
- def createTree(dataSet, minSup = 1):
- headerTable = {}
- for trans in dataSet:
- for item in trans:
- headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
- for k in headerTable.keys():
- if headerTable[k] < minSup:
- del[headerTable[k]]
- freqItemSet = set(headerTable.keys())
- if len(freqItemSet) == 0: return None, None
- for k in headerTable:
- headerTable[k] = [headerTable[k], None]
- retTree = treeNode('Null Set', 1, None)
- for tranSet, count in dataSet.items():
- localD = {}
- for item in tranSet:
- if item in freqItemSet:
- localD[item] = headerTable[item][0]
- if len(localD) > 0:
- orderedItems = [v[0] for v in sorted(localD.items(),key = lambda p:p[1], reverse = True)]
- updateTree(orderedItems, retTree, headerTable, count)
- return retTree, headerTable
-
- def updateTree(items, inTree, headerTable, count):
- if items[0] in inTree.children:
- inTree.children[items[0]].inc(count)
- else:
- inTree.children[items[0]] = treeNode(items[0], count, inTree)
- if headerTable[items[0]][1] == None:
- headerTable[items[0]][1] = inTree.children[items[0]]
- else:
- updateHeader(headerTable[items[0]][1], inTree.children[items[0]])
- if len(items) > 1:
- updateTree(items[1::], inTree.children[items[0]], headerTable, count)
-
- def updateHeader(nodeToTest, targetNode):
- while(nodeToTest.nodeLink != None):
- nodeToTest = nodeToTest.nodeLink
- nodeToTest.nodeLink = targetNode