《R语言与数据挖掘》⑧关联规则分析
写在前面简言之,关联分析就是通过量化后的数字描述物品之间的影响,以及有多大的一些影响关系。常见的算法如下:Apriori关联规则基本的术语解释:事务(Transaction):简单理解,一个人购物是的一张小票里面的所有物品组成的集合。项(Item):小票里面的商品A项集(Itemset):多个商品组成的集合,和上面不同,你细细品。这里就衍生出,1-项集、2-项集、k-项集等等。符号:X =>
·
写在前面
简言之,关联分析就是通过量化后的数字描述物品之间的影响,以及有多大的一些影响关系。
常见的算法如下:
Apriori关联规则
基本的术语解释:
- 事务(Transaction):简单理解,一个人购物是的一张小票里面的所有物品组成的集合。
- 项(Item):小票里面的商品A
- 项集(Itemset):多个商品组成的集合,和上面不同,你细细品。这里就衍生出,1-项集、2-项集、k-项集等等。
- 符号:X => Y X称为前项,那么Y称为后项。
- 支持度(Support):简言之,概率或者频率。
S u p p o r t ( X − > Y ) = P ( X , Y ) / P ( I ) = P ( X ∪ Y ) / P ( I ) = m u m ( X ∪ Y ) / n u m ( I ) Support(X->Y)=P(X,Y)/P(I)=P(X∪Y)/P(I)=mum(X∪Y)/num(I) Support(X−>Y)=P(X,Y)/P(I)=P(X∪Y)/P(I)=mum(X∪Y)/num(I)
I表示总事务集。 num()表示求事务集里特定项集出现的次数。比如, num(I)表示总事务集的个数num(X∪Y)表示含有{X,Y}的事务集的个数。 - 频繁项集(Largeltemsets),其实就是满足最小支持度阈值的,这个项集就叫做频繁项集
- 置信度(Confidence)
置信度是表示在先决条件下X发生情况下,有 X − > Y X->Y X−>Y推出Y的概率。
C o n f i d e n c e ( X − > Y ) = P ( Y ∣ X ) = P ( X , Y ) / P ( X ) = P ( X ∪ Y ) / P ( X ) Confidence(X->Y)=P(Y|X)=P(X,Y)/P(X)=P(X∪Y)/P(X) Confidence(X−>Y)=P(Y∣X)=P(X,Y)/P(X)=P(X∪Y)/P(X) - 提升度(Lift)
表示含有X的条件下,同时包含Y的概率,与Y总体发生概率之比。
L i f t ( X − > Y ) = P ( Y ∣ X / P ( Y ) Lift(X->Y)=P(Y|X/P(Y) Lift(X−>Y)=P(Y∣X/P(Y)
在R语言中,Apriori关联规则算法是借助arules中的一系列函数来实现的,而另一个包arulesViz则可以实现关联规则的可视化。
在arules中,建立关联规则有三种方法,分别为apriori算法,eclat算法和weclat算法。各算法的函数实现如下表:
代码实现
一般要进行数据之间的转换,通常是转换为transaction的形式。
# transactions格式的转换
# 列表转换transactions
a_list <- list(c("a", "b", "c"), c("a", "b"), c("a", "b", "d"), c("c", "e"),
c("a", "b", "d", "e"))
names(a_list) <- paste("Tr", c(1:5), sep = "") # 列表重命名
library(arules)
trans <- as(a_list, "transactions") # 将列表转换为transactions
inspect(trans) # 检查是否转换成功
# 数据框转换transactions
a_df <- data.frame(age = as.factor(c(6, 8, 7, 6, 9, 5)),
grade = as.factor(c(1, 3, 1, 1, 4, 1)))
trans2 <- as(a_df, "transactions") # 将数据框转换为transactions
inspect(trans2) # 检查是否转换成功
实际案例
# 关联规则分析
library(arules) # 加载程序包arules
data("Groceries") # 提取数据集Groceries
# 数据集相关的统计汇总信息,包括事务和项集的汇总情况
summary(Groceries)
out:
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda yogurt
2513 1903 1809 1715 1372
(Other)
34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14
19 20 21 22 23 24 26 27 28 29 32
14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels level2 level1
1 frankfurter sausage meat and sausage
2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage
# 建立关联规则rules,设定支持度最小值为0.001,置信度最小值为0.5
rules <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.5))
out:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 5 0.001 1 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 9
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
sorting and recoding items ... [157 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.02s].
writing ... [5668 rule(s)] done [0.01s].
creating S4 object ... done [0.00s].
# 查看规则的汇总信息
summary(rules)
out:
summary(rules)
set of 5668 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5 6
11 1461 3211 939 46
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.00 4.00 3.92 4.00 6.00
summary of quality measures:
support confidence coverage lift count
Min. :0.001017 Min. :0.5000 Min. :0.001017 Min. : 1.957 Min. : 10.0
1st Qu.:0.001118 1st Qu.:0.5455 1st Qu.:0.001729 1st Qu.: 2.464 1st Qu.: 11.0
Median :0.001322 Median :0.6000 Median :0.002135 Median : 2.899 Median : 13.0
Mean :0.001668 Mean :0.6250 Mean :0.002788 Mean : 3.262 Mean : 16.4
3rd Qu.:0.001729 3rd Qu.:0.6842 3rd Qu.:0.002949 3rd Qu.: 3.691 3rd Qu.: 17.0
Max. :0.022267 Max. :1.0000 Max. :0.043416 Max. :18.996 Max. :219.0
mining info:
data ntransactions support confidence
Groceries 9835 0.001 0.5
call
apriori(data = Groceries, parameter = list(support = 0.001, confidence = 0.5))
# 查看Groceries中商品的支持度
# Groceries数据中前3件商品的支持度
itemFrequency(Groceries[, 1:3])
out:
frankfurter sausage liver loaf
0.058973055 0.093950178 0.005083884
# Groceries数据中商品whole milk、other vegetables的支持度
itemFrequency(Groceries[, c("whole milk", "other vegetables")])
out:
whole milk other vegetables
0.2555160 0.1934926
# 输出支持度频率图
# 输出支持度support大于0.1的项集的支持度频率图
itemFrequencyPlot(Groceries, support = 0.1)
#输出支持度support最大的前20个项集的支持度频率图
itemFrequencyPlot(Groceries , topN = 20)
# 查看数据和规则
# 查看关联数据Groceries的前五项
inspect(Groceries[1:5])
out:
items
[1] {citrus fruit,
semi-finished bread,
margarine,
ready soups}
[2] {tropical fruit,
yogurt,
coffee}
[3] {whole milk}
[4] {pip fruit,
yogurt,
cream cheese ,
meat spreads}
[5] {other vegetables,
whole milk,
condensed milk,
long life bakery product}
# 查看前五项关联规则
inspect(rules[1:5])
out:
lhs rhs support confidence coverage lift count
[1] {honey} => {whole milk} 0.001118454 0.7333333 0.001525165 2.870009 11
[2] {tidbits} => {rolls/buns} 0.001220132 0.5217391 0.002338587 2.836542 12
[3] {cocoa drinks} => {whole milk} 0.001321810 0.5909091 0.002236909 2.312611 13
[4] {pudding powder} => {whole milk} 0.001321810 0.5652174 0.002338587 2.212062 13
[5] {cooking chocolate} => {whole milk} 0.001321810 0.5200000 0.002541942 2.035097 13
# 计算规则的各项附加信息
# 计算"coverage", "fishersExactTest", "conviction", "chiSquared"
qualityMeasures <- interestMeasure(rules, measure = c("coverage", "fishersExactTest",
"conviction", "chiSquared"),
transactions = Groceries)
summary(qualityMeasures)
quality(rules) <- cbind(quality(rules), qualityMeasures) # 合并quality measures
quality(rules) <- round(quality(rules), digits = 3) # 保留小数点后3位
inspect(head(rules)) # 查看合并后的关联规则
out:
lhs rhs support confidence coverage lift count coverage
[1] {honey} => {whole milk} 0.001 0.733 0.002 2.870 11 0.002
[2] {tidbits} => {rolls/buns} 0.001 0.522 0.002 2.837 12 0.002
[3] {cocoa drinks} => {whole milk} 0.001 0.591 0.002 2.313 13 0.002
[4] {pudding powder} => {whole milk} 0.001 0.565 0.002 2.212 13 0.002
[5] {cooking chocolate} => {whole milk} 0.001 0.520 0.003 2.035 13 0.003
[6] {cereals} => {whole milk} 0.004 0.643 0.006 2.516 36 0.006
fishersExactTest conviction chiSquared
[1] 0.000 2.792 18.030
[2] 0.000 1.706 17.526
[3] 0.001 1.820 13.039
[4] 0.002 1.712 11.624
[5] 0.004 1.551 9.217
[6] 0.000 2.085 44.420
# 规则排序
# 按支持度递减的顺序对rules排序
sort(rules, by = "support")
# 按支持度递减的顺序,输出支持度最大的前五项规则
inspect(sort(rules, by = "support")[1:5])
lhs rhs support confidence coverage lift
[1] {other vegetables, yogurt} => {whole milk} 0.022 0.513 0.043 2.007
[2] {other vegetables, whipped/sour cream} => {whole milk} 0.015 0.507 0.029 1.984
[3] {tropical fruit, yogurt} => {whole milk} 0.015 0.517 0.029 2.025
[4] {root vegetables, yogurt} => {whole milk} 0.015 0.563 0.026 2.203
[5] {pip fruit, other vegetables} => {whole milk} 0.014 0.518 0.026 2.025
count coverage fishersExactTest conviction chiSquared
[1] 219 0.043 0 1.528 155.428
[2] 144 0.029 0 1.510 97.261
[3] 149 0.029 0 1.543 106.934
[4] 143 0.026 0 1.704 129.583
[5] 133 0.026 0 1.543 95.223
# 提取规则
# 提取后项为"whole milk"并且提升度大于1.2的关联规则
subset(rules, subset = rhs %in% "whole milk" & lift >= 1.2)
# 查看满足后项为"whole milk"并且提升度大于1.2的关联规则的前五项
inspect(subset(rules, subset = rhs %in% "whole milk" & lift >= 1.2)[1:5])
out:
lhs rhs support confidence coverage lift count coverage
[1] {honey} => {whole milk} 0.001 0.733 0.002 2.870 11 0.002
[2] {cocoa drinks} => {whole milk} 0.001 0.591 0.002 2.313 13 0.002
[3] {pudding powder} => {whole milk} 0.001 0.565 0.002 2.212 13 0.002
[4] {cooking chocolate} => {whole milk} 0.001 0.520 0.003 2.035 13 0.003
[5] {cereals} => {whole milk} 0.004 0.643 0.006 2.516 36 0.006
fishersExactTest conviction chiSquared
[1] 0.000 2.792 18.030
[2] 0.001 1.820 13.039
[3] 0.002 1.712 11.624
[4] 0.004 1.551 9.217
[5] 0.000 2.085 44.420
# 关联规则分析
library(arules) # 加载程序包arules
library(arulesViz) # 加载程序包arulesViz
data("Groceries") # 提取数据集Groceries
summary(Groceries) # 数据集相关的统计汇总信息,包括事务和项集的汇总情况
inspect(Groceries[1:10]) # 查看数据集的前10个事务
Size <- size(Groceries) # 查看每个交易记录包含的商品数目
# 查看Groceries中商品的支持度
ItemFrequency <- itemFrequency(Groceries)
# 查看Groceries数据中商品whole milk、other vegetables的支持度
itemFrequency(Groceries[, c("whole milk", "other vegetables")])
# 作出支持度support最大的前20个项集的稀疏矩阵图
itemFrequencyPlot(Groceries, topN = 20)
# 建立关联规则rules,条件是支持度大于0.001且置信度大于0.5
rules <- apriori(Groceries, parameter = list(support = 0.001, confidence = 0.5))
inspect(rules[1:10]) # 查看rules前十则关联规则
# 查看其它的quality measures
# 计算"coverage", "fishersExactTest", "conviction", "chiSquared" summary(qualityMeasures)
qualityMeasures <- interestMeasure(rules, measure = c("coverage", "fishersExactTest",
"conviction", "chiSquared"),
transactions = Groceries)
quality(rules) <- cbind(quality(rules), qualityMeasures) # 合并quality measures
quality(rules) <- round(quality(rules), digits = 3) # 保留小数点后3位
inspect(head(rules)) # 查看合并后的关联规则
# 规则排序
# 按提升度排序
rules.sorted <- sort(rules, by = "lift")
# 查看排序后的前五则关联规则
inspect(rules.sorted[1:5])
# 提取后项为"whole milk"并且提升度大于1.2的关联规则
rules.subset <- subset(rules, subset = rhs%in%"whole milk" & lift >= 1.2)
# 查看满足后项为"whole milk"并且提升度大于1.2的关联规则的前五项
inspect(rules.subset[1:5])
# 对关联规则做散点图
plot(rules, method = "scatter", interactive = T)
更多推荐
所有评论(0)