数据挖掘实例

引用:http://old.blog.edu.cn/user1/9065/archives/2005/145093.shtml目的：给定某些属性，判断某贷款顾客的可信性（即”good/bad”状况）。简单思路：该数据包含了666条贷款顾客的历史数据和21个属性。但是我们认为，这21个属性不都能够有效地帮助我们判断顾客的可行性，所以我们首先把一些比较不相关的属性去

wen008215

845人浏览 · 2010-04-07 23:01:00

wen008215 · 2010-04-07 23:01:00 发布

引用:http://old.blog.edu.cn/user1/9065/archives/2005/145093.shtml

目的：

给定某些属性，判断某贷款顾客的可信性（即”good/bad”状况）。

简单思路：

该数据包含了666条贷款顾客的历史数据和21个属性。但是我们认为，这21个属性不都能够有效地帮助我们判断顾客的可行性，所以我们首先把一些比较不相关的属性去掉。接下来，我们再用聚类方法帮助我们把带有连续变量的属性离散化。做完预处理后我们再从中找出对有参考价值的关联规则。

基本步骤：

1．去掉多余属性

关联规则的随机性
该数据里有一布尔属性foreign workers，取值为yes或no。我们发现，其中取值为yes的元组占了所有元组的96%。置信度其实代表了一种条件概率，它无法判断两个属性之间是否带有随机性。因此，带有foreign workers的关联规则无法提供我们更多可以参考的信息。

χ2依赖性检验
首先，我们利用χ2-检验试探各属性(“Duration in months”, “Credit Amount” 及 “Age in years” 除外)与“good/bad”属性之间是否存有依赖性。

以下以 “Credit History” 为例描述算法：

Credit History	Bad	Good	Total
All Paid Duly	17	10	27
Bank Paid Duly	18	17	35
Critical	38	157	195
Delay	17	40	57
Duly Till Now	119	233	352
Total	209	457	666

Degrees of freedom: 4
Chi-square value = 32.8752245686945
p-value is less than or equal to 0.001.
The distribution is significant.
χ2-检验显示属性 “Credit History” 与属性 “Good/Bad” 之间存有依赖性。

经多番检验，只有“Status of Checking Account”, “Credit History”, “Purpose”, “Savings Account / Bonds”, “Present Employment Since”, “Property”, “Housing” 以及 “Foreign Worker” 属性与“good/bad”属性之间有显著性（α＝0.05）的依赖性。因此，我们将重点放在这9个属性上，再可能的情况下对这几个属性的取值类进行加以分类或归类，希望最终能够从中得到这些属性与“good/bad”属性之间更好的关联规则。

2．把连续变量离散化（离散化／分类／归类）

经χ2检验后，我们利用 Clustering， Classification 以及 Equal-width 方法针对属性“Duration in months”， “Credit Amount” 及 “Age in years”进行离散化以及对以上有显著依赖性的属性取值进行加以分类或归类。

Equal-width

我们利用weka里的Discretize功能将连续变量离散化。以下以“Duration in months”属性为例：

我们用weka.filters.unsupervised.attribute.Discretized功能将“Duration in months”属性的取值分为３大类： “Short-term”, “Mid-term” 以及 “Long-term” 。分类后各类的数据数量为：“Short-term”（0-12个月）245条数据, “Mid-term”（13个月-24个月）270条数据以及“Long-term”（25个月以上）151条数据。

附图1

Simple K-Means Clustering

K-Means算法是将数据分入预先设定的聚类数。首先，它随机性地将几个数据点设定为质心(cluster centroid)。接着，它再计算出各聚类的边界及新的质心位置。反复运行以上步骤就会得到预先想得到的几个聚类，从而把连续变量值离散化，或进一步聚类某些属性的取值分类。

以 “Credit Amount” 属性为例描述Simple K-Means Clustering 离散化方法：
我们利用weka里的Cluster功能SimpleKMeans算法将“Credit Amount”属性中的取值离散化，分为４类：“low”（0-2500），“mid”（2501-4400），“high”（4401-8500）及“veryhigh”（8500以上）。请看下图。

附图2

附图3

Classification

我们也把property属性重新离散化，把各个取值再加以分类，希望能够从中得到更有参考价值的关联规则。
附图4

属性与其取值聚类

属性取值聚类

Status of Existing Checking Account
• 0DM
• <200DM
• >200DM
• no checking account

Duration in month
• <13 (short-term)
• 13-24 (mid-term)
• >24 (long-term)

Credit History
• all paid duly
• bank paid duly
• critical
• duly till now
• delay

Purpose
• tangible
o car
 used
 new
o household
 furniture
 radio-tv
• intangible
o business
o repair
o education
o retraining

Credit Amount
• 0-2500 (low)
• 2501-4400 (mid)
• 4401-8500 (high)
• >8500 (veryhigh)

Savings Account / Bonds
• <100DM
• 100-500DM
• 500-1000DM
• >1000DM
• unknown / no savings account

Present Employment Since
• unemployed
• 1-4
• 4 and above

Number of People being Liable
to Provide Maintenance for
• one
• two

Personal Status and Sex
• single male
• married male
• divorced male
• divorced female

Other Debtors / Guarantors
• none
• co-applicant
• guarantor

Property
• real estate
• building society
• car
• unknown

Age in years
• <22 (young)
• 23-35 (mid)
• 36-51 (old)
• >51 (retired)

Other Installment Plans
• banks
• stores
• none

Housing
• rent
• own
• for free

Number of Existing Credits at This Bank
• one
• two

Status
• good
• bad

3．关联规则

利用weka的association功能，我们得到许多的关联规则。在众多关联规则中，以下15条规则属于较有参考价值：

1. Statusofexistingcheckingaccount=noaccount Purpose-3=Tangible Personalstatusandsex=single-male Other-debtors/guarantors=none Otherinstallmentplans=none Housing=own ==> Status=good. conf:(0.95)

2. Statusofexistingcheckingaccount=noaccount Credithistory=dulytillnow Housing=own Numberofexistingcreditsatthisbank=one Liabletoprovidemaintenancefor=one ==> status=good. conf:(0.92)

3. Statusofexistingcheckingaccount=noaccount Presentemploymentsince=over-seven ==> status=good. conf:(0.91)

4. Statusofexistingcheckingaccount=noaccount Credithistory=dulytillnow Numberofexistingcreditsatthisbank=one Liabletoprovidemaintenancefor=one ==> status=good. conf:(0.90)

5. Purpose=radio-tv Housing=own Job=skilled ==> status=good. conf:(0.89)

6. Presentemploymentsince=>4-years Ageinyears=middleage Job=skilled ==> status=good. conf:(0.88)

7. Statusofexistingcheckingaccount=noaccount Durationinmonth=mid-term Housing=own ==> status=good. conf:(0.88)

8. Statusofexistingcheckingaccount=noaccount Credithistory=dulytillnow Housing=own ==> status=good. conf:(0.87)

9. Purpose-3=Tangible Personalstatusandsex=single-male Other-debtors/guarantors=none Otherinstallmentplans=none Housing=own Job=skilled ==> Status=good. conf:(0.86)

10. Statusofexistingcheckingaccount=noaccount Property=car Housing=own ==> Status=good. conf:(0.86)

11. Purpose-2=Household Presentemploymentsince=>4-years ==> Status=good. conf:(0.85)

12. Credit-amount-simplekmeans=low Property=real-estate ==> Status=good. conf:(0.77)

13. Purpose-2=Household Credit-amount-simplekmeans=low ==> Status=good. conf:(0.76)

14. Purpose-2=Household Job=skilled ==> Status=good. conf:(0.73)

15. Presentemploymentsince=>4-years Job=skilled ==> Status=good. conf:(0.72)

4．Weightage:

根据所得关联规则，我们发现以下13属性的某些取值类倾向属性 “Status”=good.:
• Status of existing checking account: No checking account
• Duration in month：Mid-term (13-24 months)
• Credit history: All paid duly；No existing credit
• Purpose: Household
• Credit amount: Low (0-2500)
• Present employment since: >4
• Personal status and sex: Single male
• Other debtors / guarantors: None
• Property: Real estate; Car
• Age in years: Mid (23-35)
• Other installment plans: None
• Housing: Own
• Job: Skilled

根据历史数据，若某顾客拥有以上13个属性值的任意7个，我们可以认为该顾客的Status为good。

5．预测:

我们可用以上weightage方法来预测Germantest数据库中顾客的“Status”。我们从Germantest数据库中取出一名顾客的资料来预测他的“Status”：

no-account,24, duly-till-now, new-car,1393, less100DM, four-years,2, single-male, guarantor,2, real-estate,31, none, own,1, skilled,1, no, yes

该顾客的得分为10分，因此该名顾客的预测Status为good.。

结论：

经过大量的预处理，包括假设检验、分类、聚类和离散化等方法，我们客观地把一些属性去掉，也将连续属性离散化。最终我们也从“海量”关联规则中筛选出一些较有参考价值的规则，来帮助我们判断某顾客的可信性。