Tong quan ve phan cum data mining

  • View
    754

  • Download
    2

Embed Size (px)

Transcript

  • 1. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 1 - Mc lc Chng 1. TNG QUAN V PHN CM TRONG KHAI PH D LIU V CC KHI NIM C BN..................................................................................3 1.1. Gii thiu chung .................................................................................................... 3 1.2. Khai ph d liu l g? .......................................................................................... 3 1.3. Qa trnh khai ph tri thc trong c s d liu................................................... 3 1.4. Cc k thut p dng trong khai ph d liu....................................................... 3 1.4.1. Cc k thut tip cn trong khai ph d liu ..................................................... 3 1.4.2. Cc dng d liu c th khai ph ...................................................................... 3 1.5. ng dng ca khai ph d liu............................................................................. 3 1.6. Phn cm d liu v ng dng ............................................................................. 3 1.6.1. Mc ch ca phn cm d liu ........................................................................ 3 1.6.2. Cc bc c bn phn cm .......................................................................... 3 1.6.3. Cc loi c trng ............................................................................................ 3 1.6.4. Cc ng dng ca phn cm............................................................................. 3 1.6.5. Phn loi cc thut ton phn cm.................................................................... 3 1.7. Cc khi nim v nh ngha................................................................................. 3 1.7.1. Cc nh ngha phn cm ................................................................................. 3 1.7.2. Cc o gn gi............................................................................................. 3 1.7.2.1 CC NH NGHA ................................................................................. 3 1.7.2.2. CC O GN GI GIA 2 IM ................................................ 3 1.7.2.3. HM GN GI GIA MT IM V MT TP ............................ 3 1.7.2.4. CC HM GN GI GIA HAI TP................................................. 3 Chng 2. CC THUT TON PHN CM TUN T...................................3 2.1. S cc cch phn cm c th................................................................................. 3 2.2. Thut ton phn cm tun t - BSAS .................................................................. 3 2.3. c lng s cm.................................................................................................. 3 2.4. Sa i thut ton BSAS - Thut ton MBSAS................................................... 3 2.5. Thut ton phn cm tun t hai ngng - TTSAS ............................................ 3 2.6. Giai on tinh ch ................................................................................................. 3 Bi tp chng 2 .......................................................................................................... 3 Chng 3. CC THUT TON PHN CM PHN CP ................................3 3.1. Gii thiu............................................................................................................... 3 3.2. Cc thut ton tch t - GAS................................................................................. 3 3.2.1. Mt s nh ngha............................................................................................. 3 3.2.2. Mt s thut ton tch t da trn l thuyt ma trn .......................................... 3 3.2.3. Monotonicity v Crossover............................................................................... 3 3.2.4. Mt s thut ton tch t da trn l thuyt th............................................. 3 3.2.5. nh hng ca ma trn gn gi ti s phn cm ......................................... 3 3.3. Cc thut ton phn r - GDS .............................................................................. 3 3.3.1. Ci tin s GDS........................................................................................... 3 3.4. La chn phn cm tt nht................................................................................. 3 Bi tp chng 3 .......................................................................................................... 3

2. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 2 - BNG T VIT TT T hoc cm t T ting Anh T ting Vit BSAS Basic Sequential Algorithmic Scheme S thut ton phn cm tun t c s CSDL Data Base C s d liu GAS Generalized Agglomerative Scheme S tch t tng qut GDS Generalized Divisive Scheme S phn r tng qut GTAS Graph Theory based Algorithmic Scheme S thut ton da trn l thuyt th KDD Knowledge Discovery in Databases Khai ph tri thc trong c s d liu MBSAS Modified Basic Sequential Algorithmic Scheme S thut ton phn cm tun t c s sa i MST Minimum Spanning Tree Cy khung nh nht MUAS Matrix Updating Algorithmic Scheme S thut ton bin i ma trn SM Similarity Measure o tng t TTSAS Two Threshold Sequential Algorithmic Scheme S thut ton tun t 2 ngng UPGMA Unweighted Pair Group Method Average Phng php trung bnh theo cp khng trng s UPGMC Unweight Pair Group Method Centroid Phng php trng tm theo cp khng chn s WPGMA Weighted Pair Group Method Average Phng php trung bnh theo cp trng s WPGMC Weighted Pair Group Method Centroid Phng php trng tm theo cp trng s 3. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 3 - Chng 1. TNG QUAN V PHN CM TRONG KHAI PH D LIU V CC KHI NIM C BN 1.1. Gii thiu chung Nhng nm 60 ca th k trc, ngi ta bt u s dng cc cng c tin hc t chc v khai thc cc CSDL. Cng vi s pht trin vt bc ca cc cng ngh in t v truyn thng, kh nng thu thp v lu tr v x l d liu cho cc h thng tin hc khng ngng c nng cao, theo , lng thng tin c lu tr trn cc thit b nh khng ngng tng ln. Thng k s b cho thy, lng thng tin trn cc h thng tin hc c sau 20 thng li tng gp i [3]. Cui thp k 80 ca th k 20, s pht trin rng khp ca cc CSDL mi quy m to ra s bng n thng tin trn ton cu. Vo thi gian ny, ngi ta bt u cp n khi nim khng hong phn tch d liu tc nghip cung cp thng tin vi yu cu cht lng ngy cng cao cho ngi lm quyt nh trong cc t chc ti chnh, thng mi, khoa hc, ng nh John Naisbett cnh bo Chng ta ang chm ngp trong d liu m vn i tri thc. Lng d liu khng l ny thc s l mt ngun ti nguyn c nhiu gi tr bi thng tin l yu t then cht trong mi hot ng qun l, kinh doanh, pht trin sn xut v dch v, n gip nhng ngi iu hnh v qun l c hiu bit v mi trng v tin trnh hot ng ca t chc mnh trc khi ra quyt nh tc ng n qu trnh hot ng nhm t c cc mc tiu mt cch hiu qu v bn vng. Khai ph d liu (Data Mining) l mt lnh vc mi xut hin, nhm t ng khai thc nhng thng tin, nhng tri thc c tnh tim n, hu ch t nhng CSDL ln cho cc n v, t chc, doanh nghip,. t lm thc y kh nng sn xut, kinh doanh, cnh tranh cho cc n v, t chc ny. Cc kt qu khoa hc cng nhng ng dng thnh cng trong khm ph tri thc, cho thy, khai ph d liu l mt lnh vc pht trin bn vng, mang li nhiu li ch v c nhiu trin vng, ng thi c u th hn hn so vi cc cng c phn tch d liu truyn thng. Hin nay, khai ph d liu ng dng ngy cng rng ri trong cc lnh vc nh: Thng mi, ti chnh, iu tr y hc, vin thng, tin sinh,. 4. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 4 - 1.2. Khai ph d liu l g? Khai ph d liu l mt hng nghin cu mi ra i hn mt thp nin tr li y, cc k thut chnh c p dng trong lnh vc ny phn ln c tha k t lnh vc CSDL, hc my, tr tu nhn to, l thuyt thng tin, xc sut thng k, v tnh ton hiu nng cao. Do s pht trin nhanh ca khai ph d liu v phm vi p dng v cc phng php tm kim tri thc, nn c nhiu quan im khc nhau v khai ph d liu. Tuy nhin, mt mc tru tng nht nh, chng ta nh ngha khai ph d liu nh sau [10]: nh ngha : Khai ph d liu l mt qu trnh tm kim, pht hin cc tri thc mi, tim n, hu dng trong CSDL ln. Khai ph tri thc trong CSDL (Knowledge Discovery in Databases - KDD) l mc tiu chnh ca khai ph d liu, do vy hai khi nim khai ph d liu v KDD c cc nh khoa hc trn hai lnh vc xem l tng ng vi nhau. Th nhng, nu phn chia mt cch chi tit th khai ph d liu l mt bc chnh trong qu trnh KDD. 1.3. Qa trnh khai ph tri thc trong c s d liu Khai ph tri thc trong CSDL, KDD, l lnh vc lin quan n cc ngnh nh : thng k, hc my, CSDL, thut ton, trc quan ha d liu, tnh ton song song v hiu nng cao, Qu trnh KDD c th phn thnh cc giai on sau [3][10]: Trch chn d liu : l bc trch chn nhng tp d liu cn c khai ph t cc tp d liu ln (databases, data warehouses, data repositories) ban u theo mt s tiu ch nht nh. Tin x l d liu : l bc lm sch d liu (x l vi d liu khng y , d liu nhiu, d liu khng nht qun, .v.v.), rt gn d liu (s dng hm nhm v tnh tng, cc phng php nn d liu, s dng histograms, ly mu, .v.v.), ri rc ha d liu (ri rc ha da vo histograms, da vo entropy, da vo phn khong, .v.v.). Sau bc ny, d liu s nht qun, y , c rt gn, v c ri rc ha. Bin i d liu : y l bc chun ha v lm mn d liu a d liu v dng thun li nht nhm phc v cho cc k thut khai ph bc sau. 5. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 5 - Khai ph d liu: y l bc p dng nhng k thut phn tch (phn nhiu l cc k thut ca hc my) nhm khai thc d liu, trch chn c nhng mu thng tin, nhng mi lin h c bit trong d liu. y c xem l bc quan trng v tn nhiu thi gian nht ca ton qu trnh KDD. nh gi v biu din tri thc: nhng mu thng tin v mi lin h trong d liu c khai ph bc trn c chuyn dng v biu din mt dng gn gi vi ngi s dng nh th, cy, bng biu, lut, .v.v. ng thi bc ny cng nh gi nhng tri thc khai ph c theo nhng tiu ch nht nh. Cc giai on trong KDD c th hin trc quan nh hnh 1.1 di y: Hnh 1-1. Cc bc thc hin trong qu trnh khai ph tri thc 1.4. Cc k thut p dng trong khai ph d liu 1.4.1. Cc k thut tip cn trong khai ph d liu Khai ph tri thc trong CSDL l mt lnh vc lin ngnh, bao gm: T chc d liu, hc my, tr tu nhn to v cc khoa hc khc. Nu theo quan im ca hc my (Machine Learning), th cc k thut trong khai ph d liu, bao gm : Hc c gim st (Supervised learning