Tong quan ve phan cum data mining

  • Published on
    17-Jun-2015

  • View
    754

  • Download
    2

Embed Size (px)

Transcript

  • 1. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 1 - Mc lc Chng 1. TNG QUAN V PHN CM TRONG KHAI PH D LIU V CC KHI NIM C BN..................................................................................3 1.1. Gii thiu chung .................................................................................................... 3 1.2. Khai ph d liu l g? .......................................................................................... 3 1.3. Qa trnh khai ph tri thc trong c s d liu................................................... 3 1.4. Cc k thut p dng trong khai ph d liu....................................................... 3 1.4.1. Cc k thut tip cn trong khai ph d liu ..................................................... 3 1.4.2. Cc dng d liu c th khai ph ...................................................................... 3 1.5. ng dng ca khai ph d liu............................................................................. 3 1.6. Phn cm d liu v ng dng ............................................................................. 3 1.6.1. Mc ch ca phn cm d liu ........................................................................ 3 1.6.2. Cc bc c bn phn cm .......................................................................... 3 1.6.3. Cc loi c trng ............................................................................................ 3 1.6.4. Cc ng dng ca phn cm............................................................................. 3 1.6.5. Phn loi cc thut ton phn cm.................................................................... 3 1.7. Cc khi nim v nh ngha................................................................................. 3 1.7.1. Cc nh ngha phn cm ................................................................................. 3 1.7.2. Cc o gn gi............................................................................................. 3 1.7.2.1 CC NH NGHA ................................................................................. 3 1.7.2.2. CC O GN GI GIA 2 IM ................................................ 3 1.7.2.3. HM GN GI GIA MT IM V MT TP ............................ 3 1.7.2.4. CC HM GN GI GIA HAI TP................................................. 3 Chng 2. CC THUT TON PHN CM TUN T...................................3 2.1. S cc cch phn cm c th................................................................................. 3 2.2. Thut ton phn cm tun t - BSAS .................................................................. 3 2.3. c lng s cm.................................................................................................. 3 2.4. Sa i thut ton BSAS - Thut ton MBSAS................................................... 3 2.5. Thut ton phn cm tun t hai ngng - TTSAS ............................................ 3 2.6. Giai on tinh ch ................................................................................................. 3 Bi tp chng 2 .......................................................................................................... 3 Chng 3. CC THUT TON PHN CM PHN CP ................................3 3.1. Gii thiu............................................................................................................... 3 3.2. Cc thut ton tch t - GAS................................................................................. 3 3.2.1. Mt s nh ngha............................................................................................. 3 3.2.2. Mt s thut ton tch t da trn l thuyt ma trn .......................................... 3 3.2.3. Monotonicity v Crossover............................................................................... 3 3.2.4. Mt s thut ton tch t da trn l thuyt th............................................. 3 3.2.5. nh hng ca ma trn gn gi ti s phn cm ......................................... 3 3.3. Cc thut ton phn r - GDS .............................................................................. 3 3.3.1. Ci tin s GDS........................................................................................... 3 3.4. La chn phn cm tt nht................................................................................. 3 Bi tp chng 3 .......................................................................................................... 3

2. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 2 - BNG T VIT TT T hoc cm t T ting Anh T ting Vit BSAS Basic Sequential Algorithmic Scheme S thut ton phn cm tun t c s CSDL Data Base C s d liu GAS Generalized Agglomerative Scheme S tch t tng qut GDS Generalized Divisive Scheme S phn r tng qut GTAS Graph Theory based Algorithmic Scheme S thut ton da trn l thuyt th KDD Knowledge Discovery in Databases Khai ph tri thc trong c s d liu MBSAS Modified Basic Sequential Algorithmic Scheme S thut ton phn cm tun t c s sa i MST Minimum Spanning Tree Cy khung nh nht MUAS Matrix Updating Algorithmic Scheme S thut ton bin i ma trn SM Similarity Measure o tng t TTSAS Two Threshold Sequential Algorithmic Scheme S thut ton tun t 2 ngng UPGMA Unweighted Pair Group Method Average Phng php trung bnh theo cp khng trng s UPGMC Unweight Pair Group Method Centroid Phng php trng tm theo cp khng chn s WPGMA Weighted Pair Group Method Average Phng php trung bnh theo cp trng s WPGMC Weighted Pair Group Method Centroid Phng php trng tm theo cp trng s 3. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 3 - Chng 1. TNG QUAN V PHN CM TRONG KHAI PH D LIU V CC KHI NIM C BN 1.1. Gii thiu chung Nhng nm 60 ca th k trc, ngi ta bt u s dng cc cng c tin hc t chc v khai thc cc CSDL. Cng vi s pht trin vt bc ca cc cng ngh in t v truyn thng, kh nng thu thp v lu tr v x l d liu cho cc h thng tin hc khng ngng c nng cao, theo , lng thng tin c lu tr trn cc thit b nh khng ngng tng ln. Thng k s b cho thy, lng thng tin trn cc h thng tin hc c sau 20 thng li tng gp i [3]. Cui thp k 80 ca th k 20, s pht trin rng khp ca cc CSDL mi quy m to ra s bng n thng tin trn ton cu. Vo thi gian ny, ngi ta bt u cp n khi nim khng hong phn tch d liu tc nghip cung cp thng tin vi yu cu cht lng ngy cng cao cho ngi lm quyt nh trong cc t chc ti chnh, thng mi, khoa hc, ng nh John Naisbett cnh bo Chng ta ang chm ngp trong d liu m vn i tri thc. Lng d liu khng l ny thc s l mt ngun ti nguyn c nhiu gi tr bi thng tin l yu t then cht trong mi hot ng qun l, kinh doanh, pht trin sn xut v dch v, n gip nhng ngi iu hnh v qun l c hiu bit v mi trng v tin trnh hot ng ca t chc mnh trc khi ra quyt nh tc ng n qu trnh hot ng nhm t c cc mc tiu mt cch hiu qu v bn vng. Khai ph d liu (Data Mining) l mt lnh vc mi xut hin, nhm t ng khai thc nhng thng tin, nhng tri thc c tnh tim n, hu ch t nhng CSDL ln cho cc n v, t chc, doanh nghip,. t lm thc y kh nng sn xut, kinh doanh, cnh tranh cho cc n v, t chc ny. Cc kt qu khoa hc cng nhng ng dng thnh cng trong khm ph tri thc, cho thy, khai ph d liu l mt lnh vc pht trin bn vng, mang li nhiu li ch v c nhiu trin vng, ng thi c u th hn hn so vi cc cng c phn tch d liu truyn thng. Hin nay, khai ph d liu ng dng ngy cng rng ri trong cc lnh vc nh: Thng mi, ti chnh, iu tr y hc, vin thng, tin sinh,. 4. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 4 - 1.2. Khai ph d liu l g? Khai ph d liu l mt hng nghin cu mi ra i hn mt thp nin tr li y, cc k thut chnh c p dng trong lnh vc ny phn ln c tha k t lnh vc CSDL, hc my, tr tu nhn to, l thuyt thng tin, xc sut thng k, v tnh ton hiu nng cao. Do s pht trin nhanh ca khai ph d liu v phm vi p dng v cc phng php tm kim tri thc, nn c nhiu quan im khc nhau v khai ph d liu. Tuy nhin, mt mc tru tng nht nh, chng ta nh ngha khai ph d liu nh sau [10]: nh ngha : Khai ph d liu l mt qu trnh tm kim, pht hin cc tri thc mi, tim n, hu dng trong CSDL ln. Khai ph tri thc trong CSDL (Knowledge Discovery in Databases - KDD) l mc tiu chnh ca khai ph d liu, do vy hai khi nim khai ph d liu v KDD c cc nh khoa hc trn hai lnh vc xem l tng ng vi nhau. Th nhng, nu phn chia mt cch chi tit th khai ph d liu l mt bc chnh trong qu trnh KDD. 1.3. Qa trnh khai ph tri thc trong c s d liu Khai ph tri thc trong CSDL, KDD, l lnh vc lin quan n cc ngnh nh : thng k, hc my, CSDL, thut ton, trc quan ha d liu, tnh ton song song v hiu nng cao, Qu trnh KDD c th phn thnh cc giai on sau [3][10]: Trch chn d liu : l bc trch chn nhng tp d liu cn c khai ph t cc tp d liu ln (databases, data warehouses, data repositories) ban u theo mt s tiu ch nht nh. Tin x l d liu : l bc lm sch d liu (x l vi d liu khng y , d liu nhiu, d liu khng nht qun, .v.v.), rt gn d liu (s dng hm nhm v tnh tng, cc phng php nn d liu, s dng histograms, ly mu, .v.v.), ri rc ha d liu (ri rc ha da vo histograms, da vo entropy, da vo phn khong, .v.v.). Sau bc ny, d liu s nht qun, y , c rt gn, v c ri rc ha. Bin i d liu : y l bc chun ha v lm mn d liu a d liu v dng thun li nht nhm phc v cho cc k thut khai ph bc sau. 5. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 5 - Khai ph d liu: y l bc p dng nhng k thut phn tch (phn nhiu l cc k thut ca hc my) nhm khai thc d liu, trch chn c nhng mu thng tin, nhng mi lin h c bit trong d liu. y c xem l bc quan trng v tn nhiu thi gian nht ca ton qu trnh KDD. nh gi v biu din tri thc: nhng mu thng tin v mi lin h trong d liu c khai ph bc trn c chuyn dng v biu din mt dng gn gi vi ngi s dng nh th, cy, bng biu, lut, .v.v. ng thi bc ny cng nh gi nhng tri thc khai ph c theo nhng tiu ch nht nh. Cc giai on trong KDD c th hin trc quan nh hnh 1.1 di y: Hnh 1-1. Cc bc thc hin trong qu trnh khai ph tri thc 1.4. Cc k thut p dng trong khai ph d liu 1.4.1. Cc k thut tip cn trong khai ph d liu Khai ph tri thc trong CSDL l mt lnh vc lin ngnh, bao gm: T chc d liu, hc my, tr tu nhn to v cc khoa hc khc. Nu theo quan im ca hc my (Machine Learning), th cc k thut trong khai ph d liu, bao gm : Hc c gim st (Supervised learning) : L qu trnh gn nhn lp cho cc phn t trong CSDL da trn mt tp cc v d hun luyn v cc thng tin v nhn lp bit. Hc khng c gim st (Unsupervised learning) : L qu trnh phn chia mt tp d liu thnh cc lp hay l cm (clustering) d liu tng t nhau m cha bit trc cc thng tin v lp hay tp cc v d hun luyn. D liu th Trch chn d liu D liu Tin x l d liu D liu Tin x l Bin i d liu Khai ph d liu Cc mu nh gi v gii thch Biu din tri thc Tri thc 6. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 6 - Hc na gim st (Semi - Supervised learning) : L qu trnh phn chia mt tp d liu thnh cc lp da trn mt tp nh cc v d hun luyn v mt s cc thng tin v mt s nhn lp bit trc. Nu cn c vo lp cc bi ton cn gii quyt, th khai ph d liu bao gm cc k thut p dng sau [10]: Phn lp v d on (classification and prediction): xp mt i tng vo mt trong nhng lp bit trc. V d: phn lp cc d liu bnh nhn trong h s bnh n. Hng tip cn ny thng s dng mt s k thut ca hc my nh cy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Phn lp v d on cn c gi l hc c gim st. Lut kt hp (association rules): l dng lut biu din tri thc dng kh n gin. V d: 60 % n gii vo siu th mua phn th c ti 80% trong s h s mua thm son. Lut kt hp c ng dng nhiu trong lnh vc kinh doanh, y hc, tin-sinh, ti chnh v th trng chng khon, .v.v. Phn tch chui theo thi gian (sequential/ temporal patterns): tng t nh khai ph lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo cao. Phn cm (clustering/ segmentation): xp cc i tng theo tng cm d liu t nhin. Phn cm cn c gi l hc khng gim st (unsupervised learning). M t khi nim (concept description and summarization): thin v m t, tng hp v tm tt khi nim. V d: tm tt vn bn. 1.4.2. Cc dng d liu c th khai ph Do khai ph d liu c ng dng rng ri nn n c th lm vic vi rt nhiu kiu d liu khc nhau. Sau y l mt s dng d liu in hnh [10] : CSDL quan h, CSDL a chiu (multidimensional structures, data warehouses), CSDL dng giao dch, CSDL quan h - hng i tng, d liu khng gian v thi gian, d liu chui thi gian, CSDL a phng tin, d liu Text v Web, 7. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 7 - 1.5. ng dng ca khai ph d liu Khai ph d liu l mt lnh vc c quan tm v ng dng rng ri. Mt s ng dng in hnh trong khai ph d liu c th lit k nh sau : Phn tch d liu v h tr ra quyt nh, iu tr y hc, Text mining & Web mining, tin-sinh (bio- informatics), ti chnh v th trng chng khon, bo him (insurance), .v.v. 1.6. Phn cm d liu v ng dng 1.6.1. Mc ch ca phn cm d liu Phn loi l mt trong nhng hnh vi nguyn thu nht ca con ngi nhm nm gi lng thng tin khng l h nhn c hng ngy v s x l mi thng tin nh mt thc th n l l khng th. Phn cm d liu nhm mc ch chnh l khai ph cu trc ca mu d liu thnh lp cc nhm d liu t tp d liu ln, theo , cho php ngi ta i su vo phn tch v nghin cu cho tng cm d liu ny nhm khai ph v tm kim cc thng tin tim n, hu ch phc v cho ra quyt nh. Mt vi v d v ngha thc tin ca phn cm d liu nh sau : - Khm ph ra cc v tr a l thun li cho vic xy dng cc kho hng phc v mua bn hng ca mt cng ty thng mi - Xc nh cc cm nh nh nh ca cc loi ng vt nh loi th, chim, trong tp CSDL nh v ng vt nhm phc v cho vic tm kim nh - Xc nh cc nhm ngi bnh nhm cung cp thng tin cho vic phn phi cc thuc iu tr trong y t - Xc nh nhm cc khch hng trong CSDL ngn hng c vn cc u t vo bt ng sn cao Nh vy, phn cm d liu l mt phng php x l thng tin quan trng v ph bin, n nhm khm ph mi lin h gia cc mu d liu bng cch t chc chng thnh cc cm tng t. Tip theo, gi s rng tt c cc dng d liu c biu din bi khi nim c trng, cc c trng hnh thnh nn vector c trng - chiu. Thut ng phn cm c hiu l phn cm d liu. 8. Bin son: Trn Nguyn Hng Mt s thut ton phn cm c bn trong data mining 31/8/2009 - 8 - 1.6.2. Cc bc c bn phn cm Chn la c trng : Cc c trng phi c chn la mt cch hp l c th m ho nhiu nht thng tin lin quan n cng vic quan tm. Mc tiu chnh l phi gim thiu s d tha thng tin gia cc c trng. Cc c trng cn c tin x l trc khi dng trong cc bc sau. Chn o gn gi: y l mt o ch ra mc tng t hay khng tng t gia hai vector c trng. Phi m bo rng tt c cc vector c trng gp phn nh nhau trong vic tnh ton o gn gi v khng c c trng no t hn c trng no. iu ny c m nhn bi qu trnh tin x l. Tiu chun phn cm: iu ny ph thuc vo s gii thch ca chuyn gia cho thut ng d nhn thy da vo loi ca cc cm c chuyn gia cho rng ang n du di tp d liu. Chng hn, mt cm loi cht (compact) ca cc vector c trng trong khng gian -chiu c th d nhn thy theo mt tiu chun, trong khi mt cm loi di v mng li c th c d nhn thy bi mt tiu chun khc. Tiu chun phn loi c th c din t bi hm chi ph hay mt vi loi quy tc khc. Thut ton phn loi: Cn la chn mt s thut ton ring bit nhm lm sng t cu trc cm ca tp d liu. Cng nhn kt qu: Khi c kt qu phn loi th ta phi kim tra tnh ng n ca n. iu ny thng c thc hin bi vic dng cc kim nh ph hp. Gii thch kt qu: Trong nhiu trng hp, chuyn gia trong lnh vc ng dng phi kt hp kt qu phn loi vi bng chng thc nghim v phn tch a ra cc kt lun ng n. Trong mt s trng hp, nn c c bc khuynh hng phn cm; trong bc ny c cc kim nh khc nhau ch ra mt d liu...