Ch1 - Kho DL Va Khai Pha DL

Embed Size (px)

Text of Ch1 - Kho DL Va Khai Pha DL

KHAI PH D LIUChng 1: TO KHO D LIU & KHAI PH D LIUIntroduction to Data Warehousing & Data Mining

Chng 1: TO KHO D LIU & KHAI PH D LIU

1. Tng quan 2. To kho d liu 3. H tr quyt nh & X l phn tch trc tuyn (OLAP) 4. Khai ph d liu

Page 2

Chng 1: TO KHO D LIU & KHAI PH D LIU D liu (Data), Thng tin (Information), Tri thc (Knowlegde) D liu l tp cc s kin th v chng c t chc cc dng logic. Thnh phn nh nht ca d liu c tha nhn bi my tnh l cc k t n, v d: ch A, s 1, k t *Mt k t c biu din bi 8 bt. Cc bits thng c s dng o thng tin. Tri thc c xem nh l cc thng tin tch hp, bao gm cc s kin v mi quan h gia chng. Tri thc c th c coi l d liu mc cao ca s tru tng v tng qut. Khm ph tri thc hay pht hin tri thc l mt quy trnh nhn bit cc mu hoc cc m hnh trong d liu vi cc tnh nng: Phn tch, tng hp, hp thc, kh ch v c th hiu c.

Page 3

Chng 1: TO KHO D LIU & KHAI PH D LIU

To kho d liu (Data Warehousing)Mt qu trnh chuyn i d liu thnh thng tin v lm cho n c sn cho ngi dng mt cch kp thi, to s khc bit [Forrester Research, 4/1996]

Page 4

Chng 1: TO KHO D LIU & KHAI PH D LIU

Kho d liu (Data Warehouse) l g? W.H.Inmon: Mt kho d liu l mt tp hp d liu tch hp hng ch c tnh n nh, cp nht theo thi gian nhm h tr cho vic ra quyt nh.

Mt kho d liu bao gm: Mt hoc nhiu cng c chit xut d liu C s d liu tch hp hng ch n nh c tng hp bng cch thit lp cc bng d liu.

Page 5

Chng 1: TO KHO D LIU & KHAI PH D LIU

Mc ch ca kho d liu:Mc tiu chnh ca kho d liu : Phi c kh nng p ng mi yu cu v thng tin ca NSD

H tr cc nhn vin ca t chc thc hin tt, hiu qu cng vic ca mnh, nh c nhng quyt nh hp l, nhanh v bn c nhiu hng hn, nng sut cao hn, thu c li nhun cao hn, v.v.Gip cho t chc, xc nh, qun l v iu hnh cc d n, cc nghip v mt cch hiu qu v chnh xc. Tch hp d liu v cc siu d liu t nhiu ngun khc nhau

Page 6

Chng 1: TO KHO D LIU & KHAI PH D LIU

Cc gii php Kho d liu t mc cho Nng cao cht lng d liu bng cc phng php lm sch v tinh lc d liu theo nhng hng ch nht nh o Tng hp v kt ni d liu o ng b ho cc ngun d liu vi DW o Phn nh v ng nht cc h qun tr c s d liu tc nghip nh l cc cng c chun phc v cho DW.

o Qun l siu d liuo Cung cp thng tin c tch hp, tm tt hoc c lin kt, t chc theo cc ch o Dng trong cc h thng h tr quyt nh (Decision suport system DSS), cc h thng thng tin tc nghip hoc h tr cho cc truy vn c bit.Page 7

Chng 1: TO KHO D LIU & KHAI PH D LIU

Thuc tnh ca kho d liu:Tnh tch hp (Integration)

D liu gn thi gian v c tnh lch s D liu c tnh n nh (nonvolatility) D liu khng bin ng D liu tng hp

Page 8

Chng 1: TO KHO D LIU & KHAI PH D LIU

Page 9

Chng 1: TO KHO D LIU & KHAI PH D LIU

Kho d liu bao gm 7 thnh phn: D liu ngun v cc cng c chit xut, lm sch v chuyn i d liu. Kho siu d liu (MetaData)

Cc k thut to lp kho Kho d liu theo ch (Data marts): Vi cc kho d liu ny, c th tng hp thnh mt kho d liu thng minh. Ngc li, mt kho d liu c th c phn tch thnh nhiu kho d liu thng minh.

Cc cng c truy vn (query), bo co (reporting), phn tch trc tuyn (OLAP) v khai ph d liu (data mining) l cc k thut khai thc kho d liu em li nhng tri thc.. Qun tr kho d liu.

H thng phn phi thng tin.

Page 10

Chng 1: TO KHO D LIU & KHAI PH D LIUKho d liu l CSDL rt ln35%

30%25% Respondents 20% 15% 10% Initial 5% 0%

Projected 2Q96Source: META Group, Inc.

5GBPage 11

10-19GB5-9GB

50-99GB

250-499GB500GB-1TB

20-49GB

100-249GB

Chng 1: TO KHO D LIU & KHAI PH D LIU

Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos

Page 12

Chng 1: TO KHO D LIU & KHAI PH D LIUS khc bit gia cc h thao tc CSDL & cc h thng tin

c trngc im

Thao tc CSDLX l thao tc

H thng tinX l thng tin

HngNgi dng

Giao dchNhn vin, qun tr CSDL, chuyn vin CSDL

Phn tchNgi qun l, phn tch vin, ngi iu hnh

Chc nngData Khung nhn Thit k CSDL n vPage 13

Thao tc hng ngyHin hnh Chi tit, t quan h Hng ng dng Giao dch .gin, ngn c/Ghi

H tr quyt nhMang tnh lch s (lu di) Tng hp, a chiu Hng ch (Subject) Truy vn phc tp Hu nh ch c

Truy cp

Chng 1: TO KHO D LIU & KHAI PH D LIUS khc bit gia cc h thao tc CSDL & cc h thng tin

c trngCh trng S lng bn ghi truy cp S lng ngi dng Kch thc d liu u im (Priority) o (Metric)

Thao tc CSDLD liu vo Bi s ca 10 Hng ngn 100MB n GB

H thng tinThng tin ra Bi s ca triu Hng trm 100 GB n TB

Hiu nng cao, tnh sn Linh ng cao, ngi sng cao s dng ch ng Tc x l giao dch Tc truy vn

Page 14

Chng 1: TO KHO D LIU & KHAI PH D LIU

To kho d liu:Thc hin cc k thut hp nht v qun l d liu t nhiu ngun khc nhau. Mc ch tr li cc cu hi tc nghip, h tr cho cc quyt nh, m trc khng th thc hin c. Mt CSDL h tr quyt nh c to lp v duy tr ring bit vi c s d liu hot ng ca mt t chc

Page 15

Chng 1: TO KHO D LIU & KHAI PH D LIU

Khai thc kho d liu theo 3 cch chnh:1. Khai thc truyn thng Truy vn, bo co.. D liu tinh 2. X l phn tch trc tuyn (OLAP) Phn tch, kim nh gi thuyt, cha a c cc gi thuyt 3. Khai ph d liu To d liu tri thc

Page 16

Chng 1: TO KHO D LIU & KHAI PH D LIU

X L PHN TCH TRC TUYN (OLAP) H tr Quyt nh chuyn su 04 c im chnh

Phn tch d liu a chiu H tr c s d liu tin tin Giao din d dng cho ngi s dng

H tr kin trc Client / Server D liu trong kho d liu c th hin di dng a chiu (Multi Dimension) gi l khi (cube). Mi chiu m t mt c trng no ca d liu.Page 17

Chng 1: TO KHO D LIU & KHAI PH D LIU CC K THUT PHN TCH D LIU A CHIU Cc chc nng biu din d liu tin tino ha 3-D, Pivot Tables, Crosstabs. o Tng thch vi Spreadsheets v gi thng k o Tng hp d liu tin tin, cng c v phn loi trn kch thc thi gian o Cc chc nng tnh ton nng cao o Chc nng m hnh ha d liu tin tin

H TR CSDL TIN TIN Cc c trng ca x l CSDL tin tino Truy cp nhiu loi ca DBMS, cc tp tin nn (flat), v cc d liu trong & ngoi h thng o Truy cp vo kho d liu tng hp.

o nh hng D liu tin tin (drill downs v roll-ups)o C kh nng nh x yu cu ngi s dng n cc ngun d liu thch hp o H tr c s d liu rt lnPage 18

Chng 1: TO KHO D LIU & KHAI PH D LIU GIAO DIN D DNG CHO NGIS DNG o Giao din ha o C nhiu tin ch truy xut d liu d dng CU TRC CLIENT/SERVER oLm nn tng thit k, ci t, pht trin cho nhiu h thng mi oChia h thng OLAP thnh nhiu thnh phn c nh kin trc:Trn cng mt my

Phn tn trn nhiu my

Page 19

Chng 1: TO KHO D LIU & KHAI PH D LIU KIN TRC CA OLAP 03 thnh phn chnh: Giao din ha (GUI) Phn tch d liu logic X l d liu logic

OLAP QUAN H (Relational OLAP) X l phn tch trc tuyn quan h (Relational Online Analytical Processing) OLAP s dng CSDL quan h v h cc cng c truy vn lu tr v phn tch d liu a chiu H tr lc CSDL a chiu

C truy vn v ngn ng truy xut d liu hiu nng H tr CSDL lnPage 20

Chng 1: TO KHO D LIU & KHAI PH D LIU H TR LC CSDL A CHIU D liu h tr quyt nh liu c xu hng c o Khng chun ha (Nonnormalized) oTrng lp oTng hp (Preaggregate) Cc m hnh d liu s dng trong OLAP M hnh dng sao (Star Schema) M hnh chm sao s kin (Fact Constellation Schema) M hnh bng tuyt (Snowflake Schema) Thit k k thut c bit cho biu din d liu a chiu Ti u ha hot ng truy vn d liu thay v d liu cp nht hot ng

Page 21

Chng 1: TO KHO D LIU & KHAI PH D LIU

M HNH SAO -Thit k chuyn bit biu din d liu a chiu - Ti u ha cc thao tc truy vn d liu thay cho cc thao tc cp nht d liu - nh x d liu h tr quyt nh vo m hnh d liu quan h 4 thnh phn S kin (Facts) Chiu (Dimensions) Thuc tnh (Attributes) Phn cp thuc tnh (Attribute Hierarchies)Page 22

Chng 1: TO KHO D LIU & KHAI PH D LIU S KIN (Facts) o (gi tr) s biu din cho mt kha cnh kinh doanh hoc mt hot ng c th Lu tr trong mt bng s kin ti trung tm ca m hnh sao Cha cc s kin c lin kt vi cc chiu ca chng C th c tnh ton hoc c suy dn lc thc hin Cp nht nh k vi cc d liu t cc thao tc c s d liu Bng s kin (Fact Table): dng theo di cc bin ng ca d liu, cu trc ca Fact table gm cc kha ngoi l cc kha chnh ca cc bng chiu (Dimension table). o (Measure): L i lng c th tnh ton c trn cc thuc tnh ca fact table.

Page 23

Chng 1: TO KHO D LIU & KHAI PH D LIU CHIU (Dimensions) Mi chiu m t mt c trng no ca d liu. Dimension Table l cc bng m t cc c trng ca cc chiu nh chiu thi gian, chiu khch hng, chiu hng ha,

Page 24

Chng 1: TO KHO D LIU & KHAI PH D LIU THUC TNH (Attributes) Cc bng chiu cha cc thuc tnh Cc thuc tnh c s dng nghin cu, lc v phn lp cc s kin. Chiu m t cc c trng ca cc s kin thng qua cc thuc tnh. Khng c hn ch v mt ton hc v s lng chiu (3-D c m hnh ha d dng)

PHN CP THUC TNH (Attribute Hierarchies) Khi nim ny m t s phn cp th bc (mc chi tit ca d liu). V d i vi chiu thi gian, ta c thc bc nh sau: day