你的一小步,我的一大步 Jen-Wei Huang 黃仁暐 jwhuang@gmail.com National Taiwan University.

  • Published on
    12-Jan-2016

  • View
    230

  • Download
    6

Embed Size (px)

Transcript

  • Jen-Wei Huangjwhuang@gmail.com

    National Taiwan University

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.wretch.cc/blog/EtudeBIKE

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.giant-bicycles.com/zh-TW/

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://cape7.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.wretch.cc/blog/orzboyz* http://blog.sina.com.tw/9winds/* http://atomcinema.pixnet.net/blog

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.amazon.com

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.amazon.com

    Jen-Wei Huang

  • *Jen-Wei Huang** http://www.hq.nasa.gov/office/pao/History/ap11ann/kippsphotos/apollo.html

    Jen-Wei Huang

  • A General Model for Sequential Pattern Mining with a Progressive DatabaseJen-Wei Huang, Chi-Yao Tseng, Jian-Chih Ou and Ming-Syan Chen

    National Taiwan University* IEEE Trans. on Knowledge and Data Engineering, Vol. 20, No. 6, June 2008

    Jen-Wei Huang

  • *Jen-Wei Huang*OutlinesIntroductionPreliminariesAlgorithm PisaExperimentsConclusionsQ & A*

    Jen-Wei Huang

  • *Jen-Wei Huang*Introduction to SPMMining of frequently occurring patterns related to time or other sequences.J. Han, Data Mining Concepts and TechniquesGiven a set of sequences, find the complete set of frequent subsequencesJ. Pei, PrefixSpanEx) What items one will buy if he/she has bought some certain items*

    Jen-Wei Huang

  • *Jen-Wei Huang*Time-related dataCustomers buying behaviorNatural phenomenaSensor network dataWeb access patternsStock price changesDNA sequence applications*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionLet I = {x1, x2, ..., xn} be a set of different items. An element e, denoted by (xi xj ...), is a subset of items I of which items appear in a sequence at the same time.A sequence s, denoted by < e1, e2, ..., em >, is an ordered list of elements. A sequence database Db contains a set of sequences and |Db| represents the number of sequences in Db.*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionA sequence = < a1, a2, ..., an > is a subsequence of another sequence = < b1, b2, ..., bm > if there exists a set of integers, 1 i1 < i2 < ... < in m, such that a1 bi1 , a2 bi2 , ..., and an bin .*

    Jen-Wei Huang

  • *Jen-Wei Huang*DefinitionThe sequential pattern mining can be defined as "Given a sequence database, Db, and a user-defined minimum support, min_sup, find the complete set of subsequences whose occurrence frequencies min_sup |Db|."*

    Jen-Wei Huang

  • *Jen-Wei Huang*Three CategoriesDepending on the management of the corresponding database, sequential pattern mining can be divided into three categories, namely sequential pattern mining witha static database.an incremental database.a progressive database.*

    Jen-Wei Huang

  • How To Do Sequential Pattern Mining on a Static DatabaseAn Overview

    Jen-Wei Huang

  • 2006/03/24jwhuang National Taiwan University*How?Apriori-like algorithmsAprioriAll by Agrawal et alGSP by R. Srikant et al Partition-based algorithmsFreeSpan by J. Han et alPrefixSpan by J. Pei et alVertical format algorithmsSPADE by Zaki et alSPAM by Ayres et al

    jwhuang National Taiwan University

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms1.Sort phaseSort the databaseCustomer id as the primary key and time as the second key2.Litemset phase Count the frequency of each itemsetThe fraction of customers who bought the itemset

    jwhuang National Taiwan University

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms3.Transformation phase Transform each tx to all litemsets in the form of C01: C02: C03: C04: C05:

    jwhuang National Taiwan University

  • *Jen-Wei Huang*

    CIDItems2 10 205 902 302 40 60 704 303 30 50 701 301 904 40 704 903 105 101 40 705 202 903 20

    CIDItems1 30 90 {40 70}2 {10 20} 30 {40 60 70} 903 {30 50 70} 10 204 30 {40 70} 905 90 10 20

    Itemset# 103 203 304 403 501 601 704 904 {10 20}1 {40 60}1 {40 70}3 {60 70}1 {40 60 70}1 {30 50}1 {30 70}1 {50 70}1 {30 50 70}1

    Jen-Wei Huang

  • *Jen-Wei Huang*

    Itemset#New 1031 2032 3043 4034 7045 9046 {40 70}37

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Jen-Wei Huang

  • 2006/03/24jwhuang National Taiwan University*Apriori-like Algorithms4.Mining phaseApriori-like algorithm5.Maximal phase Find the maximum patterns

    jwhuang National Taiwan University

  • *Jen-Wei Huang*

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Itemset#1 221 311 411 511 611 712 102 312 412 512 612 713 113 21

    Itemset#3 433 533 633 734 104 204 304 504 624 705 115 215 305 40

    Itemset#5 625 706 116 216 306 416 516 717 107 207 307 407 507 62

    Jen-Wei Huang

  • *Jen-Wei Huang*Therefore, frequent sequential patterns are: According to mappings, original frequent sequential patterns are:

    CIDItems1 3 6 {4, 5, 7}2 {1, 2} 3 {4, 5, 7} 63 {3, 5} 1 24 3 {4, 5, 7} 65 6 1 2

    Itemset#3 4 623 5 623 7 62

    Itemset# 1031 2032 3043 4034 7045 9046 {40 70}37

    Jen-Wei Huang

  • *Jen-Wei Huang*According to mappings, original frequent sequential patterns are:

    Because and are contained by and are contained by and are contained by ,final maximal sequential patterns are:

    Jen-Wei Huang

  • *Jen-Wei Huang*Related WorksStatic databaseAprioriAll by Agrawal et alGSP by R. Srikant et alSPADE by Zaki et alFreeSpan by J. Han et alPrefixSpan by J. Pei et alSPAM by Ayres et al*

    Jen-Wei Huang

  • *Jen-Wei Huang*Related WorksIncremental databaseISM by Parthasarathy et alIncSP by Lin et alISE by Masseglia et alIncSpan by Cheng et alMILE by Chen et al*

    Jen-Wei Huang

  • *Jen-Wei Huang*MotivationThe assumption of having a static database may not hold in practice.The data in real world change on the fly.Finding sequential patterns in an incremental database may lack of interest to the users.It is noted that users are usually more interested in the recent data than the old ones. *

    Jen-Wei Huang

  • *Jen-Wei Huang*MotivationIf a certain sequence does not have any newly arriving elements, this sequence will still stay in the database and undesirably contribute to |Db|.New sequential patterns which appear frequently in the recent sequences may not be considered as frequent sequential patterns.*

    Jen-Wei Huang

  • *Jen-Wei Huang*Definition -- Period of InterestPeriod of Interest (abbreviated as POI) is a sliding windowwhose length is a user-specified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns.*

    Jen-Wei Huang

  • timeACADBBADBBCDCCDBDAAABCBCAACACBDCDDSIDPOI=5, min_supp=0.5*

  • *Jen-Wei Huang*OutlinesIntroductionPreliminariesAlgorithm PisaExperimentsConclusionsQ & A*

    Jen-Wei Huang

  • *Jen-Wei Huang*Progressive Sequential PatternProgressive sequential pattern mining problem is defined as follows"Given a progressive sequence database, a user-specified period of interest, and a user-defined minimum support threshold, find the complete set of frequent subsequences whose occurrence frequencies are greater than or equal to the minimum support times the number of sequences in every period of interest of the database."*

    Jen-Wei Huang

  • *Jen-Wei Huang*Nave AlgorithmUse conventional static sequential pattern mining algorithms to mine sequential patterns separately from all combination of POIse.g., Db1,5, Db2,6, Db3,7, Db4,8, Db5,9, etc. For the sequence database which has the elements appearing in the interval of n timestamps, the total number of POIs in this interval is equal to (n POI +1).*

    Jen-Wei Huang

  • *Jen-Wei Huang*Prior WorkThe only prior work on progressive database is GSP+ and MFS+ proposed by Zhang based on static algorithms GSP and MFS (also derived by the same authors).However, these algorithms still have to re-mine each sub-database using the static algorithms GSP and MFS.Nevertheless, the performance improvement of GSP+ and MFS+ over GSP and MFS is only within 15% as reported by their authors.*

    Jen-Wei Huang

  • *Jen-Wei Huang*Algorithm DirAppStands for Direct Append.Consists of two proceduresProgressively Updating abbreviated as PrUpImmediately Filteringabbreviated as ImFi

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*Procedure PrUpWhen progressively reading newly incoming elements, Procedure PrUp canupdate each sequence in the sequence databasegenerate candidate sequential patternscalculate occurrence frequencies of all candidate equential patterns in the current POI.

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*Procedure ImFiDirApp uses Procedure ImFi to filter out obsolete data from the existing sequence databaseprune away obsolete candidate sequential patterns from the candidate set. report the most up-to-date frequent sequential patterns to the user in every POI*

    Jen-Wei Huang

  • ABCADB*

  • *Jen-Wei Huang*Example*

    Jen-Wei Huang

  • (1)(4)(2)(3)*

    Db1,1A1

    Db1,4A1B2AB1C4AC1BC2ABC1

    Db1,2A1B2AB1

    Db1,3A1B2AB1

  • (4)(5)*

    Db1,4A1B2AB1C4AC1BC2ABC1

    Db1,5A5B(AD)2B2ABD1AB1AB(AD)1C4CA4AC1CD4BC2C(AD)4ABC1ACD1D5AC(AD)1(AD)5BCA2AD1BCD2A(AD)1BC(AD)2BA2ABCD1BD2ABC(AD)1

  • (5)(6)*

    Db1,5A5B(AD)2B2ABD1AB1AB(AD)1C4CA4AC1CD4BC2C(AD)4ABC1ACD1D5AC(AD)1(AD)5BCA2AD1BCD2A(AD)1BC(AD)2BA2ABCD1BD2ABC(AD)1

    Db2,6A5B2C4BC2D5(AD)5BA2BD2B(AD)2CA4CD4C(AD)4BCA2BCD2BC(AD)2

  • (6)(7)*

    Db2,6A5B2C4BC2D5(AD)5BA2BD2B(AD)2CA4CD4C(AD)4BCA2BCD2BC(AD)2

    Db3,7A5C4D5(AD)5CA4CD4C(AD)4B7AB5CB4DB5(AD)B5CAB4CDB4C(AD)B4

  • (1)(4)(5)(6)(7)(2)(3)*

    Db1,1A1

    Db1,5A5B(AD)2B2ABD1AB1AB(AD)1C4CA4AC1CD4BC2C(AD)4ABC1ACD1D5AC(AD)1(AD)5BCA2AD1BCD2A(AD)1BC(AD)2BA2ABCD1BD2ABC(AD)1

    Db1,4A1B2AB1C4AC1BC2ABC1

    Db1,2A1B2AB1

    Db2,6A5B2C4BC2D5(AD)5BA2BD2B(AD)2CA4CD4C(AD)4BCA2BCD2BC(AD)2

    Db1,3A1B2AB1

    Db3,7A5C4D5(AD)5CA4CD4C(AD)4B7AB5CB4DB5(AD)B5CAB4CDB4C(AD)B4

  • S01S02S03AB1(3)S04*

    Db1,2A1B2AB1

    Db1,2A1AB1D1DB1(AD)1(AD)B1B2

    Db1,2A1AB1B2AC1C2A(BC)1(BC)2

    Db1,2(4)AB13A(BC)11AC11(AD)B11DB11

    Db1,2D2

  • AB1(3)AB1(3)AB1(3)AB1(3)DA3(3)BA4(3)(2)(3)(4)(5)*

    Db1,2(4)AB13A(BC)11AC11(AD)B11DB11

    Db1,3(5)AB13A(BC)11AC11(AD)B11DB11A(BC)B11ACB11(BC)B21CB21DC21

    Db1,4(5)AB13A(BC)BC11A(BC)11A(BC)C11AC12(AD)A11(AD)B11(AD)BA11DB32BA21A(BC)B11BC32ACB11(BC)BC21(BC)B21(BC)C21CB21DA11DC21DBA11ABC12

    Db1,5(5)AB13ABC12DBA32BCA21A(BC)11A(BC)BC11A(AD)11BC(AD)21AC12A(BC)C11AB(AD)11BCD21(AD)B11(AD)A11ABC(AD)11BD21DB32(AD)BA11ABCD11CA42A(BC)B11BA43ABD11C(AD)41ACB11BC32AC(AD)11CD41(BC)B21(BC)BC21ACD11DCA21CB21(BC)C21AD11DC21DA33B(AD)21

  • BA4(4)CA4(3)(6)(7)(8)AC6(4)(9)CA4(3)AC8(5)*

    Db2,6(5)DB31BC(AD)21(BC)B21BCD21CB21BD21DC21CA43BA44C(AD)41BC32CD41(BC)BC21DCA21(BC)C21(BC)A21DA32(BC)BA21DBA31(BC)BCA21B(AD)21(BC)CA21BCA32CBA21

    Db3,7(5)DB52(AD)B51BA42BAC41BC42CAB42DA31CA(BC)31DBA31C(AD)B41BCA31CB42CA43C(BC)31C(AD)41CDB41CD41DAC31AB52DBAC31A(BC)51DBC31AC52DC31

    Db4,8(6)DB51BAC41BA41CAB41BC72C(AD)B41CA42CB41C(AD)41CDB41CD41ABC51AB52(AD)BC51A(BC)51(AD)C51AC64DBC51(AD)B51DC51

    Db5,9(5)DB51BC71AB52A(BC)51AC85(AD)B51ABC51(AD)BC51(AD)C51DBC51DC51ACD62AD62CD82

  • *Jen-Wei Huang*The Advantages of DirAppDirApp needs only one scan of newly arriving elements and the candidate set at each timestamp rather than quadratic scans by conventional algorithms.DirApp canmaintain latest data sequencesfind the complete set of up-to-date sequential patternsdelete obsolete data and patterns rapidly*

    Jen-Wei Huang

  • *Jen-Wei Huang*The Disadvantages of DirAppDirApp needs lots of working space to store the candidate sets for all sequences.Scanning all candidate sets induces huge computation in execution time.DirApp needs another data structure to calculate the occurrence frequencies of all candidate sequential patterns.

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*OutlinesIntroductionPreliminariesAlgorithm PisaExperimentsConclusionsQ & A*

    Jen-Wei Huang

  • *Jen-Wei Huang*Algorithm PisaPisa stands for Progressive mIning of Sequential pAtternsPisa utilizes a Progressive Sequential tree (abbreviated as PS-tree) to maintain the information of all sequences in each POI toupdate each sequencefind up-to-date sequential patterns

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeThe nodes in PS-tree can be divided into two different typesRoot nodeCommon nodesEach common node stores two informationNode label = element in a sequenceSequence listsequence IDs containing this elementmarked by corresponding timestampsRoot*

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeWhenever there are a series of elements appearing in the same sequence, there will be a series of nodes labeled by each element with the same sequence IDs in their sequence lists.The first node will be connected to the Root node representing the first element.The other nodes will be connected to the first node analogously.*

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeRootRoot*

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeThe path from Root node to any other node represents the candidate sequential pattern appearing in this sequence.The appearing timestamp for each candidate sequential pattern will be marked in the node labeled by the last element.

    *

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeRootRoot*

    Jen-Wei Huang

  • *Jen-Wei Huang*Algorithm PisaWhen receiving elements at timestamp t+1, Pisa traverses the PS-tree in post-order to delete the obsolete elements fromupdate current sequences ininsert newly arriving elements into

    the PS-tree of timestamp t andtransforms it into PS-tree of timestamp t+1.*

    Jen-Wei Huang

  • *Jen-Wei Huang*For a common nodePisa deletes the obsolete sequences in the sequence list of this nodeIf there is no sequence ID left in the sequence list, Pisa prunes this node away from its parentPisa checks the sequence IDs left in the sequence list to see if there is newly arriving element of the sequencesIf there is no newly arriving element, Pisa goes to the next node*

    Jen-Wei Huang

  • *Jen-Wei Huang*For a common nodeOtherwise, Pisa generates all combination of candidate elements from the arriving elementEx) ABC -> A, B, C, AB, AC, BC, ABCFor each candidate element that does not exist on the path from Root to the current node :If there is a child of the same label, Pisa updates the timestamp of this sequence to the timestamp of the same sequence in parents sequence list.Otherwise, Pisa creates a new child of this element with the sequence ID and the timestamp of the same sequence in parents sequence list.*

    Jen-Wei Huang

  • *Jen-Wei Huang*For Root nodeInstead of checking the sequence list, Pisa examines all sequences that have newly arriving elements.After Pisa generates all combination of candidate element, for each of them :If there is a child of the same label, Pisa updates the timestamp of this sequence to t+1.Otherwise, Pisa creates a new child of this element with sequence ID and timestamp t+1.*

    Jen-Wei Huang

  • *Jen-Wei Huang*Algorithm PisaAfter Pisa processes a common node, if the number of sequence IDs in the sequence list is larger than the min_supp*|Dbp,q|, the path from Root to this node will be outputted as a frequent sequential pattern.*

    Jen-Wei Huang

  • *Jen-Wei Huang*PS-treeRootRoot*

    Jen-Wei Huang

  • RootPOI=5, min_supp=0.5*

  • Db1,1(3)*

  • Db1,2(4)DAB1(3)BBDBCDb1,1(3)*

  • Db1,3(5)AB1(3)CDB*

  • Db1,4(5)AB1(3)CBAC*

  • Db1,5(5)AB1(3)BA4(3)DA3(...

Recommended

View more >