Published on

30-Mar-2017View

212Download

0

Embed Size (px)

Transcript

7900 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 12, DECEMBER 2014

Consistent Procedures for ClusterTree Estimation and Pruning

Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike von Luxburg

Abstract For a density f on Rd , a high-density cluster is anyconnected component of {x : f (x) }, for some > 0. The setof all high-density clusters forms a hierarchy called the clustertree of f . We present two procedures for estimating the clustertree given samples from f . The first is a robust variant of thesingle linkage algorithm for hierarchical clustering. The secondis based on the k-nearest neighbor graph of the samples. We givefinite-sample convergence rates for these algorithms, which alsoimply consistency, and we derive lower bounds on the samplecomplexity of cluster tree estimation. Finally, we study a treepruning procedure that guarantees, under milder conditions thanusual, to remove clusters that are spurious while recovering thosethat are salient.

Index Terms Clustering algorithms, convergence.

I. INTRODUCTION

WE CONSIDER the problem of hierarchical clustering ina density-based setting, where a cluster is formalizedas a region of high density. Given data drawn i.i.d. from someunknown distribution with density f in Rd , the goal is toestimate the hierarchical cluster structure of the density,where a cluster is defined as a connected subset of an f -levelset {x X : f (x) }. These subsets form an infinite treestructure as 0 varies, in the sense that each cluster at somelevel is contained in a cluster at a lower level < . Thisinfinite tree is called the cluster tree of f and is illustrated inFigure 1.

Our formalism of the cluster tree (Section II-B) and ournotion of consistency follow early work on clustering, inparticular that of Hartigan [1]. Much subsequent work hasbeen devoted to estimating the connected components of asingle level set; see, for example, [2][7]. In contrast to these

Manuscript received November 4, 2013; revised July 16, 2014; acceptedAugust 2, 2014. Date of publication October 3, 2014; date of current versionNovember 18, 2014. K. Chaudhuri and S. Dasgupta were supported by theNational Science Foundation under Grant IIS-1162581. U. von Luxburg wassupported by the German Research Foundation under Grant LU1718/1-1and Research Unit 1735 through the Project entitled Structural Inference inStatistics: Adaptation and Efficiency.

K. Chaudhuri and S. Dasgupta are with the University of Californiaat San Diego, La Jolla, CA 92093 USA (e-mail: kamalika@cs.ucsd.edu;dasgupta@cs.ucsd.edu).

S. Kpotufe was with the Toyota Technological Institute at Chicago, Chicago,IL 60637 USA. He is now with Princeton University, Princeton, NJ 08544USA (e-mail: samory@princeton.edu).

U. von Luxburg is with the University of Hamburg, Hamburg, Germany(e-mail: luxburg@informatik.uni-hamburg.de).

Communicated by N. Cesa-Bianchi, Associate Editor for Pattern Recogni-tion, Statistical Learning and Inference.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIT.2014.2361055

Fig. 1. Top: A probability density f on R, and two clusters at a fixedlevel . Bottom: The same density, with the overall branching structure ofthe corresponding cluster tree; within each branch, the corresponding clustershrinks gradually as one moves upward.

results, the present work is concerned with the simultaneousestimation of all level sets of an unknown density: recoveringthe cluster tree as a whole.

Are there hierarchical clustering algorithms whichconverge to the cluster tree? Previous theory work, [1] and [8],has provided partial consistency results for the well-knownsingle-linkage clustering algorithm, while other work [9] hassuggested ways to overcome the deficiencies of this algorithmby making it more robust, but without proofs of convergence.In this paper, we propose a novel way to make single-linkage more robust, while retaining most of its eleganceand simplicity (see Figure 3). We show that this algorithmimplicitly creates a hierarchy of geometric graphs, and werelate connected components of these graphs to clusters ofthe underlying density. We establish finite-sample rates ofconvergence for the clustering procedure (Theorem 3.3);the centerpiece of our argument is a result on continuumpercolation (Theorem 4.7). This also implies consistency inthe sense of Hartigan.

We then give an alternative procedure based on the k-nearestneighbor graph of the sample (see Figure 4). Such graphs arewidely used in machine learning, and interestingly there is stillmuch to understand about their expressiveness. We show thatby successively removing points from this graph, we can createa hierarchical clustering that also converges to the cluster

0018-9448 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CHAUDHURI et al.: CONSISTENT PROCEDURES FOR CLUSTER TREE ESTIMATION AND PRUNING 7901

tree, at roughly the same rate as the linkage-based scheme(Theorem 3.4).

Next, we use tools from information theory to give alower bound on the problem of cluster tree estimation(Theorem VI.1.), which matches our upper bounds in itsdependence on most of the parameters of interest.

The convergence results for our two hierarchical clusteringprocedures nevertheless leave open the possibility that the treesthey produce contain spurious branching. This is a well-studiedproblem in the cluster tree literature, and we address it witha pruning method (Figure 9) that preserves the consistencyproperties of the tree estimators while providing finite-sampleguarantees on the removal of false clusters (Theorem VII.5.)This procedure is based on simple intuition that can carry overto other cluster tree estimators.

II. DEFINITIONS AND PREVIOUS WORK

Let X be a subset of Rd . We exclusively consider Euclideandistance on X , denoted . Let B(x, r) be the closed ball ofradius r around x .

A. Clustering

We start by considering the more general context of clus-tering. While clustering procedures abound in statistics andmachine learning, it remains largely unclear whether clustersin finite datafor instance, the clusters returned by a particularprocedurereveal anything meaningful about the underlyingdistribution from which the data is sampled. Understandingwhat statistical estimation based on a finite data set revealsabout the underlying distribution is a central preoccupation ofstatistics and machine learning; however this kind of analysishas proved elusive in the case of clustering, except perhaps inthe case of density-based clustering.

Consider for instance k-means, possibly the most popularclustering procedure in use today. If this procedure returns kclusters on an n-sample from a distribution f , what dothese clusters reveal about f ? Pollard [10] proved a basicconsistency result: if the algorithm always finds the globalminimum of the k-means cost function (which, incidentally,is NP-hard and thus computationally intractable in general;see [11, Th. 3]), then as n , the clustering is the globallyoptimal k-means solution for f , suitably defined. Even then,it is unclear whether the best k-means solution to f is aninteresting or desirable quantity in settings outside of vectorquantization.

Our work, and more generally work on density-based clus-tering, relies on meaningful formalisms of how a clustering ofdata generalizes to unambiguous structures of the underlyingdistribution. The main such formalism is that of the clustertree.

B. The Cluster Tree

We start with notions of connectivity. A path P in S Xis a continuous function P : [0, 1] S. If x = P(0)and y = P(1), we write x P y and we say that x and yare connected in S. This relation connected in S is

Fig. 2. A probability density f , and the restriction of C f to a finite set ofeight points.

an equivalence relation that partitions S into its connectedcomponents. We say S X is connected if it has a singleconnected component.

The cluster tree is a hierarchy each of whose levels is apartition of a subset of X , which we will occasionally call asubpartition of X . Write (X ) = {subpartitions of X }.

Definition 2.1: For any f : X R, the cluster treeof f is a function C f : R (X ) given by C f () =connected components of {x X : f (x) }. Any element ofC f (), for any , is called a cluster of f .For any , C f () is a set of disjoint clusters of X . They forma hierarchy in the following sense.

Lemma 2.2: Pick any . Then:1) For any C C f (), there exists C C f () such that

C C .2) For any C C f () and C C f (), either C C or

C C = .We will sometimes deal with the restriction of the clus-

ter tree to a finite set of points x1, . . . , xn . Formally,the restriction of a subpartition C (X ) to these points isdefined to be C[x1, . . . , xn] = {C {x1, . . . , xn} : C C}.Likewise, the restriction of the cluster tree is C f [x1, . . . , xn] :R ({x1, . . . , xn}), where C f [x1, . . . , xn]() = C f ()[x1, . . . , xn] (Figure 2).

C. Notion of Convergence and Previous Work

Suppose a sample Xn X of size n is used to constructa tree Cn that is an estimate of C f . Hartigan [1] provided asensible notion of consistency for this setting.

Definition 2.3: For any sets A, A X , let An (resp, An)denote the smallest cluster of Cn containing A Xn (resp,A Xn). We say Cn is consistent if, whenever A and A aredifferent connected components of {x : f (x) } (for some > 0), Pr(An is disjoint from An) 1 as n .

It is well known that if Xn is used to build a uniformlyconsistent density estimate fn (that is, supx | fn(x)f (x)|0),then the cluster tree C fn is consistent; see Appendix A fordetails. The problem is that C fn is not easy to computefor typical density estimates fn : imagine, for instance, howone might go about trying to find level sets of a mixture ofGaussians. Wong and Lane [12] have an efficient procedurethat tries to approximate C fn when fn is a k-nearest neighbor

7902 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 12, DECEMBER 2014

Fig. 3. An algorithm for hierarchical clustering. The input is a sample Xn = {x1, . . . , xn} from density f on X . Parameters k and need to be set. Singlelinkage is ( = 1, k = 2). Wishart suggested = 1 and larger k.

density estimate, but they have not shown that it preserves theconsistency of C fn .

On the other hand, there is a simple and elegant algorithmthat is a plausible estimator of the cluster tree: single linkage(or Kruskals algorithm). Given a data set x1, . . . , xn Rd , itoperates as follows.

1) For each i , set r2(xi ) to the distance from xi to its nearestneighbor.

2) As r grows from 0 to :a) Construct a graph Gr with nodes {xi : r2(xi ) r}.

Include edge (xi , x j ) if xi x j r .b) Let Cn(r) be the connected components of Gr .

Hartigan [1] has shown that single linkage is consistent inone dimension (that is, for d = 1). But he also demonstrates,by a lovely reduction to continuum percolation, that thisconsistency fails in higher dimension d 2. The problemis the requirement that A Xn An: by the time the clustersare large enough that one of them contains all of A, there isa reasonable chance that this cluster will be so big as to alsocontain part of A.

With this insight, Hartigan defines a weaker notion offractional consistency, under which An (resp, An) need notcontain all of A Xn (resp, A Xn), but merely a positivefraction of it and ought to pass arbitrarily close (as n )to the remainder. He then shows that single linkage achievesthis weaker consistency for any pair A, A for which the ratio

inf{ f (x) : x A A}sup{inf{ f (x) : x P} : paths P from A to A}

is sufficiently large. More recent work by Penrose [8] closesthe gap and shows fractional consistency whenever this ratiois > 1.

A more robust version of single linkage has been proposedby Wishart [9]: when connecting points at distance r fromeach other, only consider points that have at least k neighborswithin distance r (for some k > 2). Thus initially, when r issmall, only the regions of highest density are available forlinkage, while the rest of the data set is ignored. As r getslarger, more and more of the data points become candidatesfor linkage. This scheme is intuitively sensible, but Wishartdoes not provide a proof of convergence. Thus it is unclearhow to set k, for instance.

Several papers [4][7] have recently considered the problemof recovering the connected components of {x : f (x) }for a user-specified : the flat version of our problem. Mostsimilar to the work in this paper is the algorithm of [4],which uses the k-nearest neighbor graph of the data. Theselevel set results invariably require niceness conditions on the

specific level set being recovered, often stated in terms ofthe smoothness of the boundary of clusters, and/or regularityconditions on the density f on clusters of the given levelset. It is unclear whether these conditions hold for all levelsets of a general density, in other words how restrictive theseconditions are in the context of recovering the entire clustertree. In contrast, under mild requirements on the distribution,our conditions on the recovered level sets hold for any levelset as the sample size n increases. The main distributionalrequirement for consistency is that of continuity of the densityf on a compact support X .

A different approach is taken in a paper of Steinwart [13],which does not require the user to specify a density level, butrather automatically determines the smallest at which C f ()has two components. In [13] the continuity requirements onthe density are milder than for other results in the literature,including ours. However it does restrict attention to bimodaldensities due to technical hurdles of the flat case: differentlevels of the cluster tree are collapsed together in the flatcase making it difficult to recover a given level from dataespecially in the case of multimodal densities. Interestingly,the hierarchical setting resolves some of the technical hurdlesof the flat case since levels of the cluster tree would generallyappear at different levels of a sensible hierarchical estimator.This makes it possible in this paper to give particularlysimple estimators, and to analyze them under quite modestassumptions on the data.

A related issue that has received quite a lot of attentionis that of pruning a cluster tree estimate: removing spuriousclusters. A recent result of Rinaldo et al [14] gives mean-ingful statistical guarantees, but is based on the cluster treeof an empirical density estimate, which is algorithmicallyproblematic as discussed earlier. Stuetzle and Nugent [15]have an appealing top-down scheme for estimating the clustertree, along with a post-processing step (called runt pruning)that helps identify modes of the distribution. The consistencyof this method has not yet been established. We provide...