clusterdata

Construct agglomerative clusters from data

collapse all in page

Syntax

T = clusterdata(X,cutoff)

T = clusterdata(X,Name,Value)

Description

T= clusterdata(X,cutoff)为每个观察返回集群指数(行)of an input data matrixX, given a thresholdcutofffor cutting an agglomerative hierarchical tree that thelinkagefunction generates fromX.

clusterdatasupports agglomerative clustering and incorporates thepdist,linkage, andclusterfunctions, which you can use separately for more detailed analysis. SeeAlgorithm Descriptionfor more details.

example

T= clusterdata(X,Name,Value)specifies clustering options using one or more name-value pair arguments. You must specify eitherCutofforMaxClust. For example, specify'MaxClust',5to find a maximum of five clusters.

Examples

collapse all

Find Limited Number of Clusters from Sample Data

Open Live Script

找到一个最大的三个集群和可视化a randomly generated data set using two different approaches:

Specify a value for thecutoffinput argument.
Specify a value for the'MaxClust'name-value pair argument.

Create a sample data set consisting of randomly generated data from three standard uniform distributions.

rng('default');% For reproducibilityX = [gallery('uniformdata',[10 3],12);...gallery('uniformdata',[10 3],13)+1.2;...gallery('uniformdata',[10 3],14)+2.5]; y = [ones(10,1);2*(ones(10,1));3*(ones(10,1))];% Actual classes

Create a scatter plot of the data.

scatter3(X(:,1),X(:,2),X(:,3),100,y,'filled') title('Randomly Generated Data in Three Clusters');

Find a maximum of three clusters in the data by specifying the value 3 for thecutoffinput argument.

T1 = clusterdata(X,3);

Because the value ofcutoffis greater than 2,clusterdatainterpretscutoffas the maximum number of clusters.

Plot the data with the resulting cluster assignments.

scatter3(X(:,1),X(:,2),X(:,3),100,T1,'filled') title('Result of Clustering');

Find a maximum of three clusters by specifying the value 3 for the'MaxClust'name-value pair argument.

T2 = clusterdata(X,'Maxclust',3);

Plot the data with the resulting cluster assignments.

scatter3(X(:,1),X(:,2),X(:,3),100,T2,'filled') title('Result of Clustering');

Using both approaches,clusterdataidentifies the three distinct clusters in the data.

Create and Cluster Hierarchical Tree

Open Live Script

Create a hierarchical cluster tree and find clusters in one step. Visualize the clusters using a 3-D scatter plot.

Create a 20,000-by-3 matrix of sample data generated from the standard uniform distribution.

rng('default');% For reproducibilityX = rand(20000,3);

Find a maximum of four clusters in a hierarchical cluster tree created using thewardlinkage method. Specify'SaveMemory'as'on'to construct clusters without computing the distance matrix. Otherwise, you can receive an out-of-memory error if your machine does not have enough memory to hold the distance matrix.

T = clusterdata(X,'Linkage','ward','SaveMemory','on','Maxclust',4);

Plot the data with each cluster shown in a different color.

scatter3(X(:,1),X(:,2),X(:,3),10,T)

clusterdataidentifies four clusters in the data.

Input Arguments

collapse all

`X`—Input data
numeric matrix

Input data, specified as a numeric matrix with two or more rows. The rows represent observations, and the columns represent categories or dimensions.

Data Types:single|double

`cutoff`—Threshold for cutting the hierarchical tree
positive scalar between`0`and`2`|positive integer`≥ 2`

Threshold for cutting the hierarchical tree defined bylinkage, specified as a positive scalar between0and2or a positive integer≥ 2.clusterdatabehaves differently depending on the value specified forcutoff.

If0 <cutoff< 2,nclusterdataforms clusters wheninconsistentvalues are greater thancutoff.
Ifcutoffis an integer≥ 2,nclusterdataforms a maximum ofcutoffclusters.

When you specifycutoff, you cannot specify any name-value pair arguments.

Example:clusterdata(X,3)

Data Types:single|double

Name-Value Pair Arguments

Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

Example:clusterdata(X,'Linkage','ward','MaxClust',3)specifies creating a maximum of three clusters ofXusing Ward linkage.

`'Criterion'`—Criterion for defining clusters
`'inconsistent'`|`'distance'`

Criterion for defining clusters in a hierarchical cluster tree, specified as the comma-separated pair consisting of'Criterion'and either'inconsistent'or'distance'. When you specify'Criterion', you must also specify a value forMaxClustorCutoff.

Example:clusterdata(X,'Criterion','distance','Cutoff',.5)

Data Types:char|string

`'Cutoff'`—Cutoff for inconsistent or distance criterion
positive scalar

Cutoff for inconsistent or distance criterion, specified as the comma-separated pair consisting of'Cutoff'and a positive scalar.clusterdatausesCutoffas a threshold for either the heights or the inconsistency coefficients of nodes, depending on the value ofCriterion. If you specify a value for'Cutoff'without specifying the criterion for defining clusters, thenclusterdatauses the'inconsistent'criterion by default.

If'Criterion'is'distance',nclusterdatagroups all leaves at or below a node into a cluster, provided that the height of the node is less thanCutoff.
If'Criterion'is'inconsistent',n theinconsistentvalues of a node and all its subnodes must be less thanCutoffforclusterdatato group them into a cluster.

You must specify eitherCutofforMaxClust.

Example:clusterdata(X,'Cutoff',0.2)

Data Types:single|double

`'Depth'`—Depth for computing inconsistent values
numeric scalar

Depth for computing inconsistent values, specified as the comma-separated pair consisting of'Depth'and a numeric scalar.clusterdataevaluates inconsistent values by looking to the specified depth below each node in the hierarchical cluster tree. When you specify'Depth', you must also specify a value forMaxClustorCutoff.

Example:clusterdata(X,'Depth',3,'Cutoff',0.5)

Data Types:single|double

`'Distance'`—Distance metric
`'euclidean'`(default) |`'squaredeuclidean'`|`'seuclidean'`|`'mahalanobis'`|function handle| ...

Distance metric, specified as the comma-separated pair consisting of'Distance'and any distance metric accepted by thepdistfunction, as descried in the following table. When you specify'Distance', you must also specify a value forMaxClustorCutoff.

Metric	Description
`'euclidean'`	Euclidean distance (default)
`'squaredeuclidean'`	Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)
`'seuclidean'`	Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation,`S = std(X,'omitnan')`.
`'mahalanobis'`	Mahalanobis distance using the sample covariance of`X`,`C = cov(X,'omitrows')`
`'cityblock'`	City block distance
`'minkowski'`	Minkowski distance. The default exponent is 2. To use a different exponent`P`, specify`P`after`'minkowski'`, where`P`is a positive scalar value:`'minkowski',P`.
`'chebychev'`	Chebychev distance (maximum coordinate difference)
`的余弦`	One minus the cosine of the included angle between points (treated as vectors)
`'correlation'`	One minus the sample correlation between points (treated as sequences of values)
`'hamming'`	Hamming distance, which is the percentage of coordinates that differ
`'jaccard'`	One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ
`'spearman'`	One minus the sample Spearman's rank correlation between observations (treated as sequences of values)
`@distfun`	Custom distance function handle. A distance function has the form functionD2 = distfun(ZI,ZJ)% calculation of distance... where `ZI`is a`1`-by-`n`vector containing a single observation. `ZJ`is an`m2`-by-`n`matrix containing multiple observations.`distfun`must accept a matrix`ZJ`with an arbitrary number of observations. `D2`is an`m2`-by-`1`vector of distances, and`D2(k)`is the distance between observations`ZI`and`ZJ(k,:)`. If your data is not sparse, using a built-in`distance`is generally faster than using a function handle.

For more information, seeDistance Metrics.

Example:clusterdata(X,'Distance','minkowski','MaxClust',4)

Data Types:char|string|function_handle

`'Linkage'`—Algorithm for computing the distance between clusters
`'average'`|`'centroid'`|`'complete'`|`'median'`|`'single'`|`'ward'`|`'weighted'`

Algorithm for computing distance between clusters, specified as the comma-separated pair consisting of'Linkage'and any algorithm accepted by thelinkagefunction, as described in the following table. When you specify'Linkage', you must also specify a value forMaxClustorCutoff.

Algorithm	Description
`'average'`	Unweighted average distance (UPGMA)
`'centroid'`	Centroid distance (UPGMC), appropriate for Euclidean distances only
`'complete'`	Farthest distance
`'median'`	Weighted center of mass distance (WPGMC), appropriate for Euclidean distances only
`'single'`	Shortest distance
`'ward'`	Inner squared distance (minimum variance algorithm), appropriate for Euclidean distances only
`'weighted'`	Weighted average distance (WPGMA)

For more information, seeLinkages.

Example:clusterdata(X,'Linkage','median','MaxClust',4)

Data Types:char|string

`'MaxClust'`—Maximum number of clusters
positive integer

Maximum number of clusters to form, specified as the comma-separated pair consisting of'MaxClust'and a positive integer.

You must specify eitherCutofforMaxClust.

Example:clusterdata(X,'MaxClust',4)

Data Types:single|double

`'SaveMemory'`—Option for saving memory
`'on'`|`'off'`

Option for saving memory, specified as the comma-separated pair consisting of'SaveMemory'and either'on'or'off'. When you specify'SaveMemory', you must also specify a value forMaxClustorCutoff.

The'on'setting causesclusterdatato construct clusters without computing the distance matrix. The'on'setting applies when both of these conditions are satisfied:

Linkageis'centroid','median', or'ward'.
Distanceis'euclidean'(default).

When these two conditions apply, the default value for'SaveMemory'is'on'ifXhas 20 columns or fewer, or if the computer does not have enough memory to store the distance matrix. Otherwise, the default value for'SaveMemory'is'off'.

When'SaveMemory'is'on',linkagerun time is proportional to the number of dimensions (number of columns ofX). When'SaveMemory'is'off',linkagememory requirement is proportional toN², whereNis the number of observations. Choosing the best (least-time) setting for'SaveMemory'depends on the problem dimensions, number of observations, and available memory. The default'SaveMemory'setting is a rough approximation of an optimal setting.

Example:'SaveMemory','on'

Data Types:char|string

Output Arguments

collapse all

`T`— Cluster indices
numeric column vector

Cluster indices, returned as a numeric column vector.Thas as many rows asX, and each row ofTindicates the cluster assignment of the corresponding observation inX.

Tips

If'Linkage'is'centroid'or'median',nlinkagecan produce a cluster tree that is not monotonic. This result occurs when the distance from the union of two clusters,rands, to a third cluster is less than the distance betweenrands. In this case, in a dendrogram drawn with the default orientation, the path from a leaf to the root node takes some downward steps. To avoid this result, specify another value for'Linkage'. The following image shows a nonmonotonic cluster tree.

In this case, cluster 1 and cluster 3 are joined into a new cluster, while the distance between this new cluster and cluster 2 is less than the distance between cluster 1 and cluster 3.

Algorithms

If you specify a valuecfor thecutoffinput argument, thenT=clusterdata(X,c)performs the following steps:

Create a vector of the Euclidean distance between pairs of observations inXby usingpdist.
Y =pdist(X,'euclidean')
Create an agglomerative hierarchical cluster tree fromYby usinglinkagewith the'single'method for computing the shortest distance between clusters.
Z =linkage(Y,'single')
If0 <c< 2, useclusterto define clusters fromZwhen inconsistent values are less thanc.
T=cluster(Z,'Cutoff',c)
Ifcis an integer value≥ 2, useclusterto find a maximum ofcclusters fromZ.
T= cluster(Z,'MaxClust',c)

Alternative Functionality

If you have a hierarchical cluster treeZ(the output of thelinkagefunction for the input data matrixX), you can useclusterto perform agglomerative clustering onZand return the cluster assignment for each observation (row) inX.

clusterdata

Syntax

Description

Examples

Find Limited Number of Clusters from Sample Data

Create and Cluster Hierarchical Tree

Input Arguments

`X`—Input data
numeric matrix

`cutoff`—Threshold for cutting the hierarchical tree
positive scalar between`0`and`2`|positive integer`≥ 2`

Name-Value Pair Arguments

`'Criterion'`—Criterion for defining clusters
`'inconsistent'`|`'distance'`

`'Cutoff'`—Cutoff for inconsistent or distance criterion
positive scalar

`'Depth'`—Depth for computing inconsistent values
numeric scalar

`'Distance'`—Distance metric
`'euclidean'`(default) |`'squaredeuclidean'`|`'seuclidean'`|`'mahalanobis'`|function handle| ...

`'Linkage'`—Algorithm for computing the distance between clusters
`'average'`|`'centroid'`|`'complete'`|`'median'`|`'single'`|`'ward'`|`'weighted'`

`'MaxClust'`—Maximum number of clusters
positive integer

`'SaveMemory'`—Option for saving memory
`'on'`|`'off'`

Output Arguments

`T`— Cluster indices
numeric column vector

Tips

Algorithms

Alternative Functionality

See Also

Topics

统计和机器学习lbox Documentation

金宝app

Try MATLAB, Simulink, and Other Products

clusterdata

Syntax

Description

Examples

Find Limited Number of Clusters from Sample Data

Create and Cluster Hierarchical Tree

Input Arguments

X—Input datanumeric matrix

cutoff—Threshold for cutting the hierarchical treepositive scalar between0and2|positive integer≥ 2

Name-Value Pair Arguments

'Criterion'—Criterion for defining clusters'inconsistent'|'distance'

'Cutoff'—Cutoff for inconsistent or distance criterionpositive scalar

'Depth'—Depth for computing inconsistent valuesnumeric scalar

'Distance'—Distance metric'euclidean'(default) |'squaredeuclidean'|'seuclidean'|'mahalanobis'|function handle| ...

'Linkage'—Algorithm for computing the distance between clusters'average'|'centroid'|'complete'|'median'|'single'|'ward'|'weighted'

'MaxClust'—Maximum number of clusterspositive integer

'SaveMemory'—Option for saving memory'on'|'off'

Output Arguments

T— Cluster indicesnumeric column vector

Tips

Algorithms

Alternative Functionality

See Also

Topics

统计和机器学习lbox Documentation

金宝app

Try MATLAB, Simulink, and Other Products

`X`—Input data
numeric matrix

`cutoff`—Threshold for cutting the hierarchical tree
positive scalar between`0`and`2`|positive integer`≥ 2`

`'Criterion'`—Criterion for defining clusters
`'inconsistent'`|`'distance'`

`'Cutoff'`—Cutoff for inconsistent or distance criterion
positive scalar

`'Depth'`—Depth for computing inconsistent values
numeric scalar

`'Distance'`—Distance metric
`'euclidean'`(default) |`'squaredeuclidean'`|`'seuclidean'`|`'mahalanobis'`|function handle| ...

`'Linkage'`—Algorithm for computing the distance between clusters
`'average'`|`'centroid'`|`'complete'`|`'median'`|`'single'`|`'ward'`|`'weighted'`

`'MaxClust'`—Maximum number of clusters
positive integer

`'SaveMemory'`—Option for saving memory
`'on'`|`'off'`

`T`— Cluster indices
numeric column vector