Clustering Algorithm for Data Mining
For Data with Categorical Values
For Mixed Numeric and Nominal Data
For Large Database
Conceptual Clustering
For Data Warehouse
For Spatial Database
Visualization
Extention from k-means
Superparamagnetic Data Clustering
For Document Clustering
For Sequences of Composite Objects
Hierarchical Clustering
For Multimedia Database
Applications
TITLE:
Extensions to the k-means algorithm for clustering large data
sets with categorical values
AUTHOR: Zhexue Huang;
ACSys CRC, CSIRO, Canberra, ACT, Australia
PUBLICATION: Data Mining and Knowledge Discovery,
vol.2, no.3, p. 283-304,
1384-5810
Kluwer Academic Publishers 1998
34 Refs.
LANGUAGE: English
ABSTRACT: The k-means algorithm
is well known for its efficiency in
clustering large data sets. However, working only on numeric
values prohibits it from bring used to cluster real world
data containing categorical values. In this paper we present
two algorithms which extend the k-means algorithm to
categorical domains and domains with mixed numeric and
categorical values. The k-modes algorithm uses a simple
matching dissimilarity measure to deal with categorical
objects, replaces the means of clusters with modes, and uses
a frequency-based method to update modes in the clustering
process to minimise the clustering cost function. With these
extensions the k-modes algorithm enables the clustering of
categorical data in a fashion similar to k-means. The
k-prototypes algorithm, through the definition of a combined
dissimilarity measure, further integrates the k-means and
k-modes algorithms to allow for clustering objects described
by mixed numeric and categorical attributes. We use the well
known soybean disease and credit approval data sets to
demonstrate the clustering performance of the two algorithms.
Our experiments on two real world data sets with half a
million objects each show that the two algorithms are
efficient when clustering large data sets, which is critical
to data mining applications.
THESAURUS: data mining; pattern
recognition
OTHER SUBJECTS: k-means algorithm; clustering large data sets;
categorical objects; clustering process; data mining
TITLE:
CURE: an efficient clustering algorithm for large databases
AUTHOR: Guha, S.; Rastogi,
R.; Shim, K.; Stanford Univ., CA, USA
CONFERENCE: 1998 ACM SIGMOD International Conference
on Management of Data
Seattle, WA, USA 1-4 June 1998
PUBLICATION: SIGMOD Rec. (USA), SIGMOD Record, vol.27,
no.2, p. 73-84,
0163-5808
ACM June 1998
14 Refs.
LANGUAGE: English
ABSTRACT: Clustering in data mining
is useful for discovering groups and
identifying interesting distributions in the underlying data.
Traditional clustering algorithms either favor clusters with
spherical shapes and similar sizes, or are very fragile in
the presence of outliers. We propose a novel clustering
algorithm called CURE that is more robust to outliers, and
identifies clusters having non spherical shapes and wide
variances in size. CURE achieves this by representing each
cluster by a certain fixed number of points that are
generated by selecting well scattered points from the cluster
and then shrinking them toward the center of the cluster by a
specified fraction. Having more than one representative point
per cluster allows CURE to adjust well to the geometry of non
spherical shapes and the shrinking helps to dampen the
effects of outliers. To handle large databases, CURE employs
a combination of random sampling and partitioning. A random
sample drawn from the data set is first partitioned and each
partition is partially clustered. The partial clusters are
then clustered in a second pass to yield the desired
clusters. Our experimental results confirm that the quality
of clusters produced by CURE is much better than those found
by existing algorithms. Furthermore, they demonstrate that
random sampling and partitioning enable CURE to not only
outperform existing algorithms but also to scale well for
large databases without sacrificing clustering quality.
THESAURUS: deductive databases;
knowledge acquisition;
pattern recognition; very large databases
OTHER SUBJECTS: CURE; clustering algorithm; large databases;
data mining;
outliers; non spherical shapes; well scattered points;
representative point; random sampling; partitioning;
data set; partial clusters; clustering quality
TITLE:
ITERATE: a conceptual clustering algorithm for data mining
AUTHOR: Biswas, G.;
Weinberg, J.B.; Fisher, D.H.; Dept. of Comput.
Sci., Vanderbilt Univ., Nashville, TN, USA
PUBLICATION: IEEE Transactions on Systems, Man and
Cybernetics, Part C
(Applications and Reviews), vol.28, no.2, p. 219-30,
1094-6977
IEEE May 1998
42 Refs.
LANGUAGE: English
ABSTRACT: The data exploration
task can be divided into three
interrelated subtasks: 1) feature selection, 2) discovery,
and 3) interpretation. This paper describes an unsupervised
discovery method with biases geared toward partitioning
objects into clusters that improve interpretability. The
algorithm ITERATE employs: 1) a data ordering scheme and 2)
an iterative redistribution operator to produce maximally
cohesive and distinct clusters. Cohesion or intraclass
similarity is measured in terms of the match between
individual objects and their assigned cluster prototype.
Distinctness or interclass dissimilarity is measured by an
average of the variance of the distribution match between
clusters. The authors demonstrate that interpretability, from
a problem-solving viewpoint, is addressed by the intraclass
and interclass measures. Empirical results demonstrate the
properties of the discovery algorithm and its applications to
problem solving.
THESAURUS: feature extraction;
iterative methods;
knowledge acquisition; problem solving
OTHER SUBJECTS: ITERATE; conceptual clustering algorithm;
data mining;
data exploration; feature selection; discovery;
interpretation; unsupervised discovery method; biases;
object partitioning; data ordering scheme;
iterative redistribution operator;
maximally cohesive clusters; maximally distinct clusters;
intraclass similarity; cohesion; interclass dissimilarity;
distinctness; distribution match; problem solving;
intraclass measures; interclass measures
TITLE:
BIRCH: a new data clustering algorithm and its applications
AUTHOR: Tian Zhang;
Ramakrishnan, R.; Livny, M.; Dept. of Comput. Sci.,
Wisconsin Univ., Madison, WI, USA
PUBLICATION: Data Mining and Knowledge Discovery,
vol.1, no.2, p. 141-82,
1384-5810
Kluwer Academic Publishers 1997
28 Refs.
LANGUAGE: English
ABSTRACT: Data clustering is an
important technique for exploratory data
analysis, and has been studied for several years. It has been
shown to be useful in many practical domains such as data
classification and image processing. Recently, there has been
a growing emphasis on exploratory analysis of very large
datasets to discover useful patterns and/or correlations
among attributes. This is called data mining, and data
clustering is regarded as a particular branch. However
existing data clustering methods do not adequately address
the problem of processing large datasets with a limited
amount of resources (e.g., memory and cpu cycles). So as the
dataset size increases, they do not scale up well in terms of
memory requirement, running time, and result quality. In this
paper, an efficient and scalable data clustering method is
proposed, based on a new in-memory data structure called
CF-tree, which serves as an in-memory summary of the data
distribution. We have implemented it in a system called BIRCH
(Balanced Iterative Reducing and Clustering using
Hierarchies), and studied its performance extensively in
terms of memory requirements, running time, clustering
quality, stability and scalability; we also compare it with
other available methods. Finally, BIRCH is applied to solve
two real-life problems: one is building an iterative and
interactive pixel classification tool, and the other is
generating the initial codebook for image compression.
THESAURUS: data analysis; data
compression; deductive databases;
image coding; knowledge acquisition;
pattern classification; software performance evaluation;
tree data structures; very large databases
OTHER SUBJECTS: BIRCH; data clustering algorithm; exploratory
data analysis;
data classification; image processing;
very large datasets; data mining; memory requirement;
running time; result quality; tree data structure;
CF-tree; data distribution; Balanced Iterative Reducing and
Clustering using Hierarchies; performance;
incremental algorithm; stability; scalability;
pixel classification; codebook; image compression
TITLE:
A distribution-based clustering algorithm for mining in large
spatial databases
AUTHOR: Xiaowei Xu;
Ester, M.; Kriegel, H.-P.; Sander, J.; Munich
Univ., Germany
CONFERENCE: Proceedings 14th International
Conference on Data Engineering
Orlando, FL, USA 23-27 Feb. 1998
SPONSOR: IEEE Comput. Soc. Tech. Committee on Data Eng
PUBLICATION: Proceedings. 14th International Conference
on Data Engineering
(Cat. No.98CB36164), p. 324-31
Los Alamitos, CA, USA IEEE Comput. Soc 1998
xxi+605 p.
ISBN: 0818682892
14 Refs.
LANGUAGE: English
ABSTRACT: The problem of detecting
clusters of points belonging to a
spatial point process arises in many applications. In this
paper, we introduce the new clustering algorithm DBCLASD
(Distribution-Based Clustering of LArge Spatial Databases) to
discover clusters of this type. The results of experiments
demonstrate that DBCLASD, contrary to partitioning algorithms
such as CLARANS (Clustering Large Applications based on
RANdomized Search), discovers clusters of arbitrary shape.
Furthermore, DBCLASD does not require any input parameters,
in contrast to the clustering algorithm DBSCAN (Density-Based
Spatial Clustering of Applications with Noise) requiring two
input parameters, which may be difficult to provide for large
databases. In terms of efficiency, DBCLASD is between CLARANS
and DBSCAN, close to DBSCAN. Thus, the efficiency of DBCLASD
on large spatial databases is very attractive when
considering its nonparametric nature and its good quality for
clusters of arbitrary shape.
THESAURUS: data analysis; deductive
databases; knowledge acquisition;
very large databases; visual databases
OTHER SUBJECTS: distribution-based clustering algorithm; data
mining;
large spatial databases; point cluster detection;
spatial point process; DBCLASD; partitioning algorithms;
CLARANS; randomized search; arbitrary shaped clusters;
DBSCAN; density-based spatial clustering; noise;
input parameters; efficiency; nonparametric nature;
quality
TITLE:
Unsupervised clustering with mixed numeric and nominal data-a
new similarity based agglomerative system
AUTHOR: Cen Li; Biswas,
G.; Dept. of Comput. Sci., Vanderbilt Univ.,
Nashville, TN, USA
CONFERENCE: Proceedings of First Pacific-Asia
Conference. Knowledge
Discovery and Data Mining: Techniques and Applications
Singapore 23-24 Feb. 1997
PUBLICATION: Proceedings of the First Pacific-Asia
Conference on Knowledge
Discovery and Data Mining. KDD: Techniques and Applications,
p. 35-48
EDITOR: Lu, H.
Motoda, H.
Liu, H.
Singapore World Scientific 1997
xvi+367 p.
ISBN: 9810229194
8 Refs.
LANGUAGE: English
ABSTRACT: Presents a new similarity-based
agglomerative clustering (SBAC)
algorithm that works well for data with mixed numeric and
nominal features. A similarity measure, proposed by D.W.
Goodall (1966) for biological taxonomies, that gives greater
weight to uncommon feature-value matches in similarity
computations and makes no assumptions about the underlying
distributions of the feature values, is adopted to define the
similarity measure between pairs of objects. An agglomerative
algorithm is employed to construct a concept tree, and a
simple distinctness heuristic is used to extract a partition
of the data. The performance of SBAC has been studied on real
and artificially generated data sets. The results demonstrate
the effectiveness of this algorithm in unsupervised discovery
tasks. Comparisons with other schemes illustrate the superior
performance of the algorithm.
THESAURUS: data mining; pattern
clustering; pattern matching;
semantic networks; software performance evaluation;
statistical analysis; unsupervised learning
OTHER SUBJECTS: unsupervised clustering; numerical data;
nominal data;
similarity-based agglomerative clustering algorithm;
similarity measure; biological taxonomy;
uncommon feature-value matches;
feature value distributions; concept tree;
distinctness heuristic; data partitioning;
algorithm performance; unsupervised discovery tasks
TITLE:
Refining initial points for K-means clustering
AUTHOR: Bradley, P.S.;
Fayyad, U.M.; Dept. of Comput. Sci., Wisconsin
Univ., Madison, WI, USA
CONFERENCE: Proceedings of Machine Learning
(ICML-98) Madison, WI, USA
24-27 July 1998
PUBLICATION: Machine Learning. Proceedings of the
Fifteenth International
Conference (ICML'98), p. 91-9
EDITOR: Shavlik, J.
San Francisco, CA, USA Morgan Kaufmann Publishers 1998
x+580 p.
23 Refs.
LANGUAGE: English
ABSTRACT: Practical approaches
to clustering use an iterative procedure
(e.g. K-means, expectation maximization) which converges to
one of numerous local minima. It is known that these
iterative techniques are especially sensitive to initial
starting conditions. We present a procedure for computing a
refined starting condition from a given initial one that is
based on an efficient technique for estimating the modes of a
distribution. The refined initial starting condition allows
the iterative algorithm to converge to a "better" local
minimum. The procedure is applicable to a wide class of
clustering algorithms for both discrete and continuous data.
We demonstrate the application of this method to the K-means
clustering algorithm and show that refined initial starting
points indeed lead to improved solutions. The refinement
run-time is considerably lower than the time required to
cluster the full database. The method is scalable and can be
coupled with a scalable clustering algorithm to address the
large-scale clustering problems in data mining.
THESAURUS: convergence of numerical methods;
data mining;
database theory; initial value problems;
iterative methods; minimisation; pattern clustering
OTHER SUBJECTS: initial point refinement; K-means clustering
algorithm;
iterative algorithm; convergence; local minima; refined
initial starting condition; discrete data; distribution
mode estimation technique; continuous data;
refinement run-time; database clustering;
scalable clustering algorithm; data mining;
large-scale clustering problems
TITLE:
WaveCluster: a multi-resolution clustering approach for very
large spatial databases
AUTHOR: Sheikholeslami,
G.; Chatterjee, S.; Zhang, A.; Dept. of Comput.
Sci., State Univ. of New York, Buffalo, NY, USA
CONFERENCE: Proceedings of 24th Annual International
Conference on Very
Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION: Proceedings of the Twenty-Fourth International
Conference on
Very-Large Databases, p. 428-39
EDITOR: Gupta,A.
Shmueli,O.
Widom,J.
San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
xvii+708 p.
ISBN: 1558605665
0 Refs.
LANGUAGE: English
ABSTRACT: Many applications require
the management of spatial data.
Clustering large spatial databases is an important problem
which tries to find the densely populated regions in the
feature space to be used in data mining, knowledge discovery,
or efficient information retrieval. A good clustering
approach should be efficient and detect clusters of arbitrary
shape. It must be insensitive to the outliers (noise) and the
order of input data. We propose WaveCluster, a novel
clustering approach based on wavelet transforms, which
satisfies all the above requirements. Using the
multiresolution property of wavelet transforms, we can
effectively identify arbitrary shape clusters at different
degrees of accuracy. We also demonstrate that WaveCluster is
highly efficient in terms of time complexity. Experimental
results on very large data sets are presented which show the
efficiency and effectiveness of the proposed approach
compared to the other recent clustering methods.
THESAURUS: computational complexity;
data mining;
spatial data structures; very large databases;
visual databases; wavelet transforms
OTHER SUBJECTS: WaveCluster; multi-resolution clustering approach;
very large spatial databases; data mining;
knowledge discovery; information retrieval; outliers;
input data; wavelet transforms; time complexity;
experimental results; very large data sets
TITLE:
Incremental clustering for mining in a data warehousing
environment
AUTHOR: Ester, M.;
Kriegel, H.-P.; Sander, J.; Wimmer, M.; Xiaowei Xu;
Inst. for Comput. Sci., Munchen Univ., Germany
CONFERENCE: Proceedings of 24th Annual International
Conference on Very
Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION: Proceedings of the Twenty-Fourth International
Conference on
Very-Large Databases, p. 323-33
EDITOR: Gupta,A.
Shmueli,O.
Widom,J.
San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
xvii+708 p.
ISBN: 1558605665
22 Refs.
LANGUAGE: English
ABSTRACT: Data warehouses provide
a great deal of opportunities for
performing data mining tasks such as classification and
clustering. Typically, updates are collected and applied to
the data warehouse periodically in a batch mode, e.g., during
the night. Then, all patterns derived from the warehouse by
some data mining algorithm have to be updated as well. Due to
the very large size of the databases, it is highly desirable
to perform these updates incrementally. We present the first
incremental clustering algorithm. Our algorithm is based on
the clustering algorithm DBSCAN which is applicable to any
database containing data from a metric space, e.g., to a
spatial database or to a WWW-log database. Due to the
density-based nature of DBSCAN, the insertion or deletion of
an object affects the current clustering only in the
neighborhood of this object. Thus, efficient algorithms can
be given for incremental insertions and deletions to an
existing clustering. Based on the formal definition of
clusters, it can be proven that the incremental algorithm
yields the same result as DBSCAN. A performance evaluation of
IncrementalDBSCAN on a spatial database as well as on a
WWW-log database is presented, demonstrating the efficiency
of the proposed algorithm. IncrementalDBSCAN yields
significant speed-up factors over DBSCAN even for large
numbers of daily updates in a data warehouse.
THESAURUS: data mining; data structures;
data warehouses;
software performance evaluation; visual databases
OTHER SUBJECTS: incremental clustering; data warehouses;
data mining;
classification; data clustering; updates; batch mode;
very large database; incremental clustering algorithm;
DBSCAN; performance evaluation; IncrementalDBSCAN;
spatial database; World Wide Web log database
TITLE:
Clustering categorical data: an approach based on dynamical
systems
AUTHOR: Gibson, D.;
Kleinberg, J.; Raghavan, P.; Dept. of Comput. Sci.,
California Univ., Berkeley, CA, USA
CONFERENCE: Proceedings of 24th Annual International
Conference on Very
Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION: Proceedings of the Twenty-Fourth International
Conference on
Very-Large Databases, p. 311-22
EDITOR: Gupta,A.
Shmueli,O.
Widom,J.
San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
xvii+708 p.
ISBN: 1558605665
39 Refs.
LANGUAGE: English
ABSTRACT: We describe a novel approach
for clustering collections of
sets, and its application to the analysis and mining of
categorical data. By categorical data we mean tables with
fields that cannot be naturally ordered by a metric-e.g., the
names of producers of automobiles, or the names of products
offered by a manufacturer. Our approach is based on an
iterative method for assigning and propagating weights on the
categorical values in a table; this facilitates a type of
similarity measure arising from the co-occurrence of values
in the dataset. Our techniques can be studied analytically in
terms of certain types of non-linear dynamical systems. We
discuss experiments on a variety of tables of synthetic and
real data; we find that our iterative methods converge
quickly to prominently correlated values of various
categorical fields.
THESAURUS: data analysis; data
mining; data structures;
very large databases
OTHER SUBJECTS: categorical data clustering; data analysis;
data mining;
tables; iterative method; categorical value weights;
dataset; nonlinear dynamical systems; experiments
TITLE:
Clustering large datasets in arbitrary metric spaces
AUTHOR: Ganti, V.;
Ramakrishnan, R.; Gehrke, J.; Powell, A.; French, J.
;
Dept. of Comput. Sci., Virginia Univ., Charlottesville, VA,
USA
CONFERENCE: Proceedings of IEEE Computer Society
15th International
Conference on Data Engineering Sydney, NSW, Australia 23-26
March 1999
SPONSOR: IEEE Comput. Soc. Tech. Committee on Data Eng
PUBLICATION: Proceedings 15th International Conference
on Data Engineering
(Cat. No.99CB36337), p. 502-11
EDITOR: Kitsuregawa, M.
Maciaszek, L.
Papazoglou, M.
Pu, C.
Los Alamitos, CA, USA IEEE Comput. Soc 1999
xxiii+648 p.
ISBN: 0769500714
26 Refs.
LANGUAGE: English
ABSTRACT: Clustering partitions
a collection of objects into groups
called clusters, such that similar objects fall into the same
group. Similarity between objects is defined by a distance
function satisfying the triangle inequality; this distance
function along with the collection of objects describes a
distance space. In a distance space, the only operation
possible on data objects is the computation of distance
between them. All scalable algorithms in the literature
assume a special type of distance space, namely a
k-dimensional vector space, which allows vector operations on
objects. We present two scalable algorithms designed for
clustering very large datasets in distance spaces. Our first
algorithm BUBBLE is, to our knowledge, the first scalable
clustering algorithm for data in a distance space. Our second
algorithm BUBBLE-FM improves upon BUBBLE by reducing the
number of calls to the distance function, which may be
computationally very expensive. Both algorithms make only a
single scan over the database while producing high clustering
quality. In a detailed experimental evaluation, we study both
algorithms in terms of scalability and quality of clustering.
We also show results of applying the algorithms to a real
life dataset.
THESAURUS: data handling; data
mining; database theory;
trees (mathematics); very large databases
OTHER SUBJECTS: large datasets; arbitrary metric spaces;
similar objects;
distance function; triangle inequality; distance space;
data objects; scalable algorithms;
k-dimensional vector space; vector operations;
very large dataset clustering; distance spaces; BUBBLE;
scalable clustering algorithm; BUBBLE-FM;
clustering quality; scalability; quality;
real life dataset
TITLE:
Superparamagnetic clustering of data. The definitive solution
of an ill-posed problem
AUTHOR: Domany, E.;
Dept. of Phys. of Complex Syst., Weizmann Inst. of
Sci., Rehovoth, Israel
CONFERENCE: STATPHYS 20. 20th IUPAP International
Conference on Statistical
Physics Paris, France 20-24 July 1998
PUBLICATION: Physica A (Netherlands), Physica A,
vol.263, no.1-4, p. 158-69,
0378-4371
Elsevier 1 Feb. 1999
18 Refs.
LANGUAGE: English
ABSTRACT: Clustering is an important
technique in exploratory data
analysis, with applications in image processing, object
classification, target recognition, data mining etc. The aim
is to partition data according to natural classes present in
it, assigning data points that are "more similar" to the same
"cluster". We solved this ill-posed problem without making
any assumptions about the structure of the data, by using a
physical system as an analog computer. The physical system we
use is a disordered (granular) magnet. The method was tested
successfully on a variety of artificial and real-life
problems, such as classification of flowers, processing of
satellite images, speech recognition and identification of
textures and images. We are currently involved in several
collaborations, applying the method to image classification,
fMRI data analysis and classification of protein structures.
THESAURUS: analogue computers;
botany; data analysis;
granular materials; image classification; image texture;
molecular biophysics; molecular configurations;
paramagnetic materials; pattern clustering; proteins;
speech recognition; statistics; superparamagnetism
OTHER SUBJECTS: superparamagnetic data clustering; ill-posed
problem;
exploratory data analysis; data partitioning;
physical system; analog computer; disordered magnet;
granular magnet; flower classification;
satellite image processing; speech recognition;
texture identification; image identification;
image classification; fMRI data analysis;
protein structure classification; statistics
TITLE:
Incremental conceptual clustering in the framework of Galois
lattice
AUTHOR: Tu Bao Ho;
Japan Adv. Inst. of Sci. & Technol., Ishikawa, Japan
CONFERENCE: Proceedings of First Pacific-Asia
Conference. Knowledge
Discovery and Data Mining: Techniques and Applications
Singapore 23-24 Feb. 1997
PUBLICATION: Proceedings of the First Pacific-Asia
Conference on Knowledge
Discovery and Data Mining. KDD: Techniques and Applications,
p. 49-61
EDITOR: Lu, H.
Motoda, H.
Liu, H.
Singapore World Scientific 1997
xvi+367 p.
ISBN: 9810229194
15 Refs.
LANGUAGE: English
ABSTRACT: Conceptual clustering
is a branch of concept learning that aims
at discovering conceptual knowledge from data. Like most
methods employing the classical view of concepts, the OSHAM
method allows a concept hierarchy to be induced
nonincrementally in the framework of a Galois lattice. This
paper presents an incremental version of OSHAM (called
INCOSHAM) for a discovery task in which training instances
are presented serially. Experimental results suggest an
alternative use of nonincremental and incremental algorithms
in order to improve the performance of the discovery process.
THESAURUS: data mining; Galois
fields;
learning (artificial intelligence); pattern clustering;
semantic networks
OTHER SUBJECTS: incremental conceptual clustering; Galois lattice;
concept learning; conceptual knowledge discovery; OSHAM;
concept hierarchy induction; INCOSHAM;
serially presented training instances;
nonincremental algorithms; discovery process performance
TITLE:
Clustering large data sets with mixed numeric and categorical
values
AUTHOR: Zhexue Huang;
Math. & Inf. Sci., CSIRO, Canberra, ACT,
Australia
CONFERENCE: Proceedings of First Pacific-Asia
Conference. Knowledge
Discovery and Data Mining: Techniques and Applications
Singapore 23-24 Feb. 1997
PUBLICATION: Proceedings of the First Pacific-Asia
Conference on Knowledge
Discovery and Data Mining. KDD: Techniques and Applications,
p. 21-34
EDITOR: Lu, H.
Motoda, H.
Liu, H.
Singapore World Scientific 1997
xvi+367 p.
ISBN: 9810229194
16 Refs.
LANGUAGE: English
ABSTRACT: Efficient partitioning
of large data sets into homogenous
clusters is a fundamental problem in data mining. The
standard hierarchical clustering methods provide no solution
for this problem, due to their computational inefficiency.
The k-means based methods are promising for their efficiency
in processing large data sets. However, their use is often
limited to numeric data. In this paper, we present a
k-prototypes algorithm which is based on the k-means paradigm
but removes the numeric data limitation whilst preserving its
efficiency. In the algorithm, objects are clustered against k
prototypes. A method is developed to dynamically update the k
prototypes in order to maximise the intra-cluster similarity
of objects. When applied to numeric data, the algorithm is
identical to the k-means method. To assist in the
interpretation of clusters, we use decision tree induction
algorithms to create rules for clusters. These rules,
together with other statistics about clusters, can assist
data miners to understand and identify interesting clusters.
THESAURUS: data mining; pattern
clustering; statistical databases;
very large databases
OTHER SUBJECTS: large data set clustering; numerical values;
categorical values; data set partitioning;
homogenous clusters; data mining;
hierarchical clustering methods; computational efficiency;
k-means paradigm; k-prototypes algorithm;
dynamic updating; intra-cluster similarity;
decision tree induction algorithms; cluster rules;
cluster statistics; interesting cluster identification
TITLE:
Fast and intuitive clustering of Web documents
AUTHOR: Zamir, O.;
Etzioni, O.; Madani, O.; Karp, R.M.; Dept. of
Comput. Sci. & Eng., Washington Univ., Seattle, WA, USA
CONFERENCE: Proceedings of the Third International
Conference on Knowledge
Discovery and Data Mining -KDD 97 Newport Beach, CA, USA
14-17 Aug. 1997
PUBLICATION: Proceedings of the Third International
Conference on Knowledge
Discovery and Data Mining, p. 287-90
EDITOR: Heckerman, D.
Mannila, H.
Pregibon, D.
Uthurusamy, R.
Menlo Park, CA, USA AAAI Press 1997
xiv+311 p.
10 Refs.
LANGUAGE: English
ABSTRACT: Conventional document
retrieval systems (e.g., Alta Vista)
return long lists of ranked documents in response to user
queries. Recently, document clustering has been put forth as
an alternative method of organizing retrieval results (D.R.
Cutting et al., 1992). A person browsing the clusters can
discover patterns that could be overlooked in the traditional
presentation. The paper describes two novel clustering
methods that intersect the documents in a cluster to
determine the set of words (or phrases) shared by all the
documents in the cluster. We report on experiments that
evaluate these intersection based clustering methods on
collections of snippets returned from Web search engines.
First, we show that word intersection clustering produces
superior clusters and does so faster than standard
techniques. Second, we show that our O(nlog n) time phrase
intersection clustering method produces comparable clusters
and does so more than two orders of magnitude faster than all
methods tested.
THESAURUS: computational complexity;
data mining;
information retrieval; Internet; pattern clustering;
search engines
OTHER SUBJECTS: intuitive clustering; Web documents;
document retrieval systems; Alta Vista; ranked documents;
user queries; document clustering; retrieval results;
clustering methods; intersection based clustering methods;
snippets; Web search engines;
word intersection clustering;
phrase intersection clustering method; pattern discovery
TITLE:
Clustering sequences of complex objects
AUTHOR: Ketterlin,
A.; CNRS, Strasbourg, France
CONFERENCE: Proceedings of the Third International
Conference on Knowledge
Discovery and Data Mining -KDD 97 Newport Beach, CA, USA
14-17 Aug. 1997
PUBLICATION: Proceedings of the Third International
Conference on Knowledge
Discovery and Data Mining, p. 215-18
EDITOR: Heckerman, D.
Mannila, H.
Pregibon, D.
Uthurusamy, R.
Menlo Park, CA, USA AAAI Press 1997
xiv+311 p.
6 Refs.
LANGUAGE: English
ABSTRACT: This paper is about the
unsupervised discovery of patterns in
sequences of composite objects. A composite object may be
described as a sequence of other, simpler data. In such
cases, not only the nature of the components is important,
but also the order in which these components appear. The work
studies the problem of generalizing sequences of complex
objects. A formal definition of generalized sequences is
given, and an algorithm is derived. Because of the excessive
computational complexity of this algorithm, a heuristic
version is described. This algorithm is then integrated in a
general-purpose clustering algorithm. The result is a
knowledge discovery system which is able to analyze any
structured database on the base of a unified, unsupervised
mechanism.
THESAURUS: computational complexity;
data mining;
heuristic programming; pattern recognition;
very large databases
OTHER SUBJECTS: complex object sequence clustering;
unsupervised pattern discovery; composite objects;
generalized sequences; computational complexity;
heuristic algorithm; general-purpose clustering algorithm;
knowledge discovery; data mining
TITLE: Interactive interpretation of hierarchical
clustering
AUTHOR: Boudaillier,
E.; Hebrail, G.
PUBLICATION: Intelligent Data Analysis, vol.2, no.3,
1088-467X
Elsevier Aug. 1998
10 Refs.
LANGUAGE: English
ABSTRACT: Automatic clustering
methods are part of data mining methods.
They aim at building clusters of items so that similar items
fall into the same cluster while unsimilar items fall into
separate clusters. A particular class of clustering methods
are hierarchical ones where recursive clusters are formed to
grow a binary tree representing an approximation of
similarities between items. We propose a new interactive
interface to help the user to interpret the result of such a
clustering process, according to the item characteristics.
The prototype has been applied successfully to a special case
of items providing nice graphical representations (electric
load curves) but can also be used with other types of curves
or with more standard items.
THESAURUS: data analysis; data
mining; tree data structures;
user interfaces; very large databases
OTHER SUBJECTS: interactive interpretation; hierarchical clustering;
automatic clustering methods; data mining;
recursive clusters; binary tree; interactive interface;
prototype; graphical representations; curves;
data analysis
TITLE:
Human performance on clustering Web pages: a preliminary study
AUTHOR: Macskassy,
S.A.; Banerjee, A.; Davison, B.D.; Hirsh, H.; Dept.
of Comput. Sci., Rutgers Univ., New Brunswick, NJ, USA
CONFERENCE: Proceedings of the Fourth International
Conference on Knowledge
Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION: Proceedings Fourth International Conference
on Knowledge
Discovery and Data Mining, p. 264-8
EDITOR: Agrawal, R.
Stolorz, P.
Menlo Park, CA, USA AAAI Press 1998
xii+382 p.
ISBN: 1577350707
15 Refs.
LANGUAGE: English
ABSTRACT: With the increase in
information on the World Wide Web it has
become difficult to quickly find desired information without
using multiple queries or using a topic-specific search
engine. One way to help in the search is by grouping HTML
pages together that appear in some way to be related. In
order to better understand this task, we performed an initial
study of human clustering of web pages, in the hope that it
would provide some insight into the difficulty of automating
this task. Our results show that subjects did not cluster
identically; in fact, on average, any two subjects had little
similarity in their web-page clusters. We also found that
subjects generally created rather small clusters, and those
with access only to URLs created fewer clusters than those
with access to the full text of each web page. Generally the
overlap of documents between clusters for any given subject
increased when given the full text, as did the percentage of
documents clustered. When analyzing individual subjects, we
found that each had different behavior across queries, both
in terms of overlap, size of clusters, and number of
clusters. These results provide a sobering note on any quest
for a single clearly correct clustering method for web pages.
THESAURUS: hypermedia markup languages;
information resources;
information retrieval
OTHER SUBJECTS: World Wide Web; multiple queries;
topic-specific search engine; HTML pages;
human clustering
TITLE:
Initialization of iterative refinement clustering algorithms
AUTHOR: Fayyad, U.;
Reina, C.; Bradley, P.S.; Microsoft Res., Redmond,
WA, USA
CONFERENCE: Proceedings of the Fourth International
Conference on Knowledge
Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION: Proceedings Fourth International Conference
on Knowledge
Discovery and Data Mining, p. 194-8
EDITOR: Agrawal, R.
Stolorz, P.
Menlo Park, CA, USA AAAI Press 1998
xii+382 p.
ISBN: 1577350707
19 Refs.
LANGUAGE: English
ABSTRACT: Iterative refinement
clustering algorithms (e.g. K-Means, EM)
converge to one of numerous local minima. It is known that
they are especially sensitive to initial conditions. We
present a procedure for computing a refined starting
condition from a given initial one that is based on an
efficient technique for estimating the modes of a
distribution. The refined initial starting condition leads to
convergence to "better" local minima. The procedure is
applicable to a wide class of clustering algorithms for both
discrete and continuous data. We demonstrate the application
of this method to the expectation maximization (EM)
clustering algorithm and show that refined initial points
indeed lead to improved solutions. Refinement run time is
considerably lower than the time required to cluster the full
database. The method is scalable and can be coupled with a
scalable clustering algorithm to address large-scale
clustering in data mining.
THESAURUS: convergence of numerical methods;
database theory;
pattern clustering; very large databases
OTHER SUBJECTS: initialization; iterative refinement clustering
algorithms;
local minima; convergence; initial conditions;
refined starting condition; continuous data;
discrete data;
expectation maximization clustering algorithm;
refinement run time; scalable clustering algorithm;
large-scale clustering; data mining
TITLE:
Robust incremental clustering with bad instance orderings: a
new strategy
AUTHOR: Roure, J.;
Talavera, L.; Dept. d'Inf. i Gestio, Escola Univ.
Politecnica de Mataro, Catalonia, Spain
CONFERENCE: Progress in Artificial Intelligence
- IBERAMIA 98 Lisbon,
Portugal 5-9 Oct. 1998
PUBLICATION: Progress in Artificial Intelligence
- IBERAMIA 98. 6th
Ibero-American Conference on AI. Proceedings, p. 136-47
EDITOR: Coelho, H.
Berlin, Germany Springer-Verlag 1998
xiii+420 p.
ISBN: 3540649921
13 Refs.
LANGUAGE: English
ABSTRACT: It is widely reported
in the literature that incremental
clustering systems suffer from instance ordering effects and
that under some orderings, extremely poor clusterings may be
obtained. We present a new general strategy aimed to mitigate
these effects, the Not-Yet strategy which has a general and
open formulation and it is not coupled to any particular
system. Unlike other proposals, this strategy maintains the
incremental nature of learning process. In addition, we
propose a classification of strategies to avoid ordering
effects which clarifies the benefits and disadvantages one
can expect from the proposal made in the paper as well from
the existing ones. A particular implementation of the Not-Yet
strategy is used to conduct several experiments. Results
suggest that the strategy improves the clustering quality. We
also show that, when combined with other local strategies,
the Not-Yet strategy allows the clustering system to get high
quality clusterings.
THESAURUS: data mining; learning
(artificial intelligence);
learning systems
OTHER SUBJECTS: incremental clustering; ordering effect;
not-yet strategy;
learning process; machine learning; data mining
TITLE:
An efficient approach to clustering in large multimedia
databases with noise
AUTHOR: Hinneburg,
A.; Keim, D.A.; Inst. of Comput. Sci., Halle Univ.,
Germany
CONFERENCE: Proceedings of the Fourth International
Conference on Knowledge
Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION: Proceedings Fourth International Conference
on Knowledge
Discovery and Data Mining, p. 58-65
EDITOR: Agrawal, R.
Stolorz, P.
Menlo Park, CA, USA AAAI Press 1998
xii+382 p.
ISBN: 1577350707
19 Refs.
LANGUAGE: English
ABSTRACT: Several clustering algorithms
can be applied to clustering in
large multimedia databases. The effectiveness and efficiency
of the existing algorithms, however, is somewhat limited,
since clustering in multimedia databases requires clustering
high dimensional feature vectors and since multimedia
databases often contain large amounts of noise. We therefore
introduce a new algorithm to clustering in large multimedia
databases called DENCLUE (DENsity-based CLUstEring). The
basic idea of our new approach is to model the overall point
density analytically as the sum of influence functions of the
data points. Clusters can then be identified by determining
density attractors and clusters of arbitrary shape can be
easily described by a simple equation based on the overall
density function. The advantages of our new approach are: (1)
it has a firm mathematical basis; (2) it has good clustering
properties in data sets with large amounts of noise; (3) it
allows a compact mathematical description of arbitrarily
shaped clusters in high dimensional data sets; and (4) it is
significantly faster than existing algorithms. To demonstrate
the effectiveness and efficiency of DENCLUE, we perform a
series of experiments on a number of different data sets from
CAD and molecular biology. A comparison with DBSCAN shows the
superiority of our new approach.
THESAURUS: data handling; data
mining; multimedia databases;
pattern clustering; very large databases
OTHER SUBJECTS: large multimedia database clustering; clustering
algorithms;
high dimensional feature vectors; noise; DENCLUE;
DENsity-based CLUstEring; overall point density;
influence functions; data points; density attractors;
simple equation; overall density function;
clustering properties; data sets;
compact mathematical description;
arbitrarily shaped clusters; high dimensional data sets;
CAD; molecular biology; DBSCAN
TITLE:
Scaling clustering algorithms to large databases
AUTHOR: Bradley, P.S.;
Fayyad, U.; Reina, C.; Microsoft Res., Redmond,
WA, USA
CONFERENCE: Proceedings of the Fourth International
Conference on Knowledge
Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION: Proceedings Fourth International Conference
on Knowledge
Discovery and Data Mining, p. 9-15
EDITOR: Agrawal, R.
Stolorz, P.
Menlo Park, CA, USA AAAI Press 1998
xii+382 p.
ISBN: 1577350707
29 Refs.
LANGUAGE: English
ABSTRACT: Practical clustering
algorithms require multiple data scans to
achieve convergence. For large databases, these scans become
prohibitively expensive. We present a scalable clustering
framework applicable to a wide class of iterative clustering.
We require at most one scan of the database. In this work,
the framework is instantiated and numerically justified with
the popular K-Means clustering algorithm. The method is based
on identifying regions of the data that are compressible,
regions that must be maintained in memory, and regions that
are discardable. The algorithm operates within the confines
of a limited memory buffer. Empirical results demonstrate
that the scalable scheme outperforms a sampling based
approach. In our scheme, data resolution is preserved to the
extent possible based upon the size of the allocated memory
buffer and the fit of current clustering model to the data.
The framework is naturally extended to update multiple
clustering models simultaneously. We empirically evaluate on
synthetic and publicly available data sets.
THESAURUS: data handling; data
mining; pattern clustering;
very large databases
OTHER SUBJECTS: clustering algorithm scaling; large databases;
practical clustering algorithms; multiple data scans;
scalable clustering framework; iterative clustering;
database scanning; numerical justification;
K-Means clustering algorithm; limited memory buffer;
scalable scheme; sampling based approach; data resolution;
allocated memory buffer; multiple clustering models;
publicly available data sets
TITLE:
Conceptual clustering on partitioned data: Tree-Weaver
AUTHOR: Jungsoon Yoo;
Sung Yoo; Dept. of Comput. Sci., Middle Tennessee
State Univ., Murfreesboro, TN, USA
PUBLICATION: Expert Systems with Applications, vol.15,
no.3-4, p. 367-74,
0957-4174
Elsevier Oct.-Nov. 1998
9 Refs.
LANGUAGE: English
ABSTRACT: Most knowledge discovery
in databases (KDD) research is
concentrated on supervised inductive learning. Conceptual
clustering is an unsupervised inductive learning technique
that organizes observations into an abstraction hierarchy
without using predefined class values. However, a typical
conceptual clustering algorithm is not suitable for a KDD
task because of space and time constraints. Furthermore,
typical incremental and non-incremental clustering algorithms
are not designed for a partitioned data set. We present a
conceptual clustering algorithm that works on partitioned
data. The proposed algorithm improves the clustering process
by using less computation time and less space while
maintaining the clustering quality.
THESAURUS: data mining; deductive
databases; learning by example;
unsupervised learning; very large databases
OTHER SUBJECTS: conceptual clustering; partitioned data;
Tree-Weaver;
knowledge discovery in databases;
supervised inductive learning;
unsupervised inductive learning; abstraction hierarchy;
predefined class values; space constraints;
time constraints; incremental clustering algorithms;
nonincremental clustering algorithms; computation time
TITLE:
Similarity clustering of dimensions for an enhanced visualization of multidimensional
data
AUTHOR: Ankerst, M.;
Berchtold, S.; Keim, D.A.; Munich Univ., Germany
CONFERENCE: Proceedings IEEE Symposium on Information
Visualization
Research Triangle, CA, USA 19-20 Oct. 1998
SPONSOR: IEEE Comput. Soc Tech. Committee on Comput. Graphics
PUBLICATION: Proceedings IEEE Symposium on Information
Visualization (Cat.
No.98TB100258), p. 52-60, 153
EDITOR: Wills, G.
Dill, J.
Los Almaitos, CA, USA IEEE Comput. Soc 1998
xiii+162 p.
ISBN: 0818690933
30 Refs.
LANGUAGE: English
ABSTRACT: The order and arrangement
of dimensions (variates) is crucial
for the effectiveness of a large number of visualization
techniques such as parallel coordinates, scatterplots,
recursive pattern, and many others. We describe a systematic
approach to arrange the dimensions according to their
similarity. The basic idea is to rearrange the data
dimensions such that dimensions showing a similar behavior
are positioned next to each other. For the similarity
clustering of dimensions, we need to define similarity
measures which determine the partial or global similarity of
dimensions. We then consider the problem of finding an
optimal one- or two-dimensional arrangement of the dimensions
based on their similarity. Theoretical considerations show
that both, the one- and the two-dimensional arrangement
problem are surprisingly hard problems, i.e. they are NP
complete. Our solution of the problem is therefore based on
heuristic algorithms. An empirical evaluation using a number
of different visualization techniques shows the high impact
of our similarity clustering of dimensions on the
visualization results.
THESAURUS: computational complexity;
data mining; data visualisation
OTHER SUBJECTS: similarity clustering; enhanced visualization;
multidimensional data; visualization techniques;
parallel coordinates; scatterplots; recursive pattern;
systematic approach; data dimensions; similar behavior;
similarity measures; global similarity;
two dimensional arrangement; hard problems; NP complete;
heuristic algorithms
TITLE:
CCAIIA: Clustering categorical attributes into interesting association
rules
AUTHOR: Gray, B.; Orlowska,
M.E.; Sch. of Inf. Technol., Queensland
Univ., Brisbane, Qld., Australia
CONFERENCE: Research and Development in Knowledge
Discovery and Data
Mining. Second Pacific-Asia Conference, PAKDD-98 Melbourne,
Vic., Australia 15-17 April 1998
PUBLICATION: Research and Development in Knowledge
Discovery and Data
Mining. Second Pacific-Asia Conference, PAKDD-98.
Proceedings, p. 132-43
EDITOR: Wu, X.
Kotagiri, R.
Korb, K.B.
Berlin, Germany Springer-Verlag 1998
xvi+424 p.
ISBN: 3540643834
15 Refs.
LANGUAGE: English
ABSTRACT: We investigate the problem
of mining interesting association
rules over a pair of categorical attributes at any level of
data granularity. We do this by integrating the rule
discovery process with a form of clustering. This allows
associations between groups of items to be formed where the
groping of items is based on maximising the "interestingness"
of the associations discovered. Previous work on mining
generalised associations assumes either a distance metric on
the attribute values or a taxonomy over the items mined.
These methods use the metric/taxonomy to limit the spare of
possible associations that can be found. We develop a measure
of the interestingness of association rules based on support
and the dependency between the item sets and use this measure
to guide the search. We apply the method to a data set and
observe the extraction of "interesting" associations. This
method could allow interesting and unexpected associations to
be discovered as the search space is not being limited by
user defined hierarchies.
THESAURUS: database management systems;
knowledge acquisition;
knowledge based systems
OTHER SUBJECTS: CCAIIA; data mining; categorical attributes
clustering;
association rules; rule discovery process;
distance metric; user defined hierarchies
TITLE:
Automatic subspace clustering of high dimensional data for data mining
applications
AUTHOR: Agrawal, R.;
Gehrke, J.; Gunopulos, D.; Raghavan, P.; IBM
Almaden Res. Center, San Jose, CA, USA
CONFERENCE: 1998 ACM SIGMOD International Conference
on Management of Data
Seattle, WA, USA 1-4 June 1998
PUBLICATION: SIGMOD Rec. (USA), SIGMOD Record, vol.27,
no.2, p. 94-105,
0163-5808
ACM June 1998
45 Refs.
LANGUAGE: English
ABSTRACT: Data mining applications
place special requirements on
clustering algorithms including: the ability to find clusters
embedded in subspaces of high dimensional data, scalability,
end user comprehensibility of the results, non presumption of
any canonical data distribution, and insensitivity to the
order of input records. We present CLIQUE, a clustering
algorithm that satisfies each of these requirements. CLIQUE
has the following properties: it identifies dense clusters in
subspaces of maximum dimensionality; it generates cluster
descriptions in the form of DNF expressions that are
minimized for ease of comprehension; it produces identical
results irrespective of the order in which input records are
presented and does not presume any specific mathematical form
for data distribution. Through experiments, we show that
CLIQUE efficiently finds accurate clusters in large high
dimensional datasets.
THESAURUS: data handling; deductive
databases; knowledge acquisition;
pattern recognition
OTHER SUBJECTS: automatic subspace clustering; high dimensional
data;
data mining applications; clustering algorithms;
scalability; end user comprehensibility; non presumption;
canonical data distribution; CLIQUE; clustering algorithm;
dense clusters; maximum dimensionality;
cluster descriptions; DNF expressions; input records;
data distribution; accurate clusters;
large high dimensional datasets
TITLE:
Efficient point data clustering by database operations
AUTHOR: Yan-Nong Huang;
Tarau, P.; Dept. of Comput. Sci., Moncton
Univ., NB, Canada
CONFERENCE: Proceedings of 8th International
Hong Kong Computer Society
Database Workshop. Data Mining, Data Warehousing and
Client/Server Databases Hong Kong 29-31 July 1997
SPONSOR: Oracle Syst. Hong Kong; NCR (Hong Kong); Sybase Hong
Kong; Hewlett-Packard Hong Kong; et al
PUBLICATION: Data Mining, Data Warehousing and Client/Server
Databases.
Proceedings of the 8th International Database Workshop, p.
236-50
EDITOR: Fong, J.
Singapore Springer-Verlag Singapore 1997
xi+332 p.
ISBN: 9813083549
11 Refs.
LANGUAGE: English
ABSTRACT: This paper gives a definition
for point data clustering and
shows that the clustering can be realized with several
database operators (spatial joins, aggregate and transitive
closure). With the proposed method, one doesn't have to
define a pool as the number of smaller groups that the
initial set of objects is partitioned into. An efficient
implementation of spatial joins is studied. Some experimental
results on using the proposed method for point data
clustering are reported.
THESAURUS: data analysis; database
theory; deductive databases;
pattern recognition; very large databases
OTHER SUBJECTS: point data clustering; database operations;
database operators; spatial joins; aggregate;
transitive closure; object set partitioning; data mining
TITLE:
A comparison of group-based and object-based data clustering techniques
AUTHOR: Bouguettaya,
A.; Le Viet, Q.; Colea, M.; Sch. of Inf. Syst.,
Queensland Univ. of Technol., Brisbane, Qld., Australia
CONFERENCE: Proceedings of 8th International
Hong Kong Computer Society
Database Workshop. Data Mining, Data Warehousing and
Client/Server Databases Hong Kong 29-31 July 1997
SPONSOR: Oracle Syst. Hong Kong; NCR (Hong Kong); Sybase Hong
Kong; Hewlett-Packard Hong Kong; et al
PUBLICATION: Data Mining, Data Warehousing and Client/Server
Databases.
Proceedings of the 8th International Database Workshop, p.
119-36
EDITOR: Fong, J.
Singapore Springer-Verlag Singapore 1997
xi+332 p.
ISBN: 9813083549
24 Refs.
LANGUAGE: English
ABSTRACT: We investigate the behavior
and stability of two data
clustering methods. The Slink method is based on a single
object comparison test, whereas the Ward method is based on
comparing groups of objects. Different sensitivity parameters
were used to test the behavior and stability of the two
techniques when clustering objects were drawn from the 3D
spare. The objects were generated according to three commonly
used statistical distributions. This study tends to confirm
the findings of earlier (less exhaustive) studies, namely,
the high similarity in behavior and stability among
clustering methods.
THESAURUS: data handling; deductive
databases; knowledge acquisition;
object-oriented databases; pattern recognition;
statistical analysis
OTHER SUBJECTS: object based data clustering techniques;
data clustering methods; Slink method;
single object comparison test; Ward method;
sensitivity parameters; clustering objects; 3D spare;
statistical distributions; clustering methods; group based
data clustering techniques
TITLE:
A comparative study of clustering methods
AUTHOR: Zait, M.; Messatfa,
H.; Oracle Corp., Redwood Shores, CA, USA
PUBLICATION: Future Generation Computer Systems,
vol.13, no.2-3, p. 149-59,
0167-739X
Elsevier Nov. 1997
12 Refs.
LANGUAGE: English
ABSTRACT: In this paper we propose
a methodology for comparing clustering
methods based on the quality of the result and the
performance of the execution. We applied it to several known
clustering methods: FastClust, Autoclass, Relational data
analysis, and Kohonen nets. The quality of a clustering
result depends on both the similarity measure used by the
method and sets with specific (or desired) patterns using a
combination of parameters, such as the number and the type of
the attributes, the number of records, etc. We define a
metric to measure the quality of a clustering method, i.e.,
its ability to discover some or all of the "hidden" patterns.
The performance study is based on the resource consumption,
i.e., CPU time and memory space.
THESAURUS: computational complexity;
pattern recognition;
unsupervised learning; very large databases
OTHER SUBJECTS: clustering methods; comparative study;
FastClust;
Autoclass; Relational data analysis; Kohonen nets;
performance study; data mining; data generation;
unsupervised classification
TITLE:
Clustering via concave minimization
AUTHOR: Bradley, P.S.;
Mangasarian, O.L.; Street, W.N.; Dept. of
Comput. Sci., Wisconsin Univ., Madison, WI, USA
CONFERENCE: Advances in Neural Information
Processing Systems 9.
Proceedings of the 1996 Conference Denver, CO, USA 2-5 Dec.
1996
PUBLICATION: Advances in Neural Information Processing
Systems 9.
Proceedings of the 1996 Conference, p. 368-74
EDITOR: Mozer, M.C.
Jordan, M.I.
Petsche, T.
London, UK MIT Press 1997
xvi+1098 p.
ISBN: 0262100657
18 Refs.
LANGUAGE: English
ABSTRACT: The problem of assigning
m points in the n-dimensional real
space R/sup n/ to k clusters is formulated as that of
determining k centers in R/sup n/ such that the sum of
distances of each point to the nearest center is minimized.
If a polyhedral distance is used, the problem can be
formulated as that of minimizing a piecewise-linear concave
function on a polyhedral set which is shown to be equivalent
to a bilinear program: minimizing a bilinear function on a
polyhedral set. A fast finite k-median algorithm consisting
of solving few linear programs in closed form leads to a
stationary point of the bilinear program. Computational
testing on a number of real-world databases was carried out.
On the Wisconsin Diagnostic Breast Cancer database, k-median
training set correctness was comparable to that of the k-mean
algorithm, however its testing set correctness was better.
Additionally, on the Wisconsin Prognostic Breast Cancer
database, distinct and clinically important survival curves
were extracted by the k-median algorithm, whereas the K-mean
algorithm failed to obtain such distinct survival curves for
the same database.
THESAURUS: database management systems;
learning (artificial intelligence); linear programming;
medical diagnostic computing; minimisation; neural nets;
nonlinear programming; pattern classification
OTHER SUBJECTS: concave minimization; clustering; polyhedral
distance;
bilinear programming; k-median algorithm; Wisconsin
Diagnostic Breast Cancer database;
Wisconsin Prognostic Breast Cancer database;
k-mean algorithm; training set; data mining
TITLE:
En route to data mining in legal text corpora: clustering, neural computation,
and international treaties
AUTHOR: Merkl, D.;
Schweighofer, E.; Dept. of Comput. Sci., R.
Melbourne Inst. of Technol., Vic., Australia
CONFERENCE: Database and Expert Systems Applications.
8th International
Conference, DEXA '97. Proceedings Toulouse, France 1-2 Sept.
1997
PUBLICATION: Proceedings. Eighth International Workshop
on Database and
Expert Systems Applications (Cat. No. 97TB100181), p. 465-70
EDITOR: Wagner, R.R.
Los Alamitos, CA, USA IEEE Comput. Soc 1997
xvii+770 p.
ISBN: 0818681470
35 Refs.
LANGUAGE: English
ABSTRACT: The huge amount of data
in legal information systems requires a
new generation of techniques and tools to assist lawyers in
analyzing data and finding critical nuggets of useful
knowledge. A promising approach for data mining in legal text
corpora is classification. What we are looking for are
powerful methods for the exploration of such libraries
whereby the detection of similarities between documents is
the overall goal. These methods may be used to gain insight
in the inherent structure of the various items contained in a
text archive. In this paper, we present the results from a
case study in legal document classification based on an
experimental document archive comprising important treaties
in public international law. The essentials of our approach
are the usage of a vector space document representation and
the utilization of an unsupervised artificial neural network
for document classification.
THESAURUS: classification; data
analysis; document handling;
knowledge acquisition; law administration; neural nets;
pattern classification; very large databases
OTHER SUBJECTS: data mining; legal text corpora; data clustering;
neural computation; international treaties;
legal information systems; lawyers; data analysis; legal
document classification; document similarity detection;
text archive; case study; document archive;
public international law; vector space;
document representation;
unsupervised artificial neural network
TITLE:
A connectionist approach to structural similarity determination as a basis
of clustering, classification and feature detection
AUTHOR: Schadler, I.;
Wysotzki, I.; Tech. Univ. Berlin, Germany
CONFERENCE: Principles of Data Mining and Knowledge
Discovery. First
European Symposium, PKDD '97. Proceedings Trondheim, Norway
24-27 June 1997
SPONSOR: Dept. Comput. Inf. Sci.; Norwegian Res. Council;
Norwegian Artificial Intelligence Soc
PUBLICATION: Principles of Data Mining and Knowledge
Discovery. First
European Symposium, PKDD '97. Proceedings, p. 254-64
EDITOR: Komorowski, J.
Zytkow, J.
Berlin, Germany Springer-Verlag 1997
ix+396 p.
ISBN: 3540632239
42 Refs.
LANGUAGE: English
ABSTRACT: Many algorithms in machine
learning, knowledge discovery,
pattern recognition and classification are based on the
estimation of the similarity or the distance between the
analysed objects. Objects with higher structural complexity
often cannot be described by feature vectors without losing
important structural information. These objects can
adequately be represented in the language of logic or by
labeled graphs. The similarity of such descriptions is
difficult to define and to compute. A connectionist approach
for the determination of the similarity of arbitrary labeled
graphs is introduced. Using an example from organic
chemistry, the application of the approach within one
distance based and one generalisation based classification
algorithm is demonstrated. The generalisation based algorithm
forms clusters or subclasses of similar examples of the same
class and extracts the parts of the objects which determine
the class of the object. The algorithms perform very
satisfactorily in comparison with recent logical and feature
vector approaches. Moreover, being able to handle structural
data directly, the algorithms need only a subset of the given
features of the objects for classification.
THESAURUS: chemistry computing;
feature extraction;
knowledge acquisition; learning (artificial intelligence);
neural nets; pattern classification
OTHER SUBJECTS: connectionist approach; structural similarity
determination;
clustering; pattern classification; feature detection;
machine learning; knowledge discovery;
structural complexity; feature vectors; labeled graphs;
arbitrary labeled graph similarity; organic chemistry;
generalisation based classification algorithm;
feature vector approaches; structural data
TITLE:
Inferring hierarchical clustering structures by deterministic annealing
AUTHOR: Hofmann, T.;
Buhmann, J.M.; Inst. fur Inf. III, Rheinische
Friedrich-Wilhelms-Univ., Bonn, Germany
CONFERENCE: Proceedings of 2nd International
Conference on Knowledge
Discovery and Data Mining (KDD-96) Portland, OR, USA 2-4 Aug.
1996
PUBLICATION: KDD-96 Proceedings. Second International
Conference on
Knowledge Discovery and Data Mining, p. 363-6
EDITOR: Simoudis, E.
Han, J.
Fayyad, U.
Menlo Park, CA, USA AAAI Press 1996
xiv+391 p.
15 Refs.
LANGUAGE: English
ABSTRACT: The unsupervised detection
of hierarchical structures is a
major topic in unsupervised learning and one of the key
questions in data analysis and representation. We propose a
novel algorithm for the problem of learning decision trees
for data clustering and related problems. In contrast to many
other methods based on successive tree growing and pruning,
we propose an objective function for tree evaluation and we
derive a non greedy technique for tree growing. Applying the
principles of maximum entropy and minimum cross entropy, a
deterministic annealing algorithm is derived in a meanfield
approximation. This technique allows us to canonically
superimpose tree structures and to fit parameters to averaged
or fuzzified' trees.
THESAURUS: deterministic algorithms;
inference mechanisms;
knowledge acquisition; trees (mathematics);
unsupervised learning
OTHER SUBJECTS: hierarchical clustering structure inference;
meanfield approximation; unsupervised detection;
unsupervised learning; data analysis; data representation;
decision trees; data clustering; successive tree growing;
tree pruning; objective function; tree evaluation;
non greedy technique; maximum entropy;
minimum cross entropy; deterministic annealing algorithm;
tree structures; fuzzified trees
TITLE:
Clustering using Monte Carlo cross-validation
AUTHOR: Smyth, P.;
Dept. of Inf. & Comput. Sci., California Univ.,
Irvine, CA, USA
CONFERENCE: Proceedings of 2nd International
Conference on Knowledge
Discovery and Data Mining (KDD-96) Portland, OR, USA 2-4 Aug.
1996
PUBLICATION: KDD-96 Proceedings. Second International
Conference on
Knowledge Discovery and Data Mining, p. 126-33
EDITOR: Simoudis, E.
Han, J.
Fayyad, U.
Menlo Park, CA, USA AAAI Press 1996
xiv+391 p.
19 Refs.
LANGUAGE: English
ABSTRACT: Finding the "right" number
of clusters, K, for a data set is a
difficult, and often ill posed, problem. In a probabilistic
clustering context, likelihood ratios, penalized likelihoods,
and Bayesian techniques are among the more popular
techniques. A novel cross validated likelihood criterion is
investigated for determining cluster structure. A practical
clustering algorithm based on Monte Carlo cross validation
(MCCV) is introduced. The algorithm permits the data analyst
to judge if there is strong evidence for a particular k, or
perhaps weaker evidence over a subrange of k values.
Experimental results with Gaussian mixtures on real and
simulated data suggest that MCCV provides genuine insight
into cluster structure. v-fold cross validation appears
inferior to the penalized likelihood method (BIC), a Bayesian
algorithm (AutoClass v2.0), and the new MCCV algorithm.
Overall, MCCV and AutoClass appear the most reliable of the
methods. MCCV provides the date miner with a useful data
driven clustering tool which complements the fully Bayesian
approach.
THESAURUS: Bayes methods; deductive
databases; formal verification;
knowledge acquisition; Monte Carlo methods; probability
OTHER SUBJECTS: Monte Carlo cross validation; data set;
probabilistic clustering context; likelihood ratios;
penalized likelihoods; Bayesian techniques; cross validated
likelihood criterion; cluster structure;
practical clustering algorithm; MCCV; data analyst;
Gaussian mixtures; simulated data;
v-fold cross validation; penalized likelihood method; BIC;
Bayesian algorithm; AutoClass; date miner;
data driven clustering tool; knowledge discovery
TITLE:
Conceptual clustering in structured databases: a practical approach
AUTHOR: Ketterlin,
A.; Gancarski, P.; Korczak, J.J.; LSIIT, Univ. Louis
Pasteur, Strasbourg, France
CONFERENCE: Proceedings of First International
Conference on Knowledge
Discovery and Data Mining (KDD-95) Montreal, Que., Canada
20-21 Aug. 1995
SPONSOR: AAAI; AT&T Global Inf. Solutions; GTE Lab.; et al
PUBLICATION: KDD-95 Proceedings. First International
Conference on Knowledge
Discovery and Data Mining, p. 180-5
EDITOR: Fayyad, U.M.
Uthurusamy, R.
Menlo Park, CA, USA AAAI 1995
xiv+345 p.
ISBN: 0929280822
12 Refs.
LANGUAGE: English
ABSTRACT: Many machine-learning
(either supervised or unsupervised)
techniques assume that data present themselves in an
attribute-value form, but this formalism is largely
insufficient to account for many applications. Therefore,
much of the ongoing research now focuses on first-order
learning systems, but such complex formalisms lead to high
computational complexities. On the other hand, most of the
currently installed databases have been designed according to
the entity-relationship formalism, and usually implemented on
a relational database management system. This formalism is
far less complex than first-order logic, but much more
expressive than attribute-value lists. In that context, the
database schema defines an abstraction space, and learning
must occur at each level of abstraction. This paper describes
a clustering system which is able to discover useful
groupings in structured databases. It is based on the COBWEB
algorithm, to which it adds the ability to cluster structured
objects.
THESAURUS: computational complexity;
deductive databases;
entity-relationship modelling; formal logic;
knowledge acquisition; learning (artificial intelligence);
pattern recognition; relational databases
OTHER SUBJECTS: conceptual clustering; structured databases;
machine learning techniques; supervised learning;
unsupervised learning; attribute-value lists;
first-order learning systems; computational complexities;
entity-relationship formalism;
relational database management system; first-order logic;
expressiveness; database schema; abstraction space;
COBWEB algorithm
TITLE:
A database interface for clustering in large spatial databases
AUTHOR: Ester, M.; Kriegel,
H.-P.; Xiaowei Xu; Inst. for Comput. Sci.,
Munchen Univ., Germany
CONFERENCE: Proceedings of First International Conference
on Knowledge
Discovery and Data Mining (KDD-95) Montreal, Que., Canada
20-21 Aug. 1995
SPONSOR: AAAI; AT&T Global Inf. Solutions; GTE Lab.; et al
PUBLICATION: KDD-95 Proceedings. First International Conference
on Knowledge
Discovery and Data Mining, p. 94-9
EDITOR: Fayyad, U.M.
Uthurusamy, R.
Menlo Park, CA, USA AAAI 1995
xiv+345 p.
ISBN: 0929280822
15 Refs.
LANGUAGE: English
ABSTRACT: Both the number and the size of
spatial databases are rapidly
growing because of the large amount of data obtained from
satellite images, X-ray crystallography or other scientific
equipment. Therefore, automated knowledge discovery becomes
more and more important in spatial databases. So far, most of
the methods for knowledge discovery in databases (KDD) have
been based on relational database systems. We address the
task of class identification in spatial databases using
clustering techniques. We present an interface to the
database management system (DBMS), which is crucial for the
efficiency of KDD on large databases. This interface is based
on a spatial access method, the R* tree. It clusters the
objects according to their spatial neighborhood and supports
efficient processing of spatial queries. Furthermore, we
propose a method for spatial data sampling as part of the
focusing component, significantly reducing the number of
objects to be clustered. Thus, we achieve a considerable
speed up for clustering in large databases. We have applied
the proposed techniques to real data from a large protein
database used for predicting protein-protein docking. A
performance evaluation on this database indicates that
clustering on large spatial databases can be performed both
efficiently and effectively using the approach.
THESAURUS: deductive databases; knowledge
acquisition;
pattern recognition; spatial data structures;
tree data structures; very large databases;
visual databases
OTHER SUBJECTS: database interface; large spatial database clustering;
automated knowledge discovery; KDD; class identification;
spatial databases; clustering techniques;
database management system interface;
spatial access method; R* tree; spatial neighborhood;
spatial queries; spatial data sampling; large databases;
protein database; protein-protein docking;
performance evaluation
A general probabilistic framework for
clustering individuals and objects
Igor V. Cadez, Scott Gaffney and Padhraic Smyth
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining August 20 - 23, 2000, Boston, MA USA
Pages 140-149
Efficient clustering of high-dimensional
data sets with application to reference matching
Andrew McCallum, Kamal Nigam and Lyle H. Ungar
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining August 20 - 23, 2000, Boston, MA USA
Pages 169-178
Visualization of navigation patterns
on a Web site using model-based clustering
Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth and Steven White
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining August 20 - 23, 2000, Boston, MA USA
Pages 280-284
Agglomerative clustering of a search
engine query log
Doug Beeferman and Adam Berger
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining August 20 - 23, 2000, Boston, MA USA
Pages 407-416
CHAMELEON: A hierarchical clustering algorithm using
dynamic modeling.
G. Karypis, E.-H. Han, and V. Kumar.
COMPUTER, 32:68-75, 1999
Fast and effective text mining using linear-time
document clustering
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Bjornar Larsen and Chinatsu Aone
Pages 16-22
Trajectory clustering with mixtures of
regression models
Scott Gaffney and Padhraic Smyth
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 63-72
CACTUS¡Xclustering categorical data using summaries
Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 73-83
Entropy-based subspace clustering for mining numerical
data
Chun-Hung Cheng, Ada Waichee Fu and Yi Zhang
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 84-93
Evaluating a class of distance-mapping algorithms
for data mining and clustering
Jason Tsong-Li Wang, Xiong Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro
and Kaizhong Zhang
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 307-311
Identifying distinctive subsequences in multivariate
time series by clustering
Tim Oates
Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 322-326
Clustering techniques for large data sets--from
the past to the future
Daniel A. Keim and Alexander Hinneburg
Tutorial notes for ACM SIGKDD 1999 international conference on Knowledge discovery
and data mining August 15 - 18, 1999, San Diego, CA USA
Pages 141-181
Data bubbles: quality preserving performance boosting
for hierarchical clustering
Markus M. Breunig, Hans-Peter Kriegel, Peer Kroger and Jorg Sander
Proceedings of the 2001 ACM SIGMOD international conference on Management of
Data on Management of data May 21 - 24, 2001, Santa Barbara, CA USA
Pages 79-90
Density biased sampling: an improved
method for data mining and clustering
Christopher R. Palmer and Christos Faloutsos
Proceedings of the 2000 ACM SIGMOD on Management of data May 15 - 18, 2000,
Dallas, TX USA
Pages 82-92
SQLEM: fast clustering in SQL using the EM algorithm
Carlos Ordonez and Paul Cereghini
Proceedings of the 2000 ACM SIGMOD on Management of data May 15 - 18, 2000,
Dallas, TX USA
Pages 559-570
Locality preserving dictionaries: theory
application to clustering in databases
Vijayshankar Raman
Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles
of database systems May 31 - June 3, 1999, Philadelphia, PA USA
Pages 337-345
Snakes and sandwiches: optimal clustering strategies
for a data warehouse Srivastava
H. V. Jagadish, Laks V. S. Lakshmanan and Divesh
Proceedings of the 1999 ACM SIGMOD international conference on Management of
data May 31 - June 3, 1999, Philadelphia, PA USA
Pages 37-48
OPTICS: ordering points to identify the clustering structure
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jorg Sander
Proceedings of the 1999 ACM SIGMOD international conference on Management of
data May 31 - June 3, 1999, Philadelphia, PA USA
Pages 49-60
Fast algorithms for projected clustering
Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc and Jong Soo
Park
Proceedings of the 1999 ACM SIGMOD international conference on Management of
data May 31 - June 3, 1999, Philadelphia, PA USA
Pages 61-72
Clustering methods for large databases: from
the past to the future
Alexander Hinneburg and Daniel A. Keim
Proceedings of the 1999 ACM SIGMOD international conference on Management of
data May 31 - June 3, 1999, Philadelphia, PA USA
Page 509
Clustering through decision tree construction
Bing Liu, Yiyuan Xia and Philip S. Yu
Proceedings of the ninth international conference on Information knowledge
management CIKM 2000 November 6 - 11, 2000, McLean, VA USA
Pages 20-29
A semi-supervised document clustering technique
for information organization
Han-joon Kim and Sang-goo Lee
Proceedings of the ninth international conference on Information knowledge
management CIKM 2000 November 6 - 11, 2000, McLean, VA USA
Pages 30-37
High performance clustering based on the similarity
join
Christian Bohm, Bernhard Braunmuller, Markus Breunig and Hans-Peter Kriegel
Proceedings of the ninth international conference on Information knowledge
management CIKM 2000 November 6 - 11, 2000, McLean, VA USA
Pages 298-305
An improved network clustering method for I/O-efficient
query processing
Sung-Ho Woo and Sung-Bong Yang
Proceedings of the eighth ACM symposium on Advances in geographic information
systems November 6 - 11, 2000, McLean, VA USA
Pages 62-68
Clustering declustered data for efficient retrieval
Hakan Ferhatosmanoglu, Divyakant Agrawal and Amr El Abbadi
Proceedings of the eighth international conference on Information knowledge
management November 2 - 6, 1999, Kansas City, MO USA
Pages 343-350
Indexing techniques for wireless data broadcast under
data clustering and scheduling
Quinglong Hu, Wang-Chien Lee and Dik Lun Lee
Proceedings of the eighth international conference on Information knowledge
management November 2 - 6, 1999, Kansas City, MO USA
Pages 351-358
Clustering transactions using large items
Ke Wang, Chu Xu and Bing Liu
Proceedings of the eighth international conference on Information knowledge
management November 2 - 6, 1999, Kansas City, MO USA
Pages 483-490
A multiple-resolution method for edge-centric data
clustering
Scott Epter and Mukkai Krishnamoorthy
Proceedings of the eighth international conference on Information knowledge
management November 2 - 6, 1999, Kansas City, MO USA
Pages 491-498
Relevance feedback with a small number of
relevance judgements: incremental relevance feedback vs. document clustering
Makoto Iwayama
Proceedings of the 23rd annual international ACM SIGIR conference on Research
and development in information retrieval July 24 - 28, 2000, Athens Greece
Pages 10-16
Document clustering using word clusters via the
information bottleneck method
Noam Slonim and Naftali Tishby
Proceedings of the 23rd annual international ACM SIGIR conference on Research
and development in information retrieval July 24 - 28, 2000, Athens Greece
Pages 208-215
An investigation of linguistic features and
clustering algorithms for topical document clustering
Vasileios Hatzivassiloglou, Luis Gravano and Ankineedu Maganti
Proceedings of the 23rd annual international ACM SIGIR conference on Research
and development in information retrieval July 24 - 28, 2000, Athens Greece
Pages 224-231
AUTOCLUST: Automatic Clustering via Boundary Extraction
for Mining Massive Point-Data Sets.
V. Estivill-Castro and I. Lee.
In Proceedings of the 5th International Conference on Geocomputation, 2000.
AMOEBA: Hierarchical Clustering Based on Spatial Proximity
Using Delaunay Diagram.
Estivill-Castro, Vladimir and Ickjai Lee. (2000).
Proceedings of the 9th International Symposium on Spatial Data Handling (SDH2000).
Beijing, China.
Graph-Based Hierarchical Conceptual Clustering
I. Jonyer, L. B. Holder and D. J. Cook
Proceedings of the Thirteenth Annual Florida AI Research Symposium (2000).