Clustering Algorithm for Data Mining

For Data with Categorical Values
For Mixed Numeric and Nominal Data
For Large Database
Conceptual Clustering
For Data Warehouse
For Spatial Database
Visualization
Extention from k-means
Superparamagnetic Data Clustering
For Document Clustering
For Sequences of Composite Objects
Hierarchical Clustering
For Multimedia Database
Applications

 


TITLE:          Extensions to the k-means algorithm for clustering large data
                  sets with categorical values
AUTHOR:         Zhexue Huang; ACSys CRC, CSIRO, Canberra, ACT, Australia
PUBLICATION:    Data Mining and Knowledge Discovery, vol.2, no.3, p. 283-304,
                  1384-5810
                Kluwer Academic Publishers 1998
                34 Refs.
LANGUAGE:       English
ABSTRACT:       The k-means algorithm is well known for its efficiency in
                  clustering large data sets. However, working only on numeric
                  values prohibits it from bring used to cluster real world
                  data containing categorical values. In this paper we present
                  two algorithms which extend the k-means algorithm to
                  categorical domains and domains with mixed numeric and
                  categorical values. The k-modes algorithm uses a simple
                  matching dissimilarity measure to deal with categorical
                  objects, replaces the means of clusters with modes, and uses
                  a frequency-based method to update modes in the clustering
                  process to minimise the clustering cost function. With these
                  extensions the k-modes algorithm enables the clustering of
                  categorical data in a fashion similar to k-means. The
                  k-prototypes algorithm, through the definition of a combined
                  dissimilarity measure, further integrates the k-means and
                  k-modes algorithms to allow for clustering objects described
                  by mixed numeric and categorical attributes. We use the well
                  known soybean disease and credit approval data sets to
                  demonstrate the clustering performance of the two algorithms.
                  Our experiments on two real world data sets with half a
                  million objects each show that the two algorithms are
                  efficient when clustering large data sets, which is critical
                  to data mining applications.
THESAURUS:      data mining;  pattern recognition
OTHER SUBJECTS: k-means algorithm;  clustering large data sets;
                  categorical objects;  clustering process;  data mining


TITLE:          CURE: an efficient clustering algorithm for large databases
AUTHOR:         Guha, S.; Rastogi, R.; Shim, K.; Stanford Univ., CA, USA
CONFERENCE:     1998 ACM SIGMOD International Conference on Management of Data
                  Seattle, WA, USA 1-4 June 1998
PUBLICATION:    SIGMOD Rec. (USA), SIGMOD Record, vol.27, no.2, p. 73-84,
                  0163-5808
                ACM June 1998
                14 Refs.
LANGUAGE:       English
ABSTRACT:       Clustering in data mining is useful for discovering groups and
                  identifying interesting distributions in the underlying data.
                  Traditional clustering algorithms either favor clusters with
                  spherical shapes and similar sizes, or are very fragile in
                  the presence of outliers. We propose a novel clustering
                  algorithm called CURE that is more robust to outliers, and
                  identifies clusters having non spherical shapes and wide
                  variances in size. CURE achieves this by representing each
                  cluster by a certain fixed number of points that are
                  generated by selecting well scattered points from the cluster
                  and then shrinking them toward the center of the cluster by a
                  specified fraction. Having more than one representative point
                  per cluster allows CURE to adjust well to the geometry of non
                  spherical shapes and the shrinking helps to dampen the
                  effects of outliers. To handle large databases, CURE employs
                  a combination of random sampling and partitioning. A random
                  sample drawn from the data set is first partitioned and each
                  partition is partially clustered. The partial clusters are
                  then clustered in a second pass to yield the desired
                  clusters. Our experimental results confirm that the quality
                  of clusters produced by CURE is much better than those found
                  by existing algorithms. Furthermore, they demonstrate that
                  random sampling and partitioning enable CURE to not only
                  outperform existing algorithms but also to scale well for
                  large databases without sacrificing clustering quality.
THESAURUS:      deductive databases;  knowledge acquisition;
                  pattern recognition;  very large databases
OTHER SUBJECTS: CURE;  clustering algorithm;  large databases;  data mining;
                  outliers;  non spherical shapes;  well scattered points;
                  representative point;  random sampling;  partitioning;
                  data set;  partial clusters;  clustering quality


TITLE:          ITERATE: a conceptual clustering algorithm for data mining
AUTHOR:         Biswas, G.; Weinberg, J.B.; Fisher, D.H.; Dept. of Comput.
                  Sci., Vanderbilt Univ., Nashville, TN, USA
PUBLICATION:    IEEE Transactions on Systems, Man and Cybernetics, Part C
                  (Applications and Reviews), vol.28, no.2, p. 219-30,
                  1094-6977
                IEEE May 1998
                42 Refs.
LANGUAGE:       English
ABSTRACT:       The data exploration task can be divided into three
                  interrelated subtasks: 1) feature selection, 2) discovery,
                  and 3) interpretation. This paper describes an unsupervised
                  discovery method with biases geared toward partitioning
                  objects into clusters that improve interpretability. The
                  algorithm ITERATE employs: 1) a data ordering scheme and 2)
                  an iterative redistribution operator to produce maximally
                  cohesive and distinct clusters. Cohesion or intraclass
                  similarity is measured in terms of the match between
                  individual objects and their assigned cluster prototype.
                  Distinctness or interclass dissimilarity is measured by an
                  average of the variance of the distribution match between
                  clusters. The authors demonstrate that interpretability, from
                  a problem-solving viewpoint, is addressed by the intraclass
                  and interclass measures. Empirical results demonstrate the
                  properties of the discovery algorithm and its applications to
                  problem solving.
THESAURUS:      feature extraction;  iterative methods;
                  knowledge acquisition;  problem solving
OTHER SUBJECTS: ITERATE;  conceptual clustering algorithm;  data mining;
                  data exploration;  feature selection;  discovery;
                  interpretation;  unsupervised discovery method;  biases;
                  object partitioning;  data ordering scheme;
                  iterative redistribution operator;
                  maximally cohesive clusters;  maximally distinct clusters;
                  intraclass similarity;  cohesion;  interclass dissimilarity;
                  distinctness;  distribution match;  problem solving;
                  intraclass measures;  interclass measures



TITLE:          BIRCH: a new data clustering algorithm and its applications
AUTHOR:         Tian Zhang; Ramakrishnan, R.; Livny, M.; Dept. of Comput. Sci.,
                  Wisconsin Univ., Madison, WI, USA
PUBLICATION:    Data Mining and Knowledge Discovery, vol.1, no.2, p. 141-82,
                  1384-5810
                Kluwer Academic Publishers 1997
                28 Refs.
LANGUAGE:       English
ABSTRACT:       Data clustering is an important technique for exploratory data
                  analysis, and has been studied for several years. It has been
                  shown to be useful in many practical domains such as data
                  classification and image processing. Recently, there has been
                  a growing emphasis on exploratory analysis of very large
                  datasets to discover useful patterns and/or correlations
                  among attributes. This is called data mining, and data
                  clustering is regarded as a particular branch. However
                  existing data clustering methods do not adequately address
                  the problem of processing large datasets with a limited
                  amount of resources (e.g., memory and cpu cycles). So as the
                  dataset size increases, they do not scale up well in terms of
                  memory requirement, running time, and result quality. In this
                  paper, an efficient and scalable data clustering method is
                  proposed, based on a new in-memory data structure called
                  CF-tree, which serves as an in-memory summary of the data
                  distribution. We have implemented it in a system called BIRCH
                  (Balanced Iterative Reducing and Clustering using
                  Hierarchies), and studied its performance extensively in
                  terms of memory requirements, running time, clustering
                  quality, stability and scalability; we also compare it with
                  other available methods. Finally, BIRCH is applied to solve
                  two real-life problems: one is building an iterative and
                  interactive pixel classification tool, and the other is
                  generating the initial codebook for image compression.
THESAURUS:      data analysis;  data compression;  deductive databases;
                  image coding;  knowledge acquisition;
                  pattern classification;  software performance evaluation;
                  tree data structures;  very large databases
OTHER SUBJECTS: BIRCH;  data clustering algorithm;  exploratory data analysis;
                  data classification;  image processing;
                  very large datasets;  data mining;  memory requirement;
                  running time;  result quality;  tree data structure;
                  CF-tree;  data distribution;  Balanced Iterative Reducing and
                  Clustering using Hierarchies;  performance;
                  incremental algorithm;  stability;  scalability;
                  pixel classification;  codebook;  image compression


TITLE:          A distribution-based clustering algorithm for mining in large
                  spatial databases
AUTHOR:         Xiaowei Xu; Ester, M.; Kriegel, H.-P.; Sander, J.; Munich
                  Univ., Germany
CONFERENCE:     Proceedings 14th International Conference on Data Engineering
                  Orlando, FL, USA 23-27 Feb. 1998
                SPONSOR: IEEE Comput. Soc. Tech. Committee on Data Eng
PUBLICATION:    Proceedings. 14th International Conference on Data Engineering
                  (Cat. No.98CB36164), p. 324-31
                Los Alamitos, CA, USA IEEE Comput. Soc 1998
                xxi+605 p.
                ISBN: 0818682892
                14 Refs.
LANGUAGE:       English
ABSTRACT:       The problem of detecting clusters of points belonging to a
                  spatial point process arises in many applications. In this
                  paper, we introduce the new clustering algorithm DBCLASD
                  (Distribution-Based Clustering of LArge Spatial Databases) to
                  discover clusters of this type. The results of experiments
                  demonstrate that DBCLASD, contrary to partitioning algorithms
                  such as CLARANS (Clustering Large Applications based on
                  RANdomized Search), discovers clusters of arbitrary shape.
                  Furthermore, DBCLASD does not require any input parameters,
                  in contrast to the clustering algorithm DBSCAN (Density-Based
                  Spatial Clustering of Applications with Noise) requiring two
                  input parameters, which may be difficult to provide for large
                  databases. In terms of efficiency, DBCLASD is between CLARANS
                  and DBSCAN, close to DBSCAN. Thus, the efficiency of DBCLASD
                  on large spatial databases is very attractive when
                  considering its nonparametric nature and its good quality for
                  clusters of arbitrary shape.
THESAURUS:      data analysis;  deductive databases;  knowledge acquisition;
                  very large databases;  visual databases
OTHER SUBJECTS: distribution-based clustering algorithm;  data mining;
                  large spatial databases;  point cluster detection;
                  spatial point process;  DBCLASD;  partitioning algorithms;
                  CLARANS;  randomized search;  arbitrary shaped clusters;
                  DBSCAN;  density-based spatial clustering;  noise;
                  input parameters;  efficiency;  nonparametric nature;
                  quality


TITLE:          Unsupervised clustering with mixed numeric and nominal data-a
                  new similarity based agglomerative system
AUTHOR:         Cen Li; Biswas, G.; Dept. of Comput. Sci., Vanderbilt Univ.,
                  Nashville, TN, USA
CONFERENCE:     Proceedings of First Pacific-Asia Conference. Knowledge
                  Discovery and Data Mining: Techniques and Applications
                  Singapore 23-24 Feb. 1997
PUBLICATION:    Proceedings of the First Pacific-Asia Conference on Knowledge
                  Discovery and Data Mining. KDD: Techniques and Applications,
                  p. 35-48
                EDITOR: Lu, H.
                Motoda, H.
                Liu, H.
                Singapore World Scientific 1997
                xvi+367 p.
                ISBN: 9810229194
                8 Refs.
LANGUAGE:       English
ABSTRACT:       Presents a new similarity-based agglomerative clustering (SBAC)
                  algorithm that works well for data with mixed numeric and
                  nominal features. A similarity measure, proposed by D.W.
                  Goodall (1966) for biological taxonomies, that gives greater
                  weight to uncommon feature-value matches in similarity
                  computations and makes no assumptions about the underlying
                  distributions of the feature values, is adopted to define the
                  similarity measure between pairs of objects. An agglomerative
                  algorithm is employed to construct a concept tree, and a
                  simple distinctness heuristic is used to extract a partition
                  of the data. The performance of SBAC has been studied on real
                  and artificially generated data sets. The results demonstrate
                  the effectiveness of this algorithm in unsupervised discovery
                  tasks. Comparisons with other schemes illustrate the superior
                  performance of the algorithm.
THESAURUS:      data mining;  pattern clustering;  pattern matching;
                  semantic networks;  software performance evaluation;
                  statistical analysis;  unsupervised learning
OTHER SUBJECTS: unsupervised clustering;  numerical data;  nominal data;
                  similarity-based agglomerative clustering algorithm;
                  similarity measure;  biological taxonomy;
                  uncommon feature-value matches;
                  feature value distributions;  concept tree;
                  distinctness heuristic;  data partitioning;
                  algorithm performance;  unsupervised discovery tasks


TITLE:          Refining initial points for K-means clustering
AUTHOR:         Bradley, P.S.; Fayyad, U.M.; Dept. of Comput. Sci., Wisconsin
                  Univ., Madison, WI, USA
CONFERENCE:     Proceedings of Machine Learning (ICML-98) Madison, WI, USA
                  24-27 July 1998
PUBLICATION:    Machine Learning. Proceedings of the Fifteenth International
                  Conference (ICML'98), p. 91-9
                EDITOR: Shavlik, J.
                San Francisco, CA, USA Morgan Kaufmann Publishers 1998
                x+580 p.
                23 Refs.
LANGUAGE:       English
ABSTRACT:       Practical approaches to clustering use an iterative procedure
                  (e.g. K-means, expectation maximization) which converges to
                  one of numerous local minima. It is known that these
                  iterative techniques are especially sensitive to initial
                  starting conditions. We present a procedure for computing a
                  refined starting condition from a given initial one that is
                  based on an efficient technique for estimating the modes of a
                  distribution. The refined initial starting condition allows
                  the iterative algorithm to converge to a "better" local
                  minimum. The procedure is applicable to a wide class of
                  clustering algorithms for both discrete and continuous data.
                  We demonstrate the application of this method to the K-means
                  clustering algorithm and show that refined initial starting
                  points indeed lead to improved solutions. The refinement
                  run-time is considerably lower than the time required to
                  cluster the full database. The method is scalable and can be
                  coupled with a scalable clustering algorithm to address the
                  large-scale clustering problems in data mining.
THESAURUS:      convergence of numerical methods;  data mining;
                  database theory;  initial value problems;
                  iterative methods;  minimisation;  pattern clustering
OTHER SUBJECTS: initial point refinement;  K-means clustering algorithm;
                  iterative algorithm;  convergence;  local minima;  refined
                  initial starting condition;  discrete data;  distribution
                  mode estimation technique;  continuous data;
                  refinement run-time;  database clustering;
                  scalable clustering algorithm;  data mining;
                  large-scale clustering problems


TITLE:          WaveCluster: a multi-resolution clustering approach for very
                  large spatial databases
AUTHOR:         Sheikholeslami, G.; Chatterjee, S.; Zhang, A.; Dept. of Comput.
                  Sci., State Univ. of New York, Buffalo, NY, USA
CONFERENCE:     Proceedings of 24th Annual International Conference on Very
                  Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
                SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION:    Proceedings of the Twenty-Fourth International Conference on
                  Very-Large Databases, p. 428-39
                EDITOR: Gupta,A.
                Shmueli,O.
                Widom,J.
                San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
                xvii+708 p.
                ISBN: 1558605665
                0 Refs.
LANGUAGE:       English
ABSTRACT:       Many applications require the management of spatial data.
                  Clustering large spatial databases is an important problem
                  which tries to find the densely populated regions in the
                  feature space to be used in data mining, knowledge discovery,
                  or efficient information retrieval. A good clustering
                  approach should be efficient and detect clusters of arbitrary
                  shape. It must be insensitive to the outliers (noise) and the
                  order of input data. We propose WaveCluster, a novel
                  clustering approach based on wavelet transforms, which
                  satisfies all the above requirements. Using the
                  multiresolution property of wavelet transforms, we can
                  effectively identify arbitrary shape clusters at different
                  degrees of accuracy. We also demonstrate that WaveCluster is
                  highly efficient in terms of time complexity. Experimental
                  results on very large data sets are presented which show the
                  efficiency and effectiveness of the proposed approach
                  compared to the other recent clustering methods.
THESAURUS:      computational complexity;  data mining;
                  spatial data structures;  very large databases;
                  visual databases;  wavelet transforms
OTHER SUBJECTS: WaveCluster;  multi-resolution clustering approach;
                  very large spatial databases;  data mining;
                  knowledge discovery;  information retrieval;  outliers;
                  input data;  wavelet transforms;  time complexity;
                  experimental results;  very large data sets


TITLE:          Incremental clustering for mining in a data warehousing
                  environment
AUTHOR:         Ester, M.; Kriegel, H.-P.; Sander, J.; Wimmer, M.; Xiaowei Xu;
                  Inst. for Comput. Sci., Munchen Univ., Germany
CONFERENCE:     Proceedings of 24th Annual International Conference on Very
                  Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
                SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION:    Proceedings of the Twenty-Fourth International Conference on
                  Very-Large Databases, p. 323-33
                EDITOR: Gupta,A.
                Shmueli,O.
                Widom,J.
                San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
                xvii+708 p.
                ISBN: 1558605665
                22 Refs.
LANGUAGE:       English
ABSTRACT:       Data warehouses provide a great deal of opportunities for
                  performing data mining tasks such as classification and
                  clustering. Typically, updates are collected and applied to
                  the data warehouse periodically in a batch mode, e.g., during
                  the night. Then, all patterns derived from the warehouse by
                  some data mining algorithm have to be updated as well. Due to
                  the very large size of the databases, it is highly desirable
                  to perform these updates incrementally. We present the first
                  incremental clustering algorithm. Our algorithm is based on
                  the clustering algorithm DBSCAN which is applicable to any
                  database containing data from a metric space, e.g., to a
                  spatial database or to a WWW-log database. Due to the
                  density-based nature of DBSCAN, the insertion or deletion of
                  an object affects the current clustering only in the
                  neighborhood of this object. Thus, efficient algorithms can
                  be given for incremental insertions and deletions to an
                  existing clustering. Based on the formal definition of
                  clusters, it can be proven that the incremental algorithm
                  yields the same result as DBSCAN. A performance evaluation of
                  IncrementalDBSCAN on a spatial database as well as on a
                  WWW-log database is presented, demonstrating the efficiency
                  of the proposed algorithm. IncrementalDBSCAN yields
                  significant speed-up factors over DBSCAN even for large
                  numbers of daily updates in a data warehouse.
THESAURUS:      data mining;  data structures;  data warehouses;
                  software performance evaluation;  visual databases
OTHER SUBJECTS: incremental clustering;  data warehouses;  data mining;
                  classification;  data clustering;  updates;  batch mode;
                  very large database;  incremental clustering algorithm;
                  DBSCAN;  performance evaluation;  IncrementalDBSCAN;
                  spatial database;  World Wide Web log database


TITLE:          Clustering categorical data: an approach based on dynamical
                  systems
AUTHOR:         Gibson, D.; Kleinberg, J.; Raghavan, P.; Dept. of Comput. Sci.,
                  California Univ., Berkeley, CA, USA
CONFERENCE:     Proceedings of 24th Annual International Conference on Very
                  Large Data Bases (VLDB'98) New York, NY, USA 24-27 Aug. 1998
                SPONSOR: Oracle; AT&T Lab.; IBM; Informix; Microsoft
PUBLICATION:    Proceedings of the Twenty-Fourth International Conference on
                  Very-Large Databases, p. 311-22
                EDITOR: Gupta,A.
                Shmueli,O.
                Widom,J.
                San Francisco, CA, USA Morgan Kaufmann Publishers Inc 1998
                xvii+708 p.
                ISBN: 1558605665
                39 Refs.
LANGUAGE:       English
ABSTRACT:       We describe a novel approach for clustering collections of
                  sets, and its application to the analysis and mining of
                  categorical data. By categorical data we mean tables with
                  fields that cannot be naturally ordered by a metric-e.g., the
                  names of producers of automobiles, or the names of products
                  offered by a manufacturer. Our approach is based on an
                  iterative method for assigning and propagating weights on the
                  categorical values in a table; this facilitates a type of
                  similarity measure arising from the co-occurrence of values
                  in the dataset. Our techniques can be studied analytically in
                  terms of certain types of non-linear dynamical systems. We
                  discuss experiments on a variety of tables of synthetic and
                  real data; we find that our iterative methods converge
                  quickly to prominently correlated values of various
                  categorical fields.
THESAURUS:      data analysis;  data mining;  data structures;
                  very large databases
OTHER SUBJECTS: categorical data clustering;  data analysis;  data mining;
                  tables;  iterative method;  categorical value weights;
                  dataset;  nonlinear dynamical systems;  experiments


TITLE:          Clustering large datasets in arbitrary metric spaces
AUTHOR:         Ganti, V.; Ramakrishnan, R.; Gehrke, J.; Powell, A.; French, J.
;                 Dept. of Comput. Sci., Virginia Univ., Charlottesville, VA,
                  USA
CONFERENCE:     Proceedings of IEEE Computer Society 15th International
                  Conference on Data Engineering Sydney, NSW, Australia 23-26
                  March 1999
                SPONSOR: IEEE Comput. Soc. Tech. Committee on Data Eng
PUBLICATION:    Proceedings 15th International Conference on Data Engineering
                  (Cat. No.99CB36337), p. 502-11
                EDITOR: Kitsuregawa, M.
                Maciaszek, L.
                Papazoglou, M.
                Pu, C.
                Los Alamitos, CA, USA IEEE Comput. Soc 1999
                xxiii+648 p.
                ISBN: 0769500714
                26 Refs.
LANGUAGE:       English
ABSTRACT:       Clustering partitions a collection of objects into groups
                  called clusters, such that similar objects fall into the same
                  group. Similarity between objects is defined by a distance
                  function satisfying the triangle inequality; this distance
                  function along with the collection of objects describes a
                  distance space. In a distance space, the only operation
                  possible on data objects is the computation of distance
                  between them. All scalable algorithms in the literature
                  assume a special type of distance space, namely a
                  k-dimensional vector space, which allows vector operations on
                  objects. We present two scalable algorithms designed for
                  clustering very large datasets in distance spaces. Our first
                  algorithm BUBBLE is, to our knowledge, the first scalable
                  clustering algorithm for data in a distance space. Our second
                  algorithm BUBBLE-FM improves upon BUBBLE by reducing the
                  number of calls to the distance function, which may be
                  computationally very expensive. Both algorithms make only a
                  single scan over the database while producing high clustering
                  quality. In a detailed experimental evaluation, we study both
                  algorithms in terms of scalability and quality of clustering.
                  We also show results of applying the algorithms to a real
                  life dataset.
THESAURUS:      data handling;  data mining;  database theory;
                  trees (mathematics);  very large databases
OTHER SUBJECTS: large datasets;  arbitrary metric spaces;  similar objects;
                  distance function;  triangle inequality;  distance space;
                  data objects;  scalable algorithms;
                  k-dimensional vector space;  vector operations;
                  very large dataset clustering;  distance spaces;  BUBBLE;
                  scalable clustering algorithm;  BUBBLE-FM;
                  clustering quality;  scalability;  quality;
                  real life dataset


TITLE:          Superparamagnetic clustering of data. The definitive solution
                  of an ill-posed problem
AUTHOR:         Domany, E.; Dept. of Phys. of Complex Syst., Weizmann Inst. of
                  Sci., Rehovoth, Israel
CONFERENCE:     STATPHYS 20. 20th IUPAP International Conference on Statistical
                  Physics Paris, France 20-24 July 1998
PUBLICATION:    Physica A (Netherlands), Physica A, vol.263, no.1-4, p. 158-69,
                  0378-4371
                Elsevier 1 Feb. 1999
                18 Refs.
LANGUAGE:       English
ABSTRACT:       Clustering is an important technique in exploratory data
                  analysis, with applications in image processing, object
                  classification, target recognition, data mining etc. The aim
                  is to partition data according to natural classes present in
                  it, assigning data points that are "more similar" to the same
                  "cluster". We solved this ill-posed problem without making
                  any assumptions about the structure of the data, by using a
                  physical system as an analog computer. The physical system we
                  use is a disordered (granular) magnet. The method was tested
                  successfully on a variety of artificial and real-life
                  problems, such as classification of flowers, processing of
                  satellite images, speech recognition and identification of
                  textures and images. We are currently involved in several
                  collaborations, applying the method to image classification,
                  fMRI data analysis and classification of protein structures.
THESAURUS:      analogue computers;  botany;  data analysis;
                  granular materials;  image classification;  image texture;
                  molecular biophysics;  molecular configurations;
                  paramagnetic materials;  pattern clustering;  proteins;
                  speech recognition;  statistics;  superparamagnetism
OTHER SUBJECTS: superparamagnetic data clustering;  ill-posed problem;
                  exploratory data analysis;  data partitioning;
                  physical system;  analog computer;  disordered magnet;
                  granular magnet;  flower classification;
                  satellite image processing;  speech recognition;
                  texture identification;  image identification;
                  image classification;  fMRI data analysis;
                  protein structure classification;  statistics


TITLE:          Incremental conceptual clustering in the framework of Galois
                  lattice
AUTHOR:         Tu Bao Ho; Japan Adv. Inst. of Sci. & Technol., Ishikawa, Japan
CONFERENCE:     Proceedings of First Pacific-Asia Conference. Knowledge
                  Discovery and Data Mining: Techniques and Applications
                  Singapore 23-24 Feb. 1997
PUBLICATION:    Proceedings of the First Pacific-Asia Conference on Knowledge
                  Discovery and Data Mining. KDD: Techniques and Applications,
                  p. 49-61
                EDITOR: Lu, H.
                Motoda, H.
                Liu, H.
                Singapore World Scientific 1997
                xvi+367 p.
                ISBN: 9810229194
                15 Refs.
LANGUAGE:       English
ABSTRACT:       Conceptual clustering is a branch of concept learning that aims
                  at discovering conceptual knowledge from data. Like most
                  methods employing the classical view of concepts, the OSHAM
                  method allows a concept hierarchy to be induced
                  nonincrementally in the framework of a Galois lattice. This
                  paper presents an incremental version of OSHAM (called
                  INCOSHAM) for a discovery task in which training instances
                  are presented serially. Experimental results suggest an
                  alternative use of nonincremental and incremental algorithms
                  in order to improve the performance of the discovery process.
THESAURUS:      data mining;  Galois fields;
                  learning (artificial intelligence);  pattern clustering;
                  semantic networks
OTHER SUBJECTS: incremental conceptual clustering;  Galois lattice;
                  concept learning;  conceptual knowledge discovery;  OSHAM;
                  concept hierarchy induction;  INCOSHAM;
                  serially presented training instances;
                  nonincremental algorithms;  discovery process performance


TITLE:          Clustering large data sets with mixed numeric and categorical
                  values
AUTHOR:         Zhexue Huang; Math. & Inf. Sci., CSIRO, Canberra, ACT,
                  Australia
CONFERENCE:     Proceedings of First Pacific-Asia Conference. Knowledge
                  Discovery and Data Mining: Techniques and Applications
                  Singapore 23-24 Feb. 1997
PUBLICATION:    Proceedings of the First Pacific-Asia Conference on Knowledge
                  Discovery and Data Mining. KDD: Techniques and Applications,
                  p. 21-34
                EDITOR: Lu, H.
                Motoda, H.
                Liu, H.
                Singapore World Scientific 1997
                xvi+367 p.
                ISBN: 9810229194
                16 Refs.
LANGUAGE:       English
ABSTRACT:       Efficient partitioning of large data sets into homogenous
                  clusters is a fundamental problem in data mining. The
                  standard hierarchical clustering methods provide no solution
                  for this problem, due to their computational inefficiency.
                  The k-means based methods are promising for their efficiency
                  in processing large data sets. However, their use is often
                  limited to numeric data. In this paper, we present a
                  k-prototypes algorithm which is based on the k-means paradigm
                  but removes the numeric data limitation whilst preserving its
                  efficiency. In the algorithm, objects are clustered against k
                  prototypes. A method is developed to dynamically update the k
                  prototypes in order to maximise the intra-cluster similarity
                  of objects. When applied to numeric data, the algorithm is
                  identical to the k-means method. To assist in the
                  interpretation of clusters, we use decision tree induction
                  algorithms to create rules for clusters. These rules,
                  together with other statistics about clusters, can assist
                  data miners to understand and identify interesting clusters.
THESAURUS:      data mining;  pattern clustering;  statistical databases;
                  very large databases
OTHER SUBJECTS: large data set clustering;  numerical values;
                  categorical values;  data set partitioning;
                  homogenous clusters;  data mining;
                  hierarchical clustering methods;  computational efficiency;
                  k-means paradigm;  k-prototypes algorithm;
                  dynamic updating;  intra-cluster similarity;
                  decision tree induction algorithms;  cluster rules;
                  cluster statistics;  interesting cluster identification


TITLE:          Fast and intuitive clustering of Web documents
AUTHOR:         Zamir, O.; Etzioni, O.; Madani, O.; Karp, R.M.; Dept. of
                  Comput. Sci. & Eng., Washington Univ., Seattle, WA, USA
CONFERENCE:     Proceedings of the Third International Conference on Knowledge
                  Discovery and Data Mining -KDD 97 Newport Beach, CA, USA
                  14-17 Aug. 1997
PUBLICATION:    Proceedings of the Third International Conference on Knowledge
                  Discovery and Data Mining, p. 287-90
                EDITOR: Heckerman, D.
                Mannila, H.
                Pregibon, D.
                Uthurusamy, R.
                Menlo Park, CA, USA AAAI Press 1997
                xiv+311 p.
                10 Refs.
LANGUAGE:       English
ABSTRACT:       Conventional document retrieval systems (e.g., Alta Vista)
                  return long lists of ranked documents in response to user
                  queries. Recently, document clustering has been put forth as
                  an alternative method of organizing retrieval results (D.R.
                  Cutting et al., 1992). A person browsing the clusters can
                  discover patterns that could be overlooked in the traditional
                  presentation. The paper describes two novel clustering
                  methods that intersect the documents in a cluster to
                  determine the set of words (or phrases) shared by all the
                  documents in the cluster. We report on experiments that
                  evaluate these intersection based clustering methods on
                  collections of snippets returned from Web search engines.
                  First, we show that word intersection clustering produces
                  superior clusters and does so faster than standard
                  techniques. Second, we show that our O(nlog n) time phrase
                  intersection clustering method produces comparable clusters
                  and does so more than two orders of magnitude faster than all
                  methods tested.
THESAURUS:      computational complexity;  data mining;
                  information retrieval;  Internet;  pattern clustering;
                  search engines
OTHER SUBJECTS: intuitive clustering;  Web documents;
                  document retrieval systems;  Alta Vista;  ranked documents;
                  user queries;  document clustering;  retrieval results;
                  clustering methods;  intersection based clustering methods;
                  snippets;  Web search engines;
                  word intersection clustering;
                  phrase intersection clustering method;  pattern discovery


TITLE:          Clustering sequences of complex objects
AUTHOR:         Ketterlin, A.; CNRS, Strasbourg, France
CONFERENCE:     Proceedings of the Third International Conference on Knowledge
                  Discovery and Data Mining -KDD 97 Newport Beach, CA, USA
                  14-17 Aug. 1997
PUBLICATION:    Proceedings of the Third International Conference on Knowledge
                  Discovery and Data Mining, p. 215-18
                EDITOR: Heckerman, D.
                Mannila, H.
                Pregibon, D.
                Uthurusamy, R.
                Menlo Park, CA, USA AAAI Press 1997
                xiv+311 p.
                6 Refs.
LANGUAGE:       English
ABSTRACT:       This paper is about the unsupervised discovery of patterns in
                  sequences of composite objects. A composite object may be
                  described as a sequence of other, simpler data. In such
                  cases, not only the nature of the components is important,
                  but also the order in which these components appear. The work
                  studies the problem of generalizing sequences of complex
                  objects. A formal definition of generalized sequences is
                  given, and an algorithm is derived. Because of the excessive
                  computational complexity of this algorithm, a heuristic
                  version is described. This algorithm is then integrated in a
                  general-purpose clustering algorithm. The result is a
                  knowledge discovery system which is able to analyze any
                  structured database on the base of a unified, unsupervised
                  mechanism.
THESAURUS:      computational complexity;  data mining;
                  heuristic programming;  pattern recognition;
                  very large databases
OTHER SUBJECTS: complex object sequence clustering;
                  unsupervised pattern discovery;  composite objects;
                  generalized sequences;  computational complexity;
                  heuristic algorithm;  general-purpose clustering algorithm;
                  knowledge discovery;  data mining


TITLE: Interactive interpretation of hierarchical clustering
AUTHOR:         Boudaillier, E.; Hebrail, G.
PUBLICATION:    Intelligent Data Analysis, vol.2, no.3, 1088-467X
                Elsevier Aug. 1998
                10 Refs.
LANGUAGE:       English
ABSTRACT:       Automatic clustering methods are part of data mining methods.
                  They aim at building clusters of items so that similar items
                  fall into the same cluster while unsimilar items fall into
                  separate clusters. A particular class of clustering methods
                  are hierarchical ones where recursive clusters are formed to
                  grow a binary tree representing an approximation of
                  similarities between items. We propose a new interactive
                  interface to help the user to interpret the result of such a
                  clustering process, according to the item characteristics.
                  The prototype has been applied successfully to a special case
                  of items providing nice graphical representations (electric
                  load curves) but can also be used with other types of curves
                  or with more standard items.
THESAURUS:      data analysis;  data mining;  tree data structures;
                  user interfaces;  very large databases
OTHER SUBJECTS: interactive interpretation;  hierarchical clustering;
                  automatic clustering methods;  data mining;
                  recursive clusters;  binary tree;  interactive interface;
                  prototype;  graphical representations;  curves;
                  data analysis


TITLE:          Human performance on clustering Web pages: a preliminary study
AUTHOR:         Macskassy, S.A.; Banerjee, A.; Davison, B.D.; Hirsh, H.; Dept.
                  of Comput. Sci., Rutgers Univ., New Brunswick, NJ, USA
CONFERENCE:     Proceedings of the Fourth International Conference on Knowledge
                  Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION:    Proceedings Fourth International Conference on Knowledge
                  Discovery and Data Mining, p. 264-8
                EDITOR: Agrawal, R.
                Stolorz, P.
                Menlo Park, CA, USA AAAI Press 1998
                xii+382 p.
                ISBN: 1577350707
                15 Refs.
LANGUAGE:       English
ABSTRACT:       With the increase in information on the World Wide Web it has
                  become difficult to quickly find desired information without
                  using multiple queries or using a topic-specific search
                  engine. One way to help in the search is by grouping HTML
                  pages together that appear in some way to be related. In
                  order to better understand this task, we performed an initial
                  study of human clustering of web pages, in the hope that it
                  would provide some insight into the difficulty of automating
                  this task. Our results show that subjects did not cluster
                  identically; in fact, on average, any two subjects had little
                  similarity in their web-page clusters. We also found that
                  subjects generally created rather small clusters, and those
                  with access only to URLs created fewer clusters than those
                  with access to the full text of each web page. Generally the
                  overlap of documents between clusters for any given subject
                  increased when given the full text, as did the percentage of
                  documents clustered. When analyzing individual subjects, we
                  found that each had different behavior across queries, both
                  in terms of overlap, size of clusters, and number of
                  clusters. These results provide a sobering note on any quest
                  for a single clearly correct clustering method for web pages.
THESAURUS:      hypermedia markup languages;  information resources;
                  information retrieval
OTHER SUBJECTS: World Wide Web;  multiple queries;
                  topic-specific search engine;  HTML pages;
                  human clustering


TITLE:          Initialization of iterative refinement clustering algorithms
AUTHOR:         Fayyad, U.; Reina, C.; Bradley, P.S.; Microsoft Res., Redmond,
                  WA, USA
CONFERENCE:     Proceedings of the Fourth International Conference on Knowledge
                  Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION:    Proceedings Fourth International Conference on Knowledge
                  Discovery and Data Mining, p. 194-8
                EDITOR: Agrawal, R.
                Stolorz, P.
                Menlo Park, CA, USA AAAI Press 1998
                xii+382 p.
                ISBN: 1577350707
                19 Refs.
LANGUAGE:       English
ABSTRACT:       Iterative refinement clustering algorithms (e.g. K-Means, EM)
                  converge to one of numerous local minima. It is known that
                  they are especially sensitive to initial conditions. We
                  present a procedure for computing a refined starting
                  condition from a given initial one that is based on an
                  efficient technique for estimating the modes of a
                  distribution. The refined initial starting condition leads to
                  convergence to "better" local minima. The procedure is
                  applicable to a wide class of clustering algorithms for both
                  discrete and continuous data. We demonstrate the application
                  of this method to the expectation maximization (EM)
                  clustering algorithm and show that refined initial points
                  indeed lead to improved solutions. Refinement run time is
                  considerably lower than the time required to cluster the full
                  database. The method is scalable and can be coupled with a
                  scalable clustering algorithm to address large-scale
                  clustering in data mining.
THESAURUS:      convergence of numerical methods;  database theory;
                  pattern clustering;  very large databases
OTHER SUBJECTS: initialization;  iterative refinement clustering algorithms;
                  local minima;  convergence;  initial conditions;
                  refined starting condition;  continuous data;
                  discrete data;
                  expectation maximization clustering algorithm;
                  refinement run time;  scalable clustering algorithm;
                  large-scale clustering;  data mining

TITLE:          Robust incremental clustering with bad instance orderings: a
                  new strategy
AUTHOR:         Roure, J.; Talavera, L.; Dept. d'Inf. i Gestio, Escola Univ.
                  Politecnica de Mataro, Catalonia, Spain
CONFERENCE:     Progress in Artificial Intelligence - IBERAMIA 98 Lisbon,
                  Portugal 5-9 Oct. 1998
PUBLICATION:    Progress in Artificial Intelligence - IBERAMIA 98. 6th
                  Ibero-American Conference on AI. Proceedings, p. 136-47
                EDITOR: Coelho, H.
                Berlin, Germany Springer-Verlag 1998
                xiii+420 p.
                ISBN: 3540649921
                13 Refs.
LANGUAGE:       English
ABSTRACT:       It is widely reported in the literature that incremental
                  clustering systems suffer from instance ordering effects and
                  that under some orderings, extremely poor clusterings may be
                  obtained. We present a new general strategy aimed to mitigate
                  these effects, the Not-Yet strategy which has a general and
                  open formulation and it is not coupled to any particular
                  system. Unlike other proposals, this strategy maintains the
                  incremental nature of learning process. In addition, we
                  propose a classification of strategies to avoid ordering
                  effects which clarifies the benefits and disadvantages one
                  can expect from the proposal made in the paper as well from
                  the existing ones. A particular implementation of the Not-Yet
                  strategy is used to conduct several experiments. Results
                  suggest that the strategy improves the clustering quality. We
                  also show that, when combined with other local strategies,
                  the Not-Yet strategy allows the clustering system to get high
                  quality clusterings.
THESAURUS:      data mining;  learning (artificial intelligence);
                  learning systems
OTHER SUBJECTS: incremental clustering;  ordering effect;  not-yet strategy;
                  learning process;  machine learning;  data mining


TITLE:          An efficient approach to clustering in large multimedia
                  databases with noise
AUTHOR:         Hinneburg, A.; Keim, D.A.; Inst. of Comput. Sci., Halle Univ.,
                  Germany
CONFERENCE:     Proceedings of the Fourth International Conference on Knowledge
                  Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION:    Proceedings Fourth International Conference on Knowledge
                  Discovery and Data Mining, p. 58-65
                EDITOR: Agrawal, R.
                Stolorz, P.
                Menlo Park, CA, USA AAAI Press 1998
                xii+382 p.
                ISBN: 1577350707
                19 Refs.
LANGUAGE:       English
ABSTRACT:       Several clustering algorithms can be applied to clustering in
                  large multimedia databases. The effectiveness and efficiency
                  of the existing algorithms, however, is somewhat limited,
                  since clustering in multimedia databases requires clustering
                  high dimensional feature vectors and since multimedia
                  databases often contain large amounts of noise. We therefore
                  introduce a new algorithm to clustering in large multimedia
                  databases called DENCLUE (DENsity-based CLUstEring). The
                  basic idea of our new approach is to model the overall point
                  density analytically as the sum of influence functions of the
                  data points. Clusters can then be identified by determining
                  density attractors and clusters of arbitrary shape can be
                  easily described by a simple equation based on the overall
                  density function. The advantages of our new approach are: (1)
                  it has a firm mathematical basis; (2) it has good clustering
                  properties in data sets with large amounts of noise; (3) it
                  allows a compact mathematical description of arbitrarily
                  shaped clusters in high dimensional data sets; and (4) it is
                  significantly faster than existing algorithms. To demonstrate
                  the effectiveness and efficiency of DENCLUE, we perform a
                  series of experiments on a number of different data sets from
                  CAD and molecular biology. A comparison with DBSCAN shows the
                  superiority of our new approach.
THESAURUS:      data handling;  data mining;  multimedia databases;
                  pattern clustering;  very large databases
OTHER SUBJECTS: large multimedia database clustering;  clustering algorithms;
                  high dimensional feature vectors;  noise;  DENCLUE;
                  DENsity-based CLUstEring;  overall point density;
                  influence functions;  data points;  density attractors;
                  simple equation;  overall density function;
                  clustering properties;  data sets;
                  compact mathematical description;
                  arbitrarily shaped clusters;  high dimensional data sets;
                  CAD;  molecular biology;  DBSCAN


TITLE:          Scaling clustering algorithms to large databases
AUTHOR:         Bradley, P.S.; Fayyad, U.; Reina, C.; Microsoft Res., Redmond,
                              WA, USA
CONFERENCE:     Proceedings of the Fourth International Conference on Knowledge
                              Discovery and Data Mining New York, NY, USA 27-31 Aug. 1998
PUBLICATION:    Proceedings Fourth International Conference on Knowledge
                              Discovery and Data Mining, p. 9-15
                            EDITOR: Agrawal, R.
                            Stolorz, P.
                            Menlo Park, CA, USA AAAI Press 1998
                            xii+382 p.
                            ISBN: 1577350707
                            29 Refs.
LANGUAGE:       English
ABSTRACT:       Practical clustering algorithms require multiple data scans to
                              achieve convergence. For large databases, these scans become
                              prohibitively expensive. We present a scalable clustering
                              framework applicable to a wide class of iterative clustering.
                              We require at most one scan of the database. In this work,
                              the framework is instantiated and numerically justified with
                              the popular K-Means clustering algorithm. The method is based
                              on identifying regions of the data that are compressible,
                              regions that must be maintained in memory, and regions that
                              are discardable. The algorithm operates within the confines
                              of a limited memory buffer. Empirical results demonstrate
                              that the scalable scheme outperforms a sampling based
                              approach. In our scheme, data resolution is preserved to the
                              extent possible based upon the size of the allocated memory
                              buffer and the fit of current clustering model to the data.
                              The framework is naturally extended to update multiple
                              clustering models simultaneously. We empirically evaluate on
                              synthetic and publicly available data sets.
THESAURUS:      data handling;  data mining;  pattern clustering;
                              very large databases
OTHER SUBJECTS: clustering algorithm scaling;  large databases;
                              practical clustering algorithms;  multiple data scans;
                              scalable clustering framework;  iterative clustering;
                              database scanning;  numerical justification;
                              K-Means clustering algorithm;  limited memory buffer;
                              scalable scheme;  sampling based approach;  data resolution;
                              allocated memory buffer;  multiple clustering models;
                              publicly available data sets

TITLE:          Conceptual clustering on partitioned data: Tree-Weaver
AUTHOR:         Jungsoon Yoo; Sung Yoo; Dept. of Comput. Sci., Middle Tennessee
                              State Univ., Murfreesboro, TN, USA
PUBLICATION:    Expert Systems with Applications, vol.15, no.3-4, p. 367-74,
                              0957-4174
                            Elsevier Oct.-Nov. 1998
                            9 Refs.
LANGUAGE:       English
ABSTRACT:       Most knowledge discovery in databases (KDD) research is
                              concentrated on supervised inductive learning. Conceptual
                              clustering is an unsupervised inductive learning technique
                              that organizes observations into an abstraction hierarchy
                              without using predefined class values. However, a typical
                              conceptual clustering algorithm is not suitable for a KDD
                              task because of space and time constraints. Furthermore,
                              typical incremental and non-incremental clustering algorithms
                              are not designed for a partitioned data set. We present a
                              conceptual clustering algorithm that works on partitioned
                              data. The proposed algorithm improves the clustering process
                              by using less computation time and less space while
                              maintaining the clustering quality.
THESAURUS:      data mining;  deductive databases;  learning by example;
                              unsupervised learning;  very large databases
OTHER SUBJECTS: conceptual clustering;  partitioned data;  Tree-Weaver;
                              knowledge discovery in databases;
                              supervised inductive learning;
                              unsupervised inductive learning;  abstraction hierarchy;
                              predefined class values;  space constraints;
                              time constraints;  incremental clustering algorithms;
                              nonincremental clustering algorithms;  computation time



TITLE:          Similarity clustering of dimensions for an enhanced visualization of multidimensional data
AUTHOR:         Ankerst, M.; Berchtold, S.; Keim, D.A.; Munich Univ., Germany
CONFERENCE:     Proceedings IEEE Symposium on Information Visualization
                  Research Triangle, CA, USA 19-20 Oct. 1998
                SPONSOR: IEEE Comput. Soc Tech. Committee on Comput. Graphics
PUBLICATION:    Proceedings IEEE Symposium on Information Visualization (Cat.
                  No.98TB100258), p. 52-60, 153
                EDITOR: Wills, G.
                Dill, J.
                Los Almaitos, CA, USA IEEE Comput. Soc 1998
                xiii+162 p.
                ISBN: 0818690933
                30 Refs.
LANGUAGE:       English
ABSTRACT:       The order and arrangement of dimensions (variates) is crucial
                  for the effectiveness of a large number of visualization
                  techniques such as parallel coordinates, scatterplots,
                  recursive pattern, and many others. We describe a systematic
                  approach to arrange the dimensions according to their
                  similarity. The basic idea is to rearrange the data
                  dimensions such that dimensions showing a similar behavior
                  are positioned next to each other. For the similarity
                  clustering of dimensions, we need to define similarity
                  measures which determine the partial or global similarity of
                  dimensions. We then consider the problem of finding an
                  optimal one- or two-dimensional arrangement of the dimensions
                  based on their similarity. Theoretical considerations show
                  that both, the one- and the two-dimensional arrangement
                  problem are surprisingly hard problems, i.e. they are NP
                  complete. Our solution of the problem is therefore based on
                  heuristic algorithms. An empirical evaluation using a number
                  of different visualization techniques shows the high impact
                  of our similarity clustering of dimensions on the
                  visualization results.
THESAURUS:      computational complexity;  data mining;  data visualisation
OTHER SUBJECTS: similarity clustering;  enhanced visualization;
                  multidimensional data;  visualization techniques;
                  parallel coordinates;  scatterplots;  recursive pattern;
                  systematic approach;  data dimensions;  similar behavior;
                  similarity measures;  global similarity;
                  two dimensional arrangement;  hard problems;  NP complete;
                  heuristic algorithms


TITLE:          CCAIIA: Clustering categorical attributes into interesting association rules
AUTHOR:         Gray, B.; Orlowska, M.E.; Sch. of Inf. Technol., Queensland
                  Univ., Brisbane, Qld., Australia
CONFERENCE:     Research and Development in Knowledge Discovery and Data
                  Mining. Second Pacific-Asia Conference, PAKDD-98 Melbourne,
                  Vic., Australia 15-17 April 1998
PUBLICATION:    Research and Development in Knowledge Discovery and Data
                  Mining. Second Pacific-Asia Conference, PAKDD-98.
                  Proceedings, p. 132-43
                EDITOR: Wu, X.
                Kotagiri, R.
                Korb, K.B.
                Berlin, Germany Springer-Verlag 1998
                xvi+424 p.
                ISBN: 3540643834
                15 Refs. 
LANGUAGE:       English
ABSTRACT:       We investigate the problem of mining interesting association
                  rules over a pair of categorical attributes at any level of
                  data granularity. We do this by integrating the rule
                  discovery process with a form of clustering. This allows
                  associations between groups of items to be formed where the
                  groping of items is based on maximising the "interestingness"
                  of the associations discovered. Previous work on mining
                  generalised associations assumes either a distance metric on
                  the attribute values or a taxonomy over the items mined.
                  These methods use the metric/taxonomy to limit the spare of
                  possible associations that can be found. We develop a measure
                  of the interestingness of association rules based on support
                  and the dependency between the item sets and use this measure
                  to guide the search. We apply the method to a data set and
                  observe the extraction of "interesting" associations. This
                  method could allow interesting and unexpected associations to
                  be discovered as the search space is not being limited by
                  user defined hierarchies.
THESAURUS:      database management systems;  knowledge acquisition;
                  knowledge based systems
OTHER SUBJECTS: CCAIIA;  data mining;  categorical attributes clustering;
                  association rules;  rule discovery process;
                  distance metric;  user defined hierarchies 

TITLE:          Automatic subspace clustering of high dimensional data for data mining applications
AUTHOR:         Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P.; IBM
                  Almaden Res. Center, San Jose, CA, USA
CONFERENCE:     1998 ACM SIGMOD International Conference on Management of Data
                  Seattle, WA, USA 1-4 June 1998
PUBLICATION:    SIGMOD Rec. (USA), SIGMOD Record, vol.27, no.2, p. 94-105,
                  0163-5808
                ACM June 1998
                45 Refs.
LANGUAGE:       English
ABSTRACT:       Data mining applications place special requirements on
                  clustering algorithms including: the ability to find clusters
                  embedded in subspaces of high dimensional data, scalability,
                  end user comprehensibility of the results, non presumption of
                  any canonical data distribution, and insensitivity to the
                  order of input records. We present CLIQUE, a clustering
                  algorithm that satisfies each of these requirements. CLIQUE
                  has the following properties: it identifies dense clusters in
                  subspaces of maximum dimensionality; it generates cluster
                  descriptions in the form of DNF expressions that are
                  minimized for ease of comprehension; it produces identical
                  results irrespective of the order in which input records are
                  presented and does not presume any specific mathematical form
                  for data distribution. Through experiments, we show that
                  CLIQUE efficiently finds accurate clusters in large high
                  dimensional datasets.
THESAURUS:      data handling;  deductive databases;  knowledge acquisition;
                  pattern recognition
OTHER SUBJECTS: automatic subspace clustering;  high dimensional data;
                  data mining applications;  clustering algorithms;
                  scalability;  end user comprehensibility;  non presumption;
                  canonical data distribution;  CLIQUE;  clustering algorithm;
                  dense clusters;  maximum dimensionality;
                  cluster descriptions;  DNF expressions;  input records;
                  data distribution;  accurate clusters;
                  large high dimensional datasets


TITLE:          Efficient point data clustering by database operations
AUTHOR:         Yan-Nong Huang; Tarau, P.; Dept. of Comput. Sci., Moncton
                  Univ., NB, Canada
CONFERENCE:     Proceedings of 8th International Hong Kong Computer Society
                  Database Workshop. Data Mining, Data Warehousing and
                  Client/Server Databases Hong Kong 29-31 July 1997
                SPONSOR: Oracle Syst. Hong Kong; NCR (Hong Kong); Sybase Hong
                  Kong; Hewlett-Packard Hong Kong; et al
PUBLICATION:    Data Mining, Data Warehousing and Client/Server Databases.
                  Proceedings of the 8th International Database Workshop, p.
                  236-50
                EDITOR: Fong, J.
                Singapore Springer-Verlag Singapore 1997
                xi+332 p.
                ISBN: 9813083549
                11 Refs.
LANGUAGE:       English
ABSTRACT:       This paper gives a definition for point data clustering and
                  shows that the clustering can be realized with several
                  database operators (spatial joins, aggregate and transitive
                  closure). With the proposed method, one doesn't have to
                  define a pool as the number of smaller groups that the
                  initial set of objects is partitioned into. An efficient
                  implementation of spatial joins is studied. Some experimental
                  results on using the proposed method for point data
                  clustering are reported.
THESAURUS:      data analysis;  database theory;  deductive databases;
                  pattern recognition;  very large databases
OTHER SUBJECTS: point data clustering;  database operations;
                  database operators;  spatial joins;  aggregate;
                  transitive closure;  object set partitioning;  data mining

TITLE:          A comparison of group-based and object-based data clustering techniques
AUTHOR:         Bouguettaya, A.; Le Viet, Q.; Colea, M.; Sch. of Inf. Syst.,
                  Queensland Univ. of Technol., Brisbane, Qld., Australia
CONFERENCE:     Proceedings of 8th International Hong Kong Computer Society
                  Database Workshop. Data Mining, Data Warehousing and
                  Client/Server Databases Hong Kong 29-31 July 1997
                SPONSOR: Oracle Syst. Hong Kong; NCR (Hong Kong); Sybase Hong
                  Kong; Hewlett-Packard Hong Kong; et al
PUBLICATION:    Data Mining, Data Warehousing and Client/Server Databases.
                  Proceedings of the 8th International Database Workshop, p.
                  119-36
                EDITOR: Fong, J.
                Singapore Springer-Verlag Singapore 1997
                xi+332 p.
                ISBN: 9813083549
                24 Refs.
LANGUAGE:       English
ABSTRACT:       We investigate the behavior and stability of two data
                  clustering methods. The Slink method is based on a single
                  object comparison test, whereas the Ward method is based on
                  comparing groups of objects. Different sensitivity parameters
                  were used to test the behavior and stability of the two
                  techniques when clustering objects were drawn from the 3D
                  spare. The objects were generated according to three commonly
                  used statistical distributions. This study tends to confirm
                  the findings of earlier (less exhaustive) studies, namely,
                  the high similarity in behavior and stability among
                  clustering methods.
THESAURUS:      data handling;  deductive databases;  knowledge acquisition;
                  object-oriented databases;  pattern recognition;
                  statistical analysis
OTHER SUBJECTS: object based data clustering techniques;
                  data clustering methods;  Slink method;
                  single object comparison test;  Ward method;
                  sensitivity parameters;  clustering objects;  3D spare;
                  statistical distributions;  clustering methods;  group based
                  data clustering techniques 

TITLE:          A comparative study of clustering methods
AUTHOR:         Zait, M.; Messatfa, H.; Oracle Corp., Redwood Shores, CA, USA
PUBLICATION:    Future Generation Computer Systems, vol.13, no.2-3, p. 149-59,
                  0167-739X
                Elsevier Nov. 1997
                12 Refs.
LANGUAGE:       English
ABSTRACT:       In this paper we propose a methodology for comparing clustering
                  methods based on the quality of the result and the
                  performance of the execution. We applied it to several known
                  clustering methods: FastClust, Autoclass, Relational data
                  analysis, and Kohonen nets. The quality of a clustering
                  result depends on both the similarity measure used by the
                  method and sets with specific (or desired) patterns using a
                  combination of parameters, such as the number and the type of
                  the attributes, the number of records, etc. We define a
                  metric to measure the quality of a clustering method, i.e.,
                  its ability to discover some or all of the "hidden" patterns.
                  The performance study is based on the resource consumption,
                  i.e., CPU time and memory space.
THESAURUS:      computational complexity;  pattern recognition;
                  unsupervised learning;  very large databases
OTHER SUBJECTS: clustering methods;  comparative study;  FastClust;
                  Autoclass;  Relational data analysis;  Kohonen nets;
                  performance study;  data mining;  data generation;
                  unsupervised classification


TITLE:          Clustering via concave minimization
AUTHOR:         Bradley, P.S.; Mangasarian, O.L.; Street, W.N.; Dept. of
                  Comput. Sci., Wisconsin Univ., Madison, WI, USA
CONFERENCE:     Advances in Neural Information Processing Systems 9.
                  Proceedings of the 1996 Conference Denver, CO, USA 2-5 Dec.
                  1996
PUBLICATION:    Advances in Neural Information Processing Systems 9.
                  Proceedings of the 1996 Conference, p. 368-74
                EDITOR: Mozer, M.C.
                Jordan, M.I.
                Petsche, T.
                London, UK MIT Press 1997
                xvi+1098 p.
                ISBN: 0262100657
                18 Refs.
LANGUAGE:       English
ABSTRACT:       The problem of assigning m points in the n-dimensional real
                  space R/sup n/ to k clusters is formulated as that of
                  determining k centers in R/sup n/ such that the sum of
                  distances of each point to the nearest center is minimized.
                  If a polyhedral distance is used, the problem can be
                  formulated as that of minimizing a piecewise-linear concave
                  function on a polyhedral set which is shown to be equivalent
                  to a bilinear program: minimizing a bilinear function on a
                  polyhedral set. A fast finite k-median algorithm consisting
                  of solving few linear programs in closed form leads to a
                  stationary point of the bilinear program. Computational
                  testing on a number of real-world databases was carried out.
                  On the Wisconsin Diagnostic Breast Cancer database, k-median
                  training set correctness was comparable to that of the k-mean
                  algorithm, however its testing set correctness was better.
                  Additionally, on the Wisconsin Prognostic Breast Cancer
                  database, distinct and clinically important survival curves
                  were extracted by the k-median algorithm, whereas the K-mean
                  algorithm failed to obtain such distinct survival curves for
                  the same database.
THESAURUS:      database management systems;
                  learning (artificial intelligence);  linear programming;
                  medical diagnostic computing;  minimisation;  neural nets;
                  nonlinear programming;  pattern classification
OTHER SUBJECTS: concave minimization;  clustering;  polyhedral distance;
                  bilinear programming;  k-median algorithm;  Wisconsin
                  Diagnostic Breast Cancer database;
                  Wisconsin Prognostic Breast Cancer database;
                  k-mean algorithm;  training set;  data mining


TITLE:          En route to data mining in legal text corpora: clustering, neural computation, and international treaties
AUTHOR:         Merkl, D.; Schweighofer, E.; Dept. of Comput. Sci., R.
                  Melbourne Inst. of Technol., Vic., Australia
CONFERENCE:     Database and Expert Systems Applications. 8th International
                  Conference, DEXA '97. Proceedings Toulouse, France 1-2 Sept.
                  1997
PUBLICATION:    Proceedings. Eighth International Workshop on Database and
                  Expert Systems Applications (Cat. No. 97TB100181), p. 465-70
                EDITOR: Wagner, R.R.
                Los Alamitos, CA, USA IEEE Comput. Soc 1997
                xvii+770 p.
                ISBN: 0818681470
                35 Refs.
LANGUAGE:       English
ABSTRACT:       The huge amount of data in legal information systems requires a
                  new generation of techniques and tools to assist lawyers in
                  analyzing data and finding critical nuggets of useful
                  knowledge. A promising approach for data mining in legal text
                  corpora is classification. What we are looking for are
                  powerful methods for the exploration of such libraries
                  whereby the detection of similarities between documents is
                  the overall goal. These methods may be used to gain insight
                  in the inherent structure of the various items contained in a
                  text archive. In this paper, we present the results from a
                  case study in legal document classification based on an
                  experimental document archive comprising important treaties
                  in public international law. The essentials of our approach
                  are the usage of a vector space document representation and
                  the utilization of an unsupervised artificial neural network
                  for document classification.
THESAURUS:      classification;  data analysis;  document handling;
                  knowledge acquisition;  law administration;  neural nets;
                  pattern classification;  very large databases
OTHER SUBJECTS: data mining;  legal text corpora;  data clustering;
                  neural computation;  international treaties;
                  legal information systems;  lawyers;  data analysis;  legal
                  document classification;  document similarity detection;
                  text archive;  case study;  document archive;
                  public international law;  vector space;
                  document representation;
                  unsupervised artificial neural network


TITLE:          A connectionist approach to structural similarity determination as a basis of clustering, classification and feature detection
AUTHOR:         Schadler, I.; Wysotzki, I.; Tech. Univ. Berlin, Germany
CONFERENCE:     Principles of Data Mining and Knowledge Discovery. First
                  European Symposium, PKDD '97. Proceedings Trondheim, Norway
                  24-27 June 1997
                SPONSOR: Dept. Comput. Inf. Sci.; Norwegian Res. Council;
                  Norwegian Artificial Intelligence Soc
PUBLICATION:    Principles of Data Mining and Knowledge Discovery. First
                  European Symposium, PKDD '97. Proceedings, p. 254-64
                EDITOR: Komorowski, J.
                Zytkow, J.
                Berlin, Germany Springer-Verlag 1997
                ix+396 p.
                ISBN: 3540632239
                42 Refs.
LANGUAGE:       English
ABSTRACT:       Many algorithms in machine learning, knowledge discovery,
                  pattern recognition and classification are based on the
                  estimation of the similarity or the distance between the
                  analysed objects. Objects with higher structural complexity
                  often cannot be described by feature vectors without losing
                  important structural information. These objects can
                  adequately be represented in the language of logic or by
                  labeled graphs. The similarity of such descriptions is
                  difficult to define and to compute. A connectionist approach
                  for the determination of the similarity of arbitrary labeled
                  graphs is introduced. Using an example from organic
                  chemistry, the application of the approach within one
                  distance based and one generalisation based classification
                  algorithm is demonstrated. The generalisation based algorithm
                  forms clusters or subclasses of similar examples of the same
                  class and extracts the parts of the objects which determine
                  the class of the object. The algorithms perform very
                  satisfactorily in comparison with recent logical and feature
                  vector approaches. Moreover, being able to handle structural
                  data directly, the algorithms need only a subset of the given
                  features of the objects for classification.
THESAURUS:      chemistry computing;  feature extraction;
                  knowledge acquisition;  learning (artificial intelligence);
                  neural nets;  pattern classification
OTHER SUBJECTS: connectionist approach;  structural similarity determination;
                  clustering;  pattern classification;  feature detection;
                  machine learning;  knowledge discovery;
                  structural complexity;  feature vectors;  labeled graphs;
                  arbitrary labeled graph similarity;  organic chemistry;
                  generalisation based classification algorithm;
                  feature vector approaches;  structural data

TITLE:          Inferring hierarchical clustering structures by deterministic annealing
AUTHOR:         Hofmann, T.; Buhmann, J.M.; Inst. fur Inf. III, Rheinische
                  Friedrich-Wilhelms-Univ., Bonn, Germany
CONFERENCE:     Proceedings of 2nd International Conference on Knowledge
                  Discovery and Data Mining (KDD-96) Portland, OR, USA 2-4 Aug.
                  1996
PUBLICATION:    KDD-96 Proceedings. Second International Conference on
                  Knowledge Discovery and Data Mining, p. 363-6
                EDITOR: Simoudis, E.
                Han, J.
                Fayyad, U.
                Menlo Park, CA, USA AAAI Press 1996
                xiv+391 p.
                15 Refs.
LANGUAGE:       English
ABSTRACT:       The unsupervised detection of hierarchical structures is a
                  major topic in unsupervised learning and one of the key
                  questions in data analysis and representation. We propose a
                  novel algorithm for the problem of learning decision trees
                  for data clustering and related problems. In contrast to many
                  other methods based on successive tree growing and pruning,
                  we propose an objective function for tree evaluation and we
                  derive a non greedy technique for tree growing. Applying the
                  principles of maximum entropy and minimum cross entropy, a
                  deterministic annealing algorithm is derived in a meanfield
                  approximation. This technique allows us to canonically
                  superimpose tree structures and to fit parameters to averaged
                  or fuzzified' trees.
THESAURUS:      deterministic algorithms;  inference mechanisms;
                  knowledge acquisition;  trees (mathematics);
                  unsupervised learning
OTHER SUBJECTS: hierarchical clustering structure inference;
                  meanfield approximation;  unsupervised detection;
                  unsupervised learning;  data analysis;  data representation;
                  decision trees;  data clustering;  successive tree growing;
                  tree pruning;  objective function;  tree evaluation;
                  non greedy technique;  maximum entropy;
                  minimum cross entropy;  deterministic annealing algorithm;
                  tree structures;  fuzzified trees

TITLE:          Clustering using Monte Carlo cross-validation
AUTHOR:         Smyth, P.; Dept. of Inf. & Comput. Sci., California Univ.,
                  Irvine, CA, USA
CONFERENCE:     Proceedings of 2nd International Conference on Knowledge
                  Discovery and Data Mining (KDD-96) Portland, OR, USA 2-4 Aug.
                  1996
PUBLICATION:    KDD-96 Proceedings. Second International Conference on
                  Knowledge Discovery and Data Mining, p. 126-33
                EDITOR: Simoudis, E.
                Han, J.
                Fayyad, U.
                Menlo Park, CA, USA AAAI Press 1996
                xiv+391 p.
                19 Refs.
LANGUAGE:       English
ABSTRACT:       Finding the "right" number of clusters, K, for a data set is a
                  difficult, and often ill posed, problem. In a probabilistic
                  clustering context, likelihood ratios, penalized likelihoods,
                  and Bayesian techniques are among the more popular
                  techniques. A novel cross validated likelihood criterion is
                  investigated for determining cluster structure. A practical
                  clustering algorithm based on Monte Carlo cross validation
                  (MCCV) is introduced. The algorithm permits the data analyst
                  to judge if there is strong evidence for a particular k, or
                  perhaps weaker evidence over a subrange of k values.
                  Experimental results with Gaussian mixtures on real and
                  simulated data suggest that MCCV provides genuine insight
                  into cluster structure. v-fold cross validation appears
                  inferior to the penalized likelihood method (BIC), a Bayesian
                  algorithm (AutoClass v2.0), and the new MCCV algorithm.
                  Overall, MCCV and AutoClass appear the most reliable of the
                  methods. MCCV provides the date miner with a useful data
                  driven clustering tool which complements the fully Bayesian
                  approach.
THESAURUS:      Bayes methods;  deductive databases;  formal verification;
                  knowledge acquisition;  Monte Carlo methods;  probability
OTHER SUBJECTS: Monte Carlo cross validation;  data set;
                  probabilistic clustering context;  likelihood ratios;
                  penalized likelihoods;  Bayesian techniques;  cross validated
                  likelihood criterion;  cluster structure;
                  practical clustering algorithm;  MCCV;  data analyst;
                  Gaussian mixtures;  simulated data;
                  v-fold cross validation;  penalized likelihood method;  BIC;
                  Bayesian algorithm;  AutoClass;  date miner;
                  data driven clustering tool;  knowledge discovery


TITLE:          Conceptual clustering in structured databases: a practical approach
AUTHOR:         Ketterlin, A.; Gancarski, P.; Korczak, J.J.; LSIIT, Univ. Louis
                  Pasteur, Strasbourg, France
CONFERENCE:     Proceedings of First International Conference on Knowledge
                  Discovery and Data Mining (KDD-95) Montreal, Que., Canada
                  20-21 Aug. 1995
                SPONSOR: AAAI; AT&T Global Inf. Solutions; GTE Lab.; et al
PUBLICATION:    KDD-95 Proceedings. First International Conference on Knowledge
                  Discovery and Data Mining, p. 180-5
                EDITOR: Fayyad, U.M.
                Uthurusamy, R.
                Menlo Park, CA, USA AAAI 1995
                xiv+345 p.
                ISBN: 0929280822
                12 Refs.
LANGUAGE:       English
ABSTRACT:       Many machine-learning (either supervised or unsupervised)
                  techniques assume that data present themselves in an
                  attribute-value form, but this formalism is largely
                  insufficient to account for many applications. Therefore,
                  much of the ongoing research now focuses on first-order
                  learning systems, but such complex formalisms lead to high
                  computational complexities. On the other hand, most of the
                  currently installed databases have been designed according to
                  the entity-relationship formalism, and usually implemented on
                  a relational database management system. This formalism is
                  far less complex than first-order logic, but much more
                  expressive than attribute-value lists. In that context, the
                  database schema defines an abstraction space, and learning
                  must occur at each level of abstraction. This paper describes
                  a clustering system which is able to discover useful
                  groupings in structured databases. It is based on the COBWEB
                  algorithm, to which it adds the ability to cluster structured
                  objects.
THESAURUS:      computational complexity;  deductive databases;
                  entity-relationship modelling;  formal logic;
                  knowledge acquisition;  learning (artificial intelligence);
                  pattern recognition;  relational databases
OTHER SUBJECTS: conceptual clustering;  structured databases;
                  machine learning techniques;  supervised learning;
                  unsupervised learning;  attribute-value lists;
                  first-order learning systems;  computational complexities;
                  entity-relationship formalism;
                  relational database management system;  first-order logic;
                  expressiveness;  database schema;  abstraction space;
                  COBWEB algorithm


TITLE:          A database interface for clustering in large spatial databases
AUTHOR:         Ester, M.; Kriegel, H.-P.; Xiaowei Xu; Inst. for Comput. Sci.,
                  Munchen Univ., Germany
CONFERENCE:     Proceedings of First International Conference on Knowledge
                  Discovery and Data Mining (KDD-95) Montreal, Que., Canada
                  20-21 Aug. 1995
                SPONSOR: AAAI; AT&T Global Inf. Solutions; GTE Lab.; et al
PUBLICATION:    KDD-95 Proceedings. First International Conference on Knowledge
                  Discovery and Data Mining, p. 94-9
                EDITOR: Fayyad, U.M.
                Uthurusamy, R.
                Menlo Park, CA, USA AAAI 1995
                xiv+345 p.
                ISBN: 0929280822
                15 Refs.
LANGUAGE:       English
ABSTRACT:       Both the number and the size of spatial databases are rapidly
                  growing because of the large amount of data obtained from
                  satellite images, X-ray crystallography or other scientific
                  equipment. Therefore, automated knowledge discovery becomes
                  more and more important in spatial databases. So far, most of
                  the methods for knowledge discovery in databases (KDD) have
                  been based on relational database systems. We address the
                  task of class identification in spatial databases using
                  clustering techniques. We present an interface to the
                  database management system (DBMS), which is crucial for the
                  efficiency of KDD on large databases. This interface is based
                  on a spatial access method, the R* tree. It clusters the
                  objects according to their spatial neighborhood and supports
                  efficient processing of spatial queries. Furthermore, we
                  propose a method for spatial data sampling as part of the
                  focusing component, significantly reducing the number of
                  objects to be clustered. Thus, we achieve a considerable
                  speed up for clustering in large databases. We have applied
                  the proposed techniques to real data from a large protein
                  database used for predicting protein-protein docking. A
                  performance evaluation on this database indicates that
                  clustering on large spatial databases can be performed both
                  efficiently and effectively using the approach.
THESAURUS:      deductive databases;  knowledge acquisition;
                  pattern recognition;  spatial data structures;
                  tree data structures;  very large databases;
                  visual databases
OTHER SUBJECTS: database interface;  large spatial database clustering;
                  automated knowledge discovery;  KDD;  class identification;
                  spatial databases;  clustering techniques;
                  database management system interface;
                  spatial access method;  R* tree;  spatial neighborhood;
                  spatial queries;  spatial data sampling;  large databases;
                  protein database;  protein-protein docking;
                  performance evaluation


A general probabilistic framework for clustering individuals and objects

Igor V. Cadez, Scott Gaffney and Padhraic Smyth

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining August 20 - 23, 2000, Boston, MA USA

Pages 140-149


Efficient clustering of high-dimensional data sets with application to reference matching

Andrew McCallum, Kamal Nigam and Lyle H. Ungar

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining August 20 - 23, 2000, Boston, MA USA

Pages 169-178


Visualization of navigation patterns on a Web site using model-based clustering

Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth and Steven White

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining August 20 - 23, 2000, Boston, MA USA

Pages 280-284


Agglomerative clustering of a search engine query log

Doug Beeferman and Adam Berger

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining August 20 - 23, 2000, Boston, MA USA

Pages 407-416


CHAMELEON: A hierarchical clustering algorithm using dynamic modeling.

G. Karypis, E.-H. Han, and V. Kumar.

COMPUTER, 32:68-75, 1999


Fast and effective text mining using linear-time document clustering

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Bjornar Larsen and Chinatsu Aone

Pages 16-22


Trajectory clustering with mixtures of regression models

Scott Gaffney and Padhraic Smyth

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 63-72


CACTUS¡Xclustering categorical data using summaries

Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 73-83


Entropy-based subspace clustering for mining numerical data

Chun-Hung Cheng, Ada Waichee Fu and Yi Zhang

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 84-93


Evaluating a class of distance-mapping algorithms for data mining and clustering

Jason Tsong-Li Wang, Xiong Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro and Kaizhong Zhang

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 307-311


Identifying distinctive subsequences in multivariate time series by clustering

Tim Oates

Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 322-326


Clustering techniques for large data sets--from the past to the future

Daniel A. Keim and Alexander Hinneburg

Tutorial notes for ACM SIGKDD 1999 international conference on Knowledge discovery and data mining August 15 - 18, 1999, San Diego, CA USA

Pages 141-181


Data bubbles: quality preserving performance boosting for hierarchical clustering

Markus M. Breunig, Hans-Peter Kriegel, Peer Kroger and Jorg Sander

Proceedings of the 2001 ACM SIGMOD international conference on Management of Data on Management of data May 21 - 24, 2001, Santa Barbara, CA USA

Pages 79-90


Density biased sampling: an improved method for data mining and clustering

Christopher R. Palmer and Christos Faloutsos

Proceedings of the 2000 ACM SIGMOD on Management of data May 15 - 18, 2000, Dallas, TX USA

Pages 82-92


SQLEM: fast clustering in SQL using the EM algorithm

Carlos Ordonez and Paul Cereghini

Proceedings of the 2000 ACM SIGMOD on Management of data May 15 - 18, 2000, Dallas, TX USA

Pages 559-570


Locality preserving dictionaries: theory application to clustering in databases

Vijayshankar Raman

Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems May 31 - June 3, 1999, Philadelphia, PA USA

Pages 337-345


Snakes and sandwiches: optimal clustering strategies for a data warehouse Srivastava

H. V. Jagadish, Laks V. S. Lakshmanan and Divesh

Proceedings of the 1999 ACM SIGMOD international conference on Management of data May 31 - June 3, 1999, Philadelphia, PA USA

Pages 37-48


OPTICS: ordering points to identify the clustering structure

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jorg Sander

Proceedings of the 1999 ACM SIGMOD international conference on Management of data May 31 - June 3, 1999, Philadelphia, PA USA

Pages 49-60


Fast algorithms for projected clustering

Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc and Jong Soo Park

Proceedings of the 1999 ACM SIGMOD international conference on Management of data May 31 - June 3, 1999, Philadelphia, PA USA

Pages 61-72


Clustering methods for large databases: from the past to the future

Alexander Hinneburg and Daniel A. Keim

Proceedings of the 1999 ACM SIGMOD international conference on Management of data May 31 - June 3, 1999, Philadelphia, PA USA

Page 509


Clustering through decision tree construction

Bing Liu, Yiyuan Xia and Philip S. Yu

Proceedings of the ninth international conference on Information knowledge management CIKM 2000 November 6 - 11, 2000, McLean, VA USA

Pages 20-29


A semi-supervised document clustering technique for information organization

Han-joon Kim and Sang-goo Lee

Proceedings of the ninth international conference on Information knowledge management CIKM 2000 November 6 - 11, 2000, McLean, VA USA

Pages 30-37


High performance clustering based on the similarity join

Christian Bohm, Bernhard Braunmuller, Markus Breunig and Hans-Peter Kriegel

Proceedings of the ninth international conference on Information knowledge management CIKM 2000 November 6 - 11, 2000, McLean, VA USA

Pages 298-305


An improved network clustering method for I/O-efficient query processing

Sung-Ho Woo and Sung-Bong Yang

Proceedings of the eighth ACM symposium on Advances in geographic information systems November 6 - 11, 2000, McLean, VA USA

Pages 62-68


Clustering declustered data for efficient retrieval

Hakan Ferhatosmanoglu, Divyakant Agrawal and Amr El Abbadi

Proceedings of the eighth international conference on Information knowledge management November 2 - 6, 1999, Kansas City, MO USA

Pages 343-350


Indexing techniques for wireless data broadcast under data clustering and scheduling

Quinglong Hu, Wang-Chien Lee and Dik Lun Lee

Proceedings of the eighth international conference on Information knowledge management November 2 - 6, 1999, Kansas City, MO USA

Pages 351-358


Clustering transactions using large items

Ke Wang, Chu Xu and Bing Liu

Proceedings of the eighth international conference on Information knowledge management November 2 - 6, 1999, Kansas City, MO USA

Pages 483-490


A multiple-resolution method for edge-centric data clustering

Scott Epter and Mukkai Krishnamoorthy

Proceedings of the eighth international conference on Information knowledge management November 2 - 6, 1999, Kansas City, MO USA

Pages 491-498


Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering

Makoto Iwayama

Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval July 24 - 28, 2000, Athens Greece

Pages 10-16


Document clustering using word clusters via the information bottleneck method

Noam Slonim and Naftali Tishby

Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval July 24 - 28, 2000, Athens Greece

Pages 208-215


An investigation of linguistic features and clustering algorithms for topical document clustering

Vasileios Hatzivassiloglou, Luis Gravano and Ankineedu Maganti

Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval July 24 - 28, 2000, Athens Greece

Pages 224-231


AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets.

V. Estivill-Castro and I. Lee.

In Proceedings of the 5th International Conference on Geocomputation, 2000.


AMOEBA: Hierarchical Clustering Based on Spatial Proximity Using Delaunay Diagram.

Estivill-Castro, Vladimir and Ickjai Lee. (2000).

Proceedings of the 9th International Symposium on Spatial Data Handling (SDH2000). Beijing, China.


Graph-Based Hierarchical Conceptual Clustering

I. Jonyer, L. B. Holder and D. J. Cook

Proceedings of the Thirteenth Annual Florida AI Research Symposium (2000).