LMW-tree

web scale clustering

View project on GitHub

Learning M-Way Tree

LMW-tree is a generic template library written in C++ that implements several algorithms that use the m-way nearest neighbor tree structure to store their data. The algorithms and data structures are generic to support different data representations and functionality.

The algorithms are primarily focussed on computationally efficient clustering. Clustering is an unsupervised machine learning process that finds interesting patterns in data. It can be used for nearest neighbor search, supervised learning and other machine learning applications.

The Parallel Streaming Signature EM-tree has produced document clusters for the ClueWeb large scale web crawls. This result was presented at the 2015 International World Wide Web conference. Document clustering at this scale has not previously been reported.

See the ClueWeb09 and ClueWeb12 clusters for examples of document clusters produced by the Parallel Streaming Signature EM-tree algorithm. ClueWeb09 and 12 contain 500 and 733 million web pages and were clustered into 600,000 to 700,000 clusters. The document to cluster mappings and other related files area available for download.

Algorithms

  • EM-tree
  • K-tree
  • k-means
  • Tree Structured Vector Quantization (TSVQ)
  • Novel variants of all of the above to work with bit vectors, bit strings or signatures.

License

LMW-tree is licensed under the BSD license.

Authors and Contributors

This project has been supported by the Queensland University Technology via the Big Data Lab and the Deputy Vice Chancellor's PhD scholarship.

Support or Contact

Having troubles with LMW-tree? Have feedback? Contact Chris de Vries via email.

Publications

This software and research has resulted in many publications. Please cite one or more of these papers when using this software.

Algorithms and Data Structures

Publications about algorithms and data structures based upon the m-way nearest neighbor search tree.

[PDF] De Vries, C.M., De Vine, L., Geva, S., Nayak, R.: Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications. In: Proceedings of the 24th International Conference on World Wide Web Conference (2015)

[PDF] De Vries, C.M., De Vine, L., Geva, S.: The EM-tree Algorithm. In: PhD thesis "Document Clustering Algorithms, Representations and Evaluation for Information Retrieval" (2014) page 168

[PDF] De Vries, C.M., De Vine, L., Geva, S.: Random Indexing K-tree. In: ADCS09: Australian Document Computing Symposium 2009, Sydney, Australia, December 4 (2009) 43-50

[PDF] Geva, S.: K-tree: a height balanced tree structured vector quantizer. Proceedings of the 2000 IEEE Signal Processing Society Workshop Neural Networks for Signal Processing X, 2000. 1 (2000) 271-280 vol.1

Document Representations

Publications about document representations for learning and search.

[PDF] De Vries, C.M. and Geva, S.: Pairwise similarity of TopSig document signatures. In Australasian Document Computing Symposium 2012, 5-6 December 2012, University of Otago, Dunedin.

[PDF] Geva, S. and De Vries, C.M.: TopSig: Topology Preserving Document Signatures. In: Conference on Information and Knowledge Management 2011, 24-28 October 2011, Glasgow, Scotland.

Information Retrieval

Publications about cluster based search engines.

[PDF] De Vries, C.M., Geva, S., and, Trotman, A.: Distributed Information Retrieval: Collection Distribution, Selection and the Cluster Hypothesis for Evaluation of Document Clustering. In: PhD thesis "Document Clustering Algorithms, Representations and Evaluation for Information Retrieval" (2014) page 129

Clustering Evaluation

Publications on metrics, measures and evaluation tasks to compare different approaches to clustering.

[PDF] Reuter, T., Papadopoulos, S., Petkos, G., Mezaris, V., Kompatsiaris, V., Cimiano, P., De Vries, C.M., Geva, S. (2013) Social event detection at MediaEval 2013 : challenges, datasets and evaluation. In Eslevich, M. and van Laere, O. (Eds.) Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop, MediaEval Multimedia Benchmark , Barcelona, Spain, pp. 1-2.

[PDF] De Vries, C.M., Geva, S., and, Trotman, A. Document clustering evaluation : Divergence from a random baseline. In Workshop "Information Retrieval 2012" (IR-2012), 12-14 September, 2012, Technical University of Dortmund, Dortmund, Germany.

[PDF] De Vries, C.M., Nayak, R., Kutty, S., Geva, S., Tagarelli, A.: Overview of the INEX 2010 XML mining track: clustering and classification of XML documents. In Lecture Notes in Computer Science, INEX 2010, Amsterdam (2011) In Press

[PDF] Nayak, R., De Vries, C.M., Kutty, S., Geva, S., Denoyer, L., Gallinari, P.: Overview of the INEX 2009 XML mining track: clustering and classification of XML documents. In Focused Retrieval and Evaluation : Proceedings of 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Queensland (2010) 366-378

Document Clustering and Classification

Publications applying these family of algorithms to document clustering and classification.

[PDF] Nayak, R., Mills, R., De Vries, C.M., Geva, S.: Clustering and labeling a web scale document collection using Wikipedia clusters. In Yi, Zeng, Kotoulas, Spyros, & Huang, Zhisheng (Eds.) Web-KR '14 Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, ACM New York, NY, USA, Shanghai, China, (2014) pp. 23-30.

[PDF] De Vries, C.M., De Vine, L., Geva, S.: Clustering with Random Indexing K-tree and XML structure. In: Focused Retrieval and Evaluation: Proceedings of 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Queensland (2010) 366-378

[PDF] De Vries, C.M., Geva, S.: K-tree: large scale document clustering. In: SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, Boston, MA, USA, ACM (2009) 718-719

[PDF] De Vries, C.M., Geva, S.: Document Clustering with K-tree. In: Advances in Focused Retrieval: 7th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2008, Schloss Dagstuhl, Germany, December 15-18 (2009) 420-431

Software

Publications on implementations of software related to clustering.

[PDF] De Vries, C.M., Geva, S.: ClusterEval 1.0 : Cluster quality Evaluation software. (2013)

[PDF] Grossekathoefer, U., De Vries, C.M., Geva, S.: pyktree: a K-tree implementation in Python (2011)

Theses

Combinations of the papers above presented as Masters and PhD theses.

[PDF] De Vries, C.M.: Document clustering Algorithms, Representations and Evaluation for Information Retrieval. PhD by Publication Thesis, QUT. (2014)

[PDF] De Vries, C.M.: Application of K-tree to Document Clustering. Masters by Research Thesis, QUT. (2010)