With Himalaya Data Mining Tools we are developing new functionality for data mining and working on techniques to improve existing models. The long term goal of the project is to publish the source code of new cutting edge algorithms from the Cornell Database Group so that these new algorithms can be utilized and tested by other users in a structured environment.
The initial release included two algorithms for mining patterns from itemsets in a transactional database. A new algorithm for regression tree construction has been recently posted. There are plans to include additional techniques such as data stream mining in the future.
All algorithms are implemented in C++ and have been ported to both the Linux and Windows platforms. Extensive documentation is provided with Doxygen-generated HTML pages that detail all of the various components of both programs.
This research has been supported by the National Science Foundation under Grants IIS-0084762 and IIS-0121175. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
MAFIA is an algorithm for mining maximal frequent itemsets from a transactional database. The algorithm is especially efficient when the itemsets in the database are very long by using innovative pruning and compression techniques. It is one of the fastest published algorithm for mining long itemsets and outperforms previous work by up to an order of magnitude.
SPAM is a new algorithm for finding all frequent sequences within a transactional database, and is the fastest published algorithm for this task today. Various pruning mechanisms are implemented to reduce the search space and speed processing. Our performance numbers show that the algorithm outperforms previous work by a factor of 3 to over an order of magnitude.
SECRET (Scalable EM Classification-based Regression Trees) is a new construction algorithm for regression trees with linear models in the leaves.