Modifying MUSTAFA to capture salient data

University of South Florida

Sandia National Labs

Nitesh Chawla

Overview

Large scale data sets are becoming ubiquitous.

Intelligent visualization is needed to unearth the information in the terabytes of data.

To successfully mine large scale data sets for information, efficient and scalable parallel or/and distributed algorithms are needed.

The relevant portions of data may be missed during visualization.

The relevant portions of data may be missed during visualization.

Time consuming exploratory analysis of data is needed to find important anomalies.

This calls for the use of a visualization tool, which provides intelligent guidance.

Large data sets can be distributed and learning applied in parallel to develop a model of interesting regions.

Large data sets can be distributed and learning applied in parallel to develop a model of interesting regions.

User profiles, ‘avatars’ are created as a result.

Avatar is unique to each user and attached to a type or category of data set.

Learning, Creating Avatars

Mustafa, visualization tool has been modified by USF to facilitate learning and the creation of avatars.

During a training session, a user browses a data set using Mustafa and identifies salient regions.

A labeled data set is generated at the end of the training session.

This data set can be used by a data mining system to create avatars.

This data set can be used by a data mining system to create avatars.

This avatar is the user profile and can be used to classify data sets in domains similar to its training set.

Avatar can then be queried for interesting regions.

How to Learn?

Disjoint sets of labeled data, generated during a training session, will be created.

Decision trees will be learned on each of the disjoint data sets in parallel and converted to rules.

The rules will be a reflection of the saliencies selected by the user.

The rules will be combined together into a single model. This rule model will be the learned representation of regions likely to interest the user.

Avatar Created !!!

The created rule set will be consulted to classify the unseen data.

Thus, an AVATAR is built for a user.

A Training Example

The can exodus data set was used for training.

The can data set has 10088 nodes, 9 nodal variables and 44 time steps.

The user browsed odd numbered time steps and identified regions as interesting or very interesting. All unlabeled regions were given a saliency of unseen low.

The size of the training data set was 221,936 examples

The size of the training data set was 221,936 examples

There were approximately 1300 rules created. The rules were created from pruned (default pruning) decision tree.

The test bed was the even numbered time steps of the same data set.

Training output

Flat data file : A file comprising of all the nodes data and corresponding saliencies.

Flat names file : A file comprising of all the attributes listing and classes (saliency).

Exodus file : The training exodus file recreated with saliency added as another nodal variable.

Exodus data set

PPT Slide

Testing output

Flat results file : A file containing grouping of nodes by time steps and saliency.

HTML file : An HTML file providing a web interface for node query.

Exodus file : A similar test file is created with saliency added as a feature.

Input Rule set and Exodus data set

output

PPT Slide

In the following presentation shows an animated display of snapshots of the can exodus data set used for training. The training region is displayed by a probe point.

In the following presentation shows an animated display of snapshots of the can exodus data set used for training. The training region is displayed by a probe point.

Then we will show three representative classified regions from the testing or Avatar application stage.

PPT Slide

LET US SEE HOW WELL THE AVATAR HAS LEARNED

PPT Slide

The node highlighted was given a saliency of Interesting

in the testing phase. Correctly classified.

PPT Slide

An error! Classified as Interesting. Wrong, since the nodes in the region of the section of the can being crushed received a saliency of very interesting during training.

PPT Slide

A salient region correctly classified as very interesting.