Modifying MUSTAFA to capture salient data
University of South Florida
Overview
- Large scale data sets are becoming ubiquitous.
- Intelligent visualization is needed to unearth the information in the terabytes of data.
- To successfully mine large scale data sets for information, efficient and scalable parallel or/and distributed algorithms are needed.
The relevant portions of data may be missed during visualization.
- The relevant portions of data may be missed during visualization.
- Time consuming exploratory analysis of data is needed to find important anomalies.
- This calls for the use of a visualization tool, which provides intelligent guidance.
Large data sets can be distributed and learning applied in parallel to develop a model of interesting regions.
- Large data sets can be distributed and learning applied in parallel to develop a model of interesting regions.
- User profiles, ‘avatars’ are created as a result.
- Avatar is unique to each user and attached to a type or category of data set.
Learning, Creating Avatars
- Mustafa, visualization tool has been modified by USF to facilitate learning and the creation of avatars.
- During a training session, a user browses a data set using Mustafa and identifies salient regions.
- A labeled data set is generated at the end of the training session.
This data set can be used by a data mining system to create avatars.
- This data set can be used by a data mining system to create avatars.
- This avatar is the user profile and can be used to classify data sets in domains similar to its training set.
- Avatar can then be queried for interesting regions.
How to Learn?
- Disjoint sets of labeled data, generated during a training session, will be created.
- Decision trees will be learned on each of the disjoint data sets in parallel and converted to rules.
- The rules will be a reflection of the saliencies selected by the user.
- The rules will be combined together into a single model. This rule model will be the learned representation of regions likely to interest the user.
Avatar Created !!!
- The created rule set will be consulted to classify the unseen data.
- Thus, an AVATAR is built for a user.
A Training Example
- The can exodus data set was used for training.
- The can data set has 10088 nodes, 9 nodal variables and 44 time steps.
- The user browsed odd numbered time steps and identified regions as interesting or very interesting. All unlabeled regions were given a saliency of unseen low.
The size of the training data set was 221,936 examples
- The size of the training data set was 221,936 examples
- There were approximately 1300 rules created. The rules were created from pruned (default pruning) decision tree.
- The test bed was the even numbered time steps of the same data set.
Training output
Flat data file : A file comprising of all the nodes data and corresponding saliencies.
Flat names file : A file comprising of all the attributes listing and classes (saliency).
Exodus file : The training exodus file recreated with saliency added as another nodal variable.
PPT Slide
Flat results file : A file containing grouping of nodes by time steps and saliency.
HTML file : An HTML file providing a web interface for node query.
Exodus file : A similar test file is created with saliency added as a feature.
Input Rule set and Exodus data set
PPT Slide
In the following presentation shows an animated display of snapshots of the can exodus data set used for training. The training region is displayed by a probe point.
- In the following presentation shows an animated display of snapshots of the can exodus data set used for training. The training region is displayed by a probe point.
- Then we will show three representative classified regions from the testing or Avatar application stage.
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
PPT Slide
LET US SEE HOW WELL THE AVATAR HAS LEARNED
PPT Slide
The node highlighted was given a saliency of Interesting
in the testing phase. Correctly classified.
PPT Slide
An error! Classified as Interesting. Wrong, since the nodes in the region of the section of the can being crushed received a saliency of very interesting during training.
PPT Slide
A salient region correctly classified as very interesting.