[Anderson] Visual Data Mining

[ Pobierz całość w formacie PDF ]
.First, as you are using the parallel plot forother exploration purposes, you may run across pattern-based outliers andshould be able to recognize them.Second, the parallel plot is effective inisolating and removing the outlier.Drag the top var2 range slider down, approaching the outlying observation.Getting really close is not necessary; just get close enough to see the outlierclearly isolated on the var1 axis.88Advanced Topics in Initial Exploration 79Figure 3.6 PCP of Table6.cvs Back SideDrag the bottom var1 range slider up until only the single outlyingobservation is visible.Right-click on a range slider, selecting Make dataset from filter to createa subset containing only the outlying observation.Name the dataset Table6Outlier.At times when removing outliers using the parallel plot it is easier to create asubset of all but the outliers; and at other times, in Table6 for example, it iseasier to create the subset of just the outlying observations.When this is the caseanother step is required using the Control Center s Difference operation.In the Control Center, drag Table6Outlier over Table6.csv and drop.Select Create dataset from Difference.8888 880 Visual Data MiningAfter subtracting (or removing) the outliers, the resulting dataset containsjust valid observations, which was the objective.The dataset name assignedby the Difference operation is a combination of both involved datasets.In this case it is Table6.csv-Table6Outlier.A better shorter name mightbe preferred.Right-click on Table6.csv-Table6Outlier; select View/Edit name andnotes.Change the name to Table6Valid.Click Save.A pattern check of experimental dataThe dataset ResponseTime.csv contains the results of benchmark tests com-paring to widely used web servers, identified in the dataset as Platform A and Platform B.The data was collected by using a simulator to repeatedly makerequests of web pages from the servers from hundreds of different locations thenmeasuring the response times.Open ResponseTime.csv.View the Summary Statistics.The AvgPgRsp is the average time in milliseconds that it took the server torespond, given the requested page size (FileSize in kilobytes) and the number ofrequests per second (TPS) hitting the server.The average was based onthousands of requests to the server from hundreds of clients requesting filesof the specified size and the given traffic level.As you see from the summarystatistics, file size requests ranged from 5 KB up to 50 KB and traffic levelsranged from 100 to 550 requests per second.View ResponseTime.csv in a scatter plot.Select TPS on the X axis and FileSize on the Y axis.In this view, we clearly see the benchmark design.This is not randomlysampled data, but a carefully crafted experiment.The apparently missingobservations in the upper right corner of the grid (A in Figure 3.7) representtrials that overloaded the server so much that it failed to respond to the pagerequests.The missing observations in the column at 225 TPS (B in Figure 3.7)were unintentional omissions made by the lab technician running the88 88 88 8Advanced Topics in Initial Exploration 81Figure 3.7 Scatter Plot of ResponseTime.csvsimulations.These omissions were not detected until results were viewed in asimilar scatter plot.Change the Y axis to AvgPgRsp, the Z axis to FileSize, and the categoryto Server.Numerous outliers are visible among the observations at the 250 TPS level.After review, these turned out to be simulations run without first returning theservers to a predetermined initial state again a mistake made by the technician.They obviously need to be removed from the dataset before continuing theanalysis.The higher page response rates at the upper TPS and FileSize settingswere valid measurements reflecting degradation of the server at these levels.Thephenomenon observed in the plot is frequently referred to as the hockey stick.Exercise 3.6Using the ResponseTime.csv dataset in VisMiner:a.Use the parallel plot to extract a subset named outliers of the invalidobservations.b.In the Control Center, create a subset of valid observations using thedifference between the full dataset and the outlier set.c.Name the dataset ValidResponseTime.csv.882 Visual Data MiningSummaryIn preparation for the application of data mining algorithms, most datasets needsome modification.These modifications include:projection attribute selectionrestriction filtering of observationssub-population extractionaggregation combining of observationselimination of missing valuesderivation of new columnsmerging of datasetsdetection and elimination of outliers.VisMiner supports all of these operations.Where suitable, some are sup-ported within the parallel plot, correlation matrix, and location plot viewers.Other modification operations are implemented directly in the Control Center.4Prediction Algorithms forData MiningIn support of the data mining process, VisMiner implements algorithms forprediction modeling.It supports modelers both for classification (predictingnominal or class values) and regression (predicting continuous numericvalues).In this chapter we introduce the basic algorithms implemented byVisMiner.These include decision trees, support vector machines, andartificial neural networks for classification and artificial neural networksfor regression.For the most part, the algorithms of VisMiner are a black box.One does notneed to know precisely how the algorithms work in order to deploy them in datamining exercises.Consequently, this chapter may be skipped.However, knowl-edge of the algorithms can help in the following ways:Algorithm selection each algorithm has its strengths and weaknesses.Anunderstanding of the internal workings of an algorithm leads to a betterappreciation of its strengths and weaknesses.Consequently it results inbetter decision making when it comes to algorithm selection as dictated bythe dataset characteristics and data mining objective.Results evaluation knowing how the algorithm arrived at its results helps inassessment of the applicability and confidence in the results.For example,with respect to a decision tree, how does a root level split variable comparein importance to a leaf level split?Visual Data Mining: The VisMiner Approach, First Edition.Russell K.Anderson.� 2013 John Wiley & Sons, Ltd.Published 2013 by John Wiley & Sons, Ltd [ Pobierz całość w formacie PDF ]

Linki