Data Analysis and Pattern Recognition

The experiments in this part of Lab 4 use the Analysis -> Covariance tool in the Pattern Recognition and Feature Extraction Toolbox . (The manual provides a systematic description of all of its features.) The purpose is to use data to create a classifier and to estimate its error rate. Start as usual:

1. Launch Matlab.

2. Connect to your directory (e.g., at the Matlab prompt, type cd my-dir )

3. Type lab4

4. Choose Analysis -> Covariance

5. Choose Options -> Covariance Analysis

6. A popup menu will lete you select two files, File 1 for Class 1 data and File 2 for Class 2 data.

Enter

7. A second popup menu lets you select two features from the possibly several features.

Enter

8. A third popup menu lets you select the test data.

For both File 1 and File 2, enter the range from 4 to 25 and click "Continue".

(Thus, we are using only 6 training points, 3 from Class 1 and 3 from Class 2.)

Warning: The program is not bulletproofed. It does not check to see that your selections are in the legal ranges. It does not even warn you that you have to have at least 2 training points and at least 1 test point. If you make an invalid selection, expect to get Matlab error messages!

9. After the program computes the covariance matrix, it plots the training data -- Class 1 in red and Class 2 in blue -- giving you the options you saw in the earlier covariance example.

- Display the Euclidean Seperator and note that number of misclassified
training examples

- Display the Mahalanobis Seperator and note the number of misclassified
training examples

- Hide the seperators

10. Note the new menu choices that appear on the Analyis pulldown menu:

Compute Mahalanobis Statistics11. Select Analysis -> Display Test Points, and note that many Class 1 examples are misclassified.

Display Mahalanobis Error Data

Clear Mahalanobis Error Data

Compute Euclidean Statistics

Display Euclidean Error Data

Clear Euclidean Error Data

Display Test Points

Note:The program does not compute the classification errors until you ask it to. However, once asked, it keeps a history of the training error rate and the test error rate, and will display themt on request. Use the "Clear" command if you want to get rid of the history.

12. Select Analysis -> Hide Test Points.

13. Select Analysis -> Compute Euclidean Statistics; note that the percentage correct is much higher for the training data than for the test data.

- Record the number of training points, the number of test points, and the percentage correct on the training and the test data

The graph shows the percentage correct plotted versus the number of training points. Close the graph window.

15. Repeat Steps 5 to 8, except when selecting the range of points for test in Step 8, use 23 to 25.

16. Select Analysis -> Compute Euclidean Statistics; note that the percentage correct is about the same for the training data and the test data; in fact, in this case we are doing better on the test set than the training set!

- Record the number of training points, the number of test points, and the percentage correct on the training and the test data

18. Repeat Steps 15 and 16, except when selecting the range of points for test, use 1 to 3.

- Record the number of training points, the number of test points, and
the percentage correct on the training and the test data.

- Explain why there is so much variability in the test results.

19. Select Analysis -> Clear Euclidean Error Data

1.1. Choose Options -> Covariance Analysis

1.2. EnterLab4A1andLab4A2and click "Continue".

1.3. Enter1for Feature 1 and2for Feature 2 and click "Continue".

1.4. Enter the range from 1 to N and click "Continue".

1.5. Select Analysis -> Compute Euclidean Statistics.

1.6. Select Analysis -> Compute Mahalanobis Statistics.

2. After the last results are obtained, select Analysis -> Display Euclidean Error Data, and

Analysis -> Display Mahalanobis Error Data.

- Record the Euclidean results

- Record the Mahalanobis results

- What limits the accuracy of the error-rate estimates when the number
of training points is small?

- What liits the accuracy of the error-rate estimates when the number of training points is large?

Select Analysis -> Clear Mahalanobis Error Data;

The purpose of this last experiment with synthetic data is to see how cross validation produces more reliable error estimates using a small number of test samples. Start as usual:

1. Choose Analysis -> Covariance

2. Choose Options -> Covariance Analysis

3. Enter

4. Enter

5. Repeat the following steps for the following values of N1 and N2: (1, 5), (6, 10), (11, 15), (16, 20), (21, 25):

5.1. Enter the range from N1 to N2 and click "Continue".

5.2. Select Analysis -> Compute Euclidean Statistics; record the results on paper.

5.3. Select Analysis -> Compute Mahalanobis Statistics; record the results on paper.

6. Quit:

6.1 Select Options -> exit

6.2 Select Exit -> Close HCI Lab

6.3 At the MATLAB prompt, typequit.

7. Using your recorded test results:

- Calculate the average Euclidean percentage correct

- Calculate the average Mahalanobis percentage correct

- Which results do you think are more believable, these results or the results that you obtained in Experiment 2.2? Explain.

On to Lab # 4, Part c: BioMuse Data

Up to Lab4 and 5