Laboratory #4b
Data Analysis and Pattern Recognition

2. Classifying Synthetic Data

2.1 Estimating Error Rates

The experiments in this part of Lab 4 use the Analysis -> Covariance tool in the Pattern Recognition and Feature Extraction Toolbox . (The manual provides a systematic description of all of its features.) The purpose is to use data to create a classifier and to estimate its error rate. Start as usual:

1. Launch Matlab.

2. Connect to your directory (e.g., at the Matlab prompt, type cd my-dir )

3. Type lab4

4. Choose Analysis -> Covariance

5. Choose Options -> Covariance Analysis

6. A popup menu will lete you select two files, File 1 for Class 1 data and File 2 for Class 2 data.
Enter Lab4A1 and Lab4A2 and click "Continue".

7. A second popup menu lets you select two features from the possibly several features.
Enter 1 for Feature 1 and 2 for Feature 2, and click "Continue".

8. A third popup menu lets you select the test data.
For both File 1 and File 2, enter the range from 4 to 25 and click "Continue".
(Thus, we are using only 6 training points, 3 from Class 1 and 3 from Class 2.)

Warning: The program is not bulletproofed. It does not check to see that your selections are in the legal ranges. It does not even warn you that you have to have at least 2 training points and at least 1 test point. If you make an invalid selection, expect to get Matlab error messages!

9. After the program computes the covariance matrix, it plots the training data -- Class 1 in red and Class 2 in blue -- giving you the options you saw in the earlier covariance example.

Display the Euclidean Seperator and note that number of misclassified training examples
Display the Mahalanobis Seperator and note the number of misclassified training examples
Hide the seperators

10. Note the new menu choices that appear on the Analyis pulldown menu:

Compute Mahalanobis Statistics
Display Mahalanobis Error Data
Clear Mahalanobis Error Data
Compute Euclidean Statistics
Display Euclidean Error Data
Clear Euclidean Error Data
Display Test Points

Note: The program does not compute the classification errors until you ask it to. However, once asked, it keeps a history of the training error rate and the test error rate, and will display themt on request. Use the "Clear" command if you want to get rid of the history.

11. Select Analysis -> Display Test Points, and note that many Class 1 examples are misclassified.

12. Select Analysis -> Hide Test Points.

13. Select Analysis -> Compute Euclidean Statistics; note that the percentage correct is much higher for the training data than for the test data.

Record the number of training points, the number of test points, and the percentage correct on the training and the test data

14. Select Analysis -> Display Euclidean Error Data.
The graph shows the percentage correct plotted versus the number of training points. Close the graph window.

15. Repeat Steps 5 to 8, except when selecting the range of points for test in Step 8, use 23 to 25.

16. Select Analysis -> Compute Euclidean Statistics; note that the percentage correct is about the same for the training data and the test data; in fact, in this case we are doing better on the test set than the training set!

Record the number of training points, the number of test points, and the percentage correct on the training and the test data

17. Select Analysis -> Display Euclidean Error Data. (Close the window when satisfied.)

18. Repeat Steps 15 and 16, except when selecting the range of points for test, use 1 to 3.

Record the number of training points, the number of test points, and the percentage correct on the training and the test data.
Explain why there is so much variability in the test results.

19. Select Analysis -> Clear Euclidean Error Data

2.2 Dependence of Error Rate on Number of Examples

1. Although the process is somewhat clumsy, we will use the Covariance Analysis tool to see how the training and test error rates vary with the number of examples. Thus, we repeat the following steps for N = 5, 10, 15, 20, 22 (being careful not to accidentally select "Clear"!):

1.1. Choose Options -> Covariance Analysis

1.2. Enter Lab4A1 and Lab4A2 and click "Continue".

1.3. Enter 1 for Feature 1 and 2 for Feature 2 and click "Continue".

1.4. Enter the range from 1 to N and click "Continue".

1.5. Select Analysis -> Compute Euclidean Statistics.

1.6. Select Analysis -> Compute Mahalanobis Statistics.

2. After the last results are obtained, select Analysis -> Display Euclidean Error Data, and
Analysis -> Display Mahalanobis Error Data.

Record the Euclidean results
Record the Mahalanobis results
What limits the accuracy of the error-rate estimates when the number of training points is small?
What liits the accuracy of the error-rate estimates when the number of training points is large?

3. Select Analysis -> Clear Euclidean Error Data;
Select Analysis -> Clear Mahalanobis Error Data;

2.3 Cross Validation

The purpose of this last experiment with synthetic data is to see how cross validation produces more reliable error estimates using a small number of test samples. Start as usual:

1. Choose Analysis -> Covariance

2. Choose Options -> Covariance Analysis

3. Enter Lab4A1 and Lab4A2 for File 1 and File 2

4. Enter 1 for Feature 1 and 2 for Feature 2

5. Repeat the following steps for the following values of N1 and N2: (1, 5), (6, 10), (11, 15), (16, 20), (21, 25):

5.1. Enter the range from N1 to N2 and click "Continue".

5.2. Select Analysis -> Compute Euclidean Statistics; record the results on paper.

5.3. Select Analysis -> Compute Mahalanobis Statistics; record the results on paper.

6. Quit:

6.1 Select Options -> exit

6.2 Select Exit -> Close HCI Lab

6.3 At the MATLAB prompt, type quit.

7. Using your recorded test results:

Calculate the average Euclidean percentage correct
Calculate the average Mahalanobis percentage correct
Which results do you think are more believable, these results or the results that you obtained in Experiment 2.2? Explain.

On to Lab # 4, Part c: BioMuse Data

Up to Lab4 and 5

Laboratory #4b Data Analysis and Pattern Recognition

2. Classifying Synthetic Data

2.1 Estimating Error Rates

2.2 Dependence of Error Rate on Number of Examples

2.3 Cross Validation

Laboratory #4b
Data Analysis and Pattern Recognition