Laboratory #4b
Data Analysis and Pattern Recognition
2. Classifying Synthetic Data
2.1 Estimating Error Rates
The experiments in this part of Lab 4 use the Analysis -> Covariance
tool in the Pattern Recognition and Feature Extraction Toolbox
. (The manual provides a systematic description
of all of its features.) The purpose is to use data to create a classifier
and to estimate its error rate. Start as usual:
1. Launch Matlab.
2. Connect to your directory (e.g., at the Matlab prompt, type cd
my-dir )
3. Type lab4
4. Choose Analysis -> Covariance
5. Choose Options -> Covariance Analysis
6. A popup menu will lete you select two files, File 1 for Class 1 data
and File 2 for Class 2 data.
Enter Lab4A1 and Lab4A2 and click "Continue".
7. A second popup menu lets you select two features from the possibly several
features.
Enter 1 for Feature 1 and 2 for Feature 2, and click "Continue".
8. A third popup menu lets you select the test data.
For both File 1 and File 2, enter the range from 4 to 25 and click "Continue".
(Thus, we are using only 6 training points, 3 from Class 1 and 3 from Class
2.)
Warning: The program is not bulletproofed. It does not
check to see that your selections are in the legal ranges. It does not even
warn you that you have to have at least 2 training points and at least 1
test point. If you make an invalid selection, expect to get Matlab error
messages!
9. After the program computes the covariance matrix, it plots the training
data -- Class 1 in red and Class 2 in blue -- giving you the options you
saw in the earlier covariance example.
- Display the Euclidean Seperator and note that number of misclassified
training examples
- Display the Mahalanobis Seperator and note the number of misclassified
training examples
- Hide the seperators
10. Note the new menu choices that appear on the Analyis pulldown menu:
Compute Mahalanobis Statistics
Display Mahalanobis Error Data
Clear Mahalanobis Error Data
Compute Euclidean Statistics
Display Euclidean Error Data
Clear Euclidean Error Data
Display Test Points
Note: The program does not compute the classification errors until
you ask it to. However, once asked, it keeps a history of the training error
rate and the test error rate, and will display themt on request. Use the
"Clear" command if you want to get rid of the history.
11. Select Analysis -> Display Test Points, and note that many Class
1 examples are misclassified.
12. Select Analysis -> Hide Test Points.
13. Select Analysis -> Compute Euclidean Statistics; note that the percentage
correct is much higher for the training data than for the test data.
- Record the number of training points, the number of test points, and
the percentage correct on the training and the test data
14. Select Analysis -> Display Euclidean Error Data.
The graph shows the percentage correct plotted versus the number of training
points. Close the graph window.
15. Repeat Steps 5 to 8, except when selecting the range of points for test
in Step 8, use 23 to 25.
16. Select Analysis -> Compute Euclidean Statistics; note that the percentage
correct is about the same for the training data and the test data; in fact,
in this case we are doing better on the test set than the training set!
- Record the number of training points, the number of test points, and
the percentage correct on the training and the test data
17. Select Analysis -> Display Euclidean Error Data. (Close the window
when satisfied.)
18. Repeat Steps 15 and 16, except when selecting the range of points for
test, use 1 to 3.
- Record the number of training points, the number of test points, and
the percentage correct on the training and the test data.
- Explain why there is so much variability in the test results.
19. Select Analysis -> Clear Euclidean Error Data
2.2 Dependence of Error Rate on Number of Examples
1. Although the process is somewhat clumsy, we will use the Covariance Analysis
tool to see how the training and test error rates vary with the number of
examples. Thus, we repeat the following steps for N = 5, 10, 15, 20, 22
(being careful not to accidentally select "Clear"!):
1.1. Choose Options -> Covariance Analysis
1.2. Enter Lab4A1 and Lab4A2 and click "Continue".
1.3. Enter 1 for Feature 1 and 2 for Feature 2 and click "Continue".
1.4. Enter the range from 1 to N and click "Continue".
1.5. Select Analysis -> Compute Euclidean Statistics.
1.6. Select Analysis -> Compute Mahalanobis Statistics.
2. After the last results are obtained, select Analysis -> Display Euclidean
Error Data, and
Analysis -> Display Mahalanobis Error Data.
- Record the Euclidean results
- Record the Mahalanobis results
- What limits the accuracy of the error-rate estimates when the number
of training points is small?
- What liits the accuracy of the error-rate estimates when the number
of training points is large?
3. Select Analysis -> Clear Euclidean Error Data;
Select Analysis -> Clear Mahalanobis Error Data;
2.3 Cross Validation
The purpose of this last experiment with synthetic data is to see how cross
validation produces more reliable error estimates using a small number of
test samples. Start as usual:
1. Choose Analysis -> Covariance
2. Choose Options -> Covariance Analysis
3. Enter Lab4A1 and Lab4A2 for File 1 and File 2
4. Enter 1 for Feature 1 and 2 for Feature 2
5. Repeat the following steps for the following values of N1 and N2: (1,
5), (6, 10), (11, 15), (16, 20), (21, 25):
5.1. Enter the range from N1 to N2 and click "Continue".
5.2. Select Analysis -> Compute Euclidean Statistics; record the results
on paper.
5.3. Select Analysis -> Compute Mahalanobis Statistics; record the results
on paper.
6. Quit:
6.1 Select Options -> exit
6.2 Select Exit -> Close HCI Lab
6.3 At the MATLAB prompt, type quit.
7. Using your recorded test results:
- Calculate the average Euclidean percentage correct
- Calculate the average Mahalanobis percentage correct
- Which results do you think are more believable, these results or the
results that you obtained in Experiment 2.2? Explain.
On to Lab # 4, Part c: BioMuse Data
Up to Lab4 and 5