Wednesday, May 16, 2012

Classification Fail

Classification Results


The only somewhat promising classification result I attained was using color histograms to classify by genre. The error rate is at around .5 (at least below chance!) and area under the ROC curve was highest for DietFitness (~.8), second highest for History (~.75), and lowest for Romance and SciFiFantasy (~.65).



Using LDA-reduced SIFT histograms for AdaBoost classification is performing at around chance. Error rate (number of misclassified images/ total number of images) never goes beneath .7 on testing data, where chance would be .75 for 4 classes, for classification by both rating and genre. Area under the ROC curve also hovers at around .5 for all classes.

I took a look at the LDA-projected data and it's clear that there is not good class separation for SIFT histograms (for class definition by either genre or rating).

Figure 1: First two dimensions of LDA-reduced SIFT descriptors (color-coded by rating)


Figure 2: First two dimensions of LDA-reducted SIFT descriptors (color coded by genre)

This could mean that a) SIFT features are not doing a good job of capturing class differences or b) there are not clear class differences based on cover image.

In support of (b) (that there are not good class differences), I also tried AdaBoost with the color histograms as feature input and class labels by rating, with very similar results (~.7 error rate on testing data), area under ROC curve ~.5 for all classes.


To Do


- Get actual ROC curves/ confusion matrices for classification
- Take a look at which colors are good between-genre specifiers
- Try classification by year/number of reviews
- Suggestions?

Matlab Tip


Check it out:


keyboard

A life-changing debugging tool.


Preprocessing


SIFT descriptors were attained using VLFeat, and then reduced to k-1 dimensions (where k=number of classes) using Linear Discriminant Analysis. The LDA output was then histogram-ed to make the final features for input to AdaBoost.


Labels


Rating


Based on the distribution of labels for my set of book cover images, for ratings ranging from 1 to 4, (see Figure 3) I chose 4 classes:

1) "Bad" rating <= 3.5
2) "Okay" 3.5 < rating <= 4
3) "Good" 4 < rating <-= 4.5
4) "Great" rating > 4.5



Genre

I'm still working with 4 different genres:

1) Romance
2) History
3) DietFitness
4) SciFiFantasy


Histogram Construction


In case of nonparametric distribution of projected SIFT data, I chose a bin width W determined by the interquartile range of the histogram-ed data and number of samples (average number of SIFT descriptors per image ~= 700)


W = 2 (IQRN-1/3






Monday, May 7, 2012

Multiboost in C++; ARFF; Weka in Java

Last week I left off with having attained SIFT histograms but having trouble with the slowness of the Matlab implementation of AdaBoost that I had been using.

I found an entire collection of boosting C++ code [1], which includes an AdaBoost implementation for multiple classes. One of the input formats is the ARFF (Attribute-Relation File Format) developed by the Computer Science department at the University of Waikato (in New Zealand) for use in their open-source Java-based Weka software for machine learning. Conveniently, Matt Dunham has already shared a Matlab-Weka interface which I was able to use to convert my .mat datasets to ARFF format to run though the multiboost framework and get weak learner results (training was so fast).

For this week I still need to 1) separate the covers into training/testing and run for cross-validation results for both genre and ratings and 2) understand what's going on with the multi-class generalization of AdaBoost.

If you have time you should check out the Weka project. I really like their mission statement and open-source spirit.

[1] D. Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl
MultiBoost: a multi-purpose boosting package
Journal of Machine Learning Research, 13:549–553, 2012. 

Wednesday, May 2, 2012

SIFT histograms, EMD, and classification

I found a paper Home Interior Classification Using SIFT Keypoint Histograms where the authors used SIFT histograms to classify indoor scenes. This was great because they had the same problem that I had, and implemented and compared a few different solutions.

For clarification, right now I'm playing around with a toy dataset of 40 images (10 from 4 genres).

The problem is that for 40 images, I get about 26,000 SIFT keypoints, with 128 descriptor dimensionality. One way of comparison could be to classify each of the key points individually and then match images with the same class with the most matching key points, but that's very computationally expensive.

In the aforementioned Ayers and Boutell paper, the optimal solution to the above problem that they found was reducing the dimensionality of the SIFT descriptors from 128 to c-1 (where c is number of classes) using Fisher's Linear Discriminant, creating histograms from the lower-dimensionality SIFT descriptors, and then classifying the images using AdaBoost.

I thought I'd start there, so I used Linear Discriminant Analysis (LDA) (which maximizes the between-class variance and minimizes the within-class variance for projection onto a c-1 dimensional vector) to reduce the dimensionality of the key points, and then made histograms with 3 bins for each dimension and ran AdaBoost on the results. It gets reasonable output right now, I just need to make sure I'm using LDA correctly and do cross-validation, and play around with histogram binning.

A problem with AdaBoost that I've run into is that it gets very, very slow with ~250 features as input. Does anybody know the time complexity on AdaBoost?

In this case, AdaBoost eliminates the need to compare histograms directly. However, if I were to do so (which I probably will at some point), Serge pointed me in the direction of Earth Mover's Distance (EMD) as a distance metric between histograms. This is better than the histogram intersection method I previously used for color histograms, because it takes bin distance/local bin similarity into account.

Something I could try with EMD is to use it to get cluster centers using k-means, and then use a nearest mean classifier.

Histograms, histograms everywhere!

Preliminary classification results will definitely be out next week.