Books and Their Covers

Friday, March 15, 2013

Final Write Up

When this class came to an end last year, I completed a final write up summarizing the methods and results of this project. Here is the link to a copy.

Monday, June 4, 2012

ROC curves + ending objectives

For this week I worked on uploading classifier definitions learned from Multiboost into MATLAB and writing code to generate ROC curves. They show, once more, that color histogram perform much better for genre classification than SIFT features.

SIFT to classify genre

AUC

Diet/Fitness: .6642

History: .6237

Romance: .5533

SciFi/Fantasy: .5821

Color to classify genre

The following are 2 ROC curves for two different cross-validation folds:

AUC
Diet/Fitness: .7892
History: .7583
Romance: .7733
SciFi/Fantasy: .7433

AUC
Diet/Fitness: .9258
History: .8067
Romance: .6750
SciFi/Fantasy: .6950

The classification performance is obviously different for different train/test cross validation folds. Should I average them together to get final performance measures..?

I've also been working on getting rid of between-class redundant SIFT features prior to LDA projection to see if this improves classification, but have run into the MATLAB 'for' loop bottleneck. My goal is to write a MATLAB-interfaceable c file to accomplish this for inclusion in the report.

I've also looked for relationship between SIFT and publication year and number of user reviews with no positive results.

Wednesday, May 16, 2012

Classification Fail

Classification Results

The only somewhat promising classification result I attained was using color histograms to classify by genre. The error rate is at around .5 (at least below chance!) and area under the ROC curve was highest for DietFitness (~.8), second highest for History (~.75), and lowest for Romance and SciFiFantasy (~.65).

Using LDA-reduced SIFT histograms for AdaBoost classification is performing at around chance. Error rate (number of misclassified images/ total number of images) never goes beneath .7 on testing data, where chance would be .75 for 4 classes, for classification by both rating and genre. Area under the ROC curve also hovers at around .5 for all classes.

I took a look at the LDA-projected data and it's clear that there is not good class separation for SIFT histograms (for class definition by either genre or rating).

Figure 1: First two dimensions of LDA-reduced SIFT descriptors (color-coded by rating)

Figure 2: First two dimensions of LDA-reducted SIFT descriptors (color coded by genre)

This could mean that a) SIFT features are not doing a good job of capturing class differences or b) there are not clear class differences based on cover image.

In support of (b) (that there are not good class differences), I also tried AdaBoost with the color histograms as feature input and class labels by rating, with very similar results (~.7 error rate on testing data), area under ROC curve ~.5 for all classes.

To Do

- Get actual ROC curves/ confusion matrices for classification
- Take a look at which colors are good between-genre specifiers
- Try classification by year/number of reviews
- Suggestions?

Matlab Tip

Check it out:

keyboard

A life-changing debugging tool.

Preprocessing

SIFT descriptors were attained using VLFeat, and then reduced to k-1 dimensions (where k=number of classes) using Linear Discriminant Analysis. The LDA output was then histogram-ed to make the final features for input to AdaBoost.

Labels

Rating

Based on the distribution of labels for my set of book cover images, for ratings ranging from 1 to 4, (see Figure 3) I chose 4 classes:

1) "Bad" rating <= 3.5
2) "Okay" 3.5 < rating <= 4
3) "Good" 4 < rating <-= 4.5
4) "Great" rating > 4.5

Genre

I'm still working with 4 different genres:

1) Romance
2) History
3) DietFitness
4) SciFiFantasy

Histogram Construction

In case of nonparametric distribution of projected SIFT data, I chose a bin width W determined by the interquartile range of the histogram-ed data and number of samples (average number of SIFT descriptors per image ~= 700)

W = 2 (IQR) N^-1/3

Monday, May 7, 2012

Multiboost in C++; ARFF; Weka in Java

Last week I left off with having attained SIFT histograms but having trouble with the slowness of the Matlab implementation of AdaBoost that I had been using.

I found an entire collection of boosting C++ code [1], which includes an AdaBoost implementation for multiple classes. One of the input formats is the ARFF (Attribute-Relation File Format) developed by the Computer Science department at the University of Waikato (in New Zealand) for use in their open-source Java-based Weka software for machine learning. Conveniently, Matt Dunham has already shared a Matlab-Weka interface which I was able to use to convert my .mat datasets to ARFF format to run though the multiboost framework and get weak learner results (training was so fast).

For this week I still need to 1) separate the covers into training/testing and run for cross-validation results for both genre and ratings and 2) understand what's going on with the multi-class generalization of AdaBoost.

If you have time you should check out the Weka project. I really like their mission statement and open-source spirit.

[1] D. Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl
MultiBoost: a multi-purpose boosting package
Journal of Machine Learning Research, 13:549–553, 2012.

Wednesday, May 2, 2012

SIFT histograms, EMD, and classification

I found a paper Home Interior Classification Using SIFT Keypoint Histograms where the authors used SIFT histograms to classify indoor scenes. This was great because they had the same problem that I had, and implemented and compared a few different solutions.

For clarification, right now I'm playing around with a toy dataset of 40 images (10 from 4 genres).

The problem is that for 40 images, I get about 26,000 SIFT keypoints, with 128 descriptor dimensionality. One way of comparison could be to classify each of the key points individually and then match images with the same class with the most matching key points, but that's very computationally expensive.

In the aforementioned Ayers and Boutell paper, the optimal solution to the above problem that they found was reducing the dimensionality of the SIFT descriptors from 128 to c-1 (where c is number of classes) using Fisher's Linear Discriminant, creating histograms from the lower-dimensionality SIFT descriptors, and then classifying the images using AdaBoost.

I thought I'd start there, so I used Linear Discriminant Analysis (LDA) (which maximizes the between-class variance and minimizes the within-class variance for projection onto a c-1 dimensional vector) to reduce the dimensionality of the key points, and then made histograms with 3 bins for each dimension and ran AdaBoost on the results. It gets reasonable output right now, I just need to make sure I'm using LDA correctly and do cross-validation, and play around with histogram binning.

A problem with AdaBoost that I've run into is that it gets very, very slow with ~250 features as input. Does anybody know the time complexity on AdaBoost?

In this case, AdaBoost eliminates the need to compare histograms directly. However, if I were to do so (which I probably will at some point), Serge pointed me in the direction of Earth Mover's Distance (EMD) as a distance metric between histograms. This is better than the histogram intersection method I previously used for color histograms, because it takes bin distance/local bin similarity into account.

Something I could try with EMD is to use it to get cluster centers using k-means, and then use a nearest mean classifier.

Histograms, histograms everywhere!

Preliminary classification results will definitely be out next week.

Wednesday, April 25, 2012

SIFT -> Codebook -> Features -> Classification

Last time my update was about moving forward with feature extraction and logistic regression classification upon having completed installation of VLFeat.

vl_sift outputs X, Y frame center position coordinates, a frame scale coordinate, and a frame orientation in radians. When mapped onto an image, sift features look something like:

The yellow circles represent the output values mentioned above, and the green boxes represent descriptor vectors calculated using local gradient information.

These raw position/scale/orientation values won't immediately work for logistic regression purposes because each value on its own doesn't mean anything - I need a one-value representation of each SIFT frame. A way of doing this is generating a "codebook" of similar features across the training data and using the codebook to generate histograms for classification. This concept is visualized below from Kinnunen et al. 2009:

A commonly used codebook creation method is to run k-means and then use the cluster centers as a codevectors (for SIFT we could run k-means on the descriptor vectors). Defining a number of codevectors, one can then define similarity bins centered around each codevector.

So for next week I'll focus on generating a codebook of SIFT features, creating feature histograms for each image, and running logistic regression on SIFT and HoC.

My MATLAB aside:

MATLAB can store images in 64-bit (double), 32-bit (single), 16-bit (uint16), or 8-bit (uint8) form. double is MATLAB's standard 64-bit representation of numeric data. Storing images in uint8 form is a good idea for using less storage space, but many MATLAB operations are carried out on the double form.

To convert an image I to a double:

I = im2double(I)

The same can be carried out with any of the image types:

I = im2uint8(I)

I = im2uint16(I)

I = im2single(I)

Note that VLFeat asks for image input in single form.

Monday, April 16, 2012

Color Histograms + k-means

After initially attempting to write my own color histogram code (wouldn't recommend it), I found some effective code (getPatchHist.m) from a computer vision source code website, the method of which matched what I'd seen elsewhere. Essentially RGB values for each image are transformed into one value using a weighted sum of RGB values at each pixel. These values are then counted into a series of bins for each image.

Using 16^3 bins, the color histogram for Adora looks like:

The Histogram Intersection between each image was then computed by taking the sum of the intersection between all the color bins of each image, followed by summing and normalizing them.

Taking a look at the histogram intersections using imagesc, we can double-check that histograms intersect completely with themselves (diagonal values are 1). We also see that most color histograms are only distantly related, with a few that are very similar.

Running k-means on this intersection matrix, using 4 clusters and taking the closest 8 images from each cluster center, we get the following:

Cluster 1
Big black areas: 33% Romance, 37% History, 13% DietFitness, 17% SciFiFantasy

Cluster 2
Pastel colors: 24% Romance, 28% History, 36% DietFitness, 12% SciFiFantasy

Cluster 3
White with font: 11% Romance, 28% History, 56% DietFitness, 1% SciFiFantasy

Cluster 4
Bright colors: 29% Romance, 24% History, 21% DietFitness, 27% SciFiFantasy

Cluster Visualization
Taking the first 2 principle components of the histogram intersection matrix, and color-coding them according to cluster, we get:

Goals for Wed
- get familiar with VLfeat for more features