Books and Their Covers: 2012

Monday, June 4, 2012

ROC curves + ending objectives

For this week I worked on uploading classifier definitions learned from Multiboost into MATLAB and writing code to generate ROC curves. They show, once more, that color histogram perform much better for genre classification than SIFT features.

SIFT to classify genre

AUC

Diet/Fitness: .6642

History: .6237

Romance: .5533

SciFi/Fantasy: .5821

Color to classify genre

The following are 2 ROC curves for two different cross-validation folds:

AUC
Diet/Fitness: .7892
History: .7583
Romance: .7733
SciFi/Fantasy: .7433

AUC
Diet/Fitness: .9258
History: .8067
Romance: .6750
SciFi/Fantasy: .6950

The classification performance is obviously different for different train/test cross validation folds. Should I average them together to get final performance measures..?

I've also been working on getting rid of between-class redundant SIFT features prior to LDA projection to see if this improves classification, but have run into the MATLAB 'for' loop bottleneck. My goal is to write a MATLAB-interfaceable c file to accomplish this for inclusion in the report.

I've also looked for relationship between SIFT and publication year and number of user reviews with no positive results.

Wednesday, May 16, 2012

Classification Fail

Classification Results

The only somewhat promising classification result I attained was using color histograms to classify by genre. The error rate is at around .5 (at least below chance!) and area under the ROC curve was highest for DietFitness (~.8), second highest for History (~.75), and lowest for Romance and SciFiFantasy (~.65).

Using LDA-reduced SIFT histograms for AdaBoost classification is performing at around chance. Error rate (number of misclassified images/ total number of images) never goes beneath .7 on testing data, where chance would be .75 for 4 classes, for classification by both rating and genre. Area under the ROC curve also hovers at around .5 for all classes.

I took a look at the LDA-projected data and it's clear that there is not good class separation for SIFT histograms (for class definition by either genre or rating).

Figure 1: First two dimensions of LDA-reduced SIFT descriptors (color-coded by rating)

Figure 2: First two dimensions of LDA-reducted SIFT descriptors (color coded by genre)

This could mean that a) SIFT features are not doing a good job of capturing class differences or b) there are not clear class differences based on cover image.

In support of (b) (that there are not good class differences), I also tried AdaBoost with the color histograms as feature input and class labels by rating, with very similar results (~.7 error rate on testing data), area under ROC curve ~.5 for all classes.

To Do

- Get actual ROC curves/ confusion matrices for classification
- Take a look at which colors are good between-genre specifiers
- Try classification by year/number of reviews
- Suggestions?

Matlab Tip

Check it out:

keyboard

A life-changing debugging tool.

Preprocessing

SIFT descriptors were attained using VLFeat, and then reduced to k-1 dimensions (where k=number of classes) using Linear Discriminant Analysis. The LDA output was then histogram-ed to make the final features for input to AdaBoost.

Labels

Rating

Based on the distribution of labels for my set of book cover images, for ratings ranging from 1 to 4, (see Figure 3) I chose 4 classes:

1) "Bad" rating <= 3.5
2) "Okay" 3.5 < rating <= 4
3) "Good" 4 < rating <-= 4.5
4) "Great" rating > 4.5

Genre

I'm still working with 4 different genres:

1) Romance
2) History
3) DietFitness
4) SciFiFantasy

Histogram Construction

In case of nonparametric distribution of projected SIFT data, I chose a bin width W determined by the interquartile range of the histogram-ed data and number of samples (average number of SIFT descriptors per image ~= 700)

W = 2 (IQR) N^-1/3

Monday, May 7, 2012

Multiboost in C++; ARFF; Weka in Java

Last week I left off with having attained SIFT histograms but having trouble with the slowness of the Matlab implementation of AdaBoost that I had been using.

I found an entire collection of boosting C++ code [1], which includes an AdaBoost implementation for multiple classes. One of the input formats is the ARFF (Attribute-Relation File Format) developed by the Computer Science department at the University of Waikato (in New Zealand) for use in their open-source Java-based Weka software for machine learning. Conveniently, Matt Dunham has already shared a Matlab-Weka interface which I was able to use to convert my .mat datasets to ARFF format to run though the multiboost framework and get weak learner results (training was so fast).

For this week I still need to 1) separate the covers into training/testing and run for cross-validation results for both genre and ratings and 2) understand what's going on with the multi-class generalization of AdaBoost.

If you have time you should check out the Weka project. I really like their mission statement and open-source spirit.

[1] D. Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl
MultiBoost: a multi-purpose boosting package
Journal of Machine Learning Research, 13:549–553, 2012.

Wednesday, May 2, 2012

SIFT histograms, EMD, and classification

I found a paper Home Interior Classification Using SIFT Keypoint Histograms where the authors used SIFT histograms to classify indoor scenes. This was great because they had the same problem that I had, and implemented and compared a few different solutions.

For clarification, right now I'm playing around with a toy dataset of 40 images (10 from 4 genres).

The problem is that for 40 images, I get about 26,000 SIFT keypoints, with 128 descriptor dimensionality. One way of comparison could be to classify each of the key points individually and then match images with the same class with the most matching key points, but that's very computationally expensive.

In the aforementioned Ayers and Boutell paper, the optimal solution to the above problem that they found was reducing the dimensionality of the SIFT descriptors from 128 to c-1 (where c is number of classes) using Fisher's Linear Discriminant, creating histograms from the lower-dimensionality SIFT descriptors, and then classifying the images using AdaBoost.

I thought I'd start there, so I used Linear Discriminant Analysis (LDA) (which maximizes the between-class variance and minimizes the within-class variance for projection onto a c-1 dimensional vector) to reduce the dimensionality of the key points, and then made histograms with 3 bins for each dimension and ran AdaBoost on the results. It gets reasonable output right now, I just need to make sure I'm using LDA correctly and do cross-validation, and play around with histogram binning.

A problem with AdaBoost that I've run into is that it gets very, very slow with ~250 features as input. Does anybody know the time complexity on AdaBoost?

In this case, AdaBoost eliminates the need to compare histograms directly. However, if I were to do so (which I probably will at some point), Serge pointed me in the direction of Earth Mover's Distance (EMD) as a distance metric between histograms. This is better than the histogram intersection method I previously used for color histograms, because it takes bin distance/local bin similarity into account.

Something I could try with EMD is to use it to get cluster centers using k-means, and then use a nearest mean classifier.

Histograms, histograms everywhere!

Preliminary classification results will definitely be out next week.

Wednesday, April 25, 2012

SIFT -> Codebook -> Features -> Classification

Last time my update was about moving forward with feature extraction and logistic regression classification upon having completed installation of VLFeat.

vl_sift outputs X, Y frame center position coordinates, a frame scale coordinate, and a frame orientation in radians. When mapped onto an image, sift features look something like:

The yellow circles represent the output values mentioned above, and the green boxes represent descriptor vectors calculated using local gradient information.

These raw position/scale/orientation values won't immediately work for logistic regression purposes because each value on its own doesn't mean anything - I need a one-value representation of each SIFT frame. A way of doing this is generating a "codebook" of similar features across the training data and using the codebook to generate histograms for classification. This concept is visualized below from Kinnunen et al. 2009:

A commonly used codebook creation method is to run k-means and then use the cluster centers as a codevectors (for SIFT we could run k-means on the descriptor vectors). Defining a number of codevectors, one can then define similarity bins centered around each codevector.

So for next week I'll focus on generating a codebook of SIFT features, creating feature histograms for each image, and running logistic regression on SIFT and HoC.

My MATLAB aside:

MATLAB can store images in 64-bit (double), 32-bit (single), 16-bit (uint16), or 8-bit (uint8) form. double is MATLAB's standard 64-bit representation of numeric data. Storing images in uint8 form is a good idea for using less storage space, but many MATLAB operations are carried out on the double form.

To convert an image I to a double:

I = im2double(I)

The same can be carried out with any of the image types:

I = im2uint8(I)

I = im2uint16(I)

I = im2single(I)

Note that VLFeat asks for image input in single form.

Monday, April 16, 2012

Color Histograms + k-means

After initially attempting to write my own color histogram code (wouldn't recommend it), I found some effective code (getPatchHist.m) from a computer vision source code website, the method of which matched what I'd seen elsewhere. Essentially RGB values for each image are transformed into one value using a weighted sum of RGB values at each pixel. These values are then counted into a series of bins for each image.

Using 16^3 bins, the color histogram for Adora looks like:

The Histogram Intersection between each image was then computed by taking the sum of the intersection between all the color bins of each image, followed by summing and normalizing them.

Taking a look at the histogram intersections using imagesc, we can double-check that histograms intersect completely with themselves (diagonal values are 1). We also see that most color histograms are only distantly related, with a few that are very similar.

Running k-means on this intersection matrix, using 4 clusters and taking the closest 8 images from each cluster center, we get the following:

Cluster 1
Big black areas: 33% Romance, 37% History, 13% DietFitness, 17% SciFiFantasy

Cluster 2
Pastel colors: 24% Romance, 28% History, 36% DietFitness, 12% SciFiFantasy

Cluster 3
White with font: 11% Romance, 28% History, 56% DietFitness, 1% SciFiFantasy

Cluster 4
Bright colors: 29% Romance, 24% History, 21% DietFitness, 27% SciFiFantasy

Cluster Visualization
Taking the first 2 principle components of the histogram intersection matrix, and color-coding them according to cluster, we get:

Goals for Wed
- get familiar with VLfeat for more features

Wednesday, April 11, 2012

More covers! Still working on color histograms.

Kyle helped once more with image collection by pointing me towards openlibrary.org, which has a huge collection of book cover images for download (on the order of 10s of thousands). These do not come with ratings, but even without the ratings they can be used for unsupervised exploration.

Monday, April 9, 2012

Image collection, pre-processing, and clustering

I was able to add 90 book covers from the genre of health and fitness to the dataset, so that I have 60-100 book covers from 4 genres, now. Kyle generously shared a collection of .png files of philosophy book covers, grabbed from a collection of pdfs, but there are issues with finding corresponding ratings from Amazon (social commentary?). Rotten Tomatoes does not rate books, BUT I came across Goodreads just yesterday, and it seems promising as an alternate source for ratings and cover images.

I've written a script to load the images into MATLAB using eval and imread. From here I can directly access the RGB color values for each of the images, which I will use for my first set of image features, color histograms. I've chosen color as a first step because its readily available, and worth exploring.

I ran across this related project by Dr. Sai Chaitanya Gaddam at Boston University, Judging a movie by its cover: A clustering analysis. The gist:

use Netflix Prize data set of ~100 million ratings on 17770 movies from 480189 users
find movie similarity matrix
use k-means to find movie clusters
find "average" movie poster images from exemplar images from each cluster

He came up with a few averaged poster images like the ones below:

Note: he did not use the poster image itself for clustering.

Along the way, to overcome the issue of a 2D representation of 3D data, he came up with a neat visualization of the color properties of an image:

As for using color histograms to capture similarities between pictures, a classic technique called Histogram Intersection comes from Swain and Ballard 1991, where, given two histograms with n bins each, their intersection is:

Goals for Wednesday include:

Finish the dataset and upload everything to MATLAB
Calculate color histogram intersections between images

Goals for the rest of the week include:

Find something meaningful to do with intersection values - run k-means?
Read more classification literature

Tuesday, April 3, 2012

Image Collection

I was initially optimistic of finding an automated way of harvesting book cover images from Amazon, but after a few attempts at using website copying software (httrack, SiteSucker), difficulty with the website format (the covers aren't even directly downloadable files..?) and an approaching deadline, I resorted to using screenshots. For future knowledge, I would still be very interested in learning an automated way of doing this.

Finding a good variety of ratings for books has also been difficult. Books can only be sorted from highest-rated to lowest-rated, and I was not able to access search results past 100 pages.

Anyway, I at least have a range of ratings from 3-5 stars. I've been saving them in PNG format to avoid loss of image quality. I use XnView, a free image-editing software that lets me crop and re-size many images at a time (available at http://www.xnview.com/en/index.html). So far I have 66 SciFi/Fantasy, 98 Romance, and 85 History, with more being added every day. For next week, I'll need to finish image collection and have a solid idea of what features I am going to extract and how. Some of my favorite book covers so far:

If you're curious, here's a link to the full project proposal.

Related Work - Paper Gestalt

I realized that when Professor Belongie asked me about related work on Monday, he had (probably) meant this (humorous) paper he sent me about using computer vision to determine the quality of CVPR paper submissions.

Very witty paper by "von Bearnensquash", can be found at http://vision.ucsd.edu/sites/default/files/gestalt.pdf. They used "standard computer vision features" (LUV histograms, HoG, and gradient magnitude) and AdaBoost classification and found that "good" paper features include brightly colored graph and math equations, and "bad" paper features include complicated tables and missing pages (illustrated below).

They found that allowing for a false positive rate of 15%, they could successfully reject half of the "bad" papers.

The problem I'm addressing is similar but there's an important distinction. They use the content of the thing itself to evaluate quality, so it is sensible for their to be a relationship, but book cover images are not necessarily related to the content of what I'm evaluating for quality (the book itself).

In any case, AdaBoost could be a good classification method to try as it is simple and doubles as a feature selection method. There is a nice overview at https://hpcrd.lbl.gov/~meza/projects/MachineLearning/EnsembleMethods/introBoosting.pdf