I was able to add 90 book covers from the genre of health and fitness to the dataset, so that I have 60-100 book covers from 4 genres, now. Kyle generously shared a collection of .png files of philosophy book covers, grabbed from a collection of pdfs, but there are issues with finding corresponding ratings from Amazon (social commentary?). Rotten Tomatoes does not rate books, BUT I came across Goodreads just yesterday, and it seems promising as an alternate source for ratings and cover images.
I've written a script to load the images into MATLAB using eval and imread. From here I can directly access the RGB color values for each of the images, which I will use for my first set of image features, color histograms. I've chosen color as a first step because its readily available, and worth exploring.
I ran across this related project by Dr. Sai Chaitanya Gaddam at Boston University, Judging a movie by its cover: A clustering analysis. The gist:
Note: he did not use the poster image itself for clustering.
Along the way, to overcome the issue of a 2D representation of 3D data, he came up with a neat visualization of the color properties of an image:

I've written a script to load the images into MATLAB using eval and imread. From here I can directly access the RGB color values for each of the images, which I will use for my first set of image features, color histograms. I've chosen color as a first step because its readily available, and worth exploring.
I ran across this related project by Dr. Sai Chaitanya Gaddam at Boston University, Judging a movie by its cover: A clustering analysis. The gist:
- use Netflix Prize data set of ~100 million ratings on 17770 movies from 480189 users
- find movie similarity matrix
- use k-means to find movie clusters
- find "average" movie poster images from exemplar images from each cluster
He came up with a few averaged poster images like the ones below:
Note: he did not use the poster image itself for clustering.
Along the way, to overcome the issue of a 2D representation of 3D data, he came up with a neat visualization of the color properties of an image:
As for using color histograms to capture similarities between pictures, a classic technique called Histogram Intersection comes from Swain and Ballard 1991, where, given two histograms with n bins each, their intersection is:

Goals for Wednesday include:
- Finish the dataset and upload everything to MATLAB
- Calculate color histogram intersections between images
Goals for the rest of the week include:
- Find something meaningful to do with intersection values - run k-means?
- Read more classification literature







No comments:
Post a Comment