Machine Learning Image Popularity - Predict Image Success

Updated by steve on Feb. 6, 2019.

What makes people click, like & share the images you post online?

In this post, we'll walk through developing an algorithm to predict whether or not an image is popular on GrubHub with 65% accuracy.

Predicting whether or not an image will be popular or not is a very valuable problem to solve for any online marketer or Instagram influencer.

In this post, we'll explore how machine learning can help answer this very important question and get your images more attention online.

1. Training Data

In this exercise, we'll keep things simple and focus on predicting whether or not an image's click through rate will exceed a certain percent or not.

Ideally, we'd like to train our system on images with a 50:50 split: 50% of images have a click through rate below a certain threshold and 50% have a clickthrough above that threshold.

If you already have your own data of images and associated click throughs (thousands of images are recommended), feel free to skip to part 2.

Infer Click Through Rates from Most Popular Food Images

Don't have thousands of images with click through data to get started with? No problem - if we get a little creative & resourceful, we can build our own training dataset using publicly available information.

Take a look at GrubHub and click on your favorite restaurant. Check out their menu and you'll notice some items are featured as Most Popular, typically with images attached.

Let's inspect this data more closely - we'll find that within some sections of the menu where multiple items have images, there are cases where one item rose to "Most Popular" while other menu items were left behind. We can assume that the "Most Popular" item was ordered more than its peers and thus rose to the "Most Popular" section.

While we can't say for sure that the most popular items were oredered more due to their photo (the title and description can make all the difference too - or the popular item may have just been around longer), but for this exercise we will assume that within a menu category, the popular item has a higher click through rate than its peers with photos.

Now we can create our training data: popular food images vs. non-popular food images.

Collecting the Data

We'll use the Stevesie GrubHub Data integration to collect our training data. We'll need to create 3 workers to get started:

Step 1: Get a GrubHub Access Token

Create a worker for Anonymous Login and get an access token. Save this somewhere as you'll need it for the next steps. This token is good for only one hour, so you'll need to do this again if you need more than an hour.

Step 2: Build a Restaurant List

On your worker, fill in your location and run with auto-pagination to get 1,000 restaurants near you (or wherever you're interested in).

Step 3: Aggregate Restaurant Details



import os import csv import requests


with open(os.path.expanduser(SOURCEIDSFILEPATH), 'r') as f: csvreader = csv.reader(f) next(csvreader)

for row in csv_reader:
    restaurant_id = row[0]


Wait a bit for Stevesie to index everything. Once it's done click the green export button and download your data in JSON format.

Step 3: Download Popular & Unpopular Images

Now that we have our JSON, we can forge the URLs of where the images live online and download them locally to build our machine learning algorithm.



RESTAURANTJSONFILEPATH = '~/Desktop/trainingrestaurants.json' TARGETIMAGEDIRECTORY = '~/Desktop/trainingimages'

with open(os.path.expanduser(RESTAURANTJSONFILEPATH), 'r') as f: all_restaurants = json.load(f)

for item in allrestaurants['items']: restaurant = item['object']['restaurant'] for menucategory in restaurant.get('menucategorylist', []):

        menu_items_with_images = \
            [menu_item for menu_item in menu_category['menu_item_list'] if 'media_image' in menu_item]

        if len(menu_items_with_images) > 0:
            category_popular_urls = []
            category_unpopular_urls = []

            for menu_item in menu_items_with_images:
                is_popular = menu_item['popular']
                media_image = menu_item['media_image']
                image_url = '{}{}.{}'.format(
                    media_image['base_url'], media_image['public_id'], media_image['format'])

                if is_popular:

            if len(category_popular_urls) > 0 and len(category_unpopular_urls) > 0:
                all_popular_urls += category_popular_urls
                all_unpopular_urls += category_unpopular_urls

def dedupelist(inspectlist, checklist): return [url for url in inspectlist if url not in check_list]

def writeurls(imageurls, directorypath): if not os.path.exists(directorypath): os.makedirs(directory_path)

for image_url in image_urls:
    url_name = image_url.split('/')[-1]
    write_image = open(os.path.join(directory_path, url_name), 'wb')
    r = requests.get(image_url, stream=True)
    for block in r.iter_content(1024):
        if not block:

writeurls(dedupelist(allpopularurls, allunpopularurls), os.path.join(TARGETIMAGEDIRECTORY, 'popular')) writeurls(dedupelist(allunpopularurls, allpopularurls), os.path.join(TARGETIMAGEDIRECTORY, 'unpopular')) ```

2. Strategize

If you fail at this step, everything you doing going forward will be a waste of time and you'll have to come back here. Take a deep breathe and really think about the problem you're trying to solve and its context.

Don't Throw AI at It

You may be saying to yourself - now I have 2 sets of images I want to classify.. I know what that sounds like! If you Google how to train and perform image classification, you'll likely land on something like Simple Image Classification using Convolutional Neural Network.

If you throw your images at this algorithm, you're going to see poor results.

We're Learning the Invisible

The problem of predicting image engagement is not a traditional problem of image classifictaion (e.g. what is in this image), but rather will it be well-received online based on the how of the image: the colors, lighting, angles, etc...

Popular Training Images:

Popular Training Images

Unpopular Training Images:

Unpopular Training Images

Something About Those Colors

Just by looking at my training data, I can see that the popular images just seem to... pop more than the other images. They have a certain characteristic about the colors used, their distributions and the contrasts they command to make them more appealing.

I'm going to hypothesize going forward that we can get some predictive power just by analyzing the main colors of images, so we will proceed by focusing on those features and ignoring everything else (e.g. raw pixels, lines, shadows, etc...).

3. Extract Features

Feature extraction is arguably one of the most important steps in machine learning. A learning algorithm is only as good as the data that it's fed - feed your algorithm the wrong data (or irrelevant data), and you're going to see poor results.

Dominant Colors

I want to get the dominant color from each image. After Googling around a little bit, Finding Dominant Image Colours Using Python proved extremely helpful in documenting the approach using K-Means clustering.

Build a Training CSV File

We now want to transform our raw data (the images in each folder) into a CSV file with the color summary for each image. We'll write a quick Python script to accomplish this:



import os

SOURCEIMAGESFOLDER = '~/Desktop/trainingimages' TARGETFILEPATH = '~/Desktop/training_features.csv'


def featuresfromimage(image_filepath): img = cv2.imread(filepath)

#convert to bgr
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# shape to pixels
img = img.reshape((img.shape[0] * img.shape[1], 3))

kmeans = MiniBatchKMeans(n_clusters=NUM_COLORS)

colors = kmeans.cluster_centers_
labels = kmeans.labels_

# todo - get weights and SORT before putting in CSV!

write this so we can use it in the predictor

def imagefeatures(directorypath, ispopular): allfeatures = [] for filename in os.listdir(directorypath): filepath = os.path.join(directorypath, filename)

return all_features

with open(os.path.expanduser(TARGETFILEPATH), 'w') as f: writer = csv.writer(f, delimiter=',', quoting=csv.QUOTEALL) writeimagefeatures(os.path.join(SOURCEIMAGESFOLDER, 'popular', True, writer) writeimagefeatures(os.path.join(SOURCEIMAGESFOLDER, 'unpopular', False, writer)

We'll also be able to use this script to generate our testing features, which we'll get to later.

4. Explore Features

Once we have our features from the raw data, let's examine them a bit to make sure they line up with our assumptions.



import csv

FEATUREFILEPATH = '~/Desktop/trainingfeatures.csv'

fig = plt.figure() axunpop = fig.addsubplot(1, 2, 1, projection='3d', title='Unpopular') axpop = fig.addsubplot(1, 2, 2, projection='3d', title='Popular')

def featuretuplefromrow(row): # (ispopular, colors) return (bool(row[1]), [ [row[2], row[3], row[4], row[5]], # r, g, b, weight [row[6], row[7], row[8], row[9]], [row[10], row[11], row[12], row[13]], [row[14], row[15], row[16], row[17]], [row[18], row[19], row[20], row[21]] ])

def rgbtohex(r, g, b): return '#%02x%02x%02x' % (r, g, b)

with open(os.path.expanduser(FEATUREFILEPATH), 'r') as f: for row in csvreader: ispopular, colors = featuretuplefromrow(row)

to_plot = ax_pop if is_popular else ax_unpop
points = []

for color in colors:
    rgb = [color[0], color[1], color[2]]
    to_plot.scatter(*rgb, s=100*color[3], color=rgb_to_hex(*rgb))

poly = geoms.Polygon(np.array(points))
verts = [list(zip(x, y, z))]
pc = Poly3DCollection(verts, linewidths=1, alpha=max_weight)

avg_r = np.array([color[0] for color in colors]).mean()
avg_g = np.array([color[1] for color in colors]).mean()
avg_b = np.array([color[2] for color in colors]).mean()

face_color = [avg_r / 255, avg_g / 255, avg_b / 255]
to_plot.add_collection3d(pc, zs='z') ```

5. Build Your Model

Once you're gotten to know your features, consider what would be a good machine learning algorithm. In our color data, there are a lot of different colors for each datapoint as well as differences in colors we hypothesize to be important. We feel that the relationship between these raw colors and their interactions with each other will help determine if an image reaches popularity or not.

This sounds like a good case for using a Support Vector Machine (or SVM).



import joblib

from sklearn.modelselection import crossval_score

FEATUREFILEPATH = '~/Desktop/trainingfeatures.csv' TARGETMODELFILEPATH = '~/Desktop/svm_model.joblib'

def extractlabelandfeatures(row): ispopular, colors = featuretuplefrom_row(row)

features = [
    unweighted_r - avg_r,
    unweighted_g - avg_g,
    unweighted_b - avg_b,






    sorted_colors[0][0] - sorted_colors[1][0],
    sorted_colors[0][1] - sorted_colors[1][1],
    sorted_colors[0][2] - sorted_colors[1][2],
    sorted_colors[0][3] - sorted_colors[1][3],

    sorted_colors[0][0] - sorted_colors[2][0],
    sorted_colors[0][1] - sorted_colors[2][1],
    sorted_colors[0][2] - sorted_colors[2][2],
    sorted_colors[0][3] - sorted_colors[2][3],

    sorted_colors[0][0] - sorted_colors[3][0],
    sorted_colors[0][1] - sorted_colors[3][1],
    sorted_colors[0][2] - sorted_colors[3][2],
    sorted_colors[0][3] - sorted_colors[3][3],

    sorted_colors[0][0] - sorted_colors[4][0],
    sorted_colors[0][1] - sorted_colors[4][1],
    sorted_colors[0][2] - sorted_colors[4][2],
    sorted_colors[0][3] - sorted_colors[4][3],

    sorted_colors[1][0] - sorted_colors[2][0],
    sorted_colors[1][1] - sorted_colors[2][1],
    sorted_colors[1][2] - sorted_colors[2][2],
    sorted_colors[1][3] - sorted_colors[2][3],

    sorted_colors[1][0] - sorted_colors[3][0],
    sorted_colors[1][1] - sorted_colors[3][1],
    sorted_colors[1][2] - sorted_colors[3][2],
    sorted_colors[1][3] - sorted_colors[3][3],

    sorted_colors[1][0] - sorted_colors[4][0],
    sorted_colors[1][1] - sorted_colors[4][1],
    sorted_colors[1][2] - sorted_colors[4][2],
    sorted_colors[1][3] - sorted_colors[4][3],

    sorted_colors[2][0] - sorted_colors[3][0],
    sorted_colors[2][1] - sorted_colors[3][1],
    sorted_colors[2][2] - sorted_colors[3][2],
    sorted_colors[2][3] - sorted_colors[3][3],

    sorted_colors[2][0] - sorted_colors[4][0],
    sorted_colors[2][1] - sorted_colors[4][1],
    sorted_colors[2][2] - sorted_colors[4][2],
    sorted_colors[2][3] - sorted_colors[4][3],

    sorted_colors[3][0] - sorted_colors[4][0],
    sorted_colors[3][1] - sorted_colors[4][1],
    sorted_colors[3][2] - sorted_colors[4][2],
    sorted_colors[3][3] - sorted_colors[4][3],


return (is_popular, features)

Xtrain = [] Ytrain = []

with open(os.path.expanduser(FEATUREFILEPATH), 'r') as f: for row in csvreader: y, x = extractlabeland_features(row)


clf = svm.SVC(kernel='poly', max_iter=1000000, degree=3, C=100000.0)

scores = crossvalscore(clf, X, training_labels, cv=5) print('Accuracy: %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2)), Ytrain) joblub.dump(clf, os.path.expanduser(TARGETMODELFILEPATH)) ```

6. Test & Evaluate Your Model

``` python

FEATUREDATAFILEPATHTEST = '~/Desktop/testfeatures.csv' SVMMODELFILEPATH = '~/Desktop/svmmodel.joblib'

from sklearn.metrics import confusion_matrix

TODO - cite source

def plotconfusionmatrix(cm, classes, normalize=False, title='Confusion matrix', """ This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True. """ if normalize: cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print('Confusion matrix, without normalization')


plt.imshow(cm, interpolation='nearest', cmap=cmap)
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)

fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], fmt),
             color="white" if cm[i, j] > thresh else "black")

plt.ylabel('True label')
plt.xlabel('Predicted label')

Xtest = [] Ytest = []

with open(os.path.expanduser(FEATUREDATAFILEPATHTEST), 'r') as f: csv_reader = csv.reader(f)

for row in csv_reader:
    y, x = extract_label_and_features(row)


clf = joblib.load(SVMMODELFILEPATH)

scores = clf.score(Xtest, Ytest) print('accuracy') print(scores)

Ypredict = clf.predict(Xtest)

Compute confusion matrix


class_names = ['Unpopular', 'Popular']

Plot non-normalized confusion matrix

plt.figure() plotconfusionmatrix(cnfmatrix, classes=classnames, title='Confusion matrix, without normalization')

Plot normalized confusion matrix

plt.figure() plotconfusionmatrix(cnfmatrix, classes=classnames, normalize=True, title='Normalized confusion matrix')

totalpopularpredictions = cnfmatrix[0][1] + cnfmatrix[1][1] correctpopularpredictions = cnf_matrix[1][1]

print('*') print('Popular Distribution') print( (cnfmatrix[1][0] + cnfmatrix[1][1]) / (cnfmatrix[1][0] + cnfmatrix[1][1] + cnfmatrix[0][0] + cnfmatrix[0][1]) ) print('Relevant Precision') print(correctpopularpredictions / totalpopularpredictions) ```

Evaluate the Confusion Matrix

Confustion Matrix

We can see that of the popular predictions, the algorithm is correct 29% of the time (compared to a baseline of 20% that are popular in our test set). This means that if the algorithm agrees that a photo will become popular, it has a 29% chance of making it.

What's more impressive however is how good the algorithm is at filtering out what will be unpopular images.

7. Build Your Predictor

Now it's the fun part. When we're happy with our model, we can now use it to predict if a photo will do well online or not. Remember, we know the accuracy is only 65% and to take it with a grain of salt, but when you're on the fence between which of two images to post, this algorithm may make a decent tie breaker.



import os

IMAGEFILEPATH '~/Desktop/predictme.jpg' SVMMODELFILEPATH = '~/Desktop/svm_model.joblib'

import joblib

clf = joblib.load(SVMMODELFILEPATH)

imagefeatures = featuresfromimage(os.expanduser(IMAGEFILEPATH))

predictions = clf.predict([image_features])

print(predictions) ```