Building an Image Search/Tagging Engine

Exploring the power of vector embeddings

Abhishek Swain
9 min readSep 29


Photo by saeed mhmdi on Unsplash

My Portfolio:

This post is inspired by: Building an image search service from scratch | by Emmanuel Ameisen | Insight (


For a very long time, I have wanted to build a Search Engine to search for similar images, something like Reverse Image Search — Search by Image to Find Similar Photos. This led me to study about the field of Content-based image retrieval.

From Wikipedia we have,

Content-based image retrieval, also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases.

The term “content” in this context might refer to colors, shapes, textures, or any other information that can be derived from the image itself.

The most basic way we can think about searching images is by comparing two images i.e., image similarity. Since all images are just vectors, we can use a formula such as cosine similarity (Cosine similarity: How does it measure the similarity, Maths behind and usage in Python | by Varun | Towards Data Science) and recommend the most similar images. However, can we do better. Rather than comparing raw image vectors, we can find better representations and then do the comparison.

Vector Embeddings

When I talk about “representations”, I essentially mean vector embeddings. From the blog: What are vector embeddings? | A Comprehensive Vector Embeddings Guide | Elastic, we have,

Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships. They represent different data types as points in a multidimensional space, where similar data points are clustered closer together. These numerical representations help machines understand and process this data more effectively.

Image Search

There are various types of vector embeddings, of which we are only concerned about image embeddings. How do we do it? We will be using a CNN (Convolutional Neural Network) to get out image embeddings.

Now that we have our image embeddings, we can use cosine similarity to find similarity scores between our search image and the database of images and recommend images with higher similarity.

Image Tagging

This is a simple case of multi-class or multi-label classification. We train a CNN on a dataset like IMAGENET, CIFAR etc., to assign probabilities to various classes. Now, the issue with this is that you can never have a dataset with all the labels for everything in this universe. Besides, if you noticed an image has various object in it.

E.g., An image of a cat can have tags like ‘cat’, ‘animal’, ‘fur’, ‘meow’ etc. All these words are different, but they are qualities/features which can be ascribed to a cat.

This is the notion of “semantic similarity”. Vector embeddings can help us capture this. If you project all the words ‘cat’, ‘animal’, ‘fur’, ‘meow’ into an embedding-space. That means find the vector embedding for each word and then plot it, you’ll see that all of them are clustered together.

From: Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium

Now we have two sets of embeddings with us: image embeddings and word embeddings. Can we combine these two? Enter “Visual Semantic Embedding (VSE)” models, specifically, DeViSE: A Deep Visual-Semantic Embedding Model ( The proposed approach in this paper:

Our objective is to leverage semantic knowledge learned in the text domain and transfer it to a model trained for visual object recognition. We begin by pre-training a simple neural language model well-suited for learning semantically meaningful, dense vector representations of words. In parallel, we pre-train a state-of-the-art deep neural network for visual object recognition, complete with a traditional SoftMax output layer. We then construct a deep visual-semantic model by taking the lower layers of the pre-trained visual object recognition network and re-training them to predict the vector representation of the image label text as learned by the language model.

Plan Of Attack

The dataset I have used is fastai/imagenette: A smaller subset of 10 easily classified classes from Imagenet, and a little more French (

  1. Get GloVe: Global Vectors for Word Representation ( Embeddings for all the labels in the dataset.
  2. Get Image Embeddings for the images using any CNN. I use ResNet50V2.
  3. We now train our ResNet50V2 to predict the GloVe Embeddings.

Let’s Code!

I will be using TensorFlow, you can use PyTorch as well.

Complete code available at: Abhiswain97/Search-Engine: Search or tag images (

Creating the datasets

def load_glove_embeddings():
embedding_dict = {}
with open(file=CFG.GLOVE_EMBEDDINGS, encoding="utf-8") as f:
for line in tqdm(f.readlines(), total=400000):
splits = line.split(maxsplit=1)
embedding_dict[splits[0]] = np.fromstring(splits[1], dtype="f", sep=" "), arr=embedding_dict, allow_pickle=True)

def create_labels2embeddings():
em = np.load(CFG.GLOVE_EMBEDDINGS_INDEX, allow_pickle=True).tolist()

final_embeddings = {}

for label in np.unique(CFG.TRAIN_LABELS):
sp_lbl = label.lower().split()
if len(sp_lbl) > 1:
final_embeddings[label] = (em[sp_lbl[0]] + em[sp_lbl[1]]) / 2
final_embeddings[label] = em[label]

The first function loads the GloVe Embeddings and the second creates the labels-to-embeddings dictionary and saves it. Next we split the data into training and validation:

# Create Dataframe
df = pd.DataFrame({"image_path": CFG.TRAIN_PATHS, "label": CFG.TRAIN_LABELS})

# Sample 1000 images from the DataFrame
df = (
.apply(lambda x: x.sample(n=100, replace=True))

# Split the dataset

train_df, val_df = train_test_split(df, test_size=0.20, stratify=df["label"].values)

The below function creates the and saves it:

# Creating the and save to disk
def create_tf_dataset(df, mode, save=True):
final_embeddings = np.load(
file=CFG.LABEL_EMBEDDINGS_PATH, allow_pickle=True

images, labels = [], []
for idx, row in tqdm(df.iterrows(), total=len(df)):
# Load and resize the image
image =["image_path"]).convert("RGB").resize((224, 224))

# Normalize the image
image = np.array(image) / 255.0

# Get the embeeding for the corresponding label
label = final_embeddings[row["label"]]

# Create a list of images and labels

ds =, labels))

if save:
dataset=ds, path=(CFG.SAVE_DATA_DIR / mode).as_posix()

The Model

pretrained_head = tf.keras.applications.ResNet50V2(
include_top=True, input_shape=(224, 224, 3)
last_layer = pretrained_head.get_layer(name="avg_pool")
x = last_layer.output
x = tf.keras.layers.Dense(2000, name="intermediate_layer")(x)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dropout(rate=0.5)(x)
x = tf.keras.layers.Dense(300, name="embedding_layer")(x)
embed_output = tf.keras.layers.BatchNormalization()(x)

model = tf.keras.Model(


To fasten up image-search we will be using Nearest neighbor search — Wikipedia. Spotify has a nice library spotify/annoy: Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk ( which will help us use it. Let’s create some utilities:

def create_word_index(file_path):
Function to create AnnoyIndex of glove word embeddings & index-to-word dict.
Incase they are already present just load and return.

file_path: Path to the glove txt file

idx2word: dictionary mapping all indexes to words
word_embedding_index: AnnoyIndex of glove word vectors

word_embedding_index = AnnoyIndex(f=300, metric="angular") # Init the Annoy Index
idx2word = {} # Init the dict
word2idx = {}
if (
and CFG.IDX2WORD_PATH.exists()
and CFG.WORD2IDX_PATH.exists()
): # If both files exist then load and return them
with open(CFG.IDX2WORD_PATH, "r") as f:
idx2word = json.load(fp=f)
with open(CFG.WORD2IDX_PATH, "r") as f:
word2idx = json.load(fp=f)
else: # Else create the AnnoyIndex as well as the JSON dict and save them
with open(file_path, encoding="utf-8", mode="r") as f:
for idx, line in tqdm(
desc="Processing glove vectors: ",
): # Loop over each line in the file
word, coefs = line.split(
) # Split the word and the glove vectors.
coefs = np.fromstring(
coefs, "f", sep=" "
) # Convert the glove vectors to numpy array
# Add the vector to index
word_embedding_index.add_item(i=idx, vector=coefs)
# Map the index to word
idx2word[idx] = word

word2idx = {v: k for k, v in idx2word.items()}

print("Building the index")

print("Saving to disk")

with open(
file=CFG.IDX2WORD_PATH, mode="w"
) as f: # Save the json dict as well
json.dump(obj=idx2word, fp=f)

with open(
file=CFG.WORD2IDX_PATH, mode="w"
) as f: # Save the reverse-json dict as well
json.dump(obj=word2idx, fp=f)

# Return both of them
return idx2word, word_embedding_index, word2idx

def create_image_embedding_index(paths, model):
Function to create AnnoyIndex of image embeddings & index-to-path dict.
Incase they are already present just load and return.

paths: Path to all the images to index
model: the DL model

idx2path: dictionary mapping all indexes to paths
all_image_embeddings_indexed: AnnoyIndex of image embedding vectors

all_image_embedings_indexed = AnnoyIndex(
f=300, metric="angular"
) # Init the AnnoyIndex

idx2path = {
k: v
for k, v in tqdm(
desc="Creating index-to-path dict: ",
} # Create the index to image path dict

) # If AnnoyInedx exists then just load them
for idx, pth in tqdm(
enumerate(paths), total=len(paths), desc="Indexing images: "
): # Loop over all the image paths
image = ("RGB").resize((224, 224))
) # Open, convert, resize
image = np.array(image) / 255.0 # Convert to numpy array and normalize

embeds = model(image[np.newaxis, :]) # Get the embedding from the model

idx, embeds.numpy().flatten()
) # Add it to the AnnoyIndex

# Build the Index
print("Building the index")

print("Saving the index to disk...")

# Return them
return idx2path, all_image_embedings_indexed

def find_similar_tags(file, model, num_tags):
Function to find tags for an image.

file: The image file
model: The DL model
num_tags: Number of tags to create

tags: Tags for the image
image ="RGB").resize((224, 224)) # Open, convert, resize
image = np.array(image) / 255.0 # Convert to numpy array & normalize

embeds = model(image[np.newaxis, :]) # Get the image embedding from the model

idx2word, annoy_word_embed_idx, _ = create_word_index(
) # Get the idx2word dict & AnnoyIndex

idxs = annoy_word_embed_idx.get_nns_by_vector(
vector=embeds.numpy().flatten(), n=num_tags, search_k=-1
) # Search for the nearest `num_tags` vectors

tags = [idx2word[str(idx)] for idx in idxs] # Get the tags

# Return the tags
return tags

def find_similar_images(model, num_images, search_by_word=None, image_pth=None):
idx2pth, ann_idx = create_image_embedding_index(paths=CFG.TEST_PATHS, model=model)

if search_by_word is None:
if image_pth is None:
raise ValueError("An image needs to be provided")
image ="RGB").resize((224, 224))
image = np.array(image) / 255.0

embeds = model(image[np.newaxis, :]).numpy().flatten()

idxs = ann_idx.get_nns_by_vector(vector=embeds, n=num_images)

paths = [idx2pth[idx] for idx in idxs]

return paths
_, word_ann, word2idx = create_word_index(

word_embed = word_ann.get_item_vector(word2idx[search_by_word])

idxs = ann_idx.get_nns_by_vector(vector=word_embed, n=num_images)

paths = [idx2pth[idx] for idx in idxs]

return paths

def plot_images(paths, ncols, nrows):
fig, axs = plt.subplots(ncols=ncols, nrows=nrows)

idx = 0
for i in range(nrows):
for j in range(ncols):
image = np.array([idx]).convert("RGB").resize((224, 224)))
axs[i, j].imshow(image)
axs[i, j].xaxis.set_visible(False)
axs[i, j].yaxis.set_visible(False)
idx += 1

return fig

Training the model

# Load the
train_ds ="../data/Training")
train_ds = train_ds.batch(8).cache().prefetch(

val_ds ="../data/Validation")
val_ds = val_ds.batch(8).cache().prefetch(

checkpoint = tf.keras.callbacks.ModelCheckpoint(

), validation_data=val_ds, epochs=30, callbacks=[checkpoint])

I trained the model for just over 2 hours, and it achieved a cosine similarity of 0.80.

Creating a Web-App

We will use Streamlit to create a web-app. Here’s the code:

import streamlit as st
from time import time
import tensorflow as tf

from src.utils import find_similar_images, find_similar_tags, plot_images

st.set_option("deprecation.showPyplotGlobalUse", False)

radio =
label="Find images or tags ?", options=["Find images", "Find tags"]

"<h1><i><center>Find Images/Tags</center></i><h1>",

model = tf.keras.models.load_model(r"models\resnet50v2_model-30-0.80.hdf5")

if radio == "Find images":
search =
label="Search by image/text ?", options=["Search by image", "Search by text"]

if search == "Search by image":
file = st.file_uploader("Upload image!")

if file is not None:
st.image(file, use_column_width=True)
button = st.button("Find images")

if button:
with st.spinner("Finding images....."):
start = time()
paths = find_similar_images(
model=model, num_images=9, search_by_word=None, image_pth=file
fig = plot_images(paths=paths, ncols=3, nrows=3)
end = time() - start
st.success(f"Predcition done in: {end} secs")

st.pyplot(fig, use_container_width=True)
word = st.text_input(label="Enter the text to search")

if word != "":
button = st.button("Find images")

if button:
with st.spinner("Finding images....."):
start = time()
paths = find_similar_images(
model=model, num_images=9, search_by_word=word
fig = plot_images(paths=paths, ncols=3, nrows=3)
end = time() - start
st.success(f"Predcition done in: {end} secs")

st.pyplot(fig, use_container_width=True)

file = st.file_uploader(label="Upload an image")
num_tags = st.number_input("Enter number of tags")

if file:
st.image(file, use_column_width=True)
button = st.button("Find tags")

if button:
with st.spinner("Finding tags....."):
start = time()
tags = find_similar_tags(file=file, model=model, num_tags=int(num_tags))
end = time() - start
st.success(f"Predcition done in: {end} secs")



Using Annoy the inference time locally is < 1s.

Image Tagging

Searching similar images using text

Searching similar images using images


We explored the power of embeddings for image tagging and searching. If you have made it this far, I hope it was informative for you. Please let me know in case there is something I can improve upon. In the original article (Building an image search service from scratch | by Emmanuel Ameisen | Insight ( which inspired this, the author has also explored other stuff.

I am actively looking for data science opportunities. Please reach out to me on my LinkedIn: Abhishek Swain | LinkedIn if you can offer one.