Home
Learn Chat GPT
Learn Chat GPT (Beginner)
Efficient vector similarity search with Redis: a step-by-step tutorial

9 minutes to read - Mar 30, 2023

Efficient vector similarity search with Redis: a step-by-step tutorial

VISIT

The ability to search for information is crucial in today's digital landscape, with users expecting search functionality in nearly every application and website. To improve search results, architects and developers must continuously explore new methods and architectures. One of the approaches is to utilize vector embeddings generated by deep learning models, which can enhance the accuracy and relevance of search results.

Table of Contents

1Let's get started!

2Connect to Redis

3Create embeddings

4Preparing utility functions

5Comparison of flat and HNSW indexing for approximate nearest neighbor search

6Index & Query the new data

7Query our FLAT index

8Query our HNSW index

To this end, many organizations are leveraging indexing techniques to transform their data into vector space. By representing data as vectors, it becomes possible to perform similarity searches that return the most relevant results.

In this tutorial, we will explore how deep learning models can be used to create vector embeddings, which can then be indexed for efficient and accurate search with the help of Redis. With a thorough understanding of this approach, architects and developers can better appreciate the potential of AI-powered search capabilities and find best ways to improve the search experience for users.

We will go through the process of:

creating vector embeddings for Amazon product dataset,

indexing them with Redis

searching for similar vectors.

We will also explore the pros and cons of different indexing methods and how they can be used to improve search performance.

Let's get started!

Create a new directory and create a new Jupyter notebook. Get the dataset CSV file from here. Store it in ./data/ dir. We will be using Python 3.8. Install the following dependencies in the first cell:

After installing the dependencies, you can start by importing the necessary libraries and defining any necessary classes or functions. In this case, we can import the following libraries and define a color class for later use:

The Redis library is imported to interact with Redis, an in-memory data structure store often used as a database, cache, and message broker. We also import the following classes from redis.commands.search.field and redis.commands.search.query modules:

VectorField: used to represent vector fields in Redis, such as embeddings.

TextField: used to represent text fields in Redis.

TagField: used to represent tag fields in Redis.

Query: used to create search queries for Redis.

Result: used to represent search results returned by Redis.

In addition, we define a color class that can be used to print colored text to the console. The color class has several attributes, such as PURPLE, CYAN, BOLD, etc., which can be used to colorize text output in the console.

The next step involves loading Amazon product data into a Pandas DataFrame, and truncating long text fields to a maximum length of 512 characters, which is the maximum length supported by the pre-trained sentence embedding generator that we will be using later on.

Here's the cell code to load the product data and truncate long text fields:

The code loads the product data from a CSV file, truncates long text fields, adds a new primary_key column to the DataFrame, filters out any products without keywords, and extracts the metadata for the first 1000 products with non-empty item keywords and stores it in a dictionary called product_metadata.

Let's take a look at the first few rows of the product data:

Connect to Redis

After loading the product data into a Pandas DataFrame and extracting metadata for the 1000 products, the next step is to connect to Redis. We will be using a Redis instance provided by RedisLabs in their cloud, which offers a free tier. You can sign up for a free account at redis.com/try-free/.

Spin up a new Redis instance and copy the connection details. You will need to provide a password when connecting to the Redis instance. You can find the password in the connection details page. The password is the same as the password for the default user.

Create embeddings

Using SentenceTransformer

The next step involves generating embeddings (vectors) for the item keywords using a pre-trained Sentence Transformer model called distilroberta-v1. This model is available from the Sentence Transformer library. We will use the SentenceTransformer class to load the model and generate the embeddings.

Then we will check the dimensions of the embeddings:

Preparing utility functions

Now we have 1000 embeddings for the 1000 products. Next, we will define 3 utility functions. One for loading the product data and two for creating the index on Vector fields.

def load_vectors(client:Redis, product_metadata, vector_dict, vector_field_name):

p = client.pipeline(transaction=False)

for index in product_metadata.keys():

#hash key

key='product:'+ str(index)+ ':' + product_metadata[index]['primary_key']

#hash values

item_metadata = product_metadata[index]

item_keywords_vector = vector_dict[index].astype(np.float32).tobytes()

item_metadata[vector_field_name]=item_keywords_vector

# HSET

p.hset(key,mapping=item_metadata)

p.execute()

def create_flat_index (redis_conn,vector_field_name,number_of_vectors, vector_dimensions=512, distance_metric='L2'):

redis_conn.ft().create_index([

VectorField(vector_field_name, "FLAT", {"TYPE": "FLOAT32", "DIM": vector_dimensions, "DISTANCE_METRIC": distance_metric, "INITIAL_CAP": number_of_vectors, "BLOCK_SIZE":number_of_vectors }),

TagField("product_type"),

TextField("item_name"),

TextField("item_keywords"),

TagField("country")

])

def create_hnsw_index (redis_conn,vector_field_name,number_of_vectors, vector_dimensions=512, distance_metric='L2',M=40,EF=200):

redis_conn.ft().create_index([

VectorField(vector_field_name, "HNSW", {"TYPE": "FLOAT32", "DIM": vector_dimensions, "DISTANCE_METRIC": distance_metric, "INITIAL_CAP": number_of_vectors, "M": M, "EF_CONSTRUCTION": EF}),

TagField("product_type"),

TextField("item_keywords"),

TextField("item_name"),

TagField("country")

])

Comparison of flat and HNSW indexing for approximate nearest neighbor search

Flat indexing and HNSW are both methods used in approximate nearest neighbor search in high-dimensional spaces, but they differ in the way they construct and search the index.

Flat indexing is a straightforward approach where all data points are indexed and stored in a single list or tree structure. To find the nearest neighbors of a query point, a brute force search is conducted by computing the distance between the query point and all other points in the index. While simple and easy to implement, flat indexing can be computationally expensive and impractical for large datasets and high-dimensional spaces.

HNSW, on the other hand, stands for Hierarchical Navigable Small World and is a more complex indexing algorithm that organizes data points into a hierarchical graph structure. This graph is built by connecting each point to its nearest neighbors, and then recursively connecting nearby points in a hierarchical manner. This results in a graph structure that is highly clustered and enables fast approximate nearest neighbor search by exploring only a small subset of the graph.

The key advantage of HNSW over flat indexing is its ability to scale to large datasets and high-dimensional spaces. However, HNSW requires more careful tuning of its parameters and may have higher index construction time compared to flat indexing.

Index & Query the new data

Next, we will load and index the product data using a flat index first.

Query our FLAT index

And after this we can query our index and search for the top 5 (topK) nearest neighbors for a given query vector: