There are many different ways to create text embeddings, but the most common is to use a neural network. A neural network is a machine learning algorithm that is very good at learning complex relationships. The input to a neural network is a vector, and the output is a vector of the same size. The neural network learns to map the input vectors to the output vectors in a way that captures the relationships between the inputs and outputs.
To create text embedding, the neural network is first trained on a large corpus of text. The training data is a set of sentences, and each sentence is represented as a vector. The vectors are created by taking the word vectors of the words in the sentence and summing them together. The neural network is then trained to map the sentence vectors to a fixed vector size.
Once the neural network has been trained, it can then be used to create text embeddings for new pieces of text. The new text is first represented as a vector, and then the neural network is used to map the vector to the fixed vector size. The result is a text embedding that captures the meaning of the text.
Text embeddings can be used for a variety of machine learning tasks. For example, they can be used to improve the performance of a machine learning algorithm that is used to classify texts. Text embeddings can also be used to find similar pieces of text, or to cluster texts together.
There are many different ways to create text embeddings, and the choice of method will depend on the application. However, neural networks are a powerful and widely used method for creating text embeddings.
Co:here is a powerful neural network, which can generate, embed, and classify text. In this tutorial, we will use Co:here to embed descriptions. To use Co:here you need to create account on Co:here and get API key.
We will be programming in Python, so we need to install cohere library by pip
Firstly, we have to implement cohere.Client. In arguments of Client should be API key, which you have generated before, and version 2021-11-08. I will create the class CoHere, it will be useful in the next steps.
The main part of each neural network is a dataset. In this tutorial, I will use a dataset that includes 1000 descriptions of 10 classes. If you want to use the same, you can download it here.
The downloaded dataset has 10 folders in each folder is 100 files.txt with descriptions. The name of files is a label of description, e.g.sport_3.txt.
We will compare Random Forest with Co:here Classifier, so we have to prepare data in two ways. For Random Forest we will use Co:here Embedder, we will focus on this in this tutorial. Cohere classifier requires samples, in which each sample should be designed as a list [description, label] and I did it in my previous tutorial (here)
In the beginning, we need to load all data, to do that. We create the function load_examples. In this function we will use three external libraries:
os.path to go into the folder with data. The code is executed in a path where is a python's file.py. This is an internal library, so we do not need to install it.
numpy this library is useful to work with arrays. In this tutorial, we will use it to generate random numbers. You have to install this library by pip pip install numpy.
glob helps us to read all files and folder names. This is an external library, so the installation is needed - pip install glob.
The downloaded dataset should be extracted in the folder data. By os.path.join we can get universal paths of folders.
In windows, a return is equal to data\*.
Then we can use glob method to get all names of folders.
folders_name is a list, which contains window paths of folders. In this tutorial, these are the names of labels.
Size of Co:here training dataset can not be bigger than 50 examples and each class has to have at least 5 examples, but for Random Forest we can use 1000 examples. With loop for we can get the names of each file. The entire function looks like that:
The last loop is taking randomly N-paths of each label and appending them into a new list examples_path.
Now, we have to create a training set. To make it we will load examples with load_examples(). In each path is the name of a class, we will use it to create samples. Descriptions need to be read from files, a length can not be long, so in this tutorial, the length will be equal to 100. To list texts is appended list of [descroption, class_name]. Thus, a return is that list.
We back to CoHere class. We have to add one method - to embed examples.
The second cohere method is to embed text. The method has serval arguments, such as:
model size of a model.
texts list of texts to embed.
truncate if the text is longer than the available token, which part of the text should be taken LEFT, RIGHT or NONE.
All of them you can find here.
In this tutorial, the cohere method will be implemented as a method of our CoHere class.
X_train_embeded will be an array of numbers, which looks like that:
To create an application, which will be comparison of two likelihood displays, we will use Stramlit. This is an easy and very useful library.
Installation
We will need text inputs for co:here API key.
In docs of streamlit we can find methods:
st.header() to make a header on our app
st.test_input() to send a text request
st.button() to create button
st.write() to display the results of cohere model.
st.progress() to display a progrss bar
st.column() to split an app
To run the streamlit app use command
The created app looks like this
Text embedding is a powerful tool that can be used to improve the performance of machine learning algorithms. Neural networks are a widely used and effective method for creating text embeddings. Text embeddings can be used for tasks such as text classification, text similarity, and text clustering.
In this tutorial, we compare the Random Forest with Co:here Classifier, but the possibilities of Co:here Embedder are huge. You can build a lot of stuff with it.