What are Vector Embeddings and Vector Databases?

Why Rememberizer is more than just a database or keyword search engine.

Rememberizer uses vector embeddings in vector databases to enable searches for semantic similarity within user knowledge sources. This is a fundamentally more advanced and nuanced form of information retrieval than simply looking for keywords in content through a search engine or database.

In their most advanced form (as used by Rememberizer) vector embeddings are created by language models with architectures similar to the AI LLMs (Large Language Models) that underpin OpenAI's gpt models and ChatGPT service as well as models/services from Google (Gemini) , Anthropic (Claude), Facebook (LLama 2) and others. For this reason it is natural to use vector embeddings to discover relevant knowledge to include in the context of AI model prompts. The technologies are complementary and somewhat equivalent. For this reason most providers of LLMs as a service will also produce vector embeddings as a service (for example: a blog from Together AI or another blog from OpenAI).

What does a vector embedding look like? Consider a co-ordinate (x,y) in two dimensions. If it represents a line from the orgin to this point, we can think of it as a line with a direction, in other words a vector in two dimensions. In our context, a vector embedding will be a list of something like 768 numbers representing a vector in a 768-dimensional space. Ultimately this list of numbers can represent weights between zero and one in a Transformer model that define the meaning in a phrase such as "A bolt of lightening out of the blue." This is fundamentally the same underlying representation of meaning used in GPT-4 for example. As a result, we can expect a good vector embedding to enable the same brilliant apparent understanding that we see in modern AI language models.

It is worth noting that vector embeddings can be used to represent more than just text, but also other types of data such as images or sound. And with a properly trained model one can compare across media, so that a vector embedding on a block of text can be compared to an image, or visa versa. Today Rememberizer enables searches within just the text component of user documents and knowledge. But text-to-image and image-to-text search are on the roadmap. Google uses vector embeddings to power their text search (text-to-text) and also their image search (text-to-image) (reference) . Facebook has contemplated using embeddings for their social network search (reference). Snapchat uses vector embeddings to understand context in order to serve the right ad to the right use at the right time (reference).

To deeply understand how vector embedding and vector databases work start with the overview from Hugging Face. Pinecone (a vector embedding database as a service) has a good overview as well.

Another great source for understanding the search and knowledge in vectors is the Meta/Facebook paper and code for the FAISS library. "FAISS: A Library for Efficient Similarity Search and Clustering of Dense Vectors" by Johnson, Douze, and Jégou (2017): FAISS provides a comprehensive overview of a library designed for efficient similarity search and clustering of dense vectors. It discusses methods for optimizing the indexing and search processes in large-scale vector databases, including those based on Product Quantization. The best place to learn more about this is the documentation along with the code on Github.

Be sure to consider the June 2017 paper that started the genAI (generative artificial intelligence) revolution, "Attention Is All You Need." (reference) which introduces the Transformer architecture behind GPT models and all the LLMs that follow from OpenAI, Google, Meta (Facebook), Nvidia, Microsoft, IBM, Anthropic, Mistral, Salesforce, xAI (Elon Musk), Stability AI, Cohere, and many other open sources. Consider also, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (reference 1998, reference 2010). These papers discuss the theory behind approximate nearest neighbor (ANN) search in high-dimensional spaces, a core concept in vector databases for efficiently retrieving similar items.

One exciting thing about these Transformer-based models is that the more data they used, the bigger (more parameters) they got, the better their understanding and capabilities. OpenAI first noticed this when they trained their GPT-2 model. Realizing this potential, they immediately stopped being an open-source oriented non-profit and became a closed source for-profit company focused on producing GPT-3, GPT-4 and its famous front end, ChatGPT. Interestingly, Google owns the patent on this technology -- it was their researchers behind Transformers and Attention Is All You Need (reference). ChatGPT begs to differ a bit about my characterization, writing that "The narrative around OpenAI's transition from an open-source-oriented non-profit to a closed-source for-profit entity simplifies a complex evolution. OpenAI's shift included a focus on safety and responsible AI development alongside commercialization aspects. It's also worth noting that while OpenAI has prioritized developing proprietary technology like GPT-3 and beyond, it continues to engage with the research community through publications and collaborations."

BERT language models are based on Transformers and are often used in advanced vector embedding engines. This was introduced in the 2018 paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (reference). BERT (Bidirectional Encoder Representations from Transformers) marked a significant shift towards pre-trained models that can be fine-tuned for a wide range of NLP tasks. Its innovative use of bidirectional training and transformer architecture set new standards for model performance across numerous benchmarks. Earlier innovative methods for creating vector embeddings were introduced by GloVe (2014, Stanford), Word2Vec (2013, Google). "GloVe: Global Vectors for Word Representation" (reference): The GloVe (Global Vectors) paper proposed a new global log-bilinear regression model for the unsupervised learning of word representations, combining the benefits of the two main approaches to embedding: global matrix factorization and local context window methods. "Efficient Estimation of Word Representations in Vector Space" (reference): This paper introduced Word2Vec, a groundbreaking approach to generating word embeddings. Word2Vec models, including the Continuous Bag of Words (CBOW) and Skip-Gram models, are pivotal in the evolution of word embeddings.

Last updated