8/25/2024

Persisting Data with Embeddings Using LangChain Chroma

In the world of AI & machine learning, especially when dealing with Natural Language Processing (NLP), the management of data is critical. One innovative tool that's gaining traction is LangChain. It provides a comprehensive framework for developing applications powered by language models, and its integration with Chroma has revolutionized how we handle embeddings and persistence of data. This blog post delves into the intricacies of using LangChain with Chroma to persist data through embeddings and effectively manage your information storage needs.

What is Chroma?

Chroma is an AI-native open-source vector database that emphasizes developer productivity & happiness. It allows you to efficiently store & manage embeddings, making it easier to execute queries on unstructured data. Chroma’s architecture supports modern-day applications that require fast & scalable solutions for complex data retrieval tasks. Integrating Chroma with LangChain enhances its capabilities in managing & querying embeddings, setting the stage for more advanced applications.

Understanding the Concept of Embeddings

Before diving into the technical details, let’s clarify what embeddings are. Embeddings convert words or phrases into numerical vectors, capturing semantic meanings. This representation allows algorithms to understand text in mathematical terms rather than just as strings of characters. For instance, the words 'king' and 'queen' project closely in vector space, reflecting their related meanings.

Setting Up Your Environment

To start with LangChain & Chroma, you need to set up your Python environment. Here’s how you can get started:

1
pip install openai langchain sentence_transformers chromadb -q

This command will give you all the necessary libraries you need to work with embeddings in LangChain and Chroma!

Basic Initialization

Once you have the necessary libraries installed, you can initialize your Chroma DB like so:

1
2
3
4
5
6
7
8
9
10
11
12
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

os.environ["OPENAI_API_KEY"] = "your_api_key"

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_langchain_db"
)

In this example, we've set up a connection to Chroma, prepared our OpenAI embeddings, & specified a directory for persistent storage.

Loading and Splitting Documents

Next, let’s load some documents & split them into manageable chunks. This is crucial because it allows us to embed text in a way that each piece retains its context, which results in better performance when querying later.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Specify the directory containing documents
directory = '/path/to/documents'

def load_docs(directory):
    loader = DirectoryLoader(directory)
    documents = loader.load()
    return documents

documents = load_docs(directory)

# Split the loaded documents
def split_docs(docs, chunk_size=1000, chunk_overlap=20):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return text_splitter.split_documents(docs)

chunked_docs = split_docs(documents)

This code snippet demonstrates how to load a directory of documents and split them into chunks that can be embedded into your vector database.

Creating Embeddings

Now that we have our chunked documents, it’s time to convert these into embeddings. This process translates your documents into a format that can be understood by machine learning models, allowing for efficient querying.

1
2
3
4
5
6
7
8
9
from langchain.embeddings import SentenceTransformerEmbeddings

# Define the embedding function
def create_embeddings(docs):
    embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
    return embeddings.embed_documents(docs)

# Create embeddings from the chunked documents
embedded_docs = create_embeddings(chunked_docs)

With this setup, your documents are transformed into embeddings ready to be stored in Chroma.

Adding Documents to the Vector Store

Once you have your embeddings, it’s time to insert them into the Chroma database. This step completes the process of persisting your data.

1
2
3
4
5
vector_store.add_documents(
    documents=[doc.page_content for doc in chunked_docs],
    embeddings=embedded_docs
)
vector_store.persist()  # Ensure data is saved

In this case, the

add_documents

function takes the content of your chunked documents & their corresponding embeddings, adding them to the vector store. The

persist

function call ensures that this data remains accessible even after your program ends.

Querying Data from Chroma

The real power of embedding comes into play when you need to query this information. For instance, suppose you want to find documents relevant to a specific keyword.

1
2
3
4
query = "What are the benefits of using embeddings in machine learning?"
results = vector_store.similarity_search(query)
for res in results:
    print(res.page_content)

Here, the

similarity_search

function queries the vector store for contents similar to your specified query. It's like asking Chroma, “Hey, what should I know about embeddings?”

Enhancing Data Management with LangChain

LangChain allows for better management of your data with a combination of tools that enhance persistence, retrieval, & analytics. For instance, integrating LangChain with OpenAI's LLM sends this data for processing in conversational AI applications. Here's how you can set that up:

Using RetrievalQA Chain for Query Processing

You can utilize the

RetrievalQA

class from LangChain to connect your database directly to an LLM.

1
2
3
4
5
6
7
8
9
10
11
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=OpenAIEmbeddings(),  # or any LLM you prefer
    chain_type="stuff",
    retriever=vector_store.as_retriever()
)

# Ask a question
response = qa.run(query)
print(response)

This setup allows you to get answers from your stored data without rewriting any part of your LLM. Simply pass a query & retrieve relevant data!

Monitoring & Managing Your Database

It's crucial to have insights into how your data is performing. Chroma provides tools for monitoring active queries, usage analytics, & overall performance statistics. You can adjust indexing or refine your querying process based on this feedback. Collecting insights from user interactions enhances overall performance & allows refinement of applications.

The Power of Arsturn

Once you've embraced the world of embeddings and vector databases, consider taking the next STEP with your AI solutions. Watch your engagement thrive with Arsturn! Creating custom chatbots for your website has never been easier—no coding skills required! With our platform:

Effortlessly create customizable chatbots and save time on operational tasks.
Enhance customer engagement through powerful conversation analytics and instant responses.
Connect seamlessly with your audience using AI, leveraging your own data.

Conclusion

Persisting data with embeddings using LangChain Chroma is a game-changer for developers working with AI & NLP. The integration of these tools helps create a sophisticated, intelligent product that not only stores data but retrieves it efficiently and effectively when needed. By leveraging this technology, you can build applications that respond dynamically to user queries based on rich datasets, enhancing your user experience significantly. Start building your intelligent applications today with LangChain & Chroma, and boost your engagement with a powerful chatbot from Arsturn to connect better with your audience!

Happy coding!