Step 1: How to Choose an Embeddings Model
If possible, we recommend usingvoyage-code-3
, which will give the most accurate answers of any existing embeddings model for code. You can obtain an API key here. Because their API is OpenAI-compatible, you can use any OpenAI client by swapping out the URL.
Step 2: How to Choose a Vector Database
There are a number of available vector databases, but because most vector databases will be able to performantly handle large codebases, we would recommend choosing one for ease of setup and experimentation. LanceDB is a good choice for this because it can run in-memory with libraries for both Python and Node.js. This means that in the beginning you can focus on writing code rather than setting up infrastructure. If you have already chosen a vector database, then using this instead of LanceDB is also a fine choice.Step 3: How to Choose a “Chunking” Strategy
Most embeddings models can only handle a limited amount of text at once. To get around this, we “chunk” our code into smaller pieces. If you usevoyage-code-3
, it has a maximum context length of 16,000 tokens, which is enough to fit most files. This means that in the beginning you can get away with a more naive strategy of truncating files that exceed the limit. In order of easiest to most comprehensive, 3 chunking strategies you can use are:
- Truncate the file when it goes over the context length: in this case you will always have 1 chunk per file.
- Split the file into chunks of a fixed length: starting at the top of the file, add lines in your current chunk until it reaches the limit, then start a new chunk.
- Use a recursive, abstract syntax tree (AST)-based strategy: this is the most exact, but most complex. In most cases you can achieve high quality results by using (1) or (2), but if you’d like to try this you can find a reference example in our code chunker or in LlamaIndex.
Step 4: How to Put Together an Indexing Script
Indexing, in which we will insert your code into the vector database in a retrievable format, happens in three steps:- Chunking
- Generating embeddings
- Inserting into the vector database
If you are indexing more than one repository, it is best to store these in
separate “tables” (terminology used by LanceDB) or “collections” (terminology
used by some other vector DBs). The alternative of adding a “repository” field
and then filtering by this is less performant.
Step 5: How to Run Your Indexing Script
In a perfect production version, you would want to build “automatic, incremental indexing”, so that you whenever a file changes, that file and nothing else is automatically re-indexed. This has the benefits of perfectly up-to-date embeddings and lower cost.That said, we highly recommend first building and testing the pipeline before attempting this. Unless your codebase is being entirely rewritten frequently, an incremental refresh of the index is likely to be sufficient and reasonably cheap.
Step 6: How to set up an MCP server
To integrate your custom RAG system with Continue, you’ll create an MCP (Model Context Protocol) server. MCP provides a standardized way for AI tools to access external resources.Create your MCP server
Here’s a reference implementation using Python that queries your vector database:Configure Continue to use your MCP server
Add your MCP server to Continue’s configuration: config.yaml:Step 7 (Bonus): How to Set Up Reranking
If you’d like to improve the quality of your results, a great first step is to add reranking. This involves retrieving a larger initial pool of results from the vector database, and then using a reranking model to order them from most to least relevant. This works because the reranking model can perform a slightly more expensive calculation on the small set of top results, and so can give a more accurate ordering than similarity search, which has to search over all entries in the database. If you wish to return 10 total results for each query for example, then you would:- Retrieve ~50 results from the vector database using similarity search
- Send all of these 50 results to the reranker API along with the query in order to get relevancy scores for each
- Sort the results by relevancy score and return the top 10
rerank-2
model from Voyage AI, which has examples of usage here.