Combining LLMs with information retrieval and hypernear outperforms using LLMs alone

Fine-tuning LLMs

Large Language Models (LLMs) are pre-trained at an extraordinary cost. Their pre-training computes the underlying weights and biases in order to minimise the losses between the model’s prediction and the training. It is generally seen that with diverse training data set and a fairly large number of weights, the model to generalises well.

It is possible to fine-tune a model to improve it's performance on a particular task by performing additional downstream training on additional new examples. Downstream fine-tuning is very powerful, since the original large neural network would have learnt general features which are then used to quickly learn new features related to your use case. Since, the additional training examples are curated specifically for the new use case, the cost of such fine tuning is low.

A number of techniques are used for fine-tuning,:

Using Low-Rank-Adaptation (LoRA), where only subsets of the parameters are updated during the fine-tuning process.

Modifying the weights and biases of each layer in the neural network.

Training on the additional set of examples.

Adding new output layers to the end of network and freezing the parameters in the original layers.

The problem: with fine-tuning LLMs

LLMs are more fruitful to organisations when they are adapted to their business context. For that, private/corporate proprietary data which is not in the public domain need to be used for creating additional new data subsets and since that data is only available with particular organisation only it's obvious that it would not have been used during pre-training of that foundation LLMs.

Similarly, sensitive data / real-time data data cannot be included in the LLM training data because it has been created after the pre-training date. For both categories above, fine-tuning might not be ideal due to privacy concerns, costs and the constant changing nature of the data. In addition, fine-tuning requires advanced ML expertise and resources which many companies may not have access to. Therefore being able to use context data for these categories of data opens up important use cases for LLMs and makes them more accessible. This process is often refereed as grounding of LLMs. The grounding is not possible with LLM alone, this requires a system which can do a late fusion of nonparametric memory which can substantially improve the performance of the end-to-end LLM based system.

In-Context Learning

In-Context learning is an alternate mechanism to fine-tuning the model, during in-content learning use specific training examples during inference stage hence the model's weights remains unaltered. Here prompts are used to input specific examples.Here pair of inputs and example outputs are combined and given as input into the LLM by prompts.

Limiting Context during Prompting

Context limits are a factor of model and their architectures. For most of the initially published LLMs, the context length limit for the prompt were limited to a few hundred tokens at most. In the recent LLMs, the context limits have gone up to 8k and even 100k. For example,

OpenAI GPT-4 was launched with a 8k token limit.

Anthropic has just released a version of Claude with a context limit of 100k which translates into around 75k words (typically 3/4 of an English word length).

What this means is theoretically around 75k long prompt can be given to Claude LLM and instantly ask something from that prompt content.

The problem: passing context in prompts

There is a significant issue with in-context learning with the closed LLMs. The models are stateless and the context/organisation specific data has not actually been learnt by the LLM. Therefore LLMs require that the context data is sent back to the LLM with each prompt. Since prompts are stateless, sending thousands of tokens with each prompt is infeasible for most of the use cases.

Cost involved with consumption of APIs

Many of the cutting edge models offered by AI companies have a ‘per token’ fee imposed on consumption of their APIs. If a per token fee is also imposed on the prompts with higher number of tokens say 1k+, then this will incur significant API usage fees.

Vector Databases as a context source

One stateful solution is to use vector database to store any context you would like to make available to an LLM. Vector database work as a context provider to user / application inputs, as an intermediary between user and the LLMs.

In the world of Large Language Models, semantic and it's search plays a crucial role, vector databases are ideal here because the LLM’s query can be used to perform a semantic search, where closeness of distance retrieves the neighbouring vectors that are 'relatively near' to the user query/input.

Using hypernear as enterprise context source for LLMs

The process of transforming text into embeddings begins with tokenization, we start with decomposing the text data to be used for context into smaller chunks or tokens. These tokens can be as small as individual characters or as large as entire sentences, However, in most cases they represent individual words or sub-words. The number of tokens fed to the model at one time can range anywhere from the size of a sentence, a paragraph, all the way up to a small document.

Once the tokens have been computed, then using an LLM embeddings API, it is possible to convert each chunk of text into a list of vectors. this is just one time process for entire corporate dataset. hypernear integrates with all industry leader embedding APIs. Moreover, you can also leverage custom embedding API with hypernear.

Once we have our vectors ready they can be inserted into hypernear. This creates a persistent storage of vectors in hypernear, which enabled your organisation to leverage these context tokens across different LLM sessions, prompts and even across different LLMs.

Simple 3 steps to leverage hypernear

Embedding - Build an embedding model that's designed to encode data corpus
Hydration - vector import process
Search - use hypernear to search for similar data by encoding a video / image / document / audio product and using the vector to query for similar content.

Once the hydration of hypernear has happened, hypernear generate the relevant queries when your prompts or LLM needs to look up context data.

The desired outcome allows your to leverage models even with smaller context lengths, which have lesser per API usage fees imposed.
Moreover, it results in smaller amounts of data being inputted in to the LLM with each prompt.
Yes of course, by having most relevant context with each prompt the LLMs produces better response quality.

Have a project where hypernear can augment organisation's datasets with your choice of LLM?

Let's Connect!