Customizing LLMs – Part 1: the Concept

I wanted to better understand our capacity to customize Large Language Models (LLMs). By “our” I mean us, users. I will register my current understanding in two parts:

Part 1 (this post) describes my understanding of how LLMs work and dives a bit into Retrieval Augmented Generation (RAG).
Part 2 (see post on my Engleberg Huller page) describes my experiment with building a RAG system on my laptop computer (and, yes, I am not a developer, just a user, so check it out).

How LLMs work

LLMs are neural network AI models that are developed to process, understand and generate human language based on a large number of parameters. They have two main components:

The first – and the fundamental breakthrough in the development of LLMs – is a transformer. A transformer is a form of neural network architecture that creates vectors of parameters representing not just a token (a unit of analysis in neural networks, like a word or part of a word, or a pixel if we were thinking of images) but also the context in which that token lies (for example, position of a word in a phrase, the syntax of a phrase and a word’s semantics). This consideration of context seems to be referred to as “self-attention.” The final vectors generated by transformers are referred to as “final embeddings,” (although it seems like sometimes the term “embedding” is just used for the vectors representing the tokens and not their context)
The second component is a task specific model that takes the final embeddings and generates outputs, such as the predicted next word in a sentence.

Each of the two components above is developed through training of optimization models: the first component requires training a model to find the best vector of parameters (final embedding) that represents a token and its context, and the second component requires training a model to find the best output given a collection or sequence of embeddings.

An important detail, however, is that these components and trainings are not developed in sequence but, rather, at the same time, using feedback loops (backpropagation). Below I reproduce a diagram the ChatGPT made for me (with some alterations, seems ChatGPT is not yet very good at making these diagrams).

Some aspects that may influence the power of a transformer include the number of dimensions (parameters – usually in the billions) considered in embeddings and the number of tokens that a transformer can consider simultaneously (the size of the context window).

Customizing LLMs

I understand there are two main ways of customizing LLMs: Fine Tuning and Retrieval Augmented Generation (RAG). Before exploring those, however, a few words on “prompt engineering.”

Prompt engineering is also seen as a way to customize the results obtained from using an LLM. You can find lots of recommendations online and do full courses on it. I do not want to diminish the attention this seems to get, but it essentially means learning how to interact with LLMs to obtain the best possible answers to your questions. From my experience, the most useful asset in doing so, is your own knowledge about the subject you are exploringt. This allows you to pursue your questions in detail, inducing the LLM to refine its answers. I often do more traditional Google research on a subject before interacting with ChatGPT, so I better know what to ask and how to phrase my questions.

Fine-Tuning

Fine-tuning is an approach to customizing a LLM that consists of altering the collection of parameter values (weights) that an LLM uses in responding to a question (prompt), to improve the LLM performance for specific purposes. Traditionally, it is considered expensive because LLMs can have billions of parameters and “retraining” them would typically be out of reach for all but the companies that own the LLMs (or their transformers). But there are approaches that avoid having to “retrain” an LLM.

A common approach is that referred to as Parameter Efficient Fine Tuning (PEFT). This approach makes use of adapters, small neural network modules that are trained for a specific purpose, and that are typically then added to an LLM without needing to touch the remaining parameter values. An approach known as Low Rank Adaptation (LoRA), in particular, seems to have been able to substantially reduce the cost of fine tuning in projects on which it was used by using low rank matrices (limited number of parameters).

One limitation of this approach seems to be that, once “fine-tuned,” the resulting set of parameter values of the LLM is applied to any interaction with that LLM. This may generate better results for the specific case the LLM was fine-tuned, but may generate worse results for other use cases. The circumstances in which it is worth for an organization to invest in fine-tuning an LLM would need to then be looked at carefully.

Retrieval Augmented Generation (RAG)

RAG consists of submitting to an LLM a specific set of information that you ask the LLM to consider when responding to your prompts. To do so, you need to build a pipeline that ingests the additional information you want to submit, then breaks that down to small parts (chunks), which are then transformed into embeddings (vectors) in a way that aligns with the way your LLM also uses embeddings. The LLM will then use the embeddings of your prompt to search for similar embeddings of the (chunked) additional information and feed all that to the LLM to seek a response. The diagram below illustrates how RAG works and I discuss the steps in red further below.

1. Chunking – this step breaks down the added information in segments called chunks so that the number of tokens in segments fits the embedding models and LLM context windows, both of which have limits. The chunks need to still have semantic meaning, however, and there is discussion out there around the optimal size of chunks, which may depend on each situation. A general ballpark figure common in discussions seems to be the range of 128-256 tokens, which seems quite specific to me (there must be a back story to this range, I just don’t know what it is). Chunks end up being typically sentences or small paragraphs. Examples of tools that can assist in doing this are LangChain and LlamaIndex.

2. Embedding – the chunks are then turned into vectors intended to capture both content and context of the chunks. These embeddings are typically stored in what is called a vector database which can then be searched. So the result is that chunks become “findable” based on their meaning and context. There are various embedding tools out there from OpenAI, Google and others. However, some of these models may be better aligned than others with the specific LLM that a RAG project intends to use. So, for example, an OpenAI embedding model would be appropriate for use with OpenAI transformers, or Sentence-BERT with Hugging Face transformers. The vector database, on the other hand, may not need such alignments and there are many popular options being used in RAG projects (e.g. Pinecone, Weaviate, Milvus). The user questions also need to be embedded to feed into the LLM but they are typically not stored in the database because they are typically not for repeated use. LLM’s do this embedding every time they are prompted, The existence of embedding models seems to actually be the reason (or one of) why prompts can be better or worse “engineered” to extract the information desired from the LLMs.

3&4. Retrieval and Generation – these steps then consist of feeding a LLM with the embedding of both the new material we wish the LLM to consider and the questions we are prompting the LLM to answer – and then obtaining the answer.

I reached the understanding above, in part, by dialoguing with ChatGPT 5, which in some situations seems to generate not very good responses, perhaps worse than those of ChatGPT 4o (see sources below). But, as I mentioned when discussing prompt engineering, I do have the habit of cross referencing with other information I find on the internet, including feeding it back to ChatGPT, so I am hopeful the understanding above is relatively accurate.

I then proceeded to try to build a RAG system on my laptop and compare it with simply uploading a set of documents to ChatGPT (I did use the 4o version, in this case). Because my RAG system connected with the OpenAI LLM, I figured it would highlight the RAG components of the system built on my laptop and help me better understand those component. For the results of this exercise, please see the Engelberg Huller post Customizing LLMs – Part 2: an Experiment.

Sources

3Blue1Brown. 2025 (last updated Sept 26). Neural Networks. Course (9 videos), YouTube. Available: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi. AccessedL October 06, 2026

Belcic I and Cole Stryker. Undated. What is LLM Customization? IBM Think. Available: https://www.ibm.com/think/topics/llm-customization. Accessed: October 06, 2025

Google. Gemini backed AI Overview on Google search

OpenAI. ChatGPT 5, accessed October 2025

You Might Also Like

Descriptive and Inferential Statistics

A Thought After Reading Daniel Kahneman’s “Thinking, Fast and Slow”

Indicators of Government Expenditures