AI – Joeira

Customizing LLMs – Part 3: The Orchestration Layer

Post author:Alex Uriarte
Post published:March 22, 2026
Post category:Fan

This is the third post in a series where I try to better understand how LLMs work and how to build custom applications using LLMs. In this post, I will dive further into the architecture of custom applications beyond RAG.

In part 1 of this series, I was somewhat dismissive of “prompt engineering,” describing it as mostly learning how to interact with LLMs. Although I still think that description is not necessarily wrong, many custom applications seem to rely largely on developing a layer in between the user and the LLM that molds both inputs and outputs from the interaction and that can greatly enhance the benefits of using LLMs. This layer may combine:

User prompts
Any saved interactions between the user and the LLM
Specific instructions on how the LLM should make use of user input (system prompts)
Tools made available to the LLM
RAG content

This layer is often called the orchestration layer or the control plane. To combine the information above, it actually makes use of collection of tools:

Frameworks – these are possibly the main tools and they are the ones that combine the content from prompt templates, tool calling, memory, RAG pipelines and agent workflows. Examples of frameworks commonly used for AI applications are LangChain, AutoGen (MS) and Semantic Kernel (MS).
Other tools that can be part of the orchestration layer include:
- Code libraries
- Core API and traffic management tools
- Workflow and agent orchestration tools
- Security and audit tools

In the diagram below, I highlight this orchestration layer and its prompt building function in red.

How does this orchestration layer combine the various sources of input it receives into what it feeds the LLM? How much does it influence, for example, the extent to which outputs from a user’s interaction with the LLM rely on the content retrieved from a RAG process relative to the content already incorporated in the LLM?

What I have been able to understand is the following:

All input coordinated by the orchestration layer needs to ultimately be transformed into tokens and then to vectors that the LLM transformer can work with. How many tokens an LLM can deal with at a time is what is called the context window. As of March 2026, it seems like most LLMs readily available have context windows of anywhere between 128 thousand and 1 million tokens.
If the custom LLM uses RAG, what the orchestration layer receives from the RAG system are chunks selected by a vector search engine by semantically comparing the embeddings from the prompt given to it with those in the vector database that contains the RAG content.
The transformer treats all tokens received as input equally and what determines how some tokens influence others is the transformer’s self-attention mechanism. This mechanism can be influenced by several factors (according to my exchange with ChatGPT):
- Tokens that are semantically aligned with the question get higher attention.
- Strong instructions bias attention patterns.
- Proximity in context window can matter.
- Longer RAG context can dilute influence.
So, to some extent, we can influence the importance given by the LLM to its various sources of input by affecting some of the factors above in the orchestration layer. In particular, we are able to influence:
- the size of RAG chunks and where they are placed in the context window
- the instructions received by the LLM and whether the system prompt forces “grounding”
Grounding consists on forcing an LLM to rely completely or mostly on selected sources rather than on its pre-trained content. There are several ways of enforcing strong grounding in the orchestration/control layer including:
- removing user message from context;
- forcing an answer only from citations;
- rejecting answers if citations are absent, requiring the LLM to support every answer by retrieved text

I recently noticed how effective the system prompt of custom LLMs can be (the “prompt engineering” that I had previously downplayed) in a custom GPT I built to summarize weekly updates I was receiving from my team into monthly reports. By just inputting the information on a regular ChatGPT interface, the results I was getting were unusable: the LLM was having a hard time matching sentences from different weeks that referred to the same workstream, and was also messing up the timeline of the updates. By building a custom GT (using ChatGPTs own tool), I was able to fix those issues: the difference was just the instruction layer translating the user input into the actual instructions received by the LLM.

Context Window Sizes, RAG and the Role of the Orchestration Layer

I have the ChatGPT Plus plan (I pay $20/month) and I asked ChatGPT what is the size of the context window of the LLMs I have access to:

For OpenAI’s GPT-5.3 (the “instant” model), the typical context window is 128k, if accessed directly through an API. But when I access it through the ChatGPT Plus plan that I have, it goes through an orchestration layer. This orchestration layer may:
- Reserve tokens for system + tool instructions
- Keep headroom for the model’s response
- Trim conversation history dynamically
- Enforce tier-based limits

All of the above take context window space. The result is that the effective context window that I typically have access to using ChatGPT is closer to 32k.

For OpenAI’s GPT-5.4 (the “thinking” model), the API accessed context window is around 1 m tokens, while the effects context window I have access to through my ChatGPT Plus plan interface is closer to 256k – 400k “depending on tier or configuration” (I did not dive deeper to figure out what that means)

In theory, RAG becomes less necessary, the larger the context window is. Because the larger it is, the more the material that would typically go into the RAG can be inserted directly into the context window, instead of being stored in the vector database with only the relevant chunks being retrieved. To give us an idea of the what the context window dimensions mean in terms of the size of text (again, according to ChatGPT), roughly:

However:

PDF documents use more tokens, so 100,000 tokens may be closer to 40k-60k words, instead of 75k
Code and structured data are also less efficiently translated into tokens

In reality, even with larger context windows, it seems like RAG can still be beneficial for two reasons:

The self-attention function of LLMs seems to degrade with large content inserted in its context windows;
More content in the context window increases the cost of using LLMs

Whether the custom application has a RAG or not changes considerably the focus of the orchestration layer:

WIthout a RAG, but relying on a larger context window pipeline, the orchestration layer will focus on whether to load full texts or summaries, in what order, how much of older conversations to include…the focus is on context management and output failures often reflect failures in such management, such as too much irrelevant content
WIth a RAG, the orchestration layer will focus more on what chunks to retrieve from the RAG’s uploaded documents and output failures often reflect failure in such retrieval.

Hybrid architectures are often used where the orchestration layer may include full text documents in the context window in some cases (e.g. for smaller documents) and retrieve chunks through a RAG in other cases (e.g. large documents).

Sources:

OpenAI. ChatGPT 5.3. Accessed March 2026

IBM Technology. 2026. Is RAG Still Needed? Choosing the Best Approach for LLMs. March. YouTube video. Available: https://www.youtube.com/watch?v=UabBYexBD4k. Accessed: March 21, 2026.

Customizing LLMs – Part 2: an Experiment

Post author:Alex Uriarte
Post published:October 13, 2025
Post category:Engelberg Huller

[For the data and Python code used for this blog post, please visit my GitHub repository https://github.com/EngelbergHuller/Customizing-LLMs-Experiment/tree/main]

As mentioned in Part 1 of this two part series (the first part is on my FAN page and titled Customizing LLMs – Part 1: the Concept), I wanted to better understand our capacity to customize LLMs and decided to explore a bit the concept and then experiment a bit myself with building a RAG system on my laptop. I am not a developer and so I “vibe coded” with ChatGPT 4o to set up the RAG system, and then compared the results with those I obtained by simply uploading a set of documents to ChatGPT. Because my RAG system connected with the OpenAI LLM, the thought was that it would help me understand those components that are particular to RAG and not related to the LLM it is based on. I describe this experiment here.

A Bit More Background and Outline

I have the ChatGPT personal plus plan where I pay $20 a month. This plan allows me to upload a limited number of documents to ChatGPT and ask it to analyze them. If I understand correctly, the limits are up to about 50 MB with no file being larger than about 20 MB. If more needs to be analyzed concurrently, ChatGPT will offer to analyze them in batches and then compare the batches for a broader summary. I understand that when ChatGPT analyzes documents uploaded to its projects, it does so following a RAG type pipeline, but using its specific tools.

Below I:

Describe the specifications of the RAG system on my laptop relative to those that ChatGPT tells me it uses when analyzing documents uploaded to it
Describe the documents I used as input to both the laptop RAG system and ChatGPT
Describe the questions (prompts) I used and the relative responses I received from each of the two systems
Try to understand how the different responses resulted from different system specifications

Laptop RAG and ChatGPT Specifications

I used PyCharm as an IDE (Integrated Development Environment), created an account on Open AI to access their LLM, and then asked ChatGPT 4o to walk me through the process of creating a RAG system on my laptop. The system on my laptop uses:

A customized script for chunking the pdfs:
- It defined a maximum amount of words per chunk (800), not tokens
- It defined an overlap of 100 for adjacent chunks (for continuity)
The OpenAI embedding model “text-embedding-3-small”
The FAISS library of algorithms for indexing and searching vectors by similarity
The OpenAI LLM ChatGPT 4o

In comparison, uploading documents to ChatpGPT4o it tells me it uses:

A chunking pipeline with:
- Chunk sizes of ~500–700 tokens (~400–600 words)
- Overlap of ~50–100 tokens
- Splitting by semantic logic, often prioritizing punctuation
An embedding model “similar in capability to:
- text-embedding-3-large or
- text-embedding-ada-002 (depending on optimization and routing)
“an internal vector index that behaves similarly to FAISS, but is not FAISS itself”
Also OpenAI LLM ChatGPT 4o

Input

I used as input 20 pdf documents containing old newspaper editorials that I downloaded from a ProQuest database (accessed through my local public library) and that resulted from a search for key words “foreign aid,” “foreign assistance,” or USAID. I won’t discuss the more general document search because, for the purposes of this experiment with RAG, the 20 document set is the universe of interest. It is relevant, however, that these are scanned images of old newspaper editorials because, as I discuss further below, one of the issues I encountered seemed to stem from the quality of Optical Character Recognition (OCR) software that I was able to use.

Q&A

Having uploaded the 20 pdfs to both ChatGPT’s interface and to the RAG system on my laptop, I asked the same three questions to both:

In one short paragraph, tell me what these documents are about
Are all articles critical of foreign aid or do any of them praise or defend it?
Please provide a table with these 20 articles categorized by stance, date, and key quotes

One advantage of the ChatGPT tool that became clear upfront, was that it seemed to follow its responses with meaningful questions or suggestions, in a way that my RAG system did not. The last bullet above I added at the suggestion of ChatGPT.

I also modified the laptop RAG a couple of times as I noticed some of the limitations in the results I was obtaining:

First, as I collected the responses from my custom RAG system, a sentence in one of its responses made clear that it was not accessing the entire content of the pdf documents but seemed to be relying mostly on the title and perhaps some other metadata. It became clear that, because the pdfs were mostly scans of newspapers, the RAG needed to use Optical Character Recognition (OCR) and was not doing so, but rather relying on what it could read as text. I had to install two new pieces of software, add them to my windows environment, and ensure PyCharm was accessing them: Tesseract and Poppler, two open source software that interact to enable OCR.

After rebuilding the RAG system with OCR the results were better but I noticed the responses did not seem to be making use of all the 20 documents. It turns out that the FAISS tool was retrieving information from chunks using an 8 nearest-neighbor criteria and ignoring other information. So I expanded that to 20. ChatGPT warned me that by doing so, I could decrease the relevance of the response and, yet, comprehensive coverage was what I was looking for.

Below I copy the responses obtained by the two approaches (I reformatted the responses considerably for presentation but did not modify the content).

Prompt 1. In one short paragraph, tell me what these documents are about

Prompt 2. Are all articles critical of foreign aid or do any of them praise or defend it?

Prompt 3. Please provide a table with these 20 articles categorized by stance, date, and key quotes

ChatGPT 4o

Custom RAG – OCR K=20

Comments on the responses to questions

On the responses to question 1. Unlike ChatGPT, the laptop RAG did not seem able to identify the Washington Post as one of the newspapers from which editorials were sourced. Three of the 20 articles were from the Washington Post, the other 17 were from the New York Times. Since the RAG systems reports that when the text content of a chunk is low it “falls back on OCR,” I thought that, perhaps, it no longer included content from that chunk that was in text format. But I asked the program and that seems to not be the case. It seems to refer to “references from the context” as references extracted in text format and stated that it used both content extracted with OCR and in text when responding to questions.

The laptop RAG also did not clearly identify foreign aid and foreign assistance as the central theme of the 20 documents as clearly as ChatGPT did. After further examining the documents, two of the 20 documents are full pages from the Washington Post that contain several editorials each, only one in each document referring to foreign aid. It seems like the laptop RAG system took into account the other editorials in those two documents much more than ChatGPT did, when providing an overall summary.

On the responses to question 2. The ChatGPT answers provide a sentence summarizing the articles with a common stance (e.g. critical, favorable…) and then provides examples. The laptop RAG systems stays focused on the individual chunks, only providing short summary sentences at the beginning and end. Also, the ChatGPT answers refer to specific documents when providing examples (numbers in brackets) while the laptop RAG system refers to chunks.

On the responses to question 3. The ChatGPT answers responded to the request to categorize the articles by “stance” by defining four buckets in which it divided all 20 articles (supportive, critical, mixed/reformist, and neutral/analytical). The laptop RAG system responded with individualized “stances” for each article.

The laptop RAG system did not interpret each document as being one “article” mentioned in the prompt. I beleive this was likely an issue with my prompt. As mentioned, 2 of the 20 documents had more than one editorial in them. But we may also need to think how to best customize the retrievel of information. The FAISS library works to retrieve information by clustering. When it was set to retrieve 8 nearest-neighbor chunks, it had produced a list of 12 articles in response to question 3. When I expanded this to 20 nearest-neighbor chunks, it retreived 19.

Where the responses became most troubling to me was when I noticed that neither ChatGPT nor the laptop RAG correctly identified the titles to all the articles. The laptop RAG actually performed better in this case than ChatGPT getting 13 titles correct, while ChatGPT only got 9. As previously mentioned, in 2 of the 20 documents, there were more than one editorial in the same document, which would have contributed to the difficulty in selecting one title for the document, and this seems to be reflected in some of the titles offered by the laptop RAG system. But the titles offered by ChatGPT seem to be particularly bewildering and I could not figure out where they came from.

ChatGPT also seemed to have sometimes identified the dates incorrectly, while the laptop RAG sometimes provided NA when it was not able to identify the date. For example, ChatGPT listed five documents as being from 1972 when only two of the documents are from that year.

So what do I draw from the experiment?

First, I would not currently feel comfortable relying directly on ChatGPT nor on this initial laptop RAG system to provide me with good information about scanned pages of newspaper articles. That said, given the responses to question 3, I am left with the impression that what may have seemed to be better responses by ChatGPT in questions 1 and 2 may actually reflect a greater inclination of ChatGPT to “fill in gaps” with made up information when compared to the laptop RAG system.

In addition, given that there seem to be several ways to improve on this laptop RAG (based on my discussion about it with ChatGPT), the laptop RAG may actually be a more promising avenue to obtain more reliable information from such types of documents.

Second, based on information provided by ChatGPT itself, there were differences in the two systems in the chunking approach taken (words vs tokens), the embedding models used (even though both used OpenAI embedding models, and in the indexing method used for search and retrieval. Because the laptop RAG system is transparent and customizable in each of these elements in ways that directly using ChatGPT does not seem to be, it should allow for room for improvement.

Third, one of the main reasons to look into a customized RAG system (for my purposes) is the existence of limitations in directly using an LLM like ChatGPT to analyze large amounts of scanned documents. However, in further exploring a customized RAG system for this purpose it seems like some effort should go into:

Further scrutinizing the types of documents that will be well interpreted by the RAG system and those that may not and finding ways to, perhaps, exclude documents that would not be well read. For example, how can we better deal with documents that scan entire newspaper pages where only one of the articles on the page is of interest?
Further looking into how well OCR is working and how well the RAG system is capturing information from OCR in tandem with information in text format

A final note on prompt engineering: I seem to not have paid sufficient attention to it in Part 1 of this two post series and in this exerise as well. Given the limitations of both ChatGPT and the laptop RAG system, it is possible that I would have obtained better results just by more clearly specificing to the systems the output I was looking for.

Oh, and on the OpenAI cost of the laptop RAG system exercise: $0.15

Sources

OpenAI. ChatGPT 4o, accessed October 2025

ProQuest access to The New York Times and Washington Post historical editions. Accessed through Fairfax County Public Libraries, October 2025

Customizing LLMs – Part 1: the Concept

Post author:Alex Uriarte
Post published:October 13, 2025
Post category:Fan

I wanted to better understand our capacity to customize Large Language Models (LLMs). By “our” I mean us, users. I will register my current understanding in two parts:

Part 1 (this post) describes my understanding of how LLMs work and dives a bit into Retrieval Augmented Generation (RAG).
Part 2 (see post on my Engleberg Huller page) describes my experiment with building a RAG system on my laptop computer (and, yes, I am not a developer, just a user, so check it out).

How LLMs work

LLMs are neural network AI models that are developed to process, understand and generate human language based on a large number of parameters. They have two main components:

The first – and the fundamental breakthrough in the development of LLMs – is a transformer. A transformer is a form of neural network architecture that creates vectors of parameters representing not just a token (a unit of analysis in neural networks, like a word or part of a word, or a pixel if we were thinking of images) but also the context in which that token lies (for example, position of a word in a phrase, the syntax of a phrase and a word’s semantics). This consideration of context seems to be referred to as “self-attention.” The final vectors generated by transformers are referred to as “final embeddings,” (although it seems like sometimes the term “embedding” is just used for the vectors representing the tokens and not their context)
The second component is a task specific model that takes the final embeddings and generates outputs, such as the predicted next word in a sentence.

Each of the two components above is developed through training of optimization models: the first component requires training a model to find the best vector of parameters (final embedding) that represents a token and its context, and the second component requires training a model to find the best output given a collection or sequence of embeddings.

An important detail, however, is that these components and trainings are not developed in sequence but, rather, at the same time, using feedback loops (backpropagation). Below I reproduce a diagram the ChatGPT made for me (with some alterations, seems ChatGPT is not yet very good at making these diagrams).

Some aspects that may influence the power of a transformer include the number of dimensions (parameters – usually in the billions) considered in embeddings and the number of tokens that a transformer can consider simultaneously (the size of the context window).

Customizing LLMs

I understand there are two main ways of customizing LLMs: Fine Tuning and Retrieval Augmented Generation (RAG). Before exploring those, however, a few words on “prompt engineering.”

Prompt engineering is also seen as a way to customize the results obtained from using an LLM. You can find lots of recommendations online and do full courses on it. I do not want to diminish the attention this seems to get, but it essentially means learning how to interact with LLMs to obtain the best possible answers to your questions. From my experience, the most useful asset in doing so, is your own knowledge about the subject you are exploringt. This allows you to pursue your questions in detail, inducing the LLM to refine its answers. I often do more traditional Google research on a subject before interacting with ChatGPT, so I better know what to ask and how to phrase my questions.

Fine-Tuning

Fine-tuning is an approach to customizing a LLM that consists of altering the collection of parameter values (weights) that an LLM uses in responding to a question (prompt), to improve the LLM performance for specific purposes. Traditionally, it is considered expensive because LLMs can have billions of parameters and “retraining” them would typically be out of reach for all but the companies that own the LLMs (or their transformers). But there are approaches that avoid having to “retrain” an LLM.

A common approach is that referred to as Parameter Efficient Fine Tuning (PEFT). This approach makes use of adapters, small neural network modules that are trained for a specific purpose, and that are typically then added to an LLM without needing to touch the remaining parameter values. An approach known as Low Rank Adaptation (LoRA), in particular, seems to have been able to substantially reduce the cost of fine tuning in projects on which it was used by using low rank matrices (limited number of parameters).

One limitation of this approach seems to be that, once “fine-tuned,” the resulting set of parameter values of the LLM is applied to any interaction with that LLM. This may generate better results for the specific case the LLM was fine-tuned, but may generate worse results for other use cases. The circumstances in which it is worth for an organization to invest in fine-tuning an LLM would need to then be looked at carefully.

Retrieval Augmented Generation (RAG)

RAG consists of submitting to an LLM a specific set of information that you ask the LLM to consider when responding to your prompts. To do so, you need to build a pipeline that ingests the additional information you want to submit, then breaks that down to small parts (chunks), which are then transformed into embeddings (vectors) in a way that aligns with the way your LLM also uses embeddings. The LLM will then use the embeddings of your prompt to search for similar embeddings of the (chunked) additional information and feed all that to the LLM to seek a response. The diagram below illustrates how RAG works and I discuss the steps in red further below.

1. Chunking – this step breaks down the added information in segments called chunks so that the number of tokens in segments fits the embedding models and LLM context windows, both of which have limits. The chunks need to still have semantic meaning, however, and there is discussion out there around the optimal size of chunks, which may depend on each situation. A general ballpark figure common in discussions seems to be the range of 128-256 tokens, which seems quite specific to me (there must be a back story to this range, I just don’t know what it is). Chunks end up being typically sentences or small paragraphs. Examples of tools that can assist in doing this are LangChain and LlamaIndex.

2. Embedding – the chunks are then turned into vectors intended to capture both content and context of the chunks. These embeddings are typically stored in what is called a vector database which can then be searched. So the result is that chunks become “findable” based on their meaning and context. There are various embedding tools out there from OpenAI, Google and others. However, some of these models may be better aligned than others with the specific LLM that a RAG project intends to use. So, for example, an OpenAI embedding model would be appropriate for use with OpenAI transformers, or Sentence-BERT with Hugging Face transformers. The vector database, on the other hand, may not need such alignments and there are many popular options being used in RAG projects (e.g. Pinecone, Weaviate, Milvus). The user questions also need to be embedded to feed into the LLM but they are typically not stored in the database because they are typically not for repeated use. LLM’s do this embedding every time they are prompted, The existence of embedding models seems to actually be the reason (or one of) why prompts can be better or worse “engineered” to extract the information desired from the LLMs.

3&4. Retrieval and Generation – these steps then consist of feeding a LLM with the embedding of both the new material we wish the LLM to consider and the questions we are prompting the LLM to answer – and then obtaining the answer.

I reached the understanding above, in part, by dialoguing with ChatGPT 5, which in some situations seems to generate not very good responses, perhaps worse than those of ChatGPT 4o (see sources below). But, as I mentioned when discussing prompt engineering, I do have the habit of cross referencing with other information I find on the internet, including feeding it back to ChatGPT, so I am hopeful the understanding above is relatively accurate.

I then proceeded to try to build a RAG system on my laptop and compare it with simply uploading a set of documents to ChatGPT (I did use the 4o version, in this case). Because my RAG system connected with the OpenAI LLM, I figured it would highlight the RAG components of the system built on my laptop and help me better understand those component. For the results of this exercise, please see the Engelberg Huller post Customizing LLMs – Part 2: an Experiment.

Sources

3Blue1Brown. 2025 (last updated Sept 26). Neural Networks. Course (9 videos), YouTube. Available: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi. AccessedL October 06, 2026

Belcic I and Cole Stryker. Undated. What is LLM Customization? IBM Think. Available: https://www.ibm.com/think/topics/llm-customization. Accessed: October 06, 2025

Google. Gemini backed AI Overview on Google search

OpenAI. ChatGPT 5, accessed October 2025

End of content

No more pages to load