Customizing LLMs – Part 3: The Orchestration Layer
- Post author:Alex Uriarte
- Post published:March 22, 2026
- Post category:Fan
This is the third post in a series where I try to better understand how LLMs work and how to build custom applications using LLMs. In this post, I will dive further into the architecture of custom applications beyond RAG.
In part 1 of this series, I was somewhat dismissive of “prompt engineering,” describing it as mostly learning how to interact with LLMs. Although I still think that description is not necessarily wrong, many custom applications seem to rely largely on developing a layer in between the user and the LLM that molds both inputs and outputs from the interaction and that can greatly enhance the benefits of using LLMs. This layer may combine:
- User prompts
- Any saved interactions between the user and the LLM
- Specific instructions on how the LLM should make use of user input (system prompts)
- Tools made available to the LLM
- RAG content
This layer is often called the orchestration layer or the control plane. To combine the information above, it actually makes use of collection of tools:
- Frameworks – these are possibly the main tools and they are the ones that combine the content from prompt templates, tool calling, memory, RAG pipelines and agent workflows. Examples of frameworks commonly used for AI applications are LangChain, AutoGen (MS) and Semantic Kernel (MS).
- Other tools that can be part of the orchestration layer include:
- Code libraries
- Core API and traffic management tools
- Workflow and agent orchestration tools
- Security and audit tools
In the diagram below, I highlight this orchestration layer and its prompt building function in red.
How does this orchestration layer combine the various sources of input it receives into what it feeds the LLM? How much does it influence, for example, the extent to which outputs from a user’s interaction with the LLM rely on the content retrieved from a RAG process relative to the content already incorporated in the LLM?
What I have been able to understand is the following:
- All input coordinated by the orchestration layer needs to ultimately be transformed into tokens and then to vectors that the LLM transformer can work with. How many tokens an LLM can deal with at a time is what is called the context window. As of March 2026, it seems like most LLMs readily available have context windows of anywhere between 128 thousand and 1 million tokens.
- If the custom LLM uses RAG, what the orchestration layer receives from the RAG system are chunks selected by a vector search engine by semantically comparing the embeddings from the prompt given to it with those in the vector database that contains the RAG content.
- The transformer treats all tokens received as input equally and what determines how some tokens influence others is the transformer’s self-attention mechanism. This mechanism can be influenced by several factors (according to my exchange with ChatGPT):
- Tokens that are semantically aligned with the question get higher attention.
- Strong instructions bias attention patterns.
- Proximity in context window can matter.
- Longer RAG context can dilute influence.
- So, to some extent, we can influence the importance given by the LLM to its various sources of input by affecting some of the factors above in the orchestration layer. In particular, we are able to influence:
- the size of RAG chunks and where they are placed in the context window
- the instructions received by the LLM and whether the system prompt forces “grounding”
- Grounding consists on forcing an LLM to rely completely or mostly on selected sources rather than on its pre-trained content. There are several ways of enforcing strong grounding in the orchestration/control layer including:
- removing user message from context;
- forcing an answer only from citations;
- rejecting answers if citations are absent, requiring the LLM to support every answer by retrieved text
I recently noticed how effective the system prompt of custom LLMs can be (the “prompt engineering” that I had previously downplayed) in a custom GPT I built to summarize weekly updates I was receiving from my team into monthly reports. By just inputting the information on a regular ChatGPT interface, the results I was getting were unusable: the LLM was having a hard time matching sentences from different weeks that referred to the same workstream, and was also messing up the timeline of the updates. By building a custom GT (using ChatGPTs own tool), I was able to fix those issues: the difference was just the instruction layer translating the user input into the actual instructions received by the LLM.
Context Window Sizes, RAG and the Role of the Orchestration Layer
I have the ChatGPT Plus plan (I pay $20/month) and I asked ChatGPT what is the size of the context window of the LLMs I have access to:
- For OpenAI’s GPT-5.3 (the “instant” model), the typical context window is 128k, if accessed directly through an API. But when I access it through the ChatGPT Plus plan that I have, it goes through an orchestration layer. This orchestration layer may:
- Reserve tokens for system + tool instructions
- Keep headroom for the model’s response
- Trim conversation history dynamically
- Enforce tier-based limits
All of the above take context window space. The result is that the effective context window that I typically have access to using ChatGPT is closer to 32k.
- For OpenAI’s GPT-5.4 (the “thinking” model), the API accessed context window is around 1 m tokens, while the effects context window I have access to through my ChatGPT Plus plan interface is closer to 256k – 400k “depending on tier or configuration” (I did not dive deeper to figure out what that means)
In theory, RAG becomes less necessary, the larger the context window is. Because the larger it is, the more the material that would typically go into the RAG can be inserted directly into the context window, instead of being stored in the vector database with only the relevant chunks being retrieved. To give us an idea of the what the context window dimensions mean in terms of the size of text (again, according to ChatGPT), roughly:
However:
- PDF documents use more tokens, so 100,000 tokens may be closer to 40k-60k words, instead of 75k
- Code and structured data are also less efficiently translated into tokens
In reality, even with larger context windows, it seems like RAG can still be beneficial for two reasons:
- The self-attention function of LLMs seems to degrade with large content inserted in its context windows;
- More content in the context window increases the cost of using LLMs
Whether the custom application has a RAG or not changes considerably the focus of the orchestration layer:
- WIthout a RAG, but relying on a larger context window pipeline, the orchestration layer will focus on whether to load full texts or summaries, in what order, how much of older conversations to include…the focus is on context management and output failures often reflect failures in such management, such as too much irrelevant content
- WIth a RAG, the orchestration layer will focus more on what chunks to retrieve from the RAG’s uploaded documents and output failures often reflect failure in such retrieval.
Hybrid architectures are often used where the orchestration layer may include full text documents in the context window in some cases (e.g. for smaller documents) and retrieve chunks through a RAG in other cases (e.g. large documents).
Sources:
OpenAI. ChatGPT 5.3. Accessed March 2026
IBM Technology. 2026. Is RAG Still Needed? Choosing the Best Approach for LLMs. March. YouTube video. Available: https://www.youtube.com/watch?v=UabBYexBD4k. Accessed: March 21, 2026.
