Customizing LLMs – Part 2: an Experiment

[For the data and Python code used for this blog post, please visit my GitHub repository https://github.com/EngelbergHuller/Customizing-LLMs-Experiment/tree/main]

As mentioned in Part 1 of this two part series (the first part is on my FAN page and titled Customizing LLMs – Part 1: the Concept), I wanted to better understand our capacity to customize LLMs and decided to explore a bit the concept and then experiment a bit myself with building a RAG system on my laptop. I am not a developer and so I “vibe coded” with ChatGPT 4o to set up the RAG system, and then compared the results with those I obtained by simply uploading a set of documents to ChatGPT. Because my RAG system connected with the OpenAI LLM, the thought was that it would help me understand those components that are particular to RAG and not related to the LLM it is based on. I describe this experiment here.

A Bit More Background and Outline

I have the ChatGPT personal plus plan where I pay $20 a month. This plan allows me to upload a limited number of documents to ChatGPT and ask it to analyze them. If I understand correctly, the limits are up to about 50 MB with no file being larger than about 20 MB. If more needs to be analyzed concurrently, ChatGPT will offer to analyze them in batches and then compare the batches for a broader summary. I understand that when ChatGPT analyzes documents uploaded to its projects, it does so following a RAG type pipeline, but using its specific tools.

Below I:

Describe the specifications of the RAG system on my laptop relative to those that ChatGPT tells me it uses when analyzing documents uploaded to it
Describe the documents I used as input to both the laptop RAG system and ChatGPT
Describe the questions (prompts) I used and the relative responses I received from each of the two systems
Try to understand how the different responses resulted from different system specifications

Laptop RAG and ChatGPT Specifications

I used PyCharm as an IDE (Integrated Development Environment), created an account on Open AI to access their LLM, and then asked ChatGPT 4o to walk me through the process of creating a RAG system on my laptop. The system on my laptop uses:

A customized script for chunking the pdfs:
- It defined a maximum amount of words per chunk (800), not tokens
- It defined an overlap of 100 for adjacent chunks (for continuity)
The OpenAI embedding model “text-embedding-3-small”
The FAISS library of algorithms for indexing and searching vectors by similarity
The OpenAI LLM ChatGPT 4o

In comparison, uploading documents to ChatpGPT4o it tells me it uses:

A chunking pipeline with:
- Chunk sizes of ~500–700 tokens (~400–600 words)
- Overlap of ~50–100 tokens
- Splitting by semantic logic, often prioritizing punctuation
An embedding model “similar in capability to:
- text-embedding-3-large or
- text-embedding-ada-002 (depending on optimization and routing)
“an internal vector index that behaves similarly to FAISS, but is not FAISS itself”
Also OpenAI LLM ChatGPT 4o

Input

I used as input 20 pdf documents containing old newspaper editorials that I downloaded from a ProQuest database (accessed through my local public library) and that resulted from a search for key words “foreign aid,” “foreign assistance,” or USAID. I won’t discuss the more general document search because, for the purposes of this experiment with RAG, the 20 document set is the universe of interest. It is relevant, however, that these are scanned images of old newspaper editorials because, as I discuss further below, one of the issues I encountered seemed to stem from the quality of Optical Character Recognition (OCR) software that I was able to use.

Q&A

Having uploaded the 20 pdfs to both ChatGPT’s interface and to the RAG system on my laptop, I asked the same three questions to both:

In one short paragraph, tell me what these documents are about
Are all articles critical of foreign aid or do any of them praise or defend it?
Please provide a table with these 20 articles categorized by stance, date, and key quotes

One advantage of the ChatGPT tool that became clear upfront, was that it seemed to follow its responses with meaningful questions or suggestions, in a way that my RAG system did not. The last bullet above I added at the suggestion of ChatGPT.

I also modified the laptop RAG a couple of times as I noticed some of the limitations in the results I was obtaining:

First, as I collected the responses from my custom RAG system, a sentence in one of its responses made clear that it was not accessing the entire content of the pdf documents but seemed to be relying mostly on the title and perhaps some other metadata. It became clear that, because the pdfs were mostly scans of newspapers, the RAG needed to use Optical Character Recognition (OCR) and was not doing so, but rather relying on what it could read as text. I had to install two new pieces of software, add them to my windows environment, and ensure PyCharm was accessing them: Tesseract and Poppler, two open source software that interact to enable OCR.

After rebuilding the RAG system with OCR the results were better but I noticed the responses did not seem to be making use of all the 20 documents. It turns out that the FAISS tool was retrieving information from chunks using an 8 nearest-neighbor criteria and ignoring other information. So I expanded that to 20. ChatGPT warned me that by doing so, I could decrease the relevance of the response and, yet, comprehensive coverage was what I was looking for.

Below I copy the responses obtained by the two approaches (I reformatted the responses considerably for presentation but did not modify the content).

Prompt 1. In one short paragraph, tell me what these documents are about

Prompt 2. Are all articles critical of foreign aid or do any of them praise or defend it?

Prompt 3. Please provide a table with these 20 articles categorized by stance, date, and key quotes

ChatGPT 4o

Custom RAG – OCR K=20

Comments on the responses to questions

On the responses to question 1. Unlike ChatGPT, the laptop RAG did not seem able to identify the Washington Post as one of the newspapers from which editorials were sourced. Three of the 20 articles were from the Washington Post, the other 17 were from the New York Times. Since the RAG systems reports that when the text content of a chunk is low it “falls back on OCR,” I thought that, perhaps, it no longer included content from that chunk that was in text format. But I asked the program and that seems to not be the case. It seems to refer to “references from the context” as references extracted in text format and stated that it used both content extracted with OCR and in text when responding to questions.

The laptop RAG also did not clearly identify foreign aid and foreign assistance as the central theme of the 20 documents as clearly as ChatGPT did. After further examining the documents, two of the 20 documents are full pages from the Washington Post that contain several editorials each, only one in each document referring to foreign aid. It seems like the laptop RAG system took into account the other editorials in those two documents much more than ChatGPT did, when providing an overall summary.

On the responses to question 2. The ChatGPT answers provide a sentence summarizing the articles with a common stance (e.g. critical, favorable…) and then provides examples. The laptop RAG systems stays focused on the individual chunks, only providing short summary sentences at the beginning and end. Also, the ChatGPT answers refer to specific documents when providing examples (numbers in brackets) while the laptop RAG system refers to chunks.

On the responses to question 3. The ChatGPT answers responded to the request to categorize the articles by “stance” by defining four buckets in which it divided all 20 articles (supportive, critical, mixed/reformist, and neutral/analytical). The laptop RAG system responded with individualized “stances” for each article.

The laptop RAG system did not interpret each document as being one “article” mentioned in the prompt. I beleive this was likely an issue with my prompt. As mentioned, 2 of the 20 documents had more than one editorial in them. But we may also need to think how to best customize the retrievel of information. The FAISS library works to retrieve information by clustering. When it was set to retrieve 8 nearest-neighbor chunks, it had produced a list of 12 articles in response to question 3. When I expanded this to 20 nearest-neighbor chunks, it retreived 19.

Where the responses became most troubling to me was when I noticed that neither ChatGPT nor the laptop RAG correctly identified the titles to all the articles. The laptop RAG actually performed better in this case than ChatGPT getting 13 titles correct, while ChatGPT only got 9. As previously mentioned, in 2 of the 20 documents, there were more than one editorial in the same document, which would have contributed to the difficulty in selecting one title for the document, and this seems to be reflected in some of the titles offered by the laptop RAG system. But the titles offered by ChatGPT seem to be particularly bewildering and I could not figure out where they came from.

ChatGPT also seemed to have sometimes identified the dates incorrectly, while the laptop RAG sometimes provided NA when it was not able to identify the date. For example, ChatGPT listed five documents as being from 1972 when only two of the documents are from that year.

So what do I draw from the experiment?

First, I would not currently feel comfortable relying directly on ChatGPT nor on this initial laptop RAG system to provide me with good information about scanned pages of newspaper articles. That said, given the responses to question 3, I am left with the impression that what may have seemed to be better responses by ChatGPT in questions 1 and 2 may actually reflect a greater inclination of ChatGPT to “fill in gaps” with made up information when compared to the laptop RAG system.

In addition, given that there seem to be several ways to improve on this laptop RAG (based on my discussion about it with ChatGPT), the laptop RAG may actually be a more promising avenue to obtain more reliable information from such types of documents.

Second, based on information provided by ChatGPT itself, there were differences in the two systems in the chunking approach taken (words vs tokens), the embedding models used (even though both used OpenAI embedding models, and in the indexing method used for search and retrieval. Because the laptop RAG system is transparent and customizable in each of these elements in ways that directly using ChatGPT does not seem to be, it should allow for room for improvement.

Third, one of the main reasons to look into a customized RAG system (for my purposes) is the existence of limitations in directly using an LLM like ChatGPT to analyze large amounts of scanned documents. However, in further exploring a customized RAG system for this purpose it seems like some effort should go into:

Further scrutinizing the types of documents that will be well interpreted by the RAG system and those that may not and finding ways to, perhaps, exclude documents that would not be well read. For example, how can we better deal with documents that scan entire newspaper pages where only one of the articles on the page is of interest?
Further looking into how well OCR is working and how well the RAG system is capturing information from OCR in tandem with information in text format

A final note on prompt engineering: I seem to not have paid sufficient attention to it in Part 1 of this two post series and in this exerise as well. Given the limitations of both ChatGPT and the laptop RAG system, it is possible that I would have obtained better results just by more clearly specificing to the systems the output I was looking for.

Oh, and on the OpenAI cost of the laptop RAG system exercise: $0.15

Sources

OpenAI. ChatGPT 4o, accessed October 2025

ProQuest access to The New York Times and Washington Post historical editions. Accessed through Fairfax County Public Libraries, October 2025

You Might Also Like

Similarities in GDP Per Capita Trajectories

Catch-up