Customizing LLMs – Part 2: an Experiment

Image by Samuel Ijimakin. Downloaded from Pixabay

[For the data and Python code used for this blog post, please visit my GitHub repository https://github.com/EngelbergHuller/Customizing-LLMs-Experiment/tree/main]

As mentioned in Part 1 of this two part series (the first part is on my FAN page and titled Customizing LLMs – Part 1: the Concept), I wanted to better understand our capacity to customize LLMs and decided to explore a bit the concept and then experiment a bit myself with building a RAG system on my laptop. I am not a developer and so I “vibe coded” with ChatGPT 4o to set up the RAG system, and then compared the results with those I obtained by simply uploading a set of documents to ChatGPT. Because my RAG system connected with the OpenAI LLM, the thought was that it would help me understand those components that are particular to RAG and not related to the LLM it is based on. I describe this experiment here.

A Bit More Background and Outline

I have the ChatGPT personal plus plan where I pay $20 a month. This plan allows me to upload a limited number of documents to ChatGPT and ask it to analyze them. If I understand correctly, the limits are up to about 50 MB with no file being larger than about 20 MB. If more needs to be analyzed concurrently, ChatGPT will offer to analyze them in batches and then compare the batches for a broader summary. I understand that when ChatGPT analyzes documents uploaded to its projects, it does so following a RAG type pipeline, but using its specific tools.

Below I:

  1. Describe the specifications of the RAG system on my laptop relative to those that ChatGPT tells me it uses when analyzing documents uploaded to it
  2. Describe the documents I used as input to both the laptop RAG system and ChatGPT
  3. Describe the questions (prompts) I used and the relative responses I received from each of the two systems
  4. Try to understand how the different responses resulted from different system specifications

Laptop RAG and ChatGPT Specifications

I used PyCharm as an IDE (Integrated Development Environment), created an account on Open AI to access their LLM, and then asked ChatGPT 4o to walk me through the process of creating a RAG system on my laptop. The system on my laptop uses:

  • A customized script for chunking the pdfs:
    • It defined a maximum amount of words per chunk (800), not tokens
    • It defined an overlap of 100 for adjacent chunks (for continuity)
  • The OpenAI embedding model “text-embedding-3-small”
  • The FAISS library of algorithms for indexing and searching vectors by similarity
  • The OpenAI LLM ChatGPT 4o

In comparison, uploading documents to ChatpGPT4o it tells me it uses:

  • A chunking pipeline with:
    • Chunk sizes of ~500–700 tokens (~400–600 words)
    • Overlap of ~50–100 tokens
    • Splitting by semantic logic, often prioritizing punctuation
  • An embedding model “similar in capability to:
    • text-embedding-3-large or
    • text-embedding-ada-002 (depending on optimization and routing)
  • “an internal vector index that behaves similarly to FAISS, but is not FAISS itself” 
  • Also OpenAI LLM ChatGPT 4o

Input

I used as input 20 pdf documents containing old newspaper editorials that I downloaded from a ProQuest database (accessed through my local public library) and that resulted from a search for key words “foreign aid,” “foreign assistance,” or USAID. I won’t discuss the more general document search because, for the purposes of this experiment with RAG, the 20 document set is the universe of interest. It is relevant, however, that these are scanned images of old newspaper editorials because, as I discuss further below, one of the issues I encountered seemed to stem from the quality of Optical Character Recognition (OCR) software that I was able to use.

Q&A

Having uploaded the 20 pdfs to both ChatGPT’s interface and to the RAG system on my laptop, I asked the same three questions to both:

  • In one short paragraph, tell me what these documents are about
  • Are all articles critical of foreign aid or do any of them praise or defend it?
  • Please provide a table with these 20 articles categorized by stance, date, and key quotes

One advantage of the ChatGPT tool that became clear upfront, was that it seemed to follow its responses with meaningful questions or suggestions, in a way that my RAG system did not. The last bullet above I added at the suggestion of ChatGPT.

I also modified the laptop RAG a couple of times as I noticed some of the limitations in the results I was obtaining:

First, as I collected the responses from my custom RAG system, a sentence in one of its responses made clear that it was not accessing the entire content of the pdf documents but seemed to be relying mostly on the title and perhaps some other metadata. It became clear that, because the pdfs were mostly scans of newspapers, the RAG needed to use Optical Character Recognition (OCR) and was not doing so, but rather relying on what it could read as text. I had to install two new pieces of software, add them to my windows environment, and ensure PyCharm was accessing them: Tesseract and Poppler, two open source software that interact to enable OCR.

After rebuilding the RAG system with OCR the results were better but I noticed the responses did not seem to be making use of all the 20 documents. It turns out that the FAISS tool was retrieving information from chunks using an 8 nearest-neighbor criteria and ignoring other information. So I expanded that to 20. ChatGPT warned me that by doing so, I could decrease the relevance of the response and, yet, comprehensive coverage was what I was looking for. 

Below I copy the responses obtained by the two approaches (I reformatted the responses considerably for presentation but did not modify the content).

Prompt 1. In one short paragraph, tell me what these documents are about

Prompt 2. Are all articles critical of foreign aid or do any of them praise or defend it?

Prompt 3. Please provide a table with these 20 articles categorized by stance, date, and key quotes

ChatGPT 4o

Custom RAG – OCR K=20

Comments on the responses to questions

On the responses to question 1. Unlike ChatGPT, the laptop RAG did not seem able to identify the Washington Post as one of the newspapers from which editorials were sourced. Three of the 20 articles were from the Washington Post, the other 17 were from the New York Times. Since the RAG systems reports that when the text content of a chunk is low it “falls back on OCR,” I thought that, perhaps, it no longer included content from that chunk that was in text format. But I asked the program and that seems to not be the case. It seems to refer to “references from the context” as references extracted in text format and stated that it used both content extracted with OCR and in text when responding to questions.

The laptop RAG also did not clearly identify foreign aid and foreign assistance as the central theme of the 20 documents as clearly as ChatGPT did. After further examining the documents, two of the 20 documents are full pages from the Washington Post that contain several editorials each, only one in each document referring to foreign aid. It seems like the laptop RAG system took into account the other editorials in those two documents much more than ChatGPT did, when providing an overall summary.

On the responses to question 2. The ChatGPT answers provide a sentence summarizing the articles with a common stance (e.g. critical, favorable…) and then provides examples. The laptop RAG systems stays focused on the individual chunks, only providing short summary sentences at the beginning and end. Also, the ChatGPT answers refer to specific documents when providing examples (numbers in brackets) while the laptop RAG system refers to chunks.

On the responses to question 3. The ChatGPT answers responded to the request to categorize the articles by “stance” by defining four buckets in which it divided all 20 articles (supportive, critical, mixed/reformist, and neutral/analytical). The laptop RAG system responded with individualized “stances” for each article.

The laptop RAG system did not interpret each document as being one “article” mentioned in the prompt. I beleive this was likely an issue with my prompt. As mentioned, 2 of the 20 documents had more than one editorial in them. But we may also need to think how to best customize the retrievel of information. The FAISS library works to retrieve information by clustering. When it was set to retrieve 8 nearest-neighbor chunks, it had produced a list of 12 articles in response to question 3. When I expanded this to 20 nearest-neighbor chunks, it retreived 19.

Where the responses became most troubling to me was when I noticed that neither ChatGPT nor the laptop RAG correctly identified the titles to all the articles. The laptop RAG actually performed better in this case than ChatGPT getting 13 titles correct, while ChatGPT only got 9. As previously mentioned, in 2 of the 20 documents, there were more than one editorial in the same document, which would have contributed to the difficulty in selecting one title for the document, and this seems to be reflected in some of the titles offered by the laptop RAG system. But the titles offered by ChatGPT seem to be particularly bewildering and I could not figure out where they came from.

ChatGPT also seemed to have sometimes identified the dates incorrectly, while the laptop RAG sometimes provided NA when it was not able to identify the date. For example, ChatGPT listed five documents as being from 1972 when only two of the documents are from that year.

So what do I draw from the experiment?

First, I would not currently feel comfortable relying directly on ChatGPT nor on this initial laptop RAG system to provide me with good information about scanned pages of newspaper articles. That said, given the responses to question 3, I am left with the impression that what may have seemed to be better responses by ChatGPT in questions 1 and 2 may actually reflect a greater inclination of ChatGPT to “fill in gaps” with made up information when compared to the laptop RAG system.

In addition, given that there seem to be several ways to improve on this laptop RAG (based on my discussion about it with ChatGPT), the laptop RAG may actually be a more promising avenue to obtain more reliable information from such types of documents. 

Second, based on information provided by ChatGPT itself, there were differences in the two systems in the chunking approach taken (words vs tokens), the embedding models used (even though both used OpenAI embedding models, and in the indexing method used for search and retrieval. Because the laptop RAG system is transparent and customizable in each of these elements in ways that directly using ChatGPT does not seem to be, it should allow for room for improvement.

Third, one of the main reasons to look into a customized RAG system (for my purposes) is the existence of limitations in directly using an LLM like ChatGPT to analyze large amounts of scanned documents. However, in further exploring a customized RAG system for this purpose it seems like some effort should go into:

  1. Further scrutinizing the types of documents that will be well interpreted by the RAG system and those that may not and finding ways to, perhaps, exclude documents that would not be well read. For example, how can we better deal with documents that scan entire newspaper pages where only one of the articles on the page is of interest?
  2. Further looking into how well OCR is working and how well the RAG system is capturing information from OCR in tandem with information in text format
A final note on prompt engineering: I seem to not have paid sufficient attention to it in Part 1 of this two post series and in this exerise as well. Given the limitations of both ChatGPT and the laptop RAG system, it is possible that I would have obtained better results just by more clearly specificing to the systems the output I was looking for.
 
Oh, and on the OpenAI cost of the laptop RAG system exercise: $0.15
 

Sources

OpenAI. ChatGPT 4o, accessed October 2025

ProQuest access to The New York Times and Washington Post historical editions. Accessed through Fairfax County Public Libraries, October 2025 

Continue ReadingCustomizing LLMs – Part 2: an Experiment

Similarities in GDP Per Capita Trajectories

[For the data and R code for this blog post, please visit my GitHub repository EngelbergHuller/GDP-Growth-Similarities]

In my previous Engelberg Huller post, “Catch-up,” January 20, GDP per capita growth data over a 60 year period seems to suggest similarities in growth trajectories of countries geographically close to each other, whether reflecting similar institutions and histories, economic integration and interdependency patterns, or some other factor.

In an attempt to further explore these similarities, but also teach myself a bit of the open source statistical software R, I decided to look at growth data using an R package called “Similarity Measures.” This package offers functions built to compare two vectors and assess the numerical proximity between the elements of those vectors. Functions such as these are often used to compare the distance between two geographical trajectories, such as those of migrating animals or traffic. But they can also be used to compare trajectories of single variables over time.

I used the same dataset of Gross Domestic Product in constant local currency units over the 60 year 1961-2020 period that I used in the “Catch-up” post. I had to exclude 3 of the 93 countries used in “Catch-up” for lack of complete data for all the 60 years and I transformed GDP in constant LCUs into an index with 1961 = 100 to be able to compare trajectories in the same unit of measurement.

I used a function called Longest Common Subsequence (LCSS). This function counts the number of elements that are considered equivalent under certain criteria. The criteria are determined by three parameters. The following is my understanding of these parameters:

  • The first establishes what elements in each vector are compared. In the R function, this is the “pointSpacing” argument, A value of 2 means that 2 intervals between the indexed elements of each vector are allowed.
  • The second parameter establishes the difference allowed in the values between elements compared, for those elements to be considered equivalent. In the R function, this is the “pointDistance” argument.
  • The third parameter I have less of an understanding but is a margin of error established for the algorithm calculations, and it influences the “accuracy and speed of the calculation.” In the R function, this is the “errorMarg” argument. In calibration, this parameter seemed to make little difference in the outcomes.

I initially applied the LCSS function to the country GDP per capita index trajectories where 1961 was set to 100 for all countries. Because the LCSS function compares years based on the distance between their values, there are many more years considered the same by the function in the early period of the trajectories (say, 1960s) than in the later period (say, 2010s). This is not what we would like. We would like all period of the trajectories to be valued the same when accessing similarity between two trajectories.

So I turned to applying the LCSS function on the growth rates themselves. Doing so would mean that, when one countries GDP per capita index trajectory goes up, say 3 percentage points, and another one does too, the two trajectories would be considered equivalent in that year, even if at that point in time their cumulative growth histories had distanced their growth trajectories.

To calibrate the LCSS function (choose the parameters to use), I used the trajectories for Argentina and Uruguay, two countries whose GDP per capita growth trajectories appeared to be closely related in my  “Catch-up” post. I chose parameters that seemed intuitively reasonable and that didn’t seem to generate extreme outcomes (e.g. entire trajectories for two countries being considered the same or only the first year, 1961 = 100, being considered the same). I ended up with:

  • pointSpacing = 2
  • pointDistance = 2
  • errorMarg = 0.5

Running the LCSS function to compare 90 countries, 2 by 2, in all possible combinations, generates a matrix with 8100 results with diagonal = 59 (each country’s trajectory when compared to itself shows all 59 years being equivalent). This leaves 8100-90 = 8010 results that compare different countries. Because the function compares, say, Argentina to Uruguay and then Uruguay to Argentina, the number of unique results comparing two different countries is actually 8010/2 = 4005. Because it took my laptop a few seconds to compare each pair of trajectories, running the function for the entire set of 90 countries took me over 11 hours (and so I did each run overnight).

Out of the 90 countries, the two that had the closest GDP growth trajectories were France and Austria. With the parameters chosen, their growth rates were equivalent in 58 of the 59 years. The least similar trajectories had growth rates that were equivalent in 20 of the 59 years and there were three pairs of trajectories with that score: Burma and Bahamas, Greece and Chad, Iran and Indonesia.

The eight most similar pairs of trajectories were among five European countries: France, Belgium, Netherlands, Italy and Austria. Their growth trajectories are shown in Figure 1, below.

Figures 2 shows the most similar GDP per capita trajectories, those of France and Belgium, and Figure 3 shows their growth rates.

The South America Southern Cone had GDP per capita similarity scores in the 20s and 30s, i.e. their growth rates were similar in 20 to 40 of the 59 years compared (F4)

From the “Catch-up” post, we saw that the two highest growth countries in the 1960-2020 period were China and South Korea. Figures 5 and 6 below show how their growth trajectories compare. Their growth rates were comparable until the early 1990s, when South Korea’s growth rate slowed down and China continued its accelerated pace.

Two other interesting pairs of growth trajectories are the United States and the United Kingdom; and Bolivia and Guatemala. After the five aforementioned European countries, the next closest pair of growth trajectories are those of the United States and the United Kingdom (F7). All other countries with trajectories similar to others in 50 or more of the 59 years compared are rich countries (other European countries and Australia), the exception being the pair of trajectories for Guatemala and Bolivia (F8). Both these countries saw their GDP per capita fall in the first half of the 1980s. I will leave the reasons to explore in a potential future post.

The exercise above suggests strong connections between the growth trajectories of rich countries, not as much for the rest of the world. It also proved to be a nice little contribution to my own R learning. I intend to further explore growth data in future posts.

References

World Bank: World Development Indicators. Available from USAID IDEA: https://idea.usaid.gov/.  Accessed: January 14, 2023

Continue ReadingSimilarities in GDP Per Capita Trajectories

Catch-up

Look at the figure F1 below. It shows Gross National Income (GNI) per capita for countries in the Southern Cone of South America relative to that of the United States over a period of 26 years (1995-2020), as much data as I found available in Purchasing Power Parity (PPP). What do you see?

I see two main things:

  • Paraguay’s per capita income is pretty much the same share of the U.S.’s in 2020 as it was in 1995. Chile’s and Uruguay’s are slightly higher in 2020 than in 1995, Brazil’s is slightly lower than it was, and Argentina’s is quite lower than it was.
  • The biggest fluctuation in the ratio of GNI per capita’s relative to the U.S. was that of Argentina, particularly during the 10 year period between 1998 and 2008, when the ratio fell from around 0.4 to about 0.3 and then back up to 0.4 (interval shown by the vertical blue lines).

For someone interested in the economic development of the Southern Cone of South America, the two bullets are not very comforting. They suggest little to no “catch-up” happening relative to the United States. More generally, they show little movement at all in the ratio of national per capita incomes relative to the U.S., raising the question of how easy or hard it is to achieve some kind of catch-up. Even Argentina’s growth between 2002 and 2008 was likely mostly recovery from the decline between 1998 and 2002.

I looked at similar data for Central America, an area of particular importance to the U.S. and its foreign assistance, given the strong links of its population to the U.S. through migration flows.

Here too the main trend seems to be a relative stability in the ratio of national income of Central American countries relative to the U.S., the exception being some apparent progress being made by Panama since 2006.

Perhaps a secondary suggestion of both charts above, is that there seems to be stronger similarities in the trajectories of some countries in the same region relative to others. For example, Argentina and Uruguay. Or perhaps Brazil and Paraguay. Costa Rica, Panama’s neighbor, shows a slight upwards trend from 2006, potentially associated with Panama’s. In other words, it is worth exploring the strength of economic integration between neighboring countries (in a future post).

I decided to look at longer term growth trends. I used Gross Domestic Product (GDP) per capita data measured in constant local currency units (LCUs) for three reasons: the World Bank has data for over 90 countries starting in 1960 for this indicator, GDP is presumably a better indicator of productivity growth inside the country than GNI, and constant LCUs would circumvent the exchange rate issues that other units of measurement (like constant U.S. dollars or PPP international dollars) have to deal with. The drawback is that the absolute measures of output are not comparable between countries. It only makes sense to use LCUs to compare growth rates. I divided the average GDP per capita of a country between 2018-2020 by the average for that same country between 1961-1963. The result is how many times the GDP per capita of that country was multiplied over a 60 year period, in constant local currency. This is a measure of productivity growth.

Figure 3 below is a histogram for the 93 countries for which data were available. For 82 of those countries, the resulting growth factor in per capita GDP over the 60 year period was between 0 and 6. Three countries had a growth factor between 6 and 7 and the remaining 8 countries had higher growth factors, including factors of 58 for China, 28 for South Korea, 17 for Botswana and 15 for Singapore. I did not include a 94th country, Somalia, for which the factor was 551 and that seemed unreasonably large to me. I hope to explore in a future post.

Looking at these data, I again have two observations:

  • If the U.S. GDP per capita was almost 3 times higher in 2020 than in 1960, all the other countries who grew their GDP per capita by multiples between 0 and 4 or 5 did no or little catching up. If your GDP per capita is, say, a quarter of that of the U.S. and the U.S. grows its GDP per capita three times over a given period, you would need to grow yours by 3 x 4 = 12 times to catch-up. If your GDP per capita was a tenth of that of the U.S., you would need to grow your GDP per capita by a multiple of 30.
  • The countries that did some catching up seem geographically concentrated around China, with the exception of Botswana, as shown by the circle in the map below. The darker the red, the higher the GDP per capita growth factor. The darker the blue, the lower the GDP per capita growth factor.

F4. Map: countries in red shade are those in the first two buckets of the above histogram

Data Source: World Bank, World Development Indicators (WDI). GDP Constant LCUs and Population data. Map built using Tableau Public.

If China performed so well over the 60 year period, how come its GDP per capita today is not comparable or even larger than that of the U.S.? We do not have data in comparable units (e.g. PPP) going back that to 1960. But based on GNI data measured in current US$ (averaging exchange rates over a three year period – Atlas method). China’s GNI per capita in 1962 (oldest year) was approximately 2% (1/50) of that of the U.S. China would have had to have grown by a factor of 50 x 3 = 150 during that period to have caught-up with the U.S.

So here are some questions for potential exploration in future posts:

  1. How common/rare is it for a country to catch up? Are there particular circumstances that are always/often present when countries do catch up? Are these circumstances different for countries at different levels of GDP/GNI per capita?
  2. How good/bad are GDP and GNI as indicators of the standards of living and/or well-being of the population of a country?
  3. To what extent do fluctuations in exchange rates affect standards of living and well-being? How well are the different units of measurement of GDP and GNI able to remove the effect of any share of those fluctuations that do not reflect standards of living or well-being?
  4. What accounts for the apparent similarity in growth trajectories of some neighboring countries? Is it level of trade and/or economic integration? Is it similarity in their economies and exposure to similar external circumstances (shocks)?
  5. On the Southern Cone countries: Uruguay’s GNI per capita fluctuations seem to follow somewhat those of Argentina up to around 2013 or so, but then not so much. Same thing seems to have happened to Paraguay’s relative to Brazil. Was that actually the case and what could explain that?
  6. On Central America: what explains Panama’s performance after 2006?

References

World Bank: World Development Indicators. Available from USAID IDEA: https://idea.usaid.gov/.  Accessed: January 14, 2023

Continue ReadingCatch-up

End of content

No more pages to load