A Thought After Reading Daniel Kahneman’s “Thinking, Fast and Slow”

  • Post author:
  • Post category:Fan
Image by Maryam62. Downloaded from pixabay.com

I can tell that Daniel Kahneman’s book, Thinking, Fast and Slow, is one of those books that I will find myself coming back to over and over again. Why? On one hand, it provides evidence from experimental psychology for phenomena that I was already inclined to believe, that is, it appeals to my own confirmation bias. At the same time, it offers a number of insights and explanations that are new to me and eye opening. Kahneman, a psychologist, won a Nobel Prize in Economics for his work on decision-making under uncertainty. His book reflects a lifetime of learning about human judgement and is an absolute delight to read.

It’s central framework seems to revolve around the idea that our minds address problems in two distinct ways, which he describes using the metaphors of System 1 and System 2. The table below summarizes some of the characteristics of each system.

System 1System 2
FastSlow
IntuitiveDeliberate
Automatic, cannot be turned off at willEffortful and/but/therefore lazy
ImpulsiveRequires self-control
Associative and causal, but does not have the capacity for statistical thinkingRequires training for statistical thinking
Prone to bias and to believingIn charge of doubting and unbelieving

Much of Kahneman’s book is then focused on System 1 and its characteristics and biases. We learn of its need for coherence and the central role of cognitive ease in driving belief. System 2 is much less explored in the book and I was left with the desire to learn more about what can be done to strengthen our use of System 2.

Why would we want to strengthen our reliance on System 2? It is not obvious that doing so would necessarily lead to better social outcomes, if we are willing to rely on expertise.System 1’s intuitive thought draws both on heuristics (shortcuts that replace a more complex problem with a simpler one) and on expertise (recognition of information previously acquired). Becoming an expert means you can draw on your acquired information to reach better conclusions with less effort. In other words, the more we rely on experts, the more likely we are to avoid incorrect conclusions in a world governed by System 1 thinkers.

However, we are unlikely (and it is probably impossible) to seek expertise in the myriad of problems and decisions we encounter on a daily basis. And as Kahneman points out in a reference to the work of another psychologist, Daniel Gilbert, System 1 is prone to believe (Kahneman, 2011, Chapter 7). “Unbelieving” is the effortful task of System 2. If we want to form opinions and draw conclusions about much of the information we encounter on any given day, we will need to spend energy, potentially a lot of energy: we will need to be willing to lead an effortful, even if potentially fulfilling life. Those less susceptible to the System 1 biases and more prone to calling on System 2 Kahneman calls “engaged.”

Should we strive to be more “engaged?” If so, how?

References

Kahneman, Daniel, 2011. Thinking, Fast and Slow. Farrar, Straus and Giroux.

Continue ReadingA Thought After Reading Daniel Kahneman’s “Thinking, Fast and Slow”

Challenges in Exploratory Data Analysis

  • Post author:
  • Post category:Fan
Image by bluebudgie. Downloaded from pixabay.com

You are given a dataset and asked: “what do the data tell us? Do not assume we know anything about the subject, just tell us what the data say?” This is often the task referred to as “exploratory data analysis,” and it is harder than might seem. I see two main challenges.

The first is the request to “not assume we know anything about the subject.” This request is easy to forget without realizing. For example, say you have a dataset with twenty variables. It is perfectly fine during exploratory analysis to want to look at, not just individual variables in your dataset, but also how variables fluctuate relative to each other, that is, correlation. Now, how easy is it to look at correlations within the dataset with no prior inclination to think some of the twenty variables will be more likely correlated than others? We can fight the urge to pay more attention to those by always including all twenty variables in any and all considerations about correlation, but this requires discipline. One could even argue that we should, indeed, spend more time exploring correlations that we have a basis to believe have a causal connection, and that focusing equally on other correlations are a waste of time and possibly misleading. In any case, how to explore data given the mental models we all approach them with is a potential issue to be dealt with. I will likely return to this in a future post.

The second challenge I see in exploratory data analysis is identifying, and keeping in mind at all times, the sources of uncertainty in our data. The sources of uncertainty are several: from what we don’t know about how the variables were chosen and the data were collected, cleaned, stored and checked, to whether we are, consciously or not, asking questions, not about the dataset itself, but about the underlying generating process, that is, about a population of which we can consider the dataset to be a sample.

This last point I find is often overlooked. In some cases, we know that we are looking at a sample and asking questions about a population. For example survey data is often clearly extracted from a broader population in which we are interested. This is the classic use of inferential statistics that we all learn about in college – although, even in this case, we often see analyses focusing on point estimates rather than the more appropriate confidence intervals. But there are cases where we lose track of the sources of uncertainty in our data (or sources of uncertainty in our analysis) and must maintain discipline to correctly assess what our analysis is actually telling us.

For example, say we have data for five characteristics (five variables) for every inhabitant of a community. We are only interested in that community, so we understand we have “population” data (not a sample). In looking at correlation between our five variables, we decide to look at linear correlation among them through a linear regression. Our statistical software spits out a summary of results from our linear regression that includes coefficients and p-values for those coefficients. But p-values assume a distribution for the observed coefficients. If there is a distribution, there is a source of uncertainty (a random variable). Where did that uncertainty come from? Aren’t we looking at population data and, therefore, what we see is all there is to know?

My answer is that the uncertainty stems from assuming there is a linear relationship with variables when what we observe does not perfectly fit that linear relationship. There is, therefore, an “error” term associated with each observation relative to the calculated linearly predicted relationship. The whole linear regression exercise is asking questions about an assumed underlying generating process in the data, not about the observed data itself. We started making assumptions about the data and asking questions about an underlying process, very possibly without noticing.

So here are my tentative initial guidelines for doing exploratory data analysis:

  1. Start by understanding the data: publishing source; when and where the data was collected and who collected it; what universe is it supposed to represent and was it intended as a sample of a larger population; definitions – are the variables well defined; what errors may have been inserted in the data during transmission, cleaning, storing or other manipulation.
  2. Go onto univariate analyses and then cover correlations, being mindful of any potential assumptions we are making and, if we feel we absolutely need to make these assumptions, be explicit about them, and keep them in mind when drawing conclusions.
  3. Keep in mind at all times whether our questions are focusing on the data at hand or on an underlying generating process, i.e., whether we are “going beyond the data.” Again, be explicit if doing so.
  4. Be aware that exploratory analysis is supposed to focus on extracting inspiration from our data. It is not sufficient to draw conclusions. These require a separate step:  testing hypotheses with a second set of data that can be assumed extracted through the same generating process (from the same population). We do not test hypotheses during exploratory data analysis, nor discuss causality and modelling, other than possibly as suggestions for the next step of hypothesis testing.
Continue ReadingChallenges in Exploratory Data Analysis

Repetition and Belief

  • Post author:
  • Post category:Fan
Image by Pham Thoai. Downloaded from pexels.com

“You can fool all the people some of the time and some of the people all the time, but you cannot fool all the people all the time.” Respond quickly: who is the author of this statement? The most common attribution is Abraham Lincoln, although, as often happens with quotes, there is little evidence to support this attribution. There seems to be some evidence that the statement was actually made by the seventeenth century French protestant, Jacques Abbadie (Quote Investigator, 2013; Parker, 2016). However, we have heard the attribution to Abraham Lincoln so often, that we assume it to be true.

Over the end of the year holidays I read (most of) Daniel Kahneman’s book, “Thinking, Fast and Slow.” Kahneman, a psychologist, won the Nobel Prize for Economics in 2002 (shared with Vernon Smith) for the contributions to economics from his research on human judgement and decision-making under uncertainty. One aspect of human cognition that he describes in his book, is that we are more likely to believe what we find easy to. Various factors can contribute to our “cognitive ease”, including a clear display of the information we are exposed to, having been “primed” by association with a prior piece of information, being in a good mood, and also by repeated exposure to the information, whether that information is true or not (Kahneman, 2011, Chapter 5). 

Repetition is also often discussed as an important component in learning (see, for example, the literature on spaced repetition), change management and other aspects of our daily life that depend on our understanding and perception of reality. I once read (and never forgot) a passage in Machiavelli’s “The Prince” where he states that injuries should be inflicted at once, while benefits should be provided piecemeal, overtime, if a ruler is to ensure permanence in power. I always understood this as reflecting how repetition affects perception (Machiavelli, 1998, chapter VIII). 

Recent political discussions in the U.S. have referred to the idea of the “Big Lie,” the idea that less plausible lies are often easier to sell to the public than small ones… if sufficiently repeated (RationalWiki contributors, 2021). A recent paper by Fazio et al (2015) argues, based on a couple of experiments, that repetition of false information impacts belief even when those exposed know better, in an effect they call “knowledge neglect,” and that reflects a primacy of processing fluency (cognitive ease) over retrieval of knowledge under certain conditions which include repetition.

What do I take from the above? The more confident I am in new acquired knowledge, the more I will repeat, remind myself, to better internalize, garner the power of repetition. The less confident I am about new knowledge, the more suspicious I will be when seeing it repeated. I guess that is a new years resolution and, yes, those work: I heard it many times.

References

Fazio, L. K; N. M. Brashier; B. K. Payne and E. J. Marsh. 2015. “Knowledge Does Not Protect Against Illusory Truth.” In Journal of Experimental Psychology 144(5): 993-1002. Available: https://www.apa.org/pubs/journals/features/xge-0000098.pdf. Accessed: January 18, 2021

Kahneman, Daniel, 2011. Thinking, Fast and Slow. Farrar, Straus and Giroux.

Machiavelli, Nicolo, 1998. The Prince. The Project Gutenberg eBook. Translated by W.K.Marriott. Available: https://www.gutenberg.org/files/1232/1232-h/1232-h.htm#chap08. Accessed: January 09, 2021.

Parker, David B., 2016. “You Can Fool All the People”: Did Lincoln Say It?. History News Network. Available: https://historynewsnetwork.org/article/161924. Accessed: January 09, 2021

Quote Investigator, 2013. You cannot fool all the people all the time. Available: https://quoteinvestigator.com/2013/12/11/cannot-fool/#return-note-7793-2. Accessed: January 09, 2021.

RationalWiki contributors, 2021. “Big Lie,” In RationalWiki. Available: https://rationalwiki.org/wiki/Big_lie. Accessed: January 09, 2021.

Continue ReadingRepetition and Belief

Defining Data

  • Post author:
  • Post category:Fan
Image by Alex Uriarte

A few weeks ago I watched a few of Crash Course’s Data Literacy elearning videos on YouTube (Arizona State University and Crash Course 2020). It’s first episode defines “data” as “specific information we collect to make decisions.” This is a different definition from others I have heard. It does have some interesting aspects to it. Under this definition:

  • Data would be a subset of information. That is, all data would be information but not all information data.
  • It uses collection and decision making to define what information is data and what is not.

Other definitions are very different.

A common distinction between data and information is that found in the so-called DIKW pyramid or similar representations. DIKW stands for Data – Information – Knowledge – Wisdom, and usually suggests a hierarchy where data is the broader concept that is then filtered as information, that is in turn filtered as knowledge and finally as wisdom. This seems to be commonly used in the knowledge management community and is often attributed to an article by Russell Lincoln Ackoff in the Journal of Applied Systems Analysis in 1989 (e.g., see Bernstein 2009)

Under this representation, data are often interpreted as facts, noise or signals. There are many criticisms to this representation, from whether “filtering” is actually a good way of thinking about the connections between these concepts, to proposed changes to the pyramid, to what is actually the broader concept, data or information (for just a few examples of a relatively large literature, see Weinberger 2010; Tuommi 1999; and Dammann 2018).

Yet a third way of thinking about data is the definition contained in US law. US Federal statutes define data as “recorded information, regardless of form or the media on which the data is recorded” (44 U.S. Code § 3502). The definition is less innocuous than what it may seem at first. Recording information is in good part what distinguishes our handling of information from cultures who rely (or relied) on voice of mouth transmission and the potential loss of content associated with such practices: think of the telephone game that kids play, whispering a sentence in another one’s ear, who then whispers to another one, and so on until the last child states out loud what his/her understanding of the sentence is, often to find the sentence arrived at the end of the communication chain completely altered or distorted. Under this definition, however, information is the broader concept.

The table below summarizes the three different definitions of “data.” 

 

ASU and Crash Course 2020

U.S. Federal Statutes

DIKW pyramid

Definition or understanding

Specific information we collect to make decisions

Recorded information, regardless of form or the media on which the data is recorded

Facts, noise, signals

Highlight

Data has a purpose: decision-making

Data must be recorded

Data as facts, no specific purpose or characteristic

Data relative to information

Information > Data

Information > Data

Data > Information

I do not find the last row – showing the relation between data and information – particularly useful in understanding data: it is a result of how we define not just data but also information, and may be more useful for discussions focused on knowledge. I include it in the table only for the sake of comparison and may explore it in other posts. I do find the “highlight” of each definition useful in thinking about data, how to manage and use them: 

  • Data should reflect facts. Whether it does or not, depends on how it was collected and managed. This is important to keep in mind in discussions about data collection, data curation and trusted repositories.
  • Data should be recorded. This reinforces the importance of data curation and particularly of metadata in enabling us to understand what “facts” do the data actual capture.
  • Data may be used for decision making. Hence, it is important to keep this in mind the many considerations around data bias, completeness, presentation and interpretation.

In this blog, I will use the highlight of each of the three definitions to discuss data.

References

44 U.S. Code § 3502. Legal Information Institute. Cornell Law School. Available: https://www.law.cornell.edu/uscode/text/44/3502#:~:text=(A)%20means%20the%20obtaining%2C,or%20format%2C%20calling%20for%20either%E2%80%94. Accessed: November 14, 2020

Arizona State University and Crash Course, 2020. Study Hall: Data Literacy. Available: https://www.youtube.com/watch?v=0H8awA3GBPg&list=PLNrrxHpJhC8m_ifiOWl1hquDmdgvcviOt&index=14. Accessed: November 27, 2020

Bernstein, J. H., 2009. The Data-Information-Knowledge-Wisdom Hierarchy and its Antithesis. In: Proceedings from North American Symposium on Knowledge Organization. Vol. 2.  Available: https://journals.lib.washington.edu/index.php/nasko/article/viewFile/12806/11288. Accessed: November 27, 2020.

Dammann, Olaf. 2018. Data, Information, Evidence, and Knowledge: A Proposal for Health Informatics and Data Science. In: Online Journal of Public Health Informatics10(3):e224. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6435353/pdf/ojphi-10-e224.pdf. Accessed: November 27, 2020.

Tuommi, Ikka. 1999. Data is more than knowledge: Implications of the reversed knowledge hierarchy for knowledge management and organizational memory. In: Journal of Management Information Systems 16(3):103-117.Available: https://www.researchgate.net/publication/328803142_Data_is_more_than_knowledge_Implications_of_the_reversed_knowledge_hierarchy_for_knowledge_management_and_organizational_memory. Accessed: November 27, 2020

Weinberger, 2010. The Problem with the Data-Information-Knowledge-Wisdom Hierarchy. In: Harvard Business Review. Available: https://hbr.org/2010/02/data-is-to-info-as-info-is-not. Accessed: November 27, 2020.

Continue ReadingDefining Data

On Using and Citing Wikipedia

  • Post author:
  • Post category:Fan
Image by Unattributed. Downloaded from pexels.com

I am a big fan of Wikipedia. Although I expect to never use it as a sole source of information in any of my posts (ahem: with the exception of this article), I do often use it as a starting point when researching any subject, and I often click on some of the references cited in Wikipedia entries to follow-up on whatever I am researching. You will see me referencing Wikipedia articles often.

Wikipedia warns that “nothing found here has necessarily been reviewed by people with the expertise required to provide you with complete, accurate or reliable information” (Wikipedia contributors, 2020a). In addition, they warn that, as a community-built reference, it may contain errors. Yet, I often cite Wikipedia for two reasons:

  • First, I do typically use it as a starting point and in combination with other sources in my posts. It’s encyclopedic nature – comprehensive, even if not necessarily in-depth – make it exactly the great starting point it is intended to be;
  • Second, because it is a community-built source and because it has come to occupy such a central role as reference on the internet, it is actually very subject to spot checking and reviews, even if not necessarily by recognized specialists in any given area, and not consistently for all entries. I do use my judgment on what entries I am more likely to trust. I am more likely to trust one on, say, butterflies, than one on a small country’s minor dictator from a hundred years ago, where there may be less people capable of or interested in verifying, as well as less reliable sources. A subject for which there is less information and is of less general interest will not receive the same amount and frequency of authoritative checks. 

Some minimum parameters do exist for Wikipedia entries: based on its “About” page: material “must fit within Wikipedia’s policies, including being verifiable against a published reliable source. Editors’ opinions and beliefs and unreviewed research will not remain.” And “many experienced editors are watching to ensure that edits are improvements. […] its contributors work on improving quality, removing or repairing misinformation and other errors” (Wikipedia contributors, 2020b).

As food for thought: compare what you find on Wikipedia to what you would typically find circulating on social media about any specific subject. I know, that is a very low bar but, nonetheless, it illustrates that it is possible to build value collaboratively under certain rules and within defined processes (I am now thinking of open source software, but will leave that for another post). 

For those ready to never read one of my posts again, check the article on “Criticism of Wikipedia” on ummm…shhh, Wikipedia: Wikipedia contributors, 2020.

References

Wikipedia contributors, 2020a. “Wikipedia: General Disclaimer,” In Wikipedia, The Free Encyclopedia. Available:https://en.wikipedia.org/wiki/Wikipedia:General_disclaimer. Accessed: October 25, 2020.

Wikipedia contributors, 2020b. “Wikipedia: About,” In Wikipedia, The Free Encyclopedia. Available: https://en.wikipedia.org/wiki/Wikipedia:About. Accessed: October 25, 2020.

Wikipedia contributors, 2020c. “Criticism of Wikipedia,” Wikipedia, The Free Encyclopedia/ Available:  https://en.wikipedia.org/w/index.php?title=Criticism_of_Wikipedia&oldid=985922235. Accessed: November 1, 2020.

Continue ReadingOn Using and Citing Wikipedia

End of content

No more pages to load