This post was originally posted to LinkedIn.
Have you heard about the ongoing attempts to create AI chatbots that mimic professionals – lawyers, doctors, and especially therapists? There are already apps attempting to give therapy with generative AI, such as Yuna. If you've talked to enough therapists as I have about this, they'll tell you it's impossible for AI to replace their job, and I'd have to agree – for now. That said, the current technology has untapped potential to aid therapists if used correctly. Future models, like the one I am trying to develop in this series, may have the potential to take some of the less complex client loads.
In this series of articles, you will join me as I replicate their efforts in creating my own, open-source, therapy LLM bot and then surpass what is currently available with a more advanced model architecture. Therapy, specifically Cognitive Behavior Therapy (CBT) makes sense as the starting place. Unlike some other therapies that have a physical component, CBT tends to rely heavily on communication in order to operate and interact directly with the patient. Often the therapy involves guiding a patient towards a path of self-reflection and reframing unhealthy and/or untrue narratives that they have about their own life. Therefore therapy is the most obvious application of LLMs to attempt first.
The Simplified Origin Of LLMs
Before we come to pass judgment on the abilities of LLMs, I think it would be helpful to get a general idea how they emerged so that we can better utilize their strengths and avoid their weaknesses later.
To put the history of how we got here overly simply, LLMs came to us from the field of Natural Language Processing (NLP) within Artificial Intelligence. As the name implies, this is the subfield of computer intelligence that deals with using human language directly with computers and computers using our language with us.
Human language is extraordinarily complex, ask any etymologist. We have all sorts of intention and meaning in words beyond their written definitions when they are used. It is not uncommon for human learners of non-native languages to accidentally commit a faux-pas as the meaning of something that was just said could be taken differently. It is such a complex task that socializing and communicating with one another may have been a significant factor for our high level of intelligence as a species, also known as the social brain hypothesis.
To tackle this complex task, humans have mainly developed two ways of teaching each other language. One way is to sit down and break down language into its constituent parts and analyze their use. The other way is much simpler and is how humans naturally learn language: those who already speak the language use it around and with the unacquainted person until they see the patterns and make the connections between concepts. Thousands of years of human knowledge have led to this method. Its wisdom is simple: you cannot teach every conceivable combination of words to learn. Instead, the knowledge that is transferred is how language should be used generally in response to anything.
Instead of teaching a computer the components of human language, what really needed to happen was to teach it how to learn language language independently. From this task, a model can be created of language itself. Many words and topics are formed into a web that is then refined over many different cycles. Training the model, as it is called, adjusts the connections in that web-based on new samples of language. Topics are identified and the relationships are tweaked between every word and every other word. It is like predictive text on your phone but given as much of the entirety of human written work as can be provided by the trainer. Once complete, you have a very large model of human language — a large language model. A web map of the Open Orca Large Language Model. Subjects are grouped by seemingly random subjects like "dogs playing" or "twitter" with many groups of different colored dots gathered around topics. This graph produced by Nomic AI's Atlas shows the Open Orca large language model represented as a web. Each dot represents a topic and related topics share the same color.
Simply put, LLMs are excellent at talking like us because they have studied as much as we can write for it to read. However, at the end of the day, it is mimicry. The model does not generate entirely new ideas, instead, it combines ideas in ways that it predicts should go together. This insight is useful in the pursuit of a therapy chatbot – if a solution to a problem has genuinely never been thought of, the chatbot will most likely not come up with it since that output is improbable (it has never happened).
LLMs In The Context Of Today
Let’s begin with an analogy for LLMs as that of a well-read librarian. A librarian so well read on the library’s collection that they can greet any visitor to the library and talk to them about their interest. This exceptional librarian knows the finest details of the library’s collection on all sorts of subjects such as law or psychology. However, much of the hype around AI today is placing the librarian in the role of the lawyer or psychologist merely because the librarian can speak the language of these professions. The librarian is still a librarian and isn’t a specialist.
The idea of a virtual assistant is not new, considering such products as Apple’s Siri were released to the public 13 years ago in 2011. This is reflective of the strong demand for human-computer interaction that is as intuitive as talking to another human. To this day my elderly family still navigates their phones with Siri because that is the only way they know to gather and interact with data on their phone.
LLMs offer not only another means of interacting with computers but may be the only accessible means for some. The true innovation of LLMs is it allows humans to much more directly and intuitively communicate with data on a level unseen before. The recent development of LLMs of sufficient quality that allow us to have these natural conversations is only the first step. The next steps involve a whole ecosystem incorporating this new ability.
Back with our librarian, we can see that they are a generalist mouthpiece to the enormous data they are trained on. However what if we restricted the generalist LLM to be a specialist? To continue the metaphor, what if the librarian went back to school to become a therapist? Although a great thought, the reason why more specialized models have not gotten as far is because the model still needs to be able to communicate concepts in broad human language.
When we have conversations with our doctors and lawyers, we often forget that the person we are talking to is not just a professional but also a person. They have the interpersonal people skills to be able to directly talk with their client/patient on the patient’s level as an average person despite their specialized knowledge. They are performing two roles at once. If you were to talk with a specialist counselor LLM, the conversation would be impersonal like talking to a textbook. The information dispensed might not be tailored or empathetic enough for an individual user's needs. Even then, it would still be vulnerable to not being trained on enough data and hallucinating, or making up, facts and data. No, what we need instead is for the LLM to communicate well, but rely on a system for the specialization.
The Current Limitations of LLMs and the Path Forward
Here is where we have finally been building to: multi-model systems. We are starting to see that one super-large, generalist LLM will not be good for our therapist bot. As mentioned before, a more specialized model would still be lacking the context to be informed and caring as a counselor. Further, it could be quite dangerous for a model to start hallucinating advice if it has taken on the role of an expert trained to know all the details when it doesn’t know the answer. Instead, the design of the model should acknowledge its own limitations to the user plainly.
I believe an architecture of multiple models, or Mixture of Experts (MoE), will be needed to create professional AI that even starts to be comparable to humans. A primary LLM can direct and communicate through the conversation, but many other models should be working in the background and advising what should happen.
A MoE architecture can take many different forms for a single application, and we will explore what that will look like in the next article. Until then, here are some questions that the architecture needs to answer: Should one LLM maintain the entire conversation or should the patient go from one specialized LLM to another depending on the type of conversation they need? How should those models be classified (by diagnosis, by therapy type?) Should the model even attempt to diagnose? If so, should a classifying model be trained for each condition?
What’s Next?
In the next article, I will explore what is currently on the market and attempt to reverse-engineer the prompt to see if we can achieve comparable results with only one LLM. We will take the lessons from this well-speaking generalist LLM, and use it to inform our mouthpiece LLM in our MoE model.