Long-term memory for Generative AI

Large language models (LLMs) such as ChatGPT have embedded within them and can make use of the huge amount of information they were fed during training. A user is able to access that embedded knowledge by giving the LLM instructions during the course of a conversation with it. At present, however, the LLM has a limited capacity to remember the details of a conversation; that capacity is determined by the size of its context window.

The context window is the part of the input text that the LLM processes when producing the next word in its response to an instruction. Although it varies across different LLMs, the context window is typically a few thousand words. Once the conversation exceeds the size of the context window, the LLM is unable to make use of everything the user has input over the course of the conversation; it ‘forgets’ things from earlier parts of the conversation. The context window can be increased in size, but that increases the amount of processing that has to be done to produce a response and that further increase soon becomes impractical. Researchers at UC Berkeley are exploring one approach to get around this limitation and have explained it in their paper MemGPT: Towards LLMs as Operating Systems.

In MemGPT, they have given the LLM a memory system similar in principle to that of a personal computer (PC). They call the context window the LLM’s main context and view this as its short-term memory, analogous to a PC’s Random Access Memory (RAM). In addition, MemGPT has been given an external context analogous to a PC’s disk drive or solid state drive. The external context comprises:
recall storage which stores the entire history of events processed by the LLM processor, and
archival storage which serves as a general read-write datastore that can serve as overflow for the main context.

During a conversation, archival storage allows MemGPT to store facts, experiences, preferences, etc. about the user, while recall storage allows MemGPT to find past interactions related to a particular query or within a specific time period. For document analysis, archival storage can be used to search over (and add to) an expansive document database.

To achieve the above, MemGPT’s main context is divided into three components:
system instructions set out the logic for how MemGPT functions control the interaction with external context;
conversational context holds a first-in-first-out (FIFO) queue of recent event history (e.g., messages between the LLM and the user); and
working context serves as a working memory scratchpad.

System instructions are read-only and pinned to main context (they do not change during the lifetime of the MemGPT agent).
Conversational context is read-only with a special eviction policy (if the queue reaches a certain size, a portion of the front is truncated or compressed via recursive summarization).
Working context is writeable by the LLM processor via function calls.

Combined, the three parts of main context cannot exceed the underlying LLM processor’s maximum context size.