Large Language Models (LLMs) have greatly improved the state of the art in various comprehension and generation tasks, revolutionizing natural language processing. Most LLMs gain from self-supervised training on huge corpora gathering information from a fixed-size local context and showcasing emerging skills, including zero-shot prompting, learning in context and Chain-of-Thought (CoT) reasoning. The input length limitation of current LLMs prevents them from generalizing to real-world applications, such as extended horizontal planning, where the ability to handle long-running material beyond a fixed-size session is crucial.
The simplest solution to the length limit problem is to simply increase the length of the input context. For better long-range interdependence, GPT-3, for example, increases the input length from 1k GPT-2 tokens to 2k. In-context dense attention is however severely limited by the quadratic computational complexity of Transformer self-attention, and this technique often requires computationally extensive training from the outset. Another new area of research, still mostly requiring training upfront, focuses on creating context-sparse attention to avoid the quadratic cost of self-attention.
While Memorizing Transformer (MemTRM) is a well-known study, it approximates low attention in context through dense attention to both tokens in context and stored tokens retrieved from non-differentiable memory for Transformers. MemTRM offers significant perplexity benefits when modeling large books or documents, scaling the resulting language model to handle up to 65,000 tokens. MemTRM’s linked memory approach, which uses a single model to encode and fuse memory for language modeling, has the difficulty of memory obsolescence during training. In other words, older representations cached in memory can have distributional changes from those of the newer model as model parameters change, reducing the use of memory growth.
In this paper, the authors of UCSB and Microsoft Research propose the LONGMEM framework, which allows language models to cache context or prior knowledge in long form in the non-differentiable memory bank and take advantage of it via a decoupled memory module to address the memory obsolescence problem. They create a revolutionary residual sidenet (SideNet) to achieve decoupled memory. A frozen backbone LLM is used to pull attention keys and paired values from the previous context into the memory bank. The resulting attention query of the current input is used in SideNet’s increased memory tier to access the cache (keys and values) for previous contexts. The associated memory boosts are then merged into hidden state learning via a process of joint attention.
Improved knowledge transfer from the pre-trained LLM backbone is made possible by the newly built residual cross-network connections between the SideNet and the frozen LLM backbone. The pre-trained LLM can be modified to use long contextual memory by repeatedly training the residual SideNet to extract and merge the augmented long context from memory. There are two main advantages to their decoupled memory system. First, the LLM and SideNet decoupled frozen backbone in their proposed architecture isolate memory fetching and merging from encoding previous inputs into memory.
This efficiently solves the memory obsolescence problem as the LLM backbone only acts as an encoder of long context knowledge. Instead, the residual SideNet acts as a retrieval and memory reader. Second, it is computationally inefficient and suffers from a catastrophic forgetfulness of directly changing the LLM with memory boosts. In addition to being able to access knowledge that has been previously learned, LONGMEM can also prevent devastating forgetting as the LLM spine is frozen during the actual memory augmentation adaptation phase. Depending on the subsequent activities, LONGMEM can insert different types of text and long-lived information into the memory bank.
They focus on two illustrative examples: memory-enhanced contextual learning with thousands of task-relevant demonstration examples, and language modeling with full-length book contexts. They evaluate how well the proposed LONGMEM performs on different long-text language modeling tasks and on in-context learning with memory enhancement for language comprehension. According to the experimental results, their model consistently surpasses strong baselines in its ability to model long texts and contextual learning. Their approach significantly increases the LLM’s ability to represent long-context language by -1.38 ~ -1.62 perplexities on various length divisions of the Gutenberg-2022 corpus.
Amazingly, their model far exceeds the current strong x-former baselines to achieve the state-of-the-art performance of 40.5% identification accuracy on ChapterBreak, a challenging long-context modeling benchmark. Finally, compared to MemTRM and baselines without memory enhancement, LONGMEM shows strong learning advantages in context on common NLU tasks.
Check out ThePaperANDGithub link.Don’t forget to subscribeour 24k+ ML SubReddit,Discord channel,ANDEmail newsletterwhere we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us atAsif@marktechpost.com
Check out 100s AI Tools in the AI Tools Club
Aneesh Tickoo is a Consulting Intern at MarktechPost. She is currently pursuing her BA in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects that harness the power of machine learning. Her research interest is image processing and she is passionate about building solutions around it. She loves connecting with people and collaborating on interesting projects.
#Microsoft #Santa #Barbara #researchers #propose #LONGMEM #framework #LLMs #memorize #long #history