Discover LOMO (Low-Memory Optimization): a new AI optimizer that merges gradient computation and parameter update in one step to reduce memory usage

Large speech models have transformed natural language processing by showcasing extraordinary abilities like emergence and grokking and driving model size to continually increase. The bar for NLP research is raised by training these models with billions of parameters, such as those with parameters from 30B to 175B. It is difficult for small labs and companies to participate in this field of research as optimizing LLMs often requires expensive GPU resources, such as 880GB machines. Recently, resource-constrained LLM optimization has been made possible by fine-parameter optimization techniques such as LoRA and Prefix-tuning.

Although full parameter tuning has been considered a more effective strategy than efficient parameter tuning, both techniques should provide a viable solution. They want to study methods to complete full parameter tuning under resource-limited circumstances. They examine the activation, optimizer states, gradient tensor, and parameters of the four characteristics of memory usage in LLMs, and optimize the training process in three ways: 1) They reevaluate the algorithmic functionality of an optimizer and find that SGD is a suitable replacement for the full parameter setup for the LLM. Since SGD doesn’t keep the intermediate stages, we can eliminate the entire states portion of the optimizer. 2) Their suggested optimizer, LOMO, as shown in Figure 1, reduces the memory usage of the gradient tensors to 0, equal to the memory usage of the largest gradient tensor. 3) Incorporate gradient normalization and loss scaling and pass certain calculations at full accuracy during training to stabilize mix precision training with LOMO. Their method combines the same amount of memory of the gradient parameters, activation, and maximum tensor.

They greatly increase the memory consumption of full parameter tuning, reducing it to the level of inference. This is because the forward process alone shouldn’t require less memory than the backward process. In particular, they ensure that the fine-tuning function is not compromised when using LOMO to conserve memory because the parameter update process is similar to SGD. Researchers at Fudan University demonstrate how using LOMO makes it possible to successfully train a 65B model with only 8 RTX 3090 GPUs by empirically evaluating LOMO’s memory and throughput capabilities. Furthermore, they use LOMO to tune all parameters of the LLMs on the collection of SuperGLUE datasets to validate the downstream performance of the suggested approach. The empirical results show the performance of LOMO when optimizing LLM with many parameters.
Check out 100s AI Tools in our AI Tools Club

These are their overall contributions:

They offer a theoretical study suggesting that SGD can successfully regulate all LLM parameters. It is possible that the hurdles that once prevented the widespread use of SGDs will not be as severe as you optimize LLMs.

They suggest LOMO, or out-of-memory optimization, to drastically reduce GPU memory usage while maintaining the tuning process.

They empirically demonstrate the efficiency of LOMO in optimizing LLMs under resource constrained circumstances by closely analyzing memory usage and throughput performance. Performance evaluations of downstream jobs provide further justification for this.

The implementation of the code is available on GitHub.

Check out theConnecting Paper and Github.Don’t forget to subscribeour 25k+ ML SubReddit,Discord channel,ANDEmail newsletter, where we share the latest news on AI research, cool AI projects, and more. If you have any questions regarding the above article or if you have missed anything, please do not hesitate to email us

Featured tools:

Check out 100s AI Tools in the AI ​​Tools Club

Aneesh Tickoo is a Consulting Intern at MarktechPost. She is currently pursuing her BA in Data Science and Artificial Intelligence from Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects that harness the power of machine learning. Her research interest is image processing and she is passionate about building solutions around it. She loves connecting with people and collaborating on interesting projects. has just released some great features. Generate an illustrated story from a prompt. Check it out here. (Sponsored)

#Discover #LOMO #LowMemory #Optimization #optimizer #merges #gradient #computation #parameter #update #step #reduce #memory #usage
Image Source :

Leave a Comment