The more lottery tickets you buy, the higher your chances of winning, but spending more than you win is obviously not a wise strategy. Something similar happens in AI powered by deep learning: we know that the larger a neural network is (i.e., the more parameters it has), the better it can learn the task we set for it.
However, the strategy of making it infinitely large during training is not only impossible but also extremely inefficient. Scientists have tried to imitate the way biological brains learn, which is highly resource-efficient, by providing machines with a gradual training process that starts with simpler examples and progresses to more complex ones—a model known as “curriculum learning.”
Surprisingly, however, they found that this seemingly sensible strategy is irrelevant for overparameterized (very large) networks.
A study in the Journal of Statistical Mechanics: Theory and Experiment sought to understand why this “failure” occurs, suggesting that these overparameterized networks are so “rich” that they tend to learn by following a path based more on quantity (of resources) than quality (input organized by increasing difficulty).
This might actually be good news, as it suggests that by carefully adjusting the initial size of the network, curriculum learning could still be a viable strategy, potentially promising for creating more resource-efficient, and therefore less energy-consuming, neural networks.
There is great excitement towards neural network-based AI like ChatGPT: every day, a new bot or feature emerges that everyone wants to try, and the phenomenon is also growing in scientific research and industrial applications. This requires increasing computing power—and, therefore, energy consumption—and the concerns regarding both the energy sources needed and the emissions produced by this sector are on the rise. Making this technology capable of doing more with less is thus crucial.
Neural networks are computational models made up of many “nodes” performing calculations, with a distant resemblance to the networks of neurons in biological brains, capable of learning autonomously based on the input they receive. For example, they “see” a vast number of images and learn to categorize and recognize content without direct instruction.
Among experts, it is well known that the larger a neural network is during the training phase (i.e., the more parameters it uses), the more precisely it can perform the required tasks. This strategy is known in technical jargon as the “Lottery Ticket Hypothesis” and has the significant drawback of requiring a massive amount of computing resources, with all the associated problems (increasingly powerful computers are needed, which demand more and more energy).
To find a solution, many scientists have looked at where this type of problem appears to have been, at least partially, solved: biological brains. Our brains, with only two or three meals a day, can perform tasks that require supercomputers and a huge amount of energy for a neural network. How do they do it?
The order in which we learn things might be the answer. “If someone has never played the piano and you put them in front of a Chopin piece, they’re unlikely to make much progress learning it,” explains Luca Saglietti, a physicist at Bocconi University in Milan, who coordinated the study. “Normally, there’s a whole learning path spanning years, starting from playing ‘Twinkle Twinkle Little Star’ and eventually leading to Chopin.”
When input is provided to machines in an order of increasing difficulty, it is called “curriculum learning.” However, the most common way to train neural networks is to feed them input randomly into highly powerful, overparameterized networks.
Once the network has learned, it is possible to reduce the number of parameters—even lower than 10% of the initial amount—because they are no longer used. However, if you start with only 10% of the parameters, the network fails to learn. So, while an AI might eventually fit into our phone, during training, it requires massive servers.
Scientists have wondered whether curriculum learning could save resources. But research so far suggests that for very overparameterized networks, curriculum learning seems irrelevant: performance in the training phase does not seem to be improved.
The new work by Saglietti and colleagues attempted to understand why.
“What we’ve seen is that an overparameterized neural network doesn’t need this path because, instead of being guided through learning by examples, it’s guided by the fact that it has so many parameters—resources that are already close to what it needs,” explains Saglietti.
In other words, even if you offer it optimized learning data, the network prefers to rely on its vast processing resources, finding parts within itself that, with a few tweaks, can already perform the task.
This is actually good news, as it does not mean that networks cannot take advantage of curriculum learning, but that, given the high number of initial parameters, they are pushed in a different direction. In principle, therefore, one could find a way to start with smaller networks and adopt curriculum learning.
“This is one part of the hypothesis explored in our study,” Saglietti explains.
“At least within the experiments we conducted, we observed that if we start with smaller networks, the effect of the curriculum—showing examples in a curated order—begins to show improvement in performance compared to when the input is provided randomly. This improvement is greater than when you keep increasing the parameters to the point where the order of the input no longer matters.”
More information:
Stefano Sarao Mannelli et al, Tilting the odds at the lottery: the interplay of overparameterisation and curricula in neural networks*, Journal of Statistical Mechanics: Theory and Experiment (2024). DOI: 10.1088/1742-5468/ad864b
Provided by
International School of Advanced Studies (SISSA)
Citation:
Researchers explore how to bring larger neural networks closer to the energy efficiency of biological brains (2024, November 19)