Training time probably can’t scale up much from here the largest language models are already spending months training, and firms may not find it profitable to spend years training a single model. Utilization rates also decline with the number of GPU processors used in parallel, since the more processors you use, the more time you’ll have to “waste” sending data between them. Empirically, utilization rates seem to be 30-75% at best. Even the most efficient models on today’s hardware spend 40% of training time making calls to memory. Critically, it also involves calling memory and communicating between different processors. “Wait a minute, why would the GPU be idle?”īecause training an AI model involves more than just multiplying numbers. ) performing computations in parallel, the speed of the GPU when it’s running, and the utilization rate, i.e., the percentage of the time the GPU is actually executing tasks while the model is training. In other words, compute (and hence performance) scales with the amount of time devoted to training a model, the number of computers (these days, largely GPUs But there are some reasons to expect that we may run into fundamental limits to how much compute can go into LLMs by the end of this decade. Perhaps AI will attract greater investment and resources as the first LLM-driven product are released and become widely popular. It’s still possible that the compute devoted to AI models will accelerate faster than the current trend. As extremely large models have become more compute-intensive, the pace of their growth seems to have slowed. That’s more than the compute growth between GPT-3 and GPT-4, though less than the compute growth between GPT-2 and GPT-3. If the largest AI models continue to grow at their current pace through the end of this decade, that would be the equivalent of three orders of magnitude of compute growth. Looking at a longer time horizon, Epoch AI estimates that the compute used for training the state-of-the-art machine learning models has increased by about eight orders of magnitude (that is, 100 million times over) between 20. GPT-4’s performance - on everything from programming problems to the bar exam - is even more impressive. The much larger GPT-3 can reliably generate on-topic, sensible completions. In practical terms, GPT-2 could produce coherent sentences, but its output tended to degenerate into repetitive noise after about a paragraph. In other words, training GPT-3 took about 200,000 times as much compute as GPT-2, and GPT-4 probably took between 60-150 times more than GPT-3. The details of their newest model, GPT-4, have not been made public, but outside estimates of its size range from 400 billion to 1 trillion parameters and around 8 trillion tokens of training data. GPT-3 - the model behind ChatGPT - was trained on 300 billion to 400 billion tokens of text data and had 175 billion parameters. GPT-2, which OpenAI released in 2019, was trained on 300 million tokens of text data and had 1.5 billion parameters. It’s difficult to say exactly what “11% lower loss” means in terms of how powerful or accurate a model is, but we can use existing models for context. This tells us how much “better” models can get from “scaling compute” alone. If a model has 10 times the compute, its loss will be about 11% lower. The scaling relationship between loss and compute found by OpenAI in 2020 is a power law. Compute is therefore influenced by both the amount of data and number of parameters. Is simply the number of computer operations (typically matrix multiplications) that must be performed throughout the model’s training. Training data is the size of the dataset a model is trained on.The number of parameters in the model is a measure of its complexity - equivalent to the number of nodes in the neural network.įinally, the amount of “compute” used for a model, measured in floating point operations, or flops, A close antonym of performance is “loss,” which is a measure of how far off a model’s predictions were from reality lower loss means better performance. A large language model is trained to predict text completions the more often it correctly predicts how to complete a text, the better its performance. The performance of a model describes its accuracy in choosing the “right” answer on known data. Scaling laws in AI generally relate the performance of a model to its inputs: training data, model parameters, and compute.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |