Issues and Concerns in the AI Era

Infrastructure and Environment


Learning Objectives

  • You understand why training large language models requires massive computational infrastructure.
  • You can identify the major cost factors: hardware, energy, and engineering effort.
  • You know the environmental impacts of AI training and deployment.

The infrastructure challenge

Training a large language model is not primarily about discovering better algorithms — the fundamental techniques are well understood and publicly documented. The real challenge is scale.

Building large language models requires hundreds of billions to trillions of words of training data, thousands of specialized processors running continuously for weeks or months, high-speed networks with minimal latency, and storage systems managing petabytes of data.

Without this infrastructure operating at scale, the algorithms alone cannot produce modern large language models.

Loading Exercise...

Hardware and infrastructure requirements

Large language models are trained on specialized hardware designed for high-throughput numerical computation. These include GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units), which can perform the massive number of matrix multiplications required for training neural networks far more efficiently than general-purpose CPUs.

As an example, training GPT-3 used a system with 10,000 NVIDIA GPUs, running for several weeks. Larger recent models use even more resources. It is not just the GPUs — the entire system, including CPUs, memory, storage, and networking, must be designed to handle the data throughput and parallelism required.

At the moment, the main provider of AI training hardware is NVIDIA, whose GPUs dominate the market. Visit NVIDIA’s stock price page to see how the market values this dominance — note that NVIDIA has done multiple stock splits, so the price per share is not directly comparable over time.

Furthermore, the data center housing this hardware must provide sufficient power and cooling to keep thousands of high-performance processors running reliably 24/7 — running thousands of GPUs continuously for weeks or months consumes a huge amount of electricity, easily comparable to the power usage of a small town.

For more details on energy consumption, see e.g. How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference.

Loading Exercise...

Environmental impacts

In addition to the direct energy consumption and the associated carbon emissions (which depend on the energy source), there are also broader environmental impacts to consider. In particular, the hardware supply chain and water usage.

First, AI systems depend on specialized chips requiring mining and refining of rare earth elements, energy-intensive manufacturing processes, and global supply chains with transport-related emissions. Mining the necessary materials can lead to habitat destruction, pollution, and significant carbon emissions, especially in regions with lax environmental regulations.

The costs of manufacturing and distributing the hardware are often overlooked compared to electricity but are significant.

Furthermore, as AI data centers require substantial cooling to prevent overheating, they often rely on water-intensive cooling methods. In water-scarce regions, this can strain local water supplies and impact ecosystems.

Loading Exercise...

Broader implications

As the costs of training LLMs is high and rising, several broader implications arise:

  • Only well-resourced organizations can afford to train the largest models from scratch, concentrating capabilities and raising questions about equitable access to AI technology. Currently, companies like OpenAI, Google, Microsoft, and Meta invest more in AI infrastructure than many countries spend on research and development.

  • As AI deployment and use grows, the cumulative energy consumption and environmental impact become significant, prompting calls for more sustainable practices in AI development. Currently, many search engines are also integrating AI features, leading to increased computational load and energy use even for routine queries.

  • As access to advanced semiconductors has become a strategic consideration for nations seeking to develop AI capabilities, there are geopolitical implications on supply chain, including minerals. For instance, countries have been restricting exports of advanced chips to competitors.

Loading Exercise...