The AI Stack for the 3rd Epoch of Computing

Hot-takes on everything from AI applications to AI hardware and the "flippening" of AI compute towards inference.

Dec 01, 2023

It’s hard not to imagine how this new wave of AI could become the biggest disruption since the internet. More value will be created in the next decade, than all of the last 100 years combined. ChatGPT is simply the beginning - allowing us to communicate with systems naturally through language (i.e. English), but reason with one of the most powerful computational engines ever built. If we expect this capability to be as mainstream as Google search, a lot needs to change in the current AI stack to bring this reality to fruition.

Simply put, we’re just scratching the surface of a bigger Cambrian explosion of the new AI stack - new AI-native interfaces, applications, models, clouds and hardware will emerge over the next decade potentially disrupting every layer of today’s compute stack.

AI-native products will affect the entire compute stack

I truly believe that this elastic demand we’re seeing for AI and AI-native products will soon give rise to the 3rd epoch of computing1 - one that’s dominated by specialized AI inference chips, purpose-built and high-performance AI clouds, and foundational models that we interact with over APIs and fundamentally new interfaces. This platform shift will inevitably affect the entire computing stack — from end-users all the way to specialized AI hardware (HW). Here’s what I think we can expect to see at each layer in the coming years:

The *elastic* demand for AI-native products today will soon give rise to the *3rd epoch of computing* - one that’s dominated by specialized AI inference chips, high-performance cloud computing platforms, and foundational models that we interact with over APIs. This platform shift will inevitably affect the entire computing stack, from *end-users* all the way to specialized AI hardware. Image inspiration: AI: The coming revolution (Coatue)

1. AI Applications

This layer of the stack will arguably be the most dynamic and value-creating any industry has seen in decades, but we have no clue what's possible - just like the internet in the 90s. Let me try to paint a computational framework to predict where we’re headed. Our best guess is that we’re going to see a slow but steady progression of automation-levels (i.e. L1-L5 akin to automation-levels in vehicles) - from purely functional APIs (generally stateless, singular utility), to more higher-order APIs (multiple functional APIs combined), to co-pilots (human-assisted higher-order APIs), to full-fledged agents (fully-automated, objective-driven).

Functional AI (L1): Models augment or fully replace existing / traditional APIs. Companies including Google, Meta, Uber, Spotify, Instacart, Stripe and others have already deployed ML in various aspects of their engineering and business including email classification, spam filtering, recommendation systems, demand forecasting, etc. We will continue to see ubiquitous adoption of these data-driven systems with more compute-hungry algorithms in the coming years across the whole gamut of modalities including text, image, audio, video, code, 3D assets, and actions. For example, companies like MidJourney already have over 2M subscribers paying $10/month (> $200M ARR) for their wildly popular but singular text-to-image generation Discord bot API.
Higher-order AI APIs: Models composed in creative and domain-specific ways to provide real value to both enterprise and consumers. Each offering is tailored to the specific domain/niche/industry such as legal, healthcare, finance, software development etc. Products are more often composed of several capabilities (sentence/document embeddings, image understanding, RAG with vector DBs, domain-specific personalization etc) often via functional APIs to enable fundamentally new automation that functional APIs cannot individually provide.
For example, companies like Harvey.ai have partnered with OpenAI to dramatically speed up their legal research and diligence processes and overall firm efficiency by over 10x. Other companies with a similar 10x domain-specific productivity gains include Copy.ai for marketing, MidJourney for creative artists, RunwayML for video-generation.
Co-pilots (L2-L3): Models augment humans in their day-to-day operational tasks, amplifying their work and overall productivity. For instance, Github Copilot, since it’s release has already sped development time by over 55%. NVIDIA recently deployed ChipNeMo internally to help hardware and software architects co-design their next-generation AI chips, their blockbuster $100B+ business. In some scenarios, co-pilots can even reduce the barrier-to-entry for these operational roles that previously required highly-skilled workers - for example, Replit’s developer ecosystem could grow to a billion users in the next decade with co-pilots drastically reducing the barrier-to-entry for software development.
Agents (L4-L5): Models fully automate the development and orchestration of complex tasks with simple high-level, user-informed objectives. While we’re far from this reality today, recent agent-based methods such as AutoGen and MemGPT have exploded in developer interest given the massive expected-utility of fully-automated agents. This is still in a proof-of-concept stage, and will require significant R&D before mainstream utility. However, the compute demands for these systems are already 2 orders of magnitude more than all previous model types given the recursive nature of their execution.

Each progression in automation level will demand an order of magnitude more AI compute than the previous level, necessitating even more need for purpose-built infrastructure that can keep up with these growing inference demands.

2. AI Models & APIs

Over the past 2 decades, the API ecosystem has significantly matured, providing enterprises and developers with the much needed separation of logic and business concern. APIs continue to be the most powerful distributional channel for AI today - OpenAI’s surge in adoption of their GPT API has allowed developers to quickly incorporate and evaluate new AI-powered APIs, without having to worry much about the complexities in the infrastructure, the availability of GPUs or the upfront AI training costs.

Closed-source Models: The OpenAI GPT API today has over 80% of the LLM API market-share primarily due to their unrivaled quality of their models. Several competitors (Anthropic, Mistral) have come out with alternatives making OpenAI aggressively cut costs, and improve overall developer experience in order to retain market-share.
Open-source Models: With the rise of open-source ML and tools such as transformers and diffusers, there’s a parallel world of foundation model development that aims to democratize and open-source this breakthrough technology - aka the “Linux movement for AI”. Open-source software has become the heartbeat of AI models today, championed by companies like Huggingface, Meta (ironically) and others, with over 200K+ devs contributing, 18M members on AI discord channels.
Multi-modal APIs: Most of the foundation model growth and adoption lately has been primarily around LLMs. However, more powerful modeling capabilities through transformers and the rapid increase in computational availability has led to fundamentally new capabilities in generative AI today - modalities such as image, audio, video, 3d, including multimedia such as PDFs are all explainable with AI in some interestingly innovative capacity today. We can expect more of this to come, allowing enterprises to fully leverage their data advantage including unstructured data.

Needless to say, AI-powered APIs will continue to be mainstream with more and more capabilities being enabled in the coming years, abstracting away much of the sophisticated infrastructure and models from developers / enterprises.

3. AI Developer Tools

While investors are still getting over the red-sea that was MLOps, a new category of tooling has emerged to provide the “picks-and-shovels” for the Generative AI and recently-coined LLMOps industry. Open-source tools have emerged in all aspects of the new AI dev-to-prod stack (foundation models, compute, inference & fine-tuning, workflow orchestration, monitoring & observability), some re-hashing their MLOps strategies for LLMs, while others fundamentally building from the ground-up for this new era.

Sequoia’s recent market-map of the Generative AI infrastructure companies. Image credit: Generative AI Act Two (Sequoia)

In general though, we’re seeking better abstractions for these new class of AI models, compute and the underlying HW infrastructure while keeping the SW abstractions as familiar to developers as the existing MLOps software stack.

Better abstractions for models: An interesting observation is how extremely accessible these large and sophisticated language-models have become with language (e.g. English) as the new API for communicating with our systems. Companies like Replit and OpenAI are well-positioned to grow their developer ecosystem to 1B+ users through these “language-as-code” tools (OpenAI GPTs, Replit AI) enabling a fundamentally new developer platform - it’s likely that substantially longer tail of use-cases will be captured, with faster market penetration than what mobile or the internet had.
Better abstractions for developers: The accessibility of LLMs through “language” has also led to the rise of the AI engineer that straddle the application layer and foundation model APIs. It’s arguable that this role may be much broader - from SW engineers who build the infrastructure tools to product and business teams who simply use the English language to instruct and build the wide range of automation they need as part of their day-to-day operations.

The rise of the AI engineer that straddle the application layer and foundation model APIs. Image credit: The Rise of the AI Engineer (Latent space)

Better abstractions for compute: Several GPU providers and infrastructure tools have emerged in the past year allowing developers to provision compute and run models, without having to worry about the underlying GPU infrastructure. While “serverless” and “fast cold-boots” have been casually thrown around, the underlying AI HW-virtualization (i.e. GPU virtualization) is far from this reality. NVIDIA only recently added support for MIGs with their A100, H100 line-ups. Orchestration tools such as k8s have historically only limited to a single-pod-per-GPU model. The onus is still on the developer to manually tune and scale their models per GPU, leaving modern datacenter GPUs more often heavily underutilized. Needless to say, there’s plenty of room for better here and we’re simply starting to see the early abstractions emerge.
Better abstractions for hardware: If we take a step back from NVIDIA GPUs and look at the broader AI HW landscape, it gets hairy very quickly. While NVIDIA has spend years maturing their CUDA and driver stack for both developers and enterprise, the rest of the AI HW ecosystem is in disarray. Most developers refrain from new AI HW devices, simply due to the complexity in building device-drivers, compiling PyTorch from source, or setting up their training and inference libraries to fully utilize the underlying HW. Today, PyTorch, XLA and other ML-IR solutions have significantly reduced the barrier-to-entry for any hot AI-HW startup looking to get the distribution of ML developers building on PyTorch/JAX, but there’s still more work to be done.

4. AI Cloud & Data Center

This is the world of AI chips (CPUs, GPUs, ASICs), high-bandwidth interconnects, high power-density cloud data-centers and HW virtualization that account for bulk of the AI we see today. Notably, several opinionated AI clouds have sprung up in the recent years to fully capitalize on the AI compute demands (Lambda Labs, Crusoe, Coreweave, BlueSea Frontier Compute Cluster etc). Some of these cloud vendors have even successfully raised billions of dollars in funding to outfit their cloud with the latest and greatest NVIDIA chips. Lambda labs sold out over $100M+ of the on-demand cloud of H100s in just over 1 hour. CoreWeave in just a few years has become one of the largest GPU operators, alongside the 3 major cloud providers (AWS, GCP, Azure).

Interestingly, even though all the cloud-service providers are racking up the same NVIDIA HW, their prices vary wildly due to the surging demand for these specialized chips. Image credit: Navigating the high cost of AI compute (a16z)

NVIDIA is the king of training today: There’s no denying it. NVIDIA arguably runs more than 95% of the AI workloads globally, and it’s going to take the competition a few long and painful years to capture even a fraction of it. Most of this has been attributed to NVIDIA’s fast HW release cycles, and large/vibrant developer-ecosystem of CUDA developers that write custom kernels and embarrassingly parallel programs specifically designed for its chipsets. Needless to say, all of this recent surge in AI demand has resulted in NVIDIA completely owning the data-center market with breakout $10B+ earnings in 23Q4, and more than $60B (2M H100s) in GPU orders on their books for the next year.
The AI inference “flippening” is coming: While NVIDIA may remain the king of accelerators for a few more quarters to come, there’s a new surge of both HW (Tenstorrent, Graphcore, Cerebras etc) and SW (Modular / Mojo, Together, OctoML etc) competition vying separately for the training and inference market share. Much of this interest is going to address where the puck is headed — the inference market. Even though NVIDIA dominated the market for training chips, the increased demand in inference and viability of these foundation models means that we’re soon going to see a “flippening” of AI workloads towards inference, with inference accounting for 100x that of training in the next few years. ASIC manufacturers such as Google’s TPUs v5e is expected to ramp up capacity with millions of chips, well understanding the need for inference capacity and reduced price / FLOP. AWS has also doubled-down on their Inferentia2 chips for inference-specialized accelerators that promise significantly better price-performance over NVIDIA GPUs.

Most hyper-scalers and AI enterprises that are serious about their long-term strategy will either develop or partner with AI HW vendors (including and beyond NVIDIA). Some will vertically integrate all the way down to the metal (Tesla, Meta, AWS, GCP, MSFT, OpenAI to name a few). Image credit: State of AI 2023

Inference ≠ Training: Opportunistically, AI cloud vendors have capitalized on the surging demand for GPUs, racking up several 1000s of NVIDIA A100/H100 GPUs. However, these may backfire when the demand switches to inference and the market floods with specialized AI inference accelerators that are more competitive on price / FLOP than their NVIDIA GPU counterparts. While training foundation models demanded significantly different network fabrics and interconnects with low-tolerance for HW failures, inference and the economics in operationalizing these large models are going to be wildly different. My wild prdiction (see figure below) is that AI CSPs will start partnering with custom ASIC manufacturers to differentiate and provide a competitive advantage to application developers, while slowly diversifying from their NVIDIA portfolio of GPUs.

A **wild** **prediction** for the value-chain of AI HW if the AI application market spends $1T+ in compute: More open-source models will be deployed on AI cloud-service-providers (AI CSPs) in order to reduce costs. AI CSPs will want to have partnerships with ASIC vendors to differentiate their value-prop from the big 3 and NVIDIA. As the market matures, more ASIC manufacturers could be well-poised to capture the cloud market by partnering with AI CSPs as cloud vendors have doubled-down on their own chips to reduce their reliance on NVIDIA.

The inference economics of AI

Ironically, we’re still in the early days of this market transformation. While AI training, compute spend and multi-billion dollar early-stage startups have predominantly stolen the limelight, the real money is going to be in AI inference. It may be counter-intuitive, but inference is extremely expensive today — way more than training. ChatGPT, the fastest growing AI application on the market today requires over 30,000 GPUs to serve it’s rapidly expanding 100M+ user base. However, if ChatGPT were to replace every Google search today, it would cost $36B/yr in inference costs alone, requiring 4M+ GPUs to power their one breakout product (or $100B in capex to purchase the GPUs from NVIDIA). If we expect AI to be as mainstream in enterprises and consumer-tech as Google search is today, the inference economics needs to be at least 2-3 orders of magnitude better.

Here’s another perspective on why inference economics matters - if OpenAI were to serve ChatGPT (GPT4) to all of its 100M users, they would need roughly 4x **the compute /** **day** relative to the resource-hungry GPT4 training to operationalize their **single** product. Image credit: AI: The coming revolution (Coatue)

This doesn’t mean that there isn’t real value that we can provide today. Companies like Harvey, Copy.ai, Midjourney have already brought breakthrough products to the market by simply reducing the overall cost of human-in-the-loop by 1000x . Creating images that once cost $100+ for an hour’s work of graphic artists, now costs ~$0.001. Perusing through 1000s of legal documents now costs AI < $1, where paralegals or clerks would have cost several thousands of dollars. Again, we have only started to augment and automate these specific jobs because the unit economics for inference (i.e. inference economics) is finally >1000x over human-labor today.

Conclusion

All of this begs the question - when are we going to see this market transformation with AI?

Only when the inference economics makes it viable. In other words, the sooner we can reduce the cost of AI inference by 100-1000x, the sooner we’ll see the long-tailed and broader market disruption that humanity previously encountered with foundational technologies like the electric grid, the laser, the transistor and the internet.

The disruption is coming though - AI inference will soon dominate training and become ubiquitous, inference costs will drop, software abstractions will get better, more AI inference accelerators will flood the market meeting the exponential compute needs, more builders will write/speak with language as a way to communicate and build with our systems, and we’ll arguably capture more value and productivity gains in the next 10 years than the last 100 years combined.

So, it turns out that there’s a lot left to build in this full-stack AI cloud of our future. Alan Kay’s words resonate even more than it ever did in this new computing era - “People who are really serious about software should make their own hardware”.

Microchip was the first, that brought the marginal cost of compute to 0. Internet was the second, that brought the marginal cost of distribution to 0. AI will be the 3rd, bringing the marginal cost of labor/work to 0 - Credit: Martin Casado (a16z)

Petroleum Training Partners

Dec 2Edited

Very Informative Blog. you wrote it very well and it helps me to understand programming language more. because i just start learning different languages like python,java,mojo etc,

You can read articles related to mojo here:

https://syntaxscenarios.com/mojo/mojo-an-emerging-programming-language-for-ai/

also i recommend you to read and learn python list with none value :

https://syntaxscenarios.com/python/filter-none-from-a-list/

Expand full comment

AI Infra "Bytes"

Discussion about this post