The future of AI compute - with Jonathan Ross

Feb 13, 20247 min read

Jonathan Ross is founder and CEO of Groq - a company invested in building chips for AI inference. Jonathan is interviewed by Social Capital CEO, Chamath Palihapitiya who also led the initial funding rounds for Groq back in 2016.

Jonathan is leading the edge on this one - if thats even a language to explain people of this level of insight into where the industry is at today and where it will go - at every level - unit economics, market dynamics, of course the operating system of it all - the chips architecture.

There are some very important concepts explained, which I will try to summarise and hopefully provide justice to the way its explained.

Here is the recording to the substack followers - The future of AI compute - Jonathan Ross

This article from CNBC (dated 2017) tells the backstory - Ex-Googlers-left-secretive-AI-unit-to-form-Groq

My construction of the summmary as below

Why is Groq important for the industry

What are the product considerations that gone into Groq

What are the metrics which the AI chip computing industry is predominantly using

Industry dynamic in the AI compute space

What are Groq's competitive advantage

Where does it go from here - potential 1000x multiplier on Language Inference!

Why is Groq important for the industry

Jonathan's breakout occurred when he began developing chips for Google during the emergence of chips as the limiting factor for all the research computations required by Google AI engineers. They recognized that matrix multiplication was the crucial computation in AI, and neither CPUs nor GPUs were handling it well. Consequently, they designed a chip named TPUs (Tensor Processing Units) to accelerate all AI workloads, a significant endeavor for Google in terms of both R&D investment and strategic maneuvering, dating back to 2016. Relying solely on chip providers for critical AI workload infrastructure would have severely hindered their ability to innovate and scale their services.

Following his success with TPUs at Google, Jonathan received similar invitations from other compute providers. However, he desired to take a leap of faith and establish a chip company from the ground up. Concentrating chip accessibility in the hands of only a few players would concentrate too much power, particularly in an industry that was fundamentally crucial for AI computation and indirectly impacted the entire industry.

Groq's focus lay specifically on language inference, recognizing that this was where latency in AI applications mattered most. Their product, the Language Processing Unit (LPU), aimed to address this need.

What are the product considerations that gone into Groq

Even during the development of TPUs, the team had to ensure that the chip could execute code written for GPUs seamlessly. This required writing kernels for chips they weren't directly building, enabling users to run their GPU-written models on TPUs with little to no modification. Initially, the focus was on designing a compiler to streamline this process, eliminating the need for custom kernels for existing chip architectures like GPUs and TPUs.

Another key insight was that market demand for chips would be significantly higher for inference loads compared to training loads.

A third crucial insight was that generative AI inherently operates in a sequential manner. Up until that point, AI primarily focused on tasks such as classification, summarization, and contextual reasoning. However, generative AI required the ability to predict the end of a sequence. While existing chips excelled at parallelization, they lacked efficiency in sequential token generation, a critical aspect for generative AI.

What are the metrics which the AI chip computing industry is predominantly using

Quality, of course, is paramount. However, speed becomes the crucial factor, considering that models are already available in open source, making quality accessible to anyone in need. Thus, the differentiator becomes speed.

In terms of speed, there are two distinct metrics to consider. First is the time it takes for the initial response—minimizing any awkward pauses in inference questions is imperative. The second metric pertains to the rate at which responses are generated.

In applications with multiple layers of models, where the language model constitutes just one layer, inherent latency cannot be reduced. Therefore, a long latency at that layer would render the application unviable.

Furthermore, an important consideration lies in the chip's architecture, capable of handling workloads with long sequences without relying on kernels for interpretation. Additionally, in the case of LPUs, all 64 chips on the board load into memory, a feature absent in GPUs.

Beyond these performance factors, economic considerations come into play—platform lock-in and cost. These dynamic models require fine-tuning for specific applications. Startups that utilize these models become locked into the compute platform used for fine-tuning, hindering their ability to adapt to rapidly improving models.

Over-The-Top applications often opt for lower-parameter models to save costs, inadvertently introducing hallucinations in the generated content. Though they may initially appear correct, closer inspection reveals inaccuracies. So there is a need to be able to deploy large models rather than their smaller replacements (Llama 2, not Llama 1 for example)

Regarding costs, there are notable differences between native cloud providers who may own their own chips and providers sourcing chips from NVIDIA for data centers—the latter tends to be more expensive. However, a third option exists with Groq, whose chips are 60-70% cheaper in terms of both cost and electricity consumption. At this stage, it becomes more sensible to build on a significantly cheaper chip rather than renting from a platform for the product's lifetime.

Industry dynamic in the AI compute space

The industry is undergoing a fundamental shift, with cloud players backward integrating into chip manufacturing, recognizing the criticality of inference loads for their businesses, while NVIDIA is forward integrating into data centers and workload management.

At first glance, it may appear to be an even battle. However, what many fail to realize is that NVIDIA's software stack is its differentiator, not just its chip. For instance, Tesla possesses its own chip, but it's the software layered atop it that users can adopt and utilize. While cloud providers have successfully developed chips, they lack a compelling software to manage them, hindering industry-wide adoption. It would take years for a new provider to develop an end-to-end stack that rivals NVIDIA's.

Thus, it's evident that NVIDIA holds a sustainable competitive advantage that is unlikely to dissipate. The industry is keenly aware of this. Not only has NVIDIA created an advantage, but they also leverage it to price and lock-in customers. For instance, all chips must be paid for in advance, and NVIDIA can choose to deliver or delay them based on the customer's loyalty level. Under such market conditions, it's exceedingly challenging for a new player to enter the market.

The closest competition to NVIDIA is AMD, yet they lag behind in terms of the software stack despite having a superior GPU to begin with.

What are Groq's competitive advantage

First is architecture - Groq is focused on AI inference side - specifically on sequential workloads which are rather language related workloads. The architecture therefore incorporate a larger memory to incorporate a band of chips which is orders of magnitude better than competition. The core of this technology seems to be "Deterministic & Synchronous Chip" - it is able to calendar schedule 640 chips while GPUs work on a 8 chips in the same count. Groq can extend it further into 2560 with further advancements. In addition all the novel thinking and architecture around networking, system software, compiler are strong differentiators

Second is a consequence of the first. Because of the unique architecture, the response times on LPUs are much better. And that is a unique advantage in real time inference loads such as audio/speaking. This increases the demand overall creating new opportunities for value creation.

Third is the foresight Groq had on building a very good software stack that will sit over the chip. This is where NVIDIA is currently differentiating. Groq has already solved that (from the experiences Jonathan had building this at Google)

Fourth is row material - Grog is using 12-14nm silicons which are generations behind the ones NVIDIA's GPUs are using which are 3-4nm. There is of course supply side constraints in addition to them being expensive (it is unclear to me why this is something NVIDIA can not replicate in someways - I believe that its about the whole system - lack of kernels, single compiler, larger memory foot print etc)

Fifth is a consequence of the fourth. Since its so much cheaper to build racks of hardware compared to NVIDIA chips, the customers are likely to be able to put in much better resilience into the infrastructure than even the best in class LLM inference sservices are able to provide.

Where does it go from here - potential 1000x multiplier on Language Inference!

Regarding Groq, they are currently constrained only by a specific type of memory supply, known as HMP, primarily produced by Samsung and Hykon. However, the supply-side constraints are expected to improve over the next year, significantly enhancing Groq's ability to improve by several multiples.

Currently, token/sec rates hover around 120 tokens/sec for leading providers, while Groq's baseline is set at 240 tokens/sec. Internally, they've made improvements to reach around 330 tokens/sec. Industry expectations project this to rise to 350 tokens/sec within the next 6 months to a year. However, Groq is aiming for multiples of that, potentially reaching around 700 to 1000 tokens/sec, an impressive 8-10x increase over current performance levels.

This constitutes the underlying hardware stack. Additionally, we typically observe a 2-5x improvement in model throughput every six months. Consequently, we anticipate a 50x improvement in overall AI application performance, leading to reduced costs and making more applications viable.

Considering the additional capital inflow into this sector, estimated at 5-10x, we could potentially witness a 1000x improvement in overall economics. Imagine the possibilities!

Furthermore, this isn't just about increasing access to existing knowledge (the information age) at the edges; we're fundamentally generating new ideas and information (generative age), the full implications of which we have yet to fully comprehend!

Collationist.

The future of AI compute - with Jonathan Ross

Why is Groq important for the industry

What are the product considerations that gone into Groq

What are the metrics which the AI chip computing industry is predominantly using

Industry dynamic in the AI compute space

What are Groq's competitive advantage

Where does it go from here - potential 1000x multiplier on Language Inference!

Why is Groq important for the industry

What are the product considerations that gone into Groq

What are the metrics which the AI chip computing industry is predominantly using

Industry dynamic in the AI compute space

What are Groq's competitive advantage

Where does it go from here - potential 1000x multiplier on Language Inference!

Recent Posts

Comments

Technology Posts

Obervations from Karpathy on AI evolution

21 Lessons for 21st Century by Yuval Noah Harari

The future of AI compute - with Jonathan Ross

Who will dominate the AI Ecosystem

Top trends in the AI industry

Current direction of NLP research in RAG

Unlocking the Power of Neural Information Retrieval:

US Federal Budget in numbers