Carving Up the TPU
Leftovers for Jensen or Just Gravy on the AI Trade?
The first words of our August State of the Themes AI section were direct: Long GOOGL.
We elaborated…
OpenAI is compute-constrained. Without their own datacenters or their own chips, they are at significant disadvantage to Google, which not only is an ascendant hyperscaler but also a chip designer that has designed the only mass produced chip competitive with Nvidia’s core offering across many metrics and has optimized their entire research, training and inference pipelines for it [...]
Google has low customer acquisition costs, best-in-class first party data, full vertical integration and TPUs enabling cheaper training [...]
It’s clear that both the technical and market tailwinds favor Google, but the market does not seem to be pricing that in.
While the writing has been on the wall for some time, the market’s perception of Google has dramatically reversed over the past several months – transforming from an AI loser bleeding its search dominance to a stalking horse destined to undercut the most consensus AI winners.
We can see this shift occurred around the time we wrote up Google (or, perhaps, around the time the legal overhang lessened) on this chart from Coatue:
This sentiment has only accelerated in the past week with the release of Gemini 3. Not only is Google now firmly positioned at the cutting edge of frontier models, but it’s doing it on its own terms, or TPUs. It is indeed possible to train a frontier model without NVDA... that is, if you’re an ML-pioneering hyperscaler who has spent the past 10 years developing and optimizing for this custom silicon.
The announcements that both Anthropic and Meta are planning to implement TPU chips raise further questions about NVDA’s dominance. Surprisingly, META reportedly wants TPUs for training, not just inference. None of this information is truly new (see above) but the one-two punch, combined with growing skepticism of OpenAI seems to have culminated in a passing of the public torch.
While we are happy that the market has caught up to the trade – Google remains our largest allocation in our Dynamic AI basket – we are also wary of over-extrapolation, recency bias, and consensus crowding. After all, as Oracle has shown, you can go from a 40% gap up to CDS watching in a matter of weeks.
Is Google’s vertical integration both a cost and structural advantage? Is the CUDA moat shrinking? Are NVIDIA margins at risk?
Perhaps.
But maybe there are two ways to think about this:
Switching costs for architecture/ecosystem is high for AI accelerators. Gemini 3 should be bullish for compute demand as labs locked into NVDA rush to buy more of the next gen chips to compete.
NVDA’s margins are so high that any competitive threat must be aggressively priced in even if it means higher volumes.
In other words, isn’t a breakthrough model broadly bullish?
The market is clearly leaning on the #2, but we could find ourselves tempted to dip back into NVDA if sentiment overshoots reality. Regardless – if a TPU scale-up is on the horizon, who else stands to benefit?
First, we will run through some technical details, then highlight where we think the broader “TPU trade” goes from here.
TPUs for Idiots
At its core, a neural net is just a machine for doing enormous numbers of weighted sums. For myself (an investor, not an AI engineer), I find it useful to reduce neural nets to:
Inputs x Weights = Outputs
Inputs are arranged as a matrix (a batch of tokens, images, whatever). Weights for a layer are another matrix. Take a simple example: handwritten digit recognition. A 28x28 grayscale image is 784 numbers - the “8-detector” neuron holds 784 weights (28x28). For that individual neuron, the core operation is multiplying each pixel by its weight, adding everything up, and then finally comparing the score against other digits.
Once you scale this up to our modern AI/ML models you’re doing trillions of these “matrix multiplication” (abbreviated as matmul) operations – in effect, a bundled batch of all those weighted sums.
Enter: CPUs, GPUs, TPUs.
A CPU is like one very smart and capable worker – good at anything but slow at repetitive jobs. A GPU is like thousands of generalized workers. A TPU is like an automated factory line built specifically for one task, with the machines arranged so the work has almost no back and forth.
It’s important to recognize that even within the AI/ML space, different functions require different chips. Google itself publishes the following guide for how to choose chips depending on the task at hand.
As we laid out in Interconnects 101, the layman’s understanding of the difference between a CPU and GPU is that a CPU is built with a relatively small number of cores, but with each of those cores capable of executing hundreds of specific instructions over dozens of units. Consumer CPUs generally have up to 16 cores with data center chips up to 128 but even with significantly higher clock speeds, this is no match for the parallelization that GPUs allow.
What we think of as a ‘core’ in a GPU is known as a streaming multiprocessor (SM) in Nvidia-parlence. Rather than focusing all of the core’s resources on routing instructions through specialized units, GPU SMs leverage dozens or hundreds of copies of the same units simultaneously. For example, Nvidia’s Blackwell architecture SMs each have 4 “tensor cores” and 128 “CUDA cores” for a B200 with 192 SMs, that totals 768 and 24,576 physical cores respectively.
Though not comparable on a core-to-core basis, TPUs take this philosophy to the next level and dedicate nearly the entire silicon area to crunching matmuls. They can do this because cores aren’t instructed in the traditional sense, but data is drawn through them more like an assembly line than a laboratory.
The architectural driver behind the TPU is the systolic array. Though reports differ on the size and arrangement of the array across TPU versions, the philosophical departure from GPU SMs is consistent.
The array is made up of MAC cells, short for multiply-accumulate, which Google describes as ALUs (arithmetic logic unit) which is a classic computing building block. The difference is that MAC cells are ALUs that don’t accept instructions, just data. They don’t have to choose between FMA, ADD, SHIFT, or any other of hundreds of possible instructions that a less specialized processor might see – they just multiply and accumulate. This saves on space by eliminating unit-level instruction caches while also essentially eliminating data caches as data flows through the array in a predetermined fashion and saves on power by pushing utilizations to the max.
This is why TPUs can offer very high performance per watt and per dollar on machine learning, but are essentially useless outside that domain. You get extremely high throughput on the kind of linear algebra that defines neural nets which means lower power per operation and a smaller die area for a given level of ML performance. However, this comes at the price of much less flexibility than a GPU.
Google has made Gemini’s per-prompt footprint a PR point, so Gemini likely has a real advantage in energy per text call – helped by TPUs and their own tuning – but there is no transparent or quantified comparison for the training runs. Inference, at least, looks to be much more energy efficient per user interaction on TPUs.
In short, the TPUs are hyper-specialized ML chips that sacrifice flexibility for efficiency. But as AI increasingly moves to the “mass-production” phase, even small efficiency gains can prove very meaningful.




