Pipeshift Cuts GPU Usage For AI Inference By 75% With Modular Interface Engine

Join our daily and weekly newsletters to get the latest updates and exclusive content on industry-leading AI coverage. Learn more

Deepseek’s release of R1 this week was a watershed moment in the field of AI. No one thought that a Chinese startup would be the first to release a reasoning model that matches OpenAi’s O1 and open source (in line with OpenAi’s original mission) at the same time.

Companies can easily unload R1 dumbbells through the face hug, but access has never been the issue: more than 80% of teams are using or plan to use open models. Deployment is the real culprit. If you go with hyperscale services, like Vertex Ai, you are locked into a specific cloud. On the other hand, if you go it alone and build internally, there is the challenge of resource limitations, as you have to configure a dozen different components just to get started, let alone optimize or scale downstream.

To address this challenge, Y Combinator and Senseai supported Cesafil is launching an end-to-end platform that enables companies to train, deploy, and scale open source generative AI models (LLMS, vision models, audio models, and image models) on any or early cloud GPUs. The company is competing with a fast-growing domain that includes Baseten, Domino Data Lab, Together AI, and SimpleSmart.

The key value proposition? Pipeshift uses a modular inference engine that can be quickly optimized for speed and efficiency, helping teams not only deploy 30x faster, but accomplish more with the same infrastructure, leading to savings of costs up to 60%.

Imagine running four GPUs worth of inferences with just one.

Table of Contents

The orchestration bottleneck

When you have to run different models, bringing together a functional MLOPS stack internally, from accessing compute, training and tuning to production-grade deployment and monitoring, becomes the problem. You need to set up 10 different components and inference instances to get things up and running and then put in thousands of engineering hours for even the smallest optimizations.

“There are multiple components of an inference engine,” Arko Chattopadhyay, co-founder and CEO of Pipeshift, told VentureBeat. “Each combination of these components creates a different engine with varying performance for the same workload. Identifying the optimal combination to maximize ROI requires weeks of repetitive experimentation and fine-tuning of settings. In most cases, it can take internal teams years to develop pipelines that can enable flexibility and modularization of infrastructure, pushing companies into the market alongside the accumulation of massive technology debt.”

While there are startups offering platforms to deploy open models in cloud or on-premises environments, Chattopadhyay says most of them are GPU brokers, offering one-size-fits-all inference solutions. As a result, they maintain separate GPU instances for different LLMs, which doesn’t help when teams want to save costs and optimize for performance.

To solve this, Chattopadhyay started Pipeshift and developed a framework called Modular Architecture for GPU-based Inference Clusters (Magic), aimed at distributing the inference stack into different plug-and-play pieces. The work created a LEGO-like system that allows teams to configure the right inference stack for their workloads, without the hassle of infrastructure engineering.

This way, a team can quickly add or swap out different inference components to rebuild a custom inference engine that can pull more from existing infrastructure to meet cost, performance, or even scalability expectations.

For example, a team could set up a unified inference system, where multiple domain-specific LLMs could run with hot swapping on a single GPU, utilizing it to its full advantage.

Run four GPU workloads on one

Since claiming to offer a modular inference solution is one thing and delivering it is entirely another, Pipeshift’s founder was quick to point out the benefits of the company’s offering.

“In terms of opex…magic allows you to run LLM like Llama 3.1 8b at >500 tokens/sec on a given set of NVIDIA GPUs without any quantization or model compression,” he said. “This unlocks a massive reduction in scaling costs, as GPUs can now handle workloads that are an order of magnitude 20-30 times what they were originally able to achieve using native platforms offered by cloud providers.”

The CEO noted that the company is already working with 30 companies on an annual license-based model.

One of these is a Fortune 500 retailer that initially used four independent GPU instances to run four open models for its automated support and document processing workflows. Each of these GPU groups was scaling independently, which added to massive cost overheads.

“Large-scale fine-tuning was not possible as data sets became larger and all pipelines supported single-GPU workloads, while requiring it to load all data at once. Additionally, there was no auto-scaling support with tools like AWS Sagemaker, which made it difficult to ensure optimal use of infra, leading the company to pre-approve quotas and reserve capacity in advance for a theoretical scale that only reached 5% of the time,” Chattopadhyay noted.

Interestingly, after switching to Pipeshift’s modular architecture, all fine-tuning was reduced to a single GPU instance serving them in parallel, without any memory partitioning or model degradation. This reduced the requirement of running these workloads from four GPUs to a single GPU.

“Without additional optimizations, we were able to scale the GPU capabilities to a point where it was serving tokens five times faster for inference and could handle higher scaling,” the CEO added. In total, he said the company saw a 30x faster deployment timeline and a 60% reduction in infrastructure costs.

With a modular architecture, Pipeshift wants to position itself as the reference platform for implementing all cutting-edge open source AI models, including Deepseek R-1.

However, it won’t be an easy journey as competitors continue to evolve their offerings.

For example, SimplisMart, which raised $7 million a few months ago, is taking a similar approach to software optimized for inference. Cloud service providers such as Google Cloud and Microsoft Azure are also strengthening their respective offerings, although Chattopadhyay believes these CSPs will be more like partners than competitors in the long term.

“We are a platform for tools and orchestration of AI workloads, like Databricks has been for data intelligence,” he explained. “In most scenarios, most cloud service providers will become GTM partners in the growth stage for the type of value their customers will be able to get from Pipeshift on their AWS/GCP/Azure clouds.”

In the coming months, Pipeshift will also introduce tools to help teams build and scale their data sets, along with model evaluation and testing. This will speed up the experimentation and data preparation cycle exponentially, allowing customers to leverage orchestration more efficiently.

Daily insights on business use cases with VB daily

If you want to impress your boss, VB Daily has you covered. We give you the inside account of what companies are doing with generative AI, from regulatory changes to practical implementations, so you can share ideas for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.