Solutions

From local to cloud

Develop with DeepSeek R1 on Apple GPUs, Deploy with Serverless Inference

Mar 27, 2025

In this post we will show how Tower makes it possible to develop apps that use local GPUs for inference, and deploy the same app to a Tower production environment that uses a serverless inference provider.

Everyone is trying to incorporate AI into their applications these days, but the developer experience has been challenging, and GPU costs have been high. It has been common until now to use cloud-based GPUs for both production deployments and development, but this comes with problems:

Network roundtrips add latency when developers use cloud-based GPUs for inference, and they must wait.
Resource contention often occurs when multiple developers share dedicated inference resources (developers must wait again).
When serverless providers are used for inference, developers get throttled and must also wait.
The costs of running inference on ML models on cloud GPUs are high.

Not surprisingly, many developers have been asking if they could use their local GPUs during development to reduce contention for shared resources, eliminate throttling, and reduce costs and latency.

Are models even available for local inference?

Problem: Many models are proprietary and cost a lot of $$s to train

Before a model can be used for inference, that model must be created (=trained). Training modern models takes a lot of hardware and financial resources. Open AI has been among the leaders in LLMs, and they keep their ChatGPT model proprietary. You can't just download it to your development machine; you have to use Open AI's inference endpoints.

Solution: Vendors are launching solid models in Open Source

Open-source LLMs began appearing as early as 2019, with the Google T5 model among the first. But for a long time, they lagged in accuracy compared to closed-source models.

In the last 12 months, open-source models have started approaching the quality of commercial models. Llama 3 from Meta was released in April 2024, and R1 from DeepSeek was released in January 2025. DeepSeek R1 is that infamous model that caused the US stock market turmoil when released because everyone thought that the demand for GPUs would decline and that the US-based Magnificent Seven AI-driven companies were in serious trouble from China-based competition. DeepSeek R1 went head-to-head in accuracy and quality competition with the then top-of-class commercial model o1 from Open AI (released in December 2024). In benchmarks, it scored higher than o1 in coding, math, and other tasks.

Comparison of benchmark results for DeepSeek R1 and OpenAI o1. Source: DeepSeek, HuggingFace

Do models even fit developer-grade hardware?

Problem: Models are growing in size

To make models more capable, model vendors like Meta, DeepSeek, and OpenAI added more parameters to them and consequently grew their size. For example, the DeepSeek R1 has 671 billion parameters. Each parameter usually needs 1 byte in memory (although there are models and versions of models where parameters take 0.2 bytes, 0.5 bytes, or even 2 bytes). So, the full R1 DeepSeek model needs 720 GB of memory (when uncompressed). The ChatGPT-4 model is rumored to have about 1.8 trillion parameters. The growing sizes of ML models are making using them locally challenging.

Solution #1: Techniques to manage model memory requirements are emerging

Model vendors popularized two approaches to address the model size issue and fit the models into developer hardware.

Distillation: First, model vendors released model versions with fewer parameters in a process called "distillation." For example, the DeepSeek R1 model comes in variants with 1.5B, 7B, 14B, and up to 70B instead of 671B parameters. These model variants have lower accuracy than the full model, but they are fine for development purposes, and they fit into development laptops with 32-48 GB of memory.

Quantization: Second, model-tuning startups emerged that reduced the memory requirements of each parameter. For example, Unsloth AI developed so-called "quantization" techniques that reduced the storage requirement of every DeepSeek R1 model parameter from 1 byte to 0.5 byte (4-bit quantization) and even less than 0.2 bytes (1.73-bit quantization). The 4-bit variant of R1 needs "only" 400GB, and the 1.73-bit variant needs only 160GB of memory. The quantized variants of the models have lower accuracy, but the accuracy loss is less than in model variants with fewer parameters.

"While ML models grow in sophistication and increase the number of their parameters, techniques emerge that put a lid on the memory requirements of these models, making them available for local inference."

Solution #2: Developer hardware is improving

In parallel, hardware resources available to developers have made remarkable strides in recent years.

Available for < $5K: A medium-class Apple MacBook Pro developer laptop (with an M4 Max chip and 48 GB of memory) is available for $4K and can fit the 70B-parameter/4-bit-quantized variant of the DeepSeek R1 model that needs 38 GB of memory without a problem. The M4 Max chip achieves 34 TFLOPs (trillions of floating point operations) in half-precision format (16-bit, or FP16). This performance is enough to produce a hundred tokens per second (an important performance metric for LLMs) when running smaller variants of modern LLMs, which is sufficient for development tasks. Not only do the models now fit a dev machine, but they also provide snappy response times.

Available for ~ $10K: In March 2025, Apple announced the Mac Studio mini-desktop with an optional M3 Ultra chip (twice the TFLOPS performance of the M4 Max chip) and support for up to 512 GB of memory. For ~ $10K, you can now deploy the full 671B-parameter DeepSeek R1 model (in its "4-bit quantized" variant, it needs ~400GB of memory) and do local inference from your desk.

Mac Studio (March 2025). Source: Wikipedia

Not to be outdone, NVIDIA announced the DGX Spark mini-desktop and the DGX Station last week (March 18).

The DGX Spark has 128GB of memory and supports serving models with up to 200 billion 4-bit parameters. Two Sparks can be connected and serve a 400 billion 4-bit parameter model. Note the important qualification around the 4-bit parameter size. Models that don't have 4-bit quantized variants will need more bytes per parameter!

DGX Spark mini-desktop (March 2025). Source: Nvidia

The DGX Station supports 784 GB of memory, but the GPUs can access only 288 GB. While the DGX Station can deploy the full 671B-parameter DeepSeek R1 model, it can only do it in its 1.73-bit quantized variant (which needs 160GB).

DGX Station under-the-desk tower (March 2025). Source: Nvidia

Does ML inference even work on the developer's software stack?

Problem: Using popular ML inference frameworks with Apple GPUs is not straightforward

Many developers use Apple Mac mini-desktops or laptops to develop code. Apple offers solid GPU capabilities on those machines. However, to use GPUs, popular ML frameworks such as PyTorch or TensorFlow must use Apple's Metal Performance Shaders technology, or MPS for short.

The PyTorch framework, for example, has dozens of functions that are not supported on MPS. This is very frustrating to developers because they get errors that they can't fix themselves. PyTorch has not yet been performance-optimized for MPS either.

Solution: Local inference servers like llama.cpp, Ollama, and vLLM make Apple GPU issues go away

To solve the problem with Apple hardware's LLM model compatibility and performance, the open-source llama.cpp (short for “LLaMa C++”) project was born and grew to more than 1000 contributors. Startups like Ollama then built on top of llama.cpp, providing a solid foundation for local inference on Apple hardware and elsewhere. vLLM is another example of a widely used local inference engine.

The "distillation" and "quantization" techniques that make smaller versions of models, the recent hardware improvements with Mac Studio and DGX Spark & Station, and the cottage industry of local inference servers like Ollama and vLLM make local inference with GPUs much more feasible.

Can code be easily deployed to production environments?

This brings us to the last remaining significant barrier - the problem that Tower is addressing.

Problem: Development and Production environments are not the same

While "doing local inference" sounds easy at a high level, significant differences exist between the GPUs on development machines and in production environments.

As we've seen above, developers love MacBooks; however, in production environments, Nvidia GPUs are often used. To run on Nvidia GPUs, the ML inference software must support Nvidia's CUDA framework. Because Apple's and Nvidia's GPUs are different, Apple's MPS and CUDA frameworks also differ.

When developers need to run a piece of software in two different environments, they either:

1. Find some middleware or container solution that abstracts their code from the differences or

2. Make the two environments completely the same.

(Unworkable) Solution #1: Use Docker to abstract differences in the environments

Developers would have loved it if they could abstract the differences between the dev and prod environments using container technology like Docker. However, Docker does not have the drivers for Apple Silicon GPUs, and there are no plans to add these drivers. Therefore, it's impossible to access GPUs when running apps in Docker containers on MacOS.

Here are three true statements:

You can run apps that don't use GPUs in Docker containers on MacOS
You can run apps that use GPUs in Docker containers on Linux
You can't have the following three things simultaneously: Docker, MacOS, and GPUs.

(Unworkable) Solution #2: Run Macs in production

If you can't containerize your apps and run them in Docker anywhere you want, how about using the same stack in production as you use in development?

Cloud providers like AWS do offer Apple Mac EC2 instances, but they are not cheap, and they are limited to 32 GB of memory, so they won't be able to serve any of the modern LLMs. To compound the issue, one needs hundreds, if not thousands, of times more inference throughput in production than in development. Renting a thousand Apple Macs in AWS would cost much more than paying for the equivalent Nvidia GPU capacity.

The Tower Solution

How do we provide developers with a better local development experience while guaranteeing that the code they develop can be deployed and run in production?

Tower developed a solution for that, and today, we are sharing an example that illustrates our approach (see example on GitHub).

Tower allows you to have two or more environments: one for local development, another one for integration testing, a third one for production, etc. Tower keeps these environments as similar as possible while allowing you to have differences that make apps work in that particular environment. Your apps are compatible with all environments and can seamlessly move from development to production.

In our example, we will demonstrate Tower's local development capability. During development, we will use a local ollama inference server to run a version of the DeepSeek R1 model with a reduced number of parameters. This inference server will use our dev machine's GPUs, yielding a reasonable inference speed.

In production, however, the same app will use a serverless inference provider (Together.AI). We chose a serverless inference provider instead of an accelerator-enabled node because it is more cost-efficient for smaller workloads since inference providers charge by the token and not by the node's uptime. We also used Hugging Face's new Hub feature, allowing inference calls through just one interface (HF Hub). This creates some flexibility in switching inference providers based on latency, availability, and cost.

What does the example app do? It's a use case that many developers are familiar with. Developers use GitHub to submit issues (feature requests and bugs) with projects hosted on GitHub, and they often have a lively conversation with the project maintainer and other users. The issues and their comments look like a chat - something that LLMs like DeepSeek R1 should understand well. Our app uses the DeepSeek R1 model to analyze the discussion on the GitHub issue thread and propose the best course of action to the project maintainer and the issue's original author.

Let's see this code in action!

See the Tower demo on Youtube or click on play below.