At Computex 2025, NVIDIA presented its modular software stack for so-called “AI PCs” – a concept that primarily aims to transform commercially available PC systems with RTX graphics cards into locally deployable AI computing stations. The presented stack is made up of several components that interlock at different levels of the workflow: CUDA as a programming interface and base layer for parallel computing on GPUs, TensorRT as an optimized inference backend, OptiX for raytracing-based image calculation, Maxine for audio and video AI features, Riva for speech and text processing and Broadcast SDKs for streaming and communication applications. This stack is complemented by a large number of specialized software development kits (SDKs) aimed at developers, creative professionals and AI researchers.
The declared aim is to meet the increasing demand for generative AI computing power directly on the end device – without constant dependence on cloud backends or internet connections. The local execution of so-called inference models – i.e. the calculation based on already trained neural networks – should thus also be possible on consumer systems, provided an RTX graphics card is available. NVIDIA is thus positioning its AI PC platform as a logical continuation of the trend towards “edge inference”: computing tasks are no longer necessarily carried out centrally in data centers, but directly on the user device, which not only reduces latencies, but also addresses data protection concerns.
The stack has been specially optimized to run even complex models such as large language models (LLMs), image generators or multimodal AI applications efficiently on GPUs. TensorRT takes on the role of an inference accelerator that precompiles trained models for the respective hardware and adjusts them to the specific GPU at runtime. According to NVIDIA, this means that even sophisticated models such as Llama 3, Mistral or RAG systems can be used without dedicated server hardware – for example in development environments, interactive tools or offline applications. Unlike traditional cloud concepts, where the computing load is outsourced to remote servers, AI PCs focus on local control of data and computing processes.
The presentation at Computex also indicates that NVIDIA is increasingly modularizing its existing software infrastructure to cater to both end users and professional developers. This means that individual components of the AI stack can be flexibly integrated into existing workflows – for example via container solutions (such as NIMs), Python bindings or dedicated plug-in systems. At the same time, support for developer tools such as Visual Studio Code, ComfyUI or generative scripting environments is being expanded.
In the long term, NVIDIA’s strategy is to position the GPU not only as an accelerator for graphics rendering or deep learning, but also as a universal AI processor for hybrid desktop and workstation environments. In this context, Computex 2025 serves less as a platform for new hardware innovations and more as a manifesto for the shift in computing logic – from the cloud back to end devices. AI PCs are therefore not just a marketing term, but could establish themselves as a new category between gaming PCs and workstations.
TensorRT for RTX: optimizing inference performance
With “TensorRT for RTX”, NVIDIA is presenting a further developed inference backend at Computex 2025 that is specifically tailored to consumer and developer devices with RTX graphics cards. At its core, it is a variant of the TensorRT framework known from the data center sector, but now with a focus on local, GPU-accelerated execution of AI models in the end user area. The special feature lies in the so-called just-in-time optimization: models are not only compiled once, but dynamically adapted to the respective RTX GPU, including architecture variants, memory expansion and available computing units.
According to the company, “TensorRT for RTX” enables up to twice the inference performance compared to Microsoft’s DirectML – an API framework used in Windows environments for AI acceleration. This increase is based not only on more efficient use of the CUDA cores and Tensor Cores, but also on optimized graph summaries, quantizations and runtime reductions. At the same time, aggressive optimizations also reduce the resulting model size – in some cases by up to 70 percent. This is particularly advantageous for large language models (LLMs) or multimodal models, whose memory requirements have often been an obstacle to local execution up to now.
Another advantage results from the broad compatibility: according to NVIDIA, TensorRT for RTX is available for all RTX graphics cards – regardless of generation and performance class. This includes older Turing and Ampere models as well as the latest Ada and Blackwell chips. This makes it possible for the first time to run complex AI models locally on mid-range GPUs – for example for text generation, speech recognition, 3D content synthesis or image analysis.
The developer workflow is specifically designed for easy integration: Models that are available in ONNX (Open Neural Network Exchange) format, for example, can be converted into TensorRT-compatible binary packages using prepared toolchains. NVIDIA provides both an SDK and preconfigured reference environments, including compatibility with popular development platforms such as PyTorch, TensorFlow and JAX. In addition, pre-optimized variants of popular models such as Llama 3.1, Deepseek, Mistral or Riva will be made directly accessible via the in-house hub (build.nvidia.com).
NVIDIA is thus taking a significant step towards the democratization of AI inference on the end device. With the right software support, TensorRT for RTX could become a cornerstone of modern local workflows, for example for developers of smaller LLMs, experimental GUI applications or content creators who want to use generative models without having to deal with cloud latency or data security issues. Compatibility with all RTX graphics cards – from high-end to entry-level.
Software integration: CUDA everywhere
In May 2025, as part of NVIDIA’s AI PC initiative, five software packages were introduced as new integration partners of the expanded CUDA and RTX software stack. These applications come from different areas – from video editing to rendering to real-time communication – and exemplify the increasing shift of complex AI functionalities from the data center to locally operated RTX hardware.
LM Studio is a development environment for the local execution of large language models (LLMs) with user-defined prompts, modifications and visualizations. According to NVIDIA, the latest integration of CUDA 12.8 in conjunction with TensorRT has enabled up to 30% higher inference performance compared to previous versions. It is particularly worth mentioning that LM Studio also supports quantized versions of Llama 3 models and Mistral 7B – models that previously required a dedicated cloud environment in many cases. GPU acceleration should now make it possible to run such models with high performance on consumer hardware with RTX 30 or 40 series.
Topaz Video AI, a widely used application for AI-supported video processing (upscaling, noise reduction, motion smoothing), was also CUDA-accelerated in May. The manufacturer Topaz Labs particularly emphasizes the advantages of the new tensor operations that are accessible via CUDA. The support of RTX GPUs should not only lead to shorter rendering times, but also ensure a noticeably higher output quality when using generative functions – such as frame rate interpolation or object tracking.
Bilibili, one of the largest Chinese video platforms with over 300 million monthly users, has also been equipped with NVIDIA’s Broadcast SDK. The integration allows streamers and content creators to use AI effects such as background blur, noise reduction or automatic camera tracking – in real time and directly on the end device. The feature is primarily aimed at semi-professional user groups who need an easily accessible improvement to their video output.
Two other platforms were also mentioned in the area of professional visualization: Autodesk VRED, a software for high-end visualizations in the automotive industry, and Chaos Enscape, a widely used real-time rendering tool in architecture and construction planning. Both applications now support DLSS 4, which is expected to deliver significant performance gains, particularly for interactive scenes with path tracing or complex material structures. The use of DLSS 4 is not limited to games, but is increasingly being extended to industrial and creative application areas.
These developments document a strategically motivated expansion of RTX-specific functions to professional software solutions. At the same time, however, it remains to be seen to what extent these applications can actually manage without cloud support and whether the entire inference pipeline is handled locally on the GPU. In many cases – such as licenses for large models or external data sources – hybrid solutions are likely to remain. Nevertheless, the increasing implementation of CUDA, TensorRT and DLSS 4 in third-party applications marks a clear step towards locally sovereign AI use on consumer hardware.
NIMs: microservices for AI models
Another central element of NVIDIA’s AI PC initiative is the so-called “NIMs” – an abbreviation for NVIDIA Inference Microservices. These are modular, prefabricated container solutions that have been specially designed for the rapid, locally executable integration of AI functionality into proprietary software environments. Each of these microservices contains a quantized AI model, a complete inference pipeline with all dependencies, libraries and standardized interfaces (APIs) that can be used to access applications and scripts.
The deployment is containerized – usually via OCI-compliant formats such as Docker or Podman – and can therefore be used on both Windows and Linux systems. NVIDIA explicitly advertises this architecture as open-platform, provided that an RTX graphics card is available as the computing unit. RTX models from Turing (RTX 20 series) upwards are supported, whereby the inference speed and memory requirements are heavily dependent on the respective integrated model.
In terms of content, the current NIM portfolio covers a wide range of application areas:
Available models include:
-
Llama 3.1 (8B) and Mistral 7B – two widely used language models (LLMs) that can be used for text generation, prompt parsing and semantic retrieval.
-
YOLOX – a real-time object recognition model often used in the field of automated image analysis or robotics projects.
-
PaddleOCR – an OCR model optimized for multilingualism for text recognition in images and scans.
-
NV CLIP – a multimodal model for image-text linking that can be used for image description, captioning or visual retrieval.
-
Riva Parakeet and Maxine Voice Studio – modules for automatic speech recognition (ASR) and text-to-speech (TTS), including voice adaptation and timing synthesis.
The NIMs can be obtained either directly via the build.nvidia.com platform or via partner integrations such as HuggingFace, GitHub or Docker Hub. According to NVIDIA, the models are fully optimized for TensorRT and CUDA, so that they should also perform well on single-GPU systems with limited resources – provided that the VRAM requirements are met. Integration into existing software projects is planned via RESTful APIs and Python or C bindings.
In contrast to classic SDKs or frameworks, NVIDIA is pursuing a low-configuration approach with the NIMs: developers should receive an executable setup in just a few steps without having to deal with build processes, dependencies or hardware configurations in detail. At the same time, the system offers a certain degree of modularity: several NIMs can be operated in parallel, integrated into distributed architectures or connected to existing automation environments.
This positions NVIDIA Inference Microservices as a bridging technology between locally operated AI infrastructure and cloud-based inference services. Especially for developers, start-ups or research institutions that want to control data protection, latency or operating costs themselves, NIMs offer a technical alternative to full cloud integration – without having to sacrifice the performance of modern AI models.
G-Assist: the personal GPU butler?
The “G-Assist” project, which NVIDIA presented at Computex 2025 as a modular assistance system, is aimed more at end users and consumer-oriented scenarios. In contrast to the more developer- or production-focused components of the AI PC stack, G-Assist aims to create an everyday, interactive interface between the user and the PC application – comparable to a configurable co-pilot that can be customized using various plug-ins.
The platform is based on an open plug-in architecture that allows functional modules to be loaded for specific tasks such as media control, information retrieval or system automation. The following areas of application were shown as examples in the presentation: control of music and streaming services, web-based searches via context menus, status displays for live streams, automated in-game commands, system control of peripheral devices (such as lighting, ventilation or macro buttons) and simple IoT interactions in the local network. G-Assist is thus clearly positioned on the borderline between voice assistance, overlay systems and intelligent UI extension frameworks.
Technically, the system relies on a local runtime environment that is embedded in the existing NVIDIA app (formerly GeForce Experience) via a visual interface. Users can activate and configure plug-ins directly there and exchange them via a rating system. The development of custom modules is made possible via an interface that can be connected to common tools such as ChatGPT. NVIDIA provides API documentation, GitHub templates and a web IDE for this purpose. A central distribution and discussion of extensions is planned via Discord, giving the community a central role in the ecosystem.
It is worth noting that although G-Assist provides for the connection to large language models such as GPT – in particular for the semantic interpretation of user queries or for generating context-related responses – this functionality is not part of the system itself. Instead, G-Assist sees itself as a framework for integrating existing AI services, not as a proprietary assistance model. The user decides whether requests are forwarded to local models, the OpenAI API, HuggingFace endpoints or their own tools.
This deliberate openness underlines the rather experimental nature of the project: G-Assist is not intended to be a replacement for Alexa, Siri or the Google Assistant, but rather a modular supplement tailored to everyday desktop use for tech-savvy users who want to put together their own digital assistance environment. In the gaming environment in particular, but also in creative or production-related workflows, new fields of application could open up in the future – for example through situation-dependent automation, context-based macros or voice-supported processes with real-time feedback. However, this requires a certain level of technical understanding on the part of users, especially when creating or expanding their own plug-ins.
Conclusion: continuity instead of surprise
NVIDIA is not using Computex 2025 to make disruptive announcements, but is consistently continuing the strategic course of recent years. The RTX 5060 extends the Blackwell lineup downwards, while DLSS 4 establishes itself as the link between performance and image quality. The expansion of generative AI functionality on RTX GPUs seems ambitious, but so far lacks tangible metrics on practical relevance outside of specialized applications. The attempt to position the GPU as a universal data center in the home PC should nevertheless be successful in the long term – provided that users beyond the gaming sector are actually prepared to adapt their workflows accordingly.
2 Antworten
Kommentar
Lade neue Kommentare
Veteran
Neuling
Alle Kommentare lesen unter igor´sLAB Community →