VideoRAG with NVIDIA NIMs: Revolutionizing On-Prem AI Infrastructure

VideoRAG with NVIDIA NIMs: Revolutionizing On-Prem AI Infrastructure
At Moonlite, we’re redefining the way organizations manage and deploy AI. Our mission is to provide complete control over your AI infrastructure—on-premises, secure, and cost effective. In today’s post, we’ll take an in-depth look at one of our flagship technical use cases: a video retrieval and analysis system that leverages NVIDIA Inference Microservices (NIMs) to deliver fast, high-quality AI insights entirely on-prem.
This system is built using state-of-the-art Run Anywhere NVIDIA NIMs for both embedding generation and multimodal reasoning. With our platform, you can build applications faster, deploy models instantly, and maintain total control over your data—everything from AI inference to access management is handled on-prem, with a cloud native feeling experience.
A Comprehensive End-to-End AI Pipeline
Our video retrieval and analysis system is designed to capture, process, and analyze security camera feeds in real time. It integrates multiple components —from low-level video processing to high-level natural language querying. Here are the key steps to our pipeline:
1. Real-Time Video Processing and Motion Detection
The first step in our system is connecting to a live security camera feed. Instead of storing and processing every single frame—which would be both costly and inefficient—use OpenCV to do some light image processing and motion detection to parse out imagery that we don’t think is going to contain relevant information. There are also more embedding clustering-centric methods that you could integrate into this flow too. This module monitors the camera feed and identifies periods of activity by detecting significant changes in consecutive frames.
This early-stage filtering ensures that our system only processes frames where something noteworthy is happening. By eliminating redundant data and focusing on frames with motion, we reduce entropy on the systems by keeping the data set smaller and more relevant. This type of approach is fairly standard in remote sensing and would be critical for a larger scale system.
2. Image Capture and Embedding
Once motion is detected, the system captures the image and stores the frames in an on-prem storage appliance. We then deployed a service that wakes up and checks for new files in the folder, embeds those images, and generates all of the necessary metadata needed for downstream querying. This decoupling ensures that the real-time capture process is never blocked by the more resource-intensive embedding computations..
Under the hood, we use Chroma DB as our vector database to store multimodal embeddings. The images are embedded using NVClip—an NVIDIA NIM specifically designed for high-quality, rapid embedding generation. NVClip transforms each image into a vector representation that encapsulates its visual features along with crucial metadata such as timestamps, where the referenced image can be found, and camera feed information.
One of the key strengths of our approach is the asynchronous architecture. By running image capture and embedding on separate threads or services, we maintain high throughput and ensure that no single process becomes a bottleneck. This design not only maximizes resource utilization but also improves the responsiveness of the entire system.
3. Querying the Embedded Data with a Streamlit App
After the images have been embedded, our system offers a user-friendly Streamlit app that allows users to query the stored data using natural language. Users can ask questions like:
- “When did packages get delivered?”
- “When did the mailman arrive?”
- “How many dogs passed by the camera?”
The app processes the initial query and performs a similarity search against the stored embeddings to retrieve candidate images from different relevant sets of time. The underlying search leverages our NVIDIA NIM-powered NVClip model, which provides different candidates of images that we can then parse based on the distance measures, timeframe, and any other downstream search measures.
4. Validating and Summarizing Results with a Multimodal LLM
Once candidate images are retrieved, we deploy a two-step process to further refine the results:
- Candidate Validation:
Each candidate image undergoes a validation step using a vision LLM—the Llama Vision Instruct 11B model, also deployed as an NVIDIA NIM. This model is tasked with determining if an image actually contains visual evidence to answer the user’s question. For example, if the query is “When did packages get delivered?”, the model checks for cues like delivery vehicles, packages, or relevant timestamp overlays. Only images that pass this validation step are considered for further processing. - Summarization and Final Aggregation:
Once validated images are grouped by day, the vision LLM generates summaries for each of the images. This summary captures key observations such as specific times when relevant activity was detected. The model is instructed to include precise timestamps formatted consistently in 24-hour time (YYYY-MM-DD HH:MM:SS Central Time) to avoid any ambiguity. Finally, these summaries are aggregated into a final, cohesive answer that directly responds to the user’s question. This final answer not only compiles the evidence but also provides context—indicating which days had definitive events and highlighting any ambiguous findings.
How NVIDIA NIMs Drive Our Platform
One of the core differentiators of our solution is the use of NVIDIA Inference Microservices (NIMs) at every stage of the pipeline:
- NVClip for Embeddings:
Our image embeddings are generated by NVClip, a state-of-the-art NVIDIA NIM. This model efficiently converts images into high-dimensional vectors that capture essential visual features. The speed and accuracy of NVClip are critical for ensuring that our similarity search retrieves the most relevant images quickly. - Llama Vision Instruct 11 B for Multimodal Reasoning:
Both the candidate validation and final summarization are powered by our vision LLM—the Llama Vision Instruct model. By deploying this model as an NVIDIA NIM, we achieve best in class inference speed. Additionally, NVIDIA’s provided hardware profiles streamline the infrastructure selection and tuning process. We already know what we should be getting. The model’s ability to understand and integrate visual and textual information is what enables our system to deliver clear, concise, and accurate responses. - On-Prem Deployment:
All these models are hosted and deployed within our own secure, on-premises infrastructure – i.e. these are all Run Anywhere NIMs. This means that all data processing occurs internally, ensuring maximum security, compliance, and data privacy. By eliminating dependency on external cloud services, we provide our clients with a truly private AI solution that never compromises on performance.
Building a Future-Proof AI Infrastructure
At Moonlite, we believe that the future of AI lies in platforms that provide developers and enterprises with a cloud-native experience and complete control over their on-prem infrastructure. Our platform not only simplifies the deployment of cutting-edge AI models but also provides robust management tools including role-based access control, end-to-end observability, and rapid iteration capabilities.
Imagine being able to cache and deploy models instantly with just a few clicks, or launching an application without having to worry about the underlying load balancing and security challenges. Our model router and deployment framework take care of these complexities, so you can focus on what matters most—developing innovative AI applications that drive your business forward.
Conclusion
Our video retrieval and analysis system is a shining example of how Moonlite is transforming AI infrastructure management. By leveraging NVIDIA NIMs for both image embeddings and multimodal reasoning, we offer a solution that is fast, secure, and entirely on-premises. From real-time video processing to asynchronous image embedding and natural language querying, every component of our pipeline is designed to deliver maximum efficiency and clarity.
With Moonlite, you’re not just building AI applications—you’re building with a platform that scales with your needs, protects your data, and empowers your organization with complete AI visibility and control. It helps you build the way you need to build. Explore our platform today and discover how you can build applications faster, secure your data, and achieve unparalleled performance in your AI deployments.
If you’re interested in learning more about our solutions or would like to see a demo, please visit our website or contact our team. With Moonlite’s robust, on-prem AI infrastructure, you can unlock the full potential of your data and build smarter, more secure applications that truly stand out in today’s competitive landscape.