I am getting a CUDA Out of Memory (OOM) error when starting vLLM. How do I fix this?

This occurs when the model requires more VRAM than your GPU has available. You can mitigate this by reducing the gpu_memory_utilization flag in your vLLM startup command (e.g., --gpu-memory-utilization 0.85), or by utilizing a quantized model version (such as AWQ or GPTQ formats).

Why is the embedding generation step slow?

By default, HuggingFace embeddings may process on the CPU if PyTorch is not correctly configured for CUDA. Ensure you have the CUDA-enabled version of PyTorch installed. You can force the device mapping by passing model_kwargs={'device': 'cuda'} to the HuggingFaceEmbeddings initialization.

Can I scale this pipeline to handle hundreds of concurrent users?

Yes. vLLM is specifically designed for high concurrency via continuous batching. However, to scale effectively at an enterprise level, you will need to load-balance multiple vLLM instances across a multi-GPU cluster (e.g., 4x or 8x NVIDIA A100s) and deploy Qdrant in distributed cluster mode rather than a single standalone container.

Build a Private RAG Pipeline (vLLM & LangChain)

Prerequisites

Before beginning the deployment, ensure your infrastructure meets the following requirements:

Dedicated GPU Server: A bare-metal machine running Ubuntu 22.04 LTS.
VRAM Requirements: Minimum 24GB VRAM (e.g., NVIDIA RTX 3090 / 4090) for 7B–8B parameter models. Multi-GPU configurations (e.g., NVIDIA A100 or H100) are required for 70B+ models.
System Access: Root or sudo privileges.
Software Dependencies: Python 3.10+, Docker, and Docker Compose installed.
Drivers: NVIDIA Display Drivers (v535+) and CUDA Toolkit (v12.1+).

Quick Summary

Optimize Inference: Deploy vLLM to utilize PagedAttention, maximizing GPU memory utilization and throughput.
Vector Storage: Spin up a Qdrant container to manage and query document embeddings efficiently.
Orchestration: Use LangChain to parse internal documents, generate vector embeddings via HuggingFace, and structure the retrieval chain.
Data Sovereignty: Execute the entire query lifecycle locally on bare-metal hardware, ensuring zero external API calls.

Deploying a Retrieval-Augmented Generation (RAG) pipeline is the standard approach for allowing Large Language Models (LLMs) to securely interact with your proprietary enterprise data. However, relying on public APIs exposes sensitive corporate documents to third-party networks and introduces unacceptable latency for high-throughput applications.

By self-hosting your inference architecture, you retain absolute data sovereignty. This guide demonstrates how to architect and deploy a high-performance, fully private RAG pipeline using vLLM for inference, LangChain for orchestration, and a local vector database, all running on a dedicated GPU server.

Step 1: Prepare the GPU Environment

First, verify that your NVIDIA drivers and CUDA environment are correctly installed and recognized by the operating system.

Bash — Verify GPU

nvidia-smi

Create a dedicated Python virtual environment to isolate the pipeline dependencies:

Bash — Create Environment

python3 -m venv rag_env
source rag_env/bin/activate

Upgrade pip and install the core AI frameworks:

Bash — Install Dependencies

pip install --upgrade pip
pip install vllm langchain langchain-openai langchain-community sentence-transformers qdrant-client pypdf

Step 2: Deploy the vLLM API Server

vLLM is a highly optimized inference engine that significantly reduces latency. We will start vLLM in OpenAI-compatible API mode, allowing LangChain to interact with it seamlessly. For this tutorial, we will serve the Meta-Llama-3-8B-Instruct model.
Run the following command to initialize the server:

Bash — Start vLLM Server

python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dtype auto \
    --api-key private-rag-key \
    --max-model-len 4096 \
    --port 8000

💡 Infrastructure Tip: Serving LLMs via vLLM requires immense memory bandwidth and continuous I/O operations. Running this on virtualized cloud instances often introduces hypervisor overhead, leading to token generation latency. Deploying this stack directly on a GPUYard Dedicated GPU Server ensures your pipeline has unshared access to the PCIe lanes and GPU VRAM, delivering maximum tokens-per-second (TPS) for concurrent enterprise users.

Step 3: Initialize the Vector Database (Qdrant)

A highly performant RAG pipeline requires a robust vector database to store and search document embeddings. We will deploy Qdrant locally using Docker.

Execute the following command to spin up the Qdrant instance:

Bash — Run Qdrant

docker run -d -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

Verify the container is running:

Bash — Verify Qdrant

curl http://localhost:6333

Step 4: Document Ingestion and Embedding

Next, we write a Python script to ingest a sample PDF, chunk the text, convert it into embeddings using a local HuggingFace model, and load it into Qdrant. Create a file named ingest.py:

Python — ingest.py

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant

# 1. Load the internal document
loader = PyPDFLoader("enterprise_policy.pdf")
documents = loader.load()

# 2. Chunk the text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

# 3. Initialize local embeddings model (Executes on CPU/GPU)
embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={'device': 'cuda'}
)

# 4. Store vectors in the local Qdrant instance
qdrant = Qdrant.from_documents(
    chunks,
    embeddings,
    url="http://localhost:6333",
    collection_name="enterprise_knowledge",
)

print(f"Successfully ingested {len(chunks)} document chunks into Qdrant.")

Run the ingestion script:

Bash

python3 ingest.py

Step 5: Build the Retrieval and Generation Loop

Finally, connect LangChain to the local Qdrant database and the vLLM API server to execute the RAG query. Create a file named query.py:

Python — query.py

from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# 1. Connect to local vLLM API
llm = ChatOpenAI(
    openai_api_base="http://localhost:8000/v1",
    openai_api_key="private-rag-key",
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    temperature=0.1
)

# 2. Connect to local Qdrant Vector Store
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
client = QdrantClient(url="http://localhost:6333")
vector_store = Qdrant(client=client, collection_name="enterprise_knowledge", embeddings=embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# 3. Define the RAG Prompt
system_prompt = (
    "You are a helpful enterprise assistant. Use the provided context to answer the user's question."
    "If you don't know the answer based on the context, say so.\n\n"
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

# 4. Construct the RAG Chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

# 5. Execute a query
response = rag_chain.invoke({"input": "What is the company policy on remote work?"})
print("\nAnswer:", response["answer"])

Run the query script:

Bash

python3 query.py

Real-World FAQ & Troubleshooting

Q: I am getting a CUDA Out of Memory (OOM) error when starting vLLM. How do I fix this?
A: This occurs when the model requires more VRAM than your GPU has available. You can mitigate this by reducing the gpu_memory_utilization flag in your vLLM startup command (e.g., --gpu-memory-utilization 0.85), or by utilizing a quantized model version (such as AWQ or GPTQ formats).
Q: Why is the embedding generation step slow?
A: By default, HuggingFace embeddings may process on the CPU if PyTorch is not correctly configured for CUDA. Ensure you have the CUDA-enabled version of PyTorch installed. You can force the device mapping by passing model_kwargs={'device': 'cuda'} to the HuggingFaceEmbeddings initialization, as shown in Step 4.
Q: Can I scale this pipeline to handle hundreds of concurrent users?
A: Yes. vLLM is specifically designed for high concurrency via continuous batching. However, to scale effectively at an enterprise level, you will need to load-balance multiple vLLM instances across a multi-GPU cluster (e.g., 4x or 8x NVIDIA A100s) and deploy Qdrant in distributed cluster mode rather than a single standalone container.

Conclusion

You have successfully architected and deployed a highly secure, private RAG pipeline. By combining vLLM's superior inference speed, Qdrant's vector search efficiency, and LangChain's orchestration capabilities, you have built an enterprise-grade AI system that never leaks data to external providers.

AI Infrastructure

Ready to Deploy Your Private RAG Pipeline?

Don't let hardware bottlenecks slow down your AI inference. Whether you are testing 8B models or scaling 70B+ parameters for enterprise production, GPUYard provides the bare-metal power you need. Spin up a dedicated server with RTX 4090s or multi-GPU A100 clusters and get maximum throughput with zero hypervisor overhead.

Explore GPUYard's Dedicated GPU Servers & Pricing

Build your high-performance AI cluster today.

Deploy Training Nodes Worldwide

North America Europe Asia South America Africa Australia

How to Build a Production-Ready Private RAG Pipeline with vLLM, LangChain, and Dedicated GPUs