Skip to content

[Performance] Docker Build & Runtime Performance Optimization #70

@aarora79

Description

@aarora79

Docker Performance Analysis: Critical Issues & Optimization Plan

Analysis Date: 2025-06-22
Impact: Build times exceed 10-15 minutes, startup delays of 2-10 minutes
Priority: High - affects developer productivity and deployment speed

🔴 Critical Performance Issues

1. Registry Container: Massive Dependency Hell

Location: docker/Dockerfile.registry:33-50
Impact: 5-10 minutes build time

  • Heavy ML Dependencies: Installing torch>=1.6.0, sentence-transformers>=2.2.2, faiss-cpu>=1.7.4
  • PyTorch alone is ~800MB+ and takes 5-10 minutes to download/compile
  • FAISS with CPU optimizations requires compilation
  • Sentence-transformers pulls additional model dependencies

2. Model Download During Runtime

Location: docker/registry-entrypoint.sh:81-113
Impact: 2-10 minutes startup delay

  • Downloads 400MB+ sentence-transformer model at container startup
  • Multiple fallback attempts with SSL verification disabled
  • Network-dependent startup time - varies by connection speed
  • Blocks container readiness until model download completes

3. Inefficient Layer Caching

Location: docker/Dockerfile.registry:24
Impact: Full rebuild on any code change

  • COPY . /app/ invalidates cache on ANY file change
  • Dependencies reinstalled on every code change
  • No multi-stage builds to separate dependencies from application code

4. Sequential Service Dependencies

Location: docker-compose.yml:31-32
Impact: Cascading startup delays

  • Registry depends on auth-server
  • MCPGW depends on registry
  • Each service waits for previous to be ready

5. Build Context Size

Impact: Slow context transfer to Docker daemon

  • Entire project copied to each container (including .venv, logs, etc.)
  • No .dockerignore to exclude unnecessary files

🚀 Immediate Wins (1-2 days implementation)

1. Add .dockerignore

Impact: 50-80% faster build context transfer

.venv/
__pycache__/
*.pyc
.git/
logs/
*.log
.pytest_cache/
.coverage
node_modules/
.DS_Store

2. Multi-stage Dockerfile for Registry

Impact: 60-70% faster rebuilds

# Stage 1: Dependencies
FROM python:3.12-slim as deps
ENV PYTHONUNBUFFERED=1
RUN pip install uv
COPY requirements.txt .
RUN uv pip install -r requirements.txt

# Stage 2: Runtime
FROM python:3.12-slim
COPY --from=deps /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . /app/
WORKDIR /app

3. Parallel Builds

Impact: 40-50% faster total build time

# In build_and_run.sh
docker-compose build --parallel

4. Pre-download Models in Build Stage

Impact: Eliminates 2-10 minute startup delay

# Add to Dockerfile.registry after dependencies
RUN mkdir -p /app/registry/models/all-MiniLM-L6-v2 && \
    huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 \
    --local-dir /app/registry/models/all-MiniLM-L6-v2

📈 Expected Performance Improvements

Optimization Build Time Reduction Startup Time Reduction
.dockerignore 30-50% 0%
Multi-stage builds 60-70% 0%
Parallel builds 40-50% 0%
Pre-download models 0% 90-95%
Combined 85-95% 90-95%

🎯 Implementation Plan

Phase 1 (Week 1) - Immediate Wins

  • Add comprehensive .dockerignore file
  • Implement parallel builds in build_and_run.sh
  • Pre-download ML models during build stage
  • Add build timing instrumentation

Phase 2 (Week 2) - Architecture Improvements

  • Implement multi-stage Dockerfile for registry
  • Optimize health check intervals
  • Improve dependency layer caching

Phase 3 (Week 3) - Advanced Optimizations

  • Create base ML image with pre-installed dependencies
  • Implement model caching volumes
  • Optimize service startup dependencies

📋 Success Metrics

Target Performance Goals

  • Total build time: <2 minutes (currently 10-15 minutes)
  • Container startup time: <30 seconds (currently 2-10 minutes)
  • Image sizes: <1GB per service
  • Cache hit rates: >80%

Validation Commands

# Measure build time
time docker-compose build

# Measure startup time  
time docker-compose up -d && docker-compose logs -f --until=30s

# Check image sizes
docker images --format "table {{.Repository}}\t{{.Size}}"

🔧 Technical Details

The full performance analysis includes:

  • Detailed instrumentation strategy for build_and_run.sh
  • Layer-by-layer Docker build analysis
  • Medium-term solutions (base images, caching volumes)
  • Long-term architectural improvements (dependency splitting)

This issue addresses critical performance bottlenecks that significantly impact developer productivity and deployment efficiency. The proposed optimizations can reduce build times from 10-15 minutes to under 2 minutes, and startup times from 2-10 minutes to under 30 seconds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance improvements and optimizations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions