Vector Embeddings with Sentence Transformers and Docker: Complete Guide

Introduction to Vector Embeddings and Sentence Transformers

Vector embeddings have revolutionized natural language processing (NLP) by converting text into numerical representations that machines can understand and compare. Sentence Transformers, built on top of transformer models like BERT and RoBERTa, provide state-of-the-art sentence and text embeddings that power semantic search, clustering, and similarity comparison applications.

In production environments, containerizing these models with Docker ensures consistency, scalability, and simplified deployment across different infrastructure. This comprehensive guide walks you through building, deploying, and optimizing vector embedding services using Sentence Transformers and Docker.

Understanding Vector Embeddings and Their Applications

Vector embeddings transform text into dense numerical vectors where semantically similar texts have similar vector representations. Unlike traditional keyword matching, embeddings capture contextual meaning, enabling more intelligent search and comparison capabilities.

Common Use Cases

Semantic Search: Finding documents based on meaning rather than exact keyword matches
Recommendation Systems: Suggesting similar content based on semantic similarity
Clustering and Classification: Grouping similar documents automatically
Question Answering: Matching questions to relevant answers in knowledge bases
Duplicate Detection: Identifying similar or duplicate content at scale

Setting Up Your Development Environment

Before diving into Docker containerization, let’s establish a working local environment to understand the core functionality.

Installing Dependencies

Create a project directory and set up a virtual environment:

mkdir sentence-transformers-docker
cd sentence-transformers-docker
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Create a requirements.txt file with necessary dependencies:

sentence-transformers==2.2.2
flask==3.0.0
gunicorn==21.2.0
numpy==1.24.3
torch==2.1.0

Install the packages:

pip install -r requirements.txt

Building a Vector Embedding Service

Let’s create a REST API service that generates embeddings for input text. This service will serve as the foundation for our containerized application.

Creating the Flask Application

Create a file named app.py:

from flask import Flask, request, jsonify
from sentence_transformers import SentenceTransformer
import numpy as np
import logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load model at startup
MODEL_NAME = 'all-MiniLM-L6-v2'
logger.info(f"Loading model: {MODEL_NAME}")
model = SentenceTransformer(MODEL_NAME)
logger.info("Model loaded successfully")

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({"status": "healthy", "model": MODEL_NAME}), 200

@app.route('/embed', methods=['POST'])
def generate_embeddings():
    try:
        data = request.get_json()
        
        if not data or 'texts' not in data:
            return jsonify({"error": "Missing 'texts' field"}), 400
        
        texts = data['texts']
        
        if not isinstance(texts, list):
            texts = [texts]
        
        # Generate embeddings
        embeddings = model.encode(texts, convert_to_numpy=True)
        
        # Convert to list for JSON serialization
        embeddings_list = embeddings.tolist()
        
        return jsonify({
            "embeddings": embeddings_list,
            "dimensions": len(embeddings_list[0]),
            "count": len(embeddings_list)
        }), 200
        
    except Exception as e:
        logger.error(f"Error generating embeddings: {str(e)}")
        return jsonify({"error": str(e)}), 500

@app.route('/similarity', methods=['POST'])
def calculate_similarity():
    try:
        data = request.get_json()
        
        if not data or 'text1' not in data or 'text2' not in data:
            return jsonify({"error": "Missing text1 or text2"}), 400
        
        # Generate embeddings
        embedding1 = model.encode(data['text1'], convert_to_numpy=True)
        embedding2 = model.encode(data['text2'], convert_to_numpy=True)
        
        # Calculate cosine similarity
        similarity = np.dot(embedding1, embedding2) / (
            np.linalg.norm(embedding1) * np.linalg.norm(embedding2)
        )
        
        return jsonify({
            "similarity": float(similarity),
            "text1": data['text1'],
            "text2": data['text2']
        }), 200
        
    except Exception as e:
        logger.error(f"Error calculating similarity: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Containerizing with Docker

Now let’s containerize our application for production deployment. We’ll create an optimized multi-stage Docker build.

Creating the Dockerfile

Create a Dockerfile in your project root:

FROM python:3.10-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Production stage
FROM python:3.10-slim

WORKDIR /app

# Copy Python dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY app.py .

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

# Pre-download the model during build
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

# Expose port
EXPOSE 5000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:5000/health')"

# Run with gunicorn for production
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "--threads", "4", "--timeout", "120", "app:app"]

Building and Running the Container

Build your Docker image:

docker build -t sentence-transformers-api:latest .

Run the container:

docker run -d \
  --name embeddings-service \
  -p 5000:5000 \
  --memory="2g" \
  --cpus="2" \
  sentence-transformers-api:latest

Check the logs:

docker logs -f embeddings-service

Testing Your Embedding Service

Once your container is running, test the endpoints using curl or any HTTP client.

Health Check

curl http://localhost:5000/health

Generate Embeddings

curl -X POST http://localhost:5000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Docker makes deployment easy", "Kubernetes orchestrates containers"]}'

Calculate Similarity

curl -X POST http://localhost:5000/similarity \
  -H "Content-Type: application/json" \
  -d '{"text1": "Machine learning with Python", "text2": "AI development using Python"}'

Kubernetes Deployment

For production environments, deploying to Kubernetes provides scalability and resilience. Here’s a complete Kubernetes configuration.

Kubernetes Deployment YAML

Create k8s-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sentence-transformers-deployment
  labels:
    app: embeddings-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: embeddings-service
  template:
    metadata:
      labels:
        app: embeddings-service
    spec:
      containers:
      - name: embeddings-api
        image: sentence-transformers-api:latest
        ports:
        - containerPort: 5000
          name: http
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: embeddings-service
spec:
  selector:
    app: embeddings-service
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5000
  type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: embeddings-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sentence-transformers-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Deploy to Kubernetes:

kubectl apply -f k8s-deployment.yaml

# Check deployment status
kubectl get pods -l app=embeddings-service

# View service details
kubectl get svc embeddings-service

Docker Compose for Development

For local development with multiple services, use Docker Compose. Create docker-compose.yml:

version: '3.8'

services:
  embeddings-api:
    build: .
    container_name: embeddings-service
    ports:
      - "5000:5000"
    environment:
      - MODEL_NAME=all-MiniLM-L6-v2
      - FLASK_ENV=production
    volumes:
      - model-cache:/root/.cache/torch/sentence_transformers
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '1'
          memory: 1G
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    restart: unless-stopped

volumes:
  model-cache:
    driver: local

Start the services:

docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Performance Optimization Best Practices

Model Selection

Choose the right model based on your requirements:

all-MiniLM-L6-v2: Fast, lightweight (384 dimensions), good for most applications
all-mpnet-base-v2: Higher quality (768 dimensions), slower inference
paraphrase-multilingual-MiniLM-L12-v2: Supports 50+ languages

Caching Strategies

Implement caching to avoid recomputing embeddings for frequently accessed texts:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def get_embedding_cached(text: str):
    return tuple(model.encode(text, convert_to_numpy=True))

Batch Processing

Always process texts in batches when possible for better throughput:

# Good: Batch processing
texts = ["text1", "text2", "text3"]
embeddings = model.encode(texts, batch_size=32)

# Avoid: Processing one at a time
embeddings = [model.encode(text) for text in texts]

Troubleshooting Common Issues

Out of Memory Errors

If you encounter OOM errors, reduce batch size or increase container memory:

docker run -d \
  --name embeddings-service \
  -p 5000:5000 \
  --memory="4g" \
  --memory-swap="4g" \
  sentence-transformers-api:latest

Slow Model Loading

Pre-download models during Docker build rather than at runtime. The Dockerfile example above includes this optimization.

Connection Timeouts

Increase gunicorn timeout for large batch processing:

CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "2", "--timeout", "300", "app:app"]

Security Considerations

Implement these security best practices in production:

Run as non-root user: Add USER directive in Dockerfile
Rate limiting: Implement API rate limiting to prevent abuse
Authentication: Add API key authentication for production endpoints
Network policies: Use Kubernetes NetworkPolicies to restrict traffic
Image scanning: Regularly scan Docker images for vulnerabilities

Monitoring and Observability

Add Prometheus metrics for production monitoring:

from prometheus_flask_exporter import PrometheusMetrics

metrics = PrometheusMetrics(app)
metrics.info('embeddings_api_info', 'Embeddings API', version='1.0.0')

Monitor key metrics:

Request latency and throughput
Memory usage and GPU utilization
Error rates and types
Model inference time

Conclusion

Containerizing Sentence Transformers with Docker provides a robust, scalable solution for deploying vector embedding services. By following the patterns and best practices outlined in this guide, you can build production-ready NLP services that handle semantic search, similarity comparison, and other embedding-based applications efficiently.

The combination of Docker’s portability, Kubernetes’ orchestration capabilities, and Sentence Transformers’ powerful models creates a solid foundation for modern AI/ML applications. Start with the basic Docker setup, test thoroughly, and scale to Kubernetes when your application demands it.

Remember to monitor performance, optimize resource usage, and keep your models updated to maintain the best results in production environments.