TinyMCE AI on-premises: Production deployment guide
Architecture overview
The AI service is stateless, persists all state to MySQL/PostgreSQL and Redis, and scales horizontally behind a load balancer.
TLS / HTTPS
The AI service does not terminate Transport Layer Security (TLS). Place a reverse proxy in front.
Nginx example
server {
listen 443 ssl;
server_name ai.example.com;
ssl_certificate /etc/ssl/certs/ai.example.com.pem;
ssl_certificate_key /etc/ssl/private/ai.example.com.key;
location / {
proxy_pass http://ai-service:8000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# SSE streaming support
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
}
|
Server-Sent Events (SSE) streaming requires |
Horizontal scaling
The AI service is stateless. All persistent state lives in the SQL database, Redis, and the file-storage back end. Any number of replicas can run behind a load balancer. All replicas must share identical environment variable configuration.
Scaling considerations
| Component | Scaling approach |
|---|---|
AI service |
Add more containers (stateless) |
MySQL / PostgreSQL |
Read replicas or managed DB (RDS, Cloud SQL, Azure Database) |
Redis |
Redis Cluster or Sentinel; managed Redis (ElastiCache, Memorystore, Azure Cache) |
File storage |
S3 / Azure Blob recommended for production. The |
|
When deploying for the first time or upgrading to a new version, start a single instance and wait for it to become healthy before scaling up. Subsequent scale events do not require this precaution. |
Podman deployment
The AI service works with Podman as an alternative to Docker. In Podman, containers within a pod share a network namespace, so use 127.0.0.1 instead of container names for hostnames.
podman login -u 'TINY_REGISTRY_USERNAME' registry.containers.tiny.cloud
podman pull registry.containers.tiny.cloud/ai-service:latest
podman pod create --name ai-pod -p 8000:8000 -p 3306:3306 -p 6379:6379
podman run -d --pod ai-pod --name mysql \
-e MYSQL_ROOT_PASSWORD=ROOT_PASSWORD \
-e MYSQL_DATABASE=ai_service \
mysql:8.0
podman run -d --pod ai-pod --name redis redis:7
podman run --init -d --pod ai-pod --name ai-service \
-e LICENSE_KEY='T8LK:...' \
-e ENVIRONMENTS_MANAGEMENT_SECRET_KEY='MANAGEMENT_SECRET' \
-e DATABASE_DRIVER='mysql' \
-e DATABASE_HOST='127.0.0.1' \
-e DATABASE_USER='root' \
-e DATABASE_PASSWORD='ROOT_PASSWORD' \
-e DATABASE_DATABASE='ai_service' \
-e REDIS_HOST='127.0.0.1' \
-e PROVIDERS='{"openai":{"type":"openai","apiKeys":["sk-proj-..."]}}' \
-e STORAGE_DRIVER='database' \
registry.containers.tiny.cloud/ai-service:latest
Pin to mysql:8.0. The mysql:8 tag floats to MySQL 8.4, which removes the default-authentication-plugin flag and causes a crash loop. See Database, Redis, and storage for details.
|
Kubernetes deployment
Namespace and image pull secret
kubectl create namespace tinymce-ai
kubectl create secret docker-registry tiny-registry \
--namespace tinymce-ai \
--docker-server=registry.containers.tiny.cloud \
--docker-username=TINY_REGISTRY_USERNAME \
--docker-password='TINY_REGISTRY_ACCESS_TOKEN'
Application secrets
apiVersion: v1
kind: Secret
metadata:
name: ai-service-secrets
namespace: tinymce-ai
type: Opaque
stringData:
license-key: "EXAMPLE_LICENSE_KEY"
management-secret: "EXAMPLE_MANAGEMENT_SECRET"
db-password: "EXAMPLE_DB_PASSWORD"
redis-password: "EXAMPLE_REDIS_PASSWORD"
providers: |
{
"openai": {
"type": "openai",
"apiKeys": ["sk-proj-EXAMPLE_KEY"]
}
}
In production, use Sealed Secrets, External Secrets Operator, or HashiCorp Vault rather than committing raw secret manifests.
Deployment
Full Kubernetes Deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-service
namespace: tinymce-ai
spec:
replicas: 2
selector:
matchLabels:
app: ai-service
template:
metadata:
labels:
app: ai-service
spec:
imagePullSecrets:
- name: tiny-registry
containers:
- name: ai-service
image: registry.containers.tiny.cloud/ai-service:latest
ports:
- containerPort: 8000
env:
- name: LICENSE_KEY
valueFrom:
secretKeyRef:
name: ai-service-secrets
key: license-key
- name: ENVIRONMENTS_MANAGEMENT_SECRET_KEY
valueFrom:
secretKeyRef:
name: ai-service-secrets
key: management-secret
- name: DATABASE_DRIVER
value: "mysql"
- name: DATABASE_HOST
value: "mysql.tinymce-ai.svc.cluster.local"
- name: DATABASE_USER
value: "ai_service"
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: ai-service-secrets
key: db-password
- name: DATABASE_DATABASE
value: "ai_service"
- name: REDIS_HOST
value: "redis.tinymce-ai.svc.cluster.local"
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: ai-service-secrets
key: redis-password
- name: PROVIDERS
valueFrom:
secretKeyRef:
name: ai-service-secrets
key: providers
- name: STORAGE_DRIVER
value: "s3"
- name: STORAGE_REGION
value: "us-east-1"
- name: STORAGE_BUCKET
value: "example-ai-storage-bucket"
- name: ENABLE_METRIC_LOGS
value: "true"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
Service
apiVersion: v1
kind: Service
metadata:
name: ai-service
namespace: tinymce-ai
spec:
selector:
app: ai-service
ports:
- port: 8000
targetPort: 8000
Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ai-service
namespace: tinymce-ai
annotations:
nginx.ingress.kubernetes.io/proxy-buffering: "off"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
tls:
- hosts:
- ai.example.com
secretName: ai-tls-cert
rules:
- host: ai.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ai-service
port:
number: 8000
Horizontal pod autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-service-hpa
namespace: tinymce-ai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
AWS ECS / Fargate
Task definition
Full ECS Fargate task definition
{
"family": "ai-service",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"containerDefinitions": [
{
"name": "ai-service",
"image": "registry.containers.tiny.cloud/ai-service:latest",
"portMappings": [{ "containerPort": 8000 }],
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
},
"secrets": [
{ "name": "LICENSE_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:ai-license" },
{ "name": "ENVIRONMENTS_MANAGEMENT_SECRET_KEY", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:ai-mgmt-secret" },
{ "name": "DATABASE_PASSWORD", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:ai-db" },
{ "name": "PROVIDERS", "valueFrom": "arn:aws:secretsmanager:us-east-1:111122223333:secret:ai-providers" }
],
"environment": [
{ "name": "DATABASE_DRIVER", "value": "mysql" },
{ "name": "DATABASE_HOST", "value": "example-rds-endpoint.region.rds.amazonaws.com" },
{ "name": "DATABASE_USER", "value": "ai_service" },
{ "name": "DATABASE_DATABASE", "value": "ai_service" },
{ "name": "REDIS_HOST", "value": "example-elasticache-endpoint.region.cache.amazonaws.com" },
{ "name": "STORAGE_DRIVER", "value": "s3" },
{ "name": "STORAGE_BUCKET", "value": "example-ai-storage-bucket" },
{ "name": "STORAGE_REGION", "value": "us-east-1" }
]
}
]
}
Infrastructure recommendations
| Service | AWS recommendation |
|---|---|
Database |
RDS for MySQL 8.0 (Multi-AZ for high availability (HA)) |
Redis |
ElastiCache for Redis 7 (cluster mode) |
Storage |
Same-region S3 bucket |
Load balancer |
ALB with |
Secrets |
AWS Secrets Manager |
Registry pull credentials |
Secrets Manager + ECR pull-through cache, or a private repository mirroring |
Security hardening
| Practice | Implementation |
|---|---|
Network isolation |
Place the AI service in a private subnet; expose only through a load balancer. Restrict database and Redis to the AI service security group. |
Block panel from the public internet |
Restrict |
TLS everywhere |
Terminate TLS 1.3 at the reverse proxy. Use internal mutual TLS (mTLS) between the AI service and the data layer where supported. |
Secrets management |
Use Vault, AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager. Never store secrets directly in orchestration manifests or commit them to source control. |
Database encryption at rest |
Turn on encryption at rest in the cloud provider console. RDS, Cloud SQL, and Azure Database enable this by default. |
Redis authentication |
Always set |
Container security |
Run as non-root, use a read-only filesystem where possible, and drop unnecessary Linux capabilities. |
Image scanning |
Scan |
Least-privilege JSON Web Tokens (JWTs) |
Grant only the permissions each user role requires. Avoid full-access tokens in production. |
API secret rotation |
Periodically create a new access key, add the new key to the configuration, then revoke the old key. The token endpoint reads the secret at request time. |
Audit logging |
Enable |
Large language model (LLM) API key rotation |
Add the new key to the |
Rate limiting
The AI service has no built-in rate limiting. Place rate-limit rules in front of the service to prevent a runaway client from consuming LLM provider quota or overloading the database.
nginx
limit_req_zone $http_authorization zone=ai_jwt:10m rate=10r/s;
server {
location /v1/ {
limit_req zone=ai_jwt burst=20 nodelay;
proxy_pass http://ai-service:8000;
proxy_buffering off;
proxy_read_timeout 300s;
}
}
Observability
Health monitoring
Poll /health on each instance to confirm it is running. A healthy instance responds with HTTP 200.
curl -f http://ai-service:8000/health
Structured metric logs
Set the ENABLE_METRIC_LOGS environment variable to enable request-level JSON logs to stdout:
-e ENABLE_METRIC_LOGS='true'
When enabled, the service writes a structured JSON entry for each request. Key fields include the request duration, HTTP status code, and outcome status. These entries are suitable for ingestion into any log aggregator that supports JSON parsing.
OpenTelemetry
-e LLM_TELEMETRY_ENABLED='true' \
-e OTEL_EXPORTER_OTLP_TRACES_ENDPOINT='http://otel-collector:4318/v1/traces' \
-e OTEL_TRACES_SAMPLER_ARG='1.0' \
-e OTEL_DEBUG='true'
| Variable | Required | Default | Description |
|---|---|---|---|
|
Yes |
|
Primary telemetry switch |
|
Yes |
- |
OpenTelemetry Protocol (OTLP) endpoint URL |
|
No |
|
Sampling rate (0.0 to 1.0) |
|
No |
- |
Verbose OTLP diagnostic logging |
Compatible with Jaeger, Grafana Tempo, Datadog, New Relic, Honeycomb, and any OTLP-compatible back end.
Langfuse
Langfuse provides AI-specific observability: token usage, latency per LLM call, prompt quality scores, and cost tracking.
-e LANGFUSE_PUBLIC_KEY='pk-lf-...' \
-e LANGFUSE_SECRET_KEY='sk-lf-...' \
-e LANGFUSE_BASE_URL='https://cloud.langfuse.com' \
-e LANGFUSE_DEBUG='true'
| Variable | Required | Default | Description |
|---|---|---|---|
|
Yes (if used) |
- |
Langfuse public key |
|
Yes (if used) |
- |
Langfuse secret key |
|
No |
Self-hosted Langfuse URL |
|
|
No |
- |
Verbose Langfuse logging |
Langfuse also requires LLM_TELEMETRY_ENABLED=true and a valid OTEL_EXPORTER_OTLP_TRACES_ENDPOINT.
OpenTelemetry and Langfuse can run at the same time. The service emits to both without conflict.
Distributed logging
For production multi-instance deployments, ship container logs to a central aggregator.
| Platform | Log driver / approach |
|---|---|
AWS |
CloudWatch Logs through the |
GCP |
Cloud Logging (automatic on GKE), or Fluent Bit |
Azure |
Azure Monitor (automatic on Azure Container Apps and AKS) |
Self-hosted (ELK) |
Fluent Bit or Filebeat to Elasticsearch + Kibana |
Self-hosted (Loki) |
Fluent Bit or Promtail to Grafana Loki |
Fluentd |
Use the Docker fluentd log driver |
docker run ... \
--log-driver=fluentd \
--log-opt fluentd-address=localhost:24224 \
--log-opt tag=ai-service \
...
The metric logs produced by the ENABLE_METRIC_LOGS option are already structured JSON and parse cleanly in any aggregator.
Recommended monitoring
The following checks help catch common issues early:
-
Health endpoint — poll
/healthon each instance; alert if any instance returns a non-200 response for more than 60 seconds. -
Error rate — monitor the HTTP 5xx rate in the metric logs or traces; a sustained increase may indicate an LLM provider outage or a misconfigured environment.
-
Latency — track request duration; a sudden increase typically points to LLM provider throttling or network issues.
-
Container restarts — alert on repeated container restarts, which may indicate a missing environment variable or a database connectivity problem.
For troubleshooting specific error patterns, see Troubleshooting.
Backup and recovery
Database
The database contains environments, access keys, conversations, messages, and file metadata. Back up the database using standard production practices:
-
MySQL:
mysqldumpor managed snapshots (RDS automated backups). -
PostgreSQL:
pg_dumpor managed snapshots.
Enable point-in-time recovery.
Upgrade process
-
Pull the new image:
docker pull registry.containers.tiny.cloud/ai-service:NEW_VERSION -
For rolling deploys across version boundaries: start one instance at the new version and wait for it to become healthy before rolling the rest.
-
For Kubernetes: update the image tag in the Deployment. The default
RollingUpdatestrategy handles zero-downtime upgrades, provided the first new pod becomes Ready before the rollout continues. -
Verify
/healthon every replica before declaring the upgrade complete.
Review the release notes for the target version and take a database backup before upgrading.
License keys are per-deployment, not per-replica. One key covers any number of replicas of a single deployment.
Performance characteristics
| Metric | Typical value |
|---|---|
Cold start |
Approximately 3 seconds |
Health check response |
Less than 10 ms |
Token validation |
Less than 5 ms |
Time to first token (LLM) |
200 ms to 2 s (depends on provider and model) |
Memory per instance |
256 to 512 MB |
Concurrent connections |
1,000+ per instance |
These values are approximate and vary with hardware, provider latency, and prompt complexity. The LLM provider’s rate limits are typically the binding constraint before the AI service becomes one.
Sizing guide
| Users | AI service replicas | Database | Redis | Notes |
|---|---|---|---|---|
1 to 50 |
1 |
db.t3.small (or 2 vCPU / 4 GB self-managed) |
cache.t3.micro |
Development and small teams |
50 to 500 |
2 |
db.r6g.large |
cache.r6g.large |
Small production |
500 to 5,000 |
3 to 5 |
db.r6g.xlarge (Multi-AZ) |
cache.r6g.xlarge (cluster) |
Medium production |
5,000+ |
5+ (Horizontal Pod Autoscaler (HPA)) |
db.r6g.2xlarge+ |
cache.r6g.2xlarge+ |
Large production; contact Tiny for guidance |
Starting point for self-managed deployments:
-
AI service instance: 2 vCPU / 4 GB RAM
-
Database instance: 2 vCPU / 8 GB RAM
-
Redis instance: 1 vCPU / 2 GB RAM
Scale based on user count, average prompt size, and concurrent streaming connections. The LLM provider’s rate limits are usually the binding constraint long before the AI service or database becomes one.