Kubernetes is the easy part. Production Kubernetes is the work.
Standing up a managed Kubernetes cluster takes an hour. Running production workloads on it for two years, surviving cluster upgrades, passing compliance audits, and handling a 3am incident takes engineering practice most teams have not built yet.
We have built and migrated Kubernetes platforms for healthcare engineering teams running clinical workloads, defense subcontractors with CUI in their clusters, and B2B SaaS companies under SOC 2 scrutiny. The pattern is the same in every engagement: the cluster is fine, the workloads run, and the platform is one upgrade or one incident away from a bad weekend.
Production Kubernetes is a set of decisions made consistently: namespace isolation, network policy, pod security standards, image admission, secrets handling, GitOps discipline, and operational runbooks. We make those decisions with you, ship them as code, and hand back a platform your team can run.
What a production-grade cluster actually looks like.
The architecture baseline that survives audits and reduces 3am pages rests on the following decisions. Each is shipped as code in every engagement.
- Namespace isolationOne namespace per team or workload domain. ResourceQuotas, LimitRanges, and NetworkPolicies enforced at the namespace boundary.
- Network policy default-denyCalico, Cilium, or platform-native policy with default-deny ingress and egress. Explicit allow rules per dependency.
- Pod Security StandardsRestricted profile enforced cluster-wide. Baseline profile only where a specific workload constraint requires it, with documented exception.
- Image admission controlKyverno or OPA Gatekeeper rejecting unsigned images, images from untrusted registries, and images with disallowed CVE thresholds.
- Secrets managementExternal Secrets Operator backed by cloud KMS (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault). No long-lived secrets in cluster YAML.
- GitOps deploymentArgo CD or Flux as the only path to production. Every cluster state derives from a Git commit; manual kubectl edits are an exception, not a workflow.
- Observability stackPrometheus / OpenTelemetry / Loki or equivalent. Cluster, node, and workload-level dashboards with SLO-aligned alerting.
How we move production workloads without breaking them.
The standard Stonebridge migration uses a parallel-environment approach. We build the new Kubernetes platform alongside the existing environment, replay traffic to validate behavior, and cut over once we are confident. Not before.
Phase 1: Assess and inventory
We catalog every workload, dependency, and external integration. We capture traffic patterns, resource consumption, and operational runbooks. The output is a migration roadmap with named risks and explicit owners.
Phase 2: Build the parallel platform
The new Kubernetes platform is built fresh, in code, with the architecture baseline above. It runs in production parallel to the existing environment with its own DNS, observability, and on-call posture.
Phase 3: Replay and shadow
Traffic is replayed against the new platform from production logs. Workloads run in shadow (receiving real traffic, producing real responses, but not serving customers) until the response surface is provably identical.
Phase 4: Cut over and decommission
Cutover is a DNS or load balancer change, not a forklift event. Rollback is one command. We do not decommission the old environment until the new platform has run customer traffic without incident for a defined burn-in period.
Five patterns that burn weekends and fail audits.
We see the same mistakes repeatedly when teams build Kubernetes platforms without architectural help. None are about not knowing Kubernetes. They are about not knowing how to make the platform survive a year of production load.
One cluster, hard multi-tenant, no isolation
Production, staging, and dev sharing a cluster with namespace separation only. The first incident that takes the cluster down takes all three environments with it. Separate clusters for separate trust boundaries.
Privileged pods because the chart asked for them
Helm charts from upstream often default to running as root or asking for hostPath mounts. Accepting those defaults turns the cluster into a single trust boundary. We fork or replace charts that cannot run under the Restricted pod security profile.
Default-allow network policy
Many teams install a CNI with policy support and then never write a policy. Egress from any pod can reach anywhere. Start with default-deny and add explicit allows. Not the other way around.
Manual kubectl in production
A platform managed by kubectl is unauditable, unreversible, and unfixable at 3am. GitOps is the audit trail and the recovery story rolled into one. Manual operations are exceptions, not workflows.
No documented upgrade path
Kubernetes releases minor versions every four months. A platform with no documented, rehearsed upgrade process accumulates technical debt until an upgrade is impossible without a re-platform. We ship a documented upgrade runbook with every engagement.
Three ways to engage. Fixed scope, fixed price.
Most clients start with an audit, then move to a migration or build engagement. Some come straight to migration when a customer or compliance deadline is fixed.
Kubernetes Platform Audit
Two-week, fixed-fee assessment of your existing Kubernetes platform against security, reliability, and compliance baselines. Produces a written report with prioritized remediation roadmap.
- 2 weeks duration
- Cluster + workload review
- Security baseline mapping
- Reliability + cost findings
- Prioritized roadmap
Kubernetes Migration
Eight-week hands-on engagement to migrate production workloads to a Kubernetes platform with zero customer-visible downtime. Parallel environments, traffic replay, documented cutover.
- 8 weeks duration
- Parallel-environment cutover
- Traffic replay + shadow workloads
- Documented rollback path
- 30-day post-cutover support
Kubernetes Platform Build
Ten-week hands-on engagement to architect and ship a production Kubernetes platform from scratch. Architecture baseline, GitOps, observability, runbooks, and on-call training included.
- 10 weeks duration
- Architecture baseline shipped
- GitOps + admission control
- Observability stack
- On-call training + runbooks
A representative engagement.
One of our most recent Kubernetes engagements was a full migration for a repeat client moving production workloads from a legacy environment onto Kubernetes with strict isolation requirements.
Full Kubernetes migration with zero production downtime.
The client had outgrown their existing platform. Workloads needed namespace isolation, network policies, and resource limits to satisfy their compliance posture, and a previous attempt at migration had stalled out without a clear rollback path.
We scoped the work cleanly, built out the manifests, set up namespace isolation, network policies, and pod-level resource limits, and migrated workloads using the parallel-environment pattern. The internal team was walked through the architecture and trained on the runbooks before we decommissioned the old environment.
Zero customer-visible downtime during cutover. The client has since come back for additional engagements.
during migration
policies enforced
additional engagements