Service · S/03 Platform

Production Kubernetes, built for regulated workloads.

We architect, migrate, and operate production Kubernetes platforms for healthcare, defense, federal, and B2B SaaS teams. Namespace isolation, network policies, pod security, signed-artifact admission, and zero-downtime migrations across EKS, GKE, AKS, and OKE, including their regulated variants.

EngagementFixed-fee build,
migration, or audit
Timeline2 weeks (audit)
8 weeks (migration)
PlatformsEKS · GKE · AKS · OKE
incl. GovCloud · AW
DeploymentGitOps · Argo CD
Flux · Helm
01 — The Problem

Kubernetes is the easy part. Production Kubernetes is the work.

Standing up a managed Kubernetes cluster takes an hour. Running production workloads on it for two years, surviving cluster upgrades, passing compliance audits, and handling a 3am incident takes engineering practice most teams have not built yet.

We have built and migrated Kubernetes platforms for healthcare engineering teams running clinical workloads, defense subcontractors with CUI in their clusters, and B2B SaaS companies under SOC 2 scrutiny. The pattern is the same in every engagement: the cluster is fine, the workloads run, and the platform is one upgrade or one incident away from a bad weekend.

Production Kubernetes is a set of decisions made consistently: namespace isolation, network policy, pod security standards, image admission, secrets handling, GitOps discipline, and operational runbooks. We make those decisions with you, ship them as code, and hand back a platform your team can run.

02 — Architecture Baseline

What a production-grade cluster actually looks like.

The architecture baseline that survives audits and reduces 3am pages rests on the following decisions. Each is shipped as code in every engagement.

  • Namespace isolationOne namespace per team or workload domain. ResourceQuotas, LimitRanges, and NetworkPolicies enforced at the namespace boundary.
  • Network policy default-denyCalico, Cilium, or platform-native policy with default-deny ingress and egress. Explicit allow rules per dependency.
  • Pod Security StandardsRestricted profile enforced cluster-wide. Baseline profile only where a specific workload constraint requires it, with documented exception.
  • Image admission controlKyverno or OPA Gatekeeper rejecting unsigned images, images from untrusted registries, and images with disallowed CVE thresholds.
  • Secrets managementExternal Secrets Operator backed by cloud KMS (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault). No long-lived secrets in cluster YAML.
  • GitOps deploymentArgo CD or Flux as the only path to production. Every cluster state derives from a Git commit; manual kubectl edits are an exception, not a workflow.
  • Observability stackPrometheus / OpenTelemetry / Loki or equivalent. Cluster, node, and workload-level dashboards with SLO-aligned alerting.
03 — Zero-Downtime Migration Pattern

How we move production workloads without breaking them.

The standard Stonebridge migration uses a parallel-environment approach. We build the new Kubernetes platform alongside the existing environment, replay traffic to validate behavior, and cut over once we are confident. Not before.

Assess
Build Parallel
Replay Traffic
Shadow Workloads
Cut Over
Decommission

Phase 1: Assess and inventory

We catalog every workload, dependency, and external integration. We capture traffic patterns, resource consumption, and operational runbooks. The output is a migration roadmap with named risks and explicit owners.

Phase 2: Build the parallel platform

The new Kubernetes platform is built fresh, in code, with the architecture baseline above. It runs in production parallel to the existing environment with its own DNS, observability, and on-call posture.

Phase 3: Replay and shadow

Traffic is replayed against the new platform from production logs. Workloads run in shadow (receiving real traffic, producing real responses, but not serving customers) until the response surface is provably identical.

Phase 4: Cut over and decommission

Cutover is a DNS or load balancer change, not a forklift event. Rollback is one command. We do not decommission the old environment until the new platform has run customer traffic without incident for a defined burn-in period.

04 — Common Mistakes

Five patterns that burn weekends and fail audits.

We see the same mistakes repeatedly when teams build Kubernetes platforms without architectural help. None are about not knowing Kubernetes. They are about not knowing how to make the platform survive a year of production load.

  1. One cluster, hard multi-tenant, no isolation

    Production, staging, and dev sharing a cluster with namespace separation only. The first incident that takes the cluster down takes all three environments with it. Separate clusters for separate trust boundaries.

  2. Privileged pods because the chart asked for them

    Helm charts from upstream often default to running as root or asking for hostPath mounts. Accepting those defaults turns the cluster into a single trust boundary. We fork or replace charts that cannot run under the Restricted pod security profile.

  3. Default-allow network policy

    Many teams install a CNI with policy support and then never write a policy. Egress from any pod can reach anywhere. Start with default-deny and add explicit allows. Not the other way around.

  4. Manual kubectl in production

    A platform managed by kubectl is unauditable, unreversible, and unfixable at 3am. GitOps is the audit trail and the recovery story rolled into one. Manual operations are exceptions, not workflows.

  5. No documented upgrade path

    Kubernetes releases minor versions every four months. A platform with no documented, rehearsed upgrade process accumulates technical debt until an upgrade is impossible without a re-platform. We ship a documented upgrade runbook with every engagement.

05 — Engagement

Three ways to engage. Fixed scope, fixed price.

Most clients start with an audit, then move to a migration or build engagement. Some come straight to migration when a customer or compliance deadline is fixed.

06 — Recent Work

A representative engagement.

One of our most recent Kubernetes engagements was a full migration for a repeat client moving production workloads from a legacy environment onto Kubernetes with strict isolation requirements.

REPEAT CLIENT · MIGRATION

Full Kubernetes migration with zero production downtime.

The client had outgrown their existing platform. Workloads needed namespace isolation, network policies, and resource limits to satisfy their compliance posture, and a previous attempt at migration had stalled out without a clear rollback path.

We scoped the work cleanly, built out the manifests, set up namespace isolation, network policies, and pod-level resource limits, and migrated workloads using the parallel-environment pattern. The internal team was walked through the architecture and trained on the runbooks before we decommissioned the old environment.

Zero customer-visible downtime during cutover. The client has since come back for additional engagements.

ZeroCustomer-visible downtime
during migration
FullNamespace isolation + network
policies enforced
RepeatClient returned for
additional engagements
07 — Questions

Frequently asked, directly answered.

Q/01Which managed Kubernetes platforms do you support?
Amazon EKS, Google GKE (Standard and Autopilot), Azure AKS, and Oracle OKE. Also their regulated variants: EKS on GovCloud, GKE on Assured Workloads, and AKS on Azure Government. The platform pattern is largely the same across all four; the differences are in identity integration, networking primitives, and the regulated overlays.
Q/02Can you migrate us to Kubernetes without production downtime?
Yes. That is the standard expectation for a Stonebridge migration. We use a parallel-environment approach: build the Kubernetes platform alongside the existing environment, replay traffic, run shadow workloads, and cut over via DNS or load balancer once the new platform is proven stable. Our most recent K8s migration moved a production workload with zero customer-visible downtime.
Q/03Should we use Autopilot / Fargate or manage our own nodes?
Depends on the workload profile and compliance posture. For HIPAA and lower FedRAMP impact levels, GKE Autopilot and EKS Fargate eliminate a class of patching and node-security concerns and we frequently recommend them. For workloads needing GPU access, custom kernels, or specific compliance overlays, managed node pools are usually the right call. We assess and recommend rather than defaulting.
Q/04How do you handle multi-tenancy and namespace isolation?
Hard multi-tenancy with namespace isolation, network policies, resource quotas, pod security standards, and admission control. Where the compliance posture requires it (PHI, CUI, regulated data), we recommend cluster-level isolation rather than namespace isolation. We make the call based on the data classification, not on cost optimization.
Q/05Do you do GitOps, or imperative deployments?
GitOps for almost every engagement, using Argo CD or Flux. The audit trail, the rollback story, and the disaster-recovery posture are all materially better with GitOps. Imperative workflows survive only where a legacy constraint forces them, and we will tell you when that is the case.
Q/06Can you operate the platform after handoff, or is this build-only?
Both. Build engagements are fixed fee with on-call training and 30 days of post-handoff support. Operational retainers are available for clients who need ongoing capacity: upgrade cadences, cluster lifecycle, on-call rotation, and incident response. Most clients combine the two.

Ship to production. Survive the upgrade.

Most discovery calls take 30 minutes. We come back with a written proposal within 48 hours. If we are not the right fit for the engagement, we will tell you in the first call and point you somewhere that is.

Book a 30-minute call