Nguyen Nhat Minh

DevOps Engineer & Cloud Solution Architect

📍 Vietnam

AzureAWSGCP KubernetesTerraformCI/CD Argo CDCiliumSRE

🧭 Profile Summary

My journey to the technical world started as an IT support engineer in 2017, then moved through several IT positions before stepping into the Cloud Solution Architect space. Currently I hold the role of DevOps Engineer and Cloud Solution Architect — designing, migrating, and modernizing cloud-native platforms across Azure, AWS, and GCP.

Eight years building and running production cloud infrastructure across Germany, Singapore, Malaysia, Australia, Hong Kong, Japan, Korea, the US, and India. Strong on Kubernetes/AKS, GitOps, Terraform, and CI/CD — with a track record of measurable wins: cost down 35%, security breaches down 40%, troubleshooting time down 70%.

📌 Quick Highlights

Current Role

Senior DevOps Engineer @ NTT DATA — Automobile project, Germany

Experience

8+ years · IT Support → Cloud → DevOps & Solution Architect

Cloud Focus

Azure (primary) · AWS · GCP · multi-cloud migration & modernization

Platform Stack

Kubernetes, Argo CD (GitOps), Cilium, Gateway API, Terraform, Azure DevOps

Top Certifications

CKA · Azure Solutions Architect Expert · Azure Security Engineer

Key Numbers

Cost ↓35% · Security breaches ↓40% · Troubleshooting time ↓70% · ~10h/week saved

🎤 Self-Pitch

I'm a DevOps Engineer and Cloud Solution Architect with 8+ years of experience, starting from IT support and growing into designing and running production cloud-native platforms across Azure, AWS, and GCP. I specialize in Kubernetes/AKS, GitOps with Argo CD, Terraform, and CI/CD. I've delivered measurable results — cutting infrastructure cost by 35%, reducing security breaches by 40%, and dropping troubleshooting time by 70% through centralized observability. I work across regions and time zones (Germany, Singapore, Australia, Japan, Korea, US, India) and I'm comfortable owning a platform end to end, from architecture and IaC to day-2 operations and incident response.

Who I am

DevOps + Solution Architect, 8+ yrs, multi-cloud, hands-on through to design.

What I do best

AKS/Kubernetes platform, GitOps, Terraform, CI/CD, SRE & cost optimization.

Why me

Proven numbers, runs regulated workloads (banking/finance/healthcare), strong written + spoken English.

⭐ STAR Stories

AKS HA Upgrade — Zero-downtime production upgrade

Situation

A production AKS cluster running a customer-facing API was several minor versions behind and hitting deprecated-API warnings. Upgrade was blocked because nobody wanted downtime on a 24/7 service.

Task

Plan and execute the upgrade with no user-visible downtime and a clean rollback path.

Action

Audited deprecated APIs with kubent and checked Helm charts before touching anything.
Upgraded control plane first, then node pools, keeping version skew within 2 minors.
Used a blue-green node pool: spun up a new pool on the target version, cordoned the old pool, drained gradually while watching metrics.
Paired drains with sane PDBs (minAvailable, replicas ≥ 2) and surge upgrade so capacity never dipped.

Result

Upgrade completed with zero downtime, no failed requests during drains, and the old pool removed cleanly. The runbook became the team's standard for future upgrades.

Central Logging — Troubleshooting time −70%

Situation

Performance incidents dragged on because logs and metrics were scattered across services and tools — engineers spent more time finding data than fixing problems.

Task

Make incident context instantly available so engineers can diagnose fast.

Action

Rolled out centralized logging + monitoring with standardized dashboards per service.
Normalized alert rules so each alert answers: what broke, where, and the likely cause.
Wired KQL queries on ContainerLogV2 for fast per-namespace/pod tracing.

Result

70% reduction in troubleshooting time for performance issues — engineers open a dashboard and see context immediately.

Cost Optimization — Infra cost −35%

Situation

A Vietnam-customer project was overspending on cloud infrastructure with oversized and idle resources.

Task

Bring cost down without hurting performance or reliability.

Action

Ran Cost Analysis broken down by resource type and tag to find top spenders — orphan disks, idle VMs, oversized gateways.
Right-sized requests/limits and node pools, applied autoscaling, and bought Reserved Instances for stable workloads.
Cleaned up duplicate firewall/logging double-costs and built a cost dashboard for the team.

Result

35% lower infrastructure cost through better resource allocation, plus a dashboard so the team keeps the discipline long-term.

DR — 12 months of 99.9% SLA (Healthcare)

Situation

A healthcare client needed guaranteed business continuity with strict compliance and a 99.9% SLA.

Task

Design and prove a DR strategy that actually works under failure, not just on paper.

Action

Wrote PowerShell automation for snapshots, cross-region replication, failover validation, and smoke tests.
Built per-scenario runbooks: region down, DB corruption, cert expiry, account lockout.
Ran real monthly DR drills and auto-sent drill reports to stakeholders.

Result

12 consecutive months at 99.9% SLA, recovery time cut 40%, clean compliance audits. The lesson: drill for real, measure for real.

GitOps Migration — Argo CD on the Automobile project

Situation

Continuous delivery relied on pipelines writing straight into the cluster — hard to audit, easy to drift, broad security blast radius.

Task

Move to a Git-as-source-of-truth model with auditability and safe rollbacks.

Action

Introduced Argo CD with an App-of-Apps structure; cluster state lives entirely in a config repo.
Separated the config repo from source code so an image-updater bot commits tags without pipeline loops.
Removed cluster write access from CI — pipelines only build and push images.
Added sync waves/hooks for ordering and Argo Rollouts for canary releases.

Result

Every change now goes through PR review with full history; rollback is a git revert. Drift is auto-detected and healed, and the security blast radius of CI dropped sharply.

🛠️ Technical Skills

Cloud Technologies

AzureAWSGoogle Cloud

Scripting & Programming

PythonBashPowerShellJavaScript

DevOps Tools

TerraformAzure DevOpsDockerKubernetesCI/CD

Monitoring & Management

GrafanaPrometheusDatadog

Databases

SQL DatabaseKusto (KQL)

Other

Active DirectoryPower AppsDynamics 365

🎖️ Certifications

✓CKA — Certified Kubernetes Administrator

✓Microsoft Certified: Azure Solutions Architect Expert

✓Microsoft Certified: Azure Security Engineer Associate

✓AZ-303 Microsoft Azure Architect Technologies

✓Microsoft Certified: Azure Administrator Associate

✓MB-900 Microsoft Dynamics Fundamentals

✓MS-900 Microsoft O365 Fundamentals

✓AZ-900 Microsoft Azure Fundamentals

✓PL-900 Power Platform Fundamentals

🎓 Education

Bachelor's Degree — Information Assurance

FPT University

Practical Bachelor's Degree — Information Technology

College of Technology and Tourism

🌐 Languages

English — Advanced Proficiency Korean — Upper Beginner

💼 Work Experience

Senior DevOps Engineer · NTT DATA Dec 2025 – Present

Automobile Project — Germany

KubernetesArgo CDCiliumGateway APICI/CD

Lead efforts to migrate and modernize cloud infrastructure, moving workloads to new resources while enhancing scalability, performance, and security to support future growth.
Enhance and optimize DevOps pipelines, introducing automation improvements, refining CI/CD workflows, and reducing deployment time to accelerate delivery cycles.
Work extensively with Kubernetes, integrating advanced tooling: Argo CD for GitOps-based continuous delivery, Cilium for secure and efficient networking, and Gateway API for modern traffic routing and service management.
Collaborate with cross-functional teams to ensure seamless integration, improved observability, and resilient cloud-native operations across environments.

FPT Software Journey

Solution Architect · FPT Software Apr 2025 – Dec 2025

Energy Project — Germany

Designed and deployed a disaster recovery strategy to ensure business continuity in catastrophic failures and reduce potential downtime.
Collaborated with development teams to define and enforce service-level objectives (SLOs) for critical applications.
Led containerization of applications using Kubernetes, streamlining deployment processes and decreasing time to market.
Developed and maintained runbooks and documentation for common incidents, enabling faster onboarding and improving knowledge sharing.

Lead DevOps Engineer · FPT Software Oct 2023 – May 2025

Integration Project — Singapore, Malaysia & Australia Customer

Built integration systems for customers following a daily schedule.
Planned system requirements with clients, wrote Python automation scripts for data transfer to ERP systems, and provisioned Azure resources using Terraform.
Built DevOps pipelines, set up network services, conducted API and performance testing, and configured SFTP servers for secure data connections.
Reduced data transfer time by automating processes with Python scripts and provisioning Azure resources with Terraform.

L3 Engineer · FPT Software Dec 2022 – May 2024

Managed Service Project — Hong Kong Customer

Planned and discussed system requirements with clients, managed access and permissions, and optimized AWS and Azure cloud environments.
Oversaw server management, Active Directory, and system migration from AWS to Azure; collaborated with teams to provide solutions and resolve issues.

DevOps Engineer · FPT Software Sep 2022 – May 2023

DevOps Project — Vietnam Customer

Designed high-level architecture, drafted proposals, and provisioned infrastructure based on client requirements.
Managed access, deployed applications on Kubernetes, wrote Dockerfiles and CI/CD pipelines in Azure DevOps, troubleshot systems, and collaborated with dev teams on solutions and bug fixes.
Optimized cloud infrastructure costs by 35% through efficient resource allocation and utilization.
Developed custom scripts for automated server provisioning, saving an average of 10 hours per week on repetitive tasks.

Solution Architect · FPT Software Sep 2022 – Feb 2023

Azure Re-architect Project — Singapore Customer

Gathered requirements, designed high-level architecture, and provisioned Azure infrastructure for Dev, UAT, and Prod environments.
Managed access, security, and service integration while deploying applications and Kubernetes components, and developing monitoring tools and technical documentation.

Solution Architect · FPT Software May 2022 – Jan 2023

Department Service Offering — Global Customer

Drafted migration proposal templates for offshore teams, covering architecture design, solution strategy, and cost management.
Engaged with multiple teams to estimate, strategize, and provide Q&A on key points, testing methods, and client-driven assumptions.

Early Career

Team Member · FPT Software Apr 2022 – Jul 2022

AWS Connect Managed Service — Korea Customer

Managed the customer's AWS Connect system and reporting.
Monitored network performance metrics, proactively identifying and resolving issues to improve overall uptime.
Implemented Quality of Service (QoS) policies, improving network stability and prioritizing critical applications.
Conducted thorough security audits and implemented firewall rules, leading to a 40% decrease in security breaches and ensuring compliance with industry standards.

Site Reliability Engineer · FPT Software Oct 2021 – Feb 2022

Technology & Solution Project — US & India Customer

Optimized client monitoring solutions, managed Azure Kubernetes Service, automated procedures, and ensured business continuity through DR planning.
Provided technical support, deployed SaaS applications, and implemented monitoring tools while documenting environments and delivering training for client teams.
Defined and enforced service-level objectives (SLOs) and established incident communication protocols for timely, transparent stakeholder updates.

Technical Lead · FPT Software Oct 2020 – Feb 2022

Azure Kubernetes System Project — Asia Customer

Managed, troubleshot, and optimized client AKS systems and Azure cloud environments — resource deployment, security hardening, performance monitoring, and issue resolution with internal teams and Microsoft support.
Implemented automation, integrated with ServiceNow, developed technical documentation, and ensured compliance with security policies.
Designed and deployed a disaster recovery strategy ensuring business continuity in catastrophic failures.
Implemented centralized logging and monitoring, resulting in a 70% reduction in troubleshooting time for performance-related issues.

Cloud Operator · FPT Software Dec 2020 – Mar 2021

Cloud Infrastructure Project — Japanese Company

Managed operation and administration of customer systems on AWS (EC2, ECR, EKS, S3, Load Balancer, CDN).
Deployed security patches and software release modules for the customer's system.
Collaborated with cross-functional teams to plan and execute network capacity upgrades, accommodating rapid growth and avoiding bottlenecks.

Technical Support / Cloud Interaction Engineer · Tek-Experts / Microsoft Sep 2019 – Oct 2020

Microsoft Global Customer

Built Model-Driven Applications using Microsoft Power Apps; worked directly with Azure SQL Database, Common Data Service, and data integration.
Troubleshot and fixed the Dynamics CRM platform and customizations across on-premises and cloud environments; integrated with Office 365 and Common Data Service.
Assisted customers with Dynamics 365 F&O and AX2012 (on-prem and cloud), including environment management and Azure AD application authentication for D365 F&O integration.

☸️ Kubernetes & AKS — Deep Dive

Đây là phần technical em đào sâu nhất qua các project AKS ở FPT Software và NTT DATA. Tổng hợp lại theo kiểu "cái gì cũng phải giải thích được tới tận cơ chế".

Pod Lifecycle & Scheduling

Luồng schedule một Pod

API Server: nhận request, validate theo schema + admission controllers (ResourceQuota, MutatingWebhook, PodSecurity).
etcd: persist Pod object với nodeName="" (chưa được schedule).
Scheduler: watch các Pod chưa có nodeName, chạy 2 pha — Filter (loại node không đủ resource, không match taint/affinity/nodeSelector) và Score (chấm điểm node còn lại: least-allocated, image-locality), rồi Bind node điểm cao nhất.
Kubelet: trên node đích, gọi CRI pull image, tạo container, attach volume, setup network qua CNI plugin.
Probe: startup → liveness → readiness pass → endpoint controller add Pod IP vào Service.

Điểm cộng khi nói: nhắc topologySpreadConstraints cho HA và preemption theo priority khi node full.

Debug Pod theo triệu chứng

Triệu chứng	Lệnh đầu tiên	Nguyên nhân thường gặp
Pending	`kubectl describe pod` → xem Events	Thiếu CPU/mem, taint mismatch, PVC pending, không có node schedule được
ImagePullBackOff	`kubectl describe pod` + check secret	Sai image tag, registry auth (imagePullSecret), ACR firewall
CrashLoopBackOff	`kubectl logs <pod> --previous`	App crash, sai env var, thiếu config, OOM (exit code 137)
OOMKilled	`kubectl describe pod` → Reason: OOMKilled	Memory limit thấp, leak, JVM heap chưa tune
Terminating (kẹt)	`kubectl get pod -o yaml` → check finalizers	Finalizer chưa xóa, volume detach fail

Resource Requests/Limits & QoS Class

Request: resource tối thiểu scheduler reserve cho pod → dùng để schedule.
Limit: mức tối đa. CPU vượt → throttle. Memory vượt → OOMKilled.
QoS class (tự suy ra): Guaranteed → Burstable → BestEffort. Đây là thứ tự bị evict khi node pressure.

Trap kinh điển: "Tại sao CPU request quan trọng hơn limit?" → CPU là compressible (vượt thì throttle, không kill), memory là incompressible (vượt thì OOMKill ngay). Nên CPU request phải chính xác (ảnh hưởng scheduling + HPA), còn memory limit phải set và để gần sát request.

Pod Disruption Budget (PDB) — bài học từ AKS upgrade

PDB đảm bảo số pod tối thiểu available trong voluntary disruption (drain node, cluster upgrade). Không bảo vệ khỏi node crash hay OOM.

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2      # hoặc maxUnavailable: 1
  selector:
    matchLabels: { app: api }

Pitfall: đặt minAvailable: 100% trên deployment 1 replica → drain kẹt vĩnh viễn. Luôn cặp PDB với replicas ≥ 2.

AKS Networking — CNI Modes

CNI Mode	Pod IP	Ưu	Nhược
Kubenet (legacy)	Internal CIDR (NAT qua node)	Ít tốn IP VNet, setup nhanh	Performance thấp, không support Windows pool
Azure CNI	VNet CIDR (mỗi pod 1 IP VNet)	Performance cao, route-able từ VNet khác	Tốn IP VNet (mỗi node reserve ~30 IP)
Azure CNI Overlay	Overlay CIDR riêng	Tiết kiệm IP VNet, scale tốt >1000 node	Pod không route-able trực tiếp từ VNet khác
Azure CNI + Cilium (eBPF)	Overlay hoặc VNet	Network policy nhanh nhất, observability eBPF	Thêm complexity, cần hiểu eBPF

Talking point thực tế: nhiều cluster trong cùng VNet thì Azure CNI truyền thống cạn IP rất nhanh — chọn Overlay để giữ pod IP space riêng, chỉ node mới consume IP VNet.

Storage trên AKS — Disk vs File vs Blob CSI

Azure Disk (RWO): cho StatefulSet/DB, bound vào 1 zone → cần ZRS disk cho cross-zone.
Azure File (RWX): shared config, media; NFS v4.1 cho performance tốt nhất.
Azure Blob CSI (RWX): object storage, ML data, backup.

Multi-zone cluster luôn để volumeBindingMode: WaitForFirstConsumer để disk được tạo cùng zone với pod, tránh lệch zone gây pod không mount được.

🏢 Quy mô AKS thực tế: Em vận hành AKS production cho nhiều domain khắt khe — e-commerce (traffic spike theo campaign), banking và finance (compliance, audit, isolation chặt). Phần dưới là những thứ em phải nắm sâu để chạy AKS ở các môi trường này.

Autoscaling — HPA / VPA / Cluster Autoscaler / KEDA

HPAKEDACluster AutoscalerAKS

HPA: scale số replica theo CPU/memory hoặc custom metric (qua Prometheus Adapter). Cốt lõi cho web e-commerce co giãn theo tải.
VPA: đề xuất/điều chỉnh request & limit. Không combine HPA + VPA trên cùng metric — sẽ conflict.
Cluster Autoscaler: add/remove node khi pod pending. Trên AKS dùng built-in (không phải Karpenter — đó là AWS). Tune scale-down-unneeded-time, scale-down-delay-after-add.
KEDA: event-driven autoscaling — scale theo độ dài queue (Service Bus, Storage Queue), Kafka lag, hay custom metric. Cực hợp với e-commerce: scale worker theo số order tồn trong queue, scale-to-zero khi rảnh để tiết kiệm chi phí.

Pattern spike e-commerce (flash sale)

Pre-scale theo lịch (cron HPA / scheduled scaling) trước giờ sale, không đợi metric mới phản ứng.
Overprovision pod "pause" priority thấp để giữ node ấm → pod thật schedule tức thì, không chờ node mới boot.
Tách node pool theo workload (system / app / batch) để autoscale độc lập.

Security & Identity — chuẩn banking/finance

Workload IdentityKey VaultRBACAzure Policy

Workload Identity (federated, prefer): pod lấy Azure AD token qua OIDC, không cần secret trong cluster. Thay thế Pod Identity (legacy) — không cần NMI/MIC, scale tốt hơn.
Secrets: Azure Key Vault + Secrets Store CSI Driver mount secret vào pod; bật auto-rotation. Không để secret plaintext trong manifest/Git.
RBAC + AAD integration: AKS-managed Azure AD, gán quyền theo AAD group; least-privilege namespace-scoped Role thay vì cluster-admin.
Private cluster: API server private endpoint, không expose ra internet — bắt buộc với banking. Truy cập qua jumpbox/VPN/Bastion.
Azure Policy for AKS (Gatekeeper/OPA): enforce không cho privileged container, bắt buộc resource limits, chỉ pull image từ ACR tin cậy, deny hostPath.
Defender for Containers: runtime threat detection, scan image vulnerability trong ACR.

Network hardening trong cluster

NetworkPolicy default-deny per namespace, whitelist explicit từng luồng (Cilium/Calico).
Egress control: route outbound qua Azure Firewall / NAT Gateway, whitelist FQDN — finance cần kiểm soát data exfiltration.
Private Link tới PaaS (SQL, Storage, Key Vault) → traffic không ra internet.
mTLS service-to-service qua service mesh khi cần zero-trust nội bộ.

AKS Upgrade & Day-2 Operations

AKSNode PoolHelm

Control plane vs node pool: upgrade control plane trước, node pool sau; giữ skew ≤ 2 minor version.
Surge upgrade: max-surge để tạo node mới trước khi cordon/drain node cũ → zero-downtime. Cặp với PDB hợp lý.
Blue-green node pool: tạo node pool mới version mới, cordon pool cũ, drain dần, quan sát rồi xóa pool cũ — an toàn cho Prod banking.
Compatibility audit: quét deprecated API trước upgrade (kubent / kubectl deprecations), check Helm chart không dùng API sắp bỏ.
Maintenance window & auto-upgrade channel: đặt window, dùng channel (patch/stable) để tự vá security, nhưng Prod thì controlled.

Multi-tenancy & Isolation

Banking/finance thường cần tách biệt mạnh giữa team/môi trường/khách hàng:

Mức	Cách làm	Khi nào dùng
Namespace	ResourceQuota + LimitRange + NetworkPolicy + RBAC per ns	Soft multi-tenant, team nội bộ
Node pool	Taint/toleration + nodeSelector, tách pool theo tenant/workload	Cần tách compute, tránh noisy neighbor
Cluster riêng	1 cluster / tenant hoặc / môi trường	Compliance cao, blast radius nhỏ — phổ biến ở finance

Quota + LimitRange chống một tenant ăn hết resource; PriorityClass bảo vệ workload critical (payment) khi node pressure.

Reliability Patterns cho payment/critical service

PodTopologySpreadConstraints + pod anti-affinity: rải replica qua zone/node, mất 1 zone vẫn sống.
PDB bảo vệ trong drain/upgrade; multi-zone node pool cho HA.
Probe đúng: readiness tách khỏi liveness; startupProbe cho app khởi động chậm (Magento, JVM) tránh bị kill oan.
Graceful shutdown: preStop hook + terminationGracePeriodSeconds để drain connection, không cắt giữa transaction.
Resilience4j / circuit breaker ở app, retry có backoff + idempotency key cho payment.

Scenario — Magento 2 trên AKS chịu tải Black Friday (e-commerce)

Tình huống: traffic gấp 8–10 lần giờ thường, latency checkout phải giữ ổn định, không được mất order.

Compute: tách node pool (web / queue worker / cron), HPA theo p95 latency + RPS, KEDA scale worker theo độ dài queue order.

Caching: Redis cho session + full-page cache, CDN trước static, giảm tải xuống DB.

Pre-warm: scheduled scale trước giờ sale + pause pod overprovision giữ node ấm.

Bảo vệ: rate limit tại Gateway/WAF, PDB + multi-zone, circuit breaker xuống payment gateway.

Kết quả mong muốn: giữ SLO checkout, autoscale mượt, scale-down sau campaign để tiết kiệm chi phí.

Observability đặc thù AKS

Container Insights (Azure Monitor) + Managed Prometheus + Grafana cho metric cluster/node/pod.
Control plane logs (kube-apiserver, kube-audit, scheduler) bật Diagnostic Settings → Log Analytics; kube-audit bắt buộc cho compliance banking.
KQL trên ContainerLogV2 để truy vết lỗi theo namespace/pod, gắn alert vào pattern lỗi cụ thể.

📦 Workload Types & Strategy

Chọn đúng controller cho từng loại workload

DeploymentStatefulSetDaemonSetJobCronJob

Workload	Controller	Đặc điểm	Ví dụ thực tế
Stateless API / web	Deployment	Pod thay thế được, RollingUpdate, scale ngang thoải mái	REST API, frontend, Magento web tier
Stateful có identity	StatefulSet	Stable network ID + PVC riêng từng pod, ordered deploy/scale	DB, Kafka, Zookeeper, Elasticsearch
Node-level agent	DaemonSet	1 pod / node, tự chạy khi thêm node	Log shipper, CNI, node-exporter, security agent
Chạy 1 lần rồi xong	Job	Chạy tới completion, retry theo backoffLimit	DB migration, batch xử lý
Theo lịch	CronJob	Job theo cron, quản concurrencyPolicy	Backup nightly, report, cleanup

Talking points hay bị hỏi

Vì sao DB dùng StatefulSet chứ không Deployment? StatefulSet cho stable identity (pod-0, pod-1) + PVC gắn cố định từng pod — quan trọng cho replication/clustering. Deployment thì pod và volume đều vô danh.
headless Service (clusterIP: None) đi kèm StatefulSet để mỗi pod có DNS riêng.
concurrencyPolicy của CronJob: Forbid tránh job chồng nhau khi job trước chưa xong.
init container cho setup trước khi app start (wait-for-db, migration nhẹ).

💾 Storage & PV/PVC

Mô hình PV / PVC / StorageClass

PVPVCStorageClassCSI

PersistentVolume (PV): tài nguyên storage thực (Azure Disk/File/Blob), do admin hoặc dynamic provisioner tạo.
PersistentVolumeClaim (PVC): yêu cầu storage của app (dung lượng + accessMode). Pod mount PVC, không mount PV trực tiếp.
StorageClass: định nghĩa cách dynamic provision (loại disk, SKU, reclaimPolicy, binding mode).

Access Modes

Mode	Nghĩa	Backend Azure phù hợp
RWO (ReadWriteOnce)	Mount read-write bởi 1 node	Azure Disk — DB, StatefulSet
RWX (ReadWriteMany)	Nhiều node mount read-write cùng lúc	Azure File (SMB/NFS), Blob CSI — shared config/media
ROX (ReadOnlyMany)	Nhiều node mount read-only	Data set chia sẻ read-only

reclaimPolicy & volumeBindingMode

reclaimPolicy: Delete (mặc định dynamic) xóa disk khi xóa PVC; Retain giữ lại data — bắt buộc cho DB production để tránh mất data khi lỡ tay delete PVC.
volumeBindingMode: WaitForFirstConsumer trên cluster multi-zone — disk được tạo cùng zone với pod, tránh lệch zone khiến pod không mount được.
allowVolumeExpansion: bật để resize PVC online (Azure Disk hỗ trợ expand, không shrink).

Pitfalls thực chiến

RWO ngăn scale ngang: Deployment nhiều replica dùng chung 1 PVC RWO sẽ kẹt — pod thứ 2 không mount được. Cần RWX (Azure File) hoặc PVC riêng từng pod (StatefulSet volumeClaimTemplate).
Disk gắn zone: Azure Disk nằm ở 1 zone; pod reschedule sang zone khác sẽ fail mount — dùng ZRS disk hoặc WaitForFirstConsumer.
Pod Terminating kẹt do volume detach fail: thường do disk chưa detach khỏi node cũ — kiểm tra VolumeAttachment.
Backup: snapshot qua CSI VolumeSnapshot hoặc Velero cho cả manifest + PV.

🏛️ Domain Q&A

Những câu hỏi đặc thù theo domain em từng vận hành — bấm tab để xem.

E-commerce

Banking / Finance

Healthcare

Q: Xử lý traffic spike flash-sale thế nào?

Pre-scale theo lịch trước giờ sale, overprovision pod priority thấp giữ node ấm, HPA theo p95 latency + RPS, KEDA scale worker theo độ dài queue order. Redis cho session + full-page cache, CDN trước static. Scale-down sau campaign để tiết kiệm.

Q: Đảm bảo không mất order khi pod chết?

Order đẩy vào queue (Service Bus) trước khi xử lý; worker idempotent theo order ID; graceful shutdown (preStop + terminationGracePeriod) để drain connection, không cắt giữa transaction.

Q: Isolation và compliance trong banking?

Private cluster (API server không expose internet), cluster riêng theo môi trường để blast radius nhỏ, Azure Policy/OPA enforce no-privileged + chỉ pull từ ACR tin cậy, kube-audit log bắt buộc, Private Link tới SQL/Storage/Key Vault.

Q: Quản secret an toàn?

Key Vault + Secrets Store CSI Driver, Workload Identity (federated, không secret trong cluster), auto-rotation. Không để secret plaintext trong Git.

Q: Đảm bảo SLA 99.9% và business continuity?

DR drill thật hàng tháng, PowerShell automation cho snapshot/replicate/failover, runbook theo từng scenario (region down, DB corruption, cert expiry). Recovery time giảm 40%, 12 tháng liên tục đạt SLA.

Q: Bảo vệ dữ liệu nhạy cảm (PHI)?

Encryption at-rest + in-transit, RBAC namespace-scoped least-privilege, audit log đầy đủ, network default-deny + egress control để chống data exfiltration.

🔄 GitOps & Argo CD

Vì sao GitOps?

Ở project Automobile (NTT DATA) em chuyển continuous delivery sang mô hình GitOps với Argo CD. Git là single source of truth: state mong muốn của cluster nằm hết trong repo, Argo CD reconcile liên tục để cluster khớp với Git.

Argo CDKubernetesHelmKustomize

Lợi ích thực tế

Auditability: mọi thay đổi đi qua PR → có review, có history, rollback = git revert.
Drift detection: ai đó sửa tay trên cluster (kubectl edit) → Argo CD báo OutOfSync và tự heal về state trong Git.
Tách deploy khỏi CI: pipeline chỉ build image + push, không cần quyền write thẳng vào cluster → giảm blast radius về security.

Pattern hay dùng

App of Apps: một root Application quản nhiều child Application → onboard env/team mới chỉ bằng cách thêm manifest.
Sync waves & hooks: điều khiển thứ tự apply (CRD trước, workload sau; migration job chạy như PreSync hook).
Progressive delivery: kết hợp Argo Rollouts cho canary/blue-green theo % traffic, quan sát metric rồi promote/rollback.

Bài học GitOps

Tách config repo khỏi source code repo. Đừng để pipeline tự commit image tag vào chính repo đang trigger pipeline → dễ tạo loop. Image updater (hoặc bot) commit vào config repo riêng, Argo CD watch repo đó.

🕸️ Cilium & Gateway API

Cilium (eBPF) — vì sao team modern chọn

CiliumeBPFHubble

Network policy ở kernel level (eBPF) thay vì iptables → latency thấp hơn đáng kể, scale tốt khi số pod/rule lớn.
Identity-based policy: match theo pod label/identity thay vì IP → policy ổn định khi pod restart đổi IP.
Hubble: observability L3/L4/L7 không cần sidecar — thấy được flow giữa service, drop do policy, DNS, HTTP.
Default deny: bắt đầu bằng deny-all rồi whitelist explicit, đúng tinh thần zero-trust trong cluster.

Gateway API — thế hệ sau Ingress

Ở project Automobile em dùng Gateway API cho traffic routing thay cho Ingress truyền thống. Lý do:

Khía cạnh	Ingress (cũ)	Gateway API (mới)
Phân vai trò	Một resource lẫn lộn infra + app	Tách rõ: GatewayClass (provider) / Gateway (infra team) / HTTPRoute (app team)
Khả năng mở rộng	Phụ thuộc annotation riêng từng controller	API chuẩn hóa, ít annotation magic
Traffic splitting	Hạn chế	Native weight-based routing, header match, canary
Đa protocol	Chủ yếu HTTP	HTTP, gRPC, TCP, TLS routing

Điểm hay khi vận hành: infra team own Gateway (TLS, listener, IP), app team chỉ tự quản HTTPRoute của mình → giảm va chạm quyền, đúng mô hình multi-team trên cùng cluster.

🏗️ Terraform / Infrastructure as Code

Cách em tổ chức Terraform

TerraformAzureModule

Remote state: Azure Storage backend + state locking (blob lease) để tránh 2 người apply đè nhau.
Module hóa: tách module tái dùng (network, aks, keyvault) → env Dev/UAT/Prod chỉ khác file tfvars, không copy-paste code.
Workspace/folder per env: mỗi env một state riêng → nổ Dev không đụng Prod.
Plan trong CI: terraform plan chạy tự động trên PR, output plan để review trước khi apply sau merge.

Ví dụ provision AKS (rút gọn)

module "aks" {
  source              = "../../modules/aks"
  cluster_name        = var.cluster_name
  kubernetes_version  = "1.30"
  network_plugin      = "azure"
  network_plugin_mode = "overlay"   # CNI Overlay tiết kiệm IP
  default_node_pool   = {
    name       = "system"
    vm_size    = "Standard_D4s_v5"
    min_count  = 3
    max_count  = 10
    enable_auto_scaling = true
  }
}

Bài học IaC

Đừng import lung tung: resource tạo tay rồi import vào state dễ gây drift — chuẩn hóa "mọi thứ qua Terraform" ngay từ đầu.
Pin version: ghim provider version + module version, tránh apply hôm sau ra kết quả khác.
State là tài sản nhạy cảm: chứa secret dạng plaintext → khóa quyền truy cập storage account chứa state.

🚀 CI/CD Pipelines

Triết lý pipeline

Azure DevOpsDockerKubernetesGitOps

Em xây pipeline ở Azure DevOps qua nhiều project. Mô hình em theo là tách rõ CI (build + test + push image) khỏi CD (deploy), đặc biệt khi cluster dùng GitOps.

Các stage chuẩn

Build: multi-stage Dockerfile, layer cache, build image immutable theo commit SHA (không dùng latest).
Test: unit + lint + scan image (Trivy) + scan IaC (tfsec/checkov) — fail sớm tại đây.
Push: đẩy image lên ACR, sign/tag theo SHA.
Deploy: CD không kubectl apply thẳng; thay vào đó cập nhật image tag trong config repo → Argo CD reconcile.

Update Strategies

RollingUpdate (default): zero-downtime, cần readiness probe đúng. maxSurge / maxUnavailable điều khiển tốc độ.
Recreate: kill hết rồi tạo mới — có downtime nhưng tránh version skew (vd DB migration không backward-compatible).
Blue-Green: swap Service selector, rollback tức thì nhưng tốn 2x resource.
Canary: release theo % traffic (Argo Rollouts / service mesh), quan sát metric rồi promote.

📊 Monitoring & SRE

Observability stack

PrometheusGrafanaDatadogAzure MonitorKQL

Quan điểm của em: alert phải có context, không phải alert nhiều. Mỗi alert nên trả lời được 3 câu — cái gì hỏng, ở đâu, nghi ngờ nguyên nhân gì.

Bốn golden signals

Latency: đo theo percentile (p95/p99), không xài average vì average giấu outlier.
Traffic: request/s, throughput.
Errors: tỉ lệ lỗi, phân loại 4xx vs 5xx.
Saturation: mức bão hòa resource (CPU, mem, queue depth).

SLO với percentile

Availability: 99.9% uptime (~43 phút downtime/tháng).
Latency: p95 < 200ms, p99 < 500ms.
Error rate: < 0.1%.

Vì sao p95/p99 mà không p50 hay p99.9? p50 quá dễ (median giấu worst-case), p99.9 quá khắt khe (dính outlier khó control). p95/p99 đại diện majority user mà vẫn achievable.

requests
| where timestamp > ago(5m)
| summarize p95=percentile(duration,95), p99=percentile(duration,99)
| where p95 > 200 or p99 > 500
// Alert: "API latency SLO violated"

Cách em giảm troubleshooting time 70%

Vấn đề: mỗi sự cố performance tốn rất nhiều thời gian vì log/metric rải rác nhiều nơi.

Hành động: triển khai centralized logging + monitoring tập trung, chuẩn hóa dashboard và alert rule theo từng service.

Kết quả: giảm 70% thời gian troubleshoot các vấn đề liên quan performance — engineer mở dashboard là thấy ngay context.

🛟 DR & Cost Optimization

Disaster Recovery (Healthcare, SLA 99.9%)

PowerShellAzureCompliance

PowerShell scripts: snapshot, replicate cross-region, failover validation, automated smoke test.
Runbook cho từng scenario: region down, DB corruption, cert expiry, account lockout.
Vendor escalation playbook (Microsoft Premier support flow).
Báo cáo DR drill tự động gửi stakeholder hàng tháng.

Kết quả: 12 tháng liên tục đạt SLA 99.9%, recovery time giảm 40%, audit pass clean. Bài học: DR không phải plan trên giấy — phải drill thật, đo thật.

Cost Optimization Landing Zone

Azure Cost MgmtReserved Instance

Cost Analysis breakdown theo subscription / resource type / tag → tìm top spender: orphan disk, idle VM, oversized gateway.
Network/firewall review: gộp rule thừa, tắt logging double-cost (FW → Storage → Log Analytics).
Reserved Instance cho workload ổn định.
Knowledge transfer: workshop + cost dashboard để team khách tự giữ discipline.

Kết quả: ở project Vietnam Customer, tối ưu chi phí hạ tầng 35% qua phân bổ resource hợp lý; ở engagement landing zone, monthly Azure spend giảm 15%. Bài học: dashboard + alert tự động bền hơn cleanup một lần.

🏢 BCDR — Business Continuity & Disaster Recovery

BCDR là gì — và khác DR chỗ nào

BCPDRAzure Site RecoveryBackupRTO/RPO

BCDR ghép hai mảng: Business Continuity (BCP) — làm sao doanh nghiệp vẫn hoạt động khi có sự cố (con người, quy trình, communication), và Disaster Recovery (DR) — phần kỹ thuật khôi phục hệ thống/dữ liệu sau thảm họa.

Dễ hiểu: DR lo “làm sao bật lại server và data”, BCP lo “làm sao business vẫn phục vụ khách trong lúc đó”. Một DR plan tốt mà thiếu BCP thì kỹ thuật sống nhưng vận hành vẫn loạn.

RTO & RPO — hai con số cốt lõi

RTO (Recovery Time Objective): tối đa được phép downtime — “bao lâu thì phải chạy lại”.
RPO (Recovery Point Objective): tối đa được phép mất data — “lui về được tới thời điểm nào”. RPO 15 phút nghĩa là backup/replicate ít nhất mỗi 15 phút.
Tradeoff: RTO/RPO càng nhỏ thì chi phí càng cao. Phải map theo độ quan trọng của từng hệ thống, không “mọi thứ RPO=0”.

4 chiến lược DR — từ rẻ tới đắt

Chiến lược	Cách làm	RTO/RPO	Chi phí
Backup & Restore	Backup định kỳ, restore khi cần	Giờ → ngày	Thấp nhất
Pilot Light	Core (DB replicate) chạy nhỏ ở region 2, phần còn lại tắt	Phút → giờ	Thấp-trung
Warm Standby	Bản thu nhỏ đầy đủ chạy sẵn ở region 2, scale lên khi failover	Phút	Trung-cao
Active-Active (Multi-site)	Cả 2 region chạy full, traffic chia qua Front Door/Traffic Manager	~0 (near-zero)	Cao nhất

Talking point: chọn tier theo RTO/RPO + ngân sách. Banking core thường Warm Standby hoặc Active-Active; hệ thống nội bộ ít quan trọng thì Backup & Restore là đủ.

Azure BCDR toolkit

ASRAzure BackupFront DoorTraffic ManagerGRS/ZRS

Azure Site Recovery (ASR): replicate VM/workload sang region khác, orchestrate failover có recovery plan (thứ tự boot, script trước/sau). Hỗ trợ test failover không đụng production.
Azure Backup: backup VM, disk, SQL, File Share; giữ recovery point theo policy; bật soft delete + immutable vault chống ransomware/xóa nhầm.
Storage redundancy: LRS (1 datacenter) → ZRS (cross-zone) → GRS/GZRS (cross-region) — chọn theo yêu cầu RPO.
Front Door / Traffic Manager: điều hướng traffic sang region khỏe khi region chính chết — health probe + priority/weighted routing.
DB: Azure SQL active geo-replication / failover group; Cosmos DB multi-region write cho RPO gần 0.
AKS: cluster ở 2 region, manifest trong Git (Argo CD) để dựng lại nhanh; data layer dựa vào DB/storage replication chứ không dựa vào cluster.

Scenario — Region chính chết lúc đang giờ cao điểm

1. Detect: health probe Front Door báo region A unhealthy, alert bắn ngay.

2. Declare: tuyên bố DR event theo severity, mở incident bridge, báo stakeholder theo communication plan (phần BCP).

3. Failover: chạy ASR recovery plan / chuyển DB sang replica region B, Front Door tự đẩy traffic sang B.

4. Validate: smoke test tự động (login, checkout, API critical) trước khi tuyên bố “back to service”.

5. Failback: khi region A hồi phục, đồng bộ ngược data, failback có kiểm soát (không vội, tránh split-brain).

6. Postmortem: đo RTO/RPO thực tế so với mục tiêu, cập nhật runbook.

Bài học BCDR thực chiến

Plan không drill = plan chết. Ở project Healthcare em drill thật hàng tháng — đó là lý do giữ được 12 tháng SLA 99.9% và recovery time giảm 40%.
Test failover ≠ real failover: ASR cho test isolated network để drill mà không ảnh hưởng prod — dùng nó thường xuyên.
Backup phải test restore: backup chạy xanh không đủ — phải restore thử định kỳ mới biết dùng được.
Dependency mapping: biết app phụ thuộc gì (DNS, secret, identity) — thiếu một mắt xích là failover gãy.
RPO thật = tần suất replicate: đừng hứa RPO 5 phút khi backup chỉ chạy mỗi giờ.

☁️ Multi-Cloud Mapping

Em làm cả Azure (primary), AWS và GCP. Biết service nào tương đương nhau giúp migrate và thiết kế cross-cloud nhanh.

Nhu cầu	Azure	AWS	GCP
Managed Kubernetes	AKS	EKS	GKE
Compute VM	Virtual Machines / VMSS	EC2 / ASG	Compute Engine / MIG
Serverless function	Azure Functions	Lambda	Cloud Functions
Object storage	Blob Storage	S3	Cloud Storage
Block storage	Managed Disk	EBS	Persistent Disk
Managed SQL	Azure SQL / Flexible Server	RDS / Aurora	Cloud SQL / AlloyDB
NoSQL	Cosmos DB	DynamoDB	Firestore / Bigtable
Container registry	ACR	ECR	Artifact Registry
Identity / IAM	Entra ID (Azure AD)	IAM	Cloud IAM
Secrets	Key Vault	Secrets Manager	Secret Manager
Load balancer	Azure LB / App Gateway	ELB / ALB	Cloud Load Balancing
Private network	VNet	VPC	VPC
Message queue	Service Bus	SQS / SNS	Pub/Sub
Monitoring	Azure Monitor	CloudWatch	Cloud Monitoring
Node autoscaler	Cluster Autoscaler	Karpenter / CA	Node auto-provisioning
IaC (native)	ARM / Bicep	CloudFormation	Deployment Manager

Quan điểm: dùng Terraform làm lớp IaC chung cho cả ba cloud thay vì native tool riêng — một ngôn ngữ, một workflow, dễ chuyển team và audit. Khi bàn migration, em map theo cặp service như bảng trên rồi chú ý khác biệt về networking và IAM model — đó là chỗ hay sập nhất.

🚨 Incident / SRE

Incident response — cách em xử lý

On-callSLOPostmortemServiceNow

Lifecycle một sự cố

Detect: alert dựa trên golden signals + SLO burn rate, không alert theo cảm tính.

Triage: xác định severity (SEV1-3), impact, blast radius; mở incident channel.

Mitigate: ưu tiên khôi phục service trước (rollback, failover, scale) rồi mới tìm root cause.

Communicate: cập nhật stakeholder định kỳ theo protocol đã định trước.

Postmortem: blameless, tìm contributing factors, ra action item có owner + deadline.

SLO / Error Budget

SLI: chỉ số đo được (latency p95, error rate, availability).
SLO: mục tiêu trên SLI (vd 99.9% requests < 200ms).
Error budget: phần còn được phép sai. Cạn budget → freeze feature, tập trung reliability. Còn nhiều → được ship nhanh hơn.
Burn-rate alert: alert khi tiêu budget quá nhanh (vd 2% trong 1h) thay vì chỉ alert khi đã vượt ngưỡng.

Scenario — API 5xx tăng đột biến lúc 2h sáng

1. Dashboard: 5xx tăng, latency p99 vọt, error budget burn nhanh → SEV2.

2. Khoanh vùng: lỗi ở 1 service hay toàn cluster? Check deploy gần nhất (Argo CD history).

3. Mitigate trước: nếu do release mới → rollback (git revert, Argo sync); nếu do downstream → circuit breaker + scale.

4. Xác nhận hồi phục bằng metric, cập nhật stakeholder.

5. Postmortem hôm sau: vì sao test không bắt được, thêm guardrail.

📜 Scripting

Python / Bash / PowerShell — dùng đúng chỗ

PythonBashPowerShell

Python: automation phức tạp — data transfer sang ERP, gọi API, xử lý JSON/CSV, glue logic. Đây là ngôn ngữ chính em viết script transfer dữ liệu ở project Integration (Singapore/Malaysia/Australia).
Bash: glue trong CI/CD, Dockerfile entrypoint, k8s init container, thao tác file/log nhanh trên Linux.
PowerShell: automation trên Azure/Windows — DR runbook (snapshot, replicate, failover validation), quản Active Directory, smoke test sau failover.

Nguyên tắc viết script production

Idempotent: chạy lại nhiều lần ra cùng kết quả, không nhân đôi side effect.
Fail-fast: Bash set -euo pipefail; check exit code, không nuốt lỗi im lặng.
Idempotency + retry: gọi API có retry + backoff, tôn trọng 429/Retry-After.
Không hardcode secret: lấy từ Key Vault / env / managed identity, không commit vào Git.
Logging rõ ràng: log ra stdout có timestamp để dễ trace khi chạy trong pipeline/cron.

Ví dụ — Bash an toàn cho automation

#!/usr/bin/env bash
set -euo pipefail

RESOURCE_GROUP="${1:?need resource group}"
log(){ echo "[$(date -u +%FT%TZ)] $*"; }

log "Snapshotting disks in ${RESOURCE_GROUP}..."
for disk in $(az disk list -g "$RESOURCE_GROUP" --query "[].name" -o tsv); do
  az snapshot create -g "$RESOURCE_GROUP" \
    --name "${disk}-$(date -u +%Y%m%d)" \
    --source "$disk" --only-show-errors
  log "snapshot done: $disk"
done

Thói quen: biến bắt buộc dùng ${VAR:?} để fail sớm nếu thiếu tham số, log có timestamp UTC, quản lỗi bằng set -euo pipefail.

💬 Behavioral

Tell me about yourself

Dùng Self-Pitch ở trên: 8+ năm, IT support → Cloud → DevOps/Architect, mạnh AKS/GitOps/Terraform, có con số (cost -35%, breach -40%, troubleshoot -70%), làm đa vùng.

Conflict với team thì sao?

STAR ngắn: tình huống bất đồng về approach → tập trung vào data/tradeoff thay vì quan điểm cá nhân → thống nhất tiêu chí (cost, risk, time) → chọn giải pháp đo được. Kết: quyết định dựa trên bằng chứng, quan hệ vẫn tốt.

Một lần thất bại?

Chọn lỗi thật nhưng đã fix + học được (vd PDB minAvailable=100% làm drain kẹt) → nhấn mạnh bài học và guardrail thêm sau đó.

🧩 Tricky / Weak-Spot

Câu hỏi bắt bí hay gặp

liveness vs readiness? liveness fail → restart pod; readiness fail → tạm gỡ khỏi Service nhưng không restart. Đừng để liveness check dependency ngoài → sinh restart dây chuyền.
CPU limit có nên set không? Nhiều trường hợp không set CPU limit (tránh throttle oan), nhưng luôn set memory limit (incompressible). Tùy SLA và tính đa tenant.
kubectl apply vs replace? apply là declarative merge (3-way), replace ghi đè toàn bộ. GitOps thì Argo lo apply, không kubectl tay.
Tại sao không dùng latest tag? không reproducible, rollback mơ hồ, cache lộn xộn. Dùng immutable tag theo commit SHA.

Weak spot thành thật: nếu bị hỏi về mảng chưa sâu (vd service mesh nâng cao, eBPF kỹ thuật), em nói thật mức độ từng làm + cách em sẽ học nhanh, thay vì bịa. Thành thật + biết cách tự học được đánh giá cao hơn giả vờ biết.

❓ Câu hỏi ngược

Cuối buổi, hỏi ngược để thể hiện mình quan tâm đúng chỗ:

Team đang ở đâu trên hành trình GitOps / platform engineering? Có internal developer platform chưa?
On-call setup thế nào — rotation, escalation, có tôn trọng error budget không?
Quyết định kỹ thuật do team tự quyết hay top-down? Tech debt được ưu tiên ra sao?
Thách thức lớn nhất của platform trong 6-12 tháng tới là gì?
Thành công của vị trí này được đo bằng gì sau 3-6 tháng?

✅ Checklist

Trước phỏng vấn

Ôn lại 5 STAR stories, nhớ con số (35% / 40% / 70% / 99.9%)
Nắm Self-Pitch dưới 60 giây
Ôn AKS deep-dive: pod lifecycle, CNI, storage, upgrade, autoscaling
Ôn GitOps / Argo CD, Terraform, CI/CD flow
Chuẩn bị 3-5 câu hỏi ngược
Research công ty + JD, map kinh nghiệm vào requirement

Kỹ năng trả lời

Dùng STAR cho câu behavioral
Nói tradeoff, không nói tuyệt đối (“tùy context”)
Thành thật khi không biết, nói cách sẽ tìm ra
Gắn câu trả lời với impact thực tế

📐 System Design

Khung trả lời một bài system design

RequirementsScale est.Tradeoffs

1. Làm rõ yêu cầu: functional + non-functional (RPS, latency, data size, SLA). Đừng vẽ ngay.

2. Ước lượng: QPS, storage/năm, bandwidth — để chọn đúng component.

3. High-level design: client → LB → service → cache → DB → queue; vẽ data flow.

4. Deep-dive 1-2 component: chỗ khó nhất (data partition, consistency, hot key).

5. Tradeoffs & bottleneck: nói rõ vì sao chọn, đánh đổi gì, scale tiếp ra sao.

Building blocks phải nắm

Load balancing: L4 vs L7, sticky session, health check, consistent hashing.
Caching: cache-aside vs write-through; TTL; chống cache stampede (lock/jitter); Redis cho session/hot data; CDN cho static.
DB scaling: read replica, sharding (theo key), connection pooling, index đúng; biết khi nào cần NoSQL.
Async: message queue (Service Bus/Kafka) để decouple, smooth spike, retry; outbox pattern để không mất event.
API design: REST vs gRPC, idempotency, pagination, rate limit, versioning.

Scenario — Thiết kế URL shortener (hay hỏi)

Write: sinh short key (base62 từ counter hoặc hash + check collision), lưu mapping.

Read (đọc nhiều hơn ghi 100:1): cache nóng ở Redis, DB làm source of truth.

Scale: stateless service sau LB, DB shard theo key, CDN/edge cho redirect.

Tradeoff: 301 (cache lâu, khó đo click) vs 302 (đo được analytics nhưng tải cao hơn).

🕸️ Service Mesh

Khi nào cần service mesh — và khi nào không

IstioLinkerdCilium MeshmTLS

Mesh giải quyết 3 bài ở tầng hạ tầng (không đụng code app): traffic management (canary, retry, timeout), security (mTLS, authz service-to-service), và observability (golden metrics, distributed trace tự động).

Senior framing: mesh không miễn phí — thêm sidecar = thêm latency, RAM, và độ phức tạp vận hành. Microservice ít thì library (Resilience4j) hoặc Gateway API là đủ. Mesh đáng dùng khi có nhiều service, cần zero-trust mTLS toàn mạng, hoặc progressive delivery chuẩn hóa.

Sidecar vs Sidecar-less (eBPF)

Sidecar (Istio Envoy): mỗi pod một proxy — mạnh, linh hoạt L7, nhưng tốn resource và thêm hop.
Ambient / sidecar-less: Istio ambient (ztunnel) hay Cilium service mesh dùng eBPF — bớt overhead sidecar, mTLS ở node level.
Linkerd: nhẹ, micro-proxy Rust, dễ vận hành hơn Istio — hợp khi không cần full feature.

Em đang dùng Cilium ở project Automobile, nên hướng mesh tự nhiên là Cilium/eBPF + Gateway API thay vì thêm Istio sidecar.

🔒 DevSecOps & Supply Chain Security

Shift-left security — bảo mật từ lúc code

TrivyCheckovSBOMDefenderOPA/Gatekeeper

Scan image (Trivy/Grype): chạy trong CI, fail build khi có CVE HIGH/CRITICAL chưa có fix — đừng block khi chưa có patch.
Scan IaC (Checkov/tfsec): phát hiện misconfiguration Terraform/Helm trước khi apply.
SBOM (Software Bill of Materials): liệt kê toàn bộ dependency trong image; bắt buộc trong nhiều regulated industry.
Image signing (Cosign/Notation): ký image sau build, Gatekeeper chỉ cho deploy image đã ký từ registry tin cậy.
Supply chain (SLSA): provenance — biết image đến từ đâu, ai build, từ source nào — chống tampering.
Runtime (Defender for Containers / Falco): detect anomaly trong runtime (exec ở container không được phép, privilege escalation).

OPA / Gatekeeper — policy as code

Viết ConstraintTemplate (Rego) enforce chính sách: chỉ pull từ ACR, bắt buộc label, cấm hostPath, bắt resource limits. Check trước khi apply (admission) thay vì phát hiện sau khi chạy. Kết hợp với Azure Policy for AKS để có audit mode.

🐧 Linux & Troubleshooting

Một số bài hay bị hỏi ở tầm senior

LinuxstracetcpdumpperfeBPF

Triệu chứng	Lệnh đầu tiên	Hướng xử lý
CPU 100%	`top / htop`, `pidstat 1`	Tìm PID ngốn, `perf top -p PID` xem function, profile code
Memory leak	`free -m`, `cat /proc/meminfo`	`valgrind` / heap dump; OOM score; kiểm tra slab cache
Disk I/O chậm	`iostat -x 1`, `iotop`	%util cao → bỉ disk; await cao → queue sâu; check filesystem (ext4 noatime)
Network drop	`ss -s`, `netstat -s`	Retransmit cao → cóng mạng / MTU; `tcpdump` bắt gói để phân tích
Lỗi DNS trong k8s	`kubectl exec — nslookup svc`	CoreDNS logs, ndots config, search domain, DNS caching
App không bind port	`ss -ltnp`, `strace -p PID`	Permission (port <1024), firewall, SELinux/AppArmor

Linux internals hay được đào

Process vs Thread: thread chia sẻ address space; fork() tạo process con (copy-on-write).
Context switch overhead: lý do container nhẹ hơn VM — cùng kernel, không full OS.
eBPF: chạy code trong kernel không cần module, an toàn, nền tảng của Cilium/Hubble/Falco.
cgroups + namespaces: nền tảng của container isolation — cgroup giới hạn resource, namespace cách ly PID/net/mount/user.
inode / file descriptor leak: process mở quá nhiều fd → lsof -p PID | wc -l, ulimit -n.

⚗️ Distributed Systems

Các khái niệm nền tảng hay bị hỏi

CAPPACELCConsistencyConsensus

CAP theorem: distributed system chỉ đảm bảo được 2/3: Consistency, Availability, Partition tolerance. Khi có network partition buộc phải chọn CP (SQL) hay AP (Cassandra, Cosmos).
Eventual consistency: replica đồng bộ theo giờ, read có thể stale — chấp nhận được cho cart, feed; không được cho balance/payment.
Idempotency: retry an toàn vì result không thay đổi — bắt buộc cho payment API, queue consumer.
Distributed lock: Redis SETNX + TTL hoặc Lease — cẩn thận GC pause có thể expire lock sớm.
Saga pattern: long-running transaction qua nhiều service — choreography (event) hay orchestration (saga coordinator); rollback bằng compensating transaction.
Outbox pattern: viết DB + publish event trong 1 transaction (to outbox table), polling relay đẩy ra queue — đảm bảo at-least-once, không mất event.

Consensus & leader election

Raft / Paxos: etcd (não của k8s) dùng Raft — leader handle write, follower replicate, quorum majority (3/5). Etcd chết 2/3 node → cluster read-only.
Split-brain: 2 node cùng nghĩ mình là leader → ghi đè / mất data. Fencing token để chống.
Thundering herd: nhiều process dậy cùng lúc sau khi cache expired → jitter + single-flight/mutex để chống stampede.

🏗️ Platform Engineering & FinOps

Internal Developer Platform (IDP)

BackstageGolden PathSelf-service

Platform engineering là xây dựng “sản phẩm nội bộ” cho developer — thay vì từng team tự setup CI/CD, Terraform, observability từ đầu. Mục tiêu: giảm cognitive load cho dev, chuẩn hóa golden path.

Self-service portal (Backstage): dev click là ra repo + pipeline + cluster namespace + monitoring, không cần đợi ops.
Golden path template: Helm chart chuẩn, Dockerfile bảo mật, pipeline template — team follow thì tự được best practice.
Platform team là product team: đo NPS, lead time, DORA metrics — không chỉ support ticket.

FinOps — cloud cost là trách nhiệm chung

Tagging strategy: mọi resource phải có tag env/team/project — nếu không tag không deploy được (policy enforce).
Chargeback / showback: tới từng team biết mình tiêu bao nhiêu — tạo động lực tự optimize.
Right-sizing: Prometheus + VPA recommendation để phát hiện pod oversized.
Spot/preemptible node: batch/ML workload chạy trên spot (tiết kiệm 60-80%), stateless app chạy trên regular.
KEDA scale-to-zero: worker không có việc → 0 replica → không tốn tiền.
DORA metrics: Deployment Frequency, Lead Time, Change Failure Rate, MTTR — đo hiệu quả platform, không chỉ uptime.

🤝 Leadership & Mentoring

Senior không chỉ làm giỏi — phải nhân rộng được

Technical mentoring: pair programming, code/IaC review có giải thích “vi sao” thay vì chỉ sửa; viết decision record (ADR) để team sau hiểu context.
Runbook & documentation: kiến thức trong đầu không scale — em ưu tiên viết runbook rõ để junior on-call tự xử lý được mà không ping senior.
Blameless culture: postmortem tập trung vào system failure, không đổ lỗi người — tạo môi trường tự báo lỗi sớm.
Leading without authority: ở cross-functional project, em dẫn dắt bằng sự chuẩn bị kỹ, đặt câu hỏi đúng, và proposal có data — không phải bằng title.
Delegation: biết việc nào nên giao (ownership rõ, guardrail sẵn), việc nào phải đứng ra — tránh bỏ nhiệm hoặc ôm hết.

Câu hay bị hỏi: “Làm sao để team follow best practice mà không ép buộc?” → Golden path dễ dùng hơn đường khó; policy-as-code làm guardrail im lặng; workshop + ADR để team hiểu lý do, không chỉ biết luật.

🤖 AIOps

AIOps là gì — và giải quyết bài nào

Anomaly DetectionEvent CorrelationAuto-remediationAzure Monitor

AIOps (AI for IT Operations) dùng ML + big data để xử lý khối lượng tín hiệu khổng lồ (log, metric, trace, event) mà con người không đọc xuể. Mục tiêu: giảm alert fatigue, phát hiện sớm bất thường, tăng tốc tìm root cause, và tiến tới tự động xử lý.

Quan điểm senior: AIOps không thay thế SRE — nó lọc nhiễu để con người tập trung vào việc khó. Theo hướng "AI gợi ý, người quyết định" cho tới khi đủ tin cậy mới mở auto-remediation.

Năng lực cốt lõi của AIOps

Anomaly detection: ML học baseline của metric (CPU, latency, RPS) rồi cảnh báo khi lệch bất thường — thay vì ngưỡng tĩnh dễ báo nhầm. Azure Monitor có dynamic threshold làm sẵn việc này.
Event correlation & noise reduction: gom hàng trăm alert liên quan thành một incident duy nhất, chống alert storm khi một sự cố gốc sinh ra dây chuyền.
Root cause analysis: tương quan log + metric + trace + thay đổi deploy gần nhất để gợi ý nguyên nhân, rút ngắn MTTR.
Predictive / forecasting: dự báo cạn disk, hết capacity, traffic spike theo mùa — để scale trước khi sập.
Auto-remediation: nối với runbook automation (Azure Automation, Logic Apps, KEDA) để tự restart, scale, hoặc failover khi pattern đã được xác nhận an toàn.

Scenario — Áp dụng AIOps giảm alert fatigue

1. Vấn đề: on-call nhận hàng trăm alert/ngày, phần lớn là nhiễu — alert thật bị chìm.

2. Gom nhóm: bật dynamic threshold + correlation để gộp alert cùng root thành một incident, cắt alert trùng.

3. Phân loại: ML xếp ưu tiên theo impact + lịch sử, đẩy cái quan trọng lên đầu.

4. Kết quả: on-call chỉ nhìn số ít incident thật, MTTR giảm, đỡ burn-out — đúng tinh thần centralized observability em đã làm (troubleshoot -70%).

Lưu ý thực chiến: AIOps chỉ tốt khi data sạch và có ngữ cảnh — log chuẩn hóa, metric có label đúng, trace đầy đủ. Garbage in, garbage out. Bắt đầu từ anomaly detection + correlation (ROI nhanh, rủi ro thấp), auto-remediation làm sau khi đã tin tưởng.

Built with 🍅 — last updated June 2026