FAQ kiến trúc hệ thống enterprise và microservices

Question 1

Làm thế nào OKAXI thiết kế hệ thống chịu tải lớn?

Accepted Answer

OKAXI dùng kiến trúc event-driven với Kafka làm backbone. Mọi giao tiếp giữa Microservices đi qua event topic thay vì gọi REST đồng bộ. Service phía nhận xử lý theo công suất riêng, không bị ép tốc độ bởi service phía gửi. Kết quả thực tế đo trên dự án retail là sustained throughput 5000 event mỗi giây không bị degradation, và peak burst 12000 event mỗi giây không mất event nào.

Question 2

Vai trò chính của Kafka trong việc xử lý bất đồng bộ là gì?

Accepted Answer

Kafka đóng năm vai trò trong stack OKAXI. Một là durable buffer, persist event lên disk trước khi acknowledge producer. Hai là partition log cho parallel consumer, mỗi partition do một consumer instance xử lý. Ba là replay log, consumer có thể seek lại offset cũ khi cần reprocess. Bốn là fanout, cùng một event tới nhiều consumer group độc lập. Năm là backpressure tự nhiên, consumer chậm không kéo lag producer.

Question 3

OKAXI tránh nghẽn cổ chai khi traffic burst ra sao?

Accepted Answer

Có bốn cơ chế chính. Một là Kafka topic partition theo natural key (customer_id, region) để spread load đều consumer. Hai là consumer pool elastic, auto-scale theo lag metric của topic. Ba là async processing tách biệt latency-critical path khỏi heavy computation path. Bốn là circuit breaker ở downstream API call, fallback nhanh khi external system chậm.

Question 4

Có thể dùng Message Broker khác Kafka không?

Accepted Answer

Có. OKAXI có template integration cho RabbitMQ, NATS, AWS SQS và Redis Streams. Lựa chọn dựa vào yêu cầu cụ thể. Kafka phù hợp high throughput long-retention. RabbitMQ phù hợp complex routing và priority queue. NATS phù hợp ultra-low latency. SQS phù hợp serverless workload trên AWS. Redis Streams phù hợp lightweight queue dùng chung với cache layer.

Question 5

Backpressure trong kiến trúc event-driven của OKAXI ra sao?

Accepted Answer

Backpressure được xử lý tự nhiên qua Kafka offset model. Producer publish event nhanh, consumer xử lý theo công suất, lag tích lũy trên broker. Khi lag vượt threshold cấu hình, alert trigger và auto-scaling spin up consumer instance mới. Producer chỉ bị throttle khi disk Kafka đầy hoặc partition count đạt limit. Dashboard real-time hiển thị consumer lag theo từng topic.

Question 6

OKAXI đảm bảo dữ liệu không mất mát giữa các Microservices ra sao?

Accepted Answer

OKAXI áp dụng outbox pattern kết hợp saga pattern. Outbox: business transaction và event write thực hiện trong cùng một local DB transaction. Một relay process đọc outbox table và publish lên Kafka, đảm bảo at-least-once delivery. Saga: long-running workflow chia thành step nhỏ, mỗi step có compensating action để rollback khi step sau fail. Kết hợp hai pattern tránh được trường hợp event lost hoặc business state không nhất quán.

Question 7

Idempotency keys hoạt động như thế nào trong API của OKAXI?

Accepted Answer

Mọi mutating endpoint chấp nhận header Idempotency-Key UUID. Server lưu key kèm response trong Redis với TTL 24 giờ. Request lặp lại cùng key trả về response cũ thay vì xử lý lại. Producer Kafka cũng tag event với event_id UUID, consumer dedupe qua bảng processed_events trước khi commit business logic. Hai lớp idempotency cộng dồn đảm bảo exactly-once business semantic dù transport là at-least-once.

Question 8

Cách xử lý transaction phân tán?

Accepted Answer

OKAXI không dùng 2-phase commit cho production. Mọi workflow cross-service triển khai dưới dạng Saga choreography hoặc Saga orchestration. Choreography: mỗi service phản ứng với event từ service trước. Orchestration: một service điều phối trung tâm gọi từng step và xử lý compensation. Lựa chọn pattern dựa trên độ phức tạp workflow và yêu cầu visibility.

Question 9

Xử lý network partition và split-brain ra sao?

Accepted Answer

OKAXI thiết kế theo CAP theorem chọn AP cho event delivery layer và CP cho config plus identity layer. Event layer dùng Kafka cluster replica factor 3 với min.insync.replicas 2, chấp nhận một broker mất tạm thời. Config layer dùng etcd hoặc ZooKeeper với quorum write, từ chối write khi không đủ quorum. Service phía consumer phát hiện partition qua heartbeat và fallback sang cache local.

Question 10

Data consistency model OKAXI chọn ra sao?

Accepted Answer

OKAXI áp dụng eventual consistency cho cross-service state và strong consistency trong từng service boundary. Mỗi service có local DB ACID. Cross-service consistency đạt được qua eventual replication và saga compensation. Khách hàng cần strong consistency cross-service được tư vấn dùng synchronous orchestration trên top, với latency cost rõ ràng được giải thích từ giai đoạn architecture review.

Question 11

Microservices của OKAXI thiết kế stateless ra sao?

Accepted Answer

Service không lưu state trong memory hay disk local. State sống ở ba layer riêng. Một là database (PostgreSQL, MongoDB) cho persistent state. Hai là Redis cluster cho session và cache. Ba là Kafka cho event log. Service instance có thể scale up hoặc kill bất kỳ lúc nào mà không mất dữ liệu. Container restart hoặc node failure không ảnh hưởng business state.

Question 12

Session state và cache distribution như thế nào?

Accepted Answer

OKAXI dùng Redis cluster với hash slot partitioning cho cả session store và cache. Sticky session bị tránh để mọi instance đều xử lý được request. Session token là JWT signed, server side chỉ lưu blacklist cho revoke. Cache có TTL ngắn (vài giây tới vài phút) cho hot data và TTL dài (tới 24 giờ) cho lookup table. Cache invalidation qua Kafka event khi data nguồn thay đổi.

Question 13

Auto-scaling policy của OKAXI ra sao?

Accepted Answer

OKAXI dùng Kubernetes HPA với hai signal chính. CPU utilization above 70 phần trăm trigger scale up. Custom metric Kafka consumer lag above threshold cũng trigger scale up. Scale down chậm hơn (5 phút cooldown) để tránh thrashing khi traffic dao động. Pre-warmed pod pool sẵn sàng cho expected traffic peak (campaign launch, sale event). Cluster autoscaler scale node khi pod pending vượt threshold.

Question 14

Database scaling strategy của OKAXI?

Accepted Answer

OKAXI áp dụng đa chiến lược tùy use case. Read-heavy workload dùng read replica với connection pool routing read tới replica. Write-heavy workload single tenant dùng partitioning theo natural key (customer_id, time range). Multi-tenant dùng sharding hoặc database-per-tenant. Multi-master tránh trừ trường hợp đặc biệt vì conflict resolution phức tạp. Khách hàng nhận khuyến nghị cụ thể sau architecture review.

Question 15

Load balancer strategy của OKAXI?

Accepted Answer

OKAXI dùng kết hợp L7 và L4 tùy layer. Ingress dùng L7 với Nginx hoặc Envoy cho HTTP routing, TLS termination và rate limiting. Internal service mesh dùng L4 cho gRPC binary protocol để minimize parse cost. Sticky session tránh tuyệt đối, mọi load balancer dùng round-robin hoặc least-connection. Health check active mỗi 5 giây để remove pod unhealthy khỏi pool nhanh.

Question 16

Distributed log collection của OKAXI ra sao?

Accepted Answer

OKAXI dùng OpenTelemetry agent gắn vào mỗi service container. Agent collect log, metric, trace và gửi tới collector cluster. Collector route log sang Loki hoặc ELK stack tùy khách hàng. Format chuẩn JSON structured với correlation_id, request_id, customer_id. Search và filter qua Grafana hoặc Kibana. Retention chính sách 30 ngày hot và 1 năm cold trên object storage.

Question 17

Distributed tracing implementation của OKAXI?

Accepted Answer

OKAXI dùng OpenTelemetry SDK cho cả backend Go, Python, Java và Node. Trace bao gồm root span ở Ingress, child span cho mỗi service hop, db query và external API call. Trace store trên Jaeger hoặc Tempo. Sampling rate cấu hình theo service: 100 phần trăm cho debug environment, 1 đến 10 phần trăm cho production tùy traffic. Trace ID propagate qua HTTP header và Kafka message header.

Question 18

Real-time metrics và alerting của OKAXI?

Accepted Answer

OKAXI dùng Prometheus scrape metric từ mỗi service mỗi 15 giây. Grafana cho dashboard real-time. AlertManager route alert tới Slack, Email, hoặc PagerDuty tùy severity. Metric chính theo dõi gồm request rate, error rate, latency p50/p95/p99 (RED method), CPU/memory/disk (USE method) và business KPI tùy ngành. Alert threshold cấu hình theo SLO mục tiêu.

Question 19

Cách phát hiện và cô lập lỗi trong Microservices?

Accepted Answer

Có ba lớp phát hiện. Một là alert tự động khi error rate hoặc latency vượt SLO. Hai là distributed trace cho từng failing request, người trực gọi pull trace ID từ log để follow chuỗi service hop. Ba là chaos engineering nội bộ, inject fault định kỳ để verify circuit breaker và fallback. Khi xác định service lỗi, traffic được drain dần qua load balancer trong khi team fix root cause.

Question 20

SLI, SLO, SLA approach của OKAXI?

Accepted Answer

SLI (Service Level Indicator) đo lường thực tế gồm availability, latency, error rate, freshness. SLO (Service Level Objective) là target nội bộ. Ví dụ availability 99.9 phần trăm, latency p95 dưới 300ms cho API, throughput 1000 RPS sustained. SLA (Service Level Agreement) là commitment với khách hàng, thường thấp hơn SLO một bậc để có margin. Error budget tính theo SLO, khi budget cạn, team freeze release feature mới và focus reliability work.

Kiến trúc hệ thống enterprise và microservices