THM Sytem Design

2022-01-25

TryHackMe Infrastructure & System Design

A comprehensive technical guide to the AWS infrastructure, system design patterns, and cloud architecture that powers the TryHackMe platform.

Infrastructure Overview
AWS Multi-Region Architecture
Infrastructure as Code (SST v3)
VPC & Networking
VM Deployment Pipeline
Guacamole Remote Desktop System
Networks v2 (Pulumi Dynamic IaC)
AWS Services Map
Data Layer
Real-Time Communication
Job Scheduling & Background Processing
Authentication & Session Management
Observability Stack
Request Lifecycle
Configuration & Secrets
Key File Reference

1. Infrastructure Overview

2. AWS Multi-Region Architecture

Why multi-region? VM interactions are latency-sensitive -- typing in a remote desktop session with 200ms latency is unusable. By deploying Guacamole clusters and VM instances in 5 regions, users connect to the nearest one. The API itself stays in eu-west-1 (single-region) because it only handles REST/WebSocket traffic where 100-200ms latency is acceptable.

TryHackMe operates across 5 AWS regions to minimize latency for global users deploying VMs.

Region Responsibilities

Region	Role	Key Resources
`eu-west-1`	Primary. API, database, all services	API server, MongoDB, Redis, Kafka, all AWS services
`us-east-1`	Modern region with dynamic VPC	Dynamically provisioned VPC (`10.201.0.0/16`), S3 gateway endpoint
`us-west-1`	Legacy VM region	Guacamole ASG, VM instances
`ap-south-1`	APAC VM region	Guacamole cluster, Nginx proxy, VMs
`eu-central-1`	EU secondary VM region	Guacamole cluster, Nginx proxy, VMs

Cross-Region Connectivity

Transit Gateway peering connects all VM regions for:

VPN client pool routing (10.8.0.0/24, 10.9.0.0/24, 10.13.0.0/24)
Network CIDR routing between regions
External BADR account peering (account 439472160432)

3. Infrastructure as Code (SST v3)

Why SST v3? SST v3 wraps Pulumi and Terraform under a TypeScript-first API, so infrastructure is defined in the same language as the application. Unlike SST v2 (which was CDK/CloudFormation), v3 deploys faster, supports multi-provider (AWS + Cloudflare in one config), and avoids CloudFormation's 500-resource stack limits. The Link system lets Lambda functions access secrets and resource ARNs without manual env var wiring.

All infrastructure is managed via SST v3 (not v2/CDK). The root config is at sst.config.ts.

SST Apps

Shared Infrastructure (`infrastructure/shared.ts`)

Defines resources shared across all SST apps:

Resource	Type	Purpose
`MongoConnectionString`	SST Secret	MongoDB connection string
`CloudflareAccessClientId/Secret`	SST Secret	Cloudflare Zero Trust auth (non-prod)
`NetworksIacInvocationQueue`	SQS FIFO Queue	Network provisioning message bus
`MainWebAppBaseUrl`	SST Linkable	Base URL for the main web app
Application VPC IDs	SST Linkable	VPC, subnet, and SG references per region

Deployment

# Deploy all apps to staging
npx sst deploy --stage staging
 
# Deploy specific app
APPS=core npx sst deploy --stage production
 
# Selective deployment via APPS env var
APPS=guacamole,nginx npx sst deploy --stage staging

4. VPC & Networking

VPC Layout (Production, eu-west-1)

Security Groups

Security Group	Ports	Direction	CIDR	Purpose
`VulnerableVmsSecurityGroup`	All except 80	Ingress	Private IPs only	Isolate vulnerable machines
`AllowAllSecurityGroup`	All	All	`0.0.0.0/0`	Permissive SG for specific services
`GuacamoleSecurityGroup`	8080	Ingress	Cloudflare IPs	Guacamole web app access
`GuacdLoadBalancerSecurityGroup`	4822	Ingress	Guacamole SG	Guacd protocol traffic
`GuacdSecurityGroup`	4822	Ingress	LB SG	Guacd daemon access
`NginxSecurityGroup`	80, 8443	Ingress	`0.0.0.0/0`	Nginx reverse proxy

us-east-1 (Modern Region, Dynamic VPC)

Unlike legacy regions, us-east-1 creates its VPC dynamically in SST:

VPC: 10.201.0.0/16
├── Vulnerable VMs Subnet: 10.201.0.0/17
├── Remote Access Subnet:  10.201.128.0/18
├── LB Subnet:             10.201.192.0/24
├── Internet Gateway (dynamic)
├── Transit Gateway Attachment (tgw-01c8741ec73fb2f5c)
└── S3 Gateway Endpoint (private S3 access)

5. VM Deployment Pipeline

This is the core of TryHackMe's infrastructure -- deploying EC2 instances for users to interact with.

Deployment Flow

Key Concepts

Upload (VM Template): An Upload defines a VM image -- its AMI, instance type, OS, remote connection type, credentials, and boot time. Uploads are stored in MongoDB and reference AMIs in each region.

Instance: A running VM instance. Tracks AWS instance ID, connection details, expiry time, and state. Stored in MongoDB.

Instance Types & Selection:

Retry Strategy:

Spot instances: up to AWS_SPOT_MAX_RETRIES retries (default 4)
On-demand: up to AWS_ON_DEMAND_MAX_RETRIES retries (default 8)
Backoff delay: AWS_BACKOFF_DELAY_MS (default 100ms)

Instance Lifecycle:

Default expiry: varies by context (1-3 hours)
Extension: +1 hour (once, only in last hour, max 6 hours total)
Termination: User-initiated, auto-expiry, or spot interruption

Cost Center Tagging:

Every EC2 instance is tagged for billing attribution:

CostCenter	Usage
`ROOM`	Standard room VMs
`EXAM`	Certification exam VMs
`SOC_SIM`	SOC Simulator VMs
`THREAT_HUNTER`	Threat Hunting VMs
`NETWORK`	Network lab VMs

6. Guacamole Remote Desktop System

Why Guacamole? Users need browser-based access to Windows (RDP) and Linux (VNC/SSH) VMs without installing any client software. Guacamole is the only mature open-source solution that renders RDP/VNC/SSH in a browser via HTML5 Canvas. The guacd daemon handles protocol translation, while the web app serves the UI and manages connections. Running it on ECS Fargate per-region keeps latency low for users connecting to their nearest VM.

Apache Guacamole provides browser-based access to VMs via RDP, VNC, and SSH.

Connection Types

Type	Protocol	Port	Use Case
`GUACAMOLE`	RDP/VNC/SSH via Guacamole	8080	Primary VM access method
`NO_VNC`	VNC via Nginx reverse proxy	80/8443	Lightweight VNC access
`DCV`	Amazon DCV	Custom	High-performance desktop
`WEB_PROXY`	HTTP reverse proxy	80	Web-based tools

Guac-A-Worker (Cloudflare)

The Guac-A-Worker is a Cloudflare Worker that routes VM requests:

User requests *.guacaworker.{domain}
Worker extracts task ID from subdomain
Looks up public DNS name in Cloudflare KV
Routes request to correct Guacamole instance

Resources created by SST:

KV Namespace for task-to-DNS mapping
Cloudflare DNS records (wildcard + base)
Advanced Certificate Manager (Google CA)
Worker route: *guacaworker.{domain}/*
IAM user with ECS/EC2 describe permissions

Nginx Reverse Proxy

For VNC connections, an Nginx reverse proxy runs per-region:

# IP-based subdomain routing
# 10-x-x-x.reverse-proxy-{region}.{domain} -> 10.x.x.x
server {
    location / {
        proxy_pass http://$target_ip:$target_port;
    }
    location /websockify/ {
        proxy_pass http://$target_ip:$target_port;
        # WebSocket upgrade for VNC
    }
}

Scaling:

Production: 10-50 ARM64 instances (0.25 vCPU, 0.5GB each)
Staging: 2-4 instances

7. Networks v2 (Pulumi Dynamic IaC)

Why Pulumi inside Lambda (and not SST/Terraform)? Each user's network lab needs its own VPC, subnets, VPN server, and multiple VMs -- created on-demand and destroyed hours later. SST/Terraform are for long-lived infrastructure. Pulumi's Automation API lets you run pulumi up and pulumi destroy programmatically from code, making it ideal for ephemeral, per-user infrastructure. Lambda provides the compute, SQS provides the queue, and EFS persists the Pulumi binary and state between invocations.

Networks v2 creates isolated network environments for users using Pulumi Automation API inside a Lambda function.

Lambda Configuration

Setting	Value
Runtime	Node.js 22.x
Memory	1 GB
Timeout	15 minutes
Storage	EFS mount at `/mnt/efs`
VPC	Connected to application VPC
Permissions	Full AWS access (`*`)
Trigger	SQS FIFO queue

What Pulumi Creates Per Network

Network Expiry

A cron Lambda runs every minute to check for expired networks:

Queries MongoDB for expired network instances
Sends SQS message to destroy the Pulumi stack
Lambda destroys all AWS resources for that network

8. AWS Services Map

Service Usage Summary -- What & Why

Service	What It Does	Why This Choice
EC2	Deploys VMs (spot + on-demand), manages AMIs, key pairs, volumes	The entire product is built around giving users real machines to hack. EC2 is the only AWS service that provides full OS-level virtual machines with arbitrary network configs and pre-baked AMIs. Spot instances cut costs ~60-70% for ephemeral, short-lived VMs that users will destroy in 1-3 hours anyway.
S3	Stores images, VPN configs, badges, certificates, VM uploads, RAG files, skills matrices	Cheap, durable, globally-accessible object storage. VM disk images need to live in S3 before being imported as AMIs. Presigned URLs let the frontend upload/download directly without proxying through the API.
ECS Fargate	Runs Guacamole (web + guacd) and Nginx reverse proxy per region	Guacamole and Nginx are long-running services that need to scale independently per region. Fargate removes the need to manage EC2 hosts for these containers -- AWS handles the underlying compute.
Lambda	Networks IaC handler, Turbo S3 cache handler, cron jobs, log ingestion, AMI state change handler	Perfect for event-driven, short-lived work: a SOC-Sim log needs to be ingested at a specific time, a network needs provisioning, an AMI import completed. Pay-per-invocation means zero cost when idle.
EventBridge Scheduler	Schedules one-time events for SOC-Sim log ingestion into SIEM VMs	SOC-Sim logs must appear in the SIEM at precise timestamps to simulate a real attack timeline. EventBridge one-time schedules let you say "fire this Lambda at exactly 14:32:07" per log entry -- no polling, no cron.
SQS	FIFO queue for network infrastructure provisioning messages	Network creation via Pulumi takes up to 15 minutes. SQS decouples the API request from the Lambda execution, guarantees message ordering (FIFO), and provides at-least-once delivery with built-in retry.
API Gateway	HTTP APIs for Turbo S3 cache and Mock API	Lightweight serverless HTTP endpoints that don't justify running a full server. API Gateway + Lambda is the cheapest way to expose an HTTP endpoint that's called infrequently.
Transit Gateway	Cross-region VPC peering for VM connectivity	Users in India deploy VMs in ap-south-1, but the API runs in eu-west-1. TGW peering lets the API server's VPC talk to VM VPCs across regions without setting up individual VPC peering per region-pair.
Step Functions	Cloud training account lifecycle (assign, move OU, recycle)	AWS account provisioning is a multi-step workflow (create account, move to OU, set permissions, assign to user). Step Functions model this as a state machine with built-in retries and rollback -- much safer than chaining Lambda calls manually.
SSM Parameter Store	Cell-based infrastructure configs, subnet/SG references	Infrastructure metadata (which subnets are available, which security groups to use) changes per cell and region. SSM provides a centralized, versioned key-value store that both IaC and runtime code can read from. Cheaper than Secrets Manager for non-sensitive config.
Secrets Manager	Guacamole secret keys, sensitive credentials per region	Guacamole JSON secret keys and other credentials must be encrypted at rest and rotatable. Secrets Manager provides automatic encryption, access audit trails, and cross-region replication.
Auto Scaling	Manages Guacamole load balancer instance pools	VM traffic is bursty -- a cohort of students might start a room simultaneously. Auto Scaling adjusts Guacamole capacity between 2-20 instances based on CPU, so we don't over-provision during off-peak hours.
CloudWatch	Logs for all Lambda functions, ECS Container Insights, CPU alarms	Lambda logs go to CloudWatch by default (no choice). Container Insights gives ECS-level metrics (CPU, memory, network) without instrumenting the app. Alarms trigger when Guacamole CPU hits 60% to flag scaling issues early.
EFS	Persistent storage for Pulumi binaries and state in Lambda	Lambda has only 512MB of ephemeral `/tmp` storage. Pulumi's binary + node_modules + state files exceed that. EFS provides a shared, persistent filesystem mounted at `/mnt/efs` that persists across Lambda invocations.
IAM	Multi-account credential profiles (default, admin, network-infra, vm-upload, cloud-training)	Different operations require different AWS accounts with different permission boundaries. VM uploads go to a dedicated account, cloud training uses its own account, network infra has its own. IAM profiles isolate blast radius -- a compromised key only affects one concern.

9. Data Layer

MongoDB

Why MongoDB? The data is deeply nested and schema-diverse -- a VM instance document looks nothing like a user document or a SOC-Sim run. MongoDB's flexible schema handles this naturally. Mongoose adds just enough structure (schemas, validation, indexes) without the rigidity of a relational DB. The platform also needs fast reads on varied query patterns (by user, by room, by instance ID, by status), which compound indexes on a document store handle well.

Connection settings:

Pool: min 2, max 50
Read preference: primary
Read concern: local
Write concern: majority
Retry reads and writes enabled

Key collections:

instances -- Running VM instances
uploads -- VM templates/images
users -- User accounts
soc_sim_runs / soc_sim_run_alerts -- SOC Sim data
threat_hunting_runs -- Threat Hunting data
scheduler_jobs -- Agenda.js job queue
sessions -- Express sessions
Network templates, paths, modules, rooms, companies, etc.

Redis

Why Redis? Sub-millisecond reads for rate limiting (every single API request checks Redis), Socket.IO pub/sub across multiple API instances, and Bull queue backing. Two instances isolate the latency-critical main workload from the Tutor feature's higher-volume writes.

Two separate Redis instances serve different purposes:

Kafka

Used for event streaming between the API and analytics consumers:

Why Kafka? High-throughput, ordered event streaming that decouples producers from consumers. Analytics events, tutor interactions, and MongoDB change events generate a firehose of data -- Kafka handles this without back-pressure on the API. Consumer groups allow multiple independent consumers (analytics pipeline, tutor service) to process the same events at their own pace.

Topics include MongoDB change events and analytics events. A Vercel-to-ECS HTTP bridge exists for the Next.js frontend to publish events.

10. Real-Time Communication

Socket.IO Architecture

Why Socket.IO over raw WebSockets? Socket.IO provides automatic reconnection, room-based broadcasting (critical for SOC-Sim multiplayer), binary streaming, and -- most importantly -- the Redis adapter for multi-instance scaling. Raw WebSockets would require building all of this from scratch.

The Redis adapter (@socket.io/redis-adapter) ensures events published on one API instance reach clients connected to other instances. A Redis emitter (@socket.io/redis-emitter) bridges Vercel serverless functions to ECS-hosted Socket.IO.

Socket Handlers:

Handler	Events
SOC Sim	`soc-sim-join-run`, alert assign/unassign, run status changes
TTX	Tabletop exercise events
Tutor	AI tutor chat events
Rooms	Room join/leave, task progress
Networks	Network instance status
Missions	Mission progress updates
KOTH	King of the Hill game events
Chat	General chat

11. Job Scheduling & Background Processing

Three Scheduling Systems

System	Backing Store	Use Case	Why This One
Agenda.js	MongoDB (`scheduler_jobs`)	Short-interval polling (health checks, score reduction)	Already have MongoDB. Agenda stores jobs as documents -- no extra infrastructure. Supports `repeatEvery('3 seconds')` for VM health checks, which is too frequent for EventBridge (1-min minimum) and too short-lived for Bull.
node-cron	In-process	Periodic maintenance (runs when `THM_NODE_CRON=cron`)	Simplest option for periodic tasks that don't need persistence or distribution. Runs on a dedicated `cron` process via `THM_NODE_CRON` env flag so it doesn't execute on every API instance.
EventBridge	AWS managed	One-time scheduled events (log delivery), periodic cron (network expiry)	SOC-Sim needs thousands of one-shot schedules ("fire Lambda at 14:32:07 to ingest log #47"). EventBridge handles this natively without polling. Also ideal for AWS-native event patterns like AMI state changes.
Bull	Redis (via ioredis)	Async job processing (company ops, room ops, leagues)	Jobs that need retries, priority, concurrency control, and a monitoring UI. Bull Board provides a web dashboard for inspecting failed jobs. Redis-backed means jobs survive API restarts.

12. Authentication & Session Management

Why Passport.js? Strategy-based architecture means adding a new auth provider (e.g., WorkOS for enterprise SSO) is just adding a new strategy file -- no changes to the core auth flow. The ecosystem has battle-tested strategies for every provider TryHackMe uses.

Why connect-mongo for sessions? Sessions must survive API restarts and be shared across multiple API instances. MongoDB is already in the stack, so connect-mongo avoids adding another dependency. Redis sessions would be faster but add operational complexity for marginal gain (sessions are read once per request, not per millisecond).

Why WorkOS? Enterprise B2B customers need SAML/OIDC SSO with their identity provider (Okta, Azure AD, etc.). WorkOS provides a unified API for all SSO protocols, directory sync, and admin portal -- building this in-house would take months.

Sessions stored in MongoDB via connect-mongo
Session cookie max age: 7 days
Passport serializes user ID into session
Socket.IO connections share the same Passport session middleware

13. Observability Stack

Why Pino over Winston? Pino is the fastest Node.js logger by a wide margin -- it uses worker threads for serialization and avoids blocking the event loop. On a platform that handles thousands of concurrent VM deployments, logging overhead matters. Winston would add measurable latency to every request.

Why Sentry? Stack traces, breadcrumbs, release tracking, and performance profiling in one tool. The @sentry/profiling-node integration provides continuous profiling in production to catch slow functions. Sentry's Express integration automatically captures unhandled errors with full request context.

Why Segment? Single API for routing analytics events to multiple destinations (Mixpanel, Amplitude, data warehouse, etc.). Changing analytics providers means updating Segment config, not rewriting event calls across the codebase.

Why Customer.io over SES directly? Templated transactional emails with visual builders, drip campaigns, and user segmentation. SES is used under the hood (via Nodemailer) only for custom emails like calendar invitations where Customer.io templates don't apply.

Log levels: trace, debug, info, warn, error, fatal (configurable via LOG_LEVEL)

Redaction: Cookies and sensitive headers are redacted from HTTP logs.

14. Request Lifecycle

Rate Limiting

20+ endpoint-specific rate limiters backed by Redis:

Redis Store (rate-limit-redis)
  └── Fallback: In-memory store

Why Redis-backed rate limiting? Rate limits must be shared across all API instances -- if instance A sees 15 requests from a user, instance B must know about them. Redis provides the shared counter. The rate-limit-redis store handles atomic increments. In-memory fallback ensures the API doesn't crash if Redis is temporarily unavailable (it just rate-limits per-instance instead of globally).

Why express-rate-limit over a WAF? Fine-grained, per-endpoint control. The "start VM" endpoint has a different limit than "get user profile." WAF rules can't express "5 VM deploys per hour per user but 100 profile reads per minute." 20+ distinct limiter configs in the codebase reflect this granularity.

Rate limiters are warmed up on server start. Socket.IO has separate per-minute and per-day rate limits via rate-limiter-flexible.

15. Configuration & Secrets

Why Cloudflare in front of everything? DDoS protection, global CDN, DNS management, and the Workers edge compute platform -- all in one. The Guac-A-Worker runs at Cloudflare's edge (300+ PoPs worldwide), routing VM desktop requests to the nearest Guacamole cluster with near-zero latency. Cloudflare also provides the wildcard SSL certificates that make *.guacaworker.{domain} subdomain routing possible.

Why GrowthBook for feature flags? Open-source, self-hostable, and provides server-side SDK for Node.js. Feature flags gate every major feature (SOC-Sim VM deploy, curriculum mapping, threat hunting SIEM tool modal). GrowthBook's targeting rules allow gradual rollouts by user segment, company, or percentage -- critical for safely shipping infrastructure changes like new VM connection types.

Why Sanity CMS? Scenario content (SOC-Sim alerts, Threat Hunting timelines, evaluation criteria) changes frequently and is authored by non-engineers. Sanity provides a structured content authoring UI, real-time collaboration, and a GROQ query API. Content is fetched at runtime, not baked into deploys, so scenario updates ship instantly without redeployment.

Config Architecture

AWS Credential Profiles

Profile	Env Vars	Purpose
`default`	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`	Primary account operations
`admin`	`AWS_ADMIN_ACCESS_KEY_ID`, `AWS_ADMIN_SECRET_ACCESS_KEY`	Admin EC2 operations
`vm-upload`	`AWS_VM_UPLOAD_ACCESS_KEY_ID`, `AWS_VM_UPLOAD_SECRET_ACCESS_KEY`	VM image uploads
`cloud-training`	`AWS_CLOUD_TRAINING_ACCESS_KEY_ID`, `AWS_CLOUD_TRAINING_SECRET_KEY`	Cloud training account provisioning
`network-infra`	`AWS_NETWORK_INFRA_ACCESS_KEY_ID`, `AWS_NETWORK_INFRA_SECRET_ACCESS_KEY`	Network IaC operations
Region-specific	`AWS_REGION_*_ACCESS_KEY_ID`, etc.	Per-region operations (us-east-1, ap-south-1, eu-central-1)

S3 Bucket Map

Bucket	Env Var	Content
Images	`S3_IMAGES_BUCKET_NAME`	Room icons, avatars, team logos
VPN Configs	`S3_VPN_CONFIG_BUCKET_NAME`	Generated OpenVPN configs
VPN Templates	`S3_VPN_TEMPLATES_BUCKET_NAME`	VPN configuration templates
Badges	`S3_BADGES_BUCKET_NAME`	Achievement badge images
Certificates	`S3_CERTIFICATES_BUCKET_NAME`	Completion certificates
Skills Matrix	`S3_SKILLS_MATRIX_BUCKET_NAME`	Skills assessment exports
VM Uploads	Region-specific	VM disk images for import
RAG	`TTX_S3_BUCKET_NAME`	RAG documents for AI features
BADR Deployment	`tryhackme-badr-deployment-{stage}-{region}`	Network deployment artifacts
Turbo Cache	SST-managed	Turborepo remote build cache

16. Key File Reference

Infrastructure as Code

File	What It Defines
`sst.config.ts`	Root SST v3 config, app registry, stage/region settings
`infrastructure/shared.ts`	Shared secrets, SQS queue, VPC references
`infrastructure/core.ts`	VPCs, subnets, security groups, Transit Gateways, S3 endpoints
`infrastructure/guacamole.ts`	ECS clusters, Fargate services, ALBs, auto-scaling for Guacamole
`infrastructure/nginx.ts`	Nginx reverse proxy ECS services, Cloudflare DNS, TLS
`infrastructure/utils.ts`	SST helper functions
`apps/networks-iac/infrastructure.ts`	Lambda + SQS for dynamic Pulumi provisioning
`apps/turbo-s3/infrastructure.ts`	S3 bucket + API Gateway for Turborepo cache
`apps/cron/infrastructure.ts`	Cron Lambda + EventBridge subscriptions
`apps/mock-api/infrastructure.ts`	Mock API Gateway + Lambda
`apps/guac-a-worker/infrastructure.ts`	Cloudflare Worker, KV, DNS, certificates

AWS Service Wrappers

File	Service
`src/services/aws/ec2/index.ts`	EC2 instances, AMIs, volumes, tags
`src/services/aws/ec2/amis.ts`	AMI state change processing
`src/services/s3/index.ts`	S3 uploads, downloads, presigned URLs
`src/services/aws/scheduler/index.ts`	EventBridge one-time schedules
`src/services/aws/sqs/index.ts`	SQS message publishing
`src/services/aws/ssm/index.ts`	SSM Parameter Store lookups
`src/services/aws/secrets-manager/index.ts`	Secrets Manager retrieval
`src/services/aws/auto-scaling/index.ts`	Auto Scaling Group queries
`src/services/aws/step-functions/index.ts`	Step Functions execution

VM & Connection Services

File	Purpose
`src/services/vms/index.ts`	VM deployment orchestrator
`src/services/vms/deployment.ts`	EC2 instance creation, tagging, user data
`src/services/vms/termination.ts`	VM termination and cleanup
`src/services/vms/connections/index.ts`	Connection type router
`src/services/vms/connections/guacamole.ts`	Guacamole connection creation
`src/services/vms/connections/novnc.ts`	NoVNC connection handling
`src/services/vms/connections/dcv.ts`	Amazon DCV connections
`src/services/vms/connections/web-proxy.ts`	Web proxy connections
`src/services/vms/data-assignment.ts`	VM data for SOC-Sim/Threat Hunting
`src/services/vms/utils.ts`	Cell selection, regional configs
`src/services/guacamole/client.ts`	Guacamole API client
`src/services/azure/labs.ts`	Azure Lab management
`src/services/azure/azure-repo.ts`	Azure lab repository

System Design Components

File	Component
`src/infra/database/index.ts`	MongoDB connection setup
`src/services/redis/index.ts`	Main Redis client
`src/services/redis/tutor.ts`	Tutor Redis client
`src/services/queues/connection.ts`	Bull queue Redis connection
`src/services/agenda/index.ts`	Agenda.js scheduler
`src/services/agenda/definitions.ts`	Job definitions (health checks)
`src/infra/sockets/index.ts`	Socket.IO server setup
`src/infra/sockets/middleware.ts`	Socket auth middleware
`src/middlewares/auth/passport/index.ts`	Passport strategies
`src/services/kafka/index.ts`	Kafka producer/consumer
`src/common/logging/index.ts`	Pino logger configuration
`src/middlewares/limiters/index.ts`	Rate limiting middleware
`src/services/health-check/index.ts`	Health check endpoints
`src/server.ts`	Express server + middleware chain
`apps/api/config.js`	Central configuration

All src/ paths above are relative to apps/api/app/api/v2/.