THM Sytem Design

2022-01-25

TryHackMe Infrastructure & System Design

A comprehensive technical guide to the AWS infrastructure, system design patterns, and cloud architecture that powers the TryHackMe platform.


Table of Contents

  1. Infrastructure Overview
  2. AWS Multi-Region Architecture
  3. Infrastructure as Code (SST v3)
  4. VPC & Networking
  5. VM Deployment Pipeline
  6. Guacamole Remote Desktop System
  7. Networks v2 (Pulumi Dynamic IaC)
  8. AWS Services Map
  9. Data Layer
  10. Real-Time Communication
  11. Job Scheduling & Background Processing
  12. Authentication & Session Management
  13. Observability Stack
  14. Request Lifecycle
  15. Configuration & Secrets
  16. Key File Reference

1. Infrastructure Overview


2. AWS Multi-Region Architecture

Why multi-region? VM interactions are latency-sensitive -- typing in a remote desktop session with 200ms latency is unusable. By deploying Guacamole clusters and VM instances in 5 regions, users connect to the nearest one. The API itself stays in eu-west-1 (single-region) because it only handles REST/WebSocket traffic where 100-200ms latency is acceptable.

TryHackMe operates across 5 AWS regions to minimize latency for global users deploying VMs.

Region Responsibilities

RegionRoleKey Resources
eu-west-1Primary. API, database, all servicesAPI server, MongoDB, Redis, Kafka, all AWS services
us-east-1Modern region with dynamic VPCDynamically provisioned VPC (10.201.0.0/16), S3 gateway endpoint
us-west-1Legacy VM regionGuacamole ASG, VM instances
ap-south-1APAC VM regionGuacamole cluster, Nginx proxy, VMs
eu-central-1EU secondary VM regionGuacamole cluster, Nginx proxy, VMs

Cross-Region Connectivity

Transit Gateway peering connects all VM regions for:


3. Infrastructure as Code (SST v3)

Why SST v3? SST v3 wraps Pulumi and Terraform under a TypeScript-first API, so infrastructure is defined in the same language as the application. Unlike SST v2 (which was CDK/CloudFormation), v3 deploys faster, supports multi-provider (AWS + Cloudflare in one config), and avoids CloudFormation's 500-resource stack limits. The Link system lets Lambda functions access secrets and resource ARNs without manual env var wiring.

All infrastructure is managed via SST v3 (not v2/CDK). The root config is at sst.config.ts.

SST Apps

Shared Infrastructure (infrastructure/shared.ts)

Defines resources shared across all SST apps:

ResourceTypePurpose
MongoConnectionStringSST SecretMongoDB connection string
CloudflareAccessClientId/SecretSST SecretCloudflare Zero Trust auth (non-prod)
NetworksIacInvocationQueueSQS FIFO QueueNetwork provisioning message bus
MainWebAppBaseUrlSST LinkableBase URL for the main web app
Application VPC IDsSST LinkableVPC, subnet, and SG references per region

Deployment

# Deploy all apps to staging
npx sst deploy --stage staging
 
# Deploy specific app
APPS=core npx sst deploy --stage production
 
# Selective deployment via APPS env var
APPS=guacamole,nginx npx sst deploy --stage staging

4. VPC & Networking

VPC Layout (Production, eu-west-1)

Security Groups

Security GroupPortsDirectionCIDRPurpose
VulnerableVmsSecurityGroupAll except 80IngressPrivate IPs onlyIsolate vulnerable machines
AllowAllSecurityGroupAllAll0.0.0.0/0Permissive SG for specific services
GuacamoleSecurityGroup8080IngressCloudflare IPsGuacamole web app access
GuacdLoadBalancerSecurityGroup4822IngressGuacamole SGGuacd protocol traffic
GuacdSecurityGroup4822IngressLB SGGuacd daemon access
NginxSecurityGroup80, 8443Ingress0.0.0.0/0Nginx reverse proxy

us-east-1 (Modern Region, Dynamic VPC)

Unlike legacy regions, us-east-1 creates its VPC dynamically in SST:

VPC: 10.201.0.0/16
├── Vulnerable VMs Subnet: 10.201.0.0/17
├── Remote Access Subnet:  10.201.128.0/18
├── LB Subnet:             10.201.192.0/24
├── Internet Gateway (dynamic)
├── Transit Gateway Attachment (tgw-01c8741ec73fb2f5c)
└── S3 Gateway Endpoint (private S3 access)

5. VM Deployment Pipeline

This is the core of TryHackMe's infrastructure -- deploying EC2 instances for users to interact with.

Deployment Flow

Key Concepts

Upload (VM Template): An Upload defines a VM image -- its AMI, instance type, OS, remote connection type, credentials, and boot time. Uploads are stored in MongoDB and reference AMIs in each region.

Instance: A running VM instance. Tracks AWS instance ID, connection details, expiry time, and state. Stored in MongoDB.

Instance Types & Selection:

Retry Strategy:

Instance Lifecycle:

Cost Center Tagging:

Every EC2 instance is tagged for billing attribution:

CostCenterUsage
ROOMStandard room VMs
EXAMCertification exam VMs
SOC_SIMSOC Simulator VMs
THREAT_HUNTERThreat Hunting VMs
NETWORKNetwork lab VMs

6. Guacamole Remote Desktop System

Why Guacamole? Users need browser-based access to Windows (RDP) and Linux (VNC/SSH) VMs without installing any client software. Guacamole is the only mature open-source solution that renders RDP/VNC/SSH in a browser via HTML5 Canvas. The guacd daemon handles protocol translation, while the web app serves the UI and manages connections. Running it on ECS Fargate per-region keeps latency low for users connecting to their nearest VM.

Apache Guacamole provides browser-based access to VMs via RDP, VNC, and SSH.

Connection Types

TypeProtocolPortUse Case
GUACAMOLERDP/VNC/SSH via Guacamole8080Primary VM access method
NO_VNCVNC via Nginx reverse proxy80/8443Lightweight VNC access
DCVAmazon DCVCustomHigh-performance desktop
WEB_PROXYHTTP reverse proxy80Web-based tools

Guac-A-Worker (Cloudflare)

The Guac-A-Worker is a Cloudflare Worker that routes VM requests:

  1. User requests *.guacaworker.{domain}
  2. Worker extracts task ID from subdomain
  3. Looks up public DNS name in Cloudflare KV
  4. Routes request to correct Guacamole instance

Resources created by SST:

Nginx Reverse Proxy

For VNC connections, an Nginx reverse proxy runs per-region:

# IP-based subdomain routing
# 10-x-x-x.reverse-proxy-{region}.{domain} -> 10.x.x.x
server {
    location / {
        proxy_pass http://$target_ip:$target_port;
    }
    location /websockify/ {
        proxy_pass http://$target_ip:$target_port;
        # WebSocket upgrade for VNC
    }
}

Scaling:


7. Networks v2 (Pulumi Dynamic IaC)

Why Pulumi inside Lambda (and not SST/Terraform)? Each user's network lab needs its own VPC, subnets, VPN server, and multiple VMs -- created on-demand and destroyed hours later. SST/Terraform are for long-lived infrastructure. Pulumi's Automation API lets you run pulumi up and pulumi destroy programmatically from code, making it ideal for ephemeral, per-user infrastructure. Lambda provides the compute, SQS provides the queue, and EFS persists the Pulumi binary and state between invocations.

Networks v2 creates isolated network environments for users using Pulumi Automation API inside a Lambda function.

Lambda Configuration

SettingValue
RuntimeNode.js 22.x
Memory1 GB
Timeout15 minutes
StorageEFS mount at /mnt/efs
VPCConnected to application VPC
PermissionsFull AWS access (*)
TriggerSQS FIFO queue

What Pulumi Creates Per Network

Network Expiry

A cron Lambda runs every minute to check for expired networks:

  1. Queries MongoDB for expired network instances
  2. Sends SQS message to destroy the Pulumi stack
  3. Lambda destroys all AWS resources for that network

8. AWS Services Map

Service Usage Summary -- What & Why

ServiceWhat It DoesWhy This Choice
EC2Deploys VMs (spot + on-demand), manages AMIs, key pairs, volumesThe entire product is built around giving users real machines to hack. EC2 is the only AWS service that provides full OS-level virtual machines with arbitrary network configs and pre-baked AMIs. Spot instances cut costs ~60-70% for ephemeral, short-lived VMs that users will destroy in 1-3 hours anyway.
S3Stores images, VPN configs, badges, certificates, VM uploads, RAG files, skills matricesCheap, durable, globally-accessible object storage. VM disk images need to live in S3 before being imported as AMIs. Presigned URLs let the frontend upload/download directly without proxying through the API.
ECS FargateRuns Guacamole (web + guacd) and Nginx reverse proxy per regionGuacamole and Nginx are long-running services that need to scale independently per region. Fargate removes the need to manage EC2 hosts for these containers -- AWS handles the underlying compute.
LambdaNetworks IaC handler, Turbo S3 cache handler, cron jobs, log ingestion, AMI state change handlerPerfect for event-driven, short-lived work: a SOC-Sim log needs to be ingested at a specific time, a network needs provisioning, an AMI import completed. Pay-per-invocation means zero cost when idle.
EventBridge SchedulerSchedules one-time events for SOC-Sim log ingestion into SIEM VMsSOC-Sim logs must appear in the SIEM at precise timestamps to simulate a real attack timeline. EventBridge one-time schedules let you say "fire this Lambda at exactly 14:32:07" per log entry -- no polling, no cron.
SQSFIFO queue for network infrastructure provisioning messagesNetwork creation via Pulumi takes up to 15 minutes. SQS decouples the API request from the Lambda execution, guarantees message ordering (FIFO), and provides at-least-once delivery with built-in retry.
API GatewayHTTP APIs for Turbo S3 cache and Mock APILightweight serverless HTTP endpoints that don't justify running a full server. API Gateway + Lambda is the cheapest way to expose an HTTP endpoint that's called infrequently.
Transit GatewayCross-region VPC peering for VM connectivityUsers in India deploy VMs in ap-south-1, but the API runs in eu-west-1. TGW peering lets the API server's VPC talk to VM VPCs across regions without setting up individual VPC peering per region-pair.
Step FunctionsCloud training account lifecycle (assign, move OU, recycle)AWS account provisioning is a multi-step workflow (create account, move to OU, set permissions, assign to user). Step Functions model this as a state machine with built-in retries and rollback -- much safer than chaining Lambda calls manually.
SSM Parameter StoreCell-based infrastructure configs, subnet/SG referencesInfrastructure metadata (which subnets are available, which security groups to use) changes per cell and region. SSM provides a centralized, versioned key-value store that both IaC and runtime code can read from. Cheaper than Secrets Manager for non-sensitive config.
Secrets ManagerGuacamole secret keys, sensitive credentials per regionGuacamole JSON secret keys and other credentials must be encrypted at rest and rotatable. Secrets Manager provides automatic encryption, access audit trails, and cross-region replication.
Auto ScalingManages Guacamole load balancer instance poolsVM traffic is bursty -- a cohort of students might start a room simultaneously. Auto Scaling adjusts Guacamole capacity between 2-20 instances based on CPU, so we don't over-provision during off-peak hours.
CloudWatchLogs for all Lambda functions, ECS Container Insights, CPU alarmsLambda logs go to CloudWatch by default (no choice). Container Insights gives ECS-level metrics (CPU, memory, network) without instrumenting the app. Alarms trigger when Guacamole CPU hits 60% to flag scaling issues early.
EFSPersistent storage for Pulumi binaries and state in LambdaLambda has only 512MB of ephemeral /tmp storage. Pulumi's binary + node_modules + state files exceed that. EFS provides a shared, persistent filesystem mounted at /mnt/efs that persists across Lambda invocations.
IAMMulti-account credential profiles (default, admin, network-infra, vm-upload, cloud-training)Different operations require different AWS accounts with different permission boundaries. VM uploads go to a dedicated account, cloud training uses its own account, network infra has its own. IAM profiles isolate blast radius -- a compromised key only affects one concern.

9. Data Layer

MongoDB

Why MongoDB? The data is deeply nested and schema-diverse -- a VM instance document looks nothing like a user document or a SOC-Sim run. MongoDB's flexible schema handles this naturally. Mongoose adds just enough structure (schemas, validation, indexes) without the rigidity of a relational DB. The platform also needs fast reads on varied query patterns (by user, by room, by instance ID, by status), which compound indexes on a document store handle well.

Connection settings:

Key collections:

Redis

Why Redis? Sub-millisecond reads for rate limiting (every single API request checks Redis), Socket.IO pub/sub across multiple API instances, and Bull queue backing. Two instances isolate the latency-critical main workload from the Tutor feature's higher-volume writes.

Two separate Redis instances serve different purposes:

Kafka

Used for event streaming between the API and analytics consumers:

Why Kafka? High-throughput, ordered event streaming that decouples producers from consumers. Analytics events, tutor interactions, and MongoDB change events generate a firehose of data -- Kafka handles this without back-pressure on the API. Consumer groups allow multiple independent consumers (analytics pipeline, tutor service) to process the same events at their own pace.

Topics include MongoDB change events and analytics events. A Vercel-to-ECS HTTP bridge exists for the Next.js frontend to publish events.


10. Real-Time Communication

Socket.IO Architecture

Why Socket.IO over raw WebSockets? Socket.IO provides automatic reconnection, room-based broadcasting (critical for SOC-Sim multiplayer), binary streaming, and -- most importantly -- the Redis adapter for multi-instance scaling. Raw WebSockets would require building all of this from scratch.

The Redis adapter (@socket.io/redis-adapter) ensures events published on one API instance reach clients connected to other instances. A Redis emitter (@socket.io/redis-emitter) bridges Vercel serverless functions to ECS-hosted Socket.IO.

Socket Handlers:

HandlerEvents
SOC Simsoc-sim-join-run, alert assign/unassign, run status changes
TTXTabletop exercise events
TutorAI tutor chat events
RoomsRoom join/leave, task progress
NetworksNetwork instance status
MissionsMission progress updates
KOTHKing of the Hill game events
ChatGeneral chat

11. Job Scheduling & Background Processing

Three Scheduling Systems

SystemBacking StoreUse CaseWhy This One
Agenda.jsMongoDB (scheduler_jobs)Short-interval polling (health checks, score reduction)Already have MongoDB. Agenda stores jobs as documents -- no extra infrastructure. Supports repeatEvery('3 seconds') for VM health checks, which is too frequent for EventBridge (1-min minimum) and too short-lived for Bull.
node-cronIn-processPeriodic maintenance (runs when THM_NODE_CRON=cron)Simplest option for periodic tasks that don't need persistence or distribution. Runs on a dedicated cron process via THM_NODE_CRON env flag so it doesn't execute on every API instance.
EventBridgeAWS managedOne-time scheduled events (log delivery), periodic cron (network expiry)SOC-Sim needs thousands of one-shot schedules ("fire Lambda at 14:32:07 to ingest log #47"). EventBridge handles this natively without polling. Also ideal for AWS-native event patterns like AMI state changes.
BullRedis (via ioredis)Async job processing (company ops, room ops, leagues)Jobs that need retries, priority, concurrency control, and a monitoring UI. Bull Board provides a web dashboard for inspecting failed jobs. Redis-backed means jobs survive API restarts.

12. Authentication & Session Management

Why Passport.js? Strategy-based architecture means adding a new auth provider (e.g., WorkOS for enterprise SSO) is just adding a new strategy file -- no changes to the core auth flow. The ecosystem has battle-tested strategies for every provider TryHackMe uses.

Why connect-mongo for sessions? Sessions must survive API restarts and be shared across multiple API instances. MongoDB is already in the stack, so connect-mongo avoids adding another dependency. Redis sessions would be faster but add operational complexity for marginal gain (sessions are read once per request, not per millisecond).

Why WorkOS? Enterprise B2B customers need SAML/OIDC SSO with their identity provider (Okta, Azure AD, etc.). WorkOS provides a unified API for all SSO protocols, directory sync, and admin portal -- building this in-house would take months.


13. Observability Stack

Why Pino over Winston? Pino is the fastest Node.js logger by a wide margin -- it uses worker threads for serialization and avoids blocking the event loop. On a platform that handles thousands of concurrent VM deployments, logging overhead matters. Winston would add measurable latency to every request.

Why Sentry? Stack traces, breadcrumbs, release tracking, and performance profiling in one tool. The @sentry/profiling-node integration provides continuous profiling in production to catch slow functions. Sentry's Express integration automatically captures unhandled errors with full request context.

Why Segment? Single API for routing analytics events to multiple destinations (Mixpanel, Amplitude, data warehouse, etc.). Changing analytics providers means updating Segment config, not rewriting event calls across the codebase.

Why Customer.io over SES directly? Templated transactional emails with visual builders, drip campaigns, and user segmentation. SES is used under the hood (via Nodemailer) only for custom emails like calendar invitations where Customer.io templates don't apply.

Log levels: trace, debug, info, warn, error, fatal (configurable via LOG_LEVEL)

Redaction: Cookies and sensitive headers are redacted from HTTP logs.


14. Request Lifecycle

Rate Limiting

20+ endpoint-specific rate limiters backed by Redis:

Redis Store (rate-limit-redis)
  └── Fallback: In-memory store

Why Redis-backed rate limiting? Rate limits must be shared across all API instances -- if instance A sees 15 requests from a user, instance B must know about them. Redis provides the shared counter. The rate-limit-redis store handles atomic increments. In-memory fallback ensures the API doesn't crash if Redis is temporarily unavailable (it just rate-limits per-instance instead of globally).

Why express-rate-limit over a WAF? Fine-grained, per-endpoint control. The "start VM" endpoint has a different limit than "get user profile." WAF rules can't express "5 VM deploys per hour per user but 100 profile reads per minute." 20+ distinct limiter configs in the codebase reflect this granularity.

Rate limiters are warmed up on server start. Socket.IO has separate per-minute and per-day rate limits via rate-limiter-flexible.


15. Configuration & Secrets

Why Cloudflare in front of everything? DDoS protection, global CDN, DNS management, and the Workers edge compute platform -- all in one. The Guac-A-Worker runs at Cloudflare's edge (300+ PoPs worldwide), routing VM desktop requests to the nearest Guacamole cluster with near-zero latency. Cloudflare also provides the wildcard SSL certificates that make *.guacaworker.{domain} subdomain routing possible.

Why GrowthBook for feature flags? Open-source, self-hostable, and provides server-side SDK for Node.js. Feature flags gate every major feature (SOC-Sim VM deploy, curriculum mapping, threat hunting SIEM tool modal). GrowthBook's targeting rules allow gradual rollouts by user segment, company, or percentage -- critical for safely shipping infrastructure changes like new VM connection types.

Why Sanity CMS? Scenario content (SOC-Sim alerts, Threat Hunting timelines, evaluation criteria) changes frequently and is authored by non-engineers. Sanity provides a structured content authoring UI, real-time collaboration, and a GROQ query API. Content is fetched at runtime, not baked into deploys, so scenario updates ship instantly without redeployment.

Config Architecture

AWS Credential Profiles

ProfileEnv VarsPurpose
defaultAWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEYPrimary account operations
adminAWS_ADMIN_ACCESS_KEY_ID, AWS_ADMIN_SECRET_ACCESS_KEYAdmin EC2 operations
vm-uploadAWS_VM_UPLOAD_ACCESS_KEY_ID, AWS_VM_UPLOAD_SECRET_ACCESS_KEYVM image uploads
cloud-trainingAWS_CLOUD_TRAINING_ACCESS_KEY_ID, AWS_CLOUD_TRAINING_SECRET_KEYCloud training account provisioning
network-infraAWS_NETWORK_INFRA_ACCESS_KEY_ID, AWS_NETWORK_INFRA_SECRET_ACCESS_KEYNetwork IaC operations
Region-specificAWS_REGION_*_ACCESS_KEY_ID, etc.Per-region operations (us-east-1, ap-south-1, eu-central-1)

S3 Bucket Map

BucketEnv VarContent
ImagesS3_IMAGES_BUCKET_NAMERoom icons, avatars, team logos
VPN ConfigsS3_VPN_CONFIG_BUCKET_NAMEGenerated OpenVPN configs
VPN TemplatesS3_VPN_TEMPLATES_BUCKET_NAMEVPN configuration templates
BadgesS3_BADGES_BUCKET_NAMEAchievement badge images
CertificatesS3_CERTIFICATES_BUCKET_NAMECompletion certificates
Skills MatrixS3_SKILLS_MATRIX_BUCKET_NAMESkills assessment exports
VM UploadsRegion-specificVM disk images for import
RAGTTX_S3_BUCKET_NAMERAG documents for AI features
BADR Deploymenttryhackme-badr-deployment-{stage}-{region}Network deployment artifacts
Turbo CacheSST-managedTurborepo remote build cache

16. Key File Reference

Infrastructure as Code

FileWhat It Defines
sst.config.tsRoot SST v3 config, app registry, stage/region settings
infrastructure/shared.tsShared secrets, SQS queue, VPC references
infrastructure/core.tsVPCs, subnets, security groups, Transit Gateways, S3 endpoints
infrastructure/guacamole.tsECS clusters, Fargate services, ALBs, auto-scaling for Guacamole
infrastructure/nginx.tsNginx reverse proxy ECS services, Cloudflare DNS, TLS
infrastructure/utils.tsSST helper functions
apps/networks-iac/infrastructure.tsLambda + SQS for dynamic Pulumi provisioning
apps/turbo-s3/infrastructure.tsS3 bucket + API Gateway for Turborepo cache
apps/cron/infrastructure.tsCron Lambda + EventBridge subscriptions
apps/mock-api/infrastructure.tsMock API Gateway + Lambda
apps/guac-a-worker/infrastructure.tsCloudflare Worker, KV, DNS, certificates

AWS Service Wrappers

FileService
src/services/aws/ec2/index.tsEC2 instances, AMIs, volumes, tags
src/services/aws/ec2/amis.tsAMI state change processing
src/services/s3/index.tsS3 uploads, downloads, presigned URLs
src/services/aws/scheduler/index.tsEventBridge one-time schedules
src/services/aws/sqs/index.tsSQS message publishing
src/services/aws/ssm/index.tsSSM Parameter Store lookups
src/services/aws/secrets-manager/index.tsSecrets Manager retrieval
src/services/aws/auto-scaling/index.tsAuto Scaling Group queries
src/services/aws/step-functions/index.tsStep Functions execution

VM & Connection Services

FilePurpose
src/services/vms/index.tsVM deployment orchestrator
src/services/vms/deployment.tsEC2 instance creation, tagging, user data
src/services/vms/termination.tsVM termination and cleanup
src/services/vms/connections/index.tsConnection type router
src/services/vms/connections/guacamole.tsGuacamole connection creation
src/services/vms/connections/novnc.tsNoVNC connection handling
src/services/vms/connections/dcv.tsAmazon DCV connections
src/services/vms/connections/web-proxy.tsWeb proxy connections
src/services/vms/data-assignment.tsVM data for SOC-Sim/Threat Hunting
src/services/vms/utils.tsCell selection, regional configs
src/services/guacamole/client.tsGuacamole API client
src/services/azure/labs.tsAzure Lab management
src/services/azure/azure-repo.tsAzure lab repository

System Design Components

FileComponent
src/infra/database/index.tsMongoDB connection setup
src/services/redis/index.tsMain Redis client
src/services/redis/tutor.tsTutor Redis client
src/services/queues/connection.tsBull queue Redis connection
src/services/agenda/index.tsAgenda.js scheduler
src/services/agenda/definitions.tsJob definitions (health checks)
src/infra/sockets/index.tsSocket.IO server setup
src/infra/sockets/middleware.tsSocket auth middleware
src/middlewares/auth/passport/index.tsPassport strategies
src/services/kafka/index.tsKafka producer/consumer
src/common/logging/index.tsPino logger configuration
src/middlewares/limiters/index.tsRate limiting middleware
src/services/health-check/index.tsHealth check endpoints
src/server.tsExpress server + middleware chain
apps/api/config.jsCentral configuration

All src/ paths above are relative to apps/api/app/api/v2/.