THM Sytem Design
2022-01-25
TryHackMe Infrastructure & System Design
A comprehensive technical guide to the AWS infrastructure, system design patterns, and cloud architecture that powers the TryHackMe platform.
Table of Contents
- Infrastructure Overview
- AWS Multi-Region Architecture
- Infrastructure as Code (SST v3)
- VPC & Networking
- VM Deployment Pipeline
- Guacamole Remote Desktop System
- Networks v2 (Pulumi Dynamic IaC)
- AWS Services Map
- Data Layer
- Real-Time Communication
- Job Scheduling & Background Processing
- Authentication & Session Management
- Observability Stack
- Request Lifecycle
- Configuration & Secrets
- Key File Reference
1. Infrastructure Overview
2. AWS Multi-Region Architecture
Why multi-region? VM interactions are latency-sensitive -- typing in a remote desktop session with 200ms latency is unusable. By deploying Guacamole clusters and VM instances in 5 regions, users connect to the nearest one. The API itself stays in eu-west-1 (single-region) because it only handles REST/WebSocket traffic where 100-200ms latency is acceptable.
TryHackMe operates across 5 AWS regions to minimize latency for global users deploying VMs.
Region Responsibilities
| Region | Role | Key Resources |
|---|---|---|
eu-west-1 | Primary. API, database, all services | API server, MongoDB, Redis, Kafka, all AWS services |
us-east-1 | Modern region with dynamic VPC | Dynamically provisioned VPC (10.201.0.0/16), S3 gateway endpoint |
us-west-1 | Legacy VM region | Guacamole ASG, VM instances |
ap-south-1 | APAC VM region | Guacamole cluster, Nginx proxy, VMs |
eu-central-1 | EU secondary VM region | Guacamole cluster, Nginx proxy, VMs |
Cross-Region Connectivity
Transit Gateway peering connects all VM regions for:
- VPN client pool routing (
10.8.0.0/24,10.9.0.0/24,10.13.0.0/24) - Network CIDR routing between regions
- External BADR account peering (account
439472160432)
3. Infrastructure as Code (SST v3)
Why SST v3? SST v3 wraps Pulumi and Terraform under a TypeScript-first API, so infrastructure is defined in the same language as the application. Unlike SST v2 (which was CDK/CloudFormation), v3 deploys faster, supports multi-provider (AWS + Cloudflare in one config), and avoids CloudFormation's 500-resource stack limits. The Link system lets Lambda functions access secrets and resource ARNs without manual env var wiring.
All infrastructure is managed via SST v3 (not v2/CDK). The root config is at sst.config.ts.
SST Apps
Shared Infrastructure (infrastructure/shared.ts)
Defines resources shared across all SST apps:
| Resource | Type | Purpose |
|---|---|---|
MongoConnectionString | SST Secret | MongoDB connection string |
CloudflareAccessClientId/Secret | SST Secret | Cloudflare Zero Trust auth (non-prod) |
NetworksIacInvocationQueue | SQS FIFO Queue | Network provisioning message bus |
MainWebAppBaseUrl | SST Linkable | Base URL for the main web app |
| Application VPC IDs | SST Linkable | VPC, subnet, and SG references per region |
Deployment
# Deploy all apps to staging
npx sst deploy --stage staging
# Deploy specific app
APPS=core npx sst deploy --stage production
# Selective deployment via APPS env var
APPS=guacamole,nginx npx sst deploy --stage staging4. VPC & Networking
VPC Layout (Production, eu-west-1)
Security Groups
| Security Group | Ports | Direction | CIDR | Purpose |
|---|---|---|---|---|
VulnerableVmsSecurityGroup | All except 80 | Ingress | Private IPs only | Isolate vulnerable machines |
AllowAllSecurityGroup | All | All | 0.0.0.0/0 | Permissive SG for specific services |
GuacamoleSecurityGroup | 8080 | Ingress | Cloudflare IPs | Guacamole web app access |
GuacdLoadBalancerSecurityGroup | 4822 | Ingress | Guacamole SG | Guacd protocol traffic |
GuacdSecurityGroup | 4822 | Ingress | LB SG | Guacd daemon access |
NginxSecurityGroup | 80, 8443 | Ingress | 0.0.0.0/0 | Nginx reverse proxy |
us-east-1 (Modern Region, Dynamic VPC)
Unlike legacy regions, us-east-1 creates its VPC dynamically in SST:
VPC: 10.201.0.0/16
├── Vulnerable VMs Subnet: 10.201.0.0/17
├── Remote Access Subnet: 10.201.128.0/18
├── LB Subnet: 10.201.192.0/24
├── Internet Gateway (dynamic)
├── Transit Gateway Attachment (tgw-01c8741ec73fb2f5c)
└── S3 Gateway Endpoint (private S3 access)
5. VM Deployment Pipeline
This is the core of TryHackMe's infrastructure -- deploying EC2 instances for users to interact with.
Deployment Flow
Key Concepts
Upload (VM Template): An Upload defines a VM image -- its AMI, instance type, OS, remote connection type, credentials, and boot time. Uploads are stored in MongoDB and reference AMIs in each region.
Instance: A running VM instance. Tracks AWS instance ID, connection details, expiry time, and state. Stored in MongoDB.
Instance Types & Selection:
Retry Strategy:
- Spot instances: up to
AWS_SPOT_MAX_RETRIESretries (default 4) - On-demand: up to
AWS_ON_DEMAND_MAX_RETRIESretries (default 8) - Backoff delay:
AWS_BACKOFF_DELAY_MS(default 100ms)
Instance Lifecycle:
- Default expiry: varies by context (1-3 hours)
- Extension: +1 hour (once, only in last hour, max 6 hours total)
- Termination: User-initiated, auto-expiry, or spot interruption
Cost Center Tagging:
Every EC2 instance is tagged for billing attribution:
| CostCenter | Usage |
|---|---|
ROOM | Standard room VMs |
EXAM | Certification exam VMs |
SOC_SIM | SOC Simulator VMs |
THREAT_HUNTER | Threat Hunting VMs |
NETWORK | Network lab VMs |
6. Guacamole Remote Desktop System
Why Guacamole? Users need browser-based access to Windows (RDP) and Linux (VNC/SSH) VMs without installing any client software. Guacamole is the only mature open-source solution that renders RDP/VNC/SSH in a browser via HTML5 Canvas. The guacd daemon handles protocol translation, while the web app serves the UI and manages connections. Running it on ECS Fargate per-region keeps latency low for users connecting to their nearest VM.
Apache Guacamole provides browser-based access to VMs via RDP, VNC, and SSH.
Connection Types
| Type | Protocol | Port | Use Case |
|---|---|---|---|
GUACAMOLE | RDP/VNC/SSH via Guacamole | 8080 | Primary VM access method |
NO_VNC | VNC via Nginx reverse proxy | 80/8443 | Lightweight VNC access |
DCV | Amazon DCV | Custom | High-performance desktop |
WEB_PROXY | HTTP reverse proxy | 80 | Web-based tools |
Guac-A-Worker (Cloudflare)
The Guac-A-Worker is a Cloudflare Worker that routes VM requests:
- User requests
*.guacaworker.{domain} - Worker extracts task ID from subdomain
- Looks up public DNS name in Cloudflare KV
- Routes request to correct Guacamole instance
Resources created by SST:
- KV Namespace for task-to-DNS mapping
- Cloudflare DNS records (wildcard + base)
- Advanced Certificate Manager (Google CA)
- Worker route:
*guacaworker.{domain}/* - IAM user with ECS/EC2 describe permissions
Nginx Reverse Proxy
For VNC connections, an Nginx reverse proxy runs per-region:
# IP-based subdomain routing
# 10-x-x-x.reverse-proxy-{region}.{domain} -> 10.x.x.x
server {
location / {
proxy_pass http://$target_ip:$target_port;
}
location /websockify/ {
proxy_pass http://$target_ip:$target_port;
# WebSocket upgrade for VNC
}
}Scaling:
- Production: 10-50 ARM64 instances (0.25 vCPU, 0.5GB each)
- Staging: 2-4 instances
7. Networks v2 (Pulumi Dynamic IaC)
Why Pulumi inside Lambda (and not SST/Terraform)? Each user's network lab needs its own VPC, subnets, VPN server, and multiple VMs -- created on-demand and destroyed hours later. SST/Terraform are for long-lived infrastructure. Pulumi's Automation API lets you run pulumi up and pulumi destroy programmatically from code, making it ideal for ephemeral, per-user infrastructure. Lambda provides the compute, SQS provides the queue, and EFS persists the Pulumi binary and state between invocations.
Networks v2 creates isolated network environments for users using Pulumi Automation API inside a Lambda function.
Lambda Configuration
| Setting | Value |
|---|---|
| Runtime | Node.js 22.x |
| Memory | 1 GB |
| Timeout | 15 minutes |
| Storage | EFS mount at /mnt/efs |
| VPC | Connected to application VPC |
| Permissions | Full AWS access (*) |
| Trigger | SQS FIFO queue |
What Pulumi Creates Per Network
Network Expiry
A cron Lambda runs every minute to check for expired networks:
- Queries MongoDB for expired network instances
- Sends SQS message to destroy the Pulumi stack
- Lambda destroys all AWS resources for that network
8. AWS Services Map
Service Usage Summary -- What & Why
| Service | What It Does | Why This Choice |
|---|---|---|
| EC2 | Deploys VMs (spot + on-demand), manages AMIs, key pairs, volumes | The entire product is built around giving users real machines to hack. EC2 is the only AWS service that provides full OS-level virtual machines with arbitrary network configs and pre-baked AMIs. Spot instances cut costs ~60-70% for ephemeral, short-lived VMs that users will destroy in 1-3 hours anyway. |
| S3 | Stores images, VPN configs, badges, certificates, VM uploads, RAG files, skills matrices | Cheap, durable, globally-accessible object storage. VM disk images need to live in S3 before being imported as AMIs. Presigned URLs let the frontend upload/download directly without proxying through the API. |
| ECS Fargate | Runs Guacamole (web + guacd) and Nginx reverse proxy per region | Guacamole and Nginx are long-running services that need to scale independently per region. Fargate removes the need to manage EC2 hosts for these containers -- AWS handles the underlying compute. |
| Lambda | Networks IaC handler, Turbo S3 cache handler, cron jobs, log ingestion, AMI state change handler | Perfect for event-driven, short-lived work: a SOC-Sim log needs to be ingested at a specific time, a network needs provisioning, an AMI import completed. Pay-per-invocation means zero cost when idle. |
| EventBridge Scheduler | Schedules one-time events for SOC-Sim log ingestion into SIEM VMs | SOC-Sim logs must appear in the SIEM at precise timestamps to simulate a real attack timeline. EventBridge one-time schedules let you say "fire this Lambda at exactly 14:32:07" per log entry -- no polling, no cron. |
| SQS | FIFO queue for network infrastructure provisioning messages | Network creation via Pulumi takes up to 15 minutes. SQS decouples the API request from the Lambda execution, guarantees message ordering (FIFO), and provides at-least-once delivery with built-in retry. |
| API Gateway | HTTP APIs for Turbo S3 cache and Mock API | Lightweight serverless HTTP endpoints that don't justify running a full server. API Gateway + Lambda is the cheapest way to expose an HTTP endpoint that's called infrequently. |
| Transit Gateway | Cross-region VPC peering for VM connectivity | Users in India deploy VMs in ap-south-1, but the API runs in eu-west-1. TGW peering lets the API server's VPC talk to VM VPCs across regions without setting up individual VPC peering per region-pair. |
| Step Functions | Cloud training account lifecycle (assign, move OU, recycle) | AWS account provisioning is a multi-step workflow (create account, move to OU, set permissions, assign to user). Step Functions model this as a state machine with built-in retries and rollback -- much safer than chaining Lambda calls manually. |
| SSM Parameter Store | Cell-based infrastructure configs, subnet/SG references | Infrastructure metadata (which subnets are available, which security groups to use) changes per cell and region. SSM provides a centralized, versioned key-value store that both IaC and runtime code can read from. Cheaper than Secrets Manager for non-sensitive config. |
| Secrets Manager | Guacamole secret keys, sensitive credentials per region | Guacamole JSON secret keys and other credentials must be encrypted at rest and rotatable. Secrets Manager provides automatic encryption, access audit trails, and cross-region replication. |
| Auto Scaling | Manages Guacamole load balancer instance pools | VM traffic is bursty -- a cohort of students might start a room simultaneously. Auto Scaling adjusts Guacamole capacity between 2-20 instances based on CPU, so we don't over-provision during off-peak hours. |
| CloudWatch | Logs for all Lambda functions, ECS Container Insights, CPU alarms | Lambda logs go to CloudWatch by default (no choice). Container Insights gives ECS-level metrics (CPU, memory, network) without instrumenting the app. Alarms trigger when Guacamole CPU hits 60% to flag scaling issues early. |
| EFS | Persistent storage for Pulumi binaries and state in Lambda | Lambda has only 512MB of ephemeral /tmp storage. Pulumi's binary + node_modules + state files exceed that. EFS provides a shared, persistent filesystem mounted at /mnt/efs that persists across Lambda invocations. |
| IAM | Multi-account credential profiles (default, admin, network-infra, vm-upload, cloud-training) | Different operations require different AWS accounts with different permission boundaries. VM uploads go to a dedicated account, cloud training uses its own account, network infra has its own. IAM profiles isolate blast radius -- a compromised key only affects one concern. |
9. Data Layer
MongoDB
Why MongoDB? The data is deeply nested and schema-diverse -- a VM instance document looks nothing like a user document or a SOC-Sim run. MongoDB's flexible schema handles this naturally. Mongoose adds just enough structure (schemas, validation, indexes) without the rigidity of a relational DB. The platform also needs fast reads on varied query patterns (by user, by room, by instance ID, by status), which compound indexes on a document store handle well.
Connection settings:
- Pool: min 2, max 50
- Read preference:
primary - Read concern:
local - Write concern:
majority - Retry reads and writes enabled
Key collections:
instances-- Running VM instancesuploads-- VM templates/imagesusers-- User accountssoc_sim_runs/soc_sim_run_alerts-- SOC Sim datathreat_hunting_runs-- Threat Hunting datascheduler_jobs-- Agenda.js job queuesessions-- Express sessions- Network templates, paths, modules, rooms, companies, etc.
Redis
Why Redis? Sub-millisecond reads for rate limiting (every single API request checks Redis), Socket.IO pub/sub across multiple API instances, and Bull queue backing. Two instances isolate the latency-critical main workload from the Tutor feature's higher-volume writes.
Two separate Redis instances serve different purposes:
Kafka
Used for event streaming between the API and analytics consumers:
Why Kafka? High-throughput, ordered event streaming that decouples producers from consumers. Analytics events, tutor interactions, and MongoDB change events generate a firehose of data -- Kafka handles this without back-pressure on the API. Consumer groups allow multiple independent consumers (analytics pipeline, tutor service) to process the same events at their own pace.
Topics include MongoDB change events and analytics events. A Vercel-to-ECS HTTP bridge exists for the Next.js frontend to publish events.
10. Real-Time Communication
Socket.IO Architecture
Why Socket.IO over raw WebSockets? Socket.IO provides automatic reconnection, room-based broadcasting (critical for SOC-Sim multiplayer), binary streaming, and -- most importantly -- the Redis adapter for multi-instance scaling. Raw WebSockets would require building all of this from scratch.
The Redis adapter (@socket.io/redis-adapter) ensures events published on one API instance reach clients connected to other instances. A Redis emitter (@socket.io/redis-emitter) bridges Vercel serverless functions to ECS-hosted Socket.IO.
Socket Handlers:
| Handler | Events |
|---|---|
| SOC Sim | soc-sim-join-run, alert assign/unassign, run status changes |
| TTX | Tabletop exercise events |
| Tutor | AI tutor chat events |
| Rooms | Room join/leave, task progress |
| Networks | Network instance status |
| Missions | Mission progress updates |
| KOTH | King of the Hill game events |
| Chat | General chat |
11. Job Scheduling & Background Processing
Three Scheduling Systems
| System | Backing Store | Use Case | Why This One |
|---|---|---|---|
| Agenda.js | MongoDB (scheduler_jobs) | Short-interval polling (health checks, score reduction) | Already have MongoDB. Agenda stores jobs as documents -- no extra infrastructure. Supports repeatEvery('3 seconds') for VM health checks, which is too frequent for EventBridge (1-min minimum) and too short-lived for Bull. |
| node-cron | In-process | Periodic maintenance (runs when THM_NODE_CRON=cron) | Simplest option for periodic tasks that don't need persistence or distribution. Runs on a dedicated cron process via THM_NODE_CRON env flag so it doesn't execute on every API instance. |
| EventBridge | AWS managed | One-time scheduled events (log delivery), periodic cron (network expiry) | SOC-Sim needs thousands of one-shot schedules ("fire Lambda at 14:32:07 to ingest log #47"). EventBridge handles this natively without polling. Also ideal for AWS-native event patterns like AMI state changes. |
| Bull | Redis (via ioredis) | Async job processing (company ops, room ops, leagues) | Jobs that need retries, priority, concurrency control, and a monitoring UI. Bull Board provides a web dashboard for inspecting failed jobs. Redis-backed means jobs survive API restarts. |
12. Authentication & Session Management
Why Passport.js? Strategy-based architecture means adding a new auth provider (e.g., WorkOS for enterprise SSO) is just adding a new strategy file -- no changes to the core auth flow. The ecosystem has battle-tested strategies for every provider TryHackMe uses.
Why connect-mongo for sessions? Sessions must survive API restarts and be shared across multiple API instances. MongoDB is already in the stack, so connect-mongo avoids adding another dependency. Redis sessions would be faster but add operational complexity for marginal gain (sessions are read once per request, not per millisecond).
Why WorkOS? Enterprise B2B customers need SAML/OIDC SSO with their identity provider (Okta, Azure AD, etc.). WorkOS provides a unified API for all SSO protocols, directory sync, and admin portal -- building this in-house would take months.
- Sessions stored in MongoDB via
connect-mongo - Session cookie max age: 7 days
- Passport serializes user ID into session
- Socket.IO connections share the same Passport session middleware
13. Observability Stack
Why Pino over Winston? Pino is the fastest Node.js logger by a wide margin -- it uses worker threads for serialization and avoids blocking the event loop. On a platform that handles thousands of concurrent VM deployments, logging overhead matters. Winston would add measurable latency to every request.
Why Sentry? Stack traces, breadcrumbs, release tracking, and performance profiling in one tool. The @sentry/profiling-node integration provides continuous profiling in production to catch slow functions. Sentry's Express integration automatically captures unhandled errors with full request context.
Why Segment? Single API for routing analytics events to multiple destinations (Mixpanel, Amplitude, data warehouse, etc.). Changing analytics providers means updating Segment config, not rewriting event calls across the codebase.
Why Customer.io over SES directly? Templated transactional emails with visual builders, drip campaigns, and user segmentation. SES is used under the hood (via Nodemailer) only for custom emails like calendar invitations where Customer.io templates don't apply.
Log levels: trace, debug, info, warn, error, fatal (configurable via LOG_LEVEL)
Redaction: Cookies and sensitive headers are redacted from HTTP logs.
14. Request Lifecycle
Rate Limiting
20+ endpoint-specific rate limiters backed by Redis:
Redis Store (rate-limit-redis)
└── Fallback: In-memory store
Why Redis-backed rate limiting? Rate limits must be shared across all API instances -- if instance A sees 15 requests from a user, instance B must know about them. Redis provides the shared counter. The rate-limit-redis store handles atomic increments. In-memory fallback ensures the API doesn't crash if Redis is temporarily unavailable (it just rate-limits per-instance instead of globally).
Why express-rate-limit over a WAF? Fine-grained, per-endpoint control. The "start VM" endpoint has a different limit than "get user profile." WAF rules can't express "5 VM deploys per hour per user but 100 profile reads per minute." 20+ distinct limiter configs in the codebase reflect this granularity.
Rate limiters are warmed up on server start. Socket.IO has separate per-minute and per-day rate limits via rate-limiter-flexible.
15. Configuration & Secrets
Why Cloudflare in front of everything? DDoS protection, global CDN, DNS management, and the Workers edge compute platform -- all in one. The Guac-A-Worker runs at Cloudflare's edge (300+ PoPs worldwide), routing VM desktop requests to the nearest Guacamole cluster with near-zero latency. Cloudflare also provides the wildcard SSL certificates that make *.guacaworker.{domain} subdomain routing possible.
Why GrowthBook for feature flags? Open-source, self-hostable, and provides server-side SDK for Node.js. Feature flags gate every major feature (SOC-Sim VM deploy, curriculum mapping, threat hunting SIEM tool modal). GrowthBook's targeting rules allow gradual rollouts by user segment, company, or percentage -- critical for safely shipping infrastructure changes like new VM connection types.
Why Sanity CMS? Scenario content (SOC-Sim alerts, Threat Hunting timelines, evaluation criteria) changes frequently and is authored by non-engineers. Sanity provides a structured content authoring UI, real-time collaboration, and a GROQ query API. Content is fetched at runtime, not baked into deploys, so scenario updates ship instantly without redeployment.
Config Architecture
AWS Credential Profiles
| Profile | Env Vars | Purpose |
|---|---|---|
default | AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY | Primary account operations |
admin | AWS_ADMIN_ACCESS_KEY_ID, AWS_ADMIN_SECRET_ACCESS_KEY | Admin EC2 operations |
vm-upload | AWS_VM_UPLOAD_ACCESS_KEY_ID, AWS_VM_UPLOAD_SECRET_ACCESS_KEY | VM image uploads |
cloud-training | AWS_CLOUD_TRAINING_ACCESS_KEY_ID, AWS_CLOUD_TRAINING_SECRET_KEY | Cloud training account provisioning |
network-infra | AWS_NETWORK_INFRA_ACCESS_KEY_ID, AWS_NETWORK_INFRA_SECRET_ACCESS_KEY | Network IaC operations |
| Region-specific | AWS_REGION_*_ACCESS_KEY_ID, etc. | Per-region operations (us-east-1, ap-south-1, eu-central-1) |
S3 Bucket Map
| Bucket | Env Var | Content |
|---|---|---|
| Images | S3_IMAGES_BUCKET_NAME | Room icons, avatars, team logos |
| VPN Configs | S3_VPN_CONFIG_BUCKET_NAME | Generated OpenVPN configs |
| VPN Templates | S3_VPN_TEMPLATES_BUCKET_NAME | VPN configuration templates |
| Badges | S3_BADGES_BUCKET_NAME | Achievement badge images |
| Certificates | S3_CERTIFICATES_BUCKET_NAME | Completion certificates |
| Skills Matrix | S3_SKILLS_MATRIX_BUCKET_NAME | Skills assessment exports |
| VM Uploads | Region-specific | VM disk images for import |
| RAG | TTX_S3_BUCKET_NAME | RAG documents for AI features |
| BADR Deployment | tryhackme-badr-deployment-{stage}-{region} | Network deployment artifacts |
| Turbo Cache | SST-managed | Turborepo remote build cache |
16. Key File Reference
Infrastructure as Code
| File | What It Defines |
|---|---|
sst.config.ts | Root SST v3 config, app registry, stage/region settings |
infrastructure/shared.ts | Shared secrets, SQS queue, VPC references |
infrastructure/core.ts | VPCs, subnets, security groups, Transit Gateways, S3 endpoints |
infrastructure/guacamole.ts | ECS clusters, Fargate services, ALBs, auto-scaling for Guacamole |
infrastructure/nginx.ts | Nginx reverse proxy ECS services, Cloudflare DNS, TLS |
infrastructure/utils.ts | SST helper functions |
apps/networks-iac/infrastructure.ts | Lambda + SQS for dynamic Pulumi provisioning |
apps/turbo-s3/infrastructure.ts | S3 bucket + API Gateway for Turborepo cache |
apps/cron/infrastructure.ts | Cron Lambda + EventBridge subscriptions |
apps/mock-api/infrastructure.ts | Mock API Gateway + Lambda |
apps/guac-a-worker/infrastructure.ts | Cloudflare Worker, KV, DNS, certificates |
AWS Service Wrappers
| File | Service |
|---|---|
src/services/aws/ec2/index.ts | EC2 instances, AMIs, volumes, tags |
src/services/aws/ec2/amis.ts | AMI state change processing |
src/services/s3/index.ts | S3 uploads, downloads, presigned URLs |
src/services/aws/scheduler/index.ts | EventBridge one-time schedules |
src/services/aws/sqs/index.ts | SQS message publishing |
src/services/aws/ssm/index.ts | SSM Parameter Store lookups |
src/services/aws/secrets-manager/index.ts | Secrets Manager retrieval |
src/services/aws/auto-scaling/index.ts | Auto Scaling Group queries |
src/services/aws/step-functions/index.ts | Step Functions execution |
VM & Connection Services
| File | Purpose |
|---|---|
src/services/vms/index.ts | VM deployment orchestrator |
src/services/vms/deployment.ts | EC2 instance creation, tagging, user data |
src/services/vms/termination.ts | VM termination and cleanup |
src/services/vms/connections/index.ts | Connection type router |
src/services/vms/connections/guacamole.ts | Guacamole connection creation |
src/services/vms/connections/novnc.ts | NoVNC connection handling |
src/services/vms/connections/dcv.ts | Amazon DCV connections |
src/services/vms/connections/web-proxy.ts | Web proxy connections |
src/services/vms/data-assignment.ts | VM data for SOC-Sim/Threat Hunting |
src/services/vms/utils.ts | Cell selection, regional configs |
src/services/guacamole/client.ts | Guacamole API client |
src/services/azure/labs.ts | Azure Lab management |
src/services/azure/azure-repo.ts | Azure lab repository |
System Design Components
| File | Component |
|---|---|
src/infra/database/index.ts | MongoDB connection setup |
src/services/redis/index.ts | Main Redis client |
src/services/redis/tutor.ts | Tutor Redis client |
src/services/queues/connection.ts | Bull queue Redis connection |
src/services/agenda/index.ts | Agenda.js scheduler |
src/services/agenda/definitions.ts | Job definitions (health checks) |
src/infra/sockets/index.ts | Socket.IO server setup |
src/infra/sockets/middleware.ts | Socket auth middleware |
src/middlewares/auth/passport/index.ts | Passport strategies |
src/services/kafka/index.ts | Kafka producer/consumer |
src/common/logging/index.ts | Pino logger configuration |
src/middlewares/limiters/index.ts | Rate limiting middleware |
src/services/health-check/index.ts | Health check endpoints |
src/server.ts | Express server + middleware chain |
apps/api/config.js | Central configuration |
All
src/paths above are relative toapps/api/app/api/v2/.