Curriculum Mapping
2022-01-25
Curriculum Mapping -- Technical Documentation
1. Overview
Curriculum Mapping is a B2B feature in TryHackMe's Management Dashboard that allows company administrators to upload their existing cybersecurity curriculum documents and have the system automatically:
- Parse the document (XLSX, PDF, DOCX, TXT, CSV, Markdown)
- Extract structured learning requirements using GPT-4o-mini
- Match those requirements to TryHackMe's room catalogue using a multi-signal scoring algorithm
- Create a custom learning path with auto-generated modules
The entire flow is AI-powered at its core -- the system transforms unstructured educational documents into structured, actionable training paths mapped to TryHackMe content.
2. High-Level Architecture
3. The Pipeline: 4-Stage Architecture
4. Stage 1: Document Parsing (parser.ts)
Supported Formats
| Format | Library | Extraction Method |
|---|---|---|
.xlsx, .xls | SheetJS (XLSX) | Structured column parsing with header detection |
.pdf | pdf-parse | Raw text extraction |
.docx, .doc | mammoth | Raw text extraction |
.txt, .csv, .md | Native | UTF-8 buffer read |
XLSX Parsing (Structured Path)
XLSX files get the most sophisticated treatment:
- Merged cell propagation -- fills merged cell ranges so every row has complete data
- Header row detection -- scans first 10 rows, scores each by keyword density (
title,code,room,topic,module,difficulty,time,week, etc.), picks the row with the highest match ratio - Column mapping -- identifies title, time, difficulty, description, and section columns by keyword matching
- Row extraction -- iterates data rows, skipping noise rows, extracting structured
ParsedCurriculumItemobjects
Text-Based Document Classification
For PDF/DOCX/TXT, the parser classifies the document into three tiers:
| Tier | Heuristic | Parsing Strategy |
|---|---|---|
| Structured | High bullet/numbered ratio, short lines | Line-by-line extraction of titles, time, difficulty |
| Semi-structured | 3+ headings (Module/Phase/Chapter patterns) | Heading-based hierarchy extraction |
| Unstructured | Long lines, few headings (prose-heavy) | Smart sampling (beginning + middle + end, 10K chars each) sent to AI |
5. Stage 2: AI Requirement Extraction (extractor.ts)
This is the core AI stage. All documents go through GPT-4o-mini extraction, regardless of the parsing tier.
AI Configuration
| Parameter | Value |
|---|---|
| Model | gpt-4o-mini |
| Temperature | 0.2 (low -- deterministic/focused) |
| Response format | json_object (structured JSON) |
| Max items sent | 150 (sampled: beginning + middle + end) |
The Extraction Prompt
The prompt instructs GPT-4o-mini to act as a cybersecurity curriculum analyst:
Program Constraints
When the user provides program settings, the prompt is enriched:
| Setting | Effect on Prompt |
|---|---|
| Program length (e.g., 8 weeks) | "Structure and scope topics to fit within this timeframe" |
| Difficulty = easy | "Prioritise foundational topics (networking basics, Linux fundamentals). Skip advanced exploitation." |
| Difficulty = intermediate | "Include both foundational and hands-on topics. Balance breadth with depth." |
| Difficulty = hard | "Prioritise complex and specialist topics. Assume foundational knowledge." |
Fallback Strategy
If the AI call fails, the system falls back to keyword-based extraction: it scans each parsed item's title/description for ~100 cybersecurity terms (firewall, nmap, wireshark, phishing, sql injection, etc.) and uses those as skill tags.
6. Stage 3: Content Matching (matcher.ts)
Once requirements are extracted, the matcher scores every TryHackMe room against every requirement using a weighted multi-signal scoring algorithm.
Scoring Formula
totalScore = (tagMatch × 0.50) + (titleSimilarity × 0.35) + (descriptionSimilarity × 0.15)
Match Thresholds
| Status | Score Threshold | Meaning |
|---|---|---|
| Matched | >= 0.60 | Strong content match |
| Partial | >= 0.30 | Some relevant content found |
| Unmatched | < 0.30 | No suitable TryHackMe room |
Coverage Calculation
coverage% = ((matched + partial × 0.5) / total) × 100
Deduplication Strategy
Each room is assigned to only one requirement -- the one where it scored highest. This prevents the same room from appearing in multiple modules.
Room Pool
The matcher fetches all public TryHackMe rooms of type CHALLENGE or WALKTHROUGH with their tag entities. This forms the candidate pool (~thousands of rooms).
7. Stage 4: Path Creation (index.ts)
When the user confirms the matched content and personalises the path:
The created path has:
source: PathSource.CURRICULUM-- identifies it as AI-generated from curriculumcontentType: PathContentType.MODULES-- uses custom module structurepublic: false-- company-private by default
8. Frontend: 4-Step Wizard
Step Details
| Step | Component | API Call | What Happens |
|---|---|---|---|
| 0. Upload | UploadStep | -- | User uploads file or pastes text, configures settings |
| Loading | LoadingStep | POST /extract | AI parses + extracts requirements |
| 1. Requirements | RequirementsStep | -- | User reviews extracted topics, skills, sections |
| Loading | LoadingStep | POST /match-requirements | Scoring algorithm runs against room catalogue |
| 2. Content | ContentStep | -- | User reviews matched modules, removes unwanted rooms |
| 3. Personalise | PersonaliseStep | -- | User edits AI-suggested title, description, intro |
| Creating | LoadingStep | POST /create-path | Custom modules + learning path created |
| Done | PathCreatedStep → PathPreviewStep | -- | Preview + assign path |
9. API Endpoints
All routes are gated behind the curriculum-mapping GrowthBook feature flag.
| Method | Endpoint | Description |
|---|---|---|
POST | /companies/curriculum/extract | Parse document + AI extract requirements |
POST | /companies/curriculum/match-requirements | Match requirements to rooms |
POST | /companies/curriculum/match | Full pipeline (parse + extract + match) |
POST | /companies/curriculum/create-path | Create custom modules + learning path |
Extract Request
POST /companies/curriculum/extract?companyId=xxx&programLength=8&programLengthUnit=weeks&difficulty=intermediate
Content-Type: multipart/form-data
Body: file (binary)
Extract Response
{
requirements: [
{
id: "abc123",
topic: "Network Fundamentals",
skills: ["networking", "tcp", "udp", "wireshark"],
objective: "Understand TCP/IP stack and packet analysis",
difficulty: "easy",
timeEstimate: 120,
section: "Week 1"
},
// ...
],
pathTitle: "Cybersecurity Foundations",
cardDescription: "Build a strong foundation in...",
pageIntroduction: "This comprehensive path covers...",
metadata: {
format: "xlsx",
parsedItems: 45,
extractedRequirements: 22,
sections: ["Week 1", "Week 2", ...]
}
}Match Response
{
estimatedCompletionTime: { minutes: 1440, hours: 24, label: "24 hours" },
pathDifficulty: { level: 2, label: "Medium" },
coverage: { total: 22, matched: 15, partial: 4, unmatched: 3, percentage: 77 },
modules: [
{
requirementId: "abc123",
title: "Network Fundamentals",
section: "Week 1",
rooms: [
{
roomId: "...",
code: "introtnetworking",
title: "Intro to Networking",
matchScore: 0.82,
matchReasons: ["Tag: networking, tcp", "Title: 78%"],
timeToComplete: 60,
difficultyLevel: 1
}
]
}
],
requirements: [
{ id: "abc123", topic: "Network Fundamentals", status: "matched", matchedRoomCodes: ["introtnetworking"] }
]
}10. AI Integration Deep Dive
Where AI Is Used
AI vs Algorithmic Split
The architecture deliberately separates AI extraction from algorithmic matching:
- AI handles the ambiguous, creative work: understanding what a curriculum document is trying to teach, extracting meaningful learning objectives from noisy data, generating human-readable descriptions
- Algorithms handle the deterministic, repeatable work: scoring rooms against requirements, deduplicating, grouping, and creating database entities
This split means:
- The matching step is fast, cheap, and deterministic -- no AI costs per match
- The extraction step is only called once per document upload
- Requirements can be re-matched with different difficulty settings without re-extracting (separate endpoints)
Token Optimization
For large documents (>150 items), the system uses smart sampling to stay within token limits:
selectedItems = items[0..50] + items[middle-25..middle+25] + items[-50..]
This captures the beginning, middle, and end of the curriculum, giving the AI representative coverage without exceeding context limits.
11. Directory Structure
Backend
apps/api/app/api/v2/src/
├── routes/companies/curriculum.ts # Route definitions (feature-flagged)
├── controllers/companies/curriculum.ts # Request handlers
├── middlewares/companies/curriculum.ts # Multer file upload middleware
├── services/companies/curriculum/
│ ├── index.ts # Orchestrator (matchCurriculum, extractCurriculumRequirements, createCurriculumPath)
│ ├── parser.ts # Document parsing (XLSX, PDF, DOCX, TXT)
│ ├── extractor.ts # AI requirement extraction (GPT-4o-mini)
│ └── matcher.ts # Room matching algorithm
└── common/interfaces/company/
└── curriculum.ts # TypeScript interfaces, MATCH_WEIGHTS, MATCH_THRESHOLDS
Frontend
apps/frontend/src/features/management-dashboard/curriculum-mapping/
├── curriculum-mapping.tsx # Main 4-step wizard component
├── curriculum-mapping.slice.ts # RTK Query mutations (extract, match, create)
├── curriculum-mapping.types.ts # TypeScript interfaces
├── curriculum-mapping.constants.ts # Step labels, accepted file types
├── curriculum-mapping.styles.ts # Styled components
├── curriculum-rich-text.ts # Rich text → plain text conversion
├── steps/
│ ├── upload-step/ # File upload + paste + settings
│ ├── requirements-step/ # Requirements review table
│ ├── content-step/ # Matched modules + rooms
│ └── personalise-step/ # Title/description/intro editor
└── components/
├── loading/ # Loading spinner + message
├── path-created/ # Success animation
└── path-preview/ # Final path preview
12. Feature Flag
The entire feature is gated behind:
curriculumRouter.use(requireGrowthBookFeatureFlag('curriculum-mapping'));When curriculum-mapping is off, all /companies/curriculum/* routes return 403.