Curriculum Mapping

2022-01-25

Curriculum Mapping -- Technical Documentation

1. Overview

Curriculum Mapping is a B2B feature in TryHackMe's Management Dashboard that allows company administrators to upload their existing cybersecurity curriculum documents and have the system automatically:

Parse the document (XLSX, PDF, DOCX, TXT, CSV, Markdown)
Extract structured learning requirements using GPT-4o-mini
Match those requirements to TryHackMe's room catalogue using a multi-signal scoring algorithm
Create a custom learning path with auto-generated modules

The entire flow is AI-powered at its core -- the system transforms unstructured educational documents into structured, actionable training paths mapped to TryHackMe content.

2. High-Level Architecture

3. The Pipeline: 4-Stage Architecture

4. Stage 1: Document Parsing (`parser.ts`)

Supported Formats

Format	Library	Extraction Method
`.xlsx`, `.xls`	SheetJS (XLSX)	Structured column parsing with header detection
`.pdf`	pdf-parse	Raw text extraction
`.docx`, `.doc`	mammoth	Raw text extraction
`.txt`, `.csv`, `.md`	Native	UTF-8 buffer read

XLSX Parsing (Structured Path)

XLSX files get the most sophisticated treatment:

Merged cell propagation -- fills merged cell ranges so every row has complete data
Header row detection -- scans first 10 rows, scores each by keyword density (title, code, room, topic, module, difficulty, time, week, etc.), picks the row with the highest match ratio
Column mapping -- identifies title, time, difficulty, description, and section columns by keyword matching
Row extraction -- iterates data rows, skipping noise rows, extracting structured ParsedCurriculumItem objects

Text-Based Document Classification

For PDF/DOCX/TXT, the parser classifies the document into three tiers:

Tier	Heuristic	Parsing Strategy
Structured	High bullet/numbered ratio, short lines	Line-by-line extraction of titles, time, difficulty
Semi-structured	3+ headings (Module/Phase/Chapter patterns)	Heading-based hierarchy extraction
Unstructured	Long lines, few headings (prose-heavy)	Smart sampling (beginning + middle + end, 10K chars each) sent to AI

5. Stage 2: AI Requirement Extraction (`extractor.ts`)

This is the core AI stage. All documents go through GPT-4o-mini extraction, regardless of the parsing tier.

AI Configuration

Parameter	Value
Model	`gpt-4o-mini`
Temperature	`0.2` (low -- deterministic/focused)
Response format	`json_object` (structured JSON)
Max items sent	150 (sampled: beginning + middle + end)

The Extraction Prompt

The prompt instructs GPT-4o-mini to act as a cybersecurity curriculum analyst:

Program Constraints

When the user provides program settings, the prompt is enriched:

Setting	Effect on Prompt
Program length (e.g., 8 weeks)	"Structure and scope topics to fit within this timeframe"
Difficulty = easy	"Prioritise foundational topics (networking basics, Linux fundamentals). Skip advanced exploitation."
Difficulty = intermediate	"Include both foundational and hands-on topics. Balance breadth with depth."
Difficulty = hard	"Prioritise complex and specialist topics. Assume foundational knowledge."

Fallback Strategy

If the AI call fails, the system falls back to keyword-based extraction: it scans each parsed item's title/description for ~100 cybersecurity terms (firewall, nmap, wireshark, phishing, sql injection, etc.) and uses those as skill tags.

6. Stage 3: Content Matching (`matcher.ts`)

Once requirements are extracted, the matcher scores every TryHackMe room against every requirement using a weighted multi-signal scoring algorithm.

Scoring Formula

totalScore = (tagMatch × 0.50) + (titleSimilarity × 0.35) + (descriptionSimilarity × 0.15)

Match Thresholds

Status	Score Threshold	Meaning
Matched	>= 0.60	Strong content match
Partial	>= 0.30	Some relevant content found
Unmatched	< 0.30	No suitable TryHackMe room

Coverage Calculation

coverage% = ((matched + partial × 0.5) / total) × 100

Deduplication Strategy

Each room is assigned to only one requirement -- the one where it scored highest. This prevents the same room from appearing in multiple modules.

Room Pool

The matcher fetches all public TryHackMe rooms of type CHALLENGE or WALKTHROUGH with their tag entities. This forms the candidate pool (~thousands of rooms).

7. Stage 4: Path Creation (`index.ts`)

When the user confirms the matched content and personalises the path:

The created path has:

source: PathSource.CURRICULUM -- identifies it as AI-generated from curriculum
contentType: PathContentType.MODULES -- uses custom module structure
public: false -- company-private by default

8. Frontend: 4-Step Wizard

Step Details

Step	Component	API Call	What Happens
0. Upload	`UploadStep`	--	User uploads file or pastes text, configures settings
Loading	`LoadingStep`	`POST /extract`	AI parses + extracts requirements
1. Requirements	`RequirementsStep`	--	User reviews extracted topics, skills, sections
Loading	`LoadingStep`	`POST /match-requirements`	Scoring algorithm runs against room catalogue
2. Content	`ContentStep`	--	User reviews matched modules, removes unwanted rooms
3. Personalise	`PersonaliseStep`	--	User edits AI-suggested title, description, intro
Creating	`LoadingStep`	`POST /create-path`	Custom modules + learning path created
Done	`PathCreatedStep` → `PathPreviewStep`	--	Preview + assign path

9. API Endpoints

All routes are gated behind the curriculum-mapping GrowthBook feature flag.

Method	Endpoint	Description
`POST`	`/companies/curriculum/extract`	Parse document + AI extract requirements
`POST`	`/companies/curriculum/match-requirements`	Match requirements to rooms
`POST`	`/companies/curriculum/match`	Full pipeline (parse + extract + match)
`POST`	`/companies/curriculum/create-path`	Create custom modules + learning path

Extract Request

POST /companies/curriculum/extract?companyId=xxx&programLength=8&programLengthUnit=weeks&difficulty=intermediate
Content-Type: multipart/form-data
Body: file (binary)

Extract Response

{
  requirements: [
    {
      id: "abc123",
      topic: "Network Fundamentals",
      skills: ["networking", "tcp", "udp", "wireshark"],
      objective: "Understand TCP/IP stack and packet analysis",
      difficulty: "easy",
      timeEstimate: 120,
      section: "Week 1"
    },
    // ...
  ],
  pathTitle: "Cybersecurity Foundations",
  cardDescription: "Build a strong foundation in...",
  pageIntroduction: "This comprehensive path covers...",
  metadata: {
    format: "xlsx",
    parsedItems: 45,
    extractedRequirements: 22,
    sections: ["Week 1", "Week 2", ...]
  }
}

Match Response

{
  estimatedCompletionTime: { minutes: 1440, hours: 24, label: "24 hours" },
  pathDifficulty: { level: 2, label: "Medium" },
  coverage: { total: 22, matched: 15, partial: 4, unmatched: 3, percentage: 77 },
  modules: [
    {
      requirementId: "abc123",
      title: "Network Fundamentals",
      section: "Week 1",
      rooms: [
        {
          roomId: "...",
          code: "introtnetworking",
          title: "Intro to Networking",
          matchScore: 0.82,
          matchReasons: ["Tag: networking, tcp", "Title: 78%"],
          timeToComplete: 60,
          difficultyLevel: 1
        }
      ]
    }
  ],
  requirements: [
    { id: "abc123", topic: "Network Fundamentals", status: "matched", matchedRoomCodes: ["introtnetworking"] }
  ]
}

10. AI Integration Deep Dive

Where AI Is Used

AI vs Algorithmic Split

The architecture deliberately separates AI extraction from algorithmic matching:

AI handles the ambiguous, creative work: understanding what a curriculum document is trying to teach, extracting meaningful learning objectives from noisy data, generating human-readable descriptions
Algorithms handle the deterministic, repeatable work: scoring rooms against requirements, deduplicating, grouping, and creating database entities

This split means:

The matching step is fast, cheap, and deterministic -- no AI costs per match
The extraction step is only called once per document upload
Requirements can be re-matched with different difficulty settings without re-extracting (separate endpoints)

Token Optimization

For large documents (>150 items), the system uses smart sampling to stay within token limits:

selectedItems = items[0..50] + items[middle-25..middle+25] + items[-50..]

This captures the beginning, middle, and end of the curriculum, giving the AI representative coverage without exceeding context limits.

11. Directory Structure

Backend

apps/api/app/api/v2/src/
├── routes/companies/curriculum.ts       # Route definitions (feature-flagged)
├── controllers/companies/curriculum.ts  # Request handlers
├── middlewares/companies/curriculum.ts   # Multer file upload middleware
├── services/companies/curriculum/
│   ├── index.ts                         # Orchestrator (matchCurriculum, extractCurriculumRequirements, createCurriculumPath)
│   ├── parser.ts                        # Document parsing (XLSX, PDF, DOCX, TXT)
│   ├── extractor.ts                     # AI requirement extraction (GPT-4o-mini)
│   └── matcher.ts                       # Room matching algorithm
└── common/interfaces/company/
    └── curriculum.ts                    # TypeScript interfaces, MATCH_WEIGHTS, MATCH_THRESHOLDS

Frontend

apps/frontend/src/features/management-dashboard/curriculum-mapping/
├── curriculum-mapping.tsx               # Main 4-step wizard component
├── curriculum-mapping.slice.ts          # RTK Query mutations (extract, match, create)
├── curriculum-mapping.types.ts          # TypeScript interfaces
├── curriculum-mapping.constants.ts      # Step labels, accepted file types
├── curriculum-mapping.styles.ts         # Styled components
├── curriculum-rich-text.ts             # Rich text → plain text conversion
├── steps/
│   ├── upload-step/                     # File upload + paste + settings
│   ├── requirements-step/              # Requirements review table
│   ├── content-step/                   # Matched modules + rooms
│   └── personalise-step/              # Title/description/intro editor
└── components/
    ├── loading/                         # Loading spinner + message
    ├── path-created/                   # Success animation
    └── path-preview/                   # Final path preview

12. Feature Flag

The entire feature is gated behind:

curriculumRouter.use(requireGrowthBookFeatureFlag('curriculum-mapping'));

When curriculum-mapping is off, all /companies/curriculum/* routes return 403.