Curriculum Mapping

2022-01-25

Curriculum Mapping -- Technical Documentation

1. Overview

Curriculum Mapping is a B2B feature in TryHackMe's Management Dashboard that allows company administrators to upload their existing cybersecurity curriculum documents and have the system automatically:

  1. Parse the document (XLSX, PDF, DOCX, TXT, CSV, Markdown)
  2. Extract structured learning requirements using GPT-4o-mini
  3. Match those requirements to TryHackMe's room catalogue using a multi-signal scoring algorithm
  4. Create a custom learning path with auto-generated modules

The entire flow is AI-powered at its core -- the system transforms unstructured educational documents into structured, actionable training paths mapped to TryHackMe content.


2. High-Level Architecture


3. The Pipeline: 4-Stage Architecture


4. Stage 1: Document Parsing (parser.ts)

Supported Formats

FormatLibraryExtraction Method
.xlsx, .xlsSheetJS (XLSX)Structured column parsing with header detection
.pdfpdf-parseRaw text extraction
.docx, .docmammothRaw text extraction
.txt, .csv, .mdNativeUTF-8 buffer read

XLSX Parsing (Structured Path)

XLSX files get the most sophisticated treatment:

  1. Merged cell propagation -- fills merged cell ranges so every row has complete data
  2. Header row detection -- scans first 10 rows, scores each by keyword density (title, code, room, topic, module, difficulty, time, week, etc.), picks the row with the highest match ratio
  3. Column mapping -- identifies title, time, difficulty, description, and section columns by keyword matching
  4. Row extraction -- iterates data rows, skipping noise rows, extracting structured ParsedCurriculumItem objects

Text-Based Document Classification

For PDF/DOCX/TXT, the parser classifies the document into three tiers:

TierHeuristicParsing Strategy
StructuredHigh bullet/numbered ratio, short linesLine-by-line extraction of titles, time, difficulty
Semi-structured3+ headings (Module/Phase/Chapter patterns)Heading-based hierarchy extraction
UnstructuredLong lines, few headings (prose-heavy)Smart sampling (beginning + middle + end, 10K chars each) sent to AI

5. Stage 2: AI Requirement Extraction (extractor.ts)

This is the core AI stage. All documents go through GPT-4o-mini extraction, regardless of the parsing tier.

AI Configuration

ParameterValue
Modelgpt-4o-mini
Temperature0.2 (low -- deterministic/focused)
Response formatjson_object (structured JSON)
Max items sent150 (sampled: beginning + middle + end)

The Extraction Prompt

The prompt instructs GPT-4o-mini to act as a cybersecurity curriculum analyst:

Program Constraints

When the user provides program settings, the prompt is enriched:

SettingEffect on Prompt
Program length (e.g., 8 weeks)"Structure and scope topics to fit within this timeframe"
Difficulty = easy"Prioritise foundational topics (networking basics, Linux fundamentals). Skip advanced exploitation."
Difficulty = intermediate"Include both foundational and hands-on topics. Balance breadth with depth."
Difficulty = hard"Prioritise complex and specialist topics. Assume foundational knowledge."

Fallback Strategy

If the AI call fails, the system falls back to keyword-based extraction: it scans each parsed item's title/description for ~100 cybersecurity terms (firewall, nmap, wireshark, phishing, sql injection, etc.) and uses those as skill tags.


6. Stage 3: Content Matching (matcher.ts)

Once requirements are extracted, the matcher scores every TryHackMe room against every requirement using a weighted multi-signal scoring algorithm.

Scoring Formula

totalScore = (tagMatch × 0.50) + (titleSimilarity × 0.35) + (descriptionSimilarity × 0.15)

Match Thresholds

StatusScore ThresholdMeaning
Matched>= 0.60Strong content match
Partial>= 0.30Some relevant content found
Unmatched< 0.30No suitable TryHackMe room

Coverage Calculation

coverage% = ((matched + partial × 0.5) / total) × 100

Deduplication Strategy

Each room is assigned to only one requirement -- the one where it scored highest. This prevents the same room from appearing in multiple modules.

Room Pool

The matcher fetches all public TryHackMe rooms of type CHALLENGE or WALKTHROUGH with their tag entities. This forms the candidate pool (~thousands of rooms).


7. Stage 4: Path Creation (index.ts)

When the user confirms the matched content and personalises the path:

The created path has:


8. Frontend: 4-Step Wizard

Step Details

StepComponentAPI CallWhat Happens
0. UploadUploadStep--User uploads file or pastes text, configures settings
LoadingLoadingStepPOST /extractAI parses + extracts requirements
1. RequirementsRequirementsStep--User reviews extracted topics, skills, sections
LoadingLoadingStepPOST /match-requirementsScoring algorithm runs against room catalogue
2. ContentContentStep--User reviews matched modules, removes unwanted rooms
3. PersonalisePersonaliseStep--User edits AI-suggested title, description, intro
CreatingLoadingStepPOST /create-pathCustom modules + learning path created
DonePathCreatedStepPathPreviewStep--Preview + assign path

9. API Endpoints

All routes are gated behind the curriculum-mapping GrowthBook feature flag.

MethodEndpointDescription
POST/companies/curriculum/extractParse document + AI extract requirements
POST/companies/curriculum/match-requirementsMatch requirements to rooms
POST/companies/curriculum/matchFull pipeline (parse + extract + match)
POST/companies/curriculum/create-pathCreate custom modules + learning path

Extract Request

POST /companies/curriculum/extract?companyId=xxx&programLength=8&programLengthUnit=weeks&difficulty=intermediate
Content-Type: multipart/form-data
Body: file (binary)

Extract Response

{
  requirements: [
    {
      id: "abc123",
      topic: "Network Fundamentals",
      skills: ["networking", "tcp", "udp", "wireshark"],
      objective: "Understand TCP/IP stack and packet analysis",
      difficulty: "easy",
      timeEstimate: 120,
      section: "Week 1"
    },
    // ...
  ],
  pathTitle: "Cybersecurity Foundations",
  cardDescription: "Build a strong foundation in...",
  pageIntroduction: "This comprehensive path covers...",
  metadata: {
    format: "xlsx",
    parsedItems: 45,
    extractedRequirements: 22,
    sections: ["Week 1", "Week 2", ...]
  }
}

Match Response

{
  estimatedCompletionTime: { minutes: 1440, hours: 24, label: "24 hours" },
  pathDifficulty: { level: 2, label: "Medium" },
  coverage: { total: 22, matched: 15, partial: 4, unmatched: 3, percentage: 77 },
  modules: [
    {
      requirementId: "abc123",
      title: "Network Fundamentals",
      section: "Week 1",
      rooms: [
        {
          roomId: "...",
          code: "introtnetworking",
          title: "Intro to Networking",
          matchScore: 0.82,
          matchReasons: ["Tag: networking, tcp", "Title: 78%"],
          timeToComplete: 60,
          difficultyLevel: 1
        }
      ]
    }
  ],
  requirements: [
    { id: "abc123", topic: "Network Fundamentals", status: "matched", matchedRoomCodes: ["introtnetworking"] }
  ]
}

10. AI Integration Deep Dive

Where AI Is Used

AI vs Algorithmic Split

The architecture deliberately separates AI extraction from algorithmic matching:

This split means:

Token Optimization

For large documents (>150 items), the system uses smart sampling to stay within token limits:

selectedItems = items[0..50] + items[middle-25..middle+25] + items[-50..]

This captures the beginning, middle, and end of the curriculum, giving the AI representative coverage without exceeding context limits.


11. Directory Structure

Backend

apps/api/app/api/v2/src/
├── routes/companies/curriculum.ts       # Route definitions (feature-flagged)
├── controllers/companies/curriculum.ts  # Request handlers
├── middlewares/companies/curriculum.ts   # Multer file upload middleware
├── services/companies/curriculum/
│   ├── index.ts                         # Orchestrator (matchCurriculum, extractCurriculumRequirements, createCurriculumPath)
│   ├── parser.ts                        # Document parsing (XLSX, PDF, DOCX, TXT)
│   ├── extractor.ts                     # AI requirement extraction (GPT-4o-mini)
│   └── matcher.ts                       # Room matching algorithm
└── common/interfaces/company/
    └── curriculum.ts                    # TypeScript interfaces, MATCH_WEIGHTS, MATCH_THRESHOLDS

Frontend

apps/frontend/src/features/management-dashboard/curriculum-mapping/
├── curriculum-mapping.tsx               # Main 4-step wizard component
├── curriculum-mapping.slice.ts          # RTK Query mutations (extract, match, create)
├── curriculum-mapping.types.ts          # TypeScript interfaces
├── curriculum-mapping.constants.ts      # Step labels, accepted file types
├── curriculum-mapping.styles.ts         # Styled components
├── curriculum-rich-text.ts             # Rich text → plain text conversion
├── steps/
│   ├── upload-step/                     # File upload + paste + settings
│   ├── requirements-step/              # Requirements review table
│   ├── content-step/                   # Matched modules + rooms
│   └── personalise-step/              # Title/description/intro editor
└── components/
    ├── loading/                         # Loading spinner + message
    ├── path-created/                   # Success animation
    └── path-preview/                   # Final path preview

12. Feature Flag

The entire feature is gated behind:

curriculumRouter.use(requireGrowthBookFeatureFlag('curriculum-mapping'));

When curriculum-mapping is off, all /companies/curriculum/* routes return 403.


13. End-to-End Data Flow