A cohort-based severity analysis of 101 university students — moving beyond prevalence counting into structured risk segmentation and intervention prioritisation.
Most mental health analyses stop at prevalence: counting how many students report a condition. This project was structured around a different question — does psychological burden intensify as students advance academically?
The analysis was built on three principles: cohort-based comparison using Year of Study as the primary lens, severity stratification separating single from overlapping conditions, and proportional normalisation to prevent enrollment imbalance from distorting results. The goal was risk identification and intervention prioritisation, not descriptive reporting.
The raw dataset had 11 columns and 101 rows collected via a survey form. Several structural problems needed resolving before any analysis could begin.
Dropped Timestamp — narrow collection window made it analytically meaningless. Dropped Course — excessive categorical fragmentation with low signal value would have added noise, not insight.
Column headers were renamed from verbose survey labels (e.g. "Do you have Anxiety?") to clean field names. CGPA was grouped into standardised ranges: 0–1.99, 2.00–2.49, 2.50–2.99, 3.00–3.49, 3.50–4.00. Year labels normalised to Year 1–4.
Blank and incomplete rows were removed to ensure analytical reliability. The cleaned dataset retained all 101 valid student records across the four year cohorts.
Three derived classification columns were created to measure escalation intensity rather than simple presence. This is the core feature engineering step that powers the entire severity analysis.
| Column | Status | Notes |
|---|---|---|
| Timestamp | REMOVED | Narrow survey window made time analysis meaningless |
| What is your course? | REMOVED | Too many fragmented categories, low analytical signal |
Choose your gender → Gender |
RENAMED | Shortened from verbose survey phrasing |
Your current year of Study → Year Of Study |
RENAMED | Standardised to clean label |
What is your CGPA? → CGPA range |
RENAMED | Grouped into 5 standardised ranges for pivot analysis |
Do you have Depression? → Depression? |
RENAMED | Shortened; same for Anxiety and Panic Attack |
S/N |
NEW | Serial number column added for row identification |
At least 1 — Any Condition |
NEW | TRUE if student reports at least one condition. Baseline prevalence flag. |
At least 2 — Multiple Cases |
NEW | TRUE if two or more conditions present concurrently. Elevated burden flag. |
All three — Complex Cases |
NEW | TRUE if all three conditions (Anxiety + Depression + Panic Attack) reported together. Highest severity tier. |
These three columns were calculated directly in Excel using OR/AND logic on the three condition columns before the data was loaded into Power BI.
Raw counts alone would have been misleading — Year 1 has 43 students while Year 4 has only 8. DAX measures were written to calculate within-cohort prevalence, making year-on-year comparison valid.
Year 1 had the largest raw case count (25 students). Without proportional normalisation, this looks like the most at-risk cohort. Proportional analysis tells the opposite story — Year 3 carries the highest burden at 75% prevalence.
| Cohort | n | Any Condition | Multiple Cases | Complex Cases |
|---|---|---|---|---|
| Year 1 | 43 | |||
| Year 2 | 26 | |||
| Year 3 | 24 | |||
| Year 4 | 8 ⚠ |
⚠ Year 4 n=8. Low sample size limits generalisability — interpret with caution.
CGPA segmentation shows that psychological burden is not isolated to lower-performing students. The highest concentration of students sits in the 3.00–4.00 CGPA range — and conditions are present across all tiers. Stress exposure is cross-performance, which means targeting only low-CGPA students for support would miss the majority of cases.
The core contribution of this project is methodological. By combining Excel based severity engineering with DAX-driven proportional analysis in Power BI, the work moves beyond descriptive dashboarding into structured risk identification.
The finding that Year 3 represents the peak risk cohort, not Year 1 which had the highest raw count, would have been invisible without normalisation. That distinction is the difference between resource allocation that works and resource allocation that misses the problem entirely.
// download source files to explore the dataset, dashboard, and full report