Methodology

A detailed overview of the statistical methods and analytical techniques used in this student performance analysis.

🔍

Exploratory Data Analysis (EDA)

Initial exploration of the dataset to understand distributions, patterns, and relationships.

  • Examined distributions of all 33 variables
  • Identified missing values and outliers
  • Computed summary statistics (mean, median, std, quartiles)
  • Created correlation matrices to identify related variables
  • Visualized relationships through scatter plots, histograms, and box plots
Python/RMatplotlib/ggplot2Pandas/dplyr
📊

One-Way ANOVA

Analysis of Variance to test whether group means differ significantly.

  • Used to compare grades across categorical groups (e.g., education levels, failure counts)
  • Tests the null hypothesis that all group means are equal
  • F-statistic measures ratio of between-group to within-group variance
  • Significant results (p < 0.05) indicate at least one group differs
  • Effect sizes computed to assess practical significance
F = MS_between / MS_within
In this study: A large F-value suggests group means differ more than would be expected by chance. For example, comparing grades across failure groups (0, 1, 2, 3+) yielded F > 30, p < 0.001.
📈

Spearman Rank Correlation

Non-parametric measure of rank correlation between two variables.

  • Used for ordinal variables (like alcohol consumption 1-5)
  • Does not assume linear relationships
  • Robust to outliers
  • Values range from -1 (perfect negative) to +1 (perfect positive)
  • Tests significance of monotonic relationships
ρ = 1 - (6Σd²) / (n(n²-1))
In this study: For weekend alcohol vs grades: ρ ≈ -0.17, indicating a weak but consistent negative relationship. Higher alcohol consumption tends to associate with lower grades.
🎯

Principal Component Analysis (PCA)

Dimensionality reduction technique to identify major patterns in multivariate data.

  • Reduces many variables to fewer uncorrelated dimensions
  • Each component captures maximum remaining variance
  • Loadings show which variables contribute to each dimension
  • Used to identify "lifestyle" vs "family background" patterns
  • Visualization in 2D reveals clustering by performance
X = TW^T + E
In this study: PC1 captured "social/alcohol" behaviors (goout, Walc, Dalc load high). PC2 captured "parental education" (Medu, Fedu dominate). Students scoring high on PC1 but low on PC2 tended to have lower grades.
🌳

Decision Tree Classification

Tree-based model that recursively splits data to predict outcomes.

  • Intuitive, interpretable structure
  • Automatically handles non-linear relationships and interactions
  • Variable importance derived from split gains
  • Shows hierarchical importance of predictors
  • Can handle both categorical and continuous variables
In this study: The tree revealed "failures" as the root split (most important), followed by parental education for students without failures, and lifestyle factors for those with failures.

Assumptions & Limitations

⚠️

Selection Bias in Support Variables

Students receiving school support may have lower grades not because support is harmful, but because struggling students are selected into support programs. This is a classic case of confounding.

⚠️

Self-Reported Data Limitations

Variables like alcohol consumption and study time are self-reported and may be subject to social desirability bias. Actual values may differ from reported values.

⚠️

Cross-Sectional Design

Data represents a single point in time. We cannot establish causation—only associations. Prior grades (G1, G2) predicting G3 suggests stability but doesn't prove causal mechanisms.

⚠️

Sample Specificity

Data comes from two Portuguese schools. Findings may not generalize to other countries, cultures, or educational systems.

Data Pipeline

Raw CSVParsing & CleaningType ConversionDerived VariablesInteractive FilteringVisualization

Data is loaded from /public/data/student-mat.csv, parsed using PapaParse, converted to typed TypeScript objects, and made available through React Context for all components.