Methodology
A detailed overview of the statistical methods and analytical techniques used in this student performance analysis.
Exploratory Data Analysis (EDA)
Initial exploration of the dataset to understand distributions, patterns, and relationships.
- •Examined distributions of all 33 variables
- •Identified missing values and outliers
- •Computed summary statistics (mean, median, std, quartiles)
- •Created correlation matrices to identify related variables
- •Visualized relationships through scatter plots, histograms, and box plots
One-Way ANOVA
Analysis of Variance to test whether group means differ significantly.
- •Used to compare grades across categorical groups (e.g., education levels, failure counts)
- •Tests the null hypothesis that all group means are equal
- •F-statistic measures ratio of between-group to within-group variance
- •Significant results (p < 0.05) indicate at least one group differs
- •Effect sizes computed to assess practical significance
Spearman Rank Correlation
Non-parametric measure of rank correlation between two variables.
- •Used for ordinal variables (like alcohol consumption 1-5)
- •Does not assume linear relationships
- •Robust to outliers
- •Values range from -1 (perfect negative) to +1 (perfect positive)
- •Tests significance of monotonic relationships
Principal Component Analysis (PCA)
Dimensionality reduction technique to identify major patterns in multivariate data.
- •Reduces many variables to fewer uncorrelated dimensions
- •Each component captures maximum remaining variance
- •Loadings show which variables contribute to each dimension
- •Used to identify "lifestyle" vs "family background" patterns
- •Visualization in 2D reveals clustering by performance
Decision Tree Classification
Tree-based model that recursively splits data to predict outcomes.
- •Intuitive, interpretable structure
- •Automatically handles non-linear relationships and interactions
- •Variable importance derived from split gains
- •Shows hierarchical importance of predictors
- •Can handle both categorical and continuous variables
Assumptions & Limitations
Selection Bias in Support Variables
Students receiving school support may have lower grades not because support is harmful, but because struggling students are selected into support programs. This is a classic case of confounding.
Self-Reported Data Limitations
Variables like alcohol consumption and study time are self-reported and may be subject to social desirability bias. Actual values may differ from reported values.
Cross-Sectional Design
Data represents a single point in time. We cannot establish causation—only associations. Prior grades (G1, G2) predicting G3 suggests stability but doesn't prove causal mechanisms.
Sample Specificity
Data comes from two Portuguese schools. Findings may not generalize to other countries, cultures, or educational systems.
Data Pipeline
Data is loaded from /public/data/student-mat.csv, parsed using PapaParse, converted to typed TypeScript objects, and made available through React Context for all components.