Methodology

A detailed overview of the statistical methods and analytical techniques used in this student performance analysis.

🔍

Exploratory Data Analysis (EDA)

Initial exploration of the dataset to understand distributions, patterns, and relationships.

•Examined distributions of all 33 variables
•Identified missing values and outliers
•Computed summary statistics (mean, median, std, quartiles)
•Created correlation matrices to identify related variables
•Visualized relationships through scatter plots, histograms, and box plots

Python/RMatplotlib/ggplot2Pandas/dplyr

📊

One-Way ANOVA

Analysis of Variance to test whether group means differ significantly.

•Used to compare grades across categorical groups (e.g., education levels, failure counts)
•Tests the null hypothesis that all group means are equal
•F-statistic measures ratio of between-group to within-group variance
•Significant results (p < 0.05) indicate at least one group differs
•Effect sizes computed to assess practical significance

F = MS_between / MS_within

In this study: A large F-value suggests group means differ more than would be expected by chance. For example, comparing grades across failure groups (0, 1, 2, 3+) yielded F > 30, p < 0.001.

📈

Spearman Rank Correlation

Non-parametric measure of rank correlation between two variables.

•Used for ordinal variables (like alcohol consumption 1-5)
•Does not assume linear relationships
•Robust to outliers
•Values range from -1 (perfect negative) to +1 (perfect positive)
•Tests significance of monotonic relationships

ρ = 1 - (6Σd²) / (n(n²-1))

In this study: For weekend alcohol vs grades: ρ ≈ -0.17, indicating a weak but consistent negative relationship. Higher alcohol consumption tends to associate with lower grades.

🎯

Principal Component Analysis (PCA)

Dimensionality reduction technique to identify major patterns in multivariate data.

•Reduces many variables to fewer uncorrelated dimensions
•Each component captures maximum remaining variance
•Loadings show which variables contribute to each dimension
•Used to identify "lifestyle" vs "family background" patterns
•Visualization in 2D reveals clustering by performance

X = TW^T + E

In this study: PC1 captured "social/alcohol" behaviors (goout, Walc, Dalc load high). PC2 captured "parental education" (Medu, Fedu dominate). Students scoring high on PC1 but low on PC2 tended to have lower grades.

🌳

Decision Tree Classification

Tree-based model that recursively splits data to predict outcomes.

•Intuitive, interpretable structure
•Automatically handles non-linear relationships and interactions
•Variable importance derived from split gains
•Shows hierarchical importance of predictors
•Can handle both categorical and continuous variables

In this study: The tree revealed "failures" as the root split (most important), followed by parental education for students without failures, and lifestyle factors for those with failures.

Assumptions & Limitations

⚠️

Selection Bias in Support Variables

Students receiving school support may have lower grades not because support is harmful, but because struggling students are selected into support programs. This is a classic case of confounding.

⚠️

Self-Reported Data Limitations

Variables like alcohol consumption and study time are self-reported and may be subject to social desirability bias. Actual values may differ from reported values.

⚠️

Cross-Sectional Design

Data represents a single point in time. We cannot establish causation—only associations. Prior grades (G1, G2) predicting G3 suggests stability but doesn't prove causal mechanisms.

⚠️

Sample Specificity

Data comes from two Portuguese schools. Findings may not generalize to other countries, cultures, or educational systems.

Data Pipeline

Raw CSV→Parsing & Cleaning→Type Conversion→Derived Variables→Interactive Filtering→Visualization

Data is loaded from /public/data/student-mat.csv, parsed using PapaParse, converted to typed TypeScript objects, and made available through React Context for all components.