Performance comparison of machine-learning and deep-learning models for software cost estimation.
We benchmarked a Decision Tree (DT) baseline, XGBoost, and a Deep Neural Decision Tree (DNDT) on the COCOMO II dataset. With only 97 historical projects available, we expanded the dataset using Gaussian-noise augmentation and evaluated each model with an 80/20 split plus k-fold tuning. XGBoost delivered the best MAE/MRE/R² scores, proving more robust to noise than DNDT while the simple DT surprisingly held its ground on variance metrics.
Accurate software-effort estimation informs staffing, budget, and project feasibility. Earlier approaches relied on expert judgment, but modern ML/DL models offer reproducible predictions. Our study investigates whether the added complexity of DNDTs is worthwhile compared to strong tabular learners such as XGBoost when data is scarce.
The revised COCOMO II dataset contains 97 projects with four key columns: E (estimated effort), PEMi (effort multipliers), Size KLOC (thousands of lines of code), and the target ACT_EFFORT. We log-transformed skewed distributions and applied a RobustScaler to limit the impact of outliers. Figure analyses (not shown) highlighted strong correlation between E and PEMi (0.60) and confirmed Size KLOC as the dominant predictor.
After cleaning, we injected Gaussian noise (σ = 0.05) to double the dataset to 190 points. We then trained:
Metrics included MAE, MRE, R², and Adjusted R². The top model received extra tuning with k-fold cross-validation.
XGBoost achieved MAE 0.30, MRE 0.36, R² 0.59, and Adjusted R² 0.55 on the test set, outperforming DNDT (MAE 0.40, MRE 0.74) and narrowly edging the DT baseline (MAE 0.31). Feature importance confirmed that Size KLOC drives predictions, aligning with the correlation analysis.
DNDT struggled with overfitting because the dataset is too small to leverage its expressiveness; even with augmentation, noise sensitivity remained. XGBoost balanced bias/variance best, though it still showed signs of overfitting the training set (Adjusted R² ≈ 1.0) and would benefit from richer features. The DT baseline remains compelling when interpretability and stability trump raw accuracy.
XGBoost is the most reliable choice for low-data software-effort estimation tasks, especially when paired with lightweight augmentation. DNDTs require more data diversity, while decision trees provide a trustworthy fallback with minimal tuning.
We used Python, scikit-learn, XGBoost, and the DNDT reference implementation. ChatGPT and Quillbot assisted with text polishing, Scribbr handled APA references, and Miro documented the research pipeline.
Balint coordinated experiments and preprocessing; Niels led the analyses; Jonas owned model training/evaluation; Benjamin wrote discussion and future work; Mateusz delivered the initial research pitch, coordinated documentation, and supported model comparisons.