COCOMO Effort Estimation · Mateusz Hibner

Abstract

We benchmarked a Decision Tree (DT) baseline, XGBoost, and a Deep Neural Decision Tree (DNDT) on the COCOMO II dataset. With only 97 historical projects available, we expanded the dataset using Gaussian-noise augmentation and evaluated each model with an 80/20 split plus k-fold tuning. XGBoost delivered the best MAE/MRE/R² scores, proving more robust to noise than DNDT while the simple DT surprisingly held its ground on variance metrics.

Introduction

Accurate software-effort estimation informs staffing, budget, and project feasibility. Earlier approaches relied on expert judgment, but modern ML/DL models offer reproducible predictions. Our study investigates whether the added complexity of DNDTs is worthwhile compared to strong tabular learners such as XGBoost when data is scarce.

Dataset & Features

The revised COCOMO II dataset contains 97 projects with four key columns: E (estimated effort), PEMi (effort multipliers), Size KLOC (thousands of lines of code), and the target ACT_EFFORT. We log-transformed skewed distributions and applied a RobustScaler to limit the impact of outliers. Figure analyses (not shown) highlighted strong correlation between E and PEMi (0.60) and confirmed Size KLOC as the dominant predictor.

Methodology

After cleaning, we injected Gaussian noise (σ = 0.05) to double the dataset to 190 points. We then trained:

Decision Tree: transparent baseline for quick interpretability.
XGBoost: gradient-boosted trees with regularization and exhaustive Grid Search (27 combos).
DNDT: differentiable decision tree with soft splits optimized via gradient descent.

Metrics included MAE, MRE, R², and Adjusted R². The top model received extra tuning with k-fold cross-validation.

Results

XGBoost achieved MAE 0.30, MRE 0.36, R² 0.59, and Adjusted R² 0.55 on the test set, outperforming DNDT (MAE 0.40, MRE 0.74) and narrowly edging the DT baseline (MAE 0.31). Feature importance confirmed that Size KLOC drives predictions, aligning with the correlation analysis.

Discussion

DNDT struggled with overfitting because the dataset is too small to leverage its expressiveness; even with augmentation, noise sensitivity remained. XGBoost balanced bias/variance best, though it still showed signs of overfitting the training set (Adjusted R² ≈ 1.0) and would benefit from richer features. The DT baseline remains compelling when interpretability and stability trump raw accuracy.

Conclusion

XGBoost is the most reliable choice for low-data software-effort estimation tasks, especially when paired with lightweight augmentation. DNDTs require more data diversity, while decision trees provide a trustworthy fallback with minimal tuning.

Statement of Technology

We used Python, scikit-learn, XGBoost, and the DNDT reference implementation. ChatGPT and Quillbot assisted with text polishing, Scribbr handled APA references, and Miro documented the research pipeline.

Contribution

Balint coordinated experiments and preprocessing; Niels led the analyses; Jonas owned model training/evaluation; Benjamin wrote discussion and future work; Mateusz delivered the initial research pitch, coordinated documentation, and supported model comparisons.

References

Arifuzzaman, M. et al. (2023). Technologies, 11(1), 24.
COCOMO™ II site resources (n.d.). USC.
Draz, M. M. et al. (2024). Scientific Reports, 14(1).
Feizpour, E. et al. (2023). Journal of Software Evolution and Process, 35(12).
Hoc, H. T. et al. (2023). IEEE Access, 11, 60590–60604.
Kumar, K. H., & Srinivas, K. (2024). Expert Systems with Applications, 251, 124107.
Mienye, I. D., & Jere, N. (2024). IEEE Access, 12, 86716–86727.
Nawaz, M. A. et al. (2020). ICAISC-2020.
Rankovic, N. et al. (2021). IEEE Access, 9, 26926–26936.
Shepperd, M. (2025). IEEE Transactions on Software Engineering.
Tazwar, S. et al. (2024). ICPRAM.
Yusri, H. I. H. et al. (2022). ICSGRC.