IEEE CSNT 2026: Operationalizing Site Reliability in Large-Scale Distributed Systems

I presented our paper "Operationalizing Site Reliability in Large-Scale Distributed Systems: Shifting Ownership Left" at IEEE CSNT 2026 (Al-Khobar, Saudi Arabia): predictive models that forecast failures to keep large-scale distributed systems reliable.

Share
Ankur Gupta presenting at IEEE CSNT 2026: Operationalizing Site Reliability in Large-Scale Distributed Systems

I presented our paper, "Operationalizing Site Reliability in Large-Scale Distributed Systems: Shifting Ownership Left," at the 2026 IEEE 15th International Conference on Communication Systems and Network Technologies (CSNT 2026), held in Al-Khobar, Saudi Arabia in April 2026. The paper is published and indexed by IEEE: read it on IEEE Xplore.

What the paper is about

Guaranteeing site reliability in large-scale distributed systems is hard, because these systems have complex, interacting failure modes. Conventional models often cannot predict failures far enough in advance, which leads to service interruptions and higher operating costs. The paper tackles this with a predictive approach that forecasts failures, so operators can keep sites reliable without retraining the entire platform, and so providers can pursue reliability and cost optimization at the same time.

The method combines three ideas: a multiscale-attention transformer (MATrans) to learn the temporal and spatial relationships between failures; optimized feature selection and hyperparameter tuning using bird swarm optimization (EBSO); and failure forecasting with Hamiltonian-based neural networks (HNN). We evaluated it on the Grid5000 failure data from the Failure Trace Archive (FTA), with extensive exploratory data analysis and failure-pattern statistics at the cluster, site, and system levels. The hybrid model predicts failed nodes, time between failures (TBF), and time to repair or recovery (TTR) with strong accuracy across distributed-system types, supporting predictive site-reliability management.

Authors

Ankur Gupta, Karan Gupta, Bhakti Hinduja, Naveen Kumar Mylarappa, and Niharika Pramod Pujari.

Keywords

Site reliability, distributed computing, failure prediction, system performance, operational efficiency, predictive maintenance.

Citation

A. Gupta, K. Gupta, B. Hinduja, N. K. Mylarappa, and N. P. Pujari, "Operationalizing Site Reliability in Large-Scale Distributed Systems: Shifting Ownership Left," in 2026 IEEE 15th International Conference on Communication Systems and Network Technologies (CSNT), Al-Khobar, Saudi Arabia, Apr. 2026, pp. 1438-1443, doi: 10.1109/CSNT69054.2026.11502502.