Manufacturing · Machine Learning
Reducing unplanned downtime by 67% across 3 production lines through ML-based predictive maintenance
A mid-size US precision parts manufacturer operating 3 high-throughput CNC production lines was experiencing 340+ hours of unplanned downtime annually — each hour costing $18,000 in lost output, scrapped material, and emergency maintenance labour. Maintenance was entirely reactive: machines failed, production stopped, technicians responded. We built an ML-based predictive maintenance system on top of existing sensor infrastructure that predicts failures 6–18 hours in advance, reduced unplanned downtime by 67%, and delivered $4.1M in measurable first-year savings.
Business Context
The machines were telling them something was wrong.
Nobody was listening.
The manufacturer ran three CNC production lines — 47 machines in total — producing precision aerospace and automotive components under tight tolerance specifications. Each machine had between 8 and 24 sensors already installed: vibration, temperature, spindle load, coolant pressure, and acoustic emission. The data was being collected by the SCADA system and stored — but never analysed. Maintenance was scheduled on fixed calendar intervals regardless of actual machine condition, and failures between scheduled maintenance windows were handled reactively. The maintenance team was experienced and capable. They simply had no early warning system.
The cost of reactive maintenance
- 340 hrs
- unplanned downtime per year across 3 lines
- $18K
- cost per hour of unplanned downtime
- 23%
- of maintenance budget spent on emergency repairs
Average across 2022–2023; individual incidents ranged from 2 hours to 3 days
Lost output, scrapped in-process parts, emergency labour, and expedited parts procurement
vs. industry benchmark of 8–10% for facilities with condition-based maintenance
The failure modes were well understood by the maintenance team — spindle bearing degradation, coolant pump cavitation, and tool holder runout were responsible for 74% of unplanned stops. The team could often tell a machine was "running rough" hours before failure, but had no systematic way to act on that intuition across 47 machines simultaneously. One experienced technician could monitor a handful of machines closely. Nobody could monitor all 47.
The sensor data was the asset. Two years of vibration, temperature, and load data sat in the SCADA historian — including the signatures of every failure event that had occurred. The problem was not a lack of data. It was the absence of a system that could read that data in real time and translate it into actionable maintenance alerts before the failure occurred.
Scope of Work
What we were asked to build
Sensor data pipeline and feature extraction
Real-time ingestion pipeline from the SCADA historian — pulling vibration, temperature, spindle load, coolant pressure, and acoustic emission data at 1-second intervals per machine. Feature extraction computing 60+ time-domain and frequency-domain features per sensor per machine: RMS, kurtosis, spectral entropy, bearing fault frequencies, and trend derivatives.
Failure prediction models per failure mode
Separate ML models trained per failure mode per machine class — spindle bearing degradation, coolant system faults, and tool holder anomalies. Models trained on 2 years of historical sensor data with failure event labels provided by the maintenance team. Outputs a health score per machine updated every 15 minutes with a predicted time-to-failure range.
Maintenance alert and work order integration
Alert engine generating prioritised maintenance work orders when health scores cross configurable thresholds. Alerts routed to the CMMS (computerised maintenance management system) automatically — creating a work order with the predicted failure mode, recommended action, and estimated urgency window. No new tooling for the maintenance team to learn.
Production floor dashboard
Real-time health status dashboard for all 47 machines — colour-coded by health score, showing trend direction, active alerts, and predicted maintenance windows. Accessible on floor terminals and mobile devices. Maintenance manager can drill into any machine to see the sensor signals driving the health score.
Constraints we worked within
- SCADA system and sensor infrastructure could not be modified — data pipeline had to read from the historian without impacting production systems
- CMMS integration required work orders in a specific format — custom connector built to match existing workflow
- Some machines had incomplete failure history — cold-start handling required for 11 machines with fewer than 3 recorded failure events
- Model alerts had to be actionable within the maintenance team's shift structure — 6-hour minimum advance warning required to schedule planned intervention
Explicitly not in scope
- New sensor installation or hardware procurement
- Quality control or defect detection on finished parts
- Supply chain or spare parts inventory optimisation
- ERP integration or production scheduling changes
System Architecture
Existing sensors. New intelligence layer. Failures predicted hours before they happen.
How We Worked
7 months. Maintenance team as domain experts throughout. Zero production disruption.
Data Audit & Failure Mode Mapping
Extracted and audited 2 years of SCADA historian data across all 47 machines. Worked with the maintenance team to label every failure event in the historical record — 127 distinct failure events across the 3 primary failure modes. Identified 11 machines with insufficient failure history for supervised training — flagged for anomaly detection approach rather than supervised classification.
Feature Engineering & Model Development
Built the feature extraction pipeline — 60+ features per sensor per machine computed on a rolling 15-minute window. Trained failure prediction models per failure mode. Spindle bearing model achieved 89% precision and 84% recall on held-out test data. Coolant fault model achieved 91% precision. Tool holder model was harder — 78% precision due to thinner failure history — flagged to client with recommendation to collect more labelled data over the next 6 months.
Alert Engine & CMMS Integration
Alert thresholds configured with maintenance manager — calibrated to generate 3–5 actionable alerts per day across all 47 machines, avoiding alert fatigue. CMMS connector built and tested. Dashboard deployed to floor terminals. Maintenance team ran a 3-week shadow period — alerts generated but not acted on, team compared predictions against their own assessments.
Live Operation & Model Refinement
System went live. First predicted failure caught: spindle bearing on Line 2, Machine 14 — alert fired 11 hours before the bearing would have failed based on degradation rate. Planned replacement completed in a scheduled 2-hour window. Equivalent reactive failure would have caused an estimated 14-hour unplanned stop. Model performance monitored weekly; 2 refinement cycles completed in month 7 based on new failure event data.
Working rhythm
- CadenceTwo-week sprints, weekly maintenance team reviews
- Decision ownerVP of Operations and Maintenance Manager
- Primary metricUnplanned downtime hours vs. prior year baseline
- Escalation SLA24 hours with written recommendation
Results
Measured at 6 months post go-live.
reduction in unplanned downtime hours
Was: 340 hours of unplanned downtime per year across 3 lines
Annualised from 6-month post-go-live data. The system predicted 34 of the 41 failure events that occurred in the measurement period — 83% catch rate. The 7 missed predictions were all on the tool holder model, which had the thinnest training data. Additional failure event labelling is ongoing to improve this model.
in measurable first-year savings
Was: $18,000/hour × 340 hours = $6.1M annual downtime cost
Savings calculated as avoided downtime cost ($3.7M) plus reduction in emergency maintenance spend ($0.4M). Does not include quality improvements from catching degraded machines before they produce out-of-tolerance parts — estimated at an additional $0.3M in scrap reduction.
average advance warning before predicted failure
Was: zero advance warning — failures discovered when production stopped
Advance warning range across all caught predictions: 6 hours (minimum, tool holder faults) to 31 hours (spindle bearing degradation). The 6-hour minimum was sufficient for the maintenance team to schedule planned interventions within shift structure in all but 2 cases.
failure prediction catch rate across all 3 failure modes
Was: 0% — no predictive capability, all failures discovered reactively
Spindle bearing: 94% catch rate. Coolant faults: 88% catch rate. Tool holder: 67% catch rate (improving as more labelled failure data accumulates). False positive rate: 1.2 false alerts per week across all 47 machines — maintenance team reports this as acceptable given the cost of a missed failure.
What This Means for You
The sensor data already exists in most manufacturing facilities. The gap is not hardware — it is the absence of a system that reads that data continuously and translates it into maintenance decisions before failures occur.
- 01Your maintenance team responds to failures rather than preventing them — unplanned stops are a regular operational reality
- 02You have sensor data being collected by your SCADA or historian system that is not being used for predictive purposes
- 03Emergency maintenance and expedited parts procurement represent a disproportionate share of your maintenance budget
This engagement was built entirely on top of existing sensor infrastructure — no new hardware, no SCADA modifications, no production disruption during implementation. The maintenance team's domain knowledge was the most valuable input to the model: their failure event labels and their assessment of alert thresholds shaped the system from day one.
See how we approach Machine Learning for manufacturing