Data Foundations for AI-Driven HVAC: From Sensors to Machine Learning Models | CKY Refrigeration & Air Conditioning Engineering Office

When the industry begins discussing "AI-driven HVAC systems," the most commonly overlooked aspect by engineers is not algorithm selection, but data quality and completeness. Even the most sophisticated machine learning model, if built on noise-filled and severely incomplete data, will produce predictions of zero engineering value. As the first article in the "AI x HVAC" series, this piece starts from the most fundamental sensor layer and progressively builds the data infrastructure for AI HVAC applications.

AI x HVAC Series

Data Foundations: From Sensors to Machine Learning Models (This Article)
Fault Detection and Predictive Maintenance
Chiller Plant Optimization: From MPC to Deep Reinforcement Learning
Future Vision: Digital Twins, Generative AI, and Edge Intelligence

1. The Sensor Layer: AI HVAC's Nerve Endings

AI applications for HVAC systems begin with sensors. Every temperature probe, pressure transducer, and flow meter serves as the system's "senses," providing raw signals for model training and real-time inference. ASHRAE Handbook -- Fundamentals Chapter 37 provides comprehensive technical specifications for measurement and instruments^[1], but for AI application scenarios, sensor deployment strategies require further consideration.

Core Measurement Parameters

A typical chilled water plant system requires at minimum the following measurement points to build effective AI models:

Chiller side: Chilled water supply/return temperatures, condenser water supply/return temperatures, compressor current/power, refrigerant pressures (high/low), refrigerant temperatures
Water system: Chilled water flow rate, condenser water flow rate, branch flow rates, pump frequency and power, pipe pressure differentials
Air side: Supply/return air temperatures, supply/return air humidity, CO2 concentration, fan frequency and power, filter pressure differential
Outdoor conditions: Dry-bulb temperature, relative humidity (or wet-bulb temperature), solar radiation, wind speed and direction

Sensor accuracy and sampling frequency directly affect model quality. Temperature sensors need +/-0.1 degrees C accuracy to effectively capture chiller performance curve changes; power measurements must cover true power (kW) rather than current alone, otherwise power factor variations will introduce systematic errors.

The Sampling Frequency Decision

Traditional BMS sampling periods of 5-15 minutes are sufficient for monitoring and alarm purposes. However, AI models require different sampling frequencies for different applications: fault detection needs 1-5 minute data to capture anomaly patterns, equipment performance prediction can use 15-minute data, and long-term energy analysis can be reduced to hourly^[2]. Higher sampling frequency means greater data storage and transmission pressure; engineers must balance model requirements against system burden.

2. Communication Protocols: The Convergence of BACnet and MQTT

Building automation communication protocols bridge sensors and AI platforms. ASHRAE Standard 135 (BACnet)^[3] is the most widely adopted standard protocol in building automation, defining data exchange formats and services between devices. BACnet/IP supports Ethernet transmission, enabling real-time reading of large numbers of data points.

However, BACnet was originally designed for device interoperability, not big data transmission. When hundreds of thousands of time-series data points need to be streamed to cloud AI platforms, MQTT (Message Queuing Telemetry Transport), with its lightweight publish/subscribe architecture, has become the mainstream choice for IoT and AI applications. Many modern BMS systems already support BACnet-to-MQTT gateway conversion, allowing existing BACnet devices to seamlessly connect to AI data pipelines.

Data Standardization Challenges

Different BMS brands use different naming conventions, units, and data formats for the same physical quantities. ASHRAE's Project Haystack and Brick Schema are attempting to establish unified building IoT semantic standards. For AI models, semantic consistency of data is critical -- if "chilled water supply temperature" is recorded in degrees C at one site and degrees F at another within the training data, model generalization capability will be severely compromised.

3. Data Quality: The Make-or-Break Factor for AI Models

Lawrence Berkeley National Laboratory (LBNL) research teams found in their analysis of BMS data from hundreds of U.S. commercial buildings that raw data typically contains 15-30% quality issues^[4], including sensor drift, missing values from communication interruptions, timestamp errors, and out-of-range outliers. If these issues are not properly addressed, they will directly contaminate the ML model training process.

Common Data Quality Issues

Sensor Drift: Temperature and humidity sensors shift over time; without regular calibration, annual drift can reach 0.5-1.0 degrees C
Stuck-at Fault: Sensor output remains fixed at a single value, commonly seen in aging analog sensors
Missing Data: Data gaps caused by communication interruptions, power outages, or controller restarts
Outliers: Anomalous values from electromagnetic interference, poor wiring, or measurement range overflow

Data Cleaning Strategies

Effective data cleaning workflows include: outlier detection (based on physical range and statistical methods), missing value imputation (linear interpolation, k-NN, or physics model-based imputation), and sensor validation (cross-referencing redundant sensors). Professor Braun's research team at Purdue University systematically explored the impact of HVAC sensor faults on control performance as early as 2002^[5], laying the foundation for subsequent AI data quality research.

4. Feature Engineering: Digitalizing Domain Knowledge

Feature engineering is the process of translating HVAC engineers' domain knowledge into numerical features that ML models can understand. Raw sensor readings (temperature, pressure, flow) require transformation before they can be effectively input to models. This is precisely where HVAC engineers provide irreplaceable value on AI teams.

Key Derived Features

Chiller efficiency: kW/RT = Compressor power / Cooling tonnage, where cooling tonnage is calculated from chilled water flow rate and temperature differential
Cooling tower approach temperature: Cooling water leaving temperature minus outdoor wet-bulb temperature, reflecting cooling tower operating efficiency
Part Load Ratio (PLR): Current cooling load / Rated capacity, a core indicator for predicting equipment performance
Temporal features: Time of day, day of week, whether it's a holiday -- capturing cyclical patterns in HVAC usage
Lag Features: Adding values from the previous 1-6 time steps, allowing models to understand system dynamic response

Fan et al. (2018) showed that ML models with proper feature engineering can achieve 15-25% improvement in building energy prediction accuracy^[6]. Feature engineering quality directly determines the model's ceiling -- even with state-of-the-art deep learning architectures, low-quality or irrelevant input features will limit model performance.

5. ML Methodology Primer: From Linear Regression to Deep Learning

For HVAC engineers, ML model selection should not chase the latest, most complex algorithms, but rather make pragmatic choices based on data volume, problem complexity, and interpretability requirements.

Supervised Learning

When clear input-output correspondences exist (e.g., using sensor data to predict chiller kW/RT), supervised learning is the most direct choice. Linear regression suits simple relationships, Random Forest is robust for handling nonlinear relationships and outlier resistance, and gradient boosting trees (XGBoost/LightGBM) demonstrate top-tier performance in most tabular data prediction tasks.

Unsupervised Learning

When labeled data is unavailable (e.g., when the explicit definition of "normal" operation is unknown), unsupervised learning can discover latent patterns in data. Principal Component Analysis (PCA) can be used for dimensionality reduction and anomaly detection, k-means clustering can group operating modes, and Autoencoders can learn the data distribution of normal operation, flagging situations that deviate from this distribution as anomalies.

Time Series Methods

HVAC data is inherently time series, with clear daily and seasonal cycles. LSTM (Long Short-Term Memory) networks and Transformer architectures have achieved significant advances in time series prediction in recent years. Professor Miller and Professor Nagy's research team at the National University of Singapore systematically compared various ML methods for building energy prediction performance using the Building Data Genome 2 dataset^[7], providing an empirical basis for engineers' method selection.

6. Local Challenges in Taiwan

Promoting AI HVAC applications in Taiwan faces several localization-specific challenges beyond the technology itself:

Insufficient sensors in existing buildings: Many existing commercial buildings in Taiwan have BMS sensor coverage far below AI model requirements; sensor retrofitting involves practical challenges of piping, wiring, and construction coordination
BMS system proprietary nature: Some local BMS vendors operate closed systems without standard protocol interfaces, requiring custom gateway development for data access
High-temperature, high-humidity data characteristics: Taiwan's year-round high humidity makes latent heat load proportions significant; ML models must specifically handle humidity-related features rather than directly applying model architectures from temperate regions
Engineering talent gap: Cross-disciplinary talent with both HVAC engineering and data science capabilities is scarce, impeding technology deployment speed

The high-performance control sequences defined by ASHRAE Guideline 36^[3], while primarily rule-based logic, provide detailed sensor configuration and data point specifications that serve as the minimum threshold for AI system data foundations. If Taiwan's engineering teams use Guideline 36 standards as a starting point and progressively expand sensor deployment and data infrastructure, they will establish a solid foundation for subsequent AI applications.

Conclusion

Data is the foundation of AI HVAC applications. Sensor deployment strategy, communication protocol selection, data quality assurance, feature engineering design, and ML methodology matching -- each step requires deep collaboration between HVAC engineers and data scientists. In the next article of this series, we will build on this data foundation to explore how AI enables fault detection and predictive maintenance for HVAC systems, letting the system tell you what's wrong.