I led the collaboration between the Institute of Cognitive Science, the Institute of Computer Science (both from Osnabrück University), and the Research Center Jülich to apply Bayesian inference to predict environmental variables. Institute of Computer Science provided data on simultaneous measurements of the leaf area index (LAI) of white winter wheat and its accompanying reflectance spectra, while our collaborators from Research Center Jülich helped us with interpretation of our results and their relation to physical phenomena. Our collaborations resulted in a publication titled "Bayesian Hierarchical Models Can Infer Interpretable Predictions of Leaf Area Index From Heterogeneous Datasets". You can find the accompanying code here.

Modeling environmental data

Environmental modeling often relies on predicting the LAI, which measures the one-sided leaf surface area relative to the ground. It depends on factors like soil type, crop type, and weather conditions. Since these factors change throughout the growing season, LAI estimation requires data collected over time and across different locations (our data was collected in four different fields over four different years). The resulting data can be highly heterogeneous and sparse, with some locations or years having few data points. My approach focused on building a robust model that effectively accounts for these variations while providing reliable LAI predictions.

The biggest challenge when working with data with such systematic differences was to create an accurate predictive model while preserving important information from different locations and over time. We also wanted to include domain knowledge in the model to understand how measured variables (in our case, different wavelengths of the recorded spectra) affect predictions of LAI. For those reasons, we chose a Bayesian approach.

Bayesian hierarchical models

To predict LAI, I used Bayesian hierarchical models, which are a class of models that allow parameters to vary on multiple levels of abstraction. In our case, it means the model parameters for different fields and years can vary around a common value that preserves similarities among the groups of measurements while allowing for specific differences between the datasets. Their flexibility makes them a strong choice for analyzing complex, multi-level data, such as LAI estimates across different regions and time periods.

To model the relationship between LAI and spectral reflectance, I used spline basis functions with adaptive knot placement. This captures the association between LAI and spectral reflectance at various wavelengths.

I then developed three Bayesian hierarchical generalized linear models with increasing complexity:

Baseline model – All data is pooled together without distinguishing between fields or years.
Model with a dataset-specific bias term (Figure 1) – Introduces a bias parameter for each dataset to account for the variation in scale between the datasets.
Full hierarchical model – Further adds a bias parameter for each year to capture temporal variation.

Figure 1: Dependency graph of the hierarchical model with dataset-specific bias terms. The predictions for each dataset j depend on an individual bias parameter bj, which in turn depends on the shared mean bias parameter b*.

I used MCMC sampling in PyMC to infer a full posterior predictive distribution of all model parameters and LAI predictions for each model.

We wanted to asses how each feature contributes to the model’s predictions by computing feature importance. This method breaks the dependence between correlated features and calculates the relative change in the model’s error before and after breaking that dependence. By doing so, we can identify which wavelengths have the most influence on LAI predictions (Figure 2).

Figure 2: Inferred kernel function and feature importance. (A) shows the posterior distribution of the inferred kernel function. The black line represents the expected kernel. We can relate several ranges of the reflectance spectrum to physical phenomena, namely effects due to green leaf pigment [400–700 nm] and photosynthetic capacity [495–680 nm, peak at 670 nm] and the red edge region [690–720 nm] in the visible light range, as well as the canopy’s water content [1,150 nm–1,260 nm, peak absorption at around 1,200 nm] in the near-infrared range. (B) shows a stem-plot of the relative importance of each feature (enumerated; normalized by the average feature importance) as well as the resulting estimated importance of each wavelength.

Conclusion

Each model performs well in predicting LAI. In addition to the predictions, all three models learn an interpretable kernel-like function of reflectance spectra. The main difference between the models is their complexity and how well the inferred kernel function describes the model. By computing feature importance, we can see how parts of reflectance spectra (which correspond to physical phenomena) contribute to predictions, making the model interpretable. Our analysis showed that adding a bias parameter for each dataset significantly improves the model, while adding a bias for each year increases complexity with little additional gain.

Predicting Environmental Variables

Modeling environmental data

Bayesian hierarchical models

Conclusion

Predicting Demand

Predicting the Spread of Infectious Diseases

Predicting Environmental
Variables