The European Commission is proposing to create soil ‘districts’ to improve the evaluation and monitoring of soil health in European Union member states so that all EU soils are in a healthy condition by 2050 (https://environment.ec.europa.eu/publications/proposal-directive-soilmonitoring-and-resilience_en). European scientists are collaborating with land users and national environmental agencies to provide an operational definition of such districts so that they are maximally effective. These districts of homogeneous soils must be large enough to be managed by a single authority, for soil-health assessment and for implementing policy. Given the diversity of soils in the EU, using homogeneity with respect to soil class or property could lead to an unmanageable number of districts. Spatial units should instead be homogeneous in terms of soil-landscape relationships, or ‘soilscapes’ (https://doi.org/10.1016/S0016-7061(00)00101-4). A district would then comprise a limited sequence of soil types amenable to similar management and policy. An effective soil district definition must also consider the national borders of member states and local EU administrative units, ensuring that the spatial contiguity of the districts is maintained.
Spatial evaluation of the soils capacity and condition to store carbon across Australia
Alexandre M J-C Wadoux, Mercedes Román Dobarco, Wartini Ng, and
1 more author
The soil security concept has been put forward to maintain and improve soil resources inter alia to provide food, clean water, climate change mitigation and adaptation, and to protect ecosystems. A provisional framework suggested indicators for the soil security dimensions, and a methodology to achieve a quantification. In this study, we illustrate the framework for the function soil carbon storage and the two dimensions of soil capacity and soil condition. The methodology consists of (i) the selection and quantification of a small set of soil indicators for capacity and condition, (ii) the transformation of indicator values to unitless utility values via expert-generated utility graphs, and (iii) a two-level aggregation of the utility values by soil profile and by dimension. For capacity, we used a set of three indicators: total organic and inorganic carbon content and mineral associated organic carbon in the fine fraction (MAOC) estimated via their reference value using existing maps of pedogenons and current landuse to identify areas of remnant genosoils (total organic and inorganic carbon) and the 90th percentile for MAOC. For condition we used the same set of indicators, but this time using the estimated current value and comparing with their reference-state values (calculated for capacity). The methodology was applied to the whole of Australia at a spatial resolution of 90 m x 90 m. The results show that the unitless indicator values supporting the function varied greatly in Australia. Aggregation of the indicators into the two dimensions of capacity and condition revealed that most of Australia has a relatively low capacity to support the function, but that most soils are in a generally good condition relative to that capacity, with some exceptions in agricultural areas, although more sampling of the remnant genosoils is required for corroboration and improvement. The maps of capacity and condition may serve as a basis to estimate a spatially-explicit local index of Australia’s soil resilience to the threat of decarbonization.
Drivers and human impacts on topsoil bacterial and fungal community biogeography across Australia
Peipei Xue, Budiman Minasny, Alexandre M J-C Wadoux, and
4 more authors
Soil microbial diversity mediates a wide range of key processes and ecosystem services influencing planetary health. Our knowledge of microbial biogeography patterns, spatial drivers and human impacts at the continental scale remains limited. Here, we reveal the drivers of bacterial and fungal community distribution in Australian topsoils using 1384 soil samples from diverse bioregions. Our findings highlight that climate factors, particularly precipitation and temperature, along with soil properties, are the primary drivers of topsoil microbial biogeography. Using random forest machine-learning models, we generated high-resolution maps of soil bacteria and fungi across continental Australia. The maps revealed microbial hotspots, for example, the eastern coast, southeastern coast, and west coast were dominated by Proteobacteria and Acidobacteria. Fungal distribution is strongly influenced by precipitation, with Ascomycota dominating the central region. This study also demonstrated the impact of human modification on the underground microbial community at the continental scale, which significantly increased the relative abundance of Proteobacteria and Ascomycota, but decreased Chloroflexi and Basidiomycota. The variations in microbial phyla could be attributed to distinct responses to altered environmental factors after human modifications. This study provides insights into the biogeography of soil microbiota, valuable for regional soil biodiversity assessments and monitoring microbial responses to global changes.
Multivariate regional deep learning prediction of soil properties from near-infrared, mid-infrared and their combined spectra
Rumbidzai W Nyawasha, Alexandre MJ-C Wadoux, Pierre Todoroff, and
5 more authors
Artificial neural network (ANN) models have been successfully used in infrared spectroscopy research for the prediction of soil properties. They often show better performance than conventional methods such as partial least squares regression (PLSR). In this paper we develop and evaluate a multivariate extension of ANN for predicting correlated soil properties: total carbon (C), total nitrogen (N), clay, silt, and sand contents, using visible near-infrared (vis-NIR), mid-infrared (MIR) or combined spectra (vis-NIR + MIR). We hypothesize that accounting for the correlation through joint modelling of soil properties with a single model can eliminate “pedological chimera”: unrealistic values that may arise when properties are predicted independently such as when calculating ratio or soil texture values. We tested two types of ANN models, a univariate (ANN-UV) and a multivariate model (ANN-MV), using a dataset of 228 soil samples collected from Murehwa district in Zimbabwe at two soil depth intervals (0–20 and 20–40 cm). The models were compared with results from a univariate PLSR (PLSR-UV) model. We found that the multivariate ANN model was better at conserving the observed correlations between properties and consequently gave realistic soil C:N and C:Clay ratios, but that there was no improvement in prediction accuracy over using a univariate model (ANN or PLSR). The use of combined spectra (vis-NIR + MIR) did not make any significant improvements in prediction accuracy of the multivariate ANN model compared to using the vis-NIR or MIR only. We conclude that the multivariate ANN model is better suited for the prediction of multiple correlated soil properties and that it is flexible and can account for compositional constrains. The multivariate ANN model helps to keep realistic ratio values – with strong implications for assessment studies that make use of such predicted soil values.
Applications and challenges of digital soil mapping in Africa
Andree M Nenkam, Alexandre MJ-C Wadoux, Budiman Minasny, and
8 more authors
The mapping of soils in Africa is at least a century old. We currently have access to various maps depicting mapping units locally and for the continent. In the past two decades, there has been a growing interest in alternatives for generating soil maps through digital soil mapping (DSM) techniques. There are, however, numerous challenges pertaining to the implementation of DSM in Africa, such as the unavailability of appropriate covariates, age and positional error in the measurements, low sampling density, and spatial clustering of the soil data used to fit and validate the models. This review aims to investigate the current state of DSM in Africa, identify challenges specific to implementing DSM in Africa and the ways it has been solved in the literature. We found that nearly half of African countries had an existing digital soil map covering either a local or national area, and that most studies were performed at a local extent. Soil carbon was the most common property under study, whereas soil hydraulic variables were seldom reported. Nearly all studies performed mapping for the topsoil up to 30 cm and calculated validation statistics using existing datasets but without collecting a post-mapping probability sample. Few studies (i.e., 11%) reported an estimate of map uncertainty. Half of the studies had in mind a downstream application (e.g., soil fertility assessment) in the map generation. We further correlated the area of study and sampling density and found a strong negative relationship. About 30% of the studies relied on legacy soil datasets and had a lack of sufficient spatial coverage of their area of study. From this review, we highlight some research opportunities and suggest improvements in the current methodologies. Future research should focus on capacity building in DSM, new data collection, and legacy data rescue. New initiatives, that should be initiated and led from within the continent, could support the long-term monitoring of soils and updating of soil information systems while ensuring their contextualised usability. This pairs with better delivery of existing DSM studies to stakeholders and the generation of a value-added proposition to governmental institutions.
Some limitations of the concordance correlation coefficient to characterise model accuracy
Perusal of the environmental modelling literature reveals that the Lin’s concordance correlation coefficient is a popular validation statistic to characterise model or map quality. In this communication, we illustrate with synthetic examples three undesirable statistical properties of this coefficient. We argue that ignorance of these properties have led to a frequent misuse of this coefficient in modelling and mapping studies. The stand-alone use of the concordance correlation coefficient is insufficient because i) it does not inform on the relative contribution of bias and correlation, ii) the values cannot be compared across different datasets or studies and iii) it is prone to the same problems as other linear correlation statistics. The concordance coefficient was, in fact, thought initially for evaluating reproducibility studies over repeated trials of the same variable, not for characterising model accuracy. For the validation of models and maps, we recommend calculating statistics that, combined with the concordance correlation coefficient, represent various aspects of the model or map quality, which can be visualised together in a single figure with a Taylor or solar diagram.
A global numerical classification of the soil surface layer
The quest for a global soil classification system has been a long-standing challenge in soil science. There currently exist two, seemingly disjoint, global soil classification systems, the USDA Soil Taxonomy and the World Reference Base for Soil Resources, and many regional and national systems. While both systems are acknowledged as international, there remain various examples of their shortcoming in accounting of topsoil features, local applications and communication with established regional classification systems. This calls for a numerical soil classification that addresses these discrepancies and achieves harmonization with existing national systems. In this paper, we report on the development of a natural layer classification system — as opposed to the classification of soil profile entities, as a first step towards achieving a comprehensive global numerical soil classification not based on a priori defined classes. We implemented a modelling approach with a set of predicted key soil properties available globally for the soil surface layer with the same depth range of 0–5 cm. The set of properties was partitioned into a number of homogeneous and disjoint classes using the k-means clustering algorithm. Next, we investigated the pattern of variation of the clusters in association with the soil property map with principal component analysis. A three-component nomenclature system is derived in a transformed space of the class-specific centroids to account for the uneven distribution of the centroids in the principal component space. We show that it is possible to build a data-based objective numerical taxonomic classification of soil layers, and that existing sets of key soil properties, predicted separately, coalesce into identifiable clusters or classes and manifest discernible spatial and/or pedological patterns. This grouping of key soil properties to logical categories is a possible step to better define diagnostic horizon features and suggest new ones. The general-purpose map of soil surface layer classes of the world also has potential applications in assessing soil change and designing monitoring surveys.
2023
Remote sensing of the Earth’s soil color in space and time
Rodnei Rizzo, Alexandre MJ-C Wadoux, José AM Demattê, and
8 more authors
Soil color is a key indicator of soil properties and conditions, exerting influence on both agronomic and environmental variables. Conventional methods for soil color determination have come under scrutiny due to their limited accuracy and reliability. In response to these concerns, we developed an innovative system that leverages 35 years of satellite imagery in conjunction with in-situ soil spectral measurements. This approach enables the creation of a global soil color map with a fine spatial resolution of 30 m x 30 m. The system initially identifies bare earth areas worldwide using reflectance bands acquired from Landsat 4 through Landsat 8 between 1985 and 2020. Soil color was quantified using the CIE-XYZ coordinates, utilizing 8005 soil spectral measurements within the visible range (380–780 nm) as ground truth data. We established transfer functions to convert Landsat reflectance bands to standardized XYZ color coordinates. These transfer functions were subsequently applied to images of bare surfaces, covering approximately 38.5% of the Earth’s surface. We validated the resulting global soil color map using statistical indices derived from an independent set of ground-truth spectral data, demonstrating a high degree of agreement. By creating the world’s first global soil color map, we have set a baseline for future spatial and temporal monitoring of soil conditions, thus enhancing our understanding and management of our planet’s vital soil resources.
Modélisations spatio-temporelles d’indicateurs bio-physiques des sols et de fonctions fournies par les écosystèmes
Spectroscopic modelling of soil has advanced greatly with the development of large spectral libraries, computational resources and statistical modelling. The use of complex statistical and algorithmic tools from the field of machine learning has become popular for predicting properties from their visible, near- and mid-infrared spectra. Many users, however, find it difficult to trust the predictions made with machine learning. We lack interpretation and understanding of how the predictions were made, so that these models are often referred to as black boxes. In this study, I report on the development and application of a model-independent method for interpreting complex machine learning spectroscopic models. The method relies on Shapley values, a statistical approach originally developed in coalitional game theory. In a case study for predicting the total organic carbon from a large European mid-infrared spectroscopic database, I fitted a random forest machine learning model and showed how Shapley values can help us understand (i) the average contribution of individual wavenumbers, (ii) the contribution of the spectrum-specific wavenumbers, and (iii) the average contribution of groups of spectra taken together with similar characteristics. The results show that Shapley values revealed more insights than commonly used interpretation methods based on the variable importance. The most striking spectral regions identified as important contributors to the prediction corresponded to the molecular vibration of organic and inorganic compounds that are known to relate to organic carbon. Shapley values are a useful methodological development that will yield a better understanding and trust of complex machine learning and algorithmic tool in soil spectroscopy research.
Mapping soil organic carbon fractions for Australia, their stocks, and uncertainty
Mercedes Román Dobarco, Alexandre M J C Wadoux, Brendan Malone, and
3 more authors
Soil organic carbon (SOC) is the largest terrestrial carbon pool. SOC is composed of a continuous set of compounds with different chemical compositions, origins, and susceptibilities to decomposition that are commonly separated into pools characterised by different responses to anthropogenic and environmental disturbance. Here we map the contribution of three SOC fractions to the total SOC content of Australia’s soils. The three SOC fractions, mineral-associated organic carbon (MAOC), particulate organic carbon (POC), and pyrogenic organic carbon (PyOC), represent SOC composition with distinct turnover rates, chemistry, and pathway formation. Data for MAOC, POC, and PyOC were obtained with near- and mid-infrared spectral models calibrated with measured SOC fractions. We transformed the data using an isometric-log-ratio (ilr) transformation to account for the closed compositional nature of SOC fractions. The resulting back-transformed ilr components were mapped across Australia. SOC fraction stocks for 0–30 cm were derived with maps of total organic carbon concentration, bulk density, coarse fragments, and soil thickness. Mapping was done by a quantile regression forest fitted with the ilr-transformed data and a large set of environmental variables as predictors. The resulting maps along with the quantified uncertainty show the unique spatial pattern of SOC fractions in Australia. MAOC dominated the total SOC with an average of 59% ± 17%, whereas 28% ± 17% was PyOC and 13% ± 11% was POC. The allocation of total organic carbon (TOC) to the MAOC fractions increased with depth. SOC vulnerability (i.e. POCdiv[MAOC+PyOC] ) was greater in areas with Mediterranean and temperate climates. TOC and the distribution among fractions were the most influential variables in SOC fraction uncertainty. Further, the diversity of climatic and pedological conditions suggests that different mechanisms will control SOC stabilisation and dynamics across the continent, as shown by the model covariates’ importance metric. We estimated the total SOC stocks (0–30 cm) to be 13 Pg MAOC, 2 Pg POC, and 5 Pg PyOC, which is consistent with previous estimates. The maps of SOC fractions and their stocks can be used for modelling SOC dynamics and forecasting changes in SOC stocks as a response to land use change, management, and climate change.
Uncertainty of spatial averages and totals of natural resource maps
Global, continental and regional maps of concentrations, stocks and fluxes of natural resources provide baseline data to assess how ecosystems respond to human disturbance and global warming. They are also used as input to numerous modelling efforts. But these maps suffer from multiple error sources and hence it is good practice to report estimates of the associated map uncertainty, so that users can evaluate their fitness for use. We explain why quantification of uncertainty of spatial aggregates is more complex than uncertainty quantification at point support, because it must account for spatial autocorrelation of the map errors. Unfortunately this is not done in a number of recent high-profile studies. We describe how spatial autocorrelation of map errors can be accounted for with block kriging, a method that requires geostatistical expertise. Next, we propose a new, model-based approach that avoids the numerical complexity of block kriging and is feasible for large-scale studies where maps are typically made using machine learning. Our approach relies on Monte Carlo integration to derive the uncertainty of the spatial average or total from point support prediction errors. We account for spatial autocorrelation of the map error by geostatistical modelling of the standardized map error. We show that the uncertainty strongly depends on the spatial autocorrelation of the map errors. In a first case study, we used block kriging to show that the uncertainty of the predicted topsoil organic carbon in France decreases when the support increases. In a second case study, we estimated the uncertainty of spatial aggregates of a machine learning map of the aboveground biomass in Western Africa using Monte Carlo integration. We found that this uncertainty was small because of the weak spatial autocorrelation of the standardized map errors. We present a tool to get realistic estimates of the uncertainty of spatial averages and totals of natural resources maps. The method presented in this paper is essential for parties that need to evaluate whether differences in aggregated environmental variables or natural resources between regions or over time are statistically significant.
A proposal for the assessment of soil security: Soil functions, soil services and threats to soil
Sandra J Evangelista, Damien J Field, Alex B McBratney, and
5 more authors
Human societies face six existential challenges to their sustainable development. These challenges have been previously addressed by a myriad of concepts such as soil conservation, soil quality, and soil health. Yet, of these, only soil security attempts to integrate the six existential challenges concurrently through the five biophysical and socio-economic dimensions of capacity, condition, capital, connectivity and codification. In this paper, we highlight past and existing concepts, and make a proposal for a provisional assessment of soil security. The proposal addresses three roles of soil: soil functions, soil services and threats to soil. For each identified role, we indicate a potential, but not exhaustive, list of indicators that characterise the five dimensions of soil security. We also raise issues of quantification and combination of indicators briefly. We found that capacity and condition are theoretically easier to measure and quantify than connectivity and codification. The dimension capital might be conveniently assessed using indicators that relate to the economic value of soils. The next step is to test this proposal for which we make recommendations on potential study cases and examples. We conclude that the five dimensions of soil security can potentially be assessed quantitatively and comprehensively using indicators that characterise each role, but also found that there is need for further work to devise an operational measurement methodology to estimate connectivity of people to soil.
Baseline high-resolution maps of organic carbon content in Australian soils
Alexandre M JC Wadoux, Mercedes Román Dobarco, Brendan Malone, and
3 more authors
We introduce a new dataset of high-resolution gridded total soil organic carbon content data produced at 30 m × 30 m and 90 m × 90 m resolutions across Australia. For each product resolution, the dataset consists of six maps of soil organic carbon content along with an estimate of the uncertainty represented by the 90% prediction interval. Soil organic carbon maps were produced up to a depth of 200 cm, for six intervals: 0–5 cm, 5–15 cm, 15–30 cm, 30–60 cm, 60–100 cm and 100–200 cm. The maps were obtained through interpolation of 90,025 depth-harmonized organic carbon measurements using quantile regression forest and a large set of environmental covariates. Validation with 10-fold cross-validation showed that all six maps had relatively small errors and that prediction uncertainty was adequately estimated. The soil carbon maps provide a new baseline from which change in future carbon stocks can be monitored and the influence of climate change, land management, and greenhouse gas offset can be assessed.
Estimating soil aggregate stability with infrared spectroscopy and pedotransfer functions
Thomas Chalaux Clergue, Nicolas PA Saby, Alexandre MJ-C Wadoux, and
2 more authors
Soil aggregate stability is an important indicator of soil condition and is directly related to soil degradation processes such as erosion and crusting. Aggregate stability is conventionally measured by testing the aggregate resistance to water disturbance mechanisms. Such measurements, however, are costly and time-consuming, which make them difficult to implement at a regional or country scale. In this study, we explore two different approaches to estimate soil aggregate stability by means of commonly-measured soil properties or mid-infrared spectroscopy measurements. The first approach relies on land use and soil properties. In the second approach aggregate stability is estimated by a model fitted with mid-infrared spectroscopic data. We tested the two approaches with a dataset composed of 202 soil samples from mainland France, in which aggregate stability was measured with a fast wetting test. We found that simple linear models based on common soil properties and models based on mid-infrared spectral data yielded similar results. Interpretation of the models revealed well-known relationships: land use had a major role in predicting aggregate stability, followed by organic carbon and clay content. Overall, we conclude that both approaches offer a reliable, cheap and time-efficient alternative to estimating soil aggregate stability. These approaches offer a tool to estimate aggregate stability over large geographical areas, which can support the development of erosive risk management plans and the implementation of adaptive management strategies to mitigate threats to soil and improve the overall soil condition.
Participatory approaches for soil research and management: A literature-based synthesis
Participatory approaches to data gathering and research which involve farmers, laypeople, amateur soil scientists, concerned community members or school students have attracted much attention recently, not only to enable scientific progress but also to achieve social and educational outcomes. Non-expert participation in soil research and management is diverse and applied variously, ranging from data collection to inform large-scale monitoring schemes in citizen science projects to projects in which the participants define the object of study and the questions to be answered. The growth of participatory projects to tackle complex environmental and soil-related issues has generated literature that describes both the way the projects are initiated, implemented and the outcomes they achieve. We review the existing literature on participatory soil research and management. Existing studies are classified into three categories based on the degree of participation in the different phases of research. The quality of participation is further evaluated systematically through the five elements that participatory projects usually include: inputs, activities, outputs, outcomes and impacts. We found that the majority of existing participatory projects were contributory in nature, where participants contribute to generating data. Co-created projects which involve a greater level of participation are less frequent. We also found large disparities in the context in which these types of participation occurred: contributory projects were mostly documented in more economically developed countries, whereas projects that suggest greater involvement of participants were mostly formulated in developing countries in relation to soil management and conservation issues. The long-term sustained outcomes of participatory projects on human well-being and socio-ecological systems are seldom reported. We conclude that participatory approaches are opportunities for education, communication and scientific progress and that participation is being facilitated by digital convergence. Participatory projects should, however, also be evaluated in terms of their long-term impact on the participants, to be sure that the expectations of the various parties align with the outcomes. All in all, such participation adds to the quantum of soil connectivity and in this sense makes the soil more secure globally.
Shapley values reveal the drivers of soil organic carbon stock prediction
Alexandre MJ-C Wadoux, Nicolas PA Saby, and Manuel P Martin
Insights into the controlling factors of soil organic carbon (SOC) stock variation are necessary both for our scientific understanding of the terrestrial carbon balance and to support policies that intend to promote carbon storage in soils to mitigate climate change. In recent years, complex statistical and algorithmic tools from the field of machine learning have become popular for modelling and mapping SOC stocks over large areas. In this paper, we report on the development of a statistical method for interpreting complex models, which we implemented for the study of SOC stock variation. We fitted a random forest machine learning model with 2206 measurements of SOC stocks for the 0–50 cm depth interval from mainland France and used a set of environmental covariates as explanatory variables. We introduce Shapley values, a method from coalitional game theory, and use them to understand how environmental factors influence SOC stock prediction: what is the functional form of the association in the model between SOC stocks and environmental covariates, and how does the covariate importance vary locally from one location to another and between carbon-landscape zones? Results were validated both in light of the existing and well-described soil processes mediating soil carbon storage and with regards to previous studies in the same area. We found that vegetation and topography were overall the most important drivers of SOC stock variation in mainland France but that the set of most important covariates varied greatly among locations and carbon-landscape zones. In two spatial locations with equivalent SOC stocks, there was nearly an opposite pattern in the individual covariate contribution that yielded the prediction – in one case climate variables contributed positively, whereas in the second case climate variables contributed negatively – and this effect was mitigated by land use. We demonstrate that Shapley values are a methodological development that yield useful insights into the importance of factors controlling SOC stock variation in space. This may provide valuable information to understand whether complex empirical models are predicting a property of interest for the right reasons and to formulate hypotheses on the mechanisms driving the carbon sequestration potential of a soil.
2022
Beyond prediction: methods for interpreting complex models of soil variation
Understanding the spatial variation of soil properties is central to many sub-disciplines of soil science. Commonly in soil mapping studies, a soil map is constructed through prediction by a statistical or non-statistical model calibrated with measured values of the soil property and environmental covariates of which maps are available. In recent years, the field has gradually shifted attention towards more complex statistical and algorithmic tools from the field of machine learning. These models are particularly useful for their predictive capabilities and are often more accurate than classical models, but they lack interpretability and their functioning cannot be readily visualized. There is a need to understand how these models can be used for purposes other than making accurate prediction and whether it is possible to extract information on the relationships among variables found by the models. In this paper we describe and evaluate a set of methods for the interpretation of complex models of soil variation. An overview is presented of how model-independent methods can serve the purpose of interpreting and visualizing different aspects of the model. We illustrate the methods with the interpretation of two mapping models in a case study mapping topsoil organic carbon in France. We reveal the importance of each driver of soil variation, their interaction, as well as the functional form of the association between environmental covariate and the soil property. Interpretation is also conducted locally for an area and two spatial locations with distinct land use and climate. We show that in all cases important insights can be obtained, both into the overall model functioning and into the decision made by the model for a prediction at a location. This underpins the importance of going beyond accurate prediction in soil mapping studies. Interpretation of mapping models reveal how the predictions are made and can help us formulating hypotheses on the underlying soil processes and mechanisms driving soil variation.
Dealing with clustered samples for assessing map accuracy by cross-validation
Sytze De Bruin, Dick J Brus, Gerard BM Heuvelink, and
2 more authors
Mapping of environmental variables often relies on map accuracy assessment through cross-validation with the data used for calibrating the underlying mapping model. When the data points are spatially clustered, conventional cross-validation leads to optimistically biased estimates of map accuracy. Several papers have promoted spatial cross-validation as a means to tackle this over-optimism. Many of these papers blame spatial autocorrelation as the cause of the bias and propagate the widespread misconception that spatial proximity of calibration points to validation points invalidates classical statistical validation of maps. We present and evaluate alternative cross-validation approaches for assessing map accuracy from clustered sample data. The first method uses inverse sampling-intensity weighting to correct for selection bias. Sampling-intensity is estimated by a two-dimensional kernel approach. The two other approaches are model-based methods rooted in geostatistics, where the first assumes homogeneity of residual variance over the study area whilst the second accounts for heteroscedasticity as a function of the sampling intensity. The methods were tested and compared against conventional k-fold cross-validation and blocked spatial cross-validation to estimate map accuracy metrics of above-ground biomass and soil organic carbon stock maps covering western Europe. Results acquired over 100 realizations of five sampling designs ranging from non-clustered to strongly clustered confirmed that inverse sampling-intensity weighting and the heteroscedastic model-based method had smaller bias than conventional and spatial cross-validation for all but the most strongly clustered design. For the strongly clustered design where large portions of the maps were predicted by extrapolation, blocked spatial cross-validation was closest to the reference map accuracy metrics, but still biased. For such cases, extrapolation is best avoided by additional sampling or limitation of the prediction area. Weighted cross-validation is recommended for moderately clustered samples, while conventional random cross-validation suits fairly regularly spread samples.
Using homosoils for quantitative extrapolation of soil mapping models
Andree M Nenkam, Alexandre MJ-C Wadoux, Budiman Minasny, and
4 more authors
Since the early 2000s, digital soil maps have been successfully used for various applications, including precision agriculture, environmental assessments and land use management. Globally, however, there are large disparities in the availability of soil data on which digital soil mapping (DSM) models can be fitted. Several studies attempted to transfer a DSM model fitted from an area with a well-developed soil database to map the soil in areas with low sampling density. This usually is a challenging task because two areas have hardly ever the same soil-forming factors in two different regions of the world. In this study, we aim to determine whether finding homosoils (i.e., locations sharing similar soil-forming factors) can help transferring soil information by means of a DSM model extrapolation. We hypothesize that within areas in the world considered as homosoils, one can leverage on areas with high sampling density and fit a DSM model, which can then be extrapolated geographically to an area with little or no data. We collected publicly available soil data for clay, silt, sand, organic carbon (OC), pH and total nitrogen (N) within our study area in Mali, West Africa and its homosoils. We fitted a regression tree model between the soil properties and environmental covariates of the homosoils, and applied this model to our study area in Mali. Several calibration and validation strategies were explored. We also compared our approach with existing maps made at a global and a continental scale. We concluded that geographic model extrapolation within homosoils was possible, but that model accuracy dramatically improved when local data were included in the calibration dataset. The maps produced from models fitted with data from homosoils were more accurate than existing products for this study area, for three (silt, sand, pH) out of six soil properties. This study would be relevant to areas with very little or no soil data to carry critical soils and environmental risk assessments at a regional level.
A primer on soil analysis using visible and near-infrared (vis-NIR) and mid-infrared (MIR) spectroscopy
A primer on soil analysis using visible and near-infrared (vis-NIR) and mid-infrared (MIR) spectroscopy” is the first training material on the topic of soil spectroscopy for beginner levels, by the Global Soil Laboratory Network Initiative on Soil Spectroscopy (GLOSOLAN-Spec) of the Global Soil Partnership, FAO. This document provides an introduction to the use of soil spectroscopy for soil analysis and covers the basic and fundamental procedures for using this technology for soil analysis. The series “Soil spectroscopy training material” is part of the Global Soil Laboratory Network (GLOSOLAN) to strengthen the capacity of laboratories in soil analysis. It provides a series of training materials covering wide range of topics in soil vis-NIR and MIR spectroscopy. The overall objective is to develop national and regional soil spectral libraries with an estimation service, and to provide advisory services on appropriate instrumentation
Overview of pedometrics
Alexandre MJ-C Wadoux, Inakwu OA Odeh, and Alex B McBratney
Pedometrics is concerned with the application of mathematical and statistical methods to the study of the distribution and genesis of soils. Here, we describe the main areas that pedometric research addresses: distribution of the soil pattern in character space, spatial and spatio-temporal soil variation, quantitative evaluation of the utility and quality of soil, and quantitative pedogenesis. To these main areas akin to the problems of pedology, pedometrics considers and represents uncertainty. Pedometric research is undeniably in an expansion phase and has now many areas of application at the interface with many questions relevant to the sustainable growth of our societies.
An integrated approach for the evaluation of quantitative soil maps through Taylor and solar diagrams
Alexandre MJ-C Wadoux, Dennis JJ Walvoort, and Dick J Brus
For many decades, soil scientists have produced spatial estimates of soil properties using statistical and non-statistical mapping models. Commonly in soil mapping studies the map quality is assessed through pairwise comparison of observed and predicted values of a soil property, from which statistical indices summarizing the quality of the entire map are computed. Often these indices are based on average error and correlation statistics. In this study, we recommend a more appropriate and effective method of map evaluation by means of Taylor and solar diagrams. Taylor and solar diagrams are summary diagrams exploiting the relationship between statistical indices to visualize differentiable aspects of map quality into a single plot. An important advantage over current map quality evaluation is that map quality can be assessed from the combined effect of a few statistical quantities, not just on the basis of a single index or list of indices. We illustrate the use of common statistical indices and their combination into summary diagrams with a simulation study and two applications on soil data. In the simulation study nine maps with known statistical properties are produced and evaluated with tables and summary diagrams. In the first case study with soil data, change in the quality of a large-scale topsoil organic carbon map is tracked for a number of permutations in the mapping model parameters, whereas in the second case study several maps of topsoil organic carbon content for the same area, made by various statistical and non-statistical models, are compared and evaluated. We consider that in all cases better insights in map quality are obtained with summary diagrams, instead of using a single index or an extensive list of indices. This underpins the importance of using integrated summary graphics to communicate on quantitative map quality so as to avoid excessive trust that a single map quality index may suggest.
2021
Moon, David. The American Steppes: The Unexpected Russian Roots of Great Plains Agriculture, 1870s-1930s. Cambridge University Press, UK, 2020. xxxiii+ 432 pp.\pounds 90, hardback. ISBN: 9781107103603.
Soil is a complex system in which biological, chemical and physical interactions take place. The behaviour of these interactions changes in spatial scale from the atomic to the global, and in time. To understand how this system works, soil scientists usually rely on incremental improvements in the knowledge by refinement of theories through hypothesis testing and development using carefully designed experiments. In the last two decades, the primacy of this knowledge construction process has been challenged by the development of large soil databases and algorithms such as machine learning. The data-driven research approach to soil science, the inference of soil knowledge directly from data by using computational tools and modelling techniques, is becoming more popular. Despite the wide adoption of a data-driven research approach to soil science, there has been little discussion on how a research driven by data instead of hypotheses affects scientific progress. In this paper, we provide an introductory perspective on data-driven soil research by discussing some of the issues and opportunities of knowledge discovery from soil data. We show that while data-driven soil research may seem revolutionary for some, soil science has a long history of exploratory efforts to generate knowledge from data. Empirical and factual soil classifications, for example, were data driven. We further discuss, with examples, (i) data, databases and the logic of data storage for data-driven soil research, (ii) the issues of extreme empiricist claims that arise corollary to the increase in the use of computational tools, and (iii) the challenge of formulating a scientific explanation based on patterns observed in the data and data analysis tools. By considering the epistemic challenges of the data-driven scientific research in the light of the historical literature, we found that there is a continuity of practices, some being certainly amplified by recent technological changes, but that the core methods of scientific enquiry from data remain essentially unchanged.
Spatial cross-validation is not the right way to evaluate map accuracy
Alexandre MJ-C Wadoux, Gerard BM Heuvelink, Sytze De Bruin, and
1 more author
For decades scientists have produced maps of biological, ecological and environmental variables. These studies commonly evaluate the map accuracy through cross-validation with the data used for calibrating the underlying mapping model. Recent studies, however, have argued that cross-validation statistics of most mapping studies are optimistically biased. They attribute these overoptimistic results to a supposed serious methodological flaw in standard cross-validation methods, namely that these methods ignore spatial autocorrelation in the data. They argue that spatial cross-validation should be used instead, and contend that standard cross-validation methods are inherently invalid in a geospatial context because of the autocorrelation present in most spatial data. Here we argue that these studies propagate a widespread misconception of statistical validation of maps. We explain that unbiased estimates of map accuracy indices can be obtained by probability sampling and design-based inference and illustrate this with a numerical experiment on large-scale above-ground biomass mapping. In our experiment, standard cross-validation (i.e., ignoring autocorrelation) led to smaller bias than spatial cross-validation. Standard cross-validation was deficient in case of a strongly clustered dataset that had large differences in sampling density, but less so than spatial cross-validation. We conclude that spatial cross-validation methods have no theoretical underpinning and should not be used for assessing map accuracy, while standard cross-validation is deficient in case of clustered data. Model-free, design-unbiased and valid accuracy assessment is achieved with probability sampling and design-based inference. It is valid without the need to explicitly incorporate or adjust for spatial autocorrelation and perfectly suited for the validation of large scale biological, ecological and environmental maps.
Ten challenges for the future of pedometrics
Alexandre MJ-C Wadoux, Gerard BM Heuvelink, R Murray Lark, and
6 more authors
Pedometrics, the application of mathematical and statistical methods to the study of the distribution and genesis of soils, has broadened its scope over the past two decades. The primary focus of pedometricians has traditionally been on spatial and spatio-temporal soil inventories with numerical soil classification, geostatistical modelling of spatial variation and mapping. The rapid development of remote and proximal soil sensing as well as data-driven statistical modelling techniques have had a major impact on pedometrics over the past decades. During this time, a general demand for quantitative digital soil information for environmental modelling and management has compelled pedometricians to address other soil-related questions from a quantitative point of view: soil genesis and utility and quality of soil. While scientific progress is largely an autonomous process that is difficult to steer, research efforts could benefit from an agenda with pressing pedometric research topics. This paper defines and discusses ten recent or longstanding pedometrics challenges, with the attempt to identify knowledge gaps and suggest new concepts and methods to overcome them. The ten challenges were selected through a collaborative effort and may serve as a guidance for future pedometrics research and to foster collaboration among soil scientists. The challenges discussed in this paper are also indicators of the current understanding and state of knowledge from which progress can be measured in the future.
Soil Spectral Inference with R: Analysing Digital Soil Spectra Using the R Programming Environment
Alexandre MJ-C Wadoux, Brendan Malone, Budiman Minasny, and
2 more authors
This book provides a didactic overview of techniques for inferring information from soil spectroscopic data, and the codes in the R programming language for performing such analyses. It is intended for students, researchers and practitioners looking to infer soil information from spectroscopic data, focusing mainly on, but not restricted to, the infrared range of the electromagnetic spectrum. Little prior knowledge of the R programming language or digital soil spectra is required. We work through the steps to process spectroscopic data systematically.
Beyond prediction: methods for interpreting complex models of soil variation
Understanding the spatial variation of soil properties is central to many sub-disciplines of soil science. Commonly in soil mapping studies, a soil map is constructed through prediction by a statistical or non-statistical model calibrated with measured values of the soil property and environmental covariates of which maps are available. In recent years, the field has gradually shifted attention towards more complex statistical and algorithmic tools from the field of machine learning. These models are particularly useful for their predictive capabilities and are often more accurate than classical models, but they lack interpretability and their functioning cannot be readily visualized. There is a need to understand how these these models can be used for purposes other than making accurate prediction and whether it is possible to extract information on the relationships among variables found by the models. In this paper we describe and evaluate a set of methods for the interpretation of complex models of soil variation. An overview is presented of how model-independent methods can serve the purpose of interpreting and visualizing different aspects of the model. We illustrate the methods with the interpretation of two mapping models in a case study mapping topsoil organic carbon in France. We reveal the importance of each driver of soil variation, their interaction, as well as the functional form of the association between environmental covariate and the soil property. Interpretation is also conducted locally for an area and two spatial locations with distinct land use and climate. We show that in all cases important insights can be obtained, both into the overall model functioning and into the decision made by the model for a prediction at a location. This underpins the importance of going beyond accurate prediction in soil mapping studies. Interpretation of mapping models reveal how the predictions are made and can help us formulating hypotheses on the underlying soil processes and mechanisms driving soil variation.
Digital convergence is helping us to better understand and study the soil. Fixed and mobile sensors, and wireless communication systems aided by the internet produce cheap and abundant streams of digital soil data that can readily be used for modeling and information generation. Here, we explore the ways in which digital science and technology have affected soil science. We can call this digital soil science and define it as the study of the soil aided by the tools of the digital convergence. To some degree, all of our research and teaching had been enabled, enhanced, and expanded by the digital convergence. We outline how soil science has changed using illustrations of intellectual and technical developments enabled digitally. Digital soil sensors have been widely implemented, and new tools such as cell phones and applications, or metagenomics techniques are becoming available. There are also areas in soil science for which no major obstacles in the digital technologies exist, but which have not been thoroughly investigated—for example, to devise a truly digital soil field description or for building a formal digital quantitative system of soil classification. The soil science community will need to be alert to some of the dangers brought by digital convergence such as the lack of new theory and proprietary (black-box) soil prediction. Finally, we discuss a whole set of digital tools that will, or might, gain the stage in the immediate future and take a stab in the dark on what may lie over the horizon of digital soil science.
Hypotheses are of major importance in scientific research. In current applications of machine learning algorithms for soil mapping the hypotheses being tested or developed are often ambiguous or undefined. Mapping soil properties or classes, however, does not tell much about the dynamics and processes that underly soil genesis and evolution. When the interest in the soil map is for applications in a context different than soil science, such as for policy making or baseline production of quantitative soil information, the interpretation should be made in light of this application. If otherwise, we recommend soil scientists to provide hypotheses to accompany their research. The hypothesis is formulated at the beginning of the research and, in some cases, motivates data collection. Here we argue that when applying data-driven techniques such as machine learning, developing hypotheses can be a useful end point of the research. The spatial pattern predicted by the machine learning model and the correlation found among the covariates are an opportunity to develop hypotheses which are likely to require additional analyses and datasets to be tested. Systematically providing scientific hypotheses in digital soil mapping studies will enable the soil science community to build on previous work, and to increase the credibility of data-driven algorithms as a means to accelerate discovery on soil processes.
2020
Machine learning for digital soil mapping: Applications, challenges and suggested solutions
Alexandre MJ-C Wadoux, Budiman Minasny, and Alex B McBratney
The uptake of machine learning (ML) algorithms in digital soil mapping (DSM) is transforming the way soil scientists produce their maps. Within the past two decades, soil scientists have applied ML to a wide range of scenarios, by mapping soil properties or classes with various ML algorithms, on spatial scale from the local to the global, and with depth. The wide adoption of ML for soil mapping was made possible by the increase in data availability, the ease of accessing environmental spatial data, and the development of software solutions aided by computational tools to analyse them. In this article, we review the current use of ML in DSM, identify the key challenges and suggest solutions from the existing literature. There is a growing interest in the use of ML in DSM. Most studies emphasize prediction and accuracy of the predicted maps for applications, such as baseline production of quantitative soil information. Few studies account for existing soil knowledge in the modelling process or quantify the uncertainty of the predicted maps. Further, we discuss the challenges related to the application of ML for soil mapping and suggest solutions from existing studies in the natural sciences. The challenges are: sampling, resampling, accounting for the spatial information, multivariate mapping, uncertainty analysis, validation, integration of pedological knowledge and interpretation of the models. Overall, the current literature shows few attempts in understanding the underlying soil structure or process using the predicted maps and the ML model, for example by generating hypotheses on mechanistic relationships among variables. In this regard, several additional challenging aspects need to be considered, such as the inclusion of pedological knowledge in the ML algorithm or the interpretability of the calibrated ML model. Tackling these challenges is critical for ML to gain credibility and scientific consistency in soil science. We conclude that for future developments, ML could incorporate three core elements: plausibility, interpretability, and explainability, which will trigger soil scientists to couple model prediction with pedological explanation and understanding of the underlying soil processes.
A note on knowledge discovery and machine learning in digital soil mapping
Alexandre MJ-C Wadoux, Alessandro Samuel-Rosa, Laura Poggio, and
1 more author
In digital soil mapping, machine learning (ML) techniques are being used to infer a relationship between a soil property and the covariates. The information derived from this process is often translated to pedological knowledge. This mechanism is referred to as knowledge discovery. This study shows that knowledge discovery based on ML must be treated with caution. We show how pseudo-covariates can be used to accurately predict soil organic carbon in a hypothetical case study. We demonstrate that ML methods can find relevant patterns even when the covariates are meaningless and not related to soil forming factors and processes. We argue that pattern recognition for prediction should not be equated with knowledge discovery. Knowledge discovery requires more than the recognition of patterns and successful prediction. It requires the pre-selection and preprocessing of pedologically relevant environmental covariates and the posterior interpretation and evaluation of the recognized patterns. We argue that important ML covariates could serve the purpose of providing elements to postulate hypothesis about soil processes that, once validated through experiments, could result in new pedological knowledge.
Optimization of rain gauge sampling density for river discharge prediction using Bayesian calibration
Alexandre MJ-C Wadoux, Gerard BM Heuvelink, Remko Uijlenhoet, and
1 more author
River discharges are often predicted based on a calibrated rainfall-runoff model. The major sources of uncertainty, namely input, parameter and model structural uncertainty must all be taken into account to obtain realistic estimates of the accuracy of discharge predictions. Over the past years, Bayesian calibration has emerged as a suitable method for quantifying uncertainty in model parameters and model structure, where the latter is usually modelled by an additive or multiplicative stochastic term. Recently, much work has also been done to include input uncertainty in the Bayesian framework. However, the use of geostatistical methods for characterizing the prior distribution of the catchment rainfall is underexplored, particularly in combination with assessments of the influence of increasing or decreasing rain gauge network density on discharge prediction accuracy. In this article we integrate geostatistics and Bayesian calibration to analyze the effect of rain gauge density on river discharge prediction accuracy. We calibrated the HBV hydrological model while accounting for input, initial state, model parameter and model structural uncertainty, and also taking uncertainties in the discharge measurements into account. Results for the Thur basin in Switzerland showed that model parameter uncertainty was the main contributor to the joint posterior uncertainty. We also showed that a low rain gauge density is enough for the Bayesian calibration, and that increasing the number of rain gauges improved model prediction until reaching a density of one gauge per 340 km2. While the optimal rain gauge density is case-study specific, we make recommendations on how to handle input uncertainty in Bayesian calibration for river discharge prediction and present the methodology that may be used to carry out such experiments.
History and interpretation of early soil and organic matter investigations in Deli, Sumatra, Indonesia
Budiman Minasny, Erwin Nyak Akoeb, Tengku Sabrina, and
2 more authors
This paper provides a history of the investigation of the soils and organic matter of Deli in Sumatra, Indonesia, for growing tobacco in the early 20th century and an interpretation based on current data, knowledge and understanding. We first review some early chemists and agrogeologists’ investigations on the soils of Deli to increase tobacco production. Van Bemmelen studied the humus of the soil of Deli in 1890 and formalised an 8-year fallow plantation scheme for growing tobacco. While maintaining organic matter had been established, the complexity of soil distribution in the area was more important in determining the quality of tobacco. It took another 40 years for the soil in Deli area to be properly mapped. Jan Henri Druif in the 1930s mapped and classified the soils of Deli based on their parent material and mineralogical composition. We then describe the rise and demise of the tobacco industry from 1930s-current. We examine the implication of the fallow system and soil distribution with the current understanding of soil carbon processes and recent data. The results are interpreted and discussed considering i) the myth of ”poor” tropical soils, ii) nutrient availability after slash and burn, iii) soil organic matter decline after forest conversion and recovery after fallow, and iv) soil mapping and provenance. Based on published studies and observed data coupled with modelling, we attempt to explain early researchers’ observations and deductions. We summarise soil organic carbon dynamic conditions in the tropics after 50 years of forest clearance: under fallow rotation, it is possible to maintain, on average, a constant value of 20% organic carbon (OC) decrease from the original level, while continuous cropping can decrease OC levels up to 30–40%. An extreme condition with continuous cultivation and little organic matter input can result in an OC decline of up to 80%. The historical studies enable to appreciate aspects of soil mapping and organic matter that are repeatedly overlooked in present-day research.
If a map is constructed through prediction with a statistical or non-statistical model, the sampling design used for selecting the sample on which the model is fitted plays a key role in the final map accuracy. Several sampling designs are available for selecting these calibration samples. Commonly, sampling designs for mapping are compared in real-world case studies by selecting just one sample for each of the sampling designs under study. In this study, we show that sampling designs for mapping are better compared on the basis of the distribution of the map quality indices over repeated selection of the calibration sample. In practice this is only feasible by subsampling a large dataset representing the population of interest, or by selecting calibration samples from a map depicting the study variable. This is illustrated with two real-world case studies. In the first case study a quantitative variable, soil organic carbon, is mapped by kriging with an external drift in France, whereas in the second case a categorical variable, land cover, is mapped by random forest in a region in France. The performance of two sampling designs for mapping are compared: simple random sampling and conditioned Latin hypercube sampling, at various sample sizes. We show that in both case studies the sampling distributions of map quality indices obtained with the two sampling design types, for a given sample size, show large variation and largely overlap. This shows that when comparing sampling designs for mapping on the basis of a single sample selected per design, there is a serious risk of an incidental result.
Precocious 19th century soil carbon science
Budiman Minasny, Alex B McBratney, Alexandre MJ-C Wadoux, and
2 more authors
Soil organic matter is important for nutrient exchange in the soil environment, carbon sink, and soil fertility. Soil scientists usually estimate the amount of organic matter in a soil from its carbon content using the 1.724 conversion factor. The origin of this conversion factor is conventionally attributed to Jacob Maarten Van Bemmelen, a Dutch chemist. In the early nineteenth century, science academies devoted considerable attention to understanding soil humus to increase agricultural productivity. Van Bemmelen investigated the fertility of soils for growing tobacco in Indonesia. Van Bemmelen’s 1890 publication used the 1.724 factor for estimating humus content from elemental analysis of C concentration. A survey of the scientific literature from the same period indicated that Emil Wolff was the first to suggest the factor. This paper draws a brief historical summary of van Bemmelen’s research on soil organic matter, and discusses the origin and use of the 1.724 factor using the scientific literature from 1900s to 1930s. The origin of the factor is contextualized with the emerging humus theory of the 19th century. Our study suggests that the factor has been erroneously attributed to van Bemmelen and widely used in English, French, Dutch, and German literature. The 1.724 factor was originally developed for the conversion of carbon to humic substances, which themselves do not have a clear definition. Many regional studies have indicated the inadequacy of the factor.
2019
Sampling design optimization for soil mapping with random forest
Alexandre MJ-C Wadoux, Dick J Brus, and Gerard BM Heuvelink
Machine learning techniques are widely employed to generate digital soil maps. The map accuracy is partly determined by the number and spatial locations of the measurements used to calibrate the machine learning model. However, determining the optimal sampling design for mapping with machine learning techniques has not yet been considered in detail in digital soil mapping studies. In this paper, we investigate sampling design optimization for soil mapping with random forest. A design is optimized using spatial simulated annealing by minimizing the mean squared prediction error (MSE). We applied this approach to mapping soil organic carbon for a part of Europe using subsamples of the LUCAS dataset. The optimized subsamples are used as input for the random forest machine learning model, using a large set of readily available environmental data as covariates. We also predicted the same soil property using subsamples selected by simple random sampling, conditioned Latin Hypercube sampling (cLHS), spatial coverage sampling and feature space coverage sampling. Distributions of the estimated population MSEs are obtained through repeated random splitting of the LUCAS dataset, serving as the population of interest, into subsets used for validation, testing and selection of calibration samples, and repeated selection of calibration samples with the various sampling designs. The differences between the medians of the MSE distributions were tested for significance using the non-parametric Mann-Whitney test. The process was repeated for different sample sizes. We also analyzed the spread of the optimized designs in both geographic and feature space to reveal their characteristics. Results show that optimization of the sampling design by minimizing the MSE is worthwhile for small sample sizes. However, an important disadvantage of sampling design optimization using MSE is that it requires known values of the soil property at all locations and as a consequence is only feasible for subsampling an existing dataset. For larger sample sizes, the effect of using an MSE optimized design diminishes. In this case, we recommend to use a sample spread uniformly in the feature (i.e. covariate) space of the most important random forest covariates. The results also show that for our case study, cLHS sampling performs worse than the other sampling designs for mapping with random forest. We stress that comparison of sampling designs for calibration by splitting the data just once is very sensitive to the data split that one happens to use if the validation set is small.
Using deep learning for multivariate mapping of soil with quantified uncertainty
Digital soil mapping (DSM) techniques are widely employed to generate soil maps. Soil properties are typically predicted individually, while ignoring the interrelation between them. Models for predicting multiple properties exist, but they are computationally demanding and often fail to provide accurate description of the associated uncertainty. In this paper a convolutional neural network (CNN) model is described to predict several soil properties with quantified uncertainty. CNN has the advantage that it incorporates spatial contextual information of environmental covariates surrounding an observation. A single CNN model can be trained to predict multiple soil properties simultaneously. I further propose a two-step approach to estimate the uncertainty of the prediction for mapping using a neural network model. The methodology is tested mapping six soil properties on the French metropolitan territory using measurements from the LUCAS dataset and a large set of environmental covariates portraying the factors of soil formation. Results indicate that the multivariate CNN model produces accurate maps as shown by the coefficient of determination and concordance correlation coefficient, compared to a conventional machine learning technique. For this country extent mapping, the maps predicted by CNN have a detailed pattern with significant spatial variation. Evaluation of the uncertainty maps using the median of the standardized squared prediction error and accuracy plots suggests that the uncertainty was accurately quantified, albeit slightly underestimated. The tests conducted using different window size of input covariates to predict the soil properties indicate that CNN benefits from using local contextual information in a radius of 4.5 km. I conclude that CNN is an effective model to predict several soil properties and that the associated uncertainty can be accurately quantified with the proposed approach.
Efficient sampling for geostatistical surveys
Alexandre MJ-C Wadoux, Benjamin P Marchant, and Richard M Lark
A geostatistical survey for soil requires rational choices regarding the sampling strategy. If the variogram of the property of interest is known then it is possible to optimize the sampling scheme such that an objective function related to the survey error is minimized. However, the variogram is rarely known prior to sampling. Instead it must be approximated by using either a variogram estimated from a reconnaissance survey or a variogram estimated for the same soil property in similar conditions. For this reason, spatial coverage schemes are often preferred, because they rely on the simple dispersion of sampling units as uniformly as possible, and are similar to those produced by minimizing the kriging variance. If extra sampling locations are added close to those in a spatial coverage scheme then the scheme might be broadly similar to one produced by minimizing the total error (i.e. kriging variance plus the prediction error due to uncertainty in the covariance parameters). We consider the relative merits of these different sampling approaches by comparing their mean total error for different specified random functions. Our results showed the considerable benefit of adding close-pairs to a spatial coverage scheme, and that optimizing with respect to the total error generally gave a small further advantage. When we consider the example of sampling for geostatistical survey of clay content of the soil, an optimized scheme based on the average of previously reported clay variograms was fairly robust compared to the spatial coverage plus close-pairs scheme. We conclude that the direct optimization of spatial surveys was only rarely worthwhile. For most cases, it is best to apply a spatial coverage scheme with a proportion of additional sampling locations to provide some closely spaced pairs. Furthermore, our results indicated that the number of observations required for an effective geostatistical survey depend on the variogram parameters.
Robust soil mapping at the farm scale with vis–NIR spectroscopy
Leonardo Ramirez-Lopez, Alexandre MJ-C Wadoux, Marston HD Franceschini, and
4 more authors
Sustainable agriculture practices are often hampered by the prohibitive costs associated with the generation of fine-resolution soil maps. Recently, several papers have been published highlighting how visible and near infrared (vis–NIR) reflectance spectroscopy may offer an alternative to address this problem by increasing the density of soil sampling and by reducing the number of conventional laboratory analyses needed. However, for farm-scale soil mapping, previous studies rarely focused on sample optimization for the calibration of vis–NIR models or on robust modelling of the spatial variation of soil properties predicted by vis–NIR spectroscopy. In the present study, we used soil vis–NIR spectroscopy models optimized in terms of both number of calibration samples and accuracy for high-resolution robust farm-scale soil mapping and addressed some of the most common pitfalls identified in previous research. We collected 910 samples from 458 locations at two depths (A, 0–0.20 m; B, 0.80–1.0 m) in the state of São Paulo, Brazil. All soil samples were analysed by conventional methods and scanned in the vis–NIR spectral range. With the vis–NIR spectra only, we inferred statistically the optimal set size and the best samples with which to calibrate vis–NIR models. The calibrated vis–NIR models were validated and used to predict soil properties for the rest of the samples. The prediction error of the spectroscopic model was propagated through the spatial analysis, in which robust block kriging was used to predict particle-size fractions and exchangeable calcium content for each depth. The results indicated that statistical selection of the calibration samples based on vis–NIR spectra considerably decreased the need for conventional chemical analysis for a given level of mapping accuracy. The methods tested in this research were developed and implemented using open-source software. All codes and data are provided for reproducible research purposes.
Sampling design optimization for geostatistical modelling and prediction
Space-time monitoring and prediction of environmental variables requires measurements of the environment. But environmental variables cannot be measured everywhere and all the time. Scientists can only collect a fragment, a sample of the property of interest in space and time, with the objective of using this sample to infer the property at unvisited locations and times. Sampling might be a costly and time consuming affair. Consequently, we need efficient strategies to select an optimal design for mapping. Most studies on sampling design optimization consider the case of predictive mapping using geostatistics. In recent years geostatistical models and associated mapping techniques have advanced, which calls for adaptation of associated sampling designs. The main objective of this thesis is to address the optimal design of some recent advances in mapping.
Multi-source data integration for soil mapping using deep learning
Alexandre MJ-C Wadoux, José Padarian, and Budiman Minasny
With the advances of new proximal soil sensing technologies, soil properties can be inferred by a variety of sensors, each having its distinct level of accuracy. This measurement error affects subsequent modelling and therefore must be integrated when calibrating a spatial prediction model. This paper introduces a deep learning model for contextual digital soil mapping (DSM) using uncertain measurements of the soil property. The deep learning model, called the convolutional neural network (CNN), has the advantage that it uses as input a local representation of environmental covariates to leverage the spatial information contained in the vicinity of a location. Spatial non-linear relationships between measured soil properties and neighbouring covariate pixel values are found by optimizing an objective function, which can be weighted with respect to a measurement error of soil observations. In addition, a single model can be trained to predict a soil property at different soil depths. This method is tested in mapping top- and subsoil organic carbon using laboratory-analysed and spectroscopically inferred measurements. Results show that the CNN significantly increased prediction accuracy as indicated by the coefficient of determination and concordance correlation coefficient, when compared to a conventional DSM technique. Deeper soil layer prediction error decreased, while preserving the interrelation between soil property and depths. The tests conducted suggest that the CNN benefits from using local contextual information up to 260 to 360 m. We conclude that the CNN is a flexible, effective and promising model to predict soil properties at multiple depths while accounting for contextual covariate information and measurement error.
2018
Epistemological aspects of soil science in the late nineteenth century
Alexandre MJ-C Wadoux
Master’s thesis. University of Nantes, France, 2018
The turn of the nineteenth century witnessed the emergence of a new geographical and global vision on soils. This is particularly due to the work of a Russian school of thought, from which the notion of soil was differentiated as an independent and varying natural body. This notion is confronted with the vision of the agronomists, chemists or geologists before to get established thanks to a very effective diffusion of the Russian writings and the parallel emergence of similar theories in Germany and in the United States. In this thesis, I propose a historical and epistemological study of the context, actors and drivers of this change in soil science for the period 1883-1910. In this respect, I identify the context in which soil science is situated at the beginning of our study period. I define the previous works, and the contradictions in the existing theories as well as the conditions of the emergence of Russian school ideas. In a second part, I draw up an inventory of the methods used in soil science. What are the daily tools of soil scientists ? I continue with a focus on the process of reasoning employed and the role of the hypothesis. Finally, I analyse and conclude on the lexicon of two texts, chosen to represent two schools of divergent thoughts. I show that soil science initiated its transformation in the Russian school at first, but the rapid evolution of the theory originated from the joint contribution of methods and reasoning inherited from agricultural chemistry and geology. These two disciplines, after a short period of opposition, have also helped in developing soil science from a conceptual and methodological point of view.
Accounting for non-stationary variance in geostatistical mapping of soil properties
Alexandre MJ-C Wadoux, Dick J Brus, and Gerard BM Heuvelink
Simple and ordinary kriging assume a constant mean and variance of the soil variable of interest. This assumption is often implausible because the mean and/or variance are linked to terrain attributes, parent material or other soil forming factors. In kriging with external drift (KED) non-stationarity in the mean is accounted for by modelling it as a linear combination of covariates. In this study, we applied an extension of KED that also accounts for non-stationary variance. Similar to the mean, the variance is modelled as a linear combination of covariates. The set of covariates for the mean may differ from the set for the variance. The best combinations of covariates for the mean and variance are selected using Akaike’s information criterion. Model parameters of the selected model are then estimated by differential evolution using the Restricted Maximum Likelihood (REML) in the objective function. The methodology was tested in a small area of the Hunter Valley, NSW Australia, where samples from a fine grid with gamma K measurements were treated as measurements of the variable of interest. Terrain attributes were used as covariates. Both a non-stationary variance and a stationary variance model were calibrated. The mean squared prediction errors of the two models were somewhat comparable. However, the uncertainty about the predictions was much better quantified by the non-stationary variance model, as indicated by the mean and median of the standardized squared prediction error and by accuracy plots. We conclude that the non-stationary variance model is more flexible and better suited for uncertainty quantification of a mapped soil property. However, parameter estimation of the non-stationary variance model requires more attention due to possible singularity of the covariance matrix.
2017
Sampling design optimisation for rainfall prediction using a non-stationary geostatistical model
Alexandre MJ-C Wadoux, Dick J Brus, Miguel A Rico-Ramirez, and
1 more author
The accuracy of spatial predictions of rainfall by merging rain-gauge and radar data is partly determined by the sampling design of the rain-gauge network. Optimising the locations of the rain-gauges may increase the accuracy of the predictions. Existing spatial sampling design optimisation methods are based on minimisation of the spatially averaged prediction error variance under the assumption of intrinsic stationarity. Over the past years, substantial progress has been made to deal with non-stationary spatial processes in kriging. Various well-documented geostatistical models relax the assumption of stationarity in the mean, while recent studies show the importance of considering non-stationarity in the variance for environmental processes occurring in complex landscapes. We optimised the sampling locations of rain-gauges using an extension of the Kriging with External Drift (KED) model for prediction of rainfall fields. The model incorporates both non-stationarity in the mean and in the variance, which are modelled as functions of external covariates such as radar imagery, distance to radar station and radar beam blockage. Spatial predictions are made repeatedly over time, each time recalibrating the model. The space-time averaged KED variance was minimised by Spatial Simulated Annealing (SSA). The methodology was tested using a case study predicting daily rainfall in the north of England for a one-year period. Results show that (i) the proposed non-stationary variance model outperforms the stationary variance model, and (ii) a small but significant decrease of the rainfall prediction error variance is obtained with the optimised rain-gauge network. In particular, it pays off to place rain-gauges at locations where the radar imagery is inaccurate, while keeping the distribution over the study area sufficiently uniform.
Sediment reallocations due to erosive rainfall events in the Three Gorges Reservoir Area, Central China
Felix Stumpf, Philipp Goebes, Karsten Schmidt, and
5 more authors
Digital soil mapping (DSM) products represent estimates of spatially distributed soil properties. These estimations comprise an element of uncertainty that is not evenly distributed over the area covered by DSM. If we quantify the uncertainty spatially explicit, this information can be used to improve the quality of DSM by optimizing the sampling design. This study follows a DSM approach using a Random Forest regression model, legacy soil samples, and terrain covariates to estimate topsoil silt and clay contents in a small catchment of 4.2 km2 in the Three Gorges Reservoir Area, Central China. We aim (i) to introduce a method to derive spatial uncertainty, and (ii) to improve the initial DSM approaches by additional sampling that is guided by the spatial uncertainty. The proposed uncertainty measure is based on multiple realizations of individual and randomized decision tree models. We used the spatial uncertainty of the initial DSM approaches to stratify the study area and thereby to identify potential sampling areas of high uncertainties. Further, we tested how precisely available legacy samples cover the variability of the covariates within each potential sampling area to define the final sampling area and to apply a purposive sampling design. For the final Random Forest model calibration, we combined the legacy sample set with the additional samples. This uncertainty-driven DSM refinement was evaluated by comparing it to a second approach. In this second approach, the additional samples were replaced by a random sample set of the same size, obtained from the entire study area. For the comparative analysis, external, bootstrap-, and cross-validation was applied. The DSM approach using the uncertainty-driven refinement performed best. The averaged spatial uncertainty was reduced by 31% for silt and by 27% for clay compared to the initial DSM approach. Using external validation, the accuracy increased by the same proportions, while showing an overall accuracy of R2 = 0.59 for silt and R2 = 0.56 for clay.
2016
Incorporating limited field operability and legacy soil samples in a hypercube sampling design for digital soil mapping
Felix Stumpf, Karsten Schmidt, Thorsten Behrens, and
6 more authors
Most calibration sampling designs for Digital Soil Mapping (DSM) demarcate spatially distinct sample sites. In practical applications major challenges are often limited field accessibility and the question on how to integrate legacy soil samples to cope with usually scarce resources for field sampling and laboratory analysis. The study focuses on the development and application of an efficiency improved DSM sampling design that (1) applies an optimized sample set size, (2) compensates for limited field accessibility, and (3) enables the integration of legacy soil samples. The proposed sampling design represents a modification of conditioned Latin Hypercube Sampling (cLHS), which originally returns distinct sample sites to optimally cover a soil related covariate space and to preserve the correlation of the covariates in the sample set. The sample set size was determined by comparing multiple sample set sizes of original cLHS sets according to their representation of the covariate space. Limited field accessibility and the integration of legacy samples were incorporated by providing alternative sample sites to replace the original cLHS sites. We applied the modified cLHS design (cLHSadapt) in a small catchment (4.2 km2) in Central China to model topsoil sand fractions using Random Forest regression (RF). For evaluating the proposed approach, we compared cLHSadapt with the original cLHS design (cLHSorig). With an optimized sample set size n = 30, the results show a similar representation of the cLHS covariate space between cLHSadapt and cLHSorig, while the correlation between the covariates is preserved (r = 0.40 vs. r = 0.39). Furthermore, we doubled the sample set size of cLHSadapt by adding available legacy samples (cLHSadapt+) and compared the prediction accuracies. Based on an external validation set cLHSval (n = 20), the coefficient of determination (R2) of the cLHSadapt predictions range between 0.59 and 0.71 for topsoil sand fractions. The R2-values of the RF predictions based on cLHSadapt+, using additional legacy samples, are marginally increased on average by 5%.
2015
Mid-Infrared spectroscopy for soil and terrain analysis
Diffuse Reflectance Spectroscopy is a fast, cost-efficient and non-destructive method well suited to derive a large number of soil properties from a sin-gle scan. The ability of infrared spectroscopy for predicting soil components has already been widely described in numerous studies. Especially for the Mid-Infrared (MIR) regions (400-4000 cm−1), common calibration methods allow the prediction of various soil properties with high accuracy. In this respect, the bending or stretching vibration at a precise wavelength allows qualitative diagnostic on the soil components without any coupled chemical analysis. Recent studies show the importance of the soil information summarized into a few wavelengths of the spectrum. However, the use for soil monitoring remains unexplored. In this thesis, we propose a method to identify quickly which soil attributes are influenced by various and easily obtainable environmental secondary information. We demonstrate first that a few bands in our spectrum can represent most of the variability of a target soil property. In consequences, through the study of spectra-terrain relationships, we highlight the link existing between terrain derivative and theinformation content of the spectra. We implemented three calibration methods: Partial Least Square Regression (PLSR), Cubist and Support Vector Machine (SVM) and then used a robust linear model to define the precision and significance of the modelled terrain attributes (as independent variable) to the bands of the spectra. The 140 samples were collected from a heterogeneous 4,2 km2 catchment area in Hubei province in central China, andscanned in the mid-infrared range using an Alpha FT.IR Spectrometer. In this work, the spectra is first linked to laboratory measured soil properties to calibrate our models and then linked to 34 terrain attributes derived from adigital elevation model with a resolution of 25 m. The multivariate relationship is qualitatively interpreted based on terrain spectrograms derived from the fitted models. The results show that (i) the three calibration methods tested are efficient for predicting soil texture and organic matter; therefore our spectral library contains information about soil properties (ii) soil mineralogy and particularly clay minerals are strongly linked to the bedrock properties as well as to elevation. In contrast, soil organic matter is difficult to interpret, showing reasonable correlation to vegetation coverage and slope only for aromatic and alkyl groups. The method appears to be suitable to investigate soil-landscape relationships through Mid-Infrared spectroscopy andwithout any prior laboratory analysis.