Astraea.earth

By visualizing our model predictions, we are able to identify areas at high risk of crop and water accessibility.

Climate change is expected to increase the prevalence of malnutrition in the 21st century through higher temperatures, increasing rainfall volatility, and weather extremes that could have devastating effects on crops and freshwater availability. At Astraea, we specialize in applying machine learning to satellite imagery and have developed a geospatial machine learning platform, called EarthAI, to handle the volume and complexity of remote sensing imagery. In our crop classification project, we trained a model to classify crops and water using remote sensing imagery and used these predictions to study changes in food and freshwater accessibility in Zambia, which ranked as 5th worst in the 2018 Global Hunger Index, placing it in the “alarming” category.⁵ 60% of Zambia’s population lives in poverty and 40% of the children have stunted growth due to malnutrition.⁴ Through this study, we were able to detect changes in natural resource accessibility as well as track crop availability and rotation over time thereby getting a sense of food insecurity in the region.

“As a benefit corporation, it’s very important for us to connect individuals and organizations to the information they need to improve the world around them. There are almost an unlimited number of ways to do this but it’s really rewarding to work on projects where access to information or technology is simply not a given. Focusing EarthAI on the food shortages in Zambia was an inspiring challenge for us a company.” ~ Brendan Richardson, CEO

Data

Because of its global coverage and long archive of free and publicly available data, we decided to use Landsat 8 imagery in this project. This satellite revisits the same spot on Earth approximately every 16 days and produces imagery at a spatial resolution of 30 meters in the visible, near-infrared, and short-wave infrared spectrums.²

A ground truth geo-referenced crop dataset is hard to find for most of the world, but luckily the United States Department of Agriculture (USDA) has produced one annually for the continental U.S. since 2008.³ This dataset, called CropScape, labels 92 different crops and 25 land cover types at 30-meter resolution.³ We used the CropScape dataset to train our model in the United States and then extrapolated this model to Zambia.

Figure 1. Sources: Landsat 8 imagery and CropScape Cropland Data Layer.

Area of Interest

Zambia has enjoyed a long period of peace and stability but has been severely impacted by climate change.⁴ We intentionally selected a politically stable country to control for confounding variables, such as war or economic sanctions, so that the food insecurity issue is a result of weather extremes caused by climate change that affects crops slowly over time, rather than a result of a volatile political environment or another intangible variable beyond the analysis capabilities of remote sensing data.

In Zambia, agriculture supports the livelihoods of 85% of the population,⁶ but it is a landlocked country where most farmers rely on rainfall to irrigate their crops making it especially vulnerable to weather fluctuations caused by climate change.⁴ Because of these factors and because the crops grown in Zambia are well matched to the crops in the U.S., Zambia was an easy choice for model extrapolation.

Using some basic statistics about crop production in Zambia (Data Africa, FAO Country Report, and Zambia Data Portal), we determined the most common crops in Zambia according to the number of hectares planted and yield of each crop in 2015 and 2016. In total, we included nine crop types in our model: corn, wheat, millet, potato, mixed beans, peanuts, sugarcane, tobacco, and cotton.

Spectral Feature Engineering

The features used to train our classification models were the raw bands from Landsat 8 (red, green, blue, near-infrared, two short-wave infrared, and a coastal band) and a derived a set of water, vegetation, and soil indices. Because vegetation indices spike at different times for different crops, it’s important to include a time series of feature values to differentiate between crops.

The normalized difference vegetation index (NDVI) is the ratio of near-infrared and red bands that are used to measure plant health.⁷ A higher NDVI value is indicative of a healthier plant.⁷ Figure 2 gives you a sense of how a time series of NDVI is able to differentiate crops. The NDVI of corn peaks early in June and July; cotton, peanuts, millet, and tobacco peak in August; and dry beans peak in September.

Figure 2: Time series chart of average NDVI by crop and month.

Modeling

The Astraea EarthAI platform is built on Apache Spark and Amazon Web Services (AWS) infrastructure. Since the amount of imagery we needed to process for this modeling effort was almost a terabyte, we used EarthAI to perform the modeling in a timely and distributed manner. Our process is outlined below.

Read in the Landsat 8 imagery and ground truth CropScape data and spatially join these two raster datasets by re-projecting and resampling the ground truth data to match the remote sensing data.
Derive the water, vegetation, and soil indices from the raw Landsat 8 bands.
Explode the raster data into pixels, so each row in our dataset is a pixel and the columns are different bands.
Use the Landsat 8 quality assessment (QA) band to filter out bad data, such as pixels covered by clouds.
Create the time-series features by averaging each of the spectral features by month.
Split the data into train and test sets.
Train and evaluate a classification model.
Score the model over Zambia across two years: 2015 and 2016.
Perform post-classification change detection on the model’s predictions.

Figure 3: Modeling workflow using EarthAI platform.

Because not all of the crop calendars of our nine chosen crops were in sync, we decided to have two separate models for summer and winter crops. In the U.S., sugarcane, wheat, and potatoes are winter crops, which are planted in late summer/early fall and harvested the following summer while corn, cotton, peanuts, millet, dry beans, and tobacco are summer crops, which are planted in early spring and harvested in the fall.⁸ In Zambia, the crop calendars are opposite due to the different hemisphere.⁹

For both the summer and winter growing seasons, our best performing model was a Gradient Boosted Trees (GBT) model. The GBT summer model had an overall accuracy of 0.876 and an F1 score of 0.792. The class metrics are shown in figure 4. Millet has poor recall compared to the rest of the classes because our training set was unbalanced, and we had too few examples of this class.

The GBT winter model had an overall accuracy of 0.956 and an F1 score of 0.895. Fewer classes and a barren winter landscape made these classes easier to model than the summer classes. The class metrics are shown in figure 5.

Figure 4: Summer model metrics for test set.

Figure 5: Winter model metrics for test set.

Findings

We scored the summer and winter models over Zambia across two years and performed post-classification change detection on areas within Zambia. We were able to detect changes in natural resource availability, such a significant reduction in water levels in the Mita Hills freshwater reservoir between 2015 and 2016 as shown in figure 6.

Figure 6: Model predictions overlaid on Mita Hills Dam, Mkushi District, Zambia for 2015 and 2016.

We were also able to track crop availability and rotation over time such as the example shown in figure 7, which shows a shift of food crops to cash crops. In times of need, cash crops generate more income because they are grown for sale, not consumption.¹

Figure 7: Model predictions overlaid on an Agricultural District of Zambia for 2015 and 2016.

By visualizing our model predictions, we are able to identify areas at high risk of food insecurity and water accessibility. Sharing this information in a timely manner could accelerate intervention efforts to prevent hunger and malnutrition.

Challenges

As is always the case when trying to extrapolate a model trained on a specific set of data to a new set of data, generalizability is going to be a challenge. Our model was trained on imagery in the U.S. because we couldn’t find ground truth geo-referenced crop data in Zambia. The United States is large and not all states exhibit similar characteristics to Zambia, so we analyzed and selected which U.S. states were most similar to Zambia in terms of climate, weather, soil, biome, ecoregion, and crop availability to create the most representative training set.

Another challenge with this project was the enormity of remote sensing data that needed to be processed and the speed at which we could complete feature engineering. Luckily, the use of Astraea EarthAI mitigated these issues as well as all the hidden complexity of handling geospatial data sources.

“One of the major challenges with incorporating Earth-observing (EO) data into data science is the handling of the disparate map projections and coordinate systems. Another challenge is the massive size of the data. We designed Astraea EarthAI from the ground-up to handle both of these challenges with little to no impact on the data scientist. This is exemplified by the EarthAI “Raster Join” feature, whereby large data sets from all over the planet can be joined spatially and temporally with the complexities managed by the platform.” ~ Simeon Fitch, VP of Research and Development

The remote sensing data we used in this project was about a terabyte in size, which was then reduced to 11GB after aggregating and filtering the data in EarthAI. Part of the reason for that significant reduction in size was that we were using optical data and had to contend with clouds. If a cloud covered a specific pixel any time throughout the growing season, then that pixel was dropped from our data, which was undesirable for the crops that were already rare in our data, such as beans and millet. As a result, we started investigating different ways of reducing missing data and one of the most promising techniques was incorporating a synthetic aperture radar (SAR) data source into our model.

SAR is a form of radar that works by transmitting microwave pulses down to Earth. The pulses are reflected or scattered when they come into contact with an object and then return back to the satellite where a sensor makes an image from the returned echoes. Unlike passive optical sensors that require sunlight to illuminate an object, SAR can sense during the day as well as the night thereby producing more imagery. Another advantage of SAR is that microwave pulses are able to penetrate clouds, so SAR can produce imagery even on cloudy days. We’ve had some promising initial results combining SAR with optical imagery to classify crops as exemplified in figure 8.

Figure 8: Comparison of model predictions without and with SAR data.

By utilizing the data science expertise of staff and scalable processing abilities of the Astraea EarthAI platform, we’ve developed a model with the ability to track water and crop availability over time in a country at risk of food insecurity. As climate change continues to alter weather patterns, countries affected will have to adapt to different growing seasons and water patterns, so a model that allows people to visualize the changing distribution of crops and water over time will help countries adapt as well as identify areas at high risk of food insecurity.

For questions or comments, please visit our website: www.astraea.earth.

Written by Courtney Layman, Senior Data Scientist

Many thanks to the other data scientists at Astraea who worked on this project:
Dr. Kimberly Scott, Co-Founder & VP of Data Science
Jason Brown, Senior Data Scientist
Eric Culbertson, Data Scientist

References

‍

Predicting Food Insecurity in Zambia Using Satellite Imagery