Estimation of tree height using GEDI dataset - Random Forest prediction

Jupyterlab setup and libraries retrieving [Just for UBUNTU 24.04]

To run this lesson properly on osboxes.org you need to have a clean install of jupyterlab using the command line. If you didn’t do it for the previous lesson (Python data analysis or Python geospatial analysis), do it for this one, directly on the terminal:

pipx uninstall jupyterlab
pipx install --system-site-packages jupyterlab

and just after that, you can procede to install python libraries, using the apt install always in the terminal:

sudo apt install python3-numpy python3-pandas python3-sklearn python3-matplotlib python3-rasterio python3-geopandas python3-seaborn python3-skimage python3-xarray

Jupyterlab setup and libraries retrieving [Just for OSGEOLIVE16]

In order to run this notebook you need to run this command lines in your terminal:

source $HOME/venv/bin/activate
sudo apt update
sudo apt install python3-pip
pip3 install pyspatialml

after that all other needed libraries directly on jupyter with the magic lines (!):

!pip3 install "pandas>=2.2.2,<3.0.0" "numpy>=1.26.4,<2.0.0" "rasterio>=1.3.10,<2.0.0" "scikit-learn>=1.4.2,<2.0.0" "scipy>=1.13.0,<2.0.0" "matplotlib>=3.9.0,<4.0.0" "xarray>=2024.1.0,<2025.0.0"

Open the notebook lesson

Open the lesson using the CL

cd /media/sf_LVM_shared/my_SE_data/exercise
jupyter-lab Tree_Height_03RF_pred_SM.ipynb

Model definition

Find a relationship between tree height and enviromental predictors to be able to predict out side the GEDI observation.
The model, in its simplest form, looks like the following:
y ~ f ( a*x1 + b*x2 + c*x3 + …)  + ε,

where y is the response variable, the x’s are predictor variables, and ε is the associated error.

Different types of models

Parametric models – make assumptions about the underlying distribution of the data:

  • Maximum likelihood classification,

  • Discriminant analysis,

  • General linear models (ex. linear regression)

Nonparametric models – make no assumptions about the underlying data distributions:

  • Generalized additive models,

  • Support Vector Machine,

  • Artificial neural networks,

  • Random Forest

Random forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification (containing categorical variables) and Regression (response continuous variables) problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.

One of the most important features of the Random Forest Algorithm is that it can handle continuous and categorical variables predictors without any assamption on their data distribution, dimension, and even if the predictors are autocorrelated.

Bagging
Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Choosesing a random sample from the data set it is creating a tree. Each tree is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Each is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as tree aggregation.

Steps involved in random forest algorithm:

  • Step 1: In Random forest n number of random records are taken from the data set having k number of records.

  • Step 2: Individual decision trees are constructed for each sample.

  • Step 3: Each decision tree will generate an output.
  • Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.

(source https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/)

RandomForest

Random Forest test

In this exercise we will use Random Forest Regression algorithm with the sklearn python package to estimage the tree height. In last step, we will see how it was possible to have a prediction on a raster using Pyspatialml library (now deprecated) and how we can to the same but using the updated more robust Xarray and Rasterio libraries for the spatial prediction on the raster files.

[2]:
import pandas as pd
import numpy as np
import rasterio
from rasterio import *
from rasterio.plot import show
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,6.5)

Data analysis

Before everything we need to import the raw data, within the extracted predictors and show whole the data distribution

[3]:
predictors = pd.read_csv("tree_height/txt/eu_x_y_height_predictors_select.txt", sep=" ",  index_col=False)
pd.set_option('display.max_columns',None)

# change column name
predictors = predictors.rename({'dev-magnitude':'devmagnitude'} , axis='columns')
predictors.head(10)
[3]:
ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover
0 1 6.050001 49.727499 3139.00 1540 13 2113 5893 -10.486560 -238043120 1.158417 0.069094 353.983124 23 276.871094 46.444092 347.665405 0.042500 9 780403 19.798992 440.672211 85
1 2 6.050002 49.922155 1454.75 1491 12 1993 5912 33.274361 -208915344 -1.755341 0.269112 267.511688 19 -49.526367 19.552734 -130.541748 0.182780 16 772777 20.889412 457.756195 85
2 3 6.050002 48.602377 853.50 1521 17 2124 5983 0.045293 -137479792 1.908780 -0.016055 389.751160 21 93.257324 50.743652 384.522461 0.036253 14 898820 20.695877 481.879700 62
3 4 6.050009 48.151979 3141.00 1526 16 2569 6130 -33.654274 -267223072 0.965787 0.067767 380.207703 27 542.401367 202.264160 386.156738 0.005139 15 831824 19.375000 479.410278 85
4 5 6.050010 49.588410 2065.25 1547 14 2108 5923 27.493824 -107809368 -0.162624 0.014065 308.042786 25 136.048340 146.835205 198.127441 0.028847 17 796962 18.777500 457.880066 85
5 6 6.050014 48.608456 1246.50 1515 19 2124 6010 -1.602039 17384282 1.447979 -0.018912 364.527100 18 221.339844 247.387207 480.387939 0.042747 14 897945 19.398880 474.331329 62
6 7 6.050016 48.571401 2938.75 1520 19 2169 6147 27.856503 -66516432 -1.073956 0.002280 254.679596 19 125.250488 87.865234 160.696777 0.037254 11 908426 20.170450 476.414520 96
7 8 6.050019 49.921613 3294.75 1490 12 1995 5912 22.102139 -297770784 -1.402633 0.309765 294.927765 26 -86.729492 -145.584229 -190.062988 0.222435 15 772784 20.855963 457.195404 86
8 9 6.050020 48.822645 1623.50 1554 18 1973 6138 18.496584 -25336536 -0.800016 0.010370 240.493759 22 -51.470703 -245.886719 172.074707 0.004428 8 839132 21.812290 496.231110 64
9 10 6.050024 49.847522 1400.00 1521 15 2187 5886 -5.660453 -278652608 1.477951 -0.068720 376.671143 12 277.297363 273.141846 -138.895996 0.098817 13 768873 21.137711 466.976685 70
[6]:
print("NUMBER OF ROWS:",len(predictors))
NUMBER OF ROWS: 1267239

To deeper analyze data distribution its always good to plot them

[7]:
bins = np.linspace(min(predictors['h']),max(predictors['h']),100)
plt.hist((predictors['h']),bins,alpha=0.8);
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_17_0.png

In this way we can see where the date spans to, and detect outliers.

Indeed, in this case we’re gonna select only trees with height below 70m (because in this area, the highest tree detected in germany was 67m https://visit.freiburg.de/en/attractions/waldtraut-germany-s-tallest-tree).

And we procede sampling just 100K points out of the 1.2M, to train a lighter model.

[24]:
predictors_sel = predictors.loc[(predictors['h'] < 7000)  ].sample(100000)

# add a height column in meter
predictors_sel.insert ( 4, 'hm' ,  predictors_sel['h']/100 )

# overlook
predictors_sel.head(10)
[24]:
ID X Y h hm BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover
81999 82000 6.349125 49.644223 2659.75 26.5975 1530 16 2272 5931 21.476559 -141539392 1.694310 -0.040122 331.223236 23 631.313477 214.736084 365.218750 0.057163 25 717991 18.626610 451.421082 86
965362 965363 9.012956 49.690358 2689.50 26.8950 1536 12 2137 6537 15.065849 120771872 -1.399779 -0.072601 242.992401 26 353.459473 270.740967 220.016357 0.026610 9 719463 20.024574 492.804718 75
1236259 1236260 9.801078 49.395159 2572.00 25.7200 1528 16 2238 6472 16.836056 -162671216 0.910004 -0.033257 418.785919 20 -54.990723 61.513428 257.196045 0.045267 8 806762 19.990988 525.202148 82
1115641 1115642 9.382723 49.106039 2370.00 23.7000 1513 15 2407 6637 -24.503428 -231658992 -0.858004 -0.064393 267.034180 4 536.372314 470.607422 432.341064 0.030045 6 784144 19.837236 517.255615 29
215407 215408 6.773116 49.944617 1392.00 13.9200 1515 12 1952 6118 26.732821 -193019280 0.534557 -0.005609 300.268585 22 -162.502686 -200.892090 -412.381104 0.056135 15 666767 19.065788 437.629272 74
939859 939860 8.948157 49.643079 2365.25 23.6525 1481 13 2466 6341 -10.630081 -171884224 1.417312 -0.121560 400.390808 26 -52.929199 -124.971191 -149.944336 0.015808 9 737709 19.441750 473.700439 89
157727 157728 6.620174 49.837913 3160.75 31.6075 1536 13 2164 6052 -57.077930 -253935056 1.182596 -0.038630 343.578552 24 178.370117 86.893066 392.301758 -0.081356 23 675920 19.584587 443.499878 100
376387 376388 7.163473 49.791920 3010.00 30.1000 1352 12 2377 5621 -3.525097 -193161968 3.196814 0.090454 699.986816 22 435.113525 436.794434 406.962158 -0.108846 32 638925 22.397139 416.059174 87
426482 426483 7.277403 49.534507 277.75 2.7775 1494 15 2251 6162 7.484096 -268103968 0.745217 0.029769 391.105804 14 -40.088135 -26.999023 -72.698975 0.078123 12 641196 20.374083 485.715973 75
286481 286482 6.954634 48.417400 2362.50 23.6250 1331 20 3297 6108 15.627705 -346475904 0.985921 0.047404 547.931702 33 -2.996826 -0.247314 -181.765503 0.207945 50 960411 21.377501 454.721680 95
[25]:
print("NUMBER OF ROWS:",len(predictors_sel))
NUMBER OF ROWS: 100000

Now we can see again the data distribution with the same plotting used before

[26]:
bins = np.linspace(min(predictors_sel['hm']),max(predictors_sel['hm']),100)
plt.hist((predictors_sel['hm']),bins,alpha=0.8);
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_22_0.png

To have a comparison we’ll analyze the The Global Forest Canopy Height, 2019 map. This has been released in 2020 (within a scientific publication https://doi.org/10.1016/j.rse.2020.112165).

The authors used a regression tree model that was calibrated and applied to each individual Landsat GLAD ARD tile (1° × 1°) in a “moving window” mode.

This tree height estimation its available as forestheight.tiff and their extracted values available in the table as forestheight column.

To see how this model performed we can see a correlation plot and its calculated pearson coefficient.

[27]:
plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(predictors['h']/100,predictors['forestheight'])
plt.xlabel('sensed tree height\n (m)')
plt.ylabel('estimated tree height\n (m)')
ident = [0, 50]
plt.plot(ident,ident,'r--')
[27]:
[<matplotlib.lines.Line2D at 0x712ff962c980>]
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_24_1.png
[28]:
# calculating the pearson correlation index
pearsonr_Publication_Estimation = pearsonr(predictors['h'],predictors['forestheight'])[0]

print("PEARSON CORRELATION COEFFINCENT:", pearsonr_Publication_Estimation)
PEARSON CORRELATION COEFFINCENT: 0.4527925130004339

In this procedure we will try to outperform this model Pearson Correlation Coefficent.

To do that we’re gonna use a more advanced Machine Learning architecture with some more and different enviromental predictors that could better express the ecological asset and the physical conditions.

Data set splitting

To procede we need to split the dataset (predictors_sel) in order to create the response variable table that will be used against the predictors variables (excluding the forestheight predictors).

[29]:
X = predictors_sel.iloc[:,[5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23]].values
Y = predictors_sel.iloc[:,4:5].values
feat = predictors_sel.iloc[:,[5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23]].columns.values

Its always good to double check that we select the right columns in the right order

[30]:
feat
[30]:
array(['BLDFIE_WeigAver', 'CECSOL_WeigAver', 'CHELSA_bio18',
       'CHELSA_bio4', 'convergence', 'cti', 'devmagnitude', 'eastness',
       'elev', 'glad_ard_SVVI_max', 'glad_ard_SVVI_med',
       'glad_ard_SVVI_min', 'northness', 'ORCDRC_WeigAver',
       'outlet_dist_dw_basin', 'SBIO3_Isothermality_5_15cm',
       'SBIO4_Temperature_Seasonality_5_15cm', 'treecover'], dtype=object)
[31]:
print("RESPONSE VARIABLE DIMENSION:", Y.shape)
RESPONSE VARIABLE DIMENSION: (100000, 1)
[32]:
print("PREDICTING FEATURES DIMENSION:", X.shape)
PREDICTING FEATURES DIMENSION: (100000, 18)

Now we can create 4 different dataset:

  • 2 for model algorithm training (X_train and Y_train)

  • 2 for model testing (X_test and Y_test)

[40]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=42)
y_train = np.ravel(Y_train)
y_test = np.ravel(Y_test)

Random Forest default parameters

Random Forest can be implemented using the RandomForestRegressor in scikit-learn

We can procede on training the Random Forest model using its default parameters.

[41]:
rf = RandomForestRegressor(random_state = 42)
rf.get_params()
[41]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}
[42]:
rfReg = RandomForestRegressor(min_samples_leaf=50, oob_score=True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
pearsonr_all = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
print("PEARSON CORRELATION COEFFINCENT:", pearsonr_all)
PEARSON CORRELATION COEFFINCENT: [0.6035710046325166, 0.506707480767017]
[45]:
# checking the oob score
print("OUT OF BAG SCORE:", rfReg.oob_score_)
OUT OF BAG SCORE: 0.2625683482738882

Additional resources that can be used to reduce more the oob error can be found at:

https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

[52]:
plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_train,dic_pred['train'])
plt.xlabel('Training GEDI Height (all dataset)\n (m)')
plt.ylabel('Training Tree Height prediction\n (m)')
ident = [0, 50]
plt.plot(ident,ident,'r--')
[52]:
[<matplotlib.lines.Line2D at 0x712ff7839910>]
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_40_1.png
[47]:
plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_test,dic_pred['test'])
plt.xlabel('Testing GEDI Height (all dataset)\n (m)')
plt.ylabel('Testing Tree Height prediction\n (m)')
ident = [0, 50]
plt.plot(ident,ident,'r--')
[47]:
[<matplotlib.lines.Line2D at 0x712ff7a23f50>]
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_41_1.png
[54]:
impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)]
ind = np.argsort(impt[0])
[55]:
ind
[55]:
array([ 1, 13,  0,  4,  6,  8, 16,  2, 15,  5, 12,  3, 11,  9, 14,  7, 10,
       17])

To deeper analyze model behaviour in response variable regression we can extract the Feature Importance Ranking

[56]:
plt.rcParams["figure.figsize"] = (6,10)
plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center")
plt.yticks(range(len(feat)),feat[ind]);
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_45_0.png

Model tuning whith Hyperparameters Tweeking

To improve model performance we can use tools like GridSearchCV to find with a recursive search the best combination of all model hyperparameters imployed.

Hyperparameters that were setted as default in the first model run, like:

  • “max_features”: number of features to consider when looking for the best split.

  • “max_samples”: number of samples to draw from X to train each base estimator.

  • “n_estimators”: identify the number of trees that must grow. It must be large enough so that the error is stabilized. Defoult 100.

  • “max_depth”: max number of levels in each decision tree.

By using this code we can tune up those parameters to improve the algoirithm performance (this code is not runned directly because of its big computational demand as for time and memory perspective)

pipeline = Pipeline([('rf',RandomForestRegressor())])

parameters = {
        'rf__max_features':(3,4,5),
        'rf__max_samples':(0.5,0.6,0.7),
        'rf__n_estimators':(500,1000),
        'rf__max_depth':(50,100,200,300)}

grid_search = GridSearchCV(pipeline,parameters,n_jobs=6,cv=5,scoring='r2',verbose=1)
grid_search.fit(X_train,y_train)


rfReg = RandomForestRegressor(n_estimators=5000,max_features=0.33,max_depth=500,max_samples=0.7,n_jobs=-1,random_state=24 , oob_score = True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
pearsonr_all_tune = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
pearsonr_all_tune

grid_search.best_score_

print ('Best Training score: %0.3f' % grid_search.best_score_)
print ('Optimal parameters:') best_par = grid_search.best_estimator_.get_params()
for par_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (par_name, best_par[par_name]))

Tracking the error rate trend as in https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html

Features importance ranking shows that the most important variable is treecover followed by Spectral Variability Vegetation Index, and outlet_dist_dw_basin which is a proxy of water accumulation.

Besides eastness and northness describes microclimat condition.

Model tuning with data selection

We can enhance the model performance also by having a wise data selection.

In this case we can have it based on data quality flag (available in the dataset) to select more reilable tree height points.

File storing tree hight (cm) obtained by 6 algorithms, with their associate quality flags. The quality flags can be used to refine and select the best tree height estimation and use it as tree height observation.

  • a?_95: tree hight (cm) at 95 quintile, for each algorithm

  • min_rh_95: minimum value of tree hight (cm) ammong the 6 algorithms

  • max_rh_95: maximum value of tree hight (cm) ammong the 6 algorithms

  • BEAM: 1-4 coverage beam = lower power (worse) ; 5-8 power beam = higher power (better)

  • digital_elev: digital mdoel elevation

  • elev_low: elevation of center of lowest mode

  • qc_a?: quality_flag for six algorithms quality_flag = 1 (better); = 0 (worse)

  • se_a?: sensitivity for six algorithms sensitivity < 0.95 (worse); sensitivity > 0.95 (beter )

  • deg_fg: (degrade_flag) not-degraded 0 (better) ; degraded > 0 (worse)

  • solar_ele: solar elevation. > 0 day (worse); < 0 night (better)

[57]:
height_6algorithms = pd.read_csv("tree_height/txt/eu_y_x_select_6algorithms_fullTable.txt", sep=" ",  index_col=False)
pd.set_option('display.max_columns',None)
height_6algorithms.head(6)
[57]:
ID X Y a1_95 a2_95 a3_95 a4_95 a5_95 a6_95 min_rh_95 max_rh_95 BEAM digital_elev elev_low qc_a1 qc_a2 qc_a3 qc_a4 qc_a5 qc_a6 se_a1 se_a2 se_a3 se_a4 se_a5 se_a6 deg_fg solar_ele
0 1 6.050001 49.727499 3139 3139 3139 3120 3139 3139 3120 3139 5 410.0 383.72153 1 1 1 1 1 1 0.962 0.984 0.968 0.962 0.989 0.979 0 17.7
1 2 6.050002 49.922155 1022 2303 970 872 5596 1524 872 5596 5 290.0 2374.14110 0 0 0 0 0 0 0.948 0.990 0.960 0.948 0.994 0.980 0 43.7
2 3 6.050002 48.602377 380 1336 332 362 1336 1340 332 1340 4 440.0 435.97781 1 1 1 1 1 1 0.947 0.975 0.956 0.947 0.981 0.968 0 0.2
3 4 6.050009 48.151979 3153 3142 3142 3127 3138 3142 3127 3153 2 450.0 422.00537 1 1 1 1 1 1 0.930 0.970 0.943 0.930 0.978 0.962 0 -14.2
4 5 6.050010 49.588410 666 4221 651 33 5611 2723 33 5611 8 370.0 2413.74830 0 0 0 0 0 0 0.941 0.983 0.946 0.941 0.992 0.969 0 22.1
5 6 6.050014 48.608456 787 1179 1187 761 1833 1833 761 1833 3 420.0 415.51581 1 1 1 1 1 1 0.952 0.979 0.961 0.952 0.986 0.975 0 0.2
[58]:
height_6algorithms_sel = height_6algorithms.loc[(height_6algorithms['BEAM'] > 4)
                                            &   (height_6algorithms['qc_a1'] == 1)
                                            &   (height_6algorithms['qc_a2'] == 1)
                                            &   (height_6algorithms['qc_a3'] == 1)
                                            &   (height_6algorithms['qc_a4'] == 1)
                                            &   (height_6algorithms['qc_a5'] == 1)
                                            &   (height_6algorithms['qc_a6'] == 1)
                                            &   (height_6algorithms['se_a1'] > 0.95)
                                            &   (height_6algorithms['se_a2'] > 0.95)
                                            &   (height_6algorithms['se_a3'] > 0.95)
                                            &   (height_6algorithms['se_a4'] > 0.95)
                                            &   (height_6algorithms['se_a5'] > 0.95)
                                            &   (height_6algorithms['se_a6'] > 0.95)
                                            &   (height_6algorithms['deg_fg'] == 0)
                                            &   (height_6algorithms['solar_ele'] < 0)]
[59]:
height_6algorithms_sel
[59]:
ID X Y a1_95 a2_95 a3_95 a4_95 a5_95 a6_95 min_rh_95 max_rh_95 BEAM digital_elev elev_low qc_a1 qc_a2 qc_a3 qc_a4 qc_a5 qc_a6 se_a1 se_a2 se_a3 se_a4 se_a5 se_a6 deg_fg solar_ele
7 8 6.050019 49.921613 3303 3288 3296 3236 3857 3292 3236 3857 7 320.0 297.68533 1 1 1 1 1 1 0.971 0.988 0.976 0.971 0.992 0.984 0 -33.9
11 12 6.050039 47.995344 2762 2736 2740 2747 3893 2736 2736 3893 5 390.0 368.55121 1 1 1 1 1 1 0.975 0.990 0.979 0.975 0.994 0.987 0 -37.3
14 15 6.050046 49.865317 1398 2505 2509 1316 2848 2505 1316 2848 6 340.0 330.40564 1 1 1 1 1 1 0.973 0.990 0.979 0.973 0.994 0.986 0 -18.2
15 16 6.050048 49.050020 984 943 947 958 2617 947 943 2617 6 300.0 291.22598 1 1 1 1 1 1 0.978 0.991 0.982 0.978 0.995 0.988 0 -35.4
16 17 6.050049 48.391359 3362 3332 3336 3351 4467 3336 3332 4467 5 530.0 504.78122 1 1 1 1 1 1 0.973 0.988 0.977 0.973 0.992 0.984 0 -5.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1267207 1267208 9.949829 49.216272 2160 2816 2816 2104 3299 2816 2104 3299 8 420.0 386.44556 1 1 1 1 1 1 0.980 0.993 0.984 0.980 0.995 0.989 0 -16.9
1267211 1267212 9.949856 49.881190 3190 3179 3179 3171 3822 3179 3171 3822 6 380.0 363.69348 1 1 1 1 1 1 0.968 0.986 0.974 0.968 0.990 0.982 0 -35.1
1267216 1267217 9.949880 49.873435 2061 2828 2046 2024 2828 2828 2024 2828 7 380.0 361.06812 1 1 1 1 1 1 0.967 0.988 0.974 0.967 0.993 0.983 0 -35.1
1267227 1267228 9.949958 49.127182 366 2307 1260 355 3531 2719 355 3531 6 500.0 493.52792 1 1 1 1 1 1 0.973 0.989 0.978 0.973 0.993 0.985 0 -36.0
1267237 1267238 9.949999 49.936763 2513 2490 2490 2494 2490 2490 2490 2513 5 360.0 346.54227 1 1 1 1 1 1 0.968 0.988 0.974 0.968 0.993 0.983 0 -32.2

226892 rows × 28 columns

Calculate the mean height excluidng the maximum and minimum values

[60]:
height_sel =  pd.DataFrame({'ID' : height_6algorithms_sel['ID'] ,
                            'hm_sel': (height_6algorithms_sel['a1_95'] + height_6algorithms_sel['a2_95'] + height_6algorithms_sel['a3_95'] + height_6algorithms_sel['a4_95']
                             + height_6algorithms_sel['a5_95'] + height_6algorithms_sel['a6_95'] - height_6algorithms_sel['min_rh_95'] - height_6algorithms_sel['max_rh_95']) / 400 } )
[61]:
height_sel
[61]:
ID hm_sel
7 8 32.9475
11 12 27.4625
14 15 22.2925
15 16 9.5900
16 17 33.4625
... ... ...
1267207 1267208 26.5200
1267211 1267212 31.8175
1267216 1267217 24.4075
1267227 1267228 16.6300
1267237 1267238 24.9100

226892 rows × 2 columns

Merge the new height with the predictors table, using the ID as Primary Key

[62]:
predictors_hm_sel = pd.merge( predictors ,  height_sel , left_on='ID' ,  right_on='ID' ,  how='right')
[63]:
predictors_hm_sel.head(6)
[63]:
ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover hm_sel
0 8 6.050019 49.921613 3294.75 1490 12 1995 5912 22.102139 -297770784 -1.402633 0.309765 294.927765 26 -86.729492 -145.584229 -190.062988 0.222435 15 772784 20.855963 457.195404 86 32.9475
1 12 6.050039 47.995344 2746.25 1523 12 2612 6181 3.549103 -71279992 0.507727 -0.021408 322.920227 26 660.006104 92.722168 190.979736 -0.034787 16 784807 20.798000 460.501221 97 27.4625
2 15 6.050046 49.865317 2229.25 1517 13 2191 5901 31.054762 -186807440 -1.375050 -0.126880 291.412537 7 1028.385498 915.806396 841.586182 0.024677 16 766444 19.941267 454.185089 54 22.2925
3 16 6.050048 49.050020 959.00 1526 14 2081 6100 9.933455 -183562672 -0.382834 0.086874 246.288010 24 -12.283691 -58.179199 174.205566 0.094175 10 805730 19.849365 470.946533 78 9.5900
4 17 6.050049 48.391359 3346.25 1489 19 2486 5966 -6.957157 -273522688 2.989759 0.214769 474.409088 24 125.583008 6.154297 128.129150 0.017164 15 950190 21.179420 491.398376 85 33.4625
5 19 6.050053 49.877876 529.00 1531 12 2184 5915 -24.278454 -377335296 0.265329 -0.248356 335.534760 25 593.601074 228.712402 315.298340 -0.127365 17 764713 19.760756 448.580811 96 5.2900
[64]:
predictors_hm_sel = predictors_hm_sel.loc[(predictors['h'] < 7000) ].sample(100000)
[65]:
predictors_hm_sel
[65]:
ID X Y h BLDFIE_WeigAver CECSOL_WeigAver CHELSA_bio18 CHELSA_bio4 convergence cti devmagnitude eastness elev forestheight glad_ard_SVVI_max glad_ard_SVVI_med glad_ard_SVVI_min northness ORCDRC_WeigAver outlet_dist_dw_basin SBIO3_Isothermality_5_15cm SBIO4_Temperature_Seasonality_5_15cm treecover hm_sel
67109 358075 7.121301 48.393362 3127.00 1390 14 3278 6149 -18.622808 -349385216 1.205495 0.221449 628.071411 28 16.379883 -1.361572 -254.569824 0.122928 30 870714 21.026922 455.862976 89 31.2700
49443 271106 6.915753 48.398865 2755.00 1403 16 2880 6227 -10.780100 -364243424 0.597950 0.212738 479.239380 28 94.056396 13.479736 -101.955322 0.052124 26 956659 20.500694 460.670288 85 27.5500
225998 1262978 9.924864 49.878408 2404.25 1497 14 2044 6528 -5.170033 -260566736 1.070872 -0.060704 322.041046 20 256.728516 15.479248 -28.410645 0.124602 8 855708 19.397581 507.361115 78 24.0425
46927 257606 6.880155 49.368619 2322.00 1514 15 2292 6233 11.174547 -242390336 -0.602614 -0.084595 268.367157 20 236.149902 236.915039 371.205322 -0.050283 11 769062 21.441999 496.682587 58 23.2200
27199 146987 6.582437 49.476879 2464.25 1521 14 2117 6219 -6.374312 -234200800 -1.728040 0.231585 182.034286 23 73.248047 19.785400 249.853027 0.019138 11 731353 20.218201 483.040649 93 24.6425
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
25984 139170 6.551567 49.175072 2125.00 1511 17 2252 6015 -14.249553 -248558160 2.698703 -0.056752 371.123718 25 141.861816 2.110840 86.527588 0.024905 12 800114 20.031235 479.971527 85 21.2500
46950 257777 6.880572 49.535945 231.00 1519 14 2161 6189 32.561085 89643480 -1.137112 -0.001126 286.474335 0 937.622314 743.368652 405.727051 -0.008619 10 787605 21.355637 486.371185 4 2.3100
160836 912341 8.879877 48.959728 1810.50 1518 15 2200 6501 -45.779213 -249658704 0.581574 0.009894 277.630981 23 321.381836 71.376465 304.113281 -0.130948 15 832408 18.087175 476.243561 99 18.1050
100900 558839 7.634858 49.360323 2422.75 1533 11 2377 6304 -23.222776 -286424256 0.680885 0.017598 367.074890 25 -84.395996 -98.640869 22.901367 -0.179092 10 895793 18.074240 468.273102 80 24.2275
67155 358294 7.121821 49.908679 3338.00 1535 11 1946 6144 -16.786621 -319256448 0.482177 0.128489 345.215790 26 -161.289062 -201.367188 -307.723389 -0.012693 6 591350 18.666250 450.020935 87 33.3800

100000 rows × 24 columns

[66]:
X = predictors_hm_sel.iloc[:,[4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22]].values
Y = predictors_hm_sel.iloc[:,23:24].values
feat = predictors_hm_sel.iloc[:,[4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22]].columns.values
[69]:
feat
[69]:
array(['BLDFIE_WeigAver', 'CECSOL_WeigAver', 'CHELSA_bio18',
       'CHELSA_bio4', 'convergence', 'cti', 'devmagnitude', 'eastness',
       'elev', 'glad_ard_SVVI_max', 'glad_ard_SVVI_med',
       'glad_ard_SVVI_min', 'northness', 'ORCDRC_WeigAver',
       'outlet_dist_dw_basin', 'SBIO3_Isothermality_5_15cm',
       'SBIO4_Temperature_Seasonality_5_15cm', 'treecover'], dtype=object)
[70]:
print("RESPONSE VARIABLE DIMENSION:", Y.shape)
RESPONSE VARIABLE DIMENSION: (100000, 1)
[71]:
print("PREDICTING FEATURES DIMENSION:", X.shape)
PREDICTING FEATURES DIMENSION: (100000, 18)
[72]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, random_state=24)
y_train = np.ravel(Y_train)
y_test = np.ravel(Y_test)

Model run using default parameters and wuality flag selected data

Training Random Forest using default parameters

[73]:
rfReg = RandomForestRegressor(min_samples_leaf=20, oob_score=True)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
[74]:
# checking the oob score
print("OUT OF BAG SCORE:", rfReg.oob_score_)
OUT OF BAG SCORE: 0.37252902906812235
[75]:
pearsonr_all_sel = [pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]
print("PEARSON CORRELATION COEFFINCENT:", pearsonr_all_sel)
PEARSON CORRELATION COEFFINCENT: [0.746007848830619, 0.6125852590365624]
[85]:
plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_train,dic_pred['train'])
plt.xlabel('Training GEDI Height (all dataset)\n (m)')
plt.ylabel('Training Tree Height prediction\n (m)')
ident = [0, 50]
plt.plot(ident,ident,'r--')
[85]:
[<matplotlib.lines.Line2D at 0x712ff41c05c0>]
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_71_1.png
[87]:
plt.rcParams["figure.figsize"] = (8,6)
plt.scatter(y_test,dic_pred['test'])
plt.xlabel('Testing GEDI Height (all dataset)\n (m)')
plt.ylabel('Testing Tree Height prediction\n (m)')
ident = [0, 50]
plt.plot(ident,ident,'r--')
[87]:
[<matplotlib.lines.Line2D at 0x712ff3e5f7d0>]
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_72_1.png
[88]:
impt = [rfReg.feature_importances_, np.std([tree.feature_importances_ for tree in rfReg.estimators_],axis=1)]
ind = np.argsort(impt[0])
[89]:
ind
[89]:
array([ 1, 13,  0,  4, 16,  6,  8, 15,  5,  2, 11,  3, 12,  9, 14,  7, 10,
       17])
[91]:
plt.rcParams["figure.figsize"] = (6,10)
plt.barh(range(len(feat)),impt[0][ind],color="b", xerr=impt[1][ind], align="center")
plt.yticks(range(len(feat)),feat[ind]);
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_75_0.png

Model Final Assesment

we can do a comparison between published model performance and ours.

Global Forest Canopy Height, 2019

[78]:
print("PEARSON CORRELATION INDEX:", pearsonr_Publication_Estimation)
PEARSON CORRELATION INDEX: 0.4527925130004339

Random Forest Model [100k GEDI subset] [fine tuning=NO] [data selection=NO]

[81]:
print("PEARSON CORRELATION COEFFINCENT:", pearsonr_all)
PEARSON CORRELATION COEFFINCENT: [0.6035710046325166, 0.506707480767017]

Random Forest Model [100k GEDI subset] [fine tuning=NO] [data selection=YES]

[83]:
print("PEARSON CORRELATION COEFFINCENT:", pearsonr_all_sel)
PEARSON CORRELATION COEFFINCENT: [0.746007848830619, 0.6125852590365624]

If we want to improve even more the selected data model, we can perform again a GridSearch hyperparameter fine tuning

pipeline = Pipeline([('rf',RandomForestRegressor())])

parameters = {
    'rf__max_features':("log2","sqrt",0.33),
    'rf__max_samples':(0.5,0.6,0.7),
    'rf__n_estimators':(500,1000),
    'rf__max_depth':(50,100,200)}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,cv=3,scoring='r2',verbose=1)
grid_search.fit(X_train,y_train)

grid_search.best_score_

print ('Best Training score: %0.3f' % grid_search.best_score_)
print ('Optimal parameters:')
best_par = grid_search.best_estimator_.get_params()
for par_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (par_name, best_par[par_name]))

rfReg = RandomForestRegressor(n_estimators=500,max_features='sqrt',max_depth=50,max_samples=0.6,n_jobs=-1,random_state=24)
rfReg.fit(X_train, y_train);
dic_pred = {}
dic_pred['train'] = rfReg.predict(X_train)
dic_pred['test'] = rfReg.predict(X_test)
[pearsonr(dic_pred['train'],y_train)[0],pearsonr(dic_pred['test'],y_test)[0]]

Model Prediction on Raster

Predict on the raster using pyspatialml (deprecated library)

# import Global Landsat Analysis Ready Data
glad_ard_SVVI_min = "tree_height/geodata_raster/glad_ard_SVVI_min.tif"
glad_ard_SVVI_med = "tree_height/geodata_raster/glad_ard_SVVI_med.tif"
glad_ard_SVVI_max = "tree_height/geodata_raster/glad_ard_SVVI_max.tif"

# import climate data
CHELSA_bio4 = "tree_height/geodata_raster/CHELSA_bio4.tif"
CHELSA_bio18 = "tree_height/geodata_raster/CHELSA_bio18.tif"

# import soil data
BLDFIE_WeigAver = "tree_height/geodata_raster/BLDFIE_WeigAver.tif"
CECSOL_WeigAver = "tree_height/geodata_raster/CECSOL_WeigAver.tif"
ORCDRC_WeigAver = "tree_height/geodata_raster/ORCDRC_WeigAver.tif"

# import geomorphological data (DEM derived)
elev = "tree_height/geodata_raster/elev.tif"
convergence = "tree_height/geodata_raster/convergence.tif"
northness = "tree_height/geodata_raster/northness.tif"
eastness = "tree_height/geodata_raster/eastness.tif"
devmagnitude = "tree_height/geodata_raster/dev-magnitude.tif"

# import hydrographic data (from hydrography90m)
cti = "tree_height/geodata_raster/cti.tif"
outlet_dist_dw_basin = "tree_height/geodata_raster/outlet_dist_dw_basin.tif"

# import soil climate data

SBIO3_Isothermality_5_15cm = "tree_height/geodata_raster/SBIO3_Isothermality_5_15cm.tif"
SBIO4_Temperature_Seasonality_5_15cm = "tree_height/geodata_raster/SBIO4_Temperature_Seasonality_5_15cm.tif"

# import forest cover data

treecover = "tree_height/geodata_raster/treecover.tif"
from pyspatialml import Raster
predictors_rasters = [glad_ard_SVVI_min, glad_ard_SVVI_med, glad_ard_SVVI_max,
                      CHELSA_bio4,CHELSA_bio18,
                      BLDFIE_WeigAver,CECSOL_WeigAver,ORCDRC_WeigAver,
                      elev,convergence,northness,eastness,devmagnitude,cti,outlet_dist_dw_basin,
                      SBIO3_Isothermality_5_15cm,SBIO4_Temperature_Seasonality_5_15cm,treecover]

stack = Raster(predictors_rasters)
result = stack.predict(estimator=rfReg, dtype='int16', nodata=-1)
[53]:
# plot regression result
plt.rcParams["figure.figsize"] = (12,12)
result.iloc[0].cmap = "plasma"
result.plot()
plt.show()
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_90_0.png

Predict on the raster using xarray and rasterio (Full size 90m res raster prediction >12h)

import xarray as xr
import rasterio
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
import os

# Define paths to raster files
raster_paths = {
    'glad_ard_SVVI_min': "tree_height/geodata_raster/glad_ard_SVVI_min.tif",
    'glad_ard_SVVI_med': "tree_height/geodata_raster/glad_ard_SVVI_med.tif",
    'glad_ard_SVVI_max': "tree_height/geodata_raster/glad_ard_SVVI_max.tif",
    'CHELSA_bio4': "tree_height/geodata_raster/CHELSA_bio4.tif",
    'CHELSA_bio18': "tree_height/geodata_raster/CHELSA_bio18.tif",
    'BLDFIE_WeigAver': "tree_height/geodata_raster/BLDFIE_WeigAver.tif",
    'CECSOL_WeigAver': "tree_height/geodata_raster/CECSOL_WeigAver.tif",
    'ORCDRC_WeigAver': "tree_height/geodata_raster/ORCDRC_WeigAver.tif",
    'elev': "tree_height/geodata_raster/elev.tif",
    'convergence': "tree_height/geodata_raster/convergence.tif",
    'northness': "tree_height/geodata_raster/northness.tif",
    'eastness': "tree_height/geodata_raster/eastness.tif",
    'devmagnitude': "tree_height/geodata_raster/dev-magnitude.tif",
    'cti': "tree_height/geodata_raster/cti.tif",
    'outlet_dist_dw_basin': "tree_height/geodata_raster/outlet_dist_dw_basin.tif",
    'SBIO3_Isothermality_5_15cm': "tree_height/geodata_raster/SBIO3_Isothermality_5_15cm.tif",
    'SBIO4_Temperature_Seasonality_5_15cm': "tree_height/geodata_raster/SBIO4_Temperature_Seasonality_5_15cm.tif",
    'treecover': "tree_height/geodata_raster/treecover.tif"
}

# Load raster files into an xarray Dataset
raster_datasets = []
for name, path in raster_paths.items():
    with rasterio.open(path) as src:
        # Read raster data and convert to xarray DataArray
        data = src.read(1)  # Read the first band
        coords = {
            'y': np.linspace(src.bounds.top, src.bounds.bottom, src.height),
            'x': np.linspace(src.bounds.left, src.bounds.right, src.width)
        }
        da = xr.DataArray(data, coords=coords, dims=['y', 'x'], name=name)
        raster_datasets.append(da)

# Combine all DataArrays into a single Dataset
stack = xr.merge(raster_datasets)

# Prepare data for prediction
# Convert the stack to a 2D array where each row is a pixel and columns are features
feature_names = list(raster_paths.keys())
data_array = stack.to_array(dim='variable').transpose('y', 'x', 'variable')
data_2d = data_array.values.reshape(-1, len(feature_names))

# Create a mask for valid data (exclude pixels where any predictor is NaN)
valid_mask = ~np.any(np.isnan(data_2d), axis=1)
valid_data = data_2d[valid_mask]

# Predict tree heights for valid pixels using the trained RandomForestRegressor (rfReg)
predictions = np.full(data_2d.shape[0], -1, dtype=np.int16)  # Initialize with nodata value
predictions[valid_mask] = rfReg.predict(valid_data)

# Reshape predictions back to the original raster shape
pred_2d = predictions.reshape(data_array.shape[0], data_array.shape[1])

# Save the prediction as a new raster file
output_path = "tree_height/geodata_raster/predicted_tree_height.tif"
with rasterio.open(
    list(raster_paths.values())[0],  # Use the first raster as a template for metadata
) as src:
    profile = src.profile
    profile.update(dtype=rasterio.int16, nodata=-1)

with rasterio.open(output_path, 'w', **profile) as dst:
    dst.write(pred_2d, 1)

# Plot the result using xarray
ds_pred = xr.DataArray(pred_2d, coords={'y': stack.y, 'x': stack.x}, dims=['y', 'x'], name='predicted_height')
ds_pred.plot(cmap='plasma', figsize=(12, 12))
plt.title("Predicted Tree Height")
plt.show()

Predict on the raster using xarray and rasterio (1000x1000 UL corner tile for 90m res raster prediction)

[92]:
# Define paths to raster files
raster_paths = {
    # import Global Landsat Analysis Ready Data
    'glad_ard_SVVI_min': "tree_height/geodata_raster/glad_ard_SVVI_min.tif",
    'glad_ard_SVVI_med': "tree_height/geodata_raster/glad_ard_SVVI_med.tif",
    'glad_ard_SVVI_max': "tree_height/geodata_raster/glad_ard_SVVI_max.tif",
    # import climate data
    'CHELSA_bio4': "tree_height/geodata_raster/CHELSA_bio4.tif",
    'CHELSA_bio18': "tree_height/geodata_raster/CHELSA_bio18.tif",
    # import soil data
    'BLDFIE_WeigAver': "tree_height/geodata_raster/BLDFIE_WeigAver.tif",
    'CECSOL_WeigAver': "tree_height/geodata_raster/CECSOL_WeigAver.tif",
    'ORCDRC_WeigAver': "tree_height/geodata_raster/ORCDRC_WeigAver.tif",
    # import geomorphological data (DEM derived)
    'elev': "tree_height/geodata_raster/elev.tif",
    'convergence': "tree_height/geodata_raster/convergence.tif",
    'northness': "tree_height/geodata_raster/northness.tif",
    'eastness': "tree_height/geodata_raster/eastness.tif",
    'devmagnitude': "tree_height/geodata_raster/dev-magnitude.tif",
    # import hydrographic data (from hydrography90m)
    'cti': "tree_height/geodata_raster/cti.tif",
    'outlet_dist_dw_basin': "tree_height/geodata_raster/outlet_dist_dw_basin.tif",
    # import soil climate data
    'SBIO3_Isothermality_5_15cm': "tree_height/geodata_raster/SBIO3_Isothermality_5_15cm.tif",
    'SBIO4_Temperature_Seasonality_5_15cm': "tree_height/geodata_raster/SBIO4_Temperature_Seasonality_5_15cm.tif",
    # import forest cover data
    'treecover': "tree_height/geodata_raster/treecover.tif"
}
[93]:
import xarray as xr
import rasterio
import rasterio.windows
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

# Define a small window (1000x1000 UL corner) to avoid high time/computation demand
window = rasterio.windows.Window(0, 0, 1000, 1000)

# Load raster files into an xarray Dataset with NoData handling
raster_datasets = []
for name, path in raster_paths.items():
    with rasterio.open(path) as src:
        # Get NoData value from raster metadata
        nodata = src.nodata if src.nodata is not None else np.nan
        # Read data within the specified window
        data = src.read(1, window=window)

        # Replace raster-specific NoData values with np.nan for consistency
        if nodata is not np.nan:
            data = np.where(data == nodata, np.nan, data)

        # Transform coordinates for the window
        transform = src.window_transform(window)
        rows, cols = np.arange(window.height), np.arange(window.width)
        xs, ys = np.meshgrid(cols, rows)
        xs_geo, ys_geo = rasterio.transform.xy(transform, ys, xs)
        x_coords = np.array(xs_geo[0])  # X coordinates for columns
        y_coords = np.array([row[0] for row in ys_geo])  # Y coordinates for rows

        # Create xarray DataArray
        da = xr.DataArray(
            data,
            coords={'y': y_coords, 'x': x_coords},
            dims=['y', 'x'],
            name=name,
            attrs={'nodata': np.nan}  # Store NoData as NaN in attributes
        )
        raster_datasets.append(da)

# Combine all DataArrays into a single Dataset
stack = xr.merge(raster_datasets)

# Prepare data for prediction
# Convert the stack to a 2D array where each row is a pixel and columns are features
feature_names = list(raster_paths.keys())
data_array = stack.to_array(dim='variable').transpose('y', 'x', 'variable')
data_2d = data_array.values.reshape(-1, len(feature_names))

# Create a mask for valid data (exclude pixels where any predictor is NaN)
valid_mask = ~np.any(np.isnan(data_2d), axis=1)
valid_data = data_2d[valid_mask]

# Predict tree heights for valid pixels using the trained RandomForestRegressor
predictions = np.full(data_2d.shape[0], -1, dtype=np.int16)  # Initialize with NoData value (-1)
if valid_data.size > 0:  # Ensure there is valid data to predict
    predictions[valid_mask] = rfReg.predict(valid_data).astype(np.int16)
else:
    print("Warning: No valid data (non-NoData pixels) available for prediction.")

# Reshape predictions back to the original raster shape
pred_2d = predictions.reshape(data_array.shape[0], data_array.shape[1])

# Save the prediction as a new raster file
output_path = "tree_height/geodata_raster/predicted_tree_height_light.tif"
with rasterio.open(list(raster_paths.values())[0]) as src:
    profile = src.profile
    profile.update(
        dtype=rasterio.int16,
        nodata=-1,  # Explicitly set NoData value
        width=window.width,
        height=window.height,
        transform=src.window_transform(window)
    )

with rasterio.open(output_path, 'w', **profile) as dst:
    dst.write(pred_2d, 1)

# Plot the result using xarray with NoData masking
ds_pred = xr.DataArray(
    pred_2d,
    coords={'y': stack.y, 'x': stack.x},
    dims=['y', 'x'],
    name='predicted_tree_height',
    attrs={'nodata': -1}
)

# Mask NoData values (-1) for visualization
ds_pred_masked = ds_pred.where(ds_pred != -1)

# Plot with a colorbar and appropriate scaling for tree heights
plt.figure(figsize=(15, 15))
ds_pred_masked.plot(cmap='viridis', cbar_kwargs={'label': 'Tree Height (m)'})
plt.title("Predicted Tree Height (1000x1000px, NoData masked)")
plt.xlabel("X (geographic)")
plt.ylabel("Y (geographic)")
plt.show()
../_images/CASESTUDY_Tree_Height_03RF_pred_SM_95_0.png