Running ECCO Adjoint and Optimization

Running ECCO Adjoint and Optimization#

ECCO Ocean State Estimation uses an iterative optimization process to adjust control variables—including ocean surface forcing, mixing parameters, and initial conditions—to minimize the weighted sum of model–data misfits, called the cost function (J), in a least-squares sense. This iterative process involves tens of iterations. Each iteration includes one forward run to compute the model–data misfit (J), one adjoint run to compute the adjoint gradients (i.e., the sensitivity of J to the controls), and an optimization step that uses these gradients to estimate updated control adjustments. The typical steps for conducting multiple iterations, starting from initial control variables (called first-guess controls), are as follows:

Execute the forward model using a set of first-guess model control variables to compute the model-data misfits and the initial value of the cost function, J.
Run the adjoint model, forced by the model–data misfits from the forward simulation as inputs. Upon completion, the adjoint gradients of J with respect to the control variables are computed. These adjoint gradients will be used in step 3 to compute updated control adjustments.
Compute control adjustments, called optimization, by utilizing the adjoint gradients and J fromm steps 1 and 2 to compute a set of adjustments to the control variables using the method of steepest descent.
Execute the forward model again using the updated control variables—i.e., the sum of the first-guess values (or those from the previous iteration) and the adjustments computed in step 3—and compute the new value of J.
This step is equivalent to step 1, except the control variables are no longer the first guess.
Run the adjoint model (repeating step 2) to run the adjoint model and compute a new set of adjoint gradients.
Update control adjustments, another optimization step by applying the adjoint gradients, updated control variables, and J from the previous two or more iterations (up to 4 iterations to limit memory usage when producing ECCO V4 estimates) to calculate a new set of control adjustments using a quasi-Newton method, such as Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS). Except for using the L-BFGS method, step 6 is essentially the same as step 3.
Repeat step 4 using the latest control variable adjustments, and continue this cycle until J is sufficiently minimized. The final control adjustments, combined with the first-guess controls, constitute the optimized controls used to conduct a forward simulation to produce the final ECCO release, such as ECCO Version 4 Release 4.

In practice, a forward run and an adjoint run, such as steps 1 and 2 (or steps 4 and 5) are often executed in a single model run that has both forward and adjoint modes. We have described how to conduct a forward simulation in Reproducing ECCO Version 4 Release 4. In this tutorial, we describe how to run the ECCO adjoint model and conduct optimization (steps 3 or 5).

Steps 3 and 6 conduct a line search that identifies a direction along which J can be reduced and then calculates a step size to adjust the controls in order to reduce J. The two line search methods are:

Steepest descent (Step 3): a first-derivative method that uses the gradient as the search direction
Quasi-Newton method (Step 6): a method that uses second-derivative (curvature) information and often achieves faster J reduction

Log in to P-Cluster#

Follow the same step in Reproducing ECCO Version 4 Release 4 to log in to the P-Cluster.

Modules#

As described in Reproducing ECCO Version 4 Release 4, the required modules should have been loaded automatically if users have updated their /home/USERNAME/.bashrc file with the example .bashrc (you may need to check out the latest version of example .bashrc).

In addition to the modules listed in Reproducing ECCO Version 4 Release 4, users also need the module intel-oneapi-mkl-2021.2.0-gcc-11.1.0-idxgd2d for the optimization step. This module should also have been loaded automatically by /home/USERNAME/.bashrc if users have updated it with the latest version of the example .bashrc. Alternatively — though not recommended — the module can be loaded manually if it is not loaded automatically:

Module Load Command
`module load intel-oneapi-mkl-2021.2.0-gcc-11.1.0-idxgd2d`

Code, Namelists, Input Files, and Scripts#

Code, Namelists, Input Files#

Note

Obtaining the code and namelist files, as well as creating the symbolic link to the input files, is the same as in Reproducing ECCO Version 4 Release 4. You may skip to Scripts and Optimization Code if you’ve already completed these steps and the files and symbolic link are still intact.

Following the instructions in Reproducing ECCO Version 4 Release 4, copy the code and namelist directories to /efs_ecco/USERNAME/r4/ (be sure to replace USERNAME with your actual username):

rsync -av /efs_ecco/ECCO/V4/r4/WORKINGDIR /efs_ecco/USERNAME/r4/

Then, change into /efs_ecco/USERNAME/r4/ and create a symbolic link input pointing to the input directory /efs_ecco/ECCO/V4/r4/input, as also described in Reproducing ECCO Version 4 Release 4, using the following commands:

cd /efs_ecco/USERNAME/r4/
ln -s /efs_ecco/ECCO/V4/r4/input .

This symbolic link will be used to access the input files in the example run script described below.

Scripts and Optimization Code#

In addition, copy the scripts, lsopt, and optim directories from ECCO-v4-Configurations/ECCOv4 Release 4/ into ECCOV4/release4/. The scripts directory contains various useful scripts, while lsopt and optim contain code required for the optimization step.

cd /efs_ecco/$USERNAME/r4/WORKINGDIR/
cp -r "ECCO-v4-Configurations/ECCOv4 Release 4/scripts/" ECCOV4/release4/
cp -r "ECCO-v4-Configurations/ECCOv4 Release 4/optimization/lsopt/" ECCOV4/release4/
cp -r "ECCO-v4-Configurations/ECCOv4 Release 4/optimization/optim/" ECCOV4/release4/

Directory Structure#

The directory structure under /efs_ecco/USERNAME/r4/ now looks like the following:

┌── WORKINGDIR
│   ├── ECCO-v4-Configurations
│   ├── ECCOV4
│   │   └── release4
│   │       ├── code
│   │       ├── namelist
│   │       ├── build
│   │       ├── run
│   │       ├── scripts
│   │       ├── lsopt
│   │       └── optim
│   └── MITgcm
└── input

Compile#

Compile Code for Adjoint Runs#

The commands to compile the code and generate the executable for an adjoint run are shown in the following code block:

cd WORKINGDIR/ECCOV4/release4
mkdir build_ad
cd build_ad
export ROOTDIR=../../../MITgcm
../../../MITgcm/tools/genmake2 -mods=../code -optfile=../code/linux_ifort_impi_aws_sysmodule -mpi
make depend
make adtaf
make adall
cd ..

Note that a new build directory, build_ad, has been created for adjoint runs, distinguishing it from the existing build directory used for forward runs.

The commands are similar to those used for reproducing ECCO V4r4, a forward simulation (Reproducing ECCO Version 4 Release 4. However, there are some important differences, involving two commands: make adtaf and make adall.

make adtaf sends the code to the TAF server to generate adjoint code.
make adall uses the TAF-generated adjoint code to build the executable used for conducting adjoint runs.

A successful compilation will generate the executable build_ad/mitgcm_uv. As stated earlier, in practice, an adjoint run contains both a forward mode and an adjoint mode. The forward mode computes the model–data misfits and their weighted sum, J. During the adjoint mode, the adjoint model—forced by the model–data misfits—computes the gradients of J with respect to the control variables.

Compile Optimization Code#

The code for the optimization steps (steps 3 and 6) is compiled separately to generate its own executable. The commands to compile the optimization code are as follows:

cd WORKINGDIR/ECCOV4/release4
cd lsopt
make clean
make 
cd ../optim
make clean
make
cd ..

A successful compilation will generate the executable optim/optim.x.

Conduct Iterations#

This section demonstrates how to run the ECCO adjoint model and optimization over multiple iterations using an automated script. We begin with a high-level overview of the workflow and output, followed by a detailed walkthrough of the script itself.

Overview of the Iteration Workflow#

Here, we use an example run script to describe the detailed steps for conducting adjoint runs and optimization. The script automates the iteration process by performing three iterations (iterations 0 to 2) over a short 3-day model integration, from 12Z on January 1, 1992, to 12Z on January 3, 1992. It uses the ECCO V4r4 configuration but starts from a cold start, in which all control adjustments are set to zero. The run is named v4r4_coldstart, and the total wall clock time for completing the three iterations is under 45 minutes.

The script is available on the P-Cluster at /efs_ecco/owang/r4/WORKINGDIR/ECCOV4/release4/run_script_slurm_autoopt_coldstart_v4r4.bash. Copy it to your working directory at /efs_ecco/USERNAME/r4/WORKINGDIR/ECCOV4/release4 (replace USERNAME with your actual username, but keep the directory structure the same). Then submit the script using sbatch with the following commands:

cd /efs_ecco/USERNAME/r4/WORKINGDIR/ECCOV4/release4
cp /efs_ecco/ECCO/V4/r4/scripts/run_script_slurm_autoopt_coldstart_v4r4.bash .
sbatch run_script_slurm_autoopt_coldstart_v4r4.bash

Upon completing the three iterations, five new directories will be generated by the script, as shown by the output of the following commands:

cd /efs_ecco/USERNAME/r4/WORKINGDIR/ECCOV4/release4
ls -1 | grep v4r4_coldstart

The output (order rearranged) shows five directories, along with the script run_script_slurm_autoopt_coldstart_v4r4.bash:

run_script_slurm_autoopt_coldstart_v4r4.bash
v4r4_coldstart.iter0
v4r4_coldstart.iter1
v4r4_coldstart.iter2
ctrlvec.v4r4_coldstart
optim.v4r4_coldstart

The three directories starting with v4r4_coldstart correspond to the three run directories for the three iterations. The other two directories, ctrlvec.v4r4_coldstart (hereinafter referred to as the ctrldir direcotry) and optim.v4r4_coldstart (hereinafter referred to as the optimdir direcotry), are used in the line search during the optimization step to calculate updated control adjustments for the next iteration.

Each run directory, such as v4r4_coldstart.iter0, contains the following files:

ecco_ctrl_MIT_CE_000.opt0000
Packed control adjustments (hereinafter referred to as the ecco_ctrl file).
ecco_cost_MIT_CE_000.opt0000
Packed adjoint gradients (hereinafter referred to as the ecco_cost file).
costfunction0000
Includes the total cost function J (called fc in the model) in the first line
(e.g., fc = 2079042.47585259 0.0000000E+00), as well as individual
cost values for different observation types.

The 4-digit number at the end of each filename corresponds to the iteration number.

The total cost function J, or fc, is the quantity that the iterative optimization process seeks to minimize. As shown below, it has been reduced to 0.618 at iteration 2 relative to its value at iteration 0:

Iteration Number	Cost (`fc`)	Cost Ratio w.r.t. Iteration 0
0	2079042.47585259	1.000
1	2070774.52785874	0.996
2	1284971.72606061	0.618

Walkthrough of the Example Run Script#

To further help users understand the iterative optimization process, a detailed explanation of the example run script is presented below.

Setup Slurm directivies:#

#!/bin/bash
#SBATCH -J ECCOv4r4_autoopt
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=36
#SBATCH --time=24:00:00
#SBATCH --exclusive
#SBATCH --partition=sealevel-c5n18xl-demand
#SBATCH --mem-per-cpu=1GB
#SBATCH -o ECCOv4r4_autoopt-%j-out
#SBATCH -e ECCOv4r4_autoopt-%j-out

Configure Shell Environment, Load modules, and Setup Environment Variables#

# Initialize and set up the environment
umask 022
ulimit -s unlimited
source /etc/profile
source /shared/spack/share/spack/setup-env.sh
source /usr/share/modules/init/sh

# Load required modules
module purge
module load intel-oneapi-compilers-2021.2.0-gcc-11.1.0-adt4bgf
module load intel-oneapi-mpi-2021.2.0-gcc-11.1.0-ibxno3u
module load netcdf-c-4.8.1-gcc-11.1.0-6so76nc
module load netcdf-fortran-4.5.3-gcc-11.1.0-d35hzyr
module load hdf5-1.10.7-gcc-9.4.0-vif4ht3
module load intel-oneapi-mkl-2021.2.0-gcc-11.1.0-idxgd2d
module list

# Set environment variables
export FORT_BUFFERED=1
export MPI_BUFS_PER_PROC=128
export MPI_DISPLAY_SETTINGS=""

Set Up Run-Specific Variables#

The script then sets up run-specific variables, such as the number of processors.

# Run-specific variables
nprocs=96
basedir="/efs_ecco/$USER/r4/WORKINGDIR/ECCOV4/release4/"
inputdir=../../../../input/

# Specify starting iteration number
whichiter=0
swhichiter=$(printf "%010d" ${whichiter})

# Specify the final iteration number (inclusive)
# For example, setting maxiter=2 will run iterations 0, 1, and 2
maxiter=$((whichiter + 2))

# Offset iteration: starting iteration index (usually 0).
# Do not change unless restarting iterations with a steepest descent line search.
offsetiter=0
runnm='v4r4_coldstart'

It uses 96 processes (nprocs=96) to conduct the runs under the directory /efs_ecco/$USER/r4/WORKINGDIR/ECCOV4/release4/ (basedir). The input files are located in /efs_ecco/$USER/r4/input/ (inputdir); symbolic links to these files will be created by the run script and accessed by the model. The example script starts at iteration 0 (whichiter=0) and ends at iteration 2 (with the final iteration maxiter set to 2). The run is called v4r4_coldstart.

The variable offsetiter is the starting iteration number for conducting a complete set of iterations (from steps 1 to 7, as described at the beginning of the tutorial), and it is typically set to 0. Occasionally, we may want to restart a new set of iterations from step 1. This can happen, for example, if a new set of observations is added. In such cases, offsetiter should be set to the current iteration number, from which the new iteration cycle begins.

Setup Optimization#

Two directories are created for the optimization step. The ctrldir stores control adjustments and adjoint gradients from previous iterations. The control optimdir is used to run optimization and generate the next ecco_ctrl file. The new ecco_ctrl file is saved to ctrldir and unpacked by the model to obtain updated controls

ctrldir=${basedir}/ctrlvec.${runnm}
optimdir=${basedir}/optim.${runnm}
mkdir -p "${ctrldir}" "${optimdir}"

The script then copies the namelists and the executable optim.x from the optim directory, where the executable was generated, to the optimization directory optimdir, where the control adjustments will be generated. Note that these namelists are specific for the optimization step and diff from those used for conducting forward or adjoint runs.

cp -p ${basedir}/optim/data* "$optimdir"
cp -p ${basedir}/optim/optim.x "$optimdir"

Loop Through Iterations#

The while block loops through the iterations and conducts an adjoint run for each iteration. However, depending on certain switches (see unpack and skip_optim below), the script may skip the optimization step—for example, in iteration 0.

while [ ${whichiter} -le ${maxiter} ]; do
done

Switches for Skipping Optimization#

Inside the while block, the following section sets up two switches (unpack and skip_optim) as well as the previous iteration number (iterm1). If skip_optim is set to true, the optimization step is skipped. This is the case for iteration 0, where there is no need to generate a ecco_ctrl file because all control adjustments are zero.

The unpack switch serves a similar purpose—if unpack is set to 0, it means the control adjustments for individual control variables already exist, so there is no need to unpack an ecco_ctrl file, and the optimization step will be skipped.

  # unpack=1: obtain control adjustments by unpacking ecco_ctrl
  # unpack=0: no unpacking needed; individual control adjustment fields already exist
  unpack=1
  if [ ${whichiter} -eq ${offsetiter} ]; then
    unpack=0
  fi

  # Previous iteration number 
  iterm1=$((whichiter - 1))
  yiter=$(printf "%04d" ${whichiter})
  yiterm1=$(printf "%04d" ${iterm1})

  # Determine whether to skip optimization, i.e., skip generating ecco_ctrl for next iteration
  if [ ${unpack} -eq 0 ] || ([ ${whichiter} -eq 0 ] && [ ${unpack} -eq 0 ]); then
    skip_optim=true
  else
    skip_optim=false
  fi

  # Skip optimization if ecco_ctrl already exists
  if [ -f ${ctrldir}/ecco_ctrl_MIT_CE_000.opt${yiter} ]; then
    skip_optim=true
  fi

Optimization Step#

The following if block is where the optimization is conducted to generate the ecco_ctrl file for the next iteration.

  if [ "$skip_optim" = false ]; then
    # Perform optimization to compute ecco_ctrl for the next iteration
    cd "$optimdir"

    # optimcycle
    optimcycle=$((whichiter - (offsetiter + 1)))
    nextcycle=$((optimcycle + 1))
    yoptimcycle=$(printf "%04d" ${optimcycle})
    ynextcycle=$(printf "%04d" ${nextcycle})

    # Abort if required inputs are missing
    if [ ! -f ${ctrldir}/ecco_ctrl_MIT_CE_000.opt${yiterm1} ] || \
       [ ! -f ${ctrldir}/ecco_cost_MIT_CE_000.opt${yiterm1} ]; then
      echo 'run aborted'
      exit 1
    fi

    # Link previous iteration's ecco_ctrl and ecco_cost files
    ln -s ${ctrldir}/ecco_ctrl_MIT_CE_000.opt${yiterm1} ecco_ctrl_MIT_CE_000.opt${yoptimcycle}
    ln -s ${ctrldir}/ecco_cost_MIT_CE_000.opt${yiterm1} ecco_cost_MIT_CE_000.opt${yoptimcycle}

    # Update data.optim: 1) optimcycle, 2) fmin (set once for steepest descent; fmin is computed automatically below)
    sed -i "/optimcycle=/c\\ optimcycle=${optimcycle}," data.optim

    if [ ${optimcycle} -eq 0 ]; then
      # Steepest descent step.
      # Abort if OPWARMI or OPWARMD already exists — remove them and resubmit the script.
      # These files are only needed for the Quasi-Newton method.
      if [ -f OPWARMI ] || [ -f OPWARMD ]; then
        echo "Error: OPWARMI or OPWARMD already exists. Remove them and resubmit the script."
        exit 99
      fi
      sed -i "/fmin=/c\\ fmin=${fmin}," data.optim
    fi

    cp data.optim data.optim_i${iterm1}

    # Generate new ecco_ctrl
    ./optim.x > op_i${iterm1}
    cp -f OPWARMI OPWARMI.${iterm1}
    # move and rename new ecco_ctrl to ctrldif directory
    mv ecco_ctrl*.opt${ynextcycle} ${ctrldir}/ecco_ctrl_MIT_CE_000.opt${yiter}
  fi

In the code block above, the script first changes into the optimdir direcory and sets the current optimization iteration (optimcycle), which is the number of iterations since offsetiter, as well as the next optimization iteration (nextcycle). It then checks whether either the ecco_ctrl or ecco_cost file from the previous iteration exists. If not, the script terminates, as both are needed for the line search.

Next, the script creates symbolic links to the ecco_ctrl and ecco_cost files in the ctrldir directory. After that, two important changes are made to the namelist file data.optim using the Linux stream editor sed:

The optimcycle namelist is updated with the current optimcycle value.
The fmin namelist is set to a value related to the cost reduction target, which is automatically estimated by the script (see below) based on the cost function f0 from iteration 0 (or offsetiter).

In addition to these two sed commands, there is a safety check for iteration 0 (or offsetiter) that aborts the script if either OPWARMI or OPWARMD already exists. These two files contain information such as gradients from previous iterations and should not exist for iteration 0 (or offsetiter). They may be leftover from a previous failed run and must be removed before restarting the iterations.

The cp command cp data.optim data.optim_i${iterm1} saves the current data.optim file for archival purposes. The same applies to the command cp -f OPWARMI OPWARMI.${iterm1}.

The optimization is actually performed by the executable in the following command:

./optim.x > op_i${iterm1}

After that, the generated ecco_ctrl file (with the next optimization cycle number) is renamed and moved to the ctrldir directory using the current iteration number in the filename. It will be loaded and unpacked to retrieve the control adjustments during the next iteration.

Conduct Adjoint Run#

The next code block conducts an adjoint run.

  mkdir ${basedir}/${runnm}.iter${whichiter}
  cd ${basedir}/${runnm}.iter${whichiter}

  # Link input files from various sources
  ln -s ../namelist/* .
  ln -s ${inputdir}/input_init/error_weight/data_error/* .
  ln -s ${inputdir}/input_init/* .
  ln -s ${inputdir}/data_constraints/data_error/*/* .
  ln -s ${inputdir}/data_constraints/*/* .
  ln -s ${inputdir}/input_forcing/unadjusted/eccov4r4* .
  ln -s ${inputdir}/input_forcing/other/*.bin .
  ln -s ${inputdir}/input_forcing/control_weights/* .
  ln -s ${inputdir}/input_forcing/control_weights/atm_ctrls/* .
  ln -s ${inputdir}/native_grid_files/tile*.mitgrid .

  python ../scripts/mkdir_subdir_diags.py

  # Namelist setup
  rm -f data
  cp -p data.iter0.3d data
  rm -f data.exf
  cp -p data.exf.iter0 data.exf
  rm -f data.gmredi
  cp -p data.gmredi.iter0 data.gmredi

  if [ ${whichiter} -eq 0 ]; then
    rm -f data.ctrl
    cp -p data.ctrl.iter0.inclatmctrl data.ctrl
  elif [ ${whichiter} -eq ${offsetiter} ] && [ ${unpack} -eq 0 ]; then
    rm -f data.ctrl
    cp -p data.ctrl_itXX.inclatmctrl data.ctrl
  else
    rm -f data.ctrl
    cp -p data.ctrl.unpack.inclatmctrl data.ctrl
    # Copy ecco_ctrl file to run directory
    cp -f ${ctrldir}/ecco_ctrl*${whichiter} .
  fi

   # Turn off the profiles package, as there are issues with using netCDF on the P-Cluster
   unlink data.pkg
   cp -p ../namelist/data.pkg .
   sed -i '/useProfiles=.TRUE./ s/^/#/' data.pkg

  # Create data.optim
  rm -f data.optim
  cat > data.optim <<EOF
 &OPTIM
 optimcycle=${whichiter},
 &
EOF

  # Run the model
  cp -p ../build_ad/mitgcmuv_ad .
  mpirun -np "${nprocs}" ./mitgcmuv_ad

The code block first creates a run directory, such as v4r4_coldstart.iter0, and changes into it. The script then creates symbolic links to the input files. The line python ../scripts/mkdir_subdir_diags.py creates the diags directory in the run directory, along with a list of subdirectories under diags, based on the information in the namelist file data.diagnostic. These are used by the model to output diagnostics such as model state at user-specified time intervals.

Next, the script replaces some default ECCO V4r4 namelist files with versions specific to this experiment—a cold start of a 3-day model integration. For the namelist file data.ctrl, three different versions are used:

One for iteration 0, where the model sets all control adjustments to zero.
A second for iteration offsetiter, or when unpack=0, where pre-unpacked control adjustment files are used as input instead of unpacking a ecco_ctrl file.
A third for all other cases, where a ecco_ctrl file is copied from ctrldir to the run directory and unpacked by the model to obtain control adjustments.

Two additional namelist changes include temporarily disabling using profile data on the P-Cluster (due to issues with netCDF modules), and setting the current iteration in data.optim.

The mpirun command below then launches the executable mitgcmuv_ad as a multi-process (96-process) job:

mpirun -np "${nprocs}" ./mitgcmuv_ad

Post Run Processing#

The remaining code block in the following does some post-run processing and prepares for the next iteration:

  # Save cost and control outputs to ctrldir for optimization
  rsync -av ecco_cost_MIT_CE_000.opt${yiter} ${ctrldir}
  rsync -av ecco_ctrl_MIT_CE_000.opt${yiter} ${ctrldir}

  # Compute fmin from costfunction0000 output (used in data.optim)
  # If the target cost reduction is 0.4% relative to the cost from iteration offsetiter (f0),
  # then fmin is set to (1 - 0.5 * 0.004) * f0 = 0.998 * f0
  if [ ${whichiter} -eq 0 ] || [ ${whichiter} -eq ${offsetiter} ]; then
    f0=$(grep " fc = " costfunction0000 | awk '{print $3}')
    echo "To have 0.4% cost reduction, set fmin = 0.998 * f0"
    fmin=$(echo "${f0} * 0.998" | bc)
    echo "fmin: ${fmin}"
  fi

  cd ..
  whichiter=$((whichiter + 1))
  echo ${whichiter}

First, the ecco_ctrl and ecco_cost files are copied to the ctrldir directory, where they will be used by the optimization code to generate updated control adjustments for the next iteration.

Another important post-processing step for iteration 0 is to estimate fmin, a value closely related to the target cost reduction. The script extracts the total cost fc (referred to as f0) from the costfunction0000 file in the run directory for iteration 0 (which is v4r4_coldstart.iter0 in this case). The target cost reduction is set to 0.4% relative to f0, and accordingly, fmin is computed as 99.8% of f0. fmin is used during the line search in the steepest descent step to estimate a step size. This step size is expected to produce control adjustments that yield a cost reduction matching the target.

A complete iteration is now finished. The iteration number whichiter is incremented by 1, and the script proceeds to the next iteration.