Category Archives: C++

Large-Scale MOO Experiments with SHARK – Oracle Grid Engine

This post explains how to conduct large-scale MOO experiments with the SHARK machine learning library on clusters running Oracle grid engine.

An experiment consists of three phases:

  1. front approximation
  2. performance indicator calculation
  3. result accumulation and statistics calculation

Within this post, I’m going to focus on the first step.

Front Approximation

In this phases, the Pareto front approximations generated by applying multiple multi-objective evolutionary algorithms (MOEAs) to a set of objective functions are recorded.

Here, I assume that we want to evaluate the (µ+1)-MO-CMA-ES relying on the hypervolume indicator on the DTLZ suite of benchmark functions. A ready-to-use command-line application implementing the MO-CMA-ES is bundled with the default Shark installation. The executable is configurable via command-line arguments queryable by passing –help:

  --objectiveFunction arg
  --seed arg (=1)
  --storageInterval arg (=100)
  --searchSpaceDimension arg (=10)
  --maxNoEvaluations arg (=50000)
  --timeLimit arg (=1000)
  --fitnessLimit arg (=1e-10)
  --resultDir arg (=.)
  --algorithmConfigFile arg
  --objectiveSpaceDimension arg (=2)

That is, to execute the MO-CMA-ES for DTLZ2 with 3 objectives and terminating after 50000 objective function evaluations, the following call is required:

  SteadyStateMOCMAMain --objectiveFunction=DTLZ2 --objectiveSpaceDimension=3 --maxNoEvaluations=50000 

Note that we do not specify the rng seed explicitly but rely on the default value 1.

For the scenario considered here, we want to run several independent trials of one specific MOEA and one specific objective function in parallel. To this end, we rely on the array job feature of the grid engine and submit an array of 25 independent trials to the grid engine with the following command:

  qsub -N 'DTLZ2_3' -t 1-25 DTLZ2 /globally/known/path 3

Here, the script is defined as follows:

#$ -S /bin/bash
#$ -o /dev/null

SteadyStateMOCMAMain --seed $SGE_TASK_ID --resultDir=$2 --objectiveFunction=$1 --objectiveSpaceDimension=$3

In summary, the script takes care of actually running the algorithm and setting the seed to environment variable $SGE_TASK_ID. The variable is set by the grid engine to the unique job number and thus, we can ensure independent trials. There is one more thing to note: The result dir needs to be known across the whole cluster. Normally, your dev ops provide you with a scratch environment that is accessible from every computing node.

That’s it. Wait a few minutes until the experiment completes and stay tuned for the second post that explains how to evaluate the quality of the Pareto-front approximations.


Shark 3.x – Continuous Integration

Taken from the SHARK website:

SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques. SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Windows, Solaris, MacOS X, and Linux.

The library has been in active development for over 10 years now and is in use by scientists all over the world. Last year, we, the core SHARK developers, decided that a rewrite of the library is necessary to support future use cases and provide a solid platform for users and contributors, alike. Our goals were simple:

  • Unify and simplify the library structure.
  • Rely on established components wherever feasible.
  • Documentation, documentation and again, documentation
  • Focus on quality.

In this post, I would like to dive a little deeper into the topic of quality and the processes that we established to ensure a constant and high level of quality. We decided to address quality both from a technical (read: testable) and from an API point of view.

In terms of API quality, we want the programming interface to be consistent, convenient to use and easy to extend. In equivalence to the user experience, we want potential developers to experience a welcoming and friendly environment. As we are a geographically distributed team of developers and scientists, we decided to go for a pre-commit code review approach implemented with the help of ReviewBoard. Despite initial concerns on behalf of the developers, the review process proved to be one of the most useful tools while rewriting the library with developers starting to like the final “Ship It” quickly.

In terms of “technical” quality, we decided to go for continous integration of all (reviewed) commits to the rewrite branch for all of our supported platforms. With the help of Jenkins and a bunch of virtual machines, we finally realized our idea of continous integration testing to prevent from regressions. Our unit test suite is implemented with the unit testing framework provided by boost. Test execution is handled by CTest. Static and dynamic analysis of the library is carried out with the help of cppcheck and valgrind, respectively. Code coverage metrics are calculated with the help of gcov. Finally, we are integrating all of the testing results in the job-specific views of our Jenkins instance, thereby providing developers a single source of information on the state of the library.