[Updated] Mir & XMir Performance

6 Replies

This is the first article in a series of blog posts on Mir’s and XMir’s performance. The idea is to provide further insights into the overall performance work, point out existing bottlenecks and how the team is addressing them.

Our overall goal for Mir and XMir is to provide an absolutely fluid user experience, both in the case of typical desktop usage as well as in the case of more demanding usage scenarios like 3D gaming. More to this, our efforts to provide a fluent user-experience on the desktop should at most have a minimal impact on overall 3D application performance.

During the last weeks and months, a lot of people have raised the question if and to what degree the introduction of a system-level compositor impacts graphical performance. The short answer is: Yes, any additional layer between the GPU and the actual rendering process has an impact on the overall performance characteristics of the system. However, there are ways to avoid most of the overhead and this blog post is the not-so-short answer to the initial question.

As its name implies, a compositor is responsible for taking multiple buffer streams or surfaces and assembling (a.k.a. compositing) a final image that is then scanned out to the connected monitors. In the general case, composition requires buffering of the final image and it requires GPU resources to render the individual surfaces to the destination buffer in preparation for scanout. Here, the destination buffer is the framebuffer. The overhead of a system-level compositor can be summarized as this additional rendering step in the overall graphic pipeline, for the obvious benefit of being able to control the final output and enabling flicker-free boot, shutdown, resume, suspend and session-switching.

Both internally and externally, people have been measuring the overall performance impact with XMir as available from the archive today. Roughly speaking, people have been reporting a performance impact of ~20% in the Phoronix test suite and the question becomes: How can we significantly decrease the impact in the specific case of XMir while still keeping all the aforementioned benefits in place? The underlying idea to solve the issue is straightforward: If the compositor is clever enough, it could recognize situations where an opaque client surface does cover a complete output (XMir matches exactly this configuration). In that case, composition can be avoided and the client should be provided with a framebuffer as rendering target instead of the usual graphic memory buffer. Moreover, the server-side composition strategy can be smart, and completely skip the final composition step and scan out the framebuffer as soon as the client signals “done”. Luckily, Mir’s composition engine and associated buffer allocation/swapping infrastructure allows for implementing this behavior easily and transparently to the client. The respective implementation has been living in https://code.launchpad.net/~vanvugt/mir/bypass for some time now, and we have been testing it in parallel to trunk. Our primary test and benchmarking platform was Intel, and we haven’t seen any issues with the patch on that platform. There is a graphical glitch present on ATI cards that we are actively working on. Nouveau gives us some headache as it is quite slow both on X and XMir right now. However, we are confident that we won’t see any major issues in XMir once the underlying cause in the Nouveau driver is fixed.

Results

Measuring graphical performance and developing meaningful benchmarks is a complex task on its own. Luckily, we have some pretty capable tooling available in the opensource world. During development and evaluation of the bypass feature, we have been relying on selected test-cases of Phoronix Test Suite and on glmark2 to continuously evaluate performance gains and overall impact. We are going to publish the results across Intel, NVIDIA and AMD GPUs as part of our regular QA reporting at http://reports.qa.ubuntu.com/graphics/ as soon as we hit trunk. In summary, we are able to reduce XMir’s total overhead to ~6% on Nexuiz and OpenArena (see section “Conclusions and Future Work” for reasons for and approaches to further reduce the remaining overhead). Please also note that we are actively investigating into the results for the “QGears2: OpenGL + Image scaling” test case:

GUI Toolkits – Intel 2500

GUI Toolkits – Intel 3000

GUI Toolkits – Intel 4000

Nexuiz HDR Off

Nexuiz HDR On

OpenArena

GLMark2 numbers are not yet reported via the public dashboard but we are actively working on wiring them up as part of our daily quality efforts, too. However, the numbers are quite promising as can be seen from this preview (Lenovo x220, i7 vPro, Intel(R) HD Graphics 3000):

Conclusions & Future Work

Today, we are landing an important GPU-bound optimization for the XMir use-case with the bypass feature and we see significant performance improvements in our benchmarking scenarios. Everyday users will hardly notice any difference in graphical performance, but notice a decrease in power usage on laptops due to the system-compositor requiring less GPU and CPU cycles to carry out its tasks.

However, this is only the first step and we still see some overhead in the benchmarks. Our GLMark2 benchmark numbers for raw Mir when compared to X as in Saucy today suggest that we still have GPU-bound optimization potential that we should leverage in the XMir case. The unity-system-compositor performance is not the bottleneck in this specific scenario and we need to become more clever on the X side of things. In summary, we need to propagate the bypass approach further down into the X world and its clients with X/Compiz handing out the raw buffer provided by Mir to fullscreen, opaque X clients. Luckily, Compiz already knows about the notion of composite bypass, too and the remaining optimization potential lies mostly within X itself by making it more aware of the fact that it is living in a world of nested compositors now. Quite likely, though, Mir will require adjustments, too, to expose composition bypass end-to-end in the XMir scenario. Stay tuned, we will keep you posted within this series of blog posts.

[Update] Michael of Phoronix found out that some games, when run in fullscreen mode but not at native resolution, do not benefit from composition bypass. As mentioned in one of the comments, we are now starting to investigate into this sort of issues and will come back with updates once we identified the root causes. At any rate: Thanks for bringing it up, we will make sure that the respective benchmark/setup is present in our benchmarking setup, too.

Running, Stopped, Killed? A user shouldn’t care

3 Replies

To put it straight: A user should be able to invoke an arbitrary number of applications and experience neither a slow-down nor ever have to wonder whether an app is already running. It is just there whenever it is invoked and the system takes care of the rest.

When we started working on Ubuntu Touch roughly a year ago, our primary focus was the enablement of central HW components such that Ubuntu would be able to run on common mobile form factors. However, after having reached the goal of being able to leverage HW acceleration for UI, media decoding and accessing the on-board sensors, we started thinking about our application model that we wanted to deeply integrate with the OS. From a user’s and a developer’s perspective, our primary goals are:

Provide a consistent application model that spans installation, execution and de-installation of apps.
Ensure security at all stages and account for the fact that apps have to be considered harmful.
Ensure a seamless multi-tasking experience that is transparent to the user and does not require to think in terms of running/not running.
Make the application model as easy to develop with as possible.

From the system’s point of view, our objectives are:

Integrate a well-defined confinement model deep within the system.
Enable the system to aggressively control the resource consumption of apps.
Enable a seamless transition to the converged world.

Every single objective listed before is a challenge on its own. On top, they are interdependent and even conflicting at times. However, one of the most fundamental building blocks of the overall application model is the application lifecycle and this blog post is dedicated to explaining both our lifecycle model and policies. From a user’s perspective

A mobile device is an environment offering a limited set of computing resources, i.e., CPU cycles, main memory, GPU cycles, graphics memory and power. Running applications are competing for these resources and we have to assume that applications are greedy, trying to use as much of the available resources as possible. The user expects the system to ensure a fluent user experience while providing the longest possible battery life at the same time. On top of this, the user should not be required to carry out any sort of process management tasks or to build a mental model of the different run states an app can be in. Ultimately, the application lifecycle should be completely transparent to the user and multi-tasking should be seamless.

A solution to the problem needs to satisfy the following additional constraints:

The application lifecycle model should be easy to develop with. That is, the changes to the well-known process state machine should be as small as possible, providing sensible and robust fallback behavior.
As Ubuntu is working towards a converged world, the lifecycle model needs to be adaptable to a range of different scenarios: From mobile phones, over tablets to classic desktop environments. The differences should be transparent to the developer and both applications and the overall system need to be able to transition seamlessly from one use-case to the other. This is especially important when thinking about the Ubuntu Edge, with the phone being a full-featured desktop/laptop replacement when docked or connected to a large screen.

The Application Lifecycle Model

As noted earlier, one of our goals is the ability to define different lifecycle policies and swap them out dynamically at runtime to account for different usage scenarios. We want to minimize the impact on developers when moving to a converged world and make lifecycle policy changes and decisions transparent to a user and a developer alike. To this end, we clearly separate the application lifecycle model and the policies that the system executes on top of it. Our current model we are putting in place extends on the well-known process state machine as presented in the following diagram:

The states are defined as:

Focused: The application is visible to the user and guaranteed to be running and provided with all necessary resources.
Unfocused: The application is not guaranteed to be running, i.e., it might not be granted CPU or GPU cycles and the policy is free to trigger a state transition to any of the not-running states. In the phone scenario, the app is not visible to the user.
Killed: The app’s process image has been removed from main memory.
Stopped: The app’s process is sigstop’d.
Stateless: The app’s process has been sigkill’d without prior state preservation. Serves as a way to reset an app’s state.

A “transparent” application lifecycle then translates to: Ensure that applications are able to preserve (and subsequently recreate) their state before being transitioned to the “not-running” meta state. This is indicated by dashed state transitions in the diagram. All of these transitions are preceeded by a notification to the app that it is about to be stopped or killed, handing over an archive file that the app can serialize its state to during a grace period. After that, the app is actually transitioned to “not running”. When the app is resurrected, the system provides the archive back to the app and the app recreates its previous state. In the diagram, an interesting aspect becomes visible: As we want to enable lifecycle policies to kill a stopped app, an application needs to preserve its state even if only being sig-stop’ed.

Application Lifecycle Policies

Based on the application lifecycle model presented before we can now start defining policies triggering the state transitions. Classical desktop behavior can be easily expressed in this model, too. The current desktop lifecycle policy never automatically triggers a state transition from the running to the “not-running” state and does not limit resources granted to an app when unfocused.

For version 1 on the phone (only considering the non-converged, standalone phone scenario) we are defining a very strict lifecycle policy. All non-focused apps are not guaranteed to stay in the “running” state and are transitioned to the “not running” state at the policies discretion to aggressively save resources. Today, we are already sigstop’ing app processes whenever they are unfocused and we will go even further and kill unfocused apps when we detect memory pressure (even before the OOM potentially kicks in).

Why are we so strict? We as a platform take on the responsibility to manage the scarce resources of a mobile device as efficiently as possible. We don’t want to leave it up to reviewers or users to identify and capture resource hogging apps. However, this is only version 1 and we consciously decided for a very conservative approach that we can open up gradually going forward as opposed to starting without a clear policy and taking away functionality over time.

Implications of a Strict Lifecycle Policy

Both internally and externally, a lot of discussions have been triggered by the strict lifecycle policy for version 1. We have discussed a multitude of different use-cases and almost all of them are solvable by means of separating apps into an engine and into a UI part. First, separating application logic from the presentation layer is good practice in software design. Second, relying on an engine/background service allows apps to escape the lifecycle “trap” easily by dispatching to an entity that exceeds an apps lifetime as dictated by our lifecycle policy. A nice side-effect is easily testable code.

How does work for version 1 in Ubuntu Touch? In summary, the system will provide a set of system services that cover the most prominent examples and use-cases identified from both external & internal discussions, e.g.:

Music playback in the background
Downloads happening in the background
Alarms/appointments

In this 1st version, apps will not be able to install their own background services/engines in the default setup. However, going forward in time, we will provide a mechanism for apps to hand their engine to the system and have it executing in the background (with resource constrains in place, though).

For readers interested in more details:

[Updated] Mir – An outpost envisioned as a new home

11 Replies

Some time ago, Canonical started internal discussions about our convergence strategy, clearly spelling out the distant target of shaping and developing a single computing platform and operating system that is able to power the cloud, classic desktop machines, laptops, TV sets, phones and tablets (fridges anyone?). More to this, we stated that we want a single mobile device, a bundle of portable computing power together with the respective operating system, that seamlessly adapts to different form-factors and use-cases. Or to put it a little more catchy:

Your phone is your TV, is your desktop, is your … you name it.

And yes, we were aiming for the moon. We worked hard to draft a system that would allow us to reach the moon, while at the same time catering towards our other central goal of providing a beautiful and lean user experience. It became apparent in conversations and discussions that one of the cornerstones of the future system we were designing will be the graphics stack, including the Unity shell. After much discussions about existing solutions (X & Weston), and how we could leverage them, we took a step back and distilled down our (technical) requirements for a future graphics stack:

Tailored towards an EGL/GL(ES) world.
Minimal assumptions regarding the underlying driver model.
Ability to leverage existing drivers implementing the Android driver model.
Ability to leverage existing hardware compositors.
Efficient, in terms of memory-, CPU- and GPU-consumption/usage.
Tightly integrated with the Unity shell, fulfilling the shell’s requirements while at the same time not dictating any sort of semantics up the stack.
An efficient and secure input subsystem supporting demanding mobile use-cases.
Fully tested from the ground up.
Adaptable to future requirements.

With these priorities in mind, we revisited and carefully evaluated existing solutions again and found that neither of them satisfies our requirements. In particular, X and its driver model is way too complicated and feature-laden, resulting in a less efficient system and a driver model that is unlikely to be widely supported on mobile platforms. In the case of Weston, the lack of a clearly defined driver model as well as the lack of a rigorous development process in terms of testing driven by a set of well-defined requirements gave us doubts whether it would help us to reach the “moon”. We looked further and found Google’s SurfaceFlinger, a standalone compositor that fulfilled some but not all of our requirements. It benefits from its consistent driver model that is widely adopted and supported within the industry and it fulfills a clear set of requirements. It’s rock-solid and stable, but we did not think that it would empower us to fulfill our mission of a tightly integrated user experience that scales across form-factors. However, SurfaceFlinger was chosen as our initial solution for getting started with the overall Ubuntu Touch project, planning to replace SurfaceFlinger with Mir as soon as possible.

In summary, our evaluation of existing solutions led us to the conclusion that neither of them fits our requirements and that adjusting/adapting them would require substantial efforts, too. For this, we decided to go for our own solution, catering directly towards our goals and our vision, effectively saying: Yes, we are going to do our own display server.

The project was born and we decided to name it Mir (as in the space station), as it would be our outpost that finally enables us to reach our goal of arriving at the moon, providing a seamless and beautiful user experience spanning multiple form-factors. The team set off to work on Mir while large parts of Canonical started working on the Ubuntu Touch project (relying on SurfaceFlinger). A lot of knowledge and experience was and is transferred from the Ubuntu Touch project to the Mir team, with the work on the phone and tablet revealing more and more technical requirements and subtleties that heavily influenced the Mir architecture and implementation. Thus, one of the earliest technical decisions that has been taken by the team concerned the protocol that applications rely upon to communicate to Mir. We evaluated Wayland and compared it to the SurfaceFlinger approach, realizing that neither of them fits our needs. Wayland is a very promising and sensible attempt at standardizing the way that applications talk to a display server, however, it exposes privileged sections like the shell integration that we planned to handle differently, both for security reasons and as we wanted to decouple the way the shell works on top of the display server from the application-facing protocol. In the end, we decided to implement Mir’s core in a protocol-agnostic way, considering the communication protocol a very important part of the overall story, but not the driving one. We have chosen Google’s protocol buffers as our data and interface description language (DDL & IDL) and implemented a lean RPC layer that abstract away transport-specific details (relying on an ordinary socket by default). It is well tested and the protobuf IDL and DDL provide us with forward- as well as backward-versioning capabilities. However, the communication component of the overall system is adaptable and can be adjusted to account for future requirements and developments.

A final word on timelines: At the time of this writing, we are preparing to deploy the next version of the Unity shell tightly integrated with Mir in the not-too-distant future on the phone and the tablet. However, regarding the desktop use-case: We have integrated X, leveraging the prior XWayland work, to run on top of Mir (XMir) and started initial porting efforts for toolkits with a focus on Qt5. With our X-integration in place, you can run Mir on your desktop machine if your system runs a GPU that supports the free driver stack. For the closed-source desktop drivers: We are in active conversations with GPU vendors to enable Mir on those drivers/GPUs, too. [Updated] More to this, we are working together with NVIDIA towards a more unified driver model sitting on top of EGL.

As we are moving along with the development of Mir, we will work on integrating existing toolkits with the new display stack to allow application developers a seamless transition between different form-factors and prevent them from the burden of supporting multiple different display servers. As noted before, Qt5 is our first target here, with GTK3 and XUL being in the pipeline. To support legacy X applications that cannot be adjusted to work against Mir, we will provide a translation-layer that leverages the existing XMir implementation and allows legacy X apps to use Mir transparently.

And that’s pretty much it. Thanks for reading. I hope I could give you some insight into the history of Mir, its motivation and its trajectory. If you want to dive even further into the Mir world, head over to wiki.ubuntu.com/MirSpec. A lot of work has already been done, but more of it lies ahead of us. So if you are curious, have questions, want to help us or simply want to get to know the team, join the Mir sessions at the upcoming virtual UDS. You might also want to read Olli’s description of the bigger picture, including our overall convergence strategy for Unity.

To the moon,

Thomas

Large-Scale MOO Experiments with SHARK – Oracle Grid Engine

Leave a reply

This post explains how to conduct large-scale MOO experiments with the SHARK machine learning library on clusters running Oracle grid engine.

An experiment consists of three phases:

front approximation
performance indicator calculation
result accumulation and statistics calculation

Within this post, I’m going to focus on the first step.

Front Approximation

In this phases, the Pareto front approximations generated by applying multiple multi-objective evolutionary algorithms (MOEAs) to a set of objective functions are recorded.

Here, I assume that we want to evaluate the (µ+1)-MO-CMA-ES relying on the hypervolume indicator on the DTLZ suite of benchmark functions. A ready-to-use command-line application implementing the MO-CMA-ES is bundled with the default Shark installation. The executable is configurable via command-line arguments queryable by passing –help:

  --objectiveFunction arg
  --seed arg (=1)
  --storageInterval arg (=100)
  --searchSpaceDimension arg (=10)
  --maxNoEvaluations arg (=50000)
  --timeLimit arg (=1000)
  --fitnessLimit arg (=1e-10)
  --resultDir arg (=.)
  --algorithmConfigFile arg
  --algorithmUsage 
  --defaultAlgorithmUsage 
  --objectiveSpaceDimension arg (=2)
  --reportFitnessFunctions

That is, to execute the MO-CMA-ES for DTLZ2 with 3 objectives and terminating after 50000 objective function evaluations, the following call is required:

  SteadyStateMOCMAMain --objectiveFunction=DTLZ2 --objectiveSpaceDimension=3 --maxNoEvaluations=50000

Note that we do not specify the rng seed explicitly but rely on the default value 1.

For the scenario considered here, we want to run several independent trials of one specific MOEA and one specific objective function in parallel. To this end, we rely on the array job feature of the grid engine and submit an array of 25 independent trials to the grid engine with the following command:

  qsub -N 'DTLZ2_3' -t 1-25 RunAlgo.sh DTLZ2 /globally/known/path 3

Here, the script RunAlgo.sh is defined as follows:

#!/bin/bash
#$ -S /bin/bash
#$ -o /dev/null

SteadyStateMOCMAMain --seed $SGE_TASK_ID --resultDir=$2 --objectiveFunction=$1 --objectiveSpaceDimension=$3

In summary, the script takes care of actually running the algorithm and setting the seed to environment variable $SGE_TASK_ID. The variable is set by the grid engine to the unique job number and thus, we can ensure independent trials. There is one more thing to note: The result dir needs to be known across the whole cluster. Normally, your dev ops provide you with a scratch environment that is accessible from every computing node.

That’s it. Wait a few minutes until the experiment completes and stay tuned for the second post that explains how to evaluate the quality of the Pareto-front approximations.

(µ+1)-MO-CMA-ES

Leave a reply

A brief video of the (µ+1)-MO-CMA-ES solving DTLZ2 with 3 objectives.

Shark 3.x – Continuous Integration

Leave a reply

Taken from the SHARK website:

SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques. SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Windows, Solaris, MacOS X, and Linux.

The library has been in active development for over 10 years now and is in use by scientists all over the world. Last year, we, the core SHARK developers, decided that a rewrite of the library is necessary to support future use cases and provide a solid platform for users and contributors, alike. Our goals were simple:

Unify and simplify the library structure.
Rely on established components wherever feasible.
Documentation, documentation and again, documentation
Focus on quality.

In this post, I would like to dive a little deeper into the topic of quality and the processes that we established to ensure a constant and high level of quality. We decided to address quality both from a technical (read: testable) and from an API point of view.

In terms of API quality, we want the programming interface to be consistent, convenient to use and easy to extend. In equivalence to the user experience, we want potential developers to experience a welcoming and friendly environment. As we are a geographically distributed team of developers and scientists, we decided to go for a pre-commit code review approach implemented with the help of ReviewBoard. Despite initial concerns on behalf of the developers, the review process proved to be one of the most useful tools while rewriting the library with developers starting to like the final “Ship It” quickly.

In terms of “technical” quality, we decided to go for continous integration of all (reviewed) commits to the rewrite branch for all of our supported platforms. With the help of Jenkins and a bunch of virtual machines, we finally realized our idea of continous integration testing to prevent from regressions. Our unit test suite is implemented with the unit testing framework provided by boost. Test execution is handled by CTest. Static and dynamic analysis of the library is carried out with the help of cppcheck and valgrind, respectively. Code coverage metrics are calculated with the help of gcov. Finally, we are integrating all of the testing results in the job-specific views of our Jenkins instance, thereby providing developers a single source of information on the state of the library.

A first post

Leave a reply

iN fAiRy dUsT wE tRuSt – the slogan of the Chaos Computer Club, one of the oldest hacker organizations in the world.