Monday, July 16, 2012

[O]Flux Event-based Run-time

One of the goals is to avoid doing a long running blocking system call in the run-time while holding a resource (I mentioned this in my last post on the subject).  If a mutex (for example) is held while a thread in the program is attempting to do some slow I/O, performance can suffer tremendously.  All the other threads in the program who might have made quick use of that resource are stuck waiting.

The Flux event-based run-time tries to avoid this predicament by simplifying the problem in a way.  To try to obtain more concurrency, system calls which could block are intercepted (or interposed) using the standard UNIX technique with the LD_PRELOAD environment variable.  This means that any application code which calls read(), for example, passes through the Flux run-time to do so (this code is referred to as a shim since it sits between the OS and the application).  If this shimmed system call would block, the run-time will start up another thread or reschedule a ready thread (via a Pthread conditional variable) when this happens.  


The run-time itself has only one master mutex in it.  This mutex is held whenever the application code is running but a shimmed system call is not running.   While a C++ application function is running but blocked in a system call, the chance to release the mutex is used to give another thread which is also running a C++ application function a chance to run (or schedule a new runnable event which runs such a function).  Runnable events are kept in FIFO order on an internal run queue which is protected by the main mutex.


This state diagram shows what the life of a thread is like in the event-based run-time (initially the FIFO is seeded with one event for every source node, and one thread is told to grab the first FIFOed event):





The darker blue color indicates the state where the C++ application function is executing.  The green states are waits on conditional variables (WTR or waiting to run is one of them, and the other is WIP or waiting in pool).  Idle threads end up blocked on WIP. Threads which have completed their system call and want back in on the main mutex need to pass through WTR.  The light blue states are Flux run-time steps which manage the internal run-time state.

Several threads can be at once in the blocking state waiting for a system call to return while one thread has the mutex and is doing "blue things".  Consider a simple C++ node function Sleep which calls shimmed system call sleep():


  int
  Sleep(const Sleep_in * in,Sleep_out * out, Sleep_atoms *)
  {
    sleep(in->sec);
    out->val = in->val;
    return 0;
  }

Since sleep is shimmed, the Flux shim code could check whether the argument is 0 (in which case no blocking happens) possibly saving a context switch caused by release of the main mutex. For a greater-than 0 value we don't want to hold the mutex during the system call, so it is released with a call to pthread_mutex_unlock(). Then the system call happens and we return into the run-time with a conditional variable wait (WTR):


  // shim_sleep symbol is captured via dlsym in init
  extern "C" unsigned int
  sleep(unsigned int s) {
    if(!s) return; // not blocking -- bypass
    pthread_mutex_unlock(&main_runtime_mtx);
    pthread_cond_signal(&wip_cond);
    unsigned res =((shim_sleep)(s));
    pthread_mutex_lock(&main_runtime_mtx);
    pthread_cond_wait(&wtr_cond,&main_runtime_mtx);
    return res;
  }

Imagine a runtime FIFO with 5 Sleep events on it (all with positive in->sec values), the runtime will get 5 threads involved with running those sleeps at the same time (assuming there are 5 threads available to run them).

Summary


In this post I describe the state machine that every run-time thread finds itself living within.  Initially a Flux program will have only a single thread, and thread growth is allowed to happen when there are no threads available to signal (WIP), but there are runnable events available.  Concurrency is had by allowing multiple threads to do blocking system calls at the same time.  We can at least be sure when writing the app that in-memory structures are safely accessed atomically between system calls.  With extra mechanisms to keep certain node events from dispatching at the same time, further atomicity guarantees are possible.


No comments:

Post a Comment