Showing posts with label concurrency. Show all posts
Showing posts with label concurrency. Show all posts

Thursday, October 11, 2012

OFlux: Lock-free Mutex Atomic

Unlike a Pthread mutex which stops a thread in its tracks if the resource is already held, an OFlux atomic with mutual-exclusion semantics does not do that.  Rather, it takes an "event" argument which is parked within the atomic's internal queue.  The thread is still available to execute other events as long as they are runnable (have all the needed atomic resources they need).  This keeps threads warmer, and avoids a costly context switch.  The cost of parking an event is just a single compare and swap (CAS) instruction (this code is geared for the i686 architecture).  Within the same loop that this CAS happens, conditions are checked for whether the atomic is free (not held by another event), and if so another CAS allows acquisition.  So the basic operation of acquire_or_wait() provides a simple way for events to interact with an atomic.  For a completed event to release an atomic it calls release() which can return a next event (from its waiting queue) to have acquired.


Rudiments of Interface


I will try to explain the top-level interface for this atomic object first, and then delve into the mechanics of the working parts within it.   Here is the AtomicExclusive subclass of AtomicCommon which implements the atomic object with mutual exclusion semantics (meaning at most one event may hold the atomic at one time) from src/runtime/lockfree/atomic/OFluxLFAtomic.h (simplified somewhat):


typedef EventBase * EventBasePtr;

class AtomicExclusive : public AtomicCommon {
public:
  AtomicExclusive(void * data)
    : AtomicCommon(data)
  {}
  virtual int held() const { return ! _waiters.empty(); }
  virtual void release(
     std::vector<EventBasePtr > & rel_ev
   , EventBasePtr &)
  {
    EventBaseHolder * ebh = _waiters.pop();
    if(ebh) {
      rel_ev.push_back(EventBasePtr(ebh->ev));
      AtomicCommon::allocator.put(ebh);
    }
  }
  virtual bool acquire_or_wait(EventBasePtr & ev, int mode)
  {
    EventBaseHolder * ebh = 
       AtomicCommon::allocator.get(ev,mode);
    bool acquired = _waiters.push(ebh);
    if(acquired) {
      AtomicCommon::allocator.put(ebh); 
      // not in use - return it to pool
    }
    return acquired;
  }
  ... // other things omitted
private:
  ExclusiveWaiterList _waiters;
};


Regardless of whether the atomic is currently held or not, multiple threads can be calling the acquire_or_wait() member function on behalf of new events which need to acquire this atomic.  Returning a true value from this function indicates that the acquisition succeeded and the event may proceed with further acquisitions, or actually execute its function.  Returning a false value indicates that the event has been given-up (implying it is queued internally within the atomic object, and the current thread is forbidden to access it): to the atomic object and will later be released when its turn to hold the atomic happens.

The interface supports other semantics like read-write or pools, which I will describe at a later date in a separate blog posting.  The signature of release() adds events to a std::vector (which could be thread local and pre-sized for efficiency) in anticipation of other kinds of atomic objects.  Generally the choice of ordering for events within the internal queue is to be first-in first-out (FIFO) since this has the simplest fairness guarantee, and makes writing code easier.  The mode argument can specify a "mode" of acquisition, but it is not very useful for this type of atomic object (there is only one mode of acquisition that it supports: "exclusive").  I won't cover the internals of the AtomicCommon::allocator member object, except to say that it is a thread-safe reclamation facility which has a hazard pointers capability (enabling the implementation of the ExclusiveWaiterList class which I will talk about next).


Hang on to this


In order to avoid putting mechanics that ExclusiveWaiterList needs into the events, I have a container object called EventBaseHolder which boxes up the events while they are anywhere near ExclusiveWaiterList.  On its own, it is not very interesting, but it is important to describe it before it is used (again simplified for reading):


struct EventBaseHolder {
  EventBaseHolder(
      EventBasePtr & a_ev
    , int a_mode)
    : ev(a_ev)
    , next(NULL)
    , mode(a_mode)
  {}

  EventBase * ev;
  EventBaseHolder * next;
  int mode;
};

The allocator object's get() member function is used to call the constructor (above).


Line up!


The declaration of ExclusiveWaiterList is pretty straight-forward (once I have simplified the code a bit):


template<const size_t v, typename T>
inline bool is_val(const T * p)
{
  return reinterpret_cast<size_t>(p) == v;
}

template<const size_t v, typename T>
inline void set_val(T * & p)
{
  reinterpret_cast<size_t &>(p) = v;
}


class ExclusiveWaiterList {
        WaiterList()
    : _head(AtomicCommon::allocator.get(0,
            EventBaseHolder::None))
    , _tail(_head)
  { set_val<0x0001>(_head->next); }

  size_t count() const;
  bool has_waiters() const;
  bool push(EventBaseHolder * e);
  EventBaseHolder * pop();

public:
  EventBaseHolder * _head;
  EventBaseHolder * _tail;
};

When constructed, a new ExclusiveWaiterList has one empty EventBaseHolder element in it (which has mode None).  That initial element is a critical part of the implementation of this list.  I will describe each of the member functions one by one after I first describe the possible states that the list can be in (abstractly characterized by these three states):

In the empty state, there is no event holding the atomic, and no waiters.  The other states have self-explanatory names. The pointer value 0x1 is used as a special marker value since it is not a real (i.e. properly aligned) value for a pointer on i686.  States not shown in the diagram (e.g. _head == _tail and _head->next > 0x1) should be unreachable as resting states (no operation is completing past its sequentialization point).

The has_waiters() member function is quite simple to implement with these states in mind:


bool ExclusiveWaiterList::has_waiters() const
{
  EventBaseHolder * h = _head->next;
  return !is_val<0x0001>(h) && (h != NULL);
}

The count() member function is also fairly simple (though it is not used in a context where it needs to be fully safe -- so no hazard pointers are set):


size_t ExclusiveWaiterList::count() const // number of waiters
{
  size_t res = 0;
  EventBaseHolder * h = _head;
  while(h && !is_val<0x0001>(h)  && h->ev) {
    ++res;
    h = h->next;
  }
  return res;
}


Push



The code to push an event is most easily explained by visualizing the transitions that should happen on the abstract state diagram above:


Here is the (simplified) code which does the push() transitions:


template<typename T>
inline T * unmk(T * t)
{
  size_t u = reinterpret_cast<size_t>(t);
  return reinterpret_cast<T *>(u & ~0x0001);
}

inline void ev_swap(EventBase * & ev1, EventBase * & ev2)
{
  EventBase * ev = ev1;
  ev1 = ev2;
  ev2 = ev;
}


bool
ExclusiveWaiterList::push(EventBaseHolder *e)
{
  bool res = false;
  e->next = NULL;
  EventBaseHolder * t = NULL;
  EventBase * ev = NULL; // hold the event locally
  ev_swap(ev,e->ev);
  EventBaseHolder * h = NULL;
  EventBaseHolder * hn = NULL;
  while(1) {
    // h = _head; side-effect:
    HAZARD_PTR_ASSIGN(h,_head,0); 
    hn = h->next;
    if(is_val<0x0001>(hn)
           && __sync_bool_compare_and_swap(
             &(h->next)
           , 0x0001
           , NULL)) {
      // empty->held/no
      ev_swap(e->ev,ev);
      res = true;
      break;
    } else {
      // (held/no,held/some)->held/some
      //t = _tail; side-effect
      HAZARD_PTR_ASSIGN(t,_tail,1); 
      if(unmk(t->next)) {
        continue;
      }
      if(unmk(t) && __sync_bool_compare_and_swap(
           &(t->next)
           , NULL
           , e)) {
        __sync_bool_compare_and_swap(&_tail,t,e);
        ev_swap(t->ev,ev);
        break;
      }
    }
  }
  HAZARD_PTR_RELEASE(0);
  HAZARD_PTR_RELEASE(1);
  return res;
}

The __sync_bool_compare_and_swap() are built-in gcc primitives for CAS, and the HAZARD_PTR macros are part of the deferred reclamation capabilities of the AtomicCommon::allocator object.  Here are the intermediate results of executing push(e1); push(e2); push(e3); with return results to each invocation:



Pop


The pop() operation is done using CAS operations as well, and returns an event holder if one was in the list.



Here is the (simplified) code:


EventBaseHolder *
ExclusiveWaiterList::pop()
{
  EventBaseHolder * r = NULL;
  EventBaseHolder * h = NULL;
  EventBaseHolder * hn = NULL;
  EventBaseHolder * t = NULL;
  while(1) {
    HAZARD_PTR_ASSIGN(h,_head,0); 
    //h = _head; side-effect
    hn = h->next;
    HAZARD_PTR_ASSIGN(t,_tail,1); 
    //t = _tail; side-effect
    if(unmk(t) && unmk(t->next)) {
      __sync_synchronize();
      continue;
    }
    if(h != _tail && hn != NULL
         && !is_val<0x0001>(hn)
         && h->ev != NULL
         && __sync_bool_compare_and_swap(
                   &_head
                   , h
                   , hn)) {
      // held/some->held/no,held/some
      r = h;
      r->next = NULL;
      break;
    } else if(hn==NULL && __sync_bool_compare_and_swap(
         &(h->next)
        , hn
        , 0x0001)) {
      // held/no->empty
      break; //empty
    }
  }
  HAZARD_PTR_RELEASE(0);
  HAZARD_PTR_RELEASE(1);
  return r;
}

With an example of what pop(); pop(); pop(); looks like when there are two waiters (in this case events e1, e2 and then e3 are releasing their hold on the atomic object.



Away to the Races


Is this code correct?  Is it possible for races to happen which mutate the list into a state which should not be reachable?  These are great questions.  Here are some things to check (in terms of parallel operations hitting this data structure happening in thread T1 and thread T2):

  • T1 calls push() and T2 calls push().   This can happen.  Serialization will happen on the update to _tail->next.  The transition values for that location are from either 0x0 or 0x1 -- other values indicate that your _tail variable needs to be re-read.
  • T1 calls pop() and T2 calls pop().  Actually this cannot happen.  Only one event holds the atomic object (and only it can invoke pop()), so this case is impossible.
  • T1 calls push() and T2 calls pop().   This can happen, but only when T2 sees a held state.  Transitions (for pop()) from held/some event modify _head and do not interact with push().  Transitions (for pop()) from held/no event modify _head->next when _tail==_head, and will serialize on that change (due to CAS).

Summary


I have presented here a lock-free atomic object implementation which is used to serialize event access to a resource (essentially some kind of object).  Rather than block, acquisitions which immediately fail cause events to queue inside of the atomic object, so that when the holding event releases the waiting events are allowed access (in a FIFO manner).  The underlying data structure to make this access safe is a modified linked list which allows parallel access from multiple threads without the use of pthread synchronization or semaphores.  When used in the lock-free OFlux run-time this atomic object can mediate access to guards in an efficient manner, causing event waits to not turn into thread waits.  Since the underlying structure is linked-list based, the necessity to mediate reclamation of the link objects with hazard pointers and loss of cache locality are performance problems.  An alternate implementation based on a dynamic buffer consisting of C arrays, might perform even better (or present experience would indicate this to be likely true).

Thursday, August 30, 2012

Cleanup: OFlux Guard Garbage Collection

OFlux Guards with keys with large ranges (e.g. hash strings) can cause the underlying map memory to grow too much.  The values in these guards are pointers, so a buildup of (key,0) pairs in the underlying map is responsible.  Since any new key not already present in the map starts off associated with 0 (the NULL pointer if you like), simply removing these entries from the underlying map when they are detected corrects the problem.  The timing of the removal in the run-time has to ensure the map is not accessed by two threads at the same time.



Garbarge Collected Guards


The /gc guard modifier enables this feature to remove an accessed (key,0) pair once the node function is  done with it.  The example I will describe below demonstrates its usefulness (based on gctest1 from the OFlux repo):


exclusive/gc G (int v) => int *; 
  /* remove the /gc to cause this program to leak memory */

node GenerateKey () 
  => (int key);
node Populate (int key, guard G(key) as g) 
  => (int key); /* populates the rval guard */
node Depopulate (int key, guard G(key) as g) 
  => (); /* depopulates the rval guard */


source GenerateKey -> Populate -> Depopulate;


The plan is to have the source node GenerateKey generate a number, have node Populate populate the G guard with that number as a key, then have node Depopulate depopulate it (by assigning it a value of (int *) 0) on the same integer key.  The following C++ implementation of the node functions carry out this plan:


#include <stdio.h>
#include "OFluxGenerate_gctest1.h"

int 
GenerateKey(const GenerateKey_in *
          , GenerateKey_out * out
          , GenerateKey_atoms *)
{
 static int x = 0;
 out->key = ++x;
 return 0;
}

int 
Populate(const Populate_in *in
       , Populate_out *out
       , Populate_atoms * atoms)
{
 int * & g = atoms->g();
 static int x = 0;
 out->key = in->key;
 if(g == NULL) {
  g = &x;
 }
 return 0;
}

int
Depopulate(const Depopulate_in * in
         , Depopulate_out *
         , Depopulate_atoms * atoms)
{
 int * & g = atoms->g();
 g = NULL;
 return 0;
}

Summary


Building this code without the /gc guard modifier causes the Gb guard's underlying map to explode in size (eventually running out of memory on a 32-bit system once it hits the 4G mark).  In some cases, the key space is a known finite set that does not grow values dynamically very much, and the need to remove those unused (key,0) pairs from the underlying map is not there (and the resulting speed hit of removing things is unnecessary).  Depending on the application, it could be that the value associated with key will transition from 0 to a real object again at some point in the near future with high probability.  If that is the case, the default setting with no garbage collection is recommended.

Thursday, July 26, 2012

OFlux: Building a Flow

Most of a .flux file content will be node declarations and connecting flow.  There are two types of nodes: abstract nodes which serve as helpful connection points and do not have C++ code associated with them, and concrete nodes which are implemented via a C++ function.  In this post, I will describe the major features of the OFlux language in some detail.  First off is something programming language people call "choice".

Routing with Conditions


Suppose we have a source node Src which blocks on some input and produces an output Foo * foo.   The source node is declared as follows (its input set is empty -- which is necessary for a source node):


 node Src () => (Foo * foo);
 source Src;

As written, the oflux compiler will complain about this input since the flow rooted at Src with only one node in it does not end with a node that has a () output set.  Adding a line which has terminate Src, will silence the complaint (in effect saying "we know what we are doing, don't complain").  What we really want to accomplish is to apply a condition isFooEnough() to every foo that comes out of Src.  This condition will -- in reality -- be implemented within our C++ code using a function with prototype bool isFooEnough(Foo *):


 condition isFooEnough(Foo *) => bool;


Suppose we want to implement separate nodes ConsumeFooEnough and ConsumerFooLacking as successors depending on the outcome of the isFooEnough() test.  Needless to say, it is assumed that isFooEnough has no side-effects on its argument or the global state of the program since multiple calls to such a conditional might occur when the logic gets more complicated.  Once consumed we will dispose of foo with a node called DisposeFoo.  The abstract node ComsumeFoo is used below only to help describe the decision being made:


 node ConsumeFooEnough (Foo * foo) => (Foo *);
 node ConsumeFooLacking (Foo * foo) => (Foo *);
 node DisposeFoo (Foo * foo) => ();
 node abstract ConsumeFoo (Foo * foo) => ...;
 ConsumeFoo: [isFooEnough] = ConsumeFooEnough -> DisposeFoo;
 ConsumeFoo: [*] = ConsumeFooLacking -> DisposeFoo;
 source Src -> ConsumeFoo; /* changing the original */

Much like pattern matching in a language like OCaml or Scala, the syntax for routing is done using the rule which first matches the input.  So if isFooEnough(foo) returns true, then the routing to ConsumeFooEnough happens, and otherwise the path to ConsumeFooLacking is chosen.  This syntax survives from the original Flux language design.  Elipsis (...)  is used on the output declaration of ConsumeFoo to indicate to the compiler that we do not care to specify the output set, and if we did only () would be acceptable since other options would not unify with the output set of DisposeFoo.  Now that a basic flow is described, the application programmer only has to implement the following concrete node functions in their C++ code to create a runnable program: Src, ConsumeFooEnough, ConsumeFooLacking, and DisposeFoo.  If it is not necessary to have DisposeFoo as a separate node, I would recommend that it just be implemented as a function which is called inside of the ConsumeFooEnough/ConsumeFooLacking C++ functions.  The only reason to have a separate node (which implies a separate event and a run-time scheduling of that event -- a non-negligible cost), is if DisposeFoo is interacting with resources in the program (please see my posting on OFlux guards).

Running oflux


To compile the content above in a file called web.flux we issue the command:


 % ./oflux web.flux
OFlux v1.00-6-gc59d671 on web.flux

This causes the following output to be created locally:

  • web.dot: a description of the OFlux flow which can be turned into a pretty picture showing types, conditions and nodes (blue are abstract).  To create a picture with graphviz, run dot -Tpng web.dot -o web.png:

  • web.xml: the XML description of the OFlux flow which will be loaded at run-time to make the program run (edits to this file can cause changes to the flow without having to recompile. For brevity only the Src node is shown as it has the most interesting entry.):


<flow name="web.flux" ofluxversion="v1.00-6-gc59d671">
 <node name="Src" function="Src" source="true" door="false" iserrhandler="false"
 detached="false" external="false" inputunionhash="cbd4d4a285d623ee19470f7d5d68e
1c1a765263c3c9242a8d61b222d49c7e64c" outputunionhash="6e63ce559f96ef9b1f68aa4370
2f1343c9638715464dea7d140e6b4d068d9688">
  <successorlist>
   <successor name="0">
    <case nodetarget="ConsumeFooEnough">
     <condition name="isFooEnough" argno="1" isnegated="false" unionhash="6e63ce
559f96ef9b1f68aa43702f1343c9638715464dea7d140e6b4d068d9688"/>
    </case>
    <case nodetarget="ConsumeFooLacking">
     <condition name="isFooEnough" argno="1" isnegated="true" unionhash="6e63ce5
59f96ef9b1f68aa43702f1343c9638715464dea7d140e6b4d068d9688"/>
    </case>
   </successor>
   <successor name="erste">
    <case nodetarget="Src"/>
   </successor>
  </successorlist>
 </node>
 ... 
</flow>


  • OFluxGenerate.h: A header file declaring the needed node and conditional functions which includes the application's mImpl.h header file (which defines the types used).  This header should be included in the application's .cpp source.  It should not be necessary to understand all of the mechanics inside of the OFluxGenerate.h file.  Using the node declarations inside of web.flux, each node N gives rise to types N_in, N_out and N_atoms which are needed for the C++ prototype  int N(const N_in *, N_out *, N_atoms).
  • OFluxGenerate.cpp: The generated C++ code needed to bind the OFlux program to the run-time.  This is where the static tables used to look-up symbols read from the XML file (web.xml) live.
Building a complete application means OFlux compiling/C++ compiling/linking/running with these outputs.  Errors can crop up at any of those stages, so it is typical to have a tool like Gnu make build the whole project -- even verifying that the XML file is loadable (properly gets its symbols from the OFluxGenerate.cpp code)


Some Limitations and Philosophy


In the above example there are plenty of things for the oflux compiler to check for us.  When a node is connected to another as a successor, the compiler does an asymmetric unification of the output set with the input set.  This means that each input should exist as an output from the predecessor.  Generally this is done using the name of the argument (so matching foo with foo in my example), but if there is a mis-match in naming an attempt is made to unify using types.

If the names of the formal parameters do not match there are consequences in the generated code -- a C union is necessary to give the field two names (this is trouble if any of the argument types is non-POD, and that happens quite a bit with C++ code).  When unification fails, the offending node and its argument is indicated by the oflux compiler (the first error causes the compilation to halt).


OFlux adds another dimension to the task of programming a server.  A new server design will have nodes and flow to go with it.  My consistent impression working with developers who are new to the tool is that many many nodes end up being defined (much like functions are used in C++).  Making a flow very long (lots of nodes from source to any sink), can be quite detrimental to performance if most of the nodes are doing less work than the overhead to execute them.  My best advice is to try an initial flow with the absolute least number of nodes possible, and then investigate refining that program into more nodes in a step-wise manner.  Keep it as simple as possible!


Routing to Many Places


Occasionally concurrency is needed in the flow since an output is consumed by multiple nodes, and its inefficient to have them each process that input (a foo perhaps) sequentially.  The OFlux run-time keeps a node event (which holds its output data) alive using reference counting, so it is possible to keep a node event around long enough to be processed by multiple successor node events.

If we wanted to modify our example above to have DisposeFoo be an abstract which aliases two nodes we want to run at the same time (ReclaimFoo and CommunicateFoo).  This could be done as follows (replacing the C++ function implementation we have for DisposeFoo with implementations for ReclaimFoo and CommunicateFoo:


 node abstract DisposeFoo (Foo *foo) => ();
 node CommunicateFoo(Foo * foo) => ();
 node ReclaimFoo(Foo * foo) => ();
 DisposeFoo = CommunicateFoo & ReclaimFoo;

Using this technique, the completion of a ConsumeFooEnough event will cause two new events to be created: one for CommunicateFoo and another for ReclaimFoo.  If these nodes were detached or had a shimmed system call in them, they could both dispatch at the same time and run concurrently.


Summary



Using the basic composition syntax within OFlux an application developer can describe the top-level flow of events in his program without having to explicitly manage a thread pool themselves.  The language offers three main types of basic composition:


  1. sequential (using ->)
  2. concurrent (using &)
  3. choice using (using :[ ? ]matching)

Monday, July 23, 2012

OFlux: Detached Nodes

Tasty



Previously, I described the anti-pattern of blocking while locking, and also how it is that the OFlux run-time escapes this pitfall.  If a node in your program tends to do two or more blocking system calls, each one will be intercepted by the shim mechanism to give up the run-time's main mutex.  The context switching in a case like that could be optimized if the node does not cause side-effects within the rest of the program (mostly modifying non-local state).  The optimization is to enter the node C++ function as if it were one big system call (releasing the main run-time mutex for the duration of the function's execution).  This saves context switching on the mutex and conceivably increases the concurrency in the program (nodes events for these detached nodes are now able to run independently of the run-time more often).  Here is how we augment the basic state diagram for each run-time thread to accommodate this idea:



For nodes that are declared detached, the ability to run in the new mode (dotted line box on the right side) is available when the run-time sees that there are enough threads to allow this.  These two dotted-boxes indicate the states where the thread is not holding the main run-time mutex.

Example: Sleeping Beauty and the Seven Dwarves


Within a working copy of the OFlux Github repo, you can create a new directory called src/examples/dwarves with the following ex-contents.mk make file:

$(info Reading ex-contents.mk $(COMPONENT_DIR))

OFLUX_PROJECT_NAME:=dwarves

include $(SRCDIR)/Mk/oflux_example.mk

$(OFLUX_PROJECT_NAME)_OFLUX_CXXFLAGS+= -DHASINIT -DHASDEINIT

The dwarves.flux file describes the flow:

node SnowWhite () => (int apple_id);
node Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

The C++ code for these nodes is pretty simple. Every 0.10 ms SnowWhite sends out an apple, and a Dwarf picks it up and does ten 0.10 ms sleeps in order to consume it (in mImpl_dwarves.cpp):

#include "OFluxGenerate_dwarves.h"
#include "OFluxRunTimeAbstract.h"
#include <sys/time.h>
#include <unistd.h>
#include <cstdlib>

long dwarf_count = 0;
extern oflux::shared_ptr<oflux::runtimeabstract> theRT;

int
SnowWhite(const SnowWhite_in *
        , SnowWhite_out * out
        , SnowWhite_atoms *)
{
        static int apples = 0;
        out->apple_id = apples++;
        if(apples>10000) {
                theRT->hard_kill();
        }
        usleep(100);
        return 0;
}

int
Dwarf(    const Dwarf_in * in
        , Dwarf_out *
        , Dwarf_atoms *)
{
        __sync_fetch_and_add(&dwarf_count,1);
        for(size_t i = 0;i < 10; ++i) {
                usleep(100);
        }
        return 0;
}

I have also added code to produce statistics when the program exits:
struct timeval tv_start;

void
deinit()
{
        struct timeval tv_end;
        gettimeofday(&tv_end,0);
        double total_time = tv_end.tv_sec-tv_start.tv_sec
                + (tv_end.tv_usec - tv_start.tv_usec)
                   /1000000.00;
        double dps = dwarf_count / total_time;
        printf("ran %lf seconds, dispatched %lf "
               "dwarves per second\n"
                , total_time
                , dps);
}

void
init(int argc,char * argv[])
{
        atexit(deinit);
        gettimeofday(&tv_start,0);
}

As is, the dwarves.flux flow will produce the following output on my Asus 1000HE netbook (which has 2 hardware contexts and 1 core):

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 5.109480 seconds, dispatched 1957.146324 dwarves per second

Detaching Dwarves


But if we make the Dwarf node detached (which I claim will likely be of benefit since the usleep shimmed system call will be called less frequently:

node SnowWhite () => (int apple_id);
node detached Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

Re-running the test, we can see that we are running a little faster:

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 3.468819 seconds, dispatched 2882.825538 dwarves per second

So detaching nodes can pay off handsomely if it is safe to do so, since it reduces the in and out of the main run-time mutex.  It is unsafe to do this if there is something about the node source code which makes it unsafe (e.g. mutating non-local state).  Detached nodes are also useful when making calls to 3rd party libraries which (themselves) have mutexes -- in order to avoid a deadlock with the run-time mutex.

Tuesday, July 17, 2012

Anti-pattern: Lock and Block

Poison.


Never hold a lock and then block waiting for I/O. Just do that one thing, and you are mostly out of the woods. Even if your "multi-thread" program ultimately serializes to the equivalent of one thread because of your synchronization choices, doing that one thing (not locking while blocking) should keep your application at above average awesome.

Threads.


So how does locking work?  Why is it done? What is a thread anyway?

So many questions for a neophyte C++ developer to ask.  A thread is an execution context (stack, program counter, thread local storage) within a process which shares the address space (so global variables, binary code, and text sections) with other threads in that process.  The kernel schedules each runnable thread in the program on the physical cores that the machines has.  A core?  They told you about how many of those you've got when you bought your computer -- I know it sounded esoteric at the time.  Each core is capable of moving a thread along through the code, changing the state of the program.  As time progresses, machines have more cores, and software is written to have more threads which make use of that added hardware/machinery.

As time progresses (downwards in the below diagram) two threads spend time on a core (shown in blue), time off the core (shown in green), and time waiting for a blocking system call to return:


Since threads progress through these phases of doing serious number crunching, waiting for system calls to return, mutexes/semaphores to be released by other threads and acquired by this thread, and just waiting its turn for time on a core, having more threads than cores makes sense.  Unless all the thread does is number crunching in local memory, it does not spend all of its time on a core.

Locking.


When threads want to share memory (say a structure like hash table or a vector or a queue), the textbook way of making that safe (it is unsafe since partial state changes on that memory -- when interleaved -- might transition the structure into a state that should be unreachable or invalid) is a lock of some sort (a mutex usually).  Read and write operations on structures shared by threads use locks to serialize -- allowing uninterrupted access for the duration -- the operations and keep things safe:



Poison.



Serialization costs something.  All this co-ordination has a bit of overhead, and it slows both the threads down as they hit the brakes more often to wait for access to the shared structures.  If the duration of the lock only has some very fast number crunching or local memory (cache) access, then things aren't so bad.  The time in the critical section is so short, the penalty is not onerous.  Things get epic-ly awful if there are blocking system calls happening in a thread while those locks are held:


This can quite terrible for threads waiting for their turn with the lock, since these threads collectively go to 0 percent, I/O ends up dictating the speed of progress.  This is not what you want.

Tasty.


In the OFlux run-time, blocking system calls are shimmed via interposition so that the main run-time mutex is not held for the duration of those system calls.  It avoids the "big faux pas" by design:


Other event-based systems do similar tricks to avoid the same problem.  Hopefully this post helps explain what the problem is with blocking whiling locking,  and how life is better (by design) using a run-time which side-steps the issue completely.

Follow Mark on GitHub