Monday, July 23, 2012

OFlux: Detached Nodes

Tasty



Previously, I described the anti-pattern of blocking while locking, and also how it is that the OFlux run-time escapes this pitfall.  If a node in your program tends to do two or more blocking system calls, each one will be intercepted by the shim mechanism to give up the run-time's main mutex.  The context switching in a case like that could be optimized if the node does not cause side-effects within the rest of the program (mostly modifying non-local state).  The optimization is to enter the node C++ function as if it were one big system call (releasing the main run-time mutex for the duration of the function's execution).  This saves context switching on the mutex and conceivably increases the concurrency in the program (nodes events for these detached nodes are now able to run independently of the run-time more often).  Here is how we augment the basic state diagram for each run-time thread to accommodate this idea:



For nodes that are declared detached, the ability to run in the new mode (dotted line box on the right side) is available when the run-time sees that there are enough threads to allow this.  These two dotted-boxes indicate the states where the thread is not holding the main run-time mutex.

Example: Sleeping Beauty and the Seven Dwarves


Within a working copy of the OFlux Github repo, you can create a new directory called src/examples/dwarves with the following ex-contents.mk make file:

$(info Reading ex-contents.mk $(COMPONENT_DIR))

OFLUX_PROJECT_NAME:=dwarves

include $(SRCDIR)/Mk/oflux_example.mk

$(OFLUX_PROJECT_NAME)_OFLUX_CXXFLAGS+= -DHASINIT -DHASDEINIT

The dwarves.flux file describes the flow:

node SnowWhite () => (int apple_id);
node Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

The C++ code for these nodes is pretty simple. Every 0.10 ms SnowWhite sends out an apple, and a Dwarf picks it up and does ten 0.10 ms sleeps in order to consume it (in mImpl_dwarves.cpp):

#include "OFluxGenerate_dwarves.h"
#include "OFluxRunTimeAbstract.h"
#include <sys/time.h>
#include <unistd.h>
#include <cstdlib>

long dwarf_count = 0;
extern oflux::shared_ptr<oflux::runtimeabstract> theRT;

int
SnowWhite(const SnowWhite_in *
        , SnowWhite_out * out
        , SnowWhite_atoms *)
{
        static int apples = 0;
        out->apple_id = apples++;
        if(apples>10000) {
                theRT->hard_kill();
        }
        usleep(100);
        return 0;
}

int
Dwarf(    const Dwarf_in * in
        , Dwarf_out *
        , Dwarf_atoms *)
{
        __sync_fetch_and_add(&dwarf_count,1);
        for(size_t i = 0;i < 10; ++i) {
                usleep(100);
        }
        return 0;
}

I have also added code to produce statistics when the program exits:
struct timeval tv_start;

void
deinit()
{
        struct timeval tv_end;
        gettimeofday(&tv_end,0);
        double total_time = tv_end.tv_sec-tv_start.tv_sec
                + (tv_end.tv_usec - tv_start.tv_usec)
                   /1000000.00;
        double dps = dwarf_count / total_time;
        printf("ran %lf seconds, dispatched %lf "
               "dwarves per second\n"
                , total_time
                , dps);
}

void
init(int argc,char * argv[])
{
        atexit(deinit);
        gettimeofday(&tv_start,0);
}

As is, the dwarves.flux flow will produce the following output on my Asus 1000HE netbook (which has 2 hardware contexts and 1 core):

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 5.109480 seconds, dispatched 1957.146324 dwarves per second

Detaching Dwarves


But if we make the Dwarf node detached (which I claim will likely be of benefit since the usleep shimmed system call will be called less frequently:

node SnowWhite () => (int apple_id);
node detached Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

Re-running the test, we can see that we are running a little faster:

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 3.468819 seconds, dispatched 2882.825538 dwarves per second

So detaching nodes can pay off handsomely if it is safe to do so, since it reduces the in and out of the main run-time mutex.  It is unsafe to do this if there is something about the node source code which makes it unsafe (e.g. mutating non-local state).  Detached nodes are also useful when making calls to 3rd party libraries which (themselves) have mutexes -- in order to avoid a deadlock with the run-time mutex.

No comments:

Post a Comment