Poison, Poison, Tasty Fish!: history

Showing posts with label history. Show all posts

Monday, October 1, 2012

Splitting Trade Units

Dividing a single executed trade proportionally for several customers is an interesting problem. Naive decisions can lead to unpleasant and surprising results. In this post, I will describe the problem of fairly partitioning a trade with an integer number of units amongst a group of participants who are participating according to some numerical proportions (think pie chart). The proportions will be specified as double precision floating point numbers, and they will be normalized relative to their overall sum.

Requirement: Make it Whole

Given a new trade for U units which establishes a new position, and N participants where participant i has P[i] >= 0 proportion. If the total proportion (left-associated sum) is P>0, then the problem becomes finding values U[i] whose sum is U, and which are proportional to the P[i] values.

Using (P[i]/P)*U and rounding, seems like a good idea, but it is wrong. Consider what happens with N = 3, P[0] = P[1] = P[2] = 10.0 and U = 20 : each participant is given 6.6666666666 units (but rounded to the nearest integer using an unspecified rounding technique). A deterministic round to 6 units each or 7 units each hands out the wrong number of total units (18 versus 21). So this won't work.

Instead of trying to establish unit counts right away per client, I attempted to assign unit counts to two groups of participants (essentially abstracting the problem to two participants). I wrote the following code:

 void inline position_split(
   int & units1
 , int & units2
 , const double & prop1
 , const double & prop2
 , int total_position)
{
 double total_proportion = prop1 + prop2;
 assert(total_proportion > 0 || total_position == 0);
 if(total_position) {
  units1 = ROUND_UNITS(total_position * prop1 / total_proportion);
  units2 = total_position - units1;
 } else {
  units1 = 0;
  units2 = 0;
 }
}

Going back to my N = 3 example, this is better since I can use it to determine (U[0], U[1]+U[2]) using (P[0],P[1]+P[2]) and then invoke it again in a similar way to find out what (U[0]+U[1],U[2]) is. Applying the code above I obtain the following unit values: U[0]=7, U[1] = 6, U[2]=7. Great, now the values add up to the expected total U.

Selling Out

Now that you are in a trade for U units and each participant has his piece the most interesting part happens. How do you attribute profit as each unit is closed (in the worst case each one is done separately). In essence, you have to have an ordering of the existing units of the position and an idea of whose unit will go next. If the number of participants is very high, it would also be nice to not have to store the number of units left to each of them at all times. This is key, since persisting that information can be a painful limit on performance as each update will cause a side-effect to an underlying database.

My initial approach was to leverage position_split to figure out what the position distribution would be if for (U-1) total units and then just compare that with the result for U to find out which participant lost a unit. This does not work. Back to our example plugging in 19 for position size we get U[0] = 6, U[1] = 7, and U[2] = 6. This is a problem since the middle participant (i=1) actually gained a unit when we reduced the position.

Since position_split is not enough and we still want to avoid storing detailed per-participant data on the remaining position as it gets reduced, we return to considering how to order the units of the beginning position in a way that fairly strips them from the participants. How can the ordering of the units sold off be deterministic and therefore, require no storage?

Partial Binary Trees

A binary tree is a tree with two paths emenating from each non-leaf node in the tree. The leaves can be addressed using the binary coded path from the root to the leaf. A partial binary tree is a binary tree which is missing leaves beyond a particular coded address. It should be obvious that a unique partial binary tree exists with U leaves for any positive integer U. Consider the following tree for U = 5 (as an example):

I have written the coded binary path (0s and 1s for lefts and rights) on the leaves backwards. Those backwards path numbers can be arranged as a sequence of unsigned binary numbers S = ( 0, 4, 2, 1, 3 ). In this case, the sequence has no gaps (missing numbers), but that is not the case in general. Using OS = sort(S), to denote the ascending totally order sequence for S, we can say that after reducing the position by r units the next (ordered) unit to be picked is OS^-1[S[OS^-1[r]]]. For the above tree, OS = (0,1,2,3,4), so unit 0 goes first, unit 4 goes second, etc. When there are gaps in OS, the point of using its inverse is clearer.

Essentially, for each trade which is originally U units in size, we can keep track of the number of units sold so far (r) and figure out which participants lose the next t units by evaluating OS^-1•S•OS^-1 at r,r+1,...r+t-1. The position_split function indicates which participant originally had each of the units (e.g. participant 0 gets units 0 to U[0]-1, assuming U[0]>0). I have empirically observed that the ordering of sold units using this technique is reasonably fair -- at each point the relative amount of units each participant has is roughly in line with their initial proportion amounts P[i].

Summary

Splitting the initial trade properly, requires a technique which is sensitive to rounding and floating point arithmetic. It distributes the units of a trade to several participants using their proportion amounts P[i] and it ensures that the right total number of units are distributed.

The reversed binary codes for a partial binary tree provide a deterministic permutation which can be used to fairly sell out a position so that at any point the participants still hold their proportion of what is left in aggregate. By only knowing how many units were already sold r and the additional number to be sold t, an algorithm can tell you which participants are losing how many units. The code to efficiently compute the permutation is a bit involved, so I have not posted it here. The basics needed to reconstruct the approach are here, however.

Thursday, September 20, 2012

Trailing Stops: Rotating in two dimensions

Trailing stop orders are a computational challenge since they are state-full and react to every price change. In 2008, I looked into the problem of implementing a data structure which could do better than a naive container at dealing with trailing stop orders. In this post, I will describe the structure that I came up with, and critique it a little.

Hitching a Ride

A trailing stop order is designed to attempt to lock-in profit. A stop order S (which is less fancy, and should be understood first) is typically used to close an open trade T at price P by specifying a market price which the trade T has accumulated a maximum allowable loss. If T is a long position, then the stop order specifies a price lower than P, and otherwise (a short position) it specifies a price higher than P. A trailing stop order TS is specified similarly but its trigger price creeps upwards so that it is never further from the market price than it started out. Using this to control risk for trade T is advantageous since it sort of locks in profit. If the price moves against T, then the TS price stays put (acting like a watermark). If you want more explanation of how this works, please have a look at this helpful video.

I should stress that my purpose is to describe how to implement trailing stops, and not to endorse their use under any specific contexts. Trading is risky, and strategic decisions to use various order types should be made with due caution.

Interface Design

Trailing stops for a particular instrument will enter the system with an insert() method, and be removable using a remove() method. It is assumed that each of these orders has a unique integer identifier which can be used to refer to them. Although the data structure should be concerned with the efficiency of these house keeping methods, the high-frequency operations of deepest concern are related to price movements.

The same structure should be used for orders which are buy or sell, its only really important to realize whether to use bid or ask prices, and to have an idea of which way is "up" or "down". Prices will be dealt with internally as integers relative to the current market price (units which are pips or pipettes). A trailing stop order has a pair of integers which describe its state (trailing_stop, trailing_amount). To insert a new order which is 10 units off of market, the pair (10,10) is put into the structure. The below diagram shows valid values for resting trailing stop orders (trailing amounts are not allowed below 1 and they cannot exceed the trailing stop):

Trailing amounts, which track how far the trailing stop order is from the market price, will change on the orders as market prices (Px) are applied to the structure. The two different sides (buy and sell) will each have a data structure which reacts differently to market prices:

TS type	Px side	above/below market	Px increases	Px decreases
BUY	ASK	above	price_down()	price_up()
SELL	BID	below	price_up()	price_down()

If the price moves "down" by a single price unit we call price_down() which notionally changes the state of every trailing stop order from (ts,ta) to (ts,ta-1), and all (ts,0) state orders are removed as triggered orders. A naive implementation would need to touch every order to update these records.

If the price moves "up" by a single price unit we call price_up() which notionally changes the state of every trailing stop order from (ts,ta) into (ts,min(ts,ta+1)). Again, a naive implementation would need to touch every order to update all of the trailing amount records which is quite expensive.

Rotating Array of Arrays

In order to simplify the job of updating the trailing amounts, we could use the trailing amount to classify and index each order. This means that we can, in many cases, just have an array of orders move its position (changing the effective trailing amounts for all of its contents at once). Consider the following array of array of lists (the inner arrays are connected by dotted lines, and each order with a particular (ts,ta) value is added to the list held at that box):

The outer array is the part we will rotate. If the top box in the diagram (blue grid) is at position [0,0] of the NxN array, then an order with state (ts,ta) should be at logical array position [N-ts+ta-1,ts-1] (in the above diagram N=5). The top "triangle" of boxes are unused since they do not represent reachable (ts,ta) states.

Completing a price_up() is only a matter of merging the red boxes at (ts,ts) up into the array above and rotating the whole top-level array (which is done in the usual way with modulus operations and an index variable):

The new empty lists (white boxes) simply appear as a result of the rotation, and the green boxes are not disturbed. The magenta boxes are an inner array which moves as a result of the outer array rotation.

To do a price_down() operation, a similar trick happens. The diagonal order list elements (trailing amount 1) boxes (shown as red) are removed, and the array is rotated in the other direction:

The new empty array that holds all the new (ts,ts) orders is shown as empty magenta boxes on the lower edge. As a practical matter, the number of price unit levels supported (the value of N) has to be a fixed predetermined value. As described, the data structure will work very well for large numbers of orders, but will use up considerable space when empty (an NxN array). To mitigate this, the inner arrays (blue boxes linked with dotted lines) could be represented more sparsely with skip-lists or binary trees. If the remaining dimension (top-level) array is still consuming too much memory, something could be done to make it sparse as well (allowing a larger value of N).

Summary

Using a benchmark which consisted of:

inserting 2 million orders at a spread of (ts,ta) values
price_up() 100 times
price_down() 100 times
insert 200k more random orders
price_down() 100 times
price_up() 100 times

The data structure described above (with sparse BST inner arrays) was 20 times faster than a naive implementation built around the glib hash table. It is an interesting case of using a frame of reference to get free work done (rotating array) and mapping the problem to a data structure that attempts to reduce the number of operations.

If the structure has evenly distributed (ts,ta) valued trailing stop orders in it, price_up() and price_down() are both O(N) for the above structure. In the case of a vector, list or hash-table something closer to O(N*N) is observed, as every element gets touched. I have always been interested to know if something more space efficient, but with similar properties could be imagined in the future.

Monday, July 23, 2012

OFlux: Detached Nodes

Tasty

Previously, I described the anti-pattern of blocking while locking, and also how it is that the OFlux run-time escapes this pitfall. If a node in your program tends to do two or more blocking system calls, each one will be intercepted by the shim mechanism to give up the run-time's main mutex. The context switching in a case like that could be optimized if the node does not cause side-effects within the rest of the program (mostly modifying non-local state). The optimization is to enter the node C++ function as if it were one big system call (releasing the main run-time mutex for the duration of the function's execution). This saves context switching on the mutex and conceivably increases the concurrency in the program (nodes events for these detached nodes are now able to run independently of the run-time more often). Here is how we augment the basic state diagram for each run-time thread to accommodate this idea:

For nodes that are declared detached, the ability to run in the new mode (dotted line box on the right side) is available when the run-time sees that there are enough threads to allow this. These two dotted-boxes indicate the states where the thread is not holding the main run-time mutex.

Example: Sleeping Beauty and the Seven Dwarves

Within a working copy of the OFlux Github repo, you can create a new directory called src/examples/dwarves with the following ex-contents.mk make file:

$(info Reading ex-contents.mk $(COMPONENT_DIR))

OFLUX_PROJECT_NAME:=dwarves

include $(SRCDIR)/Mk/oflux_example.mk

$(OFLUX_PROJECT_NAME)_OFLUX_CXXFLAGS+= -DHASINIT -DHASDEINIT

The dwarves.flux file describes the flow:

node SnowWhite () => (int apple_id);
node Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

The C++ code for these nodes is pretty simple. Every 0.10 ms SnowWhite sends out an apple, and a Dwarf picks it up and does ten 0.10 ms sleeps in order to consume it (in mImpl_dwarves.cpp):

#include "OFluxGenerate_dwarves.h"
#include "OFluxRunTimeAbstract.h"
#include <sys/time.h>
#include <unistd.h>
#include <cstdlib>

long dwarf_count = 0;
extern oflux::shared_ptr<oflux::runtimeabstract> theRT;

int
SnowWhite(const SnowWhite_in *
        , SnowWhite_out * out
        , SnowWhite_atoms *)
{
        static int apples = 0;
        out->apple_id = apples++;
        if(apples>10000) {
                theRT->hard_kill();
        }
        usleep(100);
        return 0;
}

int
Dwarf(    const Dwarf_in * in
        , Dwarf_out *
        , Dwarf_atoms *)
{
        __sync_fetch_and_add(&dwarf_count,1);
        for(size_t i = 0;i < 10; ++i) {
                usleep(100);
        }
        return 0;
}

I have also added code to produce statistics when the program exits:

struct timeval tv_start;

void
deinit()
{
        struct timeval tv_end;
        gettimeofday(&tv_end,0);
        double total_time = tv_end.tv_sec-tv_start.tv_sec
                + (tv_end.tv_usec - tv_start.tv_usec)
                   /1000000.00;
        double dps = dwarf_count / total_time;
        printf("ran %lf seconds, dispatched %lf "
               "dwarves per second\n"
                , total_time
                , dps);
}

void
init(int argc,char * argv[])
{
        atexit(deinit);
        gettimeofday(&tv_start,0);
}

As is, the dwarves.flux flow will produce the following output on my Asus 1000HE netbook (which has 2 hardware contexts and 1 core):

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 5.109480 seconds, dispatched 1957.146324 dwarves per second

Detaching Dwarves

But if we make the Dwarf node detached (which I claim will likely be of benefit since the usleep shimmed system call will be called less frequently:

node SnowWhite () => (int apple_id);
node detached Dwarf (int apple_id) => ();
source SnowWhite -> Dwarf;

Re-running the test, we can see that we are running a little faster:

 # ./builds/_Linux_i686_production/run-dwarves.sh \
   2> /dev/null  | grep ran
ran 3.468819 seconds, dispatched 2882.825538 dwarves per second

So detaching nodes can pay off handsomely if it is safe to do so, since it reduces the in and out of the main run-time mutex. It is unsafe to do this if there is something about the node source code which makes it unsafe (e.g. mutating non-local state). Detached nodes are also useful when making calls to 3rd party libraries which (themselves) have mutexes -- in order to avoid a deadlock with the run-time mutex.

Tuesday, July 17, 2012

[O]Flux: En garde!

Tasty.

Even event-based programs tend to have some state. Its best if that state is minimal, and shared as narrowly as possible, accessed as briefly as possible and viewed with suspicion when evaluating performance bottlenecks. Message passing systems like Akka or Erlang tend to avoid this problem my locating state within one thread which runs the Actor which owns that state. That has some advantages for that model, but it lacks some flexibility when we consider that machines have lots of memory these days and that memory is meant to be shared (safely of course).

The way this problem was solved in Flux was with atomic constraints. In OFlux these were generalized to guards (at first there were only exclusive guards). Suppose we had a object foo of type Foo which we wanted to allow multiple node functions to access -- but only one event at a time. We would declare an exclusive guard to do this job in our OFlux code as follows:

  exclusive aFooGuard () => Foo *;

A node Bar would would be granted access to the Foo * & underlying this using a declaration like:

  node Bar (InputType1 anInput1,..., guard aFooGuard()) 
     => (OutputType1 anOutput1,...);

At the C++ level our implementation of Bar will have an argument atoms which can be used to get our Foo * & with atoms->aFooGuard(). A node event which has a guard in its declaration like this is said to acquire the guard before entry and it releases it after the node event has executed. Initially this will be a reference to a NULL pointer, but we can change that since it is a reference (allocate a new Foo). Without the "right hand value" part to store a Foo *, you have basically what Flux used for atomic constraints.

Furthermore, OFlux extends atomic constraints by allowing some data to be stored and even de-reference an index to get at that data. Consider what happens if we add a couple of keys to aFooGuard (which has a map data structure inside of it to keep track of these "right hand values":

  exclusive aFooGuard(int k1, long k2) => Foo *;

Now when we acquire the guard with a node declaration we have a chance to use some of the input arguments of the node to formulate key values:

  node Bar (InputType1 anInput1,InputType2 anInput2, 
      ...,
      guard aFooGuard(inthash(anInput1),longhash(anInput2))) 
   => (OutputType1 anOutput1,...);

The C++ functions inthash() and longhash() are not necessary, they just illustrate how we could use code buried in our C++ application to massage the inputs into appropriate keys. The semantics of an exclusive guard is that a Bar event gets exclusive access to the "right hand value" at the particular key combination it is using. For instance, one Bar event might hold aFooGuard(0,0) while another simultaneously works with aFooGuard(0,1). Finer grain keys lead to more possibilities when considering which events could run concurrently.

The decision to grab a guard does not need to be made universally for a particular node. It is possible to grab the guard conditionally on its input values:

    node Bar (InputType1 anInput1,InputType2 anInput2,...,
        guard aFooGuard(inthash(anInput1),longhash(anInput2))
            if test(anInput1)) 
     => (OutputType1 anOutput1,...);

Only if test(anInput1) returns true will the guard be acquired. The Bar C++ function is made aware of whether the guard is properly acquired and held via convenience property atoms->have_aFooGuard().

Guards can also be given precedence which will indicate to the run-time the order in which they should be acquired. The run-time will use that ordering (complaining at compilation if the ordering is not well-formed -- since a cycle is detected in its relation) when processing node events. All the required guards for an event need to be acquired before it can be allowed to execute.

  exclusive aBazGuard Foo * => Baz *;

  precedence aFooGuard < aBazGuard;

When grabbing multiple guards it is possible to use the "right hand value" of one guard as a key to another guard. This sets up an implicit precedence (one that does not need to be made explicit by the programmer), and it makes the second acquisition conditional on the first having a non-NULL "right hand value":

    node Bar (InputType1 anInput1,InputType2 anInput2,...,
        guard aFooGuard(inthash(anInput1),longhash(anInput2))
            as foo if test(anInput1),
        guard aBazGuard(foo) as baz) 
     => (OutputType1 anOutput1,...);

The syntax "as foo" lets us give a local name to the acquired guard so that it can be referenced in the key the subsequent aBazGuard acquisition. By necessity, we have to acquire the aBazGuard(foo) after foo and only when foo is not NULL.

OFlux supports (via macros) the ability to pre-populate a guard with data before the run-time is allowed to start. It is also possible to use similar macros to walk the guard's entire contents (all keys and values) in circumstances where you can be sure that it is safe.

Exclusive guards are just one of the acquisition semantics available within the zoo of guards. Other semantics are available and they affect the number of node events which can hold them (at the same key-set values):

Guard Type (keyword)	Pre-populated?	Semantics
`exclusive`	not needed	At most 1 node event per unique keys combination
`free`	probably	no restrictions. useful for read-only cases like configuration
`readwrite`	not needed	`read`: only other readers allowed `write` : only one node event (no readers) `upgradeable` : read when non-NULL, write otherwise
`pool`	required	from a prepopulated pool for each key-set, as many node events can run as there are items in the pool

The readwrite guard is special in that a mode (read/write/upgradable) needs to be specified in the node declaration in order to control the type of acquisition that is done. A read-only acquisition provides a const and non-reference access to the "right hand value".

Summary

Guards provide an abstraction to co-ordinate control of resources (shared data, or other things) within an OFlux program. They support a variety of concurrent access semantics, can be acquired conditionally, pre-populated when initializing the program, and use keys similar to hash tables. These have evolved within the history of the OFlux tool from practical use-cases that application developers had at various times.

Monday, July 16, 2012

[O]Flux Event-based Run-time

One of the goals is to avoid doing a long running blocking system call in the run-time while holding a resource (I mentioned this in my last post on the subject). If a mutex (for example) is held while a thread in the program is attempting to do some slow I/O, performance can suffer tremendously. All the other threads in the program who might have made quick use of that resource are stuck waiting.

The Flux event-based run-time tries to avoid this predicament by simplifying the problem in a way. To try to obtain more concurrency, system calls which could block are intercepted (or interposed) using the standard UNIX technique with the LD_PRELOAD environment variable. This means that any application code which calls read(), for example, passes through the Flux run-time to do so (this code is referred to as a shim since it sits between the OS and the application). If this shimmed system call would block, the run-time will start up another thread or reschedule a ready thread (via a Pthread conditional variable) when this happens.

The run-time itself has only one master mutex in it. This mutex is held whenever the application code is running but a shimmed system call is not running. While a C++ application function is running but blocked in a system call, the chance to release the mutex is used to give another thread which is also running a C++ application function a chance to run (or schedule a new runnable event which runs such a function). Runnable events are kept in FIFO order on an internal run queue which is protected by the main mutex.

This state diagram shows what the life of a thread is like in the event-based run-time (initially the FIFO is seeded with one event for every source node, and one thread is told to grab the first FIFOed event):

The darker blue color indicates the state where the C++ application function is executing. The green states are waits on conditional variables (WTR or waiting to run is one of them, and the other is WIP or waiting in pool). Idle threads end up blocked on WIP. Threads which have completed their system call and want back in on the main mutex need to pass through WTR. The light blue states are Flux run-time steps which manage the internal run-time state.

Several threads can be at once in the blocking state waiting for a system call to return while one thread has the mutex and is doing "blue things". Consider a simple C++ node function Sleep which calls shimmed system call sleep():

  int
  Sleep(const Sleep_in * in,Sleep_out * out, Sleep_atoms *)
  {
    sleep(in->sec);
    out->val = in->val;
    return 0;
  }

Since sleep is shimmed, the Flux shim code could check whether the argument is 0 (in which case no blocking happens) possibly saving a context switch caused by release of the main mutex. For a greater-than 0 value we don't want to hold the mutex during the system call, so it is released with a call to pthread_mutex_unlock(). Then the system call happens and we return into the run-time with a conditional variable wait (WTR):

  // shim_sleep symbol is captured via dlsym in init
  extern "C" unsigned int
  sleep(unsigned int s) {
    if(!s) return; // not blocking -- bypass
    pthread_mutex_unlock(&main_runtime_mtx);
    pthread_cond_signal(&wip_cond);
    unsigned res =((shim_sleep)(s));
    pthread_mutex_lock(&main_runtime_mtx);
    pthread_cond_wait(&wtr_cond,&main_runtime_mtx);
    return res;
  }

Imagine a runtime FIFO with 5 Sleep events on it (all with positive in->sec values), the runtime will get 5 threads involved with running those sleeps at the same time (assuming there are 5 threads available to run them).

Summary

In this post I describe the state machine that every run-time thread finds itself living within. Initially a Flux program will have only a single thread, and thread growth is allowed to happen when there are no threads available to signal (WIP), but there are runnable events available. Concurrency is had by allowing multiple threads to do blocking system calls at the same time. We can at least be sure when writing the app that in-memory structures are safely accessed atomically between system calls. With extra mechanisms to keep certain node events from dispatching at the same time, further atomicity guarantees are possible.

[O]Flux High-level Motivation

Tasty.

If you read the original Flux paper carefully, there are several things that the Flux tool provides:

○ A data-flow-like top-level language which describes

How the C++ functions in the application should be composed to complete the task flow in the program.
How to protect common resources these functions manipulate using atomicity constraints.
The C++ types and arguments passed from one task to another (though the Flux compiler itself does not comprehend the types very deeply).

○ An event-based run-time on which to run the final program which prioritizes:

Abstracting concurrency issues from the developer writing the lower level C++ functions as much as possible.
Trying not to block when doing synchronized work
Completing work which it has already underway (so don't begin something new when there is a partially complete and runnable task available).

○ Other benefits of the Flux approach include

High re-use of existing code (they cite the migration of an existing BitTorrrent client to Flux).
Deadlock avoidance is part of the design
Performance analysis and prediction

Although a seemingly cumbersome choice for a small domain-specific language, Java was used to implement the original Flux compiler. One advantage of this choice is that the compiler is highly distributable as a JAR (build once, run everywhere -- right?). The generated C code that the compiler produced from the Flux flow description implemented the flow with a giant switch statement. The Flux language implements choice using a style that is similar to pattern matching in a functional language (so a single conditional successor is possible after a node event has run).

Webserver Example

This example demonstrated nicely how to use the basic parts of the Flux language and how to mentally visualize the program:

/* Your basic web server flow */

node Page (int socket) => ();

node ReadRequest (int socket) 
   => (int socket, bool close, const char* request);

node Reply (int socket, bool close, int length, 
      const char* content, const char* output) 
   => ();

node ReadWrite (int socket, bool close, const char* file)
   => (int socket, bool close, int length, 
    const char* content, const char* output);

node Listen () => (int socket);

node FourOhFour (int socket, bool close, const char* file) 
   => ();

node BadRequest (int socket) => ();

source Listen -> Page;

Page = ReadRequest -> ReadWrite -> Reply;

handle error ReadWrite -> FourOhF;

handle error ReadRequest -> BadRequest;

Visualized as a graph the source node Listen listening to an accept socket passes the request socket to its successor node. The Page node is just an abstract node or expression for what follows (ReadRequest in this case). ReadRequest reads the request from the socket and determines if there is any HTTP keep-alive in the request.

This flow is straight-through handling except that there are error handles along the way for mis-formed requests (BadRequest) and resources which the webserver does not have (FourOhFour named after the famed 404 HTTP return code).

In terms of concurrency the run-time will juggle the system calls that are used to read files, read requests from sockets and write responses on those sockets. Without bothering the application programmer very much about it, the ability to have multiple threads reading different resources (possibly separate files that are not paged into virtual memory yet) at the same time is cool. In this case we say that there are several threads running events for a node like ReadWrite.

Friday, July 13, 2012

Enumeration in an MT world

Poison.

Learning new programming languages is a healthy thing to do as a programmer. It keeps your mind vital and your skills spry. Its also a pretty easy thing to do the more you do it, since metaphors are often copied wholesale from one language to another. For instance, both C and Java can be credibly described as descendants of Algol 68 an early imperative programming language. They both do loops and conditionals in much the same way and the semantics of the programs written in those languages is mostly understood operationally.

Although it might be difficult, exposure to other families of programming languages like functional (e.g. OCaml) or declarative (e.g. prolog) can make modern everyday skills easier to grasp (Javascript or gnu-make). Without good cross-pollination, old metaphors which are no longer appropriate out stay their welcome. My main example of this is C pointers. They are an ancient concept (as old as C is), but I feel they should have stayed where they were.

Long ago, C programs were mostly single threaded, and locations in memory were tracked with pointers which were de-referenced via * or -> primitive operators. These were great ideas in that context. C structs have locations, and those locations (pointers) are useful handles for referring to them. Consider the following code to iterate through an array:

  static const char * names[] =
    { "mark"
    , "paul"
    , "kevin"
    , 0 };
  for( const char ** n_ptr = &(names[0])
      ; *n_ptr != 0; ++n_ptr) {
    printf("hi %s\n",*n_ptr);
  }

Its pretty low-level, but you can see how it is that n_ptr is used to traverse the list until the last element a 0 pointer is encountered (and not printed), at which point we stop iterating. My claim is that this pattern of using a pointer for enumerating elements is not such a great idea in a multi-threaded context. STL tries to do just that with containers (and thats the problem) in C++. Its an idea that should not have been copied.

Tasty.

Java did it much better by using Enumeration/Iterator interfaces, which I will explain now using C++. Suppose you have a container or class that will give you (via a member method) a "list" of T objects:

  template< typename AnyT >
  struct Enumerator {
    virtual bool next(const AnyT * &) = 0;
    virtual ~Enumerator() {}
  };

  // now implement it
  Enumerator<t> * 
  SomeCollection::someMethodGivesList()
  { 
    // build/return Enumerator<t> subclass
    ...
  }

All of this setup gives us a single thing Enumerator<T> * with a simple type which can be drained of its T items easily, and its subclass implementation can be buried inside of SomeCollection.cpp. This has a side benefit of not polluting the compilation and link steps with extra symbols (bloating header files for instance).

  // now use that implementation 
  SomeCollection sc;
  const T * tptr = 0;
  Enumerator<t> * enumptr = sc.someMethodGivesList()
  while(enumptr && enumptr->next(tptr)) {
    ... // now use tptr
  }

I am a huge fan of how functional languages deal with this by using either fold or map functions, and we could emulate this somewhat with C function pointers (implementing a visitor call-back argument).

  void
  SomeCollection::someMethodVisitsList(
     void * fp
   , void (*oneach_cb)(void *, const T *))
  {
    // ok traverse the internal list here
    // calling (*oneach_cb)(fp,tptr) all the way
  }

This is a bit of a "poor man's" solution compared to what a real functional language can do with its parametric polymorphism capabilities. The opportunity to write a traversal function once is not really there, but could be part-way done with templates.

Poison.

The real problem with the STL using pointers as the basis for enumeration is that the container's iterators are not safely managed:

  for(std::vector::iterator itr = vContainer.begin()
      ; itr != v.end(); ++itr) {
    ...
  }

For starters the end() could be a moving target if another thread updates the vContainer underneath us at an unfortunate time. The job of synchronization was just never considered in with single threaded C and its pointers. Another issue is that iterators from different STL containers are not really interchangable, and passing a pair of iterators as a result to something is a bit painful. The Enumerator<AnyT> * is easily held in a smart pointer (perhaps a shared pointer) so it hops a couple fences there as well.

Summary.

Of course there are details to make either the Enumerator<AnyT> * solution or the visitor call-back solution work safely for multi-threaded programs -- but narrower interface these idioms have make the job easier.

I know that there is an MT-safe library called TBB which impletements STL-like containers, but it sounds hard to me -- and I am getting tired for forgetting to increment my "pointers".

Thursday, July 12, 2012

DRBL worked!

Tasty

DRBL is a useful project that provides a way for a master server to be a PXE boot server and fileserver to thin nodes (computers who do not -- necessarily -- have any non-volatile storage (e.g. hard disks, SSDs, flash drives)). I have installed this successfully on Debian Wheezy following the directions. My hope is that someone finds this documentation of my choices helpful if they are trying to make DRBL run on similar hardware with Debian Wheezy (and I am included in that).

# dpkg --list | grep drbl
ii  clonezilla                           2.5.8-1drbl                    A partition or disk cloning tool
ii  drbl                                 1.10.90-1drbl                  DRBL (Diskless Remote Boot in Linux).
ii  drbl-chntpw                          20110511-1drbl                 NT SAM password recovery utility
ii  drbl-partimage                       0.6.8-1drbl                    Partition Image
ii  freedos                              1.0-16drbl                     A complete, free, 100% MS-DOS compatible operating system.
ii  ipxe                                 20120123-1drbl                 Open source network boot firmware
ii  mkpxeinitrd-net                      1.6.12-1drbl                   Build initial ramdisk images for DRBL clients.
ii  mkswap-uuid                          0.1.1-1drbl                    set up a Linux swap area with UUID can be assigned
ii  partclone                            0.2.45-2drbl                   The utility to clone and restore a partition.

There are some choices to be made within the installation instructions, and I want to go into some more detail on those. So firstly we'll assume that all the necessary Debian packages have been installed on your master. The two commands that remain are drblsrv and drblpush:

# /opt/drbl/sbin/drblsrv -i
*****************************************************.
Hint! When a "yes or no" option is available, the default value is uppercase. E.g. (y/N) the default is "N", so when you press "Enter" without typing "Y or N" it will be as if you typed "N" and then "Enter". If you are not sure which option to choose just press "Enter" key.
*****************************************************.
*****************************************************.
Installing DRBL for Debian Linux...
*****************************************************.
Do you want to install the network installation boot images so that you can let the client computer install some GNU/Linux distributions (Debian, Ubuntu, RedHat Linux, Fedora Core, Mandriva, CentOS and OpenSuSE...) via a network connection? !!NOTE!! This will download a lot of files (Typically > 100 MB) so it might take a few minutes. If the client computer has a hard drive that you may install GNU/Linux onto, put a Y here. If you answer "no" here, you can run "drbl-netinstall" to install them later.
[y/N] n
*****************************************************.
This GNU/Linux distribution uses one kernel to support SMP and non-SMP arch.
*****************************************************.
Do you want to use the serial console output on the client computer(s)?
If you do NOT know what to pick, say "N" here, otherwise the client computer(s) may show NOTHING on the screen!
[y/N] n
The CPU arch option for your clients: 2
The optimization for your system is set to be the same as this server computer.
*****************************************************.
Cleaning the cache of APT to make some settings take effect...
Get:1 http://security.debian.org wheezy/updates InRelease [87.8 kB]
Get:2 http://security.debian.org wheezy/updates/main Sources [937 B]
Get:3 http://security.debian.org wheezy/updates/main amd64 Packages [979 B]
... # apt-get update fallout
...
Reading package lists... Done
*****************************************************.
Do you want to upgrade the operating system?
[y/N] n
*****************************************************.
2nd, installing the necessary files for DRBL...
*****************************************************.
Searching if lvm2 ntfs-3g genisoimage mkisofs lshw hwinfo aoetools vblade dmidecode lzop lzma xz xz-utils pxz lzip pigz pbzip2 lbzip2 plzip lrzip hfsutils hfsprogs dmsetup dmraid kpartx device-mapper tofrodos dos2unix unix2dos dhcp3-server isc-dhcp-server gdisk btrfs-tools ufsutils disktype available...
Package lvm2 exists in repository.
Package ntfs-3g exists in repository.
Package genisoimage exists in repository.
Package lshw exists in repository.
Package hwinfo exists in repository.
Package aoetools exists in repository.
Package vblade exists in repository.
Package dmidecode exists in repository.
Package lzop exists in repository.
Package lzma exists in repository.
Package xz-utils exists in repository.
Package lzip exists in repository.
Package pigz exists in repository.
Package pbzip2 exists in repository.
Package lbzip2 exists in repository.
Package plzip exists in repository.
Package lrzip exists in repository.
Package hfsutils exists in repository.
Package hfsprogs exists in repository.
Package dmsetup exists in repository.
Package dmraid exists in repository.
Package kpartx exists in repository.
Package tofrodos exists in repository.
Package dos2unix exists in repository.
Package isc-dhcp-server exists in repository.
Package gdisk exists in repository.
Package btrfs-tools exists in repository.
Package ufsutils exists in repository.
Package disktype exists in repository.
Package ntfs-3g does not provide the files of old package ntfsprogs, so ntfsprogs will be installed.
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package libdigest-sha1-perl
Warning! Some necessary packages are not installed! If you continue, maybe something will go wrong! It is better to exit now and check your /etc/apt/sources.list and internet link!
Press Ctrl-C to stop the program! Or Press "Enter" to continue...
*****************************************************.
*****************************************************.
Trying to upgrade some necessary packages if available...
*****************************************************.
Searching for the latest kernel in the repository... kernel ...
The latest kernel in the ayo repository is linux-image-3.2.0-3-amd64
There are 2 kernels available for clients, which one do you prefer?
[1]: kernel 3.2.0-1-amd64 x86_64 (from this DRBL server)
[2]: linux-image-3.2.0-3-amd64 (from APT repository)
[1] 1
Clients will use the kernel 3.2.0-1-amd64 x86_64 from server.
It might take several minutes to install this kernel, please be patient...
done!
*****************************************************.
Installing kernel for clients... ...
Searching for the latest kernel in the repository... kernel ...
*****************************************************.
Now run: drblsrv-offline -c -d -a -l en_US.UTF-8 -s 3.2.0-1-amd64 "" ""
Using kernel from this server for client...
*****************************************************.
Your OS version is:: Debian Testing-Unstable
*****************************************************.
*****************************************************.
Installing kernel for clients... ...
The kernel for client is copied from server.
Installing kernel 3.2.0-1-amd64 for clients...
It might take several minutes to install this kernel, please be patient... ...done!
Generating modules.dep and map files for clients... done!
Preparing the kernel firmware for clients...
*****************************************************.
Creating config file for PXE clients...
Copying pxelinux.0, gpxelinux.0, menu.c32, vesamenu.c32, chain.c32, mboot.c32, sanboot.c32 and memdisk to /tftpboot/nbi_img...
Copying memtest86+ to /tftpboot/nbi_img...
Copying FreeDOS files to /tftpboot/nbi_img/...
Generating default pxelinux config (/tftpboot/nbi_img/pxelinux.cfg/default)...
Use com32 module: vesamenu.c32
Adding menus for DRBL, local boot, memtest86+, FreeDOS...
done!
*****************************************************.
*****************************************************.
Creating the image files for PXE and Etherboot client computer(s), this will take a few minutes ...
The latest kernel for the DRBL clients is 3.2.0-1-amd64
Running mknic-nbi --kernel 3.2.0-1-amd64 --all --no-modules
Will client check DHCP server name is "drbl" or not: yes
The maximum times to try to get IP address for a client: 5
The pause time after network card is up: 0
The timeout to wait for network card linked (Unit: 0.1 secs): 70
Setting port for udhcpc request to default...
Using the kernel modules from /tftpboot/node_root//lib/modules...
The selected kernel for DRBL clients is: 3.2.0-1-amd64
Kernel 2.6 or 3 was found, so default to use initramfs.
Creating the network boot initrd for PXE clients by: mkpxeinitrd-net -k 3.2.0-1-amd64 -t initramfs -nf
Use kernel modules from /tftpboot/node_root//lib/modules/3.2.0-1-amd64.
Trying to include network card firmwares if they exist in /tftpboot/node_root//lib/firmware/...
Calling hook udev...
Creating the initRAMFS image...
Initramfs, remove ramdisk_size/ramdisk_block in /tftpboot/nbi_img/pxelinux.cfg/default if exists...
Finished!
Done!
*****************************************************.
Done!

The next command drblpush, goes as follows:

# /opt/drbl/sbin/drblpush -i
******************************************************
Hint! When a yes/no option is available, the default value is uppercase, Ex. (y/N), the default is "N", when you press "Enter", it will use "N". If you are not sure which one to choose, you can just press "Enter" key.
******************************************************
Searching the installed packages for DRBL server...This might take several minutes...
Finished searching the installed packages for DRBL server.
******************************************************
------------------------------------------------------
The interactive mode let you supply the information of your DRBL environment.
------------------------------------------------------
------------------------------------------------------
Please enter DNS domain (such as drbl.sf.net):
[cls.ox.com] cls.ox.com
Set DOMAIN as cls.ox.com
------------------------------------------------------
Please enter NIS/YP domain name:
[penguinzilla]
Set DOMAIN as penguinzilla
------------------------------------------------------
Please enter the client hostname prefix:
This prefix is used to automatically create hostname for clients. If you want to overwrite some or all automatically created hostnames, press Ctrl-C to quit this program now, edit /opt/drbl/conf/client-ip-hostname, then run this program again.
[monet] monet
Set the client hostname prefix as monet
------------------------------------------------------
eth0: IP address 192.168.1.1, netmask 255.255.255.0
wlan0: IP address 10.1.4.209, netmask 255.255.254.0
Configured ethernet card(s) found in your system: eth0 wlan0
------------------------------------------------------
The public IP address of this server is NOT found.
Which ethernet port in this server is for public Internet accsess, not for DRBL connection?
Available ethernet ports in this server:
eth0 (192.168.1.1), wlan0 (10.1.4.209),
[wlan0] wlan0
The ethernet port you choose for the WAN connection: wlan0
The ethernet port(s) for DRBL environment: eth0
******************************************************
******************************************************
Now we can collect the MAC address of clients!
If you want to let the DHCP service in DRBL server offer same IP address to client every time when client boot, and you never did this procedure, you should do it now!
If you already have those MAC addresses of clients, you can put them into different group files (These files number is the same number of networks cards for DRBL service). In this case, you can skip this step.
This step helps you to record the MAC addresses of clients, then divide them into different groups. It will save your time and reduce the typos.
The MAC addresses will be recorded turn by turn according to the boot of clients,
and they will be put into different files according to the network card in server, file name will be like macadr-eth1.txt, macadr-eth2.txt... You can find them in directory /etc/drbl.
Please boot the clients by order, make sure they boot from etherboot or PXE!
Do you want to collect them?
[y/N] n
******************************************************
OK! Let's continue...
******************************************************
Do you want to let the DHCP service in DRBL server offer same IP address to the client every time when client boots (If you want this function, you have to collect the MAC addresses of clients, and save them in file(s) (as in the previous procedure)). This is for the clients connected to DRBL server's ethernet network interface eth0 ?
[y/N] n
******************************************************
OK! Let's continue, we will set the IP address of clients by "first boot gets IP first" instead of fixed one!
Hostmin: 192.168.1.1
******************************************************
What is the initial number do you want to use in the last set of digits in the IP (i.e. the initial value of d in the IP address a.b.c.d) for DRBL clients connected to this ethernet port eth0.
[1] 10
******************************************************
How many DRBL clients (PC for students) connected to DRBL server's ethernet network interface eth0 ?
Please enter the number:
[1] 10
******************************************************
The final number in the last set of digits in the client's IP address is "19".
We will set the IP address for the clients connected to DRBL server's ethernet network interface eth0 as: 192.168.1.10 - 192.168.1.19
Accept ? [Y/n] y
******************************************************
OK! Let's continue...
******************************************************
The Layout for your DRBL environment:
******************************************************
NIC NIC IP Clients
+-----------------------------+
| DRBL SERVER |
| |
| +-- [wlan0] 10.1.4.209 +- to WAN
| |
| +-- [eth0] 192.168.1.1 +- to clients group 0 [ 10 clients, their IP
| | from 192.168.1.10 - 192.168.1.19]
+-----------------------------+
******************************************************
Total clients: 10
******************************************************
Press Enter to continue...
******************************************************
------------------------------------------------------
In the system, there are 3 modes for diskless linux services:
[0] Full DRBL mode, every client has its own NFS based /etc and /var.
[1] DRBL SSI (Single system image) mode, every client uses tmpfs based /etc and /var. In this mode, the loading and necessary disk space of server will be lighter. NOTE! (a) The client machine memory is recommended at least 256 MB. (b) The setting and config files of client will not be saved to the DRBL server! They are just used once and will vanish after the machine shutdowns! Besides, if you modify any file in the template client (located in /tftpboot/nodes), you have to run /opt/drbl/sbin/drbl-gen-ssi-files to create the template tarball in /tftpboot/node_root/drbl_ssi/. (c) If you want to provide some file to overwrite the setting in the template tarball when client boots, check /tftpboot/node_root/drbl_ssi/clients/00_README for more details.
[2] I do NOT want to provide diskless Linux service to client.
Which mode do you prefer?
[0] 1
DRBL SSI mode is set, an elegant mode for DRBL is on the way!
******************************************************
------------------------------------------------------
In the system, there are 4 modes available for clonezilla:
[0] Full Clonezilla mode, every client has its own NFS based /etc and /var.
[1] Clonezilla box mode, every client uses tmpfs based /etc and /var. In this mode, the loading and necessary disk space of server will be lighter than that in Full Clonezilla mode. Note! In Clonezilla box mode, the setting and config files of client will not be saved to the DRBL server! They just use once and will vanish after the machine shutdowns!
[2] I do NOT want clonezilla.
[3] Use Clonezilla live as the OS (Operating System) of clients (Testing).
Which mode do you prefer?
[0] 2
No Clonezilla is the system!
******************************************************
******************************************************
The CPU arch for clients when running Clonezilla job: generic
------------------------------------------------------
If there is a local harddrive with swap partition or writable file system in your client machine,
do you want to use that swap partition or create a swap file in the writable filesystem so that client has more memory to use? (This step will NOT destroy any data in that harddisk)
[Y/n] n
******************************************************
------------------------------------------------------
Which mode do you want the clients to use after they boot?
"1": Graphic mode (X window system) (default),
"2": Text mode.
[1] 2
The clients will use text mode when they boot.
******************************************************
------------------------------------------------------
Do you want to set the root's password for clients instead of using same root's password copied from server? (For better security)
[y/N] n
OK! Let's continue...
------------------------------------------------------
Do you want to set the pxelinux password for clients so that when client boots, a password must be entered to startup (For better security)
[y/N] n
OK! Let's continue...
------------------------------------------------------
Do you want to set the boot prompt for clients?
[Y/n] y
How many 1/10 sec is the boot prompt timeout for clients?
[70] 70
OK! Let's continue...
------------------------------------------------------
------------------------------------------------------
Do you want to use graphic background for PXE menu when client boots?
Note! If you use graphical PXELinux menu, however client fails to boot, you can switch to text mode by running "/opt/drbl/sbin/switch-pxe-bg-mode -m text".
[y/N]
Use text PXE Linux menu for the client.
------------------------------------------------------
------------------------------------------------------
Do you want to let audio, cdrom, floppy, video and plugdev (like USB device) open to all users in the DRBL client? If yes, we will add all the users to those device groups in the server and client.
[Y/n] y
OK! Let's continue...
------------------------------------------------------
------------------------------------------------------
By using alias interface, every client can have 2 IPs,
one of them is private IP for clients connected to DRBL server, and the other is public IP for clients directly connected to WAN from switch!
Do you want to setup public IP for clients?
[y/N] n
------------------------------------------------------
Do you want to let DRBL clients have an option to run terminal mode? i.e. you want to let that client run remote display (which will mostly use resources of server), say "Y" here.
Note!
0. If you say yes to this option, this will be a very limited environment for client, i.e. NO local access for USB, CD, audio, printer, etc. in client.
1. If your server is not powerful, say "no" here.
2. By saying "yes" here, we will turn on xdmcp,
It is never a safe thing to turn on that. Setting up /etc/hosts.allow and /etc/hosts.deny to only allow local access is another alternative but not the safest.
Firewalling port 177 is the safest if you wish to have xdmcp on.
Read the manual for more notes on the security of XDMCP.
Please set it by yourself!
3. If you say "yes" here, you might have to restart your desktop environment manager (gdm/kdm) later, remember to save your data before you close applications!
Do you want to let client has an option to run terminal mode?
[y/N] n
OK! Let's continue...
------------------------------------------------------
------------------------------------------------------
Do you want to let DRBL server as a NAT server? If not, your DRBL client will NOT be able to access Internat.
[Y/n]
OK! Let's continue...
------------------------------------------------------
------------------------------------------------------
Do you want to keep the old setting of existing DRBL clients if they exist?
[Y/n]
We will try to keep the setting of the DRBL clients if they already exist.
******************************************************
******************************************************
The running kernel in the server supports NFS over TCP!
Note! If you change the running kernel in the server, and not sure whether the kernel supports NFS over udp or tcp, you'd better to re-run "drblpush -i" again to avoid the client boots in failure!
Press Enter to continue...
------------------------------------------------------
******************************************************
The calculated NETWORK for eth0 is 192.168.1.0.
******************************************************
******************************************************
We are now ready to deploy the files to system!
Do you want to continue?
Warning! If you go on, your firewall rules will be overwritten during the setup!
The original rules will be backuped as iptables.drblsave in system config directory (/etc/sysconfig or /etc/default).
[Y/n]
******************************************************
OK! Let's do it!
------------------------------------------------------
Checking the necessary disk space... done!
Copying the config file to /etc/drbl... done!
Backup the original /etc/hosts as /etc/hosts.drblsave... done!
Generate the /etc/hosts for clients connected to eth0... done!
Cleaning the stale files of the diskless nodes if they exist... done!
*****************************************************.
*****************************************************.
The version number for your GNU/Linux: DBN-TU
Keeping the old common root files if they exist...
Keeping old nodes if they exist...
Creating common root files... This might take several minutes...........Copying normal dir /lib32 to /tftpboot/node_root/...Copying normal dir /lib64 to /tftpboot/node_root/... done!
Update the kernel for client if necessary...
The DRBL client uses x86_64 kernel with version 3.2.0-1-amd64...
Trying to update the /tftpboot/node_root/lib/modules/3.2.0-1-amd64 from server's /lib/modules/... This might take several minutes...
Found kernel modules in /lib/modules/3.2.0-1-amd64 and its arch "x86_64" matches client's "x86_64"...
Syncing /lib/modules/3.2.0-1-amd64 to client's common root...
Syncing /boot/*-3.2.0-1-amd64* to client's common root...
Generating the /tftpboot/node_root/lib/modules/3.2.0-1-amd64/modules.dep
Syncing /lib/firmware/ to client's common root...
Copying the directory /etc/ to clients common root /tftpboot/node_root...
Cleaning the ssh key file ssh_host_dsa_key copied from server... done!
Cleaning the ssh key file ssh_host_dsa_key.pub copied from server... done!
Cleaning the ssh key file ssh_host_rsa_key copied from server... done!
Cleaning the ssh key file ssh_host_rsa_key.pub copied from server... done!
Commenting the TCPwrapper related file /tftpboot/node_root/etc/hosts.deny copied from server... done!
Commenting the TCPwrapper related file /tftpboot/node_root/etc/hosts.allow copied from server... done!
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
The startup services for DRBL client are:
firstboot rpcbind nis nfs-common ssh dbus kbd acpid cups binfmt-support drblthincli arm-wol sendsigs umountfs
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
Using udev for clients... Set text mode for Debian DRBL client...
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
Deleting the accounts (except root) in the clients common root template... cp: `/lib64/ld-linux-x86-64.so.2' and `/tftpboot/node_root/lib64/ld-linux-x86-64.so.2' are the same file
done!
Enabling the NIS client in the common root template... done!
Creating some necessary files in the clients common root template...... done!
Creating DRBL client: monet010 192.168.1.10... Generating SSH host keys for client 192.168.1.10 if they do not exist... done!
Display manager:"gdm3"...
Setting node 192.168.1.10 as normal_login... done!
mv: target `S99single' is not a directory
Creating DRBL client: monet011 192.168.1.11... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet012 192.168.1.12... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet013 192.168.1.13... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet014 192.168.1.14... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet015 192.168.1.15... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet016 192.168.1.16... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet017 192.168.1.17... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet018 192.168.1.18... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Creating DRBL client: monet019 192.168.1.19... Pseudo client is created for DRBL SSI or clonezilla box mode! done!
Template client for DRBL SSI, Clonezilla box mode or Clonezilla live client is 192.168.1.10
Using template host /tftpboot/nodes/192.168.1.10
Generating SSH host keys for client 192.168.1.10 if they do not exist... done!
Generating the files for DRBL single system image template... root... etc... var...
Modifying option diskless_client_os in drbl-ocs.conf...
Disable the password in pxelinux simple menu for all clients...
Disabling PXE password in config file /tftpboot/nbi_img/pxelinux.cfg/default...
done!
Now add necessary services to this DRBL server: DHCP, TFTP, NFS, NIS...
Generating the NFS exports for DRBL clients...
Backup the original /etc/exports as /etc/exports.drblsave
Exporting to clients by IP address line-by-line...
The /etc/exports setting is ok now!
Now generate the firewall rules for NAT service...
Stop the NAT service first...
Flushing firewall rules: success
ip_forward is already on.
Now set the YP securenets...
Backup the original /etc/ypserv.securenets as /etc/ypserv.securenets.drblsave
The /etc/ypserv.securenets setting is done!
Update YP...
Now add the service: rpcbind isc-dhcp-server nis nfs-common nfs-kernel-server tftpd-hpa drbl-clients-nat
Force to add rpcbind service in this Debian DRBL server...
Force to add isc-dhcp-server service in this Debian DRBL server...
Force to add nis service in this Debian DRBL server...
Force to add nfs-common service in this Debian DRBL server...
Force to add nfs-kernel-server service in this Debian DRBL server...
Force to add tftpd-hpa service in this Debian DRBL server...
Force to add drbl-clients-nat service in this Debian DRBL server...
Now start the service: rpcbind isc-dhcp-server nis nfs-common nfs-kernel-server tftpd-hpa drbl-clients-nat
Stopping rpcbind daemon....
Starting rpcbind daemon....
Stopping ISC DHCP server: dhcpd.
Starting ISC DHCP server: dhcpd.
Stopping NIS services: ypbind ypserv ypppasswdd ypxfrd.
Starting NIS services: ypserv yppasswdd ypxfrd ypbindbinding to YP server.............................failed (backgrounded).
.
Stopping NFS common utilities: idmapd statd.
Starting NFS common utilities: statd idmapd.
Stopping NFS kernel daemon: mountd nfsd.
Unexporting directories for NFS kernel daemon....
Exporting directories for NFS kernel daemon....
Starting NFS kernel daemon: nfsd mountd.
Stopping HPA's tftpd: in.tftpd.
Starting HPA's tftpd: in.tftpd.
Stopping the NAT services for DRBL clients... Now stop the NAT service...
Flushing firewall rules: success
done!
Starting the NAT services for DRBL clients... done!
ip_forward is already on.
The display manager in this DRBL server is "gdm3"
Clean all the previous saved config file if they exist...done!
Turn on the boot prompt for PXE client...done!
Turn off the thin client option in PXE boot menu...done!
Modifying /tftpboot/nbi_img/pxelinux.cfg/default to let DRBL client use text PXE boot menu... done!
DRBL SSI mode. Set clientdir opt for label drbl in pxelinux config...
Setting drbl_mode="drbl_ssi_mode" in /etc/drbl/drbl_deploy.conf and /etc/drbl/drblpush.conf... done!
Clonezilla service is set as unavailable. Set clientdir opt for label clonezilla in pxelinux config...
Setting clonezilla_mode="none" in /etc/drbl/drbl_deploy.conf and /etc/drbl/drblpush.conf... done!
*****************************************************.
Adding normal users to group "audio cdrom plugdev floppy video"........ done!
*****************************************************.
Updating the YP/NIS for group...
Note! If you add new or remove accounts in the DRBL server in the future, remember to run the following command again, so that some group (EX:plugdev) will be updated:
tune-debian-dev-group-perm -g "audio cdrom plugdev floppy video" -e
*****************************************************.
Enjoy DRBL!!!
http://drbl.nchc.org.tw; http://drbl.name
NCHC Free Software Labs, Taiwan. http://free.nchc.org.tw
*****************************************************.
If you like, you can reboot the DRBL server now to make sure everything is ready...(This is not necessary, just an option)
*****************************************************.
The DRBL server is ready! Now set the client machines to boot from PXE or Etherboot. (refer to http://drbl.sourceforge.net for more details)
NOTE! If Etherboot is used on client computers, version 5.4.0 or newer is required!
P.S. The config file is saved as /etc/drbl/drblpush.conf. Therefore if you want to run drblpush with the same config again, you may run it as: /opt/drbl/sbin/drblpush -c /etc/drbl/drblpush.conf

It should be noted that any changes to the master in terms of new packages that modify their /var or /etc will only work on the nodes after a drblpush is issued again.
A big thanks to the folks at NCHC in Taiwan and other maintainers of this tool. Its very useful and seems to "Just work" for my case.

Wednesday, July 11, 2012

OFlux

Tasty.

Many years ago, I was looking for a way to help write C++ applications quickly using re-usable components with an event-based metaphor. The problems we were dealing with felt like describing an assembly line which turned requests into responses. Each stage of the assembly line could be described by pretty basic functions which mutate state in some way. Some casting around for solutions to this problem led me to Emery Berger et al's Flux project which had a USENIX 2006 paper published describing it. I really liked how this model abstracted away the writing of thread code. I think we take this for granted now with event-based programming paradigms now, but it looked like a promising line of attack for us.

For several reasons, we decided to rewrite the Flux tool and add features to it that our application developers needed. First of all the compiler was redone from scratch using OCaml instead of Java, and then the run-time was redone using C++ which loaded flow information via XML. These choices made the code a bit more flexible, and -- to be fair -- moving away from expedient graduate student project code had to happen for production use.

Account Cache Example

To consider a small example (below) of an integer indexed account cache (containing C++ Account objects), consider the following.

 exclusive account_lock (int account) => Account *;
 node Start () => (int account);
 node Finish (int account, guard account_lock(account)) => ();
 source Start -> Finish;

Using exclusive guard account_lock as a resource the flow consisting of C++ functions Start() and Finish() operates as follows:

Start() generates integers refering to the keys of the account_lock which are accessed by the Finish() function.
Start() may be doing I/O (reading a socket or a file) to obtain account integer keys
multiple Finish() events can run at the same time on different accounts (the semantics of the exclusive key word)
The account_lock guard resource is accessed for the duration of the Finish() function invocation and de-referenced with the account integer key.

Here is the shell C++ code for Finish which does the initial object population from the heap when the guard de-references to a NULL:

int Finish(const Finish_in * fi
         , Finish_out *
         , Finish_atoms * fa)
{
 if(fa->account_lock() == NULL) {
  fa->account_lock() = new Account(fi->account);
 }
 ...
 return 0;
}

This example demonstrates how the different sub-tasks of producing and consuming work in the program can be decoupled to increase concurrency in a beneficial way. Often, for the purpose of testing/demonstrating, it is sufficient to write an example source node (which generate events from nothing like Start() does) via a call to sleep() and some formula for where the input comes from. Like the Flux run-time the issuing of a blocking system call like sleep() causes the run-time to consult its queue of run-able events to see if another thread can either finish (preferred) or initiate an event execution. This is detected via interposition of the system calls so that the application developer is unaware that it is happening.

New Repo

The OFlux repo is now publicly available on Github and released under AGPLv3. Please have a look at the src/examples sub-directory for more examples (documentation is coming), if you are interested.

Tuesday, July 10, 2012

Nagle's Algorithm

Poison.

Every now and then a little history rears up and teaches us something. Nagle's algorithm is a legacy part of TCP which was used to economize TCP headers by holding up tiny quantities of packet payload in order to wait a bit to see if more data is following soon (if it isn't then the tiny packet gets sent out after a delay). This is insidious for an application that distributes real-time market data (prices of things). Holding up a single price because it is not travelling with a pack of other prices, is bad since it degrades (in a non-uniform way) the market data stream.

I discovered the effects of Nagle's algorithm (and how to disable it with setsockopt and TCP_NODELAY with the berkeley sockets interface) when dealing with our FIX rates service. A trans-pacific connection showed poor transmission quality at the time, and it was a bit of a mystery why so many of the prices were taking so long to get out.

The graph shows the maximum, average and minimum for a single hour on the trans-pacific link before Nagle's algorithm was disabled on this link. The distrubing thing is that every 10 second period had a wide spread between the minimum time for a price to transit and the maximum. This indicated that there was something at play. A one way trip between the eastern seaboard of the United States to Tokyo is in effect a bit less than 100msec (Akamai keeps stats which indicate they can do that sort of latency).

Cure.

The C code to turn Nagle's algorithm off is pretty simple. One important catch is that it must be done on both sides of the connection (or at least on any side which writes small packets).

  int on = 1;
  int nodelay_ret = setsockopt( fd, IPPROTO_TCP, 
        TCP_NODELAY, (void *)&on, 4);

Its also pretty easy to detect dynamically using rudimentary tracing tools on Linux:

  strace -f -e trace=setsockopt -p `pidof `

Or on Solaris:

  truss -f -tsetsockopt -p `pgrep -x `

While running these traces, you need to program to accept somehow (establish a connection) so that it has a chance to show you how smartly it is setting this esoteric option.

The pay off is huge. Min and max are very close together once this issue is fixed (and if there are many hops, then the fix needs to get into several places). It was cathartic in the end to discover that a fairly recent Java library for FIX called QuickFIX had properly disabled Nagle's algorithm, and that our custom ("faster") C code needed fixing.

Follow @PichoraM
Follow Mark on GitHub