Poison, Poison, Tasty Fish!: Enumeration on the stack

Allocating objects in C++ on the heap is not free. Although there are many implementations of allocators that are very fast (and particularly good for multi-threaded applications), it is still preferable to avoid allocating too many things on the heap. In Java programs there is not much choice in the matter, since the use of the heap is kind of an endemic habit. Fortunately garbage collection in Java has gotten a whole lot better. In this post I re-visit the enumeration interface I described earlier and try to provide two implementations of it (one uses the program stack only, and the other relies on the heap).

Enumerators Again

In Enumerator.h I describe the same generic enumerator interface, and then I offer some convenience C macros which allow a function to be called which creates the enumerator, and enables a traversal of it:

template< typename AnyT >
struct Enumerator {
  virtual bool next(const AnyT * &) = 0;
  virtual ~Enumerator() {}
};

The stack macros within Enumerator.h are as follows (note the use of "placement new" and the explicit call to the virtual destructor):

#define stack_for_each_start(EFUN,Tp,X) \
 { /*new scope*/ \
  char _enum_buff[EFUN##__storage_size]; \
  typedef Tp _enum_Tp; \
  Enumerator<Tp> * _enum##X = EFUN(_enum_buff); \
  Enumerator<Tp> * _enum = _enum##X; /*for cleanup*/ \
  const Tp * X = 0; \
  while((_enum##X) && (_enum##X)->next(X)) {

#define stack_for_each_end \
 } if(_enum) { (_enum)->~Enumerator<_enum_Tp>(); } }

So that is reasonably ugly, and most C macros are, but the beauty on the application side is coming -- be patient! Here are the heap allocated macros:

#define heap_for_each_start(EFUN,Tp,X) \
 { /*new scope*/ \
  Enumerator<Tp> * _enum##X = EFUN(); \
  Enumerator<Tp> * _enum = _enum##X; /*for cleanup*/ \
  const Tp * X = 0; \
  while((_enum##X) &&(_enum##X)->next(X)) {

#define heap_for_each_end \
 } delete (_enum); }

Our Library

In order to write our application library header file app.h, we want to minimize the amount of junk we expose the user application code. Here we declare a single function app::get_all_ts() which will return the enumerator and use existing storage if provided as an argument (which is our way to allocating on the stack):

#include "enumerator.h"
#include <cstdlib>

namespace app {
  struct T { int id; char name[40]; }; // T defined here
  Enumerator<T> * get_all_ts(void * storage = 0);
  extern const size_t get_all_ts__storage_size;
} // namespace app

The app::get_all_ts__storage_size symbol is provided so that we know how much storage the underlying enumerator implementation needs. The implementation of the library code behind app::get_all_ts hides almost all the detail of the collection being enumerated, and how the enumerator implementation works:

#include "app.h"
#include <new>

namespace app {

T t1 = { 1, "sam" };
T t2 = { 2, "bill" };
T * tarray[] = { &t1, &t2, 0 };

struct AllEnumeratorImpl : public Enumerator<T> {
 AllEnumeratorImpl() : at(&(tarray[0])) {}
 virtual bool next(const T * & tp) {
  if((tp = *at)) {
   ++at;
   return true;
  }
  return false;
 }
private:
 T ** at;
};

Enumerator<T> * 
get_all_ts(void * storage)
{
 return storage
  ? new(storage) AllEnumeratorImpl()
  : new AllEnumeratorImpl();
}

const size_t get_all_ts__storage_size = 
        sizeof(AllEnumeratorImpl);

} // namespace app

My simple C array tarray of objects is just an example. A real library would have a more interesting collection with less static contents.

User Code

Finally we can use these mechanics to do a traversal in the ex.cpp user code:

#include "app.h"
#include <cstdio>

int
main()
{
  stack_for_each_start(app::get_all_ts, app::T, tptr)
    printf("%d %s\n", tptr->id, tptr->name);
  stack_for_each_end
  return 0;
}

Building and running the ex.cpp code we get:

 % g++ -g app.cpp ex.cpp -o ex
 % ./ex
 1 sam
 2 bill

Checking for Leaks

Using the heap_ version of the macros available in Enumerator.h, the same result is had. The difference in Valgrind output on the two implementations is a bit interesting (first for the stack macros):

% valgrind --tool=memcheck --leak-check=full ./ex
==29306== Memcheck, a memory error detector.
==29306== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==29306== Using LibVEX rev 1884, a library for dynamic binary translation.
==29306== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==29306== Using valgrind-3.4.1-Debian, a dynamic binary instrumentation framework.
==29306== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==29306== For more details, rerun with: -v
==29306== 
1 sam
2 bill
==29306== 
==29306== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 1)
==29306== malloc/free: in use at exit: 0 bytes in 0 blocks.
==29306== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==29306== For counts of detected errors, rerun with: -v
==29306== All heap blocks were freed -- no leaks are possible.

Comparing this with the heap implementation which does see one alloc happen:

% valgrind --tool=memcheck --leak-check=full ./ex
==29288== Memcheck, a memory error detector.
==29288== Copyright (C) 2002-2008, and GNU GPL'd, by Julian Seward et al.
==29288== Using LibVEX rev 1884, a library for dynamic binary translation.
==29288== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==29288== Using valgrind-3.4.1-Debian, a dynamic binary instrumentation framework.
==29288== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==29288== For more details, rerun with: -v
==29288== 
1 sam
2 bill
==29288== 
==29288== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 1)
==29288== malloc/free: in use at exit: 0 bytes in 0 blocks.
==29288== malloc/free: 1 allocs, 1 frees, 8 bytes allocated.
==29288== For counts of detected errors, rerun with: -v
==29288== All heap blocks were freed -- no leaks are possible.

At this point we also know that neither method leaks memory, or segmentation faults. So they are pretty equivalent to the user program.

Performance

To compare performance I made a new sample ex_timing.cpp with no I/O but iterated 10 million times:

#include "app.h"
#include <cstdio>

int
main()
{
  for(long i = 0; i < 10000000; ++i) {
    heap_for_each_start(app::get_all_ts, app::T, tptr)
    heap_for_each_end
  }
  return 0;
}

I ran each version (stack and heap) 3 times and picked the middle user timing to get this table (these used compiler optimization -O3):

macro prefix	user time for 10 million iterations
stack	292 ms
heap	960 ms

The stack version is over three times faster!

Summary

For local things like enumerators the method of encapsulating the functionality had us returning a heap allocated object which the user code had to delete (via out macro) the object when it was done. This was quite functional, but it has a performance penalty relative to what can be done with the stack. The trouble with the stack is that the memory for the enumerator object has to be be "allocated" prior to the call into the library call which returns the enumerator -- and this forced us to expose the size of the implementation enumerator somehow. I think it is an acceptable cost (some minor symbol pollution) in order to get some benefit for performance. Of course, I wish for a way to do this which does not involve using C macros.

Poison, Poison, Tasty Fish!

Friday, July 27, 2012

Enumeration on the stack