Cpp implementation of an ExternalWorkload

Hi, I am working with @PierreZ on the integration of a Rust Workload in the fdbserver simulation. To do so I wrote a C++ shared object (libcppwrapper.so) to be loaded in the ExternalWorkload of the fdbserver. The C++ workload exposes bindings that I can then use in Rust. In the end I simply call fdbserver -r simulation -f ./test.txt with test.txt as follow:

testTitle=MyTest
  testName=External
  workloadName=TestWorkload
  libraryPath=path/to/lib/
  libraryName=cppwrapper

This scheme can already link the methods of ExternalWorkload (init, setup, start, check and getMetrics) to a Rust implementation. Two things are currently not working: I can’t call the logger trace method (either through a FDBLogger instance or a FDBWorkloadContext instance) and calling most C bindings from libfdb_c.so results in segmentation faults. I simplified the project by removing the Rust part. C++ calls directly the logger and the C bindings, but I get the exact same errors.

Here is the github repository (GitHub - PierreZ/test-fdb-workload) with all the code and a CI that reproduces the situation I’m describing.

Do you know what might be the cause of this?

1 Like

One problem I see so far is that you need to call fdb_select_api_version. See C API — FoundationDB 7.1.

There is an example external workload here that seems to work: foundationdb/SimpleWorkload.cpp at 1328c343231e447c1822ba1edea8c315c194955e · apple/foundationdb · GitHub. It does get compiled and linked with some very specific flags (which you can see if you build with ninja -v or make VERBOSE=1), and it’s possible that some of them are necessary to work properly.

Thanks for the quick answer. Indeed, adding fdb_select_api_version has fixed some problems and C bindings like fdb_get_client_version now work correctly. Although it didn’t fix everything, the trace method of the logger still produces invalid behavior and more complex C bindings like fdb_future_block_until_ready crash fdbserver. From what we can see, it seems that most of the pointers passed between ExternalWorkload and our workload implementation are somewhat sensitive. As you suggested, we also think this is due to differences in compilation on both sides. But we can’t make sense of the compilation commands in (for example) SimpleWorkload.cpp:


/opt/rh/devtoolset-8/root/usr/bin/c++ -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_SYSTEM_NO_DEPRECATED -DNO_INTELLISENSE -Dc_workloads_EXPORTS -I/root/foundationdb -I. -Ibindings/c -I/root/foundationdb/bindings/c -Ibindings/c/foundationdb -O3 -DNDEBUG -fPIC -DCMAKE_BUILD -ggdb -fno-omit-frame-pointer -mavx -Wno-pragmas -Wno-attributes -Wno-error=format -Wunused-variable -Wno-deprecated -fvisibility=hidden -Wreturn-type -fPIC -Wclass-memaccess -DHAVE_OPENSSL -std=gnu++17 -MD -MT bindings/c/CMakeFiles/c_workloads.dir/test/workloads/SimpleWorkload.cpp.o -MF bindings/c/CMakeFiles/c_workloads.dir/test/workloads/SimpleWorkload.cpp.o.d -o bindings/c/CMakeFiles/c_workloads.dir/test/workloads/SimpleWorkload.cpp.o -c /root/foundationdb/bindings/c/test/workloads/SimpleWorkload.cpp

&& /opt/rh/devtoolset-8/root/usr/bin/c++ -fPIC -O3 -DNDEBUG -static-libstdc++ -static-libgcc -Wl,--version-script=/root/foundationdb/bindings/c/external_workload.map,-z,nodelete -shared -Wl,-soname,libc_workloads.so -o share/foundationdb/libc_workloads.so bindings/c/CMakeFiles/c_workloads.dir/test/workloads/workloads.cpp.o bindings/c/CMakeFiles/c_workloads.dir/test/workloads/SimpleWorkload.cpp.o -Wl,-rpath,/root/build_output/lib lib/libfdb_c.so && :

Using github actions we created two branches:

You can see the results and log files on their respective CI pipeline.

We have separated the two cases for isolation purposes, but we believe the two problems are related.

We tried to compiled it and make it work locally but couldn’t. First it seems this example is incomplete as FDBWorkloadFactoryImpl::create searches for the name of the workload in a static _factories map but SimpleWorkload don’t seem to be registered there. And manually instantiating SimpleWorkload yields later the same errors (bad trace logging and segfaults on C bindings) as our workload implementation. Which further supports the theory of a compilation difference.

Do you see any other possible cause for these problems? Do you think they are related? And if it is a compilation difference, can you see why?

Any idea why we are experiencing SIGNAL: Segmentation fault (11) in this block?

Hi, we have started working on this project again and found a lot of new things and would really appreciate confirmation of some points as many were “reversed engineered”.

First of all, for the compilation problem we haven’t identified why exactly this happens but found a temporary fix by simply compiling in the official foundationdb container. Apart from replacing “g++” by “c++”, our Makefile remains unchanged, so the problem is not due to compiler arguments.

With a logger and database finally working, we started working on the Rust bindings again. After many hours of debugging, we arrived at this mental representation of what’s going on:

  • fdb calls the setup method of the workload

  • fdb can’t do anything until setup returns, i.e. any future created in setup won’t be resolved until we exit, thus waiting for a future in setup results in a deadlock

  • fdb won’t call start until the GenericPromise done is resolved, so if no boolean is ever sent to done it results in a deadlock

  • once done is resolved, fdb calls starts and the database can move in memory, so any callback set by setup, running after this point and holding on the previous database pointer will most likely crash when trying to use it

  • done is a GenericPromise and holds a smart pointer to a FDBPromise, if its last reference is dropped fdb knows and throws a “broken_promise” error, so we have to be carefull when passing it between C++ and Rust (through a C interface)

It also seems that start and check behave exactly the same way. More generally it seems that whenever execution is granted to the workload (either in a “step method” like setup or a callback we defined), fdb is paused (we assume this is done to preserve determinism). So we have to chain callbacks every time we want to block on actions that return a future.

As it is very verbose to write (and we would like to use foundationdb-rs which abstracts the raw bindings and uses Rust Futures) we tried to use async/await in a blocking runtime in a separate thread:

fn setup(&mut self, db: Database, done: GenericPromise<bool>) {
    std::thread::spawn(move || {
        // on separate thread, create and poll futures
        runtime.block_on(async {
            // it crashes here, as we use db
            let trx = db.create_trx().unwrap();
            trx.set_read_version(42);
            // sets a callback and wait for it to be called
            // similar to fdb_future_block_until_ready
            let version1 = trx.get_read_version().await.unwrap();
            // "chained future"
            let version2 = trx.get_read_version().await.unwrap();
            done.send(true);
        });
    });
    // returns execution to fdb so futures will be resolved
}

but this crashes as soon as we try to use db and it seems to be due to running on a thread that is not managed by fdb.

The “pseudo code” of a working version of the above example would be something like:

fn setup(&mut self, db: Database, done: GenericPromise<bool>) {
    let trx = db.create_trx()
    let f = fdb_transaction_get_read_version(trx);
    fdb_future_set_callback(f, callback1, CallbackData { trx, done });
}
fn callback1(f: *mut FDBFuture, data: CallbackData) {
    let mut version1;
    fdb_future_get_int64(f, &mut version1);
    let f = fdb_transaction_get_read_version(data.trx);
    fdb_future_set_callback(f, callback2, data);
}
fn callback2(f: *mut FDBFuture, data: CallbackData) {
    let mut version2;
    fdb_future_get_int64(f, &mut version2);
    data.done.send(true);
}

We think we can emulate the second code by writting our own runtime, that will simplify code to:

fn setup(&mut self, db: Database, done: GenericPromise<bool>) {
    our_runtime_callback(async {
        let trx = db.create_trx().unwrap();
        trx.set_read_version(42);
        let version1 = trx.get_read_version().await.unwrap();
        let version2 = trx.get_read_version().await.unwrap();
        done.send(true);
    });
}

But we would like being sure we really understand how we are supposed to use the simulator before sinking more hours in it and potentially finding out that it simply can’t work like that because we overlook something. So, did we understand correctly so far how it works? Are we wrong on some points? And can you think of some important points we didn’t mentionned?

On a side note, I had some troubles passing along the done GenericPromise. The solution I got working is the following (in the C++ wrapper):

struct RustWorkload;

template<typename T>
struct Wrapper {
        T inner;
};

extern "C" void rust_setup(
    RustWorkload*,
    FDBDatabase*,
    Wrapper<GenericPromise<bool>>*
);

class WorkloadTranslater: public FDBWorkload {
    private:
        RustWorkload* rustWorkload;
    public:
        virtual void setup(
            FDBDatabase* db,
            GenericPromise<bool> done
        ) override {
            // this increments the ref counter as done is copied
            auto wrapped = new Wrapper<GenericPromise<bool>> { done };
            rust_setup(this->rustWorkload, db, wrapped);
        } // the ref counter is decremented as done goes out of scope
};

and Rust can call FDBPromise_send_bool which is defined in C++:

extern "C" void FDBPromise_send_bool(
    Wrapper<GenericPromise<bool>>* promise,
    bool val
) {
        promise->inner.send(val);
        delete promise;
}

Does this seem reasonable? Or can you think of a better/simpler solution?

Just letting you know, @Wonshtrum managed to make it work :tada:

We now have an interface/trait:

pub trait RustWorkload {
    fn description(&self) -> String;
    fn init(&mut self, context: WorkloadContext) -> bool;
    fn setup(&'static mut self, db: SimDatabase, done: Promise);
    fn start(&'static mut self, db: SimDatabase, done: Promise);
    fn check(&'static mut self, db: SimDatabase, done: Promise);
    fn get_metrics(&self) -> Vec<FDBPerfMetric>;
    fn get_check_timeout(&self) -> f64;
}

that we can implement the trait and create transactions like so:

impl RustWorkload for MyWorkload {
    // ...
    fn setup(&'static mut self, db: SimDatabase, mut done: Promise) {
        fdb_rt(async move {
            let trx = db.create_trx().expect("cannot create a transaction");
            let setup_key = &("setup_ok").pack_to_vec();
            trx.set(setup_key, "true".as_ref());
            trx.commit().await.expect("could not commit");

            self.context.trace(
                FDBSeverity::Info,
                "Successfully setup workload".to_string(),
                vec![("Uuid".to_string(), self.uuid.to_string())],
            );

            println!("setup({}) done", &self.uuid);
            done.send(true);
        })
    }

This is still early, but we will open-source it once we are actually using it to validate our internal layer-sdk :upside_down_face: