Last year we managed with @PierreZ to integrate a Rust Workload in the fdbserver simulation. We use the ExternalWorkload
of FoundationDB to load a C++ wrapper that calls into Rust. It worked perfectly in version 7.1 but suddenly stopped in 7.3. We encounter a segmentation fault when instantiating our custom Workload. It happens before Rust is called, everything stays in “C++ land”. Here is the simplified code of our wrapper:
class MyWorkloadFactory: public FDBWorkloadFactory {
public:
MyWorkloadFactory(FDBLogger* logger): FDBWorkloadFactory() {
logger->trace(FDBSeverity::Info, "MyWorkloadFactory", {});
}
virtual std::shared_ptr<FDBWorkload> create(const std::string& name) {
// segmentation fault here:
std::cout << "MyWorkloadFactory::create(" << name << ")" << std::endl;
return std::make_shared<MyWorkload>(name);
}
};
extern "C" FDBWorkloadFactory* workloadFactory(FDBLogger* logger) {
static MyWorkloadFactory factory(logger);
return &factory;
}
Here is the fdbserver call site from ExtertnalWorkload.actor.cpp
:
FDBWorkloadFactory* (*workloadFactory)(FDBLogger*);
...
auto wName = ::getOption(options, "workloadName"_sr, ""_sr);
...
workloadFactory = reinterpret_cast<decltype(workloadFactory)>(loadFunction(library, "workloadFactory"));
if (workloadFactory == nullptr) {
...
return;
}
workloadImpl = (*workloadFactory)(FDBLoggerImpl::instance())->create(wName.toString());
I investigated all relevant header files for a meaningful change in contract but found nothing. Most of them haven’t even changed in years. I looked into the generated assembly and found that the fdbserer and our workload seem to represent the type const std::string& name differently
. From the disassembly and gdb it seems to me that the fdbserver passes the const std::string&
as a pointer in rdx
which points to a memory region that looks like:
struct fdb_string {
uint8_t size,
char content[size/2],
}
This pointer seems to be 8 bytes aligned. Strangely, the size is double the length of the string and stored on a single byte. It feels more like a custom type than a type for std, but I couldn’t find anything in your Arena
, StringRef
and other similar types that matches exactly.
On our workload side, create
seems to expect the string to be passed as a pointer of pointer in rdx
. And the memory region pointed should look like:
struct w_string {
uint64_t length,
uint64_t _1,
uint64_t _2,
char content[length],
}
I don’t know what the 2 quadwords are for (it resembles a basic_string
but I’m not sure). Strangely rdx
is a double pointer indirection and doesn’t point to the start of the struct, but at the “content” field. The length is retrieved by looking 24 bytes before the value pointed by the pointer pointed by rdx
.
For those interested, here is the simplified assembly of the workload trying to use the string:
mov r13,rdx
mov rsi,QWORD PTR [r13+0x0] // &w_string.content
mov rdx,QWORD PTR [rsi-0x18] // *(&w_string.content-24) = w_string.length
This last instruction is the cause of the segfault, as the value rsi
takes when the fdbserver calls is not a valid pointer.
Here is the simplified assembly of the fdbserver calling create:
test rax,rax // if (workloadFactory == nullptr)
je 0x39cacc6
lea rdi,[rip+0x2310c8d] // FDBLoggerImpl::instance()
call rax // (*workloadFactory)(logger)
mov r13,rax // workloadImpl
movsxd rbx,DWORD PTR [rbp-0x50] // ???
lea eax,[rbx+rbx*1] // ???
mov BYTE PTR [rbp-0x188],al // fdb_string.size
mov rax,QWORD PTR [r13+0x0] // workloadImpl.vtable[0], FDBWorkloadFactory::create
lea rdi,[rbp-0x90] // ???
lea rdx,[rbp-0x188] // fdb_string
mov rsi,r13 // "this" pointer of workloadImpl
call QWORD PTR [rax] // workloadImpl.create(fdb_string)
I tried to fix this issue by artificially reconstructing the string in my workload:
class MyWorkloadFactory: public FDBWorkloadFactory {
public:
...
virtual std::shared_ptr<FDBWorkload> create(const std::string& mangled) override {
char* raw = (char*)&mangled;
std::string name = std::string(raw+1, raw[0]>>1);
return std::make_shared<WorkloadTranslater>(name);
}
}
This works fine, I get back the expected name, but then I have another segmentation fault a bit later on FDBWorkloadContext::getOption
also due to mismatched string representation.
I suspect this is due to a difference in compilation. We use the official FoundationDB docker (foundationdb/build:centos7-20240228040135-fc272dd89b) with devtoolset-11. What version of the docker and devtoolset did you use to compile the 7.3 fdbserver? Could it have changed between 7.1 and 7.3?