On Tue, Jul 18, 2023 at 11:51 AM Mike Snitzer <snitzer@kernel.org> wrote:

>
> But the long-standing dependency on VDO's work-queue data
> struct is still lingering (drivers/md/dm-vdo/work-queue.c). At a
> minimum we need to work toward pinning down _exactly_ why that is, and
> I think the best way to answer that is by simply converting the VDO
> code over to using Linux's workqueues.  If doing so causes serious
> inherent performance (or functionality) loss then we need to
> understand why -- and fix Linux's workqueue code accordingly. (I've
> cc'd Tejun so he is aware).
>

 We tried this experiment and did indeed see some significant performance
differences. Nearly a 7x slowdown in some cases.

VDO can be pretty CPU-intensive. In addition to hashing and compression, it
scans some big in-memory data structures as part of the deduplication
process. Some data structures are split across one or more "zones" to
enable concurrency (usually split based on bits of an address or something
like that), but some are not, and a couple of those threads can sometimes
exceed 50% CPU utilization, even 90% depending on the system and test data
configuration. (Usually this is while pushing over 1GB/s through the
deduplication and compression processing on a system with fast storage. On
a slow VM with spinning storage, the CPU load is much smaller.)

We use a sort of message-passing arrangement where a worker thread is
responsible for updating certain data structures as needed for the I/Os in
progress, rather than having the processing of each I/O contend for locks
on the data structures. It gives us some good throughput under load but it
does mean upwards of a dozen handoffs per 4kB write, depending on
compressibility, whether the block is a duplicate, and various other
factors. So processing 1 GB/s means handling over 3M messages per second,
though each step of processing is generally lightweight. For our dedicated
worker threads, it's not unusual for a thread to wake up and process a few
tens or even hundreds of updates to its data structures (likely benefiting
from CPU caching of the data structures) before running out of available
work and going back to sleep.

The experiment I ran was to create an ordered workqueue instead of each
dedicated thread where we need serialization, and unordered workqueues when
concurrency is allowed. On our slower test systems (> 10y old Supermicro
Xeon E5-1650 v2, RAID-0 storage using SSDs or HDDs), the slowdown was less
significant (under 2x), but on our faster system (4-5? year old Supermicro
1029P-WTR, 2x Xeon Gold 6128 = 12 cores, NVMe storage) we got nearly a 7x
slowdown overall. I haven't yet dug deeply into _why_ the kernel work
queues are slower in this sort of setup. I did run "perf top" briefly
during one test with kernel work queues, and the largest single use of CPU
cycles was in spin lock acquisition, but I didn't get call graphs.

(This was with Fedora 37 6.2.12-200 and 6.2.15-200 kernels, without the
latest submissions from Tejun, which look interesting. Though I suspect we
care more about cache locality for some of our thread-specific data
structures than for accessing the I/O structures.)

Ken