Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints

Linux-mm Archive mirror
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Waiman Long <longman@redhat.com>,
	tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com,
	cgroups@vger.kernel.org, yosryahmed@google.com,
	netdev@vger.kernel.org, linux-mm@kvack.org,
	kernel-team@cloudflare.com,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Daniel Dao <dqminh@cloudflare.com>,
	Ivan Babrou <ivan@cloudflare.com>,
	jr@cloudflare.com
Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
Date: Mon, 6 May 2024 14:03:47 +0200	[thread overview]
Message-ID: <55854a94-681e-4142-9160-98b22fa64d61@kernel.org> (raw)
In-Reply-To: <4gdfgo3njmej7a42x6x6x4b6tm267xmrfwedis4mq7f4mypfc7@4egtwzrfqkhp>



On 03/05/2024 21.18, Shakeel Butt wrote:
> On Fri, May 03, 2024 at 04:00:20PM +0200, Jesper Dangaard Brouer wrote:
>>
>>
> [...]
>>>
>>> I may have mistakenly thinking the lock hold time refers to just the
>>> cpu_lock. Your reported times here are about the cgroup_rstat_lock.
>>> Right? If so, the numbers make sense to me.
>>>
>>
>> True, my reported number here are about the cgroup_rstat_lock.
>> Glad to hear, we are more aligned then :-)
>>
>> Given I just got some prod machines online with this patch
>> cgroup_rstat_cpu_lock tracepoints, I can give you some early results,
>> about hold-time for the cgroup_rstat_cpu_lock.
> 
> Oh you have already shared the preliminary data.
> 
>>
>>  From this oneliner bpftrace commands:
>>
>>    sudo bpftrace -e '
>>           tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
>>             @start[tid]=nsecs; @cnt[probe]=count()}
>>           tracepoint:cgroup:cgroup_rstat_cpu_locked {
>>             $now=nsecs;
>>             if (args->contended) {
>>               @wait_per_cpu_ns=hist($now-@start[tid]); delete(@start[tid]);}
>>             @cnt[probe]=count(); @locked[tid]=$now}
>>           tracepoint:cgroup:cgroup_rstat_cpu_unlock {
>>             $now=nsecs;
>>             @locked_per_cpu_ns=hist($now-@locked[tid]); delete(@locked[tid]);
>>             @cnt[probe]=count()}
>>           interval:s:1 {time("%H:%M:%S "); print(@wait_per_cpu_ns);
>>             print(@locked_per_cpu_ns); print(@cnt); clear(@cnt);}'
>>
>> Results from one 1 sec period:
>>
>> 13:39:55 @wait_per_cpu_ns:
>> [512, 1K)              3 |      |
>> [1K, 2K)              12 |@      |
>> [2K, 4K)             390
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K)              70 |@@@@@@@@@      |
>> [8K, 16K)             24 |@@@      |
>> [16K, 32K)           183 |@@@@@@@@@@@@@@@@@@@@@@@@      |
>> [32K, 64K)            11 |@      |
>>
>> @locked_per_cpu_ns:
>> [256, 512)         75592 |@      |
>> [512, 1K)        2537357
>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [1K, 2K)          528615 |@@@@@@@@@@      |
>> [2K, 4K)          168519 |@@@      |
>> [4K, 8K)          162039 |@@@      |
>> [8K, 16K)         100730 |@@      |
>> [16K, 32K)         42276 |      |
>> [32K, 64K)          1423 |      |
>> [64K, 128K)           89 |      |
>>
>>   @cnt[tracepoint:cgroup:cgroup_rstat_cpu_lock_contended]: 3 /sec
>>   @cnt[tracepoint:cgroup:cgroup_rstat_cpu_unlock]: 3200  /sec
>>   @cnt[tracepoint:cgroup:cgroup_rstat_cpu_locked]: 3200  /sec
>>
>>
>> So, we see "flush-code-path" per-CPU-holding @locked_per_cpu_ns isn't
>> exceeding 128 usec.
> 
> Hmm 128 usec is actually unexpectedly high. 

> How does the cgroup hierarchy on your system looks like? 
I didn't design this, so hopefully my co-workers can help me out here? 
(To @Daniel or @Jon)

My low level view is that, there are 17 top-level directories in 
/sys/fs/cgroup/.
There are 649 cgroups (counting occurrence of memory.stat).
There are two directories that contain the major part.
  - /sys/fs/cgroup/system.slice = 379
  - /sys/fs/cgroup/production.slice = 233
  - (production.slice have directory two levels)
  - remaining 37

We are open to changing this if you have any advice?
(@Daniel and @Jon are actually working on restructuring this)

> How many cgroups have actual workloads running?
Do you have a command line trick to determine this?


> Can the network softirqs run on any cpus or smaller
> set of cpus? I am assuming these softirqs are processing packets from
> any or all cgroups and thus have larger cgroup update tree. 

Softirq and specifically NET_RX is running half of the cores (e.g. 64).
(I'm looking at restructuring this allocation)

> I wonder if
> you comment out MEMCG_SOCK stat update and still see the same holding
> time.
>

It doesn't look like MEMCG_SOCK is used.

I deduct you are asking:
  - What is the update count for different types of mod_memcg_state() calls?

// Dumped via BTF info
enum memcg_stat_item {
         MEMCG_SWAP = 43,
         MEMCG_SOCK = 44,
         MEMCG_PERCPU_B = 45,
         MEMCG_VMALLOC = 46,
         MEMCG_KMEM = 47,
         MEMCG_ZSWAP_B = 48,
         MEMCG_ZSWAPPED = 49,
         MEMCG_NR_STAT = 50,
};

sudo bpftrace -e 'kfunc:vmlinux:__mod_memcg_state{@[args->idx]=count()} 
END{printf("\nEND time elapsed: %d sec\n", elapsed / 1000000000);}'
Attaching 2 probes...
^C
END time elapsed: 99 sec

@[45]: 17996
@[46]: 18603
@[43]: 61858
@[47]: 21398919

It seems clear that MEMCG_KMEM = 47 is the main "user".
  - 21398919/99 = 216150 calls per sec

Could someone explain to me what this MEMCG_KMEM is used for?


>>
>> My latency requirements, to avoid RX-queue overflow, with 1024 slots,
>> running at 25 Gbit/s, is 27.6 usec with small packets, and 500 usec
>> (0.5ms) with MTU size packets.  This is very close to my latency
>> requirements.
>>
>> --Jesper
>>

next prev parent reply	other threads:[~2024-05-06 12:03 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 14:04 [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints Jesper Dangaard Brouer
2024-05-01 14:24 ` Waiman Long
2024-05-01 17:22   ` Jesper Dangaard Brouer
2024-05-01 18:41     ` Waiman Long
2024-05-02 11:23       ` Jesper Dangaard Brouer
2024-05-02 18:19         ` Waiman Long
2024-05-03 14:00           ` Jesper Dangaard Brouer
2024-05-03 14:30             ` Waiman Long
2024-05-03 19:18             ` Shakeel Butt
2024-05-06 12:03               ` Jesper Dangaard Brouer [this message]
2024-05-06 16:22                 ` Shakeel Butt
2024-05-06 16:28                   ` Ivan Babrou
2024-05-06 19:45                     ` Shakeel Butt
2024-05-06 19:54                   ` Jesper Dangaard Brouer
2024-05-02 19:44     ` Shakeel Butt
2024-05-03 12:58       ` Jesper Dangaard Brouer
2024-05-03 18:11         ` Shakeel Butt
2024-05-14  5:18 ` Jesper Dangaard Brouer
2024-05-14  5:55   ` Tejun Heo
2024-05-14 16:59     ` Waiman Long
2024-05-15 16:58 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55854a94-681e-4142-9160-98b22fa64d61@kernel.org \
    --to=hawk@kernel.org \
    --cc=acme@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=cgroups@vger.kernel.org \
    --cc=dqminh@cloudflare.com \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=jr@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=longman@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).