Re: [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Dagaen Golomb <dgolomb@seas.upenn.edu>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>, Sisu Xi <xisisu@gmail.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	xen-devel@lists.xen.org, Meng Xu <mengxu@cis.upenn.edu>,
	Jan Beulich <jbeulich@suse.com>, Chong Li <lichong659@gmail.com>
Subject: Re: [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling.
Date: Wed, 10 Jun 2015 00:18:50 -0400	[thread overview]
Message-ID: <CALcuvTjaUhVvFX3oNZxEPOZ0dehtuVfCLQcwMDvSWac+xCvjGg@mail.gmail.com> (raw)
In-Reply-To: <1433854393.2403.141.camel@citrix.com>


[-- Attachment #1.1: Type: text/plain, Size: 15723 bytes --]

Thanks for the feedback.


> Thanks for doing this. The first question I have is whether you run any
> test/benchmark that you can show the result of here.
>
> For instance, a few weeks ago, Meng (while trying to do a completely
> different thing) sent here on the list some numbers about the frequency
> of the scheduler being called, and the overhead caused by that... Would
> it be hard to run the same experiments and collect the numbers, with and
> without your patch?
>

I could run some experiments to measure this. An easier quick test may be
to count invocations and run the same workload on each, making sure each
run a long time to remove biases.


> On Mon, 2015-06-08 at 07:46 -0400, Dagaen Golomb wrote:
> > To do this, we create a new list that holds, for each
> > vcpu, the time least into the future that it may need to be
> > rescheduled.
> >
> Ok. Actually, what I really expected to see in this patch was, either:
>  a) a new, scheduler wide, list of replenishment times _and_ a timer,
>     always (re)programmed to fire at the most imminent one; or
>  b) one timer for each vcpu, always (re)programmed to fire at the time
>     instant when the replenishment for that vcpu should happen.
>

a) is essentially what we have going on here. The list is exactly that, a
list of possible scheduling change times (including replenishments as they
are coalesced with period), one per vcpu, and the timer is triggered using
the earliest one.


> And note that, when I say "timer", I mean an actual Xen timer, i.e.,
> those things that are started, stopped, and with a timer handling
> routine being called when they expire. For an example, you can have a
> look, in sched_credit.c, at 'ticker' or 'master_ticker'.
>

I will look at this. However, the current solution is likely functionally
equivalent and, with only one timer and a single list used to arm it, I'm
not sure if using a Xen timer is necessary. Do they incur any extra
overhead or coarsen granularity?


>
> Deciding between a) or b) isn't easy, without actually trying to
> implement them. I personally prefer b), and I think it would be a
> superior long term solution (see right below), but given te current
> architecture of the RTDS scheduler (e.g., the fact that it has a global
> runq, etc), I won't nack a), which would most likely be simpler.
>

I agree b) would may be better in the long run, but with the current
architecture a) is simpler. b) can be future work as the scheduler evolves.


> Performance and overhead wise, again, hard to tell in advance. b) would
> introduce more overhead (as there are more timers), but will likely
> reveal more scalable (which is not something of great importance for
> this scheduler) and may allow for better latency (which _is_ something
> of great importance for this scheduler :-) ), since there won't be the
> need to lock the whole scheduler during the replenishments (which is,
> OTOH, necessary with a)).
>
> And that's it for the replenishments. For enforcing the budget, well, we
> have the ret.time value of the task_slice struct returned by
> rt_schedule, so that's event driven already, and we don't need to do
> much about it.
>

The timerq will also hold the end of period if that is the next relevant
event, and if at head of the timerq this value will be used to arm the
timer to let the task run until budget exhaustion.


> Does all this make sense? Am I making myself clear enough?
> If not, feel free to ask.
>
> >  The scheduler chooses the lowest time off of this
> > list and waits until the specified time instead of running every
> > 1 ms as it did before.
> >
> Yeah, well, I see what you mean and how you this is actually succeeding
> (at least AFAICT, by looking at the code, again, it would be nice to
> have some numbers) in improving the scheduler behavior.
>
> However, this transition toward event driven-ness has two goals:
>  * improve the scheduler behavior [check, at least to some extent]
>  * improve the code (readability, maintainability, etc.)
>    [not check at all :-(]
>
> As an example, the __repl_update() function: it's called 2 times inside
> rt_schedule() and a third from rt_context_saved(), which is basically
> like saying it's called 3 times from rt_schedule(), and this always made
> very few sense to me. In fact, I think that scheduling decisions and
> replenishment events shouldn't be so tightly coupled. There are
> interdependencies, of course (a replenishment could call for a
> scheduling decision to be taken), but not like this. That's why I say
> I'd like to see a timer handling routine dealing with replenishment, and
> let rt_schedule() do it's job, which is:
>  - enforcing the budget of the running vcpu. If it expires,
>    _(re)program_ the replenishment timer
>  - choose the best vcpu to run next, if necessary
>
> With this patch, you're still using rt_schedule() to do both scheduling
> and replenishments, although you're now doing it at actual replenishment
> times, instead of checking for them every millisecond.
>
>
I think once you understand that the timerq is not only replenishments, but
any event that could cause a branch is the scheduler behavior, it becomes
more palatable to have them intertwined. Really, the timerq and scheduling
aren't as intertwined as they appear, they are piratically isolated
functionally. Its just that the timerq is most naturally serviced during
scheduling events when data that effects scheduling decisions is changing.
Its straightforward and efficient. Let me know your thoughts on this.


> Also, the bits and pieces that you need to add in order to deal with
> this new queue is, really, making things even more complex than they are
> now, which is undesirable.
>

While it does add some complexity, I don't feel a single queue and managing
it is that much extra complexity.


> So, all this being said, let me know what you think about it (and that
> applies to everyone else as well, of course, in particular Meng).
>
> I won't comment on the code in too much details, as it will require some
> quite radical restructuring, but I'll skim through it and provide some
> hints anyway.
>
> > Signed-off-by: Dagaen Golomb <dgolomb@seas.upenn.edu>
> > Signed-off-by: Meng Xu <mengxu@cis.upenn.edu>
> > ---
> >  xen/common/sched_rt.c |  319
> ++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 222 insertions(+), 97 deletions(-)
> >
> First of all, it's a big patch. It's only changing one file and one
> logical component, and for that reason, TBH, it's not that hard to
> review. Still I think you can break it in at least two, and perhaps even
> more, if you try to implement what I described above.
>

Noted.
Note that its not as large as it seems, a large chunk of the deletions and
an equal number of insertions include moving a function due to visibility
to other functions.


> > diff --git a/xen/common/sched_rt.c b/xen/common/sched_rt.c
> > index 7c39a9e..25f0458 100644
> > --- a/xen/common/sched_rt.c
> > +++ b/xen/common/sched_rt.c
>
> > @@ -156,6 +160,7 @@ struct rt_vcpu {
> >      s_time_t cur_budget;        /* current budget */
> >      s_time_t last_start;        /* last start time */
> >      s_time_t cur_deadline;      /* current deadline for EDF */
> > +    s_time_t next_sched_needed; /* next time to make scheduling
> decision */
> >
> As an example of why I said that things should become simpler, and are
> instead becoming more complex: with my solution, you don't need anything
> like this. In fact, budget enforcement is already being done ok in
> rt_schedule(), so no need to do anything more/different about it.
> Replenishments should be programmed to fire at cur_deadline, so again,
> no need to add this new field, and multiplex its actual value between
> budget and deadline (which, yes, counts as being something rather
> complex for me! :-D).
>

As mentioned, the timerq is handling all events that could change the
scheduling decision, not just replenishments. This tracks the earliest time
anything cause this to be scheduled differently, and could be extended if
any more insights are found. Budget enforcement is being done in
rt_schedule but its being done by this very list: a budget expiration is
one possible event that would require a vcpu reschedule.


>
> > @@ -361,6 +387,12 @@ __q_remove(struct rt_vcpu *svc)
> >          list_del_init(&svc->q_elem);
> >  }
> >
> > +static inline void __t_remove(struct rt_vcpu *svc)
> > +{
> > +     if( !list_empty(&svc->t_elem) )
> > +             list_del_init(&svc->t_elem);
> > +}
> > +
> >
> You're using hard tabs for indenting here. As you see everywhere esle,
> Xen wants 4 spaces for this.
>

Sorry, I thought I had gotten them all changed to spaces. My editor wasn't
configured correctly at first so some are lurking around. I will
exterminate the remaining ones.


> >  /*
> >   * Insert svc with budget in RunQ according to EDF:
> >   * vcpus with smaller deadlines go first.
> > @@ -395,6 +427,72 @@ __runq_insert(const struct scheduler *ops, struct
> rt_vcpu *svc)
> >  }
> >
> >  /*
> > + * Insert svc into the timerq, maintaining ascending order by
> next_sched_needed.
> > + */
> > +static void __timerq_insert(const struct scheduler *ops, struct rt_vcpu
> *svc)
> > +{
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct list_head *timerq = rt_timerq(ops);
> > +    struct list_head *iter = timerq;
> > +
> > +    ASSERT( spin_is_locked(&prv->lock) );
> > +
> > +    ASSERT( list_empty(&svc->t_elem) );
> > +
> The blank line between the asserts, I'd kill it.
>

Can do.


> > +     list_for_each(iter, timerq)
> > +     {
> >
> You're still using tabs, and mixing them with spaces, making things look
> even more cumbersome.
>

Again, sorry. Will exterminate.


>
> > +/*
> > + * Return how far the lowest time on the timerq (that is after NOW) is
> in the
> > + * future.
> > + * Lock must be grabbed before calling this.
> > + */
> > +static s_time_t __timerq_next(const struct scheduler *ops, s_time_t now)
> > +{
> > +    struct list_head *timerq = rt_timerq(ops);
> > +    struct list_head *iter, *tmp;
> > +
> > +    list_for_each_safe(iter, tmp, timerq)
> > +    {
> > +        struct rt_vcpu * iter_svc = __t_elem(iter);
> > +
> > +        if ( iter_svc->next_sched_needed > now )
> > +            return (iter_svc->next_sched_needed - now);
> > +        else
> > +            __t_remove(iter_svc);
> > +    }
> > +
> > +    return MAX_SCHEDULE;
> > +}
> >
> If using a timer, you can just get rid of items while, in the timer
> routine, you handle the event associated to them.
>
> Also, why is MAX_SCHEDULE still there?
>

Honestly, events do not have to be removed here, but its done to prevent
walking a list of stale values to get to the newest one on the next call.
This is done more for performance. Their non-removal would be functionally
equivalent.
With the timer alternative you mention, where would the timer events and
their data be held? I think that could be extra complexity, unless I fail
to understand the idea completely.

The current approach is actually somewhat simple if you think about it,
requiring just one piece of information per vcpu (the next_sched_needed
field) which is updated as information is available. It also uses a
guaranteed O(1) increase in memory with how the queue and next_sched_time
is set up, and most computational parts are rolled into existing routines
that need to be done anyways.

MAX_SCHEDULE may not be required, but I have it there as an "in case." For
example, it will cause the scheduler to be called after MAX_SCHEDULE even
when no events are occurring (such as no vcpus). Its there to ensure
progress in case of any bugs or unexpected behavior.


> > +/*
> > + * Updates the next_sched_needed field for a vcpu, which is used for
> > + * determining the next needed schedule time for this vcpu. Then updates
> > + * timerq via reinsert.
> > + */
> > +static void update_sched_time(const struct scheduler *ops, s_time_t now,
> > +                              struct rt_vcpu *svc)
> > +{
> > +    /* update next needed schedule time */
> > +    if( test_bit(__RTDS_scheduled, &svc->flags) )
> > +        svc->next_sched_needed = now + svc->cur_budget;
> > +    else
> > +        svc->next_sched_needed = svc->cur_deadline;
> > +
> > +    /* Remove and reinsert to maintain sorted order in timerq */
> > +    __t_remove(svc);
> > +    __timerq_insert(ops, svc);
> > +
> > +    return;
> > +}
> >
> And here's the multiplexing, which --even worse-- (may) require
> rearranging the queue! We really don't need anything like this.
>

I see what you mean here, but this doesn't seem bad to me. Right now its
only these two events and a simple if-else, but what is great is that this
method can be updated to include any number of tricks or insights that
could help determine when something may need be scheduled or not. Right now
it seems complex for what appears to be just two cases, but many more cases
or an entirely different accounting method could be placed here and it
would be fine. It makes for a nice "ground zero" for modifying the timer
update behavior.


> >  /*
> > + * Pick a cpu where to run a vcpu,
> > + * possibly kicking out the vcpu running there
> > + * Called by wake() and context_saved()
> > + * We have a running candidate here, the kick logic is:
> > + * Among all the cpus that are within the cpu affinity
> > + * 1) if the new->cpu is idle, kick it. This could benefit cache hit
> > + * 2) if there are any idle vcpu, kick it.
> > + * 3) now all pcpus are busy;
> > + *    among all the running vcpus, pick lowest priority one
> > + *    if snext has higher priority, kick it.
> > + *
> > + * TODO:
> > + * 1) what if these two vcpus belongs to the same domain?
> > + *    replace a vcpu belonging to the same domain introduces more
> overhead
> > + *
> > + * lock is grabbed before calling this function
> > + */
> > +static void
> > +runq_tickle(const struct scheduler *ops, struct rt_vcpu *new)
> > +{
> > +    struct rt_private *prv = rt_priv(ops);
> > +    struct rt_vcpu *latest_deadline_vcpu = NULL; /* lowest priority */
> > +    struct rt_vcpu *iter_svc;
> > +    struct vcpu *iter_vc;
> > +    int cpu = 0, cpu_to_tickle = 0;
> > +    cpumask_t not_tickled;
> > +    cpumask_t *online;
> > +
> >
> [snip]
>
> And here you are moving a function up. In general, it is better to have
> separate patches that do the movings, with the fact that the are all
> about moving and that the code is not being changed, clearly stated in
> the commit message. This is because it is really really hard to figure
> out, e.g. from here, whether you also changed something in the function
> while moving it, making reviewing a lot harder and more prone to error.
>

Ah, thanks for mentioning. The move was required but I didn't know a
separate patch was customary. I can make it so.


> That being said, in this specific case, you're moving up runq_tickle(),
> the purpose of which is to trigger a call to rt_schedule() (after going
> through the softirq machinery)... in order to call it in rt_schedule().
> That's pretty tricky, and is not how tickling should work.
>

It set up to only tickle if needed. I'm not sure if this is an issue,


> Again, with an approach that really take advantage of timers, this won't
> be necessary.
>

I think there could be some discussion on the approaches. Hopefully others
comment as well. As mentioned this is actually a logically quite simple
change, with good efficiency, and the groud zero makes changes using lots
of information easy although right now it was kept simple to get the change
out. Future improvements to the update function was somewhat expected as
future work.

Thanks for the comments.
~Dagaen

[-- Attachment #1.2: Type: text/html, Size: 20508 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2015-06-10  4:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-08 11:46 [PATCH] Modified RTDS scheduler to use an event-driven model instead of polling Dagaen Golomb
2015-06-09 12:53 ` Dario Faggioli
2015-06-10  4:18   ` Dagaen Golomb [this message]
2015-06-10 22:30     ` Dario Faggioli
2015-06-13 20:33       ` Dagaen Golomb
2015-06-16 17:07         ` Dagaen Golomb
2015-06-17  0:20           ` Dario Faggioli
2015-06-17  5:24             ` Dagaen Golomb
2015-06-17  5:45               ` Meng Xu
2015-06-17  6:03                 ` Dagaen Golomb
2015-06-17  6:09                   ` Meng Xu
2015-06-17  9:20                 ` Dario Faggioli
2015-06-17  8:27               ` Dario Faggioli
2015-06-18 18:07                 ` Dagaen Golomb
2015-06-22  9:11                   ` Dario Faggioli
2015-06-22 11:58                     ` Dagaen Golomb
2015-06-10  5:54   ` Meng Xu
2015-06-10 17:46     ` Dario Faggioli
2015-06-11  5:50       ` Meng Xu
2015-06-12  9:28         ` Dario Faggioli
2015-06-13 17:21           ` Meng Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALcuvTjaUhVvFX3oNZxEPOZ0dehtuVfCLQcwMDvSWac+xCvjGg@mail.gmail.com \
    --to=dgolomb@seas.upenn.edu \
    --cc=dario.faggioli@citrix.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jbeulich@suse.com \
    --cc=lichong659@gmail.com \
    --cc=mengxu@cis.upenn.edu \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    --cc=xisisu@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.