[NewStore]About PGLog Workload With RocksDB

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [NewStore]About PGLog Workload With RocksDB
@ 2015-09-08 13:58 Haomai Wang
  2015-09-08 14:06 ` Haomai Wang
  2015-09-09  7:28 ` Dałek, Piotr
  0 siblings, 2 replies; 10+ messages in thread
From: Haomai Wang @ 2015-09-08 13:58 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hi Sage,

I notice your post in rocksdb page about make rocksdb aware of short
alive key/value pairs.

I think it would be great if one keyvalue db impl could support
different key types with different store behaviors. But it looks like
difficult for me to add this feature to an existing db.

So combine my experience with filestore, I just think let
NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
could be easy and effective. PGLog owned by PG and maintain the
history of ops. It's alike Journal Data but only have several hundreds
bytes. Actually we only need to have several hundreds MB at most to
store all pgs pglog. For FileStore, we already have FileJournal have a
copy of PGLog, previously I always think about reduce another copy in
leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
it need a lot of works to be done in FileJournal to aware of pglog
things. NewStore doesn't use FileJournal and it should be easier to
settle down my idea(?).

Actually I think a rados write op in current objectstore impl that
omap key/value pairs hurts performance hugely. Lots of cpu cycles are
consumed and contributes to short-alive keys(pglog). It should be a
obvious optimization point. In the other hands, pglog is dull and
doesn't need rich keyvalue api supports. Maybe a lightweight
filejournal to settle down pglogs keys is also worth to try.

In short, I think it would be cleaner and easier than improving
rocksdb to impl a pglog-optimization structure to store this.

PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html

-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 13:58 [NewStore]About PGLog Workload With RocksDB Haomai Wang
@ 2015-09-08 14:06 ` Haomai Wang
  2015-09-08 14:12   ` Gregory Farnum
  2015-09-08 19:19   ` Sage Weil
  2015-09-09  7:28 ` Dałek, Piotr
  1 sibling, 2 replies; 10+ messages in thread
From: Haomai Wang @ 2015-09-08 14:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hit "Send" by accident for previous mail. :-(

some points about pglog:
1. short-alive but frequency(HIGH)
2. small and related to the number of pgs
3. typical seq read/write scene
4. doesn't need rich structure like LSM or B-tree to support apis, has
obvious different to user-side/other omap keys.
5. a simple loopback impl is efficient and simple


On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> Hi Sage,
>
> I notice your post in rocksdb page about make rocksdb aware of short
> alive key/value pairs.
>
> I think it would be great if one keyvalue db impl could support
> different key types with different store behaviors. But it looks like
> difficult for me to add this feature to an existing db.
>
> So combine my experience with filestore, I just think let
> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
> could be easy and effective. PGLog owned by PG and maintain the
> history of ops. It's alike Journal Data but only have several hundreds
> bytes. Actually we only need to have several hundreds MB at most to
> store all pgs pglog. For FileStore, we already have FileJournal have a
> copy of PGLog, previously I always think about reduce another copy in
> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
> it need a lot of works to be done in FileJournal to aware of pglog
> things. NewStore doesn't use FileJournal and it should be easier to
> settle down my idea(?).
>
> Actually I think a rados write op in current objectstore impl that
> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
> consumed and contributes to short-alive keys(pglog). It should be a
> obvious optimization point. In the other hands, pglog is dull and
> doesn't need rich keyvalue api supports. Maybe a lightweight
> filejournal to settle down pglogs keys is also worth to try.
>
> In short, I think it would be cleaner and easier than improving
> rocksdb to impl a pglog-optimization structure to store this.
>
> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 14:06 ` Haomai Wang
@ 2015-09-08 14:12   ` Gregory Farnum
  2015-09-08 14:18     ` Haomai Wang
  2015-09-08 15:47     ` Gregory Farnum
  2015-09-08 19:19   ` Sage Weil
  1 sibling, 2 replies; 10+ messages in thread
From: Gregory Farnum @ 2015-09-08 14:12 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Tue, Sep 8, 2015 at 3:06 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> Hit "Send" by accident for previous mail. :-(
>
> some points about pglog:
> 1. short-alive but frequency(HIGH)

Is this really true? The default length of the log is 1000 entries,
and most OSDs have ~100 PGs, so on a hard drive running at 80
writes/second that's about 100000 seconds (~27 hours) before we delete
an entry. In reality most deployments aren't writing that
quickly....and if something goes wrong with the PG we increase to
10000 log entries!
-Greg

> 2. small and related to the number of pgs
> 3. typical seq read/write scene
> 4. doesn't need rich structure like LSM or B-tree to support apis, has
> obvious different to user-side/other omap keys.
> 5. a simple loopback impl is efficient and simple
>
>
> On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> Hi Sage,
>>
>> I notice your post in rocksdb page about make rocksdb aware of short
>> alive key/value pairs.
>>
>> I think it would be great if one keyvalue db impl could support
>> different key types with different store behaviors. But it looks like
>> difficult for me to add this feature to an existing db.
>>
>> So combine my experience with filestore, I just think let
>> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
>> could be easy and effective. PGLog owned by PG and maintain the
>> history of ops. It's alike Journal Data but only have several hundreds
>> bytes. Actually we only need to have several hundreds MB at most to
>> store all pgs pglog. For FileStore, we already have FileJournal have a
>> copy of PGLog, previously I always think about reduce another copy in
>> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
>> it need a lot of works to be done in FileJournal to aware of pglog
>> things. NewStore doesn't use FileJournal and it should be easier to
>> settle down my idea(?).
>>
>> Actually I think a rados write op in current objectstore impl that
>> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
>> consumed and contributes to short-alive keys(pglog). It should be a
>> obvious optimization point. In the other hands, pglog is dull and
>> doesn't need rich keyvalue api supports. Maybe a lightweight
>> filejournal to settle down pglogs keys is also worth to try.
>>
>> In short, I think it would be cleaner and easier than improving
>> rocksdb to impl a pglog-optimization structure to store this.
>>
>> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 14:12   ` Gregory Farnum
@ 2015-09-08 14:18     ` Haomai Wang
  2015-09-08 15:47     ` Gregory Farnum
  1 sibling, 0 replies; 10+ messages in thread
From: Haomai Wang @ 2015-09-08 14:18 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Tue, Sep 8, 2015 at 10:12 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Tue, Sep 8, 2015 at 3:06 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> Hit "Send" by accident for previous mail. :-(
>>
>> some points about pglog:
>> 1. short-alive but frequency(HIGH)
>
> Is this really true? The default length of the log is 1000 entries,
> and most OSDs have ~100 PGs, so on a hard drive running at 80
> writes/second that's about 100000 seconds (~27 hours) before we delete

SSD is filled in my mind....... Yep, for HDD pglogs it's not a passing
traveller.

The main point I think is pglog, journal data and omap keys are three
types data.

> an entry. In reality most deployments aren't writing that
> quickly....and if something goes wrong with the PG we increase to
> 10000 log entries!
> -Greg
>
>> 2. small and related to the number of pgs
>> 3. typical seq read/write scene
>> 4. doesn't need rich structure like LSM or B-tree to support apis, has
>> obvious different to user-side/other omap keys.
>> 5. a simple loopback impl is efficient and simple
>>
>>
>> On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>> Hi Sage,
>>>
>>> I notice your post in rocksdb page about make rocksdb aware of short
>>> alive key/value pairs.
>>>
>>> I think it would be great if one keyvalue db impl could support
>>> different key types with different store behaviors. But it looks like
>>> difficult for me to add this feature to an existing db.
>>>
>>> So combine my experience with filestore, I just think let
>>> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
>>> could be easy and effective. PGLog owned by PG and maintain the
>>> history of ops. It's alike Journal Data but only have several hundreds
>>> bytes. Actually we only need to have several hundreds MB at most to
>>> store all pgs pglog. For FileStore, we already have FileJournal have a
>>> copy of PGLog, previously I always think about reduce another copy in
>>> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
>>> it need a lot of works to be done in FileJournal to aware of pglog
>>> things. NewStore doesn't use FileJournal and it should be easier to
>>> settle down my idea(?).
>>>
>>> Actually I think a rados write op in current objectstore impl that
>>> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
>>> consumed and contributes to short-alive keys(pglog). It should be a
>>> obvious optimization point. In the other hands, pglog is dull and
>>> doesn't need rich keyvalue api supports. Maybe a lightweight
>>> filejournal to settle down pglogs keys is also worth to try.
>>>
>>> In short, I think it would be cleaner and easier than improving
>>> rocksdb to impl a pglog-optimization structure to store this.
>>>
>>> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 14:12   ` Gregory Farnum
  2015-09-08 14:18     ` Haomai Wang
@ 2015-09-08 15:47     ` Gregory Farnum
  1 sibling, 0 replies; 10+ messages in thread
From: Gregory Farnum @ 2015-09-08 15:47 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, ceph-devel@vger.kernel.org

On Tue, Sep 8, 2015 at 3:12 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Tue, Sep 8, 2015 at 3:06 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>> Hit "Send" by accident for previous mail. :-(
>>
>> some points about pglog:
>> 1. short-alive but frequency(HIGH)
>
> Is this really true? The default length of the log is 1000 entries,
> and most OSDs have ~100 PGs, so on a hard drive running at 80
> writes/second that's about 100000 seconds (~27 hours) before we delete
> an entry. In reality most deployments aren't writing that
> quickly....and if something goes wrong with the PG we increase to
> 10000 log entries!
> -Greg

Er, whoops, as Ilya points out I've left out a step here. 100000
entries at 80 entries/sec is only 1250 seconds, or about 20 minutes.
That's quite a different number and you should perhaps just ignore me.
:)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 14:06 ` Haomai Wang
  2015-09-08 14:12   ` Gregory Farnum
@ 2015-09-08 19:19   ` Sage Weil
  2015-09-08 19:27     ` Mark Nelson
       [not found]     ` <55EF3639.3060108@redhat.com>
  1 sibling, 2 replies; 10+ messages in thread
From: Sage Weil @ 2015-09-08 19:19 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel@vger.kernel.org

> On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> > Hi Sage,
> >
> > I notice your post in rocksdb page about make rocksdb aware of short
> > alive key/value pairs.
> >
> > I think it would be great if one keyvalue db impl could support
> > different key types with different store behaviors. But it looks like
> > difficult for me to add this feature to an existing db.

WiredTiger comes to mind.. it supports a few different backing 
strategies (both btree and lsm, iirc).  Also, rocksdb has column families.  
That doesn't help with the write log piece (the log is shared, as I 
understand it) but it does mean that we can segregate the log events or 
other bits of the namespace off into regions that have different 
compaction policies (e.g., rarely/ever compact so that we avoid 
amplification and but suffer on reads during startup).

> > So combine my experience with filestore, I just think let
> > NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
> > could be easy and effective. PGLog owned by PG and maintain the
> > history of ops. It's alike Journal Data but only have several hundreds
> > bytes. Actually we only need to have several hundreds MB at most to
> > store all pgs pglog. For FileStore, we already have FileJournal have a
> > copy of PGLog, previously I always think about reduce another copy in
> > leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
> > it need a lot of works to be done in FileJournal to aware of pglog
> > things. NewStore doesn't use FileJournal and it should be easier to
> > settle down my idea(?).
> >
> > Actually I think a rados write op in current objectstore impl that
> > omap key/value pairs hurts performance hugely. Lots of cpu cycles are
> > consumed and contributes to short-alive keys(pglog). It should be a
> > obvious optimization point. In the other hands, pglog is dull and
> > doesn't need rich keyvalue api supports. Maybe a lightweight
> > filejournal to settle down pglogs keys is also worth to try.
> >
> > In short, I think it would be cleaner and easier than improving
> > rocksdb to impl a pglog-optimization structure to store this.

I've given some thought to adding a FileJournal to newstore to do the wal 
events (which are the main thing we're putting in rocksdb that is *always* 
shortlived and can be reasonably big--and thus cost a lot when it gets 
flushed to L0).  But it just makes things a lot more complex.  We would 
have two synchronization (fsync) targets, or we would want to be smart 
about putting entire transactions in one journal and not the other.  
Still thinking about it, but it makes me a bit sad--it really feels like 
this is a common, simple workload that the KeyValueDB implementation 
should be able to handle.  What I'd really like is a hint on they key, or 
a predetermined key range that we use, so that the backend knows our 
lifecycle expectations and can optimize accordingly.

I'm hoping I can sell the rocksdb folks on a log rotation and flush 
strategy that prevents these keys from every making it into L0... that, 
combined with the overwrite change, will give us both low latency and no 
amplification for these writes (and any other keys that get rewritten, 
like hot object metadata).

On Tue, 8 Sep 2015, Haomai Wang wrote:
> Hit "Send" by accident for previous mail. :-(
> 
> some points about pglog:
> 1. short-alive but frequency(HIGH)
> 2. small and related to the number of pgs
> 3. typical seq read/write scene
> 4. doesn't need rich structure like LSM or B-tree to support apis, has
> obvious different to user-side/other omap keys.
> 5. a simple loopback impl is efficient and simple

It simpler.. though not quite as simple as it could be.  The pg log 
lengths may vary widely (10000 events could be any time span, 
depending on how active the pg is).  And we want to pix all the pg log 
events into a single append stream for write efficiency.  So we still need 
some complex tracking and eventual compaction ... which is part of 
what the LSM is doing for us.

> > PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html

This looks pretty interested!  Anyone interested in giving it a spin?  It 
should be pretty easy to write it into the KeyValueDB interface.

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 19:19   ` Sage Weil
@ 2015-09-08 19:27     ` Mark Nelson
       [not found]     ` <55EF3639.3060108@redhat.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Mark Nelson @ 2015-09-08 19:27 UTC (permalink / raw)
  To: Sage Weil, Haomai Wang; +Cc: ceph-devel@vger.kernel.org



On 09/08/2015 02:19 PM, Sage Weil wrote:
>> On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
>>> Hi Sage,
>>>
>>> I notice your post in rocksdb page about make rocksdb aware of short
>>> alive key/value pairs.
>>>
>>> I think it would be great if one keyvalue db impl could support
>>> different key types with different store behaviors. But it looks like
>>> difficult for me to add this feature to an existing db.
>
> WiredTiger comes to mind.. it supports a few different backing
> strategies (both btree and lsm, iirc).  Also, rocksdb has column families.
> That doesn't help with the write log piece (the log is shared, as I
> understand it) but it does mean that we can segregate the log events or
> other bits of the namespace off into regions that have different
> compaction policies (e.g., rarely/ever compact so that we avoid
> amplification and but suffer on reads during startup).
>
>>> So combine my experience with filestore, I just think let
>>> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
>>> could be easy and effective. PGLog owned by PG and maintain the
>>> history of ops. It's alike Journal Data but only have several hundreds
>>> bytes. Actually we only need to have several hundreds MB at most to
>>> store all pgs pglog. For FileStore, we already have FileJournal have a
>>> copy of PGLog, previously I always think about reduce another copy in
>>> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
>>> it need a lot of works to be done in FileJournal to aware of pglog
>>> things. NewStore doesn't use FileJournal and it should be easier to
>>> settle down my idea(?).
>>>
>>> Actually I think a rados write op in current objectstore impl that
>>> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
>>> consumed and contributes to short-alive keys(pglog). It should be a
>>> obvious optimization point. In the other hands, pglog is dull and
>>> doesn't need rich keyvalue api supports. Maybe a lightweight
>>> filejournal to settle down pglogs keys is also worth to try.
>>>
>>> In short, I think it would be cleaner and easier than improving
>>> rocksdb to impl a pglog-optimization structure to store this.
>
> I've given some thought to adding a FileJournal to newstore to do the wal
> events (which are the main thing we're putting in rocksdb that is *always*
> shortlived and can be reasonably big--and thus cost a lot when it gets
> flushed to L0).  But it just makes things a lot more complex.  We would
> have two synchronization (fsync) targets, or we would want to be smart
> about putting entire transactions in one journal and not the other.
> Still thinking about it, but it makes me a bit sad--it really feels like
> this is a common, simple workload that the KeyValueDB implementation
> should be able to handle.  What I'd really like is a hint on they key, or
> a predetermined key range that we use, so that the backend knows our
> lifecycle expectations and can optimize accordingly.

That sure seems like the right way to go to me too.

>
> I'm hoping I can sell the rocksdb folks on a log rotation and flush
> strategy that prevents these keys from every making it into L0... that,
> combined with the overwrite change, will give us both low latency and no
> amplification for these writes (and any other keys that get rewritten,
> like hot object metadata).

Am I correct that the above is assuming we can't convince the rocksdb 
guys that we should be able to explicitly hint that the key should never 
go to L0 anyway?  IE, it seems to me like most of our problems are fixed 
by your new log file creation strategy and if we could do the hinting?

>
> On Tue, 8 Sep 2015, Haomai Wang wrote:
>> Hit "Send" by accident for previous mail. :-(
>>
>> some points about pglog:
>> 1. short-alive but frequency(HIGH)
>> 2. small and related to the number of pgs
>> 3. typical seq read/write scene
>> 4. doesn't need rich structure like LSM or B-tree to support apis, has
>> obvious different to user-side/other omap keys.
>> 5. a simple loopback impl is efficient and simple
>
> It simpler.. though not quite as simple as it could be.  The pg log
> lengths may vary widely (10000 events could be any time span,
> depending on how active the pg is).  And we want to pix all the pg log
> events into a single append stream for write efficiency.  So we still need
> some complex tracking and eventual compaction ... which is part of
> what the LSM is doing for us.
>
>>> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>
> This looks pretty interested!  Anyone interested in giving it a spin?  It
> should be pretty easy to write it into the KeyValueDB interface.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [NewStore]About PGLog Workload With RocksDB
       [not found]     ` <55EF3639.3060108@redhat.com>
@ 2015-09-08 19:32       ` Sage Weil
       [not found]         ` <6F3FA899187F0043BA1827A69DA2F7CC0361ED99@shsmsx102.ccr.corp.intel.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2015-09-08 19:32 UTC (permalink / raw)
  To: Mark Nelson

On Tue, 8 Sep 2015, Mark Nelson wrote:
> On 09/08/2015 02:19 PM, Sage Weil wrote:
> > > On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> > > > Hi Sage,
> > > > 
> > > > I notice your post in rocksdb page about make rocksdb aware of short
> > > > alive key/value pairs.
> > > > 
> > > > I think it would be great if one keyvalue db impl could support
> > > > different key types with different store behaviors. But it looks like
> > > > difficult for me to add this feature to an existing db.
> > 
> > WiredTiger comes to mind.. it supports a few different backing
> > strategies (both btree and lsm, iirc).  Also, rocksdb has column families.
> > That doesn't help with the write log piece (the log is shared, as I
> > understand it) but it does mean that we can segregate the log events or
> > other bits of the namespace off into regions that have different
> > compaction policies (e.g., rarely/ever compact so that we avoid
> > amplification and but suffer on reads during startup).
> > 
> > > > So combine my experience with filestore, I just think let
> > > > NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
> > > > could be easy and effective. PGLog owned by PG and maintain the
> > > > history of ops. It's alike Journal Data but only have several hundreds
> > > > bytes. Actually we only need to have several hundreds MB at most to
> > > > store all pgs pglog. For FileStore, we already have FileJournal have a
> > > > copy of PGLog, previously I always think about reduce another copy in
> > > > leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
> > > > it need a lot of works to be done in FileJournal to aware of pglog
> > > > things. NewStore doesn't use FileJournal and it should be easier to
> > > > settle down my idea(?).
> > > > 
> > > > Actually I think a rados write op in current objectstore impl that
> > > > omap key/value pairs hurts performance hugely. Lots of cpu cycles are
> > > > consumed and contributes to short-alive keys(pglog). It should be a
> > > > obvious optimization point. In the other hands, pglog is dull and
> > > > doesn't need rich keyvalue api supports. Maybe a lightweight
> > > > filejournal to settle down pglogs keys is also worth to try.
> > > > 
> > > > In short, I think it would be cleaner and easier than improving
> > > > rocksdb to impl a pglog-optimization structure to store this.
> > 
> > I've given some thought to adding a FileJournal to newstore to do the wal
> > events (which are the main thing we're putting in rocksdb that is *always*
> > shortlived and can be reasonably big--and thus cost a lot when it gets
> > flushed to L0).  But it just makes things a lot more complex.  We would
> > have two synchronization (fsync) targets, or we would want to be smart
> > about putting entire transactions in one journal and not the other.
> > Still thinking about it, but it makes me a bit sad--it really feels like
> > this is a common, simple workload that the KeyValueDB implementation
> > should be able to handle.  What I'd really like is a hint on they key, or
> > a predetermined key range that we use, so that the backend knows our
> > lifecycle expectations and can optimize accordingly.
> 
> That sure seems like the right way to go to me too.
> 
> > 
> > I'm hoping I can sell the rocksdb folks on a log rotation and flush
> > strategy that prevents these keys from every making it into L0... that,
> > combined with the overwrite change, will give us both low latency and no
> > amplification for these writes (and any other keys that get rewritten,
> > like hot object metadata).
> 
> Am I correct that the above is assuming we can't convince the rocksdb guys
> that we should be able to explicitly hint that the key should never go to L0
> anyway?

The 'flush strategy' part would make overwritten or deleted keys skip L0, 
without any hint needed.  The rest (compaction policy, I guess) I 
think can be accomplished with column families... so I don't think hints 
are actually needed in the rocksdb case.

Basically, the issue is that some number of log files/buffers are 
compacted together and then written to a new L0 sst.  That means that any 
keys alive at the end of that time period are amplified.  What I'd like to 
do is make it so that we take N + M logs, and we flush/dedup the N log 
segments into SST's... but skip any keys present in the M log buffers 
that follow.  That way we have M logs' worth of time to hide overwrites 
and short-lived keys.  The problem is that it will break the snapshot and 
replication stuff in rocksdb.   We don't need that, so I'm hoping we can 
have an upstream feature that's incompatible with those bits... otherwise, 
we'd need to carry our changes downstream, which would suck, or just 
accept that some of the WAL events will get rewritten (making N very large 
would make it more infrequent, but never eliminate it entirely).

Anyway, what I meant to say before but forgot was that the other 
reason I like putting this in the kv backend is that it gives us a 
single commit transaction log to reason about.  That will keep 
things simpler and while still keeping latency low.

sage


> 
> > 
> > On Tue, 8 Sep 2015, Haomai Wang wrote:
> > > Hit "Send" by accident for previous mail. :-(
> > > 
> > > some points about pglog:
> > > 1. short-alive but frequency(HIGH)
> > > 2. small and related to the number of pgs
> > > 3. typical seq read/write scene
> > > 4. doesn't need rich structure like LSM or B-tree to support apis, has
> > > obvious different to user-side/other omap keys.
> > > 5. a simple loopback impl is efficient and simple
> > 
> > It simpler.. though not quite as simple as it could be.  The pg log
> > lengths may vary widely (10000 events could be any time span,
> > depending on how active the pg is).  And we want to pix all the pg log
> > events into a single append stream for write efficiency.  So we still need
> > some complex tracking and eventual compaction ... which is part of
> > what the LSM is doing for us.
> > 
> > > > PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
> > 
> > This looks pretty interested!  Anyone interested in giving it a spin?  It
> > should be pretty easy to write it into the KeyValueDB interface.
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [NewStore]About PGLog Workload With RocksDB
  2015-09-08 13:58 [NewStore]About PGLog Workload With RocksDB Haomai Wang
  2015-09-08 14:06 ` Haomai Wang
@ 2015-09-09  7:28 ` Dałek, Piotr
  1 sibling, 0 replies; 10+ messages in thread
From: Dałek, Piotr @ 2015-09-09  7:28 UTC (permalink / raw)
  To: Haomai Wang, Sage Weil; +Cc: ceph-devel@vger.kernel.org

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Tuesday, September 08, 2015 3:58 PM
> To: Sage Weil
 
> Hi Sage,
> 
> I notice your post in rocksdb page about make rocksdb aware of short alive
> key/value pairs.
> 
> I think it would be great if one keyvalue db impl could support different key
> types with different store behaviors. But it looks like difficult for me to add
> this feature to an existing db.
> 
> [..]

Somewhat radical (and maybe also reckless) point of view here, but do we actually need that RocksDB? I'm perfectly aware of fact that writing own RocksDB replacement will take a lot of man-hours, but maybe they'll be better spent than going around current RocksDB limitations, fixing their bugs and tuning Ceph for their code?



With best regards / Pozdrawiam
Piotr Dałek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [NewStore]About PGLog Workload With RocksDB
       [not found]         ` <6F3FA899187F0043BA1827A69DA2F7CC0361ED99@shsmsx102.ccr.corp.intel.com>
@ 2015-09-14 12:31           ` Sage Weil
  0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2015-09-14 12:31 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Mark Nelson, ceph-devel

On Mon, 14 Sep 2015, Chen, Xiaoxi wrote:
> >What I'd like to do is make it so that we take N + M logs, and we 
> >flush/dedup the N log segments into SST's... but skip any keys present 
> >in the M log buffers that follow.
> >That way we have M logs' worth of time to hide overwrites and 
> >short-lived keys.  The problem is that it will break the snapshot and 
> >replication stuff in rocksdb.  We don't need that.
> 
> It looks to me these can be partially done by the option 
> "min_write_buffer_number_to_merge" . min_write_buffer_number_to_merge is 
> the minimum number of memtables to be merged before flushing to storage. 
> For example, if this option is set to 2, immutable memtables are only 
> flushed when there are two of them - a single immutable memtable will 
> never be flushed.

Yes, and the write/delete/update pairs in those 2 tables will be 
consolidated, but we'll still write out any keys that were 'alive' when 
the most recent of the N memtables was completed.  So it'll reduce the 
writeout of ephemeral keys by a factor of N, but not completely.  And 
there'll still be some ephemeral items in L0 that will have to get 
compacted out later (not sure how expensive that is).

> What if we set it to N+M so the short-lived keys whose lifetime < N+M 
> will not be flushed to L0? The downside might be the compaction/flush 
> load will be more spiky than the way you proposed.

You mean just set min_write_buffer_number_to_merge to a larger number?  
It'll reduce it further but never completely eliminate it.  I think it'll 
also end up generated large L0 SST's if the number of ephemeral keys is 
low..

> These two approach are depending on the size, so on default the WAL 
> items(which is relatively much bigger than PGLog) will likely to make 
> PGLog flush more frequently. Not sure whether keys belongs to different 
> column families have separate buffer, seems yes, then that would be 
> great.

All the column families share the same WAL/write buffers.

I think it's hard to tell how much we are losing relative to a largish 
min_write_buffer_number_to_merge value.  Clearly it's less optimal, but 
maybe it's close enough for now.  It would be great to prototype the N+M 
dedup and see.  I forget if I mentioned it on the list, but on the rocksdb 
page Siying Dong says:

"I can't think of a way that you can config it as you like for now. It's 
possible to add a new feature to fulfill that. My idea would be similar to 
Dhruba Borthakur's. We only start flushing when we have two immutable 
memtables M1 and M2 and we only flush the older one M1. When flushing M1, 
we use M2 as a reference. We can iterate the two mem tables at the same 
time, while M2 is only used to check whether there is a new valus 
overwriting the values in M1. This is a feature of perhaps 1-2 weeks. If 
someone from the community is interested in helping in implementing it, we 
can provide help to you."

Also, apparently hbase is working on doing something similar.

If anyone is interested in hacking on this, let me know!

sage

> 
> 
> -xiaoxi
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, September 9, 2015 3:33 AM
> To: Mark Nelson
> Subject: Re: [NewStore]About PGLog Workload With RocksDB
> 
> On Tue, 8 Sep 2015, Mark Nelson wrote:
> > On 09/08/2015 02:19 PM, Sage Weil wrote:
> > > > On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiwang@gmail.com> wrote:
> > > > > Hi Sage,
> > > > > 
> > > > > I notice your post in rocksdb page about make rocksdb aware of 
> > > > > short alive key/value pairs.
> > > > > 
> > > > > I think it would be great if one keyvalue db impl could support 
> > > > > different key types with different store behaviors. But it looks 
> > > > > like difficult for me to add this feature to an existing db.
> > > 
> > > WiredTiger comes to mind.. it supports a few different backing 
> > > strategies (both btree and lsm, iirc).  Also, rocksdb has column families.
> > > That doesn't help with the write log piece (the log is shared, as I 
> > > understand it) but it does mean that we can segregate the log events 
> > > or other bits of the namespace off into regions that have different 
> > > compaction policies (e.g., rarely/ever compact so that we avoid 
> > > amplification and but suffer on reads during startup).
> > > 
> > > > > So combine my experience with filestore, I just think let 
> > > > > NewStore/FileStore aware of this short-alive keys(Or just PGLog 
> > > > > keys) could be easy and effective. PGLog owned by PG and 
> > > > > maintain the history of ops. It's alike Journal Data but only 
> > > > > have several hundreds bytes. Actually we only need to have 
> > > > > several hundreds MB at most to store all pgs pglog. For 
> > > > > FileStore, we already have FileJournal have a copy of PGLog, 
> > > > > previously I always think about reduce another copy in leveldb 
> > > > > to reduce leveldb calls which consumes lots of cpu cycles. But 
> > > > > it need a lot of works to be done in FileJournal to aware of 
> > > > > pglog things. NewStore doesn't use FileJournal and it should be easier to settle down my idea(?).
> > > > > 
> > > > > Actually I think a rados write op in current objectstore impl 
> > > > > that omap key/value pairs hurts performance hugely. Lots of cpu 
> > > > > cycles are consumed and contributes to short-alive keys(pglog). 
> > > > > It should be a obvious optimization point. In the other hands, 
> > > > > pglog is dull and doesn't need rich keyvalue api supports. Maybe 
> > > > > a lightweight filejournal to settle down pglogs keys is also worth to try.
> > > > > 
> > > > > In short, I think it would be cleaner and easier than improving 
> > > > > rocksdb to impl a pglog-optimization structure to store this.
> > > 
> > > I've given some thought to adding a FileJournal to newstore to do 
> > > the wal events (which are the main thing we're putting in rocksdb 
> > > that is *always* shortlived and can be reasonably big--and thus cost 
> > > a lot when it gets flushed to L0).  But it just makes things a lot 
> > > more complex.  We would have two synchronization (fsync) targets, or 
> > > we would want to be smart about putting entire transactions in one journal and not the other.
> > > Still thinking about it, but it makes me a bit sad--it really feels 
> > > like this is a common, simple workload that the KeyValueDB 
> > > implementation should be able to handle.  What I'd really like is a 
> > > hint on they key, or a predetermined key range that we use, so that 
> > > the backend knows our lifecycle expectations and can optimize accordingly.
> > 
> > That sure seems like the right way to go to me too.
> > 
> > > 
> > > I'm hoping I can sell the rocksdb folks on a log rotation and flush 
> > > strategy that prevents these keys from every making it into L0... 
> > > that, combined with the overwrite change, will give us both low 
> > > latency and no amplification for these writes (and any other keys 
> > > that get rewritten, like hot object metadata).
> > 
> > Am I correct that the above is assuming we can't convince the rocksdb 
> > guys that we should be able to explicitly hint that the key should 
> > never go to L0 anyway?
> 
> The 'flush strategy' part would make overwritten or deleted keys skip L0, without any hint needed.  The rest (compaction policy, I guess) I think can be accomplished with column families... so I don't think hints are actually needed in the rocksdb case.
> 
> Basically, the issue is that some number of log files/buffers are compacted together and then written to a new L0 sst.  That means that any keys alive at the end of that time period are amplified.  What I'd like to do is make it so that we take N + M logs, and we flush/dedup the N log segments into SST's... but skip any keys present in the M log buffers that follow.  That way we have M logs' worth of time to hide overwrites and short-lived keys.  The problem is that it will break the snapshot and 
> replication stuff in rocksdb.   We don't need that, so I'm hoping we can 
> have an upstream feature that's incompatible with those bits... otherwise, we'd need to carry our changes downstream, which would suck, or just accept that some of the WAL events will get rewritten (making N very large would make it more infrequent, but never eliminate it entirely).
> 
> Anyway, what I meant to say before but forgot was that the other reason I like putting this in the kv backend is that it gives us a single commit transaction log to reason about.  That will keep things simpler and while still keeping latency low.
> 
> sage
> 
> 
> > 
> > > 
> > > On Tue, 8 Sep 2015, Haomai Wang wrote:
> > > > Hit "Send" by accident for previous mail. :-(
> > > > 
> > > > some points about pglog:
> > > > 1. short-alive but frequency(HIGH) 2. small and related to the 
> > > > number of pgs 3. typical seq read/write scene 4. doesn't need rich 
> > > > structure like LSM or B-tree to support apis, has obvious 
> > > > different to user-side/other omap keys.
> > > > 5. a simple loopback impl is efficient and simple
> > > 
> > > It simpler.. though not quite as simple as it could be.  The pg log 
> > > lengths may vary widely (10000 events could be any time span, 
> > > depending on how active the pg is).  And we want to pix all the pg 
> > > log events into a single append stream for write efficiency.  So we 
> > > still need some complex tracking and eventual compaction ... which 
> > > is part of what the LSM is doing for us.
> > > 
> > > > > PS(off topic): a keyvaluedb benchmark 
> > > > > http://sphia.org/benchmarks.html
> > > 
> > > This looks pretty interested!  Anyone interested in giving it a 
> > > spin?  It should be pretty easy to write it into the KeyValueDB interface.
> > > 
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe 
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-09-14 12:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-08 13:58 [NewStore]About PGLog Workload With RocksDB Haomai Wang
2015-09-08 14:06 ` Haomai Wang
2015-09-08 14:12   ` Gregory Farnum
2015-09-08 14:18     ` Haomai Wang
2015-09-08 15:47     ` Gregory Farnum
2015-09-08 19:19   ` Sage Weil
2015-09-08 19:27     ` Mark Nelson
     [not found]     ` <55EF3639.3060108@redhat.com>
2015-09-08 19:32       ` Sage Weil
     [not found]         ` <6F3FA899187F0043BA1827A69DA2F7CC0361ED99@shsmsx102.ccr.corp.intel.com>
2015-09-14 12:31           ` Sage Weil
2015-09-09  7:28 ` Dałek, Piotr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.