diff mbox

Bluestore assert

Message ID BL2PR02MB2115BB77EA5BA38BCF48E52DF4110@BL2PR02MB2115.namprd02.prod.outlook.com (mailing list archive)
State New, archived
Headers show

Commit Message

Somnath Roy Aug. 14, 2016, 5:24 p.m. UTC
Sage,
I did this..

root@emsnode5:~/ceph-master/src# git diff
It's not obvious to me how we would get NotFound when doing a Write into the kv store.

Thanks!
sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 9:36 AM
> To: Mark Nelson
> Cc: Somnath Roy; ceph-devel
> Subject: Re: Bluestore assert
> 
> On Thu, 11 Aug 2016, Mark Nelson wrote:
> > Sorry if I missed this during discussion, but why are these being 
> > called if the file is deleted?
> 
> I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
> 
> sage
> 
> >
> > Mark
> >
> > On 08/11/2016 11:29 AM, Sage Weil wrote:
> > > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > Please find the full log for the BlueFS replay bug in the 
> > > > following location.
> > > >
> > > > https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zip
> > > >
> > > > For the db transaction one , I have added code to dump the 
> > > > rocksdb error code before the assert as you suggested and waiting to reproduce.
> > >
> > > I'm pretty sure this is the root cause:
> > >
> > > https://github.com/ceph/ceph/pull/10686
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe 
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> >
> >
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sage Weil Aug. 15, 2016, 7:53 p.m. UTC | #1
On Sun, 14 Aug 2016, Somnath Roy wrote:
> Sage,
> I did this..
> 
> root@emsnode5:~/ceph-master/src# git diff
> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc
> index 638d231..bcf0935 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;
> +  }
>    return s.ok() ? 0 : -1;
>  }
> 
> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns_sync);
>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;
> +  }
> +
>    return s.ok() ? 0 : -1;
>  }
>  int RocksDBStore::get_info_log_level(string info_log_level)
> @@ -442,7 +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const string &prefix,
>  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>                                                          const string &k)
>  {
> -  bat->SingleDelete(combine_strings(prefix, k));
> +  //bat->SingleDelete(combine_strings(prefix, k));
> +  bat->Delete(combine_strings(prefix, k));
>  }
> 
> But, the db crash is still happening with the following log message.
> 
> rocksdb: submit_transaction_sync error: NotFound: code = 1
> 
> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
> May be I should dump the transaction and see what's in there ?

Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions 
because they're being constructed by rocksdb (WriteBack or something).  
Not sure if there is a dump for that (I'm guessing not?).  You'd need to 
write one, or build a kludgey lookaside map that can be dumped.
 
> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
> Is it because I needed to run with this patch from the beginning and not just during replay ?

Yeah, the bug happens before replay.. we are writing a bad entry into the 
bluefs log.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Thursday, August 11, 2016 3:32 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > Sage,
> > Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
> > I did the following changes in the code..
> > 
> > @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns);
> >    logger->tinc(l_rocksdb_submit_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> > 
> >  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t) 
> > @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns_sync);
> >    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> >  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
> > --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc 
> > index fe7f743..3f4ecd5 100644
> > --- a/src/os/bluestore/BlueStore.cc
> > +++ b/src/os/bluestore/BlueStore.cc
> > @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
> >              ++it) {
> >           _txc_finalize_kv((*it), (*it)->t);
> >           int r = db->submit_transaction((*it)->t);
> > +          if (r < 0 ) {
> > +            dout(0) << "submit_transaction returned = " << r << dendl;
> > +          }
> >           assert(r == 0);
> >         }
> >        }
> > @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
> >         t->rm_single_key(PREFIX_WAL, key);
> >        }
> >        int r = db->submit_transaction_sync(t);
> > +      if (r < 0 ) {
> > +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
> > +      }
> > +
> >        assert(r == 0);
> > 
> > 
> > This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
> > It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
> > It seems some kind of corruption within Bluestore ?
> > Let me now the next step.
> 
> Can you add this too?
> 
> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index 638d231..b5467f7 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,9 @@ int
> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>    return s.ok() ? 0 : -1;
>  }
>  
> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
> 
> Thanks!
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 9:36 AM
> > To: Mark Nelson
> > Cc: Somnath Roy; ceph-devel
> > Subject: Re: Bluestore assert
> > 
> > On Thu, 11 Aug 2016, Mark Nelson wrote:
> > > Sorry if I missed this during discussion, but why are these being 
> > > called if the file is deleted?
> > 
> > I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
> > 
> > sage
> > 
> > >
> > > Mark
> > >
> > > On 08/11/2016 11:29 AM, Sage Weil wrote:
> > > > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > Please find the full log for the BlueFS replay bug in the 
> > > > > following location.
> > > > >
> > > > > https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zip
> > > > >
> > > > > For the db transaction one , I have added code to dump the 
> > > > > rocksdb error code before the assert as you suggested and waiting to reproduce.
> > > >
> > > > I'm pretty sure this is the root cause:
> > > >
> > > > https://github.com/ceph/ceph/pull/10686
> > > >
> > > > sage
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org 
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > >
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 16, 2016, 7:45 p.m. UTC | #2
Sage,
The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
Presently, I am trying to root cause the debug the db assertion issue.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Monday, August 15, 2016 12:54 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Sun, 14 Aug 2016, Somnath Roy wrote:
> Sage,
> I did this..
> 
> root@emsnode5:~/ceph-master/src# git diff diff --git 
> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index 
> 638d231..bcf0935 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;  }
>    return s.ok() ? 0 : -1;
>  }
> 
> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns_sync);
>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;  }
> +
>    return s.ok() ? 0 : -1;
>  }
>  int RocksDBStore::get_info_log_level(string info_log_level) @@ -442,7 
> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const 
> string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>                                                          const string 
> &k)  {
> -  bat->SingleDelete(combine_strings(prefix, k));
> +  //bat->SingleDelete(combine_strings(prefix, k));  
> + bat->Delete(combine_strings(prefix, k));
>  }
> 
> But, the db crash is still happening with the following log message.
> 
> rocksdb: submit_transaction_sync error: NotFound: code = 1
> 
> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
> May be I should dump the transaction and see what's in there ?

Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
 
> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
> Is it because I needed to run with this patch from the beginning and not just during replay ?

Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 3:32 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > Sage,
> > Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
> > I did the following changes in the code..
> > 
> > @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns);
> >    logger->tinc(l_rocksdb_submit_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> > 
> >  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction 
> > t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns_sync);
> >    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> >  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
> > --git a/src/os/bluestore/BlueStore.cc 
> > b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
> > --- a/src/os/bluestore/BlueStore.cc
> > +++ b/src/os/bluestore/BlueStore.cc
> > @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
> >              ++it) {
> >           _txc_finalize_kv((*it), (*it)->t);
> >           int r = db->submit_transaction((*it)->t);
> > +          if (r < 0 ) {
> > +            dout(0) << "submit_transaction returned = " << r << dendl;
> > +          }
> >           assert(r == 0);
> >         }
> >        }
> > @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
> >         t->rm_single_key(PREFIX_WAL, key);
> >        }
> >        int r = db->submit_transaction_sync(t);
> > +      if (r < 0 ) {
> > +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
> > +      }
> > +
> >        assert(r == 0);
> > 
> > 
> > This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
> > It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
> > It seems some kind of corruption within Bluestore ?
> > Let me now the next step.
> 
> Can you add this too?
> 
> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index 
> 638d231..b5467f7 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,9 @@ int
> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>    return s.ok() ? 0 : -1;
>  }
>  
> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
> 
> Thanks!
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 9:36 AM
> > To: Mark Nelson
> > Cc: Somnath Roy; ceph-devel
> > Subject: Re: Bluestore assert
> > 
> > On Thu, 11 Aug 2016, Mark Nelson wrote:
> > > Sorry if I missed this during discussion, but why are these being 
> > > called if the file is deleted?
> > 
> > I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
> > 
> > sage
> > 
> > >
> > > Mark
> > >
> > > On 08/11/2016 11:29 AM, Sage Weil wrote:
> > > > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > Please find the full log for the BlueFS replay bug in the 
> > > > > following location.
> > > > >
> > > > > https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
> > > > > p
> > > > >
> > > > > For the db transaction one , I have added code to dump the 
> > > > > rocksdb error code before the assert as you suggested and waiting to reproduce.
> > > >
> > > > I'm pretty sure this is the root cause:
> > > >
> > > > https://github.com/ceph/ceph/pull/10686
> > > >
> > > > sage
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > ceph-devel" in the body of a message to 
> > > > majordomo@vger.kernel.org More majordomo info at  
> > > > http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > >
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 22, 2016, 7:33 a.m. UTC | #3
Sage,
Got the following asserts on two different path with the latest master.

1.
os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)

 ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f0d46f9cd0]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55f0d43cb34d]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]
 4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55f0d43ccf12]
 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x35b) [0x55f0d43ce2fb]
 6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x55f0d456f4f1]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]
 9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x55f0d45b2506]
 10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]
 11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]
 12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55f0d44a4610]
 13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55f0d44b147f]
 14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55f0d4568079]
 15: (()+0x98f113) [0x55f0d4568113]
 16: (()+0x76fa) [0x7f06101576fa]
 17: (clone()+0x6d) [0x7f060dfb7b5d]


2.

5700 time 2016-08-21 23:15:50.962450
os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)

 ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d9959bfcd0]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55d99569134d]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]
 4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55d995692f12]
 5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]
 6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]
 7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13fa) [0x55d995778c2a]
 8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d9957797aa]
 9: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d99573cb5b]
 10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]
 11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]
 12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]
 13: (()+0x76fa) [0x7f8a0bdde6fa]
 14: (clone()+0x6d) [0x7f8a09c3eb5d]


I saw this assert is newly introduced in the code.
FYI, I was running rocksdb by enabling universal style compaction during this time.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Tuesday, August 16, 2016 12:45 PM
To: Sage Weil
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
Presently, I am trying to root cause the debug the db assertion issue.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Monday, August 15, 2016 12:54 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Sun, 14 Aug 2016, Somnath Roy wrote:
> Sage,
> I did this..
> 
> root@emsnode5:~/ceph-master/src# git diff diff --git 
> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
> 638d231..bcf0935 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;  }
>    return s.ok() ? 0 : -1;
>  }
> 
> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns_sync);
>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString()
> +        << "code = " << s.code() << dendl;  }
> +
>    return s.ok() ? 0 : -1;
>  }
>  int RocksDBStore::get_info_log_level(string info_log_level) @@ -442,7
> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const
> string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>                                                          const string
> &k)  {
> -  bat->SingleDelete(combine_strings(prefix, k));
> +  //bat->SingleDelete(combine_strings(prefix, k));
> + bat->Delete(combine_strings(prefix, k));
>  }
> 
> But, the db crash is still happening with the following log message.
> 
> rocksdb: submit_transaction_sync error: NotFound: code = 1
> 
> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
> May be I should dump the transaction and see what's in there ?

Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
 
> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
> Is it because I needed to run with this patch from the beginning and not just during replay ?

Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, August 11, 2016 3:32 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Thu, 11 Aug 2016, Somnath Roy wrote:
> > Sage,
> > Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
> > I did the following changes in the code..
> > 
> > @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns);
> >    logger->tinc(l_rocksdb_submit_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> > 
> >  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction
> > t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns_sync);
> >    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> > -  return s.ok() ? 0 : -1;
> > +  return s.ok() ? 0 : -s.code();
> >  }
> >  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
> > --git a/src/os/bluestore/BlueStore.cc 
> > b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
> > --- a/src/os/bluestore/BlueStore.cc
> > +++ b/src/os/bluestore/BlueStore.cc
> > @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
> >              ++it) {
> >           _txc_finalize_kv((*it), (*it)->t);
> >           int r = db->submit_transaction((*it)->t);
> > +          if (r < 0 ) {
> > +            dout(0) << "submit_transaction returned = " << r << dendl;
> > +          }
> >           assert(r == 0);
> >         }
> >        }
> > @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
> >         t->rm_single_key(PREFIX_WAL, key);
> >        }
> >        int r = db->submit_transaction_sync(t);
> > +      if (r < 0 ) {
> > +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
> > +      }
> > +
> >        assert(r == 0);
> > 
> > 
> > This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
> > It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
> > It seems some kind of corruption within Bluestore ?
> > Let me now the next step.
> 
> Can you add this too?
> 
> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
> 638d231..b5467f7 100644
> --- a/src/kv/RocksDBStore.cc
> +++ b/src/kv/RocksDBStore.cc
> @@ -370,6 +370,9 @@ int
> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> +  if (!s.ok()) {
> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>    return s.ok() ? 0 : -1;
>  }
>  
> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
> 
> Thanks!
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 9:36 AM
> > To: Mark Nelson
> > Cc: Somnath Roy; ceph-devel
> > Subject: Re: Bluestore assert
> > 
> > On Thu, 11 Aug 2016, Mark Nelson wrote:
> > > Sorry if I missed this during discussion, but why are these being 
> > > called if the file is deleted?
> > 
> > I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
> > 
> > sage
> > 
> > >
> > > Mark
> > >
> > > On 08/11/2016 11:29 AM, Sage Weil wrote:
> > > > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > Please find the full log for the BlueFS replay bug in the 
> > > > > following location.
> > > > >
> > > > > https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
> > > > > p
> > > > >
> > > > > For the db transaction one , I have added code to dump the 
> > > > > rocksdb error code before the assert as you suggested and waiting to reproduce.
> > > >
> > > > I'm pretty sure this is the root cause:
> > > >
> > > > https://github.com/ceph/ceph/pull/10686
> > > >
> > > > sage
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > ceph-devel" in the body of a message to 
> > > > majordomo@vger.kernel.org More majordomo info at 
> > > > http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > >
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 22, 2016, 8:02 a.m. UTC | #4
Ideally, we should have enough space already allocated for the log, but
here are trying to grow the log file, hence the assert.
Are there any log relevant log messages about allocation in the log from
bluefs?

Varada

On Monday 22 August 2016 01:04 PM, Somnath Roy wrote:
> Sage,
> Got the following asserts on two different path with the latest master.
>
> 1.
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f0d46f9cd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55f0d43cb34d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55f0d43ccf12]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x35b) [0x55f0d43ce2fb]
>  6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x55f0d456f4f1]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]
>  9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x55f0d45b2506]
>  10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]
>  11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]
>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55f0d44a4610]
>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55f0d44b147f]
>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55f0d4568079]
>  15: (()+0x98f113) [0x55f0d4568113]
>  16: (()+0x76fa) [0x7f06101576fa]
>  17: (clone()+0x6d) [0x7f060dfb7b5d]
>
>
> 2.
>
> 5700 time 2016-08-21 23:15:50.962450
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d9959bfcd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55d99569134d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55d995692f12]
>  5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]
>  6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]
>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13fa) [0x55d995778c2a]
>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d9957797aa]
>  9: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d99573cb5b]
>  10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]
>  11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]
>  12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]
>  13: (()+0x76fa) [0x7f8a0bdde6fa]
>  14: (clone()+0x6d) [0x7f8a09c3eb5d]
>
>
> I saw this assert is newly introduced in the code.
> FYI, I was running rocksdb by enabling universal style compaction during this time.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Tuesday, August 16, 2016 12:45 PM
> To: Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> Sage,
> The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
> Presently, I am trying to root cause the debug the db assertion issue.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 15, 2016 12:54 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> On Sun, 14 Aug 2016, Somnath Roy wrote:
>> Sage,
>> I did this..
>>
>> root@emsnode5:~/ceph-master/src# git diff diff --git 
>> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>> 638d231..bcf0935 100644
>> --- a/src/kv/RocksDBStore.cc
>> +++ b/src/kv/RocksDBStore.cc
>> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns);
>>    logger->tinc(l_rocksdb_submit_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString()
>> +        << "code = " << s.code() << dendl;  }
>>    return s.ok() ? 0 : -1;
>>  }
>>
>> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns_sync);
>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString()
>> +        << "code = " << s.code() << dendl;  }
>> +
>>    return s.ok() ? 0 : -1;
>>  }
>>  int RocksDBStore::get_info_log_level(string info_log_level) @@ -442,7
>> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const
>> string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>>                                                          const string
>> &k)  {
>> -  bat->SingleDelete(combine_strings(prefix, k));
>> +  //bat->SingleDelete(combine_strings(prefix, k));
>> + bat->Delete(combine_strings(prefix, k));
>>  }
>>
>> But, the db crash is still happening with the following log message.
>>
>> rocksdb: submit_transaction_sync error: NotFound: code = 1
>>
>> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
>> May be I should dump the transaction and see what's in there ?
> Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
> Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
>  
>> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
>> Is it because I needed to run with this patch from the beginning and not just during replay ?
> Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.
>
> sage
>
>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Thursday, August 11, 2016 3:32 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>> Sage,
>>> Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
>>> I did the following changes in the code..
>>>
>>> @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns);
>>>    logger->tinc(l_rocksdb_submit_latency, lat);
>>> -  return s.ok() ? 0 : -1;
>>> +  return s.ok() ? 0 : -s.code();
>>>  }
>>>
>>>  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction
>>> t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns_sync);
>>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>>> -  return s.ok() ? 0 : -1;
>>> +  return s.ok() ? 0 : -s.code();
>>>  }
>>>  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
>>> --git a/src/os/bluestore/BlueStore.cc 
>>> b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
>>> --- a/src/os/bluestore/BlueStore.cc
>>> +++ b/src/os/bluestore/BlueStore.cc
>>> @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
>>>              ++it) {
>>>           _txc_finalize_kv((*it), (*it)->t);
>>>           int r = db->submit_transaction((*it)->t);
>>> +          if (r < 0 ) {
>>> +            dout(0) << "submit_transaction returned = " << r << dendl;
>>> +          }
>>>           assert(r == 0);
>>>         }
>>>        }
>>> @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
>>>         t->rm_single_key(PREFIX_WAL, key);
>>>        }
>>>        int r = db->submit_transaction_sync(t);
>>> +      if (r < 0 ) {
>>> +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
>>> +      }
>>> +
>>>        assert(r == 0);
>>>
>>>
>>> This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
>>> It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
>>> It seems some kind of corruption within Bluestore ?
>>> Let me now the next step.
>> Can you add this too?
>>
>> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>> 638d231..b5467f7 100644
>> --- a/src/kv/RocksDBStore.cc
>> +++ b/src/kv/RocksDBStore.cc
>> @@ -370,6 +370,9 @@ int
>> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns);
>>    logger->tinc(l_rocksdb_submit_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>>    return s.ok() ? 0 : -1;
>>  }
>>  
>> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
>>
>> Thanks!
>> sage
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Thursday, August 11, 2016 9:36 AM
>>> To: Mark Nelson
>>> Cc: Somnath Roy; ceph-devel
>>> Subject: Re: Bluestore assert
>>>
>>> On Thu, 11 Aug 2016, Mark Nelson wrote:
>>>> Sorry if I missed this during discussion, but why are these being 
>>>> called if the file is deleted?
>>> I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
>>>
>>> sage
>>>
>>>> Mark
>>>>
>>>> On 08/11/2016 11:29 AM, Sage Weil wrote:
>>>>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>>>>> Sage,
>>>>>> Please find the full log for the BlueFS replay bug in the 
>>>>>> following location.
>>>>>>
>>>>>> https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
>>>>>> p
>>>>>>
>>>>>> For the db transaction one , I have added code to dump the 
>>>>>> rocksdb error code before the assert as you suggested and waiting to reproduce.
>>>>> I'm pretty sure this is the root cause:
>>>>>
>>>>> https://github.com/ceph/ceph/pull/10686
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in the body of a message to 
>>>>> majordomo@vger.kernel.org More majordomo info at 
>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 22, 2016, 8:12 a.m. UTC | #5
The assert is newly introduced and the space check was there earlier as well. Why it is now expecting fnode.ino !=1 here ?

-----Original Message-----
From: Varada Kari 
Sent: Monday, August 22, 2016 1:02 AM
To: Somnath Roy; Sage Weil
Cc: Mark Nelson; ceph-devel
Subject: Re: Bluestore assert

Ideally, we should have enough space already allocated for the log, but here are trying to grow the log file, hence the assert.
Are there any log relevant log messages about allocation in the log from bluefs?

Varada

On Monday 22 August 2016 01:04 PM, Somnath Roy wrote:
> Sage,
> Got the following asserts on two different path with the latest master.
>
> 1.
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>
>  ceph version 11.0.0-1688-g6f48ee6 
> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55f0d46f9cd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
> long)+0x1bed) [0x55f0d43cb34d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
> unsigned long, unsigned long)+0x3b2) [0x55f0d43ccf12]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, 
> std::unique_lock<std::mutex>&)+0x35b) [0x55f0d43ce2fb]
>  6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) 
> [0x55f0d456f4f1]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]
>  9: 
> (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
> const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) 
> [0x55f0d45b2506]
>  10: 
> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compaction
> Job::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]
>  11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]
>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
> rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> [0x55f0d44a4610]
>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) 
> [0x55f0d44b147f]
>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> [0x55f0d4568079]
>  15: (()+0x98f113) [0x55f0d4568113]
>  16: (()+0x76fa) [0x7f06101576fa]
>  17: (clone()+0x6d) [0x7f060dfb7b5d]
>
>
> 2.
>
> 5700 time 2016-08-21 23:15:50.962450
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>
>  ceph version 11.0.0-1688-g6f48ee6 
> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x55d9959bfcd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
> long)+0x1bed) [0x55d99569134d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
> unsigned long, unsigned long)+0x3b2) [0x55d995692f12]
>  5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]
>  6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]
>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
> unsigned long, bool)+0x13fa) [0x55d995778c2a]
>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x2a) [0x55d9957797aa]
>  9: 
> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tra
> nsactionImpl>)+0x6b) [0x55d99573cb5b]
>  10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]
>  11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]
>  12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]
>  13: (()+0x76fa) [0x7f8a0bdde6fa]
>  14: (clone()+0x6d) [0x7f8a09c3eb5d]
>
>
> I saw this assert is newly introduced in the code.
> FYI, I was running rocksdb by enabling universal style compaction during this time.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Tuesday, August 16, 2016 12:45 PM
> To: Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> Sage,
> The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
> Presently, I am trying to root cause the debug the db assertion issue.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 15, 2016 12:54 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> On Sun, 14 Aug 2016, Somnath Roy wrote:
>> Sage,
>> I did this..
>>
>> root@emsnode5:~/ceph-master/src# git diff diff --git 
>> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>> 638d231..bcf0935 100644
>> --- a/src/kv/RocksDBStore.cc
>> +++ b/src/kv/RocksDBStore.cc
>> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns);
>>    logger->tinc(l_rocksdb_submit_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString()
>> +        << "code = " << s.code() << dendl;  }
>>    return s.ok() ? 0 : -1;
>>  }
>>
>> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns_sync);
>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString()
>> +        << "code = " << s.code() << dendl;  }
>> +
>>    return s.ok() ? 0 : -1;
>>  }
>>  int RocksDBStore::get_info_log_level(string info_log_level) @@ 
>> -442,7
>> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const
>> string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>>                                                          const string
>> &k)  {
>> -  bat->SingleDelete(combine_strings(prefix, k));
>> +  //bat->SingleDelete(combine_strings(prefix, k));
>> + bat->Delete(combine_strings(prefix, k));
>>  }
>>
>> But, the db crash is still happening with the following log message.
>>
>> rocksdb: submit_transaction_sync error: NotFound: code = 1
>>
>> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
>> May be I should dump the transaction and see what's in there ?
> Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
> Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
>  
>> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
>> Is it because I needed to run with this patch from the beginning and not just during replay ?
> Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.
>
> sage
>
>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Thursday, August 11, 2016 3:32 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>> Sage,
>>> Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
>>> I did the following changes in the code..
>>>
>>> @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns);
>>>    logger->tinc(l_rocksdb_submit_latency, lat);
>>> -  return s.ok() ? 0 : -1;
>>> +  return s.ok() ? 0 : -s.code();
>>>  }
>>>
>>>  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction
>>> t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns_sync);
>>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>>> -  return s.ok() ? 0 : -1;
>>> +  return s.ok() ? 0 : -s.code();
>>>  }
>>>  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
>>> --git a/src/os/bluestore/BlueStore.cc 
>>> b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
>>> --- a/src/os/bluestore/BlueStore.cc
>>> +++ b/src/os/bluestore/BlueStore.cc
>>> @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
>>>              ++it) {
>>>           _txc_finalize_kv((*it), (*it)->t);
>>>           int r = db->submit_transaction((*it)->t);
>>> +          if (r < 0 ) {
>>> +            dout(0) << "submit_transaction returned = " << r << dendl;
>>> +          }
>>>           assert(r == 0);
>>>         }
>>>        }
>>> @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
>>>         t->rm_single_key(PREFIX_WAL, key);
>>>        }
>>>        int r = db->submit_transaction_sync(t);
>>> +      if (r < 0 ) {
>>> +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
>>> +      }
>>> +
>>>        assert(r == 0);
>>>
>>>
>>> This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
>>> It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
>>> It seems some kind of corruption within Bluestore ?
>>> Let me now the next step.
>> Can you add this too?
>>
>> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>> 638d231..b5467f7 100644
>> --- a/src/kv/RocksDBStore.cc
>> +++ b/src/kv/RocksDBStore.cc
>> @@ -370,6 +370,9 @@ int
>> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>    logger->inc(l_rocksdb_txns);
>>    logger->tinc(l_rocksdb_submit_latency, lat);
>> +  if (!s.ok()) {
>> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>>    return s.ok() ? 0 : -1;
>>  }
>>  
>> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
>>
>> Thanks!
>> sage
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Thursday, August 11, 2016 9:36 AM
>>> To: Mark Nelson
>>> Cc: Somnath Roy; ceph-devel
>>> Subject: Re: Bluestore assert
>>>
>>> On Thu, 11 Aug 2016, Mark Nelson wrote:
>>>> Sorry if I missed this during discussion, but why are these being 
>>>> called if the file is deleted?
>>> I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
>>>
>>> sage
>>>
>>>> Mark
>>>>
>>>> On 08/11/2016 11:29 AM, Sage Weil wrote:
>>>>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>>>>> Sage,
>>>>>> Please find the full log for the BlueFS replay bug in the 
>>>>>> following location.
>>>>>>
>>>>>> https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
>>>>>> p
>>>>>>
>>>>>> For the db transaction one , I have added code to dump the 
>>>>>> rocksdb error code before the assert as you suggested and waiting to reproduce.
>>>>> I'm pretty sure this is the root cause:
>>>>>
>>>>> https://github.com/ceph/ceph/pull/10686
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 22, 2016, 8:16 a.m. UTC | #6
We don't want to grow log inode(inode 1) during sync and flush, which
will create a allocation loop to record the allocation. We should have
made enough allocations already for this.
Seems we are not here. Do we have any messages in the logs about the
allocations to made to log file in those events? That would help to
identify the allocation failure here.
Else we might have to add some more debugs here.

Varada

On Monday 22 August 2016 01:42 PM, Somnath Roy wrote:
> The assert is newly introduced and the space check was there earlier as well. Why it is now expecting fnode.ino !=1 here ?
>
> -----Original Message-----
> From: Varada Kari 
> Sent: Monday, August 22, 2016 1:02 AM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> Ideally, we should have enough space already allocated for the log, but here are trying to grow the log file, hence the assert.
> Are there any log relevant log messages about allocation in the log from bluefs?
>
> Varada
>
> On Monday 22 August 2016 01:04 PM, Somnath Roy wrote:
>> Sage,
>> Got the following asserts on two different path with the latest master.
>>
>> 1.
>> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>>
>>  ceph version 11.0.0-1688-g6f48ee6 
>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x80) [0x55f0d46f9cd0]
>>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
>> long)+0x1bed) [0x55f0d43cb34d]
>>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]
>>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
>> unsigned long, unsigned long)+0x3b2) [0x55f0d43ccf12]
>>  5: (BlueFS::_fsync(BlueFS::FileWriter*, 
>> std::unique_lock<std::mutex>&)+0x35b) [0x55f0d43ce2fb]
>>  6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]
>>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) 
>> [0x55f0d456f4f1]
>>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]
>>  9: 
>> (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
>> const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) 
>> [0x55f0d45b2506]
>>  10: 
>> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compaction
>> Job::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]
>>  11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]
>>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
>> rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
>> [0x55f0d44a4610]
>>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) 
>> [0x55f0d44b147f]
>>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
>> [0x55f0d4568079]
>>  15: (()+0x98f113) [0x55f0d4568113]
>>  16: (()+0x76fa) [0x7f06101576fa]
>>  17: (clone()+0x6d) [0x7f060dfb7b5d]
>>
>>
>> 2.
>>
>> 5700 time 2016-08-21 23:15:50.962450
>> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
>>
>>  ceph version 11.0.0-1688-g6f48ee6 
>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x80) [0x55d9959bfcd0]
>>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
>> long)+0x1bed) [0x55d99569134d]
>>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]
>>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
>> unsigned long, unsigned long)+0x3b2) [0x55d995692f12]
>>  5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]
>>  6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]
>>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, 
>> unsigned long, bool)+0x13fa) [0x55d995778c2a]
>>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
>> rocksdb::WriteBatch*)+0x2a) [0x55d9957797aa]
>>  9: 
>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::Tra
>> nsactionImpl>)+0x6b) [0x55d99573cb5b]
>>  10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]
>>  11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]
>>  12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]
>>  13: (()+0x76fa) [0x7f8a0bdde6fa]
>>  14: (clone()+0x6d) [0x7f8a09c3eb5d]
>>
>>
>> I saw this assert is newly introduced in the code.
>> FYI, I was running rocksdb by enabling universal style compaction during this time.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Tuesday, August 16, 2016 12:45 PM
>> To: Sage Weil
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> Sage,
>> The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
>> Presently, I am trying to root cause the debug the db assertion issue.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Monday, August 15, 2016 12:54 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> On Sun, 14 Aug 2016, Somnath Roy wrote:
>>> Sage,
>>> I did this..
>>>
>>> root@emsnode5:~/ceph-master/src# git diff diff --git 
>>> a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>>> 638d231..bcf0935 100644
>>> --- a/src/kv/RocksDBStore.cc
>>> +++ b/src/kv/RocksDBStore.cc
>>> @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns);
>>>    logger->tinc(l_rocksdb_submit_latency, lat);
>>> +  if (!s.ok()) {
>>> +    derr << __func__ << " error: " << s.ToString()
>>> +        << "code = " << s.code() << dendl;  }
>>>    return s.ok() ? 0 : -1;
>>>  }
>>>
>>> @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns_sync);
>>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>>> +  if (!s.ok()) {
>>> +    derr << __func__ << " error: " << s.ToString()
>>> +        << "code = " << s.code() << dendl;  }
>>> +
>>>    return s.ok() ? 0 : -1;
>>>  }
>>>  int RocksDBStore::get_info_log_level(string info_log_level) @@ 
>>> -442,7
>>> +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const
>>> string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
>>>                                                          const string
>>> &k)  {
>>> -  bat->SingleDelete(combine_strings(prefix, k));
>>> +  //bat->SingleDelete(combine_strings(prefix, k));
>>> + bat->Delete(combine_strings(prefix, k));
>>>  }
>>>
>>> But, the db crash is still happening with the following log message.
>>>
>>> rocksdb: submit_transaction_sync error: NotFound: code = 1
>>>
>>> It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
>>> May be I should dump the transaction and see what's in there ?
>> Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
>> Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
>>  
>>> I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
>>> Is it because I needed to run with this patch from the beginning and not just during replay ?
>> Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.
>>
>> sage
>>
>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Thursday, August 11, 2016 3:32 PM
>>> To: Somnath Roy
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: RE: Bluestore assert
>>>
>>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>>> Sage,
>>>> Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
>>>> I did the following changes in the code..
>>>>
>>>> @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>>    logger->inc(l_rocksdb_txns);
>>>>    logger->tinc(l_rocksdb_submit_latency, lat);
>>>> -  return s.ok() ? 0 : -1;
>>>> +  return s.ok() ? 0 : -s.code();
>>>>  }
>>>>
>>>>  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction
>>>> t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>>    logger->inc(l_rocksdb_txns_sync);
>>>>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
>>>> -  return s.ok() ? 0 : -1;
>>>> +  return s.ok() ? 0 : -s.code();
>>>>  }
>>>>  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
>>>> --git a/src/os/bluestore/BlueStore.cc 
>>>> b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
>>>> --- a/src/os/bluestore/BlueStore.cc
>>>> +++ b/src/os/bluestore/BlueStore.cc
>>>> @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
>>>>              ++it) {
>>>>           _txc_finalize_kv((*it), (*it)->t);
>>>>           int r = db->submit_transaction((*it)->t);
>>>> +          if (r < 0 ) {
>>>> +            dout(0) << "submit_transaction returned = " << r << dendl;
>>>> +          }
>>>>           assert(r == 0);
>>>>         }
>>>>        }
>>>> @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
>>>>         t->rm_single_key(PREFIX_WAL, key);
>>>>        }
>>>>        int r = db->submit_transaction_sync(t);
>>>> +      if (r < 0 ) {
>>>> +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
>>>> +      }
>>>> +
>>>>        assert(r == 0);
>>>>
>>>>
>>>> This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
>>>> It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
>>>> It seems some kind of corruption within Bluestore ?
>>>> Let me now the next step.
>>> Can you add this too?
>>>
>>> diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
>>> 638d231..b5467f7 100644
>>> --- a/src/kv/RocksDBStore.cc
>>> +++ b/src/kv/RocksDBStore.cc
>>> @@ -370,6 +370,9 @@ int
>>> RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>>>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>>>    logger->inc(l_rocksdb_txns);
>>>    logger->tinc(l_rocksdb_submit_latency, lat);
>>> +  if (!s.ok()) {
>>> +    derr << __func__ << " error: " << s.ToString() << dendl;  }
>>>    return s.ok() ? 0 : -1;
>>>  }
>>>  
>>> It's not obvious to me how we would get NotFound when doing a Write into the kv store.
>>>
>>> Thanks!
>>> sage
>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>> Sent: Thursday, August 11, 2016 9:36 AM
>>>> To: Mark Nelson
>>>> Cc: Somnath Roy; ceph-devel
>>>> Subject: Re: Bluestore assert
>>>>
>>>> On Thu, 11 Aug 2016, Mark Nelson wrote:
>>>>> Sorry if I missed this during discussion, but why are these being 
>>>>> called if the file is deleted?
>>>> I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
>>>>
>>>> sage
>>>>
>>>>> Mark
>>>>>
>>>>> On 08/11/2016 11:29 AM, Sage Weil wrote:
>>>>>> On Thu, 11 Aug 2016, Somnath Roy wrote:
>>>>>>> Sage,
>>>>>>> Please find the full log for the BlueFS replay bug in the 
>>>>>>> following location.
>>>>>>>
>>>>>>> https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
>>>>>>> p
>>>>>>>
>>>>>>> For the db transaction one , I have added code to dump the 
>>>>>>> rocksdb error code before the assert as you suggested and waiting to reproduce.
>>>>>> I'm pretty sure this is the root cause:
>>>>>>
>>>>>> https://github.com/ceph/ceph/pull/10686
>>>>>>
>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 22, 2016, 2:57 p.m. UTC | #7
On Mon, 22 Aug 2016, Somnath Roy wrote:
> Sage,
> Got the following asserts on two different path with the latest master.
> 
> 1.
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
> 
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55f0d46f9cd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55f0d43cb34d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55f0d43cb467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55f0d43ccf12]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x35b) [0x55f0d43ce2fb]
>  6: (BlueRocksWritableFile::Sync()+0x62) [0x55f0d43e5c32]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x55f0d456f4f1]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x55f0d45709a0]
>  9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x55f0d45b2506]
>  10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x14ea) [0x55f0d45b4cca]
>  11: (rocksdb::CompactionJob::Run()+0x479) [0x55f0d45b5c49]
>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55f0d44a4610]
>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55f0d44b147f]
>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55f0d4568079]
>  15: (()+0x98f113) [0x55f0d4568113]
>  16: (()+0x76fa) [0x7f06101576fa]
>  17: (clone()+0x6d) [0x7f060dfb7b5d]
> 
> 
> 2.
> 
> 5700 time 2016-08-21 23:15:50.962450
> os/bluestore/BlueFS.cc: 1377: FAILED assert(h->file->fnode.ino != 1)
> 
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55d9959bfcd0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1bed) [0x55d99569134d]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x55d995691467]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x3b2) [0x55d995692f12]
>  5: (BlueFS::sync_metadata()+0x1c3) [0x55d995697c33]
>  6: (BlueRocksDirectory::Fsync()+0xd) [0x55d9956ab98d]
>  7: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x13fa) [0x55d995778c2a]
>  8: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d9957797aa]
>  9: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d99573cb5b]
>  10: (BlueStore::_kv_sync_thread()+0x1745) [0x55d995589e65]
>  11: (BlueStore::KVSyncThread::entry()+0xd) [0x55d9955b754d]
>  12: (Thread::entry_wrapper()+0x75) [0x55d99599f5e5]
>  13: (()+0x76fa) [0x7f8a0bdde6fa]
>  14: (clone()+0x6d) [0x7f8a09c3eb5d]
> 
> 
> I saw this assert is newly introduced in the code.
> FYI, I was running rocksdb by enabling universal style compaction during this time.

This is a new assert in the async compaction code.  I'll see if I can 
reproduce it with the bluefs tests with universal compaction... that 
should make it easy to track down.

sage


> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Tuesday, August 16, 2016 12:45 PM
> To: Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> The replay bug *is fixed* with your patch. I am able to make the OSDs (and cluster) up after hitting the db assertion bug.
> Presently, I am trying to root cause the debug the db assertion issue.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 15, 2016 12:54 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Sun, 14 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I did this..
> > 
> > root@emsnode5:~/ceph-master/src# git diff diff --git 
> > a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
> > 638d231..bcf0935 100644
> > --- a/src/kv/RocksDBStore.cc
> > +++ b/src/kv/RocksDBStore.cc
> > @@ -370,6 +370,10 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns);
> >    logger->tinc(l_rocksdb_submit_latency, lat);
> > +  if (!s.ok()) {
> > +    derr << __func__ << " error: " << s.ToString()
> > +        << "code = " << s.code() << dendl;  }
> >    return s.ok() ? 0 : -1;
> >  }
> > 
> > @@ -385,6 +389,11 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns_sync);
> >    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> > +  if (!s.ok()) {
> > +    derr << __func__ << " error: " << s.ToString()
> > +        << "code = " << s.code() << dendl;  }
> > +
> >    return s.ok() ? 0 : -1;
> >  }
> >  int RocksDBStore::get_info_log_level(string info_log_level) @@ -442,7
> > +451,8 @@ void RocksDBStore::RocksDBTransactionImpl::rmkey(const
> > string &prefix,  void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
> >                                                          const string
> > &k)  {
> > -  bat->SingleDelete(combine_strings(prefix, k));
> > +  //bat->SingleDelete(combine_strings(prefix, k));
> > + bat->Delete(combine_strings(prefix, k));
> >  }
> > 
> > But, the db crash is still happening with the following log message.
> > 
> > rocksdb: submit_transaction_sync error: NotFound: code = 1
> > 
> > It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
> > May be I should dump the transaction and see what's in there ?
> 
> Yeah.  Unfortunately I think it isn't trivial to dump the kv transactions because they're being constructed by rocksdb (WriteBack or something).  
> Not sure if there is a dump for that (I'm guessing not?).  You'd need to write one, or build a kludgey lookaside map that can be dumped.
>  
> > I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
> > Is it because I needed to run with this patch from the beginning and not just during replay ?
> 
> Yeah, the bug happens before replay.. we are writing a bad entry into the bluefs log.
> 
> sage
> 
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Thursday, August 11, 2016 3:32 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
> > > I did the following changes in the code..
> > > 
> > > @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> > >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> > >    logger->inc(l_rocksdb_txns);
> > >    logger->tinc(l_rocksdb_submit_latency, lat);
> > > -  return s.ok() ? 0 : -1;
> > > +  return s.ok() ? 0 : -s.code();
> > >  }
> > > 
> > >  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction
> > > t) @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
> > >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> > >    logger->inc(l_rocksdb_txns_sync);
> > >    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> > > -  return s.ok() ? 0 : -1;
> > > +  return s.ok() ? 0 : -s.code();
> > >  }
> > >  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
> > > --git a/src/os/bluestore/BlueStore.cc 
> > > b/src/os/bluestore/BlueStore.cc index fe7f743..3f4ecd5 100644
> > > --- a/src/os/bluestore/BlueStore.cc
> > > +++ b/src/os/bluestore/BlueStore.cc
> > > @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
> > >              ++it) {
> > >           _txc_finalize_kv((*it), (*it)->t);
> > >           int r = db->submit_transaction((*it)->t);
> > > +          if (r < 0 ) {
> > > +            dout(0) << "submit_transaction returned = " << r << dendl;
> > > +          }
> > >           assert(r == 0);
> > >         }
> > >        }
> > > @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
> > >         t->rm_single_key(PREFIX_WAL, key);
> > >        }
> > >        int r = db->submit_transaction_sync(t);
> > > +      if (r < 0 ) {
> > > +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
> > > +      }
> > > +
> > >        assert(r == 0);
> > > 
> > > 
> > > This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
> > > It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
> > > It seems some kind of corruption within Bluestore ?
> > > Let me now the next step.
> > 
> > Can you add this too?
> > 
> > diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index
> > 638d231..b5467f7 100644
> > --- a/src/kv/RocksDBStore.cc
> > +++ b/src/kv/RocksDBStore.cc
> > @@ -370,6 +370,9 @@ int
> > RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
> >    utime_t lat = ceph_clock_now(g_ceph_context) - start;
> >    logger->inc(l_rocksdb_txns);
> >    logger->tinc(l_rocksdb_submit_latency, lat);
> > +  if (!s.ok()) {
> > +    derr << __func__ << " error: " << s.ToString() << dendl;  }
> >    return s.ok() ? 0 : -1;
> >  }
> >  
> > It's not obvious to me how we would get NotFound when doing a Write into the kv store.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Thursday, August 11, 2016 9:36 AM
> > > To: Mark Nelson
> > > Cc: Somnath Roy; ceph-devel
> > > Subject: Re: Bluestore assert
> > > 
> > > On Thu, 11 Aug 2016, Mark Nelson wrote:
> > > > Sorry if I missed this during discussion, but why are these being 
> > > > called if the file is deleted?
> > > 
> > > I'm not sure... rocksdb is the one consuming the interface.  Looking through the code, though, this is the only way I can see that we could log an op_file_update *after* an op_file_remove.
> > > 
> > > sage
> > > 
> > > >
> > > > Mark
> > > >
> > > > On 08/11/2016 11:29 AM, Sage Weil wrote:
> > > > > On Thu, 11 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > Please find the full log for the BlueFS replay bug in the 
> > > > > > following location.
> > > > > >
> > > > > > https://github.com/somnathr/ceph/blob/master/ceph-osd.1.log.zi
> > > > > > p
> > > > > >
> > > > > > For the db transaction one , I have added code to dump the 
> > > > > > rocksdb error code before the assert as you suggested and waiting to reproduce.
> > > > >
> > > > > I'm pretty sure this is the root cause:
> > > > >
> > > > > https://github.com/ceph/ceph/pull/10686
> > > > >
> > > > > sage
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe 
> > > > > ceph-devel" in the body of a message to 
> > > > > majordomo@vger.kernel.org More majordomo info at 
> > > > > http://vger.kernel.org/majordomo-info.html
> > > > >
> > > >
> > > >
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 22, 2016, 9:57 p.m. UTC | #8
On Mon, 22 Aug 2016, Somnath Roy wrote:
> FYI, I was running rocksdb by enabling universal style compaction during 
> this time.

How are you selecting universal compaction?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 22, 2016, 10:01 p.m. UTC | #9
"compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
Here is the option I am using..

        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"

Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..


     0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)

 ceph version 11.0.0-1688-g3fcc89c (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5581ed453cb0]
 2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
 3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
 4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
 5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, rocksdb::Logger*)+0x358) [0x5581ed291c18]
 6: (()+0x94fd54) [0x5581ed282d54]
 7: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
 8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) [0x5581ed28ba68]
 9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
 10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) [0x5581ed25c458]
 11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
 12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x22) [0x5581ed1f4182]
 13: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
 14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) [0x5581ed02802b]
 15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
 16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
 17: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
 18: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) [0x5581ecef89e9]
 19: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) [0x5581ecefeb4b]
 20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
 21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
 22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x5581eccdd542]
 23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
 24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x5581ed440b2f]
 25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5581ed4441f0]
 26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
 27: (()+0x76fa) [0x7f8ed9e4e6fa]
 28: (clone()+0x6d) [0x7f8ed7caeb5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Monday, August 22, 2016 2:57 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Mon, 22 Aug 2016, Somnath Roy wrote:
> FYI, I was running rocksdb by enabling universal style compaction
> during this time.

How are you selecting universal compaction?

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 22, 2016, 11:10 p.m. UTC | #10
Sage,
I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.

   -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
    -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
    -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
    -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
    -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
     0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
osd/OSD.h: 999: FAILED assert(ret)

 ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5617f2a99e80]
 2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
 3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
 4: (main()+0x2fe0) [0x5617f229d1f0]
 5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
 6: (_start()+0x29) [0x5617f22eb909]

OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Monday, August 22, 2016 3:01 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

"compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
Here is the option I am using..

        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"

Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..


     0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)

 ceph version 11.0.0-1688-g3fcc89c (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5581ed453cb0]
 2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
 3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
 4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
 5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, rocksdb::Logger*)+0x358) [0x5581ed291c18]
 6: (()+0x94fd54) [0x5581ed282d54]
 7: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
 8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) [0x5581ed28ba68]
 9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
 10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) [0x5581ed25c458]
 11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
 12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x22) [0x5581ed1f4182]
 13: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
 14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) [0x5581ed02802b]
 15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
 16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
 17: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
 18: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) [0x5581ecef89e9]
 19: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) [0x5581ecefeb4b]
 20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
 21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
 22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x5581eccdd542]
 23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
 24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x5581ed440b2f]
 25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5581ed4441f0]
 26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
 27: (()+0x76fa) [0x7f8ed9e4e6fa]
 28: (clone()+0x6d) [0x7f8ed7caeb5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Monday, August 22, 2016 2:57 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Mon, 22 Aug 2016, Somnath Roy wrote:
> FYI, I was running rocksdb by enabling universal style compaction
> during this time.

How are you selecting universal compaction?

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 23, 2016, 5:09 a.m. UTC | #11
We added couple of new asserts in the code and changed the log
compaction to async. That was the recent change happened on BlueFS. But
this one seems we are not able to read from BlueStore.
But the other assert what you have posted, was from BlueFS(read from
rocksdb). May be we should write some unit test cases to reproduce the
same issue similar to rocksdb write patterns.
Can you point me to the logs?

Varada

On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
> Sage,
> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>
>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> osd/OSD.h: 999: FAILED assert(ret)
>
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5617f2a99e80]
>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>  4: (main()+0x2fe0) [0x5617f229d1f0]
>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>  6: (_start()+0x29) [0x5617f22eb909]
>
> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, August 22, 2016 3:01 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> Here is the option I am using..
>
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>
> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>
>
>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>
>  ceph version 11.0.0-1688-g3fcc89c (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5581ed453cb0]
>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, rocksdb::Logger*)+0x358) [0x5581ed291c18]
>  6: (()+0x94fd54) [0x5581ed282d54]
>  7: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) [0x5581ed28ba68]
>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) [0x5581ed25c458]
>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>  13: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) [0x5581ed02802b]
>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
>  17: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>  18: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) [0x5581ecef89e9]
>  19: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) [0x5581ecefeb4b]
>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x5581eccdd542]
>  23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x5581ed440b2f]
>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5581ed4441f0]
>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 2:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> On Mon, 22 Aug 2016, Somnath Roy wrote:
>> FYI, I was running rocksdb by enabling universal style compaction
>> during this time.
> How are you selecting universal compaction?
>
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 23, 2016, 5:18 a.m. UTC | #12
Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.

Thanks & Regards
Somnath

-----Original Message-----
From: Varada Kari 
Sent: Monday, August 22, 2016 10:09 PM
To: Somnath Roy; Sage Weil
Cc: Mark Nelson; ceph-devel
Subject: Re: Bluestore assert

We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
Can you point me to the logs?

Varada

On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
> Sage,
> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>
>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
> time 2016-08-22 15:55:27.214638
> osd/OSD.h: 999: FAILED assert(ret)
>
>  ceph version 11.0.0-1688-g6f48ee6 
> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5617f2a99e80]
>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>  4: (main()+0x2fe0) [0x5617f229d1f0]
>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>  6: (_start()+0x29) [0x5617f22eb909]
>
> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, August 22, 2016 3:01 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> Here is the option I am using..
>
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>
> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>
>
>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 
> os/bluestore/BlueFS.cc: In function 'int 
> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' 
> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>
>  ceph version 11.0.0-1688-g3fcc89c 
> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5581ed453cb0]
>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
> long, char*)+0x836) [0x5581ed11c1b6]
>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, 
> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, 
> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>  6: (()+0x94fd54) [0x5581ed282d54]
>  7: 
> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTab
> le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, 
> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> [0x5581ed28ba68]
>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, 
> bool, int)+0x158) [0x5581ed252118]
>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> [0x5581ed25c458]
>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>  13: (RocksDBStore::get(std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> [0x5581ed1d21d7]
>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) 
> [0x5581ed02802b]
>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, 
> std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> [0x5581ed032bc2]
>  17: 
> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction
> , std::allocator<ObjectStore::Transaction> >&, 
> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>  18: 
> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) 
> [0x5581ecef89e9]
>  19: 
> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) 
> [0x5581ecefeb4b]
>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, 
> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> [0x5581eccdd2e9]
>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> 
> const&)+0x52) [0x5581eccdd542]
>  23: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) 
> [0x5581ed440b2f]
>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) 
> [0x5581ed4441f0]
>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 2:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
>
> On Mon, 22 Aug 2016, Somnath Roy wrote:
>> FYI, I was running rocksdb by enabling universal style compaction 
>> during this time.
> How are you selecting universal compaction?
>
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 23, 2016, 5:25 a.m. UTC | #13
Can you try disabling the async log compaction and check once?
bluefs_compact_log_sync to false.

Varada

On Tuesday 23 August 2016 10:48 AM, Somnath Roy wrote:
> Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
> Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Varada Kari 
> Sent: Monday, August 22, 2016 10:09 PM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
> But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
> Can you point me to the logs?
>
> Varada
>
> On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
>> Sage,
>> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>>
>>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
>> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
>> time 2016-08-22 15:55:27.214638
>> osd/OSD.h: 999: FAILED assert(ret)
>>
>>  ceph version 11.0.0-1688-g6f48ee6 
>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x80) [0x5617f2a99e80]
>>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>>  4: (main()+0x2fe0) [0x5617f229d1f0]
>>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>>  6: (_start()+0x29) [0x5617f22eb909]
>>
>> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Monday, August 22, 2016 3:01 PM
>> To: 'Sage Weil'
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
>> Here is the option I am using..
>>
>>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>>
>> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>>
>>
>>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 
>> os/bluestore/BlueFS.cc: In function 'int 
>> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' 
>> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
>> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>>
>>  ceph version 11.0.0-1688-g3fcc89c 
>> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x80) [0x5581ed453cb0]
>>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
>> long, char*)+0x836) [0x5581ed11c1b6]
>>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
>> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, 
>> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
>> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
>> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, 
>> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>>  6: (()+0x94fd54) [0x5581ed282d54]
>>  7: 
>> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTab
>> le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, 
>> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
>> [0x5581ed28ba68]
>>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
>> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, 
>> bool, int)+0x158) [0x5581ed252118]
>>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
>> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
>> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
>> [0x5581ed25c458]
>>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>>  13: (RocksDBStore::get(std::__cxx11::basic_string<char, 
>> std::char_traits<char>, std::allocator<char> > const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
>> [0x5581ed1d21d7]
>>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) 
>> [0x5581ed02802b]
>>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, 
>> std::vector<ObjectStore::Transaction, 
>> std::allocator<ObjectStore::Transaction> >&, 
>> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
>> [0x5581ed032bc2]
>>  17: 
>> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction
>> , std::allocator<ObjectStore::Transaction> >&, 
>> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>>  18: 
>> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) 
>> [0x5581ecef89e9]
>>  19: 
>> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) 
>> [0x5581ecefeb4b]
>>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, 
>> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
>> [0x5581eccdd2e9]
>>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> 
>> const&)+0x52) [0x5581eccdd542]
>>  23: (OSD::ShardedOpWQ::_process(unsigned int, 
>> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) 
>> [0x5581ed440b2f]
>>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) 
>> [0x5581ed4441f0]
>>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Monday, August 22, 2016 2:57 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> On Mon, 22 Aug 2016, Somnath Roy wrote:
>>> FYI, I was running rocksdb by enabling universal style compaction 
>>> during this time.
>> How are you selecting universal compaction?
>>
>> sage
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 23, 2016, 5:27 a.m. UTC | #14
bluefs_compact_log_sync = false by default, you mean to say make it true ?

-----Original Message-----
From: Varada Kari 
Sent: Monday, August 22, 2016 10:25 PM
To: Somnath Roy; Sage Weil
Cc: Mark Nelson; ceph-devel
Subject: Re: Bluestore assert

Can you try disabling the async log compaction and check once?
bluefs_compact_log_sync to false.

Varada

On Tuesday 23 August 2016 10:48 AM, Somnath Roy wrote:
> Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
> Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Varada Kari
> Sent: Monday, August 22, 2016 10:09 PM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
> But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
> Can you point me to the logs?
>
> Varada
>
> On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
>> Sage,
>> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>>
>>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
>> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
>> time 2016-08-22 15:55:27.214638
>> osd/OSD.h: 999: FAILED assert(ret)
>>
>>  ceph version 11.0.0-1688-g6f48ee6
>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x80) [0x5617f2a99e80]
>>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>>  4: (main()+0x2fe0) [0x5617f229d1f0]
>>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>>  6: (_start()+0x29) [0x5617f22eb909]
>>
>> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Monday, August 22, 2016 3:01 PM
>> To: 'Sage Weil'
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
>> Here is the option I am using..
>>
>>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>>
>> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>>
>>
>>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
>> os/bluestore/BlueFS.cc: In function 'int 
>> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
>> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
>> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>>
>>  ceph version 11.0.0-1688-g3fcc89c
>> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x80) [0x5581ed453cb0]
>>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
>> unsigned long, char*)+0x836) [0x5581ed11c1b6]
>>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
>> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
>> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
>> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
>> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
>> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>>  6: (()+0x94fd54) [0x5581ed282d54]
>>  7: 
>> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTa
>> b le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
>> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
>> [0x5581ed28ba68]
>>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
>> const&, rocksdb::Slice const&, rocksdb::GetContext*, 
>> rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
>> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
>> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
>> [0x5581ed25c458]
>>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>>  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > const&, 
>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
>> [0x5581ed1d21d7]
>>  14: (BlueStore::Collection::get_onode(ghobject_t const&, 
>> bool)+0x55b) [0x5581ed02802b]
>>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
>> std::vector<ObjectStore::Transaction,
>> std::allocator<ObjectStore::Transaction> >&, 
>> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
>> [0x5581ed032bc2]
>>  17: 
>> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transactio
>> n , std::allocator<ObjectStore::Transaction> >&,
>> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>>  18: 
>> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
>> [0x5581ecef89e9]
>>  19: 
>> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb)
>> [0x5581ecefeb4b]
>>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
>> [0x5581eccdd2e9]
>>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
>> const&)+0x52) [0x5581eccdd542]
>>  23: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned 
>> int)+0x89f) [0x5581ed440b2f]
>>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>> [0x5581ed4441f0]
>>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Monday, August 22, 2016 2:57 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; ceph-devel
>> Subject: RE: Bluestore assert
>>
>> On Mon, 22 Aug 2016, Somnath Roy wrote:
>>> FYI, I was running rocksdb by enabling universal style compaction 
>>> during this time.
>> How are you selecting universal compaction?
>>
>> sage
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 23, 2016, 5:30 a.m. UTC | #15
Nope. Then the code is same. We shouldn't be seeing any issue. Logs
would help anyway. Let us run with debugs on and see.

Varada

On Tuesday 23 August 2016 10:57 AM, Somnath Roy wrote:
> bluefs_compact_log_sync = false by default, you mean to say make it true ?
>
> -----Original Message-----
> From: Varada Kari 
> Sent: Monday, August 22, 2016 10:25 PM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> Can you try disabling the async log compaction and check once?
> bluefs_compact_log_sync to false.
>
> Varada
>
> On Tuesday 23 August 2016 10:48 AM, Somnath Roy wrote:
>> Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
>> Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Varada Kari
>> Sent: Monday, August 22, 2016 10:09 PM
>> To: Somnath Roy; Sage Weil
>> Cc: Mark Nelson; ceph-devel
>> Subject: Re: Bluestore assert
>>
>> We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
>> But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
>> Can you point me to the logs?
>>
>> Varada
>>
>> On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
>>> Sage,
>>> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>>>
>>>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>>>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>>>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>>>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>>>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>>>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
>>> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
>>> time 2016-08-22 15:55:27.214638
>>> osd/OSD.h: 999: FAILED assert(ret)
>>>
>>>  ceph version 11.0.0-1688-g6f48ee6
>>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x80) [0x5617f2a99e80]
>>>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>>>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>>>  4: (main()+0x2fe0) [0x5617f229d1f0]
>>>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>>>  6: (_start()+0x29) [0x5617f22eb909]
>>>
>>> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Somnath Roy
>>> Sent: Monday, August 22, 2016 3:01 PM
>>> To: 'Sage Weil'
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: RE: Bluestore assert
>>>
>>> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
>>> Here is the option I am using..
>>>
>>>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>>>
>>> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>>>
>>>
>>>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
>>> os/bluestore/BlueFS.cc: In function 'int 
>>> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
>>> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
>>> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>>>
>>>  ceph version 11.0.0-1688-g3fcc89c
>>> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x80) [0x5581ed453cb0]
>>>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
>>> unsigned long, char*)+0x836) [0x5581ed11c1b6]
>>>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>>> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>>>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
>>> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>>>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
>>> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
>>> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
>>> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
>>> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>>>  6: (()+0x94fd54) [0x5581ed282d54]
>>>  7: 
>>> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTa
>>> b le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
>>> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>>>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
>>> [0x5581ed28ba68]
>>>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
>>> const&, rocksdb::Slice const&, rocksdb::GetContext*, 
>>> rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>>>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
>>> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
>>> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
>>> [0x5581ed25c458]
>>>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>>>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>>>  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
>>> std::char_traits<char>, std::allocator<char> > const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
>>> [0x5581ed1d21d7]
>>>  14: (BlueStore::Collection::get_onode(ghobject_t const&, 
>>> bool)+0x55b) [0x5581ed02802b]
>>>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>>> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>>>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
>>> std::vector<ObjectStore::Transaction,
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
>>> [0x5581ed032bc2]
>>>  17: 
>>> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transactio
>>> n , std::allocator<ObjectStore::Transaction> >&,
>>> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>>>  18: 
>>> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
>>> [0x5581ecef89e9]
>>>  19: 
>>> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb)
>>> [0x5581ecefeb4b]
>>>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>>> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>>>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
>>> [0x5581eccdd2e9]
>>>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
>>> const&)+0x52) [0x5581eccdd542]
>>>  23: (OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>>>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned 
>>> int)+0x89f) [0x5581ed440b2f]
>>>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>>> [0x5581ed4441f0]
>>>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>>>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>>>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Monday, August 22, 2016 2:57 PM
>>> To: Somnath Roy
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: RE: Bluestore assert
>>>
>>> On Mon, 22 Aug 2016, Somnath Roy wrote:
>>>> FYI, I was running rocksdb by enabling universal style compaction 
>>>> during this time.
>>> How are you selecting universal compaction?
>>>
>>> sage
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 23, 2016, 5:35 a.m. UTC | #16
It is by default doing async compaction , right ? Only if the option is set true, it is going for sync compaction..So, code is not same I guess ?

-----Original Message-----
From: Varada Kari 
Sent: Monday, August 22, 2016 10:31 PM
To: Somnath Roy; Sage Weil
Cc: Mark Nelson; ceph-devel
Subject: Re: Bluestore assert

Nope. Then the code is same. We shouldn't be seeing any issue. Logs would help anyway. Let us run with debugs on and see.

Varada

On Tuesday 23 August 2016 10:57 AM, Somnath Roy wrote:
> bluefs_compact_log_sync = false by default, you mean to say make it true ?
>
> -----Original Message-----
> From: Varada Kari
> Sent: Monday, August 22, 2016 10:25 PM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> Can you try disabling the async log compaction and check once?
> bluefs_compact_log_sync to false.
>
> Varada
>
> On Tuesday 23 August 2016 10:48 AM, Somnath Roy wrote:
>> Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
>> Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Varada Kari
>> Sent: Monday, August 22, 2016 10:09 PM
>> To: Somnath Roy; Sage Weil
>> Cc: Mark Nelson; ceph-devel
>> Subject: Re: Bluestore assert
>>
>> We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
>> But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
>> Can you point me to the logs?
>>
>> Varada
>>
>> On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
>>> Sage,
>>> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>>>
>>>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>>>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>>>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>>>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>>>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>>>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
>>> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 
>>> 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
>>> osd/OSD.h: 999: FAILED assert(ret)
>>>
>>>  ceph version 11.0.0-1688-g6f48ee6
>>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x80) [0x5617f2a99e80]
>>>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>>>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>>>  4: (main()+0x2fe0) [0x5617f229d1f0]
>>>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>>>  6: (_start()+0x29) [0x5617f22eb909]
>>>
>>> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Somnath Roy
>>> Sent: Monday, August 22, 2016 3:01 PM
>>> To: 'Sage Weil'
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: RE: Bluestore assert
>>>
>>> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
>>> Here is the option I am using..
>>>
>>>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>>>
>>> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>>>
>>>
>>>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
>>> os/bluestore/BlueFS.cc: In function 'int 
>>> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
>>> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
>>> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>>>
>>>  ceph version 11.0.0-1688-g3fcc89c
>>> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x80) [0x5581ed453cb0]
>>>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
>>> unsigned long, char*)+0x836) [0x5581ed11c1b6]
>>>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>>> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>>>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
>>> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>>>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
>>> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
>>> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
>>> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
>>> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>>>  6: (()+0x94fd54) [0x5581ed282d54]
>>>  7: 
>>> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedT
>>> a b le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
>>> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>>>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
>>> [0x5581ed28ba68]
>>>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
>>> const&, rocksdb::Slice const&, rocksdb::GetContext*, 
>>> rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>>>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
>>> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
>>> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
>>> [0x5581ed25c458]
>>>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>>>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>>>  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
>>> std::char_traits<char>, std::allocator<char> > const&, 
>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
>>> [0x5581ed1d21d7]
>>>  14: (BlueStore::Collection::get_onode(ghobject_t const&,
>>> bool)+0x55b) [0x5581ed02802b]
>>>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>>> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>>>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
>>> std::vector<ObjectStore::Transaction,
>>> std::allocator<ObjectStore::Transaction> >&, 
>>> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
>>> [0x5581ed032bc2]
>>>  17: 
>>> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transacti
>>> o n , std::allocator<ObjectStore::Transaction> >&,
>>> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>>>  18: 
>>> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
>>> [0x5581ecef89e9]
>>>  19: 
>>> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb
>>> )
>>> [0x5581ecefeb4b]
>>>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>>> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>>>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
>>> [0x5581eccdd2e9]
>>>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
>>> const&)+0x52) [0x5581eccdd542]
>>>  23: (OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>>>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
>>> int)+0x89f) [0x5581ed440b2f]
>>>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>>> [0x5581ed4441f0]
>>>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>>>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>>>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Monday, August 22, 2016 2:57 PM
>>> To: Somnath Roy
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: RE: Bluestore assert
>>>
>>> On Mon, 22 Aug 2016, Somnath Roy wrote:
>>>> FYI, I was running rocksdb by enabling universal style compaction 
>>>> during this time.
>>> How are you selecting universal compaction?
>>>
>>> sage
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Varada Kari Aug. 23, 2016, 5:38 a.m. UTC | #17
Sorry, i got confused, yes you are correct. Disable the async compaction
and try. That's the recent change which went in. But there shouldn't any
difference in that.

Varada

On Tuesday 23 August 2016 11:05 AM, Somnath Roy wrote:
> It is by default doing async compaction , right ? Only if the option is set true, it is going for sync compaction..So, code is not same I guess ?
>
> -----Original Message-----
> From: Varada Kari 
> Sent: Monday, August 22, 2016 10:31 PM
> To: Somnath Roy; Sage Weil
> Cc: Mark Nelson; ceph-devel
> Subject: Re: Bluestore assert
>
> Nope. Then the code is same. We shouldn't be seeing any issue. Logs would help anyway. Let us run with debugs on and see.
>
> Varada
>
> On Tuesday 23 August 2016 10:57 AM, Somnath Roy wrote:
>> bluefs_compact_log_sync = false by default, you mean to say make it true ?
>>
>> -----Original Message-----
>> From: Varada Kari
>> Sent: Monday, August 22, 2016 10:25 PM
>> To: Somnath Roy; Sage Weil
>> Cc: Mark Nelson; ceph-devel
>> Subject: Re: Bluestore assert
>>
>> Can you try disabling the async log compaction and check once?
>> bluefs_compact_log_sync to false.
>>
>> Varada
>>
>> On Tuesday 23 August 2016 10:48 AM, Somnath Roy wrote:
>>> Well, it is trying to get a map for an epoch and failing.This is part of meta object read and going to rocksdb/Bluefs, so, it seems it is also a corruption at BlueFS level. 
>>> Logs are not useful, I am modifying code to print detailed output during crash , hope we can figure out something with that for the other crashes.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Varada Kari
>>> Sent: Monday, August 22, 2016 10:09 PM
>>> To: Somnath Roy; Sage Weil
>>> Cc: Mark Nelson; ceph-devel
>>> Subject: Re: Bluestore assert
>>>
>>> We added couple of new asserts in the code and changed the log compaction to async. That was the recent change happened on BlueFS. But this one seems we are not able to read from BlueStore.
>>> But the other assert what you have posted, was from BlueFS(read from rocksdb). May be we should write some unit test cases to reproduce the same issue similar to rocksdb write patterns.
>>> Can you point me to the logs?
>>>
>>> Varada
>>>
>>> On Tuesday 23 August 2016 04:42 AM, Somnath Roy wrote:
>>>> Sage,
>>>> I think there are some bug introduced recently in the BlueFS and I am getting the corruption like this which I was not facing earlier.
>>>>
>>>>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>>>>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>>>>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>>>>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>>>>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>>>>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
>>>> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 
>>>> 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
>>>> osd/OSD.h: 999: FAILED assert(ret)
>>>>
>>>>  ceph version 11.0.0-1688-g6f48ee6
>>>> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x80) [0x5617f2a99e80]
>>>>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>>>>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>>>>  4: (main()+0x2fe0) [0x5617f229d1f0]
>>>>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>>>>  6: (_start()+0x29) [0x5617f22eb909]
>>>>
>>>> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Somnath Roy
>>>> Sent: Monday, August 22, 2016 3:01 PM
>>>> To: 'Sage Weil'
>>>> Cc: Mark Nelson; ceph-devel
>>>> Subject: RE: Bluestore assert
>>>>
>>>> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
>>>> Here is the option I am using..
>>>>
>>>>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
>>>>
>>>> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
>>>>
>>>>
>>>>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
>>>> os/bluestore/BlueFS.cc: In function 'int 
>>>> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
>>>> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
>>>> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
>>>>
>>>>  ceph version 11.0.0-1688-g3fcc89c
>>>> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x80) [0x5581ed453cb0]
>>>>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
>>>> unsigned long, char*)+0x836) [0x5581ed11c1b6]
>>>>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>>>> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>>>>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
>>>> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>>>>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
>>>> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
>>>> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
>>>> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
>>>> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>>>>  6: (()+0x94fd54) [0x5581ed282d54]
>>>>  7: 
>>>> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedT
>>>> a b le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
>>>> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>>>>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
>>>> [0x5581ed28ba68]
>>>>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
>>>> const&, rocksdb::Slice const&, rocksdb::GetContext*, 
>>>> rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>>>>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
>>>> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
>>>> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
>>>> [0x5581ed25c458]
>>>>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>>> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>>>>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>>> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>>>>  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
>>>> std::char_traits<char>, std::allocator<char> > const&, 
>>>> std::__cxx11::basic_string<char, std::char_traits<char>, 
>>>> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
>>>> [0x5581ed1d21d7]
>>>>  14: (BlueStore::Collection::get_onode(ghobject_t const&,
>>>> bool)+0x55b) [0x5581ed02802b]
>>>>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
>>>> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>>>>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
>>>> std::vector<ObjectStore::Transaction,
>>>> std::allocator<ObjectStore::Transaction> >&, 
>>>> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
>>>> [0x5581ed032bc2]
>>>>  17: 
>>>> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transacti
>>>> o n , std::allocator<ObjectStore::Transaction> >&,
>>>> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>>>>  18: 
>>>> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
>>>> [0x5581ecef89e9]
>>>>  19: 
>>>> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb
>>>> )
>>>> [0x5581ecefeb4b]
>>>>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
>>>> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>>>>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>>> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
>>>> [0x5581eccdd2e9]
>>>>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
>>>> const&)+0x52) [0x5581eccdd542]
>>>>  23: (OSD::ShardedOpWQ::_process(unsigned int,
>>>> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>>>>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
>>>> int)+0x89f) [0x5581ed440b2f]
>>>>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>>>> [0x5581ed4441f0]
>>>>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>>>>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>>>>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>> Sent: Monday, August 22, 2016 2:57 PM
>>>> To: Somnath Roy
>>>> Cc: Mark Nelson; ceph-devel
>>>> Subject: RE: Bluestore assert
>>>>
>>>> On Mon, 22 Aug 2016, Somnath Roy wrote:
>>>>> FYI, I was running rocksdb by enabling universal style compaction 
>>>>> during this time.
>>>> How are you selecting universal compaction?
>>>>
>>>> sage
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 23, 2016, 1:45 p.m. UTC | #18
On Mon, 22 Aug 2016, Somnath Roy wrote:
> Sage,
> I think there are some bug introduced recently in the BlueFS and I am 
> getting the corruption like this which I was not facing earlier.

My guess is the async bluefs compaction.  You can set 'bluefs compact log 
sync = true' to disable it.

Any idea how long do you have to run to reproduce?  I'd love to see a 
bluefs log leading up to it.  If it eats too much disk space, you could do 
debug bluefs = 1/20 so that it only dumps recent history on crash.

Thanks!
sage

> 
>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> osd/OSD.h: 999: FAILED assert(ret)
> 
>  ceph version 11.0.0-1688-g6f48ee6 (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5617f2a99e80]
>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>  4: (main()+0x2fe0) [0x5617f229d1f0]
>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>  6: (_start()+0x29) [0x5617f22eb909]
> 
> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, August 22, 2016 3:01 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> Here is the option I am using..
> 
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> 
> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> 
> 
>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> 
>  ceph version 11.0.0-1688-g3fcc89c (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5581ed453cb0]
>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, rocksdb::Logger*)+0x358) [0x5581ed291c18]
>  6: (()+0x94fd54) [0x5581ed282d54]
>  7: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) [0x5581ed28ba68]
>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) [0x5581ed25c458]
>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>  13: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) [0x5581ed02802b]
>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
>  17: (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>  18: (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) [0x5581ecef89e9]
>  19: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) [0x5581ecefeb4b]
>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x5581eccdd542]
>  23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x5581ed440b2f]
>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5581ed4441f0]
>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 2:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Mon, 22 Aug 2016, Somnath Roy wrote:
> > FYI, I was running rocksdb by enabling universal style compaction
> > during this time.
> 
> How are you selecting universal compaction?
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 23, 2016, 2:46 p.m. UTC | #19
I was running my tests for 2 hours and it happened within that time.
I will try to reproduce with 1/20.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, August 23, 2016 6:46 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Mon, 22 Aug 2016, Somnath Roy wrote:
> Sage,
> I think there are some bug introduced recently in the BlueFS and I am 
> getting the corruption like this which I was not facing earlier.

My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.

Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.

Thanks!
sage

> 
>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
> time 2016-08-22 15:55:27.214638
> osd/OSD.h: 999: FAILED assert(ret)
> 
>  ceph version 11.0.0-1688-g6f48ee6 
> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5617f2a99e80]
>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>  4: (main()+0x2fe0) [0x5617f229d1f0]
>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>  6: (_start()+0x29) [0x5617f22eb909]
> 
> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, August 22, 2016 3:01 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> Here is the option I am using..
> 
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> 
> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> 
> 
>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1 
> os/bluestore/BlueFS.cc: In function 'int 
> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)' 
> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> 
>  ceph version 11.0.0-1688-g3fcc89c 
> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x5581ed453cb0]
>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
> long, char*)+0x836) [0x5581ed11c1b6]
>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, 
> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&, 
> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>  6: (()+0x94fd54) [0x5581ed282d54]
>  7: 
> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTab
> le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, 
> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> [0x5581ed28ba68]
>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, 
> bool, int)+0x158) [0x5581ed252118]
>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> [0x5581ed25c458]
>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>  13: (RocksDBStore::get(std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> [0x5581ed1d21d7]
>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) 
> [0x5581ed02802b]
>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*, 
> std::vector<ObjectStore::Transaction, 
> std::allocator<ObjectStore::Transaction> >&, 
> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> [0x5581ed032bc2]
>  17: 
> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction
> , std::allocator<ObjectStore::Transaction> >&, 
> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>  18: 
> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39) 
> [0x5581ecef89e9]
>  19: 
> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb) 
> [0x5581ecefeb4b]
>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, 
> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> [0x5581eccdd2e9]
>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> 
> const&)+0x52) [0x5581eccdd542]
>  23: (OSD::ShardedOpWQ::_process(unsigned int, 
> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) 
> [0x5581ed440b2f]
>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) 
> [0x5581ed4441f0]
>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 2:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Mon, 22 Aug 2016, Somnath Roy wrote:
> > FYI, I was running rocksdb by enabling universal style compaction 
> > during this time.
> 
> How are you selecting universal compaction?
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 24, 2016, 8:36 p.m. UTC | #20
Sage,
I got the db assert log from submit_transaction in the following location.

https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88a1b28fcc39

This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.

   -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing log 266 from recycle list

   -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
   -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
   -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.

It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.

   -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
Put( Prefix = M key = 0x0000000000001483'.0000000087.00000000000000035303')
Put( Prefix = M key = 0x0000000000001483'._info')
Put( Prefix = O key = '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089ace7!'0xfffffffffffffffeffffffffffffffff)
Delete( Prefix = B key = 0x000004e73ae72000)
Put( Prefix = B key = 0x000004e73af72000)
Merge( Prefix = T key = 'bluestore_statfs')

Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.

Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?

BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Tuesday, August 23, 2016 7:46 AM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

I was running my tests for 2 hours and it happened within that time.
I will try to reproduce with 1/20.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Tuesday, August 23, 2016 6:46 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Mon, 22 Aug 2016, Somnath Roy wrote:
> Sage,
> I think there are some bug introduced recently in the BlueFS and I am 
> getting the corruption like this which I was not facing earlier.

My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.

Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.

Thanks!
sage

> 
>    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
>     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
>     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
>     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
>     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
>      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
> time 2016-08-22 15:55:27.214638
> osd/OSD.h: 999: FAILED assert(ret)
> 
>  ceph version 11.0.0-1688-g6f48ee6
> (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x80) [0x5617f2a99e80]
>  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
>  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
>  4: (main()+0x2fe0) [0x5617f229d1f0]
>  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
>  6: (_start()+0x29) [0x5617f22eb909]
> 
> OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, August 22, 2016 3:01 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> Here is the option I am using..
> 
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> 
> Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> 
> 
>      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> os/bluestore/BlueFS.cc: In function 'int 
> BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> 
>  ceph version 11.0.0-1688-g3fcc89c
> (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x80) [0x5581ed453cb0]
>  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
> long, char*)+0x836) [0x5581ed11c1b6]
>  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
>  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
>  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
> rocksdb::Logger*)+0x358) [0x5581ed291c18]
>  6: (()+0x94fd54) [0x5581ed282d54]
>  7: 
> (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTab
> le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
> rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
>  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> [0x5581ed28ba68]
>  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, 
> bool, int)+0x158) [0x5581ed252118]
>  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> [0x5581ed25c458]
>  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
>  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> >*)+0x22) [0x5581ed1f4182]
>  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> std::char_traits<char>, std::allocator<char> > const&, 
> std::__cxx11::basic_string<char, std::char_traits<char>, 
> std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> [0x5581ed1d21d7]
>  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) 
> [0x5581ed02802b]
>  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
>  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> std::vector<ObjectStore::Transaction,
> std::allocator<ObjectStore::Transaction> >&, 
> std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> [0x5581ed032bc2]
>  17: 
> (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction
> , std::allocator<ObjectStore::Transaction> >&,
> std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
>  18: 
> (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
> [0x5581ecef89e9]
>  19: 
> (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb)
> [0x5581ecefeb4b]
>  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
>  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> [0x5581eccdd2e9]
>  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> const&)+0x52) [0x5581eccdd542]
>  23: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
>  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) 
> [0x5581ed440b2f]
>  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> [0x5581ed4441f0]
>  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
>  27: (()+0x76fa) [0x7f8ed9e4e6fa]
>  28: (clone()+0x6d) [0x7f8ed7caeb5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Monday, August 22, 2016 2:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Mon, 22 Aug 2016, Somnath Roy wrote:
> > FYI, I was running rocksdb by enabling universal style compaction 
> > during this time.
> 
> How are you selecting universal compaction?
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 24, 2016, 8:43 p.m. UTC | #21
On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage,
> I got the db assert log from submit_transaction in the following location.
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88a1b28fcc39
> 
> This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> 
>    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing log 266 from recycle list
> 
>    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
>    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
>    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> 
> It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.

How much of the log do you have? Can you post what you have somewhere?

Thanks!
sage


> 
>    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> Put( Prefix = M key = 0x0000000000001483'.0000000087.00000000000000035303')
> Put( Prefix = M key = 0x0000000000001483'._info')
> Put( Prefix = O key = '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089ace7!'0xfffffffffffffffeffffffffffffffff)
> Delete( Prefix = B key = 0x000004e73ae72000)
> Put( Prefix = B key = 0x000004e73af72000)
> Merge( Prefix = T key = 'bluestore_statfs')
> 
> Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> 
> Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> 
> BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy 
> Sent: Tuesday, August 23, 2016 7:46 AM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> I was running my tests for 2 hours and it happened within that time.
> I will try to reproduce with 1/20.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, August 23, 2016 6:46 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Mon, 22 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I think there are some bug introduced recently in the BlueFS and I am 
> > getting the corruption like this which I was not facing earlier.
> 
> My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> 
> Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> 
> Thanks!
> sage
> 
> > 
> >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6ba94d8c0 
> > time 2016-08-22 15:55:27.214638
> > osd/OSD.h: 999: FAILED assert(ret)
> > 
> >  ceph version 11.0.0-1688-g6f48ee6
> > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x5617f2a99e80]
> >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> >  4: (main()+0x2fe0) [0x5617f229d1f0]
> >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> >  6: (_start()+0x29) [0x5617f22eb909]
> > 
> > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Monday, August 22, 2016 3:01 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > Here is the option I am using..
> > 
> >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > 
> > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > 
> > 
> >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > os/bluestore/BlueFS.cc: In function 'int 
> > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > 
> >  ceph version 11.0.0-1688-g3fcc89c
> > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x5581ed453cb0]
> >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
> > long, char*)+0x836) [0x5581ed11c1b6]
> >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> > bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
> > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> >  6: (()+0x94fd54) [0x5581ed282d54]
> >  7: 
> > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTab
> > le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
> > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > [0x5581ed28ba68]
> >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, 
> > rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, 
> > bool, int)+0x158) [0x5581ed252118]
> >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > [0x5581ed25c458]
> >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > [0x5581ed1d21d7]
> >  14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) 
> > [0x5581ed02802b]
> >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > std::vector<ObjectStore::Transaction,
> > std::allocator<ObjectStore::Transaction> >&, 
> > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > [0x5581ed032bc2]
> >  17: 
> > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transaction
> > , std::allocator<ObjectStore::Transaction> >&,
> > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> >  18: 
> > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
> > [0x5581ecef89e9]
> >  19: 
> > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb)
> > [0x5581ecefeb4b]
> >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > [0x5581eccdd2e9]
> >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > const&)+0x52) [0x5581eccdd542]
> >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) 
> > [0x5581ed440b2f]
> >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > [0x5581ed4441f0]
> >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Monday, August 22, 2016 2:57 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > FYI, I was running rocksdb by enabling universal style compaction 
> > > during this time.
> > 
> > How are you selecting universal compaction?
> > 
> > sage
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 24, 2016, 8:49 p.m. UTC | #22
Sage,
It is there in the following github link I posted earlier..You can see 3 logs there..

https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88a1b28fcc39

Thanks & Regards
Somnath


-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, August 24, 2016 1:43 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage,
> I got the db assert log from submit_transaction in the following location.
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39
> 
> This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> 
>    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> log 266 from recycle list
> 
>    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
>    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
>    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> 
> It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.

How much of the log do you have? Can you post what you have somewhere?

Thanks!
sage


> 
>    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> Put( Prefix = M key = 
> 0x0000000000001483'.0000000087.00000000000000035303')
> Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O key 
> = 
> '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089ac
> e7!'0xfffffffffffffffeffffffffffffffff)
> Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key = 
> 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> 
> Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> 
> Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> 
> BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Tuesday, August 23, 2016 7:46 AM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> I was running my tests for 2 hours and it happened within that time.
> I will try to reproduce with 1/20.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, August 23, 2016 6:46 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Mon, 22 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I think there are some bug introduced recently in the BlueFS and I 
> > am getting the corruption like this which I was not facing earlier.
> 
> My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> 
> Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> 
> Thanks!
> sage
> 
> > 
> >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > function 'OSDMapRef OSDService::get_map(epoch_t)' thread 
> > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > osd/OSD.h: 999: FAILED assert(ret)
> > 
> >  ceph version 11.0.0-1688-g6f48ee6
> > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x5617f2a99e80]
> >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> >  4: (main()+0x2fe0) [0x5617f229d1f0]
> >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> >  6: (_start()+0x29) [0x5617f22eb909]
> > 
> > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Monday, August 22, 2016 3:01 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > Here is the option I am using..
> > 
> >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > 
> > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > 
> > 
> >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > os/bluestore/BlueFS.cc: In function 'int 
> > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > 
> >  ceph version 11.0.0-1688-g3fcc89c
> > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x5581ed453cb0]
> >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> > bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
> > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> >  6: (()+0x94fd54) [0x5581ed282d54]
> >  7: 
> > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedT
> > ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
> > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > [0x5581ed28ba68]
> >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > [0x5581ed25c458]
> >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > std::char_traits<char>, std::allocator<char> > const&, 
> > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > [0x5581ed1d21d7]
> >  14: (BlueStore::Collection::get_onode(ghobject_t const&, 
> > bool)+0x55b) [0x5581ed02802b]
> >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > std::vector<ObjectStore::Transaction,
> > std::allocator<ObjectStore::Transaction> >&, 
> > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > [0x5581ed032bc2]
> >  17: 
> > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transacti
> > on , std::allocator<ObjectStore::Transaction> >&,
> > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> >  18: 
> > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
> > [0x5581ecef89e9]
> >  19: 
> > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb
> > )
> > [0x5581ecefeb4b]
> >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > [0x5581eccdd2e9]
> >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > const&)+0x52) [0x5581eccdd542]
> >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned 
> > int)+0x89f) [0x5581ed440b2f]
> >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > [0x5581ed4441f0]
> >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Monday, August 22, 2016 2:57 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > FYI, I was running rocksdb by enabling universal style compaction 
> > > during this time.
> > 
> > How are you selecting universal compaction?
> > 
> > sage
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 24, 2016, 9:34 p.m. UTC | #23
On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You can 
> see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes 
perfect sense: it's coming from the ReuseWritableFile() function (which 
gets and error on rename and returns that).  It shouldn't ever fail, so 
there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find 
out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key = 
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O key 
> > = 
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key = 
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread 
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, 
> > > bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedT
> > > ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&, 
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transacti
> > > on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd39)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned 
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style compaction 
> > > > during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 24, 2016, 9:51 p.m. UTC | #24
Sage,
Thanks for looking , glad that we figured out something :-)..
So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, August 24, 2016 2:34 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You 
> can see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key =
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > key = 
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > 9)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style 
> > > > compaction during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 24, 2016, 10:27 p.m. UTC | #25
On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage,
> Thanks for looking , glad that we figured out something :-)..
> So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?

Yeah, that'd be perfect.

> Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Fingers crossed!  Otherwise maybe write it off to NFS or something...

Thanks!
sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Wednesday, August 24, 2016 2:34 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage, It is there in the following github link I posted earlier..You 
> > can see 3 logs there..
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> > a1b28fcc39
> 
> Ah sorry, got it.
> 
> And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> 
> I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> 
> sage
> 
> 
> 
>  > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 1:43 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I got the db assert log from submit_transaction in the following location.
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > > 88
> > > a1b28fcc39
> > > 
> > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > 
> > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > > log 266 from recycle list
> > > 
> > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > 
> > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > 
> > How much of the log do you have? Can you post what you have somewhere?
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > > Put( Prefix = M key =
> > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > > key = 
> > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > > ac
> > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > 
> > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > 
> > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > 
> > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > I was running my tests for 2 hours and it happened within that time.
> > > I will try to reproduce with 1/20.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I think there are some bug introduced recently in the BlueFS and I 
> > > > am getting the corruption like this which I was not facing earlier.
> > > 
> > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > 
> > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > 
> > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x80) [0x5617f2a99e80]
> > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > 
> > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > Here is the option I am using..
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > 
> > > > 
> > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > 
> > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x80) [0x5581ed453cb0]
> > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > rocksdb::PersistentCacheOptions const&,
> > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > >  7: 
> > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > > const&,
> > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > > [0x5581ed28ba68]
> > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > > [0x5581ed25c458]
> > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > [0x5581ed1d21d7]
> > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > bool)+0x55b) [0x5581ed02802b]
> > > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > std::vector<ObjectStore::Transaction,
> > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > [0x5581ed032bc2]
> > > >  17: 
> > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > >  18: 
> > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > > 9)
> > > > [0x5581ecef89e9]
> > > >  19: 
> > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > > fb
> > > > )
> > > > [0x5581ecefeb4b]
> > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > [0x5581eccdd2e9]
> > > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > const&)+0x52) [0x5581eccdd542]
> > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > int)+0x89f) [0x5581ed440b2f]
> > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > [0x5581ed4441f0]
> > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > compaction during this time.
> > > > 
> > > > How are you selecting universal compaction?
> > > > 
> > > > sage
> > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 25, 2016, 9:34 p.m. UTC | #26
Sage,
Hope you are able to download the log I shared via google doc.
It seems the bug is around this portion.

2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to recycle list

2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to recycle list

2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 started
2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: memtable #1 done
2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: memtable #2 done
2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75

2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.

2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink db.wal/000256.log
2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink db.wal/000254.log
2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])

So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).

I was going through the rocksdb code and I found the following.

1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.

2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.

3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.

4625       alive_log_files_.back().AddSize(log_entry.size());   

Can it be reintroducing the same log number (254) , I am not sure.

Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.

Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy 
Sent: Wednesday, August 24, 2016 2:52 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Thanks for looking , glad that we figured out something :-)..
So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 24, 2016 2:34 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You 
> can see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key =
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > key =
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > 9)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style 
> > > > compaction during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 28, 2016, 2:37 p.m. UTC | #27
Sage,
Some updates on this.

1. The issue is reproduced with the latest rocksdb master as well.

2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.

3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.

4. Created a rocksdb issue for this (https://github.com/facebook/rocksdb/issues/1303)

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy 
Sent: Thursday, August 25, 2016 2:35 PM
To: 'Sage Weil'
Cc: 'Mark Nelson'; 'ceph-devel'
Subject: RE: Bluestore assert

Sage,
Hope you are able to download the log I shared via google doc.
It seems the bug is around this portion.

2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to recycle list

2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to recycle list

2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 started
2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: memtable #1 done
2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: memtable #2 done
2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75

2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.

2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink db.wal/000256.log
2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink db.wal/000254.log
2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])

So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).

I was going through the rocksdb code and I found the following.

1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.

2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.

3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.

4625       alive_log_files_.back().AddSize(log_entry.size());   

Can it be reintroducing the same log number (254) , I am not sure.

Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.

Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Wednesday, August 24, 2016 2:52 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Thanks for looking , glad that we figured out something :-)..
So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 24, 2016 2:34 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You 
> can see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key =
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > key =
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > 9)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style 
> > > > compaction during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 29, 2016, 8:20 a.m. UTC | #28
Sage,
The following assert is resolved too. I was able to reproduce this by adding additional log regarding the error.

os/bluestore/BlueFS.cc: 852: FAILED assert(r == 0)

 ceph version 11.0.0-1688-g3fcc89c (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55ba4cca84d0]
 2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x602) [0x55ba4c9684e2]
 3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x20) [0x55ba4c98bfd0]
 4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x83f) [0x55ba4cb1273f]
 5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::ImmutableCFOptions const&, bool, rocksdb::Slice const&, rocksdb::PersistentCacheOptions const&)+0x5ed) [0x55ba4cae51cd]
 6: (()+0x95d731) [0x55ba4cad4731]
 7: (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*)+0x5f9) [0x55ba4cad6559]
 8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x55d) [0x55ba4cade91d]
 9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, int)+0x158) [0x55ba4caa29a8]
 10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x55f) [0x55ba4cab74ef]
 11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*)+0x5c2) [0x55ba4ca3abe2]
 12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x22) [0x55ba4ca3ae12]
 13: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x157) [0x55ba4ca1f437]
 14: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x55b) [0x55ba4c87458b]
 15: (BlueStore::getattr(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, char const*, ceph::buffer::ptr&)+0xa7) [0x55ba4c8750b7]
 16: (PGBackend::objects_get_attr(hobject_t const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x10c) [0x55ba4c719cbc]
 17: (()+0x526224) [0x55ba4c69d224]
 18: (ReplicatedPG::find_object_context(hobject_t const&, std::shared_ptr<ObjectContext>*, bool, bool, hobject_t*)+0x35d) [0x55ba4c6ab41d]
 19: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1aba) [0x55ba4c6d89ba]
 20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5fe) [0x55ba4c6990ae]
 21: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) [0x55ba4c529819]
 22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x52) [0x55ba4c529a72]
 23: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x73f) [0x55ba4c54983f]
 24: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x55ba4cc9534f]
 25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55ba4cc98a10]
 26: (Thread::entry_wrapper()+0x75) [0x55ba4cc87de5]
 27: (()+0x76fa) [0x7fca8108c6fa]
 28: (clone()+0x6d) [0x7fca7eeecb5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I printed the error string and pointing to input/output error , which I verified in syslog , it is happening because one of my SSDs is having medium corruption.

I will send out a PR of printing verbose log before assert wherever possible.

So, 2 down , one more (bluefs inode assert) to go :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Sunday, August 28, 2016 7:37 AM
To: 'Sage Weil'
Cc: 'Mark Nelson'; 'ceph-devel'
Subject: RE: Bluestore assert

Sage,
Some updates on this.

1. The issue is reproduced with the latest rocksdb master as well.

2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.

3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.

4. Created a rocksdb issue for this (https://github.com/facebook/rocksdb/issues/1303)

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 25, 2016 2:35 PM
To: 'Sage Weil'
Cc: 'Mark Nelson'; 'ceph-devel'
Subject: RE: Bluestore assert

Sage,
Hope you are able to download the log I shared via google doc.
It seems the bug is around this portion.

2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to recycle list

2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to recycle list

2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 started
2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: memtable #1 done
2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: memtable #2 done
2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75

2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.

2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink db.wal/000256.log
2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink db.wal/000254.log
2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])

So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).

I was going through the rocksdb code and I found the following.

1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.

2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.

3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.

4625       alive_log_files_.back().AddSize(log_entry.size());   

Can it be reintroducing the same log number (254) , I am not sure.

Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.

Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Wednesday, August 24, 2016 2:52 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Thanks for looking , glad that we figured out something :-)..
So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 24, 2016 2:34 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You 
> can see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key =
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > key =
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > 9)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style 
> > > > compaction during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Aug. 30, 2016, 11:57 p.m. UTC | #29
Sage,
I did some debugging on the rocksdb bug., here is my findings.

1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.

https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L854

2. But, it is there in the candidate list in the following loop.

https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L1000


3. This means it is added in full_scan_candidate_files from the following  from a full scan (?)

https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L834

Added some log entries to verify , need to wait 6 hours :-(

4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.

https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L1013

Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.

(number == state.prev_log_number)

5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).

Let me know what you think.

Thanks & Regards
Somnath
-----Original Message-----
From: Somnath Roy 
Sent: Sunday, August 28, 2016 7:37 AM
To: 'Sage Weil'
Cc: 'Mark Nelson'; 'ceph-devel'
Subject: RE: Bluestore assert

Sage,
Some updates on this.

1. The issue is reproduced with the latest rocksdb master as well.

2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.

3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.

4. Created a rocksdb issue for this (https://github.com/facebook/rocksdb/issues/1303)

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, August 25, 2016 2:35 PM
To: 'Sage Weil'
Cc: 'Mark Nelson'; 'ceph-devel'
Subject: RE: Bluestore assert

Sage,
Hope you are able to download the log I shared via google doc.
It seems the bug is around this portion.

2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to recycle list

2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to recycle list

2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 started
2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: memtable #1 done
2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: memtable #2 done
2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75

2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.

2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink db.wal/000256.log
2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink db.wal/000254.log
2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])

So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).

I was going through the rocksdb code and I found the following.

1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.

2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.

3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.

4625       alive_log_files_.back().AddSize(log_entry.size());   

Can it be reintroducing the same log number (254) , I am not sure.

Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.

Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Wednesday, August 24, 2016 2:52 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Thanks for looking , glad that we figured out something :-)..
So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 24, 2016 2:34 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Wed, 24 Aug 2016, Somnath Roy wrote:
> Sage, It is there in the following github link I posted earlier..You 
> can see 3 logs there..
> 
> https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> a1b28fcc39

Ah sorry, got it.

And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.

I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...

sage



 > 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 1:43 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I got the db assert log from submit_transaction in the following location.
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> > 
> > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > 
> >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > log 266 from recycle list
> > 
> >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > 
> > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> 
> How much of the log do you have? Can you post what you have somewhere?
> 
> Thanks!
> sage
> 
> 
> > 
> >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > Put( Prefix = M key =
> > 0x0000000000001483'.0000000087.00000000000000035303')
> > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > key =
> > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > ac
> > e7!'0xfffffffffffffffeffffffffffffffff)
> > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > 
> > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > 
> > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > 
> > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Tuesday, August 23, 2016 7:46 AM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > I was running my tests for 2 hours and it happened within that time.
> > I will try to reproduce with 1/20.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Tuesday, August 23, 2016 6:46 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I think there are some bug introduced recently in the BlueFS and I 
> > > am getting the corruption like this which I was not facing earlier.
> > 
> > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > 
> > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > 
> > Thanks!
> > sage
> > 
> > > 
> > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > osd/OSD.h: 999: FAILED assert(ret)
> > > 
> > >  ceph version 11.0.0-1688-g6f48ee6
> > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5617f2a99e80]
> > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > >  6: (_start()+0x29) [0x5617f22eb909]
> > > 
> > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Monday, August 22, 2016 3:01 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > Here is the option I am using..
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > 
> > > 
> > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > 
> > >  ceph version 11.0.0-1688-g3fcc89c
> > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x5581ed453cb0]
> > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > rocksdb::PersistentCacheOptions const&,
> > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > >  6: (()+0x94fd54) [0x5581ed282d54]
> > >  7: 
> > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > const&,
> > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > [0x5581ed28ba68]
> > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > [0x5581ed25c458]
> > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > std::char_traits<char>, std::allocator<char> > const&, 
> > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > [0x5581ed1d21d7]
> > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > bool)+0x55b) [0x5581ed02802b]
> > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > std::vector<ObjectStore::Transaction,
> > > std::allocator<ObjectStore::Transaction> >&, 
> > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > [0x5581ed032bc2]
> > >  17: 
> > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > >  18: 
> > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > 9)
> > > [0x5581ecef89e9]
> > >  19: 
> > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > fb
> > > )
> > > [0x5581ecefeb4b]
> > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > [0x5581eccdd2e9]
> > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > const&)+0x52) [0x5581eccdd542]
> > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > int)+0x89f) [0x5581ed440b2f]
> > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > [0x5581ed4441f0]
> > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Monday, August 22, 2016 2:57 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > FYI, I was running rocksdb by enabling universal style 
> > > > compaction during this time.
> > > 
> > > How are you selecting universal compaction?
> > > 
> > > sage
> > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Aug. 31, 2016, 1:20 p.m. UTC | #30
On Tue, 30 Aug 2016, Somnath Roy wrote:
> Sage,
> I did some debugging on the rocksdb bug., here is my findings.
> 
> 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L854
> 
> 2. But, it is there in the candidate list in the following loop.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L1000
> 
> 
> 3. This means it is added in full_scan_candidate_files from the following  from a full scan (?)
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L834
> 
> Added some log entries to verify , need to wait 6 hours :-(
> 
> 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460609bf7cea4b63/db/db_impl.cc#L1013
> 
> Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> 
> (number == state.prev_log_number)
> 
> 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).

That seems reasonable.  I suggest coding this up and submitting a PR to 
github.com/facebook/rocksdb, and ask in the comment if there is a better 
solution.

Probably the recycle list should be turned into a set so that the check is 
O(log n)...

Thanks!
sage


> 
> Let me know what you think.
> 
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Somnath Roy 
> Sent: Sunday, August 28, 2016 7:37 AM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Some updates on this.
> 
> 1. The issue is reproduced with the latest rocksdb master as well.
> 
> 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> 
> 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> 
> 4. Created a rocksdb issue for this (https://github.com/facebook/rocksdb/issues/1303)
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 25, 2016 2:35 PM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Hope you are able to download the log I shared via google doc.
> It seems the bug is around this portion.
> 
> 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to recycle list
> 
> 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to recycle list
> 
> 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 started
> 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: memtable #1 done
> 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: memtable #2 done
> 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> 
> 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> 
> 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink db.wal/000256.log
> 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 0 extents [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x11100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink db.wal/000254.log
> 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
> 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 0 extents [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb400000+800000,0:0xc000000+500000])
> 
> So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> 
> I was going through the rocksdb code and I found the following.
> 
> 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> 
> 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> 
> 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> 
> 4625       alive_log_files_.back().AddSize(log_entry.size());   
> 
> Can it be reintroducing the same log number (254) , I am not sure.
> 
> Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> 
> Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Wednesday, August 24, 2016 2:52 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Thanks for looking , glad that we figured out something :-)..
> So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> Hope my root partition doesn't get full , this crash happened after 6 hours :-) , 
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 2:34 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage, It is there in the following github link I posted earlier..You 
> > can see 3 logs there..
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d88
> > a1b28fcc39
> 
> Ah sorry, got it.
> 
> And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> 
> I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> 
> sage
> 
> 
> 
>  > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 1:43 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I got the db assert log from submit_transaction in the following location.
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > > 88
> > > a1b28fcc39
> > > 
> > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > 
> > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: reusing 
> > > log 266 from recycle list
> > > 
> > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > 
> > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > 
> > How much of the log do you have? Can you post what you have somewhere?
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch = 
> > > Put( Prefix = M key =
> > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > > key =
> > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000000089
> > > ac
> > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > 
> > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > 
> > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > 
> > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > I was running my tests for 2 hours and it happened within that time.
> > > I will try to reproduce with 1/20.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I think there are some bug introduced recently in the BlueFS and I 
> > > > am getting the corruption like this which I was not facing earlier.
> > > 
> > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > 
> > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > 
> > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x80) [0x5617f2a99e80]
> > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > 
> > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > Here is the option I am using..
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > 
> > > > 
> > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > 
> > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x80) [0x5581ed453cb0]
> > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
> > > > rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned 
> > > > long, rocksdb::Slice*, char*) const+0x83f) [0x5581ed2c6f4f]
> > > >  5: (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > rocksdb::PersistentCacheOptions const&,
> > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > >  7: 
> > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBase
> > > > dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > > const&,
> > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > > [0x5581ed28ba68]
> > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, 
> > > > rocksdb::MergeContext*, bool*, bool*, unsigned long*)+0x4f8) 
> > > > [0x5581ed25c458]
> > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > [0x5581ed1d21d7]
> > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > bool)+0x55b) [0x5581ed02802b]
> > > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > std::vector<ObjectStore::Transaction,
> > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > [0x5581ed032bc2]
> > > >  17: 
> > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Transac
> > > > ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > >  18: 
> > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0xd3
> > > > 9)
> > > > [0x5581ecef89e9]
> > > >  19: 
> > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x2
> > > > fb
> > > > )
> > > > [0x5581ecefeb4b]
> > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > [0x5581eccdd2e9]
> > > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > const&)+0x52) [0x5581eccdd542]
> > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > int)+0x89f) [0x5581ed440b2f]
> > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > [0x5581ed4441f0]
> > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > compaction during this time.
> > > > 
> > > > How are you selecting universal compaction?
> > > > 
> > > > sage
> > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 1, 2016, 10:59 p.m. UTC | #31
Sage,
Created the following pull request on rocksdb repo, please take a look.

https://github.com/facebook/rocksdb/pull/1313

The fix is working fine for me.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, August 31, 2016 6:20 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Tue, 30 Aug 2016, Somnath Roy wrote:
> Sage,
> I did some debugging on the rocksdb bug., here is my findings.
> 
> 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L854
> 
> 2. But, it is there in the candidate list in the following loop.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L1000
> 
> 
> 3. This means it is added in full_scan_candidate_files from the 
> following  from a full scan (?)
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L834
> 
> Added some log entries to verify , need to wait 6 hours :-(
> 
> 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L1013
> 
> Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> 
> (number == state.prev_log_number)
> 
> 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).

That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.

Probably the recycle list should be turned into a set so that the check is O(log n)...

Thanks!
sage


> 
> Let me know what you think.
> 
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Somnath Roy
> Sent: Sunday, August 28, 2016 7:37 AM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Some updates on this.
> 
> 1. The issue is reproduced with the latest rocksdb master as well.
> 
> 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> 
> 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> 
> 4. Created a rocksdb issue for this 
> (https://github.com/facebook/rocksdb/issues/1303)
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 25, 2016 2:35 PM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Hope you are able to download the log I shared via google doc.
> It seems the bug is around this portion.
> 
> 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to 
> recycle list
> 
> 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to 
> recycle list
> 
> 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time 
> 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> started
> 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time 
> 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> memtable #1 done
> 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time 
> 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> memtable #2 done
> 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time 
> 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time 
> 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max 
> bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> 
> 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> 
> 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> db.wal/000256.log
> 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs 
> 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev 
> 0 extents 
> [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> bdev 0 extents 
> [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> db.wal/000254.log
> 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs 
> 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev 
> 0 extents 
> [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> 00000+800000,0:0xc000000+500000])
> 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> bdev 0 extents 
> [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> 00000+800000,0:0xc000000+500000])
> 
> So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> 
> I was going through the rocksdb code and I found the following.
> 
> 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> 
> 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> 
> 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> 
> 4625       alive_log_files_.back().AddSize(log_entry.size());   
> 
> Can it be reintroducing the same log number (254) , I am not sure.
> 
> Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> 
> Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Wednesday, August 24, 2016 2:52 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Thanks for looking , glad that we figured out something :-)..
> So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> Hope my root partition doesn't get full , this crash happened after 6 
> hours :-) ,
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 2:34 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage, It is there in the following github link I posted earlier..You 
> > can see 3 logs there..
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> 
> Ah sorry, got it.
> 
> And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> 
> I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> 
> sage
> 
> 
> 
>  >
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 1:43 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I got the db assert log from submit_transaction in the following location.
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c68
> > > 7d
> > > 88
> > > a1b28fcc39
> > > 
> > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > 
> > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > reusing log 266 from recycle list
> > > 
> > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > 
> > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > 
> > How much of the log do you have? Can you post what you have somewhere?
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch 
> > > = Put( Prefix = M key =
> > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > > key =
> > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.0000000000
> > > 89
> > > ac
> > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > 
> > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > 
> > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > 
> > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > I was running my tests for 2 hours and it happened within that time.
> > > I will try to reproduce with 1/20.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I think there are some bug introduced recently in the BlueFS and 
> > > > I am getting the corruption like this which I was not facing earlier.
> > > 
> > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > 
> > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > 
> > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x5617f2a99e80]
> > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > 
> > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > Here is the option I am using..
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > 
> > > > 
> > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > 
> > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x5581ed453cb0]
> > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > [0x5581ed2c6f4f]
> > > >  5: 
> > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > rocksdb::PersistentCacheOptions const&,
> > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > >  7: 
> > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBa
> > > > se dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > > const&,
> > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > > [0x5581ed28ba68]
> > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned 
> > > > long*)+0x4f8) [0x5581ed25c458]
> > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > [0x5581ed1d21d7]
> > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > bool)+0x55b) [0x5581ed02802b]
> > > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > std::vector<ObjectStore::Transaction,
> > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > [0x5581ed032bc2]
> > > >  17: 
> > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Trans
> > > > ac ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > >  18: 
> > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0x
> > > > d3
> > > > 9)
> > > > [0x5581ecef89e9]
> > > >  19: 
> > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0
> > > > x2
> > > > fb
> > > > )
> > > > [0x5581ecefeb4b]
> > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > [0x5581eccdd2e9]
> > > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > const&)+0x52) [0x5581eccdd542]
> > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > int)+0x89f) [0x5581ed440b2f]
> > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > [0x5581ed4441f0]
> > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > compaction during this time.
> > > > 
> > > > How are you selecting universal compaction?
> > > > 
> > > > sage
> > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 2, 2016, 4:11 p.m. UTC | #32
Sage,
Tried to do some analysis on the inode assert, following looks suspicious.

   -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
    -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
    -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
    -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
    -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000]), flushing log

The above looks good, it is about to call _flush_and_sync_log() after this.

    -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
    -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
    -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
    -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
    -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
     0> 2016-08-31 17:55:56.939745 7faf14fff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7faf14fff700 time 2016-08-31 17:55:56.934282
os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)

 ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x56073c27c7d0]
 2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1d69) [0x56073bf4e109]
 3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x56073bf4e2d7]
 4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
 5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
 6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
 7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x56073c0f24b1]
 8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
 9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x56073c1354c6]
 10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x14ea) [0x56073c137c8a]
 11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
 12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x56073c0275d0]
 13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x56073c03443f]
 14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x56073c0eb039]
 15: (()+0x9900d3) [0x56073c0eb0d3]
 16: (()+0x76fa) [0x7faf3d1106fa]
 17: (clone()+0x6d) [0x7faf3af70b5d]


Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?

Question :
------------

1. Why we are using the existing log_write to do a runway check ?

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1280

Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?

2. The runway check is not considering the request length , so, why it is not expecting to allocate here (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1388)

If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy 
Sent: Thursday, September 01, 2016 3:59 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Created the following pull request on rocksdb repo, please take a look.

https://github.com/facebook/rocksdb/pull/1313

The fix is working fine for me.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, August 31, 2016 6:20 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Tue, 30 Aug 2016, Somnath Roy wrote:
> Sage,
> I did some debugging on the rocksdb bug., here is my findings.
> 
> 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L854
> 
> 2. But, it is there in the candidate list in the following loop.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L1000
> 
> 
> 3. This means it is added in full_scan_candidate_files from the 
> following  from a full scan (?)
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L834
> 
> Added some log entries to verify , need to wait 6 hours :-(
> 
> 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> 
> https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> 9bf7cea4b63/db/db_impl.cc#L1013
> 
> Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> 
> (number == state.prev_log_number)
> 
> 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).

That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.

Probably the recycle list should be turned into a set so that the check is O(log n)...

Thanks!
sage


> 
> Let me know what you think.
> 
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Somnath Roy
> Sent: Sunday, August 28, 2016 7:37 AM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Some updates on this.
> 
> 1. The issue is reproduced with the latest rocksdb master as well.
> 
> 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> 
> 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> 
> 4. Created a rocksdb issue for this
> (https://github.com/facebook/rocksdb/issues/1303)
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, August 25, 2016 2:35 PM
> To: 'Sage Weil'
> Cc: 'Mark Nelson'; 'ceph-devel'
> Subject: RE: Bluestore assert
> 
> Sage,
> Hope you are able to download the log I shared via google doc.
> It seems the bug is around this portion.
> 
> 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to 
> recycle list
> 
> 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to 
> recycle list
> 
> 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time
> 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> started
> 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time
> 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> memtable #1 done
> 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time
> 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> memtable #2 done
> 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time
> 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time
> 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max 
> bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> 
> 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> 
> 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> db.wal/000256.log
> 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs
> 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev
> 0 extents
> [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> bdev 0 extents 
> [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> db.wal/000254.log
> 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs
> 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev
> 0 extents
> [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> 00000+800000,0:0xc000000+500000])
> 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> bdev 0 extents 
> [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> 00000+800000,0:0xc000000+500000])
> 
> So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> 
> I was going through the rocksdb code and I found the following.
> 
> 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> 
> 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> 
> 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> 
> 4625       alive_log_files_.back().AddSize(log_entry.size());   
> 
> Can it be reintroducing the same log number (254) , I am not sure.
> 
> Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> 
> Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Wednesday, August 24, 2016 2:52 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Thanks for looking , glad that we figured out something :-)..
> So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> Hope my root partition doesn't get full , this crash happened after 6 
> hours :-) ,
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 24, 2016 2:34 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Wed, 24 Aug 2016, Somnath Roy wrote:
> > Sage, It is there in the following github link I posted earlier..You 
> > can see 3 logs there..
> > 
> > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > 88
> > a1b28fcc39
> 
> Ah sorry, got it.
> 
> And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> 
> I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> 
> sage
> 
> 
> 
>  >
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 1:43 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I got the db assert log from submit_transaction in the following location.
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c68
> > > 7d
> > > 88
> > > a1b28fcc39
> > > 
> > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > 
> > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > reusing log 266 from recycle list
> > > 
> > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > 
> > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > 
> > How much of the log do you have? Can you post what you have somewhere?
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch 
> > > = Put( Prefix = M key =
> > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > > key =
> > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.0000000000
> > > 89
> > > ac
> > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > 
> > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > 
> > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > 
> > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > I was running my tests for 2 hours and it happened within that time.
> > > I will try to reproduce with 1/20.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I think there are some bug introduced recently in the BlueFS and 
> > > > I am getting the corruption like this which I was not facing earlier.
> > > 
> > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > 
> > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > 
> > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x5617f2a99e80]
> > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > 
> > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > Here is the option I am using..
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > 
> > > > 
> > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > 
> > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x5581ed453cb0]
> > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > [0x5581ed2c6f4f]
> > > >  5: 
> > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > rocksdb::PersistentCacheOptions const&,
> > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > >  7: 
> > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBa
> > > > se dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > > const&,
> > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > > [0x5581ed28ba68]
> > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > long*)+0x4f8) [0x5581ed25c458]
> > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > [0x5581ed1d21d7]
> > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > bool)+0x55b) [0x5581ed02802b]
> > > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > std::vector<ObjectStore::Transaction,
> > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > [0x5581ed032bc2]
> > > >  17: 
> > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Trans
> > > > ac ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > >  18: 
> > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0x
> > > > d3
> > > > 9)
> > > > [0x5581ecef89e9]
> > > >  19: 
> > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0
> > > > x2
> > > > fb
> > > > )
> > > > [0x5581ecefeb4b]
> > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > [0x5581eccdd2e9]
> > > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > const&)+0x52) [0x5581eccdd542]
> > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > int)+0x89f) [0x5581ed440b2f]
> > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > [0x5581ed4441f0]
> > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > compaction during this time.
> > > > 
> > > > How are you selecting universal compaction?
> > > > 
> > > > sage
> > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 2, 2016, 4:34 p.m. UTC | #33
On Fri, 2 Sep 2016, Somnath Roy wrote:

> Sage,
> Tried to do some analysis on the inode assert, following looks suspicious.
> 
>    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x75300!
 000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+1000!
 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
>     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000!
 +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,!
 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
>     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
>     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
>     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100!
 000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+1000!
 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000]), flushing log
> 
> The above looks good, it is about to call _flush_and_sync_log() after this.

Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What 
rocksdb options did you pass in?  I'm guessing this is a log file, but we 
generally want those smallish (maybe 16MB - 64MB, so that L0 SST 
generation isn't too slow).

>     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
>     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
>     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
>     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
>     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
>      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1 os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7faf14fff700 time 2016-08-31 17:55:56.934282
> os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x56073c27c7d0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1d69) [0x56073bf4e109]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x56073bf4e2d7]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
>  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) [0x56073c0f24b1]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
>  9: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) [0x56073c1354c6]
>  10: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x14ea) [0x56073c137c8a]
>  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x56073c0275d0]
>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x56073c03443f]
>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x56073c0eb039]
>  15: (()+0x9900d3) [0x56073c0eb0d3]
>  16: (()+0x76fa) [0x7faf3d1106fa]
>  17: (clone()+0x6d) [0x7faf3af70b5d]
> 
> 
> Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?

Yes.  The metadata for the log file is dirty (the file size changed), so 
bluefs is flushing it's journal (ino 1) to update the fnode.

But this is very concerning:

>     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])

0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to 
the bluefs metadata journal at once?  That's why it's blowign the runway.  
We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at 
offset ~13MB, and we're writing ~5MB.

> Question :
> ------------
> 
> 1. Why we are using the existing log_write to do a runway check ?
> 
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1280
> 
> Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?

It's the bluefs journal writer.. that's the runway we're worried about.

> 2. The runway check is not considering the request length , so, why it is not expecting to allocate here (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1388)
> 
> If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.

The level 10 log is probably enough...

Thanks!
sage

> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy 
> Sent: Thursday, September 01, 2016 3:59 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Created the following pull request on rocksdb repo, please take a look.
> 
> https://github.com/facebook/rocksdb/pull/1313
> 
> The fix is working fine for me.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 31, 2016 6:20 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Tue, 30 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I did some debugging on the rocksdb bug., here is my findings.
> > 
> > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> > 9bf7cea4b63/db/db_impl.cc#L854
> > 
> > 2. But, it is there in the candidate list in the following loop.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> > 9bf7cea4b63/db/db_impl.cc#L1000
> > 
> > 
> > 3. This means it is added in full_scan_candidate_files from the 
> > following  from a full scan (?)
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> > 9bf7cea4b63/db/db_impl.cc#L834
> > 
> > Added some log entries to verify , need to wait 6 hours :-(
> > 
> > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f46060
> > 9bf7cea4b63/db/db_impl.cc#L1013
> > 
> > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > 
> > (number == state.prev_log_number)
> > 
> > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> 
> That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> 
> Probably the recycle list should be turned into a set so that the check is O(log n)...
> 
> Thanks!
> sage
> 
> 
> > 
> > Let me know what you think.
> > 
> > Thanks & Regards
> > Somnath
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Sunday, August 28, 2016 7:37 AM
> > To: 'Sage Weil'
> > Cc: 'Mark Nelson'; 'ceph-devel'
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Some updates on this.
> > 
> > 1. The issue is reproduced with the latest rocksdb master as well.
> > 
> > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > 
> > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > 
> > 4. Created a rocksdb issue for this
> > (https://github.com/facebook/rocksdb/issues/1303)
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, August 25, 2016 2:35 PM
> > To: 'Sage Weil'
> > Cc: 'Mark Nelson'; 'ceph-devel'
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Hope you are able to download the log I shared via google doc.
> > It seems the bug is around this portion.
> > 
> > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 to 
> > recycle list
> > 
> > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 to 
> > recycle list
> > 
> > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log Time
> > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > started
> > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log Time
> > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > memtable #1 done
> > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log Time
> > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > memtable #2 done
> > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log Time
> > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log Time
> > 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 max 
> > bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > 
> > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > 
> > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > db.wal/000256.log
> > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had refs
> > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 bdev
> > 0 extents
> > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > bdev 0 extents 
> > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200000+
> > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0x1
> > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > db.wal/000254.log
> > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had refs
> > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 bdev
> > 0 extents
> > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> > 00000+800000,0:0xc000000+500000])
> > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > bdev 0 extents 
> > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400000+
> > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0xb4
> > 00000+800000,0:0xc000000+500000])
> > 
> > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > 
> > I was going through the rocksdb code and I found the following.
> > 
> > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > 
> > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > 
> > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > 
> > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > 
> > Can it be reintroducing the same log number (254) , I am not sure.
> > 
> > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > 
> > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Wednesday, August 24, 2016 2:52 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Thanks for looking , glad that we figured out something :-)..
> > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > Hope my root partition doesn't get full , this crash happened after 6 
> > hours :-) ,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 2:34 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage, It is there in the following github link I posted earlier..You 
> > > can see 3 logs there..
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c687d
> > > 88
> > > a1b28fcc39
> > 
> > Ah sorry, got it.
> > 
> > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > 
> > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > 
> > sage
> > 
> > 
> > 
> >  >
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I got the db assert log from submit_transaction in the following location.
> > > > 
> > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c68
> > > > 7d
> > > > 88
> > > > a1b28fcc39
> > > > 
> > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > 
> > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > reusing log 266 from recycle list
> > > > 
> > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > 
> > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > 
> > > How much of the log do you have? Can you post what you have somewhere?
> > > 
> > > Thanks!
> > > sage
> > > 
> > > 
> > > > 
> > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > submit_transaction error: NotFound:  code = 1 rocksdb::WriteBatch 
> > > > = Put( Prefix = M key =
> > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = O 
> > > > key =
> > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.0000000000
> > > > 89
> > > > ac
> > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key =
> > > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > > 
> > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > 
> > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > 
> > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > I was running my tests for 2 hours and it happened within that time.
> > > > I will try to reproduce with 1/20.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I think there are some bug introduced recently in the BlueFS and 
> > > > > I am getting the corruption like this which I was not facing earlier.
> > > > 
> > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > 
> > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > > 
> > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: In 
> > > > > function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > 
> > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x5617f2a99e80]
> > > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > 
> > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > Here is the option I am using..
> > > > > 
> > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > 
> > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > 
> > > > > 
> > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > 
> > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x5581ed453cb0]
> > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > [0x5581ed2c6f4f]
> > > > >  5: 
> > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > rocksdb::PersistentCacheOptions const&,
> > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > >  7: 
> > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBa
> > > > > se dT ab le::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice 
> > > > > const&,
> > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::Slice const&, rocksdb::GetContext*, bool)+0x508) 
> > > > > [0x5581ed28ba68]
> > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > [0x5581ed1d21d7]
> > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > bool)+0x55b) [0x5581ed02802b]
> > > > >  15: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > std::vector<ObjectStore::Transaction,
> > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > [0x5581ed032bc2]
> > > > >  17: 
> > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Trans
> > > > > ac ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > >  18: 
> > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+0x
> > > > > d3
> > > > > 9)
> > > > > [0x5581ecef89e9]
> > > > >  19: 
> > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0
> > > > > x2
> > > > > fb
> > > > > )
> > > > > [0x5581ecefeb4b]
> > > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > [0x5581eccdd2e9]
> > > > >  22: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > > const&)+0x52) [0x5581eccdd542]
> > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > int)+0x89f) [0x5581ed440b2f]
> > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > [0x5581ed4441f0]
> > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > 
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > compaction during this time.
> > > > > 
> > > > > How are you selecting universal compaction?
> > > > > 
> > > > > sage
> > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > 
> > 
> 
>
Somnath Roy Sept. 2, 2016, 5:23 p.m. UTC | #34
Here is my rocksdb option :

        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"

one discrepancy I can see here on max_bytes_for_level_base , it should be same as level 0 size. Initially, I had bigger min_write_buffer_number_to_merge and that's how I calculated. Now, level 0 size is the following

write_buffer_size * min_write_buffer_number_to_merge * level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB

I should adjust max_bytes_for_level_base to the similar value probably.

Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.

https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?usp=sharing

Thanks for the explanation , I got it now why it is trying to flush inode 1.

But, shouldn't we check the length as well during runway check rather than just relying on bluefs_min_log_runway only.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, September 02, 2016 9:35 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert



On Fri, 2 Sep 2016, Somnath Roy wrote:

> Sage,
> Tried to do some analysis on the inode assert, following looks suspicious.
> 
>    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x75300!
 000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+1000!
 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
>     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000!
 +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,!
 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
>     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
>     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
>     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100!
 000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+1000!
 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000]), flushing log
> 
> The above looks good, it is about to call _flush_and_sync_log() after this.

Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).

>     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
>     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
>     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
>     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
>     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
>      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1 
> os/bluestore/BlueFS.cc: In function 'int 
> BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
> 7faf14fff700 time 2016-08-31 17:55:56.934282
> os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)
> 
>  ceph version 11.0.0-1946-g9a5cfe2 
> (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x80) [0x56073c27c7d0]
>  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
> long)+0x1d69) [0x56073bf4e109]
>  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x56073bf4e2d7]
>  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
> unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
>  5: (BlueFS::_fsync(BlueFS::FileWriter*, 
> std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
>  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
>  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) 
> [0x56073c0f24b1]
>  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
>  9: 
> (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
> const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) 
> [0x56073c1354c6]
>  10: 
> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compaction
> Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
>  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
>  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
> rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> [0x56073c0275d0]
>  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) 
> [0x56073c03443f]
>  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> [0x56073c0eb039]
>  15: (()+0x9900d3) [0x56073c0eb0d3]
>  16: (()+0x76fa) [0x7faf3d1106fa]
>  17: (clone()+0x6d) [0x7faf3af70b5d]
> 
> 
> Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?

Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.

But this is very concerning:

>     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 
> 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 
> 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])

0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.

> Question :
> ------------
> 
> 1. Why we are using the existing log_write to do a runway check ?
> 
> https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1
> 280
> 
> Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?

It's the bluefs journal writer.. that's the runway we're worried about.

> 2. The runway check is not considering the request length , so, why it 
> is not expecting to allocate here 
> (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L
> 1388)
> 
> If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.

The level 10 log is probably enough...

Thanks!
sage

> 
> Thanks & Regards
> Somnath
> 
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, September 01, 2016 3:59 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Created the following pull request on rocksdb repo, please take a look.
> 
> https://github.com/facebook/rocksdb/pull/1313
> 
> The fix is working fine for me.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, August 31, 2016 6:20 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Tue, 30 Aug 2016, Somnath Roy wrote:
> > Sage,
> > I did some debugging on the rocksdb bug., here is my findings.
> > 
> > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > 60
> > 9bf7cea4b63/db/db_impl.cc#L854
> > 
> > 2. But, it is there in the candidate list in the following loop.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > 60
> > 9bf7cea4b63/db/db_impl.cc#L1000
> > 
> > 
> > 3. This means it is added in full_scan_candidate_files from the 
> > following  from a full scan (?)
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > 60
> > 9bf7cea4b63/db/db_impl.cc#L834
> > 
> > Added some log entries to verify , need to wait 6 hours :-(
> > 
> > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > 
> > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > 60
> > 9bf7cea4b63/db/db_impl.cc#L1013
> > 
> > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > 
> > (number == state.prev_log_number)
> > 
> > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> 
> That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> 
> Probably the recycle list should be turned into a set so that the check is O(log n)...
> 
> Thanks!
> sage
> 
> 
> > 
> > Let me know what you think.
> > 
> > Thanks & Regards
> > Somnath
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Sunday, August 28, 2016 7:37 AM
> > To: 'Sage Weil'
> > Cc: 'Mark Nelson'; 'ceph-devel'
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Some updates on this.
> > 
> > 1. The issue is reproduced with the latest rocksdb master as well.
> > 
> > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > 
> > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > 
> > 4. Created a rocksdb issue for this
> > (https://github.com/facebook/rocksdb/issues/1303)
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, August 25, 2016 2:35 PM
> > To: 'Sage Weil'
> > Cc: 'Mark Nelson'; 'ceph-devel'
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Hope you are able to download the log I shared via google doc.
> > It seems the bug is around this portion.
> > 
> > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 
> > to recycle list
> > 
> > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 
> > to recycle list
> > 
> > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log 
> > Time
> > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > started
> > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log 
> > Time
> > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > memtable #1 done
> > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log 
> > Time
> > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > memtable #2 done
> > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log 
> > Time
> > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log 
> > Time
> > 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 
> > max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > 
> > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > 
> > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > db.wal/000256.log
> > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > refs
> > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > bdev
> > 0 extents
> > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe20000
> > 0+
> > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0
> > x1
> > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 
> > 00:41:26.298423 bdev 0 extents 
> > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe20000
> > 0+
> > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0
> > x1
> > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > db.wal/000254.log
> > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > refs
> > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > bdev
> > 0 extents
> > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x840000
> > 0+
> > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0x
> > b4
> > 00000+800000,0:0xc000000+500000])
> > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 
> > 00:41:26.299110 bdev 0 extents 
> > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x840000
> > 0+
> > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0x
> > b4
> > 00000+800000,0:0xc000000+500000])
> > 
> > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > 
> > I was going through the rocksdb code and I found the following.
> > 
> > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > 
> > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > 
> > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > 
> > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > 
> > Can it be reintroducing the same log number (254) , I am not sure.
> > 
> > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > 
> > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Wednesday, August 24, 2016 2:52 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Thanks for looking , glad that we figured out something :-)..
> > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > Hope my root partition doesn't get full , this crash happened after 
> > 6 hours :-) ,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 24, 2016 2:34 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > Sage, It is there in the following github link I posted 
> > > earlier..You can see 3 logs there..
> > > 
> > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c68
> > > 7d
> > > 88
> > > a1b28fcc39
> > 
> > Ah sorry, got it.
> > 
> > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > 
> > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > 
> > sage
> > 
> > 
> > 
> >  >
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I got the db assert log from submit_transaction in the following location.
> > > > 
> > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c
> > > > 68
> > > > 7d
> > > > 88
> > > > a1b28fcc39
> > > > 
> > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > 
> > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > reusing log 266 from recycle list
> > > > 
> > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > 
> > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > 
> > > How much of the log do you have? Can you post what you have somewhere?
> > > 
> > > Thanks!
> > > sage
> > > 
> > > 
> > > > 
> > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > submit_transaction error: NotFound:  code = 1 
> > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = 
> > > > O key =
> > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00000000
> > > > 00
> > > > 89
> > > > ac
> > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key 
> > > > =
> > > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > > 
> > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > 
> > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > 
> > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > I was running my tests for 2 hours and it happened within that time.
> > > > I will try to reproduce with 1/20.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I think there are some bug introduced recently in the BlueFS 
> > > > > and I am getting the corruption like this which I was not facing earlier.
> > > > 
> > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > 
> > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > > 
> > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > 
> > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x5617f2a99e80]
> > > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > 
> > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > Here is the option I am using..
> > > > > 
> > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > 
> > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > 
> > > > > 
> > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > 
> > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x5581ed453cb0]
> > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > [0x5581ed2c6f4f]
> > > > >  5: 
> > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > rocksdb::PersistentCacheOptions const&,
> > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > >  7: 
> > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::Block
> > > > > Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > rocksdb::Slice const&,
> > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions 
> > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > > bool)+0x508) [0x5581ed28ba68]
> > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, 
> > > > > unsigned
> > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > [0x5581ed1d21d7]
> > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > bool)+0x55b) [0x5581ed02802b]
> > > > >  15: 
> > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > std::vector<ObjectStore::Transaction,
> > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > [0x5581ed032bc2]
> > > > >  17: 
> > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Tra
> > > > > ns ac ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > >  18: 
> > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+
> > > > > 0x
> > > > > d3
> > > > > 9)
> > > > > [0x5581ecef89e9]
> > > > >  19: 
> > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)
> > > > > +0
> > > > > x2
> > > > > fb
> > > > > )
> > > > > [0x5581ecefeb4b]
> > > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > [0x5581eccdd2e9]
> > > > >  22: 
> > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > > const&)+0x52) [0x5581eccdd542]
> > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > int)+0x89f) [0x5581ed440b2f]
> > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > [0x5581ed4441f0]
> > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > 
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > compaction during this time.
> > > > > 
> > > > > How are you selecting universal compaction?
> > > > > 
> > > > > sage
> > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 2, 2016, 5:56 p.m. UTC | #35
On Fri, 2 Sep 2016, Somnath Roy wrote:
> Here is my rocksdb option :
> 
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> 
> one discrepancy I can see here on max_bytes_for_level_base , it should be same as level 0 size. Initially, I had bigger min_write_buffer_number_to_merge and that's how I calculated. Now, level 0 size is the following
> 
> write_buffer_size * min_write_buffer_number_to_merge * level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> 
> I should adjust max_bytes_for_level_base to the similar value probably.
> 
> Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> 
> https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?usp=sharing
> 
> Thanks for the explanation , I got it now why it is trying to flush 
> inode 1.
> 
> But, shouldn't we check the length as well during runway check rather 
> than just relying on bluefs_min_log_runway only.

That's what this does:

  uint64_t runway = log_writer->file->fnode.get_allocated() - log_writer->pos;

Anyway, I think I see what's going on.  There are a ton of _fsync and 
_flush_range calls that have to flush the fnode, and the fnode is pretty 
big (maybe 5k?) because it has so many extents (your tunables are 
generating really big files).

I think this is just a matter of the runway configurable being too small 
for your configuration.  Try bumping bluefs_min_log_runway by 10x.

Well, actually, we could improve this a bit.  Right now rocksdb is calling 
lots of flush on a bit sst, and a final fsync at the end.  Bluefs is 
logging the updated fnode every time the flush changes the file size, and 
then only writing it to disk when the final fsync happens.  Instead, it 
could/should put the dirty fnode on a list and only at log flush time 
flush append the latest fnodes to the log and write it out.

I'll add it to the trello board.  I think it's not that big a deal.. 
except when you have really big files.

sage


 > 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Friday, September 02, 2016 9:35 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> 
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> 
> > Sage,
> > Tried to do some analysis on the inode assert, following looks suspicious.
> > 
> >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x753!
 00!
>  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+10!
 00!
>  00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
> >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x751000!
 00!
>  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+10000!
 0,!
>  1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000])
> >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x751!
 00!
>  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+10!
 00!
>  00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d800000+10e00000]), flushing log
> > 
> > The above looks good, it is about to call _flush_and_sync_log() after this.
> 
> Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> 
> >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1 
> > os/bluestore/BlueFS.cc: In function 'int 
> > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
> > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2 
> > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x80) [0x56073c27c7d0]
> >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned 
> > long)+0x1d69) [0x56073bf4e109]
> >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) [0x56073bf4e2d7]
> >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, 
> > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> >  5: (BlueFS::_fsync(BlueFS::FileWriter*, 
> > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1) 
> > [0x56073c0f24b1]
> >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
> >  9: 
> > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
> > const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6) 
> > [0x56073c1354c6]
> >  10: 
> > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compaction
> > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
> > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > [0x56073c0275d0]
> >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) 
> > [0x56073c03443f]
> >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > [0x56073c0eb039]
> >  15: (()+0x9900d3) [0x56073c0eb0d3]
> >  16: (()+0x76fa) [0x7faf3d1106fa]
> >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > 
> > 
> > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> 
> Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> 
> But this is very concerning:
> 
> >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 
> > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 
> > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> 
> 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> 
> > Question :
> > ------------
> > 
> > 1. Why we are using the existing log_write to do a runway check ?
> > 
> > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L1
> > 280
> > 
> > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> 
> It's the bluefs journal writer.. that's the runway we're worried about.
> 
> > 2. The runway check is not considering the request length , so, why it 
> > is not expecting to allocate here 
> > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#L
> > 1388)
> > 
> > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> 
> The level 10 log is probably enough...
> 
> Thanks!
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, September 01, 2016 3:59 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Created the following pull request on rocksdb repo, please take a look.
> > 
> > https://github.com/facebook/rocksdb/pull/1313
> > 
> > The fix is working fine for me.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 31, 2016 6:20 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I did some debugging on the rocksdb bug., here is my findings.
> > > 
> > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L854
> > > 
> > > 2. But, it is there in the candidate list in the following loop.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > 
> > > 
> > > 3. This means it is added in full_scan_candidate_files from the 
> > > following  from a full scan (?)
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L834
> > > 
> > > Added some log entries to verify , need to wait 6 hours :-(
> > > 
> > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f460
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > 
> > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > 
> > > (number == state.prev_log_number)
> > > 
> > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > 
> > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > 
> > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > Let me know what you think.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Sunday, August 28, 2016 7:37 AM
> > > To: 'Sage Weil'
> > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Some updates on this.
> > > 
> > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > 
> > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > 
> > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > 
> > > 4. Created a rocksdb issue for this
> > > (https://github.com/facebook/rocksdb/issues/1303)
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Thursday, August 25, 2016 2:35 PM
> > > To: 'Sage Weil'
> > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Hope you are able to download the log I shared via google doc.
> > > It seems the bug is around this portion.
> > > 
> > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 
> > > to recycle list
> > > 
> > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 
> > > to recycle list
> > > 
> > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > > started
> > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > memtable #1 done
> > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > memtable #2 done
> > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 
> > > max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > > 
> > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > 
> > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > db.wal/000256.log
> > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > > refs
> > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > > bdev
> > > 0 extents
> > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe20000
> > > 0+
> > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0
> > > x1
> > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 
> > > 00:41:26.298423 bdev 0 extents 
> > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe20000
> > > 0+
> > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0:0
> > > x1
> > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > db.wal/000254.log
> > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > > refs
> > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > > bdev
> > > 0 extents
> > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x840000
> > > 0+
> > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0x
> > > b4
> > > 00000+800000,0:0xc000000+500000])
> > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25 
> > > 00:41:26.299110 bdev 0 extents 
> > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x840000
> > > 0+
> > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:0x
> > > b4
> > > 00000+800000,0:0xc000000+500000])
> > > 
> > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > 
> > > I was going through the rocksdb code and I found the following.
> > > 
> > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > 
> > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > 
> > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > 
> > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > 
> > > Can it be reintroducing the same log number (254) , I am not sure.
> > > 
> > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > 
> > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Thanks for looking , glad that we figured out something :-)..
> > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > Hope my root partition doesn't get full , this crash happened after 
> > > 6 hours :-) ,
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > Sage, It is there in the following github link I posted 
> > > > earlier..You can see 3 logs there..
> > > > 
> > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c68
> > > > 7d
> > > > 88
> > > > a1b28fcc39
> > > 
> > > Ah sorry, got it.
> > > 
> > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > 
> > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > 
> > > sage
> > > 
> > > 
> > > 
> > >  >
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I got the db assert log from submit_transaction in the following location.
> > > > > 
> > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c
> > > > > 68
> > > > > 7d
> > > > > 88
> > > > > a1b28fcc39
> > > > > 
> > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > 
> > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > reusing log 266 from recycle list
> > > > > 
> > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > 
> > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > 
> > > > How much of the log do you have? Can you post what you have somewhere?
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > submit_transaction error: NotFound:  code = 1 
> > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix = 
> > > > > O key =
> > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00000000
> > > > > 00
> > > > > 89
> > > > > ac
> > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B key 
> > > > > =
> > > > > 0x000004e73af72000) Merge( Prefix = T key = 'bluestore_statfs')
> > > > > 
> > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > 
> > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > 
> > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > I will try to reproduce with 1/20.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I think there are some bug introduced recently in the BlueFS 
> > > > > > and I am getting the corruption like this which I was not facing earlier.
> > > > > 
> > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > 
> > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > > 
> > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > 
> > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) [0x5617f2395fdd]
> > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > 
> > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > Here is the option I am using..
> > > > > > 
> > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > 
> > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > 
> > > > > > 
> > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > 
> > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, 
> > > > > > unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > > [0x5581ed2c6f4f]
> > > > > >  5: 
> > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*,
> > > > > > rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > > rocksdb::PersistentCacheOptions const&,
> > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > >  7: 
> > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::Block
> > > > > > Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > > rocksdb::Slice const&,
> > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions 
> > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor 
> > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*, 
> > > > > > rocksdb::HistogramImpl*, bool, int)+0x158) [0x5581ed252118]
> > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, 
> > > > > > unsigned
> > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > > [0x5581ed1d21d7]
> > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > >  15: 
> > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > std::vector<ObjectStore::Transaction,
> > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > > [0x5581ed032bc2]
> > > > > >  17: 
> > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::Tra
> > > > > > ns ac ti on , std::allocator<ObjectStore::Transaction> >&,
> > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > >  18: 
> > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>)+
> > > > > > 0x
> > > > > > d3
> > > > > > 9)
> > > > > > [0x5581ecef89e9]
> > > > > >  19: 
> > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)
> > > > > > +0
> > > > > > x2
> > > > > > fb
> > > > > > )
> > > > > > [0x5581ecefeb4b]
> > > > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > > [0x5581eccdd2e9]
> > > > > >  22: 
> > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > [0x5581ed4441f0]
> > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > 
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > compaction during this time.
> > > > > > 
> > > > > > How are you selecting universal compaction?
> > > > > > 
> > > > > > sage
> > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 2, 2016, 7:01 p.m. UTC | #36
Sage,
It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
I will try some universal compaction tuning related to file size and confirm.
Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
Can't we do like this ?

//Basically, checking the length of the log as well
if (runway < g_conf->bluefs_min_log_runway) || (runway < log_writer ->buffer.length() {
 //allocate
}

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, September 02, 2016 10:57 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Fri, 2 Sep 2016, Somnath Roy wrote:
> Here is my rocksdb option :
> 
>         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> 
> one discrepancy I can see here on max_bytes_for_level_base , it should 
> be same as level 0 size. Initially, I had bigger 
> min_write_buffer_number_to_merge and that's how I calculated. Now, 
> level 0 size is the following
> 
> write_buffer_size * min_write_buffer_number_to_merge * 
> level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> 
> I should adjust max_bytes_for_level_base to the similar value probably.
> 
> Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> 
> https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?usp=
> sharing
> 
> Thanks for the explanation , I got it now why it is trying to flush 
> inode 1.
> 
> But, shouldn't we check the length as well during runway check rather 
> than just relying on bluefs_min_log_runway only.

That's what this does:

  uint64_t runway = log_writer->file->fnode.get_allocated() - log_writer->pos;

Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).

I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.

Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.

I'll add it to the trello board.  I think it's not that big a deal.. 
except when you have really big files.

sage


 > 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 9:35 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> 
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> 
> > Sage,
> > Tried to do some analysis on the inode assert, following looks suspicious.
> > 
> >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x753!
 00!
>  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+10!
 00!
>  
> 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d8
> 00000+10e00000])
> >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x751000!
 00!
>  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+10000!
 0,!
>  
> 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d6000
> 00+100000,1:0x7d800000+10e00000])
> >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x751!
 00!
>  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+10!
 00!
>  
> 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d6
> 00000+100000,1:0x7d800000+10e00000]), flushing log
> > 
> > The above looks good, it is about to call _flush_and_sync_log() after this.
> 
> Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> 
> >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > os/bluestore/BlueFS.cc: In function 'int 
> > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' 
> > thread
> > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2
> > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x80) [0x56073c27c7d0]
> >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > unsigned
> > long)+0x1d69) [0x56073bf4e109]
> >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > [0x56073bf4e2d7]
> >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > [0x56073c0f24b1]
> >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
> >  9: 
> > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
> > const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > [0x56073c1354c6]
> >  10: 
> > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compacti
> > on
> > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > [0x56073c0275d0]
> >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > [0x56073c03443f]
> >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > [0x56073c0eb039]
> >  15: (()+0x9900d3) [0x56073c0eb0d3]
> >  16: (()+0x76fa) [0x7faf3d1106fa]
> >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > 
> > 
> > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> 
> Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> 
> But this is very concerning:
> 
> >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> 
> 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> 
> > Question :
> > ------------
> > 
> > 1. Why we are using the existing log_write to do a runway check ?
> > 
> > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#
> > L1
> > 280
> > 
> > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> 
> It's the bluefs journal writer.. that's the runway we're worried about.
> 
> > 2. The runway check is not considering the request length , so, why 
> > it is not expecting to allocate here 
> > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc
> > #L
> > 1388)
> > 
> > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> 
> The level 10 log is probably enough...
> 
> Thanks!
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > 
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, September 01, 2016 3:59 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Sage,
> > Created the following pull request on rocksdb repo, please take a look.
> > 
> > https://github.com/facebook/rocksdb/pull/1313
> > 
> > The fix is working fine for me.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, August 31, 2016 6:20 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > Sage,
> > > I did some debugging on the rocksdb bug., here is my findings.
> > > 
> > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > 60
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L854
> > > 
> > > 2. But, it is there in the candidate list in the following loop.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > 60
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > 
> > > 
> > > 3. This means it is added in full_scan_candidate_files from the 
> > > following  from a full scan (?)
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > 60
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L834
> > > 
> > > Added some log entries to verify , need to wait 6 hours :-(
> > > 
> > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > 
> > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > 60
> > > 60
> > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > 
> > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > 
> > > (number == state.prev_log_number)
> > > 
> > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > 
> > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > 
> > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > 
> > Thanks!
> > sage
> > 
> > 
> > > 
> > > Let me know what you think.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Sunday, August 28, 2016 7:37 AM
> > > To: 'Sage Weil'
> > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Some updates on this.
> > > 
> > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > 
> > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > 
> > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > 
> > > 4. Created a rocksdb issue for this
> > > (https://github.com/facebook/rocksdb/issues/1303)
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Thursday, August 25, 2016 2:35 PM
> > > To: 'Sage Weil'
> > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Hope you are able to download the log I shared via google doc.
> > > It seems the bug is around this portion.
> > > 
> > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 
> > > to recycle list
> > > 
> > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 
> > > to recycle list
> > > 
> > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > > started
> > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > memtable #1 done
> > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > memtable #2 done
> > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log 
> > > Time
> > > 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 
> > > max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > > 
> > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > 
> > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > db.wal/000256.log
> > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > > refs
> > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > > bdev
> > > 0 extents
> > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200
> > > 00
> > > 0+
> > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0
> > > :0
> > > x1
> > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > 00:41:26.298423 bdev 0 extents
> > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200
> > > 00
> > > 0+
> > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0
> > > :0
> > > x1
> > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > db.wal/000254.log
> > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > > refs
> > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > > bdev
> > > 0 extents
> > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400
> > > 00
> > > 0+
> > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > 0x
> > > b4
> > > 00000+800000,0:0xc000000+500000])
> > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > 00:41:26.299110 bdev 0 extents
> > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400
> > > 00
> > > 0+
> > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > 0x
> > > b4
> > > 00000+800000,0:0xc000000+500000])
> > > 
> > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > 
> > > I was going through the rocksdb code and I found the following.
> > > 
> > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > 
> > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > 
> > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > 
> > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > 
> > > Can it be reintroducing the same log number (254) , I am not sure.
> > > 
> > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > 
> > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Thanks for looking , glad that we figured out something :-)..
> > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > Hope my root partition doesn't get full , this crash happened 
> > > after
> > > 6 hours :-) ,
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > Sage, It is there in the following github link I posted 
> > > > earlier..You can see 3 logs there..
> > > > 
> > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c
> > > > 68
> > > > 7d
> > > > 88
> > > > a1b28fcc39
> > > 
> > > Ah sorry, got it.
> > > 
> > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > 
> > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > 
> > > sage
> > > 
> > > 
> > > 
> > >  >
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I got the db assert log from submit_transaction in the following location.
> > > > > 
> > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b3
> > > > > 9c
> > > > > 68
> > > > > 7d
> > > > > 88
> > > > > a1b28fcc39
> > > > > 
> > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > 
> > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > reusing log 266 from recycle list
> > > > > 
> > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > 
> > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > 
> > > > How much of the log do you have? Can you post what you have somewhere?
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > submit_transaction error: NotFound:  code = 1 
> > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix 
> > > > > = O key =
> > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000
> > > > > 00
> > > > > 00
> > > > > 89
> > > > > ac
> > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B 
> > > > > key =
> > > > > 0x000004e73af72000) Merge( Prefix = T key = 
> > > > > 'bluestore_statfs')
> > > > > 
> > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > 
> > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > 
> > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > I will try to reproduce with 1/20.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I think there are some bug introduced recently in the BlueFS 
> > > > > > and I am getting the corruption like this which I was not facing earlier.
> > > > > 
> > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > 
> > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > > 
> > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > 
> > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > [0x5617f2395fdd]
> > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > 
> > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > Here is the option I am using..
> > > > > > 
> > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > 
> > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > 
> > > > > > 
> > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > 
> > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > > [0x5581ed2c6f4f]
> > > > > >  5: 
> > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*
> > > > > > , rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > > rocksdb::PersistentCacheOptions const&,
> > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > >  7: 
> > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::Blo
> > > > > > ck Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > > rocksdb::Slice const&,
> > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::InternalKeyComparator const&, 
> > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, 
> > > > > > int)+0x158) [0x5581ed252118]
> > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, 
> > > > > > unsigned
> > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > > [0x5581ed1d21d7]
> > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > >  15: 
> > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > std::vector<ObjectStore::Transaction,
> > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > > [0x5581ed032bc2]
> > > > > >  17: 
> > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::T
> > > > > > ra ns ac ti on , std::allocator<ObjectStore::Transaction> 
> > > > > > >&,
> > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > >  18: 
> > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>
> > > > > > )+
> > > > > > 0x
> > > > > > d3
> > > > > > 9)
> > > > > > [0x5581ecef89e9]
> > > > > >  19: 
> > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest
> > > > > > >)
> > > > > > +0
> > > > > > x2
> > > > > > fb
> > > > > > )
> > > > > > [0x5581ecefeb4b]
> > > > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > > [0x5581eccdd2e9]
> > > > > >  22: 
> > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > [0x5581ed4441f0]
> > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > 
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > compaction during this time.
> > > > > > 
> > > > > > How are you selecting universal compaction?
> > > > > > 
> > > > > > sage
> > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 2, 2016, 7:26 p.m. UTC | #37
On Fri, 2 Sep 2016, Somnath Roy wrote:
> Sage,
> It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> I will try some universal compaction tuning related to file size and confirm.
> Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> Can't we do like this ?
> 
> //Basically, checking the length of the log as well
> if (runway < g_conf->bluefs_min_log_runway) || (runway < log_writer ->buffer.length() {
>  //allocate
> }

Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.

And I think just configuring a long runway won't hurt either (e.g., 
100MB).

That's probably enough to be safe, but once we fix the flush thing I 
mentioned that will make this go away.

s

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Friday, September 02, 2016 10:57 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Here is my rocksdb option :
> > 
> >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > 
> > one discrepancy I can see here on max_bytes_for_level_base , it should 
> > be same as level 0 size. Initially, I had bigger 
> > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > level 0 size is the following
> > 
> > write_buffer_size * min_write_buffer_number_to_merge * 
> > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > 
> > I should adjust max_bytes_for_level_base to the similar value probably.
> > 
> > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > 
> > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?usp=
> > sharing
> > 
> > Thanks for the explanation , I got it now why it is trying to flush 
> > inode 1.
> > 
> > But, shouldn't we check the length as well during runway check rather 
> > than just relying on bluefs_min_log_runway only.
> 
> That's what this does:
> 
>   uint64_t runway = log_writer->file->fnode.get_allocated() - log_writer->pos;
> 
> Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> 
> I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> 
> Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> 
> I'll add it to the trello board.  I think it's not that big a deal.. 
> except when you have really big files.
> 
> sage
> 
> 
>  > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 9:35 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > 
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > 
> > > Sage,
> > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > 
> > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x7!
 53!
>  00!
> >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+!
 10!
>  00!
> >  
> > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7d8
> > 00000+10e00000])
> > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x7510!
 00!
>  00!
> >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100!
 00!
>  0,!
> >  
> > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d6000
> > 00+100000,1:0x7d800000+10e00000])
> > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x7!
 51!
>  00!
> >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+!
 10!
>  00!
> >  
> > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d6
> > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > 
> > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > 
> > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > 
> > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' 
> > > thread
> > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 1)
> > > 
> > >  ceph version 11.0.0-1946-g9a5cfe2
> > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x56073c27c7d0]
> > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > unsigned
> > > long)+0x1d69) [0x56073bf4e109]
> > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > [0x56073bf4e2d7]
> > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > [0x56073c0f24b1]
> > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) [0x56073c0f3960]
> > >  9: 
> > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
> > > const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > [0x56073c1354c6]
> > >  10: 
> > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compacti
> > > on
> > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > [0x56073c0275d0]
> > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > [0x56073c03443f]
> > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > [0x56073c0eb039]
> > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > 
> > > 
> > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > 
> > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > 
> > But this is very concerning:
> > 
> > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > 
> > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > 
> > > Question :
> > > ------------
> > > 
> > > 1. Why we are using the existing log_write to do a runway check ?
> > > 
> > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc#
> > > L1
> > > 280
> > > 
> > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > 
> > It's the bluefs journal writer.. that's the runway we're worried about.
> > 
> > > 2. The runway check is not considering the request length , so, why 
> > > it is not expecting to allocate here 
> > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.cc
> > > #L
> > > 1388)
> > > 
> > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > 
> > The level 10 log is probably enough...
> > 
> > Thanks!
> > sage
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Thursday, September 01, 2016 3:59 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Created the following pull request on rocksdb repo, please take a look.
> > > 
> > > https://github.com/facebook/rocksdb/pull/1313
> > > 
> > > The fix is working fine for me.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > 
> > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > 
> > > > 2. But, it is there in the candidate list in the following loop.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > 
> > > > 
> > > > 3. This means it is added in full_scan_candidate_files from the 
> > > > following  from a full scan (?)
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > 
> > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > 
> > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > 
> > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > 
> > > > (number == state.prev_log_number)
> > > > 
> > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > 
> > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > 
> > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > 
> > > Thanks!
> > > sage
> > > 
> > > 
> > > > 
> > > > Let me know what you think.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > To: 'Sage Weil'
> > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Some updates on this.
> > > > 
> > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > 
> > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > 
> > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > 
> > > > 4. Created a rocksdb issue for this
> > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > To: 'Sage Weil'
> > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Hope you are able to download the log I shared via google doc.
> > > > It seems the bug is around this portion.
> > > > 
> > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 254 
> > > > to recycle list
> > > > 
> > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 256 
> > > > to recycle list
> > > > 
> > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original Log 
> > > > Time
> > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > > > started
> > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original Log 
> > > > Time
> > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > memtable #1 done
> > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original Log 
> > > > Time
> > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > memtable #2 done
> > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original Log 
> > > > Time
> > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original Log 
> > > > Time
> > > > 2016/08/25-00:44:03.348297) [default] Level summary: base level 1 
> > > > max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > > > 
> > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > 
> > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > db.wal/000256.log
> > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > > > refs
> > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > > > bdev
> > > > 0 extents
> > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200
> > > > 00
> > > > 0+
> > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0
> > > > :0
> > > > x1
> > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > 00:41:26.298423 bdev 0 extents
> > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe200
> > > > 00
> > > > 0+
> > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000,0
> > > > :0
> > > > x1
> > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > db.wal/000254.log
> > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > > > refs
> > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > > > bdev
> > > > 0 extents
> > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400
> > > > 00
> > > > 0+
> > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > 0x
> > > > b4
> > > > 00000+800000,0:0xc000000+500000])
> > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > 00:41:26.299110 bdev 0 extents
> > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x8400
> > > > 00
> > > > 0+
> > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > 0x
> > > > b4
> > > > 00000+800000,0:0xc000000+500000])
> > > > 
> > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > 
> > > > I was going through the rocksdb code and I found the following.
> > > > 
> > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > 
> > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > 
> > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > 
> > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > 
> > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > 
> > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > 
> > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Thanks for looking , glad that we figured out something :-)..
> > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > Hope my root partition doesn't get full , this crash happened 
> > > > after
> > > > 6 hours :-) ,
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > Sage, It is there in the following github link I posted 
> > > > > earlier..You can see 3 logs there..
> > > > > 
> > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b39c
> > > > > 68
> > > > > 7d
> > > > > 88
> > > > > a1b28fcc39
> > > > 
> > > > Ah sorry, got it.
> > > > 
> > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > 
> > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > 
> > > >  >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > 
> > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b3
> > > > > > 9c
> > > > > > 68
> > > > > > 7d
> > > > > > 88
> > > > > > a1b28fcc39
> > > > > > 
> > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > 
> > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > reusing log 266 from recycle list
> > > > > > 
> > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > 
> > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > 
> > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > 
> > > > > > 
> > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( Prefix 
> > > > > > = O key =
> > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.000000
> > > > > > 00
> > > > > > 00
> > > > > > 89
> > > > > > ac
> > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B 
> > > > > > key =
> > > > > > 0x000004e73af72000) Merge( Prefix = T key = 
> > > > > > 'bluestore_statfs')
> > > > > > 
> > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > 
> > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > 
> > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > I will try to reproduce with 1/20.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I think there are some bug introduced recently in the BlueFS 
> > > > > > > and I am getting the corruption like this which I was not facing earlier.
> > > > > > 
> > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > 
> > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > > 
> > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' thread
> > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > 
> > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > > char
> > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > [0x5617f2395fdd]
> > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > 
> > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > Here is the option I am using..
> > > > > > > 
> > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > 
> > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > 
> > > > > > > 
> > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > 
> > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > > char
> > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned 
> > > > > > > long, rocksdb::Slice*, char*) const+0x20) [0x5581ed13f840]
> > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > > > [0x5581ed2c6f4f]
> > > > > > >  5: 
> > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*
> > > > > > > , rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > > > rocksdb::PersistentCacheOptions const&,
> > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > >  7: 
> > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::Blo
> > > > > > > ck Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::Slice const&,
> > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::InternalKeyComparator const&, 
> > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool, 
> > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::LookupKey const&, std::__cxx11::basic_string<char, 
> > > > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > > > rocksdb::Status*, rocksdb::MergeContext*, bool*, bool*, 
> > > > > > > unsigned
> > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> >*, bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > > > [0x5581ed1d21d7]
> > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > >  15: 
> > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > >  16: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > > > [0x5581ed032bc2]
> > > > > > >  17: 
> > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore::T
> > > > > > > ra ns ac ti on , std::allocator<ObjectStore::Transaction> 
> > > > > > > >&,
> > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > >  18: 
> > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequest>
> > > > > > > )+
> > > > > > > 0x
> > > > > > > d3
> > > > > > > 9)
> > > > > > > [0x5581ecef89e9]
> > > > > > >  19: 
> > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest
> > > > > > > >)
> > > > > > > +0
> > > > > > > x2
> > > > > > > fb
> > > > > > > )
> > > > > > > [0x5581ecefeb4b]
> > > > > > >  20: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > > > [0x5581eccdd2e9]
> > > > > > >  22: 
> > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>
> > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > [0x5581ed4441f0]
> > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > compaction during this time.
> > > > > > > 
> > > > > > > How are you selecting universal compaction?
> > > > > > > 
> > > > > > > sage
> > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at  
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 2, 2016, 8:18 p.m. UTC | #38
Sage,
I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, September 02, 2016 12:27 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Fri, 2 Sep 2016, Somnath Roy wrote:
> Sage,
> It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> I will try some universal compaction tuning related to file size and confirm.
> Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> Can't we do like this ?
> 
> //Basically, checking the length of the log as well if (runway < 
> g_conf->bluefs_min_log_runway) || (runway < log_writer 
> ->buffer.length() {  //allocate }

Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.

And I think just configuring a long runway won't hurt either (e.g., 100MB).

That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.

s

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 10:57 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Here is my rocksdb option :
> > 
> >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > 
> > one discrepancy I can see here on max_bytes_for_level_base , it 
> > should be same as level 0 size. Initially, I had bigger 
> > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > level 0 size is the following
> > 
> > write_buffer_size * min_write_buffer_number_to_merge * 
> > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > 
> > I should adjust max_bytes_for_level_base to the similar value probably.
> > 
> > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > 
> > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?us
> > p=
> > sharing
> > 
> > Thanks for the explanation , I got it now why it is trying to flush 
> > inode 1.
> > 
> > But, shouldn't we check the length as well during runway check 
> > rather than just relying on bluefs_min_log_runway only.
> 
> That's what this does:
> 
>   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> log_writer->pos;
> 
> Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> 
> I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> 
> Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> 
> I'll add it to the trello board.  I think it's not that big a deal.. 
> except when you have really big files.
> 
> sage
> 
> 
>  >
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 9:35 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > 
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > 
> > > Sage,
> > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > 
> > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0x7!
 53!
>  00!
> >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000000+!
 10!
>  00!
> >  
> > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7
> > d8
> > 00000+10e00000])
> > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x7510!
 00!
>  00!
> >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100!
 00!
>  0,!
> >  
> > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d60
> > 00
> > 00+100000,1:0x7d800000+10e00000])
> > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x7!
 51!
>  00!
> >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+!
 10!
>  00!
> >  
> > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7
> > d6
> > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > 
> > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > 
> > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > 
> > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > os/bluestore/BlueFS.cc: In function 'int 
> > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > thread
> > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 
> > > 1)
> > > 
> > >  ceph version 11.0.0-1946-g9a5cfe2
> > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x80) [0x56073c27c7d0]
> > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > unsigned
> > > long)+0x1d69) [0x56073bf4e109]
> > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > [0x56073bf4e2d7]
> > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > [0x56073c0f24b1]
> > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) 
> > > [0x56073c0f3960]
> > >  9: 
> > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Statu
> > > s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > [0x56073c1354c6]
> > >  10: 
> > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compac
> > > ti
> > > on
> > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > [0x56073c0275d0]
> > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > [0x56073c03443f]
> > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > [0x56073c0eb039]
> > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > 
> > > 
> > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > 
> > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > 
> > But this is very concerning:
> > 
> > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > 
> > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > 
> > > Question :
> > > ------------
> > > 
> > > 1. Why we are using the existing log_write to do a runway check ?
> > > 
> > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.c
> > > c#
> > > L1
> > > 280
> > > 
> > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > 
> > It's the bluefs journal writer.. that's the runway we're worried about.
> > 
> > > 2. The runway check is not considering the request length , so, 
> > > why it is not expecting to allocate here 
> > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > cc
> > > #L
> > > 1388)
> > > 
> > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > 
> > The level 10 log is probably enough...
> > 
> > Thanks!
> > sage
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > 
> > > -----Original Message-----
> > > From: Somnath Roy
> > > Sent: Thursday, September 01, 2016 3:59 PM
> > > To: 'Sage Weil'
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > Sage,
> > > Created the following pull request on rocksdb repo, please take a look.
> > > 
> > > https://github.com/facebook/rocksdb/pull/1313
> > > 
> > > The fix is working fine for me.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > Sage,
> > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > 
> > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > 
> > > > 2. But, it is there in the candidate list in the following loop.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > 
> > > > 
> > > > 3. This means it is added in full_scan_candidate_files from the 
> > > > following  from a full scan (?)
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > 
> > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > 
> > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > 
> > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > f4
> > > > 60
> > > > 60
> > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > 
> > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > 
> > > > (number == state.prev_log_number)
> > > > 
> > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > 
> > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > 
> > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > 
> > > Thanks!
> > > sage
> > > 
> > > 
> > > > 
> > > > Let me know what you think.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > To: 'Sage Weil'
> > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Some updates on this.
> > > > 
> > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > 
> > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > 
> > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > 
> > > > 4. Created a rocksdb issue for this
> > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > To: 'Sage Weil'
> > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Hope you are able to download the log I shared via google doc.
> > > > It seems the bug is around this portion.
> > > > 
> > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 
> > > > 254 to recycle list
> > > > 
> > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 
> > > > 256 to recycle list
> > > > 
> > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > Log Time
> > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > > > started
> > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > Log Time
> > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > memtable #1 done
> > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > Log Time
> > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > memtable #2 done
> > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > Log Time
> > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > Log Time
> > > > 2016/08/25-00:44:03.348297) [default] Level summary: base level 
> > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > > > 
> > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > 
> > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > db.wal/000256.log
> > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > > > refs
> > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > > > bdev
> > > > 0 extents
> > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe2
> > > > 00
> > > > 00
> > > > 0+
> > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000
> > > > ,0
> > > > :0
> > > > x1
> > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > 00:41:26.298423 bdev 0 extents
> > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe2
> > > > 00
> > > > 00
> > > > 0+
> > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000
> > > > ,0
> > > > :0
> > > > x1
> > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > db.wal/000254.log
> > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > > > refs
> > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > > > bdev
> > > > 0 extents
> > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x84
> > > > 00
> > > > 00
> > > > 0+
> > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > 0x
> > > > b4
> > > > 00000+800000,0:0xc000000+500000])
> > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > 00:41:26.299110 bdev 0 extents
> > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x84
> > > > 00
> > > > 00
> > > > 0+
> > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > 0x
> > > > b4
> > > > 00000+800000,0:0xc000000+500000])
> > > > 
> > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > 
> > > > I was going through the rocksdb code and I found the following.
> > > > 
> > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > 
> > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > 
> > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > 
> > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > 
> > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > 
> > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > 
> > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Thanks for looking , glad that we figured out something :-)..
> > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > Hope my root partition doesn't get full , this crash happened 
> > > > after
> > > > 6 hours :-) ,
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > Sage, It is there in the following github link I posted 
> > > > > earlier..You can see 3 logs there..
> > > > > 
> > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b3
> > > > > 9c
> > > > > 68
> > > > > 7d
> > > > > 88
> > > > > a1b28fcc39
> > > > 
> > > > Ah sorry, got it.
> > > > 
> > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > 
> > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > 
> > > > sage
> > > > 
> > > > 
> > > > 
> > > >  >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > 
> > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > b3
> > > > > > 9c
> > > > > > 68
> > > > > > 7d
> > > > > > 88
> > > > > > a1b28fcc39
> > > > > > 
> > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > 
> > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > reusing log 266 from recycle list
> > > > > > 
> > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > 
> > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > 
> > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > 
> > > > > > 
> > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > Prefix = O key =
> > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.0000
> > > > > > 00
> > > > > > 00
> > > > > > 00
> > > > > > 89
> > > > > > ac
> > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B 
> > > > > > key =
> > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > 'bluestore_statfs')
> > > > > > 
> > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > 
> > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > 
> > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > I will try to reproduce with 1/20.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I think there are some bug introduced recently in the 
> > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > 
> > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > 
> > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > > 
> > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > thread
> > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > 
> > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > int, char
> > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > [0x5617f2395fdd]
> > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > 
> > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > Here is the option I am using..
> > > > > > > 
> > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > 
> > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > 
> > > > > > > 
> > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > 
> > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > int, char
> > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > [0x5581ed13f840]
> > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > > > [0x5581ed2c6f4f]
> > > > > > >  5: 
> > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReade
> > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > > > rocksdb::PersistentCacheOptions const&,
> > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > >  7: 
> > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::B
> > > > > > > lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::Slice const&,
> > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::InternalKeyComparator const&, 
> > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::LookupKey const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions 
> > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > > > [0x5581ed1d21d7]
> > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > >  15: 
> > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > >  16: 
> > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > > > [0x5581ed032bc2]
> > > > > > >  17: 
> > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > :T ra ns ac ti on , 
> > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > >&,
> > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > >  18: 
> > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpReques
> > > > > > > t>
> > > > > > > )+
> > > > > > > 0x
> > > > > > > d3
> > > > > > > 9)
> > > > > > > [0x5581ecef89e9]
> > > > > > >  19: 
> > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReque
> > > > > > > st
> > > > > > > >)
> > > > > > > +0
> > > > > > > x2
> > > > > > > fb
> > > > > > > )
> > > > > > > [0x5581ecefeb4b]
> > > > > > >  20: 
> > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > > > [0x5581eccdd2e9]
> > > > > > >  22: 
> > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest
> > > > > > > >
> > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > [0x5581ed4441f0]
> > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > compaction during this time.
> > > > > > > 
> > > > > > > How are you selecting universal compaction?
> > > > > > > 
> > > > > > > sage
> > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at 
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 2, 2016, 8:24 p.m. UTC | #39
On Fri, 2 Sep 2016, Somnath Roy wrote:
> Sage,
> I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?

I think it'll actually be less frequent, since it allocates 
bluefs_max_log_runway at a time.  Well, assuming you set that tunable as 
high well!

sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Friday, September 02, 2016 12:27 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > I will try some universal compaction tuning related to file size and confirm.
> > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > Can't we do like this ?
> > 
> > //Basically, checking the length of the log as well if (runway < 
> > g_conf->bluefs_min_log_runway) || (runway < log_writer 
> > ->buffer.length() {  //allocate }
> 
> Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> 
> And I think just configuring a long runway won't hurt either (e.g., 100MB).
> 
> That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> 
> s
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 10:57 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Here is my rocksdb option :
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > should be same as level 0 size. Initially, I had bigger 
> > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > level 0 size is the following
> > > 
> > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > 
> > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > 
> > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > 
> > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?us
> > > p=
> > > sharing
> > > 
> > > Thanks for the explanation , I got it now why it is trying to flush 
> > > inode 1.
> > > 
> > > But, shouldn't we check the length as well during runway check 
> > > rather than just relying on bluefs_min_log_runway only.
> > 
> > That's what this does:
> > 
> >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > log_writer->pos;
> > 
> > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > 
> > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > 
> > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > 
> > I'll add it to the trello board.  I think it's not that big a deal.. 
> > except when you have really big files.
> > 
> > sage
> > 
> > 
> >  >
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 9:35 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > 
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > 
> > > > Sage,
> > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > 
> > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0!
 x7!
>  53!
> >  00!
> > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d00000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0x7
> > > d8
> > > 00000+10e00000])
> > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75!
 10!
>  00!
> >  00!
> > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+1!
 00!
>  00!
> >  0,!
> > >  
> > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d60
> > > 00
> > > 00+100000,1:0x7d800000+10e00000])
> > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0!
 x7!
>  51!
> >  00!
> > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce0000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7
> > > d6
> > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > 
> > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > 
> > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > 
> > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > thread
> > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino != 
> > > > 1)
> > > > 
> > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x80) [0x56073c27c7d0]
> > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > unsigned
> > > > long)+0x1d69) [0x56073bf4e109]
> > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > [0x56073bf4e2d7]
> > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > [0x56073c0f24b1]
> > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0) 
> > > > [0x56073c0f3960]
> > > >  9: 
> > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Statu
> > > > s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > [0x56073c1354c6]
> > > >  10: 
> > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Compac
> > > > ti
> > > > on
> > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > [0x56073c0275d0]
> > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > [0x56073c03443f]
> > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > [0x56073c0eb039]
> > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > 
> > > > 
> > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > 
> > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > 
> > > But this is very concerning:
> > > 
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > 
> > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > 
> > > > Question :
> > > > ------------
> > > > 
> > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > 
> > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.c
> > > > c#
> > > > L1
> > > > 280
> > > > 
> > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > 
> > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > 
> > > > 2. The runway check is not considering the request length , so, 
> > > > why it is not expecting to allocate here 
> > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > cc
> > > > #L
> > > > 1388)
> > > > 
> > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > 
> > > The level 10 log is probably enough...
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Created the following pull request on rocksdb repo, please take a look.
> > > > 
> > > > https://github.com/facebook/rocksdb/pull/1313
> > > > 
> > > > The fix is working fine for me.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > 
> > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > 
> > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > 
> > > > > 
> > > > > 3. This means it is added in full_scan_candidate_files from the 
> > > > > following  from a full scan (?)
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > 
> > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > 
> > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f698
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > 
> > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > 
> > > > > (number == state.prev_log_number)
> > > > > 
> > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > 
> > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > 
> > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > > Let me know what you think.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Some updates on this.
> > > > > 
> > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > 
> > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > 
> > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > 
> > > > > 4. Created a rocksdb issue for this
> > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Hope you are able to download the log I shared via google doc.
> > > > > It seems the bug is around this portion.
> > > > > 
> > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log 
> > > > > 254 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log 
> > > > > 256 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table #258 
> > > > > started
> > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > memtable #1 done
> > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > memtable #2 done
> > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base level 
> > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 0.75
> > > > > 
> > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > 
> > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000256.log
> > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link had 
> > > > > refs
> > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 00:41:26.298423 
> > > > > bdev
> > > > > 0 extents
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > 00:41:26.298423 bdev 0 extents
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0xe2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+700000
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000254.log
> > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link had 
> > > > > refs
> > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 00:41:26.299110 
> > > > > bdev
> > > > > 0 extents
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > 00:41:26.299110 bdev 0 extents
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 
> > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > 
> > > > > I was going through the rocksdb code and I found the following.
> > > > > 
> > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > 
> > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > 
> > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > 
> > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > 
> > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > 
> > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > 
> > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > after
> > > > > 6 hours :-) ,
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > Sage, It is there in the following github link I posted 
> > > > > > earlier..You can see 3 logs there..
> > > > > > 
> > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017b3
> > > > > > 9c
> > > > > > 68
> > > > > > 7d
> > > > > > 88
> > > > > > a1b28fcc39
> > > > > 
> > > > > Ah sorry, got it.
> > > > > 
> > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > 
> > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > 
> > > > > sage
> > > > > 
> > > > > 
> > > > > 
> > > > >  >
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > > 
> > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > 
> > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > reusing log 266 from recycle list
> > > > > > > 
> > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > 
> > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > 
> > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > Prefix = O key =
> > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.0000
> > > > > > > 00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 89
> > > > > > > ac
> > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = B 
> > > > > > > key =
> > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > 'bluestore_statfs')
> > > > > > > 
> > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > 
> > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > 
> > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > I will try to reproduce with 1/20.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > 
> > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > 
> > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > thread
> > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > [0x5617f2395fdd]
> > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > 
> > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > Here is the option I am using..
> > > > > > > > 
> > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > 
> > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > 
> > > > > > > > 
> > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > [0x5581ed13f840]
> > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned long, 
> > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x83f) 
> > > > > > > > [0x5581ed2c6f4f]
> > > > > > > >  5: 
> > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReade
> > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::BlockHandle const&, rocksdb::BlockContents*, 
> > > > > > > > rocksdb::Env*, bool, rocksdb::Slice const&, 
> > > > > > > > rocksdb::PersistentCacheOptions const&,
> > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > >  7: 
> > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::B
> > > > > > > > lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::Slice const&,
> > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::InternalKeyComparator const&, 
> > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > std::char_traits<char>, std::allocator<char> >*, 
> > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> > const&, ceph::buffer::list*)+0x157) 
> > > > > > > > [0x5581ed1d21d7]
> > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t const&,
> > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > >  15: 
> > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > >  16: 
> > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x362) 
> > > > > > > > [0x5581ed032bc2]
> > > > > > > >  17: 
> > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > :T ra ns ac ti on , 
> > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > >&,
> > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > >  18: 
> > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpReques
> > > > > > > > t>
> > > > > > > > )+
> > > > > > > > 0x
> > > > > > > > d3
> > > > > > > > 9)
> > > > > > > > [0x5581ecef89e9]
> > > > > > > >  19: 
> > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReque
> > > > > > > > st
> > > > > > > > >)
> > > > > > > > +0
> > > > > > > > x2
> > > > > > > > fb
> > > > > > > > )
> > > > > > > > [0x5581ecefeb4b]
> > > > > > > >  20: 
> > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x409) 
> > > > > > > > [0x5581eccdd2e9]
> > > > > > > >  22: 
> > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest
> > > > > > > > >
> > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > >  24: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > >  25: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > [0x5581ed4441f0]
> > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > compaction during this time.
> > > > > > > > 
> > > > > > > > How are you selecting universal compaction?
> > > > > > > > 
> > > > > > > > sage
> > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at  
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 2, 2016, 8:26 p.m. UTC | #40
Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
Will see how it goes, thanks..

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, September 02, 2016 1:25 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Fri, 2 Sep 2016, Somnath Roy wrote:
> Sage,
> I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?

I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!

sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 12:27 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > I will try some universal compaction tuning related to file size and confirm.
> > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > Can't we do like this ?
> > 
> > //Basically, checking the length of the log as well if (runway <
> > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > ->buffer.length() {  //allocate }
> 
> Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> 
> And I think just configuring a long runway won't hurt either (e.g., 100MB).
> 
> That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> 
> s
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 10:57 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Here is my rocksdb option :
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > should be same as level 0 size. Initially, I had bigger 
> > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > level 0 size is the following
> > > 
> > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > 
> > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > 
> > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > 
> > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > us
> > > p=
> > > sharing
> > > 
> > > Thanks for the explanation , I got it now why it is trying to 
> > > flush inode 1.
> > > 
> > > But, shouldn't we check the length as well during runway check 
> > > rather than just relying on bluefs_min_log_runway only.
> > 
> > That's what this does:
> > 
> >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > log_writer->pos;
> > 
> > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > 
> > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > 
> > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > 
> > I'll add it to the trello board.  I think it's not that big a deal.. 
> > except when you have really big files.
> > 
> > sage
> > 
> > 
> >  >
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 9:35 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > 
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > 
> > > > Sage,
> > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > 
> > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0!
 x7!
>  53!
> >  00!
> > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d00000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > x7
> > > d8
> > > 00000+10e00000])
> > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75!
 10!
>  00!
> >  00!
> > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+1!
 00!
>  00!
> >  0,!
> > >  
> > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > 60
> > > 00
> > > 00+100000,1:0x7d800000+10e00000])
> > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0!
 x7!
>  51!
> >  00!
> > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce0000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > x7
> > > d6
> > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > 
> > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > 
> > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > 
> > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > thread
> > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > !=
> > > > 1)
> > > > 
> > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x56073c27c7d0]
> > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > unsigned
> > > > long)+0x1d69) [0x56073bf4e109]
> > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > [0x56073bf4e2d7]
> > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > [0x56073c0f24b1]
> > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > [0x56073c0f3960]
> > > >  9: 
> > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > [0x56073c1354c6]
> > > >  10: 
> > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > ac
> > > > ti
> > > > on
> > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > [0x56073c0275d0]
> > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > [0x56073c03443f]
> > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > [0x56073c0eb039]
> > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > 
> > > > 
> > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > 
> > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > 
> > > But this is very concerning:
> > > 
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > 
> > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > 
> > > > Question :
> > > > ------------
> > > > 
> > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > 
> > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > .c
> > > > c#
> > > > L1
> > > > 280
> > > > 
> > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > 
> > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > 
> > > > 2. The runway check is not considering the request length , so, 
> > > > why it is not expecting to allocate here 
> > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > cc
> > > > #L
> > > > 1388)
> > > > 
> > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > 
> > > The level 10 log is probably enough...
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Created the following pull request on rocksdb repo, please take a look.
> > > > 
> > > > https://github.com/facebook/rocksdb/pull/1313
> > > > 
> > > > The fix is working fine for me.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > 
> > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > 
> > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > 
> > > > > 
> > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > the following  from a full scan (?)
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > 
> > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > 
> > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > 
> > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > 
> > > > > (number == state.prev_log_number)
> > > > > 
> > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > 
> > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > 
> > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > > Let me know what you think.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Some updates on this.
> > > > > 
> > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > 
> > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > 
> > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > 
> > > > > 4. Created a rocksdb issue for this
> > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Hope you are able to download the log I shared via google doc.
> > > > > It seems the bug is around this portion.
> > > > > 
> > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > 254 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > 256 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table 
> > > > > #258 started
> > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > memtable #1 done
> > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > memtable #2 done
> > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > level
> > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score 
> > > > > 0.75
> > > > > 
> > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > 
> > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000256.log
> > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > had refs
> > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25 
> > > > > 00:41:26.298423 bdev
> > > > > 0 extents
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > e2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > 00
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > 00:41:26.298423 bdev 0 extents
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > e2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > 00
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000254.log
> > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > had refs
> > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25 
> > > > > 00:41:26.299110 bdev
> > > > > 0 extents
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > 84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > 00:41:26.299110 bdev 0 extents
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > 84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 
> > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > 
> > > > > I was going through the rocksdb code and I found the following.
> > > > > 
> > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > 
> > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > 
> > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > 
> > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > 
> > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > 
> > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > 
> > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > after
> > > > > 6 hours :-) ,
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > Sage, It is there in the following github link I posted 
> > > > > > earlier..You can see 3 logs there..
> > > > > > 
> > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > b3
> > > > > > 9c
> > > > > > 68
> > > > > > 7d
> > > > > > 88
> > > > > > a1b28fcc39
> > > > > 
> > > > > Ah sorry, got it.
> > > > > 
> > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > 
> > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > 
> > > > > sage
> > > > > 
> > > > > 
> > > > > 
> > > > >  >
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > 17
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > > 
> > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > 
> > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > reusing log 266 from recycle list
> > > > > > > 
> > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > 
> > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > 
> > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > Prefix = O key =
> > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 89
> > > > > > > ac
> > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > B key =
> > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > 'bluestore_statfs')
> > > > > > > 
> > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > 
> > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > 
> > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > I will try to reproduce with 1/20.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > 
> > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > 
> > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > thread
> > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > [0x5617f2395fdd]
> > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > 
> > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > Here is the option I am using..
> > > > > > > > 
> > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > 
> > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > 
> > > > > > > > 
> > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > [0x5581ed13f840]
> > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned 
> > > > > > > > long, unsigned long, rocksdb::Slice*, char*) 
> > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > >  5: 
> > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > de
> > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > const&,
> > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > >  7: 
> > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> > const&, 
> > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t 
> > > > > > > > const&,
> > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > >  15: 
> > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > *,
> > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > >  16: 
> > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > std::shared_ptr<TrackedOp>, 
> > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > >  17: 
> > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > :T ra ns ac ti on ,
> > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > >&,
> > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > >  18: 
> > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > es
> > > > > > > > t>
> > > > > > > > )+
> > > > > > > > 0x
> > > > > > > > d3
> > > > > > > > 9)
> > > > > > > > [0x5581ecef89e9]
> > > > > > > >  19: 
> > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > ue
> > > > > > > > st
> > > > > > > > >)
> > > > > > > > +0
> > > > > > > > x2
> > > > > > > > fb
> > > > > > > > )
> > > > > > > > [0x5581ecefeb4b]
> > > > > > > >  20: 
> > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > std::shared_ptr<OpRequest>, 
> > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > >  22: 
> > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > st
> > > > > > > > >
> > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > >  24: 
> > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > >  25: 
> > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > [0x5581ed4441f0]
> > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > compaction during this time.
> > > > > > > > 
> > > > > > > > How are you selecting universal compaction?
> > > > > > > > 
> > > > > > > > sage
> > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > More majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at 
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at  
> > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 6, 2016, 5:13 a.m. UTC | #41
Sage,
Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.

1. 

in thread 7f9de27ff700 thread_name:rocksdb:bg7

 ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
 1: (()+0xa0d94e) [0x55aa63cd194e]
 2: (()+0x113d0) [0x7f9df56723d0]
 3: (gsignal()+0x38) [0x7f9df33f7418]
 4: (abort()+0x16a) [0x7f9df33f901a]
 5: (()+0x2dbd7) [0x7f9df33efbd7]
 6: (()+0x2dc82) [0x7f9df33efc82]
 7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
 8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
 9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
 10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
 11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
 12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
 13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
 14: (()+0x991753) [0x55aa63c55753]
 15: (()+0x76fa) [0x7f9df56686fa]
 16: (clone()+0x6d) [0x7f9df34c8b5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I was able to root cause it as a async log compaction bug. Here is my analysis..

Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.

   -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
   -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
   -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
   -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000

So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.

  dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
	   << " of " << log_file->fnode.extents << dendl;
  uint64_t discarded = 0;
  vector<bluefs_extent_t> old_extents;
  while (discarded < old_log_jump_to) {
    bluefs_extent_t& e = log_file->fnode.extents.front();
    bluefs_extent_t temp = e;
    if (discarded + e.length <= old_log_jump_to) {
      dout(10) << __func__ << " remove old log extent " << e << dendl;
      discarded += e.length;
      log_file->fnode.extents.erase(log_file->fnode.extents.begin());
    } else {
      dout(10) << __func__ << " remove front of old log extent " << e << dendl;
      uint64_t drop = old_log_jump_to - discarded;
      temp.length = drop;
      e.offset += drop;
      e.length -= drop;
      discarded += drop;
      dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
    }
    old_extents.push_back(temp);
  }

But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?

2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!

2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)

 ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
 2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
 3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
 4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
 5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
 6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
 7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
 8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
 9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
 10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
 11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
 12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
 16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
 17: (()+0x76fa) [0x7ff402b126fa]
 18: (clone()+0x6d) [0x7ff400972b5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Friday, September 02, 2016 1:27 PM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
Will see how it goes, thanks..

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Friday, September 02, 2016 1:25 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Fri, 2 Sep 2016, Somnath Roy wrote:
> Sage,
> I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?

I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!

sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 12:27 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > I will try some universal compaction tuning related to file size and confirm.
> > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > Can't we do like this ?
> > 
> > //Basically, checking the length of the log as well if (runway <
> > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > ->buffer.length() {  //allocate }
> 
> Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> 
> And I think just configuring a long runway won't hurt either (e.g., 100MB).
> 
> That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> 
> s
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 10:57 AM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Here is my rocksdb option :
> > > 
> > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > 
> > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > should be same as level 0 size. Initially, I had bigger 
> > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > level 0 size is the following
> > > 
> > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > 
> > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > 
> > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > 
> > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > us
> > > p=
> > > sharing
> > > 
> > > Thanks for the explanation , I got it now why it is trying to 
> > > flush inode 1.
> > > 
> > > But, shouldn't we check the length as well during runway check 
> > > rather than just relying on bluefs_min_log_runway only.
> > 
> > That's what this does:
> > 
> >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > log_writer->pos;
> > 
> > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > 
> > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > 
> > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > 
> > I'll add it to the trello board.  I think it's not that big a deal.. 
> > except when you have really big files.
> > 
> > sage
> > 
> > 
> >  >
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 9:35 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > 
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > 
> > > > Sage,
> > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > 
> > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1:0!
 x7!
>  53!
> >  00!
> > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d00000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > x7
> > > d8
> > > 00000+10e00000])
> > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75!
 10!
>  00!
> >  00!
> > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+1!
 00!
>  00!
> >  0,!
> > >  
> > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > 60
> > > 00
> > > 00+100000,1:0x7d800000+10e00000])
> > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0!
 x7!
>  51!
> >  00!
> > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce0000!
 0+!
>  10!
> >  00!
> > >  
> > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > x7
> > > d6
> > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > 
> > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > 
> > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > 
> > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > thread
> > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > !=
> > > > 1)
> > > > 
> > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > char
> > > > const*)+0x80) [0x56073c27c7d0]
> > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > unsigned
> > > > long)+0x1d69) [0x56073bf4e109]
> > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > [0x56073bf4e2d7]
> > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > [0x56073c0f24b1]
> > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > [0x56073c0f3960]
> > > >  9: 
> > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > [0x56073c1354c6]
> > > >  10: 
> > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > ac
> > > > ti
> > > > on
> > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > [0x56073c0275d0]
> > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > [0x56073c03443f]
> > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > [0x56073c0eb039]
> > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > 
> > > > 
> > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > 
> > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > 
> > > But this is very concerning:
> > > 
> > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > 
> > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > 
> > > > Question :
> > > > ------------
> > > > 
> > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > 
> > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > .c
> > > > c#
> > > > L1
> > > > 280
> > > > 
> > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > 
> > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > 
> > > > 2. The runway check is not considering the request length , so, 
> > > > why it is not expecting to allocate here 
> > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > cc
> > > > #L
> > > > 1388)
> > > > 
> > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > 
> > > The level 10 log is probably enough...
> > > 
> > > Thanks!
> > > sage
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Somnath Roy
> > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > To: 'Sage Weil'
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > Sage,
> > > > Created the following pull request on rocksdb repo, please take a look.
> > > > 
> > > > https://github.com/facebook/rocksdb/pull/1313
> > > > 
> > > > The fix is working fine for me.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > Sage,
> > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > 
> > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > 
> > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > 
> > > > > 
> > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > the following  from a full scan (?)
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > 
> > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > 
> > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > 98
> > > > > f4
> > > > > 60
> > > > > 60
> > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > 
> > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > 
> > > > > (number == state.prev_log_number)
> > > > > 
> > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > 
> > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > 
> > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > > > 
> > > > > Let me know what you think.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Some updates on this.
> > > > > 
> > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > 
> > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > 
> > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > 
> > > > > 4. Created a rocksdb issue for this
> > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Hope you are able to download the log I shared via google doc.
> > > > > It seems the bug is around this portion.
> > > > > 
> > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > 254 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > 256 to recycle list
> > > > > 
> > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > #258 started
> > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > memtable #1 done
> > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > memtable #2 done
> > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > Log Time
> > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > level
> > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > 0.75
> > > > > 
> > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > 
> > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000256.log
> > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > had refs
> > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > 00:41:26.298423 bdev
> > > > > 0 extents
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > e2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > 00
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > 00:41:26.298423 bdev 0 extents 
> > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > e2
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > 00
> > > > > ,0
> > > > > :0
> > > > > x1
> > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > db.wal/000254.log
> > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > had refs
> > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > 00:41:26.299110 bdev
> > > > > 0 extents
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > 84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > 00:41:26.299110 bdev 0 extents 
> > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > 84
> > > > > 00
> > > > > 00
> > > > > 0+
> > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > 0x
> > > > > b4
> > > > > 00000+800000,0:0xc000000+500000])
> > > > > 
> > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > 
> > > > > I was going through the rocksdb code and I found the following.
> > > > > 
> > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > 
> > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > 
> > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > 
> > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > 
> > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > 
> > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > 
> > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > after
> > > > > 6 hours :-) ,
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > Sage, It is there in the following github link I posted 
> > > > > > earlier..You can see 3 logs there..
> > > > > > 
> > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > b3
> > > > > > 9c
> > > > > > 68
> > > > > > 7d
> > > > > > 88
> > > > > > a1b28fcc39
> > > > > 
> > > > > Ah sorry, got it.
> > > > > 
> > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > 
> > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > 
> > > > > sage
> > > > > 
> > > > > 
> > > > > 
> > > > >  >
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > 17
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > > 
> > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > 
> > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > reusing log 266 from recycle list
> > > > > > > 
> > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > 
> > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > 
> > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > Prefix = O key =
> > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 00
> > > > > > > 89
> > > > > > > ac
> > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > B key =
> > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > 'bluestore_statfs')
> > > > > > > 
> > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > 
> > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > 
> > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > I will try to reproduce with 1/20.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > 
> > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > 
> > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > thread
> > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > [0x5617f2395fdd]
> > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > 
> > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > Here is the option I am using..
> > > > > > > > 
> > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > 
> > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > 
> > > > > > > > 
> > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > 
> > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > int, char
> > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > [0x5581ed13f840]
> > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > >  5: 
> > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > de
> > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > const&,
> > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > >  7: 
> > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > std::allocator<char> > const&,
> > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > const&,
> > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > >  15: 
> > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > *,
> > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > >  16: 
> > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > >  17: 
> > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > :T ra ns ac ti on ,
> > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > >&,
> > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > >  18: 
> > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > es
> > > > > > > > t>
> > > > > > > > )+
> > > > > > > > 0x
> > > > > > > > d3
> > > > > > > > 9)
> > > > > > > > [0x5581ecef89e9]
> > > > > > > >  19: 
> > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > ue
> > > > > > > > st
> > > > > > > > >)
> > > > > > > > +0
> > > > > > > > x2
> > > > > > > > fb
> > > > > > > > )
> > > > > > > > [0x5581ecefeb4b]
> > > > > > > >  20: 
> > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > >  22: 
> > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > st
> > > > > > > > >
> > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > >  24: 
> > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > >  25: 
> > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > [0x5581ed4441f0]
> > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > compaction during this time.
> > > > > > > > 
> > > > > > > > How are you selecting universal compaction?
> > > > > > > > 
> > > > > > > > sage
> > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > More majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at 
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More 
> > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 6, 2016, 1:27 p.m. UTC | #42
On Tue, 6 Sep 2016, Somnath Roy wrote:
> Sage,
> Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.
> 
> 1. 
> 
> in thread 7f9de27ff700 thread_name:rocksdb:bg7
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (()+0xa0d94e) [0x55aa63cd194e]
>  2: (()+0x113d0) [0x7f9df56723d0]
>  3: (gsignal()+0x38) [0x7f9df33f7418]
>  4: (abort()+0x16a) [0x7f9df33f901a]
>  5: (()+0x2dbd7) [0x7f9df33efbd7]
>  6: (()+0x2dc82) [0x7f9df33efc82]
>  7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
>  8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
>  9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
>  10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
>  11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
>  12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
>  13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
>  14: (()+0x991753) [0x55aa63c55753]
>  15: (()+0x76fa) [0x7f9df56686fa]
>  16: (clone()+0x6d) [0x7f9df34c8b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> I was able to root cause it as a async log compaction bug. Here is my analysis..
> 
> Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.
> 
>    -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
>    -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
>    -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
>    -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> 
> So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.
> 
>   dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
> 	   << " of " << log_file->fnode.extents << dendl;
>   uint64_t discarded = 0;
>   vector<bluefs_extent_t> old_extents;
>   while (discarded < old_log_jump_to) {
>     bluefs_extent_t& e = log_file->fnode.extents.front();
>     bluefs_extent_t temp = e;
>     if (discarded + e.length <= old_log_jump_to) {
>       dout(10) << __func__ << " remove old log extent " << e << dendl;
>       discarded += e.length;
>       log_file->fnode.extents.erase(log_file->fnode.extents.begin());
>     } else {
>       dout(10) << __func__ << " remove front of old log extent " << e << dendl;
>       uint64_t drop = old_log_jump_to - discarded;
>       temp.length = drop;
>       e.offset += drop;
>       e.length -= drop;
>       discarded += drop;
>       dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
>     }
>     old_extents.push_back(temp);
>   }
> 
> But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?

Exactly--it shouldn't be.  old_log_jump_to *must* be less than the 
totally allocated extents.  It should equal just the extents that were 
present/used *prior* to us ensuring that runway is allocated.  Do you 
have a bit more log?  We need to see why it was big enough to empty out 
the vector...

> 2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!

Hrm, hard to say what's going on there.  My guess is a secondary effect 
from the above.  At least we should rule it out.

sage

> 
> 2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
> 2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
> 2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
> 2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
> 2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
> 2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
> 2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
> 2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
> 2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
> 2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
> 2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
> 2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
> 2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
> 2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
> 2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
> os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
>  2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
>  3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
>  4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
>  5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
>  6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
>  7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
>  8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
>  9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
>  10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
>  11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
>  12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
>  16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
>  17: (()+0x76fa) [0x7ff402b126fa]
>  18: (clone()+0x6d) [0x7ff400972b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy 
> Sent: Friday, September 02, 2016 1:27 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
> Will see how it goes, thanks..
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 1:25 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> > One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?
> 
> I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!
> 
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 12:27 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Sage,
> > > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > > I will try some universal compaction tuning related to file size and confirm.
> > > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > > Can't we do like this ?
> > > 
> > > //Basically, checking the length of the log as well if (runway <
> > > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > > ->buffer.length() {  //allocate }
> > 
> > Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> > 
> > And I think just configuring a long runway won't hurt either (e.g., 100MB).
> > 
> > That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> > 
> > s
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 10:57 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > Here is my rocksdb option :
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > > should be same as level 0 size. Initially, I had bigger 
> > > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > > level 0 size is the following
> > > > 
> > > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > > 
> > > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > > 
> > > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > > 
> > > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > > us
> > > > p=
> > > > sharing
> > > > 
> > > > Thanks for the explanation , I got it now why it is trying to 
> > > > flush inode 1.
> > > > 
> > > > But, shouldn't we check the length as well during runway check 
> > > > rather than just relying on bluefs_min_log_runway only.
> > > 
> > > That's what this does:
> > > 
> > >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > > log_writer->pos;
> > > 
> > > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > > 
> > > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > > 
> > > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > > 
> > > I'll add it to the trello board.  I think it's not that big a deal.. 
> > > except when you have really big files.
> > > 
> > > sage
> > > 
> > > 
> > >  >
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, September 02, 2016 9:35 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > 
> > > > 
> > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > 
> > > > > Sage,
> > > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > > 
> > > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1!
 :0!
>  x7!
> >  53!
> > >  00!
> > > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > > x7
> > > > d8
> > > > 00000+10e00000])
> > > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x!
 75!
>  10!
> >  00!
> > >  00!
> > > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000!
 +1!
>  00!
> >  00!
> > >  0,!
> > > >  
> > > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > > 60
> > > > 00
> > > > 00+100000,1:0x7d800000+10e00000])
> > > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1!
 :0!
>  x7!
> >  51!
> > >  00!
> > > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > > x7
> > > > d6
> > > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > > 
> > > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > > 
> > > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > > 
> > > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > > thread
> > > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > > !=
> > > > > 1)
> > > > > 
> > > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x56073c27c7d0]
> > > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > > unsigned
> > > > > long)+0x1d69) [0x56073bf4e109]
> > > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > > [0x56073bf4e2d7]
> > > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > > [0x56073c0f24b1]
> > > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > > [0x56073c0f3960]
> > > > >  9: 
> > > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > > [0x56073c1354c6]
> > > > >  10: 
> > > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > > ac
> > > > > ti
> > > > > on
> > > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > > [0x56073c0275d0]
> > > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > > [0x56073c03443f]
> > > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > > [0x56073c0eb039]
> > > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > > 
> > > > > 
> > > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > > 
> > > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > > 
> > > > But this is very concerning:
> > > > 
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > 
> > > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > > 
> > > > > Question :
> > > > > ------------
> > > > > 
> > > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > > 
> > > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > > .c
> > > > > c#
> > > > > L1
> > > > > 280
> > > > > 
> > > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > > 
> > > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > > 
> > > > > 2. The runway check is not considering the request length , so, 
> > > > > why it is not expecting to allocate here 
> > > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > > cc
> > > > > #L
> > > > > 1388)
> > > > > 
> > > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > > 
> > > > The level 10 log is probably enough...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Created the following pull request on rocksdb repo, please take a look.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/pull/1313
> > > > > 
> > > > > The fix is working fine for me.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > > 
> > > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > > 
> > > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > > 
> > > > > > 
> > > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > > the following  from a full scan (?)
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > > 
> > > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > > 
> > > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > > 
> > > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > > 
> > > > > > (number == state.prev_log_number)
> > > > > > 
> > > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > > 
> > > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > > 
> > > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Let me know what you think.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Some updates on this.
> > > > > > 
> > > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > > 
> > > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > > 
> > > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > > 
> > > > > > 4. Created a rocksdb issue for this
> > > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Hope you are able to download the log I shared via google doc.
> > > > > > It seems the bug is around this portion.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > > 254 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > > 256 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > > #258 started
> > > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > > memtable #1 done
> > > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > > memtable #2 done
> > > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > > level
> > > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > > 0.75
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000256.log
> > > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev
> > > > > > 0 extents
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev 0 extents 
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000254.log
> > > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev
> > > > > > 0 extents
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev 0 extents 
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 
> > > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > > 
> > > > > > I was going through the rocksdb code and I found the following.
> > > > > > 
> > > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > > 
> > > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > > 
> > > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > > 
> > > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > > 
> > > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > > 
> > > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > > 
> > > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > > after
> > > > > > 6 hours :-) ,
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage, It is there in the following github link I posted 
> > > > > > > earlier..You can see 3 logs there..
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > 
> > > > > > Ah sorry, got it.
> > > > > > 
> > > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > > 
> > > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > > 
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  >
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > > 
> > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > > 17
> > > > > > > > b3
> > > > > > > > 9c
> > > > > > > > 68
> > > > > > > > 7d
> > > > > > > > 88
> > > > > > > > a1b28fcc39
> > > > > > > > 
> > > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > > 
> > > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > > reusing log 266 from recycle list
> > > > > > > > 
> > > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > > 
> > > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > > 
> > > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > > Prefix = O key =
> > > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 89
> > > > > > > > ac
> > > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > > B key =
> > > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > > 'bluestore_statfs')
> > > > > > > > 
> > > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > > 
> > > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > > 
> > > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > > I will try to reproduce with 1/20.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > Sage,
> > > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > > 
> > > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > > 
> > > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > sage
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > > thread
> > > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > > [0x5617f2395fdd]
> > > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > > 
> > > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Somnath Roy
> > > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > > To: 'Sage Weil'
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > > Here is the option I am using..
> > > > > > > > > 
> > > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > > 
> > > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > > [0x5581ed13f840]
> > > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > > >  5: 
> > > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > > de
> > > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > > const&,
> > > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > > >  7: 
> > > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> > const&,
> > > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > > const&,
> > > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > > >  15: 
> > > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > > *,
> > > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > > >  16: 
> > > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > > >  17: 
> > > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > > :T ra ns ac ti on ,
> > > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > > >&,
> > > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > > >  18: 
> > > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > > es
> > > > > > > > > t>
> > > > > > > > > )+
> > > > > > > > > 0x
> > > > > > > > > d3
> > > > > > > > > 9)
> > > > > > > > > [0x5581ecef89e9]
> > > > > > > > >  19: 
> > > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > > ue
> > > > > > > > > st
> > > > > > > > > >)
> > > > > > > > > +0
> > > > > > > > > x2
> > > > > > > > > fb
> > > > > > > > > )
> > > > > > > > > [0x5581ecefeb4b]
> > > > > > > > >  20: 
> > > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > > >  22: 
> > > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > > st
> > > > > > > > > >
> > > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > > >  24: 
> > > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > > >  25: 
> > > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > > [0x5581ed4441f0]
> > > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > > To: Somnath Roy
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > > compaction during this time.
> > > > > > > > > 
> > > > > > > > > How are you selecting universal compaction?
> > > > > > > > > 
> > > > > > > > > sage
> > > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > > More majordomo info at 
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 6, 2016, 3:29 p.m. UTC | #43
Sage,
Please find the entire 0/20 log in the following location for the first assert.

https://github.com/somnathr/ceph/blob/master/ceph-osd.3.log

This may not be helpful, I will try to reproduce this with debug_bluefs = 10/20.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, September 06, 2016 6:28 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Tue, 6 Sep 2016, Somnath Roy wrote:
> Sage,
> Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.
> 
> 1. 
> 
> in thread 7f9de27ff700 thread_name:rocksdb:bg7
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (()+0xa0d94e) [0x55aa63cd194e]
>  2: (()+0x113d0) [0x7f9df56723d0]
>  3: (gsignal()+0x38) [0x7f9df33f7418]
>  4: (abort()+0x16a) [0x7f9df33f901a]
>  5: (()+0x2dbd7) [0x7f9df33efbd7]
>  6: (()+0x2dc82) [0x7f9df33efc82]
>  7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
>  8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
>  9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
>  10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
>  11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
>  12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
>  13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
>  14: (()+0x991753) [0x55aa63c55753]
>  15: (()+0x76fa) [0x7f9df56686fa]
>  16: (clone()+0x6d) [0x7f9df34c8b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> I was able to root cause it as a async log compaction bug. Here is my analysis..
> 
> Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.
> 
>    -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
>    -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
>    -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
>    -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> 
> So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.
> 
>   dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
> 	   << " of " << log_file->fnode.extents << dendl;
>   uint64_t discarded = 0;
>   vector<bluefs_extent_t> old_extents;
>   while (discarded < old_log_jump_to) {
>     bluefs_extent_t& e = log_file->fnode.extents.front();
>     bluefs_extent_t temp = e;
>     if (discarded + e.length <= old_log_jump_to) {
>       dout(10) << __func__ << " remove old log extent " << e << dendl;
>       discarded += e.length;
>       log_file->fnode.extents.erase(log_file->fnode.extents.begin());
>     } else {
>       dout(10) << __func__ << " remove front of old log extent " << e << dendl;
>       uint64_t drop = old_log_jump_to - discarded;
>       temp.length = drop;
>       e.offset += drop;
>       e.length -= drop;
>       discarded += drop;
>       dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
>     }
>     old_extents.push_back(temp);
>   }
> 
> But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?

Exactly--it shouldn't be.  old_log_jump_to *must* be less than the 
totally allocated extents.  It should equal just the extents that were 
present/used *prior* to us ensuring that runway is allocated.  Do you 
have a bit more log?  We need to see why it was big enough to empty out 
the vector...

> 2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!

Hrm, hard to say what's going on there.  My guess is a secondary effect 
from the above.  At least we should rule it out.

sage

> 
> 2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
> 2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
> 2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
> 2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
> 2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
> 2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
> 2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
> 2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
> 2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
> 2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
> 2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
> 2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
> 2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
> 2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
> 2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
> os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
>  2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
>  3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
>  4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
>  5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
>  6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
>  7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
>  8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
>  9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
>  10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
>  11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
>  12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
>  16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
>  17: (()+0x76fa) [0x7ff402b126fa]
>  18: (clone()+0x6d) [0x7ff400972b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy 
> Sent: Friday, September 02, 2016 1:27 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
> Will see how it goes, thanks..
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 1:25 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> > One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?
> 
> I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!
> 
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 12:27 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Sage,
> > > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > > I will try some universal compaction tuning related to file size and confirm.
> > > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > > Can't we do like this ?
> > > 
> > > //Basically, checking the length of the log as well if (runway <
> > > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > > ->buffer.length() {  //allocate }
> > 
> > Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> > 
> > And I think just configuring a long runway won't hurt either (e.g., 100MB).
> > 
> > That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> > 
> > s
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 10:57 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > Here is my rocksdb option :
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > > should be same as level 0 size. Initially, I had bigger 
> > > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > > level 0 size is the following
> > > > 
> > > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > > 
> > > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > > 
> > > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > > 
> > > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > > us
> > > > p=
> > > > sharing
> > > > 
> > > > Thanks for the explanation , I got it now why it is trying to 
> > > > flush inode 1.
> > > > 
> > > > But, shouldn't we check the length as well during runway check 
> > > > rather than just relying on bluefs_min_log_runway only.
> > > 
> > > That's what this does:
> > > 
> > >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > > log_writer->pos;
> > > 
> > > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > > 
> > > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > > 
> > > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > > 
> > > I'll add it to the trello board.  I think it's not that big a deal.. 
> > > except when you have really big files.
> > > 
> > > sage
> > > 
> > > 
> > >  >
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, September 02, 2016 9:35 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > 
> > > > 
> > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > 
> > > > > Sage,
> > > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > > 
> > > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1!
 :0!
>  x7!
> >  53!
> > >  00!
> > > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > > x7
> > > > d8
> > > > 00000+10e00000])
> > > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x!
 75!
>  10!
> >  00!
> > >  00!
> > > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000!
 +1!
>  00!
> >  00!
> > >  0,!
> > > >  
> > > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > > 60
> > > > 00
> > > > 00+100000,1:0x7d800000+10e00000])
> > > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1!
 :0!
>  x7!
> >  51!
> > >  00!
> > > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > > x7
> > > > d6
> > > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > > 
> > > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > > 
> > > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > > 
> > > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > > thread
> > > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > > !=
> > > > > 1)
> > > > > 
> > > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x56073c27c7d0]
> > > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > > unsigned
> > > > > long)+0x1d69) [0x56073bf4e109]
> > > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > > [0x56073bf4e2d7]
> > > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > > [0x56073c0f24b1]
> > > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > > [0x56073c0f3960]
> > > > >  9: 
> > > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > > [0x56073c1354c6]
> > > > >  10: 
> > > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > > ac
> > > > > ti
> > > > > on
> > > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > > [0x56073c0275d0]
> > > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > > [0x56073c03443f]
> > > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > > [0x56073c0eb039]
> > > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > > 
> > > > > 
> > > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > > 
> > > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > > 
> > > > But this is very concerning:
> > > > 
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > 
> > > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > > 
> > > > > Question :
> > > > > ------------
> > > > > 
> > > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > > 
> > > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > > .c
> > > > > c#
> > > > > L1
> > > > > 280
> > > > > 
> > > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > > 
> > > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > > 
> > > > > 2. The runway check is not considering the request length , so, 
> > > > > why it is not expecting to allocate here 
> > > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > > cc
> > > > > #L
> > > > > 1388)
> > > > > 
> > > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > > 
> > > > The level 10 log is probably enough...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Created the following pull request on rocksdb repo, please take a look.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/pull/1313
> > > > > 
> > > > > The fix is working fine for me.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > > 
> > > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > > 
> > > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > > 
> > > > > > 
> > > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > > the following  from a full scan (?)
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > > 
> > > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > > 
> > > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > > 
> > > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > > 
> > > > > > (number == state.prev_log_number)
> > > > > > 
> > > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > > 
> > > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > > 
> > > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Let me know what you think.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Some updates on this.
> > > > > > 
> > > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > > 
> > > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > > 
> > > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > > 
> > > > > > 4. Created a rocksdb issue for this
> > > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Hope you are able to download the log I shared via google doc.
> > > > > > It seems the bug is around this portion.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > > 254 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > > 256 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > > #258 started
> > > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > > memtable #1 done
> > > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > > memtable #2 done
> > > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > > level
> > > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > > 0.75
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000256.log
> > > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev
> > > > > > 0 extents
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev 0 extents 
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000254.log
> > > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev
> > > > > > 0 extents
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev 0 extents 
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 
> > > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > > 
> > > > > > I was going through the rocksdb code and I found the following.
> > > > > > 
> > > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > > 
> > > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > > 
> > > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > > 
> > > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > > 
> > > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > > 
> > > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > > 
> > > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > > after
> > > > > > 6 hours :-) ,
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage, It is there in the following github link I posted 
> > > > > > > earlier..You can see 3 logs there..
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > 
> > > > > > Ah sorry, got it.
> > > > > > 
> > > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > > 
> > > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > > 
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  >
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > > 
> > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > > 17
> > > > > > > > b3
> > > > > > > > 9c
> > > > > > > > 68
> > > > > > > > 7d
> > > > > > > > 88
> > > > > > > > a1b28fcc39
> > > > > > > > 
> > > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > > 
> > > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > > reusing log 266 from recycle list
> > > > > > > > 
> > > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > > 
> > > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > > 
> > > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > > Prefix = O key =
> > > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 89
> > > > > > > > ac
> > > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > > B key =
> > > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > > 'bluestore_statfs')
> > > > > > > > 
> > > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > > 
> > > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > > 
> > > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > > I will try to reproduce with 1/20.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > Sage,
> > > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > > 
> > > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > > 
> > > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > sage
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > > thread
> > > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > > [0x5617f2395fdd]
> > > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > > 
> > > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Somnath Roy
> > > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > > To: 'Sage Weil'
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > > Here is the option I am using..
> > > > > > > > > 
> > > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > > 
> > > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > > [0x5581ed13f840]
> > > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > > >  5: 
> > > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > > de
> > > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > > const&,
> > > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > > >  7: 
> > > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> > const&,
> > > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > > const&,
> > > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > > >  15: 
> > > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > > *,
> > > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > > >  16: 
> > > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > > >  17: 
> > > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > > :T ra ns ac ti on ,
> > > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > > >&,
> > > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > > >  18: 
> > > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > > es
> > > > > > > > > t>
> > > > > > > > > )+
> > > > > > > > > 0x
> > > > > > > > > d3
> > > > > > > > > 9)
> > > > > > > > > [0x5581ecef89e9]
> > > > > > > > >  19: 
> > > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > > ue
> > > > > > > > > st
> > > > > > > > > >)
> > > > > > > > > +0
> > > > > > > > > x2
> > > > > > > > > fb
> > > > > > > > > )
> > > > > > > > > [0x5581ecefeb4b]
> > > > > > > > >  20: 
> > > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > > >  22: 
> > > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > > st
> > > > > > > > > >
> > > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > > >  24: 
> > > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > > >  25: 
> > > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > > [0x5581ed4441f0]
> > > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > > To: Somnath Roy
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > > compaction during this time.
> > > > > > > > > 
> > > > > > > > > How are you selecting universal compaction?
> > > > > > > > > 
> > > > > > > > > sage
> > > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > > More majordomo info at 
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 15, 2016, 5:10 a.m. UTC | #44
Sage,
I hit the following assert again with the latest master after 9 hours , so, it seems the progress check stuff is not fixing the issue.

ceph version v11.0.0-2307-gca74bd9 (ca74bd9f17a76ca16c59f976fc32829b2dff88b2)
 1: (()+0x8c725e) [0x55d39df2d25e]
 2: (()+0x113d0) [0x7fd3b05833d0]
 3: (gsignal()+0x38) [0x7fd3aed93418]
 4: (abort()+0x16a) [0x7fd3aed9501a]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fd3af6d584d]
 6: (()+0x8d6b6) [0x7fd3af6d36b6]
 7: (()+0x8d701) [0x7fd3af6d3701]
 8: (()+0x8d919) [0x7fd3af6d3919]
 9: (std::__throw_length_error(char const*)+0x3f) [0x7fd3af6fc25f]
 10: (()+0x89510f) [0x55d39defb10f]
 11: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0xc04) [0x55d39def09a4]
 12: (BlueFS::sync_metadata()+0x49b) [0x55d39def1feb]
 13: (BlueRocksDirectory::Fsync()+0xd) [0x55d39df0499d]
 14: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x1402) [0x55d39df53a62]
 15: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d39df545ea]
 16: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d39de7ac2b]
 17: (BlueStore::_kv_sync_thread()+0x20a8) [0x55d39de2f148]
 18: (BlueStore::KVSyncThread::entry()+0xd) [0x55d39de50ead]
 19: (()+0x76fa) [0x7fd3b05796fa]
 20: (clone()+0x6d) [0x7fd3aee64b5d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

It is the same problem , old_log_jump_to is bigger than the content of extent vector.

Thanks & Regards
Somnath
-----Original Message-----
From: Somnath Roy 
Sent: Tuesday, September 06, 2016 8:30 AM
To: 'Sage Weil'
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

Sage,
Please find the entire 0/20 log in the following location for the first assert.

https://github.com/somnathr/ceph/blob/master/ceph-osd.3.log

This may not be helpful, I will try to reproduce this with debug_bluefs = 10/20.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Tuesday, September 06, 2016 6:28 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Tue, 6 Sep 2016, Somnath Roy wrote:
> Sage,
> Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.
> 
> 1. 
> 
> in thread 7f9de27ff700 thread_name:rocksdb:bg7
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (()+0xa0d94e) [0x55aa63cd194e]
>  2: (()+0x113d0) [0x7f9df56723d0]
>  3: (gsignal()+0x38) [0x7f9df33f7418]
>  4: (abort()+0x16a) [0x7f9df33f901a]
>  5: (()+0x2dbd7) [0x7f9df33efbd7]
>  6: (()+0x2dc82) [0x7f9df33efc82]
>  7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
>  8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
>  9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
>  10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
>  11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
>  12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
>  13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
>  14: (()+0x991753) [0x55aa63c55753]
>  15: (()+0x76fa) [0x7f9df56686fa]
>  16: (clone()+0x6d) [0x7f9df34c8b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> I was able to root cause it as a async log compaction bug. Here is my analysis..
> 
> Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.
> 
>    -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
>    -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
>    -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
>    -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> 
> So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.
> 
>   dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
> 	   << " of " << log_file->fnode.extents << dendl;
>   uint64_t discarded = 0;
>   vector<bluefs_extent_t> old_extents;
>   while (discarded < old_log_jump_to) {
>     bluefs_extent_t& e = log_file->fnode.extents.front();
>     bluefs_extent_t temp = e;
>     if (discarded + e.length <= old_log_jump_to) {
>       dout(10) << __func__ << " remove old log extent " << e << dendl;
>       discarded += e.length;
>       log_file->fnode.extents.erase(log_file->fnode.extents.begin());
>     } else {
>       dout(10) << __func__ << " remove front of old log extent " << e << dendl;
>       uint64_t drop = old_log_jump_to - discarded;
>       temp.length = drop;
>       e.offset += drop;
>       e.length -= drop;
>       discarded += drop;
>       dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
>     }
>     old_extents.push_back(temp);
>   }
> 
> But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?

Exactly--it shouldn't be.  old_log_jump_to *must* be less than the 
totally allocated extents.  It should equal just the extents that were 
present/used *prior* to us ensuring that runway is allocated.  Do you 
have a bit more log?  We need to see why it was big enough to empty out 
the vector...

> 2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!

Hrm, hard to say what's going on there.  My guess is a secondary effect 
from the above.  At least we should rule it out.

sage

> 
> 2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
> 2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
> 2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
> 2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
> 2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> 2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
> 2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
> 2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
> 2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
> 2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
> 2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
> 2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
> 2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
> 2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
> 2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
> 2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
> 2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
> os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)
> 
>  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
>  2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
>  3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
>  4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
>  5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
>  6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
>  7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
>  8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
>  9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
>  10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
>  11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
>  12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
>  16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
>  17: (()+0x76fa) [0x7ff402b126fa]
>  18: (clone()+0x6d) [0x7ff400972b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy 
> Sent: Friday, September 02, 2016 1:27 PM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
> Will see how it goes, thanks..
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, September 02, 2016 1:25 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Fri, 2 Sep 2016, Somnath Roy wrote:
> > Sage,
> > I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> > One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?
> 
> I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!
> 
> sage
> 
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 12:27 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Sage,
> > > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > > I will try some universal compaction tuning related to file size and confirm.
> > > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > > Can't we do like this ?
> > > 
> > > //Basically, checking the length of the log as well if (runway <
> > > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > > ->buffer.length() {  //allocate }
> > 
> > Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> > 
> > And I think just configuring a long runway won't hurt either (e.g., 100MB).
> > 
> > That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> > 
> > s
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 10:57 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > Here is my rocksdb option :
> > > > 
> > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > 
> > > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > > should be same as level 0 size. Initially, I had bigger 
> > > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > > level 0 size is the following
> > > > 
> > > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > > 
> > > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > > 
> > > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > > 
> > > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > > us
> > > > p=
> > > > sharing
> > > > 
> > > > Thanks for the explanation , I got it now why it is trying to 
> > > > flush inode 1.
> > > > 
> > > > But, shouldn't we check the length as well during runway check 
> > > > rather than just relying on bluefs_min_log_runway only.
> > > 
> > > That's what this does:
> > > 
> > >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > > log_writer->pos;
> > > 
> > > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > > 
> > > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > > 
> > > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > > 
> > > I'll add it to the trello board.  I think it's not that big a deal.. 
> > > except when you have really big files.
> > > 
> > > sage
> > > 
> > > 
> > >  >
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, September 02, 2016 9:35 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > 
> > > > 
> > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > 
> > > > > Sage,
> > > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > > 
> > > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000,1!
 :0!
>  x7!
> >  53!
> > >  00!
> > > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d000!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > > x7
> > > > d8
> > > > 00000+10e00000])
> > > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x!
 75!
>  10!
> >  00!
> > >  00!
> > > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000!
 +1!
>  00!
> >  00!
> > >  0,!
> > > >  
> > > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > > 60
> > > > 00
> > > > 00+100000,1:0x7d800000+10e00000])
> > > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1!
 :0!
>  x7!
> >  51!
> > >  00!
> > > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00!
 00!
>  0+!
> >  10!
> > >  00!
> > > >  
> > > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > > x7
> > > > d6
> > > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > > 
> > > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > > 
> > > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > > 
> > > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > > thread
> > > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > > !=
> > > > > 1)
> > > > > 
> > > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > char
> > > > > const*)+0x80) [0x56073c27c7d0]
> > > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > > unsigned
> > > > > long)+0x1d69) [0x56073bf4e109]
> > > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > > [0x56073bf4e2d7]
> > > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > > [0x56073c0f24b1]
> > > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > > [0x56073c0f3960]
> > > > >  9: 
> > > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > > [0x56073c1354c6]
> > > > >  10: 
> > > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > > ac
> > > > > ti
> > > > > on
> > > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > > [0x56073c0275d0]
> > > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > > [0x56073c03443f]
> > > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > > [0x56073c0eb039]
> > > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > > 
> > > > > 
> > > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > > 
> > > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > > 
> > > > But this is very concerning:
> > > > 
> > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > 
> > > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > > 
> > > > > Question :
> > > > > ------------
> > > > > 
> > > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > > 
> > > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > > .c
> > > > > c#
> > > > > L1
> > > > > 280
> > > > > 
> > > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > > 
> > > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > > 
> > > > > 2. The runway check is not considering the request length , so, 
> > > > > why it is not expecting to allocate here 
> > > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > > cc
> > > > > #L
> > > > > 1388)
> > > > > 
> > > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > > 
> > > > The level 10 log is probably enough...
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Somnath Roy
> > > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > > To: 'Sage Weil'
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > Sage,
> > > > > Created the following pull request on rocksdb repo, please take a look.
> > > > > 
> > > > > https://github.com/facebook/rocksdb/pull/1313
> > > > > 
> > > > > The fix is working fine for me.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > > Sage,
> > > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > > 
> > > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > > 
> > > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > > 
> > > > > > 
> > > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > > the following  from a full scan (?)
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > > 
> > > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > > 
> > > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > 98
> > > > > > f4
> > > > > > 60
> > > > > > 60
> > > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > > 
> > > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > > 
> > > > > > (number == state.prev_log_number)
> > > > > > 
> > > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > > 
> > > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > > 
> > > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Let me know what you think.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Some updates on this.
> > > > > > 
> > > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > > 
> > > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > > 
> > > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > > 
> > > > > > 4. Created a rocksdb issue for this
> > > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Hope you are able to download the log I shared via google doc.
> > > > > > It seems the bug is around this portion.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > > 254 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > > 256 to recycle list
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > > #258 started
> > > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > > memtable #1 done
> > > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > > memtable #2 done
> > > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > > Log Time
> > > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > > level
> > > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > > 0.75
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > > 
> > > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000256.log
> > > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev
> > > > > > 0 extents
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > 00:41:26.298423 bdev 0 extents 
> > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > e2
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > 00
> > > > > > ,0
> > > > > > :0
> > > > > > x1
> > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > > db.wal/000254.log
> > > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > > had refs
> > > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev
> > > > > > 0 extents
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > 00:41:26.299110 bdev 0 extents 
> > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > 84
> > > > > > 00
> > > > > > 00
> > > > > > 0+
> > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > 0x
> > > > > > b4
> > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > 
> > > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > > 
> > > > > > I was going through the rocksdb code and I found the following.
> > > > > > 
> > > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > > 
> > > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > > 
> > > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > > 
> > > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > > 
> > > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > > 
> > > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > > 
> > > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > > after
> > > > > > 6 hours :-) ,
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage, It is there in the following github link I posted 
> > > > > > > earlier..You can see 3 logs there..
> > > > > > > 
> > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > b3
> > > > > > > 9c
> > > > > > > 68
> > > > > > > 7d
> > > > > > > 88
> > > > > > > a1b28fcc39
> > > > > > 
> > > > > > Ah sorry, got it.
> > > > > > 
> > > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > > 
> > > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > > 
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > >  >
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage,
> > > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > > 
> > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > > 17
> > > > > > > > b3
> > > > > > > > 9c
> > > > > > > > 68
> > > > > > > > 7d
> > > > > > > > 88
> > > > > > > > a1b28fcc39
> > > > > > > > 
> > > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > > 
> > > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > > reusing log 266 from recycle list
> > > > > > > > 
> > > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > > 
> > > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > > 
> > > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > > Prefix = O key =
> > > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 00
> > > > > > > > 89
> > > > > > > > ac
> > > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > > B key =
> > > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > > 'bluestore_statfs')
> > > > > > > > 
> > > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > > 
> > > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > > 
> > > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Somnath Roy
> > > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > > To: 'Sage Weil'
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > > I will try to reproduce with 1/20.
> > > > > > > > 
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > Sage,
> > > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > > 
> > > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > > 
> > > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > sage
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > > thread
> > > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > > [0x5617f2395fdd]
> > > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > > 
> > > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Somnath Roy
> > > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > > To: 'Sage Weil'
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > > Here is the option I am using..
> > > > > > > > > 
> > > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > > 
> > > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > > 
> > > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > int, char
> > > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > > [0x5581ed13f840]
> > > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > > >  5: 
> > > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > > de
> > > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > > const&,
> > > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > > >  7: 
> > > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > std::allocator<char> > const&,
> > > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > > const&,
> > > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > > >  15: 
> > > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > > *,
> > > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > > >  16: 
> > > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > > >  17: 
> > > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > > :T ra ns ac ti on ,
> > > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > > >&,
> > > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > > >  18: 
> > > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > > es
> > > > > > > > > t>
> > > > > > > > > )+
> > > > > > > > > 0x
> > > > > > > > > d3
> > > > > > > > > 9)
> > > > > > > > > [0x5581ecef89e9]
> > > > > > > > >  19: 
> > > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > > ue
> > > > > > > > > st
> > > > > > > > > >)
> > > > > > > > > +0
> > > > > > > > > x2
> > > > > > > > > fb
> > > > > > > > > )
> > > > > > > > > [0x5581ecefeb4b]
> > > > > > > > >  20: 
> > > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > > >  22: 
> > > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > > st
> > > > > > > > > >
> > > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > > >  24: 
> > > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > > >  25: 
> > > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > > [0x5581ed4441f0]
> > > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > > To: Somnath Roy
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > > compaction during this time.
> > > > > > > > > 
> > > > > > > > > How are you selecting universal compaction?
> > > > > > > > > 
> > > > > > > > > sage
> > > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > > More majordomo info at 
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at 
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil Sept. 15, 2016, 3:38 p.m. UTC | #45
On Thu, 15 Sep 2016, Somnath Roy wrote:
> Sage,
> I hit the following assert again with the latest master after 9 hours , so, it seems the progress check stuff is not fixing the issue.
> 
> ceph version v11.0.0-2307-gca74bd9 (ca74bd9f17a76ca16c59f976fc32829b2dff88b2)
>  1: (()+0x8c725e) [0x55d39df2d25e]
>  2: (()+0x113d0) [0x7fd3b05833d0]
>  3: (gsignal()+0x38) [0x7fd3aed93418]
>  4: (abort()+0x16a) [0x7fd3aed9501a]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fd3af6d584d]
>  6: (()+0x8d6b6) [0x7fd3af6d36b6]
>  7: (()+0x8d701) [0x7fd3af6d3701]
>  8: (()+0x8d919) [0x7fd3af6d3919]
>  9: (std::__throw_length_error(char const*)+0x3f) [0x7fd3af6fc25f]
>  10: (()+0x89510f) [0x55d39defb10f]
>  11: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0xc04) [0x55d39def09a4]
>  12: (BlueFS::sync_metadata()+0x49b) [0x55d39def1feb]
>  13: (BlueRocksDirectory::Fsync()+0xd) [0x55d39df0499d]
>  14: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x1402) [0x55d39df53a62]
>  15: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d39df545ea]
>  16: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d39de7ac2b]
>  17: (BlueStore::_kv_sync_thread()+0x20a8) [0x55d39de2f148]
>  18: (BlueStore::KVSyncThread::entry()+0xd) [0x55d39de50ead]
>  19: (()+0x76fa) [0x7fd3b05796fa]
>  20: (clone()+0x6d) [0x7fd3aee64b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> It is the same problem , old_log_jump_to is bigger than the content of extent vector.

Ah, I had the condition wrong in the previous fix.  See

	https://github.com/ceph/ceph/pull/11095

Thanks!
sage


> 
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Somnath Roy 
> Sent: Tuesday, September 06, 2016 8:30 AM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Please find the entire 0/20 log in the following location for the first assert.
> 
> https://github.com/somnathr/ceph/blob/master/ceph-osd.3.log
> 
> This may not be helpful, I will try to reproduce this with debug_bluefs = 10/20.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Tuesday, September 06, 2016 6:28 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Tue, 6 Sep 2016, Somnath Roy wrote:
> > Sage,
> > Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.
> > 
> > 1. 
> > 
> > in thread 7f9de27ff700 thread_name:rocksdb:bg7
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (()+0xa0d94e) [0x55aa63cd194e]
> >  2: (()+0x113d0) [0x7f9df56723d0]
> >  3: (gsignal()+0x38) [0x7f9df33f7418]
> >  4: (abort()+0x16a) [0x7f9df33f901a]
> >  5: (()+0x2dbd7) [0x7f9df33efbd7]
> >  6: (()+0x2dc82) [0x7f9df33efc82]
> >  7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
> >  8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
> >  9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
> >  10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
> >  11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
> >  12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
> >  13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
> >  14: (()+0x991753) [0x55aa63c55753]
> >  15: (()+0x76fa) [0x7f9df56686fa]
> >  16: (clone()+0x6d) [0x7f9df34c8b5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > I was able to root cause it as a async log compaction bug. Here is my analysis..
> > 
> > Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.
> > 
> >    -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
> >    -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
> >    -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> >    -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> > 
> > So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.
> > 
> >   dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
> > 	   << " of " << log_file->fnode.extents << dendl;
> >   uint64_t discarded = 0;
> >   vector<bluefs_extent_t> old_extents;
> >   while (discarded < old_log_jump_to) {
> >     bluefs_extent_t& e = log_file->fnode.extents.front();
> >     bluefs_extent_t temp = e;
> >     if (discarded + e.length <= old_log_jump_to) {
> >       dout(10) << __func__ << " remove old log extent " << e << dendl;
> >       discarded += e.length;
> >       log_file->fnode.extents.erase(log_file->fnode.extents.begin());
> >     } else {
> >       dout(10) << __func__ << " remove front of old log extent " << e << dendl;
> >       uint64_t drop = old_log_jump_to - discarded;
> >       temp.length = drop;
> >       e.offset += drop;
> >       e.length -= drop;
> >       discarded += drop;
> >       dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
> >     }
> >     old_extents.push_back(temp);
> >   }
> > 
> > But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?
> 
> Exactly--it shouldn't be.  old_log_jump_to *must* be less than the 
> totally allocated extents.  It should equal just the extents that were 
> present/used *prior* to us ensuring that runway is allocated.  Do you 
> have a bit more log?  We need to see why it was big enough to empty out 
> the vector...
> 
> > 2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!
> 
> Hrm, hard to say what's going on there.  My guess is a secondary effect 
> from the above.  At least we should rule it out.
> 
> sage
> 
> > 
> > 2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
> > 2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
> > 2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
> > 2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
> > 2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
> > 2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
> > 2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
> > 2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
> > 2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
> > 2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
> > 2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
> > 2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
> > 2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
> > 2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
> > 2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
> > 2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
> > 2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
> > os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
> >  2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
> >  3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
> >  4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
> >  5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
> >  6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
> >  7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
> >  8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
> >  9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
> >  10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
> >  11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
> >  12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
> >  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
> >  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
> >  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
> >  16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
> >  17: (()+0x76fa) [0x7ff402b126fa]
> >  18: (clone()+0x6d) [0x7ff400972b5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy 
> > Sent: Friday, September 02, 2016 1:27 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
> > Will see how it goes, thanks..
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 1:25 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Sage,
> > > I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> > > One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?
> > 
> > I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!
> > 
> > sage
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 12:27 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > Sage,
> > > > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > > > I will try some universal compaction tuning related to file size and confirm.
> > > > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > > > Can't we do like this ?
> > > > 
> > > > //Basically, checking the length of the log as well if (runway <
> > > > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > > > ->buffer.length() {  //allocate }
> > > 
> > > Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> > > 
> > > And I think just configuring a long runway won't hurt either (e.g., 100MB).
> > > 
> > > That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> > > 
> > > s
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, September 02, 2016 10:57 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > > Here is my rocksdb option :
> > > > > 
> > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > 
> > > > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > > > should be same as level 0 size. Initially, I had bigger 
> > > > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > > > level 0 size is the following
> > > > > 
> > > > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > > > 
> > > > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > > > 
> > > > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > > > 
> > > > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > > > us
> > > > > p=
> > > > > sharing
> > > > > 
> > > > > Thanks for the explanation , I got it now why it is trying to 
> > > > > flush inode 1.
> > > > > 
> > > > > But, shouldn't we check the length as well during runway check 
> > > > > rather than just relying on bluefs_min_log_runway only.
> > > > 
> > > > That's what this does:
> > > > 
> > > >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > > > log_writer->pos;
> > > > 
> > > > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > > > 
> > > > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > > > 
> > > > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > > > 
> > > > I'll add it to the trello board.  I think it's not that big a deal.. 
> > > > except when you have really big files.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > >  >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Friday, September 02, 2016 9:35 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > 
> > > > > 
> > > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > > 
> > > > > > Sage,
> > > > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > > > 
> > > > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000!
 ,1!
>  :0!
> >  x7!
> > >  53!
> > > >  00!
> > > > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d0!
 00!
>  00!
> >  0+!
> > >  10!
> > > >  00!
> > > > >  
> > > > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > > > x7
> > > > > d8
> > > > > 00000+10e00000])
> > > > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:!
 0x!
>  75!
> >  10!
> > >  00!
> > > >  00!
> > > > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce000!
 00!
>  +1!
> >  00!
> > >  00!
> > > >  0,!
> > > > >  
> > > > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > > > 60
> > > > > 00
> > > > > 00+100000,1:0x7d800000+10e00000])
> > > > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000!
 ,1!
>  :0!
> >  x7!
> > >  51!
> > > >  00!
> > > > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce!
 00!
>  00!
> >  0+!
> > >  10!
> > > >  00!
> > > > >  
> > > > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > > > x7
> > > > > d6
> > > > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > > > 
> > > > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > > > 
> > > > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > > > 
> > > > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > > > thread
> > > > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > > > !=
> > > > > > 1)
> > > > > > 
> > > > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x56073c27c7d0]
> > > > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > > > unsigned
> > > > > > long)+0x1d69) [0x56073bf4e109]
> > > > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > > > [0x56073bf4e2d7]
> > > > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > > > [0x56073c0f24b1]
> > > > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > > > [0x56073c0f3960]
> > > > > >  9: 
> > > > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > > > [0x56073c1354c6]
> > > > > >  10: 
> > > > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > > > ac
> > > > > > ti
> > > > > > on
> > > > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > > > [0x56073c0275d0]
> > > > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > > > [0x56073c03443f]
> > > > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > > > [0x56073c0eb039]
> > > > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > > > 
> > > > > > 
> > > > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > > > 
> > > > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > > > 
> > > > > But this is very concerning:
> > > > > 
> > > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > 
> > > > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > > > 
> > > > > > Question :
> > > > > > ------------
> > > > > > 
> > > > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > > > 
> > > > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > > > .c
> > > > > > c#
> > > > > > L1
> > > > > > 280
> > > > > > 
> > > > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > > > 
> > > > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > > > 
> > > > > > 2. The runway check is not considering the request length , so, 
> > > > > > why it is not expecting to allocate here 
> > > > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > > > cc
> > > > > > #L
> > > > > > 1388)
> > > > > > 
> > > > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > > > 
> > > > > The level 10 log is probably enough...
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Created the following pull request on rocksdb repo, please take a look.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/pull/1313
> > > > > > 
> > > > > > The fix is working fine for me.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > > > 
> > > > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > > > 
> > > > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > > > 
> > > > > > > 
> > > > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > > > the following  from a full scan (?)
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > > > 
> > > > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > > > 
> > > > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > > > 
> > > > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > > > 
> > > > > > > (number == state.prev_log_number)
> > > > > > > 
> > > > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > > > 
> > > > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > > > 
> > > > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Let me know what you think.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Some updates on this.
> > > > > > > 
> > > > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > > > 
> > > > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > > > 
> > > > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > > > 
> > > > > > > 4. Created a rocksdb issue for this
> > > > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Hope you are able to download the log I shared via google doc.
> > > > > > > It seems the bug is around this portion.
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > > > 254 to recycle list
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > > > 256 to recycle list
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > > > #258 started
> > > > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > > > memtable #1 done
> > > > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > > > memtable #2 done
> > > > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > > > level
> > > > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > > > 0.75
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > > > db.wal/000256.log
> > > > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > had refs
> > > > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > > 00:41:26.298423 bdev
> > > > > > > 0 extents
> > > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > > e2
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > > 00
> > > > > > > ,0
> > > > > > > :0
> > > > > > > x1
> > > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > > 00:41:26.298423 bdev 0 extents 
> > > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > > e2
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > > 00
> > > > > > > ,0
> > > > > > > :0
> > > > > > > x1
> > > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > > > db.wal/000254.log
> > > > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > had refs
> > > > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > > 00:41:26.299110 bdev
> > > > > > > 0 extents
> > > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > > 84
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > > 0x
> > > > > > > b4
> > > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > > 00:41:26.299110 bdev 0 extents 
> > > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > > 84
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > > 0x
> > > > > > > b4
> > > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > > 
> > > > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > > > 
> > > > > > > I was going through the rocksdb code and I found the following.
> > > > > > > 
> > > > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > > > 
> > > > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > > > 
> > > > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > > > 
> > > > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > > > 
> > > > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > > > 
> > > > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > > > 
> > > > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > > > after
> > > > > > > 6 hours :-) ,
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage, It is there in the following github link I posted 
> > > > > > > > earlier..You can see 3 logs there..
> > > > > > > > 
> > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > > b3
> > > > > > > > 9c
> > > > > > > > 68
> > > > > > > > 7d
> > > > > > > > 88
> > > > > > > > a1b28fcc39
> > > > > > > 
> > > > > > > Ah sorry, got it.
> > > > > > > 
> > > > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > > > 
> > > > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > > > 
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >  >
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > > Sage,
> > > > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > > > 
> > > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > > > 17
> > > > > > > > > b3
> > > > > > > > > 9c
> > > > > > > > > 68
> > > > > > > > > 7d
> > > > > > > > > 88
> > > > > > > > > a1b28fcc39
> > > > > > > > > 
> > > > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > > > 
> > > > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > > > reusing log 266 from recycle list
> > > > > > > > > 
> > > > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > > > 
> > > > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > > > 
> > > > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > sage
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > > > Prefix = O key =
> > > > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 89
> > > > > > > > > ac
> > > > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > > > B key =
> > > > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > > > 'bluestore_statfs')
> > > > > > > > > 
> > > > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > > > 
> > > > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > > > 
> > > > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Somnath Roy
> > > > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > > > To: 'Sage Weil'
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > > > I will try to reproduce with 1/20.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > > > To: Somnath Roy
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > Sage,
> > > > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > > > 
> > > > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > > > 
> > > > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > sage
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > > > thread
> > > > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > > > 
> > > > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > > int, char
> > > > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > > > [0x5617f2395fdd]
> > > > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > > > 
> > > > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > > > 
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Somnath Roy
> > > > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > > > To: 'Sage Weil'
> > > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > > 
> > > > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > > > Here is the option I am using..
> > > > > > > > > > 
> > > > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > > > 
> > > > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > > > 
> > > > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > > int, char
> > > > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > > > [0x5581ed13f840]
> > > > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > > > >  5: 
> > > > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > > > de
> > > > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > > > const&,
> > > > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > > > >  7: 
> > > > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> > const&,
> > > > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > > > const&,
> > > > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > > > >  15: 
> > > > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > > > *,
> > > > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > > > >  16: 
> > > > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > > > >  17: 
> > > > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > > > :T ra ns ac ti on ,
> > > > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > > > >&,
> > > > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > > > >  18: 
> > > > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > > > es
> > > > > > > > > > t>
> > > > > > > > > > )+
> > > > > > > > > > 0x
> > > > > > > > > > d3
> > > > > > > > > > 9)
> > > > > > > > > > [0x5581ecef89e9]
> > > > > > > > > >  19: 
> > > > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > > > ue
> > > > > > > > > > st
> > > > > > > > > > >)
> > > > > > > > > > +0
> > > > > > > > > > x2
> > > > > > > > > > fb
> > > > > > > > > > )
> > > > > > > > > > [0x5581ecefeb4b]
> > > > > > > > > >  20: 
> > > > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > > > >  22: 
> > > > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > > > st
> > > > > > > > > > >
> > > > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > > > >  24: 
> > > > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > > > >  25: 
> > > > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > > > [0x5581ed4441f0]
> > > > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > > > To: Somnath Roy
> > > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > > 
> > > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > > > compaction during this time.
> > > > > > > > > > 
> > > > > > > > > > How are you selecting universal compaction?
> > > > > > > > > > 
> > > > > > > > > > sage
> > > > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > > > More majordomo info at 
> > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > > majordomo info at 
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Somnath Roy Sept. 30, 2016, 4:44 p.m. UTC | #46
Sage,
As I mentioned in the morning call , here is the assert I got.

/root/ceph-master/src/os/bluestore/StupidAllocator.cc: 333: FAILED assert(committing.empty())

 ceph version v11.0.0-2791-gb87d96c (b87d96c6cb8ec5490b452ca0c5bc06f861417d42)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5644fbfe6bf0]
 2: (StupidAllocator::commit_start()+0x30c) [0x5644fbe30fcc]
 3: (BlueFS::sync_metadata()+0x29e) [0x5644fbe0ab4e]
 4: (BlueRocksDirectory::Fsync()+0xd) [0x5644fbe1d82d]
 5: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x1402) [0x5644fbe6d432]
 6: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x5644fbe6dfba]
 7: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0xb9) [0x5644fbd93ab9]
 8: (BlueStore::_kv_sync_thread()+0x2072) [0x5644fbd46ad2]
 9: (BlueStore::KVSyncThread::entry()+0xd) [0x5644fbd690fd]
 10: (()+0x76fa) [0x7f014592a6fa]
 11: (clone()+0x6d) [0x7f0144215b5d]

Not much log , but, I tried to find any obvious reason but couldn't.

Commit_start()/commit_finish() is only called from sync_metadata() in BlueFS. Since the BlueStore allocator is different I am not counting commit_start()/commit_finish() from Bluestore::_kv_sync_thread().
Committing structure is always cleared from commit_finish() and all those are well protected by lock.
Only possibility, if the alloc[] changes and commit_finish() is not called for that particular bdev->alloc ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Thursday, September 15, 2016 8:38 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Thu, 15 Sep 2016, Somnath Roy wrote:
> Sage,
> I hit the following assert again with the latest master after 9 hours , so, it seems the progress check stuff is not fixing the issue.
> 
> ceph version v11.0.0-2307-gca74bd9 (ca74bd9f17a76ca16c59f976fc32829b2dff88b2)
>  1: (()+0x8c725e) [0x55d39df2d25e]
>  2: (()+0x113d0) [0x7fd3b05833d0]
>  3: (gsignal()+0x38) [0x7fd3aed93418]
>  4: (abort()+0x16a) [0x7fd3aed9501a]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fd3af6d584d]
>  6: (()+0x8d6b6) [0x7fd3af6d36b6]
>  7: (()+0x8d701) [0x7fd3af6d3701]
>  8: (()+0x8d919) [0x7fd3af6d3919]
>  9: (std::__throw_length_error(char const*)+0x3f) [0x7fd3af6fc25f]
>  10: (()+0x89510f) [0x55d39defb10f]
>  11: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0xc04) [0x55d39def09a4]
>  12: (BlueFS::sync_metadata()+0x49b) [0x55d39def1feb]
>  13: (BlueRocksDirectory::Fsync()+0xd) [0x55d39df0499d]
>  14: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool)+0x1402) [0x55d39df53a62]
>  15: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x2a) [0x55d39df545ea]
>  16: (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x6b) [0x55d39de7ac2b]
>  17: (BlueStore::_kv_sync_thread()+0x20a8) [0x55d39de2f148]
>  18: (BlueStore::KVSyncThread::entry()+0xd) [0x55d39de50ead]
>  19: (()+0x76fa) [0x7fd3b05796fa]
>  20: (clone()+0x6d) [0x7fd3aee64b5d]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> 
> It is the same problem , old_log_jump_to is bigger than the content of extent vector.

Ah, I had the condition wrong in the previous fix.  See

	https://github.com/ceph/ceph/pull/11095

Thanks!
sage


> 
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: Somnath Roy 
> Sent: Tuesday, September 06, 2016 8:30 AM
> To: 'Sage Weil'
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> Sage,
> Please find the entire 0/20 log in the following location for the first assert.
> 
> https://github.com/somnathr/ceph/blob/master/ceph-osd.3.log
> 
> This may not be helpful, I will try to reproduce this with debug_bluefs = 10/20.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Tuesday, September 06, 2016 6:28 AM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel
> Subject: RE: Bluestore assert
> 
> On Tue, 6 Sep 2016, Somnath Roy wrote:
> > Sage,
> > Here is one of the assert that I can reproduce consistently while I was running with big runway values and for 10 hours of 4K RW without preconditioning.
> > 
> > 1. 
> > 
> > in thread 7f9de27ff700 thread_name:rocksdb:bg7
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (()+0xa0d94e) [0x55aa63cd194e]
> >  2: (()+0x113d0) [0x7f9df56723d0]
> >  3: (gsignal()+0x38) [0x7f9df33f7418]
> >  4: (abort()+0x16a) [0x7f9df33f901a]
> >  5: (()+0x2dbd7) [0x7f9df33efbd7]
> >  6: (()+0x2dc82) [0x7f9df33efc82]
> >  7: (BlueFS::_compact_log_async(std::unique_lock<std::mutex>&)+0x1d43) [0x55aa63abea33]
> >  8: (BlueFS::sync_metadata()+0x38b) [0x55aa63abef0b]
> >  9: (BlueRocksDirectory::Fsync()+0xd) [0x55aa63ad303d]
> >  10: (rocksdb::CompactionJob::Run()+0xe86) [0x55aa63ca3c96]
> >  11: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) [0x55aa63b91c50]
> >  12: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf) [0x55aa63b9eabf]
> >  13: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) [0x55aa63c556b9]
> >  14: (()+0x991753) [0x55aa63c55753]
> >  15: (()+0x76fa) [0x7f9df56686fa]
> >  16: (clone()+0x6d) [0x7f9df34c8b5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > I was able to root cause it as a async log compaction bug. Here is my analysis..
> > 
> > Here is the log snippet (for the crashing thread) it dumped with debug_bluefs = 0/20.
> > 
> >    -95> 2016-09-05 18:09:38.242895 7f8d7a3f5700 10 bluefs _compact_log_async remove 0x32100000 of [1:0x3f2d900000+100000,0:0x1e7700000+19000000]
> >    -94> 2016-09-05 18:09:38.242903 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 1:0x3f2d900000+100000
> >    -93> 2016-09-05 18:09:38.242905 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> >    -92> 2016-09-05 18:09:38.242907 7f8d7a3f5700 10 bluefs _compact_log_async remove old log extent 0:0x1e7700000+19000000
> > 
> > So, last two entries of the extent is similar and it is corrupted because we didn't check whether vector is empty in the following loop. We are trying to use front() on the empty vector which is undefined and later it is crashed while we are using begin() with vector erase. Begin() with empty vector can't be dereferenced.
> > 
> >   dout(10) << __func__ << " remove 0x" << std::hex << old_log_jump_to << std::dec
> > 	   << " of " << log_file->fnode.extents << dendl;
> >   uint64_t discarded = 0;
> >   vector<bluefs_extent_t> old_extents;
> >   while (discarded < old_log_jump_to) {
> >     bluefs_extent_t& e = log_file->fnode.extents.front();
> >     bluefs_extent_t temp = e;
> >     if (discarded + e.length <= old_log_jump_to) {
> >       dout(10) << __func__ << " remove old log extent " << e << dendl;
> >       discarded += e.length;
> >       log_file->fnode.extents.erase(log_file->fnode.extents.begin());
> >     } else {
> >       dout(10) << __func__ << " remove front of old log extent " << e << dendl;
> >       uint64_t drop = old_log_jump_to - discarded;
> >       temp.length = drop;
> >       e.offset += drop;
> >       e.length -= drop;
> >       discarded += drop;
> >       dout(10) << __func__ << "   kept " << e << " removed " << temp << dendl;
> >     }
> >     old_extents.push_back(temp);
> >   }
> > 
> > But, question is other than adding an empty check for the vector do we need to do anything else ? Why in this case after ~7 hours old_log_jump_to is bigger than the content of extent vector (because of bigger runway config ?) ?
> 
> Exactly--it shouldn't be.  old_log_jump_to *must* be less than the 
> totally allocated extents.  It should equal just the extents that were 
> present/used *prior* to us ensuring that runway is allocated.  Do you 
> have a bit more log?  We need to see why it was big enough to empty out 
> the vector...
> 
> > 2. Here is another assert during recovery , which I was not able reproduce again later. The 0/20 log is not saying anything on the thread unfortunately !!
> 
> Hrm, hard to say what's going on there.  My guess is a secondary effect 
> from the above.  At least we should rule it out.
> 
> sage
> 
> > 
> > 2016-09-02 19:04:35.261856 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34638 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).fault
> > 2016-09-02 19:04:35.262428 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34682 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:35.263045 7ff3ae1fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45296 s=2 pgs=50 cs=1 l=0 c=0x7ff3aa87c640).fault, initiating reconnect
> > 2016-09-02 19:04:35.263477 7ff3e31fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.10:6805/937878 pipe(0x7fee55871000 sd=36 :45348 s=1 pgs=50 cs=2 l=0 c=0x7ff3aa87c640).connect got RESETSESSION
> > 2016-09-02 19:04:35.270038 7ff3c5ff4700  0 -- 10.60.194.11:6832/1227385 submit_message MOSDPGPushReply(1.235 22 [PushReplyOp(1:ac44029e:::rbd_data.10176b8b4567.00000000001bb8db:head),PushReplyOp(1:ac44036d:::rbd_data.10176b8b4567.00000000005a5365:head),PushReplyOp(1:ac440604:::rbd_data.10176b8b4567.000000000043d177:head),PushReplyOp(1:ac440608:::rbd_data.10176b8b4567.00000000002aba83:head),PushReplyOp(1:ac44089f:::rbd_data.10176b8b4567.0000000000710e5d:head),PushReplyOp(1:ac4409cd:::rbd_data.10176b8b4567.0000000000689b0d:head),PushReplyOp(1:ac440c37:::rbd_data.10176b8b4567.00000000002d1db3:head),PushReplyOp(1:ac440e0b:::rbd_data.10176b8b4567.00000000009801e1:head)]) v2 remote, 10.60.194.11:6829/227799, failed lossy con, dropping message 0x7ff245058380
> > 2016-09-02 19:04:35.282823 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=43 :34694 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:35.293903 7ff3e43fe700  0 -- 10.60.194.11:0/227385 >> 10.60.194.11:6822/227085 pipe(0x7ff38604a000 sd=38 :34696 s=1 pgs=0 cs=0 l=1 c=0x7ff386013900).connect claims to be 10.60.194.11:6822/1226253 not 10.60.194.11:6822/227085 - wrong node!
> > 2016-09-02 19:04:39.366837 7ff3befe6700  0 log_channel(cluster) log [INF] : 1.130 continuing backfill to osd.5 from (20'392769,22'395772] MIN to 22'395772
> > 2016-09-02 19:04:39.367262 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.39a continuing backfill to osd.4 from (20'383603,22'386606] MIN to 22'386606
> > 2016-09-02 19:04:39.368695 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.253 continuing backfill to osd.1 from (20'386883,22'389884] MIN to 22'389884
> > 2016-09-02 19:04:39.408083 7ff3c17eb700  0 log_channel(cluster) log [INF] : 1.70 continuing backfill to osd.1 from (20'388673,22'391675] MIN to 22'391675
> > 2016-09-02 19:04:39.408152 7ff3bf7e7700  0 log_channel(cluster) log [INF] : 1.2bd continuing backfill to osd.1 from (20'389889,22'392892] MIN to 22'392892
> > 2016-09-02 19:04:40.617675 7ff3b51fc700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7fef0ebfe000 sd=85 :6832 s=0 pgs=0 cs=0 l=0 c=0x7ff20d7ce280).accept connect_seq 0 vs existing 0 state connecting
> > 2016-09-02 19:04:40.617770 7ff3b62fd700  0 -- 10.60.194.11:6832/1227385 >> 10.60.194.11:6801/226086 pipe(0x7ff37b01a000 sd=86 :43610 s=4 pgs=0 cs=0 l=0 c=0x7ff37b01c140).connect got RESETSESSION but no longer connecting
> > 2016-09-02 19:04:41.197663 7ff3c0fea700  0 log_channel(cluster) log [INF] : 1.1ed continuing backfill to osd.0 from (20'393177,22'396182] MIN to 22'396182
> > 2016-09-02 19:04:41.197689 7ff3c2fee700  0 log_channel(cluster) log [INF] : 1.3fa continuing backfill to osd.0 from (20'391286,22'394289] MIN to 22'394289
> > 2016-09-02 19:04:41.197736 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.0 from (20'389914,22'392915] MIN to 22'392915
> > 2016-09-02 19:04:41.197752 7ff3c37ef700  0 log_channel(cluster) log [INF] : 1.290 continuing backfill to osd.13 from (20'389914,22'392915] MIN to 22'392915
> > 2016-09-02 19:04:41.197759 7ff3bffe8700  0 log_channel(cluster) log [INF] : 1.260 continuing backfill to osd.0 from (20'388867,22'391871] MIN to 22'391871
> > 2016-09-02 19:04:41.405458 7ff39f7ff700 -1 os/bluestore/BlueStore.cc: In function 'void BlueStore::OnodeSpace::add(const ghobject_t&, BlueStore::OnodeRef)' thread 7ff39f7ff700 time 2016-09-02 19:04:41.387802
> > os/bluestore/BlueStore.cc: 1065: FAILED assert(onode_map.count(oid) == 0)
> > 
> >  ceph version 11.0.0-1946-g9a5cfe2 (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x557bda1c9750]
> >  2: (BlueStore::OnodeSpace::add(ghobject_t const&, boost::intrusive_ptr<BlueStore::Onode>)+0x4bf) [0x557bd9d9cd0f]
> >  3: (BlueStore::Collection::get_onode(ghobject_t const&, bool)+0x63e) [0x557bd9d9d3ae]
> >  4: (BlueStore::get_omap_iterator(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&)+0xc5) [0x557bd9da2c45]
> >  5: (BlueStore::get_omap_iterator(coll_t const&, ghobject_t const&)+0x7a) [0x557bd9d7e50a]
> >  6: (OSDriver::get_next(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list>*)+0x45) [0x557bd9af2305]
> >  7: (SnapMapper::get_next_object_to_trim(snapid_t, hobject_t*)+0x482) [0x557bd9af2bf2]
> >  8: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x4ac) [0x557bd9bfd8fc]
> >  9: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xc8) [0x557bd9c42418]
> >  10: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x139) [0x557bd9c2d299]
> >  11: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x111) [0x557bd9c2d541]
> >  12: (ReplicatedPG::snap_trimmer(unsigned int)+0x468) [0x557bd9ba1258]
> >  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x750) [0x557bd9a727b0]
> >  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x89f) [0x557bda1b65cf]
> >  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x557bda1b9c90]
> >  16: (Thread::entry_wrapper()+0x75) [0x557bda1a9065]
> >  17: (()+0x76fa) [0x7ff402b126fa]
> >  18: (clone()+0x6d) [0x7ff400972b5d]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Somnath Roy 
> > Sent: Friday, September 02, 2016 1:27 PM
> > To: 'Sage Weil'
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > Yes, did that with similar ratio, see below,  max = 400MB , min = 100MB.
> > Will see how it goes, thanks..
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, September 02, 2016 1:25 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; ceph-devel
> > Subject: RE: Bluestore assert
> > 
> > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > Sage,
> > > I am running with big runway values now (min 100 MB, max 400MB) and will keep you posted on this.
> > > One point, if I give this big runway values, the allocation will be very frequent (and probably unnecessarily for most of the cases) , no harm with that ?
> > 
> > I think it'll actually be less frequent, since it allocates bluefs_max_log_runway at a time.  Well, assuming you set that tunable as high well!
> > 
> > sage
> > 
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, September 02, 2016 12:27 PM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; ceph-devel
> > > Subject: RE: Bluestore assert
> > > 
> > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > Sage,
> > > > It is probably the universal compaction that is generating bigger files, other than that I don't see the following tuning will be generating large files.
> > > > I will try some universal compaction tuning related to file size and confirm.
> > > > Yeah, big bluefs_min_log_runway value will probably shield the assert for now , but, I am afraid since we don't have control to the file size , we can't possibly sure about the value we should be giving with bluefs_min_log_runway for assert not to hit in future long runs.
> > > > Can't we do like this ?
> > > > 
> > > > //Basically, checking the length of the log as well if (runway <
> > > > g_conf->bluefs_min_log_runway) || (runway < log_writer
> > > > ->buffer.length() {  //allocate }
> > > 
> > > Oh, I see what you mean.  Yeah, I'll add that in--certainly doesn't hurt.
> > > 
> > > And I think just configuring a long runway won't hurt either (e.g., 100MB).
> > > 
> > > That's probably enough to be safe, but once we fix the flush thing I mentioned that will make this go away.
> > > 
> > > s
> > > 
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, September 02, 2016 10:57 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; ceph-devel
> > > > Subject: RE: Bluestore assert
> > > > 
> > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > > Here is my rocksdb option :
> > > > > 
> > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > 
> > > > > one discrepancy I can see here on max_bytes_for_level_base , it 
> > > > > should be same as level 0 size. Initially, I had bigger 
> > > > > min_write_buffer_number_to_merge and that's how I calculated. Now, 
> > > > > level 0 size is the following
> > > > > 
> > > > > write_buffer_size * min_write_buffer_number_to_merge * 
> > > > > level0_file_num_compaction_trigger = 80MB * 2 * 4 = ~640MB
> > > > > 
> > > > > I should adjust max_bytes_for_level_base to the similar value probably.
> > > > > 
> > > > > Please find the level 10 log here. The log I captured during replay (crashed again) after it crashed originally because of the same reason.
> > > > > 
> > > > > https://drive.google.com/file/d/0B7W-S0z_ymMJSnA3R2dyellYZ0U/view?
> > > > > us
> > > > > p=
> > > > > sharing
> > > > > 
> > > > > Thanks for the explanation , I got it now why it is trying to 
> > > > > flush inode 1.
> > > > > 
> > > > > But, shouldn't we check the length as well during runway check 
> > > > > rather than just relying on bluefs_min_log_runway only.
> > > > 
> > > > That's what this does:
> > > > 
> > > >   uint64_t runway = log_writer->file->fnode.get_allocated() - 
> > > > log_writer->pos;
> > > > 
> > > > Anyway, I think I see what's going on.  There are a ton of _fsync and _flush_range calls that have to flush the fnode, and the fnode is pretty big (maybe 5k?) because it has so many extents (your tunables are generating really big files).
> > > > 
> > > > I think this is just a matter of the runway configurable being too small for your configuration.  Try bumping bluefs_min_log_runway by 10x.
> > > > 
> > > > Well, actually, we could improve this a bit.  Right now rocksdb is calling lots of flush on a bit sst, and a final fsync at the end.  Bluefs is logging the updated fnode every time the flush changes the file size, and then only writing it to disk when the final fsync happens.  Instead, it could/should put the dirty fnode on a list and only at log flush time flush append the latest fnodes to the log and write it out.
> > > > 
> > > > I'll add it to the trello board.  I think it's not that big a deal.. 
> > > > except when you have really big files.
> > > > 
> > > > sage
> > > > 
> > > > 
> > > >  >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Friday, September 02, 2016 9:35 AM
> > > > > To: Somnath Roy
> > > > > Cc: Mark Nelson; ceph-devel
> > > > > Subject: RE: Bluestore assert
> > > > > 
> > > > > 
> > > > > 
> > > > > On Fri, 2 Sep 2016, Somnath Roy wrote:
> > > > > 
> > > > > > Sage,
> > > > > > Tried to do some analysis on the inode assert, following looks suspicious.
> > > > > > 
> > > > > >    -10> 2016-08-31 17:55:56.921075 7faf14fff700 10 bluefs _fsync 0x7faf12075140 file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:0x75100000+100000!
 ,1!
>  :0!
> >  x7!
> > >  53!
> > > >  00!
> > > > >  000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce00000+100000,1:0x7d0!
 00!
>  00!
> >  0+!
> > >  10!
> > > >  00!
> > > > >  
> > > > > 00,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d600000+100000,1:0
> > > > > x7
> > > > > d8
> > > > > 00000+10e00000])
> > > > > >     -9> 2016-08-31 17:55:56.921117 7faf14fff700 10 bluefs _flush 0x7faf12075140 no dirty data on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000,1:!
 0x!
>  75!
> >  10!
> > >  00!
> > > >  00!
> > > > >  +100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce000!
 00!
>  +1!
> >  00!
> > >  00!
> > > >  0,!
> > > > >  
> > > > > 1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0x7d
> > > > > 60
> > > > > 00
> > > > > 00+100000,1:0x7d800000+10e00000])
> > > > > >     -8> 2016-08-31 17:55:56.921137 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140
> > > > > >     -7> 2016-08-31 17:55:56.931541 7faf14fff700 10 bluefs wait_for_aio 0x7faf12075140 done in 0.010402
> > > > > >     -6> 2016-08-31 17:55:56.931551 7faf14fff700 20 bluefs _fsync file metadata was dirty (31278) on file(ino 693 size 0x19c2c265 mtime 2016-08-31 17:55:56.919038 bdev 1 extents [1:0x6e500000+100000,1:0x6e700000+100000,1:0x6ea00000+200000,1:0x6ed00000+200000,1:0x6f000000+200000,1:0x6f300000+200000,1:0x6f600000+200000,1:0x6f900000+200000,1:0x6fc00000+200000,1:0x6ff00000+100000,1:0x70100000+200000,1:0x70400000+200000,1:0x70700000+200000,1:0x70a00000+100000,1:0x70c00000+200000,1:0x70f00000+100000,1:0x71100000+200000,1:0x71400000+200000,1:0x71700000+200000,1:0x71a00000+100000,1:0x71c00000+100000,1:0x71e00000+100000,1:0x72000000+100000,1:0x72200000+200000,1:0x72500000+200000,1:0x72800000+100000,1:0x72a00000+100000,1:0x72c00000+200000,1:0x72f00000+100000,1:0x73100000+200000,1:0x73400000+100000,1:0x73600000+200000,1:0x73900000+200000,1:0x73c00000+200000,1:0x73f00000+200000,1:0x74300000+200000,1:0x74600000+200000,1:0x74900000+200000,1:0x74c00000+200000,1:0x74f00000+100000!
 ,1!
>  :0!
> >  x7!
> > >  51!
> > > >  00!
> > > > >  000+100000,1:0x75300000+100000,1:0x75600000+100000,1:0x75800000+100000,1:0x75a00000+200000,1:0x75d00000+100000,1:0x75f00000+100000,1:0x76100000+100000,1:0x76400000+200000,1:0x76700000+200000,1:0x76a00000+300000,1:0x76e00000+200000,1:0x77100000+100000,1:0x77300000+100000,1:0x77500000+100000,1:0x77700000+300000,1:0x77b00000+200000,1:0x77e00000+200000,1:0x78100000+100000,1:0x78300000+200000,1:0x78600000+200000,1:0x78900000+100000,1:0x78c00000+100000,1:0x78e00000+100000,1:0x79000000+100000,1:0x79200000+100000,1:0x79400000+200000,1:0x79700000+100000,1:0x79900000+200000,1:0x79c00000+200000,1:0x79f00000+100000,1:0x7a100000+200000,1:0x7a400000+200000,1:0x7a700000+100000,1:0x7a900000+200000,1:0x7ac00000+100000,1:0x7ae00000+100000,1:0x7b000000+100000,1:0x7b200000+200000,1:0x7b500000+100000,1:0x7b800000+200000,1:0x7bb00000+100000,1:0x7bd00000+200000,1:0x7c000000+100000,1:0x7c200000+200000,1:0x7c500000+100000,1:0x7c700000+100000,1:0x7c900000+200000,1:0x7cc00000+100000,1:0x7ce!
 00!
>  00!
> >  0+!
> > >  10!
> > > >  00!
> > > > >  
> > > > > 00,1:0x7d000000+100000,1:0x7d200000+100000,1:0x7d400000+100000,1:0
> > > > > x7
> > > > > d6
> > > > > 00000+100000,1:0x7d800000+10e00000]), flushing log
> > > > > > 
> > > > > > The above looks good, it is about to call _flush_and_sync_log() after this.
> > > > > 
> > > > > Yes, although this file is huge (0x19c2c265 = 432194149 ~ 400MB)... What rocksdb options did you pass in?  I'm guessing this is a log file, but we generally want those smallish (maybe 16MB - 64MB, so that L0 SST generation isn't too slow).
> > > > > 
> > > > > >     -5> 2016-08-31 17:55:56.931588 7faf14fff700 10 bluefs _flush_and_sync_log txn(seq 31278 len 0x50e146 crc 0x58a6b9ab)
> > > > > >     -4> 2016-08-31 17:55:56.933951 7faf14fff700 10 bluefs _pad_bl padding with 0xe94 zeros
> > > > > >     -3> 2016-08-31 17:55:56.934079 7faf14fff700 20 bluefs flush_bdev
> > > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > >     -1> 2016-08-31 17:55:56.934274 7faf14fff700 10 bluefs _flush_range 0x7faf2e824140 pos 0xce9000 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > >      0> 2016-08-31 17:55:56.939745 7faf14fff700 -1
> > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
> > > > > > thread
> > > > > > 7faf14fff700 time 2016-08-31 17:55:56.934282
> > > > > > os/bluestore/BlueFS.cc: 1390: FAILED assert(h->file->fnode.ino 
> > > > > > !=
> > > > > > 1)
> > > > > > 
> > > > > >  ceph version 11.0.0-1946-g9a5cfe2
> > > > > > (9a5cfe2e8e8c79b976c34e593993d74b58fce885)
> > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, 
> > > > > > char
> > > > > > const*)+0x80) [0x56073c27c7d0]
> > > > > >  2: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
> > > > > > unsigned
> > > > > > long)+0x1d69) [0x56073bf4e109]
> > > > > >  3: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0xa7) 
> > > > > > [0x56073bf4e2d7]
> > > > > >  4: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&,
> > > > > > unsigned long, unsigned long)+0x443) [0x56073bf4fe13]
> > > > > >  5: (BlueFS::_fsync(BlueFS::FileWriter*,
> > > > > > std::unique_lock<std::mutex>&)+0x35b) [0x56073bf5140b]
> > > > > >  6: (BlueRocksWritableFile::Sync()+0x62) [0x56073bf68be2]
> > > > > >  7: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x2d1)
> > > > > > [0x56073c0f24b1]
> > > > > >  8: (rocksdb::WritableFileWriter::Sync(bool)+0xf0)
> > > > > > [0x56073c0f3960]
> > > > > >  9: 
> > > > > > (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Sta
> > > > > > tu s const&, rocksdb::CompactionJob::SubcompactionState*)+0x4e6)
> > > > > > [0x56073c1354c6]
> > > > > >  10: 
> > > > > > (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::Comp
> > > > > > ac
> > > > > > ti
> > > > > > on
> > > > > > Job::SubcompactionState*)+0x14ea) [0x56073c137c8a]
> > > > > >  11: (rocksdb::CompactionJob::Run()+0x479) [0x56073c138c09]
> > > > > >  12: (rocksdb::DBImpl::BackgroundCompaction(bool*,
> > > > > > rocksdb::JobContext*, rocksdb::LogBuffer*, void*)+0x9c0) 
> > > > > > [0x56073c0275d0]
> > > > > >  13: (rocksdb::DBImpl::BackgroundCallCompaction(void*)+0xbf)
> > > > > > [0x56073c03443f]
> > > > > >  14: (rocksdb::ThreadPool::BGThread(unsigned long)+0x1d9) 
> > > > > > [0x56073c0eb039]
> > > > > >  15: (()+0x9900d3) [0x56073c0eb0d3]
> > > > > >  16: (()+0x76fa) [0x7faf3d1106fa]
> > > > > >  17: (clone()+0x6d) [0x7faf3af70b5d]
> > > > > > 
> > > > > > 
> > > > > > Now, as you can see it is calling _flush() with inode 1 , why ? is this expected ?
> > > > > 
> > > > > Yes.  The metadata for the log file is dirty (the file size changed), so bluefs is flushing it's journal (ino 1) to update the fnode.
> > > > > 
> > > > > But this is very concerning:
> > > > > 
> > > > > >     -2> 2016-08-31 17:55:56.934257 7faf14fff700 10 bluefs _flush
> > > > > > 0x7faf2e824140 0xce9000~50f000 to file(ino 1 size 0xce9000 mtime
> > > > > > 0.000000 bdev 0 extents [0:0xba00000+500000,0:0xfd00000+c00000])
> > > > > 
> > > > > 0xce9000~50f000 is a ~5 MB write.  Why would we ever write that much to the bluefs metadata journal at once?  That's why it's blowign the runway.  
> > > > > We have ~17MB allocated (0:0xba00000+500000,0:0xfd00000+c00000), we're at offset ~13MB, and we're writing ~5MB.
> > > > > 
> > > > > > Question :
> > > > > > ------------
> > > > > > 
> > > > > > 1. Why we are using the existing log_write to do a runway check ?
> > > > > > 
> > > > > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS
> > > > > > .c
> > > > > > c#
> > > > > > L1
> > > > > > 280
> > > > > > 
> > > > > > Shouldn't the log_writer needs to be reinitialed with the FileWriter rocksdb sent with sync call ?
> > > > > 
> > > > > It's the bluefs journal writer.. that's the runway we're worried about.
> > > > > 
> > > > > > 2. The runway check is not considering the request length , so, 
> > > > > > why it is not expecting to allocate here 
> > > > > > (https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueFS.
> > > > > > cc
> > > > > > #L
> > > > > > 1388)
> > > > > > 
> > > > > > If the snippet is not sufficient, let me know if you want me to upload the level 10 log or need 20/20 log to proceed further.
> > > > > 
> > > > > The level 10 log is probably enough...
> > > > > 
> > > > > Thanks!
> > > > > sage
> > > > > 
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Somnath Roy
> > > > > > Sent: Thursday, September 01, 2016 3:59 PM
> > > > > > To: 'Sage Weil'
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > Sage,
> > > > > > Created the following pull request on rocksdb repo, please take a look.
> > > > > > 
> > > > > > https://github.com/facebook/rocksdb/pull/1313
> > > > > > 
> > > > > > The fix is working fine for me.
> > > > > > 
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Wednesday, August 31, 2016 6:20 AM
> > > > > > To: Somnath Roy
> > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > Subject: RE: Bluestore assert
> > > > > > 
> > > > > > On Tue, 30 Aug 2016, Somnath Roy wrote:
> > > > > > > Sage,
> > > > > > > I did some debugging on the rocksdb bug., here is my findings.
> > > > > > > 
> > > > > > > 1. The log file number is added to log_recycle_files and *not* in log_delete_files from the following if loop, which is expected.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L854
> > > > > > > 
> > > > > > > 2. But, it is there in the candidate list in the following loop.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L1000
> > > > > > > 
> > > > > > > 
> > > > > > > 3. This means it is added in full_scan_candidate_files from 
> > > > > > > the following  from a full scan (?)
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L834
> > > > > > > 
> > > > > > > Added some log entries to verify , need to wait 6 hours :-(
> > > > > > > 
> > > > > > > 4. Probably, #3 is not unusual , but the check in the following seems not sufficient to keep the file.
> > > > > > > 
> > > > > > > https://github.com/facebook/rocksdb/blob/56dd03411534d0957a6f6
> > > > > > > 98
> > > > > > > f4
> > > > > > > 60
> > > > > > > 60
> > > > > > > 9bf7cea4b63/db/db_impl.cc#L1013
> > > > > > > 
> > > > > > > Again, added some log to see the state.log_number during that time. BTW, I have seen the following check probably noop as state.prev_log_number is always coming 0.
> > > > > > > 
> > > > > > > (number == state.prev_log_number)
> > > > > > > 
> > > > > > > 5. So, the quick solution I am thinking is to put a check to see if the log is in recycle list and avoid deleting from the above part (?).
> > > > > > 
> > > > > > That seems reasonable.  I suggest coding this up and submitting a PR to github.com/facebook/rocksdb, and ask in the comment if there is a better solution.
> > > > > > 
> > > > > > Probably the recycle list should be turned into a set so that the check is O(log n)...
> > > > > > 
> > > > > > Thanks!
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Let me know what you think.
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Sunday, August 28, 2016 7:37 AM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Some updates on this.
> > > > > > > 
> > > > > > > 1. The issue is reproduced with the latest rocksdb master as well.
> > > > > > > 
> > > > > > > 2. The issue is *not happening* if I run disabling rocksdb log recycling. This proves our root cause is right.
> > > > > > > 
> > > > > > > 3. Running some more performance tests by disabling log recycling , but, initial impression is, it is introducing spikes and output is not as stable as enabling log recycling.
> > > > > > > 
> > > > > > > 4. Created a rocksdb issue for this
> > > > > > > (https://github.com/facebook/rocksdb/issues/1303)
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Thursday, August 25, 2016 2:35 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: 'Mark Nelson'; 'ceph-devel'
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Hope you are able to download the log I shared via google doc.
> > > > > > > It seems the bug is around this portion.
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348710 7f7c117ff700  4 rocksdb: adding log
> > > > > > > 254 to recycle list
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348722 7f7c117ff700  4 rocksdb: adding log
> > > > > > > 256 to recycle list
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348725 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.347467) [default] Level-0 commit table
> > > > > > > #258 started
> > > > > > > 2016-08-25 00:44:03.348727 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348225) [default] Level-0 commit table #258: 
> > > > > > > memtable #1 done
> > > > > > > 2016-08-25 00:44:03.348729 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348227) [default] Level-0 commit table #258: 
> > > > > > > memtable #2 done
> > > > > > > 2016-08-25 00:44:03.348730 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348239) EVENT_LOG_v1 {"time_micros": 
> > > > > > > 1472111043348233, "job": 88, "event": "flush_finished", "lsm_state": 
> > > > > > > [3, 4, 0, 0, 0, 0, 0], "immutable_memtables": 0}
> > > > > > > 2016-08-25 00:44:03.348735 7f7c117ff700  4 rocksdb: (Original 
> > > > > > > Log Time
> > > > > > > 2016/08/25-00:44:03.348297) [default] Level summary: base 
> > > > > > > level
> > > > > > > 1 max bytes base 5368709120 files[3 4 0 0 0 0 0] max score
> > > > > > > 0.75
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348751 7f7c117ff700  4 rocksdb: [JOB 88] Try to delete WAL files size 131512372, prev total WAL file size 131834601, number of live WAL files 3.
> > > > > > > 
> > > > > > > 2016-08-25 00:44:03.348761 7f7c117ff700 10 bluefs unlink 
> > > > > > > db.wal/000256.log
> > > > > > > 2016-08-25 00:44:03.348766 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > had refs
> > > > > > > 1 on file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > > 00:41:26.298423 bdev
> > > > > > > 0 extents
> > > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > > e2
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > > 00
> > > > > > > ,0
> > > > > > > :0
> > > > > > > x1
> > > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > > 2016-08-25 00:44:03.348775 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > destroying file(ino 19 size 0x3f3ddd9 mtime 2016-08-25
> > > > > > > 00:41:26.298423 bdev 0 extents 
> > > > > > > [0:0xc500000+200000,0:0xcb00000+800000,0:0xd700000+700000,0:0x
> > > > > > > e2
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0xee00000+800000,0:0xfa00000+800000,0:0x10600000+7000
> > > > > > > 00
> > > > > > > ,0
> > > > > > > :0
> > > > > > > x1
> > > > > > > 1100000+700000,0:0x11c00000+800000,0:0x12800000+100000])
> > > > > > > 2016-08-25 00:44:03.348794 7f7c117ff700 10 bluefs unlink 
> > > > > > > db.wal/000254.log
> > > > > > > 2016-08-25 00:44:03.348796 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > had refs
> > > > > > > 1 on file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > > 00:41:26.299110 bdev
> > > > > > > 0 extents
> > > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > > 84
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > > 0x
> > > > > > > b4
> > > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > > 2016-08-25 00:44:03.348803 7f7c117ff700 20 bluefs _drop_link 
> > > > > > > destroying file(ino 18 size 0x3f3d402 mtime 2016-08-25
> > > > > > > 00:41:26.299110 bdev 0 extents 
> > > > > > > [0:0x6500000+400000,0:0x6d00000+700000,0:0x7800000+800000,0:0x
> > > > > > > 84
> > > > > > > 00
> > > > > > > 00
> > > > > > > 0+
> > > > > > > 800000,0:0x9000000+800000,0:0x9c00000+700000,0:0xa700000+900000,0:
> > > > > > > 0x
> > > > > > > b4
> > > > > > > 00000+800000,0:0xc000000+500000])
> > > > > > > 
> > > > > > > So, log 254 is added to the recycle list and at the same time it is added for deletion. It seems there is a race condition in this portion (?).
> > > > > > > 
> > > > > > > I was going through the rocksdb code and I found the following.
> > > > > > > 
> > > > > > > 1. DBImpl::FindObsoleteFiles is the one that is responsible for populating log_recycle_files and log_delete_files. It is also deleting entries from alive_log_files_. But, this is always under mutex_ lock.
> > > > > > > 
> > > > > > > 2. Log is deleted from DBImpl::DeleteObsoleteFileImpl which is *not* under lock , but iterating over log_delete_files. This is fishy but it shouldn't be the reason for same log number end up in two list.
> > > > > > > 
> > > > > > > 3. I saw all the places but the following  place alive_log_files_ (within DBImpl::WriteImpl)  is accessed without lock.
> > > > > > > 
> > > > > > > 4625       alive_log_files_.back().AddSize(log_entry.size());   
> > > > > > > 
> > > > > > > Can it be reintroducing the same log number (254) , I am not sure.
> > > > > > > 
> > > > > > > Summary, it seems a rocksb bug and making recycle_log_file_num = 0 should *bypass* that. I need to check the performance impact for this though.
> > > > > > > 
> > > > > > > Should I post this to rocksdb community or any other place from where I can get response from rocksdb folks ?
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Somnath Roy
> > > > > > > Sent: Wednesday, August 24, 2016 2:52 PM
> > > > > > > To: 'Sage Weil'
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > Sage,
> > > > > > > Thanks for looking , glad that we figured out something :-)..
> > > > > > > So, you want me to reproduce this with only debug_bluefs = 20/20 ? Don't need bluestore log ?
> > > > > > > Hope my root partition doesn't get full , this crash happened 
> > > > > > > after
> > > > > > > 6 hours :-) ,
> > > > > > > 
> > > > > > > Thanks & Regards
> > > > > > > Somnath
> > > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Wednesday, August 24, 2016 2:34 PM
> > > > > > > To: Somnath Roy
> > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > Subject: RE: Bluestore assert
> > > > > > > 
> > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > Sage, It is there in the following github link I posted 
> > > > > > > > earlier..You can see 3 logs there..
> > > > > > > > 
> > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf017
> > > > > > > > b3
> > > > > > > > 9c
> > > > > > > > 68
> > > > > > > > 7d
> > > > > > > > 88
> > > > > > > > a1b28fcc39
> > > > > > > 
> > > > > > > Ah sorry, got it.
> > > > > > > 
> > > > > > > And looking at the crash and code the weird error you're getting makes perfect sense: it's coming from the ReuseWritableFile() function (which gets and error on rename and returns that).  It shouldn't ever fail, so there is either a bug in the bluefs code or the rocksdb recycling code.
> > > > > > > 
> > > > > > > I think we need a full bluefs log leading up to the crash so we can find out what happened to the file that is missing...
> > > > > > > 
> > > > > > > sage
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >  >
> > > > > > > > Thanks & Regards
> > > > > > > > Somnath
> > > > > > > > 
> > > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Wednesday, August 24, 2016 1:43 PM
> > > > > > > > To: Somnath Roy
> > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > 
> > > > > > > > On Wed, 24 Aug 2016, Somnath Roy wrote:
> > > > > > > > > Sage,
> > > > > > > > > I got the db assert log from submit_transaction in the following location.
> > > > > > > > > 
> > > > > > > > > https://github.com/somnathr/ceph/commit/b69811eb2b87f25cf0
> > > > > > > > > 17
> > > > > > > > > b3
> > > > > > > > > 9c
> > > > > > > > > 68
> > > > > > > > > 7d
> > > > > > > > > 88
> > > > > > > > > a1b28fcc39
> > > > > > > > > 
> > > > > > > > > This is the log with level 1/20 and with my hook of printing rocksdb::writebatch transaction. I have uploaded 3 osd logs and a common pattern before crash is the following.
> > > > > > > > > 
> > > > > > > > >    -34> 2016-08-24 02:37:22.074321 7ff151fff700  4 rocksdb: 
> > > > > > > > > reusing log 266 from recycle list
> > > > > > > > > 
> > > > > > > > >    -33> 2016-08-24 02:37:22.074332 7ff151fff700 10 bluefs rename db.wal/000266.log -> db.wal/000271.log
> > > > > > > > >    -32> 2016-08-24 02:37:22.074338 7ff151fff700 20 bluefs rename dir db.wal (0x7ff18bdfdec0) file 000266.log not found
> > > > > > > > >    -31> 2016-08-24 02:37:22.074341 7ff151fff700  4 rocksdb: [default] New memtable created with log file: #271. Immutable memtables: 0.
> > > > > > > > > 
> > > > > > > > > It is trying to rename the wal file it seems and old file is not found. You can see the transaction printed in the log along with error code like this.
> > > > > > > > 
> > > > > > > > How much of the log do you have? Can you post what you have somewhere?
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > sage
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > >    -30> 2016-08-24 02:37:22.074486 7ff151fff700 -1 rocksdb: 
> > > > > > > > > submit_transaction error: NotFound:  code = 1 
> > > > > > > > > rocksdb::WriteBatch = Put( Prefix = M key =
> > > > > > > > > 0x0000000000001483'.0000000087.00000000000000035303')
> > > > > > > > > Put( Prefix = M key = 0x0000000000001483'._info') Put( 
> > > > > > > > > Prefix = O key =
> > > > > > > > > '--'0x800000000000000137863de5'.!=rbd_data.105b6b8b4567.00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 00
> > > > > > > > > 89
> > > > > > > > > ac
> > > > > > > > > e7!'0xfffffffffffffffeffffffffffffffff)
> > > > > > > > > Delete( Prefix = B key = 0x000004e73ae72000) Put( Prefix = 
> > > > > > > > > B key =
> > > > > > > > > 0x000004e73af72000) Merge( Prefix = T key =
> > > > > > > > > 'bluestore_statfs')
> > > > > > > > > 
> > > > > > > > > Hope my decoding of the key is proper, I have reused pretty_binary_string() of Bluestore after removing 1st 2 bytes of the key which is prefix and a '0'.
> > > > > > > > > 
> > > > > > > > > Any suggestion on the next step of root causing this db assert if log rename is not enough hint ?
> > > > > > > > > 
> > > > > > > > > BTW, I ran this time with bluefs_compact_log_sync = true and without commenting the inode number asserts. I didn't hit those asserts as well as no corruption  so far. Seems like bug of async compaction. Will try to reproduce with verbose log that one later.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Somnath Roy
> > > > > > > > > Sent: Tuesday, August 23, 2016 7:46 AM
> > > > > > > > > To: 'Sage Weil'
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > I was running my tests for 2 hours and it happened within that time.
> > > > > > > > > I will try to reproduce with 1/20.
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Tuesday, August 23, 2016 6:46 AM
> > > > > > > > > To: Somnath Roy
> > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > 
> > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > Sage,
> > > > > > > > > > I think there are some bug introduced recently in the 
> > > > > > > > > > BlueFS and I am getting the corruption like this which I was not facing earlier.
> > > > > > > > > 
> > > > > > > > > My guess is the async bluefs compaction.  You can set 'bluefs compact log sync = true' to disable it.
> > > > > > > > > 
> > > > > > > > > Any idea how long do you have to run to reproduce?  I'd love to see a bluefs log leading up to it.  If it eats too much disk space, you could do debug bluefs = 1/20 so that it only dumps recent history on crash.
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > sage
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >    -5> 2016-08-22 15:55:21.248558 7fb68f7ff700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1471906521248538, "job": 3, "event": "compaction_started", "files_L0": [2115, 2102, 2087, 2069, 2046], "files_L1": [], "files_L2": [], "files_L3": [], "files_L4": [], "files_L5": [], "files_L6": [1998, 2007, 2013, 2019, 2026, 2032, 2039, 2043, 2052, 2060], "score": 1.5, "input_data_size": 1648188401}
> > > > > > > > > >     -4> 2016-08-22 15:55:27.209944 7fb6ba94d8c0  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
> > > > > > > > > >     -3> 2016-08-22 15:55:27.213612 7fb6ba94d8c0  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
> > > > > > > > > >     -2> 2016-08-22 15:55:27.213627 7fb6ba94d8c0  0 _get_class not permitted to load kvs
> > > > > > > > > >     -1> 2016-08-22 15:55:27.214620 7fb6ba94d8c0 -1 osd.0 0 failed to load OSD map for epoch 321, got 0 bytes
> > > > > > > > > >      0> 2016-08-22 15:55:27.216111 7fb6ba94d8c0 -1 osd/OSD.h: 
> > > > > > > > > > In function 'OSDMapRef OSDService::get_map(epoch_t)' 
> > > > > > > > > > thread
> > > > > > > > > > 7fb6ba94d8c0 time 2016-08-22 15:55:27.214638
> > > > > > > > > > osd/OSD.h: 999: FAILED assert(ret)
> > > > > > > > > > 
> > > > > > > > > >  ceph version 11.0.0-1688-g6f48ee6
> > > > > > > > > > (6f48ee6bc5c85f44d7ca4c984f9bef1339c2bea4)
> > > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > > int, char
> > > > > > > > > > const*)+0x80) [0x5617f2a99e80]
> > > > > > > > > >  2: (OSDService::get_map(unsigned int)+0x5d) 
> > > > > > > > > > [0x5617f2395fdd]
> > > > > > > > > >  3: (OSD::init()+0x1f7e) [0x5617f233d3ce]
> > > > > > > > > >  4: (main()+0x2fe0) [0x5617f229d1f0]
> > > > > > > > > >  5: (__libc_start_main()+0xf0) [0x7fb6b7196830]
> > > > > > > > > >  6: (_start()+0x29) [0x5617f22eb909]
> > > > > > > > > > 
> > > > > > > > > > OSDs are not coming up (after a restart) and eventually I had to recreate the cluster.
> > > > > > > > > > 
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Somnath Roy
> > > > > > > > > > Sent: Monday, August 22, 2016 3:01 PM
> > > > > > > > > > To: 'Sage Weil'
> > > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > > 
> > > > > > > > > > "compaction_style=kCompactionStyleUniversal"  in the bluestore_rocksdb_options .
> > > > > > > > > > Here is the option I am using..
> > > > > > > > > > 
> > > > > > > > > >         bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10,compaction_style=kCompactionStyleUniversal"
> > > > > > > > > > 
> > > > > > > > > > Here is another one after 4 hours of 4K RW :-)...Sorry to bombard you with all these , I am adding more log for the next time if I hit it..
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > >      0> 2016-08-22 17:37:24.730817 7f8e7afff700 -1
> > > > > > > > > > os/bluestore/BlueFS.cc: In function 'int 
> > > > > > > > > > BlueFS::_read_random(BlueFS::FileReader*, uint64_t, size_t, char*)'
> > > > > > > > > > thread 7f8e7afff700 time 2016-08-22 17:37:24.722706
> > > > > > > > > > os/bluestore/BlueFS.cc: 845: FAILED assert(r == 0)
> > > > > > > > > > 
> > > > > > > > > >  ceph version 11.0.0-1688-g3fcc89c
> > > > > > > > > > (3fcc89c7ab4c92e6c4564e29f4e1a663db36acc0)
> > > > > > > > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, 
> > > > > > > > > > int, char
> > > > > > > > > > const*)+0x80) [0x5581ed453cb0]
> > > > > > > > > >  2: (BlueFS::_read_random(BlueFS::FileReader*, unsigned 
> > > > > > > > > > long, unsigned long, char*)+0x836) [0x5581ed11c1b6]
> > > > > > > > > >  3: (BlueRocksRandomAccessFile::Read(unsigned long, 
> > > > > > > > > > unsigned long, rocksdb::Slice*, char*) const+0x20) 
> > > > > > > > > > [0x5581ed13f840]
> > > > > > > > > >  4: (rocksdb::RandomAccessFileReader::Read(unsigned
> > > > > > > > > > long, unsigned long, rocksdb::Slice*, char*)
> > > > > > > > > > const+0x83f) [0x5581ed2c6f4f]
> > > > > > > > > >  5: 
> > > > > > > > > > (rocksdb::ReadBlockContents(rocksdb::RandomAccessFileRea
> > > > > > > > > > de
> > > > > > > > > > r* , rocksdb::Footer const&, rocksdb::ReadOptions 
> > > > > > > > > > const&, rocksdb::BlockHandle const&, 
> > > > > > > > > > rocksdb::BlockContents*, rocksdb::Env*, bool, 
> > > > > > > > > > rocksdb::Slice const&, rocksdb::PersistentCacheOptions 
> > > > > > > > > > const&,
> > > > > > > > > > rocksdb::Logger*)+0x358) [0x5581ed291c18]
> > > > > > > > > >  6: (()+0x94fd54) [0x5581ed282d54]
> > > > > > > > > >  7: 
> > > > > > > > > > (rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb:
> > > > > > > > > > :B lo ck Ba se dT ab le::Rep*, rocksdb::ReadOptions 
> > > > > > > > > > const&, rocksdb::Slice const&,
> > > > > > > > > > rocksdb::BlockIter*)+0x60c) [0x5581ed284b3c]
> > > > > > > > > >  8: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::Slice const&, rocksdb::GetContext*,
> > > > > > > > > > bool)+0x508) [0x5581ed28ba68]
> > > > > > > > > >  9: (rocksdb::TableCache::Get(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::InternalKeyComparator const&, 
> > > > > > > > > > rocksdb::FileDescriptor const&, rocksdb::Slice const&, 
> > > > > > > > > > rocksdb::GetContext*, rocksdb::HistogramImpl*, bool,
> > > > > > > > > > int)+0x158) [0x5581ed252118]
> > > > > > > > > >  10: (rocksdb::Version::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > > rocksdb::LookupKey const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> >*, rocksdb::Status*, 
> > > > > > > > > > rocksdb::MergeContext*, bool*, bool*, unsigned
> > > > > > > > > > long*)+0x4f8) [0x5581ed25c458]
> > > > > > > > > >  11: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions
> > > > > > > > > > const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice 
> > > > > > > > > > const&, std::__cxx11::basic_string<char, 
> > > > > > > > > > std::char_traits<char>, std::allocator<char> >*,
> > > > > > > > > > bool*)+0x5fa) [0x5581ed1f3f7a]
> > > > > > > > > >  12: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
> > > > > > > > > > rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> >*)+0x22) [0x5581ed1f4182]
> > > > > > > > > >  13: (RocksDBStore::get(std::__cxx11::basic_string<char,
> > > > > > > > > > std::char_traits<char>, std::allocator<char> > const&, 
> > > > > > > > > > std::__cxx11::basic_string<char, std::char_traits<char>, 
> > > > > > > > > > std::allocator<char> > const&,
> > > > > > > > > > ceph::buffer::list*)+0x157) [0x5581ed1d21d7]
> > > > > > > > > >  14: (BlueStore::Collection::get_onode(ghobject_t
> > > > > > > > > > const&,
> > > > > > > > > > bool)+0x55b) [0x5581ed02802b]
> > > > > > > > > >  15: 
> > > > > > > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext
> > > > > > > > > > *,
> > > > > > > > > > ObjectStore::Transaction*)+0x1e49) [0x5581ed0318a9]
> > > > > > > > > >  16: 
> > > > > > > > > > (BlueStore::queue_transactions(ObjectStore::Sequencer*,
> > > > > > > > > > std::vector<ObjectStore::Transaction,
> > > > > > > > > > std::allocator<ObjectStore::Transaction> >&, 
> > > > > > > > > > std::shared_ptr<TrackedOp>,
> > > > > > > > > > ThreadPool::TPHandle*)+0x362) [0x5581ed032bc2]
> > > > > > > > > >  17: 
> > > > > > > > > > (ReplicatedPG::queue_transactions(std::vector<ObjectStore:
> > > > > > > > > > :T ra ns ac ti on ,
> > > > > > > > > > std::allocator<ObjectStore::Transaction>
> > > > > > > > > > >&,
> > > > > > > > > > std::shared_ptr<OpRequest>)+0x81) [0x5581ecea5e51]
> > > > > > > > > >  18: 
> > > > > > > > > > (ReplicatedBackend::sub_op_modify(std::shared_ptr<OpRequ
> > > > > > > > > > es
> > > > > > > > > > t>
> > > > > > > > > > )+
> > > > > > > > > > 0x
> > > > > > > > > > d3
> > > > > > > > > > 9)
> > > > > > > > > > [0x5581ecef89e9]
> > > > > > > > > >  19: 
> > > > > > > > > > (ReplicatedBackend::handle_message(std::shared_ptr<OpReq
> > > > > > > > > > ue
> > > > > > > > > > st
> > > > > > > > > > >)
> > > > > > > > > > +0
> > > > > > > > > > x2
> > > > > > > > > > fb
> > > > > > > > > > )
> > > > > > > > > > [0x5581ecefeb4b]
> > > > > > > > > >  20: 
> > > > > > > > > > (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&,
> > > > > > > > > > ThreadPool::TPHandle&)+0xbd) [0x5581ece4c63d]
> > > > > > > > > >  21: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > > > > > > > > > std::shared_ptr<OpRequest>,
> > > > > > > > > > ThreadPool::TPHandle&)+0x409) [0x5581eccdd2e9]
> > > > > > > > > >  22: 
> > > > > > > > > > (PGQueueable::RunVis::operator()(std::shared_ptr<OpReque
> > > > > > > > > > st
> > > > > > > > > > >
> > > > > > > > > > const&)+0x52) [0x5581eccdd542]
> > > > > > > > > >  23: (OSD::ShardedOpWQ::_process(unsigned int,
> > > > > > > > > > ceph::heartbeat_handle_d*)+0x73f) [0x5581eccfd30f]
> > > > > > > > > >  24: 
> > > > > > > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned
> > > > > > > > > > int)+0x89f) [0x5581ed440b2f]
> > > > > > > > > >  25: 
> > > > > > > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
> > > > > > > > > > [0x5581ed4441f0]
> > > > > > > > > >  26: (Thread::entry_wrapper()+0x75) [0x5581ed4335c5]
> > > > > > > > > >  27: (()+0x76fa) [0x7f8ed9e4e6fa]
> > > > > > > > > >  28: (clone()+0x6d) [0x7f8ed7caeb5d]
> > > > > > > > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > > Sent: Monday, August 22, 2016 2:57 PM
> > > > > > > > > > To: Somnath Roy
> > > > > > > > > > Cc: Mark Nelson; ceph-devel
> > > > > > > > > > Subject: RE: Bluestore assert
> > > > > > > > > > 
> > > > > > > > > > On Mon, 22 Aug 2016, Somnath Roy wrote:
> > > > > > > > > > > FYI, I was running rocksdb by enabling universal style 
> > > > > > > > > > > compaction during this time.
> > > > > > > > > > 
> > > > > > > > > > How are you selecting universal compaction?
> > > > > > > > > > 
> > > > > > > > > > sage
> > > > > > > > > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org 
> > > > > > > > > > More majordomo info at 
> > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > > majordomo info at 
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > > in the body of a message to majordomo@vger.kernel.org More 
> > > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc
index 638d231..bcf0935 100644
--- a/src/kv/RocksDBStore.cc
+++ b/src/kv/RocksDBStore.cc
@@ -370,6 +370,10 @@  int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
   utime_t lat = ceph_clock_now(g_ceph_context) - start;
   logger->inc(l_rocksdb_txns);
   logger->tinc(l_rocksdb_submit_latency, lat);
+  if (!s.ok()) {
+    derr << __func__ << " error: " << s.ToString()
+        << "code = " << s.code() << dendl;
+  }
   return s.ok() ? 0 : -1;
 }

@@ -385,6 +389,11 @@  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
   utime_t lat = ceph_clock_now(g_ceph_context) - start;
   logger->inc(l_rocksdb_txns_sync);
   logger->tinc(l_rocksdb_submit_sync_latency, lat);
+  if (!s.ok()) {
+    derr << __func__ << " error: " << s.ToString()
+        << "code = " << s.code() << dendl;
+  }
+
   return s.ok() ? 0 : -1;
 }
 int RocksDBStore::get_info_log_level(string info_log_level)
@@ -442,7 +451,8 @@  void RocksDBStore::RocksDBTransactionImpl::rmkey(const string &prefix,
 void RocksDBStore::RocksDBTransactionImpl::rm_single_key(const string &prefix,
                                                         const string &k)
 {
-  bat->SingleDelete(combine_strings(prefix, k));
+  //bat->SingleDelete(combine_strings(prefix, k));
+  bat->Delete(combine_strings(prefix, k));
 }

But, the db crash is still happening with the following log message.

rocksdb: submit_transaction_sync error: NotFound: code = 1

It seems it is not related to rm_single_key as I am hitting this from  https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L5108 as well where rm_single_key is not called.
May be I should dump the transaction and see what's in there ?

I am hitting the BlueFS replay bug I mentioned earlier and applied your patch (https://github.com/ceph/ceph/pull/10686) but not helping.
Is it because I needed to run with this patch from the beginning and not just during replay ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Thursday, August 11, 2016 3:32 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: RE: Bluestore assert

On Thu, 11 Aug 2016, Somnath Roy wrote:
> Sage,
> Regarding the db assert , I hit that again on multiple OSDs while I was populating 40TB rbd images (~35TB written before crash).
> I did the following changes in the code..
> 
> @@ -370,7 +370,7 @@ int RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns);
>    logger->tinc(l_rocksdb_submit_latency, lat);
> -  return s.ok() ? 0 : -1;
> +  return s.ok() ? 0 : -s.code();
>  }
> 
>  int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t) 
> @@ -385,7 +385,7 @@ int RocksDBStore::submit_transaction_sync(KeyValueDB::Transaction t)
>    utime_t lat = ceph_clock_now(g_ceph_context) - start;
>    logger->inc(l_rocksdb_txns_sync);
>    logger->tinc(l_rocksdb_submit_sync_latency, lat);
> -  return s.ok() ? 0 : -1;
> +  return s.ok() ? 0 : -s.code();
>  }
>  int RocksDBStore::get_info_log_level(string info_log_level)  { diff 
> --git a/src/os/bluestore/BlueStore.cc b/src/os/bluestore/BlueStore.cc 
> index fe7f743..3f4ecd5 100644
> --- a/src/os/bluestore/BlueStore.cc
> +++ b/src/os/bluestore/BlueStore.cc
> @@ -4989,6 +4989,9 @@ void BlueStore::_kv_sync_thread()
>              ++it) {
>           _txc_finalize_kv((*it), (*it)->t);
>           int r = db->submit_transaction((*it)->t);
> +          if (r < 0 ) {
> +            dout(0) << "submit_transaction returned = " << r << dendl;
> +          }
>           assert(r == 0);
>         }
>        }
> @@ -5026,6 +5029,10 @@ void BlueStore::_kv_sync_thread()
>         t->rm_single_key(PREFIX_WAL, key);
>        }
>        int r = db->submit_transaction_sync(t);
> +      if (r < 0 ) {
> +        dout(0) << "submit_transaction_sync returned = " << r << dendl;
> +      }
> +
>        assert(r == 0);
> 
> 
> This is printing -1 in the log before asset. So, the corresponding code from the rocksdb side is "kNotFound".
> It is not related to space as I hit this same issue irrespective of db partition size is 100G or 300G.
> It seems some kind of corruption within Bluestore ?
> Let me now the next step.

Can you add this too?

diff --git a/src/kv/RocksDBStore.cc b/src/kv/RocksDBStore.cc index 638d231..b5467f7 100644
--- a/src/kv/RocksDBStore.cc
+++ b/src/kv/RocksDBStore.cc
@@ -370,6 +370,9 @@  int
RocksDBStore::submit_transaction(KeyValueDB::Transaction t)
   utime_t lat = ceph_clock_now(g_ceph_context) - start;
   logger->inc(l_rocksdb_txns);
   logger->tinc(l_rocksdb_submit_latency, lat);
+  if (!s.ok()) {
+    derr << __func__ << " error: " << s.ToString() << dendl;  }
   return s.ok() ? 0 : -1;
 }