Request for ceph.conf environment extension

Message ID	20130506091822.GA26022@upset.ux.pdb.fsc.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> DomainKey-Signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Date:From:To:Subject:Message-ID: Mime-Version:Content-Type:Content-Disposition; b=MDjm93Z5FzkBwmKR6kLOzsNmy6WFqq4IRfj8U/6fijTRNMXl8xPXEpV0 647cQn9sLfo53xlnAcjUGViVl4gFNuxAwonkyMi9Jhct1E2Bu9qwg48nP XOLpxIIOFZS2FIuRz1WgE33SdLNdnF9W1mDFeEmUZgVXd8iHVztHxKi1L iqpZQY7QEyGsO+wnMo9ggsbtmUez7Jt1BpnqcZ48qe7r4oDUnzFlVQKZr 2EGDl8GWZ10DkMXH+AL/uGn6sHrjj; Date: Mon, 6 May 2013 11:18:22 +0200 From: Andreas Friedrich <andreas.friedrich@ts.fujitsu.com> To: Ceph Development <ceph-devel@vger.kernel.org> Subject: Request for ceph.conf environment extension Message-ID: <20130506091822.GA26022@upset.ux.pdb.fsc.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="jI8keyz6grp/JLjh" Content-Disposition: inline Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Andreas Friedrich May 6, 2013, 9:18 a.m. UTC

Hello,

we are using Infiniband instead of Ethernet for cluster interconnection.
Instead of IPoIB (IP-over-InfiniBand Protocol) we want to use SDP
(Sockets Direct Protocol) as a mid layer protocol.

To connect the Ceph daemons to SDP without changing the Ceph code, the
LD_PRELOAD mechanism can be used.

To enable the LD_PRELOAD mechanism for the Ceph daemons only, a little 
generic extension in the global section of /etc/ceph/ceph.conf would
be helpful, e.g.:

[global]
	environment = LD_PRELOAD=/usr/lib64/libsdp.so.1

The appending patch adds 5 lines in the Bobtail (0.56.6) init script.
The init script will then read the environment setting and - if present -
call the Ceph daemons with the preceding environment string.

With best regards
Andreas Friedrich
----------------------------------------------------------------------
FUJITSU
Fujitsu Technology Solutions GmbH
Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
Tel: +49 (5251) 525-1512
Fax: +49 (5251) 525-321512
Email: andreas.friedrich@ts.fujitsu.com
Web: ts.fujitsu.com
Company details: de.ts.fujitsu.com/imprint
----------------------------------------------------------------------

Gandalf Corvotempesta May 6, 2013, 12:36 p.m. UTC | #1

2013/5/6 Andreas Friedrich <andreas.friedrich@ts.fujitsu.com>:
> To enable the LD_PRELOAD mechanism for the Ceph daemons only, a little
> generic extension in the global section of /etc/ceph/ceph.conf would
> be helpful, e.g.:
>
> [global]
>         environment = LD_PRELOAD=/usr/lib64/libsdp.so.1
>
> The appending patch adds 5 lines in the Bobtail (0.56.6) init script.
> The init script will then read the environment setting and - if present -
> call the Ceph daemons with the preceding environment string.

Cool! We are planning the same infrastructure with IB as networking.
Could you share more details about this?
 - which performance are you having and on which hardware?
 - any issue with IB and SDP?
 - are you able to use the newer rsockets and not SDP? It also has a
preloader library and is still developed (SDP is deprecated)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wido den Hollander May 6, 2013, 12:46 p.m. UTC | #2

On 05/06/2013 11:18 AM, Andreas Friedrich wrote:
> Hello,
>
> we are using Infiniband instead of Ethernet for cluster interconnection.
> Instead of IPoIB (IP-over-InfiniBand Protocol) we want to use SDP
> (Sockets Direct Protocol) as a mid layer protocol.
>
> To connect the Ceph daemons to SDP without changing the Ceph code, the
> LD_PRELOAD mechanism can be used.
>
> To enable the LD_PRELOAD mechanism for the Ceph daemons only, a little
> generic extension in the global section of /etc/ceph/ceph.conf would
> be helpful, e.g.:
>
> [global]
> 	environment = LD_PRELOAD=/usr/lib64/libsdp.so.1
>
> The appending patch adds 5 lines in the Bobtail (0.56.6) init script.
> The init script will then read the environment setting and - if present -
> call the Ceph daemons with the preceding environment string.
>

You can also submit this via a Pull request on Github. That way the 
author stays preserved and you get all the credits for this patch :)

> With best regards
> Andreas Friedrich
> ----------------------------------------------------------------------
> FUJITSU
> Fujitsu Technology Solutions GmbH
> Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany
> Tel: +49 (5251) 525-1512
> Fax: +49 (5251) 525-321512
> Email: andreas.friedrich@ts.fujitsu.com
> Web: ts.fujitsu.com
> Company details: de.ts.fujitsu.com/imprint
> ----------------------------------------------------------------------
>

Mark Nelson May 6, 2013, 1:23 p.m. UTC | #3

On 05/06/2013 07:36 AM, Gandalf Corvotempesta wrote:
> 2013/5/6 Andreas Friedrich <andreas.friedrich@ts.fujitsu.com>:
>> To enable the LD_PRELOAD mechanism for the Ceph daemons only, a little
>> generic extension in the global section of /etc/ceph/ceph.conf would
>> be helpful, e.g.:
>>
>> [global]
>>          environment = LD_PRELOAD=/usr/lib64/libsdp.so.1
>>
>> The appending patch adds 5 lines in the Bobtail (0.56.6) init script.
>> The init script will then read the environment setting and - if present -
>> call the Ceph daemons with the preceding environment string.
>
> Cool! We are planning the same infrastructure with IB as networking.
> Could you share more details about this?
>   - which performance are you having and on which hardware?
>   - any issue with IB and SDP?
>   - are you able to use the newer rsockets and not SDP? It also has a
> preloader library and is still developed (SDP is deprecated)

Yeah, that's more or less the conclusion I came to as well.  With SDP 
being deprecated, rsockets is looking like an attractive potential 
alternative.  I'll let Sage or someone comment on the patch though.

It would be very interesting to hear how SDP does.  With IPoIB I've 
gotten about 2GB/s on QDR with Ceph, which is roughly also what I can 
get in an ideal round-robin setup with 2 bonded 10GbE links.

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gandalf Corvotempesta May 6, 2013, 1:35 p.m. UTC | #4

2013/5/6 Mark Nelson <mark.nelson@inktank.com>:
> It would be very interesting to hear how SDP does.  With IPoIB I've gotten
> about 2GB/s on QDR with Ceph, which is roughly also what I can get in an
> ideal round-robin setup with 2 bonded 10GbE links.

Yes, but IB costs 1/4 than 10GbE and will be much expandible in future.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mark Nelson May 6, 2013, 1:39 p.m. UTC | #5

On 05/06/2013 08:35 AM, Gandalf Corvotempesta wrote:
> 2013/5/6 Mark Nelson <mark.nelson@inktank.com>:
>> It would be very interesting to hear how SDP does.  With IPoIB I've gotten
>> about 2GB/s on QDR with Ceph, which is roughly also what I can get in an
>> ideal round-robin setup with 2 bonded 10GbE links.
>
> Yes, but IB costs 1/4 than 10GbE and will be much expandible in future.
>

QDR shouldn't be that much cheaper.  Maybe SDR or DDR.  But I agree with 
your general sentiment.  I think rsockets may be a really good 
benefit/cost solution for the short term to get IB support into Ceph. 
It sounds like there is some work planned for a kernel implementation 
which would be fantastic on the file system side as well.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Kristian Søgaard May 6, 2013, 1:48 p.m. UTC | #6

Hi,

> Yes, but IB costs 1/4 than 10GbE and will be much expandible in future.

I'm testing Ceph currently with bonded 1 GbE links and contemplating 
moving to 10 GbE or IB. I have to pay for the costs out of my own 
pocket, so price is a major factor.

I have noted that just recently it has become possible to buy 10 GbE 
switches for a lot less than before. I don't know what IB equipment 
costs, mainly because I don't know much about IB and hence don't know 
which equipment to buy.

My costs for a cobber-based 10 GbE setup per server would be approximate:

   Switch: 115$
   NIC: 345$
   Cable: 2$

   Total: 462$ (per server)

If anyone could comment on how that compares to IB pricing, I would 
appreciate it.

Is it possible to quantify how much Ceph would benefit of the improved 
latency between 1 GbE and 10 GbE? (i.e. assuming that 1 GbE gives me 
enough bandwidth, would I see any gains from lower latency?)

And similarly, would there be a significant gain from lower latency when 
comparing 10 GbE to IB? (I assume that IB has lower latency than 10 GbE)

Gandalf Corvotempesta May 6, 2013, 1:49 p.m. UTC | #7

2013/5/6 Jens Kristian Søgaard <jens@mermaidconsulting.dk>:
> My costs for a cobber-based 10 GbE setup per server would be approximate:
>
>   Switch: 115$
>   NIC: 345$
>   Cable: 2$

115$ for a 10GB switch? Which kind of switch?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Kristian Søgaard May 6, 2013, 1:53 p.m. UTC | #8

Hi again,

> 115$ for a 10GB switch? Which kind of switch?

No no - not 115$ for a switch. The cost was per port!

So an 8 port switch would be approx. 920$ USD.

I'm looking at just a bare-bones switchs that does VLANs, jumbo frames 
and port trunking. The network would be used exclusively for Ceph.

Gandalf Corvotempesta May 6, 2013, 2 p.m. UTC | #9

2013/5/6 Jens Kristian Søgaard <jens@mermaidconsulting.dk>:
> So an 8 port switch would be approx. 920$ USD.
>
> I'm looking at just a bare-bones switchs that does VLANs, jumbo frames and
> port trunking. The network would be used exclusively for Ceph.

You should also consider 10GbE for the public network, and there you
should need more features.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Kristian Søgaard May 6, 2013, 2:03 p.m. UTC | #10

Hi,

>> I'm looking at just a bare-bones switchs that does VLANs, jumbo frames and
>> port trunking. The network would be used exclusively for Ceph.

> You should also consider 10GbE for the public network, and there you
> should need more features.

I was actually going to put both the public and private network on the 
same 10 GbE. Why do you think I need more features?

My public and private networks are completely dedicated to Ceph - so 
nothing else takes place on them.

I have a third network which is just plain 1 GbE which handles all other 
communication between the servers (and the Internet).

Gandalf Corvotempesta May 6, 2013, 2:13 p.m. UTC | #11

2013/5/6 Jens Kristian Søgaard <jens@mermaidconsulting.dk>:
> I was actually going to put both the public and private network on the same
> 10 GbE. Why do you think I need more features?

I'm just supposing, i'm also evaluating the same network topology like you.

There is a lowcost. 12x 10GbaseT switch from Netgear.

I'm also evaluating to create a full IB network dedicated to ceph (IB
on cluster network and IB to public network). Do you think will be
possible to use all ceph services (RBD, RGW, CephFS, QEMU, ...) via
SDP/rsockets ?

On ebay there are many IB switches sold at 1000-1500$ with 24 or 36
DDR ports. If used with SDP/rsockets you should archieve more or less
18Gbps
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Kristian Søgaard May 6, 2013, 2:18 p.m. UTC | #12

Hi,

> I'm just supposing, i'm also evaluating the same network topology
> like you.

I have tested my current setup for a while with just triple-bonded GbE,
and haven't found the need for more features. I might be wrong ofcourse :-)

> There is a lowcost. 12x 10GbaseT switch from Netgear.

I was looking at the Netgear XS708E which is very low cost compared to
what the prices were 6 months ago when I looked at it last.

What I don't know is how those prices compare to IB pricing.

> on cluster network and IB to public network). Do you think will be
> possible to use all ceph services (RBD, RGW, CephFS, QEMU, ...) via
> SDP/rsockets ?

I haven't tried SDP/rsocket, so I don't know. From what I have
experienced in the past, these things doesn't seem to be completely
"drop in and forget about it" - there's always something that is not
completely compatible with ordinary TCP/IP sockets.

> On ebay there are many IB switches sold at 1000-1500$ with 24 or 36
> DDR ports. If used with SDP/rsockets you should archieve more or less
> 18Gbps

That sounds cheap yes, but I assume that is used equipment?
I was looking at costs for new equipment.

Mark Nelson May 6, 2013, 2:59 p.m. UTC | #13

On 05/06/2013 09:18 AM, Jens Kristian Søgaard wrote:
> Hi,
>
>> I'm just supposing, i'm also evaluating the same network topology
>> like you.
>
> I have tested my current setup for a while with just triple-bonded GbE,
> and haven't found the need for more features. I might be wrong ofcourse :-)
>
>> There is a lowcost. 12x 10GbaseT switch from Netgear.
>
> I was looking at the Netgear XS708E which is very low cost compared to
> what the prices were 6 months ago when I looked at it last.

I'm really not up to speed on the capabilities of cheap 10GbE switches. 
I'm sure other people will have comments about what features are 
worthwhile, etc.  Just from a raw performance perspective, you might 
want to make sure that the switch can handle lots of randomly 
distributed traffic between all of the ports well.  I'd expect that it 
shouldn't be too terrible, but who knows.  I expect most IB switches you 
run into should deal with this kind of pattern competently enough for 
Ceph workloads to do OK.  On the front end portion of the network you'll 
always have client<->server communication, so the pattern there will be 
less all-to-all than the backend traffic (more 1st half <-> 2nd half).

On really big deployments, static routing becomes a pretty big issue. 
Dynamic routing, or at least well optimized routes can make a huge 
difference.  ORNL has done some work in this area for their Lustre IB 
networks and I think LLNL is investigating it as well.

Mark

>
> What I don't know is how those prices compare to IB pricing.
>
>> on cluster network and IB to public network). Do you think will be
>> possible to use all ceph services (RBD, RGW, CephFS, QEMU, ...) via
>> SDP/rsockets ?
>
> I haven't tried SDP/rsocket, so I don't know. From what I have
> experienced in the past, these things doesn't seem to be completely
> "drop in and forget about it" - there's always something that is not
> completely compatible with ordinary TCP/IP sockets.
>
>> On ebay there are many IB switches sold at 1000-1500$ with 24 or 36
>> DDR ports. If used with SDP/rsockets you should archieve more or less
>> 18Gbps
>
> That sounds cheap yes, but I assume that is used equipment?
> I was looking at costs for new equipment.
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Kristian Søgaard May 6, 2013, 3:16 p.m. UTC | #14

Hi,

> Just from a raw performance perspective, you might
> want to make sure that the switch can handle lots of randomly
> distributed traffic between all of the ports well.  I'd expect that it
> shouldn't be too terrible, but who knows.

It has been many years since I've seen a switch have problems with that. 
This switch has a 160 Gb/s backplane, so I don't foresee any problems.

Especially considering that I'm mixing the public and private networks 
on the same switch.

> On really big deployments, static routing becomes a pretty big issue.
> Dynamic routing, or at least well optimized routes can make a huge

I use dedicated routers for routing, the switch is just a "dumb" L2-switch.

I'm looking at a very small deployment (probably 8 servers).

Mark Nelson May 6, 2013, 3:40 p.m. UTC | #15

On 05/06/2013 10:16 AM, Jens Kristian Søgaard wrote:
> Hi,
>
>> Just from a raw performance perspective, you might
>> want to make sure that the switch can handle lots of randomly
>> distributed traffic between all of the ports well.  I'd expect that it
>> shouldn't be too terrible, but who knows.
>
> It has been many years since I've seen a switch have problems with that.
> This switch has a 160 Gb/s backplane, so I don't foresee any problems.
>
> Especially considering that I'm mixing the public and private networks
> on the same switch.
>
>> On really big deployments, static routing becomes a pretty big issue.
>> Dynamic routing, or at least well optimized routes can make a huge
>
> I use dedicated routers for routing, the switch is just a "dumb" L2-switch.
>
> I'm looking at a very small deployment (probably 8 servers).

Ha, ok.  I suspect you'll be fine.  Nothing like over-thinking the 
problem right? :)

>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kasper Dieter May 6, 2013, 4:02 p.m. UTC | #16

On Mon, May 06, 2013 at 02:36:39PM +0200, Gandalf Corvotempesta wrote:
> 2013/5/6 Andreas Friedrich <andreas.friedrich@ts.fujitsu.com>:
> > To enable the LD_PRELOAD mechanism for the Ceph daemons only, a little
> > generic extension in the global section of /etc/ceph/ceph.conf would
> > be helpful, e.g.:
> >
> > [global]
> >         environment = LD_PRELOAD=/usr/lib64/libsdp.so.1
> >
> > The appending patch adds 5 lines in the Bobtail (0.56.6) init script.
> > The init script will then read the environment setting and - if present -
> > call the Ceph daemons with the preceding environment string.
> 
> Cool! We are planning the same infrastructure with IB as networking.
> Could you share more details about this?
>  - which performance are you having and on which hardware?

6x Storage-Server, each with 3x Intel-910 PCIe-SSD (OSD-xfs), Journal in 1GB RAM-Disk per OSD
plus 1x Client-server, each Server with 10GbE, Intel-QDR as well as MLX-FDR

The performance values are still scattering ...
(1 fio client job, 128 QD, 1TB of data, 60 seconds)
e.g. 
- read_4m_128, read_8m_128, randread_4m_128, randread_8m_128		approx. 2.2 GB/s on 56 GbIPoIB_CM and 56 GbSDP
- write_4m_128, write_8m_128, randwrite_4m_128, randwrite_8m_128	approx. 500 MB/s on 56 GbIPoIB_CM and 56 GbSDP,
									but     900 MS/s on 40 GbIPoIB_CM
>  - any issue with IB and SDP?
Well, you have to compile the complete OFED stack (ib_core, ... ib_sdp) from one vendor.
MLX is running, the Intel status is 'wip' ... the cluster is up, but jerking ... 
(OSDs are getting out/in ... timeouts) looks like a shortage on ressource ?!

>  - are you able to use the newer rsockets and not SDP? It also has a
> preloader library and is still developed (SDP is deprecated)
rsockets.so is up, but stuttering ... 

... and for the moment rsockets is UserLand only, in contrast to sdp.ko

Cheers,
-Dieter

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gandalf Corvotempesta May 6, 2013, 4:51 p.m. UTC | #17

2013/5/6 Mark Nelson <mark.nelson@inktank.com>:
>  On the front end
> portion of the network you'll always have client<->server communication, so
> the pattern there will be less all-to-all than the backend traffic (more 1st
> half <-> 2nd half).

What do you suggest for the frontend portion? 10GbE or 2Gb (2x 1GbE
bonded toghether) should be enough in case of RBD with Qemu?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Request for ceph.conf environment extension

Commit Message

Comments

Patch