I looked at it, I promise I did. Really wanted to avoid writing this if I could....

rabbitmq · on Oct 22, 2012

This does not make any sense. All software runs on machines. Each machine is potentially a 'single point of failure'. This sometimes is a problem, and sometimes is not.

Just because software runs on multiple machines it does not mean that there is 'no SPoF', for example in a distributed messaging system you need to replicate state, unless you don't mind losing it when you have a crash.

If you don't mind losing state, then it is very easy to run RabbitMQ on more than one server.

If you do mind losing state, then RabbitMQ provides several ways to use multiple servers in a decentralised and redundant manner. See eg http://www.rabbitmq.com/distributed.html

It is very easy to write a system with multiple components. The difficulty is preserving all the additional capability (eg uptime) while maintaining a coherent system. This role can be played by a "broker" which does NOT imply a single machine or "SPoF" - it can be a distributed service. See eg http://www.rabbitmq.com/blog/2010/09/22/broker-vs-brokerless...

paddyforan · on Oct 22, 2012

I mean this in the most neutral and unassuming way possible, but...

Are you familiar with the basics of a distributed hash table? To say that "each machine is potentially a single point of failure" is... inaccurate.

According to the algorithm's authors, you can lose the vast majority of your cluster simultaneously without affecting the stability of the cluster, in certain cases. As a worst-case scenario, you'll need 16 servers to drop offline simultaneously before there's a problem.

Am I saying my solution is better than RabbitMQ? No. Am I saying people should use my solution instead of RabbitMQ? No. Am I saying my solution is closer to what I want than RabbitMQ is? Yes. Yes I am.

rabbitmq · on Oct 23, 2012

Paddy, thanks for asking me if I am "familiar with the basics" of a DHT. Just to be clear, I don't think DHTs are the same as messaging systems. But, they touch on a set of related issues.

My assertion was that "Just because software runs on multiple machines it does not mean that there is 'no SPoF'". You seem to have read this as saying "in all cases where there are multiple machines, there is always a SPoF". In any case, I was trying to make a more general point, which is that (A) lots of cases which seem to have no SPoF, in fact do have SPoFs, (B) some cases get accused of having a SPoF, but doing so is a mistake, and Rabbit is in this category because you can set up a number of multi-machine scenarios with the required redundancy, and (C) in many cases having a SPoF is not a problem anyway, and may even be a good thing.

You said "From what I can see in the docs, RabbitMQ is a client/server relationship. Meaning there is a server. Meaning a single point of failure and a bottleneck. I hate those..". With all due respect this represents a misunderstanding of Rabbit. Moreover you assert that 'there is a server' means that there is a 'single point of failure'. Because the conclusion does not follow from the premise, I assumed you must have meant something else.

Perhaps you could explain why you think your system has fewer 'bottlenecks' than RabbitMQ or other messaging systems. I don't think that it does but would love to be enlightened.

From your more recent comment, you seem to be saying that using a DHT is a good thing. Yes, I agree, sometimes this is the case. Are you familiar with how RabbitMQ does any of the following: clustering, HA, federation? You say that "Am I saying my solution is closer to what I want than RabbitMQ is? Yes. Yes I am.". I would love to understand this.

paddyforan · on Oct 23, 2012

> thanks for asking me if I am "familiar with the basics" of a DHT.

I get the impression that you feel I asked this in a demeaning manner. I had no such intent. But some of your statements seem to contradict the fundamentals of a DHT, so I was unsure what level of explanation I needed to do. I meant only to gauge your level of prior knowledge, in order to engage you on that level. I apologise if I come off as brusque; everyone seems to be taking the fact that I did not use their technology of choice as a personal insult, so I'm growing weary of explaining why each individual technology is not what I wanted.

> Just to be clear, I don't think DHTs are the same as messaging systems. But, they touch on a set of related issues.

Agreed. To be clear: I released a DHT. I did not release a messaging system of any sort. I do intend to build a messaging system on top of it, but its uses are not limited to that.

> My assertion was that "Just because software runs on multiple machines it does not mean that there is 'no SPoF'". You seem to have read this as saying "in all cases where there are multiple machines, there is always a SPoF".

Not really. My assertion is just that any software that is built on a DHT and the principles behind it has no single point of failure. And if you were not referring to software built on DHTs, I'm not entirely sure how the comment is relevant to the discussion?

> In any case, I was trying to make a more general point, which is that (A) lots of cases which seem to have no SPoF, in fact do have SPoFs

I would argue that many of the alternatives people have proposed to me fall under this description.

> (B) some cases get accused of having a SPoF, but doing so is a mistake, and Rabbit is in this category because you can set up a number of multi-machine scenarios with the required redundancy

Redundancy does not make the SPoF disappear, it just manages the risk of that SPoF. Perhaps I am using the term beyond its meaning here (sorry!), but I think of this more in the architecture of the cluster. If traffic is routed through a small, centralised subset of machines whose specific purpose is to handle or route that traffic, I consider that to be a SPoF, no matter how unlikely that failure may be. I consider it so because rather than architecting your cluster to avoid issues, you are offsetting the issues to an ops/deployment problem. Yes, you can achieve HA, but it is not an inherent part of your cluster; it's bolted on afterwards by some clever duct-taping as you ping servers and swap them out if they seem to be down.

> (C) in many cases having a SPoF is not a problem anyway, and may even be a good thing.

A DHT is not appropriate for all cases, though I would challenge anyone to quote me on ever saying it is. I won't even say it's the best tool for the job I'm using it for; it simply is the tool that fit best with my desired approach to the problem.

> With all due respect this represents a misunderstanding of Rabbit.

This is entirely possible. I have a decidedly rudimentary understanding of Rabbit, something I tried to convey by qualifying all of my statements about it. "From what I can see in the docs", etc.

> Moreover you assert that 'there is a server' means that there is a 'single point of failure'. Because the conclusion does not follow from the premise, I assumed you must have meant something else.

In my understanding, the conclusion does follow from the premise: a server designates a machine that is specifically intended to handle requests. Clearly, I'm misusing the term SPoF. My apologies for the confusion caused by that. In the six months I worked on this, I never once had to explain why I thought a DHT fit my needs; I was speaking to people who worked with distributed systems, so no explanation was needed. This left me a little ill-prepared to explain myself.

> Perhaps you could explain why you think your system has fewer 'bottlenecks' than RabbitMQ or other messaging systems. I don't think that it does but would love to be enlightened.

It's entirely possible my system does not have fewer bottlenecks than RabbitMQ. Again, I'm no RabbitMQ expert. Here's why I think my system has few bottlenecks: * No change in code or deploy practices is needed between one server and one billion servers. * Unless a catastrophic event hits the cluster (an event that would leave your application non-functioning, even if Pastry continued functioning), the cluster will remain healthy as servers come and go. This is not a remedy put in place by ops, it is not a bolted on feature, it is a core premise of the algorithm. I prefer to solve my availability concerns at the software level, rather on the deploy level. This might be a virtue of the fact that my software is open source, so I try to make it simple for others to deploy. It might be a virtue of the fact that I am more familiar with writing software than I am with deploying software. * The messaging component is not an element in the architecture; rather, it is an embedded piece of every single element in the architecture that wishes to take advantage of it. There are no messaging servers, there are no brokers, no queues. There is simply your architecture, except now it can communicate efficiently.

Based on these three points, Pastry fit in with the approach I wanted to take in my architecture. It seemed like every other messaging protocol I could think of preferred to have a messaging server, instead. Even if you have a pool of these servers, allowing for HA through redundancy, that is not really what I was looking for.

> you seem to be saying that using a DHT is a good thing. Yes, I agree, sometimes this is the case.

We are in agreement, then. Examine your problem, then choose a tool for the job. A lot of people seem to be taking issue with the fact that I did not contort the problem until it fit pre-existing solutions, instead of creating a solution to the problem I saw.

> Are you familiar with how RabbitMQ does any of the following: clustering, HA, federation?

I am familiar with this page: http://www.rabbitmq.com/ha.html In addition, I have seen this page: http://www.rabbitmq.com/distributed.html

Both of them feel like a bolted on solution to the problem of distributed message passing, rather than an inherent design characteristic. Allow me to quote:

> some important caveats apply: whilst exchanges and bindings survive the loss of individual nodes, queues and their messages do not. This is because a queue and its contents reside on exactly one node, thus the loss of a node will render its queues unavailable.

This is understandable; queuing is pretty much impossible (as far as I know) to achieve in a distributed system. But I don't need queuing, so why should I let that hamstring me unnecessarily?

> should one node of a cluster fail, the queue can automatically switch to one of the mirrors and continue to operate, with no unavailability of service.

This does not feel like an inherent aspect of the design. This feels a lot like a deploy detail.

> In normal operation, for each mirrored-queue, there is one master and several slaves,

When I hear "master" and "slave", they translate in my head to "bottleneck/SPoF" and "backups".

> Clustering connects multiple machines together to form a single logical broker.

That sounds an awful lot like a single element in the system that is responsible for message traffic. There may be a lot of machines in that single element, but it is a single element nonetheless.

I am not saying that RabbitMQ is a bad solution for message passing, nor am I saying my problem couldn't be solved by RabbitMQ if I changed my architecture to fit Rabbit's needs. All I am saying is that I have a preference for architectures that does not take advantage of the things Rabbit does really, really well, and does take advantage of the things that a DHT does really, really well. So I'm more than a little confused that people are up in arms over the fact that I used the paradigm that matched my preference instead of trying to force something to do what it was not intended to do.

rabbitmq · on Oct 23, 2012

Thanks for this lengthy reply. I'll try to keep mine short and that means skipping over a bunch of stuff.

First, re "whilst exchanges and bindings survive the loss of individual nodes, queues and their messages do not" -- this needs to be clarified. When queues are replicated, their messages do survive, in order, in the replica. The queue itself may die but the client can get back into context by finding another node in the group. This is observationally similar to DHTs.

Second, "a single logical broker" and "sounds an awful lot like a single element in the system that is responsible for message traffic".... NOOOOOO the point of 'logical' is the opposite of what you say!

Finally the BIG difference between DHTs and something like Rabbit is that in a DHT each datum is replicated around the ring N times, non-uniformly. Whereas in a system like Rabbit each datum is replicated N times in a uniform manner, such that ordered pairs of messages A and B will exist in every queue which has either A or B.

JulianMorrison · on Oct 20, 2012

0mq doesn't have a client/server relationship. http://www.zeromq.org/

paddyforan · on Oct 20, 2012

I believe I looked at that as well. It was six months ago, so forgive me if my memory is a bit off.

Looking at the intro, it looks like I have to hardcode the IPs and ports of all of the machines I want to publish or subscribe to. Is that correct? If so, that's a little fragile for my tastes. I like "upgrading servers" by standing up new servers, testing they work, and then changing the floating IP associated with the DNS to point to them instead.

And I can hear you say "Aha! Floating IP! Just use that!" but for most cloud providers (I believe), that is billed as external bandwidth, which is not free like internal requests are. Finally, if I want to stand up more servers to scale horizontally, I'd still have to modify my code and redeploy it to all my servers. I think. Unless I'm missing something about how 0mq works. Which I totally could be.

On the whole, learning what I needed to learn to make Pastry happen made me a better programmer, too. So if for no other reason than that, I'm glad I did it.

JulianMorrison · on Oct 20, 2012

No more so than with TCP sockets you have to "hard code" the IP connected to.

0mq is low level, it's not going to do routing like Pastry, but you can get away with having one fixed location for a "lookup server" that keeps track of everything else's location (something like the Doozer project, also in Go https://github.com/ha/doozerd ).

Also, 0mq only needs to know the IP of the publisher, for pub-sub.

paddyforan · on Oct 20, 2012

Very cool. Where were you six months ago? ;)

I was familiar with doozer, but doozer wasn't(/isn't) being actively maintained and it doesn't compile against the latest version of Go, so I'd have to bring it up to speed before I could use it anyways. Not saying it would be harder, but between that and writing the 0mq library (pretty sure one does not exist in Go yet), I'd estimate the work would be more or less equivalent and yield just about the same stability. Seat of the pants guess, but it makes me feel better, at least.

JulianMorrison · on Oct 20, 2012

Heh, thanks.

The mailing list suggests http://github.com/4ad/doozer and http://github.com/4ad/doozerd are being maintained as forks of the original. Or there's Zookeeper. Or brew your own central lookup server using 0mq that does the same job of "set a key, get a key, notify subscribers when it changes". It's a single point of failure, but since the system will run fine without it (only lacking topology updates) it's not a hugely problematic one.

paddyforan · on Oct 20, 2012

This reply link just appeared for me. Not cool -_-

The lookup server is a single point of failure. And while it's not a hugely problematic one, it also is a dedicated machine whose sole purpose is keeping the other machines running. Which feels wrong to me.

It probably wasn't the best business decision to invest time in this, and I won't even argue this is the best technical solution. But it's the only technical solution that didn't make me feel like I was working around limitations; things just worked they way they were supposed to. The API servers received an event, they told the WebSockets servers about it. It felt very conceptually pure to me. I'm a sucker for that.

JulianMorrison · on Oct 20, 2012

To get the reply on a deep comment, click "link" first to get the comment as a single page, and "reply" will be there.

Doozer is only slightly a SPF, it has high availability by having multi-master replication. It's also not doing a lot of communicating, or a lot of CPU work, and it keeps its data in RAM, so it may be OK for it to live on a non-dedicated machine.

Sorry if I come across as criticizing your admittedly cool creation. It was just that you said there weren't alternatives, and I knew of one.

paddyforan · on Oct 20, 2012

No, I definitely appreciate the conversation, because I looked for forever for a pre-built solution to this, and couldn't figure out how nobody had needed a solution before now. So I'm glad there are solutions, and I'm not just crazy.

alexchamberlain · on Oct 20, 2012

Isn't the lookup server a SPoF?

brian_cloutier · on Oct 20, 2012

What does it have?

JulianMorrison · on Oct 20, 2012

Peer-to-peer connections analogous to TCP. In the case of pub/sub all subscribers connect to the publisher. http://zguide2.zeromq.org/page:all

paddyforan · on Oct 20, 2012

I guess I was wrong about hard-coding the IPs. Maybe?

It doesn't have many-to-many baked in, though. That I can see, at least.

It looks like I may have been able to make this work, but the setup would be a little complex.

... because a distributed hash table is not complex at all. -_-

JulianMorrison · on Oct 20, 2012

It doesn't have many-to-many but in my experience most things aren't true many-to-many. More like, there are a handful of publishers in the system and each has a handful of subscribers.

Suppose your architecture is API -> Web sockets and you want to make it highly scalable, then what I'd do is:

1. Many API -> few relays (use doozerd to keep up to date as to where the relays are and select one randomly, use ZMQ_ROUTER socket type to enqueue API->relay messages).

2. Few Relay -> many web socket (again using doozerd for the web socket nodes to find the relay, and using ZMQ_PUB on the relay and ZMQ_SUB on the web socket nodes).

paddyforan · on Oct 20, 2012

Sounds like it would work, and have great availability and scalability. It just, conceptually, doesn't mesh with how I think about the system. An absolutely aesthetic distinction, I know.

jeremyjh · on Oct 20, 2012

You didn't look very hard.

http://www.rabbitmq.com/ha.html

paddyforan · on Oct 20, 2012

I thought you just missed this. Apparently I was wrong.

I think you're missing the point a little.

old_sound · on Oct 22, 2012

You'll have a single point of failure depending on how you orchestrate your architecture, not if you use RabbitMQ, ZeroMQ or whatever. There's no magic in the software world

paddyforan · on Oct 23, 2012

And my application will certainly have its single points of failure. But they will be things that I could not architect around with new software.