Technical Design

Transport

DerbyOS is built on top of ZeroMQ, which provides the basic transport. It provides the ability to have many-to-many transport (with more reliability than UDP). There are a number of peer-to-peer protocols (such as JeffMQ, ZRE/Zyre) which are partially built on top of ZeroMQ that could be used. These have been investigated but abandoned due either to the complexity of getting them to compile, or are not available in needed languages/platforms. While they do add some useful features (such as discovery or reliability), the restrictions they provide outweigh their advantages. This is not to say, however, that good ideas haven’t been taken from them... Specifically, ZRE is used as the topology model for DerbyOS.

Topology

DerbyOS is peer-to-peer. There are a number of ways that this can be implemented using ZeroMQ. Our needs fortunately have very low requirements:

Limited number of clients

Low bandwidth (both limited number of messages sent, as well as the size of their content)

Discovery handled outside of ZeroMQ

Must support having multiple bouts running on the same network at the same time, with a way to filter events between different bouts

ZRE provides all these capabilities and then some (such as discovery), and will be used as the technological blueprint, though the actual protocol built on top of it shall be different.

Implementation Details

This section assumes familiarity with ZeroMQ. Here there be dragons.

Interconnection Model

Each node SHALL create a ZeroMQ ROUTER socket and bind this to an ephemeral TCP port (in the range %C000x - %FFFFx). The node SHALL broadcast this IP address and port number using Bonjour/mDNS (if support is provided).

This ROUTER socket SHALL be used for all incoming ZeroMQ messages from other nodes. A node SHALL NOT send messages to peers via this socket.

When a node discovers a new peer, it SHALL create a ZeroMQ DEALER socket, set its identity (binary 16-octet UUID) on that socket (once introduced), and connect this to the peer's address/port advertised in the mDNS message. A node may immediately, after connection, start to send messages to a peer via this DEALER socket.

A node SHALL connect each DEALER sockets to at most one peer. A node may disconnect its DEALER socket if the peer has failed to respond within some time (see Heartbeating).

This DEALER socket SHALL be used for all outgoing ZeroMQ messages to a specific peer. A node SHALL not receive messages on this socket. The sender MAY set a high-water mark (HWM) of, for example, 100 messages per second (if the timeout period is 30 second, this means a HWM of 3,000 messages). The sender SHOULD set the send timeout on the socket to zero so that a full send buffer can be detected and treated as "peer not responding".

Note that the ROUTER socket provides the caller with the UUID of the sender for any message received on the socket, as an identity frame that precedes other frames in the message. A peer can thus use the identity on received messages to look up the appropriate DEALER socket for messages back to that peer. The identity SHALL be a binary 16-octet UUID value.

When a node receives, on its ROUTER socket, an ‘HELO” message from an unknown node, it SHALL treat this as a new peer in the identical fashion as if a Bonjour/mDNS response was received from an unknown node.

NOTE: the ROUTER-to-DEALER pattern that DerbyOS uses is designed to ensure that messages are never lost due to synchronization issues. Sending to a ROUTER socket that does not (yet) have a connection to a peer causes the message to be dropped.

Messages

All data sent from the ROUTER socket will be multipart (in the ZeroMQ definition of ‘multipart’) with three components:

1.Sender, the 16 byte UUID of the sender

2.Message type, a simple 4 ASCII character string

3.Body, an arbitrary UTF8 encoded string

There are only a small number of message types - most dealing with handshaking and other “out of band”/infrastructure information, while the bulk of all messages should be XML messages related to the actual bout.

Introductions

When a node discovers another know (either via Bonjour/mDNS or manually), it will connect its DEALER socket to the remote nodes ROUTER socket and immediately send an HELO message which will contain the (local) node’s address and port. Until a node sends the HELO message, all other messages will be ignored (due to not knowing the node’s address and port). Once a node receives an HELO message, it will add that node to its table of peers, and then send a corresponding HELO message back to the initiator, and then a PEER message to all other peers which contains the remote node’s address and port. When a node receives a PEER message, if there isn’t already a peer with that address and port, it will connect to that peer (and start the whole introduction process over).

This introduction is equivalent to this dialogue, where Alice and Bob are already holding a conversation, and Charlie walks up to them (where a,b,c are addresses, and A,B,C are unique identifiers):

index.php.png

 

Charlie (to Alice): Hello, I am Charlie (c->a HELO C)
Alice (to Charlie): Hello, I am Alice (a->c HELO A)
Alice (to Bob): Bob, this introduce your self to our new friend (a->b PEER c)
Bob (to Charlie): Hello, I am Bob (b->c HELO B)
Charlie (to Bob): Hello, I am Charlie (c->b HELO C)
Bob (to Alice): Have you met this person? (b->a PEER c)
Alice, having already met Charlie, does nothing further
Charlie (to Alice): Have you met this person? (c->a PEER b)
Alice, having already met Bob, does nothing further

This protocol can be optimized to have Bob realize that Alice sent the PEER message about Charlie to Bob in the first place, so there is not need to send that PEER message back

NB: This introduction scheme is noisy (since each new peer results in N^2 messages) but it is simple and robust, and even in a typical case of up to 16 NSOdes, that still isn’t a lot of messages, and there shouldn’t be a constant influx of new NSOdes.

Goodbye

If an NSOde is gracefully quit/restarted, a GBYE message should be sent to all connected NSOdes to let them know of the NSOde’s disappearance. The remote peers should remove that leaving NSOde’s entry from their list of known peers. Sending this message is not mandatory, but should be done (but handling this message is mandatory to support switching bouts).

Bout Identification

To support having multiple bouts using the same network, a mandatory BOUT message is sent after HELO message. The body of the message is an arbitrary bout identification string (which could be a UUID string, or a user visible bout name, etc...). If there is only a single bout, a zero length string is used. If an NSOde receives a BOUT message, it must respond with one of its own to that node (even if the NSOde isn’t configured with a bout identification string - in which case a zero length string is used). If the bout identification strings do not match, then that connection must be closed (optionally with a GBYE message sent) and the peer removed from the NSOde’s table of peers.

NB: ideally, forwarding introductions should be done after BOUT messages are received, but that requires potential of extra state that could possible get corrupted if a connection dies midway through the handshaking (or otherwise interacts with threading models if two HELO messages appear at the same time). If the client can support this “delayed PEER”, it should, but this is not mandatory.

If possible, this bout identifier should be presented in Bonjour/mDNS entry to prevent clients from joining in the first place.

Chat

While this protocol isn’t specifically designed to support arbitrary chat, there may be the need to send textual messages. This is done by sending a CHAT message, the body of which contains the textual message. Clients need not support sending such messages, but should (but not must) support displaying them. By default, messages are sent to all peers, but the client could implement this message to only send to a single peer NSOde if so desired.

Heartbeat (TBD)

There currently isn’t a heart beating scheme defined, but one will be added to allow for determining if a peer has crashed (or the network has gone down). This will be a BEAT message, the body of which is currently undefined

Time Synchronization (TBD)

All NSOde peers should have a synchronized concept of the time (even if the devices internal clock is off from the rest of the clocks) to allow for “wall clock” based timestamps to be accurate. When an NSOde joins another peer, the new node will synchronize itself with that other existing node only if the joining node does not have any other peers. This is all done via the TIME message. Basically, the time message will include a number of seconds since a defined epoch (it is also assumed that all NSOdes are in the same time zone physically). This epoch shall be the Unix epoch of January 1, 1970. Internally, time should be stored as a double precision value (64 bits) relative to this epoch, and values in the body of the TIME message should be decimal representation of that double precision value.

Notes that it may be possible for time synchronization to have problems if the network access point reboots. This will be due to the fact that separate groups of peers will become connected to form disjoint groups, as opposed to joining a singe existing group.

XML

The bulk of all messages will be XMSG messages, where the body of the message contains xml. The top level node of the message is the action element, which contains zero or more DerbyXML element nodes.

Discovery

Discovery of nodes is ideally done using mDNS, assuming the clients and access point support them. If needed, however, raw IP addresses can be used, though this causes problem (since there are multiple ports that are needed). The above “Introductions” protocol are based on not having mDNS. While it would be easier to just rely on mDNS (which will also provide for notification of node disappearance), a robust protocol should be flexible enough to support both mDNS and “manual ip” configurations (even when mixed in the same network). In reality, this proves to be a bit tricky.

If mDNS is supported, entries are registered with the type “_nsode_derbyos_com._tcp”. TXT records must also be provided with the nodes uuid and bout identification. DerbyOS sample code provide a simple python class that uses pybonjour, and provides two entry points - one to broadcast the NSOdes visible name, port, uuid, and boutid. A second entry point is used to look for peers, which takes two callback routines - one when a node is discovered (with the node’s full name, host target, port, and TXT record), and a second when a node is disconnected (with the node’s service name).

Implementations should store a peer’s address information via ip octets (for IPv4), rather than any ad-hoc name that mDNS may provide. This will allow for mixing nodes that use mDNS and those with manual IP addresses. It is also easier to get the IP address rather than DNS name in a number of circumstances (provided allowances are made for multi-hosted devices). If DNS names were used, if the node had a number of names, confusion could come in as to the identity of a node.

“What jam are we on?”

One common problem is determining what jam we are current on (or going into, or just finished, etc...) This is extra problematic since a scoreboard may display the jam that just finished, and not the jam we are going into - obviously mid-jam any display of jam numbers should be the current jam. During lineup, however, it is more subtle due to a the fact that lineup information will refer to the upcoming jam, scoring information will refer to the completed jam, and penalty information can refer to either.

This gets worse in systems where a single NSO is responsible for “starting the next jam” (such as the lineup tracker for the nome team). If that person is behind, or their computer has issues, data entry for everybody else can be compromised. Our solution is that anybody can refer to any jam at any time (while in theory this would allow somebody to enter data for jam 99, in practice, this should be dealt with by the client software). We further define:

Active Jam - The jam which is currently being skated. This is the jam with most recent starting timestamp, but no duration.

Just Completed Jam - During lineup, the just completed jam is the jam with the most recent starting timestamp and duration. If there is an active jam, there is a just completed jam (but hopefully by the time the next jam has started, all the information for the just completed jam should be entered).

Upcoming Jam - A jam whose number is one greater than the just completed jam, if any. In the case of the start of the bout, this will be period one jam one. If the just completed jam is in period one and its ending timestamp corresponds to a time that is after the ending timestamp of the period, this is jam one period two. If the bout is completed, there is obviously no upcoming jam. If there is an active jam, there is no upcoming jam (this could be defined, but one should not be entering information into a future jam while an active jam is ongoing).

So long as the bout has not completed, there is always either an active jam or an upcoming jam. There is also a just completed jam so long as period one jam one has a duration (i.e., it has completed).

A side effect of the “no single jam creator” is that taken to its logical conclusion, there should be no need to explicitly create a jam. This is also a logical extension of the default bout xml - namely, it is already pre-populated with three teams (officials are the third team) and two periods. We can thus extend this to say that any reference to a given jam will automatically create that jam.

Collaborative Editing Engine

Internally, being able to construct, edit, and reconcile the DerbyXML file is essentially similar to traditional “Commutative Replicated Data Types” algorithms (see papers such as <http://arxiv.org/abs/1010.3615> for more details) Fortunately, we don’t need quite that level of flexibility, since under normal usage patterns, the same element isn’t altered by multiple clients at the same time. However, by focusing on the underlying DerbyXML file (as opposed to a domain specific protocol) we can easily add/extend features. For example, we don’t need to have protocol extensions specifically to deal with action and error tracking - those events are elements in DerbyXML just like lineup is. If we were to be able to add “location on the track” for events, those would be child elements or attributes of existing DerbyXML elements and again not need extension to the protocol to support them.

As a result, the XMSG message which is used to handle all derby specific events can be relatively simple. The body contains xml (with a “payload” as the child element(s) of that body) - and that root xml can be one of the following tags:

add - adds the child element(s) to an parent

remove - removes a specified element (no payload)

set - sets the value of an attribute, or the textual content of an element (textual content of an element can be appended using the add tag). To set an attribute, the name of the attribute will be in an “attribute” element

The glossed over detail above is the concept of how to specify a parent/element. In general, any element can have an “eid” attribute, which is a UUID string. The eid is not saved as part of the DerbyXML document by default, since it is only relevant during bout time. With a few exceptions, use of eid is all but mandatory, since there is no other reliable way to access most elements in the DerbyXML tree (since order of elements may not be consistent between NSOdes). We can then refer to an element as follows:

Attributes

Reference

eid = “123...123123”The eid attribute of a specific xml element
period=”1”A given period (either 1 or 2)
period=”1” jam=”3”A given jam within a given period
team=”home”A specific team (“home”,”away”,”official”)
team=”home” person=”123”A specific person on a specific team - the person attribute can be either their number or their name
path=”path/to/node”Useful for very specific cases, such as venue or ruleset. For example, this can be used to specify the duration of halftime.

XML Reconciliation and Auditing

There are various algorithms that can be used to deal with unifying distributed XML, though most are based on having a centralized server to synchronize with. Instead of using something with that level of complexity (with the ability to manage a wide variety of inconsistencies across arbitrary hierarchies), we instead use a different approach that leverages an auditing stream.

The basic idea is that besides keeping a full XML tree of the bout, there is also an audit log of all manipulations of the tree by all the clients. This first off provides a clear-cut audit of who has done what. Secondly, it provides the ability to reconstruct the XML tree from scratch (which, in and of itself, isn’t terribly useful, since if we have the audit log, we would also have the bout XML tree). However, this audit log also provides a way to know what messages have been received from what other NSOdes. For example, if we know that there is some information missing, we can retrieve it and fill in our local knowledge of the state of the bout (and this is where the “reconstruct the XML tree from scratch” comes in).

In order to keep track of what messages we have seen locally, the XMSG payload XML includes the following attributes:

Attributes

Reference

sourceThe UUID of the originator of the message (i.e., which NSOde actually recorded/generated the event)
timestampAn absolute epoch based timestamp of when the event happened. This is informative only (since timestamps between multiple devices may not be exactly the same). The timestamp is only used to sort the recorded messages in the audit log
sequenceAn integer value specific to the originating NSOde, which starts at 1 at “start up”, and is incremented sequentially.

NSOdes keep track of the latest sequence number for each known peer. When a message is received, the latest sequence number of the source is compared with the sequence number of the message - if the message isn’t one more than the last know sequence number, we know that messages are missing (and we can ask the peer for that specific message).

Note that this algorithm still has one major weakness - we don’t know that we are missing messages from a given node until we get a new message from them (which has a skipped sequence number). If a node crashes, we will not get a new message at all. And if the network is spotty, a given node may never get messages delivered from a specific node (even though its neighbors have seen it).

To deal with this, there is another message used to help synchronize all the nodes, the GSIP (“gossip”) message. At various times, a NSOde will send the most recent sequence number of a given source to another peer (“did you hear message 5 from JamTimerNode?”). When a NSOde gets a GSIP message it need not reply to it, simply check to see if the latest sequence number from that peer matches. If the sequence number is greater than our local knowledge of the sequence number, it will ask the “talked about” peer to re-send all those messages. If the sequence number is less than our local knowledge, we must send our local audit information to catch the sender of the GSIP message all the messages it is missing.

Note that there is no special message needed to request missing messages - it can simple send a GSIP message to another node (either the node sending the message, or the node referred to in the message, or even any other node), with the lower sequence number, and that node will reply with the missing messages (and then the tree can be rebuilt).

This algorithm works in a number of different situations:

A node joining an existing network - every time the node gets the HELO message, it sends a GSIP message to it with sequence number of zero. That node will then reply with all events that node had generated

A node temporarily looses the network - this is similar to above, but when the other nodes reconnect, they will send an outdated GSIP message, and reconnected node will reply with whatever was missed

A node crashes (and is gone for good). When the client fires back up, it should have a new UUID - since the audit information for the old node before the crash is stored on every other node, it can be replayed

If the network crashes, when it returns, as it reforms, all lost messages will be sent.

GSIP Engine Implementation Details

The “gossip management system”, as described above, is used to synchronize all the NSOdes’ XML trees. The actual implementation of it is flexible - it need, however:

Keep track of all XMSG bodies received in a list

Be able to find the latest sequence number in that list for any given peer (and be abel to detect gaps)

Whenever a HELO message is received from a given node, it needs to send a GSIP message to that node with the latest sequence number it knows about from thatpeer (allowing for the node to provide past message to “catch up”).

“Periodically” it must send GSIP messages to all connected peers with the latest sequence numbers of other nodes. Exactly how this is best done is a matter of tuning:

A message should not be forwarded from another node unless all messages from that node are know (i.e., the sequences numbers have no gaps). E.g., node B should not forward a message with sequence number 7 from node C to a third node A unless all messages with sequence number 1..6 are also known by node B.

A node should keep track of what it hears from other node’s GSIP messages, so that if node A hears that node B “GSIP C 7” it will not later send the same message back to B (since it already knows that node B has seen all messages from node C up through sequence number 7).

Receiving a message whose source attribute is different from the sending channel should be treated as a corresponding GSIP message. E.g., if node B sends node A an XMSG with node C as the source and 7 as the sequence number, node A knows that node B has seen messages from node C up through sequence number 7, the same as “GSIP C 7”.