Sarracenia

MetPX-Sarracenia is a data duplication or distribution engine that leverages existing standard technologies (file and web servers and AMQP brokers) to achieve real-time message delivery and end to end transparency in file transfers. Whereas in Sundew, each switch is a standalone configuration which transforms data in complex ways, in Sarracenia, the data sources establish a file tree hierarchy which is carried through any number of intervening pumps until they arrive at a client. The client can provide explicit acknowledgement that propagates back through the network to the source. Whereas traditional file switching is a point-to-point affair where knowledge is only between each segment, in Sarracenia, information flows from end to end in both directions.

Overview

At it's heart, Sarracenia exposes a tree of web accessible folders (WAF), using any standard HTTP server (tested with apache) or SFTP server, with other types of servers as a pluggable option. Weather applications are soft real-time, where data should be delivered as quickly as possible to the next hop, and minutes, perhaps seconds, count. The standard web push technologies, ATOM, RSS, etc... are actually polling technologies that when used in low latency applications consume a great deal of bandwidth and overhead. For exactly these reasons, those standards stipulate a minimum polling interval of five minutes. Advanced Message Queueing Protocol (AMQP) messaging brings true push to notifications, and makes real-time sending far more efficient.

e-ddsr-components.jpg

Sources of data announce their products, pumping systems pull the data using HTTP or SFTP onto their WAF trees, and then announce their trees for downstream clients. When clients download data, they may write a report message back to the server. Servers are configured to forward those client report messages back through the intervening servers back to the source. The Source can see the entire path that the data took to get to each client. With traditional switching applications, sources only see that they delivered to the first hop in a chain. Beyond that first hop, routing is opaque, and tracing the path of data required assistance from administrators of each intervening system. With Sarracenia's report forwarding, the switching network is relatively transparent to the sources. Diagnostics are vastly simplified.

For large files / high performance, files are segmented on ingest if they are sufficiently large to make this worthwhile. Each file can traverse the data pumping network independently, and reassembly is only needed at end points. A file of sufficient size will announce the availability of several segments for transfer, multiple threads or transfer nodes will pick up segments and transfer them. The more segments available, the higher the parallelism of the transfer. In many cases, Sarracenia manages parallelism and network usage without explicit user intervention. As intervening pumps do not store and forward entire files, the maximum file size which can traverse the network is maximized.

Where Sundew supports a wide variety of file formats, protocols, and conventions specific to the real-time meteorology, Sarracenia takes a step further away from specific applications and is a ruthlessly generic tree replication engine, which should allow it to be used in other domains. The initial prototype client, dd_subscribe, in use since 2013, was replaced in 2016 by the full blown Sarracenia package, with all components necessary for production as well as consumption of file trees.

Sarracenia is expected to be a far simpler application than sundew from every point of view: Operator, Developer, Analyst, Data Sources, Data consumers. Sarracenia imposes a single interface mechanism, but that mechanism is completely portable and generic. It should run without issue on any modern platform (Linux, Windows, Mac)

For more information about Sarra, view the Sarracenia in 10 Minutes Video or proceed to the detailed documentation

Implementations

Part of Sarracenia defines an application layer message over AMQP as a transport. Sarracenia's has multiple implementations:

  • Sarracenia itself ( http://github.com/MetPX/sarracenia ) a complete reference implementation in Python >= 3.4. It runs on Linux, Mac, and Windows.
  • sarrac ( https://github.com/MetPX/sarrac ) is a C implementation of data insertion (post & watch.) It is Linux only. There is also a libcshim to be able to tranparently implement data insertion with this tool, and libsarra allows C programs to post directly. There is consumer code as well (to read queues) but no downloading so far. This subset is meant to be used where python3 environments are impractical (some HPC environments.)
  • node-sarra ( https://github.com/darkskyapp/node-sarra ) An embryonic implementation for node.js.
  • dd_subscribe ( https://github.com/MetPX/sarracenia ) python2 stripped down download only client. The predecessor of Sarracenia. still compatible.

More implementations are welcome.

Deployments/Use Cases

Deployment status in 2015: Sarracenia in 10 Minutes Video (5:26 in)

Deployment status in 2018: Deployments as of January 2018

Why Not Just Use Rsync?

There are a number of tree replication tools that are widely used, why invent another? RSync, for example is a fabulous tool, and we recommend it highly for many use cases. but there are times when Sarracenia can go 72 times faster than rsync: Case Study: HPC Mirroring Use Case

Rsync and other tools are comparison based (dealing with a single Source and Destination) Sarracenia, while it does not require or use multi-casting, is oriented towards a delivery to multiple receivers, particularly when the source does not know who all the receivers are (pub/sub.) Where rsync synchronization is typically done by walking a large tree, that means that the synchronization interval is inherently limited to the frequency at which you can do the file tree walks (in large trees, that can be a long time.) Each file tree walk reads the entire tree in order to generate signatures, so supporting larger numbers of clients causes large overhead. Sarracenia avoids file tree walks by having writers calculate the checksums once, and signal their activity directly to readers by messages, reducing overhead by orders of magnitude. Lsyncd is a tool that leverages the INOTIFY features of Linux to achieve the same liveness, and it might be more suitable but it is obviously not portable. Doing this through the file system is thought to be cumbersome and less general than explicit middleware message passing, which also handles the logs in a straight-forward way.

One of the design goals of Sarracenia is to be end-to-end. Rsync is point-to-point, meaning it does not support the transitivity of transfers across multiple data pumps that is desired. On the other hand, the first use case for Sarracenia is the distribution of new files. Updates to files were not common initially. ZSync is much closer in spirit to this use case. Sarracenia now has a similar approach based on file partitions (or blocks), but with user selectable size (50M is a good choice), generally much larger than Zsync blocks (typically 4k), more amenable to accelleration. Using an announcement per checksummed block allows transfers to be accellerated more easily.

The use of the AMQP message bus enables use of flexible third party transfers, straight-forward system-wide monitoring and integration of other features such as security scanning within the flow.

Another consideration is that Sarracenia doesn't actually implement any transport. It is completely agnostic to the actual protocol used to tranfer data. Once can post arbitrary protocol URLs, and add plugins to work with those arbitrary protocols, or substitute accellerated downloaders to deal with certain types of downloads. The download_scp plugin, included with the package, shows the use of the built-in python transfer mechanisms, but the simple use of a binary to accellerate downloads when the file exceeds a threshold size, making that method more efficient. Use of another compatible binary, such as dd or cp, (for local files), scp, or wget via plugins is also straightforward.

Why No FTP?

The transport protocols fully supported by Sarracenia are http(s) and SFTP (SSH File Transfer Protocol.) In many cases, when public data is being exchanged, FTP is a lingua franca that is used. The main advantage being relatively simple programmatic access, but that advantage is obviated by the use of Sarracenia itself. Further, these days, with increased security concerns, and with cpu instructions for encryption and multiple cores something of a cpu glut, it no longer makes much sense not to encrypt traffic. Additionally, to support multi-streaming, Sarracenia makes use of byte-ranges, which are provided by SFTP and HTTP servers, but not FTP. So we cannot support file partitioning on FTP. So while FTP sort-of-works, it is not now, nor ever will be, fully supported.

AMQP

AMQP is the Advanced Message Queuing Protocol, which emerged from the financial trading industry and has gradually matured. Implementations first appeared in 2007, and there are now several open source ones. AMQP implementations are not JMS plumbing. JMS standardizes the API programmers use, but not the on the wire protocol. So typically, one cannot exchange messages between people using different JMS providers. AMQP standardizes for interoperability, and functions effectively as an interoperability shim for JMS, without being limited to Java. AMQP is language neutral, and message neutral. there are many deployments using python, C++, and ruby. One could adapt WMO-GTS protocols very easily to function over AMQP. JMS providers are very Java oriented.

Sarracenia relies heavily on the use of brokers and topic based exchanges, which were prominent in AMQP standards efforts prior to version 1.0, at which point they were removed. It is hoped that these concepts will be re-introduced at some point. Until that time, the application will rely on pre-1.0 standard message brokers, such as rabbitmq.

Downloading Sarracenia

Steps for downloading the latest version of Sarracenia are available on our downloads page .

Getting The Source Code

The source code is available from our git repository .

Documentation

The documentation for Sarracenia can be found on our documentation .