SR_Subscribe

Select and Conditionally Download Posted Files

Manual section:1
Date: March 2017
Version: 2.17.03a4
Manual group:Metpx-Sarracenia Suite

SYNOPSIS

sr_subscribe configfile foreground|start|stop|restart|reload|status (formerly dd_subscribe )

DESCRIPTION

sr_subscribe is a program to efficiently download files from websites or file servers that provide sr_post(7) protocol notifications. Such sites publish a message for each file as soon as it is available. Clients connect to a broker (often the same as the server itself) and subscribe to the notifications. The sr_post notifications provide true push notices for web-accessible folders (WAF), and are far more efficient than either periodic polling of directories, or ATOM/RSS style notifications.

sr_subscribe can also be used for purposes other than downloading, (such as for supplying to an external program) specifying the -n (notify_only, or no download) will suppress the download behaviour and only post the URL on standard output. The standard output can be piped to other processes in classic UNIX text filter style.

The sr_subscribe command takes two argument: a configuration file described below, followed by an action start|stop|restart|reload|status... (self described).

The foreground action is different. It would be used when building a configuration or debugging things. It is used when the user wants to run the program and its configfile interactively... The foreground instance is not concerned by other actions, but should the configured instances be running it shares the same (configured) message queue. The user would stop using the foreground instance by simply pressing <ctrl-c> on linux or use other means to kill its process. That would be the old dd_subscribe behavior...

CONFIGURATION

In general, the options for this component are described by the sr_config(7) page which should be read first. It fully explains the option configuration language, and how to find the option settings.

CREDENTIAL OPTIONS

The broker option sets all the credential information to connect to the RabbitMQ server

  • broker amqp{s}://<user>:<pw>@<brokerhost>[:port]/<vhost>
(default: amqp://anonymous:anonymous@dd.weather.gc.ca/ )

All sr_ tools store all sensitive authentication info in the credentials.conf file. Passwords for SFTP, AMQP, and HTTP accounts are stored in URL´s there, as well as other pointers to thins such as private keys, or FTP modes.

For more details, see: sr_config(7) credentials

AMQP QUEUE BINDINGS

Once connected to an AMQP broker, the user needs to create a queue and bind it to an exchange. These options define which messages (URL notifications) the program receives:

  • exchange <name> (default: xpublic)
  • topic_prefix <amqp pattern> (default: v00.dd.notify -- developer option)
  • subtopic <amqp pattern> (subtopic need to be set)

Several topic options may be declared. To give a correct value to the subtopic,

for more details, see: sr_config(7)

One has the choice of filtering using subtopic with only AMQP's limited wildcarding and length limited to 255 encoded bytes, or the more powerful regular expression based accept/reject mechanisms described below. The difference being that the AMQP filtering is applied by the broker itself, saving the notices from being delivered to the client at all. The accept/reject patterns apply to messages sent by the broker to the subscriber. In other words, accept/reject are client side filters, whereas subtopic is server side filtering.

It is best practice to use server side filtering to reduce the number of announcements sent to the client to a small superset of what is relevant, and perform only a fine-tuning with the client side mechanisms, saving bandwidth and processing for all.

topic_prefix is primarily of interest during protocol version transitions, where one wishes to specify a non-default protocol version of messages to subscribe to.

DELIVERY SPECIFICATIONS

These options set what files the user wants and where it will be placed, and under which name.

  • accept <regexp pattern> (must be set)
  • attempts <count> (default: 3)
  • destfn_script (sundew compatibility... see that section)
  • directory <path> (default: .)
  • filename (for sundew compatibility.. see that section)
  • flatten <boolean> (default: false)
  • inflight <.string> (default: .tmp)
  • mirror <boolean> (default: false)
  • overwrite <boolean> (default: true)
  • reject <regexp pattern> (optional)
  • strip <count> (default: 0)
  • discard <boolean> (default: false)

The attempts option indicates how many times to attempt downloading the data before giving up. The default of 3 should be appropriate in most cases. The inflight option is a suffix given to the file during the download and taken away when it is completed... If inflight is set to . then it is prefixed with it and taken away when it is completed... This gives a mean to avoid processing the file prematurely.

The option directory defines where to put the files on your server. Combined with accept / reject options, the user can select the files of interest and their directories of residence. (see the mirror option for more directory settings).

The accept and reject options use regular expressions (regexp) to match URL. Theses options are processed sequentially. The URL of a file that matches a reject pattern is never downloaded. One that match an accept pattern is downloaded into the directory declared by the closest directory option above the matching accept option.

ex.   directory /mylocaldirectory/myradars
      accept    .*RADAR.*

      directory /mylocaldirectory/mygribs
      reject    .*Reg.*
      accept    .*GRIB.*

The mirror option can be used to mirror the dd.weather.gc.ca tree of the files. If set to True the directory given by the directory option will be the basename of a tree. Accepted files under that directory will be placed under the subdirectory tree leaf where it resides under dd.weather.gc.ca. For example retrieving the following url, with options:

http://dd.weather.gc.ca/radar/PRECIP/GIF/WGJ/201312141900_WGJ_PRECIP_SNOW.gif

  mirror    True
  directory /mylocaldirectory
  accept    .*RADAR.*

would result in the creation of the directories and the file /mylocaldirectory/radar/PRECIP/GIF/WGJ/201312141900_WGJ_PRECIP_SNOW.gif

You can modify the mirrored directoties with the option strip . If set to N (an integer) the first 'N' directories are withdrawn. For example

http://dd.weather.gc.ca/radar/PRECIP/GIF/WGJ/201312141900_WGJ_PRECIP_SNOW.gif

  mirror    True
  strip     3
  directory /mylocaldirectory
  accept    .*RADAR.*

would result in the creation of the directories and the file /mylocaldirectory/WGJ/201312141900_WGJ_PRECIP_SNOW.gif

The flatten option is use to set a separator character. This character will be used to replace the '/' in the url directory and create a "flatten" filename form its dd.weather.gc.ca path. For example retrieving the following url, with options:

http://dd.weather.gc.ca/model_gem_global/25km/grib2/lat_lon/12/015/CMC_glb_TMP_TGL_2_latlon.24x.24_2013121612_P015.grib2

  flatten   -
  directory /mylocaldirectory
  accept    .*model_gem_global.*

would result in the creation of the filepath

/mylocaldirectory/model_gem_global-25km-grib2-lat_lon-12-015-CMC_glb_TMP_TGL_2_latlon.24x.24_2013121612_P015.grib2

The overwrite option,if set to false, avoid unnecessary downloads under these conditions : 1- the file to be downloaded is already on the user's file system at the right place and 2- the checksum of the amqp message matched the one of the file. The default is True (overwrite without checking).

The discard option,if set to true, deletes the file once downloaded. This option can be usefull when debugging or testing a configuration.

EXAMPLES

Here is a short complete example configuration file:

broker amqp://dd.weather.gc.ca/

subtopic model_gem_global.25km.grib2.#
accept .*

This above file will connect to the dd.weather.gc.ca broker, connecting as anonymous with password anonymous (defaults) to obtain announcements about files in the http://dd.weather.gc.ca/model_gem_global/25km/grib2 directory. All files which arrive in that directory or below it will be downloaded into the current directory (or just printed to standard output if -n option was specified.)

A variety of example configuration files are available here:

http://sourceforge.net/p/metpx/git/ci/master/tree/sarracenia/samples/config/

for more details, see: sr_config(7)

QUEUES and MULTIPLE STREAMS

When executed, sr_subscribe chooses a queue name, which it writes to a file named after the configuration file given as an argument to sr_subscribe** with a .queue suffix ( ."configfile".queue). If sr_subscribe is stopped, the posted messages continue to accumulate on the broker in the queue. When the program is restarted, it uses the queuename stored in that file to connect to the same queue, and not lose any messages.

File downloads can be parallelized by running multiple sr_subscribes using the same queue. The processes will share the queue and each download part of what has been selected. Simply launch multiple instances of sr_subscribe in the same user/directory using the same configuration file,

You can also run several sr_subscribe with different configuration files to have multiple download streams delivering into the the same directory, and that download stream can be multi-streamed as well.

RABBITMQ LOGGING

For each download, by default, an amqp report message is sent back to the broker. This is done with option :

  • report_back <boolean> (default: True)

These reports are used for delivery tuning and for data sources to generate statistical information. Set this option to False, to prevent generation of reports for this usage.

ADVANCED FEATURES

There are ways to insert scripts into the flow of messages and file downloads: Should you want to implement tasks in various part of the execution of the program:

  • do_download <script> (default: None)
  • on_message <script> (default: msg_log)
  • on_file <script> (default: file_log)
  • on_parts <script> (default: None)

A do_nothing.py script for on_message, on_file, and on_part could be (this one being for on_file):

class Transformer(object):
     def __init__(self):
         pass

     def perform(self,parent):
         logger = parent.logger

         logger.info("I have no effect but adding this log line")

         return True

transformer  = Transformer()
self.on_file = transformer.perform

The only arguments the script receives it parent, which is an instance of the sr_subscribe class Should one of these scripts return False, the processing of the message/file will stop there and another message will be consumed from the broker.

for more details, see: sr_config(7)

DEPRECATED SETTINGS

These settings pertain to previous versions of the client, and have been superceded.

  • host <broker host> (unsupported)
  • amqp-user <broker user> (unsupported)
  • amqp-password <broker pass> (unsupported)
  • http-user <url user> (now in credentials.conf)
  • http-password <url pass> (now in credentials.conf)
  • topic <amqp pattern> (deprecated)
  • exchange_type <type> (default: topic)
  • exchange_key <amqp pattern> (deprecated)
  • lock <locktext> (renamed to inflight)

SEE ALSO

sr_config(7) - the format of configurations for MetPX-Sarracenia.

sr_report(7) - the format of report messages.

sr_report(1) - process report messages.

sr_post(1) - post announcemensts of specific files.

sr_post(7) - The format of announcement messages.

sr_sarra(1) - Subscribe, Acquire, and ReAdvertise tool.

sr_watch(1) - the directory watching daemon.

http://metpx.sf.net/ - sr_subscribe is a component of MetPX-Sarracenia, the AMQP based data pump.

SUNDEW COMPATIBILITY OPTIONS

For compatibility with sundew, there are some additional delivery options which can be specified.

destfn_script <script> (default:None)

This option defines a script to be run when everything is ready for the delivery of the product. The script receives the sr_sender class instance. The script takes the parent as an argument, and for example, any modification to parent.local_file will change the name of the file written locally.

filename <keyword> (default:WHATFN)

From metpx-sundew the support of this option give all sorts of possibilities for setting the remote filename. Some keywords are based on the fact that metpx-sundew filenames are five (to six) fields strings separated by for colons. The possible keywords are :

WHATFN
  • the first part of the sundew filename (string before first :)
HEADFN
  • HEADER part of the sundew filename
SENDER
  • the sundew filename may end with a string SENDER=<string> in this case the <string> will be the remote filename
NONE
  • deliver with the complete sundew filename (without :SENDER=...)
NONESENDER
  • deliver with the complete sundew filename (with :SENDER=...)
TIME
  • time stamp appended to filename. Example of use: WHATFN:TIME
DESTFN=str
  • direct filename declaration str
SATNET=1,2,3,A
  • cmc internal satnet application parameters
DESTFNSCRIPT=script.py
  • invoke a script (same as destfn_script) to generate the name of the file to write

accept <regexp pattern> [<keyword>]

keyword can be added to the accept option. The keyword is any one of the filename tion. A message that matched against the accept regexp pattern, will have its remote_file plied this keyword option. This keyword has priority over the preceeding filename one.

e regexp pattern can be use to set directory parts if part of the message is put to parenthesis. sr_sender can use these parts to build the directory name. The rst enclosed parenthesis strings will replace keyword ${0} in the directory name... the second ${1} etc.

example of use:

filename NONE

directory /this/first/target/directory

accept .*file.*type1.*

directory /this/target/directory

accept .*file.*type2.*

accept .*file.*type3.*  DESTFN=file_of_type3

directory /this/${0}/pattern/${1}/directory

accept .*(2016....).*(RAW.*GRIB).*

A selected message by the first accept would be delivered unchanged to the first directory. A selected message by the second accept would be delivered unchanged to the second directory. A selected message by the third accept would be renamed "file_of_type3" in the second directory. A selected message by the forth accept would be delivered unchanged to a directory

named /this/20160123/pattern/RAW_MERGER_GRIB/directory if the message would have a notice like :

20150813161959.854 http://this.pump.com/ relative/path/to/20160123_product_RAW_MERGER_GRIB_from_CMC

HISTORY

Dd_subscribe was initially developed for dd.weather.gc.ca, an Environment Canada website where a wide variety of meteorological products are made available to the public. It is from the name of this site that the sarracenia suite takes the dd_ prefix for it's tools. The initial version was deployed in 2013 on an experimental basis. The following year, support of checksums was added, and in the fall of 2015, the feeds were updated to v02.

In 2007, when the MetPX was originally open sourced, the staff responsible were part of Environment Canada. In honour of the Species At Risk Act (SARA), to highlight the plight of disappearing species which are not furry (the furry ones get all the attention) and because search engines will find references to names which are more unusual more easily, the original MetPX WMO switch was named after a carnivorous plant on the Species At Risk Registry: The Thread-leaved Sundew.

The organization behind Metpx have since moved to Shared Services Canada, but when it came time to name a new module, we kept with a theme of carnivorous plants, and chose another one indigenous to some parts of Canada: Sarracenia any of a variety of insectivorous pitcher plants. We like plants that eat meat!

dd_subscribe Renaming

The new module (MetPX-Sarracenia) has many components, is used for more than distribution, and more than one web site, and causes confusion for sys-admins thinking it is associated with the dd(1) command (to convert and copy files). So, we switched all the components to use the sr_ prefix.