<link href="css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="css/metpx-sidebar.css" rel="stylesheet" type="text/css"/> </head> <body data-offset="15" data-spy="scroll" data-target="#sidenav"> <nav class="navbar navbar-default navbar-static-top"> <div class="container"> <div class="navbar-header"> <button aria-controls="navbar" aria-expanded="false" class="navbar-toggle collapsed" data-target="#navbar" data-toggle="collapse" type="button"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"/> <span class="icon-bar"/> <span class="icon-bar"/> </button> <a class="navbar-brand" href="index-e.html"><strong>MetPX</strong></a> </div> <div class="navbar-collapse collapse" id="navbar"> <ul class="nav navbar-nav"> <li class="dropdown"> <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">Sundew <span class="caret"/></a> <ul class="dropdown-menu"> <li><a href="https://github.com/MetPX/Sundew/blob/doc/sundew-e.rst">Overview</a></li> <li><a href="https://github.com/MetPX/Sundew/tree/doc">Documentation</a></li> <li><a href="https://github.com/MetPX/Sundew">Download</a></li> <li><a href="https://github.com/MetPX/Sundew">Git Source and Issues</a></li> </ul> </li> <li class="dropdown"> <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" role="button">Sarracenia <span class="caret"/></a> <ul class="dropdown-menu"> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/sarra.rst">Overview</a></li> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/CHANGES.rst">Release Notes</a></li> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/UPGRADING.rst">Upgrade Guide</a></li> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/sr_subscribe.1.rst#documentation">Documentation</a></li> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/Install.html">Download</a></li> <li><a href="https://github.com/MetPX/sarracenia">Git Source and Issues</a></li> </ul> </li> </ul> <ul class="nav navbar-nav navbar-right"> <li><a href="index-f.html">Français</a></li> </ul> </div>  </div>  </nav><div class="container"> <div class="row"> <nav class="col-md-3" id="sidenav"> <ul class="nav nav-underline nav-stacked hidden-xs hidden-sm" data-offset-bottom="200" data-offset-top="20" data-spy="affix" id="sidebar"> <li><a class="reference internal" href="#data-sources" id="id2">Data Sources</a><ul class="nav nav-underline sub-nav"> <li><a class="reference internal" href="#injecting-data-into-a-metpx-sarracenia-pump-network" id="id3">Injecting Data into a MetPX-Sarracenia Pump Network</a><ul> <li><a class="reference internal" href="#revision-record" id="id4">Revision Record</a></li> <li><a class="reference internal" href="#sftp-injection" id="id5">SFTP Injection</a></li> <li><a class="reference internal" href="#http-injection" id="id6">HTTP Injection</a></li> <li><a class="reference internal" href="#polling-external-sources" id="id7">Polling External Sources</a></li> <li><a class="reference internal" href="#report-messages" id="id8">Report Messages</a></li> <li><a class="reference internal" href="#large-files" id="id9">Large Files</a></li> <li><a class="reference internal" href="#reliability-and-checksums" id="id10">Reliability and Checksums</a></li> <li><a class="reference internal" href="#user-headers" id="id11">User Headers</a><ul> <li><a class="reference internal" href="#efficiency-considerations" id="id12">Efficiency Considerations</a></li> </ul> </li> </ul> </li> </ul> </li> <li><a class="reference internal" href="#quickly-announcing-very-large-trees-on-linux" id="id13">Quickly Announcing Very Large Trees On Linux</a></li> </ul></nav>  <div class="col-md-9"> <div class="section" id="data-sources"> <h1 class="page-header">Data Sources</h1> <div class="section" id="injecting-data-into-a-metpx-sarracenia-pump-network"> <h2>Injecting Data into a MetPX-Sarracenia Pump Network</h2> <div class="admonition warning"> <p class="last"><strong>FIXME</strong>: Missing sections are highlighted by <strong>FIXME</strong>. What is here should be accurate!</p> </div> <div class="admonition note"> <p class="last"><strong>FIXME</strong>: known missing elements: good discussion of checksum choice. Caveat about file update strategies. Use case of a file file that is constantly updated, rather than issuing new files.)</p> </div> <div class="section" id="revision-record"> <h3>Revision Record</h3> <table class="docutils field-list" frame="void" rules="none"> <col class="field-name"/> <col class="field-body"/> <tbody valign="top"> <tr class="field"><th class="field-name">version:</th><td class="field-body"> 2.18.05b4</td> </tr> <tr class="field"><th class="field-name">date:</th><td class="field-body">June 2018</td> </tr> </tbody> </table> <p>A Sarracenia data pump is a web (or sftp) server with notifications for subscribers to know, quickly, when new data has arrived. To find out what data is already available on a pump, view the tree with a web browser. For simple immediate needs, one can download data using the browser itself, or a standard tool such as wget. The usual intent is for sr_subscribe to automatically download the data wanted to a directory on a subscriber machine where other software can process it. Note that this manual uses subscriptions to test data injection, so the subscriber guide should likely be read before this one.</p> <p>Regardless of how it is done, injecting data means telling the pump where the data is so that it can be forwarded to and/or by the pump. This can be done by either using the active and explicit sr_post command, or just using sr_watch on a directory. Where there are large numbers of file, and/or tight timeliness constraints, invocation of sr_post directly by the producer of the file is optimal, as sr_watch may provide disappointing performance. Another explicit, but low frequency approach is the sr_poll command, which allows one to query remote systems to pull data into the network efficiently.</p> <p>While sr_watch is written as an optimal directory watching system, there simply is no quick way to watch large (say, more than 100,000 files) directory trees. On dd.weather.gc.ca, as an example, there are 60 million files in about a million directories. To walk that directory tree once takes several hours. To find new files, the best temporal resolution is every few (say 3) hours. So on average notification will occur 1.5 hours after the file has showed up. Using I_NOTIFY (on Linux), it still takes several hours to start up, because it needs to do an initial file tree walk to set up all the watches. After that it will be instant, but if there are too many files (and 60 million is very likely too many) it will just crash and refuse to work. These are inherent limitations of watching directories, no matter how it is done. If it is really necesary to do this, there is hope. Please consult <a class="reference internal" href="#quickly-announcing-very-large-trees-on-linux">Quickly Announcing Very Large Trees On Linux</a></p> <p>With sr_post, the program that puts the file anywhere in the arbitrarily deep tree[1]_ tells the pump (which will tell subscribers) exactly where to look. There are no system limits to worry about. That’s how dd.weather.gc.ca works, and notifications are sub-second, with 60 million files on the disk. It is much more efficient, in general, to do direct notifications rather than pass by the indirection of the file system, but in small and simple cases, it make little practical difference.</p> <p>In the simplest case, the pump takes data from your account, wherever you have it, providing you give it permission. We describe that case first.</p> <table class="docutils footnote" frame="void" id="id1" rules="none"> <colgroup><col class="label"/><col/></colgroup> <tbody valign="top"> <tr><td class="label">[1]</td><td>While the file tree itself has no limits in depth or number, the ability to filter based on <em>topics</em> is limited by AMQP to 255 characters. So the <em>subtopic</em> configuration item is limited to somewhat less than that. There isn't a fixed limit because topics are utf8 encoded which is variable length. Note that the <em>subtopic</em> directive is meant to provide coarse classification, and use of <em>accept/reject</em> is meant for more detailed work. <em>accept/reject</em> clauses do not rely on AMQP headers, using path names stored in the body of the message, and so are not affected by this limit.</td></tr> </tbody> </table> </div> <div class="section" id="sftp-injection"> <h3>SFTP Injection</h3> <p>Using the sr_post(1) command directly is the most straightforward way to inject data into the pump network. To use sr_post, you have to know:</p> <ul class="simple"> <li>the name of the local broker: ( say: ddsr.cmc.ec.gc.ca. )</li> <li>your authentication info for that broker ( say: user=rnd : password=rndpw )</li> <li>your own server name. (say: grumpy.cmc.ec.gc.ca )</li> <li>your own user name on your server (say: peter)</li> </ul> <p>Assume the goal is for the pump to access peter's account via SFTP. Then you need to take the pump´s public key, and place it in peter's .ssh/authorized_keys. On the server you are using (<em>grumpy</em>), one needs to do something like this:</p> <pre class="literal-block"> wget http://ddsr.cmc.ec.gca/config/pump.pub >>~peter/.ssh/authorized_keys </pre> <div class="admonition warning"> <p class="last"><strong>FIXME</strong>: this config directory is not implemented yet. need to get public key by talking to an admin for now.</p> </div> <p>This will enable the pump to access peter's account on grumpy using his private key. So assuming one is logged in to Peter's account on grumpy, one can store the broker credentials safely:</p> <pre class="literal-block"> echo 'amqps://rnd:rndpw@ddsr.cmc.ec.gc.ca' >> ~/.config/sarra/credentials.conf: </pre> <div class="admonition note"> <p class="last">Passwords are always stored in the credentials.conf file.</p> </div> <p>So now the command line for sr_post is just the url to for ddsr to retrieve the file on grumpy:</p> <pre class="literal-block"> sr_post -post_broker amqp://guest:guest@localhost/ -post_base_dir /var/www/posts/ \ -post_base_url http://localhost:81/frog.dna 2016-01-20 14:53:49,014 [INFO] Output AMQP broker(localhost) user(guest) vhost(/) 2016-01-20 14:53:49,019 [INFO] message published : 2016-01-20 14:53:49,019 [INFO] exchange xs_guest topic v02.post.frog.dna 2016-01-20 14:53:49,019 [INFO] notice 20160120145349.19 http://localhost:81/ frog.dna 2016-01-20 14:53:49,020 [INFO] headers parts=1,16,1,0,0 sum=d,d108dcff28200e8d26d15d1b3dfeac1c to_clusters=localhost </pre> <p>There is a sr_subscribe to subscribe to all <tt class="docutils literal">*.dna</tt> posts. The subscribe log said. Here is the config file:</p> <pre class="literal-block"> broker amqp://guest:guest@localhost directory /var/www/subscribed subtopic # accept .*dna* </pre> <p>and here is the related output from the subscribe log file:</p> <pre class="literal-block"> 2016-01-20 14:53:49,418 [INFO] Received notice 20160120145349.19 http://grumpy:80/ 20160120/guest/frog.dna 2016-01-20 14:53:49,419 [INFO] downloading/copying into /var/www/subscribed/frog.dna 2016-01-20 14:53:49,420 [INFO] Downloads: http://grumpy:80/20160120/guest/frog.dna into /var/www/subscribed/frog.dna 0-16 2016-01-20 14:53:49,424 [INFO] 201 Downloaded : v02.report.20160120.guest.frog.dna 20160120145349.19 http://grumpy:80/ 20160120/guest/frog.dna 201 sarra-server-trusty guest 0.404653 parts=1,16,1,0,0 sum=d,d108dcff28200e8d26d15d1b3dfeac1c from_cluster=test_cluster source=guest to_clusters=test_cluster rename=/var/www/subscribed/frog.dna message=Downloaded </pre> <p>Or alternatively, here is the log from an sr_sarra instance:</p> <pre class="literal-block"> 2016-01-20 14:53:49,376 [INFO] Received v02.post.frog.dna '20160120145349.19 http://grumpy:81/ frog.dna' parts=1,16,1,0,0 sum=d,d108dcff28200e8d26d15d1b3dfeac1c to_clusters=test_cluster 2016-01-20 14:53:49,377 [INFO] downloading/copying into /var/www/test/20160120/guest/frog.dna 2016-01-20 14:53:49,377 [INFO] Downloads: http://grumpy:81/frog.dna into /var/www/test/20160120/guest/frog.dna 0-16 2016-01-20 14:53:49,380 [INFO] 201 Downloaded : v02.report.frog.dna 20160120145349.19 http://grumpy:81/ frog.dna 201 sarra-server-trusty guest 0.360282 parts=1,16,1,0,0 sum=d,d108dcff28200e8d26d15d1b3dfeac1c from_cluster=test_cluster source=guest to_clusters=test_cluster message=Downloaded 2016-01-20 14:53:49,381 [INFO] message published : 2016-01-20 14:53:49,381 [INFO] exchange xpublic topic v02.post.20160120.guest.frog.dna 2016-01-20 14:53:49,381 [INFO] notice 20160120145349.19 http://grumpy:80/ 20160120/guest/frog.dna @ </pre> <p>the command asks ddsr to retrieve the treefrog/frog.dna file by logging in to grumpy as peter (using the pump's private key.) to retrieve it, and posting it on the pump, for forwarding to the other pump destinations.</p> <p>Similar to sr_subscribe, one can also place configuration files in an sr_post specific directory:</p> <pre class="literal-block"> blacklab% sr_post edit dissem.conf broker amqps://rnd@ddsr.cmc.ec.gc.ca/ to DDIEDM,DDIDOR,ARCHPC url sftp://peter@grumpy </pre> <p>and then:</p> <pre class="literal-block"> sr_post -c dissem -url treefrog/frog.dna </pre> <p>If there are different varieties of posting used, configurations can be saved for each one.</p> <div class="admonition warning"> <p><strong>FIXME</strong>: Need to do a real example. this made up stuff isn´t sufficiently helpful.</p> <p><strong>FIXME</strong>: sr_post does not accept config files right now, says the man page. True/False?</p> <p class="last">sr_post command lines can be a lot simpler if it did.</p> </div> <p>sr_post typically returns immediately as its only job is to advice the pump of the availability of files. The files are not transferred when sr_post returns, so one should note delete files after posting without being sure the pump actually picked them up.</p> <div class="admonition note"> <p>sftp is perhaps the simplest for the user to implement and understand, but it is also the most costly in terms of CPU on the server. All of the work of data transfer is done at the python application level when sftp acquisition is done, which isn´t great.</p> <p class="last">a lower cpu version would be for the client to send somehow (sftp?) and then just tell where the file is on the pump (basically the sr_sender2 version.)</p> </div> <p>Note that this example used sftp, but if the file is available on a local web site, then http work work, or if the data pump and the source server share a file system, then even a file url could work.</p> </div> <div class="section" id="http-injection"> <h3>HTTP Injection</h3> <p>If we take a similar case, but in this case there is some http accessible space, the steps are the same or even simpler if no authentication is required for the pump to acquire the data. One needs to install a web server of some kind.</p> <p>Assume a configuration that show all files under /var/www as folders, running under the www-data users. Data posted in such directories must be readable to the www-data user, to allow the web server to read it. The server running the web server is called <em>blacklab</em>, and the user on the server is <em>peter</em> running as peter on blacklab, a directory is created under /var/www/project/outgoing, that is writable by peter, which results in a configuration like so:</p> <pre class="literal-block"> sr_watch edit project.conf broker amqp://feeder@localhost/ url http://blacklab/ post_base_dir /var/www/project/outgoing </pre> <p>then a watch is started:</p> <pre class="literal-block"> sr_watch start project </pre> <div class="admonition warning"> <p><strong>FIXME</strong>: real example.</p> <dl class="last docutils"> <dt><strong>FIXME</strong>: sr_watch was supposed to take configuration files, but might not have</dt> <dd>been modified to that effect yet.</dd> </dl> </div> <p>While sr_watch is running, any time a file is created in the <em>document_root</em> directory, it will be announced to the pump (on localhost, ie. the server blacklab itself.):</p> <pre class="literal-block"> cp frog.dna /var/www/project/outgoing </pre> <div class="admonition warning"> <p class="last"><strong>FIXME</strong>: real example.</p> </div> <p>This triggers a post to the pump. Any subscribers will then be able to download the file.</p> <div class="admonition warning"> <p class="last"><strong>FIXME</strong>. too much broken for now to really run this easily... so creating real demo is deferred.</p> </div> </div> <div class="section" id="polling-external-sources"> <h3>Polling External Sources</h3> <p>Some sources are inherently remote, and we are unable to interest of affect them. One can configure sr_poll to pull in data from external sources, typically web sites. The sr_poll command typically runs as a singleton that tracks what is new at a source tree and creates source messages for the pump network to process.</p> <p>External servers, especially web servers often have different ways of posting their product listings, so custom processing of the list is often needed. That is why sr_poll has the do_poll setting, meaning that use of a plugin script is virtually required to use it.</p> <div class="admonition note"> <p class="last">see the poll_script included in the package plugins directory for an example. <strong>FIXME</strong>:</p> </div> </div> <div class="section" id="report-messages"> <h3>Report Messages</h3> <p>If the sr_post worked, that means the pump accepted to take a look at your file. To find out where your data goes to afterward , one needs to examine source log messages. It is also important to note that the initial pump, or any other pump downstream, may refuse to forward your data for various reasons, that will only be reported to the source in these report messages.</p> <p>To view source report messages, the sr_report command is just a version of sr_subscribe, with the same options where they make sense. If the configuration file (~/.config/sarra/default.conf) is set up, then all that is needed is:</p> <pre class="literal-block"> sr_report </pre> <p>To view report messages indicating what has happenned to the items inserted into the network from the same pump using that account (rnd, in the example.) One can trigger arbitrary post processing of report messages by using on_message plugin.</p> <div class="admonition warning"> <p class="last"><strong>FIXME</strong>: need some examples.</p> </div> </div> <div class="section" id="large-files"> <h3>Large Files</h3> <p>Larger files are not sent as a single block. They are sent in parts, and each part is fingerprinted, so that when files are updated, unchanged portions are not sent again. There is a default threshold built into the sr_ commands, above which partitioned announcements will be done by default. This threshold can be adjusted to taste using the <em>part_threshold</em> option.</p> <p>Different pumps along the route may have different maximum part sizes. To traverse a given path, the part must be no larger than the threshold setting of all the intervening pumps. A pump will send the source an error log message if it refuses to forward a file.</p> <p>As each part is announced, so there is a corresponding report message for each part. This allows senders to monitor progress of delivery of large files.</p> </div> <div class="section" id="reliability-and-checksums"> <h3>Reliability and Checksums</h3> <p>Every piece of data injected into the pumping network needs to have a unique fingerprint (or checksum.) Data will flow if it is new, and determining if the data is new is based on the fingerprint. To get reliability in a sarracenia network, multiple independent sources are provisioned. Each source announces their products, and if they have the same name and fingerprint, then the products are considered the same.</p> <p>The sr_winnow component of sarracenia looks at incoming announcements and notes which products are received (by file name and checksum.) If a product is new, it is forwarded on to other components for processing. If a product is a duplicate, then the announcement is not forwarded further. Similarly, when sr_subscribe or sr_sarra components receive an announcement for a product that is already present on the local system, they will examine the fingerprint and not download the data unless it is different. Checksum methods need to be known across a network, as downstream components will re-apply them.</p> <p>Different fingerprinting algorithms are appropriate for different types of data, so the algorithm to apply needs to be chosen by the data source, and not imposed by the network. Normally, the 'd' algorithm is used, which applies the well-known Message-Digest 5 (md5sum) algorithm to the data in the file.</p> <p>When there is one origin for data, this algorithm works well. For high availability, production chains will operate in parallel, preferably with no communication between them. Items produced by independent chains may naturally have different processing time and log stamps and serial numbers applied, so the same data processed through different chains will not be identical at the binary level. For products produced by different production chains to be accepted as equivalent, they need to have the same fingerprint.</p> <p>One solution for that case is, if the two processing chains will produce data with the same name, to checksum based on the file name instead of the data, this is called 'n'. In many cases, the names themselves are production chain dependent, so a custom algorithm is needed. If a custom algorithm is chosen, it needs to be published on the network:</p> <pre class="literal-block"> http://dd.cmc.ec.gc.ca/config/msc-radar/sums/ u.py </pre> <p>So downstream clients can obtain and apply the same algorithm to compare announcements from multiple sources.</p> <div class="admonition warning"> <p><strong>FIXME</strong>: science fiction again: no such config directories exist yet. no means to update them. search path for checksum algos? built-in,system-wide,per-source?</p> <p>Also, if each source defines their own algorithm, then they need to pick the same one (with the same name) in order to have a match.</p> <p><strong>FIXME</strong>: verify that fingerprint verification includes matching the algorithm as well as value.</p> <p class="last"><strong>FIXME</strong>: not needed at the beginning, but likely at some point. in the mean time, we just talk to people and include their algorithms in the package.</p> </div> <div class="admonition note"> <p class="last">Fingerprint methods that are based on the name, rather than the actual data, will cause the entire file to be re-sent when they are updated.</p> </div> </div> <div class="section" id="user-headers"> <h3>User Headers</h3> <p>What if there is some piece of metadata that a data source has chosen for some reason not to include in the filename hierarchy? How can data consumers know that information without having to download the file in order to determine that it is uninteresting. An example would be weather warnings. The file names might include weather warnings for an entire country. If consumers are only interested in downloading warnings that are local to them, then, a data source could use the on_post hook in order to add additional headers to the message.</p> <div class="admonition note"> <p class="last">With great flexibility comes great potential for harm. The path names should include as much information as possible as sarracenia is built to optimize routing using them. Additional meta-data should be used to supplement, rather than replace, the built-in routing.</p> </div> <p>To add headers to messages being posted, one can use header option. In a configuration file, add the following statements:</p> <pre class="literal-block"> header CAP_province=Ontario header CAP_area-desc=Uxbridge%20-%20Beaverton%20-%20Northern%20Durham%20Region header CAP_polygon=43.9984,-79.2175 43.9988,-79.219 44.2212,-79.3158 44.4664,-79.2343 44.5121,-79.1451 44.5135,-79.1415 44.5136,-79.1411 44.5137,-79.1407 44.5138,-79.14 44.5169,-79.0917 44.517,-79.0879 44.5169,-79.0823 44.218,-78.7659 44.0832,-78.7047 43.9984,-79.2175 </pre> <p>So that when a file advertisement is posted, it will include the headers with the given values. This example is artificial in that it statically assigns the header values which is appropriate to simple cases. For this specific case, it is likely more appropriate to implement a specialized on_post plugin for Common Alerting Protocol files to extract the above header information and place it in the message headers for each alert.</p> <div class="section" id="efficiency-considerations"> <h4>Efficiency Considerations</h4> <p>It is not recommended to put overly complex logic in the plugin scripts, as they execute synchronously with post and receive operations. Note that the use of built-in facilities of AMQP (headers) is done to explicitly be as efficient as possible. As an extreme example, including encoded XML into messages will not affect performance slightly, it will slow processing by orders of magnitude. One will not be able to compensate for with multiple instances, as the penalty is simply too large to overcome.</p> <p>Consider, for example, Common Alerting Protocol (CAP) messages for weather alerts. These alerts routinely exceed 100 KBytes in size, wheras a sarracenia message is on the order of 200 bytes. The sarracenia messages go to many more recipients than the alert: anyone considering downloading an alert, as oppposed to just the ones the subscriber is actually interested in, and this metadata will also be included in the report messages, and so replicated in many additional locations where the data itself will not be present.</p> <p>Including all the information that is in the CAP would mean just in terms of pure transport 500 times more capacity used for a single message. When there are many millions of messages to transfer, this adds up. Only the minimal information required by the subscriber to make the decision to download or not should be added to the message. It should also be noted that in addition to the above, there is typically a 10x to 100x cpu and memory penalty parsing an XML data structure compared to plain text representation, which will affect the processing rate.</p> </div> </div> </div> </div> <div class="section" id="quickly-announcing-very-large-trees-on-linux"> <h1 class="page-header">Quickly Announcing Very Large Trees On Linux</h1> <p>To mirror very large trees (millions of files) in real time, it takes too long for tools like rsync or find to traverse and generate lists of files to copy. On Linux, one can intercept calls for file operations using the well known shim library technique. This technique provides virtually real-time announcements of files regardless of the size of the tree, with minimal overhead as this technique imposes much less load than tree traversal mechanisms, and makes use of the C implementation of Sarracenia, which uses very little memory or processor resources.</p> <p>To use this technique, one needs to have the C implementation of Sarracenia installed. The Libsrshim library is part of that package, and the environment needs to be configured to intercept calls to the C library like so:</p> <pre class="literal-block"> export SR_POST_CONFIG=somepost.conf export LD_PRELOAD=libsrshim.so.1.0.0 </pre> <p>Where <em>somepost.conf</em> is a valid configuration that can be tested with sr_post to manually post a file. Any process invoked from a shell with these settings will have all calls to routines like close(2) intercepted by libsrshim. Libsrshim will check if the file is being written, and then apply the somepost configuration (accept/reject clauses) and post the file if it is appropriate. Example:</p> <pre class="literal-block"> blacklab% more pyiotest f=open("hoho", "w+" ) f.write("hello") f.close() blacklab% blacklab% more test2.sh echo "called with: $* " if [ ! "${LD_PRELOAD}" ]; then export SR_POST_CONFIG=`pwd`/test_post.conf export LD_PRELOAD=`pwd`/libsrshim.so.1.0.0 exec $0 #the exec here makes the LD_PRELOAD affect this shell, as well as sub-processes. fi set -x echo "FIXME: exec above fixes ... builtin i/o like redirection not being posted!" bash -c 'echo "hoho" >>~/test/hoho' /usr/bin/python2.7 pyiotest cp libsrshim.c ~/test/hoho_my_darling.txt blacklab% lacklab% ./test2.sh called with: called with: +++ echo 'FIXME: exec above fixes ... builtin i/o like redirection not being posted!' FIXME: exec above fixes ... builtin i/o like redirection not being posted! +++ bash -c 'echo "hoho" >>~/test/hoho' 2017-10-21 20:20:44,092 [INFO] sr_post settings: action=foreground log_level=1 follow_symlinks=no sleep=0 heartbeat=300 cache=0 cache_file=off 2017-10-21 20:20:44,092 [DEBUG] setting to_cluster: localhost 2017-10-21 20:20:44,092 [DEBUG] post_broker: amqp://tsource:<pw>@localhost:5672 2017-10-21 20:20:44,094 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,095 [DEBUG] isMatchingPattern: /home/peter/test/hoho matched mask: accept .* 2017-10-21 20:20:44,096 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,096 [DEBUG] sr_post file2message called with: /home/peter/test/hoho sb=0x7ffef2aae2f0 islnk=0, isdir=0, isreg=1 2017-10-21 20:20:44,096 [INFO] published: 20171021202044.096 sftp://peter@localhost /home/peter/test/hoho topic=v02.post.home.peter.test sum=s,a0bcb70b771de1f614c724a86169288ee9dc749a6c0bbb9dd0f863c2b66531d21b65b81bd3d3ec4e345c2fea59032a1b4f3fe52317da3bf075374f7b699b10aa source=tsource to_clusters=localhost from_cluster=localhost mtime=20171021202002.304 atime=20171021202002.308 mode=0644 parts=1,2,1,0,0 +++ /usr/bin/python2.7 pyiotest 2017-10-21 20:20:44,105 [INFO] sr_post settings: action=foreground log_level=1 follow_symlinks=no sleep=0 heartbeat=300 cache=0 cache_file=off 2017-10-21 20:20:44,105 [DEBUG] setting to_cluster: localhost 2017-10-21 20:20:44,105 [DEBUG] post_broker: amqp://tsource:<pw>@localhost:5672 2017-10-21 20:20:44,107 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,107 [DEBUG] isMatchingPattern: /home/peter/src/sarracenia/c/hoho matched mask: accept .* 2017-10-21 20:20:44,108 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,108 [DEBUG] sr_post file2message called with: /home/peter/src/sarracenia/c/hoho sb=0x7ffeb02838b0 islnk=0, isdir=0, isreg=1 2017-10-21 20:20:44,108 [INFO] published: 20171021202044.108 sftp://peter@localhost /c/hoho topic=v02.post.c sum=s,9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043 source=tsource to_clusters=localhost from_cluster=localhost mtime=20171021202044.101 atime=20171021202002.320 mode=0644 parts=1,5,1,0,0 +++ cp libsrshim.c /home/peter/test/hoho_my_darling.txt 2017-10-21 20:20:44,112 [INFO] sr_post settings: action=foreground log_level=1 follow_symlinks=no sleep=0 heartbeat=300 cache=0 cache_file=off 2017-10-21 20:20:44,112 [DEBUG] setting to_cluster: localhost 2017-10-21 20:20:44,112 [DEBUG] post_broker: amqp://tsource:<pw>@localhost:5672 2017-10-21 20:20:44,114 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,114 [DEBUG] isMatchingPattern: /home/peter/test/hoho_my_darling.txt matched mask: accept .* 2017-10-21 20:20:44,115 [DEBUG] connected to post broker amqp://tsource@localhost:5672/#xs_tsource_cpost_watch 2017-10-21 20:20:44,115 [DEBUG] sr_post file2message called with: /home/peter/test/hoho_my_darling.txt sb=0x7ffc8250d950 islnk=0, isdir=0, isreg=1 2017-10-21 20:20:44,116 [INFO] published: 20171021202044.115 sftp://peter@localhost /home/peter/test/hoho_my_darling.txt topic=v02.post.home.peter.test sum=s,f5595a47339197c9e03e7b3c374d4f13e53e819b44f7f47b67bf1112e4bd6e01f2af2122e85eda5da633469dbfb0eaf2367314c32736ae8aa7819743f1772935 source=tsource to_clusters=localhost from_cluster=localhost mtime=20171021202044.109 atime=20171021202002.328 mode=0644 parts=1,15117,1,0,0 blacklab% </pre> <dl class="docutils"> <dt>Note::</dt> <dd><blockquote class="first"> <p>file re-direction of i/o resulting from shell builtins (no process spawn) in the shell where the environment variables are first set WILL NOT BE POSTED. only sub-shells are affected:</p> <pre class="literal-block"> # will not be posted... echo "hoho" >kk.conf # fill be posted. bash -c 'echo "hoho" >kk.conf' </pre> <p>This is a limitation of the technique, as the dynamic library load order is resolved on process startup, and is cannot be modified afterward. one work-around:</p> <pre class="literal-block"> if [ ! "${LD_PRELOAD}" ]; then export SR_POST_CONFIG=`pwd`/test_post.conf export LD_PRELOAD=`pwd`/libsrshim.so.1.0.0 exec $* fi </pre> </blockquote> <p class="last">Which will activate the shim library for the calling environment, by restarting it. This particular code may have impact on command line options and may not be directly applicable.</p> </dd> </dl> <p>As an example, we have a tree of 22 million files that is written continuously day and night. We need to copy that tree to a second file system as quickly as possible, with an aspirational maximum copy time being about five minutes.</p> </div> </div>  </div>  </div><footer class="footer"> <div class="container"> <div class="col-sm-6"> <p>Code licenced <a href="https://github.com/MetPX/sarracenia/blob/master/LICENSE">GPLv2</a></p> <p class="text-muted"> <span class="glyphicon glyphicon-copyright-mark"> </span> 2004-2011 Environment Canada<br/><span class="glyphicon glyphicon-copyright-mark"> </span> 2011-2015 Government of Canada</p> </div> <div class="col-sm-6"> <ul class="list-inline"> <li><a href="http://sourceforge.net/p/metpx">SourceForge</a></li> <li><a href="https://github.com/MetPX/sarracenia">GitHub</a></li> <li><a href="https://github.com/MetPX/sarracenia/blob/master/doc/sarra.rst">About</a></li> </ul> </div> </div> </footer><script src="js/anchor.js"> </script><script>anchors.add();</script><script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"> </script><script src="js/bootstrap.min.js"> </script></body> </html>