How to use SSH based Rsync in Cron Job


the other day I was trying to setup an automatic rsync job to sync data contents from prod to dev, due to the newly enforced policy that the same nfs storage can only be mounted in either prod or dev.

Was thinking it would be just as simple as adding a command “rsync -avh –delete src dst” to crontab, but quickly ran to below issue:

Permission denied (publickey)

Then it flashed back onto me that we use RSA for remote server login authentication and we use putty’s keygen to generate public/private key protected by a passphrase.

To avoid entering passphrase over and over again we started “ssh-gen -s” in ~/.bash_login and exported the environment variable SSH_AUTH_SOCK.

The .bash_login will also call “ssh-add ~/.ssh/id_rsa” which pops up prompt asking for passphrase for first time login and added to the password cache of ssh-agent.

For any server we only need to enter the passphrase once.

All future ssh/rsync command will know to automatically talk to ssh-agent via local domain socket given by SSH_AUTH_SOCK env variable hence becomes password free operation.

So there are several challenges to setting this up in crontab:

  • ensure we have an ssh-agent to use for fetching private key to auth to remote side, this will need to work also in the case no one logs to local dev host hence we cannot count on ~/.bash_login.
  • ssh-add normally requires terminal and prompt for passphrase, it does support external program for using in terminal-less environment like in crontab’s minimized execution env.
  • also if the cron job needs to run frequently we cannot create too many copy of ssh-agent


Uniquely Identify ssh-agent

we need to a way to uniquely identify the ssh-agent we created for every boot.

we cannot just use hostname as file created from last boot will persist to next and may interfere with the identification.

we just use the simple solution of combining boot time with host name.

# get unique per boot file
bootTime=$(who -b|sed -e "s/^\s\+//" -e "s/\s\+/_/g" -e "s/-/_/g" -e "s/:/_/g")

the auth_sock_file contains the output from running “ssh-agent -s”, a sample run of “ssh-agent -s” is like below

$ssh-agent -s
SSH_AUTH_SOCK=/tmp/ssh-uYP3qOghCJoa/agent.9516; export SSH_AUTH_SOCK;
echo Agent pid 9517;

Check if the ssh-agent is still alive

for safety check even if the auth_sock_file exists we still need to check if ssh-agent is still running which we simply use “ps” command:

  if (( SSH_AGENT_PID > 0 ))
    liveProcess=$(ps -p ${SSH_AGENT_PID} --no-headers -o user,cmd|wc -l)
    printf "liveProcess=${liveProcess}\n"     >> ${log_file}

    if (( liveProcess > 0 ))
      printf "process ${SSH_AGENT_PID} already exists, use it as it is\n" >> ${log_file}
      printf "process ${SSH_AGENT_PID} does not exist\n"                  >> ${log_file}
      export SSH_AGENT_PID=0

Recreating ssh-agent if it is not running or first time cron job after boot

we choose to use expect to enter passphrase and read passphrase from a file

if (( SSH_AGENT_PID <= 0 ))
  printf "recreating ssh-agent\n"       >> ${log_file}

  eval `ssh-agent -s`
  printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n" >> ${log_file}

  passphrase=$(cat ${passphrase_file})

  # now running expect
  expect <<- EOF
  spawn ssh-add ${id_rsa_file}
  expect -re "Enter passphrase for .*:"
  send "${passphrase}\r"
  expect -re ".+"

Check if ssh-add succeeds

for safety check we still need to check if ssh-add successfully added the rsa key to the cache:

  # check if ssh-add succeeds
  ssh-add -l  >> ${log_file}
  count=$(ssh-add -l |grep -c ${id_rsa_file})

  if (( count > 0 ))
    printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n"  > ${auth_sock_file}

    printf "successfully added ${id_rsa_file}\n"                                           >> ${log_file}
    printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n" >> ${log_file}
    printf "failed to add ${id_rsa_file}\n"                                                >> ${log_file}
    kill -9 ${SSH_AGENT_PID}

    exit -1

The full code is attached below

#! /bin/bash

# first check if ssh-agent is running

# get unique per boot file
bootTime=$(who -b|sed -e "s/^\s\+//" -e "s/\s\+/_/g" -e "s/-/_/g" -e "s/:/_/g")

mkdir -p /var/tmp/${user}

if [[ -f ${auth_sock_file} ]]
  printf "now=$(date)\n"                      >> ${log_file}
  printf "auth sock file ${auth_sock_file}\n" >> ${log_file}

  source ${auth_sock_file}

  printf "SSH_AGENT_PID from file: ${SSH_AGENT_PID}\n" >> ${log_file}

  if (( SSH_AGENT_PID > 0 ))
    liveProcess=$(ps -p ${SSH_AGENT_PID} --no-headers -o user,cmd|wc -l)
    printf "liveProcess=${liveProcess}\n"     >> ${log_file}

    if (( liveProcess > 0 ))
      printf "process ${SSH_AGENT_PID} already exists, use it as it is\n" >> ${log_file}
      printf "process ${SSH_AGENT_PID} does not exist\n"                  >> ${log_file}
      export SSH_AGENT_PID=0
  printf "now=$(date)\n"                        > ${log_file}
  printf "auth sock file ${auth_sock_file}\n"  >> ${log_file}

if (( SSH_AGENT_PID <= 0 ))
  printf "recreating ssh-agent\n"       >> ${log_file}

  eval `ssh-agent -s`
  printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n" >> ${log_file}

  passphrase=$(cat ${passphrase_file})

  # now running expect
  expect <<- EOF
  spawn ssh-add ${id_rsa_file}
  expect -re "Enter passphrase for .*:"
  send "${passphrase}\r"
  expect -re ".+"

  # check if ssh-add succeeds
  ssh-add -l  >> ${log_file}
  count=$(ssh-add -l |grep -c ${id_rsa_file})

  if (( count > 0 ))
    printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n"  > ${auth_sock_file}

    printf "successfully added ${id_rsa_file}\n"                                           >> ${log_file}
    printf "export SSH_AUTH_SOCK=${SSH_AUTH_SOCK};export SSH_AGENT_PID=${SSH_AGENT_PID}\n" >> ${log_file}
    printf "failed to add ${id_rsa_file}\n"                                                >> ${log_file}
    kill -9 ${SSH_AGENT_PID}

    exit -1

printf "start rsync\n"              >> ${log_file}
rsync -avh --delete remote_user@remote_host:src dst         >> ${log_file} 2>&1
printf "finished rsync\n"           >> ${log_file}

printf "finished running\n\n"       >> ${log_file}

Setup Cron Job

finally setup cron job similar to below

*/5 * * * * >/dev/null 2>&1


while the simple solution works for its intended purpose, and can execute using designated user’s identify without requiring the user to login so can also be applied to special service account.

However it may not be recommended as it requires clear text passphrase to be put in a file for cron job fetching.

Leave a comment

all possible ways you can talk to Q/KDB+


Q/KDB is really popular and powerful and playing on REPL console is fun – yet there are times you have to talk to a running Q/KDB using other approaches that makes best sense to you.

In this post I will try to list different ways you can do that and how they differ from each other.

Overall there are 4 ways of communication that I am aware of, not all of them may be called “protocol” per se, but I am using the term in a less strict context:

  • IPC protocol
  • QCon protocol
  • HTTP protocol
  • WebSocket

Test Q Server

to Compare the different ways of communication I have started a Q server on a remote host and create 2 simple functions with one returning a table and another returning an integer:

KDB+ 3.6 2019.09.19 Copyright (C) 1993-2019 Kx Systems
l64/ 12(16)core 31807MB codywu codycent EXPIRE 2021.01.10 KOD #12345678

q)returnTable:{([] x:til 3;y:`a`b`c)}
q)\p 8080
q)returnTable 3
x y

0 a
1 b
2 c
q)f . (3;6)

IPC Protocol

IPC protocol is probably the most common ones we use.

when you are doing “hopen” in your client q process to open a connect to remote q server, IPC protocol is at play.

IPC protocol does some special encoding which can be easily reproduced with q’s “-8!” instruction.

Details can be found at:

Example of using hopen from a client q process, including intentionally causing errors.

q)h:hopen `:codycent:8080:codywu:password
q)h (`returnTable;3)
x y
0 a
1 b
2 c
q)h (`returnTab;3)
[0] h (`returnTab;3)
q)h (`f;3;6)
q)h "f[9][6]"

if we look at the pcap on we can see the bytes showing exactly what we expect IPC would encode:


as you can see the “returnTable 3” is encoded similar to “-8!” encoding


The only difference is the second byte which indicates request and response, “-8!” simply encodes it as 0x00 yet in the sync request it will be set to 0x1.

similarly there are other language bindings built on IPC protocol, most notably we have C, Node JS and Python.

C binding is described here: which uses a header file provided by KDB for serialization and deserialization.

I will skip C as there are good examples on that “Interface” page, instead I will briefly cover NodeJS and Python.

for NodeJS, there is a “node-q” package available at:

the code leverage on “c.js” provided by KDB from:

sample code for accessing Q server using node-q is like below:

let nodeq = require("node-q");

nodeq.connect({host: "codycent", port: 8080}, (err, con) => {
if (err) throw err;
// interact with con like demonstrated below

con.k("returnTable 3", (err, res) => {
if (err) throw err;
console.log("result:", res); // 6

and by running it we see result similar to below:

$node test.js
result: [ { x: 0, y: 'a' }, { x: 1, y: 'b' }, { x: 2, y: 'c' } ]

for Python, we have a less frequently used project providing

one can connect to remote Q server by simply:

python 2.7.16 (default, Nov 9 2019, 05:55:08)
[GCC 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.32.4) (-macos10.15-objc-s on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.insert(0, "/Users/codywu/qpy")
>>> import kdb
>>> conn=kdb.q('codycent', 8080, 'codywu')
>>> conn.k('f[3][6]')
>>> r = conn.k('returnTable 3')
>>> r
>>> dir(r)
['__doc__', '__eq__', '__getitem__', '__init__', '__iter__', '__len__', '__module__', '__str__', 'index', 'length', 'next', 'x', 'y']
>>> r.x
['x', 'y']
>>> r.y
[[0, 1, 2], ['a', 'b', 'c']]

more resource can be found on:

QCon Protocol

qcon is another interesting program that allows you to connect to remote Q server and execute anything as if you are directly executing on remote Q process.

so you no longer have to worry about using handle explicitly to send your request – qcon emulate it for you under the cover and echo back the response when received from q server.

Example of using qcon:

$qcon codycent:8080:codywu:password
codycent:8080>returnTable 3
x y
0 a
1 b
2 c

if we pcap the traffic it shows something different from IPC, for request the request string is sent as plain string together with username:password:


for response it also comes back as plain string without any encoding:


another observation is qcon seems to connection/disconnect for every request hence we see many SYN/FIN in the pcap.

there does not seem to be any documentation online on qcon protocol so there is only some guess based on pcap.

HTTP Protocol

Q comes naturally with HTTP protocol support and one simply connect to the same port using any decent browser.

Q will respond following standard HTTP response.

If we simply direct browser at host:port we get a list of available variables.

I will first create 2 global variables on the q server:

q)v: til 10

below is what one can see when connecting:

one can also use “?” to run query:



pretty neat!

if we examine the pcap it shows all data communicated as plain string using standard http GET request/response:


some good reference can be found:


websocket is another leap over HTTP protocol as it provides more flexibility especially on certain use cases where server needs to constantly “push” data to the client.

This helps the clients as clients no longer have to keep polling the server for data and with one connection client and server can properly sequence the request/response to workaround the case where multiple HTTP connection might be running in parallel as typically in the case of AJAX.

WebSocket is layered on TCP, though it starts with HTTP request then ugprade to WS.

It comes with its own framing hence it is message based instead of byte stream based.

It also provides its own masking and compression but it is not for security purposes, rather to mitigate some attacks on existing socket cache servers.

the protocol is detailed in:

the diagram that makes sense to us is the below framing:


note however, websocket is just a carrier protocol in that it doesn’t specify what its payload should look like, this opens door for certain flexibility as we will see that we can  pass any format of data we like: be it IPC encoded data, JSON encoded data, compressed data, as long as both ends understand how to interpret the data.

There are some nice and detailed document talking about using websocket so I will not repeat them here:

Node JS has a handy package “ws” that allows us to emulate browser from command line:

Q relies on handle to support websocket connections and by default it is not defined so we actually need to define it explicitly to make it work.

here we simply make it return a string representation of the data without any special encoding.

q){neg[.z.w] .Q.s @[value;x;{`$"'",x}]}

below is a simple program to connect to our q server:

const WebSocket = require('ws');

const ws = new WebSocket('ws://codycent:8080');

let first = false;

ws.on('open', () => {

ws.on('message', (data) => {
console.log("data=\n" + data);

if(first == false)
first = true;

when we run it, it outputs something like below:

$node wstest.js
x y
0 a
1 b
2 c


if we look at the pcap we can see the initial HTTP then upgrade:


it is masked but really no secrete, wireshark readily “unmasked” it for us:

the response is also encoded in plain string as we can see:


the referenced document provides further refining by replacing plain payload with IPC encoded data or with JSON encoded data so client can do more sophisticated processing instead of dealing with string only.


we have reviewed 4 different ways one can communicate with a remote Q server.

IPC is a very widely used protocol especially for communication between Q processes.

for most other interfacing, the trend is to use websocket so Q can readily serve as backend for GUI apps with web based GUI now the main stream deployment.


1 Comment

why touching a file does not cause Bazel to rebuild my project?


Recently I have been playing with Bazel as it seems more and more projects are moved to Bazel.

While playing with the examples/cpp/helloworld that comes with the source code, I noticed my usual trick of “touching a file” does not cause it to recompile.

Suspecting this might be due to a bug rather than a feature, I dug into the source code and it turns out Bazel actually has a sophisticated way to reduce unnecessary compiling to the minimum, leveraging on things called “ActionCache” and file digest such as SHA256.

I will try to reveal some details in the rest of the post.

First to show my observations of touching and build examples/cpp:hello-world

$bazel build examples/cpp:hello-world
INFO: Analyzed target //examples/cpp:hello-world (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //examples/cpp:hello-world up-to-date:
INFO: Elapsed time: 0.061s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action

$touch examples/cpp/ 

$bazel build examples/cpp:hello-world 
INFO: Analyzed target //examples/cpp:hello-world (0 packages loaded, 0 targets configured). 
INFO: Found 1 target... 
Target //examples/cpp:hello-world up-to-date: 
INFO: Elapsed time: 0.050s, Critical Path: 0.00s 
INFO: 0 processes. 
INFO: Build completed successfully, 1 total action

I can go even further by opening the file with vi, force saving it without changing anything and see how Bazel behaves.
The reason I use vi is vi will usually do some trick like “creating a temp file then rename it”, hence it goes more aggressive than mere touch, and this aggressiveness can be seen in “stat” command where you will see inode is changed.

if you however write a small program to simply open a file, change a char in place, and save it, it will not actually change inode, let me show it with the example.

$stat examples/cpp/
  File: ‘examples/cpp/’
  Size: 747             Blocks: 8          IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 50383878    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  codywu)   Gid: ( 1000/  codywu)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2020-01-05 14:35:52.179841956 -0500
Modify: 2020-01-05 14:35:52.179841956 -0500
Change: 2020-01-05 14:35:52.181841946 -0500
 Birth: -

$touch examples/cpp/

$stat examples/cpp/
  File: ‘examples/cpp/’
  Size: 747             Blocks: 8          IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 50383878    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  codywu)   Gid: ( 1000/  codywu)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2020-01-05 14:38:07.496045397 -0500
Modify: 2020-01-05 14:38:07.496045397 -0500
Change: 2020-01-05 14:38:07.496045397 -0500
 Birth: -

$vim examples/cpp/

$stat examples/cpp/
  File: ‘examples/cpp/’
  Size: 747             Blocks: 8          IO Block: 4096   regular file
Device: fd03h/64771d    Inode: 13229799    Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  codywu)   Gid: ( 1000/  codywu)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2020-01-05 14:38:14.179003529 -0500
Modify: 2020-01-05 14:38:14.179003529 -0500
Change: 2020-01-05 14:38:14.180003523 -0500
 Birth: -

and not surprisingly that does not help either. (note I did not change any content)

$bazel build examples/cpp:hello-world
INFO: Analyzed target //examples/cpp:hello-world (1 packages loaded, 5 targets configured).
INFO: Found 1 target...
Target //examples/cpp:hello-world up-to-date:
INFO: Elapsed time: 0.069s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action

Now let’s change the file for real and I am going to change the “world” to “world!” and this time bazel builds it for real.

Hello world

$vim examples/cpp/

$bazel build examples/cpp:hello-world
INFO: Analyzed target //examples/cpp:hello-world (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //examples/cpp:hello-world up-to-date:
INFO: Elapsed time: 0.255s, Critical Path: 0.21s
INFO: 2 processes: 2 processwrapper-sandbox.
INFO: Build completed successfully, 3 total actions

Hello world!

note in the above the output it no longer “1 total actions” but “3 total actions” and 2 other actions are “compiling” and “linking” the updated

If you run bazel with “bazel build -s examples/cpp:hello-world” you will actually see “gcc” showing up on the command line.

So what happened here? Bazel seems to be pretty smart in this case to figure out when “gcc” is actually needed.

Let’s go in and dive into the details.

First Level Dirtiness Check

In the first round, Bazel will walk through the dependent node map trying to find all the dependencies based on the target specified on the command line, which in this case, is the target //examples/cpp:hello-world.

After all the dependencies are identified, Bazel will run Dirtiness Check and find all the Dirty nodes.

After it finds the dirty nodes it will then evaluate all those dirty nodes and come up with the actions that needs to be performed.

Below is the relevant code that deals with changed files

 531   private void handleChangedFiles(
 532       Collection diffPackageRootsUnderWhichToCheck,
 533       Diff diff,
 534       boolean managedDirectoriesChanged) {
 535     int numWithoutNewValues = diff.changedKeysWithoutNewValues().size();
 536     Iterable keysToBeChangedLaterInThisBuild = diff.changedKeysWithoutNewValues();
 537     Map changedKeysWithNewValues = diff.changedKeysWithNewValues();
 539     // If managed directories settings changed, do not inject any new values, just invalidate
 540     // keys of the changed values. {@link #invalidateCachedWorkspacePathsStates()}
 541     if (managedDirectoriesChanged) {
 542       numWithoutNewValues += changedKeysWithNewValues.size();
 543       keysToBeChangedLaterInThisBuild =
 544           Iterables.concat(keysToBeChangedLaterInThisBuild, changedKeysWithNewValues.keySet());
 545       changedKeysWithNewValues = ImmutableMap.of();
 546     }
 548     logDiffInfo(
 549         diffPackageRootsUnderWhichToCheck,
 550         keysToBeChangedLaterInThisBuild,
 551         numWithoutNewValues,
 552         changedKeysWithNewValues);
 554     recordingDiffer.invalidate(keysToBeChangedLaterInThisBuild);
 555     recordingDiffer.inject(changedKeysWithNewValues);
 556     modifiedFiles += getNumberOfModifiedFiles(keysToBeChangedLaterInThisBuild);
 557     modifiedFiles += getNumberOfModifiedFiles(changedKeysWithNewValues.keySet());
 558     incrementalBuildMonitor.accrue(keysToBeChangedLaterInThisBuild);
 559     incrementalBuildMonitor.accrue(changedKeysWithNewValues.keySet());
 560   }

the call stack of handleChangedFiles looks like below

changedKeysWithNewValues={FILE_STATE:[/home/codywu/bazel]/[examples/cpp/]=RegularFileStateValue{digest=null, size=749, contentsProxy=ctime of 1578250746334 and nodeId of 170525141}}$run$2(
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.base/java.util.concurrent.ThreadPoolExecutor$ Source)
java.base/ Source)

The diffs are calculated using function getDirtyValues, which applies dirtiness checker to those files:

481   private BatchDirtyResult getDirtyValues(ValueFetcher fetcher,
482       Iterable keys, final SkyValueDirtinessChecker checker,
483       final boolean checkMissingValues) throws InterruptedException {
484     ExecutorService executor =
485         Executors.newFixedThreadPool(
487             new ThreadFactoryBuilder().setNameFormat("FileSystem Value Invalidator %d").build());
489     final BatchDirtyResult batchResult = new BatchDirtyResult();
490     ThrowableRecordingRunnableWrapper wrapper =
491         new ThrowableRecordingRunnableWrapper("FilesystemValueChecker#getDirtyValues");
492     final AtomicInteger numKeysScanned = new AtomicInteger(0);
493     final AtomicInteger numKeysChecked = new AtomicInteger(0);
494     ElapsedTimeReceiver elapsedTimeReceiver =
495         elapsedTimeNanos -> {
496           if (elapsedTimeNanos > 0) {
498                 String.format(
499                     "Spent %d ms checking %d filesystem nodes (%d scanned)",
500                     TimeUnit.MILLISECONDS.convert(elapsedTimeNanos, TimeUnit.NANOSECONDS),
501                     numKeysChecked.get(),
502                     numKeysScanned.get()));
503           }
504         };
505     try (AutoProfiler prof = AutoProfiler.create(elapsedTimeReceiver)) {
506       for (final SkyKey key : keys) {
507         numKeysScanned.incrementAndGet();
508         if (!checker.applies(key)) {
509           continue;
510         }
511         Preconditions.checkState(
512             key.functionName().getHermeticity() == FunctionHermeticity.NONHERMETIC,
513             "Only non-hermetic keys can be dirty roots: %s",
514             key);
515         executor.execute(
516             wrapper.wrap(
517                 () -> {
518                   SkyValue value;
519                   try {
520                     value = fetcher.get(key);
521                   } catch (InterruptedException e) {
522                     // Exit fast. Interrupt is handled below on the main thread.
523                     return;
524                   }
525                   if (!checkMissingValues && value == null) {
526                     return;
527                   }
529                   numKeysChecked.incrementAndGet();
530                   DirtyResult result = checker.check(key, value, tsgm);
531                   if (result.isDirty()) {
532                     batchResult.add(key, value, result.getNewValue());
533                   }
534                 }));
535       }

the most critical part of code above is line #530 where check tries to check dependent nodes and add to the changed files if the result is “dirty”.

The dirtiness checker is a UnionDirtinessChecker of ExternalDirtinessChecker and MissingDiffDirtinessChecker

 508       diff =
 509           fsvc.getDirtyKeys(
 510               memoizingEvaluator.getValues(),
 511               new UnionDirtinessChecker(
 512                   Iterables.concat(
 513                       customDirtinessCheckers,
 514                       ImmutableList.of(
 515                           new ExternalDirtinessChecker(tmpExternalFilesHelper, fileTypesToCheck),
 516                           new MissingDiffDirtinessChecker(diffPackageRootsUnderWhichToCheck)))));
 517     }

In this case, the ultimate check is done by MissingDiffDirtinessChecker, which uses FileDirtinessChecker::createNewValues() to create new value for comparing with cached value saved in Bazel’s cache.

 44     @Override
 45     @Nullable
 46     public SkyValue createNewValue(SkyKey key, @Nullable TimestampGranularityMonitor tsgm) {
 47       RootedPath rootedPath = (RootedPath) key.argument();
 48       try {
 49         return FileStateValue.create(rootedPath, tsgm);
 50       } catch (IOException e) {
 51         // TODO(bazel-team): An IOException indicates a failure to get a file digest or a symlink
 52         // target, not a missing file. Such a failure really shouldn't happen, so failing early
 53         // may be better here.
 54         return null;
 55       }
 56     }
 57   }

the returned SkyValue is a RegularFileStateValue created from below function:

 76   public static FileStateValue create(
 77       RootedPath rootedPath,
 78       FilesystemCalls syscallCache,
 79       @Nullable TimestampGranularityMonitor tsgm)
 80       throws InconsistentFilesystemException, IOException {
 81     Path path = rootedPath.asPath();
 82     Dirent.Type type = syscallCache.getType(path, Symlinks.NOFOLLOW);
 83     if (type == null) {
 85     }
 86     switch (type) {
 87       case DIRECTORY:
 88         return DIRECTORY_FILE_STATE_NODE;
 89       case SYMLINK:
 90         return new SymlinkFileStateValue(path.readSymbolicLinkUnchecked());
 91       case FILE:
 92       case UNKNOWN:
 93         {
 94           FileStatus stat = syscallCache.statIfFound(path, Symlinks.NOFOLLOW);
 95           Preconditions.checkNotNull(
 96               stat, "File %s found in directory, but stat failed", rootedPath);
 97           return createWithStatNoFollow(rootedPath, FileStatusWithDigestAdapter.adapt(stat), tsgm);
 98         }
 99       default:
100         throw new IllegalStateException(type.toString());
101     }
102   }

The RegularFileStateValue has below defined to compare old and new values:

281     @Override
282     public boolean equals(Object obj) {
283       if (obj == this) {
284         return true;
285       }
286       if (!(obj instanceof RegularFileStateValue)) {
287         return false;
288       }
289       RegularFileStateValue other = (RegularFileStateValue) obj;
290       return size == other.size
291           && Arrays.equals(digest, other.digest)
292           && Objects.equals(contentsProxy, other.contentsProxy);
293     }

Digest value in this case is null and FileContentsProxy only compares change time (ctime) and Inode Id:

 61   @Override
 62   public boolean equals(Object other) {
 63     if (other == this) {
 64       return true;
 65     }
 67     if (!(other instanceof FileContentsProxy)) {
 68       return false;
 69     }
 71     FileContentsProxy that = (FileContentsProxy) other;
 72     return ctime == that.ctime && nodeId == that.nodeId;
 73   }

so we end up comparing 3 things for first level dirtiness:

  • File Size
  • Change Time
  • Inode Id

“touching” will change “Change Time” so touching indeed makes the file dirty, which will cause “” to be added to the Action for evaluating.

Second Level Digest Check

After first stage identified “dirty” nodes we actually get into the second level check where the concept of “ActionCache” as well as “File Digest” is used to prevent unnecessary compiling & linking.

the entry point of the second level is the calling to funciton mustExecute() which checks to see if the action is indeed needed.

It starts by first fetching the cache entry for the action and then calls into mustExecute().

280     ActionCache.Entry entry = getCacheEntry(action);
281     if (mustExecute(
282         action,
283         entry,
284         handler,
285         metadataHandler,
286         actionInputs,
287         clientEnv,
288         remoteDefaultPlatformProperties)) {
289       if (entry != null) {
290         removeCacheEntry(action);
291       }
292       return new Token(getKeyString(action));
293     }

mustExecute() does a few checks and most important to us is validateArtifacts on line #326 and comparison of ActionKey on line #330.

301   protected boolean mustExecute(
302       Action action,
303       @Nullable ActionCache.Entry entry,
304       EventHandler handler,
305       MetadataHandler metadataHandler,
306       Iterable actionInputs,
307       Map clientEnv,
308       Map remoteDefaultPlatformProperties) {
309     // Unconditional execution can be applied only for actions that are allowed to be executed.
310     if (unconditionalExecution(action)) {
311       Preconditions.checkState(action.isVolatile());
312       reportUnconditionalExecution(handler, action);
313       actionCache.accountMiss(MissReason.UNCONDITIONAL_EXECUTION);
314       return true;
315     }
316     if (entry == null) {
317       reportNewAction(handler, action);
318       actionCache.accountMiss(MissReason.NOT_CACHED);
319       return true;
320     }
322     if (entry.isCorrupted()) {
323       reportCorruptedCacheEntry(handler, action);
324       actionCache.accountMiss(MissReason.CORRUPTED_CACHE_ENTRY);
325       return true;
326     } else if (validateArtifacts(entry, action, actionInputs, metadataHandler, true)) {
327       reportChanged(handler, action);
328       actionCache.accountMiss(MissReason.DIFFERENT_FILES);
329       return true;
330     } else if (!entry.getActionKey().equals(action.getKey(actionKeyContext))) {
331       reportCommand(handler, action);
332       actionCache.accountMiss(MissReason.DIFFERENT_ACTION_KEY);
333       return true;
334     }
335     Map usedEnvironment =
336         computeUsedEnv(action, clientEnv, remoteDefaultPlatformProperties);
337     if (!Arrays.equals(entry.getUsedClientEnvDigest(), DigestUtils.fromEnv(usedEnvironment))) {
338       reportClientEnv(handler, action, usedEnvironment);
339       actionCache.accountMiss(MissReason.DIFFERENT_ENVIRONMENT);
340       return true;
341     }
343     entry.getFileDigest();
344     actionCache.accountHit();
345     return false;
346   }

First let’s talk about ActionKey comparison on line #330 as it is actually not our focus here.

ActionKey is a digest created from a certain Action’s inputs/outputs and various other arguments.

The goal is to check to see if certain action is forced to update due to command line option or build files changes etc.

In our example, we only modified the file contents hence it actually does not apply to us anyway.

The more interesting one is the validateArtifacts which actually goes into getting the SHA256 digest of the file:

135   /**
136    * Validate metadata state for action input or output artifacts.
137    *
138    * @param entry cached action information.
139    * @param action action to be validated.
140    * @param actionInputs the inputs of the action. Normally just the result of action.getInputs(),
141    *     but if this action doesn't yet know its inputs, we check the inputs from the cache.
142    * @param metadataHandler provider of metadata for the artifacts this action interacts with.
143    * @param checkOutput true to validate output artifacts, Otherwise, just validate inputs.
144    * @return true if at least one artifact has changed, false - otherwise.
145    */
146   private static boolean validateArtifacts(
147       ActionCache.Entry entry,
148       Action action,
149       Iterable actionInputs,
150       MetadataHandler metadataHandler,
151       boolean checkOutput) {
152     Map mdMap = new HashMap();
153     if (checkOutput) {
154       for (Artifact artifact : action.getOutputs()) {
155         mdMap.put(artifact.getExecPathString(), getMetadataMaybe(metadataHandler, artifact));
156       }
157     }
158     for (Artifact artifact : actionInputs) {
159       mdMap.put(artifact.getExecPathString(), getMetadataMaybe(metadataHandler, artifact));
160     }
161     return !Arrays.equals(DigestUtils.fromMetadata(mdMap), entry.getFileDigest());
162   }

as we can see, we add both inputs and outputs files to the mdMap and then generate digests from the map, and use this final digest to compare with the cached digest.

DigestUtils does simple XOR of everything’s digest (and in this case SHA256 Digest) in the map:

278   /**
279    * @param mdMap A collection of (execPath, FileArtifactValue) pairs. Values may be null.
280    * @return an order-independent digest from the given "set" of (path, metadata) pairs.
281    */
282   public static byte[] fromMetadata(Map mdMap) {
283     byte[] result = new byte[1]; // reserve the empty string
284     // Profiling showed that MessageDigest engine instantiation was a hotspot, so create one
285     // instance for this computation to amortize its cost.
286     Fingerprint fp = new Fingerprint();
287     for (Map.Entry entry : mdMap.entrySet()) {
288       result = xor(result, getDigest(fp, entry.getKey(), entry.getValue()));
289     }
290     return result;
291   }

The file digest is calculated when evaluation the target File with stack trace like below:

getDigest java.base/java.lang.Thread.getStackTrace(Unknown Source)
getDigest java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
getDigest java.base/java.util.concurrent.ThreadPoolExecutor$ Source)
getDigest java.base/ Source)

and the digest function is done with below getDigest():

310   /**
311    * Returns the digest of the file denoted by the path, following symbolic links.
312    *
313    * 

Subclasses may (and do) optimize this computation for a particular digest functions. 314 * 315 * @return a new byte array containing the file's digest 316 * @throws IOException if the digest could not be computed for any reason 317 */ 318 protected byte[] getDigest(final Path path) throws IOException { 319 return new ByteSource() { 320 @Override 321 public InputStream openStream() throws IOException { 322 return getInputStream(path); 323 } 324 }.hash(digestFunction.getHashFunction()).asBytes(); 325 }

and currently the two hash functions are registered as digest:

 70   public static final DigestHashFunction SHA1 = register(Hashing.sha1(), "SHA-1", "SHA1");
 71   public static final DigestHashFunction SHA256 = register(Hashing.sha256(), "SHA-256", "SHA256");

this is why Bazel is able to actually detect if the file content is changed or not and why “touching” cannot trigger a rebuild.

Leave a comment

ssh connect proxy command and timeout, syscall restart

The ssh command we use can have a configured option to use proxy command to do the real connection,as described in this post:ssh through proxy.

One of the proxy command is connect which is now moved to bitbucket:BitBucket Connect.

The place where I work has deployed ssh command in the same way and one of the observation is when we connect to a non-existent IP address the connection hangs endlessly.

We have already explicitly specified timeout parameter in the proxy command like below:

connect -w 5 22

yet it always stucks without time out. why?

Interested I took a look at the source of the connect.c and tried to understand how timeout is implemented.

We use a pretty old version of connect.c which is even before it was moved to bitbucket.

And the logic of timeout is actually simpler than I expected.

The timeout is scheduled by setting alarm signal using API alarm(unsigned seconds) and the intention is to let signal interrupts the long waiting connect() sys call.

However, it turns out that we have received signal as expected but the connect syscall returns SA_RESTARTSYS and is automatically restarted:

connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr(&quot;;)}, 16) = ? ERESTARTSYS (To be restarted)
--- SIGALRM (Alarm clock) @ 0 (0) ---
rt_sigreturn(0x1) = 42
connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr(&quot;;)}, 16 &lt;unfinished ...&gt;

As it turns out, the idea of using signal to interrupt long running syscall works but it has a missing chain – to prevent the syscall from automatically restarted.

In Linux, we an use sigaction to specifies the actions taken when a signal is received.
One of the params to sigaction is sa_flags where we can specify the SA_RESTART option which specifies the behavior:

If a signal is caught during the system calls listed below, the call may be forced to terminate with the error EINTR, the call may return with a data transfer shorter than requested, or the call may be restarted. Restart of pending calls is requested by setting the SA_RESTART bit in sa_flags. The affected system calls include open(2), read(2), write(2), sendto(2), recvfrom(2), sendmsg(2) and recvmsg(2) on a communications channel or a slow device (such as a terminal, but not a regular file) and during a wait(2) or ioctl(2). However, calls that have already committed are not restarted, but instead return a partial success (for example, a short read count).

By default, when we call alarm to schedule the signal we specify the SA_RESTART by default so even if our ALARM signal successfully interrupts connect syscall it failed to interrupt it completely.

One useful API in this case is siginterrupt() API which changes signal behavior by way of sigaction.

The API manual reads like:

The siginterrupt() function is used to change the system call restart behavior when a system call is interrupted by the specified signal. If the flag is false (0), then system calls will be restarted if they are interrupted by the specified signal and no data has been transferred yet. System call restart has been the default behavior since 4.2BSD, and is the default behaviour for signal(3) on FreeBSD.

If the flag is true (1), then restarting of system calls is disabled. If a system call is interrupted by the specified signal and no data has been transferred, the system call will return -1 with the global variable errno set to EINTR. Interrupted system calls that have started transferring data will return the amount of data actually transferred. System call interrupt is the signal behavior found on 4.1BSD and AT&T System V UNIX systems.

So we can use this to change the ALARM signal behavior and that immediately solves the problem.

However, when looking at the bitBucket code solution obviously it follows a different way by allowing the signal handler to exit the whole program directly:

/* timeout signal hander */
    signal( SIGALRM, SIG_IGN );
    alarm( 0 );
    error( &quot;timed out\n&quot; );
Leave a comment

PyQ working with 32bit free kdb on CentOS 64bit

I have been playing recently with PyQ which is an interesting effort trying to get best of both worlds of Q and Python.

There is also a good presentation for view at Exploring KDB+ Data in IPython Notebooks.

Also has a special page in its Contrib section introducing the module Contrib/PyQ

I like the idea of PyQ which is ingenious in terms of its approach.

Basically it tries to execute Q binary and from therein launches the python main function Py_Main and passing in any needed arguments.

Py_Main is loaded from Python’s own libPython shared library.

By running it this way it allows Python to have direct access to any Q symbol references without any more effort (since Q is running so no need to load anything in for using Q functions such as r0).

There is only one tiny annoyance: nowadays most of the linux distributions are 64bit yet only provides free 32bit kdb for downloading.

So basically unless you intend to purchase a copy of 64bit license you are facing the task of getting 32bit free kdb working on 64bit linux system.

The limitation comes largely from the design, even if 32bit app can run happily without any effort on 64bit system.

Because of the way Q loads libPython dynamically via dlopen under the cover, and Q itself is 32bit process, it cannot load any shared library of different bits.

To put it simply, if Q binary itself is ELF32 it can only load ELF32 shared libraries.
And it would not work if you have only 64bit Python installed, which is nearly the default nowadays especially with many automatic package installation and management tools.

That’s exactly what I ran into and default python setup install simply gets me nowhere.

Fascinated and I decided to give it a bit of effort so I have created my own github fork of the original distribution.

This is the original github distro from the author:Original PyQ GitHub

And this is my modified version providing a CentOS based make file:64bit CentOS customized PyQ for 32bit free kdb

To use it, first you need to have a working Kdb installation.
For 32bit free KDB you can download and follow the instruction here:Free 32KDB downloading

After installation and proper environment variable setting you should be able to start Q by simply issue command q and see the banner like below:

KDB+ 3.3 2015.11.03 Copyright (C) 1993-2015 Kx Systems
l32/ 3()core 7823MB codywu centos NONEXPIRE

Welcome to kdb+ 32bit edition
For support please see
Tutorials can be found at
To exit, type \\
To remove this startup msg, edit q.q

Next we need to download a copy of Python 2.7 and compile it as 32bit.

Here is the stackoverflow post talking about how to do it:Compile 32bit Python on CentOS

For me, I am happy to compile it without installing to any external directory so my way of doing it is slightly different.

First I download the zip file installation to a directory, in this case it is /var/tmp/Python32.
Zip Source file can be downloaded by wget

Python-2.7.7  Python-2.7.7.tgz

Then I simply run configure and build so it will just create shared library and binary within the same directory:

$CFLAGS=-m32 LDFLAGS=-m32 ./configure --enable-shared
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for --enable-universalsdk... no
checking for --with-universal-archs... 32-bit
config.status: creating Makefile.pre
config.status: creating Modules/Setup.config
config.status: creating Misc/python.pc
config.status: creating Modules/ld_so_aix
config.status: creating pyconfig.h
creating Modules/Setup
creating Modules/Setup.local
creating Makefile

gcc -pthread -c -fno-strict-aliasing -m32 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes  -I. -IInclude -I./Include  -fPIC -DPy_BUILD_CORE -o Modules/python.o ./Modules/python.c
copying and adjusting /var/tmp/Python32/Python-2.7.7/Tools/scripts/pydoc -> build/scripts-2.7
copying and adjusting /var/tmp/Python32/Python-2.7.7/Tools/scripts/idle -> build/scripts-2.7
copying and adjusting /var/tmp/Python32/Python-2.7.7/Tools/scripts/2to3 -> build/scripts-2.7
copying and adjusting /var/tmp/Python32/Python-2.7.7/Lib/ -> build/scripts-2.7
changing mode of build/scripts-2.7/pydoc from 664 to 775
changing mode of build/scripts-2.7/idle from 664 to 775
changing mode of build/scripts-2.7/2to3 from 664 to 775
changing mode of build/scripts-2.7/ from 664 to 775
/usr/bin/install -c -m 644 ./Tools/gdb/

After compiling is successful your directory should look like below:

build  Lib                  Makefile         Parser          python         Tools
config.guess   Demo          libpython2.7.a       Makefile.pre     PC              Python
config.log     Doc   PCbuild
config.status  Grammar  Misc             pybuilddir.txt  README
config.sub     Include       LICENSE              Modules          pyconfig.h      RISCOS
configure      install-sh    Mac                  Objects

We are good with Python 32bit part now we can download my modified github Pyq and simply run gmake all:

$git clone
Cloning into 'pyq'...
remote: Counting objects: 118, done.
remote: Total 118 (delta 0), reused 0 (delta 0), pack-reused 118
Receiving objects: 100% (118/118), 72.71 KiB | 0 bytes/s, done.
Resolving deltas: 100% (38/38), done.

Before running gmake all we just need to do one more step to modify makefile and set the correct Python32 src directory like this:

$head -5 Makefile
# for 64bit centos + 32bit free kdb only (tested only on centos 7)

# change below to where you put your downloaded and unzipped python2.7 dir

After that is confirmed we simply run gmake all from pyq directory:

$gmake all
mkdir -p install/pyq
gcc -o install/pyq/ -shared -fPIC -m32 -DKXVER=3 src/pyq/py.c
gcc -o install/pyq/ -shared -fPIC -m32 -DQVER=3 -DKXVER=3 -I/var/tmp/Python32/Python-2.7.7/Include -I/var/tmp/Python32/Python-2.7.7 src/pyq/_k.c
mkdir -p install/bin
gcc -o install/bin/pyq -DQBITS="\"l32\"" -DPYQ_SRC_DIR="\"/var/tmp/git/pyq\"" -DPYTHON32_SRC_DIR="\"/var/tmp/Python32/Python-2.7.7\"" -m32 src/pyq.c
cp src/pyq/*.py install/pyq/
cp src/pyq/*.q install/pyq/

Now the binary pyq is installed to install/bin/pyq and that’s all you need to use it.

Run that binary and you will get a python shell:

Welcome to kdb+ 32bit edition
For support please see
Tutorials can be found at
To exit, type \\
To remove this startup msg, edit q.q
Python 2.7.7 (default, Jan  4 2016, 23:24:02)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Now to begin using pyq, we just need to import pyq and use whatever is available is provided by pyq.

Let’s try some examples from Kx’s Contrib webpage:Kx Contrib/PyQ

Python 2.7.7 (default, Jan  4 2016, 23:24:02)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyq import *
>>> q
<pyq._Q object at 0xf2f6a44c>
>>> = q('([]date:();sym:();qty:())')
>>> q.insert('trade', (date(2006,10,6), 'IBM', 200))
>>> q.insert('trade', (date(2006,10,6), 'MSFT', 100))
date       sym  qty
2006.10.06 IBM  200
2006.10.06 MSFT 100

I did not try further to enable Ipython as that would need compile 32bit Ipython as well and import IPython module from the session.

But still it should not be that difficult now that you have some basic PyQ running.

Leave a comment

Why your python script should not modify LD_LIBRARY_PATH

It starts with some colleague trying to manipulate LD_LIBRARY_PATH in his python scripts before importing a module.

His shell starts up environment has a LD_LIBRARY_PATH setting which he would rather not change.

But before loading a specific module that is implemented in shared library, he would like the LD_LIBRARY_PATH be changed so it picks up its dependent libraries elsewhere.

Surely enough, that does not work, but why?

It has to do with how LD_LIBRARY_PATH is used in a typical process.

Here I am trying to dig a bit deeper and explain why it would not work.

Let’s first see where LD_LIBRARY_PATH is initially parsed.
It is first parsed by elf rtld component in glibc as we can see in the following code:

06 void
307 internal_function
308 _dl_non_dynamic_init (void)
309 {
310   _dl_main_map.l_origin = _dl_get_origin ();
311   _dl_main_map.l_phdr = GL(dl_phdr);
312   _dl_main_map.l_phnum = GL(dl_phnum);
315     HP_TIMING_NOW (_dl_cpuclock_offset);
317   _dl_verbose = *(getenv (&quot;LD_WARN&quot;) ?: &quot;&quot;) == '\0' ? 0 : 1;
319   /* Set up the data structures for the system-supplied DSO early,
320      so they can influence _dl_init_paths.  */
321   setup_vdso (NULL, NULL);
323   /* Initialize the data structures for the search paths for shared
324      objects.  */
325   _dl_init_paths (getenv (&quot;LD_LIBRARY_PATH&quot;));

what _dl_init_paths does is to fill in a static global variable env_path_list:

  97 /* This is the decomposed LD_LIBRARY_PATH search path.  */
  98 static struct r_search_path_struct env_path_list attribute_relro;
 658 void
 659 internal_function
 660 _dl_init_paths (const char *llp)
 661 {
 811       env_path_list.dirs = (struct r_search_path_elem **)
 812     malloc ((nllp + 1) * sizeof (struct r_search_path_elem *));
 813       if (env_path_list.dirs == NULL)
 814     {
 815       errstring = N_(&quot;cannot create cache for search path&quot;);
 816       goto signal_error;
 817     }
 819       (void) fillin_rpath (llp_tmp, env_path_list.dirs, &quot;:;&quot;,
 820                __libc_enable_secure, &quot;LD_LIBRARY_PATH&quot;,
 821                NULL, l);

After the variable env_path_list is filled in it is later used in _dl_map_object:

1929 struct link_map *
1930 internal_function
1931 _dl_map_object (struct link_map *loader, const char *name,
1932         int type, int trace_mode, int mode, Lmid_t nsid)
1933 {
2068       /* Try the LD_LIBRARY_PATH environment variable.  */
2069       if (fd == -1 &amp;&amp; env_path_list.dirs != (void *) -1)
2070     fd = open_path (name, namelen, mode, &amp;env_path_list,
2071             &amp;realname, &amp;fb,
2072             loader ?: GL(dl_ns)[LM_ID_BASE]._ns_loaded,
2073             LA_SER_LIBPATH, &amp;found_other_class);
2075       /* Look at the RUNPATH information for this binary.  */
2076       if (fd == -1 &amp;&amp; loader != NULL
2077       &amp;&amp; cache_rpath (loader, &amp;loader-&gt;l_runpath_dirs,
2078               DT_RUNPATH, &quot;RUNPATH&quot;))
2079     fd = open_path (name, namelen, mode,
2080             &amp;loader-&gt;l_runpath_dirs, &amp;realname, &amp;fb, loader,
2081             LA_SER_RUNPATH, &amp;found_other_class);

so _dl_map_object first tries to find the object by name from env_path_list then find in current binary’s RUNPATH.

The logic is normally what we expect to see when using LD_LIBRARY_PATH.

And _dl_map_object is actually used in dl_open which is the API exposed normally for dynamically loading shared library.

184 static void
185 dl_open_worker (void *a)
186 {
224   /* Load the named object.  */
225   struct link_map *new;
226   args-&gt;map = new = _dl_map_object (call_map, file, lt_loaded, 0,
227                     mode | __RTLD_CALLMAP, args-&gt;nsid);

We can try to write a simple program and see where _dl_init_paths is called:

$cat a.C
#include &lt;stdio.h&gt;

int main(void)
  return 0;

$g++ -o a a.C

We start the program with gdb and set to break on _dl_init_paths:

$gdb ./a
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
(gdb) b _dl_init_paths
Function &quot;_dl_init_paths&quot; not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (_dl_init_paths) pending.
(gdb) r
Starting program: ./a

Breakpoint 1, 0x00007ffff7de2d40 in _dl_init_paths () from /lib64/
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.1.x86_64
(gdb) bt
#0  0x00007ffff7de2d40 in _dl_init_paths () from /lib64/
#1  0x00007ffff7dde0dd in dl_main () from /lib64/
#2  0x00007ffff7df17d5 in _dl_sysdep_start () from /lib64/
#3  0x00007ffff7ddfcc1 in _dl_start () from /lib64/
#4  0x00007ffff7ddc438 in _start () from /lib64/
#5  0x0000000000000001 in ?? ()
#6  0x00007fffffffe550 in ?? ()
#7  0x0000000000000000 in ?? ()
gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7ddbae0  0x00007ffff7df627a  Yes (*)     /lib64/
(*): Shared library is missing debugging information.

As can be seen from stack trace it is called from _start entry point and at this time only one library is loaded: /lib64/

This is because normally the first step for starting an elf program is to load the interpreter which is responsible for loading the real program.

The interpreter is specified in interpreter section which we can show using readelf:

$readelf -a ./a|grep -i interp
  [ 1] .interp           PROGBITS         0000000000400238  00000238
  INTERP         0x0000000000000238 0x0000000000400238 0x0000000000400238
      [Requesting program interpreter: /lib64/]
   01     .interp

So the interpreter program runs its own stuff and read from the process environment for critical variables such as “LD_LIBRARY_PATH” and initialize its own static variables.

After that it goes on to execute at the entry point coded in ELF header which we can also display via readelf:

$readelf -a ./a|grep -i entry
  Entry point address:               0x400500

We can show that the function at that address is actually the target program’s _start which will later loads the main function we wrote:

(gdb) b _start
Breakpoint 2 at 0x400440
(gdb) cont

Breakpoint 2, 0x0000000000400440 in _start ()
(gdb) bt
#0  0x0000000000400440 in _start ()
(gdb) info shared
From                To                  Syms Read   Shared Object Library
0x00007ffff7ddbae0  0x00007ffff7df627a  Yes (*)     /lib64/
0x00007ffff7a393e0  0x00007ffff7b7cba0  Yes (*)     /lib64/
(*): Shared library is missing debugging information.

We can even manually set the environment variable at the start of gdb session and examine the internal data structure to show it is already filled in:

(gdb) set env LD_LIBRARY_PATH /var/tmp
(gdb) b _start
Breakpoint 1 at 0x400500
(gdb) r
Starting program: ./a

Breakpoint 1, 0x0000000000400500 in _start ()
(gdb) x/4xg &amp;env_path_list
0x7ffff7ffcdc0 &lt;env_path_list&gt;: 0x00007ffff7ff9640      0x0000000000000000
0x7ffff7ffcdd0 &lt;__stack_prot&gt;:  0x0000000001000000      0x0000000000000000
(gdb) x/4xg 0x00007ffff7ff9640
0x7ffff7ff9640: 0x00007ffff7ff9650      0x0000000000000000
0x7ffff7ff9650: 0x00007ffff7ff9000      0x00007ffff7df770c
(gdb) x/4xg 0x00007ffff7ff9650
0x7ffff7ff9650: 0x00007ffff7ff9000      0x00007ffff7df770c
0x7ffff7ff9660: 0x0000000000000000      0x00007ffff7ff9688
(gdb) x/s 0x00007ffff7df770c
0x7ffff7df770c: &quot;LD_LIBRARY_PATH&quot;
(gdb) x/s 0x00007ffff7ff9688
0x7ffff7ff9688: &quot;/var/tmp/&quot;

The displayed values are for below variables what and where:

 160 struct r_search_path_elem
 161   {
 162     /* This link is only used in the `all_dirs' member of `r_search_path'.  */
 163     struct r_search_path_elem *next;
 165     /* Strings saying where the definition came from.  */
 166     const char *what;
 167     const char *where;
 169     /* Basename for this search path element.  The string must end with
 170        a slash character.  */
 171     const char *dirname;
 172     size_t dirnamelen;
 174     enum r_dir_status status[0];
 175   };

As you can see, the environment variable LD_LIBRARY_PATH is already consumed well before python binary even starts to parse the python script.

So any attempt from within the python scripts to change LD_LIBRARY_PATH is like a character in a novel trying to talk to the author to change the story line.

So not just for python scripts it would be the same for perl scripts and even for a C program that tries to manipulate LD_LIBRARY_PATH from within the code. is only called when restarting a new program and that gives us the natural solution – simply to restart the same program after fixing any environment variables.

Here is a sample program skeleton:


import os
import sys

if &quot;LD_LIBRARY_PATH&quot; in os.environ and os.environ[&quot;LD_LIBRARY_PATH&quot;] == &quot;/var/tmp/test&quot;:
  print &quot;bad one&quot;
  del os.environ[&quot;LD_LIBRARY_PATH&quot;]
  os.execve(__file__, sys.argv, os.environ)
  print &quot;good one&quot;

When we run it in bad environment it will restart itself, otherwise it will proceed normally:


bad one
good one


good one

Also we have checked how dlopen is implemented based on using LD_LIBRARY_PATH and that function is exactly what python’s ctypes.CDLL and imp.load_dynamic uses for loading shared library.

So we can imagine if glibc exposes API to manipulate env_path_list and python provides interfaces to use that we can indeed allow ourselves to load shared library by dynamically updating library loading path.

Leave a comment

I raised my stack limit and it began to crash?

It all starts with a mysterious crash that my colleague noticed when he runs the same server with the same setup on two different boxes (but with identical hardware) and one of them crashes every time while the other works out just fine.

After some trial and error the only difference found is the default resource limit for stack size.

The crashing one has maximum stack size set to 100Mb while the working one has maximum stack size set to 10Mb.

That did not really answer the ultimate question – why would my server crash when I have increased its maximum stack size limit with the good intention to give it more space to run?

It was counter-intuitive, caught everyone by surprise and the only error information we get is a mysterious uncaught exception leading to SIGABT:

boost_1_55_0::thread_resource_error: Resource temporarily unavailable thread xxxxx exiting

Looking at the boost code we can see it is a failure when creating the thread and on linux platform start_thread_noexcept ultimately goes to calling pthread_create():

73     private:
174         bool start_thread_noexcept();
175         bool start_thread_noexcept(const attributes& attr);
176         void start_thread()
177         {
178           if (!start_thread_noexcept())
179           {
180             boost::throw_exception(thread_resource_error());
181           }
182         }
246     bool thread::start_thread_noexcept()
247     {
248         thread_info->self=thread_info;
249         int const res = pthread_create(&thread_info->thread_handle, 0, &thread_proxy, thread_info.get());
250         if (res != 0)
251         {
252             thread_info->self.reset();
253             return false;
254         }
255         return true;
256     }

So something failed when calling into pthread_create() and we know pthread_create() uses syscall clone to do part of the work so we tried to strace the server start up to see if any syscall around clone failed:

$strace -f ./server
[ Process PID=5928 runs in 32bit mode]
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x55575000
Process 5953 resume (parent 5928 ready)
[pid 5928] mmap2(NULL, 10481696, PROD_READ|PORT_WRITE|PORT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)
[pid 5928] -- SIGABRT (Aborted) @ 0(0) ---
[pid 5953] +++ killed by SIGABRT (core dumped) +++

Indeed we see syscall mmap2 failed due to ENOMEM.

So here the process is trying to map 104861696 bytes for stack uses and that number happens to be a little bit over 100Mb.

So why does it try to mmap such a big stack? And that number seems to be what we specified for our stack limit?

$ulimit -s

Starting the server with gdb and we set the breakpoint on mmap() function trying to capture the stack of mmap syscall and that clear shows it is from within pthread_create:

#0 0x585ae200 in mmap() from /lib/
#1 0x584ba3dd in pthread_create@GLIBC_2.1() from /lib/
#2 0x57e5433f in boost_1_55_0::thread::start_thread_noexcept(tis=this@entry=0xffffb678) at libs/thread/src/pthread/thread.cpp:246
#3 0x0825f3e1 in start_thread(this=0xffffb678) at boost/thread/detail/thread.hpp:180

Why would pthread_create want to create such a large stack?

We have to dig into pthread code and it turns out one step in pthread_create is to allocate stack space and unless you pass in thread init attribute explicitly specifying the stack size, alloc_stack will try to allocate by default attribute:

89 int
490 __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
491               void *(*start_routine) (void *), void *arg)
492 {
495   const struct pthread_attr *iattr = (struct pthread_attr *) attr;
496   struct pthread_attr default_attr;
497   bool free_cpuset = false;
498   if (iattr == NULL)
499     {
500       lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
501       default_attr = __default_pthread_attr;
502       size_t cpusetsize = default_attr.cpusetsize;
525   struct pthread *pd = NULL;
526   int err = ALLOCATE_STACK (iattr, &pd);
527   int retval = 0;

ALLOCATE_STACK is a macro which calls function allocate_stack and does the real mmap function call:

  51 /* This is how the function is called.  We do it this way to allow
  52    other variants of the function to have more parameters.  */
  53 # define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)
 345 /* Returns a usable stack for a new thread either by allocating a
 346    new stack or reusing a cached stack of sufficient size.
 347    ATTR must be non-NULL and point to a valid pthread_attr.
 348    PDP must be non-NULL.  */
 349 static int
 350 allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
 352 {
 360   /* Get the stack size from the attribute if it is set.  Otherwise we
 361      use the default we determined at start time.  */
 362   if (attr->stacksize != 0)
 363     size = attr->stacksize;
 364   else
 365     {
 366       lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
 367       size = __default_pthread_attr.stacksize;
 368       lll_unlock (__default_pthread_attr_lock, LLL_PRIVATE);
 369     }
 504       mem = mmap (NULL, size, prot,
 505               MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

So in most of our use of boost bind we never both specifying any init attributes and by above logic, we shall take the stack size from __default_pthread_attr.stacksize. What is that?

Turns out that is just our resource limit stack size:

299 void
300 __pthread_initialize_minimal_internal (void)
301 {
42   /* Determine the default allowed stack size.  This is the size used
443      in case the user does not specify one.  */
444   struct rlimit limit;
445   if (__getrlimit (RLIMIT_STACK, &limit) != 0
446       || limit.rlim_cur == RLIM_INFINITY)
447     /* The system limit is not usable.  Use an architecture-specific
448        default.  */
449     limit.rlim_cur = ARCH_STACK_DEFAULT_SIZE;
450   else if (limit.rlim_cur < PTHREAD_STACK_MIN)
451     /* The system limit is unusably small.
452        Use the minimal size acceptable.  */
453     limit.rlim_cur = PTHREAD_STACK_MIN;
455   /* Make sure it meets the minimum size that allocate_stack
456      (allocatestack.c) will demand, which depends on the page size.  */
457   const uintptr_t pagesz = GLRO(dl_pagesize);
458   const size_t minstack = pagesz + __static_tls_size + MINIMAL_REST_STACK;
459   if (limit.rlim_cur < minstack)
460     limit.rlim_cur = minstack;
462   /* Round the resource limit up to page size.  */
463   limit.rlim_cur = ALIGN_UP (limit.rlim_cur, pagesz);
464   lll_lock (__default_pthread_attr_lock, LLL_PRIVATE);
465   __default_pthread_attr.stacksize = limit.rlim_cur;

Now it is clear: when you specify a larger stack limit pthread by default will allocate such a big stack.

And if we run a multi-threaded server process that creates many threads soon we may request too huge an address space than what a typical 32bit application can tolerate, which on linux is about 3GB with 1GB saved for kernel mapping.

We can now do a quick strace and count how many times we request such 100Mb stack:

$strace -f -emmap2 server 2>&1|grep -i "mmap.*10481696"|wc -l

Merely for creating thread stack we have requessted about 2.7Gb memory address which is definitely pressing towards the 3Gb limit.

And that also explains why lower the stack limit helps even though it might be a bit counter-intuitive.

Of course, what pthread does is only allocating stack address space, it does not really ask for physical memory so better solution would be to migrate to 64bit process which has far more address space for user space.

Leave a comment

ADF and TRF – what are they, how are they different

Was trying to understand the two confusing terms and could not find good articles giving clear explanations so I have decided to start my own one.


Alternative Display Facility(ADF) and Trade Reporting Facility(TRF) are both trade reporting infrastructure under the purview of Financial Industry Regulatory Authority (FINRA).

They both can help their own participants to report trades that happened off the exchange.

How are they different

Operated by different organization

TRF is attached and operated by an exchange while ADF is directly looked after by FINRA.

Right now there are 2 TRF: NYSE TRF and NASDAQ TRF, operated by NYSE and NASDAQ separately.

Initially the only such reporting infrastructure is running by NASDAQ/NASD and then SEC decides it wants to open it up for competition, especially after the separation of NASDAQ and NASD which later becomes the nowadays FINRA.

Soon NYSE opened up its own TRF and FINRA came up with ADF.

Below table may give some idea on the timeline (TRF Exchange Participants):


ADF can also display quotation

Aside from help its participants to report trades as part of post trade work, ADF also help its participants display quotation which can join NBBO and become part of the protected quotation.

The quotation will display on SIP so allowing other market centers to route orders to it.

However ADF does not provide interconnection or routing services which needs to be implemented outside the scope of ADF.


Is that NYSE TRF only reports trades for NYSE listed securities and NASDAQ TRF only reports trades for NASDAQ listed securities?

No, unlike SIP, TRF can reports trades no matter where the corresponding securities is listed.

So as an ATS, e.g. a dark pool, for the same trade you can choose to report to NYSE TRF or to NASDAQ TRF or to FINRA ADF.

You can even be smart and report some trades to NYSE TRF while others to NASDAQ TRF.

Below is a snapshot from BATS which shows that indeed NYSE TRF reports securities for all exchanges and likewise for NASDAQ TRF.Bats Market Volume Summary


But I don’t see ADF trades reported in above snapshot?

That’s astute.

It seems ADF currently has no participants so no one reports trades via ADF.

This is from FINRA’s page on ADF participantsADF Participants:


The only participants LAVAFlow ECN exit ADF this FebruaryLavaFlow ADF Deactivation News:


So if we check the historical market volume before deactivation we indeed can see ADF reports:


And you can also see past inactive participants here with LAVAFlow the last (Inactive Participants):

Are hidden trades reported by ADF/TRF?

Hidden trades refer to those hidden orders on the lit venues.
They are considered as on-exchange trades so reported to separate SIP directly.

ADF/TRF normally reports OTC-like trades such as dark pool cross or off-exchange VWAP like crossing trades, broker-dealer internalization trades etc.

There are also trades/transactions not happening on exchange but outside the scope of ADF/TRF as well, you can check the FINRA Trade Reporting FAQ for details.

Can I see where trade was executed so as to do some liquidity mapping ?

Unfortunately no.
Currently trades are reported by the reporting facility’s MPID (Market Participant Identification Number) so you won’t be able to know where the trade was executed.

This also raises some concern that SEC said it prevents public from discovering where liquidity is and it wants to improve on the transparency.

Is it Real time reporting?

Near real time but not quite.

The current requirement is within 10 seconds so probably too much to be used to generate real time trade signals.

Notice can be read here FINRA Notice: Report OTC Trades as soon as Practicable.

is TRF reported volume all dark?

No, not really.

TRF reported volumes can come from dark pool or ECN or any upstairs trades and ECN trades can well be lit.

Details can be read from this good paper:Dark Pool Debate

is TRF reported volume equivalent to off-exchange trade volume?

Yes that’s right.


How ADF was motivated to address ECN concern

The “ADF” – Alternate Display Facility – was established under an SEC rule in 2002 as a place for ECNs that did not wish to place their quotes in the NASDAQ Single Book system because these ECNs felt that NASDAQ’s order processing algorithms favored NASDAQ Market Makers over ECNs

An unlinked ECN is one that is not linked into NASDAQ Single Book or another exchange, so its quotes are shown in a separate system – the ADF – Alternate Display Facility. Trades of NASDAQ securities executed by unlinked ECNs are reported through TRACS – the Trade Reporting and Comparison Service.



Uncaughtable runtime error on Solaris

It all begins when one of my colleagues tries to create a C++ plugin for Q/KDB binary.

The design of Q/KDB allows you to create plugin in your favorite language, implement functions you need and then import and use them in your Q working environment seamlessly.

And it has a nice wiki page detailing how InterfacingWithC.

So he starts with a really simple function that does basic exception catching:

#include "k.h"
#include <stdexcept>

void throwing_cpp_func()
   throw std::runtime_error("unoh");

extern "C"
  K catching()
       return ki(0);
    catch(const std::runtime_error& e)
      return ki(1);
       return ki(2);

The code is quite self-explanatory and if you have never working on Q/Kdb interfacing with C, ki(n) simply is K’s data type of integer with value n.

So with this function, we expect to get back an integer and sounds like we should get back “1”.

Test on x86_64 linux works exactly as we supposed:

$g++ -m64 -shared -fPIC -o -DKXVER=3

q)f: `:~/ 2:(`catching;1)

2: is Q/Kdb’s way of doing dynamic library loading and its grammar details can be looked up here: TwoColon

Now we try to do the same on x86_64 5.11 Solaris:

$g++ -m64 -shared -fPIC -o -DKXVER=3

q)f: `:~/ 2:(`catching;1)
Sorry, this application or an associated library has encountered a fatal error and will exit
If known, please email the steps to reproduce this error to
thank you

Something funny happened here obviously but what is it?
Why is this standard runtime_error escaped our catching net and cause the Q to shutdown itself?

Let’s try by having dbx attach to the Q process right before we call the vital “f” function and see what happened:

q)f: `:~/ 2:(`catching;1)

(execute below when finished running "cont" on dbx)

For information about new features see `help changes'
(dbx0 attach 7832
Reading q
Attached to process 7832
stopped in __pollsys at 0xffff80ffbf49f03a
0xffff80ffbf49f03a: __pollsys+0x000a: jb __cerror
(dbx) cont
signal SEGV (no mapping at the fault address) in (unknown) at 0x1116d
0x000000001116d: <bad address 0x1116d>
(dbx) where -h
  [1] 0x1116d(0x1, 0x1, 0x474e5543432b2b00, ...) at ...
  [2] _Unwind_RaiseException_Body(0x0, 0x0...) at ...
  [3] _Unwind_RaiseException(0x0, 0x0...) at ...
  [4] __cxa_throw(0x0,....) at ...
  [5] throwing_cpp_func(0x0,...) at ...
  [6] catching_func(0x0,...) at ....

So we end up crashing because of SIGSEGV (segmentation fault) and clearly our code is running to some wonderland where RIP is with some funny value of 0x1116d.

And the stack shows we did try to execute __cxa_throw which is basically the translation by c++ compiler when we write “throw …”.

And somehow the exception was thrown successfully and we were in the process of unwinding the exception yet we suddenly trespassed into some alien land and crashed ourself.

At least we know it’s not the exception is not caught, otherwise we would see uncaught exception leadin gto std::terminate and then the program “aborted”.

We have no clue so let’s back out a little and set a break point at Unwind_RaiseException and maybe we can stepi manually till we saw some smoking guns.

For information about new features see `help changes'
(dbx0 attach 7832
Reading q
Attached to process 7832
stopped in __pollsys at 0xffff80ffbf49f03a
0xffff80ffbf49f03a: __pollsys+0x000a: jb __cerror
(dbx) stop in _Unwind_RaiseException
More than one identifier '_Unwind_RaiseException'
Select one of the follow:
 0) Cancel
 1) ``_Unwind_RaiseException
 2) ``_Unwind_RaiseException
 3) ``_Unwind_RaiseException
 a) All

Wait a second!
That reads suspicious – why would we have multiple definition of the same function. Won’t application when running face the same ambiguity as we do?

Let’s try to check manually if that’s really the case.

$ldd $(which q) => /lib/64/
... => /lib/64/
... => /lib/64/

$ldd => /usr/lib/64/ => /lib/64/ => /usr/lib/64/ => /lib/64/

$nm -C /lib/64/|grep -i _Unwind_RaiseException
[10764] | 1236640| 93|FUNC | WEAK |3 |19 |_SUNW_Unwind_RaiseException
[10586] | 1236640| 93|FUNC | GLOB |3 |19 |_Unwind_RaiseException
[445]   | 1235940| 713|FUNC |LOCAL|2 | 19 |_Unwind_RaiseException_Body

$nm -C /usr/lib/64/|grep -i Unwind_RaiseException
[314] | 79712| 341| FUNC| GLOB|0 |15  |_Unwind_RaiseException
[252] | 78976| 196| FUNC| LOCL|0 |15  |_Unwind_RaiseException_Phase2

Clearly our stack shows we are using the “_Unwind_RaiseException” in because that’s the only lib with _Unwind_RaiseException_Body defined.

So is it possible that the ambiguity caused us to call the wrong _Unwind_RaiseException?

That seems to be exactly the case.

We can show the effect by reducing it to a more independent example and show how it works differently for application that loads the shared library, when it is built either as a C or C++ application.

We will see that C-built app works fine while C++-built app suffers.

Here goes a Q-free and striped-down sample shared lib:

#include <stdexcept>

extern "C"
  int catching()
      throw std::runtime_error("uhoh");
      return 0;
    catch(const std::runtime_error& e)
      return 1;
      return 2;

Here goes our C-built version of app:

#include <stdio.h>
int catching();

int main(void)
  printf("a=%d\n", catching());
  return 0;

Here goes our C++-built version of app:

#include <iostream>

extern "C" int catching();

int main(void)
  std::cout << "a=" << catching() << std::endl;
  return 0;

Now let’s build the app differently with gcc and g++ and show how that works out.

$g++ -m64 -shared -fPIC -o test.cpp

$gcc -m64 -o c-app -ltest -L. c-app.c

$g++ -m64 -o cpp-app -ltest -L. cpp-app.cpp

Memory fault(coredump)


And let’s show why that would be the case and what caused one app to pick up correct version of _Unwind_RaiseException while not for the other one.

It all has to do with the library it links – let’s list them using tool elfdump:

$elfdump cpp-app|grep -i needed
     [0] NEEDED    0x19e
     [1] NEEDED    0x1aa
     [2] NEEDED    0x1b9
     [3] NEEDED    0x1c3
     [4] NEEDED    0x17f

$elfdump c-app|grep -i needed
     [0] NEEDED   0x1a7
     [1] NEEDED   0x188

So it is clear – the working cpp-app has listed before and that makes all the difference.

The crashing c-app, just like the Q/kdb process does not even have linked.
And it probably only picks up that library when dynamically loading dependent library of our plugin library which would be too late for symbol resolving, hence it always picks up‘s version.

To further prove our theory we can force changing the library loading sequence and force making to be the first one to consider for symbol resolving and magically make broken c-app work.

Our trick is LD_PRELOAD which will force application to load the specified library first.

$LD_PRELOAD=/usr/lib/64/ ./c-app


So we have successfully made it work!

A little googling shows the same story and it is actually a reported bug for x86_64 Solaris gcc and was later remediated in newer version of gcc.

Here is the link for bug reporting:Bug 59788 – Mixing libc and libgcc_s unwinders on 64-bit Solaris 10+/x86 breaks EH

And here is another blog talking about the same issue:Exceptions, gcc and Solaris 10 AMD 64bit

Leave a comment

RegNMS Order Protection Rule Explained

Order Protection Rule (a.k.a. Trade Through Rule) is always a bit confusing given what it wants to achieve in a complicated, fragmented market environments.

Here I am trying to summarize and clarify some of the confusions ,that initially puzzled myself, and given references where it is possible.

Only Top of Book is Protected

First we need to understand what the rule is protecting and how does it enforce that.

Using US Stock as example, we will model our trading world as a list of so-called “trading centers” where stocks are quoted by all its participants while at the same time, order execution/modification/cancellation constantly changes the bid and ask quotes.

So each trading center can be modeled as one data source emitting constant flow of its quote books which changes quickly.

For example, below is what a typical depth of level 2 book for Apple looks like, with many levels of price for both buy and sell.


This trading center’s best bid and best ask is currently at 324.02 ~ 324.10 with spread being 8 cents.

If we aggregate all those different trading center’s best bid and best ask and sort them from high to low we would have something called “quote montage”.

It is called “Montage” because it does not give you top N best bids and ask quotes but simply glues best bid/ask from each trading center.

And that’s our protected quotations and that’s also what gets published onto the consolidated feed such as SIP.

To give an example:

if NYSE has below quotes (only price is showing):

100     102
99       103

and BATS has below quotes:

98     101
97     102.5

Then the montage will be created using:

NYSE              100      101      BATS
BATS               98       102      NYSE

So even if NYSE’s second best BID price (99) is better than BAT’s best BID price it’s not protected.

Now that we know what quotation is protected we can tell what kind of protection is enforced:
Basically you are supposed to give those quotation higher priority so if you got a trade executed at inferior price than the montage you virtually “trade through” them and you will have some hard time facing scrutiny of SEC.

So key thing here is: only top of book of each trading center is protected, not the depth of the book, even if they “can” be better.

Not every trading center is protected

US Stock trading market has become fragmented over the years and now if you send an order to your broken it might travel a long way before it gets executed.

It probably will attempt first your broker’s dark pool to see if there is any internalization opportunity to save cost for the broker (e.g. if Goldman is your broker they might try their own Sigma-X first).

Then it may try one or several low-to-zero-commission dark pools to save commission and reduce price impact by not leaking order info to the lit markets.

Those external dark pools are sometimes maintained with low commission to give HFT traders a chance to fill any order flow therein so would be different from those internalization-purposed dark pools.

It may then also try some the inverted rebate venue to get some rebate for the broker.

And finally it may go to one of lit markets seeking best execution by NBBO or post there if it is not marketable.

So overall you have lots of places to trade and that number nowadays in US is about 40 to 50!!

Not all of those trading centers are protected however.
For example Dark Pool are not included for the apparent reason they won’t even publish quotes book!

So what kind of trading centers are protected?
According to SEC there are 2 categories of trading centers:

  • National Security Exchanges that provide fast automatic quotations, currently there are about 13 such exchanges
  • a national securities association (currently FINRA through its Alternative Display Facility (“ADF”)


And here is a list from NASDAQ’s NASDAQ regnms FAQ:


And very naturally if you wish to route to those market centers when they have the protected quotes you need to hit you have to be sure you have the connection to them all.


National Best Bid Offer,a.k.a. NBBO is the best of the “montage” book we talked about earlier.

That’s the best price you are expected to get and the price you are not supposed to ignore.

To put it this way, even dark pool has to use that as reference price to decide on its crossing price. Though in practice some dark pool maintainers use slow SIP feed which may always give accurate NBBO.

When are they protected

Modern markets can trade longer time than you are awake each day.

Regular markets for US goes from 9:30am to 4:00pm ET and can do half day on holiday events.

But there are also post market trading in some markets with less liquidity and wilder prices for you to trade.

They exist for a reason. For example, if you are desperate to get rid of your position before end of day you might find post market trading your only choice.

So SEC has been clear in its rule that protection is only enforced during Regular Trading Hour.

Trade Through Definition

So now we can properly follow SEC in its definition of Trade Through:

A trade-through is defined as the purchase or sale of an “NMS stock” during “regular trading hours” (9:30 a.m. to 4:00 p.m. ET), either as agent or principal, at a price that is lower than a protected bid or higher than a protected offer. An NMS stock generally means any exchangelisted security (other than listed options) for which consolidated market data is disseminated.

Routing is not enforced if you can do better

So normally if there is a better price at a different trading center you are required to route there for better execution.

But remember the protected quotes are nothing but a “montage” so in our earlier example NYSE actually has better quotes than BATS’s protected quotes.

So here is the deal – if you are able to trade at equal or better price than another trading center’s protected quotes you are free to trade by your own quotes!

And this has actually been how some of the dark pool works by claiming to give “price improvements” than NBBO: for example to cross at mid point price that save half of the spread for both sides.

You have your discretionary of routing, sometimes

So we know if you don’t have better or equal matching quotes you are required to route.
But what if two exchanges happen to have the equal best “price”?

This is the so-called “trade at situation and you have the freedom to decide on where you want to route your orders.

Isn’t that great?

Protection Exception: Intermarket Sweep Orders (ISO)

ISO order is an interesting and frequently used exception to the order protection rule.

It is brought up to address institutional traders’ concern when it has to fill large qty quickly.

By the default rule it has to route small orders around to chase protected quotes and that normally is small in size, a couple of lots maybe due to the use of “display size”.

Worse due to slow feeds or High Frequency Traders that protected quotes may be long gone by the time your order gets there and you end up chasing phantom quotes.

And finally if you always have to satisfy relatively smaller-sized protected quotes before you chip into the depth of the books for chunky quotes, by the time you do so, other predatory algorithms may have figured out your intention already and use that to go against you.

To address this, SEC has this ISO exception that you are allowed to take out depth of book of a certain trading center as long as you have sent out the order by the better protected quotes to those other trading centers simultaneously.

No waiting and you are allowed to happily dig deep!

More importantly trading centers such as NYSE may have specially devised such order type to help you.

For example, NYSE offers an “NYSE IOC Order” that “will be automatically executed against the displayed
quotation up to its full size and sweep the Display Book® system, to the extent possible, with portions of the order
routed to other markets if necessary in compliance with Regulation NMS and the portion not executed will be immediately cancelled.


NYSE rules

And NASDAQ provides similar MOPP orders:



ISO is your duty

Though NYSE and NASDAQ can help you fulfill ISO duty by allowing you to send special order types, it is largely your duty if you directly send ISO order types to them.

So if you, as a broker, send ISO to NYSE with the intention to capture NYSE’s depth of the book, you must make sure you have sent other orders by the displayed protected quotes at the same time.

Any trade though therein will be charged against you only so it is vital you preserve any vital evidence if you do plan to use ISO.

Day ISO can be a loaded gun

people normally use IOC ISO to have the order cancelled back if it is unable to get fully filled within its limit price, even if it has tried to capture local depth of book.

Day ISO however is a harder to use order type since anything not filled may be left at the book, without routing and potentially cause the markets to get locked or crossed.

And in that case, SEC will chase you for locking/crossing the markets.

Below are extracted from again NASDAQ’s NASDAQ Regnms FAQ:


Protection Exception: Benchmark Execution Price

For certain type of matching engines, they cross the order at benchmark execution price rather than market quote price at the time of crossing.

For example, if an engine holds periodic crossing phase of 10 minutes and cross both sides of participants with 10min VWAP price, it is okay the VWAP price can be a trade through compared against then-protected market quotes.

There are however certain restrictions to make the exception justified – for example, order has to be entered before start of the crossing phase otherwise someone may try to game the system.

There is no such global consistent protected quotes or even NBBO

Contrary to most common belief, there is no global consist protected quotes or even globally agreed NBBO.

It mainly has to do with technology challenge there in – even if everyone uses SIP as the sole source of quotes it may arrive at traders at different time due to propagation delay.

And that explains many of the exchange provided Co-location initiative which allows you to station your server in the same data center of exchange’s matching engine server’s.

In some extreme cases, COLO providers even promise to wire the same length of cables to each of the COLO users to make sure they have a level playing field, at least within the data center.

So that is one obvious technical difficulty but obviously not the only one.

SEC does not mandate you to use a specific feed for protected quotes.
In fact, private exchange direct feed is usually faster and more reliable than SIP feeds.
A sophisticate algorithm provider or HFT firm may choose to source from various exchange direct feed and create its own vision of protected quotes and NBBO.

It is usually more accurate than SIP which suffered from phantom quotes issue – the quote is stale and you are guaranteed to be not able to hit it.

Then again, even if you use the same exchange direct feed as another firm, those various feeds, carrying identical data, may arrive at your trading engines at different timing.
After all, you cannot Colo your server to every exchanges at the same time.

So you are fine to use any data feed you like as long as you are able to present evidence that you have followed and respected protected quotes that you saw at the time – especially you tend to trade through a lot and will need to explain to the regulators.

This has also brought with it the well know issue of Latency Arbitrage where a HFT engine may see more up-to-date quotes and use that to trade against users trading by the slower SIP quotes.

For example, Goldman’s Sigma-X was said to switch from slow SIP to Redline for better performance.

One Second Window Exception

This is a rather interesting exception and was probably invented due to the difficulty of maintaining a globally consistent protected quotes and NBBO view.

as quoted from SEC doc:

This exception is primarily designed to deal with the practical difficulties of preventing intermarket
trade-throughs during a fast-moving market when quotations can change rapidly. If a trade is
executed at a price that would not have been a trade-through of protected quotations as they
stood at any point within the previous one second (the one-second window), then the trade is
excepted from Rule 611.

And an example given in SEC rule is:

Trading centers would be entitled to trade at any price equal to or
better than the least aggressive best bid or best offer, as applicable, displayed by the other trading
center during that one-second window. For example, if the best bid price displayed by another
trading center has flickered between $10.00 and $10.01 during the one-second window, the
trading center that received the order could execute a trade at $10.00 without violating Rule 611

I am not sure how this works in reality and whether that can become a source of abusing.

Help yourself when things goes wrong

There are cases when one exchange is facing technical issues and struggling to keep your flow of updates of protected quotes, which may well gets stale or even lock/cross the markets.

Self help is the term that is used to describe the action other trading centers can take to declare that they will ignore the protected quotes on the troubled trading center.

Of course you need to notify the trading center which you declare self help agains in the first time.

And hopefully if they receive enough self help notification they would realize things go awry and begin to fix it.

Oddlot Quotes are not protected

The Rule does not apply to odd­lot orders or to the odd­lot portions of mixed­lot orders.

Rule 600(b)(8) defines “bid” or “offer” as the bid price or offer price for one or more round lots of an NMS security. This definition is embedded in the definition of “quotation” in Rule 600(b)(62), as well as the definition of “protected bid” or “protected offer” in Rule 600(b)(57).

Consequently, trading centers are permitted to establish their own rules for handling odd­lot orders and the odd­lot portions of mixed­lot orders. For example, although trading centers are not required to handle odd­lot orders or the odd­lot portions of mixed lot orders in accordance with the requirements for automated quotations set forth in Rule 600(b)(3), they are free to incorporate such requirements in their rules if they wish to do so.

Save your trade record

SEC is constantly monitoring trade through rate at various brokers so it is important you preserve any such trading documents and corresponding market data record to show that your trading engines has done the best it can and proper due diligence is not skipped intentionally or unintentionally.

For example, Goldman was fined a big sum of money to SEC for unable to follow NBBO in its dark pool Sigma-X. (link)


Below lists some useful sources of information while preparing this blog.


RegNMS wiki



NASDAQ RegNMS Facts Sheet


BATS Order Types

KCG Demystifying Order Types

1 Comment