Skip to main content

Trading bot : data provider.

Trading Bot
Author
Raphael Outhier
Kernel engineer
Trading_bot - This article is part of a series.
Part 2: This Article

This chapter describes the design choices made when implementing the data provision system and justify them by covering in a simplified, but functional way, how disk access, memory translation, kernel buffer management and memory mapping work.

Participants
#

The data provider system features the following participants :

  • multiple remote providers, each providing data for a specific set of instruments in a specific time range. It is assumed that those remote providers provide the exact same information and that when multiple providers can provide the same data elements, any can be chosen.
  • multiple data consumers, which want to use data provided by remote providers. Consumers can be strategies, or simulation brokers.

Consumer needs
#

The data consumers need to do the two following types of request :

  • range request : takes a start minute index (stt) and an end minute index (end) and for the time interval [stt, end[, generate :
    • a volumes array with one volume per minute index in increasing mid order. If no transactions were made at a minute index, the related array element is 0.
    • a values array with one value per minute index in increasing mid order. If no transactions were made at a minute index, the related array element is 0.
  • value request : takes tgt a target minute index (tgt), and determines mid the greatest minute index inferior or equal to tgt with a non-zero volume, then returns :
    • mid.
    • vol the volume at mid.
    • val the value at mid.

In practice, the range request is used when we want to cache data, or compute statistics like moving averages, or correlations, and the value request is used to find the most recent volume and price at a given moment in time.

Remote provider format
#

Remote providers like polygonio provide rest-ful APIs to fetch data. Those APIs do likely transfer data in plain text formats like JSon, that the local provider has to parse and convert into floats for the rest of the system to use them.

Rest APIs are thread-safe by design, so there is no need to coordinate the possibly multiple active trading cores to fetch data.

Constraints
#

First, as stated in the previous chapters, it is essential to avoid downloading anything from remote providers if it is not strictly necessary. That gives us our first two constraints :

  • data downloaded from the provider must be stored to disk and used in place of the remote provider when possible.
  • local providers running on the same machine must collaboratively use the same storage.

Then, as stated in the ‘Consumer needs’ section, data is ultimately to be either read by value or stored in contiguous memory arrays. This gives us our third constraint :

  • disk storage must be efficient for both value and range read.

In practice, this means that our disk storage stores data in contiguous arrays which follows the same padding rules as the arrays generated by range requests : if at a given minute ID there have been no transactions, then both volume and value array elements are 0.

My proposal for the design answering those three constraints uses the file system and its memory mapping capabilities as storage medium. To understand why, let’s cover how modern operating systems manage disk storage and file memory mapping in a simplified way.

Architectural and kernel concepts
#

The content of this section originally covered the following things :

  • how modern OSes manage storage devices like SATA disks.
  • how the MMU works and how VA -> PA translation is done.
  • what happens when you call mmap.

After writing it, I decided to move it in a dedicated article that can be found here, as it may be interesting beyond the scope of the trading bot.

Please take the time to read it to get familiar with the concepts that the current design choice of the provider is based on.

File storage design
#

File structure, values, volumes
#

Data downloaded from the remote provider is stored in the file system, and mmap is used to efficiently access it.

Since mmap prevents us to resize a file once mapped in memory :

  • we need to store data in fixed-sized files.
  • we need to extend the files to their largest possible size at creation before the first mmap is done.

My decision was to use a dedicated file per year per instrument. One could use a different time base but one ultimately has to pick one.

Each file contains data starting at Jan 1st 00:00:00 for the file’s year, and contains all zero-padded data until the end of the year, one data element per minute :

  • if at a given minute no transaction happened, the data element is :
    • volume = 0;
    • value = 0;
  • if at a given minute at least one transaction happened, the data element is :
    • volume = sum of all transactions volumes;
    • value = average of all transaction values weighted by their volumes;

Data is written in time order, starting at Jan 1st 00:00:00. An element at mid A cannot be written before all elements with mids B < A have been written. If an instrument has no data in a time range, volumes and values for this range are set to 0.

Each file is thus be divided into the following sections, which are be independently mmap-ed by every provider needing to access the data that it contains :

  • descriptor : identifies the file using the instrument name and the year.
  • sync : data used by multiple providers using this file to synchronize their writes.
  • volumes : the array of all volumes of transactions for each minute of this year. Expressed with f64s.
  • values : the array of all values of transactions for each minute of this year. Expressed with f64s.
  • prv_ids : another array containing an array index for each minute of this year. Expressed with u32s. Has a special purpose which is not needed to understand the structure of the provider. Described in a dedicated section at the end of the article if the reader is curious.

Mapping attributes
#

As stated above, each file section is mmap-ed by every provider who needs to use its data. Let’s recap how it is mapped and accessed.

  • descriptor : only written by the provider that creates the file. Mappable as read-only by every other provider. Could be mapped as private once we’re sure that the provider that created the file has initialized the descriptor and called msync.
  • sync : is actively read and written using atomics by the different providers who use the file’s data. Needs to be mapped as shareable read-write by everyone.
  • volumes, values, prv_ids : is written in order, which causes those sections to be incrementally written. Only one page at a time is written and all others could be read-only. In practice to avoid the cost of TLB updates, all pages are writeable. All pages >= the currently written pages must be shared to observe writes from other providers. All pages < the currently written page could be remapped as private but this brings no benefit in practice since no one is supposed to write to those (or it is a bug). In practice, all those pages are mapped as shareable read-write.

Sync
#

Since providers may use the same data at the same time, they may also need the same non-downloaded data at the same time.

The shared section contains data that providers may need to read and write to coordinate themselves in accessing the file’s data. In practice it contains :

  • u8 lck : a lock which protects the access to the two fields below.
  • u8 wrt : a flag reporting that someone is actively downloading writing data.
  • u64 siz : the number of elements that have been downloaded and written and can be freely read by anyone.

One could say that since we assumed that all remote providers provide identical data, all local providers could then just all download data independently and write them to the file sections that they all mapped as shared.

This would work in practice but :

  • is dirty (personal opinion).
  • would lead to unnecessary cache coherency latency for writes to the same locations. Those writes would ultimately write the same data, but the cache coherency system would probably not be aware of that and the perf hit could be noticeable.
  • we would still need to synchronize the siz field of the sync data to ensure that it is monotonically increasing. Otherwise, a provider A could write a lot of data and increase it and another provider B, who started to write at the same time but downloaded a smaller segment (but in a slower manner) could complete after, and update siz to a lesser value. This would force other providers to re-download and re-write data already written by A.

Hence, we need the sync section.

Now, let’s describe a write sequence example that shows how multiple provides collaborate to download the same data and how they all end up seeing what other write :

  • The storage file only contains data for [Jan 1st 00:00:00, Jan 3rd 00:00:00[ (<= note the open bracket, we have no data for the end time).
  • Providers A and B want to read data for [Jan 10th 00:00:00, Jan 20th 00:00:00[
  • They both attempt to lock lck using atomics. The cache coherency system ensures that exactly one acquires the lock.
  • Provider A succeeds. Provider B is blocked until A releases the lock.
  • Provider A reads siz and sees that the data that it wants is not here. It either has to download it or wait for another provider to do so.
  • Provider A sees that wrt is clear : no one is actively downloading data so provider A must do it itself.
  • Provider A sets wrt and releases lck.
  • Provider B is unblocked, acquires lck.
  • Provider B reads siz and sees that the data that it wants is not here. It either has to download it or wait for another provider to do so.
  • Provider B sees that wrt is set : someone is actively downloading data so provider B must wait for it to complete.
  • Provider B releases lck and goes to sleep for some time. It will periodically re-lock lck and check siz and wrt again to check if data is being downloaded or if it needs to do it itself.
  • Provider A downloads data, processes it and writes it to its locally mmap-ed sections. Updates are propagated to B by the cache coherency system.
  • Provider A attempts and succeeds in acquiring lck. It then clears wrt, updates siz to reflect its writes, and then releases lck.
  • Provider A then proceeds to read the data range that it was originally interested in.
  • Provider B wakes up and acquires lck. It then checks siz and notices that the data that it wants is now here. It releases lck.
  • Provider b then proceeds to read the data range that it was originally interested in.

Top level design
#

Let’s take a look at what a system with two bots attempting to process years 2020 and 2021 of NVDA would look like.

prv sync
Structure of the data provider.

On the top, we see the remote provider, used by bots to download data.

In the middle, we see our two bots. Each ultimately wants to create a buffer containing all volumes and values data for NVDA [2020, 2021].

Both have their own local provider, and they have opened the storage files containing NVDA data for 2020 and 2021.

On the bottom, we see a representation of those two files and of their internal state :

  • prv/nvda.2020.stg is complete : all its data has been downloaded, and hence, both bots have been able to extract data from it and to populate half of their destination buffers.
  • prv/nvda.2021.stg is incomplete : half of its data is missing, and both bots 0 and 1 need its missing data to finish populating their buffers.

Bot 0 has attempted to acquire write privileges but has failed to do so, as Bot 1 was quicker to acquire it. It is now repeatedly sleeping and checking that the write is complete and that all data that it requires has been written.

Bot 1 has successfully attempted to acquire write privileges, and is now actively downloading data from the remote provider and writing it to its local mapping of the data sections.

Once Bot 1 is be done writing all the remaining data :

  • prv/nvda.2020.stg will be reported as complete.
  • Bot 0 will detect that the data that it requires is present and finish populating its data buffer from the data that was written by Bot 1.
  • Bot 1 will be able to populate its own buffer from the data that it downloaded.

Storage file layout
#

  Offset        Description    Data
┌────────────┬──────────────┬───────────────────────────┐
│            │              │                           │
│            │              │  Marketplace              │
│  0         │  Descriptor  │  Symbol                   │
│            │              │  Total number of minutes  │
│            │              │                           │
├────────────┼──────────────┼───────────────────────────┤
│            │              │  Sync data access lock    │
│ PG_SIZ     │  Sync data   │  Write flag               │
│            │              │  size counter             │
├────────────┼──────────────┼───────────────────────────┤
│            │              │                           │
│            │              │  Raw array of values      │
│ 2 * PG_SIZ │  Values      │  in mid order             │
│            │              │  (f64)                    │
│            │              │                           │
├────────────┼──────────────┼───────────────────────────┤
│            │              │                           │
│ 2 * PG_SIZ │              │  Raw array of volumes     │
│ + VA_SIZ   │  Volumes     │  in mid order             │
│            │              │  (f64)                    │
│            │              │                           │
├────────────┼──────────────┼───────────────────────────┤
│            │              │                           │
│ 2 * PG_SIZ │              │  Raw array of indices     │
│ + VA_SIZ   │  Prv_ids     │  in mid order             │
│ + VO_SIZ   │              │  (u32)                    │
│            │              │                           │
├────────────┼──────────────┼───────────────────────────┤
│            │              │                           │
│ 2 * PG_SIZ │              │                           │
│ + VA_SIZ   │   End        │  N/A                      │
│ + VO_SIZ   │              │                           │
│ + PR_SIZ   │              │                           │
│            │              │                           │
└────────────┴──────────────┴───────────────────────────┘

First, let’s remember that each section (except END…) will be mmap-ed by the trading bot’s local provider in order to read and write data. Hence, each section needs to be placed at an offset which allows its mapping.

As stated in the dedicated article, the MMU only allows us to map entire pages, which means that our section must reside at an offset which is a multiple of a page.

Since our trading bot may run on multiple systems, we must only ensure that our file layout is compatible with the systems that we want to run on, by choosing a very large page size. PG_SIZ = 64KiB is enough for a modern system.

Since a year has a maximal number of minutes, equal to 366 * 24 * 60 = 383040, each array (volumes, values and prev IDs) have at most 383040 elements. Hence their maximal size is :

  • VA_SIZ, VO_SIZ : 383040 * sizeof(f64) = 383040 * 8 = 3064320 bytes.
  • PR_SIZ : 3064320 * sizeof(u32) = 3064320 * 4 = 1532160 bytes.

Since we must respect page alignment, the effective size of those arrays can be determined by :

  • VA_SIZ, VO_SIZ : (3064320 + (PG_SIZ - 1)) & ~(PG_SIZ - 1) = 0x2f0000 = 3080192
  • PR_SIZ : (1532160 + (PG_SIZ - 1)) & ~(PG_SIZ - 1) = 0x180000 = 1572864

(This uses the fact that PG_SIZ is a power of 2, so rounding down to it is just masking some bits, and that rounding a number N up to X is equivalent to rounding N + (X - 1) down to X.

Going further : mapping, shareability, cache coherency and performance
#

As we covered in the file system section, whenever a local provider wants to read data for a given year of a given instrument, it creates (if needed) then opens the file corresponding to this (year, instrument) and maps its different sections in memory.

If multiple local providers want to use the same data, they will end up mapping the same data pages in their respective virtual address space. This allows them to use simple atomic operations to coordinate themselves when writing to those buffers.

When doing so, the processor’s cache coherency system takes care of :

  • actually ensuring the atomicity of those instructions (like Compare And Swap) even in a multi-processing context.
  • propagating the writes that one CPU makes to other CPUs, with the assumption that the correct memory barriers are used at the correct places.

Cache coherency operations have an inherent cost. When CPU A writes at a given location which is in CPU B’s cache, both CPUs need to coordinate so that they all see the same value. If the architecture has implicit ordering guaranteed, they also need to ensure that all previous writes made by A are visible to B if it tries to read at the related locations.

This is why one of the first rules of performance is : “avoid mutable shared data”. Memory accesses are slow, and shared memory accesses are even slower.

In our design, things are not dramatic as :

  • the descriptor section is never written, except by the local provider which takes care of creating the file and initializing it. Other local providers never write this section.
  • the synchronization page is the page whose accesses are the slowest, as it is inherently here to take advantage of the CPU’s cache coherency mechanism. We want accesses to this page to be slow, as most of them are atomic operations that are here for correctness rather than perf.
  • data pages should not cause us too much latency, as the data that they contain is written by only one CPU before being read by the others. The situation where multiple local providers wait for the same data element should be relatively rare, but in this case, there is some latency involved, as cache coherency is made at the cache line (64 bytes = 8 64-bits floats) level. What could happen is that :
    • CPU A gets write rights, downloads data, writes it in memory. The write finishes with filling half (4 elements) of a cache line L.
    • CPU B reads the newly written data until the end -> cache line L must be migrated from CPU A to CPU B so that B sees A’s updates.
    • CPU B gets write rights for the next data, downloads it, and writes it in memory. The write starts by filling the remaining half of L.
    • CPU A reads the newly written data until the end -> cache line L must be migrated back from CPU B to CPU A so that A sees B’s updates.

Since our system never assumes that writes made by a local provider to a location should not be visible to other local providers, we can map all our pages as shareable, and avoid the expensive copy-on-write kernel mechanism.

We can map the pages that we do not intend to modify (descriptor page + full data pages) as read-only, in order to actually prevent anyone from writing to them and ensure that our code is bug-free. Though it may not be a good performance idea, as remap operations are expensive in term of :

  • code path, since they involve taking a syscall.
  • kernel processing, as they involve updating the kernel’s bookkeeping structures to reflect our new attributes.
  • HW cost, as they involve updating our MMU’s TLB which could take long.

Though, as writing to a full page of data (64KiB / 8 = 8192 minutes = 136 hours = 5 days worth of data) is relatively rare, one could disregard this perf hit.

Going further : prv_ids
#

As I mentioned before, there is a third array in our file storage, that has not been explained yet.

Let’s first understand the problem that it solves, by re-stating the two different data read procedures that the provider must support :

The data consumers need to do the two following types of request :

  • range request : takes a start minute index (stt) and an end minute index (end) and for the time interval [stt, end[, generate :
    • a volumes array with one volume per minute index in increasing mid order. If no transactions were made at a minute index, the related array element is 0.
    • a values array with one value per minute index in increasing mid order. If no transactions were made at a minute index, the related array element is 0.
  • value request : takes tgt a target minute index (tgt), and determines mid the greatest minute index inferior or equal to tgt with a non-zero volume, then returns :
    • mid.
    • vol the volume at mid.
    • val the value at mid.

Based on the storage method implemented by the provider, which this chapter gave extensive details about, the first method should look straightforward. To perform a (multi-year) range requests, the provider performs multiple sequential sub range requests, one for each year that the main range request is composed of and for each of these single year sub requests, downloads data to (if needed) and reads data from the year’s storage file. It is relatively efficient, in the sense that performing the data reads will cost a lot of memory accesses but those are strictly required so there cannot be much optimizations done.

A valid optimization could be, instead of having the consumers expect full data buffers, to have them expect an array of pointers to the mapped area where the actual data resides. That would avoid the initial copy but if data must be copied at the end, we would still need to do the copy.

The second request (the value request) is less obvious and more prone to footgunning ourselves perf-wise.

Indeed, in order to find the first mid with a non-null volume starting at a given mid stt, we need to traverse the history minute by minute in reverse mid order starting at stt. This can take an unnecessarily long time and can easily be avoided if, for every mid x, we keep track of the last mid <= x with a non-null volume.

The prv_ids are made for this purpose. For every mid m that a file contains data for, it has an element which :

  • if the volume at mid m is non-null, it is equal to the index of this mid in the values / volumes array.
  • if the volume at mid m is null, and the previous index at which the volume was non-null is in the same year (and hence, that its index in the file’s array exists), it is equal to this index.
  • if the volume at mid m is null, and the previous index at which the volume was non-null is in a previous year (and hence, that its index in the file’s array does not exist), it is equal to -1.

The value request lookup (starting at stt is then straightforward for the provider :

  • it first starts in the file that contains stt.
  • it looks up in the prv_ids array at the offset corresponding to stt.
  • if the read value is -1, it knows that the mid that it looks for is not in this file. It then reiterates on the previous year starting at the year’s last mid.
  • if the read value is x != -1, it knows that the mid that it looks for is contained in this file and resides at index x in this file’s array. It then performs the reads and returns the expected results.

Instead of traversing the volumes array one value at a time, this method allows us to only perform one read per year in which we want to look for (+ the overhead required to map file data which would be required anyway.

The reader may wonder what happens if we start a value read request at a mid for which an instrument never actually had any trade (ex : pre market introduction). This would theoretically cause the provider to look up every year in decreasing order until the start of times (i.e 1970 as everyone knows). In practice, we can just consider that an instrument having no transactions in an entire year simply was not traded before and have the request fail.

Trading_bot - This article is part of a series.
Part 2: This Article