Strix devlog #6

2018/12/11

11 minute read

So some more progress this week, mostly optimization to existing jobs - I needed to get things running smoothly before I can concentrate on the new features I have planned. The issues for the new features are on gitea waiting to be tackled but accumulating too much technical debt makes in harder in the long run to get a smooth running system. Here’s a summary of this devlog:

Bugfixes

Optimization

Experiments

New features

Bugfixes

First off, a couple of bug fixes. Finding bugs in this system notoriously difficult as testing isn’t done on behaviors but generated data. Verifying the data against a source of truth is quite time consuming and I only have a certain amount of time I can dedicate to manual testing. One of the defects that caught my attention this week was stochastic signals firing for both oversold and overbought on the same day, which is absolute non-sense. At first I thought it a date issue because the symbols I checked were from the top gainers and had huge increases that could pull up the stochastic from oversold to overbought in a day. My estimate was that the stochastic was oversold on day T-1 and became overbought on day T. But deeper investigation and hours of debugging showed that this was not the case. The triggering code for the indicator is fairly straight forward

if (def.srt.contains("overbought") && d[len - 2] < ob && ob < d[len - 1]) {
            return new ScanResult(data, def);
        }
        if (def.srt.contains("oversold") && d[len - 2] > os && os > d[len - 1]) {
            return new ScanResult(data, def);
        }

        if (def.srt.contains("neutral") && (d[len - 2] > ob && ob > d[len - 1]) ||
                (d[len - 2] < os && os < d[len - 1])) {
            return new ScanResult(data, def);
        }

There is subtle bug in this code even though it looks quite simple. The 3rd condition: a & b | c. The precedence order is equal for both operators so it would trigger for both overbought and oversold

The correct code is a & (b | c). Yet again the great syntax of Java produces a bug that is easy to miss and will make you go blind in the process. Well it did cost a couple of hours but at least it was an easy fix.

Another bug that I introduced during the performance evaluator developments came to light after I tackled some optimization tasks on that module. I was expecting way more scans to be score than there were which led me to investigate the scoring code that revealed a premature exit from the evaluation loop. A misplaced return statement instead of a continue statement was causing the loop to terminate early and not score the remaining predictions. Another easy fix at least.

Optimizations Optimizations were the meat of this weeks work. Originally I thought that running the modules on a single thread with good caching would suffice in terms of performance but I was wrong.

So I went ahead to parallelize the portions of the modules that made sense. These were

Thanks to Java 8 parallel streams this turned out to be quite easy. I just had to make sure that critical sections in code were atomic, and used classes that were thread safe. One aspect to take into consideration is to preserve the locality of the cache and not to evict items that from the cache only to have a second loop query that item again. So the order of processing is important.

symbols -> combos -> dates

will make sure that the cache contains the symbol data ready to be processed and will not run a DB query. Initially I had just parallelized the dates loop but that loop doesn’t contain enough data to make it worth while. The CPU cores were only busy at 60% which is not ideal. So I moved the parallel streaming up 2 levels to the symbol level which now utilized all cores at 100%. PerfEval and Scans use the same code base so that was a bit easier than the Correlation Processor that needed some extra attention due to memory issues. I operate under a RAM constraint (because I need to keep server costs to a minimum) so I needed to implement a specific cache for correlation calculations that just holds the last 30 closing prices and symbols.

The seconds area of optimization was for syncing data from IEX and running scans on the synced data. I currently orchestrate the module jobs via Jenkins and the pipeline only supports triggering based on a fail/success return code. This means that every time I run a sync job, a scan job will trigger on success - even if there is no new data. It will just calculate the same results over and over again. This wasn’t really a problem when the scan job only took a couple of minutes to complete with 1 years worth of data but it doesn’t work with 5 years of data. So I implemented a check on the scan module that will only run the scans if the latest scan date for a symbol is less than the latest OHLC date. The job will still run but it will just skip the scans so it takes only about 3 - 4 minutes to complete as opposed to 1 - 1.5 hours. Another issue is with IEX not clearly defining when they will update the API with the days stock data. Previously I was fetching the data at 7 pm and 11 pm local time and processing the data even if the data was old (the one at 7 pm, they would have the data updated by 11 pm for sure). But this means that the new scans/signals are not shown until almost 12 am which is less than ideal. I didn’t want to check every hour or two because it would trigger the whole job pipeline and it’s a lot of data to download. My solution to this came after I discovered an API endpoint that listed the symbols, which had a date field that showed the time it was updated. Why didn’t I just check a symbol for the last date? because at any given date a symbol may not be traded. The probability that AAPL not being traded is rather low, but still I prefer a robust solution if there is one. The cost of making this API call to the symbol list is low, so now I poll every hour for new data between 4 pm - 11 pm on the week days. There is still no way to abort the pipe line without a failure on Jenkins but with the scan checking this is now less of a problem.

A huge pain point was the duration of the Performance evaluation. I realized this week that I was doing a lot unnecessary processing that was causing the job to take forever. I only needed to calculate performance for the scans that had been triggered for that symbol, and I was running all the scans for that symbol. Duh!

With a new filter that skips scans that are not relevant and parallel processing the performance evaluation now takes 1 hour to complete which is reasonable. Even though I won’t be running this job that often it kind of became my holy grail to optimize this. So looking back at the comments on the issue on gitea the first iteration resulted in 1K scored scans and 6K unscored (this was due to the bug I mentioned previously, which at the time I didn’t know). This was way to little so I thought I’d throw more data at it and increased the data interval to 5 years from 1. This increase resulted in 2.4K vs 6K. Still not good enough. After fixing the premature loop termination issue the final ratio is 5K/6K which to me looks OK. It is possible that some scans just didn’t occur frequent enough to be scored.

The DAO layer also got some love this week. I had previously implemented the DAO’s in a way that each operation would open a new connection to the DB even though this is not good practice. I didn’t want to integrate a connection pooling solution as the processes are not long lived so I just refactored the connection objects to be reused class wide. I’m using a relatively new SQL library called sql2o which has a nice plain API for DB operations but the way they implemented batch insertions is not optimal. It just wraps the inserts around a transaction and still inserts each row individually. MySQL’s grouped insert performs much better than this so I refactored batch inserts to generate a grouped insert query. This increased performance of the inserts quite a bit, even though I didn’t measure by how much.

After examining the MySQL advisor on PhpMyAdmin I saw that it complained about a lot of row sorting. The culprit was that each query to the OHLCV table needed a sort by date for the caching and range query to work properly. The problem here is that I already got the data sorted from the data API and lost that information after the insertion into the OHLCV table. I though about getting rid of the table and querying from the JSON data directly but some of the functions in the TOP module and market overview module take advantage of this table to reduce the amount of code and off load processing to the database, so the table had to stay. I stored the API response in K/V store only for query by the ChartData component for caching. I also saw that selecting the last scan date from the scan_result table for doing full table scans so I save those in the K/V store too. This type of optimization can be good for performance but it’s important not to let these data points get out of sync. I also implemented a fall back mechanism to querying the DB if the value was not found in the K/V store. This case can occur if a symbol is introduced and the last scan date is not yet inserted.

The Jenkins job pipeline also needs to be separated by exchange to reduce the amount of processing. This can be achieved by extracting the exchange parameter as a CLI parameter to the jobs. No need to run scans for IEX after BFX data synced, just run the BFX scans

Experiments

I decided to change the way the prediction checker scores scans. My initial implementation was exit on +/- 2 x ATR of the symbol. This yielded around 50% average scores. Next I tried a 15 period low/high exit strategy. There is really no correct way of doing this, as strategy is very personal. My reasoning was that on an up trend the 15 period low would still decent gains and on a down trend it would exit quite early to cut losses. After the performance evaluation runs the average score was 0.23959762958591568 and average error was 0.17946781069881107 for a 95% confidence. I will still run the strategies of 2xATR, 3xATR and a percentage based exit to determine their outputs. Maybe running multiple strategies and selecting the best performing one and showing that could also be a nice feature, but again it’s very personal. I believe a 1:1 take profit / stop loss ratio will result in an average score of around 50%, and 2:1 will result with an average score of 33% as it’s basically based on expected value since the price changes are distributed randomly.

I also changed the way the Bollinger Bands squeeze scan was working. It was scanning the last N periods and triggering if the last period bands were within a certain limit of the minimum band width of the last N periods. This is kind of over complicated as I just want to get the periods were the bandwidth is low, so now it will trigger if the bandwidth is < 4%.

New features

As I was sifting through the scans I realized that I was looking for a chart to see the relevant data that triggered the scan. If the scan is a Bollinger Bands squeeze I wanted to the bands on the chart. So I added this feature. This required me to store the band data in the K/V store because that’s the only storage the API will access. To be consistent I also stored the last 100 period OHLC data in the K/V store to generate the candle stick data for the chart. The K/V store is backed by MySQL but this may change in the future. Redis is a strong contender for K/V storage, but I’ll cross that bridge when I get there. I added a new field to the scan definitions file that defines which indicators will shown on the chart when that scan is triggered. Here is a screenshot of this new feature in action:

Wow, yet another very long post for a short week. Looks like a lot has been done and development is going full steam ahead.