Our Experience with Numerai

Saahil Barai
Analytics Vidhya
Published in
36 min readMay 7, 2021

--

Saahil Barai, Amit Verma, Hasanain Manesia, Stephen Chang, Yash Dhaduti, and Brian Menezes.

Photo by Maxim Hopman on Unsplash

An Introduction to Numerai

Numerai is a platform where users have the opportunity to build machine learning models on abstract financial data to predict the stock market. Numerai is composed of a couple of facets that when combined make the platform incredibly interesting. We will begin with a brief introduction of each of the main facets of Numerai: Data, Modeling, Submissions, Scoring and Staking.

Data is one of the core components of Numerai. Numerai provides its users with high quality financial data that has been cleaned and regularized and obfuscated. By doing so Numerai is enabling data scientists to apply their talents and knowledge to the stock market by providing high quality financial data that is otherwise not readily available to the common investor. On top of this, the data is already cleaned and essentially ready to be used “out of the box”. Using this, data users can build a model to predict a target variable indicative of future price using features that correspond to the current stock market. Each week on Saturdays, a new test dataset is released and users will have to submit their new predictions by Monday. After each submission, Numerai will show the user submission diagnostics that indicate to the user through various metrics the models performance and risk characteristics over the historical validation data. The main scoring component of Numerai is the correlation between your predictions and the targets. These correlations will be computed over a tournament round composed of 20 live market days. Finally, users can optionally stake currency on their models.The better the model does, the more the payout at the end of the round. In this way, users are encouraged to participate and refine their models. These five facets come together to create an incredibly unique community that is enabling data scientists to take on the realm of the stock market.

Data

Snippet of the tournament data

A snippet of the tournament data provided to users is shown above. The first column has an id that corresponds to a particular stock. The second column titled era represents the time period for which the data on each stock was collected. Each era represents a month. Lastly, the data is abstracted, where each feature and the target takes on 5 values between 0 and 1.

Metrics and Diagnostics

The metrics on Numerai are split into 3 categories: Performance, Risk and Meta Model Contribution (MMC). Performance consists of a sharpe score, correlation score and Feature neutral correlation score. The sharpe score is calculated by dividing the mean of per era correlations by the standard deviation of per era correlations. The correlation score is simply the mean of the per era correlations and the feature neutral correlation score is the mean of the per era correlations after predictions have been neutralized to all features. This category is a measure of performance over the validation set. Risk consists of standard deviation, feature exposure and max drawdown. Feature exposure represents the maximum correlation any one feature has with the target predictions. Max drawdown represents the largest difference between any two eras in terms of correlation. This category is a measure of how likely it is that the model has significant losses in correlation in the future. To explain what MMC consists of we must first discuss what the Meta Model is. The Meta Model is a weighted ensemble of all submissions made in a round. The more a user is staking on their model the more they are represented in the ensemble. So in addition to the correlation scoring component there exists the MMC component which is the correlation of a user's submission to the Meta Model. This is done to encourage the performance of the collective users on the platform rather than to solely encourage individual correlation. The MMC component consists of MMC + Correlation sharpe score, MMC mean, and Correlation to example predictions. The MMC + Correlation sharpe score is very similar to sharpe score with the exception that the mean and standard deviation components now have MMC scores added to them. This means that the calculation for MMC + Correlation sharpe is computed by adding correlations per era plus MMC correlations per era over standard deviation plus MMC standard deviation. This category is a measure of how unique a model is in comparison to other users' submission on the platform. For more information on these metrics, Numerai has a great page explaining diagnostics.

Modeling

There are many models that can be used in the Numerai Tournament, where each model has its own advantages and disadvantages. Some models lend themselves to be highly useful in some applications but perform poorly in other applications. It is our goal in this section to find models that are better suited to the structure of the data and metrics provided by Numerai. Below we detail our experience with various regression models. These models take on two main categories: Linear and Nonlinear. The nonlinear models consist of a majority of tree based models, whereas the linear models are based on linear regression.

General Insights from running Linear Models:

Overall, the performance of all linear models was poor relative to other models created in our project as well as those of others in the Numerai competition. The results for Models 1–6 weren’t really surprising given that we had time series data, which isn’t expected to be modelled well using a linear relationship. Having a slight idea of the results we would get, we still wanted to see how well we could model the dataset using a linear model. Vanilla versions of ridge and linear regression provided the best validation sharpe, while the other linear models performed far worse relative to the baseline validation sharpe of .4918. Despite this unfavorable performance from these linear models, it was interesting to see how lasso was able to yield a low validation standard deviation relative to all models tested in this project. That said, after analysis of models 1–6, the biggest takeaway was that it really is hard to rely on any non-optimized or optimized linear model for the provided dataset.

Model 1: Vanilla Linear Regression

Metrics
Validation Sharpe: 0.4918
Validation Correlation: 0.0149
Validation Feature Neutral Correlation: 0.0025
Validation Standard Deviation: 0.0303

From the performance metric perspective, the validation sharpe was very low, which was understandable given that this was our first model that we ran. From a risk metric perspective, the validation standard deviation was very high relative to all other models we used, being almost twice that of the non-linear models. From this linear model, we realized that we needed to find a way to either increase validation correlation or decrease validation standard deviation. Because linear regression didn’t really have any numerical features to hyper parameterize over, we chose to evaluate other linear models instead, such as ridge, lasso, and elastic net.

Model 2: Vanilla Ridge Regression

Metrics 
Validation Sharpe: 0.4918
Validation Correlation: 0.0149
Validation Feature Neutral Correlation: 0.0025
Validation Standard Deviation: 0.0303

With hopes of an improvement in validation sharpe by adding a penalty, ridge regression performed mostly the exact same as vanilla linear regression (for validation sharpe, validation standard deviation, and validation correlation). Both the validation correlation and validation standard deviation remained the same as when linear regression was run, which was not expected. As mentioned earlier, the hope was for this model to at least give us a different validation correlation or validation standard deviation to work off of. However, because this was not the case, we thought of hyper parameterizing alpha and max iterations in hopes to increase validation correlation and or decrease validation standard deviation.

Model 3: Ridge Regression with GridSearchCV over alpha and max_iter

Metrics
Validation Sharpe: 0.4908
Validation Correlation: 0.0149
Validation Feature Neutral Correlation: 0.0026
Validation Standard Deviation: 0.0304

In an attempt to increase the validation correlation and decrease validation standard deviation, we ran gridsearch (over alpha and max iterations) using ridge regression. In comparison to our vanilla ridge regression model, our hyper parameterized one didn’t yield favorable results. To our surprise, the validation sharpe decreased as a result of the validation standard deviation increasing by .0001. Even after using GridSearchCV, along with multiple ranges for both alpha and max iterations, the validation correlation and validation standard deviation didn’t change much at all to improve the validation shape from .4908 to anywhere near .4918. Given that these were the only numerical parameters, this was our best approach with grid searching over ridge regression.

Model 4: Vanilla Lasso Regression

Metrics
Validation Sharpe: -0.2058
Validation Correlation: -0.0031
Validation Feature Neutral Correlation: -0.0030
Validation Standard Deviation: 0.0151

Given the performance of linear regression and ridge regression models, we wanted to evaluate how lasso regression would perform relative to them out of curiosity. We were rather surprised by our findings as the validation sharpe was far worse in comparison to ridge and linear regression. This was mainly due to the negative validation correlation. However, the only metric that did improve was the validation standard deviation, which was nearly half of what it was when running vanilla linear regression and vanilla ridge regression. This improvement incentivized us to hyper parameterize lasso regression in hopes to make the validation correlation a higher, positive number.

Model 5: Lasso Regression with GridSearchCV over alpha and max_iter

Metrics
Validation Sharpe: -0.2058
Validation Correlation: -0.0031
Validation Feature Neutral Correlation: -0.0030
Validation Standard Deviation: 0.0151

Because of the vanilla lasso model’s decrease in validation standard deviation, roughly by a factor of .5, to our linear and ridge regression models, we hoped that hyper parameterizing lasso over alpha and max iterations would make the validation correlation positive, and thus result in a validation sharpe near .5. Sadly, this was not the case. We weren’t able to change the validation correlation all, and this was a similar situation to when we grid searched over ridge regression. That said, we realized that even grid searching over parameters like alpha and max iterations wasn’t very useful for linear models as adding a penalty would only decrease the validation correlation on this time-series data.

Model 6: Vanilla Elastic Net

Metrics 
Validation Sharpe: -0.2058
Validation Correlation: -0.0031
Validation Feature Neutral Correlation: -0.0030
Validation Standard Deviation: 0.0151

Elastic Net was our final attempt at using a linear model to see if any combination of ridge and lasso could provide us a validation sharpe higher than .4918. Unfortunately, we were given the same results as both the lasso regression models we ran. This wasn’t incentivizing enough for us to hyper parameterize over as we expected the performance metrics to be unfavorable. In other words, we didn’t want to proceed with optimizing elastic net as we expected to get the exact same, or barely changed, metrics as the vanilla and hyper parameterized lasso models.

General Insights from running Non-Linear Models:

Overall, nonlinear models gave better results than the linear models, with the average validation sharpe being above .7. Across all the vanilla models, we generally saw a trend where the validation standard deviation would be around .02, while the validation correlation would be around .015, with the exception of some models like RandomForestRegressor. Nonetheless, these non-linear models proved to be more worthwhile investigating even though they took longer to train. Additionally, we found it to be beneficial implementing additional tools to improve validation sharpe, which includes data pre-processing, data post-processing, era boosting, and neural networks.

Model 7: Vanilla XGBRegressor

Metrics 
Validation Sharpe: 0.6822
Validation Correlation: 0.0119
Validation Feature Neutral Correlation: 0.0064
Validation Standard Deviation: 0.0175

Transitioning away from linear models, XGBoost proved to be a noticeable improvement over any of the linear models we ran. The validation standard deviation was very close to the lowest one found from our linear models, being .0151. The only downside was that our model’s validation correlation was lower (.0119) than our highest validation correlation from our linear models, being .0149. Given this great improvement, we sought to hyper parameterized XGBoost to see if there would be more favorable changes in validation correlation and standard deviation to help increase our validation sharpe to be above .65.

Model 8: Vanilla LightGBMRegressor

Metrics
Validation Sharpe: 0.6348
Validation Correlation: 0.0151
Validation Feature Neutral Correlation: 0.0076
Validation Standard Deviation: 0.0238

Like XGBoost, LightGBM also proved to be a big improvement from our linear models. The validation correlation was .151, being one of the highest correlation we received. Conversely, the high validation standard deviation made this model perform roughly on the same level as the XGBoost.

Model 9: Vanilla CatBoostRegressor

Metrics
Validation Sharpe: 0.8704
Validation Correlation: 0.0172
Validation Feature Neutral Correlation: 0.0122
Validation Standard Deviation: 0.0198

CatBoost was by far the best vanilla model we ran given its validation sharpe. Because of the high validation correlation, we were able to get a validation sharpe of .8704. The validation standard deviation was less than .02, so it looked promising to use this model as a part of our final solution for the competition. That said, we believed that model was worth optimizing/hyper parameterizing.

Model 10: Vanilla RandomForestRegressor

Metrics 
Validation Sharpe: 0.3854
Validation Correlation: 0.0081
Validation Feature Neutral Correlation: 0.0023
Validation Standard Deviation: 0.0210

Despite it being a nonlinear model, RandomForestRegressor performed as one of the worst models in our project. The validation standard deviation was on par with that of the other nonlinear models such as XGBoost, CatBoost, and LightGBM, but the severely low validation correlation proved costly as the validation sharpe was .3854, which is far worse than the validation sharpe for vanilla linear regression (being .4918). For this reason, we decided to not look further into optimizing RandomForestRegressor.

Model 11: CatBoostRegressor with GridSearchCV over n_estimators and learning_rate

Parameters Chosen
learning_rate: .5
n_estimators: 500
Metrics
Validation Sharpe: 0.8057
Validation Correlation: 0.0196
Validation Feature Neutral Correlation: 0.0129
Validation Standard Deviation: 0.0244

Given that n_esitmators and learning_rate are core parameters that would affect boosting regressors like CatBoost, it was surprising for us to see that we weren’t able to achieve a validation sharpe of higher than .8704 after having used GridSearchCV. Our biggest problem was that our validation standard deviation had increased much more than our validation correlation. For this reason, we sought to use vanilla CatBoost as a part of our final model.

Model 12: Vanilla CatBoost with Feature Neutralization

Metrics
Validation Sharpe: 0.6967
Validation Correlation: 0.0148
Validation Feature Neutral Correlation: 0.016
Validation Standard Deviation: 0.0213

In an attempt to lower validation standard deviation, we found that data post processing will be a useful method. We decided to use this on CatBoost as it was already one of the highest performing models without having hyper parameterized. That said, we ended up getting a lower validation sharpe upon using feature neutralization per era.

Model 13: Vanilla Gradient Boosting Regressor

Metrics 
Validation Sharpe: 0.5997
Validation Correlation: 0.0165
Validation Feature Neutral Correlation: 0.0091
Validation Standard Deviation: 0.0275

Vanilla Gradient Boost Regressor performed around the same as LightGBM and XGBoost. The validation correlation was fairly close to that of vanilla CatBoost, which was very interesting. Additionally, the validation standard deviation was the highest of all nonlinear model’s we ran, which wasn’t necessarily good in terms of helping increase validation sharpe.

Model 14: LightGBMRegressor with GridSearchCV over n_estimators, max_depth and learning_rate

Parameters Chosen
n_estimators: 900
learning_rate: .01
max_depth: 10
Metrics
Validation Sharpe: 0.7188
Validation Correlation: 0.0177
Validation Feature Neutral Correlation: 0.0110
Validation Standard Deviation: 0.0247

Hyper parameterizing over n_estimators, max_depth and learning_rate led to a significant improvement over the vanilla LightGBMRegressor. We see a significant increase in validation sharpe, validation correlation and feature neutral correlation. However, the vanilla CatBoostRegressor still has a sizable lead on the optimized LightGBMRegressor. We can see this difference in the validation sharpe of each model: 0.7188 versus 0.8704. It is also important to note that the vanilla CatboostRegressor has less validation standard deviation as well indicating to some degree that the model’s sharpe is not a result of overfit. Consequently, CatBoostRegressor remains our top option.

Model 15: XGBRegressor with GridSearchCV over n_estimators, max_depth, colsample_bytree and learning_rate

Parameters Chosen
colsample_bytree: 0.1
learning_rate: 0.01
n_estimators: 1000
max_depth: 5
Metrics
Validation Sharpe: 0.7676
Validation Correlation: 0.0228
Validation Feature Neutral Correlation: 0.0172
Validation Standard Deviation: 0.0297

Hyper parameterizing over n_estimators, max_depth, colsample_bytree and learning_rate led to a significant improvement over the vanilla XGBRegressor. This improvement was, however, not large enough to surpass our top performing models. The validation sharpe saw an increase from 0.6822 to 0.7676. On the other hand, the validation standard deviation saw a large increase. This shows that while we may have increased correlation through hyper parameterization, the model is less consistent on validation eras.

Model 16: Stacking Vanilla CatBoost with a hyper parameterized XGBRegressor

Metrics
Validation Sharpe: 0.7542
Validation Correlation: 0.0206
Validation Feature Neutral Correlation: 0.0127
Validation Standard Deviation: 0.0273

Now that we had a good selection of models tested, we felt it was time to diversify our experimentation by stacking nonlinear models. To be clear, the base estimator used was Vanilla CatBoost while the model suite consisted solely of our hyper parameterized XGBoost model from Model 15. This was our first attempt at stacking a combination of models that previously worked well. We were surprised to see that this stacking had in fact hurt the validation sharpe, given that both vanilla CatBoost and the hyper parameterized XGBRegressor had individual validation sharpes that were higher than 0.7542.

Model 17: Stacking Vanilla CatBoost with a hyper parameterized XGBRegressor and performing Feature Neutralization

Metrics
Validation Sharpe: 0.9314
Validation Correlation: 0.0191
Validation Feature Neutral Correlation: 0.0138
Validation Standard Deviation: 0.0205

Model 17 is identical to model 16 with the exception that model 17 has feature neutralization. This was our attempt at performing A/B testing to see the true effect of neutralization on stacking regressors. The results were significantly positive as seen by the increase in validation sharpe from 0.7542 to 0.9314. This change was likely due to the reduction in standard deviation between the two models. Feature neutralization in this case made correlation across eras more consistent resulting in a higher validation sharpe. This model stands as one of the highest performing models tested thus far.

Model 18: Era boosting model with GradientBoostingRegressor

Parameters Chosen  
max_depth: 5
learning_rate: 0.01
subsample: 0.5
n_estimators: 10
num_iters: 200
Metrics
Validation Sharpe: 0.8071
Validation Correlation: 0.0179
Validation Feature Neutral Correlation: 0.0141
Validation Standard Deviation: 0.0222

This model uses the GradientBoostingRegressor with some hyper parameter tuning, but most importantly, uses the era boosting algorithm. The goal of the era boosting algorithm was to lower standard deviation, and thus improve validation sharpe. This allows the validation sharpe to be significantly greater than Model 13, which is a Vanilla Gradient Boosting Regressor. However, some models received a better validation sharpe than this model did, so while the era boosting algorithm may be effective, a different Boosting Regressor should be used to maximize validation sharpe.

Model 19: Era boosting model with GradientBoostingRegressor and Dimensionality Reduction

Parameters Chosen 
max_depth: 5
learning_rate: 0.01
subsample: 0.5
n_estimators: 10
num_iters: 200
Metrics
Validation Sharpe: 0.7112
Validation Correlation: 0.0149
Validation Feature Neutral Correlation: 0.0106
Validation Standard Deviation: 0.0210

Model 19 is identical to Model 18 with the exception that this model uses dimensionality reduction. Dimensionality reduction, as seen in comparison with Model 18, showed increases in both validation sharpe as well as the CORR + MMC sharpe metrics. This shows that both columns of ‘id’ and ‘era’ were not conducive to the model by treating them as features, and the small employment of the feature selection allowed us to improve on the model.

Model 20: Neural network with two hidden layers of size 4000, and Leaky Relu activation

Metrics 
Validation Sharpe: 0.0782
Validation Correlation: 0.0029
Validation Standard Deviation: 0.0153

Model 20 trains a PyTorch implementation of a neural network. Originally during training, the network performed excellently on the validation data. However over time, correlation quickly fell to (near) 0 presumably since the neural network is overfitting on the training data. It is then likely that improving this model will require an architecture with additional regularization.

Model 21: Neural network with additional dropout.

Metrics
Validation Sharpe: 0.7114
Validation Correlation: 0.0260
Validation Standard Deviation: 0.0361

Model 21 is almost identical to Model 20. The only difference was the addition of 35% dropout on the input and first hidden layer. This model is much better at avoiding overfitting and performs almost as well as the decision trees. However, compared to previous models, the standard deviation is much higher indicating that the model is less robust and may not perform as well if the market starts producing increasingly complex eras (test data eras that do not resemble the training data eras).

Data preprocessing

Data preprocessing is a major step in creating and improving on machine learning models, especially the models we’ve created for the Numerai competition. The quality of the data fed into the model directly affects how it learns, so this step is integral in creating models that learn efficiently. Some steps that we’ve taken in our data preprocessing stage include a data quality assessment, feature aggregation, feature sampling, feature encoding, and dimensionality reduction.

Data Quality Assessment

When checking the quality of our data, we generally want to look for duplicate values, missing values, and inconsistent values.

First, to check duplicate values we can run duplicate() on ‘numerai_training_data.csv’, which will check if there are any duplicate rows in this dataframe. After running it, it returns false, meaning the table is clean from any duplicate values.

For checking missing values, we can run df.isnull().values.any(), which will let us know if any value in the dataframe is NaN. This instruction returns false, indicating that the table is free from any NaN values.

Lastly, for inconsistent values, we want to check if all values within categorical or quantitative columns are their respective values. For example, we don’t want a quantitative value within a column that is supposed to be categorical. After determining the values from every column, we can deduce that “era” and “data_type” of columns 1 and 2 respectively are categorical features, while columns 3 to 313 are quantitative features. Because the values in columns 1 and 2 are strictly categorical and the values in columns 3 to 313 are strictly quantitative, we can deduce that there are no inconsistent values in this dataset.

Feature Aggregation

Feature aggregation is performed so we can get a better perspective of our data by reducing the number of data objects. Not only does this reduce processing time and memory consumption, but can provide a more stable, high-level view of the data because we aren’t looking at each individual data object.

For ‘numerai_training_data.csv,’ we can see that a categorical feature we can aggregate on is ‘era’. Upon closer inspection, there are around 120 ‘eras’ from ‘era1’ to ‘era120’. These eras can be grouped per column by average, as seen below.

Example of Feature Aggregation on column ‘era’

The figure above gives a simplified version of the dataframe after performing feature aggregation on column ‘era’, as indicated by the ellipses. After doing this operation, we can see that the table becomes a lot more of a high-level view by summarizing each feature by an average of its values by era. Although it allows us to greatly reduce the time to make our models, we sacrifice a loss of information, particularly the individual values of each row.

Feature Sampling

Because we are working with a very large dataset that has over 500,000 rows and over 300 features, feature sampling is an important technique to employ to reduce the runtime of creating our model by only focusing on a subset of the data we are looking at. Most importantly, we want to take samples of features that are representative of the dataset as whole and which gives us a model that neither overfits or underfits.

One of the best approaches for feature sampling is stratified sampling through train_test_split, which takes in the dataset, the target feature, and any additional parameters to specify how the split should be done, particularly parameter “stratify”. Having columns ‘feature_intelligence1’ to ‘feature_wisdom46’ as our X, and column ‘target’ as our y, we can obtain a valid training and test data split from the main dataset, which is mostly representative of the samples we could take to train our model.

Because we normally use ‘numerai_training_data.csv’ as our training data, this training and test data split can be used to locally predict mostly valid MMC scores without having to use our daily submissions in the Numerai competition. To clarify, mostly valid means that by creating a train test split from the given training data from Numerai, the resulting MMC score will differ slightly from the MMC scores from the Numerai competition submissions.

Feature Encoding

Feature Encoding is a way to reframe our dataset as an input to the machine/model, all the while still retaining its meaning. Nominal and ordinal encoding are the most common types of feature encoding. To elaborate, nominal encoding is a one-to-one mapping that retains the meaning of our data, such as the one-hot encoding technique. Ordinal encoding employs any range of integer mappings, such as assigning “red” as 1, “blue” as 2, and “green” as 3.

In our case, we employed a strong example of one-hot encoding, which is a nominal encoding technique. To specify, our features from ‘feature_intelligence1’ to ‘feature_wisdom46’ all use values found in this array: [0, 0.25, 0.50, 0.75, 1]. Because one-hot encoding is used mainly to translate categorical variables to numeric values, we need to create 5 new features for every existing feature; for example, for ‘feature_intelligence1,’ we needed to make ‘feature_intelligence1_0.0’, ‘feature_intelligence1_0.25’, ‘feature_intelligence1_0.5’, ‘feature_intelligence1_0.75’, and ‘feature_intelligence1_1.0’, which would replace ‘feature_intelligence1’. A sample of the employed one-hot encoding technique can be seen below.

Result of one-hot encoding on feature ‘feature_intelligence1’

After running a few sample models (none that need mentioning), we can see that the models unfortunately do worse with the use of one-hot encoding. While the performance of the model was much faster (took less minutes to run the same model), the CORR + MMC as well as the validation sharpe measures suffered.

Numerai diagnostics after performing one-hot encoding on our GradientBoost model = Validation Sharpe: 0.7112, Validation Correlation: 0.0149, Validation Feature Neutral Correlation: 0.0106, Validation Standard Deviation: 0.0210, Feature Exposure: 0.2816, Max Drawdown: -0.049, Correlation + MMC Sharpe: 0.4598, MMC Mean: -0.0024

Dimensionality Reduction

Most datasets, including the one Numerai has provided to us, has an extremely large amount of features, or dimensions. Generally, the more features/dimensions a dataset has, the more complexity the dataset has. This would increase the time needed to train our model, as well as increasing the chance for overfitting of data, or fitting your parameters too tightly with the training data.

In the Numerai training data, the dimension reduction step we applied to the dataset is that we removed columns ‘id’ and ‘era’ as features from the training data. In doing so, we eliminate noise and nonsensical features, and our model is easier to visualize.

Using Model 18 (a GradientBoostingRegressor model with a focus on era boosting with a validation sharpe of 0.8 and a MMC + CORR of 0.56), employing our dimensionality reduction (Model 19) improved the scores only slightly, going to 0.83 and 0.61 for both validation sharpe and MMC + CORR, respectively. This shows that both columns of ‘id’ and ‘era’ were not conducive to the model by treating them as features, and the small employment of the feature selection allowed us to improve on the model.

Era Boosting

In the Performance category of Numerai, it is crucial to have a good sharpe score, which basically means having a low standard deviation between eras. Originally, implementing a vanilla XGBoost model could give you this graph below.

Graph of per era Correlations

This graph measures the correlation for every era. At first glance the predictions look to perform well, with the model getting positive correlation with a lot of eras, and while it does perform well, there are also many eras with weak or negative correlations. Using more trees will end up increasing the mean performance of the model, with the correlation of most eras increasing, but there will still be some eras with weak/negative correlation, leaving the model to be inconsistent. Essentially, the XGBoost model is trying to maximize the mean performance over all of the data, but we want the model to minimize the standard deviation across the eras as well.

In order to improve these models, we can utilize a concept called Era Boosting, where we are essentially boosting the weights of the lower eras to increase correlation, decrease standard deviation, and thus increase the sharpe score and performance.

The era boosting algorithm pseudocode is as follows, with the main objective being to build 10 trees from the data, predict using the model, and then find half of the eras which performed the worst. Then build 10 new trees on that subset of eras and repeat the process.

Era Boosting Algorithm

  1. Build 10 Trees on all eras from the training data.
  2. Predict using your model over the data and find eras which are in the worst half of performance
  3. Build 10 new trees only on the worst half of eras.
  4. Predict again, find the eras which are in the worst half of performance, build 10 new trees, and repeat.

After implementing this algorithm 20 times, we end up having about 200 trees, with the process of every 10 new trees being made on the worst half of eras, the new graph of the model is shown below, with the graph measuring correlation for every era.

Graph of per era Correlations after Era Boosting

Now, in comparison to the model previously shown, the model above, which was made using 20 iterations of the era boosting algorithm, has no eras with weak or negative correlation, with all eras having consistent, similar performances and a low standard deviation overall. The standard deviation went from being 0.28 to 0.003 after implementing all of the iterations, which is a significant improvement. By building trees on the worst half of eras, performance wise, we are allowing the model to now give equal performance and minimize the differences across all eras. The exact scores of the implemented era boosting on Numerai can also be found in Model 18 of the Modeling section, with the exact sharpe score being 0.8071, which is also a significant improvement over other models. The table below also shows the decrease in standard deviation with every 5 interactions of the boosting algorithm, to further show the relation between the score, and every repetition of Era Boosting. Keep in mind that sharpe score is dependent on the mean and standard deviation of a model, so decreasing the standard deviation will greatly improve our sharpe score.

Table showing decrease in standard deviation with every 5 interactions of the boosting algorithm

There are some questions however, regarding how high era boosting can improve the overall validation sharpe score. We are implementing boosting, which is used to combat underfitting, but in a way, after implementing 200 trees, the model will most likely start to overfit, so there should be adjustments to the number of trees, as well as with the learning rate and other parameters to combat this. This can cause the training score sharpe values to be very misleading when trying to predict tournament sharpe values. Also, this model is not actively using other ways to improve the overall mean performance, but instead honing in on lowering the variance of the model, so it is important to try and combine this model with others to improve the overall result.

In the end, era boosting is one algorithm we can include or add onto additional models to further improve the models standard deviation, which will aid in the performance metrics for Numerai. This concept helps ensure consistent results across all eras, while still keeping relatively high positive correlation values, which is crucial in terms of improving validation sharpe.

Feature Exposure

Feature exposure plays an important role in a models consistency therefore, this is one metric we decided to focus our efforts on. Feature exposures importance can be conveyed through the context of the regression problem. The goal of Numerai is to predict future values in relation to the stock market. The nature of the stock market is that it cannot be predicted well over the long run by a single feature. Features that do well in one market regime may not do so well in another market regime. Having a small number of features that are highly weighted in the model may put the model at risk of performing very poorly over time in the market.

The calculation of feature exposure for the Numerai competition is based on the Spearman’s rank correlation coefficient (SRCC). The SRCC can be calculated by first selecting two features and converting them into rankings. Once these rankings are obtained, the SRCC is the covariance between the two ranked variables over the product of each ranked variable’s standard deviation.

Once the Spearman’s rank correlation coefficient is computed for each column, the coefficients are then combined into a single metric using the root mean square function. However, looking at each column’s correlation coefficient can provide useful information as well. For instance, the maximum SRCC value can be used as an additional risk indicator. The maximum SRCC value indicates the maximum correlation of the predicted targets to a single column of feature.

Feature Exposure vs. Correlation Tradeoff

By reading in the forums and conducting experiments of our own, we came to realize that there is a tradeoff between feature exposure and correlation. A model that has very low feature exposure may not be good at indicating anything significant whereas a model that has high feature exposure can have highly inconsistent correlation. Somewhere in the middle of these two extremes exists a conservative model that performs moderately well in the long run at the cost of some correlation.

Relative graph of correlation and consistency as a function of feature exposure.

The graph above shows that as feature exposure increases, correlation also increases at the cost of consistency. On the lower ranges of feature exposure, the model will generally not be able to pick up on anything significant and as a result it will consistently perform poorly. On the higher ranges of feature exposure, the model will overfit to the current state of the market, performing very inconsistently over multiple different states. In other words, it would have high variability over time.

Looking at the graph the ideal scenario is the sweet spot in the middle of both extremes where consistency and correlation intersect. This intersection is purely conceptual so as to paint a picture of what our goal in optimizing feature exposure is.

Now begins our quest to find this sweet spot. The supplemental resources provided by Numerai elucidate one method to resolve high feature exposure: feature neutralization.

#Code from Numerai Analysis and Tips Notebook def _neutralize(df, columns, by, proportion=1.0):
scores = df[columns]
exposures = df[by].values
scores = scores - proportion * exposures.dot(np.linalg.pinv(exposures).dot(scores))
return scores / scores.std(ddof=0)

Feature neutralization begins by taking the entire dataframe, the column to neutralize, the features to neutralize by, and the neutralization proportion as inputs. The first and second lines isolate the column to neutralize (scores) and the features to neutralize (exposures) respectively. For the third line, the code is reducing the neutralization column by a vector multiplied by the proportion specified. The vector being used is computed by first taking the dot product of the pseudo inverse of the exposures with the scores and second taking a dot product of the resultant and the exposures.

Moore Penrose Pseudo Inverse Example

In the context of our problem above the Moore Penrose matrix is represented by the result of “np.linalg.pinv(exposures)”. The vector y can be thought of as the scores represented by the “.dot(scores)”. Lastly, x can be thought of as a vector of beta values. It is important to keep in mind that the Moore Penrose solution is not an exact solution because of the “m>n” constraint placed on the problem. However, if m=n we could obtain an exact solution. The Moore Penrose solution produces a solution with the least squared error and this is why we can think of the x vector as a beta vector where beta represents the coefficients of a least squares linear solution. Once these beta values are computed we take another dot product this time multiplying the exposures and beta values. This produces what we can think of as a prediction from the least squares solution. This prediction is then multiplied by the desired proportion and subtracted from the original score vector to create a new score vector. Finally the new score vector is divided by its standard deviation to rescale it and then returned. The goal of this process was to reduce feature exposure.

Below we will examine the effect of feature neutralization on tree based models. If we applied this process to linear regression models, we would be subtracting two linear equations resulting in a poor overall model as shown by jrb’s great post on feature exposure. The two tree based models we will use to form a baseline measure of feature exposure are an XGB regression model and a random forest model. These models will be trained without hyper parameterization and used to predict values in the tournament data or test set. With these predicted values we will calculate feature exposure and maximum feature exposure to obtain our baseline. Following this we will take the predictions from each model and neutralize them prior to calculating feature exposure and maximum feature exposure to see the change in values. The proportion we will use to neutralize for this analysis is 0.50.

Effect of Feature Neutralization with proportion 0.50 on Random Forest and XGB Regressors.

We see from the table above, that in both cases neutralization has significantly reduced feature exposure. The findings are in line with our original goal in finding the sweet spot of consistency and correlation. For the second part of this analysis, we will change the proportion value instead of holding it fixed at 0.50. By checking the correlation and feature exposure at different proportion levels we hope to visualize the feature exposure and correlation tradeoff.

Random Forest Regressor Graphs
XGB Regressor Graphs

The graphs above show the effect of feature neutralization on feature exposure, maximum feature exposure and Numerai score which can be used as a measure of correlation.

#Code from Numerai Analysis and Tips Notebook# The models should be scored based on the rank-correlation (spearman) with the targetdef numerai_score(y_true, y_pred):
y_prednew = pd.Series(y_pred)
rank_pred = y_prednew.groupby(eras).apply(lambda x: x.rank(pct=True, method="first"))
return np.corrcoef(y_true, rank_pred)[0,1]

Prior to going over the findings from the graphs, it is important to keep in mind our hypothesis that increasing the proportion value, should represent an increase in neutralization strength and simultaneously a reduction in feature exposure at the cost of correlation.

Right off the bat, we can see that increasing the proportion value decreases the correlation in a near linear fashion. The third graph produced by both models is a clear indicator of this idea that there is a tradeoff between the correlation of your model and how much you can limit any one feature’s predictive power. If we decrease our exposure to features too much the model loses its ability to predict well.

The second graph of proportion value versus maximum feature exposure also exemplifies our hypothesis to a high degree. As the neutralization proportion goes up we see a near linear decrease in max feature exposure. The tail end of the graph from 0.80 to 1.00, however, shows an increase in max feature exposure.

The first graph of proportion value versus feature exposure is the most interesting and contrary to our initial hypothesis. These graphs look like parabolas indicating that feature exposure goes down until some proportion value in-between 0.20 and 0.40. What is interesting is that the maximum feature exposure continues to go down well after the trough of the feature exposure. This indicates that we may be reintroducing some exposure to features through our neutralization. This indication is further supported by the tail end of the maximum feature exposure graph that also shows an increase in exposure. Possibly what is happening is that across many features there is a slight uptick in exposure to account for a decrease across some of the most predictive features. This slight uptick begins to take effect after the 0.20 and 0.40 range and compounds from then onwards. Whereas the decrease in the most predictive features remains until the 0.80 range.

At this point in the analysis, we began to wonder was the trough on the first set of graphs the sweet spot that we had been looking for. To test this claim we looked at the max drawdown metric, validation sharpe, and validation standard deviation. The max drawdown metric is the maximum decrease in correlation between any two validation eras. Validation sharpe is the mean of per era correlations over the standard deviation of per era correlations.

From the table below we can see that the trough indicated row 1 (0.3 neutralization proportion) does not result in an optimization of validation sharpe. The validation sharpe continues to decrease as the proportion goes up. This is likely because the mean of per era correlations goes down as proportion goes up but standard deviation also goes down at a slower rate than mean resulting in a dropping validation sharpe.

Table showing effect of neutralization proportion on max drawdown, validation Sharpe, and validation standard deviation.

Overall, our original hypothesis was in line with our findings to a high degree. We did see that increasing the proportion value, which represents an increase in neutralization strength and simultaneously a reduction in feature exposure, resulted in a loss of correlation. We also saw that an increase in neutralization strength results in an increase in consistency.

Overall findings graph

The graph above is similar to the conceptual graph proposed above even though it may not look the same. In this graph we used an increase in neutralization proportion as a proxy for decrease in feature exposure. Additionally, we used per era correlation mean as a proxy for correlation and per era correlation standard deviation as a proxy for consistency. We then see that a decrease in feature exposure leads to decrease in standard deviation which can be viewed as an increase in consistency as the per era correlations are more closely distributed. Furthermore, we see that a decrease in feature exposure leads to a decrease in per era correlation mean which can be viewed as a decrease in overall correlation.

There is no crossing point in the graph above due to the fact that each metric graphed is on its own scale and we were using standard deviation as a proxy for consistency.

Lastly, it is worth mentioning that this tradeoff also speaks to a users risk appetite. If a user has a high risk appetite he or she may perform little to no neutralization in hopes of getting higher returns. Conversely, if a user has a low risk appetite, he or she may neutralize significantly to try and take a consistently small return each round. In our model we felt that a higher neutralization proportion would give us more consistency on live stock market data at the cost of having higher correlation. We were closer to a low risk appetite strategy. As such our final choice in neutralization proportion was 0.60. This choice was reflected in the post processing of model 12 and 17 above.

While we did not find a definite crossing point or quantitative range of the optimal neutralization value, we were able to show that there is indeed a relationship or tradeoff that exists and exemplify it through testing.

MMC

MMC, or Meta Model Contribution, is the way that all user submissions to the Numerai competition are scored. To explain what MMC consists of we must first discuss what the Meta Model is. The Meta Model is a weighted ensemble of all submissions made in a round. The more a user is staking on their model the more they are represented in the ensemble. So in addition to the correlation scoring component there exists the MMC component which is the correlation of a user’s submission to the Meta Model. This is done to encourage the performance of the collective users on the platform rather than to solely encourage individual correlation. The MMC component consists of MMC + Correlation score, MMC mean, and Correlation (CORR) to example predictions.

In Numerai’s competition, originality is heavily rewarded. Their process of quantifying through MMC is as follows:

  1. Build a model out of all user submissions.
  2. Take the difference of each user submission and the meta model predictions
  3. Said difference is compared to the true stock market results
  4. Users are now incentivized to improve on these differences by looking for unique data
  5. Profit!

While most data science competitions are some sort of variation of modifying the data preprocessing stage, trial testing many thousand variations of XGBoost models and cross validation, and tuning hyperparameters, Numerai wants to stray from that trend through the use of the Meta Model Contribution.

performance * (1-correlation_with_all_other_models)

The above equation is a way to determine how much a user should be paid according to the uniqueness of the model they’ve created in their submission. However, before July 20th, Numerai has recognized that users will attempt to sacrifice their CORR score for a higher MMC score, and vice versa. This is seen in the way that users are paid. To explain through an example, if a user had 0.15 CORR and -0.04, they would be paid stake * 0.11, where stake is a quantitative value the user places on their submission as a quantitative form of confidence.

After July 20th, Numerai has changed their leaderboards to rank users based on MMC + CORR, a combination of both previous metrics. This allows for users to take their focus away on sacrificing whether they should make their CORR or MMC score higher, and gearing them solely towards making genuinely unique models. This is based on the belief that users shouldn’t be penalized for consistently making improvements to the meta model with a positive MMC and scoring high on their models.

However, users still believe in choosing between focusing on a CORR payout vs a MMC + CORR payout approach. This is based on maximizing average payout and payout sharpe, concepts explained earlier in this blog. After plotting scatter plots for pure CORR vs MMC + CORR metrics for validation and training data, and scatter plots for correlation between CORR and MMC scores per era in training and validation data, it becomes very apparent that using the MMC + CORR approach for payout is more favorable for models with positive MMC in the Numerai competition in most cases.

From our model building showcased earlier, we experienced various increases with our CORR + MMC score, particularly after switching from linear to non-linear models. For simplicity in this section, we will refer to the CORR + MMC score as a representative increase for MMC mean and the correlation metrics. To elaborate, Models 1–3 employed the use of linear and ridge regression with a grid search cross validation, resulting in a maximum MMC + CORR score of 0.3166. However, for models 4–6, where we switched to vanilla lasso regression and use of elastic net, we experienced major decreases in MMC + CORR score, dipping down to a small score of — 0.1992. This is a result of negative validation correlation.

Moving forward to non-linear models, we experienced big jumps in our MMC + CORR score. Starting at Model 7 with a vanilla XGBRegressor, the reported MMC + CORR is 0.4750, improving on the previous high of 0.3166. This is due not only to the fact that the algorithm is non-linear, but XGBoost does some of the parameter tuning for us, resulting in a slight improvement in the score. As one of our final models, CatboostRegressor and RandomForestRegressor models created models with MMC + CORR scores of 0.65, our highest thus far. For this reason, we continue to hyperparameter tune Catboost and RandomForest models. However, after finding out that RandomForestRegressor has severely low validation correlation, we switched back to CatboostRegressor, which gave us the best results for CORR + MMC.

Results and Conclusion

Overall, our approach to modeling revealed a lot of interesting findings, especially as we hyper parametrized, stacked, and used many different tools on our suite of models. We saw that the linear models performed far worse on average in comparison to the nonlinear models we tested. We believe that the process of testing more nonlinear models and trying combinations of hyper parameterization, data pre-processing, and more resulted in higher validation sharpe scores. After all is said and done, our best model was Model 17, which consisted of stacking Vanilla CatBoost with a hyper parameterized XGBRegressor and performing Feature Neutralization. We believe that a large part of this model’s good performance was because of the feature neutralization, which lowered the standard deviation of correlation across eras and made them more consistent.

In terms of our work with data preprocessing, we were able to research a variety of techniques, including data quality assessments, feature aggregation, feature sampling, feature encoding, and dimensionality reduction. However, the ones put into action were data quality assessment, feature sampling, dimensionality reduction, and feature encoding. The first 3 techniques were used to create Model 19, which was built off of Model 18’s GradientBoostingRegressor model with a focus on era boosting. Dimensionality reduction improved our model’s validation sharpe score as well as the CORR + MMC metric, improving the scores only slightly, going to 0.83 and 0.61 for both validation sharpe and MMC + CORR, respectively.

Now, given our highest validation sharpe performance from Model 17, we also want to reflect on why we thought our other models didn’t perform as well. As mentioned earlier, all linear models ran in this competition performed very poorly because of the high standard deviation of the correlations per era. This would lead to either very high validation standard deviation or very low (and possibly negative) validation correlation, which thus affected validation sharpe severely. Additionally, the lack of consistency of positive correlation per era would result in very low, and even negative, validation correlation scores for all of our linear models. As for our nonlinear models, we believe that they performed significantly better as they were able to capture relationships that the linear models could not. In other words, the stock market data did not yield itself to many linear relationships. Additionally, tree based models take advantage of the power of many models. Each tree can pick up on a different relationship within the data and then each tree’s results can be combined or averaged. This is another advantage that we believe tree based models brought to the table.

Implementing era boosting proved to be beneficial in improving the Validation sharpe of a model. The era boosting algorithm was used to boost the eras which performed lower, which lowered the standard deviation of the eras in the model. This did a good job at accounting for the variability between eras, as for the given data, eras are our major data points as the rest of the data is purposefully hidden. This method allowed us to specifically increase the Validation sharpe of the model. However, this did not affect other metrics of the model, so to provide better results, era boosting should be included with other models or techniques. Some key takeaways from implementing era boosting is this was very successful at strictly improving standard deviation with eras, but the model itself struggled improving mean performance and other risk metrics, so it was difficult to improve those metrics.

Feature exposure proved to be one of the important knobs that we could turn to lower risk and increase the validation sharpe of a model. The method of feature neutralization allowed us to turn that knob and reduce the predictive power of each feature. Doing so, led to a model that had less per era correlation on average but was much more consistent across eras. The consistency across eras was reflected in an increase in validation sharpe.

In our MMC section, we introduced the meaning of MMC, and the role it plays in the strategy that users employ when creating submissions for the competition, as well as how they get paid depending on their MMC. In regards to our experience with the CORR + MMC metric (correlation + meta model contribution), we experienced our major increases in this score due to our switch from linear to non-linear models. A combination of switching to models such as CatboostRegressor and employing techniques such as neutralization reduced the standard deviation of our models and increased the validation sharpe, resulting in larger increases for the CORR + MMC.

In conclusion, this competition proved as a great avenue for us to expand upon our knowledge of Machine Learning and see how we can use our skill sets to analyze time-series data to predict the stock market. Furthermore, as many of us have not taken part in a regression-based Machine Learning competition, we realize that there was a lot to learn in our approach to find our best model, all the while exploring important techniques to increase validation sharpe. That said, we hope to employ these important facets of analyzing data and producing models as we not only consider our next Data Science competition, but also as we begin to enter industry jobs and perform data analysis on real-world problems.

--

--