Why backtests are useless, EAs are flawed and their parameters are bad [DISCUSS]

I am still curious about how a walk-forward is any different to testing on out-of-sample data, and how being adaptive is anything other than having just another layer of test clauses to trigger a trade.

Walk forward analysis gives you 100-150 "in-sample/out-of-sample test-pairs", instead of 1, like in classical out-of-sample testing (more on that in the paper I will release in 1-3 days, where I explain how wfa works).

The thing about adaptivity is that you dont get another layer to trigger a trade, but to re-adjust your normal layer (which consists of an EA and its corresponding parameters) so it always gives you the best results. ;)

-------------------------------------------------------------------------------------------------------------------------------------



And like you I don't know the increased significance of calling trading with no money "walk forward analysis". Can someone turn on the light?

Well, thats what it is about, doing the exact same thing like "trading". BUT with the difference that you can do so with 13 years of data in just a few hours.
So it simulates trading, like you do live, but on the past :) Clear now?

-------------------------------------------------------------------------------------------------------------------------------------



Excellent response. But can I play Devil’s Advocate? (please don’t take offense if I appear confrontational – I don’t mean to be!).

So “adaptive” means “changing parameters to optimise results as time goes by”?

[...]
b). are you speculating that “if the optimum parameters for the current market are A, B and C, if we trade using these parameters then we can expect X% return”? In this instance, we would, for example, go back 3 years then walk forward each month, “optimising” as we go, then seeing how well this approach worked for the next month, etc, etc. Then if we get sufficiently good results over, say, the last 3 years, we confidently conclude that we have a viable trading strategy.

[...]

b). is just data mining/fitting?

b) is correct, and of course it is "just" data mining, as this is what the whole EA trading approach is about :)
But it is mining for data with the goal to have a trading plan that produces parametersets for my EA that have a high chance of beeing successful in the future.
And not data mining with the goal to have some nice looking backtests in the past.
Thats the difference :)

But I dont really get how you want to reproduce it using coin tosses, could you please explain that? ;)

-------------------------------------------------------------------------------------------------------------------------------------



But when someone says backtesting is useless, its so far off the mark its just not worth discussing.
[...]
None of the tests will account for slippage, late entries, gaps etc etc.

Ok, you are right, perhaps my thread title was a bit too provocative, but that was what I intended with it, to start a discussion ;) And you are reading the thread, so perhaps it was the right decision :p

And with the other stuff you mentioned, yes, thats the reason why I prefer H4 or D1 timeframes, as on them the "randomness" of live trading is not so heavily influencing the strategy's performance.
But you are right, real live testing (on demo or not) is absolutly vital for success :)

-------------------------------------------------------------------------------------------------------------------------------------



IMO: The only way to learn trading is to practice day and night and weekends using a simulator and historical data in real time. Practice, practice study and practice.

I agree with the learning part, but for an algo-trader its more about research/discussions/learning than about practice :)



-Darwin
 
Last edited:
"A WFA takes, for example, the data from 2000 to optimise your system, then tests it on 2001 (which is, from the point of optimisation, the "future" or "live trading").
Then it walks forward and optimises the system on 2001, test it on 2002. Then optimise on 2002, test on 2003 etc.
Do this until you walked through your whole data and only consider the "live trading" in your evaluation.

You see? With this simple tactic you can tackle the 3 problems I have described above, the lack of adaptability, the need for a evaluated parameter selection process and the uselessness of a "past-performance"-backtest."

Imo this does not tackle the problem because you can always find a system if you look hard enough that will pass WFA and fail in forward trading. Just do a neural network search and you will see what I mean. WFA does not work if new conditions develop in the markets that will make an unprofitable system in WFA turn profitable in forward trading. It is a very conservative approach that basically will reject 99.99% of systems and you will end up trading none. Just my 2 cents.
 
Last edited:
Quote:
b) is correct, and of course it is "just" data mining, as this is what the whole EA trading approach is about :)
But it is mining for data with the goal to have a trading plan that produces parametersets for my EA that have a high chance of beeing successful in the future.
And not data mining with the goal to have some nice looking backtests in the past.
Thats the difference :)

But I dont really get how you want to reproduce it using coin tosses, could you please explain that? ;)

OK. Randomness. It was mathematician George Spencer-Brown who showed that in a random sequence of 10 to the power of 1,000,007 zeros and ones, we would expect at least 10 non-overlapping sub-sequences of 1 million consecutive zeros. So if you bought a book of random binary numbers for a research project you'd be rather alarmed if all the pages contained just zeros - yet a random sample of random numbers might produce just that. The first iPods had a true random shuffle feature, but users complained when it played the same songs back-to-back (so Steve Jobs famously made the feature "less random to make it feel more random"). Those random anecdotes demonstrate that randomness is a tricky subject!

See attached spreadsheet. I've put together a simple spreadsheet that randomly generates coin tosses (represented by ones and zeroes). Lets pretend they're stock prices. For each Tail the price goes up by 0.25% and for each Head price goes down by 0.25%. The resultant chart looks like a real stock chart BUT IS TOTALLY RANDOM. We look at the chart, or the numbers, and come up with an idea we want to test - say that if we get 4 heads in a row we then buy/Long. We only have to go down to row 68 to show that this is a very profitable trading strategy. We then "walk forward test" and find that it works over the entire 500 row sample.

Therefore, the million dollar question is, how do we know whether the results of our testing represent real "relationships" or are just random results?
 

Attachments

  • Random Coin Tosses 1.xls
    290 KB · Views: 232
Quote:
b) is correct, and of course it is "just" data mining, as this is what the whole EA trading approach is about :)
But it is mining for data with the goal to have a trading plan that produces parametersets for my EA that have a high chance of beeing successful in the future.
And not data mining with the goal to have some nice looking backtests in the past.
Thats the difference :)

But I dont really get how you want to reproduce it using coin tosses, could you please explain that? ;)

OK. Randomness. It was mathematician George Spencer-Brown who showed that in a random sequence of 10 to the power of 1,000,007 zeros and ones, we would expect at least 10 non-overlapping sub-sequences of 1 million consecutive zeros. So if you bought a book of random binary numbers for a research project you'd be rather alarmed if all the pages contained just zeros - yet a random sample of random numbers might produce just that. The first iPods had a true random shuffle feature, but users complained when it played the same songs back-to-back (so Steve Jobs famously made the feature "less random to make it feel more random"). Those random anecdotes demonstrate that randomness is a tricky subject!

See attached spreadsheet. I've put together a simple spreadsheet that randomly generates coin tosses (represented by ones and zeroes). Lets pretend they're stock prices. For each Tail the price goes up by 0.25% and for each Head price goes down by 0.25%. The resultant chart looks like a real stock chart BUT IS TOTALLY RANDOM. We look at the chart, or the numbers, and come up with an idea we want to test - say that if we get 4 heads in a row we then buy/Long. We only have to go down to row 68 to show that this is a very profitable trading strategy. We then "walk forward test" and find that it works over the entire 500 row sample.

Therefore, the million dollar question is, how do we know whether the results of our testing represent real "relationships" or are just random results?

Since you're comfortable with maths, you'll know that you can make a null hypothesis and test for statistical significance of your trading results.
 
Since you're comfortable with maths, you'll know that you can make a null hypothesis and test for statistical significance of your trading results.

My problem with a Null Hypothesis: Using my coin tossing example I formulate a null hypothesis that “longing after tossing 4 heads” does not generate profits over the ensuing 432 coin tosses. But the test shows that the “longing after 4 heads” theory was profitable, therefore the null hypothesis is falsified. As there is no middle ground possible, my original theory must be true. But it clearly isn’t – it is random coin tosses. In a nutshell, the null hypothesis method can’t cope with randomness.

If you can spot an error in my logic I’d be delighted to be enlightened.
 
Imo this does not tackle the problem because you can always find a system if you look hard enough that will pass WFA and fail in forward trading. Just do a neural network search and you will see what I mean. WFA does not work if new conditions develop in the markets that will make an unprofitable system in WFA turn profitable in forward trading. It is a very conservative approach that basically will reject 99.99% of systems and you will end up trading none. Just my 2 cents.

If you use the wfa in the wrong way, of course, you will find systems that are "overfitted" towards the WFA-approach and will fail in live trading.
If you have a normal system-development-process and just use WFA as the last step, it will be hard to overfitt. The thing is, if the WFA gives you "bad" results, you should in best cases just throw the EA away or make only minor changes, but as soon as you optimise the system towards WFA results, you might overfitt, yes.

Also, a good WFA (or better, the trader that uses the WFA) should be able to detect if completly new conditions develop for which the EA is not suited, and then just stop to trade the EA. Tough, my new algo will have ways to determine this automatically :)

It's as conservative as you want it to be. To make it very non-conservative, just use the last few years for analysis (instead of 13) and tune your parameterranges accordingly :)

It's just a tool afterall, not doing your work as trader ;)

-------------------------------------------------------------------------------------------------------------------------------------


Therefore, the million dollar question is, how do we know whether the results of our testing represent real "relationships" or are just random results?

Hey, thanks for the whole writeup. Yeah, I know of the random-walk-experiments and yes, markets tend to move that random way most of the times.

The problem you describe is curvefitting, right?
Well, the way we want to make sure we have real relationships is the following:

First we optimise our parameters on some data, to find the best match for the "relationships" we want to prove that they exist.

Then, if these parameters hold "in the future" and still describe the relationship well, we can conclude that it is a real relationship, as if we would have optimised our parameters on some random data/relationship, and then used this parameters "in the future" (if it is also random), it could not work for the average test case.

So, if we then repeat this test 100-150 times, and always just look at the past=>future-relationship, and most of them are "valid", chances are high that we have a real, non-overfitted, relationship.

My new algo will be able to measure up to 100.000 of these past=>future-relationships, giving us the whole picture, but at the moment 150 should be a good start! :)

-------------------------------------------------------------------------------------------------------------------------------------


Towards the discussion about null hypothesis: Do it without me, I am not good at math ;)


-Darwin
 
My problem with a Null Hypothesis: Using my coin tossing example I formulate a null hypothesis that “longing after tossing 4 heads” does not generate profits over the ensuing 432 coin tosses. But the test shows that the “longing after 4 heads” theory was profitable, therefore the null hypothesis is falsified. As there is no middle ground possible, my original theory must be true. But it clearly isn’t – it is random coin tosses. In a nutshell, the null hypothesis method can’t cope with randomness.

If you can spot an error in my logic I’d be delighted to be enlightened.

Ok, logical flaws:

1) Excel does not do a great job of generating random numbers (unless it has updated it's RNG in a recent version that I haven't used), but it should be sufficient for the purposes here, so not a big problem.

2) You haven't actually carried out a hypothesis test, have you? It's not about whether a set of 20 trades ends up profitable. It's about whether the results are statistically significant. And for testing that, you would want a lot more than 21 trades, especially when they're so easy to generate, and you would want to choose a sensible null hypothesis. Do you need help on this point?

3) On your sheet, 15 wins, 6 losses. If we recalculate the cells, then on my first recalculation it has become a losing system, one more loss than win. Next recalculation has 7 wins and 11 losses, and so on. It should be obvious that the results vary quite a bit, and that you can't give such significance to your 15 wins and 6 losses result, and furthermore that you must apply some statistical test of significance which is what I suggested.

It's not that the null hypothesis method can't cope with randomness, it's great for exactly this type of thing.
 
Last edited:
My problem with a Null Hypothesis: Using my coin tossing example I formulate a null hypothesis that “longing after tossing 4 heads” does not generate profits over the ensuing 432 coin tosses. But the test shows that the “longing after 4 heads” theory was profitable, therefore the null hypothesis is falsified. As there is no middle ground possible, my original theory must be true. But it clearly isn’t – it is random coin tosses. In a nutshell, the null hypothesis method can’t cope with randomness.

If you can spot an error in my logic I’d be delighted to be enlightened.

I don't understand how the test shows that. Each time I hit F9 I get another result. You may have to do that like 100,000 times to calculate a statistic and see if the hull can be rejected. Am I missing something?
 
Darwin - good thread, well done!

I use Amibroker for testing strategies.. the walk-forward analysis is excellent and is just a click of a button like the backtest.

There is a decent book on forward testing by Robert Pardo, you may have come across this. What is not entirely satisfactory (to me at least!) about his approach is that his analysis of the results themselves is a little qualitative, focusing primarily on raw profit numbers (without regard to drawdown).

Would you care to share with us how you process walk forward data? In other words, which output are you looking to maximise, or what would lead you to believe a system is robust - is this something you determine quantitatively?
 
Darwin - good thread, well done!

I use Amibroker for testing strategies.. the walk-forward analysis is excellent and is just a click of a button like the backtest.

There is a decent book on forward testing by Robert Pardo, you may have come across this. What is not entirely satisfactory (to me at least!) about his approach is that his analysis of the results themselves is a little qualitative, focusing primarily on raw profit numbers (without regard to drawdown).

Would you care to share with us how you process walk forward data? In other words, which output are you looking to maximise, or what would lead you to believe a system is robust - is this something you determine quantitatively?

Thanks for the nice feedback!

Well, with this algorithm it's up to the user to define a desired characteristic, like profit or drawdown or profit_per_drawdown or something, that is then used to pick a parameterset for live trading. (the one I am currently developing can seek the best characteristics on it's own btw :) )
So it's not really the output that is maximised, but the "how to choose parameters for live trading"-methodology that is tested.

But then, to see if a system is robust, I assemble all out-of-sample-trades to one backtest, not just a few meaningless numbers :) So you get an equity curve, "overall statistics", statistics split for short/long, trade-return-distributions, monthly-return-distributions and also some diagramms that show you the fluctuation of your fitness-values (like profit or profit-factor etc) on a per-year and a per-month basis :)

So, where the initial WFA approach only gives you a "WFE" number (which is useless btw, as far as my experience shows), I give you an in-depth analysis of the system :)

-Darwin
 
Last edited:
Shakone – thanks for your response. The coin-tossing example was meant to be a simplified version to explain my concern, which is that even with a large sample size we cannot distinguish between patterns that are due to randomness and patterns which are due to genuine relationships. However, having thought about your response for a couple of days, I conclude that we can apportion a likelihood of such a pattern being random or not. Thus, we never know if an individual test or strategy is showing randomness or relationships, but we can determine that, say, 9 times out of 10 the results are due to relationships. Have I got this now?

Darwin – Surely past observations can be explained by an infinite number of hypotheses. The only way to decide which ones are better is by seeing how well they can predict observations whose outcomes are not yet known, ie the future. Can you give us an idea of how well your approach is working? (eg average ROC over 12 months with max 50% drawdown in any one year).

Also, I thought that financial markets were non-stationary, ie the statistical characteristics change over time (in mathematical terms it is a multi-agent system/complex adaptive system). So how does your approach deal with this issue?
 
Hey Tyger :)

Darwin – Surely past observations can be explained by an infinite number of hypotheses. The only way to decide which ones are better is by seeing how well they can predict observations whose outcomes are not yet known, ie the future.

Yes, the solution space is inifinite, right.
As stated in my new article (http://www.trade2win.com/boards/tra...uccessor-backtesting-discuss.html#post2230130) WFA tries to tackle the problem by just taking into account the relative future for system evaluation.

Of course, all data used to do it is still just the past, so we can not be sure if the one trading-system we have chosen, out of the infinite amount of trading-systems, will be profitable in the future. But we can at least analyse if the performance in the past and the performance in the relative future is somehow correlated.

But still, it's up to the trader to see when a traded inefficiency is not longer existing and the system is "broken".

I have 1 or 2 ideas to solve this automated and 100% algorithmically (let systems trade less frequently if they become worse and worse, and vice versa.) and I will make some experiments regarding this topic when my new algo is working in it's core, but that is still a bit down the road.

Can you give us an idea of how well your approach is working? (eg average ROC over 12 months with max 50% drawdown in any one year).

No, first because I am still coding and researching. But the main problem with this statistics is that they depend on the particular trading-system, and not on the analysis method.
So even if I would have years of live trading, these statistics would just prove that the EA I traded was good, but can't say much about the approach itself (for this we need to test many, many EAs for many, many years).


Also, I thought that financial markets were non-stationary, ie the statistical characteristics change over time (in mathematical terms it is a multi-agent system/complex adaptive system). So how does your approach deal with this issue?

As long as the statistical characteristics don't change too rapid and the underlying inefficiency is still intact (for example, there is still a course correction after a heavy trend but the length of the correction has changed or something), WFA can, by definition, handle this by adjusting the parameters for the strategy.

If the underlying inefficiency is not longer existing, we have the problem I described in the first part of this post.


-Darwin
 
Well, with this algorithm it's up to the user to define a desired characteristic, like profit or drawdown or profit_per_drawdown or something, that is then used to pick a parameterset for live trading. (the one I am currently developing can seek the best characteristics on it's own btw :) )
So it's not really the output that is maximised, but the "how to choose parameters for live trading"-methodology that is tested.

But then, to see if a system is robust, I assemble all out-of-sample-trades to one backtest, not just a few meaningless numbers :) So you get an equity curve, "overall statistics", statistics split for short/long, trade-return-distributions, monthly-return-distributions and also some diagramms that show you the fluctuation of your fitness-values (like profit or profit-factor etc) on a per-year and a per-month basis :)

So, where the initial WFA approach only gives you a "WFE" number (which is useless btw, as far as my experience shows), I give you an in-depth analysis of the system :)

-Darwin

Yes, Amibroker does all this as well.. it assembles the OOS simulations together to give an overall OOS backtest.

Nonetheless, you still need to decide the output which determines whether you think the system is "robust" or not. Typically, I will look for return/drawdown of greater than 1, but other factors such as % win rate and profit factor are also relevant.

Are you going to give us an example of your work?
 
Nice discussion D-FX!
I'm sure your product will be enlightening to all who experience it. Best of luck to you and thank you management for simplifying the thread for those who might otherwise have been confused!

Cheers
 
Nonetheless, you still need to decide the output which determines whether you think the system is "robust" or not. Typically, I will look for return/drawdown of greater than 1, but other factors such as % win rate and profit factor are also relevant.

Are you going to give us an example of your work?

Ah, now I understand. Well, I can only give you an analysis-report, and then it's up to the trader to decide if the system is good enough or not.
Tough, an cloud-based evaluation of analysis reports or something would be cool, but too far down the road to even think about it yet.

Well, yes, sure, here is an example: http://85.214.116.235/btana_example.html
That was done on the default "Moving Average"-EA that is shipped with every MT4 installation, so don't expect anything good.

This analysis report was the first alpha, I think it's not the current state but will give you an idea of what to expect. Also I will add more and more stuff to it when I feel the need for it.

Also, I will release the tool tomorrow on some forums and beginning at next week on all the others, so you can soon make your own opinion :)

Nice discussion D-FX!
I'm sure your product will be enlightening to all who experience it. Best of luck to you and thank you management for simplifying the thread for those who might otherwise have been confused!

Cheers

Thank you very much for this feedback, I also hope so haha :)
Also, if someone is still confused, they hopefully post here so I can un-confuse them ;)

-Darwin
 
Last edited:
@meanreversion:

Well I have to revert my above statement.
There IS a fixed number that indicates robustness.

If, on a huge number of samples (eg 100.000), the chance to have a profitable trade window is > 50% (and the average trade window is >0$), that would mean it does not matter which parameterset you trade, so as long as you trade the EA with any settings, you will make profit in the long run :)

I think that is how robust strategies are defined, isnt it?

-Darwin
 
Top