Credible Granger-Causality Inference with Modest Sample Lengths: A Cross-Sample Validation Approach

Credible Granger-causality analysis appears to require post-sample inference, as it is well-known that in-sample t can be a poor guide to actual forecasting e ffectiveness. But post-sample model testing requires an often-consequential a priori partitioning of the data into an `in-sample' period - purportedly utilized only for model speci fication/estimation - and a 'post-sample' period, purportedly utilized (only at the end of the analysis) for model validation/testing purposes. This partitioning is usually infeasible, however, with samples of modest length - e.g., T<=150 - as is common in both quarterly data sets and/or in monthly data sets where institutional arrangements vary over time, simply because there is in such cases insu fficient data available to credibly accomplish both purposes separately. A cross-sample validation (CSV) testing procedure is proposed below which both eliminates the aforementioned a priori partitioning and which also substantially ameliorates this power versus credibility predicament - preserving most of the power of in-sample testing (by utilizing all of the sample data in the test), while also retaining most of the credibility of post-sample testing (by always basing model forecasts on data not utilized in estimating that particular model's coeffi cients). Simulations show that the price paid, in terms of power relative to the in-sample Granger-causality F test, is manageable. An illustrative application is given, to a re-analysis of the Engel and West (2005) study of the causal relationship between macroeconomic fundamentals and the exchange rate.


Introduction
The seminal contribution of Granger (1966) introduced the notion of "Granger-causality" and sparked a flurry of empirical implementations. In brief, the fluctuations in a time series x t are said to Granger-cause fluctuations in a time series y t if and only if an optimal forecasting model for y t based on an otherwise-appropriately-wide information set, but omitting the past of x t , forecasts y t less well than an analogous model which additionally includes the past of x t in the information set. 1 Attention is usually restricted to linear forecasting models, in which restricted setting optimal modeling is relatively straightforward. This linearity assumption can itself be an issue, but it is the 'appropriately-wide' information set restriction which can more easily be problematic. Indeed, this is the ultimate source of all examples in which the concept of Granger-causality yields apparently-spurious results. It should be noted, however, that this problem with Granger-causality is essentially equivalent to the usual omitted-variables problem in econometric modeling, in which variables wrongly omitted from a model that are correlated with included ones lead to distorted inference on the included variables. Thus, Granger-causality testing merely calls on us to explicitly confront a problem which is endemic, but usually swept under the rug.
The initial spate of implementations -e.g., Sims (1972) and Pierce and Haugh (1977) -relied entirely on in-sample tests (usually just simple F -tests of the relevant model parameter restrictions) to infer whether or not the forecasting model for y t over the wider information set is superior. Granger himself, however, soon became worriedas the result of observing a multitude of multivariate linear time series models which fit the sample data well, but forecast post-sample data very poorly -that these in-sample tests of causality were characteristically prone to distortion from 'data mining.' Such data mining is, of course, based on the fact that the model specification (variables and lag structure) is identified based on the same data used to fit and then to evaluate the model -e.g., see Granger and Newbold (1977, pp. 281 and 311). In essence, because we all tend to discard models which do not fit well, the fitted models we produce frequently fail to forecast well -or at all. 2 This concern led to the first post-sample implementations of Granger-causalityin Ashley, Granger, and Schmalensee (1980) and Ashley (1981) -testing explicitly whether or not the post-sample forecasts based on the wider information set are an actual improvement or not. A number of alternative tests for post-sample forecasting improvement -e.g., Diebold and Mariano (1995), West (1996), Ashley (1998), Gilbert (2001), McCracken (2001, 2005), West (2006, 2007), and McCracken (2007), among many others -were then developed in the ensuing years. Ashley and Ye (2012) provides an up-to-date example describing and implementing a selection of the post-sample methods still popular. 3 However -and despite one of us being an early and vocal advocate of post-sample testing -we must note that all post-sample implementations of the Granger-causality concept suffer from two inherent (and related) drawbacks: 1) The analyst is always obliged to partition the available data set at the outset into an 'in-sample' period -for use in identifying and estimating model specifications -and a 'post-sample' or 'holdout' period -which is to be reserved solely for evaluating which model provides superior forecasts. If done pristinely -i.e., without looking at the data and at the forecasting performance of the models over various in-sample/post-sample splits -this partitioning is, at best, somewhat ad hoc and arbitrary.
2) Post-sample Granger-causality testing tends to be feasible only where either the available data set is very long -so that a quite lengthy (and representative) post-sample period can be selected -or where the causal effect is so overwhelmingly strong as to hardly require statistical testing.
This first drawback has recently received renewed attention in Hansen and Timmermann (2012) and in Rossi and Inoue (2012), both of which demonstrate the consequentiality of this choice for Granger-causation inference, and both of which therefore go on to propose statistical tests which are constructed so as to be robust to this choice.
In their work one could say that the principal problem is actually an over-abundance of feasible in-sample/post-sample splits, leading to an awkwardly consequential nuisance parameter. The present work is distinct from theirs in that it is primarily aimed at settings in which the total amount of data available is modest -i.e., where the principal problem is an overall paucity of data -which renders this sort of robustification problematic. Foreshadowing, our proposed procedure ameliorates this difficulty by both explicitly considering every possible in-sample/post-sample partitioning and by completely utilizing all of the scarce sample data, while always basing model forecasts on data not utilized in estimating that particular model's coefficients.
This data scarcity issue brings up the second drawback alluded to above. Using 3 Over time it also became apparent -see Clark and McCracken (2001) and Inoue and Kilian (2004) that there is an important distinction to be made between choosing which model is closest to the the true (population) model versus which model provides the most accurate forecasts, especially for nested models. Despite the fact that the intuitive justification for the Granger-causation concept is grounded in forecasting, it is the former rather than the latter choice which is causation-relevant. The M SE −F post-sample test used in Ashley and Ye (2012) and in the comparisons reported below, takes this feature into account, however, so this particular aspect of post-sample testing for Granger-caustion is not further discussed here. The M SE − F test is briefly described below in Section 3 below and also in Ashley and Ye (2012); see Gilbert (2001), McCracken (2001, 2005), and McCracken (2007) for details. simulated data in an idealized setting, Ashley (2003) showed that statistical testing for a mean square forecasting error improvement requires more data than one might expect. In particular, even with data generated from linear models with normally, identically, and independently distributed (NIID) error terms, one typically needs 80 to 100 post-sample observations in order to conclude that a 20% mean square error reduction is statistically significant at the 5% level. Basically, this is because one needs that much data in order to estimate a second moment (such as a mean squared error) with the requisite precision. 4 Additionally, another problem with a short post-sample period is that it can easily constitute a non-representative sample with regard to the putatively causing explanatory variables. For example, there might (or might not) be a strong and stable causal relationship between a variable y t and lagged values of a variable x t . But if there happens to be an unusually large (or small) amount of sample variation in x t during the last portion of the data set, then a short post-sample model validation period can easily yield misleading results.
Thus, in settings where the available relevant data set is not very long -as is typically the case with quarterly macroeconomic data and as is more generally the case where institutional arrangements vary over time, so that only the most recent data are relevant -reliably informative statistical testing of the proposition that the post-sample mean square forecasting error from one model exceeds that of another is likely to require a post-sample period so lengthy as to leave insufficient in-sample data available for model identification and estimation. 5 Section 2 introduces an elegant new Granger-causality test which -because it uses all of the available data at once in the testing procedure -both eliminates the need to decide a priori upon an in-sample versus post-sample partitioning of the available data set and also dramatically ameliorates the problems caused by a data set of modest length inducing the choice of a short post-sample period. Yet this new testing procedure retains a good deal of the credibility attached to post-sample testing, in that the relative performance of the estimated models used in the new test is always evaluated over data not used in the estimation of their coefficients; for this reason the new tests are denoted 4 Even when, as here, the issue is relative predictability at the population level across two different information sets, Inoue and Kilian (2004) correctly argue that post-sample testing is inefficient; essentially, this is because it only uses a portion of the sample data available. On the other hand, this efficiency loss is empirically significant only where a lack of data forces one to specify a post-sample period which is short: it was probably not a very important factor in Ashley and Ye (2012), for example, where a sample nearly 500 months in length allowed the authors to reserve 180 observations for post-sample testing. 5 The small sample lengths concentrated upon here could well be the natural consequence of having discarded a good deal of available sample data because one has statistically identified structural breaks in the data. In large-sample settings one might try to test for breaks and for Granger-causality all at once, as in Rossi (2005); see also, Pesaran and Timmermann (2007). 'cross-sample validation' or 'CSV' Granger-causality tests below. 6 The results of calculations using simulated data to compare the empirical power of the new tests to that of the usual in-sample F test and to that of the M SE − F postsample test are presented in Section 3 for sample lengths of 30, 60, and 120 periods.
These results indicate that the power of the CSV Granger causality tests proposed in Section 2 below is only modestly lower than that of the usual in-sample F test and (for post-sample periods of reasonable length) that their power is distinctly higher than that of the M SE − F post-sample test.
Note that this is the desired outcome -not a test with higher power than the insample F test. Rather, what the CSV Granger causality tests proposed here provide is a causality test with higher credibility than that of the in-sample F test, which credibility is obtained at a tolerable loss in power and while avoiding the problems (sample split arbitrariness, relatively low power, etc.) of the post-sample tests.
An illustrative application is given in Section 4, to a re-analysis of the Engel and West (2005) study of the causal relationship between macroeconomic fundamentals and the exchange rate. In their setting -with only 88 to 106 quarters of sample data available -post-sample Granger-causality testing was justifiably not considered feasible. Applying the cross-sample validation Granger-causality tests introduced here, we find that some of the Engel and West causality results are actually strengthened, but that the breadth of applicability of their conclusions is reduced. Section 5 concludes the paper.

Causality
For notational simplicity, we write the model for y t over the full (unrestricted) information set in the usual multiple regression model format: 6 These CSV tests are new to the literature on Granger-causality with modest data sets, but the idea of estimating model coefficients over one part of the sample and then using them in another has long appeared in the statistics literature, where it is usually called "cross-validation." For example, Racine and Parmeter (2013) have independently proposed a model-comparison procedure which can be used to compare both cross-sectional and time series models; their approach 'cross-validates' in a somewhat similar, but not identical, way to that of the CSV tests proposed here. (In particular, their procedure utilizes a large number of randomly-chosen sample-splits, whereas -as will be explained in Section 2 -the CSV tests explicitly examine every possible in-sample versus post-sample partitioning.) The Racine and Parmeter procedure requires block bootstrap re-sampling when applied to serially dependent time series data, however, so it is not suitable for use with the relatively short time series for which the CSV Granger-causality tests are designed.
where X is T × k and write the model for y t over the restricted information set as: where the T × (k − g) array X r is identical to X but omits the columns containing the data on the g putatively causative variables and where β r omits the corresponding components. Here X might contain additional explanatory variables -e.g., errorcorrection terms if the model for y t is co-integrated -as well as lagged values of y t .
It is tacitly assumed here that the coefficient vector β u is a constant over all T observations. this is assumed to have been assured either by having pruned the sample (which is why T might well be so modest in length) or by inclusion of appropriate explanatory variables at least approximately allowing for any structural changes within the data set.
Because the sampling distribution of the test statistic derived below is obtained using bootstrap simulation, Equation 1 must be specified with enough dynamics -e.g., lagged values of y t and the other variables -that an assumption to the effect that the model errors (ε u ) are serially independent is tenable, but neither normality nor homoscedasticity needs to be assumed.
In Granger-causality analysis attention is usually restricted to linear models, but Indeed, it should be pointed out that an important implicit assumption in Equation   1 is that this specification includes the past values of all time series substantially relevant to the current value of y t and especially any which are causally connected with the g time series omitted from X r . This implicit assumption is both necessary 7 Parametric nonlinear specifications for Equations 1 and 2 -and consequent CSV Granger causality analysis in that setting -are by no means ruled out, but would likely require substantially larger samples than are envisioned here. (It's not so much the fact that nonlinear least squares requires so much more data than does OLS, the problem is that the class of nonlinear models is so broad that the specification search process requires larger samples.) Diks and Panchenko (2006) provide a non-parametric in-sample Granger causality analysis framework, but effective non-parametric estimation requires even larger samples. and sufficient as to eliminate all of the usual counter-examples in which the Grangercausality concept itself becomes problematic, but it is nonetheless a strong assumption.
On the other hand, it is also worth noting that this assumption is tacitly (and equally) made in any and all reduced-form regression modeling, so perhaps all that should be further mentioned here is that reasonable care must be taken (and common sense utilized) in specifying Equation 1.
Now suppose that the sample of T observations is split into two parts: the first τ observations and the remaining T − τ observations, where the value of τ is (for the moment) taken as given. Let the subscript 'τ ' denote an array consisting of just the first τ elements of the corresponding un-subscripted array and let the subscript '−τ ' similarly denote an array consisting of just the remaining T − τ elements.
Analogously, letβ u τ be the estimator of β u in Equation 1 using only the first τ observations, letβ u −τ be the estimator of β u in Equation 1 using only the last T − τ observations, and defineβ r τ andβ r −τ similarly with regard to the estimators of β r in the restricted regression, Equation 2. Clearly,β u τ is simply (X τ X τ ) −1 X τ Y τ if (as would be likely for small T ) OLS estimation is used, and similarly forβ u −τ ,β r τ , andβ r −τ . 8 Thus, if one takes the first τ observations to be the 'in-sample' period and the remaining T − τ observations to be the 'post-sample' period, then -without parameter updating -the T −τ post-sample forecasting errors made by the unrestricted model are Similarly, the post-sample forecasting errors made by the restricted model comprise the arrayε r −τ ≡ Y −τ − X r −τβ r τ . Note, however, that one can also (in a completely analogous fashion) define the τ ×1 array of 'sample precasting' errorsε u τ ≡ Y τ − X τβ u −τ , in whichβ u −τ -the parameter estimator based on the (unrestricted model) data for the T − τ post-sample periodsis used to obtain the model errors for the first τ periods. Similarly, for the restricted model, the corresponding 'sample precasting' errors areε r τ ≡ Y τ −X τβ r −τ . In both cases these are just the prediction errors made in the first τ periods, using the parameter estimates obtained using only the data from the final T − τ periods.
For any given sample-split τ , one can easily compute both an unrestricted and a restricted sum of T squared 'out-of-sample' prediction errors, U RSS τ and RSS τ . Each of these is the sum of the squared prediction errors from all T periods in the data set, yet each is entirely based on errors made applying an estimated coefficient vector to explanatory variable data never used in its estimation. More explicitly: and In parsing the above equations, the reader is reminded that superscript 'r' signifies that the g columns corresponding to the explanatory variables which putatively Grangercause fluctuations in y t have been removed, whereas subscript 'τ ' signifies that only the first τ rows corresponding to the first τ sample periods are used in the Y , X, and X r arrays, and that the subscript '−τ ' signifies that only the last T −τ rows corresponding to the final T − τ sample periods are used in the Y , X, and X r arrays.
Having obtained U RSS(τ ) and RSS(τ ), the pseudo-F statistic would be potentially useful in testing the null hypothesis that the coefficients on all g putatively Granger-causing explanatory variables are zero.
In practice, however, F τ itself is of minimal interest, because it depends on the where τ must lie in the interval [k + 1 , T − k − 1] so that bothβ u τ andβ u −τ are computable. Thus, for example,Q 0.50 is just the sample median of F k+1 ... F T −k−1 ; These sample order statistics, by construction, do not depend on τ . Howeverlike F τ itself -their finite-sample sampling distributions are unknown, even for conditionally homoscedastic model errors. Recalling that the raison d'être for the present approach is to obtain credible inference results despite the fact that the value of T is 9 The empirical power of tests based on the sample mean of the feasible F τ values (and several variations involving un-equally weighted averages) was examined also, but the power of these tests is lower than that of tests based on the sample median. In addition, we naturally turned to quantile statistics because the distributions of the simulated F τ are quite non-gaussian. modest, the sample lengths envisioned here for use in calculatingQ ν are inherently too small for the use of asymptotic results. Consequently, Granger-causation inferences based onQ ν must in practice be obtained using bootstrap methods, and results obtained in this way are quoted in Section 3 below. 10 Granger-causality tests based onQ ν are aptly called 'cross-sample validation' tests because they are based on applying the model coefficients estimated on one portion of the data to predicting the other portion of the data. Consequently, below we denotê Bootstrap inference ensures that the sizes of these cross-sample validation tests are reasonably accurate, even for the modest sample lengths considered here. 11 But -compared to the power of the usual in-sample F test to detect Granger-causalityhow large a price in power must one pay for the added credibility provided by these cross-validation tests? This issue is addressed in the next section.

Cross-Sample Validation Test Power Comparisons Using Simulated Data
This section uses simulated data to compare the power of the cross-sample validation tests proposed above to that of both the usual in-sample F test and to that of a typical post-sample test. These results are designed to answer the following two questions: • Is the power of the cross-sample validation tests to detect Granger-causality close enough to that of the in-sample F test as to be a reasonable compromise? 12 10 Bootstrap inference requires hardly more computer coding than does the sample evaluation of F τ . And, using present equipment, bootstrap inference with N boot = 10, 000 simulations requires only 10 to 65 seconds of computer time as T varies from 30 to 120. Windows-based software is available from the authors which conveniently implements bootstrap-based Granger-causality inference based onQ ν for models such as Equation 1 (with k <= 40 and T <= 4, 000), optionally including multiple lags in the dependent variable and -where conditional heteroscedasticity in the model errors is a concern -using the 'wild' bootstrap, as described in Gonçalves and Kilian (2004). The reader should note, however, that, while not requiring N IID model errors, the ordinary bootstrap still requires ε u ∼ IID(0, σ 2 ) in Equation 1 and the wild bootstrap still requires serial independence in ε u . As noted in Section 2, where linear modeling is sufficient then this serial independence can be ensured by including a sufficient number of lagged values of the dependent and explanatory variables in the specification of Equation 1. 11 Size results were calculated at an early stage in this project, but only as a check on the computer codes and on the finite-sample use of the bootstrap: absent problems of those sort, the empirical size of a bootstrapped 5% test is 0.05 by construction, aside from simulation noise arbitrarily reducible by increasing the number of simulations. This was not a problem in our checks at T = 30. 12 At the risk of repetitiveness, we note again that the sizes of the tests are not at issue here since the use of bootstrap inference virtually assures that the empirical size equals the nominal size. Recall also that our CSV tests are not expected to have higher power than the in-sample F test: their added value (as with the • For a reasonable post-sample forecasting period length, is the power of the crosssample validation tests to detect Granger-causality so substantially higher than that of the post-sample test as to make this an attractive alternative?
More specifically, three kinds of test are considered: 1. The usual in-sample F test. 13 This test utilizes all T observations at once, with no sample split at all.
where P is the number of post-sample periods chosen, e u,t+1 is the one-stepahead forecasting error made by the unrestricted model in period t, and e r,t+1 is the corresponding one-step-ahead forecasting error made by the restricted model.
Both models are estimated using all data up to period t. 14  The issue as to whether the cross-sample validation tests can provide interestinglydistinct Granger-causality results in a practical setting is deferred to Section 4. Here the relative power of these three kinds of tests is compared using M = 10, 000 artificially generated data sets, each of length of length T ; results are given in Table 1 for T = 30, 60, and 120.

And, finally, the cross-sample validation tests -based on the sample quantiles of
These artificial data sets were generated from an (intentionally unremarkable) dynamic multiple regression model of the form: post-sample tests) lies in their enhanced credibility. 13 This is the standard test covered in most textbooks -e.g., Davidson and MacKinnon (1993, p. 92). 14 The notation used for the post-sample forecasting errors in Equation 9 is consistent with that of Mc-Cracken (2007). The notation used in Section 2 -which defined analogous vectors of post-sample forecast errors,ε u −(T −P ) andε r −(T −P ) -is intentionally distinct from that used in Equation 9 because the parameter estimates in the models used to obtain e u,t+1 and e r,t+1 are updated each period, whereas the out-of-sample prediction errors used inε u −(T −P ) andε r −(T −P ) are not.
where u t is generated as an NIID(0,1) variate for each observation in each data set. 15 As would be common, this regression model includes a lagged dependent variable and several explanatory control variables: x 1,t , x 2,t , and x 3,t , not all of which actually belong in the model. Equation 10 also includes two putatively causal variables: x 4,t and x 5,t , one of which actually is causal. Aside from the lagged dependent variable, all of the explanatory variable values for each data set were generated (once) as AR (1) variates (with first-order autocorrelation of 0.50) and then held 'fixed in repeated samples' across all M artificial data sets. 16 The data on y t for each artificial data set were then generated recursively from Equation 10. 17 Equation 10 is typical, in size and kind, to the sorts of unrestricted models commonly used in Granger-causality analysis. In particular, its assumption of NIID model errors seems reasonably innocuous since one might expect an analyst to include sufficient lagged terms in such a model as to eliminate any serial correlation in the errors and since the bootstrap inference used in implementing our method would in any case allow for any departures from normality and homoscedasticity.
The null hypothesis that neither x 4,t nor x 5,t Granger-causes y t was then tested by applying all three kinds of test to a regression model (analogous to Equation 10) which was fitted, using OLS, to each of these M data sets. Because exact sampling distributions are available for none of these tests with sample lengths this small, 5% critical points (and corresponding test rejection P -values for each artificial data set) were obtained using non-parametric bootstrap re-sampling to generate N boot = 10, 000 new T -samples based on this fitted regression model. More specifically, simulated values of y 1 ... y T were obtained by recursion of this fitted model, using T 'new' model errors generated by picking at random amongst the fitting errors. 18 Rejection frequency results (i.e., estimates of the empirical power) for each of the three kinds of Granger-causality tests listed above -in each instance testing the null hypothesis that the coefficients on x 4,t and x 5,t are both zero -are collected in Table 1 for M = 10, 000 artificial data sets of length T = 30, 60, or 120 generated in this way from Equation 10. Table 1 in a few cases includes entries for post-sample tests with forecasting period lengths which are ludicrously small. For example, it is hardly credible that an analyst would truly sequester a post-sample period of length 10 or 20 periods from a total sample which is only 30 periods in length. On the other hand, it is interesting to at least look at the power of a test based on a five-period post-sample test in this case, so that entry is included in Table 1 nevertheless.
It is evident from the empirical power results in Table 1 that the in-sample F test has the highest power in each case. This result is to be expected: it is obviously helpful to estimate the model parameters using the entire data set; the object here is to obtain Granger-causality test results which are more convincing than those provided by the in-sample test because they do not rely for inference on the same sample data used to specify and estimate the models. Such credibility is to some extent provided by the post-sample M SE − F test, but the results in Table 1 indicate that this additional credibility comes at a high cost in terms of power. The cross-sample validation tests introduced here also provide higher-credibility Granger-causality inference than does the in-sample F test, but -in most cases -with substantially higher power than the post-sample tests.
Notably noisiness. In particular, we note that the empirical power of the CSV 100, test -whose test statistic is sup F τ and is thus reminiscent of other sup F tests in the literature -is typically much smaller than the empirical power of the CSV 75 test. Because a unique cross-sample validation test is desirable, our recommendation is to simply use the 'third-quartile' or CSV 75 cross-sample validation test in empirical applications.
The application given in the next section provides additional insights with regard to the relative merits of these different tests.

An Empirical Application: Do Fluctuations in Macroeconomic Fundamentals Granger-Cause
Fluctuations in the Exchange Rate?
The standard view on the determination of a country's exchange rate is the asset-pricing model, in which the exchange rate is a function of the expected discounted values of future macroeconomic fundamentals -i.e., cross-country differentials in output, money, interest rates, etc. But it is also a long-standing puzzle that models based on this theory have bleak empirical performance. In particular, exchange rates are well-approximated as random walks and do not appear to be forecastable using macroeconomic fundamentals. In an influential study, Engel and West (2005), enhance the asset-pricing model by adding two reasonable assumptions: that the subjective discount factor is close to one and that the macroeconomic fundamentals are highly persistent. Under these assumptions, the model predicts that exchange rates will behave like random walks and that innovations in exchange rates are correlated with news about future values of the macroeconomic fundamentals. They test their enhanced model by using quarterly data for six countries (with the U.S. as base, and a sample period of 1974Q1 to 2001Q3 in most cases) to look for Granger-causality between the growth rate in each country's exchange rate and the growth rate in each of several fundamental macroeconomic differential time series, relative to the U.S. The fundamentals variables Engel and West consider are, in their notation: 19 rates for Germany, Italy, and Japan. 20 In particular, Engel and West are able to reject the null hypothesis of no-Granger-causality for the exchange rate at either the 5% or the 1% level of significance for ppd t (in the case of Germany), for mmd t and for ppd t (in the case of Italy), and for all of mmd t , ppd t , ii t , and iid t (in the case of Japan).
But are these findings of Granger-causation merely artifacts due to the use of the same data in both estimating the bivariate relationships and in the causality testing?
As noted above, the Engel and West data sets are too short for post-sample testing to be useful, but this is an excellent setting for the application of the cross-sample validation causality tests introduced here. Table 2 summarizes the results. The sample lengths are given because several of the fundamental data series (for Italy and Japan) begin subsequent to 1974. Table 2 also indicates whether the results for this column were obtained using the usual bootstrap or using the wild bootstrap. The latter was necessary in four of the seven cases because the fitting errors of the underlying estimated regression model -for the fundamentals variable in terms of both its own past values and the past values of this country's exchange rate -in these instances displayed severe conditional heteroscedasticity. 21 The next row of Table 2 displays the P -value at which the null hypothesis of no-Granger-causality can be rejected using the usual (in-sample) F test; as in Engel and West's work, all seven of these P -values remain less than 0.05 in these bootstrapped results.
The third-quartile cross-sample validation test -CSV75 -rejection P -values are given in the next row of Table 2 test results, although it is (as one might expect) a bit weaker for Japan than that tional heteroscedasticity and these two in-sample Granger-causality results disappeared when White-Eicker standard error estimates were used. 21 In these four cases the conditional heteroscedasticity was so clearly visible in a time plot of the fitting errors that formal testing was beside the point. The use of robust standard error estimates in these models also yielded substantial changes in the in-sample F test P -values, with the ppd t , ii t , and iid t rejections for Japan becoming notably more significant and the ppd t rejection P -value for Italy rising from 0.004 to 0.033. 22 Based on the empirical power results in Section 3, Table 2 focuses solely on the third-quartile cross-sample validation test results. provided by the in-sample test. Finally, the evidence for a causal link from ii t or iid t to the exchange rate in the data for Japan is again mixed: there is still evidence (albeit somewhat weaker than from the in-sample result) for Granger-causality running from iid t to the exchange rate, but the in-sample evidence for Granger-causality running from ii t to the exchange rate appears to have been artifactual.
The remaining four rows of Table 2  In summary, the cross-sample validation test results on the Engel and West data turn out to be quite illustrative: they are clearly distinct from the in-sample results, yet these more-credible results enrich -rather than broadly invalidate -Engel and West's original conclusions. In particular, the cross-validation results reinforce their contention that macroeconomic fundamentals can Granger-cause exchange rate fluctuations in some instances: certainly in the cases of ppd t and the German exchange rate and of mmd t and the Italian exchange rate -and probably also in the cases of ppd t and the Italian exchange rate and the cases of both mmd t and iid t for the Japanese exchange rate. Yet several of Engel and West's causality inferences -ppd t and ii t for the Japanese exchange rate -turn out to be artifacts of the in-sample testing method used. Thus, our re-examination of Engel and West's data strengthens the empirical support for their exchange rate model in some countries, while also indicating that their in-sample causality analysis over-states the breadth of the evidence in support of the model's predictions. 23 Recall that the M SE − F tests are conducted with recursive parameter updating.
Post-sample Granger-causality testing still seems preferable for substantially large values of the sample length, T , as its efficiency loss will in such cases be out-weighed by its higher credibility. 24 But for small values of T -e.g., T = 30 -post-sample testing may be simply infeasible: here the data are so scarce as to make the assertion that the analyst 'held back' even five observations from the model identification/estimation process risible. And, even for somewhat larger values of T -e.g., T = 60 or T = 120a post-sample period of credible length might necessarily be so short as to provide little power in testing whether the post-sample forecasting errors of the restricted model are significantly larger than those of the unrestricted model. Or post-sample Grangercausality testing might easily yield erratic results in these settings, due to unusual sample variation in the putatively causing variables during the course of such a brief post-sample period.
In contrast, the usual in-sample F test utilizes all T observations to test the null hypothesis that the parameters on the putatively causative parameters are all zero, so one is in a notably better position to deal with a small data set. But this in-sample test is apt to routinely yield misleading Granger-causality inferences, simply because our modeling processes inherently pre-dispose us to find models which fit well. (After all, who among us has not found that our models tend typically to fit better than they forecast?) This tendency to systematically over-fit leads to in-sample detections of Granger-causation which is not actually present. Still, with samples of modest length, an analyst would heretofore have little choice but to utilize the in-sample test, despite this danger.
The 'third-quartile' -CSV 75 -cross-sample validation test proposed here resolves this predicament by utilizing all of the scarce sample data, while nevertheless always basing model predictions on coefficients estimated over data not used in making these predictions. Calculations based on simulated data indicate that this new test has empirical power not all that much lower than that of the less-credible in-sample test, so that the penalty paid for the additional confidence which can be accorded the results it provides is manageable. The empirical example based on the Engel and West (2005) data set, presented above in Section 4, illustrates the value of this new approach to Granger-causality analysis in settings where only modest amounts of sample data are available.
Finally, then, how does the present work indicate that modern Granger-causality analysis with a sample of modest length should be done?
• The first step should consist of a thoughtful specification of the unrestricted 24 One might, however, for very large T turn instead to the method of Racine and Parmeter (2013). model for each variable -in I(0) form -including (i.e., conditioning upon) a reasonable approximation to the set of all importantly causative variables, an error-correction term (if cointegration is present), a sufficient number of lagged values of the dependent and other variables as to yield serially uncorrelated fitting errors, and whatever additional conditioning variables are necessary in order to remove obvious signs of structural drift (or shifts) during the data set. A plot of the model fitting errors is useful and warranted in this regard.
• Each of the restricted models is then specified and diagnostically checked in a similar fashion, as a nested model within the unrestricted model specification.
• The in-sample F test can then be applied to test for any particular causal link. If the null hypothesis (of no causality for this link) cannot be rejected at a culturally acceptable level of significance (5%, usually), then one can conclude that there is no real evidence for the existence of this causal link.
• If, in contrast, the null hypothesis of no causality is in fact rejected on the insample F test, then we suggest that a degree of skepticism is warranted: Could this result be an artifact of model mis-specification? Or -and what is usually essentially the same thing -are there non-homogeneities across the sample inducing a spurious inference? Or, is this rejection of the null hypothesis simply the result of an ordinary sampling fluctuation? To ameliorate, if not entirely relieve, these skepticism-inducing worries, we suggest the application of the CSV Granger causality test -typicallyQ(0.50) orQ(0.75) -as described above. If the null hypothesis of no causality is still rejected on the CSV tests, then it is reasonable to take this set of results as strong evidence in favor of the causal link actually existing. If, in contrast, the null hypothesis is no longer rejected on the CSV tests, then the initial skepticism -most especially with regard to model misspecification and/or structural instability over the sample period -would seem to be warranted.