Optimal weight settings in locally weighted regression: A guidance through cross-validation approach
MetadataShow full item record
Locally weighted regression is a powerful tool that allows the estimation of different sets of coefficients for each location in the underlying data, challenging the assumption of stationary regression coefficients across a study region. The accuracy of LWR largely depends on how a researcher establishes the relationship across locations, which is often constructed using a weight matrix or function. This paper explores the different kernel functions used to assign weights to observations, including Gaussian, bi-square, and tri-cubic, and how the choice of weight variables and window size affects the accuracy of the estimates. We guide this choice through the cross-validation approach and show that the bi-square function outperforms the choice of other kernel functions. Our findings demonstrate that an optimal window size for LWR models depends on the cross-validation (CV) approach employed. In our empirical application, the full-sample CV guides the choice of a higher window-size case, and CV by proxy guides the choice of a lower window size. Since the CV by Proxy approach focuses on the predictive ability of the model in the vicinity of one specific point (usually a policy point/site), we note that guiding a model choice through this approach makes more intuitive sense when the aim of the researcher is to predict the outcome in one specific site (policy or target point). To identify the optimal weight variables, while we suggest exploring various combinations of weight variables, we argue that an efficient alternative is to merge all continuous variables in the dataset into a single weight variable.
General Audience Abstract
Locally weighted regression (LWR) is a statistical technique that establishes a relationship between dependent and explanatory variables, focusing primarily on data points in proximity to a specific point of interest/target point. This technique assigns varying degrees of importance to the observations that are in proximity to the target point, thereby allowing for the modeling of relationships that may exhibit spatial variability within the dataset. The accuracy of LWR largely depends on how researchers define relationships across different locations/studies, which is often done using a “weight setting”. We define weight setting as a combination of weight functions (determines how the observations around a point of interest are weighted before they enter the model), weight variables (determines proximity between the point of interest and all other observations), and window sizes (determines the number of observations that can be allowed in the local regression). To find which weight setting is an optimal one or which combination of weight functions, weight variables, and window sizes generates the lowest predictive error, researchers often employ a cross-validation (CV) approach. Cross-validation is a statistical method used to assess and validate the performance of a predictive model. It entails removing a host observation (a point of interest), predicting that point, and evaluating the accuracy of such predicted point by comparing it with its actual value. In our study, we employ two CV approaches. The first one is a full-sample CV approach, where we remove a host observation, and predict it using the full set of observations used in the given local regression. The second one is the CV by proxy approach, which uses a similar mechanism as full-sample CV to check the accuracy of the prediction, however, by focusing only on the vicinity points that share similar characteristics as a target point. We find that the bi-square function consistently outperforms the choice of Gaussian and tri-cubic weight functions, regardless of the CV approaches. However, the choice of an optimal window size in LWR models depends on the CV approach that we employ. While the full-sample CV method guides us toward the selection of a larger window size, the CV by proxy directs us toward a smaller window size. In the context of identifying the optimal weight variables, we recommend exploring various combinations of weight variables. However, we also propose an efficient alternative, which involves using all continuous variables within the dataset into a single-weight variable instead of striving to identify the best of thousands of different weight variable settings.
- Masters Theses