Generative models meet similarity search: efficient, heuristic-free and robust retrieval
The rapid growth of digital data, especially visual and textual contents, brings many challenges to the problem of finding similar data. Exact similarity search, which aims to exhaustively find all relevant items through a linear scan in a dataset, is impractical due to its high computational complexity. Approximate-nearest-neighbor (ANN) search methods, especially the Learning-to-hash or Hashing methods, provide principled approaches that balance the trade-offs between the quality of the guesses and the computational cost for web-scale databases. In this era of data explosion, it is crucial for the hashing methods to be both computationally efficient and robust to various scenarios such as when the application has noisy data or data that slightly changes over time (i.e., out-of-distribution).
This Thesis focuses on the development of practical generative learning-to-hash methods and explainable retrieval models. We first identify and discuss the various aspects where the framework of generative modeling can be used to improve the model designs and generalization of the hashing methods. Then we show that these generative hashing methods similarly enjoy several appealing empirical and theoretical properties of generative modeling. Specifically, the proposed generative hashing models generalize better with important properties such as low-sample requirement, and out-of-distribution and data-corruption robustness. Finally, in domains with structured data such as graphs, we show that the computational methods in generative modeling have an interesting utility beyond estimating the data distribution and describe a retrieval framework that can explain its decision by borrowing the algorithmic ideas developed in these methods.
Two subsets of generative hashing methods and a subset of explainable retrieval methods are proposed. For the first hashing subset, we propose a novel adversarial framework that can be easily adapted to a new problem domain and three training algorithms that learn the hash functions without several hyperparameters commonly found in the previous hashing methods. The contributions of our work include: (1) Propose novel algorithms, which are based on adversarial learning, to learn the hash functions; (2) Design computationally efficient Wasserstein-related adversarial approaches which have low computational and sample efficiency; (3) Conduct extensive experiments on several benchmark datasets in various domains, including computational advertising, and text and image retrieval, for performance evaluation. For the second hashing subset, we propose energy-based hashing solutions which can improve the generalization and robustness of existing hashing approaches. The contributions of our work for this task include: (1) Propose data-synthesis solutions to improve the generalization of existing hashing methods; (2) Propose energy-based hashing solutions which exhibit better robustness against out-of-distribution and corrupted data; (3) Conduct extensive experiments for performance evaluations on several benchmark datasets in the image retrieval domain.
Finally, for the last subset of explainable retrieval methods, we propose an optimal alignment algorithm that achieves a better similarity approximation for a pair of structured objects, such as graphs, while capturing the alignment between the nodes of the graphs to explain the similarity calculation. The contributions of our work for this task include: (1) Propose a novel optimal alignment algorithm for comparing two sets of bag-of-vectors embeddings; (2) Propose a differentiable computation to learn the parameters of the proposed optimal alignment model; (3) Conduct extensive experiments, for performance evaluation of both the similarity approximation task and the retrieval task, on several benchmark graph datasets.