Virginia Tech
    • Log in
    View Item 
    •   VTechWorks Home
    • ETDs: Virginia Tech Electronic Theses and Dissertations
    • Doctoral Dissertations
    • View Item
    •   VTechWorks Home
    • ETDs: Virginia Tech Electronic Theses and Dissertations
    • Doctoral Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Towards Interpretable Vision Systems

    Thumbnail
    View/Open
    Zhang_P_D_2017.pdf (15.59Mb)
    Downloads: 917
    Supporting documents (52.67Mb)
    Downloads: 41
    Date
    2017-12-06
    Author
    Zhang, Peng
    Metadata
    Show full item record
    Abstract
    Artificial intelligent (AI) systems today are booming and they are used to solve new tasks or improve the performance on existing ones. However, most AI systems work in a black-box fashion, which prevents the users from accessing the inner modules. This leads to two major problems: (i) users have no idea when the underlying system will fail and thus it could fail abruptly without any warning or explanation, and (ii) users' lack of proficiency about the system could fail pushing the AI progress to its state-of-the-art. In this work, we address these problems in the following directions. First, we develop a failure prediction system, acting as an input filter. It raises a flag when the system is likely to fail with the given input. Second, we develop a portfolio computer vision system. It is able to predict which of the candidate computer vision systems perform the best on the input. Both systems have the benefit of only looking at the inputs without running the underlying vision systems. Besides, they are applicable to any vision system. By equipped such systems on different applications, we confirm the improved performance. Finally, instead of identifying errors, we develop more interpretable AI systems, which reveal the inner modules directly. We take two tasks as examples, words semantic matching and Visual Question Answering (VQA). In VQA, we take binary questions on abstract scenes as the first stage, then we extend to all question types on real images. In both cases, we take attention as an important intermediate output. By explicitly forcing the systems to attend correct regions, we ensure the correctness in the systems. We build a neural network to directly learn the semantic matching, instead of using the relation similarity between words. Across all the above directions, we show that by diagnosing errors and making more interpretable systems, we are able to improve the performance in the current models.
    URI
    http://hdl.handle.net/10919/81074
    Collections
    • Doctoral Dissertations [16334]

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us
     

     

    VTechWorks

    AboutPoliciesHelp

    Browse

    All of VTechWorksCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Log inRegister

    Statistics

    View Usage Statistics

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us