Browsing by Author "Xie, Zhiwu"
Now showing 1 - 20 of 43
Results Per Page
Sort Options
- 3D Data Repository Features, Best Practices, and Implications for Preservation Models: Findings from a National ForumHardesty, Juliet; Johnson, Jennifer; Wittenberg, Jamie; Hall, Nathan; Cook, Matt; Lischer-Katz, Zack; Xie, Zhiwu; McDonald, Robert H. (2020-07)This study identifies challenges and directions for 3D/VR repository standards and practices. As 3D technologies become more affordable and accessible, academic libraries need to implement workflows, standards, and practices that support the full lifecycle of 3D data. This study invited experts across several disciplines to analyze current national repository and preservation efforts. Outlined models provide frameworks to identify features, examine workflows, and determine implications of 3D data on current preservation models. Participants identified challenges for supporting 3D data, including intellectual property and fair use; providing repository system management beyond academic libraries; seeking guidance outside of academia for workflows to model.
- Archiving the Relaxed Consistency WebXie, Zhiwu; Van de Sompel, Herbert; Liu, Jinyang; van Reenen, Johann; Jordan, Ramiro (ACM, 2013)The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.
- Are Repositories Impeding Big Data Reuse?Xie, Zhiwu; Galad, Andrej; Chen, Yinlin; Fox, Edward A. (Virginia Tech, 2016-06-14)In this intentionally provocative presentation, we question the scalability of popular digital repositories and whether they are suitable for big data reuse. Are the layers of API these repositories have painted over file system primitives necessary? How essential is it for the repository to insist on being the sole manager of the content, and arranging files in ways to prevent access other than from their own APIs? We explore these questions from the perspective of big data reuse, and describe controlled reuse experiments against Fedora 4 to evaluate the cost of these practices.
- Big Data Processing in the Cloud: a Hydra/Sufia ExperienceBrittle, Collin; Xie, Zhiwu (2014-06-10)Presentation video available at https://connectpro.helsinki.fi/p1txjdy74ts/ This presentation addresses the challenge of processing big data in a cloud-based data repository. Using the Hydra Project’s Hydra and Sufia ruby gems and working with the Hydra community, we created a special repository for the project, and set up background jobs. Our approach is to create the metadata with these jobs, which are distributed across multiple computing cores. This will allow us to scale our infrastructure out on an as-needed basis, and decouples automatic metadata creation from the response times seen by the user. While the metadata is not immediately available after ingestion, it does mean that the object is. By distributing the jobs, we can compute complex properties without impacting the repository server. Hydra and Sufia allowed us to get a head start by giving us a simple self deposit repository, complete with background jobs support via Redis and Resque.
- The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight CycleWang, Xinyue; Xie, Zhiwu (ACM, 2020-08)The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.
- ClusteringXie, Zhiwu (2015-06-11)This presentation is part of a panel presentation at Open Repository 2015, Fedora Technical Working Group - Assessment of Fedora 4.
- Deep Learning Approach for Cell Nuclear Pore Detection and Quantification over High Resolution 3D DataHe, Chongyu (Virginia Tech, 2023-12-21)The intricate task of segmenting and quantifying cell nuclear pores in high-resolution 3D microscopy data is critical for cellular biology and disease research. This thesis introduces a deep learning pipeline crafted to automate the segmentation and quantification of nuclear pores from high-resolution 3D cell organelle images. Our aim is to refine computational methods capable of handling the data's complexity and size, thus improving accuracy and reducing manual labor in biological image analysis. The developed pipeline incorporates data preprocessing, augmentation strategies, random block sampling, and a three-stage post-processing algorithm. It utilizes a 3D U-Net with a VGG-16 backbone, optimized through cyclical data augmentation and random block sampling to tackle the challenges posed by limited labeled data and the processing of large-scale 3D images. The pipeline has demonstrated its capability to effectively learn and predict nuclear pore structures, achieving improvements in validation metrics compared to baseline models. Our experiments suggest that cyclical augmentation helps prevent overfitting, and random block sampling contributes to managing data imbalance. The post-processing phase successfully automates the quantification of nuclear pores without the need for manual intervention. The proposed pipeline offers an efficient and scalable approach to segmenting and quantifying nuclear pores in 3D microscopy images. Despite the ongoing challenges of computational intensity and data volume, the techniques developed in this study provide insights into the automation of complex biological image analysis tasks, with potential applications extending beyond the detection of nuclear pores.
- Developing Library Strategy for 3D and Virtual Reality Collection Development and Reuse: An IMLS National Digital Platform ProjectHall, Nathan; Hardesty, Juliet; Cook, Robert; McDonald, Robert H.; Lischer Katz, Zack; Wittenbert, Jaime; Carlisle, Tara; Johnson, Jennifer; Griffin, Julie; Xie, Zhiwu; Ogier, Andrea (2018)These are the preliminary and full proposals for an IMLS grant to develop a white paper (to be added here in late 2018) to host three national forums of invited experts to support library adoption of 3D and virtual reality (VR) services. The forums were hosted by Virginia Tech University Libraries, Indiana University Libraries, and the University of Oklahoma Libraries. Each forum covered a different 3D and VR theme: content creation and publishing, visualization and analysis, and repository practice and standards. Lower costs and greater computational power have made 3D and VR technologies financially realistic for a broad variety of institutions. Many academic libraries have developed archives for other forms of research data, but there is an absence of standards and best practices for producing, managing, and preserving 3D and VR content. This gap is an information management problem suited to the strengths of libraries and can benefit librarians and researchers alike across institutions.
- DLA: Who We Are and What We DoMcMillan, Gail; Gilbertson, Keith; Hall, Nathan; Lawrence, Anne S.; Weeks, Kimberli; Wills, D. Jane; Xie, Zhiwu (2012-05-24)In Service Day (ISD) 2012 presentation about the Digital Library and Archives (DLA).
- Evaluating Cost of Cloud Execution in a Data RepositoryXie, Zhiwu; Chen, Yinlin; Griffin, Julie; Walters, Tyler (ACM, 2016-06)In this paper, we utilize a set of controlled experiments to benchmark the cost associated with the cloud execution of typical repository functions such as ingestion, fixity checking, and heavy data processing. We focus on the repository service pattern where content is explicitly stored away from where it is processed. We measured the processing speed and unit cost of each scenario using a large sensor dataset and Amazon Web Services (AWS). The initial results reveal three distinct cost patterns: 1) spend more to buy up to proportionally faster services; 2) more money does not necessarily buy better performance; and 3) spend less, but faster. Further investigations into these performance and cost patterns will help repositories to form a more effective operation strategy.
- Event-related Collections Understanding and ServicesLi, Liuqing (Virginia Tech, 2020-03-18)Event-related collections, including both tweets and webpages, have valuable information, and are worth exploring in interdisciplinary research and education. Unfortunately, such data is noisy, so this variety of information has not been adequately exploited. Further, for better understanding, more knowledge hidden behind events needs to be unearthed. Regarding these collections, different societies may have different requirements in particular scenarios. Some may need relatively clean datasets for data exploration and data mining. Social researchers require preprocessing of information, so they can conduct analyses. General societies are interested in the overall descriptions of events. However, few systems, tools, or methods exist to support the flexible use of event-related collections. In this research, we propose a new, integrated system to process and analyze event-related collections at different levels (i.e., data, information, and knowledge). It also provides various services and covers the most important stages in a system pipeline, including collection development, curation, analysis, integration, and visualization. Firstly, we propose a query likelihood model with pre-query design and post-query expansion to rank a webpage corpus by query generation probability, and retrieve relevant webpages from event-related tweet collections. We further preserve webpage data into WARC files and enrich original tweets with webpages in JSON format. As an application of data management, we conduct an empirical study of the embedded URLs in tweets based on collection development and data curation techniques. Secondly, we develop TwiRole, an integrated model for 3-way user classification on Twitter, which detects brand-related, female-related, and male-related tweeters through multiple features with both machine learning (i.e., random forest classifier) and deep learning (i.e., an 18-layer ResNet) techniques. As guidance to user-centered social research at the information level, we combine TwiRole with a pre-trained recurrent neural network-based emotion detection model, and carry out tweeting pattern analyses on disaster-related collections. Finally, we propose a tweet-guided multi-document summarization (TMDS) model, which generates summaries of the event-related collections by using tweets associated with those events. The TMDS model also considers three aspects of named entities (i.e., importance, relatedness, and diversity) as well as topics, to score sentences in webpages, and then rank selected relevant sentences in proper order for summarization. The entire system is realized using many technologies, such as collection development, natural language processing, machine learning, and deep learning. For each part, comprehensive evaluations are carried out, that confirm the effectiveness and accuracy of our proposed approaches. Regarding broader impact, the outcomes proposed in our study can be easily adopted or extended for further event analyses and service development.
- Facilitate Cross-Repository Big Data Discovery and ReuseXie, Zhiwu (Virginia Tech, 2013-03-13)Researchers have accumulated large amount of observational, experimental, and simulation data. Much effort has been made to collect, curate, preserve, and provide open access to them, but putting the data online is only the start. Coined by Jim Gray as the fourth paradigm, the data-intensive science strives to uncover the hidden patterns and correlations across research topics and disciplines by aggregating and cross-interrogating these data silos. The productivity for e-Research may be much improved if we can provide the researchers with fast, easy, and cost-effective methods to discover and reuse these datasets in an ad-hoc and explorative manner.
- FishTraits version 2: integrating ecological, biogeographic and bibliographic informationXie, Zhiwu; Frimpong, Emmanuel A.; Lee, Sunshin (ACM, 2013-07-22)In this paper we describe the new development of FishTraits. Originating from an ecological database that documents and consolidates more than 100 traits for 809 fish species, the new version focuses on the integration of these traits data with the bibliographic and biogeographic information. We explain the overall design as well as the implementation details.
- Improving scalability by self-archivingXie, Zhiwu; Liu, Jinyang; Van de Sompel, Herbert; van Reenen, Johann; Jordan, Ramiro (ACM, 2011-06-13)The newer generation of web browsers supports the client-side database, making it possible to run the full web application stacks entirely in the web clients. Still, the server side database is indispensable as the central hub for exchanging persistent data between the web clients. Assuming this characterization, we propose a novel web application framework in which the server archives its database states at predefined periods then makes them available on the web. The clients then use these archives to synchronize their local databases. Although the main purpose is to reduce the database scalability bottleneck, this approach also promotes self-archiving and can be used for time traveling. We discuss the consistency properties provided by this framework, as well as the tradeoffs imposed.
- The Insitutional Repository's Role in Preserving Research DataXie, Zhiwu; McMillan, Gail; Walters, Tyler (Virginia Tech, 2012-07-25)In recent years, many funding agencies have started to require long-term preservation and open access to research data. While most research universities have already run their own institutional repositories (IR), it's not clear what role the IR can play in managing these data. Unlike the textual and even multimedia contents currently archived by the conventional IR, research data are much more diverse in terms of format, metadata, storage, rendering, and access requirements. The differences between the geospatial data, astronomical observation data, DNA sequencing data, and computational fluid dynamics simulation data can be so large as to deserve their own disciplinary data repositories. A disciplinary repository can customize its structure and functionality for a specific type of data, a luxury not available to the general-purpose IR. On the other hand, the IR is uniquely positioned to manage the research data. The university provides the IT infrastructure where most of the data are initially generated, processed, stored, and managed. As part of the IT infrastructure, the IR usually presents the lowest migration barrier and also the cheapest cost for data created within the same institution. In order to meet the data managing challenges, we therefore must clearly define the core functionality an IR must provide during the lifecycle of the research data, which may include: - Closely integrate the IR with the university's IT infrastructure to allow easy deposit and access control - Provide the baseline storage needs, which may be further differentiated by the usage pattern to lower the cost - Act as a metadata hub that not only can understand various disciplinary metadata, but can also translate them into more widely understood terms for easy discovery and access - Facilitate reuse and preservation by at least maintaining the preservation metadata that document the environment where the data originally lived - Provide programming interfaces to facilitate the data visualization, presentation, and usage from external services - Provide data exchange interfaces to various disciplinary data repositories Virginia Tech is working towards building its IR, VTechWorks, as an exemplary general-purpose repository that fulfills these data management roles.
- Large Web Archive Collection Infrastructure and ServicesWang, Xinyue (Virginia Tech, 2023-01-20)The web has evolved to be the primary carrier of human knowledge during the information age. The ephemeral nature of much web content makes web knowledge preservation vital in preserving human knowledge and memories. Web archives are created to preserve the current web and make it available for future reuse. A growing number of web archive initia- tives are actively engaging in web archiving activities. Web archiving standards like WARC, for formatted storage, have been established to standardize the preservation of web archive data. In addition to its preservation purpose, web archive data is also used as a source for research and for lost information recovery. However, the reuse of web archive data is inherently challenging because of the scale of data size and requirements of big data tools to serve and analyze web archive data efficiently. In this research, we propose to build web archive infrastructure that can support efficient and scalable web archive reuse with big data formats like Parquet, enabling more efficient quantitative data analysis and browsing services. Upon the Hadoop big data processing platform with components like Apache Spark and HBase, we propose to replace the WARC (web archive) data format with a columnar data format Parquet to facilitate more efficient reuse. Such a columnar data format can provide the same features as WARC for long-term preservation. In addition, the columnar data format introduces the potential for better com- putational efficiency and data reuse flexibility. The experiments show that this proposed design can significantly improve quantitative data analysis tasks for common web archive data usage. This design can also serve web archive data for a web browsing service. Unlike the conventional web hosting design for large data, this design primarily works on top of the raw large data in file systems to provide a hybrid environment around web archive reuse. In addition to the standard web archive data, we also integrate Twitter data into our design as part of web archive resources. Twitter is a prominent source of data for researchers in a vari- ety of fields and an integral element of the web's history. However, Twitter data is typically collected through non-standardized tools for different collections. We aggregate the Twitter data from different sources and integrate it into the suggested design for reuse. We are able to greatly increase the processing performance of workloads around social media data by overcoming the data loading bottleneck with a web-archive-like Parquet data format.
- Librarian-in-the-Loop Deep Learning to Curate Very Large Biomedical Image DatasetsXie, Zhiwu; Chen, Yinlin (2024-02-01)We present a research data management project where librarians from University of California, Riverside and Virginia Tech are deeply embedded in a research team at Yale School of Medicine to directly answer specific research questions by applying AI/Deep Learning techniques to very large biomedical images. Leveraging library resources and expertise, we have developed a prototype pipeline that identifies nuclear pores from whole cell images captured at 8 nanometer resolution by a cutting edge microscope, in the hope to reveal the cellular mechanism of one type of epilepsy and autism. This project exemplifies out data management approach that strives to engage in much earlier stages of research, e.g., even during ideation and data collection, instead of waiting till most research activities are completed to "consult" or "advice" on the very general questions on data storage or preservation. This project also highlights the importance of non generative AI approaches, which have already been widely used as research tools in a much more mature manner.
- Nearline Web ArchivingXie, Zhiwu; Nayyar, Krati; Fox, Edward A. (2016-06-23)In this paper, we propose a modified approach to realtime transactional web archiving. It leverages the web caching infrastructure that is already prevalent on web servers. Instead of archiving web content at HTTP transaction time, in our approach the archiving happens when the cached copy expires and is about to be expunged. Before the deletion, all expired cache copies are combined and then sent to the web archive in small batches. Since the cache is purged at much lower frequency than HTTP transactions, the archival workload is also much lower than that for transactional archiving. To further decrease the processing load at the origin server, archival copy deduplication is carried out at the archive instead of at the origin server. It is crucial to note that the cache purging process is separate from those that serve the HTTP requests. It can be, and usually is set to lower priority. The archiving therefore occurs only when the server is not busy fulfilling its more mission critical tasks; this is much less disruptive to the origin server. This approach, however, does not guarantee that the freshest copy is archived, although the cache purging policy may be adjusted to attempt to bound the freshness of the archive.
- Newman Library Pecha Kucha: Digital Library and ArchivesHall, Nathan; Lawrence, Anne S.; Xie, Zhiwu (2012-05-24)2012 Pecha Kucha about the Digital Library and Archives (DLA) at the Virginia Tech Libraries In Service Day.
- Numerically Trained Ultrasound AI for Monitoring Tool DegradationJin, Yuqi; Wang, Xinyue; Fox, Edward A.; Xie, Zhiwu; Neogi, Arup; Mishra, Rajiv S.; Wang, Tianhao (Wiley, 2022-01-13)Monitoring tool degradation during manufacturing can ensure product accuracy and reliability. However, due to variations in degradation conditions and complexity in signal analysis, effective and broadly applicable monitoring is still challenging to achieve. Herein, a novel monitoring method using ultrasound signals augmented with a numerically trained machine learning technique is reported to monitor the wear condition of friction stir welding and processing tools. Ultrasonic signals travel axially inside the tools, and even minor tool wear will change the time and amplitude of the reflected signal. An artificial intelligence (AI) algorithm is selected as a suitable referee to identify the small variations in the tool conditions based on the reflected ultrasound signals. To properly train the AI referee, a human‐error‐free data bank using numerical simulation is generated. The simulation models the experimental conditions with high fidelity and can provide comparable ultrasound signals. As a result, the trained AI model can recognize the tool wear from real experiments with subwavelength accuracy prediction of the worn amount on the tool pins.
- «
- 1 (current)
- 2
- 3
- »