Alex Kinnaman  0:01  
Hello everyone and thank you for listening to our talk Preserving DH Projects: Creating an Environment for Emulation. My name is Alex Kinnaman, and I'm the Digital Preservation Coordinator at Virginia Tech.

Corinne Guimont  0:12  
And I'm Corinne Guimont, I'm the Digital Scholarship Coordinator at Virginia Tech. Next. 

As many of you know, digital humanities projects are complex. Each project can contain a wide variety of content, data, code, and more. And every project is different and can pose a new challenge. The variety, while exciting in the world of digital scholarship, does make complete perfect preservation a challenge. Here at Virginia Tech, we currently have about a dozen hosted and active DH projects in the Library with several other others hosted elsewhere. While all of these projects are currently web archived, we do not currently have a set preservation plan or strategy for this work. With these challenges in mind, Alex and I set out to come up with a basic strategy for preservation that we could apply to different projects. We decided to approach preservation for DH by breaking down the projects into its different components, so it could be packaged and then reproduced or emulated. We wanted to create a workflow that preserved the content of the project that minimum while also maintaining the integrity. In this presentation, we will discuss how we went about doing this, the challenges we faced in doing so, and what our next steps are.

Next.

Our ultimate goals for this project were to identify some preservation best practices that could be applied across a variety of DH projects, especially those common platforms. For example, we have multiple projects hosted in Omeka. We wanted to minimize the workload to a reasonable number of hours per project, so we could ensure the work could be done for multiple projects. And of course, we wanted to create a preservation package suitable for reproducibility or emulation. And as an added bonus, since a lot of the DH work falls within our Library's publishing unit where I am, I wanted to create language or a policy that could fit into a larger preservation plan that I'm currently working on for all of our Virginia Tech publishing publications. So our strategy was to first look at existing and hosted active projects and identify a test case. We then wanted to define the different project components and their file types, and also locate any available metadata and create new metadata if necessary. We also wanted to look at what documentation existed and then think about what other documentation would be needed to reproduce the project. And then ultimately, we wanted to retest the whole process on another project. 

The project we ultimately chose is redlining Virginia. This is an online exhibit built in omeka. From a researcher in the history department, LaDale Winling. He looks at maps from the Home Owners Loan Corporation, to see how different areas and major cities throughout Virginia were redlined. The work is part of a larger project mapping inequality at the University of Richmond, that LaDale is a part of. One of his undergraduate history classes compiled the project for a physical exhibit that was in the Library in the winter of 2016 to 2017. Then he had a graduate research assistant work on organizing all the content, including maps, text, videos, and post it notes that were interactive piece of the exhibit, in the upper hand corner. 

And then ultimately, another graduate assistant worked on putting all of this into Omeka. So this was kind of about a year and a half process. We chose this specific project for a few reasons. First of all, it's a smaller project with about 110 items, making it a bit easier for us to get handle on what work we're dealing with. It's also built on a fairly common platform. As I mentioned earlier, we have at least four projects currently hosted by the Library built on Omeka and several other researchers on campus use it regularly. LaDale had already packaged many of the data files, and uploaded them into VTechData our institutional data repository, so we had easy access to them. There's also some of the other content and other places such as exhibit photos from the physical exhibit in VTechWorks, our institutional repository. And then we felt that all of these existing pieces, with all of these existing pieces, this project would be reasonable to be reconstructed or migrated. And as an added perk, with this being a project that I've been involved with from the beginning, it's fairly easy for me to reach out to LaDale with any questions.
  
Next.

Alex Kinnaman  4:15  
So once we were really digging into this project, and its various components, I wanted to get a sense of the file formats we were looking at, to see if there were any special considerations that needed to be made in terms of preservation. This list is just to give you a basic idea of what we were working with. You can see there are some common file formats for image, text and video. But more importantly, were the data files that generate the georectified maps, which are a primary component of Redlining Virginia. We also created new components, namely the preservation profile and the Digital Exhibits Metadata Application Profile, or MAP, both at the collection-level, as well as generating metadata for the Shapefile bundles at the item-level. And of course documentation on all components that include information on opening and reuse. And this was especially important for the various data files that are listed. 

So not only were we working with different file formats and component types, but we also had to piece together exactly where the known components were actually accessible, as well as where new components appeared. We found different information from the Redlining Virginia website, its parent project Redlining Inequality, both of our institutional repositories VTechWorks and VTechData, the researcher in specifically to access some original videos that are only accessible in YouTube. And you can see that there were a lot more components than we expected and a lot more locations. And some of this information overlapped and other bits of information did not. 

So the workflow that we designed followed this basic outline over the course of about three months. First, we located all of the components and collected as much existing metadata and documentation as we could and compiled it in a Team Drive. With everything in one spot, we then filled out the preservation profile, which includes preservation specific information, as well as contact information, degree of homogeneity, various access locations, and so on. And we also updated the digital exhibits MAP. Then I got to do some heavy lifting with the file formats. We did migrate some HTML to TXT, and the MP4s to MOVs, which didn't take too long because there wasn't too much content. And we opted not to migrate the JPEG files to TIFFs because there were over 100 objects, and the goal of this project was to be reasonable. Fortunately, for us, though, the data files for the interactive maps are already in the ideal preservation format, according to the most recent Library of Congress Recommended Format Statement. So major migrations didn't need to happen there. 

Next, I looked into metadata schemas for the items without metadata, and the content with the most need were the Shapefile bundles. There were a few standards I looked at, and we opted for a simpler schema that was more reasonable to fill out because we had eight different maps. And the longer one was 91 pages long. So simultaneously, as we were testing out the metadata, I wanted to test out how to open each of these components and then document any successful outcomes. We did find several open source tools to open even the data files both individually and all pieced together. And this was largely through GIS softwares. Finally, we documented these workflows for reuse on other projects. This included both documentation that goes into the preservation package, as well as an outline that will be used to recommend project building guidelines for new projects. 

So this is a breakdown of the time spent collating all of the project components broken out by general tasks. Ultimately, we met about weekly or bi-weekly for three months, and spent approximately 33 hours total between the two of us to get to this point where we have a preservation package ready. 

This project did come with its challenges some anticipated and others not. The physical exhibit featured in Virginia Tech Libraries that we mentioned earlier, was actually documented both in VTechWorks and on the website itself, and they archived respectively, the images of the physical exhibit and the post it comments left by attendees, which are featured to the right there. Fortunately, both of them had some item-level metadata we could use. But because they were in different locations, we weren't sure how to portray that initially. So as we mentioned before, exactly nothing existed in a single location. So we had to do a lot of digging and tracking to ensure that we had all of the components without duplicating efforts. The maps, which we knew existed, ended up being a challenge given their complex file bundles, and lack of metadata. This was particularly relevant again to the Shapefile bundles, which really do require some metadata and documentation if we really want these items to be reused and reconstructed later. Finally, while we wanted to try to do this project without the aid of the researcher, which is the likely case for most of us working on our DH projects, we can't download videos from YouTube (legally), so we did have to go to the researcher for those. 

Despite our challenges, we were successful, and we do have a preservation package and a comprehensive documentation put together and we are very excited about that. Some other outcomes include that, we realize there is a need for a preservation professional, a metadata professional, or other experts in order to get a sense of the file formats to ensure long term access. We also realize that there is much more to consider than what is present on just the website, or whatever platform the project is on. In our case, it was an Omeka instance, and there were various components that were present elsewhere and accessible elsewhere that we needed to compile. We also struggled with how best to represent the physical exhibit images along with the post it feedback, because they are one in the same physically but represented in different locations. And they had different metadata and different schemas. But ultimately, they ended up in the same folder within our preservation package file structure. Finally, we know that we need to have a level of familiarization with whatever platform the project is hosted on, so that we can automate some of the content and metadata exporting and save on time.

Corinne Guimont  10:56  
We also found some general outcomes that weren't really technical, but are more important to - but are also important to consider. First off, this was a very manual process, and we could not automate everything to do with the variety of content. We found that because of this, if we know what is needed from the researcher early on, we can ask for those pieces before this project is complete and we can ask for them in a specific organized format. This should also help ease the process for future projects as we know it's needed, so maybe we can talk to researchers early on and tell them that we need these different components, and within these different standards, and maybe work with them to get those together, before we're handed the final project to preserve.

Unknown Speaker  11:37  
Next.

Alex Kinnaman  11:39  
Right, so I just wanted to provide a sample of the preservation package file structure as one of our successes. I won't into the weeds, but you can see that there is a format that is separated out by components, and then a documentation folder with our new created newly created documentation.

Pretty excited.

Corinne Guimont  12:04  
But with all of that, we also found some limitations. We know that this is just one project and other projects will have different components with new challenges. Especially given that this was a smaller and more static project, we know that we have a lot of larger, more complex projects, perhaps some that are built on some custom platforms. And these will pose a much larger set of challenges. We also know that this project, we have access to the researcher to ask questions. For some of our other hosted projects, the projects are a bit older, the researchers might have left, and even if we were to be able to get in touch with those, we can't guarantee that they still have the materials that we would need, given that there might be a period of time has passed.

Alex Kinnaman  12:45  
And separately, bandwidth was another huge limitation we faced. It was just the two of us available to do the work, and given the current state of the world, our developers and tech support were not very available. One preservation-specific limitation that we chose was to limit the migration of file formats. We were lucky that most of the content was in a preservation format, even if it wasn't the best preservation format, but we considered it unreasonable if we wanted to repeat this workflow on a dozen other DH projects. We also did not migrate the Omeka instance to a Library-managed platform, so access is still the responsibility of the researcher. Finally, as I mentioned earlier, we attempted to compile our package with no assistance from the researcher, but the videos are vital to the context and integrity of the project so we needed to include them. 

Our immediate next step is to work with our Metadata Services department and our Digital Libraries department to ensure that we have the required metadata for ingest into our Digital Library Platform preservation system, and consider any feedback from our colleagues. We will also be testing this workflow on another of our DH projects, examples shown to the right there, sometime next spring to see if this workflow will work again. And finally, for me, the the one that I'm most excited about is to develop more concrete levels of DH preservation, which we can now base on the amount of work it is actually taken us to preserve a DH project.

Corinne Guimont  14:16  
And then as I mentioned earlier on in this presentation, I plan to integrate the documentation we've created out of this project into a much larger plan for preserving all types of publications within our Library and within our publishing unit in the Library. Next.

And with that, we'd love to answer any questions you might have and here are our emails and then a link to the project if anybody's interested in looking at it further.