0:02
Hello everyone. Thank you for listening to our talk on a cloud-based serverless microservices application for digital preservation. This is presented by Yinlin Chen and Alex Kinnaman from Virginia Tech with support from James Tuttle also at Virginia Tech. 


0:23
So today, just a quick run-through of what we'll be chatting about we'll give an introduction to what we're doing at Virginia Tech. A bit about our problem statement and then our solution to those problems; an overview of our entire infrastructure, as well as an overview of the performance and cost of our decisions; and then finally a conclusion and future work. 


0:49
So digital preservation, as we all know, combines policies, strategies, and actions that enable access to digital content over time. And the digital preservation strategies and actions address content creation, integrity, and maintenance. Since content integrity includes verification methods and routine audits, we want to make sure that we can check the fixity of the content held on our preservation storage systems at regular intervals. This helps us to maintain logs of fixity information and supply audits on demand. And also it helps us, helps our ability at home to detect any corrupt data. 


1:27
So this is an overview of the Virginia Tech Digital Library Platform, which we're in the process of building. And you can see that it is built on several different types of microservices. And for this talk will be focusing specifically on the fixity service. And we'll get a little bit more into this diagram as the presentation moves on. But this gives you kind of an idea of the overall system that we're working within. 


1:54
So our issue is that we needed to move our data from on-premise servers to cloud vendors in order to reduce all of our maintenance obligations and significantly reduce expense. Cloud vendors such as AWS S3 and Azure Storage advertise a 99.9999, et cetera, of data durability of objects over a given year. However, we want to make sure that we can verify the data from these black box vendor systems. We want to make sure there's file integrity from our version to their version and also get notifications when custom events occur. So our solution is to develop a cloud-based, serverless and microservice application. This allows us to run large amounts of fixity creation and validation asynchronously. And it also allows routine checking based on the policies that we have defined in our digital libraries department. The system can scale up and down depending on the amount of content and we need to move. And this way, what's really helpful to us is that we only have to use the resources that are required. We don't have to buy one package and only use a few things. We can tailor individual components. And of course, we want to make this as automatic as possible without any human intervention or as much system maintenance. So now I will pass it to Yinlin Chen to get a deeper sense of the system and the performance and cost. 


3:30
The entire infrastructure is serverless. We use all our AWS managed services. In this infrastructure we have two parts. One part is we implement several microservices to do the fixity work to retrieve the files from the S3 and do the fixity check using microservice. Here we implement three Lambda to do that. Another part is to base on the policy we implemented, we can define different rules and generate a report using AWS Athena. Finally, based on the result we can notify our users. Inside this serverless fixity work, we have three Lambda functions implemented in microservice. They connect to each other and put inside AWS step functions. These functions read the files in the S3 bucket, then first they retrieve files from S3. Then after the file is retrieved then it will compute the checksum, we use MD5. Then finally we validate the checksum with the original MD5 and to make sure it's working or not, and then we save all results into another S3 bucket to further analyze these local data. We can use the Command Line to trigger this process, or we can use the Web Interface to trigger this process too.


5:30
This is our step function workflow. So here, we can see these green colors are functions. So we can see we retrieve files and if the file is retrieved successfully, then we will do the checksum compute, then finally do a validation of the checksum. You can see here we can wait four minutes, 12 hours, or four hours. This is because we have all kinds of different files and some files have storage being AWS standard storage, which we can achieve very quickly. Some files because these are archive files, so we store in S3 Glacier. So these files will take at least 12 hours to retrieve. We support multiple scenarios. So no matter if this file is stored in S3 Standard or Glacier, we're able to retrieve this and do the fixity check. So this is the workflow for the fixity check.


6:48
We can see in S3 we then have different storage classes from Standard to S3 Glacier Deep Archive. For the Standard we can retrieve very quickly, in less than four minutes, for the Glacier we will need 12 hours. You can also see the pricing here. The files stored in Glacier are cheaper than, for instance, Standard. So based on different kinds of files, we store in different S3 storage classes. We all can do the fixity check for all these files.


7:34
This is a history of all the step function events. You can see every operation will have a timestamp, and you can see the total execution time. For this example it's 39 seconds. We can see all the Lambda functions being excused, and also we can see all the Lambda logs through the CloudWatch logs. Every single step is recorded, so we can get all the information we need to do the analysis.


8:17
This is another part. We have different preservation rules. We can base on what rule we implement Lambda. Which rule is the condition inside this Lambda, and we connect this with the CloudWatch. So based on rules we choose the Lambda functions and they will query the data in the Amazon Athena. From the log files we can do the queries and based on that rule we decide to trigger the step function and to do the fixity work. We also apply the Amazon SQS so we can support to process thousands of files all at the same time, concurrently. So no matter if today we need to do a fixity for 100 files or 1,000 files or 10,000 files, we use this approach to support that. We can do fixity checks for thousands of files very quickly.


9:32
We record everything, all the fixity results in the S3 and we use AWS Athena service so we can use basic SQL queries to gather all the information we need. Even with very complex scenarios, we can be able to write a SQL query to get a report we need. You can see here we recorded how long it will take, and all the information if this fixity check is matched or not, when is fixity check being done. This table shows all the information. Underlying this is just a text file stored in S3, but we can be using the AWS Athena service. We can see it's like the database, and we can use a simple query to query all of these that we need.


10:20
This is the performance. We test from size 3 megabyte to 800 megabyte. Our current setup is focused on image data. The largest ones are about 800mb. Then in the future we will try to do these on video data. So video data can be GB and then we will do that in the future. You can see here we process over 1,000 data just in 40 seconds. We can do this very quickly.


11:14
Then the cost, because right now the state of our preservation data is around 500 gigabyte. Most of this is just free to us because AWS has free tiers, so most of it costs nothing to us. We do several tests based on this data to do multiple different kinds of survey tests, like multiple rounds, and all the cost we'll need something like $2. It's nothing to us because we use serverless infrastructure. If we don't use the resources, we don't need to pay anything. We tried to compare this if we use instances, but we thought that we don't want to waste this kind of money because instances are charged by hours. Also we need to write a program to support thousands of files concurrently and it will take a lot of work. We don't want to spend the time or money on that. But you can see the costs are very low to us.


12:25
Conclusion. You can see using this approach we save a lot of time, and based on our rules we have a lot of flexibility because it's just multiple microservices communicating together. We can tune performance and cost for each single microservice. There are many combinations which we can choose and we can test to reduce the cost of that improvement performance.


12:54
In the future, we will try to support more preserving data, like videos, and continue improving our performance, and reduce costs, and then write reporting tools for AWS Athena, and support more different kinds of rules. Right now we just support some basic rules but in the future we will support more kinds of rules and policies for the preservation. That's it. Thank you.