PresQT Conference

PresQT: presqt.crc.nd.edu

PresQT: presqt.crc.nd.edu

I have just returned from PresQT. This is an academic conference focused on preservation of digital data, especially scientific datasets and software. The point is to allow others to check results through reproducibility. I was invited to give the corporate perspective on preservation.

As I wrote several weeks ago in my post on innovation, I am not a fan of silos. Attending this conference I found another set of silos: academic and corporate. There were so many tools people were discussing that I had never heard of. The first day left me reeling from all the new information. I also felt WAY out of my depth. I was wondering why I was there, how I could contribute and how this effort could engage the corporate and law firm worlds.

I did my presentation relatively early in the morning on the second day and felt much more like I could contribute to the discussion. I was warmly welcomed, good things were said about my talk and people asked a lot of questions. This answered my question. Part of my presentation was about content strategy (see my posts on the components of a good content management plan), but another part was about how academia and corporate environments can work together on this project to achieve better, more useful results.

Part of my point was that preservation starts with a content strategy. Preserving at the end is great, but much can be lost if you don’t start from the beginning. Of course, nobody in corporate, law firms (law firms do have a good handle on records management for client files) or academia does this. Preservation is an after thought. Content management is a huge undertaking. I hope to make a difference, but it will be small as I am one person shouting about it into a canyon. Whatever small inroads I can make in my work will be worth it.

I learned a lot. I learned about new tools and sites. One is ReproZip, which allows people to backup and entire project “by tracing the systems calls used by the experiment to automatically identify which files should be included. You can review and edit this list and the metadata before creating the final package file. Packages can be reproduced in different ways, including chroot environments, Vagrant-built virtual machines, and Docker containers; more can be added through plugins. What this means to someone like me is that I can pack up my entire website and someone in the future will be able to unpack the whole thing and look at it in all of its glory. Software, databases, widgets, text, everything is included so that the project/site/whatever can run as it did when it was posted on the web. The difference between this and the Internet Archives Wayback Machine is that pieces can be missing from the Wayback Machine as their crawler can’t access everything. The downside of ReproZip is that the creator, or someone involved in the project or site, must create the archive. The ReproZip tool comes out of the NYU Center for Data Science. A librarian, Vicky Steeves, is heavily involved in the project as a trainer and outreach coordinator. It makes my librarian heart happy.

I was also very interested in Euan Cochrane’s Wikidata project. Wikidata, funded in part by the Wikimedia Foundation, is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. As you know, I like the idea of central storage for information. I like wikis a lot and this platform allows all different people to edit and contribute. They do have reviewers to verify information.

The thing I like most about Code Ocean, aside from the name, is the look and feel. It is pretty and looks like it has a user interface that it easy to use. Code Ocean is a cloud-based executable research platform. I am sure that it works very well, too- I can’t really speak to that as it is out of my area of expertise. I hope we can make the PresQT tool look and feel just as good.

I also think that Code Ocean could create a section that would be useful for tech companies. They could deposit their code as a safeguard for changes to their corporate structure and backup as well as well historic preservation. So many companies with interesting tools have gone out of business as the tech industry waxes and wanes. Think about some of the early search engines. As a result we have lost that knowledge. What could be done with it if it were deposited somewhere like Code Ocean and made available once certain corporate events or actions took place?

The big coup, from my point of view, is learning about Open Science Framework. This is essentially a streamlined, but well thought out and useful content management system. One thing that is very intriguing is that it overlays (not sure if that is the right idea) on top of cloud based storage systems and allows users to search across them. I think this tool could solve the problem of companies using Box, DropBox and Google Drive. I have a lot to learn about it, but I hope to become more well versed and use it to its fullest extent. Stay tuned for more on OSF.

Just because I didn’t mention every single tool or presentation doesn’t mean that I didn’t find it interesting or potentially useful. There was so much interesting work shown at this event that I feel somewhat overwhelmed. I hope to be able to participate further.

EVERYONE can participate in PresQT in different ways. One is to take the needs assessment. This survey will allow the team to get feedback from more people to make the tool more useful. The permanent link will be available soon and I will post it here. Personally, I had a hard time with the survey as the language, and some of the subject matter, was not in my wheelhouse. Still, I soldiered on and provided as much helpful feedback from my community as I could.

People can also explore the resources from the conference on OSF. More of the presentations will be posted there shortly.