May | 2017 | Jaye Lapachet

At one of the sessions during PresQT, we discussed the value of auto-selecting metadata data terms. Someone brought up the problem of getting researchers, lawyers, basically anyone, to fill out DMS profiles, choose tags and other tasks associated with attaching metadata. Those groups want to use it after the fact, when they are searching, but they don’t want to take the time to add good metadata. This means that auto-selecting metadata is at least worth talking about.

First, here is another point where academia intersects with the corporate/law firm world.

We finally got to the point of AI. We were discussing having machines read documents, assign metadata and leave the ‘fun’ work to the humans. This is a good solution, when and if it works, but what about language? Someone brought up patents, which are written to talk around the subject. If someone wants to patent a bottle, the word ‘bottle’ is never used. The point was that how would a machine know the precise meaning of certain language constructions?

This was brought up when I read an article by Bob Ambrogi about Judicata, a start-up legal research system that purports to be better that WestlawNext and Lexis Advance. In the article, Ambrogi writes “It does this, he explains, by mapping the legal genome — that is, mapping the law with extreme accuracy and granularity.”

Hhmm.

I am very interested in this process and how it can supplant the mundane time sucking work of lawyers and law firm professionals by automating processes through AI. This process, theoretically, leaves the value added work to the humans. Ravel has done this to a certain degree. Now Judicata is claiming to be good enough at the process to rival Wexis.

The process was compared to the technology used to guide driverless cars and was more fully explained when the article continued with ‘…driverless cars require highly detailed, three-dimensional computerized maps that can pinpoint a car’s location and understand its surroundings.

Judicata has been trying to build that kind of a map for law. “This is different than what you might be thinking about given all the hype around AI,” he explains. “AI and machine learning only do as well as the data that goes into them. We’ve focused on creating better data.” ‘

Better data. Yes. That is what we need all around. We can’t always get it by having people fill out profiles. We need a combination of humans and machines. Humans get bored and frustrated and just want to accomplish their immediate task so they can go have a beer with friends. Humans don’t always think long term to the point where they will need to retrieve their data in the future.

I am looking at this from an access point of view. In my world it is possible to make all information accessible. We just need the right tools and the will to do it. If Judicata can do this for law, perhaps it can be done using similar technology for other disciplines as well. Granted, the law has special needs and requirements for retrieving information, but researchers need what they need in their own disciplines as well.

Perhaps presenting researchers, lawyers, academics, people with a machine produced profile and allowing them to modify it or overlay it with their own terms would be a start to improving accessibility? Ambrogi writes “The curse is that the mapping can be only partially automated and requires a significant level of human effort.” This is absolutely true and requires money. I would love to see researchers be able to provide some effort in that area based on the papers they publish. This is my idealized view of the world where everyone shares and wants universal access to all information. In my dreams, right? Perhaps PresQT can start us down that road and Judicata can take it further.

PresQT: presqt.crc.nd.edu

I have just returned from PresQT. This is an academic conference focused on preservation of digital data, especially scientific datasets and software. The point is to allow others to check results through reproducibility. I was invited to give the corporate perspective on preservation.

As I wrote several weeks ago in my post on innovation, I am not a fan of silos. Attending this conference I found another set of silos: academic and corporate. There were so many tools people were discussing that I had never heard of. The first day left me reeling from all the new information. I also felt WAY out of my depth. I was wondering why I was there, how I could contribute and how this effort could engage the corporate and law firm worlds.

I did my presentation relatively early in the morning on the second day and felt much more like I could contribute to the discussion. I was warmly welcomed, good things were said about my talk and people asked a lot of questions. This answered my question. Part of my presentation was about content strategy (see my posts on the components of a good content management plan), but another part was about how academia and corporate environments can work together on this project to achieve better, more useful results.

Part of my point was that preservation starts with a content strategy. Preserving at the end is great, but much can be lost if you don’t start from the beginning. Of course, nobody in corporate, law firms (law firms do have a good handle on records management for client files) or academia does this. Preservation is an after thought. Content management is a huge undertaking. I hope to make a difference, but it will be small as I am one person shouting about it into a canyon. Whatever small inroads I can make in my work will be worth it.

I learned a lot. I learned about new tools and sites. One is ReproZip, which allows people to backup and entire project “by tracing the systems calls used by the experiment to automatically identify which files should be included. You can review and edit this list and the metadata before creating the final package file. Packages can be reproduced in different ways, including chroot environments, Vagrant-built virtual machines, and Docker containers; more can be added through plugins. What this means to someone like me is that I can pack up my entire website and someone in the future will be able to unpack the whole thing and look at it in all of its glory. Software, databases, widgets, text, everything is included so that the project/site/whatever can run as it did when it was posted on the web. The difference between this and the Internet Archives Wayback Machine is that pieces can be missing from the Wayback Machine as their crawler can’t access everything. The downside of ReproZip is that the creator, or someone involved in the project or site, must create the archive. The ReproZip tool comes out of the NYU Center for Data Science. A librarian, Vicky Steeves, is heavily involved in the project as a trainer and outreach coordinator. It makes my librarian heart happy.

I was also very interested in Euan Cochrane’s Wikidata project. Wikidata, funded in part by the Wikimedia Foundation, is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wikisource, and others. As you know, I like the idea of central storage for information. I like wikis a lot and this platform allows all different people to edit and contribute. They do have reviewers to verify information.

The thing I like most about Code Ocean, aside from the name, is the look and feel. It is pretty and looks like it has a user interface that it easy to use. Code Ocean is a cloud-based executable research platform. I am sure that it works very well, too- I can’t really speak to that as it is out of my area of expertise. I hope we can make the PresQT tool look and feel just as good.

I also think that Code Ocean could create a section that would be useful for tech companies. They could deposit their code as a safeguard for changes to their corporate structure and backup as well as well historic preservation. So many companies with interesting tools have gone out of business as the tech industry waxes and wanes. Think about some of the early search engines. As a result we have lost that knowledge. What could be done with it if it were deposited somewhere like Code Ocean and made available once certain corporate events or actions took place?

The big coup, from my point of view, is learning about Open Science Framework. This is essentially a streamlined, but well thought out and useful content management system. One thing that is very intriguing is that it overlays (not sure if that is the right idea) on top of cloud based storage systems and allows users to search across them. I think this tool could solve the problem of companies using Box, DropBox and Google Drive. I have a lot to learn about it, but I hope to become more well versed and use it to its fullest extent. Stay tuned for more on OSF.

Just because I didn’t mention every single tool or presentation doesn’t mean that I didn’t find it interesting or potentially useful. There was so much interesting work shown at this event that I feel somewhat overwhelmed. I hope to be able to participate further.

EVERYONE can participate in PresQT in different ways. One is to take the needs assessment. This survey will allow the team to get feedback from more people to make the tool more useful. The permanent link will be available soon and I will post it here. Personally, I had a hard time with the survey as the language, and some of the subject matter, was not in my wheelhouse. Still, I soldiered on and provided as much helpful feedback from my community as I could.

People can also explore the resources from the conference on OSF. More of the presentations will be posted there shortly.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Jaye Lapachet

Knowledge Manager, San Francisco, Calif

Monthly Archives: May 2017

PresQT and the Law

PresQT Conference