The HathiTrust Digital Library is a consortium of 90 academic institutions around the world that are redefining what it means to preserve books and provide access to knowledge in an age of cheap data storage and globalization.
The project provides digital access to more than 11 million volumes of books and other printed materials that have been scanned online by Google as part of its Google Books project, as well as other local library book scanning projects. While the sheer act of scanning millions of books sounds epic, what has really required a lot of thought and work is the complex process of providing access to the information in a useful and legal way.
About a third of the works are in the public domain, and therefore can be read for free in a digital format. For the rest of the works that are still in copyright, a search of HathiTrust’s database yields the bibliographic information on books containing the relevant search terms. Researchers can also search within the text of books to discover the frequencies of specific terms. If a researcher wants full access to a work, the HathiTrust provides links to libraries that hold physical copies.
Despite the organization’s careful steps to stay within the bounds of the U.S. legal concept of fair use, HathiTrust became the target of a copyright infringement lawsuit in 2011. The plaintiffs are a group of authors and the Authors Guild in the United States.
As in the Google Books lawsuit, the authors essentially argued that as copyright owners, they wield ultimate control over the decision on whether to digitize their works or not, and on what terms access to their works should be granted – even if the library wasn’t providing full-text access to their works online to the general public. A federal district court judge in New York disagreed and dismissed the suit in the Fall of 2012, writing: “I cannot imagine a definition of fair use that would not encompass the transformative uses made by the defendants.”
Judge Harold Baer, Jr. dismissed the authors’ arguments that the inclusion of their books in the HathiTrust would by default result in lost sales. The goal of the project isn’t to replace the marketplace for the books, he noted. It is “to provide superior search capabilities” instead, by providing what amounts to a more efficient maps of the world’s corpus of works to academics, as well as access to those who are disabled, such as the blind.
The Authors Guild is appealing the decision. Meanwhile, HathiTrust this February named Mike Furlough, Penn State University’s Associate Dean for Research and Scholarly Communications, as its executive director. Furlough, who is 47, begins his new role on May 19th.
The Disruptive Competition Project checked in with Furlough recently to chat about his work, and his new job.
Q: So what does it mean to be a member of the HathiTrust Digital Library?
A: The membership is really large – larger than what most people, I think, realize. There are a little more than 90 libraries that have joined HathiTrust.
Originally, I think a lot of people thought that anybody who joins HathiTrust is joining so that they can put their digital collections in the HathiTrust. And so far, not every library has done that. It is may be a third of the member libraries that have contributed content.
But all of our members have made a commitment to collectively sharing the responsibility for preserving the digital collection that HathiTrust holds. One of the things that I want to do early in my tenure is to talk to all the member libraries, and ask them what their aspirations are in joining HathiTrust as a collections repository as well.
If you are a member library, your users have the same kind of access that the public has, which includes full-text access to out-of-copyright works, and the ability to search in-copyright materials. But there are certain benefits we can extend to the general membership that we can’t to the general public.
For example, we have a collection of 11 million texts that are scanned with full-text. Under certain circumstances, those books can all be made accessible to the individuals at member universities who have a print disability that requires them to have additional technology to read. For this to happen, the user has to be certified by the University, and the member must own or have owned a copy of the physical book.
We do in a way that is consistent with fair use and section 121 of the U.S. copyright law, which means we can only do this for members in the United States. What Hathi has done is to vastly increase the collection of the materials that are available to those students with those needs. That’s a huge, huge benefit. That’s something that the general public doesn’t see, or have knowledge about. But for those who need it, it’s a major, major thing.
Q: The concept of fair use is quite unique to the United States apart from places like Israel. It doesn’t exist in Europe, for example. How has that impacted the world of research, and in particular HathiTrust?
A: That’s a very good question. Let me talk about research and data science first. There is a lot of emphasis these days, particularly in the U.S., on data sharing and data mining. Everybody is generating huge amounts of data, and federal agencies that fund research are encouraging that it be shared for re-use and analysis. The theory is that if we get more computing power onto masses of data, there’s the ability to solve bigger problems.
But if you’ve got data coming from not only different researchers and projects in different countries, which have different legal, copyright, or privacy regimes, you really do start to have a legal mess, and it can become something that impedes the progress of science.
At HathiTrust, we are housed in the United States, and we take international copyright law extremely seriously. We do that by recognizing works that are published outside of the United States that we have digitized, and making only appropriate legal access to them, which in some cases might be very limited.
Q: How does HathiTrust’s system that manages the intellectual property rights work? Could you describe it to me?
A: We have a metadata management system for HathiTrust, a core tool called Zephir, which we use to manage metadata submitted by our different partner libraries. It’s where we have all the information to manage the volumes in HathiTrust. From this master source we export certain parts of that database to the public catalogue so you can search for title, author, and that sort of thing in addition to full-text search.
We also have a rights database, which allows us to record the copyright status and the corresponding level of access that is allowed. And the copyright status may be determined in any number of ways. In the United States, there are very clear dates for when a work is out of copyright. Anything first published in the United States prior to 1923 is very clearly out of copyright in the United States.
For works in the United States that were published from 1923 to 1964, there are additional possibilities that a work may be out of copyright, for instance if the copyright was not renewed. The copyright term during that period was much shorter. It was originally 28 years and then another 28 when it was renewed. So if people didn’t re-register, then those works would be out of copyright.
With funding from the Institute for Museum and Library Studies, the University of Michigan led development of the Copyright Review Management System, which allows us to record the trail of investigation into a work’s copyright status. Trained staff search the US Copyright Office records, and search other related records to find evidence of a renewal, and record the results of their search. There are also processes for cross-checking and quality control, so that it’s not just one person determining this. And when those staff can’t make a clear determination—say they are finding conflicting information—the review will be flagged for additional review by one of the managers of the system.
So we record the work of those determinations and that determines the level of access HathiTrust provides. In the United States, we had a little over 300,000 items that were published between those dates, and we have investigated just about all of them, and about 160,000 of them were determined to be out of copyright.
The balance of them were still in copyright, or we couldn’t make a conclusive ruling, in which case we treat them as in-copyright. And we feel comfortable with that. It’s close to what we expected. It’s not as high as what some people would expect. It’s close to 50-50. Sometimes we’ll get requests to investigate the copyright status of specific works because the researcher might not be otherwise able to access it, so if someone has reason to believe that something is out of copyright, they might get in touch with us and say: can you check? In some cases it is out of copyright, and we can open it up, and in some cases not.
A: We in the preservation community don’t think of Google as a preservation agent. I’m sure it’s in the interest of Google to ensure the longevity of the data that they’re collecting, but in terms of preserving it for the good of the public, that’s the library’s mission, and that’s why HathiTrust was founded: to preserve and share the collections of our member libraries, whether scanned by Google and others.
It’s an evolving ecosystem. The Library of Congress has a very broad mission. It’s a primarily physical library, which has digitized some of its collections, and it serves Congress, and it serves the public, and it serves libraries. HathiTrust’s mission focuses on building a shared digital library, and by preserving and providing access collectively to materials partners contribute, we can free resources for libraries to do more, both with their digital and print collections.
The Digital Public Library of America is an “aggregator” that provides some really nice features for search and browse of collections at multiple libraries, archives, and museums. There’s a lot of cool stuff, and it’s from all over the place. I can go there, and I can discover photographs from civil rights archives in Alabama, or I can find books on the topic, and they’re doing it in a very attractive way too.
But DPLA does not provide full-text search for the materials in its collections. It also doesn’t currently host the collections, just the metadata that points back to the item at the host. HathiTrust takes advantage of DPLA’s services by having our records of our books in their catalogue. So you might search in the DPLA catalogue, and find an item, and it will take you to HathiTrust. I think what we do is very complementary and I hope we can work together more.
I have always believed that you have to make sure that researchers can find what we have in our libraries no matter where they are online—you can’t make them go to the library website, and we can’t assume they are going to find the HathiTrust website, either. For the purposes of discovery and findability, anything that is worth getting online, we need to get it in as many places online as possible. DPLA gives us another opportunity to bring our content to another potential audience.
Q: Can you tell me about the HathiTrust Research Center?
A: The HathiTrust Research Center is co-directed by Stephen Downie at the University of Illinois and Beth Plale at the University of Indiana, and those institutions serve as the co-hosts. The Research Center holds a copy of a portion of the HathiTrust digital corpus.
They are working to create a collection of tools for large-scale textual mining and analysis. There are some really smart people at those institutions and elsewhere who are working on fascinating things, like can you determine the prevalence of particular genres of works by running an algorithm across [the corpus of books,] or can you track the evolution of certain concepts across time by running algorithms across those texts. The Research Center has received funding from both the Alfred P. Sloan Foundation and the Andrew W. Mellon Foundation to support its development and use.
Q: Has HathiTrust tried to incorporate other media, other than books?
A: HathiTrust has focused primarily on preservation of and access to books. We’ve done some prototyping to support collections of images, but that’s not in production at this time.
There are definitely needs that go beyond books, of course. Libraries often find it difficult to make accessible films and photographs. The copyright issues for those formats are challenging.
Take photographs: a library might have collections in its archives that are really well documented, and we know who produced them, but sometimes when a library receives a collection it isn’t well documented. There may not be identifying information about the photographs. You don’t know who took the photograph. You don’t know what the agreements between the photographer and the owner of the collection were, you don’t know if they did it work for hire, you don’t know who actually owns the rights. So this is a big problem for archives in general. This isn’t just for photographs, but sound recordings, sometimes film too.
In those cases libraries aren’t always sure whether they can make the items accessible, or in what way, or what would be fair use in that realm. You just don’t have enough information to fully evaluate the collection in so many cases. For audio or film collections there can be so many creators involved. Those are multi-authored works. You’ve got a director, a scriptwriter; you’ve got the composer of the score. The distribution company might own the copyright overall, but the scriptwriter may have some claims too, and so on. When you’re dealing with relatively major motion pictures, it’s pretty straightforward to find those things out, but when you’re dealing with sound recordings in your collections from the 1950s, the song might still be under copyright, but the recording is probably not under copyright, and yet the arrangement might be copyrighted. Or you can’t find the rights owners for the recordings. It becomes very challenging for some organizations to make informed decisions about access and what constitutes fair use in those contexts.
Q: But why can’t you do with sound recordings what you’ve done with books?
A: It’s not an area we’ve explored to date—digitizing and hosting audio on the scale we’ve done with books will be much more expensive and as a whole libraries don’t have the same degree of knowledge of our archival audio or video collections that we do of our published book collections. We also couldn’t provide an equivalent of full-text search with audio or film recordings—a transcript of those recording, yes. There’s really not a great way to search sound recordings right now on the scale of what HathiTrust does. If you had a transcription of a film, you could do a search on that. It’s the nature of the media. We’re just not there yet with other kinds of media. I certainly know of research underway to do just that, where people are working on identifying a sound print, or a vocal pattern, and search within an audio collection for a particular poet speaking, or something like that. But it’s still pretty early days for that kind of technology.
One thing that libraries often do when they have performed a fair use analysis is to put such photographs online and ask the public to help. We put it up, people find it and tell us things about it. We learn who the people in the photograph are, or where was it taken, where the photograph actually came from? And learning from our users–that’s a fantastic part of my job, or of any librarian’s job.
You think you know very little about something, but then you make it accessible to the public, and almost everybody knows something about something, and then they’ll find it and come and tell you about it.