In 2018 the HathiTrust Collections Committee had an interest in comparing the book holdings of the Library Congress, recently available as a dataset in the MARC Open-Access release, to the open-access Hathifiles to get a sense of the digital library's coverage. The plots below represent my findings from this investigation.
The process involved first extracting title and, where provided, an OCLC number from the Library of Congress dataset's MARC fields, then matching the 10 million plus records against the 16-million-record Hathifiles. A first match on OCLC number was attempted, and if that failed, a Levenshtein distance match on titles (minimum of eight-token titles were selected to attempt a match) combined with year of publication. In the plots below, the type of match is indicated; the result is a baseline of OCLC exact matches, followed by an additional set of matches for each year in a stacked bar format.
The top plot gives matches in raw numbers of books, the bottom gives matches as a percentage of the entire collection.
A few things stand out. First, a considerable portion of the Library of Congress collection, particular for the in-copyright years, are not yet in the HathiTrust digital library. One also notices the decline in ability to match by OCLC number in the 1970s--here, my Library of Congress contact tells me that there was a period where external identifiers have not been added to the library's MARC. Finally, there is the growing problem of undercoverage as the explosion of publishing takes off in the modern period. Up until the 1920s, Hathi's coverage of a large library's holding is strong. But after the copyright bright line, and especially for publications since World War II, contributions to HathiTrust's holdings appear to be strongly needed.