Developing Trust in Your Sources in a Generative AI World
Letter from the Executive Director, December 2024
Provenance is a fundamental element of trust. Put another way, confidence in the trustworthiness of information is inherently tied to its source. Perhaps you’ve determined you can rely on this newsletter because it’s coming from NISO, and your experience with NISO’s communications is that they’re factual, reliable, and valuable to your business. You similarly have other trusted sources for information, be it scholarly, business-related, or political.
Recognition of what the sources of materials are, where they are coming from, and how results are ranked has been an issue in information discovery for years. Originally, it was taken as a given that any discovery service would provide details of the content that was included and the criteria for inclusion. As digital systems have grown in scale and scope, though, maintaining that information has become challenging. The NISO Open Discovery Initiative has focused attention on these issues since 2012. Earlier this fall, the ODI group circulated a survey on generative artificial intelligence (AI) systems in search and discovery. A summary of some of the results of that survey should be released before the year’s end. With new tools appearing on the market regularly, these questions about transparency are ever more important, if ever more complex.
Last month, a team at the Allen Institute and the University of Washington announced the development of a new academic search tool, OpenScholar. This new AI search tool was described in the announcement as “a specialized retrieval-augmented LM [language model] that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses.” Perhaps the release was intentionally timed to coincide with the 20th anniversary of Google Scholar, which was also last month. The developers of OpenScholar highlighted its trustworthiness by comparing the outputs to a simultaneously released ScholarQABench benchmark test. Without judging the benchmark’s quality, there’s perhaps something a bit unseemly about releasing a benchmark test one developed to test one’s own tool against. However, the desire, even the expectation that transparency (in that the model’s code is open source) and quality assessment should be available, is worth celebrating.
Problematically, the OpenScholar tool has only ingested open content. While understandable as a premise for a research project, using only a segment of the scholarly literature is problematic. For such a tool to be truly useful to researchers, it needs to contain all—or at least the vast majority of—relevant content, both open access and subscription-based. A user might not have access to all the content, but the results that the tool produces should be based on it. If one wants reliable and trustworthy answers from a large-language model (LLM) tool, one needs to have built the tool on as much data as is possible to get, or at least a significant enough amount to be considered comprehensive.
Whether this tool competes favorably with other AI-powered discovery tools, such as ScopusAI, Clarivate’s Web of Science Research Assistant or its Primo Research Assistant, and more generalized services such as Perplexity, remains to be seen. While OpenScholar gains high marks for its transparency and for adherence to POSI-like principles, basing its decision making solely on open publicationsseems troubling for a variety of reasons.
Open content is an important element in this communications system, but it’s worth noting that a variety of scholarly publishers are exploring licensing their content for ingest and reuse by these next generation services. Providing that content to LLM and AI tool developers is important in giving researchers a complete view of their domains.
Last month, I participated in a webinar panel hosted by the Society for Scholarly Publishing and its Scholarly Kitchen blog (where I contribute) about licensing content for LLM-based tools. During this session, Roanie Levy (Copyright Clearance Center), Josh Jarrett (Wiley), Mandy Hill (Cambridge University Press), and I discussed the various aspects of the licensing landscape for AI systems, including the various approaches to integrating content in LLM tools and retrieval-augmented generation systems; best practices for things like citation; and different approaches to controlling community behavior through standards, contracts, or regulation. One might expect the issues of provenance and attribution will best be addressed through community consensus or model license terms, rather than by regulation or legislation, since these will take considerably longer to implement. These are two areas where NISO might play a valued role in drawing our community together.
When it comes to content being generated by AI systems, the question of provenance is one area of growing need and increasing work. Some of this work is taking place in the regulatory and legislative arenas worldwide. We have seen earlier this year, with the adoption of the EU Artificial Intelligence Act, a recognition of the importance of transparency about the authorship of content by generative AI systems, as well as the value of copyright. Similarly, just recently, a new US interagency task force was launched to explore best practices for sourcing of content and its provenance.
The framework of standards provides trading partners a basis upon which trust can be built. Conformance to standards provides some measure of quality expectation for purchasers of a product or service. In support of community understanding and trust building, work on the technical problems of content provenance is also advancing.
In November, the ISO Technical Committee on Document Management Applications’ subcommittee on document file formats (ISO TC 171/SC 2) began fast-track work to standardize the Coalition for Content Provenance and Authenticity (C2PA) technical specification on content credentials. This work seeks to create technical metadata and hard cryptographic binding of content objects with provenance metadata. Ideally, this metadata can help keep misinformation, manipulated content, or machine-generated content from being widely distributed. That draft is being circulated for feedback.US-based members of NISO can provide commentary through NISO’s US Technical Advisory Group, which is a liaison to this working group.
Since January, a different committee has been working on an ISO project to track the beneficial ownership of channels of content. The International Workshop Agreement project to create a unique media identifier (UMId) was circulated last month for public comment. Now that the comment period has closed, the working group developing the UMId system will work quickly to address the public feedback and will seek to publish the final agreement by early 2025. Ideally, after publication, that work will continue to develop into both practical services for tracking media channels and also a more formal standard state within ISO under the Subcommittee on Identification and Description.
As 2024 draws to a close, the scope of work to understand and position our community for the coming wave of generative AI tools is coming into focus. In the coming year, we’ll collectively need to address the issues I’ve touched on here and also many, many more. From license terms to content identification, and from attribution to transparency and trust, there is a lot to do. All of this is taking place in a rapidly changing technological environment, which compounds the need for speed and community consensus. As during the mid-1990s to the early 2000s, the early decisions we take now will impact much of the next several decades of information creation, processing, and preservation.
On this note of looking forward, I hope that all of you have a joyous holiday season and that the new year brings you seamless interoperability, prosperity, and peace.
Sincerely,
Todd A. Carpenter, Executive Director
NISO