The modern digital landscape that facilitates real-time collaboration has also created a huge attack surface. Today, documents containing malicious constructs are a leading cause of data breaches in the public and private sector [1].
Warnings about the dangers of suspicious files abound, yet people’s jobs depend on documents: reading and reacting to them, filling out forms, etc. Even if you can authenticate the immediate provider of the document, its contents may come from an untrusted source and contain malicious data payloads.
Researchers at DARPA’s Safe Documents (SafeDocs) program have developed new methods and tools that allow people to confidently open documents and trust what they see on their screens.
Starting in 2018, SafeDocs started with the goal of improving the security of electronic communications, especially in sensitive or critical applications such as military or government operations. Since then, SafeDocs research and development has reduced the complexity of document formats, which are the rules documents must obey in order for the software to open them. Additionally, the teams radically improved the software’s ability to reject invalid and harmful data without impacting the core functionality of new and existing electronic data formats. SafeDocs tools have also helped preserve electronic document history and keep electronic documents feature-rich.
“Today, electronic data is the attack surface,” said Dr. Sergey Bratus, DARPA’s SafeDocs program manager at the Information Innovation Office. “Attackers abuse the excessive complexity and ambiguity of document format rules to sneak malicious payloads past the scanners. SafeDocs formal methods approach helps to discover and eliminate dark corners where attackers like to hide. The resulting technologies make incoming data trustworthy through documents that are viable for many industries, including those dealing with critical infrastructure.
Document formats are quite complex. You might think of a document as an inert piece of digital paper, but it includes many technical characteristics. These features interact with the complexities within the software that interprets the document and displays it on the screen.
Complexities in file formats create opportunities for attackers to hide. Today’s software that processes digital data such as documents, messages, and data streams is prone to errors and vulnerable to exploitation by malicious input.
Complexity can also lead to ambiguity and misunderstanding, providing opportunities for attackers who can manipulate data into complex and confusing data formats. For example, the widely used Portable Document Format (PDF) specification is 1,000 pages of English text with over 70 normative references to other documents, many of which have voluminous normative references of their own.
The size and complexity of the PDF specification can and has led to different interpretations. Research suggests that despite official standards, most implementations follow in fact standards defined by file malformations deemed benign and supported by other permissive software. According to a number of recent articles [2], a PDF file containing encrypted data could be manipulated to exfiltrate the data to a specific location when the user interacts with the document. Even cryptographically signed PDF files could be manipulated to make a fake signature appear valid or a tampered file intact. Also, malicious payloads included in PDF files might be hidden by security scanning software.
SafeDocs implementers have developed methodologies and tools for capturing and defining human-understandable and machine-readable descriptions of electronic data formats to address the ambiguity and complexity of file formats. Performing teams also created automated software construction kits for creating secure and verified scanners using the subsets of simplified formats where the inherent complexity or ambiguity of the existing format had been reduced for security reasons.
According to Bratus, this approach strikes at the root cause of the scanner’s vulnerabilities: programmer errors due to misreading format rules or not checking them. To successfully implement a scanner for a modern document format like PDF, a programmer must understand the thousands of rules and their interactions and ensure that the code checks them all—an impossible task for even the most observant programmer.
“Acting on an unchecked assumption is the recipe for code vulnerability,” Bratus said. “SafeDocs helps the programmer avoid implementation errors due to misunderstandings or accidental omissions by automatically generating the code.”
PDF’s widespread use, complexity, occasional ambiguity, and diversity of implementations prompted DARPA to engage the PDF Association. The PDF Association is the umbrella organization representing the PDF technology ecosystem, including companies such as Adobe, DocuSign and Foxit, stakeholders such as Boeing, free software projects such as LaTeX, and government agencies such as the US National Archives and the Library of the United States Congress. DARPA sought to use the format as a test and demonstration vehicle for SafeDocs enforcers to create the systems, tools, and specifications to help improve the security of PDF and other digital document formats.
Together, the PDF Association and other SafeDocs executors [3] faced a critical challenge: to create unique definitions that help computers think about document formats, and to use automatically generated scanners to reject malevolence and avoid the confusion caused by ambiguity. As a result, they accomplished the following:
- Submitted 117 disambiguating changes to the International Standard for PDF (ISO 32000-2 AKA PDF 2.0), 88 of which have been fully resolved and approved by ISO with publicly available solutions;
- Developed the Arlington PDF model, the first vendor-neutral, human- and machine-readable, derived from open source specification definition of PDF data objects;
- Completed a security audit of the International Color Consortium (ICC) color profile format used in PDF and many image formats, resulting in an update to the ICC specification and a move to incorporate machine-readable data descriptions to assist implementers. ICC color profiles are integral to accurate image rendering and can be used for malicious purposes, as River Loop Security and the PDF Association describe in this analysis;
- Identified the need and directed the curation of a new corpus of PDF files, CC-MAIN-2021-31-PDF-UNTRUNCATED, to support research and format awareness; AND
- Automated tests/parsers generated for coding to address human error and reduce labor time from three years to one day.
“DARPA and the PDF Association are helping standards organizations redefine software specifications and even standards development processes that could help mitigate billions [4] dollars in lost productivity caused by data breaches,” Bratus said. “Through our collaborative efforts, we have demonstrated the ability to eliminate the root cause of ambiguity – the place where attackers can hide within the complexity of modern documents”.
Bratus plans to expand SafeDocs solutions beyond documents to other file formats, such as those used to operate automobiles and military systems, stream video and more.
If every data format could be engineered with SafeDocs tools, we would significantly reduce the vulnerabilities of systems to prearranged malicious data attacks,” he said.
Therefore, DARPA is in the process of transferring the tools to government partners.
As part of its third phase, the Open Group Sensor Open Systems Architecture (SOSA) consortium explored SafeDocs data modeling technologies for incorporation within the SOSA standard. The SOSA approach establishes guidelines for Command, Control, Communications, Computers, Cyber, Intelligence, Surveillance and Reconnaissance (C5ISR) systems. The goal is to enable flexibility in the selection and acquisition of sensors and subsystems that provide sensor data collection, processing, exploitation, communication, and related functions throughout the entire lifecycle of the C5ISR system.
The Electronic Records Processing Branch at the National Archives and Records Administration (NARA) has also benefited from SafeDocs. As an enforcer of SafeDocs, NASA’s Jet Propulsion Laboratory (JPL) enhanced one of NARA’s tools, Apache Tika, which automatically identifies embedded and corrupted files and extracts critical text and metadata from PDFs to understand their characteristics and provenance of files, an essential function for digital file preservation. According to a senior IT specialist at NARA, using the improved Apache Tika toolkit helps them perform tasks more efficiently and securely. In addition, the specialist said that his team is successfully using the updated tool to accelerate the processing of large sets of records and find new ways to process records more efficiently.
In addition, PDF Association and DARPA Embedded Entrepreneur Initiative enforcer Galois and other enforcers continue to focus on transitioning SafeDocs format knowledge and approaches to industry and international standard-setting bodies. The agency also encourages industry to adopt its solutions, as seen in this industry example, where a company describes its application of the Arlington model to improve its PDF creation software.
The following tools can help software developers and information security/privacy researchers improve their organization’s security posture when handling electronic documents. These vary in functionality and specificity for a variety of uses. Check each description and click on the tool links for more information.
Resources for the Portable Document Format (PDF):
Programmer resources for describing data formats and auto-generated analysis code:
Tools for understanding document collections and formatting rules:
Tools for understanding the behavior of existing parser code:
![]() |
![]() |
Tools for understanding the behavior of existing parser code:
![]() |
Read more about DARPA
#DARPAs #SafeDocs #Creates #Safer #Documents #Safer #Computing #Today