Why Should Industry Care about the FAIR Data Principles?
Formally published in Nature Scientific Data in 2016, the FAIR Data Principles provide a framework for scientific data management and stewardship. “FAIR” is an acronym for the Findability, Accessibility, Interoperability, and Reusability of data—for both humans and machines. In this Q&A-style blog, Carmen Nitsche (CCDC US general manager who is also active in several InChI and IUPAC data standards initiatives) answers common questions about how the FAIR Data Principles can help solve real-world challenges.
CCDC takes its role as the keeper of the world’s collection of small-molecule structural data very seriously. Along with carefully curating the data in the Cambridge Structural Database (CSD), we’re also dedicated to understanding how our communities want to use data in the CSD—today and in the future. We have anchored these efforts in the FAIR Data Principles to help us prepare for our community’s future needs. Here, we’ll talk with Carmen about what this means to our industrial community members. (If you’re new to the FAIR Data Principles, read them in this blog post and in the original 2016 Nature publication.)
What is your background working with industrial clients on the FAIR Data Principles?
I was introduced to FAIR around 2017 while working at The Pistoia Alliance, a nonprofit, industry group focused on precompetitive collaboration in life sciences R&D. We knew that FAIR was making headway in academia, but key consortium members were asking for our help on exploring if and how FAIR was relevant to the life sciences industry.
Looking back, most of the project development I was involved with at the Pistoia Alliance ultimately boiled down to the need to FAIR-ify. Examples include:
-
The HELM project (Hierarchical Editing Language for Macromolecules) involved establishing an open standard notation to represent large biomolecules, like antibodies, based on the original algorithm developed and published by Pfizer. HELM is now included in technical guidance for ISO 11238 TS 19844.
-
In 2019, I co-presented at BIoIT about the Unified Data Model (UDM), an open data format that allows consistent capture of experimental compound synthesis and testing data that makes the information readily sharable. The goal was to ultimately break open the silos of data stored in proprietary lab notebook repositories.
What is “siloed data” and how can FAIR Data Principles address it?
“Data silo” refers to creating, storing, and using information in an isolated manner. Everything from having data compiled on a computer few can access, to using a file format or data identifier that others are unfamiliar with or do not use. However, “silos” are not just physical constructs. They also affect how we think and interact with one another. For example, the chemist may not even think to include a specific data point because they don’t need it for the experiment at hand. But it might have been easy to collect and would have made the results useful for the biologist down the hall. Addressing non-physical silos requires a more holistic and strategic view of data. It also highlights the difficult challenge that the data creators and the data consumers are often not the same, and their interests, resources, and vision may not align.
Why should industry care about FAIR Data?
Making data findable, accessible, interoperable, and reusable is not just an issue across different organizations, but rather within companies as well. This makes FAIR absolutely relevant to industry. For example, one of the most valuable assets a pharma company creates is its data. Getting the most out of that investment requires attention to the FAIR Data Principles, by whatever name you might give such an effort. One clear example of this occurs every time one pharma company acquires another. If the corporate compound databases use different identifiers and representations, it might take the new company months to years to figure out exactly what their IP assets really are!
Challenges related to data management can also affect productivity and morale. For example:
-
Data scientists spend about 45% of their time on data preparation tasks, including data loading and cleansing, according to a 2020 survey of data scientists by Anaconda that included over 2,300 people from over 100 countries.
What advice would you give to an organization trying to adopt the FAIR Principles?
Of course, especially in a corporate environment, you’ll need to justify investing in a new effort and answer the question: “What is the ROI?” The Pistoia Alliance established an interest group made up of dedicated professionals from across its membership. This team built a free and open FAIR Toolkit, which provides practical advice and tools for industrial organizations. It also has case studies, best practices, Fair Maturity Indicators, and other guidance required to get started.
You can also bring in specialists to advise and consult. For example, at the CCDC we offer our data consultancy services to top pharma companies and help to curate their proprietary data into forms that empower them to use the knowledge in future projects.
What is the biggest challenge industry faces when adopting FAIR Principles?
Most practitioners understand the myriad of data management issues, which frankly go beyond FAIR to include areas such as security, governance, legacy data, systems, and more. But to address FAIR, you need enterprise-wide buy-in on the need for a strategic and holistic approach to data management. So you really have to build a solid business case. That is why it’s great that the FAIR Toolkit includes some industrial case studies. It really helps to point to other organizations who have successfully navigated the FAIR waters and found clear, tangible benefits.
One should note that FAIR is not a one-time effort. Nor does it need to be tackled all at once. Start with a FAIR maturity audit to learn where on the continuum your organization sits. Then you can pick the highest-value, highest-return activities for your organization.
Read more
Read about CCDC’s FAIR Data journey.
Read about the CSD and CCDC’s data tools and philosophy.
Read how CCDC can leverage 50 years of experience at the forefront of structural chemistry to help you with data services.
See how keeping data accessible across organizations can lead to new insights! In a recent paper, CCDC and GlaxoSmithKline (GSK) combined published data (from the CSD) with proprietary data (from GSK) to better inform machine learning models towards the development of novel compounds.