Generative AI for R&D: Building the “Digital Chemist” for Covestro

A high-tech chemical laboratory setting with a monitor displaying a 'Digital Chemist' Generative AI interface analyzing polymer formulations and 3D molecular structures.When Covestro, a global leader in high-tech polymer materials, needed to accelerate their R&D cycle, they leveraged their strategic partnership with Cardinal Peak / FPT Software. Building on our deep institutional knowledge of their application ecosystem, we engineered a secure, scalable Generative AI for R&D solution—a “Digital Chemist”—that allows scientists to programmatically query decades of unstructured testing data, reducing research time by 50%.

The Challenge: “Dark Data” in the Lab

Covestro possessed a goldmine of proprietary data: 30 years of PDF reports, Excel spreadsheets, and Word documents detailing thousands of chemical formulations. However, this data was “dark”—unstructured, unsearchable, and scattered across siloed drives.

  • The Search Barrier: Scientists spent hours manually digging through folders to find past experiments, often re-running expensive tests because previous results couldn’t be located.
  • The Security Gap: Off-the-shelf GenAI tools were non-starters; Covestro could not risk exposing proprietary formulations to public models.
  • The Complexity: Chemical data is contextual. A standard keyword search for “Polymer X” returns thousands of irrelevant hits without understanding the relationship between formulations and results.

One of the factors that makes this partnership amazing is the flexibility, the speed, and the diligence of the team.

Walter Gruner
CIO
Covestro

The Solution: A Secure RAG Architecture

Leveraging our AI Center of Excellence (CoE), we architected a custom Retrieval-Augmented Generation (RAG) solution that acts as a secure “Digital Chemist” assistant.

Data Ingestion & "Chunking"

We engineered an automated ingestion pipeline to process the ‘messy’ reality of R&D data. The system ingests PDFs, PPTs, and DOCX files—including complex Safety Data Sheets (SDS)—utilizing advanced computer vision to distinguish between text, chemical diagrams, and tables. We implemented intelligent ‘chunking’ strategies that respect the structure of scientific documents, ensuring that chemical formulas and their associated results stay semantically linked in the vector database.

Vector Search & LLM Integration

We utilized a high-performance vector database to index the semantic meaning of the documents. When a scientist asks, “How did Formulation A perform at 80°C?”, the system retrieves the specific paragraphs containing that thermal data and feeds them into a secure, private instance of a Large Language Model (LLM).

Enterprise-Grade Security

Security was paramount. We deployed the solution within a Virtual Private Cloud (VPC), ensuring that:

  • No Data Training: The LLM is used only for inference (reasoning), not training. Covestro’s data never leaves their secure environment.
  • Access Control: The system respects existing file permissions—scientists can only query documents they are authorized to view.

Covestro’s proprietary chemical data is never used to train base models, ensuring that a competitor using the same LLM can never access Covestro’s unique research insights.

Engineer Your Secure GenAI Knowledge Base

Stop re-running expensive experiments. From chemical formulation optimization to secure knowledge retrieval, our engineers design the private GenAI systems that accelerate your R&D.

RAG Architecture Diagram

Technical architecture diagram of a secure Retrieval-Augmented Generation (RAG) pipeline for Generative AI in R&D, illustrating the flow from unstructured data ingestion and intelligent chunking to semantic indexing in a vector database and private LLM inference within a secure VPC.

The Results: 50% Faster Innovation

By transforming dark data into an interactive knowledge base, we fundamentally changed the daily workflow of Covestro’s scientists.

  • 50% Reduction in Research Time: Questions that took days to answer via manual search are now resolved in seconds.
  • Zero “Re-Work”: Scientists can instantly verify if an experiment has been run before, saving thousands of dollars in wasted materials and lab time.
  • Adoption at Scale: The tool was rapidly adopted by the R&D team, becoming a central part of their innovation process for chemical formulation optimization.

Technical Stack

Core Tech

Generative AI, RAG (Retrieval-Augmented Generation), LLM

Cloud Infrastructure

AWS (Bedrock, Lambda, OpenSearch)

Data Processing

Python, LangChain, OCR

The Engine Behind the Solution

This project was accelerated by our AI Center of Excellence (CoE). By leveraging pre-built accelerators, proven MLOps frameworks, and a deep bench of 1,000+ AI engineers, we delivered an enterprise-grade, secure RAG solution in a fraction of the time required for a ground-up build.

The GenAI Accelerator Stack

Cardinal Peak/FPT AI CoE: A three-layered GenAI Accelerator Stack illustrating our infrastructure layer (AWS, NVIDIA, Kubernetes), AI/ML Ops layer (LangChain, MLflow, Vector Databases), and a security and compliance layer for governance and privacy.

Explore Materials Informatics & GenAI Related Solutions

Generative AI Services

Generative AI Services

Learn how we build custom LLM applications that are secure, scalable, and private.

Market: AI for Materials Science

Market: AI for Materials Science

Discover other ways we are helping the chemical and materials industry innovate faster.

Generative AI for R&D Frequently Asked Questions

Can Generative AI (LLMs) be used securely with proprietary corporate R&D data?

Yes, Gen AI (LLMs) can be used securely with proprietary corporate R&D data if architected correctly. It requires an “inference-only” approach within a secure environment (like a Virtual Private Cloud), ensuring your proprietary data is never used to train public models. Leading enterprises use Retrieval-Augmented Generation (RAG), where the AI is granted temporary access to internal documents to answer a specific query without absorbing the data into its core model.

How do you prevent AI models from "hallucinating" incorrect scientific facts or chemical formulas?

In scientific contexts, relying on an LLM’s general knowledge is risky. The solution is a strict RAG architecture. The model must be instructed to only generate answers based on facts retrieved from a verified internal knowledge base (like a vector database of verified reports) and to provide citations for every claim, allowing scientists to instantly verify the source.

Additionally, because our RAG architecture utilizes intelligent chunking that understands document structure, the system can reliably retrieve data from Safety Data Sheets (SDS) without losing the context of the chemical properties listed.

What is the best way to digitize unstructured scientific legacy data like PDFs and handwritten lab notes for AI?

Standard OCR is often insufficient for complex scientific documents. A robust data engineering pipeline is required, utilizing advanced computer vision to recognize document structure ( differentiating between text, chemical diagrams, and tables) and intelligent “chunking” strategies to preserve the semantic context of the data before it is ingested into a vector database.

What is the role of a Vector Database in a RAG architecture for R&D?

A vector database is essential for searching unstructured scientific data. Unlike standard keyword search, it stores the “semantic meaning” of data as mathematical vectors. This allows an R&D system to understand contextual relationships between chemical formulations, test parameters, and results across thousands of disparate PDF reports, retrieving the exact context needed to answer complex scientific queries.

Why use AWS Bedrock for building secure enterprise Generative AI applications?

AWS Bedrock is often chosen for enterprise applications because it provides secure, private access to leading foundation models via a single API. Crucially for R&D, Bedrock ensures data privacy: your proprietary data used for inference is encrypted, remains within your Virtual Private Cloud (VPC), and is never shared with model providers or used to train their public base models.