Over the past 5 years Artificial Intelligence (AI) and Machine Learning (ML) have made great strides and the pharmaceutical industry has started applying these technologies in areas as diverse as clinical trial design and the design of DNA-encoded Libraries. Not a day goes by without a new ChatGPT article coming across my newsfeed, so we thought we’d take a look at a class of AI algorithms known as Generative AI (of which ChatGPT is one example).
Over the past few months we’ve been working on a whitepaper to document the state-of-the-art of AI/ML in drug discovery. As part of the preparation for writing this whitepaper, we experimented with different tasks to help answer the question “is AI ready for prime time for a given task”.
In this case, we were interested in finding out if we could use ChatGPT (and other Generative AI algorithms) to mine drug development pipeline information. Our Pharm2Market database contains the pipelines of over 800 drug discovery companies in the US. The process of keeping the data up-to-date involves a lot of manual curation work. So we were interested in looking for ways to apply Generative AI to help extract pipeline information from the websites of these companies and reduce the amount of manual labour required to curate the Pharm2Market database.
In this experiment, we looked at the pipeline of Foghorn Therapeutics. As you’ll see from the transcript of our experiment, the user experience for ChatGPT is “chat-like”. And what you end up with is an iterative “conversation” where you are trying to constantly refine the results you get into something that is usable.
The conversation started with:
What drug development projects does Foghorn Therapeutics have in its pipeline?
Foghorn Therapeutics is a biotechnology company focused on developing medicines that target the chromatin regulatory system to treat cancer and other serious diseases. As of my knowledge cutoff of 2021, Foghorn Therapeutics had several drug development projects in its pipeline. Here are some examples:
- FT-4202: Foghorn’s lead candidate is a small molecule modulator of sickle cell disease (SCD) that targets the fetal hemoglobin (HbF) pathway. FT-4202 is currently in Phase 2 clinical trials for the treatment of SCD.
- FT-7051: This is a selective inhibitor of the bromodomain and extraterminal (BET) family of proteins, which play a role in regulating gene expression. FT-7051 is being developed for the treatment of various solid tumors and is currently in preclinical development.
- FT-8225: This is a small molecule inhibitor of the WDR5 protein, which is involved in the formation of chromatin regulatory complexes. FT-8225 is being developed for the treatment of acute myeloid leukemia (AML) and other hematologic malignancies and is currently in preclinical development.
- FT-2102: This is a selective inhibitor of mutant isocitrate dehydrogenase 1 (IDH1) enzymes, which are implicated in the development of certain types of cancer, including AML and gliomas. FT-2102 is currently in Phase 1 clinical trials for the treatment of these diseases.
Please note that drug development is a complex and ever-changing process, and the information I provided may no longer be up-to-date or accurate. It is always best to consult the company’s official website or other reliable sources for the most current information.
Where did you get this information?
As a language model, I do not have access to the internet or any external sources of information. However, I was trained on a large corpus of text that includes various sources of information, such as scientific articles, news articles, and other publicly available sources.
In the case of Foghorn Therapeutics' drug development projects, the information I provided was based on publicly available sources such as the company's official website, press releases, and news articles. It is important to note that my knowledge cutoff is September 2021, so there may be updates or changes to the company's pipeline since then.
The key part of that answer, is the knowledge cutoff date (shown in bold). Which means that any data that they show us might be outdated. So we used the Wayback Machine, an internet archive to see what Foghorn’s pipeline looked like in Sept 2021

Already we can see a number of discrepancies. The project IDs are all wrong, not to mention the targets, indications and phases.
At this part of the experiment though, we wanted to improve the form of the answer, and then work on improving the accuracy of the answer.
Express your previous answer in Foghorn Therapeutics pipeline in json format.
{
"lead_candidate":{
"name":"FT-4202",
"indication":"sickle cell disease",
"clinical_phase":"Phase 2"
},
"candidates":[
{
"name":"FT-7051",
"indication":"various solid tumors",
"clinical_phase":"preclinical"
},
{
"name":"FT-8225",
"indication":"acute myeloid leukemia and other hematologic malignancies",
"clinical_phase":"preclinical"
},
{
"name":"FT-2102",
"indication":"acute myeloid leukemia and gliomas",
"clinical_phase":"Phase 1"
}
]
}
For the drug Target information please include the entrez gene ID for each drug target
Unfortunately, I do not have access to a database that provides the Entrez Gene IDs for the drug targets mentioned in my previous responses. However, I can provide you with a general format for including Entrez Gene IDs in a JSON representation of drug targets:
{
"candidates":[
{
"name":"FT-4202",
"drug_target":{
"name":"fetal hemoglobin (HbF) pathway",
"entrez_gene_id":"12345"
},
"indication":"sickle cell disease"
},
{
"name":"FT-7051",
"drug_target":{
"name":"bromodomain and extraterminal (BET) family of proteins",
"entrez_gene_id":"67890"
},
"indication":"various solid tumors"
},
{
"name":"FT-8225",
"drug_target":{
"name":"WDR5 protein",
"entrez_gene_id":"23456"
},
"indication":"acute myeloid leukemia and other hematologic malignancies"
},
{
"name":"FT-2102",
"drug_target":{
"name":"mutant isocitrate dehydrogenase 1 (IDH1) enzymes",
"entrez_gene_id":"78901"
},
"indication":"acute myeloid leukemia and gliomas"
}
]
}
Here we can see that they’ve generated gene IDs, rather than looking up the appropriate ID, so we next specified where to find the right answer.
You can find the entrez gene IDs here https://www.ncbi.nlm.nih.gov/gene/
Thank you for providing the source of the Entrez Gene IDs. Here's an updated example of the JSON representation of drug targets with the corresponding Entrez Gene IDs:
{
"candidates":[
{
"name":"FT-4202",
"drug_target":{
"name":"fetal hemoglobin (HbF) pathway",
"entrez_gene_id":"3043"
},
"indication":"sickle cell disease"
},
{
"name":"FT-7051",
"drug_target":{
"name":"bromodomain and extraterminal (BET) family of proteins",
"entrez_gene_id":"6872"
},
"indication":"various solid tumors"
},
{
"name":"FT-8225",
"drug_target":{
"name":"WDR5 protein",
"entrez_gene_id":"11091"
},
"indication":"acute myeloid leukemia and other hematologic malignancies"
},
{
"name":"FT-2102",
"drug_target":{
"name":"mutant isocitrate dehydrogenase 1 (IDH1) enzymes",
"entrez_gene_id":"3417"
},
"indication":"acute myeloid leukemia and gliomas"
}
]
}
And this was key to being able to get the right information. We found that if we specified where to find the right information, ChatGPT would then start improving the answers. When we later went back and specified the URL for Foghorn’s pipeline, we started getting the better answers for the pipeline. The final result (after much back-and-forthing) is shown below:
{
"projects": [
{
"project_name": "FHD-286",
"targets": [
{
"gene_symbol": "SMARCA4",
"entrez_gene_id": "6597",
"uniprot_id": "P51532"
},
{
"gene_symbol": "SMARCA2",
"entrez_gene_id": "6595",
"uniprot_id": "P51531"
}
],
"therapeutic_area": "Cancer",
"therapeutic_class/modality": "Small molecule",
"indications": ["AML", "MDS", "Uveal Melanoma"]
},
{
"project_name": "FHD-609",
"targets": [
{
"gene_symbol": "PRMT5",
"entrez_gene_id": "10419",
"uniprot_id": "P41229"
}
],
"therapeutic_area": "Cancer",
"therapeutic_class/modality": "PRMT5 inhibitor",
"indications": ["Cancer"]
},
{
"project_name": "FHD-286/609 Combination",
"targets": [
{
"gene_symbol": "SMARCA4",
"entrez_gene_id": "6597",
"uniprot_id": "P51532"
},
{
"gene_symbol": "SMARCA2",
"entrez_gene_id": "6595",
"uniprot_id": "P51531"
},
{
"gene_symbol": "PRMT5",
"entrez_gene_id": "10419",
"uniprot_id": "P41229"
}
],
"therapeutic_area": "Cancer",
"therapeutic_class/modality": "BET/PRMT5 inhibitor",
"indications": ["Cancer"]
},
{
"project_name": "FHD-818",
"targets": [
{
"gene_symbol": "PPARG",
"entrez_gene_id": "5468",
"uniprot_id": "P37231"
}
],
"therapeutic_area": "Metabolic Disorders",
"therapeutic_class/modality": "PPARγ modulator",
"indications": ["Metabolic Disorders"]
},
{
"project_name": "FHD-286/818 Combination",
"targets": [
{
"gene_symbol": "SMARCA4",
"entrez_gene_id": "6597",
"uniprot_id": "P51532"
},
{
"gene_symbol": "SMARCA2",
"entrez_gene_id": "6595",
"uniprot_id": "P51531"
},
{
"gene_symbol": "PPARG",
"entrez_gene_id": "5468",
"uniprot_id": "P37231"
}
],
"therapeutic_area": "Metabolic Disorders",
"therapeutic_class/modality": "BET/PPARγ modulator",
"indications": ["Metabolic Disorders"]
}
There are still discrepancies in the results though. In some cases, the system returned “Cancer” for the indications, despite the fact that the pipeline page clearly showed a number of specific cancers — “cancer” vs “uveal melanoma” for example.
Conclusions
Although it is possible to eventually get something approximating what you want out of the system there are numerous caveats:
- Be as specific as you can as early in the conversation as possible.
- Give ChatGPT the URLs of the sources that you want to draw on in order to generate the data that you’re after.
- It doesn’t support ontologies yet, which means that the results you get may not be exactly what you’re looking for.
- The results highlight the need more open, domain-specific Large Language Models (LLMs).
- It doesn’t have a way of citing its sources yet. So when you ask it, “where did you get this data”, it’s not going to pull up the page and highlight the sentence where it got the information.
- Double-check all data
- The current data cutoff date of September 2021 means that if your conversation requires more recent data, you’re out of luck.
- Lastly, what this exercise points out is the need for a common FAIRified data format for expressing drug pipeline information. In my BioITWorld talk on “The Secret Names of Things”, I highlighted the lack of a standard format as an obstacle to making pipeline data easily searchable, and therefore making it harder for potential investors to find likely investments.

