We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


Basecamp Research Launches BaseFold: A Breakthrough in 3D Protein Structure Prediction

Protein structure.
Credit: iStock.
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 3 minutes

Basecamp Research, announced the launch of BaseFold, its new deep learning model that predicts 3D structures of large, complex proteins more accurately than other AI-powered tools, including the industry gold standard, AlphaFold2. These data were recently published in bioRxiv.


BaseFold was created by augmenting the AlphaFold2 model, which predicts the 3D structure of a protein based on its amino acid sequence, with BaseGraph. BaseGraph is Basecamp Research’s purpose-built foundational dataset for biological AI, collected via access and benefit-sharing partnerships with over 25 biodiversity-rich countries. The published accuracy improvements are just a starting point, as BaseFold is continuously improving week over week as Basecamp Research scales its global network of biodiversity partnerships. Furthermore, Basecamp Research will be working with NVIDIA to optimise and productionise BaseFold for NVIDIA BioNeMo, a generative AI platform for drug discovery.

Want more breaking news?

Subscribe to Technology Networks’ daily newsletter, delivering breaking science news straight to your inbox every day.

Subscribe for FREE

The scientific benchmark for determining protein structure is still via slow and time-consuming experimental methods such as X-ray crystallography. However, AlphaFold2’s development in 2020 provided a breakthrough for the use of AI across biotechnology, giving scientists confidence in AI-based structural predictions. A wide array of structure prediction models have since followed AlphaFold2, most notably RoseTTaFold and ESMFold.


However, the performance of these models is highly dependent on their training data; all are trained on public protein databases that are widely seen as unfit for biotech’s AI era. These public training datasets are small, unreliable and heavily biased toward proteins from laboratory model organisms. The sequence data captured in these public databases is estimated to represent less than 0.000001% of life on Earth. These data limitations mean that existing AI tools work well for predicting the structures of smaller, simpler proteins that are well-represented in public datasets but often struggle beyond that, creating major problems for those using AI to develop complex new medicines.


AlphaFold2 draws heavily from the public MGnify database, known for having issues with incomplete sequences, which can impact the quality of structures predicted for larger proteins. Basecamp Research’s BaseFold tackles the next big computational challenge, which is to achieve crystallography-level accuracy for larger, more complex proteins, especially those underrepresented in existing protein sequence databases.


To do this, BaseFold extracts orders of magnitude more meaningful evolutionary information from over 6 billion relationships in BaseGraph. Replete with extensive genomic context and comprehensive metadata, training algorithms on BaseGraph has been shown to yield significant advances in the performance of a wide range of biological AI models, including AlphaFold2 as presented here.


In this preprint, Basecamp Research scientists evaluated BaseFold’s performance in predicting the structure of various proteins selected from the CASP15 (Critical Assessment of Structure Prediction) competition and CAMEO (Continuous Automated Model EvaluatiOn) community project.


Publication Result Highlights

  • Basecamp Research’s purpose-built foundational dataset allowed BaseFold to improve the accuracy of AlphaFold2’s predicted structures by up to 6-fold.
  • The team demonstrated an up to 3-fold improvement in modelling accuracy for small molecule interactions with protein targets.
  • BaseFold unlocks more reliable 3D structure predictions and small molecule docking for larger and more complex proteins than ever before, particularly those that are underrepresented in public datasets.
  • This step change is poised to greatly accelerate drug discovery efforts, where understanding these interactions will allow for more advanced therapeutics molecules to be developed using AI.

“We have redesigned and rebuilt the entire data acquisition process, making us the first team ever to collect and annotate biodiversity data with the same quality as human clinical genetic data — all purpose-built for the AI era,” said Dr. Phil Lorenz, CTO of Basecamp Research. “BaseGraph, the most diverse and comprehensive dataset of its kind, is the core driver of our advances in AI. The results of this publication prove that more diverse, representative genomics data allows for step-change algorithm improvements without the need for extensive lab-in-the-loop infrastructure. Our database is growing every week, and as a result, BaseFold is improving every week, too.”


“AlphaFold is one of the most useful AI tools in drug discovery, and for good reason. It enables researchers to better predict how medicines may interact with proteins in the body, shaving off years of work. However, AlphaFold still has significant room for improvement – particularly when being used to predict large, complex and underrepresented proteins, which are often the most critical for the development of new therapeutics. Even just a few percentage points of error can have major implications in accurately predicting protein-molecule interactions,” said Dr. Glen Gowers, co-founder of Basecamp Research.


“We know that when it comes to AI, the best data produces the best outcomes, and it’s rewarding to know that the new, purpose-built foundational dataset that we have built is already having widespread implications for drug development and human health,” Dr. Gowers added. “We’re not stopping here, though – we are continuing to scale our biodiversity partnerships and apply this data advantage across more and more biological AI models.”