Data from hospitals is critical in training AI and making healthcare better, says structural biologist Janet Thornton
“Do you know what ontology is?” asks Professor Janet Thornton, a pioneer in structural bioinformatics, as I sit down with her for a conversation. I shake my head in the negative.
Taking a table in the room as an example, in a minute she explains the concept, which essentially is a framework that represents knowledge in a structured format and allows AI systems to understand it. I remember what she said previously during a more informal chat — “Science is very easy if you try to understand it, if you break it down.”
Director Emeritus at the European Bioinformatics Institute (EBI), Prof Thornton led research groups studying the biology of proteins and ageing during her time at EBI and has been vocal about the importance of open data and tools. She was in Bengaluru to deliver a lecture organised by Shibulal Family Philanthropic Initiatives (SFPI), in collaboration with TIFR.
The Hindu caught up with Prof Thornton to understand her views on the importance of data infrastructure in improving healthcare, developments in the field of bioinformatics and where studies could lead to.
When you started off your career, there were about 20 known protein structures. And now there are about 200,000 structures…
There were 230,000 in the database last Thursday. But now we can predict structures using Alpha Fold and AI, and they have predicted the structures for 214 million proteins. Going from 20 to 240 million is just phenomenal. It’s like discovering a new world. With the technologies, crystallography and electron microscopy, and now the prediction built on that data, we can see the world of proteins that basically enact the body. And it’s fantastic.
Research has come a long way in this field. What does that mean for the health sector?
It has had an impact, but it’s still really just developing. We know much more about our physiology – how metabolism works and how viruses invade.
For example, we saw with the COVID virus how quickly the structures of the spike protein, which helped in the design of the vaccine, could be characterised. I think we’ve come a long way in our general understanding of proteins. Now, translating this knowledge into new medicines is only just starting. I don’t think we’re quite there, but I think eventually this will lead to a new way of doing medicine.
I’m not a medical doctor, so this is just from my perspective. But often medicines only treat the symptoms. I think that will change because once we have enough data, we can improve the rational design of new medicines. I think every health service in the world is struggling with the demand, but I believe we will gradually improve our understanding, and that will impact medicine for everybody.
I sit on the board of Health Data Research U.K. It’s a government-funded organisation to try to coordinate health data and make it available for research. It’s taken 50 years to establish this huge set of data resources that can be used openly for basic biological research. Health data is quite different. And in my opinion, it’s going to take 10 to 20 years to coordinate that health data and make it accessible. It’s difficult because the health data relates to people. It’s personal. So, there are ethical concerns which must be considered.
The other side is that you want to make this freely available for research, but there are also commercial interests.
So, I think the impact at the moment is still for basic biological research. The impact on medicine is just beginning, mainly with genetic diseases. There are some diseases that are more tractable. But chronic diseases, like cardiovascular problems, dementia, diseases of old age and so on are much more challenging. I think these areas remain challenging.
There are also opportunities with respect to the delivery of medicine in the hospitals. In the U.K., the NHS is really struggling because of the demand. I think there are ways to make hospitals more efficient and to handle the flow of patients.
Have you ever felt that when AI becomes an obsession, other critical areas of research in bioinformatics might get overshadowed?
The problem I see is different. In order for AI to be effective, you need good data. The only reason AlphaFold worked was because it was trained on 170,000 structures that had already been solved. The protein structure prediction problem is almost the ideal AI problem. When we look at health, the question is whether we can effectively use the data currently available. There’s a huge amount of data, but it’s not curated, stored, and shared. The lack of such data is a game-breaker. So, what question do you get the AI to learn how to solve? Adequately defining the question for the AI to solve is not easy and it is critically important.
Then there’s the problem of ontologies. If you don’t use the same words, the computer doesn’t know that it’s the same object. Similarly, when it comes to health the descriptions of diseases are not sufficiently standardised making it difficult to combine different types of data.
The other big problem is that every human being is an individual. And we all have different genetic backgrounds. Things affect each of us differently. That’s another complication.
Capturing this data and making it available to improve clinical practice is difficult also because of the ethics and privacy concerns. Implementing solutions is another challenge due to the different procedures and protocols, often long and torturous.
Then there is the commercial involvement. Drug companies don’t share their data because commercially such data can be valuable.
In the U.K. alone, we have different hospitals run by different Trusts, and they don’t share their data. They don’t store it in the same way. They don’t use the same protocols. And that makes the data not so useful, so sorting that data out is a big, big problem.
For health data, it’s really challenging to make the data accessible for research. But it’s a critical health infrastructure. We need to make the data from hospitals visible so that it can be used to train AI programs that really can tease out new information from data. But you need good data. You need clean data. And what you get out reflects what you put in, as always.
What is the direction in which you like to see the research progressing and impacting humankind in the years to come?
I think I’m quite naive about this because I’m not a clinician. But there are 8 billion people in the world and out of these there will be somebody who has a similar life journey to me with the same problems and the same positives and negatives. We could see if we can somehow bring together all those life journeys.
A doctor normally operates on his own experience which is somewhat limited with respect to the amount of information held by the doctor. My feeling is that if we could gather the data out there, there’s the potential for better, more accurate prognosis and diagnosis. I’d like to see the real understanding come from all the different technologies we’ve got in life sciences.
There’s also the potential of improving ecology and agriculture.
And then there is green chemistry. We’ve been looking at enzyme mechanisms. There are many new enzymes in the bacteria at the bottom of the sea and in wastewater and so on that will be able to do new chemistries. So, if we learn how to design proteins that will do the green chemistry, that would be fantastic. It might even help to tackle climate change.
Published – February 07, 2025 09:00 am IST
Post Comment