Over the next few minutes, he explained that his brother-in-law is Nacho Esteban – the CEO of a biotechnology company called 24Genétics. Up till then, Nacho’s business had focused on DNA tests to show people their ancestry. But now, with Madrid facing another wave of Covid-19 cases, he had decided his company should try to do something to help.
“He’s started working with La Paz,” said Jose Luis. Where I’m from, everyone knows Hospital Universitario La Paz. It’s one of the most important clinics in Madrid. “He needs somebody with your skills, Carlos.” Jose Luis continued, “Someone who’s passionate about data, and knows how to attack it.” Now, I worry that over the years Jose Luis has built up a higher opinion of my skills than I have myself. But he was right in at least one sense: my background is in physics, but data has been always been my major fascination – ever since I started at Lucent Technologies in the late 1990s.
In December last year, Spain was being hit especially hard by the second wave of infections, and Intensive Care Units were rapidly filling up. So, naturally I agreed to speak to Nacho and find out about his arrangement with Hospital Universitario La Paz. He described how they had sent him a modest batch of anonymised Covid-19 patient data, so he could investigate ways it might be harnessed to help fight the virus. Now, he was asking if I could help.
Unsurprisingly, I pounced.
At the end of a year like no other, I confess I was quietly looking forward to a rest over the holiday season. But there’s a saying which goes: “When opportunities fall into your lap, be ready to pounce.” And this was an opportunity. A chance to not only do what I love doing – but to boost my knowledge and perhaps even help in the battle against coronavirus. So unsurprisingly – I pounced.
At this point, perhaps I should clarify that I’m not very familiar with the healthcare sector. My work as a data scientist for Teradata is largely focused on telecoms and banking. So I spend my time answering questions like: “What’s the likelihood this person will buy themselves a new smartphone?” or “Should this person be approved for a mortgage?”. So it was a pleasant surprise when – having scribbled my name on all the requisite NDAs – Nacho sent me the patient data and I found the structure and variables to be very similar to what I’m used to.
I was looking at roughly 200 rows of data. Each row represented a Covid-19 patient. And each of the 50-or-so columns contained what we call ‘covariates’ – variables such as: age, sex, heart rate, pulse etc. The final and most sobering covariate stated whether the patient survived the virus or not. Nacho’s challenge to me was to take all that data, and build an algorithm that could analyse a new patient’s data, and predict how at-risk they were from the virus.
I reached for my toolbox of algorithms.
The first job was to investigate the covariates and cleanse the data, being careful to avoid what’s called ‘over-adjustment’. After that, it was a case of taking out my toolbox of algorithms. Some of which date back as far as the 1960s. Others are more recent though – they’re ‘state of the art’ algorithms with the power to capture detailed patterns in the target.
You may be surprised when I tell you that working with this Covid-19 data was actually very similar to my work with telcos and banks. The principles were almost exactly the same, and the algorithms I used were very similar. The main difference was that – while I often work with terabytes of information spread across millions of rows – this was a comparatively small data set.
After working with the data for some time, the moment finally came when I’d be able to test what I’d created. A 75-85% accuracy level is considered very good for a working model. 90-95% is considered excellent. My new algorithm was showing 95-97% accuracy – far better than I could have dared hope for. It was only after checking several times, that I allowed myself to sit back and smile.
Now, it’s over to the doctors.
As I write this, doctors are discussing how to use the algorithm as a diagnostic tool. The truth is: it’s still very early days. And there are still more tests to be done and ethical considerations to be addressed.
What I can tell you is the algorithm will be passed over to the Covid-19 Host Genetics Initiative. Which is a project that aims to bring together the world's genetic community to analyse and share data to grow our understanding of the virus.
For my part, I’m now spending evenings and weekends working through the anonymised genetic information from the initial data set. It’s far more complex work, and it requires far more understanding and processing power. However, if I can build a new algorithm that takes genetics into account, it could open up all sorts of new opportunities for fighting coronavirus.
I’m beginning to think I may be on to something. But then again, I may not.
Either way, it’s a hugely rewarding side-project for me, and was certainly a worthwhile use of my Christmas vacation. I think it goes to show how the sort of data analytics capabilities we offer at Teradata can be applied in new and unexpected areas.
It’s not the first time data science has been used in healthcare, and it certainly won’t be the last. In fact, maybe Jose Luis is planning to introduce me to some other relatives before the Easter holiday!