Why data cleaning and curation is not something that can be handled easily
Please explain who you are and what you do?
I'm working as a research scientist in University of Luxembourg. Basically, I'm working on the whole area of bioinfomatics IT, that includes both infrastructures and data sciences in translational research. I'm also in charge of coordinating activities related to high performance computing, especially in the area of biomedical research applications
I don't know if you've heard of Simon Sinek’s golden circle which is about understanding what your ‘why’ is. What is your why?
I think the main reason is that life science and medical research are the starting points of tackling all the health-related problems. So, all the findings from that research are providing new knowledge and insights that will be, eventually, translated into a new diagnosis method or treatment.
But currently with the progress of the high throughput technologies, this whole area of biomedical research becomes more data intensive. But the problem is that many of the key elements in this whole research area are not prepared for this.
I think the first and most important key element is the researchers, the people. Most people are trained as a bench scientist. They are very good in handling certain technologies and producing the research findings. But if you ask them to handle this massive amount of data with underlying links between them, this is quite challenging for them.
Secondly, I think all the traditional tools we are using, they are also not sufficient and this is worsened by complexity of the data we are handling now. I think the last, but not least, is the whole infrastructure, especially the IT infrastructures, are not yet set up to provide flexible and sustainable data platforms for researchers to carry out this research. That's the main reason we tried to tackle from the bottom up.
What do you think the consequences would be if we don't do that right? If we don't do that well?
Currently, what I see is that many people get lost in the data. They generate a huge amount of data, but the first thing they do is they just call their friends or colleagues who know a bit of computer and ask can you do something for me? Because I can generate the data, but I will not be able to analyse it properly. And that's where the research gets slowed down. Slowing down the research means getting innovations or findings to where they impact patients slower as well.
It takes about 17 years for research finding to clinical implementation, on average. And that's an incredibly long time. There are things that people can address, or find a new biomarker, a new target. But it's the things that actually can move us across that gap faster. Such as making the data process much more efficient and robust.
What is the best project you've ever worked on in the life sciences and why?
For me, personally, I would say the eTRIKS project. This is the first time, we have been working on the same project and this is the first project, that the whole team are openly and extensively trying to explore the challenges of data knowledge integration.
This is a multi-partner public-private partnership project. The whole project gathers people from different areas, including the pharmaceutical industry, data scientist, and people who are familiar with infrastructure management, people who develop standards of data. We also have experts on ethical and legal aspect of the research. And not to forget to mention that the whole project is supported by excellent project management and operating team. I really think this whole project works well, compared with the other projects I've been working on.
You mentioned that other projects were different. What was different? They just didn't talk to each other?
I think in eTRIKS we have a very well defined problem and everyone was feeling the same pain. Besides that, I think the team really tried to understand each other from different aspects. And they worked towards the same goal.
For the other projects, I've been working on a similar scale, people tend to work in small silos. So, they have their own part of this whole project. It's not an integrated project, but kind glued together sub-projects, which is not as interactive as eTRIKS where everyone is actually pushing the same goal at the end.
What's the biggest challenge you've encountered in projects?
I think the big challenge is communications. At the beginning, people think, especially people like me that come from a scientific, academic background, always think the biggest challenge is going to be research problems or technical problems. But in the end, it's communication. Because one thing I learned is if you cannot effectively understand what other people need and make the other people understand what the challenges you are facing, it's very difficult to actually develop anything that is satisfying for the users. The typical scenario's like you didn't ask the users before developing something that they'll need or the user think that what they are asking for is an easy task to do. They then give you limited resources to do it.
And I think, because what you touched up on earlier, is the idea that researchers don't have the skills or the knowledge to handle data. I think that issue of communication that I've seen between data scientists and researchers is important. But it's also extremely challenged in that same way. So, I agree with you on that, for sure. What is the number one misconception that you've encountered about data?
The typical answer we got is, when we ask about data quality especially, is “my data is clean”. And if not, I will ask my student to handle it. Then you dive into the data and you find that you always find problems with the data. That's a general problem. But beyond that, people think about data cleaning and curation as something that can be handled easily, it's also a misconception.
Because what we learned in the projects, is that the curation process is a very highly professional area that requires highly trained people that have, not only the main knowledge of the data generated, but they also have to have adequate IT skills to handle data. And they need to have good communicative skills and the knowledge to produce traceable and reproducible work. Because you should be able to explain what they have done to the data. These are all elements that are expected, needed, for the curators. We have learned this, also from other curators and especially the reference data management team. If you hire a student to curate data, what you get is a data set that you have to curate with post-docs again.
That's a bit provocative, but yes. There's a lot in there, what you just said, and I want to kind of pull some things out here. One thing for example, just start with the end point there. I think this is a really interesting point that kind of fits back to all of it. I've recently done some reading where if you have a little bit of knowledge, people tend to think they need to know what needs to happen. As you get more knowledge, you think you have no idea what needs to happen. And that's an interesting tension there.
And I think the key is you have to just be humble that you don't know. And at the same time, the flip side is, when you know, you kind of have to be more certain that you do know. That you actually have something to add to this whole picture. I think that's what you're kind of getting at. And I think that's really interesting.
You talked in the beginning there about they say it's clean. When they said that, what did you find that wasn't clean? What was the problem?
Many of the retrospective data are collected, either multi-side or using different tools, sometimes even with paper. And of course, without the perfect tool or validation steps, there will be many data that violate the definition of each variables you are collecting. And this is a typical error that you will encounter. Later, of course, you have to go deeper to the data, like if there's duplications.
So, I think that's an interesting one because I think people often think in a very simple way. That not clean data is missing data or, a decimal point. But what you're really saying is that the rules for that data collection were not applied or variable. Right?
Or they don't have the right tools to check this. They probably have a very good study protocol with defined data dictionaries. But to follow this dictionary is a challenging part during the data collection.
What is a data dictionary?
A data dictionary from a data curator point of view, is the definition of each variable you are collecting. What type it is, for example. Is there a certain pattern or range of value or does it have a certain link to other variables you are collecting, for example, BMI. You don't collect them directly. You are actually calculating this based on weight and height. These are all in the definitions. Without this definition, you cannot do any curation. This is like the guidelines or, if you want to say, meta data of everything that you are collecting.
Is conditional logic supposed to be in the data dictionary?
Yes, it should be. What we usually do during the project is to send a template of data dictionary to the project, they should help us to fill in this information. In that template, there will be logic, for example, if this field need to be unique or that you should not have duplicate values in this field. Or a combination of several fields has to be unique. Or the field has to be derived from a certain other field. Or even as a string, if you have a certain pattern of string, which would have a regular expression pattern there.
How do you get clinical input into the conditional logic?
I think for certain fields this is a common sense. For example, you could talk about age. You know roughly the range of age. But for certain fields, especially for the disease related field, this can only be solved by communicating with the data collectors, data controllers.
Do you think that researchers or clinicians need to understand a bit more about the data process?
That's a good point. They need to know about the curation process, roughly how it works, and what the challenges are, again communication is central here.
Can you briefly highlight what is the curation process? Curation or integrating data sets. It's kind of often the same thing, although not quite exactly.
Yeah. In terms of curation, normally you start with initial communication to collect meta data. You don't collect data first. As we discussed before. That means the curator should understand the source of the data. And, the destination of the data being curated. So, in which form they would need to curate it. To which system, for example.
And then you start to talk about data dictionaries. As we discussed. Getting all the information you need for the dictionary, and for some data types, for example, a mixed data type, you also need to understand or get information about how the data's generated and pre-processed. Because this has huge influence on the value that you will be handling in the future.
For example, if they have done normalization. Without the knowledge of this, you would tend to mis-curate the data in the future. After this information is collected, then the first step is a tedious step, that is called data format conversion. So, you have to convert whatever type of data you get, sometimes even paper, into a common computer friendly format, that any tools you are using can read.
And we have found out that this step can be really error prone. For example, people use comma as both delimiter and decimals etc. This is very common. And in coding of different special characters that could lead to bug problems in the future. I would say these two parts will at least be 50% of the whole effort.
Then you enter in the real or conventional curation step. You start with cleaning and data quality control based on rules you set up from the data dictionaries. Once that step is done, the value will not be changed anymore. Then you should do the transformation, like format to the model that is required. So, reshuffle the data without changing the values mostly.
Depending on the needs of the project, the final step will be doing standardization. Some project need standardization to be comparable to other data sets. Sometimes it is for an internal data set so the vocabulary does not need to be standardized.
When all this is done, the curator has to provide a report that documents everything that has be done with the data set.
What would be the one thing you would like to see happen that would be a game changer in terms of getting the most value out of life science research data?
I think the answer's is being close to what is in everyone's heart. You need to have an open community driven curation and sharing process. Especially the very important thing here is to have a mechanism that gives credit to all the stakeholders. The one who shared the data, the one who curated the data, and also the one who provided sharing platforms. Because currently none of them are getting credit. Only the one who analyzed data and published the results gets credit.
Because it's often one of the challenges for clinician researchers, that they set up the study, generated the data, and then make it open and somebody else does the analysis and gets the New England Journal paper. It really requires a big shift in how we credit academic research.
I think currently there are good movements in that direction. I think there are journals who publish datasets, instead of the research findings. And there are also, even in the publication areas, credits given to anonymous reviewers for their reviewing work. I think now we should talk about curators. Who actually do the silent work and don't get any credit out of it.
There's a lot that should happen in making curation better. And, I think, that's one of the things we ran across in eTRIKS and other projects is it's hard to, in an academic setting, have professional curators.
Exactly. Unless they are, you know, core facilities. Or a platform for where their daily job is to do data curation. For researchers, normally they try to avoid this.
Well it just made me think also, it's one of the bigger challenges that we've seen is that if you're working in an academic setting, doing the work of curation, it's hard because it's not something you're going to get a publication from. Or you can't write your PhD thesis. There's a lot that has to happen in making curation better. And, I think, that's one of the things we ran across in eTRIKS and other projects is it's hard to, in an academic setting, have professional curators. Right?
Exactly. Unless they are, you know, part of core facilities. Or a platform for where their daily job is to do data curation. For researchers, normally they try to avoid this.
Well that's interesting because that's different than what I anticipated. I thought you were going to say standards.
I think it's quite difficult to convince people on standards, especially researchers, to dive into data curation. Because this is not really, at the current stage, part of their career. I think what you mentioned that everything under a standard, that is the ultimate goal. But what I was mentioning was maybe a path, one of the paths where we can reach to that goal.
Because, to me, that's one of the bigger drivers of good collaboration. Is that there's the bigger picture. We want to have this curated data set so that we can do some research. But there's also the individual picture, right? And if you can marry those two. So that you know that if you're helping to curate the data set, you're gonna get credit. You're gonna advance in your career. When those two things marry, come together, then you get synergy. Because then you get a whole group of people working on a common problem.
Because it solves their individual needs and it solves the bigger picture. And, quite frankly, that's a fulfilling way to work as well.