4 Lessons I learned using data science tools on a real scientific article

In the past few years, there's been an outrageous growth on the interest towards data science, that's fact. However, there were two main publics for the "new" knowledge field:

Scientists who produce data science related articles(Deep Learning models, Transformation techniques comparisons)
People from the internet who were curious about the hype
People who wanted new skills to work at a high-paying industry

I always felt it was good that the potential data science had as a tool was finally being popularised. However, there was a point that bothered me: why are scientists developing DS articles, but not using it as a research tool on other knowledge fields?

So, being a Physics major student at Brazil, and a self taught data science enthusiast, I tried to find a research group at the university, in which I could use the Data Science skills I had, in order to produce better evidence to the article.

Soon, I joined a group writing a article which used statistics to determine which comorbities led statistically more people to death by covid-19. Along with that, I also asked the question: how well can you predict someone's death based on symptons they have. The results and conclusion can be seen here.

Finally, here are 4 lessons I learned through 8 months working on the article

1- Science isn't a straight line

What I mean by that is: science is not your beginner project, with a well defined dataset, and guaranteed success. The data is a mess, and the decision making process on what to do with each variable is tiring and makes all the difference to the outcome.

Also, when working on an article, there is always a hypothesis, and those who thought on the hypothesis, surely have a guess on what the result will turn out to be. However, you always have to be careful with that, since grasping too hard on guesses might mistake you into confirmation bias (google it up if you haven't heard of it. It's important). That leads us to the second point:

2- Science learns from bad results

If a model or analysis didn't result on pretty visualizations and excellent models, there's always something to learn from that. Is your knowledge enough? How well did you model your data?

Maybe, the data just doesn't answer the question you're asking. Maybe, there's some information about the data (how it was collected, the meaning of the variables) you still don't know, that would make all the difference.Data is not a human: it won't lie to you. So, try to listen to what it is trying to tell you, rather than seeing only what you want to see.

3- Model performance isn't always the main goal

Working on the article, our main goal was to measure the relative importance between our variables to the model's outcome. When working on it for some weeks, we faced a problem: the model performed better when we removed 2 especific variables. If it was a industry problem, it wouldn't be a problem: results are the main goal, so just delete it.
However, in our case, it was not wise to remove these variables. Although they were irrelevant to the model's performance, they were relevant healthcare indicators. Does it mean we designed bad features? does this indicate a systematic problem on the way data was collected? This is real world data, collected by stressed out healthcare workers from a developing country. We can't simply close our eyes to the logistic challenge. Or maybe, it just means that the variable we thought was an important indicator for covid-19 deaths, turned out not to be. Don't forget the first point.

4- Reproducibility matters

When working with data science, we(at least I did!) tend to develop models and analysis as our own little monster. No one else is going to read it, so why bother making it pretty?

First of all, reproducibility is not about making things look good: it is all about making sure that every decision you took is clear, including: where you took the data from, what feature engineering decisions you took, how you dealt with missing values, how you chose a model, and how and why you chose the metrics you chose. The question that has to be answered here is: if I have the dataset, can I reproduce your exact results only with information you gave me on the article? If not, it shouldn't be considered science. You're no better than magicians.

Don't get me wrong, it's not that I don't trust you: science isn't about trust. Would you trust a rocket only a scientist said works, or would you rather have it checked by dozens of experienced engineers?

It's only science if it can be reproduced.