Research data: Why bigger isn't always better

Since the technology revolution and the rise of the internet, we are constantly surrounded by data, and companies harvesting and storing that data, supposedly for our benefit. In the world of scientific and medical research, big research data is often touted as the way we will solve some of our biggest questions and treat as yet untreatable diseases. But is more data really better? And what do we need to do to make the most out of all this information?

What is big data?

Big data deals with large data sets that are too complicated to be managed and analysed by traditional methods.^1,2 It is an entire field that has grown with the internet to find new ways to analyse the quantity of data we now generate on a daily basis. Medical research is just one area where big data has the potential to provide answers to many of the as yet unsolved questions, ultimately transforming and even saving patient’s lives.

One example of this is in the treatment of rare diseases with small patient populations.³ Having access to more data from all over the world can be transformative, providing healthcare professionals with the opportunity to compare relationships of larger data sets and monitor patterns with more accuracy. This can then be used to speed up the diagnosis and treatment of diseases which would traditionally have been difficult to identify.

In any scientific study, more data points would be considered beneficial to corroborate and reinforce conclusions. Big data is therefore a great opportunity to support, or disprove, ongoing research. But in order to do this, we need access to this data, so the rise of open science and shared data resources is just as important. This brings its own challenges in a world of data protection and GDPR regulations, where having access to personal, and in this case medical, data requires the anonymisation of data, without loss of the key medical information. Big data is therefore nothing (powerless?) without the correct, intelligent management and protection.

Quality not quantity

Not only is data management a key concern, it’s also important to make sure we use the right data. One helpful way of looking at it is that we want depth of data not breadth. More data points can be useful to see relationships more clearly but adding extra fields that have no relevance in specific contexts can lead to spurious results and even the observation of non-existent correlations.⁴

As with smaller data sets, there are concerns that confirmation bias and signal error could confuse results. Looking for data that confirms existing opinions is a human characteristic so the use of algorithms to analyse data makes sense to reduce confirmation bias. However, humans program the algorithm so it could still be biased, and if so, any conclusions drawn will be too.

Overlooking gaps in data can also have significant implications and lead to inaccurate correlations. A study of data collected from Twitter following Hurricane Sandy suggested that the Manhattan borough had experienced the worst of the storm when we know in fact that wasn’t the case. That’s where most of the tweets originated but the number of tweets was due to the population density and proportion of smart phone users in New York, not the actual impact of the storm.⁵

Data sharing in the publishing community

Science publishing is another area where the availability of published research has increased exponentially with the rise of the internet. Just as with data, this availability is a fantastic opportunity to advance research but without the correct management and algorithms, finding high-quality, relevant research is now more challenging due to the quantity of research available. A platform which collates research from multiple peer-reviewed journals and analyses it to find papers relevant to the reader is therefore essential to further research, support the learning of scientists and healthcare professionals and save time.

The data is only as good as the analysis. The only way to make sure we use the right data and come to reliable conclusions is the selection of sensible data sets and by the design of the algorithms we use. That’s not to say that big data doesn’t have huge potential, it’s just important to remember that we need to be intelligent in the way we use it, especially in medicine, where patients’ health is at risk.

Ben Mansfield is the Founder of ClinOwl, a leading content discovery platform for healthcare professionals.

https://en.wikipedia.org/wiki/Big_data
https://dictionary.cambridge.org/dictionary/english/big-data
https://www.pharmaceutical-technology.com/comment/can-big-data-improve-diagnosis-rare-diseases/
https://towardsdatascience.com/ai-ml-practicalities-more-data-isnt-always-better-ae1dac9ad28f
https://www.wired.com/insights/2013/05/more-data-more-problems-is-big-data-always-right/

Why bigger isn’t always better when it comes to research data

What is big data?

Quality not quantity

Data sharing in the publishing community