What will be the data environment in Horizon 2020? Modern science is facing a huge challenge in managing data. On one hand, every year a big number of the scientific data are lost or become not readable any more. On the other hand, 80 billion Euro investments in research (that is the expected budget for Horizon 2020) could lead to an unbearable amount of data, that cannot be managed easily – with a loss of efficiency and a huge waste of public money. Eventually, one of the highest goal of Horizon 2020 will be to face Big Data.
Data loss
80% of data from scientific articles are lost within 20 years. This is the main finding of a study published in December 2013 by the University of British Columbia (Canada). These results immediately opened up a large debate, as they outlined the “dark side” of the technical and technological progress in data storage. As a matter of fact, the study outlined that:
“For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.”
The study suggested that this could be seen as an economic problem, too. Parts of research funds are used in finding data that probably have been already discovered, but locked somewhere. This is particularly true for Western countries, where storage technologies are developed and replaced at a fast rate. Who's still using floppy disk? But how many scientists started collecting data with them?
The geopolitical aspect is not the only one. The Canadian institution that conducted the research went further: among others, ecology and medical related data are those most affected areas, due to their dependence on historical datasets. Ecology and medicine: the cross-cutting issues Horizon 2020 deals most with. But, as seen before, Big Data do not means only loosing old data, but also being overwhelmed by new ones.
Data Tsunami
The development of sophisticated measuring instruments creates large sets of findings, that have to be stored, managed and reused. And, nowadays, the scientific community have not found a standard procedure for all the process. Moreover, analysts are fearing what is called “data tsunami”. A data tsunami is generated when a small amount of data, reused, reprocessed or simply linked each other creates a new, huge amount of data. This, reproduced for thousands times, can create something unbearable to manage.
Dealing with 80 billion Euro budget, Horizon 2020 is really fearing a data tsunami, or, at least, an enormous waste of money, by being unprepared to disseminate the collected data (data loss). So, authorities and institutions are attempting to set a new standard in the life circle of scientific research and data managing.
A living experiment
The European Union is conscious of these challenges of loss or tsunami of data. They both affect the economy of H2020 in terms of efficiency and money. So, the approach of H2020 is multiple. It involves research on new technologies (mainly digital infrastructure) through the e-Infrastructures research funding. It involves rules to give grant, as some call will explicitly ask a Data Management Plan. It involves policy making, as the H2020 Open Access policy had been discussed for long time, mainly about the balance between open access and patenting the scientific discoveries. Will it work? It is impossible to predict. Anyway, at the moment, Horizon 2020 can be seen as a unique experiment in which, maybe for the first time, the Big Data issue had been faced seriously.