The EBI, the European Bioinformatics Institute, is a research center of the European Molecular Biology Laboratory (EMBL) based at the Wellcome Trust Genome Campus in Hinxton near Cambridge, Great Britain. Its mission is to provide, in a strictly free manner, data and more generally IT services to the entire scientific community. EBI's activity, therefore, is to collect, preserve, and then distribute the data obtained by researchers working in different fields of life sciences; anywhere from basic biology to clinical medicine to ecology. Well, in three years between 2008 and 2010, the institute has almost quadrupled the byte quantity (the unit used to measure information) it has accumulated: going from roughly 1.000 to almost 4.000 terabytes (one terabyte is equal to 1012, one million bytes). For comparison purposes we can say that one of the largest libraries in the world, the United States Congress Library, in Washington, with its 28 million books and 50 million manuscripts contains the equivalent of roughly 20 terabytes. It's as if EBI's virtual library contains 200 United States Congress libraries.
Nevertheless this is hardly anything in comparison to the data gathered by the CERN scientists in one year. CERN is the European Center of High-Energy Physics in Geneva. In just one experiment, conducted at the Large Hadron Collider (LHC) in 2008, the CSM gathered more than 2.000 terabytes of data. Three years later, in 2010, the data gathered increased to 10.000 terabytes. 50 websites were needed throughout the world to store and manage the data.
The future however appears to be even more data-intensive, as technicians would say. A mammoth task. It is expected to begin working in the twenties of this century. In fact the Square Kilometre Array (SKA), the enormous telescope that is being developed between South Africa and Australia will gather 1.000.000 terabytes of data per day. This means that this enormous 'ear' will be able to gather information which is the equivalent of 50.000 American Congress libraries every day. Actually, even the massive amount of information we produce and upload onto the net every day is not something to be overlooked. Twitter produces 1 terabyte every 2,6 days. Facebook, 15 terabytes! But the quantity produced by the scientific community is enormous. The CMS experiment at the CERN alone,which is run by a few hundred physicians, produces almost double the amount produced by the 900 million Facebook users. However roughly a few thousand astronomers around the world who are collaborating on the SKA project will be called to run, in the future, an amount of data which is five times the size (roughly 100.000 times) the data handled by Facebook.
Nowadays there are over 7 million researchers throughout the world. Not all of them collect the same amount of data as their colleagues at EBI, CERN or SKA. Nevertheless it is true that each one of them, thanks to new technology, and not just it-related technology, creates a quantity of information which was unheard of in the past. The sum of all this information is colossal. Of course information is not necessarily knowledge. Or, at least, it wasn't in the past. Because today many are convinced that the quantity of information produced is sufficient to compensate for the quality of the information. The gathering, storage and analysis of the enormous amount of data obtained by the over 7 million researchers around the world can be transformed into new scientific knowledge, or rather into a new way of producing scientific knowledge. It is for this reason that some are now speaking of the «fourth paradigm».
Take for example the Biobank in Great Britain, as suggested by the Royal Society in a recent report: Science as an open enterprise. It preserves blood, urine and saliva samples of 500.000 people; thus an enormous quantity of clinical and genetic data. All these people have given their consent in using their data. This information, which is not only electronically-based, has the potential to generate a transitional phase of our knowledge regarding a vast number of illnesses: cancer, diabetes, heart-attacks, depression. We just need to learn how to collect the data – in perspective everyone has to transmit everything to everyone – store and analyze it. This applies to the astronomical field, even for SKA. The network of computers that will manage its database will create that hypothetical intelligence that the Marquis Pierre-Simon de Laplace spoke of at the beginning of the XIX century which, bearing in mind the conditions of each cosmic particle, is capable of knowing the present, past and future of the entire universe.
But the fields of scientific information – ranging from ecology to climate, to physics, to particles and sociology – are so many that the transitional phase in creating new knowledge can be practically unlimited. Today the majority of potential knowledge found in the enormous amount of information gathered without a specific objective risks being lost. Because we still don't have the right tools that enable us to automatically find the needle of knowledge in the haystack of information. History gives us an example. At the beginning of the 1980s a satellite which was sent into orbit to study the ozone in the atmosphere, the Solar Mesosphere Explorer (SME), accumulated enough data to discover the reduction of the molecule in the stratosphere. But an automatic information correcting system refused to perceive those small variations. It took Paul Crutzen, Mario Molina and Sherwood Rowland's acuteness to understand the new knowledge contained in the immense amount of raw data. The story of Crutzen, Molina and Rowland – who were awarded the Nobel prize for chemistry in 1995 – demonstrates how it is not enough to have a large amount of raw data, you need to know how to analyze it. But since the amount of data available is colossal, we can no longer count on man's ingenuity. We need to rely on the power of machines to understand information.
Ultimately it is a combination of having a large amount of information and the ability to analyze it which becomes a fundamental factor capable of producing new knowledge. Today this combination can be accomplished. This is why the Royal Society asks itself if we are not in the presence of a new epistemological paradigm, the fourth. Actually the first person to speak of the fourth paradigm generated by eScience and the eruption of electronics in the scientist's working world was Jim Gray, an IT specialist, winner of the Turing Prize, who spent the last years of his life at Microsoft in order to convince the world that we have entered a new epistemological era.
The first and second paradigm are those which Galileo called “sensible experiences” and “certain demonstrations”, or rather empiric observation and theory, if possible with mathematics. The advent of the computer has inaugurated a new possibility of producing new scientific knowledge: simulation. Today much scientific research deals with the simulated world as opposed to the natural world. The disadvantage of this is that the results have nothing to do with reality. They are a more or less an acceptable approximation of reality. The advantage is that you can repeat controlled experiments indefinitely and you can modify the parameters and navigate in space and time as you wish. Well, according to Jim Gray, the fourth paradigm consists of navigating in an endless sea of data in search (as well) of order and regularity which we cannot see and which theories cannot foresee. It is an interdisciplinary navigation capable of generating new knowledge. We do not know – not yet anyway, – if Jim Gary is right. Whether we can begin talking of a transition of an epistemological phase. It is certain however that the enormous quantity of data in every sector exists and we have the technical possibility of navigating inside it even by using the automatic pilot (the algorithms). It would be a shame to lose or compromise this opportunity. To grasp it fully, as the Royal Society rightly sustains, we must undertake three drastic choices.
The first is that everybody, in total transparency, must confer all the data in their possession to a global database. Researchers for example should not limit their work to simply writing an article and selecting and making public a limited amount of significant data. They should make available all the data collected. They should increase their scientific communication.
The second choice is that everyone should have free access to this global database and be free to navigate in this sea of data.
The third choice is that national and international public authorities should make available the minimum amount of resources needed to create the suitable IT infrastructures.
Some governments – starting with the British – have shown a considerable sensitivity to these issues which have been put forth by the Royal Society because they are convinced that this will be one of the main paths of innovation in the future. The Italian government and the scientific institutions of our country should do the same.