Empirico Research Corp. - August 1, 2013

A Data-centric View of Science

Accurately collected and analyzed data arbitrates hypotheses...

Evidence of the successful execution and application of scientific research is all around us. Just visit a factory. Read the list of ingredients on any packaged food. Start your computer and browse the internet. Read a resource report prepared by any hydrocarbon or mining company on any of their exploration or development projects. Get in your car and drive across a large bridge to the airport to catch a flight. Our mobility, knowledge, creative capabilities and high standard of living are largely the result of the application of science. It has transformed our economies into power-house employers and spawned wealth-creating technologies, some of which are hazardous to the environment but have added substantially to democratic wealth nonetheless. Evidence-based scientific methods have created the modern world.

Scientific competencies are required to advance many businesses, maintain current infrastructures and generally shape the future. We now need to use science to protect vital ecosystems and advance our societies for the benefit of all citizens. Policy-makers who understand data can independently assess competing interpretations, and are less at the mercy of expert opinions. But how does properly collected and analyzed data constrain interpretations and provide objective answers to questions? I'm going to discuss the role of data analytics.

Describing the components of good scientific research has occupied my thoughts since studying research design and analysis over three decades ago while completing a degree in neuroscience. I currently run my own media business doing mostly production and advertising, but even these creative pursuits involve effectively using strategic data from google analytics and market research. My past work included a 26 year stretch working with geologists and geophysicists at ten different resource companies, as well as several technology start-ups and new product ventures. I love learning about the latest science and technology. A family friend, Walter Wardrop, is a former Industrial Technology Advisor (ITA) with the National Research Council, and he keeps me in the loop about the latest developments in "green" technologies. My step-father was a professional physicist. Our many discussions have increased my scientific literacy tremendously. He helped me analyze Rutherford's gold foil experiment described at the end of this article.

Writing about the use of data in science is a bit ironic, because there is so little data relevant to it. Is there agreement among scientists that planning to collect data, then collecting data, and finally analyzing the data are essential components of scientific research? I would say yes, but I can't speak for them. What they answer would only be an opinion anyway. I am not aware of any study that actually measures how much time scientists spend on data-related activities or how many scientists are full-time theorists? Maybe most research scientists develop an hypothesis, and then also get involved in collecting relevant data. I'm not aware of any time studies having been done to measure how scientists allocate their time.

But if you read actual scientific research reports, and believe what the authors say, then a clear methodology emerges of testing hypotheses against experimental and observational data. The most frequently published type of data in the physical sciences are quantitative measurements of physical properties and processes. As you read more research, you notice many studies use new data collection methods to test ever more detailed hypotheses. As you become an expert, you begin to notice that not all research is well done. How much is bad? In the physical sciences important research findings are always replicated before being regarded as truthful. This is not generally so in the social sciences, unfortunately. A study published in Science Magazine August 28, 2015, Estimating the reproducibility of psychological science was only able to replicate the results of 36 out of 100 randomly selected experimental and correlational studies published in three psychology journals. This level of professional failure at psychological research (which I always suspected) is one of the reasons I am writing this article.

It is evident from the literature that successful scientific research generates data that verifies conclusions about what exists, existed, or may come into existence, and how existents cause effects. In more basic terms, scientists collect data to answer questions about things and events. The explanatory usefulness of data is always relative to the specific question being asked. The same dataset can be cogent with regards to one question, suggestive with regards to another and irrelevant or deficient in regards to all other questions. The critical scientific criteria is when the hypothesis accords with experimental and observational data. But that concordance does not instantly make an hypothesis into a theory. On that issue, physicist, Christian Beck, made the following comment about verifying new discoveries such as dark matter:

A true discovery of dark matter that is convincing for most scientists would require consistent results from several different experiments using different detection methods, in addition to what has been observed by the Leicester group.

Sometimes the verifying data is no more than pointing a telescope at the right place in the sky, a chance finding of a new species, or just taking some pictures of an object at the right moment. Increasingly, testing theoretical predictions involves extremely complicated, time-consuming and expensive experiments—notably the large hadron collider at CERN, which is the largest and most complex machine ever built. Sometimes it looks like the low hanging fruit of cogent data collection has been picked, but then new technology comes along—like optical DNA sequencers—enabling even deeper and more detailed probing of nature. It is evident that science progresses in lockstep with data generation, collection and analysis technology.

The glory associated with generating new scientific knowledge often goes to theorists—and not unjustly in many situations. It is quite an amazing thing when a theorist anticipates the existence of something not previously known to exist, and then that thing is found using experimental methods. Examples from the history of science abound: The discovery of Neptune, Darwin’s prediction of an African hawk moth with a ten inch proboscis, Dirac’s antimatter calculations, Einstein’s prediction of the warping of space-time, new elements predicted by the periodic table, and many new subatomic particles including the recently verified Higgs boson. However, this culture of theoretical glorification should not distract the public from the essential need for good data collection and analysis.

Not all scientific experiments involve hypothesis testing. The British biophysicist, Rosalind Franklin, used X-ray crystallography to image DNA molecules, independently generating the cogent data used by Crick and Watson to develop their chemical model of DNA. Anyone can search an area not knowing what they might find, or just simply measure a physical property or process. The history of physics indicates that it is probably more effective to think about prospective hypotheses before collecting data. This can uncover questions that no one might have thought of, that better identify more relevant tests. For example, when it was first proposed by Einstein, the general theory of relativity raised the question of whether a strong gravitational field could bend light. Someone might have asked that question before, but the hypothesis caught Eddington's attention and he was inspired to photograph the 1919 solar eclipse to confirm this incredible prediction.

Hypothesis formulation in science is probably essential, but most initial ideas are wrong. There were several competing theoretical models of the Higgs boson, some of which predicted five or more different masses. In the case of the Michelson-Morley experiment, the aether hypothesis was discredited, and there was no theory at the time capable of explaining the finding that the speed of light is independent of frame of reference. Only cogent data can reveal which hypothesis is right, or maybe that they are all wrong.

So how does this data-centric approach to science jive with actual scientific practices? One of the great scientific experiments of all time is Ernest Rutherford’s Gold Foil Experiment. Under the guidance of Rutherford, Hans Geiger and Ernest Marsden bombarded very thin gold foil with alpha particles and discovered that some actually bounced back toward the source. This is how Rutherford described the experience:

It was quite the most incredible event that has ever happened to me in my life. It was almost as incredible as if you fired a 15-inch shell at a piece of tissue paper and it came back and hit you. On consideration, I realized that this scattering backward must be the result of a single collision, and when I made calculations I saw that it was impossible to get anything of that order of magnitude unless you took a system in which the greater part of the mass of the atom was concentrated in a minute nucleus. It was then that I had the idea of an atom with a minute massive centre, carrying a charge.

My freshman level physics is not up to the task of properly interpreting this experiment, so I emailed a professional physicist I know, my stepfather Derek Paul, who replied:

At that time there was only one model of the composition of atoms, which were known to contain electrons, and it was also known that the electrons' mass was only a rather small fraction of the mass of the atoms. A first guess was that the positive charge on an atom was uniformly spread over its volume, and that the electrons were somehow distributed so as to neutralize the positive charge overall. Such a model is incapable of causing a fast alpha particle (having four times the hydrogen mass) to scatter at any but a very small angle. Rutherford, having observed some scattering at around 120 degrees (from the incident direction) eventually twigged that this could only occur if the electrons were spread over the full volume of the atom, but the nucleus was concentrated at the centre. Furthermore, from the scattering formula (how many scatter at each angle), he was able to show that the positive charge of the gold atoms occupied a volume within a radius of around 2 X 10 E -12 cm (the expression 10 E -x means 10 to the power –x), which is around one ten-thousandth of the atomic radius. This implied also an extremely high density for the positively charged matter at the centre, which we now call the nucleus of the atom. If you compress the steel of a 100,000 tonne battleship to the density of the atomic nucleus, it would fit within a cubic cm.

Obviously mathematics plays the key role in the interpretation of this experiment. Without quantitative knowledge of alpha particles and how to calculate their energy, the experiment is impossible to appreciate, This is the source of the gun shell metaphor used by Rutherford in the quote above. Rutherford used the already known physics of alpha particle scattering as the mathematics for data analysis to calculate the size and mass of the atomic nucleus of gold. It is typical of the physical sciences that verified theories provide the mathematics for analyzing new and more complex phenomena.

It is not entirely correct to call the Gold Foil Experiment hypothesis testing, since the data actually shaped and refined the sketchy model. Even if there had been an hypothesis, what starts off as hypothesis testing can become hypothesis refinement, abandonment or an unexpected result suggestive of an entirely new hypothesis. In all cases, mathematics provides the basis on which the match between data and hypothesis can be fine tuned and assessed.

Semantically speaking, the agreement between theory and empirical data is an answer to the question: "What kind of real world structure could produce the experimental results?" An hypothesis is an educated guess which turns into a theory or fact when it is supported by data from a variety of sources. That our expectations don't always accord with measurements is what makes science objective. The experimental result from the Gold Foil Experiment enabled Rutherford to fine-tune the atomic model of 1909 to arrive at a size for the nucleus of the atom which previously had not been known or even anticipated.

Many have served science well in the trenches of data collection and analysis. Humanity will need many more in the years ahead.

Craig Farlinger B.Sc. - September 24, 2015