Cometh the man; Francis Bacon’s insight was that the process of discovery was inherently algorithmic. Photo courtesy NPG/Wikipedia
The duty of man who investigates the writings of scientists, if learning the truth is his goal, is to make himself an enemy of all that he reads and … attack it from every side. He should also suspect himself as he performs his critical examination of it, so that he may avoid falling into either prejudice or leniency.
– Ibn al-Haytham (965-1040 CE)
Science is in the midst of a data crisis. Last year, there were more than 1.2 million new papers published in the biomedical sciences alone, bringing the total number of peer-reviewed biomedical papers to over 26 million. However, the average scientist reads only about 250 papers a year. Meanwhile, the quality of the scientific literature has been in decline. Some recent studies found that the majority of biomedical papers were irreproducible.
The twin challenges of too much quantity and too little quality are rooted in the finite neurological capacity of the human mind. Scientists are deriving hypotheses from a smaller and smaller fraction of our collective knowledge and consequently, more and more, asking the wrong questions, or asking ones that have already been answered. Also, human creativity seems to depend increasingly on the stochasticity of previous experiences – particular life events that allow a researcher to notice something others do not. Although chance has always been a factor in scientific discovery, it is currently playing a much larger role than it should.
One promising strategy to overcome the current crisis is to integrate machines and artificial intelligence in the scientific process. Machines have greater memory and higher computational capacity than the human brain. Automation of the scientific process could greatly increase the rate of discovery. It could even begin another scientific revolution. That huge possibility hinges on an equally huge question: can scientific discovery really be automated?
I believe it can, using an approach that we have known about for centuries. The answer to this question can be found in the work of Sir Francis Bacon, the 17th-century English philosopher and a key progenitor of modern science.
The first reiterations of the scientific method can be traced back many centuries earlier to Muslim thinkers such as Ibn al-Haytham, who emphasised both empiricism and experimentation. However, it was Bacon who first formalised the scientific method and made it a subject of study. In his book Novum Organum (1620), he proposed a model for discovery that is still known as the Baconian method. He argued against syllogistic logic for scientific synthesis, which he considered to be unreliable. Instead, he proposed an approach in which relevant observations about a specific phenomenon are systematically collected, tabulated and objectively analysed using inductive logic to generate generalisable ideas. In his view, truth could be uncovered only when the mind is free from incomplete (and hence false) axioms.
The Baconian method attempted to remove logical bias from the process of observation and conceptualisation, by delineating the steps of scientific synthesis and optimising each one separately. Bacon’s vision was to leverage a community of observers to collect vast amounts of information about nature and tabulate it into a central record accessible to inductive analysis. In Novum Organum, he wrote: ‘Empiricists are like ants; they accumulate and use. Rationalists spin webs like spiders. The best method is that of the bee; it is somewhere in between, taking existing material and using it.’
The Baconian method is rarely used today. It proved too laborious and extravagantly expensive; its technological applications were unclear. However, at the time the formalisation of a scientific method marked a revolutionary advance. Before it, science was metaphysical, accessible only to a few learned men, mostly of noble birth. By rejecting the authority of the ancient Greeks and delineating the steps of discovery, Bacon created a blueprint that would allow anyone, regardless of background, to become a scientist.
Bacon’s insights also revealed an important hidden truth: the discovery process is inherently algorithmic. It is the outcome of a finite number of steps that are repeated until a meaningful result is uncovered. Bacon explicitly used the word ‘machine’ in describing his method. His scientific algorithm has three essential components: first, observations have to be collected and integrated into the total corpus of knowledge. Second, the new observations are used to generate new hypotheses. Third, the hypotheses are tested through carefully designed experiments.
If science is algorithmic, then it must have the potential for automation. This futuristic dream has eluded information and computer scientists for decades, in large part because the three main steps of scientific discovery occupy different planes. Observation is sensual; hypothesis-generation is mental; and experimentation is mechanical. Automating the scientific process will require the effective incorporation of machines in each step, and in all three feeding into each other without friction. Nobody has yet figured out how to do that.
Experimentation has seen the most substantial recent progress. For example, the pharmaceutical industry commonly uses automated high-throughput platforms for drug design. Startups such as Transcriptic and Emerald Cloud Lab, both in California, are building systems to automate almost every physical task that biomedical scientists do. Scientists can submit their experiments online, where they are converted to code and fed into robotic platforms that carry out a battery of biological experiments. These solutions are most relevant to disciplines that require intensive experimentation, such as molecular biology and chemical engineering, but analogous methods can be applied in other data-intensive fields, and even extended to theoretical disciplines.
Automated hypothesis-generation is less advanced, but the work of Don Swanson in the 1980s provided an important step forward. He demonstrated the existence of hidden links between unrelated ideas in the scientific literature; using a simple deductive logical framework, he could connect papers from various fields with no citation overlap. In this way, Swanson was able to hypothesise a novel link between dietary fish oil and Reynaud’s Syndrome without conducting any experiments or being an expert in either field. Other, more recent approaches, such as those of Andrey Rzhetsky at the University of Chicago and Albert-László Barabási at Northeastern University, rely on mathematical modelling and graph theory. They incorporate large datasets, in which knowledge is projected as a network, where nodes are concepts and links are relationships between them. Novel hypotheses would show up as undiscovered links between nodes.
The most challenging step in the automation process is how to collect reliable scientific observations on a large scale. There is currently no central data bank that holds humanity’s total scientific knowledge on an observational level. Natural language-processing has advanced to the point at which it can automatically extract not only relationships but also context from scientific papers. However, major scientific publishers have placed severe restrictions on text-mining. More important, the text of papers is biased towards the scientist’s interpretations (or misconceptions), and it contains synthesised complex concepts and methodologies that are difficult to extract and quantify.
Nevertheless, recent advances in computing and networked databases make the Baconian method practical for the first time in history. And even before scientific discovery can be automated, embracing Bacon’s approach could prove valuable at a time when pure reductionism is reaching the edge of its usefulness.
Human minds simply cannot reconstruct highly complex natural phenomena efficiently enough in the age of big data. A modern Baconian method that incorporates reductionist ideas through data-mining, but then analyses this information through inductive computational models, could transform our understanding of the natural world. Such an approach would enable us to generate novel hypotheses that have higher chances of turning out to be true, to test those hypotheses, and to fill gaps in our knowledge. It would also provide a much-needed reminder of what science is supposed to be: truth-seeking, anti-authoritarian, and limitlessly free.