Sunday, September 09, 2012

'Pixels of information'

My friend Barend Mons wrote to me and I think it is worth sharing his letter on this blog. I checked with him, and he agrees that it can be shared on this blog.
Dear Jan,

I'm writing to you inspired by your remark that "OA is not a goal in itself but one means to an end: more effective knowledge discovery".

What we need for eScience is Open Information to support the Knowledge Discovery process. As eScience can be pictured as 'science that can not be done without a computer', computer reasonable information is the most important element to be 'open'. 
You're right, Barend. That's why I think CC-BY is a necessary element of open access. 
As we discussed many times before, computer reasoning and 'in silico' knowledge discovery leads essentially to 'hypotheses' not to final discoveries. There are two very important next steps. First, what I would call 'in cerebro' validation, mainly browsing the suggestions provided by computer algorithms mining the literature and 'validating' individual assertions (call them triples if you wish) in their original context. 'Who asserted it, where, based on what experimental evidence, assay...?' etc. In other words, why should I believe (in the context of my knowledge discovery process) this individual element of my 'hypothesis-graph' to be 'true' or 'valid'? Obviously in the end, the entire hypothesis put forward by a computer algorithm and 'pre'-validated by human reasoning based on 'what we collectively already know' needs to be experimentally proven (call it 'in origine' validation).

What I would like to discuss in a bit more depth is the 'in cerebro' part. For practical purposes I here define 'everything we collectively know', or at least what we have 'shared' as the 'explicitome' (I hope Jon Eisen doesn't include that in his 'bad -omes'), essentially a huge dynamic graph of 'nanopublications' or actually rather 'cardinal assertions' where identical, repetitive nanopublications have already been aggregated and assigned an 'evidence factor'.  Whenever a given assertion (connecting triple) is not a 'completely established fact' (the sort of assertion you repeat in a new narrative without the need to add a reference/citation) we will go to narrative text 'forever' to 'check the validity' in my opinion.

Major computer power is now exploited for various intelligent ways to infer the 'implicitome' of what we implicitly know (sorry, Jon, should you ever see this!), but triples captured in RDF are certainly no replacement for narrative in terms of reading a good reasoning, why conclusions are warranted, extensive description of materials and methods etc. So the 'validation' of triples outside their context will be a very important process in eScience for many decades to come. In fact your earlier metaphor of the 'minutes of science' fits perfectly in this model. 'Why would I believe this particular assertion'? ... Well, look in the minutes by whom, where and based on what evidence it was made'.

Now here is a very relevant part of the OA discussion: The time when some people thought that OA was a sort of charity model for scientific publishing is definitely over, with profitable OA publishers around us. The only real difference is: do we (the authors) pay up front, or do we refuse that (for whatever good reason, see below) and now the reader has to pay 'after the fact'. So let's first agree that there is no 'moral superiority', whatever that is, in OA over the traditional subscription model.  
Not sure if I agree, Barend. OK, let's leave morals out of it, but first of all, articles in subscription journals can also be made open access via the so-called 'green' route of depositing the accepted manuscript in an open repository; and secondly, OA at source, the so-called 'gold' route, is definitely practically and transparently the superior way to share scientific information with anyone who needs or wants it.
We have also seen the downsides of OA, for instance for researchers in developing countries who may still have great difficulty to find the substantial fees to publish in the leading Open Access journals.

I believe however, that we have a great paradigm shift right in front of us. Computer reasoning and ultralight 'RDF graphs' distributing the results to (inter alia) mobile devices will allow global open distribution of such 'pixels of information' at affordable costs, even in developing countries. Obviously, a practice that will be associated is to 'go and check' the validity of individual assertions in these graphs. That is exactly where the 'classical' narrative article will continue to have its great value. It is clear that the costs of reviewing, formatting, cross-linking and sustainably providing the 'minutes of science' is costly and that the community will have to pay for these costs via various routes. I feel that it is perfectly defensible that those articles for which the publishing costs have not been paid for by the authors, and that are still being provided by classical publishing houses, should continue to 'have a price'. As long as all nanopublications (let's say the assertions representing the 'dry facts' contained in the narrative legacy as well as data in databases) are exposed in Open (RDF) Spaces for people and computers to reason with, the knowledge discovery process will be enormously accelerated. Some people may still resent that they may have to pay (at least for some time to come) for narrative that was published following the 'don't pay now — subscribe later' adage. We obviously believe that the major players from the 'subscription age' have a responsibility, but also a very strong incentive to develop new methods and business models that allow a smooth transition to eScience-supportive publication without becoming extinct before they can adapt.

Your views are certainly worth a serious and in-depth discussion, Barend. I invite readers of this blog to join in and engage in that discussion.

Jan Velterop