Wednesday, July 20, 2005

In support of PubChem: towards open chemical information

Public release date: 18-Jul-2005

Contact: Juliette Savin
BioMed Central

XML architecture provides a new way of publishing chemical information

An XML-based approach to the communication of chemical information in the biomedical literature would prevent the loss of crucial information and facilitate the re-use of data - and would be easily achievable using existing open tools and resources. A commentary article published today in the Open Access journal BMC Bioinformatics argues that it is time chemistry followed in the footsteps of bioinformatics and structural biology and moved towards the creation of an open semantic web facilitating access to chemical information.

In the article, Peter Murray-Rust, from the University of Cambridge, UK, and John Mitchell and Henry Rzepa from Imperial College London, UK argue using three case studies that conventional methods such as cutting-and-pasting chemical information are time-consuming and introduce errors. The authors argue in favour an open XML architecture linking to connection tables or open databases such as PubChem, to identify chemical compounds mentioned in the biomedical literature. This comes as additional support for open chemical databases like the NIH's PubChem, which is currently at the centre of a legal battle between the NIH and the American Chemical Society (ACS). The ACS runs the very lucrative Chemical Abstracts Service and is directly threatened by public databases.

Murray-Rust et al. explain that an open XML-based architecture would provide a cost-effective and user-friendly way to publish chemical information.

Such a structure would avoid the loss of data - currently 80-99% of chemical information is never published due to the lack of a simple technical protocol to access it. It would make chemical information easier to read, save time, and would allow published data to be aggregated and re-used. Murray et al. recognise that implementing such as system might take time and money and might not be supported by all publishers. However "if publishers adopt these tools and protocols, then the quality and quantity of chemical information available to bioscientists will increase and the authors, publishers and readers will find the process cost-effective", write the authors. They add that most chemical information already exists in electronic format in the chemists' computers and could be converted into XML format very easily, without any loss.

Murray-Rust et al. used three recent articles containing chemical information, and published in journals of the BMC-series published by BioMed Central, as the basis for case studies on the usefulness of an XML-based tool for the identification of chemical compounds in biomedical literature.

Chemical compounds can be listed using connection tables and associated chemical structure diagrams, but also by structural information such as that provided by IUPAC-NIST Chemical Identifiers (INChI). They can also be found using open semantically free identifiers such as those provided by PubChem or based on their common names using Open lexicons; or by systematic chemical name. XML-based information embedded in the text of digitally published chemistry documents could refer to one or more of these, to help readers identify the compounds.

In their first case study, Murray-Rust et al. coded each molecule mentioned in the article in a simple conversion protocol: XML-based Chemical Markup Language (CML), giving the molecules their PubChem Ids. They estimate that the entire coding process took them the same amount of time as it would take a reader to look up the molecules in chemical databases. In addition to the PubChem ID, CML could contain the INChI identifier and meta-data for each molecule. For the second article, they show that, even using an automated system, looking for information about chemical compounds mentioned in the article takes around 45 minutes. This could have been avoided if the compound had been marked up and linked to connection tables and open databases. In the third article, the name of one compound had been misspelt and others were unclear. This made it difficult for text-mining robots to find information about the compounds, and not all the data needed was retrieved.


This press release is based on the article:
Communication and re-use of chemical information in bioscience
Peter S Murray-Rust, John BO Mitchell, and Henry S Rzepa
BMC Bioinformatics 2005, 6:180 (18 July 2005)

This article is available free of charge, according to BMC Bioinformatics' Open Access policy at:

Monday, July 18, 2005

In lieu of flowers

A rather poignant open letter from Heather Morrison to the American Association of Cancer Research:

Dear Margaret Foti, CEO, American Association of Cancer Research,

According to the SPARC Open Access News of July 13, AACR is one of a group that has signed a letter on July 7 to Senator Arlen Specter, expressing "significant concerns about the National Institutes of Health duplicating private sector on-line publishing".

The banner at the top of your website this morning does not say: defending the interests of the private sector in the publishing industry.

What your banner says is quite different. It is "Saving Lives Through Research".

This is a noble reason for the existence of your association. My request is that AACR review its mission, and reconsider its position on the NIH Public Access Policy. I cannot see how such a review could possibly come to any other conclusion than that your mission compels you to fully support and participate in Public Access.

Change is difficult for anyone, and I have no doubt that the small changes needed for Public Access will be a little bit uncomfortable for your association. I urge you, however, to consider how many families, not only in the U.S. but throughout the world - have asked for donations to cancer research in lieu of flowers. How many have wanted to set aside their own comforts in bereavement to speed the research, so that others would be spared the agony that they and their loved ones went through. When so many are seeing the need to speed the research and placing it above their own comfort, surely your association can, too.

Surely you realize that the best way to "accelerate the dissemination of new research findings" - to borrow a phrase from your mission statement - is for cancer researchers to share their findings as openly as possible, as soon as possible. The ideal is to post the findings openly on the web, just as soon as the quality control process (peer review) is complete - generally before
publication. Imposing any delay, or any restrictions on dissemination, is contrary to your mission statement.

Your mission also says that you will "advance the understanding of cancer etiology, prevention, diagnosis, and treatment throughout the world." Outside the wealthy nations, there are many universities with no journal subscriptions at all; and, many places where lack of funds to purchase resources is a deterring factor to education, period. Participating in the NIH Public Access program clearly advances your mission. Lack of access is a factor in the U.S. too, of course; not all states are equally wealthy, and not all can afford all the journals for their university libraries.

Please share this message with your Board, and your members. If your basic mission has changed from saving lives to private sector profits, your mission statement needs updating. If your mission continues to be to accelerate cancer research, then you need to reverse your stance on the NIH's Public Access Policy, from opposition to enthusiastic support and

To facilitate dissemination and encourage other associations to consider their missions when thinking about open access, this is an open letter, copied to the SPARC Open Access Forum.

I congratulate the U.S. National Institute of Health and the U.S. Senate for their support for Public Access. This is one policy area where many, myself included, see the United States as providing an example of visionary leadership, which other nations would be well advised to follow.

best wishes,

Heather G. Morrison

Thursday, July 07, 2005

Duelling Databases

From The Scientist:
Can companies still make money selling genomic and molecular information?

Celera Genomics made hundreds of millions of dollars by selling access to its proprietary genome sequence information. But this month, Celera discontinued its database subscription service and made its 30 billion base pairs of genomic data of humans, rats, and mice freely available through GenBank, operated by the US National Center for Biotechnology Information.
Some see Celera's decision to exit the sequence business as proof of the adage that information wants to be free, and yet another sign that selling access to data is no longer a viable business model. "The trend is perfectly clear. It would be surprising to find any company setting up a business plan that was based on a subscription database of precompetitive information," says Francis Collins, director of the National Human Genome Research Institute and leader of the Human Genome Project, Celera's publicly funded rival in the race to sequence the human genome.

More (including about PubChem, CAS, et cetera)...