Because good research needs good data

Science Commons

By Mags McGeever, University of Edinburgh

Published: 2006

1. Introduction

Many readers will be familiar with Creative Commons, its ethos and the suite of licences it provides. An organisation they may be less familiar with is Science Commons.

Science Commons is a branch of Creative Commons that aims to make the Web work for science the way that it currently works for culture. It is a non-profit organisation aimed at accelerating the research cycle which they define as "the continuous production and reuse of knowledge that is at the heart of the scientific method." Science Commons describes itself as having three interlocking initiatives: making scientific research 'reuseful'; enabling 'one-click' access to research materials; and integrating fragmented information sources. Its work is of relevance to anyone within the scientific cycle looking to reduce legal and technical barriers to research and discovery.

Back to top

2. The Goal — Interoperability of Scientific Data

Science Commons believe that copyright and other legal devices stand in the way of the reuse of scientific scholarship in new discoveries, and advocate making data available in the public domain via the Web so that other scientists can use it for new purposes. The purported advantage of this is the systematic increase of the chance of making major discoveries and decrease of the likelihood of missing information that could be useful for progress.

Science Commons are not alone in pondering this issue. Numerous scientists have pointed out the irony that, at a time when we have the technologies to permit global access and distributed processing of scientific data, legal restrictions are making it harder to connect the dots.

"Modern technologies, especially the evolving use of the World Wide Web as a library, have forever changed the mechanisms for delivery and replication of documents. In many fields, results are published nearly as quickly as they are discovered. But copyright law has evolved at a different rate. Progress in modern technology, combined with a legal system that was crafted for the analog era, is now having unintended consequences. One of these is a kind of legal "friction" that hinders the reuse of knowledge and slows innovation." — Source: Science Commons website

The costs of dealing with legal matters (whether these costs are financial or time-based) can take research 'out of play' simply because it can be more expensive to do the lawyer work than the likely return to be derived from using the data. This stifles scientific innovation, as the value of scientific information increases exponentially when it is connected with other scientific information, and is diminished when segregated by law.

Recently, Science Commons, having considered the legal issues surrounding data sharing in great depth, has changed its recommendations on how to make scientific data and databases available for greater and easier reuse and integration.

Back to top

3. The Legal Background

When discussing intellectual property and data, a key concept is often overlooked. This is that facts or raw data in themselves are not copyrightable. The thing that is copyrightable is the expression of that data. So, for example, a scientific article could be copyrighted but the raw data on which it rests could not. This is one of the most fundamental premises of intellectual property: that no one can own facts, or ideas, only the inventions or expressions created by their intersection.

However, copyright does protect the structure of a database if, by reason of the selection or arrangement of the contents, it constitutes the author's own intellectual creation (this is the UK position — there are slight international variations).

In relation to this Science Commons used to provide an FAQ on database licensing that aimed to assist scientists in understanding how a Creative Commons licence could be applied to the copyrightable elements of a database and how to manage the facts (non-copyrightable) stored in that database. However, they discovered that scientists were uncomfortable applying the FAQ because they found it hard to recognise the distinction between what is copyrightable and what isn't. They concluded that any approach based on licensing data and databases is faced with a major difficulty: the need for a widespread understanding of which parts of a dataset attract intellectual property protection (and are thus subject to the licence terms) and which parts do not. As explained above, a database is divided into copyrightable and non-copyrightable elements. Unfortunately the average user tends to take an all or nothing approach and assume that either the whole thing is protected or no part at all.

In reality it is difficult not only for the scientist but for legally trained professionals to provide clear guidance on the boundaries of how the legislation applies to data and databases. Many of the questions about the precise boundaries are still undecided and will be resolved only over time through individual court cases. That is of little use to a data provider who needs to decide on a policy for his/her institution today.

In addition problems are compounded when more than one jurisdiction is involved (as is frequently the case with data integration) because different countries have different laws in this area. For example, in EU countries there is the added difficulty of understanding how to apply the sui generis database protection (for more on this please see International Considerations). Furthermore, the precise application of that right differs between the various Member States.

A further difficulty identified by Science Commons when licensing data and databases is cascading attributions (if attribution is required as part of the licence terms). As technological capabilities increase for data integration, would a scientist who had run a query over 20,000 data sets need to attribute all 20,000 data depositors?

Some of the key questions Science Commons found itself facing were:

  • What kinds of property rights apply to data and databases and how?
  • What happens to those property rights when data is shared globally on the Internet and the rules of different jurisdictions come into play?
  • What is the best way to ensure interoperability of data at both a technical and legal level?

Back to top

4. The Solution

Science Commons concluded that, in order for scientists to utilise it, any usage system would need to be both legally accurate as well as simple to use. It would also need to be interdisciplinary, multinational and involve both public and private initiatives.

Having hitherto promoted existing Creative Commons licences for making data available, Science Commons has now moved towards recommending committing works to the public domain, and has been developing a collection of best practices for doing this. This process has evolved into Science Commons' new Protocol for Implementing Open Access Data .

4.1. The Protocol for Implementing Open Access Data

The Protocol for Implementing Open Access Data is a method for ensuring that scientific databases can be legally integrated with one another. The Protocol is built on the public domain status of data in many countries and provides legal certainty to both data deposit and data use. The Protocol is not a licence or legal tool in itself, but instead a methodology for a) creating such legal tools and b) marking data already in the public domain for machine-assisted discovery.

Amongst other things the Protocol states that any implementation of it must waive all rights necessary for data extraction and reuse (including copyright, sui generis database rights, claims of unfair competition, implied contracts, and other legal rights), and must not apply any obligations on the user of the data or database such as 'copyleft' or 'share alike' or even the legal requirement to provide attribution. In addition, any implementation should define a non-legally binding set of citation norms in clear, lay-readable language.

Terms for use of a database are often dictated by contract in addition to or instead of intellectual property. For that reason the Protocol calls for providers to affirmatively declare that no contractual constraints apply to the database.

Where data cannot be made available under the Protocol the next best thing is for the metadata to be made available under the Protocol. In that way, at least the existence of the protected data is discoverable.

Current examples of legal tools that place a user in compliance with the Protocol are the Open Data Commons Public Domain Dedication and Licence , and the Creative Commons CC-Zero waiver .

4.2 Open Data Commons Public Domain Dedication and Licence

The Open Data Commons Public Domain Dedication and Licence is a document intended to allow free sharing, modification, and use of a work for any purpose, and without any restrictions. The legal tool is intended for use with databases or data, either together or individually.

This waiver and licence tries to the fullest extent possible to eliminate or fully licence any rights that cover databases and data. Any community norms or similar statements of use in relation to a particular database or data do not form part of the document, and do not act as a contract for access or other terms of use for the database or data.

4.3 CC-Zero Waiver

Creative Commons' CC-Zero initiative, was instigated largely as an adjunct to Science Commons' Protocol for Implementing Open Access Data. The CC-Zero Waiver is essentially a replacement for Creative Commons' old Public Domain Dedication. It provides a way for authors to assign their works either public domain status, or put them under a licence as near to public domain as the law allows.

4.4 Community Norms

Having addressed the issue of intellectual property, Science Commons were still left with the issue of attribution, which is no longer legally required where copyright has been rescinded. They concluded that requesting behaviour, such as citation (the non-legally required version of attribution), through norms rather than as a legal requirement based on copyright, contract or other legal grounds for database protection, is greatly beneficial as it allow for different scientific disciplines to develop their own norms for citation. This results in certainty and reward without constraining one community by asking them to conform to the norms of another.

One implication of using community norms instead of legally enforceable agreements is a philosophical move away from the concept of punishing the 'bad' actors towards the concept of rewarding the 'good' actors. One way of doing this is by using trademarks which may be applied by those acting in accordance with the appropriate norms. Such trademarks would act as a 'badge of honour' that organisations would be keen to display. Where previously a party may have sued for infringement, they would now reward compliance.

An example of the move towards using norms is the Community Norms document made available by Open Data Commons . They suggest that a user may wish to use it (or a community norms statement the user has described/created for himself/herself) along with the Open Data Commons' Public Domain Dedication and Licence. It is important to note that norms do not have legal status and are not legally binding, so need to be used in conjunction with a legally binding document, rather than in place of one.

Back to top

5. NeuroCommons

Science Commons has a proof of concept project in the field of Neuroscience called NeuroCommons . The NeuroCommons project seeks to make all scientific research materials, such as research articles, annotations, data and physical materials, as available and as useable as they can be. This is achieved by fostering practices that render information in a form that promotes uniform access by computational agents (sometimes called Interoperability). They intend to create knowledge sources that combine meaningfully, enabling semantically precise queries that span multiple information sources.

The work of NeuroCommons covers general data and knowledge sources used in computational biology as well as sources specific to neuroscience and neuromedicine. The practices they are developing and promoting are designed to work well on the Semantic Web.

Back to top

6. The Reproducible Research Standard (RRS)

A related development is that of a new standard in line with the philosophy of Science Commons. This is called the Reproducible Research Standard. The RRS has been developed in response to the situation where most of the components necessary for reproduction of the results of research and for building upon research (such as code, parameters used, the dataset, acquisition details, documentation, and any metaknowledge used in the experiment) usually remain unpublished.

As explained by Jon Claerbout of Stanford University:

"[a]n article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."

Supporters argue that publishing the complete research product will accelerate the pace of research in the field and benefit the scientist because open research is built upon and cited more frequently than work published in closed journals. The RRS would provide a mechanism for scientists to licence the meta-knowledge associated with the creation and perfection of their data. It would also provide a mechanism through which metadata could be encoded and associated with the data in a machine readable way.

The RRS sets forth the steps a scientist can take to ensure their work is recognised as reproducible meaning that certain conditions are satisfied:

  1. The full compendium of information relating to the research is available on the Internet
  2. The media components, including the original selection and arrangement of the data, are licensed under the Creative Commons BY licence which is an attribution-only licence
  3. The code components are licensed under the Modified Berkeley Software Distribution licence
  4. The data has been released into the public domain according to the Science Commons Protocol (since the raw facts are not copyrightable it does not make sense to apply a copyright rescinding licence to them)

A range of standards are suggested to demonstrate how fully a researcher is complying with the RRS. Where a 'Gold Standard' is applied to a work, this demonstrates that it has satisfied all four of the above criteria. A 'Silver Standard' signifies that the work is not fully released but that the researcher has promised to release the full compendium at Gold Standard level to anyone who asks for it. The 'Bronze Standard' will be used where work satisfies some of the above criteria. This could be for example that the code has been released under a share-alike licence or that the data has not been made available in the public domain.

Efforts are currently underway to make the RRS an official mark of Science Commons. With Science Commons' support, this could be come a standard for funding institutions, publishers or collaborators who wish to ensure public availability of research products in their entirety. In addition it could become the standard of choice for the scientist/researcher that wants to make their work more visible in order to increase the likelihood of collaboration and citation.

There appears to be some conflict between the philosophy of the RRS and Science Commons, such as the suggestion in the RRS to apply a Creative Commons attribution licence to the protectable elements of the database. However the RRS is still being developed and hopefully the kinks will be ironed out.

Back to top

7. International Considerations

Science Commons is based in the U.S. Although they are very mindful of the need for data sharing methods to be international in their approach and demonstrate an excellent knowledge of the different international implications, they are inherently U.S. focused. Things to remember about the differences in the UK are firstly the existence of the database right. This is a sui generis form of intellectual property protection developed exclusively to protect databases. The database right is a right that subsists in a database if there has been a substantial investment in obtaining, verifying or presenting the contents of the database (even if the contents and/or structure of the database are not original and therefore do not attract copyright).

A second difference is the moral rights tradition. The U.S. has little tradition of moral rights in copyright but outside the U.S. moral rights are stronger. Moral rights are 'author's rights' and are regarded as inalienable under some legal systems. In the UK they include the right to be identified as the author or director of a work as appropriate, the right to object to the derogatory treatment of a work and the right to object to false attribution of a work. In civil law countries such as France they are stronger still.

Thirdly, the U.S. has a stronger tradition of the Commons. Historically, up until their signing of the Berne Convention (an international agreement governing copyright), works automatically fell into the public domain unless they were labelled and registered as copyright.

Back to top

8. Additional Resources

Back to top