Biomedical Big Data

March 4, 2015

This post belongs to a invited post collection of the Mind the Byte's blog and has been written by Prof. Ferran Sanz, head of the Integrative Biomedical Informatics group of GRIB (IMIM-UPF). You can read the original following this link

This post belongs to a invited post collection of the Mind the Byte’s blog and has been written by Prof. Ferran Sanz, head of the Integrative Biomedical Informatics group of GRIB (IMIM-UPF). You can read the original following this link

Biomedical sciences are today characterized by the huge amount of data that they generate and the fact that an important fraction these data is publicly available in electronic format through the Internet. An example of this is the huge amount of knowledge that is contained in the biomedical literature. The summaries of almost all the relevant scientific articles published worldwide are freely available in PubMed (a resource that currently contains the bibliographic references and summaries of more than 20 millions of articles, to which an additional million is incorporated each year). The full text of an increasing fraction of these articles (more than 15% in 2013) is also publicly available. Taking into account the vast amount of electronic text that is accessible, an effective and comprehensive gathering of information about a particular topic implies the development and application of computational tools for the automatic reading of the papers (text mining applications). Another challenge that requires specific methods and tools is the filtering and prioritization of the information resulting from the aforementioned automatic gathering, with the aim of putting the focus in the most relevant and reliable knowledge.

Another example of the biomedical big data is the information about known associations between genes (or proteins) and diseases. There is an important amount of the information of this type that is freely available but it uses to be scattered amount different databases. With the aim of overcoming this problem, an open access resource (DisGeNET) has been developed and launched. DisGeNET integrates most of the existing information about gene/protein-disease associations (currently, 381056 associations between 16666 genes and 13172 diseases). Moreover, it offers a user-friendly interface for exploring the resource, as well as tools for analyzing its contents. It has to be pointed out that DisGeNET implements the rdf format, which facilitates the integration of this database with other complementary resources in the framework of theLinked Dataformat.