Text mining web services
BioTec TU Dresden and Humboldt-Universität zu Berlin are participating in the BioCreative MetaServer project. BCMS is a joint effort of currently 13 groups to provide web services for annotations to biomedical texts. The platform unites multiple different systems to annotate gene/protein identifiers from EntrezGene, UniProt, and other sequence databases; gene and protein mentions in text (without IDs); species occurring in texts (mapped to NCBI taxonomy IDs); and predictions of whether or not an article discusses protein--protein interactions.
Please refer to reference [1] for a more detailed introduction to the BCMS project.
Access to our services
The services in the BCSM project use the XML-RPC scheme for requests and responses. Please find below the access URLs for requesting GN, GM, TX, and PI annotations. We provide Java clients (source and binary) that query our servers and which can easily be modified to suit your specific needs.
- GN annotations:
- Hosted by Biotec TU Dresden at http://gopubmed2.biotec.tu-dresden.de/XmlRpcServlet
- GM, TX, PI annotations:
- Hosted by Humboldt-Universität zu Berlin at http://141.20.27.241:81/XmlRpcServlet
- Method name:
- The method to invoke is called Annotator.getAnnotation and expects a single PubMed ID as parameter.
- Java client:
- Package with sources and binaries: client.tar.gz;
You will also need the libraries xmlrpc-common-3.0.jar,xmlrpc-client-3.0.jar, and ws-commons-util-1.0.1.jar (or newer versions), which you can download from the Apache mirrors, see http://www.apache.org/dyn/closer.cgi/ws/xmlrpc/. Simply get the file called xmlrpc-current-bin.tar.gz and unpack it. The libraries are contained in the lib/ folder.
Please also read this short summary
- Output:
- The returned values are tuples that describe each annotation. For gene mention normalization and protein mentions, the tuple will consist of four elements: the referenced database (dbname, either EntrezGene or UniProt), the genes/proteins ID in that database (dbid), the species for this gene/protein (taxid, from NCBI taxonomy), and a confidence telling how reliable this annotations is (confidence, between 0 and 1). For species, the tuple contains the NCBI Taxonomy ID (taxid) and a confidence value. For protein-protein interactions, the tuple states whether an interaction was predicted (true) and with which confidence. Note that for articles not predicted to contain an interaction, no tuple is given (and not an interaction with the value 'false').
The output of the client on the command line will be a list of annotations, for example:
taxons confidence 1.0
taxons taxid 2759
normalizations confidence 1.0
normalizations dbname UniProt
normalizations dbid O12705
normalizations taxid 51677
normalizations confidence 0.25
normalizations dbname EntrezGene
normalizations dbid 9054
normalizations taxid 9606
interaction true
interaction 0.8
References
- [1] Florian Leitner et al.: Introducing Meta-Services for Biomedical Information Extraction. Genome Biology, Special Issue on the BioCreative Challenge Evaluation, 2007. To appear.
- [2] Jörg Hakenberg, Loic Royer, Conrad Plake, Hendrik Strobelt, Michael Schroeder:
Me and my friends: gene mention normalization with background knowledge.
Proceedings of the Second BioCreative Challenge Evaluation Workshop, April 23-25 2007, Madrid, Spain, ISBN 84-933255-6-2 (oral presentation and proceedings).
[Paper]
- [3] Jörg Hakenberg, Michael Schroeder, Ulf Leser:
Consensus pattern alignment to find protein-protein interactions in text.
Proceedings of the Second BioCreative Challenge Evaluation Workshop, April 23-25 2007, Madrid, Spain, ISBN 84-933255-6-2 (oral presentation and proceedings).
[Paper]
Last changes: JH, 10/03/2007.