Gene annotation data

Data sources

We currently obtain the gene annotation data from several public data resources and keep them up-to-date, so that you don’t have to do it:

Source Update frequency Notes
NCBI Entrez weekly snapshot  
Ensembl whenever a new release is available
Ensembl Pre! and EnsemblGenomes
are not included at the moment
Uniprot whenever a new release is available  
NetAffx whenever a new release is available For “reporter” field
PharmGKB whenever a new release is available  
UCSC whenever a new release is available For “exons” field
CPDB whenever a new release is available For “pathway” field

The most updated data information can be accessed here.

Gene object

Gene annotation data are both stored and returned as a gene object, which is essentially a collection of fields (attributes) and their values:

{
    "_id": "1017",
    "_score": 20.4676,
    "taxid": 9606,
    "symbol": "CDK2",
    "entrezgene": 1017,
    "name": "cyclin-dependent kinase 2",
    "genomic_pos": {
        "start": 55966769,
        "chr": "12",
        "end": 55972784,
        "strand": 1
    }
}

The example above omits most of available fields. For a full example, you can just check out a few gene examples: CDK2, ADA. Or, did you try our interactive API page yet?

_id field

Each individual gene object contains an “_id” field as the primary key. The value of the “_id” field is the NCBI gene ID (the same as “entrezgene” field, but as a string) if available for a gene object, otherwise, Ensembl gene ID is used (e.g. those Ensembl-only genes). Here is an example. We recommend to use “entrezgene” field for the NCBI gene ID, and “ensembl.gene” field for Ensembl gene ID, instead of using “_id” field.

Note

Regardless how the value of the “_id” field looks like, either NCBI gene ID or Ensembl gene ID always works for our gene annotation service /v3/gene/<geneid>.

_score field

You will often see a “_score” field in the returned gene object, which is the internal score representing how well the query matches the returned gene object. It probably does not mean much in gene annotation service when only one gene object is returned. In gene query service, by default, the returned gene hits are sorted by the scores in descending order.

Species

We support ALL species annotated by NCBI and Ensembl. All of our services allow you to pass a “species” parameter to limit the query results. “species” parameter accepts taxonomy ids as the input. You can look for the taxomony ids for your favorite species from NCBI Taxonomy.

For convenience, we allow you to pass these common names for commonly used species (e.g. “species=human,mouse,rat”):

Common name Genus name Taxonomy id
human Homo sapiens 9606
mouse Mus musculus 10090
rat Rattus norvegicus 10116
fruitfly Drosophila melanogaster 7227
nematode Caenorhabditis elegans 6239
zebrafish Danio rerio 7955
thale-cress Arabidopsis thaliana 3702
frog Xenopus tropicalis 8364
pig Sus scrofa 9823

If needed, you can pass “species=all” to query against all available species, although, we recommend you to pass specific species you need for faster response.

Genome assemblies

Our gene query service supports genome interval queries. We import genomic location data from Ensembl, so all species available there are supported. You can find the their reference genome assemblies information here.

This table lists the genome assembies for commonly-used species:

Common name Genus name Genome assembly
human Homo sapiens GRCh38 (hg38), also support hg19
mouse Mus musculus GRCm38 (mm10), also support mm9
rat Rattus norvegicus Rnor_6.0 (rn6)
fruitfly Drosophila melanogaster BDGP6 (dm6)
nematode Caenorhabditis elegans WBcel235 (ce11)
zebrafish Danio rerio GRCz10 (danRer10)
frog Xenopus tropicalis JGI_7.0 (xenTro7)
pig Sus scrofa Sscrofa10.2 (susScr3)

Available fields

The table below lists of all of the possible fields that could be in a gene object.

Field Indexed Type Notes