Data and Statistics

 

A – G

Aquatic Commons
The Aquatic Commons is a thematic digital repository covering the natural marine, estuarine /brackish and fresh water environments . It includes all aspects of the science, technology, management and conservation of these environments, their organisms and resources, and the economic, sociological and legal aspects. It is complementary to OceanDocs, which is supported by the Intergovernmental Oceanographic Commission (IOC)/ International Oceanographic Data and Information Exchange (IODE) specifically to collect, preserve and facilitate access to all research output from members of their Ocean Data and Information Networks (ODINS).
ARAPORT: Arabidopsis Information Portal
The Arabidopsis Information Portal is an open-access online community resource for Arabidopsis research. Araport enables biologists to navigate from the Arabidopsis thaliana Col-0 reference genome sequence to its associated annotation including gene structure, gene expression, protein function, and interaction networks. Araport was funded in 2013 and came on line in 2014. Araport already offers a single interface through which to access a wide range of Arabidopsis information. Araport will grow through contributions of other labs in the form of modules: data, computation, and visualization tools.
Bilbao Crystallographic Server
Free crystallographic programs and databases from the Materials Laboratory at the University of the Basque Country, Spain.
Binding Database
BindingDB is a public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of proteins considered to be candidate drug-targets with ligands that are small, drug-like molecules. BindingDB supports medicinal chemistry and drug discovery via literature awareness and development of structure-activity relations (SAR and QSAR); validation of computational chemistry and molecular modeling approaches such as docking, scoring and free energy methods; chemical biology and chemical genomics; and basic studies of the physical chemistry of molecular recognition. BindingDB also includes a small collection of host-guest binding data of interest to chemists studying supramolecular systems.
Biochemical Pathways Map
Interactive mapping of the biochemical pathways, created by Roche.
Biological Macromolecule Crystallization Database (BMCD)
The BMCD stores information on protein and nucleic acid crystals that have been reported in the literature or deposited in the Protein Data Bank. Crystal growth conditions have been parsed into separate chemicals with numerical concentrations to faciliate data mining. The mission of the BMCD is to enable the discovery of relations among protein properties, crystal conditions, and crystal behavior, in order to facilitate the design of crystal screening strategies for the determination of new structures.
BLAST: Basic Local Alignment Search Tool
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
BRENDA: Comprehensive Enzyme Information System
BRENDA is the main collection of enzyme functional data available to the scientific community. The enzymes are classified according to the Enzyme Commission list of enzymes. Some 6500 “different” enzymes are covered. The data collection is being developed into a metabolic network information system with links to Enzyme expression and regulation information.
CDC’s Data and Statistics
Data and Statistics information from the Centers for Disease Control and Prevention.
ChEBI
Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on ‘small’ chemical compounds. The term ‘molecular entity’ refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms. All data in the database is non-proprietary or is derived from a non-proprietary source. It is thus freely accessible and available to anyone. In addition, each data item is fully traceable and explicitly referenced to the original source.
ChEMBL
ChEMBL is a database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties (e.g. logP, Molecular Weight, Lipinski Parameters, etc.) and abstracted bioactivities (e.g. binding constants, pharmacology and ADMET data).
ChemDB
ChemDB is a suite of chemical datasets and learning tools created by UC Irvine. Includes a chemical search feature for about 4 million compounds from vendor catalogs.
Chemical Structure Lookup Service
Look up whether a structure occurs in over 100 different databases, both public and commercial. Search input must be InChIs, FICuS or uuuuu identifiers, molecular formulas, SMILES, or IDs used in the original individual database.
Crystallography Open Database
Open-access collection of crystal structures in CIF format of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers.
Data.gov
The home of the U.S. Government’s open data. Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
The Data Repository for University of Minnesota (DRUM)
DRUM is a publicly available collection of digital research data generated by U of M researchers, students, and staff. Anyone can search and download the data housed in the repository, instantly or by request.
DrugBank
The DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets. As both a bioinformatics and a cheminformatics resource, DrugBank combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. Because of its broad scope, comprehensive referencing and unusually detailed data descriptions, DrugBank is more akin to a drug encyclopedia than a drug database. As a result, links to DrugBank are maintained for nearly all drugs listed in Wikipedia. DrugBank is widely used by the drug industry, medicinal chemists, pharmacists, physicians, students and the general public.
The Electron Microscopy Data Bank (EMDB) at PDBe
Owned by the Protein Data Bank in Europe (PDBe), the Electron Microscopy Data Bank (EMDB) is a public repository for electron microscopy density maps of macromolecular complexes and subcellular structures. It covers a variety of techniques, including single-particle analysis, electron tomography, and electron (2D) crystallography.
EMDataResource: Unified Data Resource for 3DEM
EMDataResource is the unified global portal for one-stop deposition and retrieval of 3DEM density maps, atomic models and associated metadata, and is a joint effort among investigators of the Stanford/SLAC CryoEM Facility and the Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers, in collaboration with the EMDB team at the European Bioinformatics Institute. EMDataResource also serves as a resource for news, events, software tools, data standards, and validation methods for the 3DEM community.
ENZYME
ENZYME is a repository of information relative to the nomenclature of enzymes. It is primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) and it describes each type of characterized enzyme for which an EC (Enzyme Commission) number has been provided. ENZYME now includes entries with preliminary EC numbers. Preliminary EC numbers include an ‘n’ as part of the fourth (serial) digit (e.g. EC 3.5.1.n3).
European Bioinformatics Institute (EMBL-EBI) Data Resources
We maintain the world’s most comprehensive range of freely available molecular data resources. Developed in collaboration with our colleagues worldwide, our databases and tools help scientists share data efficiently, perform complex queries and analyse the results in different ways. Our work supports millions of researchers, who are wet-lab and computational biologists working in all areas of the life sciences, from biomedicine to biodiversity and agri-food research.
FAOSTAT
A multilingual database currently containing more than 1 million time-series records covering international statistics in the following areas: Food Balance Sheets, Food Aid Shipments, and Population.
Fishes of Texas
A product of the Ichthyology Collection from the University of Texas at Austin. The database includes records from over 40 institutions based on specimens collected as far back as the mid 1800’s. They have focused on standardizing, merging and subjecting the data to a rigorous error detection and correction process and making it available to researchers, natural research managers and the public. The resulting fish occurrence records now include the state’s approximately 280 species found in freshwaters and many more from its bays. This database is available online and allows for powerful data queries, on-the-fly mapping of results, and downloading of records to facilitate its utilization in diverse and complex research and management applications, as well as education.
GenBank
(NLM/NIH) GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. A new release is made every two months. It would be best to start with the GenBank home page first. Click the “i” icon for more information. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the National Center for Biotechnology Information. These three organizations exchange data on a daily basis. Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, and a table of features that identifies coding regions and other sites of biological significance, such as transcription units, sites of mutations or modifications, and repeats. Protein translations for coding regions are included in the feature table. Bibliographic references are included along with a link to the Medline unique identifier for all published sequences.
GenomeNet
Published and maintained by the University of Kyoto, GenomeNet is a Japanese network of database and computational services for genome research and related research areas in biomedical sciences.
Global Biodiversity Information Facility (GBIF)
GBIF—the Global Biodiversity Information Facility—is an international network and research infrastructure funded by the world’s governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth. Coordinated through its Secretariat in Copenhagen, the GBIF network of participating countries and organizations, working through participant nodes, provides data-holding institutions around the world with common standards and open-source tools that enable them to share information about where and when species have been recorded. This knowledge derives from many sources, including everything from museum specimens collected in the 18th and 19th century to geotagged smartphone photos shared by amateur naturalists in recent days and weeks.
Google Dataset Search
“Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s personal web page.” — Google

Return to Top

H – N

HealthData.gov
On this site, you can find data on a wide range of topics, including environmental health, medical devices, Medicare & Medicaid, social services, community health, mental health, and substance abuse. The data is collected and supplied from agencies from the U.S. Department of Health and Human Services as well as state partners. This includes the Centers for Medicare and Medicaid Services, Centers for Disease Control and Prevention, Food and Drug Administration, and the Agency for Health Care Research and Quality, among others.
HealthyPeople.gov’s DATA2020
Healthy People provides science-based, 10-year national objectives for improving the health of all Americans. For 3 decades, Healthy People has established benchmarks and monitored progress over time in order to: encourage collaborations across communities and sectors, empower individuals toward making informed health decisions, and measure the impact of prevention activities.
hmdb: The Human Metabolome Database
The Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education. The database is designed to contain or link three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. The database contains 114,020 metabolite entries including both water-soluble and lipid soluble metabolites as well as metabolites that would be regarded as either abundant (> 1 uM) or relatively rare (< 1 nM). Additionally, 5,702 protein sequences are linked to these metabolite entries. Each MetaboCard entry contains 130 data fields with 2/3 of the information being devoted to chemical/clinical data and the other 1/3 devoted to enzymatic or biochemical data.
Integrated Digitized Biocollections (iDigBio)
The national resource for Advancing Digitization of Biodiversity Collections (ADBC) funded by the National Science Foundation. Through ADBC, data and images for millions of biological specimens are being made available in electronic format for the research community, government agencies, students, educators, and the general public.
Integrated Taxonomic Information System (ITIS)
Here you will find authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world. We are a partnership of U.S., Canadian, and Mexican agencies (ITIS-North America); other organizations; and taxonomic specialists. ITIS is also a partner of Species 2000 and the Global Biodiversity Information Facility (GBIF). The ITIS and Species 2000 Catalogue of Life (CoL) partnership is proud to provide the taxonomic backbone to the Encyclopedia of Life (EOL).
National Center for Health Statistics (NCHS)
The National Center for Health Statistics has a mission to provide statistics and data that can guide public policies and actions. It’s goal is to improve the health of Americans. It is the United States’ principal health statistics agency. The NCHS website provides access to many health statistics sources, from published reports, to, data briefs on specific topics, and public use data files. Their FastStats section gives quick and easy access to statistics on specific health topics, from diseases and conditions to health care and insurance.
National Centers for Environmental Information (NCEI)
NOAA’s former three data centers have merged into the National Centers for Environmental Information (NCEI). The demand for high-value environmental data and information has dramatically increased in recent years. To improve our ability to meet that demand, NOAA’s former three data centers—the National Climatic Data Center, the National Geophysical Data Center, and the National Oceanographic Data Center, which includes the National Coastal Data Development Center—have merged into the National Centers for Environmental Information (NCEI). NCEI is responsible for hosting and providing access to one of the most significant archives on Earth, with comprehensive oceanic, atmospheric, and geophysical data. From the depths of the ocean to the surface of the sun and from million-year-old sediment records to near real-time satellite images, NCEI is the Nation’s leading authority for environmental information.
National Institutes of Health Data Sharing Repositories
Directory developed at the National Library of Medicine that lists NIH supported data repositories and resources with aggregated information about biomedical data.
NIST Science Data Portal
Data products developed and distributed by the National Institute of Standards and Technology span multiple disciplines of research and are widely used in research and development programs by industry and academia. NIST’s publicly available data sets showcase its commitment to providing accurate, well-curated measurements of physical properties, exemplified by the Standard Reference Data program, as well as its commitment to advancing basic research. The featured data domains include:
Nucleic Acid Database (NDB)
Published and maintained by Rutgers University, the NDB contains information about experimentally-determined nucleic acids and complex assemblies.
Use the NDB to perform searches based on annotations relating to sequence, structure and function, and to download, analyze, and learn about nucleic acids.

Return to Top

O – T

OMIM (Online Mendelian Inheritance in Man)
This database is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI, the National Center for Biotechnology Information. The database contains textual information and references. It also contains copious links to MEDLINE and sequence records in the Entrez system, and links to additional related resources at NCBI and elsewhere. OMIM is intended for use primarily by physicians and other professionals concerned with genetic disorders, by genetics researchers, and by advanced students in science and medicine.
RCSB Protein Data Bank
A leading global resource for experimental data central to scientific discovery, the RCSB PDB (Research Collaboratory for Structural Bioinformatics PDB) operates the US data center for the global PDB archive, and makes PDB data available at no charge to all data consumers without limitations on usage. The Vision of the RCSB PDB is to enable open access to the accumulating knowledge of 3D structure, function, and evolution of biological macromolecules, expanding the frontiers of fundamental biology, biomedicine, and biotechnology.
re3data.org
Comprehensive list of research data repositories with downloadable datasets. Browse by subject, content type, and country.
SABIO-RK: Biochemical Reaction Kinetics Database
SABIO-RK is a curated database that contains information about biochemical reactions, their kinetic rate equations with parameters and experimental conditions.
STITCH
STITCH is a database of known and predicted interactions between chemicals and proteins. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. Chemicals are linked to other chemicals and proteins by evidence derived from experiments, databases and the literature. STITCH contains interactions between 500,000 small molecules and 9.6 million proteins from over 2000 organisms.
Texas Data Repository
Hosted by the Texas Digital Library. The Texas Data Repository is a platform for publishing and archiving small datasets (and other data products) created by faculty, staff, and students at Texas higher education institutions. The repository is built in an open ­source application called Dataverse, originally developed and used by Harvard University. Dataverse is interoperable with other Dataverse installations and systems (like Open Journal Systems), providing opportunities for greater visibility of data.
Texas Health Data
Published and maintained by the Texas Department of State Health Services, the Center for Health Statistics’ Texas Health Data website is an interactive public data system that allows you to query DSHS public health datasets for statistical reports and summaries.

Return to Top

U – Z

UniProt
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Also worth noting is UniProt’s comprehensive list of 173 cross-referenced databases, which displays explicit and implicit links to databases such as nucleotide sequence databases, model organism databases and genomics and proteomics resources.
UCSC Genome Browser
Published and maintained by the University of California Santa Cruz, the UCSC Genome Browser contains the reference sequence and working draft assemblies for a large collection of genomes. The UCSC Genome Browser is developed and maintained by the Genome Bioinformatics Group, a cross-departmental team within the UCSC Genomics Institute.
United Nations Statistics Division
The United Nations Statistics Division is committed to the advancement of the global statistical system. We compile and disseminate global statistical information, develop standards and norms for statistical activities, and support countries’ efforts to strengthen their national statistical systems. We facilitate the coordination of international statistical activities and support the functioning of the United Nations Statistical Commission as the apex entity of the global statistical system.
United States Census Bureau
The Census Bureau’s mission is to serve as the nation’s leading provider of quality data about its people and economy.
USGS: Science Explorer: Data, Tools, and Technology
This site seeks to provide the scientific understanding and technologies needed to support the sound management and conservation of our Nation’s biological resources.
Wolfram Data Repository
The Wolfram Data Repository is a public resource that hosts an expanding collection of computable datasets, curated and structured to be suitable for immediate use in computation, visualization, analysis and more.Building on the Wolfram Data Framework and the Wolfram Language, the Wolfram Data Repository provides a uniform system for storing data and making it immediately computable and useful. With datasets of many types and from many sources, the Wolfram Data Repository is built to be a global resource for public data and data-backed publication.

Return to Top
 

Return to Biological Sciences