Database Description
====================

The built-in downloader uses the ``db_current`` HTTP snapshot from ``pro.unl.edu`` by default.
You can also download the pinned S3 release for faster and more stable transfers by using the ``--aws_s3`` flag with the ``run_dbcan database`` command (see :doc:`Preparing Databases <../user_guide/prepare_the_database>`).
The databases are generally updated annually between July and September.

The databases used by run_dbCAN are described below.

CAZyme Databases
----------------

- DIAMOND database
  - Description: Fast protein alignment database used to identify CAZyme sequences.
  - Filename: ``CAZy.dmnd``

- dbCAN HMM database
  - Description: HMM profiles for CAZyme families; sensitive identification.
  - Filename: ``dbCAN.hmm``

- dbCAN-sub HMM database
  - Description: Subfamily-level HMM profiles for fine-grained CAZyme subfamily identification.
  - Filename: ``dbCAN-sub.hmm``


CGC Databases
-------------

- Transporter DIAMOND database (from TCDB)
  - Description: Transporter proteins used to identify transporters in CGCs.
  - Filename: ``TCDB.dmnd``

- Transcription Factor HMM database (fungi, from MycoCosm)
  - Description: HMM profiles for transcription factors used in fungal datasets.
  - Filename: ``TF.hmm``

- Transcription Factor DIAMOND database (prokaryotes, from PRODORIC)
  - Description: TF protein database used to identify TFs in prokaryotic CGCs.
  - Filename: ``TF.dmnd``

- Signal Transduction Protein HMM database
  - Description: HMM profiles for signal transduction proteins.
  - Filename: ``STP.hmm``

- Sulfatase DIAMOND database (from SulfAtlas)
  - Description: Sulfatase protein database used to identify sulfatases.
  - Filename: ``sulfatlas_db.dmnd``

- Peptidase DIAMOND database (from MEROPS)
  - Description: Peptidase protein database used to identify peptidases.
  - Filename: ``peptidase_db.dmnd``

- dbCAN-PUL DIAMOND database
  - Description: Polysaccharide Utilization Loci (PUL) protein database for PUL identification.
  - Filename: ``PUL.dmnd``


Substrate Prediction Databases
------------------------------

- Substrate mapping table
  - Description: Mapping from CAZyme family/EC to known substrates.
  - Filename: ``fam-substrate-mapping.tsv``

- dbCAN-PUL substrate table
  - Description: Substrate mapping associated with PULs from the dbCAN-PUL database.
  - Directory: ``dbCAN-PUL/``
  - Spreadsheet: ``dbCAN-PUL.xlsx``
