NCBI-BLAST安装与配置
Contents
BLAST全名 Basic Local Alignment Search Tool, 是一个非常有名的生物学工具,可以用来进行多序列比对,如果是简单的使用网页版(https://blast.ncbi.nlm.nih.gov/Blast.cgi)
就够了,但很多生物信息学工具都需要NCBI-BLAST作为依赖软件进行安装。
本文就简单介绍以下基于官网教程的NCBI-BLAST的安装以及所需数据库的建立,基本使用的话可以参考最下方的参考中的中文教程。
1.NCBI-BLAST的安装
1.下载NCBI BLAST
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
选择适合的安装文件
2.安装(centOS7为例)
对于perl我还是真的很不熟练,但听说是很多生物数据处理和数据库的有用工具,我也买了两本小骆驼教程,有机会一定学习一下(其实现在就是该学习的时候,不要拖拉啦)
|
|
2.database的安装
2.1 数据库目录以及介绍
- /blast/db/ directory
file name | description |
---|---|
16SMicrobial.tar.gz | Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117 |
FASTA/ | Subdirectory for FASTA formatted sequences |
README | README for this subdirectory (this file) |
Representative_Genomes.*tar.gz | Representative bacterial/archaeal genomes database |
cdd_delta.tar.gz | Conserved Domain Database sequences for use with stand alone deltablast |
cloud/ | Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEt |
env_nr.*tar.gz | Protein sequences for metagenomes |
env_nt.*tar.gz | Nucleotide sequences for metagenomes |
est.tar.gz | This file requires est_human..tar.gz, est_mouse..tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others. |
est_human.*.tar.gz | Human subset of the est database from the est division of GenBank, EMBL and DDBJ. |
est_mouse.*.tar.gz | Mouse subset of the est databasae |
est_others.*.tar.gz | Non-human and non-mouse subset of the est database |
gss.*tar.gz | Sequences from the GSS division of GenBank, EMBL, and DDBJ |
htgs.*tar.gz | Sequences from the HTG division of GenBank, EMBL, and DDBJ |
human_genomic.*tar.gz | Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
nr.*tar.gz | Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq |
nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS. |
other_genomic.*tar.gz | RefSeq chromosome records (NC_######) for non-human organisms |
pataa.*tar.gz | Patent protein sequences |
patnt.*tar.gz | Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ |
pdbaa.*tar.gz | Sequences for the protein structure from the Protein Data Bank |
pdbnt.*tar.gz | Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
refseq_genomic.*tar.gz | NCBI genomic reference sequences |
refseq_protein.*tar.gz | NCBI protein reference sequences |
refseq_rna.*tar.gz | NCBI Transcript reference sequences |
sts.*tar.gz | Sequences from the STS division of GenBank, EMBL, and DDBJ |
swissprot.tar.gz | Swiss-Prot sequence database (last major update) |
taxdb.tar.gz | Additional taxonomy information for the databases listed here providing common and scientific names |
tsa_nt.*tar.gz | Sequences from the TSA division of GenBank, EMBL, and DDBJ |
vector.tar.gz | Vector sequences from 2010, see Note 2 in section 4. |
2 /blast/db/FASTA directory
file name | description |
---|---|
alu.a.gz | translation of alu.n repeats |
alu.n.gz | alu repeat elements (from 2003) |
drosoph.aa.gz | CDS translations from drosophila.nt |
drosoph.nt.gz | genomic sequences for drosophila (from 2003) |
env_nr.gz* | Protein sequences for metagenomes, taxid 408169 |
env_nt.gz* | Nucleotide sequences for metagenomes, taxid 408169 |
est_human.gz* | human subset of the est database (see Note 1) |
est_mouse.gz* | mouse subset of the est database |
est_others.gz* | non-human and non-mouse subset of the est database |
gss.gz* | sequences from the GSS division of GenBank, EMBL, and DDBJ |
htgs.gz* | sequences from the HTG division of GenBank, EMBL, and DDBJ |
human_genomic.gz* | human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
igSeqNt.gz | human and mouse immunoglobulin variable region nucleotide sequences |
igSeqProt.gz | human and mouse immunoglobulin variable region protein sequences |
mito.aa.gz | CDS translations of complete mitochondrial genomes |
mito.nt.gz | complete mitochondrial genomes |
nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq |
nt.gz* | nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant. |
other_genomic.gz* | RefSeq chromosome records (NC_######) for organisms other than human |
pataa.gz* | patent protein sequences |
patnt.gz* | patent nucleotide sequences. Both patent sequence files are from the USPTO, or EPO/JPO via EMBL/DDBJ |
pdbaa.gz* | protein sequences from pdb protein structures |
pdbnt.gz* | nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
sts.gz* | database for sequence tag site entries |
swissprot.gz* | swiss-prot database (last major release) |
vector.gz | vector sequences from 2010. (See Note 2) |
yeast.aa.gz | protein translations from yeast.nt |
yeast.nt.gz | yeast genomes (from 2003) |
2.2 使用工具在线安装(ubuntu)
|
|
2.3 自己去官网下载(fasta格式或者格式化的数据库)
1. # fasta文件(ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA)
|
|
2. 格式化的数据库(ftp://ftp.ncbi.nlm.nih.gov/blast/db/)
|
|
参考
- https://go.usa.gov/xPhky
- ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html
- ftp://ftp.ncbi.nlm.nih.gov/blast/db/
- https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
- 这或许是我写的最全的BLAST教程 https://www.jianshu.com/p/de28be1a3bea
- Linux系统中NCBI BLAST+本地化教程 http://blog.shenwei.me/local-blast-installation/