Contents

NCBI-BLAST安装与配置

BLAST全名 Basic Local Alignment Search Tool, 是一个非常有名的生物学工具,可以用来进行多序列比对,如果是简单的使用网页版(https://blast.ncbi.nlm.nih.gov/Blast.cgi)
就够了,但很多生物信息学工具都需要NCBI-BLAST作为依赖软件进行安装。
本文就简单介绍以下基于官网教程的NCBI-BLAST的安装以及所需数据库的建立,基本使用的话可以参考最下方的参考中的中文教程。

1.NCBI-BLAST的安装

1.下载NCBI BLAST

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
选择适合的安装文件

2.安装(centOS7为例)

对于perl我还是真的很不熟练,但听说是很多生物数据处理和数据库的有用工具,我也买了两本小骆驼教程,有机会一定学习一下(其实现在就是该学习的时候,不要拖拉啦)

1
2
3
4
5
# perl依赖(这部分去掉了,使用yum localinstall可以自动安装依赖)
#Install:
	sudo yum localinstall  ncbi-blast-2.8.1+-2.x86_64.rpm
#Upgrade:
	sudo rpm -Uvh ncbi-blast-2.8.1+-2.x86_64.rpm

2.database的安装

2.1 数据库目录以及介绍

  1. /blast/db/ directory
file name description
16SMicrobial.tar.gz Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117
FASTA/ Subdirectory for FASTA formatted sequences
README README for this subdirectory (this file)
Representative_Genomes.*tar.gz Representative bacterial/archaeal genomes database
cdd_delta.tar.gz Conserved Domain Database sequences for use with stand alone deltablast
cloud/ Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEt
env_nr.*tar.gz Protein sequences for metagenomes
env_nt.*tar.gz Nucleotide sequences for metagenomes
est.tar.gz This file requires est_human..tar.gz, est_mouse..tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others.
est_human.*.tar.gz Human subset of the est database from the est division of GenBank, EMBL and DDBJ.
est_mouse.*.tar.gz Mouse subset of the est databasae
est_others.*.tar.gz Non-human and non-mouse subset of the est database
gss.*tar.gz Sequences from the GSS division of GenBank, EMBL, and DDBJ
htgs.*tar.gz Sequences from the HTG division of GenBank, EMBL, and DDBJ
human_genomic.*tar.gz Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs
nr.*tar.gz Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq
nt.*tar.gz Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.
other_genomic.*tar.gz RefSeq chromosome records (NC_######) for non-human organisms
pataa.*tar.gz Patent protein sequences
patnt.*tar.gz Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ
pdbaa.*tar.gz Sequences for the protein structure from the Protein Data Bank
pdbnt.*tar.gz Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries.
refseq_genomic.*tar.gz NCBI genomic reference sequences
refseq_protein.*tar.gz NCBI protein reference sequences
refseq_rna.*tar.gz NCBI Transcript reference sequences
sts.*tar.gz Sequences from the STS division of GenBank, EMBL, and DDBJ
swissprot.tar.gz Swiss-Prot sequence database (last major update)
taxdb.tar.gz Additional taxonomy information for the databases listed here providing common and scientific names
tsa_nt.*tar.gz Sequences from the TSA division of GenBank, EMBL, and DDBJ
vector.tar.gz Vector sequences from 2010, see Note 2 in section 4.

2 /blast/db/FASTA directory

file name description
alu.a.gz translation of alu.n repeats
alu.n.gz alu repeat elements (from 2003)
drosoph.aa.gz CDS translations from drosophila.nt
drosoph.nt.gz genomic sequences for drosophila (from 2003)
env_nr.gz* Protein sequences for metagenomes, taxid 408169
env_nt.gz* Nucleotide sequences for metagenomes, taxid 408169
est_human.gz* human subset of the est database (see Note 1)
est_mouse.gz* mouse subset of the est database
est_others.gz* non-human and non-mouse subset of the est database
gss.gz* sequences from the GSS division of GenBank, EMBL, and DDBJ
htgs.gz* sequences from the HTG division of GenBank, EMBL, and DDBJ
human_genomic.gz* human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs
igSeqNt.gz human and mouse immunoglobulin variable region nucleotide sequences
igSeqProt.gz human and mouse immunoglobulin variable region protein sequences
mito.aa.gz CDS translations of complete mitochondrial genomes
mito.nt.gz complete mitochondrial genomes
nr.gz* non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz* nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.
other_genomic.gz* RefSeq chromosome records (NC_######) for organisms other than human
pataa.gz* patent protein sequences
patnt.gz* patent nucleotide sequences. Both patent sequence files are from the USPTO, or EPO/JPO via EMBL/DDBJ
pdbaa.gz* protein sequences from pdb protein structures
pdbnt.gz* nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.
sts.gz* database for sequence tag site entries
swissprot.gz* swiss-prot database (last major release)
vector.gz vector sequences from 2010. (See Note 2)
yeast.aa.gz protein translations from yeast.nt
yeast.nt.gz yeast genomes (from 2003)

2.2 使用工具在线安装(ubuntu)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
curl -sO ftp://ftp.ncbi.nlm.nih.gov/blast/temp/update_blastdb.pl
chmod +x update_blastdb.pl
apt-get install -qqy libjson-perl perl-doc
./update_blastdb.pl -version

# By default update_blastdb.pl connects to NCBI
./update_blastdb.pl --showall

# But you can specify to connect to Google Cloud Platform (GCP)
./update_blastdb.pl --source gcp --showall pretty

# Download the vector database from NCBI
./update_blastdb.pl --decompress vector

# And download swissprot from GCP in BLASTDBv5 format
./update_blastdb.pl --source gcp swissprot_v5

2.3 自己去官网下载(fasta格式或者格式化的数据库)

1. # fasta文件(ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA)

1
2
3
4
#For nucleotide fasta file:  
makeblastdb -in input_db -dbtype nucl -parse_seqids
#For protein fasta file:      
makeblastdb -in input_db -dbtype prot -parse_seqids

2. 格式化的数据库(ftp://ftp.ncbi.nlm.nih.gov/blast/db/)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 下载nr.90.tar.gz并解压(注意是在当前目录下执行命令或者用绝对路径也行)
blastdbcmd -db nr.90  -info
blastdbcmd -db nr.90 -dbtype prot -entry all -outfmt "%f" -out nr90.fa
# generate sequences in FASTA format
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out nr.fa
# 查看信息
blastdbcmd -db nr -dbtype pro  -info
# 所有数据
blastdbcmd -db  nr -dbtype prot  -entry all | head
# 具体关键字,如GI号
blastdbcmd -db  nr -dbtype prot  -entry 3 | head

参考

  1. https://go.usa.gov/xPhky
  2. ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html
  3. ftp://ftp.ncbi.nlm.nih.gov/blast/db/
  4. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
  5. 这或许是我写的最全的BLAST教程 https://www.jianshu.com/p/de28be1a3bea
  6. Linux系统中NCBI BLAST+本地化教程 http://blog.shenwei.me/local-blast-installation/