NCBI-BLAST安装与配置
Contents
BLAST全名 Basic Local Alignment Search Tool, 是一个非常有名的生物学工具,可以用来进行多序列比对,如果是简单的使用网页版(https://blast.ncbi.nlm.nih.gov/Blast.cgi)
就够了,但很多生物信息学工具都需要NCBI-BLAST作为依赖软件进行安装。
本文就简单介绍以下基于官网教程的NCBI-BLAST的安装以及所需数据库的建立,基本使用的话可以参考最下方的参考中的中文教程。
1.NCBI-BLAST的安装
1.下载NCBI BLAST
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
选择适合的安装文件
2.安装(centOS7为例)
对于perl我还是真的很不熟练,但听说是很多生物数据处理和数据库的有用工具,我也买了两本小骆驼教程,有机会一定学习一下(其实现在就是该学习的时候,不要拖拉啦)
|
|
2.database的安装
2.1 数据库目录以及介绍
- /blast/db/ directory
| file name | description |
|---|---|
| 16SMicrobial.tar.gz | Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117 |
| FASTA/ | Subdirectory for FASTA formatted sequences |
| README | README for this subdirectory (this file) |
| Representative_Genomes.*tar.gz | Representative bacterial/archaeal genomes database |
| cdd_delta.tar.gz | Conserved Domain Database sequences for use with stand alone deltablast |
| cloud/ | Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEt |
| env_nr.*tar.gz | Protein sequences for metagenomes |
| env_nt.*tar.gz | Nucleotide sequences for metagenomes |
| est.tar.gz | This file requires est_human..tar.gz, est_mouse..tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others. |
| est_human.*.tar.gz | Human subset of the est database from the est division of GenBank, EMBL and DDBJ. |
| est_mouse.*.tar.gz | Mouse subset of the est databasae |
| est_others.*.tar.gz | Non-human and non-mouse subset of the est database |
| gss.*tar.gz | Sequences from the GSS division of GenBank, EMBL, and DDBJ |
| htgs.*tar.gz | Sequences from the HTG division of GenBank, EMBL, and DDBJ |
| human_genomic.*tar.gz | Human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
| nr.*tar.gz | Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq |
| nt.*tar.gz | Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS. |
| other_genomic.*tar.gz | RefSeq chromosome records (NC_######) for non-human organisms |
| pataa.*tar.gz | Patent protein sequences |
| patnt.*tar.gz | Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ |
| pdbaa.*tar.gz | Sequences for the protein structure from the Protein Data Bank |
| pdbnt.*tar.gz | Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
| refseq_genomic.*tar.gz | NCBI genomic reference sequences |
| refseq_protein.*tar.gz | NCBI protein reference sequences |
| refseq_rna.*tar.gz | NCBI Transcript reference sequences |
| sts.*tar.gz | Sequences from the STS division of GenBank, EMBL, and DDBJ |
| swissprot.tar.gz | Swiss-Prot sequence database (last major update) |
| taxdb.tar.gz | Additional taxonomy information for the databases listed here providing common and scientific names |
| tsa_nt.*tar.gz | Sequences from the TSA division of GenBank, EMBL, and DDBJ |
| vector.tar.gz | Vector sequences from 2010, see Note 2 in section 4. |
2 /blast/db/FASTA directory
| file name | description |
|---|---|
| alu.a.gz | translation of alu.n repeats |
| alu.n.gz | alu repeat elements (from 2003) |
| drosoph.aa.gz | CDS translations from drosophila.nt |
| drosoph.nt.gz | genomic sequences for drosophila (from 2003) |
| env_nr.gz* | Protein sequences for metagenomes, taxid 408169 |
| env_nt.gz* | Nucleotide sequences for metagenomes, taxid 408169 |
| est_human.gz* | human subset of the est database (see Note 1) |
| est_mouse.gz* | mouse subset of the est database |
| est_others.gz* | non-human and non-mouse subset of the est database |
| gss.gz* | sequences from the GSS division of GenBank, EMBL, and DDBJ |
| htgs.gz* | sequences from the HTG division of GenBank, EMBL, and DDBJ |
| human_genomic.gz* | human RefSeq (NC_######) chromosome records with gap adjusted concatenated NT_ contigs |
| igSeqNt.gz | human and mouse immunoglobulin variable region nucleotide sequences |
| igSeqProt.gz | human and mouse immunoglobulin variable region protein sequences |
| mito.aa.gz | CDS translations of complete mitochondrial genomes |
| mito.nt.gz | complete mitochondrial genomes |
| nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq |
| nt.gz* | nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant. |
| other_genomic.gz* | RefSeq chromosome records (NC_######) for organisms other than human |
| pataa.gz* | patent protein sequences |
| patnt.gz* | patent nucleotide sequences. Both patent sequence files are from the USPTO, or EPO/JPO via EMBL/DDBJ |
| pdbaa.gz* | protein sequences from pdb protein structures |
| pdbnt.gz* | nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries. |
| sts.gz* | database for sequence tag site entries |
| swissprot.gz* | swiss-prot database (last major release) |
| vector.gz | vector sequences from 2010. (See Note 2) |
| yeast.aa.gz | protein translations from yeast.nt |
| yeast.nt.gz | yeast genomes (from 2003) |
2.2 使用工具在线安装(ubuntu)
|
|
2.3 自己去官网下载(fasta格式或者格式化的数据库)
1. # fasta文件(ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA)
|
|
2. 格式化的数据库(ftp://ftp.ncbi.nlm.nih.gov/blast/db/)
|
|
参考
- https://go.usa.gov/xPhky
- ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html
- ftp://ftp.ncbi.nlm.nih.gov/blast/db/
- https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
- 这或许是我写的最全的BLAST教程 https://www.jianshu.com/p/de28be1a3bea
- Linux系统中NCBI BLAST+本地化教程 http://blog.shenwei.me/local-blast-installation/
UNIVERONE