kneaddata安装及使用

tech2025-07-20  7

文章目录

简介安装下载数据库创建定制数据库(可选)运行质控后结果统计添加FastQC功能附录-参数详情参考资料

简介

【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。 【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。 【注意】HUMAnN和MetaPhlAn分析前需要对下机数据进行质控。

KneadData是一款宏基因组和宏转录组测序数据质控的流程,其主要功能包括使用Trimmomatic序列质控,bowtie2比对至对应数据库基因组去除宿主等序列。

安装

## 构建虚拟环境,安装kneaddata conda create -n kneaddata conda activate kneaddata conda install -c biobakery kneaddata

下载数据库

mkdir kneaddata_database cd kneaddata_database/ kneaddata_database --download human_genome bowtie2 ./ kneaddata_database --download mouse_C57BL bowtie2 ./ kneaddata_database --download human_transcriptome bowtie2 ./ kneaddata_database --download ribosomal_RNA bowtie2 ./ ## 分别解压数据库文件到自定义目录中 mkdir kneaddata_db_DATABASE_NAME tar -zxvf DATABASE_NAME.tar.gz -C ./kneaddata_db_DATABASE_NAME/

创建定制数据库(可选)

# bowtie2-build <reference> <db-name> mkdir kneaddata_db_Rnor_6 cd kneaddata_db_Rnor_6 wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0/GCF_000001895.5_Rnor_6.0_genomic.fna.gz bowtie2-build GCF_000001895.5_Rnor_6.0_genomic.fna.gz Rnor_6 --threads 8

运行

## 指定数据库,可以全列出来,也可以用到某个就写某个 KNEADDATA_DB_MOUSE_C57BL_6NJ=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_mouse_C57BL_6NJ/mouse_C57BL_6NJ KNEADDATA_DB_HUMAN_GENOME=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_human_genome/Homo_sapiens KNEADDATA_DB_HUMAN_TRANSCRIPTOME=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_human_transcriptome/human_hg38_refMrna KNEADDATA_DB_RIBOSOMAL_RNA=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_ribosomal_rna/SILVA_128_LSUParc_SSUParc_ribosomal_RNA KNEADDATA_DB_RNOR_6=/home/dengysh/anaconda3/envs/kneaddata/kneaddata_database/kneaddata_db_Rnor_6/Rnor_6 ## 单端数据 kneaddata -i D84-1.fastq.gz -o ./D84-1 -t 20 -p 20 -db $KNEADDATA_DB_HUMAN_GENOME ## 双端数据 ## 如果是使用多个数据库进行质控,建议添加‘--serial’参数,会将多个数据库的输入输出串联起来,同时建议添加‘--cat-final-output’参数,合并最终输出文件,便于后续分析 kneaddata -i D84-1.R1.fastq.gz -i D84-1.R2.fastq.gz -o ./D84-1 --output-prefix D84-1 -t 20 -p 20 --cat-final-output --serial -db $KNEADDATA_DB_HUMAN_GENOME -db $KNEADDATA_DB_RNOR_6 -db $KNEADDATA_DB_RIBOSOMAL_RNA

质控后结果统计

跑完质控后,还可以对跑完后的数据记性质控过程统计。使用kneaddata_read_count_table功能,输入文件是输出目录中的log文件,里面记录了质控过程。如果是多个独立样本,可以将多个样本的log文件汇总在同一个目录下,对目录下所有log文件进行汇总统计。

kneaddata_read_count_table --input log_file/*.log --output kneaddata_read_count_table.tsv

结果: 列名中含有“pair”表示配对的reads数,“orphan”表示过滤后不成对的reads数。

Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1 decontaminated Homo_sapiens pair2 decontaminated Rnor_6 pair1 decontaminated Rnor_6 pair2 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA pair1 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA pair2 decontaminated Homo_sapiens orphan1 decontaminated Homo_sapiens orphan2 decontaminated Rnor_6 orphan1 decontaminated Rnor_6 orphan2 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA orphan1 decontaminated SILVA_128_LSUParc_SSUParc_ribosomal_RNA orphan2 final pair1 final pair2 final orphan1 final orphan2 Sample1 34911652.0 34911652.0 34908958.0 34908958.0 400.0 2294.0 34605696.0 34605696.0 34602788.0 34602788.0 34602777.0 34602777.0 111066.0 114822.0 111600.0 115489.0 111605.0 115495.0 34602777.0 34602777.0 110989.0 114745.0 Sample2 34680773.0 34680773.0 34678094.0 34678094.0 369.0 2310.0 34380372.0 34380372.0 34378883.0 34378883.0 34378878.0 34378878.0 109089.0 112894.0 109539.0 113322.0 109542.0 113324.0 34378878.0 34378878.0 109076.0 112872.0 Sample3 35418717.0 35418717.0 35416028.0 35416028.0 423.0 2266.0 35139583.0 35139583.0 35138485.0 35138485.0 35138475.0 35138475.0 105007.0 108089.0 105384.0 108460.0 105389.0 108465.0 35138475.0 35138475.0 104997.0 108081.0

添加FastQC功能

使用前,需先下载安装Fastqc,下载链接:https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip。使用--run-fastqc-start和--run-fastqc-end检查质控前后reads的质量。

## 下载完成后,解压并修改权限 unzip fastqc_v0.11.9.zip chmod -R 777 FastQC ## 运行kneaddata结合fastqc kneaddata --input input/SE_extra.fastq --reference-db input/demo_db --output kneaddataOutputFastQC --run-fastqc-start --run-fastqc-end

附录-参数详情

usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR [-db REFERENCE_DB] [--bypass-trim] [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>] [-q {phred33,phred64}] [--run-bmtagger] [--bypass-trf] [--run-fastqc-start] [--run-fastqc-end] [--store-temp-output] [--remove-intermediate-output] [--cat-final-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG] [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY] [--trimmomatic-options TRIMMOMATIC_OPTIONS] [--sequencer-source {NexteraPE,TruSeq2,TruSeq3}] [--bowtie2 BOWTIE2_PATH] [--bowtie2-options BOWTIE2_OPTIONS] [--no-discordant] [--reorder] [--serial] [--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH] [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI] [--minscore MINSCORE] [--maxperiod MAXPERIOD] [--fastqc FASTQC_PATH] KneadData optional arguments: -h, --help show this help message and exit -v, --verbose additional output is printed global options: --version show program's version number and exit -i INPUT, --input INPUT input FASTQ file (add a second argument instance to run with paired input files) -o OUTPUT_DIR, --output OUTPUT_DIR directory to write output files -db REFERENCE_DB, --reference-db REFERENCE_DB location of reference database (additional arguments add databases) --bypass-trim bypass the trim step --output-prefix OUTPUT_PREFIX prefix for all output files [ DEFAULT : $SAMPLE_kneaddata ] -t <1>, --threads <1> number of threads [ Default : 1 ] -p <1>, --processes <1> number of processes [ Default : 1 ] -q {phred33,phred64}, --quality-scores {phred33,phred64} quality scores [ DEFAULT : phred33 ] --run-bmtagger run BMTagger instead of Bowtie2 to identify contaminant reads --bypass-trf option to bypass the removal of tandem repeats --run-fastqc-start run fastqc at the beginning of the workflow --run-fastqc-end run fastqc at the end of the workflow --store-temp-output store temp output files [ DEFAULT : temp output files are removed ] --remove-intermediate-output remove intermediate output files [ DEFAULT : intermediate output files are stored ] --cat-final-output concatenate all final output files [ DEFAULT : final output is not concatenated ] --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} level of log messages [ DEFAULT : DEBUG ] --log LOG log file [ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ] trimmomatic arguments: --trimmomatic TRIMMOMATIC_PATH path to trimmomatic [ DEFAULT : $PATH ] --max-memory MAX_MEMORY max amount of memory [ DEFAULT : 500m ] --trimmomatic-options TRIMMOMATIC_OPTIONS options for trimmomatic [ DEFAULT : ILLUMINACLIP:/-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 ] MINLEN is set to 50 percent of total input read length --sequencer-source {NexteraPE,TruSeq2,TruSeq3} options for sequencer-source [ DEFAULT : NexteraPE] bowtie2 arguments: --bowtie2 BOWTIE2_PATH path to bowtie2 [ DEFAULT : $PATH ] --bowtie2-options BOWTIE2_OPTIONS options for bowtie2 [ DEFAULT : --very-sensitive ] --no-discordant do not include discordant alignments for pairs (ie one of the two pairs aligns) [ DEFAULT : Discordant alignments are included ] --reorder order the sequences in the same order as the input [ DEFAULT : With discordant paired alignments sequences are not ordered ] --serial filter the input in serial for multiple databases so a subset of reads are processed in each database search bmtagger arguments: --bmtagger BMTAGGER_PATH path to BMTagger [ DEFAULT : $PATH ] trf arguments: --trf TRF_PATH path to TRF [ DEFAULT : $PATH ] --match MATCH matching weight [ DEFAULT : 2 ] --mismatch MISMATCH mismatching penalty [ DEFAULT : 7 ] --delta DELTA indel penalty [ DEFAULT : 7 ] --pm PM match probability [ DEFAULT : 80 ] --pi PI indel probability [ DEFAULT : 10 ] --minscore MINSCORE minimum alignment score to report [ DEFAULT : 50 ] --maxperiod MAXPERIOD maximum period size to report [ DEFAULT : 500 ] fastqc arguments: --fastqc FASTQC_PATH path to fastqc [ DEFAULT : $PATH ]

参考资料

https://github.com/biobakery/kneaddata https://github.com/biobakery/biobakery/wiki/kneaddata

最新回复(0)