A couple questions of analyzing Hiseq data following Miseq SOP

ch3coch3 · October 2, 2015, 9:42pm

Hi all,

I’m new to mothur and it’s a bumpy road so far.

So I’m following Miseq SOP to analyze our own data. We happen to sequence v6 region as well. Although a couple problems that I ran into:

We got Hiseq data, with only one .fastq and .fna data per sample, but the sequencing was PE. I didn’t know if the company merge/contigs rfastq and ffastq for us or not. How can I tell by looking at fastq or fna data, please?

For example, fna data looks like this

H1_0
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGACGGGTCCTTAAGTCAGTTGTGAAAGTTT
GCGGCTCAACCGTAAAATTGCAGTTGATACTGGGGACCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTG
AAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGACCGTAACTGACGTTGATGCTCGAAAGTGCG
GGTATCAAACAGG
H1_1
TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCC
CCGGCTTAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGCGGAATTCCATGTGTAGCGGTG
AAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTG
GGGAGCAAACAGG
H1_2
TACGGGGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGCGCGTAGGCGGGACGCCAAGTCAGCGGTAAAAGACT
GCAGCTAAACTGTAGCACGCCGTTGAAACTGGCGCCCTCGAGACGAGACGAGGGAGGCGGAACAAGTGAAGTAGCGGTGA
AATGCTTAGATATCACTTGGAACCCCGATAGCGAAGGCAGCTTCCCAGGCTCGATCTGACGCTGATGCGCGAGAGCGTGG
GTAGCGAACAGG
H1_3
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATGTGAAAGGCG
GGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGACCTGGAGTACCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTG
AAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTG
GGGAGCAAACAGG

and fastq data looks like this
@H1_0 SN7001328:511:H2WKLBCXX:1:1101:4739:2211 1:N:0:GACAGTGC orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGACGGGTCCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGGACCTTGAGTGCGGCAGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTGCTGGACCGTAACTGACGTTGATGCTCGAAAGTGCGGGTATCAAACAGG
+
IGIEIHHHHHEEF@DHDHH@FFHFHGHHHI@HHHH00DCC@FHH?EHHIHCGHIIFHHHIHHHHHIEHHFIHHHEHHIHGIHHEHIGHGH?HIIIHIHHHHHIHHGG@HHFHHEGDGHHIIIHHHIIHHHHHHICHHHEHIHHIGHFFIHGIIIIIIIIIHIIFIHHHIHIHHG@AGDG@HIIHDIIHEDEHIIIIHFHHHHEHEHIHHGIIIIIIIIIIIIIIIHDIHGIIIIHDCIHIIHHHHIIIIIIII
@H1_1 SN7001328:511:H2WKLBCXX:1:1101:8024:2126 1:N:0:GACAGTGC orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAAAGCGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTTTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
IIIIIIIIHHFHFHHIIIGIGHHHIIIIIIGHIIIGHIIIGII<GIIIIIIIIIHHIIIIIIIICGIIIIIIHGIIEHHHHIIIIHHHHHHHHHIHIIIHIGIHHIIHHHHHHHHHGHIIIHHHEHIIIIHHGGIIIIIIIIIIIIHHIHHIHIIHGIIIIIIIIHHIIIIIIHIIHHHIIIHIIIHIIIIIHHHIHGIHIIIIIIIIIIIHHEHEHEHHHIHHIIIHEIHIIIIGIIIHIIIIIHIIIIIHH
@H1_2 SN7001328:511:H2WKLBCXX:1:1101:14224:2150 1:N:0:GACAGTGC orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TACGGGGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGCGCGTAGGCGGGACGCCAAGTCAGCGGTAAAAGACTGCAGCTAAACTGTAGCACGCCGTTGAAACTGGCGCCCTCGAGACGAGACGAGGGAGGCGGAACAAGTGAAGTAGCGGTGAAATGCTTAGATATCACTTGGAACCCCGATAGCGAAGGCAGCTTCCCAGGCTCGATCTGACGCTGATGCGCGAGAGCGTGGGTAGCGAACAGG
+
IIIIIIIIIIIGIIIIIIIIIIIIIHGHIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIGIIIIHIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIHIIIIIIGIGIIHIGIIIIIIIIIIIIIIIIIIHGIIIIIIIIIIIIHIIIIIIIIIIIIIIHIIHIHFIIIIIIGIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIGIIIIHIIIIIIIIIIIIHIIIIIIIGIIIIIHIII
@H1_3 SN7001328:511:H2WKLBCXX:1:1101:18363:2111 1:N:0:GACAGTGA orig_bc=AAAAAAAAAAAA new_bc=AAAAAAAAAAAA bc_diffs=0
TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGGACAGCAAGTCTGATGTGAAAGGCGGGGGCTCAACCCCCGGACTGCATTGGAAACTGCTGACCTGGAGTACCGGAGGGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGCTTACTGGACGGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

Since we only have one fastq file per sample instead of rfastq and ffastq files, I didn’t use make.contigs command. And I merged all the fastq and fna data into a big fastq and fna data per treatment (we got 3 groups). So I’ve got 3 huge fna files and 3 fastq files. I skipped make.contigs and did screen.seqs and the rest. Would that be correct?
I don’t know how to make .file, like stability.file in SOP. And I don’t have group= command, 'cause I don’t have a group. Would it affect my analysis?
At the step mothur > remove.lineage(fasta=Lgroup.good.unique.good.filter.unique.precluster.pick.fasta,count=Lgroup.good.unique.good.filter.unique.precluster.uchime.pick.count_table, taxonomy=Lgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.taxonomy, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)

Mothur stopped and warning, Unable to open Lgroup.good.unique.good.filter.unique.precluster.uchime.pick.count_table
[ERROR]: did not complete remove.lineage.

I looked the database, I didn’t have an output file named Lgroup.good.unique.good.filter.unique.precluster.uchime.pick.count_table, I only got Lgroup.good.unique.good.filter.unique.precluster.count_table. But I did chimera.uchime

mothur > chimera.uchime(fasta=Lgroup.good.unique.good.filter.unique.precluster.fasta, count=Lgroup.good.unique.good.filter.unique.precluster.count_table, dereplicate=t)

Using 1 processors.

uchime by Robert C. Edgar
http://drive5.com/uchime
This code is donated to the public domain.

Checking sequences from Lgroup.good.unique.good.filter.unique.precluster.fasta …

It took 6932 secs to check 46130 sequences. 23222 chimeras were found.

Output File Names:
Lgroup.good.unique.good.filter.unique.precluster.uchime.chimeras
Lgroup.good.unique.good.filter.unique.precluster.uchime.accnos

So the process stopped right here unless I figure out what went wrong.

I tried stability.batch file using just one sample fna file, started with screen.seqs(fasta=H1.fna, minlength=252, maxhomop=8, maxambig=0, maxlength=275), that went surprisingly well, except an error saying

mothur > count.seqs(name=current, group=current)
[WARNING]: no file was saved for group parameter.
Using H1.good.names as input file for the name parameter.

Because I don’t have a group.

I used hpc to run the data, but it got terminated, the last log info was

Output File Names:
H1.good.unique.good.filter.unique.precluster.pick.pick.fasta.30.dist

It took 0 to calculate the distances for 2 sequences.
It took 38 seconds to split the distance file.

Reading H1.good.unique.good.filter.unique.precluster.pick.pick.fasta.3.dist

Any idea why it didn’t go through?

Sorry for tons of questions here. I’m brand new and still learning.

Much thanks in advance for your help!

Cheers

westcott · October 5, 2015, 5:52pm

Welcome to the mothur community! It looks like your sequencing company has assembled the reads for you so there is no need to run make.contigs. You can create a group file for you dataset using the make.group command, http://www.mothur.org/wiki/Make.group.

mothur > make.group(treatment1.fna-treatment2.fna-treatment3.fna, groups=treatment1-treatment2-treatment3)

When you have the group file, you should be able to follow the rest of the tutorial.

ch3coch3 · October 6, 2015, 3:48am

Hi westcott,

Thanks much for your reply! It did make progress after make.group command.
I did

make.group(fasta=Hgroup.fasta-Mgroup.fasta-Lgroup.fasta, groups=H-M-L.group)

and the output file was Hgroup.Mgroup.Lgroup.groups (hope that’s the way it’s supposed to be)

Then I adapted SOP batch file to run our data

summary.seqs(fasta=Lgroup.fasta)
screen.seqs(fasta=Lgroup.fasta, group=Hgroup.Mgroup.Lgroup.groups, minlength=252, maxambig=0, maxlength=275, maxhomop=8)
unique.seqs()
count.seqs(name=current, group=current)
align.seqs(fasta=current, reference=silva.v4.fasta)
summary.seqs(fasta=current, count=current)
screen.seqs(fasta=current, count=current, start=1968, end=11550, maxhomop=8)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
pre.cluster(fasta=current, count=current, diffs=2)
chimera.uchime(fasta=current, count=current, dereplicate=t)
remove.seqs(fasta=current, accnos=current)
classify.seqs(fasta=current, count=current, reference=trainset9_032012.pds.fasta, taxonomy=trainset9_032012.pds.tax, cutoff=80)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Archaea-Eukaryota)
remove.groups(count=current, fasta=current, taxonomy=current, groups=current)
cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15)
make.shared(list=current, count=current, label=0.03)
classify.otu(list=current, count=current, taxonomy=current, label=0.03)
phylotype(taxonomy=current)
make.shared(list=current, count=current, label=1)
classify.otu(list=current, count=current, taxonomy=current, label=1)

But at the step
mothur > cluster.split(fasta=current, count=current, taxonomy=current, splitmethod=classify, taxlevel=4, cutoff=0.15)

Using Hgroup.good.unique.good.filter.unique.precluster.uchime.pick.pick.pick.count_table as input file for the count parameter.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta as input file for the fasta parameter.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy as input file for the taxonomy parameter.

It went wrong and saying

[ERROR]: Hgroup.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta is blank, aborting.
[ERROR]: Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy is blank, aborting.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pick.pick.fasta as input file for the fasta parameter.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy as input file for the taxonomy parameter.

Using 1 processors.
Using splitmethod classify.
Splitting the file…
[ERROR]: Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy is blank. Please correct.
splitcutoff is greater than the longest taxonomy, using 1
[ERROR]: Could not open
It took 0 seconds to split the distance file.
[ERROR]: unable to spawn the necessary processes.

I went back and checked, turns out at the step
mothur > remove.groups(count=current, fasta=current, taxonomy=current, groups=Hgroup.Mgroup.Lgroup.good.groups)
Using Hgroup.good.unique.good.filter.unique.precluster.uchime.pick.pick.count_table as input file for the count parameter.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pick.fasta as input file for the fasta parameter.
Using Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.taxonomy as input file for the taxonomy parameter.

[NOTE]: The count file should contain only unique names, so mothur assumes your fasta, list and taxonomy files also contain only uniques.

Hgroup.Mgroup.Lgroup.groups is not a valid group, and will be disregarded.
You provided no valid groups. I will run the command using all the groups in your file.
Your file contains only sequences from the groups you wish to remove.
Removed 28474 sequences from your fasta file.
Your file does NOT contain sequences from the groups you wish to get.
Removed 1819009 sequences from your count file.
Your file contains only sequences from the groups you wish to remove.
Removed 28474 sequences from your taxonomy file.

Apparently the group setting was incorrect and mothur removed all the sequenced. So Hgroup.good.unique.good.filter.unique.precluster.pick.pds.wang.pick.pick.taxonomy is blank.

Would you please help me to fix the problem?

Much much thanks!!

westcott · October 6, 2015, 1:55pm

You are correct the remove.groups command is the problem. You gave the remove.groups command the name of a group file instead of the groups you want to remove. What groups are you trying to remove here?

The group parameter is used to provide a group file, http://www.mothur.org/wiki/Group_file.
The groups parameter is used to list group names. In your case, H, M or L.group.

So you could use the command to remove group H by running:

mothur > remove.groups(count=current, fasta=current, taxonomy=current, groups=H)

or H and M:

mothur > remove.groups(count=current, fasta=current, taxonomy=current, groups=H-M)

Make sense?

ch3coch3 · October 6, 2015, 7:42pm

Yes, westcott,

It totally made sense. And I finally realized that I didn’t need to remove groups at this step. So I # this command and the rest ran perfectly fine.

Thanks much!! Your inputs really helped me solve the problem

westcott · October 7, 2015, 2:29pm

Glad to help,

Topic		Replies	Views
Mismatch between fasta and group files Commands in mothur	1	2782	October 23, 2014
Groups missing after cluster command	3	601	October 16, 2020
Discrepancy between fasta and group counts Commands in mothur	16	2676	December 3, 2019
screen.seq command problem Commands in mothur	13	5206	February 8, 2016
Name file and group file sequence discrepancy Commands in mothur	5	3850	May 29, 2013

A couple questions of analyzing Hiseq data following Miseq SOP

Related topics