I’m not sure if this is the correct place to post about 454 SOP, if not I am sorry
I am trying to repeat every step of 454 SOP with a dataset retrieved from SRA. It consists in 11 fastq files. I performed the steps from below.
fastq.info -> so I get the fasta and qual files
merge.files -> to get one fasta and one qual files with all 11 samples together (taking care of put all files in the same order)
make.groups -> to get a group file of all 11 samples. Taking care of put all files in the same order too.
trim.seqs(fasta=all.fasta, qfile=all.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, minlength=200, qwindowaverage=25, qwindowsize=50, processors=4)
unique.seqs()
align.seqs(reference=silva.bacteria.fasta)
screen.seqs(fasta=current, name=current, group=current, end=6447, optimize=start, criteria=95)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs() And Now is when I get the following error:
pre.cluster(fasta=current, name=current, group=current)
A lot of “missing name name_of_sequence” appears in the prompt, it says that my name file contains less sequences than my group file and aborts the program to Linux shell with a “segmentation fault”.
I’m not sure why this is happening and if I should use another file like a count file or something.
Hopefully you could help me.
Thanks in advance,
Galo
When you run the unique.seqs() command is mothur auto-detecting your current name file? If it isn’t then that could be where your problem is, as all the information stored in your old name file will be ignored and you’ll generate a new one, based solely off the dereplication of the filtered fasta file.
In fact, since I don’t have the flowfiles, because data retrieved from SRA is in fastq format, I’ve been using that unique.seqs() command without a name file. I don’t know if I should generate a name file in a previous step, and I don’t know how to do it either.
Ah…just re-reading your process there’s another place the problem could be coming from. The trim.seqs command can’t process a group file so it’s likely that your quality-filtering is discarding reads from the fastq file but you still have references to them in your groups file. You might need to repeat the trim.seqs on each individual fastq file and them perform your merge.files/make.groups command afterwards.
So for example, shuffle your commands around like
fastq.info()
trim.seqs(fasta=current,qfile=current, maxambig=0, maxhomop=8, flip=T, minlength=200, qwindowaverage=25, qwindowsize=50, processors=4)
[Repeat 11 times]
...
merge.files()
make.groups()
unique.seqs(fasta=all.fasta)
count.seqs(name=current, group=current) #The name file is generated by the unique.seqs command previously.
align.seqs(fasta=current, reference=silva.bacteria.fasta)
screen.seqs(fasta=current, count=current, end=6447, optimize=start, criteria=95)
filter.seqs(fasta=current, vertical=T, trump=.)
unique.seqs(fasta=current, count=current)
And from there you can rejoin the SOP as you would normally.
It might also be worth looking into raising your qwindowaverage value as well. I know the 454 SOP recommends a higher one for pyrosequencing data. I suppose it depends on what you plan to do with the output though.
Dwaite, I tried to process with trim.seqs() the fasta files one by one and then merge the files and make a group file: it totally works! Thank you very much!
Just one question. I didn’t use the count.seqs() command since it is not used in the 454 SOP and, apparently, it works nice. I know what count.seqs() does but actually I don’t understand when to use it and what is this command intended for.
It’s really just a way of condensing your names/groups files into a single, smaller file. Since it removes all the redundant sequences names that are stored in the names/groups files and just tallies the number of identical sequences you end up needing much less space. It’s not in the 454 SOP because it was added to mothur quite a while after that was written but it is in the MiSeq SOP, which is where it’s probably more important since there are so many more sequences generated.
There are probably a few performance improvements to using it over names/groups (I don’t know that for a fact, but if I was writing code for mothur I would be taking advantage of it), but the main reasons I use it as soon as possible is that it reduces the number of files you’re handling and if you need to jump out of mothur into Excel, R or whatever then it’s much more amenable to number-crunching.