Let the games begin...

Ok, here goes… My former advisor gave me three sets (reads?) of 454 fasta 16s rRNA sequences from three water sources. I have the forward and reverse primers, but no barcodes. No .sff files, flowgrams, etc. How would you proceed? I’m looking at your 454 sop, and thinking I should probably do as you suggest in such a case – jump in with your trim.seqs command. I ran summary.seqs for the first set:

mothur > summary.seqs(fasta=454reads.MID218.fasta)

and ended up with the following:

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 47 47 0 3 1
2.5%-tile: 1 440 440 0 4 72
25%-tile: 1 446 446 0 4 715
Median: 1 446 446 0 5 1430
75%-tile: 1 448 448 0 6 2145
97.5%-tile: 1 467 467 0 7 2788
Maximum: 1 803 803 3 10 2859
Mean: 1 445.319 445.319 0.010843 5.09759

of Seqs: 2859

If the forward and reverse primers are 34 and 37 bp long, respectively, can I take my median sequence length and subtract (446 bp) and subtract 71 bp as my target length? How do I format my oligos file? With both forward and reverse primers? Something simply like this?

forward GGACTACAGGGTTTCTAATCCTGTTTGCTCCCAC
reverse CCGTTGCGCAAGATTCCCCACTGCTGCCTCCCGTAGG

Thanks,
Mike

So you just need the forward line in your oligos file - if you include the reverse line, everything will likely get scrapped since it won’t find the reverse sequence at the end of the read. If you only have the fasta data then you’ll want to chop the reads to the first 200 bp using the keepfirst option in trim.seqs.

Pat

I see, Pat, thanks. Regarding the 200 bp, doesn’t that mean throwing data away? Can it be avoided? And in that case, I would be able to use the reverse primer, no? Or what about reversing and complementing the remaining sequence and doing the same with the r primer?

When I tried it last night, I set the min to 440 bp. But, I was ending up with nothing in my trim and name files. So, this evening, I took one sequence from the fasta set and started playing with it and varying the settings in the trim.seqs command. I had to remove bdiffs – because there are no barcodes. I scrapped the reverse primer. And I had to set pdiffs to 18. And I copied the 200 bp. Then it worked… Problem is I may have to go up to 30-32 bdiffs for the full set.

Mike

One question that occurs to me is what to do with the three separate sets (reads? is my terminology correct?) of 16s sequences I have? They are not barcoded, but at some point I’m going to want to look at community structure, phylogenetics, diversity, etc. So, there must be some way to label and then merge the three… “Back in the day,” I’d have opened them in Sequencher or MacClade or even Wrangler and added some identifier to the names…

I see, Pat, thanks. Regarding the 200 bp, doesn’t that mean throwing data away? Can it be avoided? And in that case, I would be able to use the reverse primer, no? Or what about reversing and complementing the remaining sequence and doing the same with the r primer?

Well… Sure, but it’s crappy data that you’re discarding. Since you don’t have the quality scores or the flow data, there isn’t much more you can do than to just use the first 200 bp. If you look at the PLoS ONE paper you’ll see that the error rates spike after 200 bp without any correction.

When I tried it last night, I set the min to 440 bp. But, I was ending up with nothing in my trim and name files. So, this evening, I took one sequence from the fasta set and started playing with it and varying the settings in the trim.seqs command. I had to remove bdiffs – because there are no barcodes. I scrapped the reverse primer. And I had to set pdiffs to 18. And I copied the 200 bp. Then it worked… Problem is I may have to go up to 30-32 bdiffs for the full set.

Hmmm, I’m not really sure what you’re asking here. THere’s no reason that you would ever use those values for pdiffs or bdiffs, I suspect that you have the wrong primer or barcode sequences. Could you look at the first few sequences and see whether you can spot the primers?

One question that occurs to me is what to do with the three separate sets (reads? is my terminology correct?) of 16s sequences I have? They are not barcoded, but at some point I’m going to want to look at community structure, phylogenetics, diversity, etc. So, there must be some way to label and then merge the three…

The only problem would be the groups file, which you can make with make.group.

Good luck…
pat

Ok, I was informed that the forward primer sequence was:
GGACTACAGGGTTTCTAATCCTGTTTGCTCCCAC

The first few sequences begin:
GGACTACCAGGGTATCTAATCCTGTTTGCTCCCCAC
GGACTACCCGGGTATCTAATCCTGTTTGCTACCCAC
GGACTACACGGGTATCTAATCCTGTTCGCTCCCCGC
GGACTACTGGGTTTCTAATCCTGTTTGCTCCCCACG
GGACTACCAGGGTTTCTAATCCTGTTCGCTCCCCGC

etc.

I had assumed that maybe the primers had been designed with a lot of ambiguity. Yeah, 18 out of 36 IS a lot of ambiguity.

How do I use the make.groups command?

Thanks,
Mike

Hey Pat,

Ok, got the make.groups command. In order to work with the 3 reads, I used the merge.files command to merge them in a single file, is that correct? And presumably, this will still correspond to the Group file I created…

Then I used the unique.seqs to create a name file. This brought my total number of sequences down from 17,479 to 15,549.

I used the trim.seqs to get rid of primers. Here, I ran into that primer problem. I assume the pdiffs setting will remove any sequences in which the primer doesn’t match in more than 2 b.p.? When I set pdiffs=2, I only end up with 1248 sequences. When I set pdiffs=4, it goes up to 9567. At pdiffs=10, it goes to 15,462. It would nice to have across-the-board homology in my primer sequences, but am I wrong in thinking that that is secondary to having maximum true homology in my read sequences? So, unless I’m misunderstanding this procedure, wouldn’t I want to keep as many sequences as possible, trimming putative primer sequence and allowing your homology assessment process (alignment) to get rid of false homology statements? What if I were to trim the primers after I do my alignment?


Thanks, Mike

Hi Pat,

At the pre.cluster command. I’m getting the following error and crashing the program:

mothur > pre.cluster(fasta=454reads.unique.trim.unique.good.filter.unique.fasta,
name=454reads.unique.trim.unique.good.filter.names, group=454Reads.good.groups,
diffs=2)


Using 2 processors. missing name ICCKWPX01A0463 ... [long list of missing names] ... missing name ICH233E01DLVD4

[ERROR]: Your name file contains 9398 valid sequences, and your groupfile contai
ns 9629, please correct.
[ERROR]: process 0 only processed 1 of 2 groups assigned to it, quitting.

/******************************************/
Running command: unique.seqs(fasta=454reads.unique.trim.unique.good.filter.uniqu
e.precluster.fasta, name=454reads.unique.trim.unique.good.filter.unique.preclust
er.names)
[ERROR]: 454reads.unique.trim.unique.good.filter.unique.precluster.fasta is blan
k, aborting.
Using 454reads.unique.trim.unique.good.filter.unique.fasta as input file for the
fasta parameter.
[ERROR]: 454reads.unique.trim.unique.good.filter.unique.precluster.names is blan
k, aborting.
/******************************************/

I guess one question would be, when I run that final unique.seqs, is there some way to assure that names in the groups file are tracking the changes? Or does my “original sin” lie with my use of the make.groups and merge.files commands? Perhaps I should work with the files separately through the clean-up and alignment, and then merge them?

As a p.s., I tried using the get.groups command, which I assumed would make the groups file agree with the names and fasta files, but nope, didn’t work.

Thanks for your time!
Mike

Ok, now I have the correct primer. But, when I get to the pre-cluster command, I’m still getting a discrepancy between my name file and my group file that’s shutting me down. It seems I’m missing something. My final fasta and names files prior to pre-cluster are:

218.a2.a3.trim.unique.unique.good.filter.names
218.a2.a3.trim.unique.unique.good.filter.unique.fasta

Your groupfile at that point is:
group=GQY1XT001.shhh.good.groups

Which implies that you’ve somehow trimmed your group file as you proceeded. How? My group file at that point is the one I created at the beginning, using the merge.files and make.groups commands.

I DID figure out a work-around – to wait to make my groups and merge my reads just prior to preclustering. Problem is then I have to redo my alignment, filtering and screening, since the net effect, I’d imagine, is just as though I created three separate alignments in Sequencher and then brought them together.

Thanks,
Mike