make.contigs() segmentation fault candidate seqs too long

To whom it may concern,

Background
I am interested in seeing how the count of OTUs and taxonomic resolution change as the quality trimming from the end of sample reads changes. I speculate that more trimming from the read ends will result in more spurious OTUs and lower taxonomic resolution.

I wanted to do this with data that I know should theoretically have the same sequences (and assumptions of how those sequences were generated) as my reference database. Since I typically have used Green Genes, I have randomly sampled sequences from Green Genes and mapped quality scores that drop n nucleotides from the end of both the forward and reverse read. I will iterate across n=0 through n=N, where N is the length of the longest sequence in Green Genes. Sequences shorter than n will be removed from downstream analysis.

Description of Problem
I have encountered an issue with make.contigs(), which gives me a few warnings of One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. […], and then it gives a Segmentation fault.

Question
Does this indicate that make.contigs() will make template sequences no longer than 1000 nt long even when candidate sequences are longer?

mothur > make.contigs(file=stability7.files, processors=8)

Using 8 processors.
Reading fastq data...
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Done.

>>>>>   Processing insilico_sample7_100000seqs_drop0_L001_R1_001.variable_cut.0ffastatemp (file 1 of 1) <<<<<
Making contigs...
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1366.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1353.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1355.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1358.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1467.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1504.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1390.
One of your candidate sequences is longer than you longest template sequence. Your longest template sequence is 1000. Your candidate is 1453.
Segmentation fault

I’m not 100% on what you’re trying to do with make.contigs. This command is designed to take paired reads and assemble them into a single sequence. What technology is currently able to generate a 1000 nt contig? I suspect you want trim.seqs which should be able to do what you want.

Pat

Thank you for the response, Pat.

For a little more background on what I am trying to do comes from my reading of http://blog.mothur.org/2014/09/11/Why-such-a-large-distance-matrix/. Specifically, I am interested in observing the effect of read overlap on the number of spurious OTUs using a customization of http://www.mothur.org/wiki/MiSeq_SOP using Green Genes as a reference database and mock data derived from Green Genes.

Why I am trying to put Green Genes sequences through the Mothur MiSeq SOP is because I figure that if I put a reference Green Genes sequence through the pipeline, while using Green Genes as the reference database in the pipeline, I expect that an ideal output would include an OTU matching the taxonomy of this reference sequence. Of course, the sequences in Green Genes are not paired and do not have quality scores. To get around this without touching my pipeline script, I figured I could take the reverse complement of each sequence to be treated as the reverse read and map quality scores onto them.

I don’t know of any sequencing technology that can generate a 1000 nt contig, and I am not exactly sure how Green Genes has such long sequences (de novo reconstruction?). I’ve looked at a kernel density estimated distribution of Green Genes’ sequence lengths, and they tend to be longer than 1000 nt.

I agree that using a function like trim.seqs() would shorten my sequences, and I think that the intended usage of make.contigs() is a sensible one. What I think I shall do is look at an MSA of Green Genes to identify a subinterval of the MSA small enough for the Green Genes sequence’s subsequences to be short enough to use in make.contigs() that also has lots of variability for taxonomy.

greengenes has long sequences from Sanger sequencing data.