some puzzles of the command "make contigs"

Hi dear teacher ,

I have some questions for help about the command “make contigs”

  1. I want to know how much the value of default quality in the command,when the spliced bases are not ambiguous.

  2. What is the file format of the “fqfile and rqfile”

  3. Recently, I tested to splice a sequence with a overlap of 5-10bp, and the result was successful. So I want to know how much bp of a overlap at least or default, when using this command?

  4. When there was no overlap between the forward and reverse of the sequence, the spliced results of “make contiges” were the reverse sequence result in the front of the forward sequence result. I can not understand this phenomenon, so I want to know why?

Thank you very much! Have a good day. :smiley:

  1. I want to know how much the value of default quality in the command,when the spliced bases are not ambiguous.

Please see http://www.ncbi.nlm.nih.gov/pubmed/23793624

  1. What is the file format of the “fqfile and rqfile”

Those are quality score files. They are similar to a fasta file where instead of putting in the DNA sequence we have the phred quality scores separated by spaces.

  1. Recently, I tested to splice a sequence with a overlap of 5-10bp, and the result was successful. So I want to know how much bp of a overlap at least or default, when using this command?

If you look at the above referenced paper, you’ll see that if you want decent data, you need to have fully overlapping reads.

  1. When there was no overlap between the forward and reverse of the sequence, the spliced results of “make contiges” were the reverse sequence result in the front of the forward sequence result. I can not understand this phenomenon, so I want to know why?

I’d be curious to see the sequences where this happens. Of course, I’m not sure why you’d want to deal with reads that do not (fully) overlap.

Pat

Hi Pat:

Thank you for your paper! I will read it seriously.


4. When there was no overlap between the forward and reverse of the sequence, the spliced results of "make contiges" were the reverse sequence result in the front of the forward sequence result. I can not understand this phenomenon, so I want to know why?

I’d be curious to see the sequences where this happens. Of course, I’m not sure why you’d want to deal with reads that do not (fully) overlap.


My study was the microdiversity of environment(for example:soil) based on the 16Sr DNA V1-V3 through 2*300 illumina miseq, and there exist the length over 600bp to some species. When I deal with my sequencing results, I find the reverse sequence result in the front of the forward sequence result.

For example:
the forward sequence:
AGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACGTGCAAGTCGAACGGTAACAGAAGACCTTGCTCTTGCTGACGAGTGGCGGACGGGTGAGTAATGTATGGGGATCTGCTGAATGGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGATTAATGTCGTAAGACCAAAGCGTGGGACTTTCGGGACACGTACAATTTGATGAACCCATATGGGATTAGCTAGTAGGTGGGGTAATGGCACACCTAGGCGACGATCTCTAACAGATATGAGAGGATGACAAGCA

the reverse sequence:
TTGCCGCGGCTGCTGGCATGGAGTTAGGCAGTGCATCTCATGTAATTAACGTCAATTAACAACCCTGTTACAATTGTTATATTCCTCATTACCGCAAGTATTTTACAATCCCAAAGTCTTCTTCATACACCCGGCATGGCTGAAGGCGGGTTTCTACCATTGTGCAATATTACTCACTGCTGCACAACGTAGGAGTGAGGACAGTGGACAAGTAACATAGCAGCTGGTAAACAAAGCAAGCCAAAAAGAACACCTCGGACAAGCAAACCAGTAACCAAACAACAAACCAAACGCCAAACGA

after the make.contigs:
the reverse complementary of the reverse sequencing results in front of the forward sequencing results
TCGTTTGGCGTTTGGTTTGTTGTTTGGTTACTGGTTTGCTTGTCCGAGGTGTTCTTTTTGGCTTGCTTTGTTTACCAGCTGCTATGTTACTTGTCCACTGTCCTCACTCCTACGTTGTGCAGCAGTGAGTAATATTGCACAATGGTAGAAACCCGCCTTCAGCCATGCCGGGTGTATGAAGAAGACTTTGGGATTGTAAAATACTTGCGGTAATGAGGAATATAACAATTGTAACAGGGTTGTTAATTGACGTTAATTACATGAGATGCACTGCCTAACTCCATGCCAGCAGCCGCGGCAAGAGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAGGCTTAACACGTGCAAGTCGAACGGTAACAGAAGACCTTGCTCTTGCTGACGAGTGGCGGACGGGTGAGTAATGTATGGGGATCTGCTGAATGGTGGGGGACAACAGTTGGAAACGACTGCTAATACCGATTAATGTCGTAAGACCAAAGCGTGGGACTTTCGGGACACGTACAATTTGATGAACCCATATGGGATTAGCTAGTAGGTGGGGTAATGGCACACCTAGGCGACGATCTCTAACAGATATGAGAGGATGACAAGCA

So I am queer about this phenomenon.

Meiling

If you take any two sequences they will align however poorly. I think that is what is happening here. I really question whether you can get useful data out of sequencing the V13 regions with the 300PE kit. I suspect your error rate is through the roof.

Pat

thak you for your reply.

I analyzed the silva database using the primer 8f and 533r which amplified the 16s rDNA V1-3 by the “pcr.seq” command. I found that the length of most of sequents was less than 600bp, that is why I ask you how length of the overlap at least when using “make.contigs” .

You really need the reads to fully overlap to generate usable sequence data for use with OTUs. If you have a >500 bp fragment and 300 PE chemistry, the reads will not fully overlap. With the 300 PE kit and the 250 PE kit, you need 300 and 250 bases, respectively, of overlapping sequence data.

Hi Pat
Thank you for your reply. I think I know what you said.
Now , I have another problem about my data analysis. The dist file was too large(200G) after the “dist.seqs” command, but my unique sequence was only 100 000.
It took a long time to cluster for OTU. So could you help me to reduce the dist file. Which parameter should I add to the command.
My command is
dist.seqs(fasta=total.final.fasta,cutoff=0.01,countends=F,processors=24)

Are there other reasons to cause the dist file too large and waht should I do.

Meiling

You really need the reads to fully overlap to generate usable sequence data for use with OTUs. If you have a >500 bp fragment and 300 PE chemistry, the reads will not fully overlap. With the 300 PE kit and the 250 PE kit, you need 300 and 250 bases, respectively, of overlapping sequence data.

That is my point. If the reads do not fully overlap you will get data with a very high error rate leading to the generation of many spurious sequences and large distance matrices. You likely only option is to use classify.seqs and to pursue the phlylotype-based approach.

Hi Pat:
When I process my data in mothur, I find there are lots of problems. I am very glad that you can help me.
Now the illumina platform generates large number of data, so it will take long time to process. When using the cluster command (the dist file 155G), the result was “It took 7 seconds to cluster” and the “total.final.fn.list” file was only “unique”. The RAM of my servers upgrades from 100G to 288G, but the result is as before. The “hcluster” command can do this well while it take a long time. So I want to know how much “dist file” can be processed with the RAM of 288G?

I"m not sure how large of a distance matrix can fit into 288 GB, I’ve never had a 155 G distance matrix and we’ve done analyses with thousands of samples. The problem is really the error rate - it would probably be cheaper to do the sequencing “right” than to buy additional RAM or to sit around and wait for the clustering to complete (weeks?) if you have enough RAM.

Thank you for your suggession. I also want to reduce the error rate.
I used the primer 8F and 533R to analysis the bacteria V1-V3.

After I aligned with the silva database, the summary file was

Start End NBases Ambigs Polymer NumSeqs
Minimum: 0 0 0 0 1 1
2.5%-tile: 1 6 1 0 1 16733
25%-tile: 8 12862 21 0 3 167330
Median: 8 12862 526 0 5 334660
75%-tile: 9 12862 531 0 5 501989
97.5%-tile: 42274 42276 541 0 7 652586
Maximum: 42284 42284 595 0 300 669318
Mean: 5955.41 15305.2 386.185 0 4.27128

of unique seqs: 469808

total # of seqs: 669318

Next, I will use screen.seq to remove some sequences. According to the summary result, I cannot determine the start and end position. Now I used “start=1000,end=12862”, and this could not remove many of sequences. Can you give me some proposal? Or before that how can I process the sequences?

I think we’re missing something in communication. Your problem is that you have a high error rate. You really can’t do anything about that now. To reduce the error rate, the reads must fully overlap.