Can't open LookUp_Titanium.pat.

Hei, Pat,

I am running Shhh.flows command for new dataset. But can not open LookUp_Titanium.pat. I have been analyzing several dataset by using this command before, it worked just fine. The LookUp_Titanium.pat file is in the same folder with Mothur. The LookUp_Titanium.pat file looks fine after opened in Textedit …I am running this on Mac

Mothur > shhh.flows(file=hui/hui.flow.files)
Unable to open LookUp_Titanium.pat. Trying mothur’s executable location /Users/mpat-group/hui/LookUp_Titanium.pat

Many thanks.

Hui

The only explanation would be that it isn’t really where you think it is or that you aren’t where you think you are in the file system.

Hei, Pat,
Really have no idea what is wrong with the folder. But anyway, it worked again. Thanks.

Still one more question about the following steps: after shhh.seqs, the number of the reads dropped down drastically. Also after trim.seqs and unique.seqs. Is this because the reads are from fungi or just bad sequence quality? For bacteria, it is not so drastical dropping down after Shhh.seqs…

  1. before Shhh.seqs

mothur > summary.seqs(fasta=hui/hui.fasta)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 39 39 0 2 1
2.5%-tile: 1 51 51 0 3 10790
25%-tile: 1 239 239 0 4 107899
Median: 1 325 325 0 5 215798
75%-tile: 1 364 364 0 5 323697
97.5%-tile: 1 440 440 0 8 420806
Maximum: 1 769 769 35 31 431595
Mean: 1 299.108 299.108 0.0249841 4.81248

of Seqs: 431595

  1. after trim.flow and Shhh.seqs:

mothur > summary.seqs(fasta=hui/hui.shhh.trim.fasta, name=hui/hui.shhh.trim.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 210 210 0 3 1
2.5%-tile: 1 236 236 0 3 464
25%-tile: 1 245 245 0 4 4638
Median: 1 248 248 0 4 9276
75%-tile: 1 248 248 0 5 13913
97.5%-tile: 1 250 250 0 5 18087
Maximum: 1 250 250 0 8 18550
Mean: 1 245.623 245.623 0 4.1683

of unique seqs: 1192

total # of seqs: 18550
Output File Names:
hui/hui.shhh.trim.summary

3.after unique.seqs:

mothur > summary.seqs(fasta=hui/hui.shhh.trim.unique.fasta, name=hui/hui.shhh.trim.unique.names)

Start End NBases Ambigs Polymer NumSeqs
Minimum: 1 210 210 0 3 1
2.5%-tile: 1 236 236 0 3 464
25%-tile: 1 245 245 0 4 4638
Median: 1 248 248 0 4 9276
75%-tile: 1 248 248 0 5 13913
97.5%-tile: 1 250 250 0 5 18087
Maximum: 1 250 250 0 8 18550
Mean: 1 245.623 245.623 0 4.1683

of unique seqs: 679

total # of seqs: 18550

Output File Names:
hui/hui.shhh.trim.unique.summary

Many thanks.

So to recap…

Before trim.flows: 431595 total sequences
After shhh.flows: 18550 total sequences / 1192 unique sequences
After unique.seqs: 18550 total sequences / 679 unique sequences

The after shhh.flows/unique.seqs numbers look consistent. shhh.flows will output the unique denoised sequences within each group. unique.seqs uniques across groups. So your total number of sequences after shhh.flows and unique.seqs are the same. shhh.flows does not remove sequences so you are going from 431595 to 18550 sequences in the trim.flows step. Either your primer /barcode sequences are messed up or your flowgrams are short. If you look at the scrap.flow file you will see codes indicating why your sequences are getting discarded. Look there for your answer…

Pat

Great thanks pat!

You are right that the flowgrams are too short. I should notice that my sequences are much shorter (about 200-250bp for ITS2). Do you think it is reasonable to set up 150-450 for minflows & maxflows?

Hui

Well whatever you pick, minflows needs to be the same as maxflows or the denoising won’t really work.

Pat

Hei, Pat,

What do you mean :minflows needs to be the same as maxflows? Doesn’t mean it I can not set up the range of flowgrams, such as between 150 (min) to 450 (max)?

What are your suggestions on the flowgrams in this case as too less sequences left for 450 ?

Many thanks. Hui

Take a look at Figure 1 of http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0027310. See the column that says “shhh.flows (360-720)”? That shows the error rate using minflows=360, maxflows=720. You’ll notice that the columns to the right of that give a much better error reduction when minflows and maxflows have the same value. So while you can set them to be different values, its really a bad idea.

I’ve been thinking about the minflows-macflows for some time, and I just dont get why they have to be the same size to work.

(yes, yes, I get thats what the data is showing), but, WHY does this happen? What is the probability of getting most of your sequences to be a minimum size of 450 and a maximum size of 450.

This is of course presuming that the minflow/maxflow size refers to the length in bp of the sequence…

Does this step trim longer sequences to be that size and exclude smaller sequences?


(and if so, cant you just take care of this after you align?)

Here’s what I think is the reason… If the reads vary in length between 360 and 720 then there are some of the 360 reads that are the same as the 720 reads, you just don’t have those data. So when the distances are calculated between sequences they may be treated differently because of their difference in length (same reason we make all our 16S seuqence the same length before clustering). So they don’t get denoised together and the denoising doesn’t work as well because the errors accumulate in the longer reads.