80 samples help

Hi
I hope you are well

I have 80 samples to analyze with mothur.
This is big data, and I have only 7 processors to do so. What do you suggest to fasten the analyses? The dist.seqs will take forever, what shall I do?

Thank you for any help

Hi there,

You’re not giving me much to go off of here. 80 samples isn’t necessarily “big data”. What region are you sequencing? How far into the pipeline have you gotten? You might want to give this a read…

Pat

Thank you Pat I had read it several years ago, the quality of the sequences is fine, but that overlapping issue may be impacting the overall data yes.

Now II encounter another problem with this data set that was the alignment from mothur. I got an huge amount of unclassified sequences, and that reduced my data from 600k to 70k.
I was very suspicious about that and when to get some of the sequences that were classified as unclassified and did a Blast with then. From 10, 3 effectively didn’t matched with no sequences on the database, but the other had close relatives at 80% and 70%.

I went to see the alignment and noticed that for those particular sequences the alignment was really really bad, and that is why they had not been classified.

Is this a known problem in mothur? Is there any parameter I can change for that? Thank you

Hello,

I don’t know if this is a good advice or not. That said, I work with environmental water samples, and it is not rare for me to have a huge part of my sequences as unclassified or ‘unknown,’ which I later remove with the ‘remove.lineage’ command (the average of unclassified is always around 200K). I also work with a bunch of samples as well, around 50-100.

I also ran into the problem of creating OTUs out of my data (and we got a very decent computer in the lab). I found 2 solutions to my problem: (1) increase the ‘diff’ number in the ‘pre.cluster’ step, or (2) create ASVs instead of OTUs (which is what I do). I know there is a whole battle between ASVs and OTUs, but in my field of research, ASVs are accepted. Beware that you should know what quaetion you want to answer with your data and decide if ASVs make sense or not for your problem.

You can also start with ASVs to understand your data and, later on, further group your data into OTUs. For example, you could start with ASVs, look at your data and define which ASVs are low abundance (if it is 1 sequence in 80 samples, I don’t think it’ll do any harm to remove it). Then, eliminate this low abundance seqeunces, and try clustering into OTUs again.

I hope this helps or make sense. Again, I don’t know if it is good advice or not, but it works for us. I’m also here to learn and know the opinion of everyone regarding this method because it is not the best.

Good luck!!!

Hi, Thank you for replying, I have been working with quite different environments, and this a very particular one with low microbial quantity for starter. But my concern now is that if I am able to find a close relative to my sequences with BLAST, but that failed to classify on mothur because the alignment on mothur was not good, that gives me some worries.
I will be trying another parameters to align and see if it does the trick.

Moreover, the increase in diffs I already do that by default, since I am using V4V5 for this particular set and mismatch tend to occur.
I don’t wanna use ASVS due to that particular case of mismatch number that will increase artificially the diversity. And on at the end I always remove single double and triple .-tons.

Did you remove low-abundance sequences before making OTUs? When you remove a singleton, it is usually after making OTUs. What if you remove the ‘1 sequence in 1 sample’ before clustering? This could further eliminate noise and make it easier to create OTUs. Is this possible? If yes, how could it be done?

Just to be clear… I do not advocate removing rare sequences (however defined) and do not advocate removing “unclassified” sequences.

V4-V5 sequences will give you horrible quality on the assembled reads because they do not fully overlap. the 2x300 chemistry is still bad with the error rates climbing after ~500 total nucleotides. This will result in inflated numbers of unique sequences/ASVs/OTUs. This is why I posted the blog post. This is not a mothur problem. This is a MiSeq problem.

Pat