Size of .align file

Dear Users,

Apologize if these topic was discussed before or is in the wrong board.

Recently we begin to generate reads from v3v5 and v4v6 regions (~450-500 nucleotides), consequently our sequences are bigger and with less unique reads (add that we are working with soil). After initial pre-possessing (unique.seqs, trim.seqs, screen,seqs and others) we obtained .align files in the magnitude of 5.6Gb to 10Gb. Although any downstream analysis is not possible. For example if we want to apply the filter.seqs or screen.summary.seqs commands we obtained a file with only one sequence (no error or any blue screen or stop). We are using 1.13.0, PC Xp and 4Mb RAM (which maybe are the culprit, since Xp cannot support additional memory).

Just to clarify, other than getting a new comp that handle maybe >10Gb of RAM if there any solution (i.e. merging files, etc) for downstream analysis.


Thanks Vicente

Vicente,

First, I don’t believe that your reads are “really” that long… We’re finding that the read quality craps out with about 100-200 bp to go in the sequence. To deal with this I’m suggesting that people use a sliding window trimming procedure in trim.seqs. This should radically drop the number of uniques. For example…

trim.seqs(fasta=stool.fasta, oligos=stool.oligos, qfile=stool.qual, maxambig=0, maxhomop=8, flip=T, bdiffs=1, pdiffs=2, qwindowaverage=35, qwindowsize=50)

This will make sure that every 50-bp window has an average Q-score of 35. The paper describing this is in the works.

However, filter.seqs shouldn’t have a problem with large datafiles since it only stores one sequence at a time. I suspect the problem you would encounter would be at read.dist or cluster and then you would probably need a real computer :). We’ll also be posting a 64-bit windows version soon.

Pat

Hi Pat,

Thanks for the reply. I forget to mentioned that I obtained the reads from VAMPS. The files contained reads with an avg of 420-450 nucleotides (for both v3v5 and v4v6). Since I downloaded from the “trimmed FASTA sequences” section I assumed (and in some cases tested) that they removed the primers,tag, ambiguous bases, flip, etc. The only preprocessing step that I performed is maxlength=450, maxhomop=10 and unique.seqs. After that I reduced the amount of sequences from 170,000 to 120,000. In addition I can not perform the quality/windows step since I do not have access to the .qual file.

Well I’m able to perform all preprocessing steps even the alignment, but when I want to perform the filter.seqs step using the .align file it only produce a fasta file with only one sequence (also occur with the summary step).

In addition I repeat the process with the addition of the precluster step (reduce to 97,000 reads) and aligned, but again only produce a fasta file with one sequence.


I took a subsample of 30,000 sequences from the same file, perform all the initial steps, aligned and I was able to continue any downstream analysis (filter, summary, dist.seqs, cluster, etc).

I’m aware that probably the bottleneck for the analysis (using our computers) will be the dist.seqs or the cluster step, but having the alignment in FASTA file is an advantage.

Thanks and keep the good work
Vicente

V - feel free to send us the fasta file and I can take a quick look. You might also try to get the original fasta and qual files from MBL.

Pat