My number of unique sequences is about 770,000 and total number of sequences is about 4,902,143 after pre.cluster, chimera.vsearch, remove.seqs, classify.seqs, and remove.lineage commands. I am analyzing Ion Torrent data that does not have paired ends.
For controls the commands I have used so far are below:
fastq.info(fastq=R_2021_10_05_12_00_01_user_S51010052021_Chip.fastq)
summary.seqs(fasta=R_2021_10_05_12_00_01_user_S51010052021_Chip.fasta, processors=8)
trim.seqs(fasta=R_2021_10_05_12_00_01_user_S51010052021_Chip.fasta, oligos=CXMicroMothurOligos.txt, maxambig=0, maxhomop=6, bdiffs=0, pdiffs=0, minlength=265, keepfirst=310, flip=F, processors=8)
summary.seqs(fasta=current, processors=8)
get.current()
unique.seqs(fasta=current)
get.current()
summary.seqs(fasta=current, name=current, processors=8)
count.seqs(name=current, group=current)
get.current()
#pcr.seqs(fasta=silva.nr_v138_1.align,start=11895,end=25318,keepdots=F,processors=32)
#rename.file(input=silva.nr_v138_1.pcr.align,new=silva.v4.fasta)
#summary.seqs(fasta=silva.v4.fasta)
#get.current()
align.seqs(fasta=current, reference=silva.v4.fasta, processors=2)
summary.seqs(fasta=current, count=current, processors=8)
get.current()
summary.seqs(fasta=current, count=current, processors=8)
screen.seqs(fasta=current, count=current, summary=current, start=1967, optimize=end, criteria=95, processors=8)
summary.seqs(fasta=current, count=current, processors=8)
filter.seqs(fasta=current, vertical=T, trump=.)
summary.seqs(fasta=current, count=current, processors=8)
unique.seqs(fasta=current, count=current)
summary.seqs(fasta=current, count=current, processors=8)
get.current()
pre.cluster(fasta=current, count=current, diffs=2)
summary.seqs(fasta=current, count=current, processors=8)
chimera.vsearch(fasta=current, count=current, dereplicate=t)
remove.seqs(accnos=current, fasta=current)
summary.seqs(fasta=current, count=current, processors=8)
classify.seqs(fasta=current, count=current, template=silva.nr_v138_1.align, taxonomy=silva.nr_v138_1.tax, cutoff=80, processors=8)
remove.lineage(fasta=current, count=current, taxonomy=current, taxon=Chloroplast-Mitochondria-unknown-Eukaryota)
summary.seqs(fasta=current, count=current, processors=8)
summary.tax(taxonomy=current, count=current)
get.current()
rename.file(fasta=current, count=current, taxonomy=current, prefix=10052021)
dist.seqs(fasta=current, cutoff=0.03)
I have been struggling with understanding the type of computer power and storage needed since I’m new to using Mothur and analyzing data without the use of prebuilt tools (RDPipeline).
Since I have started using Mothur I can see the size of my files (32KB to 56,000KB in reference to barcoded count tables/map files and 500,000KB in reference to precluster files), the number of sequences, and the average bp size (265-285). This raw multiplexed FASTQ file was about 2 million KB and FASTA was 2 million KB. The trimmed unique good align files were about 22 million KB. I do not have enough of a computer background to know how this all translates to the CPUs, storage, RAM etc needed. Therefore when I ask IT or computer cloud company reps it is really hard for me to find a way to communicate what is needed.
I have a call scheduled with a rep from the Microsoft Azure team in the New Year. However, I have no clue what to tell them to help get a better set up.
Do you have any pointers on what I should be communicating to IT and cloud computer reps that makes it clear for what is needed based on the data?
If it helps so far I know the following now that I am further along in analyzing this data set:
• I only have access to Microsoft Azure ( I’m working in a business structure that uses Microsoft products where the type of computer, parts, setup, etc has to go through IT after I communicate what it needed)
• To test out Mothur I chose to use FASTQ data that had about 47 barcodes (47 samples) of data. However there are times where I will have a FASTQ file with 32 to 72 barcodes (32-72 samples) of data so these sizes can be smaller or larger than what I am testing out now. In about a year I process about 10-15 chips that produce Mutilplexed FASTQ files. All this data must be saved.
• During this trial of 47 barcoded multiplexed data up to the summary.tax step all the file sizes range from 9KB to 22 million KB. Which doesn’t seem like a lot of space when converted to GBs.
• I am extremely comfortable with Ubuntu, Notepad++, command-lines, and virtual machine language. I was able to adequately install Ubuntu Shell, ARB Project files, and Mothur. However, in the past this was all set up via University help and I really only got the use of the preset up tools experience. Now that I am not in a university setting it’s a bit different.
• The type of computer I was provided initially was 16 GB RAM, 64 bit operating system, and Intel Core Processor i5, Base speed 2.21 GHz, Cores 4, and logical processors 8.
• I have also tried a computer with the following specs: 4 GB RAM, 64 bit operating system, and Intel Core Processor TM CPU 1.00 GHz, Base speed 1.61 GHz, Cores 2, and logical processors 4.