Processing MiSeq single (unpaired) reads

I am trying to process some old 16S datasets that were sequenced using illumina before paired end reads were widely used. How do I add them into mothur’s MiSeq SOP? I am downloading from NCBI’s SRA, so the formats are de-multiplexed fastq files with the primer sequences removed. My understanding is that make.contigs() is where the fastq’s are turned into fasta’s, but that command requires both forward and reverse files as input.
Thanks!

2 Likes

If your files are already demultiplexed and have no barcode/primer information, I usually find the quickest way is to quality filter each fastq file separately, and then merge the output files together.

A basic workflow is

  1. Runfastq.infoover each fastq file to split them into fasta and qual files.
  2. Quality filter them withtrim.seqs.
  3. Merge the QC-ed fasta files together withmerge.filesto get your full fasta file.
  4. Create a groups file usingmake.group.
  5. Rununique.seqsto dereplicate the fasta file.
  6. Runcount.seqsover the resulting names file, and your groups file, to get the count table.

From there, you should be able to go back to the MiSeq SOP at the alignment step using the *.unique.fasta file and the count table.

3 Likes

Thanks :!: :!: :!:

One follow-up question: I am trying to quality filter with similar stringency to the standard SOP for paired end data.

In trim.seqs the default quality cutoff is qthreshold=25 and in make.contigs the default quality cutoff is insert=20. Are these equivalent cutoffs? i.e. should I set qthreshold to 20 ?

Also, should I include a screen.seqs in your suggested list of commands for this same reason? i.e. after make.group and before unique.seqs ?

I’m pretty sure the insert in make.contigs is the same parameter as the qthreshold in trim.seqs. Personally I try to go a bit higher that Q20 - at least Q25, but I think standard practice here varies a bit.

I usually just wait until after alignment to do screen.seqs, because it’s only removing sequences that are too short (or too long), contain Ns, or contain suspicious homopolymers. You definitely need it after alignment to make sure you have good start/stop positions otherwise filtering will get pretty messy, but if you want to screen the fasta/count files before alignment it won’t be a bad thing. It would probably make your alignment faster, since there will be less sequences to process.