Using preprocessed merged reads

elximo · March 2, 2017, 10:06pm

Hello All,

This is probably a simple question. I have a set of preprocessed merged fasta files, each belonging to a sample in an experiment. These files have already been trimmed for quality, and the overlapping reads were merged. Is it possible to load them somehow in MOTHUR? I have one fasta file per sample. Is there a way I can load the associated information and continue working through any of the SOPs, starting after the make.contigs? If I understand correctly, I need to create groups file. I created one with SampleID FastaFilename. I intend to expand that file to add an additional column for treatments.

I can read one fasta file at a time, display its summary…etc. Also, It seems like the get.group() function requires merging all fasta files and editting the sequence headers with groups.

Could you please advise on how to proceed without the reprocessing the original fastq files?

PS. Searching the forum is not working properly. I get the message "Sorry but you are not permitted to use the search system."

pschloss · March 10, 2017, 1:42pm

You would need to concatenate the fasta files and generate a groups file. Then you can start up after make.contigs in the SOP. To be honest, I would strongly encourage you to start from scratch with raw fastq files as we have yet to find an alternative protocol that generates contigs with as low an error rate as we can achieve with make.contigs.

dnasaurus · March 10, 2017, 1:49pm

I had the same issue/question: Skipping make.contigs
Just wrote a script or two to generate the needed format, worked like a charm after that.

Yeah unfortunately you can only search relatively recent topics with google, sometimes not even that. I can’t even display all of my topics and posts, lol.

brjoce · March 16, 2017, 9:24pm

Could you give me more info regarding the “just wrote a script or two to generate the needed…” How did you go about doing this?
Thanks

dnasaurus · March 27, 2017, 2:54pm

I described the steps in the post I linked, but if you can’t write some bash code or other yourself then it’s a little bit trickier. Do you just need more details on the steps or how to actually do it with commands and code?

Merge all sample FASTA files into a single FASTA file. Just concatenate them together with something like the cat command.
The groups file is a file which is basically a single long list of lines which contain individual read IDs from the fasta files of every sample separated by a TAB from the sample name they belong to. It should look like this:

M00967_43_000000000-A3JHG_1_1101_10011_3881     gut1
M00967_43_000000000-A3JHG_1_1101_10050_15564    gut1
M00967_43_000000000-A3JHG_1_1101_10051_26098    gut1
M00967_43_000000000-A3JHG_1_1101_10133_8460     gut1

I believe your FASTA files might look like this:

>M00967:43:000000000-A3JHG:1:1101:18327:1699 1:N:0:188
NACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAG
>M00967:43:000000000-A3JHG:1:1101:14069:1827 1:N:0:188
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGCCAAGTCAGCGGTAAAATTGCGGGGCTCAACCCCGTACAGCCGTTGAAACTGCCGGGCTCGAGTGGGCGAGAAGTATGCGGAATGCGTGGTGTAGCGGTGAAATGCATAGATATCACGCAGAACCCCGATTGCGAAGGCAGCATACCGGCGCCCTACTGACGCTGAGGCACGAAAGTGCGGGGATCAAACAG
>M00967:43:000000000-A3JHG:1:1101:18044:1900 1:N:0:188
TACGGAGGATGCGAGCGTTGTCCGGAATCACTGGGCGTAAAGGGCGCGTAGGCGGTTTAATAAGTCAGTGGTGAAAACTGAGGGCTCAACCCTCAGCCTGCCACTGATACTGTTAGACTTGAGTATGGAAGAGGAGAATGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGATTCTCTGGGCCAAGACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA

You need to go through the FASTA file of all individual samples and output all the read IDs from that file (all lines that start with “>” in this example) and then add the sample name/abbreviation after a tab separator to that single line. Depending what your read IDs look like you might have to format the line a bit so that you only catch everything before a space - to get rid of this “1:N:0:188” which is included in the headers and change colons to underscores (though Pat told me that this might not be necessary). I did this part with some scripting, here “pseudocode” e.g. if line has “>” then print line print \t print sample_name print \n
Then you just concatenate all these individual files into a single one which contains these lines from all samples.

Sorry since I have no idea how your files look like or how they are organized I can’t be more specific.

Topic		Replies	Views
make.contigs and 'clean' data Commands in mothur	1	663	February 27, 2018
make.contigs alternative command Commands in mothur	6	2424	May 23, 2016
Group file if already start with contigs Commands in mothur	5	4366	August 30, 2019
how to reassemble the mothur produced file Commands in mothur	1	682	September 14, 2017
getting a groups file for PacBio reads Commands in mothur	2	1138	April 13, 2018

Using preprocessed merged reads

Related topics