This is probably a simple question. I have a set of preprocessed merged fasta files, each belonging to a sample in an experiment. These files have already been trimmed for quality, and the overlapping reads were merged. Is it possible to load them somehow in MOTHUR? I have one fasta file per sample. Is there a way I can load the associated information and continue working through any of the SOPs, starting after the make.contigs? If I understand correctly, I need to create groups file. I created one with SampleID FastaFilename. I intend to expand that file to add an additional column for treatments.
I can read one fasta file at a time, display its summary…etc. Also, It seems like the get.group() function requires merging all fasta files and editting the sequence headers with groups.
Could you please advise on how to proceed without the reprocessing the original fastq files?
PS. Searching the forum is not working properly. I get the message "Sorry but you are not permitted to use the search system."
You would need to concatenate the fasta files and generate a groups file. Then you can start up after make.contigs in the SOP. To be honest, I would strongly encourage you to start from scratch with raw fastq files as we have yet to find an alternative protocol that generates contigs with as low an error rate as we can achieve with make.contigs.
I had the same issue/question: Skipping make.contigs
Just wrote a script or two to generate the needed format, worked like a charm after that.
Yeah unfortunately you can only search relatively recent topics with google, sometimes not even that. I can’t even display all of my topics and posts, lol.
I described the steps in the post I linked, but if you can’t write some bash code or other yourself then it’s a little bit trickier. Do you just need more details on the steps or how to actually do it with commands and code?
Merge all sample FASTA files into a single FASTA file. Just concatenate them together with something like the cat command.
The groups file is a file which is basically a single long list of lines which contain individual read IDs from the fasta files of every sample separated by a TAB from the sample name they belong to. It should look like this:
You need to go through the FASTA file of all individual samples and output all the read IDs from that file (all lines that start with “>” in this example) and then add the sample name/abbreviation after a tab separator to that single line. Depending what your read IDs look like you might have to format the line a bit so that you only catch everything before a space - to get rid of this “1:N:0:188” which is included in the headers and change colons to underscores (though Pat told me that this might not be necessary). I did this part with some scripting, here “pseudocode” e.g. if line has “>” then print line print \t print sample_name print \n
Then you just concatenate all these individual files into a single one which contains these lines from all samples.
Sorry since I have no idea how your files look like or how they are organized I can’t be more specific.