I am attempting to analyze a data set from the HMP Crohns project. I have found all the SRA files I need, but I’m having a hard time figuring our how to find the primers and barcodes used.
This is an example of the page for one of the SRAs http://www.ncbi.nlm.nih.gov/sra/SRX021306 as you can see at the bottom it has a section listed as barcode. Clicking on it pulls up kind of an ugly overlay with the barcode listed as TCTCTATGCG. I also found a script which parses out the meta information for this particular study and this is indeed the barcode that gets parsed as well. This all seemed well and good, but after using sratoolkit to extract the sff file, and then mothur to extract the fasta and qual file, if I search within the fasta file for “TCTCTATGCG” I only get 56 matches in a 132 mb fasta file. This seems low to me, but I don’t have much experience in this area, so perhaps it’s normal? Additionally, I can’t seem to locate the primers that were used. They don’t seem to be listed on the SRA page and the script that was designed by the study authors to pull down the meta information only gives the barcode and says that it’s forward, but no primer sequence. Here’s the relevant xml that the meta script generated (the whole file is 220 lines, so I haven’t pasted it).
I also tried searching the fasta file for the text just to the right of the tag but the sequence got quite short before it appeared an appreciable number of times.
I searched through a fair amount of the mothur board on google, but I can’t seem to find anyone having a similar problem. Is it just something simple that I am completely missing?
Do I need to have the primer sequence in order to use trim.flows() or can it be done with just the barcodes (not that I fully trust them anyway…)