Finding primer/barcode information from 454 SRA files

I am attempting to analyze a data set from the HMP Crohns project. I have found all the SRA files I need, but I’m having a hard time figuring our how to find the primers and barcodes used.

This is an example of the page for one of the SRAs http://www.ncbi.nlm.nih.gov/sra/SRX021306 as you can see at the bottom it has a section listed as barcode. Clicking on it pulls up kind of an ugly overlay with the barcode listed as TCTCTATGCG. I also found a script which parses out the meta information for this particular study and this is indeed the barcode that gets parsed as well. This all seemed well and good, but after using sratoolkit to extract the sff file, and then mothur to extract the fasta and qual file, if I search within the fasta file for “TCTCTATGCG” I only get 56 matches in a 132 mb fasta file. This seems low to me, but I don’t have much experience in this area, so perhaps it’s normal? Additionally, I can’t seem to locate the primers that were used. They don’t seem to be listed on the SRA page and the script that was designed by the study authors to pull down the meta information only gives the barcode and says that it’s forward, but no primer sequence. Here’s the relevant xml that the meta script generated (the whole file is 220 lines, so I haven’t pasted it).

<SPOT_DESCRIPTOR>
<SPOT_DECODE_SPEC>
<READ_SPEC>
<READ_INDEX>0</READ_INDEX>
<READ_CLASS>Technical Read</READ_CLASS>
<READ_TYPE>Adapter</READ_TYPE>
<EXPECTED_BASECALL_TABLE>
TCAG
</EXPECTED_BASECALL_TABLE>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>1</READ_INDEX>
<READ_LABEL>barcode</READ_LABEL>
<READ_CLASS>Technical Read</READ_CLASS>
<READ_TYPE>BarCode</READ_TYPE>
<EXPECTED_BASECALL_TABLE>
TCTCTATGCG
</EXPECTED_BASECALL_TABLE>
</READ_SPEC>
<READ_SPEC>
<READ_INDEX>2</READ_INDEX>
<READ_CLASS>Application Read</READ_CLASS>
<READ_TYPE>Forward</READ_TYPE>
<RELATIVE_ORDER follows_read_index=“1”/>
</READ_SPEC>
</SPOT_DECODE_SPEC>
</SPOT_DESCRIPTOR>

I also tried searching the fasta file for the text just to the right of the tag but the sequence got quite short before it appeared an appreciable number of times.

I searched through a fair amount of the mothur board on google, but I can’t seem to find anyone having a similar problem. Is it just something simple that I am completely missing?

Do I need to have the primer sequence in order to use trim.flows() or can it be done with just the barcodes (not that I fully trust them anyway…)

In fact, now that I look further, many of the SRA pages for this project don’t even list a barcode. Here’s an example: http://www.ncbi.nlm.nih.gov/sra/SRX021316. I’m starting to suspect that barcode/primer information may be unavailable to me, what’s the best way to proceed through the mothur 454 SOP without this information?

I’m not sure that the example you sent actually contains 16S rRNA gene sequence. On the link you provided you’ll see this: “Sample: Human metagenome DNA sample from a female participant in the dbGaP study “Human Gut Microbiome in Crohn’s Disease””. It also indicates this under “Library”:

Library: 4L100003197 (less…)
Strategy: WGS
Source: GENOMIC
Selection: RANDOM
Layout: SINGLE

If you click on the “Study” link and from there the “Experiments” link you’ll see there are 196 different libraries that seem to be a mix of amplicons and shotgun sequence libraries. As I understand SRA, each “experiment” represents a different sample. I think that those are what you want.

As for trim.flows/shhh.flows/trim.seqs, you do want the barcode and primer sequences. You might go back to the original paper and find the primers they use. It looks like they did V13 going from V3 towards V1.

Hope this helps! SRA is a pain, but I know they’re trying to improve things.

Pat

Thanks for the reply. I realized that mothur wasn’t intended for analyzing this type of data about 2 days in >.<. I selected some alternate 16S rRNA runs that will work for my purposes instead and have carried on since then. However, even the later 16S rRNA runs from those 196 libraries don’t have complete primer/barcode information (they’re missing the barcode), but that part was at least easy to extract myself.