General Sequencing Questions

Aloha Mothur Folks,

Hope all is well with you and yours.

I am a retired Medical Laboratory Technologist with a strong background in Microbiology and Serology. I am healthy and have a healthy lifestyle in Hawaii. I started collecting data to monitor my gut Microbiome last year. The 16S rRNA Amplicon nucleotide sequences were produced on a Illumina machine. My interest in monitoring my gut microbiome start last year (2019) with my introduction to the services provided by uBiome. This company, as you probably know, is no longer in business. I was happy with their service and I am thankful that I was able to download all of the FastQ files before their server went offline. Their gut microbiome test collection kit and 16S rRNA Amplicon analysis was priced at about $100 per sample. I am currently looking for a company/laboratory that can supply a similar service. uBiome’s report included a comparison against all of the samples in their database. Customers provided general health information, such as height, weight physical activity, diet and disease states all these variables were considered in a very elaborate and graphic report. Most of the information was interesting especially for people who consider themselves as normal or average or at some extreme. The bacteria that populates my intestinal track is unique and their variety and quantity is of interest to me. When I stopped my strenuous cardiovascular exercise for four weeks it was interesting that the quantity of the genus Veillonella was reduced by more than one half everything else being unchanged. Please see the visualization of the FastQ data files which I upload to Sequentiabiitech.

https://metagenomics.sequentiabiotech.com/shared/TaskFlow/1288cb88-8578-4251-b6e9-7780cc30e71e/538b1f51-2b93-4cf9-865c-41fe1843ebac

I am 74 years old and I plan to monitor my gut microbiota yearly for the rest of my life. I am interested in how my diet effects the diversity of microorganisms in my gut as I age.

I am a participant in the “All of Us” Research Program and the Million Veterans Research Program which are not associated with my work as an unaffiliated “Citizen Scientist”.

I am currently accumulating knowledge and data using R, Qiime2, Phyloseq, and BioPython as well as Pandas as they applied to my FastQ files.

I have posed this same question to several sequencing company and other places with no satisfactory answer. I think I understand the basic principles of the, new to me, technology with regard to barcoding and multiplexing.

I don’t understand how I can have a file composed almost 500,000 sequences 150bp and there are not duplicated sequences. For that matter, none of the large datasets that I have computed to date contain any duplicated sequences! My old timers experience has taught me that there should be numerous bacterial cells of the same species in the stool samples that I submitted for analysis.

I was hoping someone might write a reply that would help me understand, what I currently consider to abnormal (not correct).

Thanks,

John Hasty BS. MT(ASCP) Retired

The sequence quality of an individual 150 nt read is quite poor. We strongly advocate sequencing the V4 region with paired 250 nt reads. Because the V4 region is ~250 nt, nearly every base is sequenced twice. Back to a 150 nt read. The per base error rate with Illumina is about 1%. It’s a safe assumption that your sequences have an average of 1.5 errors per sequence. The errors are random-ish. So if I sequenced the same fragment a bunch, I’d probably see every possible error. When you do this for a community, vioala, there won’t be many duplicates.

The 150 nt approach taken by ubiome and American Gut is cheap, but generates error-prone data. It is good enough for classification using a database, but is pretty horrendous for OTU-based analyses at a level below genus.

Hope this helps a bit…

1 Like

Hi John
I might be able to help you with sequencing but I don’t know about the report-I’ve never seen a uBiome report.

I fully agree with Pat on the quality of the 2x150 reads, my lab’s proces is based on his wetlab recommendations. We run 2x250 MiSeq

feel free to check out mars.uconn.edu to see what we offer.

Kendra

Thanks for the good information. The genus level is fine with me, I am primarily interested in the total quantity of each nucleotide in the sequences - I think!

Thanks Kendra, I will definitely checkout mars I would love doing business with a .edu

1 Like

The following is how the uncompresses files appear in my file folder:
ssr_1266562__R1__L001.fastq
ssr_1266562__R1__L002.fastq
ssr_1266562__R1__L003.fastq
ssr_1266562__R1__L004.fastq
ssr_1266562__R2__L001.fastq
ssr_1266562__R2__L002.fastq
ssr_1266562__R2__L003.fastq
ssr_1266562__R2__L004.fastq

Can the foward and reverse reads be determined from the above names?

Assuming that the R1_L001, R1_L002, R1_L003 and R1_L004 are the foward reads, and
the R2s are the reverse reads why am I getting the following error message? I have tried
reording the files as well as changing the file names to match those used in the tutorial
and I still get the same message.

mothur > make.contigs(file=MYstability.files, processors=4)

Using 4 processors.

Processing file pair MyBiomeSeq/ssr_1266562__R1__L001.fastq - MyBiomeSeq/ssr_1266562__R1__L002.fastq (files 1 of 4) <<<<<
Making contigs…
NS500419_382_HJH22AFXY_1_11101_18065_1049 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
NS500419_382_HJH22AFXY_1_11206_16315_8892 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
NS500419_382_HJH22AFXY_1_11312_20752_11924 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
NS500419_382_HJH22AFXY_1_21206_24618_6085 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
Segmentation fault: 11

The following is a copy of the first lines of one of the foward .fastq file.

@NS500419:382:HJH22AFXY:1:11101:18065:1049 1:N:0:GAAGGAGT+GCTTGAGT
GTGCCAGCCGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGTGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGA
+
AAAAAEEEEEEEEEEEEEEEEEEAEAEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEAE<AEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEAEEAE<AEEEEEAAAEEEEEEEEEEAEEEAEEE<AAAE<A<EEE<AA6A6AAEEAEE<
@NS500419:382:HJH22AFXY:1:11101:8905:1080 1:N:0:GAAGGAGT+GCTTGAGT
GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGTGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACTGTTGAACTTGAGTGCAGAAGA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEEEEE/EEEEEEEE<EEEEEEEEEEEEEEEE/<EEEE<EAAA<EAAEA6
@NS500419:382:HJH22AFXY:1:11101:2587:1091 1:N:0:GAAGGAGT+GCTTGAGT
GTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGA
+

The reverse lines of the same sample is as follows:

@NS500419:382:HJH22AFXY:1:11101:18065:1049 2:N:0:GAAGGAGT+GCTTGAGT
CCGGACTACCAGGGTATCTAATCCTGTTCGCTACCCATGCTTTCGAGCCTCAGCGTCAGTTGCAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCTTCCATATATCTACGCATTCCACCGCTACACATGGAGTTCCACTCTCCTCTTCTG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEAE/EAEE/A/EE<EEEEEEEEEEEAEEEEEEEEEEE/AA<EAEEAEEEEEEAAAEE/E/AAAA<<<A<E<<AEAA666A<6<A<AA<
@NS500419:382:HJH22AFXY:1:11101:8905:1080 2:N:0:GAAGGAGT+GCTTGAGT
CCGGACTACCGGGGTTTCTAATCCTGTTCGCTACCCATGCTTTCGAGCCTCAGCGTCAGTTGCAGACCAGAGAGCCGCCTTCGCCACTGGTGTTCTTCCATATATCTACGCATTCCACCGCTACACATGGAGTTCCACTCTCCTCTTCTG
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEE6EEEEEEEEEEEEEEEEEAE/EEEE/</EEAEEEEEEEEEEEAEEEEEEEAEEE/EEEAAEE/EEE<EAAA<AEEEAE<AAA<<AEEEAEEAE666A<6<6<A<A
@NS500419:382:HJH22AFXY:1:11101:2587:1091 2:N:0:GAAGGAGT+GCTTGAGT
CCGGACTACTCGGGTTTCTAATCCTGTTTGATCCCCACGCTTTCGCACATCAGCGTCAGTTACAGACCAGAAAGTCGCCTTCGCCACTGGTGTTCCTCCATATCTCTGCGCATTTCACCGCTACACATGGAATTCCACTTTCCTCTTCTG
+

Can these uBiome files be used in Mothur?
Perhaps I need to use the sequences from more than one sample?

Any help appreciated.

Thanks

what does your file file look like?

I solved my own problem by removing all the multiple forward and reverse reads except for one.
This is what the file that works looks like:
ssr_1266562__R1__L001.fastq
ssr_1266562__R2__L001.fastq
This process was repeated for all ten of my 2019 samples making a folder that contained 20 fastq files. I not sure why the uBiome download contained 4 repeats on the same sample.

This is the orginal download from uBiome after the files is uncompressed for one sample.

ssr_1373003__R1__L001.fastq
ssr_1373003__R1__L002.fastq
ssr_1373003__R1__L003.fastq
ssr_1373003__R1__L004.fastq
ssr_1373003__R2__L001.fastq
ssr_1373003__R2__L002.fastq
ssr_1373003__R2__L003.fastq
ssr_1373003__R2__L004.fastq

The following changes must be make manually before the sequences can be maked contig.

ssr_A1373003__R1__L001.fastq
ssr_A1373003__R2__L001.fastq
ssr_B1373003__R1__L002.fastq
ssr_B1373003__R2__L002.fastq
ssr_C1373003__R1__L003.fastq
ssr_C1373003__R2__L003.fastq
ssr_D1373003__R1__L004.fastq
ssr_D1373003__R2__L004.fastq

If the initial name is created by the sequencing machine, then it should be changed.
I had to do this to all of my samples to make a contig of 990,000 sequences.

Now, it would rally be fun if I could run “make.nucleotides” to get the total number of each
nucleotide in the contig.