mgcluster issue - multiple commas in fn.list file

Hello,

I’m trying to get OPFs using mgcluster. Mothur runs well with mgcluster(blast=my_self_blast), however when I open resulted fn.list file, there are OPFs like

FKX0NLA01C6PXB_1_2_253,FKX0NLA01BRCMX_1_252_1,FK1S85202HGOWC_2_3_269,FK1S85202IFO3W_2_19_273,,FKX0NLA01C5BUR_2_30_278

or even

,,,FK1S85201CQN3W_1_3_242,FKWF80P02JJTP2_3_246_1,FK1S85202JR3T5_1_1_240,FK1S85202IDPCI_2_1_249,FKX0NLA02IAVUW_2_3_248,FK1S85201AVF6S_2_1_258

I.e, there are sequences with NULL accession numbers in my OPFs.

Is it a bug or feature?
Could it be the result of the long accno names?

It’s not the length of the accession numbers. Can you tell us how you are generating the blast table?

Blast-table was generated by blasting all ORFs under analysis against themselves with following options:

-i orfs.fasta -d orfs.db -p blastp -m 8 -F F -o orfs.selfblast

NB: mpiblast was used

Upd:

I checked if the number of sequences changed and are there duplicate seqs for different OPF definitions with the following command:

perl -ne 'chop; $g=$s=$_; $s=~s/\t/,/g; $s=~s/,{1,}/,/g; @a=split(",",$s); %seqids=map{$_ => 1}grep{!/^$/}@a[2..$#a]; @a1=split("\t",$g); print join("\t", (@a[0..1], $#a1-1, $#a-1, scalar keys %seqids))."\n"' < my.fn.list

and it is turned out that sequences are in fact unique and nothing is missed in fn.list file:

unique 907091 907091 907091 907091
unique 907090 907090 907091 907091
unique 901253 901253 907091 907091
0.00 896471 896471 907091 907091
0.01 883862 883862 907091 907091
0.02 868195 868195 907091 907091
0.03 849879 849879 907091 907091
0.04 831429 831429 907091 907091
0.05 812922 812922 907091 907091
0.06 793794 793794 907091 907091
0.07 775795 775795 907091 907091
0.08 758211 758211 907091 907091
0.09 740337 740337 907091 907091
0.10 728230 728230 907091 907091
0.11 699455 699455 907091 907091
0.12 522723 522723 907091 907091
...

columns:
1 and 2 - OPF definition and respective number of OPFs from fn.list file,
3-number of groups for OPF definition obtained by splitting the line by TABs,
4/5-the numbers of seqs/unique_seqs for OPF definition obtained by removing multiple commas, substituting TABs by commas and splitting the line by comma.

If so, multiple commas could be easily parsed out from my fn.list to allow subsequent analyses. Is it right or did I missed something?

Could you send your blast file and mothur logfile to mothur.bugs@gmail.com?

Firstly, the blast-out file too large for email. I’ll upload it to ftp and send you a link.

Secondly, what log file do you need? No errors occurred with mgcluster command - it does read blast.out file and outputs my.fn.list, my.fn.rabund and my.fn.sabund files. However, when I try read.otu command on my.fn.list file to move on multiple samples analysis:

read.otu(list=my.fn.list, group=my.group)

mothur exits with error:

mothur v.1.11.0
Last updated: 6/18/2010

by
Patrick D. Schloss

Department of Microbiology & Immunology
University of Michigan
pschloss@umich.edu
http://www.mothur.org

When using, please cite:
Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

Distributed under the GNU General Public License

Type 'help()' for information on the commands that are available

Type 'quit()' to exit program



mothur > read.otu(list=2008_11.fn.list, group=2008_11.group)

unique
unique
unique
Error: Sequence '' was not found in the group file, please correct.

as my.fn.list file generated by mgcluster command does contain strings like in my first post.

Let me know if you need additional data.
Please, inform me either by e-mail or in this thread when you will find the reason.

Best regards, Yuri.

I was able to find the problem, the fix will be part of 1.12.0. Thanks for your help in tracking down the bug!