Code to check for duplicated sequences?

Kendra · May 23, 2019, 7:04pm

I’m trying to reanalyze a big dataset that was generated over several years, samples were originally not demultiplexed in basespace so I’ve gone back and done that in fits and starts over the past many months. I have samples that apparently were demultiplexed twice but with different names. How can i figure out which samples those are? I’m trying to sort then use awk to pull the duplicates from the good.groups file but I’m doing something wrong and pulling all lines. I’m using this awk https://stackoverflow.com/questions/1450085/list-only-duplicate-lines-based-on-one-column-from-a-semi-colon-delimited-file

Kendra · May 23, 2019, 10:01pm

hummm so the sequences are duplicated in fasta but not duplicated in .groups? Does this make sense?

Kendra · May 23, 2019, 10:58pm

ok figured it out, one sample is duplicated so it only got into the groups file once but the fasta twice.

for anyone who has dup samples issues here’s what I did, there is likely a more elegant solution but this works.

#pull out seq names
grep '>' file.trim.good.fasta >seqs
#keep only duplicated
sort seqs > seqs.s
uniq -d seqs.s dups
#remove ee=00...
awk -F'\t' '{print $1}' dups >  dups1
#remove >
sed 's/>//g' dups1 >dups
#pull lines with duplicated seqs from the groups file
grep -f dups file.contigs.good.groups > dup.groups

Topic		Replies	Views
split.groups to get fasta files including duplicate sampels Commands in mothur	2	687	September 5, 2017
Sequence duplication in screen.seqs, unique.seqs	1	608	July 16, 2021
remove duplicate entries from groups file Commands in mothur	2	2884	January 8, 2013
make.shared seq not in group file mothur bugs	1	2360	July 8, 2013
names file Commands in mothur	5	5163	April 13, 2010

Code to check for duplicated sequences?

Related Topics