Code to check for duplicated sequences?

I’m trying to reanalyze a big dataset that was generated over several years, samples were originally not demultiplexed in basespace so I’ve gone back and done that in fits and starts over the past many months. I have samples that apparently were demultiplexed twice but with different names. How can i figure out which samples those are? I’m trying to sort then use awk to pull the duplicates from the good.groups file but I’m doing something wrong and pulling all lines. I’m using this awk https://stackoverflow.com/questions/1450085/list-only-duplicate-lines-based-on-one-column-from-a-semi-colon-delimited-file

hummm so the sequences are duplicated in fasta but not duplicated in .groups? Does this make sense?

ok figured it out, one sample is duplicated so it only got into the groups file once but the fasta twice.

for anyone who has dup samples issues here’s what I did, there is likely a more elegant solution but this works.

#pull out seq names
grep '>' file.trim.good.fasta >seqs
#keep only duplicated
sort seqs > seqs.s
uniq -d seqs.s dups
#remove ee=00...
awk -F'\t' '{print $1}' dups >  dups1
#remove >
sed 's/>//g' dups1 >dups
#pull lines with duplicated seqs from the groups file
grep -f dups file.contigs.good.groups > dup.groups
1 Like