python - extract each sequencing data as individual file -


there ecoli.ffn file rows indicating name of sequencing genes:

$head ecoli.ffn >ecoli16:g027092:gcf_000460315:gi|545267691|ref|nz_ke701669.1|:551259-572036 atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat >ecoli16:g000011:55989:gi|218693476|ref|nc_011748.1|:1128430-1131042 gtgtacgctatggcgggtaattttgccgat >ecoli16:g000012:55989:gi|218693476|ref|nc_011748.1|:1128430-1131042 gtgtacgctatggcgggtaattttgccgat ctgacagctgttcttacactggattcaacc ctgacagctgttcttacactggattcaacc 

as shown above, gene name between 1st , 2nd colon:

g027092 g000011 g000012 

i use ecoli.ffn generate 3 files: g027092.txt, g000011.txt,g000012.txt, containing each sequencing data.

for example, g027092.txt contains raw data without header:

$cat g027092.txt atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat 

how make it?

awk rescue!

$ awk -f: -v rs=">" 'nr==fnr{n=split($0,t,"\n");                              for(i=1;i<n;i++) a[t[i]];                              next}                      $2 in a{file=$2".txt";                               sub(/[^\n]+\n/,"");                               print > file}' index file   $ head g*.txt ==> g000011.txt <== gtgtacgctatggcgggtaattttgccgat   ==> g000012.txt <== gtgtacgctatggcgggtaattttgccgat ctgacagctgttcttacactggattcaacc ctgacagctgttcttacactggattcaacc   ==> g027092.txt <== atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat 

explanation

nr==fnr{n=sp... block parses first file , creates lookup table

$2 in a{file=$2".txt"; if current record in lookup table, set file name using key , txt extension

sub(/[^\n]+\n/,"") delete header line

print > file , print specified filename.


Comments

Popular posts from this blog

html - Styling progress bar with inline style -

java - Oracle Sql developer error: could not install some modules -

How to use autoclose brackets in Jupyter notebook? -