python - extract each sequencing data as individual file -
there ecoli.ffn
file rows indicating name of sequencing genes:
$head ecoli.ffn >ecoli16:g027092:gcf_000460315:gi|545267691|ref|nz_ke701669.1|:551259-572036 atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat >ecoli16:g000011:55989:gi|218693476|ref|nc_011748.1|:1128430-1131042 gtgtacgctatggcgggtaattttgccgat >ecoli16:g000012:55989:gi|218693476|ref|nc_011748.1|:1128430-1131042 gtgtacgctatggcgggtaattttgccgat ctgacagctgttcttacactggattcaacc ctgacagctgttcttacactggattcaacc
as shown above, gene name between 1st , 2nd colon:
g027092 g000011 g000012
i use ecoli.ffn
generate 3 files: g027092.txt
, g000011.txt
,g000012.txt
, containing each sequencing data.
for example, g027092.txt
contains raw data without header:
$cat g027092.txt atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat
how make it?
awk
rescue!
$ awk -f: -v rs=">" 'nr==fnr{n=split($0,t,"\n"); for(i=1;i<n;i++) a[t[i]]; next} $2 in a{file=$2".txt"; sub(/[^\n]+\n/,""); print > file}' index file $ head g*.txt ==> g000011.txt <== gtgtacgctatggcgggtaattttgccgat ==> g000012.txt <== gtgtacgctatggcgggtaattttgccgat ctgacagctgttcttacactggattcaacc ctgacagctgttcttacactggattcaacc ==> g027092.txt <== atgagcctgattattgatgttatttcgcgt aaaacatccgtcaaacaaacgctgattaat
explanation
nr==fnr{n=sp...
block parses first file , creates lookup table
$2 in a{file=$2".txt";
if current record in lookup table, set file name using key , txt extension
sub(/[^\n]+\n/,"")
delete header line
print > file
, print specified filename.
Comments
Post a Comment