perl - Analyse text records based on the value of two fields -
i using perl test couple of conditions in each line of input file. one-liner below works records not all.
in current output, lines 2,3, , 5 correct, lines 1 , 4 not, presumably because stb
value has 2 comma-separated values in instead of one. instance stb=0.5,0.645036;
instead of stb=0.590597;
.
i can not seem figure out how apply same logic both conditions, if stb >= 0.8
, "strand bias" , reads
value of fdp
field.
the input file have lines in 1 stb
value , two.
perl
perl -ple '/^[^#].*fdp=(\d+);.*stb=(\d+\.\d+);/ , $_.=($2 >= 0.8?" strand bias ":" ").$1." reads"' input > out
input
chr1 93159358 . ct ac,c 51.3482 pass af=0,0.538462;ao=4,12;dp=39;fao=0,21;fdp=39;fr=.;fro=18;fsaf=0,11;fsar=0,10;fsrf=15;fsrr=3;fwdb=0.0379899,0.0954749;fxx=0;hrun=1,5;len=2,1;mlld=22.441,10.1519;oalt=ac,-;oid=.,.;omapalt=ac,c;opos=93159358,93159359;oref=ct,t;pb=0.5,0.5;pbp=1,1;qd=5.26648;rbi=0.0698716,0.219287;refb=-0.0299799,-0.0774582;revb=0.0586414,0.197411;ro=22;saf=0,9;sar=4,3;srf=17;srr=5;ssen=0,0;ssep=0,0;sssb=-0.747246,-0.0336118;stb=0.5,0.645036;stbp=1,0.086;type=mnp,del;varb=0.059091,0.135819;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3 chr1 93073228 . c 142.937 pass af=0.4;ao=42;dp=105;fao=42;fdp=105;fr=.;fro=63;fsaf=25;fsar=17;fsrf=28;fsrr=35;fwdb=-0.00213313;fxx=0.00943307;hrun=2;len=1;mlld=178.966;oalt=a;oid=.;omapalt=a;opos=93073228;oref=c;pb=0.5;pbp=1;qd=5.44523;rbi=0.00753887;refb=-0.0179184;revb=-0.00723079;ro=63;saf=25;sar=17;srf=28;srr=35;ssen=0;ssep=0;sssb=0.159972;stb=0.590597;stbp=0.144;type=snp;varb=0.0207923;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35 chr1 93089823 . t c 1038.33 pass af=1;ao=110;dp=111;fao=111;fdp=111;fr=.;fro=0;fsaf=76;fsar=35;fsrf=0;fsrr=0;fwdb=0.0247073;fxx=0.00892777;hrun=2;len=1;mlld=59.5565;oalt=c;oid=.;omapalt=c;opos=93089823;oref=t;pb=0.5;pbp=1;qd=37.4173;rbi=0.025038;refb=-0.0649256;revb=-0.0040564;ro=1;saf=75;sar=35;srf=1;srr=0;ssen=0;ssep=0;sssb=-0.00628837;stb=0.5;stbp=1;type=snp;varb=0.000880627;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0 chr11 36596027 . ag aa,a 1031.71 pass af=0.121875,0.703125;ao=52,118;dp=333;fao=39,225;fdp=320;fr=.;fro=56;fsaf=2,136;fsar=37,89;fsrf=14;fsrr=42;fwdb=0.0148693,0.00188064;fxx=0.0615818;hrun=5,5;len=1,1;mlld=11.6837,10.3394;oalt=a,-;oid=.,.;omapalt=aa,a;opos=36596028,36596028;oref=g,g;pb=0.5,0.5;pbp=1,1;qd=12.8964;rbi=0.065829,0.11083;refb=-0.0510698,-0.110624;revb=0.0641277,0.110814;ro=85;saf=2,84;sar=50,34;srf=17;srr=68;ssen=0,0;ssep=0,0.328125;sssb=-0.504054,0.394265;stb=0.789128,0.571642;stbp=0.007,0;type=snp,del;varb=0.0245642,0.134756;ann=rag1 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42 chr11 95825383 . c t 143.023 pass af=0.47561;ao=28;dp=71;fao=39;fdp=82;fr=.;fro=43;fsaf=6;fsar=33;fsrf=40;fsrr=3;fwdb=-0.0301041;fxx=0.0238067;hrun=1;len=1;mlld=189.321;oalt=t;oid=.;omapalt=t;opos=95825383;oref=c;pb=0.5;pbp=1;qd=6.97675;rbi=0.153139;refb=-0.0165525;revb=0.150151;ro=43;saf=6;sar=22;srf=40;srr=3;ssen=0;ssep=0;sssb=-0.666847;stb=0.875258;stbp=0;type=snp;varb=0.0275999;ann=maml2 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3
current output (lines 2,3, , 5 correct)
line 1 stb=0.5,0.645036 line 4 stb=0.789128,0.571642
chr1 93159358 . ct ac,c 51.3482 pass af=0,0.538462;ao=4,12;dp=39;fao=0,21;fdp=39;fr=.;fro=18;fsaf=0,11;fsar=0,10;fsrf=15;fsrr=3;fwdb=0.0379899,0.0954749;fxx=0;hrun=1,5;len=2,1;mlld=22.441,10.1519;oalt=ac,-;oid=.,.;omapalt=ac,c;opos=93159358,93159359;oref=ct,t;pb=0.5,0.5;pbp=1,1;qd=5.26648;rbi=0.0698716,0.219287;refb=-0.0299799,-0.0774582;revb=0.0586414,0.197411;ro=22;saf=0,9;sar=4,3;srf=17;srr=5;ssen=0,0;ssep=0,0;sssb=-0.747246,-0.0336118;stb=0.5,0.645036;stbp=1,0.086;type=mnp,del;varb=0.059091,0.135819;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3 chr1 93073228 . c 142.937 pass af=0.4;ao=42;dp=105;fao=42;fdp=105;fr=.;fro=63;fsaf=25;fsar=17;fsrf=28;fsrr=35;fwdb=-0.00213313;fxx=0.00943307;hrun=2;len=1;mlld=178.966;oalt=a;oid=.;omapalt=a;opos=93073228;oref=c;pb=0.5;pbp=1;qd=5.44523;rbi=0.00753887;refb=-0.0179184;revb=-0.00723079;ro=63;saf=25;sar=17;srf=28;srr=35;ssen=0;ssep=0;sssb=0.159972;stb=0.590597;stbp=0.144;type=snp;varb=0.0207923;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35 105 reads chr1 93089823 . t c 1038.33 pass af=1;ao=110;dp=111;fao=111;fdp=111;fr=.;fro=0;fsaf=76;fsar=35;fsrf=0;fsrr=0;fwdb=0.0247073;fxx=0.00892777;hrun=2;len=1;mlld=59.5565;oalt=c;oid=.;omapalt=c;opos=93089823;oref=t;pb=0.5;pbp=1;qd=37.4173;rbi=0.025038;refb=-0.0649256;revb=-0.0040564;ro=1;saf=75;sar=35;srf=1;srr=0;ssen=0;ssep=0;sssb=-0.00628837;stb=0.5;stbp=1;type=snp;varb=0.000880627;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0 111 reads chr11 36596027 . ag aa,a 1031.71 pass af=0.121875,0.703125;ao=52,118;dp=333;fao=39,225;fdp=320;fr=.;fro=56;fsaf=2,136;fsar=37,89;fsrf=14;fsrr=42;fwdb=0.0148693,0.00188064;fxx=0.0615818;hrun=5,5;len=1,1;mlld=11.6837,10.3394;oalt=a,-;oid=.,.;omapalt=aa,a;opos=36596028,36596028;oref=g,g;pb=0.5,0.5;pbp=1,1;qd=12.8964;rbi=0.065829,0.11083;refb=-0.0510698,-0.110624;revb=0.0641277,0.110814;ro=85;saf=2,84;sar=50,34;srf=17;srr=68;ssen=0,0;ssep=0,0.328125;sssb=-0.504054,0.394265;stb=0.789128,0.571642;stbp=0.007,0;type=snp,del;varb=0.0245642,0.134756;ann=rag1 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42 chr11 95825383 . c t 143.023 pass af=0.47561;ao=28;dp=71;fao=39;fdp=82;fr=.;fro=43;fsaf=6;fsar=33;fsrf=40;fsrr=3;fwdb=-0.0301041;fxx=0.0238067;hrun=1;len=1;mlld=189.321;oalt=t;oid=.;omapalt=t;opos=95825383;oref=c;pb=0.5;pbp=1;qd=6.97675;rbi=0.153139;refb=-0.0165525;revb=0.150151;ro=43;saf=6;sar=22;srf=40;srr=3;ssen=0;ssep=0;sssb=-0.666847;stb=0.875258;stbp=0;type=snp;varb=0.0275999;ann=maml2 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3 strand bias 82 reads
desired output
chr1 93159358 . ct ac,c 51.3482 pass af=0,0.538462;ao=4,12;dp=39;fao=0,21;fdp=39;fr=.;fro=18;fsaf=0,11;fsar=0,10;fsrf=15;fsrr=3;fwdb=0.0379899,0.0954749;fxx=0;hrun=1,5;len=2,1;mlld=22.441,10.1519;oalt=ac,-;oid=.,.;omapalt=ac,c;opos=93159358,93159359;oref=ct,t;pb=0.5,0.5;pbp=1,1;qd=5.26648;rbi=0.0698716,0.219287;refb=-0.0299799,-0.0774582;revb=0.0586414,0.197411;ro=22;saf=0,9;sar=4,3;srf=17;srr=5;ssen=0,0;ssep=0,0;sssb=-0.747246,-0.0336118;stb=0.5,0.645036;stbp=1,0.086;type=mnp,del;varb=0.059091,0.135819;ann=evi5 39 reads readsgt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:46:39:39:22:18:4,12:0,21:0,0.538462:4,3:0,9:17:5:0,10:0,11:15:3 chr1 93073228 . c 142.937 pass af=0.4;ao=42;dp=105;fao=42;fdp=105;fr=.;fro=63;fsaf=25;fsar=17;fsrf=28;fsrr=35;fwdb=-0.00213313;fxx=0.00943307;hrun=2;len=1;mlld=178.966;oalt=a;oid=.;omapalt=a;opos=93073228;oref=c;pb=0.5;pbp=1;qd=5.44523;rbi=0.00753887;refb=-0.0179184;revb=-0.00723079;ro=63;saf=25;sar=17;srf=28;srr=35;ssen=0;ssep=0;sssb=0.159972;stb=0.590597;stbp=0.144;type=snp;varb=0.0207923;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:142:105:105:63:63:42:42:0.4:17:25:28:35:17:25:28:35 105 reads chr1 93089823 . t c 1038.33 pass af=1;ao=110;dp=111;fao=111;fdp=111;fr=.;fro=0;fsaf=76;fsar=35;fsrf=0;fsrr=0;fwdb=0.0247073;fxx=0.00892777;hrun=2;len=1;mlld=59.5565;oalt=c;oid=.;omapalt=c;opos=93089823;oref=t;pb=0.5;pbp=1;qd=37.4173;rbi=0.025038;refb=-0.0649256;revb=-0.0040564;ro=1;saf=75;sar=35;srf=1;srr=0;ssen=0;ssep=0;sssb=-0.00628837;stb=0.5;stbp=1;type=snp;varb=0.000880627;ann=evi5 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 1/1:42:111:111:1:0:110:111:1:35:75:1:0:35:76:0:0 111 reads chr11 36596027 . ag aa,a 1031.71 pass af=0.121875,0.703125;ao=52,118;dp=333;fao=39,225;fdp=320;fr=.;fro=56;fsaf=2,136;fsar=37,89;fsrf=14;fsrr=42;fwdb=0.0148693,0.00188064;fxx=0.0615818;hrun=5,5;len=1,1;mlld=11.6837,10.3394;oalt=a,-;oid=.,.;omapalt=aa,a;opos=36596028,36596028;oref=g,g;pb=0.5,0.5;pbp=1,1;qd=12.8964;rbi=0.065829,0.11083;refb=-0.0510698,-0.110624;revb=0.0641277,0.110814;ro=85;saf=2,84;sar=50,34;srf=17;srr=68;ssen=0,0;ssep=0,0.328125;sssb=-0.504054,0.394265;stb=0.789128,0.571642;stbp=0.007,0;type=snp,del;varb=0.0245642,0.134756;ann=rag1 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/2:42:333:320:85:56:52,118:39,225:0.121875,0.703125:50,34:2,84:17:68:37,89:2,136:14:42 320 reads chr11 95825383 . c t 143.023 pass af=0.47561;ao=28;dp=71;fao=39;fdp=82;fr=.;fro=43;fsaf=6;fsar=33;fsrf=40;fsrr=3;fwdb=-0.0301041;fxx=0.0238067;hrun=1;len=1;mlld=189.321;oalt=t;oid=.;omapalt=t;opos=95825383;oref=c;pb=0.5;pbp=1;qd=6.97675;rbi=0.153139;refb=-0.0165525;revb=0.150151;ro=43;saf=6;sar=22;srf=40;srr=3;ssen=0;ssep=0;sssb=-0.666847;stb=0.875258;stbp=0;type=snp;varb=0.0275999;ann=maml2 gt:gq:dp:fdp:ro:fro:ao:fao:af:sar:saf:srf:srr:fsar:fsaf:fsrf:fsrr 0/1:143:71:82:43:43:28:39:0.47561:22:6:40:3:33:6:40:3 strand bias 82 reads
it nice have more information
on face of it, long you're happy ignore second value stb
, can remove insistence of semicolon ;
after value. this
perl -ple '/^[^#].*fdp=(\d+);.*stb=(\d+\.\d+)/ , $_.=($2 >= 0.8?" strand bias ":" ").$1." reads"'
but that's pretty awful perl code and, have found, damned hard debug
i prefer this: script extracts all of labelled values hash , uses them directly there
i've guessed want all of stb
values under 0.8 result, i've used all
function list::util
test that. split stb
value on commas , create boolean $all_ok
status variable indicates whether true. works fine whether there's 1 value or 1 thousand
then printf
builds output string components we've calculated
i've emptied $line
variable can see appended each line debugging purposes. delete statement or comment out real run
use strict; use warnings 'all'; use list::util 'all'; while ( <> ) { next unless /\s/; @fields = split; chomp( $line = $_ ); %values = map { split /=/ } split /;/, $fields[7]; $all_ok = { $_ < 0.8 } split /,/, $values{stb}; $line = ''; # debugging printf "%s %s %s reads\n", $line, $all_ok ? 'good' : 'strand bias', $values{fdp}; }
output
39 reads 105 reads 111 reads 320 reads strand bias 82 reads
update
now you've explained want list of status values in output can write better solution. no longer need list::util::all
, instead need create array of boolean status values list of stb
data
don't forget comment out $line = ''
before run data real
it looks this
use strict; use warnings 'all'; use list::util 'all'; while ( <> ) { next unless /\s/; @fields = split; chomp( $line = $_ ); %values = map { split /=/ } split /;/, $fields[7]; @stb_ok = map { $_ < 0.8 } split /,/, $values{stb}; @good = map { $_ ? 'good' : 'strand bias' } @stb_ok; $line = ''; # debugging printf "%s %s %s reads\n", $line, "@good", $values{fdp}; }
output
good 39 reads 105 reads 111 reads good 320 reads strand bias 82 reads
Comments
Post a Comment