python - Filter information of a txt file by regular expressions -
i have file information, how looks like:
****alignment**** sequence: gi|86755972|gb|abd15130.1| cold acclimation protein cor413-pm1 [chimonanthus praecox] length: 201 e-value: 2.66576e-82 kylamktdqlavanmidsdinelkmatmrlindasmlghygfgthflkwlaclaaiyllildrtnwrtnmltsll... +ylamktd+ + +i +d+ e+ +l+ da+ lg g gt lkw+a aaiyllildrtnw+tnmlt+ll... eylamktdewsaqqliqtdlkemgkaakklvydatklgslgvgtsilkwvasfaaiyllildrtnwktnmltall...
now want filter information, , want use variable. think should use regular expression this, don't know how lots of information of second line, example.
i need hitsid
, protein
, organism
, , evalue
.
the corresponding data:
hitsid = 86755972 protein = cold acclimation protein cor413-pm1 organism = chimonanthus praecox evalue = 2.66576e-82
so want that, when ask hitsid
, python prints '86755972
'.
could me this? thanks!
use regex like
^sequence:[^|]*\|(?p<hitsid>[^|]*)\|\s*\s*(?p<protein>[^][]*?)\s*\[(?p<organism>[^][]*)][\s\s]*?\ne-value:\s*(?p<evalue>.*)
see regex demo
a sample python code getting multiple values list of dictionaries:
import re p = re.compile(r'^sequence:[^|]*\|(?p<hitsid>[^|]*)\|\s*\s*(?p<protein>[^][]*?)\s*\[(?p<organism>[^][]*)][\s\s]*?\ne-value:\s*(?p<evalue>.*)', re.multiline) s = "****alignment****\nsequence: gi|86755972|gb|abd15130.1| cold acclimation protein cor413-pm1 [chimonanthus praecox]\nlength: 201\ne-value: 2.66576e-82\nkylamktdqlavanmidsdinelkmatmrlindasmlghygfgthflkwlaclaaiyllildrtnwrtnmltsll...\n+ylamktd+ + +i +d+ e+ +l+ da+ lg g gt lkw+a aaiyllildrtnw+tnmlt+ll...\neylamktdewsaqqliqtdlkemgkaakklvydatklgslgvgtsilkwvasfaaiyllildrtnwktnmltall..." res = [m.groupdict() m in p.finditer(s)] x in res: print(x['hitsid']) print(x['protein']) print(x['organism']) print(x['evalue'])
Comments
Post a Comment