Se abbiamo un file in formato flat SWISSPROT il cui nome è OPSDi_BOVIN, di cui per esempio un estratto è il seguente
ID OPSD_BOVIN STANDARD; PRT; 348 AA. AC P02699; TR A. TS TMHMM83; TMHMM160; COPRETHI; HTP; HMMTOP; NON_RED. DE RHODOPSIN. FT DOMAIN 1 36 EXTRACELLULAR. FT TRANSMEM 37 63 1. (MEDLINE; 20385054) FT DOMAIN 64 73 CYTOPLASMIC. FT TRANSMEM 74 96 2. (MEDLINE; 20385054) FT DOMAIN 97 110 EXTRACELLULAR. FT TRANSMEM 111 133 3. (MEDLINE; 20385054) FT DOMAIN 134 152 CYTOPLASMIC. FT TRANSMEM 153 173 4. (MEDLINE; 20385054) FT DOMAIN 174 202 EXTRACELLULAR. FT TRANSMEM 203 224 5. (MEDLINE; 20385054) FT DOMAIN 225 252 CYTOPLASMIC. FT TRANSMEM 253 274 6. (MEDLINE; 20385054) FT DOMAIN 275 286 EXTRACELLULAR. FT TRANSMEM 287 308 7. (MEDLINE; 20385054) FT DOMAIN 309 348 CYTOPLASMIC. SQ SEQUENCE 348 AA; 39007 MW; 33FDA196803E81F3 CRC64; MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA //e vogliamo estrarre la sequenza soltanto, potremo utilizzare un codice del tipo
def read_sp_seq(filename): ''' reads a swissprot flat file and returns the protein sequence ''' try: f=open(filename,'r') except: print 'Error in file ',filename,'with mode',mode return None # nothing to do seq="" # our sequence wholefile=f.readlines() # read the whole file f.close() # close file # read until we find the 'SQ' keyword line=wholefile[0] numlines=len(wholefile) i=0 while (line[0:2] != 'SQ') and (i<numlines): i=i+1 line=wholefile[i] if i == numlines: print 'Error in file ',filename,'Not a swissprot file' return None # nothing to do i=i+1 line=wholefile[i] while (line[0:2] != '//') and (i<numlines): listline=line.split() seq=seq+"".join(listline) i=i+1 line=wholefile[i] return seqed utilizzarla come
>>> seq=read_sp_seq('OPSD_BOVIN') >>> seq 'MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLY VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGG EIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEG MQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQESATT QKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPV IYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA'