Often one is faced with a task to get the lineage for a list of species. I did not find anything easily adaptable so I decided to create a small script which uses the NCBI Taxonomy to provide annotation of the lineage for a specified list of species which is provided in the file.
To get all the data and script you can visit Download section and get lineage.zip
To execute a script(posted below or found in the lineage.zip) in the shell (terminal) you need to have a file with species name (for example if you have a file species.txt the code would be):
1 |
python lineage.py species.txt |
After script finishes it will produce a file: species_lineage.txt – where the complete lineage will be reported (each of elements is separated by “|”).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# -*- coding: utf-8 -*- """ Created on Fri Nov 4 14:54:12 2016 @author: Krzysztof """ import collections as col import sys def Tree(): return col.defaultdict(Tree) def get_values(listin,val): out=[] for i in val: out.append(listin[i].upper()) return out def build_info(f,idxkey,idxvals,filtering="",sep="\t|\t"): tree=Tree() with open(f) as f: for i in f: if filtering !="" and filtering not in i: continue data=i.split(sep) if len(idxvals)==1: tree[data[idxkey]]=data[idxvals[0]].upper() else: val=get_values(data,idxvals) tree[data[idxkey]]=val return tree def reverse_dict(dictold): new=Tree() for i in dictold: new[dictold[i]]=i return new def traverse_tree(dictionary,idx): allelements=[idx] key=idx try: while 1: parent=dictionary[key][0] if parent==key or parent=='1': break else: allelements.append(parent) key=parent except:pass return allelements def get_lineage(taxids,names,nodes): string="" for i in taxids: string+=names[i]+"|" return string def main(nodes,names,namesrev): f=sys.argv[1] out=f.replace(".txt","_lineage.txt") out=open(out,"w") with open(f) as f: for i in f: i=i.strip().split() if len(i)!=2 and "Virus" not in i: print >>out,i+"|"+"Specie name should consist of two words\ if it is not a virus" continue specie=" ".join(i).upper() if specie not in namesrev: print>>out,specie+"|"+"Most likely your species contain a typo" else: taxidspecies=namesrev[specie] taxids=traverse_tree(nodes,taxidspecies) lineage=get_lineage(taxids,names,nodes) print >>out, lineage out.close() if __name__ == "__main__": nodes=build_info("nodes.dmp",0,[1,2]) names=build_info("names.dmp",0,[1],"scientific name") namesrev=reverse_dict(names) main(nodes,names,namesrev) |
Alternatively, for the newest annotation, one can visit NCBI Taxonomy FTP and download a file taxdmp.zip. After, one should unzip this archive. A file called lineage.py should be created inside this new directory with a copy of the script posted above. Execution is performed as stated above.