Calculating taxonomic purity of extracted bins
This is documentation on how to use scripts/tax-classify.py
to estimate the
taxonomic purity of bins of k-mers.
Getting set up.
First, we'll need an LCA database, which combines signatures with taxonomic information; the command sourmash lca index builds such a database.
For this we need two things:
- a bunch of signatures, with names;
- the taxonomic information for those signatures, calculated e.g. by dib-lab/2018-ncbi-lineages;
For the podar data set, the taxonomic lineages are here: spacegraphcats/data/podar-lineages.csv
To calculate the signatures for the podar data set, we should use a lower scaled value, which will give higher resolution to the taxonomic classification:
sourmash compute -k 31 --scaled=100 {?,??}.fa -f --name-from-first
Then, to calculate an LCA database, do:
sourmash lca index podar-lineage.csv podar-ref.scaled100.lca.json \
{?,??}.fa.sig \
--scaled 100 -C 3 --split-identifiers
This will result in a file podar-ref.scaled100.lca.json
that you can use with tax-classify.py
.
Running tax-classify on catlas region output
First, generate the .node_mh
file from a catlas with the appropriate scaled value;
it should be no smaller than the one used when computing the signatures & building the LCA database, above.
python -m search.characterize_catlas_regions twofoo_k31_r1 twofoo.vec --scaled=100
Then:
scripts/tax-classify.py twofoo.vec podar-ref.scaled100.lca.json
Running tax-classify-sigs on catlas search output
You can run the same code as in tax-classify.py
on signatures produced by search.extract_nodes_by_query
, using a different front-end script:
scripts/tax-classify-sigs.py ./podarV_k31_r1_search_oh0_jan19/[012].fa.contigs.sig \
~/dev/sourmash/podar-ref.1k.lca.json
Here, the output is more detailed because we expect more of these bins to be low purity.