Quick usage of khloraa scaffolding python package#

Here we will run and explore the results step by step.

Configure a virtual environment#

You can use .venv_39 to run this Jupyter notebook.

python3.9 -m pip install virtualenv
python3.9 -m virtualenv .venv_39
source ./.venv_39/bin/activate

pip3.9 install khloraascaf

Run khloraascaf#

In this section we will run khloraascaf.

Import what we need#

#
# * Path utilities from standard library
#
from pathlib import Path
#
# * The main scaffolding function from khloraascaf
#  
from khloraascaf import scaffolding
#
# * One of the accepted solver option constant
#
from khloraascaf.inputs import SOLVER_CBC

Use khloraascaf function#

#
# Prepare the scaffolding result directory
#
outdir = Path('scaffolding_result')
outdir.mkdir(exist_ok=True)
#
# Compute the scaffolding using the assembly data
#
outdir_gen = scaffolding(
    Path('../data/ir_sc/contig_attrs.tsv'),
    Path('../data/ir_sc/contig_links.tsv'),
    'C0',
    solver=SOLVER_CBC,
    outdir=outdir,
)
#
# khloraascaf creates a directory with unique name
#   to put all the files it has created
#
assert outdir_gen in outdir.glob('*')
print(f'Uniquely named directory: {outdir_gen}')

Retrieve the input / output configuration of your run#

Thanks to the IOConfig class you can easily retrieve the arguments and the options used to run khloraa_scaffolding.

from khloraascaf import IOConfig

io_cfg = IOConfig.from_run_directory(outdir_gen)

print('Arguments used:')
print(f'* Contig attributes file: {io_cfg.contig_attrs()}')
print(f'* Contig links file: {io_cfg.contig_links()}')
print(f'* Contig starter: {io_cfg.contig_starter()}')
print()
print('Options used:')
print(f'* Solver: {io_cfg.solver()}')
print(f'* Output directory: {io_cfg.outdir()}')
print(f'* Instance name: {io_cfg.instance_name()}')
# And other that are more for combinatorial and technical issues ...

Dive into the results#

In this section we will see how to manipulate the result and first know what has been produced.

Explore the produced files with the solutions metadata class#

from khloraascaf import MetadataAllSolutions
#
# Use metadata class to easily dive into the results
# (you can also see by hand the solutions.yaml file that has been produced)
#
all_solutions_metadata = MetadataAllSolutions.from_run_directory(outdir_gen)
#
# * How many solutions the scaffolding has produced?
#
print(f'Number of solutions: {len(all_solutions_metadata.solutions())}')
#
# Let's pick the solution
#
sol_metadata = all_solutions_metadata.solutions()[0]
#
# What is the problem succession that has built this solution?
#
print('Problem succession:')
for ilp_code in sol_metadata.ilp_combination():
    print(f'* {ilp_code}')
#
# See which files the scaffolding has produced:
#
print(f'Files in {outdir_gen}:')
for file in outdir_gen.glob('*'):
    print(f'* {file.relative_to(outdir_gen)}')
#
# The metadata classes give the path of the produced file 
# relatively to the YAML file path
#
print('The list of oriented contigs for each region')
print(f'* {sol_metadata.contigs_of_regions()}')
print()
print('The list of oriented regions')
print(f'* {sol_metadata.map_of_regions()}')
print()
#
# The YAML files
#
print(
    'YAML file containing all the arguments and options you used'
    ' to run khloraascaf',
)
print(f'* {IOConfig.YAML_FILE}')
print()
print('YAML file that contains metadata on the solutions')
print(f'* {MetadataAllSolutions.YAML_FILE}')

Explore the multimeric forms of the chloroplast#

The solution we get represents one order of the oriented contigs. But it is known that chloroplast genomes can possess multimeric forms, especially due to inverted repeats.

Here the solution was produced from the combination of the \(\mathcal{IRP}\) and \(\mathcal{SCP}\) problems. This means that the solution represents a genome architecture with inverted repeats (IR) and single-copy regions (SC).

Let’s explore the multimeric forms thanks to the graph of regions.

from khloraascaf import AssemblyGraph
#
# Use the previous solution metadata object: easy!
#
asm_graph = AssemblyGraph.from_solution_metadata(sol_metadata)
#
# Get all the regions configuration
#
multimeric_forms = list(asm_graph.all_region_paths())
#
# Let explore them (with a bit of string formatting)
#
from khloraascaf import ORIENT_INT_STR
print(f'Number of multimeric forms: {len(multimeric_forms)}')
for region_path in multimeric_forms:
    string_orreg_order = ''
    for region_ind, region_or in region_path:
        string_orreg_order += f' {region_ind} {ORIENT_INT_STR[region_or]}'
    print(f'* {string_orreg_order}')

Each number corresponds to a region identifier, and it is followed by a sign (+ or -) that gives the orientation of the region.

We obtain two multimeric forms. They differ with the region 2: in one of the form, the region 2 is in reverse orientation (2 -) comparing to the other (2 +). This is due to the presence of an inverted repeat, here the region 1.

For each multimeric form (corresponding to an oriented region order), we can obtain the corresponding oriented contig order:

for region_path in multimeric_forms:
    string_orc_order = ''
    for c_id, c_or in asm_graph.region_path_to_oriented_contigs(region_path):
        string_orc_order += f' {c_id} {ORIENT_INT_STR[c_or]}'
    print(f'* {string_orc_order}')

Already here we observe the inversion of the region 2, that is composed of contig C4.

If you want to write the multimeric forms in a file, you can:

from khloraascaf import (
    write_region_paths, fmt_region_paths_filename, 
    write_oriented_contig_paths, fmt_oriented_contig_paths_filename,
)
#
# Write the orders of oriented regions in a file
#
region_orders_file = outdir_gen / fmt_region_paths_filename(
    io_cfg.instance_name(), sol_metadata.ilp_combination(),
)
write_region_paths(multimeric_forms, region_orders_file)
#
# Write the orders of oriented contigs in a file
#
contig_orders_file = outdir_gen / fmt_oriented_contig_paths_filename(
    io_cfg.instance_name(), sol_metadata.ilp_combination(),
)
write_oriented_contig_paths(
    asm_graph.all_oriented_contig_paths(),
    contig_orders_file,
)
#
# See new files
#
print(f'File in {outdir_gen}:')
for file in outdir_gen.glob('*'):
    print(f'* {file.relative_to(outdir_gen)}')

Cleaning the bazar#

Clean what you have done:

#
# Clean all
#
for file in outdir_gen.glob('*'):
    file.unlink()
outdir_gen.rmdir()
outdir.rmdir()