FlatGFA: An Efficient Pangenome Representation

FlatGFA is an efficient on-disk and in-memory way to represent pangenomic variation graphs. It can losslessly represent GFA files. Here’s a quick example:

import flatgfa
from collections import Counter

graph = flatgfa.parse("something.gfa")
depths = Counter()
for path in graph.paths:
    for step in path:
        depths[step.segment.id] += 1

print('#node.id\tdepth')
for seg in graph.segments:
    print('{}\t{}'.format(seg.name, depths[seg.id]))

This example computes the node depth for every segment in a graph. It starts by parsing a GFA text file, but FlatGFA also has its own efficient binary representation—you can read and write this format with load() and FlatGFA.write_flatgfa().

The library is on PyPI, so you can get started by typing pip install flatgfa.

API Reference

Loading Data

The FlatGFA library can both read and write files in two formats: the standard GFA text format, and its own efficient binary representation (called “FlatGFA” files). Each of these functions below return a FlatGFA object. Parsing GFA text can take some time, but loading a binary FlatGFA file should be very fast.

flatgfa.parse(filename)

Parse a GFA file into our FlatGFA representation.

flatgfa.parse_bytes(bytes)

Parse a GFA file from a bytestring into our FlatGFA representation.

flatgfa.load(filename)

Load a binary FlatGFA file.

This function should be fast to call because it does not actually read the file’s data. It memory-maps the file so subsequent accesses will actually read the data “on demand.” You can produce these files with FlatGFA.write_flatgfa().

GFA Graphs

The FlatGFA class provides the entry point to access the data either loaded from a FlatGFA binary file or parsed from a GFA text file. Most importantly, you can iterate over the Segment, Path, and Link objects that it contains. The FlatGFA class exposes list-like containers for each of these types:

for seg in graph.segments:
    print(seg.name)
print(graph.segments[0].sequence())

These containers support both iteration (like the for above) and random access (like graph.segments[0] above).

You can also write graphs out to disk using FlatGFA.write_gfa() (producing a standard GFA text file) and FlatGFA.write_flatgfa() (our binary format). If you just want a GFA string, use str(graph).

class flatgfa.FlatGFA

An efficient representation of a Graphical Fragment Assembly (GFA) file.

The links (edges) in the graph, as a LinkList.

paths

The paths in the graph, as a PathList.

segments

The segments (nodes) in the graph, as a SegmentList.

write_flatgfa(filename)

Write the graph as a binary FlatGFA file.

You can read the resulting file with load().

write_gfa(filename)

Write the graph as a GFA text file.

The GFA Data Model

These classes represent the core data model for GFA graphs: Segment for vertices in the graph, Path for walks through the graph, and Link for edges in the graph. Internally, all of these objects only contain references to the underlying data stored in a FlatGFA, so they are very small, but accessing any of the associated data (such as the nucleotide sequence for a segment) require further lookups.

The Handle class is a segment–orientation pair: both paths and links traverse these handles.

To get a GFA text representation of any of these objects, use str(obj). All these objects are equatable (so you can compare them with ==) and hashable (so you can store them in dicts and sets). This reflects equality on the underlying references to the data store, so two objects are equal if they refer to the same index in the same FlatGFA.

class flatgfa.Segment

A segment in a GFA graph.

Segments are the nodes in the GFA graph. They have a unique ID and an associated nucleotide sequence.

id

The unique identifier for the segment, an int.

name

The segment’s name as declared in the GFA file, an int.

sequence()

Get the nucleotide sequence for the segment as a byte string.

This copies the underlying sequence data to contruct the Python bytes object, so it is slow to use for large sequences.

class flatgfa.Path

A path in a GFA graph.

Paths are walks through the GFA graph, where each step is an oriented segment. This class is an iterable over the segments in the path, so use something like this:

for step in path:
    print(step.segment.name)

to walk through a path’s steps.

id

The unique identifier for the path, an int.

name

Get the name of this path as declared in the GFA file, as a string.

steps

Get a list of steps in this path.

For convenience, the path itself provides direct access to the step list. So, for example, path.steps[4] is the same as path[4].

A link in a GFA graph.

Links are directed edges between oriented segments. The source and sink are both Handle objects, i.e., the “forward” or “backward” direction of a given segment.

from_

The edge’s source handle.

id

The unique identifier for the link.

to

The edge’s sink handle.

class flatgfa.Handle

An oriented segment reference.

Because both paths and links connect oriented segments rather than the segments themselves, they use this class to distinguish between (for example) 5+ and 5-.

is_forward

The orientation.

seg_id

The segment ID, an int.

segment

The segment, as a Segment object.

Iteration

The FlatGFA library exposes special container classes to access the Segment, Path, and Link objects that make up a GFA graph. These classes are meant to behave sort of like Python list objects while supporting efficient iteration over FlatGFA’s internal representation.

All of these container objects support subscripting (like graph.segments[i] where i is an integer index) and iteration.

class flatgfa.SegmentList

A sequence of Segment objects.

find(name)

Find a segment by its name (an int), or return None if not found.

class flatgfa.PathList

A sequence of Path objects.

find(name)

Find a path by its name (a string), or return None if not found.

A sequence of Link objects.

class flatgfa.StepList

A list of Handle objects, such as a sequence of path steps.