FlatGFA: An Efficient Pangenome Representation¶

FlatGFA is an efficient on-disk and in-memory way to represent pangenomic variation graphs. It can losslessly represent GFA files. Here’s a quick example:

import flatgfa
from collections import Counter

graph = flatgfa.parse("something.gfa")
depths = Counter()
for path in graph.paths:
    for step in path:
        depths[step.segment.id] += 1

print('#node.id\tdepth')
for seg in graph.segments:
    print('{}\t{}'.format(seg.name, depths[seg.id]))

This example computes the node depth for every segment in a graph. It starts by parsing a GFA text file, but FlatGFA also has its own efficient binary representation—you can read and write this format with load() and FlatGFA.write_flatgfa().

The library is on PyPI, so you can get started by typing pip install flatgfa.

API Reference¶

Loading Data¶

The FlatGFA library can both read and write files in two formats: the standard GFA text format, and its own efficient binary representation (called “FlatGFA” files). Each of these functions below return a FlatGFA object. Parsing GFA text can take some time, but loading a binary FlatGFA file should be very fast.

flatgfa.parse(filename)¶: Parse a GFA file into our FlatGFA representation.

flatgfa.parse_bytes(bytes)¶: Parse a GFA file from a bytestring into our FlatGFA representation.

flatgfa.load(filename)¶

Load a binary FlatGFA file.

This function should be fast to call because it does not actually read the file’s data. It memory-maps the file so subsequent accesses will actually read the data “on demand.” You can produce these files with FlatGFA.write_flatgfa().

GFA Graphs¶

The FlatGFA class provides the entry point to access the data either loaded from a FlatGFA binary file or parsed from a GFA text file. Most importantly, you can iterate over the Segment, Path, and Link objects that it contains. The FlatGFA class exposes list-like containers for each of these types:

for seg in graph.segments:
    print(seg.name)
print(graph.segments[0].sequence())

These containers support both iteration (like the for above) and random access (like graph.segments[0] above).

You can also write graphs out to disk using FlatGFA.write_gfa() (producing a standard GFA text file) and FlatGFA.write_flatgfa() (our binary format). If you just want a GFA string, use str(graph).

class flatgfa.FlatGFA¶

An efficient representation of a Graphical Fragment Assembly (GFA) file.

links¶: The links (edges) in the graph, as a LinkList.

paths¶: The paths in the graph, as a PathList.

segments¶: The segments (nodes) in the graph, as a SegmentList.

write_flatgfa(filename)¶

Write the graph as a binary FlatGFA file.

You can read the resulting file with load().

write_gfa(filename)¶: Write the graph as a GFA text file.

The GFA Data Model¶

These classes represent the core data model for GFA graphs: Segment for vertices in the graph, Path for walks through the graph, and Link for edges in the graph. Internally, all of these objects only contain references to the underlying data stored in a FlatGFA, so they are very small, but accessing any of the associated data (such as the nucleotide sequence for a segment) require further lookups.

The Handle class is a segment–orientation pair: both paths and links traverse these handles.

To get a GFA text representation of any of these objects, use str(obj). All these objects are equatable (so you can compare them with ==) and hashable (so you can store them in dicts and sets). This reflects equality on the underlying references to the data store, so two objects are equal if they refer to the same index in the same FlatGFA.

class flatgfa.Segment¶

A segment in a GFA graph.

Segments are the nodes in the GFA graph. They have a unique ID and an associated nucleotide sequence.

id¶: The unique identifier for the segment, an int.

name¶: The segment’s name as declared in the GFA file, an int.

sequence()¶

Get the nucleotide sequence for the segment as a byte string.

This copies the underlying sequence data to contruct the Python bytes object, so it is slow to use for large sequences.

class flatgfa.Path¶

A path in a GFA graph.

Paths are walks through the GFA graph, where each step is an oriented segment. This class is an iterable over the segments in the path, so use something like this:

for step in path:
    print(step.segment.name)

to walk through a path’s steps.

id¶: The unique identifier for the path, an int.

name¶: Get the name of this path as declared in the GFA file, as a string.

steps¶

Get a list of steps in this path.

For convenience, the path itself provides direct access to the step list. So, for example, path.steps[4] is the same as path[4].

class flatgfa.Link¶

A link in a GFA graph.

Links are directed edges between oriented segments. The source and sink are both Handle objects, i.e., the “forward” or “backward” direction of a given segment.

from_¶: The edge’s source handle.

id¶: The unique identifier for the link.

to¶: The edge’s sink handle.

class flatgfa.Handle¶

An oriented segment reference.

Because both paths and links connect oriented segments rather than the segments themselves, they use this class to distinguish between (for example) 5+ and 5-.

is_forward¶: The orientation.

seg_id¶: The segment ID, an int.

segment¶: The segment, as a Segment object.

Iteration¶

The FlatGFA library exposes special container classes to access the Segment, Path, and Link objects that make up a GFA graph. These classes are meant to behave sort of like Python list objects while supporting efficient iteration over FlatGFA’s internal representation.

All of these container objects support subscripting (like graph.segments[i] where i is an integer index) and iteration.

class flatgfa.SegmentList¶

A sequence of Segment objects.

find(name)¶: Find a segment by its name (an int), or return None if not found.

class flatgfa.PathList¶

A sequence of Path objects.

find(name)¶: Find a path by its name (a string), or return None if not found.

class flatgfa.LinkList¶: A sequence of Link objects.

class flatgfa.StepList¶: A list of Handle objects, such as a sequence of path steps.

FlatGFA: An Efficient Pangenome Representation¶

API Reference¶

Loading Data¶

GFA Graphs¶

The GFA Data Model¶

Iteration¶

flatgfa

Navigation

Related Topics