FlatGFA: An Efficient Pangenome Representation¶
FlatGFA is an efficient on-disk and in-memory way to represent pangenomic variation graphs. It can losslessly represent GFA files. Here’s a quick example:
import flatgfa
from collections import Counter
graph = flatgfa.parse("something.gfa")
depths = Counter()
for path in graph.paths:
for step in path:
depths[step.segment.id] += 1
print('#node.id\tdepth')
for seg in graph.segments:
print('{}\t{}'.format(seg.name, depths[seg.id]))
This example computes the node depth for every segment in a graph.
It starts by parsing a GFA text file, but FlatGFA also has its own efficient
binary representation—you can read and write this format with
load()
and FlatGFA.write_flatgfa()
.
The library is on PyPI, so you can get started by typing
pip install flatgfa
.
API Reference¶
Loading Data¶
The FlatGFA library can both read and write files in two formats: the standard
GFA text format, and its own efficient binary representation (called
“FlatGFA” files). Each of these functions below return a FlatGFA
object. Parsing GFA text can take some time, but loading a binary FlatGFA file
should be very fast.
- flatgfa.parse(filename)¶
Parse a GFA file into our FlatGFA representation.
- flatgfa.parse_bytes(bytes)¶
Parse a GFA file from a bytestring into our FlatGFA representation.
- flatgfa.load(filename)¶
Load a binary FlatGFA file.
This function should be fast to call because it does not actually read the file’s data. It memory-maps the file so subsequent accesses will actually read the data “on demand.” You can produce these files with
FlatGFA.write_flatgfa()
.
GFA Graphs¶
The FlatGFA
class provides the entry point to access the data either
loaded from a FlatGFA binary file or parsed from a GFA text file. Most
importantly, you can iterate over the Segment
, Path
, and
Link
objects that it contains. The FlatGFA
class exposes
list
-like containers for each of these types:
for seg in graph.segments:
print(seg.name)
print(graph.segments[0].sequence())
These containers support both iteration (like the for
above) and random
access (like graph.segments[0]
above).
You can also write graphs out to disk using FlatGFA.write_gfa()
(producing a standard GFA text file) and FlatGFA.write_flatgfa()
(our
binary format). If you just want a GFA string, use str(graph).
- class flatgfa.FlatGFA¶
An efficient representation of a Graphical Fragment Assembly (GFA) file.
- segments¶
The segments (nodes) in the graph, as a
SegmentList
.
- write_flatgfa(filename)¶
Write the graph as a binary FlatGFA file.
You can read the resulting file with
load()
.
- write_gfa(filename)¶
Write the graph as a GFA text file.
The GFA Data Model¶
These classes represent the core data model for GFA graphs:
Segment
for vertices in the graph,
Path
for walks through the graph,
and Link
for edges in the graph.
Internally, all of these objects only contain references to the underlying
data stored in a FlatGFA
, so they are very small, but accessing any
of the associated data (such as the nucleotide sequence for a segment) require
further lookups.
The Handle
class is a segment–orientation pair: both paths and links
traverse these handles.
To get a GFA text representation of any of these objects, use str(obj)
.
All these objects are equatable (so you can compare them with ==
) and
hashable (so you can store them in dicts and sets). This reflects equality on
the underlying references to the data store, so two objects are equal if they
refer to the same index in the same FlatGFA
.
- class flatgfa.Segment¶
A segment in a GFA graph.
Segments are the nodes in the GFA graph. They have a unique ID and an associated nucleotide sequence.
- id¶
The unique identifier for the segment, an int.
- name¶
The segment’s name as declared in the GFA file, an int.
- sequence()¶
Get the nucleotide sequence for the segment as a byte string.
This copies the underlying sequence data to contruct the Python bytes object, so it is slow to use for large sequences.
- class flatgfa.Path¶
A path in a GFA graph.
Paths are walks through the GFA graph, where each step is an oriented segment. This class is an iterable over the segments in the path, so use something like this:
for step in path: print(step.segment.name)
to walk through a path’s steps.
- id¶
The unique identifier for the path, an int.
- name¶
Get the name of this path as declared in the GFA file, as a string.
- steps¶
Get a list of steps in this path.
For convenience, the path itself provides direct access to the step list. So, for example,
path.steps[4]
is the same aspath[4]
.
- class flatgfa.Link¶
A link in a GFA graph.
Links are directed edges between oriented segments. The source and sink are both Handle objects, i.e., the “forward” or “backward” direction of a given segment.
- from_¶
The edge’s source handle.
- id¶
The unique identifier for the link.
- to¶
The edge’s sink handle.
- class flatgfa.Handle¶
An oriented segment reference.
Because both paths and links connect oriented segments rather than the segments themselves, they use this class to distinguish between (for example)
5+
and5-
.- is_forward¶
The orientation.
- seg_id¶
The segment ID, an int.
Iteration¶
The FlatGFA library exposes special container classes to access the
Segment
, Path
, and Link
objects that make up a GFA
graph. These classes are meant to behave sort of like Python list
objects while supporting efficient iteration over FlatGFA’s internal
representation.
All of these container objects support subscripting (like
graph.segments[i]
where i
is an integer index) and iteration.
- class flatgfa.SegmentList¶
A sequence of
Segment
objects.- find(name)¶
Find a segment by its name (an int), or return None if not found.