@gmod/gff
Read and write GFF3 data performantly. This module aims to be a complete implementation of the GFF3 specification.
- streaming parsing and streaming formatting
- proper escaping and unescaping of attribute and column values
- supports features with multiple locations and features with multiple parents
- reconstructs feature hierarchies of both
Parent
andDerives_from
relationships - parses FASTA sections
- does no validation except for referential integrity of
Parent
andDerives_from
relationships - only compatible with GFF3
Install
$ npm install --save @gmod/gff
Usage
const gff = require('@gmod/gff').default
// or in ES6 (recommended)
import gff from '@gmod/gff'
const fs = require('fs')
// parse a file from a file name
// parses only features and sequences by default,
// set options to parse directives and/or comments
fs.createReadStream('path/to/my/file.gff3')
.pipe(gff.parseStream({ parseAll: true }))
.on('data', (data) => {
if (data.directive) {
console.log('got a directive', data)
} else if (data.comment) {
console.log('got a comment', data)
} else if (data.sequence) {
console.log('got a sequence from a FASTA section')
} else {
console.log('got a feature', data)
}
})
// parse a string of gff3 synchronously
const stringOfGFF3 = fs.readFileSync('my_annotations.gff3').toString()
const arrayOfThings = gff.parseStringSync(stringOfGFF3)
// format an array of items to a string
const newStringOfGFF3 = gff.formatSync(arrayOfThings)
// format a stream of things to a stream of text.
// inserts sync marks automatically.
myStreamOfGFF3Objects
.pipe(gff.formatStream())
.pipe(fs.createWriteStream('my_new.gff3'))
// format a stream of things and write it to
// a gff3 file. inserts sync marks and a
// '##gff-version 3' header if one is not
// already present
gff.formatFile(
myStreamOfGFF3Objects,
fs.createWriteStream('my_new_2.gff3', { encoding: 'utf8' }),
)
Object format
features
In GFF3, features can have more than one location. We parse features as arrayrefs of all the lines that share that feature's ID. Values that are.
in the GFF3 are null
in the output.A simple feature that's located in just one place:
[
{
"seq_id": "ctg123",
"source": null,
"type": "gene",
"start": 1000,
"end": 9000,
"score": null,
"strand": "+",
"phase": null,
"attributes": {
"ID": [
"gene00001"
],
"Name": [
"EDEN"
]
},
"child_features": [],
"derived_features": []
}
]
A CDS called
cds00001
located in two places:[
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 1201,
"end": 1500,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": ["cds00001"],
"Parent": ["mRNA00001"]
},
"child_features": [],
"derived_features": []
},
{
"seq_id": "ctg123",
"source": null,
"type": "CDS",
"start": 3000,
"end": 3902,
"score": null,
"strand": "+",
"phase": "0",
"attributes": {
"ID": ["cds00001"],
"Parent": ["mRNA00001"]
},
"child_features": [],
"derived_features": []
}
]
directives
parseDirective("##gff-version 3\n")
// returns
{
"directive": "gff-version",
"value": "3"
}
parseDirective('##sequence-region ctg123 1 1497228\n')
// returns
{
"directive": "sequence-region",
"value": "ctg123 1 1497228",
"seq_id": "ctg123",
"start": "1",
"end": "1497228"
}
comments
parseComment('# hi this is a comment\n')
// returns
{
"comment": "hi this is a comment"
}
sequences
These come from any embedded##FASTA
section in the GFF3 file.parseSequences(`##FASTA
>ctgA test contig
ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA`)
// returns
[
{
"id": "ctgA",
"description": "test contig",
"sequence": "ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA"
}
]
API
Table of Contents
- encoding - parseFeatures - parseDirectives - parseComments - parseSequences - parseAll - bufferSize - Parameters - Parameters - Parameters - Parameters - ParametersParseOptions
Parser optionsencoding
Text encoding of the input GFF3. default 'utf8'Type: BufferEncoding
parseFeatures
Whether to parse features, default trueType: boolean
parseDirectives
Whether to parse directives, default falseType: boolean
parseComments
Whether to parse comments, default falseType: boolean
parseSequences
Whether to parse sequences, default trueType: boolean
parseAll
Parse all features, directives, comments, and sequences. Overrides other parsing options. Default false.Type: boolean
bufferSize
Maximum number of GFF3 lines to buffer, default 1000Type: number
parseStream
Parse a stream of text data into a stream of feature, directive, comment, an sequence objects.Parameters
options
ParseOptions Parsing options (optional, default{}
)
Returns GFFTransform stream (in objectMode) of parsed items
parseStringSync
Synchronously parse a string containing GFF3 and return an array of the parsed items.Parameters
str
string GFF3 stringinputOptions
({encoding: BufferEncoding?, bufferSize: number?} | undefined)? Parsing options
Returns Array<(GFF3Feature | GFF3Sequence)> array of parsed features, directives, comments and/or sequences
formatSync
Format an array of GFF3 items (features,directives,comments) into string of GFF3. Does not insert synchronization (###) marks.Parameters
items
Array\ Array of features, directives, comments and/or sequences
Returns string the formatted GFF3
formatStream
Format a stream of features, directives, comments and/or sequences into a stream of GFF3 text.Inserts synchronization (###) marks automatically.
Parameters
options
FormatOptions parser options (optional, default{}
)
Returns FormattingTransform
formatFile
Format a stream of features, directives, comments and/or sequences into a GFF3 file and write it to the filesystem.Inserts synchronization (###) marks and a ##gff-version directive automatically (if one is not already present).
Parameters
stream
Readable the stream to write to the filewriteStream
Writableoptions
FormatOptions parser options (optional, default{}
)filename
the file path to write to
Returns Promise\ promise for null that resolves when the stream has been written
About util
There is also a util
module that contains super-low-level functions for dealing with lines and parts of lines.// non-ES6
const util = require('@gmod/gff').default.util
// or, with ES6
import gff from '@gmod/gff'
const util = gff.util
const gff3Lines = util.formatItem({
seq_id: 'ctgA',
...
}))
util
Table of Contents
- Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - Parameters - seqid - source - type - start - end - score - strand - phase - attributes - childfeatures - derivedfeatures - directive - value - value - seqid - start - end - value - source - buildName - comment - id - description - sequenceunescape
Unescape a string value used in a GFF3 attribute.Parameters
stringVal
string Escaped GFF3 string value
Returns string An unescaped string value
escape
Escape a value for use in a GFF3 attribute value.Parameters
Returns string An escaped string value
escapeColumn
Escape a value for use in a GFF3 column value.Parameters
Returns string An escaped column value
parseAttributes
Parse the 9th column (attributes) of a GFF3 feature line.Parameters
attrString
string String of GFF3 9th column
Returns GFF3Attributes Parsed attributes
parseFeature
Parse a GFF3 feature lineParameters
line
string GFF3 feature line
Returns GFF3FeatureLine The parsed feature
parseDirective
Parse a GFF3 directive line.Parameters
line
string GFF3 directive line
Returns (GFF3Directive | GFF3SequenceRegionDirective | GFF3GenomeBuildDirective | null) The parsed directive
formatAttributes
Format an attributes object into a string suitable for the 9th column of GFF3.Parameters
attrs
GFF3Attributes Attributes
Returns string GFF3 9th column string
formatFeature
Format a feature object or array of feature objects into one or more lines of GFF3.Parameters
featureOrFeatures
(GFF3FeatureLine | GFF3FeatureLineWithRefs | Array<(GFF3FeatureLine | GFF3FeatureLineWithRefs)>) A feature object or array of feature objects
Returns string A string of one or more GFF3 lines
formatDirective
Format a directive into a line of GFF3.Parameters
directive
GFF3Directive A directive object
Returns string A directive line string
formatComment
Format a comment into a GFF3 comment. Yes I know this is just adding a # and a newline.Parameters
comment
GFF3Comment A comment object
Returns string A comment line string
formatSequence
Format a sequence object as FASTAParameters
seq
GFF3Sequence A sequence object
Returns string Formatted single FASTA sequence string
formatItem
Format a directive, comment, sequence, or feature, or array of such items, into one or more lines of GFF3.Parameters
itemOrItems
(GFF3FeatureLineWithRefs | GFF3Directive | GFF3Comment | GFF3Sequence | Array<(GFF3FeatureLineWithRefs | GFF3Directive | GFF3Comment | GFF3Sequence)>) A comment, sequence, or feature, or array of such items
Returns (string | Array<string>) A formatted string or array of strings
GFF3Attributes
A record of GFF3 attribute identifiers and the values of those identifiersType: Record<string, (Array<string> | undefined)>
GFF3FeatureLine
A representation of a single line of a GFF3 fileseqid
The ID of the landmark used to establish the coordinate system for the current featureType: (string | null)
source
A free text qualifier intended to describe the algorithm or operating procedure that generated this featureType: (string | null)
type
The type of the featureType: (string | null)
start
The start coordinates of the featureType: (number | null)
end
The end coordinates of the featureType: (number | null)
score
The score of the featureType: (number | null)
strand
The strand of the featureType: (string | null)
phase
For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end of the current CDS featureType: (string | null)
attributes
Feature attributesType: (GFF3Attributes | null)
GFF3FeatureLineWithRefs
Extends GFF3FeatureLineA GFF3 Feature line that includes references to other features defined in their "Parent" or "Derivesfrom" attributes
childfeatures
An array of child featuresType: Array<GFF3Feature>
derivedfeatures
An array of features derived from this featureType: Array<GFF3Feature>
GFF3Feature
A GFF3 feature, which may include multiple individual feature linesType: Array<GFF3FeatureLineWithRefs>
GFF3Directive
A GFF3 directivedirective
The name of the directiveType: string
value
The string value of the directiveType: string
GFF3SequenceRegionDirective
Extends GFF3DirectiveA GFF3 sequence-region directive
value
The string value of the directiveType: string
seqid
The sequence ID parsed from the directiveType: string
start
The sequence start parsed from the directiveType: string
end
The sequence end parsed from the directiveType: string
GFF3GenomeBuildDirective
Extends GFF3DirectiveA GFF3 genome-build directive
value
The string value of the directiveType: string
source
The genome build source parsed from the directiveType: string
buildName
The genome build name parsed from the directiveType: string
GFF3Comment
A GFF3 commentcomment
The text of the commentType: string
GFF3Sequence
A GFF3 FASTA single sequenceid
The ID of the sequenceType: string
description
The description of the sequenceType: string
sequence
The sequenceType: string