- 2-3x faster than the node-glob module, with the same glob syntax
- generic: supports multiple glob engines, allowing you to use the fastest glob engine available and compare performance between glob engines
- supports multiple glob expressions for each search
- supports exclude/negative glob expressions
- supports filtering out duplicates
- supports streaming (can return a duplex stream, where the glob results are returned from the stream, and any values written to the stream are passed through)
- norahiko/globy; looks interesting but comes with a native binding which seems a bit too much, fnmatch is undocumented but benchmarked here anyway.
I also had a look at kthompson/glob-js and fitzgen/glob-to-regexp but they were missing key features such as globstar support and relative glob support and so I could not benchmark them. Most other glob modules on npm are just wrappers over minimatch or node-glob.
The following numbers are from a VM running on a Macbook Pro:
When run with a no-op matcher (e.g. a useless but fast function that returns true for every item), wildglob ran in 3.490s. From this we can conclude:
- that the
.sync()calls which use synchronous I/O are faster than the
wildglobhas an overhead of ~800ms over just using
- that adding a matcher algorithm such as minimatch adds another 2000ms or so; in
wildglobthe majority of the runtime is spent on the matching itself
- that interestingly, minimatch itself is quite fast, faster than wildmatch and the slowness in node-glob is mostly from management overhead rather than the matching engine itself (which definitely surprised me)
glob(patterns, [opts], onDone)
patterns: a single glob string, or an array of glob expressions. To exclude items, start the glob with
!to negate the match.
opts.cwd: The working directory; defaults to
process.cwd(). All glob expressions which refer to a relative path e.g
./foo/*are resolved in the working directory.
opts.root: When glob expressions refer to an absolute path e.g.
/foo/*, resolve them as if the initial
/was replaced with
opts.root. Basically allows you to pretend that the fs root is somewhere else. Defaults to
opts.match: a function with the following signature
function(filepath, glob). Called for each file to perform an exact glob match; should return true if the glob pattern matches.
opts.abspath: normally, relative globs return relative paths and absolute globs return absolute paths. Setting
abspathto true will cause all paths returned from
wildglobto be full absolute paths.
opts.fs: allows substituting
fswith a different object.
AlgorithmA glob expression is an expression which can potentially match against any path in the file system. For example
/foo/**basically says take the whole file system and compare it against that expression, and return the result.
However, stat'ing the whole file system is obviously inefficient, since we can use the glob expression itself to narrow down the potential matches:
- first, we can parse the path prefix. This is done as follows:
glob-parse- take the prefix which consists of strings and brace expansion expressions. If you encounter an item which is not a string and not a brace expression (e.g. is a
*, a globstar (
**), a set (
) or a extglob (
@())), stop adding things to the prefix. - expand the brace expansion expressions to produce an array of paths
- next, start a directory traversal for each path.
- (TODO): apply exclude globs (up to their constant portion, including exclude globs with a globstar as the last item)
- once you have read the results, check against the include globs, then the any exclude globs.
wildglobtakes the position that perfectly processing globstars and other wildcard expressions is probably more trouble than it is worth, since these expressions will generally not exclude any additional directories (which is the only way to reduce fs operations and provide a potential speedup). As you can see in the benchmark, this works out OK compared to Minimatch which does a more exact but more CPU-intensive matching before looking at the file system.
wildglobmay perform some additional directory reads, but only if your file tree is such that only a very small portion of the files are included and you have not used exclude expressions to prune the search. If the majority of the files are included, then very little additional work takes place - often none at all, if all directories needed to be
readdir'ed anyway. For example, if your include expression ends in the a globstar (as is typical), then this is the optimal behavior.
When the directory traversal starts, each include glob has been expanded so that only "tricky" parts remains. Matching a
*, a globstar or a extglob is rather tricky - typically, glob implementations use backtracking to deal with wildcard expressions such as these expressions. This results in a fairly high branching factor particularly for globstars.
Further performance improvementsAn optimal implementation should use a minimum amount of CPU time and also avoid recursing into directories which will never produce matches. The latter part relies on the fixed portions of the glob expression having appropriate matches, which has diminishing returns once the prefix has been processed. Exclusions which will only exclude files will probably only have small returns, while excluding large folders early on can have a larger impact.
Here are a couple of ideas:
- rel-to-absglob: convert a relative glob into an absolute glob expression (mainly avoids the hassles with converting back and forth on mixed glob expressions as abspaths are what the fs API uses)
- also makes exclude prefix matching easy
- Include mode: - ieie - ie - Exclude prefix mode: - eie - e
- adding set expansion support (only improves performance for globs with sets)
- adding expansion support for extglob contents (only improves performance for globs with extglob expressions)
- performing full matching before traversing into a subdirectory (matched has a nice description of why you would want to do this)
- performing partial matching before traversing into a subdirectory
- speeding up the actual glob matching: specifically, using a finite state machine to perform the matching