utility script for detecting and (optionally) fixing differences between multiple DynamoDB tables (in the same or different regions)

Downloads in past


4041.3.12 years ago5 years agoMinified + gzip package size for @silvermine/dynamodb-table-sync in KB


Silvermine DynamoDB Table Sync
Build Status Coverage Status Dependency Status Dev Dependency Status

What is it?

A script that will scan two or more DynamoDB tables and report the differences to you. Optionally, it will also synchronize them by writing updates to the slave table to keep it in sync with the master table.

How do I use it?

Here's an example of how you can use this script:
node src/cli.js \
   --master us-east-1:my-dynamodb-table \
   --slave us-west-2:my-dynamodb-table \
   --slave eu-west-1:my-dynamodb-table \
   --write-missing \

Using the arguments shown above, the synchronizer would scan your master table (the table in the us-east-1 region), and check that every item in the master table exists in each of your two slave tables (in us-west-2 and eu-west-1 in this example). Because we supplied both the --write-missing and --write-differing flags, it would write to the slaves any items that the slave was missing or where the slave's item differed from the master's item.

Running in Docker

Build the image:
docker-compose build

Then run the container:
docker-compose run \
   --rm \
   --user node \
   --volume "${HOME}/.aws:/home/node/.aws"
   dynamodb_table_sync \
   src/cli.js \
   --master us-east-1:my-dynamodb-table \
   --slave us-west-2:my-dynamodb-table \
   --slave eu-west-1:my-dynamodb-table \
   --write-missing \

Installing Globally

By installing the library globally (e.g. npm install -g @silvermine/dynamodb-table-sync), you will get a dynamodb-table-sync executable in your node bin folder. Assuming you have the node bin folder in your path (you probably do if you've ever installed any other npm package globally), then you can simply run the command from any folder like this:
dynamodb-table-sync -m us-east-1:my-dynamodb-table -s eu-west-1:my-dynamodb-table

Of course, you can use any CLI arguments you want to regardless of where you run the script from.

Using Silvermine DynamoDB Table Sync as a Library

You can also use this codebase as a library. Here's a brief example of how to do so:
var Synchronizer = require('@silvermine/dynamodb-table-sync'),

synchronizer = new Synchronizer(
   { region: 'us-east-1', name: 'my-master-table-name' },
      { region: 'us-west-2', name: 'my-slave-1-table-name' },
      { region: 'eu-west-1', name: 'my-slave-2-table-name' },
      writeMissing: true,
      writeDiffering: true,

   .then(() => {
   .catch((err) => {
      console.error(`Failed: ${err.message}`, err.stack);

See the comments in src/Synchronizer.js for more documentation on names of options. Since options all map to CLI arguments, see the list of CLI arguments below for details on what each option is used for.

Command Line Flags

Note that everywhere that a table is supplied as a command line argument, it should be in the form <region>:<table-name>.
* `--master` (or `-m`) **required**: region and name of the table that should be
  treated as the master table. This is the table that will be scanned to find items
  that are missing or different in each of the tables listed as "slaves".
* `--slave` (or `-s`) **at least one required**: region and name of a table that should
  have the same content as the master table.
   * Note that you can supply this argument as many times as needed if you want to
     compare the master table to multiple slaves.
   * Also note that all tables (the master and all slaves) must have the same key
     schema since we create a key from the items returned by the master table and use
     that key to look up items on the slave table(s).
* `--starting-key`: A JSON string of the key to start the scan at. This allows you to
  restart a scan that had failed previously by supplying one of the previous keys that
  was logged. Example: `--starting-key '{"hashKey":"abc","rangeKey":"xyz"}'`
* `--ignore` (or `--ignore-att`): An attribute that should be ignored when comparing an
  item from the master table and the corresponding item in the slave table.
   * As an example of where this might be used, if items are written to each of your
     regions independently, and each writer sets its own `lastModified` field, this
     field will differ between regions, but all the other fields/attributes on the item
     should be the same between regions. In that case you could specify `--ignore
     lastModified` so that the `lastModified` field is not compared when comparing
   * Note that you can supply this argument as many times as needed if you want to
     ignore multiple attributes, e.g. `--ignore lastModified --ignore firstCreated`.
* `--scan-limit`: a number, used to (optionally) limit the number of items returned by
  each call to `scan`. Use this if you need to slow down the scan to stay under your
  provisioned read capacity.
* `--batch-read-limit`: a number, by default 50, that is used as the maximum number of
  items to read from the slave tables in a batch read operation. When a table
  (generally the master) is being scanned, we use `BatchGetItem` to get the
  corresponding items out of the other tables to make sure they exist and compare them.
  At times you may want to lower the number of items we read from those tables to keep
  the script running within your provisioned capacity. Additionally, on a batch read
  operation, if the response would be too large, DynamoDB will return "unprocessed
  keys" - that is, keys that you requested but are not included in the response. This
  requires recursive querying from the table, which we have not implemented at this
  time. Thus, if you get an error about `UnprocessedKeys`, you may need to lower this
  number to keep the responses within the size that DynamoDB can return.
* `--max-retries`: the maximum amount of retries to attempt with a request. Defaults
  to 10 as in AWS SDK.
* `--retry-delay-base`: the base number of milliseconds to use in the exponential
  backoff for operation retries on retryable errors. Defaults to 50 as in AWS SDK.
* `--parallel`: a number of parallel scanners that should run concurrently. By default
  we do a serial scan using a single "thread" (so to speak). However, if you have a
  larger table and enough provisioned read capacity on the master and all slaves, we
  can do the scan in parallel. You just specify how many parallel segments you want
  scanned simultaneously, e.g. `--parallel 8`.
   * Note that due to how DynamoDB partitions data, if you specify more parallel
     scanners than you have partitions in your table, some segments may end much sooner
     than others. I believe that DynamoDB is basically just guessing at some random
     keys in the partition, and then reading forward from that key until the end of the
     partition or until it reaches the starting point of another segment. Thus, you may
     not be "fully parallel" throughout your entire script run. The script will
     indicate when each segment finishes. For very small tables, some segments may not
     get any data at all - even from the first call to `scan`.
   * See
* `--write-missing`: a boolean flag, that when present will tell the synchronizer to
  write to the slave table(s) any items that they are missing.
* `--write-differing`: a boolean flag, that when present will tell the synchronizer to
  write to the slave table(s) any items where their item is different from that of the
  master table.
* `--scan-for-extra`: a boolean flag, that when present will tell the synchronizer to
  also scan the slave tables to find any items in them that are not present in the
  master table.
   * Note that whereas all other operations can be done by only scanning the master
     table, supplying this option will result in an additional scan of each slave
     table. While the slave is being scanned, additional reads will occur on the master
     table because for each item scanned from the slave table, we must do a read on the
     master table to see if the item exists in the master. The only way to avoid these
     additional reads back to the (already-scanned) master table would be to keep a
     list of all items that we'd seen in the master table during its own scan. This
     would create multiple problems:
      * Keeping that list of "items the master table has" for very large tables would
        require a lot of memory, or some other temporary storage mechanism.
      * Because the list would be quite old, the period for a "race condition" would
        become quite long - essentially the length of time of all scans combined.
* `--delete-extra`: a boolean flag, that when present will tell the synchronizer to
  also scan the slave tables and delete any items in them that are not present in the
  master table.
   * Note that if supplying `--delete-extra` there is no need to also provide
    `--scan-for-extra`. Essentially `--scan-for-extra` is the way of invoking a "dry
    run" scan of the slave tables to see if they contain extras.
   * Supplying `--delete-extra` implies that we will also scan the slave tables and all
    of the same warnings and notes from `--scan-for-extra` thus apply here.
* `--profile` the name of a profile in your AWS credentials file if the script should
  use a specific named profile for its credentials.
* `--role-arn` the ARN of a role to assume if you need the script to assume a role
  before it can access DynamoDB.
* `--mfa-serial` the serial number or ARN of an MFA device if assuming a role requires
  an MFA token (can only be used with `--role-arn` **and** `--mfa-token`).
* `--mfa-token` the one-time password generated by your MFA device if assuming a role
  requires an MFA token (can only be used with `--role-arn` **and** `--mfa-serial`).
* `--slave-profile`, `--slave-role-arn`, `--slave-mfa-serial`, and `--slave-mfa-token`
 - see the subheading "Syncing Tables in Two Accounts" for more information.
* `--verbose` (or `-v`) enables additional verbose info & warnings to console logger on
  specific item discrepancies and actions being performed.

Syncing Tables in Two Accounts

It is possible to synchronize tables in two accounts. To do so, you must provide credentials arguments that configure the credentials providers to be used for the slave tables. This is done by providing one of the following combinations of arguments:
* `--slave-profile` can be used on its own to simply use a profile from your AWS
  credentials (INI) file for the slave tables.
* `--slave-role-arn` can be used on its own to assume a role. When used on its own, it
  will default to SDK-default credentials to use as its "master credentials" that it
  uses for permission to assume the role.
   * It can also be combined with `--slave-profile`, in which case the slave profile
     will provide the master credentials that are used to assume the role.
   * It can also be used in conjunction with the `--slave-mfa-serial` and
     `--slave-mfa-token` arguments if assuming the role requires an MFA token.
By configuring credentials for the slaves using these arguments, the script will use the master credentials (provided by default, or by the --profile, etc args) to describe and read from the master table, and the slave credentials to describe, read from, and write to the slave table(s).
See the heading below entitled "Authentication and Authorization to AWS DynamoDB API" for more information related to credentials.

"Dry Run" Mode

If you run the synchronizer without any of the modification flags (--write-missing, --write-differing, and --delete-extra), then the script will run in a dry run / report-only mode. It will log each item that is different or missing, and will report stats at the end of the run.
As noted above, the dry run will (by default) only scan the master table - finding items that are missing from the slaves or where the slave item differs from the master. If you want to also scan the slave tables to find items in them that do not exist in the master ("extra" items), supply the --scan-for-extra flag.

Note About Race Conditions

There is no way to make the multi-region/multi-table operation atomic. Thus, due to the time between where we read and write from various tables, there are race conditions that will exist.
For example, if you are replicating data from the master table to the slave table(s), it may be that we read an item that has not yet replicated to the slave. Or, it's possible that we read a new version of the item from the master, and an old version of the item from the slave(s). In either of these scenarios, you are "safe" with the --write-missing and --write-differing flags because the synchronizer will write what it read from the master, which in both of these scenarios is the newer data.
Of course, there are scenarios that are not safe as well. Consider, for example, the following scenario - portrayed as a serial list of events:
* Item X is written to the master table, and replicated to the slave tables.
* Another update to item X is written to the master table, bumping it to version 2.
* Our scan reads item X (version 2) from the master table.
* Another update to item X is written to the master table (from a separate process -
  not our synchronizer), bumping it to **version 3**.
   * This change is replicated to the slave tables.
* Our synchronizer reads item X from a slave table and receives version 3 of the item.
   * It now detects that there is a difference between the master and slave table
     because it read version 2 from the master and now reads version 3 from the slave.
* Because of the difference, our synchronizer writes version 2 to the slave table, thus
  **un-**synchronizing a change that had previously been in sync.
What can be done to avoid that scenario?
* You could re-run the script after it completes, and on the subsequent run, item X
  would be "fixed" because we would see version 3 in the master table and version 2
  (that we wrote on the last run) in the slave table(s).
   * Of course, running the script again introduces the potential for the race
     condition to happen again - on item X or other items.
* Better: if you have a version number of "last updated" field on your items, you could
  subclass or modify the `Synchronizer` class, overriding or modifying the
  `isItemDifferent` method. You could compare the version or last update of the items,
  and if the slave is newer than the master, you could not write it.
* Only run the synchronizer when there are no writes happening on the table
  (understandably, this is not possible for most uses of DynamoDB since most use cases
  involve constant 24/7 writes).
* Pause your normal replication process while the synchronizer is running, and resume
  it after the synchronizer runs. This assumes that the replication process is written
  in a way that it is safe to pause it for the length of time necessary for the
  synchronizer to run.

Authentication and Authorization to AWS DynamoDB API

The script can run without any specific AWS credentials configuration (for example, without --profile, --role-arn or the MFA-related arguments). In those cases the script will simply use the built-in authentication mechanism from the SDK. Thus, you will need to ensure that one of the methods that the SDK uses to auto-discover credentials will work in your environment. For example, you could:
* Supply the credentials via [environment variables][envvars]
* Use the default credentials in your [shared credentials file][credsfile]
* Run the script on an [EC2 instance that has permissions][ec2]
Note that if you assume a role that requires MFA, the temporary credentials the script receives will only last for one hour, and cannot be refreshed without a new MFA token (which the script does not attempt to implement). Thus, the provisioned capacity, parallelism, and related limiting arguments must be configured to allow your entire table to be scanned during a single hour if you are using MFA.

What Permissions Are Needed?

On the master table you will need:
* DynamoDB:DescribeTable
* DynamoDB:Scan
* If using the `--delete-extra` flag:
   * DynamoDB:BatchGetItem
On the slave table(s) you will need:
* DynamoDB:DescribeTable
* DynamoDB:BatchGetItem
* If using the `--write-missing` or `--write-differing` flags:
   * DynamoDB:PutItem
* If using the `--delete-extra` or `--scan-for-extra` flags:
   * DynamoDB:Scan
* If using the `--delete-extra` flag:
   * DynamoDB:DeleteItem

How do I contribute?

We genuinely appreciate external contributions. See our extensive documentationcontributing on how to contribute.


This software is released under the MIT license. See the license file for more details.