[1]:
%load_ext autoreload
%autoreload 2

Intro to accsr

The goal of accsr is to simplify programmatic access to data on disc and in remote storage in python. We often found ourselves repeating the same lines of code for pulling something from a bucket, loading a file from a tar archive or creating a configuration module for storing paths to existing or to-be-loaded files. accsr allows doing all this directly from python, without relying on a cli or external tools.

One of the design goals of accsr is to allow the users to use the same code for loading data and configuration, independently of the state of the local file system.

For example, a developer with all data already loaded who wants to perform an experiment with some extended data set, would load the configuration with get_config(), instantiate a RemoteStorage object and call pull() to download any missing data from the remote storage. If no data is missing, nothing will be downloaded, thus creating no overhead.

A user who does not have the data locally, would also call get_config(), (possibly using a different config_local.json file, with different access keys or namespaces), and then als call pull() with the same code. The data will be downloaded from the remote storage and stored locally.

Thus, the code will never need to change between development, testing and deployment, and unnecessary overhead for loading data is reduced as much as possible.

This approach also makes it easy to collaborate on data sets with the same code-base, and avoid stepping on each other’s toes by accident.

The configuration module

The configuration module provides utilities for reading configuration from a hierarchy of files and customizing access to them. Let us look at some use case examples for this.

[2]:
from accsr.config import ConfigProviderBase, DefaultDataConfiguration, ConfigurationBase
from accsr.remote_storage import RemoteStorage, RemoteStorageConfig
import os
from pathlib import Path

Setting up configuration providers

The recommended way of using accsr’s configuration utils is to create a module called config.py within your project and setup classes and methods for managing and providing configuration. In the cell below we show a minimal example of such a file.

Under the hood the config provider is reading out the __Configuration class from generics at runtime and makes sure that only one global instance of your custom __Configuration exists in memory. Don’t worry if you are unfamiliar with the coding patterns used here - you don’t need to understand them to use the config utils. You will probably never need to adjust the ConfigProvider related code.

[3]:
class __Configuration(ConfigurationBase):
    pass


class ConfigProvider(ConfigProviderBase[__Configuration]):
    pass


_config_provider = ConfigProvider()


def get_config(
    reload=False, config_files=("config.json", "config_local.json")
) -> __Configuration:
    """
    :param reload: if True, the configuration will be reloaded from the json files
    :param config_files: the list of files to load the configuration from
    :return: the configuration instance
    """
    return _config_provider.get_config(reload=reload, config_files=config_files)

Loading configuration from files

We found the following workflow useful for managing configuration files:

  1. Create a config.json file in the root of your project. This file should contain all the default configuration and be committed to version control.

  2. Create a config_local.json file in the root of your project with the user-specific configuration. This file should not be committed to version control. It does not need to contain all the configuration, only the parts that are different from the default configuration.

NOTE: Yaml files are also permitted, but by default the configuration is read from the two json files mentioned above. You can freely mix yaml and json and define your own hierarchy by passing config_files, so for example passing config_files=("config.json", "config_local.yaml") is allowed.

A typical use case is to have default configuration for the RemoteStorage in config.json and to have secrets (like the access key and secret), as well as a user-specific base path in config_local.json. In this way, multiple users can use the same code for loading data while still being able to experiment on their own data sets - for example storing these data sets in the same bucket but in different namespaces.

Another use case is to include a read-only access key in config.json, which is then distributed to users in version-control, and a read-write access key in config_local.json for the developers who need to update data.

Including environment variables

One can tell the configuration to read the value off an environment variable instead of writing the value directly to the file. This is useful for example for running code in CI, where it might be easier to adjust environment variables instead of files (for example, while Gitlab CI offers file-type secrets, there is no such feature in GitHub actions at the time of writing).

For instructing to read off the value from the env, simply prepend “env:” to the configured value, e.g. if your config.json looks as

{
  "configured_val": "fixed_value",
  "from_env_var": "env:MY_ENV_VAR"
}

then implementing the configuration as

class __Configuration(ConfigurationBase):
    @property
    def configured_val(self) -> str:
        return self._not("configured_val")

will result in the value of the property being read at runtime from the environment variable MY_ENV_VAR. Thus, changing the value of the environment variable will change the value of the property. This is in contrast to not-env-var values, which are read at config-loading time and will only change when the config is reloaded.

Default Configurations

accsr includes a default implementation of the ConfigurationBase class meant for typical ML and data-driven projects. To use this, simply inherit from DefaultDataConfiguration instead of ConfigurationBase. The resulting configuration class will have some default properties and methods for managing paths to data.

The RemoteStorage facilities

accsr makes it easy to interact with data stored in a remote blob storage, like S3, Google Storage, Azure Storage or similar. The RemoteStorage implements a git-like logic and uses apache-libcloud underneath.

In order to demonstrate the RemoteStorage functionality, we will start minIO, an object store with S3 interface, using docker compose. We also switch to the tests directory where the docker-compose file and some resource files for testing have been prepared.

[4]:
notebooks_dir = Path(os.getcwd()).absolute()
tests_dir = notebooks_dir.parent / "tests" / "accsr"

os.chdir(tests_dir)
[5]:
if not os.getenv("CI"):
    # In CI, we start the minIO container separately
    !docker-compose up -d
    host = "localhost"
else:
    host = "remote-storage"

port = 9001
api_port = 9000

We now should have minio up and running.

Now we can instantiate a RemoteStorage object and interact with minIO.

[6]:
remote_storage_config = RemoteStorageConfig(
    provider="s3",
    key="minio-root-user",
    secret="minio-root-password",
    bucket="accsr-demo",
    base_path="my_remote_dir",
    host=host,
    port=api_port,
    secure=False,
)

storage = RemoteStorage(remote_storage_config)

The base_path is a “directory” (or rather a namespace) within the bucket. All calls to the storage object will only affect files in the base_path.

The bucket itself does not exist yet, so let us create it. This has to be done by the user explicitly, to prevent accidental costs. Of course, if the configuration is pointing to an existing bucket, this step is not necessary.

[7]:
storage.create_bucket()

Now we can push, pull, list and generally interact with objects inside base_path within the bucket. Let us first push the resources directory to have something to start with.

The pull and push commands will return a summary of the transaction with the bucket. If the flag dryrun=True is specified, then the transaction is only computed but not executed - a good way to make sure that you are doing what is desired before actually interacting with data.

[8]:
dry_run_summary = storage.push("resources", dryrun=True)

print(f"Here the dryrun summary: ")
dry_run_summary.print_short_summary()
Scanning files in /__w/accsr/accsr/tests/accsr/resources: 100%|██████████| 11/11 [00:00<00:00, 171.59it/s]
Here the dryrun summary:
{
  "sync_direction": "push",
  "files_to_sync": 11,
  "total_size": 599,
  "unresolvable_collisions": 0,
  "synced_files": 0
}

The summary shows that we would push multiple files with this call if we remove the dryrun flag. Every detail of the transaction can be retrieved from the summary object.

[9]:
local_files_checked = dry_run_summary.matched_source_files
would_be_pushed = dry_run_summary.not_on_target
pushed_files = dry_run_summary.synced_files

print(
    f"Out of {len(local_files_checked)} files that we found inside the 'resources' dir, "
    f"we would push {len(would_be_pushed)}. In the last transaction {len(pushed_files)} files were synced."
)
Out of 11 files that we found inside the 'resources' dir, we would push 11. In the last transaction 0 files were synced.

Now let us actually perform the push

[10]:
def push_and_print():
    push_summary = storage.push("resources")
    local_files_checked = push_summary.matched_source_files
    pushed_files = push_summary.synced_files

    print(
        f"Out of {len(local_files_checked)} files that we found inside the "
        f"'resources' dir, we pushed {len(pushed_files)}."
    )
[11]:
push_and_print()
Scanning files in /__w/accsr/accsr/tests/accsr/resources: 100%|██████████| 11/11 [00:00<00:00, 404.20it/s]
pushing (bytes): 100%|██████████| 599/599 [00:00<00:00, 16981.91it/s]
Out of 11 files that we found inside the 'resources' dir, we pushed 11.

If we now push again, no new files will be synced. This holds even if force=True is specified, because the hashes are equal. The flag force=True is useful if there are collisions in file names for files with different hashes. In that case, a transaction will fail and nothing will be executed, much like with git. This is useful to avoid uncertain state, where a transaction breaks in the middle of execution.

In accsr, this behaviour is achieved by always inspecting the transaction summary before performing any changes on filesystems and thus rejecting a transaction entirely if collisions happen with force=False (the default).

[12]:
push_and_print()
Scanning files in /__w/accsr/accsr/tests/accsr/resources: 100%|██████████| 11/11 [00:00<00:00, 167.12it/s]
pushing (bytes): 0it [00:00, ?it/s]
Out of 11 files that we found inside the 'resources' dir, we pushed 0.

If we delete one file on the remote and push again, a single file will be pushed.

[13]:
deleted_files = storage.delete("resources/sample.txt")
print(f"Deleted {len(deleted_files)} files.")

push_and_print()
Deleted 1 files.
Scanning files in /__w/accsr/accsr/tests/accsr/resources: 100%|██████████| 11/11 [00:00<00:00, 367.63it/s]
pushing (bytes): 100%|██████████| 11/11 [00:00<00:00, 3046.78it/s]
Out of 11 files that we found inside the 'resources' dir, we pushed 1.

The same logic applies to pulling. Generally, RemoteStorage only downloads and uploads data if it is strictly necessary, so it is, e.g., safe to always call pull from some script or notebook, as nothing will be pulled if the necessary files are already present. Even pulling with force=True is “safe”, in the sense that it is fast. Using force=True is a good option for making sure that the data that one uses is the latest version from the remote.

On top of the basic usage presented above, RemoteStorage also provides support for filtering files based on regex by passing the optional include_regex and exclude_regex parameters to corresponding methods. Also, passing glob expressions for pushing files is permitted. See the docstrings of RemoteStorage for more details.

[14]:
# Shutting down minio and going back to notebooks dir

if not os.getenv("CI"):
    # In CI we start the minIO container separately
    !docker-compose down
os.chdir(notebooks_dir)
[ ]: