remote_storage
- class Provider(value)[source]
Bases:
str
,Enum
- GOOGLE_STORAGE = 'google_storage'
- S3 = 's3'
- AZURE_BLOBS = 'azure_blobs'
- class RemoteObjectProtocol(*args, **kwargs)[source]
Bases:
Protocol
Protocol of classes that describe remote objects. Describes information about the remote object and functionality to download the object.
- name: str
- size: int
- hash: int
- download(download_path, overwrite_existing=False) RemoteObjectProtocol | None
- class SyncObject(sync_direction: Literal['push', 'pull'], local_path: str | None = None, remote_obj: RemoteObjectProtocol | None = None, remote_path: str | None = None, remote_obj_overridden_md5_hash: int | None = None)[source]
Bases:
_JsonReprMixin
Class representing the sync-status between a local path and a remote object. Is mainly used for creating summaries and syncing within RemoteStorage and for introspection before and after push/pull transactions.
It is not recommended creating or manipulating instances of this class outside RemoteStorage, in particular in user code. This class forms part of the public interface because instances of it are given to users for introspection.
- property name
- property exists_on_target: bool
True iff the file exists on both locations
- set_local_path(path: str | None)
Changes the local path of the SyncObject :param path: :return: None
- property exists_on_remote
- property equal_md5_hash_sum
- get_bytes_transferred() int
- Returns:
the number of bytes (to be) transferred for this object
- to_dict(make_serializable=True)
- class TransactionSummary(matched_source_files: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, not_on_target: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, on_target_eq_md5: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, on_target_neq_md5: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, unresolvable_collisions: ~typing.Dict[str, ~typing.List[~accsr.remote_storage.RemoteObjectProtocol] | str] = <factory>, skipped_source_files: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, synced_files: ~typing.List[~accsr.remote_storage.SyncObject] = <factory>, sync_direction: ~typing.Literal['push', 'pull'] | None = None)[source]
Bases:
_JsonReprMixin
Class representing the summary of a push or pull operation. Is mainly used for introspection before and after push/pull transactions.
It is not recommended creating or manipulate instances of this class outside RemoteStorage, in particular in user code. This class forms part of the public interface because instances of it are given to users for introspection.
- matched_source_files: List[SyncObject]
- not_on_target: List[SyncObject]
- on_target_eq_md5: List[SyncObject]
- on_target_neq_md5: List[SyncObject]
- unresolvable_collisions: Dict[str, List[RemoteObjectProtocol] | str]
- skipped_source_files: List[SyncObject]
- synced_files: List[SyncObject]
- sync_direction: Literal['push', 'pull'] | None = None
- property files_to_sync: List[SyncObject]
Returns of files that need synchronization.
- Returns:
list of all files that are not on the target or have different md5sums on target and remote
- size_files_to_sync() int
Computes the total size of all objects that need synchronization. Raises a RuntimeError if the sync_direction property is not set to ‘push’ or ‘pull’.
- Returns:
the total size of all local objects that need synchronization if self.sync_direction=’push’ and the size of all remote files that need synchronization if self.sync_direction=’pull’
- property requires_force: bool
Getter of the requires_force property. :return: True iff a failure of the transaction can only be prevented by setting force=True.
- property has_unresolvable_collisions: bool
Getter of the requires_force property. :return: True iff there exists a collision that cannot be resolved.
- property all_files_analyzed: List[SyncObject]
Getter of the all_files_analyzed property. :return: list of all analyzed source files
- add_entry(synced_object: SyncObject, collides_with: List[RemoteObjectProtocol] | str | None = None, skip: bool = False)
Adds a SyncObject to the summary. :param synced_object: either a SyncObject or a path to a local file. :param collides_with: specification of unresolvable collisions for the given sync object :param skip: if True, the object is marked to be skipped :return: None
- get_short_summary_dict()
Returns a short summary of the transaction as a dictionary.
- print_short_summary()
Prints a short summary of the transaction (shorter than the full repr, which contains information about local and remote objects).
- class RemoteStorageConfig(provider: str, key: str, bucket: str, secret: str, region: str | None = None, host: str | None = None, port: int | None = None, base_path: str = '', secure: bool = True, use_pbar: bool = True, log_level: int = 20)[source]
Bases:
object
Contains all necessary information to establish a connection to a bucket within the remote storage, and the base path on the remote.
- provider: str
- key: str
- bucket: str
- secret: str
- region: str | None = None
- host: str | None = None
- port: int | None = None
- base_path: str = ''
- secure: bool = True
- use_pbar: bool = True
whether to use progress bars which are printed to stderr. If set to False, progress will instead be logged at the log level specified in
log_level
- log_level: int = 20
level at which to log progress for the case where use_pbar is disabled.
- class RemoteStorage(conf: RemoteStorageConfig, add_extra_to_upload: Callable[[SyncObject], dict] | None = None, remote_hash_extractor: Callable[[RemoteObjectProtocol], int] | None = None)[source]
Bases:
object
Wrapper around lib-cloud for accessing remote storage services.
- create_bucket(exist_ok: bool = True)
- property conf: RemoteStorageConfig
- property provider: str
- property remote_base_path: str
- set_remote_base_path(path: str | None)
Changes the base path in the remote storage (overriding the base path extracted from RemoteStorageConfig during instantiation). Pull and push operations will only affect files within the remote base path.
- Parameters:
path – a path with linux-like separators
- property bucket: Container
- property driver: StorageDriver
- pull(remote_path: str, local_base_dir: str = '', force: bool = False, include_regex: str | Pattern | None = None, exclude_regex: str | Pattern | None = None, convert_to_linux_path: bool = True, dryrun: bool = False, path_regex: str | Pattern | None = None, strip_abspath_prefix: str | None = None, strip_abs_local_base_dir: bool = True, use_pbar: bool | None = None) TransactionSummary
Pull either a file or a directory under the given path relative to local_base_dir.
- Parameters:
remote_path – remote path on storage bucket relative to the configured remote base path. e.g. ‘data/ground_truth/some_file.json’. Can also be an absolute local path if
strip_abspath_prefix
is specified.local_base_dir – Local base directory for constructing local path e.g. passing ‘local_base_dir’ will download to the path ‘local_base_dir/data/ground_truth/some_file.json’ in the above example
force – If False, pull will raise an error if an already existing file deviates from the remote in its md5sum. If True, these files are overwritten.
include_regex – If not None only files with paths matching the regex will be pulled. This is useful for filtering files within a remote directory before pulling them.
exclude_regex – If not None, files with paths matching the regex will be excluded from the pull. Takes precedence over
include_regex
, i.e. if a file matches both, it will be excluded.convert_to_linux_path – if True, will convert windows path to linux path (as needed by remote storage) and thus passing a remote path like ‘datamypath’ will be converted to ‘data/my/path’ before pulling. This should only be set to False if you want to pull a remote object with ‘' in its file name (which is discouraged).
dryrun – If True, simulates the pull operation and returns the remote objects that would have been pulled.
path_regex – DEPRECATED! Use
include_regex
instead.strip_abspath_prefix – Will only have an effect if the remote_path is absolute. Then the given prefix is removed from it before pulling. This is useful for pulling files from a remote storage by directly specifying absolute local paths instead of first converting them to actual remote paths. Similar in logic to local_path_prefix in push. A common use case is to always set local_base_dir to the same value and to always pass absolute paths as remote_path to pull.
strip_abs_local_base_dir – If True, and local_base_dir is an absolute path, then the local_base_dir will be treated as strip_abspath_prefix. See explanation of strip_abspath_prefix.
use_pbar – If not None, overrides the configured default value for this flag. Specifically, if True, will use a progress bar for the pull operation; if False, will use logging.
- Returns:
An object describing the summary of the operation.
- get_push_remote_path(local_path: str) str
Get the full path within a remote storage bucket for pushing.
- Parameters:
local_path – the local path to the file
- Returns:
the remote path that corresponds to the local path
- push(path: str, local_path_prefix: str | None = None, force: bool = False, include_regex: str | Pattern | None = None, exclude_regex: str | Pattern | None = None, dryrun: bool = False, path_regex: str | Pattern | None = None, use_pbar: bool | None = None) TransactionSummary
Upload files into the remote storage. Does not upload files for which the md5sum matches existing remote files. The remote path for uploading will be constructed from the remote_base_path and the provided path. The local_path_prefix serves for finding the directory on the local system or for stripping off parts of absolute paths if path is absolute, see examples below.
Examples
- path=foo/bar, local_path_prefix=None –>
./foo/bar uploaded to remote_base_path/foo/bar
- path=/home/foo/bar, local_path_prefix=None –>
/home/foo/bar uploaded to remote_base_path/home/foo/bar
- path=bar, local_path_prefix=/home/foo –>
/home/foo/bar uploaded to remote_base_path/bar
- path=/home/foo/bar, local_path_prefix=/home/foo –>
/home/foo/bar uploaded to remote_base_path/bar (Same as 3)
- path=/home/baz/bar, local_path_prefix=/home/foo –>
ValueError: Specified path=/home/baz/bar is not a child of local_path_prefix=/home/foo
- Parameters:
path – Path to the local object (file or directory) to be uploaded, may be absolute or relative. globs are supported as well, thus
path
may be a pattern like*.txt
.local_path_prefix – Prefix to be concatenated with
path
force – If False, push will raise an error if an already existing remote file deviates from the local in its md5sum. If True, these files are overwritten.
include_regex – If not None, only files with paths matching the regex will be pushed. Note that paths matched against the regex will be relative to
local_path_prefix
.exclude_regex – If not None, only files with paths not matching the regex will be pushed. Takes precedence over
include_regex
, i.e. if a file matches both regexes, it will be excluded. Note that paths matched against the regex will be relative tolocal_path_prefix
.dryrun – If True, simulates the push operation and returns the summary (with synced_files being an empty list).
path_regex – DEPRECATED! Same as
include_regex
.use_pbar – If not None, overrides the configured default value for this flag. Specifically, if True, will use a progress bar for the pull operation; if False, will use logging.
- Returns:
An object describing the summary of the operation.
- delete(remote_path: str, include_regex: str | Pattern | None = None, exclude_regex: str | Pattern | None = None, path_regex: str | Pattern | None = None) List[RemoteObjectProtocol]
Deletes a file or a directory under the given path relative to local_base_dir. Use with caution!
- Parameters:
remote_path – remote path on storage bucket relative to the configured remote base path.
include_regex – If not None only files with paths matching the regex will be deleted.
exclude_regex – If not None only files with paths not matching the regex will be deleted. Takes precedence over
include_regex
, i.e. if a file matches both regexes, it will be excluded.path_regex – DEPRECATED! Same as
include_regex
.
- Returns:
list of remote objects referring to all deleted files
- list_objects(remote_path: str) List[RemoteObjectProtocol]
- Parameters:
remote_path – remote path on storage bucket relative to the configured remote base path.
- Returns:
list of remote objects under the remote path (multiple entries if the remote path is a directory)