Data Store Guide¶
Goal¶
The Data Store is more than having a place to save your files. The Data Store is a way to manage the life cycle of your data. From the moment you create data, to publication and beyond, there are a number of practices you should follow to ensure the integrity and value of your data are maintained. This includes making your data FAIR (Findable, Accessible, Interoperable, and Reusable). The Data Store helps to achieve this with less effort. This guide will cover the minimum needed to get you started. Please look through the Data Store Manual for a more comprehensive look at Data Store capabilities.
Maintainer | Institution | Contact |
---|---|---|
Jason Williams | CyVerse / CSHL | Williams@cshl.edu |
Drag-and-drop Data Transfer with Cyberduck¶
Cyberduck is a 3rd party software (available as freeware) tool that allows you to drag-and-drop files between your local computer and the Data Store. Cyberduck can also be used to rename files, and browse other shared or public Data Store locations.
Download and first-time configuration of Cyberduck¶
Download Cyberduck at the Cyberduck Website; follow the installation instructions for your operating system.
Download the CyVerse Cyberduck connection profile Double-click on the downloaded file - this should open the installed Cyberduck software.
In the Cyberduck configuration window, enter your CyVerse username in the field ‘iPlant username’.
Under ‘Advanced Options’ ensure ‘Transfer Files’ option is set to ‘Open Multiple Connections’. Close this window - your entries will be automatically saved.
Double-click on the Data Store bookmark in the Cyberduck window. Enter your CyVerse credentials.
You should now be connected to the CyVerse Data Store and viewing the contents of your home directory.
Upload from local computer to Data Store using Cyberduck¶
Warning
When uploading your data to the Data Store you should not upload files/ folders with names containing spaces (e.g. experiment one.fastq) or name that contain special characters (e.g. ~ ` ! @ # $ % ^ & * ( ) + = { } [ ] | : ; ” ‘ < > , ? /). The Apps on the Discovery Environment and most command line applications will typically not tolerate these characters. For long file/folder names the use of underscores (e.g. experiment_one.fastq) is the recommended practice.
Double-click on the Data Store bookmark to connect to the Data Store
Select file(s)/folder(s) from your local machine and drag them into the Cyberduck window. (You may drag directly into an existing folder or from the Cyberduck ‘File’ menu, create a new folder).
A ‘Transfers’ window will appear. Monitor the upload to completion.
Download from Data Store to local computer using Cyberduck¶
Tip
In the Cyberduck ‘File’ menu, there are several more functionalities. You can for example directly specify files and folders to move without dragging and dropping them. You can also ‘synchronize’ folders - only copying items that are missing in a folder rather than copying all contents.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: learning@CyVerse.org
Command Line Transfer with iCommands¶
iCommands is a collection of tools developed by the iRODS project. iRODS is the technology that supports the CyVerse Data Store. Using iCommands is the most flexible way to interact with the Data Store. This section will cover the basics of installation and use; see also the official iRODS iCommands Documentation.
Some things to remember about iCommands
- This is a command line tool, operated in a terminal.
- There is poor support for Windows OS: Currently, we have not tested a Windows-only shell version of iCommands. We do suggest installing Windows Linux Subsystem and following the Linux installation instructions.
iCommands Installation for Linux¶
On a linux OS you can use a package manager to install iCommands in the terminal.
CentOS:
Instructions for configuring the iRODS repository can be be found on the
iRODS Packages webpage. After configurating the repository, yum
can be used to install the iCommands package irods-icommands
.
sudo rpm --import https://packages.irods.org/irods-signing-key.asc
wget -qO - https://packages.irods.org/renci-irods.yum.repo \
| sudo tee /etc/yum.repos.d/renci-irods.yum.repo
sudo yum install irods-icommands
If that does not work, an older version of iCommands, 4.1.12, can be installed from RENCI’s website.
Ubuntu 18.04:
Instructions for configuring the iRODS repository can be be found on the
iRODS Packages webpage. After configurating the repository, apt
can be used to install the iCommands package irods-icommands
.
wget -qO - https://packages.irods.org/irods-signing-key.asc \
| sudo apt-key add -
echo "deb [arch=amd64] https://packages.irods.org/apt/ $(lsb_release -sc) main" \
| sudo tee /etc/apt/sources.list.d/renci-irods.list
sudo apt-get update
sudo apt install irods-icommands
Ubuntu 20.04:
iRODS doesn’t current support Ubuntu 20.04 yet. However, the one for Ubuntu 18.04 works as long as a few extra packages are installed.
Here are the commands to configure the iRODS repository.
wget -qO - https://packages.irods.org/irods-signing-key.asc \
| sudo apt-key add -
echo "deb [arch=amd64] https://packages.irods.org/apt/ bionic main" \
| sudo tee /etc/apt/sources.list.d/renci-irods.list
sudo apt update
Prior to installing the iCommands package, a few 18.04 packages neet to be installed that are not available for 20.04. Here are the comands to install these packages.
wget --directory-prefix /tmp/ \
http://security.ubuntu.com/ubuntu/pool/main/p/python-urllib3/python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
http://security.ubuntu.com/ubuntu/pool/main/r/requests/python-requests_2.18.4-2ubuntu0.1_all.deb \
http://security.ubuntu.com/ubuntu/pool/main/o/openssl1.0/libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb
sudo apt install \
/tmp/libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb \
/tmp/python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
/tmp/python-requests_2.18.4-2ubuntu0.1_all.deb
Now apt
can be used to install the iCommands package irods-icommands
.
sudo apt install irods-icommands
If the above does not work, e.g., incomplete support for Ubuntu 20,04, an older version of iCommands, 4.1.10, can be installed by doing the following.
sudo apt update
wget \
http://mirrors.kernel.org/ubuntu/pool/main/g/glibc/multiarch-support_2.27-3ubuntu1.4_amd64.deb \
http://ftp.se.debian.org/debian/pool/main/o/openssl/libssl1.0.0_1.0.1t-1+deb8u8_amd64.deb \
https://files.renci.org/pub/irods/releases/4.1.10/ubuntu14/irods-icommands-4.1.10-ubuntu14-x86_64.deb
sudo dpkg --install \
multiarch-support_2.27-3ubuntu1.4_amd64.deb \
libssl1.0.0_1.0.1t-1+deb8u8_amd64.deb \
irods-icommands-4.1.10-ubuntu14-x86_64.deb
Arm64/Aarch64:
A CyVerse community user compiled i-commands for Raspberry Pi (and tested in NVIDIA Jetsons): https://github.com/jmscslgroup/libpanda/blob/master/scripts/irods-icommands-debs.tgz
wget https://github.com/jmscslgroup/libpanda/raw/master/scripts/irods-icommands-debs.tgz
tar zxvf irods-icommands-debs.tgz
cd irods-icommands-debs/
./install.sh
iCommands Installation for Mac OS X¶
iRODS doesn’t currently support Mac OS X, but CyVerse has built an installer for it.
- Download the CyVerse-specific Mac OS iCommands Download.
- Open the file by locating it in your Finder; right-click to run the installer. You may get a security warning noting the file is from an “unidentified developer.” You may bypass this warning by going to ‘System Preferences’, selecting the ‘Security & Privacy’ menu, and clicking the ‘Open Anyway’ button to proceed.
- Follow the prompts to begin the installation. You will need to know your administrator password to install new software.
Note
Newer Mac OS X now ships with zsh
as its default shell rather than bash
. The installer will attempt to write some environmental variables to the .bashrc
file which for zsh
is called the .zshrc.
By default, this installation will place iCommands in your system PATH
so you should be ready to run iCommands immediately at the terminal. If this does not happen (i.e. you get an error when trying to run iinit
), you can add the icommands path by editing your .zshrc
file:
# add iCommands Path
export PATH="/Applications/icommands/:$PATH"
export IRODS_PLUGINS_HOME=/Applications/icommands/plugins/
and then in terminal source the file source ~/.zshrc
.
iCommands First-time Configuration¶
Note
If using iCommands in an HPC environment, which already has iCommands installed, run the module load irods
command to get access to iRODS iCommands.
Once iCommands is installed and in the system PATH these instructions apply at a terminal in Mac OS X and Linux systems.
Open terminal
Type iinit command to start the configuration process. When prompted, enter the values shown below as comments in the example code block.
CyVerse Data Store configuration:
CyVerse Data Store configuration:
host name | port # | username | zone | password |
data.cyverse.org | 1247 | CyVerse UserID | iplant | CyVerse Password |
Note
You can reconfigure iCommands for other iRODS data stores by changing your environment file
Verify that your iCommands installation works and is properly configured using the
ils
command to list the contents of your Data Store home directory.$ ils /iplant/home/your_home_directory: file1 file2 file3 C- /iplant/home/your_home_directory/analyses C- /iplant/home/your_home_directory/another_folder
Anonymous access to the CyVerse Datastore¶
You can access public data in the CyVerse Datastore with iCommands using:
- Username: anonymous
- Password: <leave blank>
Upload Files/folders from local Computer to Data Store¶
Warning
When uploading your data to the Data Store you should not upload files/ folders with names containing spaces (e.g. experiment one.fastq) or name that contain special characters (e.g. ~ `` ! @ # $ % ^ & * ( ) + = { } [ ] | : ; “” ‘’ < > , ? /). The Apps on the Discovery Environment and most command line applications will typically not tolerate these characters. For long file/folder names the use of underscores (e.g. experiment_one.fastq) is the recommended practice.
See the full iCommands iput documentation for more information.
- Upload a directory using the iput command. Remember, the -r flag is to recursively upload a directory, so if you are uploading a single file, omit the -r flag.
$ iput -rPT /local_directory /iplant/home/cyverse_username/destination_folder # This command will output the progress as it uploads your local directory There are several optional arguments that the upload iCommand `iput` can take: .. code:: bash $ iput -r # For recursive transfer of directories and their contents $ iput -P # display the progress of the upload $ iput -f # force the upload and overwrite $ iput -T # Renew socket connection after 10 min (May help connections # that are failing due to some connection/firewall settings)
Download Files/folders from Data Store to local Computer¶
See the full iCommands iget documentation for more information.
- Download a file using the iget command. Remember, the -r flag is to recursively upload a directory, so if you are uploading a single file, omit the -r flag.
$ iget -PT /iplant/home/cyverse_username/target_file /local_destination # This command will output the progress as it downloads to your local machineTip
There are several optional arguments that the upload iCommand iget can take:
$ iget -r # For recursive transfer of directories and their contents $ iget -P # display the progress of the upload $ iget -f # force the upload and overwrite $ iget -T # Renew socket connection after 10 min (May help connections # that are failing due to some connection/firewall settings)
NetCDF iCommands¶
For the Linux distributions there are three extra iCommands that support common NetCDF operations:
inc
performs data operations on a list of NetCDF files,incarch
archives a open ended time series data,incattr
performs operation on attributes of NetCDF files.
Each of these commands accepts the -h
command line option. When a command is called with this option, it displays the command’s help documentation. Please see this help documentation for more information.
Installation
Install iRODS Runtime. Before the NetCDF iCommands can be installed, the current version of the iRODS run-time library needs to be installed. Please install the appropriate version (e.g.
irods-runtime-X-X-XX
). The distribution specific packages can be found on RENCI's iRODs website.Install NetCDF API. Once the run-time library is installed, the iRODS NetCDF API library needs to be installed. Please use the appropriate link to the download the installation package and install it. The package installer will likely warn that irods user and/or group don’t exist, and that it will be using root instead. These warnings are harmless, since the package contents should be installed with root ownership.
Additional Frequently Used iCommands¶
In addition to the commands above, there are several frequently used iCommands - most of which you would expect following the Linux paradigm:
- ipwd: Print current directory
- imkdir: Create a directory
- icd: Change directory
- irsync: Sync local directory with iRODS directory
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: learning@CyVerse.org
Associate Data with Metadata¶
The Data Store supports a variety of solutions that allow you to associate your raw data with metadata. Metadata is of critical importance to a quality research (see this article on FAIR Principles), yet it is often given less consideration until the time of publication and sharing. Here are a few metadata features that you can adopt and be aware of at the outset, see more of CyVerse’s capabilities on the Data Commons Wiki page.
Some things to remember about the platform
- You can add metadata to a single file/folder, or in bulk to large collections of data. You can use your own metadata schema, or apply one of our metadata templates.
- The Discovery Environment supports several metadata templates that can be used for submission of metadata. Additional templates you may wish to use can be found at resources like FairSharing.org.
- Metadata can be managed through a graphical interface in the Discovery Environment or using iCommands at the command line. We will only cover the Discovery Environment in this guide. See instructions for iCommands metadata on the CyVerse wiki.
Viewing and Editing Metadata for a Single File/folder in the Discovery Environment¶
Note
You must have write or own permission to edit an object’s metadata.
Log into the Discovery Environment.
Click on the
(Data Icon) to view to browse for data. Select (checkbox) a single file/folder to add metadata to.
Under the More Actions menu, click on the `Metadata` choice. You will see existing metadata for the file/folder in the Attribute, Value, Unit (AVU) format.
Tip
A single piece of metadata, or an AVU, is made up of attributes, values, and units. An attribute is a changeable property or characteristic of the file or folder you have selected that can be set to a value. For example, “time point” might be an attribute of a file, while ‘7’ could be its value, and hour a unit of time.
Adding metadata
- Click the “+ Add Metadata” button to add a new entry. Then follow the directions for editing metadata below.
Editing or deleting metadata
- You may use the “pencil” icon to edit an existing entry or the “trash can” icon to delete an entry.
- After you have made any edits or deletion, click ‘Save’ (on the top right of the screen) to save all entries and apply the metadata.
Adding Metadata to Multiple Files/folder in the Discovery Environment¶
Adding Metadata using a CyVerse Template
Log into the Discovery Environment.
Click on the
(Data Icon) to open a Data window. Select (checkbox) a single file/folder to add metadata to. Next,
Under the More Actions menu, click on the Metadata. Click on the subsequent More Actions menu and select View in Template. You have two choices in using the template:
click OK to download. (In this example, we will use the DOI Request - DataCite Metadata) template.
Editing metadata template in DE
- Follow the steps in the “Editing or deleting metadata” from the previous section above
Editing a downloaded metadata template
Unzip the downloaded template; it will contain two files blank.csv and guide.csv. Open these files using the spreadsheet editor of your choice.
Tip
- blank.csv is the metadata template that you will complete for your data.
- gude.csv contains instructions for your template, and will usually include controlled vocabulary terms for metadata descriptors.
Edit the template in one of two ways:
If all data will be in a single folder
In the blank.csv spreadsheet, in the ‘file name or path’ column, enter the file names of all the files/folder in that folder you wish to annotate with metadata.
In the remaining columns of the template, enter the values for each file/attribute combination that apply.
If desired, add additional columns to the end of the template. The metadata in the additional columns will be saved in the Data Store but will not be stored as part of the template.
Save the file in CSV format. Make sure none of the names of the files or the parent folder includes spaces or special characters. You may name this metadata file anything you wish, but keep it in CSV format.
If data will be in multiple folders
- In the blank.csv spreadsheet, in the ‘file name or path’ column, enter the full path of the top-level folder (e.g. /iplant/home/YOURUSERNAME/FOLDERNAME)
- In the remaining columns in the first row, enter the values for each file/attribute combinations
- Repeat for each file, but make sure to add the full file path (e.g. / iplant/home/YOURUSERNAME/FOLDERNAME) for each file.
- If desired, add additional columns to the end of the template. The metadata in the columns will be saved in the Data Store but will not be stored as part of the template.
- Save the file in CSV format. Make sure none of the names of the files or the parent folder includes spaces or special characters. You may name this metadata file anything you wish, but keep it in CSV format.
In an open ‘Data’ window in the Discovery Environment, navigate to appropriate location for uploading the template:
If the first column of your metadata file contains only file names (that is, all data files are in the same folder), navigate to the folder and use the Upload button (Browse local) or your choice of upload tool to upload the metadata (csv file) to that folder.
If the first column of your metadata file contains the full path to each file (that is, the data files are in different folders), it does not matter where the metadata file is located on the Data Store. Use he Upload button (Browse local) or your choice of upload tool to upload the metadata (csv file) to an appropriate location on the Data Store.
Tip
If you commit to using absolute file paths (e.g. /iplant/home/your_file_location) you can keep all of your metadata spreadsheets in one location on the Data Store for convenient management and editing.
- To apply the metadata, in the Data window, select (checkbox) the name of the folder containing the data files to which you want to apply the metadata in bulk.
- Click the More Actions menu select ‘Apply Bulk Metadata’; brows to the uploaded metadata spreadsheet and select it.
Your metadata should now be applied to your files. You should receive a notification in the Discovery Environment and you can confirm the metadata has been correctly applied by following the steps in the preceding section to view metadata.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: learning@CyVerse.org
Data Sharing and Other Features¶
One of the most powerful features of the Data Store is to share all of your data instantly, and with fine-grained permission control. You can share your data with other CyVerse users, and you can also make data available to anonymous users and with identifiers (e.g. DOIs, ARKs) through the CyVerse Data Commons. We will cover the most basic, commonly used sharing features in this guide.
WebDAV¶
WebDAV is an extension to the HTTP protocol that allows users to remotely edit and manage files. CyVerse has added support for WebDAV to the Data Store. This means users can access their home and public folders in the CyVerse Data Store from their local computers using web browsers and other WebDAV enabled applications such as common operating system file managers. With WebDAV, users can copy files between local computer and the Data Store as easily as if they were copying them between two folders on their computer.
Limitations¶
WebDAV works best for small files or small collections of files. There is no hard size limit for files, but it is recommended to not work with files over 1 GiB in size. However, 10 GiB files have been downloaded from the CyVerse WebDAV service using a web browser with decent performance. The iCommands still out perform WebDAV. For better ways to download large files or large sets of files, please see the pages for iCommands or CyberDuck.
Accessing CyVerse data via WebDAV Services¶
There are two access points to CyVerse WebDAV services: one for anonymous read-only access and one for authenticated access. These services can be accessed directly in a web browser, or with command line tools.
The simplest way to access WebDAV in a browser is to go to https://data.cyverse.org/dav. This will bring up a menu for the options described below.
WebDAV provides anonymous read-only access through URLs rooted at https://data.cyverse.org/dav-anon/
. All data that can be seen by the anonymous user can be accessed anonymously through this service, excluding the immediate contents of /iplant/home
and the immediate contents of /iplant/home/<username>
, where <username> is any CyVerse login name.
The service also provided authenticated access through URLs rooted at https://data.cyverse.org/dav/
. Once a user has authenticated with his CyVerse credentials, they can access any file or folder with the permission level they have on the file or folder.
User Data¶
A user with a CyVerse login of <username> would use the WebDAV link https://data.cyverse.org/dav/iplant/home/<username>/
to access their data.
Community Released Data/Project Data¶
To access data from specific projects stored in iRODS at /iplant/home/shared/<project>/
, use the link https://data.cyverse.org/dav/iplant/projects/<project>/
.
CyVerse Curated Data (Data with a DOI)¶
To access the data curated by CyVerse in the Data Commons (that is, datasets with DOIs), use the following link: https://data.cyverse.org/dav-anon/iplant/commons/cyverse_curated/ .
Common Ways to Access the WebDAV Service¶
Web Browser
Since WebDAV is an extension of HTTP, any web browser can be used to browse and download data through the service, using the links provided above.
File Manager
Most common operating systems come with a file manager application that can interface with a WebDAV service and can mount a WebDAV folder into the file system being managed. This allows other application running on the same computer as the file manager to access data hosted by a WebDAV service as if it were local.
Accessing through OS X Finder
Use these instructions to connect to the WebDAV service with OS X Finder.
- Open Finder.
- From the menu bar, select Go, then Connect to Server (or type command K).
- Enter the URL for the folder to access.
- Provide your CyVerse username and password if prompted.
Accessing through Windows File Explorer
Use these instructions to connect to the WebDAV service with Windows File Explorer.
- Open the File Explorer.
- Right-click on This PC.
- Select Map Network Drive.
- Select Choose a custom network location and click next.
- Enter the URL for the folder to access.
- Provide your CyVerse username and password if prompted.
Accessing through Gnome Files
Use these instruction to open the WebDAV service from the Gnome desktop using Files.
- Open Files.
- Select Other Locations in the Places sidebar.
- In the Connect to Server footer, enter the URL for the desired folder to access. Note: Files identifies TLS encrypted WebDAV URLs with the scheme davs. This means the base for the CyVerse URLs is
davs://data.cyverse.org/
instead ofhttps://data.cyverse.org/
. (e.g.,davs://data.cyverse.org/dav/iplant/home/<username>/
)- Click the neighboring Connect button.
- Provide your CyVerse username and password if prompted.
Accessing through Linux Terminal
This requires root or at least sudo access. Use these instruction to mount a WebDAV folder into the file system from a Linux terminal.
- Ensure that davfs2 is installed, e.g., for Ubuntu, sudo apt install davfs2.
- Create a directory where you to want to mount the data, e.g., mkdir /tmp/data.
- Mount the data as root, i.e., sudo mount -o gid=<you>,uid=<you> -t davfs <link> /tmp/data, where <you> is your username on the Linux machine and <link> is the URL to the WebDAV folder you want to mount.
- Provide your CyVerse username and password if prompted.
Summary¶
This guide has introduced the basic data management tools you need to manage the lifecycle of Data in CyVerse. There are many more features to explore and these are detailed in the full Data Store Manual.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: learning@CyVerse.org
Prerequisites¶
Downloads, access, and services¶
In order to complete this tutorial you will need access to the following services/software
Prerequisite | Preparation/Notes | Link/Download |
---|---|---|
CyVerse account (optional) | CyVerse supports anonymous data access to public data sets in the CyVerse Data Commons. This guide is written with the assumption you are a CyVerse account holder. See the Data Store Manual for more info on anonymous access. |
|
Cyberduck (optional) | Cyberduck is a 3rd party application with a graphical user interface that allows you to easily upload and download data. (available for Mac /PC). You will also need to download our connection profile (bookmark). | |
iCommands (optional) | iCommands are a set of command line binaries that can be used to interact with the Data Store. Download iCommands (available for Mac/ Linux) if you want to use these functionalities. |
|
Spreadsheet editor (optional) | To edit a metadata template in .csv format, we recommend using a spreadsheet editor such as Microsoft Excel or LibreOffice Calc. |
|
Warning
When uploading your data to the Data Store you should not upload files/folders with names containing spaces (e.g. experiment one.fastq) or name that contain special characters (e.g. ~ ` ! @ # $ % ^ & * ( ) + = { } [ ] | : ; ” ‘ < > , ? /). The Apps on the Discovery Environment and most command line apps will typically not tolerate these characters. For long file/folder names the use of underscores (e.g. experiment_one.fastq) is the recommended practice.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: learning@CyVerse.org