Getting started

Once you have passed through the Installation procedure, you are ready to start setting-up your file system. Usually this consists on two different steps. One is aimed to define the location of the files in different hosts, while the other covers the accesibility of the files, and their update through python.

Setting-up the system

The first thing needed to profit from the different tools offered in this package is to start defining the set of files to work with, this means, a hep_rfm.Table object. A hep_rfm.Table is simply a dictionary with strings as keys, corresponding to the names of the files, and as values, hep_rfm.FileInfo objects. The latter stores the information needed by this package to work. This information corresponds to:

  • Name: this is the user identifier of the file. In a table there can only be one file with the same name.
  • Path: the path to the file. This can be either a local or remote path.
  • Protocol ID (PID): name of the protocol to be used to access the file.
  • Time-stamp: the system time-stamp of the file, stored as a float value. This will be needed to determine which files are newer.
  • File ID (FID): this is the result of hashing the file. For this purpose, the function hep_rfm.rfm_hash() is used. The result allows to evaluate the integrity of the file, and to determine if two files aiming to be the same are different. Afterwards, the time-stamp will be used to determine which one is newer.

Generally, it will become difficult for the user to generate a table by hand even using the functions here provided. A much easier method, makes use of the script hep-rfm-table, supplied together with this package. This script allows to create, modify and update tables directly from the terminal, avoiding the creation of custom python scripts for that purpose. To start-up, create a empty directory and then type:

$ hep-rfm-table -h

You can now see the different options we have to work with tables. We will start by creating a new table.

$ hep-rfm-table create table.db
$ ls
table.db

We can see the contents on the table at any point by typing

$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:41.593218
- Contents:
No entries found

which is telling us that there are no entries in the table. Let’s add some then. We will start by creating a new file, with some content, and then we will add it to the table.

$ echo "My first table file" >> file1.txt
$ hep-rfm-table add table.db file1 file1.txt
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:41.904134
- Contents:
name         path                                                    pid     tmstp                           fid
file1        /home/user/rfm/file1.txt        local   2019-04-12 12:54:41.755092      202444300d48c9387406...

So we have added our first file. In order to do this we had to give the name of the file and the path to it. You can see that we have added a file named file1, that the path to the file has been automatically expanded, to match a global path, and that the time-stamp and file ID have been extracted. Now let’s create two more files and add them to the table as well.

$ echo "This is the second file" >> file2.txt
$ echo "This is the third file" >> file3.txt
$ hep-rfm-table add-massive table.db file2.txt file3.txt
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:42.226468
- Contents:
name         path                                                    pid     tmstp                           fid
file1        /home/user/rfm/file1.txt        local   2019-04-12 12:54:41.755092      202444300d48c9387406...
file2        /home/user/rfm/file2.txt        local   2019-04-12 12:54:42.067091      cad9a71a7902d13f44db...
file3        /home/user/rfm/file3.txt        local   2019-04-12 12:54:42.067091      ce53225da9b363e3db71...

With this command one can add easily new files, given only their paths, where the names will be extracted from the name of the files themselves, without extensions. Frequently one will have a whole system of files stored in different directories and subdirectories. This can easlity handed running another mode, which will take all the files within a directory and add them to the table. We can also specify a regular expression, so only the files whose names (including the extension) match that given.

$ mkdir -p subdir/subsubdir
$ echo "This is the fourth file" >> subdir/file4.txt
$ echo "This is the fifth file" >> subdir/subsubdir/file5.txt
$ echo "This is the sixth file" >> subdir/subsubdir/file6.dt
$ hep-rfm-table add-from-dir table.db . --regex .*.txt
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:42.643200
- Contents:
name         path                                                                    pid     tmstp                           fid
file1        /home/user/rfm/file1.txt                        local   2019-04-12 12:54:41.755092      202444300d48c9387406...
file2        /home/user/rfm/file2.txt                        local   2019-04-12 12:54:42.067091      cad9a71a7902d13f44db...
file3        /home/user/rfm/file3.txt                        local   2019-04-12 12:54:42.067091      ce53225da9b363e3db71...
file4        /home/user/rfm/subdir/file4.txt                 local   2019-04-12 12:54:42.487090      8a9dcf90d2a0bcb95d8f...
file5        /home/user/rfm/subdir/subsubdir/file5.txt       local   2019-04-12 12:54:42.487090      86f2f9bb445df1e62818...

You can see that we have included all files in the current directory, but from file6.dt, which did not match the given regular expression. If we remove the regular expression requirement, then it is included

$ hep-rfm-table add-from-dir table.db .
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:42.970372
- Contents:
name         path                                                                    pid     tmstp                           fid
file1        /home/user/rfm/file1.txt                        local   2019-04-12 12:54:41.755092      202444300d48c9387406...
file2        /home/user/rfm/file2.txt                        local   2019-04-12 12:54:42.067091      cad9a71a7902d13f44db...
file3        /home/user/rfm/file3.txt                        local   2019-04-12 12:54:42.067091      ce53225da9b363e3db71...
file4        /home/user/rfm/subdir/file4.txt                 local   2019-04-12 12:54:42.487090      8a9dcf90d2a0bcb95d8f...
file5        /home/user/rfm/subdir/subsubdir/file5.txt       local   2019-04-12 12:54:42.487090      86f2f9bb445df1e62818...
file6        /home/user/rfm/subdir/subsubdir/file6.dt        local   2019-04-12 12:54:42.487090      5bd329091112ab1c7125...
table        /home/user/rfm/table.db                         local   2019-04-12 12:54:42.639090      e00f13b293897a73ea58...

You can see that we have also included the table itself. This is very dangerous, and must be avoided, since it will lead to a replacement of the table files when working with hep_rfm.Manager. Usually it is preferred that the files are located in a sub-directory, and the table file in the parent directory, so there are no conflicts. To return to a safe status, let’s remove the table entry, put everything on a new directory and add the files again.

$ hep-rfm-table remove table.db --files table
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:43.297370
- Contents:
name         path                                                                    pid     tmstp                           fid
file1        /home/user/rfm/file1.txt                        local   2019-04-12 12:54:41.755092      202444300d48c9387406...
file2        /home/user/rfm/file2.txt                        local   2019-04-12 12:54:42.067091      cad9a71a7902d13f44db...
file3        /home/user/rfm/file3.txt                        local   2019-04-12 12:54:42.067091      ce53225da9b363e3db71...
file4        /home/user/rfm/subdir/file4.txt                 local   2019-04-12 12:54:42.487090      8a9dcf90d2a0bcb95d8f...
file5        /home/user/rfm/subdir/subsubdir/file5.txt       local   2019-04-12 12:54:42.487090      86f2f9bb445df1e62818...
file6        /home/user/rfm/subdir/subsubdir/file6.dt        local   2019-04-12 12:54:42.487090      5bd329091112ab1c7125...
$ mkdir files
$ mv subdir *.txt files/.
$ hep-rfm-table add-from-dir table.db files
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:43.628105
- Contents:
name         path                                                                            pid     tmstp                           fid
file1        /home/user/rfm/files/file1.txt                  local   2019-04-12 12:54:41.755092      202444300d48c9387406...
file2        /home/user/rfm/files/file2.txt                  local   2019-04-12 12:54:42.067091      cad9a71a7902d13f44db...
file3        /home/user/rfm/files/file3.txt                  local   2019-04-12 12:54:42.067091      ce53225da9b363e3db71...
file4        /home/user/rfm/files/subdir/file4.txt           local   2019-04-12 12:54:42.487090      8a9dcf90d2a0bcb95d8f...
file5        /home/user/rfm/files/subdir/subsubdir/file5.txt local   2019-04-12 12:54:42.487090      86f2f9bb445df1e62818...
file6        /home/user/rfm/files/subdir/subsubdir/file6.dt  local   2019-04-12 12:54:42.487090      5bd329091112ab1c7125...

So now the entry table has been removed. The idea behind this is not only to have a way to keep our data files organized, but also to be able to keep files synchronized in different hosts. This means that we will have a main place, where we would be preferably placing our new versions of the files, and from there we would be updating the other locations. In order to do any modification on a remote we need to authenticate, and we would need to do it for every single file we want to update. Using ssh keys is thus the preferred way to handle this inconvenient, or make sure that the target host is directly accessible from your current one. In the remote host, we will need to have another table with the paths in it. However, we should specify the path but not the time-stamp or file ID, since we will not have a file there. This is solved adding the --bare option to add and add-massive. The mode add-from-dir will not make sense to be used here, since we do not have files there. This means that we could type

$ ssh username@host.com
$ mkdir files
$ hep-rfm-table create table.db
$ hep-rfm-table add table.db files/file1.txt --bare --remote ssh @host.com
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:43.928512
- Contents:
name         path                                    pid     tmstp   fid
file1        @host.com:/home/user/rfm/files/file1.txt        local   0.0     none

So you can see that the time-stamp and file ID are filled with default values, which are chosen so they do not cause conflicts. Note that we have specified the remote with the protocol ID ssh and the prepending string @host.com. This will be needed afterwards to correctly update the files. The other files will be added in a row, by typing

$ hep-rfm-table add-massive table.db `for i in {2..6};do echo files/file$i.txt; done` --bare --remote ssh @host.com
$ hep-rfm-table display table.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:44.011354
- Contents:
name         path                                    pid     tmstp   fid
file1        @host.com:/home/user/rfm/files/file1.txt        local   0.0     none
file2        @host.com:/home/user/rfm/files/file2.txt        local   0.0     none
file3        @host.com:/home/user/rfm/files/file3.txt        local   0.0     none
file4        @host.com:/home/user/rfm/files/file4.txt        local   0.0     none
file5        @host.com:/home/user/rfm/files/file5.txt        local   0.0     none
file6        @host.com:/home/user/rfm/files/file6.txt        local   0.0     none

Note that here we have included all the files in the same directory. Files do not need to have similar paths in different hosts.

Sometimes we might need to replicate a table, this is, create an entire copy of the structure of one table (the system of directories and file names) into another. This is done through the replicate mode:

$ hep-rfm-table create rtable.db
$ mkdir rfiles
$ hep-rfm-table replicate rtable.db table.db files rfiles
$ hep-rfm-table display rtable.db
Running "hep-rfm-table" from hep_rfm 0.0.0.dev4
- Table created with hep_rfm 0.0.0.dev4
- Last update: 2019-04-12 12:54:44.123548
- Contents:
name         path                                                                            pid     tmstp                   fid
file1        /home/user/rfm/rfiles/file1.txt                         local   1970-01-01 01:00:00     none
file2        /home/user/rfm/rfiles/file2.txt                         local   1970-01-01 01:00:00     none
file3        /home/user/rfm/rfiles/file3.txt                         local   1970-01-01 01:00:00     none
file4        /home/user/rfm/rfiles/subdir/file4.txt                  local   1970-01-01 01:00:00     none
file5        /home/user/rfm/rfiles/subdir/subsubdir/file5.txt        local   1970-01-01 01:00:00     none
file6        /home/user/rfm/rfiles/subdir/subsubdir/file6.dt         local   1970-01-01 01:00:00     none

You can see that we have created a new table rtable.db with the same structure as table.txt, but this time pointing to bare files in the directory rfiles. There is some policy regarding adding new files through the replicate mode if they already exist in the table. Options to raise and error, replace the contents or omit the changes are contemplated.

For more information, you can always run any mode with -h. One we have done this, we are ready to see how we can access the files and keep data updated in both sites, in the next section: File management.

File management

In section Setting-up the system we saw how to set-up two tables in two different hosts, one local and one remote, with different files. Our final goal is to access these files from python. In the local host we can do it by writing

>>> import hep_rfm
>>> table = hep_rfm.Table.read('/home/user/rfm/table.db')
>>> list(sorted(table.keys()))
['file1', 'file2', 'file3', 'file4', 'file5', 'file6']
>>> table['file1']
FileInfo(name='file1', protocol_path=LocalPath(path='/home/user/rfm/files/file1.txt'), marks=FileMarks(tmstp=1540986365.8954208, fid='202444300d48c9387406493590cc693a81c011f1c112d7c8916b27ff9e5daad32a679e4471974b2169c4d22738a17a92405d8948bf5b1d2d8b69c235a65f991b'))
>>> table['file1'].protocol_path.path
/home/user/rfm/files/file1.txt

so we can access all the information. We could also modify the information of the table, make copies, and then write them using the method hep_rfm.Table.write(). To automatically update the table using the local files, we could simply call hep_rfm.Table.updated(), and we would obtain the updated version of the table.

From now on, we will consider that we are working in the local host. To manage both tables, we will create a hep_rfm.Manager class, and register the two tables

>>> mgr = hep_rfm.Manager()
>>> mgr.add_table('/home/user/rfm/table.db', 'local')
>>> mgr.add_table('@host.com:/home/user/table.db', 'ssh')

Now we have the two tables registered. Calling the method hep_rfm.Manager.available_table(), we will obtain the table corresponding to the local path. This means that in the local host we will get “/home/user/rfm/table.db”, and in the remote “/home/user/table.db”. With this information we can create the table, and work with the files from there.

>>> table = mgr.available_table()
>>> list(sorted(table.keys()))
['file1', 'file2', 'file3', 'file4', 'file5', 'file6']
>>> hep_rfm.available_working_path(table['file1'])
/home/user/rfm/files/file1.txt

The advantage of using the hep_rfm.Manager instance, is that we do not need to care whether we are in the local or remote host. This turns powerful when having code on a git repository, which is present in both local and remote hosts, since one could work in each host without really caring about the location of the files. Lastly, and probably one of the most useful methods of the class concerns the method hep_rfm.Manager.update. This method allows to update all the different tables stored in the manager.

  1. First, the tables are copied to the current host.
  2. On a second step, the file IDs of the files are compared among the different tables.
  3. If there is a mismatch, then the time-stamps are checked, so the newest version is determined.
  4. After, the files are copied to each outdated location.
  5. Finally, the tables in the outdated locations are updated.

The general idea is to have a python file in your module with the tables begin defined, so it serves as an access point to the data and also to update it. An example of such module would be

mgr = hep_rfm.Manager()
mgr.add_table('/home/user/rfm/table.db', 'local')
mgr.add_table('@host.com:/home/user/table.db', 'ssh')

table = mgr.available_table()

if __name__ == '__main__':
    mgr.update(modifiers={'ssh_usernames': {'host.com': 'username'}})

so if we put this in a file called “data.py”, we can access it via import data and then data.table would allow us to access the real paths of the files. On the other hand, the samples would be updated executing python data.py. Note that we have specified the keyword argument modifiers. This allows us to specify in a dictionary the user-name to be used for each ssh host, for example.

Limitations

The biggest limiting factor concerns hashing very-large files. The functions here present are optimized to work as well with large files (those that can not be completely loaded in memory). However, for very-large files (> 20 GB), the hashing process might become very slow depending on the machine.

Another limitation has to do with the xrootd protocol. Oftenly, with the xrootd protocol we are not allowed to copy (via xrdcp) a file from the local system to the remote. In case your project uses files which are on a site accessible via xrootd protocol, you might need to define that as your main site. Uploading the new files to this site and then running the update script from it would be probably the way to go.