HCP Metadata Query Tool (2.0.11)

Prologue / Intro

There are situations where one needs to have information about all the objects stored in an Object Storage system, or even what has happened to an object during its lifetime. As well, sometimes one needs to find out if an object has been stored and later on deleted.

In general there is few abilities, beside of ‘walking the tree’, to get this kind of information from most of the Object Storage systems on the market.

For Hitachi Content Platform (HCP), things are different. HCP offers a built-in metadata query engine (MQE), which is able to provide the mentioned details.

The tool described in this document is using the MQE API to request information about object-related operations from HCP. Object-related operation means records describing what and when things happened to an object: it’s creation, metadata changes as well as deletion (including disposition, prune and purge operations).

Is output is either a sqlite3 database file or comma-separated-value (csv) file, plain or compressed, holding a list of the requested operations.

Using the query options, it can answer questions like:

  • which objects are in the system?

  • which objects were deleted during a given time period?

  • etc.

Some recipes for using the acquired data can be found in the recipies chapter.

Installation

First of all, be aware that hcpmqe is a GUI-based tool. It will not run on a system without a GUI (headless Linux, for example).

Second, no binary installers are provided, due to the labor required to make it happen reliably for all platforms supported. A Python 3.7 (or newer) installation is required.

Installing the Python package

Given that Python 3 is installed, the process of installing hcpmqe is pretty straight forward. It’s highly suggested to use a Python virtual environment, especially if the tools is used as a one-off.

Note

Internet access is required to be able to install the package, as it depend on other packages to be loaded from PyPi (the Python package index).

This is how to do it:

  • Check the Python version:

    $ python3 --version
    Python 3.7.4
    

    Note: Python >= 3.7 is required, any higher version should do as well.

  • Create a folder to work in:

    $ mkdir hcpmqe
    $ cd hcpmqe
    
  • Setup a Python virtual environment:

    $ python3 -m venv .venv
    
  • Activate the virtual environment:

    Linux, macOS:

    $ source .venv/bin/activate
    (.venv) $
    

    Windows:

    C:\Users\sm\hcpmqe> .venv\Scripts\activate
    (.venv) C:\Users\sm\hcpmqe>
    

    Noticed the changed prompt? This shows that you have activated the virtual environment.

  • Update the Python setup tools:

    (.venv) $ pip install -U pip setuptools
    [.. a lot of messages shown here ..]
    Successfully installed pip-19.2.2 setuptools-41.2.0
    
  • Install the tools python package:

    (.venv) $ pip install hcpmqe
    Collecting hcpmqe
      Downloading hcpmqe-2.0.2.tar.gz (17 kB)
    Collecting PySimpleGUI==4.30.0
      Using cached PySimpleGUI-4.30.0-py3-none-any.whl (233 kB)
    Collecting httpx==0.16.1
      Using cached httpx-0.16.1-py3-none-any.whl (65 kB)
    Collecting certifi
      Using cached certifi-2020.11.8-py2.py3-none-any.whl (155 kB)
    Collecting httpcore==0.12.*
      Downloading httpcore-0.12.1-py3-none-any.whl (54 kB)
         |--------------------------------| 54 kB 968 kB/s
    Collecting rfc3986[idna2008]<2,>=1.3
      Using cached rfc3986-1.4.0-py2.py3-none-any.whl (31 kB)
    Collecting sniffio
      Using cached sniffio-1.2.0-py3-none-any.whl (10 kB)
    Collecting h11==0.*
      Using cached h11-0.11.0-py2.py3-none-any.whl (54 kB)
    Collecting idna; extra == "idna2008"
      Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
    Using legacy 'setup.py install' for hcpmqe, since package 'wheel' is not installed.
    Installing collected packages: PySimpleGUI, certifi, sniffio, h11, httpcore, idna, rfc3986, httpx, hcpmqe
        Running setup.py install for hcpmqe ... done
    Successfully installed PySimpleGUI-4.30.0 certifi-2020.11.8 h11-0.11.0 hcpmqe-2.0.2 httpcore-0.12.1 httpx-0.16.1 idna-2.10 rfc3986-1.4.0 sniffio-1.2.0
    

Now you can run the tool as described in the following chapters, by just calling hcpmqe.

Note

Please keep in mind that you need to have the Python virtual environment activated to be able to run the tool. If in need, simply activate it by running:

$ cd hcpmqe
$ source .venv/bin/activate

or

C:\Users\sm> cd hcpmqe
C:\Users\sm\hcpmqe> .venv\Scripts\activate

HCP prerequisites

Warning

Not having the proper permissions and/or the MQE API being disabled will always lead to error 403 when running a query:

_images/HCP_status403.png

MQE API

HCP needs to have the metadata query API enabled to allow the tool to function. The minimal setting needed can be set using the HCP System Console > Services > Search panel:

_images/HCP_search.png

Enable metadata query API is the only setting required in this panel.

System administrator

A system-level administrator must at least have the Search role to access the MQE API:

_images/HCP_mqeuser.png

Such an administrator is able to query Tenants that have granted system-level users to manage the tenant and search its namespaces in the respective Tenant Console > Overview panel:

_images/HCP_allowadmin.png

As a result of this, a full system-wide list of all operations can only be acquired if all Tenants have granted this privilege.

Using an HCP FQDN starting with “tenantname.” will query just that Tenant. In this case, the data network configured for the Tenant must be reachable by the tool, and its FQDN must be resolvable via DNS.

Using an HCP FQDN starting with “admin.” will query all Tenants that have granted the permission, even if the configured data network for some of the Tenants are not reachable by the tool.

Tenant user

A Tenant user must have at least the Search permission for the Namespace(s) he shall query:

_images/HCP_tenantuser.png

Of course, the tool must be able to reach the configured data network of the Tenant, and its FQDN must be resolvable via DNS.

In addition to that, Search needs to be enabled for any Namespaces that shall be queried:

_images/HCP_Namespace_Search.png

User Interface

Main panel

_images/gui.png

HCP access parameters

_images/access_params.png

Here, the HCP system to query is addressed. Either the system can be addressed entirely (FQDN starting with admin.) or a specific Tenant (FQDN starting with the Tenants name) can be addressed.

The query can optionally be re-fined by specifying one (or more) Namespaces, separated by comma.

Note

Please note that Namespaces must always be specified as Namespace.Tenant, even in case a specific Tenant is queried!

Further refining is available by specifying one (or more) directories (starting with /, and separated by comma). Please note that directories specified will be used for each and every Namespace addressed by the query.

The user specified must be a local HCP user (no AD account) with the proper permissions granted, as described in the prior chapter.

HCP load parameters

_images/load_params.png

MQE queries can produce a huge amount of records to be fetched from HCP, depending on the number of objects addressed by the query. Therefore, paged queries of up to 10,000 records are used to keep the peak load in an acceptable range.

A throttle of up to 60 seconds can be tuned in to relax the load on HCP even more, at the cost of a longer query run time.

In case timeout errors are reported, try a longer request timeout than the default 60 seconds.

Tip

The values (except for timeout) can be changed while a query is running. Use the slider to change the value, then click the [Set] button. The new value will be picked up with the next page request.

Output parameters

_images/output_params.png

Supported output types are comma-separated-value (csv-) files, plain as well as compressed (bz2, gzip, lzma), and Sqlite3 database files.

Selecting verbose will request all system metadata values per object from HCP, while not selecting it will request just the bare minimum (4 fields) that clearly identifies each object and the operation that triggered the record)

Query parameters

_images/query_params.png

The operations (transactions) to be queried for:

  • create - list all actually existing objects and their versions ingested

  • delete - list all objects and object versions deleted

  • dispose - list all objects deleted by disposition (automatic delete when retention period ends)

  • prune - list all object versions automatic deleted when the configured version life span ended

  • purge - list object versions deleted along with the objects actual version

Note that only objects / versions are returned where the respective operation happened during the selected time frame. Also, note that -depending on HCP configuration- records of deleted objects / versions are held for a limited number of days, only.

Time range

_images/time_range.png

Defines the time range for which operations are requested.

Status

_images/status.png

The Status line tells what’s going on, Records found informs about how many records (object operations) have been returned so far. The Last record block tells about the identity of the last found record. These values can be used to restart an interrupted quuery, for example. See the following recipies chapter.

Last Record

_images/last_record.png

This area displays the last record received. It is either the last record within a received page (as long as a query is running), or the final record received during the query.

Note

The configuration file is auto-updated with these values after every page received successfully, to allow to continue with a query later on from exactly that position.

That means that a query can always be repeated or extended from that position - if a query finished successfully, if a query was canceled before finished, or if even the tool crashed.

Time bar

_images/time_bar.png

During and after a query, the time bar shows the overall run time, the time spent on page queries as well as the time spent on writing the database (or csv file).

Run queries

Preparation

You need to have the Python virtual environment (created during install) activated to be able to run the tool. If in need, simply activate it by running:

Linux, macOS:

$ cd hcpmqe
$ source .venv/bin/activate

or

Windows:

C:\Users\sm> cd hcpmqe
C:\Users\sm\hcpmqe> .venv\Scripts\activate

Start the tool

$ hcpmqe --help
usage: hcpmqe [-h] [--version] [-C]

optional arguments:
  -h, --help         show this help message and exit
  --version          show program's version number and exit
  -C, --log2console  instead of logging to hcpmqe.log, log to console

The tool always logs its doings - either into a file in the current directory (hcpmqe.<pid>.log), or, if the -C argument is used, to the console.

Running the command will open the GUI:

_images/run_start_main.png

Run a query from scratch

Once the form is filled with parameters matching the wanted query, save the configuration, then click the [Run query] button to start the process.

_images/run_start_query.png

All the entry fields will be disabled, except the ones that allows to change page size and throttle. The Status line will show progress information, Records found reports the no. of records received so far, the Last record section shows the identity of the last pages final record, and in the very bottom, some timing information is displayed.

Re-start a query

If a query was canceled or interrupted for whatever reason, it can be restartet. If the tool crashed or was killed somehow, just start it again and load the configuration file. It will show information about the last record that was written to the output file.

Do not change any (!!!) parameter and press [Run query] again (changing values will likely cause the query to end up incomplete). You’ll be asked if you want to continue or start from scratch.

_images/run_start_restart_query.png

Recipies

Migration cross-check

Situation

You’ve migrated a Namespace, a Tenant or an entire system to another HCP, using replication, and you need to have a verification that the data in source and target is exactly the same.

Recipe

This example will use a single Tenant as an example.

Acquire a list of existing objects from both HCP systems
  • Use the hcpmqe tool to query both HCP systems for a list of existing objects:

    • Use a Tenant user with Search permission for all Namespaces within the Tenant

    • Select create as the only transaction type

    • Leave Start time at the default, set End Time to when you finished the migration

    • Select sqlite3 as output format

    • Check verbose

    • Run the query for both involved HCP systems

    _images/recipe01_80.png _images/recipe01_85.png
    • You should have two database files, once finished:

      $ ls -lh *.db
      -rw-r--r--  1 tsimons  staff   2.4M Nov 24 17:03 awhdis2_hcp80.db
      -rw-r--r--  1 tsimons  staff   2.4M Nov 24 09:04 awhdis2_hcp85.db
      

Note

For a comparison like this, just a few of the columns in the databases are relevant to clearly identify an object:

  • hash

  • ingesttime

  • namespace

  • objectPath

  • version

Some more are interesting, as well:

  • replicated

  • size

  • Use the sqlite3 commandline tool to run SQL queries to compare the two databases:

    Note

    For a valid result, make sure to limit the set of objects investigated to exactly the same time frame - we’ll use the epoch time stamp (seconds since 1970/1/1 0:00:00) for that - you can use this to convert.

    For this example, migration ended 2020/11/23 08:00:00 –> 1606114800 epoch time.

    • Open the origin HCP database (awhdis2_hcp80.db):

      $ sqlite3 awhdis2_hcp80.db
      
    • Attach the migration target HCP database (awhdis2_hcp85.db):

      sqlite> ATTACH 'awhdis2_hcp85.db' AS replica;
      
    • Check if the no. of records are equal:

      sqlite> SELECT count(*) FROM main.ops
                  WHERE ingestTime <= 1606114800;
      4673
      sqlite> SELECT count(*) FROM replica.ops
                  WHERE ingestTime <= 1606114800;
      4673
      
    • Check if there are any non-replicated objects:

      sqlite> SELECT count(*) FROM main.ops
                  WHERE NOT replicated
                        AND ingestTime <= 1606114800;
      0
      sqlite> SELECT count(*) FROM replica.ops
                  WHERE NOT replicated
                        AND ingestTime <= 1606114800;
      0
      
    • Now, lets check which records don’t exist in one of the databases:

      • List all records not in the migration target database:

        sqlite> SELECT hash, ingesttime, namespace,
                       objectPath, version FROM main.ops
                    WHERE ingestTime <= 1606114800
                    EXCEPT SELECT hash, ingesttime, namespace,
                                  objectPath, version
                                FROM replica.ops
                                    WHERE ingestTime <= 1606114800;
        [..]
        
      • List all records not in the origin database:

        sqlite> SELECT hash, ingesttime, namespace,
                       objectPath, version FROM replica.ops
                    WHERE ingestTime <= 1606114800
                    EXCEPT SELECT hash, ingesttime, namespace,
                                  objectPath, version
                                FROM main.ops
                                    WHERE ingestTime <= 1606114800;
        [..]
        

      Alternative way to achieve the same result:

      • List all records not in the migration target database:

        sqlite> SELECT DISTINCT hash, ingesttime, namespace,
                                objectPath, version FROM main.ops
                    WHERE ingestTime <= 1606114800
                        AND (hash, ingesttime, namespace, objectPath,
                             version) NOT IN
                        (SELECT DISTINCT hash, ingesttime, namespace,
                                         objectPath, version
                            FROM replica.ops
                                WHERE ingestTime <= 1606114800);
        
      • List all records not in the origin database:

        sqlite> SELECT DISTINCT hash, ingesttime, namespace,
                                objectPath, version FROM replica.ops
                    WHERE ingestTime <= 1606114800
                        AND (hash, ingesttime, namespace, objectPath,
                            version) NOT IN
                        (SELECT DISTINCT hash, ingesttime, namespace,
                                         objectPath, version
                            FROM main.ops
                                WHERE ingestTime <= 1606114800);
        

Database Schema

The schema of the ops database table, containing the collected operation records, differs between verbose or non-verbose query mode.

In addition, the database schema is build dynamically from the metadata keys HCP returns; that said, there might be slight differences between HCP versions. As of now (April 2022), this has been just added keys. Nevertheless, if such a change happens during an HCP version upgrade, the database in use might not be usable with the newer version of HCP, and thus needs to be created from scratch (just delete the existing database and run a new query).

Here are samples of the ops tables schema:

Non-verbose mode

                  Column | example value
-------------------------+-------------------------------------------------
  changeTimeMilliseconds | 1648129832613.00
               operation | CREATED
                 urlName | https://one.s3.hcp80.archivas.com/rest/hallo.txt
                 version | 105480309271425

Verbose mode

                  Column | example value
-------------------------+-------------------------------------------------
              accessTime | 1648129832
        accessTimeString | 2022-03-24T14:50:32+0100
                     acl | 0
  changeTimeMilliseconds | 1648129832613.00
        changeTimeString | 2022-03-24T14:50:32+0100
          customMetadata | 1
customMetadataAnnotation | .metapairs
                     dpl | 1
                     gid | 0
                    hash | SHA-256 78FC...<cut>...232F
              hashScheme | SHA-256
                    hold | 0
                  _index | 1
              ingestTime | 1648129832
        ingestTimeString | 2022-03-24T14:50:32+0100
               namespace | one.s3
              objectPath | /hallo.txt
               operation | CREATED
                   owner | USER,s3,s3
             permissions | 555
              replicated | 0
    replicationCollision | 0
               retention | 0
          retentionClass |
         retentionString | Deletion Allowed
                   shred | 0
                    size | 6
                    type | object
                     uid | 0
              updateTime | 1648129832
        updateTimeString | 2022-03-24T14:50:32+0100
                 urlName | https://one.s3.hcp80.archivas.com/rest/hallo.txt
                utf8Name | hallo.txt
                 version | 105480309271425

Release History

2.0.11 - 2022-05-03

  • Added tool tips to most of the form fields

2.0.10 - 2022-04-28

  • Warning box title corrected

  • some documentation corrections

2.0.9 - 2022-04-28

  • replaced the sliders in the UI with pre-seeded spin boxes

  • simplified the HCP load parameters to a single [Set] button

  • added the database schema to the documentation

  • fixed a bug where the last record values were removed from the configuration file in case a repeated query did not return new records

2.0.8 - 2022-04-26

  • made the request timeout configurable from the UI

2.0.7 - 2021-12-22

  • copyright note fixed

  • added _recipies folder, w/ a script to count objects per folder from a hcpmqe database

2.0.6 - 2021-10-20

  • fixed a bug that causes SSL handshake errors when used with Python 3.10

2.0.5 - 2020-11-23

  • start and end time are now in ISO 8601 format, added field verification

2.0.4 - 2020-11-17

  • db/csv columns are now sorted by name, to make sure they are uniform across multiple runs

  • fixed a bug where columns in sqlite3 databases were incorrectly named, occasionally

  • fixed the start- / end-times (needs to be converted to milliseconds to be accurate)

2.0.3 - 2020-11-11

  • preparation for publishing

  • corrected the URL for help/documentation

2.0.2 - 2020-11-10

  • configuration file now saved/loaded via menu entries

  • configuration file is auto-updated when changes happen

  • logging to file now into hcpmqe.<pid>.log

2.0.1 - 2020-11-08

  • automatically adopts to whatever metadata fields the HCP MQE API delivers

  • allows to restart a canceled or interrupted query

2.0.0 - 2020-11-03

  • complete re-write using tkinter through pySimpleGUI

  • runs on all major platforms (Linux, Windows, macOS)

1.0.x releases

  • 1.0.11 - 2014-08-21

    [..]

  • 1.0.1 - 2012-09-06

    initial release for Windows only

License / Trademarks

The MIT License (MIT)

Copyright (c) 2012-2022 Thorsten Simons (sw@snomis.eu)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Trademarks and Copyrights of used material

Hitachi Content Platform is a registered trademark of Hitachi Vantara LLC, in the United States and other countries.

All other trademarks, service marks, and company names in this document or web site are properties of their respective owners.