Automatic data-discovery

Introduction

The LIGO Scientific Collaboration uses a service colloquially referred to as ‘datafind’ to index and archive the locations of data files produced at the observatories, including those from other detectors (e.g. GEO600). The datafind service continually updates to provide information with a typical latency of 10 minutes, often much less.

The gwdatafind package can be used to execute queries against the datafind server to discover the URLs of data files (typically .gwf format). See the documentation for that package for full details.

Users who have access to the LIGO Data Grid – the shared computing infrastructure that supports internal collaboration data analysis – can use the local datafind server to query for local files that can be accessed directly. Other users, including those on the Open Science Grid, can use the server located at datafind.ligo.org:443 to query for files archived under CVMFS.

Auto-discovery of data using TimeSeries.get()

Additional dependencies: LDAStools.frameCPP

To discover and read data automatically, use TimeSeries.get(), and point it at a datafind host:

>>> from gwpy.timeseries import TimeSeries
>>> data = TimeSeries.get("L1:GWOSC-4KHZ_R1_STRAIN", 1187008880, 1187008884,
...                       host="datafind.ligo.org:443")
>>> print(data)
TimeSeries([-5.98844033e-20, -6.34482794e-20, -6.31740522e-20,
            ...,  6.23573197e-20,  5.54748519e-20,
             5.91121781e-20]
           unit: dimensionless,
           t0: 1187008880.0 s,
           dt: 0.000244140625 s,
           name: L1:GWOSC-4KHZ_R1_STRAIN,
           channel: L1:GWOSC-4KHZ_R1_STRAIN)

This will execute the following series of steps:

  • query datafind.ligo.org for the list of datasets it knows about

  • for each dataset, determine whether 'L1:GWOSC-4KHZ_R1_STRAIN' is contained within a representative file, and pick the most appropriate dataset (if multiple)

  • query datafind.ligo.org again for the URLs of file paths for the matched dataset name

  • read each required file and return the data

Note

At the time of writing, all queries to datafind.ligo.org are restricted to persons with a valid LIGO.ORG RFC 3820 (X509) credential.

If any of those steps were to fail, TimeSeries.get() will automatically fall back to attempting to use nds2 to access the data.

By default, as described, this method will search through all available data to find the correct files to read, so this may take a while if the server has knowledge of a large number of different datasets. If you know the dataset name – the tag associated with files containing your data – you can pass that via the frametype keyword argument to significantly speed up the search:

>>> data = TimeSeries.get("L1:GWOSC-4KHZ_R1_STRAIN", "17 August 2017 12:42:02",
...                       "17 August 2017 12:42:06", frametype="L1_GWOSC_O2_4KHZ_R1")

Available datasets

All data recorded by the current generation of detectors are identified by a dataset tag, which identifies which data are contained in a given gwf file. The following table is an incomplete, but probably OK, reference to which dataset (frametype) you want to use for file-based data access:

Datasets available with gwdatafind

Dataset (frametype)

Description

H1_R

All auxiliary channels, stored at the native sampling rate

H1_T

Second trends of all channels, including .mean, .min, and .max

H1_M

Minute trends of all channels, including .mean, .min, and .max

H1_HOFT_C00

Strain h(t) and metadata generated using the real-time calibration pipeline

H1_HOFT_CXY

Strain h(t) and metadata generated using the off-line calibration pipeline at version XY

H1_GWOSC_O2_4KHZ_R1

4k Hz Strain h(t) and metadata as released by The Gravitational-Wave Open Science Centre (GWOSC) for the O2 data release

H1_GWOSC_O2_16KHZ_R1

16k Hz Strain h(t) and metadata as released by The Gravitational-Wave Open Science Centre (GWOSC) for the O2 data release

The above datasets refer to the H1 (LIGO-Hanford) instrument, the same are available for LIGO-Livingston by substituting the L1 prefix.

Note

Not all datasets are available from all datafind servers. Each LIGO Lab-operated computing centre has its own datafind server with a subset of the available datasets.