Skip to main content

Working with Data Connectors

In addition to the Edge internal file store, you can configure Edge to connect to external data sources such as SQL databases and S3 buckets. Data connectors provide an interface to work with remote data.

A data connector is a record in Edge which points to a remote resource (SQL database, S3 bucket), combined with a standard Python API for retrieving data. This design makes it very easy to browse remote data, in place, without having to access information programmatically in Jupyter notebooks.

Managing connectors in the Data App

A good first step toward getting started with data connectors is by browsing the Data App. Even if your administrator has not given you access to create connectors yourself, you can still view the contents of existing connectors. For example, here are the contents of a remote S3 bucket displayed in the Data App:

/img/conn_s3.png

To create a new connector, first navigate to the Data App. Next, click on the "+" button to the right of the Data Sources sidebar. A dialog will prompt you to select a connector type. After choosing a connection type, you'll be prompted to provide information to enable Edge to connect to the external data source. You will find more information on how to connect to specific data types below.

/img/conn_create.png

Working with connectors in Python

The built-in EdgeSession (the edge object in your Jupyter notebook) will allow you to access data connectors, via the edge.sources attribute. Here's how the two connectors (SQL and S3) above show up in Python:

    >>> edge.sources.list_names()
['data-warehouse-sql', 'data-lake-s3']

You'll notice the names are a little different than they appear in the Data App user interface. That's because each connector has a "short name" (data-warehouse-sql or data-lake-s3 in this example), as well as a "title" (Data Warehouse (SQL) and Data Lake (S3)).

To retrieve a connector, use its short name:

    >>> conn = edge.sources.get('data-warehouse-sql')

The exact details of each connector's API vary, but all have some basic attributes in common, including name and title:

    >>> conn.name
'data-warehouse-sql'
>>> conn.title
'Data Warehouse (SQL)'

The SQL Connector

The SQL connector allows you to connect to a remote SQL database. Currently, only PostgreSQL databases are officially supported. Viewing an SQL connector in the Data App will display a list of tables, and allows you to click on each of them to preview the data:

/img/conn_create.png

In Python, you have programmatic access to both the table names and the table contents, loaded as Pandas data frames. Both are accessed through the .tables attribute, which provides a dictionary-like interface:

    >>> list(conn.tables)
['experiments', 'batchresults', 'formulations']
>>> table = conn.tables['experiments']
>>> table.to_dataframe()
index Experiment ID Description
0 0 6fbfee Unprocessed
1 1 9f47e9 Mixture
2 2 8e3ca0 Hot/Cold
3 3 8c6d14 Pressurized
4 4 39c51a High-Temp

The S3 Connector

The S3 connector allows browsing the contents of a remote data store as if it were a file system. The Python interface borrows heavily from that used by the Edge internal file store (Working with Files). In this case, file access begins via the .root attribute, which represents the root of the virtual file system:

    >>> conn = edge.sources.get('data-lake-s3')
>>> conn.root.list()
['Experiment data',
'SEM micrographs',
'sample_medium_0.jpg',
'sample_medium_1.jpg',
'sample_medium_2.jpg',
'sample_medium_3.jpg',
'sample_medium_4.jpg',
'sample_medium_5.jpg',
'sample_medium_6.jpg',
'sample_medium_7.jpg',
'sample_medium_8.jpg',
'sample_medium_9.jpg']

Files, and nested "folders", are retrieved using open():

    >>> subfolder = conn.root.open('Experiment data')
>>> subfolder.list()
['01f23764ad0b4642a374f68349c5504c.png',
'd8d73e96bff44632985ad563eb4397cd.bin',
'dc852dfb0b89442ab918297db9b8d727.png',
'e004927530c74ed7ac8c510744645d55.png',
'e38326ab94494f50ba1050b3ee704eee.bin']

>>> myfile = conn.root.open('sample_medium_0.jpg')
>>> type(myfile)
>>> edge.api.S3File