Set Up Datastore for Processing on Different Machines or Clusters

You can create and save a datastore on a platform that loads and works seamlessly on a different platform by setting up the'AlternateFileSystemRoots'property of the datastore. Use this property when:

You create a datastore on a local machine, and need to access and process the data on another machine (possibly running a different operating system).
You process your datastore with parallel and distributed computing involving different platforms, cloud or cluster machines.

This example demonstrates the use of the'AlternateFileSystemRoots'property forTabularTextDatastore. However, you can use the same syntax for any of these datastores:SpreadsheetDatastore,ImageDatastore,ParquetDatastore,FileDatastore,KeyValueDatastore, andTallDatastore. To use the'AlternateFileSystemRoots'functionality for custom datastores, seematlab.io.datastore.DsFileSetandDevelop Custom Datastore.

Save Datastore and Load on Different File System Platform

Create a datastore on one file system that loads and works seamlessly on a different machine (possibly of a different operating system). For example, create a datastore on a Windows^®machine, save it, and then load it on a Linux^®machine.

First, before you create and save the datastore, identify the root paths for your data on the different platforms. The root paths will differ based on the machine or file system. For instance, if you have data on your local machine and a copy of the data on a cluster, then get the root paths for accessing the data:

"Z:\DataSet"for your local Windows machine.
"/nfs-bldg001/DataSet"for your Linux cluster.

Then, associate these root paths by using the'AlternateFileSystemRoots'parameter of the datastore.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"]; ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);

Examine theFilesproperty of datastore. In this instance, theFilesproperty contains the location of your data as accessed by your Windows machine.

ds.Files

ans = 5×1 cell array {'Z:\DataSet\datafile01.csv'} {'Z:\DataSet\datafile02.csv'} {'Z:\DataSet\datafile03.csv'} {'Z:\DataSet\datafile04.csv'} {'Z:\DataSet\datafile05.csv'}

Save the datastore.

saveds_saved_on_Windows.matds

Load the datastore on a Linux platform and examine theFilesproperty. Since the root path'Z:\DataSet'is not accessible on the Linux cluster, at load time, the datastore function automatically updates the root paths based on the values specified in the'AlternateFileSystemRoots'parameter. TheFiles数据库现在包含的属性更新root paths for your data on the Linux cluster.

loadds_saved_on_Windows.matds.Files

ans = 5×1 cell array {'/nfs-bldg001/DataSet/datafile01.csv'} {'/nfs-bldg001/DataSet/datafile02.csv'} {'/nfs-bldg001/DataSet/datafile03.csv'} {'/nfs-bldg001/DataSet/datafile04.csv'} {'/nfs-bldg001/DataSet/datafile05.csv'}

You can now process and analyze this datastore on your Linux machine.

Process Datastore Using Parallel and Distributed Computing

To process your datastore with parallel and distributed computing that involves different platforms, cloud or cluster machines, you must predefine the'AlternateFileSystemRoots'parameter. This example demonstrates how to create a datastore on your local machine, analyze a small portion of the data, and then use Parallel Computing Toolbox™ andMATLAB^®Parallel Server™to scale up the analysis to the entire dataset.

Create a datastore and assign a value to the'AlternateFileSystemRoots'property. To set the value for the'AlternateFileSystemRoots'property, identify the root paths for your data on the different platforms. The root paths differ based on the machine or file system. For example, identify the root paths for data access from your machine and your cluster:

"Z:\DataSet"from your local Windows Machine.
"/nfs-bldg001/DataSet"from theMATLAB Parallel ServerLinux Cluster.

Then, associate these root paths using theAlternateFileSystemRootsproperty.

altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"]; ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);

Analyze a small portion of the data on your local machine. For instance, get a partitioned subset of the data, clean the data by removing any missing entries, and examine a plot of the variables.

tt = tall(partition(ds,100,1)); summary(tt);% analyze your datatt = rmmissing(tt); plot(tt.MyVar1,tt.MyVar2)

Scale up your analysis to the entire dataset by usingMATLAB Parallel Servercluster (Linux cluster). For instance, start a worker pool using the cluster profile, and then perform analysis on the entire dataset by using parallel and distributed computing capabilities.

parpool('MyMjsProfile') tt = tall(ds); summary(tt);% analyze your datatt = rmmissing(tt); plot(tt.MyVar1,tt.MyVar2)