Documentation

一个nalyze Big Data in MATLAB Using Tall Arrays

This example shows how to use tall arrays to work with big data in MATLAB®. You can use tall arrays to perform a variety of calculations on different types of data that does not fit in memory. These include basic calculations, as well as machine learning algorithms within Statistics and Machine Learning Toolbox™.

该示例在单个计算机上的一小部分数据子集操作,然后将其扩展以分析所有数据集。但是,这种分析技术可以进一步扩展,以便在如此大的数据集上工作,这些数据集无法将其读取到内存中,也可以在Apache Spark™之类的系统上工作。

高级阵列简介

Tall arrays and tall tables are used to work with out-of-memory data that has any number of rows. Instead of writing specialized code that takes into account the huge size of the data, tall arrays and tables let you work with large data sets in a manner similar to in-memory MATLAB® arrays. The difference is thatarrays typically remain unevaluated until you request that the calculations be performed.

这种延迟评价使MATLAB结合the queued calculations where possible and take the minimum number of passes through the data. Since the number of passes through the data greatly affects execution time, it is recommended that you request output only when necessary.

Create数据存储for Collection of Files

创建一个数据存储enables you to access a collection of data. A数据存储可以处理任意的大量数据,并且数据甚至可以分布在多个文件夹中的多个文件中。您可以创建一个数据存储for a collection of tabular text files (demonstrated here), spreadsheets, images, a SQL database (Database Toolbox™ required) or Hadoop® sequence files.

创建一个数据存储for a。csvfile containing airline data. Treat'NA'values as missing so that数据存储replaces them withNaNvalues. Select the variables of interest, and specify a categorical data type for the起源andDest变量。预览内容。

ds = datastore('airlinesmall.csv');DS。TreatAsMissing ='NA';ds.selectedVariablenames = {'年',,,,'月',,,,'ArrDelay',,,,'depdelay',,,,'Origin',,,,'Dest'}; ds.SelectedFormats(5:6) = {'%C',,,,'%C'}; pre = preview(ds)
pre=8×6桌年度月份Arrdelay depdelay起源____ ____ ________________________________________ 1987 10 8 12 LAX SJC 1987 10 8 1 SJC BUR 1987 10 21 20 21 20 SAN SMF 1987 10 13 12 BUR SJC 1987 10 4 -1 SMF LAX 1987 1987 10 59 59 63 LAX SJC 1987 LAX SJC 198710 3 -2 San SFO 1987 10 11 -1海洋宽松

Create Tall Array

Tall arrays are similar to in-memory MATLAB arrays, except that they can have any number of rows. Tall arrays can contain data that is numeric, logical, datetime, duration, calendarDuration, categorical, or strings. Also, you can convert any in-memory array to a tall array. (The in-memory array一个must be one of the supported data types.)

这underlying class of a tall array is based on the type of datastore that backs it. For example, if the datastoreDScontains tabular data, then高(ds)returns a tall table containing the data.

tt = tall(ds)
tt = mx6高桌子年度月ardelay depdelay起源____ __________________________ ______ ____?????????? ? ? ? ? ? ? ? ? : : : : : : : : : : : :

这display indicates the underlying data type and includes the first several rows of data. The size of the table displays as "Mx6" to indicate that MATLAB does not yet know how many rows of data there are.

Perform Calculations on Tall Arrays

您可以以类似的方式使用高大的阵列和高桌子,在其中使用内存MATLAB阵列和桌子。

高阵列的一个重要方面是,当您与它们合作时,MATLAB不会立即执行大多数操作。这些操作似乎很快执行,因为实际计算会推迟到您专门请求输出为止。这种递延评估很重要,因为即使是一个简单的命令尺寸(x)在一排高的阵列上执行,这不是一个快速的计算。

一个s you work with tall arrays, MATLAB keeps track of all of the operations to be carried out and optimizes the number of passes through the data. Thus, it is normal to work with unevaluated tall arrays and request output only when you require it. MATLAB does not know the contents or size of unevaluated tall arrays until you request that the array be evaluated and displayed.

Calculate the mean departure delay.

mDep = mean(tt.DepDelay,'omitnan'
mdep =高高的双人?

Gather Results into Workspace

这benefit of deferred evaluation is that when the time comes for MATLAB to perform the calculations, it is often possible to combine the operations in such a way that the number of passes through the data is minimized. So, even if you perform many operations, MATLAB only makes extra passes through the data when absolutely necessary.

gatherfunction forces evaluation of all queued operations and brings the resulting output back into memory. Sincegather返回entireresult in MATLAB, you should make sure that the result will fit in memory. For example, usegatheron tall arrays that are the result of a function that reduces the size of the tall array, such as,,,,最小,,,,mean,,,,and so on.

Usegather计算平均出发延迟并将答案带入内存。该计算需要单个通过数据,但是其他计算可能需要通过数据进行多次通过。MATLAB确定计算的最佳通过数,并在命令行中显示此信息。

mDep = gather(mDep)
使用本地MATLAB会话评估高表情:-Pass 1 of 2:在0.85秒完成 - 第2秒:完成在2秒内完成的0.81秒评估完成
mDep = 8.1860

Select Subset of Tall Array

You can extract values from a tall array by subscripting or indexing. You can index the array starting from the top or bottom, or by using a logical index. The functionsheadand尾巴是索引的有用替代方案,使您能够探索高阵列的第一部分也是最后一部分。同时收集这两个变量,以避免通过数据额外传递。

h = head(tt); tl = tail(tt); [h,tl] = gather(h,tl)
使用本地MATLAB会话评估高表情: -  1 of 1:完成在0.79秒评估中完成的0.93秒完成
h=8×6桌年度月份Arrdelay depdelay起源____ ____ ________________________________________ 1987 10 8 12 LAX SJC 1987 10 8 1 SJC BUR 1987 10 21 20 21 20 SAN SMF 1987 10 13 12 BUR SJC 1987 10 4 -1 SMF LAX 1987 1987 10 59 59 63 LAX SJC 1987 LAX SJC 198710 3 -2 San SFO 1987 10 11 -1海洋宽松
tl=8×6桌年月一个rrDelay DepDelay Origin Dest ____ _____ ________ ________ ______ ____ 2008 12 14 1 DAB ATL 2008 12 -8 -1 ATL TPA 2008 12 1 9 ATL CLT 2008 12 -8 -4 ATL CLT 2008 12 15 -2 BOS LGA 2008 12 -15 -1 SFO ATL 2008 12 -12 1 DAB ATL 2008 12 -1 11 ATL IAD

Usehead从数据缩放到完整数据集之前,从数据中选择10,000行的子集。

ttSubset = head(tt,10000);

按条件选择数据

You can use typical logical operations on tall arrays, which are useful for selecting relevant data or removing outliers with logical indexing. The logical expression creates a tall logical vector, which then is used to subscript, identifying the rows where the condition is true.

Select only the flights out of Boston by comparing the elements of the categorical variable起源to the value'BOS'

idx = (ttSubset.Origin =='BOS');bosflights = ttSubset(idx,:)
bosflights = 207x6 tall table Year Month ArrDelay DepDelay Origin Dest ____ _____ ________ ________ ______ ____ 1987 10 -8 0 BOS LGA 1987 10 -13 -1 BOS LGA 1987 10 12 11 BOS BWI 1987 10 -3 0 BOS EWR 1987 10 -5 0 BOS ORD 1987 10 31 19 BOS PHL 1987 10 -3 0 BOS CLE 1987 11 5 5 BOS STL : : : : : : : : : : : :

You can use the same indexing technique to remove rows with missing data or NaN values from the tall array.

idx = any(ismissing(ttSubset),2); ttSubset(idx,:) = [];

Determine Largest Delays

Due to the nature of big data, sorting all of the data using traditional methods likesortorsortrowsis inefficient. However, thetopkrowsfunction for tall arrays returns the topkrows in sorted order.

Calculate the top 10 greatest departure delays.

biggestDelays = topkrows(ttSubset,10,'depdelay');biggestDelays = gather(biggestDelays)
Evaluating tall expression using the Local MATLAB Session: Evaluation completed in 0.085 sec
biggestDelays=10×6 table年度月份Arrdelay depdelay起源____ ____ _________________________________________ 1988 3 772 785 ORD LEX 1989 3 453 447 MDT ORD 1988 12 397 425 SJU BWI 1987 1987 12 339 339 339 360 DEN STL STL 1988 1988 3 261 261 261 261 273 PHL ROC 1988 7 261 261 261 261 261 261 268 BWI 2 2 2 2 2 268 BWI PBI BWI PBI pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi pbi257 253 ORD BTV 1988 3 236 240 EWR FLL 1989 2 263 227 BNA MOB 1989 6 224 224 DFW JAX

Visualize Data in Tall Arrays

在大数据集中绘制每个点是不可行的。因此,高阵列的可视化涉及使用采样或套筒减少数据点的数量。

Visualize the number of flights per year with a histogram. The visualization functions pass through the data and immediately evaluate the solution when you call them, sogather不需要。

histogram(ttSubset.Year,'BinMethod',,,,'integers'
Evaluating tall expression using the Local MATLAB Session: Evaluation completed in 0.56 sec
xlabel('年')ylabel('Number of Flights')title('Number of Flights by Year, 1987 - 1989'

Scale to Entire Data Set

而不是使用返回的较小数据head,,,,you can scale up to perform the calculations on the entire data set by using the results from高(ds)

tt =高(DS);idx =任何(iSmissing(tt),2);tt(idx,:) = [];mndelay = mean(tt.depdelay,'omitnan');biggestDelays = topkrows(tt,10,'depdelay');[mndelay,bigaltdelays] =聚集(Mndelay,最大值)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.58 sec - Pass 2 of 2: Completed in 0.88 sec Evaluation completed in 1.6 sec
mndelay = 8.1310
biggestDelays=10×6 table年度月份Arrdelay depdelay起源____ ____ ___________________________________ 1991 3 -8 1438 MCO BWI 1998 1998 12 -12 1433 CVG ORF 1995 11 1014 1014 HNL LAX 2007 4 914 4 914 924 924 JFK DTW 2001 4 887 887 887 884 MCO DTW 2008 8555555555 CM CR1988 3 772 785 ORD LEX 2008 4 710 713 EWR RDU 1998 10 679 673 MCI DFW 2006 6 603 626 ABQ PHX
histogram(tt.Year,'BinMethod',,,,'integers'
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 1.6 sec - Pass 2 of 2: Completed in 0.79 sec Evaluation completed in 2.7 sec
xlabel('年')ylabel('Number of Flights')title('Number of Flights by Year, 1987 - 2008'

Usehistogram2to further break down the number of flights by month for the whole data set. Since the bins forand提前知道,指定垃圾箱边缘以避免通过数据额外传递。

Year_edges = 1986.5:2008.5;month_edges = 0.5:12.5;直方图2(Tt. year,tt.month,year_edges,month_edges,'DisplayStyle',,,,'瓦'
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1.3 sec Evaluation completed in 1.3 sec
colorbar xlabel('年')ylabel('月')title('Airline Flights by Month and Year, 1987 - 2008'

Data Analytics and Machine Learning with Tall Arrays

You can perform more sophisticated statistical analysis on tall arrays, including calculating predictive analytics and performing machine learning, using the functions in Statistics and Machine Learning Toolbox™.

For more information, see高阵列支持,用法注释和金宝app限制(Statistics and Machine Learning Toolbox).

扩展到大数据系统

MATLAB高大的数组的一个关键能力是connectivity to big data platforms, such as computing clusters and Apache Spark™.

This example only scratches the surface of what is possible with tall arrays for big data. SeeExtend Tall Arrays with Other Productsfor more information about using:

  • 统计和机器学习工具箱™

  • 数据库工具箱™

  • Parallel Computing Toolbox™

  • MATLAB®Parallel Server™

  • MATLAB Compiler™

See Also

Related Topics