Read and AnalyzeHadoopSequence File
This example shows how to create a datastore for a Sequence file containing key-value data. Then, you can read and process the data one block at a time. Sequence files are outputs ofmapreduce
operations that use Hadoop®.
Set the appropriate environment variable to the location where Hadoop is installed. In this case, set theMATLAB_HADOOP_INSTALL
environment variable.
setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')
hadoop-folder
is the folder where Hadoop is installed andmypath
is the path to that folder.
Create a datastore from the sample file,mapredout.seq
, using the数据存储
function. The sample file contains unique keys representing airline carrier codes and corresponding values that represent the number of flights operated by that carrier.
ds = datastore('mapredout.seq')
ds = KeyValueDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\mapredout.seq' } ReadSize: 1 key-value pairs FileType: 'seq'
数据存储
返回一个KeyValueDatastore
. The数据存储
function automatically determines the appropriate type of datastore to create.
Set theReadSize
property to six so that each call toread
reads at most six key-value pairs.
ds.ReadSize = 6;
Read subsets of the data fromds
using theread
function in awhile
loop. For each subset of data, compute the sum of the values. Store the sum for each subset in an array namedsums
. Thewhile
loop executes untilhasdata(ds)
returnsfalse
.
sums = [];whilehasdata(ds) T = read(ds); T.Value = cell2mat(T.Value); sums(end+1) = sum(T.Value);end
View the last subset of key-value pairs read.
T
T = Key Value ________ _____ 'WN' 15931 'XE' 2357 'YV' 849 'ML (1)' 69 'PA (1)' 318
Compute the total number of flights operated by all carriers.
numflights = sum(sums)
numflights = 123523
See Also
数据存储
|KeyValueDatastore
|mapreduce
|tall