Main Content

Read and AnalyzeHadoopSequence File

This example shows how to create a datastore for a Sequence file containing key-value data. Then, you can read and process the data one block at a time. Sequence files are outputs ofmapreduceoperations that use Hadoop®.

Set the appropriate environment variable to the location where Hadoop is installed. In this case, set theMATLAB_HADOOP_INSTALLenvironment variable.

setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')

hadoop-folderis the folder where Hadoop is installed andmypathis the path to that folder.

Create a datastore from the sample file,mapredout.seq, using the数据存储function. The sample file contains unique keys representing airline carrier codes and corresponding values that represent the number of flights operated by that carrier.

ds = datastore('mapredout.seq')
ds = KeyValueDatastore with properties: Files: { ' ...\matlab\toolbox\matlab\demos\mapredout.seq' } ReadSize: 1 key-value pairs FileType: 'seq'

数据存储返回一个KeyValueDatastore. The数据存储function automatically determines the appropriate type of datastore to create.

Set theReadSizeproperty to six so that each call toreadreads at most six key-value pairs.

ds.ReadSize = 6;

Read subsets of the data fromdsusing thereadfunction in awhileloop. For each subset of data, compute the sum of the values. Store the sum for each subset in an array namedsums. Thewhileloop executes untilhasdata(ds)returnsfalse.

sums = [];whilehasdata(ds) T = read(ds); T.Value = cell2mat(T.Value); sums(end+1) = sum(T.Value);end

View the last subset of key-value pairs read.

T
T = Key Value ________ _____ 'WN' 15931 'XE' 2357 'YV' 849 'ML (1)' 69 'PA (1)' 318

Compute the total number of flights operated by all carriers.

numflights = sum(sums)
numflights = 123523

See Also

|||

Related Topics