Main Content

Histograms of Tall Arrays

This example shows how to usehistogramandhistogram2to analyze and visualize data contained in a tall array.

Create Tall Table

Create a datastore using theairlinesmall.csvdata set. Treat'NA'values as missing data so that they are replaced withNaNvalues. Select a subset of the variables to work with. Convert the datastore into a tall table.

varnames = {'ArrDelay','DepDelay',“年”,'Month'}; ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA',...'SelectedVariableNames', varnames); T = tall(ds)
T = Mx4 tall table ArrDelay DepDelay Year Month ________ ________ ____ _____ 8 12 1987 10 8 1 1987 10 21 20 1987 10 13 12 1987 10 4 -1 1987 10 59 63 1987 10 3 -2 1987 10 11 -1 1987 10 : : : : : : : :

Plot Histogram of Arrival Delays

Plot a histogram of theArrDelayvariable to examine the frequency distribution of arrival delays.

h = histogram(T.ArrDelay);
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.98 sec - Pass 2 of 2: Completed in 0.47 sec Evaluation completed in 2.2 sec
标题(“航班到达延误,1987 - 2008”) xlabel('Arrival Delay (minutes)') ylabel('Frequency')

The arrival delay is most frequently a small number near 0, so these values dominate the plot and make it difficult to see other details.

Adjust Bin Limits of Histogram

Restrict the histogram bin limits to plot only arrival delays between -50 and 150 minutes. After you create a histogram object from a tall array, you cannot change any properties that would require recomputing the bins, includingBinWidthandBinLimits. Also, you cannot usemorebinsorfewerbinsto adjust the number of bins. In these cases, usehistogramto reconstruct the histogram from the raw data in the tall array.

figure histogram(T.ArrDelay,'BinLimits'[-50150])
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.89 sec - Pass 2 of 2: Completed in 0.48 sec Evaluation completed in 1.8 sec
标题('Flight arrival delays between -50 and 150 minutes, 1987 - 2008') xlabel('Arrival Delay (minutes)') ylabel('Frequency')

From this plot, it appears that long delays might be more common than initially expected. To investigate further, find the probability of an arrival delay that is one hour or greater.

Probability of Delays One Hour or Greater

The original histogram returned an objecththat contains the bin values in theValuesproperty and the bin edges in theBinEdgesproperty. You can use these properties to perform in-memory calculations.

Determine which bins contain arrival delays of one hour (60 minutes) or more. Remove the last bin edge from the logical index vector so that it is the same length as the vector of bin values.

idx = h.BinEdges >= 60; idx(end) = [];

Useidx对受潮湿腐烂rieve the value associated with each selected bin. Add the bin values together, divide by the total number of samples, and multiply by 100 to determine the overall probability of a delay greater than or equal to one hour. Since the total number of samples is computed from the original data set, usegatherto explicitly evaluate the calculation and return an in-memory scalar.

N = numel(T.ArrDelay); P = gather(sum(h.Values(idx))*100/N)
P = 4.4809

Overall, the odds of an arrival delay one hour or longer are about 4.5%.

Plot Bivariate Histogram of Delays by Month

Plot a bivariate histogram of the arrival delays that are 60 minutes or longer by month. This plot examines how seasonality affects arrival delay.

figure h2 = histogram2(T.Month,T.ArrDelay,[12 50],'YBinLimits',[60 1100],...“归一化”,'probability','FaceColor','flat');
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.85 sec Evaluation completed in 1 sec Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1.3 sec Evaluation completed in 1.5 sec
标题('Probability of arrival delays 1 hour or greater (by month)') xlabel('Month (1-12)') ylabel('Arrival Delay (minutes)') zlabel('Probability') xticks(1:12) view(-126,23)

Delay Statistics by Month

Use the bivariate histogram object to calculate the probability of having an arrival delay one hour or greater in each month, and the mean arrival delay for each month. Put the results in a table with the variablePcontaining the probability information and the variableMeanByMonthcontaining the mean arrival delay.

monthNames = {'Jan','Feb','Mar','Apr','May','Jun',...'Jul','Aug','Sep','Oct','Nov','Dec'}'; G = findgroups(T.Month); M = splitapply(@(x) mean(x,'omitnan'),T.ArrDelay,G); delayByMonth = table(monthNames, sum(h2.Values,2)*100, gather(M),...'VariableNames',{'Month','P','MeanByMonth'})
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.58 sec - Pass 2 of 2: Completed in 1.4 sec Evaluation completed in 2.8 sec
delayByMonth=12×3 tableMonth P MeanByMonth _______ ______ ___________ {'Jan'} 9.6497 8.5954 {'Feb'} 7.7058 7.3275 {'Mar'} 9.0543 7.5536 {'Apr'} 7.2504 6.0081 {'May'} 7.4256 5.2949 {'Jun'} 10.35 10.264 {'Jul'} 10.228 8.7797 {'Aug'} 8.5989 7.4522 {'Sep'} 5.4116 3.6308 {'Oct'} 6.042 4.6059 {'Nov'} 6.9002 5.2835 {'Dec'} 11.384 10.571

The results indicate that flights in the holiday month of December have an 11.4% chance of being delayed longer than an hour, but are delayed by 10.5 minutes on average. This is closely followed by the summer months of June and July, where there is about a 10% chance of being delayed an hour or more and the average delay is roughly 9 or 10 minutes.

See Also

||

Related Topics