Main Content

Grouped Statistics Calculations with Tall Arrays

This example shows how to calculate grouped statistics of a tall timetable containing power outage data. The example uses thegrouptransform,groupsummary, andgroupcountsfunctions to calculate various quantities of interest, such as the most common power outage cause in each region. Even though the raw data in this example only has about 1500 rows, you can use the techniques presented here on much larger data sets because no assumptions are made about the size of the data.

Create Datastore and Tall Timetable

The sample file,outages.csv, contains data representing electric utility outages in the United States. The file contains six columns:Region,OutageTime,Loss,Customers,RestorationTime, andCause.

Create a datastore for theoutages.csvfile. Use the"TextScanFormats"option to specify the kind of data each column contains: categorical ("%C"), floating-point numeric ("%f"), or datetime ("%D").

data_formats = ["%C","%D","%f","%f","%D","%C"]; ds = tabularTextDatastore("outages.csv","TextscanFormats", data_formats);

Create a tall table on top of the datastore, and convert the tall table into a tall timetable. TheOutageTimevariable is used for the row times since it is the first datetime or duration variable in the table.

T = tall(ds); T = table2timetable(T)
T = Mx5 tall timetable OutageTime Region Loss Customers RestorationTime Cause __________ ______ ____ _________ _______________ _____ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : : :

取代缺失的数据

Some of the rows in theRestorationTimevariable have missing times, represented byNaTvalues. Remove these rows from the table.

T = rmmissing(T,"DataVariables","RestorationTime");

For the numeric variables in the timetable, instead of removing rows with missing values, replace the missing values with the mean value for each region.

T = grouptransform(T,"Region","meanfill",["Loss","Customers"]);

Useismissingto confirm that no pieces of missing data remain in the table.

tf = any(ismissing(T),"all"); gather(tf)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 4: Completed in 0.21 sec - Pass 2 of 4: Completed in 0.57 sec - Pass 3 of 4: Completed in 0.43 sec - Pass 4 of 4: Completed in 0.59 sec Evaluation completed in 2.7 sec
ans =logical0

Preview Data

Now that the data does not contain missing values, bring a small number of rows into memory to get an idea of what the data contains.

gather(head(T))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.29 sec Evaluation completed in 0.4 sec
ans=8×5 timetableOutageTime地区损失客户RestorationTime Cause ________________ _________ ______ __________ ________________ _______________ 2002-02-01 12:18 SouthWest 458.98 1.8202e+06 2002-02-07 16:50 winter storm 2003-02-07 21:15 SouthEast 289.4 1.4294e+05 2003-02-17 08:14 winter storm 2004-04-06 05:44 West 434.81 3.4037e+05 2004-04-06 06:10 equipment fault 2002-03-16 06:18 MidWest 186.44 2.1275e+05 2002-03-18 23:23 severe storm 2003-06-18 02:49 West 0 0 2003-06-18 10:54 attack 2004-06-20 14:39 West 231.29 1.5354e+05 2004-06-20 19:16 equipment fault 2002-06-06 19:28 West 311.86 1.5354e+05 2002-06-07 00:51 equipment fault 2003-07-16 16:23 NorthEast 239.93 49434 2003-07-17 01:12 fire

Mean Power Outage Duration by Region

Determine the mean power outage duration in each region usinggroupsummary. First, create a new variableOutageDurationin the table that contains the duration of each outage, found by subtracting the outage time from the restoration time. In the call togroupsummary, specify:

  • "Region"as the grouping variable

  • "mean"as the computation method

  • "OutageDuration"as the variable to operate on.

T.OutageDuration = T.RestorationTime - T.OutageTime; times = groupsummary(T,"Region","mean","OutageDuration")
times = Mx3 tall table Region GroupCount mean_OutageDuration ______ __________ ___________________ ? ? ? ? ? ? ? ? ? : : : : : :

Change the display format of the duration results to be in days, and gather the results into memory. The results show the mean outage duration in each region, as well as the number of reported outages in each region.

times.mean_OutageDuration.Format ="d"; times = gather(times)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 1.1 sec - Pass 2 of 2: Completed in 0.49 sec Evaluation completed in 2 sec
times=5×3 tableRegion GroupCount mean_OutageDuration _________ __________ ___________________ MidWest 138 34.135 days NorthEast 548 24.21 days SouthEast 379 1.7013 days SouthWest 25 2.4799 days West 349 28.061 days

Most Common Power Outage Causes by Region

Determine how often each power outage cause occurs in each region. Usegroupcountswith theCauseandRegionvariables as grouping variables. Gather the results into memory.

causes = groupcounts(T,["Cause","Region"]); causes = gather(causes)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.26 sec - Pass 2 of 2: Completed in 0.32 sec Evaluation completed in 0.8 sec
causes=43×3 tableCause Region GroupCount ________________ _________ __________ attack MidWest 12 attack NorthEast 135 attack SouthEast 19 attack West 126 earthquake NorthEast 1 earthquake West 1 energy emergency MidWest 19 energy emergency NorthEast 29 energy emergency SouthEast 79 energy emergency SouthWest 7 energy emergency West 46 equipment fault MidWest 9 equipment fault NorthEast 17 equipment fault SouthEast 40 equipment fault SouthWest 2 equipment fault West 85 ⋮

Each cause occurs several times in the table, so even though the table contains the correct data it is not in the proper format to see how often each cause occurs in each region. To improve the presentation of the data, unstack theGroupCountvariable so that each column corresponds to a region and each row corresponds to an outage cause.

RegionCauses = gather(unstack(causes,"GroupCount","Region"))
RegionCauses=10×6 table导致中西部东北东南西南西________________ _______ _________ _________ _________ ____ attack 12 135 19 NaN 126 earthquake NaN 1 NaN NaN 1 energy emergency 19 29 79 7 46 equipment fault 9 17 40 2 85 fire NaN 5 3 NaN 17 severe storm 30 139 132 6 22 thunder storm 31 102 54 6 7 unknown 4 10 3 NaN 4 wind 16 40 13 3 22 winter storm 17 70 36 1 19

Not all combinations of outage causes and regions are represented in the data, so the resulting table contains someNaNs. Fill in theNaNvalues with zeros.

RegionCauses = fillmissing(RegionCauses,"constant",{"",0,0,0,0,0})
RegionCauses=10×6 table导致中西部东北东南西南西________________ _______ _________ _________ _________ ____ attack 12 135 19 0 126 earthquake 0 1 0 0 1 energy emergency 19 29 79 7 46 equipment fault 9 17 40 2 85 fire 0 5 3 0 17 severe storm 30 139 132 6 22 thunder storm 31 102 54 6 7 unknown 4 10 3 0 4 wind 16 40 13 3 22 winter storm 17 70 36 1 19

Worst Power Outages in Each Region

Calculate the broadest customer impact for each power outage in each region.

WorstOutages = groupsummary(T,["Region","Cause"],"max","Customers"); WorstOutages = gather(WorstOutages)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.33 sec - Pass 2 of 2: Completed in 0.52 sec Evaluation completed in 1.2 sec
WorstOutages=43×4 tableRegion Cause GroupCount max_Customers _________ ________________ __________ _____________ MidWest attack 12 2.4403e+05 MidWest energy emergency 19 5.0376e+05 MidWest equipment fault 9 2.4403e+05 MidWest severe storm 30 3.972e+06 MidWest thunder storm 31 3.8233e+05 MidWest unknown 4 3.0879e+06 MidWest wind 16 2.8666e+05 MidWest winter storm 17 7.7697e+05 NorthEast attack 135 1.5005e+05 NorthEast earthquake 1 0 NorthEast energy emergency 29 1.5005e+05 NorthEast equipment fault 17 1.667e+05 NorthEast fire 5 4.5139e+05 NorthEast severe storm 139 1.0735e+06 NorthEast thunder storm 102 5.9689e+06 NorthEast unknown 10 2.4983e+06 ⋮

Combine the data in theRegionandCausevariables into a single categorical variable by briefly converting them into strings. Then, create a categorical histogram of the maximum number of affected customers for each cause in each region.

WorstOutages.RegionCause = categorical(string(WorstOutages.Region)+" ("+string(WorstOutages.Cause)+")"); histogram("Categories",WorstOutages.RegionCause,"BinCounts",WorstOutages.max_Customers,..."DisplayOrder","descend") ylabel("Max # Affected Customers")

See Also

||

Related Topics