This example shows how to calculate grouped statistics of a tall timetable containing power outage data. The example uses thegrouptransform
,groupsummary
, andgroupcounts
functions to calculate various quantities of interest, such as the most common power outage cause in each region. Even though the raw data in this example only has about 1500 rows, you can use the techniques presented here on much larger data sets because no assumptions are made about the size of the data.
The sample file,outages.csv
, contains data representing electric utility outages in the United States. The file contains six columns:Region
,OutageTime
,Loss
,Customers
,RestorationTime
, andCause
.
Create a datastore for theoutages.csv
file. Use the"TextScanFormats"
option to specify the kind of data each column contains: categorical ("%C"
), floating-point numeric ("%f"
), or datetime ("%D"
).
data_formats = ["%C","%D","%f","%f","%D","%C"]; ds = tabularTextDatastore("outages.csv","TextscanFormats", data_formats);
Create a tall table on top of the datastore, and convert the tall table into a tall timetable. TheOutageTime
variable is used for the row times since it is the first datetime or duration variable in the table.
T = tall(ds); T = table2timetable(T)
T = Mx5 tall timetable OutageTime Region Loss Customers RestorationTime Cause __________ ______ ____ _________ _______________ _____ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : : :
Some of the rows in theRestorationTime
variable have missing times, represented byNaT
values. Remove these rows from the table.
T = rmmissing(T,"DataVariables","RestorationTime");
For the numeric variables in the timetable, instead of removing rows with missing values, replace the missing values with the mean value for each region.
T = grouptransform(T,"Region","meanfill",["Loss","Customers"]);
Useismissing
to confirm that no pieces of missing data remain in the table.
tf = any(ismissing(T),"all"); gather(tf)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 4: Completed in 0.21 sec - Pass 2 of 4: Completed in 0.57 sec - Pass 3 of 4: Completed in 0.43 sec - Pass 4 of 4: Completed in 0.59 sec Evaluation completed in 2.7 sec
ans =logical0
Now that the data does not contain missing values, bring a small number of rows into memory to get an idea of what the data contains.
gather(head(T))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.29 sec Evaluation completed in 0.4 sec
ans=8×5 timetableOutageTime地区损失客户RestorationTime Cause ________________ _________ ______ __________ ________________ _______________ 2002-02-01 12:18 SouthWest 458.98 1.8202e+06 2002-02-07 16:50 winter storm 2003-02-07 21:15 SouthEast 289.4 1.4294e+05 2003-02-17 08:14 winter storm 2004-04-06 05:44 West 434.81 3.4037e+05 2004-04-06 06:10 equipment fault 2002-03-16 06:18 MidWest 186.44 2.1275e+05 2002-03-18 23:23 severe storm 2003-06-18 02:49 West 0 0 2003-06-18 10:54 attack 2004-06-20 14:39 West 231.29 1.5354e+05 2004-06-20 19:16 equipment fault 2002-06-06 19:28 West 311.86 1.5354e+05 2002-06-07 00:51 equipment fault 2003-07-16 16:23 NorthEast 239.93 49434 2003-07-17 01:12 fire
Determine the mean power outage duration in each region usinggroupsummary
. First, create a new variableOutageDuration
in the table that contains the duration of each outage, found by subtracting the outage time from the restoration time. In the call togroupsummary
, specify:
"Region"
as the grouping variable
"mean"
as the computation method
"OutageDuration"
as the variable to operate on.
T.OutageDuration = T.RestorationTime - T.OutageTime; times = groupsummary(T,"Region","mean","OutageDuration")
times = Mx3 tall table Region GroupCount mean_OutageDuration ______ __________ ___________________ ? ? ? ? ? ? ? ? ? : : : : : :
Change the display format of the duration results to be in days, and gather the results into memory. The results show the mean outage duration in each region, as well as the number of reported outages in each region.
times.mean_OutageDuration.Format ="d"; times = gather(times)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 1.1 sec - Pass 2 of 2: Completed in 0.49 sec Evaluation completed in 2 sec
times=5×3 tableRegion GroupCount mean_OutageDuration _________ __________ ___________________ MidWest 138 34.135 days NorthEast 548 24.21 days SouthEast 379 1.7013 days SouthWest 25 2.4799 days West 349 28.061 days
Determine how often each power outage cause occurs in each region. Usegroupcounts
with theCause
andRegion
variables as grouping variables. Gather the results into memory.
causes = groupcounts(T,["Cause","Region"]); causes = gather(causes)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.26 sec - Pass 2 of 2: Completed in 0.32 sec Evaluation completed in 0.8 sec
causes=43×3 tableCause Region GroupCount ________________ _________ __________ attack MidWest 12 attack NorthEast 135 attack SouthEast 19 attack West 126 earthquake NorthEast 1 earthquake West 1 energy emergency MidWest 19 energy emergency NorthEast 29 energy emergency SouthEast 79 energy emergency SouthWest 7 energy emergency West 46 equipment fault MidWest 9 equipment fault NorthEast 17 equipment fault SouthEast 40 equipment fault SouthWest 2 equipment fault West 85 ⋮
Each cause occurs several times in the table, so even though the table contains the correct data it is not in the proper format to see how often each cause occurs in each region. To improve the presentation of the data, unstack theGroupCount
variable so that each column corresponds to a region and each row corresponds to an outage cause.
RegionCauses = gather(unstack(causes,"GroupCount","Region"))
RegionCauses=10×6 table导致中西部东北东南西南西________________ _______ _________ _________ _________ ____ attack 12 135 19 NaN 126 earthquake NaN 1 NaN NaN 1 energy emergency 19 29 79 7 46 equipment fault 9 17 40 2 85 fire NaN 5 3 NaN 17 severe storm 30 139 132 6 22 thunder storm 31 102 54 6 7 unknown 4 10 3 NaN 4 wind 16 40 13 3 22 winter storm 17 70 36 1 19
Not all combinations of outage causes and regions are represented in the data, so the resulting table contains someNaN
s. Fill in theNaN
values with zeros.
RegionCauses = fillmissing(RegionCauses,"constant",{"",0,0,0,0,0})
RegionCauses=10×6 table导致中西部东北东南西南西________________ _______ _________ _________ _________ ____ attack 12 135 19 0 126 earthquake 0 1 0 0 1 energy emergency 19 29 79 7 46 equipment fault 9 17 40 2 85 fire 0 5 3 0 17 severe storm 30 139 132 6 22 thunder storm 31 102 54 6 7 unknown 4 10 3 0 4 wind 16 40 13 3 22 winter storm 17 70 36 1 19
Calculate the broadest customer impact for each power outage in each region.
WorstOutages = groupsummary(T,["Region","Cause"],"max","Customers"); WorstOutages = gather(WorstOutages)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.33 sec - Pass 2 of 2: Completed in 0.52 sec Evaluation completed in 1.2 sec
WorstOutages=43×4 tableRegion Cause GroupCount max_Customers _________ ________________ __________ _____________ MidWest attack 12 2.4403e+05 MidWest energy emergency 19 5.0376e+05 MidWest equipment fault 9 2.4403e+05 MidWest severe storm 30 3.972e+06 MidWest thunder storm 31 3.8233e+05 MidWest unknown 4 3.0879e+06 MidWest wind 16 2.8666e+05 MidWest winter storm 17 7.7697e+05 NorthEast attack 135 1.5005e+05 NorthEast earthquake 1 0 NorthEast energy emergency 29 1.5005e+05 NorthEast equipment fault 17 1.667e+05 NorthEast fire 5 4.5139e+05 NorthEast severe storm 139 1.0735e+06 NorthEast thunder storm 102 5.9689e+06 NorthEast unknown 10 2.4983e+06 ⋮
Combine the data in theRegion
andCause
variables into a single categorical variable by briefly converting them into strings. Then, create a categorical histogram of the maximum number of affected customers for each cause in each region.
WorstOutages.RegionCause = categorical(string(WorstOutages.Region)+" ("+string(WorstOutages.Cause)+")"); histogram("Categories",WorstOutages.RegionCause,"BinCounts",WorstOutages.max_Customers,..."DisplayOrder","descend") ylabel("Max # Affected Customers")