主要内容

高阵列的逻辑回归

此示例显示了如何使用逻辑回归和其他技术对高阵列进行数据分析。高阵列表示的数据太大而无法适应计算机存储器。

定义执行环境

When you perform calculations on tall arrays, MATLAB® uses either a parallel pool (default if you have Parallel Computing Toolbox™) or the local MATLAB session. To run the example using the local MATLAB session when you have Parallel Computing Toolbox, change the global execution environment by using theMAPREDUCER功能。

MapReducer(0)

将数据输入MATLAB

创建一个数据存储,该数据存储在使用数据引用文件夹位置。数据可以包含在单个文件中,一个文件集或整个文件夹。对待'NA'values as missing data so that数据存储取代它们values. Select a subset of the variables to work with, and include the name of the airline (独算带)作为分类变量。在数据存储的顶部创建一个高高的桌子。

ds = datastore('airlinesmall.csv');ds.TreatAsMissing ='NA';ds.selectedVariablenames = {“ Dayofweek”,,,,“唯一载体”,,,,...'ArrDelay',,,,'depdelay',,,,'距离'};ds.selectedformats {2} ='%C';TT= tall(ds); tt.DayOfWeek = categorical(tt.DayOfWeek,1:7,...{'太阳',,,,'Mon',,,,'星期二',,,,'星期三',,,,'thu',,,,'Fri',,,,'Sat'},'Ordinal',真的)
TT= Mx5 tall table DayOfWeek UniqueCarrier ArrDelay DepDelay Distance _________ _____________ ________ ________ ________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : :

Late Flights

通过定义延迟飞行的逻辑变量,确定迟到20分钟或更长时间的航班。将此变量添加到高的数据表中,并指出尚未对其进行评估。该变量的预览包括前几行。

TT。LateFlight = tt.ArrDelay>=20
tt = mx6高桌周周的dayofweek独立赛车ARRDELAY DEPDELAY距离lateflight ______________________________________________ ____________ _______________________________????????? ? ? ? ? ? ? ? ? : : : : : : : : : : : :

Calculate the mean ofLateFlightto determine the overall proportion of late flights. Use收集触发对高阵列的评估并将结果带入内存。

m = mean(tt.LateFlight)
m = tall double ?
m = gather(m)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 0.82 sec - Pass 2 of 2: Completed in 0.92 sec Evaluation completed in 2.2 sec
m = 0.1580

运营商的延迟航班

检查某些类型的航班是否往往迟到。首先,检查某些承运人是否更有可能有较晚的航班。

TT。LateFlight = double(tt.LateFlight); late_by_carrier = gather(grpstats(tt,“唯一载体”,,,,'意思是',,,,'DataVar',,,,``low flight''))
使用本地MATLAB会话评估高表情: - 通过1:Of 1:在3秒内完成的2.3秒评估完成
late_by_carrier =29×4桌GroupLabel UniqueCarrier GroupCount mean_LateFlight __________ _____________ __________ _______________ {'9E' } 9E 521 0.13436 {'AA' } AA 14930 0.16236 {'AQ' } AQ 154 0.051948 {'AS' } AS 2910 0.16014 {'B6' } B6 806 0.23821 {'CO' } CO 8138 0.16319 {'DH' } DH 696 0.17672 {'DL' } DL 16578 0.15261 {'EA' } EA 920 0.15217 {'EV' } EV 1699 0.21248 {'F9' } F9 335 0.18209 {'FL' } FL 1263 0.19952 {'HA' } HA 273 0.047619 {'HP' } HP 3660 0.13907 {'ML (1)'} ML (1) 69 0.043478 {'MQ' } MQ 3962 0.18778 ⋮

载体B6ev较晚的航班比例更高。载体aq,,,,ML(1), 和HA航班相对较少,但较低的比例迟到了。

一周中的晚航班

接下来,检查一周中的不同日子是否往往有以后的航班。

late_by_day = gather(grpstats(tt,“ Dayofweek”,,,,'意思是',,,,'DataVar',,,,``low flight''))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1.4 sec Evaluation completed in 1.6 sec
late_by_day =7×4桌GroupLabel DayOfWeek GroupCount mean_LateFlight __________ _________ __________ _______________ {'Fri' } Fri 15839 0.12899 {'Mon' } Mon 18077 0.14234 {'Sat' } Sat 16958 0.15603 {'Sun' } Sun 18019 0.15117 {'Thu' } Thu 18227 0.18418 {'星期二}星期二18163 0.15526 {'wed'}星期三18240 0.18399

周三和周四的较晚航班比例最高,而星期五的比例最低。

远程飞行

检查以查看更长的飞行或较短的航班往往迟到。首先,查看迟到的飞行距离的密度,并将其与准时的航班进行比较。

ksdensity(tt.Distance(tt.LateFlight==1))
使用本地MATLAB会话评估高高的表达: - 通过2:of 2:在1.1秒完成 - 第2次,共2个:在2.9秒内完成的1.3秒评估完成
抓住ksdente(tt.distance(tt.lateflight == 0))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 2: Completed in 1 sec - Pass 2 of 2: Completed in 1.3 sec Evaluation completed in 2.8 sec
抓住offlegend('晚的',,,,'准时'

Flight distance does not make a dramatic difference in whether a flight is early or late. However, the density appears to be slightly higher for on-time flights at distances of about 400 miles. The density is also higher for late flights at distances of about 2000 miles. Calculate some simple descriptive statistics for the late and on-time flights.

lates_by_distance = chater(grpstats(tt,``low flight'',{'意思是''std'},'DataVar',,,,'距离'))
使用本地MATLAB会话评估高高的表达: - 通过1:Of 1:在1.2秒评估中完成的1.6秒完成
late_by_distance=2×5桌GroupLabel LateFlight GroupCount mean_Distance std_Distance __________ __________ __________ _____________ ____________ {'0'} 0 1.04e+05 693.14 544.75 {'1'} 1 19519 750.24 574.12

延迟航班平均长约60英里,尽管该值仅占距离值的标准偏差的一小部分。

Logistic Regression Model

使用两个连续变量,建立一个延迟飞行概率的模型(例如Distance)和categorical variables (such asDayofweek)预测概率。该模型可以帮助确定当您将它们一起考虑时,每个预测变量是否单独观察到了先前的结果。

glm = fitglm(tt,'lowflight〜距离+Dayofweek',,,,'分配',,,,“二项式”
迭代[1]:0%完成迭代[1]:100%完成迭代[2]:0%完成迭代[2]:100%完成迭代[3]:0%完成迭代[3]:100%完成迭代[4]:0%完成的迭代[4]:100%完成迭代[5]:0%完成的迭代[5]:100%完成
glm = Compact generalized linear regression model: logit(LateFlight) ~ 1 + DayOfWeek + Distance Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) -1.855 0.023052 -80.469 0 DayOfWeek_Mon -0.072603 0.029798 -2.4365 0.01483 DayOfWeek_Tues 0.026909 0.029239 0.92029 0.35742 DayOfWeek_Wed 0.2359 0.028276 8.343 7.2452e-17 DayOfWeek_Thu 0.23569 0.028282 8.3338 7.8286e-17 DayOfWeek_Fri -0.19285 0.031583 -6.106 1.0213e-09 DayOfWeek_Sat 0.033542 0.029702 1.1293 0.25879 Distance 0.00018373 1.3507e-05 13.602 3.8741e-42 123319 observations, 123311 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 504, p-value = 8.74e-105

该模型确认先前观察到的结论在这里也是正确的:

  • The Wednesday and Thursday coefficients are positive, indicating a higher probability of a late flight on those days. The Friday coefficient is negative, indicating a lower probability.

  • 距离系数为正,表明较长的飞行的迟到的可能性更高。

所有这些系数具有很小的P值。这对于具有许多观察结果的数据集很常见,因为人们可以通过大量数据可靠地估计小效果。实际上,模型中的不确定性大于模型参数估计值的不确定性。

Prediction with Model

预测一周中每一天的延迟飞行的可能性,以及距离为0到3000英里的距离。创建一个表通过在原始表中索引前100行来保存预测变量值TT

x = gather(tt(1:100,{'距离'“ Dayofweek”}));
使用本地MATLAB会话评估高表情: -  1 of 1:完成在0.22秒评估中完成的0.37秒完成
X.Distance = linspace(0,3000)';X.Dayofweek(:) ='太阳';绘图(X.Distance,predivice(glm,x));天= {'太阳''Mon''星期二''星期三''thu''Fri''Sat'};抓住为了j=2:length(days) x.DayOfWeek(:) = days{j}; plot(x.Distance,predict(glm,x));结尾传奇(天)

According to this model, a Wednesday or Thursday flight of 500 miles has the same probability of being late, about 18%, as a Friday flight of about 3000 miles.

由于这些概率都远远小于50%,因此该模型不太可能预测使用此信息的任何给定飞行将迟到。通过关注该模型预测迟到20%或更多的飞行的飞行,并将其与实际结果进行比较,从而对模型进行了更多研究。

C =聚集(crosstab(tt.lateflight,预测(glm,tt)> 20))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1.1 sec Evaluation completed in 1.2 sec
C =2×299613 4391 18394 1125

在预计迟到的可能性为20%或更高的航班中,约有20%迟到了1125/(1125 + 4391)。Among the remainder, less than 16% were late18394/(18394 + 99613)