Main Content

拆分标签

根据指定的比例查找索引以拆分标签

Description

当您处理机器或深度学习分类问题时,请使用此功能,并希望将数据集分为培训,测试和验证集,以持有相同比例的标签值。

example

idxs= splitlabels(lblsrc,p)找到将标签分开的逻辑索引lblsrc根据指定的标签的比例或数量p

example

idxs= splitlabels(lblsrc,p,'randomized')randomly assigns the specified proportion of label values to each index set inidxs

example

idxs= splitlabels(___,姓名,Value)specifies additional input arguments using name-value pairs. For example,“底层塔斯塔斯郡”,3splits the labels only in the third underlying datastore of a combined datastore.

Examples

collapse all

Read William Shakespeare's sonnets with thefilereadfunction. Extract all the vowels from the text and convert them to lowercase.

十四行诗= fileread(“十四行诗”);元音= lower(十四行诗(REGEXP)(十四行诗,“ [aeiouaeiou]”)))';

Count the number of instances of each vowel.

cnts = countlabels(vowels)
cnts=5×3 table实验室el Count Percent _____ _____ _______ a 4940 18.368 e 9028 33.569 i 4895 18.201 o 5710 21.232 u 2321 8.6302

将元音分成一个训练集,其中包含每个元音的500个实例,包含300个验证集以及其余的测试集。所有元音在前两组中均具有相等的权重,但在第三组中却没有。

spltn = splitlabels(元音,[500 300]);forkj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj}));endcntsn {:}
ans=5×3 table标签计数百分比_____ _____ _______ A 500 20 E 500 20 I 500 20 O 500 20 U 500 20
ans=5×3 table标签计数百分比_____ _____ _______ A 300 20 E 300 20 I 300 20 O 300 20 U 300 20
ans=5×3 table实验室el Count Percent _____ _____ _______ a 4140 18.083 e 8228 35.94 i 4095 17.887 o 4910 21.447 u 1521 6.6437

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

spltp = splitlabels(元音,[0.5 0.3]);forkj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj}));endcntsp {:}
ans=5×3 table实验室el Count Percent _____ _____ _______ a 2470 18.367 e 4514 33.566 i 2448 18.203 o 2855 21.23 u 1161 8.6333
ans=5×3 table标签计数百分比_____ _____ _______ A 1482 18.371 E 2708 33.569 I 1468 18.198 O 1713 21.235 U 696 8.6277
ans=5×3 table实验室el Count Percent _____ _____ _______ a 988 18.368 e 1806 33.575 i 979 18.2 o 1142 21.231 u 464 8.6261

Read William Shakespeare's sonnets with thefilereadfunction. Remove all nonalphabetic characters from the text and convert to lowercase.

十四行诗= fileread(“十四行诗”);letters = lower(sonnets(regexp(sonnets,“ [A-Z]”)))';

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) =“元音”;t =表(字母,类型,'variablenames',[["Letter""Type"]); head(T)
ans=8×2桌Letter Type ______ ___________ t "consonant" h "consonant" e "vowel" s "consonant" o "vowel" n "consonant" n "consonant" e "vowel"

显示每个类别的实例数。

cnt = countlabels(T,'TableVariable',"Type")
cnt=2×3桌Type Count Percent _________ _____ _______ consonant 46516 63.365 vowel 26894 36.635

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

splt = splitlabels(T,0.6,'TableVariable',"Type");sixty = countlabels(T(splt{1},:),'TableVariable',"Type")
sixty=2×3桌类型计数百分比_________ _____ _______辅音27910 63.366元音16136 36.634
四十= countlabels(t(splt {2},:),'TableVariable',"Type")
forty=2×3桌Type Count Percent _________ _____ _______ consonant 18606 63.363 vowel 10758 36.637

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the lettery,有时充当辅音,有时是元音。显示每个类别的实例数。

splt = splitlabels(T,0.6,'Exclude',“ Y”);sixti = countlabels(T(splt{1},:),'TableVariable',"Type")
sixti =2×3桌类型计数百分比_________ _____ _______辅音26719 62.346元音16137 37.654
forti = countlabels(T(splt{2},:),'TableVariable',"Type")
forti =2×3桌类型计数百分比_________ _____ _______辅音17813 62.349元音10757 37.651

Split the table into two sets of the same size. Include only the letterses。随机化集合。

halves = splitlabels(T,0.5,“随机”,'包括',[["e"“ S”]); cnt = countlabels(T(halves{1},:))
cnt=2×3桌字母计数百分比______ _____ _______ e 4514 64.385 S 2497 35.615

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers asA,30 asB,30C。Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

dsdata = arraydatastore(randn(100,1));dslabels = arraydatastore([repmat(("A",40,1); repmat("B",30,1);repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,“基础DatastoreIndex”,2)
cnt=3×3 table实验室el Count Percent _____ _____ _______ A 40 40 B 30 30 C 30 30

将数据集分为两组,一组包含60%的数字,另一个包含其他数字。

splitIndices = splitlabels(dsdataset,0.6,,“基础DatastoreIndex”,2); dsDataset1 = subset(dsDataset,splitIndices{1}); cnt1 = countlabels(dsDataset1,“基础DatastoreIndex”,2)
cnt1=3×3 table实验室el Count Percent _____ _____ _______ A 24 40 B 18 30 C 18 30
dsdataset2 = subset(dsdataset,splitIndices {2});cnt2 = countlabels(dsdataset2,“基础DatastoreIndex”,2)
cnt2=3×3 table实验室el Count Percent _____ _____ _______ A 16 40 B 12 30 C 12 30

Input Arguments

collapse all

Input label source, specified as one of these:

  • A categorical vector.

  • 字符串向量或字符向量的单元格数组。

  • A numeric vector or a cell array of numeric scalars.

  • 逻辑向量或逻辑标量的单元格数组。

  • A table with variables containing any of the previous data types.

  • A datastore whosereadall函数返回以前的任何数据类型。

  • A组合的datastoreobject containing an underlying datastore whosereadall函数返回以前的任何数据类型。In this case, you must specify the index of the underlying datastore that has the label values.

lblsrc必须包含可以转换为具有离散类别集的向量的标签。

Example:lblsrc= categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C" "D"])将标签源创建为具有四个类别的十个样本分类向量:A,B,C, 和D

Example:lblsrc= [0 7 2 5 11 17 15 7 7 11]creates the label source as a ten-sample numeric vector.

数据类型:单身的|double|int8|int16|int32|INT64|UINT8|uint16|UINT32|Uint64|logical|char|细绳|table|cell|categorical

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

  • Ifp是标量,拆分标签找到两个分裂索引集并返回一个两元件单元格数组idxs

    • Ifp是整数,是idxscontains a vector of indices pointing to the firstpvalues of each label category. The second element ofidxscontains indices pointing to the remaining values of each label category.

    • Ifp是范围(0,1)的值lblsrcKielements in theith category, the first element ofidxscontains a vector of indices pointing to the firstp×Kivalues of each label category. The second element ofidxscontains the indices of the remaining values of each label category.

  • Ifp是一个vector withN元素formp1,p2, …,pN,拆分标签findsN+ 1拆分索引集并返回(N+ 1)-element cell array inidxs

    • Ifp是一个向量的整数, the first element ofidxs是一个vector of indices pointing to the firstp1每个标签类别的值,下一个元素idxscontains the nextp2每个标签类别的值,依此类推。最后一个元素idxs包含每个标签类别的其余索引。

    • Ifp是一个vector of fractions andlblsrcKi元素ith category, the first element ofidxs是索引的矢量串联p1×Kivalues of each category, the next element ofidxscontains the nextp2×Ki每个标签类别的值,依此类推。最后一个元素idxs包含每个标签类别的其余索引。

Note

  • Ifp包含分数,然后其元素的总和不得大于一个。

  • Ifpcontains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

数据类型:单身的|double|int8|int16|int32|INT64|UINT8|uint16|UINT32|Uint64

姓名-Value Arguments

Specify optional comma-separated pairs of姓名,Valuearguments.姓名is the argument name and价值是相应的值。姓名must appear inside quotes. You can specify several name and value pair arguments in any order as姓名1,Value1,...,NameN,ValueN

Example:'tablevariable',“ reakode”,'reford',[“ 617”“ 508”]specifies that the function split labels based on telephone area code and exclude numbers from Boston and Natick.

实验室els to include in the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels inlblsrc。向量或单元格数组中的每个类别都必须匹配一个标签类别之一lblsrc

实验室els to exclude from the index sets, specified as a vector or cell array of label categories. The categories specified with this argument must be of the same type as the labels inlblsrc。向量或单元格数组中的每个类别都必须匹配一个标签类别之一lblsrc

表变量来读,指定为一个角色vector or string scalar. If this argument is not specified, then拆分标签uses the first table variable.

基础数据存储索引,指定为整数标量。此参数适用lblsrc是一个组合的datastore目的。拆分标签计算使用使用该标签的数据存储中的标签UnderlyingDatastoresproperty oflblsrc

输出参数

collapse all

分裂索引,返回为单元格数组。

Introduced in R2021a