{"title": "Positive-Unlabeled Learning with Non-Negative Risk Estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 1675, "page_last": 1685, "abstract": "From only positive (P) and unlabeled (U) data, a binary classifier could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very flexible, empirical risks on training data will go negative, and we will suffer from serious overfitting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overfitting, and thus we are able to use very flexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator fixes the overfitting problem of its unbiased counterparts.", "full_text": "Positive-Unlabeled Learning with\n\nNon-Negative Risk Estimator\n\nRyuichi Kiryo1,2 Gang Niu1,2 Marthinus C. du Plessis Masashi Sugiyama2,1\n\n1The University of Tokyo, 7-3-1 Hongo, Tokyo 113-0033, Japan\n\n2RIKEN, 1-4-1 Nihonbashi, Tokyo 103-0027, Japan\n\n{ kiryo@ms., gang@ms., sugi@ }k.u-tokyo.ac.jp\n\nAbstract\n\nFrom only positive (P) and unlabeled (U) data, a binary classi\ufb01er could be trained\nwith PU learning, in which the state of the art is unbiased PU learning. However,\nif its model is very \ufb02exible, empirical risks on training data will go negative, and\nwe will suffer from serious over\ufb01tting. In this paper, we propose a non-negative\nrisk estimator for PU learning: when getting minimized, it is more robust against\nover\ufb01tting, and thus we are able to use very \ufb02exible models (such as deep neural\nnetworks) given limited P data. Moreover, we analyze the bias, consistency, and\nmean-squared-error reduction of the proposed risk estimator, and bound the esti-\nmation error of the resulting empirical risk minimizer. Experiments demonstrate\nthat our risk estimator \ufb01xes the over\ufb01tting problem of its unbiased counterparts.\n\n1\n\nIntroduction\n\nPositive-unlabeled (PU) learning can be dated back to [1, 2, 3] and has been well studied since then.\nIt mainly focuses on binary classi\ufb01cation applied to retrieval and novelty or outlier detection tasks\n[4, 5, 6, 7], while it also has applications in matrix completion [8] and sequential data [9, 10].\nExisting PU methods can be divided into two categories based on how U data is handled. The \ufb01rst\ncategory (e.g., [11, 12]) identi\ufb01es possible negative (N) data in U data, and then performs ordinary\nsupervised (PN) learning; the second (e.g., [13, 14]) regards U data as N data with smaller weights.\nThe former heavily relies on the heuristics for identifying N data; the latter heavily relies on good\nchoices of the weights of U data, which is computationally expensive to tune.\nIn order to avoid tuning the weights, unbiased PU learning comes into play as a subcategory of the\nsecond category. The milestone is [4], which regards a U data as weighted P and N data simultane-\nously. It might lead to unbiased risk estimators, if we unrealistically assume that the class-posterior\nprobability is one for all P data.1 A breakthrough in this direction is [15] for proposing the \ufb01rst un-\nbiased risk estimator, and a more general estimator was suggested in [16] as a common foundation.\nThe former is unbiased but non-convex for loss functions satisfying some symmetric condition; the\nlatter is always unbiased, and it is further convex for loss functions meeting some linear-odd condi-\ntion [17, 18]. PU learning based on these unbiased risk estimators is the current state of the art.\nHowever, the unbiased risk estimators will give negative empirical risks, if the model being trained\nis very \ufb02exible. For the general estimator in [16], there exist three partial risks in the total risk (see\nEq. (2) de\ufb01ned later), especially it has a negative risk regarding P data as N data to cancel the bias\ncaused by regarding U data as N data. The worst case is that the model can realize any measurable\nfunction and the loss function is not upper bounded, so that the empirical risk is not lower bounded.\nThis needs to be \ufb01xed since the original risk, which is the target to be estimated, is non-negative.\n\n1It implies the P and N class-conditional densities have disjoint support sets, and then any P and N data (as\n\nthe test data) can be perfectly separated by a \ufb01xed classi\ufb01er that is suf\ufb01ciently \ufb02exible.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTo this end, we propose a novel non-negative risk estimator that follows and improves on the state-\nof-the-art unbiased risk estimators mentioned above. This estimator can be used for two purposes:\n\ufb01rst, given some validation data (which are also PU data), we can use our estimator to evaluate the\nrisk\u2014for this case it is biased yet optimal, and for some symmetric losses, the mean-squared-error\nreduction is guaranteed; second, given some training data, we can use our estimator to train binary\nclassi\ufb01ers\u2014for this case its risk minimizer possesses an estimation error bound of the same order\nas the risk minimizers corresponding to its unbiased counterparts [15, 16, 19].\nIn addition, we propose a large-scale PU learning algorithm for minimizing the unbiased and non-\nnegative risk estimators. This algorithm accepts any surrogate loss and is based on stochastic opti-\nmization, e.g., [20]. Note that [21] is the only existing large-scale PU algorithm, but it only accepts\na single surrogate loss from [16] and is based on sequential minimal optimization [22].\nThe rest of this paper is organized as follows. In Section 2 we review unbiased PU learning, and in\nSection 3 we propose non-negative PU learning. Theoretical analyses are carried out in Section 4,\nand experimental results are discussed in Section 5. Conclusions are given in Section 6.\n\n2 Unbiased PU learning\n\nIn this section, we review unbiased PU learning [15, 16].\nProblem settings Let X \u2208 Rd and Y \u2208 {\u00b11} (d \u2208 N) be the input and output random variables.\nLet p(x, y) be the underlying joint density of (X, Y ), pp(x) = p(x | Y = +1) and pn(x) = p(x |\nY = \u22121) be the P and N marginals (a.k.a. the P and N class-conditional densities), p(x) be the U\nmarginal, \u03c0p = p(Y = +1) be the class-prior probability, and \u03c0n = p(Y = \u22121) = 1 \u2212 \u03c0p. \u03c0p is\nassumed known throughout the paper; it can be estimated from P and U data [23, 24, 25, 26].\nConsider the two-sample problem setting of PU learning [5]: two sets of data are sampled indepen-\ndently from pp(x) and p(x) as Xp = {xp\ni=1 \u223c p(x), and a classi\ufb01er\nneeds to be trained from Xp and Xu.2 If it is PN learning as usual, Xn = {xn\ni=1 \u223c pn(x) rather\nthan Xu would be available and a classi\ufb01er could be trained from Xp and Xn.\nRisk estimators Unbiased PU learning relies on unbiased risk estimators. Let g : Rd \u2192 R be an\narbitrary decision function, and (cid:96) : R \u00d7 {\u00b11} \u2192 R be the loss function, such that the value (cid:96)(t, y)\nmeans the loss incurred by predicting an output t when the ground truth is y. Denote by R+\np (g) =\nn (g) = En[(cid:96)(g(X),\u22121)], where Ep[\u00b7] = EX\u223cpp[\u00b7] and En[\u00b7] = EX\u223cpn[\u00b7].\nEp[(cid:96)(g(X), +1)] and R\u2212\np (g) + \u03c0nR\u2212\nThen, the risk of g is R(g) = E(X,Y )\u223cp(x,y)[(cid:96)(g(X), Y )] = \u03c0pR+\nn (g). In PN learning,\nthanks to the availability of Xp and Xn, R(g) can be approximated directly by\n\ni=1 \u223c pp(x) and Xu = {xu\n\ni }np\n\ni }nn\n\ni }nu\n\np (g) = (1/np)(cid:80)np\n\nwhere (cid:98)R+\ni=1 (cid:96)(g(xp\ning, Xn is unavailable, but R\u2212\np (g) = Ep[(cid:96)(g(X),\u22121)] and R\u2212\nR\u2212\nwe can obtain that \u03c0nR\u2212\nn (g) = R\u2212\n\nwhere (cid:98)R\u2212\n\np (g) = (1/np)(cid:80)np\n\ni=1 (cid:96)(g(xp\n\nn (g),\n\np (g) + \u03c0n(cid:98)R\u2212\nn (g) = (1/nn)(cid:80)nn\n\n(cid:98)Rpn(g) = \u03c0p(cid:98)R+\ni ), +1) and (cid:98)R\u2212\n(1)\ni ),\u22121). In PU learn-\nn (g) can be approximated indirectly, as shown in [15, 16]. Denote by\nu (g) = EX\u223cp(x)[(cid:96)(g(X),\u22121)]. As \u03c0npn(x) = p(x) \u2212 \u03c0ppp(x),\nu (g) \u2212 \u03c0pR\u2212\n(cid:98)Rpu(g) = \u03c0p(cid:98)R+\ni ),\u22121) and (cid:98)R\u2212\n\np (g) + (cid:98)R\u2212\nu (g) = (1/nu)(cid:80)nu\n\np (g), and R(g) can be approximated indirectly by\n\np (g) \u2212 \u03c0p(cid:98)R\u2212\n\ni ),\u22121).\n\ni=1 (cid:96)(g(xn\n\ni=1 (cid:96)(g(xu\n\nu (g),\n\n(2)\n\nThe empirical risk estimators in Eqs. (1) and (2) are unbiased and consistent w.r.t. all popular loss\nfunctions.3 When they are used for evaluating the risk (e.g., in cross-validation), (cid:96) is by default the\nzero-one loss, namely (cid:96)01(t, y) = (1 \u2212 sign(ty))/2; when used for training, (cid:96)01 is replaced with a\nsurrogate loss [27]. In particular, [15] showed that if (cid:96) satis\ufb01es a symmetric condition:\n\n(cid:96)(t, +1) + (cid:96)(t,\u22121) = 1,\n\n(3)\n\n2Xp is a set of independent data and so is Xu, but Xp \u222a Xu does not need to be such a set.\n\n3The consistency here means for \ufb01xed g, (cid:98)Rpn(g) \u2192 R(g) and (cid:98)Rpu(g) \u2192 R(g) as np, nn, nu \u2192 \u221e.\n\n2\n\n\fwe will have\n\n(cid:98)Rpu(g) = 2\u03c0p(cid:98)R+\n\np (g) + (cid:98)R\u2212\n\nu (g) \u2212 \u03c0p,\n\n(4)\nwhich can be minimized by separating Xp and Xu with ordinary cost-sensitive learning. An issue is\n\n(cid:98)Rpu(g) in (4) must be non-convex in g, since no (cid:96)(t, y) in (3) can be convex in t. [16] showed that\n(cid:98)Rpu(g) in (2) is convex in g, if (cid:96)(t, y) is convex in t and meets a linear-odd condition [17, 18]:\n\n(cid:96)(t, +1) \u2212 (cid:96)(t,\u22121) = \u2212t.\n\n(5)\nLet g be parameterized by \u03b8, then (5) leads to a convex optimization problem so long as g is linear\nin \u03b8, for which the globally optimal solution can be obtained. Eq. (5) is not only suf\ufb01cient but also\nnecessary for the convexity, if (cid:96) is unary, i.e., (cid:96)(t,\u22121) = (cid:96)(\u2212t, +1).\nJusti\ufb01cation Thanks to the unbiasedness, we can study estimation error bounds (EEB). Let G be\n\nthe function class, and(cid:98)gpn and(cid:98)gpu be the empirical risk minimizers of (cid:98)Rpn(g) and (cid:98)Rpu(g). [19]\nproved EEB of(cid:98)gpu is tighter than EEB of(cid:98)gpn when \u03c0p/\n\nnn, if (a) (cid:96) satis\ufb01es\n(3) and is Lipschitz continuous; (b) the Rademacher complexity of G decays in O(1/\nn) for data\nof size n drawn from p(x), pp(x) or pn(x).4 In other words, under mild conditions, PU learning is\nlikely to outperform PN learning when \u03c0p/\nnn. This phenomenon has been\nobserved in experiments [19] and is illustrated in Figure 1(a).\n\n\u221a\nnu < \u03c0n/\n\n\u221a\nnp + 1/\n\n\u221a\nnp + 1/\n\nnu < \u03c0n/\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n3 Non-negative PU learning\n\nIn this section, we propose the non-negative risk estimator and the large-scale PU algorithm.\n\n3.1 Motivation\n\ni }nn\n\nn (g) = R\u2212\n\nn (g) from N data {xn\n\n\u221a\ni=1, the convergence rate is Op(\u03c0n/\n\nLet us look inside the aforementioned justi\ufb01cation of unbiased PU (uPU) learning. Intuitively, the\nu (g) \u2212 \u03c0pR\u2212\nadvantage comes from the transformation \u03c0nR\u2212\np (g). When we approximate\nnn), where Op denotes the order\n\u03c0nR\u2212\ni }nu\ni }np\np (g) from P data {xp\nu (g) \u2212 \u03c0pR\u2212\nin probability; when we approximate R\u2212\ni=1,\n\u221a\nthe convergence rate becomes Op(\u03c0p/\nnu). As a result, we might bene\ufb01t from a tighter\n\u221a\n\u221a\nnp + 1/\nuniform deviation bound when \u03c0p/\nnp + 1/\nbe dif\ufb01cult for EEB of(cid:98)gpu to be tighter than EEB of(cid:98)gpn. If G = {g | (cid:107)g(cid:107)\u221e \u2264 Cg} where Cg > 0\nHowever, the critical assumption on the Rademacher complexity is indispensable, otherwise it will\nfor any n and q(x) and all bounds become trivial; moreover if (cid:96) is not bounded from above, (cid:98)Rpu(g)\nis a constant, i.e., it has all measurable functions with some bounded norm, then Rn,q(G) = O(1)\n(cid:98)gpu, G cannot be too complex, or equivalently the model of g cannot be too \ufb02exible.\nbecomes not bounded from below, i.e., it may diverge to \u2212\u221e. Thus, in order to obtain high-quality\n\ni=1 and U data {xu\n\n\u221a\nnu < \u03c0n/\n\nnn.\n\n\u221a\n\nThis argument is supported by an experiment as illustrated in Figure 1(b). A multilayer perceptron\nwas trained for separating the even and odd digits of MNIST hand-written digits [29]. This model is\nso \ufb02exible that the number of parameters is 500 times more than the total number of P and N data.\nFrom Figure 1(b) we can see:\n\n(A) on training data, the risks of uPU and PN both decrease, and uPU is faster than PN;\n(B) on test data, the risk of PN decreases, whereas the risk of uPU does not; the risk of uPU is\n\nTo sum up, the over\ufb01tting problem of uPU is serious, which evidences that in order to obtain high-\n\nlower at the beginning but higher at the end than that of PN.\n\nquality(cid:98)gpu, the model of g cannot be too \ufb02exible.\n\n3.2 Non-negative risk estimator\n\nNevertheless, we have no choice sometimes: we are interested in using \ufb02exible models, while label-\ning more data is out of our control. Can we alleviate the over\ufb01tting problem with neither changing\nthe model nor labeling more data?\n\n(cid:80)\n4Let \u03c31, . . . , \u03c3n be n Rademacher variables, the Rademacher complexity of G for X of size n drawn from\nxi\u2208X \u03c3ig(xi)] [28]. For any \ufb01xed G and q, Rn,q(G)\n\nq(x) is de\ufb01ned by Rn,q(G) = EX E\u03c31,...,\u03c3n [supg\u2208G 1\nstill depends on n and should decrease with n.\n\nn\n\n3\n\n\f(a) Plain linear model\n\n(b) Multilayer perceptron (MLP)\n\nFigure 1: Illustrative experimental results.\n\nThe dataset is MNIST; even/odd digits are regarded as the P/N class, and \u03c0p \u2248 1/2; np = 100 and nn = 50\nfor PN learning; np = 100 and nu = 59, 900 for unbiased PU (uPU) and non-negative PU (nnPU) learning.\nThe model is a plain linear model (784-1) in (a) and an MLP (784-100-1) with ReLU in (b); it was trained by\nAlgorithm 1, where the loss (cid:96) is (cid:96)sig, the optimization algorithm A is [20], with \u03b2 = 1/2 for uPU, and \u03b2 = 0\n\nand \u03b3 = 1 for nnPU. Solid curves are (cid:98)Rpn(g) on test data where g \u2208 {(cid:98)gpn,(cid:98)gpu,(cid:101)gpu}, and dashed curves are\n(cid:98)Rpn((cid:98)gpn), (cid:98)Rpu((cid:98)gpu) and (cid:101)Rpu((cid:101)gpu) on training data. Note that nnPU is identical to uPU in (a).\nThe answer is af\ufb01rmative. Note that (cid:98)Rpu((cid:98)gpu) keeps decreasing and goes negative. This should be\n(cid:98)R\u2212\nu (g) \u2212 \u03c0p(cid:98)R\u2212\nLet(cid:101)gpu = arg ming\u2208G (cid:101)Rpu(g) be the empirical risk minimizer of (cid:101)Rpu(g). We refer to the process\nof obtaining(cid:101)gpu as non-negative PU (nnPU) learning. The implementation of nnPU will be given\nin Section 3.3, and theoretical analyses of (cid:101)Rpu(g) and(cid:101)gpu will be given in Section 4.\n\nn (g) \u2265 0, but\np (g) \u2265 0 is not always true, which is a potential reason for uPU to over\ufb01t. Based on\n\nu (g) \u2212 \u03c0pR\u2212\n(cid:110)\n0,(cid:98)R\u2212\nu (g) \u2212 \u03c0p(cid:98)R\u2212\n\n\ufb01xed since R(g) \u2265 0 for any g. Speci\ufb01cally, it holds that R\u2212\n\nthis key observation, we propose a non-negative risk estimator for PU learning:\n\n(cid:101)Rpu(g) = \u03c0p(cid:98)R+\n\np (g) = \u03c0nR\u2212\n\np (g) + max\n\n(cid:111)\n\n(6)\n\np (g)\n\n.\n\nAgain, from Figure 1(b) we can see:\n\n(A) on training data, the risk of nnPU \ufb01rst decreases and then becomes more and more \ufb02at, so\nthat the risk of nnPU is closer to the risk of PN and farther from that of uPU; in short, the\nrisk of nnPU does not go down with uPU after a certain epoch;\n\n(B) on test data, the tendency is similar, but the risk of nnPU does not go up with uPU;\n(C) at the end, nnPU achieves the lowest risk on test data.\n\nIn summary, nnPU works by explicitly constraining the training risk of uPU to be non-negative.\n\n3.3 Implementation\n\nwas minimized by the concave-convex procedure [30]. This solver is fairly sophisticated, and if we\n\nA list of popular loss functions and their properties is shown in Table 1. Let g be parameterized by\n\u03b8. If g is linear in \u03b8, the losses satisfying (5) result in convex optimizations. However, if g needs to\nbe \ufb02exible, it will be highly nonlinear in \u03b8; then the losses satisfying (5) are not advantageous over\n\nothers, since the optimizations are anyway non-convex. In [15], the ramp loss was used and (cid:98)Rpu(g)\nreplace (cid:98)Rpu(g) with (cid:101)Rpu(g), it will be more dif\ufb01cult to implement. To this end, we propose to use\nthe sigmoid loss (cid:96)sig(t, y) = 1/(1 + exp(ty)): its gradient is everywhere non-zero and (cid:101)Rpu(g) can\nIn front of big data, we should scale PU learning up by stochastic optimization. Minimizing (cid:98)Rpu(g)\nis embarrassingly parallel while minimizing (cid:101)Rpu(g) is not, since (cid:98)Rpu(g) is point-wise but (cid:101)Rpu(g)\nis not due to the max operator. That being said, max{0,(cid:98)R\u2212\nthan (1/N )(cid:80)N\np (g;Xp)} is no greater\nhence the corresponding upper bound of (cid:101)Rpu(g) can easily be minimized in parallel.\nu) is the i-th mini-batch, and\n\nu (g;Xu) \u2212 \u03c0p(cid:98)R\u2212\n\ni=1 max{0,(cid:98)R\u2212\n\nbe minimized by off-the-shelf gradient methods.\n\nu) \u2212 \u03c0p(cid:98)R\u2212\n\np)}, where (X i\n\np (g;X i\n\nu (g;X i\n\np,X i\n\n4\n\n0100200300400500Epoch0.150.200.250.300.350.400.450.50Risk w.r.t. surrogate lossPN testPN trainuPU testuPU train0100200300400500Epoch0.20.10.00.10.20.30.40.5Risk w.r.t. surrogate lossPN testPN trainuPU testuPU trainnnPU testnnPU train\fTable 1: Loss functions for PU learning and their properties.\n\nName\nZero-one loss\nRamp loss\nSquared loss\nLogistic loss\nHinge loss\nDouble hinge loss max{0, (1 \u2212 z)/2,\u2212z}\nSigmoid loss\n\n(3)\nDe\ufb01nition\n(1 \u2212 sign(z))/2\n(cid:88)\nmax{0, min{1, (1 \u2212 z)/2}} (cid:88)\n(z \u2212 1)2/4\n\u00d7\n\u00d7\nln(1 + exp(\u2212z))\n\u00d7\nmax{0, 1 \u2212 z}\n\u00d7\n(cid:88)\n\n1/(1 + exp(z))\n\n(5) Bounded\n\u00d7\n\u00d7\n(cid:88)\n(cid:88)\n\u00d7\n(cid:88)\n\u00d7\n\n(cid:88)\n(cid:88)\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n(cid:88)\n\nLipschitz\n\n\u00d7\n(cid:88)\n\u00d7\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:96)(cid:48)(z) (cid:54)= 0\nz = 0\nz \u2208 R\nz \u2208 R\n\nz \u2208 [\u22121, +1]\n\nz \u2208 (\u2212\u221e, +1]\nz \u2208 (\u2212\u221e, +1]\n\nz \u2208 R\n\nAll loss functions are unary, such that (cid:96)(t, y) = (cid:96)(z) with z = ty. The ramp loss comes from [15]; the double\nhinge loss is from [16], in which the squared, logistic and hinge losses were discussed as well. The ramp and\nsquared losses are scaled to satisfy (3) or (5). The sigmoid loss is a horizontally mirrored logistic function; the\nlogistic loss is the negative logarithm of the logistic function.\n\nInput: training data (Xp,Xu);\n\nAlgorithm 1 Large-scale PU learning based on stochastic optimization\nhyperparameters 0 \u2264 \u03b2 \u2264 \u03c0p supt maxy (cid:96)(t, y) and 0 \u2264 \u03b3 \u2264 1\n\nOutput: model parameter \u03b8 for(cid:98)gpu(x; \u03b8) or(cid:101)gpu(x; \u03b8)\n\n1: Let A be an external SGD-like stochastic optimization algorithm such as [20] or [31]\n2: while no stopping criterion has been met:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nShuf\ufb02e (Xp,Xu) into N mini-batches, and denote by (X i\nu) \u2212 \u03c0p(cid:98)R\u2212\nif (cid:98)R\u2212\nfor i = 1 to N:\nSet gradient \u2207\u03b8(cid:98)Rpu(g;X i\np) \u2265 \u2212\u03b2:\np (g;X i\nu (g;X i\np,X i\nu)\nUpdate \u03b8 by A with its current step size \u03b7\np) \u2212 (cid:98)R\u2212\nSet gradient \u2207\u03b8(\u03c0p(cid:98)R\u2212\nu (g;X i\np (g;X i\nu))\nUpdate \u03b8 by A with a discounted step size \u03b3\u03b7\n\nu) the i-th mini-batch\n\np,X i\n\nelse:\n\nThe large-scale PU algorithm is described in Algorithm 1. Let ri = (cid:98)R\u2212\np (g;X i\np). In\npractice, we may tolerate ri \u2265 \u2212\u03b2 where 0 \u2264 \u03b2 \u2264 \u03c0p supt maxy (cid:96)(t, y), as ri comes from a single\nminimizing (cid:98)Rpu(g) if \u03b2 = \u03c0p supt maxy (cid:96)(t, y). Otherwise if ri < \u2212\u03b2, we go along \u2212\u2207\u03b8ri with a\nmini-batch. The degree of tolerance is controlled by \u03b2: there is zero tolerance if \u03b2 = 0, and we are\nstep size discounted by \u03b3 where 0 \u2264 \u03b3 \u2264 1, to make this mini-batch less over\ufb01tted. Algorithm 1 is\ninsensitive to the choice of \u03b3, if the optimization algorithm A is adaptive such as [20] or [31].\n\nu) \u2212 \u03c0p(cid:98)R\u2212\n\nu (g;X i\n\n4 Theoretical analyses\n\nIn this section, we analyze the risk estimator (6) and its minimizer (all proofs are in Appendix B).\n\n4.1 Bias and consistency\n\nFix g, (cid:101)Rpu(g) \u2265 (cid:98)Rpu(g) for any (Xp,Xu) but (cid:98)Rpu(g) is unbiased, which implies (cid:101)Rpu(g) is biased\nin general. A fundamental question is then whether (cid:101)Rpu(g) is consistent. From now on, we prove\nthis consistency. To begin with, partition all possible (Xp,Xu) into D+(g) = {(Xp,Xu) | (cid:98)R\u2212\n\u03c0p(cid:98)R\u2212\np (g) \u2265 0} and D\u2212(g) = {(Xp,Xu) | (cid:98)R\u2212\nu (g) \u2212\np (g) < 0}. Assume there are Cg > 0 and\nC(cid:96) > 0 such that supg\u2208G (cid:107)g(cid:107)\u221e \u2264 Cg and sup|t|\u2264Cg maxy (cid:96)(t, y) \u2264 C(cid:96).\nis non-zero; (B) (cid:101)Rpu(g) differs from (cid:98)Rpu(g) with a non-zero probability over repeated sampling of\nLemma 1. The following three conditions are equivalent: (A) the probability measure of D\u2212(g)\n(Xp,Xu); (C) the bias of (cid:101)Rpu(g) is positive. In addition, by assuming that there is \u03b1 > 0 such that\nn (g) \u2265 \u03b1, the probability measure of D\u2212(g) can be bounded by\nR\u2212\n\nu (g) \u2212 \u03c0p(cid:98)R\u2212\n\nPr(D\u2212(g)) \u2264 exp(\u22122(\u03b1/C(cid:96))2/(\u03c02\n\np/np + 1/nu)).\n\n(7)\n\n5\n\n\f\u221a\n\nMoreover, for any \u03b4 > 0, let C\u03b4 = C(cid:96)\n\n\u221a\nBased on Lemma 1, we can show the exponential decay of the bias and also the consistency. For\nnu.\nconvenience, denote by \u03c7np,nu = 2\u03c0p/\nnp + 1/\nn (g) \u2265 \u03b1 > 0 and denote by \u2206g the right-hand\nTheorem 2 (Bias and consistency). Assume that R\u2212\n\nside of Eq. (7). As np, nu \u2192 \u221e, the bias of (cid:101)Rpu(g) decays exponentially:\n0 \u2264 EXp,Xu [(cid:101)Rpu(g)] \u2212 R(g) \u2264 C(cid:96)\u03c0p\u2206g.\n|(cid:101)Rpu(g) \u2212 R(g)| \u2264 C\u03b4 \u00b7 \u03c7np,nu + C(cid:96)\u03c0p\u2206g,\n\n(cid:112)ln(2/\u03b4)/2, then we have with probability at least 1 \u2212 \u03b4,\n|(cid:101)Rpu(g) \u2212 R(g)| \u2264 C\u03b4 \u00b7 \u03c7np,nu.\n\nand with probability at least 1 \u2212 \u03b4 \u2212 \u2206g,\n\nEither (9) or (10) in Theorem 2 indicates for \ufb01xed g, (cid:101)Rpu(g) \u2192 R(g) in Op(\u03c0p/\n\nnu).\nThis convergence rate is optimal according to the central limit theorem [32], which means the pro-\nposed estimator is a biased yet optimal estimator to the risk.\n\n\u221a\nnp + 1/\n\n(8)\n\n(9)\n\n(10)\n\n\u221a\n\n(cid:90)\n\n(Xp,Xu)\u2208D\u2212(g)\n\ni \u00b7(cid:81)nu\n\n4.2 Mean squared error\n\nu (g) \u2212 \u03c0p(cid:98)R\u2212\n\nwe can still characterize this reduction in MSE.\n\n((cid:98)Rpu(g) + (cid:101)Rpu(g) \u2212 2R(g))((cid:98)R\u2212\n\nAfter introducing the bias, (cid:101)Rpu(g) tends to overestimate R(g). It is not a shrinkage estimator [33,\n34] so that its mean squared error (MSE) is not necessarily smaller than that of (cid:98)Rpu(g). However,\nTheorem 3 (MSE reduction). It holds that MSE((cid:101)Rpu(g)) < MSE((cid:98)Rpu(g)),5 if and only if\np (g)) dF (Xp,Xu) > 0,\nwhere dF (Xp,Xu) =(cid:81)np\nu (g) \u2212 (cid:98)R\u2212\nMSE((cid:98)Rpu(g)) \u2212 MSE((cid:101)Rpu(g)) \u2265 3\u03b22Pr{(cid:101)Rpu(g) \u2212 (cid:98)Rpu(g) > \u03b2}.\nu (g) \u2212 (cid:98)R\u2212\n\nThe assumption (d) in Theorem 3 is explained as follows. Since U data can be much cheaper than\nP data in practice, it would be natural to assume nu is much larger and grows much faster than np,\np (g) \u2265 \u03b1/\u03c0p} \u221d exp(np \u2212 nu) asymptotically.6\np (g) \u2212 R\u2212\nhence Pr{R\u2212\nu (g) \u2212 (cid:98)R\u2212\nThis means the contribution of Xu is negligible for making (Xp,Xu) \u2208 D\u2212(g) so that Pr(D\u2212(g))\nu (g) \u2265 2\u03b1} has stronger exponential\nexhibits exponential decay mainly in np. As Pr{R\u2212\ndecay in nu than Pr{R\u2212\n\ni=1 pp(xp\ni . Eq. (11) is valid, if the following condi-\nn (g) \u2265 \u03b1 > 0; (d) nu (cid:29) np, such\nu (g) \u2264 2\u03b1 almost surely on D\u2212(g). In fact, given these four conditions,\n\ntions are met: (a) Pr(D\u2212(g)) > 0; (b) (cid:96) satis\ufb01es Eq. (3); (c) R\u2212\nthat we have R\u2212\nwe have for any 0 \u2264 \u03b2 \u2264 C(cid:96)\u03c0p,\n\nu (g) \u2265 \u03b1}/Pr{(cid:98)R\u2212\nu (g) \u2212 (cid:98)R\u2212\n\nu (g) \u2265 \u03b1} as well as nu (cid:29) np, we made the assumption (d).\n\ni=1 p(xu\n\ni )dxp\n\ni )dxu\n\n(11)\n\n(12)\n\n4.3 Estimation error\n\nin its use for training classi\ufb01ers. In what follows, we analyze the estimation error R((cid:101)gpu) \u2212 R(g\u2217),\nWhile Theorems 2 and 3 addressed the use of (6) for evaluating the risk, we are likewise interested\nwhere g\u2217 is the true risk minimizer in G, i.e., g\u2217 = arg ming\u2208G R(g). As a common practice [28],\nassume that (cid:96)(t, y) is Lipschitz continuous in t for all |t| \u2264 Cg with a Lipschitz constant L(cid:96).\nn (g) \u2265 \u03b1 > 0 and denote by \u2206 the\nTheorem 4 (Estimation error bound). Assume that (a) inf g\u2208G R\u2212\nright-hand side of Eq. (7); (b) G is closed under negation, i.e., g \u2208 G if and only if \u2212g \u2208 G. Then,\nfor any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\nR((cid:101)gpu) \u2212 R(g\u2217) \u2264 16L(cid:96)\u03c0pRnp,pp (G) + 8L(cid:96)Rnu,p(G) + 2C(cid:48)\n\n(cid:112)ln(1/\u03b4)/2, and Rnp,pp(G) and Rnu,p(G) are the Rademacher complexities of G\n\nwhere C(cid:48)\nfor the sampling of size np from pp(x) and of size nu from p(x), respectively.\n\n\u03b4 \u00b7 \u03c7np,nu + 2C(cid:96)\u03c0p\u2206,\n\n\u03b4 = C(cid:96)\n\n(13)\n\n5Here, MSE(\u00b7) is over repeated sampling of (Xp,Xu).\n6This can be derived as np, nu \u2192 \u221e by applying the central limit theorem to the two differences and then\n\nL\u2019H\u00f4pital\u2019s rule to the ratio of complementary error functions [32].\n\n6\n\n\f\u221a\n\n\u221a\n\nnp + 1/\n\nnu).\n\nTheorem 4 ensures that learning with (6) is also consistent: as np, nu \u2192 \u221e, R((cid:101)gpu) \u2192 R(g\u2217) and\nif (cid:96) satis\ufb01es (5), all optimizations are convex and(cid:101)gpu \u2192 g\u2217. For linear-in-parameter models with a\nnu), and thus R((cid:101)gpu) \u2192 R(g\u2217)\n\u221a\nbounded norm, Rnp,pp (G) = O(1/\n\u221a\nin Op(\u03c0p/\nFor comparison, R((cid:98)gpu) \u2212 R(g\u2217) can be bounded using a different proof technique [19]:\nR((cid:98)gpu) \u2212 R(g\u2217) \u2264 8L(cid:96)\u03c0pRnp,pp(G) + 4L(cid:96)Rnu,p(G) + 2C\u03b4 \u00b7 \u03c7np,nu,\n(cid:112)ln(2/\u03b4)/2. The differences of (13) and (14) are completely from the differences\n\nnp) and Rnu,p(G) = O(1/\n\nwhere C\u03b4 = C(cid:96)\nof the corresponding uniform deviation bounds, i.e., the following lemma and Lemma 8 of [19].\nLemma 5. Under the assumptions of Theorem 4, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\nsupg\u2208G |(cid:101)Rpu(g) \u2212 R(g)| \u2264 8L(cid:96)\u03c0pRnp,pp(G) + 4L(cid:96)Rnu,p(G) + C(cid:48)\nNotice that (cid:98)Rpu(g) is point-wise while (cid:101)Rpu(g) is not due to the maximum, which makes Lemma 5\n\n\u03b4 \u00b7 \u03c7np,nu + C(cid:96)\u03c0p\u2206.\n\nmuch more dif\ufb01cult to prove than Lemma 8 of [19]. The key trick is that after symmetrization, we\nemploy | max{0, z} \u2212 max{0, z(cid:48)}| \u2264 |z \u2212 z(cid:48)|, making three differences of partial risks point-wise\n(see (18) in the proof). As a consequence, we have to use a different Rademacher complexity with\nthe absolute value inside the supremum [35, 36], whose contraction makes the coef\ufb01cients of (15)\ndoubled compared with Lemma 8 of [19]; moreover, we have to assume G is closed under negation\nto change back to the standard Rademacher complexity without the absolute value [28]. Therefore,\nthe differences of (13) and (14) are mainly due to different proof techniques and cannot re\ufb02ect the\nintrinsic differences of empirical risk minimizers.\n\n(14)\n\n(15)\n\n5 Experiments\n\nIn this section, we compare PN, unbiased PU (uPU) and non-negative PU (nnPU) learning experi-\nmentally. We focus on training deep neural networks, as uPU learning usually does not over\ufb01t if a\nlinear-in-parameter model is used [19] and nothing needs to be \ufb01xed.\nTable 2 describes the speci\ufb01cation of benchmark datasets. MNIST, 20News and CIFAR-10 have 10,\n7 and 10 classes originally, and we constructed the P and N classes from them as follows: MNIST\nwas preprocessed in such a way that 0, 2, 4, 6, 8 constitute the P class, while 1, 3, 5, 7, 9 constitute\nthe N class; for 20News, \u2018alt.\u2019, \u2018comp.\u2019, \u2018misc.\u2019 and \u2018rec.\u2019 make up the P class, and \u2018sci.\u2019, \u2018soc.\u2019 and\n\u2018talk.\u2019 make up the N class; for CIFAR-10, the P class is formed by \u2018airplane\u2019, \u2018automobile\u2019, \u2018ship\u2019\nand \u2018truck\u2019, and the N class is formed by \u2018bird\u2019, \u2018cat\u2019, \u2018deer\u2019, \u2018dog\u2019, \u2018frog\u2019 and \u2018horse\u2019. The dataset\nepsilon has 2 classes and such a construction is unnecessary.\nThree learning methods were set up as follows: (A) for PN, np = 1, 000 and nn = (\u03c0n/2\u03c0p)2np;\n(B) for uPU, np = 1, 000 and nu is the total number of training data; (C) for nnPU, np and nu are\n\nexactly same as uPU. For uPU and nnPU, P and U data were dependent, because neither (cid:98)Rpu(g) in\nEq. (2) nor (cid:101)Rpu(g) in Eq. (6) requires them to be independent. The choice of nn was motivated by\n\n[19] and may make nnPU potentially better than PN as nu \u2192 \u221e (whether np < \u221e or np \u2264 nu).\nThe model for MNIST was a 6-layer multilayer perceptron (MLP) with ReLU [40] (more speci\ufb01-\ncally, d-300-300-300-300-1). For epsilon, the model was similar while the activation was replaced\nwith Softsign [41] for better performance. For 20News, we borrowed the pre-trained word embed-\ndings from GloVe [42], and the model can be written as d-avg_pool(word_emb(d,300))-300-300-1,\n\nTable 2: Speci\ufb01cation of benchmark datasets, models, and optimition algorithms.\n\n# Train\n\n# Test\n\n# Feature\n\n\u03c0p Model g(x; \u03b8)\n\nOpt. alg. A\n6-layer MLP with ReLU\nAdam [20]\n6-layer MLP with Softsign Adam [20]\n5-layer MLP with Softsign AdaGrad [31]\n13-layer CNN with ReLU\n\nAdam [20]\n\nName\nMNIST [29]\nepsilon [37]\n20News [38]\nCIFAR-10 [39]\n\n60, 000\n400, 000\n11, 314\n50, 000\n\n10, 000\n100, 000\n7, 532\n10, 000\n\n784\n2, 000\n61, 188\n3, 072\n\n0.49\n0.50\n0.44\n0.40\n\nSee http://yann.lecun.com/exdb/mnist/ for MNIST, https://www.csie.ntu.edu.tw/~cjlin/\nlibsvmtools/datasets/binary.html for epsilon, http://qwone.com/~jason/20Newsgroups/ for\n20Newsgroups, and https://www.cs.toronto.edu/~kriz/cifar.html for CIFAR-10.\n\n7\n\n\f(a) MNIST\n\n(b) epsilon\n\n(c) 20News\n\n(d) CIFAR-10\nFigure 2: Experimental results of training deep neural networks.\n\nwhere word_emb(d,300) retrieves 300-dimensional word embeddings for all words in a document,\navg_pool executes average pooling, and the resulting vector is fed to a 4-layer MLP with Softsign.\nThe model for CIFAR-10 was an all convolutional net [43]: (32*32*3)-[C(3*3,96)]*2-C(3*3,96,2)-\n[C(3*3,192)]*2-C(3*3,192,2)-C(3*3,192)-C(1*1,192)-C(1*1,10)-1000-1000-1, where the input is\na 32*32 RGB image, C(3*3,96) means 96 channels of 3*3 convolutions followed by ReLU, [ \u00b7 ]*2\nmeans there are two such layers, C(3*3,96,2) means a similar layer but with stride 2, etc.; it is one\nof the best architectures for CIFAR-10. Batch normalization [44] was applied before hidden layers.\nFurthermore, the sigmoid loss (cid:96)sig was used as the surrogate loss and an (cid:96)2-regularization was also\nadded. The resulting objectives were minimized by Adam [20] on MNIST, epsilon and CIFAR-10,\nand by AdaGrad [31] on 20News; we \ufb01xed \u03b2 = 0 and \u03b3 = 1 for simplicity.\nThe experimental results are reported in Figure 2, where means and standard deviations of training\nand test risks based on the same 10 random samplings are shown. We can see that uPU over\ufb01tted\ntraining data and nnPU \ufb01xed this problem. Additionally, given limited N data, nnPU outperformed\nPN on MNIST, epsilon and CIFAR-10 and was comparable to it on 20News. In summary, with the\nproposed non-negative risk estimator, we are able to use very \ufb02exible models given limited P data.\nWe further tried some cases where \u03c0p is misspeci\ufb01ed, in order to simulate PU learning in the wild,\nwhere we must suffer from errors in estimating \u03c0p. More speci\ufb01cally, we tested nnPU learning by\nreplacing \u03c0p with \u03c0(cid:48)\np to the learning method, so that it\nwould regard \u03c0(cid:48)\np as \u03c0p during the entire training process. The experimental setup was exactly same\nas before except the replacement of \u03c0p.\nThe experimental results are reported in Figure 3, where means of test risks of nnPU based on the\nsame 10 random samplings are shown, and the best test risks are identi\ufb01ed (horizontal lines are the\nbest mean test risks and vertical lines are the epochs when they were achieved). We can see that on\nMNIST, the more misspeci\ufb01cation was, the worse nnPU performed, while under-misspeci\ufb01cation\nhurt more than over-misspeci\ufb01cation; on epsilon, the cases where \u03c0(cid:48)\np equals to \u03c0p, 1.1\u03c0p and 1.2\u03c0p\n\np \u2208 {0.8\u03c0p, 0.9\u03c0p, . . . , 1.2\u03c0p} and giving \u03c0(cid:48)\n\n8\n\n0255075100125150175200Epoch0.40.30.20.10.00.10.20.30.40.5Risk w.r.t. zero-one lossPN testPN trainuPU testuPU trainnnPU testnnPU train0255075100125150175200Epoch0.000.050.100.150.200.250.300.350.400.450.50Risk w.r.t. zero-one loss0255075100125150175200Epoch0.10.00.10.20.30.4Risk w.r.t. zero-one loss0255075100125150175200Epoch0.40.30.20.10.00.10.20.30.40.5Risk w.r.t. zero-one loss\f(a) MNIST\n\n(b) epsilon\n\n(c) 20News\n\n(d) CIFAR-10\n\nFigure 3: Experimental results given \u03c0(cid:48)\n\np \u2208 {0.8\u03c0p, 0.9\u03c0p, . . . , 1.2\u03c0p}.\n\np = 1.1\u03c0p rather than \u03c0(cid:48)\n\np = 1.2\u03c0p but inferior to \u03c0(cid:48)\n\np = \u03c0p; on 20News, these three cases\np = 1.1\u03c0p; at last\n\np = \u03c0p was superior to \u03c0(cid:48)\np = 1.1\u03c0p were comparable again, and \u03c0(cid:48)\n\nwere comparable, but the best was \u03c0(cid:48)\nbecame different, such that \u03c0(cid:48)\np = \u03c0p and \u03c0(cid:48)\non CIFAR-10, \u03c0(cid:48)\nIn all the experiments, we have \ufb01xed \u03b2 = 0, which may explain this phenomenon. Recall that uPU\nover\ufb01tted seriously on all the benchmark datasets, and note that the larger \u03c0(cid:48)\np is, the more different\nnnPU is from uPU. Therefore, the replacement of \u03c0p with some \u03c0(cid:48)\np > \u03c0p introduces additional bias\naway from uPU. This may result in lower test risks given some \u03c0(cid:48)\nin Figure 3. This is also why under-misspeci\ufb01ed \u03c0(cid:48)\nAll the experiments were done with Chainer [45], and our implementation based on it is available\nat https://github.com/kiryor/nnPUlearning.\n\nof (cid:101)Rpu(g) in estimating R(g), but it also pushes (cid:101)Rpu(g) away from (cid:98)Rpu(g) and then pushes nnPU\n\np slightly larger than \u03c0p as shown\n\np hurt more than over-misspeci\ufb01ed \u03c0(cid:48)\np.\n\np = 1.2\u03c0p was the winner.\n\n6 Conclusions\n\nWe proposed a non-negative risk estimator for PU learning that follows and improves on the state-\nof-the-art unbiased risk estimators. No matter how \ufb02exible the model is, it will not go negative as\nits unbiased counterparts. It is more robust against over\ufb01tting when being minimized, and training\nvery \ufb02exible models such as deep neural networks given limited P data becomes possible. We also\ndeveloped a large-scale PU learning algorithm. Extensive theoretical analyses were presented, and\nthe usefulness of our non-negative PU learning was veri\ufb01ed by intensive experiments. A promising\nfuture direction is extending the current work to semi-supervised learning along [46].\n\nAcknowledgments\n\nGN and MS were supported by JST CREST JPMJCR1403 and GN was also partially supported by\nMicrosoft Research Asia.\n\n9\n\n0255075100125150175200Epoch0.050.100.150.200.250.30Risk w.r.t. zero-one loss0.80.91.01.11.20255075100125150175200Epoch0.280.300.320.340.360.380.400.420.440.46Risk w.r.t. zero-one loss0255075100125150175200Epoch0.200.220.240.260.280.30Risk w.r.t. zero-one loss0255075100125150175200Epoch0.180.200.220.240.260.280.30Risk w.r.t. zero-one loss\fReferences\n[1] F. Denis. PAC learning from positive statistical queries. In ALT, 1998.\n[2] F. De Comit\u00e9, F. Denis, R. Gilleron, and F. Letouzey. Positive and unlabeled examples help learning. In\n\nALT, 1999.\n\n[3] F. Letouzey, F. Denis, and R. Gilleron. Learning from positive and unlabeled examples. In ALT, 2000.\n[4] C. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In KDD, 2008.\n[5] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leathwick. Presence-only data and the EM algorithm.\n\nBiometrics, 65(2):554\u2013563, 2009.\n\n[6] C. Scott and G. Blanchard. Novelty detection: Unlabeled data de\ufb01nitely help. In AISTATS, 2009.\n[7] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning\n\nResearch, 11:2973\u20133009, 2010.\n\n[8] C.-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU learning for matrix completion. In ICML, 2015.\n[9] X. Li, P. S. Yu, B. Liu, and S.-K. Ng. Positive unlabeled learning for data stream classi\ufb01cation. In SDM,\n\n2009.\n\n[10] M. N. Nguyen, X. Li, and S.-K. Ng. Positive unlabeled leaning for time series classi\ufb01cation. In IJCAI,\n\n2011.\n\n[11] B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classi\ufb01cation of text documents. In ICML,\n\n2002.\n\n[12] X. Li and B. Liu. Learning to classify texts using positive and unlabeled data. In IJCAI, 2003.\n[13] W. S. Lee and B. Liu. Learning with positive and unlabeled examples using weighted logistic regression.\n\nIn ICML, 2003.\n\n[14] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classi\ufb01ers using positive and unlabeled\n\nexamples. In ICDM, 2003.\n\n[15] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In\n\nNIPS, 2014.\n\n[16] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled\n\ndata. In ICML, 2015.\n\n[17] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS, 2013.\n[18] G. Patrini, F. Nielsen, R. Nock, and M. Carioni. Loss factorization, weakly supervised learning and label\n\nnoise robustness. In ICML, 2016.\n\n[19] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of positive-\n\nunlabeled learning against positive-negative learning. In NIPS, 2016.\n\n[20] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[21] E. Sansone, F. G. B. De Natale, and Z.-H. Zhou. Ef\ufb01cient training for positive unlabeled learning. arXiv\n\npreprint arXiv:1608.06807, 2016.\n\n[22] J. C. Platt.\n\nFast training of support vector machines using sequential minimal optimization.\n\nIn\nB. Sch\u00f6lkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods, pages 185\u2013208.\nMIT Press, 1999.\n\n[23] A. Menon, B. Van Rooyen, C. S. Ong, and B. Williamson. Learning from corrupted binary labels via\n\nclass-probability estimation. In ICML, 2015.\n\n[24] H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of\n\ndistributions. In ICML, 2016.\n\n[25] S. Jain, M. White, and P. Radivojac. Estimating the class prior and posterior from noisy positives and\n\nunlabeled data. In NIPS, 2016.\n\n[26] M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unla-\n\nbeled data. Machine Learning, 106(4):463\u2013492, 2017.\n\n[27] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101(473):138\u2013156, 2006.\n\n[28] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.\n[29] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[30] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In NIPS, 2001.\n\n10\n\n\f[31] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[32] K.-L. Chung. A Course in Probability Theory. Academic Press, 1968.\n[33] C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc.\n\n3rd Berkeley Symposium on Mathematical Statistics and Probability, 1956.\n\n[34] W. James and C. Stein. Estimation with quadratic loss. In Proc. 4th Berkeley Symposium on Mathematical\n\nStatistics and Probability, 1961.\n\n[35] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Informa-\n\ntion Theory, 47(5):1902\u20131914, 2001.\n\n[36] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[37] G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved GLMNET for l1-regularized logistic regression.\n\nJournal of Machine Learning Research, 13:1999\u20132030, 2012.\n\n[38] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In ICML, 1995.\n[39] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[40] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, 2010.\n[41] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nAISTATS, 2010.\n\n[42] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. In EMNLP,\n\n2014.\n\n[43] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolu-\n\ntional net. In ICLR, 2015.\n\n[44] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, 2015.\n\n[45] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep\n\nlearning. In Machine Learning Systems Workshop at NIPS, 2015.\n\n[46] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classi\ufb01cation based on classi\ufb01ca-\n\ntion from positive and unlabeled data. In ICML, 2017.\n\n[47] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics,\n\npages 148\u2013188. Cambridge University Press, 1989.\n\n[48] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer,\n\n1991.\n\n[49] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n[50] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.\n\n11\n\n\f", "award": [], "sourceid": 1048, "authors": [{"given_name": "Ryuichi", "family_name": "Kiryo", "institution": "UTokyo/RIKEN"}, {"given_name": "Gang", "family_name": "Niu", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Marthinus", "family_name": "du Plessis", "institution": "The University of Tokyo"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}