Journal of Machine Learning Research Papers: Volume 17の論文一覧

Journal of Machine Learning Research Papers Volume 17に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
On the properties of variational approximations of Gibbs posteriors
ギブス後方の変分近似の性質について

The PAC-Bayesian approach is a powerful set of techniques to derive non-asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately often intractable. One may sample from it using Markov chain Monte Carlo, but this is usually too slow for big datasets. We consider instead variational approximations of the Gibbs posterior, which are fast to compute. We undertake a general study of the properties of such approximations. Our main finding is that such a variational approximation has often the same rate of convergence as the original PAC-Bayesian procedure it approximates. In addition, we show that, when the risk function is convex, a variational approximation can be obtained in polynomial time using a convex solver. We give finite sample oracle inequalities for the corresponding estimator. We specialize our results to several learning tasks (classification, ranking, matrix completion), discuss how to implement a variational approximation in each case, and illustrate the good properties of said approximation on real datasets.

PAC-ベイジアンアプローチは、ランダム推定量の非漸近的リスク境界を導出する強力な手法です。通常ギブス事後分布と呼ばれる推定量の対応する最適分布は、残念ながら扱いにくいことがよくあります。マルコフ連鎖モンテカルロを使用してそこからサンプリングすることもできますが、これは通常、大きなデータセットには時間がかかりすぎます。代わりに、計算が高速なギブス事後分布の変分近似を検討します。このような近似の特性について一般的な研究を行います。主な発見は、このような変分近似は、それが近似する元のPAC-ベイジアン手順と同じ収束率になることが多いということです。さらに、リスク関数が凸の場合、凸ソルバーを使用して変分近似を多項式時間で取得できることを示します。対応する推定量に対して有限サンプルのオラクル不等式を示します。私たちは結果をいくつかの学習タスク（分類、ランキング、行列補完）に特化し、それぞれのケースで変分近似を実装する方法を説明し、実際のデータセットでのその近似の優れた特性を示します。

Distributed Submodular Maximization
分散型サブモジュラの最大化

Many large-scale machine learning problems–clustering, non- parametric learning, kernel machines, etc.–require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Classical approaches to submodular optimization require centralized access to the full dataset, which is impractical for truly large-scale problems. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two- stage protocol GREEDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show that under certain natural conditions, performance close to the centralized approach can be achieved. We begin with monotone submodular maximization subject to a cardinality constraint, and then extend this approach to obtain approximation guarantees for (not necessarily monotone) submodular maximization subject to more general constraints including matroid or knapsack constraints. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar based clustering on tens of millions of examples using Hadoop.

クラスタリング、ノンパラメトリック学習、カーネルマシンなど、多くの大規模機械学習の問題では、大規模なデータセットから小さいながらも代表的なサブセットを選択する必要があります。このような問題は、多くの場合、さまざまな制約の下でサブモジュラーセット関数を最大化することに帰着します。サブモジュラー最適化の従来のアプローチでは、データセット全体への集中アクセスが必要ですが、これは本当に大規模な問題には非現実的です。この論文では、分散形式でのサブモジュラー関数の最大化の問題を検討します。MapReduceスタイルの計算を使用して簡単に実装できる、シンプルな2段階プロトコルGREEDIを開発します。このアプローチを理論的に分析し、特定の自然条件下では集中化アプローチに近いパフォーマンスを実現できることを示します。まず、カーディナリティ制約の下での単調なサブモジュラー最大化から始め、次にこのアプローチを拡張して、マトロイド制約やナップザック制約などのより一般的な制約の下での(必ずしも単調ではない)サブモジュラー最大化の近似保証を取得します。私たちの広範な実験では、スパースガウス過程推論や、Hadoopを使用した数千万の例に基づく例ベースのクラスタリングなど、いくつかのアプリケーションで私たちのアプローチの有効性を実証しています。

Blending Learning and Inference in Conditional Random Fields
条件付きランダムフィールドにおける学習と推論のブレンディング

Conditional random fields maximize the log-likelihood of training labels given the training data, e.g., objects given images. In many cases the training labels are structures that consist of a set of variables and the computational complexity for estimating their likelihood is exponential in the number of the variables. Learning algorithms relax this computational burden using approximate inference that is nested as a sub- procedure. In this paper we describe the objective function for nested learning and inference in conditional random fields. The devised objective maximizes the log-beliefs — probability distributions over subsets of training variables that agree on their marginal probabilities. This objective is concave and consists of two types of variables that are related to the learning and inference tasks respectively. Importantly, we afterwards show how to blend the learning and inference procedure and effectively get to the identical optimum much faster. The proposed algorithm currently achieves the state-of- the-art in various computer vision applications.

条件付きランダムフィールドは、トレーニングデータが与えられた場合(たとえば、画像が与えられたオブジェクト)、トレーニングラベルの対数尤度を最大化します。多くの場合、トレーニングラベルは一連の変数で構成される構造であり、その尤度を推定するための計算の複雑さは、変数の数に対して指数関数的です。学習アルゴリズムは、サブ手順としてネストされた近似推論を使用して、この計算負荷を軽減します。この論文では、条件付きランダムフィールドでのネストされた学習と推論の目的関数について説明します。考案された目的は、限界確率が一致するトレーニング変数のサブセット上の対数確信度(確率分布)を最大化します。この目的は凹型であり、学習タスクと推論タスクにそれぞれ関連する2種類の変数で構成されます。重要なことは、その後、学習と推論の手順を融合して、同じ最適値にはるかに速く効果的に到達する方法を示すことです。提案されたアルゴリズムは、現在、さまざまなコンピュータービジョンアプリケーションで最先端の技術を実現しています。

An Error Bound for L1-norm Support Vector Machine Coefficients in Ultra-high Dimension
超高次元におけるL1ノルムサポートベクトルマシン係数の誤差範囲

Comparing with the standard $L_2$-norm support vector machine (SVM), the $L_1$-norm SVM enjoys the nice property of simultaneously preforming classification and feature selection. In this paper, we investigate the statistical performance of $L_1$-norm SVM in ultra-high dimension, where the number of features $p$ grows at an exponential rate of the sample size $n$. Different from existing theory for SVM which has been mainly focused on the generalization error rates and empirical risk, we study the asymptotic behavior of the coefficients of $L_1$-norm SVM. Our analysis reveals that the $L_1$-norm SVM coefficients achieve near oracle rate, that is, with high probability, the $L_2$ error bound of the estimated $L_1$-norm SVM coefficients is of order $O_p(\sqrt{q\log p/n})$, where $q$ is the number of features with nonzero coefficients. Furthermore, we show that if the $L_1$-norm SVM is used as an initial value for a recently proposed algorithm for solving non- convex penalized SVM (Zhang et al., 2016b), then in two iterative steps it is guaranteed to produce an estimator that possesses the oracle property in ultra-high dimension, which in particular implies that with probability approaching one the zero coefficients are estimated as exactly zero. Simulation studies demonstrate the fine performance of $L_1$-norm SVM as a sparse classifier and its effectiveness to be utilized to solve non-convex penalized SVM problems in high dimension.

標準的な$L_2$ノルムのサポートベクターマシン(SVM)と比較して、$L_1$ノルムSVMは分類と特徴選択を同時に実行できるという優れた特性があります。この論文では、特徴の数$p$がサンプルサイズ$n$の指数関数的に増加する超高次元での$L_1$ノルムSVMの統計的パフォーマンスを調査します。主に一般化エラー率と経験的リスクに焦点を当ててきたSVMの既存の理論とは異なり、$L_1$ノルムSVMの係数の漸近挙動を調べます。分析により、$L_1$ノルムSVM係数はほぼオラクル率を達成することが明らかになりました。つまり、推定された$L_1$ノルムSVM係数の$L_2$誤差境界は、高い確率で$O_p(\sqrt{q\log p/n})$のオーダーになります。ここで、$q$は非ゼロ係数を持つ特徴の数です。さらに、最近提案された非凸ペナルティ付きSVMを解決するためのアルゴリズム(Zhangら, 2016b)の初期値として$L_1$ノルムSVMを使用すると、2つの反復ステップで超高次元でオラクル特性を持つ推定量が確実に生成されることを示します。これは特に、確率が1に近づくと、ゼロ係数が正確にゼロとして推定されることを意味します。シミュレーション研究では、スパース分類器としての$L_1$ノルムSVMの優れたパフォーマンスと、高次元の非凸ペナルティ付きSVM問題を解決するために利用できるその有効性が実証されています。

Integrative Analysis using Coupled Latent Variable Models for Individualizing Prognoses
個別化予後のための結合潜在変数モデルを用いた統合解析

Complex chronic diseases (e.g., autism, lupus, and Parkinson’s) are remarkably heterogeneous across individuals. This heterogeneity makes treatment difficult for caregivers because they cannot accurately predict the way in which the disease will progress in order to guide treatment decisions. Therefore, tools that help to predict the trajectory of these complex chronic diseases can help to improve the quality of health care. To build such tools, we can leverage clinical markers that are collected at baseline when a patient first presents and longitudinally over time during follow-up visits. Because complex chronic diseases are typically systemic, the longitudinal markers often track disease progression in multiple organ systems. In this paper, our goal is to predict a function of time that models the future trajectory of a single target clinical marker tracking a disease process of interest. We want to make these predictions using the histories of many related clinical markers as input. Our proposed solution tackles several key challenges. First, we can easily handle irregularly and sparsely sampled markers, which are standard in clinical data. Second, the number of parameters and the computational complexity of learning our model grows linearly in the number of marker types included in the model. This makes our approach applicable to diseases where many different markers are recorded over time. Finally, our model accounts for latent factors influencing disease expression, whereas standard regression models rely on observed features alone to explain variability. Moreover, our approach can be applied dynamically in continous- time and updates its predictions as soon as any new data is available. We apply our approach to the problem of predicting lung disease trajectories in scleroderma, a complex autoimmune disease. We show that our model improves over state-of-the-art baselines in predictive accuracy and we provide a qualitative analysis of our model’s output. Finally, the variability of disease presentation in scleroderma makes clinical trial recruitment challenging. We show that a prognostic tool that integrates multiple types of routinely collected longitudinal data can be used to identify individuals at greatest risk of rapid progression and to target trial recruitment.

複雑な慢性疾患(自閉症、狼瘡、パーキンソン病など)は、個人によって著しく異質です。この異質性により、介護者は治療の決定を導くために病気の進行方法を正確に予測することができず、治療が困難になります。したがって、これらの複雑な慢性疾患の軌跡を予測するのに役立つツールは、医療の質の向上に役立ちます。このようなツールを構築するには、患者が最初に来院したときにベースラインで収集され、フォローアップ訪問中に時間の経過とともに縦断的に収集される臨床マーカーを活用できます。複雑な慢性疾患は通常全身性であるため、縦断的マーカーは複数の臓器系における病気の進行を追跡することがよくあります。この論文の目標は、関心のある病気のプロセスを追跡する単一のターゲット臨床マーカーの将来の軌跡をモデル化する時間の関数を予測することです。多くの関連する臨床マーカーの履歴を入力として使用して、これらの予測を行いたいと考えています。提案するソリューションは、いくつかの重要な課題に取り組んでいます。まず、臨床データでは標準である、不規則かつまばらにサンプリングされたマーカーを簡単に処理できます。次に、モデルを学習するためのパラメータ数と計算の複雑さは、モデルに含まれるマーカーの種類の数に比例して増加します。これにより、時間の経過とともに多くの異なるマーカーが記録される疾患にこのアプローチを適用できます。最後に、標準的な回帰モデルが変動を説明するために観察された特徴のみに依存するのに対し、このモデルは疾患の発現に影響を及ぼす潜在的要因を考慮します。さらに、このアプローチは連続時間で動的に適用でき、新しいデータが利用可能になるとすぐに予測を更新します。このアプローチを、複雑な自己免疫疾患である強皮症の肺疾患の軌跡を予測する問題に適用します。このモデルは予測精度において最先端のベースラインよりも向上することを示し、モデルの出力の定性分析を提供します。最後に、強皮症の疾患の症状の変動性により、臨床試験への参加が困難になります。定期的に収集される複数の種類の縦断的データを統合する予後ツールを使用することで、急速な進行のリスクが最も高い個人を特定し、試験への参加を絞り込むことができることを示します。

A Characterization of Linkage-Based Hierarchical Clustering
リンケージに基づく階層クラスタリングの特性評価

The class of linkage-based algorithms is perhaps the most popular class of hierarchical algorithms. We identify two properties of hierarchical algorithms, and prove that linkage- based algorithms are the only ones that satisfy both of these properties. Our characterization clearly delineates the difference between linkage-based algorithms and other hierarchical methods. We formulate an intuitive notion of locality of a hierarchical algorithm that distinguishes between linkage-based and global hierarchical algorithms like bisecting $k$-means, and prove that popular divisive hierarchical algorithms produce clusterings that cannot be produced by any linkage-based algorithm.

リンケージ・ベースのアルゴリズムのクラスは、おそらく最も一般的な階層型アルゴリズムのクラスです。階層型アルゴリズムの2つの特性を特定し、リンケージベースのアルゴリズムのみがこれらの特性の両方を満たすものであることを証明します。私たちの特性評価は、リンケージベースのアルゴリズムと他の階層的手法との違いを明確に示しています。私たちは、リンケージベースのアルゴリズムと二等分$k$-meansのようなグローバルな階層アルゴリズムを区別する階層アルゴリズムの局所性の直感的な概念を定式化し、一般的な分割階層アルゴリズムがリンケージベースのアルゴリズムでは生成できないクラスタリングを生成することを証明します。

Learning Latent Variable Models by Pairwise Cluster Comparison: Part II – Algorithm and Evaluation
ペアワイズクラスター比較による潜在変数モデルの学習:パートII – アルゴリズムと評価

It is important for causal discovery to identify any latent variables that govern a problem and the relationships among them, given measurements in the observed world. In Part I of this paper, we were interested in learning a discrete latent variable model (LVM) and introduced the concept of pairwise cluster comparison (PCC) to identify causal relationships from clusters of data points and an overview of a two-stage algorithm for learning PCC (LPCC). First, LPCC learns exogenous latent variables and latent colliders, as well as their observed descendants, by using pairwise comparisons between data clusters in the measurement space that may explain latent causes. Second, LPCC identifies endogenous latent non- colliders with their observed children. In Part I, we showed that if the true graph has no serial connections, then LPCC returns the true graph, and if the true graph has a serial connection, then LPCC returns a pattern of the true graph. In this paper (Part II), we formally introduce the LPCC algorithm that implements the PCC concept. In addition, we thoroughly evaluate LPCC using simulated and real-world data sets in comparison to state-of-the-art algorithms. Besides using three real-world data sets, which have already been tested in learning an LVM, we also evaluate the algorithms using data sets that represent two original problems. The first problem is identifying young drivers’ involvement in road accidents, and the second is identifying cellular subpopulations of the immune system from mass cytometry. The results of our evaluation show that LPCC improves in accuracy with the sample size, can learn large LVMs, and is accurate in learning compared to state-of- the-art algorithms. The code for the LPCC algorithm and data sets used in the experiments reported here are available online.

因果発見では、観測された世界の測定値に基づいて、問題を支配する潜在変数とそれらの関係を識別することが重要です。この論文のパートIでは、離散潜在変数モデル(LVM)の学習に着目し、データポイントのクラスターから因果関係を識別するためのペアワイズクラスター比較(PCC)の概念と、PCCを学習するための2段階アルゴリズム(LPCC)の概要を紹介しました。まず、LPCCは、潜在的な原因を説明できる測定空間内のデータクラスター間のペアワイズ比較を使用して、外生潜在変数と潜在コライダー、およびそれらの観測された子孫を学習します。次に、LPCCは、内生潜在非コライダーとそれらの観測された子孫を識別します。パートIでは、真のグラフにシリアル接続がない場合、LPCCは真のグラフを返し、真のグラフにシリアル接続がある場合、LPCCは真のグラフのパターンを返すことを示しました。この論文(パートII)では、PCCの概念を実装するLPCCアルゴリズムを正式に紹介します。さらに、シミュレーションと実際のデータセットを使用してLPCCを徹底的に評価し、最先端のアルゴリズムと比較しました。LVMの学習ですでにテストされている3つの実際のデータセットを使用するほかに、2つの本来の問題を表すデータセットを使用してアルゴリズムを評価しました。最初の問題は、若いドライバーの交通事故への関与を特定することであり、2つ目の問題は、質量サイトメトリーから免疫系の細胞サブポピュレーションを特定することです。評価の結果、LPCCはサンプルサイズに応じて精度が向上し、大規模なLVMを学習でき、最先端のアルゴリズムと比較して学習が正確であることがわかりました。ここで報告されている実験で使用されたLPCCアルゴリズムのコードとデータセットは、オンラインで入手できます。

A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees
最適な保証でラッソを調整するための実用的なスキームと高速アルゴリズム

We introduce a novel scheme for choosing the regularization parameter in high-dimensional linear regression with Lasso. This scheme, inspired by Lepskiâs method for bandwidth selection in non-parametric regression, is equipped with both optimal finite-sample guarantees and a fast algorithm. In particular, for any design matrix such that the Lasso has low sup-norm error under an âoracle choiceâ of the regularization parameter, we show that our method matches the oracle performance up to a small constant factor, and show that it can be implemented by performing simple tests along a single Lasso path. By applying the Lasso to simulated and real data, we find that our novel scheme can be faster and more accurate than standard schemes such as Cross-Validation.

私たちは、Lassoを使用した高次元線形回帰の正則化パラメータを選択するための新しいスキームを紹介します。このスキームは、ノンパラメトリック回帰における帯域幅選択のためのレプスキーの方法に触発され、最適な有限サンプル保証と高速アルゴリズムの両方を備えています。特に、正則化パラメーターの”オラクル選択”の下でLassoのsup-norm誤差が低いような設計行列では、この手法がオラクルのパフォーマンスを小さな定数因子まで一致させることを示し、1つのLassoパスに沿って簡単なテストを実行することで実装できることを示します。Lassoをシミュレーションデータと実際のデータに適用することで、新しいスキームは、Cross-Validationなどの標準スキームよりも高速で正確であることがわかりました。

Linear Convergence of Randomized Feasible Descent Methods Under the Weak Strong Convexity Assumption
弱強凸性仮定の下でのランダム化実現可能降下法の線形収束

In this paper we generalize the framework of the Feasible Descent Method (FDM) to a Randomized (R-FDM) and a Randomized Coordinate-wise Feasible Descent Method (RC-FDM) framework. We show that many machine learning algorithms, including the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.

この論文では、フィージブルディセント法(FDM)のフレームワークを、ランダム化(R-FDM)およびランダム化座標ワイズフィージブルディセント法(RC-FDM)フレームワークに一般化します。SVM双対問題を最適化するための有名なSDCAアルゴリズムやLASSO問題の確率的座標降下法など、多くの機械学習アルゴリズムがRC-FDMのフレームワークに適合することを示します。R-FDMとRC-FDMの両方について、弱強凸性仮定の下で線形収束を証明します。さらに、RC-FDMでは双対性ギャップが線形に収束することを示しており、これはSVM双対問題に適用されたSDCAでも双対性ギャップが線形に収束することを意味します。

Gains and Losses are Fundamentally Different in Regret Minimization: The Sparse Case
後悔の最小化における利益と損失は根本的に異なる:スパースケース

We demonstrate that, in the classical non-stochastic regret minimization problem with $d$ decisions, gains and losses to be respectively maximized or minimized are fundamentally different. Indeed, by considering the additional sparsity assumption (at each stage, at most $s$ decisions incur a nonzero outcome), we derive optimal regret bounds of different orders. Specifically, with gains, we obtain an optimal regret guarantee after $T$ stages of order $\sqrt{T\log s}$, so the classical dependency in the dimension is replaced by the sparsity size. With losses, we provide matching upper and lower bounds of order $\sqrt{Ts\log(d)/d}$, which is decreasing in $d$. Eventually, we also study the bandit setting, and obtain an upper bound of order $\sqrt{Ts\log (d/s)}$ when outcomes are losses. This bound is proven to be optimal up to the logarithmic factor $\sqrt{\log(d/s)}$.

私たちは、$d$の決定を伴う古典的な非確率的後悔最小化問題では、それぞれ最大化または最小化される利益と損失が根本的に異なることを実証します。実際、追加のスパース性の仮定(各ステージで、最大で$s$個の決定がゼロ以外の結果を引き起こす)を考慮することで、さまざまな次数の最適な後悔範囲を導き出します。具体的には、ゲインを使用すると、注文$sqrt{Tlog s}$の$T$段階後に最適な後悔保証が得られるため、次元の古典的な依存関係はスパース性サイズに置き換えられます。損失の場合、$d$で減少する$sqrt{Tslog(d)/d}$の注文の上限と下限を一致させます。最終的には、バンディットの設定も研究し、結果が損失である場合に$sqrt{Tslog (d/s)}$の次数の上限を求めます。この境界は、対数係数$sqrt{log(d/s)}$まで最適であることが証明されています。

Approximate Newton Methods for Policy Search in Markov Decision Processes
マルコフ決定過程における方策探索のための近似ニュートン法

Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analyse the structure of the Hessian of the total expected reward, which is a standard objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful structure in the context of MDPs and we use this analysis to motivate two Gauss-Newton methods for MDPs. Like the Gauss- Newton method for non-linear least squares, these methods drop certain terms in the Hessian. The approximate Hessians possess desirable properties, such as negative definiteness, and we demonstrate several important performance guarantees including guaranteed ascent directions, invariance to affine transformation of the parameter space and convergence guarantees. We finally provide a unifying perspective of key policy search algorithms, demonstrating that our second Gauss- Newton algorithm is closely related to both the EM-algorithm and natural gradient ascent applied to MDPs, but performs significantly better in practice on a range of challenging domains.

近似ニュートン法は、高速収束などのニュートン法の利点を維持しながら、逆ヘッセ行列の計算や推定にコストがかかるなどの欠点を軽減することを目的とした標準的な最適化ツールです。この研究では、マルコフ決定プロセス(MDP)のポリシー最適化のための近似ニュートン法を調査します。まず、MDPの標準的な目的関数である総期待報酬のヘッセ行列の構造を分析します。勾配と同様に、ヘッセ行列はMDPのコンテキストで有用な構造を示すことを示し、この分析を使用してMDP用の2つのガウス-ニュートン法を考案します。非線形最小二乗法のガウス-ニュートン法と同様に、これらの方法ではヘッセ行列の特定の項が削除されます。近似ヘッセ行列は、負の定値性などの望ましい特性を備えており、保証された上昇方向、パラメーター空間のアフィン変換に対する不変性、収束保証など、いくつかの重要なパフォーマンス保証を示します。最後に、主要なポリシー検索アルゴリズムの統一的な観点を提供し、2番目のガウスニュートンアルゴリズムがEMアルゴリズムとMDPに適用される自然勾配上昇法の両方に密接に関連しているが、さまざまな困難なドメインで実際には大幅に優れたパフォーマンスを発揮することを示します。

Scalable Approximate Bayesian Inference for Outlier Detection under Informative Sampling
情報サンプリング下での外れ値検出のためのスケーラブルな近似ベイズ推論

Government surveys of business establishments receive a large volume of submissions where a small subset contain errors. Analysts need a fast-computing algorithm to flag this subset due to a short time window between collection and reporting. We offer a computationally-scalable optimization method based on non-parametric mixtures of hierarchical Dirichlet processes that allows discovery of multiple industry-indexed local partitions linked to a set of global cluster centers. Outliers are nominated as those clusters containing few observations. We extend an existing approach with a new merge step that reduces sensitivity to hyperparameter settings. Survey data are typically acquired under an informative sampling design where the probability of inclusion depends on the surveyed response such that the distribution for the observed sample is different from the population. We extend the derivation of a penalized objective function to use a pseudo-posterior that incorporates sampling weights that undo the informative design. We provide a simulation study to demonstrate that our approach produces unbiased estimation for the outlying cluster under informative sampling. The method is applied for outlier nomination for the Current Employment Statistics survey conducted by the Bureau of Labor Statistics.

政府による企業調査では、小さなサブセットにエラーが含まれる大量の回答が寄せられます。収集から報告までの期間が短いため、アナリストは、このサブセットにフラグを立てるための高速コンピューティングアルゴリズムを必要とします。私たちは、階層的ディリクレ過程のノンパラメトリック混合に基づく、計算的にスケーラブルな最適化手法を提供します。これにより、一連のグローバルクラスターセンターにリンクされた、複数の業界インデックス付きローカルパーティションを検出できます。外れ値は、観測値が少ないクラスターとして指定されます。私たちは、ハイパーパラメータ設定に対する感度を下げる新しいマージステップで既存のアプローチを拡張します。調査データは通常、調査対象の回答に応じて含まれる確率が決まる有益なサンプリング設計で取得されるため、観測サンプルの分布は母集団と異なります。私たちは、有益な設計を元に戻すサンプリング重みを組み込んだ擬似事後分布を使用するように、ペナルティ付き目的関数の導出を拡張します。私たちは、有益なサンプリングで外れ値のクラスターに対して私たちのアプローチが偏りのない推定値を生成することを示すシミュレーション研究を提供します。この方法は、労働統計局が実施する現在の雇用統計調査の外れ値指定に適用されます。

GenSVM: A Generalized Multiclass Support Vector Machine
GenSVM:一般化された多クラスサポートベクターマシン

Traditional extensions of the binary support vector machine (SVM) to multiclass problems are either heuristics or require solving a large dual optimization problem. Here, a generalized multiclass SVM is proposed called GenSVM. In this method classification boundaries for a $K$-class problem are constructed in a $(K-1)$-dimensional space using a simplex encoding. Additionally, several different weightings of the misclassification errors are incorporated in the loss function, such that it generalizes three existing multiclass SVMs through a single optimization problem. An iterative majorization algorithm is derived that solves the optimization problem without the need of a dual formulation. This algorithm has the advantage that it can use warm starts during cross validation and during a grid search, which significantly speeds up the training phase. Rigorous numerical experiments compare linear GenSVM with seven existing multiclass SVMs on both small and large data sets. These comparisons show that the proposed method is competitive with existing methods in both predictive accuracy and training time, and that it significantly outperforms several existing methods on these criteria.

バイナリサポートベクターマシン(SVM)のマルチクラス問題への従来の拡張は、ヒューリスティックであるか、大規模なデュアル最適化問題を解決することを必要とします。ここでは、GenSVMと呼ばれる一般化されたマルチクラスSVMが提案されています。この方法では、$K$クラス問題の分類境界が、単体エンコーディングを使用して$(K-1)$次元空間に構築されます。さらに、誤分類エラーのいくつかの異なる重み付けが損失関数に組み込まれているため、3つの既存のマルチクラスSVMが単一の最適化問題を通じて一般化されます。デュアル定式化を必要とせずに最適化問題を解決する反復メジャー化アルゴリズムが導出されます。このアルゴリズムの利点は、クロス検証中およびグリッド検索中にウォームスタートを使用できることです。これにより、トレーニングフェーズが大幅に高速化されます。厳密な数値実験では、小規模および大規模なデータセットの両方で、線形GenSVMと7つの既存のマルチクラスSVMを比較します。これらの比較により、提案された方法は予測精度とトレーニング時間の両方で既存の方法と競合し、これらの基準でいくつかの既存の方法を大幅に上回っていることがわかります。

Learning Latent Variable Models by Pairwise Cluster Comparison: Part I – Theory and Overview
ペアワイズクラスター比較による潜在変数モデルの学習:パートI – 理論と概要

Identification of latent variables that govern a problem and the relationships among them, given measurements in the observed world, are important for causal discovery. This identification can be accomplished by analyzing the constraints imposed by the latents in the measurements. We introduce the concept of pairwise cluster comparison (PCC) to identify causal relationships from clusters of data points and provide a two- stage algorithm called learning PCC (LPCC) that learns a latent variable model (LVM) using PCC. First, LPCC learns exogenous latents and latent colliders, as well as their observed descendants, by using pairwise comparisons between data clusters in the measurement space that may explain latent causes. Since in this first stage LPCC cannot distinguish endogenous latent non-colliders from their exogenous ancestors, a second stage is needed to extract the former, with their observed children, from the latter. If the true graph has no serial connections, LPCC returns the true graph, and if the true graph has a serial connection, LPCC returns a pattern of the true graph. LPCC’s most important advantage is that it is not limited to linear or latent-tree models and makes only mild assumptions about the distribution. The paper is divided in two parts: Part I (this paper) provides the necessary preliminaries, theoretical foundation to PCC, and an overview of LPCC; Part II formally introduces the LPCC algorithm and experimentally evaluates its merit in different synthetic and real domains. The code for the LPCC algorithm and data sets used in the experiments reported in Part II are available online.

観測された世界の測定値を前提として、問題を支配する潜在変数とそれらの関係を特定することは、因果関係の発見にとって重要です。この特定は、測定値の潜在変数によって課せられる制約を分析することで実現できます。データポイントのクラスターから因果関係を特定するために、ペアワイズクラスター比較(PCC)の概念を導入し、PCCを使用して潜在変数モデル(LVM)を学習する2段階アルゴリズムである学習PCC (LPCC)を提供します。まず、LPCCは、潜在的な原因を説明できる測定空間内のデータクラスター間のペアワイズ比較を使用して、外生潜在変数と潜在コライダー、およびそれらの観測された子孫を学習します。この最初の段階では、LPCCは内生潜在非コライダーを外生祖先と区別できないため、前者をその観測された子とともに後者から抽出する第2段階が必要です。真のグラフにシリアル接続がない場合、LPCCは真のグラフを返し、真のグラフにシリアル接続がある場合、LPCCは真のグラフのパターンを返します。LPCCの最も重要な利点は、線形モデルや潜在木モデルに限定されず、分布について軽度の仮定のみを行うことです。この論文は2つの部分に分かれています。パートI (この論文)では、必要な準備、PCCの理論的基礎、およびLPCCの概要を示します。パートIIでは、LPCCアルゴリズムを正式に紹介し、さまざまな合成領域と実際の領域でそのメリットを実験的に評価します。パートIIで報告された実験で使用されたLPCCアルゴリズムのコードとデータセットは、オンラインで入手できます。

Composite Multiclass Losses
複合マルチクラス損失

We consider loss functions for multiclass prediction problems. We show when a multiclass loss can be expressed as a proper composite loss, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on âclassification calibrationâ by relating it to properness. We determine the stationarity condition, Bregman representation, order- sensitivity, and quasi-convexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation for multiclass losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk. We show that the simple integral representation for binary proper losses can not be extended to multiclass losses but offer concrete guidance regarding how to design different loss functions. The conclusion drawn from these results is that the proper composite representation is a natural and convenient tool for the design of multiclass loss functions.

私たちは、マルチクラス予測問題に対する損失関数について考察します。私たちは、マルチクラス損失が、適切な損失とリンク関数の合成である適切な複合損失として表現できる場合を示す。私たちは、バイナリ損失に関する既存の結果をマルチクラス損失に拡張します。私たちは、適切性と関連付けることで、「分類キャリブレーション」に関する結果を包含します。私たちは、マルチクラス適切な損失の定常性条件、ブレグマン表現、順序感度、および準凸性を決定します。次に、マルチクラス損失の複合表現の存在と一意性を特徴付ける。私たちは、複合表現が損失の他のコア特性、すなわち、ベイズリスクのヘッシアンの観点から特徴付けるマルチクラス損失の混合可能性、許容性、および（強い）凸性とどのように関係するかを示す。私たちは、バイナリ適切な損失の単純な積分表現をマルチクラス損失に拡張することはできないが、異なる損失関数の設計方法に関する具体的なガイダンスを提供することを示す。これらの結果から導かれる結論は、適切な複合表現がマルチクラス損失関数の設計のための自然で便利なツールであるということです。

Stability and Generalization in Structured Prediction
構造化予測における安定性と一般化

Structured prediction models have been found to learn effectively from a few large examples— sometimes even just one. Despite empirical evidence, canonical learning theory cannot guarantee generalization in this setting because the error bounds decrease as a function of the number of examples. We therefore propose new PAC-Bayesian generalization bounds for structured prediction that decrease as a function of both the number of examples and the size of each example. Our analysis hinges on the stability of joint inference and the smoothness of the data distribution. We apply our bounds to several common learning scenarios, including max-margin and soft-max training of Markov random fields. Under certain conditions, the resulting error bounds can be far more optimistic than previous results and can even guarantee generalization from a single large example.

構造化された予測モデルは、いくつかの大きな例から効果的に学習することがわかっています—時には1つだけの例からでも学習します。経験的な証拠にもかかわらず、正準学習理論は、例の数の関数として誤差範囲が減少するため、この設定での一般化を保証できません。したがって、例の数と各例のサイズの両方の関数として減少する構造化予測の新しいPAC-ベイズ一般化境界を提案します。私たちの分析は、結合推論の安定性とデータ分布の滑らかさにかかっています。私たちは、マルコフ確率場のmax-marginやsoft-maxの学習など、いくつかの一般的な学習シナリオに境界を適用します。特定の条件下では、結果として得られる誤差の範囲は、以前の結果よりもはるかに楽観的であり、1つの大きな例からの一般化を保証することさえあります。

RLScore: Regularized Least-Squares Learners
RLScore: 正則化最小二乗学習器

RLScore is a Python open source module for kernel based machine learning. The library provides implementations of several regularized least-squares (RLS) type of learners. RLS methods for regression and classification, ranking, greedy feature selection, multi-task and zero-shot learning, and unsupervised classification are included. Matrix algebra based computational short-cuts are used to ensure efficiency of both training and cross-validation. A simple API and extensive tutorials allow for easy use of RLScore.

RLScoreは、カーネルベースの機械学習用のPythonオープンソースモジュールです。このライブラリは、いくつかの正則化最小二乗法(RLS)タイプの学習器の実装を提供します。回帰と分類、ランク付け、貪欲な特徴選択、マルチタスク学習とゼロショット学習、教師なし分類のためのRLS法が含まれています。行列代数ベースの計算ショートカットは、トレーニングと交差検証の両方の効率を確保するために使用されます。シンプルなAPIと広範なチュートリアルにより、RLScoreを簡単に使用できます。

Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize
一定で緩やかに減少するステップサイズを持つ制約付き強調時間差分学習の弱い収束特性

We consider the emphatic temporal-difference (TD) algorithm, ETD($\lambda$), for learning the value functions of stationary policies in a discounted, finite state and action Markov decision process. The ETD($\lambda$) algorithm was recently proposed by Sutton, Mahmood, and White (2016) to solve a long-standing divergence problem of the standard TD algorithm when it is applied to off- policy training, where data from an exploratory policy are used to evaluate other policies of interest. The almost sure convergence of ETD($\lambda$) has been proved in our recent work under general off-policy training conditions, but for a narrow range of diminishing stepsize. In this paper we present convergence results for constrained versions of ETD($\lambda$) with constant stepsize and with diminishing stepsize from a broad range. Our results characterize the asymptotic behavior of the trajectory of iterates produced by those algorithms, and are derived by combining key properties of ETD($\lambda$) with powerful convergence theorems from the weak convergence methods in stochastic approximation theory. For the case of constant stepsize, in addition to analyzing the behavior of the algorithms in the limit as the stepsize parameter approaches zero, we also analyze their behavior for a fixed stepsize and bound the deviations of their averaged iterates from the desired solution. These results are obtained by exploiting the weak Feller property of the Markov chains associated with the algorithms, and by using ergodic theorems for weak Feller Markov chains, in conjunction with the convergence results we get from the weak convergence methods. Besides ETD($\lambda$), our analysis also applies to the off-policy TD($\lambda$) algorithm, when the divergence issue is avoided by setting $\lambda$ sufficiently large. It yields, for that case, new results on the asymptotic convergence properties of constrained off-policy TD($\lambda$) with constant or slowly diminishing stepsize.

私たちは、割引された有限状態および行動マルコフ決定過程における定常ポリシーの価値関数を学習するための強調時間差分(TD)アルゴリズムETD($\lambda$)を検討します。ETD($\lambda$)アルゴリズムは、探索的ポリシーのデータを使用して他の関心のあるポリシーを評価するオフポリシートレーニングに標準TDアルゴリズムを適用した場合の長年の発散問題を解決するために、最近Sutton、Mahmood、およびWhite (2016)によって提案されました。ETD($\lambda$)のほぼ確実な収束は、一般的なオフポリシートレーニング条件下での最近の研究で証明されているが、狭い範囲の減少ステップサイズに対してです。この論文では、一定のステップサイズと広い範囲からの減少ステップサイズでのETD($\lambda$)の制約付きバージョンの収束結果を示す。我々の結果は、それらのアルゴリズムによって生成される反復の軌跡の漸近的動作を特徴付けるものであり、ETD($\lambda$)の主要な特性と、確率近似理論における弱収束法からの強力な収束定理とを組み合わせることによって導き出されたものです。一定のステップサイズの場合、ステップサイズパラメータがゼロに近づくときの極限におけるアルゴリズムの動作を分析することに加えて、固定ステップサイズに対する動作も分析し、平均反復の目的の解からの偏差を制限した。これらの結果は、アルゴリズムに関連付けられたマルコフ連鎖の弱フェラー特性を利用し、弱フェラーマルコフ連鎖のエルゴード定理を弱収束法から得られる収束結果と組み合わせて使用することで得られたものです。ETD($\lambda$)に加えて、我々の分析は、$\lambda$を十分に大きく設定することで発散の問題を回避できる場合、オフポリシーTD($\lambda$)アルゴリズムにも適用できます。この場合、一定または徐々に減少するステップサイズを持つ制約付きオフポリシーTD($\lambda$)の漸近収束特性に関する新しい結果が得られます。

On Bayes Risk Lower Bounds
ベイズリスク下限について

This paper provides a general technique for lower bounding the Bayes risk of statistical estimation, applicable to arbitrary loss functions and arbitrary prior distributions. A lower bound on the Bayes risk not only serves as a lower bound on the minimax risk, but also characterizes the fundamental limit of any estimator given the prior knowledge. Our bounds are based on the notion of $f$-informativity (CsiszÃ¡r, 1972), which is a function of the underlying class of probability measures and the prior. Application of our bounds requires upper bounds on the $f$-informativity, thus we derive new upper bounds on $f$-informativity which often lead to tight Bayes risk lower bounds. Our technique leads to generalizations of a variety of classical minimax bounds (e.g., generalized Fano’s inequality). Our Bayes risk lower bounds can be directly applied to several concrete estimation problems, including Gaussian location models, generalized linear models, and principal component analysis for spiked covariance models. To further demonstrate the applications of our Bayes risk lower bounds to machine learning problems, we present two new theoretical results: (1) a precise characterization of the minimax risk of learning spherical Gaussian mixture models under the smoothed analysis framework, and (2) lower bounds for the Bayes risk under a natural prior for both the prediction and estimation errors for high-dimensional sparse linear regression under an improper learning setting.

この論文では、統計的推定のベイズリスクの下限を設定するための一般的な手法を提示します。この手法は、任意の損失関数と任意の事前分布に適用できます。ベイズリスクの下限は、ミニマックスリスクの下限として機能するだけでなく、事前知識が与えられた場合の推定量の基本的限界を特徴付けます。この下限は、基礎となる確率測度のクラスと事前分布の関数である$f$-情報性(CsiszÃ¡r、1972)の概念に基づいています。この上限を適用するには、$f$-情報性の上限が必要であるため、$f$-情報性の新しい上限を導出します。この上限は、多くの場合、厳しいベイズリスク下限につながります。この手法により、さまざまな古典的なミニマックス境界(一般化ファノの不等式など)が一般化されます。ベイズリスクの下限は、ガウス位置モデル、一般化線形モデル、スパイク共分散モデルの主成分分析など、いくつかの具体的な推定問題に直接適用できます。ベイズリスク下限値の機械学習問題への応用をさらに実証するために、2つの新しい理論的結果を提示します。(1)平滑化分析フレームワークの下での球状ガウス混合モデルの学習のミニマックスリスクの正確な特徴付け、(2)不適切な学習設定の下での高次元スパース線形回帰の予測誤差と推定誤差の両方に対する自然な事前分布の下でのベイズリスクの下限値です。

Multi-scale Classification using Localized Spatial Depth
局所的な空間深度を用いたマルチスケール分類

In this article, we develop and investigate a new classifier based on features extracted using spatial depth. Our construction is based on fitting a generalized additive model to posterior probabilities of different competing classes. To cope with possible multi-modal as well as non-elliptic nature of the population distribution, we also develop a localized version of spatial depth and use that with varying degrees of localization to build the classifier. Final classification is done by aggregating several posterior probability estimates, each of which is obtained using this localized spatial depth with a fixed scale of localization. The proposed classifier can be conveniently used even when the dimension of the data is larger than the sample size, and its good discriminatory power for such data has been established using theoretical as well as numerical results.

この記事では、空間深度を使用して抽出された特徴に基づいて、新しい分類器を開発および調査します。私たちの構成は、一般化された加法モデルを異なる競合するクラスの事後確率に当てはめることに基づいています。人口分布のマルチモーダルおよび非楕円的な性質に対処するために、空間深度のローカライズされたバージョンも開発し、それをさまざまな程度のローカリゼーションで使用して分類器を構築します。最終的な分類は、いくつかの事後確率推定値を集約することによって行われ、それぞれがこのローカライズされた空間深度と固定スケールのローカリゼーションを使用して取得されます。提案された分類器は、データの次元がサンプルサイズよりも大きい場合でも便利に使用でき、そのようなデータに対する優れた識別力は、理論的な結果と数値的な結果を使用して確立されています。

Bayesian Decision Process for Cost-Efficient Dynamic Ranking via Crowdsourcing
クラウドソーシングによるコスト効率の高い動的ランキングのためのベイズ決定プロセス

Rank aggregation based on pairwise comparisons over a set of items has a wide range of applications. Although considerable research has been devoted to the development of rank aggregation algorithms, one basic question is how to efficiently collect a large amount of high-quality pairwise comparisons for the ranking purpose. Because of the advent of many crowdsourcing services, a crowd of workers are often hired to conduct pairwise comparisons with a small monetary reward for each pair they compare. Since different workers have different levels of reliability and different pairs have different levels of ambiguity, it is desirable to wisely allocate the limited budget for comparisons among the pairs of items and workers so that the global ranking can be accurately inferred from the comparison results. To this end, we model the active sampling problem in crowdsourced ranking as a Bayesian Markov decision process, which dynamically selects item pairs and workers to improve the ranking accuracy under a budget constraint. We further develop a computationally efficient sampling policy based on knowledge gradient as well as a moment matching technique for posterior approximation. Experimental evaluations on both synthetic and real data show that the proposed policy achieves high ranking accuracy with a lower labeling cost.

アイテムのセットに対するペアワイズ比較に基づくランク集約は、幅広い用途があります。ランク集約アルゴリズムの開発には相当な研究が費やされてきましたが、1つの基本的な問題は、ランキングの目的で大量の高品質のペアワイズ比較を効率的に収集する方法です。多くのクラウドソーシングサービスの出現により、ペアワイズ比較を行うために、比較するペアごとに少額の金銭的報酬で一群の作業者が雇われることがよくあります。作業者によって信頼性のレベルが異なり、ペアによって曖昧さのレベルが異なるため、比較結果からグローバルランキングを正確に推測できるように、アイテムと作業者のペア間の比較の限られた予算を賢く割り当てることが望ましいです。この目的のために、クラウドソーシングランキングにおけるアクティブサンプリング問題をベイジアンマルコフ決定プロセスとしてモデル化します。このプロセスでは、予算制約下でアイテムのペアと作業者を動的に選択してランキングの精度を向上させます。さらに、知識勾配に基づく計算効率の高いサンプリングポリシーと、事後近似のためのモーメントマッチング手法を開発します。合成データと実際のデータの両方に対する実験的評価により、提案されたポリシーは、ラベル付けコストを抑えながら高いランキング精度を実現することが示されました。

Newton-Stein Method: An Optimization Method for GLMs via Stein’s Lemma
ニュートン・スタイン法: スタインの補題による GLM の最適化法

We consider the problem of efficiently computing the maximum likelihood estimator in Generalized Linear Models (GLMs) when the number of observations is much larger than the number of coefficients ($n \gg p \gg 1$). In this regime, optimization algorithms can immensely benefit from approximate second order information. We propose an alternative way of constructing the curvature information by formulating it as an estimation problem and applying a Stein-type lemma, which allows further improvements through sub-sampling and eigenvalue thresholding. Our algorithm enjoys fast convergence rates, resembling that of second order methods, with modest per-iteration cost. We provide its convergence analysis for the general case where the rows of the design matrix are samples from a sub-Gaussian distribution. We show that the convergence has two phases, a quadratic phase followed by a linear phase. Finally, we empirically demonstrate that our algorithm achieves the highest performance compared to various optimization algorithms on several data sets.

私たちは、観測数が係数の数よりはるかに多い場合($n \gg p \gg 1$)、一般化線形モデル(GLM)で最大尤度推定量を効率的に計算する問題について検討します。この状況では、最適化アルゴリズムは近似的な2次情報から大きなメリットを得ることができます。曲率情報を構築する別の方法として、推定問題として定式化し、サブサンプリングと固有値のしきい値設定によってさらに改善できるStein型の補題を適用する方法を提案します。このアルゴリズムは、2次法に似た高速収束率を誇り、反復あたりのコストは控えめです。設計行列の行がサブガウス分布からのサンプルである一般的なケースについて、収束分析を提供します。収束には2つのフェーズ、つまり2次フェーズとそれに続く線形フェーズがあることを示します。最後に、いくつかのデータセットで、さまざまな最適化アルゴリズムと比較して、このアルゴリズムが最高のパフォーマンスを達成することを実証します。

Learning Planar Ising Models
平面イジングモデルの学習

Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the best planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. We demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.

グラフィカルモデルの推論と学習は、どちらも統計学と機械学習でよく研究されている問題であり、科学と工学で多くの用途が見つかっています。ただし、一般的なグラフィカルモデルでは正確な推論は扱いにくく、これは、扱いやすいグラフィカルモデルファミリ内でランダム変数のコレクションに対する最適な近似値を求めるという問題を示唆しています。この論文では、統計物理学の手法を使用して正確な推論が扱いやすい平面イジングモデルのクラスに焦点を当てます。これらの手法と、平面性テストおよび平面埋め込みの最新の方法に基づいて、任意のバイナリランダム変数のコレクション(サンプルデータから取得可能)を近似する最適な平面イジングモデルを学習するための貪欲アルゴリズムを提案します。変数間のすべてのペアワイズ相関のセットが与えられた場合、相関のセットを最適に近似するために、平面グラフとこのグラフ上で定義された最適な平面イジングモデルを選択します。シミュレーションと2つのアプリケーション(上院の投票記録のモデル化と、火星探査機データからの地質化学的深度傾向の特定)でこの方法を示します。

A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares
通常の最小二乗法のランダム化スケッチに関する統計的展望

We consider statistical as well as algorithmic aspects of solving large-scale least-squares (LS) problems using randomized sketching algorithms. For a LS problem with input data $(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$, sketching algorithms use a sketching matrix, $S\in\mathbb{R}^{r \times n}$, where $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the sketched data $(SX, SY)$. Prior work has typically adopted an algorithmic perspective, in that it has made no statistical assumptions on the input $X$ and $Y$, and instead it has been assumed that the data $(X,Y)$ are fixed and worst-case (WC). Prior results show that, when using sketching matrices such as random projections and leverage-score sampling algorithms, with $p \lesssim r \ll n$, the WC error is the same as solving the original problem, up to a small constant. From a statistical perspective, we typically consider the mean-squared error performance of randomized sketching algorithms, when data $(X, Y)$ are generated according to a statistical linear model $Y = X \beta + \epsilon$, where $\epsilon$ is a noise process. In this paper, we provide a rigorous comparison of both perspectives leading to insights on how they differ. To do this, we first develop a framework for assessing, in a unified manner, algorithmic and statistical aspects of randomized sketching methods. We then consider the statistical prediction efficiency (PE) and the statistical residual efficiency (RE) of the sketched LS estimator; and we use our framework to provide upper bounds for several types of random projection and random sampling sketching algorithms. Among other results, we show that the RE can be upper bounded when $p \lesssim r \ll n$ while the PE typically requires the sample size $r$ to be substantially larger. Lower bounds developed in subsequent results show that our upper bounds on PE can not be improved. (A preliminary version of this paper appeared as Raskutti and Mahoney (2014, 2015).)

私たちは、ランダム化スケッチアルゴリズムを使用して大規模な最小二乗(LS)問題を解く統計的側面とアルゴリズム的側面について検討します。入力データ$(X, Y) \in \mathbb{R}^{n \times p} \times \mathbb{R}^n$を持つLS問題の場合、スケッチアルゴリズムはスケッチマトリックス$S\in\mathbb{R}^{r \times n}$ (ここで$r \ll n$) を使用します。次に、完全なデータ$(X,Y)$を使用してLS問題を解くのではなく、スケッチデータ$(SX, SY)$のみを使用してLS問題を解きます。これまでの研究では、通常、入力$X$と$Y$に関する統計的仮定を行わず、代わりにデータ$(X,Y)$が固定され、最悪ケース(WC)であると仮定するというアルゴリズム的観点が採用されています。以前の結果では、$p \lesssim r \ll n$のランダム投影やレバレッジスコアサンプリングアルゴリズムなどのスケッチ行列を使用する場合、WCエラーは小さな定数を除いて元の問題を解くのと同じであることが示されています。統計的な観点からは、通常、データ$(X, Y)$が統計線形モデル$Y = X \beta + \epsilon$ ($\epsilon$はノイズプロセス)に従って生成される場合のランダム化スケッチアルゴリズムの平均二乗誤差のパフォーマンスを考慮します。この論文では、両方の観点を厳密に比較し、それらがどのように異なるかについての洞察を提供します。これを行うために、まず、ランダム化スケッチ法のアルゴリズムと統計の側面を統一的に評価するためのフレームワークを開発します。次に、スケッチされたLS推定量の統計的予測効率(PE)と統計的残差効率(RE)を考慮します。その他の結果では、REは$p \lesssim r \ll n$のときに上限が設定できる一方、PEでは通常、サンプルサイズ$r$を大幅に大きくする必要があることを示しています。その後の結果で開発された下限は、PEの上限を改善できないことを示しています。(この論文の予備バージョンは、RaskuttiとMahoney (2014、2015)として公開されました。)

Neyman-Pearson Classification under High-Dimensional Settings
高次元設定におけるネイマン・ピアソン分類

Most existing binary classification methods target on the optimization of the overall classification risk and may fail to serve some real-world applications such as cancer diagnosis, where users are more concerned with the risk of misclassifying one specific class than the other. Neyman-Pearson (NP) paradigm was introduced in this context as a novel statistical framework for handling asymmetric type I/II error priorities. It seeks classifiers with a minimal type II error and a constrained type I error under a user specified level. This article is the first attempt to construct classifiers with guaranteed theoretical performance under the NP paradigm in high-dimensional settings. Based on the fundamental Neyman-Pearson Lemma, we used a plug-in approach to construct NP-type classifiers for Naive Bayes models. The proposed classifiers satisfy the NP oracle inequalities, which are natural NP paradigm counterparts of the oracle inequalities in classical binary classification. Besides their desirable theoretical properties, we also demonstrated their numerical advantages in prioritized error control via both simulation and real data studies.

既存のバイナリ分類法のほとんどは、全体的な分類リスクの最適化を目標としており、ユーザーが特定のクラスを他のクラスよりも誤分類するリスクをより懸念する癌診断などの実際のアプリケーションには適さない可能性があります。ネイマン-ピアソン(NP)パラダイムは、このコンテキストで、非対称なタイプI/IIエラーの優先順位を処理するための新しい統計フレームワークとして導入されました。これは、ユーザーが指定したレベルの下で、タイプIIエラーが最小でタイプIエラーが制約された分類器を求めます。この記事は、高次元設定でNPパラダイムの下で理論的なパフォーマンスが保証された分類器を構築する最初の試みです。基本的なネイマン-ピアソンの補題に基づいて、プラグインアプローチを使用して、ナイーブベイズモデル用のNP型分類器を構築しました。提案された分類器は、古典的なバイナリ分類のオラクル不等式の自然なNPパラダイム対応物であるNPオラクル不等式を満たします。望ましい理論的特性に加えて、シミュレーションと実際のデータ研究の両方を通じて、優先エラー制御における数値的利点も実証しました。

Measuring Dependence Powerfully and Equitably
依存性を強力かつ公平に測定

Given a high-dimensional data set, we often wish to find the strongest relationships within it. A common strategy is to evaluate a measure of dependence on every variable pair and retain the highest-scoring pairs for follow-up. This strategy works well if the statistic used (a) has good power to detect non-trivial relationships, and (b) is equitable, meaning that for some measure of noise it assigns similar scores to equally noisy relationships regardless of relationship type (e.g., linear, exponential, periodic). In this paper, we define and theoretically characterize two new statistics that together yield an efficient approach for obtaining both power and equitability. To do this, we first introduce a new population measure of dependence and show three equivalent ways that it can be viewed, including as a canonical smoothing of mutual information. We then introduce an efficiently computable consistent estimator of our population measure of dependence, and we empirically establish its equitability on a large class of noisy functional relationships. This new statistic has better bias/variance properties and better runtime complexity than a previous heuristic approach. Next, we derive a second, related statistic whose computation is a trivial side-product of our algorithm and whose goal is powerful independence testing rather than equitability. We prove that this statistic yields a consistent independence test and show in simulations that the test has good power against independence. Taken together, our results suggest that these two statistics are a valuable pair of tools for exploratory data analysis.

高次元データセットが与えられた場合、その中で最も強い関係を見つけたいことがよくあります。一般的な戦略は、すべての変数ペアの依存関係の尺度を評価し、最高スコアのペアをフォローアップ用に保持することです。この戦略は、使用される統計が(a)重要な関係を検出する優れた検出力を持ち、(b)公平である場合、つまり、何らかのノイズの尺度に対して、関係の種類(線形、指数関数、周期的など)に関係なく、同等のノイズのある関係に同様のスコアを割り当てる場合にうまく機能します。この論文では、2つの新しい統計を定義し、理論的に特徴付けます。これらを組み合わせることで、検出力と公平性の両方を効率的に得ることができます。これを行うために、まず、新しい人口依存関係尺度を導入し、相互情報量の標準平滑化を含む、それを3つの同等な方法で表示する方法を示します。次に、人口依存関係尺度の効率的に計算可能な一貫した推定量を導入し、ノイズの多い機能関係の大規模なクラスでその公平性を経験的に確立します。この新しい統計は、以前のヒューリスティックなアプローチよりもバイアス/分散特性が優れており、実行時の複雑さも少なくなっています。次に、計算がアルゴリズムの些細な副産物であり、公平性ではなく強力な独立性テストを目的とする2番目の関連統計を導出します。この統計が一貫した独立性テストを生み出すことを証明し、シミュレーションでテストが独立性に対して優れた力を持っていることを示します。総合すると、これらの2つの統計は探索的データ分析のための貴重なツールのペアであることが示唆されます。

Multi-Objective Markov Decision Processes for Data-Driven Decision Support
データ駆動型意思決定支援のための多目的マルコフ決定過程

We present new methodology based on Multi-Objective Markov Decision Processes for developing sequential decision support systems from data. Our approach uses sequential decision-making data to provide support that is useful to many different decision-makers, each with different, potentially time-varying preference. To accomplish this, we develop an extension of fitted-$Q$ iteration for multiple objectives that computes policies for all scalarization functions, i.e. preference functions, simultaneously from continuous-state, finite-horizon data. We identify and address several conceptual and computational challenges along the way, and we introduce a new solution concept that is appropriate when different actions have similar expected outcomes. Finally, we demonstrate an application of our method using data from the Clinical Antipsychotic Trials of Intervention Effectiveness and show that our approach offers decision-makers increased choice by a larger class of optimal policies.

私たちは、データから逐次意思決定支援システムを開発するための、多目的マルコフ決定過程に基づく新しい方法論を提示します。私たちのアプローチは、シーケンシャルな意思決定データを使用して、それぞれが異なる、潜在的に時間的に変化する好みを持つ多くの異なる意思決定者に役立つサポートを提供します。これを達成するために、連続状態の有限地平線データからすべてのスカラリゼーション関数、つまり選好関数の方策を同時に計算する、複数の目的に対するfitted-$Q$反復の拡張を開発します。その過程で、いくつかの概念的および計算上の課題を特定して対処し、異なるアクションが同様の期待される結果をもたらす場合に適切な新しいソリューション概念を導入します。最後に、Clinical Antipsychotic Trials of Intervention Effectivenessのデータを使用して、この方法の適用を実証し、このアプローチにより、意思決定者により多くの選択肢を提供できることを示します。

Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition
SPSD行列のより効率的な近似とCUR行列分解に向けて

Symmetric positive semi-definite (SPSD) matrix approximation methods have been extensively used to speed up large-scale eigenvalue computation and kernel learning methods. The standard sketch based method, which we call the prototype model, produces relatively accurate approximations, but is inefficient on large square matrices. The NystrÃ¶m method is highly efficient, but can only achieve low accuracy. In this paper we propose a novel model that we call the fast SPSD matrix approximation model. The fast model is nearly as efficient as the NystrÃ¶m method and as accurate as the prototype model. We show that the fast model can potentially solve eigenvalue problems and kernel learning problems in linear time with respect to the matrix size $n$ to achieve $1+\epsilon$ relative-error, whereas both the prototype model and the NystrÃ¶m method cost at least quadratic time to attain comparable error bound. Empirical comparisons among the prototype model, the NystrÃ¶m method, and our fast model demonstrate the superiority of the fast model. We also contribute new understandings of the NystrÃ¶m method. The NystrÃ¶m method is a special instance of our fast model and is approximation to the prototype model. Our technique can be straightforwardly applied to make the CUR matrix decomposition more efficiently computed without much affecting the accuracy.

対称正半定値(SPSD)行列近似法は、大規模な固有値計算とカーネル学習法の高速化に広く使用されています。プロトタイプモデルと呼ぶ標準的なスケッチベースの方法は、比較的正確な近似値を生成しますが、大きな正方行列では非効率的です。Nyström法は非常に効率的ですが、達成できる精度は低くなります。この論文では、高速SPSD行列近似モデルと呼ぶ新しいモデルを提案します。この高速モデルは、Nyström法とほぼ同程度に効率的で、プロトタイプモデルと同程度に正確です。高速モデルは、行列サイズ$n$に関して線形時間で固有値問題とカーネル学習問題を解いて、相対誤差$1+\epsilon$を達成できる可能性がある一方、プロトタイプモデルとNyström法はどちらも、同等の誤差境界を達成するのに少なくとも2乗の時間がかかります。プロトタイプモデル、Nyström法、および高速モデルを実験的に比較すると、高速モデルの優位性が実証されます。また、Nyström法に関する新たな知見も提供します。Nyström法は、当社の高速モデルの特別なインスタンスであり、プロトタイプモデルの近似値です。当社の手法は、精度にあまり影響を与えずにCUR行列分解をより効率的に計算するために直接適用できます。

Choice of V for V-Fold Cross-Validation in Least-Squares Density Estimation
最小二乗密度推定におけるV分割交差検証のためのVの選択

This paper studies $V$-fold cross-validation for model selection in least-squares density estimation. The goal is to provide theoretical grounds for choosing $V$ in order to minimize the least-squares loss of the selected estimator. We first prove a non-asymptotic oracle inequality for $V$-fold cross-validation and its bias-corrected version ($V$-fold penalization). In particular, this result implies that $V$-fold penalization is asymptotically optimal in the nonparametric case. Then, we compute the variance of $V$-fold cross-validation and related criteria, as well as the variance of key quantities for model selection performance. We show that these variances depend on $V$ like $1+4/(V-1)$, at least in some particular cases, suggesting that the performance increases much from $V=2$ to $V=5$ or $10$, and then is almost constant. Overall, this can explain the common advice to take $V=5\,$—at least in our setting and when the computational power is limited—, as supported by some simulation experiments. An oracle inequality and exact formulas for the variance are also proved for Monte- Carlo cross-validation, also known as repeated cross-validation, where the parameter $V$ is replaced by the number $B$ of random splits of the data.

この論文では、最小二乗密度推定におけるモデル選択のための$V$倍交差検証について検討します。目標は、選択された推定量の最小二乗損失を最小化するために$V$を選択する理論的根拠を提供することです。まず、$V$倍交差検証とそのバイアス補正バージョン($V$倍ペナルティ)に対する非漸近的なオラクル不等式を証明します。特に、この結果は、$V$倍ペナルティがノンパラメトリックな場合に漸近的に最適であることを意味します。次に、$V$倍交差検証と関連基準の分散、およびモデル選択パフォーマンスの重要な量の分散を計算します。少なくともいくつかの特定のケースでは、これらの分散が$1+4/(V-1)$のように$V$に依存することを示します。これは、パフォーマンスが$V=2$から$V=5$または$10$に大幅に向上し、その後ほぼ一定になることを示唆しています。全体的に、これは、少なくとも私たちの設定と計算能力が限られている場合には、$V=5\,$を取るという一般的なアドバイスを説明できます。これは、いくつかのシミュレーション実験によって裏付けられています。オラクル不等式と分散の正確な式は、モンテカルロ交差検証(繰り返し交差検証とも呼ばれます)でも証明されています。ここで、パラメーター$V$は、データのランダム分割の数$B$に置き換えられます。

Modelling Interactions in High-dimensional Data with Backtracking
バックトラッキングによる高次元データでの相互作用のモデル化

We study the problem of high-dimensional regression when there may be interacting variables. Approaches using sparsity-inducing penalty functions such as the Lasso can be useful for producing interpretable models. However, when the number variables runs into the thousands, and so even two-way interactions number in the millions, these methods may become computationally infeasible. Typically variable screening based on model fits using only main effects must be performed first. One problem with screening is that important variables may be missed if they are only useful for prediction when certain interaction terms are also present in the model. To tackle this issue, we introduce a new method we call Backtracking. It can be incorporated into many existing high-dimensional methods based on penalty functions, and works by building increasing sets of candidate interactions iteratively. Models fitted on the main effects and interactions selected early on in this process guide the selection of future interactions. By also making use of previous fits for computation, as well as performing calculations is parallel, the overall run-time of the algorithm can be greatly reduced. The effectiveness of our method when applied to regression and classification problems is demonstrated on simulated and real data sets. In the case of using Backtracking with the Lasso, we also give some theoretical support for our procedure.

私たちは、相互作用する変数が存在する可能性がある場合の高次元回帰の問題を研究しています。Lassoなどのスパース性を誘発するペナルティ関数を使用するアプローチは、解釈可能なモデルを作成するのに役立ちます。ただし、変数の数が数千に達し、双方向の相互作用でさえ数百万に達すると、これらの方法は計算上実行不可能になる可能性があります。通常、主効果のみを使用したモデル適合に基づく変数スクリーニングを最初に実行する必要があります。スクリーニングの問題の1つは、モデルに特定の相互作用項も存在する場合にのみ予測に役立つ重要な変数が見逃される可能性があることです。この問題に対処するために、バックトラッキングと呼ばれる新しい方法を導入します。これは、ペナルティ関数に基づく多くの既存の高次元方法に組み込むことができ、候補となる相互作用のセットを反復的に構築することで機能します。このプロセスの早い段階で選択された主効果と相互作用に適合したモデルは、将来の相互作用の選択を導きます。計算に以前の適合も使用し、計算を並列に実行することで、アルゴリズムの全体的な実行時間を大幅に短縮できます。回帰および分類問題に適用した場合の当社の方法の有効性は、シミュレーションおよび実際のデータセットで実証されています。Lassoによるバックトラッキングを使用する場合、当社の手順に対する理論的裏付けもいくつか示しています。

ERRATA: On the Estimation of the Gradient Lines of a Density and the Consistency of the Mean-Shift Algorithm
正誤表:密度の勾配線の推定と平均シフトアルゴリズムの一貫性について

ERRATA to the paper On the Estimation of the Gradient Lines of a Density and the Consistency of the Mean-Shift Algorithm.

Densityの勾配線の推定と平均シフトアルゴリズムの一貫性に関する論文のERRATAです。

Neural Autoregressive Distribution Estimation
神経自己回帰分布推定

We present Neural Autoregressive Distribution Estimation (NADE) models, which are neural network architectures applied to the problem of unsupervised distribution and density estimation. They leverage the probability product rule and a weight sharing scheme inspired from restricted Boltzmann machines, to yield an estimator that is both tractable and has good generalization performance. We discuss how they achieve competitive performance in modeling both binary and real-valued observations. We also present how deep NADE models can be trained to be agnostic to the ordering of input dimensions used by the autoregressive product rule decomposition. Finally, we also show how to exploit the topological structure of pixels in images using a deep convolutional architecture for NADE.

私たちは、教師なし分布と密度推定の問題に適用されるニューラルネットワークアーキテクチャであるNeural Autoregressive Distribution Estimation(NADE)モデルを紹介します。彼らは、確率積の法則と制限されたボルツマン機械から着想を得た重み分担スキームを活用して、扱いやすく、優れた一般化性能を持つ推定量を導き出します。二値観測値と実数値観測値の両方のモデリングで競争力のあるパフォーマンスをどのように達成するかについて説明します。また、自己回帰積ルール分解で使用される入力次元の順序に依存しないように、NADEモデルを深く訓練する方法も示します。最後に、NADEの深層畳み込みアーキテクチャを使用して、画像内のピクセルのトポロジカル構造を活用する方法も示します。

Bayesian Graphical Models for Multivariate Functional Data
多変量機能データのためのベイジアングラフィカルモデル

Graphical models express conditional independence relationships among variables. Although methods for vector-valued data are well established, functional data graphical models remain underdeveloped. By functional data, we refer to data that are realizations of random functions varying over a continuum (e.g., images, signals). We introduce a notion of conditional independence between random functions, and construct a framework for Bayesian inference of undirected, decomposable graphs in the multivariate functional data context. This framework is based on extending Markov distributions and hyper Markov laws from random variables to random processes, providing a principled alternative to naive application of multivariate methods to discretized functional data. Markov properties facilitate the composition of likelihoods and priors according to the decomposition of a graph. Our focus is on Gaussian process graphical models using orthogonal basis expansions. We propose a hyper-inverse-Wishart-process prior for the covariance kernels of the infinite coefficient sequences of the basis expansion, and establish its existence and uniqueness. We also prove the strong hyper Markov property and the conjugacy of this prior under a finite rank condition of the prior kernel parameter. Stochastic search Markov chain Monte Carlo algorithms are developed for posterior inference, assessed through simulations, and applied to a study of brain activity and alcoholism.

グラフィカルモデルは、変数間の条件付き独立関係を表します。ベクトル値データの方法は十分に確立されていますが、関数データのグラフィカルモデルはまだ開発が進んでいません。関数データとは、連続体で変化するランダム関数の実現であるデータを指します(画像、信号など)。ランダム関数間の条件付き独立の概念を導入し、多変量関数データのコンテキストで無向で分解可能なグラフのベイズ推論のフレームワークを構築します。このフレームワークは、マルコフ分布とハイパーマルコフ法則をランダム変数からランダムプロセスに拡張することに基づいており、離散化された関数データへの多変量メソッドの単純な適用に代わる原理的な代替手段を提供します。マルコフ特性により、グラフの分解に従って尤度と事前分布を構成することが容易になります。直交基底展開を使用したガウス過程グラフィカルモデルに焦点を当てています。基底展開の無限係数シーケンスの共分散カーネルのハイパー逆ウィシャート過程事前分布を提案し、その存在と一意性を確立します。また、事前カーネルパラメータの有限ランク条件の下で、この事前分布の強いハイパーマルコフ特性と共役性も証明します。確率的探索マルコフ連鎖モンテカルロアルゴリズムは事後推論用に開発され、シミュレーションによって評価され、脳活動とアルコール依存症の研究に適用されます。

Guarding against Spurious Discoveries in High Dimensions
高次元での偽の発見に対する保護

Many data mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious fit, which shows how good a response variable can be fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious fit for generalized linear models and $L_1$ regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the fit, and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.

多くのデータマイニングおよび統計的機械学習アルゴリズムは、応答変数に関連付ける共変量のサブセットを選択するために開発されてきました。高次元データ分析では、そのような選択の大きな可能性のために、偽の発見が簡単に発生する可能性があります。偶然の発見よりも統計的に優れた発見を知るにはどうすればよいでしょうか。この論文では、ヌルモデルの下で最適に選択された共変量のサブセットによって応答変数をどの程度うまく適合できるかを示す偽の適合度尺度を定義し、それを計算するためのシンプルで効果的なLAMMアルゴリズムを提案します。これは、線形モデルの最大偽の相関と一致し、一般化最大偽の相関と見なすことができます。一般化線形モデルと$L_1$回帰のこのような偽の適合度の漸近分布を導出します。このような漸近分布は、サンプルサイズ、周囲次元、適合に使用される変数の数、および共分散情報に依存します。これは、乗数ブートストラップによって一貫して推定でき、偽の発見を防ぐためのベンチマークとして使用できます。また、この方法はモデル選択にも適用でき、偽の適合よりも適合度が優れている候補モデルのみを考慮します。この理論と方法は、シミュレーション例とドイツの神経芽腫試験のバイナリ結果への適用によって説得力を持って説明されています。

Nonparametric Network Models for Link Prediction
リンク予測のためのノンパラメトリックネットワークモデル

Many data sets can be represented as a sequence of interactions between entities—for example communications between individuals in a social network, protein-protein interactions or DNA-protein interactions in a biological context, or vehicles’ journeys between cities. In these contexts, there is often interest in making predictions about future interactions, such as who will message whom. A popular approach to network modeling in a Bayesian context is to assume that the observed interactions can be explained in terms of some latent structure. For example, traffic patterns might be explained by the size and importance of cities, and social network interactions might be explained by the social groups and interests of individuals. Unfortunately, while elucidating this structure can be useful, it often does not directly translate into an effective predictive tool. Further, many existing approaches are not appropriate for sparse networks, a class that includes many interesting real-world situations. In this paper, we develop models for sparse networks that combine structure elucidation with predictive performance. We use a Bayesian nonparametric approach, which allows us to predict interactions with entities outside our training set, and allows the both the latent dimensionality of the model and the number of nodes in the network to grow in expectation as we see more data. We demonstrate that we can capture latent structure while maintaining predictive power, and discuss possible extensions.

多くのデータセットは、エンティティ間の一連の相互作用として表現できます。たとえば、ソーシャルネットワーク内の個人間の通信、生物学的コンテキストでのタンパク質間相互作用またはDNAタンパク質間相互作用、または都市間の車両の移動などです。これらのコンテキストでは、誰が誰にメッセージを送信するかなど、将来の相互作用を予測することに関心が寄せられることがよくあります。ベイジアンコンテキストでのネットワークモデリングの一般的なアプローチは、観察された相互作用を何らかの潜在的な構造で説明できると仮定することです。たとえば、交通パターンは都市の規模と重要性によって説明でき、ソーシャルネットワーク相互作用は個人の社会的グループと関心によって説明できます。残念ながら、この構造を解明することは有用ですが、効果的な予測ツールに直接変換されないことがよくあります。さらに、既存のアプローチの多くは、多くの興味深い現実世界の状況を含むクラスであるスパースネットワークには適していません。この論文では、構造の解明と予測パフォーマンスを組み合わせたスパースネットワークのモデルを開発します。私たちはベイジアンノンパラメトリックアプローチを使用します。これにより、トレーニングセット外のエンティティとの相互作用を予測でき、より多くのデータを見るにつれて、モデルの潜在的な次元とネットワーク内のノードの数の両方が期待値で増加できます。予測力を維持しながら潜在的な構造を捉えることができることを実証し、可能な拡張について説明します。

Multivariate Spearman’s ρ for Aggregating Ranks Using Copulas
コピュラを使用してランクを集約するための多変量スピアマンのρ

We study the problem of rank aggregation: given a set of ranked lists, we want to form a consensus ranking. Furthermore, we consider the case of extreme lists: i.e., only the rank of the best or worst elements are known. We impute missing ranks and generalise Spearman’s $\rho$ to extreme ranks. Our main contribution is the derivation of a non-parametric estimator for rank aggregation based on multivariate extensions of Spearman’s $\rho$, which measures correlation between a set of ranked lists. Multivariate Spearman’s $\rho$ is defined using copulas, and we show that the geometric mean of normalised ranks maximises multivariate correlation. Motivated by this, we propose a weighted geometric mean approach for learning to rank which has a closed form least squares solution. When only the best (top-k) or worst (bottom-k) elements of a ranked list are known, we impute the missing ranks by the average value, allowing us to apply Spearman’s $\rho$. We discuss an optimistic and pessimistic imputation of missing values, which respectively maximise and minimise correlation, and show its effect on aggregating university rankings. Finally, we demonstrate good performance on the rank aggregation benchmarks MQ2007 and MQ2008.

私たちは、ランク集約の問題を研究します。ランク付けされたリストのセットが与えられた場合、合意に基づくランク付けを形成します。さらに、最良または最悪の要素のランクのみがわかっている極端なリストのケースを考慮します。欠損ランクを補完し、スピアマンの$\rho$を極端なランクに一般化します。私たちの主な貢献は、ランク付けされたリストのセット間の相関を測定するスピアマンの$\rho$の多変量拡張に基づくランク集約のノンパラメトリック推定量の導出です。多変量スピアマンの$\rho$はコピュラを使用して定義され、正規化されたランクの幾何平均が多変量相関を最大化することを示します。これに動機付けられて、閉じた形式の最小二乗解を持つランク付け学習のための加重幾何平均アプローチを提案します。ランク付けされたリストの最高(上位k)または最低(下位k)の要素のみがわかっている場合、平均値によって欠落したランクを補完し、スピアマンの$\rho$を適用できるようにします。相関をそれぞれ最大化および最小化する、欠落値の楽観的補完と悲観的補完について説明し、大学ランキングの集計への影響を示します。最後に、ランク集計ベンチマークMQ2007およびMQ2008で優れたパフォーマンスを示します。

Online Trans-dimensional von Mises-Fisher Mixture Models for User Profiles
ユーザープロファイルのためのオンライン次元間フォンミーゼスフィッシャー混合モデル

The proliferation of online communities has attracted much attention to modelling user behaviour in terms of social interaction, language adoption and contribution activity. Nevertheless, when applied to large-scale and cross-platform behavioural data, existing approaches generally suffer from expressiveness, scalability and generality issues. This paper proposes trans-dimensional von Mises-Fisher (TvMF) mixture models for $\mathcal{L}_{2}$ normalised behavioural data, which encapsulate: (1) a Bayesian framework for vMF mixtures that enables prior knowledge and information sharing among clusters, (2) an extended version of reversible jump MCMC algorithm that allows adaptive changes in the number of clusters for vMF mixtures when the model parameters are updated, and (3) an online TvMF mixture model that accommodates the dynamics of clusters for time-varying user behavioural data. We develop efficient collapsed Gibbs sampling techniques for posterior inference, which facilitates parallelism for parameter updates. Empirical results on simulated and real-world data show that the proposed TvMF mixture models can discover more interpretable and intuitive clusters than other widely-used models, such as k-means, non-negative matrix factorization (NMF), Dirichlet process Gaussian mixture models (DP-GMM), and dynamic topic models (DTM). We further evaluate the performance of proposed models in real-world applications, such as the churn prediction task, that shows the usefulness of the features generated.

オンラインコミュニティの急増により、社会的相互作用、言語採用、貢献活動の観点からユーザー行動をモデル化することに多くの注目が集まっています。しかし、大規模でクロスプラットフォームの行動データに適用する場合、既存のアプローチは一般に表現力、スケーラビリティ、一般性の問題に悩まされます。この論文では、$\mathcal{L}_{2}$正規化行動データ用の次元を超えたフォンミーゼスフィッシャー(TvMF)混合モデルを提案します。このモデルは、(1)クラスター間での事前知識と情報の共有を可能にするvMF混合のベイズフレームワーク、(2)モデルパラメータが更新されたときにvMF混合のクラスター数を適応的に変更できる可逆ジャンプMCMCアルゴリズムの拡張バージョン、(3)時間とともに変化するユーザー行動データのクラスターのダイナミクスに対応するオンラインTvMF混合モデルをカプセル化します。事後推論用の効率的な縮小ギブスサンプリング手法を開発し、パラメータ更新の並列処理を容易にします。シミュレーションと実世界のデータに関する実験結果から、提案されたTvMF混合モデルは、k-means、非負値行列分解(NMF)、ディリクレ過程ガウス混合モデル(DP-GMM)、動的トピックモデル(DTM)などの他の広く使用されているモデルよりも、解釈しやすく直感的なクラスターを発見できることが示されています。さらに、解約予測タスクなどの実際のアプリケーションで提案モデルのパフォーマンスを評価し、生成された機能の有用性を示します。

Mutual Information Based Matching for Causal Inference with Observational Data
観測データとの因果推論のための情報量に基づく相互マッチング

This paper presents an information theory-driven matching methodology for making causal inference from observational data. The paper adopts a âpotential outcomes frameworkâ view on evaluating the strength of cause-effect relationships: the population-wide average effects of binary treatments are estimated by comparing two groups of units — the treated and untreated (control). To reduce the bias in such treatment effect estimation, one has to compose a control group in such a way that across the compared groups of units, treatment is independent of the units’ covariates. This requirement gives rise to a subset selection / matching problem. This paper presents the models and algorithms that solve the matching problem by minimizing the mutual information (MI) between the covariates and the treatment variable. Such a formulation becomes tractable thanks to the derived optimality conditions that tackle the non-linearity of the sample-based MI function. Computational experiments with mixed integer-programming formulations and four matching algorithms demonstrate the utility of MI based matching for causal inference studies. The algorithmic developments culminate in a matching heuristic that allows for balancing the compared groups in polynomial (close to linear) time, thus allowing for treatment effect estimation with large data sets.

この論文では、観察データから因果推論を行うための情報理論に基づくマッチング手法について紹介します。この論文では、因果関係の強さを評価する際に「潜在的アウトカムフレームワーク」の視点を採用しています。つまり、2値治療の集団全体の平均効果は、治療されたユニットと治療されていないユニット(コントロール)の2つのユニットグループを比較することによって推定されます。このような治療効果の推定におけるバイアスを減らすには、比較するユニットグループ全体で治療がユニットの共変量から独立するようにコントロールグループを構成する必要があります。この要件により、サブセット選択/マッチングの問題が生じます。この論文では、共変量と治療変数間の相互情報量(MI)を最小化することでマッチング問題を解決するモデルとアルゴリズムを紹介します。このような定式化は、サンプルベースのMI関数の非線形性に対処する導出された最適性条件のおかげで扱いやすくなります。混合整数計画定式化と4つのマッチングアルゴリズムを使用した計算実験により、因果推論研究におけるMIベースのマッチングの有用性が実証されています。アルゴリズムの開発は、多項式（線形に近い）時間で比較グループのバランスをとることを可能にするマッチングヒューリスティックに集約され、大規模なデータセットでの治療効果の推定が可能になります。

Wavelet decompositions of Random Forests – smoothness analysis, sparse approximation and applications
ランダムフォレストのウェーブレット分解 – 滑らかさ解析、スパース近似および応用

In this paper we introduce, in the setting of machine learning, a generalization of wavelet analysis which is a popular approach to low dimensional structured signal analysis. The wavelet decomposition of a Random Forest provides a sparse approximation of any regression or classification high dimensional function at various levels of detail, with a concrete ordering of the Random Forest nodes: from `significant’ elements to nodes capturing only `insignificant’ noise. Motivated by function space theory, we use the wavelet decomposition to compute numerically a `weak- type’ smoothness index that captures the complexity of the underlying function. As we show through extensive experimentation, this sparse representation facilitates a variety of applications such as improved regression for difficult datasets, a novel approach to feature importance, resilience to noisy or irrelevant features, compression of ensembles, etc.

この論文では、機械学習の設定で、低次元構造化信号解析の一般的なアプローチであるウェーブレット解析の一般化を紹介します。ランダムフォレストのウェーブレット分解は、さまざまな詳細レベルでの回帰または分類高次元関数のまばらな近似を提供し、ランダムフォレストノードの具体的な順序付け(「重要な」要素から「重要でない」ノイズのみをキャプチャするノードまで)を提供します。関数空間理論に動機付けられて、ウェーブレット分解を使用して、基になる関数の複雑さを捉える”弱いタイプ”の滑らかさ指数を数値的に計算します。広範な実験を通じて示したように、このスパース表現は、困難なデータセットの回帰の改善、特徴の重要性に対する新しいアプローチ、ノイズの多い特徴や無関係な特徴に対する回復力、アンサンブルの圧縮など、さまざまなアプリケーションを容易にします。

Machine Learning in an Auction Environment
オークション環境での機械学習

We consider a model of repeated online auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser’s expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration in an auction environment is exceedingly small.

私たちは、クリック率が不確実な広告が、各オークションで競合する入札額がランダムに分布し、ペイオフが割引されるという、繰り返されるオンラインオークションのモデルを考えます。この探索/活用問題に対する最適な解決策を動的計画法の問題として定式化し、各広告主の入札額を、広告機会に対する広告主の期待値に、この値の分散に比例する項を広告主がこれまでに受け取ったインプレッション数で割った値に等しくすることで効率が最大化されることを示します。次に、この結果を使用して、オークション環境にアクティブ探索を組み込む価値が非常に小さいことを示します。

Bayesian group factor analysis with structured sparsity
構造化されたスパース性によるベイズ群因子分析

Latent factor models are the canonical statistical tool for exploratory analyses of low-dimensional linear structure for a matrix of $p$ features across $n$ samples. We develop a structured Bayesian group factor analysis model that extends the factor model to multiple coupled observation matrices; in the case of two observations, this reduces to a Bayesian model of canonical correlation analysis. Here, we carefully define a structured Bayesian prior that encourages both element-wise and column-wise shrinkage and leads to desirable behavior on high- dimensional data. In particular, our model puts a structured prior on the joint factor loading matrix, regularizing at three levels, which enables element-wise sparsity and unsupervised recovery of latent factors corresponding to structured variance across arbitrary subsets of the observations. In addition, our structured prior allows for both dense and sparse latent factors so that covariation among either all features or only a subset of features can be recovered. We use fast parameter-expanded expectation-maximization for parameter estimation in this model. We validate our method on simulated data with substantial structure. We show results of our method applied to three high- dimensional data sets, comparing results against a number of state-of-the-art approaches. These results illustrate useful properties of our model, including i) recovering sparse signal in the presence of dense effects; ii) the ability to scale naturally to large numbers of observations; iii) flexible observation- and factor-specific regularization to recover factors with a wide variety of sparsity levels and percentage of variance explained; and iv) tractable inference that scales to modern genomic and text data sizes.

潜在因子モデルは、$n$個のサンプルにわたる$p$個の特徴のマトリックスの低次元線形構造の探索的分析のための標準的な統計ツールです。私たちは、因子モデルを複数の結合された観測マトリックスに拡張する構造化ベイズグループ因子分析モデルを開発しました。2つの観測の場合、これは標準的な相関分析のベイズモデルに簡略化されます。ここでは、要素ごとおよび列ごとの両方の縮小を促進し、高次元データで望ましい動作につながる構造化ベイズ事前分布を慎重に定義します。特に、私たちのモデルは、結合因子負荷マトリックスに構造化事前分布を配置し、3つのレベルで正規化することで、観測の任意のサブセットにわたる構造化された分散に対応する潜在因子の要素ごとのスパース性と教師なし回復を可能にします。さらに、私たちの構造化事前分布は、密な潜在因子とスパースな潜在因子の両方を許容するため、すべての特徴または特徴のサブセットのみの間の共変動を回復できます。このモデルでは、パラメータ推定に高速パラメータ拡張期待値最大化を使用します。私たちは、実質的な構造を持つシミュレーションデータで我々の手法を検証しました。私たちは、3つの高次元データセットに我々の手法を適用した結果を示し、その結果をいくつかの最先端のアプローチと比較しました。これらの結果は、i)密な効果がある場合でもスパース信号を回復できること、ii)多数の観測値に自然に拡張できること、iii)観測値と因子固有の柔軟な正則化により、さまざまなスパースレベルと説明された分散率を持つ因子を回復できること、iv)最新のゲノムおよびテキストデータのサイズに拡張できる扱いやすい推論など、我々のモデルの有用な特性を示しています。

Bipartite Ranking: a Risk-Theoretic Perspective
二者間ランキング:リスク理論の視点

We present a systematic study of the bipartite ranking problem, with the aim of explicating its connections to the class- probability estimation problem. Our study focuses on the properties of the statistical risk for bipartite ranking with general losses, which is closely related to a generalised notion of the area under the ROC curve: we establish alternate representations of this risk, relate the Bayes-optimal risk to a class of probability divergences, and characterise the set of Bayes-optimal scorers for the risk. We further study properties of a generalised class of bipartite risks, based on the $p$-norm push of Rudin (2009). Our analysis is based on the rich framework of proper losses, which are the central tool in the study of class-probability estimation. We show how this analytic tool makes transparent the generalisations of several existing results, such as the equivalence of the minimisers for four seemingly disparate risks from bipartite ranking and class- probability estimation. A novel practical implication of our analysis is the design of new families of losses for scenarios where accuracy at the head of ranked list is paramount, with comparable empirical performance to the $p$-norm push.

私たちは、二部順位付け問題に関する体系的な研究を提示し、そのクラス確率推定問題との関連を明らかにすることを目指しています。我々の研究は、一般損失を伴う二部順位付けの統計的リスクの特性に焦点を当てており、これはROC曲線の下の領域の一般化された概念と密接に関連しています。我々はこのリスクの代替表現を確立し、ベイズ最適リスクを確率発散のクラスに関連付け、リスクのベイズ最適スコアラーのセットを特徴付ける。我々はさらに、Rudin (2009)の$p$ノルムプッシュに基づいて、一般化された二部リスクのクラスの特性を研究します。我々の分析は、クラス確率推定の研究における中心的なツールである適切な損失の豊富なフレームワークに基づいています。私たちは、この分析ツールが、二部順位付けとクラス確率推定からの4つの一見異なるリスクの最小化者の同等性など、いくつかの既存の結果の一般化をどのように透明化するかを示す。私たちの分析の新しい実用的な意味合いは、ランク付けされたリストの先頭での精度が最も重要であるシナリオに対して、$p$ノルムプッシュに匹敵する実験的パフォーマンスを備えた新しい損失ファミリーを設計することです。

Optimal Learning Rates for Localized SVMs
ローカライズされたSVMの最適な学習率

One of the limiting factors of using support vector machines (SVMs) in large scale applications are their super-linear computational requirements in terms of the number of training samples. To address this issue, several approaches that train SVMs on many small chunks separately have been proposed in the literature. With the exception of random chunks, which is also known as divide-and-conquer kernel ridge regression, however, these approaches have only been empirically investigated. In this work we investigate a spatially oriented method to generate the chunks. For the resulting localized SVM that uses Gaussian kernels and the least squares loss we derive an oracle inequality, which in turn is used to deduce learning rates that are essentially minimax optimal under some standard smoothness assumptions on the regression function. In addition, we derive local learning rates that are based on the local smoothness of the regression function. We further introduce a data-dependent parameter selection method for our local SVM approach and show that this method achieves the same almost optimal learning rates. Finally, we present a few larger scale experiments for our localized SVM showing that it achieves essentially the same test error as a global SVM for a fraction of the computational requirements. In addition, it turns out that the computational requirements for the local SVMs are similar to those of a vanilla random chunk approach, while the achieved test errors are significantly better.

大規模アプリケーションでサポートベクターマシン(SVM)を使用する際の制限要因の1つは、トレーニングサンプルの数に関して超線形の計算要件があることです。この問題に対処するために、多数の小さなチャンクでSVMを個別にトレーニングするいくつかのアプローチが文献で提案されています。ただし、分割統治カーネルリッジ回帰とも呼ばれるランダムチャンクを除いて、これらのアプローチは経験的にしか調査されていません。この研究では、チャンクを生成するための空間指向の方法を調査します。ガウスカーネルと最小二乗損失を使用する結果として得られるローカライズされたSVMに対して、オラクル不等式を導出します。この不等式は、回帰関数のいくつかの標準的な平滑性の仮定の下で、基本的にミニマックス最適である学習率を推定するために使用されます。さらに、回帰関数のローカル平滑性に基づくローカル学習率を導出します。さらに、ローカルSVMアプローチのデータ依存パラメーター選択方法を導入し、この方法がほぼ最適な同じ学習率を達成することを示します。最後に、ローカライズされたSVMのより大規模な実験をいくつか紹介し、計算要件のほんの一部で、グローバルSVMと本質的に同じテストエラーを達成できることを示しています。さらに、ローカルSVMの計算要件は通常のランダムチャンクアプローチとほぼ同じですが、達成されたテストエラーは大幅に優れていることがわかりました。

Data-driven Rank Breaking for Efficient Rank Aggregation
効率的なランク集計のためのデータ駆動型ランク分割

Rank aggregation systems collect ordinal preferences from individuals to produce a global ranking that represents the social preference. Rank-breaking is a common practice to reduce the computational complexity of learning the global ranking. The individual preferences are broken into pairwise comparisons and applied to efficient algorithms tailored for independent paired comparisons. However, due to the ignored dependencies in the data, naive rank-breaking approaches can result in inconsistent estimates. The key idea to produce accurate and consistent estimates is to treat the pairwise comparisons unequally, depending on the topology of the collected data. In this paper, we provide the optimal rank-breaking estimator, which not only achieves consistency but also achieves the best error bound. This allows us to characterize the fundamental tradeoff between accuracy and complexity. Further, the analysis identifies how the accuracy depends on the spectral gap of a corresponding comparison graph.

ランク集約システムは、個人から順序的な好みを収集し、社会的好みを表すグローバルランキングを作成します。ランク分割は、グローバルランキングの学習の計算の複雑さを軽減するための一般的な方法です。個人の好みは、一対比較に分割され、独立した一対比較に合わせて調整された効率的なアルゴリズムに適用されます。ただし、データ内の依存関係が無視されるため、単純なランク分割アプローチでは、一貫性のない推定値が生成されます。正確で一貫性のある推定値を生成するための重要なアイデアは、収集されたデータのトポロジに応じて、一対比較を不平等に扱うことです。この論文では、一貫性を実現するだけでなく、最良の誤差範囲も実現する最適なランク分割推定量を提供します。これにより、精度と複雑さの基本的なトレードオフを特徴付けることができます。さらに、分析では、精度が対応する比較グラフのスペクトルギャップにどのように依存するかを特定します。

On the Influence of Momentum Acceleration on Online Learning
運動量加速がオンライン学習に及ぼす影響について

The article examines in some detail the convergence rate and mean-square-error performance of momentum stochastic gradient methods in the constant step-size and slow adaptation regime. The results establish that momentum methods are equivalent to the standard stochastic gradient method with a re-scaled (larger) step-size value. The size of the re-scaling is determined by the value of the momentum parameter. The equivalence result is established for all time instants and not only in steady-state. The analysis is carried out for general strongly convex and smooth risk functions, and is not limited to quadratic risks. One notable conclusion is that the well-known benefits of momentum constructions for deterministic optimization problems do not necessarily carry over to the adaptive online setting when small constant step-sizes are used to enable continuous adaptation and learning in the presence of persistent gradient noise. From simulations, the equivalence between momentum and standard stochastic gradient methods is also observed for non-differentiable and non-convex problems.

この論文では、一定のステップサイズと遅い適応レジームにおけるモメンタム確率勾配法の収束率と平均二乗誤差のパフォーマンスを詳細に検討しています。結果から、モメンタム法は、再スケーリングされた（より大きな）ステップサイズ値を持つ標準的な確率勾配法と同等であることが立証されています。再スケーリングのサイズは、モメンタムパラメータの値によって決まります。同等の結果は、定常状態だけでなく、すべての時点に対して立証されています。分析は、一般的な強凸および滑らかなリスク関数に対して実行され、2次リスクに限定されません。注目すべき結論の1つは、持続的な勾配ノイズが存在する場合に連続的な適応と学習を可能にするために小さな一定のステップサイズが使用される場合、決定論的最適化問題に対するモメンタム構築のよく知られた利点は、必ずしも適応型オンライン設定に引き継がれるわけではないということです。シミュレーションから、モメンタム法と標準的な確率勾配法の同等性は、微分不可能な問題と非凸問題でも観察されています。

One-class classification of point patterns of extremes
極値の点パターンの 1 クラス分類

Novelty detection or one-class classification starts from a model describing some type of `normal behaviour’ and aims to classify deviations from this model as being either novelties or anomalies. In this paper the problem of novelty detection for point patterns $S=\{\mathbf{x}_1,\ldots ,\mathbf{x}_k\}\subset \mathbb{R}^d$ is treated where examples of anomalies are very sparse, or even absent. The latter complicates the tuning of hyperparameters in models commonly used for novelty detection, such as one-class support vector machines and hidden Markov models. To this end, the use of extreme value statistics is introduced to estimate explicitly a model for the abnormal class by means of extrapolation from a statistical model $X$ for the normal class. We show how multiple types of information obtained from any available extreme instances of $S$ can be combined to reduce the high false-alarm rate that is typically encountered when classes are strongly imbalanced, as often occurs in the one-class setting (whereby `abnormal’ data are often scarce). The approach is illustrated using simulated data and then a real-life application is used as an exemplar, whereby accelerometry data from epileptic seizures are analysed – these are known to be extreme and rare with respect to normal accelerometer data.

新規性検出または1クラス分類は、ある種の「正常な動作」を記述するモデルから始まり、このモデルからの逸脱を新規性または異常性として分類することを目的としています。この論文では、異常例が非常にまばらであるか、まったく存在しない場合の点パターン$S=\{\mathbf{x}_1,\ldots ,\mathbf{x}_k\}\subset \mathbb{R}^d$の新規性検出の問題を扱います。後者は、1クラスサポートベクターマシンや隠れマルコフモデルなど、新規性検出に一般的に使用されるモデルのハイパーパラメータの調整を複雑にします。この目的のために、極値統計の使用が導入され、正常クラスの統計モデル$X$からの外挿によって異常クラスのモデルを明示的に推定します。私たちは、$S$の任意の利用可能な極端な例から得られる複数の種類の情報を組み合わせて、クラスが著しく不均衡な場合に通常発生する高い誤報率を削減する方法を示します。これは、1クラス設定でよく発生します(この場合、「異常な」データが不足することがよくあります)。このアプローチは、シミュレーションデータを使用して説明され、次に実際のアプリケーションが例として使用され、てんかん発作の加速度データが分析されます。これは、通常の加速度データと比較して極端でまれであることが知られています。

A Variational Approach to Path Estimation and Parameter Inference of Hidden Diffusion Processes
隠れ拡散過程の経路推定とパラメータ推論への変分アプローチ

We consider a hidden Markov model, where the signal process, given by a diffusion, is only indirectly observed through some noisy measurements. The article develops a variational method for approximating the hidden states of the signal process given the full set of observations. This, in particular, leads to systematic approximations of the smoothing densities of the signal process. The paper then demonstrates how an efficient inference scheme, based on this variational approach to the approximation of the hidden states, can be designed to estimate the unknown parameters of stochastic differential equations. Two examples at the end illustrate the efficacy and the accuracy of the presented method.

私たちは、隠れマルコフモデルを考え、拡散によって与えられる信号プロセスは、いくつかのノイズの多い測定を通じて間接的にのみ観察されます。この記事では、観測の完全なセットが与えられた信号プロセスの隠れ状態を近似するための変分法を開発します。これは特に、信号プロセスの平滑化密度の系統的な近似につながります。次に、この変分アプローチに基づく隠れ状態の近似に基づく効率的な推論スキームを設計して、確率微分方程式の未知のパラメータを推定する方法を示します。最後にある2つの例は、提示された方法の有効性と精度を示しています。

Classification of Imbalanced Data with a Geometric Digraph Family
幾何学的ダイグラフファミリーによる不均衡データの分類

We use a geometric digraph family called class cover catch digraphs (CCCDs) to tackle the class imbalance problem in statistical classification. CCCDs provide graph theoretic solutions to the class cover problem and have been employed in classification. We assess the classification performance of CCCD classifiers by extensive Monte Carlo simulations, comparing them with other classifiers commonly used in the literature. In particular, we show that CCCD classifiers perform relatively well when one class is more frequent than the other in a two- class setting, an example of the class imbalance problem. We also point out the relationship between class imbalance and class overlapping problems, and their influence on the performance of CCCD classifiers and other classification methods as well as some state-of-the-art algorithms which are robust to class imbalance by construction. Experiments on both simulated and real data sets indicate that CCCD classifiers are robust to the class imbalance problem. CCCDs substantially undersample from the majority class while preserving the information on the discarded points during the undersampling process. Many state- of-the-art methods, however, keep this information by means of ensemble classifiers, but CCCDs yield only a single classifier with the same property, making it both appealing and fast.

私たちは、クラスカバーキャッチダイグラフ(CCCD)と呼ばれる幾何学的ダイグラフファミリーを使用して、統計的分類におけるクラス不均衡問題に取り組みます。CCCDは、クラスカバー問題に対するグラフ理論的ソリューションを提供し、分類に使用されています。私たちは、文献で一般的に使用されている他の分類器と比較しながら、広範なモンテカルロシミュレーションによってCCCD分類器の分類パフォーマンスを評価します。特に、クラス不均衡問題の一例である2クラス設定で1つのクラスが他のクラスよりも頻繁に発生する場合、CCCD分類器は比較的優れたパフォーマンスを発揮することを示します。また、クラス不均衡とクラス重複問題の関係、およびそれらがCCCD分類器やその他の分類方法のパフォーマンスに与える影響、および構造上クラス不均衡に対して堅牢な最先端のアルゴリズムについても指摘します。シミュレートされたデータセットと実際のデータセットの両方での実験により、CCCD分類器はクラス不均衡問題に対して堅牢であることが示されています。CCCDは、アンダーサンプリングプロセス中に破棄されたポイントに関する情報を保持しながら、多数派クラスから大幅にアンダーサンプリングします。しかし、多くの最先端の方法では、アンサンブル分類器によってこの情報を保持しますが、CCCDは同じ特性を持つ単一の分類器のみを生成するため、魅力的で高速です。

A New Algorithm and Theory for Penalized Regression-based Clustering
ペナルティ付き回帰ベースクラスタリングのための新しいアルゴリズムと理論

Clustering is unsupervised and exploratory in nature. Yet, it can be performed through penalized regression with grouping pursuit, as demonstrated in Pan et al. (2013). In this paper, we develop a more efficient algorithm for scalable computation and a new theory of clustering consistency for the method. This algorithm, called DC-ADMM, combines difference of convex (DC) programming with the alternating direction method of multipliers (ADMM). This algorithm is shown to be more computationally efficient than the quadratic penalty based algorithm of Pan et al. (2013) because of the former’s closed-form updating formulas. Numerically, we compare the DC- ADMM algorithm with the quadratic penalty algorithm to demonstrate its utility and scalability. Theoretically, we establish a finite-sample mis- clustering error bound for penalized regression based clustering with the $L_0$ constrained regularization in a general setting. On this ground, we provide conditions for clustering consistency of the penalized clustering method. As an end product, we put R package prclust implementing PRclust with various loss and grouping penalty functions available on GitHub and CRAN.

クラスタリングは、本質的に教師なしかつ探索的です。しかし、Panら(2013)で実証されているように、グループ化追求によるペナルティ付き回帰を通じて実行できます。この論文では、スケーラブルな計算のためのより効率的なアルゴリズムと、この方法のクラスタリング一貫性の新しい理論を開発します。DC-ADMMと呼ばれるこのアルゴリズムは、凸差分(DC)計画法と交互方向乗数法(ADMM)を組み合わせたものです。このアルゴリズムは、前者の閉じた形式の更新式のため、Panら(2013)の二次ペナルティベースのアルゴリズムよりも計算効率が高いことが示されています。数値的には、DC-ADMMアルゴリズムと二次ペナルティアルゴリズムを比較して、その有用性とスケーラビリティを実証します。理論的には、一般的な設定で$L_0$制約付き正則化によるペナルティ付き回帰ベースのクラスタリングの有限サンプルのミスクラスタリングエラー境界を確立します。この根拠に基づいて、ペナルティ付きクラスタリング法のクラスタリング一貫性の条件を提供します。最終製品として、さまざまな損失およびグループ化ペナルティ関数を備えたPRclustを実装したRパッケージprclustをGitHubおよびCRANで公開しました。

Low-Rank Doubly Stochastic Matrix Decomposition for Cluster Analysis
クラスタ解析のための低ランク二重確率行列分解

Cluster analysis by nonnegative low-rank approximations has experienced a remarkable progress in the past decade. However, the majority of such approximation approaches are still restricted to nonnegative matrix factorization (NMF) and suffer from the following two drawbacks: 1) they are unable to produce balanced partitions for large-scale manifold data which are common in real-world clustering tasks; 2) most existing NMF-type clustering methods cannot automatically determine the number of clusters. We propose a new low-rank learning method to address these two problems, which is beyond matrix factorization. Our method approximately decomposes a sparse input similarity in a normalized way and its objective can be used to learn both cluster assignments and the number of clusters. For efficient optimization, we use a relaxed formulation based on Data- Cluster-Data random walk, which is also shown to be equivalent to low-rank factorization of the doubly-stochastically normalized cluster incidence matrix. The probabilistic cluster assignments can thus be learned with a multiplicative majorization-minimization algorithm. Experimental results show that the new method is more accurate both in terms of clustering large-scale manifold data sets and of selecting the number of clusters.

非負低ランク近似によるクラスター分析は、過去10年間で目覚ましい進歩を遂げてきました。しかし、このような近似アプローチの大部分は、依然として非負行列因子分解(NMF)に限定されており、次の2つの欠点があります。1)実際のクラスタリングタスクで一般的な大規模な多様体データのバランスの取れたパーティションを生成できない。2)既存のNMFタイプのクラスタリング方法のほとんどは、クラスターの数を自動的に決定できない。これら2つの問題に対処するために、行列因子分解を超えた新しい低ランク学習方法を提案します。この方法は、スパースな入力類似性を正規化された方法で近似的に分解し、その目的を使用してクラスターの割り当てとクラスターの数の両方を学習できます。効率的な最適化のために、データ-クラスター-データランダムウォークに基づく緩和された定式化を使用します。これは、二重確率的に正規化されたクラスター接続行列の低ランク因子分解と同等であることも示されています。したがって、確率的なクラスター割り当ては、乗法的なメジャー化最小化アルゴリズムを使用して学習できます。実験結果では、新しい方法は、大規模なマニフォールドデータセットのクラスタリングとクラスターの数の選択の両方においてより正確であることが示されています。

Electronic Health Record Analysis via Deep Poisson Factor Models
ディープポアソン因子モデルによる電子カルテ分析

Electronic Health Record (EHR) phenotyping utilizes patient data captured through normal medical practice, to identify features that may represent computational medical phenotypes. These features may be used to identify at-risk patients and improve prediction of patient morbidity and mortality. We present a novel deep multi-modality architecture for EHR analysis (applicable to joint analysis of multiple forms of EHR data), based on Poisson Factor Analysis (PFA) modules. Each modality, composed of observed counts, is represented as a Poisson distribution, parameterized in terms of hidden binary units. Information from different modalities is shared via a deep hierarchy of common hidden units. Activation of these binary units occurs with probability characterized as Bernoulli- Poisson link functions, instead of more traditional logistic link functions. In addition, we demonstrate that PFA modules can be adapted to discriminative modalities. To compute model parameters, we derive efficient Markov Chain Monte Carlo (MCMC) inference that scales efficiently, with significant computational gains when compared to related models based on logistic link functions. To explore the utility of these models, we apply them to a subset of patients from the Duke-Durham patient cohort. We identified a cohort of over 16,000 patients with Type 2 Diabetes Mellitus (T2DM) based on diagnosis codes and laboratory tests out of our patient population of over 240,000. Examining the common hidden units uniting the PFA modules, we identify patient features that represent medical concepts. Experiments indicate that our learned features are better able to predict mortality and morbidity than clinical features identified previously in a large-scale clinical trial.

電子健康記録(EHR)表現型解析では、通常の医療行為を通じて収集された患者データを利用して、計算医療表現型を表す可能性のある特徴を特定します。これらの特徴は、リスクのある患者を特定し、患者の罹患率と死亡率の予測を改善するために使用できます。私たちは、ポアソン因子分析(PFA)モジュールに基づく、EHR分析用の新しいディープマルチモダリティアーキテクチャ(複数の形式のEHRデータの共同分析に適用可能)を紹介します。観測されたカウントで構成される各モダリティは、隠れたバイナリユニットでパラメーター化されたポアソン分布として表されます。異なるモダリティからの情報は、共通の隠れユニットの深い階層を介して共有されます。これらのバイナリユニットのアクティブ化は、従来のロジスティックリンク関数ではなく、ベルヌーイポアソンリンク関数として特徴付けられる確率で発生します。さらに、PFAモジュールを識別モダリティに適応できることを実証します。モデルパラメータを計算するために、ロジスティックリンク関数に基づく関連モデルと比較して、計算上の大幅な向上を伴う、効率的に拡張可能なマルコフ連鎖モンテカルロ（MCMC）推論を導出します。これらのモデルの有用性を探るため、デューク・ダーラム患者コホートの患者のサブセットにモデルを適用します。240,000人を超える患者集団から、診断コードと臨床検査に基づいて、2型糖尿病（T2DM）の患者16,000人を超えるコホートを特定しました。PFAモジュールを結合する共通の隠れユニットを調べることで、医療概念を表す患者の特徴を特定します。実験では、学習した特徴は、大規模な臨床試験で以前に特定された臨床特徴よりも死亡率と罹患率をより正確に予測できることが示されています。

The Factorized Self-Controlled Case Series Method: An Approach for Estimating the Effects of Many Drugs on Many Outcomes
因数分解自己制御ケースシリーズ法:多くの結果に対する多くの薬物の影響を推定するためのアプローチ

We provide a hierarchical Bayesian model for estimating the effects of transient drug exposures on a collection of health outcomes, where the effects of all drugs on all outcomes are estimated simultaneously. The method possesses properties that allow it to handle important challenges of dealing with large- scale longitudinal observational databases. In particular, this model is a generalization of the self-controlled case series (SCCS) method, meaning that certain patient specific baseline rates never need to be estimated. Further, this model is formulated with layers of latent factors, which substantially reduces the number of parameters and helps with interpretability by illuminating latent classes of drugs and outcomes. We believe our work is the first to consider multivariate SCCS (in the sense of multiple outcomes) and is the first to couple latent factor analysis with SCCS. We demonstrate the approach by estimating the effects of various time-sensitive insulin treatments for diabetes.

私たちは、一過性の薬物曝露が健康アウトカムの集合に与える影響を推定するための階層的ベイズモデルを提供します。このモデルでは、すべての薬物がすべての結果に与える影響が同時に推定されます。この方法は、大規模な縦断的観察データベースを扱う際の重要な課題に対処できる特性を備えています。特に、このモデルは自己対照症例シリーズ(SCCS)法の一般化であり、特定の患者固有のベースライン率を推定する必要がまったくないことを意味します。さらに、このモデルは潜在因子の層で定式化されており、これによりパラメーターの数が大幅に削減され、潜在的な薬物クラスと結果が明らかになり、解釈が容易になります。我々の研究は、多変量SCCS (複数の結果という意味で)を考慮した最初の研究であり、潜在因子分析とSCCSを組み合わせた最初の研究であると考えています。私たちは、糖尿病に対するさまざまな時間に敏感なインスリン治療の効果を推定することで、このアプローチを実証します。

fastFM: A Library for Factorization Machines
fastFM:因数分解機のライブラリ

Factorization Machines (FM) are currently only used in a narrow range of applications and are not yet part of the standard machine learning toolbox, despite their great success in collaborative filtering and click-through rate prediction. However, Factorization Machines are a general model to deal with sparse and high dimensional features. Our Factorization Machine implementation (fastFM) provides easy access to many solvers and supports regression, classification and ranking tasks. Such an implementation simplifies the use of FM for a wide range of applications. Therefore, our implementation has the potential to improve understanding of the FM model and drive new development.

因数分解機(FM)は、現在、狭い範囲のアプリケーションでのみ使用されており、協調フィルタリングとクリックスルー率予測で大きな成功を収めているにもかかわらず、まだ標準の機械学習ツールボックスの一部ではありません。しかし、因数分解機は、疎で高次元の特徴を扱うための一般的なモデルです。当社の因数分解機の実装(fastFM)は、多くのソルバーに簡単にアクセスでき、回帰、分類、およびランク付けタスクをサポートします。このような実装により、幅広いアプリケーションでのFMの使用が簡素化されます。したがって、私たちの実装は、FMモデルの理解を深め、新しい開発を推進する可能性があります。

Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle
差分プライバシーによる学習:安定性、学習可能性、ERM原理の十分性と必要性

While machine learning has proven to be a powerful data-driven solution to many real-life problems, its use in sensitive domains has been limited due to privacy concerns. A popular approach known as differential privacy offers provable privacy guarantees, but it is often observed in practice that it could substantially hamper learning accuracy. In this paper we study the learnability (whether a problem can be learned by any algorithm) under Vapnik’s general learning setting with differential privacy constraint, and reveal some intricate relationships between privacy, stability and learnability. In particular, we show that a problem is privately learnable if an only if there is a private algorithm that asymptotically minimizes the empirical risk (AERM). In contrast, for non- private learning AERM alone is not sufficient for learnability. This result suggests that when searching for private learning algorithms, we can restrict the search to algorithms that are AERM. In light of this, we propose a conceptual procedure that always finds a universally consistent algorithm whenever the problem is learnable under privacy constraint. We also propose a generic and practical algorithm and show that under very general conditions it privately learns a wide class of learning problems. Lastly, we extend some of the results to the more practical $(\epsilon,\delta)$-differential privacy and establish the existence of a phase-transition on the class of problems that are approximately privately learnable with respect to how small $\delta$ needs to be.

機械学習は、多くの現実の問題に対する強力なデータ駆動型ソリューションであることが証明されていますが、プライバシーの懸念から、機密性の高い領域での使用は制限されています。差分プライバシーとして知られる一般的なアプローチは、証明可能なプライバシー保証を提供しますが、実際には学習精度を大幅に妨げる可能性があることがしばしば観察されています。この論文では、差分プライバシー制約のあるVapnikの一般的な学習設定での学習可能性(問題が任意のアルゴリズムで学習できるかどうか)を調査し、プライバシー、安定性、学習可能性の間の複雑な関係を明らかにします。特に、経験的リスク(AERM)を漸近的に最小化するプライベートアルゴリズムがある場合のみ、問題がプライベートに学習可能であることを示します。対照的に、非プライベート学習の場合、AERMだけでは学習可能性には不十分です。この結果は、プライベート学習アルゴリズムを検索するときに、AERMであるアルゴリズムに検索を制限できることを示唆しています。これを考慮して、問題がプライバシー制約下で学習可能である場合は常に普遍的に一貫したアルゴリズムを見つける概念的な手順を提案します。また、汎用的で実用的なアルゴリズムを提案し、非常に一般的な条件下では、幅広いクラスの学習問題をプライベートに学習できることを示します。最後に、結果の一部をより実用的な$(\epsilon,\delta)$差分プライバシーに拡張し、$\delta$がどれだけ小さくなる必要があるかに関して、近似的にプライベートに学習可能な問題のクラスに位相遷移が存在することを証明します。

Jointly Informative Feature Selection Made Tractable by Gaussian Modeling
ガウスモデリングによる扱いやすい共同情報特徴選択

We address the problem of selecting groups of jointly informative, continuous, features in the context of classification and propose several novel criteria for performing this selection. The proposed class of methods is based on combining a Gaussian modeling of the feature responses with derived bounds on and approximations to their mutual information with the class label. Furthermore, specific algorithmic implementations of these criteria are presented which reduce the computational complexity of the proposed feature selection algorithms by up to two-orders of magnitude. Consequently we show that feature selection based on the joint mutual information of features and class label is in fact tractable; this runs contrary to prior works that largely depend on marginal quantities. An empirical evaluation using several types of classifiers on multiple data sets show that this class of methods outperforms state-of-the-art baselines, both in terms of speed and classification accuracy.

私たちは、分類の文脈で、共同で情報提供的で連続的な特徴のグループを選択する問題に取り組み、この選択を実行するためのいくつかの新しい基準を提案します。提案されたメソッドのクラスは、特徴応答のガウスモデリングと、派生した境界とそれらの相互情報の近似をクラスラベルと組み合わせることに基づいています。さらに、これらの基準の特定のアルゴリズム実装が示され、提案された特徴選択アルゴリズムの計算複雑さを最大2桁削減します。したがって、特徴とクラスラベルの共同相互情報に基づく特徴選択は、実際には扱いやすいことを示します。これは、限界量に大きく依存する以前の作品に反しています。複数のデータセットで数種類の分類器を使用した経験的評価では、このクラスの方法が、速度と分類精度の両方の点で最先端のベースラインよりも優れていることが示されています。

Consistency of Cheeger and Ratio Graph Cuts
CheegerとRatioのグラフカットの一貫性

This paper establishes the consistency of a family of graph-cut- based algorithms for clustering of data clouds. We consider point clouds obtained as samples of a ground-truth measure. We investigate approaches to clustering based on minimizing objective functionals defined on proximity graphs of the given sample. Our focus is on functionals based on graph cuts like the Cheeger and ratio cuts. We show that minimizers of these cuts converge as the sample size increases to a minimizer of a corresponding continuum cut (which partitions the ground truth measure). Moreover, we obtain sharp conditions on how the connectivity radius can be scaled with respect to the number of sample points for the consistency to hold. We provide results for two-way and for multiway cuts. Furthermore we provide numerical experiments that illustrate the results and explore the optimality of scaling in dimension two.

この論文では、データクラウドのクラスタリングのためのグラフカットベースのアルゴリズムファミリーの一貫性を確立します。取得した点群は、グラウンドトゥルース測定のサンプルと見なします。特定のサンプルの近接グラフで定義された目的関数の最小化に基づくクラスタリングのアプローチを調査します。私たちは、Cheegerやレシオカットのようなグラフカットに基づくファンクショナルに焦点を当てています。これらのカットの最小化器は、サンプルサイズが対応する連続体カット(グラウンドトゥルース測定を分割する)の最小化器に収束することを示します。さらに、一貫性を保持するためのサンプルポイントの数に対して接続半径をどのようにスケーリングできるかについて、明確な条件が得られます。2方向カットとマルチウェイカットの結果を提供します。さらに、結果を示す数値実験を提供し、次元2でのスケーリングの最適性を調査します。

Characteristic Kernels and Infinitely Divisible Distributions
特性カーネルと無限に割り切れる分布

We connect shift-invariant characteristic kernels to infinitely divisible distributions on $\mathbb{R}^{d}$. Characteristic kernels play an important role in machine learning applications with their kernel means to distinguish any two probability measures. The contribution of this paper is twofold. First, we show, using the Levy–Khintchine formula, that any shift- invariant kernel given by a bounded, continuous, and symmetric probability density function (pdf) of an infinitely divisible distribution on $\mathbb{R}^d$ is characteristic. We mention some closure properties of such characteristic kernels under addition, pointwise product, and convolution. Second, in developing various kernel mean algorithms, it is fundamental to compute the following values: (i) kernel mean values $m_P(x)$, $x \in \mathcal{X}$, and (ii) kernel mean RKHS inner products ${\left\langle m_P, m_Q \right\rangle _{\mathcal{H}}}$, for probability measures $P, Q$. If $P, Q$, and kernel $k$ are Gaussians, then the computation of (i) and (ii) results in Gaussian pdfs that are tractable. We generalize this Gaussian combination to more general cases in the class of infinitely divisible distributions. We then introduce a conjugate kernel and a convolution trick, so that the above (i) and (ii) have the same pdf form, expecting tractable computation at least in some cases. As specific instances, we explore $\alpha$-stable distributions and a rich class of generalized hyperbolic distributions, where the Laplace, Cauchy, and Student’s $t$ distributions are included.

私たちは、シフト不変特性カーネルを$\mathbb{R}^{d}$上の無限に分割可能な分布に関連付けます。特性カーネルは、カーネル平均によって任意の2つの確率測度を区別できるため、機械学習アプリケーションで重要な役割を果たします。この論文の貢献は2つあります。まず、Levy-Khintchine公式を使用して、$\mathbb{R}^d$上の無限に分割可能な分布の有界、連続、対称の確率密度関数(pdf)によって与えられるシフト不変カーネルはいずれも特性カーネルであることを示します。このような特性カーネルの加算、点ごとの積、畳み込みに対する閉包特性についてもいくつか説明します。第二に、さまざまなカーネル平均アルゴリズムを開発する際には、以下の値を計算することが基本となります: (i)カーネル平均値$m_P(x)$、$x \in \mathcal{X}$、および(ii)カーネル平均RKHS内積${\left\langle m_P, m_Q \right\rangle _{\mathcal{H}}}$、確率測度$P、Q$。$P、Q$、およびカーネル$k$がガウス分布である場合、(i)と(ii)の計算により、扱いやすいガウスpdfが生成されます。このガウス分布の組み合わせを、無限に分割可能な分布のクラスのより一般的なケースに一般化します。次に、共役カーネルと畳み込みトリックを導入して、上記の(i)と(ii)が同じpdf形式になるようにし、少なくともいくつかのケースでは扱いやすい計算が期待されるようにします。具体的な例として、$\alpha$安定分布と、ラプラス分布、コーシー分布、スチューデントのt分布を含む一般化双曲型分布の豊富なクラスを検討します。

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching
尤度最大化頂点指名スキームの一貫性について:最尤推定とグラフマッチングの間のギャップを埋める

Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM), including a vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove that maximum likelihood (ML)-based vertex nomination is consistent, in the sense that the performance of the ML-based scheme asymptotically matches that of the Bayes optimal scheme. We prove theorems of this form both when model parameters are known and unknown. Additionally, we introduce and prove consistency of a related, more scalable restricted-focus ML vertex nomination scheme. Finally, we incorporate vertex and edge features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach.

いくつかの頂点が事前に興味深いとみなされるグラフが与えられた場合、頂点指名タスクは、興味深い頂点がリストの先頭に集中するように、残りの頂点を指名リストに並べることです。これまでの研究では、この問題に対するいくつかのアプローチが得られており、ベイズ最適分類器の頂点指名類似体を含む、確率的ブロックモデル(SBM)からグラフが描画される設定での理論的結果があります。この論文では、最大尤度(ML)ベースの頂点指名は、MLベースの方式のパフォーマンスがベイズ最適方式のパフォーマンスと漸近的に一致するという意味で一貫していることを証明します。モデルパラメータが既知および未知の場合の両方で、この形式の定理を証明します。さらに、関連するよりスケーラブルな制限された焦点のML頂点指名方式を紹介し、その一貫性を証明します。最後に、頂点とエッジの機能をMLベースの頂点指名に組み込み、このアプローチの実証的有効性を簡単に検討します。

The Asymptotic Performance of Linear Echo State Neural Networks
線形エコー状態ニューラルネットワークの漸近性能

In this article, a study of the mean-square error (MSE) performance of linear echo-state neural networks is performed, both for training and testing tasks. Considering the realistic setting of noise present at the network nodes, we derive deterministic equivalents for the aforementioned MSE in the limit where the number of input data $T$ and network size $n$ both grow large. Specializing then the network connectivity matrix to specific random settings, we further obtain simple formulas that provide new insights on the performance of such networks.

この記事では、学習タスクとテストタスクの両方について、線形エコー状態ニューラルネットワークの平均二乗誤差(MSE)性能の研究を行います。ネットワークノードに存在するノイズの現実的な設定を考慮すると、入力データの数$T$とネットワークサイズ$n$の両方が大きくなる制限において、前述のMSEの決定論的等価性を導出します。次に、ネットワーク接続マトリックスを特定のランダム設定に特化することで、そのようなネットワークのパフォーマンスに関する新しい洞察を提供する簡単な式をさらに取得します。

A Note on the Sample Complexity of the Er-SpUD Algorithm by Spielman, Wang and Wright for Exact Recovery of Sparsely Used Dictionaries
Spielman、Wang、WrightによるEr-SpUDアルゴリズムのサンプルの複雑さに関するメモ

We consider the problem of recovering an invertible $n \times n$ matrix $A$ and a sparse $n \times p$ random matrix $X$ based on the observation of $Y = AX$ (up to a scaling and permutation of columns of $A$ and rows of $X$). Using only elementary tools from the theory of empirical processes we show that a version of the Er-SpUD algorithm by Spielman, Wang and Wright with high probability recovers $A$ and $X$ exactly, provided that $p \ge Cn\log n$, which is optimal up to the constant $C$.

私たちは、$Y = AX$の観測($A$の列と$X$の行のスケーリングと順列まで)に基づいて、可逆$ntimes n$行列$A$とスパース$ntimes p$ランダム行列$X$を回復する問題を考えます。経験的過程の理論からの基本ツールのみを使用して、Spielman、Wang、WrightによるEr-SpUDアルゴリズムのバージョンは、定数$C$まで最適であるge Cnlog n$が$pである場合、高い確率で$A$と$X$を正確に回復することを示しています。

Input Output Kernel Regression: Supervised and Semi-Supervised Structured Output Prediction with Operator-Valued Kernels
入力出力カーネル回帰: 演算子値カーネルを使用した教師ありおよび半教師あり構造化出力予測

In this paper, we introduce a novel approach, called Input Output Kernel Regression (IOKR), for learning mappings between structured inputs and structured outputs. The approach belongs to the family of Output Kernel Regression methods devoted to regression in feature space endowed with some output kernel. In order to take into account structure in input data and benefit from kernels in the input space as well, we use the Reproducing Kernel Hilbert Space theory for vector-valued functions. We first recall the ridge solution for supervised learning and then study the regularized hinge loss-based solution used in Maximum Margin Regression. Both models are also developed in the context of semi-supervised setting. In addition we derive an extension of Generalized Cross Validation for model selection in the case of the least-square model. Finally we show the versatility of the IOKR framework on two different problems: link prediction seen as a structured output problem and multi-task regression seen as a multiple and interdependent output problem. Eventually, we present a set of detailed numerical results that shows the relevance of the method on these two tasks.

この論文では、構造化された入力と構造化された出力の間のマッピングを学習するための、入力出力カーネル回帰(IOKR)と呼ばれる新しいアプローチを紹介します。このアプローチは、出力カーネルを備えた特徴空間での回帰に特化した出力カーネル回帰法のファミリーに属します。入力データの構造を考慮し、入力空間のカーネルの利点も活用するために、ベクトル値関数の再生カーネルヒルベルト空間理論を使用します。最初に、教師あり学習のリッジソリューションを思い出し、次に最大マージン回帰で使用される正規化ヒンジ損失ベースのソリューションを調べます。両方のモデルは、半教師あり設定のコンテキストでも開発されています。さらに、最小二乗モデルの場合のモデル選択のための一般化クロス検証の拡張を導出します。最後に、構造化出力問題として見られるリンク予測と、複数の相互依存出力問題として見られるマルチタスク回帰という2つの異なる問題に対するIOKRフレームワークの汎用性を示します。最後に、これら2つのタスクにおけるこの方法の関連性を示す詳細な数値結果のセットを提示します。

bandicoot: a Python Toolbox for Mobile Phone Metadata
bandicoot: 携帯電話のメタデータのための Python ツールボックス

bandicoot is an open-source Python toolbox to extract more than 1442 features from standard mobile phone metadata. bandicoot makes it easy for machine learning researchers and practitioners to load mobile phone data, to analyze and visualize them, and to extract robust features which can be used for various classification and clustering tasks. Emphasis is put on ease of use, consistency, and documentation. bandicoot has no dependencies and is distributed under MIT license.

bandicootは、標準の携帯電話メタデータから1442を超える機能を抽出するためのオープンソースのPythonツールボックスです。Bandicootは、機械学習の研究者や実務家が携帯電話のデータを読み込んで分析し、視覚化し、さまざまな分類やクラスタリングタスクに使用できる堅牢な特徴を抽出することを容易にします。使いやすさ、一貫性、および文書化に重点が置かれています。バンディクートには依存関係がなく、MITライセンスの下で配布されています。

Efficient Computation of Gaussian Process Regression for Large Spatial Data Sets by Patching Local Gaussian Processes
局所ガウス過程のパッチによる大規模空間データセットのガウス過程回帰の効率的な計算

This paper develops an efficient computational method for solving a Gaussian process (GP) regression for large spatial data sets using a collection of suitably defined local GP regressions. The conventional local GP approach first partitions a domain into multiple non-overlapping local regions, and then fits an independent GP regression for each local region using the training data belonging to the region. Two key issues with the local GP are (1) the prediction around the boundary of a local region is not as accurate as the prediction at interior of the local region, and (2) two local GP regressions for two neighboring local regions produce different predictions at the boundary of the two regions, creating undesirable discontinuity in the prediction. We address these issues by constraining the predictions of local GP regressions sharing a common boundary to satisfy the same boundary constraints, which in turn are estimated by the data. The boundary constrained local GP regressions are solved by a finite element method. Our approach shows competitive performance when compared with several state- of-the-art methods using two synthetic data sets and three real data sets.

この論文では、適切に定義されたローカルGP回帰のコレクションを使用して、大規模な空間データセットのガウス過程(GP)回帰を解くための効率的な計算方法を開発します。従来のローカルGPアプローチでは、最初にドメインを複数の重複しないローカル領域に分割し、次にその領域に属するトレーニングデータを使用して、各ローカル領域に独立したGP回帰を当てはめます。ローカルGPの2つの重要な問題は、(1)ローカル領域の境界付近の予測がローカル領域の内部での予測ほど正確ではないこと、および(2) 2つの隣接するローカル領域に対する2つのローカルGP回帰が、2つの領域の境界で異なる予測を生成するため、予測に望ましくない不連続性が生じることです。これらの問題に対処するために、共通の境界を共有するローカルGP回帰の予測を、データによって推定される同じ境界制約を満たすように制約します。境界制約付きローカルGP回帰は、有限要素法によって解決されます。私たちのアプローチは、2つの合成データセットと3つの実際のデータセットを使用したいくつかの最先端の方法と比較すると、競争力のあるパフォーマンスを示しています。

Online PCA with Optimal Regret
最適な後悔を伴うオンラインPCA

We investigate the online version of Principle Component Analysis (PCA), where in each trial $t$ the learning algorithm chooses a $k$-dimensional subspace, and upon receiving the next instance vector $\x_t$, suffers the compression loss, which is the squared Euclidean distance between this instance and its projection into the chosen subspace. When viewed in the right parameterization, this compression loss is linear, i.e. it can be rewritten as $\text{tr}(\mathbf{W}_t\x_t\x_t^\top)$, where $\mathbf{W}_t$ is the parameter of the algorithm and the outer product $\x_t\x_t^\top$ (with $\|\x_t\|\le 1$) is the instance matrix. In this paper generalize PCA to arbitrary positive definite instance matrices $\mathbf{X}_t$ with the linear loss $\text{tr}(\mathbf{W}_t\X_t)$. We evaluate online algorithms in terms of their worst-case regret, which is a bound on the additional total loss of the online algorithm on all instances matrices over the compression loss of the best $k$-dimensional subspace (chosen in hindsight). We focus on two popular online algorithms for generalized PCA: the Gradient Descent (GD) and Matrix Exponentiated Gradient (MEG) algorithms. We show that if the regret is expressed as a function of the number of trials, then both algorithms are optimal to within a constant factor on worst-case sequences of positive definite instances matrices with trace norm at most one (which subsumes the original PCA problem with outer products). This is surprising because MEG is believed be suboptimal in this case. We also show that when considering regret bounds as a function of a loss budget, then MEG remains optimal and strictly outperforms GD when the instance matrices are trace norm bounded. Next, we consider online PCA when the adversary is allowed to present the algorithm with positive semidefinite instance matrices whose largest eigenvalue is bounded (rather than their trace which is the sum of their eigenvalues). Again we can show that MEG is optimal and strictly better than GD in this setting.

私たちは、主成分分析(PCA)のオンライン版を調査します。このバージョンでは、各試行$t$で学習アルゴリズムが$k$次元のサブスペースを選択し、次のインスタンスベクトル$\x_t$を受け取ると、このインスタンスと選択されたサブスペースへの投影との間のユークリッド距離の2乗である圧縮損失が発生します。適切なパラメーター化で見ると、この圧縮損失は線形です。つまり、$\text{tr}(\mathbf{W}_t\x_t\x_t^\top)$と書き直すことができます。ここで、$\mathbf{W}_t$はアルゴリズムのパラメーターであり、外積$\x_t\x_t^\top$ ($\|\x_t\|\le 1$の場合)はインスタンスマトリックスです。この論文では、線形損失が$\text{tr}(\mathbf{W}_t\X_t)$である任意の正定値インスタンス行列$\mathbf{X}_t$にPCAを一般化します。オンラインアルゴリズムを、最悪ケースの後悔の観点から評価します。後悔とは、すべてのインスタンス行列に対するオンラインアルゴリズムの追加合計損失が、最良の$k$次元サブスペース(後から選択)の圧縮損失を上回る上限です。一般化PCAの2つの一般的なオンラインアルゴリズム、勾配降下法(GD)アルゴリズムと行列指数勾配法(MEG)アルゴリズムに注目します。後悔が試行回数の関数として表現される場合、両方のアルゴリズムが、トレースノルムが最大1である正定値インスタンス行列の最悪ケースシーケンスに対して定数係数以内で最適であることを示します(これは、外積を含む元のPCA問題を包含します)。MEGはこの場合最適ではないと考えられているため、これは驚くべきことです。また、後悔境界を損失予算の関数として考えると、インスタンスマトリックスがトレースノルムで制限されている場合、MEGは最適のままであり、GDより明らかに優れていることも示しています。次に、最大固有値が制限されている(固有値の合計であるトレースではなく)正の半定値インスタンスマトリックスを敵対者がアルゴリズムに提示できる場合のオンラインPCAについて考えます。ここでも、この設定ではMEGが最適であり、GDより明らかに優れていることがわかります。

Semiparametric Mean Field Variational Bayes: General Principles and Numerical Issues
セミパラメトリック平均場変分ベイズ:一般原則と数値問題

We introduce the term semiparametric mean field variational Bayes to describe the relaxation of mean field variational Bayes in which some density functions in the product density restriction are pre-specified to be members of convenient parametric families. This notion has appeared in various guises in the mean field variational Bayes literature during its history and we endeavor to unify this important topic. We lay down a general framework and explain how previous relevant methodologies fall within this framework. A major contribution is elucidation of numerical issues that impact semiparametric mean field variational Bayes in practice.

私たちは、セミパラメトリック平均場変分ベイズという用語を導入して、製品密度制限の一部の密度関数が便利なパラメトリックファミリーのメンバーとして事前に指定されている平均場変分ベイズの緩和を説明します。この概念は、その歴史の中で平均場変分ベイズ文献にさまざまな形で現れており、私たちはこの重要なトピックを統一しようと努力しています。一般的なフレームワークを定め、以前の関連する方法論がこのフレームワークにどのように該当するかを説明します。主な貢献は、実際にセミパラメトリック平均場変動ベイズに影響を与える数値問題の解明です。

Feature-Level Domain Adaptation
機能レベルのドメイン適応

Domain adaptation is the supervised learning setting in which the training and test data are sampled from different distributions: training data is sampled from a source domain, whilst test data is sampled from a target domain. This paper proposes and studies an approach, called feature-level domain adaptation (FLDA), that models the dependence between the two domains by means of a feature-level transfer model that is trained to describe the transfer from source to target domain. Subsequently, we train a domain-adapted classifier by minimizing the expected loss under the resulting transfer model. For linear classifiers and a large family of loss functions and transfer models, this expected loss can be computed or approximated analytically, and minimized efficiently. Our empirical evaluation of FLDA focuses on problems comprising binary and count data in which the transfer can be naturally modeled via a dropout distribution, which allows the classifier to adapt to differences in the marginal probability of features in the source and the target domain. Our experiments on several real- world problems show that FLDA performs on par with state- of- the-art domain-adaptation techniques.

ドメイン適応とは、トレーニングデータとテストデータが異なる分布からサンプリングされる教師あり学習設定です。トレーニングデータはソースドメインからサンプリングされ、テストデータはターゲットドメインからサンプリングされます。この論文では、特徴レベルドメイン適応(FLDA)と呼ばれるアプローチを提案し、検討します。このアプローチでは、ソースドメインからターゲットドメインへの転送を記述するようにトレーニングされた特徴レベル転送モデルを使用して、2つのドメイン間の依存関係をモデル化します。次に、結果として得られる転送モデルの下で期待される損失を最小化することで、ドメイン適応分類器をトレーニングします。線形分類器と、損失関数および転送モデルの大規模なファミリの場合、この期待される損失は解析的に計算または近似し、効率的に最小化できます。FLDAの実証的評価では、バイナリデータとカウントデータを含む問題に焦点を当てています。これらの問題では、転送をドロップアウト分布を介して自然にモデル化できます。これにより、分類器は、ソースドメインとターゲットドメインの特徴の周辺確率の差に適応できます。いくつかの現実の問題に対する私たちの実験では、FLDAが最先端のドメイン適応技術と同等のパフォーマンスを発揮することが示されています。

mlr: Machine Learning in R
mlr: R での機械学習

The mlr package provides a generic, object- oriented, and extensible framework for classification, regression, survival analysis and clustering for the R language. It provides a unified interface to more than 160 basic learners and includes meta-algorithms and model selection techniques to improve and extend the functionality of basic learners with, e.g., hyperparameter tuning, feature selection, and ensemble construction. Parallel high-performance computing is natively supported. The package targets practitioners who want to quickly apply machine learning algorithms, as well as researchers who want to implement, benchmark, and compare their new methods in a structured environment.

mlrパッケージは、R言語の分類、回帰、生存時間分析、クラスタリングのための汎用的でオブジェクト指向の拡張可能なフレームワークを提供します。160人以上の基本学習器に統一されたインターフェースを提供し、ハイパーパラメータ調整、特徴選択、アンサンブル構築など、基本学習器の機能を改善および拡張するためのメタアルゴリズムとモデル選択技術が含まれています。並列ハイパフォーマンスコンピューティングはネイティブにサポートされています。このパッケージは、機械学習アルゴリズムを迅速に適用したい実務家だけでなく、構造化された環境で新しい方法を実装し、ベンチマークし、比較したい研究者も対象としています。

Bounding the Search Space for Global Optimization of Neural Networks Learning Error: An Interval Analysis Approach
ニューラルネットワーク学習誤差のグローバル最適化のための探索空間の境界化:間隔分析アプローチ

Training a multilayer perceptron (MLP) with algorithms employing global search strategies has been an important research direction in the field of neural networks. Despite a number of significant results, an important matter concerning the bounds of the search region—typically defined as a box—where a global optimization method has to search for a potential global minimizer seems to be unresolved. The approach presented in this paper builds on interval analysis and attempts to define guaranteed bounds in the search space prior to applying a global search algorithm for training an MLP. These bounds depend on the machine precision and the term guaranteed denotes that the region defined surely encloses weight sets that are global minimizers of the neural network’s error function. Although the solution set to the bounding problem for an MLP is in general non-convex, the paper presents the theoretical results that help deriving a box which is a convex set. This box is an outer approximation of the algebraic solutions to the interval equations resulting from the function implemented by the network nodes. An experimental study using well known benchmarks is presented in accordance with the theoretical results.

グローバル検索戦略を採用したアルゴリズムで多層パーセプトロン(MLP)をトレーニングすることは、ニューラルネットワークの分野で重要な研究方向となっています。多くの重要な結果があるにもかかわらず、グローバル最適化法が潜在的なグローバル最小値を検索する必要がある検索領域(通常はボックスとして定義されます)の境界に関する重要な問題は未解決のようです。この論文で提示されているアプローチは、区間分析に基づいており、MLPをトレーニングするためのグローバル検索アルゴリズムを適用する前に、検索空間で保証された境界を定義しようとします。これらの境界はマシンの精度に依存し、保証という用語は、定義された領域がニューラルネットワークのエラー関数のグローバル最小値である重みセットを確実に囲むことを意味します。MLPの境界問題に対する解セットは一般に非凸ですが、この論文では、凸セットであるボックスを導出するのに役立つ理論的結果を示します。このボックスは、ネットワークノードによって実装された関数から生じる区間方程式の代数解の外部近似です。よく知られているベンチマークを使用した実験的研究が、理論的な結果に従って提示されます。

Stable Graphical Models
安定したグラフィカルモデル

Stable random variables are motivated by the central limit theorem for densities with (potentially) unbounded variance and can be thought of as natural generalizations of the Gaussian distribution to skewed and heavy-tailed phenomenon. In this paper, we introduce $\alpha$-stable graphical ($\alpha$-SG) models, a class of multivariate stable densities that can also be represented as Bayesian networks whose edges encode linear dependencies between random variables. One major hurdle to the extensive use of stable distributions is the lack of a closed- form analytical expression for their densities. This makes penalized maximum-likelihood based learning computationally demanding. We establish theoretically that the Bayesian information criterion (BIC) can asymptotically be reduced to the computationally more tractable minimum dispersion criterion (MDC) and develop StabLe, a structure learning algorithm based on MDC. We use simulated datasets for five benchmark network topologies to empirically demonstrate how StabLe improves upon ordinary least squares (OLS) regression. We also apply StabLe to microarray gene expression data for lymphoblastoid cells from 727 individuals belonging to eight global population groups. We establish that StabLe improves test set performance relative to OLS via ten-fold cross-validation. Finally, we develop SGEX, a method for quantifying differential expression of genes between different population groups.

安定ランダム変数は、（潜在的に）無制限の分散を持つ密度の中心極限定理に動機付けられ、ガウス分布の歪んだ重い裾の現象への自然な一般化として考えることができます。この論文では、エッジがランダム変数間の線形依存関係をエンコードするベイズネットワークとしても表現できる多変量安定密度のクラスである$\alpha$安定グラフィカル（$\alpha$-SG）モデルを紹介します。安定分布を広範囲に使用する上での大きな障害の1つは、その密度の閉じた形式の解析表現がないことです。これにより、ペナルティ付き最大尤度ベースの学習は計算的に要求が厳しくなります。ベイズ情報量基準（BIC）は、計算がより扱いやすい最小分散基準（MDC）に漸近的に縮小できることを理論的に確立し、MDCに基づく構造学習アルゴリズムであるStabLeを開発します。5つのベンチマークネットワークトポロジのシミュレーションデータセットを使用して、StabLeが最小二乗法(OLS)回帰をどのように改善するかを実証します。また、8つの世界人口グループに属する727人のリンパ芽球細胞のマイクロアレイ遺伝子発現データにStabLeを適用します。10倍のクロス検証により、StabLeがOLSと比較してテストセットのパフォーマンスを向上させることが証明されました。最後に、異なる人口グループ間の遺伝子の差次的発現を定量化する手法であるSGEXを開発しました。

Support Vector Hazards Machine: A Counting Process Framework for Learning Risk Scores for Censored Outcomes
サポートベクターハザードマシン: 打ち切り結果のリスクスコアを学習するためのカウントプロセスフレームワーク

Learning risk scores to predict dichotomous or continuous outcomes using machine learning approaches has been studied extensively. However, how to learn risk scores for time-to-event outcomes subject to right censoring has received little attention until recently. Existing approaches rely on inverse probability weighting or rank-based regression, which may be inefficient. In this paper, we develop a new support vector hazards machine (SVHM) approach to predict censored outcomes. Our method is based on predicting the counting process associated with the time-to-event outcomes among subjects at risk via a series of support vector machines. Introducing counting processes to represent time-to-event data leads to a connection between support vector machines in supervised learning and hazards regression in standard survival analysis. To account for different at risk populations at observed event times, a time-varying offset is used in estimating risk scores. The resulting optimization is a convex quadratic programming problem that can easily incorporate non-linearity using kernel trick. We demonstrate an interesting link from the profiled empirical risk function of SVHM to the Cox partial likelihood. We then formally show that SVHM is optimal in discriminating covariate-specific hazard function from population average hazard function, and establish the consistency and learning rate of the predicted risk using the estimated risk scores. Simulation studies show improved prediction accuracy of the event times using SVHM compared to existing machine learning methods and standard conventional approaches. Finally, we analyze two real world biomedical study data where we use clinical markers and neuroimaging biomarkers to predict age-at- onset of a disease, and demonstrate superiority of SVHM in distinguishing high risk versus low risk subjects.

機械学習アプローチを使用して二値または連続的な結果を予測するためのリスクスコアの学習は、広く研究されてきました。しかし、右打ち切りの対象となるイベントまでの時間の結果のリスクスコアの学習方法は、最近までほとんど注目されていませんでした。既存のアプローチは、逆確率重み付けまたはランクベースの回帰に依存していますが、これは非効率的である可能性があります。この論文では、打ち切り結果を予測するための新しいサポートベクターハザードマシン（SVHM）アプローチを開発します。私たちの方法は、一連のサポートベクターマシンを介して、リスクのある被験者のイベントまでの時間の結果に関連するカウントプロセスを予測することに基づいています。イベントまでの時間データを表すためにカウントプロセスを導入すると、教師あり学習のサポートベクターマシンと標準的な生存分析のハザード回帰が結びつきます。観測されたイベント時間におけるさまざまなリスクのある集団を考慮するために、リスクスコアの推定には時間変動オフセットが使用されます。結果として得られる最適化は、カーネルトリックを使用して非線形性を簡単に組み込むことができる凸二次計画問題です。SVHMのプロファイルされた経験的リスク関数からCox部分尤度への興味深いリンクを示します。次に、SVHMが共変量固有のハザード関数を母集団平均ハザード関数から区別するのに最適であることを正式に示し、推定リスクスコアを使用して予測リスクの一貫性と学習率を確立します。シミュレーション研究では、既存の機械学習方法や標準的な従来のアプローチと比較して、SVHMを使用したイベント時間の予測精度が向上しました。最後に、臨床マーカーと神経画像バイオマーカーを使用して疾患の発症年齢を予測する2つの実際の生物医学研究データを分析し、高リスク対象者と低リスク対象者を区別するSVHMの優位性を示します。

Joint Structural Estimation of Multiple Graphical Models
複数のグラフィカルモデルのジョイント構造推定

Gaussian graphical models capture dependence relationships between random variables through the pattern of nonzero elements in the corresponding inverse covariance matrices. To date, there has been a large body of literature on both computational methods and analytical results on the estimation of a single graphical model. However, in many application domains, one has to estimate several related graphical models, a problem that has also received attention in the literature. The available approaches usually assume that all graphical models are globally related. On the other hand, in many settings different relationships between subsets of the node sets exist between different graphical models. We develop methodology that jointly estimates multiple Gaussian graphical models, assuming that there exists prior information on how they are structurally related. For many applications, such information is available from external data sources. The proposed method consists of first applying neighborhood selection with a group lasso penalty to obtain edge sets of the graphs, and a maximum likelihood refit for estimating the nonzero entries in the inverse covariance matrices. We establish consistency of the proposed method for sparse high-dimensional Gaussian graphical models and examine its performance using simulation experiments. Applications to a climate data set and a breast cancer data set are also discussed.

ガウスグラフィカルモデルは、対応する逆共分散行列の非ゼロ要素のパターンを通じて、ランダム変数間の依存関係を捉えます。現在までに、単一のグラフィカルモデルの推定に関する計算方法と分析結果の両方に関する膨大な文献があります。しかし、多くのアプリケーションドメインでは、複数の関連するグラフィカルモデルを推定する必要があり、この問題も文献で注目されています。利用可能なアプローチでは通常、すべてのグラフィカルモデルが全体的に関連していると想定されています。一方、多くの設定では、異なるグラフィカルモデル間でノードセットのサブセット間に異なる関係が存在します。私たちは、複数のガウスグラフィカルモデルが構造的にどのように関連しているかに関する事前情報が存在すると想定して、複数のガウスグラフィカルモデルを共同で推定する方法論を開発します。多くのアプリケーションでは、このような情報は外部データソースから入手できます。提案された方法は、まずグループラッソペナルティによる近傍選択を適用してグラフのエッジセットを取得し、最大尤度再適合を行って逆共分散行列の非ゼロエントリを推定します。提案手法のスパースな高次元ガウスグラフィカルモデルに対する一貫性を確立し、シミュレーション実験を使用してそのパフォーマンスを検証します。気候データセットと乳がんデータセットへの応用についても説明します。

Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing
Double or Nothing:クラウドソーシングのための乗法インセンティブメカニズム

Crowdsourcing has gained immense popularity in machine learning applications for obtaining large amounts of labeled data. Crowdsourcing is cheap and fast, but suffers from the problem of low-quality data. To address this fundamental challenge in crowdsourcing, we propose a simple payment mechanism to incentivize workers to answer only the questions that they are sure of and skip the rest. We show that surprisingly, under a mild and natural no-free-lunch requirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive- compatible mechanisms (that may or may not satisfy no-free- lunch), our mechanism makes the smallest possible payment to spammers. We further extend our results to a more general setting in which workers are required to provide a quantized confidence for each question. Interestingly, this unique mechanism takes a multiplicative form. The simplicity of the mechanism is an added benefit. In preliminary experiments involving over 900 worker-task pairs, we observe a significant drop in the error rates under this unique mechanism for the same or lower monetary expenditure.

クラウドソーシングは、大量のラベル付きデータを取得する機械学習アプリケーションで非常に人気が高まっています。クラウドソーシングは安価で高速ですが、データの品質が低いという問題があります。クラウドソーシングのこの基本的な課題に対処するために、私たちは、作業者が確信している質問にのみ回答し、残りはスキップするようにインセンティブを与えるためのシンプルな支払いメカニズムを提案します。驚くべきことに、軽度で自然な無料ランチなしの要件の下では、このメカニズムが唯一のインセンティブ互換の支払いメカニズムであることを示しています。また、すべての可能なインセンティブ互換メカニズム（無料ランチなしを満たすかどうかはわかりません）の中で、私たちのメカニズムはスパマーへの支払いが最小であることも示しています。さらに、作業者が各質問に対して量子化された信頼度を提供することを要求するより一般的な設定に結果を拡張します。興味深いことに、この独自のメカニズムは乗法形式をとります。メカニズムのシンプルさは追加の利点です。900を超える作業者とタスクのペアを含む予備実験では、この独自のメカニズムにより、同じまたはより低い金銭的支出でエラー率が大幅に低下することがわかりました。

Optimal Estimation of Derivatives in Nonparametric Regression
ノンパラメトリック回帰における導関数の最適推定

We propose a simple framework for estimating derivatives without fitting the regression function in nonparametric regression. Unlike most existing methods that use the symmetric difference quotients, our method is constructed as a linear combination of observations. It is hence very flexible and applicable to both interior and boundary points, including most existing methods as special cases of ours. Within this framework, we define the variance-minimizing estimators for any order derivative of the regression function with a fixed bias-reduction level. For the equidistant design, we derive the asymptotic variance and bias of these estimators. We also show that our new method will, for the first time, achieve the asymptotically optimal convergence rate for difference-based estimators. Finally, we provide an effective criterion for selection of tuning parameters and demonstrate the usefulness of the proposed method through extensive simulation studies of the first- and second-order derivative estimators.

私たちは、ノンパラメトリック回帰において回帰関数を当てはめることなく導関数を推定するシンプルなフレームワークを提案します。対称差分商を使用する既存のほとんどの方法とは異なり、我々の方法は観測値の線形結合として構築されます。したがって、この方法は極めて柔軟であり、内部点と境界点の両方に適用可能であり、既存のほとんどの方法を我々の特別なケースとして含む。このフレームワーク内で、固定バイアス削減レベルを持つ回帰関数の任意の次数導関数の分散最小化推定量を定義します。等距離設計の場合、これらの推定量の漸近分散とバイアスを導出します。また、我々の新しい方法が、差分ベースの推定量の漸近的に最適な収束率を初めて達成することを示す。最後に、調整パラメータを選択するための効果的な基準を提供し、1次および2次導関数推定量の広範なシミュレーション研究を通じて提案方法の有用性を実証します。

Augmentable Gamma Belief Networks
拡張可能なガンマ信念ネットワーク

To infer multilayer deep representations of high-dimensional discrete and nonnegative real vectors, we propose an augmentable gamma belief network (GBN) that factorizes each of its hidden layers into the product of a sparse connection weight matrix and the nonnegative real hidden units of the next layer. The GBN’s hidden layers are jointly trained with an upward-downward Gibbs sampler that solves each layer with the same subroutine. The gamma-negative binomial process combined with a layer-wise training strategy allows inferring the width of each layer given a fixed budget on the width of the first layer. Example results illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the GBN can add more layers to improve its performance in both unsupervisedly extracting features and predicting heldout data. For exploratory data analysis, we extract trees and subnetworks from the learned deep network to visualize how the very specific factors discovered at the first hidden layer and the increasingly more general factors discovered at deeper hidden layers are related to each other, and we generate synthetic data by propagating random variables through the deep network from the top hidden layer back to the bottom data layer.

高次元の離散非負実ベクトルの多層深層表現を推論するために、各隠れ層を疎な接続重み行列と次の層の非負実隠れユニットの積に因数分解する拡張可能なガンマビリーフネットワーク(GBN)を提案します。GBNの隠れ層は、各層を同じサブルーチンで解決する上向き下向きギブスサンプラーで共同トレーニングされます。ガンマ負二項プロセスと層ごとのトレーニング戦略を組み合わせることで、最初の層の幅に固定の予算が与えられている場合、各層の幅を推論できます。例の結果は、最初の層の幅と推論されたネットワーク構造との興味深い関係を示し、GBNが層を追加して、教師なしの特徴抽出とホールドアウトデータの予測の両方でパフォーマンスを向上できることを示しています。探索的データ分析では、学習したディープネットワークからツリーとサブネットワークを抽出し、最初の隠し層で発見された非常に具体的な要因と、より深い隠し層で発見されたより一般的な要因がどのように相互に関連しているかを視覚化し、ランダム変数をディープネットワークを通じて最上位の隠し層から最下位のデータ層まで伝播させることで合成データを生成します。

The Teaching Dimension of Linear Learners
線形学習者の教育的側面

Teaching dimension is a learning theoretic quantity that specifies the minimum training set size to teach a target model to a learner. Previous studies on teaching dimension focused on version-space learners which maintain all hypotheses consistent with the training data, and cannot be applied to modern machine learners which select a specific hypothesis via optimization. This paper presents the first known teaching dimension for ridge regression, support vector machines, and logistic regression. We also exhibit optimal training sets that match these teaching dimensions. Our approach generalizes to other linear learners.

ティーチングディメンションは、ターゲットモデルを学習者にティーチングするための最小学習セットサイズを指定する学習理論的な量です。教育次元に関する以前の研究は、すべての仮説を学習データと一致させるバージョン空間学習者に焦点を当てており、最適化によって特定の仮説を選択する現代の機械学習者には適用できません。この論文では、リッジ回帰、サポートベクターマシン、およびロジスティック回帰の最初の既知の教育ディメンションについて説明します。また、これらの教育の側面に合った最適なトレーニングセットも展示しています。このアプローチは、他の線形学習器にも一般化されます。

Optimal Estimation and Completion of Matrices with Biclustering Structures
バイクラスタリング構造を持つ行列の最適推定と完成

Biclustering structures in data matrices were first formalized in a seminal paper by John Hartigan (Hartigan, 1972) where one seeks to cluster cases and variables simultaneously. Such structures are also prevalent in block modeling of networks. In this paper, we develop a theory for the estimation and completion of matrices with biclustering structures, where the data is a partially observed and noise contaminated matrix with a certain underlying biclustering structure. In particular, we show that a constrained least squares estimator achieves minimax rate-optimal performance in several of the most important scenarios. To this end, we derive unified high probability upper bounds for all sub-Gaussian data and also provide matching minimax lower bounds in both Gaussian and binary cases. Due to the close connection of graphon to stochastic block models, an immediate consequence of our general results is a minimax rate- optimal estimator for sparse graphons.

データ行列のバイクラスタリング構造は、John Hartigan(Hartigan、1972年)による独創的な論文で初めて形式化されました。そこでは、ケースと変数を同時にクラスタリングすることを求めています。このような構造は、ネットワークのブロックモデリングにもよく見られます。この論文では、データが部分的に観測され、特定の基礎となるバイクラスタリング構造を持つノイズ汚染された行列である、バイクラスタリング構造を持つ行列の推定と完成のための理論を開発します。特に、制約付き最小二乗推定量が、最も重要なシナリオのいくつかでミニマックスレート最適パフォーマンスを達成することを示します。この目的のために、すべてのサブガウスデータに対して統一された高確率の上限を導き出し、ガウスとバイナリの両方のケースで一致するミニマックス下限も提供します。グラフォンは確率的ブロックモデルに密接に関連しているため、一般的な結果の直接的な結果は、スパースグラフォンのミニマックスレート最適推定量です。

A General Framework for Constrained Bayesian Optimization using Information-based Search
情報ベース検索を用いた制約付きベイズ最適化のための一般フレームワーク

We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints are evaluated independently on a GPU. These problems require an acquisition function that can be separated into the contributions of the individual function evaluations. We develop one such acquisition function and call it Predictive Entropy Search with Constraints (PESC). PESC is an approximation to the expected information gain criterion and it compares favorably to alternative approaches based on improvement in several synthetic and real- world problems. In addition to this, we consider problems with a mix of functions that are fast and slow to evaluate. These problems require balancing the amount of time spent in the meta- computation of PESC and in the actual evaluation of the target objective. We take a bounded rationality approach and develop a partial update for PESC which trades off accuracy against speed. We then propose a method for adaptively switching between the partial and full updates for PESC. This allows us to interpolate between versions of PESC that are efficient in terms of function evaluations and those that are efficient in terms of wall-clock time. Overall, we demonstrate that PESC is an effective algorithm that provides a promising direction towards a unified solution for constrained Bayesian optimization.

私たちは、ブラックボックス制約も持つグローバルブラックボックス最適化問題を解決するための情報理論的フレームワークを提示します。我々が特に関心を持っているのは、目的関数と制約関数のサブセットが独立して評価される可能性がある、分離された制約を持つ問題を効率的に解決することです。たとえば、目的関数がCPUで評価され、制約がGPUで独立して評価される場合です。これらの問題には、個々の関数評価の寄与に分離できる獲得関数が必要です。我々はそのような獲得関数の1つを開発し、制約付き予測エントロピー検索(PESC)と呼んでいます。PESCは、期待される情報ゲイン基準の近似値であり、いくつかの合成問題と現実世界の問題の改善に基づく代替アプローチと比較して優れています。これに加えて、評価が速い関数と遅い関数が混在する問題を検討します。これらの問題では、PESCのメタ計算とターゲット目的の実際の評価に費やされる時間のバランスを取る必要があります。我々は限定合理性アプローチを採用し、精度と速度をトレードオフするPESCの部分更新を開発します。次に、PESCの部分更新と完全更新を適応的に切り替える方法を提案します。これにより、関数評価の点で効率的なPESCバージョンと実時間で効率的なPESCバージョンの間を補間できます。全体として、PESCは制約付きベイズ最適化の統一ソリューションに向けた有望な方向性を提供する効果的なアルゴリズムであることを示しています。

Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics
確率勾配ランジュバン動力学の（非）漸近的バイアスと分散の探究

Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally infeasible. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem in three ways: it generates proposed moves using only a subset of the data, it skips the Metropolis- Hastings accept-reject step, and it uses sequences of decreasing step sizes. In Teh et al. (2014), we provided the mathematical foundations for the decreasing step size SGLD, including consistency and a central limit theorem. However, in practice the SGLD is run for a relatively small number of iterations, and its step size is not decreased to zero. The present article investigates the behaviour of the SGLD with fixed step size. In particular we characterise the asymptotic bias explicitly, along with its dependence on the step size and the variance of the stochastic gradient. On that basis a modified SGLD which removes the asymptotic bias due to the variance of the stochastic gradients up to first order in the step size is derived. Moreover, we are able to obtain bounds on the finite-time bias, variance and mean squared error (MSE). The theory is illustrated with a Gaussian toy model for which the bias and the MSE for the estimation of moments can be obtained explicitly. For this toy model we study the gain of the SGLD over the standard Euler method in the limit of large data sets.

標準的なマルコフ連鎖モンテカルロ(MCMC)アルゴリズムを大規模なデータセットに適用することは、計算上不可能です。最近提案された確率的勾配ランジュバンダイナミクス(SGLD)法は、3つの方法でこの問題を回避します。データのサブセットのみを使用して提案された動きを生成する、メトロポリス-ヘイスティングスの受け入れ拒否ステップをスキップする、および減少するステップサイズのシーケンスを使用するという方法です。Tehら(2014)では、一貫性や中心極限定理など、減少するステップサイズSGLDの数学的基礎を提供しました。ただし、実際には、SGLDは比較的少数の反復で実行され、ステップサイズはゼロに減少しません。この論文では、固定ステップサイズのSGLDの動作を調査します。特に、漸近バイアスを明示的に特徴付け、そのステップサイズへの依存性と確率的勾配の分散について説明します。これを基に、ステップサイズで1次までの確率的勾配の分散による漸近バイアスを除去する修正SGLDが導出されます。さらに、有限時間バイアス、分散、平均二乗誤差(MSE)の境界を取得できます。理論は、モーメントの推定のバイアスとMSEを明示的に取得できるガウストイモデルで説明されます。このトイモデルでは、大規模なデータセットの制限内で、標準オイラー法に対するSGLDのゲインを調べます。

Universal Approximation Results for the Temporal Restricted Boltzmann Machine and the Recurrent Temporal Restricted Boltzmann Machine
時間制限ボルツマンマシンと回帰時間制限ボルツマンマシンのユニバーサル近似結果

The Restricted Boltzmann Machine (RBM) has proved to be a powerful tool in machine learning, both on its own and as the building block for Deep Belief Networks (multi-layer generative graphical models). The RBM and Deep Belief Network have been shown to be universal approximators for probability distributions on binary vectors. In this paper we prove several similar universal approximation results for two variations of the Restricted Boltzmann Machine with time dependence, the Temporal Restricted Boltzmann Machine (TRBM) and the Recurrent Temporal Restricted Boltzmann Machine (RTRBM). We show that the TRBM is a universal approximator for Markov chains and generalize the theorem to sequences with longer time dependence. We then prove that the RTRBM is a universal approximator for stochastic processes with finite time dependence. We conclude with a discussion on efficiency and how the constructions developed could explain some previous experimental results.

制限付きボルツマンマシン(RBM)は、それ自体だけでなく、Deep Belief Networks(多層生成グラフィカルモデル)のビルディングブロックとしても、機械学習の強力なツールであることが証明されています。RBMとDeep Belief Networkは、バイナリベクトル上の確率分布のユニバーサル近似器であることが示されています。この論文では、時間依存性を持つ制限付きボルツマンマシンの2つのバリエーション、時間制限付きボルツマンマシン(TRBM)と反復時間制限ボルツマンマシン(RTRBM)について、いくつかの同様のユニバーサル近似結果を証明します。TRBMがマルコフ連鎖の普遍的な近似器であることを示し、定理をより長い時間依存性を持つシーケンスに一般化します。次に、RTRBMが有限の時間依存性を持つ確率過程の普遍的な近似器であることを証明します。最後に、効率と、開発された構造が以前の実験結果をどのように説明できるかについての議論で締めくくります。

Theoretical Analysis of the Optimal Free Responses of Graph-Based SFA for the Design of Training Graphs
学習グラフ設計のためのグラフベースSFAの最適自由応答の理論解析

Slow feature analysis (SFA) is an unsupervised learning algorithm that extracts slowly varying features from a multi- dimensional time series. Graph-based SFA (GSFA) is an extension to SFA for supervised learning that can be used to successfully solve regression problems if combined with a simple supervised post-processing step on a small number of slow features. The objective function of GSFA minimizes the squared output differences between pairs of samples specified by the edges of a structure called training graph. The edges of current training graphs, however, are derived only from the relative order of the labels. Exploiting the exact numerical value of the labels enables further improvements in label estimation accuracy. In this article, we propose the exact label learning (ELL) method to create a more precise training graph that encodes the desired labels explicitly and allows GSFA to extract a normalized version of them directly (i.e., without supervised post- processing). The ELL method is used for three tasks: (1) We estimate gender from artificial images of human faces (regression) and show the advantage of coding additional labels, particularly skin color. (2) We analyze two existing graphs for regression. (3) We extract compact discriminative features to classify traffic sign images. When the number of output features is limited, such compact features provide a higher classification rate compared to a graph that generates features equivalent to those of nonlinear Fisher discriminant analysis. The method is versatile, directly supports multiple labels, and provides higher accuracy compared to current graphs for the problems considered.

スローフィーチャ分析(SFA)は、多次元時系列からゆっくり変化するフィーチャを抽出する教師なし学習アルゴリズムです。グラフベースSFA (GSFA)は、教師あり学習用のSFAの拡張であり、少数のスローフィーチャに対する単純な教師あり後処理ステップと組み合わせると、回帰問題をうまく解決するために使用できます。GSFAの目的関数は、トレーニンググラフと呼ばれる構造のエッジによって指定されたサンプルのペア間の出力差の二乗を最小化します。ただし、現在のトレーニンググラフのエッジは、ラベルの相対的な順序からのみ導出されます。ラベルの正確な数値を利用すると、ラベル推定の精度をさらに向上できます。この記事では、正確なラベル学習(ELL)方法を提案します。この方法では、必要なラベルを明示的にエンコードし、GSFAがそれらの正規化されたバージョンを直接(つまり、教師あり後処理なしで)抽出できるようにする、より正確なトレーニンググラフを作成します。ELL法は、次の3つのタスクに使用されます。(1)人間の顔の人工画像から性別を推定し(回帰)、特に肌の色などの追加ラベルをコーディングする利点を示します。(2)回帰のために既存の2つのグラフを分析します。(3)交通標識画像を分類するためにコンパクトな識別特徴を抽出します。出力特徴の数が限られている場合、このようなコンパクトな特徴は、非線形フィッシャー判別分析と同等の特徴を生成するグラフと比較して、より高い分類率を提供します。この方法は汎用性があり、複数のラベルを直接サポートし、検討対象の問題に対して現在のグラフと比較して高い精度を提供します。

Minimum Density Hyperplanes
最小密度ハイパープレーン

Associating distinct groups of objects (clusters) with contiguous regions of high probability density (high-density clusters), is central to many statistical and machine learning approaches to the classification of unlabelled data. We propose a novel hyperplane classifier for clustering and semi-supervised classification which is motivated by this objective. The proposed minimum density hyperplane minimises the integral of the empirical probability density function along it, thereby avoiding intersection with high density clusters. We show that the minimum density and the maximum margin hyperplanes are asymptotically equivalent, thus linking this approach to maximum margin clustering and semi-supervised support vector classifiers. We propose a projection pursuit formulation of the associated optimisation problem which allows us to find minimum density hyperplanes efficiently in practice, and evaluate its performance on a range of benchmark data sets. The proposed approach is found to be very competitive with state of the art methods for clustering and semi-supervised classification.

オブジェクトの明確なグループ(クラスター)を高確率密度の連続領域(高密度クラスター)に関連付けることは、ラベルなしデータの分類に対する多くの統計的アプローチおよび機械学習アプローチの中心です。この目的に動機付けられた、クラスタリングおよび半教師あり分類のための新しい超平面分類器を提案します。提案された最小密度超平面は、それに沿った経験的確率密度関数の積分を最小化し、高密度クラスターとの交差を回避します。最小密度超平面と最大マージン超平面は漸近的に等価であることを示し、これによりこのアプローチを最大マージンクラスタリングおよび半教師ありサポートベクター分類器に結び付けます。関連する最適化問題の射影追跡定式化を提案します。これにより、実際に最小密度超平面を効率的に見つけることができ、さまざまなベンチマークデータセットでそのパフォーマンスを評価できます。提案されたアプローチは、クラスタリングおよび半教師あり分類の最先端の方法と非常に競争力があることがわかりました。

New Perspectives on k-Support and Cluster Norms
k-サポートとクラスターノルムに関する新しい視点

We study a regularizer which is defined as a parameterized infimum of quadratics, and which we call the box-norm. We show that the $k$-support norm, a regularizer proposed by Argyriou et al. (2012) for sparse vector prediction problems, belongs to this family, and the box-norm can be generated as a perturbation of the former. We derive an improved algorithm to compute the proximity operator of the squared box-norm, and we provide a method to compute the norm. We extend the norms to matrices, introducing the spectral $k$-support norm and spectral box-norm. We note that the spectral box-norm is essentially equivalent to the cluster norm, a multitask learning regularizer introduced by Jacob et al. (2009a), and which in turn can be interpreted as a perturbation of the spectral $k$-support norm. Centering the norm is important for multitask learning and we also provide a method to use centered versions of the norms as regularizers. Numerical experiments indicate that the spectral $k$-support and box-norms and their centered variants provide state of the art performance in matrix completion and multitask learning problems respectively.

私たちは、二次方程式のパラメータ化された最小値として定義され、ボックスノルムと呼ばれる正則化子を研究します。私たちは、Argyriouら(2012)がスパースベクトル予測問題のために提案した正則化子である$k$サポートノルムがこの族に属し、ボックスノルムは前者の摂動として生成できることを示す。私たちは、二乗ボックスノルムの近接演算子を計算するための改良アルゴリズムを導出し、ノルムを計算する方法を提供します。私たちは、スペクトル$k$サポートノルムとスペクトルボックスノルムを導入して、ノルムを行列に拡張します。スペクトルボックスノルムは、Jacobら(2009a)が導入したマルチタスク学習正則化子であるクラスターノルムと本質的に等価であり、スペクトル$k$サポートノルムの摂動として解釈できることに注意します。ノルムを中心化することはマルチタスク学習にとって重要であり、中心化されたバージョンのノルムを正則化子として使用する方法も提供します。数値実験では、スペクトル$k$サポートおよびボックスノルムとその中心化されたバリアントが、それぞれ行列補完およびマルチタスク学習の問題で最先端のパフォーマンスを提供することが示されています。

Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits
重要度重み付けなしの重要度重み付け:組み合わせセミバンディットのための効率的なアルゴリズム

We propose a sample-efficient alternative for importance weighting for situations where one only has sample access to the probability distribution that generates the observations. Our new method, called Geometric Resampling (GR), is described and analyzed in the context of online combinatorial optimization under semi-bandit feedback, where a learner sequentially selects its actions from a combinatorial decision set so as to minimize its cumulative loss. In particular, we show that the well-known Follow-the-Perturbed-Leader (FPL) prediction method coupled with Geometric Resampling yields the first computationally efficient reduction from offline to online optimization in this setting. We provide a thorough theoretical analysis for the resulting algorithm, showing that its performance is on par with previous, inefficient solutions. Our main contribution is showing that, despite the relatively large variance induced by the GR procedure, our performance guarantees hold with high probability rather than only in expectation. As a side result, we also improve the best known regret bounds for FPL in online combinatorial optimization with full feedback, closing the perceived performance gap between FPL and exponential weights in this setting. (A preliminary version of this paper was published as Neu and BartÃ³k (2013). Parts of this work were completed while Gergely Neu was with the SequeL team at INRIA Lille — Nord Europe, France and GÃ¡bor BartÃ³k was with the Department of Computer Science at ETH ZÃ¼rich.)

私たちは、観測値を生成する確率分布にサンプルアクセスしかできない状況での重要度重み付けのためのサンプル効率の良い代替案を提案します。幾何再サンプリング(GR)と呼ばれる我々の新しい方法は、セミバンディットフィードバック下でのオンライン組合せ最適化の文脈で説明され、分析されます。この文脈では、学習者は組合せ決定セットからその行動を順次選択し、累積損失を最小化します。特に、私たちは、よく知られているFollow-the-Perturbed-Leader (FPL)予測法と幾何再サンプリングを組み合わせることで、この設定でオフライン最適化からオンライン最適化への計算効率の良い削減が初めて得られることを示す。私たちは、結果として得られるアルゴリズムの徹底的な理論分析を提供し、そのパフォーマンスが以前の非効率的なソリューションと同等であることを示す。我々の主な貢献は、GR手順によって比較的大きな変動が誘発されるにもかかわらず、我々のパフォーマンス保証が期待値だけでなく高い確率で成り立つことを示したことです。副次的な結果として、完全なフィードバックによるオンライン組み合わせ最適化におけるFPLの最もよく知られている後悔境界も改善され、この設定でのFPLと指数重みの間の認識されたパフォーマンスギャップが解消されました。(この論文の予備バージョンはNeu and BartÃ³k (2013)として公開されました。この作業の一部は、Gergely NeuがフランスのINRIA Lille — Nord EuropeのSequeLチームに所属し、GÃ¡bor BartÃ³kがETH ZÃ¼richのコンピューターサイエンス学部に所属していたときに完了しました。)

A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights
Nesterovの加速勾配法をモデル化するための微分方程式:理論と洞察

We derive a second-order ordinary differential equation (ODE) which is the limit of Nesterov’s accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov’s scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov’s scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov’s scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.

私たちは、Nesterovの加速勾配法の限界である2階常微分方程式(ODE)を導出します。このODEは、ネステロフのスキームとほぼ同等であるため、分析のツールとして役立ちます。連続時間ODEにより、ネステロフのスキームをよりよく理解できることを示します。副産物として、同様の収束率を持つスキームのファミリーを取得します。ODEの解釈はまた、ネステロフのスキームを再開してアルゴリズムに導くことを示唆しており、アルゴリズムは、目的が強く凸であるときはいつでも線形速度で収束することが厳密に証明できます。

Learning Theory for Distribution Regression
分布回帰の学習理論

We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate (Caponnetto and De Vito, 2007; Steinwart et al., 2009). This result answers a $17 $-year-old open question, establishing the consistency of the classical set kernel (Haussler, 1999; GÃ¤rtner et al., 2002) in regression. We also cover consistency for more recent kernels on distributions, including those due to Christmann and Steinwart (2010).

私たちは、分布回帰問題、すなわち確率測度からベクトル値の出力への回帰に焦点を当てる。このフレームワークには、マルチインスタンス学習や解析解のない点推定問題（ハイパーパラメータ推定やエントロピー推定など）など、多くの重要な機械学習および統計タスクが当てはまる。文献には多数のヒューリスティックが利用可能であるにもかかわらず、この問題の本質的な2段階サンプリングの性質により、理論分析は非常に困難になります。これは、実際にはサンプリングされた分布からのサンプルのみが観測可能であり、推定は点の集合間で計算された類似性に依存する必要があるためです。我々の知る限り、分布回帰の一貫性保証を備えた唯一の既存の手法は、中間ステップとしてカーネル密度推定（実際にはパフォーマンスが悪いことが多い）を必要とし、分布のドメインがコンパクトユークリッドであることを必要とします。この論文では、分布回帰の代わりとなる、単純で解析的に計算可能なリッジ回帰ベースの手法を検討します。この手法では、分布を再生カーネルヒルベルト空間に埋め込み、出力への埋め込みから回帰子を学習します。我々の主な貢献は、この方式が、穏やかな条件（カーネルで強化された分離可能な位相領域上）での2段階サンプリング設定で一貫していることを証明することです。私たちは、正確な計算統計効率のトレードオフ分析を提示し、我々の推定量が1段階サンプリングのミニマックス最適率に一致できることを示しています(CaponnettoおよびDe Vito、2007年、Steinwartら、2009年)。この結果は、17年来の未解決の問題に答え、回帰における古典的なセットカーネル(Haussler、1999年、GÃ¤rtnerら、2002年)の一貫性を確立しました。我々はまた、ChristmannおよびSteinwart (2010年)によるものを含む、分布上のより最近のカーネルの一貫性についても説明します。

Conditional Independencies under the Algorithmic Independence of Conditionals
条件文のアルゴリズム独立性の下での条件付き非独立性

In this paper we analyze the relationship between faithfulness and the more recent condition of algorithmic Independence of Conditionals (IC) with respect to the Conditional Independencies (CIs) they allow. Both conditions have been extensively used for causal inference by refuting factorizations for which the condition does not hold. Violation of faithfulness happens when there are CIs that do not follow from the Markov condition. For those CIs, non-trivial constraints among some parameters of the Conditional Probability Distributions (CPDs) must hold. When such a constraint is defined over parameters of different CPDs, we prove that IC is also violated unless the parameters have a simple description. To understand which non-Markovian CIs are permitted we define a new condition closely related to IC: the Independence from Product Constraints (IPC). The condition reflects that CIs might be the result of specific parameterizations of individual CPDs but not from constraints on parameters of different CPDs. In that sense it is more restrictive than IC: parameters may have a simple description. On the other hand, IC also excludes other forms of algorithmic dependencies between CPDs. Finally, we prove that on top of the CIs permitted by the Markov condition (faithfulness), IPC allows non-minimality, deterministic relations and what we called proportional CPDs. These are the only cases in which a CI follows from a specific parameterization of a single CPD.

この論文では、忠実性と、アルゴリズムの条件の独立性(IC)という最近の条件との関係を、それらが許す条件付き独立性(CI)に関して分析します。両方の条件は、条件が成立しない因数分解を反駁することによって、因果推論に広く使用されています。忠実性の違反は、マルコフ条件に従わないCIがある場合に発生します。これらのCIについては、条件付き確率分布(CPD)の一部のパラメータ間の非自明な制約が成立する必要があります。このような制約が異なるCPDのパラメータに対して定義されている場合、パラメータが単純な記述を持たない限り、ICも違反されることを証明します。どの非マルコフCIが許されるかを理解するために、ICに密接に関連する新しい条件、積制約からの独立性(IPC)を定義します。この条件は、CIが個々のCPDの特定のパラメータ化の結果である可能性があり、異なるCPDのパラメータの制約によるものではないことを反映しています。その意味では、ICよりも制限が厳しく、パラメータは簡単に記述できます。一方、ICはCPD間の他の形式のアルゴリズム依存性も排除します。最後に、マルコフ条件(忠実性)によって許可されるCIに加えて、IPCは非最小性、決定論的関係、および比例CPDと呼ばれるものを許可することを証明します。これらは、CIが単一のCPDの特定のパラメータ化から従う唯一のケースです。

A General Framework for Consistency of Principal Component Analysis
主成分分析の一貫性のための一般的なフレームワーク

A general asymptotic framework is developed for studying consistency properties of principal component analysis (PCA). Our framework includes several previously studied domains of asymptotics as special cases and allows one to investigate interesting connections and transitions among the various domains. More importantly, it enables us to investigate asymptotic scenarios that have not been considered before, and gain new insights into the consistency, subspace consistency and strong inconsistency regions of PCA and the boundaries among them. We also establish the corresponding convergence rate within each region. Under general spike covariance models, the dimension (or number of variables) discourages the consistency of PCA, while the sample size and spike information (the relative size of the population eigenvalues) encourage PCA consistency. Our framework nicely illustrates the relationship among these three types of information in terms of dimension, sample size and spike size, and rigorously characterizes how their relationships affect PCA consistency.

主成分分析(PCA)の一貫性特性を研究するための一般的な漸近フレームワークが開発されました。このフレームワークには、以前に研究されたいくつかの漸近ドメインが特殊なケースとして含まれており、さまざまなドメイン間の興味深い接続と遷移を調査できます。さらに重要なことは、これまで考慮されていなかった漸近シナリオを調査し、PCAの一貫性、サブスペース一貫性、および強い不一致領域とそれらの境界に関する新しい洞察を得ることができることです。また、各領域内の対応する収束率も確立します。一般的なスパイク共分散モデルでは、次元(または変数の数)によってPCAの一貫性が損なわれますが、サンプルサイズとスパイク情報(母集団の固有値の相対的なサイズ)によってPCAの一貫性が促進されます。このフレームワークは、次元、サンプルサイズ、スパイクサイズの観点からこれら3種類の情報の関係をわかりやすく示し、それらの関係がPCAの一貫性にどのように影響するかを厳密に特徴付けます。

Kernel Estimation and Model Combination in A Bandit Problem with Covariates
共変量を持つバンディット問題におけるカーネル推定とモデルの組み合わせ

Multi-armed bandit problem is an important optimization game that requires an exploration-exploitation tradeoff to achieve optimal total reward. Motivated from industrial applications such as online advertising and clinical research, we consider a setting where the rewards of bandit machines are associated with covariates, and the accurate estimation of the corresponding mean reward functions plays an important role in the performance of allocation rules. Under a flexible problem setup, we establish asymptotic strong consistency and perform a finite- time regret analysis for a sequential randomized allocation strategy based on kernel estimation. In addition, since many nonparametric and parametric methods in supervised learning may be applied to estimating the mean reward functions but guidance on how to choose among them is generally unavailable, we propose a model combining allocation strategy for adaptive performance. Simulations and a real data evaluation are conducted to illustrate the performance of the proposed allocation strategy.

多腕バンディット問題は、最適な総報酬を得るために探索と活用のトレードオフを必要とする重要な最適化ゲームです。オンライン広告や臨床研究などの産業用途に着想を得て、バンディットマシンの報酬が共変量と関連付けられ、対応する平均報酬関数の正確な推定が割り当てルールのパフォーマンスに重要な役割を果たす設定を検討します。柔軟な問題設定の下で、カーネル推定に基づく順次ランダム割り当て戦略に対して、漸近的な強い一貫性を確立し、有限時間の後悔分析を実行します。さらに、平均報酬関数の推定には、教師あり学習における多くのノンパラメトリックおよびパラメトリック手法を適用できますが、その中からどのように選択するかについてのガイダンスは一般に入手できないため、適応型パフォーマンスのための割り当て戦略を組み合わせたモデルを提案します。提案された割り当て戦略のパフォーマンスを示すために、シミュレーションと実際のデータ評価が行われます。

Megaman: Scalable Manifold Learning in Python
Megaman:Pythonでのスケーラブルな多様体学習

Manifold Learning (ML) is a class of algorithms seeking a low-dimensional non-linear representation of high-dimensional data. Thus, ML algorithms are most applicable to high- dimensional data and require large sample sizes to accurately estimate the manifold. Despite this, most existing manifold learning implementations are not particularly scalable. Here we present a Python package that implements a variety of manifold learning algorithms in a modular and scalable fashion, using fast approximate neighbors searches and fast sparse eigendecompositions. The package incorporates theoretical advances in manifold learning, such as the unbiased Laplacian estimator introduced by Coifman and Lafon (2006) and the estimation of the embedding distortion by the Riemannian metric method introduced by Perrault-Joncas and Meila (2013). In benchmarks, even on a single-core desktop computer, our code embeds millions of data points in minutes, and takes just 200 minutes to embed the main sample of galaxy spectra from the Sloan Digital Sky Survey— consisting of 0.6 million samples in 3750-dimensions—a task which has not previously been possible.

多様体学習(ML)は、高次元データの低次元非線形表現を求めるアルゴリズムのクラスです。したがって、MLアルゴリズムは高次元データに最も適しており、多様体を正確に推定するためには大きなサンプルサイズが必要です。それにもかかわらず、既存の多様体学習実装のほとんどは特にスケーラブルではありません。ここでは、高速近似近傍検索と高速スパース固有値分解を使用して、さまざまな多様体学習アルゴリズムをモジュール式でスケーラブルに実装するPythonパッケージを紹介します。このパッケージには、CoifmanとLafon (2006)によって導入された不偏ラプラシアン推定量や、Perrault-JoncasとMeila (2013)によって導入されたリーマン計量法による埋め込み歪みの推定など、多様体学習の理論的進歩が組み込まれています。ベンチマークでは、シングルコアのデスクトップコンピューターでも、当社のコードは数分で数百万のデータポイントを埋め込み、スローンデジタルスカイサーベイからの銀河スペクトルの主要サンプル(3750次元の60万サンプルで構成)を埋め込むのにわずか200分しかかかりません。これは、これまでは不可能だったタスクです。

Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel K-Means
コンダクタンスと重み付けカーネルK平均法の連続最適化によるローカルネットワークコミュニティ検出

Local network community detection is the task of finding a single community of nodes concentrated around few given seed nodes in a localized way. Conductance is a popular objective function used in many algorithms for local community detection. This paper studies a continuous relaxation of conductance. We show that continuous optimization of this objective still leads to discrete communities. We investigate the relation of conductance with weighted kernel k-means for a single community, which leads to the introduction of a new objective function, $\sigma$-conductance. Conductance is obtained by setting $\sigma$ to $0$. Two algorithms, EMc and PGDc, are proposed to locally optimize $\sigma$-conductance and automatically tune the parameter $\sigma$. They are based on expectation maximization and projected gradient descent, respectively. We prove locality and give performance guarantees for EMc and PGDc for a class of dense and well separated communities centered around the seeds. Experiments are conducted on networks with ground-truth communities, comparing to state-of-the-art graph diffusion algorithms for conductance optimization. On large graphs, results indicate that EMc and PGDc stay localized and produce communities most similar to the ground, while graph diffusion algorithms generate large communities of lower quality. (Source code of the algorithms used in the paper is available online.)

ローカルネットワークコミュニティ検出は、少数の特定のシードノードの周囲に局所的に集中するノードの単一コミュニティを見つけるタスクです。コンダクタンスは、ローカルコミュニティ検出の多くのアルゴリズムで使用される一般的な目的関数です。この論文では、コンダクタンスの連続緩和について検討します。この目的を連続的に最適化しても、依然として個別のコミュニティにつながることを示します。単一コミュニティのコンダクタンスと重み付きカーネルk平均法の関係を調査し、新しい目的関数$\sigma$-コンダクタンスを導入します。コンダクタンスは、$\sigma$を$0$に設定することで得られます。$\sigma$-コンダクタンスを局所的に最適化し、パラメーター$\sigma$を自動的に調整する2つのアルゴリズム、EMcとPGDcが提案されています。これらは、それぞれ期待値最大化と投影勾配降下法に基づいています。シードを中心とした密で十分に分離されたコミュニティのクラスについて、EMcとPGDcの局所性を証明し、パフォーマンスを保証します。実験は、グラウンドトゥルースコミュニティを持つネットワークで行われ、最先端のグラフ拡散アルゴリズムによるコンダクタンス最適化と比較されます。大きなグラフでは、EMcとPGDcは局所的であり、グラウンドに最も類似したコミュニティを生成するのに対し、グラフ拡散アルゴリズムは品質の低い大きなコミュニティを生成することが結果から示されています。(論文で使用されているアルゴリズムのソースコードはオンラインで入手できます。)

Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models
多層ガウスグラフモデルのペナルティ付き最尤推定

Analyzing multi-layered graphical models provides insight into understanding the conditional relationships among nodes within layers after adjusting for and quantifying the effects of nodes from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multi-layered graphical models, based on a computational approach involving screening of variables, iterative estimation of the directed edges between layers and undirected edges within layers and a final refitting and stability selection step that provides improved performance in finite sample settings. We establish the consistency of the estimator in a high-dimensional setting. To obtain this result, we develop a strategy that leverages the biconvexity of the likelihood function to ensure convergence of the developed iterative algorithm to a stationary point, as well as careful uniform error control of the estimates over iterations. The performance of the maximum likelihood estimator is illustrated on synthetic data.

多層グラフィカルモデルを分析すると、他の層のノードの影響を調整して定量化した後、層内のノード間の条件付き関係を理解するための洞察が得られます。変数のスクリーニング、層間の有向エッジと層内の無向エッジの反復推定、および有限サンプル設定でパフォーマンスを向上させる最終的な再フィッティングと安定性選択ステップを含む計算アプローチに基づいて、ガウス多層グラフィカルモデルのペナルティ付き最尤推定量を取得します。高次元設定での推定量の一貫性を確立します。この結果を得るために、尤度関数の双凸性を活用して、開発された反復アルゴリズムが定常点に収束すること、および反復全体で推定値の慎重な均一な誤差制御を保証する戦略を開発します。最尤推定量のパフォーマンスは、合成データで示されています。

True Online Temporal-Difference Learning
真のオンライン時間差分学習

The temporal-difference methods TD($\lambda$) and Sarsa($\lambda$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD($\lambda$) and true online Sarsa($\lambda$), respectively (van Seijen & Sutton, 2014). Algorithmically, these true online methods only make two small changes to the update rules of the regular methods, and the extra computational cost is negligible in most cases. However, they follow the ideas underlying the forward view much more closely. In particular, they maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD($\lambda$)/Sarsa($\lambda$) with regular TD($\lambda$)/Sarsa($\lambda$) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-dept analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal- difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

時間差分法TD($\lambda$)とSarsa($\lambda$)は、現代の強化学習の中核をなしています。これらの手法の魅力は、優れたパフォーマンス、低い計算コスト、およびフォワードビューによるシンプルな解釈にあります。最近、これらの手法の新しいバージョンが導入され、それぞれtrue online TD($\lambda$)およびtrue online Sarsa($\lambda$)と呼ばれています(van Seijen & Sutton、2014)。アルゴリズム的には、これらのtrue online手法は通常の手法の更新ルールに2つの小さな変更を加えるだけであり、ほとんどの場合、追加の計算コストは無視できます。ただし、これらはフォワードビューの基礎となるアイデアに非常に忠実に従っています。特に、従来のバージョンは小さなステップサイズでのみそれを近似するのに対し、これらの手法は常にフォワードビューと完全に同等です。これらのtrue online手法は、理論的な特性が優れているだけでなく、経験的に通常の手法よりも優れていると仮定しています。この記事では、広範な実証的比較を行うことで、この仮説を検証します。具体的には、ランダムMRP、現実世界の筋電義手、およびArcade Learning Environmentのドメインで、真のオンラインTD($\lambda$)/Sarsa($\lambda$)のパフォーマンスを通常のTD($\lambda$)/Sarsa($\lambda$)と比較します。表形式、バイナリ、および非バイナリ機能を使用した線形関数近似を使用します。結果は、真のオンラインメソッドが通常のメソッドよりも優れていることを示唆しています。すべてのドメイン/表現にわたって、真のオンラインメソッドの学習速度は通常の方法よりも優れていることがよくありますが、劣ることはありません。もう1つの利点は、真のオンラインメソッドではトレースの選択を行う必要がないことです。実証結果に加えて、真のオンライン時間差学習の背後にある理論の詳細な分析を提供します。さらに、オンラインフォワードビューに変更を加えて更新方程式を書き直すことで、新しい真のオンライン時間差分法を導出できることを示します。

MOCCA: Mirrored Convex/Concave Optimization for Nonconvex Composite Functions
MOCCA: 非凸複合関数の鏡像化凸/凹最適化

Many optimization problems arising in high-dimensional statistics decompose naturally into a sum of several terms, where the individual terms are relatively simple but the composite objective function can only be optimized with iterative algorithms. In this paper, we are interested in optimization problems of the form $F(Kx) + G(x)$, where $K$ is a fixed linear transformation, while $F$ and $G$ are functions that may be nonconvex and/or nondifferentiable. In particular, if either of the terms are nonconvex, existing alternating minimization techniques may fail to converge; other types of existing approaches may instead be unable to handle nondifferentiability. We propose the MOCCA (mirrored convex/concave) algorithm, a primal/dual optimization approach that takes a local convex approximation to each term at every iteration. Inspired by optimization problems arising in computed tomography (CT) imaging, this algorithm can handle a range of nonconvex composite optimization problems, and offers theoretical guarantees for convergence when the overall problem is approximately convex (that is, any concavity in one term is balanced out by convexity in the other term). Empirical results show fast convergence for several structured signal recovery problems.

高次元統計で生じる多くの最適化問題は、いくつかの項の和に自然に分解されます。個々の項は比較的単純ですが、複合目的関数は反復アルゴリズムでのみ最適化できます。この論文では、形式$F(Kx) + G(x)$の最適化問題を対象としています。ここで、$K$は固定の線形変換であり、$F$と$G$は非凸および/または微分不可能な関数です。特に、いずれかの項が非凸である場合、既存の交互最小化手法は収束しない可能性があります。また、他の種類の既存のアプローチでは、微分不可能を処理できない可能性があります。私たちは、すべての反復で各項にローカル凸近似をとる主/デュアル最適化アプローチであるMOCCA (ミラー凸/凹)アルゴリズムを提案します。コンピュータ断層撮影(CT)イメージングで発生する最適化問題にヒントを得たこのアルゴリズムは、さまざまな非凸複合最適化問題を処理でき、問題全体がほぼ凸である場合(つまり、1つの項の凹みが他の項の凸みによって相殺される)に理論的な収束保証を提供します。実験結果では、いくつかの構造化信号回復問題で高速収束が示されています。

Covariance-based Clustering in Multivariate and Functional Data Analysis
多変量および関数データ分析における共分散ベースのクラスタリング

In this paper we propose a new algorithm to perform clustering of multivariate and functional data. We study the case of two populations different in their covariances, rather than in their means. The algorithm relies on a proper quantification of distance between the estimated covariance operators of the populations, and subdivides data in two groups maximising the distance between their induced covariances. The naive implementation of such an algorithm is computationally forbidding, so we propose a heuristic formulation with a much lighter complexity and we study its convergence properties, along with its computational cost. We also propose to use an enhanced estimator for the estimation of discrete covariances of functional data, namely a linear shrinkage estimator, in order to improve the precision of the clustering. We establish the effectiveness of our algorithm through applications to both synthetic data and a real data set coming from a biomedical context, showing also how the use of shrinkage estimation may lead to substantially better results.

この論文では、多変量データと機能データのクラスタリングを実行するための新しいアルゴリズムを提案します。平均ではなく共分散が異なる2つの集団のケースを検討します。このアルゴリズムは、集団の推定共分散演算子間の距離を適切に定量化することに依存し、誘導共分散間の距離を最大化するようにデータを2つのグループに分割します。このようなアルゴリズムを単純に実装することは計算上困難であるため、はるかに複雑さの少ないヒューリスティックな定式化を提案し、その収束特性と計算コストを検討します。また、クラスタリング精度を向上させるために、機能データの離散共分散の推定に強化された推定量、つまり線形収縮推定量を使用することを提案します。合成データと生物医学的コンテキストからの実際のデータセットの両方に適用することで、アルゴリズムの有効性を確立し、収縮推定の使用によって大幅に優れた結果が得られる可能性も示します。

Large Scale Visual Recognition through Adaptation using Joint Representation and Multiple Instance Learning
共同表現と複数インスタンス学習を用いた適応による大規模視覚認識

A major barrier towards scaling visual recognition systems is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) trained used 1.2M+ labeled images have emerged as clear winners on object classification benchmarks. Unfortunately, only a small fraction of those labels are available with bounding box localization for training the detection task and even fewer pixel level annotations are available for semantic segmentation. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect scene-centric images with precisely localized labels. We develop methods for learning large scale recognition models which exploit joint training over both weak (image-level) and strong (bounding box) labels and which transfer learned perceptual representations from strongly-labeled auxiliary tasks. We provide a novel formulation of a joint multiple instance learning method that includes examples from object-centric data with image-level labels when available, and also performs domain transfer learning to improve the underlying detector representation. We then show how to use our large scale detectors to produce pixel level annotations. Using our method, we produce a $>$7.6K category detector and release code and models at lsda.berkeley vision.org.

視覚認識システムのスケーリングにおける大きな障壁は、多数のカテゴリのラベル付き画像を取得することの難しさです。最近、120万枚以上のラベル付き画像を使用してトレーニングされたディープ畳み込みニューラルネットワーク(CNN)が、オブジェクト分類ベンチマークで明確な勝者として浮上しました。残念ながら、検出タスクのトレーニング用にバウンディングボックスのローカリゼーションが利用できるラベルはごく一部であり、セマンティックセグメンテーションに利用できるピクセルレベルの注釈はさらに少ないです。正確にローカリゼーションされたラベルを持つシーン中心の画像を収集するよりも、検索エンジンから大量の画像レベルのラベルを収集する方がはるかに安価で簡単です。私たちは、弱い(画像レベル)ラベルと強い(バウンディングボックス)ラベルの両方に対する共同トレーニングを活用し、強くラベル付けされた補助タスクから学習した知覚表現を転送する、大規模な認識モデルを学習する方法を開発しました。私たちは、利用可能な場合は画像レベルのラベルを持つオブジェクト中心のデータからの例を含め、ドメイン転送学習も実行して基礎となる検出器表現を改善する、共同マルチインスタンス学習方法の新しい定式化を提供します。次に、大規模な検出器を使用してピクセルレベルの注釈を生成する方法を示します。私たちの方法を使用して、$>$7.6Kカテゴリ検出器を作成し、コードとモデルをlsda.berkeley vision.orgで公開します。

Sparse PCA via Covariance Thresholding
共分散しきい値処理によるスパース PCA

In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension $n\times p$ and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components $v_1,\dots,v_r$ has at most $s_0$ non-zero entries. We are particularly interested in the high dimensional regime wherein $p$ is comparable to, or even much larger than $n$. In an influential paper, Johnstone and Lu (2004) introduced a simple algorithm that estimates the support of the principal vectors $v_1,\dots,v_r$ by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if $s_0\le K_1\sqrt{n/\log p}$, and to fail with high probability if $s_0\ge K_2 \sqrt{n/\log p}$ for two constants $0 Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler, Vilenchik, et al. (2015). On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for $s_0\le K\sqrt{n}$ (assuming $n$ of the same order as $p$). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as $n$ much smaller than $p$. Recent lower bounds (Berthet and Rigollet, 2013; Ma and Wigderson, 2015) suggest that no polynomial time algorithm can do significantly better. The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before. Using these, we also derive sharp bounds for estimating the population covariance, and the principal component (with $\ell_2$-loss).

疎な主成分分析では、次元$n\times p$の低ランク行列のノイズの多い観測値が与えられ、追加の疎性仮定の下でそれを再構築しようとします。特に、ここでは各主成分$v_1,\dots,v_r$が最大で$s_0$個の非ゼロ要素を持つと仮定します。特に、$p$が$n$と同程度、またはそれよりもはるかに大きい高次元領域に関心があります。影響力のある論文で、JohnstoneとLu (2004)は、経験的共分散の対角線の最大要素によって主ベクトル$v_1,\dots,v_r$のサポートを推定する簡単なアルゴリズムを導入しました。この方法は、2つの定数$0に対して、$s_0\le K_1\sqrt{n/\log p}$の場合に高い確率で正しいサポートを識別し、$s_0\ge K_2 \sqrt{n/\log p}$の場合は高い確率で失敗することがわかります。ここでは、Krauthgamer、Nadler、Vilenchikら(2015)によって最近提案された共分散しきい値アルゴリズムを分析します。数値シミュレーション(ランク1の場合)に基づいて、これらの著者は、共分散しきい値により、$s_0\le K\sqrt{n}$ ($n$が$p$と同じ順序であると仮定)の場合に高い確率でサポートが正しく回復されると推測しました。この推測を証明し、実際に、より高いランクや$n$が$p$よりはるかに小さい場合を含む、より一般的な保証を確立します。最近の下限値(BerthetとRigollet、2013年、MaとWigderson、2015年)は、多項式時間アルゴリズムでこれより大幅に優れた結果が得られないことを示唆しています。私たちの分析の重要な技術的要素は、これまで考慮されていなかった領域で、カーネルランダム行列のノルムに新しい境界を設定することです。これらを使用して、母集団共分散と主成分($\ell_2$損失を使用)を推定するための明確な境界も導き出します。

Multiscale Adaptive Representation of Signals: I. The Basic Framework
信号のマルチスケール適応表現: I.基本的なフレームワーク

We introduce a framework for designing multi-scale, adaptive, shift-invariant frames and bi-frames for representing signals. The new framework, called AdaFrame, improves over dictionary learning-based techniques in terms of computational efficiency at inference time. It improves classical multi-scale basis such as wavelet frames in terms of coding efficiency. It provides an attractive alternative to dictionary learning-based techniques for low level signal processing tasks, such as compression and denoising, as well as high level tasks, such as feature extraction for object recognition. Connections with deep convolutional networks are also discussed. In particular, the proposed framework reveals a drawback in the commonly used approach for visualizing the activations of the intermediate layers in convolutional networks, and suggests a natural alternative.

私たちは、信号を表現するためのマルチスケール、適応型、シフト不変フレーム、およびバイフレームを設計するためのフレームワークを紹介します。AdaFrameと呼ばれる新しいフレームワークは、推論時の計算効率の点で、辞書学習ベースの手法よりも優れています。これは、符号化効率の点でウェーブレットフレームなどの古典的なマルチスケール基底を改善します。これは、圧縮やノイズ除去などの低レベルの信号処理タスクや、オブジェクト認識のための特徴抽出などの高レベルのタスクに対して、辞書学習ベースの手法に代わる魅力的な選択肢を提供します。また、深い畳み込みネットワークとの接続についても説明します。特に、提案されたフレームワークは、畳み込みネットワーク内の中間層の活性化を視覚化するために一般的に使用されるアプローチの欠点を明らかにし、自然な代替案を提案します。

Regularized Policy Iteration with Nonparametric Function Spaces
ノンパラメトリック関数空間による正則化ポリシー反復

We study two regularization-based approximate policy iteration algorithms, namely REG-LSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least- Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM), which are used in the algorithms’ policy evaluation steps. Regularization provides a convenient way to control the complexity of the function space to which the estimated value function belongs and as a result enables us to work with rich nonparametric function spaces. We derive efficient implementations of our methods when the function space is a reproducing kernel Hilbert space. We analyze the statistical properties of REG-LSPI and provide an upper bound on the policy evaluation error and the performance loss of the policy returned by this method. Our bound shows the dependence of the loss on the number of samples, the capacity of the function space, and some intrinsic properties of the underlying Markov Decision Process. The dependence of the policy evaluation bound on the number of samples is minimax optimal. This is the first work that provides such a strong guarantee for a nonparametric approximate policy iteration algorithm. (This work is an extension of the NIPS 2008 conference paper by Farahmand et al. (2009b).)

私たちは、大きな状態と有限の行動空間を持つ割引マルコフ決定過程における強化学習と計画の問題を解決するために、正則化に基づく2つの近似ポリシー反復アルゴリズム、すなわちREG-LSPIとREG-BRMを研究します。これらのアルゴリズムの中核は、アルゴリズムのポリシー評価ステップで使用される最小二乗時間差(LSTD)学習とベルマン残差最小化(BRM)の正則化された拡張です。正則化は、推定値関数が属する関数空間の複雑さを制御する便利な方法を提供し、その結果、豊富なノンパラメトリック関数空間で作業することを可能にします。私たちは、関数空間が再生カーネルヒルベルト空間である場合の、我々の方法の効率的な実装を導出します。私たちは、REG-LSPIの統計的特性を分析し、この方法によって返されるポリシーの評価誤差とパフォーマンス損失の上限を提供します。我々の上限は、サンプル数、関数空間の容量、および基礎となるマルコフ決定過程のいくつかの固有の特性に対する損失の依存性を示す。ポリシー評価境界のサンプル数への依存性は、ミニマックス最適です。これは、ノンパラメトリック近似ポリシー反復アルゴリズムに対してこのような強力な保証を提供する最初の研究です。(この研究では、Farahmandら(2009b)によるNIPS 2008カンファレンス論文の拡張です。)

CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data
CrossCat:異種高次元データを分析するための完全ベイズノンパラメトリック法

There is a widespread need for statistical methods that can analyze high-dimensional datasets without imposing restrictive or opaque modeling assumptions. This paper describes a domain- general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparametric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian network structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.

制限的または不透明なモデリング仮定を課すことなく、高次元データセットを分析できる統計的手法に対する幅広いニーズがあります。この論文では、CrossCatと呼ばれるドメイン汎用データ分析手法について説明します。CrossCatは、変数のサブセットで構成されるデータの複数の重複しないビューを推論し、個別のノンパラメトリック混合を使用して各ビューをモデル化します。CrossCatは、データテーブルの階層的なノンパラメトリックモデルにおける近似ベイズ推論に基づいています。このモデルは、データテーブルの列にわたるディリクレ過程混合で構成され、各混合コンポーネントは行にわたる独立したディリクレ過程混合です。内部の混合コンポーネントは、テーブル内のデータの種類によって形式が決まる単純なパラメトリックモデルです。CrossCatは、混合モデリングとベイジアンネットワーク構造学習の長所を組み合わせています。混合モデリングと同様に、CrossCatは潜在変数を仮定することで幅広いクラスの分布をモデル化でき、予測のために効率的に条件付けおよびサンプリングできる表現を生成します。ベイジアンネットワークと同様に、CrossCatは変数間の依存関係と独立性を表すため、複数の統計信号がある場合でも正確さを保ちます。推論はスケーラブルなギブスサンプリングスキームによって行われ、この論文ではそれが実際にうまく機能することを示しています。この論文には、病院のコストと品質の尺度、投票記録、失業率、遺伝子発現の測定、手書きの数字の画像など、最大1,000万セルの異種表形式データに関する実証結果も含まれています。CrossCatは、複数のドメインで受け入れられている知見と常識的な知識と一致する構造を推論し、生成的、識別的、およびモデルフリーの代替手段に匹敵する予測精度をもたらします。

Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation
Pymanopt: 自動微分を使用した多様体の最適化のための Python ツールボックス

Optimization on manifolds is a class of methods for optimization of an objective function, subject to constraints which are smooth, in the sense that the set of points which satisfy the constraints admits the structure of a differentiable manifold. While many optimization problems are of the described form, technicalities of differential geometry and the laborious calculation of derivatives pose a significant barrier for experimenting with these methods. We introduce Pymanopt (available at pymanopt.github.io), a toolbox for optimization on manifolds, implemented in Python, that—similarly to the Manopt Matlab toolbox—implements several manifold geometries and optimization algorithms. Moreover, we lower the barriers to users further by using automated differentiation for calculating derivative information, saving users time and saving them from potential calculation and implementation errors.

多様体の最適化は、目的関数の最適化のための手法のクラスであり、制約を満たす点の集合が微分可能な多様体の構造を認めるという意味で、滑らかな制約の影響を受けます。多くの最適化問題は説明した形式ですが、微分幾何学の技術や微分の面倒な計算が、これらの方法を試すための大きな障壁となっています。Pythonで実装された多様体の最適化のためのツールボックスであるPymanopt(pymanopt.github.ioで利用可能)を紹介します。これは—Manopt Matlabツールボックスと同様に—いくつかの多様体ジオメトリと最適化アルゴリズムを実装します。さらに、派生情報の計算に自動微分を使用することで、ユーザーのハードルをさらに下げ、ユーザーの時間を節約し、潜在的な計算エラーや実装エラーからユーザーを節約します。

Synergy of Monotonic Rules
単調なルールの相乗効果

This article describes a method for constructing a special rule (we call it synergy rule) that uses as its input information the outputs (scores) of several monotonic rules which solve the same pattern recognition problem. As an example of scores of such monotonic rules we consider here scores of SVM classifiers. In order to construct the optimal synergy rule, we estimate the conditional probability function based on the direct problem setting, which requires solving a Fredholm integral equation. Generally, solving a Fredholm equation is an ill-posed problem. However, in our model, we look for the solution of the equation in the set of monotonic and bounded functions, which makes the problem well-posed. This allows us to solve the equation accurately even with training data sets of limited size. In order to construct a monotonic solution, we use the set of functions that belong to Reproducing Kernel Hilbert Space (RKHS) associated with the INK-spline kernel (splines with Infinite Numbers of Knots) of degree zero. The paper provides details of the methods for finding multidimensional conditional probability in a set of monotonic functions to obtain the corresponding synergy rules. We demonstrate effectiveness of such rules for 1) solving standard pattern recognition problems, 2) constructing multi-class classification rules, 3) constructing a method for knowledge transfer from multiple intelligent teachers in the LUPI paradigm.

この記事では、同じパターン認識問題を解決する複数の単調ルールの出力(スコア)を入力情報として使用する特別なルール(シナジールールと呼ぶ)を構築する方法について説明します。このような単調ルールのスコアの例として、ここではSVM分類器のスコアを検討します。最適なシナジールールを構築するために、フレドホルム積分方程式を解く必要がある直接的な問題設定に基づいて条件付き確率関数を推定します。一般に、フレドホルム方程式を解くことは不適切問題です。ただし、このモデルでは、単調で有界な関数のセットで方程式の解を探すため、問題は適切になります。これにより、トレーニングデータセットのサイズが限られている場合でも、方程式を正確に解くことができます。単調な解を構築するために、次数0のINKスプラインカーネル(無限数のノットを持つスプライン)に関連付けられた再生カーネルヒルベルト空間(RKHS)に属する関数のセットを使用します。この論文では、一連の単調関数における多次元条件付き確率を見つけて、対応する相乗効果ルールを取得する方法について詳しく説明します。このようなルールの有効性を実証し、1)標準的なパターン認識問題の解決、2)マルチクラス分類ルールの構築、3) LUPIパラダイムにおける複数のインテリジェント教師からの知識転送方法の構築に役立てます。

Refined Error Bounds for Several Learning Algorithms
いくつかの学習アルゴリズムの誤差範囲を改良

This article studies the achievable guarantees on the error rates of certain learning algorithms, with particular focus on refining logarithmic factors. Many of the results are based on a general technique for obtaining bounds on the error rates of sample-consistent classifiers with monotonic error regions, in the realizable case. We prove bounds of this type expressed in terms of either the VC dimension or the sample compression size. This general technique also enables us to derive several new bounds on the error rates of general sample-consistent learning algorithms, as well as refined bounds on the label complexity of the CAL active learning algorithm. Additionally, we establish a simple necessary and sufficient condition for the existence of a distribution-free bound on the error rates of all sample- consistent learning rules, converging at a rate inversely proportional to the sample size. We also study learning in the presence of classification noise, deriving a new excess error rate guarantee for general VC classes under Tsybakov’s noise condition, and establishing a simple and general necessary and sufficient condition for the minimax excess risk under bounded noise to converge at a rate inversely proportional to the sample size.

この記事では、特定の学習アルゴリズムのエラー率に関する達成可能な保証について、特に対数係数の改良に焦点を当てて研究します。結果の多くは、実現可能なケースで単調なエラー領域を持つサンプル一貫性分類器のエラー率の境界を取得するための一般的な手法に基づいています。VC次元またはサンプル圧縮サイズのいずれかで表現されるこのタイプの境界を証明します。この一般的な手法により、一般的なサンプル一貫性学習アルゴリズムのエラー率に関するいくつかの新しい境界や、CALアクティブ学習アルゴリズムのラベル複雑性の改良された境界を導出することもできます。さらに、サンプルサイズに反比例する速度で収束する、すべてのサンプル一貫性学習ルールのエラー率に関する分布フリーの境界が存在するための、単純な必要十分条件を確立します。また、分類ノイズが存在する場合の学習についても研究し、Tsybakovのノイズ条件下での一般的なVCクラスに対する新しい過剰エラー率保証を導出し、制限されたノイズ下でのミニマックス過剰リスクがサンプルサイズに反比例する速度で収束するための単純で一般的な必要十分条件を確立します。

Adjusting for Chance Clustering Comparison Measures
チャンスクラスタリングの比較測度の調整

Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair- counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair- counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.

偶然性を調整した指標は、同じデータセットのパーティション/クラスタリングを比較するために広く使用されています。特に、ペアカウントに基づく調整ランド指数(ARI)と、シャノン情報理論に基づく調整相互情報量(AMI)は、クラスタリングコミュニティで非常に人気があります。ただし、各指標の最適な適用シナリオは何かという問題は未解決であり、その使用に関する文献のガイドラインはまばらであるため、ユーザーは両方を使用することになることがよくあります。Tsallisエントロピーに基づく一般化情報理論(IT)指標は、ペアカウントとシャノンIT指標をリンクすることが示されています。この論文では、ペアカウントに基づく指標の調整と情報理論に基づく指標のギャップを埋めることを目指しています。一般化IT指標の期待値と分散を解析的に計算するという重要な技術的課題を解決します。これにより、一般的なIT指標の調整を提案することができ、特殊なケースとして、よく知られている調整済みクラスタリング比較指標に簡略化されます。一般化されたIT測定の理論を使用して、ARIとAMIを外部検証指標として使用するための次のガイドラインを提案できます。ARIは、参照クラスタリングに同じサイズの大きなクラスターがある場合に使用する必要があります。AMIは、参照クラスタリングが不均衡で小さなクラスターが存在する場合に使用する必要があります。

Cross-Corpora Unsupervised Learning of Trajectories in Autism Spectrum Disorders
自閉症スペクトラム障害における教師なし学習のクロスコーパス

Patients with developmental disorders, such as autism spectrum disorder (ASD), present with symptoms that change with time even if the named diagnosis remains fixed. For example, language impairments may present as delayed speech in a toddler and difficulty reading in a school-age child. Characterizing these trajectories is important for early treatment. However, deriving these trajectories from observational sources is challenging: electronic health records only reflect observations of patients at irregular intervals and only record what factors are clinically relevant at the time of observation. Meanwhile, caretakers discuss daily developments and concerns on social media. In this work, we present a fully unsupervised approach for learning disease trajectories from incomplete medical records and social media posts, including cases in which we have only a single observation of each patient. In particular, we use a dynamic topic model approach which embeds each disease trajectory as a path in $\mathbb{R}^D$. A Polya- gamma augmentation scheme is used to efficiently perform inference as well as incorporate multiple data sources. We learn disease trajectories from the electronic health records of 13,435 patients with ASD and the forum posts of 13,743 caretakers of children with ASD, deriving interesting clinical insights as well as good predictions.

自閉症スペクトラム障害（ASD）などの発達障害の患者は、診断名が固定されていても、症状が時間とともに変化します。たとえば、言語障害は、幼児では発語の遅れ、学齢期の子供では読み書きの困難として現れることがあります。これらの軌跡を特徴付けることは、早期治療にとって重要です。しかし、観察情報源からこれらの軌跡を導き出すことは困難です。電子健康記録は、患者の観察を不定期に反映しているだけであり、観察時に臨床的に関連する要因のみを記録しています。一方、介護者はソーシャルメディアで日々の進展や懸念について話し合っています。この研究では、不完全な医療記録やソーシャルメディアの投稿から疾患の軌跡を学習するための、完全に教師なしのアプローチを提示します。これには、各患者について単一の観察しかない場合も含まれます。特に、各疾患の軌跡を$\mathbb{R}^D$のパスとして埋め込む動的トピックモデルアプローチを使用します。Polya-gamma拡張スキームを使用して、推論を効率的に実行し、複数のデータソースを組み込みます。私たちは、13,435人のASD患者の電子健康記録と13,743人のASD児童の保護者のフォーラム投稿から病気の軌跡を学び、興味深い臨床的洞察と優れた予測を導き出しました。

Extracting PICO Sentences from Clinical Trial Reports using Supervised Distant Supervision
遠隔監視による臨床試験レポートからの PICO 文の抽出

Systematic reviews underpin Evidence Based Medicine (EBM) by addressing precise clinical questions via comprehensive synthesis of all relevant published evidence. Authors of systematic reviews typically define a Population/Problem, Intervention, Comparator, and Outcome (a PICO criteria) of interest, and then retrieve, appraise and synthesize results from all reports of clinical trials that meet these criteria. Identifying PICO elements in the full-texts of trial reports is thus a critical yet time-consuming step in the systematic review process. We seek to expedite evidence synthesis by developing machine learning models to automatically extract sentences from articles relevant to PICO elements. Collecting a large corpus of training data for this task would be prohibitively expensive. Therefore, we derive distant supervision (DS) with which to train models using previously conducted reviews. DS entails heuristically deriving ‘soft’ labels from an available structured resource. However, we have access only to unstructured, free-text summaries of PICO elements for corresponding articles; we must derive from these the desired sentence-level annotations. To this end, we propose a novel method — supervised distant supervision (SDS) — that uses a small amount of direct supervision to better exploit a large corpus of distantly labeled instances by learning to pseudo-annotate articles using the available DS. We show that this approach tends to outperform existing methods with respect to automated PICO extraction.

システマティックレビューは、公開されている関連するすべてのエビデンスを包括的に統合することにより、正確な臨床上の疑問に取り組むことで、エビデンスに基づく医療（EBM）の基盤となります。システマティックレビューの著者は通常、関心のある集団/問題、介入、比較対象、および結果（PICO基準）を定義し、次にこれらの基準を満たすすべての臨床試験のレポートから結果を取得、評価、統合します。したがって、試験レポートの全文でPICO要素を特定することは、システマティックレビュープロセスにおいて重要でありながら時間のかかるステップです。私たちは、PICO要素に関連する記事から文章を自動的に抽出する機械学習モデルを開発することで、エビデンス統合を迅速化することを目指しています。このタスクのためにトレーニングデータの大規模なコーパスを収集するには、法外な費用がかかります。そのため、以前に実施したレビューを使用してモデルをトレーニングするための遠隔監視（DS）を導き出します。DSでは、利用可能な構造化リソースから「ソフト」ラベルをヒューリスティックに導き出します。ただし、対応する記事のPICO要素の構造化されていないフリーテキストの要約にしかアクセスできません。これらから、必要な文レベルの注釈を導き出さなければなりません。この目的のために、私たちは、利用可能なDSを使用して記事に疑似注釈を付ける方法を学習することで、遠隔的にラベル付けされたインスタンスの大規模なコーパスをより有効に活用するために少量の直接監督を使用する新しい方法、教師あり遠隔監督(SDS)を提案します。このアプローチは、自動化されたPICO抽出に関して既存の方法よりも優れている傾向があることを示しています。

String and Membrane Gaussian Processes
ストリングおよびメンブレンガウスプロセス

In this paper we introduce a novel framework for making exact nonparametric Bayesian inference on latent functions that is particularly suitable for Big Data tasks. Firstly, we introduce a class of stochastic processes we refer to as string Gaussian processes (string GPs which are not to be mistaken for Gaussian processes operating on text). We construct string GPs so that their finite- dimensional marginals exhibit suitable local conditional independence structures, which allow for scalable, distributed, and flexible nonparametric Bayesian inference, without resorting to approximations, and while ensuring some mild global regularity constraints. Furthermore, string GP priors naturally cope with heterogeneous input data, and the gradient of the learned latent function is readily available for explanatory analysis. Secondly, we provide some theoretical results relating our approach to the standard GP paradigm. In particular, we prove that some string GPs are Gaussian processes, which provides a complementary global perspective on our framework. Finally, we derive a scalable and distributed MCMC scheme for supervised learning tasks under string GP priors. The proposed MCMC scheme has computational time complexity $\mathcal{O}(N)$ and memory requirement $\mathcal{O}(dN)$, where $N$ is the data size and $d$ the dimension of the input space. We illustrate the efficacy of the proposed approach on several synthetic and real-world data sets, including a data set with $6$ millions input points and $8$ attributes.

この論文では、ビッグデータタスクに特に適した、潜在関数に対する正確なノンパラメトリックベイズ推論を行うための新しいフレームワークを紹介します。まず、文字列ガウス過程(テキストに対して動作するガウス過程と間違えないようにする文字列GP)と呼ぶ確率過程のクラスを紹介します。有限次元周辺が適切なローカル条件付き独立構造を示すように文字列GPを構築します。これにより、近似に頼ることなく、ある程度のグローバルな規則性制約を確保しながら、スケーラブルで分散された柔軟なノンパラメトリックベイズ推論が可能になります。さらに、文字列GPの事前分布は、異種入力データに自然に対処し、学習した潜在関数の勾配は説明分析にすぐに利用できます。次に、標準的なGPパラダイムへのアプローチに関連するいくつかの理論的結果を示します。特に、一部の文字列GPがガウス過程であることを証明し、フレームワークに対する補完的なグローバルな視点を提供します。最後に、文字列GP事前分布の下での教師あり学習タスク用のスケーラブルで分散されたMCMCスキームを導出します。提案されたMCMCスキームの計算時間計算量は$\mathcal{O}(N)$、メモリ要件は$\mathcal{O}(dN)$です。ここで、$N$はデータサイズ、$d$は入力空間の次元です。600万の入力ポイントと8つの属性を持つデータセットを含む、いくつかの合成データセットと実際のデータセットで提案されたアプローチの有効性を示します。

A Well-Conditioned and Sparse Estimation of Covariance and Inverse Covariance Matrices Using a Joint Penalty
ジョイントペナルティを用いた共分散行列と逆共分散行列の条件付きスパース推定

We develop a method for estimating well-conditioned and sparse covariance and inverse covariance matrices from a sample of vectors drawn from a sub-Gaussian distribution in high dimensional setting. The proposed estimators are obtained by minimizing the quadratic loss function and joint penalty of $\ell_1$ norm and variance of its eigenvalues. In contrast to some of the existing methods of covariance and inverse covariance matrix estimation, where often the interest is to estimate a sparse matrix, the proposed method is flexible in estimating both a sparse and well-conditioned covariance matrix simultaneously. The proposed estimators are optimal in the sense that they achieve the mini-max rate of estimation in operator norm for the underlying class of covariance and inverse covariance matrices. We give a very fast algorithm for computation of these covariance and inverse covariance matrices which is easily scalable to large scale data analysis problems. The simulation study for varying sample sizes and variables shows that the proposed estimators performs better than several other estimators for various choices of structured covariance and inverse covariance matrices. We also use our proposed estimator for tumor tissues classification using gene expression data and compare its performance with some other classification methods.

私たちは、高次元設定のサブガウス分布から抽出されたベクトルのサンプルから、条件が整った疎な共分散行列と逆共分散行列を推定する方法を開発しました。提案された推定量は、2次損失関数と、その固有値の$\ell_1$ノルムと分散の結合ペナルティを最小化することで得られます。疎行列を推定することに関心がある場合が多い共分散行列と逆共分散行列の推定の既存の方法とは対照的に、提案された方法は、疎な共分散行列と条件が整った共分散行列の両方を同時に推定できる柔軟性があります。提案された推定量は、共分散行列と逆共分散行列の基礎となるクラスに対して演算子ノルムの推定の最小最大率を達成するという意味で最適です。これらの共分散行列と逆共分散行列を計算するための非常に高速なアルゴリズムを提供し、これは大規模なデータ分析問題に簡単に拡張できます。さまざまなサンプルサイズと変数に対するシミュレーション研究では、構造化共分散行列と逆共分散行列のさまざまな選択に対して、提案された推定量が他のいくつかの推定量よりも優れたパフォーマンスを発揮することが示されています。また、遺伝子発現データを使用した腫瘍組織の分類に提案された推定量を使用し、そのパフォーマンスを他の分類方法と比較します。

An Online Convex Optimization Approach to Blackwell’s Approachability
Blackwellの親しみやすさに対するオンライン凸最適化アプローチ

The problem of approachability in repeated games with vector payoffs was introduced by Blackwell in the 1950s, along with geometric conditions and corresponding approachability strategies that rely on computing a sequence of direction vectors in the payoff space. For convex target sets, these vectors are obtained as projections from the current average payoff vector to the set. A recent paper by Abernethy, Batlett and Hazan (2011) proposed a class of approachability algorithms that rely on Online Linear Programming for obtaining alternative sequences of direction vectors. This is first implemented for target sets that are convex cones, and then generalized to any convex set by embedding it in a higher-dimensional convex cone. In this paper we present a more direct formulation that relies on general Online Convex Optimization (OCO) algorithms, along with basic properties of the support function of convex sets. This leads to a general class of approachability algorithms, depending on the choice of the OCO algorithm and the used norms. Blackwell’s original algorithm and its convergence are recovered when Follow The Leader (or a regularized version thereof) is used for the OCO algorithm.

ベクトルペイオフを伴う繰り返しゲームにおける接近可能性の問題は、ペイオフ空間における方向ベクトルのシーケンスの計算に依存する幾何学的条件および対応する接近可能性戦略とともに、1950年代にBlackwellによって導入されました。凸ターゲットセットの場合、これらのベクトルは現在の平均ペイオフベクトルからセットへの投影として取得されます。Abernethy、Batlett、Hazan (2011)による最近の論文では、方向ベクトルの代替シーケンスを取得するためにオンライン線形計画法に依存する接近可能性アルゴリズムのクラスが提案されました。これは、最初に凸錐であるターゲットセットに対して実装され、次に高次元の凸錐に埋め込むことによって任意の凸セットに一般化されました。この論文では、一般的なオンライン凸最適化(OCO)アルゴリズムに依存するより直接的な定式化と、凸セットのサポート関数の基本的なプロパティを示します。これにより、OCOアルゴリズムの選択と使用されるノルムに応じて、一般的なクラスの接近可能性アルゴリズムがもたらされます。ブラックウェルのオリジナルアルゴリズムとその収束は、OCOアルゴリズムにFollow The Leader (またはその正規化バージョン)を使用すると回復されます。

Multiple-Instance Learning from Distributions
ディストリビューションからの複数インスタンス学習

We propose a new theoretical framework for analyzing the multiple-instance learning (MIL) setting. In MIL, training examples are provided to a learning algorithm in the form of labeled sets, or “bags,” of instances. Applications of MIL include 3-D quantitative structure–activity relationship prediction for drug discovery and content-based image retrieval for web search. The goal of an algorithm is to learn a function that correctly labels new bags or a function that correctly labels new instances. We propose that bags should be treated as latent distributions from which samples are observed. We show that it is possible to learn accurate instance- and bag-labeling functions in this setting as well as functions that correctly rank bags or instances under weak assumptions. Additionally, our theoretical results suggest that it is possible to learn to rank efficiently using traditional, well-studied “supervised” learning approaches. We perform an extensive empirical evaluation that supports the theoretical predictions entailed by the new framework. The proposed theoretical framework leads to a better understanding of the relationship between the MI and standard supervised learning settings, and it provides new methods for learning from MI data that are more accurate, more efficient, and have better understood theoretical properties than existing MI-specific algorithms.

私たちは、マルチインスタンス学習(MIL)設定を分析するための新しい理論的枠組みを提案します。MILでは、学習アルゴリズムに、インスタンスのラベル付きセット、つまり「バッグ」の形式でトレーニング例が提供されます。MILの用途には、新薬発見のための3D定量的構造活性関係予測や、Web検索のためのコンテンツベースの画像検索などがあります。アルゴリズムの目標は、新しいバッグに正しくラベルを付ける関数、または新しいインスタンスに正しくラベルを付ける関数を学習することです。私たちは、バッグをサンプルが観察される潜在分布として扱うことを提案します。この設定では、正確なインスタンスおよびバッグのラベル付け関数、および弱い仮定の下でバッグまたはインスタンスを正しくランク付けする関数を学習できることを示します。さらに、我々の理論的結果は、従来のよく研究された「教師あり」学習アプローチを使用して、効率的にランク付けを学習できることを示唆しています。私たちは、新しい枠組みに伴う理論的予測を裏付ける広範な実証的評価を実施します。提案された理論的枠組みにより、MIと標準的な教師あり学習設定との関係をより深く理解できるようになり、既存のMI固有のアルゴリズムよりも正確で効率的で、理論的特性がよく理解されているMIデータからの学習のための新しい方法が提供されます。

Dual Control for Approximate Bayesian Reinforcement Learning
近似ベイズ強化学習のための双対制御

Control of non-episodic, finite-horizon dynamical systems with uncertain dynamics poses a tough and elementary case of the exploration-exploitation trade-off. Bayesian reinforcement learning, reasoning about the effect of actions and future observations, offers a principled solution, but is intractable. We review, then extend an old approximate approach from control theory—where the problem is known as dual control—in the context of modern regression methods, specifically generalized linear regression. Experiments on simulated systems show that this framework offers a useful approximation to the intractable aspects of Bayesian RL, producing structured exploration strategies that differ from standard RL approaches. We provide simple examples for the use of this framework in (approximate) Gaussian process regression and feedforward neural networks for the control of exploration.

不確実なダイナミクスを持つ非エピソード的な有限地平線力学システムの制御は、探査と開発のトレードオフの困難で初歩的なケースを提起します。ベイズ強化学習は、行動の影響と将来の観測について推論し、原則的な解決策を提供しますが、扱いにくいです。次に、制御理論から古い近似アプローチをレビューし—、次に、現代の回帰方法、特に一般化線形回帰のコンテキストで、問題が二重制御として知られている—を拡張します。シミュレーションシステムでの実験では、このフレームワークがベイジアンRLの難解な側面に対する有用な近似を提供し、標準のRLアプローチとは異なる構造化された探索戦略を生み出すことが示されています。このフレームワークを(近似)ガウスプロセス回帰および探索の制御のためのフィードフォワードニューラルネットワークで使用するための簡単な例を示します。

On Lower and Upper Bounds in Smooth and Strongly Convex Optimization
平滑最適化と強凸最適化の下限と上限について

We develop a novel framework to study smooth and strongly convex optimization algorithms. Focusing on quadratic functions we are able to examine optimization algorithms as a recursive application of linear operators. This, in turn, reveals a powerful connection between a class of optimization algorithms and the analytic theory of polynomials whereby new lower and upper bounds are derived. Whereas existing lower bounds for this setting are only valid when the dimensionality scales with the number of iterations, our lower bound holds in the natural regime where the dimensionality is fixed. Lastly, expressing it as an optimal solution for the corresponding optimization problem over polynomials, as formulated by our framework, we present a novel systematic derivation of Nesterov’s well-known Accelerated Gradient Descent method. This rather natural interpretation of AGD contrasts with earlier ones which lacked a simple, yet solid, motivation.

私たちは、滑らかで強凸の最適化アルゴリズムを研究するための新しいフレームワークを開発します。二次関数に焦点を当てると、線形演算子の再帰的応用として最適化アルゴリズムを調べることができます。これにより、最適化アルゴリズムのクラスと、新しい下限と上限が導出される多項式の解析理論との間に強力な関係があることが明らかになります。この設定の既存の下限は、次元が反復回数に比例する場合にのみ有効ですが、次元が固定されている自然領域では下限が保持されます。最後に、私たちのフレームワークによって定式化された多項式上の対応する最適化問題の最適解としてそれを表現するために、ネステロフのよく知られている加速勾配降下法の新しい系統的導出を提示します。AGDのこのかなり自然な解釈は、単純でありながら確固たる動機を欠いていた以前の解釈とは対照的です。

Bayesian Optimization for Likelihood-Free Inference of Simulator-Based Statistical Models
シミュレータベースの統計モデルの尤度フリー推論のためのベイズ最適化

Our paper deals with inferring simulator-based statistical models given some observed data. A simulator-based model is a parametrized mechanism which specifies how data are generated. It is thus also referred to as generative model. We assume that only a finite number of parameters are of interest and allow the generative process to be very general; it may be a noisy nonlinear dynamical system with an unrestricted number of hidden variables. This weak assumption is useful for devising realistic models but it renders statistical inference very difficult. The main challenge is the intractability of the likelihood function. Several likelihood-free inference methods have been proposed which share the basic idea of identifying the parameters by finding values for which the discrepancy between simulated and observed data is small. A major obstacle to using these methods is their computational cost. The cost is largely due to the need to repeatedly simulate data sets and the lack of knowledge about how the parameters affect the discrepancy. We propose a strategy which combines probabilistic modeling of the discrepancy with optimization to facilitate likelihood-free inference. The strategy is implemented using Bayesian optimization and is shown to accelerate the inference through a reduction in the number of required simulations by several orders of magnitude.

この論文では、観測データに基づいてシミュレータベースの統計モデルを推論する方法について取り上げます。シミュレータベースのモデルは、データの生成方法を指定するパラメーター化されたメカニズムです。そのため、生成モデルとも呼ばれます。関心のあるパラメーターは有限個のみであると仮定し、生成プロセスは非常に一般的なものになります。つまり、隠れた変数の数が無制限のノイズの多い非線形動的システムである可能性があります。この弱い仮定は現実的なモデルを考案するのに便利ですが、統計的推論を非常に困難にします。主な課題は、尤度関数の扱いにくさです。シミュレーションされたデータと観測されたデータの間の相違が小さい値を見つけることでパラメーターを識別するという基本的な考え方を共有する、尤度フリーの推論方法がいくつか提案されています。これらの方法を使用する上での主な障害は、計算コストです。コストは主に、データセットを繰り返しシミュレートする必要があることと、パラメーターが相違にどのように影響するかについての知識が不足していることに起因します。尤度フリーの推論を容易にするために、相違の確率的モデリングと最適化を組み合わせた戦略を提案します。この戦略はベイズ最適化を使用して実装され、必要なシミュレーションの数を数桁削減することで推論を加速することが示されています。

Bootstrap-Based Regularization for Low-Rank Matrix Estimation
低ランク行列推定のためのブートストラップベースの正則化

We develop a flexible framework for low-rank matrix estimation that allows us to transform noise models into regularization schemes via a simple bootstrap algorithm. Effectively, our procedure seeks an autoencoding basis for the observed matrix that is stable with respect to the specified noise model; we call the resulting procedure a stable autoencoder. In the simplest case, with an isotropic noise model, our method is equivalent to a classical singular value shrinkage estimator. For non-isotropic noise models—e.g., Poisson noise—the method does not reduce to singular value shrinkage, and instead yields new estimators that perform well in experiments. Moreover, by iterating our stable autoencoding scheme, we can automatically generate low-rank estimates without specifying the target rank as a tuning parameter.

私たちは、単純なブートストラップアルゴリズムを介してノイズモデルを正則化スキームに変換できる低ランク行列推定のための柔軟なフレームワークを開発します。事実上、この手順では、指定されたノイズモデルに対して安定している観測された行列の自動エンコード基底を求めます。結果として得られるプロシージャを安定したオートエンコーダと呼びます。最も単純なケースでは、等方性ノイズモデルを使用すると、この方法は古典的な特異値収縮推定器と同等です。非等方性ノイズモデル—ポアソンノイズなど)の場合—この手法は特異値の収縮に減少せず、代わりに実験で優れた性能を発揮する新しい推定量を生成します。さらに、安定した自動エンコードスキームを繰り返すことで、調整パラメータとしてターゲットランクを指定せずに、低ランクの推定値を自動的に生成できます。

The Constrained Dantzig Selector with Enhanced Consistency
一貫性を強化した制約付き Dantzig セレクタ

The Dantzig selector has received popularity for many applications such as compressed sensing and sparse modeling, thanks to its computational efficiency as a linear programming problem and its nice sampling properties. Existing results show that it can recover sparse signals mimicking the accuracy of the ideal procedure, up to a logarithmic factor of the dimensionality. Such a factor has been shown to hold for many regularization methods. An important question is whether this factor can be reduced to a logarithmic factor of the sample size in ultra-high dimensions under mild regularity conditions. To provide an affirmative answer, in this paper we suggest the constrained Dantzig selector, which has more flexible constraints and parameter space. We prove that the suggested method can achieve convergence rates within a logarithmic factor of the sample size of the oracle rates and improved sparsity, under a fairly weak assumption on the signal strength. Such improvement is significant in ultra-high dimensions. This method can be implemented efficiently through sequential linear programming. Numerical studies confirm that the sample size needed for a certain level of accuracy in these problems can be much reduced.

Dantzigセレクタは、線形計画問題としての計算効率と優れたサンプリング特性により、圧縮センシングやスパースモデリングなどの多くのアプリケーションで人気を博しています。既存の結果では、次元の対数係数まで、理想的な手順の精度を模倣したスパース信号を回復できることが示されています。このような係数は、多くの正則化方法で保持されることが示されています。重要な問題は、この係数が、軽度の正則性条件下で超高次元のサンプルサイズの対数係数にまで削減できるかどうかです。肯定的な答えを提供するために、この論文では、より柔軟な制約とパラメーター空間を持つ制約付きDantzigセレクタを提案します。提案された方法は、信号強度に関するかなり弱い仮定の下で、オラクルレートのサンプルサイズの対数係数内で収束率を達成し、スパース性を改善できることを証明します。このような改善は、超高次元では重要です。この方法は、逐次線形計画法を通じて効率的に実装できます。数値的研究により、これらの問題において一定レベルの精度を達成するために必要なサンプルサイズを大幅に削減できることが確認されています。

Multiple Output Regression with Latent Noise
潜在ノイズによる重出力回帰

In high-dimensional data, structured noise caused by observed and unobserved factors affecting multiple target variables simultaneously, imposes a serious challenge for modeling, by masking the often weak signal. Therefore, (1) explaining away the structured noise in multiple-output regression is of paramount importance. Additionally, (2) assumptions about the correlation structure of the regression weights are needed. We note that both can be formulated in a natural way in a latent variable model, in which both the interesting signal and the noise are mediated through the same latent factors. Under this assumption, the signal model then borrows strength from the noise model by encouraging similar effects on correlated targets. We introduce a hyperparameter for the latent signal-to-noise ratio which turns out to be important for modelling weak signals, and an ordered infinite-dimensional shrinkage prior that resolves the rotational unidentifiability in reduced-rank regression models. Simulations and prediction experiments with metabolite, gene expression, FMRI measurement, and macroeconomic time series data show that our model equals or exceeds the state-of-the-art performance and, in particular, outperforms the standard approach of assuming independent noise and signal models.

高次元データでは、観測された要因と観測されていない要因が同時に複数のターゲット変数に影響を及ぼすことで生じる構造化ノイズが、弱いシグナルを覆い隠すことで、モデリングに重大な課題を課します。したがって、(1)多重出力回帰における構造化ノイズの説明が最も重要です。さらに、(2)回帰重みの相関構造に関する仮定が必要です。潜在変数モデルでは、両方とも自然な方法で定式化できることに注目します。潜在変数モデルでは、関心のあるシグナルとノイズの両方が同じ潜在因子を介して媒介されます。この仮定の下では、シグナルモデルは、相関するターゲットに同様の効果をもたらすことで、ノイズモデルの強みを借ります。潜在的なシグナル対ノイズ比のハイパーパラメータを導入します。これは弱いシグナルをモデリングするために重要であることが判明しており、また、ランクを下げた回帰モデルの回転識別不能性を解決する順序付き無限次元収縮事前分布も導入します。代謝物、遺伝子発現、fMRI測定、マクロ経済時系列データを使用したシミュレーションと予測実験では、当社のモデルが最先端のパフォーマンスと同等かそれ以上であり、特に、独立したノイズモデルと信号モデルを想定する標準的なアプローチよりも優れていることが示されています。

Variational Dependent Multi-output Gaussian Process Dynamical Systems
変分依存多出力ガウス過程力学系

This paper presents a dependent multi-output Gaussian process (GP) for modeling complex dynamical systems. The outputs are dependent in this model, which is largely different from previous GP dynamical systems. We adopt convolved multi-output GPs to model the outputs, which are provided with a flexible multi-output covariance function. We adapt the variational inference method with inducing points for learning the model. Conjugate gradient based optimization is used to solve parameters involved by maximizing the variational lower bound of the marginal likelihood. The proposed model has superiority on modeling dynamical systems under the more reasonable assumption and the fully Bayesian learning framework. Further, it can be flexibly extended to handle regression problems. We evaluate the model on both synthetic and real-world data including motion capture data, traffic flow data and robot inverse dynamics data. Various evaluation methods are taken on the experiments to demonstrate the effectiveness of our model, and encouraging results are observed.

この論文では、複雑な動的システムをモデル化するための従属型マルチ出力ガウス過程(GP)を紹介します。このモデルでは出力が従属しており、以前のGP動的システムとは大きく異なります。出力をモデル化するために、柔軟なマルチ出力共分散関数を備えた畳み込みマルチ出力GPを採用します。モデルを学習するために、誘導点を使用した変分推論法を採用します。共役勾配ベースの最適化を使用して、周辺尤度の変分下限を最大化することにより、関係するパラメータを解決します。提案モデルは、より合理的な仮定と完全なベイズ学習フレームワークの下で動的システムをモデル化する上で優れています。さらに、回帰問題を処理するように柔軟に拡張できます。モーションキャプチャデータ、交通流データ、ロボット逆ダイナミクスデータなど、合成データと実世界のデータの両方でモデルを評価します。モデルの有効性を実証するために、さまざまな評価方法を実験で採用し、有望な結果が得られました。

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels
シフト不変カーネルのための準モンテカルロ機能マップ

We consider the problem of improving the efficiency of randomized Fourier feature maps to accelerate training and testing speed of kernel methods on large data sets. These approximate feature maps arise as Monte Carlo approximations to integral representations of shift-invariant kernel functions (e.g., Gaussian kernel). In this paper, we propose to use Quasi-Monte Carlo (QMC) approximations instead, where the relevant integrands are evaluated on a low-discrepancy sequence of points as opposed to random point sets as in the Monte Carlo approach. We derive a new discrepancy measure called box discrepancy based on theoretical characterizations of the integration error with respect to a given sequence. We then propose to learn QMC sequences adapted to our setting based on explicit box discrepancy minimization. Our theoretical analyses are complemented with empirical results that demonstrate the effectiveness of classical and adaptive QMC techniques for this problem.

私たちは、大規模なデータセットに対するカーネルメソッドのトレーニングとテストの速度を加速するために、ランダム化されたフーリエ特徴マップの効率を向上させる問題を検討します。これらの近似特徴マップは、シフト不変カーネル関数(ガウスカーネルなど)の積分表現に対するモンテカルロ近似として発生します。この論文では、モンテカルロ法のようなランダムな点セットとは対照的に、関連する被積分関数が点の低不一致シーケンスで評価される準モンテカルロ(QMC)近似を代わりに使用することを提案します。私たちは、特定のシーケンスに関する積分誤差の理論的特徴に基づいて、ボックス不一致と呼ばれる新しい不一致尺度を導き出します。次に、明示的なボックス不一致の最小化に基づいて、設定に適応したQMCシーケンスを学習することを提案します。私たちの理論的分析は、この問題に対する古典的および適応的なQMC技術の有効性を実証する経験的結果によって補完されます。

Volumetric Spanners: An Efficient Exploration Basis for Learning
Volumetric Spanners:学習のための効率的な探索の基礎

Numerous learning problems that contain exploration, such as experiment design, multi-arm bandits, online routing, search result aggregation and many more, have been studied extensively in isolation. In this paper we consider a generic and efficiently computable method for action space exploration based on convex geometry. We define a novel geometric notion of an exploration mechanism with low variance called volumetric spanners, and give efficient algorithms to construct such spanners. We describe applications of this mechanism to the problem of optimal experiment design and the general framework for decision making under uncertainty of bandit linear optimization. For the latter we give efficient and near-optimal regret algorithm over general convex sets. Previously such results were known only for specific convex sets, or under special conditions such as the existence of an efficient self- concordant barrier for the underlying set.

実験デザイン、マルチアームバンディット、オンラインルーティング、検索結果の集約など、探索を含む数多くの学習問題が、単独で広く研究されてきました。この論文では、凸幾何学に基づく行動空間探索のための一般的で効率的に計算可能な方法を検討します。私たちは、体積スパナと呼ばれる低分散の探索メカニズムの新しい幾何学的概念を定義し、そのようなスパナを構築するための効率的なアルゴリズムを提供します。このメカニズムを最適な実験設計の問題に適用し、バンディット線形最適化の不確定性下での意思決定の一般的なフレームワークについて説明します。後者については、一般的な凸集合よりも効率的で最適に近い後悔アルゴリズムを提供します。以前は、このような結果は特定の凸集合についてのみ、または基礎となる集合に対する効率的な自己一致障壁の存在などの特別な条件下でのみ知られていました。

Improving Structure MCMC for Bayesian Networks through Markov Blanket Resampling
マルコフブランケット再サンプリングによるベイジアンネットワークの構造 MCMC の改善

Algorithms for inferring the structure of Bayesian networks from data have become an increasingly popular method for uncovering the direct and indirect influences among variables in complex systems. A Bayesian approach to structure learning uses posterior probabilities to quantify the strength with which the data and prior knowledge jointly support each possible graph feature. Existing Markov Chain Monte Carlo (MCMC) algorithms for estimating these posterior probabilities are slow in mixing and convergence, especially for large networks. We present a novel Markov blanket resampling (MBR) scheme that intermittently reconstructs the Markov blanket of nodes, thus allowing the sampler to more effectively traverse low-probability regions between local maxima. As we can derive the complementary forward and backward directions of the MBR proposal distribution, the Metropolis-Hastings algorithm can be used to account for any asymmetries in these proposals. Experiments across a range of network sizes show that the MBR scheme outperforms other state- of-the-art algorithms, both in terms of learning performance and convergence rate. In particular, MBR achieves better learning performance than the other algorithms when the number of observations is relatively small and faster convergence when the number of variables in the network is large.

データからベイジアンネットワークの構造を推測するアルゴリズムは、複雑なシステム内の変数間の直接的および間接的な影響を明らかにする方法としてますます人気が高まっています。構造学習に対するベイジアンアプローチでは、事後確率を使用して、データと事前知識が共同で各グラフ機能をサポートする強さを定量化します。これらの事後確率を推定する既存のマルコフ連鎖モンテカルロ(MCMC)アルゴリズムは、特に大規模なネットワークの場合、混合と収束が遅くなります。私たちは、ノードのマルコフブランケットを断続的に再構築する新しいマルコフブランケット再サンプリング(MBR)スキームを紹介します。これにより、サンプラーは局所的最大値間の低確率領域をより効率的にトラバースできます。MBR提案分布の補完的な前方方向と後方方向を導出できるため、メトロポリス-ヘイスティングスアルゴリズムを使用して、これらの提案の非対称性を考慮することができます。さまざまなネットワークサイズでの実験により、MBR方式は学習パフォーマンスと収束率の両方において他の最先端のアルゴリズムよりも優れていることが示されています。特に、MBRは、観測数が比較的少ない場合に他のアルゴリズムよりも優れた学習パフォーマンスを実現し、ネットワーク内の変数の数が多い場合に収束が速くなります。

Revisiting the Nyström Method for Improved Large-scale Machine Learning
大規模機械学習の改善のためのナイストロム法の再検討

We reconsider randomized algorithms for the low-rank approximation of symmetric positive semi-definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods; they characterize the effects of common data preprocessing steps on the performance of these algorithms; and they point to important differences between uniform sampling and nonuniform sampling methods based on leverage scores. In addition, our empirical results illustrate that existing theory is so weak that it does not provide even a qualitative guide to practice. Thus, we complement our empirical results with a suite of worst- case theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds—e.g., improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds for trace norm error—and they point to future directions to make these algorithms useful in even larger-scale machine learning applications.

私たちは、データ分析や機械学習アプリケーションで生じるラプラシアンやカーネル行列などの対称正半定値(SPSD)行列の低ランク近似のためのランダム化アルゴリズムを再検討します。主な結果は、さまざまなSPSD行列に対するサンプリング法と射影法のパフォーマンス品質と実行時間の経験的評価です。結果は、サンプリング法と射影法の相補的な側面を強調し、一般的なデータ前処理手順がこれらのアルゴリズムのパフォーマンスに与える影響を特徴付け、てこ比スコアに基づく均一サンプリング法と非均一サンプリング法の重要な違いを指摘しています。さらに、経験的結果は、既存の理論が非常に弱いため、実践のための定性的なガイドさえ提供していないことを示しています。したがって、ランダムサンプリング法とランダム射影法の両方に対する最悪のケースの理論的境界のセットで経験的結果を補完します。これらの境界は、既存の境界（スペクトルおよびフロベニウスノルム誤差の改良された加法誤差境界やトレースノルム誤差の相対誤差境界など）よりも質的に優れており、これらのアルゴリズムをさらに大規模な機械学習アプリケーションで役立つものにするための将来の方向性を示しています。

A Network That Learns Strassen Multiplication
シュトラッセン乗算を学習するネットワーク

We study neural networks whose only non-linear components are multipliers, to test a new training rule in a context where the precise representation of data is paramount. These networks are challenged to discover the rules of matrix multiplication, given many examples. By limiting the number of multipliers, the network is forced to discover the Strassen multiplication rules. This is the mathematical equivalent of finding low rank decompositions of the $n\times n$ matrix multiplication tensor, $M_n$. We train these networks with the conservative learning rule, which makes minimal changes to the weights so as to give the correct output for each input at the time the input-output pair is received. Conservative learning needs a few thousand examples to find the rank 7 decomposition of $M_2$, and $10^5$ for the rank 23 decomposition of $M_3$ (the lowest known). High precision is critical, especially for $M_3$, to discriminate between true decompositions and âborder approximations”.

私たちは、非線形成分のみが乗数であるニューラルネットワークを研究し、データの正確な表現が最優先されるコンテキストで新しい学習ルールをテストします。これらのネットワークは、多くの例を挙げて、行列乗算のルールを発見するという課題に直面しています。乗数の数を制限することで、ネットワークはStrassen乗算ルールを検出することを余儀なくされます。これは、$ntimes n$行列乗算テンソル$M_n$の低ランク分解を見つけることと数学的に同等です。これらのネットワークは、入力と出力のペアが受信された時点で各入力に対して正しい出力が得られるように、重みの変更を最小限に抑える保守的な学習ルールを使用して学習します。保守的な学習では、$M_2$のランク7の分解を求めるために数千の例が必要であり、$M_3$のランク23の分解を求めるには$10^5$_3$(知られている最低のもの)という例が必要です。特に$M_3$の場合、真の分解と”境界近似”を区別するために、高い精度が重要です。

Equivalence of Graphical Lasso and Thresholding for Sparse Graphs
グラフィカルラッソなわとスパースグラフの閾値処理の等価性

This paper is concerned with the problem of finding a sparse graph capturing the conditional dependence between the entries of a Gaussian random vector, where the only available information is a sample correlation matrix. A popular approach to address this problem is the graphical lasso technique, which employs a sparsity-promoting regularization term. This paper derives a simple condition under which the computationally- expensive graphical lasso behaves the same as the simple heuristic method of thresholding. This condition depends only on the solution of graphical lasso and makes no direct use of the sample correlation matrix or the regularization coefficient. It is proved that this condition is always satisfied if the solution of graphical lasso is close to its first-order Taylor approximation or equivalently the regularization term is relatively large. This condition is tested on several random problems, and it is shown that graphical lasso and the thresholding method lead to highly similar results in the case where a sparse graph is sought. We also conduct two case studies on brain connectivity networks of twenty subjects based on fMRI data and the topology identification of electrical circuits to support the findings of this work on the similarity of graphical lasso and thresholding.

この論文では、ガウスランダムベクトルのエントリ間の条件付き依存関係を捉えるスパースグラフを見つける問題を扱うもので、ここで利用可能な情報はサンプル相関行列のみです。この問題に対処するための一般的なアプローチは、スパース性を促進する正則化項を使用するグラフィカルLasso手法です。この論文では、計算コストの高いグラフィカルLassoが単純なヒューリスティック手法であるしきい値設定と同じように動作する単純な条件を導出します。この条件はグラフィカルLassoの解のみに依存し、サンプル相関行列や正則化係数を直接使用しません。グラフィカルLassoの解がその1次テイラー近似に近い場合、または正則化項が比較的大きい場合、この条件は常に満たされることが証明されています。この条件はいくつかのランダム問題でテストされ、スパースグラフが求められる場合にはグラフィカルLassoとしきい値設定法が非常によく似た結果をもたらすことが示されています。また、グラフィカルラッソと閾値設定の類似性に関する本研究の結果を裏付けるために、fMRIデータと電気回路のトポロジー識別に基づいて、20人の被験者の脳の接続ネットワークに関する2つのケーススタディを実施します。

The LRP Toolbox for Artificial Neural Networks
人工ニューラルネットワークのためのLRPツールボックス

The Layer-wise Relevance Propagation (LRP) algorithm explains a classifier’s prediction specific to a given data point by attributing relevance scores to important components of the input by using the topology of the learned model itself. With the LRP Toolbox we provide platform-agnostic implementations for explaining the predictions of pre-trained state of the art Caffe networks and stand-alone implementations for fully connected Neural Network models. The implementations for Matlab and python shall serve as a playing field to familiarize oneself with the LRP algorithm and are implemented with readability and transparency in mind. Models and data can be imported and exported using raw text formats, Matlab’s .mat files and the .npy format for numpy or plain text.

レイヤーワイズ関連性伝播(LRP)アルゴリズムは、学習したモデル自体のトポロジを使用して、関連性スコアを入力の重要なコンポーネントに帰属させることで、特定のデータポイントに固有の分類子の予測を説明します。LRP Toolboxでは、事前学習済みの最先端のCaffeネットワークの予測を説明するためのプラットフォームに依存しない実装と、完全に接続されたニューラルネットワークモデルのスタンドアロン実装を提供します。MatlabとPythonの実装は、LRPアルゴリズムに慣れるための競争の場として機能し、可読性と透明性を念頭に置いて実装されます。モデルとデータは、生のテキスト形式、Matlabの.matファイル、およびnumpyまたはプレーンテキストの.npy形式を使用してインポートおよびエクスポートできます。

Fused Lasso Approach in Regression Coefficients Clustering — Learning Parameter Heterogeneity in Data Integration
回帰係数クラスタリングにおける融合投げ縄アプローチ : データ統合におけるパラメータの不均一性の学習

As data sets of related studies become more easily accessible, combining data sets of similar studies is often undertaken in practice to achieve a larger sample size and higher power. A major challenge arising from data integration pertains to data heterogeneity in terms of study population, study design, or study coordination. Ignoring such heterogeneity in data analysis may result in biased estimation and misleading inference. Traditional techniques of remedy to data heterogeneity include the use of interactions and random effects, which are inferior to achieving desirable statistical power or providing a meaningful interpretation, especially when a large number of smaller data sets are combined. In this paper, we propose a regularized fusion method that allows us to identify and merge inter-study homogeneous parameter clusters in regression analysis, without the use of hypothesis testing approach. Using the fused lasso, we establish a computationally efficient procedure to deal with large-scale integrated data. Incorporating the estimated parameter ordering in the fused lasso facilitates computing speed with no loss of statistical power. We conduct extensive simulation studies and provide an application example to demonstrate the performance of the new method with a comparison to the conventional methods.

関連研究のデータセットへのアクセスが容易になったため、サンプルサイズを拡大して検出力を高めるために、類似研究のデータセットを組み合わせることが実際に行われることが多くなりました。データ統合から生じる主な課題は、研究対象集団、研究デザイン、または研究調整に関するデータの異質性です。データ分析でこのような異質性を無視すると、偏った推定や誤った推論につながる可能性があります。データの異質性を改善するための従来の手法には、相互作用とランダム効果の使用が含まれますが、これらは、特に多数の小さなデータセットを組み合わせる場合は、望ましい統計的検出力の達成や意味のある解釈の提供には劣ります。この論文では、回帰分析で仮説検定アプローチを使用せずに、研究間の同質のパラメータークラスターを識別してマージできる正規化された融合方法を提案します。融合Lassoを使用して、大規模な統合データを処理する計算効率の高い手順を確立します。推定されたパラメーター順序を融合Lassoに組み込むと、統計的検出力を失うことなく計算速度が向上します。私たちは広範囲にわたるシミュレーション研究を実施し、従来の方法と比較して新しい方法のパフォーマンスを実証するアプリケーション例を提供します。

Decrypting “Cryptogenic” Epilepsy: Semi-supervised Hierarchical Conditional Random Fields For Detecting Cortical Lesions In MRI-Negative Patients
「潜在性」てんかんの解読:MRI陰性患者の皮質病変を検出するための半教師付き階層的条件付きランダムフィールド

Focal cortical dysplasia (FCD) is the most common cause of pediatric epilepsy and the third most common cause in adults with treatment-resistant epilepsy. Surgical resection of the lesion is the most effective treatment to stop seizures. Technical advances in MRI have revolutionized the diagnosis of FCD, leading to high success rates for resective surgery. However, 45% of histologically confirmed FCD patients have normal MRIs (MRI-negative). Without a visible lesion, the success rate of surgery drops from 66% to 29%. In this work, we cast the problem of detecting potential FCD lesions using MRI scans of MRI-negative patients in an image segmentation framework based on hierarchical conditional random fields (HCRF). We use surface based morphometry to model the cortical surface as a two-dimensional surface which is then segmented at multiple scales to extract superpixels of different sizes. Each superpixel is assigned an outlier score by comparing it to a control population. The lesion is detected by fusing the outlier probabilities across multiple scales using a tree- structured HCRF. The proposed method achieves a higher detection rate, with superior recall and precision on a sample of twenty MRI-negative FCD patients as compared to a baseline across four morphological features and their combinations.

局所性皮質異形成（FCD）は小児てんかんの最も一般的な原因であり、治療抵抗性てんかんの成人では3番目に多い原因です。病変の外科的切除は発作を止めるための最も効果的な治療法です。MRIの技術的進歩はFCDの診断に革命をもたらし、切除手術の成功率を高めました。しかし、組織学的に確認されたFCD患者の45％はMRIが正常（MRI陰性）です。目に見える病変がない場合、手術の成功率は66％から29％に低下します。この研究では、MRI陰性患者のMRIスキャンを使用して潜在的なFCD病変を検出する問題を、階層的条件付きランダムフィールド（HCRF）に基づく画像セグメンテーションフレームワークに投影します。表面ベースの形態計測を使用して皮質表面を2次元表面としてモデル化し、それを複数のスケールでセグメント化して、さまざまなサイズのスーパーピクセルを抽出します。各スーパーピクセルには、対照群と比較して外れ値スコアが割り当てられます。病変は、ツリー構造のHCRFを使用して複数のスケールにわたる外れ値確率を融合することで検出されます。提案された方法は、4つの形態学的特徴とその組み合わせにわたるベースラインと比較して、20人のMRI陰性FCD患者のサンプルで優れた再現率と精度を実現し、より高い検出率を実現します。

Minimax Adaptive Estimation of Nonparametric Hidden Markov Models
ノンパラメトリック隠れマルコフモデルのミニマックス適応推定

We consider stationary hidden Markov models with finite state space and nonparametric modeling of the emission distributions. It has remained unknown until very recently that such models are identifiable. In this paper, we propose a new penalized least- squares estimator for the emission distributions which is statistically optimal and practically tractable. We prove a non asymptotic oracle inequality for our nonparametric estimator of the emission distributions. A consequence is that this new estimator is rate minimax adaptive up to a logarithmic term. Our methodology is based on projections of the emission distributions onto nested subspaces of increasing complexity. The popular spectral estimators are unable to achieve the optimal rate but may be used as initial points in our procedure. Simulations are given that show the improvement obtained when applying the least-squares minimization consecutively to the spectral estimation.

私たちは、有限状態空間と放出分布のノンパラメトリックモデリングを備えた定常隠れマルコフモデルを検討します。そのようなモデルが特定可能かどうかは、ごく最近まで知られていませんでした。この論文では、統計的に最適で実用的に扱いやすい、排出分布に対する新しいペナルティ付き最小二乗推定量を提案します。私たちは、排出分布のノンパラメトリック推定量に対する非漸近オラクル不等式を証明します。その結果、この新しい推定量は、対数項まで適応可能なレート最小マックスになります。私たちの方法論は、複雑さが増すネストされた部分空間への排出分布の予測に基づいています。一般的なスペクトル推定器は最適なレートを達成することはできませんが、手順の初期ポイントとして使用できます。スペクトル推定に最小二乗最小化を連続して適用したときに得られる改善を示すシミュレーションが与えられます。

Are Random Forests Truly the Best Classifiers?
ランダムフォレストは本当に最高の分類子ですか?

The JMLR study Do we need hundreds of classifiers to solve real world classification problems? benchmarks 179 classifiers in 17 families on 121 data sets from the UCI repository and claims that âthe random forest is clearly the best family of classifierâ. In this response, we show that the study’s results are biased by the lack of a held-out test set and the exclusion of trials with errors. Further, the study’s own statistical tests indicate that random forests do not have significantly higher percent accuracy than support vector machines and neural networks, calling into question the conclusion that random forests are the best classifiers.

JMLRの研究現実世界の分類問題を解決するために、何百もの分類器が必要ですか?UCIリポジトリからの121のデータセットで17ファミリーの179の分類子をベンチマークし、「ランダムフォレストは明らかに最高の分類子ファミリーである」と主張しています。この回答では、研究の結果が、ホールドアウトされたテストセットの欠如とエラーのある試行の除外によって偏っていることを示しています。さらに、この研究独自の統計的検定では、ランダムフォレストはサポートベクターマシンやニューラルネットワークよりも有意に高いパーセント精度を持っていないことが示されており、ランダムフォレストが最良の分類器であるという結論に疑問を投げかけています。

Monotonic Calibrated Interpolated Look-Up Tables
単調に補正された補間ルックアップテーブル

Real-world machine learning applications may have requirements beyond accuracy, such as fast evaluation times and interpretability. In particular, guaranteed monotonicity of the learned function with respect to some of the inputs can be critical for user confidence. We propose meeting these goals for low-dimensional machine learning problems by learning flexible, monotonic functions using calibrated interpolated look-up tables. We extend the structural risk minimization framework of lattice regression to monotonic functions by adding linear inequality constraints. In addition, we propose jointly learning interpretable calibrations of each feature to normalize continuous features and handle categorical or missing data, at the cost of making the objective non-convex. We address large- scale learning through parallelization, mini-batching, and random sampling of additive regularizer terms. Case studies on real-world problems with up to sixteen features and up to hundreds of millions of training samples demonstrate the proposed monotonic functions can achieve state-of-the-art accuracy in practice while providing greater transparency to users.

現実世界の機械学習アプリケーションには、高速な評価時間や解釈可能性など、精度を超えた要件がある場合があります。特に、一部の入力に関して学習した関数の保証された単調性は、ユーザーの信頼にとって重要です。低次元の機械学習の問題に対するこれらの目標を達成するために、キャリブレーションされた補間ルックアップテーブルを使用して柔軟な単調関数を学習することを提案します。線形不等式制約を追加することで、格子回帰の構造リスク最小化フレームワークを単調関数に拡張します。さらに、目的関数を非凸にするという代償を払って、各機能の解釈可能なキャリブレーションを共同で学習して連続機能を正規化し、カテゴリデータまたは欠損データを処理することを提案します。並列化、ミニバッチ処理、および加法正則化項のランダムサンプリングによって、大規模な学習に対処します。最大16個の機能と最大数億のトレーニングサンプルを含む現実世界の問題のケーススタディでは、提案された単調関数が実際に最先端の精度を達成できると同時に、ユーザーに対してより高い透明性を提供できることが実証されています。

Distribution-Matching Embedding for Visual Domain Adaptation
ビジュアルドメインアダプテーションのための分布マッチング埋め込み

Domain-invariant representations are key to addressing the domain shift problem where the training and test examples follow different distributions. Existing techniques that have attempted to match the distributions of the source and target domains typically compare these distributions in the original feature space. This space, however, may not be directly suitable for such a comparison, since some of the features may have been distorted by the domain shift, or may be domain specific. In this paper, we introduce a Distribution-Matching Embedding approach: An unsupervised domain adaptation method that overcomes this issue by mapping the data to a latent space where the distance between the empirical distributions of the source and target examples is minimized. In other words, we seek to extract the information that is invariant across the source and target data. In particular, we study two different distances to compare the source and target distributions: the Maximum Mean Discrepancy and the Hellinger distance. Furthermore, we show that our approach allows us to learn either a linear embedding, or a nonlinear one. We demonstrate the benefits of our approach on the tasks of visual object recognition, text categorization, and WiFi localization.

ドメイン不変表現は、トレーニング例とテスト例が異なる分布に従うドメインシフト問題に対処するための鍵です。ソースドメインとターゲットドメインの分布を一致させようとする既存の手法では、通常、これらの分布を元の特徴空間で比較します。ただし、この空間は、特徴の一部がドメインシフトによって歪んでいるか、ドメイン固有である可能性があるため、このような比較には直接適さない場合があります。この論文では、分布一致埋め込みアプローチを紹介します。これは、ソース例とターゲット例の経験的分布間の距離が最小化される潜在空間にデータをマッピングすることで、この問題を克服する教師なしのドメイン適応方法です。言い換えると、ソースデータとターゲットデータ全体で不変の情報を抽出することです。特に、ソース分布とターゲット分布を比較するために、最大平均不一致とヘリンガー距離という2つの異なる距離を検討します。さらに、このアプローチにより、線形埋め込みと非線形埋め込みのいずれかを学習できることを示します。私たちは、視覚的オブジェクト認識、テキスト分類、WiFi位置特定などのタスクにおける私たちのアプローチの利点を実証します。

Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation
自動画像解釈のための大規模放射線データベースにおけるインターリーブテキスト/画像深層マイニング

Despite tremendous progress in computer vision, there has not been an attempt to apply machine learning on very large-scale medical image databases. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital’s Picture Archiving and Communication System. With natural language processing, we mine a collection of $\sim$216K representative two-dimensional images selected by clinicians for diagnostic reference and match the images with their descriptions in an automated manner. We then employ a weakly supervised approach using all of our available data to build models for generating approximate interpretations of patient images. Finally, we demonstrate a more strictly supervised approach to detect the presence and absence of a number of frequent disease types, providing more specific interpretations of patient scans. A relatively small amount of data is used for this part, due to the challenge in gathering quality labels from large raw text data. Our work shows the feasibility of large-scale learning and prediction in electronic patient records available in most modern clinical institutions. It also demonstrates the trade-offs to consider in designing machine learning systems for analyzing large medical data.

コンピュータビジョンの驚異的な進歩にもかかわらず、非常に大規模な医療画像データベースに機械学習を適用する試みはこれまで行われてきませんでした。この研究では、国立研究病院の画像アーカイブおよび通信システムから放射線画像とレポートの意味的相互作用を抽出してマイニングするための、テキスト/画像のインターリーブ型ディープラーニングシステムを紹介します。自然言語処理を使用して、診断参照用に臨床医が選択した$\sim$216Kの代表的な2次元画像のコレクションをマイニングし、画像とその説明を自動的に一致させます。次に、利用可能なすべてのデータを使用して弱い教師ありアプローチを採用し、患者画像のおおよその解釈を生成するモデルを構築します。最後に、頻繁な疾患の種類の有無を検出するためのより厳密な教師ありアプローチを示し、患者スキャンのより具体的な解釈を提供します。この部分では、大量の生のテキストデータから品質ラベルを収集することが難しいため、比較的少量のデータしか使用されていません。本研究は、ほとんどの現代の臨床施設で利用可能な電子患者記録における大規模な学習と予測の実現可能性を示しています。また、大規模な医療データを分析するための機械学習システムを設計する際に考慮すべきトレードオフについても説明します。

Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling
予測ジョブスケジューリングを回避するストラグラーのマルチタスク学習

Parallel processing frameworks (Dean and Ghemawat, 2004) accelerate jobs by breaking them into tasks that execute in parallel. However, slow running or straggler tasks can run up to 8 times slower than the median task on a production cluster (Ananthanarayanan et al., 2013), leading to delayed job completion and inefficient use of resources. Existing straggler mitigation techniques wait to detect stragglers and then relaunch them, delaying straggler detection and wasting resources. We built Wrangler (Yadwadkar et al., 2014), a system that predicts when stragglers are going to occur and makes scheduling decisions to avoid such situations. To capture node and workload variability, Wrangler built separate models for every node and workload, requiring the time-consuming collection of substantial training data. In this paper, we propose multi- task learning formulations that share information between the various models, allowing us to use less training data and bring training time down from 4 hours to 40 minutes. Unlike naive multi-task learning formulations, our formulations capture the shared structure in our data, improving generalization performance on limited data. Finally, we extend these formulations using group sparsity inducing norms to automatically discover the similarities between tasks and improve interpretability.

並列処理フレームワーク(DeanおよびGhemawat、2004)は、ジョブを並列に実行されるタスクに分割することでジョブを高速化します。ただし、実行速度が遅いタスクやストラグラータスクは、実稼働クラスター上の平均タスクよりも最大8倍遅く実行される可能性があり(Ananthanarayanan他、2013)、ジョブの完了が遅れ、リソースが効率的に使用されません。既存のストラグラー軽減手法では、ストラグラーが検出されるまで待機してから再起動するため、ストラグラーの検出が遅れ、リソースが浪費されます。私たちは、ストラグラーが発生するタイミングを予測し、そのような状況を回避するためにスケジュールを決定するシステムであるWrangler (Yadwadkar他、2014)を構築しました。ノードとワークロードの変動をキャプチャするために、Wranglerはノードとワークロードごとに個別のモデルを構築し、大量のトレーニングデータを時間のかかる収集を必要としました。この論文では、さまざまなモデル間で情報を共有するマルチタスク学習定式化を提案します。これにより、使用するトレーニングデータが少なくなり、トレーニング時間が4時間から40分に短縮されます。単純なマルチタスク学習定式化とは異なり、私たちの定式化はデータ内の共有構造を捉え、限られたデータでの一般化パフォーマンスを向上させます。最後に、グループスパース性誘導基準を使用してこれらの定式化を拡張し、タスク間の類似性を自動的に検出して解釈可能性を向上させます。

Trend Filtering on Graphs
グラフのトレンドフィルタリング

We introduce a family of adaptive estimators on graphs, based on penalizing the $\ell_1$ norm of discrete graph differences. This generalizes the idea of trend filtering (Kim et al., 2009; Tibshirani, 2014), used for univariate nonparametric regression, to graphs. Analogous to the univariate case, graph trend filtering exhibits a level of local adaptivity unmatched by the usual $\ell_2$-based graph smoothers. It is also defined by a convex minimization problem that is readily solved (e.g., by fast ADMM or Newton algorithms). We demonstrate the merits of graph trend filtering through both examples and theory.

私たちは、離散グラフの差の$ell_1$ノルムにペナルティを課すことに基づいて、グラフ上の適応推定量のファミリーを導入します。これは、トレンドフィルタリングの考え方を一般化します(Kimら, 2009;Tibshirani, 2014)、単変量ノンパラメトリック回帰、グラフ化に使用されます。単変量の場合と同様に、グラフトレンドフィルタリングは、通常の$ell_2$ベースのグラフスムーザーとは一致しないレベルの局所適応性を示します。また、簡単に解ける凸最小化問題によっても定義されます(たとえば、高速ADMMまたはニュートンアルゴリズムによって)。グラフトレンドフィルタリングの利点を、例と理論の両方を通じて示します。

e-PAL: An Active Learning Approach to the Multi-Objective Optimization Problem
e-PAL:多目的最適化問題へのアクティブラーニングアプローチ

In many fields one encounters the challenge of identifying out of a pool of possible designs those that simultaneously optimize multiple objectives. In many applications an exhaustive search for the Pareto-optimal set is infeasible. To address this challenge, we propose the $\epsilon$-Pareto Active Learning ($\epsilon$-PAL) algorithm which adaptively samples the design space to predict a set of Pareto-optimal solutions that cover the true Pareto front of the design space with some granularity regulated by a parameter $\epsilon$. Key features of $\epsilon$-PAL include (1) modeling the objectives as draws from a Gaussian process distribution to capture structure and accommodate noisy evaluation; (2) a method to carefully choose the next design to evaluate to maximize progress; and (3) the ability to control prediction accuracy and sampling cost. We provide theoretical bounds on $\epsilon$-PAL’s sampling cost required to achieve a desired accuracy. Further, we perform an experimental evaluation on three real-world data sets that demonstrate $\epsilon$-PAL’s effectiveness; in comparison to the state-of-the-art active learning algorithm PAL, $\epsilon$-PAL reduces the amount of computations and the number of samples from the design space required to meet the user’s desired level of accuracy. In addition, we show that $\epsilon$-PAL improves significantly over a state-of-the-art multi- objective optimization method, saving in most cases 30\% to 70\% evaluations to achieve the same accuracy.

多くの分野で、多数の可能な設計の中から複数の目的を同時に最適化する設計を特定するという課題に直面します。多くのアプリケーションでは、パレート最適セットを徹底的に検索することは不可能です。この課題に対処するために、我々は$\epsilon$-Pareto Active Learning ($\epsilon$-PAL)アルゴリズムを提案します。このアルゴリズムは、設計空間を適応的にサンプリングして、パラメータ$\epsilon$によって制御される粒度で設計空間の真のパレートフロントをカバーするパレート最適ソリューションのセットを予測します。$\epsilon$-PALの主な機能には、(1)構造を捉えてノイズの多い評価に対応するために、ガウス過程分布から抽出した値として目的をモデル化すること、(2)進捗を最大化するために評価する次の設計を慎重に選択する方法、(3)予測精度とサンプリングコストを制御する機能などがあります。私たちは、望ましい精度を達成するために必要な$\epsilon$-PALのサンプリングコストの理論的な境界を示します。さらに、$\epsilon$-PALの有効性を実証する3つの実際のデータセットで実験的評価を実行します。最先端のアクティブラーニングアルゴリズムPALと比較すると、$\epsilon$-PALは、ユーザーが望む精度レベルを満たすために必要な計算量と設計空間からのサンプル数を削減します。さらに、$\epsilon$-PALは最先端の多目的最適化手法よりも大幅に改善され、ほとんどの場合、同じ精度を達成するために30\%から70\%の評価を節約できることを示します。

Bayesian Leave-One-Out Cross-Validation Approximations for Gaussian Latent Variable Models
ガウス潜在変数モデルのベイズ一つ抜き交差検証近似

The future predictive performance of a Bayesian model can be estimated using Bayesian cross-validation. In this article, we consider Gaussian latent variable models where the integration over the latent values is approximated using the Laplace method or expectation propagation (EP). We study the properties of several Bayesian leave-one-out (LOO) cross-validation approximations that in most cases can be computed with a small additional cost after forming the posterior approximation given the full data. Our main objective is to assess the accuracy of the approximative LOO cross-validation estimators. That is, for each method (Laplace and EP) we compare the approximate fast computation with the exact brute force LOO computation. Secondarily, we evaluate the accuracy of the Laplace and EP approximations themselves against a ground truth established through extensive Markov chain Monte Carlo simulation. Our empirical results show that the approach based upon a Gaussian approximation to the LOO marginal distribution (the so-called cavity distribution) gives the most accurate and reliable results among the fast methods.

ベイジアンモデルの将来予測性能は、ベイジアンクロスバリデーションを使用して推定できます。この記事では、潜在値の積分がラプラス法または期待伝播法(EP)を使用して近似されるガウス潜在変数モデルを検討します。私たちは、ほとんどの場合、完全なデータに基づいて事後近似を形成した後、わずかな追加コストで計算できるいくつかのベイジアンLeave-One-Out (LOO)クロスバリデーション近似の特性を調べます。私たちの主な目的は、近似LOOクロスバリデーション推定値の精度を評価することです。つまり、各方法(ラプラス法とEP)について、近似高速計算と正確なブルートフォースLOO計算を比較します。次に、広範なマルコフ連鎖モンテカルロシミュレーションによって確立されたグラウンドトゥルースに対して、ラプラス近似とEP近似自体の精度を評価します。私たちの実験結果によると、LOO周辺分布(いわゆるキャビティ分布)へのガウス近似に基づくアプローチは、高速な方法の中で最も正確で信頼性の高い結果をもたらすことが示されています。

Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing
スペクトル法とEMの出会い:クラウドソーシングのための証明可能な最適アルゴリズム

Crowdsourcing is a popular paradigm for effectively collecting labels at low cost. The Dawid-Skene estimator has been widely used for inferring the true labels from the noisy labels provided by non-expert crowdsourcing workers. However, since the estimator maximizes a non-convex log-likelihood function, it is hard to theoretically justify its performance. In this paper, we propose a two-stage efficient algorithm for multi-class crowd labeling problems. The first stage uses the spectral method to obtain an initial estimate of parameters. Then the second stage refines the estimation by optimizing the objective function of the Dawid-Skene estimator via the EM algorithm. We show that our algorithm achieves the optimal convergence rate up to a logarithmic factor. We conduct extensive experiments on synthetic and real datasets. Experimental results demonstrate that the proposed algorithm is comparable to the most accurate empirical approach, while outperforming several other recently proposed methods.

クラウドソーシングは、低コストでラベルを効率的に収集するための一般的なパラダイムです。Dawid-Skene推定量は、非専門家のクラウドソーシング作業者によって提供されるノイズの多いラベルから真のラベルを推測するために広く使用されています。ただし、推定量は非凸対数尤度関数を最大化するため、そのパフォーマンスを理論的に正当化することは困難です。この論文では、マルチクラスクラウドラベリング問題のための2段階の効率的なアルゴリズムを提案します。最初の段階では、スペクトル法を使用してパラメーターの初期推定値を取得します。次に、2番目の段階では、EMアルゴリズムを介してDawid-Skene推定量の目的関数を最適化することにより、推定値を改良します。このアルゴリズムが対数係数までの最適な収束率を達成することを示します。合成データセットと実際のデータセットで広範な実験を行います。実験結果は、提案されたアルゴリズムが最も正確な経験的アプローチに匹敵し、最近提案された他のいくつかの方法よりも優れていることを示しています。

Structure Learning in Bayesian Networks of a Moderate Size by Efficient Sampling
効率的なサンプリングによる中規模のベイジアンネットワークにおける構造学習

We study the Bayesian model averaging approach to learning Bayesian network structures (DAGs) from data. We develop new algorithms including the first algorithm that is able to efficiently sample DAGs of a moderate size (with up to about 25 variables) according to the exact structure posterior. The DAG samples can then be used to construct estimators for the posterior of any feature. We theoretically prove good properties of our estimators and empirically show that our estimators considerably outperform the estimators from the previous state- of-the-art methods.

私たちは、データからベイジアンネットワーク構造(DAG)を学習するためのベイジアンモデル平均化アプローチを研究します。私たちは、正確な事後構造に従って、中程度のサイズ(最大約25の変数を持つ)のDAGを効率的にサンプリングできる最初のアルゴリズムを含む新しいアルゴリズムを開発します。その後、DAGサンプルを使用して、任意の特徴の後方の推定量を構築できます。理論的には、推定量の優れた特性を証明し、経験的には、推定量が以前の最先端の方法の推定量を大幅に上回っていることを示しています。

Control Function Instrumental Variable Estimation of Nonlinear Causal Effect Models
非線形因果効果モデルの制御関数操作変数推定

The instrumental variable method consistently estimates the effect of a treatment when there is unmeasured confounding and a valid instrumental variable. A valid instrumental variable is a variable that is independent of unmeasured confounders and affects the treatment but does not have a direct effect on the outcome beyond its effect on the treatment. Two commonly used estimators for using an instrumental variable to estimate a treatment effect are the two stage least squares estimator and the control function estimator. For linear causal effect models, these two estimators are equivalent, but for nonlinear causal effect models, the estimators are different. We provide a systematic comparison of these two estimators for nonlinear causal effect models and develop an approach to combing the two estimators that generally performs better than either one alone. We show that the control function estimator is a two stage least squares estimator with an augmented set of instrumental variables. If these augmented instrumental variables are valid, then the control function estimator can be much more efficient than usual two stage least squares without the augmented instrumental variables while if the augmented instrumental variables are not valid, then the control function estimator may be inconsistent while the usual two stage least squares remains consistent. We apply the Hausman test to test whether the augmented instrumental variables are valid and construct a pretest estimator based on this test. The pretest estimator is shown to work well in a simulation study. An application to the effect of exposure to violence on time preference is considered.

操作変数法は、測定されていない交絡因子と有効な操作変数がある場合に、治療の効果を一貫して推定します。有効な操作変数とは、測定されていない交絡因子とは独立しており、治療に影響しますが、治療への影響を超えて結果に直接影響を及ぼさない変数です。操作変数を使用して治療効果を推定するためによく使用される2つの推定量は、2段階最小二乗推定量と制御関数推定量です。線形因果効果モデルの場合、これら2つの推定量は同等ですが、非線形因果効果モデルの場合、推定量は異なります。私たちは、非線形因果効果モデルに対するこれら2つの推定量の体系的な比較を提供し、一般的にどちらか一方だけよりも優れたパフォーマンスを発揮する2つの推定量を組み合わせるアプローチを開発します。私たちは、制御関数推定量が、拡張された操作変数のセットを備えた2段階最小二乗推定量であることを示します。これらの拡張された操作変数が有効であれば、制御関数推定量は、拡張された操作変数のない通常の2段階最小二乗法よりもはるかに効率的になります。一方、拡張された操作変数が有効でない場合は、通常の2段階最小二乗法は一貫しているものの、制御関数推定量は矛盾する可能性があります。拡張された操作変数が有効かどうかをテストするためにハウスマンテストを適用し、このテストに基づいて事前テスト推定量を構築します。事前テスト推定量は、シミュレーション研究で適切に機能することが示されています。暴力への露出が時間選好に与える影響への応用を検討します。

How to Center Deep Boltzmann Machines
ディープボルツマンマシンをセンタリングする方法

This work analyzes centered Restricted Boltzmann Machines (RBMs) and centered Deep Boltzmann Machines (DBMs), where centering is done by subtracting offset values from visible and hidden variables. We show analytically that (i) centered and normal Boltzmann Machines (BMs) and thus RBMs and DBMs are different parameterizations of the same model class, such that any normal BM/RBM/DBM can be transformed to an equivalent centered BM/RBM/DBM and vice versa, and that this equivalence generalizes to artificial neural networks in general, (ii) the expected performance of centered binary BMs/RBMs/DBMs is invariant under simultaneous flip of data and offsets, for any offset value in the range of zero to one, (iii) centering can be reformulated as a different update rule for normal BMs/RBMs/DBMs, and (iv) using the enhanced gradient is equivalent to setting the offset values to the average over model and data mean. Furthermore, we present numerical simulations suggesting that (i) optimal generative performance is achieved by subtracting mean values from visible as well as hidden variables, (ii) centered binary RBMs/DBMs reach significantly higher log-likelihood values than normal binary RBMs/DBMs, (iii) centering variants whose offsets depend on the model mean, like the enhanced gradient, suffer from severe divergence problems, (iv) learning is stabilized if an exponentially moving average over the batch means is used for the offset values instead of the current batch mean, which also prevents the enhanced gradient from severe divergence, (v) on a similar level of log-likelihood values centered binary RBMs/DBMs have smaller weights and bigger bias parameters than normal binary RBMs/DBMs, (vi) centering leads to an update direction that is closer to the natural gradient, which is extremely efficient for training as we show for small binary RBMs, (vii) centering eliminates the need for greedy layer-wise pre-training of DBMs, which often even deteriorates the results independently of whether centering is used or not, and (ix) centering is also beneficial for auto encoders.

この研究では、中心化は可視変数と非表示変数からオフセット値を減算することによって行われる、中心化制限ボルツマンマシン(RBM)と中心化ディープボルツマンマシン(DBM)を分析します。私たちは、(i)中心化ボルツマンマシンと正規ボルツマンマシン(BM)、したがってRBMとDBMは同じモデルクラスの異なるパラメータ化であり、任意の正規BM/RBM/DBMを同等の中心化BM/RBM/DBMに変換でき、その逆も同様であり、この同等性は一般に人工ニューラルネットワークに一般化できること、(ii)中心化バイナリBM/RBM/DBMの期待パフォーマンスは、0から1の範囲の任意のオフセット値に対して、データとオフセットの同時反転の下で不変であること、(iii)中心化は、正規BM/RBM/DBMの異なる更新規則として再定式化できること、(iv)拡張勾配を使用することは、オフセット値をモデルとデータの平均の平均に設定することと同等であることを解析的に示します。さらに、我々は数値シミュレーションを提示し、(i)可視変数と隠れ変数の両方から平均値を減算することで最適な生成性能が達成されること、(ii)中心化バイナリRBM/DBMは通常のバイナリRBM/DBMよりも大幅に高い対数尤度値に達すること、(iii)オフセットがモデル平均に依存する中心化バリアント(強化勾配など)は深刻な発散問題に悩まされること、(iv)オフセット値に現在のバッチ平均ではなくバッチ平均の指数移動平均を使用すると学習が安定し、強化勾配の深刻な発散も防ぐことができること、(v)同様のレベルの対数尤度値では、中心化バイナリRBM/DBMは通常のバイナリRBM/DBMよりも重みが小さく、バイアスパラメータが大きいこと、(vi)中心化により更新方向が自然勾配に近くなり、これは小さなバイナリRBMの場合に示すようにトレーニングに非常に効率的であること、(vii)中心化により、貪欲なレイヤーごとの事前トレーニングが不要になることを示唆しています。DBMでは、センタリングが使用されているかどうかに関係なく、結果が悪化することがよくあります。また、(ix)センタリングは自動エンコーダにとっても有益です。

Learning Taxonomy Adaptation in Large-scale Classification
大規模分類における分類学適応の学習

In this paper, we study flat and hierarchical classification strategies in the context of large-scale taxonomies. Addressing the problem from a learning-theoretic point of view, we first propose a multi-class, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. Based on this bound, we also propose a technique for modifying a given taxonomy through pruning, that leads to a lower value of the upper bound as compared to the original taxonomy. We then present another method for hierarchy pruning by studying approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies.

この論文では、大規模な分類法のコンテキストでフラットで階層的な分類戦略を研究します。学習理論の観点からこの問題に取り組むために、まず、大規模な分類法で展開された分類器の一般化誤差に限定した多クラス階層データを提案します。この境界は、フラット分類器と階層分類器のパフォーマンスに関連する、文献で報告されたいくつかの経験的結果に対する説明を提供します。この境界に基づいて、剪定によって特定の分類法を変更する手法も提案します。これにより、元の分類法と比較して上限の値が低くなります。次に、分類子のファミリの近似誤差を研究することにより、階層プルーニングの別の方法を提示し、そこからメタ分類器で使用される特徴を導き出して、プルーニングするノードを決定します。最後に、広く使用されている2つの分類法で行われたいくつかの実験を通じて、理論的発展を説明します。

Cells in Multidimensional Recurrent Neural Networks
多次元リカレントニューラルネットワークにおける細胞

The transcription of handwritten text on images is one task in machine learning and one solution to solve it is using multi- dimensional recurrent neural networks (MDRNN) with connectionist temporal classification (CTC). The RNNs can contain special units, the long short-term memory (LSTM) cells. They are able to learn long term dependencies but they get unstable when the dimension is chosen greater than one. We defined some useful and necessary properties for the one-dimensional LSTM cell and extend them in the multi-dimensional case. Thereby we introduce several new cells with better stability. We present a method to design cells using the theory of linear shift invariant systems. The new cells are compared to the LSTM cell on the IFN/ENIT and Rimes database, where we can improve the recognition rate compared to the LSTM cell. So each application where the LSTM cells in MDRNNs are used could be improved by substituting them by the new developed cells.

画像への手書きのテキストの転記は、機械学習の1つのタスクであり、それを解決するための1つの解決策は、コネクショニスト時間分類(CTC)を備えた多次元再帰型ニューラルネットワーク(MDRNN)を使用することです。RNNには、長短期記憶(LSTM)セルという特殊なユニットを含めることができます。彼らは長期的な依存関係を学習することができますが、次元が1より大きいものを選択すると不安定になります。1次元LSTMセルにいくつかの有用で必要な特性を定義し、それらを多次元の場合に拡張しました。これにより、安定性が向上したいくつかの新しい細胞を導入します。線形シフト不変システムの理論を使用してセルを設計する方法を提示します。新しいセルは、IFN/ENITおよびRimesデータベース上のLSTMセルと比較され、LSTMセルと比較して認識率を向上させることができます。したがって、MDRNNのLSTM細胞が使用される各アプリケーションは、新しく開発された細胞によってそれらを置き換えることによって改善される可能性があります。

Integrated Common Sense Learning and Planning in POMDPs
POMDPにおける常識学習と計画の統合

We formulate a new variant of the problem of planning in an unknown environment, for which we can provide algorithms with reasonable theoretical guarantees in spite of large state spaces and time horizons, partial observability, and complex dynamics. In this variant, an agent is given a collection of example traces produced by a reference policy, which may, for example, capture the agent’s past behavior. The agent is (only) asked to find policies that are supported by regularities in the dynamics that are observable on these example traces. We describe an efficient algorithm that uses such common sense knowledge reflected in the example traces to construct decision tree policies for goal-oriented factored POMDPs. More precisely, our algorithm (provably) succeeds at finding a policy for a given input goal when (1) there is a CNF that is almost always observed satisfied on the traces of the POMDP, capturing a sufficient approximation of its dynamics and (2) for a decision tree policy of bounded complexity, there exist small- space resolution proofs that the goal is achieved on each branch using the aforementioned CNF capturing the common sense rules. Such a CNF always exists for noisy STRIPS domains, for example. Our results thus essentially establish that the possession of a suitable exploration policy for collecting the necessary examples is the fundamental obstacle to learning to act in such environments.

私たちは、未知の環境での計画問題の新しい変種を定式化します。この変種では、大規模な状態空間と時間範囲、部分的な観測可能性、複雑なダイナミクスにもかかわらず、合理的な理論的保証を備えたアルゴリズムを提供できます。この変種では、エージェントには、たとえばエージェントの過去の行動をキャプチャする参照ポリシーによって生成されたサンプルトレースのコレクションが与えられます。エージェントは、これらのサンプルトレースで観測可能なダイナミクスの規則性によってサポートされているポリシーを見つけることだけを求められます。私たちは、サンプルトレースに反映されたこのような常識的な知識を使用して、目標指向のファクタリングPOMDPの決定木ポリシーを構築する効率的なアルゴリズムについて説明します。より正確には、私たちのアルゴリズムは、(1) POMDPのトレースでほぼ常に満たされていると観測されるCNFが存在し、そのダイナミクスの十分な近似を捉えており、(2)制限された複雑性の決定木ポリシーの場合、常識的なルールを捉えた前述のCNFを使用して各ブランチで目標が達成されることの小空間解像度の証明が存在するときに、与えられた入力目標に対するポリシーを見つけることに(証明可能に)成功します。このようなCNFは、たとえばノイズの多いSTRIPSドメインでは常に存在します。したがって、私たちの結果は、必要な例を収集するための適切な探索ポリシーを所有することが、このような環境で行動することを学習するための基本的な障害であることを本質的に確立しています。

JCLAL: A Java Framework for Active Learning
JCLAL: アクティブ・ラーニングのためのJavaフレームワーク

Active Learning has become an important area of research owing to the increasing number of real-world problems which contain labelled and unlabelled examples at the same time. JCLAL is a Java Class Library for Active Learning which has an architecture that follows strong principles of object-oriented design. It is easy to use, and it allows the developers to adapt, modify and extend the framework according to their needs. The library offers a variety of active learning methods that have been proposed in the literature. The software is available under the GPL license.

アクティブラーニングは、ラベル付けされた例とラベル付けされていない例が同時に含まれる実世界の問題の数が増えているため、重要な研究分野になっています。JCLALは、オブジェクト指向設計の強力な原則に従うアーキテクチャを持つアクティブラーニング用のJavaクラスライブラリです。使いやすく、開発者はニーズに応じてフレームワークを適応、変更、拡張できます。図書館は、文献で提案されているさまざまなアクティブラーニング方法を提供しています。このソフトウェアは、GPLライセンスの下で提供されています。

Convex Regression with Interpretable Sharp Partitions
解釈可能なシャープなパーティションによる凸回帰

We consider the problem of predicting an outcome variable on the basis of a small number of covariates, using an interpretable yet non-additive model. We propose convex regression with interpretable sharp partitions (CRISP) for this task. CRISP partitions the covariate space into blocks in a data- adaptive way, and fits a mean model within each block. Unlike other partitioning methods, CRISP is fit using a non-greedy approach by solving a convex optimization problem, resulting in low- variance fits. We explore the properties of CRISP, and evaluate its performance in a simulation study and on a housing price data set.

私たちは、少数の共変量に基づいて結果変数を予測する問題を、解釈可能でありながら非加法的なモデルを用いて考えます。このタスクに対して、解釈可能なシャープパーティション(CRISP)を使用した凸回帰を提案します。CRISPは、データ適応型方法で共変量空間をブロックに分割し、各ブロック内に平均モデルを適合させます。他の分割方法とは異なり、CRISPは凸最適化問題を解くことにより、貪欲でないアプローチを使用して適合し、低分散適合をもたらします。CRISPの特性を調査し、シミュレーション研究と住宅価格データセットでその性能を評価します。

Hierarchical Relative Entropy Policy Search
階層的相対エントロピー方策検索

Many reinforcement learning (RL) tasks, especially in robotics, consist of multiple sub-tasks that are strongly structured. Such task structures can be exploited by incorporating hierarchical policies that consist of gating networks and sub-policies. However, this concept has only been partially explored for real world settings and complete methods, derived from first principles, are needed. Real world settings are challenging due to large and continuous state-action spaces that are prohibitive for exhaustive sampling methods. We define the problem of learning sub-policies in continuous state action spaces as finding a hierarchical policy that is composed of a high-level gating policy to select the low-level sub-policies for execution by the agent. In order to efficiently share experience with all sub-policies, also called inter-policy learning, we treat these sub-policies as latent variables which allows for distribution of the update information between the sub-policies. We present three different variants of our algorithm, designed to be suitable for a wide variety of real world robot learning tasks and evaluate our algorithms in two real robot learning scenarios as well as several simulations and comparisons.

多くの強化学習(RL)タスク、特にロボット工学のタスクは、強力に構造化された複数のサブタスクで構成されています。このようなタスク構造は、ゲーティングネットワークとサブポリシーで構成される階層的ポリシーを組み込むことで活用できます。ただし、この概念は現実世界の設定では部分的にしか検討されておらず、第一原理から導き出された完全な方法が必要です。現実世界の設定は、網羅的なサンプリング方法には適さない大規模で連続的な状態アクション空間のために困難です。連続状態アクション空間でサブポリシーを学習する問題を、エージェントが実行する低レベルのサブポリシーを選択するための高レベルのゲーティングポリシーで構成される階層的ポリシーを見つけることと定義します。すべてのサブポリシーで経験を効率的に共有するために(ポリシー間学習とも呼ばれます)、これらのサブポリシーを潜在変数として扱い、サブポリシー間で更新情報を配布できるようにします。私たちは、さまざまな現実世界のロボット学習タスクに適するように設計されたアルゴリズムの3つの異なるバリエーションを紹介し、2つの実際のロボット学習シナリオといくつかのシミュレーションおよび比較でアルゴリズムを評価します。

Rate Optimal Denoising of Simultaneously Sparse and Low Rank Matrices
スパース行列と低ランク行列の同時最適ノイズ除去の評価

We study minimax rates for denoising simultaneously sparse and low rank matrices in high dimensions. We show that an iterative thresholding algorithm achieves (near) optimal rates adaptively under mild conditions for a large class of loss functions. Numerical experiments on synthetic datasets also demonstrate the competitive performance of the proposed method.

私たちは、高次元でスパース行列と低ランク行列を同時にノイズ除去するためのミニマックスレートを研究します。反復閾値処理アルゴリズムは、大規模な損失関数の穏やかな条件下で適応的に(ほぼ)最適なレートを達成することを示します。合成データセットでの数値実験も、提案された方法の競争力のあるパフォーマンスを示しています。

Rounding-based Moves for Semi-Metric Labeling
セミメトリックラベリングのための丸めベースの移動

Semi-metric labeling is a special case of energy minimization for pairwise Markov random fields. The energy function consists of arbitrary unary potentials, and pairwise potentials that are proportional to a given semi-metric distance function over the label set. Popular methods for solving semi-metric labeling include (i) move-making algorithms, which iteratively solve a minimum $st$-cut problem; and (ii) the linear programming ( LP) relaxation based approach. In order to convert the fractional solution of the LP relaxation to an integer solution, several randomized rounding procedures have been developed in the literature. We consider a large class of parallel rounding procedures, and design move-making algorithms that closely mimic them. We prove that the multiplicative bound of a move-making algorithm exactly matches the approximation factor of the corresponding rounding procedure for any arbitrary distance function. Our analysis includes all known results for move- making algorithms as special cases.

セミメトリックラベリングは、ペアワイズマルコフランダムフィールドのエネルギー最小化の特殊なケースです。エネルギー関数は、任意の単項ポテンシャルと、ラベルセット上の特定のセミメトリック距離関数に比例するペアワイズポテンシャルで構成されます。セミメトリックラベリングを解決するための一般的な方法には、(i)最小$st$カット問題を反復的に解決する移動作成アルゴリズム、および(ii)線形計画法(LP)緩和に基づくアプローチがあります。LP緩和の分数解を整数解に変換するために、いくつかのランダム化された丸め手順が文献で開発されています。並列丸め手順の大規模なクラスを検討し、それらを厳密に模倣する移動作成アルゴリズムを設計します。移動作成アルゴリズムの乗法境界が、任意の距離関数の対応する丸め手順の近似係数と正確に一致することを証明します。分析には、移動作成アルゴリズムの既知の結果がすべて特殊なケースとして含まれています。

Estimating Diffusion Networks: Recovery Conditions, Sample Complexity and Soft-thresholding Algorithm
拡散ネットワークの推定:回復条件、サンプルの複雑さ、ソフトしきい値処理アルゴリズム

Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees? Despite the increasing availability of cascade data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous- time diffusion models using an $\ell_1$-regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe $O(d^3 \log N)$ cascades, where $d$ is the maximum number of parents of a node and $N$ is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding network inference algorithm which demonstrate the match between our theoretical prediction and empirical results. In practice, this new algorithm also outperforms other alternatives in terms of the accuracy of recovering hidden diffusion networks.

情報は社会的ネットワークや技術的ネットワークを通じて広がりますが、ネットワーク構造は多くの場合私たちには見えず、カスケードと呼ばれる拡散プロセスによって残された痕跡だけが観察されます。これらの観察されたカスケードから隠れたネットワーク構造を復元することはできるでしょうか?どのようなカスケードが必要で、いくつのカスケードが必要でしょうか?他のネットワーク構造よりも復元が難しいネットワーク構造はあるでしょうか?証明可能な保証を備えた効率的な推論アルゴリズムを設計することはできるでしょうか?カスケードデータとこれらのデータからネットワークを推論する方法がますます利用できるようになっているにもかかわらず、上記の質問の徹底的な理論的理解は、文献ではほとんど研究されていません。この論文では、$\ell_1$正規化された尤度最大化フレームワークを使用して、連続時間拡散モデルの一般的なファミリーのネットワーク構造推論問題を調査します。カスケードサンプリングプロセスが自然な非一貫性条件を満たす限り、$O(d^3 \log N)$個のカスケードを観測した場合、私たちのフレームワークは高い確率で正しいネットワーク構造を復元できることを示します。ここで、$d$はノードの親の最大数、$N$はノードの総数です。さらに、私たちはシンプルで効率的なソフトしきい値ネットワーク推論アルゴリズムを開発し、理論予測と実験結果の一致を実証しました。実際には、この新しいアルゴリズムは、隠れた拡散ネットワークを復元する精度の点でも他の代替アルゴリズムを上回っています。

Sparsity and Error Analysis of Empirical Feature-Based Regularization Schemes
経験的特徴ベース正則化スキームのスパース性と誤差解析

We consider a learning algorithm generated by a regularization scheme with a concave regularizer for the purpose of achieving sparsity and good learning rates in a least squares regression setting. The regularization is induced for linear combinations of empirical features, constructed in the literatures of kernel principal component analysis and kernel projection machines, based on kernels and samples. In addition to the separability of the involved optimization problem caused by the empirical features, we carry out sparsity and error analysis, giving bounds in the norm of the reproducing kernel Hilbert space, based on a priori conditions which do not require assumptions on sparsity in terms of any basis or system. In particular, we show that as the concave exponent $q$ of the concave regularizer increases to $1$, the learning ability of the algorithm improves. Some numerical simulations for both artificial and real MHC-peptide binding data involving the $\ell^q$ regularizer and the SCAD penalty are presented to demonstrate the sparsity and error analysis.

私たちは、最小二乗回帰設定でスパース性と良好な学習率を達成することを目的として、凹正則化子を用いた正則化スキームによって生成される学習アルゴリズムを検討します。正則化は、カーネルとサンプルに基づいて、カーネル主成分分析とカーネル射影マシンの文献で構築された経験的特徴の線形結合に対して誘導されます。経験的特徴によって引き起こされる関連する最適化問題の分離可能性に加えて、私たちは、いかなる基底またはシステムに関してもスパース性の仮定を必要としない事前条件に基づいて、再生カーネルヒルベルト空間のノルムの境界を与えるスパース性と誤差の分析を実行します。特に、凹正則化子の凹指数$q$が$1$に増加すると、アルゴリズムの学習能力が向上することを示す。スパース性と誤差の分析を示すために、$\ell^q$正則化子とSCADペナルティを含む人工および実際のMHCペプチド結合データの両方に対するいくつかの数値シミュレーションが提示されます。

Spectral Ranking using Seriation
セリエーションを使用したスペクトルランキング

We describe a seriation algorithm for ranking a set of items given pairwise comparisons between these items. Intuitively, the algorithm assigns similar rankings to items that compare similarly with all others. It does so by constructing a similarity matrix from pairwise comparisons, using seriation methods to reorder this matrix and construct a ranking. We first show that this spectral seriation algorithm recovers the true ranking when all pairwise comparisons are observed and consistent with a total order. We then show that ranking reconstruction is still exact when some pairwise comparisons are corrupted or missing, and that seriation based spectral ranking is more robust to noise than classical scoring methods. Finally, we bound the ranking error when only a random subset of the comparions are observed. An additional benefit of the seriation formulation is that it allows us to solve semi-supervised ranking problems. Experiments on both synthetic and real datasets demonstrate that seriation based spectral ranking achieves competitive and in some cases superior performance compared to classical ranking methods.

私たちは、アイテム間の一対比較に基づいてアイテムのセットをランク付けするセリレーションアルゴリズムについて説明します。直感的に、このアルゴリズムは、他のすべてのアイテムと類似して比較されるアイテムに類似したランク付けを割り当てます。これは、一対比較から類似性マトリックスを作成し、セリレーションメソッドを使用してこのマトリックスを並べ替え、ランク付けを構築することによって行われます。まず、すべての一対比較が観察され、全順序と一致している場合、このスペクトルセリレーションアルゴリズムによって真のランク付けが復元されることを示します。次に、一部の一対比較が破損または欠落している場合でもランク付けの再構築が正確であること、およびセリレーションベースのスペクトルランク付けが従来のスコアリング方法よりもノイズに対して堅牢であることを示します。最後に、比較のランダムなサブセットのみが観察される場合のランク付けエラーを制限しました。セリレーション定式化のもう1つの利点は、半教師ありランク付けの問題を解決できることです。合成データセットと実際のデータセットの両方での実験により、セリレーションベースのスペクトルランク付けは、従来のランク付け方法と比較して競争力があり、場合によっては優れたパフォーマンスを実現することが実証されています。

L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs
ガウス計画を持つ高次元単一インデックスモデルの回復を支援するためのL1正則化最小二乗法

It is known that for a certain class of single index models (SIMs) $Y = f(X_{p \times 1}^\top\beta_0, \varepsilon)$, support recovery is impossible when $X \sim \mathcal{N}(0, I_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design $X$ comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with $L_1$ penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of $\beta_0$ if $X \sim \mathcal{N}(0, \Sigma)$, and the covariance $\Sigma$ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs.

特定のクラスの単一インデックスモデル(SIM) $Y = f(X_{p \times 1}^\top\beta_0, \varepsilon)$では、$X \sim \mathcal{N}(0, I_{p \times p})$で、モデルの複雑さを調整したサンプルサイズが臨界しきい値を下回る場合、サポート回復は不可能であることが知られています。最近、スライス逆回帰(SIR)に基づく最適なアルゴリズムが提案されました。これらのアルゴリズムは、設計$X$がi.i.d.ガウス分布から得られるという仮定の下で証明可能に機能します。この論文では、共分散スクリーニングと$L_1$ペナルティ付き最小二乗法(つまりLASSO)に基づくアルゴリズムを分析し、SIRベースのアルゴリズムと比較して$f$と$\varepsilon$に関する仮定がわずかに異なるものの、サポート回復の観点から最適な(スカラーまで)再スケールされたサンプルサイズを利用できることを示します。さらに、より一般的には、$X \sim \mathcal{N}(0, \Sigma)$かつ共分散$\Sigma$が表現不可能な条件を満たす場合、LASSOは$\beta_0$の符号付きサポートを回復できることが示されています。私たちの研究は、線形モデルに対するLASSOのサポート回復に関する既存の結果を、より一般的なクラスのSIMに拡張しています。

LIBMF: A Library for Parallel Matrix Factorization in Shared-memory Systems
LIBMF: 共有メモリシステムにおける並列行列分解のためのライブラリ

Matrix factorization (MF) plays a key role in many applications such as recommender systems and computer vision, but MF may take long running time for handling large matrices commonly seen in the big data era. Many parallel techniques have been proposed to reduce the running time, but few parallel MF packages are available. Therefore, we present an open source library, LIBMF, based on recent advances of parallel MF for shared-memory systems. LIBMF includes easy-to-use command-line tools, interfaces to C/C++ languages, and comprehensive documentation. Our experiments demonstrate that LIBMF outperforms state of the art packages. LIBMF is BSD-licensed, so users can freely use, modify, and redistribute the code.

行列分解(MF)は、レコメンダーシステムやコンピュータービジョンなどの多くのアプリケーションで重要な役割を果たしますが、ビッグデータ時代によく見られる大きな行列を処理するには、MFの実行に時間がかかる場合があります。実行時間を短縮するために多くの並列手法が提案されていますが、使用可能な並列MFパッケージはほとんどありません。そこで、共有メモリシステム用の並列MFの最近の進歩に基づくオープンソースライブラリであるLIGMFを紹介します。LIBMFには、使いやすいコマンド・ライン・ツール、C/C++言語へのインターフェース、および包括的な資料が含まれています。私たちの実験では、LIBMFが最先端のパッケージよりも優れていることが実証されています。LIBMFはBSDライセンスであるため、ユーザーはコードを自由に使用、変更、および再配布できます。

Structure-Leveraged Methods in Breast Cancer Risk Prediction
乳がんリスク予測における構造活用法

Predicting breast cancer risk has long been a goal of medical research in the pursuit of precision medicine. The goal of this study is to develop novel penalized methods to improve breast cancer risk prediction by leveraging structure information in electronic health records. We conducted a retrospective case- control study, garnering 49 mammography descriptors and 77 high- frequency/low-penetrance single-nucleotide polymorphisms (SNPs) from an existing personalized medicine data repository. Structured mammography reports and breast imaging features have long been part of a standard electronic health record (EHR), and genetic markers likely will be in the near future. Lasso and its variants are widely used approaches to integrated learning and feature selection, and our methodological contribution is to incorporate the dependence structure among the features into these approaches. More specifically, we propose a new methodology by combining group penalty and $\ell^p$ ($1\leq p\leq2$) fusion penalty to improve breast cancer risk prediction, taking into account structure information in mammography descriptors and SNPs. We demonstrate that our method provides benefits that are both statistically significant and potentially significant to people’s lives.

乳がんリスクの予測は、精密医療を追求する医療研究の長年の目標でした。本研究の目標は、電子医療記録の構造情報を活用して、乳がんリスク予測を改善する新しいペナルティ法を開発することです。私たちは、既存の個別化医療データリポジトリから49個のマンモグラフィ記述子と77個の高頻度/低浸透度の一塩基多型(SNP)を収集し、後ろ向きの症例対照研究を実施しました。構造化されたマンモグラフィレポートと乳房画像の特徴は、長い間標準的な電子医療記録(EHR)の一部であり、遺伝子マーカーも近い将来にそうなるでしょう。Lassoとその変種は、統合学習と特徴選択に広く使用されているアプローチであり、私たちの方法論的貢献は、これらのアプローチに特徴間の依存関係構造を組み込むことです。より具体的には、マンモグラフィ記述子とSNPの構造情報を考慮して、乳がんリスク予測を改善するために、グループペナルティと$\ell^p$ ($1\leq p\leq2$)融合ペナルティを組み合わせた新しい方法論を提案します。私たちの方法は、統計的に有意であり、人々の生活にとって潜在的に重要な利益をもたらすことを実証します。

Lenient Learning in Independent-Learner Stochastic Cooperative Games
独立学習者確率協調ゲームにおける寛大な学習

We introduce the Lenient Multiagent Reinforcement Learning 2 (LMRL2) algorithm for independent-learner stochastic cooperative games. LMRL2 is designed to overcome a pathology called relative overgeneralization, and to do so while still performing well in games with stochastic transitions, stochastic rewards, and miscoordination. We discuss the existing literature, then compare LMRL2 against other algorithms drawn from the literature which can be used for games of this kind: traditional (âDistributedâ) Q-learning, Hysteretic Q-learning, WoLF-PHC, SOoN, and (for repeated games only) FMQ. The results show that LMRL2 is very effective in both of our measures (complete and correct policies), and is found in the top rank more often than any other technique. LMRL2 is also easy to tune: though it has many available parameters, almost all of them stay at default settings. Generally the algorithm is optimally tuned with a single parameter, if any. We then examine and discuss a number of side-issues and options for LMRL2.

私たちは、独立学習者による確率的協力ゲームのためのLenient Multiagent Reinforcement Learning 2 (LMRL2)アルゴリズムを紹介します。LMRL2は、相対的過剰一般化と呼ばれる病理を克服し、確率的遷移、確率的報酬、およびミスコーディネーションを伴うゲームで良好なパフォーマンスを発揮するように設計されています。既存の文献について説明し、次にLMRL2を、この種のゲームに使用できる文献から抽出した他のアルゴリズム(従来の(「分散型」) Q学習、ヒステリシスQ学習、WoLF-PHC、SOoN、および(繰り返しゲームのみ) FMQ)と比較します。結果は、LMRL2が両方の尺度(完全かつ正しいポリシー)で非常に効果的であり、他のどの手法よりも頻繁に上位にランクされていることを示しています。LMRL2は調整も簡単です。使用可能なパラメーターは多数ありますが、ほとんどすべてがデフォルト設定のままです。通常、アルゴリズムは、パラメーターが1つある場合でも、それを使用して最適に調整されます。次に、LMRL2のいくつかの副次的な問題とオプションを検討し、議論します。

CVXPY: A Python-Embedded Modeling Language for Convex Optimization
CVXPY: 凸型最適化のためのPython組み込みモデリング言語

CVXPY is a domain-specific language for convex optimization embedded in Python. It allows the user to express convex optimization problems in a natural syntax that follows the math, rather than in the restrictive standard form required by solvers. CVXPY makes it easy to combine convex optimization with high-level features of Python such as parallelism and object- oriented design. CVXPY is available at www.cvxpy.org under the GPL license, along with documentation and examples.

CVXPYは、Pythonに組み込まれた凸最適化のためのドメイン固有言語です。これにより、ユーザーは、ソルバーが必要とする制限的な標準形式ではなく、数学に従う自然な構文で凸最適化問題を表現できます。CVXPYは、凸最適化とPythonの高度な機能(並列処理やオブジェクト指向設計など)を簡単に組み合わせることができます。CVXPYはGPLライセンスの下でwww.cvxpy.orgから入手でき、ドキュメントや例も公開されています。

Model-free Variable Selection in Reproducing Kernel Hilbert Space
カーネルヒルベルト空間の再現におけるモデルフリー変数選択

Variable selection is popular in high-dimensional data analysis to identify the truly informative variables. Many variable selection methods have been developed under various model assumptions. Whereas success has been widely reported in literature, their performances largely depend on validity of the assumed models, such as the linear or additive models. This article introduces a model-free variable selection method via learning the gradient functions. The idea is based on the equivalence between whether a variable is informative and whether its corresponding gradient function is substantially non-zero. The proposed variable selection method is then formulated in a framework of learning gradients in a flexible reproducing kernel Hilbert space. The key advantage of the proposed method is that it requires no explicit model assumption and allows for general variable effects. Its asymptotic estimation and selection consistencies are studied, which establish the convergence rate of the estimated sparse gradients and assure that the truly informative variables are correctly identified in probability. The effectiveness of the proposed method is also supported by a variety of simulated examples and two real-life examples.

変数選択は、高次元データ分析において、真に有益な変数を識別するためによく使用されます。さまざまなモデル仮定の下で、多くの変数選択方法が開発されてきました。文献では成功が広く報告されていますが、そのパフォーマンスは、線形モデルや加法モデルなどの想定モデルの妥当性に大きく依存します。この記事では、勾配関数の学習によるモデルフリーの変数選択方法を紹介します。このアイデアは、変数が有益であるかどうかと、それに対応する勾配関数が実質的にゼロでないかどうかの同等性に基づいています。提案された変数選択方法は、柔軟な再生カーネルヒルベルト空間で勾配を学習するフレームワークで定式化されます。提案された方法の主な利点は、明示的なモデル仮定を必要とせず、一般的な変数効果を考慮できることです。その漸近推定と選択の一貫性が研究され、推定されたスパース勾配の収束率を確立し、真に有益な変数が確率的に正しく識別されることを保証します。提案された方法の有効性は、さまざまなシミュレーション例と2つの実際の例によっても裏付けられています。

The Benefit of Multitask Representation Learning
マルチタスク表現学習の利点

We discuss a general method to learn data representations from multiple tasks. We provide a justification for this method in both settings of multitask learning and learning-to-learn. The method is illustrated in detail in the special case of linear feature learning. Conditions on the theoretical advantage offered by multitask representation learning over independent task learning are established. In particular, focusing on the important example of half-space learning, we derive the regime in which multitask representation learning is beneficial over independent task learning, as a function of the sample size, the number of tasks and the intrinsic data dimensionality. Other potential applications of our results include multitask feature learning in reproducing kernel Hilbert spaces and multilayer, deep networks.

私たちは、複数のタスクからデータ表現を学習するための一般的な方法について説明します。この方法を、マルチタスク学習と学習学習の両方の設定で正当化します。この方法は、線形特徴学習の特殊なケースで詳細に説明されています。マルチタスク表現学習が独立したタスク学習よりも理論的に優れているかどうかの条件が確立されています。特に、ハーフスペース学習の重要な例に焦点を当てて、サンプルサイズ、タスク数、および固有のデータ次元の関数として、マルチタスク表現学習が独立したタスク学習よりも有益であるという体制を導き出します。私たちの結果の他の潜在的なアプリケーションには、カーネルヒルベルト空間の再現におけるマルチタスク特徴学習と多層の深層ネットワークが含まれます。

Multiplicative Multitask Feature Learning
乗算型マルチタスク特徴学習

We investigate a general framework of multiplicative multitask feature learning which decomposes individual task’s model parameters into a multiplication of two components. One of the components is used across all tasks and the other component is task-specific. Several previous methods can be proved to be special cases of our framework. We study the theoretical properties of this framework when different regularization conditions are applied to the two decomposed components. We prove that this framework is mathematically equivalent to the widely used multitask feature learning methods that are based on a joint regularization of all model parameters, but with a more general form of regularizers. Further, an analytical formula is derived for the across-task component as related to the task- specific component for all these regularizers, leading to a better understanding of the shrinkage effects of different regularizers. Study of this framework motivates new multitask learning algorithms. We propose two new learning formulations by varying the parameters in the proposed framework. An efficient blockwise coordinate descent algorithm is developed suitable for solving the entire family of formulations with rigorous convergence analysis. Simulation studies have identified the statistical properties of data that would be in favor of the new formulations. Extensive empirical studies on various classification and regression benchmark data sets have revealed the relative advantages of the two new formulations by comparing with the state of the art, which provides instructive insights into the feature learning problem with multiple tasks.

私たちは、個々のタスクのモデルパラメータを2つのコンポーネントの乗算に分解する乗法マルチタスク特徴学習の一般的なフレームワークを調査します。コンポーネントの1つはすべてのタスクで使用され、もう1つはタスク固有のものです。いくつかの以前の方法は、このフレームワークの特殊なケースであることが証明されています。私たちは、2つの分解されたコンポーネントに異なる正則化条件が適用された場合のこのフレームワークの理論的特性を調査します。このフレームワークは、すべてのモデルパラメータの共同正則化に基づく、より一般的な形式の正則化を使用する、広く使用されているマルチタスク特徴学習方法と数学的に同等であることを証明します。さらに、これらすべての正則化について、タスク固有のコンポーネントに関連するタスク間コンポーネントの解析式が導出され、さまざまな正則化の縮小効果をよりよく理解できるようになります。このフレームワークの研究は、新しいマルチタスク学習アルゴリズムの動機となります。私たちは、提案されたフレームワークのパラメータを変更することにより、2つの新しい学習定式化を提案します。厳密な収束分析を使用して定式化のファミリー全体を解くのに適した、効率的なブロック単位の座標降下アルゴリズムが開発されています。シミュレーション研究により、新しい定式化に有利となるデータの統計的特性が特定されました。さまざまな分類および回帰ベンチマークデータセットに関する広範な実証研究により、最先端のものと比較することで2つの新しい定式化の相対的な利点が明らかになり、複数のタスクによる特徴学習の問題に対する有益な洞察がもたらされました。

Patient Risk Stratification with Time-Varying Parameters: A Multitask Learning Approach
時間変動パラメータによる患者リスク層別化:マルチタスク学習アプローチ

The proliferation of electronic health records (EHRs) frames opportunities for using machine learning to build models that help healthcare providers improve patient outcomes. However, building useful risk stratification models presents many technical challenges including the large number of factors (both intrinsic and extrinsic) influencing a patient’s risk of an adverse outcome and the inherent evolution of that risk over time. We address these challenges in the context of learning a risk stratification model for predicting which patients are at risk of acquiring a Clostridium difficile infection (CDI). We take a novel data-centric approach, leveraging the contents of EHRs from nearly 50,000 hospital admissions. We show how, by adapting techniques from multitask learning, we can learn models for patient risk stratification with unprecedented classification performance. Our model, based on thousands of variables, both time-varying and time-invariant, changes over the course of a patient admission. Applied to a held out set of approximately 25,000 patient admissions, we achieve an area under the receiver operating characteristic curve of 0.81 (95% CI 0.78-0.84). The model has been integrated into the health record system at a large hospital in the US, and can be used to produce daily risk estimates for each inpatient. While more complex than traditional risk stratification methods, the widespread development and use of such data-driven models could ultimately enable cost-effective, targeted prevention strategies that lead to better patient outcomes.

電子カルテ(EHR)の普及により、機械学習を使用して、医療提供者が患者の転帰を改善するのに役立つモデルを構築する機会が生まれています。ただし、有用なリスク層別化モデルの構築には、患者の有害転帰リスクに影響を与える多数の要因(内因性および外因性の両方)や、そのリスクの経時的な変化など、多くの技術的課題があります。私たちは、クロストリジウム・ディフィシル感染症(CDI)を発症するリスクがある患者を予測するためのリスク層別化モデルの学習という文脈でこれらの課題に取り組みます。私たちは、約50,000件の入院からのEHRの内容を活用する、新しいデータ中心のアプローチを採用しています。マルチタスク学習の手法を適応させることで、前例のない分類パフォーマンスで患者のリスク層別化モデルを学習する方法を示します。私たちのモデルは、何千もの変数(時間変動および時間不変の両方)に基づいており、患者の入院中に変化します。約25,000件の入院患者のデータに適用したところ、受信者動作特性曲線下面積は0.81 (95% CI 0.78-0.84)となりました。このモデルは米国の大規模病院の医療記録システムに統合されており、入院患者1人あたりのリスク推定値を生成するために使用できます。従来のリスク層別化方法よりも複雑ではありますが、このようなデータ駆動型モデルの広範な開発と使用により、最終的には費用対効果の高い、的を絞った予防戦略が可能になり、患者の転帰が改善される可能性があります。

Latent Space Inference of Internet-Scale Networks
インターネットスケールネットワークの潜在空間推論

The rise of Internet-scale networks, such as web graphs and social media with hundreds of millions to billions of nodes, presents new scientific opportunities, such as overlapping community detection to discover the structure of the Internet, or to analyze trends in online social behavior. However, many existing probabilistic network models are difficult or impossible to deploy at these massive scales. We propose a scalable approach for modeling and inferring latent spaces in Internet-scale networks, with an eye towards overlapping community detection as a key application. By applying a succinct representation of networks as a bag of triangular motifs, developing a parsimonious statistical model, deriving an efficient stochastic variational inference algorithm, and implementing it as a distributed cluster program via the Petuum parameter server system, we demonstrate overlapping community detection on real networks with up to 100 million nodes and 1000 communities on 5 machines in under 40 hours. Compared to other state-of-the-art probabilistic network approaches, our method is several orders of magnitude faster, with competitive or improved accuracy at overlapping community detection.

数億から数十億のノードを持つウェブグラフやソーシャルメディアなどのインターネット規模のネットワークの台頭により、重複コミュニティ検出によるインターネットの構造の解明や、オンラインソーシャル行動の傾向の分析など、新たな科学的機会が生まれています。しかし、既存の確率ネットワークモデルの多くは、このような大規模な規模で展開することが困難または不可能です。私たちは、重複コミュニティ検出を主要なアプリケーションとして見据え、インターネット規模のネットワークの潜在空間をモデル化および推論するためのスケーラブルなアプローチを提案します。ネットワークを三角形のモチーフの袋として簡潔に表現し、簡潔な統計モデルを開発し、効率的な確率変分推論アルゴリズムを導出し、それをPetuumパラメータサーバーシステムを介して分散クラスタープログラムとして実装することで、最大1億のノードと1000のコミュニティを持つ実際のネットワークで、5台のマシンで40時間未満で重複コミュニティ検出を実証します。他の最先端の確率的ネットワーク手法と比較すると、私たちの方法は数桁高速であり、重複コミュニティの検出において同等以上の精度を備えています。

Iterative Regularization for Learning with Convex Loss Functions
凸損失関数を用いた学習のための反復正則化

We consider the problem of supervised learning with convex loss functions and propose a new form of iterative regularization based on the subgradient method. Unlike other regularization approaches, in iterative regularization no constraint or penalization is considered, and generalization is achieved by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of reproducing kernel Hilbert spaces, and prove consistency and finite sample bounds on the excess risk under general regularity conditions. Our study provides a new class of efficient regularized learning algorithms and gives insights on the interplay between statistics and optimization in machine learning.

私たちは、凸損失関数を用いた教師あり学習の問題を考え、サブグラジエント法に基づく新しい形の反復正則化を提案します。他の正則化アプローチとは異なり、反復正則化では制約やペナルティは考慮されず、汎化は経験的な反復を(早期に)停止することによって達成されます。カーネルヒルベルト空間の再現の枠組みでノンパラメトリック設定を検討し、一般的な規則性条件下での過剰リスクの一貫性と有限サンプル境界を証明します。私たちの研究は、新しいクラスの効率的な正規化学習アルゴリズムを提供し、機械学習における統計と最適化の相互作用に関する洞察を提供します。

Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics
スケールアップ経験的リスク最小化:不完全な$U$統計の最適化

In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by $U$-statistics of degree $d\geq 1$, i.e. functionals of the training data with low variance that take the form of averages over $k$-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size $n$, as it requires averaging $O(n^d)$ terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on $O(n)$ terms only, usually referred to as incomplete $U$-statistics, without damaging the $O_{\mathbb{P}}(1/\sqrt{n})$ learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a $U$-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques.

ランキング、クラスタリング、メトリック学習などの幅広い統計学習問題では、次数$d\geq 1$の$U$統計、つまり、$k$タプルの平均の形をとる低分散のトレーニングデータの関数によってリスクが正確に推定されます。計算の観点から見ると、このような統計の計算には、平均化$O(n^d)$項が必要なため、中程度のサンプルサイズ$n$でも非常にコストがかかります。このため、このようなデータ関数の最適化に依存する学習手順は、実際にはほとんど実行できません。この論文の主な目的は、驚くべきことに、そのような経験的リスクは、経験的リスク最小化（ERM）手順の$O_{\mathbb{P}}(1/\sqrt{n})$学習率を損なうことなく、通常不完全$U$統計と呼ばれる、$O(n)$項のみに基づく大幅に計算が簡単なモンテカルロ推定値で置き換えることができることを示すことです。この目的のために、適切な複雑性の仮定の下で、不完全バージョンで$U$プロセスを近似するときに発生する誤差を記述する均一偏差結果を確立します。モデル選択、高速レート状況、さまざまなサンプリング手法への拡張、およびERMの確率的勾配降下法への適用も検討されます。最後に、私たちが推進するアプローチがより単純なサブサンプリング手法を大幅に上回るという強力な経験的証拠を提供するために、数値例を示します。

Distributed Coordinate Descent Method for Learning with Big Data
ビッグデータを用いた学習のための分散座標降下法

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

この論文では、ビッグデータによる損失最小化問題を解決するためのHydra: HYbriD cooRdinAte降下法を開発・分析します。最初に座標(特徴)を分割し、各パーティションをクラスターの異なるノードに割り当てます。反復ごとに、各ノードは、他のコンピューターとは独立して、所有する座標から座標のランダムなサブセットを選択し、単純な閉形式の式に基づいて選択した座標に更新を並列に計算して適用します。問題を高い確率でほぼ解くのに十分な反復回数に限界を与え、それがデータと分割にどのように依存するかを示します。3TBの行列で記述されたLASSOインスタンスを使用して数値実験を行います。

Learning Algorithms for Second-Price Auctions with Reserve
リザーブを使用したセカンドプライスオークションの学習アルゴリズム

Second-price auctions with reserve play a critical role in the revenue of modern search engine and popular online sites since the revenue of these companies often directly depends on the outcome of such auctions. The choice of the reserve price is the main mechanism through which the auction revenue can be influenced in these electronic markets. We cast the problem of selecting the reserve price to optimize revenue as a learning problem and present a full theoretical analysis dealing with the complex properties of the corresponding loss function. We further give novel algorithms for solving this problem and report the results of several experiments in both synthetic and real-world data demonstrating their effectiveness.

リザーブ付きのセカンドプライスオークションは、これらの企業の収益がオークションの結果に直接依存することが多いため、最新の検索エンジンや人気のあるオンラインサイトの収益に重要な役割を果たします。最低価格の選択は、これらの電子市場でオークション収益に影響を与えることができる主要なメカニズムです。収益を最適化するための最低価格の選択の問題を学習問題としてキャストし、対応する損失関数の複雑な特性を扱う完全な理論的分析を提示します。さらに、この問題を解決するための新しいアルゴリズムを提供し、合成データと実世界データの両方でいくつかの実験の結果を報告し、その有効性を実証します。

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning
オフポリシー時間差分学習の問題に対する強調的アプローチ

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($\lambda$)’s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per- step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($\lambda$), and GQ$\lambda$). Compared to these methods, our emphatic TD($\lambda$) is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state- dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

この論文では、パラメトリック時間差分(TD)学習アルゴリズムのパフォーマンスを向上させるというアイデアを紹介します。これは、異なる時間ステップでの更新を選択的に強調または強調しないことによって行われます。特に、線形TD($lambda$)の更新の強調を特定の方法で変化させると、オフポリシートレーニングの下で期待される更新が安定することを示します。関数近似パラメータの数が線形のステップごとの計算でこれを達成した唯一の以前のモデルフリーTD方法は、TDC、GTD($lambda$)、およびGQ$lambda$を含む勾配TDファミリの方法です。これらの方法と比較すると、強調されたTD($lambda$)はよりシンプルで使いやすいです。学習されたパラメーターベクトルとステップサイズパラメーターが1つだけあります。私たちの扱いには、一般的な州依存の割引機能とブートストラップ機能、およびさまざまな州を正確に評価するためのさまざまな関心の程度を指定する方法が含まれます。

Challenges in multimodal gesture recognition
マルチモーダルジェスチャー認識の課題

This paper surveys the state of the art on multimodal gesture recognition and introduces the JMLR special topic on gesture recognition 2011-2015. We began right at the start of the \kinect revolution when inexpensive infrared cameras providing image depth recordings became available. We published papers using this technology and other more conventional methods, including regular video cameras, to record data, thus providing a good overview of uses of machine learning and computer vision using multimodal data in this area of application. Notably, we organized a series of challenges and made available several datasets we recorded for that purpose, including tens of thousands of videos, which are available to conduct further research. We also overview recent state of the art works on gesture recognition based on a proposed taxonomy for gesture recognition, discussing challenges and future lines of research.

この論文では、マルチモーダルジェスチャー認識の最先端を調査し、ジェスチャー認識2011-2015に関するJMLRの特別トピックを紹介します。私たちは、Kinect革命が始まったばかりの頃、画像深度記録を提供する安価な赤外線カメラが利用可能になったときに始まりました。私たちは、この技術と、通常のビデオカメラを含む他のより一般的な方法を使用してデータを記録する論文を発表し、このアプリケーション分野でのマルチモーダルデータを使用した機械学習とコンピュータービジョンの使用についての概要を示しています。特に、私たちは一連の課題を整理し、その目的のために記録したいくつかのデータセット(数万本のビデオを含む)を利用可能にしました。また、ジェスチャー認識の提案された分類法に基づくジェスチャー認識に関する最近の最先端の研究を概観し、課題と将来の研究ラインについて議論します。

Exact Inference on Gaussian Graphical Models of Arbitrary Topology using Path-Sums
パスサムを用いた任意のトポロジーのガウスグラフモデルにおける厳密推論

We present the path-sum formulation for exact statistical inference of marginals on Gaussian graphical models of arbitrary topology. The path-sum formulation gives the covariance between each pair of variables as a branched continued fraction of finite depth and breadth. Our method originates from the closed- form resummation of infinite families of terms of the walk-sum representation of the covariance matrix. We prove that the path- sum formulation always exists for models whose covariance matrix is positive definite: i.e. it is valid for both walk-summable and non-walk-summable graphical models of arbitrary topology. We show that for graphical models on trees the path-sum formulation is equivalent to Gaussian belief propagation. We also recover, as a corollary, an existing result that uses determinants to calculate the covariance matrix. We show that the path-sum formulation formulation is valid for arbitrary partitions of the inverse covariance matrix. We give detailed examples demonstrating our results.

私たちは、任意のトポロジーのガウスグラフィカルモデル上の周辺分布の正確な統計的推論のためのパスサム定式化を提示します。パスサム定式化は、各変数ペア間の共分散を、有限の深さと幅の分岐連分数として与える。我々の方法は、共分散行列のウォークサム表現の項の無限族の閉形式再和に由来します。私たちは、共分散行列が正定値であるモデルに対してパスサム定式化が常に存在することを証明します。すなわち、任意のトポロジーのウォークサム可能およびウォークサム不可能なグラフィカルモデルの両方に対して有効です。私たちは、ツリー上のグラフィカルモデルに対して、パスサム定式化がガウスの確信伝播と同等であることを示す。我々はまた、系として、共分散行列を計算するために行列式を使用する既存の結果を回復します。私たちは、パスサム定式化が逆共分散行列の任意のパーティションに対して有効であることを示す。私たちは、我々の結果を示す詳細な例を示す。

On the Characterization of a Class of Fisher-Consistent Loss Functions and its Application to Boosting
フィッシャー-Consistent損失関数のクラスの特性評価とそのブースティングへの応用について

Accurate classification of categorical outcomes is essential in a wide range of applications. Due to computational issues with minimizing the empirical 0/1 loss, Fisher consistent losses have been proposed as viable proxies. However, even with smooth losses, direct minimization remains a daunting task. To approximate such a minimizer, various boosting algorithms have been suggested. For example, with exponential loss, the AdaBoost algorithm (Freund and Schapire, 1995) is widely used for two- class problems and has been extended to the multi-class setting (Zhu et al., 2009). Alternative loss functions, such as the logistic and the hinge losses, and their corresponding boosting algorithms have also been proposed (Zou et al., 2008; Wang, 2012). In this paper we demonstrate that a broad class of losses, including non-convex functions, achieve Fisher consistency, and in addition can be used for explicit estimation of the conditional class probabilities. Furthermore, we provide a generic boosting algorithm that is not loss-specific. Extensive simulation results suggest that the proposed boosting algorithms could outperform existing methods with properly chosen losses and bags of weak learners.

カテゴリ結果の正確な分類は、幅広いアプリケーションで不可欠です。経験的な0/1損失を最小化するための計算上の問題のため、フィッシャー整合損失が実行可能なプロキシとして提案されてきました。しかし、滑らかな損失であっても、直接最小化は困難な作業のままです。このような最小化を近似するために、さまざまなブースティングアルゴリズムが提案されています。たとえば、指数損失の場合、AdaBoostアルゴリズム(FreundおよびSchapire、1995)は2クラスの問題に広く使用されており、マルチクラス設定に拡張されています(Zhuら、2009)。ロジスティック損失やヒンジ損失などの代替損失関数と、それに対応するブースティングアルゴリズムも提案されています(Zouら、2008; Wang、2012)。この論文では、非凸関数を含む幅広いクラスの損失がフィッシャー整合を実現し、さらに条件付きクラス確率の明示的な推定に使用できることを示します。さらに、損失に特化しない汎用ブースティングアルゴリズムも提供します。広範なシミュレーション結果から、提案されたブースティングアルゴリズムは、適切に選択された損失と弱学習器のバッグを使用した既存の方法よりも優れたパフォーマンスを発揮できることが示唆されています。

Compressed Gaussian Process for Manifold Regression
多様体回帰のための圧縮ガウス過程

Nonparametric regression for large numbers of features ($p$) is an increasingly important problem. If the sample size $n$ is massive, a common strategy is to partition the feature space, and then separately apply simple models to each partition set. This is not ideal when $n$ is modest relative to $p$, and we propose an alternative approach relying on random compression of the feature vector combined with Gaussian process regression. The proposed approach is particularly motivated by the setting in which the response is conditionally independent of the features given the projection to a low dimensional manifold. Conditionally on the random compression matrix and a smoothness parameter, the posterior distribution for the regression surface and posterior predictive distributions are available analytically. Running the analysis in parallel for many random compression matrices and smoothness parameters, model averaging is used to combine the results. The algorithm can be implemented rapidly even in very large $p$ and moderately large $n$ nonparametric regression, has strong theoretical justification, and is found to yield state of the art predictive performance.

多数の特徴($p$)に対するノンパラメトリック回帰は、ますます重要な問題になっています。サンプルサイズ$n$が巨大な場合、一般的な戦略は、特徴空間を分割し、各分割セットに個別に単純なモデルを適用することです。$n$が$p$に対して比較的小さい場合、これは理想的ではありません。そこで、ガウス過程回帰と組み合わせた特徴ベクトルのランダム圧縮に依存する代替アプローチを提案します。提案されたアプローチは、低次元多様体への投影が与えられた場合に、応答が特徴から条件付きで独立しているという設定に特に動機付けられています。ランダム圧縮行列と滑らかさのパラメーターを条件として、回帰面の事後分布と事後予測分布が解析的に利用できます。多くのランダム圧縮行列と滑らかさのパラメーターに対して並列に分析を実行し、モデルの平均化を使用して結果を結合します。このアルゴリズムは、非常に大きな$p$および中程度に大きい$n$のノンパラメトリック回帰でも迅速に実装でき、強力な理論的正当性があり、最先端の予測パフォーマンスが得られることがわかっています。

An Information-Theoretic Analysis of Thompson Sampling
トンプソンサンプリングの情報理論的分析

We provide an information-theoretic analysis of Thompson sampling that applies across a broad range of online optimization problems in which a decision-maker must learn from partial feedback. This analysis inherits the simplicity and elegance of information theory and leads to regret bounds that scale with the entropy of the optimal-action distribution. This strengthens preexisting results and yields new insight into how information improves performance.

私たちは、意思決定者が部分的なフィードバックから学ばなければならない幅広いオンライン最適化問題に適用されるトンプソンサンプリングの情報理論分析を提供します。この分析は、情報理論の単純さと優雅さを継承し、最適作用分布のエントロピーに比例する後悔の限界につながります。これにより、既存の結果が強化され、情報がどのようにパフォーマンスを向上させるかについての新たな洞察が得られます。

Practical Kernel-Based Reinforcement Learning
実用的なカーネルベースの強化学習

Kernel-based reinforcement learning (KBRL) stands out among approximate reinforcement learning algorithms for its strong theoretical guarantees. By casting the learning problem as a local kernel approximation, KBRL provides a way of computing a decision policy which converges to a unique solution and is statistically consistent. Unfortunately, the model constructed by KBRL grows with the number of sample transitions, resulting in a computational cost that precludes its application to large-scale or on-line domains. In this paper we introduce an algorithm that turns KBRL into a practical reinforcement learning tool. Kernel-based stochastic factorization (KBSF) builds on a simple idea: when a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix, potentially much smaller than the original, which retains some fundamental properties of its precursor. KBSF exploits such an insight to compress the information contained in KBRL’s model into an approximator of fixed size. This makes it possible to build an approximation considering both the difficulty of the problem and the associated computational cost. KBSF’s computational complexity is linear in the number of sample transitions, which is the best one can do without discarding data. Moreover, the algorithm’s simple mechanics allow for a fully incremental implementation that makes the amount of memory used independent of the number of sample transitions. The result is a kernel-based reinforcement learning algorithm that can be applied to large-scale problems in both off-line and on-line regimes. We derive upper bounds for the distance between the value functions computed by KBRL and KBSF using the same data. We also prove that it is possible to control the magnitude of the variables appearing in our bounds, which means that, given enough computational resources, we can make KBSF’s value function as close as desired to the value function that would be computed by KBRL using the same set of sample transitions. The potential of our algorithm is demonstrated in an extensive empirical study in which KBSF is applied to difficult tasks based on real-world data. Not only does KBSF solve problems that had never been solved before, but it also significantly outperforms other state-of-the-art reinforcement learning algorithms on the tasks studied.

カーネルベースの強化学習(KBRL)は、強力な理論的保証により、近似強化学習アルゴリズムの中でも際立っています。学習問題をローカルカーネル近似としてキャストすることにより、KBRLは、一意のソリューションに収束し、統計的に一貫性のある決定ポリシーを計算する方法を提供します。残念ながら、KBRLによって構築されたモデルはサンプル遷移の数とともに大きくなり、計算コストが大きくなり、大規模またはオンラインドメインへの適用が妨げられます。この論文では、KBRLを実用的な強化学習ツールに変えるアルゴリズムを紹介します。カーネルベースの確率的因数分解(KBSF)は、遷移確率行列が2つの確率的行列の積として表される場合、乗算の因数を交換することで、元の行列よりもはるかに小さい可能性のある別の遷移行列を取得できるという単純なアイデアに基づいていますが、この行列は、先行行列のいくつかの基本的な特性を保持しています。KBSFは、このような洞察を利用して、KBRLモデルに含まれる情報を固定サイズの近似値に圧縮します。これにより、問題の難しさと関連する計算コストの両方を考慮した近似値を構築できます。KBSFの計算複雑度はサンプル遷移の数に比例しており、これはデータを破棄せずに実行できる最良の方法です。さらに、アルゴリズムのシンプルなメカニズムにより、使用されるメモリの量がサンプル遷移の数に依存しない完全な増分実装が可能になります。その結果、オフラインとオンラインの両方の環境で大規模な問題に適用できるカーネルベースの強化学習アルゴリズムが実現します。同じデータを使用してKBRLとKBSFによって計算された値関数間の距離の上限を導出します。また、境界内に現れる変数の大きさを制御できることも証明します。つまり、十分な計算リソースがあれば、KBSFの値関数を、同じサンプル遷移セットを使用してKBRLによって計算される値関数に望みどおりに近づけることができます。私たちのアルゴリズムの可能性は、KBSFを実際のデータに基づく困難なタスクに適用する広範な実証研究で実証されています。KBSFは、これまで解決できなかった問題を解決するだけでなく、研究対象のタスクにおいて他の最先端の強化学習アルゴリズムを大幅に上回るパフォーマンスを発揮します。

Bayesian Policy Gradient and Actor-Critic Algorithms
ベイズ方策勾配とアクター・クリティック・アルゴリズム

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Many conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. The policy is improved by adjusting the parameters in the direction of the gradient estimate. Since Monte-Carlo methods tend to have high variance, a large number of samples is required to attain accurate estimates, resulting in slow convergence. In this paper, we first propose a Bayesian framework for policy gradient, based on modeling the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates, namely, the gradient covariance, are provided at little extra cost. Since the proposed Bayesian framework considers system trajectories as its basic observable unit, it does not require the dynamics within trajectories to be of any particular form, and thus, can be easily extended to partially observable problems. On the downside, it cannot take advantage of the Markov property when the system is Markovian. To address this issue, we proceed to supplement our Bayesian policy gradient framework with a new actor-critic learning model in which a Bayesian class of non- parametric critics, based on Gaussian process temporal difference learning, is used. Such critics model the action- value function as a Gaussian process, allowing Bayes’ rule to be used in computing the posterior distribution over action-value functions, conditioned on the observed data. Appropriate choices of the policy parameterization and of the prior covariance (kernel) between action-values allow us to obtain closed-form expressions for the posterior distribution of the gradient of the expected return with respect to the policy parameters. We perform detailed experimental comparisons of the proposed Bayesian policy gradient and actor-critic algorithms with classic Monte-Carlo based policy gradient methods, as well as with each other, on a number of reinforcement learning problems.

ポリシー勾配法は、パフォーマンス勾配推定に従ってパラメータ化されたポリシーを適応させる強化学習アルゴリズムです。従来のポリシー勾配法の多くは、モンテカルロ法を使用してこの勾配を推定します。ポリシーは、勾配推定の方向にパラメータを調整することで改善されます。モンテカルロ法は分散が大きい傾向があるため、正確な推定値を得るためには多数のサンプルが必要となり、収束が遅くなります。この論文では、まず、ポリシー勾配をガウス過程としてモデル化することに基づく、ポリシー勾配のベイジアンフレームワークを提案します。これにより、正確な勾配推定値を得るために必要なサンプル数が削減されます。さらに、自然勾配の推定値と、勾配推定値の不確実性の尺度、つまり勾配共分散が、ほとんど追加コストなしで提供されます。提案されたベイジアンフレームワークは、システムトラジェクトリをその基本的な観測可能単位と見なすため、トラジェクトリ内のダイナミクスが特定の形式である必要はなく、したがって、部分的に観測可能な問題に簡単に拡張できます。欠点としては、システムがマルコフの場合、マルコフ特性を利用できないことです。この問題に対処するために、ベイジアンポリシー勾配フレームワークを、ガウス過程の時間差分学習に基づく非パラメトリッククリティックのベイジアンクラスを使用する新しいアクタークリティック学習モデルで補完します。このようなクリティックは、アクション値関数をガウス過程としてモデル化し、観測データに基づいて、アクション値関数の事後分布を計算する際にベイズの規則を使用できるようにします。ポリシーパラメータ化とアクション値間の事前共分散(カーネル)を適切に選択すると、ポリシーパラメータに関する期待収益の勾配の事後分布の閉じた形式の表現を取得できます。提案されたベイジアンポリシー勾配およびアクタークリティックアルゴリズムと、従来のモンテカルロベースのポリシー勾配法、および互いの詳細な実験比較を、いくつかの強化学習問題で実行します。

Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches
画像パッチを比較するための畳み込みニューラルネットワークの学習によるステレオマッチング

We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets.

私たちは、修正された画像ペアから深度情報を抽出する方法を紹介します。私たちのアプローチは、多くのステレオアルゴリズムの最初の段階であるマッチングコスト計算に焦点を当てています。畳み込みニューラルネットワークを使用して、小さな画像パッチの類似性尺度を学習することにより、この問題にアプローチします。トレーニングは、類似したパッチと異なるパッチのペアの例を使用して二項分類データセットを構築することにより、教師ありの方法で実行されます。このタスクでは、速度に合わせて調整されたものと精度の2つのネットワークアーキテクチャを検討します。畳み込みニューラルネットワークの出力は、ステレオマッチングコストを初期化するために使用されます。クロスベースのコスト集計、セミグローバルマッチング、左右の一貫性チェック、サブピクセルエンハンスメント、メディアンフィルター、バイラテラルフィルターの一連の後処理手順が続きます。KITTI 2012、KITTI 2015、およびMiddleburyステレオデータセットでこの方法を評価し、3つのデータセットすべてで他のアプローチよりも優れていることを示しています。

StructED: Risk Minimization in Structured Prediction
StructED:構造化予測におけるリスク最小化

Structured tasks are distinctive: each task has its own measure of performance, such as the word error rate in speech recognition, the BLEU score in machine translation, the NDCG score in information retrieval, or the intersection-over-union score in visual object segmentation. This paper presents StructED, a software package for learning structured prediction models with training methods that aimed at optimizing the task measure of performance. The package was written in Java and released under the MIT license. It can be downloaded from adiyoss.github.io/StructED.

構造化されたタスクには特徴があり、音声認識の単語エラー率、機械翻訳のBLEUスコア、情報検索のNDCGスコア、視覚オブジェクトセグメンテーションの交差オーバーユニオンスコアなど、各タスクには独自のパフォーマンス測定値があります。この論文では、パフォーマンスのタスク測定の最適化を目的としたトレーニング方法を使用して構造化された予測モデルを学習するためのソフトウェアパッケージであるStructEDを紹介します。パッケージはJavaで書かれ、MITライセンスの下でリリースされました。adiyoss.github.io/StructEDからダウンロードできます。

Convergence of an Alternating Maximization Procedure
交互最大化手順の収束

We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the profile M-estimator by Andresen and Spokoiny (2013), that means a finite sample Wilks and Fisher theorem. Further under slightly stronger smoothness constraints on the random functional we can show nearly linear convergence to the global maximizer if the starting point for the procedure is well chosen.

私たちは、MLE推定で実現された対数尤度などのランダム汎関数の最大化器を近似するための逐次交互最大化手順の2つの収束結果を導き出します。Andresen and Spokoiny (2013)によるプロファイルM-推定量で示されたのと同じ偏差特性、つまり有限サンプルのWilksとFisherの定理をシーケンスが達成することを示すことができました。さらに、ランダム汎関数のわずかに強い滑らかさ制約の下で、手続きの開始点が適切に選択されている場合、グローバル最大化器へのほぼ線形の収束を示すことができます。

The Statistical Performance of Collaborative Inference
協調推論の統計的性能

The statistical analysis of massive and complex data sets will require the development of algorithms that depend on distributed computing and collaborative inference. Inspired by this, we propose a collaborative framework that aims to estimate the unknown mean $\theta$ of a random variable $X$. In the model we present, a certain number of calculation units, distributed across a communication network represented by a graph, participate in the estimation of $\theta$ by sequentially receiving independent data from $X$ while exchanging messages via a stochastic matrix $A$ defined over the graph. We give precise conditions on the matrix $A$ under which the statistical precision of the individual units is comparable to that of a (gold standard) virtual centralized estimate, even though each unit does not have access to all of the data. We show in particular the fundamental role played by both the non-trivial eigenvalues of $A$ and the Ramanujan class of expander graphs, which provide remarkable performance for moderate algorithmic cost.

大規模で複雑なデータセットの統計分析には、分散コンピューティングと共同推論に依存するアルゴリズムの開発が必要です。これに着想を得て、ランダム変数$X$の未知の平均$\theta$を推定することを目的とした共同フレームワークを提案します。私たちが提示するモデルでは、グラフで表される通信ネットワーク全体に分散された一定数の計算ユニットが、グラフ上に定義された確率行列$A$を介してメッセージを交換しながら、$X$から独立したデータを順番に受信することにより、$\theta$の推定に参加します。各ユニットがすべてのデータにアクセスできない場合でも、個々のユニットの統計精度が(ゴールドスタンダードの)仮想集中推定値に匹敵する、行列$A$の正確な条件を示します。特に、中程度のアルゴリズムコストで優れたパフォーマンスを提供する、$A$の非自明な固有値とRamanujanクラスの拡張グラフの両方が果たす基本的な役割を示します。

DSA: Decentralized Double Stochastic Averaging Gradient Algorithm
DSA:分散型二重確率平均化勾配アルゴリズム

This paper considers optimization problems where nodes of a network have access to summands of a global objective. Each of these local objectives is further assumed to be an average of a finite set of functions. The motivation for this setup is to solve large scale machine learning problems where elements of the training set are distributed to multiple computational elements. The decentralized double stochastic averaging gradient (DSA) algorithm is proposed as a solution alternative that relies on: (i) The use of local stochastic averaging gradients. (ii) Determination of descent steps as differences of consecutive stochastic averaging gradients. Strong convexity of local functions and Lipschitz continuity of local gradients is shown to guarantee linear convergence of the sequence generated by DSA in expectation. Local iterates are further shown to approach the optimal argument for almost all realizations. The expected linear convergence of DSA is in contrast to the sublinear rate characteristic of existing methods for decentralized stochastic optimization. Numerical experiments on a logistic regression problem illustrate reductions in convergence time and number of feature vectors processed until convergence relative to these other alternatives.

この論文では、ネットワークのノードがグローバル目標の加数にアクセスできる最適化問題について検討します。これらのローカル目標はそれぞれ、有限の関数セットの平均であるとさらに想定されます。この設定の目的は、トレーニングセットの要素が複数の計算要素に分散されている大規模な機械学習問題を解決することです。分散型二重確率平均勾配(DSA)アルゴリズムは、次に依存するソリューションの代替案として提案されています。(i)ローカル確率平均勾配の使用。(ii)連続する確率平均勾配の差としての下降ステップの決定。ローカル関数の強い凸性とローカル勾配のLipschitz連続性により、DSAによって生成されるシーケンスの線形収束が期待どおりに保証されることが示されています。さらに、ローカル反復は、ほとんどすべての実現に対して最適な引数に近づくことが示されています。DSAの期待される線形収束は、分散型確率最適化の既存の方法の線形以下の速度特性とは対照的です。ロジスティック回帰問題に関する数値実験では、他の代替案と比較して、収束時間と収束までに処理される特徴ベクトルの数が削減されることがわかります。

Probabilistic Low-Rank Matrix Completion from Quantized Measurements
量子化測定からの確率的低ランク行列補完

We consider the recovery of a low rank real-valued matrix $M$ given a subset of noisy discrete (or quantized) measurements. Such problems arise in several applications such as collaborative filtering, learning and content analytics, and sensor network localization. We consider constrained maximum likelihood estimation of $M$, under a constraint on the entry- wise infinity-norm of $M$ and an exact rank constraint. We provide upper bounds on the Frobenius norm of matrix estimation error under this model. Previous theoretical investigations have focused on binary (1-bit) quantizers, and been based on convex relaxation of the rank. Compared to the existing binary results, our performance upper bound has faster convergence rate with matrix dimensions when the fraction of revealed observations is fixed. We also propose a globally convergent optimization algorithm based on low rank factorization of $M$ and validate the method on synthetic and real data, with improved performance over previous methods.

私たちは、ノイズの多い離散（または量子化）測定のサブセットが与えられた場合の、低ランクの実数値行列$M$の復元について検討します。このような問題は、協調フィルタリング、学習およびコンテンツ分析、センサーネットワークの位置特定など、いくつかのアプリケーションで発生します。$M$のエントリごとの無限大ノルムの制約と正確なランク制約の下で、$M$の制約付き最大尤度推定を検討します。このモデルにおける行列推定誤差のフロベニウスノルムの上限を示します。これまでの理論的調査はバイナリ（1ビット）量子化器に焦点を当てており、ランクの凸緩和に基づいています。既存のバイナリ結果と比較して、明らかにされた観測の割合が固定されている場合、私たちのパフォーマンスの上限は行列次元でより速い収束率を持っています。また、$M$の低ランク因数分解に基づくグローバルに収束する最適化アルゴリズムを提案し、合成データと実際のデータでこの方法を検証し、以前の方法よりもパフォーマンスが向上しました。

Domain-Adversarial Training of Neural Networks
ニューラルネットワークのドメイン敵対的学習

We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

私たちは、トレーニング時とテスト時のデータが類似しているが異なる分布から取得されるドメイン適応のための新しい表現学習アプローチを紹介します。我々のアプローチは、効果的なドメイン転送を実現するには、トレーニング(ソース)ドメインとテスト(ターゲット)ドメインを区別できない特徴に基づいて予測を行う必要があることを示唆するドメイン適応の理論から直接ヒントを得ています。このアプローチは、ソースドメインのラベル付きデータとターゲットドメインのラベルなしデータ(ラベル付きターゲットドメインデータは不要)でトレーニングされるニューラルネットワークアーキテクチャのコンテキストでこのアイデアを実装します。トレーニングが進むにつれて、このアプローチは、(i)ソースドメインの主な学習タスクに対して識別的であり、(ii)ドメイン間のシフトに関しては無差別である特徴の出現を促進します。私たちは、この適応動作が、いくつかの標準レイヤーと新しい勾配反転レイヤーで拡張することで、ほぼすべてのフィードフォワードモデルで実現できることを示します。結果として得られる拡張アーキテクチャは、標準的なバックプロパゲーションと確率的勾配降下法を使用してトレーニングできるため、ディープラーニングパッケージのいずれかを使用してほとんど手間をかけずに実装できます。2つの異なる分類問題(ドキュメント感情分析と画像分類)に対するアプローチの成功を実証し、標準ベンチマークで最先端のドメイン適応パフォーマンスを実現しました。また、人物再識別アプリケーションのコンテキストでの記述子学習タスクに対するアプローチも検証しました。

Estimation from Pairwise Comparisons: Sharp Minimax Bounds with Topology Dependence
ペアワイズ比較からの推定: トポロジ依存性を持つシャープなミニマックス境界

Data in the form of pairwise comparisons arises in many domains, including preference elicitation, sporting competitions, and peer grading among others. We consider parametric ordinal models for such pairwise comparison data involving a latent vector $w^* \in \mathbb{R}^d$ that represents the âqualitiesâ of the $d$ items being compared; this class of models includes the two most widely used parametric models—the Bradley-Terry-Luce (BTL) and the Thurstone models. Working within a standard minimax framework, we provide tight upper and lower bounds on the optimal error in estimating the quality score vector $w^*$ under this class of models. The bounds depend on the topology of the comparison graph induced by the subset of pairs being compared, via the spectrum of the Laplacian of the comparison graph. Thus, in settings where the subset of pairs may be chosen, our results provide principled guidelines for making this choice. Finally, we compare these error rates to those under cardinal measurement models and show that the error rates in the ordinal and cardinal settings have identical scalings apart from constant pre- factors.

ペア比較の形式のデータは、選好の引き出し、スポーツ競技、ピア採点など、多くの領域で発生します。私たちは、比較される$d$個のアイテムの「品質」を表す潜在ベクトル$w^* \in \mathbb{R}^d$を含む、このようなペア比較データに対するパラメトリック順序モデルを検討します。このクラスのモデルには、最も広く使用されている2つのパラメトリックモデル、Bradley-Terry-Luce (BTL)モデルとThurstoneモデルが含まれます。標準のミニマックスフレームワーク内で作業することで、このクラスのモデルで品質スコアベクトル$w^*$を推定する際の最適誤差の上限と下限を厳密に示します。この上限と下限は、比較グラフのラプラシアンのスペクトルを介して、比較されるペアのサブセットによって誘導される比較グラフのトポロジーに依存します。したがって、ペアのサブセットを選択できる設定では、私たちの結果は、この選択を行うための原則的なガイドラインを提供します。最後に、これらのエラー率を基数測定モデルの場合と比較し、順序設定と基数設定のエラー率は、一定の前因子を除けば同一のスケーリングを持つことを示します。

Structure Discovery in Bayesian Networks by Sampling Partial Orders
部分次数のサンプリングによるベイジアンネットワークにおける構造発見

We present methods based on Metropolis-coupled Markov chain Monte Carlo (MC3) and annealed importance sampling (AIS) for estimating the posterior distribution of Bayesian networks. The methods draw samples from an appropriate distribution of partial orders on the nodes, continued by sampling directed acyclic graphs (DAGs) conditionally on the sampled partial orders. We show that the computations needed for the sampling algorithms are feasible as long as the encountered partial orders have relatively few down-sets. While the algorithms assume suitable modularity properties of the priors, arbitrary priors can be handled by dividing the importance weight of each sampled DAG by the number of topological sorts it has—we give a practical dynamic programming algorithm to compute these numbers. Our empirical results demonstrate that the presented partial-order- based samplers are superior to previous Markov chain Monte Carlo methods, which sample DAGs either directly or via linear orders on the nodes. The results also suggest that the convergence rate of the estimators based on AIS are competitive to those of MC3. Thus AIS is the preferred method, as it enables easier large- scale parallelization and, in addition, supplies good probabilistic lower bound guarantees for the marginal likelihood of the model.

私たちは、ベイジアンネットワークの事後分布を推定するための、メトロポリス結合マルコフ連鎖モンテカルロ(MC3)とアニール重要度サンプリング(AIS)に基づく方法を紹介します。この方法では、ノード上の適切な部分順序の分布からサンプルを抽出し、続いて、サンプリングされた部分順序に基づいて有向非巡回グラフ(DAG)を条件付きでサンプリングします。サンプリングアルゴリズムに必要な計算は、遭遇する部分順序のダウンセットが比較的少ない限り実行可能であることを示します。アルゴリズムは事前分布の適切なモジュール性プロパティを前提としていますが、任意の事前分布は、サンプリングされた各DAGの重要度重みを、そのDAGが持つトポロジカルソートの数で割ることによって処理できます。これらの数値を計算するための実用的な動的プログラミングアルゴリズムを示します。実験結果から、提示された部分順序ベースのサンプラーは、ノード上で直接または線形順序を介してDAGをサンプリングする以前のマルコフ連鎖モンテカルロ法よりも優れていることがわかります。結果はまた、AISに基づく推定値の収束率がMC3の推定値と競合することを示しています。したがって、AISは、大規模な並列化を容易にし、さらにモデルの限界尤度に対して良好な確率的下限保証を提供するため、推奨される方法です。

Causal Inference through a Witness Protection Program
証人保護プログラムによる因果推論

One of the most fundamental problems in causal inference is the estimation of a causal effect when treatment and outcome are confounded. This is difficult in an observational study, because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal effects that exploits observational conditional independencies to suggest âweakâ paths in an unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of âpath cancellationsâ that imply conditional independencies but do not rule out the existence of confounding causal paths. The output is a posterior distribution over bounds on the average causal effect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice as a complement to other tools in observational studies.

因果推論の最も基本的な問題の1つは、治療と結果が混同された場合の因果効果の推定です。これは観察研究では困難です、なぜなら、すべての交絡因子が調整されたという直接的な証拠がないためです。因果効果を推定するための新しいアプローチを導入し、観測的な条件付き非依存性を利用して、未知の因果グラフの「弱い」パスを示唆します。Spirtesらの広く使用されている忠実性条件は、条件付きの非依存性を暗示するが、交絡する因果経路の存在を排除しないさまざまな程度の「経路のキャンセル」を許容するために緩和されています。出力は、線形計画法アプローチとベイズ推論による平均因果効果の境界を超える事後分布です。このアプローチは、観察研究における他のツールを補完するものとして、通常の実践で使用されるべきであると私たちは主張します。

Adaptive Lasso and group-Lasso for functional Poisson regression
機能的ポアソン回帰のための適応的ラッソとgroup-Lasso

High dimensional Poisson regression has become a standard framework for the analysis of massive counts datasets. In this work we estimate the intensity function of the Poisson regression model by using a dictionary approach, which generalizes the classical basis approach, combined with a Lasso or a group-Lasso procedure. Selection depends on penalty weights that need to be calibrated. Standard methodologies developed in the Gaussian framework can not be directly applied to Poisson models due to heteroscedasticity. Here we provide data-driven weights for the Lasso and the group-Lasso derived from concentration inequalities adapted to the Poisson case. We show that the associated Lasso and group-Lasso procedures satisfy fast and slow oracle inequalities. Simulations are used to assess the empirical performance of our procedure, and an original application to the analysis of Next Generation Sequencing data is provided.

高次元ポアソン回帰は、膨大なカウントデータセットの分析の標準的なフレームワークになっています。この作業では、古典的基底アプローチを一般化する辞書アプローチと、Lassoまたはgroup-Lasso手順を組み合わせて、ポアソン回帰モデルの強度関数を推定します。選択は、キャリブレーションが必要なペナルティウェイトによって異なります。ガウスフレームワークで開発された標準的な方法論は、不均一分散性のためにポアソンモデルに直接適用することはできません。ここでは、ポアソンケースに適応した濃度の不等式から導出されたLassoとgroup-Lassoのデータ駆動型の重みを提供します。関連するLassoプロシージャとgroup-Lassoプロシージャが高速オラクル不等式と低速オラクル不等式を満たすことを示します。シミュレーションは、私たちの手順の経験的パフォーマンスを評価するために使用され、次世代シーケンシングデータの分析への独自のアプリケーションが提供されます。

Estimating Causal Structure Using Conditional DAG Models
条件付き DAG モデルを使用した因果構造の推定

This paper considers inference of causal structure in a class of graphical models called conditional DAGs. These are directed acyclic graph (DAG) models with two kinds of variables, primary and secondary. The secondary variables are used to aid in the estimation of the structure of causal relationships between the primary variables. We prove that, under certain assumptions, such causal structure is identifiable from the joint observational distribution of the primary and secondary variables. We give causal semantics for the model class, put forward a score-based approach for estimation and establish consistency results. Empirical results demonstrate gains compared with formulations that treat all variables on an equal footing, or that ignore secondary variables. The methodology is motivated by applications in biology that involve multiple data types and is illustrated here using simulated data and in an analysis of molecular data from the Cancer Genome Atlas.

この論文では、条件付きDAGと呼ばれるグラフィカルモデルのクラスにおける因果構造の推論について考察します。これらは、1次変数と2次変数の2種類の有向非巡回グラフ(DAG)モデルです。二次変数は、一次変数間の因果関係の構造の推定を支援するために使用されます。私たちは、ある仮定の下で、そのような因果構造が一次変数と二次変数の同時観測分布から識別可能であることを証明します。モデルクラスの因果関係セマンティクスを提供し、推定のためのスコアベースのアプローチを提案し、一貫性のある結果を確立します。経験的な結果は、すべての変数を平等に扱う定式化、または二次変数を無視する定式化と比較して、利益を示しています。この方法論は、複数のデータタイプを含む生物学のアプリケーションによって動機付けられており、ここでは、シミュレートされたデータとCancer Genome Atlasからの分子データの分析で示されています。

Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares
反復ヘッセスケッチ:制約付き最小二乗法の高速かつ正確な解近似

We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, our first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. We then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations. We illustrate our general theory with simulations for both unconstrained and constrained versions of least-squares, including $\ell_1$-regularization and nuclear norm constraints. We also numerically demonstrate the practicality of our approach in a real face expression classification experiment.

私たちは、一般的な凸制約を持つ最小二乗問題を近似的に解くためのランダム化スケッチ法を研究します。最小二乗近似の品質は、2次目的関数の値（コスト近似）または近似最小化器と真の最小化器の間の距離の尺度（解近似）のいずれかの方法で評価することができます。後者の基準に焦点を当てて、最初の主要な結果は、最小二乗問題でデータ行列とベクトルの両方をスケッチするランダム化方法の一般的な下限を提供します。驚くべき結果として、最も広く使用されている最小二乗スケッチは、解近似には最適ではない。次に、反復ヘッセスケッチと呼ばれる新しい方法を紹介し、最小二乗最小化器の統計的複雑さに比例する投影次元と対数反復回数を使用して、元の最小二乗問題の近似値を取得するために使用できることを示す。私たちは、$\ell_1$正則化と核ノルム制約を含む、最小二乗法の制約なしバージョンと制約付きバージョンの両方のシミュレーションで一般理論を説明します。また、実際の顔の表情分類実験で、我々のアプローチの実用性を数値的に実証します。

Subspace Learning with Partial Information
部分情報による部分空間学習

The goal of subspace learning is to find a $k$-dimensional subspace of $\mathbb{R}^d$, such that the expected squared distance between instance vectors and the subspace is as small as possible. In this paper we study subspace learning in a partial information setting, in which the learner can only observe $r \le d$ attributes from each instance vector. We propose several efficient algorithms for this task, and analyze their sample complexity.

部分空間学習の目標は、インスタンスベクトルと部分空間の間の予想される二乗距離ができるだけ小さくなるように、$mathbb{R}^d$の$k$次元部分空間を見つけることです。この論文では、学習者が各インスタンスベクトルから$r le d$属性のみを観察できる部分情報設定での部分空間学習を研究します。このタスクに対していくつかの効率的なアルゴリズムを提案し、それらのサンプルの複雑さを分析します。

Differentially Private Data Releasing for Smooth Queries
スムーズなクエリのための差分プライベートデータリリース

In the past few years, differential privacy has become a standard concept in the area of privacy. One of the most important problems in this field is to answer queries while preserving differential privacy. In spite of extensive studies, most existing work on differentially private query answering assumes the data are discrete (i.e., in $\{0,1\}^d$) and focuses on queries induced by \emph{Boolean} functions. In real applications however, continuous data are at least as common as binary data. Thus, in this work we explore a less studied topic, namely, differential privately query answering for continuous data with continuous function. As a first step towards the continuous case, we study a natural class of linear queries on continuous data which we refer to as smooth queries. A linear query is said to be $K$-smooth if it is specified by a function defined on $[-1,1]^d$ whose partial derivatives up to order $K$ are all bounded. We develop two $\epsilon$-differentially private mechanisms which are able to answer all smooth queries. The first mechanism outputs a summary of the database and can then give answers to the queries. The second mechanism is an improvement of the first one and it outputs a synthetic database. The two mechanisms both achieve an accuracy of $O (n^{-\frac{K}{2d+K}}/\epsilon )$. Here we assume that the dimension $d$ is a constant. It turns out that even in this parameter setting (which is almost trivial in the discrete case), using existing discrete mechanisms to answer the smooth queries is difficult and requires more noise. Our mechanisms are based on $L_{\infty}$-approximation of (transformed) smooth functions by low-degree even trigonometric polynomials with uniformly bounded coefficients. We also develop practically efficient variants of the mechanisms with promising experimental results.

過去数年で、差分プライバシーはプライバシーの分野で標準的な概念になりました。この分野で最も重要な問題の1つは、差分プライバシーを維持しながらクエリに回答することです。広範な研究にもかかわらず、差分プライバシークエリ回答に関する既存の研究のほとんどは、データが離散的(つまり、$\{0,1\}^d$内)であると想定し、\emph{ブール}関数によって誘導されるクエリに焦点を当てています。ただし、実際のアプリケーションでは、連続データはバイナリデータと同程度以上一般的です。したがって、この研究では、あまり研究されていないトピック、つまり連続関数を持つ連続データに対する差分プライバシークエリ回答について検討します。連続ケースへの第一歩として、連続データに対する線形クエリの自然なクラスを研究します。これをスムーズクエリと呼びます。線形クエリは、$[-1,1]^d$で定義され、$K$次までの偏微分がすべて制限されている関数によって指定される場合、$K$スムーズであると言われます。私たちは、すべての滑らかなクエリに答えることができる2つの$\epsilon$差分プライバシーメカニズムを開発しました。最初のメカニズムは、データベースの要約を出力し、クエリに答えることができます。2番目のメカニズムは最初のメカニズムを改良したもので、合成データベースを出力します。2つのメカニズムは両方とも$O (n^{-\frac{K}{2d+K}}/\epsilon )$の精度を達成します。ここでは、次元$d$は定数であると仮定します。このパラメーター設定(離散ケースではほとんど自明)でも、既存の離散メカニズムを使用して滑らかなクエリに答えることは困難であり、より多くのノイズが必要になることがわかりました。我々のメカニズムは、一様に境界付けられた係数を持つ低次の偶三角多項式による(変換された)滑らかな関数の$L_{\infty}$近似に基づいています。また、有望な実験結果を持つ、実際に効率的なメカニズムの変種も開発しています。

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms
組み合わせ型多腕バンディットとその確率的トリガーアームへの拡張

We define a general framework for a large class of combinatorial multi-armed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more base arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline $(\alpha,\beta)$-approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability $\beta$ generates an $\alpha$ fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize $(\alpha,\beta)$-approximation regret, which is the difference in total expected reward between the $\alpha\beta$ fraction of expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves $O(\log n)$ distribution-dependent regret, where $n$ is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in an earlier paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage (PMC) for online advertising and social influence maximization for viral marketing, both having nonlinear reward structures. In particular, application to social influence maximization requires our extension on probabilistically triggered arms.

私たちは、未知の分布を持つベースアームのサブセットがスーパーアームを形成する、大規模な組み合わせ多腕バンディット(CMAB)問題の一般的なフレームワークを定義します。各ラウンドで、スーパーアームがプレイされ、スーパーアームに含まれるベースアームがプレイされ、その結果が観察されます。さらに、すでにトリガーされたアームの結果に基づいて、より多くのベースアームを確率的にトリガーできる拡張を検討します。スーパーアームの報酬は、プレイされたすべてのアームの結果に依存し、2つの緩やかな仮定を満たすだけで、大規模な非線形報酬インスタンスが可能になります。アームの結果分布の平均を取り、確率$\beta$で最適期待報酬の$\alpha$分数を生成するスーパーアームを出力するオフライン$(\alpha,\beta)$近似オラクルが利用可能であると仮定します。CMABのオンライン学習アルゴリズムの目的は、$(\alpha,\beta)$近似後悔を最小化することです。これは、常に最適なスーパーアームをプレイする場合の期待報酬の$\alpha\beta$割合と、アルゴリズムに従ってスーパーアームをプレイする場合の期待報酬との間の総期待報酬の差です。私たちは、$O(\log n)$分布依存後悔を実現するCUCBアルゴリズムを提供します。ここで、$n$はプレイされたラウンドの数です。また、報酬関数の大規模なクラスに対して分布に依存しない境界を提供します。私たちの後悔分析は、古典的なMAB問題に対するUCB1アルゴリズムの境界(定数係数まで)と一致するという点で厳密であり、線形報酬の組み合わせバンディットに関する以前の論文の後悔境界を大幅に改善します。私たちは、CMABフレームワークを、オンライン広告の確率的最大カバレッジ(PMC)とバイラルマーケティングの社会的影響最大化という、どちらも非線形報酬構造を持つ2つの新しいアプリケーションに適用します。特に、社会的影響最大化への適用には、確率的にトリガーされるアームへの拡張が必要です。

SPSD Matrix Approximation vis Column Selection: Theories, Algorithms, and Extensions
SPSD 行列近似と列選択の比較: 理論、アルゴリズム、拡張

Symmetric positive semidefinite (SPSD) matrix approximation is an important problem with applications in kernel methods. However, existing SPSD matrix approximation methods such as the NystrÃ¶m method only have weak error bounds. In this paper we conduct in-depth studies of an SPSD matrix approximation model and establish strong relative-error bounds. We call it the prototype model for it has more efficient and effective extensions, and some of its extensions have high scalability. Though the prototype model itself is not suitable for large- scale data, it is still useful to study its properties, on which the analysis of its extensions relies. This paper offers novel theoretical analysis, efficient algorithms, and a highly accurate extension. First, we establish a lower error bound for the prototype model and improve the error bound of an existing column selection algorithm to match the lower bound. In this way, we obtain the first optimal column selection algorithm for the prototype model. We also prove that the prototype model is exact under certain conditions. Second, we develop a simple column selection algorithm with a provable error bound. Third, we propose a so-called spectral shifting model to make the approximation more accurate when the eigenvalues of the matrix decay slowly, and the improvement is theoretically quantified. The spectral shifting method can also be applied to improve other SPSD matrix approximation models.

対称正定値半正定値(SPSD)行列近似は、カーネル法のアプリケーションにおける重要な問題です。しかし、ニストローム法などの既存のSPSD行列近似法には、弱い誤差境界しかありません。この論文では、SPSD行列近似モデルを詳細に研究し、強い相対誤差境界を確立します。このモデルは、より効率的で効果的な拡張があり、拡張の一部は高いスケーラビリティを備えているため、プロトタイプモデルと呼びます。プロトタイプモデル自体は大規模データには適していませんが、拡張の分析の根拠となる特性を研究することは有用です。この論文では、新しい理論的分析、効率的なアルゴリズム、および高精度の拡張を提供します。まず、プロトタイプモデルの下限誤差境界を確立し、既存の列選択アルゴリズムの誤差境界を下限に一致するように改善します。このようにして、プロトタイプモデルの最初の最適な列選択アルゴリズムを取得します。また、プロトタイプモデルが特定の条件下で正確であることを証明します。次に、証明可能な誤差境界を持つ単純な列選択アルゴリズムを開発します。3番目に、行列の固有値がゆっくり減衰する場合に近似をより正確にするための、いわゆるスペクトルシフトモデルを提案し、その改善は理論的に定量化されています。スペクトルシフト法は、他のSPSD行列近似モデルの改善にも適用できます。

Kernel Mean Shrinkage Estimators
カーネル平均収縮推定器

A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well-known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a “large $d$, small $n$” paradigm.

再現カーネルヒルベルト空間(RKHS)の平均関数、またはカーネル平均は、カーネル主成分解析などの多くの古典的なアルゴリズムで使用されるという点でカーネルメソッドの中心であり、RKHSへの確率分布の埋め込みに依存する最新のカーネルメソッドの中核推論ステップも形成します。有限のサンプルが与えられた場合、経験的平均は、真のカーネル平均の標準推定量として一般的に使用されています。この推定量が広く使用されているにもかかわらず、よく知られているスタイン現象のおかげで改善できることを示しています。私たちは、カーネル平均収縮推定量(KMSE)と呼ばれる新しい推定量ファミリーを提案します。これは、理論的な正当化と優れた経験的パフォーマンスの両方から恩恵を受けます。この結果は、提案された推定量が、特に「大きな$d$、小さな$n$」パラダイムにおいて、標準的な推定量を上回っていることを示しています。

Large Scale Online Kernel Learning
大規模なオンラインカーネル学習

In this paper, we present a new framework for large scale online kernel learning, making kernel methods efficient and scalable for large-scale online learning applications. Unlike the regular budget online kernel learning scheme that usually uses some budget maintenance strategies to bound the number of support vectors, our framework explores a completely different approach of kernel functional approximation techniques to make the subsequent online learning task efficient and scalable. Specifically, we present two different online kernel machine learning algorithms: (i) Fourier Online Gradient Descent (FOGD) algorithm that applies the random Fourier features for approximating kernel functions; and (ii) NystrÃ¶m Online Gradient Descent (NOGD) algorithm that applies the NystrÃ¶m method to approximate large kernel matrices. We explore these two approaches to tackle three online learning tasks: binary classification, multi-class classification, and regression. The encouraging results of our experiments on large-scale datasets validate the effectiveness and efficiency of the proposed algorithms, making them potentially more practical than the family of existing budget online kernel learning approaches.

この論文では、大規模オンラインカーネル学習のための新しいフレームワークを提示し、大規模オンライン学習アプリケーション向けにカーネル法を効率的かつスケーラブルにします。通常、サポートベクターの数を制限するために何らかの予算維持戦略を使用する通常の予算オンラインカーネル学習スキームとは異なり、私たちのフレームワークは、カーネル関数近似手法のまったく異なるアプローチを探求し、その後のオンライン学習タスクを効率的かつスケーラブルにします。具体的には、2つの異なるオンラインカーネル機械学習アルゴリズムを紹介します。(i)ランダムフーリエ特徴を適用してカーネル関数を近似するフーリエオンライン勾配降下法(FOGD)アルゴリズム、および(ii)ナイストロム法を適用して大規模カーネル行列を近似するナイストロムオンライン勾配降下法(NOGD)アルゴリズムです。これら2つのアプローチを探求して、バイナリ分類、マルチクラス分類、および回帰という3つのオンライン学習タスクに取り組みます。大規模データセットでの実験の有望な結果は、提案されたアルゴリズムの有効性と効率性を検証し、既存の予算オンラインカーネル学習アプローチのファミリーよりも潜在的に実用的であることを示しています。

Addressing Environment Non-Stationarity by Repeating Q-learning Updates
Q学習の更新を繰り返すことによる環境の非定常性への対処

Q-learning (QL) is a popular reinforcement learning algorithm that is guaranteed to converge to optimal policies in Markov decision processes. However, QL exhibits an artifact: in expectation, the effective rate of updating the value of an action depends on the probability of choosing that action. In other words, there is a tight coupling between the learning dynamics and underlying execution policy. This coupling can cause performance degradation in noisy non-stationary environments. Here, we introduce Repeated Update Q-learning (RUQL), a learning algorithm that resolves the undesirable artifact of Q-learning while maintaining simplicity. We analyze the similarities and differences between RUQL, QL, and the closest state-of-the-art algorithms theoretically. Our analysis shows that RUQL maintains the convergence guarantee of QL in stationary environments, while relaxing the coupling between the execution policy and the learning dynamics. Experimental results confirm the theoretical insights and show how RUQL outperforms both QL and the closest state-of-the-art algorithms in noisy non-stationary environments.

Q学習(QL)は、マルコフ決定プロセスで最適なポリシーに収束することが保証されている、人気の高い強化学習アルゴリズムです。ただし、QLにはアーティファクトがあります。つまり、アクションの値を更新する実効レートは、そのアクションを選択する確率に依存します。言い換えると、学習ダイナミクスと基礎となる実行ポリシーの間には密接な結合があります。この結合により、ノイズの多い非定常環境ではパフォーマンスが低下する可能性があります。ここでは、Q学習の望ましくないアーティファクトをシンプルさを維持しながら解決する学習アルゴリズムである繰り返し更新Q学習(RUQL)を紹介します。RUQL、QL、および最も近い最先端のアルゴリズムの類似点と相違点を理論的に分析します。分析により、RUQLは定常環境でのQLの収束保証を維持しながら、実行ポリシーと学習ダイナミクスの結合を緩和することが示されています。実験結果は理論的洞察を裏付け、ノイズの多い非定常環境においてRUQLがQLおよび最も近い最先端のアルゴリズムの両方よりも優れていることを示しています。

A Unified View on Multi-class Support Vector Classification
多クラスサポートベクトル分類に関する統一的なビュー

A unified view on multi-class support vector machines (SVMs) is presented, covering most prominent variants including the one- vs-all approach and the algorithms proposed by Weston & Watkins, Crammer & Singer, Lee, Lin, & Wahba, and Liu & Yuan. The unification leads to a template for the quadratic training problems and new multi-class SVM formulations. Within our framework, we provide a comparative analysis of the various notions of multi-class margin and margin-based loss. In particular, we demonstrate limitations of the loss function considered, for instance, in the Crammer & Singer machine. We analyze Fisher consistency of multi- class loss functions and universal consistency of the various machines. On the one hand, we give examples of SVMs that are, in a particular hyperparameter regime, universally consistent without being based on a Fisher consistent loss. These include the canonical extension of SVMs to multiple classes as proposed by Weston & Watkins and Vapnik as well as the one-vs-all approach. On the other hand, it is demonstrated that machines based on Fisher consistent loss functions can fail to identify proper decision boundaries in low-dimensional feature spaces. We compared the performance of nine different multi-class SVMs in a thorough empirical study. Our results suggest to use the Weston & Watkins SVM, which can be trained comparatively fast and gives good accuracies on benchmark functions. If training time is a major concern, the one-vs-all approach is the method of choice.

マルチクラスサポートベクターマシン(SVM)に関する統一的な見解が提示され、1対全アプローチや、Weston & Watkins、Crammer & Singer、Lee、Lin、Wahba、Liu & Yuanが提案したアルゴリズムなど、最もよく知られたバリエーションが網羅されています。この統一により、2次トレーニング問題と新しいマルチクラスSVM定式化のテンプレートが生まれます。このフレームワークでは、マルチクラスマージンとマージンベースの損失のさまざまな概念の比較分析を提供します。特に、Crammer & Singerマシンなどで検討されている損失関数の限界を示します。マルチクラス損失関数のフィッシャー一貫性とさまざまなマシンの普遍的一貫性を分析します。一方では、特定のハイパーパラメータレジームで、フィッシャー一貫性損失に基づかずに普遍的に一貫性のあるSVMの例を示します。これには、Weston & WatkinsとVapnikが提案したSVMの複数クラスへの標準拡張や、1対全アプローチが含まれます。一方、フィッシャーの一貫した損失関数に基づくマシンは、低次元の特徴空間で適切な決定境界を識別できない可能性があることが実証されています。徹底的な実証研究で、9つの異なるマルチクラスSVMのパフォーマンスを比較しました。結果から、比較的速くトレーニングでき、ベンチマーク関数で優れた精度を発揮するWeston & Watkins SVMの使用が推奨されます。トレーニング時間が大きな懸念事項である場合は、1対すべてのアプローチが最適な方法です。

Scalable Learning of Bayesian Network Classifiers
ベイジアンネットワーク分類器のスケーラブルな学習

Ever increasing data quantity makes ever more urgent the need for highly scalable learners that have good classification performance. Therefore, an out-of-core learner with excellent time and space complexity, along with high expressivity (that is, capacity to learn very complex multivariate probability distributions) is extremely desirable. This paper presents such a learner. We propose an extension to the $k$-dependence Bayesian classifier (KDB) that discriminatively selects a sub- model of a full KDB classifier. It requires only one additional pass through the training data, making it a three-pass learner. Our extensive experimental evaluation on $16$ large data sets reveals that this out-of-core algorithm achieves competitive classification performance, and substantially better training and classification time than state-of-the-art in-core learners such as random forest and linear and non-linear logistic regression.

データ量が増え続けると、優れた分類パフォーマンスを備えた高度にスケーラブルな学習器の必要性がますます急務になっています。したがって、時間と空間の複雑さが優れ、表現力が高い(つまり、非常に複雑な多変量確率分布を学習する能力)アウトオブコア学習器が極めて望ましいです。この論文では、そのような学習者を紹介します。私たちは、完全なKDB分類器のサブモデルを判別的に選択する$k$依存性ベイズ分類器(KDB)の拡張を提案します。学習データに追加のパスを1回だけ必要とするため、3パス学習器になります。$16$の大規模なデータセットに対する広範な実験的評価により、このアウトオブコアアルゴリズムは、ランダムフォレストや線形および非線形ロジスティック回帰などの最先端のインコア学習器よりも競争力のある分類パフォーマンスを達成し、学習と分類時間が大幅に短縮されることが明らかになりました。

On the Estimation of the Gradient Lines of a Density and the Consistency of the Mean-Shift Algorithm
密度の勾配線の推定と平均シフトアルゴリズムの一貫性について

We consider the problem of estimating the gradient lines of a density, which can be used to cluster points sampled from that density, for example via the mean-shift algorithm of Fukunaga and Hostetler (1975). We prove general convergence bounds that we then specialize to kernel density estimation.

私たちは、密度の勾配線を推定する問題を検討し、たとえばFukunaga and Hostetler(1975)の平均シフトアルゴリズムを介して、その密度からサンプリングされたポイントをクラスタリングするために使用できます。一般的な収束限界を証明し、その後、カーネル密度の推定に特化します。

Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes
ガウス過程における潜在変数と不確定入力に対する変分推論

The Gaussian process latent variable model (GP-LVM) provides a flexible approach for non-linear dimensionality reduction that has been widely applied. However, the current approach for training GP-LVMs is based on maximum likelihood, where the latent projection variables are maximised over rather than integrated out. In this paper we present a Bayesian method for training GP-LVMs by introducing a non-standard variational inference framework that allows to approximately integrate out the latent variables and subsequently train a GP-LVM by maximising an analytic lower bound on the exact marginal likelihood. We apply this method for learning a GP-LVM from i.i.d. observations and for learning non-linear dynamical systems where the observations are temporally correlated. We show that a benefit of the variational Bayesian procedure is its robustness to overfitting and its ability to automatically select the dimensionality of the non-linear latent space. The resulting framework is generic, flexible and easy to extend for other purposes, such as Gaussian process regression with uncertain or partially missing inputs. We demonstrate our method on synthetic data and standard machine learning benchmarks, as well as challenging real world datasets, including high resolution video data.

ガウス過程潜在変数モデル(GP-LVM)は、広く適用されている非線形次元削減のための柔軟なアプローチを提供します。ただし、GP-LVMをトレーニングするための現在のアプローチは最大尤度に基づいており、潜在投影変数は積分されるのではなく最大化されます。この論文では、潜在変数を近似的に積分し、その後、正確な周辺尤度の解析的下限を最大化することでGP-LVMをトレーニングできる非標準の変分推論フレームワークを導入することで、GP-LVMをトレーニングするためのベイズ法を提示します。この方法は、i.i.d.観測値からGP-LVMを学習するため、および観測値が時間的に相関している非線形動的システムを学習するために適用します。変分ベイズ手順の利点は、過剰適合に対する堅牢性と、非線形潜在空間の次元を自動的に選択できることであることを示します。結果として得られるフレームワークは汎用性があり、柔軟性が高く、不確実な入力や部分的に欠損した入力によるガウス過程回帰など、他の目的に簡単に拡張できます。合成データと標準的な機械学習ベンチマーク、および高解像度のビデオデータを含む難しい現実世界のデータセットでこの手法を実証します。

BayesPy: Variational Bayesian Inference in Python
BayesPy: Python での変分ベイズ推論

BayesPy is an open-source Python software package for performing variational Bayesian inference. It is based on the variational message passing framework and supports conjugate exponential family models. By removing the tedious task of implementing the variational Bayesian update equations, the user can construct models faster and in a less error-prone way. Simple syntax, flexible model construction and efficient inference make BayesPy suitable for both average and expert Bayesian users. It also supports some advanced methods such as stochastic and collapsed variational inference.

BayesPyは、変分ベイズ推論を実行するためのオープンソースのPythonソフトウェアパッケージです。これは、変分メッセージパッシングフレームワークに基づいており、共役指数ファミリーモデルをサポートしています。変分ベイズ更新方程式を実装するという面倒な作業を取り除くことで、ユーザーはモデルをより迅速に、エラーが発生しにくい方法で構築できます。シンプルな構文、柔軟なモデル構築、効率的な推論により、BayesPyは平均的なベイジアンユーザーとエキスパートユーザーの両方に適しています。また、確率的推論や崩壊変分推論など、いくつかの高度な手法もサポートしています。

On Quantile Regression in Reproducing Kernel Hilbert Spaces with the Data Sparsity Constraint
データスパース制約条件を用いたカーネルヒルベルト空間の再現における分位点回帰について

For spline regressions, it is well known that the choice of knots is crucial for the performance of the estimator. As a general learning framework covering the smoothing splines, learning in a Reproducing Kernel Hilbert Space (RKHS) has a similar issue. However, the selection of training data points for kernel functions in the RKHS representation has not been carefully studied in the literature. In this paper we study quantile regression as an example of learning in a RKHS. In this case, the regular squared norm penalty does not perform training data selection. We propose a data sparsity constraint that imposes thresholding on the kernel function coefficients to achieve a sparse kernel function representation. We demonstrate that the proposed data sparsity method can have competitive prediction performance for certain situations, and have comparable performance in other cases compared to that of the traditional squared norm penalty. Therefore, the data sparsity method can serve as a competitive alternative to the squared norm penalty method. Some theoretical properties of our proposed method using the data sparsity constraint are obtained. Both simulated and real data sets are used to demonstrate the usefulness of our data sparsity constraint.

スプライン回帰の場合、ノットの選択が推定器のパフォーマンスに重要であることはよく知られています。平滑化スプラインをカバーする一般的な学習フレームワークとして、再生カーネルヒルベルト空間(RKHS)での学習にも同様の問題があります。ただし、RKHS表現におけるカーネル関数のトレーニングデータポイントの選択は、文献では十分に研究されていません。この論文では、RKHSでの学習の例として、分位回帰について検討します。この場合、通常の二乗ノルムペナルティではトレーニングデータの選択は実行されません。スパースカーネル関数表現を実現するために、カーネル関数係数にしきい値を課すデータスパース制約を提案します。提案されたデータスパース法は、特定の状況では競争力のある予測パフォーマンスを発揮し、他の場合には従来の二乗ノルムペナルティと比較して同等のパフォーマンスを発揮できることを実証します。したがって、データスパース法は、二乗ノルムペナルティ法の競争力のある代替手段として機能します。データスパース制約を使用する提案方法のいくつかの理論的特性が得られます。シミュレーションされたデータセットと実際のデータセットの両方を使用して、データスパース制約の有用性を実証します。

End-to-End Training of Deep Visuomotor Policies
深部視覚運動政策のエンドツーエンドトレーニング

Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-to-end provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot’s motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.

ポリシー検索法を使用すると、ロボットは幅広いタスクの制御ポリシーを学習できますが、ポリシー検索の実際のアプリケーションでは、多くの場合、知覚、状態推定、および低レベル制御用の手作業で設計されたコンポーネントが必要です。この論文では、次の質問に答えることを目指しています。知覚システムと制御システムをエンドツーエンドで共同でトレーニングすると、各コンポーネントを個別にトレーニングするよりもパフォーマンスが向上するのでしょうか。この目的のために、生の画像観測をロボットのモーターのトルクに直接マッピングするポリシーを学習するために使用できる方法を開発します。ポリシーは、92,000個のパラメーターを持つディープ畳み込みニューラルネットワーク(CNN)で表され、ポリシー検索を教師あり学習に変換するガイド付きポリシー検索法を使用してトレーニングされ、単純な軌道中心の強化学習法によって監督が提供されます。ボトルにキャップをねじ込むなど、視覚と制御の緊密な調整を必要とするさまざまな実際の操作タスクでこの方法を評価し、さまざまな以前のポリシー検索方法とのシミュレーション比較を示します。

The Optimal Sample Complexity of PAC Learning
PAC学習の最適なサンプル複雑性

This work establishes a new upper bound on the number of samples sufficient for PAC learning in the realizable case. The bound matches known lower bounds up to numerical constant factors. This solves a long-standing open problem on the sample complexity of PAC learning. The technique and analysis build on a recent breakthrough by Hans Simon.

この研究では、実現可能なケースでPAC学習に十分なサンプル数の新たな上限を確立します。この範囲は、数値定数係数までの既知の下限と一致します。これにより、PAC学習のサンプルの複雑さに関する長年の未解決の問題が解決されます。この手法と分析は、Hans Simonによる最近のブレークスルーに基づいています。

Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Learn Neural Networks
Hybrid Orthogonal Projection and Estimation (HOPE): ニューラルネットワークを学習するための新しいフレームワーク

In this paper, we propose a novel model for high-dimensional data, called the Hybrid Orthogonal Projection and Estimation (HOPE) model, which combines a linear orthogonal projection and a finite mixture model under a unified generative modeling framework. The HOPE model itself can be learned unsupervised from unlabelled data based on the maximum likelihood estimation as well as discriminatively from labelled data. More interestingly, we have shown the proposed HOPE models are closely related to neural networks (NNs) in a sense that each hidden layer can be reformulated as a HOPE model. As a result, the HOPE framework can be used as a novel tool to probe why and how NNs work, more importantly, to learn NNs in either supervised or unsupervised ways. In this work, we have investigated the HOPE framework to learn NNs for several standard tasks, including image recognition on MNIST and speech recognition on TIMIT. Experimental results have shown that the HOPE framework yields significant performance gains over the current state-of-the-art methods in various types of NN learning problems, including unsupervised feature learning, supervised or semi-supervised learning.

この論文では、統一された生成モデリングフレームワークの下で線形直交射影と有限混合モデルを組み合わせた、ハイブリッド直交射影および推定(HOPE)モデルと呼ばれる高次元データの新しいモデルを提案します。HOPEモデル自体は、最大尤度推定に基づいてラベルなしデータから教師なしで学習することも、ラベル付きデータから識別的に学習することもできます。さらに興味深いことに、提案されたHOPEモデルは、各隠れ層をHOPEモデルとして再定式化できるという意味で、ニューラルネットワーク(NN)と密接に関連していることを示しました。その結果、HOPEフレームワークは、NNが機能する理由と方法、さらに重要なことに、教師ありまたは教師なしのいずれかの方法でNNを学習するための新しいツールとして使用できます。この研究では、MNISTでの画像認識やTIMITでの音声認識など、いくつかの標準的なタスクのNNを学習するためのHOPEフレームワークを調査しました。実験結果によると、HOPEフレームワークは、教師なし特徴学習、教師あり学習、半教師あり学習など、さまざまな種類のNN学習問題において、現在の最先端の方法よりも大幅なパフォーマンス向上をもたらすことが示されています。

A Bounded p-norm Approximation of Max-Convolution for Sub-Quadratic Bayesian Inference on Additive Factors
加法因子上の準二次ベイズ推論のための最大畳み込みの有界pノルム近似

Max-convolution is an important problem closely resembling standard convolution; as such, max-convolution occurs frequently across many fields. Here we extend the method with fastest known worst-case runtime, which can be applied to nonnegative vectors by numerically approximating the Chebyshev norm $\| \cdot \|_\infty$, and use this approach to derive two numerically stable methods based on the idea of computing $p$-norms via fast convolution: The first method proposed, with runtime in $O( k \log(k) \log(\log(k)) )$ (which is less than $18 k \log(k)$ for any vectors that can be practically realized), uses the $p$-norm as a direct approximation of the Chebyshev norm. The second approach proposed, with runtime in $O( k \log(k) )$ (although in practice both perform similarly), uses a novel null space projection method, which extracts information from a sequence of $p$-norms to estimate the maximum value in the vector (this is equivalent to querying a small number of moments from a distribution of bounded support in order to estimate the maximum). The $p$-norm approaches are compared to one another and are shown to compute an approximation of the Viterbi path in a hidden Markov model where the transition matrix is a Toeplitz matrix; the runtime of approximating the Viterbi path is thus reduced from $O( n k^2 )$ steps to $O( n k \log(k))$ steps in practice, and is demonstrated by inferring the U.S. unemployment rate from the S&P 500 stock index.

最大畳み込みは、標準的な畳み込みに非常によく似た重要な問題です。そのため、最大畳み込みは多くの分野で頻繁に発生します。ここでは、チェビシェフノルム$\| \cdot \|_\infty$を数値的に近似することで非負ベクトルに適用できる、最速の既知の最悪ケース実行時間を持つ方法を拡張し、このアプローチを使用して、高速畳み込みによる$p$ノルムの計算という考えに基づく2つの数値的に安定した方法を導出します。提案される最初の方法は、実行時間が$O( k \log(k) \log(\log(k)) )$ (これは、実際に実現可能な任意のベクトルに対して$18 k \log(k)$未満)で、$p$ノルムをチェビシェフノルムの直接近似として使用します。提案された2番目のアプローチは、実行時間が$O( k \log(k) )$ですが(実際には両方とも同様に実行されます)、一連の$p$ノルムから情報を抽出してベクトルの最大値を推定する新しいヌル空間投影法を使用します(これは、最大値を推定するために、制限されたサポートの分布から少数のモーメントを照会することと同等です)。$p$ノルムアプローチは互いに比較され、遷移行列がテプリッツ行列である隠れマルコフモデルでビタビパスの近似値を計算することが示されています。したがって、ビタビパスの近似値を計算する実行時間は、実際には$O( n k^2 )$ステップから$O( n k \log(k))$ステップに短縮され、S&P 500株価指数から米国の失業率を推定することで実証されています。

OLPS: A Toolbox for On-Line Portfolio Selection
OLPS:オンラインポートフォリオ選択のためのツールボックス

On-line portfolio selection is a practical financial engineering problem, which aims to sequentially allocate capital among a set of assets in order to maximize long-term return. In recent years, a variety of machine learning algorithms have been proposed to address this challenging problem, but no comprehensive open-source toolbox has been released for various reasons. This article presents the first open-source toolbox for “On-Line Portfolio Selection” (OLPS), which implements a collection of classical and state-of-the-art strategies powered by machine learning algorithms. We hope that OLPS can facilitate the development of new learning methods and enable the performance benchmarking and comparisons of different strategies. OLPS is an open-source project released under Apache License (version 2.0), which is available at github.com/OLPS/OLPS or OLPS.stevenhoi.org.

オンライン・ポートフォリオ選択は、長期的なリターンを最大化するために、一連の資産間で資本を順次配分することを目的とした、実用的な金融工学の問題です。近年、この困難な問題に対処するためにさまざまな機械学習アルゴリズムが提案されていますが、さまざまな理由から包括的なオープンソースツールボックスはリリースされていません。この記事では、機械学習アルゴリズムを活用した従来の戦略と最先端の戦略のコレクションを実装する「オンラインポートフォリオ選択」(OLPS)の最初のオープンソースツールボックスを紹介します。OLPSが新しい学習方法の開発を促進し、パフォーマンスのベンチマークとさまざまな戦略の比較を可能にすることを願っています。OLPSは、Apache License (バージョン2.0)の下でリリースされたオープンソースプロジェクトであり、github.com/OLPS/OLPSまたはOLPS.stevenhoi.orgで利用できます。

MLlib: Machine Learning in Apache Spark
MLlib: Apache Spark での機械学習

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark’s open- source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark’s rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

Apache Sparkは、大規模なデータ処理のための一般的なオープンソースプラットフォームであり、反復的な機械学習タスクに適しています。この論文では、Sparkのオープンソース分散機械学習ライブラリであるMLlibを紹介します。MLlibは、さまざまな学習設定に対して効率的な機能を提供し、いくつかの基礎となる統計、最適化、および線形代数のプリミティブが含まれています。Sparkに付属しているMLlibは、複数の言語をサポートし、Sparkの豊富なエコシステムを活用してエンドツーエンドの機械学習パイプラインの開発を簡素化する高レベルのAPIを提供します。MLlibは、140人を超えるコントリビューターからなる活気に満ちたオープンソースコミュニティにより急速な成長を遂げており、さらなる成長をサポートし、ユーザーがすぐに理解できるように広範なドキュメントが含まれています。

Multi-task Sparse Structure Learning with Gaussian Copula Models
ガウス・コピュラ・モデルによるマルチタスク疎構造学習

Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously. While sometimes the underlying task relationship structure is known, often the structure needs to be estimated from data at hand. In this paper, we present a novel family of models for MTL, applicable to regression and classification problems, capable of learning the structure of tasks relationship. In particular, we consider a joint estimation problem of the tasks relationship structure and the individual task parameters, which is solved using alternating minimization. The task relationship revealed by structure learning is founded on recent advances in Gaussian graphical models endowed with sparse estimators of the precision (inverse covariance) matrix. An extension to include flexible Gaussian copula models that relaxes the Gaussian marginal assumption is also proposed. We illustrate the effectiveness of the proposed model on a variety of synthetic and benchmark data sets for regression and classification. We also consider the problem of combining Earth System Model (ESM) outputs for better projections of future climate, with focus on projections of temperature by combining ESMs in South and North America, and show that the proposed model outperforms several existing methods for the problem.

マルチタスク学習(MTL)は、複数の関連タスクを同時に学習することで、一般化のパフォーマンスを向上させることを目的としています。基礎となるタスク関係構造がわかっている場合もありますが、多くの場合、構造は手元のデータから推定する必要があります。この論文では、回帰および分類問題に適用でき、タスク関係の構造を学習できる、MTLの新しいモデルファミリを紹介します。特に、交互最小化を使用して解決される、タスク関係構造と個々のタスクパラメーターの結合推定問題を検討します。構造学習によって明らかにされるタスク関係は、精度(逆共分散)行列のスパース推定量を備えたガウスグラフィカルモデルの最近の進歩に基づいています。ガウス周辺仮定を緩和する柔軟なガウスコピュラモデルを含める拡張も提案されています。回帰および分類用のさまざまな合成データセットとベンチマークデータセットで、提案モデルの有効性を示します。また、将来の気候をより正確に予測するために地球システムモデル(ESM)の出力を組み合わせる問題も検討し、南北アメリカのESMを組み合わせることによる気温の予測に焦点を当て、提案モデルがこの問題に対する既存のいくつかの方法よりも優れていることを示します。

Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks
観測データを用いた原因と結果の区別:方法とベンチマーク

The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether $X$ causes $Y$ or, alternatively, $Y$ causes $X$, given joint observations of two variables $X,Y$. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: methods based on Additive Noise Models (ANMs) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 data sets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the ground truth causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the method based on Additive Noise Models that has originally been proposed by Hoyer et al. (2009), which obtains an accuracy of 63 $\pm$ 10 % and an AUC of 0.74 $\pm$ 0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.

純粋に観察されたデータから因果関係を発見することは、科学における基本的な問題です。このような因果関係発見問題の最も基本的な形は、2つの変数$X,Y$の同時観測が与えられた場合に、$X$が$Y$を引き起こすのか、あるいは$Y$が$X$を引き起こすのかを判断することです。例としては、両方の変数の同時測定値のみが与えられた場合に、高度が温度を引き起こすのか、あるいはその逆なのかを判断することが挙げられます。交絡、フィードバックループ、選択バイアスがないという単純化された仮定の下でも、このような2変量の因果関係発見問題は困難です。それでも、近年、これらの問題に対処するためのいくつかのアプローチが提案されています。ここでは、このような方法の2つのファミリー、つまり加法性ノイズモデル(ANM)と情報幾何学的因果推論(IGCI)に基づく方法をレビューします。私たちは、さまざまな分野(気象学、生物学、医学、工学、経済など)の37のデータセットから選択された100の異なる原因と結果のペアのデータで構成されるベンチマークCauseEffectPairsを提示し、すべてのペアの真の因果方向に関する決定の根拠を示します。これらの実際のベンチマークデータと人工的にシミュレートされたデータに対して、いくつかの二変量因果発見方法のパフォーマンスを評価します。実際のデータに関する実験結果によると、統計的に有意な結論を得るには、より多くのベンチマークデータが必要になりますが、特定の方法では純粋な観察データのみを使用して原因と結果を区別できることが示されています。全体的に最も優れた方法の1つは、もともとHoyerら(2009)によって提案された加法ノイズモデルに基づく方法で、実際のベンチマークで63 $\pm$ 10 %の精度と0.74 $\pm$ 0.05のAUCが得られます。この研究の主な理論的貢献として、私たちはその方法の一貫性を証明します。

Dimension-free Concentration Bounds on Hankel Matrices for Spectral Learning
スペクトル学習のためのHankel行列上の次元フリー濃度境界

Learning probabilistic models over strings is an important issue for many applications. Spectral methods propose elegant solutions to the problem of inferring weighted automata from finite samples of variable-length strings drawn from an unknown target distribution $p$. These methods rely on a singular value decomposition of a matrix $\v{H}_S$, called the empirical Hankel matrix, that records the frequencies of (some of) the observed strings $S$. The accuracy of the learned distribution depends both on the quantity of information embedded in $\v{H}_S$ and on the distance between $\v{H}_S$ and its mean $\v{H}_p$. Existing concentration bounds seem to indicate that the concentration over $\v{H}_p$ gets looser with its dimensions, suggesting that it might be necessary to bound the dimensions of $\v{H}_S$ for learning. We prove new dimension-free concentration bounds for classical Hankel matrices and several variants, based on prefixes or factors of strings, that are useful for learning. Experiments demonstrate that these bounds are tight and that they significantly improve existing (dimension-dependent) bounds. One consequence of these results is that the spectral learning approach remains consistent even if all the observations are recorded within the empirical matrix.

文字列に対する確率モデルの学習は、多くのアプリケーションにとって重要な問題です。スペクトル法は、未知のターゲット分布$p$から抽出された可変長文字列の有限サンプルから重み付きオートマトンを推論する問題に対する洗練されたソリューションを提案します。これらの方法は、観測された文字列$S$の（一部の）頻度を記録する経験的ハンケル行列と呼ばれる行列$\v{H}_S$の特異値分解に依存しています。学習された分布の精度は、$\v{H}_S$に埋め込まれた情報量と、$\v{H}_S$とその平均$\v{H}_p$の間の距離の両方に依存します。既存の集中の境界は、$\v{H}_p$上の集中がその次元とともに緩くなることを示しているようで、学習のために$\v{H}_S$の次元を制限する必要がある可能性があることを示唆しています。私たちは、文字列の接頭辞または因子に基づいて、学習に役立つ古典的なハンケル行列といくつかの変種に対する新しい次元フリーの濃度境界を証明します。実験により、これらの境界は厳密であり、既存の（次元依存の）境界を大幅に改善することが実証されています。これらの結果の1つの帰結は、すべての観測が経験的行列内に記録されている場合でも、スペクトル学習アプローチが一貫していることです。

A Gibbs Sampler for Learning DAGs
DAGを学習するためのギブスサンプラー

We propose a Gibbs sampler for structure learning in directed acyclic graph (DAG) models. The standard Markov chain Monte Carlo algorithms used for learning DAGs are random-walk Metropolis-Hastings samplers. These samplers are guaranteed to converge asymptotically but often mix slowly when exploring the large graph spaces that arise in structure learning. In each step, the sampler we propose draws entire sets of parents for multiple nodes from the appropriate conditional distribution. This provides an efficient way to make large moves in graph space, permitting faster mixing whilst retaining asymptotic guarantees of convergence. The conditional distribution is related to variable selection with candidate parents playing the role of covariates or inputs. We empirically examine the performance of the sampler using several simulated and real data examples. The proposed method gives robust results in diverse settings, outperforming several existing Bayesian and frequentist methods. In addition, our empirical results shed some light on the relative merits of Bayesian and constraint- based methods for structure learning.

私たちは、有向非巡回グラフ(DAG)モデルにおける構造学習のためのギブスサンプラーを提案します。DAGの学習に使用される標準的なマルコフ連鎖モンテカルロアルゴリズムは、ランダムウォークメトロポリス-ヘイスティングスサンプラーです。これらのサンプラーは漸近的に収束することが保証されているが、構造学習で生じる大規模なグラフ空間を探索する際には、しばしばゆっくりと混合します。各ステップで、我々が提案するサンプラーは、適切な条件付き分布から複数のノードの親のセット全体を抽出します。これにより、グラフ空間で大きな動きを効率的に行うことができ、漸近的な収束の保証を維持しながら、より高速な混合が可能になります。条件付き分布は、候補となる親が共変量または入力の役割を果たす変数選択に関連しています。私たちは、いくつかのシミュレーションデータ例と実際のデータ例を使用して、サンプラーのパフォーマンスを経験的に検証します。提案された方法は、さまざまな設定で堅牢な結果をもたらし、既存のベイズ法や頻度論法のいくつかよりも優れています。さらに、我々の経験的結果は、構造学習におけるベイズ法と制約ベースの方法の相対的なメリットにいくらか光を当てています。

Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables
単変量ランダム変数に対する一貫した分布フリーKサンプルと独立性検定

A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example.

2つの単変量ランダム変数が統計的に独立しているかどうかをテストするための一般的なアプローチは、サンプル空間をビンに分割し、ビン化されたデータに対してテスト統計を評価することです。パーティションサイズは重要であり、最適なパーティションサイズはデータに依存します。単純な関係を検出するには粗いパーティションが最適かもしれませんが、複雑な関係を検出するには、より細かいパーティションを検討することで、検出力を大幅に向上させることができます。私たちは、固定サイズのすべてのパーティションにわたるスコアの合計または最大化集計に基づく、一貫性のある分布フリーの新しいテストを提案します。私たちは、合計に基づくテスト統計が相互情報量の優れた推定値として機能できることを示します。さらに、すべてのパーティションサイズにわたって集計する正規化テストを提案し、それらも一貫していることを証明します。私たちは、提案されたテスト統計を効率的に計算するために不可欠な多項式時間アルゴリズムを提供します。私たちは、シミュレーションと実際のデータ例において、正規化テストの検出力が既存のテストと比較して優れており、最適な(実際には不明)パーティションサイズに基づくテストとほぼ同等であることを示します。

Non-linear Causal Inference using Gaussianity Measures
ガウス性測度を用いた非線形因果推論

We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non- Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference.

私たちは、原因と結果が加法的な非ガウスノイズで汚染された線形モデルを介して関連付けられているときに存在する、原因と結果の間の一種の非対称性について、理論的かつ経験的な証拠を提供します。原因と結果が同じ分布を持つと仮定すると、反因果方向の線形近似の残差の分布は、因果方向の残差の分布よりもガウスに近いことを示す。このガウス化効果は、高次キュムラントの大きさの減少と、残差の微分エントロピーの増加によって特徴付けられます。非線形因果推論の問題は、原因と結果の関係が線形であると仮定できる拡張された特徴空間に埋め込みを実行することによって対処されます。このタイプの非対称性に基づいて原因と結果を区別する方法の有効性は、ガウス性のさまざまな尺度を使用したさまざまな実験で説明されます。提案された方法は、因果推論の最先端技術と競合できることが示されています。

Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices
植えられた問題とクラスターとサブ行列の数の増加に伴うサブマトリックスの局在化における統計計算のトレードオフ

We consider two closely related problems: planted clustering and submatrix localization. In the planted clustering problem, a random graph is generated based on an underlying cluster structure of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and the stochastic block model, which are widely used for studying community detection, graph clustering and bi-clustering. For both problems, we show that the space of the model parameters (cluster/submatrix size, edge probabilities and the mean of the submatrices) can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the impossible regime, where all algorithms fail; (2) the hard regime, where the computationally expensive Maximum Likelihood Estimator (MLE) succeeds; (3) the easy regime, where the polynomial-time convexified MLE succeeds; (4) the simple regime, where a local counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the harder regimes. Our results establish the minimax recovery limits, which are tight up to universal constants and hold even with a growing number of clusters/submatrices, and provide order-wise stronger performance guarantees for polynomial-time algorithms than previously known. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax limits may not be achievable by polynomial-time algorithms.

私たちは、密接に関連した2つの問題、すなわち、植え付けられたクラスタリングとサブマトリックスのローカリゼーションについて検討します。植え付けられたクラスタリングの問題では、ノードの基になるクラスター構造に基づいてランダムグラフが生成されます。タスクは、グラフが与えられた場合にこれらのクラスターを回復することです。サブマトリックスのローカリゼーションの問題は、大きな実数値のランダムマトリックス内で平均値が高い隠れたサブマトリックスを見つけることです。特に興味深いのは、クラスター/サブマトリックスの数が問題のサイズとともに無制限に増加することが許可されている設定です。これらの定式化は、植え付けられたクリーク、植え付けられた最密サブグラフ、植え付けられたパーティション、植え付けられたカラーリング、確率的ブロックモデルなど、コミュニティ検出、グラフクラスタリング、およびバイクラスタリングの研究に広く使用されているいくつかの古典的なモデルをカバーしています。両方の問題に対して、モデルパラメーター(クラスター/サブマトリックスのサイズ、エッジ確率、およびサブマトリックスの平均)の空間を、統計的および計算的複雑さの減少に対応する4つの互いに素な領域に分割できることを示します。(1)すべてのアルゴリズムが失敗する不可能領域。(2)難しい領域では、計算コストの高い最尤推定量(MLE)が成功します。(3)簡単な領域では、多項式時間の凸型MLEが成功します。(4)簡単な領域では、ローカルなカウント/しきい値処理手順が成功します。さらに、これらのアルゴリズムはそれぞれ、難しい領域では失敗することを証明しています。私たちの結果は、普遍的な定数まで厳密で、クラスター/サブマトリックスの数が増えても保持されるミニマックス回復限界を確立し、これまでに知られていたよりも順序ごとに強力な多項式時間アルゴリズムのパフォーマンス保証を提供します。私たちの研究は、統計的考慮事項と計算的考慮事項の間のトレードオフを示し、ミニマックス限界は多項式時間アルゴリズムでは達成できない可能性があることを示唆しています。

Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests
信頼区間と仮説検定によるランダムフォレストの不確実性の定量化

This work develops formal statistical inference procedures for predictions generated by supervised learning ensembles. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Simulations and illustrations on a real data set are provided.

この研究では、教師あり学習アンサンブルによって生成された予測のための正式な統計的推論手順を開発します。バギングやランダムフォレストなどのブートストラップに基づくアンサンブル手法は、個々のツリーの予測精度を向上させましたが、分布結果を簡単に決定できるフレームワークを提供できませんでした。完全なブートストラップサンプルを集約する代わりに、トレーニングセットのサブサンプルに基づいて構築されたツリーを平均化して予測することを検討し、結果として得られる推定量がU統計の形をとることを実証します。したがって、個々の特徴ベクトルの予測は漸近的に正規であり、予測に伴う信頼区間が可能になります。実際には、計算速度を上げるためにサブサンプルのサブセットが使用されます。ここでは、推定量は不完全なU統計の形をとり、同等の結果が得られました。さらに、この設定により、特徴の重要性をテストするためのフレームワークが提供されることも実証します。さらに、開発した内部推定方法により、分散パラメーターを推定し、追加の計算コストをかけずにこれらの推論手順を実行できます。実際のデータセットのシミュレーションと図解が提供されます。

A Unifying Framework in Vector-valued Reproducing Kernel Hilbert Spaces for Manifold Regularization and Co-Regularized Multi-view Learning
多様体正則化と共正則化多視点学習のためのベクトル値再現カーネルヒルベルト空間における統一フレームワーク

This paper presents a general vector-valued reproducing kernel Hilbert spaces (RKHS) framework for the problem of learning an unknown functional dependency between a structured input space and a structured output space. Our formulation encompasses both Vector-valued Manifold Regularization and Co-regularized Multi- view Learning, providing in particular a unifying framework linking these two important learning approaches. In the case of the least square loss function, we provide a closed form solution, which is obtained by solving a system of linear equations. In the case of Support Vector Machine (SVM) classification, our formulation generalizes in particular both the binary Laplacian SVM to the multi-class, multi-view settings and the multi-class Simplex Cone SVM to the semi-supervised, multi-view settings. The solution is obtained by solving a single quadratic optimization problem, as in standard SVM, via the Sequential Minimal Optimization (SMO) approach. Empirical results obtained on the task of object recognition, using several challenging data sets, demonstrate the competitiveness of our algorithms compared with other state-of-the-art methods.

この論文では、構造化された入力空間と構造化された出力空間の間の未知の関数依存関係を学習する問題に対する一般的なベクトル値再生カーネルヒルベルト空間(RKHS)フレームワークを紹介します。私たちの定式化は、ベクトル値マニフォールド正則化と共正則化マルチビュー学習の両方を包含し、特にこれら2つの重要な学習アプローチをリンクする統一フレームワークを提供します。最小二乗損失関数の場合、線形方程式系を解くことで得られる閉じた形式のソリューションを提供します。サポートベクターマシン(SVM)分類の場合、私たちの定式化は、特にバイナリラプラシアンSVMをマルチクラス、マルチビュー設定に、マルチクラスシンプレックスコーンSVMを半教師あり、マルチビュー設定に一般化します。ソリューションは、標準SVMと同様に、逐次最小最適化(SMO)アプローチを介して単一の二次最適化問題を解くことで得られます。いくつかの難しいデータセットを使用した物体認識タスクで得られた実験結果は、他の最先端の方法と比較した当社のアルゴリズムの競争力を実証しています。

Learning Using Anti-Training with Sacrificial Data
犠牲データを用いたアンチトレーニングを用いた学習

Traditionally the machine-learning community has viewed the No Free Lunch (NFL) theorems for search and optimization as a limitation. We review, analyze, and unify the NFL theorem with the perspectives of “blind” search and meta-learning to arrive at necessary conditions for improving black-box optimization. We survey meta-learning literature to determine when and how meta- learning can benefit machine learning. Then, we generalize meta- learning in the context of the NFL theorems, to arrive at a novel technique called anti-training with sacrificial data (ATSD). Our technique applies at the meta level to arrive at domain specific algorithms. We also show how to generate sacrificial data. An extensive case study is presented along with simulated annealing results to demonstrate the efficacy of the ATSD method.

従来、機械学習コミュニティは、検索と最適化のためのNFL(No Free Lunch)定理を制限と見なしてきました。NFLの定理を「ブラインド」サーチとメタラーニングの視点でレビュー、分析、統一し、ブラックボックス最適化を改善するための必要な条件に到達します。メタラーニングの文献を調査して、メタラーニングが機械学習にどのようなメリットをもたらすかを判断します。次に、NFLの定理の文脈でメタ学習を一般化し、犠牲データによるアンチトレーニング(ATSD)と呼ばれる新しい手法に到達します。私たちの手法は、メタレベルで適用して、ドメイン固有のアルゴリズムに到達します。また、犠牲データを生成する方法も示します。広範なケーススタディと、ATSD法の有効性を実証するためのシミュレーテッドアニーリング結果が紹介されています。

A Closer Look at Adaptive Regret
適応型後悔を詳しく見る

For the prediction with expert advice setting, we consider methods to construct algorithms that have low adaptive regret. The adaptive regret of an algorithm on a time interval $[t_1,t_2]$ is the loss of the algorithm minus the loss of the best expert over that interval. Adaptive regret measures how well the algorithm approximates the best expert locally, and so is different from, although closely related to, both the classical regret, measured over an initial time interval $[1,t]$, and the tracking regret, where the algorithm is compared to a good sequence of experts over $[1,t]$. We investigate two existing intuitive methods for deriving algorithms with low adaptive regret, one based on specialist experts and the other based on restarts. Quite surprisingly, we show that both methods lead to the same algorithm, namely Fixed Share, which is known for its tracking regret. We provide a thorough analysis of the adaptive regret of Fixed Share. We obtain the exact worst-case adaptive regret for Fixed Share, from which the classical tracking bounds follow. We prove that Fixed Share is optimal for adaptive regret: the worst-case adaptive regret of any algorithm is at least that of an instance of Fixed Share.

専門家のアドバイスによる予測設定では、適応的後悔が低いアルゴリズムを構築する方法を検討します。時間間隔$[t_1,t_2]$でのアルゴリズムの適応的後悔は、その間隔でのアルゴリズムの損失から最良の専門家の損失を引いたものです。適応的後悔は、アルゴリズムが最良の専門家を局所的にどれだけよく近似するかを測定するため、初期時間間隔$[1,t]$で測定される従来の後悔や、アルゴリズムが$[1,t]$で一連の優れた専門家と比較される追跡後悔とは異なりますが、密接に関連しています。適応的後悔が低いアルゴリズムを導出するための既存の2つの直感的な方法を調査し、1つは専門家に基づく方法、もう1つは再起動に基づく方法を調べます。驚くべきことに、両方の方法から同じアルゴリズム、つまり追跡後悔で知られるFixed Shareが導かれることがわかります。Fixed Shareの適応的後悔を徹底的に分析します。Fixed Shareの最悪の適応的後悔を正確に取得し、そこから従来の追跡境界が導かれます。固定シェアは適応型後悔に最適であることを証明します。つまり、どのアルゴリズムでも最悪の適応型後悔は少なくとも固定シェアのインスタンスの後悔と同じになります。

Gradients Weights improve Regression and Classification
勾配の重み付けにより、回帰と分類が改善

In regression problems over $\mathbb{R}^d$, the unknown function $f$ often varies more in some coordinates than in others. We show that weighting each coordinate $i$ according to an estimate of the variation of $f$ along coordinate $i$ — e.g. the $L_1$ norm of the $i$th-directional derivative of $f$ — is an efficient way to significantly improve the performance of distance-based regressors such as kernel and $k$-NN regressors. The approach, termed Gradient Weighting (GW), consists of a first pass regression estimate $f_n$ which serves to evaluate the directional derivatives of $f$, and a second-pass regression estimate on the re-weighted data. The GW approach can be instantiated for both regression and classification, and is grounded in strong theoretical principles having to do with the way regression bias and variance are affected by a generic feature-weighting scheme. These theoretical principles provide further technical foundation for some existing feature-weighting heuristics that have proved successful in practice. We propose a simple estimator of these derivative norms and prove its consistency. The proposed estimator computes efficiently and easily extends to run online. We then derive a classification version of the GW approach which evaluates on real-worlds datasets with as much success as its regression counterpart.

$\mathbb{R}^d$上の回帰問題では、未知の関数$f$は、ある座標では他の座標よりも大きく変化することがよくあります。座標$i$に沿った$f$の変化の推定値(たとえば、$f$の$i$番目の方向微分の$L_1$ノルム)に従って各座標$i$に重み付けすると、カーネル回帰や$k$-NN回帰などの距離ベースの回帰のパフォーマンスを大幅に向上できる効率的な方法であることを示します。勾配重み付け(GW)と呼ばれるこのアプローチは、$f$の方向微分を評価するための1回目の回帰推定値$f_n$と、再重み付けされたデータに対する2回目の回帰推定値で構成されます。GWアプローチは、回帰と分類の両方にインスタンス化でき、回帰バイアスと分散が一般的な特徴重み付けスキームによってどのように影響を受けるかに関する強力な理論的原理に基づいています。これらの理論的原理は、実際に成功していることが証明されている既存の特徴重み付けヒューリスティックのさらなる技術的基礎を提供します。これらの微分ノルムの単純な推定器を提案し、その一貫性を証明します。提案された推定器は効率的に計算し、オンラインで実行できるように簡単に拡張できます。次に、実際のデータセットで回帰版と同等の成功率で評価するGWアプローチの分類バージョンを導出します。

MEKA: A Multi-label/Multi-target Extension to WEKA
MEKA:WEKAのマルチラベル/マルチターゲット拡張

Multi-label classification has rapidly attracted interest in the machine learning literature, and there are now a large number and considerable variety of methods for this type of learning. We present MEKA: an open-source Java framework based on the well-known WEKA library. MEKA provides interfaces to facilitate practical application, and a wealth of multi-label classifiers, evaluation metrics, and tools for multi-label experiments and development. It supports multi-label and multi-target data, including in incremental and semi- supervised contexts.

マルチラベル分類は、機械学習の文献に急速に関心を集めており、現在、このタイプの学習には多数の、かなりの種類の方法があります。MEKAは、よく知られたWEKAライブラリに基づくオープンソースのJavaフレームワークです。MEKAは、実用化を促進するためのインターフェース、およびマルチラベル実験と開発のための豊富なマルチラベル分類子、評価メトリック、およびツールを提供します。マルチラベルおよびマルチターゲットデータをサポートし、インクリメンタルおよびセミスーパーバイズドコンテキストを含めます。

Operator-valued Kernels for Learning from Functional Response Data
機能応答データからの学習のための演算子値カーネル

In this paper (This is a combined and expanded version of previous conference papers Kadri et al., 2010, 2011c) we consider the problems of supervised classification and regression in the case where attributes and labels are functions: a data is represented by a set of functions, and the label is also a function. We focus on the use of reproducing kernel Hilbert space theory to learn from such functional data. Basic concepts and properties of kernel-based learning are extended to include the estimation of function-valued functions. In this setting, the representer theorem is restated, a set of rigorously defined infinite-dimensional operator-valued kernels that can be valuably applied when the data are functions is described, and a learning algorithm for nonlinear functional data analysis is introduced. The methodology is illustrated through speech and audio signal processing experiments.

この論文(これは、以前の会議論文Kadriら, 2010, 2011cの結合および拡張版です)では、属性とラベルが関数である場合の教師あり分類と回帰の問題を検討します。私たちは、このような関数データから学ぶために、再現カーネルヒルベルト宇宙理論の使用に焦点を当てています。カーネルベースの学習の基本的な概念と特性は、関数値関数の推定を含むように拡張されています。この設定では、表現定理が再解釈され、データが関数である場合に価値ある適用が可能な厳密に定義された無限次元演算子値カーネルのセットが説明され、非線形関数データ分析の学習アルゴリズムが導入されます。この方法論は、音声およびオーディオ信号処理の実験を通じて説明されています。

Analysis of Classification-based Policy Iteration Algorithms
分類ベースのポリシー反復アルゴリズムの分析

We introduce a variant of the classification-based approach to policy iteration which uses a cost-sensitive loss function weighting each classification mistake by its actual regret, that is, the difference between the action- value of the greedy action and of the action chosen by the classifier. For this algorithm, we provide a full finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space (classifier), and a capacity measure which indicates how well the policy space can approximate policies that are greedy with respect to any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. Furthermore it confirms the intuition that classification-based policy iteration algorithms could be favorably compared to value-based approaches when the policies can be approximated more easily than their corresponding value functions. We also study the consistency of the algorithm when there exists a sequence of policy spaces with increasing capacity.

私たちは、ポリシー反復に対する分類ベースのアプローチの変形を導入します。これは、コストに敏感な損失関数を使用して、各分類ミスを実際の後悔、つまり貪欲なアクションのアクション値と分類器によって選択されたアクションのアクション値の差で重み付けします。このアルゴリズムに対して、完全な有限サンプル分析を提供します。結果は、ポリシー改善ステップの数、各反復で使用されるロールアウトの数、検討対象のポリシー空間(分類器)の容量、およびポリシー空間がそのメンバーのいずれかに関して貪欲なポリシーをどれだけ正確に近似できるかを示す容量測定の観点から、パフォーマンスの境界を示します。分析により、この分類ベースのポリシー反復設定での推定エラーと近似エラーのトレードオフが明らかになりました。さらに、ポリシーが対応する価値関数よりも簡単に近似できる場合、分類ベースのポリシー反復アルゴリズムは価値ベースのアプローチよりも有利に比較できるという直感を裏付けます。また、容量が増加するポリシー空間のシーケンスが存在する場合のアルゴリズムの一貫性も調査します。

Loss Minimization and Parameter Estimation with Heavy Tails
ヘビーテールによる損失最小化とパラメータ推定

This work studies applications and generalizations of a simple estimation technique that provides exponential concentration under heavy-tailed distributions, assuming only bounded low- order moments. We show that the technique can be used for approximate minimization of smooth and strongly convex losses, and specifically for least squares linear regression. For instance, our $d$-dimensional estimator requires just $\tilde{O}(d\log(1/\delta))$ random samples to obtain a constant factor approximation to the optimal least squares loss with probability $1-\delta$, without requiring the covariates or noise to be bounded or subgaussian. We provide further applications to sparse linear regression and low-rank covariance matrix estimation with similar allowances on the noise and covariate distributions. The core technique is a generalization of the median-of-means estimator to arbitrary metric spaces.

この研究では、有界の低次モーメントのみを仮定して、ヘビーテール分布の下で指数関数的な集中を提供する単純な推定手法の応用と一般化を研究します。この手法は、平滑損失と強凸損失の近似最小化、特に最小二乗線形回帰に使用できることを示します。たとえば、$d$次元推定量は、共変量やノイズが有界またはサブガウスであることなく、確率$1-delta$の最適な最小二乗損失への定数因子近似を得るために、$tilde{O}(dlog(1/delta))$個のランダムサンプルのみを必要とします。ノイズと共変量分布に同様の許容値を持つスパース線形回帰と低ランク共分散行列推定へのさらなるアプリケーションを提供します。コア手法は、平均の中央値推定量を任意のメトリック空間に一般化することです。

Extremal Mechanisms for Local Differential Privacy
局所差分プライバシーのための極限メカニズム

Local differential privacy has recently surfaced as a strong measure of privacy in contexts where personal information remains private even from data analysts. Working in a setting where both the data providers and data analysts want to maximize the utility of statistical analyses performed on the released data, we study the fundamental trade-off between local differential privacy and utility. This trade-off is formulated as a constrained optimization problem: maximize utility subject to local differential privacy constraints. We introduce a combinatorial family of extremal privatization mechanisms, which we call staircase mechanisms, and show that it contains the optimal privatization mechanisms for a broad class of information theoretic utilities such as mutual information and $f$-divergences. We further prove that for any utility function and any privacy level, solving the privacy-utility maximization problem is equivalent to solving a finite-dimensional linear program, the outcome of which is the optimal staircase mechanism. However, solving this linear program can be computationally expensive since it has a number of variables that is exponential in the size of the alphabet the data lives in. To account for this, we show that two simple privatization mechanisms, the binary and randomized response mechanisms, are universally optimal in the low and high privacy regimes, and well approximate the intermediate regime.

ローカル差分プライバシーは、個人情報がデータアナリストに対しても非公開のままである状況において、プライバシーの強力な尺度として最近浮上しました。データプロバイダーとデータアナリストの両方が、公開されたデータに対して実行される統計分析の効用を最大化したいという設定で作業し、ローカル差分プライバシーと効用との間の基本的なトレードオフを研究します。このトレードオフは、ローカル差分プライバシー制約の下で効用を最大化するという制約付き最適化問題として定式化されます。私たちは、階段メカニズムと呼ぶ極端プライベート化メカニズムの組み合わせファミリーを導入し、相互情報量や$f$ダイバージェンスなどの情報理論的効用の広範なクラスに対して最適なプライベート化メカニズムが含まれていることを示します。さらに、任意の効用関数と任意のプライバシーレベルに対して、プライバシーと効用を最大化する問題を解くことは、最適な階段メカニズムの結果となる有限次元線形計画を解くことと同等であることを証明します。ただし、この線形計画を解くには、データが存在するアルファベットのサイズに比例する数の変数があるため、計算コストが高くなる可能性があります。これを考慮して、2つの単純なプライベート化メカニズム、バイナリ応答メカニズムとランダム応答メカニズムが、低プライバシーレジームと高プライバシーレジームで普遍的に最適であり、中間レジームによく近似していることを示します。

A Consistent Information Criterion for Support Vector Machines in Diverging Model Spaces
発散モデル空間におけるサポートベクトルマシンの一貫した情報量基準

Information criteria have been popularly used in model selection and proved to possess nice theoretical properties. For classification, Claeskens et al. (2880) proposed support vector machine information criterion for feature selection and provided encouraging numerical evidence. Yet no theoretical justification was given there. This work aims to fill the gap and to provide some theoretical justifications for support vector machine information criterion in both fixed and diverging model spaces. We first derive a uniform convergence rate for the support vector machine solution and then show that a modification of the support vector machine information criterion achieves model selection consistency even when the number of features diverges at an exponential rate of the sample size. This consistency result can be further applied to selecting the optimal tuning parameter for various penalized support vector machine methods. Finite-sample performance of the proposed information criterion is investigated using Monte Carlo studies and one real-world gene selection problem.

情報基準はモデル選択で広く使用されており、優れた理論的特性を持つことが証明されています。分類については、Claeskensら(2880)が特徴選択のためのサポートベクターマシン情報基準を提案し、有望な数値的証拠を示しました。しかし、そこには理論的な正当性は示されていません。本研究は、そのギャップを埋め、固定モデル空間と発散モデル空間の両方でサポートベクターマシン情報基準の理論的正当性を示すことを目的としています。まず、サポートベクターマシンソリューションの均一な収束率を導出し、次にサポートベクターマシン情報基準の修正により、特徴の数がサンプルサイズの指数関数的に発散する場合でもモデル選択の一貫性が達成されることを示します。この一貫性の結果は、さまざまなペナルティ付きサポートベクターマシン法の最適なチューニングパラメータの選択にさらに適用できます。提案された情報基準の有限サンプルパフォーマンスは、モンテカルロ研究と1つの実際の遺伝子選択問題を使用して調査されます。

LLORMA: Local Low-Rank Matrix Approximation
LLORMA: ローカル低ランク行列近似

Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is low-rank. In this paper, we propose, analyze, and experiment with two procedures, one parallel and the other global, for constructing local matrix approximations. The two approaches approximate the observed matrix as a weighted sum of low-rank matrices. These matrices are limited to a local region of the observed matrix. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.

行列近似は、レコメンデーションシステム、テキストマイニング、およびコンピュータービジョンの一般的なツールです。行列近似を構築する際の一般的な仮定は、部分的に観測された行列が低ランクであるというものです。この論文では、ローカル行列近似を構築するための2つの手順(1つは並列、もう1つはグローバル)を提案、分析、および実験します。この2つの手法では、観測された行列を低ランク行列の加重和として近似します。これらの行列は、観測された行列の局所領域に限定されます。提案されたローカル低ランクモデリングの精度を分析します。私たちの実験では、レコメンデーションタスクの従来のアプローチよりも予測精度が向上していることが示されています。

Convex Calibration Dimension for Multiclass Loss Matrices
マルチクラス損失行列の凸型キャリブレーション次元

We study consistency properties of surrogate loss functions for general multiclass learning problems, defined by a general multiclass loss matrix. We extend the notion of classification calibration, which has been studied for binary and multiclass 0-1 classification problems (and for certain other specific learning problems), to the general multiclass setting, and derive necessary and sufficient conditions for a surrogate loss to be calibrated with respect to a loss matrix in this setting. We then introduce the notion of convex calibration dimension of a multiclass loss matrix, which measures the smallest “size” of a prediction space in which it is possible to design a convex surrogate that is calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, we apply our framework to study various subset ranking losses, and use the convex calibration dimension as a tool to show both the existence and non-existence of various types of convex calibrated surrogates for these losses. Our results strengthen recent results of Duchi et al. (2010) and CalauzÃ¨nes et al. (2012) on the non-existence of certain types of convex calibrated surrogates in subset ranking. We anticipate the convex calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems.

私たちは、一般的なマルチクラス損失行列によって定義される一般的なマルチクラス学習問題に対する代理損失関数の一貫性特性を研究します。私たちは、バイナリおよびマルチクラスの0-1分類問題(および他の特定の学習問題)について研究されてきた分類キャリブレーションの概念を一般的なマルチクラス設定に拡張し、この設定で代理損失が損失行列に対してキャリブレーションされるために必要な十分な条件を導出します。次に、マルチクラス損失行列の凸キャリブレーション次元の概念を導入します。これは、損失行列に対してキャリブレーションされた凸代理を設計できる予測空間の最小の「サイズ」を測定します。我々はこの量の上限と下限の両方を導出し、これらの結果を使用してさまざまな損失行列を分析します。特に、我々はフレームワークを適用してさまざまなサブセットランキング損失を研究し、凸キャリブレーション次元を、これらの損失に対するさまざまな種類の凸キャリブレーションされた代理の存在と非存在の両方を示すツールとして使用します。我々の結果は、Duchiらの最近の結果を補強するものです。(2010)およびCalauzÃ¨nesら(2012)は、サブセットランキングにおける特定の種類の凸較正済み代理変数が存在しないことを明らかにしました。凸較正次元は、一般的な多クラス学習問題における代理変数損失の研究と設計に役立つツールとなることが期待されます。

Learning the Variance of the Reward-To-Go
報酬の変動の学習

In Markov decision processes (MDPs), the variance of the reward- to-go is a natural measure of uncertainty about the long term performance of a policy, and is important in domains such as finance, resource allocation, and process control. Currently however, there is no tractable procedure for calculating it in large scale MDPs. This is in contrast to the case of the expected reward-to-go, also known as the value function, for which effective simulation-based algorithms are known, and have been used successfully in various domains. In this paper we extend temporal difference (TD) learning algorithms to estimating the variance of the reward-to- go for a fixed policy. We propose variants of both TD(0) and LSTD($\lambda$) with linear function approximation, prove their convergence, and demonstrate their utility in an option pricing problem. Our results show a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods, which are currently the state-of-the-art.

マルコフ決定プロセス(MDP)では、報酬の分散は、政策の長期的なパフォーマンスに関する不確実性の自然な尺度であり、財務、リソース配分、プロセス制御などの領域で重要です。しかし、現在、大規模なMDPで計算するための扱いやすい手順はありません。これは、バリュー関数とも呼ばれる期待報酬の場合とは対照的であり、効果的なシミュレーションベースのアルゴリズムが知られており、さまざまなドメインで成功裏に使用されています。この論文では、時間差(TD)学習アルゴリズムを、固定ポリシーの報酬-to-goの分散を推定するように拡張します。線形関数近似を持つTD(0)とLSTD($lambda$)の両方のバリアントを提案し、それらの収束を証明し、オプション価格設定問題での有用性を実証します。私たちの結果は、現在最先端である標準的なモンテカルロ法と比較して、サンプル効率の点で劇的に改善されていることを示しています。

Noisy Sparse Subspace Clustering
ノイズの多いスパース部分空間クラスタリング

This paper considers the problem of subspace clustering under noise. Specifically, we study the behavior of Sparse Subspace Clustering (SSC) when either adversarial or random noise is added to the unlabeled input data points, which are assumed to be in a union of low-dimensional subspaces. We show that a modified version of SSC is provably effective in correctly identifying the underlying subspaces, even with noisy data. This extends theoretical guarantee of this algorithm to more practical settings and provides justification to the success of SSC in a class of real applications.

この論文では、ノイズ下での部分空間クラスタリングの問題について考察します。具体的には、低次元部分空間の和集合にあると仮定されるラベル付けされていない入力データポイントに敵対的ノイズまたはランダムノイズが追加された場合のスパース部分空間クラスタリング(SSC)の動作を研究します。SSCの修正版は、ノイズの多いデータであっても、基礎となる部分空間を正しく識別するのに効果的であることが証明されていることを示します。これにより、このアルゴリズムの理論的な保証がより実用的な設定に拡張され、実際のアプリケーションのクラスでのSSCの成功を正当化できます。

Complexity of Representation and Inference in Compositional Models with Part Sharing
部分共有による構成モデルにおける表現と推論の複雑さ

This paper performs a complexity analysis of a class of serial and parallel compositional models of multiple objects and shows that they enable efficient representation and rapid inference. Compositional models are generative and represent objects in a hierarchically distributed manner in terms of parts and subparts, which are constructed recursively by part-subpart compositions. Parts are represented more coarsely at higher level of the hierarchy, so that the upper levels give coarse summary descriptions (e.g., there is a horse in the image) while the lower levels represents the details (e.g., the positions of the legs of the horse). This hierarchically distributed representation obeys the executive summary principle, meaning that a high level executive only requires a coarse summary description and can, if necessary, get more details by consulting lower level executives. The parts and subparts are organized in terms of hierarchical dictionaries which enables part sharing between different objects allowing efficient representation of many objects. The first main contribution of this paper is to show that compositional models can be mapped onto a parallel visual architecture similar to that used by bio- inspired visual models such as deep convolutional networks but more explicit in terms of representation, hence enabling part detection as well as object detection, and suitable for complexity analysis. Inference algorithms can be run on this architecture to exploit the gains caused by part sharing and executive summary. Effectively, this compositional architecture enables us to perform exact inference simultaneously over a large class of generative models of objects. The second contribution is an analysis of the complexity of compositional models in terms of computation time (for serial computers) and numbers of nodes (e.g., “neurons”) for parallel computers. In particular, we compute the complexity gains by part sharing and executive summary and their dependence on how the dictionary scales with the level of the hierarchy. We explore three regimes of scaling behavior where the dictionary size (i) increases exponentially with the level of the hierarchy, (ii) is determined by an unsupervised compositional learning algorithm applied to real data, (iii) decreases exponentially with scale. This analysis shows that in some regimes the use of shared parts enables algorithms which can perform inference in time linear in the number of levels for an exponential number of objects. In other regimes part sharing has little advantage for serial computers but can enable linear processing on parallel computers.

この論文では、複数のオブジェクトのシリアルおよび並列構成モデルのクラスの複雑性分析を実行し、これらが効率的な表現と迅速な推論を可能にすることを示します。構成モデルは生成的であり、オブジェクトをパーツとサブパーツの観点から階層的に分散された方法で表現します。パーツとサブパーツは、パーツとサブパーツの構成によって再帰的に構築されます。パーツは階層の上位レベルでより粗く表現されるため、上位レベルは粗い概要説明(画像には馬がいるなど)を提供し、下位レベルは詳細(馬の脚の位置など)を表します。この階層的に分散された表現は、エグゼクティブサマリーの原則に従います。つまり、上位レベルのエグゼクティブは粗い概要説明のみを必要とし、必要に応じて下位レベルのエグゼクティブを参照して詳細を取得できます。パーツとサブパーツは階層的な辞書に基づいて編成されるため、異なるオブジェクト間でパーツを共有でき、多くのオブジェクトを効率的に表現できます。この論文の最初の主な貢献は、合成モデルを、深層畳み込みネットワークなどの生物にヒントを得た視覚モデルで使用されるものと類似した並列視覚アーキテクチャにマッピングできるが、表現の点ではより明示的であるため、オブジェクト検出だけでなくパーツ検出も可能で、複雑性分析に適していることを示すことです。このアーキテクチャで推論アルゴリズムを実行して、パーツ共有とエグゼクティブサマリーによってもたらされる利点を活用できます。実質的に、この合成アーキテクチャにより、オブジェクトの生成モデルの大規模なクラスに対して同時に正確な推論を実行できます。2番目の貢献は、計算時間(シリアルコンピューターの場合)と並列コンピューターのノード数(「ニューロン」など)の観点から見た合成モデルの複雑性の分析です。特に、パーツ共有とエグゼクティブサマリーによる複雑性の利点と、それらが階層レベルに応じて辞書を拡張する方法に依存しているかどうかを計算します。私たちは、辞書のサイズが(i)階層レベルに応じて指数関数的に増加する、(ii)実際のデータに適用された教師なし構成学習アルゴリズムによって決定される、(iii)スケールに応じて指数関数的に減少する、という3つのスケーリング動作の領域を調査します。この分析により、一部の領域では、共有パーツの使用により、指数関数的な数のオブジェクトに対してレベル数に比例した時間で推論を実行できるアルゴリズムが実現できることがわかります。他の領域では、パーツ共有はシリアルコンピューターにはあまりメリットがありませんが、並列コンピューターでは線形処理が可能になります。

Herded Gibbs Sampling
群れギブスサンプリング

The Gibbs sampler is one of the most popular algorithms for inference in statistical models. In this paper, we introduce a herding variant of this algorithm, called herded Gibbs, that is entirely deterministic. We prove that herded Gibbs has an $O(1/T)$ convergence rate for models with independent variables and for fully connected probabilistic graphical models. Herded Gibbs is shown to outperform Gibbs in the tasks of image denoising with MRFs and named entity recognition with CRFs. However, the convergence for herded Gibbs for sparsely connected probabilistic graphical models is still an open problem.

Gibbsサンプラーは、統計モデルでの推論のための最も一般的なアルゴリズムの1つです。この論文では、このアルゴリズムの群れの変種である群れギブス(完全に決定論的)を紹介します。群れギブスが、独立変数を持つモデルと完全に接続された確率的グラフィカルモデルに対して、収束率が$O(1/T)$であることを証明します。Herded Gibbsは、MRFによる画像ノイズ除去とCRFによる名前付きエンティティ認識のタスクでGibbsよりも優れていることが示されています。しかし、群れのギブスがまばらに接続された確率的グラフィカルモデルに収束するかどうかは、まだ未解決の問題です。

Harry: A Tool for Measuring String Similarity
Harry:文字列の類似性を測定するためのツール

Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.

文字列を比較し、その類似性を評価することは、情報検索、自然言語処理、バイオインフォマティクスなど、機械学習の多くのアプリケーション領域における基本的な操作です。プラクティショナーは、このタスクに使用可能なさまざまな類似性測度から選択でき、それぞれが文字列データの異なる側面を強調します。この記事では、弦の類似性を測定するために特別に設計された小さなツールであるハリーを紹介します。Harryは、一般的な文字列距離や、Levenshtein距離や部分シーケンスカーネルなどの文字列カーネルを含む20を超える類似度尺度を実装しています。このツールは効率性を念頭に置いて設計されており、マルチスレッドコンピューティングと分散コンピューティングを可能にし、文字列の大規模なデータセットの分析を可能にします。Harryは一般的なデータ形式をサポートしているため、Matlab、Pylab、Wekaなどの分析環境とインターフェースできます。

Knowledge Matters: Importance of Prior Information for Optimization
ナレッジマターズ:最適化のための事前情報の重要性

We explored the effect of introducing prior knowledge into the intermediate level of deep supervised neural networks on two tasks. On a task we designed, all black-box state-of-the-art machine learning algorithms which we tested, failed to generalize well. We motivate our work from the hypothesis that, there is a training barrier involved in the nature of such tasks, and that humans learn useful intermediate concepts from other individuals by using a form of supervision or guidance using a curriculum. Our results provide a positive evidence in favor of this hypothesis. In our experiments, we trained a two- tiered MLP architecture on a dataset for which each input image contains three sprites, and the binary target class is $1$ if all of three shapes belong to the same category and otherwise the class is $0$. In terms of generalization, black-box machine learning algorithms could not perform better than chance on this task. Standard deep supervised neural networks also failed to generalize. However, using a particular structure and guiding the learner by providing intermediate targets in the form of intermediate concepts (the presence of each object) allowed us to solve the task efficiently. We obtained much better than chance, but imperfect results by exploring different architectures and optimization variants. This observation might be an indication of optimization difficulty when the neural network trained without hints on this task. We hypothesize that the learning difficulty is due to the composition of two highly non-linear tasks. Our findings are also consistent with the hypotheses on cultural learning inspired by the observations of training of neural networks sometimes getting stuck, even though good solutions exist, both in terms of training and generalization error.

私たちは、2つのタスクで、事前知識を深層教師ありニューラルネットワークの中間レベルに導入することの効果を調査しました。我々が設計したタスクでは、テストしたすべてのブラックボックスの最先端の機械学習アルゴリズムが、一般化に失敗しました。私たちは、このようなタスクの性質上、トレーニングの障壁があり、人間はカリキュラムを使用した監督または指導の形式を使用することで、他の個人から有用な中間概念を学ぶという仮説から、この研究を進めています。我々の結果は、この仮説を支持する肯定的な証拠を提供します。実験では、各入力画像に3つのスプライトが含まれ、3つの形状がすべて同じカテゴリに属する場合はバイナリターゲットクラスが$1$、それ以外の場合はクラスが$0$であるデータセットで、2層MLPアーキテクチャをトレーニングしました。一般化の点では、ブラックボックスの機械学習アルゴリズムは、このタスクで偶然よりも優れたパフォーマンスを発揮できませんでした。標準的な深層教師ありニューラルネットワークも一般化に失敗しました。しかし、特定の構造を使用し、中間概念(各オブジェクトの存在)の形で中間ターゲットを提供することで学習者を誘導することで、タスクを効率的に解決することができました。さまざまなアーキテクチャと最適化のバリエーションを探索することで、偶然よりもはるかに優れた、しかし不完全な結果が得られました。この観察結果は、ニューラルネットワークがこのタスクでヒントなしでトレーニングされた場合の最適化の難しさを示している可能性があります。学習の難しさは、2つの非常に非線形なタスクの構成によるものであると仮定しています。私たちの調査結果は、トレーニングと一般化エラーの両方の点で優れたソリューションが存在するにもかかわらず、ニューラルネットワークのトレーニングが時々行き詰まるという観察から着想を得た文化的学習の仮説とも一致しています。

Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics
確率的勾配ランジュバン動力学における一貫性と変動

Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally expensive. Both the calculation of the acceptance probability and the creation of informed proposals usually require an iteration through the whole data set. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem by generating proposals which are only based on a subset of the data, by skipping the accept-reject step and by using decreasing step-sizes sequence $(\delta_m)_{m \geq 0}$. We provide in this article a rigorous mathematical framework for analysing this algorithm. We prove that, under verifiable assumptions, the algorithm is consistent, satisfies a central limit theorem (CLT) and its asymptotic bias-variance decomposition can be characterized by an explicit functional of the step-sizes sequence $(\delta_m)_{m \geq 0}$. We leverage this analysis to give practical recommendations for the notoriously difficult tuning of this algorithm: it is asymptotically optimal to use a step-size sequence of the type $\delta_m \asymp m^{-1/3}$, leading to an algorithm whose mean squared error (MSE) decreases at rate $\mathcal{O}(m^{-1/3})$.

標準的なマルコフ連鎖モンテカルロ(MCMC)アルゴリズムを大規模なデータセットに適用すると、計算コストが高くなります。承認確率の計算と情報に基づいた提案の作成の両方で、通常、データセット全体の反復が必要になります。最近提案された確率的勾配ランジュバンダイナミクス(SGLD)法は、データのサブセットのみに基づく提案を生成し、承認-拒否ステップをスキップし、減少するステップサイズシーケンス$(\delta_m)_{m \geq 0}$を使用することで、この問題を回避します。この記事では、このアルゴリズムを分析するための厳密な数学的フレームワークを提供します。検証可能な仮定の下で、アルゴリズムが一貫しており、中心極限定理(CLT)を満たし、その漸近的バイアス分散分解がステップサイズシーケンス$(\delta_m)_{m \geq 0}$の明示的な関数によって特徴付けられることを証明します。この分析を活用して、このアルゴリズムの非常に難しい調整に関する実用的な推奨事項を示します。つまり、$\delta_m \asymp m^{-1/3}$タイプのステップサイズシーケンスを使用するのが漸近的に最適であり、平均二乗誤差(MSE)が$\mathcal{O}(m^{-1/3})$の速度で減少するアルゴリズムにつながります。

Minimax Rates in Permutation Estimation for Feature Matching
特徴マッチングの順列推定におけるミニマックス率

The problem of matching two sets of features appears in various tasks of computer vision and can be often formalized as a problem of permutation estimation. We address this problem from a statistical point of view and provide a theoretical analysis of the accuracy of several natural estimators. To this end, the minimax rate of separation is investigated and its expression is obtained as a function of the sample size, noise level and dimension of the features. We consider the cases of homoscedastic and heteroscedastic noise and establish, in each case, tight upper bounds on the separation distance of several estimators. These upper bounds are shown to be unimprovable both in the homoscedastic and heteroscedastic settings. Interestingly, these bounds demonstrate that a phase transition occurs when the dimension $d$ of the features is of the order of the logarithm of the number of features $n$. For $d=O(\log n)$, the rate is dimension free and equals $\sigma (\log n)^{1/2}$, where $\sigma$ is the noise level. In contrast, when $d$ is larger than $c\log n$ for some constant $c>0$, the minimax rate increases with $d$ and is of the order of $\sigma(d\log n)^{1/4}$. We also discuss the computational aspects of the estimators and provide empirical evidence of their consistency on synthetic data. Finally, we show that our results extend to more general matching criteria.

2つの特徴セットをマッチングする問題は、コンピュータービジョンのさまざまなタスクに現れ、多くの場合、順列推定の問題として形式化できます。統計的観点からこの問題に対処し、いくつかの自然な推定量の精度の理論的分析を提供します。この目的のために、分離のミニマックスレートを調査し、その表現をサンプルサイズ、ノイズレベル、および特徴の次元の関数として取得します。等分散および異分散ノイズの場合を考慮し、それぞれの場合で、いくつかの推定量の分離距離に厳密な上限を確立します。これらの上限は、等分散および異分散の設定の両方で改善できないことが示されています。興味深いことに、これらの上限は、特徴の次元$d$が特徴の数$n$の対数のオーダーである場合に位相遷移が発生することを示しています。$d=O(\log n)$の場合、レートは次元フリーで$\sigma (\log n)^{1/2}$に等しくなります。ここで、$\sigma$はノイズレベルです。対照的に、$d$が$c\log n$より大きい場合(ある定数$c>0$)、ミニマックスレートは$d$とともに増加し、$\sigma(d\log n)^{1/4}$のオーダーになります。また、推定量の計算面についても説明し、合成データでのそれらの一貫性について実証的な証拠を示します。最後に、結果がより一般的なマッチング基準に拡張されることを示します。

Should We Really Use Post-Hoc Tests Based on Mean-Ranks?
私たちは本当に平均ランクに基づく事後検定を使うべきでしょうか?

The statistical comparison of multiple algorithms over multiple data sets is fundamental in machine learning. This is typically carried out by the Friedman test. When the Friedman test rejects the null hypothesis, multiple comparisons are carried out to establish which are the significant differences among algorithms. The multiple comparisons are usually performed using the mean-ranks test. The aim of this technical note is to discuss the inconsistencies of the mean-ranks post-hoc test with the goal of discouraging its use in machine learning as well as in medicine, psychology, etc.. We show that the outcome of the mean-ranks test depends on the pool of algorithms originally included in the experiment. In other words, the outcome of the comparison between algorithms $A$ and $B$ depends also on the performance of the other algorithms included in the original experiment. This can lead to paradoxical situations. For instance the difference between $A$ and $B$ could be declared significant if the pool comprises algorithms $C,D,E$ and not significant if the pool comprises algorithms $F,G,H$. To overcome these issues, we suggest instead to perform the multiple comparison using a test whose outcome only depends on the two algorithms being compared, such as the sign-test or the Wilcoxon signed-rank test.

複数のデータセットにわたる複数のアルゴリズムの統計的比較は、機械学習の基本です。これは通常、フリードマン検定によって実行されます。フリードマン検定で帰無仮説が棄却された場合、アルゴリズム間の有意差を特定するために多重比較が実行されます。多重比較は通常、平均ランク検定を使用して実行されます。この技術ノートの目的は、機械学習だけでなく医学、心理学などでも平均ランク事後検定の使用を控えることを目的として、平均ランク事後検定の矛盾点について議論することです。平均ランク検定の結果は、実験に最初に含まれていたアルゴリズムのプールに依存することを示します。言い換えると、アルゴリズム$A$と$B$の比較の結果は、元の実験に含まれていた他のアルゴリズムのパフォーマンスにも依存します。これにより、逆説的な状況が発生する可能性があります。たとえば、プールがアルゴリズム$C、D、E$で構成されている場合は$A$と$B$の差が有意であると宣言され、プールがアルゴリズム$F、G、H$で構成されている場合は有意ではないと宣言される可能性があります。これらの問題を克服するには、符号検定やウィルコクソンの符号順位検定など、比較する2つのアルゴリズムのみに結果が依存する検定を使用して多重比較を実行することをお勧めします。

Random Rotation Ensembles
ランダム回転アンサンブル

In machine learning, ensemble methods combine the predictions of multiple base learners to construct more accurate aggregate predictions. Established supervised learning algorithms inject randomness into the construction of the individual base learners in an effort to promote diversity within the resulting ensembles. An undesirable side effect of this approach is that it generally also reduces the accuracy of the base learners. In this paper, we introduce a method that is simple to implement yet general and effective in improving ensemble diversity with only modest impact on the accuracy of the individual base learners. By randomly rotating the feature space prior to inducing the base learners, we achieve favorable aggregate predictions on standard data sets compared to state of the art ensemble methods, most notably for tree-based ensembles, which are particularly sensitive to rotation.

機械学習では、アンサンブル法は複数のベース学習器の予測を組み合わせて、より正確な集合予測を構築します。確立された教師あり学習アルゴリズムは、結果として得られるアンサンブル内の多様性を促進するために、個々の基本学習器の構築にランダム性を注入します。このアプローチの望ましくない副作用は、一般に、基本学習器の精度も低下させることです。この論文では、実装が簡単でありながら一般的で、個々の基本学習者の精度にわずかな影響を与えるだけでアンサンブルの多様性を向上させるのに効果的である方法を紹介します。ベース学習器を誘導する前に特徴空間をランダムに回転させることにより、最先端のアンサンブル法、特に回転に敏感なツリーベースのアンサンブル法と比較して、標準データセットで好ましい集計予測を達成します。

Consistent Algorithms for Clustering Time Series
時系列のクラスタリングのための一貫性のあるアルゴリズム

The problem of clustering is considered for the case where every point is a time series. The time series are either given in one batch (offline setting), or they are allowed to grow with time and new time series can be added along the way (online setting). We propose a natural notion of consistency for this problem, and show that there are simple, computationally efficient algorithms that are asymptotically consistent under extremely weak assumptions on the distributions that generate the data. The notion of consistency is as follows. A clustering algorithm is called consistent if it places two time series into the same cluster if and only if the distribution that generates them is the same. In the considered framework the time series are allowed to be highly dependent, and the dependence can have arbitrary form. If the number of clusters is known, the only assumption we make is that the (marginal) distribution of each time series is stationary ergodic. No parametric, memory or mixing assumptions are made. When the number of clusters is unknown, stronger assumptions are provably necessary, but it is still possible to devise nonparametric algorithms that are consistent under very general conditions. The theoretical findings of this work are illustrated with experiments on both synthetic and real data.

クラスタリングの問題は、すべてのポイントが時系列である場合について検討します。時系列は、1つのバッチで提供されるか(オフライン設定)、または時間の経過とともに増加し、途中で新しい時系列を追加できます(オンライン設定)。この問題に対して、一貫性の自然な概念を提案し、データを生成する分布に関する非常に弱い仮定の下で漸近的に一貫性のある、単純で計算効率の高いアルゴリズムがあることを示します。一貫性の概念は次のとおりです。クラスタリングアルゴリズムは、2つの時系列を生成する分布が同じ場合に限り、それらを同じクラスターに配置する場合、一貫性があると呼ばれます。検討対象のフレームワークでは、時系列は高度に依存していてもよく、依存関係は任意の形式にすることができます。クラスターの数がわかっている場合、各時系列の(限界)分布が定常エルゴードであるという仮定のみを立てます。パラメトリック、メモリ、または混合の仮定は立てません。クラスターの数が不明な場合は、より強力な仮定が必要であることは明らかですが、それでも非常に一般的な条件下で一貫性のあるノンパラメトリックアルゴリズムを考案することは可能です。この研究の理論的発見は、合成データと実際のデータの両方での実験によって説明されています。

Multiscale Dictionary Learning: Non-Asymptotic Bounds and Robustness
マルチスケール辞書学習:非漸近境界とロバスト性

High-dimensional datasets are well-approximated by low- dimensional structures. Over the past decade, this empirical observation motivated the investigation of detection, measurement, and modeling techniques to exploit these low- dimensional intrinsic structures, yielding numerous implications for high-dimensional statistics, machine learning, and signal processing. Manifold learning (where the low-dimensional structure is a manifold) and dictionary learning (where the low- dimensional structure is the set of sparse linear combinations of vectors from a finite dictionary) are two prominent theoretical and computational frameworks in this area. Despite their ostensible distinction, the recently-introduced Geometric Multi-Resolution Analysis (GMRA) provides a robust, computationally efficient, multiscale procedure for simultaneously learning manifolds and dictionaries. In this work, we prove non-asymptotic probabilistic bounds on the approximation error of GMRA for a rich class of data-generating statistical models that includes ânoisyâ manifolds, thereby establishing the theoretical robustness of the procedure and confirming empirical observations. In particular, if a dataset aggregates near a low- dimensional manifold, our results show that the approximation error of the GMRA is completely independent of the ambient dimension. Our work therefore establishes GMRA as a provably fast algorithm for dictionary learning with approximation and sparsity guarantees. We include several numerical experiments confirming these theoretical results, and our theoretical framework provides new tools for assessing the behavior of manifold learning and dictionary learning procedures on a large class of interesting models.

高次元データセットは、低次元構造によって十分に近似されます。過去10年間、この経験的観察が、これらの低次元の固有構造を利用するための検出、測定、およびモデリング手法の調査を促し、高次元統計、機械学習、および信号処理に多くの意味をもたらしました。多様体学習(低次元構造は多様体)と辞書学習(低次元構造は有限辞書のベクトルのスパース線形結合のセット)は、この分野における2つの主要な理論的および計算的フレームワークです。表面上は区別されていますが、最近導入された幾何学的マルチ解像度解析(GMRA)は、多様体と辞書を同時に学習するための堅牢で計算効率の高いマルチスケール手順を提供します。この研究では、ノイズの多い多様体を含むデータ生成統計モデルの豊富なクラスについて、GMRAの近似誤差の非漸近的確率的境界を証明し、それによって手順の理論的堅牢性を確立し、経験的観察を確認します。特に、データセットが低次元多様体の近くで集約される場合、結果はGMRAの近似誤差が周囲の次元とはまったく無関係であることを示しています。したがって、この研究では、近似とスパース性が保証された辞書学習のための証明可能な高速アルゴリズムとしてGMRAを確立しました。これらの理論的結果を確認するいくつかの数値実験を含め、理論的フレームワークは、興味深いモデルの大規模なクラスで多様体学習と辞書学習手順の動作を評価するための新しいツールを提供します。

On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models
マルチアームバンディットモデルにおけるベストアーム同定の複雑さについて

The stochastic multi-armed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the $m$ best arms. We introduce generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings. In the fixed-confidence setting, we provide the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when $m\geq 1$ under general assumptions. In the specific case of two armed- bandits, we derive refined lower bounds in both the fixed- confidence and fixed-budget settings, along with matching algorithms for Gaussian and Bernoulli bandit models. These results show in particular that the complexity of the fixed- budget setting may be smaller than the complexity of the fixed- confidence setting, contradicting the familiar behavior observed when testing fully specified alternatives. In addition, we also provide improved sequential stopping rules that have guaranteed error probabilities and shorter average running times. The proofs rely on two technical results that are of independent interest : a deviation lemma for self-normalized sums (Lemma 7) and a novel change of measure inequality for bandit models (Lemma 1).

確率的多腕バンディットモデルは、統計学や機械学習のさまざまなコンテキストで有用であることが証明されている単純な抽象化です。後悔の最小化に関して達成可能な限界は現在よく知られていますが、私たちの目的は、m個の最良の腕を特定するという観点からパフォーマンスをより深く理解することに貢献することです。文献で検討されている2つの主要なフレームワーク、固定予算設定と固定信頼度設定について、複雑さの一般的な概念を導入します。固定信頼度設定では、情報理論的な量を含み、一般的な仮定の下で$m\geq 1$の場合に保持される、複雑さに関する分布に依存する下限を初めて提供します。2つの武装バンディットの特定のケースでは、固定信頼度設定と固定予算設定の両方で洗練された下限を導出するとともに、ガウスバンディットモデルとベルヌーイバンディットモデルのマッチングアルゴリズムも導出します。これらの結果は特に、固定予算設定の複雑さは固定信頼度設定の複雑さよりも小さい可能性があることを示しており、完全に指定された代替案をテストするときに観察される一般的な動作と矛盾しています。さらに、エラー確率が保証され、平均実行時間が短くなる、改善された順次停止規則も提供します。証明は、独立した関心のある2つの技術的結果、つまり自己正規化和の偏差補題(補題7)とバンディットモデルの測定不等式の新しい変更(補題1)に依存しています。

Journal of Machine Learning Research Papers: Volume 17の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 4の論文一覧

Journal of Machine Learning Research Papers: Volume 2の論文一覧

Journal of Machine Learning Research Papers: Volume 11の論文一覧