Journal of Machine Learning Research Papers: Volume 18の論文一覧

Journal of Machine Learning Research Papers Volume 18に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Learning Certifiably Optimal Rule Lists for Categorical Data
カテゴリカルデータの認定可能な最適ルールリストの学習

We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm produces rule lists with optimal training performance, according to the regularized empirical risk, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. We demonstrate that our approach produces optimal rule lists on practical problems in seconds. Our results indicate that it is possible to construct optimal sparse rule lists that are approximately as accurate as the COMPAS proprietary risk prediction tool on data from Broward County, Florida, but that are completely interpretable. This framework is a novel alternative to CART and other decision tree methods for interpretable modeling.

私たちは、カテゴリカル特徴空間上のルールリストを構築するためのカスタム離散最適化手法の設計と実装を紹介します。当社のアルゴリズムは、正則化された経験的リスクに従って、最適なトレーニングパフォーマンスのルールリストを作成し、最適性の証明書を付けます。アルゴリズムの限界、効率的なデータ構造、計算の再利用を活用することで、数桁の時間の短縮とメモリ消費の大幅な削減を実現します。私たちのアプローチでは、実際の問題に関する最適なルールリストが数秒で生成されることを実証します。私たちの結果は、フロリダ州ブロワード郡のデータに対して、COMPAS独自のリスク予測ツールとほぼ同じ精度で、完全に解釈可能な最適なスパースルールリストを作成できることを示しています。このフレームワークは、解釈可能なモデリングのためのCARTやその他の決定木法に代わる新しい方法です。

Characteristic and Universal Tensor Product Kernels
特性およびユニバーサルテンソルプロダクトカーネル

Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.

最大平均不一致(MMD)は、統計ではエネルギー距離またはN距離とも呼ばれ、ヒルベルト・シュミット独立性基準(HSIC)、特に統計では距離共分散と呼ばれ、確率変数の差と独立性をそれぞれ定量化するための最も一般的で成功したアプローチの1つです。カーネルベースの基盤のおかげで、MMDとHSICはさまざまなドメインに適用できます。その大きな成功にもかかわらず、HSICがいつ独立性を特徴付けるのか、そしてテンソル積カーネルを持つMMDがいつ確率分布を識別できるのかについては、ほとんど知られていません。この論文では、テンソル積カーネルの特性に関するさまざまな概念を研究することにより、これらの質問に答えます。

A Cluster Elastic Net for Multivariate Regression
多変量回帰のためのクラスタ弾性ネット

We propose a method for simultaneously estimating regression coefficients and clustering response variables in a multivariate regression model, to increase prediction accuracy and give insights into the relationship between response variables. The estimates of the regression coefficients and clusters are found by using a penalized likelihood estimator, which includes a cluster fusion penalty, to shrink the difference in fitted values from responses in the same cluster, and an $L_1$ penalty for simultaneous variable selection and estimation. We propose a two-step algorithm, that iterates between k-means clustering and solving the penalized likelihood function assuming the clusters are known, which has desirable parallel computational properties obtained by using the cluster fusion penalty. If the response variable clusters are known a priori then the algorithm reduces to just solving the penalized likelihood problem. Theoretical results are presented for the penalized least squares case, including asymptotic results allowing for $p \gg n$. We extend our method to the setting where the responses are binomial variables. We propose a coordinate descent algorithm for the normal likelihood and a proximal gradient descent algorithm for the binomial likelihood, which can easily be extended to other generalized linear model (GLM) settings. Simulations and data examples from business operations and genomics are presented to show the merits of both the least squares and binomial methods.

私たちは、多変量回帰モデルで回帰係数の推定と応答変数のクラスタリングを同時に行う方法を提案します。これにより、予測精度が向上し、応答変数間の関係についての洞察が得られます。回帰係数とクラスターの推定値は、同じクラスター内の応答からの適合値の差を縮小するためのクラスター融合ペナルティと、同時変数選択および推定のための$L_1$ペナルティを含むペナルティ付き尤度推定量を使用して求められます。クラスターが既知であると仮定して、k平均法クラスタリングとペナルティ付き尤度関数の解決を繰り返す2段階アルゴリズムを提案します。このアルゴリズムは、クラスター融合ペナルティを使用することで、望ましい並列計算特性が得られます。応答変数クラスターが事前にわかっている場合、アルゴリズムはペナルティ付き尤度問題を解くだけに簡略化されます。ペナルティ付き最小二乗法の場合の理論的な結果が提示され、$p \gg n$を許容する漸近結果も含まれています。応答が二項変数である設定にこの方法を拡張します。正規尤度には座標降下法アルゴリズム、二項尤度には近似勾配降下法アルゴリズムを提案します。これらは他の一般化線形モデル(GLM)設定に簡単に拡張できます。最小二乗法と二項法の両方の利点を示すために、ビジネスオペレーションとゲノミクスからのシミュレーションとデータ例を示します。

Concentration inequalities for empirical processes of linear time series
線形時系列の経験的過程における濃度不等式

The paper considers suprema of empirical processes for linear time series indexed by functional classes. We derive an upper bound for the tail probability of the suprema under conditions on the size of the function class, the sample size, temporal dependence and the moment conditions of the underlying time series. Due to the dependence and heavy-tailness, our tail probability bound is substantially different from those classical exponential bounds obtained under the independence assumption in that it involves an extra polynomial decaying term. We allow both short- and long-range dependent processes. For empirical processes indexed by half intervals, our tail probability inequality is sharp up to a multiplicative constant.

この論文では、関数クラスによって索引付けされた線形時系列の経験的プロセスの上限について考察します。私たちは、関数クラスのサイズ、サンプルサイズ、時間依存性、および基礎となる時系列のモーメント条件の条件下で、上限のテール確率の上限を導出します。依存性とヘビーテール性により、私たちのテール確率範囲は、余分な多項式崩壊項を含むという点で、独立性の仮定の下で得られる古典的な指数関数的境界とは大きく異なります。短期および長期依存プロセスの両方を許可します。半分の区間でインデックス化された経験的プロセスの場合、裾確率の不等式は乗法定数まで急激です。

CoCoA: A General Framework for Communication-Efficient Distributed Optimization
CoCoA:通信効率の高い分散最適化のための一般的なフレームワーク

The scale of modern datasets necessitates the development of efficient distributed optimization methods for machine learning. We present a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing. We extend the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso, sparse logistic regression, and elastic net regularization, and show how earlier work can be derived as a special case. We provide convergence guarantees for the class of convex regularized loss minimization objectives, leveraging a novel approach in handling non-strongly-convex regularizers and non-smooth loss functions. The resulting framework has markedly improved performance over state-of-the- art methods, as we illustrate with an extensive set of experiments on real distributed datasets.

最新のデータセットの規模により、機械学習のための効率的な分散最適化手法の開発が必要です。この論文では、効率的な通信方式を持ち、機械学習や信号処理の幅広い問題に適用可能な分散コンピューティング環境のための汎用フレームワークであるCoCoAを紹介します。このフレームワークを拡張して、なげなわ、スパースロジスティック回帰、弾性ネット正則化などのL1正則化問題を含む一般的な非強凸正則化子をカバーし、以前の作業を特殊なケースとして導出する方法を示します。私たちは、非強凸正則化器と非平滑損失関数の取り扱いに新しいアプローチを活用して、凸正則化損失最小化目標のクラスに収束保証を提供します。結果として得られるフレームワークは、実際の分散データセットでの広範な一連の実験で示されているように、最先端の方法よりもパフォーマンスが大幅に向上しています。

Interactive Algorithms: Pool, Stream and Precognitive Stream
インタラクティブアルゴリズム:プール、ストリーム、プリコグニティブストリーム

We consider interactive algorithms in the pool-based setting, and in the stream-based setting. Interactive algorithms observe suggested elements (representing actions or queries), and interactively select some of them and receive responses. Pool- based algorithms can select elements at any order, while stream- based algorithms observe elements in sequence, and can only select elements immediately after observing them. We further consider an intermediate setting, which we term precognitive stream, in which the algorithm knows in advance the identity of all the elements in the sequence, but can select them only in the order of their appearance. For all settings, we assume that the suggested elements are generated independently from some source distribution, and ask what is the stream size required for emulating a pool algorithm with a given pool size, in the stream-based setting and in the precognitive stream setting. We provide algorithms and matching lower bounds for general pool algorithms, and for utility-based pool algorithms. We further derive nearly matching upper and lower bounds on the gap between the two settings for the special case of active learning for binary classification.

私たちは、プールベースの設定とストリームベースの設定で、対話型アルゴリズムを検討します。対話型アルゴリズムは、提案された要素(アクションまたはクエリを表す)を観察し、対話的にその一部を選択して応答を受け取ります。プールベースのアルゴリズムは任意の順序で要素を選択できますが、ストリームベースのアルゴリズムは要素を順番に観察し、観察した直後にしか要素を選択できません。さらに、アルゴリズムがシーケンス内のすべての要素のIDを事前に把握しているが、出現順にしか選択できない中間設定(予知ストリームと呼ぶ)を検討します。すべての設定について、提案された要素は何らかのソース分布とは独立して生成されると仮定し、ストリームベースの設定と予知ストリーム設定で、特定のプールサイズでプールアルゴリズムをエミュレートするために必要なストリームサイズはどれくらいかを調べます。一般的なプールアルゴリズムとユーティリティベースのプールアルゴリズムのアルゴリズムと一致する下限値を示します。さらに、バイナリ分類のアクティブラーニングの特殊なケースについて、2つの設定間のギャップのほぼ一致する上限値と下限値を導出します。

A Theory of Learning with Corrupted Labels
壊れたラベルを持つ学習理論

It is usual in machine learning theory to assume that the training and testing sets comprise of draws from the same distribution. This is rarely, if ever, true and one must admit the presence of corruption. There are many different types of corruption that can arise and as of yet there is no general means to compare the relative ease of learning in these settings. Such results are necessary if we are to make informed economic decisions regarding the acquisition of data. Here we begin to develop an abstract framework for tackling these problems. We present a generic method for learning from a fixed, known, reconstructible corruption, along with an analyses of its statistical properties. We demonstrate the utility of our framework via concrete novel results in solving supervised learning problems wherein the labels are corrupted, such as learning with noisy labels, semi-supervised learning and learning with partial labels.

機械学習理論では、トレーニングセットとテストセットが同じ分布からの描画で構成されていると想定するのが一般的です。これは、たとえあったとしても、真実であることは稀であり、腐敗の存在を認めざるを得ない。破損にはさまざまな種類があり、現時点では、これらの環境での学習の相対的な容易さを比較する一般的な手段はありません。このような結果は、データの取得に関して十分な情報に基づいた経済的決定を下すために必要です。ここでは、これらの問題に取り組むための抽象的なフレームワークの開発を開始します。私たちは、固定された、既知の、再構築可能な破損から学習するための一般的な方法と、その統計的特性の分析を提示します。私たちは、ノイズの多いラベルによる学習、半教師あり学習、部分的なラベルによる学習など、ラベルが破損する教師あり学習の問題を解決するための具体的な新しい結果を通じて、フレームワークの有用性を実証します。

Rate of Convergence of K-Nearest-Neighbor Classification Rule
K近傍分類ルールの収束率

A binary classification problem is considered. The excess error probability of the $k$-nearest-neighbor classification rule according to the error probability of the Bayes decision is revisited by a decomposition of the excess error probability into approximation and estimation errors. Under a weak margin condition and under a modified Lipschitz condition or a local Lipschitz condition, tight upper bounds are presented such that one avoids the condition that the feature vector is bounded. The concept of modified Lipschitz condition is applied for discrete distributions, too. As a consequence of both concepts, we present the rate of convergence of $L_2$ error for the corresponding nearest neighbor regression estimate.

二項分類の問題が考慮されます。ベイズ決定の誤差確率に従った$k$-nearest-neighbor分類ルールの過剰誤差確率は、過剰誤差確率を近似誤差と推定誤差に分解することによって再検討されます。弱いマージン条件、および修正リプシッツ条件またはローカルリプシッツ条件の下では、特徴ベクトルが有界であるという条件を回避するように、狭い上限が提示されます。修正リプシッツ条件の概念は、離散分布にも適用されます。両方の概念の結果として、対応する最近傍回帰推定の$L_2$誤差の収束率を示します。

Statistical Inference on Random Dot Product Graphs: a Survey
ランダム内積グラフにおける統計的推論:調査

The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the graph-inferential analogues of several canonical tenets of classical Euclidean inference. In particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference.

ランダムドット積グラフ(RDPG)は、解析的に扱いやすく、同時に、比較的単純な確率ブロックモデルから複雑な潜在位置グラフまで、広範囲のランダムグラフを包含するか、またはうまく近似できる、独立エッジのランダムグラフです。この調査論文では、ランダムドット積グラフの統計的推論の包括的なパラダイム、つまり隣接行列とラプラシアン行列のスペクトル埋め込みを中心としたパラダイムについて説明します。古典的なユークリッド推論のいくつかの標準的な原則のグラフ推論類似物を調べます。特に、隣接スペクトル埋め込みとラプラシアンスペクトル埋め込みの一貫性と漸近正規性に関する既存の結果と、これらのスペクトル埋め込みがグラフデータの単一サンプルおよび複数サンプルの仮説検定の構築で果たす役割についてまとめます。大規模なソーシャルネットワークにおけるコミュニティの検出と分類、ショウジョウバエのコネクトームの探索的データ分析による機能的および生物学的に関連するネットワーク特性の決定など、いくつかの実際のアプリケーションを調査します。スペクトルグラフ推論に必要な背景と現在の未解決の問題について概説します。

Improved spectral community detection in large heterogeneous networks
大規模な異種ネットワークにおけるスペクトルコミュニティ検出の改善

In this article, we propose and study the performance of spectral community detection for a family of â$\alpha$-normalizedâ adjacency matrices $\bf A$, of the type $ {\bf D}^{-\alpha}{\bf A}{\bf D}^{-\alpha}$ with $\bf D$ the degree matrix, in heterogeneous dense graph models. We show that the previously used normalization methods based on ${\bf A}$ or $ {\bf D}^{-1}{\bf A}{\bf D}^{-1} $ are in general suboptimal in terms of correct recovery rates and, relying on advanced random matrix methods, we prove instead the existence of an optimal value $ \alpha_{\rm opt} $ of the parameter $ \alpha $ in our generic model; we further provide an online estimation of $ \alpha_{\rm opt} $ only based on the node degrees in the graph. Numerical simulations show that the proposed method outperforms state-of-the-art spectral approaches on moderately dense to dense heterogeneous graphs.

この記事では、不均一な高密度グラフモデルで、タイプが$ {bf D}^{-alpha}{bf A}{bf D}{bf D}^{-alpha}$の「$alpha$–alpha}$」隣接行列$bf D}^{-alpha}$のファミリーのスペクトルコミュニティ検出のパフォーマンスを提案し、研究します。私たちは、以前に使用された${bf A}$または$ {bf D}^{-1}{bf A}{bf D}^{-1} $に基づく正規化方法が、正しい回復率の点で一般的に最適ではないことを示し、高度なランダム行列法に依存して、代わりに、ジェネリックモデルにおけるパラメータ$ alpha $の最適値$ alpha_{rm opt} $の存在を証明します。さらに、グラフのノード次数のみに基づいて、$ alpha_{rm opt} $のオンライン推定を提供します。数値シミュレーションは、提案された方法が、中程度の密度から高密度のヘテロジニアスグラフで最先端のスペクトルアプローチよりも優れていることを示しています。

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)
過剰分散スコアリング (ODS) による二次分散関数 (QVF) DAG モデルの学習

Learning DAG or Bayesian network models is an important problem in multi-variate causal inference. However, a number of challenges arises in learning large-scale DAG models including model identifiability and computational complexity since the space of directed graphs is huge. In this paper, we address these issues in a number of steps for a broad class of DAG models where the noise or variance is signal-dependent. Firstly we introduce a new class of identifiable DAG models, where each node has a distribution where the variance is a quadratic function of the mean (QVF DAG models). Our QVF DAG models include many interesting classes of distributions such as Poisson, Binomial, Geometric, Exponential, Gamma and many other distributions in which the noise variance depends on the mean. We prove that this class of QVF DAG models is identifiable, and introduce a new algorithm, the OverDispersion Scoring (ODS) algorithm, for learning large-scale QVF DAG models. Our algorithm is based on firstly learning the moralized or undirected graphical model representation of the DAG to reduce the DAG search-space, and then exploiting the quadratic variance property to learn the ordering. We show through theoretical results and simulations that our algorithm is statistically consistent in the high-dimensional $p>n$ setting provided that the degree of the moralized graph is bounded and performs well compared to state-of-the-art DAG-learning algorithms. We also demonstrate through a real data example involving multi-variate count data, that our ODS algorithm is well-suited to estimating DAG models for count data in comparison to other methods used for discrete data.

DAGまたはベイジアンネットワークモデルの学習は、多変量因果推論における重要な問題です。ただし、有向グラフの空間は巨大であるため、大規模なDAGモデルの学習では、モデルの識別可能性や計算の複雑さなど、多くの課題が生じます。この論文では、ノイズまたは分散が信号に依存する広範なクラスのDAGモデルについて、いくつかの手順でこれらの問題に対処します。まず、各ノードが分散が平均の2次関数である分布を持つ、識別可能な新しいクラスのDAGモデルを紹介します(QVF DAGモデル)。QVF DAGモデルには、ポアソン分布、二項分布、幾何分布、指数分布、ガンマ分布、およびノイズ分散が平均に依存するその他の多くの分布など、多くの興味深い分布のクラスが含まれています。このクラスのQVF DAGモデルが識別可能であることを証明し、大規模なQVF DAGモデルを学習するための新しいアルゴリズムであるOverDispersion Scoring (ODS)アルゴリズムを紹介します。私たちのアルゴリズムは、まずDAGの道徳的または無向のグラフィカルモデル表現を学習してDAG検索空間を縮小し、次に2次分散特性を利用して順序付けを学習することに基づいています。理論的結果とシミュレーションにより、道徳的グラフの次数が制限され、最先端のDAG学習アルゴリズムと比較して優れたパフォーマンスを発揮することを条件に、高次元$p>n$設定で私たちのアルゴリズムが統計的に一貫していることを示しています。また、多変量カウントデータを含む実際のデータ例により、離散データに使用される他の方法と比較して、私たちのODSアルゴリズムがカウントデータのDAGモデルを推定するのに適していることも示しています。

Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification
最小二乗回帰のための確率的勾配降下法の並列化: ミニバッチ処理、平均化、およびモデルの誤指定

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail- averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini- batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this mini- batching characterization is utilized in providing a highly parallelizable SGD method that achieves the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD-style methods. Following this, a non-asymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided. Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with mis-specified models in the non-realizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the mis-specified case must depend on the noise properties. The analysis tools used by this paper generalize the operator view of averaged SGD (DÃ©fossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.

この研究では、確率的勾配降下法(SGD)と組み合わせて広く使用されている平均化手法の利点を特徴づけています。特に、この研究では、(1)ミニバッチング、確率的勾配の多数のサンプルを平均化して確率的勾配推定値の分散を減らし、SGDを並列化する手法、および(2)テール平均化、SGDの最終反復の分散を減らすためにSGDの最後の数回の反復を平均化する手法について鋭い分析を示しています。この研究では、最小二乗回帰の確率的近似問題に対するこれらの方式の鋭い有限サンプル一般化誤差境界を示しています。さらに、この研究では、バッチサイズ1のSGDに対して証明可能なほぼ線形の並列化高速化を実現するためにミニバッチングを使用できる問題依存の正確な範囲を確立しています。この特徴づけは、SGD手順の最終反復の過剰リスクを考慮する際に、学習率とバッチサイズの関係を理解するために使用されます。次に、このミニバッチングの特性を利用して、バッチ勾配降下法とほぼ同じ数のシリアル更新でミニマックスリスクを達成する高度に並列化されたSGD法を提供し、既存のSGDスタイルの方法を大幅に改善します。これに続いて、モデル平均化(通信効率の高い並列化スキーム)の非漸近的過剰リスク境界が提供されます。最後に、この作業は、実現不可能な最小二乗問題で誤って指定されたモデルを処理する場合のSGDの動作の基本的な違いを明らかにします。この論文では、誤って指定されたケースのミニマックスリスクを保証する最大ステップサイズは、ノイズ特性に依存する必要があることを示しています。この論文で使用されている分析ツールは、平均SGDの演算子ビュー(DÃ©fossezおよびBach、2015)を一般化し、次にこれらの演算子を境界付けて一般化エラーを特徴付ける新しい分析を開発します。これらの手法は、確率的近似のさまざまな計算側面を分析する上で広く関心を集めています。

Average Stability is Invariant to Data Preconditioning. Implications to Exp-concave Empirical Risk Minimization
平均安定性はデータの前処理に対して不変である。指数凹関数の経験的リスク最小化への影響

We show that the average stability notion introduced by Kearns and Ron (1999); Bousquet and Elisseeff (2002) is invariant to data preconditioning, for a wide class of generalized linear models that includes most of the known exp-concave losses. In other words, when analyzing the stability rate of a given algorithm, we may assume the optimal preconditioning of the data. This implies that, at least from a statistical perspective, explicit regularization is not required in order to compensate for ill-conditioned data, which stands in contrast to a widely common approach that includes a regularization for analyzing the sample complexity of generalized linear models. Several important implications of our findings include: a) We demonstrate that the excess risk of empirical risk minimization (ERM) is controlled by the preconditioned stability rate. This immediately yields a relatively short and elegant proof for the fast rates attained by ERM in our context. b) We complement the recent bounds of Hardt et al. (2015) on the stability rate of the Stochastic Gradient Descent algorithm.

私たちは、KearnsとRon (1999)、BousquetとElisseeff (2002)によって導入された平均安定性の概念は、既知のexp-concave損失のほとんどを含む幅広い一般化線型モデルに対して、データの前処理に対して不変であることを示します。言い換えると、特定のアルゴリズムの安定率を分析する場合、データの最適な前処理を想定できます。これは、少なくとも統計的な観点からは、悪条件のデータを補正するために明示的な正則化は必要ないことを意味します。これは、一般化線型モデルのサンプルの複雑さを分析するための正則化を含む、広く普及しているアプローチとは対照的です。私たちの研究結果の重要な意味合いには、次のものがあります。a)経験的リスク最小化(ERM)の過剰リスクは、前処理された安定率によって制御されることを示します。これにより、私たちのコンテキストでERMによって達成される高速率の比較的短く簡潔な証明が直ちに得られます。b) Hardtらによる最近の境界を補完します。(2015)確率的勾配降下法アルゴリズムの安定率について。

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
カチューシャ:確率的勾配法の最初の直接加速

Nesterov’s momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov’s momentum from providing similar acceleration, even if the underlying problem is convex and finite-sum. We introduce $\mathtt{Katyusha}$, a direct, primal-only stochastic gradient method to fix this issue. In convex finite-sum stochastic optimization, $\mathtt{Katyusha}$ has an optimal accelerated convergence rate, and enjoys an optimal parallel linear speedup in the mini-batch setting. The main ingredient is $\textit{Katyusha momentum}$, a novel ânegative momentumâ on top of Nesterov’s momentum. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of $\textit{sequential and parallel}$ performance. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially try to give Katyusha a hug.

ネステロフの運動量トリックは、勾配降下法を加速することでよく知られており、高速反復アルゴリズムの構築に有効であることが証明されています。しかし、確率的設定では反例が存在し、基礎となる問題が凸有限和であっても、ネステロフの運動量では同様の加速を提供できません。この問題を解決するために、直接的なプライマルのみの確率的勾配法である$\mathtt{Katyusha}$を導入します。凸有限和確率的最適化では、$\mathtt{Katyusha}$は最適な加速収束率を持ち、ミニバッチ設定で最適な並列線形高速化を実現します。主な成分は、ネステロフの運動量の上に構築された新しい「負の運動量」である$\textit{Katyusha運動量}$です。これを分散削減ベースのアルゴリズムに組み込むことで、$\textit{逐次および並列}$パフォーマンスの両方の点で高速化できます。分散削減は、増え続ける実用的な問題にうまく適用されているため、私たちの論文では、そのようなケースのそれぞれで、カチューシャを抱きしめることを試みることができる可能性があることを示唆しています。

Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs
祖先グラフのマルコフ同値クラスにおける調整セットの完全なグラフィカルな特性評価と構築

We present a graphical criterion for covariate adjustment that is sound and complete for four different classes of causal graphical models: directed acyclic graphs (DAGs), maximal ancestral graphs (MAGs), completed partially directed acyclic graphs (CPDAGs), and partial ancestral graphs (PAGs). Our criterion unifies covariate adjustment for a large set of graph classes. Moreover, we define an explicit set that satisfies our criterion, if there is any set that satisfies our criterion. We also give efficient algorithms for constructing all sets that fulfill our criterion, implemented in the R package dagitty. Finally, we discuss the relationship between our criterion and other criteria for adjustment, and we provide new soundness and completeness proofs for the adjustment criterion for DAGs.

私たちは、因果グラフモデルの4つの異なるクラス、有向非巡回グラフ(DAG)、最大祖先グラフ(MAG)、完了部分有向非巡回グラフ(CPDAG)、および部分祖先グラフ(PAG))について、健全で完全な共変量調整のグラフィカル基準を示します。この基準は、グラフクラスの大規模なセットの共変量調整を統合します。さらに、基準を満たす集合がある場合、基準を満たす明示的な集合を定義します。また、Rパッケージdagittyに実装されている、基準を満たすすべてのセットを構築するための効率的なアルゴリズムも提供します。最後に、私たちの基準と他の調整基準との関係について説明し、DAGの調整基準の新しい健全性と完全性の証明を提供します。

Compact Convex Projections
コンパクト凸投影

We study the usefulness of conditional gradient like methods for determining projections onto convex sets, in particular, projections onto naturally arising convex sets in reproducing kernel Hilbert spaces. Our work is motivated by the recently introduced kernel herding algorithm which is closely related to the Conditional Gradient Method (CGM). It is known that the herding algorithm converges with a rate of $1/t$, where $t$ counts the number of iterations, when a point in the interior of a convex set is approximated. We generalize this result and we provide a necessary and sufficient condition for the algorithm to approximate projections with a rate of $1/t$. The CGM, which is in general vastly superior to the herding algorithm, achieves only an inferior rate of $1/\sqrt{t}$ in this setting. We study the usefulness of such projection algorithms further by exploring ways to use these for solving concrete machine learning problems. In particular, we derive non-parametric regression algorithms which use at their core a slightly modified kernel herding algorithm to determine projections. We derive bounds to control approximation errors of these methods and we demonstrate via experiments that the developed regressors are en-par with state-of-the-art regression algorithms for large scale problems.

私たちは、凸集合への射影、特に再生カーネルヒルベルト空間における自然に生じる凸集合への射影を決定するための条件付き勾配法のような方法の有用性を研究します。我々の研究は、条件付き勾配法(CGM)と密接に関連する、最近導入されたカーネルハーディングアルゴリズムに触発されています。ハーディングアルゴリズムは、凸集合の内部の点が近似されるときに、反復回数をtで表すと、$1/t$の速度で収束することが知られています。我々はこの結果を一般化し、アルゴリズムが射影を$1/t$の速度で近似するための必要十分条件を与える。一般にハーディングアルゴリズムよりはるかに優れているCGMは、この設定では$1/\sqrt{t}$という劣った速度しか達成しない。私たちは、このような射影アルゴリズムを具体的な機械学習の問題の解決に使用する方法を探求することで、これらのアルゴリズムの有用性をさらに研究します。特に、私たちは、わずかに修正されたカーネルハーディングアルゴリズムを中核として用いて投影を決定するノンパラメトリック回帰アルゴリズムを導出します。私たちは、これらの方法の近似誤差を制御するための境界を導出し、開発された回帰器が大規模問題に対する最先端の回帰アルゴリズムと同等であることを実験によって実証します。

Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging
スケッチリッジ回帰: 最適化の観点、統計の観点、モデルの平均化

We address the statistical and optimization impacts of the classical sketch and Hessian sketch used to approximately solve the Matrix Ridge Regression (MRR) problem. Prior research has quantified the effects of classical sketch on the strictly simpler least squares regression (LSR) problem. We establish that classical sketch has a similar effect upon the optimization properties of MRR as it does on those of LSR: namely, it recovers nearly optimal solutions. By contrast, Hessian sketch does not have this guarantee; instead, the approximation error is governed by a subtle interplay between the âmassâ in the responses and the optimal objective value. For both types of approximation, the regularization in the sketched MRR problem results in significantly different statistical properties from those of the sketched LSR problem. In particular, there is a bias-variance trade-off in sketched MRR that is not present in sketched LSR. We provide upper and lower bounds on the bias and variance of sketched MRR; these bounds show that classical sketch significantly increases the variance, while Hessian sketch significantly increases the bias. Empirically, sketched MRR solutions can have risks that are higher by an order-of- magnitude than those of the optimal MRR solutions. We establish theoretically and empirically that model averaging greatly decreases the gap between the risks of the true and sketched solutions to the MRR problem. Thus, in parallel or distributed settings, sketching combined with model averaging is a powerful technique that quickly obtains near-optimal solutions to the MRR problem while greatly mitigating the increased statistical risk incurred by sketching.

私たちは、マトリックスリッジ回帰(MRR)問題を近似的に解くために使用される古典的スケッチとヘッセスケッチの統計的および最適化への影響について説明します。以前の研究では、古典的スケッチが厳密に単純な最小二乗回帰(LSR)問題に与える影響を定量化しています。古典的スケッチは、LSRの最適化特性と同様にMRRの最適化特性にも影響を与えることが確認されています。つまり、ほぼ最適なソリューションを回復します。対照的に、ヘッセスケッチにはこの保証はありません。代わりに、近似誤差は、応答の「質量」と最適な目的値の間の微妙な相互作用によって決まります。両方のタイプの近似について、スケッチされたMRR問題の正則化により、スケッチされたLSR問題とは統計的特性が大きく異なります。特に、スケッチされたMRRには、スケッチされたLSRにはないバイアスと分散のトレードオフがあります。スケッチされたMRRのバイアスと分散の上限と下限を示します。これらの境界は、古典的なスケッチでは分散が大幅に増加し、ヘッセスケッチではバイアスが大幅に増加することを示しています。経験的に、スケッチされたMRRソリューションは、最適なMRRソリューションよりも1桁高いリスクを持つ可能性があります。モデル平均化により、MRR問題に対する真のソリューションとスケッチされたソリューションのリスクのギャップが大幅に減少することが理論的および経験的に確立されています。したがって、並列または分散設定では、モデル平均化と組み合わせたスケッチは、スケッチによって発生する統計的リスクの増加を大幅に軽減しながら、MRR問題に対するほぼ最適なソリューションを迅速に取得する強力な手法です。

Simultaneous Clustering and Estimation of Heterogeneous Graphical Models
異種グラフィカルモデルの同時クラスタリングと推定

We consider joint estimation of multiple graphical models arising from heterogeneous and high-dimensional observations. Unlike most previous approaches which assume that the cluster structure is given in advance, an appealing feature of our method is to learn cluster structure while estimating heterogeneous graphical models. This is achieved via a high dimensional version of Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin, 1993). A joint graphical lasso penalty is imposed on the conditional maximization step to extract both homogeneity and heterogeneity components across all clusters. Our algorithm is computationally efficient due to fast sparse learning routines and can be implemented without unsupervised learning knowledge. The superior performance of our method is demonstrated by extensive experiments and its application to a Glioblastoma cancer dataset reveals some new insights in understanding the Glioblastoma cancer. In theory, a non-asymptotic error bound is established for the output directly from our high dimensional ECM algorithm, and it consists of two quantities: statistical error (statistical accuracy) and optimization error (computational complexity). Such a result gives a theoretical guideline in terminating our ECM iterations.

私たちは、異種および高次元の観測から生じる複数のグラフィカルモデルの共同推定について検討します。クラスター構造が事前に与えられていると想定する従来のほとんどのアプローチとは異なり、我々の方法の魅力的な特徴は、異種のグラフィカルモデルを推定しながらクラスター構造を学習することです。これは、期待値条件付き最大化(ECM)アルゴリズム(MengおよびRubin、1993)の高次元バージョンによって実現されます。すべてのクラスターにわたって同質性と異質性の両方のコンポーネントを抽出するために、条件付き最大化ステップに共同グラフィカルLassoペナルティが課されます。我々のアルゴリズムは、高速スパース学習ルーチンにより計算効率が高く、教師なし学習の知識がなくても実装できます。我々の方法の優れたパフォーマンスは、広範な実験によって実証されており、グリオーブラストーマガンのデータセットへの適用により、グリオーブラストーマガンを理解するためのいくつかの新しい洞察が明らかになりました。理論的には、高次元ECMアルゴリズムから直接出力された非漸近的な誤差境界が確立され、統計誤差(統計精度)と最適化誤差(計算の複雑さ)の2つの量で構成されます。このような結果は、ECM反復を終了するための理論的なガイドラインとなります。

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks
不確実性下での報酬最大化:ネットワーク上でのサイドオブザベーションの活用

We study the stochastic multi-armed bandit (MAB) problem in the presence of side-observations across actions that occur as a result of an underlying network structure. In our model, a bipartite graph captures the relationship between actions and a common set of unknowns such that choosing an action reveals observations for the unknowns that it is connected to. This models a common scenario in online social networks where users respond to their friends’ activity, thus providing side information about each other’s preferences. Our contributions are as follows: 1) We derive an asymptotic lower bound (with respect to time) as a function of the bi-partite network structure on the regret of any uniformly good policy that achieves the maximum long-term average reward. 2) We propose two policies – a randomized policy; and a policy based on the well- known upper confidence bound (UCB) policies – both of which explore each action at a rate that is a function of its network position. We show, under mild assumptions, that these policies achieve the asymptotic lower bound on the regret up to a multiplicative factor, independent of the network structure. Finally, we use numerical examples on a real-world social network and a routing example network to demonstrate the benefits obtained by our policies over other existing policies.

私たちは、基礎となるネットワーク構造の結果として生じる行動全体にわたる副次的観測が存在する場合の確率的多腕バンディット(MAB)問題を研究します。我々のモデルでは、二部グラフが行動と共通の未知数セットとの関係を捉え、行動を選択すると、それに関連する未知数の観測が明らかになります。これは、ユーザーが友人の活動に反応し、互いの好みに関する副次的情報を提供するという、オンラインソーシャルネットワークの一般的なシナリオをモデル化します。我々の貢献は以下のとおりです。1)二部ネットワーク構造の関数として、最大長期平均報酬を達成する均一に良いポリシーの後悔に関する漸近的下限(時間に関して)を導出します。2)ランダム化ポリシーと、よく知られている上側信頼限界(UCB)ポリシーに基づくポリシーの2つのポリシーを提案します。どちらも、ネットワーク位置の関数である速度で各行動を探索します。私たちは、軽度の仮定の下で、これらのポリシーがネットワーク構造に依存せず、乗数までの後悔の漸近的下限を達成することを示します。最後に、現実世界のソーシャルネットワークとルーティング例のネットワークでの数値例を使用して、他の既存のポリシーよりも我々のポリシーによって得られる利点を示します。

SGDLibrary: A MATLAB library for stochastic optimization algorithms
SGDLibrary: 確率的最適化アルゴリズムのための MATLAB ライブラリ

We consider the problem of finding the minimizer of a function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ of the finite-sum form $\min f(w) = 1/n\sum_{i}^n f_i(w)$. This problem has been studied intensively in recent years in the field of machine learning (ML). One promising approach for large-scale data is to use a stochastic optimization algorithm to solve the problem. SGDLibrary is a readable, flexible and extensible pure-MATLAB library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a comprehensive evaluation environment for the use of these algorithms on various ML problems.

私たちは、関数$f mathbb{R}^d rightarrow mathbb{R}$の有限和形式$min f(w) = 1/nsum_{i}^n f_i(w)$のミニマイザーを求める問題を考えます。この問題は、近年、機械学習(ML)の分野で集中的に研究されています。大規模データに対する有望なアプローチの1つは、確率的最適化アルゴリズムを使用して問題を解決することです。SGDLibraryは、確率的最適化アルゴリズムのコレクションの、可読性、柔軟性、拡張性に優れた純粋なMATLABライブラリです。このライブラリの目的は、研究者と実装者に、さまざまなML問題でこれらのアルゴリズムを使用するための包括的な評価環境を提供することです。

tick: a Python Library for Statistical Learning, with an emphasis on Hawkes Processes and Time-Dependent Models
tick: 統計的学習のためのPythonライブラリで、ホークス過程と時間依存モデルに重点を置いています

This paper introduces tick, is a statistical learning library for Python 3, with a particular emphasis on time-dependent models, such as point processes, tools for generalized linear models and survival analysis. The core of the library provides model computational classes, solvers and proximal operators for regularization. It relies on a C++ implementation and state- of-the-art optimization algorithms to provide very fast computations in a single node multi-core setting. Source code and documentation can be downloaded from https://github.com/X-DataInitiative/tick.

この論文では、Python 3の統計学習ライブラリであるtickを紹介し、ポイントプロセス、一般化線形モデル用のツール、生存分析などの時間依存モデルに特に重点を置いています。ライブラリの中核となるのは、モデルの計算クラス、ソルバー、正則化のための近位演算子です。C++の実装と最先端の最適化アルゴリズムに依存して、シングルノードのマルチコア設定で非常に高速な計算を提供します。ソースコードとドキュメントはhttps://github.com/X-DataInitiative/tickからダウンロードできます。

Gaussian Lower Bound for the Information Bottleneck Limit
情報ボトルネックの限界に対するガウス下限

The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its âGaussian part”, on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data- sets and solve complex problems by linear methods.

情報ボトルネック(IB)は、ターゲットに関して、変数セットの最もコンパクトでありながら有益な表現を抽出する概念的な方法です。これは、古典的なパラメトリック統計からより広い情報理論的意味に、最小十分統計の概念を一般化します。IB曲線は、表現の複雑さとその予測力の間の最適なトレードオフを定義します。具体的には、表現とターゲット間の相互情報量(MI)のレベルが最小であることを条件として、表現と元の変数間の相互情報量(MI)のレベルを最小化することで実現されます。この問題は一般にNP困難であることが示されています。重要な例外の1つは多変量ガウスの場合で、この場合はガウスIB (GIB)が正準相関分析(CCA)に似た解析的な閉形式ソリューションを取得することが知られています。この研究では、IB曲線のガウス下限を導入します。私たちは、データの「ガウス部分」を最大化する埋め込みを見つけ、これにGIBを適用します。この埋め込みは、任意のデータセット（IBの意味で）の効率的（かつ実用的）な表現を提供し、さらにガウス分布の好ましい特性を保持します。重要なことは、最適なガウス埋め込みが非線形CCAによって上方から制限されることを示しています。これにより、任意のデータセットをガウス化し、複雑な問題を線形方法で解決する能力に基本的な制限が課せられます。

Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice
1次凸最適化のための触媒加速:理論から実践まで

We introduce a generic scheme for accelerating gradient-based optimization methods in the sense of Nesterov. The approach, called Catalyst, builds upon the inexact accelerated proximal point algorithm for minimizing a convex objective function, and consists of approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. One of the keys to achieve acceleration in theory and in practice is to solve these sub-problems with appropriate accuracy by using the right stopping criterion and the right warm-start strategy. We give practical guidelines to use Catalyst and present a comprehensive analysis of its global complexity. We show that Catalyst applies to a large class of algorithms, including gradient descent, block coordinate descent, incremental algorithms such as SAG, SAGA, SDCA, SVRG, MISO/Finito, and their proximal variants. For all of these methods, we establish faster rates using the Catalyst acceleration, for strongly convex and non-strongly convex objectives. We conclude with extensive experiments showing that acceleration is useful in practice, especially for ill-conditioned problems.

私たちは、ネステロフの意味で勾配ベースの最適化手法を加速するための一般的なスキームを紹介します。Catalystと呼ばれるこのアプローチは、凸目的関数を最小化する不正確な加速近点アルゴリズムに基づいており、適切に選択された一連の補助問題を近似的に解くことで、より速い収束につながります。理論上および実践的に加速を達成するための鍵の1つは、適切な停止基準と適切なウォームスタート戦略を使用して、これらのサブ問題を適切な精度で解くことです。Catalystを使用するための実用的なガイドラインを示し、その全体的な複雑性の包括的な分析を示します。Catalystは、勾配降下法、ブロック座標降下法、SAG、SAGA、SDCA、SVRG、MISO/Finitoなどの増分アルゴリズム、およびそれらの近似バリアントを含む、大規模なアルゴリズムクラスに適用できることを示します。これらのすべての方法について、強凸目的と非強凸目的に対して、Catalyst加速を使用してより高速な速度を確立します。最後に、広範囲にわたる実験により、加速は実際には特に悪条件の問題に対して有用であることを示します。

Weighted SGD for l_p Regression with Randomized Preconditioning
ランダム化前処理による l_p 回帰の加重 SGD

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems—e.g., $\ell_2$ and $\ell_1$ regression problems.

近年、確率的勾配降下法(SGD)法やランダム化線形代数(RLA)アルゴリズムが、機械学習やデータ解析における多くの大規模問題に適用されています。SGD法は実装が容易で、幅広い凸最適化問題に適用できます。対照的に、RLAアルゴリズムは、はるかに強力なパフォーマンス保証を提供しますが、より狭いクラスの問題に適用できます。制約付き過決定線形回帰問題—例えば、$ell_2$と$ell_1$の回帰問題を解く際に、これら2つの方法の間のギャップを埋めることを目指しています。

Sparse Exchangeable Graphs and Their Limits via Graphon Processes
グラフオンプロセスによる疎な交換可能グラフとその限界

In a recent paper, Caron and Fox suggest a probabilistic model for sparse graphs which are exchangeable when associating each vertex with a time parameter in $\mathbb{R}_+$. Here we show that by generalizing the classical definition of graphons as functions over probability spaces to functions over $\sigma$-finite measure spaces, we can model a large family of exchangeable graphs, including the Caron-Fox graphs and the traditional exchangeable dense graphs as special cases. Explicitly, modelling the underlying space of features by a $\sigma$-finite measure space $(S,\mathcal{S},\mu)$ and the connection probabilities by an integrable function $W\colon S\times S\to [0,1]$, we construct a random family $(G_t)_{t\geq 0}$ of growing graphs such that the vertices of $G_t$ are given by a Poisson point process on $S$ with intensity $t\mu$, with two points $x,y$ of the point process connected with probability $W(x,y)$. We call such a random family a graphon process. We prove that a graphon process has convergent subgraph frequencies (with possibly infinite limits) and that, in the natural extension of the cut metric to our setting, the sequence converges to the generating graphon. We also show that the underlying graphon is identifiable only as an equivalence class over graphons with cut distance zero. More generally, we study metric convergence for arbitrary (not necessarily random) sequences of graphs, and show that a sequence of graphs has a convergent subsequence if and only if it has a subsequence satisfying a property we call uniform regularity of tails. Finally, we prove that every graphon is equivalent to a graphon on $\mathbb{R}_+$ equipped with Lebesgue measure.

最近の論文で、CaronとFoxは、各頂点を$\mathbb{R}_+$内の時間パラメータに関連付けると交換可能な疎グラフの確率モデルを提案しています。ここでは、確率空間上の関数としてのグラフオンの古典的な定義を$\sigma$有限測度空間上の関数に一般化することで、Caron-Foxグラフや特殊なケースとしての従来の交換可能な密グラフを含む、交換可能なグラフの大規模なファミリーをモデル化できることを示します。明示的に、$\sigma$有限測度空間$(S,\mathcal{S},\mu)$によって特徴の基底空間をモデル化し、接続確率を積分可能な関数$W\colon S\times S\to [0,1]$によってモデル化することで、$G_t$の頂点が強度$t\mu$を持つ$S$上のポアソン点過程で与えられ、点過程の2つの点$x,y$が確率$W(x,y)$で接続されるような成長グラフのランダムファミリ$(G_t)_{t\geq 0}$を構築します。このようなランダムファミリをグラフオンプロセスと呼びます。グラフオンプロセスには収束するサブグラフ頻度(おそらく無限の制限を持つ)があり、カットメトリックを私たちの設定に自然に拡張すると、シーケンスが生成グラフオンに収束することを証明します。また、基底グラフオンは、カット距離が0のグラフオン上の同値類としてのみ識別可能であることも示します。より一般的には、任意の（必ずしもランダムではない）グラフのシーケンスの計量収束を研究し、グラフのシーケンスが収束するサブシーケンスを持つのは、それが裾の均一な正則性と呼ばれる特性を満たすサブシーケンスを持つ場合のみであることを示します。最後に、すべてのグラフンは、ルベーグ測度を備えた$\mathbb{R}_+$上のグラフンと同等であることを証明します。

Estimation of Graphical Models through Structured Norm Minimization
構造化ノルム最小化によるグラフィカルモデルの推定

Estimation of Markov Random Field and covariance models from high-dimensional data represents a canonical problem that has received a lot of attention in the literature. A key assumption, widely employed, is that of sparsity of the underlying model. In this paper, we study the problem of estimating such models exhibiting a more intricate structure comprising simultaneously of sparse, structured sparse and dense components. Such structures naturally arise in several scientific fields, including molecular biology, finance and political science. We introduce a general framework based on a novel structured norm that enables us to estimate such complex structures from high-dimensional data. The resulting optimization problem is convex and we introduce a linearized multi-block alternating direction method of multipliers (ADMM) algorithm to solve it efficiently. We illustrate the superior performance of the proposed framework on a number of synthetic data sets generated from both random and structured networks. Further, we apply the method to a number of real data sets and discuss the results.

高次元データからのマルコフランダムフィールドと共分散モデルの推定は、文献で多くの注目を集めている標準的な問題です。広く採用されている重要な仮定は、基礎となるモデルのスパース性です。この論文では、スパース、構造化されたスパース、および密なコンポーネントを同時に含む、より複雑な構造を示すモデルを推定する問題について検討します。このような構造は、分子生物学、金融、政治科学など、いくつかの科学分野で自然に発生します。高次元データからこのような複雑な構造を推定できるようにする、新しい構造化ノルムに基づく一般的なフレームワークを紹介します。結果として生じる最適化問題は凸型であり、これを効率的に解決するために、線形化されたマルチブロック交互方向乗数法(ADMM)アルゴリズムを紹介します。ランダムネットワークと構造化ネットワークの両方から生成された多数の合成データセットで、提案されたフレームワークの優れたパフォーマンスを示します。さらに、この方法を多数の実際のデータセットに適用し、結果について説明します。

A Tight Bound of Hard Thresholding
ハードしきい値のタイトバウンド

This paper is concerned with the hard thresholding operator which sets all but the $k$ largest absolute elements of a vector to zero. We establish a tight bound to quantitatively characterize the deviation of the thresholded solution from a given signal. Our theoretical result is universal in the sense that it holds for all choices of parameters, and the underlying analysis depends only on fundamental arguments in mathematical optimization. We discuss the implications for two domains: Compressed Sensing. On account of the crucial estimate, we bridge the connection between the restricted isometry property (RIP) and the sparsity parameter for a vast volume of hard thresholding based algorithms, which renders an improvement on the RIP condition especially when the true sparsity is unknown. This suggests that in essence, many more kinds of sensing matrices or fewer measurements are admissible for the data acquisition procedure. Machine Learning. In terms of large-scale machine learning, a significant yet challenging problem is learning accurate sparse models in an efficient manner. In stark contrast to prior work that attempted the $\ell_1$-relaxation for promoting sparsity, we present a novel stochastic algorithm which performs hard thresholding in each iteration, hence ensuring such parsimonious solutions. Equipped with the developed bound, we prove the {\em global linear convergence} for a number of prevalent statistical models under mild assumptions, even though the problem turns out to be non-convex.

この論文では、ベクトルの絶対値で最大$k$個を除くすべての要素をゼロに設定するハードしきい値演算子について検討します。特定の信号からのしきい値ソリューションの偏差を定量的に特徴付けるための厳密な境界を確立します。理論的結果は、すべてのパラメーター選択に当てはまるという意味で普遍的であり、基礎となる分析は数学的最適化の基本的な議論のみに依存します。次の2つのドメインへの影響について説明します。圧縮センシング。重要な推定値に基づいて、制限付き等長特性(RIP)とスパースパラメーターの関係を、膨大な量のハードしきい値ベースのアルゴリズムに関連付けます。これにより、特に真のスパース性が不明な場合に、RIP条件が改善されます。これは、本質的に、データ取得手順では、より多くの種類のセンシングマトリックスまたはより少ない測定値が許容されることを示唆しています。機械学習。大規模な機械学習に関して、重要な、かつ困難な問題は、正確なスパースモデルを効率的に学習することです。スパース性を促進するために$\ell_1$緩和を試みた以前の研究とはまったく対照的に、我々は各反復でハードしきい値処理を実行し、それによってこのような節約ソリューションを保証する新しい確率的アルゴリズムを提示します。開発された境界を備え、問題が非凸であることが判明した場合でも、穏やかな仮定の下で、多くの一般的な統計モデルの{\emグローバル線形収束}を証明します。

An l_∞ Eigenvector Perturbation Bound and Its Application
l_∞ 固有ベクトル摂動限界とその応用

In statistics and machine learning, we are interested in the eigenvectors (or singular vectors) of certain matrices (e.g. covariance matrices, data matrices, etc). However, those matrices are usually perturbed by noises or statistical errors, either from random sampling or structural patterns. The Davis- Kahan $\sin \theta$ theorem is often used to bound the difference between the eigenvectors of a matrix $A$ and those of a perturbed matrix $\widetilde{A} = A + E$, in terms of $\ell_2$ norm. In this paper, we prove that when $A$ is a low-rank and incoherent matrix, the $\ell_{\infty}$ norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of $\sqrt{d_1}$ or $\sqrt{d_2}$ for left and right vectors, where $d_1$ and $d_2$ are the matrix dimensions. The power of this new perturbation result is shown in robust covariance estimation, particularly when random variables have heavy tails. There, we propose new robust covariance estimators and establish their asymptotic properties using the newly developed perturbation bound. Our theoretical results are verified through extensive numerical experiments.

統計学と機械学習では、特定の行列（共分散行列、データ行列など）の固有ベクトル（または特異ベクトル）に注目します。ただし、これらの行列は通常、ランダムサンプリングまたは構造パターンによるノイズや統計エラーによって乱されます。Davis-Kahan $\sin \theta$定理は、行列$A$の固有ベクトルと乱された行列$\widetilde{A} = A + E$の固有ベクトルの差を$\ell_2$ノルムで制限するためによく使用されます。この論文では、$A$が低ランクで矛盾のある行列である場合、特異ベクトル（対称の場合は固有ベクトル）の$\ell_{\infty}$ノルム摂動境界が、左ベクトルと右ベクトルに対して$\sqrt{d_1}$または$\sqrt{d_2}$の係数だけ小さくなることを証明します。ここで、$d_1$と$d_2$は行列の次元です。この新しい摂動結果の威力は、特にランダム変数が重い裾を持つ場合に、ロバスト共分散推定で示されます。そこで、新しいロバスト共分散推定量を提案し、新しく開発された摂動境界を使用してその漸近特性を確立します。理論的な結果は、広範な数値実験によって検証されます。

The Search Problem in Mixture Models
混合モデルにおける探索問題

We consider the task of learning the parameters of a single component of a mixture model, for the case when we are given side information about that component; we call this the âsearch problem” in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.

私たちは、混合モデルの単一コンポーネントのパラメータを学習するタスクを、そのコンポーネントに関する副次情報が与えられた場合について検討します。これを混合モデルにおける「検索問題」と呼びます。私たちは、すべてのコンポーネントのパラメータを学習して全体の元の問題を解決するよりも計算量とサンプルの複雑度を低くして、この問題を解決したいと考えています。私たちの主な貢献は、サイド情報の概念のためのシンプルだが一般的なモデルの開発と、この一般的な設定で検索問題を解決するための対応するシンプルな行列ベースのアルゴリズムです。次に、このモデルとアルゴリズムを、ガウス混合モデル、LDAトピックモデル、サブスペースクラスタリング、混合線形回帰という4つの一般的なシナリオに特化します。これらのそれぞれについて、サイド情報が有益である場合(そしてその場合に限り)、既存のモーメントベースの混合モデルアルゴリズム(テンソル法など)よりも精度が高く、計算量も少ないパラメータ推定値を取得できることを示します。また、特定の問題インスタンスについて、このようなサイド情報を取得できるいくつかの自然な方法も示します。実際のデータセット(NY Times、Yelp、BSDS500)での実験では、実行時間と精度が大幅に向上し、アルゴリズムの実用性がさらに実証されています。

Steering Social Activity: A Stochastic Optimal Control Point Of View
社会活動の舵取り:確率的最適制御視点

User engagement in online social networking depends critically on the level of social activity in the corresponding platform—the number of online actions, such as posts, shares or replies, taken by their users. Can we design data-driven algorithms to increase social activity? At a user level, such algorithms may increase activity by helping users decide when to take an action to be more likely to be noticed by their peers. At a network level, they may increase activity by incentivizing a few influential users to take more actions, which in turn will trigger additional actions by other users. In this paper, we model social activity using the framework of marked temporal point processes, derive an alternate representation of these processes using stochastic differential equations (SDEs) with jumps and, exploiting this alternate representation, develop two efficient online algorithms with provable guarantees to steer social activity both at a user and at a network level. In doing so, we establish a previously unexplored connection between optimal control of jump SDEs and doubly stochastic marked temporal point processes, which is of independent interest. Finally, we experiment both with synthetic and real data gathered from Twitter and show that our algorithms consistently steer social activity more effectively than the state of the art.

オンラインソーシャルネットワーキングにおけるユーザーエンゲージメントは、対応するプラットフォームでのソーシャルアクティビティのレベル、つまりユーザーによる投稿、共有、返信などのオンラインアクションの数に大きく依存します。ソーシャルアクティビティを増やすためのデータ駆動型アルゴリズムを設計することは可能でしょうか。ユーザーレベルでは、このようなアルゴリズムは、ユーザーが仲間に注目されやすいようにいつアクションを取るかを決めるのを支援することで、アクティビティを増やすことができます。ネットワークレベルでは、影響力のある少数のユーザーにより多くのアクションを取るようインセンティブを与えることでアクティビティを増やし、それが他のユーザーによる追加アクションのトリガーとなる可能性があります。この論文では、マークされた時間的ポイントプロセスのフレームワークを使用してソーシャルアクティビティをモデル化し、ジャンプを含む確率微分方程式(SDE)を使用してこれらのプロセスの代替表現を導出し、この代替表現を利用して、ユーザーレベルとネットワークレベルの両方でソーシャルアクティビティを制御するための証明可能な保証を備えた2つの効率的なオンラインアルゴリズムを開発します。そうすることで、ジャンプSDEの最適制御と二重確率マーク付き時間点プロセスとの間の、これまで未踏であった関係を確立します。これは独立した関心事です。最後に、Twitterから収集した合成データと実際のデータの両方で実験し、私たちのアルゴリズムが常に最先端のものよりも効果的にソーシャルアクティビティを誘導することを示します。

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression
順列および拡張スティック破断ベイズ多項回帰

To model categorical response variables given their covariates, we propose a permuted and augmented stick-breaking (paSB) construction that one-to-one maps the observed categories to randomly permuted latent sticks. This new construction transforms multinomial regression into regression analysis of stick-specific binary random variables that are mutually independent given their covariate-dependent stick success probabilities, which are parameterized by the regression coefficients of their corresponding categories. The paSB construction allows transforming an arbitrary cross-entropy-loss binary classifier into a Bayesian multinomial one. Specifically, we parameterize the negative logarithms of the stick failure probabilities with a family of covariate-dependent softplus functions to construct nonparametric Bayesian multinomial softplus regression, and transform Bayesian support vector machine (SVM) into Bayesian multinomial SVM. These Bayesian multinomial regression models are not only capable of providing probability estimates, quantifying uncertainty, increasing robustness, and producing nonlinear classification decision boundaries, but also amenable to posterior simulation. Example results demonstrate their attractive properties and performance.

共変量を与えられたカテゴリ応答変数をモデル化するために、観測されたカテゴリをランダムに並べ替えられた潜在的なスティックに1対1でマッピングする並べ替えおよび拡張されたスティック破壊(paSB)構成を提案します。この新しい構成は、多項式回帰を、共変量に依存するスティック成功確率(対応するカテゴリの回帰係数によってパラメーター化される)を与えられた相互に独立したスティック固有のバイナリランダム変数の回帰分析に変換します。paSB構成により、任意のクロスエントロピー損失バイナリ分類器をベイズ多項式分類器に変換できます。具体的には、スティック失敗確率の負の対数を共変量依存のソフトプラス関数のファミリーでパラメーター化して、ノンパラメトリックベイズ多項式ソフトプラス回帰を構築し、ベイズサポートベクターマシン(SVM)をベイズ多項式SVMに変換します。これらのベイジアン多項式回帰モデルは、確率推定値の提供、不確実性の定量化、堅牢性の向上、非線形分類決定境界の生成が可能なだけでなく、事後シミュレーションにも対応できます。サンプル結果は、その魅力的な特性とパフォーマンスを実証しています。

Post-Regularization Inference for Time-Varying Nonparanormal Graphical Models
時間変非超正規グラフィカルモデルのための正則化後推論

We propose a novel class of time-varying nonparanormal graphical models, which allows us to model high dimensional heavy-tailed systems and the evolution of their latent network structures. Under this model we develop statistical tests for presence of edges both locally at a fixed index value and globally over a range of values. The tests are developed for a high-dimensional regime, are robust to model selection mistakes and do not require commonly assumed minimum signal strength. The testing procedures are based on a high dimensional, debiasing-free moment estimator, which uses a novel kernel smoothed Kendall’s tau correlation matrix as an input statistic. The estimator consistently estimates the latent inverse Pearson correlation matrix uniformly in both the index variable and kernel bandwidth. Its rate of convergence is shown to be minimax optimal. Our method is supported by thorough numerical simulations and an application to a neural imaging data set.

私たちは、高次元のヘビーテールシステムとそれらの潜在的なネットワーク構造の進化をモデル化することを可能にする、時間的に変化する非超常的なグラフィカルモデルの新しいクラスを提案します。このモデルでは、固定インデックス値でのローカルなエッジと、ある範囲の値に対するグローバルなエッジの存在に関する統計的検定を開発します。このテストは高次元領域向けに開発されており、モデル選択の誤りに対して堅牢であり、一般的に想定される最小信号強度を必要としません。テスト手順は、新しいカーネル平滑化ケンドールのタウ相関行列を入力統計量として使用する、高次元のバイアス除去フリーモーメント推定器に基づいています。推定器は、潜在逆ピアソン相関行列をインデックス変数とカーネル帯域幅の両方で一様に一貫して推定します。その収束速度は、ミニマックス最適であることが示されています。私たちの方法は、徹底的な数値シミュレーションとニューラルイメージングデータセットへの応用によって支えられています。

Sparse Concordance-assisted Learning for Optimal Treatment Decision
最適な治療決定のためのスパース一致支援学習

To find optimal decision rule, Fan et al. (2016) proposed an innovative concordance-assisted learning algorithm which is based on maximum rank correlation estimator. It makes better use of the available information through pairwise comparison. However the objective function is discontinuous and computationally hard to optimize. In this paper, we consider a convex surrogate loss function to solve this problem. In addition, our algorithm ensures sparsity of decision rule and renders easy interpretation. We derive the $L_2$ error bound of the estimated coefficients under ultra-high dimension. Simulation results of various settings and application to STAR*D both illustrate that the proposed method can still estimate optimal treatment regime successfully when the number of covariates is large.

最適な決定ルールを見つけるために、Fanら(2016)は、最大順位相関推定量に基づく革新的な一致支援学習アルゴリズムを提案しました。ペアワイズ比較を通じて、利用可能な情報をより有効に活用します。ただし、目的関数は不連続であり、最適化するのは計算が困難です。この論文では、この問題を解決するために凸型代理損失関数について考えます。さらに、私たちのアルゴリズムは、決定ルールのスパース性を確保し、解釈を容易にします。超高次元での推定係数の$L_2$誤差範囲を導出します。さまざまな設定のシミュレーション結果とSTAR*Dへの適用は、共変量の数が多い場合でも、提案手法が最適な治療体制を成功裏に推定できることを示しています。

Exact Learning of Lightweight Description Logic Ontologies
軽量記述ロジックオントロジーの正確な学習

We study the problem of learning description logic (DL) ontologies in Angluin et al.’s framework of exact learning via queries. We admit membership queries (âis a given subsumption entailed by the target ontology?â) and equivalence queries (âis a given ontology equivalent to the target ontology?â). We present three main results: (1) ontologies formulated in (two relevant versions of) the description logic DL-Lite can be learned with polynomially many queries of polynomial size; (2) this is not the case for ontologies formulated in the description logic $\mathcal{E}\mathcal{L}$, even when only acyclic ontologies are admitted; and (3) ontologies formulated in a fragment of $\mathcal{E}\mathcal{L}$ related to the web ontology language OWL 2 RL can be learned in polynomial time. We also show that neither membership nor equivalence queries alone are sufficient in cases (1) and (3).

私たちは、Angluinらのクエリによる正確な学習のフレームワークで、記述論理(DL)オントロジーの学習の問題を研究しています。私たちは、メンバーシップ・クエリ(「ターゲット・オントロジーが伴う所与の包摂性である」)および同等性クエリ(「ターゲット・オントロジーと同等の所与のオントロジーである」)を認める。私たちは、3つの主要な結果を提示する:(1)記述ロジックDL-Liteの(2つの関連バージョン)で定式化されたオントロジーは、多項式サイズの多項式多くのクエリで学習できます。(2)これは、記述論理$mathcal{E}mathcal{L}$で定式化されたオントロジーには当てはまりません。非巡回オントロジーのみが認められる場合でもです。(3)ウェブオントロジー言語OWL2 RLに関連する$mathcal{E}mathcal{L}$のフラグメントで定式化されたオントロジーは、多項式時間で学習できます。また、(1)と(3)の場合、メンバーシップクエリも同等性クエリも十分ではないことも示しています。

Surprising properties of dropout in deep networks
深層ネットワークにおけるドロップアウトの驚くべき特性

We analyze dropout in deep networks with rectified linear units and the quadratic loss. Our results expose surprising differences between the behavior of dropout and more traditional regularizers like weight decay. For example, on some simple data sets dropout training produces negative weights even though the output is the sum of the inputs. This provides a counterpoint to the suggestion that dropout discourages co-adaptation of weights. We also show that the dropout penalty can grow exponentially in the depth of the network while the weight-decay penalty remains essentially linear, and that dropout is insensitive to various re-scalings of the input features, outputs, and network weights. This last insensitivity implies that there are no isolated local minima of the dropout training criterion. Our work uncovers new properties of dropout, extends our understanding of why dropout succeeds, and lays the foundation for further progress.

私たちは、整流線形単位と二次損失を持つ深層ネットワークのドロップアウトを解析します。私たちの結果は、ドロップアウトの振る舞いと、体重減衰のような従来の正則化子との間に驚くべき違いがあることを明らかにしています。たとえば、一部の単純なデータセットでは、ドロップアウトトレーニングでは、出力が入力の合計である場合でも、負の重みが生成されます。これは、ドロップアウトがウェイトの共適応を阻害するという提案に対する反論を提供します。また、ドロップアウトペナルティはネットワークの深さで指数関数的に増加する可能性がある一方で、ウェイト減衰ペナルティは基本的に線形のままであり、ドロップアウトは入力フィーチャ、出力、およびネットワークウェイトのさまざまな再スケーリングの影響を受けないことを示します。この最後の鈍感さは、ドロップアウト学習基準の孤立した局所的な最小値が存在しないことを意味します。私たちの研究は、ドロップアウトの新たな特性を明らかにし、ドロップアウトが成功する理由についての理解を深め、さらなる進歩の基礎を築きます。

Simple, Robust and Optimal Ranking from Pairwise Comparisons
ペアワイズ比較からのシンプルで堅牢で最適なランキング

We consider data in the form of pairwise comparisons of $n$ items, with the goal of identifying the top $k$ items for some value of $k < n$, or alternatively, recovering a ranking of all the items. We analyze the Borda counting algorithm that ranks the items in order of the number of pairwise comparisons won, and show it has three attractive features: (a) it is an optimal method achieving the information-theoretic limits up to constant factors; (b) it is robust in that its optimality holds without imposing conditions on the underlying matrix of pairwise-comparison probabilities, in contrast to some prior work that applies only to the BTL parametric model; and (c) its computational efficiency leads to speed-ups of several orders of magnitude. We address the problem of exact recovery, and for the top-$k$ recovery problem we also extend our results to obtain sharp guarantees for approximate recovery under the Hamming distortion metric, and more generally, to any arbitrary error requirement that satisfies a simple and natural monotonicity condition. In doing so, we introduce a general framework that allows us to treat a variety of problems in the literature in an unified manner.

私たちは、$n$個のアイテムのペア比較の形式でデータを検討し、$k < n$の値で上位$k$個のアイテムを特定するか、あるいはすべてのアイテムのランキングを復元することを目標とします。私たちは、ペア比較の勝者数順にアイテムをランク付けするBordaカウントアルゴリズムを分析し、次の3つの魅力的な特徴があることを示します。(a)定数因子までの情報理論的限界を達成する最適な方法であること、(b) BTLパラメトリックモデルにのみ適用される以前の研究とは対照的に、ペア比較確率の基礎となる行列に条件を課すことなく最適性が維持されるという点で堅牢であること、(c)計算効率により数桁の高速化が実現されること。我々は正確な復元の問題に取り組み、上位$k$個の復元問題に対しては、ハミング歪みメトリックの下で近似復元の明確な保証、およびより一般的には、単純で自然な単調性条件を満たす任意のエラー要件に対する近似復元の明確な保証を得るために結果を拡張します。そうすることで、文献のさまざまな問題を統一的に扱うことができる一般的な枠組みを導入します。

Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization
制約付き凸最適化のための確率的近位点法の非漸近収束

A popular approach for solving stochastic optimization problems is the stochastic gradient descent (SGD) method. Although the SGD iteration is computationally cheap and its practical performance may be satisfactory under certain circumstances, there is recent evidence of its convergence difficulties and instability for unappropriate choice of parameters. To avoid some of the drawbacks of SGD, stochastic proximal point (SPP) algorithms have been recently considered. We introduce a new variant of the SPP method for solving stochastic convex problems subject to (in)finite intersection of constraints satisfying a linear regularity condition. For the newly introduced SPP scheme we prove new nonasymptotic convergence results. In particular, for convex Lipschitz continuous objective functions, we prove nonasymptotic convergence rates in terms of the expected value function gap of order $\mathcal{O}\left(\frac{1}{k^{1/2}}\right)$, where $k$ is the iteration counter. We also derive better nonasymptotic convergence rates in terms of expected quadratic distance from the iterates to the optimal solution for smooth strongly convex objective functions, which in the best case is of order $\mathcal{O}\left(\frac{1}{k}\right)$. Since these convergence rates can be attained by our SPP algorithm only under some natural restrictions on the stepsize, we also introduce a restarting variant of SPP that overcomes these difficulties and derive the corresponding nonasymptotic convergence rates. Numerical evidence supports the effectiveness of our methods in real problems.

確率的最適化問題を解くための一般的なアプローチは、確率的勾配降下法(SGD)です。SGD反復は計算コストが低く、特定の状況下では実用的なパフォーマンスが満足できる場合もありますが、パラメータの不適切な選択により収束が困難で不安定になるという最近の証拠があります。SGDの欠点のいくつかを回避するために、確率的近位点(SPP)アルゴリズムが最近検討されています。線形正則条件を満たす制約の有限(不)交差に従う確率的凸問題を解くためのSPP法の新しい変種を紹介します。新しく導入されたSPPスキームについて、新しい非漸近収束結果を証明します。特に、凸Lipschitz連続目的関数について、$k$は反復カウンターであるオーダー$\mathcal{O}\left(\frac{1}{k^{1/2}}\right)$の期待値関数ギャップに関して非漸近収束率を証明します。また、滑らかな強凸目的関数の最適解への反復から予想される二次距離の観点から、より良い非漸近収束率を導出します。最良の場合、$\mathcal{O}\left(\frac{1}{k}\right)$のオーダーになります。これらの収束率は、ステップサイズに対する自然な制限の下でのみSPPアルゴリズムによって達成できるため、これらの困難を克服し、対応する非漸近収束率を導出するSPPの再開バリアントも導入します。数値的証拠は、実際の問題における当社の方法の有効性を裏付けています。

Saturating Splines and Feature Selection
飽和スプラインとフィーチャー選択

We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data via a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite- dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.

私たちは、適応回帰スプラインモデルは、関数が特定の範囲外の定数として拡張されるという自然な要件である飽和を組み込むことで拡張します。飽和スプラインは、メジャーの空間上の凸最適化問題を介してデータにフィットし、条件付き勾配法に基づく効率的なアルゴリズムを使用して解きます。多くの既存のアプローチとは異なり、私たちのアルゴリズムは、事前に指定されたノット位置なしで元の無限次元(少なくとも2次のスプラインの場合)の最適化問題を解きます。次に、飽和スプラインを座標関数として持つ一般化加法モデルを適合するようにアルゴリズムを適応させ、飽和要件により、モデルが特徴選択と非線形関数近似を同時に実行できることを示します。最後に、この手法を高次スプラインに拡張する方法と、データ範囲外の拡張に関するさまざまな要件に拡張する方法を簡単に説明します。

From Predictive Methods to Missing Data Imputation: An Optimization Approach
予測手法から欠損データ補完まで:最適化アプローチ

Missing data is a common problem in real-world settings and for this reason has attracted significant attention in the statistical literature. We propose a flexible framework based on formal optimization to impute missing data with mixed continuous and categorical variables. This framework can readily incorporate various predictive models including $K$-nearest neighbors, support vector machines, and decision tree based methods, and can be adapted for multiple imputation. We derive fast first-order methods that obtain high quality solutions in seconds following a general imputation algorithm opt.impute presented in this paper. We demonstrate that our proposed method improves out-of-sample accuracy in large-scale computational experiments across a sample of 84 data sets taken from the UCI Machine Learning Repository. In all scenarios of missing at random mechanisms and various missing percentages, opt.impute produces the best overall imputation in most data sets benchmarked against five other methods: mean impute, $K$-nearest neighbors, iterative knn, Bayesian PCA, and predictive-mean matching, with an average reduction in mean absolute error of 8.3$\%$ against the best cross-validated benchmark method. Moreover, opt.impute leads to improved out-of-sample performance of learning algorithms trained using the imputed data, demonstrated by computational experiments on 10 downstream tasks. For models trained using opt.impute single imputations with 50$\%$ data missing, the average out-of- sample $R^2$ is 0.339 in the regression tasks and the average out-of-sample accuracy is 86.1$\%$ in the classification tasks, compared to 0.315 and 84.4$\%$ for the best cross-validated benchmark method. In the multiple imputation setting, downstream models trained using opt.impute obtain a statistically significant improvement over models trained using multivariate imputation by chained equations (mice) in 8/10 missing data scenarios considered.

欠損データは実世界の設定で一般的な問題であり、このため統計文献で大きな注目を集めています。私たちは、連続変数とカテゴリ変数が混在する欠損データを補完するための形式最適化に基づく柔軟なフレームワークを提案します。このフレームワークは、K近傍法、サポートベクターマシン、決定木ベースの方法など、さまざまな予測モデルを容易に組み込むことができ、多重補完に適応できます。本論文で紹介する一般的な補完アルゴリズムopt.imputeに従って、数秒で高品質のソリューションを取得する高速な一次手法を導出します。UCI機械学習リポジトリから取得した84のデータセットのサンプル全体にわたる大規模な計算実験で、提案手法によってサンプル外精度が向上することを実証します。ランダム欠損メカニズムおよびさまざまな欠損率のすべてのシナリオにおいて、opt.imputeは、平均補完、$K$最近傍法、反復knn、ベイズPCA、および予測平均マッチングという5つの他の方法と比較したほとんどのデータセットで全体的に最良の補完を生成し、最良のクロス検証ベンチマーク方法に対して平均絶対誤差の平均減少は8.3$\%$でした。さらに、opt.imputeは、補完データを使用してトレーニングされた学習アルゴリズムのサンプル外パフォーマンスの向上につながります。これは、10のダウンストリームタスクでの計算実験によって実証されています。50$\%$のデータが欠損しているopt.impute単一補完を使用してトレーニングされたモデルの場合、回帰タスクの平均サンプル外$R^2$は0.339、分類タスクの平均サンプル外精度は86.1$\%$です。これに対し、最良のクロス検証ベンチマーク方法では、それぞれ0.315および84.4$\%$でした。多重代入設定では、opt.imputeを使用してトレーニングされた下流モデルは、検討された10個の欠損データシナリオのうち8個で、連鎖方程式(マウス)による多変量代入を使用してトレーニングされたモデルよりも統計的に有意な改善を得ています。

Active Nearest-Neighbor Learning in Metric Spaces
メトリック空間でのアクティブ最近傍学習

We propose a pool-based non-parametric active learning algorithm for general metric spaces, called MArgin Regularized Metric Active Nearest Neighbor (MARMANN), which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of MARMANN is significantly lower than that of any passive learner with similar error guarantees. MARMANN is based on a generalized sample compression scheme, and a new label-efficient active model-selection procedure.

私たちは、最近傍分類器を出力するMArgin Regularized Metric Active Nearest Neighbor (MARMANN)と呼ばれる、一般的なメトリック空間に対するプールベースのノンパラメトリックアクティブラーニングアルゴリズムを提案します。ここでは、入力サンプルのノイズマージン特性に依存し、以前に提案された受動的学習器によって得られた予測誤差と競合する予測誤差を保証します。MARMANNのラベルの複雑さは、同様のエラー保証を持つ受動的な学習器のラベルの複雑さよりも大幅に低いことを証明します。MARMANNは、一般化されたサンプル圧縮スキームと、ラベル効率の高い新しいアクティブモデル選択手順に基づいています。

Enhancing Identification of Causal Effects by Pruning
剪定による因果関係の特定の強化

Causal models communicate our assumptions about causes and effects in real-world phenomena. Often the interest lies in the identification of the effect of an action which means deriving an expression from the observed probability distribution for the interventional distribution resulting from the action. In many cases an identifiability algorithm may return a complicated expression that contains variables that are in fact unnecessary. In practice this can lead to additional computational burden and increased bias or inefficiency of estimates when dealing with measurement error or missing data. We present graphical criteria to detect variables which are redundant in identifying causal effects. We also provide an improved version of a well-known identifiability algorithm that implements these criteria.

因果モデルは、現実世界の現象の原因と結果に関する仮定を伝えます。多くの場合、関心はアクションの効果の特定にあり、アクションから生じる介入分布の観察された確率分布から式を導き出すことを意味します。多くの場合、識別可能性アルゴリズムは、実際には不要な変数を含む複雑な式を返すことがあります。実際には、これは、測定誤差や欠損データを処理する際に、追加の計算負荷や推定値のバイアスや非効率性の増加につながる可能性があります。因果効果を特定する際に冗長な変数を検出するためのグラフィカルな基準を提示します。また、これらの基準を実装するよく知られた識別可能性アルゴリズムの改良版も提供しています。

Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research
クラウドソーシングが機械学習研究をどのように進めるか

This survey provides a comprehensive overview of the landscape of crowdsourcing research, targeted at the machine learning community. We begin with an overview of the ways in which crowdsourcing can be used to advance machine learning research, focusing on four application areas: 1) data generation, 2) evaluation and debugging of models, 3) hybrid intelligence systems that leverage the complementary strengths of humans and machines to expand the capabilities of AI, and 4) crowdsourced behavioral experiments that improve our understanding of how humans interact with machine learning systems and technology more broadly. We next review the extensive literature on the behavior of crowdworkers themselves. This research, which explores the prevalence of dishonesty among crowdworkers, how workers respond to both monetary incentives and intrinsic forms of motivation, and how crowdworkers interact with each other, has immediate implications that we distill into best practices that researchers should follow when using crowdsourcing in their own research. We conclude with a discussion of additional tips and best practices that are crucial to the success of any project that uses crowdsourcing, but rarely mentioned in the literature.

この調査では、機械学習コミュニティを対象としたクラウドソーシング研究の状況を包括的に概観します。まず、クラウドソーシングを使用して機械学習研究を前進させる方法の概要から始め、4つの応用分野に焦点を当てます。1)データ生成、2)モデルの評価とデバッグ、3)人間と機械の相補的な強みを活用してAIの機能を拡張するハイブリッドインテリジェンスシステム、4)人間が機械学習システムやテクノロジーとより広範にやり取りする方法についての理解を深めるクラウドソーシングによる行動実験です。次に、クラウドワーカー自身の行動に関する広範な文献を確認します。クラウドワーカーの不正行為の蔓延、労働者が金銭的インセンティブと内発的動機の両方にどのように反応するか、クラウドワーカーが互いにどのようにやり取りするかを調査したこの調査は、研究者が自分の研究でクラウドソーシングを使用する際に従うべきベストプラクティスに要約された直接的な影響を持っています。最後に、クラウドソーシングを使用するプロジェクトの成功に不可欠でありながら、文献ではほとんど言及されていない追加のヒントとベストプラクティスについて説明します。

Uncovering Causality from Multivariate Hawkes Integrated Cumulants
多変量ホークス積分キュムラントから因果関係を明らかにする

We design a new nonparametric method that allows one to estimate the matrix of integrated kernels of a multivariate Hawkes process. This matrix not only encodes the mutual influences of each node of the process, but also disentangles the causality relationships between them. Our approach is the first that leads to an estimation of this matrix without any parametric modeling and estimation of the kernels themselves. As a consequence, it can give an estimation of causality relationships between nodes (or users), based on their activity timestamps (on a social network for instance), without knowing or estimating the shape of the activities lifetime. For that purpose, we introduce a moment matching method that fits the second-order and the third-order integrated cumulants of the process. A theoretical analysis allows us to prove that this new estimation technique is consistent. Moreover, we show, on numerical experiments, that our approach is indeed very robust with respect to the shape of the kernels and gives appealing results on the MemeTracker database and on financial order book data.

私たちは、多変量ホークス過程の統合カーネルの行列を推定できる新しいノンパラメトリック手法を設計しました。この行列は、プロセスの各ノードの相互影響をエンコードするだけでなく、それらの間の因果関係を解きほぐします。我々のアプローチは、カーネル自体のパラメトリックモデリングや推定をせずにこの行列を推定できる初めてのアプローチです。その結果、アクティビティの存続期間の形状を知ったり推定したりすることなく、アクティビティのタイムスタンプ（たとえばソーシャルネットワーク上）に基づいて、ノード（またはユーザー）間の因果関係を推定できます。そのために、プロセスの2次および3次の統合キュムラントに適合するモーメントマッチング法を導入します。理論分析により、この新しい推定手法が一貫していることを証明できます。さらに、数値実験により、我々のアプローチはカーネルの形状に関して非常に堅牢であり、MemeTrackerデータベースと金融注文書データで魅力的な結果が得られることを示しています。

KELP: a Kernel-based Learning Platform
KELP: カーネルベースの学習プラットフォーム

KELP is a Java framework that enables fast and easy implementation of kernel functions over discrete data, such as strings, trees or graphs and their combination with standard vectorial kernels. Additionally, it provides several kernel- based algorithms, e.g., online and batch kernel machines for classification, regression and clustering, and a Java environment for easy implementation of new algorithms. KELP is a versatile toolkit, very appealing both to experts and practitioners of machine learning and Java language programming, who can find extensive documentation, tutorials and examples of increasing complexity on the accompanying website. Interestingly, KELP can be also used without any knowledge of Java programming through command line tools and JSON/XML interfaces enabling the declaration and instantiation of articulated learning models using simple templates. Finally, the extensive use of modularity and interfaces in KELP enables developers to easily extend it with their own kernels and algorithms.

KELPは、文字列、ツリー、グラフなどの離散データに対するカーネル関数の高速かつ簡単な実装や、標準のベクトルカーネルとの組み合わせを可能にするJavaフレームワークです。さらに、分類、回帰、クラスタリング用のオンラインおよびバッチカーネルマシンなどのカーネルベースのアルゴリズムや、新しいアルゴリズムを簡単に実装できるJava環境もいくつか提供しています。KELPは多目的ツールキットで、機械学習とJava言語プログラミングの専門家と実践者の両方にとって非常に魅力的です。付属のWebサイトには、広範なドキュメント、チュートリアル、複雑さが増す例が掲載されています。興味深いことに、KELPは、コマンドラインツールとJSON/XMLインターフェイスを介してJavaプログラミングの知識がなくても使用でき、簡単なテンプレートを使用して明確な学習モデルを宣言およびインスタンス化できます。最後に、KELPではモジュール性とインターフェイスが広範に使用されているため、開発者は独自のカーネルとアルゴリズムで簡単に拡張できます。

Pycobra: A Python Toolbox for Ensemble Learning and Visualisation
Pycobra: アンサンブル学習と可視化のための Python ツールボックス

We introduce pycobra, a Python library devoted to ensemble learning (regression and classification) and visualisation. Its main assets are the implementation of several ensemble learning algorithms, a flexible and generic interface to compare and blend any existing machine learning algorithm available in Python libraries (as long as a predict method is given), and visualisation tools such as Voronoi tessellations. pycobra is fully scikit-learn compatible and is released under the MIT open-source license. pycobra can be downloaded from the Python Package Index (PyPi) and Machine Learning Open Source Software (MLOSS). The current version (along with Jupyter notebooks, extensive documentation, and continuous integration tests) is available at https://github.com/bhargavvader/pycobra and official documentation website is https://modal.lille.inria.fr/pycobra.

私たちは、アンサンブル学習(回帰と分類)と視覚化に特化したPythonライブラリであるpycobraを紹介します。その主な資産は、いくつかのアンサンブル学習アルゴリズムの実装、Pythonライブラリで利用可能な既存の機械学習アルゴリズムを比較およびブレンドするための柔軟で汎用的なインターフェース(予測方法が指定されている限り)、およびボロノイテッセレーションなどの視覚化ツールです。pycobraはscikit-learnと完全に互換性があり、MITオープンソースライセンスの下でリリースされています。pycobraは、Python Package Index (PyPi)とMachine Learning Open Source Software (MLOSS)からダウンロードできます。現在のバージョン(Jupyter Notebook、広範なドキュメント、継続的インテグレーションテストと共に)はhttps://github.com/bhargavvader/pycobraで入手でき、公式ドキュメントWebサイトはhttps://modal.lille.inria.fr/pycobraです。

Kernel Method for Persistence Diagrams via Kernel Embedding and Weight Factor
カーネル埋め込みと重み係数による永続性図のカーネル手法

Topological data analysis (TDA) is an emerging mathematical concept for characterizing shapes in complicated data. In TDA, persistence diagrams are widely recognized as a useful descriptor of data, distinguishing robust and noisy topological properties. This paper introduces a kernel method for persistence diagrams to develop a statistical framework in TDA. The proposed kernel is stable under perturbation of data, enables one to explicitly control the effect of persistence by a weight function, and allows an efficient and accurate approximate computation. The method is applied into practical data on granular systems, oxide glasses and proteins, showing advantages of our method compared to other relevant methods for persistence diagrams.

トポロジカルデータ解析(TDA)は、複雑なデータの形状を特徴付けるための新しい数学的概念です。TDAでは、永続性図はデータの有用な記述子として広く認識されており、堅牢なトポロジカルプロパティとノイズの多いトポロジカルプロパティを区別します。この論文では、TDAで統計フレームワークを開発するための永続性図のカーネル手法を紹介します。提案されたカーネルは、データの摂動下で安定しており、重み関数によって永続性の影響を明示的に制御でき、効率的で正確な近似計算を可能にします。この方法は、粒状システム、酸化物ガラス、タンパク質の実用的なデータに適用されており、他の関連メソッドと比較して、パーシステンスダイアグラムの利点を示しています。

Significance-based community detection in weighted networks
重み付きネットワークにおける有意性に基づくコミュニティ検出

Community detection is the process of grouping strongly connected nodes in a network. Many community detection methods for un-weighted networks have a theoretical basis in a null model. Communities discovered by these methods therefore have interpretations in terms of statistical significance. In this paper, we introduce a null for weighted networks called the continuous configuration model. First, we propose a community extraction algorithm for weighted networks which incorporates iterative hypothesis testing under the null. We prove a central limit theorem for edge-weight sums and asymptotic consistency of the algorithm under a weighted stochastic block model. We then incorporate the algorithm in a community detection method called CCME. To benchmark the method, we provide a simulation framework involving the null to plant âbackground” nodes in weighted networks with communities. We show that the empirical performance of CCME on these simulations is competitive with existing methods, particularly when overlapping communities and background nodes are present. To further validate the method, we present two real-world networks with potential background nodes and analyze them with CCME, yielding results that reveal macro- features of the corresponding systems.

コミュニティ検出は、ネットワーク内の強く接続されたノードをグループ化するプロセスです。重み付けされていないネットワークのコミュニティ検出方法の多くは、ヌルモデルに理論的根拠があります。したがって、これらの方法で検出されたコミュニティは、統計的有意性の観点から解釈できます。この論文では、連続構成モデルと呼ばれる重み付けネットワークのヌルを紹介します。まず、ヌルの下での反復仮説検定を組み込んだ重み付けネットワークのコミュニティ抽出アルゴリズムを提案します。エッジ重みの合計の中心極限定理と、重み付けされた確率的ブロックモデルの下でのアルゴリズムの漸近的一貫性を証明します。次に、CCMEと呼ばれるコミュニティ検出方法にアルゴリズムを組み込みます。この方法をベンチマークするために、重み付けされたネットワークにコミュニティを含む「背景」ノードを配置するためのヌルを含むシミュレーションフレームワークを提供します。これらのシミュレーションでのCCMEの実証的パフォーマンスは、特に重複するコミュニティと背景ノードが存在する場合に、既存の方法と競合することを示します。この方法をさらに検証するために、潜在的な背景ノードを含む2つの実際のネットワークを提示し、CCMEで分析して、対応するシステムのマクロ特性を明らかにする結果を得ます。

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
量子化ニューラルネットワーク:低精度の重みと活性化によるニューラルネットワークの学習

We introduce a method to train Quantized Neural Networks (QNNs) — neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves $51\%$ top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

私たちは、実行時に量子化ニューラルネットワーク(QNN)をトレーニングする方法を紹介します。QNNは、極めて精度の低い(例: 1ビット)重みとアクティベーションを持つニューラルネットワークです。トレーニング時には、量子化された重みとアクティベーションを使用してパラメーターの勾配を計算します。フォワードパス中、QNNはメモリサイズとアクセスを大幅に削減し、ほとんどの算術演算をビット単位の演算に置き換えます。その結果、消費電力が大幅に削減されると期待されます。MNIST、CIFAR-10、SVHN、およびImageNetデータセットでQNNをトレーニングしました。結果として得られたQNNは、32ビットのQNNに匹敵する予測精度を実現します。たとえば、1ビットの重みと2ビットのアクティベーションを持つAlexNetの量子化バージョンは、$51\%$のトップ1精度を実現します。さらに、パラメーターの勾配も6ビットに量子化することで、ビット単位の演算のみを使用して勾配を計算できるようになります。量子化リカレントニューラルネットワークは、Penn Treebankデータセットでテストされ、わずか4ビットを使用して32ビットのネットワークと同等の精度を達成しました。最後に、バイナリマトリックス乗算GPUカーネルをプログラムしました。これにより、分類精度を低下させることなく、最適化されていないGPUカーネルよりも7倍高速にMNIST QNNを実行できます。QNNコードはオンラインで入手できます。

Submatrix localization via message passing
メッセージパッシングによるサブマトリックスのローカライゼーション

The principal submatrix localization problem deals with recovering a $K\times K$ principal submatrix of elevated mean $\mu$ in a large $n\times n$ symmetric matrix subject to additive standard Gaussian noise, or more generally, mean zero, variance one, subgaussian noise. This problem serves as a prototypical example for community detection, in which the community corresponds to the support of the submatrix. The main result of this paper is that in the regime $\Omega(\sqrt{n}) \leq K \leq o(n)$, the support of the submatrix can be weakly recovered (with $o(K)$ misclassification errors on average) by an optimized message passing algorithm if $\lambda = \mu^2K^2/n$, the signal-to-noise ratio, exceeds $1/e$. This extends a result by Deshpande and Montanari previously obtained for $K=\Theta(\sqrt{n})$ and $\mu=\Theta(1).$ In addition, the algorithm can be combined with a voting procedure to achieve the information-theoretic limit of exact recovery with sharp constants for all $K \geq \frac{n}{\log n} (\frac{1}{8e} + o(1))$. The total running time of the algorithm is $O(n^2\log n)$. Another version of the submatrix localization problem, known as noisy biclustering, aims to recover a $K_1\times K_2$ submatrix of elevated mean $\mu$ in a large $n_1\times n_2$ Gaussian matrix. The optimized message passing algorithm and its analysis are adapted to the bicluster problem assuming $\Omega(\sqrt{n_i}) \leq K_i \leq o(n_i)$ and $K_1\asymp K_2.$ A sharp information-theoretic condition for the weak recovery of both clusters is also identified.

主サブマトリックスの位置特定問題は、加法標準ガウスノイズ、またはより一般的には平均0、分散1、サブガウスノイズの影響を受ける大規模な$n\times n$対称マトリックス内で、平均$\mu$が上昇した$K\times K$主サブマトリックスを復元する問題を扱います。この問題は、コミュニティ検出の典型的な例であり、コミュニティはサブマトリックスのサポートに対応します。この論文の主な結果は、$\Omega(\sqrt{n}) \leq K \leq o(n)$の領域で、信号対雑音比$\lambda = \mu^2K^2/n$が$1/e$を超える場合、最適化されたメッセージパッシングアルゴリズムによってサブマトリックスのサポートを弱く復元できる(平均で$o(K)$の誤分類エラーを使用)ことです。これは、DeshpandeとMontanariが以前に$K=\Theta(\sqrt{n})$および$\mu=\Theta(1)$に対して得た結果を拡張したものです。さらに、このアルゴリズムを投票手順と組み合わせることで、すべての$K \geq \frac{n}{\log n} (\frac{1}{8e} + o(1))$に対して、鋭い定数を使用した正確な回復の情報理論的限界を達成できます。アルゴリズムの合計実行時間は$O(n^2\log n)$です。ノイズバイクラスタリングと呼ばれるサブマトリックスローカリゼーション問題の別のバージョンは、大きな$n_1\times n_2$ガウス行列内の上昇した平均$\mu$の$K_1\times K_2$サブマトリックスを回復することを目的としています。最適化されたメッセージパッシングアルゴリズムとその分析は、$\Omega(\sqrt{n_i}) \leq K_i \leq o(n_i)$および$K_1\asymp K_2$を仮定して、バイクラスター問題に適応されます。両方のクラスターの弱い回復のための明確な情報理論的条件も特定されます。

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
ハイパーバンド:ハイパーパラメータ最適化への新しいバンディットベースのアプローチ

Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Ã¸uralg , for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Ã¸uralg with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Ã¸uralg can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems.

機械学習アルゴリズムのパフォーマンスは、適切なハイパーパラメータのセットを特定することに大きく依存します。最近のアプローチでは、ベイズ最適化を使用して構成を適応的に選択しますが、適応型リソース割り当てと早期停止によるランダム検索の高速化に重点を置いています。ハイパーパラメータ最適化は、反復、データサンプル、特徴などの事前定義されたリソースがランダムにサンプリングされた構成に割り当てられる、純粋な探索の非確率的無限武装バンディット問題として定式化します。このフレームワークに新しいアルゴリズム Ã ̧uralgを導入し、その理論的特性を分析して、いくつかの望ましい保証を提供します。さらに、一連のハイパーパラメータ最適化問題で、一般的なベイズ最適化手法と Ã ̧uralgを比較します。Ã ̧uralgは、さまざまな深層学習およびカーネルベースの学習問題に取り組んでいる競合他社よりも桁違いに高速化できることがわかっています。

On Faster Convergence of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization
強凸最小化のための巡回ブロック座標降下型法の高速収束について

The cyclic block coordinate descent-type (CBCD-type) methods, which perform iterative updates for a few coordinates (a block) simultaneously throughout the procedure, have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that for strongly convex minimization, the CBCD-type methods attain iteration complexity of $\mathcal{O}(p\log(1/\epsilon))$, where $\epsilon$ is a pre-specified accuracy of the objective value, and $p$ is the number of blocks. However, such iteration complexity explicitly depends on $p$, and therefore is at least $p$ times worse than the complexity $\mathcal{O}(\log(1/\epsilon))$ of gradient descent (GD) methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity $\mathcal{O}(\log^2(p)\cdot\log(1/\epsilon))$ of the CBCD-type methods matches that of the GD methods in term of dependency on $p$, up to a $\log^2 p$ factor. Thus our complexity bounds are sharper than the existing bounds by at least a factor of $p/\log^2(p)$. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a $\log^2 (p)$ factor), under the assumption that the largest and smallest eigenvalues of the Hessian matrix do not scale with $p$. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones.

巡回ブロック座標降下型(CBCD型)法は、手順全体を通じて同時にいくつかの座標(ブロック)の反復更新を実行し、強凸最小化問題を解く際に優れた計算パフォーマンスを示しています。一般的なアプリケーションには、弾性ネット回帰、リッジペナルティ付きロジスティック回帰、スパース加法回帰など、多くの一般的な統計的機械学習法が含まれます。既存の最適化の文献によると、強凸最小化の場合、CBCD型法は反復複雑度が$\mathcal{O}(p\log(1/\epsilon))$に達します。ここで、$\epsilon$は目的値の事前指定された精度、$p$はブロック数です。ただし、このような反復複雑度は$p$に明示的に依存するため、勾配降下法(GD)の複雑度$\mathcal{O}(\log(1/\epsilon))$よりも少なくとも$p$倍悪くなります。この理論的なギャップを埋めるために、CBCD型法の収束解析を改良して提案します。特に、まず、一連の二次最小化問題に対して、CBCD型法の反復計算量$\mathcal{O}(\log^2(p)\cdot\log(1/\epsilon))$は、$p$への依存性という点で、$\log^2 p$係数までGD法の計算量と一致することを示す。したがって、我々の計算量境界は、既存の境界よりも少なくとも$p/\log^2(p)$倍だけ厳しいものとなります。また、ヘッセ行列の最大および最小の固有値が$p$に比例しないという仮定の下で、改良された計算量境界が厳密である($\log^2 (p)$係数まで)ことを確認するための下限も提供します。最後に、二次問題以外の強凸最小化問題に解析を一般化します。

Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
ハザード率を超えて:敵対的多腕バンディットのためのさらなる摂動アルゴリズム

Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key ingredient in our approach is the introduction of a new notion called the generalized hazard rate.

敵対的多腕バンディット問題に対するFollow the perturbed Leader(FTPL)アルゴリズムに関する最近の研究は、摂動を生成する分布のハザード率の役割を強調しています。ハザード率が限定的であると仮定すると、マルチアームバンディット問題に対するさまざまなFTPLアルゴリズムの後悔分析を提供することができます。この論文では、FTPLアルゴリズムの後悔限界に関する調査を、制限付きハザード率条件を超えて推進します。そうする正当な理由があります:一様分布やガウス分布などの自然分布が条件に違反します。限定サポート分布と無制限サポート分布の両方について、ハザード率条件を想定せずに後悔の範囲を示します。また、ガウス分布が低後悔アルゴリズムにつながることはないという推測も反証します。実際、それは対数係数まで、ほぼ最適な後悔につながることが判明しました。私たちのアプローチの重要な要素は、一般化ハザード率と呼ばれる新しい概念の導入です。

Divide-and-Conquer for Debiased l_1-norm Support Vector Machine in Ultra-high Dimensions
超高次元における偏りのない l_1 ノルムサポートベクターマシンの分割統治法

$1$-norm support vector machine (SVM) generally has competitive performance compared to standard $2$-norm support vector machine in classification problems, with the advantage of automatically selecting relevant features. We propose a divide-and-conquer approach in the large sample size and high-dimensional setting by splitting the data set across multiple machines, and then averaging the debiased estimators. Extension of existing theoretical studies to SVM is challenging in estimation of the inverse Hessian matrix that requires approximating the Dirac delta function via smoothing. We show that under appropriate conditions the aggregated estimator can obtain the same convergence rate as the central estimator utilizing all observations.

$1$-normサポートベクターマシン(SVM)は、分類問題において、標準の$2$-normサポートベクターマシンと比較して、一般的に競争力のあるパフォーマンスを発揮し、関連する機能を自動的に選択する利点があります。大規模なサンプルサイズと高次元の設定で、データセットを複数のマシンに分割し、バイアスがかからない推定量を平均化することにより、分割統治法を提案します。既存の理論研究をSVMに拡張することは、平滑化を介してディラックデルタ関数を近似する必要がある逆ヘッセ行列の推定において困難です。適切な条件下では、集約推定量は、すべての観測値を利用して中央推定量と同じ収束率を取得できることを示します。

To Tune or Not to Tune the Number of Trees in Random Forest
ランダムフォレストのツリーの数を調整するか、調整しないか

The number of trees $T$ in the random forest (RF) algorithm for supervised learning has to be set by the user. It is unclear whether $T$ should simply be set to the largest computationally manageable value or whether a smaller $T$ may be sufficient or in some cases even better. While the principle underlying bagging is that more trees are better, in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting $T$ to a computationally feasible large number as long as classical error measures based on average loss are considered.

教師あり学習のランダムフォレスト(RF)アルゴリズムにおけるツリーの数$T$は、ユーザーが設定する必要があります。$T$を単に計算上処理可能な最大の値に設定すればよいのか、それともより小さな$T$で十分なのか、あるいは場合によってはよりよいのかは不明です。バギングの根底にある原則はツリーの数が多いほどよいというものですが、実際には分類エラー率はツリーの数が増えるにつれて再び増加する前に最小値に達することがあります。この論文の目標は4つあります。(i)期待エラー率はツリーの数の非単調な関数である可能性があることを示す理論的結果を提供し、どのような状況でこれが起こるかを説明すること。(ii)このような非単調なパターンは、ブライアースコアや対数損失(分類の場合)、平均二乗誤差(回帰の場合)などの他のパフォーマンス指標では観察できないことを示す理論的結果を提供すること。(iii)パブリックデータベースOpenMLの多数の(n = 306)データセットへの適用を通じて問題の範囲を示すこと。(iv)最後に、平均損失に基づく古典的な誤差測定が考慮される限り、$T$を計算上実行可能な大きな数に設定することを支持します。

Mode-Seeking Clustering and Density Ridge Estimation via Direct Estimation of Density-Derivative-Ratios
密度-微分比の直接推定によるモードシーキングクラスタリングと密度リッジ推定

Modes and ridges of the probability density function behind observed data are useful geometric features. Mode-seeking clustering assigns cluster labels by associating data samples with the nearest modes, and estimation of density ridges enables us to find lower-dimensional structures hidden in data. A key technical challenge both in mode-seeking clustering and density ridge estimation is accurate estimation of the ratios of the first- and second-order density derivatives to the density. A naive approach takes a three-step approach of first estimating the data density, then computing its derivatives, and finally taking their ratios. However, this three-step approach can be unreliable because a good density estimator does not necessarily mean a good density derivative estimator, and division by the estimated density could significantly magnify the estimation error. To cope with these problems, we propose a novel estimator for the density-derivative-ratios. The proposed estimator does not involve density estimation, but rather directly approximates the ratios of density derivatives of any order. Moreover, we establish a convergence rate of the proposed estimator. Based on the proposed estimator, novel methods both for mode-seeking clustering and density ridge estimation are developed, and the respective convergence rates to the mode and ridge of the underlying density are also established. Finally, we experimentally demonstrate that the developed methods significantly outperform existing methods, particularly for relatively high-dimensional data.

観測データの背後にある確率密度関数のモードとリッジは、有用な幾何学的特徴です。モード探索クラスタリングは、データサンプルを最も近いモードに関連付けることでクラスターラベルを割り当て、密度リッジの推定により、データに隠された低次元構造を見つけることができます。モード探索クラスタリングと密度リッジ推定の両方における重要な技術的課題は、1次および2次密度導関数の密度に対する比率の正確な推定です。単純なアプローチでは、最初にデータ密度を推定し、次にその導関数を計算し、最後にそれらの比率を取得するという3段階のアプローチを採用します。ただし、この3段階のアプローチは信頼性が低い場合があります。これは、優れた密度推定量が必ずしも優れた密度導関数推定量を意味するわけではなく、推定された密度で割ると推定誤差が大幅に拡大する可能性があるためです。これらの問題に対処するために、密度導関数比の新しい推定量を提案します。提案された推定量は密度推定を伴わず、任意の次数の密度導関数の比率を直接近似します。さらに、提案された推定量の収束率を確立します。提案された推定量に基づいて、モード探索クラスタリングと密度リッジ推定の両方の新しい方法が開発され、基礎となる密度のモードとリッジへのそれぞれの収束率も確立されています。最後に、開発された方法が、特に比較的高次元のデータに対して、既存の方法を大幅に上回ることを実験的に実証します。

Efficient Learning with a Family of Nonconvex Regularizers by Redistributing Nonconvexity
非凸性再分配による非凸正則化器の族による効率的な学習

The use of convex regularizers allows for easy optimization, though they often produce biased estimation and inferior prediction performance. Recently, nonconvex regularizers have attracted a lot of attention and outperformed convex ones. However, the resultant optimization problem is much harder. In this paper, a popular subclass of $\ell_1$-based nonconvex sparsity-inducing and low-rank regularizers is considered. This includes nonconvex variants of lasso, sparse group lasso, tree- structured lasso, nuclear norm and total variation regularizers. We propose to move the nonconvexity from the regularizer to the loss. The nonconvex regularizer is then transformed to a familiar convex one, while the resultant loss function can still be guaranteed to be smooth. Learning with the convexified regularizer can be performed by existing efficient algorithms originally designed for convex regularizers (such as the proximal algorithm, Frank-Wolfe algorithm, alternating direction method of multipliers and stochastic gradient descent). This is further extended to consider cases where the convexified regularizer does not have a closed-form proximal step, and when the loss function is nonconvex nonsmooth. Extensive experiments on a variety of machine learning application scenarios show that optimizing the transformed problem is much faster than running the state-of-the-art on the original problem.

凸正則化を使用すると最適化が容易になりますが、多くの場合、偏った推定と劣った予測性能が生成されます。最近、非凸正則化が多くの注目を集め、凸正則化を上回りました。ただし、結果として得られる最適化問題ははるかに困難です。この論文では、人気のあるサブクラスの$\ell_1$ベースの非凸スパース誘導および低ランク正則化を検討します。これには、Lasso、スパースグループLasso、ツリー構造Lasso、核ノルム、および全変動正則化の非凸バリアントが含まれます。私たちは、非凸性を正則化から損失に移動することを提案します。次に、非凸正則化は、結果として得られる損失関数が滑らかであることを保証しながら、一般的な凸正則化に変換されます。凸化された正則化による学習は、もともと凸正則化用に設計された既存の効率的なアルゴリズム(近似アルゴリズム、Frank-Wolfeアルゴリズム、交互方向乗数法、確率的勾配降下法など)によって実行できます。これはさらに拡張され、凸型正則化子に閉じた形式の近似ステップがない場合や、損失関数が非凸で滑らかでない場合を考慮することができます。さまざまな機械学習アプリケーションシナリオでの広範な実験により、変換された問題を最適化すると、元の問題で最先端の方法を実行するよりもはるかに高速であることが示されています。

On b-bit Min-wise Hashing for Large-scale Regression and Classification with Sparse Data
大規模回帰とスパースデータ分類のための b ビット最小単位ハッシュ法について

Large-scale regression problems where both the number of variables, $p$, and the number of observations, $n$, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns and then work with this compressed data. $b$-bit min-wise hashing (Li and KÃ¶nig, 2011; Li et al., 2011) is a promising dimension reduction scheme for sparse matrices which produces a set of random features such that regression on the resulting design matrix approximates a kernel regression with the resemblance kernel. In this work, we derive bounds on the prediction error of such regressions. For both linear and logistic models, we show that the average prediction error vanishes asymptotically as long as $q \|\boldsymbol{\beta}^*\|_2^2 /n \rightarrow 0$, where $q$ is the average number of non-zero entries in each row of the design matrix and $\boldsymbol{\beta}^*$ is the coefficient of the linear predictor. We also show that ordinary least squares or ridge regression applied to the reduced data can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied in order for the signal to be linear in the predictors.

変数の数$p$と観測値の数$n$の両方が大きく、数百万以上のオーダーになる可能性がある大規模な回帰問題が、ますます一般的になっています。通常、データはスパースです。つまり、計画行列のエントリのほんの一部だけが非ゼロです。それでも、多くの場合、計算上実行可能な唯一のアプローチは、次元削減を実行して列のはるかに少ない新しい計画行列を取得し、この圧縮されたデータで作業することです。$b$ビットmin-wiseハッシュ(Li and KÃ¶nig、2011年、Liら、2011年)は、スパース行列の有望な次元削減スキームであり、ランダムな特徴のセットを生成するため、結果として得られる計画行列の回帰は、類似カーネルを持つカーネル回帰に近似します。この研究では、このような回帰の予測誤差の境界を導出します。線形モデルとロジスティックモデルの両方について、平均予測誤差は$q \|\boldsymbol{\beta}^*\|_2^2 /n \rightarrow 0$である限り漸近的にゼロになることを示します。ここで、$q$は設計行列の各行の非ゼロ項目の平均数、$\boldsymbol{\beta}^*$は線形予測子の係数です。また、削減されたデータに通常の最小二乗法またはリッジ回帰を適用することで、より柔軟なモデルを適合できることも示します。相互作用モデルと、信号が予測子で線形になるために未知の行正規化を適用する必要があるモデルについて、非漸近的な予測誤差境界を取得します。

Community Detection and Stochastic Block Models: Recent Developments
コミュニティ検出と確率的ブロックモデル:最近の開発

The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences. This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten- Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds. The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.

確率ブロックモデル(SBM)は、クラスターが植えられたランダムグラフモデルです。これは、クラスタリングとコミュニティ検出を研究するための標準モデルとして広く採用されており、ネットワークおよびデータサイエンスで生じる統計的および計算的トレードオフを研究するための肥沃な土壌を提供します。このノートでは、情報理論的および計算的しきい値、および正確な回復、部分的および弱い回復(検出とも呼ばれる)などのさまざまな回復要件の両方に関して、SBMでのコミュニティ検出の基本的な制限を確立する最近の開発について概説します。議論される主な結果は、Chernoff-Hellingerしきい値での正確な回復の位相遷移、Kesten-Stigumしきい値での弱い回復の位相遷移、部分的回復の最適な歪みとSNRのトレードオフ、SBMパラメーターの学習、および情報理論的しきい値と計算的しきい値のギャップです。このノートでは、限界を達成するために開発されたアルゴリズム、特にグラフ分割、半正定値計画法、線形化ビリーフプロパゲーション、古典的および非バックトラッキングスペクトル法による2ラウンドアルゴリズムについても説明します。また、いくつかの未解決の問題についても説明します。

The DFS Fused Lasso: Linear-Time Denoising over General Graphs
DFS Fused Lasso: 一般的なグラフ上の線形時間ノイズ除去

The fused lasso, also known as (anisotropic) total variation denoising, is widely used for piecewise constant signal estimation with respect to a given undirected graph. The fused lasso estimate is highly nontrivial to compute when the underlying graph is large and has an arbitrary structure. But for a special graph structure, namely, the chain graph, the fused lasso—or simply, 1d fused lasso—can be computed in linear time. In this paper, we revisit a result recently established in the online classification literature (Herbster et al., 2009; Cesa-Bianchi et al., 2013) and show that it has important implications for signal denoising on graphs. The result can be translated to our setting as follows. Given a general graph, if we run the standard depth-first search (DFS) traversal algorithm, then the total variation of any signal over the chain graph induced by DFS is no more than twice its total variation over the original graph. This result leads to several interesting theoretical and computational conclusions. Letting $m$ and $n$ denote the number of edges and nodes, respectively, of the graph in consideration, it implies that for an underlying signal with total variation $t$ over the graph, the fused lasso (properly tuned) achieves a mean squared error rate of $t^{2/3} n^{-2/3}$. Moreover, precisely the same mean squared error rate is achieved by running the 1d fused lasso on the DFS-induced chain graph. Importantly, the latter estimator is simple and computationally cheap, requiring $O(m)$ operations to construct the DFS-induced chain and $O(n)$ operations to compute the 1d fused lasso solution over this chain. Further, for trees that have bounded maximum degree, the error rate of $t^{2/3} n^{-2/3}$ cannot be improved, in the sense that it is the minimax rate for signals that have total variation $t$ over the tree. Finally, several related results also hold—for example, the analogous result holds for a roughness measure defined by the $\ell_0$ norm of differences across edges in place of the total variation metric.

融合Lassoは(異方性)総変動ノイズ除去とも呼ばれ、与えられた無向グラフに関する区分的に一定の信号推定に広く使用されています。基礎となるグラフが大きく、任意の構造を持つ場合、融合Lasso推定の計算は非常に困難です。しかし、特殊なグラフ構造、つまりチェーングラフの場合、融合Lasso (または単に1d融合Lasso)は線形時間で計算できます。この論文では、オンライン分類の文献(Herbster他、2009年、Cesa-Bianchi他、2013年)で最近確立された結果を再検討し、それがグラフ上の信号ノイズ除去に重要な意味を持つことを示します。結果は、次のように私たちの設定に翻訳できます。一般的なグラフが与えられた場合、標準的な深さ優先探索(DFS)トラバーサルアルゴリズムを実行すると、DFSによって誘導されるチェーングラフ上の任意の信号の総変動は、元のグラフ上の総変動の2倍以下になります。この結果から、いくつかの興味深い理論的および計算上の結論が導かれます。$m$と$n$をそれぞれ対象グラフのエッジとノードの数とすると、グラフ全体の総変動$t$を持つ基になる信号に対して、融合Lasso (適切に調整)は平均二乗誤差率$t^{2/3} n^{-2/3}$を達成することを意味します。さらに、1次元融合LassoをDFS誘導チェーングラフで実行することによっても、まったく同じ平均二乗誤差率が達成されます。重要なのは、後者の推定器は単純で計算コストが低く、DFS誘導チェーンの構築に$O(m)$操作、このチェーン上の1次元融合Lassoソリューションの計算に$O(n)$操作しか必要としないことです。さらに、最大次数が制限されているツリーの場合、ツリー全体の総変動$t$を持つ信号のミニマックス率という意味で、エラー率$t^{2/3} n^{-2/3}$は改善できません。最後に、いくつかの関連する結果も成り立ちます。たとえば、総変動メトリックの代わりに、エッジ間の差の$\ell_0$ノルムによって定義される粗さの尺度に対しても同様の結果が成り立ちます。

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard
球面ガウス分布の混合の最尤推定はNP困難である

This paper presents NP-hardness and hardness of approximation results for maximum likelihood estimation of mixtures of spherical Gaussians.

この論文では、球面ガウス分布の混合物の最尤推定のためのNP硬度と近似結果の硬度を示します。

On the Stability of Feature Selection Algorithms
特徴選択アルゴリズムの安定性について

Feature Selection is central to modern data science, from exploratory data analysis to predictive model-building. The âstabilityâ of a feature selection algorithm refers to the robustness of its feature preferences, with respect to data sampling and to its stochastic nature. An algorithm is `unstable’ if a small change in data leads to large changes in the chosen feature subset. Whilst the idea is simple, quantifying this has proven more challenging—we note numerous proposals in the literature, each with different motivation and justification. We present a rigorous statistical treatment for this issue. In particular, with this work we consolidate the literature and provide (1) a deeper understanding of existing work based on a small set of properties, and (2) a clearly justified statistical approach with several novel benefits. This approach serves to identify a stability measure obeying all desirable properties, and (for the first time in the literature) allowing confidence intervals and hypothesis tests on the stability, enabling rigorous experimental comparison of feature selection algorithms.

特徴選択は、探索的データ分析から予測モデル構築まで、現代のデータサイエンスの中心です。特徴選択アルゴリズムの「安定性」とは、データサンプリングとその確率的性質に関する特徴の好みの堅牢性を指します。データの小さな変化が選択された特徴サブセットの大きな変化につながる場合、アルゴリズムは「不安定」です。アイデアは単純ですが、これを定量化することはより困難であることが判明しています。文献には、それぞれ異なる動機と正当性を持つ多数の提案があります。私たちはこの問題に対する厳密な統計的処理を提示します。特に、この作業では、文献を統合し、(1)少数の特性に基づく既存の作業のより深い理解と、(2)いくつかの新しい利点を持つ明確に正当化された統計的アプローチを提供します。このアプローチは、すべての望ましい特性に従う安定性の尺度を特定し、(文献で初めて)安定性の信頼区間と仮説検定を可能にして、特徴選択アルゴリズムの厳密な実験的比較を可能にします。

auDeep: Unsupervised Learning of Representations from Audio with Deep Recurrent Neural Networks
auDeep: ディープリカレントニューラルネットワークによるオーディオからの表現の教師なし学習

auDeep is a Python toolkit for deep unsupervised representation learning from acoustic data. It is based on a recurrent sequence to sequence autoencoder approach which can learn representations of time series data by taking into account their temporal dynamics. We provide an extensive command line interface in addition to a Python API for users and developers, both of which are comprehensively documented and publicly available at https://github.com/auDeep/auDeep. Experimental results indicate that auDeep features are competitive with state-of-the art audio classification.

auDeepは、音響データから深い教師なし表現を学習するためのPythonツールキットです。これは、時系列データの時間的ダイナミクスを考慮して表現を学習できる、リカレント・シーケンス・ツー・シーケンス・オートエンコーダ・アプローチに基づいています。ユーザーと開発者向けのPython APIに加えて、広範なコマンドラインインターフェイスを提供しており、どちらも包括的に文書化されており、https://github.com/auDeep/auDeepで公開されています。実験結果から、auDeepの機能は最先端のオーディオ分類に匹敵することが示されています。

Convergence Analysis of Distributed Inference with Vector-Valued Gaussian Belief Propagation
ベクトル値ガウス信念伝搬を用いた分布推論の収束解析

This paper considers inference over distributed linear Gaussian models using factor graphs and Gaussian belief propagation (BP). The distributed inference algorithm involves only local computation of the information matrix and of the mean vector, and message passing between neighbors. Under broad conditions, it is shown that the message information matrix converges to a unique positive definite limit matrix for arbitrary positive semidefinite initialization, and it approaches an arbitrarily small neighborhood of this limit matrix at an exponential rate. A necessary and sufficient convergence condition for the belief mean vector to converge to the optimal centralized estimator is provided under the assumption that the message information matrix is initialized as a positive semidefinite matrix. Further, it is shown that Gaussian BP always converges when the underlying factor graph is given by the union of a forest and a single loop. The proposed convergence condition in the setup of distributed linear Gaussian models is shown to be strictly weaker than other existing convergence conditions and requirements, including the Gaussian Markov random field based walk-summability condition, and applicable to a large class of scenarios.

この論文では、因子グラフとガウスの確信伝播(BP)を使用した分散線形ガウスモデルの推論について検討します。分散推論アルゴリズムには、情報行列と平均ベクトルのローカル計算、および近傍間のメッセージパッシングのみが含まれます。広範な条件下では、メッセージ情報行列は、任意の正半定値初期化に対して一意の正定値限界行列に収束し、この限界行列の任意の小さな近傍に指数関数的に近づくことが示されています。メッセージ情報行列が正半定値行列として初期化されているという仮定の下で、確信平均ベクトルが最適な集中推定量に収束するための必要かつ十分な収束条件が提供されます。さらに、基礎となる因子グラフがフォレストと単一のループの結合によって与えられる場合、ガウスBPは常に収束することが示されています。分散線形ガウスモデルのセットアップで提案された収束条件は、ガウスマルコフランダムフィールドベースのウォーク合計条件を含む他の既存の収束条件および要件よりも厳密に弱いことが示され、大規模なシナリオクラスに適用できます。

Convergence of Unregularized Online Learning Algorithms
非正則化オンライン学習アルゴリズムの収束

In this paper we study the convergence of online gradient descent algorithms in reproducing kernel Hilbert spaces (RKHSs) without regularization. We establish a sufficient condition and a necessary condition for the convergence of excess generalization errors in expectation. A sufficient condition for the almost sure convergence is also given. With high probability, we provide explicit convergence rates of the excess generalization errors for both averaged iterates and the last iterate, which in turn also imply convergence rates with probability one. To our best knowledge, this is the first high- probability convergence rate for the last iterate of online gradient descent algorithms in the general convex setting. Without any boundedness assumptions on iterates, our results are derived by a novel use of two measures of the algorithm’s one- step progress, respectively by generalization errors and by distances in RKHSs, where the variances of the involved martingales are cancelled out by the descent property of the algorithm.

この論文では、正規化なしの再生カーネルヒルベルト空間(RKHS)におけるオンライン勾配降下アルゴリズムの収束について検討します。期待値における過剰一般化誤差の収束の十分条件と必要条件を確立します。ほぼ確実な収束の十分条件も示します。高い確率で、平均反復と最後の反復の両方について過剰一般化誤差の明示的な収束率を提供します。これは、確率1の収束率も意味します。私たちが知る限り、これは一般的な凸設定におけるオンライン勾配降下アルゴリズムの最後の反復で最初の高確率収束率です。反復に対する有界性の仮定なしに、私たちの結果は、アルゴリズムの1ステップの進行の2つの尺度、それぞれ一般化誤差とRKHS内の距離の新しい使用によって導き出されます。ここで、関係するマルチンゲールの分散は、アルゴリズムの降下特性によって相殺されます。

On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness
本質的に高次元空間の振る舞いについて:距離、直接近傍と逆最近傍、およびハブ性

Over the years, different characterizations of the curse of dimensionality have been provided, usually stating the conditions under which, in the limit of the infinite dimensionality, distances become indistinguishable. However, these characterizations almost never address the form of associated distributions in the finite, although high- dimensional, case. This work aims to contribute in this respect by investigating the distribution of distances, and of direct and reverse nearest neighbors, in intrinsically high-dimensional spaces. Indeed, we derive a closed form for the distribution of distances from a given point, for the expected distance from a given point to its $k$th nearest neighbor, and for the expected size of the approximate set of neighbors of a given point in finite high-dimensional spaces. Additionally, the hubness problem is considered, which is related to the form of the function $N_k$ representing the number of points that have a given point as one of their $k$ nearest neighbors, which is also called the number of $k$-occurrences. Despite the extensive use of this function, the precise characterization of its form is a longstanding problem. We derive a closed form for the number of $k$-occurrences associated with a given point in finite high- dimensional spaces, together with the associated limiting probability distribution. By investigating the relationships with the hubness phenomenon emerging in network science, we find that the distribution of node (in-)degrees of some real-life, large-scale networks has connections with the distribution of $k$-occurrences described herein.

長年にわたり、次元の呪いのさまざまな特徴付けが示されてきましたが、通常は、無限次元の極限で距離が区別できなくなる条件を述べています。しかし、これらの特徴付けは、高次元ではあっても有限の場合の関連分布の形式にはほとんど触れていません。本研究は、本質的に高次元の空間における距離の分布、および直接および逆の最近傍の分布を調査することにより、この点に貢献することを目指しています。実際、有限の高次元空間における、特定の点からの距離の分布、特定の点からその$k$番目の最近傍までの期待距離、および特定の点のおおよその近傍セットの期待サイズについて、閉じた形式を導出します。さらに、ハブネス問題も検討します。これは、特定の点を$k$近傍の1つとして持つ点の数を表す関数$N_k$の形式($k$出現数とも呼ばれます)に関連しています。この関数は広く使用されているにもかかわらず、その形式を正確に特徴付けることは長年の課題です。私たちは、有限の高次元空間内の特定の点に関連付けられた$k$発生数の閉じた形式と、それに関連付けられた限界確率分布を導出します。ネットワーク科学で出現しているハブネス現象との関係を調査することで、いくつかの実際の大規模ネットワークのノード(入)次数の分布が、ここで説明した$k$発生の分布と関係があることがわかりました。

In Search of Coherence and Consensus: Measuring the Interpretability of Statistical Topics
一貫性とコンセンサスを求めて:統計的トピックの解釈可能性の測定

Topic modeling is an important tool in natural language processing. Topic models provide two forms of output. The first is a predictive model. This type of model has the ability to predict unseen documents (e.g., their categories). When topic models are used in this way, there are ample measures to assess their performance. The second output of these models is the topics themselves. Topics are lists of keywords that describe the top words pertaining to each topic. Often, these lists of keywords are presented to a human subject who then assesses the meaning of the topic, which is ultimately subjective. One of the fundamental problems of topic models lies in assessing the quality of the topics from the perspective of human interpretability. Naturally, human subjects need to be employed to evaluate interpretability of a topic. Lately, crowdsourcing approaches are widely used to serve the role of human subjects in evaluation. In this work we study measures of interpretability and propose to measure topic interpretability from two perspectives: topic coherence and topic consensus. We start with an existing measure for topic coherence—model precision. It evaluates coherence of a topic by introducing an intruded word and measuring how well a human subject or a crowdsourcing approach could identify the intruded word: if it is easy to identify, the topic is coherent. We then investigate how we can measure coherence comprehensively by examining dimensions of topic coherence. For the second perspective of topic interpretability, we suggest topic consensus that measures how well the results of a crowdsourcing approach matches those given categories of topics. Good topics should lead to good categories, thus, high topic consensus. Therefore, if there is low topic consensus in terms of categories, topics could be of low interpretability. We then further discuss how topic coherence and topic consensus assess different aspects of topic interpretability and hope that this work can pave way for comprehensive measures of topic interpretability.

トピックモデリングは、自然言語処理の重要なツールです。トピックモデルは、2つの形式の出力を提供します。1つ目は予測モデルです。このタイプのモデルには、未知のドキュメント(たとえば、そのカテゴリ)を予測する機能があります。トピックモデルをこのように使用すると、そのパフォーマンスを評価するための十分な尺度があります。これらのモデルの2つ目の出力はトピック自体です。トピックは、各トピックに関連する上位の単語を説明するキーワードのリストです。多くの場合、これらのキーワードのリストは人間の被験者に提示され、被験者はトピックの意味を評価しますが、これは最終的には主観的です。トピックモデルの基本的な問題の1つは、人間の解釈可能性の観点からトピックの品質を評価することにあります。当然、トピックの解釈可能性を評価するには、人間の被験者を雇う必要があります。最近、クラウドソーシングアプローチが、評価における人間の被験者の役割を果たすために広く使用されています。この研究では、解釈可能性の尺度を研究し、トピックの一貫性とトピックのコンセンサスの2つの観点からトピックの解釈可能性を測定することを提案します。トピックの一貫性の既存の尺度であるモデル精度から始めます。これは、侵入語を導入し、被験者またはクラウドソーシングアプローチが侵入語をどれだけ正確に識別できるかを測定することによってトピックの一貫性を評価します。識別が容易であれば、トピックは一貫しています。次に、トピックの一貫性の次元を調べることで、一貫性を包括的に測定する方法を調査します。トピックの解釈可能性の2番目の観点では、クラウドソーシングアプローチの結果がトピックの特定のカテゴリにどれだけ一致するかを測定するトピックコンセンサスを提案します。優れたトピックは優れたカテゴリにつながるため、トピックコンセンサスは高くなります。したがって、カテゴリに関してトピックコンセンサスが低い場合、トピックの解釈可能性は低い可能性があります。次に、トピックの一貫性とトピックコンセンサスがトピックの解釈可能性のさまざまな側面をどのように評価するかについてさらに説明し、この研究がトピックの解釈可能性の包括的な測定への道を開くことを願っています。

Local Identifiability of l_1-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition
l_1-最小化辞書学習の局所識別可能性：十分かつほぼ必要な条件

We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf{x}_i\in \mathbb R^K$ for $i=1,\ldots,N$ via $\ell_1$-minimization. We assume that $\mathbf{x}_i$’s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf{D}_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf{D}_0$ to be locally identifiable, i.e., a strict local minimum of the expected $\ell_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\mu\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzero entries. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.

私たちは、$\ell_1$-最小化を介して、$i=1,\ldots,N$の$N$個の信号$\mathbf{x}_i\in \mathbb R^K$から辞書を学習する理論的特性について研究します。$\mathbf{x}_i$は、完全な（つまり、正方で可逆な）参照辞書$\mathbf{D}_0 \in \mathbb R^{K\times K}$のK列の$i.i.d.$ランダム線形結合であると仮定します。ここで、ランダム線形係数は、$s$-スパースガウスモデルまたはベルヌーイガウスモデルのいずれかから生成されます。まず、母集団の場合、参照辞書$\mathbf{D}_0$が局所的に識別可能であるための十分かつほぼ必要な条件、つまり期待される$\ell_1$ノルム目的関数の厳密な局所最小値を確立します。この条件は、ランダム線形係数のスパースおよび密なケースの両方をカバーし、GribonvalおよびSchnass (2010)による十分条件を大幅に改善します。さらに、完全な$\mu$コヒーレント参照辞書、つまり絶対ペアワイズ列内積が最大で$\mu\in[0,1)$の辞書の場合、ランダム線形係数ベクトルに最大$O(\mu^{-2})$の非ゼロエントリがある場合でも、局所的識別可能性が維持されることを示します。さらに、信号数$N$が$O(K\log K)$に比例する場合、局所的識別可能性の結果は、高い確率で有限サンプルの場合にも適用できます。

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria
百分位数リスク基準によるリスク制約付き強化学習

In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented via a chance constraint or a constraint on the conditional value-at-risk (CVaR) of the cumulative cost. We collectively refer to such problems as percentile risk-constrained MDPs. Specifically, we first derive a formula for computing the gradient of the Lagrangian function for percentile risk-constrained MDPs. Then, we devise policy gradient and actor-critic algorithms that (1) estimate such gradient, (2) update the policy in the descent direction, and (3) update the Lagrange multiplier in the ascent direction. For these algorithms we prove convergence to locally optimal policies. Finally, we demonstrate the effectiveness of our algorithms in an optimal stopping problem and an online marketing application.

多くの逐次的意思決定問題では、リスクを考慮しながら、予想される累積コストを最小化すること、つまり、確率は低く、結果が大きいイベントに対する認識を高めることに関心があります。したがって、この論文の目的は、リスク制約付きマルコフ決定プロセス(MDP)の効率的な強化学習アルゴリズムを提示することです。リスクは、確率制約または累積コストの条件付きリスク値(CVaR)の制約によって表されます。このような問題をまとめて、パーセンタイルリスク制約付きMDPと呼びます。具体的には、まず、パーセンタイルリスク制約付きMDPのラグランジュ関数の勾配を計算するための式を導出します。次に、(1)このような勾配を推定し、(2)下降方向にポリシーを更新し、(3)上昇方向にラグランジュ乗数を更新するポリシー勾配アルゴリズムとアクタークリティックアルゴリズムを考案します。これらのアルゴリズムでは、局所的に最適なポリシーへの収束が証明されます。最後に、最適停止問題とオンラインマーケティングアプリケーションにおけるアルゴリズムの有効性を示します。

Gradient Hard Thresholding Pursuit
勾配ハードしきい値処理の追求

Hard Thresholding Pursuit (HTP) is an iterative greedy selection procedure for finding sparse solutions of underdetermined linear systems. This method has been shown to have strong theoretical guarantee and impressive numerical performance. In this article, we generalize HTP from compressed sensing to a generic problem setup of sparsity-constrained convex optimization. The proposed algorithm iterates between a standard gradient descent step and a hard-thresholding step with or without debiasing. We analyze the parameter estimation and sparsity recovery performance of the proposed method. Extensive numerical results confirm our theoretical predictions and demonstrate the superiority of our method to the state-of-the-art greedy selection methods in sparse linear regression, sparse logistic regression and sparse precision matrix estimation problems.\footnote{A conference version of this work appeared in ICML 2014 \citep{Yuan- ICML-2014}.}

ハードしきい値処理(HTP)は、不確定な線形システムのスパース解を見つけるための反復的な貪欲な選択手順です。この方法は、強力な理論的保証と印象的な数値性能を持つことが示されています。この記事では、HTPを圧縮センシングから、スパース性制約のある凸最適化の一般的な問題設定に一般化します。提案されたアルゴリズムは、標準の勾配降下ステップと、バイアス除去の有無にかかわらずハードしきい値化ステップとの間を反復処理します。提案手法のパラメータ推定とスパース性回復性能を解析します。広範な数値結果は、私たちの理論的予測を確認し、スパース線形回帰、スパースロジスティック回帰、スパース精度行列推定問題における最先端の貪欲な選択方法に対する私たちの方法の優位性を示しています。脚注{この作業の会議バージョンは、ICML 2014 citep{Yuan- ICML-2014}に掲載されました。

Maximum Principle Based Algorithms for Deep Learning
深層学習のための最大原理ベースアルゴリズム

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin’s maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out – a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

トレーニングアルゴリズムの代替フレームワークを考案するために、ディープラーニングに対する連続動的システムアプローチが検討されています。トレーニングは制御問題として再構成され、これにより、ポンチャギンの最大原理(PMP)を使用して連続時間で必要な最適条件を定式化できます。次に、逐次近似法の修正を使用してPMPを解き、ディープラーニングの代替トレーニングアルゴリズムを生み出します。このアプローチの利点は、厳密なエラー推定と収束結果を確立できることです。また、鞍点付近の平坦な地形での収束が遅いなど、勾配ベースの方法のいくつかの落とし穴を回避できることも示しています。さらに、ハミルトン最大化を効率的に実行できれば、反復ごとに好ましい初期収束率が得られることを実証しています。これはまだ改善が必要なステップです。全体として、このアプローチは、遅い多様体でのトラッピングや離散的なトレーニング可能な変数に対する勾配ベースの方法の非適用性など、ディープラーニングに関連する問題に取り組むための新しい道を開きます。

pomegranate: Fast and Flexible Probabilistic Modeling in Python
pomegranate:Pythonでの高速で柔軟な確率モデリング

We present pomegranate, an open source machine learning package for probabilistic modeling in Python. Probabilistic modeling encompasses a wide range of methods that explicitly describe uncertainty using probability distributions. Three widely used probabilistic models implemented in pomegranate are general mixture models, hidden Markov models, and Bayesian networks. A primary focus of pomegranate is to abstract away the complexities of training models from their definition. This allows users to focus on specifying the correct model for their application instead of being limited by their understanding of the underlying algorithms. An aspect of this focus involves the collection of additive sufficient statistics from data sets as a strategy for training models. This approach trivially enables many useful learning strategies, such as out-of-core learning, minibatch learning, and semi-supervised learning, without requiring the user to consider how to partition data or modify the algorithms to handle these tasks themselves. pomegranate is written in Cython to speed up calculations and releases the global interpreter lock to allow for built-in multithreaded parallelism, making it competitive with—or outperform—other implementations of similar algorithms. This paper presents an overview of the design choices in pomegranate, and how they have enabled complex features to be supported by simple code. The code is available at \url{https://github.com/jmschrei/pomegranate}

私たちは、Pythonで確率モデルを作成するオープンソースの機械学習パッケージpomegranateを紹介します。確率モデルには、確率分布を使用して不確実性を明示的に記述するさまざまな方法が含まれます。pomegranateで実装されている3つの広く使用されている確率モデルは、一般混合モデル、隠れマルコフモデル、ベイジアンネットワークです。pomegranateの主な焦点は、トレーニングモデルの複雑さをその定義から抽象化することです。これにより、ユーザーは、基礎となるアルゴリズムの理解に制限されることなく、アプリケーションに適したモデルを指定することに集中できます。この焦点の1つの側面は、トレーニングモデルの戦略として、データセットから加法的に十分な統計情報を収集することです。このアプローチにより、アウトオブコア学習、ミニバッチ学習、半教師あり学習など、多くの有用な学習戦略が簡単に実現され、ユーザーはデータの分割方法を検討したり、これらのタスクを処理するためにアルゴリズムを変更したりする必要がありません。pomegranateはCythonで書かれており、計算を高速化し、グローバルインタプリタロックを解放して組み込みのマルチスレッド並列処理を可能にし、同様のアルゴリズムの他の実装と競合(またはそれ以上)します。この論文では、pomegranateの設計上の選択の概要と、それによって複雑な機能を単純なコードでサポートできるようになった方法について説明します。コードは\url{https://github.com/jmschrei/pomegranate}で入手できます。

Deep Learning the Ising Model Near Criticality
深層学習: 臨界近傍イジングモデル

It is well established that neural networks with deep architectures perform better than shallow networks for many tasks in machine learning. In statistical physics, while there has been recent interest in representing physical data with generative modelling, the focus has been on shallow neural networks. A natural question to ask is whether deep neural networks hold any advantage over shallow networks in representing such data. We investigate this question by using unsupervised, generative graphical models to learn the probability distribution of a two-dimensional Ising system. Deep Boltzmann machines, deep belief networks, and deep restricted Boltzmann networks are trained on thermal spin configurations from this system, and compared to the shallow architecture of the restricted Boltzmann machine. We benchmark the models, focussing on the accuracy of generating energetic observables near the phase transition, where these quantities are most difficult to approximate. Interestingly, after training the generative networks, we observe that the accuracy essentially depends only on the number of neurons in the first hidden layer of the network, and not on other model details such as network depth or model type. This is evidence that shallow networks are more efficient than deep networks at representing physical probability distributions associated with Ising systems near criticality.

機械学習の多くのタスクでは、深いアーキテクチャを持つニューラルネットワークの方が浅いネットワークよりもパフォーマンスが優れていることはよく知られています。統計物理学では、生成モデルを使用して物理データを表現することに最近関心が寄せられていますが、焦点は浅いニューラルネットワークにありました。当然の疑問は、そのようなデータを表現する上で、深層ニューラルネットワークが浅いネットワークよりも優れているかどうかです。私たちは、教師なしの生成グラフィカルモデルを使用して2次元イジングシステムの確率分布を学習することで、この疑問を調査します。深層ボルツマンマシン、深層ビリーフネットワーク、深層制限ボルツマンネットワークは、このシステムの熱スピン構成でトレーニングされ、制限ボルツマンマシンの浅いアーキテクチャと比較されます。私たちは、これらの量を近似するのが最も難しい相転移付近でのエネルギー観測量を生成する精度に焦点を当てて、モデルのベンチマークを行います。興味深いことに、生成ネットワークをトレーニングした後、精度は基本的にネットワークの最初の非表示層のニューロン数のみに依存し、ネットワークの深さやモデルの種類などの他のモデルの詳細には依存しないことがわかります。これは、臨界に近いイジングシステムに関連する物理的な確率分布を表現する場合、浅いネットワークの方が深いネットワークよりも効率的であることを示す証拠です。

Principled Selection of Hyperparameters in the Latent Dirichlet Allocation Model
潜在ディリクレ配分モデルにおけるハイパーパラメータの原則的選択

Latent Dirichlet Allocation (LDA) is a well known topic model that is often used to make inference regarding the properties of collections of text documents. LDA is a hierarchical Bayesian model, and involves a prior distribution on a set of latent topic variables. The prior is indexed by certain hyperparameters, and even though these have a large impact on inference, they are usually chosen either in an ad-hoc manner, or by applying an algorithm whose theoretical basis has not been firmly established. We present a method, based on a combination of Markov chain Monte Carlo and importance sampling, for estimating the maximum likelihood estimate of the hyperparameters. The method may be viewed as a computational scheme for implementation of an empirical Bayes analysis. It comes with theoretical guarantees, and a key feature of our approach is that we provide theoretically-valid error margins for our estimates. Experiments on both synthetic and real data show good performance of our methodology.

潜在ディリクレ配分法(LDA)は、テキストドキュメントのコレクションの特性に関する推論を行うためによく使用される、よく知られたトピックモデルです。LDAは階層型ベイズモデルであり、一連の潜在トピック変数の事前分布を使用します。事前分布は特定のハイパーパラメータによってインデックス付けされます。これらのハイパーパラメータは推論に大きな影響を与えますが、通常はアドホックに選択されるか、理論的根拠が十分に確立されていないアルゴリズムを適用して選択されます。ここでは、マルコフ連鎖モンテカルロと重要度サンプリングの組み合わせに基づいて、ハイパーパラメータの最大尤度推定値を推定する方法を紹介します。この方法は、経験的ベイズ分析を実装するための計算スキームと見なすことができます。この方法には理論的な保証が付いており、このアプローチの重要な特徴は、推定値に対して理論的に有効な誤差範囲を提供することです。合成データと実際のデータの両方での実験では、この方法論の優れたパフォーマンスが示されています。

Gradient Estimation with Simultaneous Perturbation and Compressive Sensing
摂動と圧縮の同時センシングによる勾配推定

We propose a scheme for finding a “good” estimator for the gradient of a function on a high-dimensional space with few function evaluations, for applications where function evaluations are expensive and the function under consideration is not sensitive in all coordinates locally, making its gradient almost sparse. Exploiting the latter aspect, our method combines ideas from Spall’s Simultaneous Perturbation Stochastic Approximation with compressive sensing. We theoretically justify its computational advantages and illustrate them empirically by numerical experiments. In particular, applications to estimating gradient outer product matrix as well as standard optimization problems are illustrated via simulations.

私たちは、関数評価が高価で、検討中の関数が局所的にすべての座標で敏感ではなく、勾配がほとんどまばらになるアプリケーション向けに、関数評価がほとんどない高次元空間上の関数の勾配について「良い」推定量を見つけるためのスキームを提案します。後者の側面を利用して、私たちの方法は、Spallの同時摂動確率近似のアイデアと圧縮センシングを組み合わせたものです。私たちは理論的にその計算上の利点を正当化し、数値実験によってそれらを経験的に説明します。特に、勾配外部積行列の推定と標準的な最適化問題への応用をシミュレーションで説明します。

Training Gaussian Mixture Models at Scale via Coresets
コアセットによる大規模な混合ガウスモデルの学習

How can we train a statistical mixture model on a massive data set? In this work we show how to construct \emph{coresets} for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being \emph{independent} of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real- world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

大規模なデータセットで統計的混合モデルをトレーニングするにはどうすればよいでしょうか。この研究では、ガウス分布の混合に対して\emph{コアセット}を構築する方法を示します。コアセットはデータの重み付きサブセットであり、コアセットに適合するモデルが元のデータセットにも適合することを保証します。驚くかもしれませんが、ガウス分布の混合では、データセットのサイズとは\emph{独立}でありながら、次元と混合成分の数の多項式サイズのコアセットを許容することを示します。したがって、計算集約型アルゴリズムを利用して、大幅に小さいデータセットで良好な近似値を計算できます。さらに重要なことは、このようなコアセットは分散設定とストリーミング設定の両方で効率的に構築でき、データ生成プロセスに制限を課さないことです。私たちの結果は、統計的推定を計算幾何学の問題に新たに還元することと、ガウス分布の混合に対する新しい組み合わせの複雑さの結果に依存しています。いくつかの実際のデータセットでの経験的評価により、コアセットベースのアプローチにより、近似誤差が無視できる程度でトレーニング時間を大幅に短縮できることが示唆されました。

Robust Topological Inference: Distance To a Measure and Kernel Distance
ロバストなトポロジカル推論: 測度までの距離とカーネル距離

Let $P$ be a distribution with support $S$. The salient features of $S$ can be quantified with persistent homology, which summarizes topological features of the sublevel sets of the distance function (the distance of any point $x$ to $S$). Given a sample from $P$ we can infer the persistent homology using an empirical version of the distance function. However, the empirical distance function is highly non-robust to noise and outliers. Even one outlier is deadly. The distance-to-a-measure (DTM), introduced by \cite{chazal2011geometric}, and the kernel distance, introduced by \cite{phillips2014goemetric}, are smooth functions that provide useful topological information but are robust to noise and outliers. \cite{massart2014} derived concentration bounds for DTM. Building on these results, we derive limiting distributions and confidence sets, and we propose a method for choosing tuning parameters.

$P$をサポート$S$のディストリビューションとします。$S$の顕著な特徴は、距離関数のサブレベル集合($x$から$S$までの任意の点の距離)のトポロジカル特徴を要約する永続的な相同性によって定量化できます。$P$からのサンプルが与えられると、距離関数の経験的バージョンを使用して永続的なホモロジーを推測できます。ただし、経験的距離関数は、ノイズや外れ値に対して非常に非ロバストです。外れ値が1つでも致命的です。cite{chazal2011geometric}によって導入されたDistance-to-a-Measurement (DTM)とcite{phillips2014goemetric}によって導入されたカーネル距離は、有用なトポロジ情報を提供するが、ノイズや外れ値に対して堅牢である滑らかな関数です。cite{massart2014}はDTMの濃度境界を導き出しました。これらの結果に基づいて、制限分布と信頼度セットを導き出し、調整パラメーターを選択する方法を提案します。

Probabilistic preference learning with the Mallows rank model
マローズランクモデルによる確率的選好学習

Ranking and comparing items is crucial for collecting information about preferences in many areas, from marketing to politics. The Mallows rank model is among the most successful approaches to analyse rank data, but its computational complexity has limited its use to a particular form based on Kendall distance. We develop new computationally tractable methods for Bayesian inference in Mallows models that work with any right-invariant distance. Our method performs inference on the consensus ranking of the items, also when based on partial rankings, such as top-$k$ items or pairwise comparisons. We prove that items that none of the assessors has ranked do not influence the maximum a posteriori consensus ranking, and can therefore be ignored. When assessors are many or heterogeneous, we propose a mixture model for clustering them in homogeneous subgroups, with cluster-specific consensus rankings. We develop approximate stochastic algorithms that allow a fully probabilistic analysis, leading to coherent quantifications of uncertainties. We make probabilistic predictions on the class membership of assessors based on their ranking of just some items, and predict missing individual preferences, as needed in recommendation systems. We test our approach using several experimental and benchmark datasets.

アイテムのランク付けと比較は、マーケティングから政治まで、多くの分野で嗜好に関する情報を収集するために不可欠です。Mallowsランクモデルはランク付けデータを分析するための最も成功したアプローチの1つですが、計算が複雑なため、Kendall距離に基づく特定の形式にしか使用できません。私たちは、任意の右不変距離で機能するMallowsモデルでのベイズ推論のための、計算的に扱いやすい新しい方法を開発します。私たちの方法は、上位$k$アイテムや一対比較などの部分的なランキングに基づく場合でも、アイテムのコンセンサスランキングの推論を実行します。評価者の誰もランク付けしていないアイテムは、事後コンセンサスランキングの最大値に影響を与えないため、無視できることを証明します。評価者が多数または異質である場合、クラスター固有のコンセンサスランキングを使用して、評価者を均質なサブグループにクラスタリングするための混合モデルを提案します。私たちは、完全に確率的な分析を可能にし、不確実性の一貫した定量化につながる近似確率アルゴリズムを開発します。私たちは、いくつかのアイテムのランク付けに基づいて評価者のクラスメンバーシップを確率的に予測し、推奨システムで必要な場合に、欠落している個人の好みを予測します。私たちは、いくつかの実験データセットとベンチマークデータセットを使用して、このアプローチをテストします。

A Study of the Classification of Low-Dimensional Data with Supervised Manifold Learning
教師あり多様体学習による低次元データの分類に関する一考察

Supervised manifold learning methods learn data representations by preserving the geometric structure of data while enhancing the separation between data samples from different classes. In this work, we propose a theoretical study of supervised manifold learning for classification. We consider nonlinear dimensionality reduction algorithms that yield linearly separable embeddings of training data and present generalization bounds for this type of algorithms. A necessary condition for satisfactory generalization performance is that the embedding allow the construction of a sufficiently regular interpolation function in relation with the separation margin of the embedding. We show that for supervised embeddings satisfying this condition, the classification error decays at an exponential rate with the number of training samples. Finally, we examine the separability of supervised nonlinear embeddings that aim to preserve the low-dimensional geometric structure of data based on graph representations. The proposed analysis is supported by experiments on several real data sets.

教師あり多様体学習法は、データの幾何学的構造を保持しながら、異なるクラスのデータサンプル間の分離を強化することで、データ表現を学習します。この研究では、分類のための教師あり多様体学習の理論的研究を提案します。トレーニングデータの線形分離可能な埋め込みを生成する非線形次元削減アルゴリズムを検討し、このタイプのアルゴリズムの一般化境界を示します。満足のいく一般化パフォーマンスの必要条件は、埋め込みが、埋め込みの分離マージンに関連して十分に規則的な補間関数の構築を可能にすることです。この条件を満たす教師あり埋め込みの場合、分類エラーはトレーニングサンプルの数とともに指数関数的に減少することを示します。最後に、グラフ表現に基づいてデータの低次元幾何学的構造を保持することを目的とした教師あり非線形埋め込みの分離可能性を調べます。提案された分析は、いくつかの実際のデータセットでの実験によってサポートされています。

Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data
選択的にサンプリングされたデータを使用した行列列サブセット選択のための証明可能な正しいアルゴリズム

We consider the problem of matrix column subset selection, which selects a subset of columns from an input matrix such that the input can be well approximated by the span of the selected columns. Column subset selection has been applied to numerous real-world data applications such as population genetics summarization, electronic circuits testing and recommendation systems. In many applications the complete data matrix is unavailable and one needs to select representative columns by inspecting only a small portion of the input matrix. In this paper we propose the first provably correct column subset selection algorithms for partially observed data matrices. Our proposed algorithms exhibit different merits and limitations in terms of statistical accuracy, computational efficiency, sample complexity and sampling schemes, which provides a nice exploration of the tradeoff between these desired properties for column subset selection. The proposed methods employ the idea of feedback driven sampling and are inspired by several sampling schemes previously introduced for low-rank matrix approximation tasks (Drineas et al., 2008; Frieze et al., 2004; Deshpande and Vempala, 2006; Krishnamurthy and Singh, 2014). Our analysis shows that, under the assumption that the input data matrix has incoherent rows but possibly coherent columns, all algorithms provably converge to the best low-rank approximation of the original data as number of selected columns increases. Furthermore, two of the proposed algorithms enjoy a relative error bound, which is preferred for column subset selection and matrix approximation purposes. We also demonstrate through both theoretical and empirical analysis the power of feedback driven sampling compared to uniform random sampling on input matrices with highly correlated columns.

私たちは、入力行列から列のサブセットを選択し、選択された列の範囲で入力をうまく近似できるようにする行列列サブセット選択の問題を検討します。列サブセット選択は、集団遺伝学の要約、電子回路のテスト、推奨システムなど、数多くの実際のデータアプリケーションに適用されてきました。多くのアプリケーションでは、完全なデータマトリックスは利用できず、入力行列のごく一部だけを検査して代表的な列を選択する必要があります。この論文では、部分的に観測されたデータマトリックスに対して初めて正しいことが証明された列サブセット選択アルゴリズムを提案します。提案されたアルゴリズムは、統計的精度、計算効率、サンプルの複雑さ、サンプリングスキームの点でさまざまなメリットと制限を示しており、列サブセット選択のこれらの望ましい特性間のトレードオフをうまく検討できます。提案された方法は、フィードバック駆動型サンプリングの考え方を採用しており、低ランク行列近似タスク用に以前に導入されたいくつかのサンプリング方式(Drineas他、2008年、Frieze他、2004年、DeshpandeとVempala、2006年、KrishnamurthyとSingh、2014年)にヒントを得ています。分析によると、入力データ行列の行は一貫性がなく、列は一貫性がある可能性があるという仮定の下では、選択された列の数が増えるにつれて、すべてのアルゴリズムが元のデータの最良の低ランク近似に収束することが証明されています。さらに、提案されたアルゴリズムのうち2つは、列のサブセット選択と行列近似の目的に適した相対誤差境界を備えています。また、相関の高い列を持つ入力行列に対する均一ランダムサンプリングと比較したフィードバック駆動型サンプリングの威力を、理論的および実証的な分析の両方を通じて実証しています。

Cost-Sensitive Learning with Noisy Labels
ノイズの多いラベルによるコスト重視の学習

We study binary classification in the presence of \emph{class- conditional} random noise, where the learner gets to see labels that are flipped independently with some probability, and where the flip probability depends on the class. Our goal is to devise learning algorithms that are efficient and statistically consistent with respect to commonly used utility measures. In particular, we look at a family of measures motivated by their application in domains where cost-sensitive learning is necessary (for example, when there is class imbalance). In contrast to most of the existing literature on consistent classification that are limited to the classical 0-1 loss, our analysis includes more general utility measures such as the AM measure (arithmetic mean of True Positive Rate and True Negative Rate). For this problem of cost-sensitive learning under class- conditional random noise, we develop two approaches that are based on suitably modifying surrogate losses. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical utility maximization in the presence of i.i.d. data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that using unbiased estimator leads to an efficient algorithm for empirical maximization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong utility bounds. This approach implies that methods already used in practice, such as biased SVM and weighted logistic regression, are provably noise- tolerant. For two practically important measures in our family, we show that the proposed methods are competitive with respect to recently proposed methods for dealing with label noise in several benchmark data sets.

私たちは、\emph{クラス条件付き}ランダムノイズの存在下でのバイナリ分類を研究します。この分類では、学習者は、ある確率で独立して反転されたラベルを見ることになり、反転の確率はクラスに依存します。我々の目標は、一般的に使用される効用尺度に関して、効率的で統計的に一貫性のある学習アルゴリズムを考案することです。特に、コストに敏感な学習が必要な領域(たとえば、クラスの不均衡がある場合)への適用を動機とした尺度のファミリーを検討します。一貫性のある分類に関する既存の文献のほとんどが古典的な0-1損失に限定されているのとは対照的に、我々の分析には、AM尺度(真陽性率と真陰性率の算術平均)などのより一般的な効用尺度が含まれます。クラス条件付きランダムノイズ下でのコストに敏感な学習のこの問題に対して、私たちは、代理損失を適切に修正することに基づく2つのアプローチを開発します。まず、あらゆる損失の単純な不偏推定量を提供し、ノイズの多いラベルを持つi.i.d.データの存在下での経験的効用最大化のパフォーマンス境界を取得します。損失関数が単純な対称条件を満たす場合、不偏推定量を使用すると、経験的最大化のための効率的なアルゴリズムにつながることを示します。次に、ノイズの多いラベルでのリスク最小化を重み付き0-1損失による分類に縮小することで、強力な効用境界を取得できる単純な重み付き代理損失の使用を提案します。このアプローチは、バイアス付きSVMや重み付きロジスティック回帰など、実際にすでに使用されている方法がノイズ耐性があることを示唆しています。私たちのファミリーの2つの実際的に重要な尺度について、提案された方法が、いくつかのベンチマークデータセットでラベルノイズを処理するために最近提案された方法と比較して競争力があることを示します。

Normal Bandits of Unknown Means and Variances
未知の平均と分散の正規バンディット

Consider the problem of sampling sequentially from a finite number of $N \geq 2$ populations, specified by random variables $X^i_k$, $ i = 1,\ldots , N,$ and $k = 1, 2, \ldots$; where $X^i_k$ denotes the outcome from population $i$ the $k^{th}$ time it is sampled. It is assumed that for each fixed $i$, $\{ X^i_k \}_{k \geq 1}$ is a sequence of i.i.d. normal random variables, with unknown mean $\mu_i$ and unknown variance $\sigma_i^2$. The objective is to have a policy $\pi$ for deciding from which of the $N$ populations to sample from at any time $t=1,2,\ldots$ so as to maximize the expected sum of outcomes of $n$ total samples or equivalently to minimize the regret due to lack on information of the parameters $\mu_i$ and $\sigma_i^2$. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from \cite{bkmab96}. Additionally, finite horizon regret bounds are given.

確率変数$X^i_k$, $ i = 1,ldots , N,$および$k = 1, 2, ldots$で指定された有限$N個のgeq 2$母集団から逐次的にサンプリングする問題を考えてみましょう。ここで、$X^i_k$は、サンプリングされた$k^{th}$時間である母集団$i$の結果を示します。各固定$i$について、${ X^i_k }_{k geq 1}$はi.i.d.正規確率変数のシーケンスであり、平均は$mu_i$で分散は$sigma_i^2$が不明であると仮定します。その目的は、任意の時点で$N$母集団のどれからサンプリングするかを決定するためのポリシー$pi$を持つことです$t=1,2,ldots$合計サンプル$nの予想される結果の合計を最大化するため、またはパラメータ$mu_i$と$sigma_i^2$の情報が不足していることによる後悔を同等に最小化します。この論文では、以下の定理4の意味で漸近的に最適な単純な膨張サンプル平均(ISM)インデックスポリシーを示します。これにより、cite{bkmab96}からの未解決の問題が解決されます。さらに、有限の地平線の後悔の境界が与えられます。

Automatic Differentiation in Machine Learning: a Survey
機械学習における自動微分:調査

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply âautodiffâ, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other’s results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names âdynamic computational graphsâ and âdifferentiable programmingâ. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms âautodiffâ, âautomatic differentiationâ, and âsymbolic differentiationâ as these are encountered more and more in machine learning settings.

導関数は、主に勾配やヘッシアンの形で、機械学習のいたるところで使われています。自動微分(AD)は、アルゴリズム微分または単に「オートディフ」とも呼ばれ、バックプロパゲーションに似ていますが、より一般的な手法の1つで、コンピュータプログラムとして表現された数値関数の導関数を効率的かつ正確に評価します。ADは小規模ですが確立された分野であり、計算流体力学、大気科学、エンジニアリング設計最適化などの分野で応用されています。ごく最近まで、機械学習とADの分野は、お互いにほとんど認識されておらず、場合によっては、お互いの結果を独自に発見していました。その関連性にもかかわらず、汎用ADは機械学習ツールボックスに含まれていませんでしたが、「動的計算グラフ」や「微分可能プログラミング」という名前で継続的に採用されるようになり、状況は徐々に変化しています。ADと機械学習の交差点を調査し、ADが直接関連するアプリケーションを取り上げ、主要な実装手法について説明します。主な微分化手法とそれらの相互関係を正確に定義することにより、機械学習の設定でますます多く見られる「autodiff」、「自動微分化」、および「記号微分化」という用語の使用を明確にすることを目指しています。

HyperTools: a Python Toolbox for Gaining Geometric Insights into High-Dimensional Data
HyperTools: 高次元データに幾何学的な洞察を得るための Python ツールボックス

Dimensionality reduction algorithms have played a foundational role in facilitating the deep understanding of complex high- dimensional data. One particularly useful application of dimensionality reduction techniques is in data visualization. Low-dimensional visualizations can help practitioners understand where machine learning algorithms might leverage the geometric properties of a dataset to improve performance. Another challenge is to generalize insights across datasets [e.g. data from multiple modalities describing the same system (Haxby et al., 2011), artwork or photographs of similar content in different styles (Zhu et al., 2017), etc.]. Several recently developed techniques(e.g. Haxby et al., 2011; Chen et al., 2015) use the procrustean transformation (Schonemann, 1966) to align the geometries of two or more spaces so that data with different axes may be plotted in a common space. We propose that each of these techniques (dimensionality reduction, alignment, and visualization) applied in sequence should be cast as a single conceptual hyperplot operation for gaining geometric insights into high-dimensional data. Our Python toolbox enables this operation in a single (highly flexible) function call.

次元削減アルゴリズムは、複雑な高次元データの深い理解を促進する上で基礎的な役割を果たしてきました。次元削減技術の特に有用な応用の1つは、データの視覚化です。低次元の視覚化は、機械学習アルゴリズムがデータセットの幾何学的特性を利用してパフォーマンスを向上させる可能性のある場所を実践者が理解するのに役立ちます。もう1つの課題は、データセット全体にわたって洞察を一般化することです[例:同じシステムを記述する複数のモダリティからのデータ(Haxbyら、2011年)、異なるスタイルの類似コンテンツのアートワークまたは写真(Zhuら、2017年)など]。最近開発されたいくつかの技術(例: Haxbyら、2011年、Chenら、2015年)では、プロクルステス変換(Schonemann、1966年)を使用して2つ以上の空間の幾何学を揃え、異なる軸を持つデータを共通の空間にプロットできるようにします。私たちは、これらの各手法（次元削減、アライメント、視覚化）を順番に適用して、高次元データに対する幾何学的洞察を得るための単一の概念的なハイパープロット操作としてキャストすることを提案します。私たちのPythonツールボックスは、この操作を単一の（非常に柔軟な）関数呼び出しで実行できるようにします。

Variational Fourier Features for Gaussian Processes
ガウス過程の変分フーリエ特徴

This work brings together two powerful concepts in Gaussian processes: the variational approach to sparse approximation and the spectral representation of Gaussian processes. This gives rise to an approximation that inherits the benefits of the variational approach but with the representational power and computational scalability of spectral representations. The work hinges on a key result that there exist spectral features related to a finite domain of the Gaussian process which exhibit almost-independent covariances. We derive these expressions for Matérn kernels in one dimension, and generalize to more dimensions using kernels with specific structures. Under the assumption of additive Gaussian noise, our method requires only a single pass through the data set, making for very fast and accurate computation. We fit a model to 4 million training points in just a few minutes on a standard laptop. With non- conjugate likelihoods, our MCMC scheme reduces the cost of computation from $\mathcal{O}(NM^2)$ (for a sparse Gaussian process) to $\mathcal{O}(NM)$ per iteration, where $N$ is the number of data and $M$ is the number of features.

この研究では、ガウス過程における2つの強力な概念、すなわちスパース近似への変分アプローチとガウス過程のスペクトル表現を統合します。これにより、変分アプローチの利点を継承しながらも、スペクトル表現の表現力と計算スケーラビリティを備えた近似が生まれます。この研究では、ガウス過程の有限領域に関連するスペクトル特性が存在し、ほぼ独立した共分散を示すという重要な結果に依存しています。私たちは、1次元のMatérnカーネルに対してこれらの式を導出し、特定の構造を持つカーネルを使用してより多くの次元に一般化します。加法的なガウスノイズを前提とすると、私たちの方法ではデータセットを1回通過するだけで済むため、非常に高速で正確な計算が可能になります。標準的なラップトップで、わずか数分で400万のトレーニングポイントにモデルを適合させました。非共役尤度を使用すると、MCMC方式では計算コストが反復ごとに$\mathcal{O}(NM^2)$（スパースガウス過程の場合）から$\mathcal{O}(NM)$に削減されます。ここで、$N$はデータ数、$M$は特徴数です。

On Binary Embedding using Circulant Matrices
循環行列を用いたバイナリ埋め込みについて

Binary embeddings provide efficient and powerful ways to perform operations on large scale data. However binary embedding typically requires long codes in order to preserve the discriminative power of the input space. Thus binary coding methods traditionally suffer from high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure allows us to use Fast Fourier Transform algorithms to speed up the computation. For obtaining $k$-bit binary codes from $d$-dimensional data, our method improves the time complexity from $\mathcal{O}(dk)$ to $\mathcal{O}(d\log{d})$, and the space complexity from $\mathcal{O}(dk)$ to $\mathcal{O}(d)$. We study two settings, which differ in the way we choose the parameters of the circulant matrix. In the first, the parameters are chosen randomly and in the second, the parameters are learned using the data. For randomized CBE, we give a theoretical analysis comparing it with binary embedding using an unstructured random projection matrix. The challenge here is to show that the dependencies in the entries of the circulant matrix do not lead to a loss in performance. In the second setting, we design a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. In both the settings, we show by extensive experiments that the CBE approach gives much better performance than the state-of-the-art approaches if we fix a running time, and provides much faster computation with negligible performance degradation if we fix the number of bits in the embedding.

バイナリ埋め込みは、大規模データに対して操作を実行するための効率的で強力な方法を提供します。ただし、バイナリ埋め込みでは通常、入力空間の識別力を維持するために長いコードが必要です。したがって、バイナリコーディング方法は、このようなシナリオでは伝統的に高い計算コストとストレージコストに悩まされています。この問題に対処するために、巡回バイナリ埋め込み(CBE)を提案します。これは、データを巡回行列で投影することでバイナリコードを生成します。巡回構造により、高速フーリエ変換アルゴリズムを使用して計算を高速化できます。$d$次元データから$k$ビットのバイナリコードを取得するために、私たちの方法は、時間計算量を$\mathcal{O}(dk)$から$\mathcal{O}(d\log{d})$に、空間計算量を$\mathcal{O}(dk)$から$\mathcal{O}(d)$に改善します。巡回行列のパラメータを選択する方法が異なる2つの設定を検討します。最初の設定では、パラメータはランダムに選択され、2番目の設定では、データを使用してパラメータが学習されます。ランダム化CBEについては、非構造化ランダム射影行列を使用したバイナリ埋め込みと比較する理論的分析を行います。ここでの課題は、巡回行列のエントリの依存関係がパフォーマンスの低下につながらないことを示すことです。2番目の設定では、データ依存の巡回射影を学習するための新しい時間周波数交互最適化を設計します。これにより、元の領域とフーリエ領域で目的が交互に最小化されます。両方の設定において、実行時間を固定するとCBEアプローチは最先端のアプローチよりもはるかに優れたパフォーマンスを提供し、埋め込みのビット数を固定するとパフォーマンスの低下が無視できるほどの高速計算を提供することを、広範な実験によって示しています。

Community Extraction in Multilayer Networks with Heterogeneous Community Structure
異種コミュニティ構造を持つ多層ネットワークにおけるコミュニティ抽出

Multilayer networks are a useful way to capture and model multiple, binary or weighted relationships among a fixed group of objects. While community detection has proven to be a useful exploratory technique for the analysis of single-layer networks, the development of community detection methods for multilayer networks is still in its infancy. We propose and investigate a procedure, called Multilayer Extraction, that identifies densely connected vertex-layer sets in multilayer networks. Multilayer Extraction makes use of a significance based score that quantifies the connectivity of an observed vertex-layer set through comparison with a fixed degree random graph model. Multilayer Extraction directly handles networks with heterogeneous layers where community structure may be different from layer to layer. The procedure can capture overlapping communities, as well as background vertex-layer pairs that do not belong to any community. We establish consistency of the vertex-layer set optimizer of our proposed multilayer score under the multilayer stochastic block model. We investigate the performance of Multilayer Extraction on three applications and a test bed of simulations. Our theoretical and numerical evaluations suggest that Multilayer Extraction is an effective exploratory tool for analyzing complex multilayer networks. Publicly available code is available at github.com/jdwilson4/Multila yerExtraction.

多層ネットワークは、固定されたオブジェクトグループ間の複数の、2進数または重み付けされた関係をキャプチャしてモデル化する便利な方法です。コミュニティ検出は単層ネットワークの分析に便利な探索的手法であることが証明されていますが、多層ネットワークのコミュニティ検出方法の開発はまだ初期段階にあります。私たちは、多層ネットワーク内の密に接続された頂点層セットを識別する多層抽出と呼ばれる手順を提案し、調査します。多層抽出では、固定次数ランダムグラフモデルとの比較を通じて、観測された頂点層セットの接続性を定量化する、有意性ベースのスコアを使用します。多層抽出は、コミュニティ構造が層ごとに異なる可能性がある異種層を持つネットワークを直接処理します。この手順では、重複するコミュニティだけでなく、どのコミュニティにも属さない背景の頂点層ペアもキャプチャできます。私たちは、多層確率ブロックモデルの下で、提案された多層スコアの頂点層セットオプティマイザーの一貫性を確立します。3つのアプリケーションとシミュレーションのテストベッドで多層抽出のパフォーマンスを調査します。私たちの理論的および数値的評価は、Multilayer Extractionが複雑な多層ネットワークを分析するための効果的な探索ツールであることを示唆しています。公開されているコードは、github.com/jdwilson4/Multila yerExtractionで入手できます。

Faithfulness of Probability Distributions and Graphs
確率分布とグラフの忠実度

A main question in graphical models and causal inference is whether, given a probability distribution $P$ (which is usually an underlying distribution of data), there is a graph (or graphs) to which $P$ is faithful. The main goal of this paper is to provide a theoretical answer to this problem. We work with general independence models, which contain probabilistic independence models as a special case. We exploit a generalization of ordering, called preordering, of the nodes of (mixed) graphs. This allows us to provide sufficient conditions for a given independence model to be Markov to a graph with the minimum possible number of edges, and more importantly, necessary and sufficient conditions for a given probability distribution to be faithful to a graph. We present our results for the general case of mixed graphs, but specialize the definitions and results to the better-known subclasses of undirected (concentration) and bidirected (covariance) graphs as well as directed acyclic graphs.

グラフィカルモデルと因果推論における主な疑問は、確率分布$P$ (通常は基礎となるデータ分布)が与えられた場合、$P$が忠実なグラフ(または複数のグラフ)が存在するかどうかです。この論文の主な目的は、この問題に対する理論的な答えを提供することです。私たちは、確率的独立モデルを特殊なケースとして含む一般的な独立モデルを扱います。私たちは、(混合)グラフのノードの順序付けの一般化(事前順序付けと呼ばれる)を活用します。これにより、与えられた独立モデルが可能な限り最小のエッジ数を持つグラフに対してマルコフとなるための十分な条件、さらに重要なことに、与えられた確率分布がグラフに忠実となるための必要かつ十分な条件を提供できます。私たちは、混合グラフの一般的なケースの結果を示しますが、定義と結果は、無向(集中)グラフと双向(共分散)グラフ、および有向非巡回グラフのよく知られたサブクラスに特化しています。

Matrix Completion with Noisy Entries and Outliers
ノイズの多いエントリと外れ値による行列補完

This paper considers the problem of matrix completion when the observed entries are noisy and contain outliers. It begins with introducing a new optimization criterion for which the recovered matrix is defined as its solution. This criterion uses the celebrated Huber function from the robust statistics literature to downweigh the effects of outliers. A practical algorithm is developed to solve the optimization involved. This algorithm is fast, straightforward to implement, and monotonic convergent. Furthermore, the proposed methodology is theoretically shown to be stable in a well defined sense. Its promising empirical performance is demonstrated via a sequence of simulation experiments, including image inpainting.

この論文では、観測されたエントリがノイズが多く、外れ値が含まれている場合の行列補完の問題について考察します。まず、復元された行列を解として定義する新しい最適化基準を導入します。この基準では、ロバスト統計の文献から有名なフーバー関数を使用して、外れ値の影響を小さく評価します。関連する最適化を解決するための実用的なアルゴリズムが開発されています。このアルゴリズムは、高速で実装が簡単で、単調収束性です。さらに、提案された方法論は、明確に定義された意味で安定していることが理論的に示されています。その有望な経験的性能は、イメージインペインティングを含む一連のシミュレーション実験によって実証されています。

Regularization and the small-ball method II: complexity dependent error rates
正則化とスモールボール法II:複雑性依存のエラー率

We study estimation properties of regularized procedures of the form $\hat f \in\arg\min_{f\in F}\Big(\frac{1}{N}\sum_{i=1}^N\big(Y_i-f(X_i)\big)^2+\lambda \Psi(f)\Big)$ for a convex class of functions $F$, regularization function $\Psi(\cdot)$ and some well chosen regularization parameter $\lambda$, where the given data is an independent sample $(X_i, Y_i)_{i=1}^N$. We obtain bounds on the $L_2$ estimation error rate that depend on the complexity of the true model $F^*:=\{f\in F: \Psi(f)\leq\Psi(f^*)\}$, where $f^*\in\arg\min_{f\in F}\mathbb{E}(Y-f(X))^2$ and the $(X_i,Y_i)$’s are independent and distributed as $(X,Y)$. Our estimate holds under weak stochastic assumptions — one of which being a small-ball condition satisfied by $F$ — and for rather flexible choices of regularization functions $\Psi(\cdot)$. Moreover, the result holds in the learning theory framework: we do not assume any a-priori connection between the output $Y$ and the input $X$. As a proof of concept, we apply our general estimation bound to various choices of $\Psi$, for example, the $\ell_p$ and $S_p$-norms (for $p\geq1$), weak-$\ell_p$, atomic norms, max- norm and SLOPE. In many cases, the estimation rate almost coincides with the minimax rate in the class $F^*$.

私たちは、関数の凸クラス$F$、正規化関数$\Psi(\cdot)$、適切に選択された正規化パラメータ$\lambda$に対して、形式$\hat f \in\arg\min_{f\in F}\Big(\frac{1}{N}\sum_{i=1}^N\big(Y_i-f(X_i)\big)^2+\lambda \Psi(f)\Big)$の正規化手順の推定特性を研究します。ここで、与えられたデータは、独立したサンプル$(X_i, Y_i)_{i=1}^N$です。私たちは、真のモデル$F^*:=\{f\in F: \Psi(f)\leq\Psi(f^*)\}$の複雑さに依存する$L_2$推定誤差率の境界を得ます。ここで、$f^*\in\arg\min_{f\in F}\mathbb{E}(Y-f(X))^2$であり、$(X_i,Y_i)$は独立しており、$(X,Y)$として分布します。我々の推定は、弱い確率的仮定(その1つは$F$によって満たされるスモールボール条件)と、かなり柔軟な正則化関数$\Psi(\cdot)$の選択の下で成立します。さらに、結果は学習理論のフレームワークで成立します。つまり、出力$Y$と入力$X$の間に事前接続を仮定しません。概念実証として、我々は一般的な推定境界を、$\Psi$のさまざまな選択、例えば、$\ell_p$および$S_p$ノルム（$p\geq1$の場合）、弱$\ell_p$、原子ノルム、最大ノルム、SLOPEに適用します。多くの場合、推定率はクラス$F^*$のミニマックス率とほぼ一致します。

Following the Leader and Fast Rates in Online Linear Prediction: Curved Constraint Sets and Other Regularities
オンライン線形予測におけるリーダーと高速レートの追跡: 曲線制約セットとその他の規則性

Follow the leader (FTL) is a simple online learning algorithm that is known to perform well when the loss functions are convex and positively curved. In this paper we ask whether there are other settings when FTL achieves low regret. In particular, we study the fundamental problem of linear prediction over a convex, compact domain with non-empty interior. Amongst other results, we prove that the curvature of the boundary of the domain can act as if the losses were curved: In this case, we prove that as long as the mean of the loss vectors have positive lengths bounded away from zero, FTL enjoys logarithmic regret, while for polytope domains and stochastic data it enjoys finite expected regret. The former result is also extended to strongly convex domains by establishing an equivalence between the strong convexity of sets and the minimum curvature of their boundary, which may be of independent interest. Building on a previously known meta-algorithm, we also get an algorithm that simultaneously enjoys the worst-case guarantees and the smaller regret of FTL when the data is `easy’. Finally, we show that such guarantees are achievable directly (e.g., by the follow the regularized leader algorithm or by a shrinkage-based variant of FTL) when the constraint set is an ellipsoid.

リーダーに従う(FTL)は、損失関数が凸で正に曲がっている場合にパフォーマンスが良好であることが知られている、単純なオンライン学習アルゴリズムです。この論文では、FTLが低い後悔を達成する他の設定があるかどうかを問います。特に、内部が空でない凸でコンパクトなドメイン上の線形予測の基本問題を研究します。他の結果の中でも、ドメインの境界の曲率は、損失が曲がっているかのように機能することを証明します。この場合、損失ベクトルの平均がゼロから離れた正の長さである限り、FTLは対数後悔を享受しますが、多面体ドメインと確率的データの場合は有限の期待後悔を享受することを証明します。前者の結果は、セットの強い凸性と境界の最小曲率との同等性を確立することにより、強い凸ドメインにも拡張されます。これは、独立した関心事である可能性があります。以前に知られているメタアルゴリズムに基づいて、データが「簡単」な場合に、最悪のケースの保証とFTLのより小さな後悔を同時に享受するアルゴリズムも得られます。最後に、制約セットが楕円体である場合、そのような保証は直接達成可能であることを示します(たとえば、正規化されたリーダーに従うアルゴリズムまたはFTLの収縮ベースのバリアントによって)。

Generalized Conditional Gradient for Sparse Estimation
スパース推定のための一般化条件付き勾配

Sparsity is an important modeling tool that expands the applicability of convex formulations for data analysis, however it also creates significant challenges for efficient algorithm design. In this paper we investigate the generalized conditional gradient (GCG) algorithm for solving sparse optimization problems— demonstrating that, with some enhancements, it can provide a more efficient alternative to current state of the art approaches. After studying the convergence properties of GCG for general convex composite problems, we develop efficient methods for evaluating polar operators, a subroutine that is required in each GCG iteration. In particular, we show how the polar operator can be efficiently evaluated in learning low-rank matrices, instantiated with detailed examples on matrix completion and dictionary learning. A further improvement is achieved by interleaving GCG with fixed-rank local subspace optimization. A series of experiments on matrix completion, multi-class classification, and multi-view dictionary learning shows that the proposed method can significantly reduce the training cost of current alternatives.

スパース性は、データ分析における凸定式化の適用範囲を広げる重要なモデリングツールですが、効率的なアルゴリズム設計に大きな課題も生じます。この論文では、スパース最適化問題を解決するための一般化条件付き勾配(GCG)アルゴリズムを調査し、いくつかの機能強化により、現在の最先端のアプローチよりも効率的な代替手段を提供できることを実証します。一般的な凸複合問題に対するGCGの収束特性を研究した後、GCGの各反復で必要なサブルーチンである極演算子を評価するための効率的な方法を開発します。特に、極演算子を低ランク行列の学習で効率的に評価する方法を示し、行列補完と辞書学習の詳細な例を示します。GCGを固定ランクのローカルサブスペース最適化とインターリーブすることで、さらに改善されます。行列補完、マルチクラス分類、およびマルチビュー辞書学習に関する一連の実験により、提案された方法が現在の代替手段のトレーニングコストを大幅に削減できることが示されています。

On Computationally Tractable Selection of Experiments in Measurement-Constrained Regression Models
測定制約付き回帰モデルにおける計算的に扱いやすい実験の選択について

We derive computationally tractable methods to select a small subset of experiment settings from a large pool of given design points. The primary focus is on linear regression models, while the technique extends to generalized linear models and Delta’s method (estimating functions of linear regression models) as well. The algorithms are based on a continuous relaxation of an otherwise intractable combinatorial optimization problem, with sampling or greedy procedures as post-processing steps. Formal approximation guarantees are established for both algorithms, and numerical results on both synthetic and real-world data confirm the effectiveness of the proposed methods.

私たちは、与えられた設計ポイントの大きなプールから実験設定の小さなサブセットを選択するために、計算的に扱いやすい方法を導き出します。主な焦点は線形回帰モデルにありますが、この手法は一般化線形モデルやデルタ法(線形回帰モデルの関数を推定する)にも拡張されています。このアルゴリズムは、他の方法では手に負えない組み合わせ最適化問題の継続的な緩和に基づいており、後処理ステップとしてサンプリングまたは貪欲な手順を使用します。どちらのアルゴリズムについても形式近似の保証が確立されており、合成データと実世界データの両方に対する数値結果により、提案された手法の有効性が確認されています。

Consistency, Breakdown Robustness, and Algorithms for Robust Improper Maximum Likelihood Clustering
一貫性、ブレークダウンのロバスト性、およびロバストな不適切な最尤クラスタリングのアルゴリズム

The robust improper maximum likelihood estimator (RIMLE) is a new method for robust multivariate clustering finding approximately Gaussian clusters. It maximizes a pseudo- likelihood defined by adding a component with improper constant density for accommodating outliers to a Gaussian mixture. A special case of the RIMLE is MLE for multivariate finite Gaussian mixture models. In this paper we treat existence, consistency, and breakdown theory for the RIMLE comprehensively. RIMLE’s existence is proved under non-smooth covariance matrix constraints. It is shown that these can be implemented via a computationally feasible Expectation-Conditional Maximization algorithm.

ロバストな不適切な最尤推定量(RIMLE)は、ロバストな多変量クラスタリングで近似的なガウスクラスターを見つけるための新しい方法です。これは、外れ値をガウス混合に収容するために不適切な一定密度の成分を追加することによって定義される擬尤度を最大化します。RIMLEの特殊なケースは、多変量有限ガウス混合モデルのMLEです。この論文では、RIMLEの存在理論、一貫性理論、ブレークダウン理論を包括的に扱う。RIMLEの存在は、非平滑共分散行列制約の下で証明されます。これらは、計算上実現可能な期待値-条件付き最大化アルゴリズムを介して実装できることが示されています。

A Nonconvex Approach for Phase Retrieval: Reshaped Wirtinger Flow and Incremental Algorithms
位相回復のための非凸アプローチ: 再形成された Wirtinger フローと増分アルゴリズム

We study the problem of solving a quadratic system of equations, i.e., recovering a vector signal $\boldsymbol{x}\in \mathbb{R}^n$ from its magnitude measurements $y_i=|\langle \boldsymbol{a}_i, \boldsymbol{x}\rangle|, i=1,…, m$. We develop a gradient descent algorithm (referred to as RWF for reshaped Wirtinger flow) by minimizing the quadratic loss of the magnitude measurements. Comparing with Wirtinger flow (WF) (Candes et al., 2015), the loss function of RWF is nonconvex and nonsmooth, but better resembles the least-squares loss when the phase information is also available. We show that for random Gaussian measurements, RWF enjoys linear convergence to the true signal as long as the number of measurements is $\mathcal{O}(n)$. This improves the sample complexity of WF ($\mathcal{O}(n\log n)$), and achieves the same sample complexity as truncated Wirtinger flow (TWF) (Chen and Candes, 2015), but without any sophisticated truncation in the gradient loop. Furthermore, RWF costs less computationally than WF, and runs faster numerically than both WF and TWF. We further develop an incremental (stochastic) version of RWF (IRWF) and connect it with the randomized Kaczmarz method for phase retrieval. We demonstrate that IRWF outperforms existing incremental as well as batch algorithms with experiments.

私たちは、2次方程式系を解く問題、すなわち、ベクトル信号$\boldsymbol{x}\in \mathbb{R}^n$をその振幅測定値$y_i=|\langle \boldsymbol{a}_i, \boldsymbol{x}\rangle|, i=1,…, m$から回復する問題を研究します。私たちは、振幅測定値の2次損失を最小化する勾配降下アルゴリズム(再構成Wirtingerフローの略称RWF)を開発します。Wirtingerフロー(WF) (Candesら, 2015)と比較すると、RWFの損失関数は非凸かつ非滑らかですが、位相情報も利用できる場合は最小二乗損失によく似ています。ランダムガウス測定値の場合、測定値の数が$\mathcal{O}(n)$である限り、RWFは真の信号に線形収束することを示します。これにより、WFのサンプル複雑度($\mathcal{O}(n\log n)$)が改善され、勾配ループでの高度な切り捨てなしで、切り捨てWirtingerフロー(TWF) (Chen and Candes, 2015)と同じサンプル複雑度が達成されます。さらに、RWFはWFよりも計算コストが低く、数値的にはWFとTWFの両方よりも高速に実行されます。さらに、RWFの増分(確率的)バージョン(IRWF)を開発し、位相回復のためのランダム化Kaczmarz法と接続します。実験により、IRWFが既存の増分アルゴリズムおよびバッチアルゴリズムよりも優れていることを実証します。

Adaptive Randomized Dimension Reduction on Massive Data
大量データに対する適応型ランダム化次元削減

The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper, we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide an efficient solution to principal component analysis (PCA), and we use this efficient solver to improve estimation in large- scale linear mixed models (LMM) for association mapping in statistical genomics. A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance by implicitly regularizing the covariance matrix estimate of the random effect in an LMM. These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies.

統計的推定量のスケーラビリティは、現代のアプリケーションにおいてますます重要になっています。スケーラブルなアルゴリズムを実装する1つの方法は、次元削減法を使用してデータを低次元の潜在空間に圧縮することです。この論文では、高次元データの低ランク構造の仮定を利用して計算と統計の両方の利点を得る次元削減の方法を開発します。最近のランダム化低ランク近似アルゴリズムを適応させて主成分分析(PCA)の効率的なソリューションを提供し、この効率的なソルバーを使用して、統計ゲノミクスの関連マッピングのための大規模線形混合モデル(LMM)の推定を改善します。この論文での主要な観察は、ランダム化が2つの役割を果たし、LMMのランダム効果の共分散行列推定を暗黙的に正規化することで計算と統計の両方のパフォーマンスを向上させることです。これらの統計的および計算上の利点は、シミュレーションデータと大規模ゲノム研究の実験で強調されています。

Bayesian Inference for Spatio-temporal Spike-and-Slab Priors
時空間スパイクアンドスラブ事前分布のベイズ推論

In this work, we address the problem of solving a series of underdetermined linear inverse problemblems subject to a sparsity constraint. We generalize the spike-and-slab prior distribution to encode a priori correlation of the support of the solution in both space and time by imposing a transformed Gaussian process on the spike-and-slab probabilities. An expectation propagation (EP) algorithm for posterior inference under the proposed model is derived. For large scale problems, the standard EP algorithm can be prohibitively slow. We therefore introduce three different approximation schemes to reduce the computational complexity. Finally, we demonstrate the proposed model using numerical experiments based on both synthetic and real data sets.

この作業では、スパース性制約の対象となる一連の未決定の線形逆問題ブレムを解く問題に取り組みます。スパイクとスラブの事前分布を一般化して、スパイクとスラブの確率に変換されたガウス過程を課すことにより、空間と時間の両方での解のサポートの先験的な相関をエンコードします。提案されたモデルの下での事後推論のための期待伝播(EP)アルゴリズムが導出されます。大規模な問題の場合、標準のEPアルゴリズムは法外に遅くなる可能性があります。したがって、計算の複雑さを軽減するために、3つの異なる近似スキームを導入します。最後に、合成データセットと実データの両方に基づく数値実験を使用して、提案されたモデルを示します。

Dimension Estimation Using Random Connection Models
ランダム接続モデルを使用した次元推定

Information about intrinsic dimension is crucial to perform dimensionality reduction, compress information, design efficient algorithms, and do statistical adaptation. In this paper we propose an estimator for the intrinsic dimension of a data set. The estimator is based on binary neighbourhood information about the observations in the form of two adjacency matrices, and does not require any explicit distance information. The underlying graph is modelled according to a subset of a specific random connection model, sometimes referred to as the Poisson blob model. Computationally the estimator scales like $n\log n$, and we specify its asymptotic distribution and rate of convergence. A simulation study on both real and simulated data shows that our approach compares favourably with some competing methods from the literature, including approaches that rely on distance information.

固有次元に関する情報は、次元削減、情報の圧縮、効率的なアルゴリズムの設計、および統計的適応を行うために重要です。この論文では、データセットの本質的な次元の推定量を提案します。推定量は、2つの隣接行列の形式で観測値に関するバイナリ近傍情報に基づいており、明示的な距離情報は必要ありません。基になるグラフは、特定のランダム接続モデル(ポアソンブロブモデルとも呼ばれます)のサブセットに従ってモデル化されます。計算上、推定器は$nlog n$のようにスケーリングされ、その漸近分布と収束速度を指定します。実際のデータとシミュレーションデータの両方を対象としたシミュレーション研究では、私たちのアプローチは、距離情報に依存するアプローチなど、文献の競合するいくつかの方法と遜色ないことが示されています。

Generalized SURE for optimal shrinkage of singular values in low-rank matrix denoising
低ランク行列のノイズ除去における特異値の最適な収縮のための一般化SURE

We consider the problem of estimating a low-rank signal matrix from noisy measurements under the assumption that the distribution of the data matrix belongs to an exponential family. In this setting, we derive generalized Stein’s unbiased risk estimation (SURE) formulas that hold for any spectral estimators which shrink or threshold the singular values of the data matrix. This leads to new data-driven spectral estimators, whose optimality is discussed using tools from random matrix theory and through numerical experiments. Under the spiked population model and in the asymptotic setting where the dimensions of the data matrix are let going to infinity, some theoretical properties of our approach are compared to recent results on asymptotically optimal shrinking rules for Gaussian noise. It also leads to new procedures for singular values shrinkage in finite-dimensional matrix denoising for Gamma- distributed and Poisson-distributed measurements.

私たちは、ノイズの多い測定値から低ランクの信号行列を推定する問題を、データ行列の分布が指数族に属すると仮定して考えます。この設定では、データ行列の特異値を縮小またはしきい値にするスペクトル推定器に対して成り立つ、一般化されたスタインの偏りのないリスク推定(SURE)式を導出します。これにより、新しいデータ駆動型のスペクトル推定器が生まれ、その最適性はランダム行列理論のツールと数値実験を通じて議論されます。スパイク人口モデルと、データ行列の次元が無限大になる漸近設定では、私たちのアプローチのいくつかの理論的特性が、ガウスノイズの漸近的に最適な縮小ルールに関する最近の結果と比較されます。また、ガンマ分布およびポアソン分布測定の有限次元行列ノイズ除去における特異値収縮の新しい手順にもつながります。

A Survey of Preference-Based Reinforcement Learning Methods
選好に基づく強化学習法の検討

Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task- specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert’s preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL.

強化学習(RL)技術は、適切に選択された報酬関数の蓄積された長期報酬を最適化します。ただし、このような報酬関数を設計するには、多くの場合、タスク固有の事前知識が大量に必要になります。設計者は、学習した動作だけでなく学習の進行にも影響を与えるさまざまな目的を考慮する必要があります。これらの問題を軽減するために、手動で設計された数値報酬ではなく、専門家の好みから直接学習できる、好みに基づく強化学習アルゴリズム(PbRL)が提案されています。PbRLは、報酬形成の問題を解決できること、数値以外の報酬から学習できること、専門家の知識への依存を減らす可能性があることから、近年注目を集めています。私たちは、タスクを正式に記述し、人間の評価タスクと計算の複雑さに影響を与えるさまざまな設計原則を指摘する、PbRLの統一フレームワークを提供します。設計原則には、想定されるフィードバックの種類、好みを捉えるために学習される表現、解決する必要がある最適化問題、および探索/活用問題への取り組み方法が含まれます。さらに、現在のアルゴリズムの欠点を指摘し、未解決の研究課題を提案し、PbRLを使用して解決された実用的なタスクを簡単に調査します。

STORE: Sparse Tensor Response Regression and Neuroimaging Analysis
STORE:スパーステンソル応答回帰とニューロイメージング分析

Motivated by applications in neuroimaging analysis, we propose a new regression model, Sparse TensOr REsponse regression (STORE), with a tensor response and a vector predictor. STORE embeds two key sparse structures: element-wise sparsity and low-rankness. It can handle both a non-symmetric and a symmetric tensor response, and thus is applicable to both structural and functional neuroimaging data. We formulate the parameter estimation as a non-convex optimization problem, and develop an efficient alternating updating algorithm. We establish a non- asymptotic estimation error bound for the actual estimator obtained from the proposed algorithm. This error bound reveals an interesting interaction between the computational efficiency and the statistical rate of convergence. When the distribution of the error tensor is Gaussian, we further obtain a fast estimation error rate which allows the tensor dimension to grow exponentially with the sample size. We illustrate the efficacy of our model through intensive simulations and an analysis of the Autism spectrum disorder neuroimaging data.

神経画像解析への応用を動機として、テンソル応答とベクトル予測子を持つ新しい回帰モデル、スパースTensOr REsponse回帰(STORE)を提案します。STOREには、要素ごとのスパース性と低ランク性という2つの重要なスパース構造が組み込まれています。非対称テンソル応答と対称テンソル応答の両方を処理できるため、構造的および機能的神経画像データの両方に適用できます。パラメーター推定を非凸最適化問題として定式化し、効率的な交互更新アルゴリズムを開発します。提案アルゴリズムから得られる実際の推定量に対して、非漸近的な推定誤差境界を確立します。この誤差境界は、計算効率と統計的収束率の興味深い相互作用を明らかにします。誤差テンソルの分布がガウス分布である場合、さらに、サンプルサイズとともにテンソル次元が指数関数的に増加することを可能にする高速推定誤差率が得られます。私たちは、徹底的なシミュレーションと自閉症スペクトラム障害の神経画像データの分析を通じて、私たちのモデルの有効性を示します。

Stochastic Gradient Descent as Approximate Bayesian Inference
近似ベイズ推論としての確率的勾配降下法

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also show how to tune SGD with momentum for approximate sampling. (4) We analyze stochastic-gradient MCMC algorithms. For Stochastic- Gradient Langevin Dynamics and Stochastic-Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

定数学習率による確率的勾配降下法(定数SGD)は、定常分布を持つマルコフ連鎖をシミュレートします。この観点から、いくつかの新しい結果を導き出しました。(1)定数SGDは、近似ベイズ事後推論アルゴリズムとして使用できることを示します。具体的には、定数SGDのチューニングパラメータを調整して、定常分布を事後分布に最適に一致させ、これら2つの分布間のKullback-Leiblerダイバージェンスを最小限に抑える方法を示します。(2)定数SGDにより、複雑な確率モデルのハイパーパラメータを最適化する新しい変分EMアルゴリズムが生成されることを示します。(3)また、近似サンプリングのためにモーメンタムを使用してSGDを調整する方法も示します。(4)確率的勾配MCMCアルゴリズムを分析します。確率的勾配ランジュバンダイナミクスと確率的勾配フィッシャースコアリングでは、有限学習率による近似誤差を定量化します。最後に（5）では、確率過程の観点から、なぜポリアック平均化が最適であるかを簡単に証明します。この考えに基づいて、スケーラブルな近似MCMCアルゴリズムである平均確率勾配サンプラーを提案します。

A Bayesian Mixed-Effects Model to Learn Trajectories of Changes from Repeated Manifold-Valued Observations
反復多様体値観測から変化の軌跡を学習するためのBayes混合効果モデル

We propose a generic Bayesian mixed-effects model to estimate the temporal progression of a biological phenomenon from observations obtained at multiple time points for a group of individuals. The progression is modeled by continuous trajectories in the space of measurements. Individual trajectories of progression result from spatiotemporal transformations of an average trajectory. These transformations allow for the quantification of changes in direction and pace at which the trajectories are followed. The framework of Riemannian geometry allows the model to be used with any kind of measurements with smooth constraints. A stochastic version of the Expectation-Maximization algorithm is used to produce maximum a posteriori estimates of the parameters. We evaluated our method using a series of neuropsychological test scores from patients with mild cognitive impairments, later diagnosed with Alzheimer’s disease, and simulated evolutions of symmetric positive definite matrices. The data-driven model of impairment of cognitive functions illustrated the variability in the ordering and timing of the decline of these functions in the population. We showed that the estimated spatiotemporal transformations effectively put into correspondence significant events in the progression of individuals.

私たちは、個体群の複数の時点で得られた観察から生物学的現象の時間的進行を推定するための、汎用ベイジアン混合効果モデルを提案します。進行は、測定空間における連続的な軌跡によってモデル化されます。個々の進行軌跡は、平均軌跡の時空間変換によって生じる。これらの変換により、軌跡がたどる方向とペースの変化を定量化できます。リーマン幾何学の枠組みにより、滑らかな制約を持つあらゆる種類の測定でこのモデルを使用できます。期待値最大化アルゴリズムの確率的バージョンを使用して、パラメータの最大事後推定値を生成します。私たちは、後にアルツハイマー病と診断された軽度認知障害患者の一連の神経心理学的検査スコアと、対称正定値行列のシミュレートされた進化を使用して、この方法を評価しました。認知機能障害のデータ駆動型モデルは、集団におけるこれらの機能の低下の順序とタイミングの変動性を示しました。推定された時空間変換が、個体の進行における重要なイベントを効果的に対応付けることを示しました。

Active-set Methods for Submodular Minimization Problems
サブモジュラ最小化問題に対するアクティブセット法

We consider the submodular function minimization (SFM) and the quadratic minimization problems regularized by the Lovasz extension of the submodular function. These optimization problems are intimately related; for example, min-cut problems and total variation denoising problems, where the cut function is submodular and its Lovasz extension is given by the associated total variation. When a quadratic loss is regularized by the total variation of a cut function, it thus becomes a total variation denoising problem and we use the same terminology in this paper for general submodular functions. We propose a new active-set algorithm for total variation denoising with the assumption of an oracle that solves the corresponding SFM problem. This can be seen as local descent algorithm over ordered partitions with explicit convergence guarantees. It is more flexible than the existing algorithms with the ability for warm-restarts using the solution of a closely related problem. Further, we also consider the case when a submodular function can be decomposed into the sum of two submodular functions $F_1$ and $F_2$ and assume SFM oracles for these two functions. We propose a new active-set algorithm for total variation denoising (and hence SFM by thresholding the solution at zero). This algorithm also performs local descent over ordered partitions and its ability to warm start considerably improves the performance of the algorithm. In the experiments, we compare the performance of the proposed algorithms with state-of-the-art algorithms, showing that it reduces the calls to SFM oracles.

私たちは、サブモジュラ関数最小化(SFM)と、サブモジュラ関数のLovasz拡張によって正規化された2次最小化問題を考察します。これらの最適化問題は密接に関連しています。たとえば、最小カット問題と全変動ノイズ除去問題では、カット関数はサブモジュラであり、そのLovasz拡張は関連する全変動によって与えられます。2次損失がカット関数の全変動によって正規化されると、それは全変動ノイズ除去問題となり、本稿では一般的なサブモジュラ関数に同じ用語を使用します。私たちは、対応するSFM問題を解決するオラクルがあると仮定して、全変動ノイズ除去のための新しいアクティブセットアルゴリズムを提案します。これは、明示的な収束保証を備えた順序付きパーティション上のローカル降下アルゴリズムと見なすことができます。これは、密接に関連する問題の解を使用してウォームリスタートを行う機能を備え、既存のアルゴリズムよりも柔軟です。さらに、サブモジュラ関数を2つのサブモジュラ関数$F_1$と$F_2$の和に分解できる場合も考慮し、これら2つの関数に対してSFMオラクルがあると仮定します。全変動ノイズ除去(したがって、ソリューションをゼロでしきい値化することによるSFM)用の新しいアクティブセットアルゴリズムを提案します。このアルゴリズムは、順序付けられたパーティションに対してローカル降下も実行し、ウォームスタートの機能により、アルゴリズムのパフォーマンスが大幅に向上します。実験では、提案されたアルゴリズムのパフォーマンスを最先端のアルゴリズムと比較し、SFMオラクルの呼び出しが削減されることを示します。

Stabilized Sparse Online Learning for Sparse Data
スパースデータに対する安定化スパースオンライン学習

Stochastic gradient descent (SGD) is commonly used for optimization in large-scale machine learning problems. Lanford et al. (2009) introduce a sparse online learning method to induce sparsity via truncated gradient. With high- dimensional sparse data, however, this method suffers from slow convergence and high variance due to heterogeneity in feature sparsity. To mitigate this issue, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft- thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate, which leads to a balanced trade-off between exploration and exploitation in learning a sparse weight vector. Numerical experiments show that our algorithm compares favorably with the original truncated gradient SGD in terms of prediction accuracy, achieving both better sparsity and stability.

確率的勾配降下法(SGD)は、大規模な機械学習の問題の最適化によく使用されます。Lanfordら(2009)は、切り捨て勾配を介してスパース性を誘導するスパースオンライン学習法を導入しました。ただし、高次元のスパースデータでは、特徴のスパース性の不均一性により、この方法は収束が遅く、変動が大きいという問題があります。この問題を軽減するために、安定化された切り捨て確率的勾配降下アルゴリズムを導入しました。重みベクトルにソフトしきい値スキームを採用し、課せられた縮小は各特徴で利用可能な情報量に適応します。結果として得られるスパース重みベクトルの変動性は、情報切り捨てと統合された安定性選択によってさらに制御されます。より良い収束を促進するために、切り捨て率にアニーリング戦略を採用し、スパース重みベクトルの学習における探索と活用のバランスの取れたトレードオフを実現します。数値実験では、予測精度の点で当社のアルゴリズムが元の切り捨て勾配SGDと比較して優れており、より優れたスパース性と安定性の両方を実現していることが示されています。

Knowledge Graph Completion via Complex Tensor Factorization
複素テンソル因数分解によるナレッジグラフの補完

In statistical relational learning, knowledge graph completion deals with automatically understanding the structure of large knowledge graphs—labeled directed graphs—and predicting missing relationships—labeled edges. State-of-the-art embedding models propose different trade-offs between modeling expressiveness, and time and space complexity. We reconcile both expressiveness and complexity through the use of complex-valued embeddings and explore the link between such complex-valued embeddings and unitary diagonalization. We corroborate our approach theoretically and show that all real square matrices—thus all possible relation/adjacency matrices—are the real part of some unitarily diagonalizable matrix. This results opens the door to a lot of other applications of square matrices factorization. Our approach based on complex embeddings is arguably simple, as it only involves a Hermitian dot product, the complex counterpart of the standard dot product between real vectors, whereas other methods resort to more and more complicated composition functions to increase their expressiveness. The proposed complex embeddings are scalable to large data sets as it remains linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.

統計的関係学習において、知識グラフ補完は、大規模な知識グラフ(ラベル付き有向グラフ)の構造を自動的に理解し、欠落している関係(ラベル付きエッジ)を予測する処理を行います。最先端の埋め込みモデルは、モデリングの表現力と時間および空間の複雑さとの間でさまざまなトレードオフを提案しています。私たちは、複素数値の埋め込みを使用することで表現力と複雑さの両方を調和させ、そのような複素数値の埋め込みとユニタリ対角化の関係を探ります。私たちは、このアプローチを理論的に裏付け、すべての実正方行列(つまり、すべての可能な関係/隣接行列)が、ユニタリ対角化可能な行列の実数部であることを示します。この結果は、正方行列因数分解の他の多くの用途への扉を開きます。複素埋め込みに基づく私たちのアプローチは、実ベクトル間の標準ドット積の複素版であるエルミートドット積のみを使用するため、ほぼ間違いなくシンプルです。一方、他の方法では、表現力を高めるために、ますます複雑な合成関数に頼っています。提案された複素埋め込みは、空間と時間の両方で線形であり、標準的なリンク予測ベンチマークで他のアプローチよりも一貫して優れているため、大規模なデータセットに拡張可能です。

Minimax Filter: Learning to Preserve Privacy from Inference Attacks
ミニマックス・フィルタ: 推論攻撃からプライバシーを保護する方法を学ぶ

Preserving privacy of continuous and/or high-dimensional data such as images, videos and audios, can be challenging with syntactic anonymization methods which are designed for discrete attributes. Differentially privacy, which uses a more rigorous definition of privacy loss, has shown more success in sanitizing continuous data. However, both syntactic and differential privacy are susceptible to inference attacks, i.e., an adversary can accurately infer sensitive attributes from sanitized data. The paper proposes a novel filter-based mechanism which preserves privacy of continuous and high-dimensional attributes against inference attacks. Finding the optimal utility-privacy tradeoff is formulated as a min-diff-max optimization problem. The paper provides an ERM-like analysis of the generalization error and also a practical algorithm to perform minimax optimization. In addition, the paper proposes a noisy minimax filter which combines minimax filter and differentially-private mechanism. Advantages of the method over purely noisy mechanisms is explained and demonstrated with examples. Experiments with several real-world tasks including facial expression classification, speech emotion classification, and activity classification from motion, show that the minimax filter can simultaneously achieve similar or higher target task accuracy and lower inference accuracy, often significantly lower than previous methods.

画像、動画、音声などの連続データや高次元データのプライバシーを保護することは、離散属性向けに設計された構文的匿名化方法では困難な場合があります。プライバシー損失のより厳密な定義を使用する差分プライバシーは、連続データのサニタイズにおいてより成功しています。ただし、構文プライバシーと差分プライバシーはどちらも推論攻撃の影響を受けやすく、つまり、攻撃者はサニタイズされたデータから機密属性を正確に推測できます。この論文では、推論攻撃に対して連続属性と高次元属性のプライバシーを保護する新しいフィルターベースのメカニズムを提案します。最適なユーティリティとプライバシーのトレードオフを見つけることは、最小-差分-最大最適化問題として定式化されます。この論文では、一般化エラーのERMのような分析と、ミニマックス最適化を実行するための実用的なアルゴリズムを提供します。さらに、この論文では、ミニマックスフィルターと差分プライバシーメカニズムを組み合わせたノイズミニマックスフィルターを提案します。この方法が純粋にノイズの多いメカニズムよりも優れている点について説明し、例を挙げて示します。顔の表情分類、音声感情分類、動きからのアクティビティ分類など、いくつかの実際のタスクを使用した実験では、ミニマックスフィルターは、ターゲットタスクと同等かそれ以上の精度と、以前の方法よりも大幅に低い推論精度を同時に達成できることが示されています。

Gap Safe Screening Rules for Sparsity Enforcing Penalties
スパース性のためのギャップセーフスクリーニングルール、罰則の施行

In high dimensional regression settings, sparsity enforcing penalties have proved useful to regularize the data-fitting term. A recently introduced technique called screening rules propose to ignore some variables in the optimization leveraging the expected sparsity of the solutions and consequently leading to faster solvers. When the procedure is guaranteed not to discard variables wrongly the rules are said to be safe. In this work, we propose a unifying framework for generalized linear models regularized with standard sparsity enforcing penalties such as $\ell_1$ or $\ell_1/\ell_2$ norms. Our technique allows to discard safely more variables than previously considered safe rules, particularly for low regularization parameters. Our proposed Gap Safe rules (so called because they rely on duality gap computation) can cope with any iterative solver but are particularly well suited to (block) coordinate descent methods. Applied to many standard learning tasks, Lasso, Sparse Group Lasso, multi-task Lasso, binary and multinomial logistic regression, etc., we report significant speed-ups compared to previously proposed safe rules on all tested data sets.

高次元回帰設定では、スパース性強制ペナルティがデータフィッティング項を正規化するために有効であることが証明されています。スクリーニングルールと呼ばれる最近導入された手法では、ソリューションの予想されるスパース性を活用して最適化でいくつかの変数を無視し、結果としてソルバーの高速化につながることを提案しています。手順で変数が誤って破棄されないことが保証されている場合、ルールは安全であると言われています。この研究では、$\ell_1$または$\ell_1/\ell_2$ノルムなどの標準的なスパース性強制ペナルティで正規化された一般化線形モデルの統一フレームワークを提案します。私たちの手法により、特に低い正規化パラメーターの場合、これまで安全と考えられていたルールよりも多くの変数を安全に破棄できます。私たちが提案するギャップセーフルール(双対ギャップ計算に依存するためこのように呼ばれています)は、任意の反復ソルバーに対応できますが、特に(ブロック)座標降下法に適しています。多くの標準的な学習タスク、Lasso、Sparse Group Lasso、マルチタスクLasso、バイナリおよび多項ロジスティック回帰などに適用すると、テストされたすべてのデータセットで、以前に提案された安全なルールと比較して大幅な速度向上が報告されます。

Poisson Random Fields for Dynamic Feature Models
動的特徴モデルのポアソン確率場

We present the Wright-Fisher Indian buffet process (WF- IBP), a probabilistic model for time-dependent data assumed to have been generated by an unknown number of latent features. This model is suitable as a prior in Bayesian nonparametric feature allocation models in which the features underlying the observed data exhibit a dependency structure over time. More specifically, we establish a new framework for generating dependent Indian buffet processes, where the Poisson random field model from population genetics is used as a way of constructing dependent beta processes. Inference in the model is complex, and we describe a sophisticated Markov Chain Monte Carlo algorithm for exact posterior simulation. We apply our construction to develop a nonparametric focused topic model for collections of time-stamped text documents and test it on the full corpus of NIPS papers published from 1987 to 2015.

私たちは、未知の数の潜在特徴によって生成されたと仮定される時間依存データの確率モデルであるWright-Fisher Indian buffet process(WF-IBP)を紹介します。このモデルは、観測データの基礎となる特徴が時間の経過とともに依存関係構造を示すベイズノンパラメトリック特徴割り当てモデルの事前分布として適しています。具体的には、従属的なベータプロセスを構築する方法として、集団遺伝学のポアソンランダムフィールドモデルを使用する、従属的なインディアンビュッフェプロセスを生成するための新しいフレームワークを確立します。モデル内の推論は複雑であり、正確な事後シミュレーションのための高度なマルコフ連鎖モンテカルロアルゴリズムについて説明します。私たちは、この構造を適用して、タイムスタンプ付きテキストドキュメントのコレクションに対するノンパラメトリックに焦点を当てたトピックモデルを開発し、1987年から2015年に発表されたNIPS論文の全コーパスでテストします。

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling
ローカルギブスサンプリングによる潜在変数モデルのためのオンラインだが正確な推論

We study parameter inference in large-scale latent variable models. We first propose a unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We then propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling. Then, for latent Dirichlet allocation,we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. In particular, using Gibbs sampling for latent variable inference is superior to variational inference in terms of test log-likelihoods. Moreover, Bayesian inference through variational methods perform poorly, sometimes leading to worse fits with latent variables of higher dimensionality.

私たちは、大規模潜在変数モデルにおけるパラメータ推論について研究しています。まず、非正準指数族からの潜在変数モデルに対するオンライン推論の統一的な取り扱いを提案し、以前に提案されたいくつかの頻度論的またはベイズ的方法の間に明示的なリンクを描画します。次に、パラメータの頻度論的推定のための新しい推論方法を提案し、MCMC法をローカルギブスサンプリングを適切に使用して潜在変数モデルのオンライン推論に適応させます。次に、潜在的なディリクレの割り当てについて、広範な一連の実験と既存の研究との比較を提供し、新しいアプローチが以前に提案されたすべての方法よりも優れています。特に、潜在変数推論にギブスサンプリングを使用することは、テスト対数尤度の点で変分推論よりも優れています。さらに、変分法によるベイズ推論はパフォーマンスが低く、高次元の潜在変数との適合度が低下することがあります。

Statistical and Computational Guarantees for the Baum-Welch Algorithm
Baum-Welchアルゴリズムの統計的および計算的保証

The Hidden Markov Model (HMM) is one of the mainstays of statistical modeling of discrete time series, with applications including speech recognition, computational biology, computer vision and econometrics. Estimating an HMM from its observation process is often addressed via the Baum-Welch algorithm, which is known to be susceptible to local optima. In this paper, we first give a general characterization of the basin of attraction associated with any global optimum of the population likelihood. By exploiting this characterization, we provide non-asymptotic finite sample guarantees on the Baum-Welch updates and show geometric convergence to a small ball of radius on the order of the minimax rate around a global optimum. As a concrete example, we prove a linear rate of convergence for a hidden Markov mixture of two isotropic Gaussians given a suitable mean separation and an initialization within a ball of large radius around (one of) the true parameters. To our knowledge, these are the first rigorous local convergence guarantees to global optima for the Baum-Welch algorithm in a setting where the likelihood function is nonconvex. We complement our theoretical results with thorough numerical simulations studying the convergence of the Baum-Welch algorithm and illustrating the accuracy of our predictions.

隠れマルコフモデル(HMM)は離散時系列の統計モデリングの主力の1つであり、音声認識、計算生物学、コンピュータービジョン、計量経済学などの用途があります。観測プロセスからHMMを推定することは、局所最適値の影響を受けやすいことで知られるBaum-Welchアルゴリズムを介して行われることがよくあります。この論文では、まず、母集団尤度のグローバル最適値に関連付けられた吸引域の一般的な特徴を示します。この特徴を利用して、Baum-Welch更新の非漸近的な有限サンプル保証を提供し、グローバル最適値の周りのミニマックスレートのオーダーで半径の小さな球への幾何学的収束を示します。具体的な例として、適切な平均分離と、真のパラメーター(の1つ)の周りの大きな半径の球内での初期化を与えられた2つの等方性ガウス分布の隠れマルコフ混合の線形収束率を証明します。私たちの知る限り、尤度関数が非凸である設定において、Baum-Welchアルゴリズムのグローバル最適解への厳密なローカル収束が保証されたのはこれが初めてです。私たちは、Baum-Welchアルゴリズムの収束を研究し、予測の精度を示す徹底的な数値シミュレーションで理論結果を補完します。

Robust and Scalable Bayes via a Median of Subset Posterior Measures
サブセット事後測定の中央値によるロバストでスケーラブルなベイズ

We propose a novel approach to Bayesian analysis that is provably robust to outliers in the data and often has computational advantages over standard methods. Our technique is based on splitting the data into non-overlapping subgroups, evaluating the posterior distribution given each independent subgroup, and then combining the resulting measures. The main novelty of our approach is the proposed aggregation step, which is based on the evaluation of a median in the space of probability measures equipped with a suitable collection of distances that can be quickly and efficiently evaluated in practice. We present both theoretical and numerical evidence illustrating the improvements achieved by our method.

私たちは、データ内の外れ値に対してロバストであることが証明され、多くの場合、標準的な方法よりも計算上の利点を持つベイズ分析への新しいアプローチを提案します。私たちの手法は、データを重複しないサブグループに分割し、各独立したサブグループを与えられた事後分布を評価し、結果の測定値を組み合わせることに基づいています。私たちのアプローチの主な新規性は、提案された集約ステップであり、これは、実際に迅速かつ効率的に評価できる距離の適切なコレクションを備えた確率測定の空間内の中央値の評価に基づいています。私たちは、私たちの方法によって達成された改善を示す理論的証拠と数値的証拠の両方を提示します。

Kernel Partial Least Squares for Stationary Data
定常データのためのカーネル偏最小二乗法

We consider the kernel partial least squares algorithm for non- parametric regression with stationary dependent data. Probabilistic convergence rates of the kernel partial least squares estimator to the true regression function are established under a source and an effective dimensionality condition. It is shown both theoretically and in simulations that long range dependence results in slower convergence rates. A protein dynamics example shows high predictive power of kernel partial least squares.

私たちは、カーネルの偏最小二乗アルゴリズムを、定常従属データを使用したノンパラメトリック回帰について考えます。カーネルの偏最小二乗推定器の真の回帰関数への確率的収束率は、ソースと有効次元条件下で確立されます。理論的にもシミュレーションでも、長期依存性がコンバージェンス率の低下につながることが示されています。タンパク質ダイナミクスの例は、カーネルの部分最小二乗法の高い予測力を示しています。

Distributed Stochastic Variance Reduced Gradient Methods by Sampling Extra Data with Replacement
置換による追加データのサンプリングによる分布確率的分散の縮小勾配法

We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions. The first subset is from a random partition and the second subset is randomly sampled with replacement. Under this setting, we define a broad class of distributed algorithms whose local computation can utilize both subsets and design a distributed stochastic variance reduced gradient method belonging to in this class. When the condition number of the problem is small, our method achieves the optimal parallel runtime, amount of communication and rounds of communication among all distributed first-order methods up to constant factors. When the condition number is relatively large, a lower bound is provided for the number of rounds of communication needed by any algorithm in this class. Then, we present an accelerated version of our method whose the rounds of communication matches the lower bound up to logarithmic terms, which establishes that this accelerated algorithm has the lowest round complexity among all algorithms in our class under this new setting.

私たちは、各マシンが関数の2つのサブセットを受け取ることができる分散最適化の新しい設定の下で、凸関数の平均を最小化するラウンド複雑度を調べます。最初のサブセットはランダムパーティションからのもので、2番目のサブセットはランダムに置換サンプリングされます。この設定の下で、ローカル計算で両方のサブセットを利用できる分散アルゴリズムの広範なクラスを定義し、このクラスに属する分散確率分散縮小勾配法を設計します。問題の条件数が小さい場合、この方法は、定数倍までのすべての分散一次法の中で最適な並列実行時間、通信量、および通信ラウンドを実現します。条件数が比較的大きい場合、このクラスのどのアルゴリズムでも必要な通信ラウンド数に下限が提供されます。次に、通信ラウンドが対数項まで下限と一致する方法の高速化バージョンを提示します。これにより、この高速化アルゴリズムは、この新しい設定の下で、クラスのすべてのアルゴリズムの中でラウンド複雑度が最も低いことが証明されます。

Classification of Time Sequences using Graphs of Temporal Constraints
時間的制約のグラフを用いた時系列の分類

We introduce two algorithms that learn to classify Symbolic and Scalar Time Sequences (SSTS); an extension of multivariate time series. An SSTS is a set of \emph{events} and a set of scalars. An event is defined by a symbol and a time-stamp. A scalar is defined by a symbol and a function mapping a number for each possible time stamp of the data. The proposed algorithms rely on temporal patterns called Graph of Temporal Constraints (GTC). A GTC is a directed graph in which vertices express occurrences of specific events, and edges express temporal constraints between occurrences of pairs of events. Additionally, each vertex of a GTC can be augmented with numeric constraints on scalar values. We allow GTCs to be cyclic and/or disconnected. The first of the introduced algorithms extracts sets of co-dependent GTCs to be used in a voting mechanism. The second algorithm builds decision forest like representations where each node is a GTC. In both algorithms, extraction of GTCs and model building are interleaved. Both algorithms are closely related to each other and they exhibit complementary properties including complexity, performance, and interpretability. The main novelties of this work reside in direct building of the model and efficient learning of GTC structures. We explain the proposed algorithms and evaluate their performance against a diverse collection of 59 benchmark data sets. In these experiments, our algorithms come across as highly competitive and in most cases closely match or outperform state-of-the-art alternatives in terms of the computational speed while dominating in terms of the accuracy of classification of time sequences.

私たちは、多変量時系列の拡張であるシンボリックおよびスカラー時間シーケンス(SSTS)を分類する方法を学習する2つのアルゴリズムを紹介します。SSTSは、\emph{イベント}のセットとスカラーのセットです。イベントは、シンボルとタイムスタンプによって定義されます。スカラーは、シンボルと、データの可能な各タイムスタンプの数値をマッピングする関数によって定義されます。提案されたアルゴリズムは、時間制約のグラフ(GTC)と呼ばれる時間パターンに依存します。GTCは、頂点が特定のイベントの発生を表し、エッジがイベントのペアの発生間の時間制約を表す有向グラフです。さらに、GTCの各頂点には、スカラー値に対する数値制約を追加できます。GTCは、循環的および/または非接続にすることができます。紹介されたアルゴリズムの最初のものは、投票メカニズムで使用される共依存GTCのセットを抽出します。2番目のアルゴリズムは、各ノードがGTCである決定森のような表現を構築します。両方のアルゴリズムで、GTCの抽出とモデル構築が交互に行われます。両方のアルゴリズムは互いに密接に関連しており、複雑性、パフォーマンス、解釈可能性などの補完的な特性を示します。この研究の主な新規性は、モデルの直接構築とGTC構造の効率的な学習にあります。提案されたアルゴリズムについて説明し、59のベンチマークデータセットの多様なコレクションに対してそのパフォーマンスを評価します。これらの実験では、私たちのアルゴリズムは非常に競争力があり、ほとんどの場合、計算速度の点で最先端の代替手段に匹敵するか、それを上回り、時間シーケンスの分類の精度の点で優位に立っています。

Learning Instrumental Variables with Structural and Non-Gaussianity Assumptions
構造的および非ガウス性の仮定による操作変数の学習

Learning a causal effect from observational data requires strong assumptions. One possible method is to use instrumental variables, which are typically justified by background knowledge. It is possible, under further assumptions, to discover whether a variable is structurally instrumental to a target causal effect $X \rightarrow Y$. However, the few existing approaches are lacking on how general these assumptions can be, and how to express possible equivalence classes of solutions. We present instrumental variable discovery methods that systematically characterize which set of causal effects can and cannot be discovered under local graphical criteria that define instrumental variables, without reconstructing full causal graphs. We also introduce the first methods to exploit non-Gaussianity assumptions, highlighting identifiability problems and solutions. Due to the difficulty of estimating such models from finite data, we investigate how to strengthen assumptions in order to make the statistical problem more manageable.

観測データから因果効果を学習するには、強力な仮定が必要です。1つの方法は、通常背景知識によって正当化される操作変数を使用することです。さらなる仮定の下で、変数がターゲット因果効果$X \rightarrow Y$に対して構造的に操作的であるかどうかを発見することは可能です。しかし、いくつかの既存のアプローチでは、これらの仮定をどの程度一般化できるか、およびソリューションの可能な同値クラスをどのように表現するかについて欠けています。私たちは、完全な因果グラフを再構築することなく、操作変数を定義するローカルなグラフィカル基準の下でどの因果効果のセットが発見できるか、できないかを体系的に特徴付ける操作変数発見方法を紹介します。また、非ガウス性仮定を利用する最初の方法を紹介し、識別可能性の問題と解決策を強調します。有限データからこのようなモデルを推定することは難しいため、統計的問題をより扱いやすくするために、どのように仮定を強化するかを調査します。

Probabilistic Line Searches for Stochastic Optimization
確率的最適化のための確率的ライン探索

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user- controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

決定論的最適化では、ライン検索は安定性と効率性を確保する標準的なツールです。確率的勾配のみが利用可能な場合、不確実な勾配は探索空間を崩壊させる厳密な一連の決定を許さないため、直接的な等価物はこれまで定式化されていません。既存の決定論的手法の構造とベイズ最適化の概念を組み合わせることにより、確率的線探索を構築します。私たちの方法は、単変量最適化目的のガウス過程の代理を保持し、ウルフ条件に対する確率的な信念を使用して降下を監視します。このアルゴリズムの計算コストは非常に低く、ユーザーが制御するパラメーターはありません。実験によると、確率的勾配降下法の学習率を定義する必要性が効果的に排除されます。

Learning Theory of Distributed Regression with Bias Corrected Regularization Kernel Network
バイアス補正正則化カーネルネットワークを用いた分布回帰の学習理論

Distributed learning is an effective way to analyze big data. In distributed regression, a typical approach is to divide the big data into multiple blocks, apply a base regression algorithm on each of them, and then simply average the output functions learnt from these blocks. Since the average process will decrease the variance, not the bias, bias correction is expected to improve the learning performance if the base regression algorithm is a biased one. Regularization kernel network is an effective and widely used method for nonlinear regression analysis. In this paper we will investigate a bias corrected version of regularization kernel network. We derive the error bounds when it is applied to a single data set and when it is applied as a base algorithm in distributed regression. We show that, under certain appropriate conditions, the optimal learning rates can be reached in both situations.

分散学習は、ビッグデータを分析するための効果的な方法です。分散回帰では、一般的なアプローチは、ビッグデータを複数のブロックに分割し、各ブロックに基本回帰アルゴリズムを適用し、これらのブロックから学習した出力関数を単純に平均化することです。平均プロセスによってバイアスではなく分散が減少するため、ベース回帰アルゴリズムがバイアスされている場合、バイアス補正により学習パフォーマンスが向上することが期待されます。正則化カーネルネットワークは、非線形回帰分析に効果的で広く使用されている方法です。この論文では、正則化カーネルネットワークのバイアス補正バージョンを調査します。誤差範囲は、単一のデータセットに適用された場合と、分散回帰の基本アルゴリズムとして適用された場合に導き出されます。私たちは、特定の適切な条件下では、両方の状況で最適な学習率に到達できることを示す。

Regularized Estimation and Testing for High-Dimensional Multi-Block Vector-Autoregressive Models
高次元マルチブロックベクトル自己回帰モデルのための正則化推定と検定

Dynamical systems comprising of multiple components that can be partitioned into distinct blocks originate in many scientific areas. A pertinent example is the interactions between financial assets and selected macroeconomic indicators, which has been studied at aggregate level—e.g. a stock index and an employment index—extensively in the macroeconomics literature. A key shortcoming of this approach is that it ignores potential influences from other related components (e.g. Gross Domestic Product) that may impact the system’s dynamics and structure and thus produces incorrect results. To mitigate this issue, we consider a multi-block linear dynamical system with Granger-causal ordering between blocks, wherein the blocks’ temporal dynamics are described by vector autoregressive processes and are influenced by blocks higher in the system hierarchy. We derive the maximum likelihood estimator for the posited model for Gaussian data in the high- dimensional setting based on appropriate regularization schemes for the parameters of the block components. To optimize the underlying non-convex likelihood function, we develop an iterative algorithm with convergence guarantees. We establish theoretical properties of the maximum likelihood estimates, leveraging the decomposability of the regularizers and a careful analysis of the iterates. Finally, we develop testing procedures for the null hypothesis of whether a block Granger-causes another block of variables. The performance of the model and the testing procedures are evaluated on synthetic data, and illustrated on a data set involving log-returns of the US S&P100 component stocks and key macroeconomic variables for the 2001–16 period.

複数のコンポーネントから成り、それぞれが個別のブロックに分割できる動的システムは、多くの科学分野に端を発しています。適切な例としては、金融資産と選択されたマクロ経済指標の相互作用が挙げられます。これは、マクロ経済学の文献で、株価指数や雇用指数などの集計レベルで広範に研究されてきました。このアプローチの主な欠点は、システムのダイナミクスと構造に影響を与える可能性のある他の関連コンポーネント(国内総生産など)からの潜在的な影響を無視し、誤った結果を生成することです。この問題を軽減するために、ブロック間にグレンジャー因果順序付けを持つマルチブロック線形動的システムを検討します。このシステムでは、ブロックの時間的ダイナミクスはベクトル自己回帰プロセスによって記述され、システム階層の上位のブロックの影響を受けます。ブロックコンポーネントのパラメーターの適切な正則化スキームに基づいて、高次元設定のガウスデータに対して想定されたモデルの最大尤度推定量を導出します。基礎となる非凸尤度関数を最適化するために、収束が保証された反復アルゴリズムを開発しました。正則化子の分解可能性と反復の慎重な分析を活用して、最大尤度推定の理論的特性を確立しました。最後に、ブロックが別の変数ブロックをグレンジャー原因とするかどうかの帰無仮説のテスト手順を開発しました。モデルとテスト手順のパフォーマンスは合成データで評価され、2001年から2016年までの米国S&P100構成銘柄の対数リターンと主要なマクロ経済変数を含むデータセットで説明されています。

Second-Order Stochastic Optimization for Machine Learning in Linear Time
線形時間における機械学習のための二次確率的最適化

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored due to the high cost of computing the second-order information. In this paper we develop second-order stochastic methods for optimization problems in machine learning that match the per- iteration cost of gradient based methods, and in certain settings improve upon the overall running time over popular first-order methods. Furthermore, our algorithm has the desirable property of being implementable in time linear in the sparsity of the input data.

一次確率的手法は、反復ごとの効率的な複雑さにより、大規模な機械学習最適化の最先端技術です。2次法は、収束を高速化できる一方で、2次情報の計算コストが高いため、あまり検討されていませんでした。この論文では、機械学習の最適化問題に対する2次確率的手法を開発し、勾配ベースの手法の反復ごとのコストに匹敵し、特定の設定では一般的な1次手法よりも全体的な実行時間を改善します。さらに、私たちのアルゴリズムは、入力データのスパース性において時間線形に実装可能であるという望ましい特性を持っています。

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization
正則化損失最小化のための汎用分散二重座標最適化フレームワーク

In modern large-scale machine learning applications, the training data are often partitioned and stored on multiple machines. It is customary to employ the data parallelism approach, where the aggregated training loss is minimized without moving data across machines. In this paper, we introduce a novel distributed dual formulation for regularized loss minimization problems that can directly handle data parallelism in the distributed setting. This formulation allows us to systematically derive dual coordinate optimization procedures, which we refer to as Distributed Alternating Dual Maximization (DADM). The framework extends earlier studies described in (Boyd et al., 2011; Ma et al., 2017; Jaggi et al., 2014; Yang, 2013) and has rigorous theoretical analyses. Moreover, with the help of the new formulation, we develop the accelerated version of DADM (Acc-DADM) by generalizing the acceleration technique from (Shalev-Shwartz and Zhang, 2014) to the distributed setting. We also provide theoretical results for the proposed accelerated version, and the new result improves previous ones (Yang, 2013; Ma et al., 2017) whose iteration complexities grow linearly on the condition number. Our empirical studies validate our theory and show that our accelerated approach significantly improves the previous state- of-the-art distributed dual coordinate optimization algorithms.

現代の大規模機械学習アプリケーションでは、トレーニングデータは複数のマシンに分割されて保存されることがよくあります。データ並列処理アプローチを採用するのが一般的で、このアプローチでは、データをマシン間で移動することなく、集約されたトレーニング損失が最小化されます。この論文では、分散設定でデータ並列処理を直接処理できる、正規化損失最小化問題に対する新しい分散デュアル定式化を紹介します。この定式化により、分散交互デュアル最大化(DADM)と呼ぶデュアル座標最適化手順を体系的に導出できます。このフレームワークは、(Boyd他、2011年、Ma他、2017年、Jaggi他、2014年、Yang、2013年)で説明されている以前の研究を拡張したものであり、厳密な理論的分析が行われています。さらに、新しい定式化の助けを借りて、(Shalev-ShwartzおよびZhang、2014年)の加速手法を分散設定に一般化することで、DADMの加速バージョン(Acc-DADM)を開発します。また、提案された高速化バージョンの理論的な結果も提供しており、新しい結果は、反復の複雑さが条件数に対して線形に増加する以前の結果(Yang、2013年、Maら、2017年)を改善しています。私たちの経験的研究は理論を検証し、私たちの高速化アプローチが以前の最先端の分散型デュアル座標最適化アルゴリズムを大幅に改善することを示しています。

Target Curricula via Selection of Minimum Feature Sets: a Case Study in Boolean Networks
最小機能セットの選択によるターゲットカリキュラム:ブールネットワークのケーススタディ

We consider the effect of introducing a curriculum of targets when training Boolean models on supervised Multi Label Classification (MLC) problems. In particular, we consider how to order targets in the absence of prior knowledge, and how such a curriculum may be enforced when using meta-heuristics to train discrete non-linear models. We show that hierarchical dependencies between targets can be exploited by enforcing an appropriate curriculum using hierarchical loss functions. On several multi-output circuit- inference problems with known target difficulties, Feedforward Boolean Networks (FBNs) trained with such a loss function achieve significantly lower out-of- sample error, up to $10\%$ in some cases. This improvement increases as the loss places more emphasis on target order and is strongly correlated with an easy-to-hard curricula. We also demonstrate the same improvements on three real-world models and two Gene Regulatory Network (GRN) inference problems. We posit a simple a-priori method for identifying an appropriate target order and estimating the strength of target relationships in Boolean Boolean MLCs. These methods use intrinsic dimension as a proxy for target difficulty, which is estimated using optimal solutions to a combinatorial optimisation problem known as the Minimum-Feature-Set (minFS) problem. We also demonstrate that the same generalisation gains can be achieved without providing any knowledge of target difficulty.

私たちは、教師ありマルチラベル分類(MLC)問題でブールモデルをトレーニングする際に、ターゲットのカリキュラムを導入する効果について検討します。特に、事前知識がない場合にターゲットを順序付ける方法と、メタヒューリスティックを使用して離散非線形モデルをトレーニングする際にそのようなカリキュラムを実施する方法について検討します。階層的損失関数を使用して適切なカリキュラムを実施することで、ターゲット間の階層的依存関係を利用できることを示します。既知のターゲットの難易度を持ついくつかのマルチ出力回路推論問題では、そのような損失関数でトレーニングされたフィードフォワードブールネットワーク(FBN)は、サンプル外エラーが大幅に低下し、場合によっては最大$10\%$になります。この改善は、損失がターゲットの順序に重点を置くほど大きくなり、簡単から難しいカリキュラムと強く相関しています。また、3つの実際のモデルと2つの遺伝子制御ネットワーク(GRN)推論問題でも同じ改善が見られることを実証します。私たちは、ブールMLCにおける適切なターゲット順序を識別し、ターゲット関係の強さを推定するための、単純な事前手法を提案します。これらの手法では、ターゲット難易度の代理として、最小特徴セット(minFS)問題として知られる組み合わせ最適化問題に対する最適解を使用して推定される、固有の次元を使用します。また、ターゲット難易度に関する知識を一切提供しなくても、同様の一般化ゲインを達成できることも実証します。

Document Neural Autoregressive Distribution Estimation
ドキュメンテーションニューラル自己回帰分布推定

We present an approach based on feed-forward neural networks for learning the distribution over textual documents. This approach is inspired by the Neural Autoregressive Distribution Estimator (NADE) model which has been shown to be a good estimator of the distribution over discrete-valued high-dimensional vectors. In this paper, we present how NADE can successfully be adapted to textual data, retaining the property that sampling or computing the probability of an observation can be done exactly and efficiently. The approach can also be used to learn deep representations of documents that are competitive to those learned by alternative topic modeling approaches. Finally, we describe how the approach can be combined with a regular neural network N-gram model and substantially improve its performance, by making its learned representation sensitive to the larger, document-level context.

私たちは、テキストドキュメント上の分布を学習するためのフィードフォワードニューラルネットワークに基づくアプローチを提示します。このアプローチは、離散値を持つ高次元ベクトル上の分布の優れた推定器であることが示されているNeural Autoregressive Distribution Estimator (NADE)モデルに触発されています。この論文では、NADEをテキストデータにうまく適応させ、観測値のサンプリングまたは確率の計算を正確かつ効率的に実行できる特性を保持する方法を示します。このアプローチは、代替トピックモデリングアプローチによって学習されたドキュメントに匹敵するドキュメントの詳細な表現を学習するためにも使用できます。最後に、このアプローチを通常のニューラルネットワークN-gramモデルと組み合わせ、学習した表現をより大きなドキュメントレベルのコンテキストに敏感にすることで、そのパフォーマンスを大幅に向上させる方法について説明します。

Efficient Sampling from Time-Varying Log-Concave Distributions
時変対数凹分布からの効率的なサンプリング

We propose a computationally efficient random walk on a convex body which rapidly mixes with respect to a fixed log-concave distribution and closely tracks a time-varying log-concave distribution. We develop general theoretical guarantees on the required number of steps; this number can be calculated on the fly according to the distance from and the shape of the next distribution. We then illustrate the technique on several examples. Within the context of exponential families, the proposed method produces samples from a posterior distribution which is updated as data arrive in a streaming fashion. The sampling technique can be used to track time-varying truncated distributions, as well as to obtain samples from a changing mixture model, fitted in a streaming fashion to data. In the setting of linear optimization, the proposed method has oracle complexity with best known dependence on the dimension for certain geometries. In the context of online learning and repeated games, the algorithm is an efficient method for implementing no-regret mixture forecasting strategies. Remarkably, in some of these examples, only one step of the random walk is needed to track the next distribution.

私たちは、固定された対数凹分布に関して急速に混合し、時間とともに変化する対数凹分布を厳密に追跡する凸体上の計算効率の高いランダムウォークを提案します。私たちは、必要なステップ数に関する一般的な理論的保証を開発します。この数は、次の分布からの距離と形状に応じて、オンザフライで計算することができます。次に、いくつかの例でこの手法を説明します。指数族のコンテキスト内で、提案された方法は、データがストリーミング形式で到着すると更新される事後分布からサンプルを生成します。このサンプリング手法は、時間とともに変化する切り捨てられた分布を追跡するために、また、データにストリーミング形式で適合された変化する混合モデルからサンプルを取得するために使用できます。線形最適化の設定では、提案された方法は、特定のジオメトリの次元に最もよく知られている依存性を持つオラクル複雑性を持つ。オンライン学習と繰り返しゲームのコンテキストでは、このアルゴリズムは、後悔のない混合予測戦略を実装するための効率的な方法です。注目すべきことに、これらの例のいくつかでは、次の分布を追跡するためにランダムウォークの1ステップのみが必要です。

Approximation Vector Machines for Large-scale Online Learning
大規模オンライン学習のための近似ベクトルマシン

One of the most challenging problems in kernel online learning is to bound the model size and to promote model sparsity. Sparse models not only improve computation and memory usage, but also enhance the generalization capacity — a principle that concurs with the law of parsimony. However, inappropriate sparsity modeling may also significantly degrade the performance. In this paper, we propose Approximation Vector Machine (AVM), a model that can simultaneously encourage sparsity and safeguard its risk in compromising the performance. In an online setting context, when an incoming instance arrives, we approximate this instance by one of its neighbors whose distance to it is less than a predefined threshold. Our key intuition is that since the newly seen instance is expressed by its nearby neighbor the optimal performance can be analytically formulated and maintained. We develop theoretical foundations to support this intuition and further establish an analysis for the common loss functions including Hinge, smooth Hinge, and Logistic (i.e., for the classification task) and $\ell_{1}$, $\ell_{2}$, and $\varepsilon$-insensitive (i.e., for the regression task) to characterize the gap between the approximation and optimal solutions. This gap crucially depends on two key factors including the frequency of approximation (i.e., how frequent the approximation operation takes place) and the predefined threshold. We conducted extensive experiments for classification and regression tasks in batch and online modes using several benchmark datasets. The quantitative results show that our proposed AVM obtained comparable predictive performances with current state-of-the-art methods while simultaneously achieving significant computational speed-up due to the ability of the proposed AVM in maintaining the model size.

カーネルオンライン学習における最も困難な問題の1つは、モデルサイズを制限し、モデルのスパース性を促進することです。スパースモデルは、計算とメモリの使用量を改善するだけでなく、一般化能力も強化します。これは、節約の法則に一致する原則です。ただし、不適切なスパースモデリングは、パフォーマンスを大幅に低下させる可能性もあります。この論文では、スパース性を促進し、パフォーマンスを低下させるリスクを回避できるモデルである近似ベクトルマシン(AVM)を提案します。オンライン設定のコンテキストでは、着信インスタンスが到着すると、そのインスタンスとの距離が定義済みのしきい値未満の近隣インスタンスの1つでこのインスタンスを近似します。重要な直感は、新しく確認されたインスタンスは近隣インスタンスによって表現されるため、最適なパフォーマンスを解析的に定式化して維持できるということです。我々はこの直感を裏付ける理論的基礎を開発し、さらにヒンジ、スムーズヒンジ、ロジスティック（分類タスク用）および$\ell_{1}$、$\ell_{2}$、$\varepsilon$非感受性（回帰タスク用）などの一般的な損失関数の分析を確立して、近似解と最適解のギャップを特徴付けます。このギャップは、近似の頻度（近似演算が行われる頻度）と事前定義されたしきい値を含む2つの主要な要因に大きく依存します。私たちは、いくつかのベンチマークデータセットを使用して、バッチモードとオンラインモードで分類タスクと回帰タスクの広範な実験を実施しました。定量的な結果から、提案されたAVMは、現在の最先端の方法と同等の予測性能を獲得すると同時に、提案されたAVMがモデルサイズを維持できるため、大幅な計算速度の向上を実現したことがわかります。

Clustering with Hidden Markov Model on Variable Blocks
変数ブロック上の隠れマルコフモデルによるクラスタリング

Large-scale data containing multiple important rare clusters, even at moderately high dimensions, pose challenges for existing clustering methods. To address this issue, we propose a new mixture model called Hidden Markov Model on Variable Blocks (HMM-VB) and a new mode search algorithm called Modal Baum-Welch (MBW) for mode-association clustering. HMM-VB leverages prior information about chain-like dependence among groups of variables to achieve the effect of dimension reduction. In case such a dependence structure is unknown or assumed merely for the sake of parsimonious modeling, we develop a recursive search algorithm based on BIC to optimize the formation of ordered variable blocks. The MBW algorithm ensures the feasibility of clustering via mode association, achieving linear complexity in terms of the number of variable blocks despite the exponentially growing number of possible state sequences in HMM-VB. In addition, we provide theoretical investigations about the identifiability of HMM-VB as well as the consistency of our approach to search for the block partition of variables in a special case. Experiments on simulated and real data show that our proposed method outperforms other widely used methods.

大規模データには、中程度に高い次元であっても、複数の重要な希少クラスターが含まれており、既存のクラスタリング手法では課題となります。この問題に対処するために、私たちは、モード関連クラスタリング用の隠れマルコフモデル(HMM-VB)と呼ばれる新しい混合モデルと、モード関連クラスタリング用のモーダルBaum-Welch (MBW)と呼ばれる新しいモード検索アルゴリズムを提案します。HMM-VBは、変数のグループ間のチェーンのような依存関係に関する事前情報を活用して、次元削減の効果を実現します。このような依存関係構造が不明であるか、単に簡素なモデリングのために想定されている場合は、順序付けられた変数ブロックの形成を最適化するために、BICに基づく再帰検索アルゴリズムを開発します。MBWアルゴリズムは、モード関連によるクラスタリングの実現可能性を保証し、HMM-VBで可能な状態シーケンスの数が指数関数的に増加するにもかかわらず、変数ブロックの数に関して線形の複雑さを実現します。さらに、HMM-VBの識別可能性と、特殊なケースでの変数のブロック分割を検索するアプローチの一貫性に関する理論的調査を提供します。シミュレーションと実際のデータでの実験により、提案された方法が他の広く使用されている方法よりも優れていることが示されました。

Hinge-Loss Markov Random Fields and Probabilistic Soft Logic
ヒンジ損失マルコフランダム場と確率的ソフトロジック

A fundamental challenge in developing high-impact machine learning technologies is balancing the need to model rich, structured domains with the ability to scale to big data. Many important problem areas are both richly structured and large scale, from social and biological networks, to knowledge graphs and the Web, to images, video, and natural language. In this paper, we introduce two new formalisms for modeling structured data, and show that they can both capture rich structure and scale to big data. The first, hinge-loss Markov random fields (HL-MRFs), is a new kind of probabilistic graphical model that generalizes different approaches to convex inference. We unite three approaches from the randomized algorithms, probabilistic graphical models, and fuzzy logic communities, showing that all three lead to the same inference objective. We then define HL- MRFs by generalizing this unified objective. The second new formalism, probabilistic soft logic (PSL), is a probabilistic programming language that makes HL-MRFs easy to define using a syntax based on first-order logic. We introduce an algorithm for inferring most-probable variable assignments (MAP inference) that is much more scalable than general-purpose convex optimization methods, because it uses message passing to take advantage of sparse dependency structures. We then show how to learn the parameters of HL-MRFs. The learned HL-MRFs are as accurate as analogous discrete models, but much more scalable. Together, these algorithms enable HL-MRFs and PSL to model rich, structured data at scales not previously possible.

インパクトの大きい機械学習テクノロジーを開発する上での基本的な課題は、リッチで構造化されたドメインをモデル化する必要性とビッグデータへの拡張性のバランスを取ることです。ソーシャルネットワークや生物学的ネットワークからナレッジグラフやWeb、画像、ビデオ、自然言語に至るまで、多くの重要な問題領域はリッチに構造化され、大規模です。この論文では、構造化データをモデル化するための2つの新しい形式を紹介し、これらがリッチな構造を捉えながらビッグデータに拡張できることを示します。1つ目のヒンジ損失マルコフランダムフィールド(HL-MRF)は、凸推論に対するさまざまなアプローチを一般化する新しい種類の確率的グラフィカルモデルです。ランダム化アルゴリズム、確率的グラフィカルモデル、ファジーロジックコミュニティからの3つのアプローチを統合し、3つすべてが同じ推論目的につながることを示します。次に、この統合された目的を一般化してHL-MRFを定義します。2つ目の新しい形式である確率的ソフトロジック(PSL)は、一階述語論理に基づく構文を使用してHL-MRFを簡単に定義できる確率的プログラミング言語です。ここでは、最も確率の高い変数割り当てを推論するアルゴリズム(MAP推論)を紹介します。このアルゴリズムは、メッセージパッシングを使用して疎な依存関係構造を利用するため、汎用の凸最適化方法よりもはるかにスケーラブルです。次に、HL-MRFのパラメーターを学習する方法を示します。学習されたHL-MRFは、類似の離散モデルと同程度の精度を持ちながら、はるかにスケーラブルです。これらのアルゴリズムを組み合わせることで、HL-MRFとPSLは、これまで不可能だった規模で、豊富な構造化データをモデル化できます。

Computational Limits of A Distributed Algorithm for Smoothing Spline
スプライン平滑化のための分布アルゴリズムの計算限界

In this paper, we explore statistical versus computational trade-off to address a basic question in the application of a distributed algorithm: what is the minimal computational cost in obtaining statistical optimality? In smoothing spline setup, we observe a phase transition phenomenon for the number of deployed machines that ends up being a simple proxy for computing cost. Specifically, a sharp upper bound for the number of machines is established: when the number is below this bound, statistical optimality (in terms of nonparametric estimation or testing) is achievable; otherwise, statistical optimality becomes impossible. These sharp bounds partly capture intrinsic computational limits of the distributed algorithm considered in this paper, and turn out to be fully determined by the smoothness of the regression function. We name the asymptotic analysis on such split-and-aggregation estimation/inference as splitotic theory. As a side remark, we argue that sample splitting may be viewed as an alternative form of regularization, playing a similar role as smoothing parameter.

この論文では、分散アルゴリズムの適用における基本的な疑問である「統計的最適性を得るための最小の計算コストはいくらか」に対処するために、統計的トレードオフと計算的トレードオフについて検討します。平滑化スプライン設定では、配備されたマシンの数の相転移現象が観察され、これは計算コストの単純な代理になります。具体的には、マシンの数の明確な上限が確立されます。マシンの数がこの上限を下回ると、統計的最適性(ノンパラメトリック推定またはテストの観点から)が達成可能になります。そうでない場合、統計的最適性は不可能になります。これらの明確な上限は、本稿で検討する分散アルゴリズムの固有の計算限界を部分的に捉えており、回帰関数の滑らかさによって完全に決定されることがわかります。このような分割と集約の推定/推論に関する漸近解析をスプリット理論と名付けます。補足として、サンプル分割は、平滑化パラメーターと同様の役割を果たす、正則化の代替形式と見なすことができると主張します。

Optimal Dictionary for Least Squares Representation
最小二乗表現の最適辞書

Dictionaries are collections of vectors used for the representation of a class of vectors in Euclidean spaces. Recent research on optimal dictionaries is focused on constructing dictionaries that offer sparse representations, i.e., $\ell_0$-optimal representations. Here we consider the problem of finding optimal dictionaries with which representations of a given class of vectors is optimal in an $\ell_2$-sense: optimality of representation is defined as attaining the minimal average $\ell_2$-norm of the coefficients used to represent the vectors in the given class. With the help of recent results on rank-1 decompositions of symmetric positive semidefinite matrices, we provide an explicit description of $\ell_2$-optimal dictionaries as well as their algorithmic constructions in polynomial time.

ディクショナリは、ユークリッド空間内のベクトルのクラスを表現するために使用されるベクトルのコレクションです。最適な辞書に関する最近の研究は、スパースな表現、つまり$ell_0$-最適な表現を提供する辞書の構築に焦点を当てています。ここでは、特定のクラスのベクトルの表現が$ell_2$の意味で最適である最適な辞書を見つける問題を考えます:表現の最適性は、特定のクラスのベクトルを表すために使用される係数の最小平均$ell_2$-normを達成することと定義されます。対称正の半正定値行列のランク1分解に関する最近の結果の助けを借りて、$ell_2$-optimalディクショナリと多項式時間でのアルゴリズム構成の明示的な説明を提供します。

Distributed Bayesian Learning with Stochastic Natural Gradient Expectation Propagation and the Posterior Server
確率的自然勾配期待伝播と事後サーバーによる分散ベイズ学習

This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a data set is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks.

この論文では、ベイジアン機械学習アルゴリズムに2つの貢献をしています。まず、確率的自然勾配期待値伝播(SNEP)を提案します。これは、人気のある変分推論アルゴリズムである期待値伝播(EP)の新しい代替手段です。SNEPはブラックボックス変分アルゴリズムであり、EP傾斜分布のモーメントを推定するためのモンテカルロサンプラーの存在以外に、関心のある分布に関する単純化の仮定を必要としません。さらに、収束の保証がないEPとは対照的に、SNEPはモンテカルロモーメント推定を使用する場合でも収束することが示されています。次に、分散ベイジアン学習の新しいアーキテクチャを提案します。これを事後サーバーと呼びます。事後サーバーは、データセットがクラスター全体に分散して保存され、各コンピューティングノードにデータの分離したサブセットが含まれる場合に、スケーラブルで堅牢なベイジアン学習を可能にします。各コンピューティングノードでは、独立したモンテカルロサンプラーが実行され、ローカルデータサブセットにのみ直接アクセスできますが、クラスター全体のすべてのデータに基づいて、グローバル事後分布の近似値を目指します。これは、SNEPの分散非同期実装を使用してクラスター全体にメッセージを渡すことで実現されます。ロジスティック回帰とニューラルネットワークの分散ベイズ学習でSNEPと事後サーバーを実証します。

Accelerating Stochastic Composition Optimization
確率的組成最適化の加速

We consider the stochastic nested composition optimization problem where the objective is a composition of two expected- value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method. This algorithm updates the solution based on noisy gradient queries using a two-timescale iteration. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments.

私たちは、確率的入れ子型合成最適化問題を考えます。ここでは、目的が2つの期待値関数の合成です。私たちは、新しい確率的1次法、すなわち加速確率的組成近位勾配(ASC-PG)法を提案します。このアルゴリズムは、2タイムスケールの反復を使用して、ノイズの多い勾配クエリに基づいて解を更新します。ASC-PGは、非平滑正則化ペナルティを処理できる確率的組成問題の最初の近位勾配法です。ASC-PGは、最もよく知られているアルゴリズムよりも高速な収束を示し、いくつかの重要な特殊なケースで最適なサンプル誤差の複雑さを達成することを示します。ASC-PGの強化学習への応用を実証し、数値実験を行います。

A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation
電力期待伝搬を用いたガウス過程擬点近似のための統一フレームワーク

Gaussian processes (GPs) are flexible distributions over functions that enable high-level assumptions about unknown functions to be encoded in a parsimonious, flexible and general way. Although elegant, the application of GPs is limited by computational and analytical intractabilities that arise when data are sufficiently numerous or when employing non-Gaussian models. Consequently, a wealth of GP approximation schemes have been developed over the last 15 years to address these key limitations. Many of these schemes employ a small set of pseudo data points to summarise the actual data. In this paper we develop a new pseudo-point approximation framework using Power Expectation Propagation (Power EP) that unifies a large number of these pseudo-point approximations. Unlike much of the previous venerable work in this area, the new framework is built on standard methods for approximate inference (variational free- energy, EP and Power EP methods) rather than employing approximations to the probabilistic generative model itself. In this way all of the approximation is performed at `inference time’ rather than at `modelling time’, resolving awkward philosophical and empirical questions that trouble previous approaches. Crucially, we demonstrate that the new framework includes new pseudo-point approximation methods that outperform current approaches on regression and classification tasks.

ガウス過程(GP)は関数上の柔軟な分布であり、未知の関数に関する高レベルの仮定を簡潔で柔軟かつ一般的な方法でエンコードできます。GPは洗練されていますが、データが十分に多い場合や非ガウスモデルを使用する場合に発生する計算および分析の扱いにくさによって、その応用は制限されます。その結果、これらの主要な制限に対処するために、過去15年間にわたって豊富なGP近似スキームが開発されました。これらのスキームの多くは、実際のデータを要約するために、少数の疑似データポイントを使用します。この論文では、多数の疑似ポイント近似を統合するPower Expectation Propagation (Power EP)を使用して、新しい疑似ポイント近似フレームワークを開発します。この分野でのこれまでの多くの優れた研究とは異なり、新しいフレームワークは、確率的生成モデル自体の近似を使用するのではなく、近似推論の標準的な方法(変分自由エネルギー、EP、およびPower EP法)に基づいて構築されています。この方法では、すべての近似が「モデリング時」ではなく「推論時」に実行されるため、従来のアプローチを悩ませていた厄介な哲学的および経験的な問題が解決されます。重要なことは、新しいフレームワークには、回帰および分類タスクで現在のアプローチよりも優れた新しい擬似点近似方法が含まれていることを実証することです。

Online Learning to Rank with Top-k Feedback
Top-kフィードバックでランク付けするためのオンライン学習

We consider two settings of online learning to rank where feedback is restricted to top ranked items. The problem is cast as an online game between a learner and sequence of users, over $T$ rounds. In both settings, the learners objective is to present ranked list of items to the users. The learner’s performance is judged on the entire ranked list and true relevances of the items. However, the learner receives highly restricted feedback at end of each round, in form of relevances of only the top $k$ ranked items, where $k \ll m$. The first setting is non-contextual, where the list of items to be ranked is fixed. The second setting is contextual, where lists of items vary, in form of traditional query-document lists. No stochastic assumption is made on the generation process of relevances of items and contexts. We provide efficient ranking strategies for both the settings. The strategies achieve $O(T^{2/3})$ regret, where regret is based on popular ranking measures in first setting and ranking surrogates in second setting. We also provide impossibility results for certain ranking measures and a certain class of surrogates, when feedback is restricted to the top ranked item, i.e. $k=1$. We empirically demonstrate the performance of our algorithms on simulated and real world data sets.

私たちは、フィードバックが上位ランクのアイテムに制限される、オンラインランキング学習の2つの設定を検討します。問題は、学習者と一連のユーザーの間で$T$ラウンドにわたって行われるオンラインゲームとして表されます。両方の設定で、学習者の目的は、ランク付けされたアイテムのリストをユーザーに提示することです。学習者のパフォーマンスは、ランク付けされたリスト全体とアイテムの真の関連性に基づいて判断されます。ただし、学習者は各ラウンドの終了時に、上位$k$ランクのアイテムのみの関連性の形式で、非常に制限されたフィードバックを受け取ります(ここで、$k \ll m$)。最初の設定は非コンテキストで、ランク付けされるアイテムのリストは固定されています。2番目の設定はコンテキストで、アイテムのリストは、従来のクエリドキュメントリストの形式で変化します。アイテムとコンテキストの関連性の生成プロセスについては、確率論的な仮定は行われません。両方の設定に対して効率的なランキング戦略を提供します。戦略は、最初の設定では人気のランキング尺度に基づき、2番目の設定ではランキングサロゲートに基づき、$O(T^{2/3})$の後悔を実現します。また、フィードバックが最上位の項目に制限されている場合、つまり$k=1$の場合、特定のランキング尺度と特定のクラスの代理に対して不可能な結果も提供します。シミュレーションおよび実際のデータセットでアルゴリズムのパフォーマンスを実証します。

Confidence Sets with Expected Sizes for Multiclass Classification
多クラス分類の期待サイズを持つ信頼度セット

Multiclass classification problems such as image annotation can involve a large number of classes. In this context, confusion between classes can occur, and single label classification may be misleading. We provide in the present paper a general device that, given an unlabeled dataset and a score function defined as the minimizer of some empirical and convex risk, outputs a set of class labels, instead of a single one. Interestingly, this procedure does not require that the unlabeled dataset explores the whole classes. Even more, the method is calibrated to control the expected size of the output set while minimizing the classification risk. We show the statistical optimality of the procedure and establish rates of convergence under the Tsybakov margin condition. It turns out that these rates are linear on the number of labels. We apply our methodology to convex aggregation of confidence sets based on the $V$-fold cross validation principle also known as the superlearning principle (van der Laan et al., 2007). We illustrate the numerical performance of the procedure on real data and demonstrate in particular that with moderate expected size, w.r.t. the number of labels, the procedure provides significant improvement of the classification risk.

画像注釈などのマルチクラス分類問題には、多数のクラスが関係することがあります。この状況では、クラス間の混乱が生じる可能性があり、単一ラベル分類は誤解を招く可能性があります。この論文では、ラベルなしデータセットと、経験的かつ凸型リスクの最小化として定義されたスコア関数が与えられた場合に、単一ではなくクラスラベルのセットを出力する一般的なデバイスを提供します。興味深いことに、この手順では、ラベルなしデータセットがクラス全体を探索する必要はありません。さらに、この方法は、分類リスクを最小限に抑えながら、出力セットの予想サイズを制御するように調整されています。この手順の統計的最適性を示し、Tsybakovマージン条件下での収束率を確立します。これらの率はラベルの数に対して線形であることがわかりました。この方法論を、スーパーラーニング原理としても知られる$V$分割交差検証原理に基づく信頼セットの凸集約に適用します(van der Laan他、2007)。実際のデータに対する手順の数値的パフォーマンスを示し、特に、ラベルの数に関して中程度の予想サイズの場合、この手順によって分類リスクが大幅に改善されることを実証します。

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression
最小二乗回帰のためのより困難、より良く、より速く、より強い収束率

We consider the optimization of a quadratic objective function whose gradients are only accessible through a stochastic oracle that returns the gradient at any given point plus a zero-mean finite variance random error. We present the first algorithm that achieves jointly the optimal prediction error rates for least-squares regression, both in terms of forgetting the initial conditions in $O(1/n^2)$, and in terms of dependence on the noise and dimension $d$ of the problem, as $O(d/n)$. Our new algorithm is based on averaged accelerated regularized gradient descent, and may also be analyzed through finer assumptions on initial conditions and the Hessian matrix, leading to dimension- free quantities that may still be small in some distances while the âoptimalâ terms above are large. In order to characterize the tightness of these new bounds, we consider an application to non-parametric regression and use the known lower bounds on the statistical performance (without computational limits), which happen to match our bounds obtained from a single pass on the data and thus show optimality of our algorithm in a wide variety of particular trade-offs between bias and variance.

私たちは、任意の点における勾配とゼロ平均有限分散ランダム誤差を返す確率的オラクルを通じてのみ勾配にアクセスできる二次目的関数の最適化について検討します。私たちは、最小二乗回帰の最適予測誤差率を、初期条件を忘れるという点では$O(1/n^2)$、問題のノイズと次元$d$への依存性という点では$O(d/n)$の両方で同時に達成する最初のアルゴリズムを提示します。我々の新しいアルゴリズムは、平均加速正規化勾配降下法に基づいており、初期条件とヘッセ行列に関するより細かい仮定を通じて分析することもできます。その結果、次元フリーの量が、上記の「最適な」項が大きい一方で、いくつかの距離では依然として小さい可能性があります。これらの新しい境界の厳しさを特徴付けるために、ノンパラメトリック回帰への応用を検討し、統計的パフォーマンスの既知の下限値（計算上の制限なし）を使用します。この下限値は、データの単一パスから得られた境界値と一致し、バイアスと分散の間のさまざまな特定のトレードオフにおいてアルゴリズムが最適であることを示します。

Stability of Controllers for Gaussian Process Dynamics
ガウスプロセスダイナミクスのためのコントローラの安定性

Learning control has become an appealing alternative to the derivation of control laws based on classic control theory. However, a major shortcoming of learning control is the lack of performance guarantees which prevents its application in many real-world scenarios. As a step towards widespread deployment of learning control, we provide stability analysis tools for controllers acting on dynamics represented by Gaussian processes (GPs). We consider differentiable Markovian control policies and system dynamics given as (i) the mean of a GP, and (ii) the full GP distribution. For both cases, we analyze finite and infinite time horizons. Furthermore, we study the effect of disturbances on the stability results. Empirical evaluations on simulated benchmark problems support our theoretical results.

制御の学習は、古典的な制御理論に基づく制御法則の導出に代わる魅力的な選択肢となっています。ただし、学習制御の大きな欠点は、パフォーマンスの保証がないため、多くの実際のシナリオでの適用が妨げられていることです。学習制御の普及に向けた一歩として、ガウス過程(GP)に代表されるダイナミクスに作用するコントローラに対する安定性解析ツールを提供しています。微分可能なマルコフ制御方策とシステムダイナミクスを、(i)GPの平均、および(ii)完全なGP分布として与えられると考えます。どちらの場合も、有限と無限の時間軸を分析します。さらに、安定性結果に対する外乱の影響についても検討しています。シミュレートされたベンチマーク問題に関する実証的評価は、私たちの理論的結果を裏付けています。

Bayesian Network Learning via Topological Order
トポロジカル順序によるベイジアンネットワーク学習

We propose a mixed integer programming (MIP) model and iterative algorithms based on topological orders to solve optimization problems with acyclic constraints on a directed graph. The proposed MIP model has a significantly lower number of constraints compared to popular MIP models based on cycle elimination constraints and triangular inequalities. The proposed iterative algorithms use gradient descent and iterative reordering approaches, respectively, for searching topological orders. A computational experiment is presented for the Gaussian Bayesian network learning problem, an optimization problem minimizing the sum of squared errors of regression models with L1 penalty over a feature network with application of gene network inference in bioinformatics.

私たちは、有向グラフ上の非巡回制約を持つ最適化問題を解くために、トポロジカル次数に基づく混合整数計画法(MIP)モデルと反復アルゴリズムを提案します。提案されたMIPモデルは、サイクル除去制約と三角不等式に基づく一般的なMIPモデルと比較して、制約の数が大幅に少なくなっています。提案された反復アルゴリズムは、トポロジカル順序の検索に、勾配降下法と反復再順序付けアプローチをそれぞれ使用します。ガウスベイジアンネットワーク学習問題、つまり、バイオインフォマティクスにおける遺伝子ネットワーク推論の適用により、特徴ネットワークに対するL1ペナルティを伴う回帰モデルの二乗誤差の合計を最小化する最適化問題について、計算実験が提示されます。

Rank Determination for Low-Rank Data Completion
低ランクデータ完了のランク決定

Recently, fundamental conditions on the sampling patterns have been obtained for finite completability of low-rank matrices or tensors given the corresponding ranks. In this paper, we consider the scenario where the rank is not given and we aim to approximate the unknown rank based on the location of sampled entries and some given completion. We consider a number of data models, including single-view matrix, multi-view matrix, CP tensor, tensor-train tensor and Tucker tensor. For each of these data models, we provide an upper bound on the rank when an arbitrary low-rank completion is given. We characterize these bounds both deterministically, i.e., with probability one given that the sampling pattern satisfies certain combinatorial properties, and probabilistically, i.e., with high probability given that the sampling probability is above some threshold. Moreover, for both single-view matrix and CP tensor, we are able to show that the obtained upper bound is exactly equal to the unknown rank if the lowest-rank completion is given. Furthermore, we provide numerical experiments for the case of single-view matrix, where we use nuclear norm minimization to find a low-rank completion of the sampled data and we observe that in most of the cases the proposed upper bound on the rank is equal to the true rank.

最近、対応するランクが与えられた場合の低ランク行列またはテンソルの有限完成可能性に関するサンプリングパターンの基本条件が得られました。この論文では、ランクが与えられていないシナリオを検討し、サンプリングされたエントリの位置といくつかの与えられた完成に基づいて未知のランクを近似することを目的とします。シングルビューマトリックス、マルチビューマトリックス、CPテンソル、テンソルトレインテンソル、タッカーテンソルなど、いくつかのデータモデルを検討します。これらのデータモデルのそれぞれについて、任意の低ランク完成が与えられた場合のランクの上限を提供します。これらの上限は、決定論的(つまり、サンプリングパターンが特定の組み合わせ特性を満たす場合の確率1)と確率論的(つまり、サンプリング確率が特定のしきい値を超える場合の確率が高い)の両方で特徴付けられます。さらに、シングルビューマトリックスとCPテンソルの両方について、最低ランク完成が与えられた場合、得られた上限が未知のランクと正確に等しいことを示すことができます。さらに、単一ビュー行列の場合の数値実験を提供します。この実験では、核ノルム最小化を使用して、サンプルデータの低ランク完成を見つけ、ほとんどの場合、提案されたランクの上限が真のランクに等しいことを確認します。

Optimal Rates for Multi-pass Stochastic Gradient Methods
マルチパス確率勾配法の最適レート

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. We study how regularization properties are controlled by the step-size, the number of passes and the mini-batch size. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases. As a byproduct, we derive optimal convergence results for batch gradient methods (even in the non-attainable cases).

私たちは、データ上を複数回通過し、ミニバッチが許可されている場合の確率的勾配法の学習特性を分析します。正則化プロパティがステップサイズ、パス数、およびミニバッチサイズによってどのように制御されるかを研究します。特に、二乗損失を考慮し、ユニバーサルステップサイズの選択では、パスの数が正則化パラメーターとして機能し、早期停止によって最適な有限サンプル境界を達成できることを示します。さらに、ミニバッチを検討する際には、より大きなステップサイズが許容されることを示しています。私たちの分析は、特殊なケースとしてバッチ勾配法と確率的勾配法の両方を含む統一的なアプローチに基づいています。副産物として、バッチグラジエント法の最適な収束結果を導き出します(達成不可能な場合でも)。

openXBOW — Introducing the Passau Open-Source Crossmodal Bag-of-Words Toolkit
openXBOW — Passau オープンソースのクロスモーダル Bag-of-Words ツールキットの紹介

We introduce openXBOW, an open-source toolkit for the generation of bag-of-words (BoW) representations from multimodal input. In the BoW principle, word histograms were first used as features in document classification, but the idea was and can easily be adapted to, e.g., acoustic or visual descriptors, introducing a prior step of vector quantisation. The openXBOW toolkit supports arbitrary numeric input features and text input and concatenates computed sub-bags to a final bag. It provides a variety of extensions and options. To our knowledge, openXBOW is the first publicly available toolkit for the generation of crossmodal bags-of-words. The capabilities of the tool have been exemplified in different scenarios: sentiment analysis in tweets, classification of snore sounds, and time-dependent emotion recognition based on acoustic, linguistic, and visual information, where improved results over other feature representations were observed.

私たちは、マルチモーダル入力からBag-of-Words(BoW)表現を生成するためのオープンソースツールキットであるopenXBOWを紹介します。BoWの原則では、ワードヒストグラムは最初にドキュメント分類の特徴として使用されましたが、そのアイデアは、例えば音響記述子や視覚記述子などに簡単に適応でき、ベクトル量子化の先行ステップを導入しました。openXBOWツールキットは、任意の数値入力機能とテキスト入力をサポートし、計算されたサブバッグを最終的なバッグに連結します。さまざまな拡張機能とオプションを提供します。私たちの知る限り、openXBOWは、クロスモーダルな単語の袋を生成するための最初の公開ツールキットです。このツールの機能は、ツイートの感情分析、いびき音の分類、音響情報、言語情報、視覚情報に基づく時間依存の感情認識など、さまざまなシナリオで実証されており、他の特徴表現よりも改善された結果が観察されました。

Fisher Consistency for Prior Probability Shift
事前確率シフトのフィッシャー一貫性

We introduce Fisher consistency in the sense of unbiasedness as a desirable property for estimators of class prior probabilities. Lack of Fisher consistency could be used as a criterion to dismiss estimators that are unlikely to deliver precise estimates in test data sets under prior probability and more general data set shift. The usefulness of this unbiasedness concept is demonstrated with three examples of classifiers used for quantification: Adjusted Count, EM-algorithm and CDE- Iterate. We find that Adjusted Count and EM-algorithm are Fisher consistent. A counter-example shows that CDE-Iterate is not Fisher consistent and, therefore, cannot be trusted to deliver reliable estimates of class probabilities.

私たちは、不偏性という意味でのフィッシャー一貫性を、クラスの事前確率の推定器にとって望ましい特性として紹介します。フィッシャーの一貫性の欠如は、事前確率およびより一般的なデータセットシフトの下でテストデータセットで正確な推定を提供する可能性が低い推定量を却下する基準として使用できます。この不偏性の概念の有用性は、定量化に使用される分類器の3つの例(調整済みカウント、EMアルゴリズム、CDE-反復)で実証されています。調整済みカウントとEMアルゴリズムはフィッシャーと一致していることがわかります。反例は、CDE-IterateがFisherの一貫性がないため、クラス確率の信頼できる推定値を提供すると信頼できないことを示しています。

An Easy-to-hard Learning Paradigm for Multiple Classes and Multiple Labels
複数のクラスと複数のラベルのための難解な学習パラダイム

Many applications, such as human action recognition and object detection, can be formulated as a multiclass classification problem. One-vs-rest (OVR) is one of the most widely used approaches for multiclass classification due to its simplicity and excellent performance. However, many confusing classes in such applications will degrade its results. For example, hand clap and boxing are two confusing actions. Hand clap is easily misclassified as boxing, and vice versa. Therefore, precisely classifying confusing classes remains a challenging task. To obtain better performance for multiclass classifications that have confusing classes, we first develop a classifier chain model for multiclass classification (CCMC) to transfer class information between classifiers. Then, based on an analysis of our proposed model, we propose an easy- to-hard learning paradigm for multiclass classification to automatically identify easy and hard classes and then use the predictions from simpler classes to help solve harder classes. Similar to CCMC, the classifier chain (CC) model is also proposed by Read et al. (2009) to capture the label dependency for multi-label classification. However, CC does not consider the order of difficulty of the labels and achieves degenerated performance when there are many confusing labels. Therefore, it is non- trivial to learn the appropriate label order for CC. Motivated by our analysis for CCMC, we also propose the easy-to-hard learning paradigm for multi-label classification to automatically identify easy and hard labels, and then use the predictions from simpler labels to help solve harder labels. We also demonstrate that our proposed strategy can be successfully applied to a wide range of applications, such as ordinal classification and relationship prediction. Extensive empirical studies validate our analysis and the effectiveness of our proposed easy-to-hard learning strategies.

人間の動作認識や物体検出など、多くのアプリケーションは、マルチクラス分類問題として定式化できます。One-vs-rest (OVR)は、そのシンプルさと優れたパフォーマンスのため、マルチクラス分類に最も広く使用されているアプローチの1つです。ただし、このようなアプリケーションでは、紛らわしいクラスが多く、結果が低下します。たとえば、手拍子とボクシングは、紛らわしい2つの動作です。手拍子はボクシングと誤分類されやすく、その逆も同様です。したがって、紛らわしいクラスを正確に分類することは、依然として困難なタスクです。紛らわしいクラスがあるマルチクラス分類のパフォーマンスを向上させるために、まず、分類器間でクラス情報を転送するマルチクラス分類用の分類器チェーンモデル(CCMC)を開発します。次に、提案モデルの分析に基づいて、マルチクラス分類用の簡単から難しいへの学習パラダイムを提案し、簡単なクラスと難しいクラスを自動的に識別し、簡単なクラスからの予測を使用して難しいクラスを解決します。CCMCと同様に、分類器チェーン(CC)モデルも、マルチラベル分類のラベル依存性を捉えるためにReadら(2009)によって提案されています。しかし、CCはラベルの難易度の順序を考慮せず、紛らわしいラベルが多い場合はパフォーマンスが低下します。したがって、CCの適切なラベル順序を学習することは簡単ではありません。CCMCの分析に触発されて、マルチラベル分類の簡単から難しいへの学習パラダイムも提案します。これは、簡単なラベルと難しいラベルを自動的に識別し、より単純なラベルからの予測を使用して難しいラベルを解決します。また、提案された戦略は、順序分類や関係予測など、幅広いアプリケーションにうまく適用できることも実証しています。広範な実証研究により、分析と提案された簡単から難しいへの学習戦略の有効性が検証されています。

Identifying Unreliable and Adversarial Workers in Crowdsourced Labeling Tasks
クラウドソーシングのラベリングタスクにおける信頼性の低いワーカーと敵対的なワーカーの特定

We study the problem of identifying unreliable and adversarial workers in crowdsourcing systems where workers (or users) provide labels for tasks (or items). Most existing studies assume that worker responses follow specific probabilistic models; however, recent evidence shows the presence of workers adopting non-random or even malicious strategies. To account for such workers, we suppose that workers comprise a mixture of honest and adversarial workers. Honest workers may be reliable or unreliable, and they provide labels according to an unknown but explicit probabilistic model. Adversaries adopt labeling strategies different from those of honest workers, whether probabilistic or not. We propose two reputation algorithms to identify unreliable honest workers and adversarial workers from only their responses. Our algorithms assume that honest workers are in the majority, and they classify workers with outlier label patterns as adversaries. Theoretically, we show that our algorithms successfully identify unreliable honest workers, workers adopting deterministic strategies, and worst- case sophisticated adversaries who can adopt arbitrary labeling strategies to degrade the accuracy of the inferred task labels. Empirically, we show that filtering out outliers using our algorithms can significantly improve the accuracy of several state-of-the-art label aggregation algorithms in real-world crowdsourcing datasets.

私たちは、ワーカー（またはユーザー）がタスク（またはアイテム）にラベルを付けるクラウドソーシングシステムにおいて、信頼できないワーカーと敵対的なワーカーを識別する問題を研究しています。既存の研究のほとんどは、ワーカーの応答が特定の確率モデルに従うと想定しています。しかし、最近の証拠は、非ランダムな、あるいは悪意のある戦略さえも採用するワーカーの存在を示しています。そのようなワーカーを説明するために、ワーカーには正直なワーカーと敵対的なワーカーが混在していると仮定します。正直なワーカーは信頼できる場合も信頼できない場合もあり、未知だが明示的な確率モデルに従ってラベルを付けます。敵対者は、確率的であるかどうかにかかわらず、正直なワーカーとは異なるラベル付け戦略を採用します。私たちは、信頼できない正直なワーカーと敵対的なワーカーをその応答のみから識別するための2つの評判アルゴリズムを提案します。我々のアルゴリズムは、正直なワーカーが大多数であると想定し、外れ値のラベルパターンを持つワーカーを敵対者として分類します。理論的には、私たちのアルゴリズムは、信頼できない正直な作業者、決定論的な戦略を採用している作業者、そして最悪の場合、任意のラベル付け戦略を採用して推定されたタスクラベルの精度を低下させることができる洗練された敵を正常に識別できることを示しています。経験的には、私たちのアルゴリズムを使用して外れ値を除外すると、実際のクラウドソーシングデータセットにおけるいくつかの最先端のラベル集約アルゴリズムの精度が大幅に向上することを示しています。

Distributed Learning with Regularized Least Squares
正則化最小二乗法による分散学習

We study distributed learning with the least squares regularization scheme in a reproducing kernel Hilbert space (RKHS). By a divide-and-conquer approach, the algorithm partitions a data set into disjoint data subsets, applies the least squares regularization scheme to each data subset to produce an output function, and then takes an average of the individual output functions as a final global estimator or predictor. We show with error bounds and learning rates in expectation in both the $L^2$-metric and RKHS-metric that the global output function of this distributed learning is a good approximation to the algorithm processing the whole data in one single machine. Our derived learning rates in expectation are optimal and stated in a general setting without any eigenfunction assumption. The analysis is achieved by a novel second order decomposition of operator differences in our integral operator approach. Even for the classical least squares regularization scheme in the RKHS associated with a general kernel, we give the best learning rate in expectation in the literature.

私たちは、再生カーネルヒルベルト空間(RKHS)における最小二乗正則化スキームを用いた分散学習を研究します。分割統治法によって、アルゴリズムはデータセットを互いに素なデータサブセットに分割し、各データサブセットに最小二乗正則化スキームを適用して出力関数を生成し、個々の出力関数の平均を最終的なグローバル推定値または予測値とします。私たちは、$L^2$メトリックとRKHSメトリックの両方における誤差境界と期待学習率を用いて、この分散学習のグローバル出力関数が、1台のマシンでデータ全体を処理するアルゴリズムの良好な近似値であることを示す。我々が導出した期待学習率は最適であり、固有関数の仮定なしに一般的な設定で述べられています。この分析は、我々の積分演算子アプローチにおける演算子差の新しい2次分解によって達成されます。一般的なカーネルに関連付けられたRKHSの古典的な最小二乗正則化スキームに対しても、我々は文献中で最高の期待学習率を与える。

A distributed block coordinate descent method for training l1 regularized linear classifiers
l1正則化線形分類器を訓練するための分散ブロック座標降下法

Distributed training of $l_1$ regularized classifiers has received great attention recently. Most existing methods approach this problem by taking steps obtained from approximating the objective by a quadratic approximation that is decoupled at the individual variable level. These methods are designed for multicore systems where communication costs are low. They are inefficient on systems such as Hadoop running on a cluster of commodity machines where communication costs are substantial. In this paper we design a distributed algorithm for $l_1$ regularization that is much better suited for such systems than existing algorithms. A careful cost analysis is used to support these points and motivate our method. The main idea of our algorithm is to do block optimization of many variables on the actual objective function within each computing node; this increases the computational cost per step that is matched with the communication cost, and decreases the number of outer iterations, thus yielding a faster overall method. Distributed Gauss-Seidel and Gauss-Southwell greedy schemes are used for choosing variables to update in each step. We establish global convergence theory for our algorithm, including Q-linear rate of convergence. Experiments on two benchmark problems show our method to be much faster than existing methods.

$l_1$正規化分類器の分散トレーニングは、最近大きな注目を集めています。既存の方法のほとんどは、個々の変数レベルで分離された2次近似によって目的を近似することで得られるステップを実行することでこの問題に取り組んでいます。これらの方法は、通信コストが低いマルチコアシステム向けに設計されています。通信コストがかなりかかる市販マシンのクラスターで実行されるHadoopなどのシステムでは非効率的です。この論文では、既存のアルゴリズムよりもそのようなシステムに適した$l_1$正規化の分散アルゴリズムを設計します。これらの点をサポートし、この方法の動機付けを行うために、慎重なコスト分析が使用されています。このアルゴリズムの主なアイデアは、各コンピューティングノード内で実際の目的関数に対して多くの変数のブロック最適化を行うことです。これにより、通信コストと一致するステップあたりの計算コストが増加し、外部反復の数が減り、全体的な方法が高速化されます。分散Gauss-SeidelおよびGauss-Southwell貪欲法は、各ステップで更新する変数を選択するために使用されます。私たちは、Q線形収束率を含む、アルゴリズムのグローバル収束理論を確立しました。2つのベンチマーク問題での実験により、私たちの方法が既存の方法よりもはるかに高速であることが示されました。

A survey of Algorithms and Analysis for Adaptive Online Learning
適応型オンライン学習のためのアルゴリズムと分析に関する調査

We present tools for the analysis of Follow-The-Regularized- Leader (FTRL), Dual Averaging, and Mirror Descent algorithms when the regularizer (equivalently, prox-function or learning rate schedule) is chosen adaptively based on the data. Adaptivity can be used to prove regret bounds that hold on every round, and also allows for data-dependent regret bounds as in AdaGrad-style algorithms (e.g., Online Gradient Descent with adaptive per-coordinate learning rates). We present results from a large number of prior works in a unified manner, using a modular and tight analysis that isolates the key arguments in easily re-usable lemmas. This approach strengthens previously known FTRL analysis techniques to produce bounds as tight as those achieved by potential functions or primal-dual analysis. Further, we prove a general and exact equivalence between adaptive Mirror Descent algorithms and a corresponding FTRL update, which allows us to analyze Mirror Descent algorithms in the same framework. The key to bridging the gap between Dural Averaging and Mirror Descent algorithms lies in an analysis of the FTRL-Proximal algorithm family. Our regret bounds are proved in the most general form, holding for arbitrary norms and non- smooth regularizers with time-varying weight.

私たちは、正規化子(同等に、近接関数または学習率スケジュール)がデータに基づいて適応的に選択される場合のFollow-The-Regularized-Leader (FTRL)、Dual Averaging、およびMirror Descentアルゴリズムの分析用ツールを紹介します。適応性は、すべてのラウンドで保持される後悔境界を証明するために使用でき、AdaGradスタイルのアルゴリズム(たとえば、適応的な座標ごとの学習率を備えたオンライン勾配降下法)のようにデータ依存の後悔境界も可能にします。私たちは、簡単に再利用可能な補題で主要な引数を分離するモジュール式の厳密な分析を使用して、多数の以前の研究からの結果を統一された方法で提示します。このアプローチは、以前に知られているFTRL分析手法を強化し、ポテンシャル関数または主双対分析によって達成されるものと同じくらい厳密な境界を生成します。さらに、適応型ミラー降下アルゴリズムと対応するFTRL更新の間の一般的かつ正確な同等性を証明し、同じフレームワークでミラー降下アルゴリズムを分析できるようにします。Dural AveragingアルゴリズムとMirror Descentアルゴリズムの間のギャップを埋める鍵は、FTRL-Proximalアルゴリズムファミリの分析にあります。私たちの後悔境界は、任意のノルムと時間によって変化する重みを持つ非滑らかな正則化子に当てはまる最も一般的な形式で証明されています。

The MADP Toolbox: An Open Source Library for Planning and Learning in (Multi-)Agent Systems
MADPツールボックス: (マルチ)エージェントシステムでの計画と学習のためのオープンソースライブラリ

This article describes the Multiagent Decision Process (MADP) Toolbox, a software library to support planning and learning for intelligent agents and multiagent systems in uncertain environments. Key features are that it supports partially observable environments and stochastic transition models; has unified support for single- and multiagent systems; provides a large number of models for decision-theoretic decision making, including one-shot and sequential decision making under various assumptions of observability and cooperation, such as Dec-POMDPs and POSGs; provides tools and parsers to quickly prototype new problems; provides an extensive range of planning and learning algorithms for single- and multiagent systems; is released under the GNU GPL v3 license; and is written in C++ and designed to be extensible via the object-oriented paradigm.

この記事では、不確実な環境でのインテリジェントエージェントとマルチエージェントシステムの計画と学習をサポートするソフトウェアライブラリであるMultiagent Decision Process (MADP) Toolboxについて説明します。主な特徴は、部分的に観測可能な環境と確率的遷移モデルをサポートしていることです。シングルエージェントシステムとマルチエージェントシステムの統一サポートがあります。Dec-POMDPやPOSGなど、可観測性と協力のさまざまな仮定の下でのワンショットおよびシーケンシャルな意思決定を含む、意思決定理論的な意思決定のための多数のモデルを提供します。新しい問題を迅速にプロトタイプ化するためのツールとパーサーを提供します。シングルエージェントおよびマルチエージェントシステム用の広範な計画および学習アルゴリズムを提供します。GNU GPL v3ライセンスの下でリリースされています。また、C++で記述され、オブジェクト指向パラダイムを介して拡張できるように設計されています。

Hierarchical Clustering via Spreading Metrics
拡散メトリックによる階層的クラスタリング

We study the cost function for hierarchical clusterings introduced by (Dasgupta, 2016) where hierarchies are treated as first-class objects rather than deriving their cost from projections into flat clusters. It was also shown in (Dasgupta, 2016) that a top-down algorithm based on the uniform Sparsest Cut problem returns a hierarchical clustering of cost at most $O\left(\alpha_n \log n\right)$ times the cost of the optimal hierarchical clustering, where $\alpha_n$ is the approximation ratio of the Sparsest Cut subroutine used. Thus using the best known approximation algorithm for Sparsest Cut due to Arora-Rao- Vazirani, the top-down algorithm returns a hierarchical clustering of cost at most $O\left(\log^{3/2} n\right)$ times the cost of the optimal solution. We improve this by giving an $O(\log{n})$-approximation algorithm for this problem. Our main technical ingredients are a combinatorial characterization of ultrametrics induced by this cost function, deriving an Integer Linear Programming (ILP) formulation for this family of ultrametrics, and showing how to iteratively round an LP relaxation of this formulation by using the idea of sphere growing which has been extensively used in the context of graph partitioning. We also prove that our algorithm returns an $O(\log{n})$- approximate hierarchical clustering for a generalization of this cost function also studied in (Dasgupta, 2016). Experiments show that the hierarchies found by using the ILP formulation as well as our rounding algorithm often have better projections into flat clusters than the standard linkage based algorithms. We conclude with constant factor inapproximability results for this problem: 1) no polynomial size LP or SDP can achieve a constant factor approximation for this problem and 2) no polynomial time algorithm can achieve a constant factor approximation under the Small Set Expansion hypothesis.

私たちは、(Dasgupta、2016)で導入された階層的クラスタリングのコスト関数を研究します。この関数では、階層は、フラットなクラスターへの投影からコストを導出するのではなく、ファーストクラスのオブジェクトとして扱われます。また、(Dasgupta、2016)では、均一なスパースカット問題に基づくトップダウンアルゴリズムが、コストが最大で最適な階層的クラスタリングのコストの$O\left(\alpha_n \log n\right)$倍の階層的クラスタリングを返すことも示されました。ここで、$\alpha_n$は、使用されるスパースカットサブルーチンの近似比です。したがって、Arora-Rao-Vaziraniによるスパースカットの最もよく知られている近似アルゴリズムを使用すると、トップダウンアルゴリズムは、コストが最大で最適解のコストの$O\left(\log^{3/2} n\right)$倍の階層的クラスタリングを返します。この問題に対して$O(\log{n})$近似アルゴリズムを与えることで、この問題を改善します。主な技術的要素は、このコスト関数によって誘導される超計量の組み合わせ特性評価、この超計量ファミリーの整数線形計画法(ILP)定式化の導出、グラフ分割のコンテキストで広く使用されている球面成長の考え方を使用して、この定式化のLP緩和を反復的に丸める方法の提示です。また、このアルゴリズムが、(Dasgupta、2016)でも研究されているこのコスト関数の一般化に対して$O(\log{n})$近似の階層的クラスタリングを返すことも証明しています。実験では、ILP定式化と丸めアルゴリズムを使用して見つかった階層は、標準的なリンクベースのアルゴリズムよりもフラットなクラスターへの投影が優れていることが多いことが示されています。この問題に対する定数因数近似不可能性の結果を結論として挙げます。1)この問題に対して定数因数近似を達成できる多項式サイズのLPまたはSDPはなく、2)小集合拡張仮説の下では、定数因数近似を達成できる多項式時間アルゴリズムはありません。

The Impact of Random Models on Clustering Similarity
クラスタリングの類似性に対するランダムモデルの影響

Clustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering reurns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.

クラスタリングは、教師なし学習の中心的なアプローチです。クラスタリングを適用した後、最も基本的な分析は、クラスタリングを定量的に比較することです。このような比較は、クラスタリング方法の評価だけでなく、コンセンサスクラスタリングなどの他のタスクにも重要です。ベースラインを確立するために、クラスタリングの類似性をクラスタリングのランダムアンサンブルのコンテキストで評価する必要があるとよく言われます。ランダムクラスタリングアンサンブルの一般的な仮定は、クラスターの数とサイズが固定されている順列モデルです。ただし、この仮定は実際には必ずしも当てはまりません。たとえば、K平均法クラスタリングを複数回実行すると、クラスターの数が固定されたクラスタリングが返されますが、クラスターのサイズ分布は大きく異なります。ここでは、クラスターの数とサイズが異なる2つのランダムクラスタリングアンサンブルのコンテキストで、2つのクラスタリング類似性尺度(Randインデックスと相互情報量)の修正されたバリアントを導出します。さらに、参照クラスタリングのシナリオにおける片側比較の影響について調査します。合成例、手書き認識、遺伝子発現データを使用して、さまざまなランダムモデルの結果を示します。ランダムモデルの選択は、類似のクラスタリングペアのランキングや、ランダムベースラインに対するクラスタリング方法の評価に大きな影響を与える可能性があることを実証します。したがって、ランダムクラスタリングモデルの選択は慎重に正当化する必要があります。

Minimax Estimation of Kernel Mean Embeddings
カーネル平均埋め込みのミニマックス推定

In this paper, we study the minimax estimation of the Bochner integral \[ \mu_k(P) := \int_\mathcal{X} k(\cdot,x)\, dP(x), \] also called the kernel mean embedding, based on random samples drawn i.i.d. from $P$, where $k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ is a positive definite kernel. Various estimators (including the empirical estimator), $\hat{\theta}_n$ of $\mu_k(P)$ are studied in the literature wherein all of them satisfy $\|\hat{\theta}_n-\mu_k(P)\|_{\mathcal{H}_k}=O_P(n^{-1/2})$ with $\mathcal{H}_k$ being the reproducing kernel Hilbert space induced by $k$. The main contribution of the paper is in showing that the above mentioned rate of $n^{-1/2}$ is minimax in $\|\cdot\|_{\mathcal{H}_k}$ and $\|\cdot\|_{L^2(\mathbb{R}^d)}$-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with $k$ being a continuous translation- invariant kernel on $\mathbb{R}^d$. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of $P$ (if it exists).

この論文では、$P$から独立分布で抽出されたランダムサンプルに基づく、カーネル平均埋め込みとも呼ばれるボッホナー積分\[ \mu_k(P) := \int_\mathcal{X} k(\cdot,x)\, dP(x), \]のミニマックス推定について検討します。ここで、$k:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$は正定値カーネルです。文献では、様々な推定量（経験的推定量を含む）$\mu_k(P)$の$\hat{\theta}_n$が研究されており、それらはすべて$\|\hat{\theta}_n-\mu_k(P)\|_{\mathcal{H}_k}=O_P(n^{-1/2})$を満たし、$\mathcal{H}_k$は$k$によって誘導される再生核ヒルベルト空間です。この論文の主な貢献は、上記の$n^{-1/2}$のレートが、離散測度のクラスと無限に微分可能な密度を持つ測度のクラス（$k$は$\mathbb{R}^d$上の連続的な並進不変カーネル）上で$\|\cdot\|_{\mathcal{H}_k}$および$\|\cdot\|_{L^2(\mathbb{R}^d)}$ノルムにおいてミニマックスであることを示したことです。この結果の興味深い点は、ミニマックスレートがカーネルの滑らかさや$P$の密度（存在する場合）に依存しないことです。

Angle-based Multicategory Distance-weighted SVM
角度ベースのマルチカテゴリ距離加重 SVM

Classification is an important supervised learning technique with numerous applications. We develop an angle-based multicategory distance-weighted support vector machine (MDWSVM) classification method that is motivated from the binary distance-weighted support vector machine (DWSVM) classification method. The new method has the merits of both support vector machine (SVM) and distance-weighted discrimination (DWD) but also alleviates both the data piling issue of SVM and the imbalanced data issue of DWD. Theoretical and numerical studies demonstrate the advantages of MDWSVM method over existing angle-based methods.

分類は、多くのアプリケーションを持つ重要な教師あり学習手法です。私たちは、バイナリ距離加重サポートベクターマシン(DWSVM)分類法から動機付けられた角度ベースのマルチカテゴリ距離加重サポートベクターマシン(MDWSVM)分類法を開発します。この新しい手法は、サポートベクターマシン(SVM)と距離加重識別(DWD)の両方のメリットを持ちながら、SVMのデータパイリングの問題とDWDの不均衡なデータの問題の両方を軽減します。理論的および数値的研究は、既存の角度ベースの方法に対するMDWSVM法の利点を示しています。

Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization
正則化経験的リスク最小化のための確率的主対座標法

We consider a generic convex optimization problem associated with regularized empirical risk minimization of linear predictors. The problem structure allows us to reformulate it as a convex-concave saddle point problem. We propose a stochastic primal-dual coordinate (SPDC) method, which alternates between maximizing over a randomly chosen dual variable and minimizing over the primal variables. An extrapolation step on the primal variables is performed to obtain accelerated convergence rate. We also develop a mini-batch version of the SPDC method which facilitates parallel computing, and an extension with weighted sampling probabilities on the dual variables, which has a better complexity than uniform sampling on unnormalized data. Both theoretically and empirically, we show that the SPDC method has comparable or better performance than several state-of-the-art optimization methods.

私たちは、線形予測子の正則化された経験的リスク最小化に関連する一般的な凸最適化問題を検討します。問題構造により、それを凸凹サドルポイント問題として再定式化できます。ランダムに選択された双対変数の最大化と主変数の最小化を交互に行う確率的主双対座標(SPDC)法を提案します。主変数の外挿ステップが実行され、収束率が加速されます。また、並列計算を容易にするSPDC法のミニバッチバージョンと、非正規化データに対する一様サンプリングよりも複雑性が高い、双対変数の重み付けサンプリング確率の拡張も開発しています。理論的にも経験的にも、SPDC法はいくつかの最先端の最適化法と同等またはそれ以上の性能を発揮することを示しています。

Convolutional Neural Networks Analyzed via Convolutional Sparse Coding
畳み込みスパース符号化による畳み込みニューラルネットワークの解析

Convolutional neural networks (CNN) have led to many state-of- the-art results spanning through various fields. However, a clear and profound theoretical understanding of the forward pass, the core algorithm of CNN, is still lacking. In parallel, within the wide field of sparse approximation, Convolutional Sparse Coding (CSC) has gained increasing attention in recent years. A theoretical study of this model was recently conducted, establishing it as a reliable and stable alternative to the commonly practiced patch-based processing. Herein, we propose a novel multi-layer model, ML-CSC, in which signals are assumed to emerge from a cascade of CSC layers. This is shown to be tightly connected to CNN, so much so that the forward pass of the CNN is in fact the thresholding pursuit serving the ML-CSC model. This connection brings a fresh view to CNN, as we are able to attribute to this architecture theoretical claims such as uniqueness of the representations throughout the network, and their stable estimation, all guaranteed under simple local sparsity conditions. Lastly, identifying the weaknesses in the above pursuit scheme, we propose an alternative to the forward pass, which is connected to deconvolutional and recurrent networks, and also has better theoretical guarantees.

畳み込みニューラルネットワーク(CNN)は、さまざまな分野にまたがる最先端の成果を数多く生み出してきました。しかし、CNNのコアアルゴリズムであるフォワードパスの明確で深い理論的理解は、まだ不足しています。同時に、スパース近似の幅広い分野において、近年、畳み込みスパースコーディング(CSC)がますます注目を集めています。このモデルの理論的研究が最近実施され、一般的に実施されているパッチベースの処理に代わる信頼性が高く安定したモデルとして確立されました。ここでは、信号がCSC層のカスケードから発生すると想定される新しい多層モデルML-CSCを提案します。これはCNNと密接に関連していることが示されており、CNNのフォワードパスは実際にはML-CSCモデルに役立つしきい値処理です。この関連により、CNNに新たな視点がもたらされます。ネットワーク全体の表現の一意性や、単純なローカルスパース条件の下で保証されるそれらの安定した推定などの理論的主張をこのアーキテクチャに帰することができるためです。最後に、上記の追跡方式の弱点を特定し、逆畳み込みネットワークと再帰ネットワークに接続され、より優れた理論的保証も備えたフォワードパスの代替案を提案します。

Learning Scalable Deep Kernels with Recurrent Structure
リカレント構造によるスケーラブルなディープカーネルの学習

Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic gradient procedure, and exploit the structure of these kernels for scalable training and prediction. This approach provides a practical representation for Bayesian LSTMs. We demonstrate state-of-the-art performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP- LSTM are uniquely valuable.

音声、ロボット工学、金融、生物学の多くのアプリケーションは、順序が重要で再帰構造が一般的であるシーケンシャルデータを扱います。ただし、この構造は標準のカーネル関数では簡単には捉えられません。このような構造をモデル化するために、ガウス過程の表現力豊かな閉形式カーネル関数を提案します。結果として得られるモデルGP-LSTMは、ガウス過程の非パラメトリックな確率的利点を維持しながら、長短期記憶(LSTM)再帰ネットワークの帰納的バイアスを完全にカプセル化します。新しい証明可能収束半確率的勾配手順を使用してガウス過程の周辺尤度を最適化することで提案カーネルの特性を学習し、これらのカーネルの構造をスケーラブルなトレーニングと予測に利用します。このアプローチは、ベイジアンLSTMの実用的な表現を提供します。いくつかのベンチマークで最先端のパフォーマンスを実証し、GP-LSTMによって提供される予測の不確実性が特に価値がある、結果的な自動運転アプリケーションを徹底的に調査します。

Making Decision Trees Feasible in Ultrahigh Feature and Label Dimensions
決定木を超高特徴およびラベル次元で実現可能にする

Due to the non-linear but highly interpretable representations, decision tree (DT) models have significantly attracted a lot of attention of researchers. However, it is difficult to understand and interpret DT models in ultrahigh dimensions and DT models usually suffer from the curse of dimensionality and achieve degenerated performance when there are many noisy features. To address these issues, this paper first presents a novel data- dependent generalization error bound for the perceptron decision tree (PDT), which provides the theoretical justification to learn a sparse linear hyperplane in each decision node and to prune the tree. Following our analysis, we introduce the notion of budget-aware classifier (BAC) with a budget constraint on the weight coefficients, and propose a supervised budgeted tree (SBT) algorithm to achieve non-linear prediction performance. To avoid generating an unstable and complicated decision tree and improve the generalization of the SBT, we present a pruning strategy by learning classifiers to minimize cross-validation errors on each BAC. To deal with ultrahigh label dimensions, based on three important phenomena of real-world data sets from a variety of application domains, we develop a sparse coding tree framework for multi-label annotation problems and provide the theoretical analysis. Extensive empirical studies verify that 1) SBT is easy to understand and interpret in ultrahigh dimensions and is more resilient to noisy features. 2) Compared with state-of-the-art algorithms, our proposed sparse coding tree framework is more efficient, yet accurate in ultrahigh label and feature dimensions.

非線形でありながら解釈性の高い表現のため、決定木(DT)モデルは研究者から多くの注目を集めています。しかし、超高次元のDTモデルを理解して解釈することは難しく、DTモデルは通常、次元の呪いに悩まされ、ノイズの多い特徴が多い場合にパフォーマンスが低下します。これらの問題に対処するために、本稿ではまず、パーセプトロン決定木(PDT)の新しいデータ依存の一般化誤差境界を提示します。これは、各決定ノードでスパース線形超平面を学習し、ツリーを剪定するための理論的根拠を提供します。分析に続いて、重み係数に予算制約を持つ予算認識分類器(BAC)の概念を導入し、非線形予測パフォーマンスを実現するための教師あり予算木(SBT)アルゴリズムを提案します。不安定で複雑な決定木の生成を回避し、SBTの一般化を改善するために、各BACのクロス検証エラーを最小化する分類器を学習する剪定戦略を提示します。超高ラベル次元に対処するために、さまざまなアプリケーションドメインの実際のデータセットの3つの重要な現象に基づいて、マルチラベル注釈問題用のスパースコーディングツリーフレームワークを開発し、理論分析を提供します。広範な実証研究により、1) SBTは超高次元で理解および解釈しやすく、ノイズの多い特徴に対してより耐性があることが検証されています。2)最先端のアルゴリズムと比較して、提案されたスパースコーディングツリーフレームワークは、超高ラベルおよび特徴次元でより効率的でありながら正確です。

Robust Discriminative Clustering with Sparse Regularizers
スパース正則化器によるロバストな判別クラスタリング

Clustering high-dimensional data often requires some form of dimensionality reduction, where clustered variables are separated from noise-looking variables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem; (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions; (c) we propose a natural extension for the multi-label scenario; (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form $d=O(\sqrt{n})$ for the affine invariant case and $d=O(n)$ for the sparse case, where $n$ is the number of examples and $d$ the ambient dimension; and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to $O(nd^2)$, improving on earlier algorithms for discriminative clustering with the square loss, which had quadratic complexity in the number of examples.

高次元データのクラスタリングでは、多くの場合、何らかの形の次元削減が必要となり、クラスタリングされた変数がノイズのように見える変数から分離されます。我々はこの問題を、適切にクラスタリングされたデータの低次元投影を見つけることとします。これにより、2つのクラスターを持つ最も単純な状況では1次元投影が得られ、3つ以上のクラスターのマルチラベルシナリオに自然に拡張されます。この論文では、(a)このクラスタリングと次元削減を組み合わせた定式化が、以前に提案された識別的クラスタリングフレームワークと同等であり、問題の凸緩和につながることを最初に示します。(b)依然として凸緩和として扱われ、高次元での推定を可能にする新しいスパース拡張を提案します。(c)マルチラベルシナリオの自然な拡張を提案します。(d)私たちは、単純な確率モデルを用いてこれらの定式化の性能に関する新たな理論的分析を提供し、アフィン不変の場合の形式$d=O(\sqrt{n})$およびスパースの場合の形式$d=O(n)$にわたるスケーリングを導きます。ここで、$n$は例の数、$d$は周囲の次元です。最後に、(e)実行時間の複雑さが$O(nd^2)$に比例する効率的な反復アルゴリズムを提案し、例の数の2乗の複雑さを持つ二乗損失による識別的クラスタリングの以前のアルゴリズムを改良します。

Bayesian Tensor Regression
ベイジアンテンソル回帰

We propose a Bayesian approach to regression with a scalar response on vector and tensor covariates. Vectorization of the tensor prior to analysis fails to exploit the structure, often leading to poor estimation and predictive performance. We introduce a novel class of multiway shrinkage priors for tensor coefficients in the regression setting and present posterior consistency results under mild conditions. A computationally efficient Markov chain Monte Carlo algorithm is developed for posterior computation. Simulation studies illustrate substantial gains over existing tensor regression methods in terms of estimation and parameter inference. Our approach is further illustrated in a neuroimaging application.

私たちは、ベクトル共変量とテンソル共変量上のスカラー応答による回帰へのベイズアプローチを提案します。解析前のテンソルのベクトル化は、構造を活用できず、多くの場合、推定と予測のパフォーマンスの低下につながります。回帰設定でテンソル係数の多方向収縮事前分布の新しいクラスを導入し、温和な条件下での事後一貫性の結果を示します。事後計算のために、計算効率の高いマルコフ連鎖モンテカルロアルゴリズムが開発されています。シミュレーション研究は、推定とパラメータ推論の点で、既存のテンソル回帰法よりも大幅な利点を示しています。私たちのアプローチは、ニューロイメージングアプリケーションでさらに説明されています。

Relational Reinforcement Learning for Planning with Exogenous Effects
外生的効果を持つ計画のための関係強化学習

Probabilistic planners have improved recently to the point that they can solve difficult tasks with complex and expressive models. In contrast, learners cannot tackle yet the expressive models that planners do, which forces complex models to be mostly handcrafted. We propose a new learning approach that can learn relational probabilistic models with both action effects and exogenous effects. The proposed learning approach combines a multi-valued variant of inductive logic programming for the generation of candidate models, with an optimization method to select the best set of planning operators to model a problem. We also show how to combine this learner with reinforcement learning algorithms to solve complete problems. Finally, experimental validation is provided that shows improvements over previous work in both simulation and a robotic task. The robotic task involves a dynamic scenario with several agents where a manipulator robot has to clear the tableware on a table. We show that the exogenous effects learned by our approach allowed the robot to clear the table in a more efficient way.

確率プランナーは最近、複雑で表現力豊かなモデルで難しいタスクを解決できるレベルまで改善されました。対照的に、学習者はプランナーが行う表現力豊かなモデルにまだ取り組むことができず、複雑なモデルは主に手作業で作成せざるを得ません。私たちは、アクション効果と外生効果の両方を持つ関係確率モデルを学習できる新しい学習アプローチを提案します。提案された学習アプローチは、候補モデルの生成のための帰納的論理プログラミングの多値バリアントと、問題をモデル化するための最適な計画演算子のセットを選択する最適化手法を組み合わせたものです。また、この学習者を強化学習アルゴリズムと組み合わせて完全な問題を解決する方法も示します。最後に、シミュレーションとロボットタスクの両方で以前の作業よりも改善されていることを示す実験検証が提供されます。ロボットタスクには、マニピュレーターロボットがテーブル上の食器を片付ける必要がある、複数のエージェントによる動的シナリオが含まれます。私たちのアプローチによって学習された外生効果により、ロボットがより効率的にテーブルを片付けることができることを示します。

Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis
変化の時: ベイズ分析による複数の分類器の比較チュートリアル

The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machine learning methods, so we should also use it in the analysis of our own results. We argue for abandonment of NHST by exposing its fallacies and, more importantly, offer better—more sound and useful— alternatives for it.

機械学習コミュニティでは、結果の統計的妥当性を確保するために、帰無仮説有意性検定(NHST)の使用を採用しました。しかし、多くの科学分野は、頻度論的推論の欠点を認識し、最も急進的なケースでは、出版物での使用を禁止することさえありました。新しい機械学習手法の開発にベイズパラダイムを取り入れたのと同じように、私たち自身の結果の分析にもそれを使用する必要があります。私たちは、NHSTの誤謬を暴露し、さらに重要なことに、より—より健全で有用な代替案を提供することで—、NHSTの放棄を主張します。

Quantifying the Informativeness of Similarity Measurements
類似性測定の情報量の定量化

In this paper, we describe an unsupervised measure for quantifying the ‘informativeness’ of correlation matrices formed from the pairwise similarities or relationships among data instances. The measure quantifies the heterogeneity of the correlations and is defined as the distance between a correlation matrix and the nearest correlation matrix with constant off-diagonal entries. This non-parametric notion generalizes existing test statistics for equality of correlation coefficients by allowing for alternative distance metrics, such as the Bures and other distances from quantum information theory. For several distance and dissimilarity metrics, we derive closed-form expressions of informativeness, which can be applied as objective functions for machine learning applications. Empirically, we demonstrate that informativeness is a useful criterion for selecting kernel parameters, choosing the dimension for kernel-based nonlinear dimensionality reduction, and identifying structured graphs. We also consider the problem of finding a maximally informative correlation matrix around a target matrix, and explore parameterizing the optimization in terms of the coordinates of the sample or through a lower-dimensional embedding. In the latter case, we find that maximizing the Bures-based informativeness measure, which is maximal for centered rank-1 correlation matrices, is equivalent to minimizing a specific matrix norm, and present an algorithm to solve the minimization problem using the norm’s proximal operator. The proposed correlation denoising algorithm consistently improves spectral clustering. Overall, we find informativeness to be a novel and useful criterion for identifying non-trivial correlation structure.

この論文では、データインスタンス間のペアワイズ類似性または関係から形成される相関行列の「情報性」を定量化する教師なしの尺度について説明します。この尺度は相関の異質性を定量化し、相関行列と、一定の非対角エントリを持つ最も近い相関行列との距離として定義されます。このノンパラメトリックな概念は、Buresや量子情報理論からのその他の距離などの代替距離メトリックを許可することで、相関係数の等価性に関する既存の検定統計を一般化します。いくつかの距離および非類似性メトリックについて、機械学習アプリケーションの目的関数として適用できる情報性の閉形式の表現を導出します。経験的に、情報性はカーネルパラメータの選択、カーネルベースの非線形次元削減の次元の選択、および構造化グラフの識別に役立つ基準であることを示します。また、ターゲット行列の周囲で最大限に有益な相関行列を見つける問題も検討し、サンプルの座標または低次元の埋め込みによる最適化のパラメータ化を検討します。後者の場合、中心化されたランク1相関行列に対して最大となるBuresベースの有益性尺度を最大化することは、特定の行列ノルムを最小化することと同等であることがわかり、ノルムの近似演算子を使用して最小化問題を解決するアルゴリズムを提示します。提案された相関ノイズ除去アルゴリズムは、スペクトルクラスタリングを一貫して改善します。全体として、有益性は、非自明な相関構造を識別するための新しい有用な基準であることがわかりました。

Recovering PCA and Sparse PCA via Hybrid-(l1,l2) Sparse Sampling of Data Elements
データ要素のハイブリッド (l1,l2) スパースサンプリングによる PCA とスパース PCA の回復

This paper addresses how well we can recover a data matrix when only given a few of its elements. We present a randomized algorithm that element-wise sparsifies the data, retaining only a few of its entries. Our new algorithm independently samples the data using probabilities that depend on both squares ($\ell_2$ sampling) and absolute values ($\ell_1$ sampling) of the entries. We prove that this hybrid algorithm ($i$) achieves a near-PCA reconstruction of the data, and ($ii$) recovers sparse principal components of the data, from a sketch formed by a sublinear sample size. Hybrid-($\ell_1,\ell_2$) inherits the $\ell_2$-ability to sample the important elements, as well as the regularization properties of $\ell_1$ sampling, and maintains strictly better quality than either $\ell_1$ or $\ell_2$ on their own. Extensive experimental results on synthetic, image, text, biological, and financial data show that not only are we able to recover PCA and sparse PCA from incomplete data, but we can speed up such computations significantly using our sparse sketch .

この論文では、データマトリックスの要素がいくつか与えられた場合に、そのデータマトリックスをどの程度正確に復元できるかについて説明します。要素ごとにデータをスパース化し、そのエントリをいくつかだけ保持するランダム化アルゴリズムを紹介します。新しいアルゴリズムは、エントリの平方($\ell_2$サンプリング)と絶対値($\ell_1$サンプリング)の両方に依存する確率を使用して、データを個別にサンプリングします。このハイブリッドアルゴリズム($i$)により、データのPCAに近い再構成が実現され、($ii$)サブ線形サンプルサイズによって形成されたスケッチから、データのスパースな主成分が復元されることを証明します。ハイブリッド($\ell_1,\ell_2$)は、重要な要素をサンプリングする$\ell_2$機能と、$\ell_1$サンプリングの正則化プロパティを継承し、$\ell_1$または$\ell_2$のいずれか単独よりも確実に優れた品質を維持します。合成データ、画像データ、テキストデータ、生物学データ、金融データに関する広範な実験結果から、不完全なデータからPCAとスパースPCAを回復できるだけでなく、スパーススケッチを使用してそのような計算を大幅に高速化できることが示されています。

Tests of Mutual or Serial Independence of Random Vectors with Applications
アプリケーションを使用したランダムベクトルの相互独立性またはシリアル独立性の検定

The problem of testing mutual independence between many random vectors is addressed. The closely related problem of testing serial independence of a multivariate stationary sequence is also considered. The MÃ¶bius transformation of characteristic functions is used to characterize independence. A generalization to $p$ vectors of distance covariance and Hilbert-Schmidt independence criterion ($HSIC$) tests with the translation invariant kernel of a stable probability distribution is proposed. Both test statistics can be expressed in a simple form as a sum over all elements of a componentwise product of $p$ doubly-centered matrices. It is shown that an $HSIC$ statistic with sufficiently small scale parameters is equivalent to a distance covariance statistic. Consistency and weak convergence of both types of statistics are established. Approximation of $p$-values is made by randomization tests without recomputing interpoint distances for each randomized sample. The dependogram is adapted to the proposed tests for the graphical identification of sources of dependencies. Empirical rejection rates obtained through extensive simulations confirm both the applicability of the testing procedures in small samples and the high level of competitiveness in terms of power. Applications to meteorological and financial data provide some interesting interpretations of dependencies revealed by dependograms.

多数のランダムベクトル間の相互独立性をテストする問題が取り上げられています。多変量定常シーケンスの連続独立性をテストするという密接に関連する問題も検討されています。特性関数のメビウス変換を使用して独立性を特徴付けます。安定した確率分布の変換不変カーネルを使用した距離共分散のpベクトルおよびヒルベルトシュミット独立基準(HSIC)テストへの一般化が提案されています。両方のテスト統計は、p個の二重中心行列の成分ごとの積のすべての要素の合計として簡単に表現できます。十分に小さなスケールパラメータを持つHSIC統計は、距離共分散統計と同等であることが示されています。両方のタイプの統計の一貫性と弱い収束が確立されています。p値の近似値は、ランダム化された各サンプルの点間距離を再計算せずにランダム化テストによって作成されます。依存関係のソースをグラフィカルに識別するために、デペンデントグラムが提案されたテストに適合されています。広範囲のシミュレーションを通じて得られた経験的拒否率は、小規模サンプルでのテスト手順の適用性と、検出力の面での高い競争力の両方を裏付けています。気象データや金融データへの応用により、ディペンデントグラムによって明らかになる依存関係の興味深い解釈が得られます。

Non-parametric Policy Search with Limited Information Loss
情報損失が限定的なノンパラメトリックポリシー検索

Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the data set. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non- parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot under-powered swing-up task directly from image data.

非線形かつ冗長な感覚入力から複雑な制御ポリシーを学習することは、強化学習アルゴリズムにとって重要な課題です。値関数または遷移モデルを近似するノンパラメトリック法は、データセットの複雑さに適応することで、この問題に対処できます。しかし、現在の多くのノンパラメトリック手法は、近似値関数の不安定な貪欲最大化に依存しており、ポリシー更新の収束が悪くなったり、振動したりする可能性があります。連続する状態アクション分布間の情報損失を制限することで、より堅牢なポリシー更新を実現できます。この論文では、堅牢かつノンパラメトリックなポリシー更新を備えたポリシー検索アルゴリズムを開発します。この方法は、非線形かつ冗長な感覚表現を持つ無限時間連続マルコフ決定プロセスのノンパラメトリック制御ポリシーを学習できます。カーネル関数の近似を使用して、要求の厳しいノンパラメトリック計算の時間要件を削減する方法を調査します。実験では、提案方法の優れたパフォーマンスと、それを効率的に近似する方法を示します。最後に、私たちのアルゴリズムは、実際のロボットのパワー不足のスイングアップタスクを画像データから直接学習できることを示します。

Multiscale Strategies for Computing Optimal Transport
最適な輸送を計算するためのマルチスケール戦略

This paper presents a multiscale approach to efficiently compute approximate optimal transport plans between point sets. It is particularly well-suited for point sets that are in high- dimensions, but are close to being intrinsically low- dimensional. The approach is based on an adaptive multiscale decomposition of the point sets. The multiscale decomposition yields a sequence of optimal transport problems, that are solved in a top-to-bottom fashion from the coarsest to the finest scale. We provide numerical evidence that this multiscale approach scales approximately linearly, in time and memory, in the number of nodes, instead of quadratically or worse for a direct solution. Empirically, the multiscale approach results in less than one percent relative error in the objective function. Furthermore, the multiscale plans constructed are of interest by themselves as they may be used to introduce novel features and notions of distances between point sets. An analysis of sets of brain MRI based on optimal transport distances illustrates the effectiveness of the proposed method on a real world data set. The application demonstrates that multiscale optimal transport distances have the potential to improve on state-of-the-art metrics currently used in computational anatomy.

この論文では、ポイントセット間の近似最適トランスポートプランを効率的に計算するためのマルチスケールアプローチを紹介します。このアプローチは、高次元であるが本質的に低次元に近いポイントセットに特に適しています。このアプローチは、ポイントセットの適応型マルチスケール分解に基づいています。マルチスケール分解により、一連の最適トランスポート問題が生成され、最も粗いスケールから最も細かいスケールまで上から下に向かって解決されます。このマルチスケールアプローチは、時間とメモリ、ノード数でほぼ線形にスケーリングされ、直接ソリューションの場合は2乗以下になるという数値的証拠を示します。経験的に、マルチスケールアプローチでは、目的関数の相対誤差は1パーセント未満になります。さらに、構築されたマルチスケールプランは、ポイントセット間の距離の新しい機能や概念を導入するために使用できるため、それ自体が興味深いものです。最適なトランスポート距離に基づく脳MRIセットの分析により、提案された方法が実際のデータセットで有効であることがわかります。このアプリケーションは、マルチスケールの最適輸送距離が、計算解剖学で現在使用されている最先端の指標を改善する可能性があることを実証しています。

A Robust-Equitable Measure for Feature Ranking and Selection
特徴のランク付けと選択のための堅牢で公平な尺度

In many applications, not all the features used to represent data samples are important. Often only a few features are relevant for the prediction task. The choice of dependence measures often affect the final result of many feature selection methods. To select features that have complex nonlinear relationships with the response variable, the dependence measure should be equitable, a concept proposed by Reshef et al. (2011); that is, the dependence measure treats linear and nonlinear relationships equally. Recently, Kinney and Atwal (2014) gave a mathematical definition of self- equitability. In this paper, we introduce a new concept of robust-equitability and identify a robust- equitable copula dependence measure, the robust copula dependence (RCD) measure. RCD is based on the $L_1$-distance of the copula density from uniform and we show that it is equitable under both equitability definitions. We also prove theoretically that RCD is much easier to estimate than mutual information. Because of these theoretical properties, the RCD measure has the following advantages compared to existing dependence measures: it is robust to different relationship forms and robust to unequal sample sizes of different features. Experiments on both synthetic and real-world data sets confirm the theoretical analysis, and illustrate the advantage of using the dependence measure RCD for feature selection.

多くのアプリケーションでは、データサンプルを表すために使用されるすべての特徴が重要であるわけではありません。予測タスクに関連する特徴はごくわずかであることがよくあります。依存性の尺度の選択は、多くの特徴選択方法の最終結果に影響を与えることがよくあります。応答変数と複雑な非線形関係を持つ特徴を選択するには、依存性の尺度が公平である必要があります。これは、Reshefら(2011)によって提案された概念です。つまり、依存性の尺度は線形関係と非線形関係を平等に扱います。最近、KinneyとAtwal (2014)は、自己公平性の数学的定義を示しました。この論文では、ロバスト公平性の新しい概念を紹介し、ロバスト公平なコピュラ依存性の尺度であるロバストコピュラ依存性(RCD)尺度を特定します。RCDは、コピュラ密度の均一性からの$L_1$距離に基づいており、両方の公平性定義の下で公平であることを示します。また、RCDは相互情報量よりもはるかに簡単に推定できることを理論的に証明します。これらの理論的特性により、RCD測定は既存の依存性測定と比較して、さまざまな関係形式に対して堅牢であり、さまざまな特徴の不均等なサンプルサイズに対して堅牢であるという利点があります。合成データセットと実際のデータセットの両方での実験により、理論的分析が確認され、依存性測定RCDを特徴選択に使用する利点が示されました。

A Bayesian Framework for Learning Rule Sets for Interpretable Classification
解釈可能な分類のための学習ルールセットのためのベイズフレームワーク

We present a machine learning algorithm for building classifiers that are comprised of a small number of short rules. These are restricted disjunctive normal form models. An example of a classifier of this form is as follows: If $X$ satisfies (condition $A$ AND condition $B$) OR (condition $C$) OR $\cdots$, then $Y=1$. Models of this form have the advantage of being interpretable to human experts since they produce a set of rules that concisely describe a specific class. We present two probabilistic models with prior parameters that the user can set to encourage the model to have a desired size and shape, to conform with a domain-specific definition of interpretability. We provide a scalable MAP inference approach and develop theoretical bounds to reduce computation by iteratively pruning the search space. We apply our method (Bayesian Rule Sets — BRS) to characterize and predict user behavior with respect to in-vehicle context-aware personalized recommender systems. Our method has a major advantage over classical associative classification methods and decision trees in that it does not greedily grow the model.

私たちは、少数の短いルールで構成される分類器を構築するための機械学習アルゴリズムを紹介します。これらは、制限付き選言正規形モデルです。この形式の分類器の例は次のとおりです。$X$が(条件$A$ AND条件$B$) OR (条件$C$) OR $\cdots$を満たす場合、$Y=1$となります。この形式のモデルには、特定のクラスを簡潔に説明する一連のルールを生成するため、人間の専門家が解釈できるという利点があります。ユーザーが設定できる事前パラメーターを持つ2つの確率モデルを紹介します。これらのパラメーターにより、モデルが目的のサイズと形状になり、ドメイン固有の解釈可能性の定義に準拠します。スケーラブルなMAP推論アプローチを提供し、検索空間を反復的に整理することで計算を削減する理論的境界を開発します。私たちの方法(ベイジアンルールセット– BRS)を適用して、車載コンテキスト認識のパーソナライズされた推奨システムに関するユーザー行動を特徴付け、予測します。私たちの方法は、モデルを貪欲に拡大しないという点で、従来の連想分類法や決定木に比べて大きな利点があります。

Variational Particle Approximations
変分粒子近似

Approximate inference in high-dimensional, discrete probabilistic models is a central problem in computational statistics and machine learning. This paper describes discrete particle variational inference (DPVI), a new approach that combines key strengths of Monte Carlo, variational and search- based techniques. DPVI is based on a novel family of particle- based variational approximations that can be fit using simple, fast, deterministic search techniques. Like Monte Carlo, DPVI can handle multiple modes, and yields exact results in a well- defined limit. Like unstructured mean-field, DPVI is based on optimizing a lower bound on the partition function; when this quantity is not of intrinsic interest, it facilitates convergence assessment and debugging. Like both Monte Carlo and combinatorial search, DPVI can take advantage of factorization, sequential structure, and custom search operators. This paper defines DPVI particle-based approximation family and partition function lower bounds, along with the sequential DPVI and local DPVI algorithm templates for optimizing them. DPVI is illustrated and evaluated via experiments on lattice Markov Random Fields, nonparametric Bayesian mixtures and block-models, and parametric as well as non-parametric hidden Markov models. Results include applications to real-world spike-sorting and relational modeling problems, and show that DPVI can offer appealing time/accuracy trade-offs as compared to multiple alternatives.

高次元の離散確率モデルにおける近似推論は、計算統計学と機械学習における中心的な問題です。この論文では、モンテカルロ、変分法、および検索ベースの手法の主要な長所を組み合わせた新しいアプローチである離散粒子変分推論(DPVI)について説明します。DPVIは、シンプルで高速な決定論的検索手法を使用して適合できる、粒子ベースの変分近似の新しいファミリに基づいています。モンテカルロと同様に、DPVIは複数のモードを処理でき、明確に定義された制限内で正確な結果を生成します。非構造化平均場と同様に、DPVIはパーティション関数の下限の最適化に基づいています。この量が本質的に重要でない場合は、収束評価とデバッグが容易になります。モンテカルロと組み合わせ検索の両方と同様に、DPVIは因数分解、シーケンシャル構造、およびカスタム検索演算子を利用できます。この論文では、DPVI粒子ベース近似ファミリーとパーティション関数の下限値、およびそれらを最適化するためのシーケンシャルDPVIとローカルDPVIアルゴリズムテンプレートを定義します。DPVIは、格子マルコフランダムフィールド、ノンパラメトリックベイズ混合とブロックモデル、およびパラメトリックとノンパラメトリックの隠れマルコフモデルでの実験によって説明および評価されます。結果には、現実世界のスパイクソートとリレーショナルモデリングの問題への応用が含まれ、DPVIが複数の代替案と比較して魅力的な時間と精度のトレードオフを提供できることを示しています。

Soft Margin Support Vector Classification as Buffered Probability Minimization
バッファ付き確率最小化としてのソフトマージンサポートベクトル分類

In this paper, we show that the popular C-SVM, soft-margin support vector classifier is equivalent to minimization of Buffered Probability of Exceedance (bPOE), a recently introduced characterization of uncertainty. To show this, we introduce a new SVM formulation, called the EC-SVM, which is derived from a simple bPOE minimization problem that is easy to interpret with a meaningful free parameter, optimal objective value, and probabilistic derivation. Over the range of its free parameter, the EC-SVM has both a convex and non-convex case which we connect to existing SVM formulations. We first show that the C-SVM, formulated with any regularization norm, is equivalent to the convex EC-SVM. Similarly, we show that the E$\nu$-SVM is equivalent to the EC-SVM over its entire parameter range, which includes both the convex and non-convex case. These equivalences, coupled with the interpretability of the EC-SVM, allow us to gain surprising new insights into the C-SVM and fully connect soft margin support vector classification with superquantile and bPOE concepts. We also show that the EC-SVM can easily be cast as a robust optimization problem, where bPOE is minimized with data lying in a fixed uncertainty set. This reformulation allows us to clearly differentiate between the convex and non-convex case, with convexity associated with pessimistic views of uncertainty and non-convexity associated with optimistic views of uncertainty. Finally, we address some practical considerations. First, we show that these new insights can assist in making parameter selection more efficient. Second, we discuss optimization approaches for solving the EC-SVM. Third, we address the issue of generalization, providing generalization bounds for both bPOE and misclassification rate.

この論文では、一般的なC-SVMソフトマージンサポートベクター分類器が、最近導入された不確実性の特性であるBuffered Probability of Exceedance (bPOE)の最小化と同等であることを示します。これを示すために、意味のある自由パラメータ、最適な目的値、および確率的導出で簡単に解釈できる単純なbPOE最小化問題から導出されたEC-SVMと呼ばれる新しいSVM定式化を紹介します。自由パラメータの範囲にわたって、EC-SVMには凸の場合と非凸の場合の両方があり、これを既存のSVM定式化に結び付けます。まず、任意の正則化ノルムで定式化されたC-SVMが凸EC-SVMと同等であることを示します。同様に、E$\nu$-SVMは、凸の場合と非凸の場合の両方を含むパラメータ範囲全体でEC-SVMと同等であることを示します。これらの同等性とEC-SVMの解釈可能性を組み合わせることで、C-SVMに関する驚くべき新しい洞察が得られ、ソフトマージンサポートベクター分類をスーパークォンタイルおよびbPOEの概念に完全に関連付けることができます。また、EC-SVMは、データが固定された不確実性セット内にある状態でbPOEが最小化されるロバスト最適化問題として簡単に表現できることも示します。この再定式化により、凸の場合と非凸の場合を明確に区別できます。凸性は不確実性の悲観的な見方に関連し、非凸性は不確実性の楽観的な見方に関連します。最後に、いくつかの実際的な考慮事項について説明します。まず、これらの新しい洞察がパラメーター選択の効率化に役立つことを示します。次に、EC-SVMを解決するための最適化アプローチについて説明します。最後に、一般化の問題に対処し、bPOEと誤分類率の両方の一般化境界を提供します。

Sharp Oracle Inequalities for Square Root Regularization
平方根正則化の鋭いオラクル不等式

We study a set of regularization methods for high-dimensional linear regression models. These penalized estimators have the square root of the residual sum of squared errors as loss function, and any weakly decomposable norm as penalty function. This fit measure is chosen because of its property that the estimator does not depend on the unknown standard deviation of the noise. On the other hand, a generalized weakly decomposable norm penalty is very useful in being able to deal with different underlying sparsity structures. We can choose a different sparsity inducing norm depending on how we want to interpret the unknown parameter vector $\beta$. Structured sparsity norms, as defined in Micchelli et al. (2010), are special cases of weakly decomposable norms, therefore we also include the square root LASSO (Belloni et al., 2011), the group square root LASSO (Bunea et al., 2014) and a new method called the square root SLOPE (in a similar fashion to the SLOPE from Bogdan et al. 2015). For this collection of estimators our results provide sharp oracle inequalities with the Karush-Kuhn-Tucker conditions. We discuss some examples of estimators. Based on a simulation we illustrate some advantages of the square root SLOPE.

私たちは、高次元線形回帰モデルの一連の正則化手法を研究します。これらのペナルティ付き推定量は、損失関数として残差二乗誤差の平方根を持ち、ペナルティ関数として弱分解可能なノルムを持ちます。この適合尺度は、推定量がノイズの未知の標準偏差に依存しないという特性のために選択されます。一方、一般化された弱分解可能なノルムペナルティは、さまざまな基礎スパース構造を処理できるため非常に便利です。未知のパラメータベクトル$\beta$をどのように解釈するかに応じて、異なるスパース誘導ノルムを選択できます。Micchelliらによって定義された構造化スパースノルム。(2010)は弱分解可能ノルムの特殊なケースであるため、平方根LASSO (Belloniら、2011)、グループ平方根LASSO (Buneaら、2014)、平方根SLOPE (Bogdanら、2015のSLOPEに類似)と呼ばれる新しい方法も含めます。この推定量のコレクションに対して、結果はKarush-Kuhn-Tucker条件を持つ鋭いオラクル不等式を提供します。推定量の例をいくつか説明します。シミュレーションに基づいて、平方根SLOPEのいくつかの利点を示します。

Hierarchically Compositional Kernels for Scalable Nonparametric Learning
スケーラブルなノンパラメトリック学習のための階層的構成カーネル

We propose a novel class of kernels to alleviate the high computational cost of large-scale nonparametric learning with kernel methods. The proposed kernel is defined based on a hierarchical partitioning of the underlying data domain, where the NystrÃ¶m method (a globally low-rank approximation) is married with a locally lossless approximation in a hierarchical fashion. The kernel maintains (strict) positive-definiteness. The corresponding kernel matrix admits a recursively off- diagonal low-rank structure, which allows for fast linear algebra computations. Suppressing the factor of data dimension, the memory and arithmetic complexities for training a regression or a classifier are reduced from $O(n^2)$ and $O(n^3)$ to $O(nr)$ and $O(nr^2)$, respectively, where $n$ is the number of training examples and $r$ is the rank on each level of the hierarchy. Although other randomized approximate kernels entail a similar complexity, empirical results show that the proposed kernel achieves a matching performance with a smaller $r$. We demonstrate comprehensive experiments to show the effective use of the proposed kernel on data sizes up to the order of millions.

私たちは、カーネル法を用いた大規模ノンパラメトリック学習の高計算コストを軽減するために、新しいクラスのカーネルを提案します。提案されたカーネルは、基礎となるデータ領域の階層的分割に基づいて定義され、階層的にニストローム法（グローバル低ランク近似）とローカルロスレス近似が結合されます。カーネルは（厳密な）正定値性を維持します。対応するカーネル行列は、再帰的に非対角の低ランク構造を許容し、高速な線形代数計算を可能にします。データ次元の要因を抑制することで、回帰または分類器のトレーニングのメモリと算術の複雑さは、それぞれ$O(n^2)$と$O(n^3)$から$O(nr)$と$O(nr^2)$に削減されます。ここで、$n$はトレーニング例の数、$r$は階層の各レベルのランクです。他のランダム近似カーネルも同様の複雑さを伴いますが、実験結果では、提案されたカーネルはより小さな$r$でマッチングパフォーマンスを達成することが示されています。提案されたカーネルが数百万オーダーまでのデータサイズで効果的に使用できることを示す包括的な実験を示します。

Learning Partial Policies to Speedup MDP Tree Search via Reduction to I.I.D. Learning
独立同分布学習への還元によるMDP木探索の高速化のための学習部分方策

A popular approach for online decision-making in large MDPs is time-bounded tree search. The effectiveness of tree search, however, is largely influenced by the action branching factor, which limits the search depth given a time bound. An obvious way to reduce action branching is to consider only a subset of potentially good actions at each state as specified by a provided partial policy. In this work, we consider offline learning of such partial policies with the goal of speeding up search without significantly reducing decision-making quality. Our first contribution consists of reducing the learning problem to set learning. We give a reduction-style analysis of three such algorithms, each making different assumptions, which relates the set learning objectives to the sub-optimality of search using the learned partial policies. Our second contribution is to describe concrete implementations of the algorithms within the popular framework of Monte-Carlo tree search. Finally, the third contribution is to evaluate the learning algorithms on two challenging MDPs with large action branching factors. The results show that the learned partial policies can significantly improve the anytime performance of Monte-Carlo tree search.

大規模MDPでのオンライン意思決定の一般的なアプローチは、時間制限付きツリー検索です。ただし、ツリー検索の有効性は、時間制限が与えられた場合に検索の深さを制限するアクション分岐係数に大きく影響されます。アクション分岐を減らす明白な方法は、提供された部分ポリシーによって指定された各状態で潜在的に良いアクションのサブセットのみを考慮することです。この研究では、意思決定の品質を大幅に低下させることなく検索を高速化することを目的として、このような部分ポリシーのオフライン学習を検討します。最初の貢献は、学習問題をセット学習に縮小することです。それぞれ異なる仮定を立てる3つのアルゴリズムの縮小スタイルの分析を行い、セット学習の目標を、学習された部分ポリシーを使用した検索の準最適性と関連付けます。2番目の貢献は、モンテカルロツリー検索の一般的なフレームワーク内でのアルゴリズムの具体的な実装を説明することです。最後に、3番目の貢献は、大きなアクション分岐係数を持つ2つの困難なMDPで学習アルゴリズムを評価することです。結果は、学習された部分ポリシーがモンテカルロツリー検索のいつでもパフォーマンスを大幅に向上できることを示しています。

Parallel Symmetric Class Expression Learning
並列対称クラス式学習

In machine learning, one often encounters data sets where a general pattern is violated by a relatively small number of exceptions (for example, a rule that says that all birds can fly is violated by examples such as penguins). This complicates the concept learning process and may lead to the rejection of some simple and expressive rules that cover many cases. In this paper we present an approach to this problem in description logic learning by computing partial descriptions (which are not necessarily entirely complete) of both positive and negative examples and combining them. Our Symmetric Parallel Class Expression Learning approach enables the generation of general rules with exception patterns included. We demonstrate that this algorithm provides significantly better results (in terms of metrics such as accuracy, search space covered, and learning time) than standard approaches on some typical data sets. Further, the approach has the added benefit that it can be parallelised relatively simply, leading to much faster exploration of the search tree on modern computers.

機械学習では、比較的少数の例外によって一般的なパターンが破られているデータセットによく遭遇します(たとえば、すべての鳥は飛べるというルールは、ペンギンなどの例によって破られています)。これにより概念学習プロセスが複雑になり、多くのケースをカバーする単純で表現力豊かなルールの一部が拒否される可能性があります。この論文では、肯定例と否定例の両方の部分的な記述(必ずしも完全に完全である必要はありません)を計算し、それらを組み合わせることで、記述論理学習におけるこの問題へのアプローチを紹介します。対称並列クラス表現学習アプローチにより、例外パターンが含まれた一般的なルールの生成が可能になります。このアルゴリズムにより、いくつかの典型的なデータセットで標準的なアプローチよりも大幅に優れた結果が得られる(精度、カバーされる検索空間、学習時間などの指標に関して)ことを実証します。さらに、このアプローチには、比較的簡単に並列化できるという利点もあり、最新のコンピューターでの検索ツリーの探索が大幅に高速化されます。

Fundamental Conditions for Low-CP-Rank Tensor Completion
低CPランクテンソル補完の基本条件

We consider the problem of low canonical polyadic (CP) rank tensor completion. A completion is a tensor whose entries agree with the observed entries and its rank matches the given CP rank. We analyze the manifold structure corresponding to the tensors with the given rank and define a set of polynomials based on the sampling pattern and CP decomposition. Then, we show that finite completability of the sampled tensor is equivalent to having a certain number of algebraically independent polynomials among the defined polynomials. Our proposed approach results in characterizing the maximum number of algebraically independent polynomials in terms of a simple geometric structure of the sampling pattern, and therefore we obtain the deterministic necessary and sufficient condition on the sampling pattern for finite completability of the sampled tensor. Moreover, assuming that the entries of the tensor are sampled independently with probability $p$ and using the mentioned deterministic analysis, we propose a combinatorial method to derive a lower bound on the sampling probability $p$, or equivalently, the number of sampled entries that guarantees finite completability with high probability. We also show that the existing result for the matrix completion problem can be used to obtain a loose lower bound on the sampling probability $p$. In addition, we obtain deterministic and probabilistic conditions for unique completability. It is seen that the number of samples required for finite or unique completability obtained by the proposed analysis on the CP manifold is orders-of- magnitude lower than that is obtained by the existing analysis on the Grassmannian manifold.

私たちは、低標準多項式(CP)ランクテンソル完備化の問題を検討します。完備化とは、その要素が観測された要素と一致し、そのランクが指定されたCPランクと一致するテンソルのことです。指定されたランクのテンソルに対応する多様体構造を分析し、サンプリングパターンとCP分解に基づいて多項式の集合を定義します。次に、サンプリングされたテンソルの有限完備化可能性は、定義された多項式の中に代数的に独立した多項式がいくつかあることと同等であることを示します。提案されたアプローチにより、サンプリングパターンの単純な幾何学的構造の観点から代数的に独立した多項式の最大数を特徴付けることができ、したがって、サンプリングパターンがサンプリングされたテンソルの有限完備化に必要かつ十分な決定論的条件が得られます。さらに、テンソルの要素が確率$p$で独立にサンプリングされると仮定し、前述の決定論的解析を使用して、サンプリング確率$p$の下限、または同等に、高い確率で有限の完了可能性を保証するサンプリングされた要素の数を導出するための組み合わせ法を提案します。また、行列完了問題に対する既存の結果を使用して、サンプリング確率$p$の緩い下限を取得できることも示します。さらに、一意の完了可能性に対する決定論的および確率的条件を取得します。CP多様体に対する提案された解析によって得られる有限または一意の完了可能性に必要なサンプル数は、グラスマン多様体に対する既存の解析によって得られる数よりも桁違いに少ないことがわかります。

Dense Distributions from Sparse Samples: Improved Gibbs Sampling Parameter Estimators for LDA
スパースサンプルからの密分布:LDAのGibbsサンプリングパラメータ推定器の改善

We introduce a novel approach for estimating Latent Dirichlet Allocation (LDA) parameters from collapsed Gibbs samples (CGS), by leveraging the full conditional distributions over the latent variable assignments to efficiently average over multiple samples, for little more computational cost than drawing a single additional collapsed Gibbs sample. Our approach can be understood as adapting the soft clustering methodology of Collapsed Variational Bayes (CVB0) to CGS parameter estimation, in order to get the best of both techniques. Our estimators can straightforwardly be applied to the output of any existing implementation of CGS, including modern accelerated variants. We perform extensive empirical comparisons of our estimators with those of standard collapsed inference algorithms on real-world data for both unsupervised LDA and Prior-LDA, a supervised variant of LDA for multi-label classification. Our results show a consistent advantage of our approach over traditional CGS under all experimental conditions, and over CVB0 inference in the majority of conditions. More broadly, our results highlight the importance of averaging over multiple samples in LDA parameter estimation, and the use of efficient computational techniques to do so.

私たちは、潜在変数割り当ての完全な条件付き分布を利用して複数のサンプルを効率的に平均化することにより、折りたたまれたギブスサンプル(CGS)から潜在ディリクレ配分(LDA)パラメータを推定する新しいアプローチを紹介します。これは、折りたたまれたギブスサンプルを1つ追加で抽出するよりもわずかに計算コストがかかるだけです。我々のアプローチは、両方の手法の長所を引き出すために、折りたたまれた変分ベイズ(CVB0)のソフトクラスタリング手法をCGSパラメータ推定に適応させたものと理解できます。我々の推定量は、最新の高速化されたバリアントを含む、CGSの既存の実装の出力に直接適用できます。私たちは、教師なしLDAと、マルチラベル分類用のLDAの教師ありバリアントであるPrior-LDAの両方について、実際のデータで我々の推定量と標準的な折りたたまれた推論アルゴリズムの推定量との広範な実験的比較を行いました。結果は、すべての実験条件下で従来のCGSよりも、また大部分の条件でCVB0推論よりも、我々のアプローチが一貫して優れていることを示しています。より広い意味では、私たちの結果は、LDAパラメータ推定において複数のサンプルを平均化すること、そしてそれを実行するための効率的な計算手法を使用することの重要性を強調しています。

On the Propagation of Low-Rate Measurement Error to Subgraph Counts in Large Networks
大規模ネットワークにおける低レート測定誤差のサブグラフカウントへの伝播について

Our work in this paper is inspired by a statistical observation that is both elementary and broadly relevant to network analysis in practice—that the uncertainty in approximating some true graph $G=(V,E)$ by some estimated graph $\hat{G}=(V,\hat{E})$ manifests as errors in our knowledge of the presence/absence of edges between vertex pairs, which must necessarily propagate to any estimates of network summaries $\eta(G)$ we seek. Motivated by the common practice of using plug-in estimates $\eta(\hat{G})$ as proxies for $\eta(G)$, our focus is on the problem of characterizing the distribution of the discrepancy $D=\eta(\hat{G}) – \eta(G)$, in the case where $\eta(\cdot)$ is a subgraph count. Specifically, we study the fundamental case where the statistic of interest is $|E|$, the number of edges in $G$. Our primary contribution in this paper is to show that in the empirically relevant setting of large graphs with low-rate measurement errors, the distribution of $D_E=|\hat{E}| – |E|$ is well-characterized by a Skellam distribution, when the errors are independent or weakly dependent. Under an assumption of independent errors, we are able to further show conditions under which this characterization is strictly better than that of an appropriate normal distribution. These results derive from our formulation of a general result, quantifying the accuracy with which the difference of two sums of dependent Bernoulli random variables may be approximated by the difference of two independent Poisson random variables, i.e., by a Skellam distribution. This general result is developed through the use of Stein’s method, and may be of some general interest. We finish with a discussion of possible extension of our work to subgraph counts $\eta(G)$ of higher order.

この論文の私たちの研究は、ネットワーク分析の実践に基礎的かつ広く関連する統計的観察に触発されたものです。つまり、ある真のグラフ$G=(V,E)$をある推定グラフ$\hat{G}=(V,\hat{E})$で近似する際の不確実性は、頂点ペア間のエッジの有無に関する知識のエラーとして現れ、それが必然的に、私たちが求めるネットワーク要約$\eta(G)$の推定値に伝播するということです。プラグイン推定値$\eta(\hat{G})$を$\eta(G)$のプロキシとして使用する一般的な方法に動機付けられて、私たちは、$\eta(\cdot)$がサブグラフ数である場合に、不一致$D=\eta(\hat{G}) – \eta(G)$の分布を特徴付ける問題に焦点を当てています。具体的には、関心のある統計量が$|E|$、つまり$G$の辺の数である基本的なケースを研究します。この論文の主な貢献は、低率の測定誤差を伴う大規模グラフの経験的に関連する設定において、誤差が独立または弱く従属している場合、$D_E=|\hat{E}| – |E|$の分布はSkellam分布によってよく特徴付けられることを示すことです。誤差が独立しているという仮定の下で、この特徴付けが適切な正規分布の特徴付けよりも厳密に優れている条件をさらに示すことができます。これらの結果は、従属ベルヌーイ確率変数の2つの和の差が2つの独立したポアソン確率変数の差、つまりSkellam分布によって近似される精度を定量化する一般的な結果の定式化から得られます。この一般的な結果はSteinの方法を使用して開発されており、一般的に興味深いものになる可能性があります。最後に、我々の研究を高次のサブグラフカウント$\eta(G)$に拡張する可能性について議論します。

Achieving Optimal Misclassification Proportion in Stochastic Block Models
確率的ブロックモデルにおける最適な誤分類比率の達成

Community detection is a fundamental statistical problem in network data analysis. In this paper, we present a polynomial time two-stage method that provably achieves optimal statistical performance in misclassification proportion for stochastic block model under weak regularity conditions. Our two-stage procedure consists of a refinement stage motivated by penalized local maximum likelihood estimation. This stage can take a wide range of weakly consistent community detection procedures as its initializer, to which it applies and outputs a community assignment that achieves optimal misclassification proportion with high probability. The theoretical property is confirmed by simulated examples.

コミュニティ検出は、ネットワークデータ分析における基本的な統計問題です。この論文では、弱い規則性条件下で確率的ブロックモデルの誤分類比率で最適な統計的パフォーマンスを証明可能に達成する多項式時間2段階法を紹介します。私たちの2段階の手順は、ペナルティを受けた局所最尤推定によって動機付けられた改良段階で構成されています。このステージでは、さまざまな弱一貫性コミュニティ検出手順を初期化子として取り、高い確率で最適な誤分類比率を達成するコミュニティ割り当てを適用して出力します。理論的な特性は、シミュレートされた例によって確認されます。

Joint Label Inference in Networks
ネットワークにおける共同ラベル推論

We consider the problem of inferring node labels in a partially labeled graph where each node in the graph has multiple label types and each label type has a large number of possible labels. Our primary example, and the focus of this paper, is the joint inference of label types such as hometown, current city, and employers for people connected by a social network; by predicting these user profile fields, the network can provide a better experience to its users. Existing approaches such as Label Propagation (Zhu et al., 2003) fail to consider interactions between the label types. Our proposed method, called EDGEEXPLAIN explicitly models these interactions, while still allowing scalable inference under a distributed message- passing architecture. On a large subset of the Facebook social network, collected in a previous study (Chakrabarti et al., 2014), EDGEEXPLAIN outperforms label propagation for several label types, with lifts of up to $120\%$ for recall@1 and $60\%$ for recall@3.

私たちは、グラフ内の各ノードに複数のラベルタイプがあり、各ラベルタイプに多数の可能なラベルがある、部分的にラベル付けされたグラフでノードラベルを推論する問題を考えます。私たちの主な例であり、この論文の焦点は、ソーシャルネットワークで接続された人々の出身地、現在の都市、雇用主などのラベルタイプの共同推論です。これらのユーザープロファイルフィールドを予測することで、ネットワークはユーザーにより良い体験を提供することができます。Label Propagation (Zhuら, 2003)などの既存のアプローチでは、ラベルタイプ間の相互作用が考慮されていません。EDGEEXPLAINと呼ばれる提案された方法は、これらの相互作用を明示的にモデル化しながら、分散型メッセージパッシングアーキテクチャの下でスケーラブルな推論を可能にします。以前の研究(Chakrabartiら, 2014)で収集されたFacebookソーシャルネットワークの大規模なサブセットでは、EDGEEXPLAINはいくつかのラベルタイプでラベル伝播を上回り、recall@1で最大$120%$、recall@3で最大$60%$のリフトを上げています。

Lens Depth Function and k-Relative Neighborhood Graph: Versatile Tools for Ordinal Data Analysis
レンズ深度関数とk相対近傍グラフ:順序データ分析のための汎用ツール

In recent years it has become popular to study machine learning problems in a setting of ordinal distance information rather than numerical distance measurements. By ordinal distance information we refer to binary answers to distance comparisons such as $d(A,B)<d(C,D)$. For many problems in machine learning and statistics it is unclear how to solve them in such a scenario. Up to now, the main approach is to explicitly construct an ordinal embedding of the data points in the Euclidean space, an approach that has a number of drawbacks. In this paper, we propose algorithms for the problems of medoid estimation, outlier identification, classification, and clustering when given only ordinal data. They are based on estimating the lens depth function and the $k$-relative neighborhood graph on a data set. Our algorithms are simple, are much faster than an ordinal embedding approach and avoid some of its drawbacks, and can easily be parallelized.

近年、機械学習の問題を数値的な距離測定ではなく、順序距離情報の設定で研究することが一般的になっています。順序距離情報とは、$d(A,B)<d(C,D)$などの距離比較に対するバイナリ回答を参照します。機械学習や統計学の多くの問題では、このようなシナリオでそれらをどのように解決するかは不明です。これまでの主なアプローチは、ユークリッド空間内のデータポイントの順序埋め込みを明示的に構築することでしたが、このアプローチには多くの欠点があります。この論文では、順序データのみが与えられた場合のmedoid推定、外れ値の特定、分類、およびクラスタリングの問題に対するアルゴリズムを提案します。これらは、データセット上のレンズ深度関数と$k$-relative近傍グラフの推定に基づいています。当社のアルゴリズムはシンプルで、序数埋め込みアプローチよりもはるかに高速で、その欠点の一部を回避し、簡単に並列化できます。

Density Estimation in Infinite Dimensional Exponential Families
無限次元指数族における密度推定

In this paper, we consider an infinite dimensional exponential family $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space $\mathcal{H}$, and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. Motivated by this approximation property, the paper addresses the question of estimating an unknown density $p_0$ through an element in $\mathcal{P}$. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between $p_0$ and $\mathcal{P}$, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. We propose an estimator $\hat{p}_n$ based on minimizing the Fisher divergence, $J(p_0\Vert p)$ between $p_0$ and $p\in \mathcal{P}$, which involves solving a simple finite-dimensional linear system. When $p_0\in\mathcal{P}$, we show that the proposed estimator is consistent, and provide a convergence rate of $n^{-\min\left\{\frac{2}{3},\frac{2\beta+1}{2\beta+2}\right\}}$ in Fisher divergence under the smoothness assumption that $\log p_0\in\mathcal{R}(C^\beta)$ for some $\beta\ge 0$, where $C$ is a certain Hilbert-Schmidt operator on $\mathcal{H}$ and $\mathcal{R}(C^\beta)$ denotes the image of $C^\beta$. We also investigate the misspecified case of $p_0\notin\mathcal{P}$ and show that $J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p)$ as $n\rightarrow \infty$, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non- parametric kernel density estimator, and that the advantage of the proposed estimator grows as $d$ increases.

この論文では、再生核ヒルベルト空間$\mathcal{H}$内の関数によってパラメータ化された確率密度の無限次元指数族$\mathcal{P}$を考察し、$\mathbb{R}^d$上の幅広いクラスの密度が、Kullback-Leibler (KL)ダイバージェンスで$\mathcal{P}$内の要素によって任意に近似できるという意味で、この族が非常に豊富であることを示します。この近似特性に基づいて、この論文では、$\mathcal{P}$内の要素を通じて未知の密度$p_0$を推定する問題を取り上げます。最大尤度推定(MLE)や疑似MLE (ふるい法に基づく)などの標準的な手法は、$p_0$と$\mathcal{P}$間のKLダイバージェンスを最小化することに基づいていますが、対数分割関数を効率的に処理できないため、実用的に有用な推定量は得られません。私たちは、$p_0$と$p\in \mathcal{P}$間のフィッシャーダイバージェンス$J(p_0\Vert p)$を最小化することに基づく推定量$\hat{p}_n$を提案します。これには、単純な有限次元線形システムを解くことが含まれます。$p_0\in\mathcal{P}$のとき、提案された推定量が矛盾しないことを示します。また、$\log p_0\in\mathcal{R}(C^\beta)$が何らかの$\beta\ge 0$に対して存在するという平滑性仮定の下で、フィッシャーダイバージェンスの収束率が$n^{-\min\left\{\frac{2}{3},\frac{2\beta+1}{2\beta+2}\right\}}$であることを示します。ここで、$C$は$\mathcal{H}$上の特定のヒルベルト・シュミット作用素であり、$\mathcal{R}(C^\beta)$は$C^\beta$の像を表します。また、$p_0\notin\mathcal{P}$の誤指定ケースも調査し、$n\rightarrow \infty$のとき$J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p)$であることを示し、上記と同様の平滑性条件下でのこの収束率を示します。数値シミュレーションにより、提案された推定量がノンパラメトリックカーネル密度推定量よりも優れていること、および提案された推定量の利点は$d$が増加するにつれて大きくなることを実証します。

Statistical Inference with Unnormalized Discrete Models and Localized Homogeneous Divergences
正規化されていない離散モデルと局所的な均質な発散による統計的推論

In this paper, we focus on parameters estimation of probabilistic models in discrete space. A naive calculation of the normalization constant of the probabilistic model on discrete space is often infeasible and statistical inference based on such probabilistic models has difficulty. In this paper, we propose a novel estimator for probabilistic models on discrete space, which is derived from an empirically localized homogeneous divergence. The idea of the empirical localization makes it possible to ignore an unobserved domain on sample space, and the homogeneous divergence is a discrepancy measure between two positive measures and has a weak coincidence axiom. The proposed estimator can be constructed without calculating the normalization constant and is asymptotically consistent and Fisher efficient. We investigate statistical properties of the proposed estimator and reveal a relationship between the empirically localized homogeneous divergence and a mixture of the $\alpha$-divergence. The $\alpha$-divergence is a non- homogeneous discrepancy measure that is frequently discussed in the context of information geometry. Using the relationship, we also propose an asymptotically consistent estimator of the normalization constant. Experiments showed that the proposed estimator comparably performs to the maximum likelihood estimator but with drastically lower computational cost.

この論文では、離散空間における確率モデルのパラメータ推定に焦点を当てる。離散空間上の確率モデルの正規化定数の単純な計算は実行不可能な場合が多く、そのような確率モデルに基づく統計的推論は困難を伴う。この論文では、経験的に局所化された同次ダイバージェンスから導かれる、離散空間上の確率モデルの新しい推定量を提案します。経験的局所化の考え方により、標本空間上の観測されていない領域を無視することが可能になり、同次ダイバージェンスは2つの正の測度間の乖離測度であり、弱い一致公理を持つ。提案された推定量は、正規化定数を計算せずに構築でき、漸近的に整合しており、フィッシャー効率的です。提案された推定量の統計的特性を調査し、経験的に局所化された同次ダイバージェンスと$\alpha$-ダイバージェンスの混合との関係を明らかにします。$\alpha$ダイバージェンスは、情報幾何学の文脈で頻繁に議論される非同次不一致尺度です。この関係を使用して、正規化定数の漸近的に一貫した推定量も提案します。実験では、提案された推定量は最大尤度推定量と同等のパフォーマンスを発揮しますが、計算コストは大幅に低いことが示されました。

On the Consistency of Ordinal Regression Methods
順序回帰法の一貫性について

Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of Fisher consistency. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model. In this paper we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation. We will see that, for a family of surrogate loss functions that subsumes support vector ordinal regression and ORBoosting, consistency can be fully characterized by the derivative of a real-valued function at zero, as happens for convex margin-based surrogates in binary classification. We also derive excess risk bounds for a surrogate of the absolute error that generalize existing risk bounds for binary classification. Finally, our analysis suggests a novel surrogate of the squared error loss. We compare this novel surrogate with competing approaches on 9 different datasets. Our method shows to be highly competitive in practice, outperforming the least squares loss on 7 out of 9 datasets.

文献で提案されている順序回帰モデルの多くは、ゼロ-1、絶対、または二乗損失関数の凸サロゲートを最小化する手法と見なすことができます。このような近似の統計的意味を調べることができる重要な特性は、フィッシャー一貫性です。フィッシャー一貫性は、サロゲート損失関数に望ましい特性であり、母集団設定、つまりデータを生成する確率分布が利用できる場合、サロゲートの最適化によって可能な限り最良のモデルが得られることを意味します。この論文では、サポートベクター順序回帰、ORBoosting、最小絶対偏差など、順序回帰のコンテキストで使用される豊富なサロゲート損失関数のファミリーのフィッシャー一貫性を特徴付けます。サポートベクター順序回帰とORBoostingを包含するサロゲート損失関数のファミリーでは、バイナリ分類の凸マージンベースのサロゲートの場合と同様に、一貫性は実数値関数のゼロでの導関数によって完全に特徴付けることができることがわかります。また、バイナリ分類の既存のリスク境界を一般化する絶対誤差の代替の過剰リスク境界も導出します。最後に、分析により、二乗誤差損失の新しい代替が提案されます。この新しい代替を、9つの異なるデータセットで競合アプローチと比較します。私たちの方法は、実際には非常に競争力があり、9つのデータセットのうち7つで最小二乗損失よりも優れています。

Two New Approaches to Compressed Sensing Exhibiting Both Robust Sparse Recovery and the Grouping Effect
ロバストなスパース回復とグループ化効果の両方を示す圧縮センシングへの2つの新しいアプローチ

In this paper we introduce a new optimization formulation for sparse regression and compressed sensing, called CLOT (Combined L-One and Two), wherein the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norms. This formulation differs from the Elastic Net (EN) formulation, in which the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norm squared. It is shown that, in the context of compressed sensing, the EN formulation does not achieve robust recovery of sparse vectors, whereas the new CLOT formulation achieves robust recovery. Also, like EN but unlike LASSO, the CLOT formulation achieves the grouping effect, wherein coefficients of highly correlated columns of the measurement (or design) matrix are assigned roughly comparable values. It is already known LASSO does not have the grouping effect. Therefore the CLOT formulation combines the best features of both LASSO (robust sparse recovery) and EN (grouping effect). The CLOT formulation is a special case of another one called SGL (Sparse Group LASSO) which was introduced into the literature previously, but without any analysis of either the grouping effect or robust sparse recovery. It is shown here that SGL achieves robust sparse recovery, and also achieves a version of the grouping effect in that coefficients of highly correlated columns belonging to the same group of the measurement (or design) matrix are assigned roughly comparable values.

この論文では、CLOT (Combined L-One and Two)と呼ばれるスパース回帰と圧縮センシングの新しい最適化定式化を紹介します。この定式化では、正則化子は$\ell_1$ノルムと$\ell_2$ノルムの凸結合です。この定式化は、正則化子が$\ell_1$ノルムと$\ell_2$ノルムの2乗の凸結合であるElastic Net (EN)定式化とは異なります。圧縮センシングのコンテキストでは、EN定式化ではスパースベクトルの堅牢な回復が達成されないのに対し、新しいCLOT定式化では堅牢な回復が達成されることが示されています。また、ENと同様ですがLASSOとは異なり、CLOT定式化ではグループ化効果が得られ、測定(または設計)行列の相関の高い列の係数にはほぼ同等の値が割り当てられます。LASSOにはグループ化効果がないことは既に知られています。したがって、CLOT定式化では、LASSO (堅牢なスパース回復)とEN (グループ化効果)の両方の優れた機能が組み合わされています。CLOT定式化は、以前に文献に紹介されたSGL (Sparse Group LASSO)と呼ばれる別の定式化の特殊なケースですが、グループ化効果や堅牢なスパース回復の分析は行われていません。ここでは、SGLが堅牢なスパース回復を実現すること、また、測定(または設計)マトリックスの同じグループに属する相関の高い列の係数にほぼ同等の値が割り当てられるという点で、グループ化効果のバージョンも実現することが示されています。

Perishability of Data: Dynamic Pricing under Varying-Coefficient Models
データの腐りやすさ:変動係数モデルの下での動的価格設定

We consider a firm that sells a large number of products to its customers in an online fashion. Each product is described by a high dimensional feature vector, and the market value of a product is assumed to be linear in the values of its features. Parameters of the valuation model are unknown and can change over time. The firm sequentially observes a product’s features and can use the historical sales data (binary sale/no sale feedbacks) to set the price of current product, with the objective of maximizing the collected revenue. We measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. We propose a pricing policy based on projected stochastic gradient descent (PSGD) and characterize its regret in terms of time $T$, features dimension $d$, and the temporal variability in the model parameters, $\delta_t$. We consider two settings. In the first one, feature vectors are chosen antagonistically by nature and we prove that the regret of PSGD pricing policy is of order $O(\sqrt{T} + \sum_{t=1}^T \sqrt{t}\delta_t)$. In the second setting (referred to as stochastic features model), the feature vectors are drawn independently from an unknown distribution. We show that in this case, the regret of PSGD pricing policy is of order $O(d^2 \log T + \sum_{t=1}^T t\delta_t/d)$.

私たちは、オンラインで多数の製品を顧客に販売する企業について考えます。各製品は高次元の特徴ベクトルで記述され、製品の市場価値はその特徴の値に比例すると想定されます。評価モデルのパラメータは未知であり、時間の経過とともに変化する可能性があります。企業は製品の機能を順次観察し、過去の販売データ(販売/非販売のバイナリフィードバック)を使用して現在の製品の価格を設定し、収集された収益を最大化することを目標とします。動的価格設定ポリシーのパフォーマンスは、後悔によって測定します。後悔とは、モデルパラメータのシーケンスを事前に知っている千里眼と比較した予想収益損失です。投影確率的勾配降下法(PSGD)に基づく価格設定ポリシーを提案し、その後悔を時間$T$、特徴次元$d$、およびモデルパラメータの時間的変動$\delta_t$で特徴付けます。2つの設定を検討します。最初の設定では、特徴ベクトルは本質的に拮抗的に選択され、PSGDプライシングポリシーの後悔は$O(\sqrt{T} + \sum_{t=1}^T \sqrt{t}\delta_t)$のオーダーであることを証明します。2番目の設定(確率的特徴モデルと呼ばれる)では、特徴ベクトルは未知の分布から独立して描画されます。この場合、PSGDプライシングポリシーの後悔は$O(d^2 \log T + \sum_{t=1}^T t\delta_t/d)$のオーダーであることを示します。

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
2点フィードバックによるバンディットおよびゼロ次凸最適化のための最適アルゴリズム

We consider the closely related problems of bandit convex optimization with two-point feedback, and zero-order stochastic convex optimization with two function evaluations per round. We provide a simple algorithm and analysis which is optimal for convex Lipschitz functions. This improves on Duchi et al. (2015), which only provides an optimal result for smooth functions; Moreover, the algorithm and analysis are simpler, and readily extend to non-Euclidean problems. The algorithm is based on a small but surprisingly powerful modification of the gradient estimator.

私たちは、2点フィードバックによるバンディット凸最適化と、ラウンドごとに2つの関数評価を行うゼロ次確率凸最適化の密接に関連する問題を検討します。凸リプシッツ関数に最適な簡単なアルゴリズムと解析を提供します。これは、スムーズな機能に対してのみ最適な結果を提供するDuchiら(2015)を改善しています。さらに、アルゴリズムと解析はより単純で、非ユークリッド問題に容易に拡張できます。このアルゴリズムは、勾配推定器の小さいながらも驚くほど強力な変更に基づいています。

Reconstructing Undirected Graphs from Eigenspaces
固有空間からの無向グラフの再構成

We aim at recovering the weighted adjacency matrix $\mathsf{W}$ of an undirected graph from a perturbed version of its eigenspaces. This situation arises for instance when working with stationary signals on graphs or Markov chains observed at random times. Our approach relies on minimizing a cost function based on the Frobenius norm of the commutator $\mathsf{A} \mathsf{B}-\mathsf{B} \mathsf{A}$ between symmetric matrices $\mathsf{A}$ and $\mathsf{B}$. We describe a particular framework in which we have access to an estimation of the eigenspaces and provide support selection procedures from theoretical and practical points of view. In the ErdÅs-RÃ©nyi model on $N$ vertices with no self-loops, we show that identifiability (i.e., the ability to reconstruct $\mathsf{W}$ from the knowledge of its eigenspaces) follows a sharp phase transition on the expected number of edges with threshold function $N\log N/2$. Simulated and real life numerical experiments assert our methodology.

私たちは、無向グラフの重み付け隣接行列$mathsf{W}$を、その固有空間の摂動バージョンから回復することを目指しています。この状況は、たとえば、グラフ上の定常信号やランダムな時間に観測されたマルコフ連鎖を操作するときに発生します。このアプローチは、対称行列$mathsf{A}$と$mathsf{B}$の間の整流子$mathsf{A} mathsf{B}-mathsf{B}$のフロベニウスノルムに基づくコスト関数の最小化に依存しています。固有空間の推定にアクセスできる特定のフレームワークについて説明し、理論的および実践的観点からサポート選択手順を提供します。自己ループのない$N$頂点のErdÅs-RÃ©nyiモデルでは、識別可能性(つまり、固有空間の知識から$mathsf{W}$を再構築する能力)が、閾値関数$Nlog N/2$を持つ期待されるエッジ数で急激な位相遷移に従うことを示します。シミュレーションされた数値実験と実際の数値実験が、私たちの方法論を主張しています。

Uniform Hypergraph Partitioning: Provable Tensor Methods and Sampling Techniques
一様ハイパーグラフ分割:証明可能なテンソル法とサンプリング手法

In a series of recent works, we have generalised the consistency results in the stochastic block model literature to the case of uniform and non-uniform hypergraphs. The present paper continues the same line of study, where we focus on partitioning weighted uniform hypergraphs—a problem often encountered in computer vision. This work is motivated by two issues that arise when a hypergraph partitioning approach is used to tackle computer vision problems: (i) The uniform hypergraphs constructed for higher-order learning contain all edges, but most have negligible weights. Thus, the adjacency tensor is nearly sparse, and yet, not binary. (ii) A more serious concern is that standard partitioning algorithms need to compute all edge weights, which is computationally expensive for hypergraphs. This is usually resolved in practice by merging the clustering algorithm with a tensor sampling strategy—an approach that is yet to be analysed rigorously. We build on our earlier work on partitioning dense unweighted uniform hypergraphs (Ghoshdastidar and Dukkipati, ICML, 2015), and address the aforementioned issues by proposing provable and efficient partitioning algorithms. Our analysis justifies the empirical success of practical sampling techniques. We also complement our theoretical findings by elaborate empirical comparison of various hypergraph partitioning schemes.

最近の一連の研究で、我々は確率的ブロックモデル文献の一貫性の結果を、均一および非均一ハイパーグラフの場合に一般化しました。本論文では同じ研究の流れを継続し、重み付き均一ハイパーグラフの分割に焦点を当てています。これは、コンピュータービジョンでよく遭遇する問題です。この研究では、ハイパーグラフ分割アプローチを使用してコンピュータービジョンの問題に取り組むときに発生する2つの問題に動機付けられています。(i)高次学習用に構築された均一ハイパーグラフにはすべてのエッジが含まれますが、そのほとんどは重みが無視できるほど小さいです。したがって、隣接テンソルはほぼスパースですが、バイナリではありません。(ii)さらに深刻な懸念は、標準的な分割アルゴリズムではすべてのエッジの重みを計算する必要があることです。これは、ハイパーグラフの場合、計算コストが高くなります。これは通常、実際にはクラスタリングアルゴリズムとテンソルサンプリング戦略を統合することによって解決されますが、このアプローチはまだ厳密に分析されていません。私たちは、密度の高い重み付けされていない均一なハイパーグラフの分割に関する以前の研究(GhoshdastidarとDukkipati、ICML、2015)を基に、証明可能で効率的な分割アルゴリズムを提案することで、前述の問題に対処します。私たちの分析は、実用的なサンプリング手法の実証的な成功を正当化します。また、さまざまなハイパーグラフ分割スキームの詳細な実証的比較によって、理論的発見を補完します。

Clustering from General Pairwise Observations with Applications to Time-varying Graphs
一般的なペアワイズ観測値と応用から時変グラフへのクラスタリング

We present a general framework for graph clustering and bi- clustering where we are given a general observation (called a label) between each pair of nodes. This framework allows a rich encoding of various types of pairwise interactions between nodes. We propose a new tractable and robust approach to this problem based on convex optimization and maximum likelihood estimators. We analyze our algorithms under a general statistical model extending the planted partition and stochastic block models. Both sufficient and necessary conditions are provided for successful recovery of the underlying clusters. Our theoretical results subsume many existing graph clustering results for a wide range of settings, including planted partition, weighted clustering, submatrix localization and partially observed graphs. Furthermore, our results are applicable to novel settings including time-varying graphs, providing new insights to solutions of these problems. We provide empirical results on both synthetic and real data that corroborate with our theoretical findings.

私たちは、各ノードペア間の一般的な観測値（ラベルと呼ばれる）が与えられるグラフクラスタリングおよびバイクラスタリングのための一般的なフレームワークを提示します。このフレームワークにより、ノード間のさまざまな種類のペアワイズ相互作用を豊富にエンコードすることができます。私たちは、この問題に対する、凸最適化および最大尤度推定量に基づく、扱いやすく堅牢な新しいアプローチを提案します。私たちは、プランテッドパーティションおよび確率的ブロックモデルを拡張した一般的な統計モデルの下で、アルゴリズムを分析します。基礎となるクラスターを正常に回復するための十分条件と必要条件の両方が提供されます。我々の理論的結果は、プランテッドパーティション、重み付きクラスタリング、サブマトリックスローカリゼーション、および部分的に観測されたグラフを含む、広範囲の設定に対する多くの既存のグラフクラスタリング結果を包含します。さらに、我々の結果は、時間変動グラフを含む新しい設定に適用可能であり、これらの問題の解決に新たな洞察を提供します。私たちは、理論的発見を裏付ける合成データおよび実データの両方に関する経験的結果を提供します。

Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers
補間分類器としての AdaBoost とランダムフォレストの成功を説明する

There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting’s interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self- averaging, interpolating algorithm which creates what we denote as a spiked-smooth classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping.

AdaBoostがなぜ成功した分類器であるかを説明する文献は多数あります。AdaBoostに関する文献は、分類器のマージンと、指数尤度関数の最適化としてのブースティングの解釈に焦点を当てています。しかし、これらの既存の説明は不完全であると指摘されています。ランダムフォレストは、文献での説明がかなり少ない、もう1つの一般的なアンサンブル手法です。ここでは、AdaBoostとランダムフォレストに関する新しい視点を紹介します。この視点では、2つのアルゴリズムが同様の理由で機能すると提案しています。両方の分類器は同様の予測精度を実現しますが、ランダムフォレストを直接の最適化手順として考えることはできません。むしろ、ランダムフォレストは、スパイクスムース分類器と呼ばれるものを作成する自己平均補間アルゴリズムであり、AdaBoostも同様に見ています。AdaBoostとランダムフォレストの両方が成功しているのは、このメカニズムのためだと推測しています。この説明を裏付けるいくつかの例を示します。その過程で、分類のためのブースティングアルゴリズムには正規化または早期停止が必要であり、決定木などの複雑度の低い学習者のクラスに限定する必要があるという従来の考え方に疑問を投げかけます。ブースティングはランダムフォレストのように、つまり、正規化や早期停止なしで、大きな決定木とともに使用する必要があるという結論に達しました。

On Markov chain Monte Carlo methods for tall data
背の高いデータに対するマルコフ連鎖モンテカルロ法について

Markov chain Monte Carlo methods are often deemed too computationally intensive to be of any practical use for big data applications, and in particular for inference on datasets containing a large number $n$ of individual data points, also known as tall datasets. In scenarios where data are assumed independent, various approaches to scale up the Metropolis- Hastings algorithm in a Bayesian inference context have been recently proposed in machine learning and computational statistics. These approaches can be grouped into two categories: divide-and-conquer approaches and, subsampling-based algorithms. The aims of this article are as follows. First, we present a comprehensive review of the existing literature, commenting on the underlying assumptions and theoretical guarantees of each method. Second, by leveraging our understanding of these limitations, we propose an original subsampling-based approach relying on a control variate method which samples under regularity conditions from a distribution provably close to the posterior distribution of interest, yet can require less than $O(n)$ data point likelihood evaluations at each iteration for certain statistical models in favourable scenarios. Finally, we emphasize that we have only been able so far to propose subsampling-based methods which display good performance in scenarios where the Bernstein-von Mises approximation of the target posterior distribution is excellent. It remains an open challenge to develop such methods in scenarios where the Bernstein-von Mises approximation is poor.

マルコフ連鎖モンテカルロ法は、多くの場合、計算負荷が大きすぎるため、ビッグデータアプリケーション、特にtallデータセットとも呼ばれる多数の個別のデータポイント$n$を含むデータセットの推論には実用的ではないと考えられています。データが独立していると想定されるシナリオでは、ベイズ推論のコンテキストでメトロポリス-ヘイスティングスアルゴリズムをスケールアップするさまざまなアプローチが、機械学習と計算統計の分野で最近提案されています。これらのアプローチは、分割統治法とサブサンプリングベースのアルゴリズムの2つのカテゴリに分類できます。この記事の目的は次のとおりです。まず、既存の文献を包括的にレビューし、各方法の根底にある仮定と理論的保証について説明します。第二に、これらの制限に関する理解を活用して、制御変量法に依存する独自のサブサンプリングベースのアプローチを提案します。この方法は、興味のある事後分布に証明可能に近い分布から規則性条件下でサンプリングしますが、好ましいシナリオでは、特定の統計モデルに対して各反復で$O(n)$未満のデータポイント尤度評価を必要とする場合があります。最後に、これまでのところ、対象事後分布のBernstein-von Mises近似が優れているシナリオでのみ、優れたパフォーマンスを発揮するサブサンプリングベースの方法を提案できたことを強調します。Bernstein-von Mises近似が不十分なシナリオでこのような方法を開発することは、未解決の課題のままです。

Distributed Semi-supervised Learning with Kernel Ridge Regression
カーネルリッジ回帰による分散半教師あり学習

This paper provides error analysis for distributed semi- supervised learning with kernel ridge regression (DSKRR) based on a divide-and-conquer strategy. DSKRR applies kernel ridge regression (KRR) to data subsets that are distributively stored on multiple servers to produce individual output functions, and then takes a weighted average of the individual output functions as a final estimator. Using a novel error decomposition which divides the generalization error of DSKRR into the approximation error, sample error and distributed error, we find that the sample error and distributed error reflect the power and limitation of DSKRR, compared with KRR processing the whole data. Thus a small distributed error provides a large range of the number of data subsets to guarantee a small generalization error. Our results show that unlabeled data play important roles in reducing the distributed error and enlarging the number of data subsets in DSKRR. Our analysis also applies to the case when the regression function is out of the reproducing kernel Hilbert space. Numerical experiments including toy simulations and a music-prediction task are employed to demonstrate our theoretical statements and show the power of unlabeled data in distributed learning.

この論文では、分割統治戦略に基づくカーネルリッジ回帰(DSKRR)による分散半教師あり学習の誤差分析を示します。DSKRRは、複数のサーバーに分散して保存されているデータサブセットにカーネルリッジ回帰(KRR)を適用して個別の出力関数を生成し、個々の出力関数の加重平均を最終推定値として使用します。DSKRRの一般化誤差を近似誤差、サンプル誤差、分散誤差に分割する新しい誤差分解を使用して、サンプル誤差と分散誤差が、データ全体を処理するKRRと比較したDSKRRの能力と限界を反映していることがわかりました。したがって、分散誤差が小さいと、データサブセットの数の範囲が広くなり、一般化誤差が小さくなります。結果は、ラベルなしデータが分散誤差を減らし、DSKRRのデータサブセットの数を増やす上で重要な役割を果たすことを示しています。また、回帰関数が再生カーネルヒルベルト空間外にある場合にも、分析が適用されます。おもちゃのシミュレーションや音楽予測タスクなどの数値実験を使用して、私たちの理論的主張を実証し、分散学習におけるラベルなしデータの威力を示します。

Asymptotic behavior of Support Vector Machine for spiked population model
スパイク人口モデルに対するサポートベクターマシンの漸近挙動

For spiked population model, we investigate the large dimension $N$ and large sample size $M$ asymptotic behavior of the Support Vector Machine (SVM) classification method in the limit of $N,M\rightarrow\infty$ at fixed $\alpha=M/N$. We focus on the generalization performance by analytically evaluating the angle between the normal direction vectors of SVM separating hyperplane and corresponding Bayes optimal separating hyperplane. This is an analogous result to the one shown in Paul (2007) and Nadler (2008) for the angle between the sample eigenvector and the population eigenvector in random matrix theorem. We provide not just bound, but sharp prediction of the asymptotic behavior of SVM that can be determined by a set of nonlinear equations. Based on the analytical results, we propose a new method of selecting tuning parameter which significantly reduces the computational cost. A surprising finding is that SVM achieves its best performance at small value of the tuning parameter under spiked population model. These results are confirmed to be correct by comparing with those of numerical simulations on finite-size systems. We also apply our formulas to an actual dataset of breast cancer and find agreement between analytical derivations and numerical computations based on cross validation.

スパイク人口モデルでは、$\alpha=M/N$を固定して$N,M\rightarrow\infty$の限界で、サポートベクターマシン(SVM)分類法の大次元$N$および大サンプルサイズ$M$の漸近挙動を調査します。SVM分離超平面の法線方向ベクトルと対応するベイズ最適分離超平面の間の角度を解析的に評価することにより、一般化のパフォーマンスに焦点を当てます。これは、ランダム行列定理におけるサンプル固有ベクトルと人口固有ベクトルの間の角度についてPaul (2007)およびNadler (2008)で示された結果に類似しています。一連の非線形方程式によって決定できるSVMの漸近挙動の境界だけでなく明確な予測を提供します。解析結果に基づいて、計算コストを大幅に削減する新しいチューニングパラメータの選択方法を提案します。驚くべき発見は、スパイク人口モデルでは、SVMがチューニングパラメータの値が小さいときに最高のパフォーマンスを達成することです。これらの結果は、有限サイズシステムでの数値シミュレーションの結果と比較することで正しいことが確認されています。また、実際の乳がんのデータセットに私たちの公式を適用し、クロス検証に基づく解析的導出と数値計算が一致することを確認しました。

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality
カーネル予測における時間精度のトレードオフ: 予測品質の制御

Kernel regression or classification (also referred to as weighted $\epsilon$-NN methods in Machine Learning) are appealing for their simplicity and therefore ubiquitous in data analysis. However, practical implementations of kernel regression or classification consist of quantizing or sub- sampling data for improving time efficiency, often at the cost of prediction quality. While such tradeoffs are necessary in practice, their statistical implications are generally not well understood, hence practical implementations come with few performance guarantees. In particular, it is unclear whether it is possible to maintain the statistical accuracy of kernel prediction—crucial in some applications—while improving prediction time. The present work provides guiding principles for combining kernel prediction with data- quantization so as to guarantee good tradeoffs between prediction time and accuracy, and in particular so as to approximately maintain the good accuracy of vanilla kernel prediction. Furthermore, our tradeoff guarantees are worked out explicitly in terms of a tuning parameter which acts as a knob that favors either time or accuracy depending on practical needs. On one end of the knob, prediction time is of the same order as that of single -nearest-neighbor prediction (which is statistically inconsistent) while maintaining consistency; on the other end of the knob, the prediction risk is nearly minimax-optimal (in terms of the original data size) while still reducing time complexity. The analysis thus reveals the interaction between the data- quantization approach and the kernel prediction method, and most importantly gives explicit control of the tradeoff to the practitioner rather than fixing the tradeoff in advance or leaving it opaque. The theoretical results are validated on data from a range of real-world application domains; in particular we demonstrate that the theoretical knob performs as expected.

カーネル回帰または分類(機械学習では重み付き$\epsilon$-NN法とも呼ばれる)は、そのシンプルさが魅力的で、データ分析では広く使用されています。しかし、カーネル回帰または分類の実際の実装では、時間効率を向上させるためにデータを量子化またはサブサンプリングしますが、予測品質が犠牲になることがよくあります。このようなトレードオフは実際には必要ですが、その統計的意味は一般に十分に理解されていないため、実際の実装ではパフォーマンスの保証はほとんどありません。特に、一部のアプリケーションで重要なカーネル予測の統計的精度を維持しながら予測時間を改善できるかどうかは不明です。この研究では、予測時間と精度の適切なトレードオフを保証し、特にバニラカーネル予測の優れた精度をほぼ維持できるように、カーネル予測とデータ量子化を組み合わせるための指針を提供します。さらに、トレードオフの保証は、実際のニーズに応じて時間または精度のいずれかを優先するノブとして機能するチューニングパラメータの観点から明示的に解決されます。ノブの一方の端では、予測時間は、一貫性を維持しながら、単一の最近傍予測（統計的に一貫性がない）と同程度です。ノブのもう一方の端では、予測リスクは、時間の複雑さを軽減しながら、ほぼ最小最大最適（元のデータサイズの観点から）です。このように、この分析により、データ量子化アプローチとカーネル予測方法の相互作用が明らかになり、最も重要なことは、トレードオフを事前に固定したり不透明なままにしたりするのではなく、実践者にトレードオフの明示的な制御を提供することです。理論的な結果は、さまざまな実際のアプリケーションドメインのデータで検証されています。特に、理論的なノブが期待どおりに機能することを実証しています。

Bayesian Learning of Dynamic Multilayer Networks
動的多層ネットワークのベイジアン学習

A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents novel challenges. In this paper, we focus on the time-varying interconnections among a set of actors in multiple contexts, called layers. Current literature lacks flexible statistical models for dynamic multilayer networks, which can enhance quality in inference and prediction by efficiently borrowing information within each network, across time, and between layers. Motivated by this gap, we develop a Bayesian nonparametric model leveraging latent space representations. Our formulation characterizes the edge probabilities as a function of shared and layer-specific actors positions in a latent space, with these positions changing in time via Gaussian processes. This representation facilitates dimensionality reduction and incorporates different sources of information in the observed data. In addition, we obtain tractable procedures for posterior computation, inference, and prediction. We provide theoretical results on the flexibility of our model. Our methods are tested on simulations and infection studies monitoring dynamic face-to-face contacts among individuals in multiple days, where we perform better than current methods in inference and prediction.

病気の伝染、国際関係、社会的交流など、ますます多くの分野で膨大なネットワークが収集されています。データストリームが拡大し続けるにつれて、これらの高度に多次元化された接続データに関連する複雑さが新たな課題を生み出しています。この論文では、レイヤーと呼ばれる複数のコンテキストにおける一連のアクター間の時間変動する相互接続に焦点を当てます。現在の文献には、各ネットワーク内、時間全体、およびレイヤー間で情報を効率的に借用することで推論と予測の品質を向上させることができる、動的な多層ネットワークの柔軟な統計モデルが欠けています。このギャップに動機付けられて、潜在空間表現を活用したベイジアンノンパラメトリックモデルを開発します。私たちの定式化は、潜在空間内の共有およびレイヤー固有のアクターの位置の関数としてエッジ確率を特徴付け、これらの位置はガウス過程によって時間とともに変化します。この表現は次元削減を容易にし、観測データ内のさまざまな情報源を組み込みます。さらに、事後計算、推論、予測のための扱いやすい手順が得られます。私たちは、モデルの柔軟性に関する理論的結果を提供します。私たちの方法は、複数日にわたる個人間の動的な対面接触を監視するシミュレーションと感染研究でテストされており、推論と予測において現在の方法よりも優れたパフォーマンスを発揮します。

Learning Local Dependence In Ordered Data
順序付けされたデータにおける局所依存性の学習

In many applications, data come with a natural ordering. This ordering can often induce local dependence among nearby variables. However, in complex data, the width of this dependence may vary, making simple assumptions such as a constant neighborhood size unrealistic. We propose a framework for learning this local dependence based on estimating the inverse of the Cholesky factor of the covariance matrix. Penalized maximum likelihood estimation of this matrix yields a simple regression interpretation for local dependence in which variables are predicted by their neighbors. Our proposed method involves solving a convex, penalized Gaussian likelihood problem with a hierarchical group lasso penalty. The problem decomposes into independent subproblems which can be solved efficiently in parallel using first-order methods. Our method yields a sparse, symmetric, positive definite estimator of the precision matrix, encoding a Gaussian graphical model. We derive theoretical results not found in existing methods attaining this structure. In particular, our conditions for signed support recovery and estimation consistency rates in multiple norms are as mild as those in a regression problem. Empirical results show our method performing favorably compared to existing methods. We apply our method to genomic data to flexibly model linkage disequilibrium. Our method is also applied to improve the performance of discriminant analysis in sound recording classification.

多くのアプリケーションでは、データには自然な順序が伴います。この順序は、多くの場合、近くの変数間のローカルな依存関係を誘発します。ただし、複雑なデータでは、この依存関係の幅は変化する可能性があり、一定近傍サイズなどの単純な仮定は非現実的になります。共分散行列のコレスキー因子の逆数を推定することに基づいて、このローカルな依存関係を学習するためのフレームワークを提案します。この行列のペナルティ付き最大尤度推定により、変数が近傍によって予測されるローカルな依存関係の単純な回帰解釈が得られます。提案された方法では、階層的なグループLassoペナルティを使用して、凸のペナルティ付きガウス尤度問題を解きます。この問題は、1次法を使用して効率的に並列に解くことができる独立したサブ問題に分解されます。この方法では、ガウスグラフィカルモデルをエンコードする、精度行列のスパースで対称の正定値推定値が得られます。この構造を実現する既存の方法では見つからない理論的結果を導き出します。特に、多重ノルムにおける符号付きサポート回復と推定一貫性率の条件は、回帰問題の場合と同じくらい緩やかです。実験結果では、既存の方法と比較して、この方法が優れていることが示されています。この方法をゲノムデータに適用して、連鎖不平衡を柔軟にモデル化します。また、この方法は、録音分類における判別分析のパフォーマンスを向上させるためにも適用されます。

COEVOLVE: A Joint Point Process Model for Information Diffusion and Network Evolution
COEVOLVE:情報拡散とネットワーク進化のためのジョイントポイントプロセスモデル

Information diffusion in online social networks is affected by the underlying network topology, but it also has the power to change it. Online users are constantly creating new links when exposed to new information sources, and in turn these links are alternating the way information spreads. However, these two highly intertwined stochastic processes, information diffusion and network evolution, have been predominantly studied separately, ignoring their co-evolutionary dynamics. We propose a temporal point process model, Coevolve, for such joint dynamics, allowing the intensity of one process to be modulated by that of the other. This model allows us to efficiently simulate interleaved diffusion and network events, and generate traces obeying common diffusion and network patterns observed in real-world networks such as Twitter. Furthermore, we also develop a convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces. We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives.

オンラインソーシャルネットワークにおける情報拡散は、基盤となるネットワークトポロジーの影響を受けますが、それを変える力も持っています。オンラインユーザーは、新しい情報源に触れると常に新しいリンクを作成し、その結果、これらのリンクによって情報の拡散方法が変わります。しかし、これら2つの高度に絡み合った確率過程、つまり情報拡散とネットワーク進化は、これまで主に別々に研究され、共進化のダイナミクスは無視されてきました。私たちは、このような共同ダイナミクスに対して、時間点過程モデルCoevolveを提案します。このモデルでは、一方のプロセスの強度をもう一方のプロセスの強度で調整できます。このモデルにより、インターリーブされた拡散とネットワークイベントを効率的にシミュレートし、Twitterなどの現実世界のネットワークで観察される一般的な拡散とネットワークパターンに従ったトレースを生成できます。さらに、私たちは、過去の拡散とネットワーク進化のトレースからモデルのパラメーターを学習するための凸最適化フレームワークも開発しました。合成データとTwitterから収集したデータの両方で実験を行い、私たちのモデルがデータによく適合し、他のモデルよりも正確な予測を提供することを示しました。

GPflow: A Gaussian Process Library using TensorFlow
GPflow:TensorFlowを使用したガウスプロセスライブラリ

GPflow is a Gaussian process library that uses TensorFlow for its core computations and Python for its front end. The distinguishing features of GPflow are that it uses variational inference as the primary approximation method, provides concise code through the use of automatic differentiation, has been engineered with a particular emphasis on software testing and is able to exploit GPU hardware.

GPflowは、コア計算にTensorFlowを使用し、フロントエンドにPythonを使用するガウスプロセスライブラリです。GPflowの際立った特徴は、主要な近似方法として変分推論を使用し、自動微分を使用して簡潔なコードを提供し、ソフトウェアテストに特に重点を置いて設計されており、GPUハードウェアを活用できることです。

GFA: Exploratory Analysis of Multiple Data Sources with Group Factor Analysis
GFA:グループ因子分析による複数のデータソースの探索的分析

The R package GFA provides a full pipeline for factor analysis of multiple data sources that are represented as matrices with co-occurring samples. It allows learning dependencies between subsets of the data sources, decomposed into latent factors. The package also implements sparse priors for the factorization, providing interpretable biclusters of the multi-source data.

RパッケージGFAは、共起するサンプルを含む行列として表される複数のデータソースの因子分析のための完全なパイプラインを提供します。これにより、データソースのサブセット間の依存関係を学習し、潜在的な要因に分解できます。また、このパッケージは因数分解のスパース事前分布も実装し、マルチソースデータの解釈可能なバイクラスターを提供します。

Bridging Supervised Learning and Test-Based Co-optimization
教師あり学習とテストベースの協調最適化の橋渡し

This paper takes a close look at the important commonalities and subtle differences between the well-established field of supervised learning and the much younger one of co-optimization. It explains the relationships between the problems, algorithms and views on cost and performance of the two fields, all throughout providing a two-way dictionary for the respective terminologies used to describe these concepts. The intent is to facilitate advancement of both fields through transfer and cross-pollination of ideas, techniques and results. As a proof of concept, a theoretical study is presented on the connection between existence / lack of free lunch in the two fields, showcasing a few ideas for improving computational complexity of certain supervised learning approaches.

この論文では、教師あり学習の確立された分野と、はるかに若い共最適化の分野との間の重要な共通点と微妙な違いを詳しく見ていきます。2つのフィールドの問題、アルゴリズム、およびコストとパフォーマンスに関する見解の関係を説明し、これらの概念を説明するために使用されるそれぞれの用語の双方向辞書を提供します。その意図は、アイデア、技術、結果の移転と相互受粉を通じて、両方の分野の進歩を促進することです。概念実証として、2つの分野でのフリーランチの存在/不足の関係に関する理論的研究が提示され、特定の教師あり学習アプローチの計算の複雑さを改善するためのいくつかのアイデアが示されています。

Nearly optimal classification for semimetrics
セミメトリックのほぼ最適な分類

We initiate the rigorous study of classification in semimetric spaces, which are point sets with a distance function that is non-negative and symmetric, but need not satisfy the triangle inequality. We define the density dimension dens and discover that it plays a central role in the statistical and algorithmic feasibility of learning in semimetric spaces. We compute this quantity for several widely used semimetrics and present nearly optimal sample compression algorithms, which are then used to obtain generalization guarantees, including fast rates. Our claim of near-optimality holds in both computational and statistical senses. When the sample has radius $R$ and margin $\gamma$, we show that it can be compressed down to roughly $d=(R/\gamma)^{\text{dens}}$ points, and further that finding a significantly better compression is algorithmically intractable unless P=NP. This compression implies generalization via standard Occam-type arguments, to which we provide a nearly matching lower bound.

私たちは、半距離空間における分類の厳密な研究を開始します。半距離空間は非負で対称な距離関数を持つ点集合であるが、三角不等式を満たす必要はない。我々は密度次元densを定義し、それが半距離空間における学習の統計的およびアルゴリズム的実現可能性において中心的な役割を果たすことを発見した。我々はこの量をいくつかの広く使用されている半距離空間について計算し、ほぼ最適なサンプル圧縮アルゴリズムを提示します。これは高速レートを含む一般化保証を得るために使用されます。我々のほぼ最適という主張は、計算と統計の両方の意味で成り立つ。サンプルが半径$R$とマージン$\gamma$を持つ場合、およそ$d=(R/\gamma)^{\text{dens}}$点に圧縮できることを示し、さらにP=NPでない限り、大幅に優れた圧縮を見つけることはアルゴリズム的に困難です。この圧縮は標準的なオッカム型議論による一般化を意味し、我々はこれにほぼ一致する下限値を提供します。

Simplifying Probabilistic Expressions in Causal Inference
因果推論における確率的表現の単純化

Obtaining a non-parametric expression for an interventional distribution is one of the most fundamental tasks in causal inference. Such an expression can be obtained for an identifiable causal effect by an algorithm or by manual application of do-calculus. Often we are left with a complicated expression which can lead to biased or inefficient estimates when missing data or measurement errors are involved. We present an automatic simplification algorithm that seeks to eliminate symbolically unnecessary variables from these expressions by taking advantage of the structure of the underlying graphical model. Our method is applicable to all causal effect formulas and is readily available in the R package causaleffect.

介入分布のノンパラメトリック表現を取得することは、因果推論における最も基本的なタスクの1つです。このような式は、アルゴリズムによって、またはdo-calculusの手動適用によって、識別可能な因果効果について取得できます。多くの場合、複雑な表現が残され、データの欠落や測定エラーが関係している場合、推定値が偏ったり非効率的になったりする可能性があります。ここでは、基礎となるグラフィカルモデルの構造を利用して、これらの式から記号的に不要な変数を排除しようとする自動単純化アルゴリズムを紹介します。私たちの方法は、すべての因果効果式に適用でき、Rパッケージの因果効果ですぐに利用できます。

A Spectral Algorithm for Inference in Hidden semi-Markov Models
隠れセミマルコフ模型における推論のためのスペクトルアルゴリズム

Hidden semi-Markov models (HSMMs) are latent variable models which allow latent state persistence and can be viewed as a generalization of the popular hidden Markov models (HMMs). In this paper, we introduce a novel spectral algorithm to perform inference in HSMMs. Unlike expectation maximization (EM), our approach correctly estimates the probability of given observation sequence based on a set of training sequences. Our approach is based on estimating moments from the sample, whose number of dimensions depends only logarithmically on the maximum length of the hidden state persistence. Moreover, the algorithm requires only a few matrix inversions and is therefore computationally efficient. Empirical evaluations on synthetic and real data demonstrate the advantage of the algorithm over EM in terms of speed and accuracy, especially for large data sets.

隠れ半マルコフモデル(HSMM)は、潜在状態の永続性を可能にする潜在変数モデルであり、一般的な隠れマルコフモデル(HMM)の一般化と見なすことができます。この論文では、HSMMで推論を実行するための新しいスペクトルアルゴリズムを紹介します。期待値最大化(EM)とは異なり、このアプローチでは、一連の学習シーケンスに基づいて、与えられた観測シーケンスの確率を正しく推定します。私たちのアプローチは、サンプルからのモーメントの推定に基づいており、その次元数は隠れ状態の永続性の最大長に対数的にのみ依存します。さらに、このアルゴリズムは少数の行列反転しか必要としないため、計算効率が高くなります。合成データと実データの経験的評価では、特に大規模なデータセットの場合、速度と精度の点で、このアルゴリズムがEMよりも優れていることが示されています。

Asymptotic Analysis of Objectives Based on Fisher Information in Active Learning
アクティブラーニングにおけるフィッシャー情報に基づく目的の漸近解析

Obtaining labels can be costly and time-consuming. Active learning allows a learning algorithm to intelligently query samples to be labeled for a more efficient learning. Fisher information ratio (FIR) has been used as an objective for selecting queries. However, little is known about the theory behind the use of FIR for active learning. There is a gap between the underlying theory and the motivation of its usage in practice. In this paper, we attempt to fill this gap and provide a rigorous framework for analyzing existing FIR-based active learning methods. In particular, we show that FIR can be asymptotically viewed as an upper bound of the expected variance of the log-likelihood ratio. Additionally, our analysis suggests a unifying framework that not only enables us to make theoretical comparisons among the existing querying methods based on FIR, but also allows us to give insight into the development of new active learning approaches based on this objective.

ラベルの取得には、コストと時間がかかる場合があります。アクティブラーニングにより、学習アルゴリズムは、より効率的な学習のためにラベル付けされるサンプルをインテリジェントにクエリできます。フィッシャー情報比(FIR)は、クエリを選択するための目標として使用されています。しかし、アクティブラーニングにFIRを使用する理論については、ほとんど知られていません。基礎となる理論と、実際にその使用の動機との間にはギャップがあります。この論文では、このギャップを埋め、既存のFIRベースのアクティブラーニング手法を分析するための厳密なフレームワークを提供しようとします。特に、FIRは、対数尤度比の期待分散の上限として漸近的に見ることができることを示します。さらに、私たちの分析は、FIRに基づく既存のクエリ方法間の理論的比較を可能にするだけでなく、この目的に基づく新しいアクティブラーニングアプローチの開発に洞察を与えることを可能にする統一的なフレームワークを示唆しています。

Online Bayesian Passive-Aggressive Learning
オンラインベイジアンパッシブアグレッシブ学習

We present online Bayesian Passive-Aggressive (BayesPA) learning, a generic online learning framework for hierarchical Bayesian models with max-margin posterior regularization. We show that BayesPA subsumes the standard online Passive- Aggressive (PA) learning and extends naturally to incorporate latent variables for both parametric and nonparametric Bayesian inference, therefore providing great flexibility for explorative analysis. As an important example, we apply BayesPA to topic modeling and derive efficient online learning algorithms for max-margin topic models. We further develop nonparametric BayesPA topic models to infer the unknown number of topics in an online manner. Experimental results on 20newsgroups and a large Wikipedia multi-label dataset (with 1.1 millions of training documents and 0.9 million of unique terms in the vocabulary) show that our approaches significantly improve time efficiency while achieving comparable accuracy with the corresponding batch algorithms.

私たちは、最大マージンの事後正則化を備えた階層ベイジアンモデルの一般的なオンライン学習フレームワークであるオンラインベイジアンパッシブアグレッシブ(BayesPA)学習を紹介します。BayesPAは、標準的なオンラインパッシブアグレッシブ(PA)学習を包含し、パラメトリックベイズ推論とノンパラメトリックベイズ推論の両方に潜在変数を組み込むように自然に拡張するため、探索的分析に大きな柔軟性を提供することを示しています。重要な例として、BayesPAをトピックモデリングに適用し、最大マージントピックモデルの効率的なオンライン学習アルゴリズムを導き出します。さらに、ノンパラメトリックなBayesPAトピックモデルを開発し、未知のトピック数をオンラインで推論します。20のニュースグループと大規模なウィキペディアのマルチラベルデータセット(110万のトレーニングドキュメントと語彙に90万のユニークな用語を含む)での実験結果は、私たちのアプローチが時間効率を大幅に向上させ、対応するバッチアルゴリズムと同等の精度を達成することを示しています。

Nonparametric Risk Bounds for Time-Series Forecasting
時系列予測のノンパラメトリックなリスク範囲

We derive generalization error bounds for traditional time- series forecasting models. Our results hold for many standard forecasting tools including autoregressive models, moving average models, and, more generally, linear state-space models. These non-asymptotic bounds need only weak assumptions on the data-generating process, yet allow forecasters to select among competing models and to guarantee, with high probability, that their chosen model will perform well. We motivate our techniques with and apply them to standard economic and financial forecasting tools—a GARCH model for predicting equity volatility and a dynamic stochastic general equilibrium model (DSGE), the standard tool in macroeconomic forecasting. We demonstrate in particular how our techniques can aid forecasters and policy makers in choosing models which behave well under uncertainty and mis-specification.

私たちは、従来の時系列予測モデルに対する一般化誤差の範囲を導き出します。私たちの結果は、自己回帰モデル、移動平均モデル、より一般的には線形状態空間モデルなど、多くの標準的な予測ツールに当てはまります。これらの非漸近的な境界は、データ生成プロセスに関する弱い仮定のみを必要としますが、予測者は競合するモデルの中から選択し、選択したモデルが適切に機能することを高い確率で保証できます。私たちは、株式のボラティリティを予測するためのGARCHモデル—や、マクロ経済予測の標準的なツールである動的確率的一般均衡モデル(DSGE)など、標準的な経済・金融予測ツールに技術を適用し、その手法をモチベーションとしています。特に、私たちの技術が、不確実性や誤った仕様の下で適切に動作するモデルを選択する際に、予測者や政策立案者にどのように役立つかを示します。

Preference-based Teaching
選好に基づく教育

We introduce a new model of teaching named preference-based teaching and a corresponding complexity parameter—the preference-based teaching dimension (PBTD)—representing the worst-case number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well- known recursive teaching dimension (RTD) on finite classes, it is radically different on infinite ones: the RTD becomes infinite already for trivial infinite classes (such as half- intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of infinite classes including classes consisting of so-called closed sets w.r.t. a given closure operator, including various classes related to linear sets over $\mathbb{N}_0$ (whose RTD had been studied quite recently) and including the class of Euclidean half-spaces. On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial flavor) which helps to derive bounds on the PBTD.

私たちは、選好に基づく教授という新しい教授モデルと、それに対応する複雑性パラメータである選好に基づく教授次元(PBTD)を導入します。これは、与えられた概念クラスの任意の概念を教えるのに必要な最悪の場合の例の数を表す。PBTDは、有限クラスではよく知られている再帰教授次元(RTD)と一致するが、無限クラスでは根本的に異なります。RTDは、自明な無限クラス(半区間など)に対して既に無限になるのに対し、PBTDは、与えられた閉包演算子に関していわゆる閉集合からなるクラス、$\mathbb{N}_0$上の線形集合に関連するさまざまなクラス(そのRTDはごく最近研究された)、ユークリッド半空間のクラスを含む、幅広い無限クラスのコレクションに対して、かなり小さな値に評価されます。これらの具体的な結果を示すことに加えて、我々は読者に、PBTDの境界を導くのに役立つ(組み合わせ的な)理論的枠組みを渡す。

Group Sparse Optimization via lp,q Regularization
lp,q 正則化によるグループスパース最適化

In this paper, we investigate a group sparse optimization problem via $\ell_{p,q}$ regularization in three aspects: theory, algorithm and application. In the theoretical aspect, by introducing a notion of group restricted eigenvalue condition, we establish an oracle property and a global recovery bound of order $\mathcal{O}(\lambda^\frac{2}{2-q})$ for any point in a level set of the $\ell_{p,q}$ regularization problem, and by virtue of modern variational analysis techniques, we also provide a local analysis of recovery bound of order $\mathcal{O}(\lambda^2)$ for a path of local minima. In the algorithmic aspect, we apply the well-known proximal gradient method to solve the $\ell_{p,q}$ regularization problems, either by analytically solving some specific $\ell_{p,q}$ regularization subproblems, or by using the Newton method to solve general $\ell_{p,q}$ regularization subproblems. In particular, we establish a local linear convergence rate of the proximal gradient method for solving the $\ell_{1,q}$ regularization problem under some mild conditions and by first proving a second-order growth condition. As a consequence, the local linear convergence rate of proximal gradient method for solving the usual $\ell_{q}$ regularization problem ($0<q<1$) is obtained. Finally in the aspect of application, we present some numerical results on both the simulated data and the real data in gene transcriptional regulation.

この論文では、$\ell_{p,q}$正則化によるグループスパース最適化問題を理論、アルゴリズム、応用の3つの側面から調査します。理論面では、グループ制限固有値条件の概念を導入することで、$\ell_{p,q}$正則化問題のレベルセット内の任意の点に対して、オラクルプロパティと$\mathcal{O}(\lambda^\frac{2}{2-q})$の順序のグローバル回復境界を確立し、最新の変分解析手法により、局所最小値のパスに対する$\mathcal{O}(\lambda^2)$の順序の回復境界のローカル解析も提供します。アルゴリズムの面では、よく知られている近位勾配法を、いくつかの特定の$\ell_{p,q}$正則化サブ問題を解析的に解くか、ニュートン法を使用して一般的な$\ell_{p,q}$正則化サブ問題を解くことによって、$\ell_{p,q}$正則化問題を解決するために適用します。特に、いくつかの穏やかな条件下で、最初に2次成長条件を証明することによって、$\ell_{1,q}$正則化問題を解決するために近位勾配法の局所線形収束率を確立します。その結果、通常の$\ell_{q}$正則化問題($0<q<1$)を解決するための近位勾配法の局所線形収束率が得られます。最後に、応用の面では、遺伝子転写制御におけるシミュレーションデータと実際のデータの両方に関する数値結果を示します。

Certifiably Optimal Low Rank Factor Analysis
認定可能な最適低ランク因子分析

Factor Analysis (FA) is a technique of fundamental importance that is widely used in classical and modern multivariate statistics, psychometrics, and econometrics. In this paper, we revisit the classical rank-constrained FA problem which seeks to approximate an observed covariance matrix ($\B\Sigma$) by the sum of a Positive Semidefinite (PSD) low-rank component ($\B\Theta$) and a diagonal matrix ($\B\Phi$) (with nonnegative entries) subject to $\B\Sigma – \B\Phi$ being PSD. We propose a flexible family of rank-constrained, nonlinear Semidefinite Optimization based formulations for this task. We introduce a reformulation of the problem as a smooth optimization problem with convex, compact constraints and propose a unified algorithmic framework, utilizing state of the art techniques in nonlinear optimization to obtain high-quality feasible solutions for our proposed formulation. At the same time, by using a variety of techniques from discrete and global optimization, we show that these solutions are certifiably optimal in many cases, even for problems with thousands of variables. Our techniques are general and make no assumption on the underlying problem data. The estimator proposed herein aids statistical interpretability and provides computational scalability and significantly improved accuracy when compared to current, publicly available popular methods for rank-constrained FA. We demonstrate the effectiveness of our proposal on an array of synthetic and real-life datasets. To our knowledge, this is the first paper that demonstrates how a previously intractable rank-constrained optimization problem can be solved to provable optimality by coupling developments in convex analysis and in global and discrete optimization.

因子分析(FA)は、古典的および現代的な多変量統計、心理測定学、計量経済学で広く使用されている、基礎的に重要な手法です。この論文では、観測された共分散行列($\B\Sigma$)を、$\B\Sigma – \B\Phi$がPSDである条件で、半正定値(PSD)の低ランク成分($\B\Theta$)と対角行列($\B\Phi$) (非負のエントリを含む)の合計で近似しようとする、古典的なランク制約FA問題を再検討します。このタスクに対して、ランク制約付きの非線形半正定値最適化に基づく柔軟な定式化のファミリを提案します。凸型でコンパクトな制約を持つ滑らかな最適化問題として問題を再定式化し、非線形最適化の最先端の技術を使用して、提案された定式化の高品質の実行可能ソリューションを取得する、統合されたアルゴリズムフレームワークを提案します。同時に、離散最適化とグローバル最適化のさまざまな手法を使用することで、これらのソリューションは、数千の変数を持つ問題であっても、多くの場合、証明可能に最適であることを示しています。私たちの手法は一般的なものであり、根本的な問題データについて仮定していません。ここで提案する推定量は、統計的な解釈可能性を高め、ランク制約FAの現在公開されている一般的な方法と比較して、計算のスケーラビリティと大幅に改善された精度を提供します。私たちは、さまざまな合成データセットと実際のデータセットで私たちの提案の有効性を実証しています。私たちの知る限り、これは、凸解析とグローバルおよび離散最適化の発展を組み合わせることで、以前は解決不可能だったランク制約最適化問題を証明可能な最適性まで解決できる方法を示した最初の論文です。

Particle Gibbs Split-Merge Sampling for Bayesian Inference in Mixture Models
混合モデルにおけるベイズ推論のための粒子ギブス分割マージサンプリング

This paper presents an original Markov chain Monte Carlo method to sample from the posterior distribution of conjugate mixture models. This algorithm relies on a flexible split-merge procedure built using the particle Gibbs sampler introduced in Andrieu et al. (2009, 2010). The resulting so-called Particle Gibbs Split-Merge sampler does not require the computation of a complex acceptance ratio and can be implemented using existing sequential Monte Carlo libraries. We investigate its performance experimentally on synthetic problems as well as on geolocation data. Our results show that for a given computational budget, the Particle Gibbs Split-Merge sampler empirically outperforms existing split merge methods. The code and instructions allowing to reproduce the experiments is available at github.com/aroth85/pgsm.

この論文では、共役混合物モデルの事後分布からサンプリングするための独自のマルコフ連鎖モンテカルロ法を紹介します。このアルゴリズムは、Andrieuら(2009, 2010)で紹介された粒子Gibbsサンプラーを使用して構築された柔軟な分割マージ手順に依存しています。その結果、いわゆるParticle Gibbs Split-Mergeサンプラーは、複雑な受入率の計算を必要とせず、既存のシーケンシャルモンテカルロライブラリを使用して実装できます。その性能を合成問題とジオロケーションデータで実験的に調査します。私たちの結果は、特定の計算バジェットに対して、Particle Gibbs Split-Mergeサンプラーが既存のスプリットマージ法を経験的に上回ることを示しています。実験を再現するためのコードと指示は、github.com/aroth85/pgsmで入手できます。

Generalized Polya Urn for Time-Varying Pitman-Yor Processes
時間変動ピットマン・ヨール過程に対する一般化ポリア・アーンの適用

This article introduces a class of first-order stationary time- varying Pitman-Yor processes. Subsuming our construction of time-varying Dirichlet processes presented in (Caron et al., 2007), these models can be used for time-dynamic density estimation and clustering. Our intuitive and simple construction relies on a generalized PÃ³lya urn scheme. Significantly, this construction yields marginal distributions at each time point that can be explicitly characterized and easily controlled. Inference is performed using Markov chain Monte Carlo and sequential Monte Carlo methods. We demonstrate our models and algorithms on epidemiological and video tracking data.

この記事では、1次定常時間変Pitman-Yorプロセスのクラスを紹介します。(Caronら, 2007)で提示された時間変動ディリクレ過程の構築を包含すると、これらのモデルは時間動的密度推定とクラスタリングに使用できます。直感的でシンプルな構造は、一般化されたPÃ³lya骨壺スキームに依存しています。重要なことに、この構造は、明示的に特徴付けて簡単に制御できる各時点で周辺分布をもたらします。推論は、マルコフ連鎖モンテカルロ法と逐次モンテカルロ法を使用して実行されます。疫学データとビデオ追跡データに関するモデルとアルゴリズムを実演します。

POMDPs.jl: A Framework for Sequential Decision Making under Uncertainty
POMDPs.jl: 不確実性下での逐次的意思決定のためのフレームワーク

POMDPs.jl is an open-source framework for solving Markov decision processes (MDPs) and partially observable MDPs (POMDPs). POMDPs.jl allows users to specify sequential decision making problems with minimal effort without sacrificing the expressive nature of POMDPs, making this framework viable for both educational and research purposes. It is written in the Julia language to allow flexible prototyping and large-scale computation that leverages the high-performance nature of the language. The associated JuliaPOMDP community also provides a number of state-of-the-art MDP and POMDP solvers and a rich library of support tools to help with implementing new solvers and evaluating the solution results. The most recent version of POMDPs.jl, the related packages, and documentation can be found at github.com/ JuliaPOMDP/POMDPs.jl.

POMDPs.jlは、マルコフ決定過程(MDP)と部分観測可能なMDP (POMDP)を解くためのオープンソースのフレームワークです。POMDPs.jlは、POMDPの表現力豊かな性質を犠牲にすることなく、最小限の労力で逐次的な意思決定問題を指定することを可能にし、このフレームワークを教育目的と研究目的の両方で実行可能にします。Julia言語で記述されているため、柔軟なプロトタイピングと、言語の高性能な性質を活用した大規模な計算が可能になります。関連するJuliaPOMDPコミュニティは、新しいソルバーの実装と解答結果の評価に役立つ、最先端のMDPおよびPOMDPソルバーと豊富なサポートツールのライブラリも提供しています。POMDPs.jlの最新バージョン、関連パッケージ、およびドキュメンテーションは、JuliaPOMDP/POMDPs.jl github.com/にあります。

Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA
Auto-WEKA 2.0: WEKAでの自動モデル選択とハイパーパラメータ最適化

WEKA is a widely used, open-source machine learning platform. Due to its intuitive interface, it is particularly popular with novice users. However, such users often find it hard to identify the best approach for their particular dataset among the many available. We describe the new version of Auto-WEKA, a system designed to help such users by automatically searching through the joint space of WEKA’s learning algorithms and their respective hyperparameter settings to maximize performance, using a state-of-the-art Bayesian optimization method. Our new package is tightly integrated with WEKA, making it just as accessible to end users as any other learning algorithm.

WEKAは、広く使用されているオープンソースの機械学習プラットフォームです。直感的なインターフェースにより、初心者ユーザーに特に人気があります。しかし、そのようなユーザーは、利用可能な多くのデータセットの中から、自分の特定のデータセットに最適なアプローチを特定するのが難しいと感じることがよくあります。この論文では、最先端のベイズ最適化手法を用いて、WEKAの学習アルゴリズムとそれぞれのハイパーパラメータ設定の共同空間を自動的に検索し、パフォーマンスを最大化することで、このようなユーザーを支援するように設計されたAuto-WEKAの新バージョンについて説明します。当社の新しいパッケージはWEKAと緊密に統合されているため、他の学習アルゴリズムと同様にエンドユーザーが利用できます。

Identifying a Minimal Class of Models for High–dimensional Data
高次元データの最小モデル・クラスの識別

Model selection consistency in the high–dimensional regression setting can be achieved only if strong assumptions are fulfilled. We therefore suggest to pursue a different goal, which we call a minimal class of models. The minimal class of models includes models that are similar in their prediction accuracy but not necessarily in their elements. We suggest a random search algorithm to reveal candidate models. The algorithm implements simulated annealing while using a score for each predictor that we suggest to derive using a combination of the lasso and the elastic net. The utility of using a minimal class of models is demonstrated in the analysis of two data sets.

高次元回帰設定でのモデル選択の一貫性は、強い仮定が満たされている場合にのみ達成できます。したがって、私たちは、モデルの最小クラスと呼ばれる別の目標を追求することを提案します。モデルの最小クラスには、予測精度は類似しているが、必ずしも要素が類似しているわけではないモデルが含まれます。候補モデルを明らかにするためのランダム探索アルゴリズムを提案します。このアルゴリズムは、投げ縄と弾性ネットの組み合わせを使用して導出することを提案する各予測子のスコアを使用しながら、シミュレーテッドアニーリングを実装します。最小クラスのモデルを使用することの有用性は、2つのデータセットの分析で実証されています。

JSAT: Java Statistical Analysis Tool, a Library for Machine Learning
JSAT: Java 統計解析ツール、機械学習用ライブラリ

Java Statistical Analysis Tool (JSAT) is a Machine Learning library written in pure Java. It works to fill a void in the Java ecosystem for a general purpose library that is relatively high performance and flexible, which is not adequately fulfilled by Weka (Hall et al., 2009) and Java-ML (Abeel et al., 2009). Almost all of the algorithms are independently implemented using an Object- Oriented framework. JSAT is made available under the GNU GPL license here: github.com/EdwardRaff/JSAT.

Java Statistical Analysis Tool (JSAT)は、純粋なJavaで記述された機械学習ライブラリです。これは、比較的高性能で柔軟性のある汎用ライブラリのJavaエコシステムの空白を埋める役割を果たしますが、Weka(Hallら, 2009)やJava-ML(Abeelら, 2009)では十分に満たされていません。ほとんどすべてのアルゴリズムは、オブジェクト指向フレームワークを使用して独立して実装されています。JSATは、GNU GPLライセンスの下で、github.com/EdwardRaff/JSATで入手できます。

Analyzing Tensor Power Method Dynamics in Overcomplete Regime
オーバーコンプリート領域におけるテンソルべき乗法ダイナミクスの解析

We present a novel analysis of the dynamics of tensor power iterations in the overcomplete regime where the tensor CP rank is larger than the input dimension. Finding the CP decomposition of an overcomplete tensor is NP-hard in general. We consider the case where the tensor components are randomly drawn, and show that the simple power iteration recovers the components with bounded error under mild initialization conditions. We apply our analysis to unsupervised learning of latent variable models, such as multi-view mixture models and spherical Gaussian mixtures. Given the third order moment tensor, we learn the parameters using tensor power iterations. We prove it can correctly learn the model parameters when the number of hidden components $k$ is much larger than the data dimension $d$, up to $k = o(d^{1.5})$. We initialize the power iterations with data samples and prove its success under mild conditions on the signal-to-noise ratio of the samples. Our analysis significantly expands the class of latent variable models where spectral methods are applicable. Our analysis also deals with noise in the input tensor leading to sample complexity result in the application to learning latent variable models.

私たちは、テンソルのCPランクが入力次元より大きい過剰完備領域におけるテンソルべき乗反復のダイナミクスに関する新しい分析を紹介します。過剰完備テンソルのCP分解を見つけることは一般にNP困難です。テンソル成分がランダムに抽出される場合を考慮し、単純なべき乗反復により、穏やかな初期化条件下では、誤差が制限された状態で成分が回復されることを示します。この分析を、マルチビュー混合モデルや球面ガウス混合などの潜在変数モデルの教師なし学習に適用します。3次モーメントテンソルが与えられた場合、テンソルべき乗反復を使用してパラメーターを学習します。隠れ成分の数$k$がデータ次元$d$よりもはるかに大きい場合(最大$k = o(d^{1.5})$)、モデルパラメーターを正しく学習できることを証明します。べき乗反復をデータサンプルで初期化し、サンプルの信号対雑音比に関する穏やかな条件下での成功を証明します。私たちの分析は、スペクトル法が適用可能な潜在変数モデルのクラスを大幅に拡張します。また、潜在変数モデルの学習への応用においてサンプルの複雑性につながる入力テンソルのノイズも処理します。

On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions
カーネル直交規則とランダム特徴展開の等価性について

We show that kernel-based quadrature rules for computing integrals can be seen as a special case of random feature expansions for positive definite kernels, for a particular decomposition that always exists for such kernels. We provide a theoretical analysis of the number of required samples for a given approximation error, leading to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. In particular, we show that the upper bound may be obtained from independent and identically distributed samples from a specific non-uniform distribution, while the lower bound if valid for any set of points. Applying our results to kernel-based quadrature, while our results are fairly general, we recover known upper and lower bounds for the special cases of Sobolev spaces. Moreover, our results extend to the more general problem of full function approximations (beyond simply computing an integral), with results in $L_2$- and $L_\infty$-norm that match known results for special cases. Applying our results to random features, we show an improvement of the number of random features needed to preserve the generalization guarantees for learning with Lipshitz-continuous losses.

私たちは、積分を計算するためのカーネルベースの求積法規則は、正定値カーネルのランダム特徴展開の特殊なケースとして見ることができることを示します。これは、そのようなカーネルに常に存在する特定の分解についてです。与えられた近似誤差に必要なサンプル数の理論的分析を提供し、関連する積分演算子の固有値のみに基づいて対数項に一致する上限と下限の両方につながります。特に、上限は特定の非一様分布からの独立した同一分布のサンプルから取得でき、下限は任意の点の集合に対して有効であることを示します。結果をカーネルベースの求積法に適用すると、結果はかなり一般的ですが、ソボレフ空間の特殊なケースの既知の上限と下限を回復します。さらに、結果はより一般的な完全関数近似の問題（単に積分を計算することを超えて）にまで拡張され、$L_2$および$L_\infty$ノルムの結果は特殊なケースの既知の結果と一致します。私たちの結果をランダム特徴に適用すると、リプシッツ連続損失による学習の一般化保証を維持するために必要なランダム特徴の数が改善されることがわかります。

Memory Efficient Kernel Approximation
メモリ効率の良いカーネル近似

Scaling kernel machines to massive data sets is a major challenge due to storage and computation issues in handling large kernel matrices, that are usually dense. Recently, many papers have suggested tackling this problem by using a low-rank approximation of the kernel matrix. In this paper, we first make the observation that the structure of shift-invariant kernels changes from low-rank to block-diagonal (without any low-rank structure) when varying the scale parameter. Based on this observation, we propose a new kernel approximation framework — Memory Efficient Kernel Approximation (MEKA), which considers both low-rank and clustering structure of the kernel matrix. We show that the resulting algorithm outperforms state-of-the-art low-rank kernel approximation methods in terms of speed, approximation error, and memory usage. As an example, on the covtype dataset with half a million samples, MEKA takes around 70 seconds and uses less than 80 MB memory on a single machine to achieve 10% relative approximation error, while standard NystrÃ¶m approximation is about 6 times slower and uses more than 400MB memory to achieve similar approximation. We also present extensive experiments on applying MEKA to speed up kernel ridge regression.

カーネルマシンを大規模なデータセットにスケーリングすることは、通常は密な大きなカーネルマトリックスを処理する際のストレージと計算の問題により、大きな課題です。最近、多くの論文で、カーネルマトリックスの低ランク近似を使用してこの問題に取り組むことが提案されています。この論文では、まず、スケールパラメーターを変更すると、シフト不変カーネルの構造が低ランクからブロック対角(低ランク構造なし)に変化するという観察を行います。この観察に基づいて、カーネルマトリックスの低ランクとクラスタリング構造の両方を考慮した新しいカーネル近似フレームワーク、メモリ効率の高いカーネル近似(MEKA)を提案します。結果として得られるアルゴリズムは、速度、近似誤差、メモリ使用量の点で最先端の低ランクカーネル近似方法よりも優れていることを示します。たとえば、50万サンプルのcovtypeデータセットでは、MEKAは1台のマシンで約70秒かかり、80 MB未満のメモリを使用して10%の相対近似誤差を達成します。一方、標準的なNystrÃ¶m近似では、同様の近似を達成するのに約6倍遅く、400 MBを超えるメモリを使用します。また、カーネルリッジ回帰を高速化するためにMEKAを適用する広範な実験も紹介します。

Breaking the Curse of Dimensionality with Convex Neural Networks
凸型ニューラルネットワークによる次元の呪縛を解く

We consider neural networks with a single hidden layer and non- decreasing positively homogeneous activation functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, they lead to a convex optimization problem and we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity- inducing norms on the input weights, we show that high- dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of observations. However, solving this convex optimization problem in infinite dimensions is only possible if the non- convex subproblem of addition of a new unit can be solved efficiently. We provide a simple geometric interpretation for our choice of activation functions and describe simple conditions for convex relaxations of the finite-dimensional non- convex subproblem to achieve the same generalization error bounds, even when constant-factor approximations cannot be found. We were not able to find strong enough convex relaxations to obtain provably polynomial-time algorithms and leave open the existence or non-existence of such tractable algorithms with non-exponential sample complexities.

私たちは、単一の隠れ層と、正規化線形ユニットのような非減少の正に同次な活性化関数を持つニューラルネットワークについて検討します。隠れユニットの数を無制限に増やし、出力の重みに古典的な非ユークリッド正則化ツールを使用すると、凸最適化問題につながります。そこで、近似誤差と推定誤差の両方を研究し、その一般化パフォーマンスの詳細な理論的分析を提供します。特に、入力変数の低次元部分空間への投影への依存性など、未知の基礎線形構造に適応できることを示します。さらに、入力の重みにスパース性を誘導するノルムを使用すると、データに関する強い仮定を必要とせず、変数の総数が観測数の指数関数になる可能性のある場合でも、高次元の非線形変数選択を実現できることを示します。ただし、この凸最適化問題を無限次元で解くには、新しいユニットの追加という非凸部分問題を効率的に解く必要があります。我々は活性化関数の選択について単純な幾何学的解釈を提供し、定数係数近似が見つからない場合でも、同じ一般化誤差境界を達成するための有限次元非凸サブ問題の凸緩和の単純な条件を説明します。証明可能な多項式時間アルゴリズムを取得するのに十分な強力な凸緩和を見つけることはできず、非指数サンプル複雑度を持つそのような扱いやすいアルゴリズムの存在または非存在については未解決のままです。

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles
情報幾何最適化アルゴリズム:不変性原理による統一像

We present a canonical way to turn any smooth parametric family of probability distributions on an arbitrary search space $X$ into a continuous-time black-box optimization method on $X$, the information-geometric optimization (IGO) method. Invariance as a major design principle keeps the number of arbitrary choices to a minimum. The resulting IGO flow is the flow of an ordinary differential equation conducting the natural gradient ascent of an adaptive, time-dependent transformation of the objective function. It makes no particular assumptions on the objective function to be optimized. The IGO method produces explicit IGO algorithms through time discretization. It naturally recovers versions of known algorithms and offers a systematic way to derive new ones. In continuous search spaces, IGO algorithms take a form related to natural evolution strategies (NES). The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update (IGO-ML). When applied to the family of Gaussian distributions on $\R^d$, the IGO framework recovers a version of the well-known CMA-ES algorithm and of xNES. For the family of Bernoulli distributions on $\{0,1\}^d$, we recover the seminal PBIL algorithm and cGA. For the distributions of restricted Boltzmann machines, we naturally obtain a novel algorithm for discrete optimization on $\{0,1\}^d$. All these algorithms are natural instances of, and unified under, the single information-geometric optimization framework. The IGO method achieves, thanks to its intrinsic formulation, maximal invariance properties: invariance under reparametrization of the search space $X$, under a change of parameters of the probability distribution, and under increasing transformation of the function to be optimized. The latter is achieved through an adaptive, quantile-based formulation of the objective. Theoretical considerations strongly suggest that IGO algorithms are essentially characterized by a minimal change of the distribution over time. Therefore they have minimal loss in diversity through the course of optimization, provided the initial diversity is high. First experiments using restricted Boltzmann machines confirm this insight. As a simple consequence, IGO seems to provide, from information theory, an elegant way to simultaneously explore several valleys of a fitness landscape in a single run.

私たちは、任意の探索空間$X$上の確率分布の滑らかなパラメトリック族を$X$上の連続時間ブラックボックス最適化法、すなわち情報幾何最適化(IGO)法に変換する標準的な方法を提示します。主要な設計原則としての不変性により、任意の選択肢の数を最小限に抑えることができます。結果として得られるIGOフローは、目的関数の適応型時間依存変換の自然な勾配上昇を実行する常微分方程式のフローです。最適化される目的関数については、特に仮定しない。IGO法は、時間の離散化を通じて明示的なIGOアルゴリズムを生成します。既知のアルゴリズムのバージョンを自然に復元し、新しいアルゴリズムを導出する体系的な方法を提供します。連続探索空間では、IGOアルゴリズムは自然進化戦略(NES)に関連する形式をとる。クロスエントロピー法は、大きな時間ステップを持つ特定のケースで復元され、滑らかな、パラメータ化に依存しない最大尤度更新(IGO-ML)に拡張できます。IGOフレームワークを$\R^d$上のガウス分布の族に適用すると、よく知られているCMA-ESアルゴリズムとxNESのバージョンが復元されます。$\{0,1\}^d$上のベルヌーイ分布の族については、独創的なPBILアルゴリズムとcGAが復元されます。制限付きボルツマンマシンの分布については、$\{0,1\}^d$上の離散最適化の新しいアルゴリズムが自然に得られます。これらのアルゴリズムはすべて、単一の情報幾何最適化フレームワークの自然な例であり、そのフレームワークの下で統合されています。IGOメソッドは、その固有の定式化により、最大の不変性特性(検索空間$X$の再パラメータ化、確率分布のパラメータの変更、最適化する関数の変換の増加に対する不変性)を実現します。後者は、目的の適応型で分位数ベースの定式化によって実現されます。理論的な考察から、IGOアルゴリズムは基本的に、時間の経過に伴う分布の変化が最小限であることが特徴であることが強く示唆されます。したがって、初期の多様性が高ければ、最適化の過程で多様性の損失は最小限に抑えられます。制限付きボルツマンマシンを使用した最初の実験で、この洞察が確認されました。単純な結果として、情報理論から見ると、IGOは1回の実行で適応度地形の複数の谷を同時に探索する優れた方法を提供するようです。

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning
不均衡学習:機械学習における不均衡なデータセットの呪いに取り組むためのPythonツールボックス

imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented state-of-the- art methods can be categorized into 4 groups: (i) under- sampling, (ii) over-sampling, (iii) combination of over- and under-sampling, and (iv) ensemble learning methods. The proposed toolbox depends only on numpy, scipy, and scikit-learn and is distributed under MIT license. Furthermore, it is fully compatible with scikit-learn and is part of the scikit-learn-contrib supported project. Documentation, unit tests as well as integration tests are provided to ease usage and contribution. Source code, binaries, and documentation can be downloaded from github.com/scikit-learn-contrib/imbalanced-learn.

imbalanced-learnは、機械学習やパターン認識で頻繁に遭遇する不均衡なデータセットの問題に対処するための幅広い方法を提供することを目的としたオープンソースのPythonツールボックスです。実装された最先端の方法は、(i)アンダーサンプリング、(ii)オーバーサンプリング、(iii)オーバーサンプリングとアンダーサンプリングの組み合わせ、および(iv)アンサンブル学習方法の4つのグループに分類できます。提案されたツールボックスは、numpy、scipy、およびscikit-learnのみに依存しており、MITライセンスの下で配布されています。さらに、scikit-learnと完全に互換性があり、scikit-learn-contribサポートプロジェクトの一部です。ドキュメント、ユニットテスト、および統合テストは、使用と貢献を容易にするために提供されています。ソースコード、バイナリ、およびドキュメントは、github.com/scikit-learn-contrib/imbalanced-learnからダウンロードできます。

A Unified Formulation and Fast Accelerated Proximal Gradient Method for Classification
分類のための統一された定式化と高速加速近位勾配法

Binary classification is the problem of predicting the class a given sample belongs to. To achieve a good prediction performance, it is important to find a suitable model for a given dataset. However, it is often time consuming and impractical for practitioners to try various classification models because each model employs a different formulation and algorithm. The difficulty can be mitigated if we have a unified formulation and an efficient universal algorithmic framework for various classification models to expedite the comparison of performance of different models for a given dataset. In this paper, we present a unified formulation of various classification models (including $C$-SVM, $\ell_2$-SVM, $\nu$-SVM, MM-FDA, MM-MPM, logistic regression, distance weighted discrimination) and develop a general optimization algorithm based on an accelerated proximal gradient (APG) method for the formulation. We design various techniques such as backtracking line search and adaptive restarting strategy in order to speed up the practical convergence of our method. We also give a theoretical convergence guarantee for the proposed fast APG algorithm. Numerical experiments show that our algorithm is stable and highly competitive to specialized algorithms designed for specific models (e.g., sequential minimal optimization (SMO) for SVM).

バイナリ分類は、与えられたサンプルが属するクラスを予測する問題です。優れた予測性能を達成するには、与えられたデータセットに適したモデルを見つけることが重要です。しかし、各モデルが異なる定式化とアルゴリズムを採用しているため、実務者がさまざまな分類モデルを試すのは時間がかかり、非現実的であることがよくあります。さまざまな分類モデルに統一された定式化と効率的な汎用アルゴリズムフレームワークがあれば、与えられたデータセットに対するさまざまなモデルのパフォーマンスの比較を迅速化できるため、困難を軽減できます。この論文では、さまざまな分類モデル（$C$-SVM、$\ell_2$-SVM、$\nu$-SVM、MM-FDA、MM-MPM、ロジスティック回帰、距離加重判別を含む）の統一された定式化を提示し、その定式化のための加速近位勾配（APG）法に基づく一般的な最適化アルゴリズムを開発します。バックトラッキングラインサーチや適応型リスタート戦略などのさまざまな手法を設計して、方法の実際の収束を高速化します。また、提案された高速APGアルゴリズムの理論的な収束保証も提供します。数値実験により、当社のアルゴリズムは安定しており、特定のモデル向けに設計された特殊なアルゴリズム（SVMの順次最小最適化(SMO)など）に対して非常に競争力があることが示されました。

Empirical Evaluation of Resampling Procedures for Optimising SVM Hyperparameters
SVMハイパーパラメータを最適化するためのリサンプリング手順の実証的評価

Tuning the regularisation and kernel hyperparameters is a vital step in optimising the generalisation performance of kernel methods, such as the support vector machine (SVM). This is most often performed by minimising a resampling/cross-validation based model selection criterion, however there seems little practical guidance on the most suitable form of resampling. This paper presents the results of an extensive empirical evaluation of resampling procedures for SVM hyperparameter selection, designed to address this gap in the machine learning literature. We tested 15 different resampling procedures on 121 binary classification data sets in order to select the best SVM hyperparameters. We used three very different statistical procedures to analyse the results: the standard multi- classifier/multi-data set procedure proposed by Dem\v{s}ar, the confidence intervals on the excess loss of each procedure in relation to 5-fold cross validation, and the Bayes factor analysis proposed by Barber. We conclude that a 2-fold procedure is appropriate to select the hyperparameters of an SVM for data sets for 1000 or more datapoints, while a 3-fold procedure is appropriate for smaller data sets.

正則化とカーネルハイパーパラメータの調整は、サポートベクターマシン(SVM)などのカーネル法の一般化パフォーマンスを最適化するための重要なステップです。これは、再サンプリング/クロスバリデーションに基づくモデル選択基準を最小化することによって最もよく実行されますが、再サンプリングの最適な形式に関する実用的なガイダンスはほとんどないようです。この論文では、機械学習の文献におけるこのギャップを埋めるために設計された、SVMハイパーパラメータ選択のための再サンプリング手順の広範な実証的評価の結果を示します。最適なSVMハイパーパラメータを選択するために、121のバイナリ分類データセットで15の異なる再サンプリング手順をテストしました。結果を分析するために、Dem\v{s}arが提案した標準的なマルチ分類器/マルチデータセット手順、5分割クロスバリデーションに関する各手順の過剰損失の信頼区間、およびBarberが提案したベイズ因子分析という3つの非常に異なる統計手順を使用しました。1000個以上のデータポイントのデータセットの場合、SVMのハイパーパラメータを選択するには2段階の手順が適切であり、より小さなデータセットの場合は3段階の手順が適切であると結論付けました。

Automatic Differentiation Variational Inference
自動微分変分推論

Probabilistic modeling is iterative. A scientist posits a simple model, fits it to her data, refines it according to her analysis, and repeats. However, fitting complex models to large data is a bottleneck in this process. Deriving algorithms for new models can be both mathematically and computationally challenging, which makes it difficult to efficiently cycle through the steps. To this end, we develop ADVI. Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models —no conjugacy assumptions are required. We study ADVI across ten modern probabilistic models and apply it to a dataset with millions of observations. We deploy ADVI as part of Stan, a probabilistic programming system.

確率的モデリングは反復的です。科学者は単純なモデルを仮定し、それを自分のデータに当てはめ、分析に従ってそれを改良し、繰り返します。ただし、このプロセスでは、複雑なモデルを大規模なデータに適合させることがボトルネックになります。新しいモデルのアルゴリズムを導出することは、数学的にも計算的にも困難な場合があり、ステップを効率的に循環させることが難しくなります。そのために、ADVIを開発しています。私たちの方法を使用すると、科学者は確率モデルとデータセットのみを提供し、それ以外は何も提供しません。ADVIは、効率的な変分推論アルゴリズムを自動的に導き出すため、科学者は多くのモデルを改良して探索することができます。ADVIは、幅広いモデルのクラスをサポートしています—共役の仮定は必要ありません。私たちは、10の現代確率モデルにわたってADVIを研究し、それを数百万の観測値を持つデータセットに適用します。ADVIは、確率的プログラミングシステムであるStanの一部としてデプロイします。

Using Conceptors to Manage Neural Long-Term Memories for Temporal Patterns
コンセプターを使用した時間パターンの神経長期記憶の管理

Biological brains can learn, recognize, organize, and re- generate large repertoires of temporal patterns. Here I propose a mechanism of neurodynamical pattern learning and representation, called conceptors, which offers an integrated account of a number of such phenomena and functionalities. It becomes possible to store a large number of temporal patterns in a single recurrent neural network. In the recall process, stored patterns can be morphed and focussed. Parametric families of patterns can be learnt from a very small number of examples. Stored temporal patterns can be content- addressed in ways that are analog to recalling static patterns in Hopfield networks.

生物学的な脳は、時間パターンの大きなレパートリーを学習し、認識し、組織化し、再生成することができます。ここでは、コンセプターと呼ばれる神経力学的なパターン学習と表現のメカニズムを提案し、そのような現象と機能のいくつかの統合された説明を提供します。これにより、単一の再帰型ニューラルネットワークに多数の時間パターンを格納することが可能になります。リコールプロセスでは、保存されたパターンをモーフィングしてフォーカスすることができます。パターンのパラメトリックファミリは、非常に少数の例から学習できます。保存された時間的パターンは、Hopfieldネットワーク内の静的パターンを呼び戻すのと類似した方法でコンテンツアドレス指定できます。

Refinery: An Open Source Topic Modeling Web Platform
Refinery: オープンソースのトピックモデリング Web プラットフォーム

We introduce Refinery, an open source platform for exploring large text document collections with topic models. Refinery is a standalone web application driven by a graphical interface, so it is usable by those without machine learning or programming expertise. Users can interactively organize articles by topic and also refine this organization with phrase-level analysis. Under the hood, we train Bayesian nonparametric topic models that can adapt model complexity to the provided data with scalable learning algorithms. The project website contains Python code and further documentation.

私たちは、トピックモデルを使用して大規模なテキストドキュメントコレクションを探索するためのオープンソースプラットフォームであるRefineryを紹介します。Refineryは、グラフィカルインターフェイスによって駆動されるスタンドアロンのWebアプリケーションであるため、機械学習やプログラミングの専門知識がない人でも使用できます。ユーザーは、トピックごとに記事をインタラクティブに整理し、フレーズレベルの分析でこの整理を洗練することもできます。内部的には、スケーラブルな学習アルゴリズムを使用して、モデルの複雑さを提供されたデータに適応できるベイズノンパラメトリックトピックモデルをトレーニングします。プロジェクトのWebサイトには、Pythonコードとその他のドキュメントが含まれています。

Differential Privacy for Bayesian Inference through Posterior Sampling
事後サンプリングによるベイズ推論の差分プライバシー

Differential privacy formalises privacy-preserving mechanisms that provide access to a database. Can Bayesian inference be used directly to provide private access to data? The answer is yes: under certain conditions on the prior, sampling from the posterior distribution can lead to a desired level of privacy and utility. For a uniform treatment, we define differential privacy over arbitrary data set metrics, outcome spaces and distribution families. This allows us to also deal with non-i.i.d or non-tabular data sets. We then prove bounds on the sensitivity of the posterior to the data, which delivers a measure of robustness. We also show how to use posterior sampling to provide differentially private responses to queries, within a decision-theoretic framework. Finally, we provide bounds on the utility of answers to queries and on the ability of an adversary to distinguish between data sets. The latter are complemented by a novel use of Le Cam’s method to obtain lower bounds on distinguishability. Our results hold for arbitrary metrics, including those for the common definition of differential privacy. For specific choices of the metric, we give a number of examples satisfying our assumptions.

差分プライバシーは、データベースへのアクセスを提供するプライバシー保護メカニズムを形式化します。ベイズ推論を直接使用して、データへのプライベートアクセスを提供できますか?答えはイエスです。事前分布の特定の条件下では、事後分布からのサンプリングにより、望ましいレベルのプライバシーと有用性を実現できます。均一な処理のために、任意のデータセットメトリック、結果空間、分布ファミリに対して差分プライバシーを定義します。これにより、非i.i.dまたは非表形式のデータセットも処理できます。次に、データに対する事後分布の感度の境界を証明し、堅牢性の尺度を提供します。また、事後サンプリングを使用して、決定理論のフレームワーク内でクエリに対して差分プライバシー応答を提供する方法を示します。最後に、クエリに対する回答の有用性と、敵対者がデータセットを区別する能力の境界を示します。後者は、区別可能性の下限を取得するためにLe Cam法を新しく使用することで補完されます。結果は、差分プライバシーの一般的な定義を含む任意のメトリックに当てはまります。メトリックの具体的な選択については、私たちの仮定を満たすいくつかの例を示します。

On Perturbed Proximal Gradient Algorithms
摂動された近位勾配アルゴリズムについて

We study a version of the proximal gradient algorithm for which the gradient is intractable and is approximated by Monte Carlo methods (and in particular Markov Chain Monte Carlo). We derive conditions on the step size and the Monte Carlo batch size under which convergence is guaranteed: both increasing batch size and constant batch size are considered. We also derive non- asymptotic bounds for an averaged version. Our results cover both the cases of biased and unbiased Monte Carlo approximation. To support our findings, we discuss the inference of a sparse generalized linear model with random effect and the problem of learning the edge structure and parameters of sparse undirected graphical models.

私たちは、勾配が扱いにくく、モンテカルロ法(特にマルコフ連鎖モンテカルロ法)によって近似される近位勾配アルゴリズムのバージョンを研究します。収束が保証されるステップサイズとモンテカルロバッチサイズに関する条件を導き出します:バッチサイズの増加と一定のバッチサイズの両方が考慮されます。また、平均化されたバージョンの非漸近境界も導き出します。私たちの結果は、偏ったモンテカルロ近似と不偏なモンテカルロ近似の両方のケースをカバーしています。私たちの発見を裏付けるために、ランダム効果を持つスパース一般化線形モデルの推論と、スパース無向グラフィカルモデルのエッジ構造とパラメータを学習する問題について説明します。

Spectral Clustering Based on Local PCA
局所PCA に基づくスペクトルクラスタリング

We propose a spectral clustering method based on local principal components analysis (PCA). After performing local PCA in selected neighborhoods, the algorithm builds a nearest neighbor graph weighted according to a discrepancy between the principal subspaces in the neighborhoods, and then applies spectral clustering. As opposed to standard spectral methods based solely on pairwise distances between points, our algorithm is able to resolve intersections. We establish theoretical guarantees for simpler variants within a prototypical mathematical framework for multi-manifold clustering, and evaluate our algorithm on various simulated data sets.

私たちは、局所主成分分析(PCA)に基づくスペクトルクラスタリング法を提案します。選択した近傍でローカルPCAを実行した後、アルゴリズムは、近傍内の主部分空間間の不一致に従って重み付けされた最近傍グラフを作成し、スペクトルクラスタリングを適用します。ポイント間のペアワイズ距離のみに基づく標準的なスペクトル方法とは対照的に、私たちのアルゴリズムは交差を解決できます。私たちは、多多様体クラスタリングのプロトタイプ数学的フレームワーク内で、より単純なバリアントの理論的保証を確立し、さまざまなシミュレーションデータセットでアルゴリズムを評価します。

Persistence Images: A Stable Vector Representation of Persistent Homology
パーシステンスイメージ:パーシステントホモロジーの安定ベクトル表現

Many data sets can be viewed as a noisy sampling of an underlying space, and tools from topological data analysis can characterize this structure for the purpose of knowledge discovery. One such tool is persistent homology, which provides a multiscale description of the homological features within a data set. A useful representation of this homological information is a persistence diagram (PD). Efforts have been made to map PDs into spaces with additional structure valuable to machine learning tasks. We convert a PD to a finite- dimensional vector representation which we call a persistence image (PI), and prove the stability of this transformation with respect to small perturbations in the inputs. The discriminatory power of PIs is compared against existing methods, showing significant performance gains. We explore the use of PIs with vector-based machine learning tools, such as linear sparse support vector machines, which identify features containing discriminating topological information. Finally, high accuracy inference of parameter values from the dynamic output of a discrete dynamical system (the linked twist map) and a partial differential equation (the anisotropic Kuramoto-Sivashinsky equation) provide a novel application of the discriminatory power of PIs.

多くのデータセットは、基礎となる空間のノイズの多いサンプリングと見なすことができ、トポロジカルデータ解析のツールは、知識発見を目的としてこの構造を特徴付けることができます。そのようなツールの1つがパーシステントホモロジーです。これは、データセット内のホモロジー特性のマルチスケール記述を提供します。このホモロジー情報の便利な表現は、パーシステンスダイアグラム(PD)です。機械学習タスクに役立つ追加の構造を持つ空間にPDをマッピングする取り組みが行われてきました。PDをパーシステンスイメージ(PI)と呼ぶ有限次元ベクトル表現に変換し、入力の小さな摂動に対するこの変換の安定性を証明します。PIの識別力を既存の方法と比較すると、大幅なパフォーマンスの向上が示されています。識別的なトポロジ情報を含む特性を識別する線形スパースサポートベクトルマシンなどのベクトルベースの機械学習ツールでのPIの使用を検討します。最後に、離散動的システム(リンクされたツイストマップ)の動的出力と偏微分方程式(異方性Kuramoto-Sivashinsky方程式)からのパラメーター値の高精度な推論により、PIの識別力の新しい応用が可能になります。

Distributed Sequence Memory of Multidimensional Inputs in Recurrent Networks
リカレントネットワークにおける多次元入力の分散シーケンスメモリ

Recurrent neural networks (RNNs) have drawn interest from machine learning researchers because of their effectiveness at preserving past inputs for time-varying data processing tasks. To understand the success and limitations of RNNs, it is critical that we advance our analysis of their fundamental memory properties. We focus on echo state networks (ESNs), which are RNNs with simple memoryless nodes and random connectivity. In most existing analyses, the short-term memory (STM) capacity results conclude that the ESN network size must scale linearly with the input size for unstructured inputs. The main contribution of this paper is to provide general results characterizing the STM capacity for linear ESNs with multidimensional input streams when the inputs have common low- dimensional structure: sparsity in a basis or significant statistical dependence between inputs. In both cases, we show that the number of nodes in the network must scale linearly with the information rate and poly-logarithmically with the input dimension. The analysis relies on advanced applications of random matrix theory and results in explicit non-asymptotic bounds on the recovery error. Taken together, this analysis provides a significant step forward in our understanding of the STM properties in RNNs.

リカレントニューラルネットワーク(RNN)は、時変データ処理タスクの過去の入力の保存に効果的であることから、機械学習研究者の関心を集めています。RNNの成功と限界を理解するには、RNNの基本的なメモリ特性の分析を進めることが重要です。私たちは、単純なメモリレスノードとランダム接続を持つRNNであるエコー状態ネットワーク(ESN)に焦点を当てます。既存のほとんどの分析では、短期記憶(STM)容量の結果から、非構造化入力の場合、ESNネットワークサイズは入力サイズに比例して拡大する必要があるという結論が出ています。この論文の主な貢献は、入力が共通の低次元構造(基底のスパース性または入力間の有意な統計的依存性)を持つ場合、多次元入力ストリームを持つ線形ESNのSTM容量を特徴付ける一般的な結果を提供することです。どちらの場合も、ネットワーク内のノード数は情報レートに比例し、入力次元に比例して拡大する必要があることを示しています。この分析は、ランダム行列理論の高度な応用に依存しており、回復エラーの明示的な非漸近境界をもたらします。総合すると、この分析は、RNNのSTMプロパティの理解に大きな前進をもたらします。

Improving Variational Methods via Pairwise Linear Response Identities
ペアワイズ線形応答恒等式による変分法の改善

Inference methods are often formulated as variational approximations: these approxima- tions allow easy evaluation of statistics by marginalization or linear response, but these estimates can be inconsistent. We show that by introducing constraints on covariance, one can ensure consistency of linear response with the variational parameters, and in so doing inference of marginal probability distributions is improved. For the Bethe approximation and its generalizations, improvements are achieved with simple choices of the constraints. The approximations are presented as variational frameworks; iterative procedures related to message passing are provided for finding the minima.

推論法はしばしば変分近似として定式化されます:これらの近似は、周辺化または線形応答による統計の容易な評価を可能にしますが、これらの推定は一貫性がない可能性があります。共分散に制約を導入することにより、線形応答と変分パラメータとの一貫性を確保でき、そうすることで周辺確率分布の推論が改善されることを示します。Bethe近似とその一般化では、制約の単純な選択で改善が達成されます。近似は変分フレームワークとして提示されます。最小値を見つけるために、メッセージ受け渡しに関連する反復手順が提供されています。

Communication-efficient Sparse Regression
通信効率の高いスパース回帰

We devise a communication-efficient approach to distributed sparse regression in the high-dimensional setting. The key idea is to average debiased or desparsified lasso estimators. We show the approach converges at the same rate as the lasso as long as the dataset is not split across too many machines, and consistently estimates the support under weaker conditions than the lasso. On the computational side, we propose a new parallel and computationally-efficient algorithm to compute the approximate inverse covariance required in the debiasing approach, when the dataset is split across samples. We further extend the approach to generalized linear models.

私たちは、高次元の環境での分散スパース回帰に対するコミュニケーション効率の良いアプローチを考案します。重要なアイデアは、偏りをなくした、または軽視されたなげなわ推定器を平均化することです。データセットがあまりにも多くのマシンに分割されない限り、アプローチは投げ縄と同じ速度で収束し、投げ縄よりも弱い条件下でのサポートを一貫して推定することを示しています。計算側では、データセットがサンプル間で分割されている場合に、バイアス除去アプローチで必要な近似逆共分散を計算するための、新しい並列で計算効率の高いアルゴリズムを提案します。さらに、一般化線形モデルへのアプローチを拡張します。

SnapVX: A Network-Based Convex Optimization Solver
SnapVX:ネットワークベースの凸最適化ソルバー

SnapVX is a high-performance solver for convex optimization problems defined on networks. For problems of this form, SnapVX provides a fast and scalable solution with guaranteed global convergence. It combines the capabilities of two open source software packages: Snap.py and CVXPY. Snap.py is a large scale graph processing library, and CVXPY provides a general modeling framework for small-scale subproblems. SnapVX offers a customizable yet easy-to-use Python interface with out-of- the- box functionality. Based on the Alternating Direction Method of Multipliers (ADMM), it is able to efficiently store, analyze, parallelize, and solve large optimization problems from a variety of different applications. Documentation, examples, and more can be found on the SnapVX website at snap.stanford.edu/snapvx.

SnapVXは、ネットワーク上で定義された凸最適化問題に対する高性能ソルバーです。このような問題に対して、SnapVXは、グローバルコンバージェンスが保証された高速でスケーラブルなソリューションを提供します。これは、Snap.pyとCVXPYという2つのオープンソースソフトウェアパッケージの機能を組み合わせたものです。Snap.pyは大規模なグラフ処理ライブラリであり、CVXPYは小規模なサブ問題に対する一般的なモデリングフレームワークを提供します。SnapVXは、カスタマイズ可能でありながら使いやすいPythonインターフェースを提供し、すぐに使える機能を備えています。交互方向乗算器法(ADMM)に基づいて、さまざまなアプリケーションからの大規模な最適化問題を効率的に保存、解析、並列化、および解決できます。ドキュメント、例などは、SnapVXのWebサイト(snap.stanford.edu/snapvx)でご覧いただけます。

Local algorithms for interactive clustering
対話型クラスタリングのためのローカルアルゴリズム

We study the design of interactive clustering algorithms. The user supervision that we consider is in the form of cluster split/merge requests; such feedback is easy for users to provide because it only requires a high-level understanding of the clusters. Our algorithms start with any initial clustering and only make local changes in each step; both are desirable properties in many applications. Local changes are desirable because in practice edits of other parts of the clustering are considered churn – changes that are perceived as quality-neutral or quality-negative. We show that in this framework we can still design provably correct algorithms given that our data satisfies natural separability properties. We also show that our framework works well in practice.

私たちは、対話型クラスタリングアルゴリズムの設計を研究します。私たちが考慮するユーザー監視は、クラスター分割/マージ要求の形式です。このようなフィードバックは、クラスターの高度な理解のみを必要とするため、ユーザーが簡単に提供できます。私たちのアルゴリズムは、任意の初期クラスタリングから始まり、各ステップでローカルな変更のみを行います。どちらも多くのアプリケーションで望ましい特性です。実際には、クラスタリングの他の部分の編集はチャーンと見なされるため、ローカルな変更が望ましいです。これは、品質中立または品質否定的と見なされる変更です。このフレームワークでは、データが自然な分離可能性特性を満たしている場合、証明可能な正しいアルゴリズムを設計できることを示します。また、私たちのフレームワークが実際にうまく機能することも示しています。

Scalable Influence Maximization for Multiple Products in Continuous-Time Diffusion Networks
連続時間拡散ネットワークにおける複数製品に対するスケーラブルな影響最大化

A typical viral marketing model identifies influential users in a social network to maximize a single product adoption assuming unlimited user attention, campaign budgets, and time. In reality, multiple products need campaigns, users have limited attention, convincing users incurs costs, and advertisers have limited budgets and expect the adoptions to be maximized soon. Facing these user, monetary, and timing constraints, we formulate the problem as a submodular maximization task in a continuous-time diffusion model under the intersection of one matroid and multiple knapsack constraints. We propose a randomized algorithm estimating the user influence (Partial results in the paper on influence estimation have been published in a conference paper: Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable influence estimation in continuous time diffusion networks. In Advances in Neural Information Processing Systems 26, 2013.) in a network ($|\mathcal{V}|$ nodes, $|\mathcal{E}|$ edges) to an accuracy of $\epsilon$ with $n=\mathcal{O}(1/\epsilon^2)$ randomizations and $\tilde{\mathcal{O}}(n|\mathcal{E}|+n|\mathcal{V}|)$ computations. By exploiting the influence estimation algorithm as a subroutine, we develop an adaptive threshold greedy algorithm achieving an approximation factor $k_a/(2+2 k)$ of the optimal when $k_a$ out of the $k$ knapsack constraints are active. Extensive experiments on networks of millions of nodes demonstrate that the proposed algorithms achieve the state-of- the-art in terms of effectiveness and scalability.

典型的なバイラルマーケティングモデルは、ソーシャルネットワークで影響力のあるユーザーを特定し、ユーザーの関心、キャンペーン予算、時間が無制限であると仮定して、単一製品の採用を最大化します。実際には、複数の製品にキャンペーンが必要で、ユーザーの関心は限られており、ユーザーを説得するにはコストがかかり、広告主の予算は限られているため、すぐに採用が最大化されることを期待しています。これらのユーザー、金銭、およびタイミングの制約に直面して、1つのマトロイドと複数のナップサック制約の交差の下で、連続時間拡散モデルにおけるサブモジュラー最大化タスクとして問題を定式化します。私たちは、ネットワーク($|\mathcal{V}|$ノード、$|\mathcal{E}|$エッジ)におけるユーザーの影響を、$n=\mathcal{O}(1/\epsilon^2)$ランダム化と$\tilde{\mathcal{O}}(n|\mathcal{E}|+n|\mathcal{V}|)$計算で$\epsilon$の精度で推定するランダム化アルゴリズムを提案します(影響推定に関する論文の部分的な結果は、会議論文で発表されています: Nan Du、Le Song、Manuel Gomez-Rodriguez、およびHongyuan Zha。連続時間拡散ネットワークにおけるスケーラブルな影響推定。Advances in Neural Information Processing Systems 26、2013。)影響推定アルゴリズムをサブルーチンとして利用することで、$k$個のナップサック制約のうち$k_a$がアクティブな場合に、最適値の近似係数$k_a/(2+2 k)$を達成する適応しきい値貪欲アルゴリズムを開発します。数百万のノードのネットワークでの広範な実験により、提案されたアルゴリズムが、有効性とスケーラビリティの点で最先端のものであることが実証されています。

Averaged Collapsed Variational Bayes Inference
平均崩壊変分ベイズ推論

This paper presents the Averaged CVB (ACVB) inference and offers convergence-guaranteed and practically useful fast Collapsed Variational Bayes (CVB) inferences. CVB inferences yield more precise inferences of Bayesian probabilistic models than Variational Bayes (VB) inferences. However, their convergence aspect is fairly unknown and has not been scrutinized. To make CVB more useful, we study their convergence behaviors in a empirical and practical approach. We develop a convergence- guaranteed algorithm for any CVB-based inference called ACVB, which enables automatic convergence detection and frees non- expert practitioners from the difficult and costly manual monitoring of inference processes. In experiments, ACVB inferences are comparable to or better than those of existing inference methods and deterministic, fast, and provide easier convergence detection. These features are especially convenient for practitioners who want precise Bayesian inference with assured convergence.

この論文では、平均化CVB (ACVB)推論を紹介し、収束が保証され、実用的に有用な高速Collapsed Variational Bayes (CVB)推論を提供します。CVB推論は、Variational Bayes (VB)推論よりも正確なベイズ確率モデルの推論を生成します。しかし、その収束性はかなり不明であり、精査されていません。CVBをより有用にするために、私たちは経験的かつ実践的なアプローチでそれらの収束挙動を研究します。ACVBと呼ばれるCVBベースの推論に対して収束保証アルゴリズムを開発し、自動収束検出を可能にし、専門家でない実務家を推論プロセスの困難でコストのかかる手動監視から解放します。実験では、ACVB推論は既存の推論方法と同等かそれよりも優れており、決定論的で高速であり、収束検出が容易です。これらの機能は、確実な収束を伴う正確なベイズ推論を求める実務家にとって特に便利です。

Journal of Machine Learning Research Papers: Volume 18の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 20の論文一覧

Journal of Machine Learning Research Papers: Volume 5の論文一覧

Journal of Machine Learning Research Papers: Volume 8の論文一覧