Journal of Machine Learning Research Papers: Volume 19の論文一覧

Journal of Machine Learning Research Papers Volume 19に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
DALEX: Explainers for Complex Predictive Models in R
DALEX: R での複雑な予測モデルの説明

Predictive modeling is invaded by elastic, yet complex methods such as neural networks or ensembles (model stacking, boosting or bagging). Such methods are usually described by a large number of parameters or hyper parameters – a price that one needs to pay for elasticity. The very number of parameters makes models hard to understand. This paper describes a consistent collection of explainers for predictive models, a.k.a. black boxes. Each explainer is a technique for exploration of a black box model. Presented approaches are model-agnostic, what means that they extract useful information from any predictive method irrespective of its internal structure. Each explainer is linked with a specific aspect of a model. Some are useful in decomposing predictions, some serve better in understanding performance, while others are useful in understanding importance and conditional responses of a particular variable. Every explainer presented here works for a single model or for a collection of models. In the latter case, models can be compared against each other. Such comparison helps to find strengths and weaknesses of different models and gives additional tools for model validation. Presented explainers are implemented in the DALEX package for R. They are based on a uniform standardized grammar of model exploration which may be easily extended.

予測モデリングには、ニューラルネットワークやアンサンブル(モデルスタッキング、ブースティング、バギング)などの、弾力性がありながらも複雑な手法が浸透しています。このような手法は通常、多数のパラメーターまたはハイパーパラメーターで記述されます。これは、弾力性のために支払う必要がある代償です。パラメーターの数が多いため、モデルが理解しにくくなります。この論文では、予測モデル(別名ブラックボックス)の説明者の一貫したコレクションについて説明します。各説明者は、ブラックボックスモデルを探索するための手法です。提示されたアプローチはモデルに依存しません。つまり、内部構造に関係なく、あらゆる予測方法から有用な情報を抽出します。各説明者は、モデルの特定の側面にリンクされています。予測を分解するのに役立つものもあれば、パフォーマンスを理解するのに役立つものもあり、特定の変数の重要性と条件付き応答を理解するのに役立つものもあります。ここで提示されたすべての説明者は、単一のモデルまたはモデルのコレクションで機能します。後者の場合、モデルを相互に比較できます。このような比較は、さまざまなモデルの長所と短所を見つけるのに役立ち、モデル検証のための追加のツールを提供します。提示された説明は、RのDALEXパッケージに実装されています。これらは、簡単に拡張できるモデル探索の統一された標準化された文法に基づいています。

Seglearn: A Python Package for Learning Sequences and Time Series
Seglearn:シーケンスと時系列を学習するためのPythonパッケージ

seglearn is an open-source Python package for performing machine learning on time series or sequences. The implementation provides a flexible pipeline for tackling classification, regression, and forecasting problems with multivariate sequence and contextual data. Sequences and series may be learned directly with deep learning models or via feature representation with classical machine learning estimators. This package is compatible with scikit-learn and is listed under scikit-learn Related Projects. The package depends on numpy, scipy, and scikit-learn. seglearn is distributed under the BSD 3-Clause License. Documentation includes a detailed API description, user guide, and examples. Unit tests provide a high degree of code coverage. Source code and documentation can be downloaded from https://github.com/dmbee/seglearn.

seglearnは、時系列またはシーケンスで機械学習を実行するためのオープンソースのPythonパッケージです。この実装は、多変量シーケンスとコンテキストデータを使用した分類、回帰、および予測の問題に取り組むための柔軟なパイプラインを提供します。シーケンスとシリーズは、ディープラーニングモデルで直接学習することも、従来の機械学習推定器による特徴表現を介して学習することもできます。このパッケージはscikit-learnと互換性があり、scikit-learn関連プロジェクトの下にリストされています。パッケージは、numpy、scipy、scikit-learnに依存します。seglearnはBSD 3-Clause Licenseの下で配布されています。ドキュメントには、詳細なAPIの説明、ユーザーガイド、および例が含まれています。単体テストでは、高度なコードカバレッジが提供されます。ソースコードとドキュメントはhttps://github.com/dmbee/seglearnからダウンロードできます。

Clustering is semidefinitely not that hard: Nonnegative SDP for manifold disentangling
クラスタリングは半無限にそれほど難しくない: 多様体のもつれ解消のための非負の SDP

In solving hard computational problems, semidefinite program (SDP) relaxations often play an important role because they come with a guarantee of optimality. Here, we focus on a popular semidefinite relaxation of K-means clustering which yields the same solution as the non-convex original formulation for well segregated datasets. We report an unexpected finding: when data contains (greater than zero-dimensional) manifolds, the SDP solution captures such geometrical structures. Unlike traditional manifold embedding techniques, our approach does not rely on manually defining a kernel but rather enforces locality via a nonnegativity constraint. We thus call our approach NOnnegative MAnifold Disentangling, or NOMAD. To build an intuitive understanding of its manifold learning capabilities, we develop a theoretical analysis of NOMAD on idealized datasets. While NOMAD is convex and the globally optimal solution can be found by generic SDP solvers with polynomial time complexity, they are too slow for modern datasets. To address this problem, we analyze a non-convex heuristic and present a new, convex and yet efficient, algorithm, based on the conditional gradient method. Our results render NOMAD a versatile, understandable, and powerful tool for manifold learning.

難しい計算問題を解決する場合、最適性が保証される半正定値計画(SDP)緩和法が重要な役割を果たすことがよくあります。ここでは、よく分離されたデータセットに対して非凸の元の定式化と同じソリューションを生成する、K平均法クラスタリングの一般的な半正定値緩和法に焦点を当てます。予期せぬ発見を報告します。データに(0次元より大きい)多様体が含まれている場合、SDPソリューションはそのような幾何学的構造を捉えます。従来の多様体埋め込み手法とは異なり、私たちのアプローチはカーネルを手動で定義するのではなく、非負制約によって局所性を強制します。したがって、このアプローチをNOnnegative MAnifold Disentangling、またはNOMADと呼びます。その多様体学習機能の直感的な理解を深めるために、理想化されたデータセットでのNOMADの理論的分析を開発します。NOMADは凸であり、多項式時間計算量を持つ汎用SDPソルバーでグローバル最適ソリューションを見つけることができますが、最新のデータセットには遅すぎます。この問題に対処するために、非凸ヒューリスティックを分析し、条件付き勾配法に基づく新しい凸型でありながら効率的なアルゴリズムを提示します。私たちの結果により、NOMADは多様体学習のための多様で理解しやすく強力なツールになります。

Improved Asynchronous Parallel Optimization Analysis for Stochastic Incremental Methods
確率的インクリメンタル法のための非同期並列最適化解析の改善

As data sets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced “perturbed iterate” framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought.

データセットのサイズが拡大し続け、マルチコアコンピュータアーキテクチャが開発されるにつれて、非同期並列最適化アルゴリズムは機械学習の分野でますます重要になっています。残念ながら、非同期メソッドの理論分析を行うことは困難です。これは、本質的にシーケンシャルなアルゴリズムに遅延と不整合が導入されるためです。これらの問題に対処するには、多くの場合、単純化されているが非現実的な仮定に頼る必要があります。新しい視点から、非同期並列最適化アルゴリズムの最近の収束率証明の大部分に存在する微妙だが重要な技術的問題を再検討して明確にし、最近導入された「摂動反復」フレームワークの単純化を提案して、この問題を解決します。新しいフレームワークの有用性を示すために、3つの異なる非同期並列増分最適化アルゴリズム、Hogwild (非同期SGD)、KROMAGNON (非同期SVRG)、およびASAGA (高速線形収束率を誇る増分勾配アルゴリズムSAGAの新しい非同期並列バージョン)を分析します。問題のある仮定を排除し、より優れた理論的結果を得ることができます。特に、ASAGAとKROMAGNONは、スパース性を仮定しなくても、マルチコアシステムで理論的な線形高速化を実現できることを証明します。40コアアーキテクチャでの実装結果を示し、実際の高速化とハードウェアオーバーヘッドを示します。最後に、非同期並列アルゴリズムの理論的分析において、あまり理解されていないが中心的な量であるオーバーラップ定数を調べます。この定数は、以前の研究で示唆されているよりもはるかに複雑であり、従来考えられていたよりも桁違いに大きいことがよくあります。

Robust PCA by Manifold Optimization
多様体最適化によるロバストPCA

Robust PCA is a widely used statistical procedure to recover an underlying low-rank matrix with grossly corrupted observations. This work considers the problem of robust PCA as a nonconvex optimization problem on the manifold of low-rank matrices and proposes two algorithms based on manifold optimization. It is shown that, with a properly designed initialization, the proposed algorithms are guaranteed to converge to the underlying low-rank matrix linearly. Compared with a previous work based on the factorization of low-rank matrices Yi et al. (2016), the proposed algorithms reduce the dependence on the condition number of the underlying low-rank matrix theoretically. Simulations and real data examples confirm the competitive performance of our method.

ロバストPCAは、大幅に破損した観測値を持つ基礎となる低ランク行列を回復するために広く使用されている統計手順です。この作業では、ロバストPCAの問題を低ランク行列の多様体上の非凸最適化問題と見なし、多様体最適化に基づく2つのアルゴリズムを提案します。適切に設計された初期化により、提案されたアルゴリズムは、基になる低ランク行列に線形に収束することが保証されます。低ランク行列の因数分解に基づく以前の研究Yiら(2016)と比較して、提案されたアルゴリズムは、理論的には基礎となる低ランク行列の条件数への依存性を減らします。シミュレーションと実際のデータの例により、当社の方法の競争力のあるパフォーマンスが確認されています。

A Random Matrix Analysis and Improvement of Semi-Supervised Learning for Large Dimensional Data
大次元データに対する半教師あり学習のランダム行列解析と改良

This article provides an original understanding of the behavior of a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data. It is demonstrated that the intuition at the root of these methods collapses in this limit and that, as a result, most of them become inconsistent. Corrective measures and a new data-driven parametrization scheme are proposed along with a theoretical analysis of the asymptotic performances of the resulting approach. A surprisingly close behavior between theoretical performances on Gaussian mixture models and on real data sets is also illustrated throughout the article, thereby suggesting the importance of the proposed analysis for dealing with practical data. As a result, significant performance gains are observed on practical data classification using the proposed parametrization.

この記事では、大規模で多数のデータの限界におけるグラフ指向の半教師あり学習アルゴリズムのクラスの動作について、独自の理解を提供します。これらの方法の根底にある直感は、この限界では崩壊し、その結果、それらのほとんどが一貫性を失うことが実証されています。是正措置と新しいデータ駆動型パラメータ化スキームが提案され、結果として得られるアプローチの漸近性能の理論的分析が提案されます。ガウス混合モデルと実際のデータセットでの理論的パフォーマンスの間の驚くほど近い動作も、記事全体で示されており、したがって、実際のデータを扱うための提案された分析の重要性を示唆しています。その結果、提案されたパラメータ化を使用した実際のデータ分類で、大幅なパフォーマンスの向上が観察されます。

Online Bootstrap Confidence Intervals for the Stochastic Gradient Descent Estimator
確率的勾配降下推定量のオンラインブートストラップ信頼区間

In many applications involving large dataset or online learning, stochastic gradient descent (SGD) is a scalable algorithm to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been well established, statistical inference such as interval estimation remains much unexplored. The classical bootstrap is not directly applicable if the data are not stored in memory. The plug-in method is not applicable when there is no explicit formula for the covariance matrix of the estimator. In this paper, we propose an online bootstrap procedure for the estimation of confidence intervals, which, upon the arrival of each observation, updates the SGD estimate as well as a number of randomly perturbed SGD estimates. The proposed method is easy to implement in practice. We establish its theoretical properties for a general class of models that includes linear regressions, generalized linear models, M-estimators and quantile regressions as special cases. The finite-sample performance and numerical utility is evaluated by simulation studies and real data applications.

大規模なデータセットやオンライン学習を伴う多くのアプリケーションでは、確率的勾配降下法(SGD)は、パラメータ推定値を計算するためのスケーラブルなアルゴリズムであり、数値的な利便性とメモリ効率の良さから人気が高まっています。SGDベースの推定量の漸近特性は十分に確立されていますが、区間推定などの統計的推論は未だにほとんど研究されていません。データがメモリに保存されていない場合、従来のブートストラップは直接適用できません。推定量の共分散行列の明示的な式がない場合、プラグイン法は適用できません。この論文では、信頼区間の推定のためのオンラインブートストラップ手順を提案します。この手順では、各観測値の到着時に、SGD推定値と、ランダムに摂動されたいくつかのSGD推定値を更新します。提案された方法は、実際に実装するのが簡単です。特殊なケースとして、線形回帰、一般化線形モデル、M推定量、および分位回帰を含む一般的なモデルのクラスについて、その理論的特性を確立します。有限サンプルのパフォーマンスと数値的有用性は、シミュレーション研究と実際のデータアプリケーションによって評価されます。

A Note on Quickly Sampling a Sparse Matrix with Low Rank Expectation
低ランク期待値を持つスパース行列の高速サンプリングに関する注意

Given matrices $X,Y \in R^{n \times K}$ and $S \in R^{K \times K}$ with positive elements, this paper proposes an algorithm fastRG to sample a sparse matrix $A$ with low rank expectation $E(A) = XSY^T$ and independent Poisson elements. This allows for quickly sampling from a broad class of stochastic blockmodel graphs (degree-corrected, mixed membership, overlapping) all of which are specific parameterizations of the generalized random product graph model defined in Section 2.2. The basic idea of fastRG is to first sample the number of edges $m$ and then sample each edge. The key insight is that because of the the low rank expectation, it is easy to sample individual edges. The naive “element-wise” algorithm requires $O(n^2)$ operations to generate the $n\times n$ adjacency matrix $A$. In sparse graphs, where $m = O(n)$, ignoring log terms, fastRG runs in time $O(n)$. An implementation in R is available on github. A computational experiment in Section 2.4 simulates graphs up to $n=10,000,000$ nodes with $m = 100,000,000$ edges. For example, on a graph with $n=500,000$ and $m = 5,000,000$, fastRG runs in less than one second on a 3.5 GHz Intel i5.

この論文では、正の要素を持つ行列$X,Y \in R^{n \times K}$と$S \in R^{K \times K}$が与えられた場合、低ランク期待値$E(A) = XSY^T$と独立したポアソン要素を持つ疎行列$A$をサンプリングするアルゴリズムfastRGを提案します。これにより、セクション2.2で定義した一般化ランダム積グラフモデルの特定のパラメーター化である、幅広いクラスの確率的ブロックモデルグラフ(次数補正、混合メンバーシップ、オーバーラップ)から迅速にサンプリングできます。fastRGの基本的な考え方は、最初にエッジの数$m$をサンプリングし、次に各エッジをサンプリングすることです。重要な洞察は、ランク期待値が低いため、個々のエッジをサンプリングするのは簡単だということです。単純な「要素ごとの」アルゴリズムでは、$n\times n$隣接行列$A$を生成するために$O(n^2)$操作が必要です。疎グラフでは、対数項を無視すると、fastRGは時間$O(n)$で実行されます。Rでの実装はgithubで入手できます。セクション2.4の計算実験では、最大$n=10,000,000$ノード、$m = 100,000,000$エッジのグラフをシミュレートします。たとえば、$n=500,000$および$m = 5,000,000$のグラフでは、fastRGは3.5 GHz Intel i5で1秒未満で実行されます。

Using Side Information to Reliably Learn Low-Rank Matrices from Missing and Corrupted Observations
サイド情報を使用して、欠落した観測値や破損した観測値から低ランク行列を確実に学習

Learning a low-rank matrix from missing and corrupted observations is a fundamental problem in many machine learning applications. However, the role of side information in low-rank matrix learning has received little attention, and most current approaches are either ad-hoc or only applicable in certain restrictive cases. In this paper, we propose a general model that exploits side information to better learn low-rank matrices from missing and corrupted observations, and show that the proposed model can be further applied to several popular scenarios such as matrix completion and robust PCA. Furthermore, we study the effect of side information on sample complexity and show that by using our model, the efficiency for learning can be improved given sufficiently informative side information. This result thus provides theoretical insight into the usefulness of side information in our model. Finally, we conduct comprehensive experiments in three real-world applications—relationship prediction, semi-supervised clustering and noisy image classification, showing that our proposed model is able to properly exploit side information for more effective learning both in theory and practice.

欠損や破損した観測から低ランク行列を学習することは、多くの機械学習アプリケーションにおける基本的な問題です。しかし、低ランク行列学習におけるサイド情報の役割はほとんど注目されておらず、現在のアプローチのほとんどはアドホックであるか、特定の制限されたケースにのみ適用可能です。この論文では、サイド情報を利用して欠損や破損した観測から低ランク行列をより適切に学習する一般的なモデルを提案し、提案モデルが行列補完やロバストPCAなどのいくつかの一般的なシナリオにさらに適用できることを示します。さらに、サンプルの複雑さに対するサイド情報の影響を調査し、十分な情報量のサイド情報があれば、モデルを使用することで学習の効率を向上できることを示します。したがって、この結果は、モデルにおけるサイド情報の有用性に関する理論的な洞察を提供します。最後に、関係予測、半教師ありクラスタリング、ノイズの多い画像分類という3つの実際のアプリケーションで包括的な実験を行い、提案モデルが理論と実践の両方でより効果的な学習のためにサイド情報を適切に利用できることを示します。

Sparse Estimation in Ising Model via Penalized Monte Carlo Methods
ペナルティモンテカルロ法によるイジングモデルにおけるスパース推定

We consider a model selection problem in high-dimensional binary Markov random fields. The usefulness of the Ising model in studying systems of complex interactions has been confirmed in many papers. The main drawback of this model is the intractable norming constant that makes estimation of parameters very challenging. In the paper we propose a Lasso penalized version of the Monte Carlo maximum likelihood method. We prove that our algorithm, under mild regularity conditions, recognizes the true dependence structure of the graph with high probability. The efficiency of the proposed method is also investigated via numerical studies.

私たちは、高次元バイナリマルコフ確率場でのモデル選択問題を考えます。複雑な相互作用のシステムの研究におけるイジングモデルの有用性は、多くの論文で確認されています。このモデルの主な欠点は、パラメータの推定を非常に困難にする難解なノルム定数です。この論文では、モンテカルロ最尤法のLassoペナルティ付きバージョンを提案します。私たちは、我々のアルゴリズムが、穏やかな規則性条件下で、グラフの真の依存構造を高い確率で認識することを証明します。提案された方法の効率は、数値研究によっても調査されます。

An efficient distributed learning algorithm based on effective local functional approximations
効果的な局所関数近似に基づく効率的な分散学習アルゴリズム

Scalable machine learning over big data is an important problem that is receiving a lot of attention in recent years. On popular distributed environments such as Hadoop running on a cluster of commodity machines, communication costs are substantial and algorithms need to be designed suitably considering those costs. In this paper we give a novel approach to the distributed training of linear classifiers (involving smooth losses and $L_2$ regularization) that is designed to reduce the total communication costs. At each iteration, the nodes minimize locally formed approximate objective functions; then the resulting minimizers are combined to form a descent direction to move. Our approach gives a lot of freedom in the formation of the approximate objective function as well as in the choice of methods to solve them. The method is shown to have $O(\log(1/\epsilon))$ time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the Terascale method (Agarwal et al., 2011), which is a state of the art distributed solver based on the statistical query model (Chu et al., 2006) that computes function and gradient values in a distributed fashion. We also evaluate against other recent distributed methods and demonstrate superior performance of our method.

ビッグデータに対するスケーラブルな機械学習は、近年多くの注目を集めている重要な問題です。コモディティマシンのクラスターで実行されるHadoopなどの一般的な分散環境では、通信コストがかなりかかるため、アルゴリズムはそれらのコストを考慮して適切に設計する必要があります。この論文では、総通信コストを削減するように設計された線形分類器の分散トレーニング(滑らかな損失と$L_2$正則化を含む)に対する新しいアプローチを示します。各反復で、ノードはローカルに形成された近似目的関数を最小化し、次に結果として得られた最小化器を組み合わせて、移動する降下方向を形成します。私たちのアプローチでは、近似目的関数の形成とそれらを解決するための方法の選択に多くの自由が与えられます。この方法は、$O(\log(1/\epsilon))$時間の収束を示すことが示されています。この方法は、反復パラメータ混合法と見なすことができます。特別なインスタンス化により、強力な収束を持つ並列確率的勾配降下法が得られます。ノード間の通信時間が長い場合、私たちの方法は、分散方式で関数と勾配値を計算する統計クエリモデル(Chuら、2006年)に基づく最先端の分散ソルバーであるTerascale法(Agarwalら、2011年)よりもはるかに高速です。また、他の最近の分散方法と比較して評価し、私たちの方法の優れたパフォーマンスを実証しました。

Optimal Bounds for Johnson-Lindenstrauss Transformations
ジョンソン・リンデンシュトラウス変換の最適境界

In 1984, Johnson and Lindenstrauss proved that any finite set of data in a high-dimensional space can be projected to a lower-dimensional space while preserving the pairwise Euclidean distances between points up to a bounded relative error. If the desired dimension of the image is too small, however, Kane, Meka, and Nelson (2011) and Jayram and Woodruff (2013) proved that such a projection does not exist. In this paper, we provide a precise asymptotic threshold for the dimension of the image, above which, there exists a projection preserving the Euclidean distance, but, below which, there does not exist such a projection.

1984年、ジョンソンとリンデンシュトラウスは、高次元空間の任意の有限データセットを低次元空間に投影できる一方で、ポイント間のペアワイズユークリッド距離を有界相対誤差まで保持できることを証明しました。しかし、画像の望ましい寸法が小さすぎる場合、Kane, Meka, and Nelson (2011)とJayram and Woodruff (2013)は、そのような投影法が存在しないことを証明しました。この論文では、画像の次元の正確な漸近閾値を提供し、それを超えると、ユークリッド距離を保持する投影が存在しますが、その下にそのような投影は存在しません。

Scikit-Multiflow: A Multi-output Streaming Framework
scikit-multiflow:マルチ出力ストリーミングフレームワーク

scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learning research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit-learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing.

scikit-multiflowは、Pythonでのデータストリームとマルチ出力学習から学習するためのフレームワークです。ストリーム学習研究の民主化を促進するプラットフォームとして機能するように考案され、単一出力、マルチ出力、マルチラベルなど、さまざまなストリーム学習問題に対して複数の最先端の学習方法、データジェネレーター、評価器を提供します。scikit-multiflowは、scikit-learn、MOA、MEKAなどの一般的なオープンソースフレームワークに基づいて構築されています。開発はFOSSの原則に従います。品質は、継続的インテグレーションと機能テストを使用して、PEP8ガイドラインに準拠することによって強化されます。

Optimal Quantum Sample Complexity of Learning Algorithms
学習アルゴリズムの最適な量子サンプルの複雑さ

In learning theory, the VC dimension of a concept class $C$ is the most common way to measure its “richness.” A fundamental result says that the number of examples needed to learn an unknown target concept $c\in C$ under an unknown distribution $D$, is tightly determined by the VC dimension $d$ of the concept class $C$. Specifically, in the PAC model $$ \Theta\Big(\frac{d}{\epsilon} + \frac{\log(1/\delta)}{\epsilon}\Big) $$ examples are necessary and sufficient for a learner to output, with probability $1-\delta$, a hypothesis $h$ that is $\epsilon$-close to the target concept $c$ (measured under $D$). In the related agnostic model, where the samples need not come from a $c\in C$, we know that $$ \Theta\Big(\frac{d}{\epsilon^2} + \frac{\log(1/\delta)}{\epsilon^2}\Big) $$ examples are necessary and sufficient to output an hypothesis $h\in C$ whose error is at most $\epsilon$ worse than the error of the best concept in $C$. Here we analyze quantum sample complexity, where each example is a coherent quantum state. This model was introduced by Bshouty and Jackson (1999), who showed that quantum examples are more powerful than classical examples in some fixed-distribution settings. However, Atıcı and Servedio (2005), improved by Zhang (2010), showed that in the PAC setting (where the learner has to succeed for every distribution), quantum examples cannot be much more powerful: the required number of quantum examples is $$ \Omega\Big(\frac{d^{1-\eta}}{\epsilon} + d + \frac{\log(1/\delta)}{\epsilon}\Big)\mbox{ for arbitrarily small constant }\eta>0. $$ Our main result is that quantum and classical sample complexity are in fact equal up to constant factors in both the PAC and agnostic models. We give two proof approaches. The first is a fairly simple information-theoretic argument that yields the above two classical bounds and yields the same bounds for quantum sample complexity up to a $\log(d/\epsilon)$ factor. We then give a second approach that avoids the log-factor loss, based on analyzing the behavior of the “Pretty Good Measurement” on the quantum state-identification problems that correspond to learning. This shows classical and quantum sample complexity are equal up to constant factors for every concept class $C$.

学習理論では、概念クラス$C$のVC次元は、その「豊かさ」を測る最も一般的な方法です。基本的な結果によると、未知の分布$D$の下で未知の対象概念$c\in C$を学習するために必要な例の数は、概念クラス$C$のVC次元$d$によって厳密に決まります。具体的には、PACモデル$$ \Theta\Big(\frac{d}{\epsilon} + \frac{\log(1/\delta)}{\epsilon}\Big) $$の例は、学習者が確率$1-\delta$で対象概念$c$に$\epsilon$近い仮説$h$を出力するのに必要かつ十分です($D$の下で測定)。関連する不可知論的モデルでは、サンプルは$c\in C$から取得される必要はなく、$$ \Theta\Big(\frac{d}{\epsilon^2} + \frac{\log(1/\delta)}{\epsilon^2}\Big) $$個のサンプルが、誤差が$C$内の最良概念の誤差より最大でも$\epsilon$悪い仮説$h\in C$を出力するために必要かつ十分であることがわかっています。ここでは、各サンプルがコヒーレントな量子状態である量子サンプル複雑性を解析します。このモデルは、BshoutyとJackson (1999)によって導入され、固定分布の設定によっては量子サンプルが古典的なサンプルよりも強力であることを示しました。しかし、Atıcı とServedio (2005)は、Zhang (2010)によって改良され、PAC設定(学習者がすべての分布で成功する必要がある)では、量子例をそれほど強力にすることはできないことを示しました。必要な量子例の数は、任意の小さな定数}\eta>0に対して$$ \Omega\Big(\frac{d^{1-\eta}}{\epsilon} + d + \frac{\log(1/\delta)}{\epsilon}\Big)\mbox{です。$$私たちの主な結果は、量子サンプル複雑度と古典サンプル複雑度は、PACモデルと不可知論モデルの両方で、定数倍まで実際には等しいということです。私たちは2つの証明アプローチを示します。1つ目は、上記の2つの古典的境界を導き、$\log(d/\epsilon)$倍までの量子サンプル複雑度に対して同じ境界を導き出す、かなり単純な情報理論的議論です。次に、学習に対応する量子状態識別問題における「かなり良い測定」の挙動を分析することに基づいて、対数係数損失を回避する2番目のアプローチを示します。これにより、古典的サンプルの複雑さと量子サンプルの複雑さは、すべての概念クラス$C$に対して定数係数まで等しいことがわかります。

The Implicit Bias of Gradient Descent on Separable Data
分離可能なデータに対する勾配降下法の暗黙的なバイアス

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.

私たちは、不規則化されたロジスティック回帰問題での勾配降下法を、線形分離可能なデータセット上の均質な線形予測子を使用して調べます。予測子が最大マージン(ハードマージンSVM)ソリューションの方向に収束することを示します。この結果は、無限大を無限大とする他の単調な減少損失関数、マルチクラス問題、および特定の制限された設定の深層ネットワークでの重み層の学習にも一般化されます。さらに、この収束は非常に遅く、損失自体の収束において対数的であることを示しています。これは、学習誤差がゼロで学習損失が非常に小さい後でも、また、私たちが示しているように、検証損失が増加しても、ロジスティック損失またはクロスエントロピー損失を最適化し続けることの利点を説明するのに役立ちます。私たちの方法論は、より複雑なモデルや他の最適化手法での陰的正則化を理解するのにも役立ちます。

Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal Modeling
ノンパラメトリック時空間サブゴールモデリングによる逆強化学習

Advances in the field of inverse reinforcement learning (IRL) have led to sophisticated inference frameworks that relax the original modeling assumption of observing an agent behavior that reflects only a single intention. Instead of learning a global behavioral model, recent IRL methods divide the demonstration data into parts, to account for the fact that different trajectories may correspond to different intentions, e.g., because they were generated by different domain experts. In this work, we go one step further: using the intuitive concept of subgoals, we build upon the premise that even a single trajectory can be explained more efficiently locally within a certain context than globally, enabling a more compact representation of the observed behavior. Based on this assumption, we build an implicit intentional model of the agent’s goals to forecast its behavior in unobserved situations. The result is an integrated Bayesian prediction framework that significantly outperforms existing IRL solutions and provides smooth policy estimates consistent with the expert’s plan. Most notably, our framework naturally handles situations where the intentions of the agent change over time and classical IRL algorithms fail. In addition, due to its probabilistic nature, the model can be straightforwardly applied in active learning scenarios to guide the demonstration process of the expert.

逆強化学習(IRL)の分野における進歩により、単一の意図のみを反映するエージェントの行動を観察するという元のモデリングの仮定を緩和する洗練された推論フレームワークが生まれました。グローバルな行動モデルを学習する代わりに、最近のIRL手法では、異なる軌跡が、たとえば異なるドメインの専門家によって生成されたために、異なる意図に対応する可能性があるという事実を考慮するために、デモンストレーションデータを複数の部分に分割します。この研究では、さらに一歩進んで、サブゴールの直感的な概念を使用して、単一の軌跡であっても、特定のコンテキスト内でグローバルよりもローカルに説明できるという前提を構築し、観察された行動をよりコンパクトに表現できるようにします。この仮定に基づいて、エージェントの目標の暗黙的な意図モデルを構築し、観察されていない状況でのエージェントの行動を予測します。その結果、既存のIRLソリューションを大幅に上回り、専門家の計画と一致するスムーズなポリシー推定を提供する統合ベイズ予測フレームワークが生まれました。最も注目すべきは、私たちのフレームワークは、エージェントの意図が時間の経過とともに変化し、従来のIRLアルゴリズムが失敗する状況を自然に処理することです。さらに、その確率的な性質により、このモデルはアクティブラーニングのシナリオに直接適用でき、専門家のデモンストレーションプロセスをガイドすることができます。

Multivariate Bayesian Structural Time Series Model
多変量ベイズ構造時系列モデル

This paper deals with inference and prediction for multiple correlated time series, where one also has the choice of using a candidate pool of contemporaneous predictors for each target series. Starting with a structural model for time series, we use Bayesian tools for model fitting, prediction and feature selection, thus extending some recent works along these lines for the univariate case. The Bayesian paradigm in this multivariate setting helps the model avoid overfitting, as well as captures correlations among multiple target time series with various state components. The model provides needed flexibility in selecting a different set of components and available predictors for each target series. The cyclical component in the model can handle large variations in the short term, which may be caused by external shocks. Extensive simulations were run to investigate properties such as estimation accuracy and performance in forecasting. This was followed by an empirical study with one-step-ahead prediction on the max log return of a portfolio of stocks that involve four leading financial institutions. Both the simulation studies and the extensive empirical study confirm that this multivariate model outperforms three other benchmark models, viz. a model that treats each target series as independent, the autoregressive integrated moving average model with regression (ARIMAX), and the multivariate ARIMAX (MARIMAX) model.

この論文では、相関のある複数の時系列の推論と予測について取り上げます。各ターゲット系列に対して、同時予測子の候補プールを使用することもできます。時系列の構造モデルから始めて、モデルのフィッティング、予測、および特徴選択にベイジアンツールを使用します。これにより、単変量の場合にこの方向の最近の研究を拡張します。この多変量設定でのベイジアンパラダイムは、モデルが過剰適合するのを防ぎ、さまざまな状態コンポーネントを持つ複数のターゲット時系列間の相関関係を捉えるのに役立ちます。このモデルは、各ターゲット系列に対して異なるコンポーネントセットと利用可能な予測子を選択する際に必要な柔軟性を提供します。モデルの周期的コンポーネントは、外部ショックによって引き起こされる可能性のある短期的な大きな変動を処理できます。推定精度や予測のパフォーマンスなどの特性を調査するために、広範なシミュレーションが実行されました。その後、4つの大手金融機関を含む株式ポートフォリオの最大対数リターンを1ステップ先予測する実証研究が行われました。シミュレーション研究と広範な実証研究の両方により、この多変量モデルが、各ターゲットシリーズを独立したものとして扱うモデル、回帰付き自己回帰積分移動平均モデル(ARIMAX)、および多変量ARIMAX (MARIMAX)モデルという他の3つのベンチマークモデルよりも優れていることが確認されました。

Efficient Bayesian Inference of Sigmoidal Gaussian Cox Processes
シグモイド・ガウス・コックス過程の効率的なベイズ推定

We present an approximate Bayesian inference approach for estimating the intensity of a inhomogeneous Poisson process, where the intensity function is modelled using a Gaussian process (GP) prior via a sigmoid link function. Augmenting the model using a latent marked Poisson process and Polya–Gamma random variables we obtain a representation of the likelihood which is conjugate to the GP prior. We estimate the posterior using a variational free–form mean field optimisation together with the framework of sparse GPs. Furthermore, as alternative approximation we suggest a sparse Laplace’s method for the posterior, for which an efficient expectation–maximisation algorithm is derived to find the posterior’s mode. Both algorithms compare well against exact inference obtained by a Markov Chain Monte Carlo sampler and standard variational Gauss approach solving the same model, while being one order of magnitude faster. Furthermore, the performance and speed of our method is competitive with that of another recently proposed Poisson process model based on a quadratic link function, while not being limited to GPs with squared exponential kernels and rectangular domains.

私たちは、非同次ポアソン過程の強度を推定するための近似ベイズ推論アプローチを提示します。このアプローチでは、強度関数は、シグモイドリンク関数を介してガウス過程(GP)事前分布を使用してモデル化されます。潜在マーク付きポアソン過程とポリアガンマランダム変数を使用してモデルを拡張すると、GP事前分布と共役な尤度の表現が得られます。私たちは、スパースGPのフレームワークとともに変分自由形式平均場最適化を使用して事後分布を推定します。さらに、代替近似として、事後分布のスパースラプラス法を提案します。これにより、事後分布のモードを見つけるための効率的な期待値最大化アルゴリズムが導出されます。両方のアルゴリズムは、同じモデルを解くマルコフ連鎖モンテカルロサンプラーと標準変分ガウスアプローチによって得られる正確な推論に匹敵し、1桁高速です。さらに、私たちの方法のパフォーマンスと速度は、2乗指数カーネルと長方形ドメインを持つGPに限定されず、最近提案された二次リンク関数に基づく別のポアソン過程モデルと競合します。

Inference via Low-Dimensional Couplings
低次元結合による推論

We investigate the low-dimensional structure of deterministic transformations between random variables, i.e., transport maps between probability measures. In the context of statistics and machine learning, these transformations can be used to couple a tractable “reference” measure (e.g., a standard Gaussian) with a target measure of interest. Direct simulation from the desired measure can then be achieved by pushing forward reference samples through the map. Yet characterizing such a map—e.g., representing and evaluating it—grows challenging in high dimensions. The central contribution of this paper is to establish a link between the Markov properties of the target measure and the existence of low-dimensional couplings, induced by transport maps that are sparse and/or decomposable. Our analysis not only facilitates the construction of transformations in high-dimensional settings, but also suggests new inference methodologies for continuous non-Gaussian graphical models. For instance, in the context of nonlinear state-space models, we describe new variational algorithms for filtering, smoothing, and sequential parameter inference. These algorithms can be understood as the natural generalization—to the non-Gaussian case—of the square-root Rauch–Tung–Striebel Gaussian smoother.

私たちは、ランダム変数間の決定論的変換、すなわち確率測度間の輸送マップの低次元構造を調査します。統計学と機械学習の文脈では、これらの変換は、扱いやすい「参照」測度（例えば、標準ガウス）を関心のあるターゲット測度と結合するために使用できます。次に、マップを介して参照サンプルをプッシュすることで、目的の測度からの直接シミュレーションを実現できます。しかし、そのようなマップの特性評価（例えば、表現および評価）は、高次元では困難になります。この論文の中心的な貢献は、ターゲット測度のマルコフ特性と、スパースおよび/または分解可能な輸送マップによって誘発される低次元結合の存在との間のリンクを確立することです。私たちの分析は、高次元設定での変換の構築を容易にするだけでなく、連続非ガウスグラフィカルモデルのための新しい推論方法論も示唆します。たとえば、非線形状態空間モデルの文脈では、フィルタリング、スムージング、およびシーケンシャルパラメータ推論のための新しい変分アルゴリズムについて説明します。これらのアルゴリズムは、平方根Rauch-Tung-Striebelガウス平滑化器の非ガウスケースへの自然な一般化として理解できます。

Extrapolating Expected Accuracies for Large Multi-Class Problems
大規模なマルチクラス問題に対する期待精度の推定

The difficulty of multi-class classification generally increases with the number of classes. Using data for a small set of the classes, can we predict how well the classifier scales as the number of classes increases? We propose a framework for studying this question, assuming that classes in both sets are sampled from the same population and that the classifier is based on independently learned scoring functions. Under this framework, we can express the classification accuracy on a set of $k$ classes as the $(k – 1)$st moment of a discriminability function; the discriminability function itself does not depend on $k$. We leverage this result to develop a non-parametric regression estimator for the discriminability function, which can extrapolate accuracy results to larger unobserved sets. We also formalize an alternative approach that extrapolates accuracy separately for each class, and identify tradeoffs between the two methods. We show that both methods can accurately predict classifier performance on label sets up to ten times the size of the original set, both in simulations as well as in realistic face recognition or character recognition tasks.

マルチクラス分類の難しさは、一般にクラスの数が増えるにつれて増大します。クラスの小さなセットのデータを使用して、クラス数の増加に応じて分類器がどの程度拡張されるかを予測できますか?両方のセットのクラスが同じ母集団からサンプリングされ、分類器が独立して学習されたスコアリング関数に基づいていると仮定して、この問題を研究するためのフレームワークを提案します。このフレームワークでは、k個のクラスのセットの分類精度を、識別可能性関数の(k – 1)番目のモーメントとして表すことができます。識別可能性関数自体はkに依存しません。この結果を利用して、識別可能性関数のノンパラメトリック回帰推定量を開発します。これにより、精度の結果をより大きな未観測セットに外挿できます。また、クラスごとに精度を個別に外挿する代替アプローチを形式化し、2つの方法間のトレードオフを特定します。シミュレーションと実際の顔認識または文字認識タスクの両方で、両方の方法が元のセットの最大10倍のサイズのラベルセットで分類器のパフォーマンスを正確に予測できることを示します。

Scaling up Data Augmentation MCMC via Calibration
キャリブレーションによるデータ拡張MCMCのスケールアップ

There has been considerable interest in making Bayesian inference more scalable. In big data settings, most of the focus has been on reducing the computing time per iteration rather than reducing the number of iterations needed in Markov chain Monte Carlo (MCMC). This article considers data augmentation MCMC (DA-MCMC), a widely used technique. DA-MCMC samples tend to become highly autocorrelated in large samples, due to a mis-calibration problem in which conditional posterior distributions given augmented data are too concentrated. This makes it necessary to collect very long MCMC paths to obtain acceptably low MC error. To combat this inefficiency, we propose a family of calibrated data augmentation algorithms, which appropriately adjust the variance of conditional posterior distributions. A Metropolis-Hastings step is used to eliminate bias in the stationary distribution of the resulting sampler. Compared to existing alternatives, this approach can dramatically reduce MC error by reducing autocorrelation and increasing the effective number of DA-MCMC samples per unit of computing time. The approach is simple and applicable to a broad variety of existing data augmentation algorithms. We focus on three popular generalized linear models: probit, logistic and Poisson log-linear. Dramatic gains in computational efficiency are shown in applications.

ベイズ推論をよりスケーラブルにすることに大きな関心が寄せられています。ビッグデータの設定では、マルコフ連鎖モンテカルロ(MCMC)で必要な反復回数を減らすことよりも、反復あたりの計算時間を減らすことに重点が置かれてきました。この記事では、広く使用されている手法であるデータ拡張MCMC (DA-MCMC)について検討します。DA-MCMCサンプルは、拡張データが与えられた条件付き事後分布が集中しすぎるという誤った較正問題により、大規模なサンプルでは自己相関が高くなる傾向があります。このため、許容できるほど低いMCエラーを得るには、非常に長いMCMCパスを収集する必要があります。この非効率性に対処するために、条件付き事後分布の分散を適切に調整する較正済みデータ拡張アルゴリズムのファミリを提案します。メトロポリス-ヘイスティングスステップを使用して、結果として得られるサンプラーの定常分布の偏りを排除します。既存の代替方法と比較すると、このアプローチは自己相関を減らし、計算時間単位あたりのDA-MCMCサンプルの有効数を増やすことで、MCエラーを大幅に削減できます。このアプローチはシンプルで、既存のさまざまなデータ拡張アルゴリズムに適用できます。ここでは、プロビット、ロジスティック、ポアソン対数線形という3つの一般的な一般化線形モデルに焦点を当てます。アプリケーションでは、計算効率の大幅な向上が示されています。

Short-term Sparse Portfolio Optimization Based on Alternating Direction Method of Multipliers
乗算器の交互方向法に基づく短期スパースポートフォリオ最適化

We propose a short-term sparse portfolio optimization (SSPO) system based on alternating direction method of multipliers (ADMM). Although some existing strategies have also exploited sparsity, they either constrain the quantity of the portfolio change or aim at the long-term portfolio optimization. Very few of them are dedicated to constructing sparse portfolios for the short-term portfolio optimization, which will be complemented by the proposed SSPO. SSPO concentrates wealth on a small proportion of assets that have good increasing potential according to some empirical financial principles, so as to maximize the cumulative wealth for the whole investment. We also propose a solving algorithm based on ADMM to handle the $\ell^1$-regularization term and the self-financing constraint simultaneously. As a significant improvement in the proposed ADMM, we have proven that its augmented Lagrangian has a saddle point, which is the foundation of the iterative formulae of ADMM but is seldom addressed by other sparsity strategies. Extensive experiments on $5$ benchmark data sets from real-world stock markets show that SSPO outperforms other state-of-the-art systems in thorough evaluations, withstands reasonable transaction costs and runs fast. Thus it is suitable for real-world financial environments.

私たちは、交互方向乗数法(ADMM)に基づく短期スパースポートフォリオ最適化(SSPO)システムを提案します。既存の戦略でもスパース性を活用していますが、それらはポートフォリオ変更の量を制限するか、長期ポートフォリオ最適化を目指しています。短期ポートフォリオ最適化のためのスパースポートフォリオの構築に特化したものはほとんどありませんが、提案するSSPOによって補完されます。SSPOは、いくつかの経験的金融原則に従って、増加の可能性が高い資産の小さな部分に富を集中させ、投資全体の累積富を最大化します。また、$\ell^1$正則化項と自己資金調達制約を同時に処理するためのADMMに基づく解決アルゴリズムも提案します。提案されたADMMの重要な改善点として、その拡張ラグランジアンには鞍点があることを証明しました。これはADMMの反復式の基礎ですが、他のスパース戦略ではほとんど扱われていません。現実世界の株式市場からの5つのベンチマークデータセットに対する広範な実験により、SSPOは徹底的な評価において他の最先端システムよりも優れており、妥当な取引コストに耐え、高速に実行されることが示されています。したがって、SSPOは現実世界の金融環境に適しています。

Hinge-Minimax Learner for the Ensemble of Hyperplanes
超平面の集合に対するヒンジ・ミニマックス学習器

In this work we consider non-linear classifiers that comprise intersections of hyperplanes. We learn these classifiers by minimizing the “minimax” bound over the negative training examples and the hinge type loss of the positive training examples. These classifiers fit typical real-life datasets that consist of a small number of positive data points and a large number of negative data points. Such an approach is computationally appealing since the majority of training examples (belonging to the negative class) are represented by the statistics of their distribution, which is used in a single constraint on the empirical risk, as opposed to SVM, in which the number of variables is equal to the size of the training set. We first focus on intersection of $K$ hyperplanes, for which we provide empirical risk bounds. We show that these bounds are dimensionally independent and decay as $K/\sqrt{m}$ for $m$ samples. We then extend the K-hyperplane mixed risk to the latent mixed risk for training a union of $C$ $K$-hyperplane models, which can form an arbitrary complex, piecewise linear boundaries. We propose efficient algorithms for training the proposed models. Finally, we show how to combine hinge-minimax training with deep architectures and extend it to multi-class settings using transfer learning. The empirical evaluation of the proposed models shows their advantage over the existing methods in a small training labeled data regime.

この研究では、、超平面の交差で構成される非線形分類器を検討します。これらの分類器は、負のトレーニング例の「ミニマックス」境界と正のトレーニング例のヒンジ型損失を最小化することで学習します。これらの分類器は、少数の正のデータポイントと多数の負のデータポイントで構成される典型的な現実のデータセットに適合します。このようなアプローチは、トレーニング例の大部分(負のクラスに属する)が分布の統計によって表されるため、計算上魅力的です。これは、変数の数がトレーニングセットのサイズに等しいSVMとは対照的に、経験的リスクに対する単一の制約で使用されます。まず、経験的リスクの境界を提供する$K$超平面の交差に焦点を当てます。これらの境界は次元的に独立しており、$m$個のサンプルに対して$K/\sqrt{m}$として減衰することを示します。次に、K超平面混合リスクを、任意の複雑な区分線形境界を形成できる$C$ $K$超平面モデルの結合をトレーニングするための潜在的混合リスクに拡張します。提案モデルをトレーニングするための効率的なアルゴリズムを提案します。最後に、ヒンジミニマックストレーニングをディープアーキテクチャと組み合わせ、転移学習を使用してマルチクラス設定に拡張する方法を示します。提案モデルの実証的評価により、小規模なトレーニングラベル付きデータレジームにおいて、既存の方法よりも優れていることが示されました。

Simple Classification Using Binary Data
バイナリデータを使用した単純な分類

Binary, or one-bit, representations of data arise naturally in many applications, and are appealing in both hardware implementations and algorithm design. In this work, we study the problem of data classification from binary data obtained from the sign pattern of low-dimensional projections and propose a framework with low computation and resource costs. We illustrate the utility of the proposed approach through stylized and realistic numerical experiments, and provide a theoretical analysis for a simple case. We hope that our framework and analysis will serve as a foundation for studying similar types of approaches.

データのバイナリ表現(1ビット)表現は、多くのアプリケーションで自然に発生し、ハードウェア実装とアルゴリズム設計の両方で魅力的です。この研究では、低次元投影の符号パターンから得られるバイナリデータからのデータ分類の問題を検討し、計算コストとリソースコストが低いフレームワークを提案します。提案されたアプローチの有用性を、定型化された現実的な数値実験を通じて説明し、単純なケースの理論的分析を提供します。私たちのフレームワークと分析が、同様のタイプのアプローチを研究するための基盤となることを願っています。

A New and Flexible Approach to the Analysis of Paired Comparison Data
対応比較データの分析に対する新しい柔軟なアプローチ

We consider the situation where $I$ items are ranked by paired comparisons. It is usually assumed that the probability that item $i$ is preferred over item $j$ is $p_{ij}=F(\mu_i-\mu_j)$ where $F$ is a symmetric distribution function, which we refer to as the comparison function, and $\mu_i$ and $\mu_j$ are the merits or scores of the compared items. This modelling framework, which is ubiquitous in the paired comparison literature, strongly depends on the assumption that the comparison function $F$ is known. In practice, however, this assumption is often unrealistic and may result in poor fit and erroneous inferences. This limitation has motivated us to relax the assumption that $F$ is fully known and simultaneously estimate the merits of the objects and the underlying comparison function. Our formulation yields a flexible semi-definite programming problem that we use as a refinement step for estimating the paired comparison probability matrix. We provide a detailed sensitivity analysis and, as a result, we establish the consistency of the resulting estimators and provide bounds on the estimation and approximation errors. Some statistical properties of the resulting estimators as well as model selection criteria are investigated. Finally, using a large data-set of computer chess matches, we estimate the comparison function and find that the model used by the International Chess Federation does not seem to apply to computer chess.

私たちは、一対比較によって$I$個のアイテムがランク付けされる状況について考えます。通常、アイテム$i$がアイテム$j$よりも好まれる確率は$p_{ij}=F(\mu_i-\mu_j)$であると想定されます。ここで、$F$は対称分布関数(比較関数と呼びます)、$\mu_i$と$\mu_j$は比較されるアイテムのメリットまたはスコアです。一対比較の文献で広く使用されているこのモデリングフレームワークは、比較関数$F$が既知であるという仮定に大きく依存しています。ただし、実際には、この仮定は非現実的であることが多く、適合性が悪く、推論が誤っている可能性があります。この制限により、$F$が完全に既知であるという仮定を緩和し、オブジェクトのメリットと基礎となる比較関数を同時に推定するようになりました。この定式化により、柔軟な半正定値計画問題が生成され、これを一対比較確率行列の推定の改良ステップとして使用します。詳細な感度分析を行い、その結果、得られた推定値の一貫性を確立し、推定および近似誤差の境界を示します。得られた推定値のいくつかの統計的特性とモデル選択基準を調査します。最後に、コンピューターチェスの試合の大規模なデータセットを使用して比較関数を推定し、国際チェス連盟が使用するモデルはコンピューターチェスには適用できないことがわかりました。

Maximum Selection and Sorting with Adversarial Comparators
敵対的コンパレータによる最大選択とソート

We study maximum selection and sorting of $n$ numbers using imperfect pairwise comparators. The imperfect comparator returns the larger of the two inputs if the inputs are more than a given threshold apart and an adversarially-chosen input otherwise. We consider two adversarial models: a non-adaptive adversary that decides on the outcomes in advance and an adaptive adversary that decides on the outcome of each comparison depending on the previous comparisons and outcomes. Against the non-adaptive adversary, we derive a maximum-selection algorithm that uses at most $2n$ comparisons in expectation and a sorting algorithm that uses at most $2n\ln n$ comparisons in expectation. In the presence of the adaptive adversary, the proposed maximum-selection algorithm uses $\Theta(n\log (1/{\epsilon}))$ comparisons to output a correct answer with probability at least $1-\epsilon$, resolving an open problem in Ajtai et al. (2015). Our study is motivated by a density-estimation problem. Given samples from an unknown distribution, we would like to find a distribution among a known class of $n$ candidate distributions that is close to the underlying distribution in $\ell_1$ distance. Scheffe’s algorithm, for example, in Devroye and Lugosi (2001) outputs a distribution at an $\ell_1$ distance at most 9 times the minimum and runs in time $\Theta(n^2\log n)$. Using our algorithm, the runtime reduces to $\Theta(n\log n)$.

私たちは、不完全なペアワイズ比較器を使用して、$n$個の数値の最大選択とソートを研究します。不完全な比較器は、入力が所定のしきい値以上離れている場合は2つの入力のうち大きい方を返し、そうでない場合は敵対的に選択された入力を返す。我々は2つの敵対モデルを検討します。1つは事前に結果を決定する非適応型敵対者、もう1つは以前の比較と結果に応じて各比較の結果を決定する適応型敵対者です。非適応型敵対者に対して、最大で$2n$回の比較を期待して使用する最大選択アルゴリズム、もう1つは最大で$2n\ln n$回の比較を期待して使用するソートアルゴリズムを導出します。適応型敵対者が存在する場合、提案された最大選択アルゴリズムは$\Theta(n\log (1/{\epsilon}))$回の比較を使用して、少なくとも$1-\epsilon$の確率で正しい答えを出力し、Ajtaiらの未解決の問題を解決します。(2015)。私たちの研究は密度推定問題に端を発しています。未知の分布からのサンプルが与えられた場合、既知のクラスの$n$個の候補分布の中から、$\ell_1$距離で基礎分布に近い分布を見つけたいと考えます。たとえば、DevroyeとLugosi (2001)のScheffeのアルゴリズムは、最大で最小値の9倍の$\ell_1$距離の分布を出力し、実行時間は$\Theta(n^2\log n)$です。私たちのアルゴリズムを使用すると、実行時間は$\Theta(n\log n)$に短縮されます。

Theoretical Analysis of Cross-Validation for Estimating the Risk of the K-Nearest Neighbor Classifier
K-最近傍分類器のリスク推定のための交差検証の理論的解析

The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the $k$-nearest neighbors ($k$NN) rule in the context of binary classification. Here we focus on the leave-$p$-out cross-validation (L$p$O) used to assess the performance of the $k$NN classifier. Remarkably this L$p$O estimator can be efficiently computed in this context using closed-form formulas derived by Celisse and Mary-Huard (2011). We describe a general strategy to derive moment and exponential concentration inequalities for the L$p$O estimator applied to the $k$NN classifier. Such results are obtained first by exploiting the connection between the L$p$O estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L$1$O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the L$p$O estimator and the classification error/risk of the $k$NN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments.

本研究の目的は、バイナリ分類のコンテキストで$k$-近傍法($k$NN)に適用されるいくつかのクロスバリデーション手順の動作に関する理論的保証を導出することです。ここでは、$k$NN分類器のパフォーマンスを評価するために使用される、leave-p$-outクロスバリデーション(L$p$O)に焦点を当てます。注目すべきことに、このL$p$O推定量は、CelisseとMary-Huard (2011)によって導出された閉形式の式を使用して、このコンテキストで効率的に計算できます。$k$NN分類器に適用されるL$p$O推定量のモーメントおよび指数濃度不等式を導出する一般的な戦略について説明します。このような結果は、まずL$p$O推定量とU統計量との関係を利用し、次にL$1$O推定量に適用される一般化Efron-Stein不等式を集中的に使用することで得られます。もう1つの重要な貢献は、L$p$O推定値と$k$NN分類器の分類エラー/リスクとの間の相違の新しい定量化を導出することです。これらの境界の最適性は、いくつかの下限値とシミュレーション実験によって議論されます。

On Semiparametric Exponential Family Graphical Models
セミパラメトリック指数族グラフィカルモデルについて

We propose a new class of semiparametric exponential family graphical models for the analysis of high dimensional mixed data. Different from the existing mixed graphical models, we allow the nodewise conditional distributions to be semiparametric generalized linear models with unspecified base measure functions. Thus, one advantage of our method is that it is unnecessary to specify the type of each node and the method is more convenient to apply in practice. Under the proposed model, we consider both problems of parameter estimation and hypothesis testing in high dimensions. In particular, we propose a symmetric pairwise score test for the presence of a single edge in the graph. Compared to the existing methods for hypothesis tests, our approach takes into account of the symmetry of the parameters, such that the inferential results are invariant with respect to the different parametrizations of the same edge. Thorough numerical simulations and a real data example are provided to back up our theoretical results.

私たちは、高次元混合データの分析用に、新しいクラスのセミパラメトリック指数族グラフィカルモデルを提案します。既存の混合グラフィカルモデルとは異なり、ノードごとの条件付き分布を、基本測定関数が指定されていないセミパラメトリック一般化線形モデルにすることができます。したがって、この方法の利点の1つは、各ノードのタイプを指定する必要がなく、実際に適用するのがより便利なことです。提案されたモデルでは、高次元でのパラメーター推定と仮説検定の両方の問題を考慮します。特に、グラフに単一のエッジが存在するかどうかの対称ペアワイズスコア検定を提案します。仮説検定の既存の方法と比較して、私たちのアプローチではパラメーターの対称性を考慮しているため、推論結果は同じエッジの異なるパラメーター化に関して不変です。理論的な結果を裏付けるために、徹底した数値シミュレーションと実際のデータ例が提供されています。

Modular Proximal Optimization for Multidimensional Total-Variation Regularization
多次元全変動正則化のためのモジュール式近位最適化

We study TV regularization, a widely used technique for eliciting structured sparsity. In particular, we propose efficient algorithms for computing prox-operators for $\ell_p$-norm TV. The most important among these is $\ell_1$-norm TV, for whose prox-operator we present a new geometric analysis which unveils a hitherto unknown connection to taut-string methods. This connection turns out to be remarkably useful as it shows how our geometry guided implementation results in efficient weighted and unweighted 1D-TV solvers, surpassing state-of-the-art methods. Our 1D-TV solvers provide the backbone for building more complex (two or higher-dimensional) TV solvers within a modular proximal optimization approach. We review the literature for an array of methods exploiting this strategy, and illustrate the benefits of our modular design through extensive suite of experiments on (i) image denoising, (ii) image deconvolution, (iii) four variants of fused-lasso, and (iv) video denoising. To underscore our claims and permit easy reproducibility, we provide all the reviewed and our new TV solvers in an easy to use multi-threaded C++, Matlab and Python library.

私たちは、構造化されたスパース性を引き出すために広く使用されている手法であるTV正則化を研究しています。特に、$\ell_p$ノルムTVの近接演算子を計算するための効率的なアルゴリズムを提案しています。これらの中で最も重要なのは$\ell_1$ノルムTVです。この近接演算子に対して、これまで知られていなかったタウトストリング法との関連を明らかにする新しい幾何学的分析を提示します。この関連は、幾何学に基づく実装が最先端の方法を上回る効率的な重み付きおよび重みなしの1D-TVソルバーをもたらす方法を示しているため、非常に有用であることがわかりました。我々の1D-TVソルバーは、モジュラー近接最適化アプローチ内でより複雑な(2次元以上の) TVソルバーを構築するためのバックボーンを提供します。私たちは、この戦略を活用したさまざまな方法についての文献をレビューし、(i)画像ノイズ除去、(ii)画像デコンボリューション、(iii)融合ラッソの4つのバリエーション、(iv)ビデオノイズ除去に関する広範な一連の実験を通じて、モジュール設計の利点を説明します。私たちの主張を強調し、簡単に再現できるようにするために、レビューしたすべてのTVソルバーと新しいTVソルバーを、使いやすいマルチスレッドのC++、Matlab、Pythonライブラリで提供します。

Fast MCMC Sampling Algorithms on Polytopes
ポリトープ上の高速MCMCサンプリングアルゴリズム

We propose and analyze two new MCMC sampling algorithms, the Vaidya walk and the John walk, for generating samples from the uniform distribution over a polytope. Both random walks are sampling algorithms derived from interior point methods. The former is based on volumetric-logarithmic barrier introduced by Vaidya whereas the latter uses John’s ellipsoids. We show that the Vaidya walk mixes in significantly fewer steps than the logarithmic-barrier based Dikin walk studied in past work. For a polytope in $\mathbb{R}^d$ defined by $n > d$ linear constraints, we show that the mixing time from a warm start is bounded as $\mathcal{O}(n^{0.5}d^{1.5})$, compared to the $\mathcal{O}(nd)$ mixing time bound for the Dikin walk. The cost of each step of the Vaidya walk is of the same order as the Dikin walk, and at most twice as large in terms of constant pre-factors. For the John walk, we prove an $\mathcal{O}(d^{2.5}\cdot\log^4(n/d))$ bound on its mixing time and conjecture that an improved variant of it could achieve a mixing time of $\mathcal{O}(d^{2}\cdot\text{poly-log}(n/d))$. Additionally, we propose variants of the Vaidya and John walks that mix in polynomial time from a deterministic starting point. The speed-up of the Vaidya walk over the Dikin walk are illustrated in numerical examples.

私たちは、多面体上の一様分布からサンプルを生成するための2つの新しいMCMCサンプリングアルゴリズム、VaidyaウォークとJohnウォークを提案し、分析します。両方のランダムウォークは、内点法から派生したサンプリングアルゴリズムです。前者はVaidyaによって導入された体積対数障壁に基づいており、後者はJohnの楕円体を使用しています。私たちは、Vaidyaウォークが、過去の研究で研究された対数障壁ベースのDikinウォークよりも大幅に少ないステップで混合することを示す。$n > d$線形制約によって定義された$\mathbb{R}^d$内の多面体の場合、ウォームスタートからの混合時間は、Dikinウォークの混合時間境界$\mathcal{O}(nd)$と比較して、$\mathcal{O}(n^{0.5}d^{1.5})$に制限されることを示す。Vaidyaウォークの各ステップのコストはDikinウォークと同じオーダーであり、定数前因子に関しては最大で2倍の大きさです。Johnウォークについては、混合時間に対する$\mathcal{O}(d^{2.5}\cdot\log^4(n/d))$の境界を証明し、その改良版では混合時間が$\mathcal{O}(d^{2}\cdot\text{poly-log}(n/d))$になる可能性があると推測します。さらに、決定論的な開始点から多項式時間で混合するVaidyaウォークとJohnウォークの変種を提案します。Dikinウォークに対するVaidyaウォークの高速化は、数値例で示されます。

How Deep Are Deep Gaussian Processes?
ディープガウス過程の深さはどれくらいですか?

Recent research has shown the potential utility of deep Gaussian processes. These deep structures are probability distributions, designed through hierarchical construction, which are conditionally Gaussian. In this paper, the current published body of work is placed in a common framework and, through recursion, several classes of deep Gaussian processes are defined. The resulting samples generated from a deep Gaussian process have a Markovian structure with respect to the depth parameter, and the effective depth of the resulting process is interpreted in terms of the ergodicity, or non-ergodicity, of the resulting Markov chain. For the classes of deep Gaussian processes introduced, we provide results concerning their ergodicity and hence their effective depth. We also demonstrate how these processes may be used for inference; in particular we show how a Metropolis-within-Gibbs construction across the levels of the hierarchy can be used to derive sampling tools which are robust to the level of resolution used to represent the functions on a computer. For illustration, we consider the effect of ergodicity in some simple numerical examples.

最近の研究で、深層ガウス過程の潜在的有用性が明らかになりました。これらの深層構造は、階層構造によって設計された条件付きガウス分布の確率分布です。この論文では、現在公開されている一連の研究を共通のフレームワークに置き、再帰によって深層ガウス過程のいくつかのクラスを定義します。深層ガウス過程から生成された結果のサンプルは、深度パラメータに関してマルコフ構造を持ち、結果として得られる過程の有効深度は、結果として得られるマルコフ連鎖のエルゴード性、または非エルゴード性の観点から解釈されます。紹介した深層ガウス過程のクラスについて、そのエルゴード性、したがって有効深度に関する結果を提供します。また、これらの過程を推論に使用する方法を示します。特に、階層レベル全体にわたるギブス内メトロポリス構成を使用して、コンピューター上で関数を表すために使用される解像度のレベルに対して堅牢なサンプリングツールを導出する方法を示します。説明のために、いくつかの簡単な数値例でエルゴード性の影響を検討します。

Profile-Based Bandit with Unknown Profiles
プロファイルベースのバンディットと不明なプロファイル

Stochastic bandits have been widely studied since decades. A very large panel of settings have been introduced, some of them for the inclusion of some structure between actions. If actions are associated with feature vectors that underlie their usefulness, the discovery of a mapping parameter between such profiles and rewards can help the exploration process of the bandit strategies. This is the setting studied in this paper, but in our case the action profiles (constant feature vectors) are unknown beforehand. Instead, the agent is only given sample vectors, with mean centered on the true profiles, for a subset of actions at each step of the process. In this new bandit instance, policies have thus to deal with a doubled uncertainty, both on the profile estimators and the reward mapping parameters learned so far. We propose a new algorithm, called \textit{SampLinUCB}, specifically designed for this case. Theoretical convergence guarantees are given for this strategy, according to various profile samples delivery scenarios. Finally, experiments are conducted on both artificial data and a task of focused data capture from online social networks. Obtained results demonstrate the relevance of the approach in various settings.

確率的バンディットは数十年にわたって広く研究されてきました。非常に大規模な設定パネルが導入されており、その一部はアクション間の構造を組み込むためのものです。アクションがその有用性の根底にある特徴ベクトルに関連付けられている場合、そのようなプロファイルと報酬の間のマッピングパラメータの発見は、バンディット戦略の探索プロセスに役立ちます。これがこの論文で研究されている設定ですが、私たちのケースでは、アクションプロファイル（定数特徴ベクトル）は事前に不明です。代わりに、エージェントには、プロセスの各ステップでのアクションのサブセットについて、平均が真のプロファイルを中心とするサンプルベクトルのみが与えられます。この新しいバンディットインスタンスでは、ポリシーは、プロファイル推定値とこれまでに学習された報酬マッピングパラメータの両方で、2倍の不確実性に対処する必要があります。私たちは、このケースのために特別に設計された\textit{SampLinUCB}と呼ばれる新しいアルゴリズムを提案します。さまざまなプロファイルサンプル配信シナリオに応じて、この戦略の理論的な収束保証が与えられます。最後に、人工データとオンラインソーシャルネットワークからの集中的なデータキャプチャタスクの両方で実験が行われます。得られた結果は、さまざまな設定におけるアプローチの関連性を実証しています。

Accelerating Cross-Validation in Multinomial Logistic Regression with l_1-Regularization
l_1-正則化による多項ロジスティック回帰のクロス検証の高速化

We develop an approximate formula for evaluating a cross-validation estimator of predictive likelihood for multinomial logistic regression regularized by an $\ell_1$-norm. This allows us to avoid repeated optimizations required for literally conducting cross-validation; hence, the computational time can be significantly reduced. The formula is derived through a perturbative approach employing the largeness of the data size and the model dimensionality. An extension to the elastic net regularization is also addressed. The usefulness of the approximate formula is demonstrated on simulated data and the ISOLET dataset from the UCI machine learning repository. MATLAB and python codes implementing the approximate formula are distributed in (Obuchi, 2017; Takahashi and Obuchi, 2017).

私たちは、$ell_1$-normによって正則化された多項ロジスティック回帰の予測尤度の交差検証推定量を評価するための近似式を開発します。これにより、文字通りクロスバリデーションを行うために必要な最適化の繰り返しを回避できます。したがって、計算時間を大幅に短縮できます。この式は、データサイズの大きさとモデルの次元性を利用した摂動論的アプローチによって導出されます。エラスティックネット正則化の拡張も取り上げられます。近似式の有用性は、シミュレートされたデータとUCI機械学習リポジトリのISOLETデータセットで実証されています。近似式を実装したMATLABコードとPythonコードは、(Obuchi, 2017;Takahashi and Obuchi, 2017)。

Covariances, Robustness, and Variational Bayes
共分散、ロバスト性、変分ベイズ

Mean-field Variational Bayes (MFVB) is an approximate Bayesian posterior inference technique that is increasingly popular due to its fast runtimes on large-scale data sets. However, even when MFVB provides accurate posterior means for certain parameters, it often mis-estimates variances and covariances. Furthermore, prior robustness measures have remained undeveloped for MFVB. By deriving a simple formula for the effect of infinitesimal model perturbations on MFVB posterior means, we provide both improved covariance estimates and local robustness measures for MFVB, thus greatly expanding the practical usefulness of MFVB posterior approximations. The estimates for MFVB posterior covariances rely on a result from the classical Bayesian robustness literature that relates derivatives of posterior expectations to posterior covariances and includes the Laplace approximation as a special case. Our key condition is that the MFVB approximation provides good estimates of a select subset of posterior means—an assumption that has been shown to hold in many practical settings. In our experiments, we demonstrate that our methods are simple, general, and fast, providing accurate posterior uncertainty estimates and robustness measures with runtimes that can be an order of magnitude faster than MCMC.

平均場変分ベイズ(MFVB)は、大規模なデータセットでの実行時間が短いため、ますます人気が高まっている近似ベイズ事後推論手法です。ただし、MFVBが特定のパラメーターの事後平均を正確に提供しても、分散と共分散の推定が誤っていることがよくあります。さらに、MFVBの事前堅牢性尺度は未開発のままです。MFVB事後平均に対する微小モデル変動の影響に関する簡単な式を導出することで、MFVBの共分散推定値とローカル堅牢性尺度の両方が改善され、MFVB事後近似の実用性が大幅に向上します。MFVB事後共分散の推定値は、事後期待値の導関数を事後共分散に関連付け、ラプラス近似を特別なケースとして含める、古典的なベイズ堅牢性文献の結果に依存しています。重要な条件は、MFVB近似が事後平均の選択されたサブセットの良好な推定値を提供することです。これは、多くの実際の設定で当てはまることが示されている仮定です。実験では、私たちの方法がシンプルで汎用的かつ高速であり、MCMCよりも桁違いに高速な実行時間で正確な事後不確実性推定値と堅牢性測定を提供できることを実証しています。

Emergence of Invariance and Disentanglement in Deep Representations
深層表現における不変性と解絡みの顕在化

Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights. Finally, we show that invariance and independence of the components of the representation learned by the network are bounded above and below by the information in the weights, and therefore are implicitly optimized during training. The theory enables us to quantify and predict sharp phase transitions between underfitting and overfitting of random labels when using our regularized loss, which we verify in experiments, and sheds light on the relation between the geometry of the loss function, invariance properties of the learned representation, and generalization error.

統計と情報理論の確立された原理を使用して、ディープニューラルネットワークの妨害要因に対する不変性は、学習した表現の情報最小性と同等であり、トレーニング中にレイヤーを積み重ねてノイズを注入すると、ネットワークが不変表現を学習する方向に自然に偏向することを示します。次に、トレーニング中に使用されるクロスエントロピー損失を分解し、固有のオーバーフィッティング項の存在を強調します。そのような項を2つの同等の方法で制限することにより、損失を正規化することを提案します。1つは、PACベイズの観点に関連するKullbach-Leibler項を使用する方法です。もう1つは、重みの情報を学習したモデルの複雑さの尺度として使用し、重みの新しい情報ボトルネックを生成する方法です。最後に、ネットワークによって学習された表現のコンポーネントの不変性と独立性は、重みの情報によって上下に制限されるため、トレーニング中に暗黙的に最適化されることを示します。この理論により、正規化された損失を使用した場合のランダムラベルのアンダーフィッティングとオーバーフィッティング間の急激な位相遷移を定量化して予測することができ、実験で検証されています。また、損失関数の幾何学、学習された表現の不変性、および一般化エラーの関係が明らかになります。

Design and Analysis of the NIPS 2016 Review Process
NIPS 2016レビュープロセスの設計と分析

Neural Information Processing Systems (NIPS) is a top-tier annual conference in machine learning. The 2016 edition of the conference comprised more than 2,400 paper submissions, 3,000 reviewers, and 8,000 attendees. This represents a growth of nearly 40% in terms of submissions, 96% in terms of reviewers, and over 100% in terms of attendees as compared to the previous year. The massive scale as well as rapid growth of the conference calls for a thorough quality assessment of the peer-review process and novel means of improvement. In this paper, we analyze several aspects of the data collected during the review process, including an experiment investigating the efficacy of collecting ordinal rankings from reviewers. We make a number of key observations, provide suggestions that may be useful for subsequent conferences, and discuss open problems towards the goal of improving peer review.

Neural Information Processing Systems (NIPS)は、機械学習におけるトップクラスの年次カンファレンスです。2016年版の会議は、2,400を超える論文の提出、3,000人の査読者、8,000人の参加者で構成されていました。これは、前年と比較して、投稿数で約40%、レビュアー数で96%、出席者数で100%以上の増加に相当します。この会議の規模と急速な成長は、ピアレビュープロセスの徹底的な品質評価と新たな改善手段を求めています。この論文では、レビュープロセス中に収集されたデータのいくつかの側面を分析します。これには、レビュアーから序数ランキングを収集することの有効性を調査する実験が含まれます。私たちは、いくつかの重要な観察を行い、その後の会議に役立つ可能性のある提案を提供し、査読を改善するという目標に向けて未解決の問題を議論します。

On Generalized Bellman Equations and Temporal-Difference Learning
一般化ベルマン方程式と時間差分学習について

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the $\lambda$-parameters of TD, based on generalized Bellman equations. Our scheme is to set $\lambda$ according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared with prior work, this scheme is more direct and flexible, and allows much larger $\lambda$ values for off-policy TD learning with bounded traces. As to its soundness, using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both the evolution of $\lambda$ and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.

私たちは、割引マルコフ決定プロセスにおけるオフポリシー時間差分(TD)学習を検討します。ここでの目標は、ポリシーを実行せずに生成された状態プロセスの観測値を使用して、モデルフリーの方法でポリシーを評価することです。オフポリシーTD学習における高分散の問題を抑制するために、一般化ベルマン方程式に基づいてTDの$\lambda$パラメータを設定する新しいスキームを提案します。我々のスキームでは、TDで計算された適格性トレース反復に従って$\lambda$を設定し、それによってこれらのトレースを目的の制限範囲内に簡単に維持します。以前の研究と比較して、このスキームはより直接的で柔軟性があり、制限されたトレースを持つオフポリシーTD学習に対してはるかに大きな$\lambda$値を許可します。その健全性については、マルコフ連鎖理論を使用して、非制限条件下での結合状態トレースプロセスのエルゴード性を証明し、この方式に関連付けられているのは、$\lambda$の進化と状態トレースプロセスの一意の不変確率測度の両方に依存する一般化ベルマン方程式(評価対象のポリシー用)であることを示します。これらの結果は、最小二乗法に基づく方式の実装の収束動作の特性評価に直接つながるだけでなく、勾配ベースの実装のさらなる分析の基礎も整えます。

Harmonic Mean Iteratively Reweighted Least Squares for Low-Rank Matrix Recovery
調和平均による低ランク行列回復のための反復再重み付け最小二乗法

We propose a new iteratively reweighted least squares (IRLS) algorithm for the recovery of a matrix $X \in \mathbb{C}^{d_1\times d_2}$ of rank $r \ll\min(d_1,d_2)$ from incomplete linear observations, solving a sequence of low complexity linear problems. The easily implementable algorithm, which we call harmonic mean iteratively reweighted least squares (HM-IRLS), optimizes a non-convex Schatten-$p$ quasi-norm penalization to promote low-rankness and carries three major strengths, in particular for the matrix completion setting. First, we observe a remarkable {global convergence behavior} of the algorithm’s iterates to the low-rank matrix for relevant, interesting cases, for which any other state-of-the-art optimization approach fails the recovery. Secondly, HM-IRLS exhibits an empirical recovery probability close to $1$ even for a number of measurements very close to the theoretical lower bound $r (d_1 +d_2 -r)$, i.e., already for significantly fewer linear observations than any other tractable approach in the literature. Thirdly, HM-IRLS exhibits a locally superlinear rate of convergence (of order $2-p$) if the linear observations fulfill a suitable null space property. While for the first two properties we have so far only strong empirical evidence, we prove the third property as our main theoretical result.

私たちは、不完全な線形観測からランク$r \ll\min(d_1,d_2)$の行列$X \in \mathbb{C}^{d_1\times d_2}$を復元し、一連の低複雑度の線形問題を解くための新しい反復再重み付け最小二乗法(IRLS)アルゴリズムを提案します。調和平均反復再重み付け最小二乗法(HM-IRLS)と呼ぶ、簡単に実装できるアルゴリズムは、非凸Schatten-$p$準ノルムペナルティを最適化して低ランク性を促進し、特に行列補完設定に関して3つの大きな強みを持っています。まず、関連性のある興味深いケースで、他の最先端の最適化アプローチでは復元に失敗する低ランク行列へのアルゴリズムの反復の顕著な{グローバル収束動作}が観察されます。第二に、HM-IRLSは、理論上の下限$r (d_1 +d_2 -r)$に非常に近い測定回数に対しても、経験的回復確率が$1$に近いことを示します。つまり、文献に記載されている他の扱いやすいアプローチよりも大幅に少ない線形観測に対しても、その確率は$1$に近いことを示します。第三に、線形観測が適切なヌル空間特性を満たす場合、HM-IRLSは局所的に超線形な収束率(オーダー$2-p$)を示します。最初の2つの特性については、これまでのところ強力な経験的証拠しかありませんが、3番目の特性を主な理論的結果として証明します。

On Tight Bounds for the Lasso
投げ縄のタイトな境界で

We present upper and lower bounds for the prediction error of the Lasso. For the case of random Gaussian design, we show that under mild conditions the prediction error of the Lasso is up to smaller order terms dominated by the prediction error of its noiseless counterpart. We then provide exact expressions for the prediction error of the latter, in terms of compatibility constants. Here, we assume the active components of the underlying regression function satisfy some “betamin” condition. For the case of fixed design, we provide upper and lower bounds, again in terms of compatibility constants. As an example, we give an up to a logarithmic term tight bound for the least squares estimator with total variation penalty.

私たちは、Lassoの予測誤差の上限と下限を示します。ランダムガウス設計の場合、穏やかな条件下では、Lassoの予測誤差は、ノイズのない対応物の予測誤差によって支配される小さな次数項まで上昇することを示します。次に、後者の予測誤差の正確な式を、互換性定数の観点から提供します。ここでは、基礎となる回帰関数のアクティブコンポーネントが何らかの「ベータミン」条件を満たすと仮定します。固定設計の場合、互換性定数の観点から、上限と下限を提供します。例として、最小二乗推定量に対して対数項のタイトバウンドを与え、全変動ペナルティを与えます。

Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem
ランダムフォレスト、決定木、およびカテゴリ予測子: “不在レベル” 問題

One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques). However, in this paper, we show how this capability can lead to an inherent “absent levels” problem for decision tree based methods that has never been thoroughly discussed, and whose consequences have never been carefully explored. This problem occurs whenever there is an indeterminacy over how to handle an observation that has reached a categorical split which was determined when the observation in question’s level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler’s random forests FORTRAN code and the randomForest R package (Liaw and Wiener, 2002) as motivating case studies, we examine how overlooking the absent levels problem can systematically bias a model. Furthermore, by using three real data examples, we illustrate how absent levels can dramatically alter a model’s performance in practice, and we empirically demonstrate how some simple heuristics can be used to help mitigate the effects of the absent levels problem until a more robust theoretical solution is found.

ランダムフォレストのような決定木ベースの方法の利点の1つは、カテゴリ予測子を最初に変換する(たとえば、特徴エンジニアリング手法を使用する)ことなくネイティブに処理できることです。ただし、この論文では、この機能が、これまで十分に議論されたことがなく、その結果が慎重に調査されたこともない、決定木ベースの方法に固有の「不在レベル」問題につながる可能性があることを示します。この問題は、トレーニング中に問題の観測レベルが不在であったために決定されたカテゴリ分割に達した観測の処理方法が不確定な場合に常に発生します。これらのインシデントは無害に思えるかもしれませんが、Leo BreimanとAdele CutlerのランダムフォレストFORTRANコードとrandomForest Rパッケージ(LiawとWiener、2002)を動機付けのケーススタディとして使用することで、不在レベル問題を見落とすとモデルに体系的なバイアスがかかる可能性があることを調べます。さらに、3つの実際のデータ例を使用して、欠落レベルが実際にモデルのパフォーマンスを劇的に変える可能性があることを示し、より堅牢な理論的解決策が見つかるまで、いくつかの単純なヒューリスティックを使用して欠落レベル問題の影響を軽減する方法を実証します。

Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions
カーネルディストリビューションの埋め込み: ユニバーサルカーネル、特性カーネル、ディストリビューションのカーネルメトリクス

Kernel mean embeddings have become a popular tool in machine learning. They map probability measures to functions in a reproducing kernel Hilbert space. The distance between two mapped measures defines a semi-distance over the probability measures known as the maximum mean discrepancy (MMD). Its properties depend on the underlying kernel and have been linked to three fundamental concepts of the kernel literature: universal, characteristic and strictly positive definite kernels. The contributions of this paper are three-fold. First, by slightly extending the usual definitions of universal, characteristic and strictly positive definite kernels, we show that these three concepts are essentially equivalent. Second, we give the first complete characterization of those kernels whose associated MMD-distance metrizes the weak convergence of probability measures. Third, we show that kernel mean embeddings can be extended from probability measures to generalized measures called Schwartz-distributions and analyze a few properties of these distribution embeddings.

カーネル平均埋め込みは、機械学習で人気のツールとなっています。カーネル平均埋め込みは、確率測度を再生カーネルヒルベルト空間内の関数にマッピングします。2つのマッピングされた測度間の距離は、最大平均差異(MMD)として知られる確率測度上の半距離を定義します。その特性は基礎となるカーネルに依存し、カーネル文献の3つの基本概念、すなわちユニバーサルカーネル、特性カーネル、および厳密な正定値カーネルに関連付けられています。この論文の貢献は3つあります。まず、ユニバーサルカーネル、特性カーネル、および厳密な正定値カーネルの通常の定義をわずかに拡張することにより、これら3つの概念が本質的に同等であることを示す。次に、関連するMMD距離が確率測度の弱収束を測定できるカーネルの完全な特性を初めて示す。最後に、カーネル平均埋め込みを確率測度からシュワルツ分布と呼ばれる一般化測度に拡張できることを示し、これらの分布埋め込みのいくつかの特性を分析します。

Markov Blanket and Markov Boundary of Multiple Variables
多重変数のマルコフブランケットとマルコフ境界

Markov blanket (Mb) and Markov boundary (MB) are two key concepts in Bayesian networks (BNs). In this paper, we study the problem of Mb and MB for multiple variables. First, we show that Mb possesses the additivity property under the local intersection assumption, that is, an Mb of multiple targets can be constructed by simply taking the union of Mbs of the individual targets and removing the targets themselves. MB is also proven to have additivity under the local intersection assumption. Second, we analyze the cases of violating additivity of Mb and MB and then put forward the notions of Markov blanket supplementary (MbS) and Markov boundary supplementary (MBS). The properties of MbS and MBS are studied in detail. Third, we build two MB discovery algorithms and prove their correctness under the local composition assumption. We also discuss the ways of practically doing conditional independence tests and analyze the complexities of the algorithms. Finally, we make a benchmarking study based on six synthetic BNs and then apply MB discovery to multi-class prediction based on a real data set. The experimental results reveal our algorithms have higher accuracies and lower complexities than existing algorithms.

マルコフブランケット(Mb)とマルコフ境界(MB)は、ベイジアンネットワーク(BN)の2つの重要な概念です。この論文では、複数の変数に対するMbとMBの問題を検討します。まず、局所交差仮定の下でMbが加法性プロパティを備えていることを示します。つまり、複数のターゲットのMbは、個々のターゲットのMbの和集合を取り、ターゲット自体を削除するだけで構築できます。MBは、局所交差仮定の下で加法性を持つことも証明されています。次に、MbとMBの加法性に違反するケースを分析し、マルコフブランケット補足(MbS)とマルコフ境界補足(MBS)の概念を提唱します。MbSとMBSの特性を詳細に検討します。最後に、2つのMB検出アルゴリズムを構築し、局所構成仮定の下での正しさを証明します。また、条件付き独立性テストを実際に行う方法について説明し、アルゴリズムの複雑さを分析します。最後に、6つの合成BNに基づいてベンチマークスタディを行い、実際のデータセットに基づくマルチクラス予測にMB検出を適用します。実験結果から、当社のアルゴリズムは既存のアルゴリズムよりも精度が高く、複雑性が低いことが明らかになりました。

An Efficient and Effective Generic Agglomerative Hierarchical Clustering Approach
効率的で効果的な汎用凝集階層クラスタリングアプローチ

We introduce an agglomerative hierarchical clustering (AHC) framework which is generic, efficient and effective. Our approach embeds a sub-family of Lance-Williams (LW) clusterings and relies on inner-products instead of squared Euclidean distances. We carry out a constrained bottom-up merging procedure on a sparsified normalized inner-product matrix. Our method is named SNK-AHC for Sparsified Normalized Kernel matrix based AHC. SNK-AHC is more scalable than the classic dissimilarity matrix based AHC. It can also produce better results when clusters have arbitrary shapes. Artificial and real-world benchmarks are used to exemplify these points. From a theoretical standpoint, SNK-AHC provides another interpretation of the classic techniques which relies on the concept of weighted penalized similarities. The differences between group average, Mcquitty, centroid, median and Ward, can be explained by their distinct averaging strategies for aggregating clusters inter-similarities and intra-similarities. Other features of SNK-AHC are examined. We provide sufficient conditions in order to have monotonic dendrograms, we elaborate a stored data matrix approach for centroid and median, we underline the diagonal translation invariance property of group average, Mcquitty and Ward and we show to what extent SNK-AHC can determine the number of clusters.

私たちは、汎用的かつ効率的で効果的な凝集型階層的クラスタリング(AHC)フレームワークを紹介します。我々のアプローチは、ランス・ウィリアムズ(LW)クラスタリングのサブファミリーを組み込み、ユークリッド距離の二乗ではなく内積に依存します。私たちは、スパース化された正規化された内積行列に対して制約付きのボトムアップマージ手順を実行します。我々の方法は、スパース化された正規化されたカーネル行列ベースのAHCの略でSNK-AHCと名付けられています。SNK-AHCは、従来の非類似度行列ベースのAHCよりもスケーラブルです。また、クラスターが任意の形状の場合に、より良い結果を生み出すことができます。人工ベンチマークと現実世界のベンチマークを使用して、これらの点を例示します。理論的な観点から、SNK-AHCは、重み付けされたペナルティ付き類似度の概念に依存する従来の手法の別の解釈を提供します。グループ平均、Mcquitty、重心、中央値、およびWardの違いは、クラスターの相互類似度と内類似度を集約するための異なる平均化戦略によって説明できます。SNK-AHCのその他の機能も検討します。単調な樹形図を作成するための十分な条件を示し、重心と中央値に対する保存データマトリックスアプローチを詳しく説明し、グループ平均、Mcquitty、およびWardの対角変換不変性を強調し、SNK-AHCがクラスターの数をどの程度決定できるかを示します。

Connections with Robust PCA and the Role of Emergent Sparsity in Variational Autoencoder Models
ロバストPCAとの接続と変分自己符号化器モデルにおける創発的スパース性の役割

Variational autoencoders (VAE) represent a popular, flexible form of deep generative model that can be stochastically fit to samples from a given random process using an information-theoretic variational bound on the true underlying distribution. Once so-obtained, the model can be putatively used to generate new samples from this distribution, or to provide a low-dimensional latent representation of existing samples. While quite effective in numerous application domains, certain important mechanisms which govern the behavior of the VAE are obfuscated by the intractable integrals and resulting stochastic approximations involved. Moreover, as a highly non-convex model, it remains unclear exactly how minima of the underlying energy relate to original design purposes. We attempt to better quantify these issues by analyzing a series of tractable special cases of increasing complexity. In doing so, we unveil interesting connections with more traditional dimensionality reduction models, as well as an intrinsic yet underappreciated propensity for robustly dismissing sparse outliers when estimating latent manifolds. With respect to the latter, we demonstrate that the VAE can be viewed as the natural evolution of recent robust PCA models, capable of learning nonlinear manifolds of unknown dimension obscured by gross corruptions.

変分オートエンコーダ(VAE)は、真の基礎分布の情報理論的変分境界を使用して、特定のランダムプロセスからのサンプルに確率的に適合できる、人気の高い柔軟な形式の深層生成モデルです。このようにして得られたモデルは、この分布から新しいサンプルを生成したり、既存のサンプルの低次元潜在表現を提供したりするために使用できると考えられます。VAEは多くのアプリケーションドメインで非常に効果的ですが、扱いにくい積分と、その結果生じる確率的近似によって、VAEの動作を制御する重要なメカニズムがわかりにくくなっています。さらに、非常に非凸なモデルであるため、基礎エネルギーの最小値が元の設計目的とどのように関係しているかは正確には不明です。私たちは、複雑さが増す一連の扱いやすい特殊なケースを分析することで、これらの問題をより適切に定量化しようとしています。そうすることで、より伝統的な次元削減モデルとの興味深い関係性、および潜在的多様体を推定する際にスパースな外れ値を堅牢に排除するという本質的でありながら過小評価されている傾向が明らかになります。後者に関しては、VAEは、大きな破損によって不明瞭になった未知の次元の非線形多様体を学習できる、最近の堅牢なPCAモデルの自然な進化と見なすことができることを実証します。

Learning from Comparisons and Choices
比較と選択から学ぶ

When tracking user-specific online activities, each user’s preference is revealed in the form of choices and comparisons. For example, a user’s purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user’s preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. The learned features can be used to predict a user’s preference of an unseen item to be used in recommendation systems. This also allows one to compute similarities among users and items to be used for categorization and search. Motivated by the empirical successes of the MultiNomial Logit (MNL) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the MNL model parameters that best explain the data. We propose a convex relaxation for learning the MNL model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. This characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. Further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. This provides a guideline for designing surveys when one can choose which items are to be compared. This is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions.

ユーザー固有のオンラインアクティビティを追跡すると、各ユーザーの好みが選択と比較の形で明らかになります。たとえば、ユーザーの購入履歴は、提供されているアイテムのサブセットからどのアイテムが選択されたかという選択の記録です。ユーザーの好みは、映画の評価のように明示的に、またはニュース記事の視聴時間のように暗黙的に観察できます。比較と選択の形でこのような個別の順序データが与えられた場合、ユーザーとアイテムの表現を共同で学習するという問題に取り組みます。学習した特徴は、推奨システムで使用するために、ユーザーの見たことのないアイテムの好みを予測するために使用できます。これにより、分類と検索に使用するユーザーとアイテム間の類似性を計算することもできます。マーケティングと輸送における多項式ロジット(MNL)モデルの実証的な成功、および単語埋め込みとクラウドソーシングによる画像埋め込みの最近の成功に動機付けられて、データを最もよく説明するMNLモデルパラメーターを学習するという問題を提起します。MNLモデルの学習に凸緩和法を提案し、そのパフォーマンスを基本的な下限と比較することで、対数係数までミニマックス最適であることを示します。これは、問題のミニマックスサンプル複雑性を特徴づけ、提案された推定量は対数係数以外では改善できないことを証明します。さらに、この分析では、サンプリンググラフのスペクトルを介して、精度がサンプリングのトポロジーにどのように依存するかを特定します。これは、比較する項目を選択できる場合の調査設計のガイドラインを提供します。これには、合成データセットと実際のデータセットでの数値シミュレーションが伴い、理論的予測が確認されます。

State-by-state Minimax Adaptive Estimation for Nonparametric Hidden Markov Models
ノンパラメトリック隠れマルコフモデルの状態ごとのミニマックス適応推定

In this paper, we introduce a new estimator for the emission densities of a nonparametric hidden Markov model. It is adaptive and minimax with respect to each state’s regularity–as opposed to globally minimax estimators, which adapt to the worst regularity among the emission densities. Our method is based on Goldenshluger and Lepski’s methodology. It is computationally efficient and only requires a family of preliminary estimators, without any restriction on the type of estimators considered. We present two such estimators that allow to reach minimax rates up to a logarithmic term: a spectral estimator and a least squares estimator. We show how to calibrate it in practice and assess its performance on simulations and on real data.

この論文では、ノンパラメトリック隠れマルコフモデルの放出密度の新しい推定量を紹介します。これは、排出密度の中で最悪の規則性に適応するグローバルミニマックス推定量とは対照的に、各州の規則性に関して適応性とミニマックスです。私たちの方法は、GoldenshlugerとLepskiの方法論に基づいています。これは計算効率が高く、考慮される推定量のタイプに制限なく、予備的な推定量のファミリーのみが必要です。対数項までのミニマックスレートに到達できる2つのそのような推定量、スペクトル推定量と最小二乗推定量を示します。実際にキャリブレーションを行い、シミュレーションと実際のデータでそのパフォーマンスを評価する方法を示します。

Local Rademacher Complexity-based Learning Guarantees for Multi-Task Learning
マルチタスク学習のためのローカル Rademacher 複雑性ベース学習保証

We show a Talagrand-type concentration inequality for Multi-Task Learning (MTL), with which we establish sharp excess risk bounds for MTL in terms of the Local Rademacher Complexity (LRC). We also give a new bound on the (LRC) for any norm regularized hypothesis classes, which applies not only to MTL, but also to the standard Single-Task Learning (STL) setting. By combining both results, one can easily derive fast-rate bounds on the excess risk for many prominent MTL methods, including–as we demonstrate–Schatten norm, group norm, and graph regularized MTL. The derived bounds reflect a relationship akin to a conservation law of asymptotic convergence rates. When compared to the rates obtained via a traditional, global Rademacher analysis, this very relationship allows for trading off slower rates with respect to the number of tasks for faster rates with respect to the number of available samples per task.

私たちは、マルチタスク学習(MTL)のタラグランド型濃度不等式を示し、これにより、局所レーデマッハー複雑性(LRC)の観点からMTLの急激な過剰リスク境界を確立します。また、任意のノルム正則化仮説クラスに対して(LRC)に新しい境界を与え、これはMTLだけでなく、標準のシングルタスク学習(STL)設定にも適用されます。両方の結果を組み合わせることにより、多くの著名なMTL手法(ここで示すように、Schatten norm、Group norm、およびグラフ正則化MTLを含む)の過剰リスクの高速レート境界を簡単に導き出すことができます。導出された境界は、漸近収束率の保存則に似た関係を反映しています。従来のグローバルなRademacher分析によって得られたレートと比較すると、この関係により、タスク数に対して低速なレートと、タスクごとに使用可能なサンプル数に対して高速なレートをトレードオフすることができます。

The xyz algorithm for fast interaction search in high-dimensional data
高次元データにおける高速インタラクション探索のためのxyzアルゴリズム

When performing regression on a data set with $p$ variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least $\mathcal{O}(p^2)$ if done naively. This cost can be prohibitive if $p$ is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $\mathcal{O}(p^\alpha)$ operations for $1<\alpha<2$ depending on their strength. The underlying idea is to transform interaction search into a closest pair problem which can be solved efficiently in subquadratic time. The algorithm is called $xyz$ and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than $10^{11}$ interactions can be screened in under $280$ seconds with a single-core $1.2$ GHz CPU.

$p$個の変数を含むデータセットで回帰を実行する場合、主な線形効果の使用を超えて、個々の変数間の積として相互作用を含めることがしばしば重要です。小規模な問題では、これらの相互作用を明示的に計算できますが、単純に行うと少なくとも$\mathcal{O}(p^2)$の計算複雑性につながります。$p$が非常に大きい場合、このコストは法外になる可能性があります。高い確率で相互作用を発見でき、穏やかな条件下では実行時間が$p$の二次以下の新しいランダム化アルゴリズムを紹介します。強い相互作用はほぼ線形時間で発見できる一方、弱い相互作用を見つけるには、その強さに応じて$1<\alpha<2$に対して$\mathcal{O}(p^\alpha)$の操作が必要であることを示します。基本的な考え方は、相互作用検索を、二次以下の時間で効率的に解決できる最近接ペア問題に変換することです。このアルゴリズムは$xyz$と呼ばれ、R言語で実装されています。ゲノムワイド関連研究への応用における効率性を実証し、シングルコアの$1.2$ GHz CPUを使用して$280$秒未満で$10^{11}$以上の相互作用をスクリーニングできることを示しました。

Invariant Models for Causal Transfer Learning
因果伝達学習のための不変モデル

Methods of transfer learning try to combine knowledge from several related tasks (or domains) to improve performance on a test task. Inspired by causal methodology, we relax the usual covariate shift assumption and assume that it holds true for a subset of predictor variables: the conditional distribution of the target variable given this subset of predictors is invariant over all tasks. We show how this assumption can be motivated from ideas in the field of causality. We focus on the problem of Domain Generalization, in which no examples from the test task are observed. We prove that in an adversarial setting using this subset for prediction is optimal in Domain Generalization; we further provide examples, in which the tasks are sufficiently diverse and the estimator therefore outperforms pooling the data, even on average. If examples from the test task are available, we also provide a method to transfer knowledge from the training tasks and exploit all available features for prediction. However, we provide no guarantees for this method. We introduce a practical method which allows for automatic inference of the above subset and provide corresponding code. We present results on synthetic data sets and a gene deletion data set.

転移学習の方法は、複数の関連するタスク(またはドメイン)からの知識を組み合わせて、テストタスクのパフォーマンスを向上させようとします。因果的方法論にヒントを得て、通常の共変量シフトの仮定を緩和し、予測変数のサブセットに当てはまると仮定します。つまり、この予測変数のサブセットを与えられたターゲット変数の条件付き分布は、すべてのタスクにわたって不変です。この仮定が因果関係の分野のアイデアからどのように動機付けられるかを示します。テストタスクの例が観察されないドメイン一般化の問題に焦点を当てます。敵対的な設定では、このサブセットを予測に使用することがドメイン一般化で最適であることを証明します。さらに、タスクが十分に多様であるため、推定器が平均でもデータのプーリングよりも優れている例を示します。テストタスクの例が利用できる場合は、トレーニングタスクから知識を転送し、予測に利用可能なすべての機能を活用する方法も提供します。ただし、この方法を保証するものではありません。上記のサブセットの自動推論を可能にする実用的な方法を紹介し、対応するコードを提供します。合成データセットと遺伝子欠失データセットの結果を紹介します。

Kernel Density Estimation for Dynamical Systems
動的システムのカーネル密度推定

We study the density estimation problem with observations generated by certain dynamical systems that admit a unique underlying invariant Lebesgue density. Observations drawn from dynamical systems are not independent and moreover, usual mixing concepts may not be appropriate for measuring the dependence among these observations. By employing the $\mathcal{C}$-mixing concept to measure the dependence, we conduct statistical analysis on the consistency and convergence of the kernel density estimator. Our main results are as follows: First, we show that with properly chosen bandwidth, the kernel density estimator is universally consistent under $L_1$-norm; Second, we establish convergence rates for the estimator with respect to several classes of dynamical systems under $L_1$-norm. In the analysis, the density function $f$ is only assumed to be H\”{o}lder continuous or pointwise H\”{o}lder controllable which is a weak assumption in the literature of nonparametric density estimation and also more realistic in the dynamical system context. Last but not least, we prove that the same convergence rates of the estimator under $L_\infty$-norm and $L_1$-norm can be achieved when the density function is H\”{o}lder continuous, compactly supported, and bounded. The bandwidth selection problem of the kernel density estimator for dynamical system is also discussed in our study via numerical simulations.

私たちは、固有の不変ルベーグ密度を基礎とする特定の動的システムによって生成される観測値を用いて密度推定問題を研究します。動的システムから得られる観測値は独立しておらず、さらに、通常の混合概念はこれらの観測値間の依存性を測定するのに適さない可能性があります。依存性を測定するために$\mathcal{C}$混合概念を採用することにより、カーネル密度推定値の一貫性と収束に関する統計分析を行う。主な結果は以下のとおりです。まず、適切に選択された帯域幅では、カーネル密度推定値は$L_1$ノルムの下で普遍的に一貫性があることを示す。次に、$L_1$ノルムの下でのいくつかのクラスの動的システムに関して推定値の収束率を確立します。分析では、密度関数$f$はH\”{o}older連続または点ごとのH\”{o}older制御可能であるとのみ仮定されるが、これはノンパラメトリック密度推定の文献では弱い仮定であり、動的システムのコンテキストではより現実的でもあります。最後に、密度関数がHolder連続、コンパクトサポート、有界である場合、$L_\infty$ノルムと$L_1$ノルムの下での推定量の同じ収束率が達成できることを証明します。動的システムのカーネル密度推定量の帯域幅選択問題も、数値シミュレーションによる研究で議論されています。

A Spectral Approach for the Design of Experiments: Design, Analysis and Algorithms
実験計画法のためのスペクトルアプローチ:設計、分析、アルゴリズム

This paper proposes a new approach to construct high quality space-filling sample designs. First, we propose a novel technique to quantify the space-filling property and optimally trade-off uniformity and randomness in sample designs in arbitrary dimensions. Second, we connect the proposed metric (defined in the spatial domain) to the quality metric of the design performance (defined in the spectral domain). This connection serves as an analytic framework for evaluating the qualitative properties of space-filling designs in general. Using the theoretical insights provided by this spatial-spectral analysis, we derive the notion of optimal space-filling designs, which we refer to as space-filling spectral designs. Third, we propose an efficient estimator to evaluate the space-filling properties of sample designs in arbitrary dimensions and use it to develop an optimization framework for generating high quality space-filling designs. Finally, we carry out a detailed performance comparison on two different applications in varying dimensions: a) image reconstruction and b) surrogate modeling for several benchmark optimization functions and a physics simulation code for inertial confinement fusion (ICF). Our results clearly evidence the superiority of the proposed space-filling designs over existing approaches, particularly in high dimensions.

この論文では、高品質の空間充填サンプルデザインを構築するための新しいアプローチを提案します。まず、任意の次元のサンプルデザインにおける空間充填特性を定量化し、均一性とランダム性を最適にトレードオフする新しい手法を提案します。次に、提案されたメトリック(空間領域で定義)を、デザインパフォーマンスの品質メトリック(スペクトル領域で定義)に接続します。この接続は、一般に空間充填デザインの定性的特性を評価するための分析フレームワークとして機能します。この空間スペクトル分析によって提供される理論的洞察を使用して、最適な空間充填デザインの概念を導き出し、これを空間充填スペクトルデザインと呼びます。最後に、任意の次元のサンプルデザインの空間充填特性を評価するための効率的な推定量を提案し、それを使用して高品質の空間充填デザインを生成するための最適化フレームワークを開発します。最後に、さまざまな次元における2つの異なるアプリケーション(a)画像再構成、b)いくつかのベンチマーク最適化関数のサロゲートモデリング、および慣性閉じ込め核融合(ICF)の物理シミュレーションコードについて詳細なパフォーマンス比較を実施します。結果は、特に高次元において、提案された空間充填設計が既存のアプローチよりも優れていることを明確に証明しています。

Goodness-of-Fit Tests for Random Partitions via Symmetric Polynomials
対称多項式によるランダム分割の適合度検定

We consider goodness-of-fit tests with i.i.d. samples generated from a categorical distribution $(p_1,…,p_k)$. For a given $(q_1,…,q_k)$, we test the null hypothesis whether $p_j=q_{\pi(j)}$ for some label permutation $\pi$. The uncertainty of label permutation implies that the null hypothesis is composite instead of being singular. In this paper, we construct a testing procedure using statistics that are defined as indefinite integrals of some symmetric polynomials. This method is aimed directly at the invariance of the problem, and avoids the need of matching the unknown labels. The asymptotic distribution of the testing statistic is shown to be chi-squared, and its power is proved to be nearly optimal under a local alternative hypothesis. Various degenerate structures of the null hypothesis are carefully analyzed in the paper. A two-sample version of the test is also studied.

私たちは、カテゴリ分布$(p_1,…,p_k)$から生成されたi.i.d.サンプルを使用した適合度検定を検討します。与えられた$(q_1,…,q_k)$について、あるラベル順列$pi$に対して帰無仮説$p_j=q_{pi(j)}$であるかどうかを検定します。ラベル順列の不確実性は、帰無仮説が単数ではなく複合仮説であることを意味します。この論文では、いくつかの対称多項式の不定積分として定義される統計を使用して、検定手順を構築します。この方法は、問題の不変性を直接対象としており、未知のラベルを一致させる必要がなくなります。検定統計量の漸近分布はカイ2乗であることが示され、その検出力は局所対立仮説の下でほぼ最適であることが証明されています。帰無仮説のさまざまな縮退構造が論文で慎重に分析されています。テストの2サンプルバージョンも研究されています。

Distribution-Specific Hardness of Learning Neural Networks
学習ニューラルネットワークの分布特異的な難易度

Although neural networks are routinely and successfully trained in practice using simple gradient-based methods, most existing theoretical results are negative, showing that learning such networks is difficult, in a worst-case sense over all data distributions. In this paper, we take a more nuanced view, and consider whether specific assumptions on the “niceness” of the input distribution, or “niceness” of the target function (e.g. in terms of smoothness, non-degeneracy, incoherence, random choice of parameters etc.), are sufficient to guarantee learnability using gradient-based methods. We provide evidence that neither class of assumptions alone is sufficient: On the one hand, for any member of a class of “nice” target functions, there are difficult input distributions. On the other hand, we identify a family of simple target functions, which are difficult to learn even if the input distribution is “nice”. To prove our results, we develop some tools which may be of independent interest, such as extending Fourier-based hardness techniques developed in the context of statistical queries (Blum et al., 1994), from the Boolean cube to Euclidean space and to more general classes of functions.

ニューラルネットワークは実際には単純な勾配ベースの方法を使用して日常的に、そして首尾よくトレーニングされていますが、既存の理論的結果のほとんどは否定的で、最悪の場合、すべてのデータ分布にわたってそのようなネットワークを学習することは困難であることを示しています。この論文では、より微妙な視点を取り、入力分布の「良好さ」またはターゲット関数の「良好さ」(たとえば、滑らかさ、非退化、非一貫性、パラメーターのランダム選択など)に関する特定の仮定が、勾配ベースの方法を使用した学習可能性を保証するのに十分であるかどうかを検討します。どちらのクラスの仮定も単独では十分ではないという証拠を示します。一方では、「良好」なターゲット関数のクラスのどのメンバーに対しても、困難な入力分布が存在します。他方では、入力分布が「良好」であっても学習が困難な単純なターゲット関数のファミリーを特定します。私たちの結果を証明するために、統計クエリのコンテキストで開発されたフーリエベースの困難性手法(Blumら、1994)をブールキューブからユークリッド空間、さらにより一般的な関数のクラスに拡張するなど、独立した関心事となる可能性のあるいくつかのツールを開発します。

A Direct Approach for Sparse Quadratic Discriminant Analysis
スパース二次判別解析のための直接的アプローチ

Quadratic discriminant analysis (QDA) is a standard tool for classification due to its simplicity and flexibility. Because the number of its parameters scales quadratically with the number of the variables, QDA is not practical, however, when the dimensionality is relatively large. To address this, we propose a novel procedure named DA-QDA for QDA in analyzing high-dimensional data. Formulated in a simple and coherent framework, DA-QDA aims to directly estimate the key quantities in the Bayes discriminant function including quadratic interactions and a linear index of the variables for classification. Under appropriate sparsity assumptions, we establish consistency results for estimating the interactions and the linear index, and further demonstrate that the misclassification rate of our procedure converges to the optimal Bayes risk, even when the dimensionality is exponentially high with respect to the sample size. An efficient algorithm based on the alternating direction method of multipliers (ADMM) is developed for finding interactions, which is much faster than its competitor in the literature. The promising performance of DA-QDA is illustrated via extensive simulation studies and the analysis of four real datasets.

二次判別分析(QDA)は、そのシンプルさと柔軟性から、分類の標準的なツールです。ただし、パラメータの数は変数の数に比例して増加するため、次元が比較的大きい場合、QDAは実用的ではありません。これに対処するために、高次元データの分析におけるQDA用のDA-QDAという新しい手順を提案します。シンプルで一貫したフレームワークで定式化されたDA-QDAは、分類のための変数の二次相互作用と線形インデックスを含むベイズ判別関数の主要な量を直接推定することを目的としています。適切なスパース性の仮定の下で、相互作用と線形インデックスを推定するための一貫性のある結果を確立し、さらに、次元がサンプルサイズに対して指数的に高い場合でも、手順の誤分類率が最適なベイズリスクに収束することを実証します。交互方向乗数法(ADMM)に基づく効率的なアルゴリズムが相互作用を見つけるために開発されており、これは文献の競合よりもはるかに高速です。DA-QDAの有望なパフォーマンスは、広範なシミュレーション研究と4つの実際のデータセットの分析を通じて実証されています。

Parallelizing Spectrally Regularized Kernel Algorithms
スペクトル正則化カーネルアルゴリズムの並列化

We consider a distributed learning approach in supervised learning for a large class of spectral regularization methods in an reproducing kernel Hilbert space (RKHS) framework. The data set of size $n$ is partitioned into $m=O(n^\alpha)$, $\alpha < \frac{1}{2}$, disjoint subsamples. On each subsample, some spectral regularization method (belonging to a large class, including in particular Kernel Ridge Regression, $L^2$-boosting and spectral cut-off) is applied. The regression function $f$ is then estimated via simple averaging, leading to a substantial reduction in computation time. We show that minimax optimal rates of convergence are preserved if $m$ grows sufficiently slowly (corresponding to an upper bound for $\alpha$) as $n \to \infty$, depending on the smoothness assumptions on $f$ and the intrinsic dimensionality. In spirit, the analysis relies on a classical bias/stochastic error analysis.

私たちは、再生カーネルヒルベルト空間(RKHS)フレームワークのスペクトル正則化方法の大規模なクラスについて、教師あり学習における分散学習アプローチを検討します。サイズ$n$のデータセットは、$m=O(n^alpha)$、$alpha、< frac{1}{2}$、ばらばらのサブサンプルに分割されます。各サブサンプルには、スペクトル正則化法(特にカーネルリッジ回帰、$L^2$-ブースティング、スペクトルカットオフなど、大きなクラスに属する)が適用されます。その後、回帰関数$f$は単純な平均化によって推定されるため、計算時間が大幅に短縮されます。$m$がからinfty$まで十分にゆっくりと成長($alpha$の上限に対応) $n、$f$の滑らかさの仮定と固有の次元に応じて、収束のミニマックス最適速度が維持されることを示します。精神的には、分析は古典的なバイアス/確率的エラー分析に依存しています。

Gradient Descent Learns Linear Dynamical Systems
勾配降下法による線形力学系の学習

We prove that stochastic gradient descent efficiently converges to the global optimizer of the maximum likelihood objective of an unknown linear time-invariant dynamical system from a sequence of noisy observations generated by the system. Even though the objective function is non-convex, we provide polynomial running time and sample complexity bounds under strong but natural assumptions. Linear systems identification has been studied for many decades, yet, to the best of our knowledge, these are the first polynomial guarantees for the problem we consider.

私たちは、確率的勾配降下法が、システムによって生成された一連のノイズの多い観測から、未知の線形時間不変力学系の最尤目的のグローバルオプティマイザーに効率的に収束することを証明します。目的関数が非凸型であっても、強力だが自然な仮定の下で、多項式の実行時間とサンプルの複雑さの境界を提供します。線形システムの同定は何十年にもわたって研究されてきましたが、私たちの知る限りでは、これらは私たちが考える問題に対する最初の多項式保証です。

Generalized Rank-Breaking: Computational and Statistical Tradeoffs
一般化ランクブレーク:計算と統計のトレードオフ

For massive and heterogeneous modern datasets, it is of fundamental interest to provide guarantees on the accuracy of estimation when computational resources are limited. In the application of rank aggregation, for the Plackett-Luce model, we provide a hierarchy of rank-breaking mechanisms ordered by the complexity in thus generated sketch of the data. This allows the number of data points collected to be gracefully traded off against computational resources available, while guaranteeing the desired level of accuracy. Theoretical guarantees on the proposed generalized rank-breaking implicitly provide such trade-offs, which can be explicitly characterized under certain canonical scenarios on the structure of the data. Further, the proposed generalized rank-breaking algorithm involves set-wise comparisons as opposed to traditional pairwise comparisons. The maximum likelihood estimate of pairwise comparisons is computed efficiently using the celebrated minorization maximization algorithm (Hunter, 2004). To compute the pseudo-maximum likelihood estimate of the set-wise comparisons, we provide a generalization of the minorization maximization algorithm and give guarantees on its convergence.

大規模で異質な現代のデータセットでは、計算リソースが限られている場合に推定の精度を保証することが根本的に重要です。ランク集約のアプリケーションでは、Plackett-Luceモデルに対して、生成されたデータのスケッチの複雑さによって順序付けられたランク分割メカニズムの階層を提供します。これにより、収集されたデータポイントの数と利用可能な計算リソースを適切にトレードオフしながら、必要なレベルの精度を保証できます。提案された一般化ランク分割の理論的な保証は、暗黙的にそのようなトレードオフを提供し、データの構造に関する特定の標準的なシナリオの下で明示的に特徴付けることができます。さらに、提案された一般化ランク分割アルゴリズムには、従来のペアワイズ比較ではなく、セットワイズ比較が含まれます。ペアワイズ比較の最大尤度推定は、有名なマイナライゼーション最大化アルゴリズム(Hunter、2004)を使用して効率的に計算されます。セットごとの比較の疑似最大尤度推定を計算するために、マイナー化最大化アルゴリズムの一般化を提供し、その収束を保証します。

Importance Sampling for Minibatches
ミニバッチの重要度サンプリング

Minibatching is a very well studied and highly popular technique in supervised learning, used by practitioners due to its ability to accelerate training through better utilization of parallel processing power and reduction of stochastic variance. Another popular technique is importance sampling–a strategy for preferential sampling of more important examples also capable of accelerating the training process. However, despite considerable effort by the community in these areas, and due to the inherent technical difficulty of the problem, there is virtually no existing work combining the power of importance sampling with the strength of minibatching. In this paper we propose the first practical importance sampling for minibatches and give simple and rigorous complexity analysis of its performance. We illustrate on synthetic problems that for training data of certain properties, our sampling can lead to several orders of magnitude improvement in training time. We then test the new sampling on several popular data sets, and show that the improvement can reach an order of magnitude.

ミニバッチングは、教師あり学習において非常によく研究され、非常に人気のある手法であり、並列処理能力の有効活用と確率的分散の低減によってトレーニングを加速できることから、実践者に使用されています。もう1つの人気のある手法は、重要度サンプリングです。これは、より重要な例を優先的にサンプリングする戦略であり、トレーニングプロセスを加速することもできます。ただし、これらの分野でのコミュニティの多大な努力にもかかわらず、問題に固有の技術的な難しさのため、重要度サンプリングの威力とミニバッチングの長所を組み合わせた既存の研究は事実上存在しません。この論文では、ミニバッチ用の初めての実用的な重要度サンプリングを提案し、そのパフォーマンスの単純かつ厳密な複雑性分析を示します。合成問題において、特定の特性を持つトレーニングデータの場合、このサンプリングによってトレーニング時間が数桁改善されることを示します。次に、いくつかの一般的なデータセットで新しいサンプリングをテストし、改善が1桁に達する可能性があることを示します。

OpenEnsembles: A Python Resource for Ensemble Clustering
OpenEnsembles: アンサンブルクラスタリングのための Python リソース

In this paper we introduce OpenEnsembles, a Python toolkit for performing and analyzing ensemble clustering. Ensemble clustering is the process of creating many clustering solutions for a given dataset and utilizing the relationships observed across the ensemble to identify final solutions, which are more robust, stable or better than the individual solutions within the ensemble. The OpenEnsembles library provides a unified interface for applying transformations to data, clustering data, visualizing individual clustering solutions, visualizing and finishing the ensemble, and calculating validation metrics for a clustering solution for any given partitioning of the data. We have documented examples of using OpenEnsembles to create, analyze, and visualize a number of different types of ensemble approaches on toy and example datasets. OpenEnsembles is released under the GNU General Public License version 3, can be installed via Conda or the Python Package Index (pip), and is available at https://github.com/NaegleLab/OpenEnsembles.

この論文では、アンサンブルクラスタリングを実行および分析するためのPythonツールキットであるOpenEnsemblesを紹介します。アンサンブルクラスタリングとは、特定のデータセットに対して多数のクラスタリングソリューションを作成し、アンサンブル全体で観察された関係を利用して、アンサンブル内の個々のソリューションよりも堅牢で安定している、または優れた最終ソリューションを特定するプロセスです。OpenEnsemblesライブラリは、データへの変換の適用、データのクラスタリング、個々のクラスタリングソリューションの視覚化、アンサンブルの視覚化と終了、およびデータの任意のパーティションに対するクラスタリングソリューションの検証メトリックの計算を行うための統合インターフェイスを提供します。OpenEnsemblesを使用して、トイデータセットとサンプルデータセットでさまざまな種類のアンサンブルアプローチを作成、分析、視覚化する例を文書化しました。OpenEnsemblesは、GNU General Public Licenseバージョン3に基づいてリリースされ、CondaまたはPython Package Index (pip)を介してインストールでき、https://github.com/NaegleLab/OpenEnsemblesで入手できます。

Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations
深層物理モデル: 非線形偏微分方程式の深層学習

We put forth a deep learning approach for discovering nonlinear partial differential equations from scattered and potentially noisy observations in space and time. Specifically, we approximate the unknown solution as well as the nonlinear dynamics by two deep neural networks. The first network acts as a prior on the unknown solution and essentially enables us to avoid numerical differentiations which are inherently ill-conditioned and unstable. The second network represents the nonlinear dynamics and helps us distill the mechanisms that govern the evolution of a given spatiotemporal data-set. We test the effectiveness of our approach for several benchmark problems spanning a number of scientific domains and demonstrate how the proposed framework can help us accurately learn the underlying dynamics and forecast future states of the system. In particular, we study the Burgers’, Korteweg-de Vries (KdV), Kuramoto-Sivashinsky, nonlinear Schr\”{o}dinger, and Navier-Stokes equations.

私たちは、空間と時間の散在し、ノイズの多い可能性のある観測から非線形偏微分方程式を発見するための深層学習アプローチを提唱しました。具体的には、未知の解と非線形ダイナミクスを2つの深層ニューラルネットワークによって近似します。最初のネットワークは、未知の解の事前分布として機能し、本質的に条件が悪く不安定な数値微分を本質的に回避することを可能にします。2番目のネットワークは非線形ダイナミクスを表し、特定の時空間データセットの進化を支配するメカニズムを抽出するのに役立ちます。私たちは、いくつかの科学領域にまたがるいくつかのベンチマーク問題に対するアプローチの有効性をテストし、提案されたフレームワークがシステムの根本的なダイナミクスを正確に学習し、将来の状態を予測するのにどのように役立つかを示します。特に、Burgers方程式、Korteweg-de Vries方程式(KdV)、Kuramoto-Sivashinsky方程式、非線形Schr”{o}dinger方程式、Navier-Stokes方程式を研究します。

Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems
非均質な状態アフィンシステムを用いた確率的入力と線形読み出しを備えたユニバーサル離散時間貯留層コンピュータ

A new class of non-homogeneous state-affine systems is introduced for use in reservoir computing. Sufficient conditions are identified that guarantee first, that the associated reservoir computers with linear readouts are causal, time-invariant, and satisfy the fading memory property and second, that a subset of this class is universal in the category of fading memory filters with stochastic almost surely uniformly bounded inputs. This means that any discrete-time filter that satisfies the fading memory property with random inputs of that type can be uniformly approximated by elements in the non-homogeneous state-affine family.

リザーバーコンピューティングで使用するために、新しいクラスの非均質な状態アフィンシステムが導入されました。第1に、線形読み出しを持つ関連するリザーバコンピュータが因果的で、時間不変であり、フェージングメモリ特性を満たすこと、第2に、このクラスのサブセットが確率的にほぼ確実に一様に有界入力を持つフェージングメモリフィルタのカテゴリで普遍的であることを保証する十分な条件が特定されます。これは、そのタイプのランダム入力でフェージングメモリプロパティを満たす離散時間フィルターは、非同次状態アフィンファミリーの要素によって一様に近似できることを意味します。

Reverse Iterative Volume Sampling for Linear Regression
線形回帰のための逆反復ボリュームサンプリング

We study the following basic machine learning task: Given a fixed set of input points in $\mathbb{R}^d$ for a linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses (the attained as well as the remaining hidden ones). We show that a good approximate solution to this least squares problem can be obtained from just dimension $d$ many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all $n$ responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques. Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithm for volume sampling which makes this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings.

私たちは、次のような基本的な機械学習タスクを研究します。線形回帰問題に対する$\mathbb{R}^d$内の固定された入力ポイントセットが与えられた場合、各ポイントの隠れた応答値を予測します。データセット内のすべてのポイントの線形予測を構築するために使用されるポイントの小さなサブセットの応答のみを取得できます。予測のパフォーマンスは、すべての応答(取得された応答と残りの隠れた応答)の合計二乗損失によって評価されます。ボリュームサンプリングと呼ばれるジョイントサンプリング手法を使用することで、この最小二乗問題に対する適切な近似解が、次元$d$の多くの応答から得られることを示します。さらに、ボリュームサンプリングされたサブ問題に対して取得された最小二乗解は、すべての$n$応答に基づく最適解の不偏推定値です。この不偏性は、他の一般的なサブセット選択手法にはない望ましい特性です。これらの基本特性に着目して、ボリュームサンプリングを研究するための理論的枠組みを開発し、その結果、最小二乗回帰だけでなく数値線形代数全般にとって重要な、いくつかの新しい行列期待値等式と統計的保証が生まれました。また、私たちの方法はボリュームサンプリングの正規化された変種にもつながり、ボリュームサンプリングの効率的な最初のアルゴリズムを提案し、この手法を機械学習ツールボックスの実用的なツールにします。最後に、理論的発見を確認する実験的証拠を示します。

Robust Synthetic Control
堅牢な合成制御

We present a robust generalization of the synthetic control method for comparative case studies. Like the classical method cf. \cite{abadie3}, we present an algorithm to estimate the unobservable counterfactual of a treatment unit. A distinguishing feature of our algorithm is that of de-noising the data matrix via singular value thresholding, which renders our approach robust in multiple facets: it automatically identifies a good subset of donors for the synthetic control, overcomes the challenges of missing data, and continues to work well in settings where covariate information may not be provided. We posit that the setting can be viewed as an instance of the Latent Variable Model and provide the first finite sample analysis (coupled with asymptotic results) for the estimation of the counterfactual. Our algorithm accurately imputes missing entries and filters corrupted observations in producing a consistent estimator of the underlying signal matrix, provided $p = \Omega( T^{-1 + \zeta})$ for some $\zeta > 0$; here, $p$ is the fraction of observed data and $T$ is the time interval of interest. Under the same proportion of observations, we demonstrate that the mean-squared error in our counterfactual estimation scales as $\mathcal{O}(\sigma^2/p + 1/\sqrt{T})$, where $\sigma^2$ is the variance of the inherent noise. Additionally, we introduce a Bayesian framework to quantify the estimation uncertainty. Our experiments, using both synthetic and real-world datasets, demonstrate that our robust generalization yields an improvement over the classical synthetic control method.

私たちは、比較ケーススタディのための合成制御法の堅牢な一般化を紹介します。古典的な方法cf. \cite{abadie3}と同様に、治療単位の観測不可能な反事実を推定するアルゴリズムを紹介します。私たちのアルゴリズムの際立った特徴は、特異値しきい値化によってデータマトリックスのノイズを除去することです。これにより、私たちのアプローチはさまざまな面で堅牢になります。合成制御に適したドナーのサブセットを自動的に識別し、欠損データの課題を克服し、共変量情報が提供されない可能性がある設定でも引き続きうまく機能します。私たちは、この設定を潜在変数モデルのインスタンスと見なすことができると仮定し、反事実の推定のための最初の有限サンプル分析(漸近結果と組み合わせた)を提供します。私たちのアルゴリズムは、$p = \Omega( T^{-1 + \zeta})$で$\zeta > 0$である場合に、欠落したエントリを正確に補完し、破損した観測をフィルターして、基礎となる信号マトリックスの一貫した推定値を生成します。ここで、$p$は観測データの割合、$T$は関心のある時間間隔です。同じ観測割合の下で、反事実推定における平均二乗誤差は$\mathcal{O}(\sigma^2/p + 1/\sqrt{T})$に比例することを示します。ここで、$\sigma^2$は固有ノイズの分散です。さらに、推定の不確実性を定量化するためにベイズフレームワークを導入します。合成データセットと実世界のデータセットの両方を使用した実験では、堅牢な一般化により、従来の合成制御方法よりも改善されることが実証されています。

ThunderSVM: A Fast SVM Library on GPUs and CPUs
ThunderSVM:GPUとCPU上の高速SVMライブラリ

Support Vector Machines (SVMs) are classic supervised learning models for classification, regression and distribution estimation. A survey conducted by Kaggle in 2017 shows that 26% of the data mining and machine learning practitioners are users of SVMs. However, SVM training and prediction are very expensive computationally for large and complex problems. This paper presents an efficient and open source SVM software toolkit called ThunderSVM which exploits the high-performance of Graphics Processing Units (GPUs) and multi-core CPUs. ThunderSVM supports all the functionalities–including classification (SVC), regression (SVR) and one-class SVMs–of LibSVM and uses identical command line options, such that existing LibSVM users can easily apply our toolkit. ThunderSVM can be used through multiple language interfaces including C/C++, Python, R and MATLAB. Our experimental results show that ThunderSVM is generally an order of magnitude faster than LibSVM while producing identical SVMs. In addition to the high efficiency, we design our convex optimization solver in a general way such that SVC, SVR, and one-class SVMs share the same solver for the ease of maintenance. Documentation, examples, and more about ThunderSVM are available at https://github.com/zeyiwen/thundersvm

サポートベクターマシン(SVM)は、分類、回帰、分布推定のための古典的な教師あり学習モデルです。2017年にKaggleが実施した調査によると、データマイニングと機械学習の実践者の26%がSVMを使用しています。ただし、SVMのトレーニングと予測は、大規模で複雑な問題の場合、計算コストが非常に高くなります。この論文では、グラフィックスプロセッシングユニット(GPU)とマルチコアCPUの高性能を活用する、効率的なオープンソースSVMソフトウェアツールキットThunderSVMを紹介します。ThunderSVMは、分類(SVC)、回帰(SVR)、1クラスSVMなど、LibSVMのすべての機能をサポートし、同じコマンドラインオプションを使用するため、既存のLibSVMユーザーは簡単にツールキットを適用できます。ThunderSVMは、C/C++、Python、R、MATLABなどの複数の言語インターフェイスで使用できます。実験結果によると、ThunderSVMは一般にLibSVMよりも1桁高速でありながら、同一のSVMを生成します。高い効率性に加えて、SVC、SVR、および1クラスSVMが同じソルバーを共有するように、凸最適化ソルバーを一般的な方法で設計して、メンテナンスを容易にしています。ThunderSVMに関するドキュメント、例、その他の情報は、https://github.com/zeyiwen/thundersvmで入手できます。

Refining the Confidence Level for Optimistic Bandit Strategies
楽観的バンディット戦略の信頼レベルの改良

This paper introduces the first strategy for stochastic bandits with unit variance Gaussian noise that is simultaneously minimax optimal up to constant factors, asymptotically optimal, and never worse than the classical upper confidence bound strategy up to universal constant factors. Preliminary empirical evidence is also promising. Besides this, a conjecture on the optimal form of the regret is shown to be false and a finite-time lower bound on the regret of any strategy is presented that very nearly matches the finite-time upper bound of the newly proposed strategy.

この論文では、単位分散ガウスノイズを持つ確率的バンディットの最初の戦略を紹介します。これは、定数因子までは最小最大最適、漸近最適、普遍定数因子までは古典的な信頼限界戦略よりも悪くはありません。予備的な経験的証拠も有望です。これに加えて、後悔の最適な形式に関する推測は誤りであることが示され、任意の戦略の後悔の有限時間下限が、新しく提案された戦略の有限時間上限とほぼ一致することが提示されます。

Distributed Proximal Gradient Algorithm for Partially Asynchronous Computer Clusters
部分非同期コンピュータクラスタのための分布近位勾配アルゴリズム

With ever growing data volume and model size, an error-tolerant, communication efficient, yet versatile distributed algorithm has become vital for the success of many large-scale machine learning applications. In this work we propose m-PAPG, an implementation of the flexible proximal gradient algorithm in model parallel systems equipped with the partially asynchronous communication protocol. The worker machines communicate asynchronously with a controlled staleness bound $s$ and operate at different frequencies. We characterize various convergence properties of m-PAPG: 1) Under a general non-smooth and non-convex setting, we prove that every limit point of the sequence generated by m-PAPG is a critical point of the objective function; 2) Under an error bound condition of convex objective functions, we prove that the optimality gap decays linearly for every $s$ steps; 3) Under the Kurdyka-{\L}ojasiewicz inequality and a sufficient decrease assumption, we prove that the sequences generated by m-PAPG converge to the same critical point, provided that a proximal Lipschitz condition is satisfied.

データ量とモデルサイズが増大し続ける中、エラー耐性があり、通信効率が高く、汎用性の高い分散アルゴリズムが、多くの大規模機械学習アプリケーションの成功に不可欠になっています。この研究では、部分非同期通信プロトコルを備えたモデル並列システムにおける柔軟な近似勾配アルゴリズムの実装であるm-PAPGを提案します。ワーカーマシンは、制御された古さの境界$s$で非同期に通信し、異なる周波数で動作します。m-PAPGのさまざまな収束特性を特徴付けます。1)一般的な非滑らかで非凸の設定では、m-PAPGによって生成されるシーケンスのすべての極限点が目的関数の臨界点であることを証明します。2)凸目的関数のエラー境界条件下では、最適性ギャップが$s$ステップごとに線形に減少することを証明します。3) Kurdyka-{\L}ojasiewicz不等式と十分な減少仮定の下で、近似Lipschitz条件が満たされている場合、m-PAPGによって生成されるシーケンスが同じ臨界点に収束することを証明します。

Dual Principal Component Pursuit
デュアルプリンシパルコンポーネントの追求

We consider the problem of learning a linear subspace from data corrupted by outliers. Classical approaches are typically designed for the case in which the subspace dimension is small relative to the ambient dimension. Our approach works with a dual representation of the subspace and hence aims to find its orthogonal complement; as such, it is particularly suitable for subspaces whose dimension is close to the ambient dimension (subspaces of high relative dimension). We pose the problem of computing normal vectors to the inlier subspace as a non-convex $\ell_1$ minimization problem on the sphere, which we call Dual Principal Component Pursuit (DPCP) problem. We provide theoretical guarantees under which every global solution to DPCP is a vector in the orthogonal complement of the inlier subspace. Moreover, we relax the non-convex DPCP problem to a recursion of linear programs whose solutions are shown to converge in a finite number of steps to a vector orthogonal to the subspace. In particular, when the inlier subspace is a hyperplane, the solutions to the recursion of linear programs converge to the global minimum of the non-convex DPCP problem in a finite number of steps. We also propose algorithms based on alternating minimization and iteratively re-weighted least squares, which are suitable for dealing with large-scale data. Experiments on synthetic data show that the proposed methods are able to handle more outliers and higher relative dimensions than current state-of-the-art methods, while experiments in the context of the three-view geometry problem in computer vision suggest that the proposed methods can be a useful or even superior alternative to traditional RANSAC-based approaches for computer vision and other applications.

私たちは、外れ値によって破損したデータから線形部分空間を学習する問題について考えます。従来のアプローチは、通常、部分空間の次元が周囲の次元に比べて小さい場合を想定して設計されています。私たちのアプローチは、部分空間のデュアル表現で動作し、したがってその直交補集合を見つけることを目指します。そのため、このアプローチは、周囲の次元に近い次元を持つ部分空間(相対的に高い次元の部分空間)に特に適しています。私たちは、インライア部分空間への法線ベクトルを計算する問題を、球面上の非凸$\ell_1$最小化問題として提起し、これをDual Principal Component Pursuit (DPCP)問題と呼んでいます。私たちは、DPCPのすべてのグローバル解がインライア部分空間の直交補集合内のベクトルであるという理論的保証を提供します。さらに、非凸DPCP問題を、解が部分空間に直交するベクトルに有限数のステップで収束することが示されている線形計画の再帰に緩和します。特に、インライア部分空間が超平面である場合、線形計画法の再帰の解は、有限数のステップで非凸DPCP問題のグローバル最小値に収束します。また、大規模データの処理に適した、交互最小化と反復的に再重み付けされた最小二乗に基づくアルゴリズムも提案します。合成データでの実験では、提案された方法が現在の最先端の方法よりも多くの外れ値とより高い相対次元を処理できることが示され、コンピュータービジョンの3ビュー幾何学問題のコンテキストでの実験では、提案された方法がコンピュータービジョンやその他のアプリケーションに対する従来のRANSACベースのアプローチに代わる有用な、あるいは優れた方法になり得ることが示唆されています。

Streaming kernel regression with provably adaptive mean, variance, and regularization
証明可能な適応平均、分散、正則化によるストリーミングカーネル回帰

We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bernstein-like concentration bounds. The latter is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.

私たちは、ストリーミングカーネル回帰の問題を検討します。これは、観測が順次到着し、RKHSに属すると想定される基礎となる平均関数を回復することを目標としています。ノイズの分散は既知であるとは想定されていません。このコンテキストでは、各ポイントでの平均関数の値に対する厳密な信頼限界推定値を維持しながら、各タイムステップで正則化パラメーターを適応的に調整する問題に取り組みます。この目的のために、まず、固定正則化と既知の分散を持つ有限次元線形回帰の既存の結果を、正則化パラメーターが過去の観測の測定可能な関数であることが許可されているカーネル設定に一般化します。次に、適切な自己正規化不等式を使用して、分散の上限と下限の推定値を構築し、Bernsteinのような集中限界を導きます。後者は、適応正則化を定義するために使用されます。私たちの手法から得られる限界は、すべての観測ポイントとすべてのタイムステップにわたって一様に有効であり、数値実験で文献と比較されます。最後に、これらのツールの可能性は、カーネル化されたバンディットへの適用によって示され、カーネルUCBおよびカーネルトンプソンサンプリング手順を再検討し、新しい適応型カーネルチューニング戦略の利点を示します。

ELFI: Engine for Likelihood-Free Inference
ELFI:尤度フリー推論のためのエンジン

Engine for Likelihood-Free Inference (ELFI) is a Python software library for performing likelihood-free inference (LFI). ELFI provides a convenient syntax for arranging components in LFI, such as priors, simulators, summaries or distances, to a network called ELFI graph. The components can be implemented in a wide variety of languages. The stand-alone ELFI graph can be used with any of the available inference methods without modifications. A central method implemented in ELFI is Bayesian Optimization for Likelihood-Free Inference (BOLFI), which has recently been shown to accelerate likelihood-free inference up to several orders of magnitude by surrogate-modelling the distance. ELFI also has an inbuilt support for output data storing for reuse and analysis, and supports parallelization of computation from multiple cores up to a cluster environment. ELFI is designed to be extensible and provides interfaces for widening its functionality. This makes the adding of new inference methods to ELFI straightforward and automatically compatible with the inbuilt features.

ELFI (尤度自由推論エンジン)は、尤度自由推論(LFI)を実行するためのPythonソフトウェアライブラリです。ELFIは、事前確率、シミュレータ、要約、距離などのLFI内のコンポーネントをELFIグラフと呼ばれるネットワークに配置するための便利な構文を提供します。コンポーネントは、さまざまな言語で実装できます。スタンドアロンのELFIグラフは、変更せずに、利用可能な推論方法のいずれでも使用できます。ELFIに実装されている中心的な方法は、尤度自由推論のベイジアン最適化(BOLFI)です。これは、距離を代理モデル化することで尤度自由推論を最大数桁高速化することが最近明らかになりました。ELFIには、再利用と分析のための出力データの保存に対する組み込みサポートもあり、複数のコアからクラスター環境までの計算の並列化をサポートしています。ELFIは拡張可能に設計されており、機能を拡張するためのインターフェイスを提供します。これにより、ELFIへの新しい推論方法の追加が簡単になり、組み込み機能と自動的に互換性が保たれます。

Regularized Optimal Transport and the Rot Mover’s Distance
正則化された最適輸送とロットムーバーの距離

This paper presents a unified framework for smooth convex regularization of discrete optimal transport problems. In this context, the regularized optimal transport turns out to be equivalent to a matrix nearness problem with respect to Bregman divergences. Our framework thus naturally generalizes a previously proposed regularization based on the Boltzmann-Shannon entropy related to the Kullback-Leibler divergence, and solved with the Sinkhorn-Knopp algorithm. We call the regularized optimal transport distance the rot mover’s distance in reference to the classical earth mover’s distance. By exploiting alternate Bregman projections, we develop the alternate scaling algorithm and non-negative alternate scaling algorithm, to compute efficiently the regularized optimal plans depending on whether the domain of the regularizer lies within the non-negative orthant or not. We further enhance the separable case with a sparse extension to deal with high data dimensions. We also instantiate our framework and discuss the inherent specificities for well-known regularizers and statistical divergences in the machine learning and information geometry communities. Finally, we demonstrate the merits of our methods with experiments using synthetic data to illustrate the effect of different regularizers, penalties and dimensions, as well as real-world data for a pattern recognition application to audio scene classification.

この論文では、離散最適輸送問題の滑らかな凸正規化のための統一フレームワークを提示します。この文脈では、正規化された最適輸送は、ブレグマンダイバージェンスに関する行列近接問題と同等であることが判明します。したがって、我々のフレームワークは、カルバックライブラーダイバージェンスに関連するボルツマンシャノンエントロピーに基づいて以前に提案され、シンクホーンノップアルゴリズムで解決された正規化を自然に一般化します。私たちは、正規化された最適輸送距離を、古典的なアースムーバーの距離を参照して、ロートムーバーの距離と呼ぶ。代替ブレグマン射影を利用することで、我々は代替スケーリングアルゴリズムと非負代替スケーリングアルゴリズムを開発し、正規化子のドメインが非負の直交座標内にあるかどうかに応じて、正規化された最適計画を効率的に計算します。私たちは、高データ次元を処理するために、スパース拡張によって分離可能なケースをさらに強化します。また、フレームワークをインスタンス化し、機械学習や情報幾何学のコミュニティでよく知られている正則化子と統計的相違の固有の特異性についても説明します。最後に、合成データを使用した実験で、さまざまな正則化子、ペナルティ、次元の効果、およびオーディオシーン分類へのパターン認識アプリケーションの実際のデータを示し、私たちの方法のメリットを示します。

Model-Free Trajectory-based Policy Optimization with Monotonic Improvement
単調な改善によるモデルフリーの軌跡ベースのポリシー最適化

Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent \qfunc learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.

最近の軌道最適化アルゴリズムの多くは、平均軌道の周りのシステムダイナミクスの線形近似と保守的なポリシー更新を交互に繰り返します。ポリシー変更を制限する方法の1つは、連続するポリシー間のKullback-Leibler (KL)ダイバージェンスを制限することです。これらのアプローチは、物理システムのエンドツーエンド制御などの困難な問題ですでに大きな実験的成功を収めています。ただし、システムダイナミクスの線形近似により、ポリシー更新にバイアスが生じ、最適なポリシーへの収束が妨げられる可能性があります。この記事では、単調な改善が保証された新しいモデルフリーの軌道ベースのポリシー最適化アルゴリズムを提案します。このアルゴリズムは、システムダイナミクスのモデルではなく、軌道データから学習したローカルで2次かつ時間依存の\qfuncをバックプロパゲーションします。このポリシー更新により、システムダイナミクスの仮定を単純化することなく、KL制約の厳密な充足が保証されます。高度に非線形な制御タスクで、システムダイナミクスを線形化するアプローチと比較して、このアルゴリズムのパフォーマンスが向上することを実験的に実証します。私たちのアルゴリズムの単調な改善を示すために、私たちはさらにポリシー更新スキームの理論的分析を行い、連続する反復間のポリシーリターンの変化の下限を導き出します。

A Robust Learning Approach for Regression Models Based on Distributionally Robust Optimization
分布ロバスト最適化に基づく回帰モデルのためのロバスト学習アプローチ

We present a Distributionally Robust Optimization (DRO) approach to estimate a robustified regression plane in a linear regression setting, when the observed samples are potentially contaminated with adversarially corrupted outliers. Our approach mitigates the impact of outliers by hedging against a family of probability distributions on the observed data, some of which assign very low probabilities to the outliers. The set of distributions under consideration are close to the empirical distribution in the sense of the Wasserstein metric. We show that this DRO formulation can be relaxed to a convex optimization problem which encompasses a class of models. By selecting proper norm spaces for the Wasserstein metric, we are able to recover several commonly used regularized regression models. We provide new insights into the regularization term and give guidance on the selection of the regularization coefficient from the standpoint of a confidence region. We establish two types of performance guarantees for the solution to our formulation under mild conditions. One is related to its out-of-sample behavior (prediction bias), and the other concerns the discrepancy between the estimated and true regression planes (estimation bias). Extensive numerical results demonstrate the superiority of our approach to a host of regression models, in terms of the prediction and estimation accuracies. We also consider the application of our robust learning procedure to outlier detection, and show that our approach achieves a much higher AUC (Area Under the ROC Curve) than M-estimation (Huber, 1964, 1973).

私たちは、観測サンプルが敵対的に破損した外れ値で汚染されている可能性がある場合、線形回帰設定でロバスト化された回帰平面を推定するための分布ロバスト最適化(DRO)アプローチを紹介します。このアプローチは、観測データに対する一連の確率分布(外れ値に非常に低い確率を割り当てるものもあります)をヘッジすることで外れ値の影響を軽減します。検討中の分布のセットは、ワッサーシュタインメトリックの意味での経験分布に近いものです。このDRO定式化は、モデルのクラスを包含する凸最適化問題に緩和できることを示します。ワッサーシュタインメトリックに適切なノルム空間を選択することで、一般的に使用されるいくつかの正則化回帰モデルを復元できます。正則化項に関する新しい洞察を提供し、信頼領域の観点から正則化係数の選択に関するガイダンスを提供します。穏やかな条件下での定式化のソリューションに対して、2種類のパフォーマンス保証を確立します。1つはサンプル外の挙動(予測バイアス)に関連し、もう1つは推定された回帰平面と実際の回帰平面の不一致(推定バイアス)に関係します。広範な数値結果により、予測および推定精度の点で、多数の回帰モデルに対する当社のアプローチの優位性が実証されています。また、外れ値検出への堅牢な学習手順の適用についても検討し、当社のアプローチがM推定(Huber、1964、1973)よりもはるかに高いAUC (ROC曲線下面積)を達成することを示しています。

Statistical Analysis and Parameter Selection for Mapper
マッパーの統計分析とパラメータ選択

In this article, we study the question of the statistical convergence of the 1-dimensional Mapper to its continuous analogue, the Reeb graph. We show that the Mapper is an optimal estimator of the Reeb graph, which gives, as a byproduct, a method to automatically tune its parameters and compute confidence regions on its topological features, such as its loops and flares. This allows to circumvent the issue of testing a large grid of parameters and keeping the most stable ones in the brute-force setting, which is widely used in visualization, clustering and feature selection with the Mapper.

この記事では、1次元マッパーとその連続アナログであるReebグラフへの統計的収束の問題を研究します。マッパーがリーブグラフの最適な推定量であり、副産物として、そのパラメータを自動的に調整し、ループやフレアなどのトポロジカル特徴の信頼領域を計算する方法を提供することを示します。これにより、パラメータの大きなグリッドをテストし、最も安定したパラメータをブルートフォース設定で保持するという問題を回避できます。これは、Mapperによる視覚化、クラスタリング、および機能選択で広く使用されています。

Change-Point Computation for Large Graphical Models: A Scalable Algorithm for Gaussian Graphical Models with Change-Points
大規模グラフィカルモデルに対する変化点計算:変化点を持つガウスグラフィカルモデルのためのスケーラブルなアルゴリズム

Graphical models with change-points are computationally challenging to fit, particularly in cases where the number of observation points and the number of nodes in the graph are large. Focusing on Gaussian graphical models, we introduce an approximate majorize-minimize (MM) algorithm that can be useful for computing change-points in large graphical models. The proposed algorithm is an order of magnitude faster than a brute force search. Under some regularity conditions on the data generating process, we show that with high probability, the algorithm converges to a value that is within statistical error of the true change-point. A fast implementation of the algorithm using Markov Chain Monte Carlo is also introduced. The performances of the proposed algorithms are evaluated on synthetic data sets and the algorithm is also used to analyze structural changes in the S{\&}P 500 over the period 2000-2016.

変化点を持つグラフィカルモデルは、特にグラフ内の観測点の数とノードの数が多い場合に、適合させるのが計算的に困難です。ガウスグラフィカルモデルに焦点を当てて、大規模なグラフィカルモデルの変化点を計算するのに役立つ近似マジョライズ最小化(MM)アルゴリズムを紹介します。提案されたアルゴリズムは、ブルートフォース検索よりも桁違いに高速です。データ生成プロセスのいくつかの規則性条件下では、アルゴリズムが真の変化点の統計的誤差内にある値に高い確率で収束することを示します。マルコフ連鎖モンテカルロ法を使用したアルゴリズムの高速実装も紹介されています。提案されたアルゴリズムの性能は、合成データセットで評価され、このアルゴリズムは、2000年から2016年の期間にわたるS{&}P 500の構造変化を分析するためにも使用されます。

A Constructive Approach to L_0 Penalized Regression
L_0ペナルティ回帰への建設的アプローチ

We propose a constructive approach to estimating sparse, high-dimensional linear regression models. The approach is a computational algorithm motivated from the KKT conditions for the $\ell_0$-penalized least squares solutions. It generates a sequence of solutions iteratively, based on support detection using primal and dual information and root finding. We refer to the algorithm as SDAR for brevity. Under a sparse Riesz condition on the design matrix and certain other conditions, we show that with high probability, the $\ell_2$ estimation error of the solution sequence decays exponentially to the minimax error bound in $O(\log(R\sqrt{J}))$ iterations, where $J$ is the number of important predictors and $R$ is the relative magnitude of the nonzero target coefficients; and under a mutual coherence condition and certain other conditions, the $\ell_{\infty}$ estimation error decays to the optimal error bound in $O(\log(R))$ iterations. Moreover the SDAR solution recovers the oracle least squares estimator within a finite number of iterations with high probability if the sparsity level is known. Computational complexity analysis shows that the cost of SDAR is $O(np)$ per iteration. We also consider an adaptive version of SDAR for use in practical applications where the true sparsity level is unknown. Simulation studies demonstrate that SDAR outperforms Lasso, MCP and two greedy methods in accuracy and efficiency.

私たちは、スパースで高次元の線形回帰モデルを推定するための建設的なアプローチを提案します。このアプローチは、$\ell_0$ペナルティ付き最小二乗解のKKT条件に着想を得た計算アルゴリズムです。これは、主情報と双対情報を使用したサポート検出と根の探索に基づいて、反復的に解のシーケンスを生成します。簡潔にするために、このアルゴリズムをSDARと呼ぶ。計画行列に対するスパースRiesz条件と他の特定の条件下では、高い確率で、解のシーケンスの$\ell_2$推定誤差が、$O(\log(R\sqrt{J}))$回の反復でミニマックス誤差境界まで指数関数的に減少することを示す。ここで、$J$は重要な予測子の数、$R$は非ゼロのターゲット係数の相対的な大きさです。相互コヒーレンス条件および他の特定の条件下では、$\ell_{\infty}$推定誤差は$O(\log(R))$回の反復で最適誤差境界まで減少します。さらに、SDARソリューションは、スパースレベルがわかっている場合、有限回数の反復で高い確率でオラクル最小二乗推定値を回復します。計算複雑性分析により、SDARのコストは反復あたり$O(np)$であることが示されています。また、真のスパースレベルが不明な実際のアプリケーションで使用するためのSDARの適応バージョンも検討します。シミュレーション研究では、SDARは精度と効率の点でLasso、MCP、および2つの貪欲法よりも優れていることが実証されています。

Experience Selection in Deep Reinforcement Learning for Control
制御のための深層強化学習における経験選択

Experience replay is a technique that allows off-policy reinforcement-learning methods to reuse past experiences. The stability and speed of convergence of reinforcement learning, as well as the eventual performance of the learned policy, are strongly dependent on the experiences being replayed. Which experiences are replayed depends on two important choices. The first is which and how many experiences to retain in the experience replay buffer. The second choice is how to sample the experiences that are to be replayed from that buffer. We propose new methods for the combined problem of experience retention and experience sampling. We refer to the combination as experience selection. We focus our investigation specifically on the control of physical systems, such as robots, where exploration is costly. To determine which experiences to keep and which to replay, we investigate different proxies for their immediate and long-term utility. These proxies include age, temporal difference error and the strength of the applied exploration noise. Since no currently available method works in all situations, we propose guidelines for using prior knowledge about the characteristics of the control problem at hand to choose the appropriate experience replay strategy.

経験の再生は、オフポリシー強化学習法で過去の経験を再利用できるようにするための手法です。強化学習の安定性と収束速度、および学習したポリシーの最終的なパフォーマンスは、再生される経験に大きく依存します。どの経験が再生されるかは、2つの重要な選択によって決まります。1つ目は、経験再生バッファーに保持する経験とその数です。2つ目は、そのバッファーから再生される経験をどのようにサンプリングするかです。経験の保持と経験のサンプリングを組み合わせた問題に対する新しい方法を提案します。この組み合わせを経験選択と呼びます。特に、探索にコストがかかるロボットなどの物理システムの制御に焦点を当てて調査します。どの経験を保持し、どの経験を再生するかを決定するために、その即時的および長期的な有用性に関するさまざまなプロキシを調査します。これらのプロキシには、年齢、時間差エラー、適用された探索ノイズの強度が含まれます。現在利用可能な方法はすべての状況で機能しないため、適切な経験再生戦略を選択するために、手元の制御問題の特性に関する事前知識を使用するためのガイドラインを提案します。

Scalable Bayes via Barycenter in Wasserstein Space
ワッサーシュタイン空間の重心によるスケーラブルベイズ

Divide-and-conquer based methods for Bayesian inference provide a general approach for tractable posterior inference when the sample size is large. These methods divide the data into smaller subsets, sample from the posterior distribution of parameters in parallel on all the subsets, and combine posterior samples from all the subsets to approximate the full data posterior distribution. The smaller size of any subset compared to the full data implies that posterior sampling on any subset is computationally more efficient than sampling from the true posterior distribution. Since the combination step takes negligible time relative to sampling, posterior computations can be scaled to massive data by dividing the full data into sufficiently large number of data subsets. One such approach relies on the geometry of posterior distributions estimated across different subsets and combines them through their barycenter in a Wasserstein space of probability measures. We provide theoretical guarantees on the accuracy of approximation that are valid in many applications. We show that the geometric method approximates the full data posterior distribution better than its competitors across diverse simulations and reproduces known results when applied to a movie ratings database.

ベイズ推定の分割統治法に基づく方法は、サンプルサイズが大きい場合に扱いやすい事後推定の一般的なアプローチを提供します。これらの方法では、データを小さなサブセットに分割し、すべてのサブセットでパラメータの事後分布から並行してサンプリングし、すべてのサブセットからの事後サンプルを組み合わせて、完全なデータ事後分布を近似します。完全なデータと比較してサブセットのサイズが小さいということは、任意のサブセットの事後サンプリングが、真の事後分布からサンプリングするよりも計算上効率的であることを意味します。結合ステップはサンプリングに比べてごくわずかな時間しかかからないため、完全なデータを十分な数のデータサブセットに分割することにより、事後計算を大規模なデータに拡張できます。このようなアプローチの1つは、異なるサブセット間で推定された事後分布の幾何学に依存し、確率測度のワッサーシュタイン空間内の重心を介してそれらを結合します。多くのアプリケーションで有効な近似の精度について理論的な保証を提供します。幾何学的手法は、さまざまなシミュレーションにおいて競合手法よりも完全なデータ事後分布をより正確に近似し、映画評価データベースに適用した場合に既知の結果を再現することを示します。

Patchwork Kriging for Large-scale Gaussian Process Regression
大規模ガウス過程回帰のためのパッチワーククリギング

This paper presents a new approach for Gaussian process (GP) regression for large datasets. The approach involves partitioning the regression input domain into multiple local regions with a different local GP model fitted in each region. Unlike existing local partitioned GP approaches, we introduce a technique for patching together the local GP models nearly seamlessly to ensure that the local GP models for two neighboring regions produce nearly the same response prediction and prediction error variance on the boundary between the two regions. This largely mitigates the well-known discontinuity problem that degrades the prediction accuracy of existing local partitioned GP methods over regional boundaries. Our main innovation is to represent the continuity conditions as additional pseudo-observations that the differences between neighboring GP responses are identically zero at an appropriately chosen set of boundary input locations. To predict the response at any input location, we simply augment the actual response observations with the pseudo-observations and apply standard GP prediction methods to the augmented data. In contrast to heuristic continuity adjustments, this has an advantage of working within a formal GP framework, so that the GP-based predictive uncertainty quantification remains valid. Our approach also inherits a sparse block-like structure for the sample covariance matrix, which results in computationally efficient closed-form expressions for the predictive mean and variance. In addition, we provide a new spatial partitioning scheme based on a recursive space partitioning along local principal component directions, which makes the proposed approach applicable for regression domains having more than two dimensions. Using three spatial datasets and three higher dimensional datasets, we investigate the numerical performance of the approach and compare it to several state-of-the-art approaches.

この論文では、大規模なデータセットのガウス過程(GP)回帰の新しいアプローチを紹介します。このアプローチでは、回帰入力ドメインを複数のローカル領域に分割し、各領域に異なるローカルGPモデルを適合させます。既存のローカル分割GPアプローチとは異なり、ローカルGPモデルをほぼシームレスにパッチして、2つの隣接する領域のローカルGPモデルが、2つの領域間の境界でほぼ同じ応答予測と予測誤差分散を生成するようにする手法を紹介します。これにより、領域境界を越えた既存のローカル分割GP方法の予測精度を低下させる、よく知られている不連続性の問題が大幅に軽減されます。私たちの主な革新は、連続性条件を、適切に選択された境界入力位置のセットで隣接するGP応答間の差が同一にゼロであるという追加の疑似観測として表すことです。任意の入力位置での応答を予測するには、実際の応答観測を疑似観測で拡張し、拡張されたデータに標準のGP予測方法を適用するだけです。ヒューリスティックな連続性調整とは対照的に、これは正式なGPフレームワーク内で機能するという利点があり、そのためGPベースの予測不確実性の定量化は有効なままです。私たちのアプローチは、サンプル共分散行列のスパースブロックのような構造も継承しており、予測平均と分散の計算効率の高い閉じた形式の表現が得られます。さらに、ローカル主成分方向に沿った再帰的な空間分割に基づく新しい空間分割スキームを提供します。これにより、提案されたアプローチは2次元を超える回帰ドメインに適用できます。3つの空間データセットと3つの高次元データセットを使用して、このアプローチの数値パフォーマンスを調査し、いくつかの最先端のアプローチと比較します。

RSG: Beating Subgradient Method without Smoothness and Strong Convexity
RSG:滑らかさと強い凸性のないサブグラディエント法を打ち負かす

In this paper, we study the efficiency of a Restarted SubGradient (RSG) method that periodically restarts the standard subgradient method (SG). We show that, when applied to a broad class of convex optimization problems, RSG method can find an $\epsilon$-optimal solution with a lower complexity than the SG method. In particular, we first show that RSG can reduce the dependence of SG’s iteration complexity on the distance between the initial solution and the optimal set to that between the $\epsilon$-level set and the optimal set {multiplied by a logarithmic factor}. Moreover, we show the advantages of RSG over SG in solving a broad family of problems that satisfy a local error bound condition, and also demonstrate its advantages for three specific families of convex optimization problems with different power constants in the local error bound condition. (a) For the problems whose epigraph is a polyhedron, RSG is shown to converge linearly. (b) For the problems with local quadratic growth property in the $\epsilon$-sublevel set, RSG has an $O(\frac{1}{\epsilon}\log(\frac{1}{\epsilon}))$ iteration complexity. (c) For the problems that admit a local Kurdyka-{\L}ojasiewicz property with a power constant of $\beta\in[0,1)$, RSG has an $O(\frac{1}{\epsilon^{2\beta}}\log(\frac{1}{\epsilon}))$ iteration complexity. The novelty of our analysis lies at exploiting the lower bound of the first-order optimality residual at the $\epsilon$-level set. It is this novelty that allows us to explore the local properties of functions (e.g., local quadratic growth property, local Kurdyka-\L ojasiewicz property, more generally local error bound conditions) to develop the improved convergence of RSG. We also develop a practical variant of RSG enjoying faster convergence than the SG method, which can be run without knowing the involved parameters in the local error bound condition. We demonstrate the effectiveness of the proposed algorithms on several machine learning tasks including regression, classification and matrix completion.

この論文では、標準サブグラディエント法(SG)を定期的に再開するRestarted SubGradient (RSG)法の効率性について検討します。広範な凸最適化問題に適用した場合、RSG法はSG法よりも低い複雑度で$\epsilon$最適解を見つけることができることを示します。特に、まずRSGは、SGの反復複雑度が初期解と最適集合の距離に依存することを、$\epsilon$レベル集合と最適集合{対数係数で乗じたもの}の距離に依存することまで減らすことができることを示します。さらに、局所誤差境界条件を満たす広範な問題群を解く上でSGよりもRSGの方が優れていることを示し、局所誤差境界条件のべき乗定数が異なる3つの特定の凸最適化問題群についてRSGの利点を示します。(a)エピグラフが多面体である問題の場合、RSGは線形収束することが示されています。(b) $\epsilon$サブレベルセットのローカル二次成長特性を持つ問題の場合、RSGの反復計算量は$O(\frac{1}{\epsilon}\log(\frac{1}{\epsilon}))$です。(c)べき定数$\beta\in[0,1)$を持つローカルKurdyka-{\L}ojasiewicz特性を許容する問題の場合、RSGの反復計算量は$O(\frac{1}{\epsilon^{2\beta}}\log(\frac{1}{\epsilon}))$です。この分析の新規性は、$\epsilon$レベルセットでの1次最適性残差の下限値を利用することにあります。この新規性により、関数のローカル特性(ローカル二次成長特性、ローカルKurdyka-\L ojasiewicz特性、より一般的にはローカル誤差境界条件など)を調査して、RSGの収束性を向上させることができます。また、SG法よりも収束が速く、ローカル誤差境界条件の関連パラメータを知らなくても実行できるRSGの実用的な変種も開発しました。回帰、分類、行列補完などのいくつかの機械学習タスクで提案アルゴリズムの有効性を実証しました。

Can We Trust the Bootstrap in High-dimensions? The Case of Linear Models
ブートストラップを高次元で信頼できますか?線形モデルの場合

We consider the performance of the bootstrap in high-dimensions for the setting of linear regression, where $p\lt n$ but $p/n$ is not close to zero. We consider ordinary least-squares as well as robust regression methods and adopt a minimalist performance requirement: can the bootstrap give us good confidence intervals for a single coordinate of $\beta$ (where $\beta$ is the true regression vector)? We show through a mix of numerical and theoretical work that the bootstrap is fraught with problems. Both of the most commonly used methods of bootstrapping for regression–residual bootstrap and pairs bootstrap–give very poor inference on $\beta$ as the ratio $p/n$ grows. We find that the residual bootstrap tend to give anti-conservative estimates (inflated Type I error), while the pairs bootstrap gives very conservative estimates (severe loss of power) as the ratio $p/n$ grows. We also show that the jackknife resampling technique for estimating the variance of $\hat{\beta}$ severely overestimates the variance in high dimensions. We contribute alternative procedures based on our theoretical results that result in dimensionality adaptive and robust bootstrap methods.

私たちは、線形回帰の設定について、高次元でのブートストラップのパフォーマンスを検討します。この設定では、$p\lt n$ですが、$p/n$はゼロに近くありません。通常の最小二乗法とロバスト回帰法を検討し、最小限のパフォーマンス要件を採用します。つまり、ブートストラップは、$\beta$ (ここで、$\beta$は真の回帰ベクトル)の単一の座標に対して適切な信頼区間を与えることができるかということです。数値的および理論的研究を組み合わせて、ブートストラップには多くの問題があることを示します。回帰のブートストラップで最も一般的に使用される2つの方法(残差ブートストラップとペアブートストラップ)では、比$p/n$が大きくなるにつれて、$\beta$に関する推論が非常に不十分になります。残差ブートストラップは反保守的な推定値(過大なタイプIの誤差)を与える傾向があり、ペアブートストラップは比$p/n$が大きくなるにつれて、非常に保守的な推定値(検出力の大幅な低下)を与えることがわかりました。また、$\hat{\beta}$の分散を推定するためのジャックナイフ再サンプリング手法は、高次元での分散を大幅に過大評価することを示しています。私たちは、次元適応型で堅牢なブートストラップ法をもたらす理論的結果に基づいた代替手順を提供します。

A Hidden Absorbing Semi-Markov Model for Informatively Censored Temporal Data: Learning and Inference
情報打ち切り時間データのための隠れ吸収セミマルコフモデル:学習と推論

Modeling continuous-time physiological processes that manifest a patient’s evolving clinical states is a key step in approaching many problems in healthcare. In this paper, we develop the Hidden Absorbing Semi-Markov Model (HASMM): a versatile probabilistic model that is capable of capturing the modern electronic health record (EHR) data. Unlike existing models, the HASMM accommodates irregularly sampled, temporally correlated, and informatively censored physiological data, and can describe non-stationary clinical state transitions. Learning the HASMM parameters from the EHR data is achieved via a novel forward-filtering backward-sampling Monte-Carlo EM algorithm that exploits the knowledge of the end-point clinical outcomes (informative censoring) in the EHR data, and implements the E-step by sequentially sampling the patients’ clinical states in the reverse-time direction while conditioning on the future states. Real-time inferences are drawn via a forward-filtering algorithm that operates on a virtually constructed discrete-time embedded Markov chain that mirrors the patient’s continuous-time state trajectory. We demonstrate the prognostic utility of the HASMM in a critical care prognosis setting using a real-world dataset for patients admitted to the Ronald Reagan UCLA Medical Center. In particular, we show that using HASMMs, a patient’s clinical deterioration can be predicted 8-9 hours prior to intensive care unit admission, with a 22$\%$ AUC gain compared to the Rothman index, which is the state-of-the-art critical care risk scoring technology.

患者の変化する臨床状態を表す連続時間生理学的プロセスをモデル化することは、医療における多くの問題に取り組む上で重要なステップです。この論文では、現代の電子医療記録(EHR)データを捕捉できる多目的確率モデルである隠れ吸収半マルコフモデル(HASMM)を開発します。既存のモデルとは異なり、HASMMは不規則にサンプリングされ、時間的に相関し、情報的に検閲された生理学的データに対応し、非定常の臨床状態遷移を記述できます。EHRデータからのHASMMパラメータの学習は、EHRデータのエンドポイント臨床結果(情報的検閲)の知識を活用し、将来の状態を条件付けしながら患者の臨床状態を逆時間方向に順次サンプリングすることでEステップを実装する、新しい順方向フィルタリング逆方向サンプリングモンテカルロEMアルゴリズムによって実現されます。リアルタイムの推論は、患者の連続時間状態軌跡を反映する仮想的に構築された離散時間埋め込みマルコフ連鎖で動作するフォワードフィルタリングアルゴリズムによって行われます。ロナルドレーガンUCLA医療センターに入院した患者の実際のデータセットを使用して、集中治療予後設定におけるHASMMの予後有用性を実証します。特に、HASMMを使用すると、集中治療室入院の8～9時間前に患者の臨床的悪化を予測でき、最先端の集中治療リスクスコアリング技術であるRothmanインデックスと比較して22%のAUCゲインが得られることを示しています。

Approximate Submodularity and its Applications: Subset Selection, Sparse Approximation and Dictionary Selection
近似部分モジュラ性とその応用:サブセット選択、スパース近似、辞書選択

We introduce the submodularity ratio as a measure of how “close” to submodular a set function $f$ is. We show that when $f$ has submodularity ratio $\gamma$, the greedy algorithm for maximizing $f$ provides a $(1-e^{-\gamma})$-approximation. Furthermore, when $\gamma$ is bounded away from 0, the greedy algorithm for minimum submodular cover also provides essentially an $O(\log n)$ approximation for a universe of $n$ elements. As a main application of this framework, we study the problem of selecting a subset of $k$ random variables from a large set, in order to obtain the best linear prediction of another variable of interest. We analyze the performance of widely used greedy heuristics; in particular, by showing that the submodularity ratio is lower-bounded by the smallest $2k$-sparse eigenvalue of the covariance matrix, we obtain the strongest known approximation guarantees for the Forward Regression and Orthogonal Matching Pursuit algorithms. As a second application, we analyze greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on real-world and synthetic data sets; in particular, we focus on an analysis of how tight various spectral parameters and the submodularity ratio are in terms of predicting the performance of the greedy algorithms.

私たちは、集合関数$f$がサブモジュラにどれだけ「近い」かの尺度として、サブモジュラリティ比を導入します。$f$がサブモジュラリティ比$\gamma$を持つ場合、$f$を最大化する貪欲アルゴリズムは$(1-e^{-\gamma})$近似を提供することを示す。さらに、$\gamma$が0から離れる方向に制限されている場合、サブモジュラ被覆を最小化する貪欲アルゴリズムは、$n$要素のユニバースに対して本質的に$O(\log n)$近似も提供します。このフレームワークの主な応用として、別の関心のある変数の最良の線形予測を得るために、大きなセットから$k$個のランダム変数のサブセットを選択する問題を研究します。広く使用されている貪欲ヒューリスティックのパフォーマンスを分析します。特に、サブモジュラリティ比が共分散行列の最小の$2k$疎固有値によって下限が定められていることを示すことにより、順方向回帰アルゴリズムと直交マッチング追跡アルゴリズムに対する最も強力な既知の近似保証が得られます。2番目の応用として、辞書選択問題に対する貪欲アルゴリズムを分析し、従来の保証を大幅に改善します。理論分析は、実世界および合成データセットでの実験によって補完されます。特に、貪欲アルゴリズムのパフォーマンスを予測する上で、さまざまなスペクトルパラメーターとサブモジュラリティ比がどの程度厳密であるかの分析に焦点を当てています。

A Two-Stage Penalized Least Squares Method for Constructing Large Systems of Structural Equations
構造方程式の大規模システムを構築するための2段階ペナルティ付き最小二乗法

We propose a two-stage penalized least squares method to build large systems of structural equations based on the instrumental variables view of the classical two-stage least squares method. We show that, with large numbers of endogenous and exogenous variables, the system can be constructed via consistent estimation of a set of conditional expectations at the first stage, and consistent selection of regulatory effects at the second stage. While the consistent estimation at the first stage can be obtained via the ridge regression, the adaptive lasso is employed at the second stage to achieve the consistent selection. This method is computationally fast and allows for parallel implementation. We demonstrate its effectiveness via simulation studies and real data analysis.

私たちは、古典的な2段階最小二乗法の道具変数ビューに基づいて、構造方程式の大規模なシステムを構築するための2段階ペナルティ付き最小二乗法を提案します。私たちは、多数の内生変数と外生変数を用いて、第1段階では条件付きの期待値を一貫して推定し、第2段階では調節効果を一貫して選択することでシステムを構築できることを示す。第1段階での一貫した推定はリッジ回帰によって取得できますが、第2段階では適応型なげなわを使用して一貫した選択を実現します。この方法は計算が高速であり、並列実装が可能です。シミュレーション研究と実データ解析により、その有効性を実証します。

Numerical Analysis near Singularities in RBF Networks
RBFネットワークにおける特異点近傍数値解析

The existence of singularities often affects the learning dynamics in feedforward neural networks. In this paper, based on theoretical analysis results, we numerically analyze the learning dynamics of radial basis function (RBF) networks near singularities to understand to what extent singularities influence the learning dynamics. First, we show the explicit expression of the Fisher information matrix for RBF networks. Second, we demonstrate through numerical simulations that the singularities have a significant impact on the learning dynamics of RBF networks. Our results show that overlap singularities mainly have influence on the low dimensional RBF networks and elimination singularities have a more significant impact to the learning processes than overlap singularities in both low and high dimensional RBF networks, whereas the plateau phenomena are mainly caused by the elimination singularities. The results can also be the foundation to investigate the singular learning dynamics in deep feedforward neural networks.

フィードフォワードニューラルネットワークでは、特異点の存在が学習ダイナミクスに影響を与えることがよくあります。この論文では、理論分析の結果に基づいて、特異点付近のラジアル基底関数(RBF)ネットワークの学習ダイナミクスを数値的に分析し、特異点が学習ダイナミクスにどの程度影響するかを理解します。まず、RBFネットワークのフィッシャー情報行列の明示的な表現を示します。次に、数値シミュレーションによって、特異点がRBFネットワークの学習ダイナミクスに大きな影響を与えることを実証します。結果は、重なり特異点は主に低次元RBFネットワークに影響を与え、低次元および高次元RBFネットワークの両方で、消去特異点は重なり特異点よりも学習プロセスに大きな影響を与える一方で、プラトー現象は主に消去特異点によって引き起こされることを示しています。この結果は、ディープフィードフォワードニューラルネットワークの特異学習ダイナミクスを調査するための基礎にもなります。

Journal of Machine Learning Research Papers: Volume 19の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 2の論文一覧

Journal of Machine Learning Research Papers: Volume 4の論文一覧

Journal of Machine Learning Research Papers: Volume 18の論文一覧