Journal of Machine Learning Research Papers: Volume 20の論文一覧

Journal of Machine Learning Research Papers Volume 20に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Why do deep convolutional networks generalize so poorly to small image transformations?
なぜ深い畳み込みネットワークは、小さな画像変換にあまり一般化されないのでしょうか?

Convolutional Neural Networks (CNNs) are commonly assumed to be invariant to small image transformations: either because of the convolutional architecture or because they were trained using data augmentation. Recently, several authors have shown that this is not the case: small translations or rescalings of the input image can drastically change the network’s prediction. In this paper, we quantify this phenomena and ask why neither the convolutional architecture nor data augmentation are sufficient to achieve the desired invariance. Specifically, we show that the convolutional architecture does not give invariance since architectures ignore the classical sampling theorem, and data augmentation does not give invariance because the CNNs learn to be invariant to transformations only for images that are very similar to typical images from the training set. We discuss two possible solutions to this problem: (1) antialiasing the intermediate representations and (2) increasing data augmentation and show that they provide only a partial solution at best. Taken together, our results indicate that the problem of insuring invariance to small image transformations in neural networks while preserving high accuracy remains unsolved.

畳み込みニューラルネットワーク(CNN)は、畳み込みアーキテクチャのため、またはデータ拡張を使用してトレーニングされているため、一般に小さな画像変換に対して不変であると想定されています。最近、複数の著者が、これは当てはまらないことを示しました。入力画像の小さな変換または再スケーリングにより、ネットワークの予測が大幅に変わる可能性があります。この論文では、この現象を定量化し、畳み込みアーキテクチャもデータ拡張も、望ましい不変性を達成するのに十分でない理由を問います。具体的には、畳み込みアーキテクチャは古典的なサンプリング定理を無視するため不変性が得られず、データ拡張は、CNNがトレーニングセットの一般的な画像に非常に類似した画像に対してのみ変換に対して不変であることを学習するため不変性が得られないことを示します。この問題の2つの可能な解決策、(1)中間表現のアンチエイリアシングと(2)データ拡張の増加について説明し、これらがせいぜい部分的な解決策しか提供しないことを示します。まとめると、私たちの結果は、ニューラルネットワークにおける小さな画像変換に対する不変性を確保しながら高い精度を維持するという問題が未解決のままであることを示しています。

Log-concave sampling: Metropolis-Hastings algorithms are fast
対数凹型サンプリング:メトロポリス・ヘイスティングス・アルゴリズムは高速

We study the problem of sampling from a strongly log-concave density supported on $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} (\kappa d \log(1/\delta) )$ steps from a warm start, as compared to the $\mathcal{O} (\kappa^2 d/\delta^2 )$ steps established in past work on ULA. We also demonstrate the gains of a modified version of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for the Metropolized random walk (MRW) and obtain $\mathcal{O}(\kappa)$ mixing time slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin-type sampling algorithms.

私たちは、$\mathbb{R}^d$でサポートされる強い対数凹密度からのサンプリングの問題を研究し、メトロポリス調整ランジュバンアルゴリズム(MALA)の混合時間の非漸近的上限を証明します。この方法は、適切なランジュバン拡散の離散化から得られるマルコフ連鎖をシミュレートし、受け入れ拒否ステップと組み合わせてサンプルを抽出します。調整されていないランジュバンアルゴリズム(ULA)の既知の保証と比較して、我々の上限は、MALAで受け入れ拒否ステップを使用すると、誤差許容度への依存性が指数関数的に改善されることを示しています。具体的には、条件数$\kappa$の密度に対してTV誤差が最大で$\delta$のサンプルを取得するために、MALAでは、ULAに関する過去の研究で確立された$\mathcal{O} (\kappa^2 d/\delta^2 )$ステップと比較して、ウォームスタートから$\mathcal{O} (\kappa d \log(1/\delta) )$ステップが必要であることを示します。また、弱対数凹密度に対して、MALAの修正バージョンがULAよりも優れていることも示します。さらに、メトロポリス化ランダムウォーク(MRW)の混合時間境界を導出し、MALAよりも遅い$\mathcal{O}(\kappa)$の混合時間を得ます。理論的発見を裏付ける数値例を示し、ランジュバン型サンプリングアルゴリズムに対するメトロポリス-ヘイスティングス調整の利点を示します。

Model Selection in Bayesian Neural Networks via Horseshoe Priors
馬蹄形事前分布によるベイジアンニューラルネットワークにおけるモデル選択

The promise of augmenting accurate predictions provided by modern neural networks with well-calibrated predictive uncertainties has reinvigorated interest in Bayesian neural networks. However, model selection—even choosing the number of nodes—remains an open question. Poor choices can severely affect the quality of the produced uncertainties. In this paper, we explore continuous shrinkage priors, the horseshoe, and the regularized horseshoe distributions, for model selection in Bayesian neural networks. When placed over node pre-activations and coupled with appropriate variational approximations, we find that the strong shrinkage provided by the horseshoe is effective at turning off nodes that do not help explain the data. We demonstrate that our approach finds compact network structures even when the number of nodes required is grossly over-estimated. Moreover, the model selection over the number of nodes does not come at the expense of predictive or computational performance; in fact, we learn smaller networks with comparable predictive performance to current approaches. These effects are particularly apparent in sample-limited settings, such as small data sets and reinforcement learning.

最新のニューラルネットワークが提供する正確な予測を、適切に調整された予測不確実性で強化できるという期待から、ベイジアンニューラルネットワークへの関心が再燃しています。しかし、モデルの選択、さらにはノード数の選択は未解決の問題です。不適切な選択は、生成される不確実性の品質に重大な影響を与える可能性があります。この論文では、ベイジアンニューラルネットワークのモデル選択のための連続収縮事前分布、馬蹄形分布、および正規化された馬蹄形分布について説明します。ノードの事前活性化に適用し、適切な変分近似と組み合わせると、馬蹄形分布によって提供される強力な収縮が、データの説明に役立たないノードをオフにするのに効果的であることがわかります。必要なノード数が大幅に過大評価されている場合でも、このアプローチによってコンパクトなネットワーク構造が見つかることを実証します。さらに、ノード数に関するモデル選択によって予測または計算パフォーマンスが犠牲になることはありません。実際、現在のアプローチと同等の予測パフォーマンスを持つより小さなネットワークを学習します。これらの効果は、小規模なデータセットや強化学習など、サンプルが限られた設定で特に顕著になります。

Neural Empirical Bayes
ニューラル経験的ベイズ

We unify kernel density estimation and empirical Bayes and address a set of problems in unsupervised machine learning with a geometric interpretation of those methods, rooted in the concentration of measure phenomenon. Kernel density is viewed symbolically as $X\rightharpoonup Y$ where the random variable $X$ is smoothed to $Y= X+N(0,\sigma^2 I_d)$, and empirical Bayes is the machinery to denoise in a least-squares sense, which we express as $X \leftharpoondown Y$. A learning objective is derived by combining these two, symbolically captured by $X \rightleftharpoons Y$. Crucially, instead of using the original nonparametric estimators, we parametrize the energy function with a neural network denoted by $\phi$; at optimality, $\nabla \phi \approx -\nabla \log f$ where $f$ is the density of $Y$. The optimization problem is abstracted as interactions of high-dimensional spheres which emerge due to the concentration of isotropic Gaussians. We introduce two algorithmic frameworks based on this machinery: (i) a “walk-jump” sampling scheme that combines Langevin MCMC (walks) and empirical Bayes (jumps), and (ii) a probabilistic framework for associative memory, called NEBULA, defined a la Hopfield by the gradient flow of the learned energy to a set of attractors. We finish the paper by reporting the emergence of very rich “creative memories” as attractors of NEBULA for highly-overlapping spheres.

私たちは、カーネル密度推定と経験的ベイズを統合し、測定集中現象に根ざしたこれらの方法の幾何学的解釈により、教師なし機械学習における一連の問題に取り組みます。カーネル密度は、ランダム変数$X$が$Y= X+N(0,\sigma^2 I_d)$に平滑化される$X\rightharpoonup Y$として記号的に表され、経験的ベイズは最小二乗の意味でノイズを除去する機構であり、$X \leftharpoondown Y$と表されます。学習目標は、これら2つを組み合わせることで導出され、記号的には$X \rightleftharpoons Y$で表されます。重要なのは、元のノンパラメトリック推定量を使用する代わりに、$\phi$で表されるニューラルネットワークでエネルギー関数をパラメーター化する点です。最適状態では、$\nabla \phi \approx -\nabla \log f$となり、$f$は$Y$の密度です。最適化問題は、等方性ガウス分布の集中により出現する高次元球の相互作用として抽象化されます。この仕組みに基づく2つのアルゴリズムフレームワークを紹介します。(i)ランジュバンMCMC (ウォーク)と経験的ベイズ(ジャンプ)を組み合わせた「ウォークジャンプ」サンプリングスキーム、および(ii)ホップフィールド法で学習エネルギーのアトラクターセットへの勾配フローによって定義される、NEBULAと呼ばれる連想記憶の確率的フレームワークです。最後に、高度に重なり合う球のNEBULAアトラクターとして非常に豊富な「創造的記憶」が出現することを報告します。

DPPy: DPP Sampling with Python
DPPy: Python による DPP サンプリング

Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning. Sampling from DPPs is a challenge and therefore we present DPPy, a Python toolbox that gathers known exact and approximate sampling algorithms for both finite and continuous DPPs. The project is hosted on GitHub, and equipped with an extensive documentation.

決定的点過程(DPP)は、物理学、確率、統計、そして最近では機械学習全体でモデルや計算ツールとして使用される点の雲上の特定の確率分布です。DPPからのサンプリングは課題であるため、有限DPPと連続DPPの両方の既知の厳密および近似サンプリングアルゴリズムを収集するPythonツールボックスであるDPPyを紹介します。このプロジェクトはGitHubでホストされており、広範なドキュメントが用意されています。

Differentiable reservoir computing
微分可能な貯留層コンピューティング

Numerous results in learning and approximation theory have evidenced the importance of differentiability at the time of countering the curse of dimensionality. In the context of reservoir computing, much effort has been devoted in the last two decades to characterize the situations in which systems of this type exhibit the so-called echo state (ESP) and fading memory (FMP) properties. These important features amount, in mathematical terms, to the existence and continuity of global reservoir system solutions. That research is complemented in this paper with the characterization of the differentiability of reservoir filters for very general classes of discrete-time deterministic inputs. This constitutes a novel strong contribution to the long line of research on the ESP and the FMP and, in particular, links to existing research on the input-dependence of the ESP. Differentiability has been shown in the literature to be a key feature in the learning of attractors of chaotic dynamical systems. A Volterra-type series representation for reservoir filters with semi-infinite discrete-time inputs is constructed in the analytic case using Taylor’s theorem and corresponding approximation bounds are provided. Finally, it is shown as a corollary of these results that any fading memory filter can be uniformly approximated by a finite Volterra series with finite memory.

学習と近似理論における数多くの結果から、次元の呪いに対抗する際には微分可能性が重要であることが証明されています。リザーバコンピューティングの分野では、このタイプのシステムがいわゆるエコー状態(ESP)とフェーディングメモリ(FMP)の特性を示す状況を特徴付けるために、過去20年間に多くの努力が払われてきました。これらの重要な特徴は、数学的に言えば、グローバルリザーバシステムソリューションの存在と連続性に相当します。この論文では、その研究を補完するために、離散時間決定論的入力の非常に一般的なクラスに対するリザーバフィルタの微分可能性の特徴付けを行っています。これは、ESPとFMPに関する長い研究の流れに新しく強力な貢献をしており、特に、ESPの入力依存性に関する既存の研究にリンクしています。文献では、微分可能性は、カオス的動的システムのアトラクターの学習における重要な特徴であることが示されています。テイラーの定理を使用して、半無限離散時間入力を持つリザーバフィルタのVolterra型級数表現を解析的に構築し、対応する近似境界を提供します。最後に、これらの結果の帰結として、任意のフェーディングメモリフィルタは有限メモリを持つ有限Volterra級数によって均一に近似できることが示されます。

Morpho-MNIST: Quantitative Assessment and Diagnostics for Representation Learning
Morpho-MNIST:表現学習のための定量的評価と診断

Revealing latent structure in data is an active field of research, having introduced exciting technologies such as variational autoencoders and adversarial networks, and is essential to push machine learning towards unsupervised knowledge discovery. However, a major challenge is the lack of suitable benchmarks for an objective and quantitative evaluation of learned representations. To address this issue we introduce Morpho-MNIST, a framework that aims to answer: “to what extent has my model learned to represent specific factors of variation in the data? We extend the popular MNIST dataset by adding a morphometric analysis enabling quantitative comparison of trained models, identification of the roles of latent variables, and characterisation of sample diversity. We further propose a set of quantifiable perturbations to assess the performance of unsupervised and supervised methods on challenging tasks such as outlier detection and domain adaptation. Data and code are available at https://github.com/dccastro/Morpho-MNIST.

データの潜在構造を明らかにすることは活発な研究分野であり、変分オートエンコーダや敵対的ネットワークなどの刺激的な技術が導入されており、機械学習を教師なしの知識発見に向けて推進するために不可欠です。ただし、大きな課題は、学習した表現を客観的かつ定量的に評価するための適切なベンチマークがないことです。この問題に対処するために、Morpho-MNISTを紹介します。これは、「モデルは、データ内の特定の変動要因を表現することをどの程度学習したか」に答えることを目的としたフレームワークです。トレーニングされたモデルの定量的な比較、潜在変数の役割の特定、サンプルの多様性の特性評価を可能にする形態測定分析を追加することで、一般的なMNISTデータセットを拡張します。さらに、外れ値検出やドメイン適応などの困難なタスクに対する教師なしおよび教師ありの方法のパフォーマンスを評価するための定量化可能な一連の摂動を提案します。データとコードはhttps://github.com/dccastro/Morpho-MNISTで入手できます。

All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously
すべてのモデルは間違っているが、多くは有用である: 予測モデルのクラス全体を同時に研究することで変数の重要度を学ぶ

Variable importance (VI) tools describe how much covariates contribute to a prediction model’s accuracy. However, important variables for one well-performing model (for example, a linear model $f(\mathbf{x})=\mathbf{x}^{T}\beta$ with a fixed coefficient vector $\beta$) may be unimportant for another model. In this paper, we propose model class reliance (MCR) as the range of VI values across all well-performing model in a prespecified class. Thus, MCR gives a more comprehensive description of importance by accounting for the fact that many prediction models, possibly of different parametric forms, may fit the data well. In the process of deriving MCR, we show several informative results for permutation-based VI estimates, based on the VI measures used in Random Forests. Specifically, we derive connections between permutation importance estimates for a single prediction model, U-statistics, conditional variable importance, conditional causal effects, and linear model coefficients. We then give probabilistic bounds for MCR, using a novel, generalizable technique. We apply MCR to a public data set of Broward County criminal records to study the reliance of recidivism prediction models on sex and race. In this application, MCR can be used to help inform VI for unknown, proprietary models.

変数重要度(VI)ツールは、共変量が予測モデルの精度にどの程度寄与しているかを説明します。ただし、あるパフォーマンスの高いモデル(たとえば、固定係数ベクトル$\beta$を持つ線形モデル$f(\mathbf{x})=\mathbf{x}^{T}\beta$)にとって重要な変数は、別のモデルにとっては重要でない場合があります。この論文では、モデルクラス信頼度(MCR)を、事前に指定されたクラス内のすべてのパフォーマンスの高いモデルにわたるVI値の範囲として提案します。したがって、MCRは、パラメトリック形式が異なる可能性のある多くの予測モデルがデータに適合する可能性があるという事実を考慮することで、より包括的な重要度の説明を提供します。MCRを導出するプロセスでは、ランダムフォレストで使用されるVI測定に基づいて、順列ベースのVI推定値に関するいくつかの有益な結果を示します。具体的には、単一の予測モデルの順列重要度推定値、U統計、条件付き変数重要度、条件付き因果効果、および線形モデル係数間の関係を導出します。次に、新しい一般化可能な手法を使用して、MCRの確率的境界を示します。ブロワード郡の犯罪記録の公開データセットにMCRを適用し、再犯予測モデルの性別と人種への依存度を調査します。このアプリケーションでは、MCRを使用して、未知の独自モデルのVIを通知できます。

New Convergence Aspects of Stochastic Gradient Algorithms
確率的勾配アルゴリズムの新たな収束の側面

The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is violated for cases where the objective function is strongly convex. In Bottou et al. (2018), a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. We show that for stochastic problems arising in machine learning such bound always holds; and we also propose an alternative convergence analysis of SGD with diminishing learning rate regime. We then move on to the asynchronous parallel setting, and prove convergence of Hogwild! algorithm in the same regime in the case of diminished learning rate. It is well-known that SGD converges if a sequence of learning rates $\{\eta_t\}$ satisfies $\sum_{t=0}^\infty \eta_t \rightarrow \infty$ and $\sum_{t=0}^\infty \eta^2_t < \infty$. We show the convergence of SGD for strongly convex objective function without using bounded gradient assumption when $\{\eta_t\}$ is a diminishing sequence and $\sum_{t=0}^\infty \eta_t \rightarrow \infty$. In other words, we extend the current state-of-the-art class of learning rates satisfying the convergence of SGD.

SGDの古典的な収束解析は、確率的勾配のノルムが一様に有界であるという仮定の下で実行されます。これは一部の損失関数には当てはまるかもしれませんが、目的関数が強く凸である場合には破綻します。Bottouら(2018)では、確率的勾配が真の勾配ノルムに関して有界であるという仮定の下で、SGDの収束の新しい解析が行われます。機械学習で生じる確率的問題では、このような制限が常に当てはまることを示します。また、学習率減少型SGDの別の収束解析も提案します。次に、非同期並列設定に移り、学習率減少型の場合に同じ型でHogwild!アルゴリズムが収束することを証明します。学習率のシーケンス$\{\eta_t\}$が$\sum_{t=0}^\infty \eta_t \rightarrow \infty$および$\sum_{t=0}^\infty \eta^2_t < \infty$を満たす場合、SGDが収束することはよく知られています。$\{\eta_t\}$が減少シーケンスであり、$\sum_{t=0}^\infty \eta_t \rightarrow \infty$の場合、有界勾配仮定を使用せずに、強凸目的関数に対するSGDの収束を示します。言い換えると、SGDの収束を満たす現在の最先端の学習率のクラスを拡張します。

DataWig: Missing Value Imputation for Tables
DataWig: テーブルの欠損値補完

With the growing importance of machine learning (ML) algorithms for practical applications, reducing data quality problems in ML pipelines has become a major focus of research. In many cases missing values can break data pipelines which makes completeness one of the most impactful data quality challenges. Current missing value imputation methods are focusing on numerical or categorical data and can be difficult to scale to datasets with millions of rows. We release DataWig, a robust and scalable approach for missing value imputation that can be applied to tables with heterogeneous data types, including unstructured text. DataWig combines deep learning feature extractors with automatic hyperparameter tuning. This enables users without a machine learning background, such as data engineers, to impute missing values with minimal effort in tables with more heterogeneous data types than supported in existing libraries, while requiring less glue code for feature engineering and offering more flexible modelling options. We demonstrate that DataWig compares favourably to existing imputation packages. Source code, documentation, and unit tests for this package are available at: https://github.com/awslabs/datawig

機械学習(ML)アルゴリズムの実用化への重要性が高まるにつれ、MLパイプラインにおけるデータ品質の問題の軽減が研究の主な焦点となっています。多くの場合、欠損値はデータパイプラインを中断させる可能性があるため、完全性は最も影響の大きいデータ品質の課題の1つとなっています。現在の欠損値補完方法は数値データまたはカテゴリデータに焦点を当てており、数百万行のデータセットに拡張することが困難な場合があります。私たちは、欠損値補完のための堅牢でスケーラブルなアプローチであるDataWigをリリースしました。これは、非構造化テキストを含む異種データタイプを持つテーブルに適用できます。DataWigは、ディープラーニング機能抽出機能と自動ハイパーパラメータチューニングを組み合わせたものです。これにより、データエンジニアなどの機械学習のバックグラウンドを持たないユーザーは、既存のライブラリでサポートされているよりも異種データタイプを持つテーブルで、最小限の労力で欠損値を補完できると同時に、機能エンジニアリングに必要なグルーコードが少なくなり、より柔軟なモデリングオプションが提供されます。私たちは、DataWigが既存の補完パッケージと比較して優れていることを実証しました。このパッケージのソースコード、ドキュメント、ユニットテストは、https://github.com/awslabs/datawigで入手できます。

Learning Overcomplete, Low Coherence Dictionaries with Linear Inference
線形推論による過剰完全で低コヒーレンス辞書の学習

Finding overcomplete latent representations of data has applications in data analysis, signal processing, machine learning, theoretical neuroscience and many other fields. In an overcomplete representation, the number of latent features exceeds the data dimensionality, which is useful when the data is undersampled by the measurements (compressed sensing or information bottlenecks in neural systems) or composed from multiple complete sets of linear features, each spanning the data space. Independent Components Analysis (ICA) is a linear technique for learning sparse latent representations, which typically has a lower computational cost than sparse coding, a linear generative model which requires an iterative, nonlinear inference step. While well suited for finding complete representations, we show that overcompleteness poses a challenge to existing ICA algorithms. Specifically, the coherence control used in existing ICA and other dictionary learning algorithms, necessary to prevent the formation of duplicate dictionary features, is ill-suited in the overcomplete case. We show that in the overcomplete case, several existing ICA algorithms have undesirable global minima that maximize coherence. We provide a theoretical explanation of these failures and, based on the theory, propose improved coherence control costs for overcomplete ICA algorithms. Further, by comparing ICA algorithms to the computationally more expensive sparse coding on synthetic data, we show that the limited applicability of overcomplete, linear inference can be extended with the proposed cost functions. Finally, when trained on natural images, we show that the coherence control biases the exploration of the data manifold, sometimes yielding suboptimal, coherent solutions. All told, this study contributes new insights into and methods for coherence control for linear ICA, some of which are applicable to many other nonlinear models.

データの過剰完全な潜在表現を見つけることは、データ分析、信号処理、機械学習、理論神経科学、その他多くの分野で応用されています。過剰完全な表現では、潜在的特徴の数がデータの次元を超えており、データが測定によってサンプリングされていない場合（圧縮センシングまたは神経システムの情報ボトルネック）、またはそれぞれがデータ空間にまたがる複数の完全な線形特徴セットから構成されている場合に役立ちます。独立成分分析（ICA）は、スパース潜在表現を学習するための線形手法であり、通常、反復的な非線形推論ステップを必要とする線形生成モデルであるスパースコーディングよりも計算コストが低くなります。完全表現を見つけるのに適していますが、過剰完全性は既存のICAアルゴリズムに課題をもたらすことを示しています。具体的には、既存のICAおよびその他の辞書学習アルゴリズムで使用される、重複する辞書特徴の形成を防ぐために必要なコヒーレンス制御は、過剰完全な場合には適していません。過剰完全の場合、既存のICAアルゴリズムのいくつかには、コヒーレンスを最大化する望ましくないグローバル最小値があることを示します。これらの失敗の理論的説明を提供し、理論に基づいて、過剰完全ICAアルゴリズムのコヒーレンス制御コストの改善を提案します。さらに、ICAアルゴリズムを、合成データに対する計算コストの高いスパースコーディングと比較することにより、過剰完全な線形推論の限られた適用範囲が、提案されたコスト関数によって拡張できることを示します。最後に、自然画像でトレーニングした場合、コヒーレンス制御によってデータマニホールドの探索が偏り、最適ではないコヒーレントなソリューションが生成される場合があることを示します。全体として、この研究では、線形ICAのコヒーレンス制御に関する新しい洞察と方法を提供し、その一部は他の多くの非線形モデルにも適用できます。

Fast Automatic Smoothing for Generalized Additive Models
一般化加法モデルの高速自動平滑化

Generalized additive models (GAMs) are regression models wherein parameters of probability distributions depend on input variables through a sum of smooth functions, whose degrees of smoothness are selected by $L_2$ regularization. Such models have become the de-facto standard nonlinear regression models when interpretability and flexibility are required, but reliable and fast methods for automatic smoothing in large data sets are still lacking. We develop a general methodology for automatically learning the optimal degree of $L_2$ regularization for GAMs using an empirical Bayes approach. The smooth functions are penalized by hyper-parameters that are learned simultaneously by maximization of a marginal likelihood using an approximate expectation-maximization algorithm. The latter involves a double Laplace approximation at the E-step, and leads to an efficient M-step. Empirical analysis shows that the resulting algorithm is numerically stable, faster than the best existing methods and achieves state-of-the-art accuracy. For illustration, we apply it to an important and challenging problem in the analysis of extremal data.

一般化加法モデル(GAM)は、確率分布のパラメータが滑らかな関数の合計を通じて入力変数に依存する回帰モデルであり、滑らかな関数の滑らかさの度合いは$L_2$正則化によって選択されます。このようなモデルは、解釈可能性と柔軟性が求められる場合の非線形回帰モデルの事実上の標準となっていますが、大規模なデータセットで自動的に平滑化する信頼性が高く高速な方法はまだありません。経験的ベイズアプローチを使用して、GAMの最適な$L_2$正則化の度合いを自動的に学習する一般的な方法論を開発します。滑らかな関数は、近似期待値最大化アルゴリズムを使用して周辺尤度を最大化することによって同時に学習されるハイパーパラメータによってペナルティが課されます。後者は、Eステップで二重ラプラス近似を伴い、効率的なMステップにつながります。経験的分析により、結果として得られるアルゴリズムは数値的に安定しており、既存の最良の方法よりも高速で、最先端の精度を達成することが示されています。説明のために、極値データの分析における重要かつ困難な問題にこれを適用します。

Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals
公平性、再現率、チャーン、およびその他の目標への適用による微分不可能な制約による最適化

We show that many machine learning goals can be expressed as “rate constraints” on a model’s predictions. We study the problem of training non-convex models subject to these rate constraints (or other non-convex or non-differentiable constraints). In the non-convex setting, the standard approach of Lagrange multipliers may fail. Furthermore, if the constraints are non-differentiable, then one cannot optimize the Lagrangian with gradient-based methods. To solve these issues, we introduce a new “proxy-Lagrangian” formulation. This leads to an algorithm that, assuming access to an optimization oracle, produces a stochastic classifier by playing a two-player non-zero-sum game solving for what we call a semi-coarse correlated equilibrium, which in turn corresponds to an approximately optimal and feasible solution to the constrained optimization problem. We then give a procedure that shrinks the randomized solution down to a mixture of at most $m+1$ deterministic solutions, given $m$ constraints. This culminates in a procedure that can solve non-convex constrained optimization problems with possibly non-differentiable and non-convex constraints, and enjoys theoretical guarantees. We provide extensive experimental results covering a broad range of policy goals, including various fairness metrics, accuracy, coverage, recall, and churn.

私たちは、多くの機械学習の目標は、モデルの予測に対する「速度制約」として表現できることを示します。これらの速度制約(またはその他の非凸制約または微分不可能制約)に従う非凸モデルのトレーニングの問題を研究します。非凸設定では、ラグランジュ乗数の標準的なアプローチが失敗する可能性があります。さらに、制約が微分不可能な場合、勾配ベースの方法でラグランジュを最適化することはできません。これらの問題を解決するために、新しい「プロキシラグランジュ」定式化を導入します。これにより、最適化オラクルへのアクセスを前提として、2人のプレーヤーの非ゼロ和ゲームをプレイして、いわゆる半粗相関均衡を解くことで確率的分類器を生成するアルゴリズムが生まれます。これは、制約付き最適化問題に対するほぼ最適かつ実行可能なソリューションに相当します。次に、$m$個の制約が与えられた場合に、ランダム化されたソリューションを最大$m+1$個の決定論的ソリューションの混合に縮小する手順を示します。これにより、微分不可能な制約と非凸制約を持つ可能性のある非凸制約最適化問題を解決できる手順が実現し、理論的な保証が得られます。さまざまな公平性メトリック、精度、カバレッジ、リコール、チャーンなど、幅広いポリシー目標をカバーする広範な実験結果を提供します。

Shared Subspace Models for Multi-Group Covariance Estimation
マルチグループ共分散推定のための共有部分空間モデル

We develop a model-based method for evaluating heterogeneity among several $p\times p$ covariance matrices in the large $p$, small $n$ setting. This is done by assuming a spiked covariance model for each group and sharing information about the space spanned by the group-level eigenvectors. We use an empirical Bayes method to identify a low-dimensional subspace which explains variation across all groups and use an MCMC algorithm to estimate the posterior uncertainty of eigenvectors and eigenvalues on this subspace. The implementation and utility of our model is illustrated with analyses of high-dimensional multivariate gene expression.

私たちは、大きな$p$、小さな$n$の設定で、いくつかの$ptimes p$共分散行列間の不均一性を評価するためのモデルベースの方法を開発します。これは、各グループのスパイク共分散モデルを仮定し、グループレベルの固有ベクトルが持つ空間に関する情報を共有することによって行われます。経験的ベイズ法を使用して、すべてのグループ間の変動を説明する低次元部分空間を特定し、MCMCアルゴリズムを使用して、この部分空間上の固有ベクトルと固有値の事後不確実性を推定します。私たちのモデルの実装と有用性は、高次元多変量遺伝子発現の解析で示されています。

DBSCAN: Optimal Rates For Density-Based Cluster Estimation
DBSCAN: 密度ベースのクラスター推定の最適レート

We study the problem of optimal estimation of the density cluster tree under various smoothness assumptions on the underlying density. Inspired by the seminal work of Chaudhuri et al. (2014), we formulate a new notion of clustering consistency which is better suited to smooth densities, and derive minimax rates for cluster tree estimation under Hölder smooth densities of arbitrary degree. We present a computationally efficient, rate optimal cluster tree estimator based on simple extensions of the popular DBSCAN algorithm of Ester et al. (1996). Our procedure relies on kernel density estimators and returns a sequence of nested random geometric graphs whose connected components form a hierarchy of clusters. The resulting optimal rates for cluster tree estimation depend on the degree of smoothness of the underlying density and, interestingly, match the minimax rates for density estimation under the sup-norm loss. Our results complement and extend the analysis of the DBSCAN algorithm in Sriperumbudur and Steinwart (2012). Finally, we consider level set estimation and cluster consistency for densities with jump discontinuities. We demonstrate that the DBSCAN algorithm attains the minimax rate in terms of the jump size and sample size in this setting as well.

私たちは、基礎密度に対するさまざまな平滑度の仮定の下での密度クラスターツリーの最適推定の問題を研究します。Chaudhuriら(2014)の独創的な研究に触発され、滑らかな密度に適したクラスタリング一貫性という新しい概念を定式化し、任意次数のHölder滑らかな密度の下でのクラスターツリー推定のミニマックス率を導出します。Esterら(1996)の一般的なDBSCANアルゴリズムの単純な拡張に基づく、計算効率が高く、レートが最適なクラスターツリー推定量を提示します。この手順はカーネル密度推定量に依存し、連結成分がクラスターの階層を形成するネストされたランダム幾何学グラフのシーケンスを返す。結果として得られるクラスターツリー推定の最適率は、基礎密度の平滑度の度合いに依存し、興味深いことに、ノルム超損失の下での密度推定のミニマックス率と一致します。私たちの結果は、SriperumbudurとSteinwart (2012)のDBSCANアルゴリズムの分析を補完し、拡張するものです。最後に、ジャンプ不連続性を持つ密度に対するレベルセット推定とクラスター一貫性について検討します。この設定でも、DBSCANアルゴリズムがジャンプサイズとサンプルサイズに関してミニマックスレートを達成することを示します。

Embarrassingly Parallel Inference for Gaussian Processes
ガウス過程の恥ずかしいほどの並列推論

Training Gaussian process-based models typically involves an $O(N^3)$ computational bottleneck due to inverting the covariance matrix. Popular methods for overcoming this matrix inversion problem cannot adequately model all types of latent functions, and are often not parallelizable. However, judicious choice of model structure can ameliorate this problem. A mixture-of-experts model that uses a mixture of $K$ Gaussian processes offers modeling flexibility and opportunities for scalable inference. Our embarrassingly parallel algorithm combines low-dimensional matrix inversions with importance sampling to yield a flexible, scalable mixture-of-experts model that offers comparable performance to Gaussian process regression at a much lower computational cost.

ガウスプロセスベースのモデルのトレーニングには、通常、共分散行列の反転による$O(N^3)$の計算ボトルネックが伴います。この行列反転問題を克服するための一般的な方法は、すべてのタイプの潜在関数を適切にモデル化できるわけではなく、多くの場合、並列化可能ではありません。ただし、モデル構造を賢明に選択することで、この問題を解決できます。$K$ガウス過程の混合を使用する専門家混合モデルは、モデリングの柔軟性とスケーラブルな推論の機会を提供します。私たちの驚くべき並列アルゴリズムは、低次元の行列反転と重要度サンプリングを組み合わせて、ガウス過程回帰に匹敵するパフォーマンスをはるかに低い計算コストで提供する、柔軟でスケーラブルな専門家混合モデルを生成します。

Determinantal Point Processes for Coresets
コアセットの決定点過程

When faced with a data set too large to be processed all at once, an obvious solution is to retain only part of it. In practice this takes a wide variety of different forms, and among them “coresets” are especially appealing. A coreset is a (small) weighted sample of the original data that comes with the following guarantee: a cost function can be evaluated on the smaller set instead of the larger one, with low relative error. For some classes of problems, and via a careful choice of sampling distribution (based on the so-called “sensitivity” metric), iid random sampling has turned to be one of the most successful methods for building coresets efficiently. However, independent samples are sometimes overly redundant, and one could hope that enforcing diversity would lead to better performance. The difficulty lies in proving coreset properties in non-iid samples. We show that the coreset property holds for samples formed with determinantal point processes (DPP). DPPs are interesting because they are a rare example of repulsive point processes with tractable theoretical properties, enabling us to prove general coreset theorems. We apply our results to both the $k$-means and the linear regression problems, and give extensive empirical evidence that the small additional computational cost of DPP sampling comes with superior performance over its iid counterpart. Of independent interest, we also provide analytical formulas for the sensitivity in the linear regression and $1$-means cases.

一度に処理するには大きすぎるデータセットに直面した場合、明らかな解決策は、その一部だけを保持することです。実際には、これはさまざまな形式を取りますが、その中でも「コアセット」は特に魅力的です。コアセットは、元のデータの（小さな）重み付きサンプルであり、次の保証が付属しています。コスト関数は、大きいセットではなく小さいセットで、相対誤差が低く評価できます。一部の問題クラスでは、サンプル分布を慎重に選択することで（いわゆる「感度」メトリックに基づいて）、iidランダムサンプリングが、コアセットを効率的に構築するための最も成功した方法の1つになりました。ただし、独立したサンプルは冗長すぎる場合があり、多様性を強制することでパフォーマンスが向上することを期待できます。難しいのは、非iidサンプルでコアセットプロパティを証明することです。コアセットプロパティは、行列式点プロセス(DPP)で形成されたサンプルに当てはまることを示します。DPPは、扱いやすい理論的特性を持つ反発点過程のまれな例であり、一般的なコアセット定理を証明できるため興味深いものです。私たちは結果を$k$平均法と線形回帰問題の両方に適用し、DPPサンプリングのわずかな追加計算コストがiidサンプリングよりも優れたパフォーマンスをもたらすという広範な経験的証拠を示します。独立した興味深い点として、線形回帰と$1$平均法の場合の感度に関する解析式も提供します。

Stochastic Canonical Correlation Analysis
確率的正準相関分析

We study the sample complexity of canonical correlation analysis (CCA), i.e., the number of samples needed to estimate the population canonical correlation and directions up to arbitrarily small error. With mild assumptions on the data distribution, we show that in order to achieve $\epsilon$-suboptimality in a properly defined measure of alignment between the estimated canonical directions and the population solution, we can solve the empirical objective exactly with $N(\epsilon, \Delta, \gamma)$ samples, where $\Delta$ is the singular value gap of the whitened cross-covariance matrix and $1/\gamma$ is an upper bound of the condition number of auto-covariance matrices. Moreover, we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with a stochastic optimization algorithm; this algorithm is based on the shift-and-invert power iterations and only needs to process the dataset for $\mathcal{O} \left(\log \frac{1}{\epsilon} \right)$ passes. Finally, we show that, given an estimate of the canonical correlation, the streaming version of the shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity, by processing the data only once.

私たちは、正準相関分析(CCA)のサンプル複雑度、すなわち、任意の小さな誤差まで母集団の正準相関と方向を推定するために必要なサンプル数を研究します。データ分布に関する軽い仮定のもとで、推定された正準方向と母集団解との間の整合の適切に定義された尺度において$\epsilon$準最適性を達成するために、経験的目的を$N(\epsilon, \Delta, \gamma)$サンプルで正確に解けることを示す。ここで、$\Delta$は白色化相互共分散行列の特異値ギャップであり、$1/\gamma$は自己共分散行列の条件数の上限です。さらに、同じレベルのサンプルを抽出し、確率的最適化アルゴリズムで経験的目的を近似的に解くことで、同じ学習精度を達成できます。このアルゴリズムは、シフトアンドインバートのべき乗反復に基づいており、データセットを$\mathcal{O} \left(\log \frac{1}{\epsilon} \right)$パスで処理するだけで済みます。最後に、正準相関の推定値が与えられた場合、シフトアンドインバートのべき乗反復のストリーミングバージョンでは、データを1回だけ処理することで、同じレベルのサンプル複雑性で同じ学習精度が達成されることを示します。

Unsupervised Evaluation and Weighted Aggregation of Ranked Classification Predictions
ランク付けされた分類予測の教師なし評価と重み付け集計

Ensemble methods that aggregate predictions from a set of diverse base learners consistently outperform individual classifiers. Many such popular strategies have been developed in a supervised setting, where the sample labels have been provided to the ensemble algorithm. However, with the rising interest in unsupervised algorithms for machine learning and growing amounts of uncurated data, the reliance on labeled data precludes the application of ensemble algorithms to many real world problems. To this end we develop a new theoretical framework for ensemble learning, the Strategy for Unsupervised Multiple Method Aggregation (SUMMA), that estimates the performances of base classifiers and uses these estimates to form an ensemble classifier. SUMMA also generates an ensemble ranking of samples based on the confidence score it assigns to each sample. We illustrate the performance of SUMMA using a synthetic example as well as two real world problems.

多様な基本学習器のセットから予測を集計するアンサンブル法は、個々の分類器よりも一貫して優れたパフォーマンスを発揮します。このような一般的な戦略の多くは、サンプルラベルがアンサンブルアルゴリズムに提供されている教師あり環境で開発されています。しかし、機械学習のための教師なしアルゴリズムへの関心が高まり、キュレーションされていないデータの量が増える中、ラベル付きデータへの依存は、アンサンブルアルゴリズムを多くの現実世界の問題に適用することを妨げています。この目的のために、アンサンブル学習のための新しい理論的フレームワークである、基本分類器のパフォーマンスを推定し、これらの推定値を使用してアンサンブル分類器を形成する、教師なし複数メソッド集約戦略(SUMMA)を開発します。また、SUMMAは、各サンプルに割り当てた信頼度スコアに基づいて、サンプルのアンサンブルランキングを生成します。SUMMAのパフォーマンスを、合成例と2つの現実世界の問題を使用して説明します。

On the Convergence of Gaussian Belief Propagation with Nodes of Arbitrary Size
任意のサイズのノードを持つガウスの信念伝播の収束について

This paper is concerned with a multivariate extension of Gaussian message passing applied to pairwise Markov graphs (MGs). Gaussian message passing applied to pairwise MGs is often labeled Gaussian belief propagation (GaBP) and can be used to approximate the marginal of each variable contained in the pairwise MG. We propose a multivariate extension of GaBP (we label this GaBP-m) that can be used to estimate higher-dimensional marginals. Beyond the ability to estimate higher-dimensional marginals, GaBP-m exhibits better convergence behavior than GaBP, and can also provide more accurate univariate marginals. The theoretical results of this paper are based on an extension of the computation tree analysis conducted on univariate nodes to the multivariate case. The main contribution of this paper is the development of a convergence condition for GaBP-m that moves beyond the walk-summability of the precision matrix. Based on this convergence condition, we derived an upper bound for the number of iterations required for convergence of the GaBP-m algorithm. An upper bound on the dissimilarity between the approximate and exact marginal covariance matrices was established. We argue that GaBP-m is robust towards a certain change in variables, a property not shared by iterative solvers of linear systems, such as the conjugate gradient (CG) and preconditioned conjugate gradient (PCG) methods. The advantages of using GaBP-m over GaBP are also illustrated empirically.

この論文では、ペアワイズマルコフグラフ(MG)に適用されるガウスメッセージパッシングの多変量拡張に関するものです。ペアワイズMGに適用されるガウスメッセージパッシングは、しばしばガウスビリーフプロパゲーション(GaBP)と呼ばれ、ペアワイズMGに含まれる各変数の周辺を近似するために使用できます。私たちは、高次元周辺を推定するために使用できるGaBPの多変量拡張(これをGaBP-mと名付けます)を提案します。高次元周辺を推定する機能以外に、GaBP-mはGaBPよりも優れた収束動作を示し、より正確な単変量周辺も提供できます。この論文の理論的結果は、単変量ノードで実行される計算ツリー分析を多変量の場合に拡張したものです。この論文の主な貢献は、精度行列のウォーク合計可能性を超えるGaBP-mの収束条件の開発です。この収束条件に基づいて、GaBP-mアルゴリズムの収束に必要な反復回数の上限を導きました。近似および正確な周辺共分散行列間の相違の上限が確立されました。GaBP-mは変数の特定の変更に対して堅牢であり、これは共役勾配(CG)法や前処理付き共役勾配(PCG)法などの線形システムの反復ソルバーでは共有されない特性であると主張します。GaBPよりもGaBP-mを使用する利点も経験的に示されています。

The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks
縮小されたPCアルゴリズム:大規模ランダムネットワークにおける因果構造学習の改善

We consider the task of estimating a high-dimensional directed acyclic graph, given observations from a linear structural equation model with arbitrary noise distribution. By exploiting properties of common random graphs, we develop a new algorithm that requires conditioning only on small sets of variables. The proposed algorithm, which is essentially a modified version of the PC-Algorithm, offers significant gains in both computational complexity and estimation accuracy. In particular, it results in more efficient and accurate estimation in large networks containing hub nodes, which are common in biological systems. We prove the consistency of the proposed algorithm, and show that it also requires a less stringent faithfulness assumption than the PC-Algorithm. Simulations in low and high-dimensional settings are used to illustrate these findings. An application to gene expression data suggests that the proposed algorithm can identify a greater number of clinically relevant genes than current methods.

私たちは、任意のノイズ分布を持つ線形構造方程式モデルからの観測値が与えられた場合、高次元の有向非巡回グラフを推定するタスクについて検討します。一般的なランダムグラフの特性を利用して、少数の変数セットのみを条件とする新しいアルゴリズムを開発します。提案されたアルゴリズムは、本質的にはPCアルゴリズムの修正版であり、計算の複雑さと推定精度の両方で大きなメリットがあります。特に、生物系で一般的なハブノードを含む大規模ネットワークでは、より効率的で正確な推定が可能になります。提案されたアルゴリズムの一貫性を証明し、PCアルゴリズムよりも厳密でない忠実性仮定が必要であることも示します。低次元および高次元の設定でのシミュレーションを使用して、これらの発見を説明します。遺伝子発現データへの適用により、提案されたアルゴリズムは現在の方法よりも多くの臨床的に関連する遺伝子を識別できることが示唆されます。

Two-Layer Feature Reduction for Sparse-Group Lasso via Decomposition of Convex Sets
凸集合の分解による疎群Lassoの2層特徴削減

Sparse-Group Lasso (SGL) has been shown to be a powerful regression technique for simultaneously discovering group and within-group sparse patterns by using a combination of the $\ell_1$ and $\ell_2$ norms. However, in large-scale applications, the complexity of the regularizers entails great computational challenges. In this paper, we propose a novel two-layer feature reduction method (TLFre) for SGL via a decomposition of its dual feasible set. The two-layer reduction is able to quickly identify the inactive groups and the inactive features, respectively, which are guaranteed to be absent from the sparse representation and can be removed from the optimization. Existing feature reduction methods are only applicable to sparse models with one sparsity-inducing regularizer. To our best knowledge, TLFre is the first one that is capable of dealing with multiple sparsity-inducing regularizers. Moreover, TLFre has a very low computational cost and can be integrated with any existing solvers. We also develop a screening method—called DPC (decomposition of convex set)—for nonnegative Lasso. Experiments on both synthetic and real data sets show that TLFre and DPC improve the efficiency of SGL and nonnegative Lasso by several orders of magnitude.

スパースグループLasso (SGL)は、$\ell_1$ノルムと$\ell_2$ノルムの組み合わせを使用して、グループとグループ内のスパースパターンを同時に検出するための強力な回帰手法であることが示されています。ただし、大規模なアプリケーションでは、正則化の複雑さにより、計算上の大きな課題が生じます。この論文では、SGLのデュアル実行可能セットを分解することにより、新しい2層特徴削減法(TLFre)を提案します。2層削減により、スパース表現には存在しないことが保証され、最適化から削除できる非アクティブなグループと非アクティブな機能をそれぞれすばやく識別できます。既存の特徴削減方法は、スパース性を誘発する正則化が1つあるスパースモデルにのみ適用できます。私たちが知る限り、TLFreは、複数のスパース性を誘発する正則化を処理できる最初の方法です。さらに、TLFreは計算コストが非常に低く、既存のソルバーと統合できます。また、非負値Lasso用のスクリーニング方法(DPC (凸集合の分解)と呼ばれる)も開発しました。合成データセットと実際のデータセットの両方での実験により、TLFreとDPCによってSGLと非負値Lassoの効率が数桁向上することが示されました。

A Kernel Multiple Change-point Algorithm via Model Selection
モデル選択によるカーネル複数変更点アルゴリズム

We consider a general formulation of the multiple change-point problem, in which the data is assumed to belong to a set equipped with a positive semidefinite kernel. We propose a model-selection penalty allowing to select the number of change points in Harchaoui and Cappé’s kernel-based change-point detection method. The model-selection penalty generalizes non-asymptotic model-selection penalties for the change-in-mean problem with univariate data. We prove a non-asymptotic oracle inequality for the resulting kernel-based change-point detection method, whatever the unknown number of change points, thanks to a concentration result for Hilbert-space valued random variables which may be of independent interest. Experiments on synthetic and real data illustrate the proposed method, demonstrating its ability to detect subtle changes in the distribution of data.

私たちは、多重変化点問題の一般的な定式化を検討し、データが正の半定値カーネルを備えたセットに属していると仮定します。HarchaouiとCappé のカーネルベースの変更点検出方法で変更点の数を選択できるモデル選択ペナルティを提案します。モデル選択ペナルティは、単変量データの平均変化問題に対する非漸近モデル選択ペナルティを一般化します。結果として得られるカーネルベースの変更点検出方法では、未知の数の変化点に関係なく、ヒルベルト空間値確率変数の集中結果のおかげで、非漸近オラクルの不等式を証明します。合成データと実データを用いた実験は、提案された方法を示しており、データの分布の微妙な変化を検出する能力を示しています。

Sparse Kernel Regression with Coefficient-based l_q-regularization
係数ベースの l_q 正則化によるスパースカーネル回帰

In this paper, we consider the $\ell_q-$regularized kernel regression with $0 < q \leq 1$. In form, the algorithm minimizes a least-square loss functional adding a coefficient-based $\ell_q-$penalty term over a linear span of features generated by a kernel function. We study the asymptotic behavior of the algorithm under the framework of learning theory. The contribution of this paper is two-fold. First, we derive a tight bound on the $\ell_2-$empirical covering numbers of the related function space involved in the error analysis. Based on this result, we obtain the convergence rates for the $\ell_1-$regularized kernel regression which is the best so far. Second, for the case $0 < q < 1$, we show that the regularization parameter plays a role as a trade-off between sparsity and convergence rates. Under some mild conditions, the fraction of non-zero coefficients in a local minimizer of the algorithm will tend to $0$ at a polynomial decay rate when the sample size $m$ becomes large. As the concerned algorithm is non-convex, we also discuss how to generate a minimizing sequence iteratively, which can help us to search a local minimizer around any initial point.

この論文では、$0 < q \leq 1$の$\ell_q-$正規化カーネル回帰について検討します。形式的には、このアルゴリズムは、カーネル関数によって生成された特徴の線形スパンにわたって係数ベースの$\ell_q-$ペナルティ項を追加した最小二乗損失関数を最小化します。学習理論の枠組みの下で、アルゴリズムの漸近的動作を研究します。本論文の貢献は2つあります。まず、エラー分析に含まれる関連関数空間の$\ell_2-$経験的被覆数の厳密な境界を導出します。この結果に基づいて、これまでで最高の$\ell_1-$正規化カーネル回帰の収束率を得ます。次に、$0 < q < 1$の場合、正規化パラメーターがスパース性と収束率のトレードオフとしての役割を果たすことを示します。いくつかの穏やかな条件下では、サンプルサイズ$m$が大きくなると、アルゴリズムの局所最小化における非ゼロ係数の割合は多項式減衰率で$0$に近づきます。関係するアルゴリズムは非凸であるため、任意の初期点の周囲で局所最小化を検索するのに役立つ、最小化シーケンスを反復的に生成する方法についても説明します。

Learning by Unsupervised Nonlinear Diffusion
教師なし非線形拡散による学習

This paper proposes and analyzes a novel clustering algorithm, called learning by unsupervised nonlinear diffusion (LUND), that combines graph-based diffusion geometry with techniques based on density and mode estimation. LUND is suitable for data generated from mixtures of distributions with densities that are both multimodal and supported near nonlinear sets. A crucial aspect of this algorithm is the use of time of a data-adapted diffusion process, and associated diffusion distances, as a scale parameter that is different from the local spatial scale parameter used in many clustering algorithms. We prove estimates for the behavior of diffusion distances with respect to this time parameter under a flexible nonparametric data model, identifying a range of times in which the mesoscopic equilibria of the underlying process are revealed, corresponding to a gap between within-cluster and between-cluster diffusion distances. These structures may be missed by the top eigenvectors of the graph Laplacian, commonly used in spectral clustering. This analysis is leveraged to prove sufficient conditions guaranteeing the accuracy of LUND. We implement LUND and confirm its theoretical properties on illustrative data sets, demonstrating its theoretical and empirical advantages over both spectral and density-based clustering.

この論文では、グラフベースの拡散幾何学と密度およびモード推定に基づく手法を組み合わせた、教師なし非線形拡散による学習(LUND)と呼ばれる新しいクラスタリングアルゴリズムを提案し、分析します。LUNDは、マルチモーダルで非線形集合に近い密度を持つ分布の混合から生成されたデータに適しています。このアルゴリズムの重要な側面は、データ適応拡散プロセスの時間と関連する拡散距離を、多くのクラスタリングアルゴリズムで使用されるローカル空間スケールパラメーターとは異なるスケールパラメーターとして使用することです。柔軟なノンパラメトリックデータモデルの下で、この時間パラメーターに関する拡散距離の動作の推定値を証明し、クラスター内拡散距離とクラスター間拡散距離のギャップに対応する、基礎となるプロセスのメソスコピック平衡が明らかになる時間の範囲を特定します。これらの構造は、スペクトルクラスタリングで一般的に使用されるグラフラプラシアンの上位の固有ベクトルでは見逃される可能性があります。この分析は、LUNDの精度を保証する十分な条件を証明するために活用されます。LUNDを実装し、例示的なデータセットでその理論的特性を確認し、スペクトルベースと密度ベースの両方のクラスタリングに対する理論的および経験的利点を実証します。

Optimal Convergence Rates for Convex Distributed Optimization in Networks
ネットワークにおける凸型分散最適化の最適収束率

This work proposes a theoretical analysis of distributed optimization of convex functions using a network of computing units. We investigate this problem under two communication schemes (centralized and decentralized) and four classical regularity assumptions: Lipschitz continuity, strong convexity, smoothness, and a combination of strong convexity and smoothness. Under the decentralized communication scheme, we provide matching upper and lower bounds of complexity along with algorithms achieving this rate up to logarithmic constants. For non-smooth objective functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly convex objective functions. Such a convergence rate is achieved by the novel multi-step primal-dual (MSPD) algorithm. Under the centralized communication scheme, we show that the naive distribution of standard optimization algorithms is optimal for smooth objective functions, and provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function for non-smooth functions. We then show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.

この研究では、コンピューティングユニットのネットワークを使用した凸関数の分散最適化の理論的分析を提案します。この問題を、2つの通信方式(集中型と分散型)と、リプシッツ連続性、強い凸性、滑らかさ、強い凸性と滑らかさの組み合わせという4つの古典的な正則性仮定の下で調査します。分散型通信方式では、複雑さの上限と下限を一致させ、この速度を対数定数まで達成するアルゴリズムを提供します。滑らかでない目的関数の場合、誤差の主な項は$O(1/\sqrt{t})$ですが、通信ネットワークの構造は$O(1/t)$の2次項にのみ影響します($t$は時間)。言い換えると、通信リソースの制限による誤差は、強く凸でない目的関数の場合でも急速に減少します。このような収束率は、新しいマルチステッププライマルデュアル(MSPD)アルゴリズムによって実現されます。集中型通信方式では、標準的な最適化アルゴリズムの単純な分散が滑らかな目的関数に最適であることを示し、滑らかでない関数の目的関数のローカル平滑化に基づく分散ランダム平滑化(DRS)と呼ばれるシンプルでありながら効率的なアルゴリズムを提供します。次に、DRSが最適収束率の$d^{1/4}$乗数以内であることを示します。ここで、$d$は基礎となる次元です。

GraSPy: Graph Statistics in Python
GraSPy:Pythonのグラフ統計

We introduce graspy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a sklearn compliant API. graspy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The documentation and all releases are available at https://neurodata.io/graspy.

私たちは、統計的推論、機械学習、ランダムグラフやグラフ母集団の可視化に特化したPythonライブラリであるgraspyをご紹介します。このパッケージは、sklearn準拠のAPIを使用してグラフを分析および理解するための柔軟で使いやすいアルゴリズムを提供します。graspyはPython Package Index (PyPi)からダウンロードでき、Apache 2.0オープンソースライセンスの下でリリースされています。ドキュメントとすべてのリリースは、https://neurodata.io/graspyで入手できます。

Simultaneous Phase Retrieval and Blind Deconvolution via Convex Programming
凸型プログラミングによる位相回収とブラインドデコンボリューションの同時実行

We consider the task of recovering two real or complex $m$-vectors from phaseless Fourier measurements of their circular convolution. Our method is a novel convex relaxation that is based on a lifted matrix recovery formulation that allows a non-trivial convex relaxation of the bilinear measurements from convolution. We prove that if the two signals belong to known random subspaces of dimensions $k$ and $n$, then they can be recovered up to the inherent scaling ambiguity with $m \gg (k+n) \log^2 m$ phaseless measurements. Our method provides the first theoretical recovery guarantee for this problem by a computationally efficient algorithm and does not require a solution estimate to be computed for initialization. Our proof is based on Rademacher complexity estimates. Additionally, we provide an alternating direction method of multipliers (ADMM) implementation and provide numerical experiments that verify the theory.

私たちは、2つの実数または複素数の$m$-ベクトルを、それらの円形畳み込みの位相レスフーリエ測定から回復するタスクを考えます。私たちの方法は、畳み込みからの双線形測定の非自明な凸緩和を可能にするリフトマトリックス回復定式化に基づく新しい凸緩和です。2つの信号が次元$k$と$n$の既知のランダムな部分空間に属している場合、$m gg (k+n) log^2 m$の無相測定により、固有のスケーリングの曖昧さまで回復できることを証明します。この手法は、計算効率の高いアルゴリズムによって、この問題に対する最初の理論的な回復保証を提供し、初期化のために解の推定を計算する必要はありません。私たちの証明は、Rademacherの複雑さの推定に基づいています。さらに、ADMM(Alternating Direction of Multipliers)の実装を提供し、理論を検証する数値実験を提供します。

SimpleDet: A Simple and Versatile Distributed Framework for Object Detection and Instance Recognition
SimpleDet: オブジェクト検出とインスタンス認識のためのシンプルで汎用性の高い分散フレームワーク

Object detection and instance recognition play a central role in many AI applications like autonomous driving, video surveillance and medical image analysis. However, training object detection models on large scale datasets remains computationally expensive and time consuming. This paper presents an efficient and open source object detection framework called SimpleDet which enables the training of state-of-the-art detection models on consumer grade hardware at large scale. SimpleDet covers a wide range of models including both high-performance and high-speed ones. SimpleDet is well-optimized for both low precision training and distributed training and achieves 70% higher throughput for the Mask R-CNN detector compared with existing frameworks. Codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet.

物体検出とインスタンス認識は、自動運転、ビデオ監視、医用画像分析など、多くのAIアプリケーションで中心的な役割を果たしています。ただし、大規模なデータセットでオブジェクト検出モデルをトレーニングするには、依然として計算コストと時間がかかります。この論文では、SimpleDetと呼ばれる効率的なオープンソースのオブジェクト検出フレームワークを紹介し、消費者向けグレードのハードウェアで最先端の検出モデルを大規模にトレーニングできるようにします。SimpleDetは、高性能モデルと高速モデルの両方を含む幅広いモデルをカバーしています。SimpleDetは、低精度の学習と分散学習の両方に最適化されており、既存のフレームワークと比較して、Mask R-CNN検出器のスループットが70%向上しています。SimpleDetのコード、例、およびドキュメントはhttps://github.com/tusimple/simpledetにあります。

Quantifying Uncertainty in Online Regression Forests
オンライン回帰フォレストにおける不確実性の定量化

Accurately quantifying uncertainty in predictions is essential for the deployment of machine learning algorithms in critical applications where mistakes are costly. Most approaches to quantifying prediction uncertainty have focused on settings where the data is static, or bounded. In this paper, we investigate methods that quantify the prediction uncertainty in a streaming setting, where the data is potentially unbounded. We propose two meta-algorithms that produce prediction intervals for online regression forests of arbitrary tree models; one based on conformal prediction, and the other based on quantile regression. We show that the approaches are able to maintain specified error rates, with constant computational cost per example and bounded memory usage. We provide empirical evidence that the methods outperform the state-of-the-art in terms of maintaining error guarantees, while being an order of magnitude faster. We also investigate how the algorithms are able to recover from concept drift.

予測の不確実性を正確に定量化することは、間違いがコストのかかる重要なアプリケーションで機械学習アルゴリズムを展開するために不可欠です。予測の不確実性を定量化するほとんどのアプローチは、データが静的または制限されている設定に焦点を当ててきました。この論文では、データが潜在的に制限されていないストリーミング設定で予測の不確実性を定量化する方法について調査します。任意のツリーモデルのオンライン回帰フォレストの予測区間を生成する2つのメタアルゴリズムを提案します。1つは等角予測に基づき、もう1つは分位回帰に基づきます。これらのアプローチでは、例あたりの計算コストが一定でメモリ使用量が制限されている場合、指定されたエラー率を維持できることを示します。これらの方法は、エラー保証の維持という点で最先端の方法よりも優れており、桁違いに高速であるという実証的証拠を示します。また、アルゴリズムがコンセプトドリフトから回復する方法についても調査します。

Convergence Guarantees for a Class of Non-convex and Non-smooth Optimization Problems
非凸最適化問題と非平滑最適化問題のクラスに対する収束保証

We consider the problem of finding critical points of functions that are non-convex and non-smooth. Studying a fairly broad class of such problems, we analyze the behavior of three gradient-based methods (gradient descent, proximal update, and Frank-Wolfe update). For each of these methods, we establish rates of convergence for general problems, and also prove faster rates for continuous sub-analytic functions. We also show that our algorithms can escape strict saddle points for a class of non-smooth functions, thereby generalizing known results for smooth functions. Our analysis leads to a simplification of the popular CCCP algorithm, used for optimizing functions that can be written as a difference of two convex functions. Our simplified algorithm retains all the convergence properties of CCCP, along with a significantly lower cost per iteration. We illustrate our methods and theory via applications to the problems of best subset selection, robust estimation, mixture density estimation, and shape-from-shading reconstruction.

私たちは、非凸かつ非滑らかな関数の臨界点を見つける問題を考察します。そのような問題のかなり広いクラスを研究し、3つの勾配ベースの方法(勾配降下法、近似更新法、およびFrank-Wolfe更新法)の動作を解析します。これらの各方法について、一般的な問題に対する収束率を確立し、連続的なサブ解析関数に対するより速い収束率も証明します。また、我々のアルゴリズムが非滑らかな関数のクラスに対して厳密な鞍点を回避できることを示し、それによって滑らかな関数に対する既知の結果を一般化します。我々の解析は、2つの凸関数の差として記述できる関数を最適化するために使用される、一般的なCCCPアルゴリズムの簡略化につながる。我々の簡略化されたアルゴリズムは、CCCPのすべての収束特性を維持しながら、反復あたりのコストを大幅に削減します。私たちは、最良のサブセット選択、ロバスト推定、混合密度推定、およびシェーディングからの形状再構成の問題への応用を通じて、我々の方法と理論を説明します。

Approximation Algorithms for Stochastic Clustering
確率的クラスタリングのための近似アルゴリズム

We consider stochastic settings for clustering, and develop provably-good approximation algorithms for a number of these notions. These algorithms yield better approximation ratios compared to the usual deterministic clustering setting. Additionally, they offer a number of advantages including clustering which is fairer and has better long-term behavior for each user. In particular, they ensure that every user is guaranteed to get good service (on average). We also complement some of these with impossibility results.

私たちは、クラスタリングの確率的設定を考慮し、これらの概念の多くに対して証明可能な良好な近似アルゴリズムを開発します。これらのアルゴリズムは、通常の決定論的クラスタリング設定と比較して、より優れた近似比をもたらします。さらに、クラスタリングなど、多くの利点があり、より公平で、各ユーザーにとって長期的な行動が向上します。特に、すべてのユーザーが良いサービスを受けることが保証されていることを保証します(平均して)。また、これらのいくつかを補完し、不可能な結果をもたらします。

High-dimensional Varying Index Coefficient Models via Stein’s Identity
スタインの恒等式による高次元可変指数係数モデル

We study the parameter estimation problem for a varying index coefficient model in high dimensions. Unlike the most existing works that iteratively estimate the parameters and link functions, based on the generalized Stein’s identity, we propose computationally efficient estimators for the high-dimensional parameters without estimating the link functions. We consider two different setups where we either estimate each sparse parameter vector individually or estimate the parameters simultaneously as a sparse or low-rank matrix. For all these cases, our estimators are shown to achieve optimal statistical rates of convergence (up to logarithmic terms in the low-rank setting). Moreover, throughout our analysis, we only require the covariate to satisfy certain moment conditions, which is significantly weaker than the Gaussian or elliptically symmetric assumptions that are commonly made in the existing literature. Finally, we conduct extensive numerical experiments to corroborate the theoretical results.

私たちは、高次元の可変インデックス係数モデルのパラメータ推定問題を研究します。一般化Stein恒等式に基づいてパラメータとリンク関数を反復的に推定する既存のほとんどの研究とは異なり、リンク関数を推定せずに高次元パラメータの計算効率の高い推定量を提案します。各スパースパラメータベクトルを個別に推定するか、パラメータをスパースまたは低ランク行列として同時に推定するかの2つの異なる設定を検討します。これらすべての場合において、推定量は最適な統計的収束率(低ランク設定での対数項まで)を達成することが示されています。さらに、分析全体を通じて、共変量が特定のモーメント条件を満たすことのみを要求します。これは、既存の文献で一般的に行われているガウスまたは楕円対称の仮定よりも大幅に弱いものです。最後に、理論的な結果を裏付けるために広範な数値実験を実施します。

Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams
ランダム永続性図の確率密度関数のノンパラメトリック推定

Topological data analysis refers to a broad set of techniques that are used to make inferences about the shape of data. A popular topological summary is the persistence diagram. Through the language of random sets, we describe a notion of global probability density function for persistence diagrams that fully characterizes their behavior and in part provides a noise likelihood model. Our approach encapsulates the number of topological features and considers the appearance or disappearance of those near the diagonal in a stable fashion. In particular, the structure of our kernel individually tracks long persistence features, while considering those near the diagonal as a collective unit. The choice to describe short persistence features as a group reduces computation time while simultaneously retaining accuracy. Indeed, we prove that the associated kernel density estimate converges to the true distribution as the number of persistence diagrams increases and the bandwidth shrinks accordingly. We also establish the convergence of the mean absolute deviation estimate, defined according to the bottleneck metric. Lastly, examples of kernel density estimation are presented for typical underlying datasets as well as for virtual electroencephalographic data related to cognition.

トポロジカルデータ分析とは、データの形状について推論を行うために使用される幅広い一連の手法を指します。一般的なトポロジカルサマリーは、パーシスタンスダイアグラムです。ランダムセットの言語を使用して、パーシスタンスダイアグラムのグローバル確率密度関数の概念を説明します。この関数は、パーシスタンスダイアグラムの挙動を完全に特徴付け、ノイズ尤度モデルを部分的に提供します。このアプローチでは、トポロジカルフィーチャの数をカプセル化し、対角線に近いフィーチャの出現または消失を安定した方法で考慮します。特に、カーネルの構造は、長いパーシスタンスフィーチャを個別に追跡し、対角線に近いフィーチャを集合的な単位として考慮します。短いパーシスタンスフィーチャをグループとして記述するという選択により、計算時間が短縮され、同時に精度が維持されます。実際、パーシスタンスダイアグラムの数が増え、帯域幅がそれに応じて縮小すると、関連するカーネル密度推定が真の分布に収束することを証明します。また、ボトルネックメトリックに従って定義された平均絶対偏差推定の収束も確立します。最後に、一般的な基礎データセットと、認知に関連する仮想脳波データに対するカーネル密度推定の例を示します。

Learning Optimized Risk Scores
最適化されたリスクスコアの学習

Risk scores are simple classification models that let users make quick risk predictions by adding and subtracting a few small numbers. These models are widely used in medicine and criminal justice, but are difficult to learn from data because they need to be calibrated, sparse, use small integer coefficients, and obey application-specific constraints. In this paper, we introduce a machine learning method to learn risk scores. We formulate the risk score problem as a mixed integer nonlinear program, and present a cutting plane algorithm to recover its optimal solution. We improve our algorithm with specialized techniques that generate feasible solutions, narrow the optimality gap, and reduce data-related computation. Our algorithm can train risk scores in a way that scales linearly in the number of samples in a dataset, and that allows practitioners to address application-specific constraints without parameter tuning or post-processing. We benchmark the performance of different methods to learn risk scores on publicly available datasets, comparing risk scores produced by our method to risk scores built using methods that are used in practice. We also discuss the practical benefits of our method through a real-world application where we build a customized risk score for ICU seizure prediction in collaboration with the Massachusetts General Hospital.

リスクスコアは、ユーザーがいくつかの小さな数字を加算および減算することで、迅速にリスク予測を行えるシンプルな分類モデルです。これらのモデルは、医療や刑事司法で広く使用されていますが、キャリブレーション、スパース、小さな整数係数の使用、アプリケーション固有の制約に従う必要があるため、データから学習するのは困難です。この論文では、リスクスコアを学習する機械学習手法を紹介します。リスクスコア問題を混合整数非線形計画として定式化し、その最適解を回復するための切断面アルゴリズムを提示します。実行可能な解を生成し、最適性のギャップを狭め、データ関連の計算を削減する特殊な手法を使用してアルゴリズムを改善します。このアルゴリズムは、データセット内のサンプル数に応じて線形にスケーリングし、パラメータ調整や後処理なしでアプリケーション固有の制約に対処できる方法でリスクスコアをトレーニングできます。公開されているデータセットでリスクスコアを学習するさまざまな方法のパフォーマンスをベンチマークし、私たちの方法で生成されたリスクスコアを、実際に使用されている方法で構築されたリスクスコアと比較します。また、マサチューセッツ総合病院と共同で、ICU発作予測のためのカスタマイズされたリスクスコアを構築する実際のアプリケーションを通じて、この方法の実際的な利点についても説明します。

On Asymptotic and Finite-Time Optimality of Bayesian Predictors
ベイズ予測子の漸近的および有限時間最適性について

The problem is that of sequential probability forecasting for finite-valued time series. The data is generated by an unknown probability distribution over the space of all one-way infinite sequences. Two settings are considered: the realizable and the non-realizable one. Assume first that the probability measure generating the sequence belongs to a given set $C$ (realizable case), but the latter is completely arbitrary (uncountably infinite, without any structure given). It is shown that the minimax asymptotic average loss—which may be positive—is always attainable, and it is attained by a Bayesian predictor whose prior is discrete and concentrated on $C$. Moreover, the finite-time loss of the Bayesian predictor is also optimal up to an additive $\log n$ term (where $n$ is the time step). This upper bound is complemented by a lower bound that goes to infinity but may do so arbitrarily slow. Passing to the non-realizable setting, let the probability measure generating the data be arbitrary, and consider the given set $C$ as a set of experts to compete with. The goal is to minimize the regret with respect to the experts. It is shown that in this setting it is possible that all Bayesian strategies are strictly suboptimal even asymptotically. In other words, a sublinear regret may be attainable but the regret of every Bayesian predictor is linear. A very general recommendation for choosing a model can be made based on these results: it is better to take a model large enough to make sure it includes the process that generates the data, even if it entails positive asymptotic average loss, for otherwise any combination of predictors in the model class may be useless.

問題は、有限値の時系列の連続確率予測です。データは、すべての一方向無限シーケンスの空間上の未知の確率分布によって生成されます。実現可能な設定と実現不可能な設定の2つが考えられます。まず、シーケンスを生成する確率測度が特定の集合$C$に属する(実現可能な場合)が、後者は完全に任意(構造が指定されていない非可算無限)であると仮定します。最小最大漸近平均損失(正の場合もあります)は常に達成可能であり、事前分布が離散的で$C$に集中しているベイズ予測子によって達成されることが示されています。さらに、ベイズ予測子の有限時間損失も、加法的な$\log n$項(ここで$n$は時間ステップ)まで最適です。この上限は、無限大になる下限によって補完されますが、その下限は任意に遅くなる場合があります。実現不可能な設定に移り、データを生成する確率測度を任意とし、与えられた集合$C$を競争する専門家の集合とみなします。目標は、専門家に関する後悔を最小化することです。この設定では、すべてのベイズ戦略が漸近的にも厳密に最適ではない可能性があることが示されています。言い換えると、亜線形の後悔は達成可能かもしれませんが、すべてのベイズ予測子の後悔は線形です。これらの結果に基づいて、モデルを選択するための非常に一般的な推奨事項を作成できます。モデルは、正の漸近平均損失を伴う場合でも、データを生成するプロセスを確実に含むように十分に大きいモデルを選択することをお勧めします。そうしないと、モデルクラス内の予測子の任意の組み合わせが役に立たなくなる可能性があります。

Collective Matrix Completion
コレクティブマトリックスコンプリート

Matrix completion aims to reconstruct a data matrix based on observations of a small number of its entries. Usually in matrix completion a single matrix is considered, which can be, for example, a rating matrix in recommendation system. However, in practical situations, data is often obtained from multiple sources which results in a collection of matrices rather than a single one. In this work, we consider the problem of collective matrix completion with multiple and heterogeneous matrices, which can be count, binary, continuous, etc. We first investigate the setting where, for each source, the matrix entries are sampled from an exponential family distribution. Then, we relax the assumption of exponential family distribution for the noise. In this setting, we do not assume any specific model for the observations. The estimation procedures are based on minimizing the sum of a goodness-of-fit term and the nuclear norm penalization of the whole collective matrix. We prove that the proposed estimators achieve fast rates of convergence under the two considered settings and we corroborate our results with numerical experiments.

行列補完は、少数のエントリの観測に基づいてデータ行列を再構築することを目的としています。通常、行列補完では単一の行列が考慮されます。これは、たとえば、推奨システムの評価行列です。ただし、実際の状況では、データは複数のソースから取得されることが多く、単一の行列ではなく行列のコレクションになります。この研究では、カウント、バイナリ、連続などの複数の異種行列を使用した集合行列補完の問題を検討します。最初に、各ソースの行列エントリが指数族分布からサンプリングされる設定を調査します。次に、ノイズに対して指数族分布の仮定を緩和します。この設定では、観測に対して特定のモデルを想定しません。推定手順は、集合行列全体の適合度項と核ノルムペナルティの合計を最小化することに基づいています。提案された推定器が、検討した2つの設定で高速収束率を達成することを証明し、数値実験で結果を裏付けます。

Robustifying Independent Component Analysis by Adjusting for Group-Wise Stationary Noise
グループごとの定常ノイズの調整による独立成分解析のロバスト化

We introduce coroICA, confounding-robust independent component analysis, a novel ICA algorithm which decomposes linearly mixed multivariate observations into independent components that are corrupted (and rendered dependent) by hidden group-wise stationary confounding. It extends the ordinary ICA model in a theoretically sound and explicit way to incorporate group-wise (or environment-wise) confounding. We show that our proposed general noise model allows to perform ICA in settings where other noisy ICA procedures fail. Additionally, it can be used for applications with grouped data by adjusting for different stationary noise within each group. Our proposed noise model has a natural relation to causality and we explain how it can be applied in the context of causal inference. In addition to our theoretical framework, we provide an efficient estimation procedure and prove identifiability of the unmixing matrix under mild assumptions. Finally, we illustrate the performance and robustness of our method on simulated data, provide audible and visual examples, and demonstrate the applicability to real-world scenarios by experiments on publicly available Antarctic ice core data as well as two EEG data sets. We provide a scikit-learn compatible pip-installable Python package coroICA as well as R and Matlab implementations accompanied by a documentation at https://sweichwald.de/coroICA/

私たちは、交絡に頑健な独立成分分析であるcoroICAを紹介します。これは、線形混合多変量観測値を、隠れたグループ全体の定常交絡によって破損(および従属化)された独立成分に分解する新しいICAアルゴリズムです。これは、グループ全体(または環境全体)の交絡を組み込むために、理論的に健全かつ明示的な方法で通常のICAモデルを拡張します。提案する一般的なノイズモデルにより、他のノイズの多いICA手順が失敗する設定でもICAを実行できることを示します。さらに、各グループ内の異なる定常ノイズを調整することにより、グループ化されたデータを使用するアプリケーションにも使用できます。提案するノイズモデルは因果関係と自然な関係があり、因果推論のコンテキストでどのように適用できるかを説明します。理論的枠組みに加えて、効率的な推定手順を提供し、軽度の仮定の下で分離行列の識別可能性を証明します。最後に、シミュレーションデータでのこの方法のパフォーマンスと堅牢性を示し、音声と視覚の例を示し、公開されている南極の氷床コアデータと2つのEEGデータセットでの実験によって、現実世界のシナリオへの適用可能性を示します。https://sweichwald.de/coroICA/で、scikit-learnと互換性のあるpipでインストール可能なPythonパッケージcoroICAと、ドキュメント付きのRおよびMatlab実装を提供します。

Characterizing the Sample Complexity of Pure Private Learners
純粋プライベート学習者のサンプルの複雑さの特性評価

Kasiviswanathan et al. (FOCS 2008) defined private learning as a combination of PAC learning and differential privacy. Informally, a private learner is applied to a collection of labeled individual information and outputs a hypothesis while preserving the privacy of each individual. Kasiviswanathan et al. left open the question of characterizing the sample complexity of private learners. We give a combinatorial characterization of the sample size sufficient and necessary to learn a class of concepts under pure differential privacy. This characterization is analogous to the well known characterization of the sample complexity of non-private learning in terms of the VC dimension of the concept class. We introduce the notion of probabilistic representation of a concept class, and our new complexity measure $RepDim$ corresponds to the size of the smallest probabilistic representation of the concept class. We show that any private learning algorithm for a concept class $C$ with sample complexity $m$ implies $RepDim(C)=O(m)$, and that there exists a private learning algorithm with sample complexity $m=O(RepDim(C))$. We further demonstrate that a similar characterization holds for the database size needed for computing a large class of optimization problems under pure differential privacy, and also for the well studied problem of private data release.

Kasiviswanathanら(FOCS 2008)は、プライベート学習をPAC学習と差分プライバシーの組み合わせとして定義しました。非公式には、プライベート学習者はラベル付けされた個人情報のコレクションに適用され、各個人のプライバシーを維持しながら仮説を出力します。Kasiviswanathanらは、プライベート学習者のサンプル複雑性の特徴付けの問題を未解決のまま残しました。私たちは、純粋な差分プライバシーの下で概念のクラスを学習するのに十分かつ必要なサンプルサイズの組み合わせ特徴付けを示します。この特徴付けは、概念クラスのVC次元に関する非プライベート学習のサンプル複雑性のよく知られた特徴付けに類似しています。私たちは概念クラスの確率的表現の概念を導入し、新しい複雑性尺度$RepDim$は概念クラスの最小の確率的表現のサイズに対応します。サンプル複雑度がmである概念クラスCのプライベート学習アルゴリズムはどれもRepDim(C)=O(m)を意味し、サンプル複雑度がm=O(RepDim(C))であるプライベート学習アルゴリズムが存在することを示します。さらに、純粋な差分プライバシーの下で大規模な最適化問題を計算するために必要なデータベースサイズや、よく研究されているプライベートデータ公開の問題にも同様の特性が当てはまることを示します。

Bayesian Optimization for Policy Search via Online-Offline Experimentation
オンライン-オフライン実験による方策探索のためのベイズ最適化

Online field experiments are the gold-standard way of evaluating changes to real-world interactive machine learning systems. Yet our ability to explore complex, multi-dimensional policy spaces—such as those found in recommendation and ranking problems—is often constrained by the limited number of experiments that can be run simultaneously. To alleviate these constraints, we augment online experiments with an offline simulator and apply multi-task Bayesian optimization to tune live machine learning systems. We describe practical issues that arise in these types of applications, including biases that arise from using a simulator and assumptions for the multi-task kernel. We measure empirical learning curves which show substantial gains from including data from biased offline experiments, and show how these learning curves are consistent with theoretical results for multi-task Gaussian process generalization. We find that improved kernel inference is a significant driver of multi-task generalization. Finally, we show several examples of Bayesian optimization efficiently tuning a live machine learning system by combining offline and online experiments.

オンラインフィールド実験は、現実世界のインタラクティブな機械学習システムへの変更を評価するための標準的な方法です。しかし、推奨やランキングの問題に見られるような複雑で多次元のポリシー空間を探索する能力は、同時に実行できる実験の数が限られているために制約されることがよくあります。これらの制約を軽減するために、オフラインシミュレーターを使用してオンライン実験を拡張し、マルチタスクベイズ最適化を適用してライブ機械学習システムを調整します。シミュレーターの使用から生じるバイアスやマルチタスクカーネルの仮定など、これらのタイプのアプリケーションで発生する実用的な問題について説明します。バイアスのあるオフライン実験のデータを含めることで大幅な向上を示す経験的学習曲線を測定し、これらの学習曲線がマルチタスクガウス過程の一般化の理論的結果とどのように一致するかを示します。カーネル推論の改善がマルチタスク一般化の重要な推進力であることがわかりました。最後に、オフライン実験とオンライン実験を組み合わせることでライブ機械学習システムを効率的に調整するベイズ最適化の例をいくつか示します。

Convergence of Gaussian Belief Propagation Under General Pairwise Factorization: Connecting Gaussian MRF with Pairwise Linear Gaussian Model
一般ペアワイズ因数分解におけるガウス信念伝搬の収束:ガウスMRFとペアワイズ線形ガウスモデルの接続

Gaussian belief propagation (BP) is a low-complexity and distributed method for computing the marginal distributions of a high-dimensional joint Gaussian distribution. However, Gaussian BP is only guaranteed to converge in singly connected graphs and may fail to converge in loopy graphs. Therefore, convergence analysis is a core topic in Gaussian BP. Existing conditions for verifying the convergence of Gaussian BP are all tailored for one particular pairwise factorization of the distribution in Gaussian Markov random field (MRF) and may not be valid for another pairwise factorization. On the other hand, convergence conditions of Gaussian BP in pairwise linear Gaussian model are developed independently from those in Gaussian MRF, making the convergence results highly scattered with diverse settings. In this paper, the convergence condition of Gaussian BP is investigated under a general pairwise factorization, which includes Gaussian MRF and pairwise linear Gaussian model as special cases. Upon this, existing convergence conditions in Gaussian MRF are extended to any pairwise factorization. Moreover, the newly established link between Gaussian MRF and pairwise linear Gaussian model reveals an easily verifiable sufficient convergence condition in pairwise linear Gaussian model, which provides a unified criterion for assessing the convergence of Gaussian BP in multiple applications. Numerical examples are presented to corroborate the theoretical results of this paper.

ガウス確信伝播法(BP)は、高次元結合ガウス分布の周辺分布を計算するための複雑性が低く分散化された方法です。しかし、ガウスBPは単連結グラフでのみ収束することが保証されており、ループグラフでは収束しない可能性があります。そのため、収束解析はガウスBPの中心的なトピックです。ガウスBPの収束を検証するための既存の条件はすべて、ガウスマルコフランダムフィールド(MRF)の分布の特定のペアワイズ分解に合わせて調整されており、別のペアワイズ分解には有効でない可能性があります。一方、ペアワイズ線形ガウスモデルにおけるガウスBPの収束条件は、ガウスMRFの条件とは独立して開発されているため、設定が多様になると収束結果が大きくばらつきます。この論文では、ガウスMRFとペアワイズ線形ガウスモデルを特殊なケースとして含む一般的なペアワイズ分解の下で、ガウスBPの収束条件を調査します。これにより、ガウスMRFの既存の収束条件が任意のペアワイズ因数分解に拡張されます。さらに、ガウスMRFとペアワイズ線形ガウスモデル間の新しく確立されたリンクにより、ペアワイズ線形ガウスモデルで簡単に検証できる十分な収束条件が明らかになり、複数のアプリケーションでガウスBPの収束を評価するための統一された基準が提供されます。この論文の理論的結果を裏付けるために、数値例が提示されています。

Minimal Sample Subspace Learning: Theory and Algorithms
最小サンプル部分空間学習:理論とアルゴリズム

Subspace segmentation, or subspace learning, is a challenging and complicated task in machine learning. This paper builds a primary frame and solid theoretical bases for the minimal subspace segmentation (MSS) of finite samples. The existence and conditional uniqueness of MSS are discussed with conditions generally satisfied in applications. Utilizing weak prior information of MSS, the minimality inspection of segments is further simplified to the prior detection of partitions. The MSS problem is then modeled as a computable optimization problem via the self-expressiveness of samples. A closed form of the representation matrices is first given for the self-expressiveness, and the connection of diagonal blocks is addressed. The MSS model uses a rank restriction on the sum of segment ranks. Theoretically, it can retrieve the minimal sample subspaces that could be heavily intersected. The optimization problem is solved via a basic manifold conjugate gradient algorithm, alternative optimization and hybrid optimization, therein considering solutions to both the primal MSS problem and its pseudo-dual problem. The MSS model is further modified for handling noisy data and solved by an ADMM algorithm. The reported experiments show the strong ability of the MSS method to retrieve minimal sample subspaces that are heavily intersected.

サブスペースセグメンテーション、またはサブスペース学習は、機械学習における困難で複雑なタスクです。この論文では、有限サンプルの最小サブスペースセグメンテーション(MSS)の基本フレームと堅固な理論的根拠を構築します。MSSの存在と条件付き一意性は、アプリケーションで一般的に満たされる条件とともに説明されます。MSSの弱い事前情報を利用することで、セグメントの最小性検査は、パーティションの事前検出にさらに簡略化されます。次に、MSS問題は、サンプルの自己表現性を介して計算可能な最適化問題としてモデル化されます。自己表現性のために、まず表現行列の閉じた形式が与えられ、対角ブロックの接続が対処されます。MSSモデルは、セグメントランクの合計にランク制限を使用します。理論的には、交差が激しい可能性のある最小サンプルサブスペースを取得できます。最適化問題は、基本的な多様体共役勾配アルゴリズム、代替最適化、ハイブリッド最適化によって解決され、その中で、主MSS問題とその擬似デュアル問題の両方のソリューションが考慮されています。MSSモデルは、ノイズの多いデータを処理するためにさらに修正され、ADMMアルゴリズムによって解決されます。報告された実験は、交差が激しい最小サンプルサブスペースを取得するMSS法の優れた能力を示しています。

Model-free Nonconvex Matrix Completion: Local Minima Analysis and Applications in Memory-efficient Kernel PCA
モデルフリーの非凸行列補完:メモリ効率の良いカーネルPCAにおける局所最小値解析と応用

This work studies low-rank approximation of a positive semidefinite matrix from partial entries via nonconvex optimization. We characterized how well local-minimum based low-rank factorization approximates a fixed positive semidefinite matrix without any assumptions on the rank-matching, the condition number or eigenspace incoherence parameter. Furthermore, under certain assumptions on rank-matching and well-boundedness of condition numbers and eigenspace incoherence parameters, a corollary of our main theorem improves the state-of-the-art sampling rate results for nonconvex matrix completion with no spurious local minima in Ge et al. (2016, 2017). In addition, we have investigated when the proposed nonconvex optimization results in accurate low-rank approximations even in presence of large condition numbers, large incoherence parameters, or rank mismatching. We also propose to apply the nonconvex optimization to memory-efficient kernel PCA. Compared to the well-known Nyström methods, numerical experiments indicate that the proposed nonconvex optimization approach yields more stable results in both low-rank approximation and clustering.

この研究では、非凸最適化による部分エントリからの半正定値行列の低ランク近似について研究しています。ランクマッチング、条件数、固有空間非コヒーレンスパラメータに関する仮定なしに、局所最小値ベースの低ランク因数分解が固定半正定値行列をどの程度近似するかを特徴付けました。さらに、ランクマッチングと条件数および固有空間非コヒーレンスパラメータの十分な有界性に関する特定の仮定の下で、私たちの主定理の帰結として、Geら(2016、2017)における偽の局所最小値のない非凸行列補完の最先端のサンプリングレート結果が改善されます。さらに、提案された非凸最適化により、大きな条件数、大きなインコヒーレンスパラメータ、またはランクの不一致が存在する場合でも、正確な低ランク近似が得られるかどうかを調べました。また、メモリ効率の高いカーネルPCAに非凸最適化を適用することも提案しています。よく知られているNyström法と比較して、数値実験では、提案された非凸最適化アプローチにより、低ランク近似とクラスタリングの両方でより安定した結果が得られることがわかっています。

Provably Accurate Double-Sparse Coding
証明可能な精度のダブルスパース符号化

Sparse coding is a crucial subroutine in algorithms for various signal processing, deep learning, and other machine learning applications. The central goal is to learn an overcomplete dictionary that can sparsely represent a given input dataset. However, a key challenge is that storage, transmission, and processing of the learned dictionary can be untenably high if the data dimension is high. In this paper, we consider the double-sparsity model introduced by Rubinstein et al. (2010b) where the dictionary itself is the product of a fixed, known basis and a data-adaptive sparse component. First, we introduce a simple algorithm for double-sparse coding that can be amenable to efficient implementation via neural architectures. Second, we theoretically analyze its performance and demonstrate asymptotic sample complexity and running time benefits over existing (provable) approaches for sparse coding. To our knowledge, our work introduces the first computationally efficient algorithm for double-sparse coding that enjoys rigorous statistical guarantees. Finally, we corroborate our theory with several numerical experiments on simulated data, suggesting that our method may be useful for problem sizes encountered in practice.

スパースコーディングは、さまざまな信号処理、ディープラーニング、その他の機械学習アプリケーションのアルゴリズムで重要なサブルーチンです。主な目標は、与えられた入力データセットをスパースに表現できるオーバーコンプリート辞書を学習することです。しかし、重要な課題は、データ次元が大きい場合、学習した辞書の保存、転送、および処理が耐えられないほど高くなる可能性があることです。この論文では、辞書自体が固定された既知の基底とデータ適応型スパースコンポーネントの積である、Rubinsteinら(2010b)によって導入された二重スパースモデルを検討します。まず、ニューラルアーキテクチャを介して効率的に実装できる二重スパースコーディングのシンプルなアルゴリズムを紹介します。次に、そのパフォーマンスを理論的に分析し、スパースコーディングの既存の(証明可能な)アプローチよりも漸近的なサンプル複雑性と実行時間の利点を示します。私たちの知る限り、私たちの研究は、厳密な統計的保証を享受する二重スパースコーディングの計算効率の高い最初のアルゴリズムを紹介します。最後に、シミュレーションデータに対するいくつかの数値実験で理論を裏付け、実際に遭遇する問題のサイズに対して私たちの方法が有用である可能性があることを示唆します。

Nonparametric Bayesian Aggregation for Massive Data
大規模データのためのノンパラメトリックベイズ集約

We develop a set of scalable Bayesian inference procedures for a general class of nonparametric regression models. Specifically, nonparametric Bayesian inferences are separately performed on each subset randomly split from a massive dataset, and then the obtained local results are aggregated into global counterparts. This aggregation step is explicit without involving any additional computation cost. By a careful partition, we show that our aggregated inference results obtain an oracle rule in the sense that they are equivalent to those obtained directly from the entire data (which are computationally prohibitive). For example, an aggregated credible ball achieves desirable credibility level and also frequentist coverage while possessing the same radius as the oracle ball.

私たちは、ノンパラメトリック回帰モデルの一般的なクラスのためのスケーラブルなベイズ推論手順のセットを開発します。具体的には、ノンパラメトリックなベイズ推論は、大規模なデータセットからランダムに分割された各サブセットに対して個別に実行され、得られたローカルな結果はグローバルな対応物に集約されます。この集計ステップは、追加の計算コストを伴わずに明示的です。慎重な分割により、集計された推論結果は、データ全体から直接取得されたものと同等であるという意味でオラクルルールを取得することを示します(これは計算的に禁止されています)。例えば、集約された信頼性のあるボールは、オラクルボールと同じ半径を持ちながら、望ましい信頼性レベルを達成し、また頻度論的なカバレッジを達成します。

Decentralized Dictionary Learning Over Time-Varying Digraphs
時間変動ダイグラフ上の分散辞書学習

This paper studies Dictionary Learning problems wherein the learning task is distributed over a multi-agent network, modeled as a time-varying directed graph. This formulation is relevant, for instance, in Big Data scenarios where massive amounts of data are collected/stored in different locations (e.g., sensors, clouds) and aggregating and/or processing all data in a fusion center might be inefficient or unfeasible, due to resource limitations, communication overheads or privacy issues. We develop a unified decentralized algorithmic framework for this class of nonconvex problems, which is proved to converge to stationary solutions at a sublinear rate. The new method hinges on Successive Convex Approximation techniques, coupled with a decentralized tracking mechanism aiming at locally estimating the gradient of the smooth part of the sum-utility. To the best of our knowledge, this is the first provably convergent decentralized algorithm for Dictionary Learning and, more generally, bi-convex problems over (time-varying) (di)graphs.

この論文では、学習タスクが時間変動有向グラフとしてモデル化されたマルチエージェントネットワークに分散される辞書学習の問題を取り上げます。この定式化は、たとえば、膨大な量のデータがさまざまな場所(センサー、クラウドなど)で収集/保存され、リソースの制限、通信オーバーヘッド、プライバシーの問題により、フュージョンセンターですべてのデータを集約および/または処理することが非効率的または実行不可能な可能性があるビッグデータシナリオに関係します。このクラスの非凸問題に対して、線形以下の速度で定常解に収束することが証明されている、統一された分散アルゴリズムフレームワークを開発します。この新しい方法は、逐次凸近似手法と、合計効用の滑らかな部分の勾配を局所的に推定することを目的とした分散追跡メカニズムを組み合わせています。私たちの知る限り、これは辞書学習、およびより一般的には(時間変動) (有)グラフ上の双凸問題に対する、初めて証明可能な収束分散アルゴリズムです。

Generalized Maximum Entropy Estimation
一般化最大エントロピー推定

We consider the problem of estimating a probability distribution that maximizes the entropy while satisfying a finite number of moment constraints, possibly corrupted by noise. Based on duality of convex programming, we present a novel approximation scheme using a smoothed fast gradient method that is equipped with explicit bounds on the approximation error. We further demonstrate how the presented scheme can be used for approximating the chemical master equation through the zero-information moment closure method, and for an approximate dynamic programming approach in the context of constrained Markov decision processes with uncountable state and action spaces.

私たちは、ノイズによって破損する可能性のある有限数のモーメント制約を満たしながらエントロピーを最大化する確率分布を推定する問題を考えます。凸プログラミングの双対性に基づいて、近似誤差の明示的な境界を備えた平滑化高速勾配法を使用した新しい近似スキームを提示します。さらに、提示されたスキームを使用して、ゼロ情報モーメント閉包法による化学マスター方程式の近似、および不可算な状態空間と行動空間を持つ制約のあるマルコフ決定プロセスのコンテキストでの近似動的プログラミングアプローチに使用できる方法を示します。

Multiclass Boosting: Margins, Codewords, Losses, and Algorithms
多クラスブースティング: マージン、コードワード、損失、アルゴリズム

The problem of multiclass boosting is considered. A new formulation is presented, combining multi-dimensional predictors, multi-dimensional real-valued codewords, and proper multiclass margin loss functions. This leads to a number of contributions, such as maximum capacity codeword sets, a family of proper and margin enforcing losses, denoted as $\gamma-\phi$ losses, and two new multiclass boosting algorithms. These are descent procedures on the functional space spanned by a set of weak learners. The first, CD-MCBoost, is a coordinate descent procedure that updates one predictor component at a time. The second, GD-MCBoost, a gradient descent procedure that updates all components jointly. Both MCBoost algorithms are defined with respect to a $\gamma-\phi$ loss and can reduce to classical boosting procedures (such as AdaBoost and LogitBoost) for binary problems. Beyond the algorithms themselves, the proposed formulation enables a unified treatment of many previous multiclass boosting algorithms. This is used to show that the latter implement different combinations of optimization strategy, codewords, weak learners, and loss function, highlighting some of their deficiencies. It is shown that no previous method matches the support of MCBoost for real codewords of maximum capacity, a proper margin-enforcing loss function, and any family of multidimensional predictors and weak learners. Experimental results confirm the superiority of MCBoost, showing that the two proposed MCBoost algorithms outperform comparable prior methods on a number of datasets.\\ \\ \textbf{Keywords}: Boosting, Multiclass Boosting, Multiclass Classification, Margin Maximization, Loss Function.

マルチクラスブースティングの問題を検討します。多次元予測子、多次元実数値コードワード、および適切なマルチクラスマージン損失関数を組み合わせた新しい定式化が提示されます。これにより、最大容量コードワードセット、$\gamma-\phi$損失として示される適切なマージン強制損失のファミリ、および2つの新しいマルチクラスブースティングアルゴリズムなど、多くの貢献がもたらされます。これらは、一連の弱学習器によって広がる関数空間上の降下手順です。1つ目のCD-MCBoostは、一度に1つの予測子コンポーネントを更新する座標降下手順です。2つ目のGD-MCBoostは、すべてのコンポーネントを共同で更新する勾配降下手順です。両方のMCBoostアルゴリズムは、$\gamma-\phi$損失に関して定義され、バイナリ問題に対する従来のブースティング手順(AdaBoostやLogitBoostなど)に還元できます。アルゴリズム自体を超えて、提案された定式化により、以前の多くのマルチクラスブースティングアルゴリズムを統一的に処理できます。これは、後者が最適化戦略、コードワード、弱学習者、損失関数のさまざまな組み合わせを実装し、それらの欠陥のいくつかを強調していることを示すために使用されます。最大容量の実際のコードワード、適切なマージン強制損失関数、および多次元予測子と弱学習者の任意のファミリに対するMCBoostのサポートに匹敵する以前の方法はないことが示されています。実験結果により、MCBoostの優位性が確認され、提案された2つのMCBoostアルゴリズムが、多数のデータセットで同等の以前の方法よりも優れていることが示されています。\\ \\ \textbf{キーワード}:ブースティング、マルチクラスブースティング、マルチクラス分類、マージン最大化、損失関数。

Local Regularization of Noisy Point Clouds: Improved Global Geometric Estimates and Data Analysis
ノイズの多い点群の局所正則化:改善されたグローバル幾何学的推定とデータ分析

Several data analysis techniques employ similarity relationships between data points to uncover the intrinsic dimension and geometric structure of the underlying data-generating mechanism. In this paper we work under the model assumption that the data is made of random perturbations of feature vectors lying on a low-dimensional manifold. We study two questions: how to define the similarity relationships over noisy data points, and what is the resulting impact of the choice of similarity in the extraction of global geometric information from the underlying manifold. We provide concrete mathematical evidence that using a local regularization of the noisy data to define the similarity improves the approximation of the hidden Euclidean distance between unperturbed points. Furthermore, graph-based objects constructed with the locally regularized similarity function satisfy better error bounds in their recovery of global geometric ones. Our theory is supported by numerical experiments that demonstrate that the gain in geometric understanding facilitated by local regularization translates into a gain in classification accuracy in simulated and real data.

いくつかのデータ分析手法では、データポイント間の類似関係を利用して、基礎となるデータ生成メカニズムの固有の次元と幾何学的構造を明らかにします。この論文では、データは低次元多様体上にある特徴ベクトルのランダムな摂動で構成されているというモデル仮定の下で作業を行います。2つの質問を検討します。ノイズの多いデータポイント間の類似関係をどのように定義するか、および基礎となる多様体からグローバルな幾何学的情報を抽出する際に類似性を選択した結果どのような影響があるかです。ノイズの多いデータのローカル正規化を使用して類似性を定義すると、摂動のないポイント間の隠れたユークリッド距離の近似値が向上するという具体的な数学的証拠を示します。さらに、ローカル正規化された類似性関数を使用して構築されたグラフベースのオブジェクトは、グローバルな幾何学的関数の回復においてより優れた誤差範囲を満たします。私たちの理論は、ローカル正規化によって促進される幾何学的理解の向上が、シミュレートされたデータと実際のデータでの分類精度の向上につながることを示す数値実験によってサポートされています。

Gaussian Processes with Linear Operator Inequality Constraints
線形演算子の不等式制約を持つガウス過程

This paper presents an approach for constrained Gaussian Process (GP) regression where we assume that a set of linear transformations of the process are bounded. It is motivated by machine learning applications for high-consequence engineering systems, where this kind of information is often made available from phenomenological knowledge. We consider a GP $f$ over functions on $\mathcal{X} \subset \mathbb{R}^{n}$ taking values in $\mathbb{R}$, where the process $linop f$ is still Gaussian when $linop $ is a linear operator. Our goal is to model $f$ under the constraint that realizations of $linop f$ are confined to a convex set of functions. In particular, we require that $a \leq linop f \leq b$, given two functions $a$ and $b$ where $a < b$ pointwise. This formulation provides a consistent way of encoding multiple linear constraints, such as shape-constraints based on e.g. boundedness, monotonicity or convexity. We adopt the approach of using a sufficiently dense set of virtual observation locations where the constraint is required to hold, and derive the exact posterior for a conjugate likelihood. The results needed for stable numerical implementation are derived, together with an efficient sampling scheme for estimating the posterior process.

この論文では、制約付きガウス過程(GP)回帰のアプローチを紹介します。このアプローチでは、過程の線形変換のセットが制限されていると仮定します。これは、この種の情報が現象論的知識から利用可能になることが多い、影響の大きいエンジニアリングシステムへの機械学習の応用に動機付けられています。$\mathbb{R}$に値を取る$\mathcal{X} \subset \mathbb{R}^{n}$上の関数に対するGP $f$を検討します。ここで、$linop $が線形演算子である場合、過程$linop f$は依然としてガウスです。私たちの目標は、$linop f$の実現が関数の凸集合に限定されるという制約の下で$f$をモデル化することです。特に、2つの関数$a$と$b$ (ここで$a < b$)が与えられた場合、$a \leq linop f \leq b$である必要があります。この定式化は、有界性、単調性、凸性などに基づく形状制約などの複数の線形制約をエンコードする一貫した方法を提供します。制約が保持される必要がある仮想観測位置の十分に密なセットを使用するアプローチを採用し、共役尤度の正確な事後分布を導出します。安定した数値実装に必要な結果が、事後プロセスを推定するための効率的なサンプリング方式とともに導出されます。

Stochastic Variance-Reduced Cubic Regularization Methods
確率的分散低減 3 次正則化法

We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of SVRC is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. For a nonconvex function with $n$ component functions, we show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n^{4/5}/\epsilon^{3/2})$\footnote{Here $\tilde{O}$ hides poly-logarithmic factors.} second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. To further reduce the sample complexity of Hessian matrix computation in cubic regularization based methods, we also propose a sample efficient stochastic variance-reduced cubic regularization (Lite-SVRC) algorithm for finding the local minimum more efficiently. Lite-SVRC converges to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n+n^{2/3}/\epsilon^{3/2})$ Hessian sample complexity, which is faster than all existing cubic regularization based methods. Numerical experiments with different nonconvex optimization problems conducted on real datasets validate our theoretical results for both SVRC and Lite-SVRC.

私たちは、非凸最適化のための確率的分散低減3次正則化ニュートン法(SVRC)を提案します。SVRCの核となるのは、3次正則化法のために特別に設計された、新しい半確率的勾配と半確率的ヘッセ行列です。$n$個のコンポーネント関数を持つ非凸関数の場合、このアルゴリズムは、2次オラクル呼び出し内で$(\epsilon,\sqrt{\epsilon})$近似局所最小値に収束することが保証されていることを示します。$\tilde{O}(n^{4/5}/\epsilon^{3/2})$\footnote{ここで、$\tilde{O}$は多重対数因子を隠します。}であり、これはサブサンプル3次正則化を含む最先端の3次正則化アルゴリズムよりも優れています。3次正則化ベースの方法でヘッセ行列計算のサンプル複雑度をさらに削減するために、局所最小値をより効率的に見つけるためのサンプル効率の高い確率的分散低減3次正則化(Lite-SVRC)アルゴリズムも提案します。Lite-SVRCは、$\tilde{O}(n+n^{2/3}/\epsilon^{3/2})$ヘッセ行列サンプル複雑度内で$(\epsilon,\sqrt{\epsilon})$近似局所最小値に収束し、これは既存のすべての3次正則化ベースの方法よりも高速です。実際のデータセットで実行されたさまざまな非凸最適化問題による数値実験により、SVRCとLite-SVRCの両方に対する理論結果が検証されます。

Spurious Valleys in One-hidden-layer Neural Network Optimization Landscapes
1隠れ層ニューラルネットワーク最適化ランドスケープにおけるスプリアスバレー

Neural networks provide a rich class of high-dimensional, non-convex optimization problems. Despite their non-convexity, gradient-descent methods often successfully optimize these models. This has motivated a recent spur in research attempting to characterize properties of their loss surface that may explain such success. In this paper, we address this phenomenon by studying a key topological property of the loss: the presence or absence of spurious valleys, defined as connected components of sub-level sets that do not include a global minimum. Focusing on a class of one-hidden-layer neural networks defined by smooth (but generally non-linear) activation functions, we identify a notion of intrinsic dimension and show that it provides necessary and sufficient conditions for the absence of spurious valleys. More concretely, finite intrinsic dimension guarantees that for sufficiently overparametrised models no spurious valleys exist, independently of the data distribution. Conversely, infinite intrinsic dimension implies that spurious valleys do exist for certain data distributions, independently of model overparametrisation. Besides these positive and negative results, we show that, although spurious valleys may exist in general, they are confined to low risk levels and avoided with high probability on overparametrised models.

ニューラルネットワークは、高次元の非凸最適化問題の豊富なクラスを提供します。非凸性にもかかわらず、勾配降下法はこれらのモデルをうまく最適化することがよくあります。これが、そのような成功を説明する可能性のある損失面の特性を特徴付けようとする最近の研究の刺激となっています。この論文では、損失の主要な位相特性、つまりグローバル最小値を含まないサブレベルセットの接続コンポーネントとして定義される偽の谷の有無を調べることで、この現象に対処します。滑らかな(ただし、一般的には非線形)活性化関数によって定義される1つの隠れ層ニューラルネットワークのクラスに焦点を当て、固有次元の概念を特定し、それが偽の谷が存在しないために必要な十分な条件を提供することを示します。より具体的には、有限の固有次元は、十分にオーバーパラメータ化されたモデルでは、データ分布とは無関係に、偽の谷が存在しないことを保証します。逆に、無限の固有次元は、モデルのオーバーパラメータ化とは無関係に、特定のデータ分布では偽の谷が存在することを意味します。これらの肯定的および否定的な結果に加えて、偽の谷は一般には存在する可能性があるものの、それらは低リスクレベルに限定され、過剰パラメータ化されたモデルでは高い確率で回避されることを示しています。

More Efficient Estimation for Logistic Regression with Optimal Subsamples
最適なサブサンプルを使用したロジスティック回帰のより効率的な推定

In this paper, we propose improved estimation method for logistic regression based on subsamples taken according the optimal subsampling probabilities developed in Wang et al. (2018). Both asymptotic results and numerical results show that the new estimator has a higher estimation efficiency. We also develop a new algorithm based on Poisson subsampling, which does not require to approximate the optimal subsampling probabilities all at once. This is computationally advantageous when available random-access memory is not enough to hold the full data. Interestingly, asymptotic distributions also show that Poisson subsampling produces a more efficient estimator if the sampling ratio, the ratio of the subsample size to the full data sample size, does not converge to zero. We also obtain the unconditional asymptotic distribution for the estimator based on Poisson subsampling. Pilot estimators are required to calculate subsampling probabilities and to correct biases in un-weighted estimators; interestingly, even if pilot estimators are inconsistent, the proposed method still produce consistent and asymptotically normal estimators.

この論文では、Wangら(2018)で開発された最適サブサンプリング確率に従って取得されたサブサンプルに基づく、ロジスティック回帰の推定方法の改良を提案します。漸近結果と数値結果の両方が、新しい推定量の推定効率が高いことを示しています。また、最適なサブサンプリング確率を一度に近似する必要のない、ポアソンサブサンプリングに基づく新しいアルゴリズムも開発しています。これは、使用可能なランダムアクセスメモリが完全なデータを保持するのに十分でない場合に計算上有利です。興味深いことに、漸近分布は、サンプリング比、つまりサブサンプルサイズと完全なデータサンプルサイズの比がゼロに収束しない場合、ポアソンサブサンプリングによりより効率的な推定量が生成されることを示しています。また、ポアソンサブサンプリングに基づく推定量の無条件漸近分布も取得します。パイロット推定量は、サブサンプリング確率を計算し、重み付けされていない推定量のバイアスを修正するために必要です。興味深いことに、パイロット推定値が矛盾している場合でも、提案された方法は依然として一貫性があり漸近的に正規の推定値を生成します。

Decoupling Sparsity and Smoothness in the Dirichlet Variational Autoencoder Topic Model
ディリクレ変分オートエンコーダトピックモデルにおけるスパース性と滑らかさの分離

Recent work on variational autoencoders (VAEs) has enabled the development of generative topic models using neural networks. Topic models based on latent Dirichlet allocation (LDA) successfully use the Dirichlet distribution as a prior for the topic and word distributions to enforce sparseness. However, there is a trade-off between sparsity and smoothness in Dirichlet distributions. Sparsity is important for a low reconstruction error during training of the autoencoder, whereas smoothness enables generalization and leads to a better log-likelihood of the test data. Both of these properties are encoded in the Dirichlet parameter vector. By rewriting this parameter vector into a product of a sparse binary vector and a smoothness vector, we decouple the two properties, leading to a model that features both a competitive topic coherence and a high log-likelihood. Efficient training is enabled using rejection sampling variational inference for the reparameterization of the Dirichlet distribution. Our experiments show that our method is competitive with other recent VAE topic models.

変分オートエンコーダ(VAE)に関する最近の研究により、ニューラルネットワークを使用した生成トピックモデルの開発が可能になりました。潜在ディリクレ配分(LDA)に基づくトピックモデルは、トピックと単語の分布の事前分布としてディリクレ分布をうまく使用して、スパース性を強化します。ただし、ディリクレ分布では、スパース性と滑らかさの間にトレードオフがあります。スパースは、オートエンコーダのトレーニング中に再構築エラーを低く抑えるために重要ですが、滑らかさは一般化を可能にし、テストデータの対数尤度を向上させます。これらの特性は両方とも、ディリクレパラメーターベクトルにエンコードされています。このパラメーターベクトルをスパースバイナリベクトルと滑らかさベクトルの積に書き換えることで、2つの特性を切り離し、競争力のあるトピックの一貫性と高い対数尤度の両方を備えたモデルを実現します。ディリクレ分布の再パラメーター化に拒否サンプリング変分推論を使用すると、効率的なトレーニングが可能になります。実験では、この方法が他の最近のVAEトピックモデルと競合することがわかりました。

Logical Explanations for Deep Relational Machines Using Relevance Information
関連性情報を使用した深層リレーショナルマシンの論理的説明

Our interest in this paper is in the construction of symbolic explanations for predictions made by a deep neural network. We will focus attention on deep relational machines (DRMs: a term introduced in Lodhi (2013)). A DRM is a deep network in which the input layer consists of Boolean-valued functions (features) that are defined in terms of relations provided as domain, or background, knowledge. Our DRMs differ from those in Lodhi (2013), which uses an Inductive Logic Programming (ILP) engine to first select features (we use random selections from a space of features that satisfies some approximate constraints on logical relevance and non-redundancy). But why do the DRMs predict what they do? One way of answering this was provided in recent work Ribeiro et al. (2016), by constructing readable proxies for a black-box predictor. The proxies are intended only to model the predictions of the black-box in local regions of the instance-space. But readability alone may not be enough: to be understandable, the local models must use relevant concepts in an meaningful manner. We investigate the use of a Bayes-like approach to identify logical proxies for local predictions of a DRM. As a preliminary step, we show that DRM’s with our randomised propositionalization method achieve predictive performance that is comparable to the best reports in the ILP literature. Our principal results on logical explanations show: (a) Models in first-order logic can approximate the DRM’s prediction closely in a small local region; and (b) Expert-provided relevance information can play the role of a prior to distinguish between logical explanations that perform equivalently on prediction alone.

この論文の目的は、ディープニューラルネットワークによる予測の記号的説明の構築です。ここでは、ディープリレーショナルマシン(DRM:Lodhi (2013)で導入された用語)に注目します。DRMは、入力層がドメインまたは背景知識として提供される関係で定義されるブール値関数(機能)で構成されるディープネットワークです。私たちのDRMは、Lodhi (2013)のDRMとは異なります。Lodhi (2013)では、帰納的論理プログラミング(ILP)エンジンを使用して最初に機能を選択します(論理的関連性と非冗長性に関するいくつかの近似制約を満たす機能の空間からランダムに選択します)。しかし、なぜDRMは予測を行うのでしょうか。この疑問に答える1つの方法は、最近のRibeiroら(2016)の研究で、ブラックボックス予測子の読み取り可能なプロキシを構築するというものでした。プロキシは、インスタンス空間のローカル領域でブラックボックスの予測をモデル化することのみを目的としています。しかし、読みやすさだけでは十分ではないかもしれません。理解できるようにするには、ローカルモデルは関連する概念を意味のある方法で使用する必要があります。DRMのローカル予測の論理プロキシを識別するために、ベイズのようなアプローチの使用を調査します。予備的なステップとして、ランダム化命題化方法を使用したDRMが、ILP文献の最高のレポートに匹敵する予測パフォーマンスを達成することを示します。論理的説明に関する主な結果は、(a)一次論理のモデルは、小さなローカル領域でDRMの予測を近似できること、および(b)専門家が提供する関連性情報は、予測のみで同等のパフォーマンスを発揮する論理的説明を区別するための事前の役割を果たすことができることを示しています。

Time-to-Event Prediction with Neural Networks and Cox Regression
ニューラルネットワークとCox回帰によるイベントまでの時間予測

New methods for time-to-event prediction are proposed by extending the Cox proportional hazards model with neural networks. Building on methodology from nested case-control studies, we propose a loss function that scales well to large data sets and enables fitting of both proportional and non-proportional extensions of the Cox model. Through simulation studies, the proposed loss function is verified to be a good approximation for the Cox partial log-likelihood. The proposed methodology is compared to existing methodologies on real-world data sets and is found to be highly competitive, typically yielding the best performance in terms of Brier score and binomial log-likelihood. A python package for the proposed methods is available at https://github.com/havakv/pycox.

イベントまでの時間を予測するための新しい方法は、ニューラルネットワークを使用してCox比例ハザードモデルを拡張することによって提案されます。ネストされたケースコントロール研究の方法論に基づいて、大規模なデータセットに適切にスケーリングし、Coxモデルの比例拡張と非比例拡張の両方のフィッティングを可能にする損失関数を提案します。シミュレーション研究を通じて、提案された損失関数は、Coxの部分対数尤度の適切な近似であることが確認されています。提案された方法論は、実世界のデータセットに関する既存の方法論と比較され、非常に競争力があり、通常、Brierスコアと二項対数尤度の点で最高のパフォーマンスを発揮することがわかっています。提案された方法のpythonパッケージはhttps://github.com/havakv/pycoxで入手できます。

Unsupervised Basis Function Adaptation for Reinforcement Learning
強化学習のための教師なし基底関数適応

When using reinforcement learning (RL) algorithms it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on an agent’s performance, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is currently interest among researchers in the potential for allowing RL algorithms to adaptively generate (i.e. to learn) approximation architectures. One relatively unexplored method of adapting approximation architectures involves using feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. In this article we will: (a) informally discuss the potential advantages offered by such methods; (b) introduce a new algorithm based on such methods which adapts a state aggregation approximation architecture on-line and is designed for use in conjunction with SARSA; (c) provide theoretical results, in a policy evaluation setting, regarding this particular algorithm’s complexity, convergence properties and potential to reduce VF error; and finally (d) test experimentally the extent to which this algorithm can improve performance given a number of different test problems. Taken together our results suggest that our algorithm (and potentially such methods more generally) can provide a versatile and computationally lightweight means of significantly boosting RL performance given suitable conditions which are commonly encountered in practice.

強化学習(RL)アルゴリズムを使用する場合、状態空間が大きいと、価値関数(VF)の何らかの近似アーキテクチャを導入するのが一般的です。ただし、このアーキテクチャの正確な形式はエージェントのパフォーマンスに大きな影響を与える可能性があり、適切な近似アーキテクチャを決定することは非常に複雑な作業になることがよくあります。そのため、現在研究者の間では、RLアルゴリズムが近似アーキテクチャを適応的に生成(つまり学習)できるようにする可能性に興味が寄せられています。近似アーキテクチャを適応させる比較的未開拓の方法の1つは、エージェントが特定の状態を訪れた頻度に関するフィードバックを使用して、状態空間のどの領域をより詳細に近似するかをガイドすることです。この記事では、(a)このような方法によって提供される潜在的な利点について非公式に説明します。(b)このような方法に基づいて、状態集約近似アーキテクチャをオンラインで適応させ、SARSAと組み合わせて使用するように設計された新しいアルゴリズムを紹介します。(c)ポリシー評価設定において、この特定のアルゴリズムの複雑さ、収束特性、およびVFエラーを削減する可能性に関する理論的な結果を提供します。最後に(d)さまざまなテスト問題が与えられた場合、このアルゴリズムがパフォーマンスをどの程度向上できるかを実験的にテストします。総合すると、私たちのアルゴリズム(およびより一般的にはそのような方法)は、実際に一般的に遭遇する適切な条件が与えられた場合に、RLパフォーマンスを大幅に向上させる多用途で計算量が少ない手段を提供できることが示唆されます。

Causal Learning via Manifold Regularization
多様体正則化による因果学習

This paper frames causal structure estimation as a machine learning task. The idea is to treat indicators of causal relationships between variables as `labels’ and to exploit available data on the variables of interest to provide features for the labelling task. Background scientific knowledge or any available interventional data provide labels on some causal relationships and the remainder are treated as unlabelled. To illustrate the key ideas, we develop a distance-based approach (based on bivariate histograms) within a manifold regularization framework. We present empirical results on three different biological data sets (including examples where causal effects can be verified by experimental intervention), that together demonstrate the efficacy and general nature of the approach as well as its simplicity from a user’s point of view.

この論文では、因果構造推定を機械学習タスクとして組み立てています。この考え方は、変数間の因果関係の指標を「ラベル」として扱い、関心のある変数に関する利用可能なデータを活用して、ラベリングタスクの特徴を提供することです。背景となる科学的知識や利用可能な介入データは、一部の因果関係にラベルを提供し、残りはラベル付けされていないものとして扱われます。主要なアイデアを説明するために、多様体正則化フレームワーク内で距離ベースのアプローチ(二変量ヒストグラムに基づく)を開発します。私たちは、3つの異なる生物学的データセット(実験的介入によって因果効果を検証できる例を含む)で経験的な結果を提示し、これらを組み合わせることで、アプローチの有効性と一般的な性質、およびユーザーの視点からのその単純さを実証します。

Learning Representations of Persistence Barcodes
永続性バーコードの表現の学習

We consider the problem of supervised learning with summary representations of topological features in data. In particular, we focus on persistent homology, the prevalent tool used in topological data analysis. As the summary representations, referred to as barcodes or persistence diagrams, come in the unusual format of multi sets, equipped with computationally expensive metrics, they can not readily be processed with conventional learning techniques. While different approaches to address this problem have been proposed, either in the context of kernel-based learning, or via carefully designed vectorization techniques, it remains an open problem how to leverage advances in representation learning via deep neural networks. Appropriately handling topological summaries as input to neural networks would address the disadvantage of previous strategies which handle this type of data in a task-agnostic manner. In particular, we propose an approach that is designed to learn a task-specific representation of barcodes. In other words, we aim to learn a representation that adapts to the learning problem while, at the same time, preserving theoretical properties (such as stability). This is done by projecting barcodes into a finite dimensional vector space using a collection of parametrized functionals, so called structure elements, for which we provide a generic construction scheme. A theoretical analysis of this approach reveals sufficient conditions to preserve stability, and also shows that different choices of structure elements lead to great differences with respect to their suitability for numerical optimization. When implemented as a neural network input layer, our approach demonstrates compelling performance on various types of problems, including graph classification and eigenvalue prediction, the classification of 2D/3D object shapes and recognizing activities from EEG signals.

私たちは、データ内の位相的特徴の要約表現を用いた教師あり学習の問題を検討します。特に、位相的データ解析で広く使用されているツールであるパーシステントホモロジーに焦点を当てる。バーコードまたはパーシステンスダイアグラムと呼ばれる要約表現は、計算コストの高いメトリックを備えたマルチセットの珍しい形式で提供されるため、従来の学習手法では容易に処理できない。この問題に対処するために、カーネルベースの学習のコンテキストで、または慎重に設計されたベクトル化手法を介して、さまざまなアプローチが提案されているが、ディープニューラルネットワークによる表現学習の進歩をどのように活用するかは未解決の問題です。ニューラルネットワークへの入力として位相的要約を適切に処理することで、タスクに依存しない方法でこのタイプのデータを処理する以前の戦略の欠点を解決できます。特に、私たちは、バーコードのタスク固有の表現を学習するように設計されたアプローチを提案します。言い換えれば、学習問題に適応すると同時に、理論的特性（安定性など）を維持する表現を学習することを目指しています。これは、パラメーター化された関数のコレクション、いわゆる構造要素を使用してバーコードを有限次元ベクトル空間に投影することによって行われます。構造要素には、汎用的な構築スキームが用意されています。このアプローチの理論的分析により、安定性を維持するための十分な条件が明らかになり、また、構造要素の選択によって数値最適化への適合性が大きく異なることも示されます。ニューラルネットワークの入力層として実装すると、グラフ分類と固有値予測、2D/3Dオブジェクトの形状の分類、EEG信号からのアクティビティの認識など、さまざまなタイプの問題で優れたパフォーマンスを発揮します。

ORCA: A Matlab/Octave Toolbox for Ordinal Regression
ORCA:順序回帰のためのMatlab/Octaveツールボックス

Ordinal regression, also named ordinal classification, studies classification problems where there exist a natural order between class labels. This structured order of the labels is crucial in all steps of the learning process in order to take full advantage of the data. ORCA (Ordinal Regression and Classification Algorithms) is a Matlab/Octave framework that implements and integrates different ordinal classification algorithms and specifically designed performance metrics. The framework simplifies the task of experimental comparison to a great extent, allowing the user to: (i) describe experiments by simple configuration files; (ii) automatically run different data partitions; (iii) parallelize the executions; (iv) generate a variety of performance reports and (v) include new algorithms by using its intuitive interface. Source code, binaries, documentation, descriptions and links to data sets and tutorials (including examples of educational purpose) are available at https://github.com/ayrna/orca.

順序回帰は順序分類とも呼ばれ、クラスラベル間に自然な順序が存在する分類問題を研究します。この構造化されたラベルの順序は、データを最大限に活用するために、学習プロセスのすべてのステップで重要です。ORCA (順序回帰および分類アルゴリズム)は、さまざまな順序分類アルゴリズムと特別に設計されたパフォーマンスメトリックを実装および統合するMatlab/Octaveフレームワークです。このフレームワークにより、実験の比較タスクが大幅に簡素化され、ユーザーは次の操作を実行できます。(i)シンプルな構成ファイルで実験を記述します。(ii)さまざまなデータパーティションを自動的に実行します。(iii)実行を並列化します。(iv)さまざまなパフォーマンスレポートを生成します。(v)直感的なインターフェイスを使用して新しいアルゴリズムを含める。ソースコード、バイナリ、ドキュメント、説明、データセットへのリンク、チュートリアル(教育目的の例を含む)は、https://github.com/ayrna/orcaで入手できます。

Deep Exploration via Randomized Value Functions
ランダム化された値関数による詳細な探索

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.

私たちは、強化学習の深い探求を導くために、ランダム化された値関数の使用を研究しています。これは、統計的および計算的に効率的な探索を、価値関数学習への一般的な実践的なアプローチで合成するためのエレガントな手段を提供します。ランダム化された値関数を活用し、計算研究を通じてその有効性を実証するいくつかの強化学習アルゴリズムを紹介します。また、表形式の表現で統計的効率を確立する後悔の限界も証明します。

ADMMBO: Bayesian Optimization with Unknown Constraints using ADMM
ADMMBO: ADMM を使用した未知制約によるベイズ最適化

There exist many problems in science and engineering that involve optimization of an unknown or partially unknown objective function. Recently, Bayesian Optimization (BO) has emerged as a powerful tool for solving optimization problems whose objective functions are only available as a black box and are expensive to evaluate. Many practical problems, however, involve optimization of an unknown objective function subject to unknown constraints. This is an important yet challenging problem for which, unlike optimizing an unknown function, existing methods face several limitations. In this paper, we present a novel constrained Bayesian optimization framework to optimize an unknown objective function subject to unknown constraints. We introduce an equivalent optimization by augmenting the objective function with constraints, introducing auxiliary variables for each constraint, and forcing the new variables to be equal to the main variable. Building on the Alternating Direction Method of Multipliers (ADMM) algorithm, we propose ADMM-Bayesian Optimization (ADMMBO) to solve the problem in an iterative fashion. Our framework leads to multiple unconstrained subproblems with unknown objective functions, which we then solve via BO. Our method resolves several challenges of state-of-the-art techniques: it can start from infeasible points, is insensitive to initialization, can efficiently handle `decoupled problems’ and has a concrete stopping criterion. Extensive experiments on a number of challenging BO benchmark problems show that our proposed approach outperforms the state-of-the-art methods in terms of the speed of obtaining a feasible solution and convergence to the global optimum as well as minimizing the number of total evaluations of unknown objective and constraints functions.

科学と工学には、未知または部分的に未知の目的関数の最適化を伴う問題が数多く存在します。最近、ベイズ最適化(BO)は、目的関数がブラックボックスとしてのみ利用可能で、評価にコストがかかる最適化問題を解決するための強力なツールとして登場しました。しかし、多くの実用的な問題には、未知の制約を受ける未知の目的関数の最適化が含まれます。これは重要でありながら困難な問題であり、未知の関数の最適化とは異なり、既存の方法はいくつかの制限に直面しています。この論文では、未知の制約を受ける未知の目的関数を最適化するための新しい制約付きベイズ最適化フレームワークを紹介します。目的関数に制約を追加し、各制約に補助変数を導入し、新しい変数をメイン変数と等しくすることで、同等の最適化を導入します。交互方向乗数法(ADMM)アルゴリズムに基づいて、問題を反復的に解決するためのADMMベイズ最適化(ADMMBO)を提案します。このフレームワークは、未知の目的関数を持つ複数の制約のないサブ問題につながり、それをBOで解決します。私たちの方法は、最先端の技術のいくつかの課題を解決します。実行不可能な点から開始でき、初期化の影響を受けず、「分離された問題」を効率的に処理でき、具体的な停止基準があります。いくつかの困難なBOベンチマーク問題に対する広範な実験により、提案されたアプローチは、実行可能なソリューションの取得速度とグローバル最適値への収束、および未知の目的関数と制約関数の総評価回数の最小化に関して、最先端の方法よりも優れていることが示されました。

Approximate Profile Maximum Likelihood
プロファイルの最尤法の近似値

We propose an efficient algorithm for approximate computation of the profile maximum likelihood (PML), a variant of maximum likelihood maximizing the probability of observing a sufficient statistic rather than the empirical sample. The PML has appealing theoretical properties, but is difficult to compute exactly. Inspired by observations gleaned from exactly solvable cases, we look for an approximate PML solution, which, intuitively, clumps comparably frequent symbols into one symbol. This amounts to lower-bounding a certain matrix permanent by summing over a subgroup of the symmetric group rather than the whole group during the computation. We extensively experiment with the approximate solution, and the empirical performance of our approach is competitive and sometimes significantly better than state-of-the-art performances for various estimation problems.

私たちは、経験的サンプルではなく、十分な統計を観測する確率を最大化する最尤法の変形であるプロファイル最尤法(PML)の近似計算のための効率的なアルゴリズムを提案します。PMLは魅力的な理論的特性を持っていますが、正確に計算することは困難です。正確に解けるケースから得られた観察結果に触発されて、私たちは近似的なPML解を探します。これは、直感的に、比較的頻繁なシンボルを1つのシンボルにまとめます。これは、計算中にグループ全体ではなく対称グループのサブグループを合計することにより、特定の行列を永続に下限することを意味します。私たちは近似解を広範囲に実験し、私たちのアプローチの経験的性能は競争力があり、さまざまな推定問題に対する最先端の性能よりも大幅に優れている場合があります。

Scaling Up Sparse Support Vector Machines by Simultaneous Feature and Sample Reduction
特徴量とサンプルの同時削減によるスパースサポートベクターマシンのスケールアップ

Sparse support vector machine (SVM) is a popular classification technique that can simultaneously learn a small set of the most interpretable features and identify the support vectors. It has achieved great successes in many real-world applications. However, for large-scale problems involving a huge number of samples and ultra-high dimensional features, solving sparse SVMs remains challenging. By noting that sparse SVMs induce sparsities in both feature and sample spaces, we propose a novel approach, which is based on accurate estimations of the primal and dual optima of sparse SVMs, to simultaneously identify the inactive features and samples that are guaranteed to be irrelevant to the outputs. Thus, we can remove the identified inactive samples and features from the training phase, leading to substantial savings in the computational cost without sacrificing the accuracy. Moreover, we show that our method can be extended to multi-class sparse support vector machines. To the best of our knowledge, the proposed method is the first static feature and sample reduction method for sparse SVMs and multi-class sparse SVMs. Experiments on both synthetic and real data sets demonstrate that our approach significantly outperforms state-of-the-art methods and the speedup gained by our approach can be orders of magnitude.

スパースサポートベクターマシン（SVM）は、最も解釈しやすい特徴の小さなセットを同時に学習し、サポートベクターを識別できる一般的な分類手法です。多くの実際のアプリケーションで大きな成功を収めています。ただし、膨大な数のサンプルと超高次元の特徴を含む大規模な問題の場合、スパースSVMを解決することは依然として困難です。スパースSVMは特徴空間とサンプル空間の両方でスパース性を誘発することに注目して、スパースSVMの主最適値と二重最適値の正確な推定に基づく新しいアプローチを提案し、出力に無関係であることが保証されている非アクティブな特徴とサンプルを同時に識別します。したがって、識別された非アクティブなサンプルと特徴をトレーニングフェーズから削除できるため、精度を犠牲にすることなく計算コストを大幅に節約できます。さらに、この方法をマルチクラススパースサポートベクターマシンに拡張できることを示しています。私たちの知る限り、提案された方法は、スパースSVMおよびマルチクラススパースSVM向けの最初の静的特徴およびサンプル削減方法です。合成データセットと実際のデータセットの両方での実験により、私たちのアプローチが最先端の方法を大幅に上回り、私たちのアプローチによって得られるスピードアップは桁違いになることが実証されています。

Ivanov-Regularised Least-Squares Estimators over Large RKHSs and Their Interpolation Spaces
大規模 RKHS とその補間空間上のイワノフ正規化最小二乗推定量

We study kernel least-squares estimation under a norm constraint. This form of regularisation is known as Ivanov regularisation and it provides better control of the norm of the estimator than the well-established Tikhonov regularisation. Ivanov regularisation can be studied under minimal assumptions. In particular, we assume only that the RKHS is separable with a bounded and measurable kernel. We provide rates of convergence for the expected squared $L^2$ error of our estimator under the weak assumption that the variance of the response variables is bounded and the unknown regression function lies in an interpolation space between $L^2$ and the RKHS. We then obtain faster rates of convergence when the regression function is bounded by clipping the estimator. In fact, we attain the optimal rate of convergence. Furthermore, we provide a high-probability bound under the stronger assumption that the response variables have subgaussian errors and that the regression function lies in an interpolation space between $L^\infty$ and the RKHS. Finally, we derive adaptive results for the settings in which the regression function is bounded.

私たちは、ノルム制約の下でカーネル最小二乗推定を研究します。この形式の正則化はイワノフ正則化として知られ、よく確立されたティホノフ正則化よりも推定量のノルムをより適切に制御できます。イワノフ正則化は、最小限の仮定の下で研究できます。特に、RKHSは有界で測定可能なカーネルで分離可能であると仮定します。応答変数の分散が有界であり、未知の回帰関数が$L^2$とRKHSの間の補間空間にあるという弱い仮定の下で、推定量の期待二乗$L^2$誤差の収束率を提供します。次に、回帰関数が有界である場合に推定量をクリッピングすることでより速い収束率が得られます。実際、最適な収束率を達成しています。さらに、応答変数がサブガウス誤差を持ち、回帰関数が$L^\infty$とRKHSの間の補間空間にあるというより強い仮定の下で、高確率の境界を提供します。最後に、回帰関数が境界付けられた設定に対して適応的な結果を導出します。

Layer-Wise Learning Strategy for Nonparametric Tensor Product Smoothing Spline Regression and Graphical Models
ノンパラメトリック、テンソル積、平滑化、スプライン回帰、およびグラフィカルモデルのための層単位学習戦略

Nonparametric estimation of multivariate functions is an important problem in statistical machine learning with many applications, ranging from nonparametric regression to nonparametric graphical models. Several authors have proposed to estimate multivariate functions under the smoothing spline analysis of variance (SSANOVA) framework, which assumes that the multivariate function can be decomposed into the summation of main effects, two-way interaction effects, and higher order interaction effects. However, existing methods are not scalable to the dimension of the random variables and the order of interactions. We propose a LAyer-wiSE leaRning strategy (LASER) to estimate multivariate functions under the SSANOVA framework. The main idea is to approximate the multivariate function sequentially starting from a model with only the main effects. Conditioned on the support of the estimated main effects, we estimate the two-way interaction effects only when the corresponding main effects are estimated to be non-zero. This process is continued until no more higher order interaction effects are identified. The proposed strategy provides a data-driven approach for estimating multivariate functions under the SSANOVA framework. Our proposal yields a sequence of estimators. To establish the theoretical properties of the sequence of estimators, we establish the notion of post-selection persistency. Extensive numerical studies are performed to evaluate the performance of our algorithm.

多変量関数のノンパラメトリック推定は、統計的機械学習における重要な問題であり、ノンパラメトリック回帰からノンパラメトリックグラフィカルモデルまで、多くのアプリケーションがあります。何人かの著者は、多変量関数を主効果、双方向相互作用効果、および高次相互作用効果の合計に分解できると仮定する平滑化スプライン分散分析(SSANOVA)フレームワークの下で多変量関数を推定することを提案しています。ただし、既存の方法は、ランダム変数の次元と相互作用の順序にスケーラブルではありません。私たちは、SSANOVAフレームワークの下で多変量関数を推定するためのLAyer-wiSE学習戦略(LASER)を提案します。主なアイデアは、主効果のみを含むモデルから始めて、多変量関数を順次近似することです。推定された主効果のサポートを条件として、対応する主効果がゼロ以外であると推定される場合にのみ、双方向相互作用効果を推定します。このプロセスは、それ以上の高次相互作用効果が特定されなくなるまで続けられます。提案された戦略は、SSANOVAフレームワークの下で多変量関数を推定するためのデータ駆動型アプローチを提供します。私たちの提案は、推定値のシーケンスを生成します。推定値のシーケンスの理論的特性を確立するために、選択後の持続性の概念を確立します。私たちのアルゴリズムのパフォーマンスを評価するために、広範な数値研究が行われます。

Binarsity: a penalization for one-hot encoded features in linear supervised learning
二項性:線形教師あり学習におけるワンホット符号化特徴に対するペナルティ

This paper deals with the problem of large-scale linear supervised learning in settings where a large number of continuous features are available. We propose to combine the well-known trick of one-hot encoding of continuous features with a new penalization called binarsity. In each group of binary features coming from the one-hot encoding of a single raw continuous feature, this penalization uses total-variation regularization together with an extra linear constraint. This induces two interesting properties on the model weights of the one-hot encoded features: they are piecewise constant, and are eventually block sparse. Non-asymptotic oracle inequalities for generalized linear models are proposed. Moreover, under a sparse additive model assumption, we prove that our procedure matches the state-of-the-art in this setting. Numerical experiments illustrate the good performances of our approach on several datasets. It is also noteworthy that our method has a numerical complexity comparable to standard $\ell_1$ penalization.

この論文では、多数の連続特徴が利用可能な設定における大規模な線形教師あり学習の問題を取り上げます。連続特徴のワンホットエンコーディングというよく知られたトリックと、バイナーシティと呼ばれる新しいペナルティを組み合わせることを提案します。単一の生の連続特徴のワンホットエンコーディングから得られるバイナリ特徴の各グループでは、このペナルティは、追加の線形制約とともに全変動正則化を使用します。これにより、ワンホットエンコードされた特徴のモデル重みに2つの興味深い特性が誘導されます。つまり、それらは区分的に一定であり、最終的にはブロックスパースになります。一般化線形モデルの非漸近オラクル不等式が提案されています。さらに、スパース加法モデル仮定の下で、この手順がこの設定における最先端のものと一致することを証明します。数値実験は、いくつかのデータセットでのアプローチの優れたパフォーマンスを示しています。また、この方法が標準的な$\ell_1$ペナルティに匹敵する数値計算量を持つことも注目に値します。

Generic Inference in Latent Gaussian Process Models
潜在ガウス過程モデルにおける一般推論

We develop an automated variational method for inference in models with Gaussian process (GP) priors and general likelihoods. The method supports multiple outputs and multiple latent functions and does not require detailed knowledge of the conditional likelihood, only needing its evaluation as a black-box function. Using a mixture of Gaussians as the variational distribution, we show that the evidence lower bound and its gradients can be estimated efficiently using samples from univariate Gaussian distributions. Furthermore, the method is scalable to large datasets which is achieved by using an augmented prior via the inducing-variable approach underpinning most sparse GP approximations, along with parallel computation and stochastic optimization. We evaluate our approach quantitatively and qualitatively with experiments on small datasets, medium-scale datasets and large datasets, showing its competitiveness under different likelihood models and sparsity levels. On the large-scale experiments involving prediction of airline delays and classification of handwritten digits, we show that our method is on par with the state-of-the-art hard-coded approaches for scalable GP regression and classification.

私たちは、ガウス過程(GP)事前分布と一般尤度を持つモデルにおける推論のための自動変分法を開発しました。この方法は、複数の出力と複数の潜在関数をサポートし、条件付き尤度の詳細な知識を必要とせず、ブラックボックス関数としての評価のみを必要とします。変分分布としてガウス分布の混合を使用して、単変量ガウス分布のサンプルを使用して、証拠の下限とその勾配を効率的に推定できることを示します。さらに、この方法は、並列計算と確率的最適化とともに、ほとんどのスパースGP近似の基礎となる誘導変数アプローチによる拡張事前分布を使用することで実現され、大規模なデータセットに拡張可能です。小規模データセット、中規模データセット、大規模データセットでの実験により、このアプローチを定量的および定性的に評価し、さまざまな尤度モデルとスパース性レベルの下での競争力を示します。航空便の遅延予測や手書き数字の分類を含む大規模な実験では、私たちの方法がスケーラブルなGP回帰と分類の最先端のハードコードされたアプローチと同等であることが示されています。

Graph Reduction with Spectral and Cut Guarantees
スペクトルとカットの保証によるグラフ削減

Can one reduce the size of a graph without significantly altering its basic properties? The graph reduction problem is hereby approached from the perspective of restricted spectral approximation, a modification of the spectral similarity measure used for graph sparsification. This choice is motivated by the observation that restricted approximation carries strong spectral and cut guarantees, and that it implies approximation results for unsupervised learning problems relying on spectral embeddings. The article then focuses on coarsening – the most common type of graph reduction. Sufficient conditions are derived for a small graph to approximate a larger one in the sense of restricted approximation. These findings give rise to algorithms that, compared to both standard and advanced graph reduction methods, find coarse graphs of improved quality, often by a large margin, without sacrificing speed.

グラフの基本的なプロパティを大幅に変更せずに、グラフのサイズを縮小することはできますか?グラフ削減問題は、グラフのスパース化に使用されるスペクトル類似度尺度の修正である制限付きスペクトル近似の観点からアプローチされます。この選択は、制限付き近似が強力なスペクトルとカットの保証をもたらすこと、およびスペクトル埋め込みに依存する教師なし学習問題の近似結果を意味するという観察によって動機付けられています。次に、この記事では、グラフ削減の最も一般的なタイプである粗大化に焦点を当てます。小さなグラフが制限された近似の意味で大きなグラフを近似するための十分な条件が導き出されます。これらの発見は、標準的なグラフ削減方法と高度なグラフ削減方法の両方と比較して、速度を犠牲にすることなく、多くの場合、大幅に改善された品質の粗いグラフを見つけるアルゴリズムを生み出します。

Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models
高次元構造化潜在属性モデルにおける属性パターンの学習

Structured latent attribute models (SLAMs) are a special family of discrete latent variable models widely used in social and biological sciences. This paper considers the problem of learning significant attribute patterns from a SLAM with potentially high-dimensional configurations of the latent attributes. We address the theoretical identifiability issue, propose a penalized likelihood method for the selection of the attribute patterns, and further establish the selection consistency in such an overfitted SLAM with a diverging number of latent patterns. The good performance of the proposed methodology is illustrated by simulation studies and two real datasets in educational assessments.

構造化潜在属性モデル(SLAM)は、社会科学や生物科学で広く使用されている離散潜在変数モデルの特別なファミリーです。この論文では、潜在的に高次元の潜在的に高次元の属性構成を持つSLAMから重要な属性パターンを学習する問題について考察します。理論的な識別可能性の問題に取り組み、属性パターンの選択に対するペナルティ付き尤度法を提案し、さらに、分散する数の潜在パターンを持つこのようなオーバーフィットSLAMの選択一貫性を確立します。提案された方法論の優れたパフォーマンスは、シミュレーション研究と教育評価における2つの実際のデータセットによって示されています。

Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery
非凸行列回復におけるスプリアス局所最小値の非存在に対するシャープ制限アイソメトリ境界

Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant $\delta$. If $\delta$ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on $\delta$ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of $\delta<1/2$ is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point $x_{0}$ satisfying $f(x_{0})\le(1-\delta)^{2}f(0)$, any descent algorithm that converges to second-order optimality guarantees exact recovery.

非凸行列の回復は、RIP定数$delta$が十分に小さい制限付きアイソメトリ特性(RIP)の下では、スプリアス局所最小値を含まないことが知られています。ただし、$delta$が大きすぎると、偽の局所的最小値を含む反例が存在することが知られています。この論文では、スプリアス局所極小値が存在しないことを保証するために、$delta$に鋭い閾値を設定できる証明手法を紹介します。この手法を使用して、ランク1のグラウンドトゥルースの場合、任意の初期点(ランダムポイントなど)からの正確な回復には$delta<1/2$のRIP定数が必要かつ十分であることを証明します。また、局所的な回復結果も証明します:初期点$x_{0}$が$f(x_{0})le(1-delta)^{2}f(0)$を満たす場合、2次最適性に収束する任意の降下アルゴリズムは正確な回復を保証します。

Distributed Inference for Linear Support Vector Machine
線形サポートベクトルマシンの分散推論

The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire data set in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator.

現代のデータの増大は、既存の統計的推論方法論や理論に多くの新たな課題をもたらし、分散推論アプローチの開発を求めています。この論文では、バイナリ分類タスク用の線形サポートベクターマシン(SVM)の分散推論について検討します。SVMに関する膨大な文献があるにもかかわらず、特に分散設定におけるSVMの推論特性についてはほとんど知られていません。この論文では、線形SVMの推論を実行するためのマルチラウンド分散線形型(MDL)推定器を提案します。提案された推定器は計算効率に優れています。特に、初期のSVM推定器のみを必要とし、その後、単純な重み付き最小二乗問題を解くことで推定器を継続的に改良します。理論的には、推定器のBahadur表現を確立します。この表現に基づいて、漸近正規性がさらに導出され、MDL推定器が最適な統計効率、つまり単一のマシン設定でデータセット全体に適用された従来の線形SVMと同じ効率を達成することが示されます。さらに、私たちの漸近結果は、分散推定の文献で一般的に想定されているマシン数またはデータバッチ数に関する条件を回避し、次元が発散するケースを許可します。提案されたMDL推定器のパフォーマンスを実証するためのシミュレーション研究を提供します。

Measuring the Effects of Data Parallelism on Neural Network Training
ニューラルネットワークの学習に対するデータ並列処理の影響の測定

Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.

最近のハードウェア開発により、ニューラルネットワークのトレーニングに利用できるデータ並列処理の規模が劇的に拡大しました。次世代ハードウェアを活用する最も簡単な方法の1つは、標準のミニバッチニューラルネットワークトレーニングアルゴリズムでバッチサイズを増やすことです。この研究では、バッチサイズを増やすことによるトレーニング時間への影響を実験的に特徴付けることを目的としています。トレーニング時間は、目標のサンプル外エラーに到達するために必要なステップ数で測定されます。この関係がトレーニングアルゴリズム、モデル、データセットによってどのように変化するかを調べ、ワークロード間で非常に大きなばらつきがあることがわかりました。その過程で、バッチサイズがモデルの品質にどのように影響するかに関する文献の意見の相違は、主に、異なるバッチサイズでのメタパラメータの調整と計算予算の違いによって説明できることを示します。バッチサイズが大きいほどサンプル外パフォーマンスが低下するという証拠は見つかりませんでした。最後に、将来ニューラルネットワークをより高速にトレーニングするための取り組みに対する結果の意味について説明します。私たちの実験データは、35のワークロードにわたる168,160個の個別モデルのトレーニング中に取得された71,638,836件の損失測定値のデータベースとして公開されています。

An Efficient Two Step Algorithm for High Dimensional Change Point Regression Models Without Grid Search
グリッド探索を使用しない高次元変化点回帰モデルのための効率的な2ステップアルゴリズム

We propose a two step algorithm based on $\ell_1/\ell_0$ regularization for the detection and estimation of parameters of a high dimensional change point regression model and provide the corresponding rates of convergence for the change point as well as the regression parameter estimates. Importantly, the computational cost of our estimator is only $2\cdotp$Lasso$(n,p)$, where Lasso$(n,p)$ represents the computational burden of one Lasso optimization in a model of size $(n,p)$. In comparison, existing grid search based approaches to this problem require a computational cost of at least $n\cdot Lasso(n,p)$ optimizations. Additionally, the proposed method is shown to be able to consistently detect the case of ‘no change’, i.e., where no finite change point exists in the model. We allow the true change point parameter $\tau_0$ to possibly move to the boundaries of its parametric space, and the jump size $\|b_0-g_0\|_2$ to possibly diverge as $n$ increases. We then characterize the corresponding effects on the rates of convergence of the change point and regression estimates. In particular, we show that, while an increasing jump size may have a beneficial effect on the change point estimate, however the optimal rate of regression parameter estimates are preserved only upto a certain rate of the increasing jump size. This behavior in the rate of regression parameter estimates is unique to high dimensional change point regression models only. Simulations are performed to empirically evaluate performance of the proposed estimators. The methodology is applied to community level socio-economic data of the U.S., collected from the 1990 U.S. census and other sources.

私たちは、高次元の変化点回帰モデルのパラメータの検出と推定のための$\ell_1/\ell_0$正則化に基づく2段階アルゴリズムを提案し、変化点と回帰パラメータ推定値の対応する収束率を示します。重要なのは、推定器の計算コストがわずか$2\cdotp$Lasso$(n,p)$であることです。ここで、Lasso$(n,p)$は、サイズ$(n,p)$のモデルにおける1つのLasso最適化の計算負荷を表します。これと比較して、この問題に対する既存のグリッド検索ベースのアプローチでは、少なくとも$n\cdot Lasso(n,p)$回の最適化の計算コストが必要です。さらに、提案された方法は、モデルに有限の変化点が存在しない「変化なし」のケースを一貫して検出できることが示されています。真の変化点パラメータ$\tau_0$がパラメトリック空間の境界まで移動すること、およびジャンプサイズ$\|b_0-g_0\|_2$が$n$の増加に伴って発散することを許容します。次に、変化点推定値と回帰推定値の収束率に対する対応する影響を特徴付けます。特に、ジャンプサイズの増加は変化点推定値に有益な影響を与える可能性がありますが、回帰パラメータ推定値の最適な速度は、ジャンプサイズの増加率が一定値までしか維持されないことを示します。回帰パラメータ推定値のこの動作は、高次元の変化点回帰モデルにのみ固有のものです。提案された推定値のパフォーマンスを経験的に評価するために、シミュレーションが実行されます。この方法論は、1990年の米国国勢調査およびその他のソースから収集された米国のコミュニティレベルの社会経済データに適用されます。

A Representer Theorem for Deep Neural Networks
深層ニューラルネットワークのための表現定理

We propose to optimize the activation functions of a deep neural network by adding a corresponding functional regularization to the cost function. We justify the use of a second-order total-variation criterion. This allows us to derive a general representer theorem for deep neural networks that makes a direct connection with splines and sparsity. Specifically, we show that the optimal network configuration can be achieved with activation functions that are nonuniform linear splines with adaptive knots. The bottom line is that the action of each neuron is encoded by a spline whose parameters (including the number of knots) are optimized during the training procedure. The scheme results in a computational structure that is compatible with existing deep-ReLU, parametric ReLU, APL (adaptive piecewise-linear) and MaxOut architectures. It also suggests novel optimization challenges and makes an explicit link with $\ell_1$ minimization and sparsity-promoting techniques.

私たちは、コスト関数に対応する関数正則化を追加することで、深層ニューラルネットワークの活性化関数を最適化することを提案します。私たちは、2次の合計変動基準の使用を正当化します。これにより、スプラインとスパース性と直接接続するディープニューラルネットワークの一般表現定理を導き出すことができます。具体的には、適応ノットを持つ不均一な線形スプラインである活性化関数を使用して、最適なネットワーク構成を実現できることを示します。要するに、各ニューロンの作用は、学習手順中にパラメータ(ノット数を含む)が最適化されるスプラインによってコード化されるということです。このスキームにより、既存のdeep-ReLU、パラメトリックReLU、APL(適応区分線形)、およびMaxOutアーキテクチャと互換性のある計算構造が得られます。また、新しい最適化の課題を提案し、$ell_1$最小化およびスパース性促進手法と明示的にリンクします。

Learning Unfaithful K-separable Gaussian Graphical Models
不忠実な K-分離可能なガウスグラフィカルモデルの学習

The global Markov property for Gaussian graphical models ensures graph separation implies conditional independence. Specifically if a node set $S$ graph separates nodes $u$ and $v$ then $X_u$ is conditionally independent of $X_v$ given $X_S$. The opposite direction need not be true, that is, $X_u \perp X_v \mid X_S$ need not imply $S$ is a node separator of $u$ and $v$. When it does, the relation $X_u \perp X_v \mid X_S$ is called faithful. In this paper we provide a characterization of faithful relations and then provide an algorithm to test faithfulness based only on knowledge of other conditional relations of the form $X_i \perp X_j \mid X_S$. We study two classes of separable Gaussian graphical models, namely, weakly $K$-separable and strongly $K$-separable Gaussian graphical models. Using the above test for faithfulness, we introduce algorithms to learn the topologies of weakly $K$-separable and strongly $K$-separable Gaussian graphical models with $\Omega(K\log p)$ sample complexity. For strongly $K$-separable Gaussian graphical models, we additionally provide a method with error bounds for learning the off-diagonal precision matrix entries.

ガウスグラフィカルモデルのグローバルマルコフ特性により、グラフの分離は条件付き独立性を意味します。具体的には、ノードセット$S$グラフがノード$u$と$v$を分離する場合、$X_S$が与えられれば$X_u$は$X_v$から条件付き独立です。逆方向は必ずしも真である必要はなく、つまり、$X_u \perp X_v \mid X_S$は$S$が$u$と$v$のノードセパレータであることを意味するわけではありません。意味する場合、関係$X_u \perp X_v \mid X_S$は忠実であると呼ばれます。この論文では、忠実な関係の特徴を示し、次に$X_i \perp X_j \mid X_S$の形式の他の条件付き関係の知識のみに基づいて忠実性をテストするアルゴリズムを示します。分離可能なガウスグラフィカルモデルの2つのクラス、つまり弱く$K$分離可能なガウスグラフィカルモデルと強く$K$分離可能なガウスグラフィカルモデルについて検討します。上記の忠実性テストを使用して、サンプル複雑度が$\Omega(K\log p)$である弱$K$分離型および強$K$分離型のガウスグラフィカルモデルのトポロジーを学習するアルゴリズムを紹介します。強$K$分離型のガウスグラフィカルモデルの場合、対角精度行列のエントリを学習するための誤差境界を持つ方法も追加で提供します。

Maximum Likelihood for Gaussian Process Classification and Generalized Linear Mixed Models under Case-Control Sampling
ケースコントロールサンプリング下におけるガウス過程分類と一般化線形混合モデルの最尤法

Modern data sets in various domains often include units that were sampled non-randomly from the population and have a latent correlation structure. Here we investigate a common form of this setting, where every unit is associated with a latent variable, all latent variables are correlated, and the probability of sampling a unit depends on its response. Such settings often arise in case-control studies, where the sampled units are correlated due to spatial proximity, family relations, or other sources of relatedness. Maximum likelihood estimation in such settings is challenging from both a computational and statistical perspective, necessitating approximations that take the sampling scheme into account. We propose a family of approximate likelihood approaches which combine composite likelihood and expectation propagation. We demonstrate the efficacy of our solutions via extensive simulations. We utilize them to investigate the genetic architecture of several complex disorders collected in case-control genetic association studies, where hundreds of thousands of genetic variants are measured for every individual, and the underlying disease liabilities of individuals are correlated due to genetic similarity. Our work is the first to provide a tractable likelihood-based solution for case-control data with complex dependency structures.

さまざまなドメインの最新のデータセットには、集団からランダムに抽出されたユニットが含まれ、潜在的な相関構造を持つことがよくあります。ここでは、すべてのユニットが潜在変数に関連付けられ、すべての潜在変数が相関し、ユニットをサンプリングする確率がその応答に依存するという、この設定の一般的な形式を調査します。このような設定は、空間的な近接性、家族関係、またはその他の関連性のソースにより、サンプリングされたユニットが相関しているケースコントロール研究でよく発生します。このような設定での最大尤度推定は、計算と統計の両方の観点から困難であり、サンプリングスキームを考慮した近似が必要です。私たちは、複合尤度と期待伝播を組み合わせた近似尤度アプローチのファミリーを提案します。私たちは、広範なシミュレーションによってソリューションの有効性を実証します。私たちは、ケースコントロール遺伝関連研究で収集されたいくつかの複雑な疾患の遺伝的構造を調査するためにそれらを使用します。ケースコントロール遺伝関連研究では、すべての個人について数十万の遺伝子変異が測定され、個人の根本的な疾患の危険性は遺伝的類似性により相関しています。私たちの研究は、複雑な依存関係構造を持つ症例対照データに対して、扱いやすい尤度ベースのソリューションを提供する初めての研究です。

Scalable Interpretable Multi-Response Regression via SEED
SEEDによるスケーラブルで解釈可能な多重応答回帰

Sparse reduced-rank regression is an important tool for uncovering meaningful dependence structure between large numbers of predictors and responses in many big data applications such as genome-wide association studies and social media analysis. Despite the recent theoretical and algorithmic advances, scalable estimation of sparse reduced-rank regression remains largely unexplored. In this paper, we suggest a scalable procedure called sequential estimation with eigen-decomposition (SEED) which needs only a single top-$r$ sparse singular value decomposition from a generalized eigenvalue problem to find the optimal low-rank and sparse matrix estimate. Our suggested method is not only scalable but also performs simultaneous dimensionality reduction and variable selection. Under some mild regularity conditions, we show that SEED enjoys nice sampling properties including consistency in estimation, rank selection, prediction, and model selection. Moreover, SEED employs only basic matrix operations that can be efficiently parallelized in high performance computing devices. Numerical studies on synthetic and real data sets show that SEED outperforms the state-of-the-art approaches for large-scale matrix estimation problem.

スパース縮退ランク回帰は、ゲノムワイド関連研究やソーシャルメディア分析などの多くのビッグデータアプリケーションで、多数の予測子と応答の間の意味のある依存関係構造を明らかにするための重要なツールです。最近の理論的およびアルゴリズム的進歩にもかかわらず、スパース縮退ランク回帰のスケーラブルな推定はまだほとんど研究されていません。この論文では、一般化固有値問題から単一のトップ$r$スパース特異値分解のみを必要とし、最適な低ランクおよびスパース行列推定値を見つける、スケーラブルな手順である固有分解による逐次推定(SEED)を提案します。提案された方法はスケーラブルであるだけでなく、次元削減と変数選択を同時に実行します。いくつかの軽度の規則性条件下では、SEEDは推定、ランク選択、予測、およびモデル選択の一貫性を含む優れたサンプリング特性を備えていることがわかります。さらに、SEEDは、高性能コンピューティングデバイスで効率的に並列化できる基本的な行列演算のみを使用します。合成データセットと実際のデータセットに関する数値的研究により、SEEDは大規模な行列推定問題に対する最先端のアプローチよりも優れていることが示されています。

Solving the OSCAR and SLOPE Models Using a Semismooth Newton-Based Augmented Lagrangian Method
半平滑ニュートンベースの拡張ラグランジュ法を使用したOSCARモデルとSLOPEモデルの解法

The octagonal shrinkage and clustering algorithm for regression (OSCAR), equipped with the $\ell_1$-norm and a pair-wise $\ell_{\infty}$-norm regularizer, is a useful tool for feature selection and grouping in high-dimensional data analysis. The computational challenge posed by OSCAR, for high dimensional and/or large sample size data, has not yet been well resolved due to the non-smoothness and non-separability of the regularizer involved. In this paper, we successfully resolve this numerical challenge by proposing a sparse semismooth Newton-based augmented Lagrangian method to solve the more general SLOPE (the sorted L-one penalized estimation) model. By appropriately exploiting the inherent sparse and low-rank property of the generalized Jacobian of the semismooth Newton system in the augmented Lagrangian subproblem, we show how the computational complexity can be substantially reduced. Our algorithm offers a notable computational advantage in the high-dimensional statistical regression settings. Numerical experiments are conducted on real data sets, and the results demonstrate that our algorithm is far superior, in both speed and robustness, to the existing state-of-the-art algorithms based on first-order iterative schemes, including the widely used accelerated proximal gradient (APG) method and the alternating direction method of multipliers (ADMM).

$\ell_1$ノルムとペアワイズ$\ell_{\infty}$ノルム正則化子を備えた回帰のための八角形縮小およびクラスタリングアルゴリズム(OSCAR)は、高次元データ分析における特徴選択およびグループ化に便利なツールです。高次元および/またはサンプルサイズの大きいデータに対してOSCARによってもたらされる計算上の課題は、関係する正則化子の非滑らかさと非分離性のため、まだ十分に解決されていません。この論文では、より一般的なSLOPE (ソートされたL-1ペナルティ付き推定)モデルを解くためのスパース半平滑ニュートンベースの拡張ラグランジアン法を提案することで、この数値的課題をうまく解決します。拡張ラグランジアン部分問題における半平滑ニュートンシステムの一般化ヤコビアンの固有のスパースおよび低ランク特性を適切に利用することで、計算の複雑さを大幅に削減できることを示します。私たちのアルゴリズムは、高次元統計回帰設定において顕著な計算上の利点を提供します。実際のデータセットで数値実験が行われ、その結果、私たちのアルゴリズムは、広く使用されている加速近位勾配(APG)法や交互方向乗数法(ADMM)などの1次反復スキームに基づく既存の最先端のアルゴリズムよりも、速度と堅牢性の両面ではるかに優れていることが実証されました。

Optimal Transport: Fast Probabilistic Approximation with Exact Solvers
最適輸送: 正確確率ソルバーによる高速確率近似

We propose a simple subsampling scheme for fast randomized approximate computation of optimal transport distances on finite spaces. This scheme operates on a random subset of the full data and can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entropically penalized versions. It is based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can easily be tuned towards higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation bounds for its accuracy in the case of discrete optimal transport problems. In particular, we show that in many important instances, including images (2D-histograms), the approximation error is independent of the size of the full problem. We present numerical experiments that demonstrate that a very good approximation in typical applications can be obtained in a computation time that is several orders of magnitude smaller than what is required for exact computation of the full problem.

私たちは、有限空間上の最適輸送距離の高速ランダム近似計算のための単純なサブサンプリング方式を提案します。この方式は、全データのランダムなサブセットに対して動作し、最先端のソルバーやエントロピーペナルティバージョンを含む任意の正確なアルゴリズムをブラックボックスバックエンドとして使用できます。これは、元の測定値からの独立したサンプルから生成された経験的測定値間の正確な距離を平均化することに基づいており、より高い精度やより短い計算時間に向けて簡単に調整できます。この目的のために、離散最適輸送問題の場合の精度について、非漸近偏差の境界を与える。特に、画像（2Dヒストグラム）を含む多くの重要なインスタンスにおいて、近似誤差は問題全体のサイズに依存しないことを示す。私たちは、典型的なアプリケーションで非常に良好な近似を、問題全体の正確な計算に必要な時間よりも数桁少ない計算時間で得られることを示す数値実験を提示します。

Complete Search for Feature Selection in Decision Trees
デシジョンツリーでの機能選択の完全検索

The search space for the feature selection problem in decision tree learning is the lattice of subsets of the available features. We design an exact enumeration procedure of the subsets of features that lead to all and only the distinct decision trees built by a greedy top-down decision tree induction algorithm. The procedure stores, in the worst case, a number of trees linear in the number of features. By exploiting a further pruning of the search space, we design a complete procedure for finding $\delta$-acceptable feature subsets, which depart by at most $\delta$ from the best estimated error over any feature subset. Feature subsets with the best estimated error are called best feature subsets. Our results apply to any error estimator function, but experiments are mainly conducted under the wrapper model, in which the misclassification error over a search set is used as an estimator. The approach is also adapted to the design of a computational optimization of the sequential backward elimination heuristic, extending its applicability to large dimensional datasets. The procedures of this paper are implemented in a multi-core data parallel C++ system. We investigate experimentally the properties and limitations of the procedures on a collection of 20 benchmark datasets, showing that oversearching increases both overfitting and instability.

決定木学習における特徴選択問題の検索空間は、利用可能な特徴のサブセットの格子です。私たちは、貪欲なトップダウン決定木誘導アルゴリズムによって構築されたすべての異なる決定木のみにつながる特徴のサブセットの正確な列挙手順を設計します。この手順は、最悪の場合、特徴の数に線形な数の木を格納します。検索空間のさらなる刈り込みを利用することで、任意の特徴サブセットで最良の推定誤差から最大で$\delta$だけ離れる$\delta$許容可能な特徴サブセットを見つけるための完全な手順を設計します。最良の推定誤差を持つ特徴サブセットは、最良の特徴サブセットと呼ばれます。我々の結果は任意の誤差推定関数に適用されますが、実験は主にラッパーモデルで行われ、検索セットの誤分類誤差が推定値として使用されます。このアプローチは、シーケンシャルバックワード消去ヒューリスティックの計算最適化の設計にも適応され、大規模次元データセットへの適用範囲が拡張されます。この論文の手順は、マルチコアデータ並列C++システムに実装されています。20個のベンチマークデータセットのコレクションで手順の特性と制限を実験的に調査し、過剰検索によって過剰適合と不安定性の両方が増加することを示しています。

Regularization via Mass Transportation
大量輸送による正規化

The goal of regression and classification methods in supervised learning is to minimize the empirical risk, that is, the expectation of some loss function quantifying the prediction error under the empirical distribution. When facing scarce training data, overfitting is typically mitigated by adding regularization terms to the objective that penalize hypothesis complexity. In this paper we introduce new regularization techniques using ideas from distributionally robust optimization, and we give new probabilistic interpretations to existing techniques. Specifically, we propose to minimize the worst-case expected loss, where the worst case is taken over the ball of all (continuous or discrete) distributions that have a bounded transportation distance from the (discrete) empirical distribution. By choosing the radius of this ball judiciously, we can guarantee that the worst-case expected loss provides an upper confidence bound on the loss on test data, thus offering new generalization bounds. We prove that the resulting regularized learning problems are tractable and can be tractably kernelized for many popular loss functions. The proposed approach to regluarization is also extended to neural networks. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.

教師あり学習における回帰法と分類法の目標は、経験的リスク、つまり経験的分布の下で予測誤差を定量化する損失関数の期待値を最小化することです。トレーニングデータが不足している場合、通常、仮説の複雑さをペナルティする正則化項を目的関数に追加することで、過剰適合が緩和されます。この論文では、分布的にロバストな最適化のアイデアを使用して新しい正則化手法を紹介し、既存の手法に新しい確率的解釈を与えます。具体的には、最悪のケースの期待損失を最小化することを提案します。この場合、最悪のケースは、(離散)経験的分布からの移動距離が制限されているすべての(連続または離散)分布の球で取られます。この球の半径を慎重に選択することで、最悪のケースの期待損失がテストデータでの損失の信頼度上限を提供し、新しい一般化境界を提供することを保証できます。結果として得られる正則化学習問題は扱いやすく、多くの一般的な損失関数に対して扱いやすくカーネル化できることを証明します。提案された再正規化のアプローチは、ニューラルネットワークにも拡張されます。シミュレーションと実験を通じて、理論的なサンプル外保証を検証します。

Non-Convex Matrix Completion and Related Problems via Strong Duality
非凸行列補完と強双対性による関連問題

This work studies the strong duality of non-convex matrix factorization problems: we show that under certain dual conditions, these problems and the dual have the same optimum. This has been well understood for convex optimization, but little was known for non-convex problems. We propose a novel analytical framework and prove that under certain dual conditions, the optimal solution of the matrix factorization program is the same as that of its bi-dual and thus the global optimality of the non-convex program can be achieved by solving its bi-dual which is convex. These dual conditions are satisfied by a wide class of matrix factorization problems, although matrix factorization is hard to solve in full generality. This analytical framework may be of independent interest to non-convex optimization more broadly. We apply our framework to two prototypical matrix factorization problems: matrix completion and robust Principal Component Analysis. These are examples of efficiently recovering a hidden matrix given limited reliable observations. Our framework shows that exact recoverability and strong duality hold with nearly-optimal sample complexity for the two problems.

この研究では、非凸行列分解問題の強い双対性について研究しています。特定の双対条件下では、これらの問題と双対問題には同じ最適値があることを示しています。これは凸最適化ではよく理解されていますが、非凸問題ではほとんど知られていません。私たちは新しい分析フレームワークを提案し、特定の双対条件下では、行列分解プログラムの最適解はその双対問題の最適解と同じであり、したがって非凸プログラムのグローバル最適性は凸である双対問題を解くことによって達成できることを証明します。これらの双対条件は、幅広いクラスの行列分解問題によって満たされますが、行列分解を完全に一般化して解くことは困難です。この分析フレームワークは、より広い意味での非凸最適化に独立した関心事である可能性があります。私たちは、このフレームワークを、行列補完とロバスト主成分分析という2つの典型的な行列分解問題に適用します。これらは、限られた信頼できる観測値がある場合に、隠れた行列を効率的に復元する例です。私たちのフレームワークは、2つの問題に対して、正確な復元可能性と強い双対性がほぼ最適なサンプル複雑度で成り立つことを示しています。

Low Permutation-rank Matrices: Structural Properties and Noisy Completion
低順列ランク行列:構造特性とノイズ補完

We consider the problem of noisy matrix completion, in which the goal is to reconstruct a structured matrix whose entries are partially observed in noise. Standard approaches to this underdetermined inverse problem are based on assuming that the underlying matrix has low rank, or is well-approximated by a low rank matrix. In this paper, we propose a richer model based on what we term the permutation-rank of a matrix. We first describe how the classical non-negative rank model enforces restrictions that may be undesirable in practice, and how and these restrictions can be avoided by using the richer permutation-rank model. Second, we establish the minimax rates of estimation under the new permutation-based model, and prove that surprisingly, the minimax rates are equivalent up to logarithmic factors to those for estimation under the typical low rank model. Third, we analyze a computationally efficient singular-value-thresholding algorithm, known to be optimal for the low-rank setting, and show that it also simultaneously yields a consistent estimator for the low-permutation rank setting. Finally, we present various structural results characterizing the uniqueness of the permutation-rank decomposition, and characterizing convex approximations of the permutation-rank polytope.

私たちは、ノイズのある行列補完の問題を考察します。この問題の目的は、エントリが部分的にノイズとして観測される構造化行列を再構築することです。この不確定な逆問題に対する標準的なアプローチは、基礎となる行列のランクが低いか、または低ランク行列によって十分に近似されているという仮定に基づいています。この論文では、行列の順列ランクと呼ばれるものに基づく、より豊富なモデルを提案します。まず、古典的な非負ランクモデルが、実際には望ましくない可能性のある制限をどのように強制するか、そしてより豊富な順列ランクモデルを使用することでこれらの制限をどのように回避できるかについて説明します。次に、新しい順列ベースのモデルでの推定のミニマックス率を確立し、驚くべきことに、ミニマックス率は、対数係数まで、典型的な低ランクモデルでの推定のものと同等であることを証明します。最後に、低ランク設定に最適であることが知られている、計算効率の高い特異値しきい値アルゴリズムを分析し、同時に、低順列ランク設定の一貫した推定値も生成することを示します。最後に、順列ランク分解の一意性を特徴付けるさまざまな構造的結果と、順列ランク多面体の凸近似を特徴付けるさまざまな構造的結果を示します。

Hamiltonian Monte Carlo with Energy Conserving Subsampling
ハミルトニアン・モンテカルロとエネルギー保存サブサンプリング

Hamiltonian Monte Carlo (HMC) samples efficiently from high-dimensional posterior distributions with proposed parameter draws obtained by iterating on a discretized version of the Hamiltonian dynamics. The iterations make HMC computationally costly, especially in problems with large data sets, since it is necessary to compute posterior densities and their derivatives with respect to the parameters. Naively computing the Hamiltonian dynamics on a subset of the data causes HMC to lose its key ability to generate distant parameter proposals with high acceptance probability. The key insight in our article is that efficient subsampling HMC for the parameters is possible if both the dynamics and the acceptance probability are computed from the same data subsample in each complete HMC iteration. We show that this is possible to do in a principled way in a HMC-within-Gibbs framework where the subsample is updated using a pseudo marginal MH step and the parameters are then updated using an HMC step, based on the current subsample. We show that our subsampling methods are fast and compare favorably to two popular sampling algorithms that use gradient estimates from data subsampling. We also explore the current limitations of subsampling HMC algorithms by varying the quality of the variance reducing control variates used in the estimators of the posterior density and its gradients.

ハミルトンモンテカルロ(HMC)は、離散化されたハミルトンダイナミクスを反復処理して得られた提案パラメータドローを使用して、高次元事後分布から効率的にサンプリングします。反復処理により、特に大規模なデータセットの問題では、事後密度とパラメータに関するその導関数を計算する必要があるため、HMCの計算コストが高くなります。データのサブセットでハミルトンダイナミクスを単純に計算すると、HMCは、高い受け入れ確率で離れたパラメータ提案を生成するという重要な能力を失います。この記事の重要な洞察は、ダイナミクスと受け入れ確率の両方が、各完全なHMC反復で同じデータサブサンプルから計算される場合、パラメータのHMCの効率的なサブサンプリングが可能になるということです。これは、サブサンプルが疑似限界MHステップを使用して更新され、次に現在のサブサンプルに基づいてHMCステップを使用してパラメータが更新される、ギブス内のHMCフレームワークで原理的に実行できることを示します。私たちのサブサンプリング法は高速であり、データサブサンプリングからの勾配推定値を使用する2つの一般的なサンプリングアルゴリズムと比較して優れていることを示します。また、事後密度とその勾配の推定値で使用される分散低減制御変量の品質を変えることで、サブサンプリングHMCアルゴリズムの現在の制限についても調査します。

Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction
表現力豊かな多次元変化点と反事実予測のための変更曲面

Identifying changes in model parameters is fundamental in machine learning and statistics. However, standard changepoint models are limited in expressiveness, often addressing unidimensional problems and assuming instantaneous changes. We introduce change surfaces as a multidimensional and highly expressive generalization of changepoints. We provide a model-agnostic formalization of change surfaces, illustrating how they can provide variable, heterogeneous, and non-monotonic rates of change across multiple dimensions. Additionally, we show how change surfaces can be used for counterfactual prediction. As a concrete instantiation of the change surface framework, we develop Gaussian Process Change Surfaces (GPCS). We demonstrate counterfactual prediction with Bayesian posterior mean and credible sets, as well as massive scalability by introducing novel methods for additive non-separable kernels. Using two large spatio-temporal datasets we employ GPCS to discover and characterize complex changes that can provide scientific and policy relevant insights. Specifically, we analyze twentieth century measles incidence across the United States and discover previously unknown heterogeneous changes after the introduction of the measles vaccine. Additionally, we apply the model to requests for lead testing kits in New York City, discovering distinct spatial and demographic patterns.

モデルパラメータの変化を識別することは、機械学習と統計の基本です。しかし、標準的な変化点モデルは表現力が限られており、多くの場合、一次元の問題に対処し、瞬間的な変化を前提としています。私たちは、変化点の多次元で表現力の高い一般化として変化面を導入します。私たちは、変化面のモデルに依存しない形式化を提供し、複数の次元にわたって可変で異質で非単調な変化率を提供する方法を示します。さらに、変化面を反事実予測に使用する方法を示します。変化面フレームワークの具体的なインスタンス化として、ガウス過程変化面(GPCS)を開発します。ベイズ事後平均と信頼できるセットによる反事実予測、および加法非分離カーネルの新しい方法を導入することで大規模なスケーラビリティを実証します。2つの大規模な時空間データセットを使用して、GPCSを採用し、科学的および政策関連の洞察を提供できる複雑な変化を発見して特徴付けます。具体的には、20世紀の米国全土における麻疹の発生率を分析し、麻疹ワクチンの導入後にこれまで知られていなかった異質な変化を発見しました。さらに、このモデルをニューヨーク市での鉛検査キットの要求に適用し、明確な空間的および人口統計的パターンを発見しました。

Adaptive Geometric Multiscale Approximations for Intrinsically Low-dimensional Data
本質的に低次元データのための適応幾何学的マルチスケール近似

We consider the problem of efficiently approximating and encoding high-dimensional data sampled from a probability distribution $\rho$ in $\mathbb{R}^D$, that is nearly supported on a $d$-dimensional set $\mathcal{M}$ – for example supported on a $d$-dimensional manifold. Geometric Multi-Resolution Analysis (GMRA) provides a robust and computationally efficient procedure to construct low-dimensional geometric approximations of $\mathcal{M}$ at varying resolutions. We introduce GMRA approximations that adapt to the unknown regularity of $\mathcal{M}$, by introducing a thresholding algorithm on the geometric wavelet coefficients. We show that these data-driven, empirical geometric approximations perform well, when the threshold is chosen as a suitable universal function of the number of samples $n$, on a large class of measures $\rho$, that are allowed to exhibit different regularity at different scales and locations, thereby efficiently encoding data from more complex measures than those supported on manifolds. These GMRA approximations are associated to a dictionary, together with a fast transform mapping data to $d$-dimensional coefficients, and an inverse of such a map, all of which are data-driven. The algorithms for both the dictionary construction and the transforms have complexity $C D n \log n$ with the constant $C$ exponential in $d$. Our work therefore establishes Adaptive GMRA as a fast dictionary learning algorithm, with approximation guarantees, for intrinsically low-dimensional data. We include several numerical experiments on both synthetic and real data, confirming our theoretical results and demonstrating the effectiveness of Adaptive GMRA.

私たちは、$d$次元集合$\mathcal{M}$でほぼサポートされている（例えば、$d$次元多様体でサポートされている）$\mathbb{R}^D$内の確率分布$\rho$からサンプリングされた高次元データを効率的に近似してエンコードする問題について検討します。幾何学的マルチ解像度解析（GMRA）は、さまざまな解像度で$\mathcal{M}$の低次元幾何学的近似を構築するための堅牢で計算効率の高い手順を提供します。私たちは、幾何学的ウェーブレット係数にしきい値アルゴリズムを導入することで、$\mathcal{M}$の未知の規則性に適応するGMRA近似を導入します。私たちは、閾値がサンプル数$n$の適切な普遍関数として選択された場合、異なるスケールと場所で異なる規則性を示すことが許される大規模なクラスの測度$\rho$に対して、これらのデータ駆動型の経験的幾何学的近似がうまく機能することを示します。これにより、多様体でサポートされているものよりも複雑な測度からデータを効率的にエンコードできます。これらのGMRA近似は、データを$d$次元係数にマッピングする高速変換、およびそのようなマップの逆とともに辞書に関連付けられており、これらはすべてデータ駆動型です。辞書構築と変換の両方のアルゴリズムは、複雑度が$C D n \log n$で、定数$C$は$d$の指数です。したがって、我々の研究により、本質的に低次元のデータに対して、近似保証付きの高速辞書学習アルゴリズムとしてAdaptive GMRAが確立されます。私たちは、合成データと実際のデータの両方でいくつかの数値実験を含め、理論的結果を確認し、Adaptive GMRAの有効性を実証します。

Relative Error Bound Analysis for Nuclear Norm Regularized Matrix Completion
核ノルム正則化行列完成のための相対誤差範囲解析

In this paper, we develop a relative error bound for nuclear norm regularized matrix completion, with the focus on the completion of full-rank matrices. Under the assumption that the top eigenspaces of the target matrix are incoherent, we derive a relative upper bound for recovering the best low-rank approximation of the unknown matrix. Although multiple works have been devoted to analyzing the recovery error of full-rank matrix completion, their error bounds are usually additive, making it impossible to obtain the perfect recovery case and more generally difficult to leverage the skewed distribution of eigenvalues. Our analysis is built upon the optimality condition of the regularized formulation and existing guarantees for low-rank matrix completion. To the best of our knowledge, this is the first relative bound that has been proved for the regularized formulation of matrix completion.

この論文では、フルランク行列の完成に焦点を当てて、核ノルム正則化行列の完成に対する相対誤差限界を開発します。ターゲット行列の上位固有空間がインコヒーレントであると仮定すると、未知の行列の最適な低ランク近似を回復するための相対的な上限を導き出します。フルランク行列補完の回復誤差の分析には複数の研究が捧げられてきましたが、通常、その誤差範囲は相加的であるため、完全な回復ケースを取得することは不可能であり、より一般的には固有値の歪んだ分布を活用することが困難です。私たちの分析は、正則化された製剤の最適性条件と、低ランクのマトリックス完成に対する既存の保証に基づいて構築されています。私たちの知る限り、これは行列完成の正則化された定式化で証明された最初の相対的限界です。

PyOD: A Python Toolbox for Scalable Outlier Detection
PyOD: スケーラブルな外れ値検出のための Python ツールボックス

PyOD is an open-source Python toolbox for performing scalable outlier detection on multivariate data. Uniquely, it provides access to a wide range of outlier detection algorithms, including established outlier ensembles and more recent neural network-based approaches, under a single, well-documented API designed for use by both practitioners and researchers. With robustness and scalability in mind, best practices such as unit testing, continuous integration, code coverage, maintainability checks, interactive examples and parallelization are emphasized as core components in the toolbox’s development. PyOD is compatible with both Python 2 and 3 and can be installed through Python Package Index (PyPI) or https://github.com/yzhao062/pyod.

PyODは、多変量データに対してスケーラブルな外れ値検出を実行するためのオープンソースのPythonツールボックスです。独自の点として、確立された外れ値アンサンブルや最新のニューラルネットワークベースのアプローチなど、幅広い外れ値検出アルゴリズムを、実務家と研究者の両方が使用できるように設計された十分に文書化された単一のAPIの下で提供します。堅牢性とスケーラビリティを念頭に置いて、ユニットテスト、継続的インテグレーション、コードカバレッジ、保守性チェック、対話型サンプル、並列化などのベストプラクティスは、ツールボックスの開発のコアコンポーネントとして強調されています。PyODはPython 2と3の両方と互換性があり、Python Package Index(PyPI)またはhttps://github.com/yzhao062/pyodを介してインストールできます。

High-Dimensional Poisson Structural Equation Model Learning via l_1-Regularized Regression
l_1-正規化回帰による高次元ポアソン構造方程式モデルの学習

In this paper, we develop a new approach to learning high-dimensional Poisson structural equation models from only observational data without strong assumptions such as faithfulness and a sparse moralized graph. A key component of our method is to decouple the ordering estimation or parent search where the problems can be efficiently addressed using $\ell_1$-regularized regression and the moments relation. We show that sample size $n = \Omega( d^{2} \log^{9} p)$ is sufficient for our polynomial time Moments Ratio Scoring (MRS) algorithm to recover the true directed graph, where $p$ is the number of nodes and $d$ is the maximum indegree. We verify through simulations that our algorithm is statistically consistent in the high-dimensional $p>n$ setting, and performs well compared to state-of-the-art ODS, GES, and MMHC algorithms. We also demonstrate through multivariate real count data that our MRS algorithm is well-suited to estimating DAG models for multivariate count data in comparison to other methods used for discrete data.

この論文では、忠実性やスパースな道徳的グラフなどの強い仮定なしに、観測データのみから高次元ポアソン構造方程式モデルを学習する新しいアプローチを開発します。この方法の重要な要素は、順序推定または親検索を切り離すことです。この場合、問題は$\ell_1$正則化回帰とモーメント関係を使用して効率的に解決できます。サンプルサイズ$n = \Omega( d^{2} \log^{9} p)$は、多項式時間モーメント比スコアリング(MRS)アルゴリズムが真の有向グラフを復元するのに十分であることを示します。ここで、$p$はノードの数、$d$は最大入次数です。シミュレーションにより、このアルゴリズムが高次元$p>n$設定で統計的に一貫しており、最先端のODS、GES、およびMMHCアルゴリズムと比較して優れたパフォーマンスを発揮することを確認します。また、多変量実数データを通じて、当社のMRSアルゴリズムが、離散データに使用される他の方法と比較して、多変量カウントデータのDAGモデルの推定に適していることも実証します。

Simultaneous Private Learning of Multiple Concepts
複数の概念の同時プライベート学習

We investigate the direct-sum problem in the context of differentially private PAC learning: What is the sample complexity of solving $k$ learning tasks simultaneously under differential privacy, and how does this cost compare to that of solving $k$ learning tasks without privacy? In our setting, an individual example consists of a domain element $x$ labeled by $k$ unknown concepts $(c_1,\ldots,c_k)$. The goal of a multi-learner is to output $k$ hypotheses $(h_1,\ldots,h_k)$ that generalize the input examples. Without concern for privacy, the sample complexity needed to simultaneously learn $k$ concepts is essentially the same as needed for learning a single concept. Under differential privacy, the basic strategy of learning each hypothesis independently yields sample complexity that grows polynomially with $k$. For some concept classes, we give multi-learners that require fewer samples than the basic strategy. Unfortunately, however, we also give lower bounds showing that even for very simple concept classes, the sample cost of private multi-learning must grow polynomially in $k$.

私たちは、差分プライバシーPAC学習のコンテキストにおける直和問題を調査します。差分プライバシー下で$k$個の学習タスクを同時に解く場合のサンプル複雑度はどれくらいですか。また、このコストはプライバシーなしで$k$個の学習タスクを解く場合のコストと比べてどうでしょうか。この設定では、個々の例は、$k$個の未知の概念$(c_1,\ldots,c_k)$でラベル付けされたドメイン要素$x$で構成されます。マルチ学習者の目標は、入力例を一般化する$k$個の仮説$(h_1,\ldots,h_k)$を出力することです。プライバシーを考慮しない場合、$k$個の概念を同時に学習するために必要なサンプル複雑度は、単一の概念を学習する場合と基本的に同じです。差分プライバシー下では、各仮説を個別に学習するという基本戦略により、$k$とともに多項式的に増加するサンプル複雑度が得られます。一部の概念クラスについては、基本戦略よりも少ないサンプルで済むマルチ学習者を提供します。しかし残念なことに、非常に単純な概念クラスの場合でも、プライベートマルチ学習のサンプルコストは$k$の多項式で増加する必要があることを示す下限も示しています。

iNNvestigate Neural Networks!
iNNvestigateニューラルネットワーク!

In recent years, deep neural networks have revolutionized many application domains of machine learning and are key components of many critical decision or predictive processes. Therefore, it is crucial that domain specialists can understand and analyze actions and predictions, even of the most complex neural network architectures. Despite these arguments neural networks are often treated as black boxes. In the attempt to alleviate this shortcoming many analysis methods were proposed, yet the lack of reference implementations often makes a systematic comparison between the methods a major effort. The presented library innvestigate addresses this by providing a common interface and out-of-the-box implementation for many analysis methods, including the reference implementation for PatternNet and PatternAttribution as well as for LRP-methods. To demonstrate the versatility of innvestigate, we provide an analysis of image classifications for variety of state-of-the-art neural network architectures.

近年、ディープニューラルネットワークは機械学習の多くのアプリケーションドメインに革命をもたらし、多くの重要な意思決定や予測プロセスの重要なコンポーネントとなっています。したがって、ドメインスペシャリストが、最も複雑なニューラルネットワークアーキテクチャであっても、アクションと予測を理解して分析できることが非常に重要です。これらの議論にもかかわらず、ニューラルネットワークはブラックボックスとして扱われることがよくあります。この欠点を軽減するために、多くの分析方法が提案されましたが、参照実装が不足しているため、方法間の体系的な比較が大きな労力を要することがよくあります。提示されたライブラリinnvestigateは、PatternNetとPatternAttributionおよびLRP方法の参照実装を含む、多くの分析方法に共通のインターフェイスとすぐに使用できる実装を提供することで、この問題に対処します。innvestigateの汎用性を示すために、さまざまな最先端のニューラルネットワークアーキテクチャの画像分類の分析を提供します。

AffectiveTweets: a Weka Package for Analyzing Affect in Tweets
AffectiveTweets: ツイート内の感情を分析するための Weka パッケージ

AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It also implements methods for building affective lexicons and distant supervision methods for training affective models from unlabeled tweets. The package was used by several teams in the shared tasks: EmoInt 2017 and Affect in Tweets SemEval 2018 Task 1.

AffectiveTweetsは、ツイートなどのソーシャルメディアメッセージの感情や感情を分析するための一連のプログラムです。これは、Weka機械学習ワークベンチのパッケージとして実装され、ツイートから最先端の影響分析機能を計算する方法を提供し、それをWekaに実装された機械学習アルゴリズムに供給することができます。また、感情的な語彙を作成する方法と、ラベルのないツイートから感情モデルを訓練するための遠隔監視方法も実装しています。このパッケージは、共有タスクでいくつかのチームによって使用されました: EmoInt 2017とAffect in Tweets SemEval 2018タスク1。

Best Arm Identification for Contaminated Bandits
汚染された盗賊のための最高の腕の識別

This paper studies active learning in the context of robust statistics. Specifically, we propose a variant of the Best Arm Identification problem for contaminated bandits, where each arm pull has probability epsilon of generating a sample from an arbitrary contamination distribution instead of the true underlying distribution. The goal is to identify the best (or approximately best) true distribution with high probability, with a secondary goal of providing guarantees on the quality of this distribution. The primary challenge of the contaminated bandit setting is that the true distributions are only partially identifiable, even with infinite samples. To address this, we develop tight, non-asymptotic sample complexity bounds for high-probability estimation of the first two robust moments (median and median absolute deviation) from contaminated samples. These concentration inequalities are the main technical contributions of the paper and may be of independent interest. Using these results, we adapt several classical Best Arm Identification algorithms to the contaminated bandit setting and derive sample complexity upper bounds for our problem. Finally, we provide matching information-theoretic lower bounds on the sample complexity (up to a small logarithmic factor).

この論文では、ロバスト統計のコンテキストにおけるアクティブラーニングについて検討します。具体的には、汚染されたバンディットに対するベストアーム識別問題のバリエーションを提案します。このバリエーションでは、各アームプルが、真の基礎分布ではなく任意の汚染分布からサンプルを生成する確率がepsilonです。目標は、最良(またはほぼ最良)の真の分布を高い確率で識別することであり、二次的な目標はこの分布の品質を保証することです。汚染されたバンディット設定の主な課題は、真の分布は、無限のサンプルであっても部分的にしか識別できないことです。これに対処するために、汚染されたサンプルからの最初の2つのロバストモーメント(中央値と中央絶対偏差)を高い確率で推定するための、厳密で非漸近的なサンプル複雑性境界を開発します。これらの濃度不等式は、この論文の主な技術的貢献であり、独立した関心事である可能性があります。これらの結果を使用して、汚染されたバンディット設定にいくつかの古典的なベストアーム識別アルゴリズムを適応させ、問題のサンプル複雑性の上限を導き出します。最後に、サンプルの複雑さ（小さな対数係数まで）に対する一致する情報理論的下限を提供します。

A Particle-Based Variational Approach to Bayesian Non-negative Matrix Factorization
ベイズ非負行列因数分解への粒子ベース変分法

Bayesian Non-negative Matrix Factorization (BNMF) is a promising approach for understanding uncertainty and structure in matrix data. However, a large volume of applied work optimizes traditional non-Bayesian NMF objectives that fail to provide a principled understanding of the non-identifiability inherent in NMF—an issue ideally addressed by a Bayesian approach. Despite their suitability, current BNMF approaches have failed to gain popularity in an applied setting; they sacrifice flexibility in modeling for tractable computation, tend to get stuck in local modes, and can require many thousands of samples for meaningful uncertainty estimates. We address these issues through a particle-based variational approach to BNMF that only requires the joint likelihood to be differentiable for computational tractability, uses a novel transfer-based initialization technique to identify multiple modes in the posterior, and thus allows domain experts to inspect a small set of factorizations that faithfully represent the posterior. On several real datasets, we obtain better particle approximations to the BNMF posterior in less time than baselines and demonstrate the significant role that multimodality plays in NMF-related tasks.

ベイジアン非負値行列因子分解(BNMF)は、行列データの不確実性と構造を理解するための有望なアプローチです。しかし、大量の応用研究は、従来の非ベイジアンNMF目的を最適化していますが、NMFに固有の非識別性の原理的な理解を提供できていません。この問題は、ベイジアンアプローチによって理想的に解決されます。現在のBNMFアプローチは適切であるにもかかわらず、応用設定で人気を得ることができていません。扱いやすい計算のためにモデリングの柔軟性を犠牲にし、ローカルモードに陥りがちで、意味のある不確実性の推定に何千ものサンプルが必要になる場合があります。私たちは、BNMFへの粒子ベースの変分アプローチを通じてこれらの問題に対処します。このアプローチでは、計算の扱いやすさのために結合尤度が微分可能であることのみを必要とし、事後分布の複数のモードを識別するために新しい転送ベースの初期化手法を使用し、ドメインエキスパートが事後分布を忠実に表す因子分解の小さなセットを検査できるようにします。いくつかの実際のデータセットでは、ベースラインよりも短い時間でBNMF事後分布のより優れた粒子近似が得られ、NMF関連タスクでマルチモダリティが果たす重要な役割を実証しています。

Dependent relevance determination for smooth and structured sparse regression
平滑回帰と構造化スパース回帰の従属関連性の決定

In many problem settings, parameter vectors are not merely sparse but dependent in such a way that non-zero coefficients tend to cluster together. We refer to this form of dependency as “region sparsity.” Classical sparse regression methods, such as the lasso and automatic relevance determination (ARD), which model parameters as independent a priori, and therefore do not exploit such dependencies. Here we introduce a hierarchical model for smooth, region-sparse weight vectors and tensors in a linear regression setting. Our approach represents a hierarchical extension of the relevance determination framework, where we add a transformed Gaussian process to model the dependencies between the prior variances of regression weights. We combine this with a structured model of the prior variances of Fourier coefficients, which eliminates unnecessary high frequencies. The resulting prior encourages weights to be region-sparse in two different bases simultaneously. We develop Laplace approximation and Monte Carlo Markov Chain (MCMC) sampling to provide efficient inference for the posterior. Furthermore, a two-stage convex relaxation of the Laplace approximation approach is also provided to relax the inevitable non-convexity during the optimization. We finally show substantial improvements over comparable methods for both simulated and real datasets from brain imaging.

多くの問題設定では、パラメータベクトルは単にスパースであるだけでなく、非ゼロの係数が一緒に集まる傾向があるような依存関係があります。この形式の依存関係を「領域スパース性」と呼びます。Lassoや自動関連性決定(ARD)などの古典的なスパース回帰法では、パラメータを事前に独立しているとモデル化するため、このような依存関係は利用されません。ここでは、線形回帰設定で滑らかな領域スパースの重みベクトルとテンソルの階層モデルを紹介します。私たちのアプローチは、関連性決定フレームワークの階層的な拡張を表しており、変換されたガウス過程を追加して、回帰重みの事前分散間の依存関係をモデル化します。これを、不要な高周波を排除するフーリエ係数の事前分散の構造化モデルと組み合わせます。結果として得られる事前分布は、2つの異なる基底で同時に重みが領域スパースになるように促します。事後確率の効率的な推論を行うために、ラプラス近似とモンテカルロマルコフ連鎖(MCMC)サンプリングを開発しました。さらに、最適化中に避けられない非凸性を緩和するために、ラプラス近似アプローチの2段階凸緩和も提供されています。最終的に、脳画像のシミュレーションデータと実際のデータの両方で、同等の方法に比べて大幅な改善が見られることが示されました。

Model Selection via the VC Dimension
VCディメンションによるモデル選択

We derive an objective function that can be optimized to give an estimator for the Vapnik-Chervonenkis dimension for use in model selection in regression problems. We verify our estimator is consistent. Then, we verify it performs well compared to seven other model selection techniques. We do this for a variety of types of data sets.

私たちは、回帰問題でのモデル選択に使用するためのVapnik-Chervonenkis次元の推定量を与えるために最適化できる目的関数を導出します。見積もりが一貫していることを確認します。次に、他の7つのモデル選択手法と比較して、優れたパフォーマンスを発揮することを確認します。これは、さまざまなタイプのデータセットに対して行われます。

An asymptotic analysis of distributed nonparametric methods
分布ノンパラメトリック法の漸近解析

We investigate and compare the fundamental performance of several distributed learning methods that have been proposed recently. We do this in the context of a distributed version of the classical signal-in-Gaussian-white-noise model, which serves as a benchmark model for studying performance in this setting. The results show how the design and tuning of a distributed method can have great impact on convergence rates and validity of uncertainty quantification. Moreover, we highlight the difficulty of designing nonparametric distributed procedures that automatically adapt to smoothness.

私たちは、最近提案されたいくつかの分散学習方法の基本的なパフォーマンスを調査し、比較します。これは、古典的なガウス内シグナルホワイトノイズモデルの分散バージョンのコンテキストで行われ、この設定でのパフォーマンスを研究するためのベンチマークモデルとして機能します。この結果は、分散法の設計と調整が、不確実性の定量化の収束率と妥当性にどのように大きな影響を与えるかを示しています。さらに、滑らかさに自動的に適応するノンパラメトリック分散プロシージャを設計することの難しさを強調します。

Streaming Principal Component Analysis From Incomplete Data
不完全データからの主成分分析のストリーミング

Linear subspace models are pervasive in computational sciences and particularly used for large datasets which are often incomplete due to privacy issues or sampling constraints. Therefore, a critical problem is developing an efficient algorithm for detecting low-dimensional linear structure from incomplete data efficiently, in terms of both computational complexity and storage. In this paper we propose a streaming subspace estimation algorithm called Subspace Navigation via Interpolation from Partial Entries (SNIPE) that efficiently processes blocks of incomplete data to estimate the underlying subspace model. In every iteration, SNIPE finds the subspace that best fits the new data block but remains close to the previous estimate. We show that SNIPE is a streaming solver for the underlying nonconvex matrix completion problem, that it converges globally {to a stationary point of this program} regardless of initialization, and that the convergence is locally linear with high probability. We also find that SNIPE shows state-of-the-art performance in our numerical simulations.

線形サブスペースモデルは計算科学で広く利用されており、プライバシーの問題やサンプリング制約により不完全となることが多い大規模なデータセットに特に使用されています。したがって、計算の複雑さとストレージの両方の観点から、不完全なデータから低次元の線形構造を効率的に検出するための効率的なアルゴリズムを開発することが重要な問題です。この論文では、不完全なデータのブロックを効率的に処理して基礎となるサブスペースモデルを推定する、Subspace Navigation via Interpolation from Partial Entries (SNIPE)と呼ばれるストリーミングサブスペース推定アルゴリズムを提案します。すべての反復で、SNIPEは新しいデータブロックに最も適合するが、以前の推定に近いサブスペースを見つけます。SNIPEは基礎となる非凸行列補完問題に対するストリーミングソルバーであり、初期化に関係なくグローバルに{このプログラムの定常点に}収束し、収束は高い確率でローカルに線形であることを示します。また、数値シミュレーションではSNIPEが最先端のパフォーマンスを示すこともわかりました。

Bayesian Space-Time Partitioning by Sampling and Pruning Spanning Trees
スパニングツリーのサンプリングと枝刈りによるベイジアン時空間分割

A typical problem in spatial data analysis is regionalization or spatially constrained clustering, which consists of aggregating small geographical areas into larger regions. A major challenge when partitioning a map is the huge number of possible partitions that compose the search space. This is compounded if we are partitioning spatio-temporal data rather than purely spatial data. We introduce a spatio-temporal product partition model that deals with the regionalization problem in a probabilistic way. Random spanning trees are used as a tool to tackle the problem of searching the space of possible partitions making feasible this exploration. Based on this framework, we propose an efficient Gibbs sampler algorithm to sample from the posterior distribution of the parameters, specially the random partition. The proposed Gibbs sampler scheme carries out a random walk on the space of the spanning trees and the partitions induced by deleting tree edges. In the purely spatial situation, we compare our proposed model with other state-of-art regionalization techniques to partition maps using simulated and real social and health data. To illustrate how the temporal component is handled by the algorithm and to show how the spatial clusters vary along the time we presented an application using human development index data. The analysis shows that our proposed model is better than state-of-art alternatives. Another appealing feature of the method is that the prior distribution for the partition is interpretable with a trivial coin flipping mechanism allowing its easy elicitation.

空間データ分析における典型的な問題は、地域化または空間的に制約されたクラスタリングであり、これは小さな地理的領域をより大きな地域に集約することから成ります。マップを分割する際の主な課題は、検索空間を構成する可能性のあるパーティションの数が膨大であることです。これは、純粋に空間的なデータではなく時空間的なデータを分割する場合に複雑になります。私たちは、確率的な方法で地域化の問題に対処する時空間積分割モデルを紹介します。ランダムスパニングツリーは、この探索を可能にする可能性のあるパーティションの空間を検索する問題に取り組むツールとして使用されます。このフレームワークに基づいて、パラメーターの事後分布、特にランダムパーティションからサンプリングする効率的なギブスサンプラーアルゴリズムを提案します。提案されたギブスサンプラースキームは、スパニングツリーの空間と、ツリーエッジを削除することによって誘導されるパーティション上でランダムウォークを実行します。純粋に空間的な状況では、シミュレートされたデータと実際の社会データおよび健康データを使用してマップを分割する、提案モデルと他の最先端の地域化手法を比較します。時間的要素がアルゴリズムによってどのように処理されるか、また空間クラスターが時間とともにどのように変化するかを示すために、人間開発指数データを使用したアプリケーションを紹介しました。分析により、提案モデルが最先端の代替モデルよりも優れていることが示されました。この方法のもう1つの魅力的な特徴は、パーティションの事前分布が、簡単なコイン投げメカニズムで解釈可能であり、簡単に導き出せることです。

Differentiable Game Mechanics
微分可能なゲームの仕組み

Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood — and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in $n$-player differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs — while at the same time being applicable to, and having guarantees in, much more general cases.

ディープラーニングは、目的関数の勾配降下法が局所最小値に収束するという基本的な保証に基づいて構築されています。残念ながら、この保証は、複数の相互作用する損失を示す生成的敵対的ネットなどの設定では機能しません。ゲームにおける勾配ベースの方法の動作は十分に理解されていませんが、敵対的および多目的アーキテクチャが急増するにつれて、ますます重要になっています。この論文では、n人プレイヤーの微分可能ゲームのダイナミクスを理解して制御するための新しいツールを開発します。重要な結果は、ゲームヤコビアンを2つのコンポーネントに分解することです。最初の対称コンポーネントは、暗黙の関数の勾配降下法に還元されるポテンシャルゲームに関連します。2番目の反対称コンポーネントは、ハミルトンゲームに関連します。これは、古典的な機械システムの保存則に似た保存則に従う新しいクラスのゲームです。この分解により、微分可能ゲームで安定した固定点を見つけるための新しいアルゴリズムであるシンプレクティック勾配調整(SGA)が生まれました。基本的な実験では、SGAはGANで安定した固定点を見つけるために最近提案されたアルゴリズムと競合する一方で、より一般的なケースにも適用可能であり、保証も備えていることが示されています。

On the optimality of the Hedge algorithm in the stochastic regime
確率論的領域におけるヘッジアルゴリズムの最適性について

In this paper, we study the behavior of the Hedge algorithm in the online stochastic setting. We prove that anytime Hedge with decreasing learning rate, which is one of the simplest algorithm for the problem of prediction with expert advice, is remarkably both worst-case optimal and adaptive to the easier stochastic and adversarial with a gap problems. This shows that, in spite of its small, non-adaptive learning rate, Hedge possesses the same optimal regret guarantee in the stochastic case as recently introduced adaptive algorithms. Moreover, our analysis exhibits qualitative differences with other versions of the Hedge algorithm, such as the fixed-horizon variant (with constant learning rate) and the one based on the so-called “doubling trick”, both of which fail to adapt to the easier stochastic setting. Finally, we determine the intrinsic limitations of anytime Hedge in the stochastic case, and discuss the improvements provided by more adaptive algorithms.

この論文では、オンライン確率論的設定におけるヘッジアルゴリズムの動作を研究します。私たちは、専門家のアドバイスによる予測の問題に対する最も単純なアルゴリズムの1つである学習率の低下を伴うヘッジが、最悪の場合に最適であり、ギャップ問題でより簡単な確率論的および敵対的問題に適応することを証明します。これは、その小さな非適応学習率にもかかわらず、Hedgeは、最近導入された適応アルゴリズムと同様に、確率的ケースで最適な後悔保証を持っていることを示しています。さらに、私たちの分析では、固定地平線バリアント(学習率が一定)やいわゆる「ダブリングトリック」に基づくものなど、他のバージョンのヘッジアルゴリズムとの質的な違いが見られ、どちらもより簡単な確率的設定に適応できません。最後に、確率論的ケースにおけるanytime Hedgeの本質的な制限を決定し、より適応性の高いアルゴリズムによって提供される改善点について説明します。

SMART: An Open Source Data Labeling Platform for Supervised Learning
SMART:教師あり学習のためのオープンソースのデータラベリングプラットフォーム

SMART is an open source web application designed to help data scientists and research teams efficiently build labeled training data sets for supervised machine learning tasks. SMART provides users with an intuitive interface for creating labeled data sets, supports active learning to help reduce the required amount of labeled data, and incorporates inter-rater reliability statistics to provide insight into label quality. SMART is designed to be platform agnostic and easily deployable to meet the needs of as many different research teams as possible. The project website https://rtiinternational.github.io/SMART/ contains links to the code repository and extensive user documentation.

SMARTは、データサイエンティストや研究チームが教師あり機械学習タスク用のラベル付きトレーニングデータセットを効率的に構築できるように設計されたオープンソースのWebアプリケーションです。SMARTは、ラベル付きデータセットを作成するための直感的なインターフェースをユーザーに提供し、ラベル付きデータに必要な量を減らすためのアクティブラーニングをサポートし、ラベル品質に関する洞察を提供するための評価者間信頼性統計を組み込んでいます。SMARTは、プラットフォームに依存せず、できるだけ多くの異なる研究チームのニーズを満たすために簡単に展開できるように設計されています。プロジェクトのWebサイトhttps://rtiinternational.github.io/SMART/、コードリポジトリへのリンクと広範なユーザードキュメントが含まれています。

Tight Lower Bounds on the VC-dimension of Geometric Set Systems
幾何学的集合システムのVC次元の厳密な下限

The VC-dimension of a set system is a way to capture its complexity and has been a key parameter studied extensively in machine learning and geometry communities. In this paper, we resolve two longstanding open problems on bounding the VC-dimension of two fundamental set systems: $k$-fold unions/intersections of half-spaces and the simplices set system. Among other implications, it settles an open question in machine learning that was first studied in the foundational paper of Blumer et al. (1989) as well as by Eisenstat and Angluin (2007) and Johnson (2008).

集合システムのVC次元は、その複雑さを捉える方法であり、機械学習や幾何学のコミュニティで広く研究されてきた重要なパラメータです。この論文では、2つの基本集合システムのVC次元の境界化に関する2つの長年の未解決の問題を解決します:$k$倍の和集合/半空間の交差と単純集合システム。他の意味合いの中でも、Blumerら(1989)の基礎論文、Eisenstat and Angluin (2007)、Johnson (2008)によって最初に研究された機械学習の未解決の問題を解決します。

Learning to Match via Inverse Optimal Transport
逆最適輸送によるマッチングの学習

We propose a unified data-driven framework based on inverse optimal transport that can learn adaptive, nonlinear interaction cost function from noisy and incomplete empirical matching matrix and predict new matching in various matching contexts. We emphasize that the discrete optimal transport plays the role of a variational principle which gives rise to an optimization based framework for modeling the observed empirical matching data. Our formulation leads to a non-convex optimization problem which can be solved efficiently by an alternating optimization method. A key novel aspect of our formulation is the incorporation of marginal relaxation via regularized Wasserstein distance, significantly improving the robustness of the method in the face of noisy or missing empirical matching data. Our model falls into the category of prescriptive models, which not only predict potential future matching, but is also able to explain what leads to empirical matching and quantifies the impact of changes in matching factors. The proposed approach has wide applicability including predicting matching in online dating, labor market, college application and crowdsourcing. We back up our claims with numerical experiments on both synthetic data and real world data sets.

私たちは、ノイズが多く不完全な経験的マッチング行列から適応型の非線形相互作用コスト関数を学習し、さまざまなマッチングコンテキストで新しいマッチングを予測できる、逆最適輸送に基づく統合データ駆動フレームワークを提案します。離散最適輸送は変分原理の役割を果たしており、観測された経験的マッチングデータをモデル化するための最適化ベースのフレームワークを生み出すことを強調します。我々の定式化は、交互最適化法によって効率的に解決できる非凸最適化問題につながります。我々の定式化の重要な新しい側面は、正規化されたワッサーシュタイン距離による限界緩和の組み込みであり、ノイズの多い、または経験的マッチングデータが欠落している場合でも、方法の堅牢性が大幅に向上します。我々のモデルは、規範モデルのカテゴリに分類され、潜在的な将来のマッチングを予測するだけでなく、経験的マッチングにつながるものを説明し、マッチング要因の変化の影響を定量化することもできます。提案されたアプローチは、オンラインデート、労働市場、大学出願、クラウドソーシングでのマッチング予測など、幅広い適用性があります。私たちは、合成データと現実世界のデータセットの両方での数値実験によって主張を裏付けています。

Quantification Under Prior Probability Shift: the Ratio Estimator and its Extensions
事前確率シフト下での定量化:比推定量とその拡張

The quantification problem consists of determining the prevalence of a given label in a target population. However, one often has access to the labels in a sample from the training population but not in the target population. A common assumption in this situation is that of prior probability shift, that is, once the labels are known, the distribution of the features is the same in the training and target populations. In this paper, we derive a new lower bound for the risk of the quantification problem under the prior shift assumption. Complementing this lower bound, we present a new approximately minimax class of estimators, ratio estimators, which generalize several previous proposals in the literature. Using a weaker version of the prior shift assumption, which can be tested, we show that ratio estimators can be used to build confidence intervals for the quantification problem. We also extend the ratio estimator so that it can: (i) incorporate labels from the target population, when they are available and (ii) estimate how the prevalence of positive labels varies according to a function of certain covariates.

定量化の問題は、対象集団における特定のラベルの普及率を決定することです。しかし、多くの場合、トレーニング集団のサンプルのラベルにはアクセスできますが、対象集団のラベルにはアクセスできません。この状況でよく使われる仮定は、事前確率シフトです。つまり、ラベルがわかれば、特徴の分布はトレーニング集団と対象集団で同じになります。この論文では、事前確率シフトの仮定の下での定量化問題のリスクの新しい下限を導きます。この下限を補完して、新しい近似ミニマックスクラスの推定量、比率推定量を提示します。これは、文献で以前に提案されたいくつかのものを一般化したものです。事前確率シフトの仮定の弱いバージョンを使用して、比率推定量を使用して定量化問題の信頼区間を構築できることを示します。また、比率推定量を拡張して、(i)対象集団のラベルが利用可能な場合はそれを組み込み、(ii)特定の共変量の関数に従って正のラベルの普及率がどのように変化するかを推定できるようにします。

Prediction Risk for the Horseshoe Regression
ホースシュー回帰の予測リスク

We show that prediction performance for global-local shrinkage regression can overcome two major difficulties of global shrinkage regression: (i) the amount of relative shrinkage is monotone in the singular values of the design matrix and (ii) the shrinkage is determined by a single tuning parameter. Specifically, we show that the horseshoe regression, with heavy-tailed component-specific local shrinkage parameters, in conjunction with a global parameter providing shrinkage towards zero, alleviates both these difficulties and consequently, results in an improved risk for prediction. Numerical demonstrations of improved prediction over competing approaches in simulations and in a pharmacogenomics data set confirm our theoretical findings.

私たちは、グローバル-ローカル収縮回帰の予測性能が、グローバル収縮回帰の2つの主要な困難を克服できることを示します:(i)相対的な収縮量は、計画行列の特異値で単調であり、(ii)収縮は単一の調整パラメータによって決定されます。具体的には、ヘビーテールのコンポーネント固有の局所収縮パラメータと、ゼロへの収縮を提供するグローバルパラメータを組み合わせたホースシュー回帰が、これらの困難の両方を軽減し、その結果、予測のリスクが向上することを示しています。シミュレーションや薬理ゲノミクスデータセットにおける競合するアプローチよりも予測が改善されたことを数値で実証したことは、私たちの理論的知見を裏付けています。

Nonuniformity of P-values Can Occur Early in Diverging Dimensions
p値の不均一性は、次元の発散の早い段階で発生する可能性があります

Evaluating the joint significance of covariates is of fundamental importance in a wide range of applications. To this end, p-values are frequently employed and produced by algorithms that are powered by classical large-sample asymptotic theory. It is well known that the conventional p-values in Gaussian linear model are valid even when the dimensionality is a non-vanishing fraction of the sample size, but can break down when the design matrix becomes singular in higher dimensions or when the error distribution deviates from Gaussianity. A natural question is when the conventional p-values in generalized linear models become invalid in diverging dimensions. We establish that such a breakdown can occur early in nonlinear models. Our theoretical characterizations are confirmed by simulation studies.

共変量の同時有意性の評価は、幅広いアプリケーションで基本的に重要です。この目的のために、p値は、古典的な大サンプルの漸近理論を動力源とするアルゴリズムによって頻繁に使用され、生成されます。ガウス線形モデルの従来のp値は、次元がサンプルサイズの消失しない部分である場合でも有効ですが、設計行列が高次元で特異になる場合や、誤差分布がガウス分布から逸脱すると破綻する可能性があることはよく知られています。当然の問題は、一般化線形モデルの従来のp値が発散次元で無効になる場合です。このようなブレークダウンは、非線形モデルで早期に発生する可能性があることを立証します。私たちの理論的な特性評価は、シミュレーション研究によって確認されています。

Generalized Score Matching for Non-Negative Data
非否定データの一般化スコアマッチング

A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach becomes computationally intensive. The score matching method of Hyvärinen (2005) avoids direct calculation of the normalizing constant and yields closed-form estimates for exponential families of continuous distributions over $\mathbb{R}^m$. Hyvärinen (2007) extended the approach to distributions supported on the non-negative orthant, $\mathbb{R}_+^m$. In this paper, we give a generalized form of score matching for non-negative data that improves estimation efficiency. As an example, we consider a general class of pairwise interaction models. Addressing an overlooked inexistence problem, we generalize the regularized score matching method of Lin et al. (2016) and improve its theoretical guarantees for non-negative Gaussian graphical models.

確率密度関数のパラメータを推定する際の一般的な課題は、正規化定数の難解性です。このような場合、最尤推定は数値積分を使用して実装できますが、このアプローチは計算集約的になります。Hyvärinen (2005)のスコアマッチング法は、正規化定数の直接計算を回避し、$mathbb{R}^m$を超える連続分布の指数ファミリーの閉形式の推定値を生成します。Hyvärinen (2007)は、非負の直交関数$mathbb{R}_+^m$でサポートされる分布へのアプローチを拡張しました。この論文では、推定効率を向上させる非負のデータに対するスコアマッチングの一般化された形式を示します。例として、ペアワイズ相互作用モデルの一般的なクラスを考えます。見落とされた不存在問題に対処するために、Linら(2016)の正則化スコアマッチング法を一般化し、非負のガウスグラフィカルモデルに対する理論的保証を改善します。

Fairness Constraints: A Flexible Approach for Fair Classification
公平性の制約: 公正な分類のための柔軟なアプローチ

Algorithmic decision making is employed in an increasing number of real-world applicationstions to aid human decision making. While it has shown considerable promise in terms of improved decision accuracy, in some scenarios, its outcomes have been also shown to impose an unfair (dis)advantage on people from certain social groups (e.g., women, blacks). In this context, there is a need for computational techniques to limit unfairness in algorithmic decision making. In this work, we take a step forward to fulfill that need and introduce a flexible constraint-based framework to enable the design of fair margin-based classifiers. The main technical innovation of our framework is a general and intuitive measure of decision boundary unfairness, which serves as a tractable proxy to several of the most popular computational definitions of unfairness from the literature. Leveraging our measure, we can reduce the design of fair margin-based classifiers to adding tractable constraints on their decision boundaries. Experiments on multiple synthetic and real-world datasets show that our framework is able to successfully limit unfairness, often at a small cost in terms of accuracy.

アルゴリズムによる意思決定は、人間の意思決定を支援するために、ますます多くの実世界のアプリケーションで採用されています。意思決定の精度が向上するという点で大きな期待が寄せられていますが、一部のシナリオでは、その結果が特定の社会集団(女性、黒人など)の人々に不公平な(不)優位性を課すことも示されています。この文脈では、アルゴリズムによる意思決定における不公平を制限するための計算技術が必要です。この研究では、そのニーズを満たすために一歩前進し、公平なマージンベースの分類器の設計を可能にする柔軟な制約ベースのフレームワークを導入します。私たちのフレームワークの主な技術革新は、決定境界の不公平性の一般的で直感的な尺度であり、これは文献にある不公平性の最も一般的な計算定義のいくつかに対する扱いやすいプロキシとして機能します。私たちの尺度を活用することで、公平なマージンベースの分類器の設計を、決定境界に扱いやすい制約を追加することにまで減らすことができます。複数の合成データセットと現実世界のデータセットでの実験により、私たちのフレームワークは、多くの場合、精度の面でわずかなコストで、不公平をうまく制限できることが示されています。

Deep Optimal Stopping
ディープ・オプティマル・ストップ

In this paper we develop a deep learning method for optimal stopping problems which directly learns the optimal stopping rule from Monte Carlo samples. As such, it is broadly applicable in situations where the underlying randomness can efficiently be simulated. We test the approach on three problems: the pricing of a Bermudan max-call option, the pricing of a callable multi barrier reverse convertible and the problem of optimally stopping a fractional Brownian motion. In all three cases it produces very accurate results in high-dimensional situations with short computing times.

この論文では、モンテカルロ試料から最適停止ルールを直接学習する最適停止問題のための深層学習法を開発します。そのため、基礎となるランダム性を効率的にシミュレートできる状況に広く適用できます。バミューダのマックスコールオプションの価格設定、コール可能なマルチバリアリバースコンバーチブルの価格設定、フラクショナルブラウン運動を最適に停止する問題の3つの問題でアプローチをテストします。3つのケースすべてにおいて、高次元の状況で非常に正確な結果を短い計算時間で生成します。

Analysis of Langevin Monte Carlo via Convex Optimization
凸最適化によるランジュバン・モンテカルロの解析

In this paper, we provide new insights on the Unadjusted Langevin Algorithm. We show that this method can be formulated as the first order optimization algorithm for an objective functional defined on the Wasserstein space of order $2$. Using this interpretation and techniques borrowed from convex optimization, we give a non-asymptotic analysis of this method to sample from log-concave smooth target distribution on $\mathbb{R}^d$. Based on this interpretation, we propose two new methods for sampling from a non-smooth target distribution. These new algorithms are natural extensions of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm, which is a popular extension of the Unadjusted Langevin Algorithm for largescale Bayesian inference. Using the optimization perspective, we provide non-asymptotic convergence analysis for the newly proposed methods.

この論文では、未調整のランジュバンアルゴリズムに関する新しい洞察を提供します。この手法は、次数$2$のWasserstein空間で定義された目的関数の1次最適化アルゴリズムとして定式化できることを示します。この解釈と凸最適化から借用した手法を使用して、この方法の非漸近解析を行い、$mathbb{R}^d$上の対数凹平滑ターゲット分布からサンプリングします。この解釈に基づいて、平滑でないターゲット分布からサンプリングするための2つの新しい方法を提案します。これらの新しいアルゴリズムは、大規模なベイズ推論のための未調整ランジュバンアルゴリズムの一般的な拡張である確率的勾配ランジュバンダイナミクス(SGLD)アルゴリズムの自然な拡張です。最適化の視点を使用して、新たに提案された方法の非漸近収束解析を提供します。

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning
分散最適化と学習におけるストラグラー緩和のための冗長性技術

Performance of distributed optimization and learning systems is bottlenecked by “straggler” nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is “encoded” to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically treated as missing, or as “erasures” at every iteration, whose loss is compensated by the embedded redundancy. For quadratic loss functions, we show that under a simple encoding scheme, many optimization algorithms (gradient descent, L-BFGS, and proximal gradient) operating under data parallelism converge to an approximate solution even when stragglers are ignored. Furthermore, we show a similar result for a wider class of convex loss functions when operating under model parallelism. The applicable classes of objectives covers several popular learning problems such as linear regression, LASSO, support vector machine, collaborative filtering, and generalized linear models including logistic regression. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.

分散最適化および学習システムのパフォーマンスは、「はぐれ者」ノードと低速の通信リンクによってボトルネックとなり、計算が大幅に遅れます。私たちは、データセットが冗長性が組み込まれた過剰完全な表現を持つように「エンコード」され、システム内のはぐれ者ノードが動的に欠損として、または反復ごとに「消失」として扱われ、その損失が組み込まれた冗長性によって補償される分散最適化フレームワークを提案します。2次損失関数の場合、単純なエンコード方式では、はぐれ者を無視した場合でも、データ並列処理で動作する多くの最適化アルゴリズム(勾配降下法、L-BFGS、および近似勾配法)が近似解に収束することを示します。さらに、モデル並列処理で動作する場合、より広いクラスの凸損失関数で同様の結果を示します。適用可能な目的のクラスは、線形回帰、LASSO、サポートベクターマシン、協調フィルタリング、ロジスティック回帰を含む一般化線形モデルなど、いくつかの一般的な学習問題をカバーします。これらの収束結果は決定論的です。つまり、ノード上の任意の遅延パターンまたは分布のシーケンスのサンプルパス収束を確立し、遅延分布の末尾の動作とは無関係です。等角タイトフレームはエンコード行列として望ましい特性を持っていることを実証し、大規模データをエンコードするための効率的なメカニズムを提案します。提案された手法をAmazon EC2クラスターに実装し、行列分解、LASSO、リッジ回帰、ロジスティック回帰などのいくつかの学習問題でそのパフォーマンスを実証し、提案された方法をコード化されていない非同期のデータ複製戦略と比較します。

Lazifying Conditional Gradient Algorithms
条件付き勾配アルゴリズムのラジファイ

Conditional gradient algorithms (also often called Frank-Wolfe algorithms) are popular due to their simplicity of only requiring a linear optimization oracle and more recently they also gained significant traction for online learning. While simple in principle, in many cases the actual implementation of the linear optimization oracle is costly. We show a general method to lazify various conditional gradient algorithms, which in actual computations leads to several orders of magnitude of speedup in wall-clock time. This is achieved by using a faster separation oracle instead of a linear optimization oracle, relying only on few linear optimization oracle calls.

条件付き勾配アルゴリズム(Frank-Wolfeアルゴリズムとも呼ばれる)は、線形最適化オラクルのみを必要とするという単純さから人気があり、最近ではオンライン学習でも大きな牽引力を得ています。原理的には単純ですが、多くの場合、線形最適化オラクルの実際の実装にはコストがかかります。さまざまな条件付き勾配アルゴリズムを遅延化する一般的な方法を示し、実際の計算では、ウォールクロック時間を数桁高速化します。これは、線形最適化オラクルの代わりに、少数の線形最適化オラクル呼び出しのみに依存する、より高速な分離オラクルを使用することで実現されます。

Semi-Analytic Resampling in Lasso
Lassoでのセミアナリサンプリング

An approximate method for conducting resampling in Lasso, the $\ell_1$ penalized linear regression, in a semi-analytic manner is developed, whereby the average over the resampled datasets is directly computed without repeated numerical sampling, thus enabling an inference free of the statistical fluctuations due to sampling finiteness, as well as a significant reduction of computational time. The proposed method is based on a message passing type algorithm, and its fast convergence is guaranteed by the state evolution analysis, when covariates are provided as zero-mean independently and identically distributed Gaussian random variables. It is employed to implement bootstrapped Lasso (Bolasso) and stability selection, both of which are variable selection methods using resampling in conjunction with Lasso, and resolves their disadvantage regarding computational cost. To examine approximation accuracy and efficiency, numerical experiments were carried out using simulated datasets. Moreover, an application to a real-world dataset, the wine quality dataset, is presented. To process such real-world datasets, an objective criterion for determining the relevance of selected variables is also introduced by the addition of noise variables and resampling. MATLAB codes implementing the proposed method are distributed in (Obuchi, 2018).

$\ell_1$ペナルティ付き線形回帰であるLassoでリサンプリングを半解析的に行う近似法が開発されました。この方法では、数値サンプリングを繰り返すことなく、リサンプリングされたデータセットの平均を直接計算するため、サンプリングの有限性による統計的変動のない推論が可能になり、計算時間も大幅に短縮されます。提案された方法はメッセージパッシング型アルゴリズムに基づいており、共変量が平均ゼロで独立かつ同一に分布するガウス確率変数として与えられた場合、状態進化解析によって高速収束が保証されます。この方法は、Lassoと組み合わせてリサンプリングを使用する変数選択法であるブートストラップLasso (Bolasso)と安定性選択を実装するために使用され、計算コストに関する欠点を解決します。近似の精度と効率を調べるために、シミュレーションデータセットを使用して数値実験を行いました。さらに、実際のデータセットであるワインの品質データセットへの適用を示します。このような現実世界のデータセットを処理するために、ノイズ変数の追加と再サンプリングによって、選択された変数の関連性を決定するための客観的な基準も導入されます。提案された方法を実装するMATLABコードは、(Obuchi、2018)で配布されています。

On Consistent Vertex Nomination Schemes
一貫性のある頂点指名スキームについて

Given a vertex of interest in a network $G_1$, the vertex nomination problem seeks to find the corresponding vertex of interest (if it exists) in a second network $G_2$. A vertex nomination scheme produces a list of the vertices in $G_2$, ranked according to how likely they are judged to be the corresponding vertex of interest in $G_2$. The vertex nomination problem and related information retrieval tasks have attracted much attention in the machine learning literature, with numerous applications to social and biological networks. However, the current framework has often been confined to a comparatively small class of network models, and the concept of statistically consistent vertex nomination schemes has been only shallowly explored. In this paper, we extend the vertex nomination problem to a very general statistical model of graphs. Further, drawing inspiration from the long-established classification framework in the pattern recognition literature, we provide definitions for the key notions of Bayes optimality and consistency in our extended vertex nomination framework, including a derivation of the Bayes optimal vertex nomination scheme. In addition, we prove that no universally consistent vertex nomination schemes exist. Illustrative examples are provided throughout.

ネットワーク$G_1$内の関心頂点が与えられた場合、頂点指名問題は、2番目のネットワーク$G_2$内の対応する関心頂点(存在する場合)を見つけようとします。頂点指名スキームは、$G_2$内の対応する関心頂点であると判断される可能性に従ってランク付けされた、$G_2$内の頂点のリストを生成します。頂点指名問題と関連する情報検索タスクは、機械学習の文献で大きな注目を集めており、社会ネットワークや生物ネットワークに多数の応用があります。ただし、現在のフレームワークは、比較的小さなクラスのネットワークモデルに限定されていることが多く、統計的に一貫性のある頂点指名スキームの概念は浅くしか研究されていません。この論文では、頂点指名問題をグラフの非常に一般的な統計モデルに拡張します。さらに、パターン認識の文献で長年確立されている分類フレームワークからインスピレーションを得て、ベイズ最適頂点指定スキームの導出を含む、拡張頂点指定フレームワークにおけるベイズ最適性と一貫性の主要概念の定義を提供します。さらに、普遍的に一貫性のある頂点指定スキームは存在しないことを証明します。全体を通して、説明例が提供されます。

Variance-based Regularization with Convex Objectives
凸目的関数による分散ベースの正則化

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen’s empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

私たちは、分散の凸型代理を提供するリスク最小化と確率的最適化へのアプローチを開発し、近似誤差と推定誤差の間の最適で計算効率の高い取引を可能にします。私たちのアプローチは、分布的にロバストな最適化とOwenの経験的尤度の手法に基づいており、推定量の理論的性能を特徴付ける多数の有限サンプルおよび漸近的な結果を提供します。特に、私たちの手順には最適性の証明書が付属しており、バイアスと分散を自動的にバランスさせることにより、経験的なリスク最小化よりも(一部のシナリオでは)速い収束率を達成します。実際には、推定量が実際にトレーニングサンプルの分散と絶対パフォーマンスの間で取引を行い、多くの分類問題に対する標準的な経験的リスクの最小化よりもサンプル外(テスト)パフォーマンスが向上することを示す、裏付けとなる経験的証拠を提供します。

Learnability of Solutions to Conjunctive Queries
接続クエリに対する解の学習可能性

The problem of learning the solution space of an unknown formula has been studied in multiple embodiments in computational learning theory. In this article, we study a family of such learning problems; this family contains, for each relational structure, the problem of learning the solution space of an unknown conjunctive query evaluated on the structure. A progression of results aimed to classify the learnability of each of the problems in this family, and thus far a culmination thereof was a positive learnability result generalizing all previous ones. This article completes the classification program towards which this progression of results strived, by presenting a negative learnability result that complements the mentioned positive learnability result. In addition, a further negative learnability result is exhibited, which indicates a dichotomy within the problems to which the first negative result applies. In order to obtain our negative results, we make use of universal-algebraic concepts.

未知の式の解空間を学習する問題は、計算学習理論の複数の実施形態で研究されてきました。この記事では、そのような学習問題のファミリーを研究します。このファミリーには、各関係構造について、構造で評価される未知の結合クエリの解空間を学習する問題が含まれます。一連の結果は、このファミリーの各問題の学習可能性を分類することを目的としており、これまでのところ、その集大成は、以前のすべての結果を一般化する肯定的な学習可能性の結果でした。この記事では、前述の肯定的な学習可能性の結果を補完する否定的な学習可能性の結果を提示することにより、この一連の結果が目指した分類プログラムを完了します。さらに、最初の否定的な結果が適用される問題内の二分法を示す、さらなる否定的な学習可能性の結果が示されています。否定的な結果を得るために、普遍的な代数の概念を使用します。

Proximal Distance Algorithms: Theory and Practice
近位距離アルゴリズム:理論と実践

Proximal distance algorithms combine the classical penalty method of constrained minimization with distance majorization. If $f(x)$ is the loss function, and $C$ is the constraint set in a constrained minimization problem, then the proximal distance principle mandates minimizing the penalized loss $f(x)+\frac{\rho}{2}dist(x,C)^2$ and following the solution $x_{\rho}$ to its limit as $\rho$ tends to $\infty$. At each iteration the squared Euclidean distance $dist(x,C)^2$ is majorized by the spherical quadratic $\|x-P_C(x_k)\|^2$, where $P_C(x_k)$ denotes the projection of the current iterate $x_k$ onto $C$. The minimum of the surrogate function $f(x)+\frac{\rho}{2}\|x-P_C(x_k)\|^2$ is given by the proximal map $prox_{\rho^{-1}f}[P_C(x_k)]$. The next iterate $x_{k+1}$ automatically decreases the original penalized loss for fixed $\rho$. Since many explicit projections and proximal maps are known, it is straightforward to derive and implement novel optimization algorithms in this setting. These algorithms can take hundreds if not thousands of iterations to converge, but the simple nature of each iteration makes proximal distance algorithms competitive with traditional algorithms. For convex problems, proximal distance algorithms reduce to proximal gradient algorithms and therefore enjoy well understood convergence properties. For nonconvex problems, one can attack convergence by invoking Zangwill’s theorem. Our numerical examples demonstrate the utility of proximal distance algorithms in various high-dimensional settings, including a) linear programming, b) constrained least squares, c) projection to the closest kinship matrix, d) projection onto a second-order cone constraint, e) calculation of Horn’s copositive matrix index, f) linear complementarity programming, and g) sparse principal components analysis. The proximal distance algorithm in each case is competitive or superior in speed to traditional methods such as the interior point method and the alternating direction method of multipliers (ADMM). Source code for the numerical examples can be found at https://github.com/klkeys/proxdist.

近接距離アルゴリズムは、制約付き最小化の古典的なペナルティ法と距離のメジャー化を組み合わせたものです。$f(x)$が損失関数で、$C$が制約付き最小化問題の制約セットである場合、近接距離原理は、ペナルティ付き損失$f(x)+\frac{\rho}{2}dist(x,C)^2$を最小化し、$\rho$が$\infty$に近づくにつれて解$x_{\rho}$をその極限まで追従することを義務付けます。各反復で、2乗ユークリッド距離$dist(x,C)^2$は球面二次方程式$\|x-P_C(x_k)\|^2$によってメジャー化されます。ここで、$P_C(x_k)$は現在の反復$x_k$の$C$への射影を表します。代理関数$f(x)+\frac{\rho}{2}\|x-P_C(x_k)\|^2$の最小値は、近似写像$prox_{\rho^{-1}f}[P_C(x_k)]$によって与えられます。次の反復$x_{k+1}$は、固定$\rho$に対して元のペナルティ損失を自動的に減少させます。多くの明示的な投影と近似写像が知られているため、この設定で新しい最適化アルゴリズムを導出して実装するのは簡単です。これらのアルゴリズムは収束するのに数百、場合によっては数千の反復が必要になることがありますが、各反復の単純な性質により、近似距離アルゴリズムは従来のアルゴリズムと競合します。凸問題の場合、近似距離アルゴリズムは近似勾配アルゴリズムに簡略化されるため、よく理解された収束特性を享受できます。非凸問題の場合、ザングウィルの定理を呼び出すことで収束を攻撃できます。私たちの数値例は、さまざまな高次元設定における近似距離アルゴリズムの有用性を示しています。これには、a)線形計画法、b)制約付き最小二乗法、c)最も近い血縁関係行列への射影、d) 2次円錐制約への射影、e)ホーンの共正行列指数の計算、f)線形相補性計画法、g)スパース主成分分析が含まれます。いずれの場合も、近似距離アルゴリズムは、内点法や交互方向乗数法(ADMM)などの従来の方法と同等か、それよりも速度が優れています。数値例のソースコードは、https://github.com/klkeys/proxdistにあります。

Active Learning for Cost-Sensitive Classification
コストに敏感な分類のためのアクティブラーニング

We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label’s cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. We empirically compare COAL to passive learning and several active learning baselines, showing significant improvements in labeling effort and test cost on real-world datasets.

私たちは、コストに敏感な多クラス分類(エラーが異なればコストも異なる問題)のためのアクティブラーニングアルゴリズムを設計します。当社のアルゴリズムであるCOALは、各ラベルのコストに回帰し、最小のラベルを予測することで予測を行います。新しい例では、過去のデータで優れたパフォーマンスを発揮する一連のリグレッサーを使用して、各ラベルの可能なコストを見積もります。最良である可能性のあるラベルのみをクエリし、確実な敗者を無視します。COALは、二乗損失の最適化を認める任意の回帰ファミリーに対して効率的に実装できることを証明します。また、予測パフォーマンスとラベリング作業に関しても強力な保証を享受しています。私たちは、COALを受動的学習およびいくつかの能動的学習ベースラインと経験的に比較し、実世界のデータセットでのラベリング作業とテストコストの大幅な改善を示しています。

A Representer Theorem for Deep Kernel Learning
深層カーネル学習のための表現定理

In this paper we provide a finite-sample and an infinite-sample representer theorem for the concatenation of (linear combinations of) kernel functions of reproducing kernel Hilbert spaces. These results serve as mathematical foundation for the analysis of machine learning algorithms based on compositions of functions. As a direct consequence in the finite-sample case, the corresponding infinite-dimensional minimization problems can be recast into (nonlinear) finite-dimensional minimization problems, which can be tackled with nonlinear optimization algorithms. Moreover, we show how concatenated machine learning problems can be reformulated as neural networks and how our representer theorem applies to a broad class of state-of-the-art deep learning methods.

この論文では、再生カーネルヒルベルト空間のカーネル関数の連結(線形結合)のための有限サンプルと無限サンプルの表現定理を提供します。これらの結果は、関数の構成に基づく機械学習アルゴリズムの分析のための数学的基礎として機能します。有限サンプルの場合の直接的な結果として、対応する無限次元の最小化問題を(非線形)有限次元の最小化問題に作り直すことができ、非線形最適化アルゴリズムで取り組むことができます。さらに、連結された機械学習の問題をニューラルネットワークとして再定式化する方法と、表現定理が最先端のディープラーニング手法の幅広いクラスにどのように適用されるかを示します。

Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks
区分線形ニューラルネットワークのためのほぼタイトなVC次元と擬似次元の範囲

We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$, and provide examples with VC-dimension $\Omega( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $\Theta(W U)$ on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial.

私たちは、ReLU活性化関数を用いたディープニューラルネットワークのVC次元の新しい上限と下限を証明します。これらの上限と下限は、ほぼすべてのパラメータ範囲で厳密です。重みの数を$W$、層の数を$L$とすると、VC次元が$O(W L \log(W))$であることを証明し、VC次元$\Omega( W L \log(W/L) )$の例を示します。これにより、既知の上限と下限の両方が改善されます。非線形ユニットの数$U$に関して、VC次元の厳密な上限$\Theta(W U)$を証明します。これらの上限はすべて、任意の区分線形活性化関数に一般化され、これらの関数クラスの疑似次元にも当てはまります。以前の結果と組み合わせると、異なる非線形性を持つネットワークの深さに対するVC次元の依存性の興味深い範囲が示されます。区分定数には依存性がなく、区分線形には線形依存性があり、一般的な区分多項式には2次依存性しかありません。

Multi-scale Online Learning: Theory and Applications to Online Auctions and Pricing
マルチスケール・オンライン学習:理論とオンラインオークション・価格設定への応用

We consider revenue maximization in online auction/pricing problems. A seller sells an identical item in each period to a new buyer, or a new set of buyers. For the online pricing problem, both when the arriving buyer bids or only responds to the posted price, we design algorithms whose regret bounds scale with the best fixed price in-hindsight, rather than the range of the values. Under the bidding model, we further show our algorithms achieve a revenue convergence rate that matches the offline sample complexity of the single-item single-buyer auction. We also show regret bounds that are scale free, and match the offline sample complexity, when comparing to a benchmark that requires a lower bound on the market share. We further expand our results beyond pricing to multi-buyer auctions, and obtain online learning algorithms for auctions, with convergence rates matching the known sample complexity upper bound of online single-item multi-buyer auctions. These results are obtained by generalizing the classical learning from experts and multi-armed bandit problems to their multi-scale versions. In this version, the reward of each action is in a different range, and the regret with respect to a given action scales with its own range, rather than the maximum range. We obtain almost optimal multi-scale regret bounds by introducing a new Online Mirror Descent (OMD) algorithm whose mirror map is the multi-scale version of the negative entropy function. We further generalize to the bandit setting by introducing the stochastic variant of this OMD algorithm.

私たちは、オンラインオークション/価格設定問題における収益最大化について考察します。売り手は各期間に同一の商品を新しい買い手、または新しい買い手グループに販売します。オンライン価格設定問題では、到着した買い手が入札する場合も、掲載価格にのみ応答する場合も、値の範囲ではなく、事後的に最適な固定価格にリグレッション境界がスケールするアルゴリズムを設計します。入札モデルでは、さらに、アルゴリズムが単一アイテム単一買い手オークションのオフラインサンプル複雑度に一致する収益収束率を達成することを示します。また、市場シェアの下限を必要とするベンチマークと比較した場合、スケールフリーでオフラインサンプル複雑度に一致するリグレッション境界を示します。さらに、結果を価格設定を超えて複数買い手オークションに拡張し、オンライン単一アイテム複数買い手オークションの既知のサンプル複雑度上限に一致する収束率を持つオークションのオンライン学習アルゴリズムを取得します。これらの結果は、専門家からの古典的な学習と多腕バンディット問題をそれらのマルチスケールバージョンに一般化することによって得られます。このバージョンでは、各アクションの報酬は異なる範囲にあり、特定のアクションに関する後悔は最大範囲ではなく、そのアクション自体の範囲でスケールします。ミラーマップが負のエントロピー関数のマルチスケールバージョンである新しいオンラインミラーディセント(OMD)アルゴリズムを導入することで、ほぼ最適なマルチスケールの後悔境界が得られます。このOMDアルゴリズムの確率的変種を導入することで、バンディット設定にさらに一般化します。

The Sup-norm Perturbation of HOSVD and Low Rank Tensor Denoising
HOSVD の超ノルム摂動と低ランクテンソルノイズ除去

The higher order singular value decomposition (HOSVD) of tensors is a generalization of matrix SVD. The perturbation analysis of HOSVD under random noise is more delicate than its matrix counterpart. Recently, polynomial time algorithms have been proposed where statistically optimal estimates of the singular subspaces and the low rank tensors are attainable in the Euclidean norm. In this article, we analyze the sup-norm perturbation bounds of HOSVD and introduce estimators of the singular subspaces with sharp deviation bounds in the sup-norm. We also investigate a low rank tensor denoising estimator and demonstrate its fast convergence rate with respect to the entry-wise errors. The sup-norm perturbation bounds reveal unconventional phase transitions for statistical learning applications such as the exact clustering in high dimensional Gaussian mixture model and the exact support recovery in sub-tensor localizations. In addition, the bounds established for HOSVD also elaborate the one-sided sup-norm perturbation bounds for the singular subspaces of unbalanced (or fat) matrices.

テンソルの高次特異値分解(HOSVD)は、行列SVDの一般化です。ランダムノイズ下でのHOSVDの摂動解析は、行列SVDよりも繊細です。最近、ユークリッドノルムで統計的に最適な特異部分空間と低ランクテンソルの推定値が得られる多項式時間アルゴリズムが提案されました。この記事では、HOSVDのノルム超摂動境界を分析し、ノルム超で急激な偏差境界を持つ特異部分空間の推定量を紹介します。また、低ランクテンソルのノイズ除去推定量を調査し、エントリごとの誤差に対する収束速度が速いことを実証します。ノルム超摂動境界は、高次元ガウス混合モデルでの正確なクラスタリングやサブテンソルローカリゼーションでの正確なサポート回復など、統計学習アプリケーションに対する型破りな位相遷移を明らかにします。さらに、HOSVDに対して確立された境界は、不均衡(または太い)行列の特異部分空間の片側ノルム超摂動境界も詳しく説明します。

Robust Estimation of Derivatives Using Locally Weighted Least Absolute Deviation Regression
局所加重最小絶対偏差回帰を用いた導関数のロバスト推定

In nonparametric regression, the derivative estimation has attracted much attention in recent years due to its wide applications. In this paper, we propose a new method for the derivative estimation using the locally weighted least absolute deviation regression. Different from the local polynomial regression, the proposed method does not require a finite variance for the error term and so is robust to the presence of heavy-tailed errors. Meanwhile, it does not require a zero median or a positive density at zero for the error term in comparison with the local median regression. We further show that the proposed estimator with random difference is asymptotically equivalent to the (infinitely) composite quantile regression estimator. In other words, running one regression is equivalent to combining infinitely many quantile regressions. In addition, the proposed method is also extended to estimate the derivatives at the boundaries and to estimate higher-order derivatives. For the equidistant design, we derive theoretical results for the proposed estimators, including the asymptotic bias and variance, consistency, and asymptotic normality. Finally, we conduct simulation studies to demonstrate that the proposed method has better performance than the existing methods in the presence of outliers and heavy-tailed errors, and analyze the Chinese house price data for the past ten years to illustrate the usefulness of the proposed method.

ノンパラメトリック回帰において、導関数推定は、その幅広い応用により近年注目を集めています。この論文では、局所的に重み付けされた最小絶対偏差回帰を用いた導関数推定の新しい方法を提案します。局所多項式回帰とは異なり、提案された方法は、誤差項に有限の分散を必要としないため、裾の重い誤差の存在に対して堅牢です。一方、局所中央値回帰と比較して、誤差項にゼロ中央値またはゼロでの正の密度を必要としません。さらに、ランダム差を持つ提案された推定量は、（無限に）複合分位回帰推定量と漸近的に同等であることを示します。言い換えると、1つの回帰を実行することは、無限に多くの分位回帰を組み合わせることと同等です。さらに、提案された方法は、境界での導関数の推定と高次の導関数の推定にも拡張されています。等距離設計については、漸近的バイアスと分散、一貫性、漸近的正規性など、提案された推定量の理論的結果を導き出します。最後に、外れ値と裾の重い誤差がある場合に、提案された方法が既存の方法よりも優れたパフォーマンスを発揮することを実証するためのシミュレーション研究を実施し、提案された方法の有用性を示すために過去10年間の中国の住宅価格データを分析します。

Kernel Approximation Methods for Speech Recognition
音声認識のためのカーネル近似法

We study the performance of kernel methods on the acoustic modeling task for automatic speech recognition, and compare their performance to deep neural networks (DNNs). To scale the kernel methods to large data sets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, we propose a simple but effective feature selection method which reduces the number of random features required to attain a fixed level of performance. Second, we present a number of metrics which correlate strongly with speech recognition performance when computed on the heldout set; we attain improved performance by using these metrics to decide when to stop training. Additionally, we show that the linear bottleneck method of Sainath et al. (2013a) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Leveraging these three methods, the kernel methods attain token error rates between $0.5\%$ better and $0.1\%$ worse than fully-connected DNNs across four speech recognition data sets, including the TIMIT and Broadcast News benchmark tasks.

私たちは、自動音声認識のための音響モデリングタスクにおけるカーネル法のパフォーマンスを調査し、そのパフォーマンスをディープニューラルネットワーク(DNN)と比較します。カーネル法を大規模なデータセットに拡張するために、RahimiとRecht (2007)のランダムフーリエ特徴法を使用します。カーネル音響モデルのパフォーマンスを向上させる2つの新しい手法を提案します。まず、一定レベルのパフォーマンスを達成するために必要なランダム特徴の数を減らす、シンプルだが効果的な特徴選択方法を提案します。次に、ホールドアウトセットで計算された音声認識パフォーマンスと強く相関するいくつかのメトリックを提示します。これらのメトリックを使用してトレーニングを停止するタイミングを決定することで、パフォーマンスが向上します。さらに、Sainathら(2013a)の線形ボトルネック法によって、トレーニングが高速化され、モデルがよりコンパクトになるだけでなく、カーネルモデルのパフォーマンスが大幅に向上することを示します。これら3つの方法を活用することで、カーネルメソッドは、TIMITおよびBroadcast Newsベンチマークタスクを含む4つの音声認識データセット全体で、完全接続DNNよりも$0.5\%$良好から$0.1\%$不良までのトークンエラー率を達成します。

The Common-directions Method for Regularized Empirical Risk Minimization
正則化経験的リスク最小化のための共通方向法

State-of-the-art first- and second-order optimization methods are able to achieve either fast global linear convergence rates or quadratic convergence, but not both of them. In this work, we propose an interpolation between first- and second-order methods for regularized empirical risk minimization that exploits the problem structure to efficiently combine multiple update directions. Our method attains both optimal global linear convergence rate for first-order methods, and local quadratic convergence. Experimental results show that our method outperforms state-of-the-art first- and second-order optimization methods in terms of the number of data accesses, while is competitive in training time.

最先端の1次最適化手法と2次最適化手法は、高速なグローバル線形収束率または2次収束を達成できますが、両方を達成することはできません。この研究では、問題構造を利用して複数の更新方向を効率的に組み合わせる、正則化された経験的リスク最小化のための一次法と二次法の補間を提案します。私たちの方法は、1次法の最適なグローバル線形収束率と局所的な二次収束の両方を達成します。実験結果によると、本手法はデータアクセス数において最先端の1次および2次最適化手法を凌駕し、学習時間においても競争力があることが分かりました。

Multi-class Heterogeneous Domain Adaptation
マルチクラス異種ドメイン適応

A crucial issue in heterogeneous domain adaptation (HDA) is the ability to learn a feature mapping between different types of features across domains. Inspired by language translation, a word translated from one language corresponds to only a few words in another language, we present an efficient method named Sparse Heterogeneous Feature Representation (SHFR) in this paper for multi-class HDA to learn a sparse feature transformation between domains with multiple classes. Specifically, we formulate the problem of learning the feature transformation as a compressed sensing problem by building multiple binary classifiers in the target domain as various measurement sensors, which are decomposed from the target multi-class classification problem. We show that the estimation error of the learned transformation decreases with the increasing number of binary classifiers. In other words, for adaptation across heterogeneous domains to be successful, it is necessary to construct a sufficient number of incoherent binary classifiers from the original multi-class classification problem. To achieve this, we propose to apply the error correcting output correcting (ECOC) scheme to generate incoherent classifiers. To speed up the learning of the feature transformation across domains, we apply an efficient batch-mode algorithm to solve the resultant nonnegative sparse recovery problem. Theoretically, we present a generalization error bound of our proposed HDA method under a multi-class setting. Lastly, we conduct extensive experiments on both synthetic and real-world datasets to demonstrate the superiority of our proposed method over existing state-of-the-art HDA methods in terms of prediction accuracy and training efficiency.

異種ドメイン適応(HDA)における重要な問題は、ドメイン間で異なるタイプの機能間の機能マッピングを学習する能力です。言語翻訳にヒントを得て、ある言語から翻訳された単語は、別の言語ではほんの数語に対応します。この論文では、マルチクラスHDAが複数のクラスを持つドメイン間のスパース機能変換を学習するための、スパース異種機能表現(SHFR)という効率的な方法を紹介します。具体的には、ターゲットドメインに複数のバイナリ分類器をさまざまな測定センサーとして構築し、ターゲットマルチクラス分類問題から分解することにより、機能変換を学習する問題を圧縮センシング問題として定式化します。学習した変換の推定誤差は、バイナリ分類器の数が増えるにつれて減少することを示します。言い換えると、異種ドメイン間の適応を成功させるには、元のマルチクラス分類問題から十分な数のインコヒーレントなバイナリ分類器を構築する必要があります。これを実現するために、エラー訂正出力訂正(ECOC)スキームを適用してインコヒーレントな分類器を生成することを提案します。ドメイン間での特徴変換の学習を高速化するために、結果として生じる非負スパース回復問題を解決するために、効率的なバッチモードアルゴリズムを適用します。理論的には、マルチクラス設定での提案HDA方法の一般化誤差境界を示します。最後に、合成データセットと実世界のデータセットの両方で広範な実験を行い、予測精度とトレーニング効率の点で、提案方法が既存の最先端のHDA方法よりも優れていることを実証します。

Near Optimal Frequent Directions for Sketching Dense and Sparse Matrices
密集した行列とまばらな行列をスケッチするためのほぼ最適な頻繁な方向

Given a large matrix $A\in\mathbb{R}^{n\times d}$, we consider the problem of computing a sketch matrix $B\in\mathbb{R}^{\ell\times d}$ which is significantly smaller than but still well approximates $A$. We consider the problems in the streaming model, where the algorithm can only make one pass over the input with limited working space, and we are interested in minimizing the covariance error $\|A^TA-B^TB\|_2.$ The popular Frequent Directions algorithm of \cite{liberty2013simple} and its variants achieve optimal space-error tradeoffs. However, whether the running time can be improved remains an unanswered question. In this paper, we almost settle the question by proving that the time complexity of this problem is equivalent to that of matrix multiplication up to lower order terms. Specifically, we provide new space-optimal algorithms with faster running times and also show that the running times of our algorithms can be improved if and only if the state-of-the-art running time of matrix multiplication can be improved significantly.

大きな行列$A\in\mathbb{R}^{n\times d}$が与えられたとき、$A$より大幅に小さいが、それでも$A$によく近似するスケッチ行列$B\in\mathbb{R}^{\ell\times d}$を計算する問題について考えます。ストリーミングモデルの問題を検討します。このモデルでは、アルゴリズムは限られた作業スペースで入力を1回しか通過できず、共分散誤差$\|A^TA-B^TB\|_2$を最小化することに関心があります。\cite{liberty2013simple}の一般的なFrequent Directionsアルゴリズムとそのバリエーションは、最適なスペースと誤差のトレードオフを実現します。ただし、実行時間を改善できるかどうかは未解決の問題です。この論文では、この問題の時間計算量が低次の項までの行列乗算の時間計算量と同等であることを証明することで、この問題をほぼ解決します。具体的には、実行時間が短縮された新しい空間最適化アルゴリズムを提供し、また、最先端の行列乗算の実行時間を大幅に改善できる場合にのみ、アルゴリズムの実行時間を改善できることも示しています。

Neural Architecture Search: A Survey
ニューラルアーキテクチャ検索:調査

Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Because of this, there is growing interest in automated \emph{neural architecture search} methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.

ディープラーニングは、画像認識、音声認識、機械翻訳など、さまざまなタスクにおいて、ここ数年で目覚ましい進歩を遂げてきました。この進歩のための重要な側面の1つは、新しいニューラルアーキテクチャです。現在採用されているアーキテクチャは、ほとんどが人間の専門家によって手動で開発されており、時間がかかり、エラーが発生しやすいプロセスです。このため、自動化されたemph{neural architecture search}メソッドへの関心が高まっています。この研究分野における既存の研究の概要を提供し、検索空間、検索戦略、およびパフォーマンス推定戦略の3つの次元に従ってそれらを分類します。

Deep Reinforcement Learning for Swarm Systems
群集システムのための深層強化学習

Recently, deep reinforcement learning (RL) methods have been applied successfully to multi-agent scenarios. Typically, the observation vector for decentralized decision making is represented by a concatenation of the (local) information an agent gathers about other agents. However, concatenation scales poorly to swarm systems with a large number of homogeneous agents as it does not exploit the fundamental properties inherent to these systems: (i) the agents in the swarm are interchangeable and (ii) the exact number of agents in the swarm is irrelevant. Therefore, we propose a new state representation for deep multi-agent RL based on mean embeddings of distributions, where we treat the agents as samples and use the empirical mean embedding as input for a decentralized policy. We define different feature spaces of the mean embedding using histograms, radial basis functions and neural networks trained end-to-end. We evaluate the representation on two well-known problems from the swarm literature in a globally and locally observable setup. For the local setup we furthermore introduce simple communication protocols. Of all approaches, the mean embedding representation using neural network features enables the richest information exchange between neighboring agents, facilitating the development of complex collective strategies.

最近、深層強化学習(RL)手法がマルチエージェントシナリオにうまく適用されています。通常、分散型意思決定の観測ベクトルは、エージェントが他のエージェントについて収集する(ローカル)情報の連結によって表されます。ただし、連結は、多数の同質エージェントを含む群集システムにはうまく対応できません。これは、これらのシステムに固有の基本特性((i)群集内のエージェントは交換可能、(ii)群集内のエージェントの正確な数は無関係)を活用していないためです。したがって、分布の平均埋め込みに基づく深層マルチエージェントRLの新しい状態表現を提案します。エージェントをサンプルとして扱い、経験的平均埋め込みを分散ポリシーの入力として使用します。ヒストグラム、ラジアル基底関数、エンドツーエンドでトレーニングされたニューラルネットワークを使用して、平均埋め込みのさまざまな特徴空間を定義します。グローバルおよびローカルに観測可能なセットアップで、群集の文献からの2つのよく知られた問題で表現を評価します。さらに、ローカル設定では、シンプルな通信プロトコルを導入します。すべてのアプローチの中で、ニューラルネットワーク機能を使用した平均埋め込み表現により、近隣のエージェント間で最も豊富な情報交換が可能になり、複雑な集団戦略の開発が容易になります。

Tunability: Importance of Hyperparameters of Machine Learning Algorithms
調整可能性:機械学習アルゴリズムのハイパーパラメータの重要性

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

最新の教師あり機械学習アルゴリズムには、実行前に設定する必要があるハイパーパラメータが含まれます。ハイパーパラメータの設定オプションは、ソフトウェアパッケージのデフォルト値、ユーザーによる手動構成、またはチューニング手順によって最適な予測パフォーマンスを得るための構成です。この論文の目標は2つあります。まず、チューニングの問題を統計的観点から形式化し、データに基づくデフォルトを定義し、アルゴリズムのハイパーパラメータのチューニング可能性を定量化する一般的な尺度を提案します。次に、OpenMLプラットフォームの38のデータセットと6つの一般的な機械学習アルゴリズムに基づいて大規模なベンチマーク調査を実施します。その尺度を適用して、パラメータのチューニング可能性を評価します。結果からハイパーパラメータのデフォルト値が得られ、ユーザーは、時間のかかるチューニング戦略を実行する価値があるかどうか、最も重要なハイパーパラメータに重点を置くかどうか、チューニングに適切なハイパーパラメータ空間を選択するかどうかを判断できます。

Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems
トンプソン・サンプリングによる、ルート探索問題への応用を伴う欺瞞環境の線上における確率的探索

The multi-armed bandit problem forms the foundation for solving a wide range of online stochastic optimization problems through a simple, yet effective mechanism. One simply casts the problem as a gambler who repeatedly pulls one out of N slot machine arms, eliciting random rewards. Learning of reward probabilities is then combined with reward maximization, by carefully balancing reward exploration against reward exploitation. In this paper, we address a particularly intriguing variant of the multi-armed bandit problem, referred to as the Stochastic Point Location (SPL) problem. The gambler is here only told whether the optimal arm (point) lies to the “left” or to the “right” of the arm pulled, with the feedback being erroneous with probability $1-\pi$. This formulation thus targets optimization in continuous action spaces with both informative and deceptive feedback. To tackle this class of problems, we formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal arm as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning $\pi$, TS-SPL also supports deceptive environments that are lying about the direction of the optimal arm. This, in turn, allows us to address the fundamental Stochastic Root Finding (SRF) problem. Empirical results demonstrate that our scheme deals with both deceptive and informative environments, significantly outperforming competing algorithms both for SRF and SPL.

多腕バンディット問題は、シンプルでありながら効果的なメカニズムを通じて、さまざまなオンライン確率最適化問題を解決するための基礎を形成します。この問題は、N本のスロットマシンのアームから1本を繰り返し引いてランダムな報酬を引き出すギャンブラーとして簡単に考えることができます。次に、報酬の探索と報酬の活用を慎重にバランスさせることで、報酬確率の学習と報酬の最大化が組み合わされます。この論文では、確率的ポイントロケーション(SPL)問題と呼ばれる、多腕バンディット問題の特に興味深いバリエーションを取り上げます。ギャンブラーには、最適なアーム(ポイント)が引いたアームの「左」にあるか「右」にあるかのみが伝えられ、フィードバックは確率$1-\pi$で誤りとなります。したがって、この定式化は、有益なフィードバックと欺瞞的なフィードバックの両方がある連続アクション空間での最適化を対象としています。このクラスの問題に取り組むために、最適なアームの位置と正しいフィードバックを受け取る確率の両方を同時に取得する、ソリューション空間のコンパクトでスケーラブルなベイジアン表現を定式化します。さらに、探索と活用のバランスをとるために、付随するThompson Samplingガイド付きStochastic Point Location (TS-SPL)スキームを紹介します。$\pi$を学習することにより、TS-SPLは最適なアームの方向について嘘をついている欺瞞的な環境もサポートします。これにより、基本的なStochastic Root Finding (SRF)問題に対処できるようになります。実験結果から、このスキームは欺瞞的な環境と有益な環境の両方に対処し、SRFとSPLの両方で競合アルゴリズムを大幅に上回っていることがわかります。

Bayesian Combination of Probabilistic Classifiers using Multivariate Normal Mixtures
多変量正規混合を使用した確率的分類器のベイジアン組み合わせ

Ensemble methods are a powerful tool, often outperforming individual prediction models. Existing Bayesian ensembles either do not model the correlations between sources, or they are only capable of combining non-probabilistic predictions. We propose a new model, which overcomes these disadvantages. Transforming the probabilistic predictions with the inverse additive logistic transformation allows us to model the correlations with multivariate normal mixtures. We derive an efficient Gibbs sampler for the proposed model and implement a regularization method to make it more robust. We compare our method to related work and the classical linear opinion pool. Empirical evaluation on several toy and real-world data sets, including a case study on air-pollution forecasting, shows that the method outperforms other methods, while being robust and easy to use.

アンサンブル法は強力なツールであり、多くの場合、個々の予測モデルよりも優れたパフォーマンスを発揮します。既存のベイジアンサンブルは、ソース間の相関をモデル化しないか、非確率的予測を組み合わせることしかできません。これらの欠点を克服する新しいモデルを提案します。逆加法ロジスティック変換を使用して確率的予測を変換すると、多変量正規混合物との相関をモデル化できます。提案モデルに対して効率的なGibbsサンプラーを導出し、それをより堅牢にするための正則化法を実装します。私たちは、私たちの方法を関連研究や古典的な線形意見プールと比較します。大気汚染予測のケーススタディを含む、いくつかの玩具および実世界のデータセットに対する実証的評価は、この方法が他の方法よりも優れている一方で、堅牢で使いやすいことを示しています。

No-Regret Bayesian Optimization with Unknown Hyperparameters
未知のハイパーパラメータによる後悔のないベイズ最適化

Bayesian optimization (BO) based on Gaussian process models is a powerful paradigm to optimize black-box functions that are expensive to evaluate. While several BO algorithms provably converge to the global optimum of the unknown function, they assume that the hyperparameters of the kernel are known in advance. This is not the case in practice and misspecification often causes these algorithms to converge to poor local optima. In this paper, we present the first BO algorithm that is provably no-regret and converges to the optimum without knowledge of the hyperparameters. During optimization we slowly adapt the hyperparameters of stationary kernels and thereby expand the associated function class over time, so that the BO algorithm considers more complex function candidates. Based on the theoretical insights, we propose several practical algorithms that achieve the empirical sample efficiency of BO with online hyperparameter estimation, but retain theoretical convergence guarantees. We evaluate our method on several benchmark problems.

ガウス過程モデルに基づくベイズ最適化(BO)は、評価にコストがかかるブラックボックス関数を最適化する強力なパラダイムです。いくつかのBOアルゴリズムは未知の関数のグローバル最適値に収束することが証明されていますが、カーネルのハイパーパラメータが事前にわかっていることを前提としています。これは実際には当てはまらず、誤った指定によりこれらのアルゴリズムは多くの場合、不十分なローカル最適値に収束します。この論文では、ハイパーパラメータを知らなくても最適値に収束することが証明されている最初のBOアルゴリズムを紹介します。最適化中は、定常カーネルのハイパーパラメータをゆっくりと適応させ、関連する関数クラスを時間の経過とともに拡張して、BOアルゴリズムがより複雑な関数候補を考慮するようにします。理論的な洞察に基づいて、オンラインハイパーパラメータ推定でBOの実験的サンプル効率を達成しながら、理論的な収束保証を維持するいくつかの実用的なアルゴリズムを提案します。いくつかのベンチマーク問題でこの方法を評価します。

Using Simulation to Improve Sample-Efficiency of Bayesian Optimization for Bipedal Robots
シミュレーションを使用した二足歩行ロボットのベイズ最適化のサンプル効率の向上

Learning for control can acquire controllers for novel robotic tasks, paving the path for autonomous agents. Such controllers can be expert-designed policies, which typically require tuning of parameters for each task scenario. In this context, Bayesian optimization (BO) has emerged as a promising approach for automatically tuning controllers. However, sample-efficiency can still be an issue for high-dimensional policies on hardware. Here, we develop an approach that utilizes simulation to learn structured feature transforms that map the original parameter space into a domain-informed space. During BO, similarity between controllers is now calculated in this transformed space. Experiments on the ATRIAS robot hardware and simulation show that our approach succeeds at sample-efficiently learning controllers for multiple robots. Another question arises: What if the simulation significantly differs from hardware? To answer this, we create increasingly approximate simulators and study the effect of increasing simulation-hardware mismatch on the performance of Bayesian optimization. We also compare our approach to other approaches from literature, and find it to be more reliable, especially in cases of high mismatch. Our experiments show that our approach succeeds across different controller types, bipedal robot models and simulator fidelity levels, making it applicable to a wide range of bipedal locomotion problems.

制御学習により、新しいロボットタスク用のコントローラーを獲得し、自律エージェントへの道を切り開くことができます。このようなコントローラーは、専門家が設計したポリシーである可能性があり、通常は各タスクシナリオのパラメーターの調整が必要です。このコンテキストでは、ベイズ最適化(BO)がコントローラーを自動的に調整するための有望なアプローチとして浮上しています。ただし、ハードウェア上の高次元ポリシーでは、サンプル効率が依然として問題になる可能性があります。ここでは、シミュレーションを利用して、元のパラメーター空間をドメイン情報に基づいた空間にマッピングする構造化された特徴変換を学習するアプローチを開発します。BO中に、この変換された空間でコントローラー間の類似性が計算されます。ATRIASロボットハードウェアとシミュレーションでの実験では、このアプローチが複数のロボットのコントローラーをサンプル効率的に学習することに成功していることがわかりました。別の疑問が生じます。シミュレーションがハードウェアと大幅に異なる場合はどうなるでしょうか。これに答えるために、ますます近似したシミュレータを作成し、シミュレーションとハードウェアの不一致の増加がベイズ最適化のパフォーマンスに与える影響を調べます。また、文献にある他のアプローチとこのアプローチを比較し、特に不一致が大きい場合に信頼性が高いことがわかりました。私たちの実験では、私たちのアプローチがさまざまなコントローラータイプ、二足歩行ロボットモデル、シミュレーターの忠実度レベルにわたって成功し、幅広い二足歩行運動の問題に適用できることが示されています。

Efficient augmentation and relaxation learning for individualized treatment rules using observational data
観察データを用いた個別化治療ルールのための効率的な拡張・緩和学習

Individualized treatment rules aim to identify if, when, which, and to whom treatment should be applied. A globally aging population, rising healthcare costs, and increased access to patient-level data have created an urgent need for high-quality estimators of individualized treatment rules that can be applied to observational data. A recent and promising line of research for estimating individualized treatment rules recasts the problem of estimating an optimal treatment rule as a weighted classification problem. We consider a class of estimators for optimal treatment rules that are analogous to convex large-margin classifiers. The proposed class applies to observational data and is doubly-robust in the sense that correct specification of either a propensity or outcome model leads to consistent estimation of the optimal individualized treatment rule. Using techniques from semiparametric efficiency theory, we derive rates of convergence for the proposed estimators and use these rates to characterize the bias-variance trade-off for estimating individualized treatment rules with classification-based methods. Simulation experiments informed by these results demonstrate that it is possible to construct new estimators within the proposed framework that significantly outperform existing ones. We illustrate the proposed methods using data from a labor training program and a study of inflammatory bowel syndrome.

個別治療ルールは、治療を適用するかどうか、いつ、どの治療を、誰に適用するかを特定することを目的としています。世界的な人口の高齢化、医療費の上昇、患者レベルのデータへのアクセスの増加により、観察データに適用できる個別治療ルールの高品質の推定量が緊急に必要とされています。個別治療ルールを推定するための最近の有望な研究では、最適な治療ルールを推定する問題を重み付き分類問題として作り直しています。凸型大マージン分類器に類似した最適な治療ルールの推定量のクラスを検討します。提案されたクラスは観察データに適用され、傾向モデルまたは結果モデルのいずれかを正しく指定すると、最適な個別治療ルールが一貫して推定されるという意味で二重に堅牢です。セミパラメトリック効率理論の手法を使用して、提案された推定量の収束率を導出し、これらの率を使用して、分類ベースの方法で個別治療ルールを推定する際のバイアスと分散のトレードオフを特徴付けます。これらの結果に基づくシミュレーション実験では、提案されたフレームワーク内で、既存のものより大幅に優れた新しい推定値を構築できることが実証されています。労働訓練プログラムと炎症性腸症候群の研究からのデータを使用して、提案された方法を説明します。

Analysis of spectral clustering algorithms for community detection: the general bipartite setting
コミュニティ検出のためのスペクトルクラスタリングアルゴリズムの解析:一般的な二者構成設定

We consider spectral clustering algorithms for community detection under a general bipartite stochastic block model (SBM). A modern spectral clustering algorithm consists of three steps: (1) regularization of an appropriate adjacency or Laplacian matrix (2) a form of spectral truncation and (3) a kmeans type algorithm in the reduced spectral domain. We focus on the adjacency-based spectral clustering and for the first step, propose a new data-driven regularization that can restore the concentration of the adjacency matrix even for the sparse networks. This result is based on recent work on regularization of random binary matrices, but avoids using unknown population level parameters, and instead estimates the necessary quantities from the data. We also propose and study a novel variation of the spectral truncation step and show how this variation changes the nature of the misclassification rate in a general SBM. We then show how the consistency results can be extended to models beyond SBMs, such as inhomogeneous random graph models with approximate clusters, including a graphon clustering problem, as well as general sub-Gaussian biclustering. A theme of the paper is providing a better understanding of the analysis of spectral methods for community detection and establishing consistency results, under fairly general clustering models and for a wide regime of degree growths, including sparse cases where the average expected degree grows arbitrarily slowly.

私たちは、一般的な二部確率ブロックモデル(SBM)の下でのコミュニティ検出のためのスペクトルクラスタリングアルゴリズムを検討します。最新のスペクトルクラスタリングアルゴリズムは、(1)適切な隣接行列またはラプラシアン行列の正規化、(2)スペクトル切断の形式、(3)縮小スペクトル領域でのkmeans型アルゴリズムの3つのステップで構成されます。我々は隣接ベースのスペクトルクラスタリングに焦点を当て、最初のステップとして、スパースネットワークでも隣接行列の集中を復元できる新しいデータ駆動型正規化を提案します。この結果は、ランダムバイナリ行列の正規化に関する最近の研究に基づいていますが、未知の集団レベルパラメータを使用することを避け、代わりにデータから必要な量を推定します。また、スペクトル切断ステップの新しいバリエーションを提案して研究し、このバリエーションが一般的なSBMでの誤分類率の性質をどのように変えるかを示します。次に、グラフオンクラスタリング問題や一般的なサブガウスバイクラスタリングを含む近似クラスターを持つ不均質ランダムグラフモデルなど、SBM以外のモデルに一貫性の結果をどのように拡張できるかを示します。この論文のテーマは、かなり一般的なクラスタリングモデルと、平均期待次数が任意にゆっくりと増加するスパースケースを含む、次数増加の幅広いレジームにおいて、コミュニティ検出のためのスペクトル法の分析をより深く理解し、一貫性の結果を確立することです。

Boosted Kernel Ridge Regression: Optimal Learning Rates and Early Stopping
ブーストカーネルリッジ回帰: 最適な学習率と早期停止

In this paper, we introduce a learning algorithm, boosted kernel ridge regression (BKRR), that combines $L_2$-Boosting with the kernel ridge regression (KRR). We analyze the learning performance of this algorithm in the framework of learning theory. We show that BKRR provides a new bias-variance trade-off via tuning the number of boosting iterations, which is different from KRR via adjusting the regularization parameter. A (semi-)exponential bias-variance trade-off is derived for BKRR, exhibiting a stable relationship between the generalization error and the number of iterations. Furthermore, an adaptive stopping rule is proposed, with which BKRR achieves the optimal learning rate without saturation.

この論文では、$L_2$-Boostingとカーネルリッジ回帰(KRR)を組み合わせた学習アルゴリズム、ブーストカーネルリッジ回帰(BKRR)を紹介します。このアルゴリズムの学習性能を学習理論の枠組みで分析します。BKRRは、ブースティングイテレーションの数を調整することでバイアスと分散のトレードオフを新たに提供し、これは正則化パラメーターを調整することでKRRとは異なることを示します。BKRRでは、(半)指数関数的なバイアス分散のトレードオフが導出され、汎化誤差と反復回数との間に安定した関係が見られます。さらに、BKRRが飽和することなく最適な学習率を達成する適応停止ルールが提案されています。

Robust Frequent Directions with Application in Online Learning
オンライン学習における応用による堅牢で頻繁な指示

The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameter-free online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms.

Frequent Directions(FD)手法は、機械学習に多くの応用があるオンラインスケッチの決定論的アプローチです。従来のFDは、ランク不足行列を出力することが多いヒューリスティックな手順です。ランク不足問題を克服するために、正則化項を導入することにより、ロバスト・フリークエント・ディレクションズ(RFD)と呼ばれる新しいスケッチ戦略を提案します。RFDは最適化問題から導出できます。スケッチ行列と正則化項を適応的かつ結合的に更新します。RFDは、計算コストを増やすことなくFDの近似誤差を低減します。また、RFDをオンライン学習に適用し、効果的なハイパーパラメータフリーのオンラインニュートンアルゴリズムを提案します。私たちは、アルゴリズムの堅牢性を保証するRFDに基づくオンラインニュートンアルゴリズムに後悔の念を導き出します。実験的研究は、提案された方法が最先端の二次オンライン学習アルゴリズムよりも優れていることを示しています。

Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python
Picasso: R と Python での高次元データ分析のためのスパース学習ライブラリ

We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex $\ell_1$, nonvoncex MCP and SCAD regularizers. The library is coded in \texttt{C++} and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently.

私たちは、さまざまなスパース学習問題(スパース線形回帰、スパースロジスティック回帰、スパースポアソン回帰、スケーリングスパース線形回帰など)のパスワイズ座標最適化の統一フレームワークを効率的なアクティブセット選択戦略と組み合わせて実装する、ピカソという名前の新しいライブラリについて説明します。さらに、このライブラリでは、ユーザーは凸型$ell_1$、非voncex MCP、SCAD正則化子など、さまざまなスパース性誘導正則化器を選択できます。ライブラリはtexttt{C++}でコーディングされており、ユーザーフレンドリーなRとPythonのラッパーがあります。数値実験は、ピカソが大きな問題に効率的にスケールアップできることを示しています。

DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization
DSCOVR: 非同期分散最適化のためのランダム化主双対ブロック座標アルゴリズム

Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system.

ビッグデータによる機械学習には、大規模な最適化モデルが関係することがよくあります。マシンのクラスターにわたる分散最適化では、すべてのモデルパラメータ(最適化変数)の頻繁な通信と同期には非常にコストがかかります。有望な解決策は、パラメータサーバーを使用してモデルパラメータのさまざまなサブセットを保存し、ローカルデータセットを使用してさまざまなマシンで非同期的に更新することです。この論文では、凸損失関数を持つ大規模な線形モデルの分散最適化に焦点を当て、パラメータサーバーを使用した非同期分散実装に特に適したランダム化された主デュアルブロック座標アルゴリズムのファミリを提案します。特に、データとモデルの同時分割を可能にするこのような問題の鞍点定式化に取り組み、分散削減による二重確率座標最適化(DSCOVR)によってその構造を活用します。他の一次分散アルゴリズムと比較して、DSCOVRでは全体的な計算量と通信量が少なく、同期が少なくなるか、同期がまったく必要なくなる可能性があることを示します。DSCOVRアルゴリズムの実装の詳細について説明し、産業用分散コンピューティングシステムでの数値実験を紹介します。

Utilizing Second Order Information in Minibatch Stochastic Variance Reduced Proximal Iterations
ミニバッチ確率的分散での 2 次情報の利用により、近位反復が減少しました

We present a novel minibatch stochastic optimization method for empirical risk minimization of linear predictors. The method efficiently leverages both sub-sampled first-order and higher-order information, by incorporating variance-reduction and acceleration techniques. We prove improved iteration complexity over state-of-the-art methods under suitable conditions. In particular, the approach enjoys global fast convergence for quadratic convex objectives and local fast convergence for general convex objectives. Experiments are provided to demonstrate the empirical advantage of the proposed method over existing approaches in the literature.

私たちは、線形予測子の経験的リスク最小化のための新しいミニバッチ確率最適化手法を提示します。この手法は、分散削減と高速化の手法を組み込むことにより、サブサンプリングされた1次情報と高次情報の両方を効率的に活用します。私たちは、適切な条件下で最先端の方法よりも反復の複雑さが改善されていることを証明しています。特に、このアプローチでは、2次凸目的関数の大域高速収束と一般凸目的関数の局所高速収束を享受します。提案された方法が文献の既存のアプローチよりも経験的に優れていることを実証するために、実験が提供されています。

Decontamination of Mutual Contamination Models
相互汚染モデルの除染

Many machine learning problems can be characterized by \emph{mutual contamination models}. In these problems, one observes several random samples from different convex combinations of a set of unknown base distributions and the goal is to infer these base distributions. This paper considers the general setting where the base distributions are defined on arbitrary probability spaces. We examine three popular machine learning problems that arise in this general setting: multiclass classification with label noise, demixing of mixed membership models, and classification with partial labels. In each case, we give sufficient conditions for identifiability and present algorithms for the infinite and finite sample settings, with associated performance guarantees.

多くの機械学習の問題は、emph{相互汚染モデル}によって特徴付けることができます。これらの問題では、未知のベース分布のセットの異なる凸の組み合わせからいくつかのランダムサンプルを観察し、これらのベース分布を推測することを目標とします。この論文では、基本分布が任意の確率空間で定義される一般的な設定について考察します。この一般的な設定で発生する3つの一般的な機械学習の問題(ラベルノイズによる多クラス分類、混合メンバーシップモデルのデミキシング、および部分ラベルによる分類)について検討します。いずれの場合も、識別可能性に十分な条件を与え、無限サンプル設定と有限サンプル設定のアルゴリズムを提示し、関連するパフォーマンス保証を提供します。

Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations
確率的修正方程式と確率的勾配アルゴリズムのダイナミクス I: 数学的基礎

We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov’s accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuous-time approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discrete-time setting.

私たちは、確率的勾配アルゴリズムのダイナミクスを解析するための確率的修正方程式(SME)フレームワークの数学的基礎を開発しています。後者は、ノイズパラメータが小さい確率微分方程式のクラスによって近似されます。この近似は、確率的目標の一般的な設定における確率的勾配降下法(SGD)、運動量SGD、および確率的ネステロフの加速勾配法の近似に関する多くの正確で有用な結果につながる弱い近似として数学的に理解できることを証明します。また、この連続時間アプローチにより、純粋な離散時間の設定では簡単には得られない可能性のある、検討中の確率的勾配アルゴリズムに関する重要な分析的洞察を発見できることを明示的な計算を通じて示します。

A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication
ランダム化行列乗算における誤差推定のためのブートストラップ法

In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for directly estimating the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an “extrapolation” technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method.

近年、数値線形代数のランダム化法は、大規模問題への一般的なアプローチとして関心が高まっています。通常、これらの方法の重要な要素は、計算を高速化するがランダムな近似誤差も生み出す、何らかの形のランダム化次元削減です。このように、次元削減ステップは、コストと精度のトレードオフをエンコードします。ただし、コストと精度の正確な数値関係は通常不明であるため、ユーザーが(1)特定のソリューションの精度がどの程度か、または(2)特定のレベルの精度を達成するためにどの程度の計算が必要かを正確に知ることは難しい場合があります。現在の論文では、これらの一般的な問題に対処するためのプロトタイプ設定として、ランダム化行列乗算(スケッチ)を検討します。解決策として、削減された次元の関数として精度を直接推定するブートストラップ法を開発します(削減された次元に関して精度の最悪ケースの境界を導出するのではなく)。計算の観点から見ると、提案された方法は標準的なスケッチ方法のコストを大幅に増加させるものではなく、これは「外挿」技術によって可能になります。さらに、提案された方法の有効性を示すために、理論的および実験的結果の両方を提供します。

Approximation Hardness for A Class of Sparse Optimization Problems
Aクラスのスパース最適化問題の近似硬さ

In this paper, we consider three typical optimization problems with a convex loss function and a nonconvex sparse penalty or constraint. For the sparse penalized problem, we prove that finding an $\mathcal{O}(n^{c_1}d^{c_2})$-optimal solution to an $n\times d$ problem is strongly NP-hard for any $c_1, c_2\in [0,1)$ such that $c_1+c_2<1$. For two constrained versions of the sparse optimization problem, we show that it is intractable to approximately compute a solution path associated with increasing values of some tuning parameter. The hardness results apply to a broad class of loss functions and sparse penalties. They suggest that one cannot even approximately solve these three problems in polynomial time, unless P $=$ NP.

この論文では、凸損失関数と非凸スパースペナルティまたは制約を持つ3つの典型的な最適化問題について考察します。スパースなペナルティ付き問題では、$ntimes d$問題に対する$mathcal{O}(n^{c_1}d^{c_2})$-最適解を見つけることは、$c_1+c_2<1$となるような任意の$c_1, c_2in [0,1)$に対してNP困難であることを証明します。スパース最適化問題の2つの制約付きバージョンについて、ある調整パラメーターの値の増加に関連する解パスを近似的に計算するのは困難であることを示します。硬度の結果は、幅広いクラスの損失関数とスパースペナルティに適用されます。彼らは、P $=$ NPでない限り、多項式時間でこれら3つの問題をほぼ解くことさえできないと示唆しています。

A Well-Tempered Landscape for Non-convex Robust Subspace Recovery
非凸型のロバストな部分空間回復のための整えられたランドスケープ

We present a mathematical analysis of a non-convex energy landscape for robust subspace recovery. We prove that an underlying subspace is the only stationary point and local minimizer in a specified neighborhood under a deterministic condition on a dataset. If the deterministic condition is satisfied, we further show that a geodesic gradient descent method over the Grassmannian manifold can exactly recover the underlying subspace when the method is properly initialized. Proper initialization by principal component analysis is guaranteed with a simple deterministic condition. Under slightly stronger assumptions, the gradient descent method with a piecewise constant step-size scheme achieves linear convergence. The practicality of the deterministic condition is demonstrated on some statistical models of data, and the method achieves almost state-of-the-art recovery guarantees on the Haystack Model for different regimes of sample size and ambient dimension. In particular, when the ambient dimension is fixed and the sample size is large enough, we show that our gradient method can exactly recover the underlying subspace for any fixed fraction of outliers (less than 1).

私たちは、非凸エネルギーランドスケープの数学的解析を、堅牢なサブスペース回復のために提示します。データセットの決定論的条件の下で、基礎となるサブスペースが、指定された近傍における唯一の定常点であり、局所的最小値であることを証明した。決定論的条件が満たされる場合、我々はさらに、グラスマン多様体上の測地線勾配降下法が、適切に初期化されていれば、基礎となるサブスペースを正確に回復できることを示す。主成分分析による適切な初期化は、単純な決定論的条件で保証されます。わずかに強い仮定の下で、区分的に一定のステップサイズスキームによる勾配降下法は、線形収束を達成します。決定論的条件の実用性は、いくつかのデータ統計モデルで実証されており、この方法は、サンプルサイズとアンビエント次元のさまざまなレジームに対して、ヘイスタックモデルでほぼ最先端の回復保証を達成します。特に、アンビエント次元が固定され、サンプルサイズが十分に大きい場合、我々の勾配法は、外れ値の固定割合（1未満）に対して、基礎となるサブスペースを正確に回復できることを示す。

A New Approach to Laplacian Solvers and Flow Problems
ラプラシアンソルバーと流れ問題への新しいアプローチ

This paper investigates the behavior of the Min-Sum message passing scheme to solve systems of linear equations in the Laplacian matrices of graphs and to compute electric flows. Voltage and flow problems involve the minimization of quadratic functions and are fundamental primitives that arise in several domains. Algorithms that have been proposed are typically centralized and involve multiple graph-theoretic constructions or sampling mechanisms that make them difficult to implement and analyze. On the other hand, message passing routines are distributed, simple, and easy to implement. In this paper we establish a framework to analyze Min-Sum to solve voltage and flow problems. We characterize the error committed by the algorithm on general weighted graphs in terms of hitting times of random walks defined on the computation trees that support the operations of the algorithms with time. For $d$-regular graphs with equal weights, we show that the convergence of the algorithms is controlled by the total variation distance between the distributions of non-backtracking random walks defined on the original graph that start from neighboring nodes. The framework that we introduce extends the analysis of Min-Sum to settings where the contraction arguments previously considered in the literature (based on the assumption of walk summability or scaled diagonal dominance) can not be used, possibly in the presence of constraints.

この論文では、グラフのラプラシアン行列の線形方程式の連立方程式を解き、電気の流れを計算するためのMin-Sumメッセージパッシングスキームの動作を調査します。電圧と流れの問題は、2次関数の最小化を伴い、いくつかのドメインで発生する基本的なプリミティブです。提案されているアルゴリズムは通常、集中化されており、複数のグラフ理論的構成またはサンプリングメカニズムを伴うため、実装と分析が困難です。一方、メッセージパッシングルーチンは分散型で、シンプルで、実装が簡単です。この論文では、電圧と流れの問題を解決するためにMin-Sumを分析するためのフレームワークを確立します。一般的な重み付きグラフでアルゴリズムによって発生するエラーを、時間とともにアルゴリズムの動作をサポートする計算ツリーで定義されたランダムウォークのヒット時間の観点から特徴付けます。等しい重みを持つ$d$正則グラフの場合、アルゴリズムの収束は、隣接するノードから始まる元のグラフで定義された非バックトラックランダムウォークの分布間の合計変動距離によって制御されることを示します。私たちが導入するフレームワークは、Min-Sumの分析を、文献で以前に検討された収縮引数(ウォーク合計可能性またはスケールされた対角優位性の仮定に基づく)が、制約が存在する場合などには使用できない設定に拡張します。

Optimal Policies for Observing Time Series and Related Restless Bandit Problems
時系列と関連するレストレスバンディットの問題を監視するための最適方策

The trade-off between the cost of acquiring and processing data, and uncertainty due to a lack of data is fundamental in machine learning. A basic instance of this trade-off is the problem of deciding when to make noisy and costly observations of a discrete-time Gaussian random walk, so as to minimise the posterior variance plus observation costs. We present the first proof that a simple policy, which observes when the posterior variance exceeds a threshold, is optimal for this problem. The proof generalises to a wide range of cost functions other than the posterior variance. It is based on a new verification theorem by Nino-Mora that guarantees threshold structure for Markov decision processes, and on the relation between binary sequences known as Christoffel words and the dynamics of discontinuous nonlinear maps, which frequently arise in physics, control and biology. This result implies that optimal policies for linear-quadratic-Gaussian control with costly observations have a threshold structure. It also implies that the restless bandit problem of observing multiple such time series, has a well-defined Whittle index policy. We discuss computation of that index, give closed-form formulae for it, and compare the performance of the associated index policy with heuristic policies.

データの取得と処理にかかるコストと、データ不足による不確実性との間のトレードオフは、機械学習において基本的なものです。このトレードオフの基本的な例は、事後分散と観測コストを最小限に抑えるために、離散時間ガウスランダムウォークのノイズが多くコストのかかる観測をいつ行うかを決定する問題です。事後分散がしきい値を超えたときに観測する単純なポリシーがこの問題に最適であるという最初の証明を示します。この証明は、事後分散以外のさまざまなコスト関数に一般化されます。これは、マルコフ決定プロセスのしきい値構造を保証するNino-Moraによる新しい検証定理と、物理学、制御、生物学で頻繁に発生する、クリストッフェル語として知られるバイナリシーケンスと不連続な非線形マップのダイナミクスとの関係に基づいています。この結果は、コストのかかる観測を伴う線形-二次ガウス制御の最適なポリシーにはしきい値構造があることを意味します。また、複数のそのような時系列を観察する落ち着きのないバンディット問題には、明確に定義されたWhittleインデックスポリシーがあることも意味します。このインデックスの計算について説明し、その閉じた形式の式を示し、関連するインデックスポリシーのパフォーマンスをヒューリスティックポリシーと比較します。

Matched Bipartite Block Model with Covariates
共変量を持つ一致した二部構成ブロックモデル

Community detection or clustering is a fundamental task in the analysis of network data. Many real networks have a bipartite structure which makes community detection challenging. In this paper, we consider a model which allows for matched communities in the bipartite setting, in addition to node covariates with information about the matching. We derive a simple fast algorithm for fitting the model based on variational inference ideas and show its effectiveness on both simulated and real data. A variation of the model to allow for degree-correction is also considered, in addition to a novel approach to fitting such degree-corrected models.

コミュニティ検出またはクラスタリングは、ネットワークデータの分析における基本的なタスクです。多くの実際のネットワークは二者構成であるため、コミュニティの検出は困難です。この論文では、マッチングに関する情報を持つノード共変量に加えて、二部構成の設定で一致したコミュニティを可能にするモデルを検討します。変分推論のアイデアに基づいてモデルを適合させるためのシンプルで高速なアルゴリズムを導き出し、シミュレーションデータと実際のデータの両方でその有効性を示します。また、次数補正を可能にするモデルのバリエーションや、そのような次数補正モデルを適合させるための新しいアプローチについても検討します。

The Relationship Between Agnostic Selective Classification, Active Learning and the Disagreement Coefficient
不可知論的選択的分類、アクティブラーニング、不一致係数の関係

A selective classifier $(f,g)$ comprises a classification function $f$ and a binary selection function $g$, which determines if the classifier abstains from prediction, or uses $f$ to predict. The classifier is called pointwise-competitive if it classifies each point identically to the best classifier in hindsight (from the same class), whenever it does not abstain. The quality of such a classifier is quantified by its rejection mass, defined to be the probability mass of the points it rejects. A “fast” rejection rate is achieved if the rejection mass is bounded from above by $\tilde{O}(1/m)$ where $m$ is the number of labeled examples used to train the classifier (and $\tilde{O}$ hides logarithmic factors). Pointwise-competitive selective (PCS) classifiers are intimately related to disagreement-based active learning and it is known that in the realizable case, a fast rejection rate of a known PCS algorithm (called Consistent Selective Strategy) is equivalent to an exponential speedup of the well-known CAL active algorithm. We focus on the agnostic setting, for which there is a known algorithm called LESS that learns a PCS classifier and achieves a fast rejection rate (depending on Hannekeâs disagreement coefficient) under strong assumptions. We present an improved PCS learning algorithm called ILESS for which we show a fast rate (depending on Hanneke’s disagreement coefficient) without any assumptions. Our rejection bound smoothly interpolates the realizable and agnostic settings. The main result of this paper is an equivalence between the following three entities: (i) the existence of a fast rejection rate for any PCS learning algorithm (such as ILESS); (ii) a poly-logarithmic bound for Hanneke’s disagreement coefficient; and (iii) an exponential speedup for a new disagreement-based active learner called {\ActiveiLESS}.

選択的分類器$(f,g)$は分類関数$f$とバイナリ選択関数$g$で構成され、分類器が予測を控えるか、予測に$f$を使用するかを決定します。分類器が予測を控えない場合、各ポイントを後から見て(同じクラスの)最良の分類器とまったく同じように分類する場合、その分類器はポイントワイズ競合型と呼ばれます。このような分類器の品質は、拒否質量によって定量化されます。拒否質量は、拒否するポイントの確率質量として定義されます。拒否質量が$\tilde{O}(1/m)$で上限付けられている場合、”高速”拒否率が達成されます。ここで、$m$は分類器のトレーニングに使用されるラベル付きサンプルの数です($\tilde{O}$は対数係数を隠します)。ポイントワイズ競合選択(PCS)分類器は、不一致ベースのアクティブラーニングと密接に関連しており、実現可能なケースでは、既知のPCSアルゴリズム(一貫性選択戦略と呼ばれる)の高速拒否率は、よく知られているCALアクティブアルゴリズムの指数関数的な高速化と同等であることが知られています。私たちは、PCS分類器を学習し、強い仮定の下で高速拒否率(Hannekeの不一致係数に依存)を達成するLESSと呼ばれる既知のアルゴリズムがある、不可知論的設定に焦点を当てます。私たちは、仮定なしで高速率(Hannekeの不一致係数に依存)を示すILESSと呼ばれる改良型PCS学習アルゴリズムを紹介します。私たちの拒否境界は、実現可能な設定と不可知論的設定をスムーズに補間します。この論文の主な結果は、次の3つのエンティティ間の同等性です。(i)すべてのPCS学習アルゴリズム(ILESSなど)の高速拒否率の存在。(ii)ハネケの不一致係数の多対数限界。(iii) {\ActiveiLESS}と呼ばれる新しい意見の相違に基づく能動学習者の指数関数的な高速化。

NetSDM: Semantic Data Mining with Network Analysis
NetSDM: ネットワーク分析によるセマンティックデータマイニング

Semantic data mining (SDM) is a form of relational data mining that uses annotated data together with complex semantic background knowledge to learn rules that can be easily interpreted. The drawback of SDM is a high computational complexity of existing SDM algorithms, resulting in long run times even when applied to relatively small data sets. This paper proposes an effective SDM approach, named NetSDM, which first transforms the available semantic background knowledge into a network format, followed by network analysis based node ranking and pruning to significantly reduce the size of the original background knowledge. The experimental evaluation of the NetSDM methodology on acute lymphoblastic leukemia and breast cancer data demonstrates that NetSDM achieves radical time efficiency improvements and that learned rules are comparable or better than the rules obtained by the original SDM algorithms.

セマンティックデータマイニング(SDM)は、リレーショナルデータマイニングの一種であり、注釈付きデータと複雑なセマンティック背景知識を使用して、簡単に解釈できるルールを学習します。SDMの欠点は、既存のSDMアルゴリズムの計算が非常に複雑であるため、比較的小さなデータセットに適用しても実行時間が長くなることです。この論文では、NetSDMと呼ばれる効果的なSDMアプローチを提案し、最初に利用可能なセマンティック背景知識をネットワーク形式に変換し、次にネットワーク分析ベースのノードランキングとプルーニングを行って、元の背景知識のサイズを大幅に縮小します。急性リンパ芽球性白血病および乳がんのデータに対するNetSDM方法論の実験的評価は、NetSDMが根本的な時間効率の改善を達成し、学習されたルールが元のSDMアルゴリズムによって得られたルールと同等またはそれ以上であることを示しています。

Kernels for Sequentially Ordered Data
順次順序付けされたデータのカーネル

We present a novel framework for learning with sequential data of any kind, such as multivariate time series, strings, or sequences of graphs. The main result is a ”sequentialization” that transforms any kernel on a given domain into a kernel for sequences in that domain. This procedure preserves properties such as positive definiteness, the associated kernel feature map is an ordered variant of sample (cross-)moments, and this sequentialized kernel is consistent in the sense that it converges to a kernel for paths if sequences converge to paths (by discretization). Further, classical kernels for sequences arise as special cases of this method. We use dynamic programming and low-rank techniques for tensors to provide efficient algorithms to compute this sequentialized kernel.

私たちは、多変量時系列、文字列、グラフのシーケンスなど、あらゆる種類の逐次データを使用して学習するための新しいフレームワークを提示します。主な結果は、特定のドメイン上の任意のカーネルをそのドメイン内のシーケンスのカーネルに変換する「シーケンシャル化」です。この手順では、正の定値性、関連するカーネル特徴マップがサンプル(クロス)モーメントの順序付けされたバリアントであること、このシーケンシャル化されたカーネルは、シーケンスがパスに収束する(離散化によって)パスのカーネルに収束するという意味で一貫性があります。さらに、シーケンスの古典的なカーネルは、この方法の特殊なケースとして発生します。テンソルの動的計画法と低ランクの手法を使用して、この逐次化されたカーネルを計算するための効率的なアルゴリズムを提供します。

Exact Clustering of Weighted Graphs via Semidefinite Programming
半定値計画法による重み付きグラフの厳密クラスタリング

As a model problem for clustering, we consider the densest $k$-disjoint-clique problem of partitioning a weighted complete graph into $k$ disjoint subgraphs such that the sum of the densities of these subgraphs is maximized. We establish that such subgraphs can be recovered from the solution of a particular semidefinite relaxation with high probability if the input graph is sampled from a distribution of clusterable graphs. Specifically, the semidefinite relaxation is exact if the graph consists of $k$ large disjoint subgraphs, corresponding to clusters, with weight concentrated within these subgraphs, plus a moderate number of nodes not belonging to any cluster. Further, we establish that if noise is weakly obscuring these clusters, i.e, the between-cluster edges are assigned very small weights, then we can recover significantly smaller clusters. For example, we show that in approximately sparse graphs, where the between-cluster weights tend to zero as the size $n$ of the graph tends to infinity, we can recover clusters of size polylogarithmic in $n$ under certain conditions on the distribution of edge weights. Empirical evidence from numerical simulations is also provided to support these theoretical phase transitions to perfect recovery of the cluster structure.

クラスタリングのモデル問題として、重み付き完全グラフを$k$個の互いに素なサブグラフに分割し、これらのサブグラフの密度の合計が最大になるようにする、最も密度の高い$k$互いに素なクリーク問題を検討します。入力グラフがクラスタリング可能なグラフの分布からサンプリングされている場合、特定の半正定値緩和の解からこのようなサブグラフを高い確率で復元できることを確立します。具体的には、グラフが$k$個の大きな互いに素なサブグラフ(クラスターに対応し、重みがこれらのサブグラフに集中し、どのクラスターにも属さないノードが適度な数含まれている)で構成されている場合、半正定値緩和は正確です。さらに、ノイズによってこれらのクラスターが弱く隠されている場合、つまりクラスター間のエッジに非常に小さな重みが割り当てられている場合、大幅に小さいクラスターを復元できることを確立します。たとえば、グラフのサイズ$n$が無限大に近づくにつれてクラスター間の重みがゼロに近づく、ほぼスパースなグラフでは、エッジの重みの分布に関する特定の条件下で、サイズが$n$の多重対数であるクラスターを回復できることを示します。数値シミュレーションからの経験的証拠も提供され、クラスター構造の完全な回復へのこれらの理論的相転移を裏付けています。

Iterated Learning in Dynamic Social Networks
動的ソーシャルネットワークにおける反復学習

A classic finding by (Kalish et al., 2007) shows that no language can be learned iteratively by rational agents in a self-sustained manner. In other words, if $A$ teaches a foreign language to $B$, who then teaches what she learned to $C$, and so on, the language will quickly get lost and agents will wind up teaching their own common native language. If so, how can linguistic novelty ever be sustained? We address this apparent paradox by considering the case of iterated learning in a social network: we show that by varying the lengths of the learning sessions over time or by keeping the networks dynamic, it is possible for iterated learning to endure forever with arbitrarily small loss.

(Kalishら, 2007)による古典的な発見は、合理的なエージェントが自立的に反復的に学習する言語はないことを示しています。言い換えれば、$A$が$B$に外国語を教え、次に$が彼女が学んだことを$C$に教える、というように、言語はすぐに失われ、エージェントは自分自身の共通の母国語を教えることになります。もしそうなら、言語の新しさはどうやって持続できるのでしょうか?この明らかなパラドックスに対処するには、ソーシャルネットワークでの反復学習の場合を考えます:学習セッションの長さを時間の経過とともに変化させたり、ネットワークを動的に保つことで、反復学習が任意の小さな損失で永遠に持続する可能性があることを示します。

Pyro: Deep Universal Probabilistic Programming
Pyro:ディープユニバーサル確率プログラミング

Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research. To scale to large data sets and high-dimensional models, Pyro uses stochastic variational inference algorithms and probability distributions built on top of PyTorch, a modern GPU-accelerated deep learning framework. To accommodate complex or model-specific algorithmic behavior, Pyro leverages Poutine, a library of composable building blocks for modifying the behavior of probabilistic programs.

Pyroは、AI研究における高度な確率モデルを開発するためのプラットフォームとして、Python上に構築された確率的プログラミング言語です。大規模なデータセットや高次元モデルにスケーリングするために、Pyroは、最新のGPUで高速化されたディープラーニングフレームワークであるPyTorchの上に構築された確率的変分推論アルゴリズムと確率分布を使用しています。複雑なアルゴリズムの動作やモデル固有の動作に対応するために、Pyroは確率的プログラムの動作を変更するためのコンポーザブルなビルディングブロックのライブラリであるPoutineを活用しています。

Monotone Learning with Rectified Wire Networks
整流ワイヤネットワークによる単調学習

We introduce a new neural network model, together with a tractable and monotone online learning algorithm. Our model describes feed-forward networks for classification, with one output node for each class. The only nonlinear operation is rectification using a ReLU function with a bias. However, there is a rectifier on every edge rather than at the nodes of the network. There are also weights, but these are positive, static, and associated with the nodes. Our rectified wire networks are able to represent arbitrary Boolean functions. Only the bias parameters, on the edges of the network, are learned. Another departure in our approach, from standard neural networks, is that the loss function is replaced by a constraint. This constraint is simply that the value of the output node associated with the correct class should be zero. Our model has the property that the exact norm-minimizing parameter update, required to correctly classify a training item, is the solution to a quadratic program that can be computed with a few passes through the network. We demonstrate a training algorithm using this update, called sequential deactivation (SDA), on MNIST and some synthetic datasets. Upon adopting a natural choice for the nodal weights, SDA has no hyperparameters other than those describing the network structure. Our experiments explore behavior with respect to network size and depth in a family of sparse expander networks.

私たちは、扱いやすく単調なオンライン学習アルゴリズムとともに、新しいニューラルネットワークモデルを紹介します。我々のモデルは、各クラスに1つの出力ノードを持つ分類用のフィードフォワードネットワークを記述します。唯一の非線形操作は、バイアス付きのReLU関数を使用した整流です。ただし、整流器はネットワークのノードではなく、すべてのエッジにあります。重みもありますが、これらは正で静的であり、ノードに関連付けられています。整流ワイヤネットワークは、任意のブール関数を表すことができます。ネットワークのエッジにあるバイアスパラメーターのみが学習されます。我々のアプローチが標準的なニューラルネットワークから逸脱しているもう1つの点は、損失関数が制約に置き換えられていることです。この制約は、正しいクラスに関連付けられた出力ノードの値がゼロである必要があるという単純なものです。我々のモデルには、トレーニング項目を正しく分類するために必要な正確なノルム最小化パラメーター更新が、ネットワークを数回通過して計算できる2次計画の解であるという特性があります。私たちは、この更新を使用したトレーニングアルゴリズム(SDA (順次非アクティブ化)と呼ばれる)をMNISTといくつかの合成データセットで実証しました。ノードの重みに自然な選択を採用すると、SDAにはネットワーク構造を記述するハイパーパラメータ以外のハイパーパラメータはありません。私たちの実験では、スパースエクスパンダーネットワークファミリのネットワークサイズと深さに関する動作を調査します。

TensorLy: Tensor Learning in Python
TensorLy: Python でのテンソル学習

Tensors are higher-order extensions of matrices. While matrix methods form the cornerstone of traditional machine learning and data analysis, tensor methods have been gaining increasing traction. However, software support for tensor operations is not on the same footing. In order to bridge this gap, we have developed TensorLy, a Python library that provides a high-level API for tensor methods and deep tensorized neural networks. TensorLy aims to follow the same standards adopted by the main projects of the Python scientific community, and to seamlessly integrate with them. Its BSD license makes it suitable for both academic and commercial applications. TensorLy’s backend system allows users to perform computations with several libraries such as NumPy or PyTorch to name but a few. They can be scaled on multiple CPU or GPU machines. In addition, using the deep-learning frameworks as backend allows to easily design and train deep tensorized neural networks. TensorLy is available at https://github.com/tensorly/tensorly

テンソルは、行列の高次拡張です。行列法は従来の機械学習とデータ分析の基礎を形成していますが、テンソル法はますます注目を集めています。しかし、テンソル演算のソフトウェアサポートは同じレベルではありません。このギャップを埋めるために、私たちはテンソル法とディープテンソル化ニューラルネットワーク用の高レベルAPIを提供するPythonライブラリであるTensorLyを開発しました。TensorLyは、Python科学コミュニティの主要プロジェクトで採用されているのと同じ標準に従い、それらとシームレスに統合することを目指しています。BSDライセンスにより、学術アプリケーションと商用アプリケーションの両方に適しています。TensorLyのバックエンドシステムを使用すると、NumPyやPyTorchなど、いくつかのライブラリを使用して計算を実行できます。複数のCPUまたはGPUマシンでスケーリングできます。さらに、ディープラーニングフレームワークをバックエンドとして使用することで、ディープテンソル化ニューラルネットワークを簡単に設計およびトレーニングできます。TensorLyはhttps://github.com/tensorly/tensorlyから入手できます。

Group Invariance, Stability to Deformations, and Complexity of Deep Convolutional Representations
群不変性、変形に対する安定性、および深層畳み込み表現の複雑さ

The success of deep convolutional architectures is often attributed in part to their ability to learn multiscale and invariant representations of natural signals. However, a precise study of these properties and how they affect learning guarantees is still missing. In this paper, we consider deep convolutional representations of signals; we study their invariance to translations and to more general groups of transformations, their stability to the action of diffeomorphisms, and their ability to preserve signal information. This analysis is carried by introducing a multilayer kernel based on convolutional kernel networks and by studying the geometry induced by the kernel mapping. We then characterize the corresponding reproducing kernel Hilbert space (RKHS), showing that it contains a large class of convolutional neural networks with homogeneous activation functions. This analysis allows us to separate data representation from learning, and to provide a canonical measure of model complexity, the RKHS norm, which controls both stability and generalization of any learned model. In addition to models in the constructed RKHS, our stability analysis also applies to convolutional networks with generic activations such as rectified linear units, and we discuss its relationship with recent generalization bounds based on spectral norms.

深層畳み込みアーキテクチャの成功は、自然信号のマルチスケールかつ不変な表現を学習する能力に一部起因することが多い。しかし、これらの特性とそれが学習保証にどのように影響するかについての正確な研究はまだ行われていない。この論文では、信号の深層畳み込み表現について検討し、並進およびより一般的な変換群に対する不変性、微分同相写像の作用に対する安定性、および信号情報を保持する能力を研究します。この分析は、畳み込みカーネルネットワークに基づく多層カーネルを導入し、カーネルマッピングによって誘導されるジオメトリを研究することによって実行されます。次に、対応する再生カーネルヒルベルト空間(RKHS)を特徴付け、同次活性化関数を持つ畳み込みニューラルネットワークの大規模なクラスが含まれていることを示す。この分析により、データ表現を学習から分離し、学習したモデルの安定性と一般化の両方を制御するモデルの複雑さの標準的な尺度であるRKHSノルムを提供することができます。構築されたRKHSのモデルに加えて、私たちの安定性分析は、正規化線形ユニットなどの一般的な活性化を持つ畳み込みネットワークにも適用され、スペクトルノルムに基づく最近の一般化境界との関係について説明します。

Joint PLDA for Simultaneous Modeling of Two Factors
2つの因子の同時モデリングのためのジョイントPLDA

Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables joint modeling of two sample-dependent factors: the class of interest and a nuisance condition. The approach does not change the basic form of PLDA but rather modifies the training procedure to consider the dependency across samples of the latent variable that models within-class variability. While the identity of the nuisance condition is needed during training, it is not needed during testing since we propose a scoring procedure that marginalizes over the corresponding latent variable. We show results on a multilingual speaker-verification task, where the language spoken is considered a nuisance condition. The proposed joint PLDA approach leads to significant performance gains in this task for two different data sets, in particular when the training data contains mostly or only monolingual speakers.

確率的線形判別分析(PLDA)は、話者や顔の認識などの生体認証の問題に使用される手法で、2つの潜在変数を使用してサンプルの変動性をモデル化します。1つはサンプルのクラスに依存し、もう1つはサンプル間で独立していると想定され、クラス内変動性をモデル化します。この研究では、サンプルに依存する2つの要因(対象クラスと迷惑条件)の共同モデル化を可能にするPLDAの一般化を提案します。このアプローチでは、PLDAの基本形式は変更されませんが、クラス内変動性をモデル化する潜在変数のサンプル間の依存関係を考慮するようにトレーニング手順が変更されます。迷惑条件のIDはトレーニング中に必要ですが、対応する潜在変数を周辺化するスコアリング手順を提案しているため、テスト中は必要ありません。話されている言語が迷惑条件と見なされる多言語話者検証タスクの結果を示します。提案された共同PLDAアプローチは、特にトレーニングデータに単一言語話者のみまたはほとんどが含まれている場合に、2つの異なるデータセットのこのタスクで大幅なパフォーマンス向上をもたらします。

Determining the Number of Latent Factors in Statistical Multi-Relational Learning
統計的多重関係学習における潜在因子数の決定

Statistical relational learning is primarily concerned with learning and inferring relationships between entities in large-scale knowledge graphs. Nickel et al. (2011) proposed a RESCAL tensor factorization model for statistical relational learning, which achieves better or at least comparable results on common benchmark data sets when compared to other state-of-the-art methods. Given a positive integer $s$, RESCAL computes an $s$-dimensional latent vector for each entity. The latent factors can be further used for solving relational learning tasks, such as collective classification, collective entity resolution and link-based clustering. The focus of this paper is to determine the number of latent factors in the RESCAL model. Due to the structure of the RESCAL model, its log-likelihood function is not concave. As a result, the corresponding maximum likelihood estimators (MLEs) may not be consistent. Nonetheless, we design a specific pseudometric, prove the consistency of the MLEs under this pseudometric and establish its rate of convergence. Based on these results, we propose a general class of information criteria and prove their model selection consistencies when the number of relations is either bounded or diverges at a proper rate of the number of entities. Simulations and real data examples show that our proposed information criteria have good finite sample properties.

統計的関係学習は、主に大規模な知識グラフ内のエンティティ間の関係を学習および推論することに関係しています。Nickelら(2011)は、統計的関係学習のためのRESCALテンソル分解モデルを提案しました。このモデルは、一般的なベンチマークデータセットで他の最先端の方法と比較して優れた、または少なくとも同等の結果を達成します。正の整数$s$が与えられると、RESCALは各エンティティの$s$次元の潜在ベクトルを計算します。潜在因子は、集合分類、集合エンティティ解決、リンクベースのクラスタリングなどの関係学習タスクを解決するためにさらに使用できます。この論文の焦点は、RESCALモデルの潜在因子の数を決定することです。RESCALモデルの構造により、その対数尤度関数は凹ではありません。その結果、対応する最大尤度推定量(MLE)は一貫していない可能性があります。それでも、特定の疑似測定法を設計し、この疑似測定法の下でのMLEの一貫性を証明し、その収束率を確立します。これらの結果に基づいて、情報基準の一般的なクラスを提案し、関係の数が制限されているか、エンティティの数の適切な割合で発散する場合のモデル選択の一貫性を証明します。シミュレーションと実際のデータの例は、提案された情報基準が優れた有限サンプル特性を持っていることを示しています。

Random Feature-based Online Multi-kernel Learning in Environments with Unknown Dynamics
未知のダイナミクスを持つ環境におけるランダム特徴量ベースのオンラインマルチカーネル学習

Kernel-based methods exhibit well-documented performance in various nonlinear learning tasks. Most of them rely on a preselected kernel, whose prudent choice presumes task-specific prior information. Especially when the latter is not available, multi-kernel learning has gained popularity thanks to its flexibility in choosing kernels from a prescribed kernel dictionary. Leveraging the random feature approximation and its recent orthogonality-promoting variant, the present contribution develops a scalable multi-kernel learning scheme (termed Raker) to obtain the sought nonlinear learning function `on the fly,’ first for static environments. To further boost performance in dynamic environments, an adaptive multi-kernel learning scheme (termed AdaRaker) is developed. AdaRaker accounts not only for data-driven learning of kernel combination, but also for the unknown dynamics. Performance is analyzed in terms of both static and dynamic regrets. AdaRaker is uniquely capable of tracking nonlinear learning functions in environments with unknown dynamics, and with with analytic performance guarantees Tests with synthetic and real datasets are carried out to showcase the effectiveness of the novel algorithms.

カーネルベースの方法は、さまざまな非線形学習タスクで十分に実証されたパフォーマンスを発揮します。そのほとんどは、事前に選択されたカーネルに依存しており、その慎重な選択にはタスク固有の事前情報が前提となります。特に後者が利用できない場合は、規定のカーネル辞書からカーネルを選択できる柔軟性により、マルチカーネル学習が普及しています。ランダム特徴近似とその最近の直交性促進バリアントを活用して、本貢献では、まず静的環境に対して、求められている非線形学習関数を「オンザフライ」で取得するためのスケーラブルなマルチカーネル学習スキーム(Rakerと名付けました)を開発します。動的環境でのパフォーマンスをさらに向上させるために、適応型マルチカーネル学習スキーム(AdaRakerと名付けました)が開発されました。AdaRakerは、カーネルの組み合わせのデータ駆動型学習だけでなく、未知のダイナミクスも考慮します。パフォーマンスは、静的および動的リグレットの両方の観点から分析されます。AdaRakerは、未知のダイナミクスを持つ環境で非線形学習関数を追跡する独自の機能を備えており、分析パフォーマンスが保証されています。合成データセットと実際のデータセットを使用したテストを実行して、新しいアルゴリズムの有効性を示します。

Spectrum Estimation from a Few Entries
いくつかのエントリからのスペクトル推定

Singular values of a data in a matrix form provide insights on the structure of the data, the effective dimensionality, and the choice of hyper-parameters on higher-level data analysis tools. However, in many practical applications such as collaborative filtering and network analysis, we only get a partial observation. Under such scenarios, we consider the fundamental problem of recovering spectral properties of the underlying matrix from a sampling of its entries. In this paper, we address the problem of directly recovering the spectrum, which is the set of singular values, and also in sample-efficient approaches for recovering a spectral sum function, which is an aggregate sum of a fixed function applied to each of the singular values. Our approach is to first estimate the Schatten $k$-norms of a matrix for a small set of values of $k$, and then apply Chebyshev approximation when estimating a spectral sum function or apply moment matching in Wasserstein distance when estimating the singular values directly. The main technical challenge is in accurately estimating the Schatten norms from a sampling of a matrix. We introduce a novel unbiased estimator based on counting small structures called network motifs in a graph and provide guarantees that match its empirical performance. Our theoretical analysis shows that Schatten norms can be recovered accurately from strictly smaller number of samples compared to what is needed to recover the underlying low-rank matrix. Numerical experiments suggest that we significantly improve upon a competing approach of using matrix completion methods, below the matrix completion threshold, above which matrix completion algorithms recover the underlying low-rank matrix exactly.

行列形式のデータの特異値は、データの構造、有効次元数、および高レベルデータ分析ツールでのハイパーパラメータの選択に関する洞察を提供します。ただし、協調フィルタリングやネットワーク分析などの多くの実用的なアプリケーションでは、部分的な観察しか得られません。このようなシナリオでは、基礎となる行列のエントリのサンプリングからそのスペクトル特性を回復するという基本的な問題を検討します。この論文では、特異値の集合であるスペクトルを直接回復する問題と、各特異値に適用された固定関数の総和であるスペクトル和関数を回復するためのサンプル効率の高いアプローチについて取り上げます。私たちのアプローチは、まず行列のシャッテン$k$ノルムを小さなkの値の集合に対して推定し、次にスペクトル和関数を推定するときにチェビシェフ近似を適用するか、または特異値を直接推定するときにワッサーシュタイン距離でモーメントマッチングを適用することです。主な技術的課題は、行列のサンプリングからシャッテンノルムを正確に推定することです。グラフ内のネットワークモチーフと呼ばれる小さな構造をカウントすることに基づく新しい不偏推定量を導入し、その経験的パフォーマンスに匹敵する保証を提供します。理論分析により、シャッテンノルムは、基になる低ランク行列を復元するために必要な数よりも厳密に少ない数のサンプルから正確に復元できることが示されています。数値実験により、行列補完しきい値を下回る行列補完法を使用する競合アプローチを大幅に改善できることが示唆されています。このしきい値を上回ると、行列補完アルゴリズムによって基になる低ランク行列が正確に復元されます。

Accelerated Alternating Projections for Robust Principal Component Analysis
ロバストな主成分分析のための加速交互投影

We study robust PCA for the fully observed setting, which is about separating a low rank matrix $\BL$ and a sparse matrix $\BS$ from their sum $\BD=\BL+\BS$. In this paper, a new algorithm, dubbed accelerated alternating projections, is introduced for robust PCA which significantly improves the computational efficiency of the existing alternating projections proposed in (Netrapalli et al., 2014) when updating the low rank factor. The acceleration is achieved by first projecting a matrix onto some low dimensional subspace before obtaining a new estimate of the low rank matrix via truncated SVD. Exact recovery guarantee has been established which shows linear convergence of the proposed algorithm. Empirical performance evaluations establish the advantage of our algorithm over other state-of-the-art algorithms for robust PCA.

私たちは、低ランク行列$BL$とスパース行列$BS$をそれらの合計$BD=BL+BS$から分離する、完全に観測された設定のロバストなPCAを研究します。この論文では、ロバストPCAのために、加速交互投影と呼ばれる新しいアルゴリズムが導入され、低ランク係数を更新するときに(Netrapalliら, 2014)で提案された既存の交互投影の計算効率が大幅に向上します。この加速は、まず行列を低次元部分空間に射影してから、切り捨てられたSVDを介して低ランク行列の新しい推定値を取得することによって達成されます。提案されたアルゴリズムの線形収束を示す正確な回復保証が確立されています。経験的な性能評価により、ロバストなPCAのための他の最先端のアルゴリズムに対する当社のアルゴリズムの利点が確立されます。

spark-crowd: A Spark Package for Learning from Crowdsourced Big Data
spark-crowd: クラウドソーシングされたビッグデータから学ぶための Spark パッケージ

As the data sets increase in size, the process of manually labeling data becomes unfeasible by small groups of experts. Thus, it is common to rely on crowdsourcing platforms which provide inexpensive, but noisy, labels. Although implementations of algorithms to tackle this problem exist, none of them focus on scalability, limiting the area of application to relatively small data sets. In this paper, we present spark-crowd, an Apache Spark package for learning from crowdsourced data with scalability in mind.

データセットのサイズが大きくなると、データに手動でラベルを付けるプロセスは、専門家の小さなグループでは実行不可能になります。したがって、安価でありながら騒々しいラベルを提供するクラウドソーシングプラットフォームに頼るのが一般的です。この問題に対処するためのアルゴリズムの実装は存在しますが、スケーラビリティに重点を置いているものはなく、適用領域は比較的小さなデータセットに限定されています。この論文では、スケーラビリティを念頭に置いてクラウドソーシングデータから学習するためのApache Sparkパッケージであるspark-crowdを紹介します。

Multiplicative local linear hazard estimation and best one-sided cross-validation
乗法的局所線形ハザード推定と最良の片側交差検証

This paper develops detailed mathematical statistical theory of a new class of cross-validation techniques of local linear kernel hazards and their multiplicative bias corrections. The new class of cross-validation combines principles of local information and recent advances in indirect cross-validation. A few applications of cross-validating multiplicative kernel hazard estimation do exist in the literature. However, detailed mathematical statistical theory and small sample performance are introduced via this paper and further upgraded to our new class of best one-sided cross-validation. Best one-sided cross-validation turns out to have excellent performance in its practical illustrations, in its small sample performance and in its mathematical statistical theoretical performance.

この論文では、局所線形カーネルハザードの新しいクラスの交差検証手法とその乗法バイアス補正の詳細な数学的統計理論を開発します。新しいクラスのクロス検証は、ローカル情報の原則と、間接的なクロス検証の最近の進歩を組み合わせたものです。乗法カーネルハザード推定の交差検証のいくつかのアプリケーションは、文献に存在します。しかし、この論文では、詳細な数学的統計理論と小さなサンプル性能が紹介され、さらに新しいクラスの最高の片側交差検証にアップグレードされます。最良の片側交差検証は、その実用的な図解、小さなサンプル性能、および数学的統計理論性能において優れた性能を発揮することが証明されています。

Delay and Cooperation in Nonstochastic Bandits
非確率的バンディットにおける遅延と協力

We study networks of communicating learning agents that cooperate to solve a common nonstochastic bandit problem. Agents use an underlying communication network to get messages about actions selected by other agents, and drop messages that took more than $d$ hops to arrive, where $d$ is a delay parameter. We introduce Exp3-Coop, a cooperative version of the Exp3 algorithm and prove that with $K$ actions and $N$ agents the average per-agent regret after $T$ rounds is at most of order $\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$, where $\alpha_{\le d}$ is the independence number of the $d$-th power of the communication graph $G$. We then show that for any connected graph, for $d=\sqrt{K}$ the regret bound is $K^{1/4}\sqrt{T}$, strictly better than the minimax regret $\sqrt{KT}$ for noncooperating agents. More informed choices of $d$ lead to bounds which are arbitrarily close to the full information minimax regret $\sqrt{T\ln K}$ when $G$ is dense. When $G$ has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality in $G$, strictly improves the regret. Finally, as a by-product of our analysis, we provide the first characterization of the minimax regret for bandit learning with delay.

私たちは、一般的な非確率的バンディット問題を解決するために協力する通信学習エージェントのネットワークを研究します。エージェントは、基礎となる通信ネットワークを使用して、他のエージェントによって選択されたアクションに関するメッセージを取得し、到着までに$d$ホップ以上かかったメッセージをドロップします。ここで、$d$は遅延パラメータです。私たちは、Exp3アルゴリズムの協力バージョンであるExp3-Coopを導入し、$K$アクションと$N$エージェントで、$T$ラウンド後のエージェントあたりの平均後悔が最大で$\sqrt{\bigl(d+1 + \tfrac{K}{N}\alpha_{\le d}\bigr)(T\ln K)}$のオーダーであることを証明します。ここで、$\alpha_{\le d}$は、通信グラフ$G$の$d$乗の独立数です。次に、任意の連結グラフについて、$d=\sqrt{K}$に対して、後悔の境界は$K^{1/4}\sqrt{T}$であり、非協力エージェントのミニマックス後悔$\sqrt{KT}$よりも確実に優れていることを示します。$d$をより情報に基づいて選択すると、$G$が密な場合、完全情報ミニマックス後悔$\sqrt{T\ln K}$に任意に近い境界が得られます。$G$にスパースなコンポーネントがある場合、エージェントが$G$での中心性に応じてパラメータを選択できるExp3-Coopの変形により、後悔が確実に改善されることを示します。最後に、分析の副産物として、遅延のあるバンディット学習のミニマックス後悔の最初の特性評価を提供します。

Smooth neighborhood recommender systems
スムーズなネイバーフッドレコメンダーシステム

Recommender systems predict users’ preferences over a large number of items by pooling similar information from other users and/or items in the presence of sparse observations. One major challenge is how to utilize user-item specific covariates and networks describing user-item interactions in a high-dimensional situation, for accurate personalized prediction. In this article, we propose a smooth neighborhood recommender in the framework of the latent factor models. A similarity kernel is utilized to borrow neighborhood information from continuous covariates over a user-item specific network, such as a user’s social network, where the grouping information defined by discrete covariates is also integrated through the network. Consequently, user-item specific information is built into the recommender to battle the `cold-start” issue in the absence of observations in collaborative and content-based filtering. Moreover, we utilize a “divide-and-conquer” version of the alternating least squares algorithm to achieve scalable computation, and establish asymptotic results for the proposed method, demonstrating that it achieves superior prediction accuracy. Finally, we illustrate that the proposed method improves substantially over its competitors in simulated examples and real benchmark data–Last.fm music data.

レコメンデーションシステムは、スパースな観測値が存在する場合に、他のユーザーやアイテムからの類似情報をプールすることで、多数のアイテムに対するユーザーの好みを予測します。1つの大きな課題は、高次元の状況でユーザーとアイテムの相互作用を記述するユーザーとアイテム固有の共変量とネットワークをどのように利用して、正確なパーソナライズされた予測を行うかということです。この記事では、潜在因子モデルのフレームワークでスムーズな近隣レコメンデーションを提案します。類似性カーネルは、ユーザーのソーシャルネットワークなどのユーザーとアイテム固有のネットワーク上の連続共変量から近隣情報を借用するために使用され、離散共変量によって定義されたグループ化情報もネットワークを通じて統合されます。その結果、ユーザーとアイテム固有の情報がレコメンデーションに組み込まれ、協調フィルタリングとコンテンツベースのフィルタリングで観測値がない場合の「コールドスタート」問題に対処します。さらに、スケーラブルな計算を実現するために、交互最小二乗アルゴリズムの「分割統治」バージョンを使用し、提案された方法の漸近結果を確立して、優れた予測精度を実現することを実証します。最後に、提案された方法が、シミュレートされた例と実際のベンチマークデータ(Last.fmの音楽データ)において競合方法よりも大幅に改善されることを示します。

Automated Scalable Bayesian Inference via Hilbert Coresets
ヒルベルトコアセットによる自動化されたスケーラブルなベイズ推論

The automation of posterior inference in Bayesian data analysis has enabled experts and nonexperts alike to use more sophisticated models, engage in faster exploratory modeling and analysis, and ensure experimental reproducibility. However, standard automated posterior inference algorithms are not tractable at the scale of massive modern data sets, and modifications to make them so are typically model-specific, require expert tuning, and can break theoretical guarantees on inferential quality. Building on the Bayesian coresets framework, this work instead takes advantage of data redundancy to shrink the data set itself as a preprocessing step, providing fully-automated, scalable Bayesian inference with theoretical guarantees. We begin with an intuitive reformulation of Bayesian coreset construction as sparse vector sum approximation, and demonstrate that its automation and performance-based shortcomings arise from the use of the supremum norm. To address these shortcomings we develop Hilbert coresets, i.e., Bayesian coresets constructed under a norm induced by an inner-product on the log-likelihood function space. We propose two Hilbert coreset construction algorithms—one based on importance sampling, and one based on the Frank-Wolfe algorithm—along with theoretical guarantees on approximation quality as a function of coreset size. Since the exact computation of the proposed inner-products is model-specific, we automate the construction with a random finite-dimensional projection of the log-likelihood functions. The resulting automated coreset construction algorithm is simple to implement, and experiments on a variety of models with real and synthetic data sets show that it provides high-quality posterior approximations and a significant reduction in the computational cost of inference.

ベイズデータ分析における事後推論の自動化により、専門家も非専門家も同様に、より洗練されたモデルを使用し、より迅速な探索的モデリングと分析に従事し、実験の再現性を確保できるようになりました。ただし、標準的な自動事後推論アルゴリズムは、大規模な最新のデータセットの規模では扱いにくく、そのための修正は通常モデル固有であり、専門家の調整が必要で、推論品質の理論的な保証に反する可能性があります。ベイズコアセットフレームワークを基にしたこの研究では、代わりにデータの冗長性を利用して、前処理ステップとしてデータセット自体を縮小し、理論的な保証を備えた完全に自動化されたスケーラブルなベイズ推論を提供します。ベイズコアセット構築をスパースベクトル和近似として直感的に再定式化することから始め、その自動化とパフォーマンスベースの欠点が上限ノルムの使用から生じることを示します。これらの欠点に対処するために、我々はヒルベルトコアセット、すなわち対数尤度関数空間上の内積によって誘導されるノルムの下で構築されるベイジアンコアセットを開発します。私たちは、コアセットサイズの関数としての近似品質の理論的保証とともに、2つのヒルベルトコアセット構築アルゴリズム(1つは重要度サンプリングに基づくもの、もう1つはFrank-Wolfeアルゴリズムに基づくもの)を提案します。提案された内積の正確な計算はモデル固有であるため、我々は対数尤度関数のランダムな有限次元投影を使用して構築を自動化します。結果として得られる自動化されたコアセット構築アルゴリズムは実装が簡単で、実際のデータセットと合成データセットを使用したさまざまなモデルの実験により、高品質の事後近似と推論の計算コストの大幅な削減が得られることがわかった。

Approximations of the Restless Bandit Problem
レストレス・バンディット問題の近似

The multi-armed restless bandit problem is studied in the case where the pay-off distributions are stationary $\varphi$-mixing. This version of the problem provides a more realistic model for most real-world applications, but cannot be optimally solved in practice, since it is known to be PSPACE-hard. The objective of this paper is to characterize a sub-class of the problem where good approximate solutions can be found using tractable approaches. Specifically, it is shown that under some conditions on the $\varphi$-mixing coefficients, a modified version of UCB can prove effective. The main challenge is that, unlike in the i.i.d. setting, the distributions of the sampled pay-offs may not have the same characteristics as those of the original bandit arms. In particular, the $\varphi$-mixing property does not necessarily carry over. This is overcome by carefully controlling the effect of a sampling policy on the pay-off distributions. Some of the proof techniques developed in this paper can be more generally used in the context of online sampling under dependence. Proposed algorithms are accompanied with corresponding regret analysis.

マルチアームの落ち着きのないバンディット問題は、ペイオフ分布が定常$\varphi$混合である場合に研究されます。この問題のバージョンは、ほとんどの実際のアプリケーションに対してより現実的なモデルを提供しますが、PSPACE困難であることが知られているため、実際には最適に解決することはできません。この論文の目的は、扱いやすいアプローチを使用して適切な近似解を見つけることができる問題のサブクラスを特徴付けることです。具体的には、$\varphi$混合係数に関するいくつかの条件下では、UCBの修正バージョンが有効であることが証明されます。主な課題は、i.i.d.設定とは異なり、サンプリングされたペイオフの分布が元のバンディットアームの分布と同じ特性を持たない可能性があることです。特に、$\varphi$混合特性は必ずしも引き継がれません。これは、サンプリングポリシーがペイオフ分布に与える影響を慎重に制御することで克服されます。この論文で開発された証明手法のいくつかは、依存関係のあるオンラインサンプリングのコンテキストでより一般的に使用できます。提案されたアルゴリズムには、対応する後悔分析が伴います。

Train and Test Tightness of LP Relaxations in Structured Prediction
構造化予測におけるLP緩和の学習とテストの厳密性

Structured prediction is used in areas including computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation for the striking observation that approximations based on linear programming (LP) relaxations are often tight (exact) on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that this training tightness generalizes to test data.

構造化予測は、コンピュータービジョンや自然言語処理などの分野で使用され、セグメンテーションや解析ツリーなどの構造化された出力を予測します。これらの設定では、予測はMAP推論によって、または同等の方法で整数線形プログラムを解くことによって実行されます。正確な予測を得るためには複雑なスコアリング関数が必要なため、学習と推論の両方で通常、近似ソルバーを使用する必要があります。線形計画法(LP)緩和に基づく近似は、実世界のインスタンスではしばしば厳密(正確)であるという驚くべき観察の理論的説明を提案します。特に、LP緩和推論による学習が訓練インスタンスの積分性を促進し、この訓練の厳密性がテストデータに一般化することを示します。

Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds
ニストローム近似によるスケーラブルカーネルK平均クラスタリング：相対誤差境界

Kernel $k$-means clustering can correctly identify and extract a far more varied collection of cluster structures than the linear $k$-means clustering algorithm. However, kernel $k$-means clustering is computationally expensive when the non-linear feature map is high-dimensional and there are many input points. Kernel approximation, e.g., the Nystrom method, has been applied in previous works to approximately solve kernel learning problems when both of the above conditions are present. This work analyzes the application of this paradigm to kernel $k$-means clustering, and shows that applying the linear $k$-means clustering algorithm to $\frac{k}{\epsilon} (1 + o(1))$ features constructed using a so-called rank-restricted Nystrom approximation results in cluster assignments that satisfy a $1 + \epsilon$ approximation ratio in terms of the kernel $k$-means cost function, relative to the guarantee provided by the same algorithm without the use of the Nystrom method. As part of the analysis, this work establishes a novel $1 + \epsilon$ relative-error trace norm guarantee for low-rank approximation using the rank-restricted Nystrom approximation. Empirical evaluations on the $8.1$ million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel $k$-means clustering with Nystrom approximation. This work argues that spectral clustering using Nystrom approximation—a popular and computationally efficient, but theoretically unsound approach to non-linear clustering—should be replaced with the efficient and theoretically sound combination of kernel $k$-means clustering with Nystrom approximation. The superior performance of the latter approach is empirically verified.

カーネル$k$-meansクラスタリングは、線形$k$-meansクラスタリングアルゴリズムよりもはるかに多様なクラスター構造のコレクションを正しく識別して抽出できます。ただし、非線形特徴マップが高次元で入力ポイントが多い場合、カーネル$k$-meansクラスタリングは計算コストが高くなります。上記の両方の条件が存在する場合、カーネル学習問題を近似的に解決するために、以前の研究でカーネル近似(たとえばNystrom法)が適用されてきました。この研究では、このパラダイムのカーネル$k$-meansクラスタリングへの適用を分析し、いわゆるランク制限Nystrom近似を使用して構築された$\frac{k}{\epsilon} (1 + o(1))$特徴に線形$k$-meansクラスタリングアルゴリズムを適用すると、カーネル$k$-meansコスト関数に関して$1 + \epsilon$近似比を満たすクラスター割り当てが得られることを示しています。これは、Nystrom法を使用しない同じアルゴリズムによって提供される保証と比較して高いものです。分析の一環として、この研究では、ランク制限Nystrom近似を使用した低ランク近似に対する新しい$1 + \epsilon$相対誤差トレースノルム保証を確立します。810万インスタンスのMNIST8Mデータセットでの実証的評価により、Nystrom近似を使用したカーネル$k$-meansクラスタリングのスケーラビリティと有用性が実証されています。この研究では、ニストロム近似を使用したスペクトルクラスタリング(非線形クラスタリングに対する一般的で計算効率は高いが、理論的には不健全なアプローチ)を、カーネル$k$平均クラスタリングとニストロム近似の効率的で理論的に健全な組み合わせに置き換える必要があると主張しています。後者のアプローチの優れたパフォーマンスは経験的に検証されています。

An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
おそらく近似的学習理論に基づく1ビット圧縮センシングへのアプローチ

In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\R^n$ generated by $k$-sparse vectors is bounded below by $k ( \lfloor\lg (n/k) \rfloor +1 )$ and above by $\lfloor 2k \lg (en) \rfloor $. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg n)$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\R^n$. The theory is also applicable to the case of noisy labels, where the signs of the measurements are flipped with some unknown probability.

この論文では、1ビット圧縮センシング(OBCS)の問題を、おそらくほぼ正しい(PAC)学習の問題として定式化します。$k$-スパースベクトルによって生成された$R^n$の半空間集合のVapnik-Chervonenkis (VC-)次元は、以下が$k( lfloorlg (n/k) rfloor +1 )$で、より上が$lfloor 2k lg (en) rfloor $で境界が設定されていることが示されています。この推定値をPAC学習理論の確立された結果と組み合わせることで、一貫したアルゴリズムが、測定ベクトルの符号のみを指定して、$O(k lg n)$の測定値で$k$-スパースベクトルを回復できることを示します。この結果は、$R^n$のtextit{all}確率測定に当てはまります。この理論は、ノイズの多いラベルの場合にも適用でき、測定値の符号が未知の確率で反転します。

Graphical Lasso and Thresholding: Equivalence and Closed-form Solutions
グラフィカルラッソとしきい値処理: 等価性と閉形式解

Graphical Lasso (GL) is a popular method for learning the structure of an undirected graphical model, which is based on an $l_1$ regularization technique. The objective of this paper is to compare the computationally-heavy GL technique with a numerically-cheap heuristic method that is based on simply thresholding the sample covariance matrix. To this end, two notions of sign-consistent and inverse-consistent matrices are developed, and then it is shown that the thresholding and GL methods are equivalent if: (i) the thresholded sample covariance matrix is both sign-consistent and inverse-consistent, and (ii) the gap between the largest thresholded and the smallest un-thresholded entries of the sample covariance matrix is not too small. By building upon this result, it is proved that the GL method—as a conic optimization problem—has an explicit closed-form solution if the thresholded sample covariance matrix has an acyclic structure. This result is then generalized to arbitrary sparse support graphs, where a formula is found to obtain an approximate solution of GL. Furthermore, it is shown that the approximation error of the derived explicit formula decreases exponentially fast with respect to the length of the minimum-length cycle of the sparsity graph. The developed results are demonstrated on synthetic data, functional MRI data, traffic flows for transportation networks, and massive randomly generated data sets. We show that the proposed method can obtain an accurate approximation of the GL for instances with the sizes as large as $80,000\times 80,000$ (more than 3.2 billion variables) in less than 30 minutes on a standard laptop computer running MATLAB, while other state-of-the-art methods do not converge within 4 hours

グラフィカルラッソ(GL)は、$l_1$正則化手法に基づく無向グラフィカルモデルの構造を学習するための一般的な方法です。この論文の目的は、計算量の多いGL手法と、単純にサンプル共分散行列をしきい値化するだけの数値的に安価なヒューリスティック手法を比較することです。このために、符号整合行列と逆整合行列という2つの概念が開発され、次に、(i)しきい値化されたサンプル共分散行列が符号整合かつ逆整合であり、(ii)サンプル共分散行列の最大しきい値化エントリと最小しきい値化されていないエントリ間のギャップが小さすぎない場合に、しきい値化手法とGL手法が同等であることが示されます。この結果に基づいて、しきい値化されたサンプル共分散行列が非巡回構造である場合、円錐最適化問題としてのGL手法には明示的な閉形式の解があることが証明されます。この結果は任意のスパースサポートグラフに一般化され、GLの近似解を得るための公式が見つかります。さらに、導出された明示的な公式の近似誤差は、スパースグラフの最小長サイクルの長さに対して指数関数的に急速に減少することが示されています。開発された結果は、合成データ、機能的MRIデータ、輸送ネットワークの交通流、およびランダムに生成された大規模なデータセットで実証されています。提案された方法は、MATLABを実行する標準的なラップトップコンピューターで30分以内に、最大$80,000\times 80,000$ (32億を超える変数)のサイズのインスタンスのGLの正確な近似を取得できることを示しています。一方、他の最先端の方法は4時間以内に収束しません。

Dynamic Pricing in High-dimensions
高次元でのダイナミックプライシング

We study the pricing problem faced by a firm that sells a large number of products, described via a wide range of features, to customers that arrive over time. Customers independently make purchasing decisions according to a general choice model that includes products features and customers’ characteristics, encoded as $d$-dimensional numerical vectors, as well as the price offered. The parameters of the choice model are a priori unknown to the firm, but can be learned as the (binary-valued) sales data accrues over time. The firm’s objective is to maximize its revenue. We benchmark the performance using the classic regret minimization framework where the regret is defined as the expected revenue loss against a clairvoyant policy that knows the parameters of the choice model in advance, and always offers the revenue-maximizing price. This setting is motivated in part by the prevalence of online marketplaces that allow for real-time pricing. We assume a structured choice model, parameters of which depend on $s_0$ out of the $d$ product features. Assuming that the market noise distribution is known, we propose a dynamic policy, called Regularized Maximum Likelihood Pricing (RMLP) that leverages the (sparsity) structure of the high-dimensional model and obtains a logarithmic regret in $T$. More specifically, the regret of our algorithm is of $O(s_0 \log d \cdot \log T)$. Furthermore, we show that no policy can obtain regret better than $O(s_0 (\log d + \log T))$. {In addition, we propose a generalization of our policy to a setting that the market noise distribution is unknown but belongs to a parametrized family of distributions. This policy obtains regret of $O(\sqrt{(\log d)T})$. We further show that no policy can obtain regret better than $\Omega(\sqrt{T})$ in such environments.}

私たちは、さまざまな機能で記述される多数の製品を、時間の経過とともに訪れる顧客に販売する企業が直面する価格設定の問題を研究します。顧客は、製品の機能と顧客の特性($d$次元の数値ベクトルとしてエンコード)、および提示価格を含む一般的な選択モデルに従って、個別に購入を決定します。選択モデルのパラメータは企業には事前には不明ですが、時間の経過とともに(バイナリ値の)販売データが蓄積されるにつれて学習できます。企業の目的は、収益を最大化することです。私たちは、後悔を、選択モデルのパラメータを事前に知っていて、常に収益を最大化する価格を提示する千里眼のポリシーに対する予想される収益損失として定義する、古典的な後悔最小化フレームワークを使用してパフォーマンスをベンチマークします。この設定は、リアルタイムの価格設定を可能にするオンラインマーケットプレイスの普及に部分的に動機付けられています。私たちは、パラメータが$d$個の製品機能のうち$s_0$に依存する構造化選択モデルを想定します。市場ノイズ分布が既知であると仮定して、高次元モデルの（スパース）構造を活用し、$T$で対数的な後悔を得るRegularized Maximum Likelihood Pricing（RMLP）と呼ばれる動的ポリシーを提案します。より具体的には、私たちのアルゴリズムの後悔は$O(s_0 \log d \cdot \log T)$です。さらに、$O(s_0 (\log d + \log T))$よりも良い後悔を得るポリシーはないことを示しています。{さらに、市場ノイズ分布は未知だが、パラメーター化された分布族に属する設定へのポリシーの一般化を提案します。このポリシーは、$O(\sqrt{(\log d)T})$の後悔を得ます。さらに、そのような環境では、$\Omega(\sqrt{T})$よりも良い後悔を得るポリシーはないことを示しています。}

Forward-Backward Selection with Early Dropping
アーリードロップによるフォワード/バックワード選択

Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists.

前方後方選択は、利用可能な最も基本的で一般的に使用されている特徴選択アルゴリズムの1つです。また、これは汎用的で、概念的には多くの異なるタイプのデータに適用できます。この論文では、予測性能を維持しながら実行時間を大幅に改善するヒューリスティックを提案します。アイデアは、選択された変数セットが与えられた結果と条件付きで独立している変数を一時的に破棄することです。これらの変数がどのように再検討され、再導入されるかに応じて、このヒューリスティックは、ますます強力な理論的保証を持つアルゴリズムのファミリーを生み出します。ベイジアンネットワークまたは最大祖先グラフによって忠実に表現できる分布では、このアルゴリズムファミリーのメンバーは、サンプル制限内のマルコフブランケットを正しく識別できます。実験では、提案されたヒューリスティックにより、より少ないまたは同じ数の変数を選択し、予測性能を維持しながら、計算効率が約1～2桁向上することを示しています。さらに、提案されたアルゴリズムとLASSOによる特徴選択は、同じ数の変数を選択するように制限されている場合に同様に機能することを示します。これにより、提案されたアルゴリズムは、LASSOの(効率的な)アルゴリズムが存在しない問題に対する魅力的な代替手段になります。

Scalable Approximations for Generalized Linear Problems
一般化線形問題に対するスケーラブルな近似

In stochastic optimization, the population risk is generally approximated by the empirical risk which is in turn minimized by an iterative algorithm. However, in the large-scale setting, empirical risk minimization may be computationally restrictive. In this paper, we design an efficient algorithm to approximate the population risk minimizer in generalized linear problems such as binary classification with surrogate losses and generalized linear regression models. We focus on large-scale problems where the iterative minimization of the empirical risk is computationally intractable, i.e., the number of observations $n$ is much larger than the dimension of the parameter $p$ ($n \gg p \gg 1$). We show that under random sub-Gaussian design, the true minimizer of the population risk is approximately proportional to the corresponding ordinary least squares (OLS) estimator. Using this relation, we design an algorithm that achieves the same accuracy as the empirical risk minimizer through iterations that attain up to a quadratic convergence rate, and that are computationally cheaper than any batch optimization algorithm by at least a factor of $\mathcal{O}(p)$. We provide theoretical guarantees for our algorithm, and analyze the convergence behavior in terms of data dimensions. Finally, we demonstrate the performance of our algorithm on well-known classification and regression problems, through extensive numerical studies on large-scale datasets, and show that it achieves the highest performance compared to several other widely used optimization algorithms.

確率的最適化では、一般に集団リスクは経験的リスクによって近似され、経験的リスクは反復アルゴリズムによって最小化されます。しかし、大規模な設定では、経験的リスク最小化は計算上の制約がある場合があります。この論文では、代理損失によるバイナリ分類や一般化線形回帰モデルなどの一般化線形問題における集団リスク最小化を近似する効率的なアルゴリズムを設計します。経験的リスクの反復最小化が計算上手に負えない、つまり観測数$n$がパラメーター$p$の次元($n \gg p \gg 1$)よりもはるかに大きい大規模問題に焦点を当てます。ランダムサブガウス設計では、集団リスクの真の最小化は、対応する通常最小二乗(OLS)推定量にほぼ比例することを示します。この関係を使用して、2次収束率に達する反復を通じて経験的リスク最小化器と同じ精度を達成し、バッチ最適化アルゴリズムよりも少なくとも$\mathcal{O}(p)$倍計算コストが低いアルゴリズムを設計します。アルゴリズムの理論的な保証を提供し、データ次元の観点から収束動作を分析します。最後に、大規模なデータセットでの広範な数値研究を通じて、よく知られている分類および回帰問題に対するアルゴリズムのパフォーマンスを実証し、広く使用されている他のいくつかの最適化アルゴリズムと比較して最高のパフォーマンスを達成することを示します。

scikit-multilearn: A Python library for Multi-Label Classification
scikit-multilearn: マルチラベル分類のためのPythonライブラリ

The scikit-multilearn is a Python library for performing multi-label classification. It is compatible with the scikit-learn and scipy ecosystems and uses sparse matrices for all internal operations; provides native Python implementations of popular multi-label classification methods alongside a novel framework for label space partitioning and division and includes modern algorithm adaptation methods, network-based label space division approaches, which extracts label dependency information and multi-label embedding classifiers. The library provides Python wrapped access to the extensive multi-label method stack from Java libraries and makes it possible to extend deep learning single-label methods for multi-label tasks. The library allows multi-label stratification and data set management. The implementation is more efficient in problem transformation than other established libraries, has good test coverage and follows PEP8. Source code and documentation can be downloaded from http://scikit.ml and also via pip The project is BSD-licensed.

scikit-multilearnは、マルチラベル分類を実行するためのPythonライブラリです。scikit-learnおよびscipyエコシステムと互換性があり、すべての内部操作にスパースマトリックスを使用します。一般的なマルチラベル分類方法のネイティブPython実装と、ラベル空間の分割および分割のための新しいフレームワークを提供し、最新のアルゴリズム適応方法、ラベル依存情報を抽出するネットワークベースのラベル空間分割アプローチ、およびマルチラベル埋め込み分類子が含まれています。ライブラリは、Javaライブラリからの広範なマルチラベルメソッドスタックへのPythonラップアクセスを提供し、ディープラーニングのシングルラベルメソッドをマルチラベルタスクに拡張できるようにします。ライブラリを使用すると、マルチラベルの階層化とデータセット管理が可能になります。実装は、他の確立されたライブラリよりも問題の変換が効率的で、テストカバレッジが良好で、PEP8に準拠しています。ソースコードとドキュメントは、http://scikit.mlから、またはpip経由でダウンロードできます。プロジェクトはBSDライセンスです。

Non-Convex Projected Gradient Descent for Generalized Low-Rank Tensor Regression
一般化低ランクテンソル回帰のための非凸投影勾配降下法

In this paper, we consider the problem of learning high-dimensional tensor regression problems with low-rank structure. One of the core challenges associated with learning high-dimensional models is computation since the underlying optimization problems are often non-convex. While convex relaxations could lead to polynomial-time algorithms they are often slow in practice. On the other hand, limited theoretical guarantees exist for non-convex methods. In this paper we provide a general framework that provides theoretical guarantees for learning high-dimensional tensor regression models under different low-rank structural assumptions using the projected gradient descent algorithm applied to a potentially non-convex constraint set $\Theta$ in terms of its localized Gaussian width (due to Gaussian design). We juxtapose our theoretical results for non-convex projected gradient descent algorithms with previous results on regularized convex approaches. The two main differences between the convex and non-convex approach are: (i) from a computational perspective whether the non-convex projection operator is computable and whether the projection has desirable contraction properties and (ii) from a statistical error bound perspective, the non-convex approach has a superior rate for a number of examples. We provide three concrete examples of low-dimensional structure which address these issues and explain the pros and cons for the non-convex and convex approaches. We supplement our theoretical results with simulations which show that, under several common settings of generalized low rank tensor regression, the projected gradient descent approach is superior both in terms of statistical error and run-time provided the step-sizes of the projected descent algorithm are suitably chosen.

この論文では、低ランク構造を持つ高次元テンソル回帰問題の学習問題について検討します。高次元モデルの学習に関連する主要な課題の1つは計算です。これは、基礎となる最適化問題が非凸であることが多いためです。凸緩和は多項式時間アルゴリズムにつながる可能性がありますが、実際には遅くなることがよくあります。一方、非凸手法には限られた理論的保証があります。この論文では、潜在的に非凸の制約セット$\Theta$に適用される投影勾配降下アルゴリズムを使用して、さまざまな低ランク構造仮定の下で高次元テンソル回帰モデルを学習するための理論的保証を提供する一般的なフレームワークを提供します。これは、ガウス設計による局所的なガウス幅に関してです。非凸投影勾配降下アルゴリズムの理論的結果を、正規化された凸アプローチに関する以前の結果と並置します。凸アプローチと非凸アプローチの主な違いは2つあります。(i)計算の観点から、非凸射影演算子が計算可能かどうか、および射影が望ましい収縮特性を持つかどうか、(ii)統計的誤差境界の観点から、非凸アプローチは多くの例で優れたレートを持ちます。これらの問題に対処する低次元構造の具体的な例を3つ示し、非凸アプローチと凸アプローチの長所と短所を説明します。理論結果をシミュレーションで補足すると、一般化低ランクテンソル回帰のいくつかの一般的な設定では、射影勾配降下法のアルゴリズムのステップサイズが適切に選択されていれば、統計的誤差と実行時間の両方の点で射影勾配降下法が優れていることがわかります。

Convergence Rate of a Simulated Annealing Algorithm with Noisy Observations
ノイズの多い観測値を持つシミュレーテッドアニーリングアルゴリズムの収束率

In this paper we propose a modified version of the simulated annealing algorithm for solving a stochastic global optimization problem. More precisely, we address the problem of finding a global minimizer of a function with noisy evaluations. We provide a rate of convergence and its optimized parametrization to ensure a minimal number of evaluations for a given accuracy and a confidence level close to 1. This work is completed with a set of numerical experimentations and assesses the practical performance both on benchmark test cases and on real world examples.

この論文では、確率的大域最適化問題を解くためのシミュレーテッドアニーリングアルゴリズムの修正版を提案します。より正確には、ノイズの多い評価を持つ関数の大域最小化器を見つける問題に対処します。収束率とその最適化されたパラメータ化を提供して、特定の精度と1に近い信頼水準に対して最小限の評価を確保します。この作業は、一連の数値実験で完了し、ベンチマークテストケースと実際の例の両方で実際のパフォーマンスを評価します。

Parsimonious Online Learning with Kernels via Sparse Projections in Function Space
関数空間におけるスパース投影によるカーネルを用いた倹約的なオンライン学習

Despite their attractiveness, popular perception is that techniques for nonparametric function approximation do not scale to streaming data due to an intractable growth in the amount of storage they require. To solve this problem in a memory-affordable way, we propose an online technique based on functional stochastic gradient descent in tandem with supervised sparsification based on greedy function subspace projections. The method, called parsimonious online learning with kernels (POLK), provides a controllable tradeoff between its solution accuracy and the amount of memory it requires. We derive conditions under which the generated function sequence converges almost surely to the optimal function, and we establish that the memory requirement remains finite. We evaluate POLK for kernel multi-class logistic regression and kernel hinge-loss classification on three canonical data sets: a synthetic Gaussian mixture model, the MNIST hand-written digits, and the Brodatz texture database. On all three tasks, we observe a favorable trade-off of objective function evaluation, classification performance, and complexity of the nonparametric regressor extracted by the proposed method.

魅力的であるにもかかわらず、ノンパラメトリック関数近似の手法は、必要なストレージ容量が手に負えないほど増大するため、ストリーミングデータには対応できないというのが一般的な認識です。この問題をメモリに余裕のある方法で解決するために、私たちは、貪欲関数サブスペース射影に基づく教師ありスパース化と連携した関数的確率的勾配降下法に基づくオンライン手法を提案します。カーネルによる節約型オンライン学習(POLK)と呼ばれるこの手法は、ソリューションの精度と必要なメモリ量との間で制御可能なトレードオフを提供します。生成された関数シーケンスがほぼ確実に最適な関数に収束する条件を導出し、メモリ要件が有限のままであることを確認します。私たちは、合成ガウス混合モデル、MNIST手書き数字、およびBrodatzテクスチャデータベースの3つの標準データセットで、カーネルマルチクラスロジスティック回帰とカーネルヒンジ損失分類に対してPOLKを評価します。3つのタスクすべてにおいて、目的関数の評価、分類パフォーマンス、および提案された方法によって抽出されたノンパラメトリック回帰器の複雑さの好ましいトレードオフが観察されます。

Transport Analysis of Infinitely Deep Neural Network
無限深層ニューラルネットワークのトランスポート解析

We investigated the feature map inside deep neural networks (DNNs) by tracking the transport map. We are interested in the role of depth—why do DNNs perform better than shallow models?—and the interpretation of DNNs—what do intermediate layers do? Despite the rapid development in their application, DNNs remain analytically unexplained because the hidden layers are nested and the parameters are not faithful. Inspired by the integral representation of shallow NNs, which is the continuum limit of the width, or the hidden unit number, we developed the flow representation and transport analysis of DNNs. The flow representation is the continuum limit of the depth, or the hidden layer number, and it is specified by an ordinary differential equation (ODE) with a vector field. We interpret an ordinary DNN as a transport map or an Euler broken line approximation of the flow. Technically speaking, a dynamical system is a natural model for the nested feature maps. In addition, it opens a new way to the coordinate-free treatment of DNNs by avoiding the redundant parametrization of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects: dynamical system, continuity equation, and Wasserstein gradient flow. A key finding is that we specified a series of transport maps of the denoising autoencoder (DAE), which is a cornerstone for the development of deep learning. Starting from the shallow DAE, this paper develops three topics: the transport map of the deep DAE, the equivalence between the stacked DAE and the composition of DAEs, and the development of the double continuum limit or the integral representation of the flow representation. As partial answers to the research questions, we found that deeper DAEs converge faster and the extracted features are better; in addition, a deep Gaussian DAE transports mass to decrease the Shannon entropy of the data distribution. We expect that further investigations on these questions lead to the development of an interpretable and principled alternatives to DNNs.

私たちは、トランスポートマップを追跡することで、ディープニューラルネットワーク（DNN）内の特徴マップを調査しました。私たちは、深さの役割（なぜDNNは浅いモデルよりもパフォーマンスが優れているのか）とDNNの解釈（中間層は何をするのか）に興味を持っています。その応用は急速に発展しているにもかかわらず、隠れ層がネストされており、パラメータが忠実ではないため、DNNは分析的に説明できないままです。浅いNNの積分表現、つまり幅の連続体限界、または隠れユニット数に着想を得て、我々はDNNのフロー表現とトランスポート解析を開発しました。フロー表現は、深さの連続体限界、または隠れ層数であり、ベクトル場を持つ常微分方程式（ODE）によって指定されます。私たちは、通常のDNNを、フローのトランスポートマップまたはオイラー折れ線近似として解釈します。技術的に言えば、動的システムは、ネストされた特徴マップの自然なモデルです。さらに、DNNの冗長なパラメータ化を回避することで、DNNを座標なしで処理する新しい方法も開きます。ワッサーシュタイン幾何学に従って、フローを3つの側面、つまり動的システム、連続方程式、ワッサーシュタイン勾配フローで分析します。重要な発見は、ディープラーニングの開発の基礎となるノイズ除去オートエンコーダ(DAE)の一連のトランスポートマップを指定したことです。浅いDAEから始めて、この論文では、深いDAEのトランスポートマップ、スタックされたDAEとDAEの合成の同等性、フロー表現の二重連続体極限または積分表現の開発という3つのトピックを展開します。研究の質問に対する部分的な回答として、深いDAEの方が収束が速く、抽出された特徴がより優れていることがわかりました。さらに、深いガウスDAEは質量をトランスポートして、データ分布のシャノンエントロピーを減少させます。これらの質問に関するさらなる調査により、DNNに代わる解釈可能で原理的な代替手段が開発されることを期待しています。

Adaptation Based on Generalized Discrepancy
一般化不一致に基づく適応

We present a new algorithm for domain adaptation improving upon a discrepancy minimization algorithm, (DM), previously shown to outperform a number of algorithms for this problem. Unlike many previously proposed solutions for domain adaptation, our algorithm does not consist of a fixed reweighting of the losses over the training sample. Instead, the reweighting depends on the hypothesis sought. The algorithm is derived from a less conservative notion of discrepancy than the DM algorithm called generalized discrepancy. We present a detailed description of our algorithm and show that it can be formulated as a convex optimization problem. We also give a detailed theoretical analysis of its learning guarantees which helps us select its parameters. Finally, we report the results of experiments demonstrating that it improves upon discrepancy minimization.

私たちは、この問題に対して多くのアルゴリズムよりも優れていることが以前に示された不一致最小化アルゴリズム(DM)を改良したドメイン適応のための新しいアルゴリズムを紹介します。ドメイン適応のために以前に提案された多くのソリューションとは異なり、私たちのアルゴリズムは、トレーニングサンプル上の損失の固定的な再重み付けで構成されていません。代わりに、再重み付けは求められた仮説に依存します。このアルゴリズムは、一般化不一致と呼ばれるDMアルゴリズムよりも保守的でない不一致の概念から派生しています。アルゴリズムの詳細な説明を提示し、それが凸最適化問題として定式化できることを示します。また、そのパラメータを選択するのに役立つ学習保証の詳細な理論的分析も提供します。最後に、不一致の最小化により改善されることを実証した実験の結果を報告します。

Journal of Machine Learning Research Papers: Volume 20の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 5の論文一覧

Journal of Machine Learning Research Papers: Volume 4の論文一覧

Journal of Machine Learning Research Papers: Volume 3の論文一覧