Journal of Machine Learning Research Papers: Volume 22の論文一覧

Journal of Machine Learning Research Papers Volume 22に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
On Universal Approximation and Error Bounds for Fourier Neural Operators
フーリエニューラル演算子のユニバーサル近似と誤差範囲について

Fourier neural operators (FNOs) have recently been proposed as an effective framework for learning operators that map between infinite-dimensional spaces. We prove that FNOs are universal, in the sense that they can approximate any continuous operator to desired accuracy. Moreover, we suggest a mechanism by which FNOs can approximate operators associated with PDEs efficiently. Explicit error bounds are derived to show that the size of the FNO, approximating operators associated with a Darcy type elliptic PDE and with the incompressible Navier-Stokes equations of fluid dynamics, only increases sub (log)-linearly in terms of the reciprocal of the error. Thus, FNOs are shown to efficiently approximate operators arising in a large class of PDEs.

フーリエニューラルオペレーター(FNO)は、無限次元空間間をマッピングするオペレーターを学習するための効果的なフレームワークとして最近提案されています。私たちは、FNOが普遍的であることを証明し、任意の連続演算子を所望の精度に近似できることを意味します。さらに、FNOがPDEに関連付けられた演算子を効率的に近似できるメカニズムを提案します。明示的な誤差範囲が導出され、FNOのサイズ、ダルシー型楕円PDEおよび流体力学の非圧縮性ナビエ・ストークス方程式に関連付けられた近似演算子は、誤差の逆数に関してサブ(対数)線形にしか増加しないことを示します。したがって、FNOは、PDEの大きなクラスで発生する演算子を効率的に近似することが示されています。

VariBAD: Variational Bayes-Adaptive Deep RL via Meta-Learning
VariBAD: メタ学習による変分ベイズ適応型深層 RL

Trading off exploration and exploitation in an unknown environment is key to maximising expected online return during learning. A Bayes-optimal policy, which does so optimally, conditions its actions not only on the environment state but also on the agent’s uncertainty about the environment. Computing a Bayes-optimal policy is however intractable for all but the smallest tasks. In this paper, we introduce variational Bayes-Adaptive Deep RL (variBAD), a way to meta-learn approximately Bayes-optimal policies for complex tasks. VariBAD simultaneously meta-learns a variational auto-encoder to perform approximate inference, and a policy that incorporates task uncertainty directly during action selection by conditioning on both the environment state and the approximate belief. In two toy domains, we illustrate how variBAD performs structured online exploration as a function of task uncertainty. We further evaluate variBAD on MuJoCo tasks widely used in meta-RL and show that it achieves higher online return than existing methods. On the recently proposed Meta-World ML1 benchmark, variBAD achieves state of the art results by a large margin, fully solving two out of the three ML1 tasks for the first time.

未知の環境での探索と活用のトレードオフは、学習中に期待されるオンラインリターンを最大化するための鍵です。これを最適に行うベイズ最適ポリシーは、環境状態だけでなく、環境に関するエージェントの不確実性にもアクションを条件付けます。ただし、ベイズ最適ポリシーの計算は、最小のタスクを除いてすべて困難です。この論文では、複雑なタスクのベイズ最適ポリシーを近似的にメタ学習する方法である変分ベイズ適応型ディープRL (variBAD)を紹介します。VariBADは、近似推論を実行する変分オートエンコーダと、環境状態と近似信念の両方を条件付けすることにより、アクション選択中にタスクの不確実性を直接組み込むポリシーを同時にメタ学習します。2つのおもちゃのドメインで、variBADがタスクの不確実性の関数として構造化されたオンライン探索を実行する方法を示します。さらに、メタRLで広く使用されているMuJoCoタスクでvariBADを評価し、既存の方法よりも高いオンラインリターンを達成することを示します。最近提案されたMeta-World ML1ベンチマークでは、variBADは大幅な差で最先端の結果を達成し、3つのML1タスクのうち2つを初めて完全に解決しました。

A Theory of the Risk for Optimization with Relaxation and its Application to Support Vector Machines
緩和による最適化のリスク理論とベクトルマシン支援への応用

In this paper we consider optimization with relaxation, an ample paradigm to make data-driven designs. This approach was previously considered by the same authors of this work in Garatti and Campi (2019), a study that revealed a deep-seated connection between two concepts: risk (probability of not satisfying a new, out-of-sample, constraint) and complexity (according to a definition introduced in paper Garatti and Campi, 2019). This connection was shown to have profound implications in applications because it implied that the risk can be estimated from the complexity, a quantity that can be measured from the data without any knowledge of the data-generation mechanism. In the present work we establish new results. First, we expand the scope of Garatti and Campi (2019) so as to embrace a more general setup that covers various algorithms in machine learning. Then, we study classical support vector methods – including SVM (Support Vector Machine), SVR (Support Vector Regression) and SVDD (Support Vector Data Description) – and derive new results for the ability of these methods to generalize. All results are valid for any finite size of the data set. When the sample size tends to infinity, we establish the unprecedented result that the risk approaches the ratio between the complexity and the cardinality of the data sample, regardless of the value of the complexity.

この論文では、データ駆動型設計を行うための十分なパラダイムである、緩和を伴う最適化について検討します。このアプローチは、本研究の同じ著者によってGarattiとCampi (2019)で以前に検討されており、その研究では、リスク(新しいサンプル外の制約を満たさない確率)と複雑性(GarattiとCampi、2019の論文で紹介された定義による)という2つの概念の間に深いつながりがあることが明らかになりました。このつながりは、データ生成メカニズムを知らなくてもデータから測定できる量である複雑性からリスクを推定できることを示唆しているため、アプリケーションに大きな影響を及ぼすことが示されました。この研究では、新しい結果を確立します。まず、GarattiとCampi (2019)の範囲を拡張して、機械学習のさまざまなアルゴリズムをカバーするより一般的な設定を採用します。次に、SVM (サポートベクターマシン)、SVR (サポートベクター回帰)、SVDD (サポートベクターデータ記述)などの従来のサポートベクターメソッドを研究し、これらのメソッドの一般化能力に関する新しい結果を導き出します。すべての結果は、データセットの任意の有限サイズに対して有効です。サンプルサイズが無限大に近づくと、複雑さの値に関係なく、リスクはデータサンプルの複雑さと基数の比率に近づくという前例のない結果が確立されます。

V-statistics and Variance Estimation
V統計量と分散推定

As machine learning procedures become an increasingly popular modeling option among applied researchers, there has been a corresponding interest in developing valid tools for understanding their statistical properties and uncertainty. Tree-based ensembles like random forests remain one such popular option for which several important theoretical advances have been made in recent years by drawing upon a connection between their natural subsampled structure and the classical theory of $U$-statistics. Unfortunately, the procedures for estimating predictive variance resulting from these studies are plagued by severe bias and extreme computational overhead. Here, we argue that the root of these problems lies in the use of subsampling without replacement and that with-replacement subsamples, resulting in $V$-statistics, substantially alleviates these problems. We develop a general framework for analyzing the asymptotic behavior of $V$-statistics, demonstrating asymptotic normality under precise regularity conditions and establishing previously unreported connections to $U$-statistics. Importantly, these findings allow us to produce a natural and efficient means of estimating the variance of a conditional expectation, a problem of wide interest across multiple scientific domains that also lies at the heart of uncertainty quantification for supervised learning ensembles.

機械学習の手順が応用研究者の間でますます人気の高いモデリングオプションになるにつれて、その統計的特性と不確実性を理解するための有効なツールの開発に対する関心が高まっています。ランダムフォレストなどのツリーベースのアンサンブルは、そのような人気のオプションの1つであり、その自然なサブサンプリング構造と古典的な$U$統計理論との関連を利用して、近年いくつかの重要な理論的進歩が遂げられています。残念ながら、これらの研究から得られる予測分散を推定する手順は、深刻なバイアスと極端な計算オーバーヘッドに悩まされています。ここでは、これらの問題の根本は、非復元サブサンプリングの使用にあり、$V$統計をもたらす復元サブサンプリングによってこれらの問題が大幅に軽減されると主張します。$V$統計の漸近的動作を分析するための一般的なフレームワークを開発し、正確な規則性条件下での漸近正規性を実証し、これまで報告されていなかった$U$統計との関連を確立します。重要なのは、これらの発見により、条件付き期待値の分散を推定する自然で効率的な手段を生み出すことができるようになることです。条件付き期待値は、複数の科学領域で広く関心を集めている問題であり、教師あり学習アンサンブルの不確実性定量化の中心でもあります。

An Online Sequential Test for Qualitative Treatment Effects
定性的治療効果のためのオンライン逐次試験

Tech companies (e.g., Google or Facebook) often use randomized online experiments and/or A/B testing primarily based on the average treatment effects to compare their new product with an old one. However, it is also critically important to detect qualitative treatment effects such that the new one may significantly outperform the existing one only under some specific circumstances. The aim of this paper is to develop a powerful testing procedure to efficiently detect such qualitative treatment effects. We propose a scalable online updating algorithm to implement our test procedure. It has three novelties including adaptive randomization, sequential monitoring, and online updating with guaranteed type-I error control. We also thoroughly examine the theoretical properties of our testing procedure including the limiting distribution of test statistics and the justification of an efficient bootstrap method. Extensive empirical studies are conducted to examine the finite sample performance of our test procedure.

ハイテク企業（GoogleやFacebookなど）は、新製品と旧製品を比較するために、主に平均処理効果に基づくランダム化オンライン実験やA/Bテストをよく使用します。しかし、特定の状況下でのみ新製品が既存製品を大幅に上回る可能性があるような定性的な処理効果を検出することも非常に重要です。本論文の目的は、そのような定性的な処理効果を効率的に検出する強力なテスト手順を開発することです。私たちは、テスト手順を実装するためのスケーラブルなオンライン更新アルゴリズムを提案します。このアルゴリズムには、適応型ランダム化、シーケンシャルモニタリング、保証されたタイプIエラー制御によるオンライン更新という3つの新機能があります。また、テスト統計量の極限分布や効率的なブートストラップ法の正当性など、テスト手順の理論的特性も徹底的に調べます。テスト手順の有限サンプルパフォーマンスを調べるために、広範な実証研究が行われています。

Double Generative Adversarial Networks for Conditional Independence Testing
条件付き独立性試験のための二重敵対的生成ネットワーク

In this article, we study the problem of high-dimensional conditional independence testing, a key building block in statistics and machine learning. We propose an inferential procedure based on double generative adversarial networks (GANs). Specifically, we first introduce a double GANs framework to learn two generators of the conditional distributions. We then integrate the two generators to construct a test statistic, which takes the form of the maximum of generalized covariance measures of multiple transformation functions. We also employ data-splitting and cross-fitting to minimize the conditions on the generators to achieve the desired asymptotic properties, and employ multiplier bootstrap to obtain the corresponding p-value. We show that the constructed test statistic is doubly robust, and the resulting test both controls type-I error and has the power approaching one asymptotically. Also notably, we establish those theoretical guarantees under much weaker and practically more feasible conditions compared to the existing tests, and our proposal gives a concrete example of how to utilize some state-of-the-art deep learning tools, such as GANs, to help address a classical but challenging statistical problem. We demonstrate the efficacy of our test through both simulations and an application to an anti-cancer drug dataset. A Python implementation of the proposed procedure is available at https://github.com/tianlinxu312/dgcit.

この記事では、統計学と機械学習の重要な構成要素である高次元条件付き独立性検定の問題を研究します。二重生成敵対的ネットワーク(GAN)に基づく推論手順を提案します。具体的には、まず条件付き分布の2つの生成器を学習するための二重GANフレームワークを導入します。次に、2つの生成器を統合して、複数の変換関数の一般化共分散測定の最大値という形をとる検定統計量を構築します。また、データ分割とクロスフィッティングを使用して生成器の条件を最小化し、目的の漸近特性を実現し、乗数ブートストラップを使用して対応するp値を取得します。構築された検定統計量は二重に堅牢であり、結果として得られる検定はタイプIの誤りを制御し、漸近的に1に近づく検出力を持つことを示します。また注目すべきは、既存のテストと比較してはるかに弱く、実際的により実現可能な条件下でこれらの理論的な保証を確立し、私たちの提案は、GANなどの最先端のディープラーニングツールを活用して、古典的でありながら困難な統計的問題に対処する方法の具体的な例を示しています。シミュレーションと抗がん剤データセットへのアプリケーションの両方を通じて、テストの有効性を実証します。提案された手順のPython実装は、https://github.com/tianlinxu312/dgcitで入手できます。

Linear Bandits on Uniformly Convex Sets
一様凸セットの線形バンディット

Linear bandit algorithms yield $\tilde{\mathcal{O}}(n\sqrt{T})$ pseudo-regret bounds on compact convex action sets $\mathcal{K}\subset\mathbb{R}^n$ and two types of structural assumptions lead to better pseudo-regret bounds.When $\mathcal{K}$ is the simplex or an $\ell_p$ ball with $p\in]1,2]$, there exist bandits algorithms with $\tilde{\mathcal{O}}(\sqrt{nT})$ pseudo-regret bounds.Here, we derive bandit algorithms for some strongly convex sets beyond $\ell_p$ balls that enjoy pseudo-regret bounds of $\tilde{\mathcal{O}}(\sqrt{nT})$.This result provides new elements for the open question in Bubeck and Cesa-Bianchi, 2012.When the action set is $q$-uniformly convex but not necessarily strongly convex ($q >2$), we obtain pseudo-regret bounds $\tilde{\mathcal{O}}(n^{1/q}T^{1/p})$ with $p$ s.t. $1/p + 1/q=1$. These pseudo-regret bounds are competitive with the general $\tilde{\mathcal{O}}(n\sqrt{T})$ for a time horizon range that depends on the degree $q>2$ of the set’s uniform convexity and the dimension $n$ of the problem.

線形バンディットアルゴリズムは、コンパクトな凸アクションセット$\mathcal{K}\subset\mathbb{R}^n$に対して$\tilde{\mathcal{O}}(n\sqrt{T})$擬似後悔境界を生成し、2種類の構造仮定によってより優れた擬似後悔境界が導かれます。$\mathcal{K}$が単体または$p\in]1,2]$の$\ell_p$ボールである場合、$\tilde{\mathcal{O}}(\sqrt{nT})$擬似後悔境界を持つバンディットアルゴリズムが存在します。ここでは、$\ell_p$ボールを超えるいくつかの強凸セットに対して、擬似後悔境界が$\tilde{\mathcal{O}}(\sqrt{nT})$であるバンディットアルゴリズムを導出します。この結果は、BubeckとCesa-Bianchi (2012)の未解決問題に新たな要素を提供します。アクションセットが$q$一様凸だが必ずしも強く凸ではない($q >2$)場合、$p$ s.t. $1/p + 1/q=1$で擬似後悔境界$\tilde{\mathcal{O}}(n^{1/q}T^{1/p})$が得られます。これらの擬似後悔境界は、集合の一様凸性の次数$q>2$と問題の次元$n$に依存する時間範囲で、一般的な$\tilde{\mathcal{O}}(n\sqrt{T})$と競合します。

Non-linear, Sparse Dimensionality Reduction via Path Lasso Penalized Autoencoders
パス Lasso ペナルティ付きオートエンコーダによる非線形スパース次元削減

High-dimensional data sets are often analyzed and explored via the construction of a latent low-dimensional space which enables convenient visualization and efficient predictive modeling or clustering. For complex data structures, linear dimensionality reduction techniques like PCA may not be sufficiently flexible to enable low-dimensional representation. Non-linear dimension reduction techniques, like kernel PCA and autoencoders, suffer from loss of interpretability since each latent variable is dependent of all input dimensions. To address this limitation, we here present path lasso penalized autoencoders. This structured regularization enhances interpretability by penalizing each path through the encoder from an input to a latent variable, thus restricting how many input variables are represented in each latent dimension. Our algorithm uses a group lasso penalty and non-negative matrix factorization to construct a sparse, non-linear latent representation. We compare the path lasso regularized autoencoder to PCA, sparse PCA, autoencoders and sparse autoencoders on real and simulated data sets. We show that the algorithm exhibits much lower reconstruction errors than sparse PCA and parameter-wise lasso regularized autoencoders for low-dimensional representations. Moreover, path lasso representations provide a more accurate reconstruction match, i.e. preserved relative distance between objects in the original and reconstructed spaces.

高次元データセットは、多くの場合、便利な視覚化と効率的な予測モデリングまたはクラスタリングを可能にする潜在低次元空間の構築を介して分析および調査されます。複雑なデータ構造の場合、PCAなどの線形次元削減手法では、低次元表現を可能にするのに十分な柔軟性がない可能性があります。カーネルPCAやオートエンコーダーなどの非線形次元削減手法では、各潜在変数がすべての入力次元に依存しているため、解釈可能性が失われます。この制限に対処するために、ここではパスLassoペナルティ付きオートエンコーダーを紹介します。この構造化正則化は、エンコーダーを通じて入力から潜在変数に至る各パスにペナルティを課すことで解釈可能性を高め、各潜在次元で表現される入力変数の数を制限します。このアルゴリズムでは、グループLassoペナルティと非負行列分解を使用して、スパースで非線形の潜在表現を構築します。実際のデータセットとシミュレートされたデータセットで、パスLasso正則化オートエンコーダーをPCA、スパースPCA、オートエンコーダー、スパースオートエンコーダーと比較します。このアルゴリズムは、低次元表現に対して、スパースPCAやパラメータごとのLasso正規化オートエンコーダよりもはるかに低い再構築エラーを示すことを示しています。さらに、パスLasso表現は、より正確な再構築マッチ、つまり元の空間と再構築された空間内のオブジェクト間の相対距離の保持を提供します。

LDLE: Low Distortion Local Eigenmaps
LDLE: 低歪みローカル固有マップ

We present Low Distortion Local Eigenmaps (LDLE), a manifold learning technique which constructs a set of low distortion local views of a data set in lower dimension and registers them to obtain a global embedding. The local views are constructed using the global eigenvectors of the graph Laplacian and are registered using Procrustes analysis. The choice of these eigenvectors may vary across the regions. In contrast to existing techniques, LDLE can embed closed and non-orientable manifolds into their intrinsic dimension by tearing them apart. It also provides gluing instruction on the boundary of the torn embedding to help identify the topology of the original manifold. Our experimental results will show that LDLE largely preserved distances up to a constant scale while other techniques produced higher distortion. We also demonstrate that LDLE produces high quality embeddings even when the data is noisy or sparse.

私たちは、低歪みローカル固有マップ(LDLE)は、低次元のデータセットの低歪みローカルビューのセットを構築し、それらを登録してグローバル埋め込みを取得する多様体学習手法です。ローカルビューは、グラフLaplacianのグローバル固有ベクトルを使用して構築され、Procrustes解析を使用して登録されます。これらの固有ベクトルの選択は、地域によって異なる場合があります。既存の技術とは対照的に、LDLEは、閉じた方向付け不可能な多様体を、それらを引き裂くことにより、それらを固有の次元に埋め込むことができます。また、引き裂かれた埋め込みの境界に接着指示を提供し、元のマニホールドのトポロジーを特定するのに役立ちます。私たちの実験結果は、LDLEが一定のスケールまで距離を大部分保持したのに対し、他の手法はより高い歪みを生み出したことを示しています。また、LDLEは、データにノイズが多い場合やまばらな場合でも、高品質の埋め込みを生成することも示しています。

Contrastive Estimation Reveals Topic Posterior Information to Linear Models
対照推定は、線形モデルへのトピック事後情報を明らかにする

Contrastive learning is an approach to representation learning that utilizes naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. In the context of document classification under topic modeling assumptions, we prove that contrastive learning is capable of recovering a representation of documents that reveals their underlying topic posterior information to linear models. We apply this procedure in a semi-supervised setup and demonstrate empirically that linear classifiers trained on these representations perform well in document classification tasks with very few training examples.

コントラスティブ学習は、自然に発生する類似したデータポイントと異なるデータポイントのペアを利用して、データの有用な埋め込みを見つける表現学習のアプローチです。トピックモデリングの仮定の下でのドキュメント分類のコンテキストでは、コントラスティブ学習が、線形モデルに対する根本的なトピックの事後情報を明らかにするドキュメントの表現を回復できることを証明します。この手順を半教師ありセットアップに適用し、これらの表現でトレーニングされた線形分類器が、トレーニング例がほとんどないドキュメント分類タスクで優れたパフォーマンスを発揮することを経験的に示しています。

Graph Matching with Partially-Correct Seeds
部分正しいシードを使用したグラフマッチング

Graph matching aims to find the latent vertex correspondence between two edge-correlated graphs and has found numerous applications across different fields. In this paper, we study a seeded graph matching problem, which assumes that a set of seeds, i.e., pre-mapped vertex-pairs, is given in advance. While most previous work requires all seeds to be correct, we focus on the setting where the seeds are partially correct. Specifically, consider two correlated graphs whose edges are sampled independently from a parent Erdos-Renyi graph $\mathcal{G}(n,p)$. A mapping between the vertices of the two graphs is provided as seeds, of which an unknown $\beta$ fraction is correct. We first analyze a simple algorithm that matches vertices based on the number of common seeds in the $1$-hop neighborhoods, and then further propose a new algorithm that uses seeds in the $2$-hop neighborhoods. We establish non-asymptotic performance guarantees of perfect matching for both $1$-hop and $2$-hop algorithms, showing that our new $2$-hop algorithm requires substantially fewer correct seeds than the $1$-hop algorithm when graphs are sparse. Moreover, by combining our new performance guarantees for the $1$-hop and $2$-hop algorithms, we attain the best-known results (in terms of the required fraction of correct seeds) across the entire range of graph sparsity and significantly improve the previous results when $p\ge n^{-5/6}$. For instance, when $p$ is a constant or $p=n^{-3/4}$, we show that only $\Omega(\sqrt{n\log n})$ correct seeds suffice for perfect matching, while the previously best-known results demand $\Omega(n)$ and $\Omega(n^{3/4}\log n)$ correct seeds, respectively. Numerical experiments corroborate our theoretical findings, demonstrating the superiority of our $2$-hop algorithm on a variety of synthetic and real graphs.

グラフマッチングは、2つの辺が相関するグラフ間の潜在的な頂点の対応関係を見つけることを目的とし、さまざまな分野で数多くの応用が見出されています。この論文では、シードのセット、つまり事前にマッピングされた頂点のペアが事前に与えられていると仮定するシード付きグラフマッチング問題を研究します。これまでのほとんどの研究ではすべてのシードが正しいことが求められていましたが、本研究ではシードが部分的に正しい設定に焦点を当てます。具体的には、親のエルデシュ・レニグラフ$\mathcal{G}(n,p)$から辺が独立してサンプリングされた2つの相関グラフを考えます。2つのグラフの頂点間のマッピングがシードとして提供され、その中の未知の$\beta$割合が正しいです。まず、$1$ホップ近傍の共通シードの数に基づいて頂点をマッチングする単純なアルゴリズムを分析し、さらに$2$ホップ近傍のシードを使用する新しいアルゴリズムを提案します。私たちは、1ホップアルゴリズムと2ホップアルゴリズムの両方について、完全マッチングの非漸近的パフォーマンス保証を確立し、グラフがスパースな場合、新しい2ホップアルゴリズムでは、1ホップアルゴリズムよりも大幅に少ない正しいシードしか必要としないことを示しています。さらに、1ホップアルゴリズムと2ホップアルゴリズムの新しいパフォーマンス保証を組み合わせることで、グラフのスパース性の全範囲にわたって(必要な正しいシードの割合に関して)最もよく知られている結果を達成し、$p\ge n^{-5/6}$の場合に以前の結果を大幅に改善します。たとえば、$p$が定数または$p=n^{-3/4}$の場合、完全マッチングには$\Omega(\sqrt{n\log n})$正しいシードのみで十分であることを示します。一方、以前の最もよく知られた結果では、それぞれ$\Omega(n)$と$\Omega(n^{3/4}\log n)$正しいシードが必要です。数値実験は理論的発見を裏付け、さまざまな合成グラフと実際のグラフ上での2ホップアルゴリズムの優位性を実証しています。

Fast Learning for Renewal Optimization in Online Task Scheduling
オンラインタスクスケジューリングにおける更新最適化のための高速学習

This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the task type vector, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time-average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like $O(1/\sqrt{k})$, where $k$ is the number of tasks processed. The same algorithm is shown to have faster $O(\log(k)/k)$ performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best $\Omega(\log(k)/k)$. A matching $\Omega(1/\sqrt{k})$ converse is also shown for the general case without strong concavity.

この論文では、更新報酬システムのオンライン最適化について検討します。コントローラは、一連のタスクを連続して実行します。各タスクには、タスクタイプベクトルと呼ばれるランダムなパラメータベクトルがあり、タスク処理オプションに影響し、結果として得られる報酬とタスクの所要時間にも影響します。タスクタイプベクトルの確率分布は不明であり、コントローラは、時間平均報酬が最適に収束するように効率的な決定を行う方法を学習する必要があります。このような更新最適化問題に関するこれまでの研究では、最適な収束時間の問題が残っています。この論文では、$O(1/\sqrt{k})$のように減少する最適性ギャップを持つアルゴリズムを開発します。ここで、$k$は処理されるタスクの数です。同じアルゴリズムは、システムが強い凹面性を満たす場合、より高速な$O(\log(k)/k)$パフォーマンスを示すことが示されています。提案されたアルゴリズムでは、古典的なRobbins-Monro反復に従って更新される補助変数を使用します。この変数と観測されたタスクタイプに基づいて、各更新フレームの開始時にオンラインスケジューリングの決定を行います。すべてのアルゴリズムが最高で$\Omega(\log(k)/k)$のパフォーマンスを持つサンプルシステムを構築することにより、強く凹んだケースに対応する逆が得られます。強い凹みのない一般的なケースに対応する$\Omega(1/\sqrt{k})$逆も示されます。

Multilevel Monte Carlo Variational Inference
マルチレベルモンテカルロ変分推論

We propose a variance reduction framework for variational inference using the Multilevel Monte Carlo (MLMC) method. Our framework is built on reparameterized gradient estimators and “recycles” parameters obtained from past update history in optimization. In addition, our framework provides a new optimization algorithm based on stochastic gradient descent (SGD) that adaptively estimates the sample size used for gradient estimation according to the ratio of the gradient variance. We theoretically show that, with our method, the variance of the gradient estimator decreases as optimization proceeds and that a learning rate scheduler function helps improve the convergence. We also show that, in terms of the signal-to-noise ratio, our method can improve the quality of gradient estimation by the learning rate scheduler function without increasing the initial sample size. Finally, we confirm that our method achieves faster convergence and reduces the variance of the gradient estimator compared with other methods through experimental comparisons with baseline methods using several benchmark datasets.

私たちは、マルチレベルモンテカルロ（MLMC）法を用いた変分推論の分散削減フレームワークを提案します。このフレームワークは、再パラメータ化された勾配推定器に基づいて構築されており、最適化における過去の更新履歴から得られたパラメータを「リサイクル」します。さらに、このフレームワークは、勾配分散の比率に応じて勾配推定に使用されるサンプルサイズを適応的に推定する、確率的勾配降下法（SGD）に基づく新しい最適化アルゴリズムを提供します。この方法では、最適化が進むにつれて勾配推定器の分散が減少し、学習率スケジューラ関数が収束の改善に役立つことを理論的に示す。また、信号対雑音比の観点から、この方法は、初期サンプルサイズを増やすことなく、学習率スケジューラ関数によって勾配推定の品質を改善できることも示す。最後に、いくつかのベンチマークデータセットを使用したベースライン方法との実験的比較を通じて、他の方法と比較して、この方法がより速い収束を達成し、勾配推定器の分散を削減することを確認します。

Gaussian Approximation for Bias Reduction in Q-Learning
Q学習におけるバイアス低減のためのガウス近似

Temporal-Difference off-policy algorithms are among the building blocks of reinforcement learning (RL). Within this family, Q-Learning is arguably the most famous one, which has been widely studied and extended. The update rule of Q-learning involves the use of the maximum operator to estimate the maximum expected value of the return. However, this estimate is positively biased, and may hinder the learning process, especially in stochastic environments and when function approximation is used. We introduce the Weighted Estimator as an effective solution to mitigate the negative effects of overestimation in Q-Learning. The Weighted Estimator estimates the maximum expected value as a weighted sum of the action values, with the weights being the probabilities that each action value is the maximum. In this work, we study the problem from the statistical perspective of estimating the maximum expected value of a set of random variables and provide bounds to the bias and the variance of the Weighted Estimator, showing its advantages over other estimators present in literature. Then, we derive algorithms to enable the use of the Weighted Estimator, in place of the Maximum Estimator, in online and batch RL, and we introduce a novel algorithm for deep RL. Finally, we empirically evaluate our algorithms in a large set of heterogeneous problems, encompassing discrete and continuous, low and high dimensional, deterministic and stochastic environments. Experimental results show the effectiveness of the Weighted Estimator in controlling the bias of the estimate, resulting in better performance than representative baselines and robust learning w.r.t. a large set of diverse environments.

時間差分オフポリシーアルゴリズムは、強化学習(RL)の構成要素の1つです。このファミリーの中で、Q学習はおそらく最も有名なものであり、広く研究され、拡張されてきました。Q学習の更新規則では、最大演算子を使用してリターンの最大期待値を推定します。ただし、この推定は正に偏っており、特に確率的環境や関数近似が使用されている場合は、学習プロセスを妨げる可能性があります。Q学習における過大評価の悪影響を軽減するための効果的なソリューションとして、重み付き推定器を導入します。重み付き推定器は、最大期待値をアクション値の加重合計として推定します。重みは、各アクション値が最大になる確率です。この研究では、ランダム変数セットの最大期待値を推定するという統計的観点から問題を研究し、重み付き推定器のバイアスと分散の境界を示し、文献にある他の推定器よりも優れている点を示します。次に、オンラインおよびバッチRLで最大推定量の代わりに重み付け推定量を使用できるようにするアルゴリズムを導出し、ディープRLの新しいアルゴリズムを紹介します。最後に、離散および連続、低次元および高次元、決定論的および確率的環境を含む、多数の異種問題でアルゴリズムを実験的に評価します。実験結果では、重み付け推定量が推定値のバイアス制御に有効であり、代表的なベースラインよりも優れたパフォーマンスと、多数の多様な環境に対する堅牢な学習が得られることが示されています。

Estimating the Lasso’s Effective Noise
投げ縄の有効騒音の推定

Much of the theory for the lasso in the linear model $Y = \boldsymbol{X} \beta^* + \varepsilon$ hinges on the quantity $2\| \boldsymbol{X}^\top \varepsilon \|_\infty / n$, which we call the lasso’s effective noise. Among other things, the effective noise plays an important role in finite-sample bounds for the lasso, the calibration of the lasso’s tuning parameter, and inference on the parameter vector $\beta^*$. In this paper, we develop a bootstrap-based estimator of the quantiles of the effective noise. The estimator is fully data-driven, that is, does not require any additional tuning parameters. We equip our estimator with finite-sample guarantees and apply it to tuning parameter calibration for the lasso and to high-dimensional inference on the parameter vector $\beta^*$.

線形モデルにおけるなげなわの理論の大部分$Y = boldsymbol{X} beta^* + varepsilon$は、量$2|boldsymbol{X}^top varepsilon |_infty / n$、これを投げ縄の有効ノイズと呼びます。特に、実効ノイズは、LASSOの有限サンプル境界、LASSOの調整パラメーターのキャリブレーション、およびパラメーターベクトル$beta^*$の推論において重要な役割を果たします。この論文では、有効ノイズの分位数のブートストラップベースの推定量を開発します。推定器は完全にデータ駆動型であり、追加の調整パラメーターは必要ありません。推定器に有限サンプル保証を装備し、それをなげなわのパラメーターキャリブレーションの調整とパラメーターベクトル$beta^*$の高次元推論に適用します。

Partial Policy Iteration for L1-Robust Markov Decision Processes
L1ロバストマルコフ決定過程の部分的な方策反復

Robust Markov decision processes (MDPs) compute reliable solutions for dynamic decision problems with partially-known transition probabilities. Unfortunately, accounting for uncertainty in the transition probabilities significantly increases the computational complexity of solving robust MDPs, which limits their scalability. This paper describes new, efficient algorithms for solving the common class of robust MDPs with s- and sa-rectangular ambiguity sets defined by weighted L1 norms. We propose partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs. We also propose fast methods for computing the robust Bellman operator in quasi-linear time, nearly matching the ordinary Bellman operator’s linear complexity. Our experimental results indicate that the proposed methods are many orders of magnitude faster than the state-of-the-art approach, which uses linear programming solvers combined with a robust value iteration.

ロバストなマルコフ決定過程(MDP)は、部分的に既知の遷移確率を持つ動的意思決定問題に対する信頼性の高い解を計算します。残念ながら、遷移確率の不確実性を考慮すると、ロバストなMDPを解く際の計算の複雑さが大幅に増し、スケーラビリティが制限されます。この論文では、重み付けされたL1ノルムによって定義されるs矩形およびsa矩形のあいまいさセットを持つロバストMDPの共通クラスを解くための新しい効率的なアルゴリズムについて説明します。私たちは、堅牢なMDPのための新しい、効率的で、柔軟で、一般的なポリシー反復スキームである、部分的なポリシーの反復スキームを提案します。また、ロバストなベルマン演算子を準線形時間で計算するための高速な方法を提案し、通常のベルマン演算子の線形複雑さにほぼ一致します。私たちの実験結果は、提案された方法が、線形計画法ソルバーとロバストな値の反復を組み合わせた最先端のアプローチよりも何桁も高速であることを示しています。

Simultaneous Change Point Inference and Structure Recovery for High Dimensional Gaussian Graphical Models
高次元ガウスグラフモデルのための変化点推論と構造回復の同時実行

In this article, we investigate the problem of simultaneous change point inference and structure recovery in the context of high dimensional Gaussian graphical models with possible abrupt changes. In particular, motivated by neighborhood selection, we incorporate a threshold variable and an unknown threshold parameter into a joint sparse regression model which combines p l1-regularized node-wise regression problems together. The change point estimator and the corresponding estimated coefficients of precision matrices are obtained together. Based on that, a classifier is introduced to distinguish whether a change point exists. To recover the graphical structure correctly, a data-driven thresholding procedure is proposed. In theory, under some sparsity conditions and regularity assumptions, our method can correctly choose a homogeneous or heterogeneous model with high accuracy. Furthermore, in the latter case with a change point, we establish estimation consistency of the change point estimator, by allowing the number of nodes being much larger than the sample size. Moreover, it is shown that, in terms of structure recovery of Gaussian graphical models, the proposed thresholding procedure achieves model selection consistency and controls the number of false positives. The validity of our proposed method is justified via extensive numerical studies. Finally, we apply our proposed method to the S&P 500 dataset to show its empirical usefulness.

この記事では、突然の変化が起こり得る高次元ガウスグラフィカルモデルのコンテキストで、変化点の推定と構造回復を同時に行う問題を調査します。特に、近傍選択を動機として、閾値変数と未知の閾値パラメータを、p l1正規化ノード単位の回帰問題を組み合わせた結合スパース回帰モデルに組み込みます。変化点推定量と、それに対応する精度行列の推定係数が一緒に得られます。それに基づいて、変化点が存在するかどうかを区別するための分類器が導入されます。グラフィカル構造を正しく回復するために、データ駆動型の閾値設定手順が提案されています。理論的には、いくつかのスパース条件と規則性の仮定の下で、私たちの方法は、高精度で均質モデルまたは異質モデルを正しく選択できます。さらに、変化点がある後者の場合、サンプルサイズよりもはるかに大きいノード数を許容することで、変化点推定量の推定の一貫性を確立します。さらに、ガウスグラフィカルモデルの構造回復に関して、提案された閾値設定手順はモデル選択の一貫性を実現し、偽陽性の数を制御することが示されています。提案された方法の有効性は、広範な数値研究によって正当化されています。最後に、提案された方法をS&P 500データセットに適用して、その実証的な有用性を示します。

On the Hardness of Robust Classification
ロバスト分類の硬さについて

It is becoming increasingly important to understand the vulnerability of machine learning models to adversarial attacks. In this paper we study the feasibility of adversarially robust learning from the perspective of computational learning theory, considering both sample and computational complexity. In particular, our definition of robust learnability requires polynomial sample complexity. We start with two negative results. We show that no non-trivial concept class can be robustly learned in the distribution-free setting against an adversary who can perturb just a single input bit. We show, moreover, that the class of monotone conjunctions cannot be robustly learned under the uniform distribution against an adversary who can perturb $\omega(\log n)$ input bits. However, we also show that if the adversary is restricted to perturbing $O(\log n)$ bits, then one can robustly learn the class of $1$-decision lists (which subsumes monotone conjunctions) with respect to the class of log-Lipschitz distributions. We then extend this result to show learnability of 2-decision lists and monotone $k$-decision lists in the same distributional and adversarial setting. Finally, we provide a simple proof of the computational hardness of robust learning on the boolean hypercube. Unlike previous results of this nature, our result does not rely on a more restricted model of learning, such as the statistical query model, nor on any hardness assumption other than the existence of an (average-case) hard learning problem in the PAC framework; this allows us to have a clean proof of the reduction, and the assumption is no stronger than assumptions that are used to build cryptographic primitives.

機械学習モデルの敵対的攻撃に対する脆弱性を理解することがますます重要になっています。この論文では、サンプルと計算の複雑さの両方を考慮した計算学習理論の観点から、敵対的に堅牢な学習の実現可能性を検討します。特に、堅牢な学習可能性の定義には、多項式サンプルの複雑さが必要です。まず、2つの否定的な結果から始めます。分布のない設定では、入力ビットを1つだけ摂動できる敵に対して、非自明な概念クラスを堅牢に学習できないことを示します。さらに、一様分布の下では、$\omega(\log n)$入力ビットを摂動できる敵に対して、単調な接続詞のクラスを堅牢に学習できないことを示します。ただし、敵が摂動できるビットが$O(\log n)$に制限されている場合は、対数リプシッツ分布のクラスに関して、$1$決定リストのクラス(単調な接続詞を包含する)を堅牢に学習できることも示します。次に、この結果を拡張して、同じ分布および敵対的設定で2決定リストと単調な$k$決定リストの学習可能性を示します。最後に、ブールハイパーキューブでの堅牢な学習の計算困難性の簡単な証明を提供します。この性質の以前の結果とは異なり、私たちの結果は、統計クエリモデルなどのより制限された学習モデルや、PACフレームワークでの(平均ケースの)困難な学習問題の存在以外の困難性の仮定に依存していません。これにより、削減の明確な証明が可能になり、仮定は暗号化プリミティブの構築に使用される仮定よりも強力ではありません。

Transferability of Spectral Graph Convolutional Neural Networks
スペクトルグラフ畳み込みニューラルネットワークの伝達性

This paper focuses on spectral graph convolutional neural networks (ConvNets), where filters are defined as elementwise multiplication in the frequency domain of a graph. In machine learning settings where the data set consists of signals defined on many different graphs, the trained ConvNet should generalize to signals on graphs unseen in the training set. It is thus important to transfer ConvNets between graphs. Transferability, which is a certain type of generalization capability, can be loosely defined as follows: if two graphs describe the same phenomenon, then a single filter or ConvNet should have similar repercussions on both graphs. This paper aims at debunking the common misconception that spectral filters are not transferable. We show that if two graphs discretize the same “continuous” space, then a spectral filter or ConvNet has approximately the same repercussion on both graphs. Our analysis is more permissive than the standard analysis. Transferability is typically described as the robustness of the filter to small graph perturbations and re-indexing of the vertices. Our analysis accounts also for large graph perturbations. We prove transferability between graphs that can have completely different dimensions and topologies, only requiring that both graphs discretize the same underlying space in some generic sense.

この論文では、スペクトルグラフ畳み込みニューラルネットワーク(ConvNet)に焦点を当てています。このネットワークでは、フィルターはグラフの周波数領域における要素ごとの乗算として定義されます。データセットが多くの異なるグラフで定義された信号で構成される機械学習の設定では、トレーニングされたConvNetはトレーニングセットにはないグラフ上の信号に一般化する必要があります。したがって、グラフ間でConvNetを転送することが重要です。ある種の一般化機能である転送可能性は、次のように大まかに定義できます。2つのグラフが同じ現象を記述する場合、単一のフィルターまたはConvNetは両方のグラフに同様の影響を与える必要があります。この論文の目的は、スペクトルフィルターは転送できないという一般的な誤解を解くことです。2つのグラフが同じ「連続」空間を離散化する場合、スペクトルフィルターまたはConvNetは両方のグラフにほぼ同じ影響を与えることを示します。私たちの分析は、標準的な分析よりも寛容です。転送可能性は、通常、小さなグラフの変動や頂点の再インデックスに対するフィルタの堅牢性として説明されます。私たちの分析では、大きなグラフの変動も考慮に入れています。私たちは、両方のグラフが何らかの一般的な意味で同じ基礎空間を離散化することのみを要求し、完全に異なる次元とトポロジを持つことができるグラフ間の転送可能性を証明します。

Nonparametric Continuous Sensor Registration
ノンパラメトリック連続センサー登録

This paper develops a new mathematical framework that enables nonparametric joint semantic and geometric representation of continuous functions using data. The joint embedding is modeled by representing the processes in a reproducing kernel Hilbert space. The functions can be defined on arbitrary smooth manifolds where the action of a Lie group aligns them. The continuous functions allow the registration to be independent of a specific signal resolution. The framework is fully analytical with a closed-form derivation of the Riemannian gradient and Hessian. We study a more specialized but widely used case where the Lie group acts on functions isometrically. We solve the problem by maximizing the inner product between two functions defined over data, while the continuous action of the rigid body motion Lie group is captured through the integration of the flow in the corresponding Lie algebra. Low-dimensional cases are derived with numerical examples to show the generality of the proposed framework. The high-dimensional derivation for the special Euclidean group acting on the Euclidean space showcases the point cloud registration and bird’s-eye view map registration abilities. An implementation of this framework for RGB-D cameras outperforms the state-of-the-art robust visual odometry and performs well in texture and structure-scarce environments.

この論文では、データを使用して連続関数の非パラメトリックな結合セマンティックおよびジオメトリ表現を可能にする新しい数学的フレームワークを開発します。結合埋め込みは、再生カーネルヒルベルト空間でプロセスを表現することによってモデル化されます。関数は、リー群の作用が整列する任意の滑らかな多様体上で定義できます。連続関数により、登録は特定の信号解像度に依存しません。フレームワークは、リーマン勾配とヘッセ行列の閉形式の導出を備えた完全に解析的です。リー群が関数に等尺的に作用する、より特殊ですが広く使用されているケースを検討します。データ上で定義された2つの関数間の内積を最大化することで問題を解決し、剛体運動リー群の連続作用は、対応するリー代数の流れの積分を通じて捕捉されます。提案されたフレームワークの一般性を示すために、数値例を使用して低次元のケースを導出します。ユークリッド空間に作用する特殊ユークリッド群の高次元導出は、ポイントクラウド登録と鳥瞰図マップ登録機能を示しています。このフレームワークをRGB-Dカメラに実装すると、最先端の堅牢な視覚オドメトリよりも優れたパフォーマンスを発揮し、テクスチャや構造が乏しい環境でも優れたパフォーマンスを発揮します。

Further results on latent discourse models and word embeddings
潜在談話モデルと単語埋め込みに関するさらなる結果

We discuss some properties of generative models for word embeddings. Namely, (Arora & Al., 2016) proposed a latent discourse model implying the concentration of the partition function of the word vectors. This concentration phenomenon led to an asymptotic linear relation between the pointwise mutual information (PMI) of pairs of words and the scalar product of their vectors. Here, we first revisit this concentration phenomenon and prove it under slightly weaker assumptions, for a set of random vectors symmetrically distributed around the origin. Second, we empirically evaluate the relation between PMI and scalar products of word vectors satisfying the concentration property. Our empirical results indicate that, in practice, this relation does not hold with arbitrarily small error. This observation is further supported by two theoretical results: (i) the error cannot be exactly zero because the corresponding shifted PMI matrix cannot be positive semidefinite; (ii) under mild assumptions, there exist pairs of words for which the error cannot be close to zero. We deduce that either natural language does not follow the assumptions of the considered generative model, or the current word vector generation methods do not allow the construction of the hypothesized word embeddings.

私たちは、単語埋め込みの生成モデルのいくつかの特性について説明します。具体的には、(Arora & Al., 2016)は、単語ベクトルの分割関数の集中を意味する潜在談話モデルを提案しました。この集中現象は、単語のペアの点ごとの相互情報量(PMI)とそれらのベクトルのスカラー積の間に漸近線形関係をもたらしました。ここでは、最初にこの集中現象を再検討し、原点の周りに対称的に分布するランダムベクトルのセットに対して、わずかに弱い仮定の下でそれを証明します。次に、集中特性を満たす単語ベクトルのスカラー積とPMIの関係を経験的に評価します。私たちの経験的結果は、実際にはこの関係が任意の小さな誤差では成り立たないことを示しています。この観察は、2つの理論的結果によってさらに裏付けられています。(i)対応するシフトされたPMI行列は正の半定値ではないため、誤差は正確にゼロにはなり得ません。(ii)軽度の仮定の下では、誤差がゼロに近くならない単語のペアが存在します。自然言語が、検討中の生成モデルの仮定に従わないか、現在の単語ベクトル生成方法では、仮定された単語埋め込みの構築が許可されていないと推測されます。

CAT: Compression-Aware Training for bandwidth reduction
CAT:帯域幅削減のための圧縮対応トレーニング

One major obstacle hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be the primary energy consumer and throughput bottleneck in hardware accelerators. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model to allow better compression of weights and feature maps during neural network deployment. Our method trains the model to achieve low-entropy feature maps, enabling efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (ImageNet), image detection (Pascal VOC), sentiment analysis (CoLa), and textual entailment (MNLI). For example, on ResNet-18, we achieve near baseline ImageNet accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available.

推論のためのCNNのユビキタスな使用を妨げる大きな障害の1つは、比較的高いメモリ帯域幅要件です。これは、ハードウェアアクセラレータの主なエネルギー消費とスループットのボトルネックになる可能性があります。量子化を考慮したトレーニングアプローチにヒントを得て、ニューラルネットワークの展開中に重みと特徴マップをより適切に圧縮できるようにモデルをトレーニングする、圧縮を考慮したトレーニング(CAT)方法を提案します。この方法では、低エントロピーの特徴マップを実現するようにモデルをトレーニングし、従来の変換コーディング方法を使用して推論時に効率的な圧縮を可能にします。CATは、画像分類(ImageNet)、画像検出(Pascal VOC)、感情分析(CoLa)、テキスト含意(MNLI)など、さまざまなビジョンおよびNLPタスクで評価された量子化について報告された最先端の結果を大幅に改善します。たとえば、ResNet-18では、5ビットの量子化で値あたりわずか1.5ビットの平均表現で、ベースラインImageNet精度に近い精度を達成しています。さらに、重みとアクティベーションのエントロピー削減を一緒に適用することで、帯域幅の削減をさらに改善できることを示しています。参照実装が利用可能です。

Stable-Baselines3: Reliable Reinforcement Learning Implementations
安定ベースライン3:信頼性の高い強化学習の実装

Stable-Baselines3 provides open-source implementations of deep reinforcement learning (RL) algorithms in Python. The implementations have been benchmarked against reference codebases, and automated unit tests cover 95% of the code. The algorithms follow a consistent interface and are accompanied by extensive documentation, making it simple to train and compare different RL algorithms. Our documentation, examples, and source-code are available at https://github.com/DLR-RM/stable-baselines3.

Stable-Baselines3は、Pythonでの深層強化学習(RL)アルゴリズムのオープンソース実装を提供します。実装は参照コードベースに対してベンチマークされており、自動化された単体テストはコードの95%をカバーしています。アルゴリズムは一貫したインターフェースに従い、広範なドキュメントが付属しているため、さまざまなRLアルゴリズムのトレーニングと比較が簡単になります。ドキュメント、例、ソースコードはhttps://github.com/DLR-RM/stable-baselines3から入手できます。

Reproducing kernel Hilbert C*-module and kernel mean embeddings
カーネルヒルベルトC*モジュールとカーネル平均埋め込みの再現

Kernel methods have been among the most popular techniques in machine learning, where learning tasks are solved using the property of reproducing kernel Hilbert space (RKHS). In this paper, we propose a novel data analysis framework with reproducing kernel Hilbert $C^*$-module (RKHM) and kernel mean embedding (KME) in RKHM. Since RKHM contains richer information than RKHS or vector-valued RKHS (vvRKHS), analysis with RKHM enables us to capture and extract structural properties in such as functional data. We show a branch of theories for RKHM to apply to data analysis, including the representer theorem, and the injectivity and universality of the proposed KME. We also show RKHM generalizes RKHS and vvRKHS. Then, we provide concrete procedures for employing RKHM and the proposed KME to data analysis.

カーネル法は、機械学習で最も一般的な手法の1つであり、学習タスクはカーネルヒルベルト空間(RKHS)の再現特性を使用して解決されます。この論文では、カーネルHilbert $C^*$-module(RKHM)とカーネル平均埋め込み(KME)をRKHMで再現した新しいデータ解析フレームワークを提案します。RKHMはRKHSやベクトル値RKHS(vvRKHS)よりも豊富な情報を含んでいるため、RKHMによる解析では、機能データなどの構造特性を捉えて抽出することができます。表現定理、提案されたKMEの注入性と普遍性など、RKHMがデータ分析に適用するための理論の一分野を示します。また、RKHMがRKHSとvvRKHSを一般化することも示します。次に、RKHMと提案されたKMEをデータ分析に採用するための具体的な手順を提供します。

Learning Bayesian Networks from Ordinal Data
順序データからのベイジアンネットワークの学習

Bayesian networks are a powerful framework for studying the dependency structure of variables in a complex system. The problem of learning Bayesian networks is tightly associated with the given data type. Ordinal data, such as stages of cancer, rating scale survey questions, and letter grades for exams, are ubiquitous in applied research. However, existing solutions are mainly for continuous and nominal data. In this work, we propose an iterative score-and-search method – called the Ordinal Structural EM (OSEM) algorithm – for learning Bayesian networks from ordinal data. Unlike traditional approaches designed for nominal data, we explicitly respect the ordering amongst the categories. More precisely, we assume that the ordinal variables originate from marginally discretizing a set of Gaussian variables, whose structural dependence in the latent space follows a directed acyclic graph. Then, we adopt the Structural EM algorithm and derive closed-form scoring functions for efficient graph searching. Through simulation studies, we illustrate the superior performance of the OSEM algorithm compared to the alternatives and analyze various factors that may influence the learning accuracy. Finally, we demonstrate the practicality of our method with a real-world application on psychological survey data from 408 patients with co-morbid symptoms of obsessive-compulsive disorder and depression.

ベイジアンネットワークは、複雑なシステムにおける変数の依存関係構造を研究するための強力なフレームワークです。ベイジアンネットワークの学習問題は、特定のデータタイプと密接に関連しています。癌のステージ、評価尺度調査の質問、試験の成績などの順序データは、応用研究で広く使用されています。ただし、既存のソリューションは主に連続データと名目データ用です。この研究では、順序データからベイジアンネットワークを学習するための、順序構造EM（OSEM）アルゴリズムと呼ばれる反復スコアおよび検索方法を提案します。名目データ用に設計された従来のアプローチとは異なり、カテゴリ間の順序を明示的に尊重します。より正確には、順序変数は、潜在空間における構造的依存関係が有向非巡回グラフに従うガウス変数のセットを限界的に離散化することに由来すると仮定します。次に、構造EMアルゴリズムを採用し、効率的なグラフ検索のための閉形式のスコアリング関数を導出します。シミュレーション研究を通じて、OSEMアルゴリズムが他のアルゴリズムに比べて優れたパフォーマンスを発揮することを示し、学習精度に影響を与える可能性のあるさまざまな要因を分析します。最後に、強迫性障害とうつ病の併存症状を持つ408人の患者の心理調査データに実際のアプリケーションを適用して、この方法の実用性を実証します。

Exact Asymptotics for Linear Quadratic Adaptive Control
線形二次適応制御の厳密漸近

Recent progress in reinforcement learning has led to remarkable performance in a range of applications, but its deployment in high-stakes settings remains quite rare. One reason is a limited understanding of the behavior of reinforcement algorithms, both in terms of their regret and their ability to learn the underlying system dynamics—existing work is focused almost exclusively on characterizing rates, with little attention paid to the constants multiplying those rates that can be critically important in practice. To start to address this challenge, we study perhaps the simplest non-bandit reinforcement learning problem: linear quadratic adaptive control (LQAC). By carefully combining recent finite-sample performance bounds for the LQAC problem with a particular (less-recent) martingale central limit theorem, we are able to derive asymptotically exact expressions for the regret, estimation error, and prediction error of a rate-optimal stepwise-updating LQAC algorithm. In simulations on both stable and unstable systems, we find that our asymptotic theory also describes the algorithm’s finite-sample behavior remarkably well.

強化学習における最近の進歩は、さまざまなアプリケーションで目覚ましいパフォーマンスをもたらしていますが、ハイステークスの設定での展開は依然として非常にまれです。理由の1つは、強化アルゴリズムの動作に対する理解が限られていることです。これは、そのリグレッションと、基礎となるシステムダイナミクスを学習する能力の両方の点で当てはまります。既存の研究は、レートの特性評価にほぼ専念しており、実際には非常に重要になる可能性があるレートを乗算する定数にはほとんど注意が払われていません。この課題に取り組むために、私たちはおそらく最も単純な非バンディット強化学習問題である線形二次適応制御(LQAC)を研究します。LQAC問題に対する最近の有限サンプルパフォーマンス境界と特定の(それほど新しくはない)マルチンゲール中心極限定理を慎重に組み合わせることで、レート最適段階更新LQACアルゴリズムのリグレッション、推定誤差、予測誤差の漸近的に正確な表現を導き出すことができます。安定システムと不安定システムの両方でのシミュレーションでは、漸近理論がアルゴリズムの有限サンプルの動作を非常によく説明していることがわかりました。

Regularized spectral methods for clustering signed networks
符号付きネットワークのクラスタリングのための正則化スペクトル法

We study the problem of k-way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. (2019) proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical experiments, SPONGE was shown to achieve state-of-the-art empirical performance. On the theoretical front, Cucuringu et al. (2019) analyzed SPONGE, as well as the popular Signed Laplacian based spectral method under the setting of a Signed Stochastic Block Model, for k=2 equal-sized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in Cucuringu et al. (2019) on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in Cucuringu et al. (2019) to the general setting of k >= 2 unequal-sized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs — a regime where standard spectral methods are known to underperform — and provide theoretical guarantees under the same setting of a Signed Stochastic Block Model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data, and three real world data sets standard in the signed networks literature.

私たちは、符号付きグラフにおけるk方向クラスタリングの問題を研究します。近年、ノード間の類似性尺度が正または負の値をとる符号付きグラフの解析とモデル化にかなりの注目が集まっています。最近、Cucuringuら(2019)は、クラスタリングタスクを、適切に定義された目的関数を最適化する一般化固有値問題として捉えるスペクトル法、SPONGE (Signed Positive over Negative Generalized Eigenproblem)を提案した。このアプローチは、社会バランス理論に動機付けられており、社会バランス理論では、クラスタリングタスクは、特定のネットワークを互いに素なグループに分解し、同じグループ内の個人が可能な限り多くの正のエッジで接続され、異なるグループの個人が主に負のエッジで接続されるようにすることを目的とします。広範な数値実験により、SPONGEは最先端の実証的パフォーマンスを達成することが示されました。理論面では、Cucuringuらが、クラスタリングタスクをSPONGEに適用した新しい手法を提案した。(2019)は、グラフが中程度の密度である状態で、k = 2の等サイズクラスターについて、符号付き確率ブロックモデルの設定下でSPONGEと一般的な符号付きラプラシアンベースのスペクトル法を分析しました。この研究では、SPONGEと符号付きラプラシアンの正規化バージョンについて、2つの面でCucuringuら(2019)の結果を基にしています。まず、両方のアルゴリズムについて、Cucuringuら(2019)の理論的分析を、中程度の密度の状態でk >= 2の不等サイズクラスターの一般的な設定に拡張します。次に、標準的なスペクトル法のパフォーマンスが低いことが知られているスパースグラフを処理するために、両方の方法の正規化バージョンを導入し、符号付き確率ブロックモデルの同じ設定で理論的な保証を提供します。私たちの知る限りでは、正規化スペクトル法は、これまで、符号付きグラフのクラスタリングの設定では考慮されていません。私たちは、合成データに対する広範な数値実験と、符号付きネットワークの文献で標準となっている3つの実世界のデータセットによって、理論的結果を補完します。

On the Stability Properties and the Optimization Landscape of Training Problems with Squared Loss for Neural Networks and General Nonlinear Conic Approximation Schemes
ニューラルネットワークと一般非線形円錐近似スキームの二乗損失を伴う学習問題の安定性特性と最適化ランドスケープについて

We study the optimization landscape and the stability properties of training problems with squared loss for neural networks and general nonlinear conic approximation schemes in a deterministic setting. It is demonstrated that, if a nonlinear conic approximation scheme is considered that is (in an appropriately defined sense) more expressive than a classical linear approximation approach and if there exist unrealizable label vectors, then a training problem with squared loss is necessarily unstable in the sense that its solution set depends discontinuously on the label vector in the training data. We further prove that the same effects that are responsible for these instability properties are also the reason for the emergence of saddle points and spurious local minima, which may be arbitrarily far away from global solutions, and that neither the instability of the training problem nor the existence of spurious local minima can, in general, be overcome by adding a regularization term to the objective function that penalizes the size of the parameters in the approximation scheme. The latter results are shown to be true regardless of whether the assumption of realizability is satisfied or not. It is further established that there exists a direct and quantifiable relationship between the analyzed instability properties and the expressiveness of the considered approximation instrument and that the set of training label vectors and, in the regularized case, Tikhonov regularization parameters that give rise to spurious local minima has a nonempty interior. We demonstrate that our analysis in particular applies to training problems for free-knot interpolation schemes and deep and shallow neural networks with variable widths that involve an arbitrary mixture of various activation functions (e.g., binary, sigmoid, tanh, arctan, soft-sign, ISRU, soft-clip, SQNL, ReLU, leaky ReLU, soft-plus, bent identity, SILU, ISRLU, and ELU). In summary, the findings of this paper illustrate that the improved approximation properties of neural networks and general nonlinear conic approximation instruments come at a price and are linked in a direct and quantifiable way to undesirable properties of the optimization problems that have to be solved in order to train them.

私たちは、ニューラルネットワークと一般的な非線形円錐近似スキームの二乗損失を伴うトレーニング問題の最適化ランドスケープと安定性特性を決定論的設定で研究します。非線形円錐近似スキームが(適切に定義された意味で)従来の線形近似アプローチよりも表現力に富み、実現不可能なラベルベクトルが存在する場合、二乗損失を伴うトレーニング問題は、その解セットがトレーニングデータ内のラベルベクトルに不連続に依存するという意味で必然的に不安定になることが実証されています。さらに、これらの不安定性特性の原因となる同じ効果が、グローバルソリューションから任意に離れている可能性がある鞍点と偽の局所最小値の出現の原因でもあること、およびトレーニング問題の不安定性も偽の局所最小値の存在も、一般に、近似スキームのパラメーターのサイズにペナルティを課す目的関数に正則化項を追加することによって克服できないことを証明します。後者の結果は、実現可能性の仮定が満たされているかどうかに関係なく、真であることが示されています。さらに、解析された不安定性特性と、検討中の近似器の表現力との間には直接的かつ定量化可能な関係があり、トレーニングラベルベクトルのセットと、正規化されたケースでは、偽の局所最小値を生成するTikhonov正規化パラメーターの内部が空でないことが確立されています。私たちの分析は、特に、さまざまな活性化関数(バイナリ、シグモイド、tanh、arctan、soft-sign、ISRU、soft-clip、SQNL、ReLU、leaky ReLU、soft-plus、bent identity、SILU、ISRLU、ELUなど)の任意の組み合わせを含む、フリーノット補間スキームと可変幅の深層および浅層ニューラルネットワークのトレーニング問題に適用されることを示しています。要約すると、この論文の調査結果は、ニューラルネットワークと一般的な非線形円錐近似機器の近似特性の改善には代償が伴い、それらをトレーニングするために解決しなければならない最適化問題の望ましくない特性に直接的かつ定量化可能な方法で結びついていることを示しています。

Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning
モデルにとらわれないプライベート学習の再考:より速い速度とアクティブラーニング

The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget, and empirical studies show the effectiveness of this new idea. The novel components in the proofs include a more refined analysis of the majority voting classifier — which could be of independent interest — and an observation that the synthetic “student” learning problem is nearly realizable by construction under the Tsybakov noise condition.

教師アンサンブルのプライベート集約(PATE)フレームワークは、差分プライバシー学習における最近の最も有望なアプローチの1つです。既存の理論分析によると、PATEは実現可能な設定でVCクラスを一貫して学習しますが、最適な分類器のエラー率がゼロから制限されているより一般的なケースでは、その成功を説明するには不十分です。このギャップを埋めるために、Tsybakovノイズ条件(TNC)を導入し、より強力で解釈しやすい学習境界を確立します。これらの境界は、PATEが機能するタイミングに関する新しい洞察を提供し、より狭い実現可能な設定でも既存の結果を改善します。また、プライバシーバジェットを節約するためにアクティブラーニングを使用するという魅力的なアイデアを調査し、実証研究によりこの新しいアイデアの有効性が示されています。証明の新しいコンポーネントには、多数決分類器のより洗練された分析(これは独立した関心事である可能性があります)と、合成「学生」学習問題がTsybakovノイズ条件の下で構築することでほぼ実現可能であるという観察が含まれます。

Domain adaptation under structural causal models
構造的因果モデルの下でのドメイン適応

Domain adaptation (DA) arises as an important problem in statistical machine learning when the source data used to train a model is different from the target data used to test the model. Recent advances in DA have mainly been application-driven and have largely relied on the idea of a common subspace for source and target data. To understand the empirical successes and failures of DA methods, we propose a theoretical framework via structural causal models that enables analysis and comparison of the prediction performance of DA methods. This framework also allows us to itemize the assumptions needed for the DA methods to have a low target error. Additionally, with insights from our theory, we propose a new DA method called CIRM that outperforms existing DA methods when both the covariates and label distributions are perturbed in the target data. We complement the theoretical analysis with extensive simulations to show the necessity of the devised assumptions. Reproducible synthetic and real data experiments are also provided to illustrate the strengths and weaknesses of DA methods when parts of the assumptions in our theory are violated.

ドメイン適応(DA)は、モデルのトレーニングに使用されるソースデータが、モデルのテストに使用されるターゲットデータと異なる場合に、統計的機械学習における重要な問題として発生します。DAの最近の進歩は主にアプリケーション主導であり、ソースデータとターゲットデータの共通サブスペースというアイデアに大きく依存しています。DA方法の実証的な成功と失敗を理解するために、構造因果モデルを介して、DA方法の予測パフォーマンスの分析と比較を可能にする理論的フレームワークを提案します。このフレームワークにより、DA方法のターゲットエラーを低くするために必要な仮定を項目別にまとめることもできます。さらに、理論からの洞察を使用して、共変量とラベル分布の両方がターゲットデータで摂動されている場合に既存のDA方法よりも優れたパフォーマンスを発揮するCIRMと呼ばれる新しいDA方法を提案します。考案された仮定の必要性を示すために、理論分析を広範なシミュレーションで補完します。再現可能な合成データと実際のデータ実験も提供され、理論の仮定の一部が違反された場合のDA方法の長所と短所を示します。

Learning Strategies in Decentralized Matching Markets under Uncertain Preferences
不確実な選好下における分散型マッチング市場における学習戦略

We study the problem of decision-making in the setting of a scarcity of shared resources when the preferences of agents are unknown a priori and must be learned from data. Taking the two-sided matching market as a running example, we focus on the decentralized setting, where agents do not share their learned preferences with a central authority. Our approach is based on the representation of preferences in a reproducing kernel Hilbert space, and a learning algorithm for preferences that accounts for uncertainty due to the competition among the agents in the market. Under regularity conditions, we show that our estimator of preferences converges at a minimax optimal rate. Given this result, we derive optimal strategies that maximize agents’ expected payoffs and we calibrate the uncertain state by taking opportunity costs into account. We also derive an incentive-compatibility property and show that the outcome from the learned strategies has a stability property. Finally, we prove a fairness property that asserts that there exists no justified envy according to the learned strategies.

私たちは、共有リソースが不足している状況で、エージェントの好みが事前に不明で、データから学習しなければならない意思決定の問題を研究します。双方向マッチング市場を例として、エージェントが学習した好みを中央機関と共有しない分散設定に焦点を当てる。我々のアプローチは、再生カーネルヒルベルト空間での好みの表現と、市場におけるエージェント間の競争による不確実性を考慮した好みの学習アルゴリズムに基づく。正則性条件下では、好みの推定値がミニマックス最適速度で収束することを示す。この結果に基づいて、エージェントの期待される報酬を最大化する最適戦略を導出し、機会費用を考慮して不確実な状態を調整します。また、インセンティブ適合性特性を導出し、学習した戦略からの結果には安定性特性があることを示す。最後に、学習した戦略によれば正当な嫉妬は存在しないと主張する公平性特性を証明します。

ROOTS: Object-Centric Representation and Rendering of 3D Scenes
ROOTS:3Dシーンのオブジェクト中心の表現とレンダリング

A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations. Recent works either achieve object-centric generation but without the ability to infer the representation, or achieve 3D scene representation learning but without object-centric compositionality. Therefore, learning to both represent and render 3D scenes with object-centric compositionality remains elusive. In this paper, we propose a probabilistic generative model for learning to build modular and compositional 3D object models from partial observations of a multi-object scene. The proposed model can (i) infer the 3D object representations by learning to search and group object areas, and also (ii) render from an arbitrary viewpoint not only individual objects but also the full scene by compositing the objects. The entire learning process is unsupervised and end-to-end. In experiments, in addition to generation quality, we also demonstrate that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings. Results can be found on our project website: https://sites.google.com/view/roots3d

人間の知能の重要な能力は、部分的なシーンの観察から個々の3Dオブジェクトのモデルを構築することです。最近の研究では、オブジェクト中心の生成は達成されていますが、表現を推測する能力がないか、3Dシーンの表現学習は達成されていますが、オブジェクト中心の構成性がありません。そのため、オブジェクト中心の構成性を備えた3Dシーンの表現とレンダリングの両方を学習することは、依然として困難です。この論文では、複数のオブジェクトシーンの部分的な観察からモジュール式で構成的な3Dオブジェクトモデルを構築することを学習するための確率的生成モデルを提案します。提案されたモデルは、(i)オブジェクト領域の検索とグループ化を学習することで3Dオブジェクト表現を推測し、(ii)オブジェクトを合成することで、個々のオブジェクトだけでなくシーン全体を任意の視点からレンダリングできます。学習プロセス全体は、教師なしのエンドツーエンドです。実験では、生成品質に加えて、学習した表現によってオブジェクト単位の操作と新しいシーン生成が可能になり、さまざまな設定に一般化できることも実証しています。結果は当プロジェクトのウェブサイトでご覧いただけます: https://sites.google.com/view/roots3d

Optimized Score Transformation for Consistent Fair Classification
一貫性のある公正な分類のための最適化されたスコア変換

This paper considers fair probabilistic binary classification where the outputs of primary interest are predicted probabilities, commonly referred to as scores. We formulate the problem of transforming scores to satisfy fairness constraints that are linear in conditional means of scores while minimizing a cross-entropy objective. The formulation can be applied directly to post-process classifier outputs and we also explore a pre-processing extension, thus allowing maximum freedom in selecting a classification algorithm. We derive a closed-form expression for the optimal transformed scores and a convex optimization problem for the transformation parameters. In the population limit, the transformed score function is the fairness-constrained minimizer of cross-entropy with respect to the true conditional probability of the outcome. In the finite sample setting, we propose a method called FairScoreTransformer to approach this solution using a combination of standard probabilistic classifiers and ADMM. We provide several consistency and finite-sample guarantees for FairScoreTransformer, relating to the transformation parameters and transformed score function that it obtains. Comprehensive experiments comparing to 10 existing methods show that FairScoreTransformer has advantages for score-based metrics such as Brier score and AUC while remaining competitive for binary label-based metrics such as accuracy.

この論文では、公平な確率的バイナリ分類について検討します。ここでは、主な関心の出力は予測確率(一般にスコアと呼ばれます)です。スコアの条件付き平均が線形である公平性制約を満たしながら、クロスエントロピー目標を最小化するようにスコアを変換する問題を定式化します。この定式化は、後処理分類器の出力に直接適用できます。また、前処理の拡張も検討し、分類アルゴリズムの選択の自由度を最大限に高めます。最適な変換スコアの閉形式の式と、変換パラメータの凸最適化問題を導出します。母集団の制限では、変換されたスコア関数は、結果の真の条件付き確率に関する公平性制約付きクロスエントロピーの最小化です。有限サンプル設定では、標準の確率的分類器とADMMを組み合わせてこのソリューションにアプローチするFairScoreTransformerと呼ばれる方法を提案します。FairScoreTransformerには、変換パラメータとそれが取得する変換されたスコア関数に関連して、一貫性と有限サンプルの保証がいくつか用意されています。既存の10種類の方法と比較した包括的な実験により、FairScoreTransformerはBrierスコアやAUCなどのスコアベースのメトリックに対して優位性を持ち、精度などのバイナリラベルベースのメトリックに対しても競争力を維持することが示されました。

Estimating Uncertainty Intervals from Collaborating Networks
協調ネットワークからの不確実性区間の推定

Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths.

効果的な意思決定には、予測に内在する不確実性を理解する必要があります。回帰分析では、この不確実性はさまざまな方法で推定できますが、これらの方法の多くは調整が面倒で、自信過剰な不確実性区間を生成したり、鮮明さに欠けたりします(不正確な区間を与えます)。私たちは、2つの異なる損失関数を持つ2つのニューラルネットワークを定義することで、回帰分析で予測分布を捕捉する新しい方法を提案することで、これらの課題に対処します。具体的には、1つのネットワークは累積分布関数を近似し、2つ目のネットワークはその逆関数を近似します。私たちはこの方法をCollaborating Networks (CN)と呼んでいます。理論分析では、最適化の固定点は理想化されたソリューションにあり、この方法はグラウンドトゥルース分布と漸近的に一致していることが実証されています。経験的には、学習は簡単で堅牢です。私たちは、不確実性が重要な、電子健康記録からの糖尿病患者のA1c値の予測を含む、2つの合成データセットと6つの実際のデータセットで、いくつかの一般的なアプローチに対してCNをベンチマークしました。合成データでは、提案されたアプローチは基本的にグラウンドトゥルースと一致します。実際のデータセットでは、CNは、対数尤度推定値、平均絶対誤差、カバレッジ推定値、予測区間幅など、多くのパフォーマンスメトリックの結果を改善します。

Model Linkage Selection for Cooperative Learning
協同学習のためのモデルリンケージ選択

We consider the distributed learning setting where each agent or learner holds a specific parametric model and a data source. The goal is to integrate information across a set of learners and data sources to enhance the prediction accuracy of a given learner. A natural way to integrate information is to build a joint model across a group of learners that shares common parameters of interest. However, the underlying parameter sharing patterns across a set of learners may not be known a priori. Misspecifying the parameter sharing patterns or the parametric model for each learner often yields a biased estimator that degrades the prediction accuracy. We propose a general method to integrate information across a set of learners that is robust against misspecification of both models and parameter sharing patterns. The main crux of our proposed method is to sequentially incorporate additional learners that can enhance the prediction accuracy of an existing joint model based on user- specified parameter sharing patterns across a set of learners. Theoretically, we show that the proposed method can data-adaptively select a parameter sharing pattern that enhances the predictive performance of a given learner. Extensive numerical studies are conducted to assess the performance of the proposed method.

私たちは、各エージェントまたは学習者が特定のパラメトリックモデルとデータソースを保持する分散学習設定を検討します。目標は、一連の学習者とデータソース間で情報を統合して、特定の学習者の予測精度を高めることです。情報を統合する自然な方法は、共通の関心パラメータを共有する学習者のグループ間で結合モデルを構築することです。ただし、一連の学習者間の基本的なパラメータ共有パターンは事前にわからない場合があります。各学習者のパラメータ共有パターンまたはパラメトリックモデルを誤って指定すると、予測精度を低下させる偏った推定値が生成されます。私たちは、モデルとパラメータ共有パターンの両方の誤った指定に対して堅牢な、一連の学習者間で情報を統合する一般的な方法を提案します。提案方法の主な要点は、一連の学習者間でユーザーが指定したパラメータ共有パターンに基づいて、既存の結合モデルの予測精度を高めることができる追加の学習者を順次組み込むことです。理論的には、提案方法は、特定の学習者の予測パフォーマンスを高めるパラメータ共有パターンをデータ適応的に選択できることを示しています。提案された方法の性能を評価するために、広範な数値研究が実施されています。

Adversarial Monte Carlo Meta-Learning of Optimal Prediction Procedures
最適予測手順の敵対的モンテカルロメタ学習

We frame the meta-learning of prediction procedures as a search for an optimal strategy in a two-player game. In this game, Nature selects a prior over distributions that generate labeled data consisting of features and an associated outcome, and the Predictor observes data sampled from a distribution drawn from this prior. The Predictor’s objective is to learn a function that maps from a new feature to an estimate of the associated outcome. We establish that, under reasonable conditions, the Predictor has an optimal strategy that is equivariant to shifts and rescalings of the outcome and is invariant to permutations of the observations and to shifts, rescalings, and permutations of the features. We introduce a neural network architecture that satisfies these properties. The proposed strategy performs favorably compared to standard practice in both parametric and nonparametric experiments.

私たちは、予測手順のメタ学習を、2人用ゲームにおける最適な戦略の探索として組み立てます。このゲームでは、Natureは、特徴と関連する結果で構成されるラベル付きデータを生成する事前分布のオーバー分布を選択し、Predictorは、この事前分布から引き出された分布からサンプリングされたデータを観察します。予測子の目的は、新しい特徴から関連する結果の推定にマップする関数を学習することです。合理的な条件下では、予測子は、結果のシフトと再スケーリングに対して等変であり、観測値の順列と特徴のシフト、再スケーリング、および順列に対して不変である最適な戦略を持っていることを立証します。これらの特性を満たすニューラルネットワークアーキテクチャを紹介します。提案された戦略は、パラメトリック実験とノンパラメトリック実験の両方で、標準的な手法と比較して良好な結果を示します。

Inference for the Case Probability in High-dimensional Logistic Regression
高次元ロジスティック回帰におけるケース確率の推論

Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.

電子医療記録において、患者の疾患または状態、すなわち症例または対照のステータスに関するラベル付けは、構造化および非構造化電子医療記録データから得られる高次元変数を使用した予測モデルにますます依存するようになっています。現在、大きな障害となっているのは、症例確率に対する有効な統計的推論方法がないことです。この論文では、予測のための高次元スパースロジスティック回帰モデルを考慮し、線形化および分散強化技術の開発を通じて、症例確率の新しいバイアス補正推定量を提案します。高次元の任意のローディングベクトルに対して、提案された推定量の漸近正規性を確立します。症例確率の信頼区間を構築し、患者の症例対照ラベル付けのための仮説検定手順を提案します。広範なシミュレーション研究と実際の電子医療記録データへの適用を通じて、提案された方法を実証します。

Bifurcation Spiking Neural Network
分岐スパイキングニューラルネットワーク

Spiking neural networks (SNNs) have attracted much attention due to their great potential for modeling time-dependent signals. The performance of SNNs depends not only on picking an apposite architecture and searching optimal connection weights as well as conventional deep neural networks, but also on the careful tuning of many hyper-parameters within fundamental spiking neural models. However, so far, there has been less systematic work on analyzing SNNs’ dynamical characteristics, especially ones relative to these internal hyper-parameters, which leads to whether SNNs are adequate for modeling actual data relies on fortune. In this work, we provide a theoretical framework for investigating spiking neural models from the perspective of dynamical systems. As a result, we point out that the LIF model with control rate hyper-parameters is a bifurcation dynamical system. This point explains why the performance of SNNs is so sensitive to the setting of control rate hyper-parameters, leading to a recommendation that diverse and adaptive eigenvalues are beneficial to improve the performance of SNNs. Inspired by this insight, we develop the Bifurcation Spiking Neural Network (BSNN) with supervised implementation, and theoretically show that BSNN is an adaptive dynamical system. Experiments validate the effectiveness of BSNN on several benchmark data sets, showing that BSNN achieves superior performance to existing SNNs and is robust to the setting of control rates.

スパイキングニューラルネットワーク（SNN）は、時間依存信号をモデル化する大きな可能性を秘めているため、多くの注目を集めています。SNNのパフォーマンスは、従来のディープニューラルネットワークと同様に適切なアーキテクチャを選択して最適な接続重みを探すだけでなく、基本的なスパイキングニューラルモデル内の多くのハイパーパラメータを注意深く調整することにも依存します。しかし、これまでのところ、SNNの動的特性、特にこれらの内部ハイパーパラメータに関連する特性を分析する体系的な研究はあまり行われておらず、SNNが実際のデータをモデル化するのに適しているかどうかは運次第ということになります。この研究では、動的システムの観点からスパイキングニューラルモデルを調査するための理論的枠組みを提供します。その結果、制御レートハイパーパラメータを備えたLIFモデルは分岐動的システムであることを指摘します。この点は、SNNのパフォーマンスが制御レートのハイパーパラメータの設定に非常に敏感である理由を説明しており、多様で適応的な固有値がSNNのパフォーマンスを向上させるのに有益であるという推奨につながります。この洞察に触発されて、私たちは教師あり実装を備えた分岐スパイキングニューラルネットワーク(BSNN)を開発し、BSNNが適応型動的システムであることを理論的に示しています。実験では、いくつかのベンチマークデータセットでBSNNの有効性を検証し、BSNNが既存のSNNよりも優れたパフォーマンスを実現し、制御レートの設定に対して堅牢であることを示しています。

Batch greedy maximization of non-submodular functions: Guarantees and applications to experimental design
非サブモジュラ関数のバッチ貪欲最大化:実験計画への保証と応用

We propose and analyze batch greedy heuristics for cardinality constrained maximization of non-submodular non-decreasing set functions. We consider the standard greedy paradigm, along with its distributed greedy and stochastic greedy variants. Our theoretical guarantees are characterized by the combination of submodularity and supermodularity ratios. We argue how these parameters define tight modular bounds based on incremental gains, and provide a novel reinterpretation of the classical greedy algorithm using the minorize-maximize (MM) principle. Based on that analogy, we propose a new class of methods exploiting any plausible modular bound. In the context of optimal experimental design for linear Bayesian inverse problems, we bound the submodularity and supermodularity ratios when the underlying objective is based on mutual information. We also develop novel modular bounds for the mutual information in this setting, and describe certain connections to polyhedral combinatorics. We discuss how algorithms using these modular bounds relate to established statistical notions such as leverage scores and to more recent efforts such as volume sampling. We demonstrate our theoretical findings on synthetic problems and on a real-world climate monitoring example.

私たちは、非サブモジュラ非減少集合関数の基数制約最大化のためのバッチ貪欲ヒューリスティックを提案し、分析します。標準的な貪欲パラダイム、およびその分散貪欲および確率的貪欲の変種を検討します。我々の理論的保証は、サブモジュラリティとスーパーモジュラリティの比率の組み合わせによって特徴付けられます。私たちは、これらのパラメータが増分ゲインに基づいて厳密なモジュラ境界をどのように定義するかを説明し、マイナー化最大化(MM)原理を使用して古典的な貪欲アルゴリズムの新しい再解釈を提供します。その類推に基づいて、私たちは、あらゆる妥当なモジュラ境界を活用する新しいクラスの方法を提案します。線形ベイズ逆問題の最適実験設計のコンテキストでは、基礎となる目的が相互情報量に基づいている場合に、サブモジュラリティとスーパーモジュラリティの比率を制限した。我々はまた、この設定での相互情報量の新しいモジュラ境界を開発し、多面体組合せ論との特定の関係について説明します。これらのモジュラー境界を使用するアルゴリズムが、レバレッジスコアなどの確立された統計概念や、ボリュームサンプリングなどの最近の取り組みとどのように関係するかについて説明します。合成問題と実際の気候監視の例で、理論的発見を示します。

Tractable Approximate Gaussian Inference for Bayesian Neural Networks
ベイジアンニューラルネットワークのための扱いやすい近似ガウス推論

In this paper, we propose an analytical method for performing tractable approximate Gaussian inference (TAGI) in Bayesian neural networks. The method enables the analytical Gaussian inference of the posterior mean vector and diagonal covariance matrix for weights and biases. The method proposed has a computational complexity of $\mathcal{O}(n)$ with respect to the number of parameters $n$, and the tests performed on regression and classification benchmarks confirm that, for a same network architecture, it matches the performance of existing methods relying on gradient backpropagation.

この論文では、ベイジアンニューラルネットワークにおいて扱いやすい近似ガウス推論(TAGI)を実行するための解析手法を提案します。この方法により、重みとバイアスの事後平均ベクトルと対角共分散行列の解析的ガウス推論が可能になります。提案された手法は、パラメーターの数$n$に対して$mathcal{O}(n)$の計算複雑性を持ち、回帰ベンチマークと分類ベンチマークで実行されるテストでは、同じネットワークアーキテクチャに対して、勾配バックプロパゲーションに依存する既存の手法のパフォーマンスと一致することが確認されています。

Bayesian time-aligned factor analysis of paired multivariate time series
対応多変量時系列のベイズ時間整列因子分析

Many modern data sets require inference methods that can estimate the shared and individual-specific components of variability in collections of matrices that change over time. Promising methods have been developed to analyze these types of data in static cases, but only a few approaches are available for dynamic settings. To address this gap, we consider novel models and inference methods for pairs of matrices in which the columns correspond to multivariate observations at different time points. In order to characterize common and individual features, we propose a Bayesian dynamic factor modeling framework called Time Aligned Common and Individual Factor Analysis (TACIFA) that includes uncertainty in time alignment through an unknown warping function. We provide theoretical support for the proposed model, showing identifiability and posterior concentration. The structure enables efficient computation through a Hamiltonian Monte Carlo (HMC) algorithm. We show excellent performance in simulations, and illustrate the method through application to a social mimicry experiment.

現代の多くのデータセットでは、時間とともに変化するマトリックスのコレクションにおける変動性の共通および個別固有の要素を推定できる推論方法が必要です。静的なケースでこれらのタイプのデータを分析するための有望な方法が開発されていますが、動的な設定に使用できるアプローチはごくわずかです。このギャップを埋めるために、列が異なる時点での多変量観測値に対応するマトリックスのペアに対する新しいモデルと推論方法を検討します。共通および個別の特徴を特徴付けるために、未知のワーピング関数による時間調整の不確実性を含む、時間調整された共通および個別因子分析(TACIFA)と呼ばれるベイズ動的因子モデリングフレームワークを提案します。提案モデルを理論的にサポートし、識別可能性と事後集中を示します。この構造により、ハミルトンモンテカルロ(HMC)アルゴリズムによる効率的な計算が可能になります。シミュレーションで優れたパフォーマンスを示し、社会的模倣実験への適用を通じて方法を説明します。

On the Riemannian Search for Eigenvector Computation
固有ベクトル計算のためのリーマン探索について

Eigenvector computation is central to numerical algebra and often critical to many data analysis tasks nowadays. Most research on this problem has been focusing on projection methods like power iterations, such that this category of algorithms can achieve both optimal convergence rates and cheap per-iteration costs. In contrast, search methods belonging to another main category are less understood in this respect. In this work, we consider the leading eigenvector computation as a non-convex optimization problem on the (generalized) Stiefel manifold and covers the cases for both standard and generalized eigenvectors. It is shown that the inexact Riemannian gradient method induced by the shift-and-invert preconditioning is guaranteed to converge to one of the ground-truth eigenvectors at an optimal rate, e.g., $O(\sqrt{\kappa_{\mathbf{B}}\frac{\lambda_{1}}{\lambda_{1}-\lambda_{p+1}}}\log\frac{1}{\epsilon})$ for a pair of real symmetric matrices $(\mathbf{A},\mathbf{B})$ with $\mathbf{B}$ being positive definite, where $\lambda_{i}$ represents the $i$-th largest generalized eigenvalue of the matrix pair, $p$ is the multiplicity of $\lambda_{1}$, and $\kappa_{\mathbf{B}}$ stands for the condition number of $\mathbf{B}$. The standard eigenvector computation is recovered by setting $\mathbf{B}$ to an identity matrix. Our analysis reduces the dependence on the eigengap, making it the first Riemannian eigensolver that achieves the optimal rate. Experiments demonstrate that the proposed search method is able to deliver significantly better performance than projection methods by taking advantages of step-size schemes.

固有ベクトルの計算は数値代数の中心であり、今日では多くのデータ分析タスクにとって重要です。この問題に関する研究のほとんどは、べき乗反復などの投影法に焦点を当てており、このカテゴリのアルゴリズムは最適な収束率と反復あたりの低コストの両方を実現できます。対照的に、別の主要なカテゴリに属する検索方法は、この点ではあまり理解されていません。この研究では、主要な固有ベクトルの計算を(一般化)シュティーフェル多様体上の非凸最適化問題として考え、標準および一般化固有ベクトルの両方のケースをカバーします。シフト・アンド・インバート前処理によって誘導される不正確なリーマン勾配法は、最適な速度で真の固有ベクトルの1つに収束することが保証されていることが示されています。例えば、実対称行列のペア$(\mathbf{A},\mathbf{B})$（$\mathbf{B}$は正定値）に対して、$O(\sqrt{\kappa_{\mathbf{B}}\frac{\lambda_{1}}{\lambda_{1}-\lambda_{p+1}}}\log\frac{1}{\epsilon})$です。ここで、$\lambda_{i}$は行列ペアの$i$番目に大きい一般化固有値を表し、$p$は$\lambda_{1}$の重複度、$\kappa_{\mathbf{B}}$は$\mathbf{B}$の条件数。標準の固有ベクトル計算は、$\mathbf{B}$を単位行列に設定することで復元されます。私たちの分析は、固有ギャップへの依存を減らし、最適なレートを実現する最初のリーマン固有値ソルバーを実現します。実験では、提案された検索方法が、ステップサイズスキームの利点を活用することで、投影法よりも大幅に優れたパフォーマンスを発揮できることが実証されています。

Statistically and Computationally Efficient Change Point Localization in Regression Settings
回帰設定における統計的および計算効率の高い変化ポイントのローカリゼーション

Detecting when the underlying distribution changes for the observed time series is a fundamental problem arising in a broad spectrum of applications. In this paper, we study multiple change-point localization in the high-dimensional regression setting, which is particularly challenging as no direct observations of the parameter of interest is available. Specifically, we assume we observe $\{ x_t, y_t\}_{t=1}^n$ where $ \{ x_t\}_{t=1}^n $ are $p$-dimensional covariates, $\{y_t\}_{t=1}^n$ are the univariate responses satisfying $\mathbb{E}(y_t) = x_t^\top \beta_t^* \text{ for } 1\le t \le n $ and $\{\beta_t^*\}_{t=1}^n $ are the unobserved regression coefficients that change over time in a piecewise constant manner. We propose a novel projection-based algorithm, Variance Projected Wild Binary Segmentation~(VPWBS), which transforms the original (difficult) problem of change-point detection in $p$-dimensional regression to a simpler problem of change-point detection in mean of a one-dimensional time series. VPWBS is shown to achieve sharp localization rate $O_p(1/n)$ up to a log factor, a significant improvement from the best rate $O_p(1/\sqrt{n})$ known in the existing literature for multiple change-point localization in high-dimensional regression. Extensive numerical experiments are conducted to demonstrate the robust and favorable performance of VPWBS over two state-of-the-art algorithms, especially when the size of change in the regression coefficients $\{\beta_t^*\}_{t=1}^n $ is small.

観測された時系列の基礎となる分布がいつ変化するかを検出することは、幅広いアプリケーションで生じる基本的な問題です。この論文では、高次元回帰設定における複数の変化点の特定について検討します。これは、関心のあるパラメータを直接観測できないため、特に困難です。具体的には、$\{ x_t, y_t\}_{t=1}^n$を観測すると仮定します。ここで、$ \{ x_t\}_{t=1}^n $は$p$次元の共変量、$\{y_t\}_{t=1}^n$は$\mathbb{E}(y_t) = x_t^\top \beta_t^* \text{ for } 1\le t \le n $を満たす単変量応答、$\{\beta_t^*\}_{t=1}^n $は、時間の経過とともに区分的に一定に変化する観測されていない回帰係数です。私たちは、新しい投影ベースのアルゴリズム、分散投影ワイルドバイナリセグメンテーション(VPWBS)を提案します。これは、p次元回帰における変化点検出という元々の(困難な)問題を、1次元時系列の平均における変化点検出というより単純な問題に変換します。VPWBSは、対数係数までの鋭い局所化率$O_p(1/n)$を達成することが示されており、これは高次元回帰における複数の変化点の局所化に関する既存の文献で知られている最高率$O_p(1/\sqrt{n})$から大幅に改善されています。特に回帰係数$\{\beta_t^*\}_{t=1}^n $の変化のサイズが小さい場合に、2つの最先端アルゴリズムよりもVPWBSが堅牢で好ましい性能を示すことを実証するために、広範な数値実験が実施されています。

Statistical Guarantees for Local Spectral Clustering on Random Neighborhood Graphs
ランダム近傍グラフ上の局所スペクトルクラスタリングの統計的保証

We study the Personalized PageRank (PPR) algorithm, a local spectral method for clustering, which extracts clusters using locally-biased random walks around a given seed node. In contrast to previous work, we adopt a classical statistical learning setup, where we obtain samples from an unknown nonparametric distribution, and aim to identify sufficiently salient clusters. We introduce a trio of population-level functionals—the normalized cut, conductance, and local spread, analogous to graph-based functionals of the same name—and prove that PPR, run on a neighborhood graph, recovers clusters with small population normalized cut and large conductance and local spread. We apply our general theory to establish that PPR identifies connected regions of high density (density clusters) that satisfy a set of natural geometric conditions. We also show a converse result, that PPR can fail to recover geometrically poorly-conditioned density clusters, even asymptotically. Finally, we provide empirical support for our theory.

私たちは、クラスタリングのためのローカルスペクトル法であるPersonalized PageRank (PPR)アルゴリズムを研究します。このアルゴリズムは、特定のシードノードの周りをローカルに偏ったランダムウォークを使用してクラスターを抽出します。以前の研究とは対照的に、未知のノンパラメトリック分布からサンプルを取得し、十分に顕著なクラスターを特定することを目指す、古典的な統計学習設定を採用します。私たちは、同じ名前のグラフベースの関数に類似した、正規化カット、コンダクタンス、およびローカルスプレッドの3つの集団レベルの関数を導入し、近傍グラフで実行されるPPRが、小さな集団正規化カットと大きなコンダクタンスおよびローカルスプレッドを持つクラスターを回復することを証明します。私たちは、一般理論を適用して、PPRが一連の自然な幾何学的条件を満たす高密度の接続領域(密度クラスター)を識別することを確立します。また、PPRは、漸近的であっても、幾何学的に条件の悪い密度クラスターを回復できない可能性があるという逆の結果も示します。最後に、我々の理論を実証的に裏付けます。

Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals
逐次検定と信頼区間への応用による混合マーチンゲールの再検討

This paper presents new deviation inequalities that are valid uniformly in time under adaptive sampling in a multi-armed bandit model. The deviations are measured using the Kullback-Leibler divergence in a given one-dimensional exponential family, and take into account multiple arms at a time. They are obtained by constructing for each arm a mixture martingale based on a hierarchical prior, and by multiplying those martingales. Our deviation inequalities allow us to analyze stopping rules based on generalized likelihood ratios for a large class of sequential identification problems. We establish asymptotic optimality of sequential tests generalising the track-and-stop method to problems beyond best arm identification. We further derive sharper stopping thresholds, where the number of arms is replaced by the newly introduced pure exploration problem rank. We construct tight confidence intervals for linear functions and minima/maxima of the vector of arm means.

この論文では、マルチアームバンディットモデルの適応サンプリングの下で時間的に一様に有効な新しい偏差不等式を示します。偏差は、特定の1次元指数族のKullback-Leibler発散を使用して測定され、一度に複数のアームが考慮されます。これらは、各アームに対して階層的な事前分布に基づいて混合物マーチンゲールを構築し、それらのマーチンゲールを乗算することによって得られます。この偏差の不等式により、大規模なクラスの逐次同定問題に対する一般化尤度比に基づいて停止ルールを分析できます。逐次テストの漸近最適性を確立し、トラックアンドストップ法を最適なアームの特定を超えた問題に一般化します。さらに、アームの数が新しく導入された純粋な探査問題ランクに置き換えられる、より鋭い停止しきい値を導き出します。線形関数とアーム平均のベクトルの最小値/最大値の厳密な信頼区間を構築します。

On lp-hyperparameter Learning via Bilevel Nonsmooth Optimization
二値非平滑最適化による lp ハイパーパラメータ学習について

We propose a bilevel optimization strategy for selecting the best hyperparameter value for the nonsmooth $\ell_p$ regularizer with $0

私たちは、$0を持つ非平滑な$ell_p$正則子に最適なハイパーパラメータ値を選択するためのバイレベル最適化戦略を提案します

Consistency of Gaussian Process Regression in Metric Spaces
メートル空間におけるガウス過程回帰の一貫性

Gaussian process (GP) regressors are used in a wide variety of regression tasks, and many recent applications feature domains that are non-Euclidean manifolds or other metric spaces. In this paper, we examine formal consistency of GP regression on general metric spaces. Specifically, we consider a GP prior on an unknown real-valued function with a metric domain space and examine consistency of the resulting posterior distribution. If the kernel is continuous and the sequence of sampling points lies sufficiently dense, then the variance of the posterior GP is shown to converge to zero almost surely monotonically and in $L^p$ for all $p > 1$, uniformly on compact sets. Moreover, we prove that if the difference between the observed function and the mean function of the prior lies in the reproducing kernel Hilbert space of the prior’s kernel, then the posterior mean converges pointwise in $L^2$ to the unknown function, and, under an additional assumption on the kernel, uniformly on compacts in $L^1$. This paper provides an important step towards the theoretical legitimization of GP regression on manifolds and other non-Euclidean metric spaces.

ガウス過程(GP)回帰変数は、さまざまな回帰タスクで使用されており、最近の多くのアプリケーションでは、非ユークリッド多様体または他の距離空間であるドメインが使用されています。この論文では、一般的な距離空間でのGP回帰の形式的な一貫性を調べます。具体的には、距離領域空間を持つ未知の実数値関数のGP事前分布を検討し、結果として得られる事後分布の一貫性を調べます。カーネルが連続しており、サンプリングポイントのシーケンスが十分に密である場合、事後GPの分散は、すべての$p > 1$に対して、ほぼ確実に単調に、コンパクトセット上で一様に0に収束することが示されます。さらに、観測された関数と事前分布の平均関数の差が事前分布のカーネルの再生カーネルヒルベルト空間にある場合、事後平均は$L^2$で未知の関数に点ごとに収束し、カーネルに関する追加の仮定の下では、$L^1$のコンパクト上で一様に収束することを証明します。この論文では、多様体やその他の非ユークリッド距離空間におけるGP回帰の理論的正当化に向けた重要な一歩を提供します。

Quasi-Monte Carlo Quasi-Newton in Variational Bayes
変分ベイズにおける準モンテカルロ準ニュートン

Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where ill-conditioning makes second order methods such as limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) more effective. We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of $O(n^{-1/2})$ then RQMC has an RMSE of $o(n^{-1/2})$ that can be close to $O(n^{-3/2})$ in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic quasi-Newton method greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.

多くの機械学習の問題は、ノイズで測定する必要がある目標を最適化します。主な方法は、各ステップで1つ以上のモンテカルロ(MC)サンプルを使用した1次確率的勾配降下法です。条件付けが悪いと、メモリが限られているBroyden-Fletcher-Goldfarb-Shanno(L-BFGS)などの二次法がより効果的になる設定があります。このような問題に対するランダム化準モンテカルロ(RQMC)サンプリングの使用について研究しています。MCサンプリングの二乗平均平方根誤差(RMSE)が$O(n^{-1/2})$の場合、RQMCのRMSEは$o(n^{-1/2})$で、好ましい設定では$O(n^{-3/2})$に近づくことができます。サンプリング精度の向上が最適化の改善に直結することを証明しています。変分ベイズに関する実証的調査では、RQMCと確率的準ニュートン法を使用すると、最適化が大幅に高速化され、MCよりも優れたパラメータ値が見つかることもあります。

Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations
エルゴード確率微分方程式の数値近似分布に対するワッサーシュタイン距離推定

We present a framework that allows for the non-asymptotic study of the $2$-Wasserstein distance between the invariant distribution of an ergodic stochastic differential equation and the distribution of its numerical approximation in the strongly log-concave case. This allows us to study in a unified way a number of different integrators proposed in the literature for the overdamped and underdamped Langevin dynamics. In addition, we analyze a novel splitting method for the underdamped Langevin dynamics which only requires one gradient evaluation per time step. Under an additional smoothness assumption on a $d$–dimensional strongly log-concave distribution with condition number $\kappa$, the algorithm is shown to produce with an $\mathcal{O}\big(\kappa^{5/4} d^{1/4}\epsilon^{-1/2} \big)$ complexity samples from a distribution that, in Wasserstein distance, is at most $\epsilon>0$ away from the target distribution.

私たちは、エルゴード確率微分方程式の不変分布と、その数値近似の分布との間の$2$-Wasserstein距離の非漸近的研究を可能にするフレームワークを提示します。これにより、オーバーダンピングとアンダーダンピングのランジュバンダイナミクスについて文献で提案されている多くの異なる積分器を統一的な方法で研究することができます。さらに、時間ステップごとに1つの勾配評価のみを必要とするアンダーダンピングランジュバンダイナミクスの新しい分割法を解析します。条件番号$kappa$の$d–次元の強対数凹分布に対する追加の滑らかさの仮定の下で、アルゴリズムは、Wasserstein距離でターゲット分布から最大$epsilon>0$離れている分布から$mathcal{O}big(kappa^{5/4} d^{1/4}epsilon^{-1/2} big)$複雑度サンプルを生成することが示されています。

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
深層学習のスパース性:ニューラルネットワークにおける効率的な推論と学習のための刈り込みと成長

The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.

ディープラーニングのエネルギーとパフォーマンスのコストが増大していることから、コミュニティは選択的にコンポーネントを削減することでニューラルネットワークのサイズを縮小するようになりました。生物学的な対応物と同様に、スパースネットワークは元の密なネットワークと同等、場合によってはそれ以上に一般化します。スパース性は、通常のネットワークのメモリフットプリントをモバイルデバイスに合わせて削減し、成長し続けるネットワークのトレーニング時間を短縮することを約束します。この論文では、ディープラーニングにおけるスパース性に関する以前の研究を調査し、推論とトレーニングの両方のためのスパース化の広範なチュートリアルを提供します。ニューラルネットワークの要素を削除および追加する方法、モデルのスパース性を実現するためのさまざまなトレーニング戦略、および実際にスパース性を活用するメカニズムについて説明します。私たちの研究は、300を超える研究論文からアイデアを抽出し、今日スパース性を利用したいと考えている実践者と、最先端を押し進めることを目標とする研究者にガイダンスを提供します。スパース化における数学的手法に関する必要な背景情報を含め、早期構造適応などの現象、スパース性とトレーニングプロセスとの複雑な関係について説明し、実際のハードウェアで高速化を実現する手法を示します。また、さまざまなスパースネットワークを比較するための基準として使用できる、プルーニングされたパラメーターの効率性の測定基準も定義します。最後に、スパース性が将来のワークロードをどのように改善できるかについて推測し、この分野における主要な未解決の問題を概説します。

DIG: A Turnkey Library for Diving into Graph Deep Learning Research
DIG:グラフディープラーニング研究に飛び込むためのターンキーライブラリ

Although there exist several libraries for deep learning on graphs, they are aiming at implementing basic operations for graph deep learning. In the research community, implementing and benchmarking various advanced tasks are still painful and time-consuming with existing libraries. To facilitate graph deep learning research, we introduce DIG: Dive into Graphs, a turnkey library that provides a unified testbed for higher level, research-oriented graph deep learning tasks. Currently, we consider graph generation, self-supervised learning on graphs, explainability of graph neural networks, and deep learning on 3D graphs. For each direction, we provide unified implementations of data interfaces, common algorithms, and evaluation metrics. Altogether, DIG is an extensible, open-source, and turnkey library for researchers to develop new methods and effortlessly compare with common baselines using widely used datasets and evaluation metrics. Source code is available at https://github.com/divelab/DIG.

グラフ上のディープラーニング用のライブラリはいくつか存在しますが、それらはグラフディープラーニングの基本的な操作を実装することを目的としています。研究コミュニティでは、既存のライブラリを使用してさまざまな高度なタスクを実装およびベンチマークすることは、依然として困難で時間がかかります。グラフディープラーニングの研究を促進するために、より高レベルの研究指向のグラフディープラーニングタスクのための統一されたテストベッドを提供するターンキーライブラリであるDIG: Dive into Graphsを紹介します。現在、グラフ生成、グラフ上の自己教師あり学習、グラフニューラルネットワークの説明可能性、および3Dグラフ上のディープラーニングを検討しています。それぞれの方向について、データインターフェイス、共通アルゴリズム、および評価メトリックの統一された実装を提供します。全体として、DIGは研究者が新しい方法を開発し、広く使用されているデータセットと評価メトリックを使用して一般的なベースラインと簡単に比較するための、拡張可能なオープンソースのターンキーライブラリです。ソースコードはhttps://github.com/divelab/DIGで入手できます。

Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo
分散型確率勾配ランジュバンダイナミクスとハミルトンモンテカルロ

Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) are two popular Markov Chain Monte Carlo (MCMC) algorithms for Bayesian inference that can scale to large datasets, allowing to sample from the posterior distribution of the parameters of a statistical model given the input data and the prior distribution over the model parameters. However, these algorithms do not apply to the decentralized learning setting, when a network of agents are working collaboratively to learn the parameters of a statistical model without sharing their individual data due to privacy reasons or communication constraints. We study two algorithms: Decentralized SGLD (DE-SGLD) and Decentralized SGHMC (DE-SGHMC) which are adaptations of SGLD and SGHMC methods that allow scaleable Bayesian inference in the decentralized setting for large datasets. We show that when the posterior distribution is strongly log-concave and smooth, the iterates of these algorithms converge linearly to a neighborhood of the target distribution in the 2-Wasserstein distance if their parameters are selected appropriately. We illustrate the efficiency of our algorithms on decentralized Bayesian linear regression and Bayesian logistic regression problems.

確率的勾配ランジュバン動力学(SGLD)と確率的勾配ハミルトンモンテカルロ(SGHMC)は、ベイズ推論用の2つの一般的なマルコフ連鎖モンテカルロ(MCMC)アルゴリズムであり、大規模なデータセットに拡張でき、入力データとモデルパラメータの事前分布が与えられた統計モデルのパラメータの事後分布からサンプリングできます。ただし、これらのアルゴリズムは、エージェントのネットワークがプライバシー上の理由や通信の制約により個々のデータを共有せずに統計モデルのパラメータを共同で学習する場合、分散学習の設定には適用されません。私たちは、大規模なデータセットの分散設定でスケーラブルなベイズ推論を可能にするSGLDおよびSGHMC法の適応である、分散SGLD (DE-SGLD)と分散SGHMC (DE-SGHMC)の2つのアルゴリズムを研究します。事後分布が強い対数凹で滑らかな場合、これらのアルゴリズムの反復は、パラメータが適切に選択されていれば、2-ワッサースタイン距離でターゲット分布の近傍に線形に収束することを示します。分散型ベイジアン線形回帰とベイジアンロジスティック回帰の問題におけるアルゴリズムの効率性を示します。

DeEPCA: Decentralized Exact PCA with Linear Convergence Rate
DeEPCA:線形収束率による分散型精密PCA

Due to the rapid growth of smart agents such as weakly connected computational nodes and sensors, developing decentralized algorithms that can perform computations on local agents becomes a major research direction. This paper considers the problem of decentralized principal components analysis (PCA), which is a statistical method widely used for data analysis. We introduce a technique called subspace tracking to reduce the communication cost, and apply it to power iterations. This leads to a decentralized PCA algorithm called DeEPCA, which has a convergence rate similar to that of the centralized PCA, while achieving the best communication complexity among existing decentralized PCA algorithms. DeEPCA is the first decentralized PCA algorithm with the number of communication rounds for each power iteration independent of target precision. Compared to existing algorithms, the proposed method is easier to tune in practice, with an improved overall communication cost. Our experiments validate the advantages of DeEPCA empirically.

弱く接続された計算ノードやセンサーなどのスマートエージェントの急速な成長により、ローカルエージェントで計算を実行できる分散アルゴリズムの開発が主要な研究方向になっています。この論文では、データ分析に広く使用されている統計的手法である分散主成分分析(PCA)の問題について検討します。通信コストを削減するためにサブスペーストラッキングと呼ばれる手法を導入し、それをべき乗反復に適用します。これにより、DeEPCAと呼ばれる分散PCAアルゴリズムが生まれ、集中型PCAと同様の収束率を持ちながら、既存の分散PCAアルゴリズムの中で最良の通信複雑度を達成しています。DeEPCAは、目標精度とは無関係に、べき乗反復ごとの通信ラウンド数を持つ最初の分散PCAアルゴリズムです。既存のアルゴリズムと比較して、提案された方法は実際に調整しやすく、全体的な通信コストが改善されています。私たちの実験は、DeEPCAの利点を経験的に検証しています。

Consensus-Based Optimization on the Sphere: Convergence to Global Minimizers and Machine Learning
球面でのコンセンサスベースの最適化:グローバル最小化器への収束と機械学習

We investigate the implementation of a new stochastic Kuramoto-Vicsek-type model for global optimization of nonconvex functions on the sphere. This model belongs to the class of Consensus-Based Optimization. In fact, particles move on the sphere driven by a drift towards an instantaneous consensus point, which is computed as a convex combination of particle locations, weighted by the cost function according to Laplace’s principle, and it represents an approximation to a global minimizer. The dynamics is further perturbed by a random vector field to favor exploration, whose variance is a function of the distance of the particles to the consensus point. In particular, as soon as the consensus is reached the stochastic component vanishes. The main results of this paper are about the proof of convergence of the numerical scheme to global minimizers provided conditions of well-preparation of the initial datum. The proof combines previous results of mean-field limit with a novel asymptotic analysis, and classical convergence results of numerical methods for SDE. We present several numerical experiments, which show that the algorithm proposed in the present paper scales well with the dimension and is extremely versatile. To quantify the performances of the new approach, we show that the algorithm is able to perform essentially as good as ad hoc state of the art methods in challenging problems in signal processing and machine learning, namely the phase retrieval problem and the robust subspace detection.

私たちは、球面上の非凸関数のグローバル最適化のための新しい確率的Kuramoto-Vicsek型モデルの実装を調査します。このモデルは、コンセンサスベースの最適化のクラスに属します。実際、粒子は、ラプラスの原理に従ってコスト関数によって重み付けされた粒子位置の凸組み合わせとして計算される瞬間的なコンセンサスポイントに向かうドリフトによって球面上を移動し、グローバルミニマイザーの近似を表します。ダイナミクスは、探索を優先するためにランダムベクトルフィールドによってさらに摂動され、その分散は粒子からコンセンサスポイントまでの距離の関数です。特に、コンセンサスに達するとすぐに、確率的コンポーネントは消えます。この論文の主な結果は、初期データが適切に準備されているという条件が与えられた場合、数値スキームがグローバルミニマイザーに収束することの証明に関するものです。証明は、平均場限界の以前の結果と新しい漸近解析、およびSDEの数値手法の古典的な収束結果を組み合わせています。本論文で提案するアルゴリズムは次元に応じて適切にスケーリングされ、非常に汎用性が高いことを示す数値実験をいくつか紹介します。新しいアプローチのパフォーマンスを定量化するために、このアルゴリズムは、信号処理と機械学習における困難な問題、つまり位相回復問題と堅牢なサブスペース検出において、アドホックな最先端の方法と本質的に同等のパフォーマンスを発揮できることを示します。

Expanding Boundaries of Gap Safe Screening
ギャップセーフスクリーニングの領域拡大

Sparse optimization problems are ubiquitous in many fields such as statistics, signal/image processing and machine learning. This has led to the birth of many iterative algorithms to solve them. A powerful strategy to boost the performance of these algorithms is known as safe screening: it allows the early identification of zero coordinates in the solution, which can then be eliminated to reduce the problem’s size and accelerate convergence. In this work, we extend the existing Gap Safe screening framework by relaxing the global strong-concavity assumption on the dual cost function. Instead, we exploit local regularity properties, that is, strong concavity on well-chosen subsets of the domain. The non-negativity constraint is also integrated to the existing framework. Besides making safe screening possible to a broader class of functions that includes $\beta$-divergences (e.g., the Kullback-Leibler divergence), the proposed approach also improves upon the existing Gap Safe screening rules on previously applicable cases (e.g., logistic regression). The proposed general framework is exemplified by some notable particular cases: logistic function, $\beta=1.5$ and Kullback-Leibler divergences. Finally, we showcase the effectiveness of the proposed screening rules with different solvers (coordinate descent, multiplicative-update and proximal gradient algorithms) and different datasets (binary classification, hyperspectral and count data).

スパース最適化問題は、統計、信号/画像処理、機械学習など、多くの分野で広く見られます。このため、これらの問題を解くための反復アルゴリズムが数多く生まれました。これらのアルゴリズムのパフォーマンスを向上させる強力な戦略は、セーフスクリーニングとして知られています。セーフスクリーニングでは、ソリューション内のゼロ座標を早期に特定し、それを排除することで問題のサイズを縮小し、収束を加速することができます。この研究では、既存のGap Safeスクリーニングフレームワークを拡張し、デュアルコスト関数のグローバルな強い凹面の仮定を緩和します。代わりに、ローカルな規則性プロパティ、つまり、ドメインの適切に選択されたサブセットの強い凹面を利用します。非負性制約も既存のフレームワークに統合されています。$\beta$ダイバージェンス(Kullback-Leiblerダイバージェンスなど)を含むより広範な関数クラスに対してセーフスクリーニングを可能にするほか、提案されたアプローチでは、以前に適用可能だったケース(ロジスティック回帰など)に対する既存のGap Safeスクリーニングルールも改善されます。提案された一般的なフレームワークは、ロジスティック関数、$\beta=1.5$、Kullback-Leiblerダイバージェンスなどのいくつかの注目すべき特定のケースによって例示されています。最後に、さまざまなソルバー(座標降下法、乗法更新法、近似勾配アルゴリズム)とさまざまなデータセット(バイナリ分類、ハイパースペクトル、カウントデータ)を使用して、提案されたスクリーニングルールの有効性を示します。

GIBBON: General-purpose Information-Based Bayesian Optimisation
GIBBON:汎用情報ベースのベイズ最適化

This paper describes a general-purpose extension of max-value entropy search, a popular approach for Bayesian Optimisation (BO). A novel approximation is proposed for the information gain — an information-theoretic quantity central to solving a range of BO problems, including noisy, multi-fidelity and batch optimisations across both continuous and highly-structured discrete spaces. Previously, these problems have been tackled separately within information-theoretic BO, each requiring a different sophisticated approximation scheme, except for batch BO, for which no computationally-lightweight information-theoretic approach has previously been proposed. GIBBON (General-purpose Information-Based Bayesian OptimisatioN) provides a single principled framework suitable for all the above, out-performing existing approaches whilst incurring substantially lower computational overheads. In addition, GIBBON does not require the problem’s search space to be Euclidean and so is the first high-performance yet computationally light-weight acquisition function that supports batch BO over general highly structured input spaces like molecular search and gene design. Moreover, our principled derivation of GIBBON yields a natural interpretation of a popular batch BO heuristic based on determinantal point processes. Finally, we analyse GIBBON across a suite of synthetic benchmark tasks, a molecular search loop, and as part of a challenging batch multi-fidelity framework for problems with controllable experimental noise.

この論文では、ベイズ最適化(BO)の一般的なアプローチである最大値エントロピー検索の汎用拡張について説明します。情報ゲインの新しい近似が提案されています。情報ゲインは、連続空間と高度に構造化された離散空間の両方にわたるノイズの多い、マルチフィデリティ、バッチ最適化など、さまざまなBO問題を解決する上で中心的な情報理論的量です。これまで、これらの問題は情報理論的BO内で個別に取り組まれており、それぞれに異なる高度な近似スキームが必要でしたが、バッチBOについては、これまで計算量が少ない情報理論的アプローチは提案されていませんでした。GIBBON (汎用情報ベースのベイズ最適化)は、上記のすべてに適した単一の原理的フレームワークを提供し、既存のアプローチよりもパフォーマンスが高く、計算オーバーヘッドが大幅に少なくなります。さらに、GIBBONは問題の検索空間がユークリッドである必要がないため、分子検索や遺伝子設計などの一般的な高度に構造化された入力空間でバッチBOをサポートする、高性能でありながら計算量が少ない最初の獲得関数です。さらに、GIBBONの原理的な導出により、決定論的点過程に基づく一般的なバッチBOヒューリスティックの自然な解釈が得られます。最後に、一連の合成ベンチマークタスク、分子検索ループ、および制御可能な実験ノイズの問題に対する困難なバッチマルチフィデリティフレームワークの一部として、GIBBONを分析します。

A general linear-time inference method for Gaussian Processes on one dimension
一次元上のガウス過程の一般的な線形時間推論法

Gaussian Processes (GPs) provide powerful probabilistic frameworks for interpolation, forecasting, and smoothing, but have been hampered by computational scaling issues. Here we investigate data sampled on one dimension (e.g., a scalar or vector time series sampled at arbitrarily-spaced intervals), for which state-space models are popular due to their linearly-scaling computational costs. It has long been conjectured that state-space models are general, able to approximate any one-dimensional GP. We provide the first general proof of this conjecture, showing that any stationary GP on one dimension with vector-valued observations governed by a Lebesgue-integrable continuous kernel can be approximated to any desired precision using a specifically-chosen statespace model: the Latent Exponentially Generated (LEG) family. This new family offers several advantages compared to the general state-space model: it is always stable (no unbounded growth), the covariance can be computed in closed form, and its parameter space is unconstrained (allowing straightforward estimation via gradient descent). The theorem’s proof also draws connections to Spectral Mixture Kernels, providing insight about this popular family of kernels. We develop parallelized algorithms for performing inference and learning in the LEG model, test the algorithm on real and synthetic data, and demonstrate scaling to datasets with billions of samples.

ガウス過程(GP)は、補間、予測、平滑化のための強力な確率的フレームワークを提供しますが、計算スケーリングの問題によって妨げられてきました。ここでは、1次元でサンプリングされたデータ(任意の間隔でサンプリングされたスカラーまたはベクトルの時系列など)を調査します。これらのデータでは、線形スケーリングの計算コストのため、状態空間モデルが人気があります。状態空間モデルは汎用性があり、任意の1次元GPを近似できると長い間推測されてきました。私たちはこの推測の最初の一般的な証明を提供し、ルベーグ積分可能な連続カーネルによって制御されるベクトル値の観測値を持つ1次元の任意の定常GPは、特別に選択された状態空間モデル、潜在指数生成(LEG)ファミリーを使用して、任意の精度で近似できることを示します。この新しいファミリーは、一般的な状態空間モデルと比較して、いくつかの利点があります。常に安定しており(無制限の成長がない)、共分散は閉じた形式で計算でき、パラメーター空間は制約がありません(勾配降下法による簡単な推定が可能)。定理の証明はスペクトル混合カーネルとの関連も示しており、この人気のあるカーネルファミリーについての洞察を提供しています。LEGモデルで推論と学習を実行するための並列化アルゴリズムを開発し、実際のデータと合成データでアルゴリズムをテストし、数十億のサンプルを持つデータセットへのスケーリングを実証します。

A Generalised Linear Model Framework for β-Variational Autoencoders based on Exponential Dispersion Families
指数分散族に基づくβ変分自己符号化器のための一般化線形モデルフレームワーク

Although variational autoencoders (VAE) are successfully used to obtain meaningful low-dimensional representations for high-dimensional data, the characterization of critical points of the loss function for general observation models is not fully understood. We introduce a theoretical framework that is based on a connection between β-VAE and generalized linear models (GLM). The equality between the activation function of a β-VAE and the inverse of the link function of a GLM enables us to provide a systematic generalization of the loss analysis for β-VAE based on the assumption that the observation model distribution belongs to an exponential dispersion family (EDF). As a result, we can initialize β-VAE nets by maximum likelihood estimates (MLE) that enhance the training performance on both synthetic and real world data sets. As a further consequence, we analytically describe the auto-pruning property inherent in the β-VAE objective and reason for posterior collapse.

変分オートエンコーダ(VAE)は、高次元データの意味のある低次元表現を取得するためにうまく使用されていますが、一般的な観測モデルの損失関数の臨界点の特性評価は完全には理解されていません。この論文では、β-VAEと一般化線形モデル(GLM)との接続に基づく理論的枠組みを紹介します。β-VAEの活性化関数とGLMのリンク関数の逆関数とが等式であることから、観測モデルの分布が指数分散ファミリー(EDF)に属すると仮定したβ-VAEの損失解析を系統的に一般化することができます。その結果、β-VAEネットを最尤推定値(MLE)で初期化し、合成データセットと実世界データセットの両方で学習パフォーマンスを向上させることができます。さらなる結果として、β-VAE対物レンズに固有の自動剪定特性と後方崩壊の理由を分析的に説明します。

Probabilistic Iterative Methods for Linear Systems
線形システムのための確率的反復法

This paper presents a probabilistic perspective on iterative methods for approximating the solution $\mathbf{x} \in \mathbb{R}^d$ of a nonsingular linear system $\mathbf{A} \mathbf{x} = \mathbf{b}$. Classically, an iterative method produces a sequence $\mathbf{x}_m$ of approximations that converge to $\mathbf{x}$ in $\mathbb{R}^d$.Our approach, instead, lifts a standard iterative method to act on the set of probability distributions, $\mathcal{P}(\mathbb{R}^d)$, outputting a sequence of probability distributions $\mu_m \in \mathcal{P}(\mathbb{R}^d)$.The output of a probabilistic iterative method can provide both a “best guess” for $\mathbf{x}$, for example by taking the mean of $\mu_m$, and also probabilistic uncertainty quantification for the value of $\mathbf{x}$ when it has not been exactly determined. A comprehensive theoretical treatment is presented in the case of a stationary linear iterative method, where we characterise both the rate of contraction of $\mu_m$ to an atomic measure on $\mathbf{x}$ and the nature of the uncertainty quantification being provided. We conclude with an empirical illustration that highlights the potential for probabilistic iterative methods to provide insight into solution uncertainty.

この論文では、非特異線形システム$\mathbf{A} \mathbf{x} = \mathbf{b}$の解$\mathbf{x} \in \mathbb{R}^d$を近似するための反復法に関する確率論的観点を提示します。従来、反復法は、$\mathbb{R}^d$で$\mathbf{x}$に収束する近似値のシーケンス$\mathbf{x}_m$を生成します。代わりに、私たちのアプローチでは、標準的な反復法を確率分布のセット$\mathcal{P}(\mathbb{R}^d)$に作用させて、確率分布のシーケンス$\mu_m \in \mathcal{P}(\mathbb{R}^d)$を出力します。確率的反復法の出力は、たとえば$\mu_m$の平均を取ることによって$\mathbf{x}$の「最善の推測」と、$\mathbf{x}$の値が正確に決定されていない場合の確率的不確実性の定量化の両方を提供できます。定常線形反復法の場合の包括的な理論的処理が提示され、$\mu_m$から$\mathbf{x}$上の原子測度への縮小率と、提供される不確実性定量化の性質の両方を特徴付けます。最後に、確率的反復法が解の不確実性に関する洞察を提供する可能性を強調する実証的な例を示します。

sklvq: Scikit Learning Vector Quantization
sklvq:Scikit学習ベクトル量子化

The sklvq package is an open-source Python implementation of a set of learning vector quantization (LVQ) algorithms. In addition to providing the core functionality for the GLVQ, GMLVQ, and LGMLVQ algorithms, sklvq is distinctive by putting emphasis on its modular and customizable design. Not only resulting in a feature-rich implementation for users but enabling easy extensions of the algorithms for researchers. The theory behind this design is described in this paper. To facilitate adoptions and inspire future contributions, sklvq is publicly available on Github (under the BSD license) and can be installed through the Python package index (PyPI). Next to being well-covered by automated testing to ensure code quality, it is accompanied by detailed online documentation. The documentation covers usage examples and provides an in-depth API including theory and scientific references.

sklvqパッケージは、一連の学習ベクトル量子化(LVQ)アルゴリズムのオープンソースのPython実装です。sklvqは、GLVQ、GMLVQ、およびLGMLVQアルゴリズムのコア機能を提供するだけでなく、モジュール式でカスタマイズ可能な設計に重点を置いていることが特徴です。その結果、ユーザーにとって機能豊富な実装が実現するだけでなく、研究者にとってもアルゴリズムの容易な拡張が可能になります。この設計の背後にある理論については、このホワイトペーパーで説明します。採用を促進し、将来の貢献を促すために、sklvqはGithubで公開されており(BSDライセンスの下で)、Pythonパッケージインデックス(PyPI)を通じてインストールできます。コードの品質を確保するための自動テストで十分にカバーされているだけでなく、詳細なオンラインドキュメントが付属しています。このドキュメントでは、使用例を取り上げ、理論や科学的な参考文献を含む詳細なAPIを提供します。

Learning with semi-definite programming: statistical bounds based on fixed point analysis and excess risk curvature
半定値計画法による学習:不動点解析と過剰リスク曲率に基づく統計的限界

Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models as the most striking instances Javanmard et al. (2016). Given the growing range of applications of SDP-based techniques to machine learning problems, and the rapid progress in the design of efficient algorithms for solving SDPs, an intriguing question is to understand how the recent advances from empirical process theory and Statistical Learning Theory can be leveraged for providing a precise statistical analysis of SDP estimators. In the present paper, we borrow cutting edge techniques and concepts from the Learning Theory literature, such as fixed point equations and excess risk curvature arguments, which yield general estimation and prediction results for a wide class of SDP estimators. From this perspective, we revisit some classical results in community detection from Guédon and Vershynin (2016) and Fei and Chen (2019), and we obtain statistical guarantees for SDP estimators used in signed clustering, angular group synchronization (for both multiplicative and additive models) and MAX-CUT. Our theoretical findings are complemented by numerical experiments for each of the three problems considered, showcasing the competitiveness of the SDP estimators.

最近、多くの統計学習問題が半正定値計画法(SDP)に適応可能であることが示されており、最も顕著な例としては、ガウス混合モデルにおけるコミュニティ検出とクラスタリングが挙げられます(Javanmardら、2016年)。機械学習問題へのSDPベースの手法の応用範囲が拡大し、SDPを解決するための効率的なアルゴリズムの設計が急速に進歩していることを考えると、経験的プロセス理論と統計学習理論の最近の進歩をどのように活用してSDP推定量の正確な統計分析を提供できるかを理解することは興味深い問題です。この論文では、固定小数点方程式や過剰リスク曲率の議論など、学習理論の文献から最先端の手法と概念を借用し、幅広いクラスのSDP推定量に対して一般的な推定と予測の結果を生み出します。この観点から、GuédonとVershynin (2016)およびFeiとChen (2019)によるコミュニティ検出の古典的な結果を再検討し、符号付きクラスタリング、角度グループ同期(乗法モデルと加法モデルの両方)、およびMAX-CUTで使用されるSDP推定量の統計的保証を取得します。私たちの理論的発見は、検討した3つの問題それぞれに対する数値実験によって補完され、SDP推定量の競争力を示しています。

Convolutional Neural Networks Are Not Invariant to Translation, but They Can Learn to Be
畳み込みニューラルネットワークは、翻訳に対して不変ではありませんが、翻訳に対して不変であることを学ぶことができます

When seeing a new object, humans can immediately recognize it across different retinal locations: the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several studies have found that these networks systematically fail to recognise new objects on untrained locations. In this work, we test a wide variety of CNNs architectures showing how, apart from DenseNet-121, none of the models tested was architecturally invariant to translation. Nevertheless, all of them could learn to be invariant to translation. We show how this can be achieved by pretraining on ImageNet, and it is sometimes possible with much simpler data sets when all the items are fully translated across the input canvas. At the same time, this invariance can be disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right ‘latent’ characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.

新しい物体を見ると、人間は網膜上の異なる場所からでもそれを即座に認識できます。つまり、物体の内部表現は平行移動に対して不変なのです。畳み込みニューラルネットワーク(CNN)は、畳み込みやプーリング操作が備わっているため、平行移動に対して構造的に不変であると一般に考えられています。実際、いくつかの研究では、これらのネットワークは訓練されていない場所にある新しい物体を体系的に認識できないことがわかっています。この研究では、さまざまなCNNアーキテクチャをテストし、DenseNet-121を除いて、テストしたモデルのどれも平行移動に対して構造的に不変ではなかったことを示しています。それでも、それらはすべて平行移動に対して不変であることを学習できました。ImageNetでの事前トレーニングによってこれを実現する方法を示します。また、すべてのアイテムが入力キャンバス上で完全に平行移動されている場合、はるかに単純なデータセットでこれが可能な場合もあります。同時に、この不変性は、壊滅的な忘却/干渉により、さらなるトレーニングによって破壊される可能性があります。これらの実験は、適切な「潜在的な」特性を持つ環境（より自然な環境）でネットワークを事前トレーニングすると、ネットワークが深い知覚ルールを学習し、その後の一般化が劇的に改善されることを示しています。

How Well Generative Adversarial Networks Learn Distributions
敵対的生成ネットワークが分布をどの程度学習するか

This paper studies the rates of convergence for learning distributions implicitly with the adversarial framework and Generative Adversarial Networks (GANs), which subsume Wasserstein, Sobolev, MMD GAN, and Generalized/Simulated Method of Moments (GMM/SMM) as special cases. We study a wide range of parametric and nonparametric target distributions under a host of objective evaluation metrics. We investigate how to obtain valid statistical guarantees for GANs through the lens of regularization. On the nonparametric end, we derive the optimal minimax rates for distribution estimation under the adversarial framework. On the parametric end, we establish a theory for general neural network classes (including deep leaky ReLU networks) that characterizes the interplay on the choice of generator and discriminator pair. We discover and isolate a new notion of regularization, called the generator-discriminator-pair regularization, that sheds light on the advantage of GANs compared to classical parametric and nonparametric approaches for explicit distribution estimation. We develop novel oracle inequalities as the main technical tools for analyzing GANs, which are of independent interest.

この論文では、ワッサーシュタイン、ソボレフ、MMD GAN、一般化/シミュレートモーメント法(GMM/SMM)を特別なケースとして包含する敵対的フレームワークと生成的敵対的ネットワーク(GAN)を使用して、分布を暗黙的に学習する場合の収束率について検討します。さまざまな客観的評価基準に基づいて、パラメトリックおよびノンパラメトリックのターゲット分布を幅広く検討します。正則化の観点から、GANの有効な統計的保証を得る方法を調べます。ノンパラメトリック側では、敵対的フレームワーク下での分布推定の最適なミニマックス率を導出します。パラメトリック側では、ジェネレータとディスクリミネータのペアの選択による相互作用を特徴付ける一般的なニューラルネットワーククラス(ディープリーキーReLUネットワークを含む)の理論を確立します。私たちは、生成器-識別器-ペア正規化と呼ばれる正規化の新しい概念を発見し、分離しました。これは、明示的な分布推定に対する従来のパラメトリックおよびノンパラメトリック手法と比較したGANの利点を明らかにします。私たちは、独立した関心事であるGANを分析するための主要な技術ツールとして、新しいオラクル不等式を開発しました。

Tighter Risk Certificates for Neural Networks
ニューラルネットワークのためのより厳しいリスク証明書

This paper presents an empirical study regarding training probabilistic neural networks using training objectives derived from PAC-Bayes bounds. In the context of probabilistic neural networks, the output of training is a probability distribution over network weights. We present two training objectives, used here for the first time in connection with training neural networks. These two training objectives are derived from tight PAC-Bayes bounds. We also re-implement a previously used training objective based on a classical PAC-Bayes bound, to compare the properties of the predictors learned using the different training objectives. We compute risk certificates for the learnt predictors, based on part of the data used to learn the predictors. We further experiment with different types of priors on the weights (both data-free and data-dependent priors) and neural network architectures. Our experiments on MNIST and CIFAR-10 show that our training methods produce competitive test set errors and non-vacuous risk bounds with much tighter values than previous results in the literature, showing promise not only to guide the learning algorithm through bounding the risk but also for model selection. These observations suggest that the methods studied here might be good candidates for self-certified learning, in the sense of using the whole data set for learning a predictor and certifying its risk on any unseen data (from the same distribution as the training data) potentially without the need for holding out test data.

この論文では、PAC-Bayes境界から導出されたトレーニング目標を使用した確率的ニューラルネットワークのトレーニングに関する実証的研究を紹介します。確率的ニューラルネットワークのコンテキストでは、トレーニングの出力はネットワークの重みに対する確率分布です。ここでは、ニューラルネットワークのトレーニングに関連して初めて使用される2つのトレーニング目標を示します。これらの2つのトレーニング目標は、厳密なPAC-Bayes境界から導出されます。また、従来のPAC-Bayes境界に基づいて以前に使用したトレーニング目標を再実装し、異なるトレーニング目標を使用して学習された予測子の特性を比較します。予測子の学習に使用されたデータの一部に基づいて、学習された予測子のリスク証明書を計算します。さらに、重みに対するさまざまな種類の事前確率(データフリー事前確率とデータ依存事前確率の両方)とニューラルネットワークアーキテクチャを試します。MNISTとCIFAR-10での実験では、私たちのトレーニング方法が、文献の以前の結果よりもはるかに厳しい値で競争力のあるテストセットエラーと空虚でないリスク境界を生成することが示されており、リスクの境界設定を通じて学習アルゴリズムを導くだけでなく、モデル選択にも有望であることが示されています。これらの観察は、ここで研究された方法が、予測子を学習し、テストデータを保留する必要なしに、潜在的にすべての未確認データ(トレーニングデータと同じ分布から)に対するリスクを認証するためにデータセット全体を使用するという意味で、自己認証学習の優れた候補になる可能性があることを示唆しています。

FATE: An Industrial Grade Platform for Collaborative Learning With Data Protection
FATE:データ保護を備えた共同学習のための産業グレードのプラットフォーム

Collaborative and federated learning has become an emerging solution to many industrial applications where data values from different sites are exploit jointly with privacy protection. We introduce FATE, an industrial-grade project that supports enterprises and institutions to build machine learning models collaboratively at large-scale in a distributed manner. FATE supports a variety of secure computation protocols and machine learning algorithms, and features out-of-box usability with end-to-end building modules and visualization tools. Documentations are available at https://github.com/FederatedAI/FATE. Case studies and other information are available at https://www.fedai.org.

協調学習と連合学習は、さまざまなサイトからのデータ値がプライバシー保護とともに活用される多くの産業用アプリケーションにとって新たなソリューションとなっています。企業や機関が大規模に分散して機械学習モデルを共同で構築することを支援する産業グレードのプロジェクトであるFATEを紹介します。FATEは、さまざまな安全な計算プロトコルと機械学習アルゴリズムをサポートし、エンドツーエンドのビルディングモジュールと視覚化ツールですぐに使用できる機能を備えています。ドキュメントはhttps://github.com/FederatedAI/FATEで入手できます。ケーススタディやその他の情報は、https://www.fedai.orgで入手できます。

Representer Theorems in Banach Spaces: Minimum Norm Interpolation, Regularized Learning and Semi-Discrete Inverse Problems
バナッハ空間における表現定理:最小ノルム補間,正則化学習および半離散逆問題

Learning a function from a finite number of sampled data points (measurements) is a fundamental problem in science and engineering. This is often formulated as a minimum norm interpolation (MNI) problem, a regularized learning problem or, in general, a semi-discrete inverse problem (SDIP), in either Hilbert spaces or Banach spaces. The goal of this paper is to systematically study solutions of these problems in Banach spaces. We aim at obtaining explicit representer theorems for their solutions, on which convenient solution methods can then be developed. For the MNI problem, the explicit representer theorems enable us to express the infimum in terms of the norm of the linear combination of the interpolation functionals. For the purpose of developing efficient computational algorithms, we establish the fixed-point equation formulation of solutions of these problems. We reveal that unlike in a Hilbert space, in general, solutions of these problems in a Banach space may not be able to be reduced to truly finite dimensional problems (with certain infinite dimensional components hidden). We demonstrate how this obstacle can be removed, reducing the original problem to a truly finite dimensional one, in the special case when the Banach space is $\ell_1(\mathbb{N})$.

有限個のサンプリングされたデータポイント(測定)から関数を学習することは、科学と工学における基本的な問題です。これは、ヒルベルト空間またはバナッハ空間のいずれかで、最小ノルム補間(MNI)問題、正規化学習問題、または一般に半離散逆問題(SDIP)として定式化されることがよくあります。この論文の目的は、バナッハ空間におけるこれらの問題の解を体系的に研究することです。私たちは、それらの解に対する明示的な表現定理を取得し、それに基づいて便利な解法を開発することを目指しています。MNI問題の場合、明示的な表現定理により、補間関数の線形結合のノルムで最小値を表現できます。効率的な計算アルゴリズムを開発するために、これらの問題の解の固定小数点方程式定式化を確立します。ヒルベルト空間とは異なり、一般に、バナッハ空間におけるこれらの問題の解は、真に有限次元の問題(特定の無限次元コンポーネントが隠されている)に縮小できない可能性があることを明らかにします。バナッハ空間が$\ell_1(\mathbb{N})$である特別なケースにおいて、この障害を取り除き、元の問題を真に有限次元の問題に縮小する方法を示します。

Bayesian Distance Clustering
ベイジアン距離クラスタリング

Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.

モデルベースのクラスタリングは、さまざまなアプリケーション分野で広く使用されています。ただし、堅牢性に関する基本的な懸念が残っています。特に、結果は、クラスター内のデータ密度を表すカーネルの選択に左右される可能性があります。データポイント間のペアワイズ差の特性を活用して、元のデータの代わりにペアワイズ距離の尤度をモデル化することに依存するベイズ距離クラスタリング手法のクラスを提案します。データ内の一部の情報は破棄されますが、モデリングの仮定に対する堅牢性が大幅に向上します。提案されたアプローチは、距離ベースクラスタリングとモデルベースクラスタリングの魅力的な中間点を表し、これらの標準的なアプローチのそれぞれの利点を活用しています。通常のカーネルの選択では十分に表現されないクラスターを推測する能力が大幅に向上することを示します。競合他社と比較してパフォーマンスを評価するためのシミュレーション研究が含まれており、このアプローチを脳ゲノム発現データのクラスタリングに適用します。

Stochastic Online Optimization using Kalman Recursion
カルマン再帰を用いた確率的オンライン最適化

We study the Extended Kalman Filter in constant dynamics, offering a bayesian perspective of stochastic optimization. For generalized linear models, we obtain high probability bounds on the cumulative excess risk in an unconstrained setting, under the assumption that the algorithm reaches a local phase. In order to avoid any projection step we propose a two-phase analysis. First, for linear and logistic regressions, we prove that the algorithm enters a local phase where the estimate stays in a small region around the optimum. We provide explicit bounds with high probability on this convergence time, slightly modifying the Extended Kalman Filter in the logistic setting. Second, for generalized linear regressions, we provide a martingale analysis of the excess risk in the local phase, improving existing ones in bounded stochastic optimization. The algorithm appears as a parameter-free online procedure that optimally solves some unconstrained optimization problems.

私たちは、拡張カルマンフィルターを一定のダイナミクスで研究し、確率的最適化のベイズ的視点を提供します。一般化線形モデルの場合、アルゴリズムがローカルフェーズに到達すると仮定して、制約のない設定で累積超過リスクの高確率範囲を取得します。予測ステップを回避するために、2フェーズ分析を提案します。まず、線形回帰とロジスティック回帰の場合、アルゴリズムが推定値が最適値の周りの小さな領域にとどまる局所フェーズに入ることを証明します。この収束時間に高い確率で明示的な境界を提供し、ロジスティック設定の拡張カルマンフィルターをわずかに変更します。次に、一般化線形回帰の場合、局所フェーズの過剰リスクのマーチンゲール分析を提供し、有界確率最適化の既存のリスクを改善します。このアルゴリズムは、制約のない最適化問題を最適に解くパラメーターフリーのオンラインプロシージャとして表示されます。

Classification vs regression in overparameterized regimes: Does the loss function matter?
過剰パラメータ化されたレジームにおける分類と回帰:損失関数は重要か?

We compare classification and regression tasks in an overparameterized linear model with Gaussian features. On the one hand, we show that with sufficient overparameterization all training points are support vectors: solutions obtained by least-squares minimum-norm interpolation, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM) that minimizes the hinge loss, typically used for training classifiers. On the other hand, we show that there exist regimes where these interpolating solutions generalize well when evaluated by the 0-1 test loss function, but do not generalize if evaluated by the square loss function, i.e. they approach the null risk. Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization).

私たちは、過剰パラメータ化された線形モデルとガウスの特徴で分類タスクと回帰タスクを比較します。一方では、十分なオーバーパラメータ化により、すべての学習ポイントがサポートベクトルであることを示します:最小二乗最小ノルム補間によって得られた解は、通常は回帰に使用され、ヒンジ損失を最小化するハードマージンサポートベクトルマシン(SVM)によって生成されたものと同一であり、通常は分類器の訓練に使用されます。一方、これらの補間解は、0-1テスト損失関数で評価するとよく一般化されるが、二乗損失関数で評価されると一般化しない、つまりヌルリスクに近づくという領域が存在することを示します。私たちの結果は、トレーニングフェーズ(最適化)とテストフェーズ(一般化)で使用される損失関数の役割と特性が非常に異なることを示しています。

A Bayes-Optimal View on Adversarial Examples
敵対的例に対するベイズ最適観点

Since the discovery of adversarial examples – the ability to fool modern CNN classifiers with tiny perturbations of the input, there has been much discussion whether they are a “bug” that is specific to current neural architectures and training methods or an inevitable “feature” of high dimensional geometry. In this paper, we argue for examining adversarial examples from the perspective of Bayes-Optimal classification. We construct realistic image datasets for which the Bayes-Optimal classifier can be efficiently computed and derive analytic conditions on the distributions under which these classifiers are provably robust against any adversarial attack even in high dimensions. Our results show that even when these “gold standard” optimal classifiers are robust, CNNs trained on the same datasets consistently learn a vulnerable classifier, indicating that adversarial examples are often an avoidable “bug”. We further show that RBF SVMs trained on the same data consistently learn a robust classifier. The same trend is observed in experiments with real images in different datasets.

敵対的サンプル(入力のわずかな変化で最新のCNN分類器を騙す能力)の発見以来、これが現在のニューラルアーキテクチャとトレーニング方法に固有の「バグ」なのか、高次元ジオメトリの避けられない「機能」なのかについて多くの議論がなされてきました。この論文では、ベイズ最適分類の観点から敵対的サンプルを検証することを主張します。ベイズ最適分類器を効率的に計算できる現実的な画像データセットを構築し、これらの分類器が高次元でもあらゆる敵対的攻撃に対して証明可能な堅牢性を持つ分布の分析条件を導き出します。私たちの結果は、これらの「ゴールドスタンダード」の最適分類器が堅牢であっても、同じデータセットでトレーニングされたCNNは一貫して脆弱な分類器を学習することを示しており、敵対的サンプルは多くの場合回避可能な「バグ」であることを示しています。さらに、同じデータでトレーニングされたRBF SVMは一貫して堅牢な分類器を学習することを示しています。異なるデータセットの実際の画像を使用した実験でも同じ傾向が見られます。

Shape-Enforcing Operators for Generic Point and Interval Estimators of Functions
関数の汎用点推定器と間隔推定器の形状強制演算子

A common problem in econometrics, statistics, and machine learning is to estimate and make inference on functions that satisfy shape restrictions. For example, distribution functions are nondecreasing and range between zero and one, height growth charts are nondecreasing in age, and production functions are nondecreasing and quasi-concave in input quantities. We propose a method to enforce these restrictions ex post on generic unconstrained point and interval estimates of the target function by applying functional operators. The interval estimates could be either frequentist confidence bands or Bayesian credible regions. If an operator has reshaping, invariance, order-preserving, and distance-reducing properties, the shape-enforced point estimates are closer to the target function than the original point estimates and the shape-enforced interval estimates have greater coverage and shorter length than the original interval estimates. We show that these properties hold for six different operators that cover commonly used shape restrictions in practice: range, convexity, monotonicity, monotone convexity, quasi-convexity, and monotone quasi-convexity, with the latter two restrictions being of paramount importance. The main attractive property of the post-processing approach is that it works in conjunction with any generic initial point or interval estimate, obtained using any of parametric, semi-parametric or nonparametric learning methods, including recent methods that are able to exploit either smoothness, sparsity, or other forms of structured parsimony of target functions. The post-processed point and interval estimates automatically inherit and provably improve these properties in finite samples, while also enforcing qualitative shape restrictions brought by scientific reasoning. We illustrate the results with two empirical applications to the estimation of a height growth chart for infants in India and a production function for chemical firms in China.

計量経済学、統計学、機械学習における共通の問題は、形状制約を満たす関数を推定し推論することです。たとえば、分布関数は非減少で0から1の範囲であり、身長成長曲線は年齢に対して非減少であり、生産関数は非減少で入力量に対して準凹です。関数演算子を適用することにより、対象関数の一般的な制約のない点推定値と区間推定値に事後的にこれらの制約を適用する方法を提案します。区間推定値は、頻度主義的信頼帯またはベイズ的信用領域のいずれかです。演算子が再形成、不変性、順序保存、距離短縮の特性を持つ場合、形状を適用した点推定値は元の点推定値よりも対象関数に近くなり、形状を適用した区間推定値は元の区間推定値よりも範囲が広く、長さが短くなります。私たちは、これらの特性が、実際によく使用される形状制約をカバーする6つの異なる演算子(範囲、凸性、単調性、単調凸性、準凸性、単調準凸性)に当てはまることを示します。最後の2つの制約が最も重要です。後処理アプローチの主な魅力的な特性は、滑らかさ、スパース性、またはターゲット関数の他の形式の構造化された節約を利用できる最近の方法を含む、パラメトリック、セミパラメトリック、またはノンパラメトリック学習方法のいずれかを使用して取得された一般的な初期点または区間推定値と組み合わせて機能することです。後処理された点と区間の推定値は、有限サンプルでこれらの特性を自動的に継承し、証明できるように改善すると同時に、科学的推論によってもたらされる定性的な形状制約も適用します。インドの乳児の身長成長チャートと中国の化学会社の生産関数の推定に対する2つの実証的アプリケーションで結果を示します。

Soft Tensor Regression
ソフトテンソル回帰

Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a soft version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, irrespective of the true or approximation tensor rank, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.

テンソル予測子を回帰モデルのスカラー結果に関連付ける統計的手法では、通常、テンソル予測子をベクトル化し、何らかの形式の正則化を使用してそのエントリの係数を推定するか、テンソル共変量の要約を使用するか、係数テンソルの低次元近似を使用します。ただし、係数テンソルの低ランク近似は、真のランクが小さくない場合は問題が生じる可能性があります。私たちは、パラレルファクター近似のソフトバージョン(PARAFAC)を想定するテンソル回帰フレームワークを提案します。係数テンソルの各エントリがテンソルモード全体の行固有の寄与の積の合計である従来のPARAFACとは対照的に、ソフトテンソル回帰(Softer)フレームワークでは、行固有の寄与が全体の平均を中心に変化することを許容します。我々はベイジアンアプローチに従って推論を行い、PARAFACをソフト化することでモデルの柔軟性が高まり、係数テンソルの推定が改善され、重要な予測子エントリがより正確に識別され、低い近似ランクでもより正確な予測が可能になることを示します。理論的な観点から、Softerを使用すると、真のテンソルランクまたは近似テンソルランクに関係なく、係数テンソルの弱一貫性事後分布が得られることを示します。これは、テンソル回帰に従来のPARAFACを使用する場合には当てはまりません。この目的のアプリケーションのコンテキストでは、Softerを対称および半対称テンソル予測子に適応させ、脳ネットワーク特性と人間の特性の関係を分析します。

Thompson Sampling Algorithms for Cascading Bandits
カスケードバンディットのためのトンプソンサンプリングアルゴリズム

Motivated by the important and urgent need for efficient optimization in online recommender systems, we revisit the cascading bandit model proposed by Kveton et al. (2015a). While Thompson sampling (TS) algorithms have been shown to be empirically superior to Upper Confidence Bound (UCB) algorithms for cascading bandits, theoretical guarantees are only known for the latter. In this paper, we first provide a problem-dependent upper bound on the regret of a TS algorithm with Beta-Bernoulli updates; this upper bound is tighter than a recent derivation under a more general setting by Huyuk and Tekin (2019). Next, we design and analyze another TS algorithm with Gaussian updates, TS-Cascade. TS-Cascade achieves the state-of-the-art problem-independent regret bound for cascading bandits. Complementarily, we consider a linear generalization of the cascading bandit model, which allows efficient learning in large-scale cascading bandit problem instances. We introduce and analyze a TS algorithm, which enjoys a regret bound that depends on the dimension of the linear model but not the number of items. Finally, by using information-theoretic techniques and a judicious construction of cascading bandit instances, we derive a nearly-matching lower bound on the expected regret for the standard model. Our paper establishes the first theoretical guarantees on TS algorithms for a stochastic combinatorial bandit problem model with partial feedback. Numerical experiments demonstrate the superiority of the proposed TS algorithms compared to existing UCB-based ones.

オンラインレコメンデーションシステムにおける効率的な最適化の重要性と緊急性に動機づけられ、我々はKvetonら(2015a)が提案したカスケードバンディットモデルを再検討します。カスケードバンディットではトンプソンサンプリング(TS)アルゴリズムが上限信頼限界(UCB)アルゴリズムよりも経験的に優れていることが証明されているが、理論的な保証は後者についてのみ知られています。この論文では、まずベータベルヌーイ更新を用いたTSアルゴリズムのリグレットの問題依存上限を示す。この上限は、HuyukとTekin(2019)によるより一般的な設定での最近の導出よりも厳密です。次に、ガウス更新を用いた別のTSアルゴリズムTS-Cascadeを設計し、分析します。TS-Cascadeは、カスケードバンディットの最先端の問題に依存しないリグレット限界を実現します。補足的に、大規模なカスケードバンディット問題インスタンスで効率的な学習を可能にするカスケードバンディットモデルの線形一般化を検討します。アイテム数ではなく線形モデルの次元に依存する後悔の境界を持つTSアルゴリズムを紹介し、分析します。最後に、情報理論的手法とカスケードバンディットインスタンスの賢明な構築を使用して、標準モデルの期待後悔のほぼ一致する下限を導出します。私たちの論文は、部分フィードバックを持つ確率的組み合わせバンディット問題モデルに対するTSアルゴリズムの初めての理論的保証を確立します。数値実験は、既存のUCBベースのアルゴリズムと比較して、提案されたTSアルゴリズムの優位性を実証します。

A Unified Framework for Spectral Clustering in Sparse Graphs
スパースグラフにおけるスペクトルクラスタリングのための統一フレームワーク

This article considers spectral community detection in the regime of sparse networks with heterogeneous degree distributions, for which we devise an algorithm to efficiently retrieve communities. Specifically, we demonstrate that a well parametrized form of regularized Laplacian matrices can be used to perform spectral clustering in sparse networks without suffering from its degree heterogeneity. Besides, we exhibit important connections between this proposed matrix and the now popular non-backtracking matrix, the Bethe-Hessian matrix, as well as the standard Laplacian matrix. Interestingly, as opposed to competitive methods, our proposed improved parametrization inherently accounts for the hardness of the classification problem. These findings are summarized under the form of an algorithm capable of both estimating the number of communities and achieving high-quality community reconstruction.

この記事では、不均一な次数分布を持つスパースネットワーク領域でのスペクトルコミュニティ検出について考察し、コミュニティを効率的に取得するためのアルゴリズムを考案します。具体的には、正則化されたラプラシアン行列の十分にパラメータ化された形式を使用して、その次数不均一性に悩まされることなく、スパースネットワークでスペクトルクラスタリングを実行できることを示します。さらに、この提案された行列と、現在人気のある非バックトラッキング行列であるベーテ・ヘッセ行列、および標準のラプラシアン行列との間に重要な関係を示しています。興味深いことに、競合的な方法とは対照的に、私たちが提案する改善されたパラメータ化は、本質的に分類問題の難易度を説明します。これらの知見は、コミュニティの数を推定し、質の高いコミュニティ再構築を実現できるアルゴリズムという形でまとめられています。

Context-dependent Networks in Multivariate Time Series: Models, Methods, and Risk Bounds in High Dimensions
多変量時系列におけるコンテキスト依存ネットワーク:高次元のモデル、手法、およびリスク境界

High-dimensional autoregressive generalized linear models arise naturally for capturing how current events trigger or inhibit future events, such as activity by one member of a social network can affect the future activities of his or her neighbors. While past work has focused on estimating the underlying network structure based solely on the times at which events occur on each node of the network, this paper examines the more nuanced problem of estimating context-dependent networks that reflect how features associated with an event (such as the content of a social media post) modulate the strength of influences among nodes. Specifically, we leverage ideas from compositional time series and regularization methods in machine learning to conduct context-dependent network estimation for high-dimensional autoregressive time series of annotated event data. Two models and corresponding estimators are considered in detail: an autoregressive multinomial model suited to categorical features and a logistic-normal model suited to features with mixed membership in different categories. Importantly, the logistic-normal model leads to a convex negative log-likelihood objective and captures dependence across categories. We provide theoretical guarantees for both estimators that are supported by simulations. We further validate our methods and demonstrate the advantages and disadvantages of both approaches through two real data examples and a synthetic data-generating model. Finally, a mixture approach enjoying both approaches’ merits is proposed and illustrated on synthetic and real data examples.

高次元自己回帰一般化線形モデルは、ソーシャルネットワークの1人のメンバーの活動が近隣メンバーの将来の活動に影響を与えるなど、現在のイベントが将来のイベントをどのように引き起こすか、または抑制するかを捉えるために自然に生まれます。過去の研究では、ネットワークの各ノードでイベントが発生する時間のみに基づいて、基礎となるネットワーク構造を推定することに焦点を当ててきましたが、この論文では、イベントに関連付けられた機能(ソーシャルメディアの投稿の内容など)がノード間の影響の強さをどのように調整するかを反映するコンテキスト依存ネットワークを推定するという、より微妙な問題を検討します。具体的には、機械学習における構成時系列と正規化手法のアイデアを活用して、注釈付きイベントデータの高次元自己回帰時系列のコンテキスト依存ネットワーク推定を実行します。カテゴリ機能に適した自己回帰多項式モデルと、異なるカテゴリに混在するメンバーシップを持つ機能に適したロジスティック正規モデルの2つのモデルと対応する推定量について詳細に検討します。重要なことは、ロジスティック正規モデルは凸負対数尤度目的関数につながり、カテゴリ間の依存関係を捉えることです。シミュレーションによって裏付けられた両方の推定値に対する理論的な保証を提供します。さらに、2つの実際のデータ例と合成データ生成モデルを使用して、両方のアプローチの長所と短所を実証し、両方のアプローチのメリットを享受する混合アプローチを提案し、合成データと実際のデータの例で説明します。

TensorHive: Management of Exclusive GPU Access for Distributed Machine Learning Workloads
TensorHive:分散機械学習ワークロードのための排他的なGPUアクセスの管理

TensorHive is a tool for organizing work of research and engineering teams that use servers with GPUs for machine learning workloads. In a comprehensive web interface, it supports reservation of GPUs for exclusive usage, hardware monitoring, as well as configuring, executing and queuing distributed computational jobs. Focusing on easy installation and simple configuration, the tool automatically detects the available computing resources and monitors their utilization. Reservations granted on the basis of flexible access control settings are protected by pluggable violation hooks. The job execution module includes auto-configuration templates for distributed neural network training jobs in frameworks such as TensorFlow and PyTorch. Documentation, source code, usage examples and issue tracking are available at the project page: https://github.com/roscisz/TensorHive/

TensorHiveは、機械学習ワークロードにGPUを搭載したサーバーを使用する研究チームやエンジニアリングチームの作業を整理するためのツールです。包括的なWebインターフェイスでは、GPUの専用使用の予約、ハードウェアの監視、分散計算ジョブの構成、実行、キューイングをサポートします。簡単なインストールと簡単な構成に重点を置いたこのツールは、使用可能なコンピューティングリソースを自動的に検出し、それらの使用率を監視します。柔軟なアクセス制御設定に基づいて付与された予約は、プラグ可能な違反フックによって保護されます。ジョブ実行モジュールには、TensorFlowやPyTorchなどのフレームワークの分散ニューラルネットワークトレーニングジョブの自動構成テンプレートが含まれています。ドキュメント、ソースコード、使用例、問題の追跡は、プロジェクトページで入手できますhttps://github.com/roscisz/TensorHive/

dalex: Responsible Machine Learning with Interactive Explainability and Fairness in Python
dalex: Python の対話型説明可能性と公平性を備えた責任ある機械学習

In modern machine learning, we observe the phenomenon of opaqueness debt, which manifests itself by an increased risk of discrimination, lack of reproducibility, and deflated performance due to data drift. An increasing amount of available data and computing power results in the growing complexity of black-box predictive models. To manage these issues, good MLOps practice asks for better validation of model performance and fairness, higher explainability, and continuous monitoring. The necessity for deeper model transparency comes from both scientific and social domains and is also caused by emerging laws and regulations on artificial intelligence. To facilitate the responsible development of machine learning models, we introduce dalex, a Python package which implements a model-agnostic interface for interactive explainability and fairness. It adopts the design crafted through the development of various tools for explainable machine learning; thus, it aims at the unification of existing solutions. This library’s source code and documentation are available under open license at https://python.drwhy.ai.

現代の機械学習では、不透明負債という現象が見られます。これは、差別リスクの増加、再現性の欠如、データドリフトによるパフォーマンスの低下として現れます。利用可能なデータ量とコンピューティングパワーの増加により、ブラックボックス予測モデルの複雑さが増しています。これらの問題を管理するため、優れたMLOpsプラクティスでは、モデルのパフォーマンスと公平性の検証の改善、説明可能性の向上、継続的な監視が求められます。より深いモデルの透明性の必要性は、科学的領域と社会的領域の両方から生じており、人工知能に関する新たな法律や規制によっても引き起こされています。機械学習モデルの責任ある開発を促進するために、インタラクティブな説明可能性と公平性のためにモデルに依存しないインターフェイスを実装するPythonパッケージであるdalexを紹介します。これは、説明可能な機械学習のためのさまざまなツールの開発を通じて作成された設計を採用しているため、既存のソリューションの統合を目指しています。このライブラリのソースコードとドキュメントは、https://python.drwhy.aiでオープンライセンスで入手できます。

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms
協調型SGD:ローカル更新SGDアルゴリズムの設計と解析のための統合フレームワーク

When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every iteration is a key bottleneck that limits the scalability of the system and hinders the benefit of parallel computation. Local-update SGD algorithms, where worker nodes perform local iterations of SGD and periodically synchronize their local models, can effectively reduce the communication frequency and save the communication delay. In this paper, we propose a powerful framework, named Cooperative SGD, that subsumes a variety of local-update SGD algorithms (such as local SGD, elastic averaging SGD, and decentralized parallel SGD) and provides a unified convergence analysis. Notably, special cases of the unified convergence analysis provided by the cooperative SGD framework yield 1) the first convergence analysis of elastic averaging SGD for general non-convex objectives, and 2) improvements upon previous analyses of local SGD and decentralized parallel SGD. Moreover, we design new algorithms such as elastic averaging SGD with overlapped computation and communication, and decentralized periodic averaging which are shown to be 4x or more faster than the baseline in reaching the same training loss.

多数のノードまたは大規模なエッジデバイスで確率的勾配降下法(SGD)を使用して機械学習モデルをトレーニングする場合、反復ごとに勾配を同期するための通信コストが重要なボトルネックとなり、システムのスケーラビリティが制限され、並列計算の利点が妨げられます。ワーカーノードがSGDのローカル反復を実行し、ローカルモデルを定期的に同期するローカル更新SGDアルゴリズムは、通信頻度を効果的に削減し、通信遅延を節約できます。この論文では、さまざまなローカル更新SGDアルゴリズム(ローカルSGD、弾性平均化SGD、分散並列SGDなど)を包含し、統一された収束分析を提供する、Cooperative SGDという強力なフレームワークを提案します。特に、協調SGDフレームワークによって提供される統合収束解析の特殊なケースでは、1)一般的な非凸目的に対する弾性平均化SGDの最初の収束解析と、2)ローカルSGDおよび分散並列SGDの以前の解析の改善がもたらされます。さらに、重複計算と通信を備えた弾性平均化SGDや、同じトレーニング損失に到達するのにベースラインより4倍以上高速であることが示された分散周期平均化などの新しいアルゴリズムを設計します。

Convex Geometry and Duality of Over-parameterized Neural Networks
過剰パラメータ化されたニューラルネットワークの凸幾何学と双対性

We develop a convex analytic approach to analyze finite width two-layer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rank-one data. We also characterize the classification decision regions in terms of a kernel matrix and minimum $\ell_1$-norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as auto-encoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cutting-plane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closed-form formulas in some practically relevant special cases.

私たちは、有限幅の2層ReLUネットワークを解析するための凸解析アプローチを開発しました。まず、正規化されたトレーニング問題に対する最適解は、凸集合の極値として特徴付けられることを証明します。凸集合では、凸幾何学的特性によって単純な解が推奨されます。次に、この特徴を利用して、最適なパラメータセットが1次元またはランク1データを含む回帰問題に対して線形スプライン補間をもたらすことを示します。また、分類決定領域をカーネルマトリックスと最小$\ell_1$ノルム解の観点から特徴付けます。これは、有限幅ネットワークの予測を説明できないニューラルタンジェントカーネルとは対照的です。凸幾何学的特徴付けは、隠れニューロンをオートエンコーダーとして直感的に説明することもできます。高次元では、トレーニング問題を無限に多くの制約を持つ有限次元凸問題として表現できることを示します。次に、特定の凸緩和を適用し、ネットワークをグローバルに最適化するための切断面アルゴリズムを導入します。さらに、緩和の正確さを分析して、グローバル最適解への収束の条件を提供します。私たちの分析では、最適なネットワークパラメータは、実際的に関連するいくつかの特殊なケースでは、解釈可能な閉じた形式の式として特徴付けられることも示されています。

Bandit Learning in Decentralized Matching Markets
分散型マッチング市場におけるバンディット学習

We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience. Also, we assume the players have no direct means of communication. This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition. We introduce a new algorithm for this setting that, over a time horizon $T$, attains $\mathcal{O}(\log(T))$ stable regret when preferences of the arms over players are shared, and $\mathcal{O}(\log(T)^2)$ regret when there are no assumptions on the preferences on either side. Moreover, in the setting where a single player may deviate, we show that the algorithm is incentive-compatible whenever the arms’ preferences are shared, but not necessarily so when preferences are fully general.

私たちは、マーケットの一方の側(プレーヤー)がもう一方の側(腕)に対するその好みについて先験的な知識を持たず、経験からその好みを学ぶ必要がある両面マッチング市場を研究します。また、プレイヤーには直接的なコミュニケーション手段がないことを前提としています。このモデルは、標準的な確率的マルチアームバンディットフレームワークを、競争を伴う分散型のマルチプレイヤー設定に拡張します。この設定に新しいアルゴリズムを導入し、時間軸$T$で、プレイヤーに対する腕の選好が共有されている場合は$mathcal{O}(log(T))$の安定した後悔を達成し、どちらの側の選好にも仮定がない場合に$mathcal{O}(log(T)^2)$の後悔を達成します。さらに、1人のプレイヤーが逸脱する可能性のある設定では、アームの選好が共有されるときは常にアルゴリズムがインセンティブと互換性があることを示すが、選好が完全に一般的なものである場合は必ずしもそうではないことを示す。

Policy Teaching in Reinforcement Learning via Environment Poisoning Attacks
環境ポイズニング攻撃による強化学習における方策指導

We study a security threat to reinforcement learning where an attacker poisons the learning environment to force the agent into executing a target policy chosen by the attacker. As a victim, we consider RL agents whose objective is to find a policy that maximizes reward in infinite-horizon problem settings. The attacker can manipulate the rewards and the transition dynamics in the learning environment at training-time, and is interested in doing so in a stealthy manner. We propose an optimization framework for finding an optimal stealthy attack for different measures of attack cost. We provide lower/upper bounds on the attack cost, and instantiate our attacks in two settings: (i) an offline setting where the agent is doing planning in the poisoned environment, and (ii) an online setting where the agent is learning a policy with poisoned feedback. Our results show that the attacker can easily succeed in teaching any target policy to the victim under mild conditions and highlight a significant security threat to reinforcement learning agents in practice.

私たちは、攻撃者が学習環境を汚染し、攻撃者が選択したターゲットポリシーをエージェントに実行させるという、強化学習に対するセキュリティ上の脅威を研究します。被害者として、無限時間の問題設定で報酬を最大化するポリシーを見つけることを目的とするRLエージェントを検討します。攻撃者は、トレーニング時に学習環境の報酬と遷移ダイナミクスを操作でき、ステルス的な方法でそれを行うことに関心があります。攻撃コストのさまざまな尺度に対して最適なステルス攻撃を見つけるための最適化フレームワークを提案します。攻撃コストの下限/上限を示し、(i)エージェントが汚染された環境で計画を行っているオフライン設定と、(ii)エージェントが汚染されたフィードバックを使用してポリシーを学習しているオンライン設定の2つの設定で攻撃をインスタンス化します。結果は、攻撃者が穏やかな条件下では被害者に任意のターゲットポリシーを簡単に教えることができることを示し、実際には強化学習エージェントに対する重大なセキュリティ上の脅威を浮き彫りにしています。

Explaining by Removing: A Unified Framework for Model Explanation
削除による説明:モデル説明のための統一されたフレームワーク

Researchers have proposed a wide variety of model explanation approaches, but it remains unclear how most methods are related or when one method is preferable to another. We describe a new unified class of methods, removal-based explanations, that are based on the principle of simulating feature removal to quantify each feature’s influence. These methods vary in several respects, so we develop a framework that characterizes each method along three dimensions: 1) how the method removes features, 2) what model behavior the method explains, and 3) how the method summarizes each feature’s influence. Our framework unifies 26 existing methods, including several of the most widely used approaches: SHAP, LIME, Meaningful Perturbations, and permutation tests. This newly understood class of explanation methods has rich connections that we examine using tools that have been largely overlooked by the explainability literature. To anchor removal-based explanations in cognitive psychology, we show that feature removal is a simple application of subtractive counterfactual reasoning. Ideas from cooperative game theory shed light on the relationships and trade-offs among different methods, and we derive conditions under which all removal-based explanations have information-theoretic interpretations. Through this analysis, we develop a unified framework that helps practitioners better understand model explanation tools, and that offers a strong theoretical foundation upon which future explainability research can build.

研究者らはさまざまなモデル説明アプローチを提案してきましたが、ほとんどの方法がどのように関連しているか、またはある方法が他の方法よりも優れている場合については依然として不明です。私たちは、特徴の削除をシミュレートして各特徴の影響を定量化するという原則に基づく、新しい統合されたクラスの方法、除去ベースの説明について説明します。これらの方法はいくつかの点で異なるため、1)方法が特徴を削除する方法、2)方法が説明するモデルの行動、3)方法が各特徴の影響を要約する方法の3つの側面に沿って各方法を特徴付けるフレームワークを開発しました。私たちのフレームワークは、SHAP、LIME、Meaningful Perturbations、および順列テストという最も広く使用されているアプローチを含む26の既存の方法を統合します。この新たに理解された説明方法のクラスには豊富なつながりがあり、説明可能性の文献ではほとんど見過ごされてきたツールを使用してそれらを調査します。認知心理学における除去ベースの説明を固定するために、特徴の削除は減算的な反事実的推論の単純な適用であることを示します。協力ゲーム理論のアイデアは、さまざまな方法間の関係性とトレードオフを明らかにし、すべての除去ベースの説明が情報理論的解釈を持つ条件を導き出します。この分析を通じて、実践者がモデル説明ツールをよりよく理解するのに役立つ統一されたフレームワークを開発し、将来の説明可能性研究を構築するための強力な理論的基礎を提供します。

Oblivious Data for Fairness with Kernels
カーネルとの公平性のための忘却データ

We investigate the problem of algorithmic fairness in the case where sensitive and non-sensitive features are available and one aims to generate new, `oblivious’, features that closely approximate the non-sensitive features, and are only minimally dependent on the sensitive ones. We study this question in the context of kernel methods. We analyze a relaxed version of the Maximum Mean Discrepancy criterion which does not guarantee full independence but makes the optimization problem tractable. We derive a closed-form solution for this relaxed optimization problem and complement the result with a study of the dependencies between the newly generated features and the sensitive ones. Our key ingredient for generating such oblivious features is a Hilbert-space-valued conditional expectation, which needs to be estimated from data. We propose a plug-in approach and demonstrate how the estimation errors can be controlled. While our techniques help reduce the bias, we would like to point out that no post-processing of any dataset could possibly serve as an alternative to well-designed experiments.

私たちは、センシティブな特徴と非センシティブな特徴が利用可能で、非センシティブな特徴に非常に近似し、センシティブな特徴に最小限しか依存しない新しい「無意識の」特徴を生成することを目的とする場合のアルゴリズムの公平性の問題を調査します。カーネル法のコンテキストでこの問題を調査します。最大平均不一致基準の緩和バージョンを分析します。これは完全な独立性を保証するものではありませんが、最適化問題を扱いやすくします。この緩和された最適化問題の閉形式のソリューションを導出し、新しく生成された特徴とセンシティブな特徴の間の依存関係の研究で結果を補完します。このような無意識の特徴を生成するための重要な要素は、データから推定する必要があるヒルベルト空間値の条件付き期待値です。プラグインアプローチを提案し、推定エラーを制御する方法を示します。私たちの技術はバイアスの削減に役立ちますが、データセットの後処理は、適切に設計された実験の代替にはなり得ないことを指摘しておきます。

A Unified Convergence Analysis for Shuffling-Type Gradient Methods
シャッフル型勾配法のための統一収束解析

In this paper, we propose a unified convergence analysis for a class of generic shuffling-type gradient methods for solving finite-sum optimization problems. Our analysis works with any sampling without replacement strategy and covers many known variants such as randomized reshuffling, deterministic or randomized single permutation, and cyclic and incremental gradient schemes. We focus on two different settings: strongly convex and nonconvex problems, but also discuss the non-strongly convex case. Our main contribution consists of new non-asymptotic and asymptotic convergence rates for a wide class of shuffling-type gradient methods in both nonconvex and convex settings. We also study uniformly randomized shuffling variants with different learning rates and model assumptions. While our rate in the nonconvex case is new and significantly improved over existing works under standard assumptions, the rate on the strongly convex one matches the existing best-known rates prior to this paper up to a constant factor without imposing a bounded gradient condition. Finally, we empirically illustrate our theoretical results via two numerical examples: nonconvex logistic regression and neural network training examples. As byproducts, our results suggest some appropriate choices for diminishing learning rates in certain shuffling variants.

この論文では、有限和最適化問題を解くための一般的なシャッフル型勾配法のクラスの統一された収束解析を提案します。本解析は、任意の非置換サンプリング戦略で機能し、ランダム化再シャッフル、決定論的またはランダム化単一置換、巡回勾配法および増分勾配法などの多くの既知の変種をカバーしています。この論文では、強く凸な問題と非凸な問題という2つの異なる設定に焦点を当てるが、強く凸でないケースについても説明します。本稿の主な貢献は、非凸と凸の両方の設定における、幅広いシャッフル型勾配法のクラスに対する新しい非漸近的および漸近的収束率です。また、異なる学習率とモデル仮定を持つ一様ランダム化シャッフル変種についても研究します。非凸の場合の収束率は新しく、標準的な仮定の下での既存の研究よりも大幅に改善されているが、強く凸の場合の収束率は、有界勾配条件を課すことなく、定数倍まで本稿以前の既存の最もよく知られた収束率と一致しています。最後に、非凸ロジスティック回帰とニューラルネットワークトレーニングの例という2つの数値例を通じて、理論的結果を実証的に示します。副産物として、私たちの結果は、特定のシャッフルバリアントで学習率を減少させるための適切な選択肢をいくつか示唆しています。

Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls
リプシッツ連続制御による決定論的連続時間システムのためのハミルトン・ヤコビ深層Q学習

In this paper, we propose Q-learning algorithms for continuous-time deterministic optimal control problems with Lipschitz continuous controls. A new class of Hamilton-Jacobi-Bellman (HJB) equations is derived from applying the dynamic programming principle to continuous-time Q-functions. Our method is based on a novel semi-discrete version of the HJB equation, which is proposed to design a Q-learning algorithm that uses data collected in discrete time without discretizing or approximating the system dynamics. We identify the conditions under which the Q-function estimated by this algorithm converges to the optimal Q-function. For practical implementation, we propose the Hamilton-Jacobi DQN, which extends the idea of deep Q-networks (DQN) to our continuous control setting. This approach does not require actor networks or numerical solutions to optimization problems for greedy actions since the HJB equation provides a simple characterization of optimal controls via ordinary differential equations. We empirically demonstrate the performance of our method through benchmark tasks and high-dimensional linear-quadratic problems.

この論文では、リプシッツ連続制御による連続時間決定論的最適制御問題のためのQ学習アルゴリズムを提案します。連続時間Q関数に動的計画法原理を適用することで、新しいクラスのハミルトン-ヤコビ-ベルマン(HJB)方程式が導出されます。本手法は、HJB方程式の新しい半離散バージョンに基づいており、システムダイナミクスを離散化または近似することなく、離散時間で収集されたデータを使用するQ学習アルゴリズムを設計するために提案されています。本アルゴリズムによって推定されたQ関数が最適なQ関数に収束する条件を特定します。実際の実装では、ディープQネットワーク(DQN)のアイデアを連続制御設定に拡張したハミルトン-ヤコビDQNを提案します。このアプローチでは、HJB方程式が常微分方程式を介して最適制御の簡単な特性評価を提供するため、アクターネットワークや貪欲アクションの最適化問題の数値解法は不要です。ベンチマークタスクと高次元線形二次問題を通じて、私たちの方法のパフォーマンスを実証します。

Langevin Monte Carlo: random coordinate descent and variance reduction
ランジュバンモンテカルロ:ランダム座標降下と分散縮小

Langevin Monte Carlo (LMC) is a popular Bayesian sampling method. For the log-concave distribution function, the method converges exponentially fast, up to a controllable discretization error. However, the method requires the evaluation of a full gradient in each iteration, and for a problem on $\mathbb{R}^d$, this amounts to $d$ times partial derivative evaluations per iteration. The cost is high when $d\gg1$. In this paper, we investigate how to enhance computational efficiency through the application of RCD (random coordinate descent) on LMC. There are two sides of the theory: 1. By blindly applying RCD to LMC, one surrogates the full gradient by a randomly selected directional derivative per iteration. Although the cost is reduced per iteration, the total number of iteration is increased to achieve a preset error tolerance. Ultimately there is no computational gain; 2. We then incorporate variance reduction techniques, such as SAGA (stochastic average gradient) and SVRG (stochastic variance reduced gradient), into RCD-LMC. It will be proved that the cost is reduced compared with the classical LMC, and in the underdamped case, convergence is achieved with the same number of iterations, while each iteration requires merely one-directional derivative. This means we obtain the best possible computational cost in the underdamped-LMC framework.

ランジュバン・モンテカルロ(LMC)は、一般的なベイズサンプリング法です。対数凹分布関数の場合、この方法は、制御可能な離散化誤差まで指数関数的に速く収束します。ただし、この方法では、各反復で完全な勾配を評価する必要があり、$\mathbb{R}^d$の問題の場合、反復ごとに$d$回の偏微分評価が必要になります。$d\gg1$の場合、コストが高くなります。この論文では、LMCにRCD (ランダム座標降下法)を適用して計算効率を高める方法について調査します。この理論には2つの側面があります。1. LMCにRCDを盲目的に適用すると、反復ごとにランダムに選択された方向微分によって完全な勾配が代用されます。反復あたりのコストは削減されますが、事前に設定された誤差許容値を達成するために反復の合計回数が増加します。最終的には計算上の利点はありません。2.次に、SAGA (確率的平均勾配)やSVRG (確率的分散低減勾配)などの分散低減手法をRCD-LMCに組み込みます。従来のLMCと比較してコストが削減され、減衰不足の場合、各反復で必要なのは一方向の微分だけであるのに対し、同じ反復回数で収束が達成されることが証明されます。これは、減衰不足LMCフレームワークで可能な限り最良の計算コストが得られることを意味します。

Failures of Model-dependent Generalization Bounds for Least-norm Interpolation
最小ノルム内挿におけるモデル依存汎化境界の失敗

We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose – it can be bounded below by a constant when the true excess risk goes to zero.

私たちは、最小ノルム線形リグレッサーの一般化パフォーマンスの限界を、データを補間できる過剰パラメータ化された領域で考慮します。統計学習理論で一般的に証明されているタイプの一般化の境界は、最小ノルム内挿を分析するために適用すると、非常に緩い場合があるという感覚を説明します。特に、学習例のさまざまな自然同時分布の場合、学習アルゴリズムの出力、学習例の数、および信頼度パラメーターのみに依存し、穏やかな条件(サンプルサイズの単調性よりも大幅に弱い)を満たす有効な汎化境界は、非常に緩くなければならない場合があります-真の過剰リスクがゼロになると、定数によって制限される可能性があります。

Learning partial correlation graphs and graphical models by covariance queries
共分散クエリによる偏相関グラフとグラフィカルモデルの学習

We study the problem of recovering the structure underlying large Gaussian graphical models or, more generally, partial correlation graphs. In high-dimensional problems it is often too costly to store the entire sample covariance matrix. We propose a new input model in which one can query single entries of the covariance matrix. We prove that it is possible to recover the support of the inverse covariance matrix with low query and computational complexity. Our algorithms work in a regime when this support is represented by tree-like graphs and, more generally, for graphs of small treewidth. Our results demonstrate that for large classes of graphs, the structure of the corresponding partial correlation graphs can be determined much faster than even computing the empirical covariance matrix.

私たちは、大規模なガウスグラフィカルモデル、またはより一般的には偏相関グラフの基礎となる構造を回復する問題を研究しています。高次元の問題では、サンプルの共分散行列全体を保存するのはコストがかかりすぎることがよくあります。共分散行列の単一のエントリをクエリできる新しい入力モデルを提案します。逆共分散行列のサポートを、クエリと計算の複雑さを抑えて回復できることを証明します。私たちのアルゴリズムは、このサポートがツリーのようなグラフで表される場合、およびより一般的には、ツリー幅が小さいグラフで表される場合、レジームで機能します。私たちの結果は、グラフの大きなクラスの場合、対応する偏相関グラフの構造は、経験的共分散行列を計算するよりもはるかに速く決定できることを示しています。

Interpretable Deep Generative Recommendation Models
解釈可能な深い生成的推奨モデル

User preference modeling in recommendation system aims to improve customer experience through discovering users’ intrinsic preference based on prior user behavior data. This is a challenging issue because user preferences usually have complicated structure, such as inter-user preference similarity and intra-user preference diversity. Among them, inter-user similarity indicates different users may share similar preference, while intra-user diversity indicates one user may have several preferences. In literatures, deep generative models have been successfully applied in recommendation systems due to its flexibility on statistical distributions and strong ability for non-linear representation learning. However, they suffer from the simple generative process when handling complex user preferences. Meanwhile, the latent representations learned by deep generative models are usually entangled, and may range from observed-level ones that dominate the complex correlations between users, to latent-level ones that characterize a user’s preference, which makes the deep model hard to explain and unfriendly for recommendation. Thus, in this paper, we propose an Interpretable Deep Generative Recommendation Model (InDGRM) to characterize inter-user preference similarity and intra-user preference diversity, which will simultaneously disentangle the learned representation from observed-level and latent-level. In InDGRM, the observed-level disentanglement on users is achieved by modeling the user-cluster structure (i.e., inter-user preference similarity) in a rich multimodal space, so that users with similar preferences are assigned into the same cluster. The observed-level disentanglement on items is achieved by modeling the intra-user preference diversity in a prototype learning strategy, where different user intentions are captured by item groups (one group refers to one intention). To promote disentangled latent representations, InDGRM adopts structure and sparsity-inducing penalty and integrates them into the generative procedure, which has ability to enforce each latent factor focus on a limited subset of items (e.g., one item group) and benefit latent-level disentanglement. Meanwhile, it can be efficiently inferred by minimizing its penalized upper bound with the aid of local variational optimization technique. Theoretically, we analyze the generalization error bound of InDGRM to guarantee its performance. A series of experimental results on four widely-used benchmark datasets demonstrates the superiority of InDGRM on recommendation performance and interpretability.

推薦システムにおけるユーザー嗜好モデリングは、以前のユーザー行動データに基づいてユーザーの本質的な嗜好を発見することで顧客体験を向上させることを目的としています。ユーザー嗜好は通常、ユーザー間の嗜好の類似性やユーザー内の嗜好の多様性など、複雑な構造を持っているため、これは難しい問題です。その中で、ユーザー間の類似性は、異なるユーザーが同様の嗜好を共有する可能性があることを示し、ユーザー内の多様性は、1人のユーザーが複数の嗜好を持つ可能性があることを示しています。文献では、統計分布に対する柔軟性と非線形表現学習の強力な能力により、深層生成モデルが推薦システムにうまく適用されています。ただし、複雑なユーザー嗜好を処理する場合、生成プロセスが単純すぎるという問題があります。一方、深層生成モデルによって学習される潜在表現は通常は絡み合っており、ユーザー間の複雑な相関関係を支配する観測レベルの表現から、ユーザーの嗜好を特徴付ける潜在レベルの表現までの範囲にわたる可能性があるため、深層モデルは説明が難しく、推薦に適していません。そこでこの論文では、ユーザー間の嗜好の類似性とユーザー内の嗜好の多様性を特徴付ける解釈可能な深層生成推奨モデル(InDGRM)を提案し、これにより学習した表現を観測レベルと潜在レベルから同時に分離します。InDGRMでは、ユーザーに関する観測レベルの分離は、豊富なマルチモーダル空間でユーザークラスター構造(つまり、ユーザー間の嗜好の類似性)をモデル化することで実現され、類似した嗜好を持つユーザーは同じクラスターに割り当てられます。アイテムに関する観測レベルの分離は、プロトタイプ学習戦略でユーザー内の嗜好の多様性をモデル化することで実現され、異なるユーザーの意図はアイテムグループによって捕捉されます(1つのグループは1つの意図を参照します)。分離した潜在表現を促進するために、InDGRMは構造とスパース性を誘発するペナルティを採用し、それらを生成手順に統合します。これにより、各潜在要因をアイテムの限られたサブセット(たとえば、1つのアイテムグループ)に集中させ、潜在レベルの分離に役立てることができます。一方、局所変分最適化技術の助けを借りて、ペナルティ上限を最小化することで、効率的に推論できます。理論的には、InDGRMの一般化誤差境界を分析して、そのパフォーマンスを保証します。広く使用されている4つのベンチマークデータセットでの一連の実験結果は、推奨パフォーマンスと解釈可能性におけるInDGRMの優位性を実証しています。

Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization
次元削減ツールの仕組みを理解する:データ可視化のためのt-SNE、UMAP、TriMap、PaCMAPの解読に対する経験的アプローチ

Dimension reduction (DR) techniques such as t-SNE, UMAP, and TriMap have demonstrated impressive visualization performance on many real-world datasets. One tension that has always faced these methods is the trade-off between preservation of global structure and preservation of local structure: these methods can either handle one or the other, but not both. In this work, our main goal is to understand what aspects of DR methods are important for preserving both local and global structure: it is difficult to design a better method without a true understanding of the choices we make in our algorithms and their empirical impact on the low-dimensional embeddings they produce. Towards the goal of local structure preservation, we provide several useful design principles for DR loss functions based on our new understanding of the mechanisms behind successful DR methods. Towards the goal of global structure preservation, our analysis illuminates that the choice of which components to preserve is important. We leverage these insights to design a new algorithm for DR, called Pairwise Controlled Manifold Approximation Projection (PaCMAP), which preserves both local and global structure. Our work provides several unexpected insights into what design choices both to make and avoid when constructing DR algorithms.

t-SNE、UMAP、TriMapなどの次元削減(DR)技術は、多くの実際のデータセットで優れた視覚化パフォーマンスを示しています。これらの方法が常に直面している1つの緊張関係は、グローバル構造の保存とローカル構造の保存の間のトレードオフです。これらの方法は、どちらか一方を処理できますが、両方を処理することはできません。この研究の主な目的は、ローカル構造とグローバル構造の両方を保存するためにDR方法のどの側面が重要であるかを理解することです。アルゴリズムで行う選択と、それらが生成する低次元埋め込みへの経験的影響について真の理解がなければ、より優れた方法を設計することは困難です。ローカル構造の保存という目標に向けて、成功したDR方法の背後にあるメカニズムに関する新しい理解に基づいて、DR損失関数のいくつかの有用な設計原則を提供します。グローバル構造の保存という目標に向けて、どのコンポーネントを保存するかの選択が重要であることが分析によって明らかにされています。私たちはこれらの洞察を活用して、ローカル構造とグローバル構造の両方を保持する、ペアワイズ制御マニホールド近似投影(PaCMAP)と呼ばれるDR用の新しいアルゴリズムを設計しました。私たちの研究は、DRアルゴリズムを構築するときに行うべき設計上の選択と避けるべき設計上の選択について、いくつかの予期しない洞察を提供します。

Refined approachability algorithms and application to regret minimization with global costs
洗練された親しみやすさのアルゴリズムと、グローバルなコストで後悔する最小化への応用

Blackwell’s approachability is a framework where two players, the Decision Maker and the Environment, play a repeated game with vector-valued payoffs. The goal of the Decision Maker is to make the average payoff converge to a given set called the target. When this is indeed possible, simple algorithms which guarantee the convergence are known. This abstract tool was successfully used for the construction of optimal strategies in various repeated games, but also found several applications in online learning. By extending an approach proposed by (Abernethy et al., 2011), we construct and analyze a class of Follow the Regularized Leader algorithms (FTRL) for Blackwell’s approachability which are able to minimize not only the Euclidean distance to the target set (as it is often the case in the context of Blackwell’s approachability) but a wide range of distance-like quantities. This flexibility enables us to apply these algorithms to closely minimize the quantity of interest in various online learning problems. In particular, for regret minimization with ℓp global costs, we obtain the first bounds with explicit dependence in p and the dimension d.

ブラックウェルの接近可能性は、意思決定者と環境の2人のプレーヤーがベクトル値の報酬を伴う繰り返しゲームをプレイするフレームワークです。意思決定者の目標は、平均報酬をターゲットと呼ばれる特定のセットに収束させることです。これが実際に可能である場合、収束を保証する単純なアルゴリズムが知られています。この抽象的なツールは、さまざまな繰り返しゲームで最適な戦略の構築に効果的に使用されましたが、オンライン学習にもいくつかの用途があります。(Abernethyら、2011)によって提案されたアプローチを拡張することにより、ブラックウェルの接近可能性に対するFollow the Regularized Leaderアルゴリズム(FTRL)のクラスを構築して分析します。このアルゴリズムは、ターゲットセットへのユークリッド距離(ブラックウェルの接近可能性のコンテキストではよくあることです)だけでなく、さまざまな距離のような量を最小化できます。この柔軟性により、これらのアルゴリズムを適用して、さまざまなオンライン学習の問題で関心のある量を厳密に最小化できます。特に、ℓpグローバルコストによる後悔最小化の場合、pと次元dに明示的に依存する最初の境界が得られます。

On ADMM in Deep Learning: Convergence and Saturation-Avoidance
深層学習におけるADMMについて:収束と飽和回避

In this paper, we develop an alternating direction method of multipliers (ADMM) for deep neural networks training with sigmoid-type activation functions (called sigmoid-ADMM pair), mainly motivated by the gradient-free nature of ADMM in avoiding the saturation of sigmoid-type activations and the advantages of deep neural networks with sigmoid-type activations (called deep sigmoid nets) over their rectified linear unit (ReLU) counterparts (called deep ReLU nets) in terms of approximation. In particular, we prove that the approximation capability of deep sigmoid nets is not worse than that of deep ReLU nets by showing that ReLU activation fucntion can be well approximated by deep sigmoid nets with two hidden layers and finitely many free parameters but not vice-verse. We also establish the global convergence of the proposed ADMM for the nonlinearly constrained formulation of the deep sigmoid nets training from arbitrary initial points to a Karush-Kuhn-Tucker (KKT) point at a rate of order O(1/k). Besides sigmoid activation, such a convergence theorem holds for a general class of smooth activations. Compared with the widely used stochastic gradient descent (SGD) algorithm for the deep ReLU nets training (called ReLU-SGD pair), the proposed sigmoid-ADMM pair is practically stable with respect to the algorithmic hyperparameters including the learning rate, initial schemes and the pro-processing of the input data. Moreover, we find that to approximate and learn simple but important functions the proposed sigmoid-ADMM pair numerically outperforms the ReLU-SGD pair.

この論文では、シグモイド型活性化関数（シグモイドADMMペアと呼ばれる）を使用したディープニューラルネットワークのトレーニング用に、交互方向乗数法（ADMM）を開発します。これは主に、ADMMの勾配フリーの性質によりシグモイド型活性化の飽和を回避できることと、近似の点でシグモイド型活性化を持つディープニューラルネットワーク（ディープシグモイドネットと呼ばれる）がReLU（整流線形ユニット）の対応するもの（ディープReLUネットと呼ばれる）よりも優れていることに動機付けられています。特に、ReLU活性化関数は、2つの隠し層と有限個の自由パラメータを持つディープシグモイドネットによって十分に近似できますが、その逆は成り立たないことを示すことにより、ディープシグモイドネットの近似能力がディープReLUネットの近似能力よりも劣っていないことを証明します。また、任意の初期点からKarush-Kuhn-Tucker (KKT)点までの深層シグモイドネットトレーニングの非線形制約定式化に対する提案ADMMのグローバル収束を、O(1/k)のオーダーの速度で確立します。シグモイド活性化に加えて、このような収束定理は、滑らかな活性化の一般的なクラスに当てはまります。深層ReLUネットトレーニングに広く使用されている確率的勾配降下法(SGD)アルゴリズム(ReLU-SGDペアと呼ばれる)と比較して、提案されたシグモイド-ADMMペアは、学習率、初期スキーム、入力データの前処理などのアルゴリズムのハイパーパラメータに関して実質的に安定しています。さらに、単純だが重要な関数を近似して学習する場合、提案されたシグモイド-ADMMペアは数値的にReLU-SGDペアよりも優れていることがわかりました。

Integrated Principal Components Analysis
統合主成分分析

Data integration, or the strategic analysis of multiple sources of data simultaneously, can often lead to discoveries that may be hidden in individualistic analyses of a single data source. We develop a new unsupervised data integration method named Integrated Principal Components Analysis (iPCA), which is a model-based generalization of PCA and serves as a practical tool to find and visualize common patterns that occur in multiple data sets. The key idea driving iPCA is the matrix-variate normal model, whose Kronecker product covariance structure captures both individual patterns within each data set and joint patterns shared by multiple data sets. Building upon this model, we develop several penalized (sparse and non-sparse) covariance estimators for iPCA, and using geodesic convexity, we prove that our non-sparse iPCA estimator converges to the global solution of a non-convex problem. We also demonstrate the practical advantages of iPCA through extensive simulations and a case study application to integrative genomics for Alzheimer’s disease. In particular, we show that the joint patterns extracted via iPCA are highly predictive of a patient’s cognition and Alzheimer’s diagnosis.

データ統合、つまり複数のデータソースを同時に戦略的に分析すると、単一のデータソースの個別分析では隠れている可能性のある発見につながることがよくあります。私たちは、統合主成分分析(iPCA)という新しい教師なしデータ統合手法を開発しました。これはPCAのモデルベースの一般化であり、複数のデータセットに発生する共通パターンを見つけて視覚化するための実用的なツールとして機能します。iPCAを推進する主要なアイデアは、行列変量正規モデルです。このモデルのクロネッカー積共分散構造は、各データセット内の個別のパターンと、複数のデータセットで共有される結合パターンの両方を捉えます。このモデルに基づいて、iPCA用のペナルティ付き(スパースおよび非スパース)共分散推定量をいくつか開発し、測地線凸性を使用して、非スパースiPCA推定量が非凸問題のグローバルソリューションに収束することを証明します。また、広範なシミュレーションとアルツハイマー病の統合ゲノミクスへのケーススタディアプリケーションを通じて、iPCAの実用的な利点を示します。特に、iPCAによって抽出された関節パターンは、患者の認知機能とアルツハイマー病の診断を非常に正確に予測できることを示しています。

Particle-Gibbs Sampling for Bayesian Feature Allocation Models
ベイズ特徴配分モデルのための粒子ギブスサンプリング

Bayesian feature allocation models are a popular tool for modelling data with a combinatorial latent structure. Exact inference in these models is generally intractable and so practitioners typically apply Markov Chain Monte Carlo (MCMC) methods for posterior inference. The most widely used MCMC strategies rely on a single variable Gibbs update of the feature allocation matrix. These updates can be inefficient as features are typically strongly correlated. To overcome this problem we have developed a block sampler that can update an entire row of the feature allocation matrix in a single move. In the context of feature allocation models, naive block Gibbs sampling is impractical for models with a large number of features as the computational complexity scales exponentially in the number of features. We develop a Particle Gibbs (PG) sampler that targets the same distribution as the row wise Gibbs updates, but has computational complexity that only grows linearly in the number of features. We compare the performance of our proposed methods to the standard Gibbs sampler using synthetic and real data from a range of feature allocation models. Our results suggest that row wise updates using the PG methodology can significantly improve the performance of samplers for feature allocation models.

ベイジアン特徴割り当てモデルは、組み合わせ潜在構造を持つデータをモデル化するための一般的なツールです。これらのモデルでの正確な推論は一般に扱いにくいため、専門家は通常、事後推論にマルコフ連鎖モンテカルロ(MCMC)法を適用します。最も広く使用されているMCMC戦略は、特徴割り当て行列の単一変数ギブス更新に依存しています。特徴は通常強く相関しているため、これらの更新は非効率的です。この問題を克服するために、私たちは1回の移動で特徴割り当て行列の行全体を更新できるブロックサンプラーを開発しました。特徴割り当てモデルのコンテキストでは、計算の複雑さが特徴の数に応じて指数関数的に増加するため、多数の特徴を持つモデルでは単純なブロックギブスサンプリングは実用的ではありません。私たちは、行単位のギブス更新と同じ分布をターゲットとする粒子ギブス(PG)サンプラーを開発しましたが、計算の複雑さは特徴の数に応じて線形にしか増加しません。さまざまな特徴割り当てモデルからの合成データと実際のデータを使用して、提案された方法のパフォーマンスを標準ギブスサンプラーと比較します。私たちの結果は、PG方法論を使用した行単位の更新により、特徴割り当てモデルのサンプラーのパフォーマンスが大幅に向上することを示唆しています。

COKE: Communication-Censored Decentralized Kernel Learning
COKE: 通信検閲分散カーネル学習

This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across different agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.

この論文では、複数の相互接続されたエージェントが、自身のローカルに観測されたデータセットにアクセスして、グローバル目的関数を共同で最小化することにより、再生カーネルヒルベルト空間上で定義された最適な決定関数を学習することを目指す、分散最適化および学習問題について検討します。非パラメトリックアプローチであるカーネル学習は、分散実装において大きな課題に直面しています。ローカル目的関数の決定変数はデータに依存するため、エージェント間で生データを交換せずに分散コンセンサスフレームワークで最適化することはできません。この大きな課題を回避するために、ランダム特徴(RF)近似アプローチを活用して、異なるエージェント間でデータに依存しないパラメーターによってRF空間でモデル化された関数に関するコンセンサスを可能にします。次に、ADMMによる高速収束実装のために、DKLAと呼ばれる反復アルゴリズムを設計します。DKLAに基づいて、通信検閲カーネル学習(COKE)アルゴリズムをさらに開発します。このアルゴリズムは、ローカル更新が有益であると判断されない限り、エージェントが反復ごとに送信しないようにすることで、DKLAの通信負荷を軽減します。DKLAとCOKEの線形収束保証と一般化パフォーマンス分析に関する理論的結果が提供されます。合成データセットと実際のデータセットの両方で包括的なテストを実施して、COKEの通信効率と学習効果を検証します。

Learning Laplacian Matrix from Graph Signals with Sparse Spectral Representation
スパーススペクトル表現を持つグラフ信号からのラプラシアン行列の学習

In this paper, we consider the problem of learning a graph structure from multivariate signals, known as graph signals. Such signals are multivariate observations carrying measurements corresponding to the nodes of an unknown graph, which we desire to infer. They are assumed to enjoy a sparse representation in the graph spectral domain, a feature which is known to carry information related to the cluster structure of a graph. The signals are also assumed to behave smoothly with respect to the underlying graph structure. For the graph learning problem, we propose a new optimization program to learn the Laplacian of this graph and provide two algorithms to solve it, called IGL-3SR and FGL-3SR. Based on a 3-step alternating procedure, both algorithms rely on standard minimization methods –such as manifold gradient descent or linear programming– and have lower complexity compared to state-of-the-art algorithms. While IGL-3SR ensures convergence, FGL-3SR acts as a relaxation and is significantly faster since its alternating process relies on multiple closed-form solutions. Both algorithms are evaluated on synthetic and real data. They are shown to perform as good or better than their competitors in terms of both numerical performance and scalability. Finally, we present a probabilistic interpretation of the proposed optimization program as a Factor Analysis Model.

この論文では、グラフ信号と呼ばれる多変量信号からグラフ構造を学習する問題について検討します。このような信号は、推測したい未知のグラフのノードに対応する測定値を運ぶ多変量観測です。これらはグラフのスペクトル領域でスパース表現されると想定されます。これは、グラフのクラスター構造に関連する情報を運ぶことが知られている特徴です。信号はまた、基礎となるグラフ構造に対してスムーズに動作すると想定されます。グラフ学習問題に対して、このグラフのラプラシアンを学習する新しい最適化プログラムを提案し、IGL-3SRとFGL-3SRと呼ばれる2つのアルゴリズムでこれを解決します。3段階の交互手順に基づく両方のアルゴリズムは、標準的な最小化方法(多様体勾配降下法や線形計画法など)に依存しており、最先端のアルゴリズムと比較して複雑性が低くなっています。IGL-3SRは収束を保証しますが、FGL-3SRは緩和として機能し、その交互プロセスが複数の閉形式のソリューションに依存するため、大幅に高速です。両方のアルゴリズムは合成データと実際のデータで評価され、数値パフォーマンスとスケーラビリティの両方の点で競合製品と同等かそれ以上のパフォーマンスを発揮することが示されています。最後に、提案された最適化プログラムを因子分析モデルとして確率的に解釈します。

Limit theorems for out-of-sample extensions of the adjacency and Laplacian spectral embeddings
隣接性およびラプラシアンスペクトル埋め込みのサンプル外拡張の極限定理

Graph embeddings, a class of dimensionality reduction techniques designed for relational data, have proven useful in exploring and modeling network structure. Most dimensionality reduction methods allow out-of-sample extensions, by which an embedding can be applied to observations not present in the training set. Applied to graphs, the out-of-sample extension problem concerns how to compute the embedding of a vertex that is added to the graph after an embedding has already been computed. In this paper, we consider the out-of-sample extension problem for two graph embedding procedures: the adjacency spectral embedding and the Laplacian spectral embedding. In both cases, we prove that when the underlying graph is generated according to a latent space model called the random dot product graph, which includes the popular stochastic block model as a special case, an out-of-sample extension based on a least-squares objective obeys a central limit theorem. In addition, we prove a concentration inequality for the out-of-sample extension of the adjacency spectral embedding based on a maximum-likelihood objective. Our results also yield a convenient framework in which to analyze trade-offs between estimation accuracy and computational expenses, which we explore briefly. Finally, we explore the performance of these out-of-sample extensions as applied to both simulated and real-world data. We observe significant computational savings with minimal losses to the quality of the learned embeddings, in keeping with our theoretical results.

グラフ埋め込みは、リレーショナルデータ用に設計された次元削減手法の一種であり、ネットワーク構造の調査とモデル化に役立つことが証明されています。ほとんどの次元削減方法では、サンプル外拡張が可能で、これにより、トレーニングセットに存在しない観測に埋め込みを適用できます。グラフに適用した場合、サンプル外拡張の問題は、埋め込みがすでに計算された後にグラフに追加された頂点の埋め込みを計算する方法に関係します。この論文では、隣接スペクトル埋め込みとラプラシアンスペクトル埋め込みという2つのグラフ埋め込み手順のサンプル外拡張問題について検討します。どちらの場合も、基礎となるグラフが、一般的な確率的ブロックモデルを特別なケースとして含むランダムドット積グラフと呼ばれる潜在空間モデルに従って生成される場合、最小二乗目的関数に基づくサンプル外拡張が中心極限定理に従うことを証明します。さらに、最大尤度目的に基づく隣接スペクトル埋め込みのサンプル外拡張の集中不等式を証明します。また、この結果は、推定精度と計算コストのトレードオフを分析するための便利なフレームワークも生み出します。これについては簡単に説明します。最後に、シミュレーションデータと実世界のデータの両方に適用した場合のこれらのサンプル外拡張のパフォーマンスを調べます。理論上の結果と一致して、学習した埋め込みの品質の低下を最小限に抑えながら、大幅な計算コストの節約が実現しました。

Sparse Popularity Adjusted Stochastic Block Model
スパース人気調整ストキャスティクスブロックモデル

In the present paper we study a sparse stochastic network enabled with a block structure. The popular Stochastic Block Model (SBM) and the Degree Corrected Block Model (DCBM) address sparsity by placing an upper bound on the maximum probability of connections between any pair of nodes. As a result, sparsity describes only the behavior of network as a whole, without distinguishing between the block-dependent sparsity patterns. To the best of our knowledge, the recently introduced Popularity Adjusted Block Model (PABM) is the only block model that allows to introduce a structural sparsity where some probabilities of connections are identically equal to zero while the rest of them remain above a certain threshold. The latter presents a more nuanced view of the network.

この論文では、ブロック構造で可能になったスパース確率ネットワークについて研究します。一般的なストキャスティクス・ブロック・モデル(SBM)と次数補正ブロック・モデル(DCBM)は、任意のノード・ペア間の接続の最大確率に上限を設定することで、スパース性に対処します。その結果、スパース性はネットワーク全体の動作のみを表し、ブロック依存のスパース性パターンを区別しません。私たちの知る限りでは、最近導入された人気調整ブロックモデル(PABM)は、接続の一部の確率がゼロに等しく等しく、残りの確率が特定のしきい値を超えている構造的なスパース性を導入できる唯一のブロックモデルです。後者は、ネットワークについてより微妙な見方を示しています。

Method of Contraction-Expansion (MOCE) for Simultaneous Inference in Linear Models
線形モデルにおける同時推論のための収縮拡大法(MOCE)

Simultaneous inference after model selection is of critical importance to address scientific hypotheses involving a set of parameters. In this paper, we consider a high-dimensional linear regression model in which a regularization procedure such as LASSO is applied to yield a sparse model. To establish a simultaneous post-model selection inference, we propose a method of contraction and expansion (MOCE) along the line of debiasing estimation in that we investigate a desirable trade-off between model selection variability and sample variability by the means of forward screening. We establish key theoretical results for the inference from the proposed MOCE procedure. Once the expanded model is properly selected, the theoretical guarantees and simultaneous confidence regions can be constructed by the joint asymptotic normal distribution. In comparison with existing methods, our proposed method exhibits stable and reliable coverage at a nominal significance level and enjoys substantially less computational burden. Thus, our MOCE approach is trustworthy in solving real-world problems.

モデル選択後の同時推論は、一連のパラメータを含む科学的仮説に対処するために極めて重要です。この論文では、LASSOなどの正規化手順を適用してスパースモデルを生成する高次元線形回帰モデルを検討します。モデル選択後の同時推論を確立するために、モデル選択の変動性とサンプルの変動性の間の望ましいトレードオフをフォワードスクリーニングによって調査するという点で、バイアス除去推定の方向に沿った収縮と拡張(MOCE)の方法を提案します。提案されたMOCE手順から、推論の主要な理論的結果を確立します。拡張モデルが適切に選択されると、理論的保証と同時信頼領域は、結合漸近正規分布によって構築できます。既存の方法と比較して、提案された方法は、名目上の有意水準で安定した信頼性の高いカバレッジを示し、計算負荷が大幅に少なくなります。したがって、MOCEアプローチは、現実世界の問題を解決する上で信頼できます。

On the Estimation of Network Complexity: Dimension of Graphons
ネットワーク複雑性の推定について:グラフォンの次元

Network complexity has been studied for over half a century and has found a wide range of applications. Many methods have been developed to characterize and estimate the complexity of networks. However, there has been little research with statistical guarantees. In this paper, we develop a statistical theory of graph complexity in a general model of random graphs, the so-called graphon model. Given a graphon, we endow the latent space of the nodes with the neighborhood distance. Our complexity index is then based on the covering number and the Minkowksi dimension of this metric space. Although the latent space is not identifiable, these indices turn out to be identifiable. This notion of complexity has simple interpretations on popular examples: it matches the number of communities in stochastic block models; the dimension of the Euclidean space in random geometric graphs; the regularity of the link function in H\”older graphons. From a single observation of the graph, we construct an estimator of the neighborhood-distance and show universal non-asymptotic bounds for its risk, matching minimax lower bounds. Based on this estimated distance, we compute the corresponding covering number and Minkowski dimension and we provide optimal non-asymptotic error bounds for these two plug-in estimators.

ネットワークの複雑性は半世紀以上にわたって研究され、幅広い用途に使用されています。ネットワークの複雑性を特徴付け、推定するための多くの方法が開発されています。しかし、統計的な保証のある研究はほとんどありませんでした。この論文では、ランダムグラフの一般的なモデル、いわゆるグラフオンモデルにおけるグラフ複雑性の統計理論を展開します。グラフオンが与えられた場合、ノードの潜在空間に近傍距離を与えます。複雑性指標は、この距離空間の被覆数とミンコフスキー次元に基づいています。潜在空間は識別できませんが、これらの指標は識別可能であることがわかります。この複雑性の概念は、一般的な例に対して単純な解釈が可能です。つまり、確率的ブロックモデルのコミュニティの数、ランダム幾何学的グラフのユークリッド空間の次元と一致します。「H\」olderグラフォンのリンク関数の規則性。グラフの単一の観測から、近傍距離の推定量を構築し、そのリスクの普遍的な非漸近境界を示し、ミニマックス下限と一致させます。この推定距離に基づいて、対応する被覆数とミンコフスキー次元を計算し、これら2つのプラグイン推定量に最適な非漸近誤差境界を提供します。

Collusion Detection and Ground Truth Inference in Crowdsourcing for Labeling Tasks
ラベリングタスクのためのクラウドソーシングにおける共謀検出とグラウンドトゥルース推論

Crowdsourcing has been a prompt and cost-effective way of obtaining labels in many machine learning applications. In the literature, a number of algorithms have been developed to infer the ground truth based on the collected labels. However, most existing studies assume workers to be independent and are vulnerable to worker collusion. This paper aims at detecting the collusive behaviors of workers in labeling tasks. Specifically, we consider collusion in a pairwise manner and propose a penalized pairwise profile likelihood method based on the adaptive LASSO penalty for collusion detection. Many models that describe the behavior of independent workers can be incorporated into our proposed framework as the baseline model. We further investigate the theoretical properties of the proposed method that guarantee the asymptotic performance. An algorithm based on expectation-maximization algorithm and coordinate descent is proposed to numerically maximize the penalized pairwise profile likelihood function for parameter estimation. To the best of our knowledge, this is the first statistical model that simultaneously detects collusion, learns workers’ capabilities, and infers the ground true labels. Numerical studies using synthetic and real data sets are also conducted to verify the performance of the method.

クラウドソーシングは、多くの機械学習アプリケーションでラベルを取得するための迅速で費用対効果の高い方法となっています。文献では、収集されたラベルに基づいてグラウンドトゥルースを推測するためのアルゴリズムが数多く開発されています。しかし、既存の研究のほとんどは、作業員が独立しており、作業員の共謀に対して脆弱であると想定しています。この論文では、ラベル付けタスクにおける作業員の共謀行動を検出することを目的としています。具体的には、共謀をペアワイズで考慮し、共謀検出のための適応型LASSOペナルティに基づくペナルティ付きペアワイズプロファイル尤度法を提案します。独立した作業員の行動を記述する多くのモデルを、ベースラインモデルとして提案フレームワークに組み込むことができます。さらに、漸近的なパフォーマンスを保証する提案方法の理論的特性を調査します。期待最大化アルゴリズムと座標降下法に基づくアルゴリズムを提案し、パラメータ推定のためのペナルティ付きペアワイズプロファイル尤度関数を数値的に最大化します。私たちの知る限り、これは共謀を検出し、作業員の能力を学習し、グラウンドトゥルーラベルを推測する最初の統計モデルです。この方法の性能を検証するために、合成データセットと実際のデータセットを使用した数値研究も実施されています。

One-Shot Federated Learning: Theoretical Limits and Algorithms to Achieve Them
ワンショット連合学習:理論的限界とそれを達成するためのアルゴリズム

We consider distributed statistical optimization in one-shot setting, where there are $m$ machines each observing $n$ i.i.d. samples. Based on its observed samples, each machine sends a $B$-bit-long message to a server. The server then collects messages from all machines, and estimates a parameter that minimizes an expected convex loss function. We investigate the impact of communication constraint, $B$, on the expected error and derive a tight lower bound on the error achievable by any algorithm. We then propose an estimator, which we call Multi-Resolution Estimator (MRE), whose expected error (when $B\ge d\log mn$ where $d$ is the dimension of parameter) meets the aforementioned lower bound up to a poly-logarithmic factor in $mn$. The expected error of MRE, unlike existing algorithms, tends to zero as the number of machines ($m$) goes to infinity, even when the number of samples per machine ($n$) remains upper bounded by a constant. We also address the problem of learning under tiny communication budget, and present lower and upper error bounds for the case that the budget $B$ is a constant.

私たちは、$m$台のマシンがあり、それぞれが$n$個のi.i.d.サンプルを観測するワンショット設定での分散統計最適化について考察します。観測されたサンプルに基づいて、各マシンは$B$ビット長のメッセージをサーバーに送信します。次に、サーバーはすべてのマシンからメッセージを収集し、期待される凸損失関数を最小化するパラメータを推定します。通信制約$B$が期待誤差に与える影響を調査し、任意のアルゴリズムで達成可能な誤差の厳密な下限を導出します。次に、Multi-Resolution Estimator (MRE)と呼ぶ推定量を提案します。この推定量の期待誤差($B\ge d\log mn$の場合、ここで$d$はパラメータの次元)は、$mn$の多重対数係数まで前述の下限を満たす。MREの期待誤差は、既存のアルゴリズムとは異なり、マシン数($m$)が無限大になるにつれて、マシンあたりのサンプル数($n$)の上限が定数のままであっても、ゼロに近づく。また、通信予算が非常に小さい場合の学習の問題にも取り組み、予算$B$が定数である場合の下限および上限の誤差境界を示します。

Differentially Private Regression and Classification with Sparse Gaussian Processes
スパースガウス過程による微分プライベート回帰と分類

A continuing challenge for machine learning is providing methods to perform computation on data while ensuring the data remains private. In this paper we build on the provable privacy guarantees of differential privacy which has been combined with Gaussian processes through the previously published cloaking method, an approach that tackles the problem of providing privacy for the outputs of a training set. In this paper we solve several shortcomings of this method, starting with the problem of predictions in regions with low data density. We experiment with the use of inducing points to provide a sparse approximation and show that these can provide robust differential privacy in outlier areas and at higher dimensions. We then look at classification, and modify the Laplace approximation approach to provide differentially private predictions. We then combine this with the sparse approximation and demonstrate the capability to perform classification in high dimensions. We finally explore the issue of hyperparameter selection and develop a method for their private selection. This paper and associated libraries provide a robust toolkit for combining differential privacy and Gaussian processes in a practical manner.

機械学習の継続的な課題は、データのプライバシーを確保しながらデータに対して計算を実行する方法を提供することです。この論文では、以前に公開されたクローキング法を通じてガウス過程と組み合わせた差分プライバシーの証明可能なプライバシー保証を基に構築します。クローキング法は、トレーニングセットの出力にプライバシーを提供するという問題に取り組むアプローチです。この論文では、データ密度の低い領域での予測の問題から始めて、この方法のいくつかの欠点を解決します。スパース近似を提供する誘導ポイントの使用を実験し、外れ値領域と高次元で堅牢な差分プライバシーを提供できることを示します。次に、分類について検討し、ラプラス近似アプローチを変更して差分プライバシー予測を提供します。次に、これをスパース近似と組み合わせて、高次元で分類を実行する機能を実証します。最後に、ハイパーパラメータ選択の問題を調査し、それらのプライベート選択の方法を開発します。この論文と関連ライブラリは、差分プライバシーとガウス過程を実用的な方法で組み合わせるための堅牢なツールキットを提供します。

Matrix Product States for Inference in Discrete Probabilistic Models
離散確率モデルにおける推論のための行列積の状態

When faced with problems involving inference in discrete domains, solutions often involve appeals to conditional independence structure or mean-field approximations. We argue that this is insufficient for a number of interesting Bayesian problems, including mixture assignment posteriors and probabilistic relational models (e.g. the stochastic block model). These posteriors exhibit no conditional independence structure, precluding the use of graphical model methods, yet exhibit dependency between every single element of the posterior, making mean-field methods a poor fit. We propose using an expressive yet tractable approximation inspired by tensor factorization methods, alternately known as the tensor train or the matrix product state, and which can be construed of as a direct extension of the mean-field approximation to higher-order dependencies. We give a comprehensive introduction to the application of matrix product state in probabilistic inference, and illustrate how to efficiently perform marginalization, conditioning, sampling, normalization, some expectations, and approximate variational inference in our proposed model.

離散領域での推論を伴う問題に直面した場合、解決策として条件付き独立構造または平均場近似を利用することがよくあります。混合割り当て事後分布や確率的関係モデル(例:確率的ブロックモデル)を含む多くの興味深いベイズ問題には、これでは不十分であると考えられます。これらの事後分布は条件付き独立構造を示さないため、グラフィカルモデル法は使用できませんが、事後分布のすべての要素間に依存関係があるため、平均場法は適していません。テンソル因数分解法にヒントを得た、表現力豊かで扱いやすい近似値(テンソルトレインまたは行列積状態とも呼ばれます)を使用することを提案します。これは、平均場近似値を高次の依存関係に直接拡張したものと解釈できます。確率的推論における行列積状態の適用について包括的に紹介し、提案モデルで周辺化、条件付け、サンプリング、正規化、いくつかの期待値、および近似変分推論を効率的に実行する方法を示します。

As You Like It: Localization via Paired Comparisons
お気に召すまま:ペア比較によるローカリゼーション

Suppose that we wish to estimate a vector $\mathbf{x}$ from a set of binary paired comparisons of the form “$\mathbf{x}$ is closer to $\mathbf{p}$ than to $\mathbf{q}$” for various choices of vectors $\mathbf{p}$ and $\mathbf{q}$. The problem of estimating $\mathbf{x}$ from this type of observation arises in a variety of contexts, including nonmetric multidimensional scaling, “unfolding,” and ranking problems, often because it provides a powerful and flexible model of preference. We describe theoretical bounds for how well we can expect to estimate $\mathbf{x}$ under a randomized model for $\mathbf{p}$ and $\mathbf{q}$. We also present results for the case where the comparisons are noisy and subject to some degree of error. Additionally, we show that under a randomized model for $\mathbf{p}$ and $\mathbf{q}$, a suitable number of binary paired comparisons yield a stable embedding of the space of target vectors. Finally, we also show that we can achieve significant gains by adaptively changing the distribution for choosing $\mathbf{p}$ and $\mathbf{q}$.

さまざまなベクトル$\mathbf{p}$と$\mathbf{q}$の選択に対して、「$\mathbf{x}$は$\mathbf{q}$よりも$\mathbf{p}$に近い」という形式のバイナリ一対比較のセットからベクトル$\mathbf{x}$を推定するとします。このタイプの観察から$\mathbf{x}$を推定する問題は、非計量多次元尺度法、「展開」、ランキング問題など、さまざまな状況で発生します。これは、多くの場合、この方法が強力で柔軟な選好モデルを提供するためです。$\mathbf{p}$と$\mathbf{q}$のランダム化モデルで$\mathbf{x}$をどの程度正確に推定できるかについて、理論的な限界を説明します。また、比較にノイズが多く、ある程度の誤差が生じる場合の結果も示します。さらに、$\mathbf{p}$と$\mathbf{q}$のランダム化モデルでは、適切な数のバイナリペア比較によって、ターゲットベクトルの空間の安定した埋め込みが得られることを示します。最後に、$\mathbf{p}$と$\mathbf{q}$を選択するための分布を適応的に変更することで、大きな利益が得られることも示します。

Mode-wise Tensor Decompositions: Multi-dimensional Generalizations of CUR Decompositions
モードワイズテンソル分解:CUR分解の多次元一般化

Low rank tensor approximation is a fundamental tool in modern machine learning and data science. In this paper, we study the characterization, perturbation analysis, and an efficient sampling strategy for two primary tensor CUR approximations, namely Chidori and Fiber CUR. We characterize exact tensor CUR decompositions for low multilinear rank tensors. We also present theoretical error bounds of the tensor CUR approximations when (adversarial or Gaussian) noise appears. Moreover, we show that low cost uniform sampling is sufficient for tensor CUR approximations if the tensor has an incoherent structure. Empirical performance evaluations, with both synthetic and real-world datasets, establish the speed advantage of the tensor CUR approximations over other state-of-the-art low multilinear rank tensor approximations.

低ランクテンソル近似は、現代の機械学習とデータサイエンスにおける基本的なツールです。この論文では、2つの主要なテンソルCUR近似、つまりChidoriとFiber CURの特性評価、摂動解析、および効率的なサンプリング戦略について研究します。低多重線形ランクテンソルの正確なテンソルCUR分解を特徴付けます。また、(敵対的またはガウス的)ノイズが現れたときのテンソルCUR近似の理論的な誤差範囲も示します。さらに、テンソルがインコヒーレントな構造を持つ場合、テンソルCUR近似には低コストの均一サンプリングが十分であることを示します。合成データセットと実世界のデータセットの両方を使用した経験的性能評価により、テンソルCUR近似が他の最先端の低多重線形ランクテンソル近似よりも速度の利点が確立されます。

mlr3pipelines – Flexible Machine Learning Pipelines in R
mlr3pipelines – R の柔軟な機械学習パイプライン

Recent years have seen a proliferation of ML frameworks. Such systems make ML accessible to non-experts, especially when combined with powerful parameter tuning and AutoML techniques. Modern, applied ML extends beyond direct learning on clean data, however, and needs an expressive language for the construction of complex ML workflows beyond simple pre- and post-processing. We present mlr3pipelines, an R framework which can be used to define linear and complex non-linear ML workflows as directed acyclic graphs. The framework is part of the mlr3 ecosystem, leveraging convenient resampling, benchmarking, and tuning components.

近年、MLフレームワークが急増しています。このようなシステムにより、特に強力なパラメーターチューニングやAutoML手法と組み合わせると、専門家でなくてもMLにアクセスできるようになります。しかし、最新の応用MLは、クリーンなデータに対する直接的な学習にとどまらず、単純な前処理と後処理を超えた複雑なMLワークフローを構築するための表現力豊かな言語が必要です。線形および複雑な非線形MLワークフローを有向非巡回グラフとして定義するために使用できるRフレームワークであるmlr3pipelinesを紹介します。このフレームワークはmlr3エコシステムの一部であり、便利なリサンプリング、ベンチマーク、チューニングコンポーネントを活用しています。

Benchmarking Unsupervised Object Representations for Video Sequences
ビデオシーケンスの教師なしオブジェクト表現のベンチマーク

Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

物体の観点から世界を認識し、それらを時間とともに追跡することは、推論とシーン理解の重要な前提条件です。最近、物体中心の表現の教師なし学習のためのいくつかの方法が提案されています。ただし、これらのモデルは異なる下流タスクで評価されたため、物体の検出、図と地のセグメンテーション、追跡などの基本的な知覚能力の点でどのように比較されるかは不明です。このギャップを埋めるために、さまざまな複雑さの4つのデータセットと、自然なビデオに関連する困難な追跡シナリオを特徴とする7つの追加テストセットを含むベンチマークを設計します。このベンチマークを使用して、4つの物体中心のアプローチの知覚能力を比較します。ViMON (再帰空間注意に基づくMONetのビデオ拡張)、空間混合モデルによるクラスタリングを活用するOP3、および空間トランスフォーマーによる明示的な因数分解を使用するTBAとSCALOR。私たちの結果は、制約のない潜在表現を持つアーキテクチャが、物体の検出、セグメンテーション、追跡に関して、空間トランスフォーマーベースのアーキテクチャよりも強力な表現を学習することを示唆しています。また、合成的な性質にもかかわらず、どの方法も最も困難な追跡シナリオを適切に処理できないこともわかりました。これは、私たちのベンチマークがより堅牢なオブジェクト中心のビデオ表現を学習するための有益なガイダンスを提供できる可能性があることを示唆しています。

A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning
自己ペース学習の確率的解釈と強化学習への応用

Across machine learning, the use of curricula has shown strong empirical potential to improve learning from data by avoiding local optima of training objectives. For reinforcement learning (RL), curricula are especially interesting, as the underlying optimization has a strong tendency to get stuck in local optima due to the exploration-exploitation trade-off. Recently, a number of approaches for an automatic generation of curricula for RL have been shown to increase performance while requiring less expert knowledge compared to manually designed curricula. However, these approaches are seldomly investigated from a theoretical perspective, preventing a deeper understanding of their mechanics. In this paper, we present an approach for automated curriculum generation in RL with a clear theoretical underpinning. More precisely, we formalize the well-known self-paced learning paradigm as inducing a distribution over training tasks, which trades off between task complexity and the objective to match a desired task distribution. Experiments show that training on this induced distribution helps to avoid poor local optima across RL algorithms in different tasks with uninformative rewards and challenging exploration requirements.

機械学習全体において、カリキュラムの使用は、トレーニング目標の局所最適を回避することでデータからの学習を改善するという実証的な可能性を強く示しています。強化学習(RL)の場合、基礎となる最適化は探索と活用のトレードオフにより局所最適に陥る傾向が強いため、カリキュラムは特に興味深いものです。最近、RLのカリキュラムを自動生成するいくつかのアプローチは、手動で設計されたカリキュラムと比較して専門知識をあまり必要とせずにパフォーマンスを向上させることが示されています。ただし、これらのアプローチは理論的な観点からはほとんど調査されておらず、そのメカニズムの深い理解を妨げています。この論文では、明確な理論的根拠を持つRLでのカリキュラムの自動生成のアプローチを紹介します。より正確には、よく知られている自己ペース学習パラダイムを、トレーニングタスクに分布を誘導するものとして形式化し、タスクの複雑さと目的の間でトレードオフを行い、望ましいタスク分布に一致させます。実験では、この誘導された分布でのトレーニングは、情報のない報酬と困難な探索要件を持つさまざまなタスクで、RLアルゴリズム全体の不十分な局所最適を回避するのに役立つことが示されています。

Alibi Explain: Algorithms for Explaining Machine Learning Models
Alibi Explain:機械学習モデルを説明するためのアルゴリズム

We introduce Alibi Explain, an open-source Python library for explaining predictions of machine learning models (https://github.com/SeldonIO/alibi). The library features state-of-the-art explainability algorithms for classification and regression models. The algorithms cover both the model-agnostic (black-box) and model-specific (white-box) setting, cater for multiple data types (tabular, text, images) and explanation scope (local and global explanations). The library exposes a unified API enabling users to work with explanations in a consistent way. Alibi adheres to best development practices featuring extensive testing of code correctness and algorithm convergence in a continuous integration environment. The library comes with extensive documentation of both usage and theoretical background of methods, and a suite of worked end-to-end use cases. Alibi aims to be a production-ready toolkit with integrations into machine learning deployment platforms such as Seldon Core and KFServing, and distributed explanation capabilities using Ray.

私たちは、機械学習モデルの予測を説明するオープンソースのPythonライブラリであるAlibi Explainを紹介します(https://github.com/SeldonIO/alibi)。このライブラリには、分類モデルと回帰モデル用の最先端の説明可能性アルゴリズムが搭載されています。このアルゴリズムは、モデルに依存しない(ブラックボックス)設定とモデル固有の(ホワイトボックス)設定の両方をカバーし、複数のデータタイプ(表形式、テキスト、画像)と説明範囲(ローカル説明とグローバル説明)に対応します。このライブラリは、ユーザーが一貫した方法で説明を操作できるように、統合APIを公開しています。Alibiは、継続的インテグレーション環境でのコードの正確性とアルゴリズムの収束の広範なテストを特徴とするベストプラクティスに準拠しています。このライブラリには、メソッドの使用方法と理論的背景の両方に関する広範なドキュメントと、エンドツーエンドのユースケースのスイートが付属しています。Alibiは、Seldon CoreやKFServingなどの機械学習展開プラットフォームへの統合と、Rayを使用した分散説明機能を備えた、本番環境対応のツールキットを目指しています。

Improved Shrinkage Prediction under a Spiked Covariance Structure
スパイク共分散構造下での収縮予測の改善

We develop a novel shrinkage rule for prediction in a high-dimensional non-exchangeable hierarchical Gaussian model with an unknown spiked covariance structure. We propose a family of priors for the mean parameter, governed by a power hyper-parameter, which encompasses independent to highly dependent scenarios. Corresponding to popular loss functions such as quadratic, generalized absolute, and Linex losses, these prior models induce a wide class of shrinkage predictors that involve quadratic forms of smooth functions of the unknown covariance. By using uniformly consistent estimators of these quadratic forms, we propose an efficient procedure for evaluating these predictors which outperforms factor model based direct plug-in approaches. We further improve our predictors by considering possible reduction in their variability through a novel coordinate-wise shrinkage policy that only uses covariance level information and can be adaptively tuned using the sample eigen structure. Finally, we extend our disaggregate model based methodology to prediction in aggregate models. We propose an easy-to-implement functional substitution method for predicting linearly aggregated targets and establish asymptotic optimality of our proposed procedure. We present simulation experiments as well as real data examples illustrating the efficacy of the proposed method.

私たちは、未知のスパイク共分散構造を持つ高次元の交換不可能な階層的ガウスモデルにおける予測のための新しい収縮ルールを開発しました。独立シナリオから高度に依存するシナリオまでを網羅する、べき乗ハイパーパラメータによって制御される平均パラメータの事前分布ファミリーを提案します。2次損失、一般化絶対損失、Linex損失などの一般的な損失関数に対応して、これらの事前モデルは、未知の共分散の滑らかな関数の2次形式を含む、幅広いクラスの収縮予測子を誘導します。これらの2次形式の均一に一貫した推定値を使用することで、因子モデルに基づく直接プラグインアプローチよりも優れた、これらの予測子を評価するための効率的な手順を提案します。共分散レベルの情報のみを使用し、サンプルの固有構造を使用して適応的に調整できる新しい座標単位の収縮ポリシーを通じて、変動性を低減することを検討することで、予測子をさらに改善します。最後に、非集計モデルに基づく方法論を集計モデルでの予測に拡張します。線形集約ターゲットを予測するための実装が容易な機能置換法を提案し、提案手順の漸近最適性を確立します。提案方法の有効性を示すシミュレーション実験と実際のデータ例を紹介します。

A Sharp Blockwise Tensor Perturbation Bound for Orthogonal Iteration
直交反復のための鋭いブロックワイズテンソル摂動限界

In this paper, we develop novel perturbation bounds for the higher-order orthogonal iteration (HOOI). Under mild regularity conditions, we establish blockwise tensor perturbation bounds for HOOI with guarantees for both tensor reconstruction in Hilbert-Schmidt norm $\|\widehat{\mathcal{T}} – \mathcal{T} \|_{\rm HS}$ and mode-$k$ singular subspace estimation in Schatten-$q$ norm $\| \sin \Theta (\widehat{U}_k, U_k) \|_q$ for any $q \geq 1$. We show the upper bounds of mode-$k$ singular subspace estimation are unilateral and converge linearly to a quantity characterized by blockwise errors of the perturbation and signal strength. For the tensor reconstruction error bound, we express the bound through a simple quantity $\xi$, which depends only on perturbation and the multilinear rank of the underlying signal. Rate matching deterministic lower bound for tensor reconstruction, which demonstrates the optimality of HOOI, is also provided. Furthermore, we prove that one-step HOOI (i.e., HOOI with only a single iteration) is also optimal in terms of tensor reconstruction and can be used to lower the computational cost. The perturbation results are also extended to the case that only partial modes of $\mathcal{T}$ have low-rank structure. We support our theoretical results by extensive numerical studies. Finally, we apply the novel perturbation bounds of HOOI on two applications, tensor denoising and tensor co-clustering, from machine learning and statistics, which demonstrates the superiority of the new perturbation results.

この論文では、高次直交反復法(HOOI)の新しい摂動境界を開発します。軽度の正則性条件下で、ヒルベルト・シュミットノルム$\|\widehat{\mathcal{T}} – \mathcal{T} \|_{\rm HS}$でのテンソル再構成と、任意の$q \geq 1$に対するSchatten-$q$ノルム$\| \sin \Theta (\widehat{U}_k, U_k) \|_q$でのモード$k$特異部分空間推定の両方を保証するHOOIのブロック単位のテンソル摂動境界を確立します。モード$k$特異部分空間推定の上限は片側性であり、摂動と信号強度のブロック単位の誤差によって特徴付けられる量に線形に収束することを示します。テンソル再構成の誤差境界については、摂動と基礎信号の多重線形ランクのみに依存する単純な量$\xi$で境界を表現します。また、HOOIの最適性を証明する、テンソル再構成のレートマッチング決定論的下限も提供されます。さらに、1ステップHOOI (つまり、1回の反復のみのHOOI)もテンソル再構成の点で最適であり、計算コストを下げるために使用できることを証明します。摂動の結果は、$\mathcal{T}$の部分モードのみが低ランク構造を持つ場合にも拡張されます。広範な数値研究によって理論的結果を裏付けます。最後に、機械学習と統計からの2つのアプリケーション、テンソルノイズ除去とテンソルコクラスタリングにHOOIの新しい摂動境界を適用し、新しい摂動結果の優位性を示します。

Conditional independences and causal relations implied by sets of equations
方程式のセットによって暗示される条件付き独立性と因果関係

Real-world complex systems are often modelled by sets of equations with endogenous and exogenous variables. What can we say about the causal and probabilistic aspects of variables that appear in these equations without explicitly solving the equations? We make use of Simon’s causal ordering algorithm (Simon, 1953) to construct a causal ordering graph and prove that it expresses the effects of soft and perfect interventions on the equations under certain unique solvability assumptions. We further construct a Markov ordering graph and prove that it encodes conditional independences in the distribution implied by the equations with independent random exogenous variables, under a similar unique solvability assumption. We discuss how this approach reveals and addresses some of the limitations of existing causal modelling frameworks, such as causal Bayesian networks and structural causal models.

実世界の複素系は、多くの場合、内生変数と外生変数を持つ方程式のセットによってモデル化されます。これらの方程式に現れる変数の因果的および確率的側面について、方程式を明示的に解かずに何を言うことができるでしょうか?サイモンの因果順序付けアルゴリズム(Simon、1953)を使用して因果順序グラフを作成し、それが特定のユニークな可解性の仮定の下で方程式に対するソフトで完全な介入の影響を表していることを証明します。さらに、マルコフ順序グラフを構築し、同様のユニークな可解性の仮定の下で、独立したランダムな外生変数を持つ方程式によって暗示される分布の条件付き独立性をエンコードすることを証明します。このアプローチが、因果ベイジアンネットワークや構造的因果モデルなど、既存の因果モデリングフレームワークのいくつかの限界を明らかにし、対処する方法について説明します。

Prediction Under Latent Factor Regression: Adaptive PCR, Interpolating Predictors and Beyond
潜在因子回帰下での予測:適応型PCR、予測変数の補間など

This work is devoted to the finite sample prediction risk analysis of a class of linear predictors of a response $Y\in \mathbb{R}$ from a high-dimensional random vector $X\in \mathbb{R}^p$ when $(X,Y)$ follows a latent factor regression model generated by a unobservable latent vector $Z$ of dimension less than $p$. Our primary contribution is in establishing finite sample risk bounds for prediction with the ubiquitous Principal Component Regression (PCR) method, under the factor regression model, with the number of principal components adaptively selected from the data—a form of theoretical guarantee that is surprisingly lacking from the PCR literature. To accomplish this, we prove a master theorem that establishes a risk bound for a large class of predictors, including the PCR predictor as a special case. This approach has the benefit of providing a unified framework for the analysis of a wide range of linear prediction methods, under the factor regression setting. In particular, we use our main theorem to recover known risk bounds for the minimum-norm interpolating predictor, which has received renewed attention in the past two years, and a prediction method tailored to a subclass of factor regression models with identifiable parameters. This model-tailored method can be interpreted as prediction via clusters with latent centers. To address the problem of selecting among a set of candidate predictors, we analyze a simple model selection procedure based on data-splitting, providing an oracle inequality under the factor model to prove that the performance of the selected predictor is close to the optimal candidate. We conclude with a detailed simulation study to support and complement our theoretical results.

この研究では、高次元ランダムベクトル$X\in \mathbb{R}^p$からの応答$Y\in \mathbb{R}$の線形予測子のクラスの有限サンプル予測リスク分析に焦点を当てています。これは、$(X,Y)$が次元$p$未満の観測不可能な潜在ベクトル$Z$によって生成された潜在因子回帰モデルに従う場合です。私たちの主な貢献は、データから適応的に選択される主成分の数を持つ因子回帰モデルの下で、広く普及している主成分回帰(PCR)法による予測の有限サンプルリスク境界を確立したことです。これは、PCRの文献には驚くほど欠けている理論的保証の形式です。これを実現するために、PCR予測子を特別なケースとして含む、大規模な予測子のクラスに対するリスク境界を確立するマスター定理を証明します。このアプローチには、因子回帰設定の下で、幅広い線形予測方法の分析のための統一されたフレームワークを提供するという利点があります。特に、過去2年間で新たな注目を集めている最小ノルム補間予測子の既知のリスク境界と、識別可能なパラメーターを持つ因子回帰モデルのサブクラスに合わせた予測方法の回復に主定理を使用します。このモデルに合わせた方法は、潜在中心を持つクラスターによる予測として解釈できます。候補予測子のセットから選択する問題に対処するために、データ分割に基づく単純なモデル選択手順を分析し、因子モデルの下でオラクル不等式を提供して、選択された予測子のパフォーマンスが最適な候補に近いことを証明します。最後に、理論的結果を裏付け、補完する詳細なシミュレーション研究を行います。

Locally Private k-Means Clustering
ローカルプライベートk-meansクラスタリング

We design a new algorithm for the Euclidean $k$-means problem that operates in the local model of differential privacy. Unlike in the non-private literature, differentially private algorithms for the $k$-means objective incur both additive and multiplicative errors. Our algorithm significantly reduces the additive error while keeping the multiplicative error the same as in previous state-of-the-art results. Specifically, on a database of size $n$, our algorithm guarantees $O(1)$ multiplicative error and $\approx n^{1/2+a}$ additive error for an arbitrarily small constant $a>0$. All previous algorithms in the local model had additive error $\approx n^{2/3+a}$. Our techniques extend to $k$-median clustering. We show that the additive error we obtain is almost optimal in terms of its dependency on the database size $n$. Specifically, we give a simple lower bound showing that every locally-private algorithm for the $k$-means objective must have additive error at least $\approx\sqrt{n}$.

私たちは、差分プライバシーのローカルモデルで動作するユークリッド$k$平均問題のための新しいアルゴリズムを設計します。非プライベート文献とは異なり、$k$-meansオブジェクティブの微分プライベートアルゴリズムは、加法誤差と乗法誤差の両方を引き起こします。私たちのアルゴリズムは、乗法誤差を以前の最先端の結果と同じに保ちながら、加法誤差を大幅に低減します。具体的には、サイズ$n$のデータベースでは、アルゴリズムは、任意の小さな定数$a>0$に対して$O(1)$乗法誤差と$approx n^{1/2+a}$加法誤差を保証します。ローカルモデル内の以前のすべてのアルゴリズムには、加法誤差$approx n^{2/3+a}$がありました。私たちの手法は、$k$-medianクラスタリングにまで及びます。得られる加法誤差は、データベースサイズ$n$への依存性という点でほぼ最適であることを示します。具体的には、$k$-meansオブジェクティブのすべてのローカルプライベートアルゴリズムが少なくとも$approxsqrt{n}$の加法誤差を持つ必要があることを示す単純な下限を与えます。

Doubly infinite residual neural networks: a diffusion process approach
二重無限残差ニューラルネットワーク:拡散過程アプローチ

Modern neural networks featuring a large number of layers (depth) and units per layer (width) have achieved a remarkable performance across many domains. While there exists a vast literature on the interplay between infinitely wide neural networks and Gaussian processes, a little is known about analogous interplays with respect to infinitely deep neural networks. Neural networks with independent and identically distributed (i.i.d.) initializations exhibit undesirable forward and backward propagation properties as the number of layers increases, e.g., vanishing dependency on the input, and perfectly correlated outputs for any two inputs. To overcome these drawbacks, Peluchetti and Favaro (2020) considered fully-connected residual networks (ResNets) with network’s parameters initialized by means of distributions that shrink as the number of layers increases, thus establishing an interplay between infinitely deep ResNets and solutions to stochastic differential equations, i.e. diffusion processes, and showing that infinitely deep ResNets does not suffer from undesirable forward-propagation properties. In this paper, we review the results of Peluchetti and Favaro (2020), extending them to convolutional ResNets, and we establish analogous backward-propagation results, which directly relate to the problem of training fully-connected deep ResNets. Then, we investigate the more general setting of doubly infinite neural networks, where both network’s width and network’s depth grow unboundedly. We focus on doubly infinite fully-connected ResNets, for which we consider i.i.d. initializations. Under this setting, we show that the dynamics of quantities of interest converge, at initialization, to deterministic limits. This allow us to provide analytical expressions for inference, both in the case of weakly trained and fully trained ResNets. Our results highlight a limited expressive power of doubly infinite ResNets when the unscaled network’s parameters are i.i.d. and the residual blocks are shallow.

多数の層(深さ)と層あたりのユニット(幅)を特徴とする最新のニューラルネットワークは、多くの領域で優れたパフォーマンスを実現しています。無限に広いニューラルネットワークとガウス過程の相互作用に関する膨大な文献が存在する一方で、無限に深いニューラルネットワークに関する類似の相互作用についてはほとんど知られていません。独立かつ同一に分布する(i.i.d.)初期化を持つニューラルネットワークは、層の数が増えるにつれて、入力への依存性が消えたり、任意の2つの入力に対して完全に相関した出力が出たりするなど、望ましくない順方向および逆方向の伝播特性を示します。これらの欠点を克服するために、PeluchettiとFavaro (2020)は、層の数が増えるにつれて縮小する分布によってネットワークパラメーターが初期化される完全接続残差ネットワーク(ResNet)を検討しました。これにより、無限深ResNetと確率微分方程式の解(拡散プロセス)との相互作用が確立され、無限深ResNetは望ましくない順方向伝播特性の影響を受けないことが示されました。この論文では、PeluchettiとFavaro (2020)の結果をレビューし、畳み込みResNetに拡張して、完全接続深層ResNetのトレーニングの問題に直接関連する類似の逆方向伝播結果を確立します。次に、ネットワークの幅とネットワークの深さの両方が無制限に増加する、より一般的な二重無限ニューラルネットワークの設定を調査します。二重無限完全接続ResNetに焦点を当て、i.i.d.初期化を検討します。この設定では、関心のある量のダイナミクスが初期化時に決定論的限界に収束することを示しています。これにより、弱くトレーニングされたResNetと完全にトレーニングされたResNetの両方のケースで、推論のための解析式を提供できます。私たちの結果は、スケーリングされていないネットワークのパラメーターがi.i.d.で、残差ブロックが浅い場合、二重無限ResNetの表現力が限られていることを強調しています。

Achieving Fairness in the Stochastic Multi-Armed Bandit Problem
確率的多腕バンディット問題における公平性の達成

We study an interesting variant of the stochastic multi-armed bandit problem, which we call the Fair-MAB problem, where, in addition to the objective of maximizing the sum of expected rewards, the algorithm also needs to ensure that at any time, each arm is pulled at least a pre-specified fraction of times. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, which we call $r$-Regret, that takes into account the above fairness constraints and extends the conventional notion of regret in a natural way. Our primary contribution is to obtain a complete characterization of a class of Fair-MAB algorithms via two parameters: the unfairness tolerance and the learning algorithm used as a black-box. For this class of algorithms, we provide a fairness guarantee that holds uniformly over time, irrespective of the chosen learning algorithm. Further, when the learning algorithm is UCB1, we show that our algorithm achieves constant $r$-Regret for a large enough time horizon. Finally, we analyze the cost of fairness in terms of the conventional notion of regret. We conclude by experimentally validating our theoretical results.

私たちは、確率的多腕バンディット問題の興味深い変種であるFair-MAB問題を検討します。この問題では、期待される報酬の合計を最大化する目的に加えて、アルゴリズムは、各アームが常に少なくとも事前に指定された割合の回数引っ張られるようにする必要があります。私たちは、保証された引きの割合を示す事前に指定されたベクトルの観点から、学習と公平性の相互作用を調査します。私たちは、上記の公平性制約を考慮し、自然な方法で従来の後悔の概念を拡張する、公平性を考慮した後悔($r$-Regretと呼ぶ)を定義します。我々の主な貢献は、不公平性許容度とブラックボックスとして使用される学習アルゴリズムという2つのパラメーターを介して、Fair-MABアルゴリズムのクラスの完全な特性評価を取得することです。このクラスのアルゴリズムでは、選択された学習アルゴリズムに関係なく、時間の経過とともに均一に保持される公平性保証を提供します。さらに、学習アルゴリズムがUCB1の場合、十分に長い時間範囲でアルゴリズムが一定の$r$-Regretを達成することを示します。最後に、従来の後悔の概念に基づいて公平性のコストを分析します。最後に、理論的結果を実験的に検証して結論を出します。

Replica Exchange for Non-Convex Optimization
非凸最適化のためのレプリカ交換

Gradient descent (GD) is known to converge quickly for convex objective functions, but it can be trapped at local minima. On the other hand, Langevin dynamics (LD) can explore the state space and find global minima, but in order to give accurate estimates, LD needs to run with a small discretization step size and weak stochastic force, which in general slow down its convergence. This paper shows that these two algorithms can “collaborate” through a simple exchange mechanism, in which they swap their current positions if LD yields a lower objective function. This idea can be seen as the singular limit of the replica-exchange technique from the sampling literature. We show that this new algorithm converges to the global minimum linearly with high probability, assuming the objective function is strongly convex in a neighborhood of the unique global minimum. By replacing gradients with stochastic gradients, and adding a proper threshold to the exchange mechanism, our algorithm can also be used in online settings. We also study non-swapping variants of the algorithm, which achieve similar performance. We further verify our theoretical results through some numerical experiments and observe superior performances of the proposed algorithm over running GD or LD alone.

勾配降下法(GD)は凸目的関数に対して急速に収束することが知られていますが、局所的最小値に陥ることがあります。一方、ランジュバン動力学(LD)は状態空間を探索して大域的最小値を見つけることができますが、正確な推定値を得るためには、LDは小さな離散化ステップサイズと弱い確率的力で実行する必要があり、一般的に収束が遅くなります。この論文では、これら2つのアルゴリズムが、LDの目的関数が低い場合に現在の位置を交換するという単純な交換メカニズムを通じて「連携」できることを示しています。この考え方は、サンプリング文献のレプリカ交換手法の特異極限として考えることができます。目的関数が唯一のグローバル最小値の近傍で強く凸であると仮定すると、この新しいアルゴリズムは高い確率でグローバル最小値に線形収束することを示します。勾配を確率的勾配に置き換え、交換メカニズムに適切なしきい値を追加することで、このアルゴリズムをオンライン設定で使用することもできます。また、同様のパフォーマンスを実現する、アルゴリズムの非スワッピングバリアントについても検討します。さらに、いくつかの数値実験を通じて理論結果を検証し、GDまたはLDのみを実行する場合よりも提案アルゴリズムの優れたパフォーマンスを確認しました。

Unlinked Monotone Regression
リンクされていない単調回帰

We consider so-called univariate unlinked (sometimes “decoupled,” or “shuffled”) regression when the unknown regression curve is monotone. In standard monotone regression, one observes a pair $(X,Y)$ where a response $Y$ is linked to a covariate $X$ through the model $Y= m_0(X) + \epsilon$, with $m_0$ the (unknown) monotone regression function and $\epsilon$ the unobserved error (assumed to be independent of $X$). In the unlinked regression setting one gets only to observe a vector of realizations from both the response $Y$ and from the covariate $X$ where now $Y \stackrel{d}{=} m_0(X) + \epsilon$. There is no (observed) pairing of $X$ and $Y$. Despite this, it is actually still possible to derive a consistent non-parametric estimator of $m_0$ under the assumption of monotonicity of $m_0$ and knowledge of the distribution of the noise $\epsilon$. In this paper, we establish an upper bound on the rate of convergence of such an estimator under minimal assumption on the distribution of the covariate $X$. We discuss extensions to the case in which the distribution of the noise is unknown. We develop a second order algorithm for its computation, and we demonstrate its use on synthetic data. Finally, we apply our method (in a fully data driven way, without knowledge of the error distribution) on longitudinal data from the US Consumer Expenditure Survey.

私たちは、未知の回帰曲線が単調である場合、いわゆる単変量非連結（「分離」または「シャッフル」と呼ばれることもある）回帰を検討します。標準的な単調回帰では、応答$Y$がモデル$Y= m_0(X) + \epsilon$を通じて共変量$X$に連結されているペア$(X,Y)$が観測されます。ここで、$m_0$は（未知の）単調回帰関数、$\epsilon$は観測されていない誤差（$X$とは独立していると想定されます）です。非連結回帰設定では、応答$Y$と共変量$X$の両方からの実現のベクトルのみが観測され、$Y \stackrel{d}{=} m_0(X) + \epsilon$となります。$X$と$Y$の（観測された）ペアリングはありません。それにもかかわらず、実際には、$m_0$の単調性とノイズ$\epsilon$の分布に関する知識を前提として、$m_0$の一貫したノンパラメトリック推定値を導出することは可能です。この論文では、共変量$X$の分布に関する最小限の仮定の下で、このような推定値の収束率の上限を確立します。ノイズの分布が不明な場合への拡張について説明します。その計算用の2次アルゴリズムを開発し、合成データでの使用法を示します。最後に、米国消費者支出調査の縦断データにこのメソッドを適用します(完全にデータ駆動型で、誤差分布を知らなくてもよい)。

Optimal Rates of Distributed Regression with Imperfect Kernels
不完全カーネルによる分散回帰の最適レート

Distributed machine learning systems have been receiving increasing attentions for their efficiency to process large scale data. Many distributed frameworks have been proposed for different machine learning tasks. In this paper, we study the distributed kernel regression via the divide and conquer approach. The learning process consists of three stages. Firstly, the data is partitioned into multiple subsets. Then a base kernel regression algorithm is applied to each subset to learn a local regression model. Finally the local models are averaged to generate the final regression model for the purpose of predictive analytics or statistical inference. This approach has been proved asymptotically minimax optimal if the kernel is perfectly selected so that the true regression function lies in the associated reproducing kernel Hilbert space. However, this is usually, if not always, impractical because kernels that can only be selected via prior knowledge or a tuning process are hardly perfect. Instead it is more common that the kernel is good enough but imperfect in the sense that the true regression can be well approximated by but does not lie exactly in the kernel space. We show distributed kernel regression can still achieve capacity independent optimal rate in this case. To this end, we first establish a general framework that allows to analyze distributed regression with response weighted base algorithms by bounding the error of such algorithms on a single data set, provided that the error bounds have factored the impact of unexplained variance of the response variable. Then we perform a leave one out analysis of the kernel ridge regression and bias corrected kernel ridge regression, which in combination with the aforementioned framework allows us to derive sharp error bounds and capacity independent optimal rates for the associated distributed kernel regression algorithms. As a byproduct of the thorough analysis, we also prove the kernel ridge regression can achieve rates faster than $O(N^{-1})$ (where $N$ is the sample size) in the noise free setting which, to our best knowledge, are first observed and novel in regression learning.

分散機械学習システムは、大規模データの処理効率の高さから、ますます注目を集めています。さまざまな機械学習タスク向けに、多くの分散フレームワークが提案されています。この論文では、分割統治アプローチによる分散カーネル回帰について検討します。学習プロセスは3つの段階で構成されています。まず、データが複数のサブセットに分割されます。次に、基本カーネル回帰アルゴリズムが各サブセットに適用され、ローカル回帰モデルが学習されます。最後に、ローカルモデルが平均化され、予測分析または統計的推論の目的で最終的な回帰モデルが生成されます。このアプローチは、カーネルが完全に選択され、真の回帰関数が関連する再生カーネルヒルベルト空間内にある場合、漸近的にミニマックス最適であることが証明されています。ただし、事前の知識またはチューニングプロセスによってのみ選択できるカーネルは完璧とは言い難いため、これは通常、常にではないにしても非実用的です。代わりに、カーネルは十分に優れていますが、真の回帰はカーネル空間で十分に近似できるが、カーネル空間内に正確には存在しないという意味で不完全であることがより一般的です。私たちは、分散カーネル回帰がこの場合も容量に依存しない最適レートを達成できることを示します。このために、まず、応答重み付けベースアルゴリズムを使用して分散回帰を分析できる一般的なフレームワークを確立します。このフレームワークでは、誤差範囲に応答変数の説明できない分散の影響が考慮されていることを条件として、そのようなアルゴリズムの誤差を単一のデータセットに制限します。次に、カーネルリッジ回帰とバイアス補正カーネルリッジ回帰のleave one out分析を実行します。これを前述のフレームワークと組み合わせることで、関連する分散カーネル回帰アルゴリズムの明確な誤差範囲と容量に依存しない最適レートを導出できます。徹底的な分析の副産物として、カーネルリッジ回帰がノイズのない設定で$O(N^{-1})$（ここで$N$はサンプルサイズ）よりも高速なレートを達成できることも証明しました。これは、我々の知る限り、回帰学習では初めて観察され、新しいものです。

Black-Box Reductions for Zeroth-Order Gradient Algorithms to Achieve Lower Query Complexity
クエリの複雑さを軽減するための 0 次勾配アルゴリズムのブラックボックス削減

Zeroth-order (ZO) optimization has been the key technique for various machine learning applications especially for black-box adversarial attack, where models need to be learned in a gradient-free manner. Although many ZO algorithms have been proposed, the high function query complexities hinder their applications seriously. To address this challenging problem, we propose two stagewise black-box reduction frameworks for ZO algorithms under convex and non-convex settings respectively, which lower down the function query complexities of ZO algorithms. Moreover, our frameworks can directly derive the convergence results of ZO algorithms under convex and non-convex settings without extra analyses, as long as convergence results under strongly convex setting are given. To illustrate the advantages, we further study ZO-SVRG, ZO-SAGA and ZO-Varag under strongly-convex setting and use our frameworks to directly derive the convergence results under convex and non-convex settings. The function query complexities of these algorithms derived by our frameworks are lower than that of their vanilla counterparts without frameworks, or even lower than that of state-of-the-art algorithms. Finally we conduct numerical experiments to illustrate the superiority of our frameworks.

ゼロ次（ZO）最適化は、さまざまな機械学習アプリケーション、特にブラックボックス敵対的攻撃のための重要な手法であり、モデルを勾配のない方法で学習する必要があります。多くのZOアルゴリズムが提案されていますが、関数クエリの複雑さが高いため、アプリケーションが著しく妨げられています。この困難な問題に対処するために、凸設定と非凸設定のそれぞれにおけるZOアルゴリズムの2つの段階的なブラックボックス削減フレームワークを提案します。これにより、ZOアルゴリズムの関数クエリの複雑さが軽減されます。さらに、私たちのフレームワークは、強い凸設定での収束結果が与えられている限り、追加の分析なしで凸設定と非凸設定でのZOアルゴリズムの収束結果を直接導き出すことができます。利点を示すために、さらに強い凸設定でのZO-SVRG、ZO-SAGA、ZO-Varagを研究し、私たちのフレームワークを使用して凸設定と非凸設定での収束結果を直接導き出します。私たちのフレームワークによって導出されたこれらのアルゴリズムの関数クエリの複雑さは、フレームワークなしのバニラアルゴリズムよりも低く、最先端のアルゴリズムよりも低くなっています。最後に、私たちのフレームワークの優位性を示すために数値実験を実施します。

First-order Convergence Theory for Weakly-Convex-Weakly-Concave Min-max Problems
弱凸-弱凹最小-最大問題に対する1次収束理論

In this paper, we consider first-order convergence theory and algorithms for solving a class of non-convex non-concave min-max saddle-point problems, whose objective function is weakly convex in the variables of minimization and weakly concave in the variables of maximization. It has many important applications in machine learning including training Generative Adversarial Nets (GANs). We propose an algorithmic framework motivated by the inexact proximal point method, where the weakly monotone variational inequality (VI) corresponding to the original min-max problem is solved through approximately solving a sequence of strongly monotone VIs constructed by adding a strongly monotone mapping to the original gradient mapping. We prove first-order convergence to a nearly stationary solution of the original min-max problem of the generic algorithmic framework and establish different rates by employing different algorithms for solving each strongly monotone VI. Experiments verify the convergence theory and also demonstrate the effectiveness of the proposed methods on training GANs.

この論文では、目的関数が最小化変数に対して弱凸で、最大化変数に対して弱凹である非凸非凹最小最大鞍点問題のクラスを解くための一次収束理論とアルゴリズムについて検討します。これは、生成的敵対的ネット(GAN)のトレーニングを含む機械学習において多くの重要な用途があります。私たちは、不正確な近点法に着想を得たアルゴリズムフレームワークを提案します。このフレームワークでは、元の最小最大問題に対応する弱単調変分不等式(VI)が、元の勾配マッピングに強単調マッピングを追加することによって構築された強単調VIのシーケンスを近似的に解くことによって解決されます。私たちは、汎用アルゴリズムフレームワークの元の最小最大問題のほぼ定常解への一次収束を証明し、各強単調VIを解決するために異なるアルゴリズムを使用することで異なる速度を確立します。実験により収束理論を検証し、提案された方法がGANのトレーニングに有効であることも実証します。

Asymptotic Normality, Concentration, and Coverage of Generalized Posteriors
漸近正規性、集中性、および一般化後頭部のカバレッジ

Generalized likelihoods are commonly used to obtain consistent estimators with attractive computational and robustness properties. Formally, any generalized likelihood can be used to define a generalized posterior distribution, but an arbitrarily defined “posterior” cannot be expected to appropriately quantify uncertainty in any meaningful sense. In this article, we provide sufficient conditions under which generalized posteriors exhibit concentration, asymptotic normality (Bernstein-von Mises), an asymptotically correct Laplace approximation, and asymptotically correct frequentist coverage. We apply our results in detail to generalized posteriors for a wide array of generalized likelihoods, including pseudolikelihoods in general, the Gaussian Markov random field pseudolikelihood, the fully observed Boltzmann machine pseudolikelihood, the Ising model pseudolikelihood, the Cox proportional hazards partial likelihood, and a median-based likelihood for robust inference of location. Further, we show how our results can be used to easily establish the asymptotics of standard posteriors for exponential families and generalized linear models. We make no assumption of model correctness so that our results apply with or without misspecification.

一般化尤度は、魅力的な計算特性と堅牢性特性を持つ一貫性のある推定値を得るためによく使用されます。形式的には、一般化尤度は一般化事後分布を定義するために使用できますが、任意に定義された「事後」では、意味のある意味で不確実性を適切に定量化することは期待できません。この記事では、一般化事後分布が集中、漸近正規性(Bernstein-von Mises)、漸近的に正しいラプラス近似、および漸近的に正しい頻度論的カバレッジを示すための十分な条件を示します。私たちは、一般的な疑似尤度、ガウスマルコフランダムフィールド疑似尤度、完全観測ボルツマンマシン疑似尤度、イジングモデル疑似尤度、Cox比例ハザード部分尤度、および位置の堅牢な推定のための中央値ベースの尤度を含む、さまざまな一般化尤度の一般化事後分布に結果を詳細に適用します。さらに、私たちの結果を使用して、指数族と一般化線形モデルの標準事後分布の漸近線型分布を簡単に確立する方法を示します。モデルの正確さについては仮定しないため、私たちの結果は、誤指定の有無にかかわらず適用されます。

Estimation and Optimization of Composite Outcomes
複合結果の推定と最適化

There is tremendous interest in precision medicine as a means to improve patient outcomes by tailoring treatment to individual characteristics. An individualized treatment rule formalizes precision medicine as a map from patient information to a recommended treatment. A treatment rule is defined to be optimal if it maximizes the mean of a scalar outcome in a population of interest, e.g., symptom reduction. However, clinical and intervention scientists often seek to balance multiple and possibly competing outcomes, e.g., symptom reduction and the risk of an adverse event. One approach to precision medicine in this setting is to elicit a composite outcome which balances all competing outcomes; unfortunately, eliciting a composite outcome directly from patients is difficult without a high-quality instrument, and an expert-derived composite outcome may not account for heterogeneity in patient preferences. We propose a new paradigm for the study of precision medicine using observational data that relies solely on the assumption that clinicians are approximately (i.e., imperfectly) making decisions to maximize individual patient utility. Estimated composite outcomes are subsequently used to construct an estimator of an individualized treatment rule which maximizes the mean of patient-specific composite outcomes. The estimated composite outcomes and estimated optimal individualized treatment rule provide new insights into patient preference heterogeneity, clinician behavior, and the value of precision medicine in a given domain. We derive inference procedures for the proposed estimators under mild conditions and demonstrate their finite sample performance through a suite of simulation experiments and an illustrative application to data from a study of bipolar depression.

個人の特性に合わせて治療を調整することで患者の転帰を改善する手段として、精密医療に大きな関心が寄せられています。個別化された治療ルールは、患者情報から推奨される治療へのマップとして精密医療を形式化します。治療ルールは、症状の軽減など、対象集団におけるスカラー結果の平均を最大化する場合、最適であると定義されます。しかし、臨床および介入科学者は、症状の軽減と有害事象のリスクなど、複数の、場合によっては競合する結果のバランスを取ろうとすることがよくあります。この設定での精密医療への1つのアプローチは、すべての競合する結果のバランスをとる複合結果を引き出すことです。残念ながら、患者から直接複合結果を引き出すことは、高品質の機器がなければ困難であり、専門家が導き出した複合結果では、患者の好みの異質性が考慮されない可能性があります。私たちは、臨床医が個々の患者の効用を最大化するために近似的に(つまり不完全に)決定を下しているという仮定のみに依存する観察データを使用した精密医療の研究のための新しいパラダイムを提案します。推定された複合アウトカムは、その後、患者固有の複合アウトカムの平均を最大化する個別治療ルールの推定量を構築するために使用されます。推定された複合アウトカムと推定された最適な個別治療ルールは、患者の嗜好の異質性、臨床医の行動、および特定の領域における精密医療の価値に関する新しい洞察を提供します。私たちは、軽度の条件下での提案された推定量の推論手順を導き出し、一連のシミュレーション実験と双極性うつ病の研究データへの例示的な適用を通じて、有限サンプルのパフォーマンスを実証します。

The ensmallen library for flexible numerical optimization
柔軟な数値最適化のための ensmallen ライブラリ

We overview the ensmallen numerical optimization library, which provides a flexible C++ framework for mathematical optimization of user-supplied objective functions. Many types of objective functions are supported, including general, differentiable, separable, constrained, and categorical. A diverse set of pre-built optimizers is provided, including Quasi-Newton optimizers and many variants of Stochastic Gradient Descent. The underlying framework facilitates the implementation of new optimizers. Optimization of an objective function typically requires supplying only one or two C++ functions. Custom behavior can be easily specified via callback functions. Empirical comparisons show that ensmallen outperforms other frameworks while providing more functionality. The library is available at https://ensmallen.org and is distributed under the permissive BSD license.

私たちは、ユーザー指定の目的関数の数学的最適化のための柔軟なC++フレームワークを提供するensmallen数値最適化ライブラリの概要を説明します。一般関数、微分可能関数、分離可能関数、制約関数、カテゴリカル関数など、多くのタイプがサポートされています。擬似ニュートンオプティマイザーやストキャスティクス勾配降下法の多くのバリアントなど、さまざまな事前構築済みオプティマイザーのセットが用意されています。基盤となるフレームワークは、新しいオプティマイザの実装を容易にします。目的関数の最適化では、通常、1つまたは2つのC++関数のみを指定する必要があります。カスタム動作は、コールバック関数を使用して簡単に指定できます。経験的な比較により、ensmallenは他のフレームワークよりも優れた性能を発揮し、より多くの機能を提供することがわかっています。このライブラリはhttps://ensmallen.orgから入手でき、寛容なBSDライセンスの下で配布されています。

Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
深層ニューラルネットワークにおける陰的自己正則化:ランダム行列理論からの証拠と学習への影響

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of Self-Regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization, such as Dropout or Weight Norm constraints. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. These phases can be observed during the training process as well as in the final learned DNNs. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed Self-Regularization, similar to the self-organization seen in the statistical physics of disordered systems (such as classical models of actual neural activity). This results from correlations arising at all size scales, which for DNNs arises implicitly due to the training process itself. This implicit Self-Regularization can depend strongly on the many knobs of the training process. In particular, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. Our results suggest that large, well-trained DNN architectures should exhibit Heavy-Tailed Self-Regularization, and we discuss the theoretical and practical implications of this.

ランダムマトリックス理論(RMT)は、AlexNetやInceptionなどの製品品質の事前トレーニング済みモデルと、LeNet5やminiature-AlexNetなどの最初からトレーニングされた小規模モデルの両方を含むディープニューラルネットワーク(DNN)の重みマトリックスを分析するために適用されます。経験的および理論的な結果は、DNNトレーニングプロセス自体が暗黙的に自己正規化の形式を実装し、より正規化されたエネルギーまたはペナルティランドスケープを暗黙的に形成することを明確に示しています。特に、DNNレイヤーマトリックスの経験的スペクトル密度(ESD)は、ドロップアウト制約や重みノルム制約などの従来の明示的な正規化形式を外生的に指定しなくても、従来の正規化された統計モデルのシグネチャを示します。RMTの比較的最近の結果、特にHeavy-Tailed行列の普遍性クラスへの拡張を基に、これらの実験結果に適用することで、増加する暗黙的な自己正則化に対応する5+1段階のトレーニングを識別する理論を開発しました。これらの段階は、トレーニングプロセス中だけでなく、最終的に学習されたDNNでも確認できます。より小規模なDNNや古いDNNの場合、この暗黙的な自己正則化は、信号とノイズを分離する「サイズスケール」がある点で、従来のTikhonov正則化に似ています。ただし、最先端のDNNの場合、無秩序システムの統計物理学(実際の神経活動の古典的なモデルなど)で見られる自己組織化に似た、新しい形式のHeavy-Tailed自己正則化を特定しました。これは、すべてのサイズスケールで発生する相関関係の結果であり、DNNの場合はトレーニングプロセス自体によって暗黙的に発生します。この暗黙的な自己正則化は、トレーニングプロセスの多くのノブに大きく依存する可能性があります。特に、バッチサイズを変更するだけで、小規模なモデルでトレーニングの5+1フェーズすべてを実行できることを実証します。結果は、大規模で十分にトレーニングされたDNNアーキテクチャがHeavy-Tailed Self-Regularizationを示すはずであることを示唆しており、これの理論的および実用的な意味について説明します。

Improving Reproducibility in Machine Learning Research(A Report from the NeurIPS 2019 Reproducibility Program)
機械学習研究における再現性の向上(NeurIPS 2019 Reproducibility Programのレポート)

One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative.

機械学習研究における課題の1つは、発表および公開された結果が健全で信頼できるものであることを保証することです。再現性、つまり、同じコードとデータ(利用可能な場合)を使用して、論文や講演で発表されたものと同様の結果を得ることは、研究結果の信頼性を検証するために必要なステップです。再現性は、オープンでアクセス可能な研究を促進するための重要なステップでもあり、それによって科学コミュニティは新しい発見を迅速に統合し、アイデアを実践に移すことができます。再現性は、堅牢な実験ワークフローの使用も促進し、意図しないエラーを減らす可能性があります。2019年、機械学習研究の最高の国際会議であるNeural Information Processing Systems (NeurIPS)会議は、機械学習研究の実施、伝達、評価方法に関するコミュニティ全体の基準を改善するために設計された再現性プログラムを導入しました。プログラムには、コード提出ポリシー、コミュニティ全体の再現性チャレンジ、論文提出プロセスの一部としての機械学習再現性チェックリストの組み込みという3つの要素が含まれていました。この論文では、これらの各コンポーネントとその展開方法、そしてこの取り組みから学んだことについて説明します。

PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review
PeerReview4All: 査読における公正で正確な査読者の任命

We consider the problem of automated assignment of papers to reviewers in conference peer review, with a focus on fairness and statistical accuracy. Our fairness objective is to maximize the review quality of the most disadvantaged paper, in contrast to the commonly used objective of maximizing the total quality over all papers. We design an assignment algorithm based on an incremental max-flow procedure that we prove is near-optimally fair. Our statistical accuracy objective is to ensure correct recovery of the papers that should be accepted. We provide a sharp minimax analysis of the accuracy of the peer-review process for a popular objective-score model as well as for a novel subjective-score model that we propose in the paper. Our analysis proves that our proposed assignment algorithm also leads to a near-optimal statistical accuracy. Finally, we design a novel experiment that allows for an objective comparison of various assignment algorithms, and overcomes the inherent difficulty posed by the absence of a ground truth in experiments on peer-review. The results of this experiment as well as of other experiments on synthetic and real data corroborate the theoretical guarantees of our algorithm.

私たちは、公平性と統計的正確性に焦点を当て、会議のピアレビューにおける論文の査読者への自動割り当ての問題を検討します。我々の公平性の目標は、すべての論文の全体的な品質を最大化するという一般的な目標とは対照的に、最も不利な論文の査読品質を最大化することです。私たちは、ほぼ最適に公平であることが証明された増分最大フロー手順に基づく割り当てアルゴリズムを設計します。我々の統計的正確性の目標は、受理されるべき論文の正しい回収を確実にすることです。私たちは、一般的な客観スコアモデルと、論文で提案する新しい主観スコアモデルのピアレビュープロセスの正確性の鋭いミニマックス分析を提供します。我々の分析は、我々が提案する割り当てアルゴリズムがほぼ最適な統計的正確性にもつながることを証明します。最後に、私たちは、さまざまな割り当てアルゴリズムの客観的な比較を可能にし、ピアレビューの実験におけるグラウンドトゥルースの欠如によってもたらされる固有の困難を克服する新しい実験を設計します。この実験の結果、および合成データと実際のデータに関する他の実験の結果は、我々のアルゴリズムの理論的保証を裏付けるものです。

Counterfactual Mean Embeddings
反事実平均の埋め込み

Counterfactual inference has become a ubiquitous tool in online advertisement, recommendation systems, medical diagnosis, and econometrics. Accurate modelling of outcome distributions associated with different interventions—known as counterfactual distributions—is crucial for the success of these applications. In this work, we propose to model counterfactual distributions using a novel Hilbert space representation called counterfactual mean embedding (CME). The CME embeds the associated counterfactual distribution into a reproducing kernel Hilbert space (RKHS) endowed with a positive definite kernel, which allows us to perform causal inference over the entire landscape of the counterfactual distribution. Based on this representation, we propose a distributional treatment effect (DTE) which can quantify the causal effect over entire outcome distributions. Our approach is nonparametric as the CME can be estimated under the unconfoundedness assumption from observational data without requiring any parametric assumption about the underlying distributions. We also establish a rate of convergence of the proposed estimator which depends on the smoothness of the conditional mean and the Radon-Nikodym derivative of the underlying marginal distributions. Furthermore, our framework allows for more complex outcomes such as images, sequences, and graphs. Our experimental results on synthetic data and off-policy evaluation tasks demonstrate the advantages of the proposed estimator.

反事実的推論は、オンライン広告、推奨システム、医療診断、計量経済学のあらゆる分野で利用されるツールとなっています。さまざまな介入に関連する結果分布（反事実的分布として知られる）の正確なモデリングは、これらのアプリケーションの成功に不可欠です。この研究では、反事実的平均埋め込み（CME）と呼ばれる新しいヒルベルト空間表現を使用して、反事実的分布をモデル化することを提案します。CMEは、関連する反事実的分布を、正定値カーネルを備えた再生カーネルヒルベルト空間（RKHS）に埋め込み、反事実的分布のランドスケープ全体にわたって因果推論を実行できるようにします。この表現に基づいて、結果分布全体にわたって因果効果を定量化できる分布治療効果（DTE）を提案します。私たちのアプローチはノンパラメトリックであり、基礎となる分布についていかなるパラメトリック仮定も必要とせずに、観測データから非交絡性仮定の下でCMEを推定できます。また、提案された推定量の収束率も確立しました。これは、条件付き平均の滑らかさと、基礎となる周辺分布のラドン・ニコディム導関数に依存します。さらに、私たちのフレームワークでは、画像、シーケンス、グラフなどのより複雑な結果も考慮できます。合成データとオフポリシー評価タスクに関する実験結果は、提案された推定量の利点を実証しています。

MetaGrad: Adaptation using Multiple Learning Rates in Online Learning
MetaGrad:オンライン学習における複数の学習率を使用した適応

We provide a new adaptive method for online convex optimization, MetaGrad, that is robust to general convex losses but achieves faster rates for a broad class of special functions, including exp-concave and strongly convex functions, but also various types of stochastic and non-stochastic functions without any curvature. We prove this by drawing a connection to the Bernstein condition, which is known to imply fast rates in offline statistical learning. MetaGrad further adapts automatically to the size of the gradients. Its main feature is that it simultaneously considers multiple learning rates, which are weighted directly proportional to their empirical performance on the data using a new meta-algorithm. We provide three versions of MetaGrad. The full matrix version maintains a full covariance matrix and is applicable to learning tasks for which we can afford update time quadratic in the dimension. The other two versions provide speed-ups for high-dimensional learning tasks with an update time that is linear in the dimension: one is based on sketching, the other on running a separate copy of the basic algorithm per coordinate. We evaluate all versions of MetaGrad on benchmark online classification and regression tasks, on which they consistently outperform both online gradient descent and AdaGrad.

私たちは、オンライン凸最適化のための新しい適応型手法であるMetaGradを提供します。この手法は、一般的な凸損失に対して堅牢ですが、指数凹関数や強い凸関数、さらには曲率のないさまざまなタイプの確率的および非確率的関数を含む幅広いクラスの特殊関数に対してより高速な速度を実現します。私たちは、オフライン統計学習において高速な速度を意味することが知られているBernstein条件との関連を描くことによってこれを証明します。MetaGradは、勾配のサイズに自動的に適応します。その主な特徴は、新しいメタアルゴリズムを使用して、データに対する経験的パフォーマンスに直接比例して重み付けされた複数の学習率を同時に考慮することです。私たちは、3つのバージョンのMetaGradを提供します。完全なマトリックスバージョンは、完全な共分散マトリックスを維持し、次元の2乗の更新時間を許容できる学習タスクに適用できます。他の2つのバージョンは、次元に対して線形の更新時間で高次元学習タスクを高速化します。1つはスケッチに基づき、もう1つは座標ごとに基本アルゴリズムの別のコピーを実行します。私たちは、ベンチマークのオンライン分類および回帰タスクでMetaGradのすべてのバージョンを評価しました。その結果、MetaGradはオンライン勾配降下法とAdaGradの両方を一貫して上回りました。

Are We Forgetting about Compositional Optimisers in Bayesian Optimisation?
ベイズ最適化における合成最適化子を忘れていませんか?

Bayesian optimisation presents a sample-efficient methodology for global optimisation. Within this framework, a crucial performance-determining subroutine is the maximisation of the acquisition function, a task complicated by the fact that acquisition functions tend to be non-convex and thus nontrivial to optimise. In this paper, we undertake a comprehensive empirical study of approaches to maximise the acquisition function. Additionally, by deriving novel, yet mathematically equivalent, compositional forms for popular acquisition functions, we recast the maximisation task as a compositional optimisation problem, allowing us to benefit from the extensive literature in this field. We highlight the empirical advantages of the compositional approach to acquisition function maximisation across 3958 individual experiments comprising synthetic optimisation tasks as well as tasks from Bayesmark. Given the generality of the acquisition function maximisation subroutine, we posit that the adoption of compositional optimisers has the potential to yield performance improvements across all domains in which Bayesian optimisation is currently being applied. An open-source implementation is made available at https://github.com/huawei-noah/noah-research/tree/CompBO/BO/HEBO/CompBO.

ベイズ最適化は、サンプル効率の高いグローバル最適化手法を提供します。このフレームワークでは、パフォーマンスを決定する重要なサブルーチンは獲得関数の最大化です。獲得関数は非凸である傾向があり、最適化が簡単ではないため、このタスクは複雑になります。この論文では、獲得関数を最大化するアプローチの包括的な実証研究を行います。さらに、一般的な獲得関数の新しい、しかし数学的に同等な構成形式を導出することにより、最大化タスクを構成最適化問題として作り直し、この分野の広範な文献の恩恵を受けることができます。合成最適化タスクとベイズマークのタスクを含む3958の個別の実験で、獲得関数の最大化に対する構成アプローチの実証的な利点を強調します。獲得関数最大化サブルーチンの汎用性を考えると、構成最適化を採用すると、ベイズ最適化が現在適用されているすべてのドメインでパフォーマンスが向上する可能性があると仮定します。オープンソース実装は、https://github.com/huawei-noah/noah-research/tree/CompBO/BO/HEBO/CompBOで公開されています。

When Does Gradient Descent with Logistic Loss Find Interpolating Two-Layer Networks?
ロジスティック損失を伴う勾配降下法は、いつ2層ネットワークの内挿を見つけるのか?

We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.

私たちは、ロジスティック損失を用いた二項分類のための有限幅2層平滑化ReLUネットワークの学習について研究します。初期損失が十分に小さい場合、勾配降下法によって学習損失がゼロになることを示します。データが特定のクラスター条件と分離条件を満たし、ネットワークの幅が十分に広い場合、勾配降下の1ステップで損失が十分に減少し、最初の結果が適用されることを示します。

Information criteria for non-normalized models
非正規化モデルの情報基準

Many statistical models are given in the form of non-normalized densities with an intractable normalization constant. Since maximum likelihood estimation is computationally intensive for these models, several estimation methods have been developed which do not require explicit computation of the normalization constant, such as noise contrastive estimation (NCE) and score matching. However, model selection methods for general nonnormalized models have not been proposed so far. In this study, we develop information criteria for non-normalized models estimated by NCE or score matching. They are approximately unbiased estimators of discrepancy measures for non-normalized models. Simulation results and applications to real data demonstrate that the proposed criteria enable selection of the appropriate non-normalized model in a data-driven manner.

多くの統計モデルは、扱いにくい正規化定数を持つ非正規化密度の形で与えられます。これらのモデルでは最尤推定が計算量が多いため、ノイズコントラスト推定(NCE)やスコアマッチングなど、正規化定数の明示的な計算を必要としないいくつかの推定方法が開発されています。しかし、一般的な非正規化モデルのモデル選択方法は、これまで提案されていませんでした。この研究では、NCEまたはスコアマッチングによって推定された非正規化モデルの情報基準を開発します。これらは、正規化されていないモデルの不一致測度のほぼ不偏な推定量です。シミュレーション結果と実際のデータへの適用は、提案された基準により、データ駆動型の方法で適切な非正規化モデルを選択できることを示しています。

The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks
リッジレット事前確率:ベイジアンニューラルネットワークの事前仕様への共分散関数アプローチ

Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.

ベイジアンニューラルネットワークは、ニューラルネットワークの強力な予測性能と、ベイジアンフレームワークにおける予測出力に関連する不確実性の正式な定量化を組み合わせようとします。ただし、ネットワークの出力空間に持ち上げられたときに意味のある事前分布をネットワークのパラメーターに付与する方法は不明のままです。ユーザーが手元のタスクに対して適切なガウス過程共分散関数を仮定できるようにする、可能な解決策が提案されています。私たちのアプローチでは、ネットワークの出力空間で仮定されたガウス過程を近似する、リッジレット事前分布と呼ばれるネットワークのパラメーターの事前分布を構築します。ニューラルネットワークとガウス過程の関係に関する既存の研究とは対照的に、私たちの分析は非漸近的であり、有限のサンプルサイズ誤差境界が提供されます。これにより、ベイジアンニューラルネットワークは、共分散関数が十分に規則的な任意のガウス過程を近似できるという普遍性プロパティが確立されます。私たちの実験的評価は概念実証に限定されており、適切なガウス過程事前分布を提供できる回帰問題では、リッジレット事前分布が非構造化事前分布よりも優れていることを実証しています。

A Greedy Algorithm for Quantizing Neural Networks
ニューラルネットワークを量子化するための貪欲アルゴリズム

We propose a new computationally efficient method for quantizing the weights of pre- trained neural networks that is general enough to handle both multi-layer perceptrons and convolutional neural networks. Our method deterministically quantizes layers in an iterative fashion with no complicated re-training required. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. This simple algorithm is equivalent to running a dynamical system, which we prove is stable for quantizing a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are Gaussian. We show that under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods on MNIST and CIFAR10 data, as well as for quantizing the VGG16 network using ImageNet data.

私たちは、多層パーセプトロンと畳み込みニューラルネットワークの両方を処理するのに十分な一般的な、事前学習済みニューラルネットワークの重みを量子化するための新しい計算効率の高い方法を提案します。私たちの方法は、複雑な再トレーニングを必要とせずに、反復的な方法でレイヤーを決定論的に量子化します。具体的には、各ニューロン、つまり隠れユニットを、貪欲なパスフォローアルゴリズムを使用して量子化します。この単純なアルゴリズムは、動的システムを実行するのと同等であり、トレーニングデータがガウスの場合に、単一層ニューラルネットワークの量子化(または、多層ネットワークの最初の層の量子化)で安定していることが証明されています。これらの仮定の下では、量子化誤差は層の幅、つまりその過剰パラメータ化のレベルとともに減衰することを示します。私たちは、MNISTおよびCIFAR10データに対する私たちの方法の性能を示すために、またImageNetデータを使用してVGG16ネットワークを量子化するための、多層ネットワークでの数値実験を提供します。

What Causes the Test Error? Going Beyond Bias-Variance via ANOVA
テストエラーの原因は何ですか?分散分析によるバイアス分散の超え方

Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called `double descent’. Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. This lead to discovering the unimodality of variance as a function of the level of parametrization, and to decomposing the variance into that arising from label noise, initialization, and randomness in the training data to understand the sources of the error. In this work we develop a deeper understanding of this area. Specifically, we propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks. The advantage of the analysis of variance is that it reveals the effects of initialization, label noise, and training data more clearly than prior approaches. Moreover, we also study the monotonicity and unimodality of the variance components. While prior work studied the unimodality of the overall variance, we study the properties of each term in the variance decomposition. One of our key insights is that often, the interaction between training samples and initialization can dominate the variance; surprisingly being larger than their marginal effect. Also, we characterize `phase transitions’ where the variance changes from unimodal to monotone. On a technical level, we leverage advanced deterministic equivalent techniques for Haar random matrices, that—to our knowledge—have not yet been used in the area. We verify our results in numerical simulations and on empirical data examples.

現代の機械学習手法は、多くの場合、過剰パラメータ化されており、細かいレベルでデータに適応できます。これは不可解に思えるかもしれません。最悪の場合、そのようなモデルは一般化する必要がありません。この謎は、過剰パラメータ化がテストエラーを減らすのはいつかを議論する「二重降下」と呼ばれる現象に関する膨大な研究に影響を与えました。最近の研究は、過剰パラメータ化が一般化に役立つ理由をより深く理解することを目指しました。これにより、パラメータ化レベルの関数として分散の単峰性を発見し、ラベルノイズ、初期化、トレーニングデータのランダム性から生じる分散に分散を分解して、エラーの原因を理解することになりました。この研究では、この分野についてより深く理解します。具体的には、特定の2層線形および非線形ネットワークの一般化パフォーマンスを研究するために、分散分析(ANOVA)を使用してテストエラーの分散を対称的に分解することを提案します。分散分析の利点は、初期化、ラベルノイズ、トレーニングデータの影響を従来のアプローチよりも明確に明らかにすることです。さらに、分散コンポーネントの単調性と単峰性も調べます。以前の研究では全体的な分散の単峰性を調べましたが、私たちは分散分解の各項の特性を調べます。重要な洞察の1つは、トレーニングサンプルと初期化の相互作用が分散を支配することがよくあることです。驚くべきことに、その影響は限界効果よりも大きくなります。また、分散が単峰性から単調性に変化する「相転移」を特徴付けます。技術的なレベルでは、Haarランダムマトリックスの高度な決定論的等価技術を活用しますが、これは私たちの知る限りこの分野ではまだ使用されていません。数値シミュレーションと経験的データ例で結果を検証します。

Kernel Smoothing, Mean Shift, and Their Learning Theory with Directional Data
カーネル平滑化、平均シフト、および指向性データによるそれらの学習理論

Directional data consist of observations distributed on a (hyper)sphere, and appear in many applied fields, such as astronomy, ecology, and environmental science. This paper studies both statistical and computational problems of kernel smoothing for directional data. We generalize the classical mean shift algorithm to directional data, which allows us to identify local modes of the directional kernel density estimator (KDE). The statistical convergence rates of the directional KDE and its derivatives are derived, and the problem of mode estimation is examined. We also prove the ascending property of the directional mean shift algorithm and investigate a general problem of gradient ascent on the unit hypersphere. To demonstrate the applicability of the algorithm, we evaluate it as a mode clustering method on both simulated and real-world data sets.

指向性データは、(超)球上に分布した観測値で構成され、天文学、生態学、環境科学など、多くの応用分野に現れます。この論文では、方向性データのカーネル平滑化の統計的問題と計算問題の両方を研究します。古典的な平均シフトアルゴリズムを指向性データに一般化することで、指向性カーネル密度推定量(KDE)の局所モードを特定できます。指向性KDEとその導関数の統計的収束率を導き出し、モード推定の問題を検討します。また、方向平均シフトアルゴリズムの昇順特性を証明し、ユニット超球面上の勾配上昇の一般的な問題を調査します。このアルゴリズムの適用性を実証するために、シミュレーションされたデータセットと実世界のデータセットの両方でモードクラスタリング手法として評価します。

Factorization Machines with Regularization for Sparse Feature Interactions
スパース特徴の相互作用に対する正則化を持つ因数分解機

Factorization machines (FMs) are machine learning predictive models based on second-order feature interactions and FMs with sparse regularization are called sparse FMs. Such regularizations enable feature selection, which selects the most relevant features for accurate prediction, and therefore they can contribute to the improvement of the model accuracy and interpretability. However, because FMs use second-order feature interactions, the selection of features often causes the loss of many relevant feature interactions in the resultant models. In such cases, FMs with regularization specially designed for feature interaction selection trying to achieve interaction-level sparsity may be preferred instead of those just for feature selection trying to achieve feature-level sparsity. In this paper, we present a new regularization scheme for feature interaction selection in FMs. For feature interaction selection, our proposed regularizer makes the feature interaction matrix sparse without a restriction on sparsity patterns imposed by the existing methods. We also describe efficient proximal algorithms for the proposed FMs and how our ideas can be applied or extended to feature selection and other related models such as higher-order FMs and the all-subsets model. The analysis and experimental results on synthetic and real-world datasets show the effectiveness of the proposed methods.

因子分解マシン(FM)は、2次特徴相互作用に基づく機械学習予測モデルであり、スパース正則化を備えたFMはスパースFMと呼ばれます。このような正則化により、正確な予測に最も関連性の高い特徴を選択する特徴選択が可能になり、モデルの精度と解釈可能性の向上に貢献できます。ただし、FMは2次特徴相互作用を使用するため、特徴の選択によって、結果のモデルで多くの関連する特徴相互作用が失われることがよくあります。このような場合、特徴レベルのスパース性を実現しようとする特徴選択専用のFMではなく、相互作用レベルのスパース性を実現しようとする特徴相互作用選択用に特別に設計された正則化を備えたFMが好まれる場合があります。この論文では、FMでの特徴相互作用選択のための新しい正則化スキームを紹介します。特徴相互作用選択では、提案する正則化子により、既存の方法によって課せられるスパース性パターンに制限されることなく、特徴相互作用行列がスパースになります。また、提案されたFMの効率的な近似アルゴリズムと、そのアイデアを特徴選択や高次FMや全サブセットモデルなどの他の関連モデルに適用または拡張する方法についても説明します。合成データセットと実世界のデータセットでの分析と実験結果は、提案された方法の有効性を示しています。

Hardness of Identity Testing for Restricted Boltzmann Machines and Potts models
制限付きボルツマンマシンとポッツモデルの同一性試験の難易度

We study the identity testing problem for restricted Boltzmann machines (RBMs), and more generally, for undirected graphical models. In this problem, given sample access to the Gibbs distribution corresponding to an unknown or hidden model $M^*$ and given an explicit model $M$, the goal is to distinguish if either $M = M^*$ or if the models are (statistically) far apart. We establish the computational hardness of identity testing for RBMs (i.e., mixed Ising models on bipartite graphs), even when there are no latent variables or an external field. Specifically, we show that unless $RP=NP$, there is no polynomial-time identity testing algorithm for RBMs when $\beta d=\omega(\log{n})$, where $d$ is the maximum degree of the visible graph and $\beta$ is the largest edge weight (in absolute value); when $\beta d =O(\log{n})$ there is an efficient identity testing algorithm that utilizes the structure learning algorithm of Klivans and Meka (2017). We prove similar lower bounds for purely ferromagnetic RBMs with inconsistent external fields and for the ferromagnetic Potts model. To prove our results, we introduce a novel methodology to reduce the corresponding approximate counting problem to testing utilizing the phase transition exhibited by these models.

私たちは、制限付きボルツマンマシン(RBM)、およびより一般的には無向グラフィカルモデルの同一性検定問題を研究します。この問題では、未知または隠れたモデル$M^*$に対応するギブス分布へのサンプルアクセスと明示的なモデル$M$が与えられた場合、目標は$M = M^*$であるか、モデルが(統計的に)離れているかを区別することです。私たちは、潜在変数や外部フィールドがない場合でも、RBM (つまり、二部グラフ上の混合イジングモデル)の同一性検定の計算困難性を確立します。具体的には、$RP=NP$でない限り、$\beta d=\omega(\log{n})$のときにRBMの多項式時間同一性検定アルゴリズムが存在しないことを示す。ここで、$d$は可視グラフの最大次数、$\beta$は最大エッジ重み(絶対値)です。$\beta d =O(\log{n})$のとき、KlivansとMeka (2017)の構造学習アルゴリズムを利用する効率的な同一性検定アルゴリズムがあります。矛盾した外部場を持つ純粋な強磁性RBMと強磁性Pottsモデルに対して同様の下限を証明します。結果を証明するために、これらのモデルが示す相転移を利用した検定に対応する近似計数問題を簡略化する新しい方法論を紹介します。

Universal consistency and rates of convergence of multiclass prototype algorithms in metric spaces
メトリック空間における多クラスプロトタイプアルゴリズムの普遍的な一貫性と収束率

We study universal consistency and convergence rates of simple nearest-neighbor prototype rules for the problem of multiclass classification in metric spaces. We first show that a novel data-dependent partitioning rule, named Proto-NN, is universally consistent in any metric space that admits a universally consistent rule. Proto-NN is a significant simplification of OptiNet, a recently proposed compression-based algorithm that, to date, was the only algorithm known to be universally consistent in such a general setting. Practically, Proto-NN is simpler to implement and enjoys reduced computational complexity. We then proceed to study convergence rates of the excess error probability. We first obtain rates for the standard $k$-NN rule under a margin condition and a new generalized-Lipschitz condition. The latter is an extension of a recently proposed modified-Lipschitz condition from $\mathbb R^d$ to metric spaces. Similarly to the modified-Lipschitz condition, the new condition avoids any boundness assumptions on the data distribution. While obtaining rates for Proto-NN is left open, we show that a second prototype rule that hybridizes between $k$-NN and Proto-NN achieves the same rates as $k$-NN while enjoying similar computational advantages as Proto-NN. However, as $k$-NN, this hybrid rule is not consistent in general.

私たちは、距離空間における多クラス分類の問題に対する単純な最近傍プロトタイプ規則の普遍的一貫性と収束率を研究します。まず、Proto-NNという新しいデータ依存分割規則が、普遍的に一貫性のある規則を許容する任意の距離空間において普遍的に一貫性があることを示す。Proto-NNは、最近提案された圧縮ベースのアルゴリズムであるOptiNetを大幅に簡略化したものです。OptiNetは、現在まで、このような一般的な設定において普遍的に一貫性があることが知られている唯一のアルゴリズムであった。実際には、Proto-NNは実装が簡単で、計算の複雑さが軽減されています。次に、過剰エラー確率の収束率の研究に進む。まず、マージン条件と新しい一般化Lipschitz条件の下での標準$k$-NN規則の収束率を取得します。後者は、最近提案された修正Lipschitz条件を$\mathbb R^d$から距離空間に拡張したものです。修正Lipschitz条件と同様に、新しい条件はデータ分布に対する境界性仮定を回避しています。Proto-NNのレートを得ることは未解決のままですが、$k$-NNとProto-NNをハイブリッド化した2番目のプロトタイプルールは、Proto-NNと同様の計算上の利点を享受しながら、$k$-NNと同じレートを達成することを示します。ただし、$k$-NNの場合、このハイブリッドルールは一般に一貫性がありません。

Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent
スケーリング勾配降下法による悪条件低ランク行列推定の加速

Low-rank matrix estimation is a canonical problem that finds numerous applications in signal processing, machine learning and imaging science. A popular approach in practice is to factorize the matrix into two compact low-rank factors, and then optimize these factors directly via simple iterative methods such as gradient descent and alternating minimization. Despite nonconvexity, recent literatures have shown that these simple heuristics in fact achieve linear convergence when initialized properly for a growing number of problems of interest. However, upon closer examination, existing approaches can still be computationally expensive especially for ill-conditioned matrices: the convergence rate of gradient descent depends linearly on the condition number of the low-rank matrix, while the per-iteration cost of alternating minimization is often prohibitive for large matrices. The goal of this paper is to set forth a competitive algorithmic approach dubbed Scaled Gradient Descent (ScaledGD) which can be viewed as preconditioned or diagonally-scaled gradient descent, where the preconditioners are adaptive and iteration-varying with a minimal computational overhead. With tailored variants for low-rank matrix sensing, robust principal component analysis and matrix completion, we theoretically show that ScaledGD achieves the best of both worlds: it converges linearly at a rate independent of the condition number of the low-rank matrix similar as alternating minimization, while maintaining the low per-iteration cost of gradient descent. Our analysis is also applicable to general loss functions that are restricted strongly convex and smooth over low-rank matrices. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties over a wide range of low-rank matrix estimation tasks. At the core of our analysis is the introduction of a new distance function that takes account of the preconditioners when measuring the distance between the iterates and the ground truth. Finally, numerical examples are provided to demonstrate the effectiveness of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank matrix estimation in a wide number of applications.

低ランク行列推定は、信号処理、機械学習、画像科学で数多くの用途がある標準的な問題です。実際によく使われるアプローチは、行列を2つのコンパクトな低ランク因子に因数分解し、勾配降下法や交互最小化法などの単純な反復法でこれらの因子を直接最適化することです。非凸性にもかかわらず、最近の文献では、これらの単純なヒューリスティックは、適切に初期化すると、関心のある問題の数が増えていく中で実際に線形収束を達成することが示されています。しかし、詳しく調べてみると、既存のアプローチは、特に条件の悪い行列の場合、依然として計算コストが高くなる可能性があります。勾配降下法の収束率は、低ランク行列の条件数に線形に依存しますが、交互最小化の反復コストは、大規模な行列ではしばしば法外に高くなります。この論文の目的は、Scaled Gradient Descent (ScaledGD)と呼ばれる競争力のあるアルゴリズム手法を提示することです。これは、前処理付きまたは対角スケールの勾配降下法と見なすことができ、前処理は適応型で反復によって変化し、計算オーバーヘッドは最小限です。低ランク行列センシング、堅牢な主成分分析、行列補完のカスタマイズされたバリエーションを使用して、ScaledGDが両方の長所を実現することを理論的に示します。つまり、交互最小化と同様に、低ランク行列の条件数に依存しない速度で線形に収束しながら、勾配降下法の反復あたりのコストを低く抑えます。私たちの分析は、低ランク行列に対して強く凸で滑らかな制限のある一般的な損失関数にも適用できます。私たちの知る限り、ScaledGDは、幅広い低ランク行列推定タスクでこのような特性を証明できる最初のアルゴリズムです。私たちの分析の核心は、反復と実際の値の間の距離を測定する際に前処理を考慮する新しい距離関数の導入です。最後に、さまざまなアプリケーションで条件の悪い低ランク行列推定の収束速度を加速するScaledGDの有効性を示す数値例を示します。

Hyperparameter Optimization via Sequential Uniform Designs
逐次均一設計によるハイパーパラメータ最適化

Hyperparameter optimization (HPO) plays a central role in the automated machine learning (AutoML). It is a challenging task as the response surfaces of hyperparameters are generally unknown, hence essentially a global optimization problem. This paper reformulates HPO as a computer experiment and proposes a novel sequential uniform design (SeqUD) strategy with three-fold advantages: a) the hyperparameter space is adaptively explored with evenly spread design points, without the need of expensive meta-modeling and acquisition optimization; b) the batch-by-batch design points are sequentially generated with parallel processing support; c) a new augmented uniform design algorithm is developed for the efficient real-time generation of follow-up design points. Extensive experiments are conducted on both global optimization tasks and HPO applications. The numerical results show that the proposed SeqUD strategy outperforms benchmark HPO methods, and it can be therefore a promising and competitive alternative to existing AutoML tools.

ハイパーパラメータ最適化(HPO)は、自動機械学習(AutoML)において中心的な役割を果たします。ハイパーパラメータの応答面は一般に未知であるため、本質的にグローバル最適化問題であり、これは困難なタスクです。この論文では、HPOをコンピューター実験として再定式化し、3つの利点を持つ新しいシーケンシャルユニフォームデザイン(SeqUD)戦略を提案します。a)ハイパーパラメータ空間は、高価なメタモデリングや取得最適化を必要とせずに、均等に分散された設計ポイントで適応的に探索されます。b)バッチごとの設計ポイントは、並列処理サポートを使用して順次生成されます。c)フォローアップ設計ポイントを効率的にリアルタイムで生成するための新しい拡張ユニフォームデザインアルゴリズムが開発されます。グローバル最適化タスクとHPOアプリケーションの両方で広範な実験が行われます。数値結果は、提案されたSeqUD戦略がベンチマークHPO方法よりも優れていることを示しています。そのため、既存のAutoMLツールに代わる有望で競争力のある代替手段になる可能性があります。

Statistical guarantees for local graph clustering
ローカルグラフクラスタリングの統計的保証

Local graph clustering methods aim to find small clusters in very large graphs. These methods take as input a graph and a seed node, and they return as output a good cluster in a running time that depends on the size of the output cluster but that is independent of the size of the input graph. In this paper, we adopt a statistical perspective on local graph clustering, and we analyze the performance of the $\ell_1$-regularized PageRank method (Fountoulakis et al., 2019) for the recovery of a single target cluster, given a seed node inside the cluster. Assuming the target cluster has been generated by a random model, we present two results. In the first, we show that the optimal support of $\ell_1$-regularized PageRank recovers the full target cluster, with bounded false positives. In the second, we show that if the seed node is connected solely to the target cluster then the optimal support of $\ell_1$-regularized PageRank recovers exactly the target cluster. We also show empirically that $\ell_1$-regularized PageRank has a state-of-the-art performance on many real graphs, demonstrating the superiority of the method. From a computational perspective, we show that the solution path of $\ell_1$-regularized PageRank is monotonic. This allows for the application of theforward stagewise algorithm, which approximates the entire solution path in running time that does not depend on the size of the whole graph. Finally, we show that $\ell_1$-regularized PageRank and approximate personalized PageRank (APPR) (Andersen et al., 2006), another very popular method for local graph clustering, are equivalent in the sense that we can lower and upper bound the output of one with the output of the other. Based on this relation, we establish for APPR similar results to those we establish for $\ell_1$-regularized PageRank.

ローカルグラフクラスタリング法は、非常に大きなグラフ内の小さなクラスターを見つけることを目的としています。これらの方法は、グラフとシードノードを入力として受け取り、出力クラスターのサイズに依存する実行時間で適切なクラスターを出力として返しますが、その実行時間は入力グラフのサイズとは無関係です。この論文では、ローカルグラフクラスタリングに統計的観点を採用し、クラスター内のシードノードが与えられた場合の単一のターゲットクラスターの回復に対する$\ell_1$正規化PageRank法(Fountoulakisら、2019年)のパフォーマンスを分析します。ターゲットクラスターがランダムモデルによって生成されたと仮定して、2つの結果を示します。最初の結果では、$\ell_1$正規化PageRankの最適なサポートにより、偽陽性が制限された状態でターゲットクラスター全体が回復されることを示します。2番目の結果では、シードノードがターゲットクラスターにのみ接続されている場合は、$\ell_1$正規化PageRankの最適なサポートによりターゲットクラスターが正確に回復されることを示します。また、$\ell_1$正規化PageRankは多くの実際のグラフで最先端のパフォーマンスを発揮することを経験的に示し、この方法の優位性を実証しています。計算の観点からは、$\ell_1$正規化PageRankのソリューションパスは単調であることを示しています。これにより、グラフ全体のサイズに依存しない実行時間でソリューションパス全体を近似する、前向き段階アルゴリズムを適用できます。最後に、$\ell_1$正規化PageRankと、ローカルグラフクラスタリングのもう1つの非常に一般的な方法である近似パーソナライズPageRank (APPR) (Andersenら、2006年)は、一方の出力をもう一方の出力で下限と上限にすることができるという意味で同等であることを示します。この関係に基づいて、APPRについても、$\ell_1$正規化PageRankで確立したものと同様の結果を確立します。

Optimal Minimax Variable Selection for Large-Scale Matrix Linear Regression Model
大規模行列線形回帰モデルのための最適ミニマックス変数選択

Large-scale matrix linear regression models with high-dimensional responses and high-dimensional variables have been widely employed in various large-scale biomedical studies. In this article, we propose an optimal minimax variable selection approach for the matrix linear regression model when the dimensions of both the response matrix and predictors diverge at the exponential rate of the sample size. We develop an iterative hard-thresholding algorithm for fast computation and establish an optimal minimax theory for the parameter estimates. The finite sample performance of the method is examined via extensive simulation studies and a real data application from the Alzheimer’s Disease Neuroimaging Initiative study is provided.

高次元応答と高次元変数を持つ大規模行列線形回帰モデルは、さまざまな大規模な生物医学研究で広く採用されています。この記事では、応答行列と予測子の両方の次元がサンプルサイズの指数関数的な速度で発散する場合の行列線形回帰モデルに最適なミニマックス変数選択アプローチを提案します。高速計算のための反復ハードしきい値化アルゴリズムを開発し、パラメータ推定の最適なミニマックス理論を確立します。この分析法の有限サンプル性能は、広範なシミュレーション研究を通じて検討され、AlzheimerのDisease Neuroimaging Initiative研究からの実際のデータアプリケーションが提供されます。

Nonparametric Modeling of Higher-Order Interactions via Hypergraphons
ハイパーグラフォンによる高次相互作用のノンパラメトリックモデリング

We study statistical and algorithmic aspects of using hypergraphons, that are limits of large hypergraphs, for modeling higher-order interactions. Although hypergraphons are extremely powerful from a modeling perspective, we consider a restricted class of Simple Lipschitz Hypergraphons (SLH), that are amenable to practically efficient estimation. We also provide rates of convergence for our estimator that are optimal for the class of SLH. Simulation results are provided to corroborate the theory.

私たちは、高次の相互作用をモデル化するために、大規模なハイパーグラフの限界であるハイパーグラフを使用する際の統計的およびアルゴリズム的側面を研究しています。ハイパーグラフンはモデリングの観点から非常に強力ですが、実質的に効率的な推定に適したSimple Lipschitzハイパーグラフ(SLH)の制限されたクラスを検討します。また、SLHのクラスに最適な推定器の収束率も提供します。シミュレーション結果は、理論を裏付けるために提供されています。

On efficient multilevel Clustering via Wasserstein distances
ワッサーシュタイン距離による効率的なマルチレベルクラスタリングについて

We propose a novel approach to the problem of multilevel clustering, which aims to simultaneously partition data in each group and discover grouping patterns among groups in a potentially large hierarchically structured corpus of data. Our method involves a joint optimization formulation over several spaces of discrete probability measures, which are endowed with Wasserstein distance metrics. We propose several variants of this problem, which admit fast optimization algorithms, by exploiting the connection to the problem of finding Wasserstein barycenters. Consistency properties are established for the estimates of both local and global clusters. Finally, experimental results with both synthetic and real data are presented to demonstrate the flexibility and scalability of the proposed approach.

私たちは、マルチレベルクラスタリングの問題に対する新しいアプローチを提案し、各グループでデータを同時に分割し、潜在的に大規模な階層構造のデータコーパス内のグループ間のグループ化パターンを発見することを目的としています。私たちの方法には、Wasserstein距離メトリックに恵まれた離散確率測度のいくつかの空間にわたる共同最適化定式化が含まれます。この問題のいくつかの変形を提案し、Wasserstein重心を見つける問題への接続を利用することにより、高速最適化アルゴリズムを認めます。一貫性プロパティは、ローカルクラスターとグローバルクラスターの両方の推定値に対して確立されます。最後に、合成データと実データの両方を使用した実験結果を提示して、提案されたアプローチの柔軟性とスケーラビリティを実証します。

Individual Fairness in Hindsight
後知恵における個人の公平性

The pervasive prevalence of algorithmic decision-making in societal domains necessitates that these algorithms satisfy reasonable notions of fairness. One compelling notion is that of individual fairness (IF), which advocates that similar individuals should be treated similarly. In this paper, we extend the notion of IF to online contextual decision-making in settings where there exists a common notion of conduciveness of decisions as perceived by the affected individuals. We introduce two definitions: (i) fairness-across-time (FT) and (ii) fairness-in-hindsight (FH). FT requires the treatment of individuals to be individually fair relative to the past as well as future, while FH only requires individual fairness of a decision at the time of the decision. We show that these two definitions can have drastically different implications when the principal needs to learn the utility model. Linear regret relative to optimal individually fair decisions is generally unavoidable under FT. On the other hand, we design a new algorithm: Cautious Fair Exploration (CaFE), which satisfies FH and achieves order-optimal sublinear regret guarantees for a broad range of settings.

社会の領域でアルゴリズムによる意思決定が広く普及しているため、これらのアルゴリズムが公平性に関する妥当な概念を満たすことが必要です。説得力のある概念の1つは、類似した個人は同様に扱われるべきであると主張する個人の公平性(IF)です。この論文では、影響を受ける個人が認識する決定の助長性に関する共通の概念が存在する環境でのオンライン状況に基づく意思決定にIFの概念を拡張します。2つの定義、(i)時間を超えた公平性(FT)と(ii)後知恵による公平性(FH)を導入します。FTでは、個人の扱いが過去だけでなく未来に対しても個別に公平であることが求められるが、FHでは決定の時点での個々の公平性のみが求められます。プリンシパルが効用モデルを学習する必要がある場合、これら2つの定義は大幅に異なる意味合いを持つ可能性があることを示す。FTでは、最適で個別に公平な決定に対する線形後悔は一般に避けられない。一方、私たちは、FHを満たし、幅広い設定に対して順序最適なサブ線形リグレット保証を実現する新しいアルゴリズム、Cautious Fair Exploration (CaFE)を設計しました。

Non-attracting Regions of Local Minima in Deep and Wide Neural Networks
深部および広域ニューラルネットワークにおける局所極小値の非誘引領域

Understanding the loss surface of neural networks is essential for the design of models with predictable performance and their success in applications. Experimental results suggest that sufficiently deep and wide neural networks are not negatively impacted by suboptimal local minima. Despite recent progress, the reason for this outcome is not fully understood. Could deep networks have very few, if at all, suboptimal local optima? or could all of them be equally good? We provide a construction to show that suboptimal local minima (i.e., non-global ones), even though degenerate, exist for fully connected neural networks with sigmoid activation functions. The local minima obtained by our construction belong to a connected set of local solutions that can be escaped from via a non-increasing path on the loss curve. For extremely wide neural networks of decreasing width after the wide layer, we prove that every suboptimal local minimum belongs to such a connected set. This provides a partial explanation for the successful application of deep neural networks. In addition, we also characterize under what conditions the same construction leads to saddle points instead of local minima for deep neural networks.

ニューラルネットワークの損失面を理解することは、予測可能なパフォーマンスを持つモデルを設計し、そのモデルをアプリケーションで成功させるために不可欠です。実験結果によると、十分に深く広いニューラルネットワークは、最適でない局所的最小値による悪影響を受けないようです。最近の進歩にもかかわらず、この結果の理由は完全には理解されていません。ディープネットワークには、最適でない局所的最小値はほとんどないのでしょうか。あるいは、すべてが同じように優れているのでしょうか。シグモイド活性化関数を持つ完全接続ニューラルネットワークの場合、最適でない局所的最小値(つまり、非グローバルな最小値)が退化しているにもかかわらず存在することを示す構成を提供します。この構成によって得られる局所的最小値は、損失曲線上の非増加パスを介して回避できる、接続されたローカルソリューションのセットに属します。ワイドレイヤーの後で幅が減少する非常に広いニューラルネットワークの場合、すべての最適でない局所的最小値がこのような接続されたセットに属することを証明します。これは、ディープニューラルネットワークの成功したアプリケーションの部分的な説明となります。さらに、どのような条件下で同じ構成がディープニューラルネットワークの局所最小値ではなく鞍点につながるかについても説明します。

Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace
共通の不変部分空間を持つ複数の異種ネットワークの推論

The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices—the multiple adjacency spectral embedding—leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.

複数の異種ネットワークからのデータを分析するためのモデルと方法論の開発は、統計ネットワーク理論と幅広い応用分野の両方で重要です。単一グラフ分析は十分に研究されていますが、複数グラフの推論はほとんど研究されていません。その理由の1つは、グラフの違いを適切にモデル化しながらも、推定を実行可能にするのに十分なモデルの単純さを維持するという固有の課題があるためです。この論文では、新しいモデルである共通サブスペース独立エッジ複数ランダムグラフモデルを導入することで、まさにこのギャップに対処します。このモデルは、頂点に共通の潜在的構造を持ちながら、各グラフの接続パターンが異なる可能性がある、異種ネットワークの集合を表します。このモデルには、確率的ブロックモデルを含む、多くの一般的なネットワーク表現が含まれています。このモデルは、重要なグラフの違いを意味のある形で説明できるほど柔軟であり、複数のネットワークで正確な推論を可能にするほど扱いやすいものです。特に、隣接行列の共同スペクトル埋め込み(多重隣接スペクトル埋め込み)により、各グラフの基礎となるパラメータを同時に一貫して推定できます。軽度の追加の仮定の下では、推定値は漸近正規性を満たし、グラフの固有値推定の改善をもたらします。シミュレーションと実際のデータの両方で、モデルと埋め込みは、次元削減、分類、仮説検定、コミュニティ検出など、その後のネットワーク推論タスクに展開できます。具体的には、拡散磁気共鳴画像法によって構築されたコネクトームのデータセットに埋め込みを適用すると、人間の被験者による脳スキャンの正確な分類と、異なる個人のスキャン間の異質性の有意義な判定が得られます。

Pseudo-Marginal Hamiltonian Monte Carlo
擬似マージナルハミルトニアンモンテカルロ

Bayesian inference in the presence of an intractable likelihood function is computationally challenging. When following a Markov chain Monte Carlo (MCMC) approach to approximate the posterior distribution in this context, one typically either uses MCMC schemes which target the joint posterior of the parameters and some auxiliary latent variables, or pseudo-marginal Metropolis-Hastings (MH) schemes. The latter mimic a MH algorithm targeting the marginal posterior of the parameters by approximating unbiasedly the intractable likelihood. However, in scenarios where the parameters and auxiliary variables are strongly correlated under the posterior and/or this posterior is multimodal, Gibbs sampling or Hamiltonian Monte Carlo (HMC) will perform poorly and the pseudo-marginal MH algorithm, as any other MH scheme, will be inefficient for high-dimensional parameters. We propose here an original MCMC algorithm, termed pseudo-marginal HMC, which combines the advantages of both HMC and pseudo-marginal schemes. Specifically, the PM-HMC method is controlled by a precision parameter $N$, controlling the approximation of the likelihood and, for any $N$, it samples the marginal posterior of the parameters. Additionally, as $N$ tends to infinity, its sample trajectories and acceptance probability converge to those of an ideal, but intractable, HMC algorithm which would have access to the intractable likelihood and its gradient. We demonstrate through experiments that PM-HMC can outperform significantly both standard HMC and pseudo-marginal MH schemes.

扱いにくい尤度関数がある場合のベイズ推定は、計算上困難です。このコンテキストで事後分布を近似するためにマルコフ連鎖モンテカルロ(MCMC)アプローチに従う場合、通常、パラメータといくつかの補助潜在変数の結合事後分布をターゲットとするMCMCスキーム、または擬似周辺メトロポリス-ヘイスティングス(MH)スキームを使用します。後者は、扱いにくい尤度を偏りなく近似することにより、パラメータの周辺事後分布をターゲットとするMHアルゴリズムを模倣します。ただし、パラメータと補助変数が事後分布の下で強く相関している、および/またはこの事後分布が多峰性であるシナリオでは、ギブスサンプリングまたはハミルトンモンテカルロ(HMC)のパフォーマンスが低下し、擬似周辺MHアルゴリズムは、他のMHスキームと同様に、高次元パラメータに対して非効率的になります。ここでは、HMCと擬似周辺スキームの両方の利点を組み合わせた、擬似周辺HMCと呼ばれる独自のMCMCアルゴリズムを提案します。具体的には、PM-HMC法は精度パラメータ$N$によって制御され、尤度の近似値を制御し、任意の$N$に対してパラメータの周辺事後値をサンプリングします。さらに、$N$が無限大に近づくにつれて、そのサンプル軌道と受け入れ確率は、扱いにくい尤度とその勾配にアクセスできる理想的だが扱いにくいHMCアルゴリズムの軌道と受け入れ確率に収束します。実験により、PM-HMCが標準HMCと疑似周辺MH方式の両方を大幅に上回ることができることを実証します。

Generalization Properties of hyper-RKHS and its Applications
ハイパーRKHSの一般化特性とその応用

This paper generalizes regularized regression problems in a hyper-reproducing kernel Hilbert space (hyper-RKHS), illustrates its utility for kernel learning and out-of-sample extensions, and proves asymptotic convergence results for the introduced regression models in an approximation theory view. Algorithmically, we consider two regularized regression models with bivariate forms in this space, including kernel ridge regression (KRR) and support vector regression (SVR) endowed with hyper-RKHS, and further combine divide-and-conquer with Nystr\”{o}m approximation for scalability in large sample cases. This framework is general: the underlying kernel is learned from a broad class, and can be positive definite or not, which adapts to various requirements in kernel learning. Theoretically, we study the convergence behavior of regularized regression algorithms in hyper-RKHS and derive the learning rates, which goes beyond the classical analysis on RKHS due to the non-trivial independence of pairwise samples and the characterisation of hyper-RKHS. Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks.

この論文では、超再生カーネルヒルベルト空間(ハイパーRKHS)における正規化回帰問題を一般化し、カーネル学習とサンプル外拡張に対するその有用性を示し、近似理論の観点から導入された回帰モデルの漸近収束結果を証明します。アルゴリズム的には、この空間で二変量形式を持つ2つの正規化回帰モデル(ハイパーRKHSを備えたカーネルリッジ回帰(KRR)とサポートベクター回帰(SVR))を検討し、さらに分割統治法とNystr\”{o}m近似を組み合わせて、大規模サンプルの場合のスケーラビリティを実現します。このフレームワークは汎用的です。基礎となるカーネルは幅広いクラスから学習され、正定値であってもそうでなくてもかまいません。これにより、カーネル学習のさまざまな要件に適応します。理論的には、ハイパーRKHSでの正規化回帰アルゴリズムの収束動作を研究し、学習率を導出します。これは、ペアワイズサンプルの非自明な独立性とハイパーRKHSの特性により、RKHSの従来の分析を超えています。実験的には、いくつかのベンチマークの結果から、採用したフレームワークが任意の類似性マトリックスから一般的なカーネル関数を学習できることが示唆されており、分類タスクで満足のいくパフォーマンスを達成しています。

Hoeffding’s Inequality for General Markov Chains and Its Applications to Statistical Learning
一般マルコフ連鎖に対するホフディングの不等式と統計学習への応用

This paper establishes Hoeffding’s lemma and inequality for bounded functions of general-state-space and not necessarily reversible Markov chains. The sharpness of these results is characterized by the optimality of the ratio between variance proxies in the Markov-dependent and independent settings. The boundedness of functions is shown necessary for such results to hold in general. To showcase the usefulness of the new results, we apply them for non-asymptotic analyses of MCMC estimation, respondent-driven sampling and high-dimensional covariance matrix estimation on time series data with a Markovian nature. In addition to statistical problems, we also apply them to study the time-discounted rewards in econometric models and the multi-armed bandit problem with Markovian rewards arising from the field of machine learning.

この論文では、一般状態空間の有界関数と必ずしも可逆的ではないマルコフ連鎖についてのヘフディングの補題と不等式を確立します。これらの結果の鮮明さは、マルコフ依存設定と独立設定における分散プロキシ間の比率の最適性によって特徴付けられます。関数の有界性は、そのような結果が一般に保持されるために必要であることが示されています。新しい結果の有用性を示すために、MCMC推定の非漸近分析、応答者駆動サンプリング、およびマルコフの性質を持つ時系列データに対する高次元共分散行列推定に適用します。統計問題に加えて、計量経済学モデルにおける時間割引報酬の研究や、機械学習の分野から生じるマルコフ報酬を伴う多腕バンディット問題の研究にも適用します。

An algorithmic view of L2 regularization and some path-following algorithms
L2正則化といくつかのパス追従アルゴリズムのアルゴリズム的見解

We establish an equivalence between the $\ell_2$-regularized solution path for a convex loss function, and the solution of an ordinary differentiable equation (ODE). Importantly, this equivalence reveals that the solution path can be viewed as the flow of a hybrid of gradient descent and Newton method applying to the empirical loss, which is similar to a widely used optimization technique called trust region method. This provides an interesting algorithmic view of $\ell_2$ regularization, and is in contrast to the conventional view that the $\ell_2$ regularization solution path is similar to the gradient flow of the empirical loss. New path-following algorithms based on homotopy methods and numerical ODE solvers are proposed to numerically approximate the solution path. In particular, we consider respectively Newton method and gradient descent method as the basis algorithm for the homotopy method, and establish their approximation error rates over the solution path. Importantly, our theory suggests novel schemes to choose grid points that guarantee an arbitrarily small suboptimality for the solution path. In terms of computational cost, we prove that in order to achieve an $\epsilon$-suboptimality for the entire solution path, the number of Newton steps required for the Newton method is $\mathcal O(\epsilon^{-1/2})$, while the number of gradient steps required for the gradient descent method is $\mathcal O\left(\epsilon^{-1} \ln(\epsilon^{-1})\right)$.Finally, we use $\ell_2$-regularized logistic regression as an illustrating example to demonstrate the effectiveness of the proposed path-following algorithms.

私たちは、凸損失関数の$\ell_2$正則化解経路と、通常の微分可能方程式(ODE)の解との間に同等性を確立します。重要なことは、この同等性により、解経路は、経験的損失に適用される勾配降下法とニュートン法のハイブリッドの流れとして見ることができるということです。これは、信頼領域法と呼ばれる広く使用されている最適化手法に似ています。これは、$\ell_2$正則化の興味深いアルゴリズム的見方を提供し、$\ell_2$正則化解経路が経験的損失の勾配フローに似ているという従来の見方とは対照的です。ホモトピー法と数値ODEソルバーに基づく新しい経路追跡アルゴリズムが、解経路を数値的に近似するために提案されています。特に、ニュートン法と勾配降下法をそれぞれホモトピー法の基礎アルゴリズムとして検討し、解経路にわたる近似誤差率を確立します。重要なのは、我々の理論が、解の経路に対して任意に小さい準最適性を保証するグリッドポイントを選択するための新しいスキームを提案していることです。計算コストの観点から、解の経路全体に対して$\epsilon$準最適性を達成するために、ニュートン法に必要なニュートンステップの数は$\mathcal O(\epsilon^{-1/2})$であるのに対し、勾配降下法に必要な勾配ステップの数は$\mathcal O\left(\epsilon^{-1} \ln(\epsilon^{-1})\right)$であることを証明します。最後に、提案された経路追跡アルゴリズムの有効性を示す例として、$\ell_2$正則化ロジスティック回帰を使用します。

Hybrid Predictive Models: When an Interpretable Model Collaborates with a Black-box Model
ハイブリッド予測モデル: 解釈可能なモデルがブラックボックスモデルと連携する場合

Interpretable machine learning has become a strong competitor for black-box models. However, the possible loss of the predictive performance for gaining understandability is often inevitable, especially when it needs to satisfy users with diverse backgrounds or high standards for what is considered interpretable. This tension puts practitioners in a dilemma of choosing between high accuracy (black-box models) and interpretability (interpretable models). In this work, we propose a novel framework for building a Hybrid Predictive Model that integrates an interpretable model with any pre-trained black-box model to combine their strengths. The interpretable model substitutes the black-box model on a subset of data where the interpretable model is most competent, gaining transparency at a low cost of the predictive accuracy. We design a principled objective function that considers predictive accuracy, model interpretability, and model transparency (defined as the percentage of data processed by the interpretable substitute.) Under this framework, we propose two hybrid models, one substituting with association rules and the other with linear models, and design customized training algorithms for both models. We test the hybrid models on structured data and text data where interpretable models collaborate with various state-of-the-art black-box models. Results show that hybrid models obtain an efficient trade-off between transparency and predictive performance, characterized by pareto frontiers. Finally, we apply the proposed model on a real-world patients dataset for predicting cardiovascular disease and propose multi-model Pareto frontiers to assist model selection in real applications.

解釈可能な機械学習は、ブラックボックスモデルの強力な競争相手となっています。しかし、理解可能性を得るための予測性能の低下は、特に多様な背景を持つユーザーや、解釈可能と見なされるものに対する高い基準を持つユーザーを満足させる必要がある場合には、避けられないことがよくあります。この緊張により、実践者は、高精度(ブラックボックスモデル)と解釈可能性(解釈可能なモデル)のどちらかを選択するというジレンマに陥ります。この研究では、解釈可能なモデルと任意の事前トレーニング済みブラックボックスモデルを統合してそれぞれの長所を組み合わせたハイブリッド予測モデルを構築するための新しいフレームワークを提案します。解釈可能なモデルは、解釈可能なモデルが最も有能なデータのサブセットでブラックボックスモデルを置き換え、予測精度を犠牲にすることなく透明性を獲得します。予測精度、モデルの解釈可能性、およびモデルの透明性(解釈可能な代替によって処理されるデータの割合として定義)を考慮した原則的な目的関数を設計します。このフレームワークでは、相関ルールで置き換え、線形モデルで置き換える2つのハイブリッドモデルを提案し、両方のモデル用にカスタマイズされたトレーニングアルゴリズムを設計します。構造化データとテキストデータでハイブリッドモデルをテストし、解釈可能なモデルをさまざまな最先端のブラックボックスモデルと連携させます。結果は、ハイブリッドモデルがパレートフロンティアによって特徴付けられる透明性と予測パフォーマンスの間で効率的なトレードオフを実現することを示しています。最後に、提案モデルを実際の患者のデータセットに適用して心血管疾患を予測し、実際のアプリケーションでのモデル選択を支援するマルチモデルパレートフロンティアを提案します。

Implicit Langevin Algorithms for Sampling From Log-concave Densities
対数凹密度からのサンプリングのための暗黙のランジュバンアルゴリズム

For sampling from a log-concave density, we study implicit integrators resulting from $\theta$-method discretization of the overdamped Langevin diffusion stochastic differential equation. Theoretical and algorithmic properties of the resulting sampling methods for $ \theta \in [0,1] $ and a range of step sizes are established. Our results generalize and extend prior works in several directions. In particular, for $\theta\ge 1/2$, we prove geometric ergodicity and stability of the resulting methods for all step sizes. We show that obtaining subsequent samples amounts to solving a strongly-convex optimization problem, which is readily achievable using one of numerous existing methods. Numerical examples supporting our theoretical analysis are also presented.

対数凹密度からのサンプリングについては、オーバーダンピングランジュバン拡散確率微分方程式の$theta$法離散化から生じる陰的積分子を研究します。結果として得られる$ theta in [0,1] $のサンプリング方法とステップサイズの範囲の理論的およびアルゴリズム的特性が確立されます。私たちの結果は、先行研究をいくつかの方向に一般化し、拡張しています。特に、$thetage 1/2$については、すべてのステップサイズについて、結果として得られる方法の幾何学的エルゴード性と安定性を証明します。後続のサンプルを取得することは、多数の既存の方法の1つを使用して容易に達成できる強凸最適化問題を解くことに相当することを示します。また、理論解析を裏付ける数値例も紹介します。

Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives
スパース分類器の学習: 連続整数最適化と混合整数最適化の観点

We consider a discrete optimization formulation for learning sparse classifiers, where the outcome depends upon a linear combination of a small subset of features. Recent work has shown that mixed integer programming (MIP) can be used to solve (to optimality) $\ell_0$-regularized regression problems at scales much larger than what was conventionally considered possible. Despite their usefulness, MIP-based global optimization approaches are significantly slower than the relatively mature algorithms for $\ell_1$-regularization and heuristics for nonconvex regularized problems. We aim to bridge this gap in computation times by developing new MIP-based algorithms for $\ell_0$-regularized classification. We propose two classes of scalable algorithms: an exact algorithm that can handle $p\approx 50,000$ features in a few minutes, and approximate algorithms that can address instances with $p\approx 10^6$ in times comparable to the fast $\ell_1$-based algorithms. Our exact algorithm is based on the novel idea of \textsl{integrality generation}, which solves the original problem (with $p$ binary variables) via a sequence of mixed integer programs that involve a small number of binary variables. Our approximate algorithms are based on coordinate descent and local combinatorial search. In addition, we present new estimation error bounds for a class of $\ell_0$-regularized estimators. Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially variable selection) compared to competing methods.

私たちは、スパース分類器を学習するための離散最適化定式化について検討します。この定式化では、結果は特徴の小さなサブセットの線形結合に依存します。最近の研究では、混合整数計画法(MIP)を使用して、従来可能と考えられていたよりもはるかに大きなスケールで$\ell_0$正則化回帰問題を(最適に)解決できることが示されています。その有用性にもかかわらず、MIPベースのグローバル最適化アプローチは、比較的成熟した$\ell_1$正則化アルゴリズムや非凸正則化問題のヒューリスティックよりも大幅に低速です。私たちは、$\ell_0$正則化分類用の新しいMIPベースのアルゴリズムを開発することで、この計算時間のギャップを埋めることを目指しています。私たちは、数分で$p\approx 50,000$の特徴を処理できる正確なアルゴリズムと、高速な$\ell_1$ベースのアルゴリズムと同等の時間で$p\approx 10^6$のインスタンスを処理できる近似アルゴリズムの2つのクラスのスケーラブルなアルゴリズムを提案します。私たちの正確なアルゴリズムは、\textsl{整数生成}という新しいアイデアに基づいています。これは、少数のバイナリ変数を含む一連の混合整数計画を介して、元の問題($p$バイナリ変数を使用)を解決します。私たちの近似アルゴリズムは、座標降下法とローカル組み合わせ検索に基づいています。さらに、$\ell_0$正規化推定量のクラスに対する新しい推定誤差境界を提示します。実際のデータと合成データでの実験により、私たちのアプローチにより、競合方法と比較して統計的パフォーマンス(特に変数選択)が大幅に向上したモデルが得られることが実証されています。

An Inertial Newton Algorithm for Deep Learning
深層学習のための慣性ニュートンアルゴリズム

We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.

私たちは、機械学習のための新しい2次慣性最適化手法INNAを紹介します。これは損失関数の幾何学を利用しますが、関数値と一般化勾配の確率的近似のみを必要とします。これにより、INNAは完全に実装可能になり、ディープニューラルネットワークのトレーニングなどの大規模な最適化問題に適応できます。このアルゴリズムは、勾配降下法とニュートンのような動作の両方と慣性を組み合わせています。私たちは、ほとんどのディープラーニング問題に対するINNAの収束を証明します。そのために、私たちは、連続動的システムとその離散確率的近似を研究する、tame最適化を含むディープラーニング損失関数を分析するための適切なフレームワークを提供します。私たちは、我々のアルゴリズムの基礎となる連続時間微分包含の亜線形収束を証明します。さらに、私たちは、標準的な最適化ミニバッチ法を非滑らかな非凸問題に適用すると、これまで議論されたことのない特定の種類の疑似定常点が生成される可能性があることも示します。私たちは、$D$臨界性という新しいアイデアに関する理論的フレームワークを提供することでこの問題に対処します。次に、INNAの簡単な漸近解析を行います。私たちのアルゴリズムでは、積極的な学習率$o(1/\log k)$を使用できます。経験的な観点から、INNAは、一般的なディープラーニングのベンチマーク問題において、最先端の技術(確率的勾配降下法、ADAGRAD、ADAM)と比較して競争力のある結果を返すことを示しています。

A Contextual Bandit Bake-off
コンテクスト・バンディット・ベイクオフ

Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically optimal and computationally efficient methods, the practical behavior of these algorithms is still poorly understood. We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. We find that a recent method (Foster et al., 2018) using optimism under uncertainty works the best overall. A surprisingly close second is a simple greedy baseline that only explores implicitly through the diversity of contexts, followed by a variant of Online Cover (Agarwal et al., 2014) which tends to be more conservative but robust to problem specification by design. Along the way, we also evaluate various components of contextual bandit algorithm design such as loss estimators. Overall, this is a thorough study and review of contextual bandit methodology.

コンテキストバンディットアルゴリズムは、多くの現実世界のインタラクティブな機械学習問題を解決するために不可欠です。統計的に最適で計算効率の高い方法で最近複数の成功例があるにもかかわらず、これらのアルゴリズムの実際の動作はまだ十分に理解されていません。私たちは、教師あり学習の最適化オラクルを利用して学習する実用的な方法に焦点を当て、コンテキストバンディットアルゴリズムを経験的に評価するために、多数の教師あり学習データセットの利用可能性を活用しました。不確実性の下での楽観主義を使用する最近の方法(Fosterら、2018年)が全体的に最も効果的であることがわかりました。驚くほど僅差で2番目に優れているのは、コンテキストの多様性を暗黙的にのみ探索する単純な貪欲ベースラインで、次にオンラインカバーの変種(Agarwalら、2014年)が続きます。これは、より保守的である傾向がありますが、設計上、問題の仕様に対して堅牢です。その過程で、損失推定器などのコンテキストバンディットアルゴリズム設計のさまざまなコンポーネントも評価します。全体として、これはコンテキストバンディット方法論の徹底的な研究とレビューです。

Locally Differentially-Private Randomized Response for Discrete Distribution Learning
離散分布学習のための局所微分プライベートランダム化応答

We consider a setup in which confidential i.i.d. samples $X_1,\dotsc,X_n$ from an unknown finite-support distribution $\boldsymbol{p}$ are passed through $n$ copies of a discrete privatization channel (a.k.a. mechanism) producing outputs $Y_1,\dotsc,Y_n$. The channel law guarantees a local differential privacy of $\epsilon$. Subject to a prescribed privacy level $\epsilon$, the optimal channel should be designed such that an estimate of the source distribution based on the channel outputs $Y_1,\dotsc,Y_n$ converges as fast as possible to the exact value $\boldsymbol{p}$. For this purpose we study the convergence to zero of three distribution distance metrics: $f$-divergence, mean-squared error and total variation. We derive the respective normalized first-order terms of convergence (as $n \to \infty$), which for a given target privacy $\epsilon$ represent a rule-of-thumb factor by which the sample size must be augmented so as to achieve the same estimation accuracy as that of a non-randomizing channel. We formulate the privacy-fidelity trade-off problem as being that of minimizing said first-order term under a privacy constraint $\epsilon$. We further identify a scalar quantity that captures the essence of this trade-off, and prove bounds and data-processing inequalities on this quantity. For some specific instances of the privacy-fidelity trade-off problem, we derive inner and outer bounds on the optimal trade-off curve.

私たちは、未知の有限サポート分布$\boldsymbol{p}$からの機密i.i.d.サンプル$X_1,\dotsc,X_n$が、出力$Y_1,\dotsc,Y_n$を生成する離散プライベート化チャネル(別名メカニズム)の$n$個のコピーを通過する設定を検討します。チャネル法則は、$\epsilon$のローカル差分プライバシーを保証します。規定のプライバシーレベル$\epsilon$に従って、最適なチャネルは、チャネル出力$Y_1,\dotsc,Y_n$に基づくソース分布の推定値が正確な値$\boldsymbol{p}$に可能な限り速く収束するように設計する必要があります。この目的のために、3つの分布距離メトリック($f$ダイバージェンス、平均二乗誤差、および総変動)のゼロへの収束を調べます。それぞれの正規化された収束の一次項（$n \to \infty$として）を導出します。これは、与えられたターゲットプライバシー$\epsilon$に対して、非ランダム化チャネルと同じ推定精度を達成するためにサンプルサイズを増やす必要があるという経験則の係数を表します。プライバシーと忠実度のトレードオフ問題を、プライバシー制約$\epsilon$の下で前述の一次項を最小化する問題として定式化します。さらに、このトレードオフの本質を捉えるスカラー量を特定し、この量の境界とデータ処理の不等式を証明します。プライバシーと忠実度のトレードオフ問題の特定のインスタンスについては、最適なトレードオフ曲線の内側境界と外側境界を導出します。

MushroomRL: Simplifying Reinforcement Learning Research
MushroomRL:強化学習研究の簡素化

MushroomRL is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. Compared to other available libraries, MushroomRL has been created with the purpose of providing a comprehensive and flexible framework to minimize the effort in implementing and testing novel RL methodologies. The architecture of MushroomRL is built in such a way that every component of a typical RL experiment is already provided, and most of the time users can only focus on the implementation of their own algorithms. MushroomRL is accompanied by a benchmarking suite collecting experimental results of state-of-the-art deep RL algorithms, and allowing to benchmark new ones. The result is a library from which RL researchers can significantly benefit in the critical phase of the empirical analysis of their works. MushroomRL stable code, tutorials, and documentation can be found at https://github.com/MushroomRL/mushroom-rl.

MushroomRLは、強化学習(RL)実験の実装と実行のプロセスを簡素化するために開発されたオープンソースのPythonライブラリです。他の利用可能なライブラリと比較して、MushroomRLは、新しいRL方法論の実装とテストの労力を最小限に抑えるための包括的で柔軟なフレームワークを提供することを目的として作成されました。MushroomRLのアーキテクチャは、一般的なRL実験のすべてのコンポーネントがすでに提供されており、ほとんどの場合、ユーザーは自分のアルゴリズムの実装にのみ集中できるように構築されています。MushroomRLには、最先端のディープRLアルゴリズムの実験結果を収集し、新しいアルゴリズムをベンチマークできるベンチマークスイートが付属しています。その結果、RLの研究者が自分の研究の実証的分析の重要な段階で大きな利益を得ることができるライブラリが生まれました。MushroomRLの安定版コード、チュートリアル、ドキュメントはhttps://github.com/MushroomRL/mushroom-rlにあります。

Learning Whenever Learning is Possible: Universal Learning under General Stochastic Processes
学習が可能なときはいつでも学ぶ:一般的な確率過程の下での普遍的な学習

This work initiates a general study of learning and generalization without the i.i.d. assumption, starting from first principles. While the traditional approach to statistical learning theory typically relies on standard assumptions from probability theory (e.g., i.i.d. or stationary ergodic), in this work we are interested in developing a theory of learning based only on the most fundamental and necessary assumptions implicit in the requirements of the learning problem itself. We specifically study universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process. We are then interested in the question of whether there exist learning rules guaranteed to be universally consistent given only the assumption that universally consistent learning is possible for the given data process. The reasoning that motivates this criterion emanates from a kind of optimist’s decision theory, and so we refer to such learning rules as being optimistically universal. We study this question in three natural learning settings: inductive, self-adaptive, and online. Remarkably, as our strongest positive result, we find that optimistically universal learning rules do indeed exist in the self-adaptive learning setting. Establishing this fact requires us to develop new approaches to the design of learning algorithms. Along the way, we also identify concise characterizations of the family of processes under which universally consistent learning is possible in the inductive and self-adaptive settings. We additionally pose a number of enticing open problems, particularly for the online learning setting.

この研究では、第一原理から始めて、i.i.d.仮定のない学習と一般化の一般的な研究を開始します。統計学習理論に対する従来のアプローチは、通常、確率論の標準的な仮定(i.i.d.または定常エルゴードなど)に依存していますが、この研究では、学習問題自体の要件に暗黙的に含まれる最も基本的で必要な仮定のみに基づいた学習理論の開発に関心があります。具体的には、データが特定の確率過程に従う場合に、任意のターゲット関数の長期平均損失を低く抑えることを目的とする、普遍的に一貫した関数学習を研究します。次に、特定のデータプロセスに対して普遍的に一貫した学習が可能であるという仮定のみを前提として、普遍的に一貫していることが保証された学習ルールが存在するかどうかという問題に関心があります。この基準の動機となる推論は、一種の楽観主義者の意思決定理論から生じるため、このような学習ルールを楽観的に普遍的であると呼んでいます。この問題を、帰納的、自己適応的、オンラインの3つの自然な学習設定で研究します。驚くべきことに、私たちの最も強力な肯定的な結果として、自己適応学習環境には楽観的に普遍的な学習ルールが実際に存在することがわかりました。この事実を確立するには、学習アルゴリズムの設計に対する新しいアプローチを開発する必要があります。その過程で、帰納的および自己適応的な環境で普遍的に一貫した学習が可能な一連のプロセスの簡潔な特徴も特定します。さらに、特にオンライン学習環境に関して、魅力的な未解決の問題をいくつか提起します。

Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime
過剰パラメータ化レジームにおける補間線形分類器の有限サンプル解析

We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.

私たちは、2クラス線形分類の最大マージンアルゴリズムの母集団リスクの限界を証明します。線形分離可能な学習データの場合、最大余裕アルゴリズムは、学習誤差がゼロに駆動されるため、勾配降下法を使用したロジスティック損失を伴う学習の限界と同等であることが以前の研究で示されています。このアルゴリズムを誤分類ノイズを含むランダムデータに適用して分析します。クリーンなデータに関する仮定には、クラス条件付き分布が標準正規分布である場合が含まれます。誤分類ノイズは、敵対者によって選択される可能性がありますが、破損したラベルの割合に制限があります。私たちの限界は、十分な過剰なパラメーター化により、ノイズの多いデータで学習された最大マージンアルゴリズムがほぼ最適な母集団リスクを達成できることを示しています。

Optimal Bounds between f-Divergences and Integral Probability Metrics
f ダイバージェンスと積分確率メトリクス間の最適境界

The families of $f$-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. In this work, we systematically study the relationship between these two families from the perspective of convex duality. Starting from a tight variational representation of the $f$-divergence, we derive a generalization of the moment-generating function, which we show exactly characterizes the best lower bound of the $f$-divergence as a function of a given IPM. Using this characterization, we obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding’s lemma, Pinsker’s inequality and its extension to subgaussian functions, and the Hammersley-Chapman-Robbins bound. This characterization also allows us to prove new results on topological properties of the divergence which may be of independent interest.

$f$-divergences(例:Kullback-Leibler発散)とIntegral Probability Metrics(例:Total Variation DistancesまたはMaximum Mean Discrepancies)のファミリーは、確率分布間の類似性を定量化するために広く使用されています。この研究では、これら2つのファミリーの関係性を凸双対性の観点から体系的に研究します。$f$-発散のタイトな変分表現から始めて、モーメント生成関数の一般化を導き出し、与えられたIPMの関数として$f$-divergenceの最適な下限を正確に特徴付けることを示します。この特性評価を使用して、新しい境界を取得すると同時に、Hoeffdingの補題、Pinskerの不等式とそのサブガウス関数への拡張、Hammersley-Chapman-Robbins境界などのよく知られた結果を統一的な方法で回復します。この特性評価により、発散のトポロジカル特性に関する新しい結果を証明することもでき、これは独立した関心事となる可能性があります。

LassoNet: A Neural Network with Feature Sparsity
LassoNet: 特徴のスパース性を持つニューラルネットワーク

Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by adding a skip (residual) layer and allowing a feature to participate in any hidden layer only if its skip-layer representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. We apply LassoNet to a number of real-data problems and find that it significantly outperforms state-of-the-art methods for feature selection and regression. LassoNet uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

ニューラルネットワークをより解釈しやすくするために、最近多くの作業が行われてきました。その1つのアプローチは、ネットワークが利用可能な機能のサブセットのみを使用するようにすることです。線形モデルでは、Lasso (または$\ell_1$正規化)回帰は、最も無関係または冗長な機能に0の重みを割り当て、データサイエンスで広く使用されています。ただし、Lassoは線形モデルにのみ適用されます。ここでは、グローバルな機能選択を備えたニューラルネットワークフレームワークであるLassoNetを紹介します。私たちのアプローチは、スキップ(残差)層を追加し、スキップ層の代表がアクティブな場合にのみ機能が任意の隠し層に参加できるようにすることで、機能のスパース性を実現します。ニューラルネットの機能選択に対する他のアプローチとは異なり、私たちの方法は、制約付きの修正された目的関数を使用するため、機能選択とパラメーター学習が直接統合されます。その結果、さまざまな機能スパース性を備えたソリューションの正規化パス全体が提供されます。私たちはLassoNetをいくつかの実データ問題に適用し、それが特徴選択と回帰の最先端の方法を大幅に上回ることを突き止めました。LassoNetは投影された近似勾配降下法を使用し、ディープネットワークに直接一般化します。標準的なニューラルネットワークに数行のコードを追加するだけで実装できます。

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints
データ共有制約下における異種性による統合高次元多重試験

Identifying informative predictors in a high-dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large–scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual-level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.

高次元回帰モデルで有益な予測子を特定することは、関連分析および予測モデリングにとって重要なステップです。高次元設定でのシグナル検出は、サンプルサイズが限られているため失敗することがよくあります。検出力を向上させる1つの方法は、同じ科学的疑問を取り扱う複数の研究をメタ分析することです。ただし、複数の研究からの高次元データの統合分析は、研究間の異質性がある場合には困難です。異なるサイト間で要約データのみを共有できるという追加のデータ共有制約がある場合、この課題はさらに顕著になります。この論文では、研究間の異質性を許容し、個人レベルのデータの共有を必要としないシグナル検出に対する新しいデータシールド統合大規模テスト(DSILT)アプローチを提案します。データの基盤となる高次元回帰モデルは研究間で異なるが、同様のサポートを共有していると仮定すると、提案された方法は、特定の共変量の全体的な影響のテスト統計を構築するために適切な統合推定およびバイアス除去手順を組み込んでいます。また、偽発見率(FDR)と偽発見比率(FDP)を制御しながら有意な効果を識別するための多重検定手順も開発しました。新しい検定手順と理想的な個人レベルメタ分析(ILMA)アプローチおよびその他の分散推論方法との理論的比較が調査されました。シミュレーション研究により、提案された検定手順は偽発見の制御と検出力の達成の両方で優れたパフォーマンスを発揮することが実証されました。新しい方法は、スタチンと肥満の遺伝子変異がII型糖尿病のリスクに与える影響を検出する実際の例に適用されます。

Bandit Convex Optimization in Non-stationary Environments
非定常環境でのバンディット凸最適化

Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decision-making with partial information, where the only feedback available to the player is the one-point or two-point function values. In this paper, we investigate BCO in non-stationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path-length of the comparator sequence that reflects the non-stationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the one-point and two-point feedback models. The latter result is optimal, matching the $\Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is adaptive to the non-stationary environments since it does not require prior knowledge of the path-length $P_T$ ahead of time, which is generally unknown. We further extend the algorithm to an anytime version that does not require to know the time horizon $T$ in advance. Moreover, we study the adaptive regret, another widely used performance measure for online learning in non-stationary environments, and design an algorithm that provably enjoys the adaptive regret guarantees for BCO problems. Finally, we present empirical studies to validate the effectiveness of the proposed approach.

バンディット凸最適化(BCO)は、部分的な情報による順次意思決定をモデル化するための基本的なフレームワークであり、プレーヤーが利用できる唯一のフィードバックは1点または2点の関数値です。この論文では、非定常環境でのBCOを調査し、パフォーマンス指標として動的後悔を選択します。動的後悔は、アルゴリズムによって発生した累積損失と任意の実行可能なコンパレータシーケンスの累積損失との差として定義されます。$T$を時間範囲、$P_T$を環境の非定常性を反映するコンパレータシーケンスのパス長とします。1点および2点のフィードバックモデルでそれぞれ$O(T^{3/4}(1+P_T)^{1/2})$および$O(T^{1/2}(1+P_T)^{1/2})$の動的後悔を実現する新しいアルゴリズムを提案します。後者の結果は最適であり、本論文で確立された$\Omega(T^{1/2}(1+P_T)^{1/2})$の下限と一致しています。特に、私たちのアルゴリズムは、通常は未知であるパス長$P_T$を事前に知る必要がないため、非定常環境に適応しています。さらに、時間範囲$T$を事前に知る必要のないいつでも実行できるバージョンにアルゴリズムを拡張します。さらに、非定常環境でのオンライン学習で広く使用されているもう1つのパフォーマンス指標である適応型後悔を研究し、BCO問題に対して適応型後悔保証を証明できるアルゴリズムを設計します。最後に、提案されたアプローチの有効性を検証するための実証的研究を示します。

A flexible model-free prediction-based framework for feature ranking
特徴ランキングのための柔軟なモデルフリー予測ベースのフレームワーク

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists’ strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample $t$ test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.

共同特徴モデリングのための統計ツールや機械学習ツールが数多く利用可能であるにもかかわらず、多くの科学者は特徴を限界的に、つまり一度に1つの特徴だけを調査しています。これは、部分的には訓練と慣習によるものですが、シンプルな視覚化と解釈可能性に対する科学者の強い関心にも根ざしています。そのため、がんドライバー遺伝子の予測など、一部の予測タスクの限界特徴ランキングは、科学的発見のプロセスで広く実践されています。この研究では、最も一般的な予測タスクの1つであるバイナリ分類の限界ランキングに焦点を当てます。ピアソン相関、2サンプル$t$検定、2サンプルウィルコクソン順位和検定など、最も広く使用されている限界ランキング基準は、特徴分布と予測目的を十分に考慮していないと主張します。この実際のギャップに対処するために、2つの予測目的に対応する2つのランキング基準を提案します。古典的基準(CC)とネイマンピアソン基準(NPC)です。どちらも、多様な特徴分布に対応するためにモデルフリーのノンパラメトリック実装を使用します。理論的には、規則性条件下では、両方の基準が、高い確率で集団レベルの対応と一致するサンプルレベルのランキングを達成することを示しています。さらに、NPCは、サンプル内の2つのクラスの割合が集団内の割合から逸脱している場合でも、サンプリングバイアスに対して堅牢です。この特性により、サンプリングバイアスが遍在する生物医学研究において、NPCは優れた可能性を秘めています。シミュレーションと実際のデータ研究におけるCCとNPCの使用と相対的な利点を示します。モデルフリーの目的ベースのランキングのアイデアは、ランキング機能サブセットに拡張可能であり、他の予測タスクや学習目標に一般化できます。

Convergence Guarantees for Gaussian Process Means With Misspecified Likelihoods and Smoothness
誤って指定された尤度と滑らかさを持つガウス過程平均の収束保証

Gaussian processes are ubiquitous in machine learning, statistics, and applied mathematics. They provide a flexible modelling framework for approximating functions, whilst simultaneously quantifying uncertainty. However, this is only true when the model is well-specified, which is often not the case in practice. In this paper, we study the properties of Gaussian process means when the smoothness of the model and the likelihood function are misspecified. In this setting, an important theoretical question of practical relevance is how accurate the Gaussian process approximations will be given the chosen model and the extent of the misspecification. The answer to this problem is particularly useful since it can inform our choice of model and experimental design. In particular, we describe how the experimental design and choice of kernel and kernel hyperparameters can be adapted to alleviate model misspecification.

ガウス過程は、機械学習、統計学、応用数学のいたるところに存在しています。これらは、関数を近似するための柔軟なモデリングフレームワークを提供すると同時に、不確実性を定量化します。ただし、これはモデルが適切に指定されている場合にのみ当てはまり、実際にはそうではないことがよくあります。この論文では、モデルの滑らかさと尤度関数が誤って指定されている場合のガウス過程平均の特性を研究します。この設定では、選択したモデルと誤指定の程度が与えられた場合のガウス過程近似がどの程度正確になるかという、実際的な関連性を持つ重要な理論的問題です。この問題に対する答えは、モデルと実験計画の選択に情報を提供できるため、特に有用です。特に、カーネルとカーネルハイパーパラメータの実験計画と選択を適応させて、モデルの誤指定を軽減する方法について説明します。

Sparse Convex Optimization via Adaptively Regularized Hard Thresholding
適応的に正則化されたハードしきい値処理によるスパース凸最適化

The goal of Sparse Convex Optimization is to optimize a convex function f under a sparsity constraint s <= s* γ, where s* is the target number of non-zero entries in a feasible solution (sparsity) and γ >= 1 is an approximation factor. There has been a lot of work to analyze the sparsity guarantees of various algorithms (LASSO, Orthogonal Matching Pursuit (OMP), Iterative Hard Thresholding (IHT)) in terms of the Restricted Condition Number κ. The best known algorithms guarantee to find an approximate solution of value f(x*)+ε with the sparsity bound of γ = O(κ min{log ((f(x0)-f(x*)) / ε), κ}), where x* is the target solution. We present a new Adaptively Regularized Hard Thresholding (ARHT) algorithm that makes significant progress on this problem by bringing the bound down to γ=O(κ), which has been shown to be tight for a general class of algorithms including LASSO, OMP, and IHT. This is achieved without significant sacrifice in the runtime efficiency compared to the fastest known algorithms. We also provide a new analysis of OMP with Replacement (OMPR) for general f, under the condition s > s* κ^2 / 4, which yields compressed sensing bounds under the Restricted Isometry Property (RIP). When compared to other compressed sensing approaches, it has the advantage of providing a strong tradeoff between the RIP condition and the solution sparsity, while working for any general function f that meets the RIP condition.

スパース凸最適化の目標は、スパース制約s <= s*γ の下で凸関数fを最適化することです。ここで、s*は実行可能解内の非ゼロ項目の目標数(スパース性)、γ>= 1は近似係数です。さまざまなアルゴリズム(LASSO、直交マッチング追跡(OMP)、反復ハードしきい値化(IHT))のスパース性保証を、制限条件数 κ の観点から分析する作業が数多く行われてきました。最もよく知られているアルゴリズムは、スパース性境界 γ= O(κmin{log ((f(x0)-f(x*)) /ε),κ})で値f(x*)+ε の近似解を見つけることを保証します。ここで、x*は目標解です。私たちは、LASSO、OMP、IHTなどの一般的なアルゴリズムのクラスに対して厳しいことが示されている γ=O(κ)まで境界を下げることでこの問題を大幅に改善する、新しいAdaptively Regularized Hard Thresholding (ARHT)アルゴリズムを紹介します。これは、既知の最速アルゴリズムと比較して、実行時間効率を大幅に犠牲にすることなく達成されます。また、s > s*κ^2 / 4の条件の下で、一般的なfに対するOMP with Replacement (OMPR)の新しい分析も提供し、制限付き等長性特性(RIP)の下で圧縮センシング境界を生成します。他の圧縮センシング手法と比較すると、RIP条件とソリューションのスパース性の間に強力なトレードオフを提供しながら、RIP条件を満たす任意の一般的な関数fに対して機能するという利点があります。

Langevin Dynamics for Adaptive Inverse Reinforcement Learning of Stochastic Gradient Algorithms
確率的勾配アルゴリズムの適応型逆強化学習のためのランジュバンダイナミクス

Inverse reinforcement learning (IRL) aims to estimate the reward function of optimizing agents by observing their response (estimates or actions). This paper considers IRL when noisy estimates of the gradient of a reward function generated by multiple stochastic gradient agents are observed. We present a generalized Langevin dynamics algorithm to estimate the reward function $R(\theta)$; specifically, the resulting Langevin algorithm asymptotically generates samples from the distribution proportional to $\exp(R(\theta))$. The proposed adaptive IRL algorithms use kernel-based passive learning schemes. We also construct multi-kernel passive Langevin algorithms for IRL which are suitable for high dimensional data. The performance of the proposed IRL algorithms are illustrated on examples in adaptive Bayesian learning, logistic regression (high dimensional problem) and constrained Markov decision processes. We prove weak convergence of the proposed IRL algorithms using martingale averaging methods. We also analyze the tracking performance of the IRL algorithms in non-stationary environments where the utility function $R(\theta)$ has a hyper-parameter that jump changes over time as a slow Markov chain which is not known to the inverse learner. In this case, martingale averaging yields a Markov switched diffusion limit as the asymptotic behavior of the IRL algorithm.

逆強化学習(IRL)は、最適化エージェントの応答(推定値またはアクション)を観察することによって、そのエージェントの報酬関数を推定することを目的としています。この論文では、複数の確率的勾配エージェントによって生成された報酬関数の勾配のノイズの多い推定値が観察される場合のIRLについて検討します。報酬関数$R(\theta)$を推定するための一般化ランジュバンダイナミクスアルゴリズムを紹介します。具体的には、結果として得られるランジュバンアルゴリズムは、$\exp(R(\theta))$に比例する分布から漸近的にサンプルを生成します。提案された適応型IRLアルゴリズムは、カーネルベースの受動学習スキームを使用します。また、高次元データに適したIRL用のマルチカーネル受動型ランジュバンアルゴリズムを構築します。提案されたIRLアルゴリズムのパフォーマンスは、適応型ベイズ学習、ロジスティック回帰(高次元問題)、制約付きマルコフ決定プロセスの例で説明されています。マルチンゲール平均化法を使用して、提案されたIRLアルゴリズムの弱収束を証明します。また、ユーティリティ関数$R(\theta)$が、逆学習者には知られていない遅いマルコフ連鎖として時間の経過とともに変化するハイパーパラメータを持つ非定常環境でのIRLアルゴリズムの追跡パフォーマンスも分析します。この場合、マルチンゲール平均化により、IRLアルゴリズムの漸近動作としてマルコフスイッチ拡散限界が生成されます。

Empirical Bayes Matrix Factorization
経験的ベイズ行列の因数分解

Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations (“Sparse FA/PCA”), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called “normal means” problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.

因子分析(FA)や主成分分析(PCA)などの行列因子分解法は、多変量データの構造を推測および要約するために広く使用されています。このような方法の多くは、ペナルティまたは事前分布を使用してスパース表現(「スパースFA/PCA」)を実現しており、どの程度のスパース性を誘導するかが重要な問題です。ここでは、行列因子分解(EBMF)に対する一般的な経験的ベイズ手法を紹介します。この手法の主な特徴は、観測データから事前分布を推定することで適切な量のスパース性を推定することです。この手法は非常に柔軟で、さまざまな事前ファミリーを幅広く許容し、行列因子分解の各コンポーネントが異なる量のスパース性を示すことを許容します。この柔軟性の鍵となるのは変分近似の使用です。変分近似により、EBMFモデルの適合が、より単純な問題、いわゆる「正規平均」問題の解決に効果的に軽減されることを示します。競合方法との数値比較と、44のヒト組織にわたる遺伝的関連性に関するGTEx (Genotype Tissue Expression)プロジェクトのデータの分析の両方を通じて、スパース事前確率を使用したEBMFの利点を実証します。数値比較では、EBMFは他の方法よりも正確な推論を提供することがよくあります。GTExデータでは、EBMFはヒト組織間の既知の関係に一致する解釈可能な構造を識別します。私たちのアプローチを実装するソフトウェアは、https://github.com/stephenslab/flashrで入手できます。

Some Theoretical Insights into Wasserstein GANs
Wasserstein GANsに関する理論的洞察

Generative Adversarial Networks (GANs) have been successful in producing outstanding results in areas as diverse as image, video, and text generation. Building on these successes, a large number of empirical studies have validated the benefits of the cousin approach called Wasserstein GANs (WGANs), which brings stabilization in the training process. In the present paper, we add a new stone to the edifice by proposing some theoretical advances in the properties of WGANs. First, we properly define the architecture of WGANs in the context of integral probability metrics parameterized by neural networks and highlight some of their basic mathematical features. We stress in particular interesting optimization properties arising from the use of a parametric 1-Lipschitz discriminator. Then, in a statistically-driven approach, we study the convergence of empirical WGANs as the sample size tends to infinity, and clarify the adversarial effects of the generator and the discriminator by underlining some trade-off properties. These features are finally illustrated with experiments using both synthetic and real-world datasets.

敵対的生成ネットワーク(GAN)は、画像、ビデオ、テキスト生成など、さまざまな分野で優れた結果を生み出すことに成功しています。これらの成功を基に、多数の実証研究により、トレーニングプロセスの安定化をもたらすWasserstein GAN (WGAN)と呼ばれる類似のアプローチの利点が検証されています。この論文では、WGANの特性に関する理論的進歩を提案することで、この建物に新たな石を追加します。まず、ニューラルネットワークによってパラメーター化された積分確率メトリックのコンテキストでWGANのアーキテクチャを適切に定義し、その基本的な数学的特徴のいくつかを強調します。特に、パラメトリック1-Lipschitz識別器の使用から生じる興味深い最適化特性を強調します。次に、統計主導のアプローチで、サンプルサイズが無限大に近づくにつれて経験的WGANが収束する様子を調べ、いくつかのトレードオフ特性を強調することで、生成器と識別器の敵対的効果を明らかにします。これらの特徴は、最終的に合成データセットと現実世界のデータセットの両方を使用した実験によって説明されます。

A General Framework for Adversarial Label Learning
敵対的ラベル学習のための一般的なフレームワーク

We consider the task of training classifiers without fully labeled data. We propose a weakly supervised method—adversarial label learning—that trains classifiers to perform well when noisy and possibly correlated labels are provided. Our framework allows users to provide different weak labels and multiple constraints on these labels. Our model then attempts to learn parameters for the data by solving a zero-sum game for the binary problems and a non-zero sum game optimization for multi-class problems. The game is between an adversary that chooses labels for the data and a model that minimizes the error made by the adversarial labels. The weak supervision constrains what labels the adversary can choose. The method therefore minimizes an upper bound of the classifier’s error rate using projected primal-dual subgradient descent. Minimizing this bound protects against bias and dependencies in the weak supervision. We first show the performance of our framework on binary classification tasks then we extend our algorithm to show its performance on multiclass datasets. Our experiments show that our method can train without labels and outperforms other approaches for weakly supervised learning.

私たちは、完全にラベル付けされたデータなしで分類器をトレーニングするタスクについて検討します。私たちは、ノイズが多く相関している可能性のあるラベルが与えられた場合に、分類器がうまく機能するようにトレーニングする、弱教師あり手法、つまり敵対的ラベル学習を提案します。我々のフレームワークでは、ユーザーがさまざまな弱ラベルと、これらのラベルに対する複数の制約を提供できます。次に、我々のモデルは、バイナリ問題の場合はゼロサムゲームを、マルチクラス問題の場合は非ゼロサムゲーム最適化を解くことで、データのパラメータを学習しようとします。ゲームは、データのラベルを選択する敵対者と、敵対的ラベルによるエラーを最小化するモデルとの間で行われます。弱教師は、敵対者が選択できるラベルを制約します。したがって、この手法では、投影されたプライマルデュアル劣勾配降下法を使用して、分類器のエラー率の上限を最小化します。この上限を最小化することで、弱教師におけるバイアスと依存性から保護されます。まず、バイナリ分類タスクでのフレームワークのパフォーマンスを示し、次にアルゴリズムを拡張して、マルチクラスデータセットでのパフォーマンスを示します。私たちの実験では、私たちの方法はラベルなしでトレーニングでき、弱教師あり学習の他のアプローチよりも優れていることが示されています。

Strong Consistency, Graph Laplacians, and the Stochastic Block Model
強一貫性、グラフラプラシアン、および確率的ブロックモデル

Spectral clustering has become one of the most popular algorithms in data clustering and community detection. We study the performance of classical two-step spectral clustering via the graph Laplacian to learn the stochastic block model. Our aim is to answer the following question: when is spectral clustering via the graph Laplacian able to achieve strong consistency, i.e., the exact recovery of the underlying hidden communities? Our work provides an entrywise analysis (an $\ell_{\infty}$-norm perturbation bound) of the Fiedler eigenvector of both the unnormalized and the normalized Laplacian associated with the adjacency matrix sampled from the stochastic block model. We prove that spectral clustering is able to achieve exact recovery of the planted community structure under conditions that match the information-theoretic limits.

スペクトルクラスタリングは、データクラスタリングとコミュニティ検出で最も人気のあるアルゴリズムの1つになっています。私たちは、確率的ブロックモデルを学習するために、グラフLaplacianを介して古典的な2ステップスペクトルクラスタリングのパフォーマンスを研究します。私たちの目的は、次の質問に答えることです:グラフLaplacianを介したスペクトルクラスタリングは、いつ強い一貫性、つまり、根底にある隠れたコミュニティの正確な回復を達成できるのでしょうか?私たちの仕事は、確率的ブロックモデルからサンプリングされた隣接行列に関連付けられた非正規化ラプラシアンと正規化ラプラシアンの両方のフィードラー固有ベクトルのエントリワイズ分析($ell_{infty}$-ノルム摂動範囲)を提供します。スペクトルクラスタリングが、情報理論的限界に一致する条件下で植栽されたコミュニティ構造の正確な回復を達成できることを証明します。

An Importance Weighted Feature Selection Stability Measure
重要度加重特徴選択の安定性測度

Current feature selection methods, especially applied to high dimensional data, tend to suffer from instability since marginal modifications in the data may result in largely distinct selected feature sets. Such instability strongly limits a sound interpretation of the selected variables by domain experts. Defining an adequate stability measure is also a research question. In this work, we propose to incorporate into the stability measure the importances of the selected features in predictive models. Such feature importances are directly proportional to feature weights in a linear model. We also consider the generalization to a non-linear setting. We illustrate, theoretically and experimentally, that current stability measures are subject to undesirable behaviors, for example, when they are jointly optimized with predictive accuracy. Results on micro-array and mass-spectrometric data show that our novel stability measure corrects for overly optimistic stability estimates in such a bi-objective context, which leads to improved decision-making. It is also shown to be less prone to the under- or over-estimation of the stability value in feature spaces with groups of highly correlated variables.

特に高次元データに適用される現在の特徴選択方法は、データのわずかな変更によって、選択された特徴セットが大きく異なる可能性があるため、不安定になる傾向があります。このような不安定性は、ドメインの専門家による選択された変数の適切な解釈を強く制限します。適切な安定性尺度を定義することも研究課題です。この研究では、予測モデルで選択された特徴の重要性を安定性尺度に組み込むことを提案します。このような特徴の重要性は、線形モデルの特徴の重みに直接比例します。また、非線形設定への一般化も検討します。現在の安定性尺度は、たとえば予測精度と共同で最適化されている場合に、望ましくない動作の影響を受けることを理論的および実験的に示します。マイクロアレイおよび質量分析データの結果は、私たちの新しい安定性尺度が、このような二目的コンテキストでの過度に楽観的な安定性推定を修正し、意思決定の改善につながることを示しています。また、相関の高い変数のグループを含む特徴空間では、安定性値の過小評価または過大評価が発生しにくいことも示されています。

Stochastic Proximal Methods for Non-Smooth Non-Convex Constrained Sparse Optimization
非平滑非凸制約付きスパース最適化のための確率的近位法

This paper focuses on stochastic proximal gradient methods for optimizing a smooth non-convex loss function with a non-smooth non-convex regularizer and convex constraints. To the best of our knowledge we present the first non-asymptotic convergence bounds for this class of problem. We present two simple stochastic proximal gradient algorithms, for general stochastic and finite-sum optimization problems. In a numerical experiment we compare our algorithms with the current state-of-the-art deterministic algorithm and find our algorithms to exhibit superior convergence.

この論文では、非平滑非凸正則化器と凸制約を使用して滑らかな非凸損失関数を最適化するための確率的近位勾配法に焦点を当てています。私たちの知る限りでは、このクラスの問題に対する最初の非漸近収束限界を提示します。一般的な確率論的および有限和最適化問題に対する2つの単純な確率的近位勾配アルゴリズムを紹介します。数値実験では、アルゴリズムを現在の最先端の決定論的アルゴリズムと比較し、アルゴリズムが優れた収束を示すことを見つけます。

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
NUQSGD:不均一量子化による証明可能な通信効率のデータ並列SGD

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

モデルとデータセットのサイズと複雑さが増すにつれて、並列モデルトレーニングを実行するためにデプロイできる確率的勾配降下法の通信効率の高いバリアントの必要性も高まります。データ並列SGDの一般的な通信圧縮方法の1つは、QSGD (Alistarhら, 2017)であり、勾配を量子化してエンコードし、通信コストを削減します。QSGDのベースラインバリアントは強力な理論的保証を提供しますが、実用的な目的のために、著者らはQSGDinfと呼ばれるヒューリスティックバリアントを提案しました。これにより、大規模なニューラルネットワークの分散トレーニングで印象的な経験的利益が示されました。この論文では、この研究に基づいて新しい勾配量子化スキームを提案し、QSGDよりも強力な理論的保証を持ち、QSGDinfヒューリスティックや他の圧縮方法の経験的パフォーマンスに匹敵し、それを超えることを示します。

A Lyapunov Analysis of Accelerated Methods in Optimization
最適化における加速法のLyapunov解析

Accelerated optimization methods, such as Nesterov’s accelerated gradient method, play a significant role in optimization. Several accelerated methods are provably optimal under standard oracle models. Such optimality results are obtained using a technique known as “estimate sequences,” which yields upper bounds on convergence properties. The technique of estimate sequences has long been considered difficult to understand and deploy, leading many researchers to generate alternative, more intuitive methods and analyses. We show there is an equivalence between the technique of estimate sequences and a family of Lyapunov functions in both continuous and discrete time. This connection allows us to develop a unified analysis of many existing accelerated algorithms, introduce new algorithms, and strengthen the connection between accelerated algorithms and continuous-time dynamical systems.

ネステロフの加速勾配法などの高速最適化手法は、最適化において重要な役割を果たします。いくつかの高速化された方法は、標準のオラクルモデルの下で最適であることが証明されています。このような最適性の結果は、収束特性の上限を生成する「推定シーケンス」と呼ばれる手法を使用して取得されます。シーケンスを推定する手法は、理解して展開するのが難しいと長い間考えられてきたため、多くの研究者が代替のより直感的な方法と分析を生成するようになりました。推定シーケンスの手法とリアプノフ関数のファミリーとの間には、連続時間と離散時間の両方で等価性があることを示します。この接続により、多くの既存の加速アルゴリズムの統一的な解析を開発し、新しいアルゴリズムを導入し、加速アルゴリズムと連続時間力学系との間の接続を強化することができます。

L-SVRG and L-Katyusha with Arbitrary Sampling
L-SVRGとL-カチューシャと任意サンプリング

We develop and analyze a new family of nonaccelerated and accelerated loopless variance-reduced methods for finite-sum optimization problems. Our convergence analysis relies on a novel expected smoothness condition which upper bounds the variance of the stochastic gradient estimation by a constant times a distance-like function. This allows us to handle with ease arbitrary sampling schemes as well as the nonconvex case. We perform an in-depth estimation of these expected smoothness parameters and propose new importance samplings which allow linear speedup when the expected minibatch size is in a certain range. Furthermore, a connection between these expected smoothness parameters and expected separable overapproximation (ESO) is established, which allows us to exploit data sparsity as well. Our general methods and results recover as special cases the loopless SVRG and loopless Katyusha methods.

私たちは、有限和最適化問題のための非加速および加速ループレス分散低減法の新しいファミリを開発および解析します。私たちの収束解析は、確率的勾配推定の分散を距離のような関数の定数倍だけ上限とする新しい期待平滑性条件に依存しています。これにより、任意のサンプリングスキームと非凸型の場合を簡単に処理できます。これらの予想される滑らかさパラメーターの詳細な推定を実行し、予想されるミニバッチサイズが特定の範囲内にある場合に線形の高速化を可能にする新しい重要度のサンプリングを提案します。さらに、これらの予想される平滑性パラメータと予想される分離可能なオーバーアプロクシメーション(ESO)との間の接続が確立されるため、データのスパース性も活用できます。私たちの一般的な方法と結果は、ループレスSVRG法とループレスカチューシャ法を特別なケースとして回復します。

Non-parametric Quantile Regression via the K-NN Fused Lasso
K-NN融合投げ縄によるノンパラメトリック分位点回帰

Quantile regression is a statistical method for estimating conditional quantiles of a response variable. In addition, for mean estimation, it is well known that quantile regression is more robust to outliers than $l_2$-based methods. By using the fused lasso penalty over a $K$-nearest neighbors graph, we propose an adaptive quantile estimator in a non-parametric setup. We show that the estimator attains optimal rate of $n^{-1/d}$ up to a logarithmic factor, under mild assumptions on the data generation mechanism of the $d$-dimensional data. We develop algorithms to compute the estimator and discuss methodology for model selection. Numerical experiments on simulated and real data demonstrate clear advantages of the proposed estimator over state of the art methods.

分位点回帰は、応答変数の条件付き分位点を推定するための統計的手法です。さらに、平均推定については、分位点回帰が$l_2$ベースの方法よりも外れ値に対してより堅牢であることはよく知られています。$K$-nearest neighborsグラフに対して融合投げ縄ペナルティを使用することで、ノンパラメトリック設定での適応型分位点推定量を提案します。推定量は、$d$次元データのデータ生成メカニズムに関する穏やかな仮定の下で、対数因子まで$n^{-1/d}$の最適レートを達成することを示します。推定量を計算するためのアルゴリズムを開発し、モデル選択の方法論について議論します。シミュレーションデータと実際のデータでの数値実験は、提案された推定量が最先端の方法よりも明らかに優れていることを示しています。

River: machine learning for streaming data in Python
River: Python でデータをストリーミングするための機械学習

River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of two popular packages for stream learning in Python: Creme and scikit-multiflow. River introduces a revamped architecture based on the lessons learnt from the seminal packages. River’s ambition is to be the go-to library for doing machine learning on streaming data. Additionally, this open source package brings under the same umbrella a large community of practitioners and researchers. The source code is available at https://github.com/online-ml/river.

Riverは、動的データストリームと継続的な学習のための機械学習ライブラリです。さまざまなストリーム学習問題に対応する複数の最先端の学習方法、データジェネレーター/トランスフォーマー、パフォーマンスメトリクス、および評価器を提供します。これは、Pythonでのストリーム学習のための2つの一般的なパッケージ、Cremeとscikit-multiflowの統合の結果です。Riverは、独創的なパッケージから学んだ教訓に基づいて、刷新されたアーキテクチャを導入しています。Riverの野望は、ストリーミングデータで機械学習を行うための頼りになるライブラリになることです。さらに、このオープンソースパッケージは、実務家と研究者の大規模なコミュニティを同じ傘下に収めています。ソースコードはhttps://github.com/online-ml/riverで入手できます。

mvlearn: Multiview Machine Learning in Python
mvlearn: Python でのマルチビュー機械学習

As data are generated more and more from multiple disparate sources, multiview data sets, where each sample has features in distinct views, have grown in recent years. However, no comprehensive package exists that enables non-specialists to use these methods easily. mvlearn is a Python library which implements the leading multiview machine learning methods. Its simple API closely follows that of scikit-learn for increased ease-of-use. The package can be installed from Python Package Index (PyPI) and the conda package manager and is released under the MIT open-source license. The documentation, detailed examples, and all releases are available at https://mvlearn.github.io/.

近年、データが複数の異なるソースから生成されることが増えるにつれ、各サンプルが個別のビューに特徴を持つマルチビューデータセットが増加しています。ただし、専門家でなくてもこれらの方法を簡単に使用できる包括的なパッケージは存在しません。mvlearnは、主要なマルチビュー機械学習手法を実装するPythonライブラリです。そのシンプルなAPIは、scikit-learnのAPIと密接に連携しており、使いやすさが向上しています。このパッケージは、Python Package Index(PyPI)およびcondaパッケージマネージャーからインストールでき、MITオープンソースライセンスの下でリリースされています。ドキュメント、詳細な例、およびすべてのリリースは、https://mvlearn.github.io/で入手できます。

Towards a Unified Analysis of Random Fourier Features
ランダムフーリエ特徴の統合解析に向けて

Random Fourier features is a widely used, simple, and effective technique for scaling up kernel methods. The existing theoretical analysis of the approach, however, remains focused on specific learning tasks and typically gives pessimistic bounds which are at odds with the empirical results. We tackle these problems and provide the first unified risk analysis of learning with random Fourier features using the squared error and Lipschitz continuous loss functions. In our bounds, the trade-off between the computational cost and the learning risk convergence rate is problem specific and expressed in terms of the regularization parameter and the number of effective degrees of freedom. We study both the standard random Fourier features method for which we improve the existing bounds on the number of features required to guarantee the corresponding minimax risk convergence rate of kernel ridge regression, as well as a data-dependent modification which samples features proportional to ridge leverage scores and further reduces the required number of features. As ridge leverage scores are expensive to compute, we devise a simple approximation scheme which provably reduces the computational cost without loss of statistical efficiency. Our empirical results illustrate the effectiveness of the proposed scheme relative to the standard random Fourier features method.

ランダムフーリエ特徴は、カーネル法のスケールアップに広く使用されている、シンプルで効果的な手法です。ただし、このアプローチの既存の理論分析は、特定の学習タスクに焦点が当てられており、通常は経験的結果と矛盾する悲観的な境界を示します。私たちはこれらの問題に取り組み、二乗誤差とリプシッツ連続損失関数を使用して、ランダムフーリエ特徴による学習の統合リスク分析を初めて提供します。私たちの境界では、計算コストと学習リスク収束率のトレードオフは問題に固有であり、正則化パラメーターと有効自由度の数で表現されます。私たちは、カーネルリッジ回帰の対応するミニマックスリスク収束率を保証するために必要な特徴の数に関する既存の境界を改善する標準的なランダムフーリエ特徴法と、リッジレバレッジスコアに比例する特徴をサンプリングして必要な特徴の数をさらに減らすデータ依存の変更の両方を研究します。リッジレバレッジスコアの計算にはコストがかかるため、統計的効率を損なうことなく計算コストを確実に削減できる単純な近似スキームを考案します。私たちの実験結果は、標準的なランダムフーリエ特徴法と比較した提案方式の有効性を示しています。

Beyond English-Centric Multilingual Machine Translation
英語中心の多言語機械翻訳を超えて

Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric, training only on data which was translated from or to English.While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open-source a training data set that covers thousands of language directions with parallel data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems from the Workshop on Machine Translation (WMT). We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

翻訳における既存の研究では、任意の言語ペア間で翻訳できる単一のモデルをトレーニングすることで、大規模多言語機械翻訳の可能性を実証しました。ただし、この研究の多くは英語中心であり、英語からまたは英語に翻訳されたデータのみでトレーニングされています。これは大規模なトレーニングデータソースによってサポートされていますが、世界中の翻訳ニーズを反映していません。この研究では、100言語の任意のペア間で直接翻訳できる真の多対多多言語翻訳モデルを作成します。大規模なマイニングによって作成された並列データを使用して、何千もの言語方向をカバーするトレーニングデータセットを構築し、オープンソース化します。次に、高密度スケーリングと言語固有のスパースパラメーターを組み合わせてモデル容量を効果的に増加させ、高品質のモデルを作成する方法を探ります。非英語中心モデルに焦点を当てることで、非英語方向間で直接翻訳するときに10 BLEU以上の向上が得られ、機械翻訳ワークショップ(WMT)の最高の単一システムと競合するパフォーマンスが得られます。私たちはスクリプトをオープンソース化して、他の人がデータ、評価、最終的なM2M-100モデルを再現できるようにしています。

Online stochastic gradient descent on non-convex losses from high-dimensional inference
高次元推論からの非凸損失に対するオンライン確率的勾配降下法

Stochastic gradient descent (SGD) is a popular algorithm for optimization problems arising in high-dimensional inference tasks. Here one produces an estimator of an unknown parameter from independent samples of data by iteratively optimizing a loss function. This loss function is random and often non-convex. We study the performance of the simplest version of SGD, namely online SGD, from a random start in the setting where the parameter space is high-dimensional. We develop nearly sharp thresholds for the number of samples needed for consistent estimation as one varies the dimension. Our thresholds depend only on an intrinsic property of the population loss which we call the information exponent. In particular, our results do not assume uniform control on the loss itself, such as convexity or uniform derivative bounds. The thresholds we obtain are polynomial in the dimension and the precise exponent depends explicitly on the information exponent. As a consequence of our results, we find that except for the simplest tasks, almost all of the data is used simply in the initial search phase to obtain non-trivial correlation with the ground truth. Upon attaining non-trivial correlation, the descent is rapid and exhibits law of large numbers type behavior. We illustrate our approach by applying it to a wide set of inference tasks such as phase retrieval, and parameter estimation for generalized linear models, online PCA, and spiked tensor models, as well as to supervised learning for single-layer networks with general activation functions.

確率的勾配降下法(SGD)は、高次元推論タスクで生じる最適化問題に対する一般的なアルゴリズムです。ここでは、損失関数を反復的に最適化することにより、データの独立したサンプルから未知のパラメータの推定値を生成します。この損失関数はランダムで、多くの場合非凸です。パラメータ空間が高次元である設定で、ランダム開始からのSGDの最も単純なバージョン、つまりオンラインSGDのパフォーマンスを調査します。次元を変えても一貫した推定に必要なサンプル数について、ほぼ明確なしきい値を開発します。しきい値は、情報指数と呼ぶ集団損失の固有の特性のみに依存します。特に、結果では、凸性や一様導関数境界など、損失自体に対する一様制御を想定していません。取得したしきい値は次元の多項式であり、正確な指数は情報指数に明示的に依存します。結果から、最も単純なタスクを除いて、ほぼすべてのデータが、グラウンドトゥルースとの非自明な相関関係を取得するために、初期検索フェーズでのみ使用されていることがわかりました。非自明な相関関係が達成されると、降下は急速になり、大数の法則のような動作を示します。このアプローチを、位相検索、一般化線形モデル、オンラインPCA、スパイクテンソルモデルのパラメーター推定などの幅広い推論タスク、および一般的な活性化関数を持つ単層ネットワークの教師あり学習に適用することで、その方法を説明します。

Pathwise Conditioning of Gaussian Processes
ガウス過程のパスワイズコンディショニング

As Gaussian processes are used to answer increasingly complex questions, analytic solutions become scarcer and scarcer. Monte Carlo methods act as a convenient bridge for connecting intractable mathematical expressions with actionable estimates via sampling. Conventional approaches for simulating Gaussian process posteriors view samples as draws from marginal distributions of process values at finite sets of input locations. This distribution-centric characterization leads to generative strategies that scale cubically in the size of the desired random vector. These methods are prohibitively expensive in cases where we would, ideally, like to draw high-dimensional vectors or even continuous sample paths. In this work, we investigate a different line of reasoning: rather than focusing on distributions, we articulate Gaussian conditionals at the level of random variables. We show how this pathwise interpretation of conditioning gives rise to a general family of approximations that lend themselves to efficiently sampling Gaussian process posteriors. Starting from first principles, we derive these methods and analyze the approximation errors they introduce. We, then, ground these results by exploring the practical implications of pathwise conditioning in various applied settings, such as global optimization and reinforcement learning.

ガウス過程はますます複雑な質問に答えるために使われるため、解析解はますます少なくなっています。モンテカルロ法は、扱いにくい数式をサンプリングによって実用的な推定値と結び付ける便利な橋渡しとして機能します。ガウス過程事後確率をシミュレートする従来のアプローチでは、サンプルを有限の入力位置セットにおけるプロセス値の周辺分布からの抽出と見なします。この分布中心の特性評価は、目的のランダムベクトルのサイズの3乗でスケーリングする生成戦略につながります。これらの方法は、理想的には高次元ベクトルや連続サンプルパスを抽出したい場合には、非常に高価です。この研究では、異なる推論方法を検討します。分布に焦点を当てるのではなく、ランダム変数のレベルでガウス条件を明確にします。この条件付けのパスワイズ解釈によって、ガウス過程事後確率を効率的にサンプリングするのに役立つ一般的な近似ファミリーがどのように生じるかを示します。第一原理から始めて、これらの方法を導出し、それらがもたらす近似誤差を分析します。次に、グローバル最適化や強化学習など、さまざまな応用設定におけるパスワイズ条件付けの実際的な意味を調査することで、これらの結果を根拠付けます。

Explaining Explanations: Axiomatic Feature Interactions for Deep Networks
説明の説明: 深層ネットワークに対する公理的特徴の相互作用

Recent work has shown great promise in explaining neural network behavior. In particular, feature attribution methods explain the features that are important to a model’s prediction on a given input. However, for many tasks, simply identifying significant features may be insufficient for understanding model behavior. The interactions between features within the model may better explain not only the model, but why certain features outrank others in importance. In this work, we present Integrated Hessians, an extension of Integrated Gradients that explains pairwise feature interactions in neural networks. Integrated Hessians overcomes several theoretical limitations of previous methods, and unlike them, is not limited to a specific architecture or class of neural network. Additionally, we find that our method is faster than existing methods when the number of features is large, and outperforms previous methods on existing quantitative benchmarks.

最近の研究は、ニューラルネットワークの動作を説明する上で大きな期待を示しています。特に、特徴属性法は、特定の入力に対するモデルの予測にとって重要な特徴を説明しています。ただし、多くのタスクでは、単に重要な特徴を特定するだけでは、モデルの動作を理解するには不十分な場合があります。モデル内の特徴量間の相互作用は、モデルだけでなく、特定の特徴が重要性で他の特徴よりもランク付けされている理由をよりよく説明できる場合があります。この作業では、ニューラルネットワークでのペアワイズ特徴の相互作用を説明するIntegrated Gradientsの拡張であるIntegrated Hessiansを紹介します。統合ヘッセ山脈は、従来の方法のいくつかの理論的制限を克服し、それらとは異なり、ニューラルネットワークの特定のアーキテクチャやクラスに限定されません。さらに、特徴の数が多い場合、私たちの方法は既存の方法よりも速く、既存の定量的ベンチマークで以前の方法よりも優れていることがわかりました。

A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints
積分二次制約によるスムーズなゲームのための1次法の統一解析

The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framework, we recover the existing bound for the gradient method~(GD), derive sharper bounds for the proximal point method~(PPM) and optimistic gradient method~(OG), and provide for the first time a global convergence rate for the negative momentum method~(NM) with an iteration complexity $\mathcal{O}(\kappa^{1.5})$, which matches its known lower bound. In addition, for time-varying systems, we prove that the gradient method with optimal step size achieves the fastest provable worst-case convergence rate with quadratic Lyapunov functions. Finally, we further extend our analysis to stochastic games and study the impact of multiplicative noise on different algorithms. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch (in contrast with the stochastic strongly-convex optimization setting, where such acceleration has been demonstrated). However, we exhibit an algorithm which achieves acceleration with two gradient queries per batch.

積分二次制約(IQC)の理論により、非線形または不確実な要素を含む相互接続システムの指数収束の証明が可能になります。この研究では、IQC理論を適応させて、滑らかで強単調なゲームに対する一次手法を研究し、収束率の上限を厳密に設定するための二次制約の設計方法を示します。このフレームワークを使用して、勾配法(GD)の既存の境界を回復し、近点法(PPM)と楽観的勾配法(OG)のより厳しい境界を導出し、反復複雑度が既知の下限と一致する$\mathcal{O}(\kappa^{1.5})$である負の運動量法(NM)のグローバル収束率を初めて提供します。さらに、時間変動システムの場合、最適なステップサイズの勾配法が、二次リアプノフ関数で最速の証明可能な最悪ケース収束率を達成することを証明します。最後に、分析を確率ゲームにまで拡張し、乗法ノイズがさまざまなアルゴリズムに与える影響を研究します。メモリが1ステップのアルゴリズムでは、バッチごとに勾配を1回しかクエリしない場合、加速を達成することは不可能であることを示します(このような加速が実証されている確率的強凸最適化設定とは対照的です)。ただし、バッチごとに2回の勾配クエリで加速を達成するアルゴリズムを示します。

Learning a High-dimensional Linear Structural Equation Model via l1-Regularized Regression
l1正則化回帰による高次元線形構造方程式モデルの学習

This paper develops a new approach to learning high-dimensional linear structural equation models (SEMs) without the commonly assumed faithfulness, Gaussian error distribution, and equal error distribution conditions. A key component of the algorithm is component-wise ordering and parent estimations, where both problems can be efficiently addressed using l1-regularized regression. This paper proves that sample sizes n = Omega( d^{2} \log p) and n = \Omega( d^2 p^{2/m} ) are sufficient for the proposed algorithm to recover linear SEMs with sub-Gaussian and (4m)-th bounded-moment error distributions, respectively, where p is the number of nodes and d is the maximum degree of the moralized graph. Further shown is the worst-case computational complexity O(n (p^3 + p^2 d^2 ) ), and hence, the proposed algorithm is statistically consistent and computationally feasible for learning a high-dimensional linear SEM when its moralized graph is sparse. Through simulations, we verify that the proposed algorithm is statistically consistent and computationally feasible, and it performs well compared to the state-of-the-art US, GDS, LISTEN and TD algorithms with our settings. We also demonstrate through real COVID-19 data that the proposed algorithm is well-suited to estimating a virus-spread map in China.

この論文では、一般的に想定されている忠実性、ガウス誤差分布、および等誤差分布条件なしに、高次元線形構造方程式モデル(SEM)を学習する新しいアプローチを開発します。アルゴリズムの主要コンポーネントは、コンポーネントごとの順序付けと親推定であり、両方の問題はl1正則化回帰を使用して効率的に対処できます。この論文では、サンプルサイズn = Omega( d^{2} \log p)およびn = \Omega( d^2 p^{2/m} )が、提案アルゴリズムがそれぞれサブガウスおよび(4m)次境界モーメント誤差分布を持つ線形SEMを回復するのに十分であることを証明しています。ここで、pはノード数、dは道徳化グラフの最大次数です。さらに、最悪の場合の計算複雑度O(n (p^3 + p^2 d^2 ) )が示されており、したがって、提案アルゴリズムは、道徳化グラフがスパースである場合に高次元線形SEMを学習するための統計的に一貫しており、計算上実行可能です。シミュレーションを通じて、提案されたアルゴリズムが統計的に一貫しており、計算上実行可能であること、そして私たちの設定では最先端のUS、GDS、LISTEN、TDアルゴリズムと比較してパフォーマンスが優れていることを検証しました。また、実際のCOVID-19データを通じて、提案されたアルゴリズムが中国でのウイルス拡散マップの推定に適していることを実証しました。

LocalGAN: Modeling Local Distributions for Adversarial Response Generation
LocalGAN:敵対的応答生成のための局所分布のモデル化

This paper presents a new methodology for modeling the local semantic distribution of responses to a given query in the human-conversation corpus, and on this basis, explores a specified adversarial learning mechanism for training Neural Response Generation (NRG) models to build conversational agents. Our investigation begins with the thorough discus- sions upon the objective function of general Generative Adversarial Nets (GAN) architectures, and the training instability problem is proved to be highly relative with the special local distributions of conversational corpora. Consequently, an energy function is employed to estimate the status of a local area restricted by the query and its responses in the semantic space, and the mathematical approximation of this energy-based distribution is finally found. Building on this foundation, a local distribution oriented objective is proposed and combined with the original objective, working as a hybrid loss for the adversarial training of response generation models, named as LocalGAN. Our experimental results demonstrate that the reasonable local distribution modeling of the query-response corpus is of great importance to adversarial NRG, and our proposed LocalGAN is promising for improving both the training stability and the quality of generated results.

この論文では、人間の会話コーパス内の特定のクエリに対する応答のローカルな意味分布をモデル化する新しい方法論を提示し、これに基づいて、会話エージェントを構築するためのニューラル応答生成(NRG)モデルをトレーニングするための特定の敵対的学習メカニズムを検討します。私たちの調査は、一般的な敵対的生成ネット(GAN)アーキテクチャの目的関数に関する徹底的な議論から始まり、トレーニングの不安定性の問題は、会話コーパスの特殊なローカル分布と非常に関連していることが証明されています。その結果、エネルギー関数を使用して、意味空間内のクエリとその応答によって制限されるローカル領域の状態を推定し、このエネルギーベースの分布の数学的近似が最終的に見つかりました。この基礎に基づいて、ローカル分布指向の目的が提案され、元の目的と組み合わされて、応答生成モデルの敵対的トレーニングのハイブリッド損失として機能するLocalGANと名付けられました。私たちの実験結果は、クエリ応答コーパスの合理的なローカル分布モデリングが敵対的NRGにとって非常に重要であり、提案されたLocalGANはトレーニングの安定性と生成された結果の品質の両方を改善するのに有望であることを示しています。

OpenML-Python: an extensible Python API for OpenML
OpenML-Python: OpenML 用の拡張可能な Python API

OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper, we introduce OpenML-Python, a client API for Python, which opens up the OpenML platform for a wide range of Python-based machine learning tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides functionality to conduct machine learning experiments, upload the results to OpenML, and reproduce results which are stored on OpenML. Furthermore, it comes with a scikit-learn extension and an extension mechanism to easily integrate other machine learning libraries written in Python into the OpenML ecosystem. Source code and documentation are available at https://github.com/openml/openml-python/.

OpenMLは、機械学習におけるオープンサイエンスコラボレーションのためのオンラインプラットフォームであり、機械学習実験のデータセットと結果を共有するために使用されます。この論文では、Pythonベースの機械学習ツールにOpenMLプラットフォームを開放するPython用のクライアントAPIであるOpenML-Pythonを紹介します。Python内からOpenML上のすべてのデータセット、タスク、実験に簡単にアクセスできます。また、機械学習の実験を行い、結果をOpenMLにアップロードし、OpenMLに保存されている結果を再現する機能も提供します。さらに、scikit-learn拡張機能と、Pythonで記述された他の機械学習ライブラリをOpenMLエコシステムに簡単に統合するための拡張メカニズムが付属しています。ソースコードとドキュメントは、https://github.com/openml/openml-python/で入手できます。

Adaptive estimation of nonparametric functionals
ノンパラメトリック汎関数の適応推定

We provide general adaptive upper bounds for estimating nonparametric functionals based on second-order U-statistics arising from finite-dimensional approximation of the infinite-dimensional models. We then provide examples of functionals for which the theory produces rate optimally matching adaptive upper and lower bounds. Our results are automatically adaptive in both parametric and nonparametric regimes of estimation and are automatically adaptive and semiparametric efficient in the regime of parametric convergence rate.

私たちは、無限次元モデルの有限次元近似から生じる2次U統計に基づいてノンパラメトリック汎関数を推定するための一般的な適応上限を提供します。次に、理論が適応的な上限と下限に最適に一致するレートを生成する汎関数の例を提供します。私たちの結果は、パラメトリックとノンパラメトリックの両方の推定領域で自動的に適応性があり、パラメトリック収束率の領域で自動的に適応性とセミパラメトリック効率を示します。

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
政策勾配法の理論について:最適性、近似、分布シフト

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: “tabular” policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case — which avoid explicit worst-case dependencies on the size of state space — by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number).

ポリシー勾配法は、大規模な状態空間やアクション空間を持つ強化学習の問題に取り組む際に最も効果的な方法の1つです。しかし、その最も基本的な理論的収束特性についてさえほとんどわかっていません。たとえば、グローバル最適解に収束するかどうか、収束する場合はどのくらいの速さか、または制限されたクラスのパラメトリックポリシーを使用することで生じる近似誤差にどのように対処するかなどです。この研究では、割引マルコフ決定プロセス(MDP)のコンテキストにおけるポリシー勾配法の計算、近似、およびサンプルサイズの特性について、証明可能な特性を提供します。「表形式」ポリシーパラメータ化(クラスには最適ポリシーが含まれており、最適ポリシーへのグローバル収束を示します)とパラメトリックポリシークラス(対数線形ポリシークラスとニューラルポリシークラスの両方を考慮) (最適ポリシーが含まれていない可能性があり、不可知論的学習結果を提供します)の両方に焦点を当てます。この研究の中心的な貢献の1つは、分布シフトの下での教師あり学習との正式な接続を確立することで、平均ケースの近似保証(状態空間のサイズに対する明示的な最悪ケースの依存関係を回避する)を提供することです。この特性は、推定誤差、近似誤差、および探索（正確に定義された条件数によって特性化される）間の重要な相互作用を示しています。

Safe Policy Iteration: A Monotonically Improving Approximate Policy Iteration Approach
安全なポリシーの反復:単調に改善する近似的なポリシー反復アプローチ

This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy-iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration, we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy, until no improving policy can be found. We propose three safe policy-iteration schemas that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game.

この論文では、近似的なポリシー反復アルゴリズムによって有用に利用できるポリシー改善ステップの研究を紹介します。ポリシー評価ステップまたはポリシー改善ステップのいずれかが近似結果を返す場合、ポリシーの反復によって生成されるポリシーのシーケンスが単調に増加していない可能性があり、振動が発生する可能性があります。この問題に対処するために、安全なポリシーの改善を検討します、つまり、各反復で、改善ポリシーが見つからないまで、現在のポリシーに対するポリシー改善の下限を最大化するポリシーを検索します。私たちは、次のポリシーが選択される方法が異なる3つの安全なポリシー反復スキーマを提案します。これらは、推定された貪欲なポリシーに対して異なります。理論的に導き出され議論されるだけでなく、提案されたアルゴリズムは、いくつかのチェーンウォークドメイン、刑務所ドメイン、およびブラックジャックカードゲームで経験的に評価および比較されます。

Guided Visual Exploration of Relations in Data Sets
データセット内の関係のガイド付き視覚的探索

Efficient explorative data analysis systems must take into account both what a user knows and wants to know. This paper proposes a principled framework for interactive visual exploration of relations in data, through views most informative given the user’s current knowledge and objectives. The user can input pre-existing knowledge of relations in the data and also formulate specific exploration interests, which are then taken into account in the exploration. The idea is to steer the exploration process towards the interests of the user, instead of showing uninteresting or already known relations. The user’s knowledge is modelled by a distribution over data sets parametrised by subsets of rows and columns of data, called tile constraints. We provide a computationally efficient implementation of this concept based on constrained randomisation. Furthermore, we describe a novel dimensionality reduction method for finding the views most informative to the user, which at the limit of no background knowledge and with generic objectives reduces to PCA. We show that the method is suitable for interactive use and is robust to noise, outperforms standard projection pursuit visualisation methods, and gives understandable and useful results in analysis of real-world data. We provide an open-source implementation of the framework.

効率的な探索的データ分析システムは、ユーザーが知っていることと知りたいことの両方を考慮する必要があります。この論文では、ユーザーの現在の知識と目的を考慮して最も有益なビューを通じて、データ内の関係をインタラクティブに視覚的に探索するための原則的なフレームワークを提案します。ユーザーは、データ内の関係に関する既存の知識を入力し、特定の探索の関心を定式化することもできます。これは、探索で考慮されます。アイデアは、興味のない関係や既知の関係を示すのではなく、探索プロセスをユーザーの関心に向けることです。ユーザーの知識は、タイル制約と呼ばれるデータの行と列のサブセットによってパラメーター化されたデータセットの分布によってモデル化されます。制約付きランダム化に基づくこの概念の計算効率の高い実装を提供します。さらに、ユーザーにとって最も有益なビューを見つけるための新しい次元削減方法について説明します。これは、背景知識がなく、一般的な目的の場合の限界では、PCAに削減されます。この方法はインタラクティブな使用に適しており、ノイズに対して堅牢で、標準的な投影追跡視覚化方法よりも優れており、現実世界のデータの分析において理解しやすく有用な結果をもたらすことを示しています。フレームワークのオープンソース実装を提供します。

Histogram Transform Ensembles for Large-scale Regression
大規模回帰のためのヒストグラム変換アンサンブル

In this paper, we propose a novel algorithm for large-scale regression problems named Histogram Transform Ensembles (HTE), composed of random rotations, stretchings, and translations. Our HTE method first implements a histogram transformed partition to the random affine mapped data, then adaptively leverages constant functions or SVMs to obtain the individual regression estimates, and eventually builds the ensemble predictor through an average strategy. First of all, in this paper, we investigate the theoretical properties of HTE when the regression function lies in the H\”{o}lder space $C^{k,\alpha}$, $k \in \mathbb{N}_0$, $\alpha \in (0,1]$. In the case that $k=0, 1$, we adopt the constant regressors and develop the na\”{i}ve histogram transforms (NHT). Within the space $C^{0,\alpha}$, although almost optimal convergence rates can be derived for both single and ensemble NHT, we fail to show the benefits of ensembles over single estimators theoretically. In contrast, in the subspace $C^{1,\alpha}$, we prove that if $d \geq 2(1+\alpha)/\alpha$, the lower bound of the convergence rates for single NHT turns out to be worse than the upper bound of the convergence rates for ensemble NHT. In the other case when $k \geq 2$, the NHT may no longer be appropriate in predicting smoother regression functions. Instead, we circumvent this issue by applying kernel histogram transforms (KHT) equipped with smoother regressors, such as support vector machines (SVMs). Accordingly, it turns out that both single and ensemble KHT enjoy almost optimal convergence rates. Then, we validate the above theoretical results with extensive numerical experiments. On the one hand, simulations are conducted to elucidate that ensemble NHT outperforms single NHT. On the other hand, the effects of bin sizes on the accuracy of both NHT and KHT are also in accord with the theoretical analysis. Last but not least, in the real-data experiments, comparisons between the ensemble KHT, equipped with adaptive histogram transforms, and other state-of-the-art large-scale regression estimators verify the effectiveness and precision of the proposed algorithm.

この論文では、ランダムな回転、伸縮、および変換で構成されるヒストグラム変換アンサンブル(HTE)という大規模な回帰問題向けの新しいアルゴリズムを提案します。当社のHTE法では、まずランダムなアフィンマップデータにヒストグラム変換されたパーティションを実装し、次に定数関数またはSVMを適応的に活用して個々の回帰推定値を取得し、最終的に平均戦略を通じてアンサンブル予測子を構築します。まず、この論文では、回帰関数がH\”{o}lder空間$C^{k,\alpha}$、$k \in \mathbb{N}_0$、$\alpha \in (0,1]$にある場合のHTEの理論的特性を調査します。$k=0, 1$の場合、定数回帰変数を採用し、素朴ヒストグラム変換(NHT)を開発します。空間$C^{0,\alpha}$内では、単一およびアンサンブルNHTの両方でほぼ最適な収束率を導出できますが、理論的にはアンサンブルが単一推定量よりも優れていることを示すことはできません。対照的に、サブスペース$C^{1,\alpha}$では、$d \geq 2(1+\alpha)/\alpha$の場合、単一NHTの収束率の下限がアンサンブルの収束率の上限よりも悪くなることを証明します。NHT。$k \geq 2$のもう1つのケースでは、NHTはより滑らかな回帰関数を予測するのに適切ではなくなる可能性があります。代わりに、サポートベクターマシン(SVM)などのより滑らかな回帰器を備えたカーネルヒストグラム変換(KHT)を適用することで、この問題を回避します。したがって、単一およびアンサンブルKHTの両方がほぼ最適な収束率を享受することがわかります。次に、上記の理論的結果を広範な数値実験で検証します。一方では、アンサンブルNHTが単一NHTよりも優れていることを明らかにするためにシミュレーションを実行します。他方では、ビンサイズがNHTとKHTの両方の精度に与える影響も、理論分析と一致しています。最後に、実際のデータ実験では、適応型ヒストグラム変換を備えたアンサンブルKHTと他の最先端の大規模回帰推定器との比較により、提案されたアルゴリズムの有効性と精度が検証されます。

Consistent Semi-Supervised Graph Regularization for High Dimensional Data
高次元データに対する一貫性のある半教師ありグラフ正則化

Semi-supervised Laplacian regularization, a standard graph-based approach for learning from both labelled and unlabelled data, was recently demonstrated to have an insignificant high dimensional learning efficiency with respect to unlabelled data, causing it to be outperformed by its unsupervised counterpart, spectral clustering, given sufficient unlabelled data. Following a detailed discussion on the origin of this inconsistency problem, a novel regularization approach involving centering operation is proposed as solution, supported by both theoretical analysis and empirical results.

ラベル付きデータとラベルなしデータの両方から学習するための標準的なグラフベースのアプローチである半教師ありラプラシアン正則化は、最近、ラベルなしデータに関して有意な高次元学習効率を持つことが実証されました。これにより、ラベルなしデータが十分にある場合、教師なしの対応物であるスペクトルクラスタリングよりも優れています。この不整合問題の起源についての詳細な議論に続いて、理論的分析と経験的結果の両方によって裏付けられた、センタリング操作を含む新しい正則化アプローチが解決策として提案されます。

Flexible Signal Denoising via Flexible Empirical Bayes Shrinkage
柔軟な経験的ベイズ縮小による柔軟な信号ノイズ除去

Signal denoising—also known as non-parametric regression—is often performed through shrinkage estimation in a transformed (e.g., wavelet) domain; shrinkage in the transformed domain corresponds to smoothing in the original domain. A key question in such applications is how much to shrink, or, equivalently, how much to smooth. Empirical Bayes shrinkage methods provide an attractive solution to this problem; they use the data to estimate a distribution of underlying “effects,” hence automatically select an appropriate amount of shrinkage. However, most existing implementations of empirical Bayes shrinkage are less flexible than they could be—both in their assumptions on the underlying distribution of effects, and in their ability to handle heteroskedasticity—which limits their signal denoising applications. Here we address this by adopting a particularly flexible, stable and computationally convenient empirical Bayes shrinkage method and applying it to several signal denoising problems. These applications include smoothing of Poisson data and heteroskedastic Gaussian data. We show through empirical comparisons that the results are competitive with other methods, including both simple thresholding rules and purpose-built empirical Bayes procedures. Our methods are implemented in the R package smashr, “SMoothing by Adaptive SHrinkage in R,” available at https://www.github.com/stephenslab/smashr.

信号ノイズ除去(ノンパラメトリック回帰とも呼ばれる)は、多くの場合、変換された(ウェーブレットなど)ドメインでの収縮推定を通じて実行されます。変換されたドメインでの収縮は、元のドメインでの平滑化に対応します。このようなアプリケーションで重要な問題は、どの程度収縮するか、または同等に、どの程度平滑化するかです。経験的ベイズ収縮法は、この問題に対する魅力的なソリューションを提供します。経験的ベイズ収縮法は、データを使用して基礎となる「効果」の分布を推定し、適切な量の収縮を自動的に選択します。ただし、経験的ベイズ収縮の既存の実装のほとんどは、基礎となる効果の分布に関する仮定と不均一分散の処理能力の両方において、本来の柔軟性に欠けており、信号ノイズ除去アプリケーションが制限されています。ここでは、特に柔軟で安定しており、計算が簡単な経験的ベイズ収縮法を採用し、それをいくつかの信号ノイズ除去問題に適用することで、この問題に対処します。これらのアプリケーションには、ポアソンデータと不均一分散ガウスデータの平滑化が含まれます。経験的な比較を通じて、その結果が、単純なしきい値ルールと専用の経験的ベイズ手順の両方を含む他の方法と競合できることを示しています。私たちの方法は、Rパッケージsmashr、「RでのAdaptive SHrinkageによるSMoothing」に実装されており、https://www.github.com/stephenslab/smashrで入手できます。

NEU: A Meta-Algorithm for Universal UAP-Invariant Feature Representation
NEU:ユニバーサルUAP不変特徴表現のためのメタアルゴリズム

Effective feature representation is key to the predictive performance of any algorithm. This paper introduces a meta-procedure, called Non-Euclidean Upgrading (NEU), which learns feature maps that are expressive enough to embed the universal approximation property (UAP) into most model classes while only outputting feature maps that preserve any model class’s UAP. We show that NEU can learn any feature map with these two properties if that feature map is asymptotically deformable into the identity. We also find that the feature-representations learned by NEU are always submanifolds of the feature space. NEU’s properties are derived from a new deep neural model that is universal amongst all orientation-preserving homeomorphisms on the input space. We derive qualitative and quantitative approximation guarantees for this architecture. We quantify the number of parameters required for this new architecture to memorize any set of input-output pairs while simultaneously fixing every point of the input space lying outside some compact set, and we quantify the size of this set as a function of our model’s depth. Moreover, we show that deep feedforward networks with most commonly used activation functions typically do not have all these properties. NEU’s performance is evaluated against competing machine learning methods on various regression and dimension reduction tasks both with financial and simulated data.

効果的な特徴表現は、あらゆるアルゴリズムの予測性能の鍵となります。この論文では、非ユークリッドアップグレード(NEU)と呼ばれるメタ手順を紹介します。これは、ほとんどのモデルクラスに普遍近似特性(UAP)を埋め込むのに十分な表現力を持つ特徴マップを学習し、任意のモデルクラスのUAPを保持する特徴マップのみを出力します。特徴マップが恒等写像に漸近的に変形可能である場合、NEUはこれら2つの特性を持つ任意の特徴マップを学習できることを示します。また、NEUによって学習された特徴表現は常に特徴空間のサブ多様体であることもわかりました。NEUの特性は、入力空間のすべての方向保持同相写像の間で普遍的な新しいディープニューラルモデルから派生しています。このアーキテクチャの定性的および定量的な近似保証を導き出します。この新しいアーキテクチャが、入力と出力のペアの任意のセットを記憶し、同時に、あるコンパクトなセットの外側にある入力空間のすべてのポイントを固定するために必要なパラメータの数を定量化し、このセットのサイズをモデルの深さの関数として定量化します。さらに、最も一般的に使用される活性化関数を備えたディープフィードフォワードネットワークは、通常、これらすべての特性を備えているわけではないことを示します。NEUのパフォーマンスは、金融データとシミュレーションデータの両方を使用して、さまざまな回帰および次元削減タスクで競合する機械学習方法と比較して評価されます。

Analysis of high-dimensional Continuous Time Markov Chains using the Local Bouncy Particle Sampler
局所弾力粒子サンプラーを用いた高次元連続時間マルコフ連鎖の解析

Sampling the parameters of high-dimensional Continuous Time Markov Chains (CTMC) is a challenging problem with important applications in many fields of applied statistics. In this work a recently proposed type of non-reversible rejection-free Markov Chain Monte Carlo (MCMC) sampler, the Bouncy Particle Sampler (BPS), is brought to bear to this problem. BPS has demonstrated its favourable computational efficiency compared with state-of-the-art MCMC algorithms, however to date applications to real-data scenario were scarce. An important aspect of practical implementation of BPS is the simulation of event times. Default implementations use conservative thinning bounds. Such bounds can slow down the algorithm and limit the computational performance. Our paper develops an algorithm with exact analytical solution to the random event times in the context of CTMCs. Our local version of BPS algorithm takes advantage of the sparse structure in the target factor graph and we also provide a graph-theoretic tool for assessing the computational complexity of local BPS algorithms.

高次元連続時間マルコフ連鎖(CTMC)のパラメータのサンプリングは、応用統計の多くの分野で重要な応用がある難しい問題です。この研究では、最近提案されたタイプの非可逆拒否フリーマルコフ連鎖モンテカルロ(MCMC)サンプラーであるBouncy Particle Sampler (BPS)をこの問題に利用します。BPSは、最先端のMCMCアルゴリズムと比較して優れた計算効率を実証していますが、これまで実際のデータシナリオへの応用はほとんどありませんでした。BPSの実際の実装の重要な側面は、イベント時間のシミュレーションです。デフォルトの実装では、保守的な間引き境界が使用されます。このような境界は、アルゴリズムを遅くし、計算パフォーマンスを制限する可能性があります。私たちの論文では、CTMCのコンテキストでランダムイベント時間に対する正確な解析ソリューションを備えたアルゴリズムを開発します。BPSアルゴリズムのローカルバージョンは、ターゲットファクターグラフのスパース構造を活用し、ローカルBPSアルゴリズムの計算複雑性を評価するためのグラフ理論ツールも提供します。

Risk Bounds for Unsupervised Cross-Domain Mapping with IPMs
IPMを使用した教師なしクロスドメインマッピングのリスクバウンド

The recent empirical success of unsupervised cross-domain mapping algorithms, in mapping between two domains that share common characteristics, is not well-supported by theoretical justifications. This lacuna is especially troubling, given the clear ambiguity in such mappings. We work with adversarial training methods based on integral probability metrics (IPMs) and derive a novel risk bound, which upper bounds the risk between the learned mapping $h$ and the target mapping $y$, by a sum of three terms: (i) the risk between $h$ and the most distant alternative mapping that was learned by the same cross-domain mapping algorithm, (ii) the minimal discrepancy between the target domain and the domain obtained by applying a hypothesis $h^*$ on the samples of the source domain, where $h^*$ is a hypothesis selectable by the same algorithm, and (iii) an approximation error term that decreases as the capacity of the class of discriminators increases and is empirically shown to be small. The bound is directly related to Occam’s razor and encourages the selection of the minimal architecture that supports a small mapping discrepancy. The bound leads to multiple algorithmic consequences, including a method for hyperparameter selection and early stopping in cross-domain mapping.

共通の特性を持つ2つのドメイン間のマッピングにおける教師なしクロスドメインマッピングアルゴリズムの最近の実証的な成功は、理論的な正当性によって十分に裏付けられていません。このようなマッピングには明らかな曖昧さがあるため、この欠陥は特に厄介です。私たちは積分確率メトリック(IPM)に基づく敵対的トレーニング方法を使用して、学習済みマッピング$h$とターゲットマッピング$y$間のリスクの上限を次の3つの項の合計で定める新しいリスク境界を導きました。(i) $h$と、同じクロスドメインマッピングアルゴリズムによって学習された最も遠い代替マッピング間のリスク、(ii)ターゲットドメインと、ソースドメインのサンプルに仮説$h^*$を適用して得られたドメイン間の最小の相違($h^*$は同じアルゴリズムによって選択可能な仮説)、(iii)識別器のクラスの容量が増加するにつれて減少し、経験的に小さいことが示されている近似誤差項。この境界はオッカムの剃刀に直接関連しており、小さなマッピングの矛盾をサポートする最小限のアーキテクチャの選択を推奨します。この境界は、ハイパーパラメータの選択方法やクロスドメインマッピングの早期停止など、複数のアルゴリズムの結果をもたらします。

Bayesian Text Classification and Summarization via A Class-Specified Topic Model
クラス指定トピックモデルによるベイジアンテキストの分類と要約

We propose the class-specified topic model (CSTM) to deal with the tasks of text classification and class-specific text summarization. The model assumes that in addition to a set of latent topics that are shared across classes, there is a set of class-specific latent topics for each class. Each document is a probabilistic mixture of the class-specific topics associated with its class and the shared topics. Each class-specific or shared topic has its own probability distribution over a given dictionary. We develop a Bayesian inference of CSTM in the semisupervised scenario, with the supervised scenario as a special case. We analyze in detail the 20 Newsgroups dataset, a benchmark dataset for text classification, and demonstrate that CSTM has better performance than a two stage approach based on latent Dirichlet allocation (LDA), several existing supervised extensions of LDA, and an $L^1$ penalized logistic regression. The favorable performance of CSTM is also demonstrated through Monte Carlo simulations and an analysis of the Reuters dataset.

私たちは、テキスト分類とクラス固有のテキスト要約のタスクを処理するために、クラス指定トピックモデル(CSTM)を提案します。このモデルでは、クラス間で共有される潜在トピックのセットに加えて、各クラスにクラス固有の潜在トピックのセットがあると想定しています。各ドキュメントは、そのクラスに関連付けられたクラス固有のトピックと共有トピックの確率的混合です。各クラス固有または共有トピックは、特定の辞書に対して独自の確率分布を持ちます。教師ありシナリオを特別なケースとして、半教師ありシナリオでCSTMのベイズ推論を開発します。テキスト分類のベンチマークデータセットである20 Newsgroupsデータセットを詳細に分析し、CSTMが潜在ディリクレ割り当て(LDA)、LDAのいくつかの既存の教師あり拡張、および$L^1$ペナルティ付きロジスティック回帰に基づく2段階アプローチよりも優れたパフォーマンスを発揮することを実証します。CSTMの優れたパフォーマンスは、モンテカルロシミュレーションとReutersデータセットの分析によっても実証されています。

Edge Sampling Using Local Network Information
ローカルネットワーク情報を使用したエッジサンプリング

Edge sampling is an important topic in network analysis. It provides a natural way to reduce network size while retaining desired features of the original network. Sampling methods that only use local information are common in practice as they do not require access to the entire network and can be easily parallelized. Despite promising empirical performances, most of these methods are derived from heuristic considerations and lack theoretical justification. In this paper, we study a simple and efficient edge sampling method that uses local network information. We show that when the local connectivity is sufficiently strong, the sampled network satisfies a strong spectral property. We measure the strength of local connectivity by a global parameter and relate it to more common network statistics such as the clustering coefficient and network curvature. Based on this result, we also provide sufficient conditions under which random networks and hypergraphs can be efficiently sampled.

エッジサンプリングは、ネットワーク解析における重要なトピックです。これは、元のネットワークの望ましい機能を維持しながら、ネットワークサイズを縮小する自然な方法を提供します。ローカル情報のみを使用するサンプリング方法は、ネットワーク全体にアクセスする必要がなく、簡単に並列化できるため、実際には一般的です。有望な経験的パフォーマンスにもかかわらず、これらの方法のほとんどはヒューリスティックな考察から導き出されており、理論的な正当性を欠いています。この論文では、ローカルネットワーク情報を用いたシンプルで効率的なエッジサンプリング手法について検討します。ローカル接続が十分に強い場合、サンプリングされたネットワークは強いスペクトル特性を満たすことを示します。グローバルパラメーターによってローカル接続性の強度を測定し、クラスタリング係数やネットワークの曲率など、より一般的なネットワーク統計に関連付けます。この結果に基づいて、ランダムネットワークとハイパーグラフを効率的にサンプリングできる十分な条件も提供します。

On Solving Probabilistic Linear Diophantine Equations
確率的線形ディオファントス方程式の解法について

Multiple methods exist for computing marginals involving a linear Diophantine constraint on random variables. Each of these extant methods has some limitation on the dimension and support or on the type of marginal computed (e.g., sum-product inference, max-product inference, maximum a posteriori, etc.). Here, we introduce the “trimmed $p$-convolution tree'” an approach that generalizes the applicability of the existing methods and achieves a runtime within a $\log$-factor or better compared to the best existing methods. A second form of trimming we call underflow/overflow trimming is introduced which aggregates events which land outside the supports for a random variable into the nearest support. Trimmed $p$-convolution trees with and without underflow/overflow trimming are used in different protein inference models. Then two different methods of approximating max-convolution using Cartesian product trees are introduced.

確率変数の線形ディオファントス制約を含む周辺を計算するための複数の方法が存在します。これらの現存する各方法には、寸法とサポート、または限界計算のタイプ(たとえば、合計積推論、最大積推論、最大事後推論など)にいくつかの制限があります。ここでは、既存のメソッドの適用性を一般化し、既存の最良のメソッドと比較して$log$-factor以上のランタイムを達成するアプローチである”トリミングされた$p$-畳み込みツリー’」を紹介します。アンダーフロー/オーバーフロートリミングと呼ばれる2つ目のトリミング形式が導入され、確率変数のサポートの外側に着地するイベントが最も近いサポートに集約されます。アンダーフロー/オーバーフロートリミングの有無にかかわらず、トリミングされた$p$-畳み込みツリーは、さまざまなタンパク質推論モデルで使用されます。次に、デカルト積木を使用して最大畳み込みを近似する2つの異なる方法を紹介します。

Multi-view Learning as a Nonparametric Nonlinear Inter-Battery Factor Analysis
ノンパラメトリック非線形電池間因子解析としてのマルチビュー学習

Factor analysis aims to determine latent factors, or traits, which summarize a given data set. Inter-battery factor analysis extends this notion to multiple views of the data. In this paper we show how a nonlinear, nonparametric version of these models can be recovered through the Gaussian process latent variable model. This gives us a flexible formalism for multi-view learning where the latent variables can be used both for exploratory purposes and for learning representations that enable efficient inference for ambiguous estimation tasks. Learning is performed in a Bayesian manner through the formulation of a variational compression scheme which gives a rigorous lower bound on the log likelihood. Our Bayesian framework provides strong regularization during training, allowing the structure of the latent space to be determined efficiently and automatically. We demonstrate this by producing the first (to our knowledge) published results of learning from dozens of views, even when data is scarce. We further show experimental results on several different types of multi-view data sets and for different kinds of tasks, including exploratory data analysis, generation, ambiguity modelling through latent priors and classification.

因子分析の目的は、特定のデータセットを要約する潜在因子または特性を決定することです。インターバッテリー因子分析では、この概念をデータの複数のビューに拡張します。この論文では、ガウス過程潜在変数モデルを使用して、これらのモデルの非線形でノンパラメトリックなバージョンを復元する方法を示します。これにより、潜在変数を探索目的と、あいまいな推定タスクの効率的な推論を可能にする表現の学習の両方に使用できる、マルチビュー学習の柔軟な形式が得られます。学習は、対数尤度の厳密な下限を与える変分圧縮スキームの定式化を通じてベイジアン方式で実行されます。ベイジアンフレームワークは、トレーニング中に強力な正則化を提供し、潜在空間の構造を効率的かつ自動的に決定できるようにします。データが不足している場合でも、数十のビューから学習した結果を初めて(私たちの知る限り)公開することで、これを実証します。さらに、探索的データ分析、生成、潜在的事前確率による曖昧性モデリング、分類など、さまざまな種類のマルチビューデータセットとさまざまな種類のタスクに関する実験結果を示します。

Gradient Methods Never Overfit On Separable Data
分離可能なデータに過剰適合しない勾配法

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) *never* overfit on separable data: If we run these methods for $T$ iterations on a dataset of size $m$, both the empirical risk and the generalization error decrease at an essentially optimal rate of $\tilde{\mathcal{O}}(1/\gamma^2 T)$ up till $T\approx m$, at which point the generalization error remains fixed at an essentially optimal level of $\tilde{\mathcal{O}}(1/\gamma^2 m)$ regardless of how large $T$ is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness.

最近の一連の研究により、勾配法と指数的テール損失を使用して分離可能なデータに対して線形予測子をトレーニングすると、予測子は最大マージン予測子の方向に漸近的に収束することが確立されました。結果として、予測子は漸近的に過剰適合しません。ただし、これは、ある制限された回数の反復後に過剰適合が非漸近的に発生するかどうかという問題には対処していません。この論文では、標準的な勾配法（特に、勾配フロー、勾配降下法、確率的勾配降下法）が分離可能なデータにオーバーフィットすることは*決して*ないことを正式に示します。これらの方法をサイズ$m$のデータセットで$T$回の反復で実行すると、経験的リスクと一般化誤差の両方が$T\approx m$まで本質的に最適な速度$\tilde{\mathcal{O}}(1/\gamma^2 T)$で減少し、その時点では、$T$の大きさに関係なく、一般化誤差は本質的に最適なレベル$\tilde{\mathcal{O}}(1/\gamma^2 m)$で固定されたままになります。その過程で、データセット上のマージン違反の数の非漸近的な境界を示し、その厳しさを証明します。

Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference
ビザンチンロバスト分布推論のための分散縮小平均中央値推定量

This paper develops an efficient distributed inference algorithm, which is robust against a moderate fraction of Byzantine nodes, namely arbitrary and possibly adversarial machines in a distributed learning system. In robust statistics, the median-of-means (MOM) has been a popular approach to hedge against Byzantine failures due to its ease of implementation and computational efficiency. However, the MOM estimator has the shortcoming in terms of statistical efficiency. The first main contribution of the paper is to propose a variance reduced median-of-means (VRMOM) estimator, which improves the statistical efficiency over the vanilla MOM estimator and is computationally as efficient as the MOM. Based on the proposed VRMOM estimator, we develop a general distributed inference algorithm that is robust against Byzantine failures. Theoretically, our distributed algorithm achieves a fast convergence rate with only a constant number of rounds of communications. We also provide the asymptotic normality result for the purpose of statistical inference. To the best of our knowledge, this is the first normality result in the setting of Byzantine-robust distributed learning. The simulation results are also presented to illustrate the effectiveness of our method.

この論文では、分散学習システム内の任意の、場合によっては敵対的なマシンなど、適度な割合のビザンチンノードに対して堅牢な、効率的な分散推論アルゴリズムを開発します。堅牢な統計では、平均の中央値(MOM)は、実装の容易さと計算効率のため、ビザンチン障害に対する一般的なヘッジ手法となっています。しかし、MOM推定器には統計効率の点で欠点があります。この論文の最初の主な貢献は、分散低減平均の中央値(VRMOM)推定器を提案することです。この推定器は、バニラMOM推定器よりも統計効率が向上し、計算効率はMOMと同等です。提案されたVRMOM推定器に基づいて、ビザンチン障害に対して堅牢な一般的な分散推論アルゴリズムを開発します。理論的には、この分散アルゴリズムは、一定数の通信ラウンドのみで高速収束率を実現します。統計推論の目的で漸近正規性の結果も提供します。私たちの知る限り、これはビザンチン堅牢分散学習の設定における最初の正規性結果です。私たちの方法の有効性を示すために、シミュレーション結果も提示されています。

Statistical Query Lower Bounds for Tensor PCA
Tensor PCA の統計クエリの下限

In the Tensor PCA problem introduced by Richard and Montanari (2014), one is given a dataset consisting of $n$ samples $\mathbf{T}_{1:n}$ of i.i.d. Gaussian tensors of order $k$ with the promise that $\mathbb{E}\mathbf{T}_1$ is a rank-1 tensor and $\|\mathbb{E} \mathbf{T}_1\| = 1$. The goal is to estimate $\mathbb{E} \mathbf{T}_1$. This problem exhibits a large conjectured hard phase when $k>2$: When $d \lesssim n \ll d^{\frac{k}{2}}$ it is information theoretically possible to estimate $\mathbb{E} \mathbf{T}_1$, but no polynomial time estimator is known. We provide a sharp analysis of the optimal sample complexity in the Statistical Query (SQ) model and show that SQ algorithms with polynomial query complexity not only fail to solve Tensor PCA in the conjectured hard phase, but also have a strictly sub-optimal sample complexity compared to some polynomial time estimators such as the Richard-Montanari spectral estimator. Our analysis reveals that the optimal sample complexity in the SQ model depends on whether $\mathbb{E} \mathbf{T}_1$ is symmetric or not. For symmetric, even order tensors, we also isolate a sample size regime in which it is possible to test if $\mathbb{E} \mathbf{T}_1 = \mathbf{0}$ or $\mathbb{E}\mathbf{T}_1 \neq \mathbf{0}$ with polynomially many queries but not estimate $\mathbb{E}\mathbf{T}_1$. Our proofs rely on the Fourier analytic approach of Feldman, Perkins and Vempala (2018) to prove sharp SQ lower bounds.

RichardとMontanari (2014)によって導入されたテンソルPCA問題では、$\mathbb{E}\mathbf{T}_1$がランク1テンソルであり、$\|\mathbb{E} \mathbf{T}_1\| = 1$であるという条件で、$k$次数のi.i.d.ガウステンソルの$n$個のサンプル$\mathbf{T}_{1:n}$で構成されるデータセットが与えられます。目標は、$\mathbb{E} \mathbf{T}_1$を推定することです。この問題は、$k>2$の場合に、大きな想定されるハードフェーズを示します。$d \lesssim n \ll d^{\frac{k}{2}}$の場合、$\mathbb{E} \mathbf{T}_1$を推定することは理論的には可能ですが、多項式時間の推定量はわかっていません。我々は統計クエリ(SQ)モデルにおける最適なサンプル複雑度の鋭い分析を提供し、多項式クエリ複雑度を持つSQアルゴリズムは、想定されるハードフェーズでTensor PCAを解決できないだけでなく、Richard-Montanariスペクトル推定器などの一部の多項式時間推定器と比較して厳密に最適ではないサンプル複雑度を持つことを示します。我々の分析は、SQモデルにおける最適なサンプル複雑度は、$\mathbb{E} \mathbf{T}_1$が対称であるかどうかに依存することを明らかにしています。対称で偶数次のテンソルについては、多項式的に多くのクエリを使用して$\mathbb{E} \mathbf{T}_1 = \mathbf{0}$または$\mathbb{E}\mathbf{T}_1 \neq \mathbf{0}$かどうかをテストできるサンプルサイズ領域も分離しますが、$\mathbb{E}\mathbf{T}_1$を推定することはできません。私たちの証明は、Feldman、Perkins、Vempala (2018)のフーリエ解析アプローチに依存して、明確なSQ下限を証明しています。

PyKEEN 1.0: A Python Library for Training and Evaluating Knowledge Graph Embeddings
PyKEEN 1.0: ナレッジグラフの埋め込みをトレーニングおよび評価するためのPythonライブラリ

Recently, knowledge graph embeddings (KGEs) have received significant attention, and several software libraries have been developed for training and evaluation. While each of them addresses specific needs, we report on a community effort to a re-design and re-implementation of PyKEEN, one of the early KGE libraries. PyKEEN 1.0 enables users to compose knowledge graph embedding models based on a wide range of interaction models, training approaches, loss functions, and permits the explicit modeling of inverse relations. It allows users to measure each component’s influence individually on the model’s performance. Besides, an automatic memory optimization has been realized in order to optimally exploit the provided hardware. Through the integration of Optuna, extensive hyper-parameter optimization (HPO) functionalities are provided.

近年、ナレッジグラフエンベディング(KGE)が大きな注目を集めており、トレーニングや評価のためにいくつかのソフトウェアライブラリが開発されています。それぞれが特定のニーズに対応していますが、初期のKGEライブラリの1つであるPyKEENの再設計と再実装に対するコミュニティの取り組みについて報告します。PyKEEN 1.0は、幅広いインタラクションモデル、トレーニングアプローチ、損失関数に基づいてナレッジグラフ埋め込みモデルを構成することを可能にし、逆関係の明示的なモデリングを可能にします。これにより、ユーザーは各コンポーネントがモデルのパフォーマンスに与える影響を個別に測定できます。さらに、提供されたハードウェアを最適に活用するために、自動メモリ最適化が実現されています。Optunaの統合により、広範なハイパーパラメータ最適化(HPO)機能が提供されます。

Knowing what You Know: valid and validated confidence sets in multiclass and multilabel prediction
知っていることを知る: 多クラスおよび多ラベル予測における有効で検証済みの信頼度セット

We develop conformal prediction methods for constructing valid predictive confidence sets in multiclass and multilabel problems without assumptions on the data generating distribution. A challenge here is that typical conformal prediction methods—which give marginal validity (coverage) guarantees—provide uneven coverage, in that they address easy examples at the expense of essentially ignoring difficult examples. By leveraging ideas from quantile regression, we build methods that always guarantee correct coverage but additionally provide (asymptotically consistent) conditional coverage for both multiclass and multilabel prediction problems. To address the potential challenge of exponentially large confidence sets in multilabel prediction, we build tree-structured classifiers that efficiently account for interactions between labels. Our methods can be bolted on top of any classification model—neural network, random forest, boosted tree—to guarantee its validity. We also provide an empirical evaluation, simultaneously providing new validation methods, that suggests the more robust coverage of our confidence sets.

私たちは、データ生成分布に関する仮定なしに、マルチクラスおよびマルチラベル問題における有効な予測信頼セットを構築するための共形予測法を開発します。ここでの課題は、限界的な妥当性（カバレッジ）保証を与える典型的な共形予測法は、本質的に難しい例を無視する代償として簡単な例に対処するという点で、不均一なカバレッジを提供することです。私たちは、分位回帰のアイデアを活用して、常に正しいカバレッジを保証するだけでなく、マルチクラスおよびマルチラベル予測問題の両方に対して（漸近的に一貫した）条件付きカバレッジも提供する方法を構築します。マルチラベル予測における指数関数的に大きな信頼セットの潜在的な課題に対処するために、ラベル間の相互作用を効率的に考慮するツリー構造の分類器を構築します。我々の方法は、ニューラルネットワーク、ランダムフォレスト、ブーストツリーなどのあらゆる分類モデルの上にボルトで固定して、その妥当性を保証できます。我々はまた、我々の信頼セットのより堅牢なカバレッジを示唆する、新しい検証方法も提供する経験的評価も提供します。

Communication-Efficient Distributed Covariance Sketch, with Application to Distributed PCA
通信効率の良い分散共分散スケッチと分散PCAへの応用

A sketch of a large data set captures vital properties of the original data while typically occupying much less space. In this paper, we consider the problem of computing a sketch of a massive data matrix $A\in\mathbb{R}^{n\times d}$ that is distributed across $s$ machines. Our goal is to output a matrix $B\in\mathbb{R}^{\ell\times d}$ which is significantly smaller than but still approximates $A$ well in terms of {covariance error}, i.e., $\|{A^TA-B^TB}\|_2$. Such a matrix $B$ is called a covariance sketch of $A$. We are mainly focused on minimizing the communication cost, which is arguably the most valuable resource in distributed computations. We show that there is a nontrivial gap between deterministic and randomized communication complexity for computing a covariance sketch. More specifically, we first prove an almost tight deterministic communication lower bound, then provide a new randomized algorithm with communication cost smaller than the deterministic lower bound. Based on a well-known connection between covariance sketch and approximate principle component analysis, we obtain better communication bounds for the distributed PCA problem. Moreover, we also give an improved distributed PCA algorithm for sparse input matrices, which uses our distributed sketching algorithm as a key building block.

大規模なデータセットのスケッチは、通常、占有するスペースがはるかに少ないまま、元のデータの重要な特性を捉えます。この論文では、$s$台のマシンに分散されている大規模なデータ行列$A\in\mathbb{R}^{n\times d}$のスケッチを計算する問題について検討します。目標は、{共分散誤差}、つまり$\|{A^TA-B^TB}\|_2$に関して$A$よりも大幅に小さいが、それでも$A$をよく近似する行列$B\in\mathbb{R}^{\ell\times d}$を出力することです。このような行列$B$は、$A$の共分散スケッチと呼ばれます。私たちは主に、分散計算においておそらく最も価値のあるリソースである通信コストを最小化することに焦点を当てています。共分散スケッチを計算するための決定論的通信複雑さとランダム化通信複雑さの間には、重要なギャップがあることを示します。具体的には、まずほぼ厳密な決定論的通信下限を証明し、次に通信コストが決定論的下限よりも小さい新しいランダム化アルゴリズムを提供します。共分散スケッチと近似主成分分析のよく知られた関係に基づいて、分散PCA問題に対するより優れた通信上限を取得します。さらに、分散スケッチアルゴリズムを主要な構成要素として使用する、スパース入力行列用の改良された分散PCAアルゴリズムも提供します。

Is SGD a Bayesian sampler? Well, almost
SGDはベイジアンサンプラーですか?まあ、ほとんど

Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability $P_{SGD}(f\mid S)$ that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function $f$ consistent with a training set $S$. We also use Gaussian processes to estimate the Bayesian posterior probability $P_{B}(f\mid S)$ that the DNN expresses $f$ upon random sampling of its parameters, conditioned on $S$. Our main findings are that $P_{SGD}(f\mid S)$ correlates remarkably well with $P_{B}(f\mid S)$ and that $P_{B}(f\mid S)$ is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines $P_{B}(f\mid S)$), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior $P_{B}(f\mid S)$ is the first order determinant of $P_{SGD}(f\mid S)$, there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on $P_{SGD}(f\mid S)$ and/or $P_{B}(f\mid S)$, can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.

ディープニューラルネットワーク(DNN)は、過剰パラメータ化された領域で驚くほどよく一般化します。これは、一般化エラーが低い関数への強い帰納的バイアスを示唆しています。私たちは、さまざまなアーキテクチャとデータセットについて、確率的勾配降下法(SGD)またはそのバリアントの1つでトレーニングされた過剰パラメータ化されたDNNが、トレーニングセット$S$と一致する関数$f$に収束する確率$P_{SGD}(f\mid S)$を計算することで、このバイアスを経験的に調査します。また、ガウス過程を使用して、$S$を条件として、パラメータのランダムサンプリング時にDNNが$f$を表現するベイズ事後確率$P_{B}(f\mid S)$を推定します。私たちの主な発見は、$P_{SGD}(f\mid S)$が$P_{B}(f\mid S)$と非常によく相関していることと、$P_{B}(f\mid S)$が低エラーおよび低複雑性関数に強く偏っていることです。これらの結果は、SGDの特殊な特性ではなく、パラメータ関数マップ($P_{B}(f\mid S)$を決定する)の強い帰納的偏りが、DNNが過剰パラメータ化された領域で非常によく一般化される主な説明であることを示唆しています。私たちの結果は、ベイズ事後分布$P_{B}(f\mid S)$が$P_{SGD}(f\mid S)$の一次行列式であることを示唆していますが、ハイパーパラメータの調整に敏感な二次差が残っています。$P_{SGD}(f\mid S)$および/または$P_{B}(f\mid S)$に基づく関数確率図は、バッチサイズ、学習率、オプティマイザーの選択などのアーキテクチャやハイパーパラメータ設定の変化がDNNのパフォーマンスにどのように影響するかを明らかにすることができます。

POT: Python Optimal Transport
POT:Python最適輸送

Optimal transport has recently been reintroduced to the machine learning community thanks in part to novel efficient optimization procedures allowing for medium to large scale applications. We propose a Python toolbox that implements several key optimal transport ideas for the machine learning community. The toolbox contains implementations of a number of founding works of OT for machine learning such as Sinkhorn algorithm and Wasserstein barycenters, but also provides generic solvers that can be used for conducting novel fundamental research. This toolbox, named POT for Python Optimal Transport, is open source with an MIT license.

Optimal Transportは、中規模から大規模のアプリケーションを可能にする斬新で効率的な最適化手順のおかげで、最近機械学習コミュニティに再導入されました。機械学習コミュニティ向けのいくつかの主要な最適なトランスポートのアイデアを実装するPythonツールボックスを提案します。このツールボックスには、SinkhornアルゴリズムやWasserstein重心など、機械学習のためのOTの多くの創設作品の実装が含まれていますが、新しい基礎研究の実施に使用できる汎用ソルバーも用意されています。このツールボックスは、POT for Python Optimal Transportと呼ばれ、MITライセンスのオープンソースです。

ChainerRL: A Deep Reinforcement Learning Library
ChainerRL:深層強化学習ライブラリ

In this paper, we introduce ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original papers’ experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found on GitHub: https://github.com/chainer/chainerrl.

この論文では、PythonとChainer深層学習フレームワークを用いて構築されたオープンソースの深層強化学習(DRL)ライブラリであるChainerRLについて紹介します。ChainerRLは、この分野の最先端の研究から引き出された包括的なDRLアルゴリズムと技術のセットを実装しています。再現性のある研究を促進するため、また教育目的で、ChainerRLは、元の論文の実験設定を厳密に再現し、いくつかのアルゴリズムの公開ベンチマーク結果を再現するスクリプトを提供しています。最後に、ChainerRLは、訓練を受けたエージェントの定性検査を可能にする可視化ツールを提供しています。ChainerRLのソースコードはGitHub: https://github.com/chainer/chainerrlにあります。

Analyzing the discrepancy principle for kernelized spectral filter learning algorithms
カーネル化されたスペクトルフィルター学習アルゴリズムの不一致原理の解析

We investigate the construction of early stopping rules in the nonparametric regression problem where iterative learning algorithms are used and the optimal iteration number is unknown. More precisely, we study the discrepancy principle, as well as modifications based on smoothed residuals, for kernelized spectral filter learning algorithms including Tikhonov regularization and gradient descent. Our main theoretical bounds are oracle inequalities established for the empirical estimation error (fixed design), and for the prediction error (random design). From these finite-sample bounds it follows that the classical discrepancy principle is statistically adaptive for slow rates occurring in the hard learning scenario, while the smoothed discrepancy principles are adaptive over ranges of faster rates (resp. higher smoothness parameters). Our approach relies on deviation inequalities for the stopping rules in the fixed design setting, combined with change-of-norm arguments to deal with the random design setting.

私たちは、反復学習アルゴリズムが使用され、最適な反復回数が不明なノンパラメトリック回帰問題における早期停止規則の構築について調査します。より正確には、カーネル化スペクトルフィルタ学習アルゴリズム(Tikhonov正則化および勾配降下法を含む)について、不一致原理と平滑化残差に基づく修正を検討します。主な理論的境界は、経験的推定誤差(固定設計)および予測誤差(ランダム設計)に対して確立されたオラクル不等式です。これらの有限サンプル境界から、古典的な不一致原理はハード学習シナリオで発生する低速速度に対して統計的に適応的であるのに対し、平滑化不一致原理は高速速度の範囲(それぞれ、より高い平滑性パラメータ)に対して適応的であることがわかる。我々のアプローチは、固定設計設定における停止規則の偏差不等式と、ランダム設計設定に対処するためのノルム変更の議論との組み合わせに依存しています。

Attention is Turing-Complete
注目はチューリング-コンプリート

Alternatives to recurrent neural networks, in particular, architectures based on self-attention, are gaining momentum for processing input sequences. In spite of their relevance, the computational properties of such networks have not yet been fully explored.We study the computational power of the Transformer, one of the most paradigmatic architectures exemplifying self-attention. We show that the Transformer with hard-attention is Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.Our study also reveals some minimal sets of elements needed to obtain this completeness result.

リカレントニューラルネットワークの代替手段、特に自己注意に基づくアーキテクチャは、入力シーケンスの処理に勢いを増しています。その関連性にもかかわらず、そのようなネットワークの計算特性はまだ完全には調査されていません。私たちは、セルフアテンションを例示する最もパラダイム的なアーキテクチャの1つであるTransformerの計算能力を研究しています。私たちは、Transformerが注目を浴びて、データの内部の高密度表現を計算しアクセスする能力のみに基づいて、Turingが完全であることを示しています。私たちの研究では、この完全性の結果を得るために必要ないくつかの最小限の要素のセットも明らかになっています。

Kernel Operations on the GPU, with Autodiff, without Memory Overflows
GPU でのカーネル操作 (Autodiff あり、メモリオーバーフローなし)

The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula, such as kernel and distance matrices. KeOps alleviates the main bottleneck of tensor-centric libraries for kernel and geometric applications: memory consumption. It also supports automatic differentiation and outperforms standard GPU baselines, including PyTorch CUDA tensors or the Halide and TVM libraries. KeOps combines optimized C++/CUDA schemes with binders for high-level languages: Python (Numpy and PyTorch), Matlab and GNU R. As a result, high-level “quadratic” codes can now scale up to large data sets with millions of samples processed in seconds. KeOps brings graphics-like performances for kernel methods and is freely available on standard repositories (PyPi, CRAN). To showcase its versatility, we provide tutorials in a wide range of settings online at www.kernel-operations.io.

KeOpsライブラリは、カーネル行列や距離行列などの数式によってエントリが与えられるテンソルに対して、高速でメモリ効率の高いGPUサポートを提供します。KeOpsは、カーネルおよびジオメトリアプリケーション用のテンソル中心ライブラリの主なボトルネックであるメモリ消費を軽減します。また、自動微分をサポートし、PyTorch CUDAテンソルやHalideライブラリ、TVMライブラリなどの標準的なGPUベースラインを凌駕します。KeOpsは、最適化されたC++/CUDAスキームと、Python(NumpyおよびPyTorch)、Matlab、GNU Rなどの高級言語用のバインダーを組み合わせています。その結果、高レベルの「二次」符号は、数百万のサンプルを数秒で処理する大規模なデータセットにスケールアップできるようになりました。KeOpsは、カーネルメソッドにグラフィックのようなパフォーマンスをもたらし、標準のリポジトリ(PyPi、CRAN)で無料で利用できます。その汎用性を示すために、www.kernel-operations.ioではオンラインでさまざまな設定のチュートリアルを提供しています。

Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives
運動量による最適化: 動的、制御理論、シンプレクティックのパースペクティブ

We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.

私たちは、さまざまな運動量ベースの最適化アルゴリズムの収束速度を力学系の観点から解析します。私たちの分析では、反復が初期条件に連続的に依存するなどの基本的なトポロジカル特性を利用して、収束率の簡単な特性評価を提供します。多くの場合、アルゴリズムパラメーターを収束率に関連付ける閉形式の式が取得されます。この解析には、離散時間と連続時間、時間不変と時間変動の定式化が含まれ、凸設定やユークリッド設定に限定されません。さらに、この記事では、運動量ベースの最適化アルゴリズムにシンプレクティック離散化スキームが重要である理由を厳密に確立し、収束の加速を示すアルゴリズムの特性評価を提供します。

Prediction against a limited adversary
限られた敵対者に対する予測

We study the problem of prediction with expert advice with adversarial corruption where the adversary can at most corrupt one expert. Using tools from viscosity theory, we characterize the long-time behavior of the value function of the game between the forecaster and the adversary. We provide lower and upper bounds for the growth rate of regret without relying on a comparison result. We show that depending on the description of regret, the limiting behavior of the game can significantly differ.

私たちは、敵対者がせいぜい一人の専門家を腐敗させることができる敵対的腐敗について、専門家のアドバイスで予測の問題を研究します。粘性理論のツールを使用して、予測者と敵対者との間のゲームの価値関数の長期的な動作を特徴付けます。後悔の成長率の下限と上限は、比較結果に頼らずに提供しています。後悔の説明に応じて、ゲームの制限動作は大きく異なる可能性があることを示します。

Phase Diagram for Two-layer ReLU Neural Networks at Infinite-width Limit
無限幅極限における2層ReLUニューラルネットワークの相図

How neural network behaves during the training over different choices of hyperparameters is an important question in the study of neural networks. In this work, inspired by the phase diagram in statistical mechanics, we draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit for a complete characterization of its dynamical regimes and their dependence on hyperparameters related to initialization. Through both experimental and theoretical approaches, we identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime, based on the relative change of input weights as the width approaches infinity, which tends to $0$, $O(1)$ and $+\infty$, respectively. In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay. In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations. The critical regime serves as the boundary between above two regimes, which exhibits an intermediate nonlinear behavior with the mean-field model as a typical example. Overall, our phase diagram for the two-layer ReLU NN serves as a map for the future studies and is a first step towards a more systematical investigation of the training behavior and the implicit regularization of NNs of different structures.

ニューラルネットワークがさまざまなハイパーパラメータの選択でトレーニング中にどのように動作するかは、ニューラルネットワークの研究における重要な問題です。この研究では、統計力学の位相図にヒントを得て、無限幅の限界における2層ReLUニューラルネットワークの位相図を描き、その動的レジームと初期化に関連するハイパーパラメータへの依存性を完全に特徴付けます。実験的アプローチと理論的アプローチの両方により、位相図で3つのレジーム、つまり線形レジーム、臨界レジーム、凝縮レジームを特定します。これは、幅が無限大に近づくにつれて入力重みが相対的に変化し、それぞれ$0$、$O(1)$、$+\infty$に近づくことに基づいています。線形レジームでは、NNトレーニングダイナミクスは、指数関数的な損失減衰を伴うランダムフィーチャモデルと同様にほぼ線形です。凝縮レジームでは、実験により、アクティブニューロンがいくつかの離散的な方向に凝縮されていることを示します。臨界状態は、上記2つの状態間の境界として機能し、平均場モデルを典型的な例として、中間の非線形動作を示します。全体として、2層ReLU NNのフェーズダイアグラムは、将来の研究のマップとして機能し、異なる構造のNNのトレーニング動作と暗黙的な正規化のより体系的な調査に向けた第一歩となります。

Testing Conditional Independence via Quantile Regression Based Partial Copulas
分位点回帰に基づく部分コピュラによる条件付き独立性の検定

The partial copula provides a method for describing the dependence between two random variables $X$ and $Y$ conditional on a third random vector $Z$ in terms of nonparametric residuals $U_1$ and $U_2$. This paper develops a nonparametric test for conditional independence by combining the partial copula with a quantile regression based method for estimating the nonparametric residuals. We consider a test statistic based on generalized correlation between $U_1$ and $U_2$ and derive its large sample properties under consistency assumptions on the quantile regression procedure. We demonstrate through a simulation study that the resulting test is sound under complicated data generating distributions. Moreover, in the examples considered the test is competitive to other state-of-the-art conditional independence tests in terms of level and power, and it has superior power in cases with conditional variance heterogeneity of $X$ and $Y$ given $Z$.

部分コピュラは、ノンパラメトリック残差$U_1$と$U_2$に関して、3番目のランダムベクトル$Z$を条件とする2つの確率変数$X$と$Y$の間の依存性を記述する方法を提供します。この論文では、部分コピュラと分位点回帰に基づく方法を組み合わせてノンパラメトリック残差を推定することにより、条件付き独立性のノンパラメトリック検定を開発します。$U_1$と$U_2$の間の一般化された相関に基づく検定統計量を検討し、分位点回帰手順の一貫性の仮定の下でその大きなサンプルプロパティを導き出します。シミュレーション研究を通じて、結果として得られるテストが複雑なデータ生成分布の下で健全であることを示しています。さらに、考慮された例では、テストはレベルと検出力の点で他の最先端の条件付き独立性テストと競合し、$Z$が与えられた場合の条件付き分散の不均一性が$X$と$Y$の場合に優れた検出力を持っています。

Determining the Number of Communities in Degree-corrected Stochastic Block Models
次数補正確率的ブロックモデルにおけるコミュニティ数の決定

We propose to estimate the number of communities in degree-corrected stochastic block models based on a pseudo likelihood ratio statistic. To this end, we introduce a method that combines spectral clustering with binary segmentation. This approach guarantees an upper bound for the pseudo likelihood ratio statistic when the model is over-fitted. We also derive its limiting distribution when the model is under-fitted. Based on these properties, we establish the consistency of our estimator for the true number of communities. Developing these theoretical properties require a mild condition on the average degrees – growing at a rate no slower than log(n), where n is the number of nodes. Our proposed method is further illustrated by simulation studies and analysis of real-world networks. The numerical results show that our approach has satisfactory performance when the network is semi-dense.

私たちは、次数補正確率ブロックモデルにおけるコミュニティの数を、疑似尤度比統計に基づいて推定することを提案します。この目的のために、スペクトルクラスタリングとバイナリセグメンテーションを組み合わせた方法を紹介します。このアプローチにより、モデルが過剰適合した場合の疑似尤度比統計量の上限が保証されます。また、モデルが適合不足の場合にも、その制限分布を導き出します。これらの特性に基づいて、コミュニティの真の数に対する推定量の一貫性を確立します。これらの理論的特性を開発するには、平均度数で穏やかな条件が必要です- log(n)よりも遅くない速度で成長します(nはノードの数)。私たちの提案手法は、実世界のネットワークのシミュレーション研究と解析によってさらに説明されています。数値結果は、ネットワークが半密な場合に、このアプローチが満足のいくパフォーマンスを発揮することを示しています。

Path Length Bounds for Gradient Descent and Flow
勾配降下法とグラデーションフローのパス長の境界

We derive bounds on the path length $\zeta$ of gradient descent (GD) and gradient flow (GF) curves for various classes of smooth convex and nonconvex functions. Among other results, we prove that: (a) if the iterates are linearly convergent with factor $(1-c)$, then $\zeta$ is at most $\mathcal{O}(1/c)$; (b) under the Polyak-Kurdyka-\L ojasiewicz (PKL) condition, $\zeta$ is at most $\mathcal{O}(\sqrt{\kappa})$, where $\kappa$ is the condition number, and at least $\widetilde\Omega(\sqrt{d} \wedge \kappa^{1/4})$; (c) for quadratics, $\zeta$ is $\Theta(\min\{\sqrt{d},\sqrt{\log \kappa}\})$ and in some cases can be independent of $\kappa$; (d) assuming just convexity, $\zeta$ can be at most $2^{4d\log d}$; (e) for separable quasiconvex functions, $\zeta$ is ${\Theta}(\sqrt{d})$. Thus, we advance current understanding of the properties of GD and GF curves beyond rates of convergence. We expect our techniques to facilitate future studies for other algorithms.

私たちは、滑らかな凸関数と非凸関数のさまざまなクラスについて、勾配降下法(GD)曲線と勾配流れ(GF)曲線の経路長$zeta$の境界を導き出します。他の結果の中で、我々は以下を証明します: (a)反復が因子$(1-c)$と線形に収束している場合、$zeta$はせいぜい$mathcal{O}(1/c)$です。(b) Polyak-Kurdyka-L ojasiewicz (PKL)条件の下では、$zeta$はせいぜい$mathcal{O}(sqrt{kappa})$であり、ここで$kappa$は条件番号であり、少なくとも$widetildeOmega(sqrt{d} wedge kappa^{1/4})$;(c)二次関数の場合、$zeta$は$Theta(min{sqrt{d},sqrt{log kappa}})$であり、場合によっては$kappa$から独立している場合があります。(d)凸性だけを仮定すると、$zeta$は最大で$2^{4dlog d}$になります。(e)分離可能な準凸関数の場合、$zeta$は${Theta}(sqrt{d})$です。したがって、GD曲線とGF曲線の特性に関する現在の理解を、収束率を超えて進めます。私たちの技術が、他のアルゴリズムの将来の研究を促進することを期待しています。

A General Framework for Empirical Bayes Estimation in Discrete Linear Exponential Family
離散線形指数族における経験的ベイズ推定のための一般的枠組み

We develop a Nonparametric Empirical Bayes (NEB) framework for compound estimation in the discrete linear exponential family, which includes a wide class of discrete distributions frequently arising from modern big data applications. We propose to directly estimate the Bayes shrinkage factor in the generalized Robbins’ formula via solving a convex program, which is carefully developed based on a RKHS representation of the Stein’s discrepancy measure. The new NEB estimation framework is flexible for incorporating various structural constraints into the data driven rule, and provides a unified approach to compound estimation with both regular and scaled squared error losses. We develop theory to show that the class of NEB estimators enjoys strong asymptotic properties. Comprehensive simulation studies as well as analyses of real data examples are carried out to demonstrate the superiority of the NEB estimator over competing methods.

私たちは、現代のビッグデータアプリケーションから頻繁に生じる離散分布の幅広いクラスを含む、離散線形指数ファミリーでの複合推定のためのノンパラメトリック経験的ベイズ(NEB)フレームワークを開発します。一般化されたロビンズの公式のベイズ収縮係数を、スタインの不一致測度のRKHS表現に基づいて慎重に開発された凸計画法を解くことによって直接推定することを提案します。新しいNEB推定フレームワークは、さまざまな構造的制約をデータ駆動型ルールに組み込む柔軟性があり、通常の二乗誤差損失とスケーリングされた二乗誤差損失の両方を使用して複合推定に統一されたアプローチを提供します。NEB推定量のクラスが強い漸近特性を享受することを示す理論を開発します。包括的なシミュレーション研究と実際のデータ例の分析が行われ、競合する方法に対するNEB推定量の優位性が実証されます。

Approximate Newton Methods
近似ニュートン法

Many machine learning models involve solving optimization problems. Thus, it is important to address a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention due to their efficiency at each iteration, rectified a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efficient stochastic second order methods have been also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the empirical performance in real applications. In this paper, we aim to fill these gaps. We propose a unifying framework to analyze both local and global convergence properties of second order methods. Accordingly, we present our theoretical results which match the empirical performance in real applications well.

多くの機械学習モデルには、最適化問題の解法が含まれます。したがって、ビッグデータアプリケーションにおける大規模な最適化問題に取り組むことが重要です。近年、サブサンプリングされたニュートン法は、各反復での効率のために多くの注目を集めるために浮上しており、高い収束率を指揮しながら各反復で高いコストに苦しむという通常のニュートン法の弱点を修正しました。他の効率的な確率的二次法も提案されています。しかし、これらの手法の収束特性はまだ十分に理解されていません。また、現在の収束理論と実際のアプリケーションでの経験的性能との間には、いくつかの重要なギャップがあります。この論文では、これらのギャップを埋めることを目指しています。私たちは、二次法のローカル収束特性とグローバル収束特性の両方を解析するための統一的なフレームワークを提案します。したがって、実際のアプリケーションでの経験的性能によく一致する理論的結果を提示します。

Dynamic Tensor Recommender Systems
ダイナミックテンソルレコメンダーシステム

Recommender systems have been extensively used by the entertainment industry, business marketing and the biomedical industry. In addition to its capacity of providing preference based recommendations as an unsupervised learning methodology, it has been also proven useful in sales forecasting, product introduction and other production related businesses. Since some consumers and companies need a recommendation or prediction for future budget, labor and supply chain coordination, dynamic recommender systems for precise forecasting have become extremely necessary. In this article, we propose a new recommendation method, namely the dynamic tensor recommender system (DTRS), which aims particularly at forecasting future recommendation. The proposed method utilizes a tensor-valued function of time to integrate time and contextual information, and creates a time-varying coefficient model for temporal tensor factorization through a polynomial spline approximation. Major advantages of the proposed method include competitive future recommendation predictions and effective prediction interval estimations. In theory, we establish the convergence rate of the proposed tensor factorization and asymptotic normality of the spline coefficient estimator. The proposed method is applied to simulations, IRI marketing data and Last.fm data. Numerical studies demonstrate that the proposed method outperforms existing methods in terms of future time forecasting.

推薦システムは、エンターテインメント業界、ビジネスマーケティング、バイオメディカル業界で広く使用されています。教師なし学習手法として嗜好に基づく推薦を提供できるだけでなく、売上予測、製品導入、その他の生産関連ビジネスでも有用であることが証明されています。一部の消費者や企業は、将来の予算、労働力、サプライチェーンの調整に関する推薦や予測を必要としているため、正確な予測を行う動的推薦システムが非常に必要になっています。この記事では、特に将来の推薦を予測することを目的とした新しい推薦方法、つまり動的テンソル推薦システム(DTRS)を提案します。提案された方法は、時間のテンソル値関数を使用して時間とコンテキスト情報を統合し、多項式スプライン近似によって時間テンソル分解の時間変動係数モデルを作成します。提案された方法の主な利点には、競争力のある将来の推薦予測と効果的な予測区間の推定が含まれます。理論的には、提案されたテンソル分解の収束率とスプライン係数推定量の漸近正規性を確立します。提案された方法は、シミュレーション、IRIマーケティングデータ、およびLast.fmデータに適用されます。数値的研究により、提案された方法が将来の時間予測に関して既存の方法よりも優れていることが実証されています。

Sparse Tensor Additive Regression
スパーステンソル加法回帰

Tensors are becoming prevalent in modern applications such as medical imaging and digital marketing. In this paper, we propose a sparse tensor additive regression (STAR) that models a scalar response as a flexible nonparametric function of tensor covariates. The proposed model effectively exploits the sparse and low-rank structures in the tensor additive regression. We formulate the parameter estimation as a non-convex optimization problem, and propose an efficient penalized alternating minimization algorithm. We establish a non-asymptotic error bound for the estimator obtained from each iteration of the proposed algorithm, which reveals an interplay between the optimization error and the statistical rate of convergence. We demonstrate the efficacy of STAR through extensive comparative simulation studies, and an application to the click-through-rate prediction in online advertising.

テンソルは、医用画像やデジタルマーケティングなどの最新のアプリケーションで普及しつつあります。この論文では、テンソル共変量の柔軟なノンパラメトリック関数としてスカラー応答をモデル化するスパーステンソル加法回帰(STAR)を提案します。提案されたモデルは、テンソル加法回帰のスパース構造と低ランク構造を効果的に活用します。パラメータ推定を非凸最適化問題として定式化し、効率的なペナルティ付き交互最小化アルゴリズムを提案します。提案されたアルゴリズムの各反復から得られる推定量に対して非漸近誤差の範囲を確立し、最適化誤差と統計的収束率との間の相互作用を明らかにします。私たちは、広範な比較シミュレーション研究と、オンライン広告のクリックスルー率予測への応用を通じて、STARの有効性を実証しています。

Geometric structure of graph Laplacian embeddings
グラフラプラシアンエンベッディングの幾何学的構造

We analyze the spectral clustering procedure for identifying coarse structure in a data set $\mathbf{x}_1, \dots, \mathbf{x}_n$, and in particular study the geometry of graph Laplacian embeddings which form the basis for spectral clustering algorithms. More precisely, we assume that the data are sampled from a mixture model supported on a manifold $\mathcal{M}$ embedded in $\mathbb{R}^d$, and pick a connectivity length-scale $\varepsilon>0$ to construct a kernelized graph Laplacian. We introduce a notion of a well-separated mixture model which only depends on the model itself, and prove that when the model is well separated, with high probability the embedded data set concentrates on cones that are centered around orthogonal vectors. Our results are meaningful in the regime where $\varepsilon = \varepsilon(n)$ is allowed to decay to zero at a slow enough rate as the number of data points grows. This rate depends on the intrinsic dimension of the manifold on which the data is supported.

私たちは、データセット$mathbf{x}_1, dots, mathbf{x}_n$の粗い構造を特定するためのスペクトルクラスタリング手順を分析し、特にスペクトルクラスタリングアルゴリズムの基礎を形成するグラフラプラシアン埋め込みの幾何学を研究します。より正確には、データは$mathbb{R}^d$に埋め込まれた多様体$mathcal{M}$でサポートされている混合モデルからサンプリングされ、接続性の長さスケール$varepsilon>0$を選択してカーネル化されたグラフLaplacianを構築すると仮定します。モデル自体にのみ依存する十分に分離された混合モデルの概念を導入し、モデルが十分に分離されている場合、埋め込まれたデータセットは高い確率で直交ベクトルを中心とした錐に集中することを証明します。私たちの結果は、データポイントの数が増えるにつれて、$varepsilon = varepsilon(n)$が十分に遅い速度でゼロに崩壊することが許される領域で意味があります。このレートは、データがサポートされている多様体の本質的な次元によって異なります。

How to Gain on Power: Novel Conditional Independence Tests Based on Short Expansion of Conditional Mutual Information
権力を獲得する方法:条件付き相互情報の短期展開に基づく新しい条件付き独立性検定

Conditional independence tests play a crucial role in many machine learning procedures such as feature selection, causal discovery, and structure learning of dependence networks. They are used in most of the existing algorithms for Markov Blanket discovery such as Grow-Shrink or Incremental Association Markov Blanket. One of the most frequently used tests for categorical variables is based on the conditional mutual information ($CMI$) and its asymptotic distribution. However, it is known that the power of such test dramatically decreases when the size of the conditioning set grows, i.e. the test fails to detect true significant variables, when the set of already selected variables is large. To overcome this drawback for discrete data, we propose to replace the conditional mutual information by Short Expansion of Conditional Mutual Information (called $SECMI$), obtained by truncating the Möbius representation of $CMI$. We prove that the distribution of $SECMI$ converges to either a normal distribution or to a distribution of some quadratic form in normal random variables. This property is crucial for the construction of a novel test of conditional independence which uses one of these distributions, chosen in a data dependent way, as a reference under the null hypothesis. The proposed methods have significantly larger power for discrete data than the standard asymptotic tests of conditional independence based on $CMI$ while retaining control of the probability of type I error.

条件付き独立性検定は、特徴選択、因果発見、依存性ネットワークの構造学習など、多くの機械学習手順で重要な役割を果たします。これらは、Grow-Shrinkや増分関連マルコフブランケットなど、マルコフブランケット発見の既存のアルゴリズムのほとんどで使用されています。カテゴリ変数の最も頻繁に使用される検定の1つは、条件付き相互情報量($CMI$)とその漸近分布に基づいています。ただし、条件セットのサイズが大きくなると、このような検定の検出力が劇的に低下することが知られています。つまり、すでに選択された変数のセットが大きい場合、検定では真に重要な変数を検出できません。離散データに対するこの欠点を克服するために、条件付き相互情報量を、$CMI$のメビウス表現を切り捨てることによって得られる条件付き相互情報量の短縮展開($SECMI$と呼ばれる)に置き換えることを提案します。$SECMI$の分布が、正規分布または正規ランダム変数の二次形式の分布のいずれかに収束することを証明します。この特性は、データ依存的に選択されたこれらの分布の1つを帰無仮説の基準として使用する条件付き独立性の新しい検定を構築する上で重要です。提案された方法は、タイプIの誤りの確率を制御しながら、CMIに基づく条件付き独立性の標準的な漸近検定よりも離散データに対して大幅に大きな検出力を持っています。

Stochastic Proximal AUC Maximization
確率的近位AUC最大化

In this paper we consider the problem of maximizing the Area under the ROC curve (AUC) which is a widely used performance metric in imbalanced classification and anomaly detection. Due to the pairwise nonlinearity of the objective function, classical SGD algorithms do not apply to the task of AUC maximization. We propose a novel stochastic proximal algorithm for AUC maximization which is scalable to large scale streaming data. Our algorithm can accommodate general penalty terms and is easy to implement with favorable $O(d)$ space and per-iteration time complexities. We establish a high-probability convergence rate $O(1/\sqrt{T})$ for the general convex setting, and improve it to a fast convergence rate $O(1/T)$ for the cases of strongly convex regularizers and no regularization term (without strong convexity). Our proof does not need the uniform boundedness assumption on the loss function or the iterates which is more fidelity to the practice. Finally, we perform extensive experiments over various benchmark data sets from real-world application domains which show the superior performance of our algorithm over the existing AUC maximization algorithms.

この論文では、不均衡分類および異常検出で広く使用されているパフォーマンス指標であるROC曲線の下の面積（AUC）を最大化する問題について検討します。目的関数のペアワイズ非線形性のため、従来のSGDアルゴリズムはAUC最大化のタスクには適用できません。大規模なストリーミングデータにスケーラブルな、AUC最大化のための新しい確率的近似アルゴリズムを提案します。このアルゴリズムは一般的なペナルティ項に対応でき、好ましい$O(d)$空間および反復あたりの時間計算量で簡単に実装できます。一般的な凸設定に対して高確率収束率$O(1/\sqrt{T})$を確立し、強い凸正則化子と正則化項なし（強い凸性なし）の場合には高速収束率$O(1/T)$に改善します。私たちの証明では、損失関数または反復に対する均一な有界性仮定は不要で、これは実践により忠実です。最後に、実際のアプリケーションドメインからのさまざまなベンチマークデータセットに対して広範な実験を実行し、既存のAUC最大化アルゴリズムよりも当社のアルゴリズムの優れたパフォーマンスを示します。

A Distributed Method for Fitting Laplacian Regularized Stratified Models
ラプラシアン正則化層化モデルを当てはめるための分布法

Stratified models are models that depend in an arbitrary way on a set of selected categorical features, and depend linearly on the other features. In a basic and traditional formulation a separate model is fit for each value of the categorical feature, using only the data that has the specific categorical value. To this formulation we add Laplacian regularization, which encourages the model parameters for neighboring categorical values to be similar. Laplacian regularization allows us to specify one or more weighted graphs on the stratification feature values. For example, stratifying over the days of the week, we can specify that the Sunday model parameter should be close to the Saturday and Monday model parameters. The regularization improves the performance of the model over the traditional stratified model, since the model for each value of the categorical `borrows strength’ from its neighbors. In particular, it produces a model even for categorical values that did not appear in the training data set. We propose an efficient distributed method for fitting stratified models, based on the alternating direction method of multipliers (ADMM). When the fitting loss functions are convex, the stratified model fitting problem is convex, and our method computes the global minimizer of the loss plus regularization; in other cases it computes a local minimizer. The method is very efficient, and naturally scales to large data sets or numbers of stratified feature values. We illustrate our method with a variety of examples.

層別モデルは、選択されたカテゴリ特性のセットに任意の方法で依存し、他の特性に線形に依存するモデルです。基本的な従来の定式化では、特定のカテゴリ値を持つデータのみを使用して、カテゴリ特性の各値に個別のモデルが適合されます。この定式化にラプラシアン正則化を追加します。これにより、隣接するカテゴリ値のモデルパラメータが類似するようになります。ラプラシアン正則化を使用すると、層別特性値に1つ以上の重み付きグラフを指定できます。たとえば、曜日で層別化する場合、日曜日のモデルパラメータが土曜日と月曜日のモデルパラメータに近くなるように指定できます。この正則化により、モデルは従来の層別モデルよりもパフォーマンスが向上します。これは、カテゴリの各値のモデルが隣接する値から「強度を借りる」ためです。特に、トレーニングデータセットに出現しなかったカテゴリ値に対してもモデルが生成されます。交互方向乗数法(ADMM)に基づいて、層別モデルをフィッティングするための効率的な分散手法を提案します。フィッティング損失関数が凸の場合、層別モデルフィッティング問題は凸であり、この手法では損失と正則化のグローバル最小化を計算します。それ以外の場合は、ローカル最小化を計算します。この手法は非常に効率的で、大規模なデータセットや層別特徴値の数に自然に適応します。さまざまな例でこの手法を説明します。

Predictive Learning on Hidden Tree-Structured Ising Models
隠れ木構造イジングモデルにおける予測学習

We provide high-probability sample complexity guarantees for exact structure recovery and accurate predictive learning using noise-corrupted samples from an acyclic (tree-shaped) graphical model. The hidden variables follow a tree-structured Ising model distribution, whereas the observable variables are generated by a binary symmetric channel taking the hidden variables as its input (flipping each bit independently with some constant probability $q\in [0,1/2)$). In the absence of noise, predictive learning on Ising models was recently studied by Bresler and Karzand (2020); this paper quantifies how noise in the hidden model impacts the tasks of structure recovery and marginal distribution estimation by proving upper and lower bounds on the sample complexity. Our results generalize state-of-the-art bounds reported in prior work, and they exactly recover the noiseless case ($q=0$). In fact, for any tree with $p$ vertices and probability of incorrect recovery $\delta>0$, the sufficient number of samples remains logarithmic as in the noiseless case, i.e., $\mathcal{O}(\log(p/\delta))$, while the dependence on $q$ is $\mathcal{O}\big( 1/(1-2q)^{4} \big)$, for both aforementioned tasks. We also present a new equivalent of Isserlis’ Theorem for sign-valued tree-structured distributions, yielding a new low-complexity algorithm for higher-order moment estimation.

私たちは、非巡回（ツリー型）グラフィカルモデルからのノイズで破損したサンプルを使用して、正確な構造回復と正確な予測学習のための高確率サンプル複雑性保証を提供します。隠れ変数はツリー構造のイジングモデル分布に従いますが、観測変数は、隠れ変数を入力として受け取るバイナリ対称チャネルによって生成されます（各ビットを一定の確率$q\in [0,1/2)$で独立に反転します）。ノイズがない場合、イジングモデルの予測学習は最近、BreslerとKarzand（2020）によって研究されました。この論文では、サンプル複雑性の上限と下限を証明することにより、隠れモデルのノイズが構造回復と周辺分布推定のタスクにどのように影響するかを定量化しています。私たちの結果は、以前の研究で報告された最先端の境界を一般化し、ノイズのないケース（$q = 0$）を正確に回復します。実際、$p$個の頂点を持ち、誤った回復の確率が$\delta>0$である任意のツリーについて、十分なサンプル数はノイズがない場合と同様に対数、つまり$\mathcal{O}(\log(p/\delta))$のままですが、前述の両方のタスクについて、$q$への依存性は$\mathcal{O}\big( 1/(1-2q)^{4} \big)$です。また、符号値ツリー構造分布に対するIsserlisの定理に相当する新しい定理も提示し、高次モーメント推定のための新しい低複雑性アルゴリズムを生み出します。

Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach
高次元一般化線形モデルの推定と推論:分割と平滑化のアプローチ

The focus of modern biomedical studies has gradually shifted to explanation and estimation of joint effects of high dimensional predictors on disease risks. Quantifying uncertainty in these estimates may provide valuable insight into prevention strategies or treatment decisions for both patients and physicians. High dimensional inference, including confidence intervals and hypothesis testing, has sparked much interest. While much work has been done in the linear regression setting, there is lack of literature on inference for high dimensional generalized linear models. We propose a novel and computationally feasible method, which accommodates a variety of outcome types, including normal, binomial, and Poisson data. We use a “splitting and smoothing” approach, which splits samples into two parts, performs variable selection using one part and conducts partial regression with the other part. Averaging the estimates over multiple random splits, we obtain the smoothed estimates, which are numerically stable. We show that the estimates are consistent, asymptotically normal, and construct confidence intervals with proper coverage probabilities for all predictors. We examine the finite sample performance of our method by comparing it with the existing methods and applying it to analyze a lung cancer cohort study.

現代の生物医学研究の焦点は、高次元予測因子の疾患リスクに対する共同効果の説明と推定に徐々に移行しています。これらの推定値の不確実性を定量化することで、患者と医師の両方にとって予防戦略や治療決定に関する貴重な洞察が得られる可能性があります。信頼区間や仮説検定などの高次元推論は、大きな関心を集めています。線形回帰の設定では多くの研究が行われてきましたが、高次元一般化線形モデルの推論に関する文献は不足しています。私たちは、正規分布、二項分布、ポアソン分布などのさまざまな結果タイプに対応する、新しい計算可能な方法を提案します。私たちは「分割と平滑化」アプローチを使用します。これは、サンプルを2つの部分に分割し、一方を使用して変数選択を実行し、もう一方を使用して部分回帰を実行します。複数のランダム分割で推定値を平均すると、数値的に安定した平滑化された推定値が得られます。推定値は一貫しており、漸近的に正規であり、すべての予測因子に対して適切なカバレッジ確率で信頼区間を構成することを示します。私たちは、既存の方法と比較し、肺がんコホート研究の分析に適用することで、我々の方法の有限サンプル性能を検証します。

Normalizing Flows for Probabilistic Modeling and Inference
確率的モデリングと推論のためのフローの正規化

Normalizing flows provide a general mechanism for defining expressive probability distributions, only requiring the specification of a (usually simple) base distribution and a series of bijective transformations. There has been much recent work on normalizing flows, ranging from improving their expressive power to expanding their application. We believe the field has now matured and is in need of a unified perspective. In this review, we attempt to provide such a perspective by describing flows through the lens of probabilistic modeling and inference. We place special emphasis on the fundamental principles of flow design, and discuss foundational topics such as expressive power and computational trade-offs. We also broaden the conceptual framing of flows by relating them to more general probability transformations. Lastly, we summarize the use of flows for tasks such as generative modeling, approximate inference, and supervised learning.

正規化フローは、表現力豊かな確率分布を定義するための一般的なメカニズムを提供し、(通常は単純な)ベース分布と一連の全単射変換の指定のみを必要とします。最近では、フローの正規化に関する多くの研究が行われており、その表現力の向上からアプリケーションの拡大まで多岐にわたります。私たちは、この分野が成熟した今、統一された視点が必要だと考えています。このレビューでは、確率的モデリングと推論のレンズを通してフローを記述することにより、そのような視点を提供しようとします。フロー設計の基本原則に特に重点を置いており、表現力や計算上のトレードオフなどの基本的なトピックについて説明します。また、フローをより一般的な確率変換に関連付けることにより、フローの概念的なフレーミングを広げます。最後に、ジェネレーティブモデリング、近似推論、教師あり学習などのタスクでのフローの使用についてまとめます。

Incorporating Unlabeled Data into Distributionally Robust Learning
ラベルなしデータの分布ロバスト学習への組み込み

We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradient-based optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to — but tighter than — guarantees from conventional DRL. We examine the performance of this new formulation on 14 real data sets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionally-robust version of the standard model-change heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real data sets.

私たちは、分布的に堅牢な学習(DRL)と呼ばれる経験的リスク最小化の堅牢な代替手段を研究します。これは、指定された分布セットからデータ分布を選択できる敵に対してパフォーマンスを発揮することを学習するものです。現在のDRL定式化の問題を示します。これは、敵に対して許容される分布の定義が広すぎるため、学習された分類器が自信を持って予測できないという問題があります。私たちは、敵をさらに制約するために、ラベルなしデータをDRL問題に組み込むソリューションを提案します。この新しい定式化は、確率的勾配ベースの最適化に適しており、学習された分類器の将来のパフォーマンスについて計算可能な保証をもたらすことを示します。これは、従来のDRLの保証に類似していますが、より厳密です。14の実際のデータセットでこの新しい定式化のパフォーマンスを調べ、従来のDRLではどちらも生成されない状況で、この新しい定式化により、重要なパフォーマンス保証を持つ効果的な分類器が生成されることがよくあります。これらの結果にヒントを得て、私たちは、標準モデル変更ヒューリスティックの分布的に堅牢な新しいバージョンを使用して、DRL定式化をアクティブラーニングに拡張しました。私たちのアクティブラーニングアルゴリズムは、実際のデータセットで元のヒューリスティックよりも優れた学習パフォーマンスを達成することがよくあります。

Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data
混合マルチビューデータのための統合一般化凸クラスタリング最適化と特徴選択

In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.

混合マルチビューデータでは、同じサンプルセットで複数の多様な特徴セットが測定されます。利用可能なすべてのデータソースを統合することで、単一のデータビューの個別クラスター分析では隠れている可能性のあるサンプル間の共通グループ構造を発見しようとします。このような統合クラスタリングの手法はいくつか検討されていますが、私たちは強力な経験的パフォーマンスを享受し、ますます普及している凸クラスタリングメソッドの数学的特性を継承する凸形式化を提案し、開発します。具体的には、Integrative Generalized Convex Clustering Optimization (iGecco)メソッドは、異なるデータビューごとに異なる凸距離、損失、または発散を使用し、共通グループにつながる共同凸融合ペナルティを使用します。さらに、各データソースが高次元の場合、混合マルチビューデータの統合は困難になることがよくあります。このようなシナリオで特徴選択を実行するために、損失固有の中心に向かって特徴を縮小することで特徴を選択する適応シフトグループラッソペナルティを開発します。いわゆるiGecco+アプローチは、各データビューからグループを決定するのに最適な特徴を選択し、多くの場合、統合クラスタリングの改善につながります。この問題を解決するために、私たちは、ビッグデータセットのモデルにもっと効率的に適合するサブ問題近似を使用する新しいタイプの一般化マルチブロックADMMアルゴリズムを開発しました。テキストマイニングとゲノミクスに関する一連の数値実験と実際のデータ例を通じて、iGecco+が高次元の混合マルチビューデータに対して優れた経験的パフォーマンスを実現することを示しています。

GemBag: Group Estimation of Multiple Bayesian Graphical Models
GemBag: 複数のベイジアングラフィカルモデルのグループ推定

In this paper, we propose a novel hierarchical Bayesian model and an efficient estimation method for the problem of joint estimation of multiple graphical models, which have similar but different sparsity structures and signal strength. Our proposed hierarchical Bayesian model is well suited for sharing of sparsity structures, and our procedure, called as GemBag, is shown to enjoy optimal theoretical properties in terms of sup-norm estimation accuracy and correct recovery of the graphical structure even when some of the signals are weak. Although optimization of the posterior distribution required for obtaining our proposed estimator is a non-convex optimization problem, we show that it turns out to be convex in a large constrained space facilitating the use of computationally efficient algorithms. Through extensive simulation studies and an application to a bike sharing data set, we demonstrate that the proposed GemBag procedure has strong empirical performance in comparison with alternative methods.

この論文では、類似しているが異なるスパース構造と信号強度を持つ複数のグラフィカルモデルの共同推定の問題に対する新しい階層ベイズモデルと効率的な推定方法を提案します。提案する階層ベイズモデルはスパース構造の共有に適しており、GemBagと呼ばれる手順は、一部の信号が弱い場合でも、ノルム超推定精度とグラフィカル構造の正しい回復に関して最適な理論的特性を持つことが示されています。提案する推定量を取得するために必要な事後分布の最適化は非凸最適化問題ですが、大きな制約空間では凸になり、計算効率の高いアルゴリズムの使用が容易になることがわかります。広範なシミュレーション研究と自転車シェアリングデータセットへの適用を通じて、提案するGemBag手順が他の方法と比較して優れた経験的パフォーマンスを持つことを実証します。

Subspace Clustering through Sub-Clusters
サブクラスタによる部分空間クラスタリング

The problem of dimension reduction is of increasing importance in modern data analysis. In this paper, we consider modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular we propose a highly scalable sampling based algorithm that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out-of-sample points. The key idea is that this random subset borrows information across the entire dataset and that the problem of clustering points can be replaced with the more efficient problem of “clustering sub-clusters”. We provide theoretical guarantees for our procedure. The numerical results indicate that for large datasets the proposed algorithm outperforms other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.

次元削減の問題は、現代のデータ分析においてますます重要になっています。この論文では、高次元空間内の点の集合を低次元部分空間の和集合としてモデル化することを検討します。特に、最初に小さなランダムサンプルのスペクトルクラスタリングを介してデータをクラスタリングし、次に残りのサンプル外ポイントを分類またはラベル付けすることで、データをクラスター化する、高度にスケーラブルなサンプリングベースのアルゴリズムを提案します。重要な考え方は、このランダムなサブセットがデータセット全体の情報を借用し、ポイントのクラスタリングの問題を「サブクラスターのクラスタリング」というより効率的な問題に置き換えることができるということです。私たちは、私たちの手順に理論的な保証を提供します。数値結果は、大規模なデータセットの場合、提案されたアルゴリズムが精度と速度の点で他の最先端の部分空間クラスタリングアルゴリズムよりも優れていることを示しています。

Sparse and Smooth Signal Estimation: Convexification of L0-Formulations
スパース信号推定と平滑信号推定:L0定式化の凸化

Signal estimation problems with smoothness and sparsity priors can be naturally modeled as quadratic optimization with $\ell_0$-“norm” constraints. Since such problems are non-convex and hard-to-solve, the standard approach is, instead, to tackle their convex surrogates based on $\ell_1$-norm relaxations. In this paper, we propose new iterative (convex) conic quadratic relaxations that exploit not only the $\ell_0$-“norm” terms, but also the fitness and smoothness functions. The iterative convexification approach substantially closes the gap between the $\ell_0$-“norm” and its $\ell_1$ surrogate. These stronger relaxations lead to significantly better estimators than $\ell_1$-norm approaches and also allow one to utilize affine sparsity priors. In addition, the parameters of the model and the resulting estimators are easily interpretable. Experiments with a tailored Lagrangian decomposition method indicate that the proposed iterative convex relaxations yield solutions within 1\% of the exact $\ell_0$-approach, and can tackle instances with up to 100,000 variables under one minute.

平滑性とスパース性の事前分布を伴う信号推定問題は、自然に$\ell_0$-“ノルム” 制約を伴う二次最適化としてモデル化できます。このような問題は非凸で解決が難しいため、標準的なアプローチでは、代わりに$\ell_1$-ノルム緩和に基づいて凸代替物に取り組みます。この論文では、$\ell_0$-“ノルム” 項だけでなく、適合度関数と平滑性関数も利用する新しい反復(凸)円錐二次緩和を提案します。反復凸化アプローチは、$\ell_0$-“ノルム” とその$\ell_1$代替物の間のギャップを大幅に埋めます。これらの強力な緩和により、$\ell_1$-ノルムアプローチよりも大幅に優れた推定量が得られ、アフィンスパース性事前分布も利用できるようになります。さらに、モデルのパラメーターと結果として得られる推定量は簡単に解釈できます。カスタマイズされたラグランジュ分解法を用いた実験では、提案された反復凸緩和により、正確な$\ell_0$アプローチの1\%以内の解が得られ、1分以内に最大100,000個の変数を持つインスタンスを処理できることが示されています。

Projection-free Decentralized Online Learning for Submodular Maximization over Time-Varying Networks
時間変動ネットワーク上のサブモジュラー最大化のための投影フリー分散型オンライン学習

This paper considers a decentralized online submodular maximization problem over time-varying networks, where each agent only utilizes its own information and the received information from its neighbors. To address the problem, we propose a decentralized Meta-Frank-Wolfe online learning method in the adversarial online setting by using local communication and local computation. Moreover, we show that an expected regret bound of $O(\sqrt{T})$ is achieved with $(1-1/e)$ approximation guarantee, where $T$ is a time horizon. In addition, we also propose a decentralized one-shot Frank-Wolfe online learning method in the stochastic online setting. Furthermore, we also show that an expected regret bound $O(T^{2/3})$ is obtained with $(1-1/e)$ approximation guarantee. Finally, we confirm the theoretical results via various experiments on different datasets.

この論文では、各エージェントが自身の情報と隣接するエージェントから受信した情報のみを利用する、時間変動ネットワーク上の分散型オンラインサブモジュラ最大化問題について考察します。この問題に対処するために、ローカル通信とローカル計算を使用して、敵対的なオンライン環境における分散型Meta-Frank-Wolfeオンライン学習方法を提案します。さらに、$(1-1/e)$近似保証($T$を時間軸)により、予想される後悔範囲$O(sqrt{T})$が達成されることを示します。さらに、確率論的オンライン設定における分散型ワンショットFrank-Wolfeオンライン学習方法も提案します。さらに、$(1-1/e)$近似保証により、予想される後悔束縛$O(T^{2/3})$が得られることも示しています。最後に、さまざまなデータセットでのさまざまな実験を通じて理論結果を確認します。

Structure Learning of Undirected Graphical Models for Count Data
カウントデータに対する無向グラフィカルモデルの構造学習

Mainly motivated by the problem of modelling biological processes underlying the basic functions of a cell -that typically involve complex interactions between genes- we present a new algorithm, called PC-LPGM, for learning the structure of undirected graphical models over discrete variables. We prove theoretical consistency of PC-LPGM in the limit of infinite observations and discuss its robustness to model misspecification. To evaluate the performance of PC-LPGM in recovering the true structure of the graphs in situations where relatively moderate sample sizes are available, extensive simulation studies are conducted, that also allow to compare our proposal with its main competitors. A biological validation of the algorithm is presented through the analysis of two real data sets.

主に、細胞の基本的な機能の根底にある生物学的プロセスをモデル化するという問題に動機付けられており、通常は遺伝子間の複雑な相互作用が関与するため、離散変数に対する無向グラフィカルモデルの構造を学習するためのPC-LPGMと呼ばれる新しいアルゴリズムを紹介します。PC-LPGMの無限観測の極限における理論的一貫性を証明し、モデルの誤指定に対するロバスト性について議論します。比較的適度なサンプルサイズが利用可能な状況でグラフの真の構造を回復する際のPC-LPGMの性能を評価するために、広範なシミュレーション研究が行われ、これにより当社の提案を主要な競合他社と比較することもできます。アルゴリズムの生物学的検証は、2つの実際のデータセットの分析を通じて提示されます。

From Low Probability to High Confidence in Stochastic Convex Optimization
確率的凸最適化における低確率から高信頼度へ

Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on light-tail noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.

確率的凸最適化の標準結果は、アルゴリズムが期待値の小さい関数値を持つ点を生成するために必要なサンプル数を制限します。より微妙な高確率の保証はまれであり、通常はライトテールノイズの仮定に依存するか、サンプルの複雑さが悪化します。この研究では、強凸問題に対する幅広い確率最適化アルゴリズムを、信頼水準が対数、条件数が多対数のみのオーバーヘッドコストで、高い信頼限界で拡張できることを示します。私たちが提案するproxBoostと呼ばれる手順は初歩的なもので、ロバストな距離推定と近位点法という2つのよく知られた要素に基づいています。ストリーミング(オンライン)アルゴリズムとオフラインアルゴリズムの両方に対する、経験的リスクの最小化に基づく結果について説明します。

Optimal Feedback Law Recovery by Gradient-Augmented Sparse Polynomial Regression
勾配増強スパース多項式回帰による最適フィードバック則回復

A sparse regression approach for the computation of high-dimensional optimal feedback laws arising in deterministic nonlinear control is proposed. The approach exploits the control-theoretical link between Hamilton-Jacobi-Bellman PDEs characterizing the value function of the optimal control problems, and first-order optimality conditions via Pontryagin’s Maximum Principle. The latter is used as a representation formula to recover the value function and its gradient at arbitrary points in the space-time domain through the solution of a two-point boundary value problem. After generating a dataset consisting of different state-value pairs, a hyperbolic cross polynomial model for the value function is fitted using a LASSO regression. An extended set of low and high-dimensional numerical tests in nonlinear optimal control reveal that enriching the dataset with gradient information reduces the number of training samples, and that the sparse polynomial regression consistently yields a feedback law of lower complexity.

決定論的非線形制御で生じる高次元最適フィードバック法則を計算するためのスパース回帰アプローチが提案されています。このアプローチは、最適制御問題の価値関数を特徴付けるハミルトン-ヤコビ-ベルマン偏微分方程式と、ポンチャギンの最大原理による一次最適条件との間の制御理論的なつながりを利用します。後者は、2点境界値問題の解決を通じて、空間時間領域内の任意の点で価値関数とその勾配を回復するための表現式として使用されます。異なる状態値ペアからなるデータセットを生成した後、LASSO回帰を使用して価値関数の双曲交差多項式モデルを適合させる。非線形最適制御における低次元および高次元の数値テストの拡張セットにより、勾配情報でデータセットを充実させることでトレーニングサンプルの数が減り、スパース多項式回帰によって一貫して複雑度の低いフィードバック法則が得られることが明らかになった。

Understanding Recurrent Neural Networks Using Nonequilibrium Response Theory
非平衡応答理論を用いたリカレントニューラルネットワークの理解

Recurrent neural networks (RNNs) are brain-inspired models widely used in machine learning for analyzing sequential data. The present work is a contribution towards a deeper understanding of how RNNs process input signals using the response theory from nonequilibrium statistical mechanics. For a class of continuous-time stochastic RNNs (SRNNs) driven by an input signal, we derive a Volterra type series representation for their output. This representation is interpretable and disentangles the input signal from the SRNN architecture. The kernels of the series are certain recursively defined correlation functions with respect to the unperturbed dynamics that completely determine the output. Exploiting connections of this representation and its implications to rough paths theory, we identify a universal feature — the response feature, which turns out to be the signature of tensor product of the input signal and a natural support basis. In particular, we show that SRNNs, with only the weights in the readout layer optimized and the weights in the hidden layer kept fixed and not optimized, can be viewed as kernel machines operating on a reproducing kernel Hilbert space associated with the response feature

リカレントニューラルネットワーク(RNN)は、機械学習でシーケンシャルデータを分析するために広く使用されている脳にヒントを得たモデルです。本研究は、非平衡統計力学の応答理論を使用して、RNNが入力信号を処理する方法の理解を深めることに貢献します。入力信号によって駆動される連続時間確率RNN (SRNN)のクラスについて、その出力のVolterra型級数表現を導出します。この表現は解釈可能であり、SRNNアーキテクチャから入力信号を分離します。級数のカーネルは、出力を完全に決定する非摂動ダイナミクスに関して再帰的に定義された特定の相関関数です。この表現とラフパス理論へのその影響とのつながりを利用して、応答機能という普遍的な特徴を特定します。これは、入力信号と自然なサポート基底のテンソル積のシグネチャであることが判明しています。特に、読み出し層の重みのみが最適化され、隠れ層の重みは固定され最適化されていないSRNNは、応答特徴に関連付けられた再生カーネルヒルベルト空間上で動作するカーネルマシンとして見ることができることを示します。

Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates
最適構造化主部分空間推定:メトリックエントロピーとミニマックスレート

Driven by a wide range of applications, several principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases sparse PCA/SVD, non-negative PCA/SVD, subspace constrained PCA/SVD, and spectral clustering. General minimax lower and upper bounds are established to characterize the interplay between the information-geometric complexity of the constraint set for the principal subspaces, the signal-to-noise ratio (SNR), and the dimensionality. The results yield interesting phase transition phenomena concerning the rates of convergence as a function of the SNRs and the fundamental limit for consistent estimation. Applying the general results to the specific settings yields the minimax rates of convergence for those problems, including the previous unknown optimal rates for sparse SVD, non-negative PCA/SVD and subspace constrained PCA/SVD.

幅広い応用分野に牽引されて、いくつかの主サブスペース推定問題が、異なる構造的制約の下で個別に研究されてきました。この論文では、一般的な構造化主サブスペース推定問題の統計分析のための統一フレームワークを提示します。この中には、特殊なケースとして、スパースPCA/SVD、非負PCA/SVD、サブスペース制約PCA/SVD、およびスペクトルクラスタリングが含まれます。一般的なミニマックス下限と上限は、主サブスペースの制約セットの情報幾何学的複雑さ、信号対雑音比(SNR)、および次元間の相互作用を特徴付けるために確立されています。その結果、SNRの関数としての収束率と一貫性のある推定の基本限界に関する興味深い相転移現象が得られます。一般的な結果を特定の設定に適用すると、これまで未知であったスパースSVD、非負PCA/SVD、およびサブスペース制約PCA/SVDの最適率を含む、それらの問題のミニマックス収束率が得られます。

RaSE: Random Subspace Ensemble Classification
RaSE: ランダム部分空間アンサンブル分類

We propose a flexible ensemble classification framework, Random Subspace Ensemble (RaSE), for sparse classification. In the RaSE algorithm, we aggregate many weak learners, where each weak learner is a base classifier trained in a subspace optimally selected from a collection of random subspaces. To conduct subspace selection, we propose a new criterion, ratio information criterion (RIC), based on weighted Kullback-Leibler divergence. The theoretical analysis includes the risk and Monte-Carlo variance of the RaSE classifier, establishing the screening consistency and weak consistency of RIC, and providing an upper bound for the misclassification rate of the RaSE classifier. In addition, we show that in a high-dimensional framework, the number of random subspaces needs to be very large to guarantee that a subspace covering signals is selected. Therefore, we propose an iterative version of the RaSE algorithm and prove that under some specific conditions, a smaller number of generated random subspaces are needed to find a desirable subspace through iteration. An array of simulations under various models and real-data applications demonstrate the effectiveness and robustness of the RaSE classifier and its iterative version in terms of low misclassification rate and accurate feature ranking. The RaSE algorithm is implemented in the R package RaSEn on CRAN.

私たちは、スパース分類のための柔軟なアンサンブル分類フレームワーク、ランダムサブスペースアンサンブル(RaSE)を提案します。RaSEアルゴリズムでは、多数の弱学習器を集約します。ここで、各弱学習器は、ランダムサブスペースのコレクションから最適に選択されたサブスペースでトレーニングされた基本分類器です。サブスペース選択を行うために、重み付きKullback-Leiblerダイバージェンスに基づく新しい基準、比率情報基準(RIC)を提案します。理論分析には、RaSE分類器のリスクとモンテカルロ分散が含まれ、RICのスクリーニング一貫性と弱い一貫性を確立し、RaSE分類器の誤分類率の上限を提供します。さらに、高次元フレームワークでは、信号をカバーするサブスペースが選択されることを保証するために、ランダムサブスペースの数を非常に大きくする必要があることを示す。したがって、RaSEアルゴリズムの反復バージョンを提案し、特定の条件下では、反復を通じて望ましいサブスペースを見つけるために、生成されるランダムサブスペースの数が少なくて済むことを証明します。さまざまなモデルと実際のデータアプリケーションでの一連のシミュレーションにより、RaSE分類器とその反復バージョンが、誤分類率の低さと特徴の正確なランク付けという点で、有効性と堅牢性を実証しています。RaSEアルゴリズムは、CRANのRパッケージRaSEnに実装されています。

Wasserstein barycenters can be computed in polynomial time in fixed dimension
ワッサーシュタイン重心は、固定次元の多項式時間で計算できます

Computing Wasserstein barycenters is a fundamental geometric problem with widespread applications in machine learning, statistics, and computer graphics. However, it is unknown whether Wasserstein barycenters can be computed in polynomial time, either exactly or to high precision (i.e., with $\textrm{polylog}(1/\varepsilon)$ runtime dependence). This paper answers these questions in the affirmative for any fixed dimension. Our approach is to solve an exponential-size linear programming formulation by efficiently implementing the corresponding separation oracle using techniques from computational geometry.

Wasserstein重心の計算は、機械学習、統計学、およびコンピューターグラフィックスに広く応用されている基本的な幾何学的問題です。しかし、Wasserstein重心が多項式時間で正確に計算できるのか、それとも高精度で計算できるのか(つまり、$textrm{polylog}(1/varepsilon)$実行時依存性で計算できるのかは不明です。この論文では、これらの質問に任意の固定次元に対して肯定的に答えます。私たちのアプローチは、計算幾何学の手法を使用して対応する分離オラクルを効率的に実装することにより、指数関数的な線形計画法の定式化を解くことです。

Banach Space Representer Theorems for Neural Networks and Ridge Splines
ニューラルネットワークとリッジスプラインのためのBanach空間表現定理

We develop a variational framework to understand the properties of the functions learned by neural networks fit to data. We propose and study a family of continuous-domain linear inverse problems with total variation-like regularization in the Radon domain subject to data fitting constraints. We derive a representer theorem showing that finite-width, single-hidden layer neural networks are solutions to these inverse problems. We draw on many techniques from variational spline theory and so we propose the notion of polynomial ridge splines, which correspond to single-hidden layer neural networks with truncated power functions as the activation function. The representer theorem is reminiscent of the classical reproducing kernel Hilbert space representer theorem, but we show that the neural network problem is posed over a non-Hilbertian Banach space. While the learning problems are posed in the continuous-domain, similar to kernel methods, the problems can be recast as finite-dimensional neural network training problems. These neural network training problems have regularizers which are related to the well-known weight decay and path-norm regularizers. Thus, our result gives insight into functional characteristics of trained neural networks and also into the design neural network regularizers. We also show that these regularizers promote neural network solutions with desirable generalization properties.

私たちは、ニューラルネットワークがデータに適合させることによって学習した関数の特性を理解するために変分フレームワークを開発します。私たちは、データ適合制約の下で、ラドン領域における全変分のような正則化を伴う連続領域線形逆問題のファミリーを提案し、研究します。私たちは、有限幅の単一隠れ層ニューラルネットワークがこれらの逆問題の解であることを示す代表定理を導出します。我々は変分スプライン理論からの多くの技術を利用し、活性化関数として切断されたべき関数を伴う単一隠れ層ニューラルネットワークに対応する多項式リッジスプラインの概念を提案します。代表定理は、古典的な再生カーネルヒルベルト空間代表定理を彷彿とさせるが、我々はニューラルネットワーク問題が非ヒルベルトバナッハ空間上に提示されることを示す。学習問題はカーネル法と同様に連続領域で提示されるが、問題は有限次元ニューラルネットワークトレーニング問題として書き直すことができます。これらのニューラルネットワークトレーニング問題には、よく知られている重み減衰およびパスノルム正則化に関連する正則化があります。したがって、私たちの結果は、トレーニングされたニューラルネットワークの機能特性と、ニューラルネットワーク正則化の設計についての洞察を提供します。また、これらの正則化により、望ましい一般化特性を持つニューラルネットワークソリューションが促進されることも示しています。

High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm
高次ランジュバン拡散により高速化された MCMC アルゴリズムが実現

We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with smooth, log-concave densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $\varepsilon > 0$ in Wasserstein distance from the target distribution in $O\left(\frac{d^{1/4}}{ \varepsilon^{1/2}} \right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $\alpha$-th order smoothness, we prove that the mixing time scales as $O \left( \frac{d^{1/4}}{\varepsilon^{1/2}} + \frac{d^{1/2}}{ \varepsilon^{1/(\alpha – 1)}} \right)$.

私たちは、滑らかな対数凹密度の分布からサンプリングするために、3次ランジュバンダイナミクスに基づくマルコフ連鎖モンテカルロ(MCMC)アルゴリズムを提案します。高次ダイナミクスにより、より柔軟な離散化スキームが可能になり、分割とより正確な積分を組み合わせた特定の方法を開発します。一般化線形モデルから生じる$d$次元分布の広範なクラスについて、結果として得られる3次アルゴリズムは、ターゲット分布からWasserstein距離で最大$varepsilon > 0$の分布からサンプルを生成することを証明します。これは、ターゲット分布から$Oleft(frac{d^{1/4}}{ varepsilon^{1/2}} right)$ステップです。この結果では、グラデーション上のLipschitz条件のみが必要です。$alpha$-番目の平滑性を持つ一般的な強凸ポテンシャルについて、混合時間がleft( frac{d^{1/4}}{varepsilon^{1/2}} + frac{d^{1/2}}{ varepsilon^{1/(alpha – 1)}} right)$のように$Oスケーリングされることを証明します。

From Fourier to Koopman: Spectral Methods for Long-term Time Series Prediction
フーリエからコープマンへ:長期時系列予測のためのスペクトル法

We propose spectral methods for long-term forecasting of temporal signals stemming from linear and nonlinear quasi-periodic dynamical systems. For linear signals, we introduce an algorithm with similarities to the Fourier transform but which does not rely on periodicity assumptions, allowing for forecasting given potentially arbitrary sampling intervals. We then extend this algorithm to handle nonlinearities by leveraging Koopman theory. The resulting algorithm performs a spectral decomposition in a nonlinear, data-dependent basis. The optimization objective for both algorithms is highly non-convex. However, expressing the objective in the frequency domain allows us to compute global optima of the error surface in a scalable and efficient manner, partially by exploiting the computational properties of the Fast Fourier Transform. Because of their close relation to Bayesian Spectral Analysis, uncertainty quantification metrics are a natural byproduct of the spectral forecasting methods. We extensively benchmark these algorithms against other leading forecasting methods on a range of synthetic experiments as well as in the context of real-world power systems and fluid flows.

私たちは、線形および非線形の準周期的動的システムから生じる時間信号の長期予測のためのスペクトル法を提案します。線形信号については、フーリエ変換に類似しているが周期性の仮定に依存せず、潜在的に任意のサンプリング間隔を与えられた予測を可能にするアルゴリズムを導入します。次に、クープマン理論を活用してこのアルゴリズムを拡張し、非線形性を処理します。結果として得られるアルゴリズムは、非線形でデータに依存するベースでスペクトル分解を実行します。両方のアルゴリズムの最適化目標は、高度に非凸です。ただし、目標を周波数領域で表現すると、高速フーリエ変換の計算特性を部分的に利用することで、スケーラブルかつ効率的な方法で誤差面のグローバル最適値を計算できます。ベイズスペクトル解析と密接な関係があるため、不確実性定量化メトリックはスペクトル予測法の自然な副産物です。私たちは、さまざまな合成実験や実際の電力システムや流体の流れのコンテキストにおいて、これらのアルゴリズムを他の主要な予測方法と徹底的にベンチマークします。

Residual Energy-Based Models for Text
テキストの残留エネルギーベースモデル

Current large-scale auto-regressive language models display impressive fluency and can generate convincing text. In this work we start by asking the question: Can the generations of these models be reliably distinguished from real text by statistical discriminators? We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation.

現在の大規模な自己回帰言語モデルは、印象的な流暢さを示し、説得力のあるテキストを生成できます。この作業では、まず、これらのモデルの世代を統計的識別器によって実際のテキストと確実に区別できるのか、という疑問を投げかけます。実験的に、モデルのトレーニングデータにアクセスできる場合は答えが肯定的であり、アクセスできない場合でも慎重に肯定的であることがわかりました。これは、(グローバルに正規化された)識別器を生成プロセスに組み込むことにより、自己回帰モデルを改善できることを示唆しています。私たちは、エネルギーベースモデルのフレームワークを使用してこれに定式化を与え、それが実際に生成モデルの結果を改善することを示します。これは、困惑と人間の評価の両方の観点から測定されます。

giotto-tda: : A Topological Data Analysis Toolkit for Machine Learning and Data Exploration
giotto-tda: 機械学習とデータ探索のためのトポロジカルデータ解析ツールキット

We introduce giotto-tda, a Python library that integrates high-performance topological data analysis with machine learning via a scikit-learn-compatible API and state-of-the-art C++ implementations. The library’s ability to handle various types of data is rooted in a wide range of preprocessing techniques, and its strong focus on data exploration and interpretability is aided by an intuitive plotting API. Source code, binaries, examples, and documentation can be found at https://github.com/giotto-ai/giotto-tda.

私たちは、scikit-learn互換のAPIと最先端のC++実装により、高性能なトポロジカルデータ解析と機械学習を統合したPythonライブラリ、giotto-tdaをご紹介します。さまざまなタイプのデータを処理するライブラリの能力は、幅広い前処理手法に根ざしており、データの探索と解釈可能性に重点を置いているのは、直感的なプロットAPIによって支えられています。ソースコード、バイナリ、例、およびドキュメントは、https://github.com/giotto-ai/giotto-tdaにあります。

Risk-Averse Learning by Temporal Difference Methods with Markov Risk Measures
マルコフリスク尺度を用いた時間差分法によるリスク回避学習

We propose a novel reinforcement learning methodology where the system performance is evaluated by a Markov coherent dynamic risk measure with the use of linear value function approximations. We construct projected risk-averse dynamic programming equations and study their properties. We propose new risk-averse counterparts of the basic and multi-step methods of temporal differences and we prove their convergence with probability one. We also perform an empirical study on a complex transportation problem.

私たちは、線形値関数近似を使用して、システム性能がマルコフコヒーレント動的リスク測定によって評価される新しい強化学習方法論を提案します。予測されたリスク回避的な動的計画法を構築し、その性質を研究します。私たちは、時間差の基本的および多段階的な方法の新しいリスク回避的な対応物を提案し、それらが確率1との収束を証明します。また、複雑な交通問題に関する実証研究も行っています。

A Bayesian Contiguous Partitioning Method for Learning Clustered Latent Variables
クラスタ化潜在変数を学習するためのBayes連続分割法

This article develops a Bayesian partitioning prior model from spanning trees of a graph, by first assigning priors on spanning trees, and then the number and the positions of removed edges given a spanning tree. The proposed method guarantees contiguity in clustering and allows to detect clusters with arbitrary shapes and sizes, whereas most existing partition models such as binary trees and Voronoi tessellations do not possess such properties. We embed this partition model within a hierarchical modeling framework to detect a clustered pattern in latent variables. We focus on illustrating the method through a clustered regression coefficient model for spatial data and propose extensions to other hierarchical models. We prove Bayesian posterior concentration results under an asymptotic framework with random graphs. We design an efficient collapsed Reversible Jump Markov chain Monte Carlo (RJ-MCMC) algorithm to estimate the clustered coefficient values and their uncertainty measures. Finally, we illustrate the performance of the model with simulation studies and a real data analysis of detecting the temperature-salinity relationship from water masses in the Atlantic Ocean.

この記事では、まずスパニングツリーに事前分布を割り当て、次にスパニングツリーが与えられた場合に削除されるエッジの数と位置を割り当てることにより、グラフのスパニングツリーからベイジアンパーティショニング事前分布モデルを開発します。提案された方法は、クラスタリングにおける連続性を保証し、任意の形状とサイズのクラスタを検出できますが、バイナリツリーやボロノイ分割などのほとんどの既存のパーティションモデルは、このような特性を持っていません。このパーティションモデルを階層モデリングフレームワークに埋め込み、潜在変数のクラスタリングパターンを検出します。空間データのクラスタリングされた回帰係数モデルを通じてこの方法を説明することに焦点を当て、他の階層モデルへの拡張を提案します。ランダムグラフを使用した漸近フレームワークの下で、ベイジアン事後集中の結果を証明します。クラスタリングされた係数値とその不確実性尺度を推定するために、効率的な折りたたみ可逆ジャンプマルコフ連鎖モンテカルロ(RJ-MCMC)アルゴリズムを設計します。最後に、シミュレーション研究と大西洋の水塊から温度と塩分の関係を検出する実際のデータ分析により、モデルの性能を示します。

Multi-class Gaussian Process Classification with Noisy Inputs
ノイズの多い入力による多クラスガウス過程分類

It is a common practice in the machine learning community to assume that the observed data are noise-free in the input attributes. Nevertheless, scenarios with input noise are common in real problems, as measurements are never perfectly accurate. If this input noise is not taken into account, a supervised machine learning method is expected to perform sub-optimally. In this paper, we focus on multi-class classification problems and use Gaussian processes (GPs) as the underlying classifier. Motivated by a data set coming from the astrophysics domain, we hypothesize that the observed data may contain noise in the inputs. Therefore, we devise several multi-class GP classifiers that can account for input noise. Such classifiers can be efficiently trained using variational inference to approximate the posterior distribution of the latent variables of the model. Moreover, in some situations, the amount of noise can be known before-hand. If this is the case, it can be readily introduced in the proposed methods. This prior information is expected to lead to better performance results. We have evaluated the proposed methods by carrying out several experiments, involving synthetic and real data. These include several data sets from the UCI repository, the MNIST data set and a data set coming from astrophysics. The results obtained show that, although the classification error is similar across methods, the predictive distribution of the proposed methods is better, in terms of the test log-likelihood, than the predictive distribution of a classifier based on GPs that ignores input noise.

機械学習コミュニティでは、観測データの入力属性にノイズがないと想定するのが一般的です。しかし、実際の問題では、測定が完璧に正確になることは決してないため、入力ノイズのあるシナリオは一般的です。この入力ノイズを考慮しないと、教師あり機械学習法は最適に機能しないことが予想されます。この論文では、マルチクラス分類問題に焦点を当て、基礎となる分類器としてガウス過程(GP)を使用します。天体物理学の分野からのデータセットに触発されて、観測データの入力にノイズが含まれている可能性があると仮定します。そのため、入力ノイズを考慮できるマルチクラスGP分類器をいくつか考案します。このような分類器は、変分推論を使用してモデルの潜在変数の事後分布を近似することで、効率的にトレーニングできます。さらに、状況によっては、ノイズの量を事前に知ることができます。この場合、提案された方法で簡単に導入できます。この事前情報により、パフォーマンスの結果が向上することが期待されます。私たちは、合成データと実際のデータを含むいくつかの実験を実施して、提案された方法を評価しました。これらには、UCIリポジトリからのいくつかのデータセット、MNISTデータセット、天体物理学からのデータセットが含まれます。得られた結果は、分類エラーは方法間で類似しているものの、提案された方法の予測分布は、テストログ尤度の観点から、入力ノイズを無視するGPに基づく分類器の予測分布よりも優れていることを示しています。

Learning and Planning for Time-Varying MDPs Using Maximum Likelihood Estimation
最尤推定を使用した時間変動MDPの学習と計画

This paper proposes a formal approach to online learning and planning for agents operating in a priori unknown, time-varying environments. The proposed method computes the maximally likely model of the environment, given the observations about the environment made by an agent earlier in the system run and assuming knowledge of a bound on the maximal rate of change of system dynamics. Such an approach generalizes the estimation method commonly used in learning algorithms for unknown Markov decision processes with time-invariant transition probabilities, but is also able to quickly and correctly identify the system dynamics following a change. Based on the proposed method, we generalize the exploration bonuses used in learning for time-invariant Markov decision processes by introducing a notion of uncertainty in a learned time-varying model, and develop a control policy for time-varying Markov decision processes based on the exploitation and exploration trade-off. We demonstrate the proposed methods on four numerical examples: a patrolling task with a change in system dynamics, a two-state MDP with periodically changing outcomes of actions, a wind flow estimation task, and a multi-armed bandit problem with periodically changing probabilities of different rewards.

この論文では、事前に未知の時間変動環境で動作するエージェントのオンライン学習および計画に対する正式なアプローチを提案します。提案された方法は、システム実行の以前の段階でエージェントが行った環境に関する観察を前提とし、システムダイナミクスの最大変化率の上限に関する知識を前提として、環境の最大尤度モデルを計算します。このようなアプローチは、時間不変遷移確率を持つ未知のマルコフ決定プロセスの学習アルゴリズムで一般的に使用される推定方法を一般化しますが、変化後のシステムダイナミクスを迅速かつ正確に識別することもできます。提案された方法に基づいて、学習された時間変動モデルに不確実性の概念を導入することで、時間不変マルコフ決定プロセスの学習で使用される探索ボーナスを一般化し、活用と探索のトレードオフに基づいて時間変動マルコフ決定プロセスの制御ポリシーを開発します。提案手法を、システムダイナミクスの変化を伴う巡回タスク、アクションの結果が定期的に変化する2状態MDP、風の流れの推定タスク、異なる報酬の確率が定期的に変化する多腕バンディット問題という4つの数値例で実証します。

Neighborhood Structure Assisted Non-negative Matrix Factorization and Its Application in Unsupervised Point-wise Anomaly Detection
近傍構造支援非負行列因数分解とその教師なし点間異常検出への応用

Dimensionality reduction is considered as an important step for ensuring competitive performance in unsupervised learning such as anomaly detection. Non-negative matrix factorization (NMF) is a widely used method to accomplish this goal. But NMF do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address this shortcoming, we propose to consider the neighborhood structural similarity information within the NMF framework and do so by modeling the data through a minimum spanning tree. We label the resulting method as the neighborhood structure-assisted NMF. We further develop both offline and online algorithms for implementing the proposed method. Empirical comparisons using twenty benchmark data sets as well as an industrial data set extracted from a hydropower plant demonstrate the superiority of the neighborhood structure-assisted NMF. Looking closer into the formulation and properties of the proposed NMF method and comparing it with several NMF variants reveal that inclusion of the MST-based neighborhood structure plays a key role in attaining the enhanced performance in anomaly detection.

次元削減は、異常検出などの教師なし学習で競争力のあるパフォーマンスを確保するための重要なステップと見なされています。非負値行列因子分解(NMF)は、この目標を達成するために広く使用されている方法です。しかし、NMFには近傍構造情報を含める規定がないため、非線形多様体構造が存在する場合に満足のいくパフォーマンスを提供できない可能性があります。この欠点に対処するために、NMFフレームワーク内で近傍構造の類似性情報を考慮し、最小全域木を介してデータをモデル化することを提案します。結果として得られる方法を近傍構造支援NMFと呼びます。さらに、提案された方法を実装するためのオフラインとオンラインの両方のアルゴリズムを開発します。20のベンチマークデータセットと水力発電所から抽出された産業データセットを使用した実験的比較により、近傍構造支援NMFの優位性が実証されています。提案されたNMF法の定式化と特性を詳しく調べ、それをいくつかのNMFバリアントと比較すると、MSTベースの近傍構造を組み込むことが、異常検出におけるパフォーマンスの向上を達成する上で重要な役割を果たしていることがわかります。

Asynchronous Online Testing of Multiple Hypotheses
複数の仮説の非同期オンラインテスト

We consider the problem of asynchronous online testing, aimed at providing control of the false discovery rate (FDR) during a continual stream of data collection and testing, where each test may be a sequential test that can start and stop at arbitrary times. This setting increasingly characterizes real-world applications in science and industry, where teams of researchers across large organizations may conduct tests of hypotheses in a decentralized manner. The overlap in time and space also tends to induce dependencies among test statistics, a challenge for classical methodology, which either assumes (overly optimistically) independence or (overly pessimistically) arbitrary dependence between test statistics. We present a general framework that addresses both of these issues via a unified computational abstraction that we refer to as “conflict sets.” We show how this framework yields algorithms with formal FDR guarantees under a more intermediate, local notion of dependence. We illustrate our algorithms in simulations by comparing to existing algorithms for online FDR control.

私たちは、データ収集とテストの継続的なストリーム中に偽発見率(FDR)を制御することを目的とした非同期オンラインテストの問題を検討します。各テストは、任意の時間に開始および停止できる連続テストである可能性があります。この設定は、大規模な組織の研究者チームが分散的に仮説のテストを実施する可能性がある科学および産業の実際のアプリケーションの特徴になりつつあります。時間と空間の重複は、テスト統計量間の依存関係も誘発する傾向があります。これは、テスト統計量間の(過度に楽観的な)独立性または(過度に悲観的な)任意の依存関係を前提とする従来の方法論の課題です。私たちは、「競合セット」と呼ぶ統一された計算抽象化を介して、これらの問題の両方に対処する一般的なフレームワークを提示します。このフレームワークにより、より中間的でローカルな依存関係の概念の下で、正式なFDR保証を備えたアルゴリズムがどのように生成されるかを示します。オンラインFDR制御の既存のアルゴリズムと比較することにより、シミュレーションでアルゴリズムを示します。

Learning interaction kernels in heterogeneous systems of agents from multiple trajectories
エージェントの異種系における相互作用カーネルを複数の軌道から学習

Systems of interacting particles, or agents, have wide applications in many disciplines, including Physics, Chemistry, Biology and Economics. These systems are governed by interaction laws, which are often unknown: estimating them from observation data is a fundamental task that can provide meaningful insights and accurate predictions of the behaviour of the agents. In this paper, we consider the inverse problem of learning interaction laws given data from multiple trajectories, in a nonparametric fashion, when the interaction kernels depend on pairwise distances. We establish a condition for learnability of interaction kernels, and construct an estimator based on the minimization of a suitably regularized least squares functional, that is guaranteed to converge, in a suitable $L^2$ space, at the optimal min-max rate for 1-dimensional nonparametric regression. We propose an efficient learning algorithm to construct such estimator, which can be implemented in parallel for multiple trajectories and is therefore well-suited for the high dimensional, big data regime. Numerical simulations on a variety examples, including opinion dynamics, predator-prey and swarm dynamics and heterogeneous particle dynamics, suggest that the learnability condition is satisfied in models used in practice, and the rate of convergence of our estimator is consistent with the theory. These simulations also suggest that our estimators are robust to noise in the observations, and can produce accurate predictions of trajectories in large time intervals, even when they are learned from observations in short time intervals.

相互作用する粒子またはエージェントのシステムは、物理学、化学、生物学、経済学など、多くの分野で幅広く応用されています。これらのシステムは、多くの場合未知である相互作用法則によって制御されています。観測データから相互作用法則を推定することは、エージェントの行動に関する有意義な洞察と正確な予測を提供できる基本的なタスクです。この論文では、相互作用カーネルがペアワイズ距離に依存する場合、複数の軌跡からのデータからノンパラメトリックに相互作用法則を学習するという逆問題を検討します。相互作用カーネルの学習可能性の条件を確立し、適切に正規化された最小二乗関数の最小化に基づく推定量を構築します。この推定量は、適切な$L^2$空間で、1次元ノンパラメトリック回帰の最適な最小最大速度で収束することが保証されています。このような推定量を構築するための効率的な学習アルゴリズムを提案します。このアルゴリズムは、複数の軌跡に対して並列に実装できるため、高次元のビッグデータ環境に適しています。意見ダイナミクス、捕食者と被捕食者および群集ダイナミクス、異種粒子ダイナミクスなど、さまざまな例での数値シミュレーションは、実際に使用されているモデルで学習可能性条件が満たされていること、および推定量の収束率が理論と一致していることを示唆しています。これらのシミュレーションは、推定量が観測のノイズに対して堅牢であり、短い時間間隔の観測から学習した場合でも、長い時間間隔での軌道の正確な予測を生成できることも示唆しています。

FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference
FLAME:因果推論への高速、大規模、ほぼ一致する正確なアプローチ

A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional categorical datasets. This method, called FLAME (Fast Large-scale Almost Matching Exactly), learns a distance metric for matching using a hold-out training data set. In order to perform matching efficiently for large datasets, FLAME leverages techniques that are natural for query processing in the area of database management, and two implementations of FLAME are provided: the first uses SQL queries and the second uses bit-vector techniques. The algorithm starts by constructing matches of the highest quality (exact matches on all covariates), and successively eliminates variables in order to match exactly on as many variables as possible, while still maintaining interpretable high-quality matches and balance between treatment and control groups. We leverage these high quality matches to estimate conditional average treatment effects (CATEs). Our experiments show that FLAME scales to huge datasets with millions of observations where existing state-of-the-art methods fail, and that it achieves significantly better performance than other matching methods.

因果推論における古典的な問題はマッチングであり、共変量情報に基づいて治療ユニットを対照ユニットとマッチングさせる必要があります。この研究では、高次元のカテゴリデータセットに対して高品質のほぼ正確な一致を計算する方法を提案します。FLAME (Fast Large-scale Almost Matching Exactly)と呼ばれるこの方法は、ホールドアウトトレーニングデータセットを使用してマッチングの距離メトリックを学習します。大規模なデータセットに対して効率的にマッチングを実行するために、FLAMEはデータベース管理の分野でクエリ処理に自然な手法を活用し、2つのFLAME実装が提供されています。1つ目はSQLクエリを使用し、2つ目はビットベクター手法を使用します。アルゴリズムは、最高品質の一致(すべての共変量で完全に一致)を構築することから開始し、解釈可能な高品質の一致と治療グループと対照グループのバランスを維持しながら、できるだけ多くの変数で正確に一致するように変数を順次排除します。これらの高品質の一致を利用して、条件付き平均治療効果(CATE)を推定します。私たちの実験では、FLAMEは、既存の最先端の方法では対応できない数百万の観測値を含む巨大なデータセットにも対応し、他のマッチング方法よりも大幅に優れたパフォーマンスを実現することが示されています。

A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms
操作のためのロボット学習のレビュー:課題、表現、およびアルゴリズム

A key challenge in intelligent robotics is creating robots that are capable of directly interacting with the world around them to achieve their goals. The last decade has seen substantial growth in research on the problem of robot manipulation, which aims to exploit the increasing availability of affordable robot arms and grippers to create robots capable of directly interacting with the world to achieve their goals. Learning will be central to such autonomous systems, as the real world contains too much variation for a robot to expect to have an accurate model of its environment, the objects in it, or the skills required to manipulate them, in advance. We aim to survey a representative subset of that research which uses machine learning for manipulation. We describe a formalization of the robot manipulation learning problem that synthesizes existing research into a single coherent framework and highlight the many remaining research opportunities and challenges.

知能ロボティクスにおける重要な課題は、目標を達成するために周囲の世界と直接対話できるロボットを作成することです。過去10年間で、ロボット操作の問題に関する研究が大幅に増加しており、手頃な価格のロボットアームとグリッパーの入手可能性の増加を利用して、世界と直接対話して目標を達成できるロボットを作成することを目的としています。現実世界にはバリエーションが多すぎて、ロボットがその環境、その中のオブジェクト、またはそれらを操作するために必要なスキルの正確なモデルを事前に期待できないため、学習はこのような自律システムの中心となります。私たちは、機械学習を操作に使用する研究の代表的なサブセットを調査することを目指しています。既存の研究を単一の首尾一貫したフレームワークに統合し、多くの残された研究の機会と課題を強調するロボット操作学習問題の形式化について説明します。

Single and Multiple Change-Point Detection with Differential Privacy
差動プライバシーを備えた単一および複数の変更ポイント検出

The change-point detection problem seeks to identify distributional changes at an unknown change-point $k^*$ in a stream of data. This problem appears in many important practical settings involving personal data, including biosurveillance, fault detection, finance, signal detection, and security systems. The field of differential privacy offers data analysis tools that provide powerful worst-case privacy guarantees. We study the statistical problem of change-point detection through the lens of differential privacy. We give private algorithms for both online and offline change-point detection, analyze these algorithms theoretically, and provide empirical validation of our results.

変化点検出問題は、データストリーム内の未知の変化点$k^*$での分布変化を特定しようとします。この問題は、バイオサーベイランス、障害検出、財務、信号検出、セキュリティシステムなど、個人データが関与する多くの重要な実務環境で発生します。差分プライバシーの分野は、強力な最悪の場合のプライバシー保証を提供するデータ分析ツールを提供します。私たちは、差分プライバシーのレンズを通して変化点検出の統計的問題を研究しています。オンラインとオフラインの両方で変化点を検出するためのプライベートアルゴリズムを提供し、これらのアルゴリズムを理論的に分析し、結果を経験的に検証します。

Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits
Tsallis-INF:確率的および敵対的バンディットに最適なアルゴリズム

We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $\alpha=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-bounding constraint, which includes stochastic regime, stochastically constrained adversarial regime, and stochastic regime with adversarial corruptions as special cases, and show that the algorithm achieves logarithmic regret guarantee in this regime and all of its special cases simultaneously with the optimal regret guarantee in the adversarial regime. The algorithm also achieves adversarial and stochastic optimality in the utility-based dueling bandit setting. We provide empirical evaluation of the algorithm demonstrating that it significantly outperforms UCB1 and EXP3 in stochastic environments. We also provide examples of adversarial environments, where UCB1 and Thompson Sampling exhibit almost linear regret, whereas our algorithm suffers only logarithmic regret. To the best of our knowledge, this is the first example demonstrating vulnerability of Thompson Sampling in adversarial environments. Last but not least, we present a general stochastic analysis and a general adversarial analysis of OMD algorithms with Tsallis entropy regularization for $\alpha\in[0,1]$ and explain the reason why $\alpha=1/2$ works best.

私たちは、体制と時間範囲に関する事前の知識なしに、敵対的および確率的多腕バンディットの両方で最適な（定数内）疑似後悔を達成するアルゴリズムを導出します。このアルゴリズムは、累乗$\alpha=1/2$のTsallisエントロピー正則化と分散減少損失推定量を使用したオンラインミラー降下法（OMD）に基づく。より一般的には、自己制限制約を持つ敵対的体制（確率的体制、確率的に制約された敵対的体制、敵対的破損を伴う確率的体制を特殊なケースとして含む）を定義し、アルゴリズムがこの体制とそのすべての特殊なケースで対数後悔保証を達成すると同時に、敵対的体制で最適な後悔保証を達成することを示す。このアルゴリズムは、効用ベースの決闘バンディット設定で敵対的および確率的最適性も達成します。私たちは、確率的環境においてUCB1およびEXP3を大幅に上回るアルゴリズムの実証的評価を提供します。また、UCB1およびThompson Samplingがほぼ線形の後悔を示すのに対し、我々のアルゴリズムは対数的な後悔しか示さないという敵対的環境の例も提供します。我々の知る限り、これは敵対的環境におけるThompson Samplingの脆弱性を示す最初の例です。最後に、$\alpha\in[0,1]$のTsallisエントロピー正則化を使用したOMDアルゴリズムの一般的な確率的分析と一般的な敵対的分析を示し、$\alpha=1/2$が最もよく機能する理由を説明します。

Inference In High-dimensional Single-Index Models Under Symmetric Designs
対称設計下における高次元単一インデックスモデルにおける推論

The problem of statistical inference for regression coefficients in a high-dimensional single-index model is considered. Under elliptical symmetry, the single index model can be reformulated as a proxy linear model whose regression parameter is identifiable. We construct estimates of the regression coefficients of interest that are similar to the debiased lasso estimates in the standard linear model and exhibit similar properties: $\sqrt{n}$-consistency and asymptotic normality. The procedure completely bypasses the estimation of the unknown link function, which can be extremely challenging depending on the underlying structure of the problem. Furthermore, under Gaussianity, we propose more efficient estimates of the coefficients by expanding the link function in the Hermite polynomial basis. Finally, we illustrate our approach via carefully designed simulation experiments.

高次元の単一インデックスモデルにおける回帰係数の統計的推論の問題について考察します。楕円対称性の下では、単一インデックスモデルは、回帰パラメーターが識別可能なプロキシ線形モデルとして再定式化できます。私たちは、標準線形モデルの偏りのない推定値と類似し、類似の特性を示す関心のある回帰係数の推定値を構築します: $sqrt{n}$-一貫性と漸近正規性。この手順では、問題の根底にある構造によっては非常に困難な場合がある未知のリンク関数の推定を完全にバイパスします。さらに、ガウス性の下では、エルミート多項式基底のリンク関数を展開することにより、係数のより効率的な推定を提案します。最後に、慎重に設計されたシミュレーション実験を通じて、私たちのアプローチを説明します。

Finite Time LTI System Identification
有限時間 LTI システムの同定

We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from a single time–series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy.

私たちは、潜在的な空間次元または順序が不明な安定した線形時不変(LTI)システムのパラメータを、ノイズの多い入出力データの単一の時系列から学習する問題に取り組みます。私たちは、有限データで可能な最良の低次近似を学習することに焦点を当てます。二重無限システムハンケル行列が順序と良好な低次近似の両方を捉えるシステム理論のサブスペースアルゴリズムに着想を得て、我々は通常の最小二乗法を使用してノイズの多い有限データからハンケルのような行列を構築します。これにより、システム識別で生じる非凸性が回避され、基礎となるLTIシステムを正確に推定できます。我々の結果は、自己正規化マルチンゲール差分項の慎重な分析に依存しており、これにより、下限の対数係数までの識別誤差が制限されます。私たちは、順序選択のためのデータ依存スキームを提供し、Ho-Kalmanサブスペースアルゴリズムに密接に関連するアプローチによって、その順序に対応するシステムパラメータの正確な実現を見つけます。提案されたモデル次数選択手順は過度に保守的ではないこと、つまり、与えられたデータ長では、高次モデルを推定したり、妥当な精度で高次近似値を見つけたりできないことを示しています。

Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions
凸損失関数を用いたマルチパス確率的勾配降下法の一般化性能

Stochastic gradient descent (SGD) has become the method of choice to tackle large-scale datasets due to its low computational cost and good practical performance. Learning rate analysis, either capacity-independent or capacity-dependent, provides a unifying viewpoint to study the computational and statistical properties of SGD, as well as the implicit regularization by tuning the number of passes. Existing capacity-independent learning rates require a nontrivial bounded subgradient assumption and a smoothness assumption to be optimal. Furthermore, existing capacity-dependent learning rates are only established for the specific least squares loss with a special structure. In this paper, we provide both optimal capacity-independent and capacity-dependent learning rates for SGD with general convex loss functions. Our results require neither bounded subgradient assumptions nor smoothness assumptions, and are stated with high probability. We achieve this improvement by a refined estimate on the norm of SGD iterates based on a careful martingale analysis and concentration inequalities on empirical processes.

確率的勾配降下法(SGD)は、計算コストが低く、実用的なパフォーマンスが優れているため、大規模なデータセットに取り組むための選択肢となっています。容量に依存しない、または容量に依存する学習率分析は、SGDの計算特性と統計特性、およびパス数の調整による暗黙の正則化を研究するための統一的な視点を提供します。既存の容量に依存しない学習率を最適にするには、自明でない有界劣勾配仮定と平滑性仮定が必要です。さらに、既存の容量に依存する学習率は、特別な構造を持つ特定の最小二乗損失に対してのみ確立されています。この論文では、一般的な凸損失関数を持つSGDの最適な容量に依存しない学習率と容量に依存する学習率の両方を提供します。私たちの結果は、有界劣勾配仮定も平滑性仮定も必要とせず、高い確率で述べられています。私たちは、慎重なマルチンゲール分析と経験的プロセスの集中不等式に基づいて、SGD反復のノルムに関する洗練された推定値によってこの改善を達成しました。

Entangled Kernels – Beyond Separability
絡み合ったカーネル – 分離可能性を超えて

We consider the problem of operator-valued kernel learning and investigate the possibility of going beyond the well-known separable kernels. Borrowing tools and concepts from the field of quantum computing, such as partial trace and entanglement, we propose a new view on operator-valued kernels and define a general family of kernels that encompasses previously known operator-valued kernels, including separable and transformable kernels. Within this framework, we introduce another novel class of operator-valued kernels called entangled kernels that are not separable. We propose an efficient two-step algorithm for this framework, where the entangled kernel is learned based on a novel extension of kernel alignment to operator-valued kernels. We illustrate our algorithm with an application to supervised dimensionality reduction, and demonstrate its effectiveness with both artificial and real data for multi-output regression.

私たちは、演算子値によるカーネル学習の問題を検討し、よく知られている分離可能なカーネルを超える可能性を調査します。パーシャルトレースやエンタングルメントなど、量子コンピューティングの分野からツールや概念を借りて、演算子値カーネルに関する新しい見方を提案し、分離可能カーネルや変換可能カーネルなど、以前に知られていた演算子値カーネルを包含するカーネルの一般的なファミリーを定義します。このフレームワーク内で、分離できないエンタングルドカーネルと呼ばれる演算子値カーネルの別の新しいクラスを紹介します。このフレームワークには、演算子値カーネルへのカーネルアライメントの新しい拡張に基づいて、絡み合ったカーネルを学習する効率的な2ステップアルゴリズムを提案します。教師あり次元削減への適用を使用してアルゴリズムを説明し、多出力回帰のための人工データと実データの両方でその有効性を実証します。

A Two-Level Decomposition Framework Exploiting First and Second Order Information for SVM Training Problems
SVM学習問題に対する1次情報と2次情報を利用する2レベル分解フレームワーク

In this work we present a novel way to solve the sub-problems that originate when using decomposition algorithms to train Support Vector Machines (SVMs). State-of-the-art Sequential Minimization Optimization (SMO) solvers reduce the original problem to a sequence of sub-problems of two variables for which the solution is analytical. Although considering more than two variables at a time usually results in a lower number of iterations needed to train an SVM model, solving the sub-problem becomes much harder and the overall computational gains are limited, if any. We propose to apply the two-variables decomposition method to solve the sub-problems themselves and experimentally show that it is a viable and efficient way to deal with sub-problems of up to 50 variables. As a second contribution we explore different ways to select the working set and its size, combining first-order and second-order working set selection rules together with a strategy for exploiting cached elements of the Hessian matrix. An extensive numerical comparison shows that the method performs considerably better than state-of-the-art software.

この研究では、分解アルゴリズムを使用してサポートベクターマシン(SVM)をトレーニングする際に生じるサブ問題を解決する新しい方法を紹介します。最先端の逐次最小化最適化(SMO)ソルバーは、元の問題を、解析的に解決できる2変数のサブ問題のシーケンスに縮小します。一度に3つ以上の変数を考慮すると、通常、SVMモデルのトレーニングに必要な反復回数は少なくなりますが、サブ問題の解決ははるかに困難になり、全体的な計算ゲインは限られます(ある場合)。2変数分解法を適用してサブ問題自体を解決することを提案し、これが最大50変数のサブ問題に対処する実行可能で効率的な方法であることを実験的に示します。2つ目の貢献として、ワーキングセットとそのサイズを選択するさまざまな方法を検討し、1次および2次ワーキングセット選択ルールと、ヘッセ行列のキャッシュされた要素を活用する戦略を組み合わせます。広範囲にわたる数値比較により、この方法は最先端のソフトウェアよりも大幅に優れたパフォーマンスを発揮することが示されています。

When random initializations help: a study of variational inference for community detection
ランダム初期化が役立つ場合:コミュニティ検出のための変分推論の研究

Variational approximation has been widely used in large-scale Bayesian inference recently, the simplest kind of which involves imposing a mean field assumption to approximate complicated latent structures. Despite the computational scalability of mean field, theoretical studies of its loss function surface and the convergence behavior of iterative updates for optimizing the loss are far from complete. In this paper, we focus on the problem of community detection for a simple two-class Stochastic Blockmodel (SBM) with equal class sizes. Using batch co-ordinate ascent (BCAVI) for updates, we show different convergence behavior with respect to different initializations. When the parameters are known or estimated within a reasonable range and held fixed, we characterize conditions under which an initialization can converge to the ground truth. On the other hand, when the parameters need to be estimated iteratively, a random initialization will converge to an uninformative local optimum.

変分近似は、近年、大規模なベイズ推論で広く使用されており、最も単純な種類は、複雑な潜在構造を近似するために平均場の仮定を課すことを含みます。平均場の計算スケーラビリティにもかかわらず、その損失関数表面の理論的研究と、損失を最適化するための反復更新の収束挙動は完全ではありません。この論文では、同じクラスサイズを持つ単純な2クラス確率ブロックモデル(SBM)のコミュニティ検出の問題に焦点を当てます。更新にBatch Coordinate Ascent (BCAVI)を使用すると、初期化ごとに異なる収束動作を示します。パラメータが既知であるか、妥当な範囲内で推定され、固定されている場合、初期化がグラウンドトゥルースに収束する条件を特徴付けます。一方、パラメーターを反復的に推定する必要がある場合、ランダムな初期化は情報量の少ない局所的な最適値に収束します。

A Fast Globally Linearly Convergent Algorithm for the Computation of Wasserstein Barycenters
ワッサーシュタイン重心の計算のための高速大域線形収束アルゴリズム

We consider the problem of computing a Wasserstein barycenter for a set of discrete probability distributions with finite supports, which finds many applications in areas such as statistics, machine learning and image processing. When the support points of the barycenter are pre-specified, this problem can be modeled as a linear programming (LP) problem whose size can be extremely large. To handle this large-scale LP, we analyse the structure of its dual problem, which is conceivably more tractable and can be reformulated as a well-structured convex problem with 3 kinds of block variables and a coupling linear equality constraint. We then adapt a symmetric Gauss-Seidel based alternating direction method of multipliers (sGS-ADMM) to solve the resulting dual problem and establish its global convergence and global linear convergence rate. As a critical component for efficient computation, we also show how all the subproblems involved can be solved exactly and efficiently. This makes our method suitable for computing a Wasserstein barycenter on a large-scale data set, without introducing an entropy regularization term as is commonly practiced. In addition, our sGS-ADMM can be used as a subroutine in an alternating minimization method to compute a barycenter when its support points are not pre-specified. Numerical results on synthetic data sets and image data sets demonstrate that our method is highly competitive for solving large-scale Wasserstein barycenter problems, in comparison to two existing representative methods and the commercial software Gurobi.

私たちは、統計、機械学習、画像処理などの分野で多くの応用が見出される、有限のサポートを持つ離散確率分布のセットのワッサーシュタイン重心を計算する問題を考察します。重心のサポートポイントが事前に指定されている場合、この問題は、サイズが非常に大きくなる可能性がある線形計画(LP)問題としてモデル化できます。この大規模なLPを処理するために、その双対問題の構造を分析します。これはおそらく扱いやすく、3種類のブロック変数と結合線形等式制約を持つ適切に構造化された凸問題として再定式化できます。次に、対称ガウスザイデルベースの交互方向乗数法(sGS-ADMM)を適用して、結果として生じる双対問題を解決し、そのグローバル収束とグローバル線形収束率を確立します。効率的な計算のための重要な要素として、関連するすべてのサブ問題を正確かつ効率的に解決する方法も示します。これにより、私たちの方法は、一般的に行われているエントロピー正規化項を導入することなく、大規模データセットでワッサーシュタイン重心を計算するのに適しています。さらに、私たちのsGS-ADMMは、サポートポイントが事前に指定されていない場合に重心を計算するための交互最小化法のサブルーチンとして使用できます。合成データセットと画像データセットの数値結果は、既存の2つの代表的な方法と商用ソフトウェアGurobiと比較して、私たちの方法が大規模なワッサーシュタイン重心問題を解決するのに非常に優れていることを示しています。

Aggregated Hold-Out
アグリゲート・ホールドアウト

Aggregated hold-out (agghoo) is a method which averages learning rules selected by hold-out (that is, cross-validation with a single split). We provide the first theoretical guarantees on agghoo, ensuring that it can be used safely: Agghoo performs at worst like the hold-out when the risk is convex. The same holds true in classification with the 0–1 risk, with an additional constant factor. For the hold-out, oracle inequalities are known for bounded losses, as in binary classification. We show that similar results can be proved, under appropriate assumptions, for other risk-minimization problems. In particular, we obtain an oracle inequality for regularized kernel regression with a Lipschitz loss, without requiring that the $Y$ variable or the regressors be bounded. Numerical experiments show that aggregation brings a significant improvement over the hold-out and that agghoo is competitive with cross-validation.

アグリゲート・ホールドアウト(agghoo)は、ホールドアウト(つまり、1回の分割による交差検証)によって選択された学習ルールを平均化する手法です。私たちは、アグフーについて最初の理論的な保証を提供し、それが安全に使用できることを保証します:アグフーは、リスクが凸状のときにホールドアウトのように最悪のパフォーマンスを発揮します。同じことが、リスクが0–1で分類され、追加の定数係数がある場合にも当てはまります。ホールドアウトの場合、オラクル不等式は、二項分類のように、限定損失で知られています。適切な仮定の下で、他のリスク最小化問題についても同様の結果が証明できることを示します。特に、Lipschitz損失を伴う正則化カーネル回帰のオラクル不等式を取得しますが、$Y$変数またはリグレッサーを有界化する必要はありません。数値実験では、アグリゲーションがホールドアウトよりも大幅に改善され、アグフーが交差検証と競合することが示されています。

Ranking and synchronization from pairwise measurements via SVD
SVDによるペアワイズ測定からのランキングと同期

Given a measurement graph $G= (V,E)$ and an unknown signal $r \in \mathbb{R}^n$, we investigate algorithms for recovering $r$ from pairwise measurements of the form $r_i – r_j$; $\{i,j\} \in E$. This problem arises in a variety of applications, such as ranking teams in sports data and time synchronization of distributed networks. Framed in the context of ranking, the task is to recover the ranking of $n$ teams (induced by $r$) given a small subset of noisy pairwise rank offsets. We propose a simple SVD-based algorithmic pipeline for both the problem of time synchronization and ranking. We provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise perturbations with outliers, using results from matrix perturbation and random matrix theory. Our theoretical findings are complemented by a detailed set of numerical experiments on both synthetic and real data, showcasing the competitiveness of our proposed algorithms with other state-of-the-art methods.

測定グラフ$G= (V,E)$と未知の信号$r \in \mathbb{R}^n$が与えられた場合、形式$r_i – r_j$; $\{i,j\} \in E$のペアワイズ測定から$r$を回復するアルゴリズムを調査します。この問題は、スポーツデータでのチームのランキングや分散ネットワークの時間同期など、さまざまなアプリケーションで発生します。ランキングのコンテキストでフレーム化されると、タスクは、ノイズの多いペアワイズランクオフセットの小さなサブセットが与えられた場合に、$n$チームのランキング($r$によって誘導)を回復することです。時間同期とランキングの両方の問題に対して、単純なSVDベースのアルゴリズムパイプラインを提案します。行列摂動法とランダム行列理論の結果を使用して、サンプリングスパース性と外れ値によるノイズ摂動の両方に対する堅牢性に関して詳細な理論的分析を提供します。私たちの理論的発見は、合成データと実際のデータの両方に対する詳細な数値実験によって補完され、私たちが提案するアルゴリズムが他の最先端の方法と競争力があることを示しています。

A Unified Sample Selection Framework for Output Noise Filtering: An Error-Bound Perspective
出力ノイズフィルタリングのための統一サンプル選択フレームワーク:エラー制約の視点

The existence of output noise will bring difficulties to supervised learning. Noise filtering, aiming to detect and remove polluted samples, is one of the main ways to deal with the noise on outputs. However, most of the filters are heuristic and could not explain the filtering influence on the generalization error (GE) bound. The hyper-parameters in various filters are specified manually or empirically, and they are usually unable to adapt to the data environment. The filter with an improper hyper-parameter may overclean, leading to a weak generalization ability. This paper proposes a unified framework of optimal sample selection (OSS) for the output noise filtering from the perspective of error bound. The covering distance filter (CDF) under the framework is presented to deal with noisy outputs in regression and ordinal classification problems. Firstly, two necessary and sufficient conditions for a fixed goodness of fit in regression are deduced from the perspective of GE bound. They provide the unified theoretical framework for determining the filtering effectiveness and optimizing the size of removed samples. The optimal sample size has the adaptability to the environmental changes in the sample size, the noise ratio, and noise variance. It offers a choice of tuning the hyper-parameter and could prevent filters from overcleansing. Meanwhile, the OSS framework can be integrated with any noise estimator and produces a new filter. Then the covering interval is proposed to separate low-noise and high-noise samples, and the effectiveness is proved in regression. The covering distance is introduced as an unbiased estimator of high noises. Further, the CDF algorithm is designed by integrating the cover distance with the OSS framework. Finally, it is verified that the CDF not only recognizes noise labels correctly but also brings down the prediction errors on real apparent age data set. Experimental results on benchmark regression and ordinal classification data sets demonstrate that the CDF outperforms the state-of-the-art filters in terms of prediction ability, noise recognition, and efficiency.

出力ノイズの存在は、教師あり学習に困難をもたらします。汚染されたサンプルを検出して除去することを目的としたノイズフィルタリングは、出力のノイズに対処する主な方法の1つです。しかし、ほとんどのフィルタはヒューリスティックであり、フィルタリングが一般化誤差（GE）境界に与える影響を説明できませんでした。さまざまなフィルタのハイパーパラメータは手動または経験的に指定され、通常はデータ環境に適応できません。不適切なハイパーパラメータを持つフィルタは過剰にクリーンアップし、一般化能力が弱くなる可能性があります。この論文では、誤差境界の観点から、出力ノイズフィルタリングのための最適サンプル選択（OSS）の統一フレームワークを提案します。このフレームワークの下での被覆距離フィルタ（CDF）は、回帰および順序分類問題におけるノイズの多い出力に対処するために提示されます。まず、回帰における適合度が固定されるための2つの必要かつ十分な条件が、GE境界の観点から推定されます。これらは、フィルタリングの有効性を決定し、除去されるサンプルのサイズを最適化するための統一された理論的フレームワークを提供します。最適なサンプルサイズは、サンプルサイズ、ノイズ比、ノイズ分散の環境変化に適応できます。ハイパーパラメータを調整する選択肢を提供し、フィルターの過剰洗浄を防ぐことができます。一方、OSSフレームワークは任意のノイズ推定器と統合して新しいフィルターを生成できます。次に、低ノイズサンプルと高ノイズサンプルを分離するためのカバー間隔が提案され、回帰で有効性が証明されています。カバー距離は、高ノイズの不偏推定器として導入されています。さらに、CDFアルゴリズムは、カバー距離をOSSフレームワークと統合して設計されています。最後に、CDFはノイズラベルを正しく認識するだけでなく、実際の見かけの年齢データセットの予測誤差を減らすことが検証されています。ベンチマーク回帰および順序分類データセットでの実験結果は、CDFが予測能力、ノイズ認識、および効率の点で最先端のフィルターよりも優れていることを示しています。

Continuous Time Analysis of Momentum Methods
モメンタム法の連続時間解析

Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak’s Heavy Ball method (HB) and Nesterov’s method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the non-asymptotic regime with fixed small, but non-zero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentum-dependent second order time-derivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase.

勾配降下法に基づく最適化手法は、ニューラルネットワークのパラメータトレーニングの基盤となるため、多くのアプリケーションで見られる印象的なテスト結果の重要な要素を構成しています。確率性を導入することが、実際の問題で成功するための鍵であり、このコンテキストでの確率的勾配降下法の役割についてはある程度理解されています。PolyakのHeavy Ball法(HB)やNesterovの加速勾配法(NAG)などの勾配降下法の運動量修正も広く採用されています。この研究では、ニューラルネットワークのトレーニングにおける運動量の役割を理解することに焦点を当て、アルゴリズムの各ステップで運動量の寄与が固定されている一般的な状況に集中しています。アイデアを簡単に説明するために、決定論的な設定で作業します。私たちのアプローチは、離散アルゴリズムの連続時間近似を導出することです。これらの連続時間近似は、離散アルゴリズム内で機能しているメカニズムに関する洞察を提供します。私たちは、そのような近似を3つ証明します。まず、固定モーメンタム法の標準的な実装は、学習率がゼロに近づくにつれて漸近的に時間再スケールされた勾配降下フローに近似することを示します。この結果は、学習率がゼロになる極限では、モーメンタム法と純粋な勾配降下法を区別するものではありません。次に、固定された小さいがゼロではない学習率を伴う非漸近的領域で実装された場合、固定モーメンタム法が勾配降下法よりも実用的に優れていることを理解することを目的とした2つの結果を証明します。これは、小さいが固定された学習率がパラメーターとして現れる連続時間制限への近似を証明することによって達成されます。これは、数値解析の文献では修正方程式法として知られており、機械学習のコンテキストで高解像度ODE近似として最近再発見されました。2番目の結果では、モーメンタム法が、学習率に比例する追加のモーメンタム依存の2次時間微分補正を伴う連続時間勾配フローによって近似されることを示します。これは、過渡期におけるモーメンタムアルゴリズムの安定化効果を説明するために使用できます。さらに、3番目の結果では、運動量法が指数的に魅力的な不変多様体を許容し、その上でダイナミクスが、元の損失関数に学習率に比例する小さな摂動を加えたものに等しい修正損失関数に関する勾配フローに近似的に減少することを示します。この小さな修正により、損失関数が凸状になり、過渡段階を超えて運動量法に存在する追加の堅牢性がエンコードされます。

Pykg2vec: A Python Library for Knowledge Graph Embedding
Pykg2vec: ナレッジグラフの埋め込みのためのPythonライブラリ

Pykg2vec is a Python library for learning the representations of the entities and relations in knowledge graphs. Pykg2vec’s flexible and modular software architecture currently implements 25 state-of-the-art knowledge graph embedding algorithms, and is designed to easily incorporate new algorithms.The goal of pykg2vec is to provide a practical and educational platform to accelerate research in knowledge graph representation learning. Pykg2vec is built on top of PyTorch and Python’s multiprocessing framework and provides modules for batch generation, Bayesian hyperparameter optimization, evaluation of KGE tasks, embedding, and result visualization. Pykg2vec is released under the MIT License and is also available in the Python Package Index (PyPI). The source code of pykg2vec is available at https://github.com/Sujit-O/pykg2vec.

Pykg2vecは、ナレッジグラフのエンティティとリレーションの表現を学習するためのPythonライブラリです。Pykg2vecの柔軟でモジュール式のソフトウェアアーキテクチャは、現在25の最先端のナレッジグラフ埋め込みアルゴリズムを実装しており、新しいアルゴリズムを簡単に組み込めるように設計されています。pykg2vecの目標は、知識グラフ表現学習の研究を加速するための実用的で教育的なプラットフォームを提供することです。Pykg2vecは、PyTorchとPythonのマルチプロセッシングフレームワークの上に構築されており、バッチ生成、ベイジアンハイパーパラメータ最適化、KGEタスクの評価、埋め込み、および結果の視覚化のためのモジュールを提供します。Pykg2vecはMITライセンスの下でリリースされており、Python Package Index (PyPI)でも利用可能です。pykg2vecのソースコードはhttps://github.com/Sujit-O/pykg2vecから入手できます。

Simple and Fast Algorithms for Interactive Machine Learning with Random Counter-examples
ランダム反例を用いた対話型機械学習のためのシンプルで高速なアルゴリズム

This work describes simple and efficient algorithms for interactively learning non-binary concepts in the learning from random counter-examples (LRC) model. Here, learning takes place from random counter-examples that the learner receives in response to their proper equivalence queries, and the learning time is the number of counter-examples needed by the learner to identify the target concept. Such learning is particularly suited for online ranking, classification, clustering, etc., where machine learning models must be used before they are fully trained. We provide two simple LRC algorithms, deterministic and randomized, for exactly learning concepts from any concept class $H$. We show that both these algorithms have an $\mathcal{O}(\log{}|H|)$ asymptotically optimal average learning time. This solves an open problem on the existence of an efficient LRC randomized algorithm while also simplifying previous results and improving their computational efficiency. We also show that the expected learning time of any Arbitrary LRC algorithm can be upper bounded by $\mathcal{O}(\frac{1}{\epsilon}\log{\frac{|H|}{\delta}})$, where $\epsilon$ and $\delta$ are the allowed learning error and failure probability respectively. This shows that LRC interactive learning is at least as efficient as non-interactive Probably Approximately Correct (PAC) learning. Our simulations also show that these algorithms outperform their theoretical bounds.

この研究では、ランダム反例からの学習(LRC)モデルで非バイナリ概念を対話的に学習するためのシンプルで効率的なアルゴリズムについて説明します。ここで、学習は、学習者が適切な同値クエリへの応答として受け取るランダム反例から行われ、学習時間は学習者が対象概念を識別するために必要な反例の数です。このような学習は、機械学習モデルを完全にトレーニングする前に使用する必要があるオンラインランキング、分類、クラスタリングなどに特に適しています。任意の概念クラス$H$から概念を正確に学習するための、決定論的およびランダム化の2つのシンプルなLRCアルゴリズムを提供します。これらのアルゴリズムの両方が、漸近的に最適な平均学習時間$\mathcal{O}(\log{}|H|)$を持つことを示します。これにより、効率的なLRCランダム化アルゴリズムの存在に関する未解決の問題が解決され、以前の結果が簡素化され、計算効率が向上します。また、任意のLRCアルゴリズムの予想学習時間は、$\mathcal{O}(\frac{1}{\epsilon}\log{\frac{|H|}{\delta}})$で上限が定められることも示しています。ここで、$\epsilon$と$\delta$はそれぞれ許容される学習誤差と失敗確率です。これは、LRC対話型学習が、少なくとも非対話型のおそらくほぼ正しい(PAC)学習と同程度に効率的であることを示しています。また、シミュレーションでは、これらのアルゴリズムが理論上の限界を上回ることも示されています。

On Multi-Armed Bandit Designs for Dose-Finding Trials
用量設定試験のための多腕バンディット設計について

We study the problem of finding the optimal dosage in early stage clinical trials through the multi-armed bandit lens. We advocate the use of the Thompson Sampling principle, a flexible algorithm that can accommodate different types of monotonicity assumptions on the toxicity and efficacy of the doses. For the simplest version of Thompson Sampling, based on a uniform prior distribution for each dose, we provide finite-time upper bounds on the number of sub-optimal dose selections, which is unprecedented for dose-finding algorithms. Through a large simulation study, we then show that variants of Thompson Sampling based on more sophisticated prior distributions outperform state-of-the-art dose identification algorithms in different types of dose-finding studies that occur in phase I or phase I/II trials.

私たちは、マルチアームバンディットレンズを通じて、初期段階の臨床試験で最適な投与量を見つけるという問題を研究しています。私たちは、用量の毒性と有効性に関するさまざまなタイプの単調性の仮定に対応できる柔軟なアルゴリズムであるトンプソンサンプリング原理の使用を提唱しています。トンプソンサンプリングの最も単純なバージョンでは、各用量の均一な事前分布に基づいて、最適でない用量選択の数に有限時間の上限を提供します。これは、用量検出アルゴリズムでは前例のないことです。次に、大規模なシミュレーション研究を通じて、より洗練された事前分布に基づくThompson Samplingのバリアントが、第I相または第I/II相試験で発生するさまざまなタイプの用量決定研究において、最先端の線量同定アルゴリズムよりも優れていることを示しています。

Homogeneity Structure Learning in Large-scale Panel Data with Heavy-tailed Errors
ヘビーテール誤差を伴う大規模パネルデータにおける均質構造学習

Large-scale panel data is ubiquitous in many modern data science applications. Conventional panel data analysis methods fail to address the new challenges, like individual impacts of covariates, endogeneity, embedded low-dimensional structure, and heavy-tailed errors, arising from the innovation of data collection platforms on which applications operate. In response to these challenges, this paper studies large-scale panel data with an interactive effects model. This model takes into account the individual impacts of covariates on each spatial node and removes the exogenous condition by allowing latent factors to affect both covariates and errors. Besides, we waive the sub-Gaussian assumption and allow the errors to be heavy-tailed. Further, we propose a data-driven procedure to learn a parsimonious yet flexible homogeneity structure embedded in high-dimensional individual impacts of covariates. The homogeneity structure assumes that there exists a partition of regression coefficients where the coefficients are the same within each group but different between the groups. The homogeneity structure is flexible as it contains many widely assumed low-dimensional structures (sparsity, global impact, etc.) as its special cases. Non-asymptotic properties are established to justify the proposed learning procedure. Extensive numerical experiments demonstrate the advantage of the proposed learning procedure over conventional methods especially when the data are generated from heavy-tailed distributions.

大規模パネルデータは、多くの現代のデータサイエンスアプリケーションで広く使用されています。従来のパネルデータ分析方法では、アプリケーションが動作するデータ収集プラットフォームの革新から生じる共変量の個別影響、内生性、埋め込まれた低次元構造、およびヘビーテールエラーなどの新しい課題に対処できません。これらの課題に対応するために、この論文では、インタラクティブ効果モデルを使用して大規模パネルデータを研究します。このモデルは、各空間ノードに対する共変量の個別影響を考慮し、潜在因子が共変量とエラーの両方に影響を与えることを許可することで外生条件を排除します。さらに、サブガウス仮定を放棄し、エラーがヘビーテールになることを許可します。さらに、共変量の高次元の個別影響に埋め込まれた簡潔でありながら柔軟な均質性構造を学習するためのデータ駆動型手順を提案します。均質性構造は、各グループ内で係数が同じであるがグループ間で異なる回帰係数の分割が存在することを前提としています。同次構造は、広く想定されている低次元構造（スパース性、グローバル影響など）を特殊なケースとして多く含むため、柔軟性があります。提案された学習手順を正当化するために、非漸近特性が確立されています。広範囲にわたる数値実験により、特にデータがヘビーテール分布から生成される場合、従来の方法よりも提案された学習手順が優れていることが実証されています。

Global and Quadratic Convergence of Newton Hard-Thresholding Pursuit
ニュートン硬閾値追跡の全体的および二次収束

Algorithms based on the hard thresholding principle have been well studied with sounding theoretical guarantees in the compressed sensing and more general sparsity-constrained optimization. It is widely observed in existing empirical studies that when a restricted Newton step was used (as the debiasing step), the hard-thresholding algorithms tend to meet halting conditions in a significantly low number of iterations and are very efficient. Hence, the thus obtained Newton hard-thresholding algorithms call for stronger theoretical guarantees than for their simple hard-thresholding counterparts. This paper provides a theoretical justification for the use of the restricted Newton step. We build our theory and algorithm, Newton Hard-Thresholding Pursuit (NHTP), for the sparsity-constrained optimization. Our main result shows that NHTP is quadratically convergent under the standard assumption of restricted strong convexity and smoothness. We also establish its global convergence to a stationary point under a weaker assumption. In the special case of the compressive sensing, NHTP effectively reduces to some of the existing hard-thresholding algorithms with a Newton step. Consequently, our fast convergence result justifies why those algorithms perform better than without the Newton step. The efficiency of NHTP was demonstrated on both synthetic and real data in compressed sensing and sparse logistic regression.

ハードしきい値原理に基づくアルゴリズムは、圧縮センシングやより一般的なスパース制約最適化において、十分な理論的保証を伴って十分に研究されてきました。既存の実証研究では、制限付きニュートンステップが（バイアス除去ステップとして）使用された場合、ハードしきい値アルゴリズムは、非常に少ない反復回数で停止条件を満たす傾向があり、非常に効率的であることが広く観察されています。したがって、このようにして得られたニュートンハードしきい値アルゴリズムは、単純なハードしきい値アルゴリズムよりも強力な理論的保証を必要とします。この論文では、制限付きニュートンステップの使用の理論的正当性を示します。私たちは、スパース制約最適化のために、理論とアルゴリズムであるニュートンハードしきい値追跡（NHTP）を構築しました。主な結果は、制限された強い凸性と滑らかさの標準的な仮定の下で、NHTPが2次収束することを示しています。また、より弱い仮定の下で、NHTPが定常点に大域的に収束することを確立しました。圧縮センシングの特殊なケースでは、NHTPはニュートンステップを使用した既存のハードしきい値アルゴリズムの一部に効果的に縮小されます。その結果、高速収束の結果は、これらのアルゴリズムがニュートンステップを使用しない場合よりもパフォーマンスが優れている理由を正当化します。NHTPの効率は、圧縮センシングとスパースロジスティック回帰の合成データと実際のデータの両方で実証されました。

Unfolding-Model-Based Visualization: Theory, Method and Applications
アンフォールディングモデルベースの視覚化:理論、方法、応用

Multidimensional unfolding methods are widely used for visualizing item response data. Such methods project respondents and items simultaneously onto a low-dimensional Euclidian space, in which respondents and items are represented by ideal points, with person-person, item-item, and person-item similarities being captured by the Euclidian distances between the points. In this paper, we study the visualization of multidimensional unfolding from a statistical perspective. We cast multidimensional unfolding into an estimation problem, where the respondent and item ideal points are treated as parameters to be estimated. An estimator is then proposed for the simultaneous estimation of these parameters. Asymptotic theory is provided for the recovery of the ideal points, shedding lights on the validity of model-based visualization. An alternating projected gradient descent algorithm is proposed for the parameter estimation. We provide two illustrative examples, one on users’ movie rating and the other on senate roll call voting.

多次元展開法は、項目応答データの視覚化に広く使用されています。このような方法では、回答者と項目を低次元ユークリッド空間に同時に投影します。この空間では、回答者と項目が理想点で表され、人対人、項目対項目、および人対項目の類似性は、点間のユークリッド距離によって表されます。この論文では、統計的観点から多次元展開の視覚化を検討します。多次元展開を推定問題に当てはめ、回答者と項目の理想点を推定対象のパラメーターとして扱います。次に、これらのパラメーターを同時に推定するための推定量を提案します。理想点の回復のための漸近理論が提供され、モデルベースの視覚化の有効性が明らかになります。パラメーター推定には、交互投影勾配降下アルゴリズムが提案されています。ここでは、ユーザーの映画評価と上院の点呼投票の2つの例を示します。

Mixing Time of Metropolis-Hastings for Bayesian Community Detection
ベイジアンコミュニティ検出のためのメトロポリス-ヘイスティングスの混合時間

We study the computational complexity of a Metropolis-Hastings algorithm for Bayesian community detection. We first establish a posterior strong consistency result for a natural prior distribution on stochastic block models under the optimal signal-to-noise ratio condition in the literature. We then give a set of conditions that guarantee rapidly mixing of a simple Metropolis-Hastings algorithm. The mixing time analysis is based on a careful study of posterior ratios and a canonical path argument to control the spectral gap of the Markov chain.

私たちは、ベイジアンコミュニティ検出のためのメトロポリス-ヘイスティングスアルゴリズムの計算量を研究します。まず、文献の最適な信号対雑音比条件下で、確率的ブロックモデル上の自然な事前分布の事後強一貫性結果を確立します。次に、単純なMetropolis-Hastingsアルゴリズムの迅速な混合を保証する一連の条件を提供します。ミキシング時間解析は、事後比の慎重な研究と、マルコフ鎖のスペクトルギャップを制御するための正準パス引数に基づいています。

Convex Clustering: Model, Theoretical Guarantee and Efficient Algorithm
凸クラスタリング:モデル、理論的保証、効率的なアルゴリズム

Clustering is a fundamental problem in unsupervised learning. Popular methods like K-means, may suffer from poor performance as they are prone to get stuck in its local minima. Recently, the sum-of-norms (SON) model (also known as the convex clustering model) has been proposed by Pelckmans et al. (2005), Lindsten et al. (2011) and Hocking et al. (2011). The perfect recovery properties of the convex clustering model with uniformly weighted all-pairwise-differences regularization have been proved by Zhu et al. (2014) and Panahi et al. (2017). However, no theoretical guarantee has been established for the general weighted convex clustering model, where better empirical results have been observed. In the numerical optimization aspect, although algorithms like the alternating direction method of multipliers (ADMM) and the alternating minimization algorithm (AMA) have been proposed to solve the convex clustering model (Chi and Lange, 2015), it still remains very challenging to solve large-scale problems. In this paper, we establish sufficient conditions for the perfect recovery guarantee of the general weighted convex clustering model, which include and improve existing theoretical results in (Zhu et al., 2014; Panahi et al., 2017) as special cases. In addition, we develop a semismooth Newton based augmented Lagrangian method for solving large-scale convex clustering problems. Extensive numerical experiments on both simulated and real data demonstrate that our algorithm is highly efficient and robust for solving large-scale problems. Moreover, the numerical results also show the superior performance and scalability of our algorithm comparing to the existing first-order methods. In particular, our algorithm is able to solve a convex clustering problem with 200,000 points in $\mathbb{R}^3$ in about 6 minutes.

クラスタリングは、教師なし学習における基本的な問題です。K平均法などの一般的な方法は、局所的最小値に陥りやすいため、パフォーマンスが低下する可能性があります。最近、ノルムの合計(SON)モデル(凸クラスタリングモデルとも呼ばれる)が、Pelckmansら(2005)、Lindstenら(2011)、Hockingら(2011)によって提案されました。均一に重み付けされたすべてのペアワイズ差の正則化による凸クラスタリングモデルの完全な回復特性は、Zhuら(2014)とPanahiら(2017)によって証明されています。ただし、より優れた実験結果が観察されている一般的な重み付け凸クラスタリングモデルについては、理論的な保証は確立されていません。数値最適化の面では、交互方向乗数法(ADMM)や交互最小化アルゴリズム(AMA)などのアルゴリズムが凸クラスタリングモデルを解くために提案されていますが(Chi and Lange、2015)、大規模な問題を解決することは依然として非常に困難です。この論文では、一般的な重み付き凸クラスタリングモデルの完全な回復保証のための十分条件を確立し、(Zhuら、2014; Panahiら、2017)の既存の理論的結果を特殊なケースとして含めて改善します。さらに、大規模な凸クラスタリング問題を解決するための半滑らかなニュートンベースの拡張ラグランジュ法を開発します。シミュレーションデータと実際のデータの両方に対する広範な数値実験により、このアルゴリズムは大規模な問題を解決するのに非常に効率的で堅牢であることが実証されています。さらに、数値結果は、既存の一次方法と比較して、このアルゴリズムの優れたパフォーマンスとスケーラビリティも示しています。特に、私たちのアルゴリズムは、$\mathbb{R}^3$内の200,000ポイントの凸クラスタリング問題を約6分で解くことができます。

A Unified Framework for Random Forest Prediction Error Estimation
ランダムフォレスト予測誤差推定のための統一フレームワーク

We introduce a unified framework for random forest prediction error estimation based on a novel estimator of the conditional prediction error distribution function. Our framework enables simple plug-in estimation of key prediction uncertainty metrics, including conditional mean squared prediction errors, conditional biases, and conditional quantiles, for random forests and many variants. Our approach is especially well-adapted for prediction interval estimation; we show via simulations that our proposed prediction intervals are competitive with, and in some settings outperform, existing methods. To establish theoretical grounding for our framework, we prove pointwise uniform consistency of a more stringent version of our estimator of the conditional prediction error distribution function. The estimators introduced here are implemented in the R package forestError.

私たちは、この論文では、条件付き予測誤差分布関数の新規推定量に基づくランダムフォレスト予測誤差推定のための統一フレームワークを紹介します。当社のフレームワークは、ランダムフォレストや多くのバリアントについて、条件付き平均二乗予測誤差、条件付きバイアス、条件付き分位数など、主要な予測不確実性メトリクスのシンプルなプラグイン推定を可能にします。私たちのアプローチは、予測間隔推定に特に適しています。シミュレーションを通じて、提案された予測区間が既存の方法と競合し、一部の設定では既存の方法よりも優れていることを示します。フレームワークの理論的根拠を確立するために、条件付き予測誤差分布関数の推定量のより厳密なバージョンの点ごとの一様一貫性を証明します。ここで紹介する推定器は、RパッケージのforestErrorに実装されています。

Preference-based Online Learning with Dueling Bandits: A Survey
Dueling Bandits による嗜好ベースのオンライン学習: 調査

In machine learning, the notion of multi-armed bandits refers to a class of online learning problems, in which an agent is supposed to simultaneously explore and exploit a given set of choice alternatives in the course of a sequential decision process. In the standard setting, the agent learns from stochastic feedback in the form of real-valued rewards. In many applications, however, numerical reward signals are not readily available—instead, only weaker information is provided, in particular relative preferences in the form of qualitative comparisons between pairs of alternatives. This observation has motivated the study of variants of the multi-armed bandit problem, in which more general representations are used both for the type of feedback to learn from and the target of prediction. The aim of this paper is to provide a survey of the state of the art in this field, referred to as preference-based multi-armed bandits or dueling bandits. To this end, we provide an overview of problems that have been considered in the literature as well as methods for tackling them. Our taxonomy is mainly based on the assumptions made by these methods about the data-generating process and, related to this, the properties of the preference-based feedback.

機械学習において、マルチアームバンディットの概念は、エージェントが一連の選択肢を同時に探索し、搾取することが求められるオンライン学習問題のクラスを指します。標準的な設定では、エージェントは実数値の報酬という形で確率的フィードバックから学習します。しかし、多くのアプリケーションでは、数値的な報酬信号は容易に利用できず、代わりにペアの選択肢間の質的比較という形で相対的な好みのみが提供されます。この観察は、学習するフィードバックのタイプと予測のターゲットの両方に対して、より一般的な表現が使用されるマルチアームバンディット問題の変種の研究を促しました。本論文の目的は、好みに基づくマルチアームバンディットまたはデュエリングバンディットと呼ばれるこの分野の最先端を概観することです。この目的のために、文献で考慮されてきた問題の概要と、それに対処するための方法を提供します。我々の分類は、主にこれらの方法がデータ生成プロセスについて行った仮定と、これに関連する好みに基づくフィードバックの特性に基づいています。

Consistent estimation of small masses in feature sampling
特徴サンプリングにおける小さな質量の一貫した推定

Consider an (observable) random sample of size $n$ from an infinite population of individuals, each individual being endowed with a finite set of features from a collection of features $(F_{j})_{j\geq1}$ with unknown probabilities $(p_{j})_{j \geq 1}$, i.e., $p_{j}$ is the probability that an individual displays feature $F_{j}$. Under this feature sampling framework, in recent years there has been a growing interest in estimating the sum of the probability masses $p_{j}$’s of features observed with frequency $r\geq0$ in the sample, here denoted by $M_{n,r}$. This is the natural feature sampling counterpart of the classical problem of estimating small probabilities in the species sampling framework, where each individual is endowed with only one feature (or “species”). In this paper we study the problem of consistent estimation of the small mass $M_{n,r}$. We first show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass $M_{n,0}$. Then, we introduce an estimator of $M_{n,r}$ and identify sufficient conditions under which the estimator is consistent. In particular, we propose a nonparametric estimator $\hat{M}_{n,r}$ of $M_{n,r}$ which has the same analytic form of the celebrated Good–Turing estimator for small probabilities, with the sole difference that the two estimators have different ranges (supports). Then, we show that $\hat{M}_{n,r}$ is strongly consistent, in the multiplicative sense, under the assumption that $(p_{j})_{j\geq1}$ has regularly varying heavy tails.

無限の個体集団からのサイズ$n$の（観測可能な）ランダムサンプルを考えます。各個体は、未知の確率$(p_{j})_{j \geq 1}$を持つ特徴のコレクション$(F_{j})_{j\geq1}$からの有限の特徴セットを持っています。すなわち、$p_{j}$は、個体が特徴$F_{j}$を示す確率です。この特徴サンプリングフレームワークの下で、近年、サンプル内で頻度$r\geq0$で観察された特徴の確率質量$p_{j}$の合計を推定することへの関心が高まっています。これは、各個体が1つの特徴（または「種」）のみを持つ古典的な種サンプリングフレームワークにおける小さな確率を推定する問題の自然な特徴サンプリングの対応物です。この論文では、小さな質量$M_{n,r}$の一貫した推定の問題を研究します。まず、欠落した質量$M_{n,0}$の普遍的に一貫した推定量は存在しないことを示します。次に、$M_{n,r}$の推定量を導入し、その推定量が一貫しているための十分条件を特定します。特に、著名な小さな確率のためのグッド–チューリング推定量と同じ解析形式を持つ$M_{n,r}$の非パラメトリック推定量$\hat{M}_{n,r}$を提案しますが、2つの推定量は異なる範囲（サポート）を持つという唯一の違いがあります。次に、$(p_{j})_{j\geq1}$が定期的に変動する重い尾を持つという仮定の下で、$\hat{M}_{n,r}$が乗法的な意味で強く一貫していることを示します。

The Decoupled Extended Kalman Filter for Dynamic Exponential-Family Factorization Models
動的指数族因数分解モデルのための分離拡張カルマンフィルタ

Motivated by the needs of online large-scale recommender systems, we specialize the decoupled extended Kalman filter to factorization models, including factorization machines, matrix and tensor factorization, and illustrate the effectiveness of the approach through numerical experiments on synthetic and on real-world data. Online learning of model parameters through the decoupled extended Kalman filter makes factorization models more broadly useful by (i) allowing for more flexible observations through the entire exponential family, (ii) modeling parameter drift, and (iii) producing parameter uncertainty estimates that can enable explore/exploit and other applications. We use a different parameter dynamics than the standard decoupled extended Kalman filter, allowing parameter drift while encouraging reasonable values. We also present an alternate derivation of the extended Kalman filter and decoupled extended Kalman filter that highlights the role of the Fisher information matrix in the extended Kalman filter.

オンラインの大規模レコメンダーシステムのニーズに動機づけられ、我々は因子化モデル、因子化マシン、行列およびテンソル因子化に対してデカップル拡張カルマンフィルターを専門化し、合成データおよび実世界データに関する数値実験を通じてアプローチの有効性を示します。デカップル拡張カルマンフィルターを通じたモデルパラメータのオンライン学習は、因子化モデルをより広く有用にし、(i)指数族全体を通じてより柔軟な観察を可能にし、(ii)パラメータのドリフトをモデル化し、(iii)探索/搾取やその他のアプリケーションを可能にするパラメータ不確実性の推定を生成します。私たちは、標準のデカップル拡張カルマンフィルターとは異なるパラメータダイナミクスを使用し、合理的な値を促進しながらパラメータのドリフトを許可します。また、拡張カルマンフィルターとデカップル拡張カルマンフィルターの代替導出を提示し、拡張カルマンフィルターにおけるフィッシャー情報行列の役割を強調します。

An Empirical Study of Bayesian Optimization: Acquisition Versus Partition
ベイズ最適化の実証的研究:取得と分割

Bayesian optimization (BO) is a popular framework for black-box optimization. Two classes of BO approaches have shown promising empirical performance while providing strong theoretical guarantees. The first class optimizes an acquisition function to select points, which is typically computationally expensive and can only be done approximately. The second class of algorithms use systematic space partitioning, which is much cheaper computationally but the selection is typically less informed. This points to a potential trade-off between the computational complexity and empirical performance of these algorithms. The current literature, however, only provides a sparse sampling of empirical comparison points, giving little insight into this trade-off. The primary contribution of this work is to conduct a comprehensive, repeatable evaluation within a common software framework, which we provide as an open-source package. Our results give strong evidence about the relative performance of these methods and reveal a consistent top performer, even when accounting for overall computation time.

ベイズ最適化（BO）は、ブラックボックス最適化のための人気のあるフレームワークです。2つのクラスのBOアプローチは、強力な理論的保証を提供しながら、有望な経験的パフォーマンスを示しています。最初のクラスは、ポイントを選択するために取得関数を最適化しますが、これは通常計算コストが高く、近似的にしか行えません。2番目のクラスのアルゴリズムは、体系的な空間分割を使用しますが、計算コストははるかに安価ですが、選択は通常あまり情報に基づいていません。これは、これらのアルゴリズムの計算複雑性と経験的パフォーマンスの間の潜在的なトレードオフを示唆しています。しかし、現在の文献は、経験的比較ポイントのスパースなサンプリングしか提供しておらず、このトレードオフについての洞察はほとんどありません。この研究の主な貢献は、共通のソフトウェアフレームワーク内で包括的で再現可能な評価を実施することであり、これをオープンソースパッケージとして提供します。我々の結果は、これらの方法の相対的なパフォーマンスについて強力な証拠を提供し、全体の計算時間を考慮に入れても一貫したトップパフォーマーを明らかにします。

Regulating Greed Over Time in Multi-Armed Bandits
多腕バンディットにおける時間の経過に伴う貪欲の規制

In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in the number of visitors, or increases in customers just before major holidays. The current paradigm of multi-armed bandit analysis does not take these known patterns into account. This means that for applications in retail, where prices are fixed for periods of time, current bandit algorithms will not suffice. This work provides a remedy that takes the time-dependent patterns into account, and we show how this remedy is implemented for the UCB, $\varepsilon$-greedy, and UCB-L algorithms, and also through a new policy called the variable arm pool algorithm. In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the corrected methods, we present a set of bounds that provide insight into why we would want to exploit during periods of high reward, and discuss the impact on regret. Our proposed methods perform well in experiments, and were inspired by a high-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo$!$ Front Page. That entry heavily used time-series methods to regulate greed over time, which was substantially more effective than other contextual bandit methods.

小売業では、訪問者数の周期的な変化や主要な休日の直前に顧客が増加するなど、顧客行動における予測可能でありながら劇的な時間依存パターンがあります。現在のマルチアームバンディット分析のパラダイムは、これらの既知のパターンを考慮に入れていません。これは、価格が一定期間固定されている小売業のアプリケーションにおいて、現在のバンディットアルゴリズムでは不十分であることを意味します。この研究では、時間依存パターンを考慮に入れた解決策を提供し、この解決策がUCB、$\varepsilon$-グリーディ、UCB-Lアルゴリズムにどのように実装されるか、また変動アームプールアルゴリズムと呼ばれる新しいポリシーを通じて示します。修正された方法では、時間の経過とともに搾取（貪欲）が調整され、高い報酬の期間中により多くの搾取が行われ、低い報酬の期間中により多くの探索が行われます。修正された方法で後悔が減少する理由を理解するために、高い報酬の期間中に搾取を行う理由についての洞察を提供する一連の境界を提示し、後悔への影響について議論します。提案された方法は実験で良好な結果を示し、Yahoo$!$フロントページのデータを使用した探索と搾取3コンテストの高得点エントリーに触発されました。そのエントリーは、時間系列手法を使用して貪欲を時間的に調整し、他の文脈バンディット手法よりもはるかに効果的でした。

Domain Generalization by Marginal Transfer Learning
周辺転移学習による領域一般化

In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced. We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets.

ドメイン一般化（DG）の問題では、いくつかの関連する予測問題からのラベル付きトレーニングデータセットがあり、目標は学習者に知られていない将来のラベルなしデータセットに対して正確な予測を行うことです。この問題は、環境的、技術的、またはその他の変動要因によってデータ分布が変動するいくつかのアプリケーションで発生します。私たちはDGのための正式なフレームワークを導入し、元の特徴空間に特徴ベクトルの周辺分布を追加することによって、これを一種の教師あり学習問題として見ることができると主張します。私たちのフレームワークは、従来の教師あり学習アルゴリズムの分析とのいくつかの関連がありますが、DGのいくつかの独自の側面は新しい分析手法を必要とします。この作業は、DGの問題が紹介された以前の会議論文に基づいて、ドメイン一般化の学習理論的基盤を築きます。私たちは、データ生成の2つの正式なモデル、対応するリスクの概念、および分布に依存しない一般化誤差分析を提示します。カーネル法に注目することで、より定量的な結果と普遍的に一貫したアルゴリズムを提供します。このアルゴリズムの効率的な実装が提供され、1つの合成データセットと3つの実世界データセットに対してプーリング戦略と実験的に比較されます。

On the Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
カーネル埋め込みに基づく適合度試験の最適性について

The reproducing kernel Hilbert space (RKHS) embedding of distributions offers a general and flexible framework for testing problems in arbitrary domains and has attracted considerable amount of attention in recent years. To gain insights into their operating characteristics, we study here the statistical performance of such approaches within a minimax framework. Focusing on the case of goodness-of-fit tests, our analyses show that a vanilla version of the kernel embedding based test could be minimax suboptimal, {when considering $\chi^2$ distance as the separation metric}. Hence we suggest a simple remedy by moderating the embedding. We prove that the moderated approach provides optimal tests for a wide range of deviations from the null and can also be made adaptive over a large collection of interpolation spaces. Numerical experiments are presented to further demonstrate the merits of our approach.

分布の再現カーネルヒルベルト空間(RKHS)埋め込みは、任意のドメインで問題をテストするための一般的で柔軟なフレームワークを提供し、近年かなりの注目を集めています。それらの動作特性についての洞察を得るために、ここではミニマックスフレームワーク内でのそのようなアプローチの統計的パフォーマンスを研究します。適合度テストの場合に焦点を当てると、私たちの分析では、カーネル埋め込みベースのテストのバニラバージョンが、{分離メトリックとして$chi^2$距離を考慮すると}、ミニマックスが最適ではない可能性があることを示しています。したがって、埋め込みを緩和するという簡単な解決策を提案します。モデレートされたアプローチは、ヌルからの幅広い偏差に対して最適なテストを提供し、補間空間の大規模なコレクションに対して適応させることができることを証明します。私たちのアプローチのメリットをさらに実証するために、数値実験が提示されます。

Journal of Machine Learning Research Papers: Volume 22の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 15の論文一覧

Journal of Machine Learning Research Papers: Volume 14の論文一覧

Journal of Machine Learning Research Papers: Volume 8の論文一覧