Journal of Machine Learning Research Papers: Volume 11の論文一覧

Journal of Machine Learning Research Papers Volume 11に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Learning Non-Stationary Dynamic Bayesian Networks
非定常動的ベイジアンネットワークの学習

Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data. An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical model called a non-stationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. Some examples of evolving networks are transcriptional regulatory networks during an organism’s development, neural pathways during learning, and traffic patterns during the day. We define the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data.

動的ベイジアンネットワーク構造を学習すると、時系列データ内の条件付き依存関係を識別するための原理的なメカニズムが提供されます。従来のDBN構造学習の重要な前提は、データが定常プロセスによって生成されることですが、この前提は多くの重要な設定では当てはまりません。この論文では、基礎となるデータ生成プロセスの条件付き依存関係構造が時間の経過とともに変化することを許容する、非定常動的ベイジアンネットワークと呼ばれる新しいクラスのグラフィカルモデルを紹介します。非定常動的ベイジアンネットワークは、ネットワークの構造が時間の経過とともに進化する問題を研究するための新しいフレームワークです。進化するネットワークの例としては、生物の発達中の転写制御ネットワーク、学習中の神経経路、および1日の交通パターンなどがあります。非定常DBNモデルを定義し、さまざまな前提の下で時系列データからモデルの構造を学習するためのMCMCサンプリングアルゴリズムを提示し、シミュレーションデータと生物学的データの両方でアルゴリズムの有効性を示します。

PAC-Bayesian Analysis of Co-clustering and Beyond
PACベイズによる共クラスタリングとその先の分析

We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved state-of-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization.The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of tree-shaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels.We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily extends to this problem and suggests that graph clustering should optimize the trade-off between empirical data fit and the mutual information that clusters preserve on graph nodes.

私たちは、共クラスタリング、行列三因子分解、グラフィカルモデル、グラフクラスタリング、ペアワイズクラスタリングなどのクラスタリングに基づく教師あり学習モデルと教師なし学習モデルのPACベイジアン一般化境界を導出します。まず、データ行列の分析に広く使用されているアプローチである共クラスタリングの分析から始めます。行列データ分析には、データ行列の欠落エントリの識別予測と、共起行列の行変数と列変数の結合確率分布の推定という2つのタスクがあります。これら2つのタスクに対する共クラスタリングベースのソリューションの予想されるサンプル外パフォーマンスのPACベイジアン一般化境界を導出します。この分析により、共クラスタリングの以前の定式化にはなかった正則化項が得られます。境界は、共クラスタリングの予想されるパフォーマンスは、その経験的パフォーマンスと、行IDと列IDのクラスター変数によって保持される相互情報量とのトレードオフによって決まることを示唆しています。私たちは、識別予測タスクのこのトレードオフの局所最適値を見つけるための反復射影アルゴリズムを導出します。このアルゴリズムは、MovieLens協調フィルタリングタスクで最先端のパフォーマンスを達成した。我々の共クラスタリングモデルは行列三因子分解とも見ることができ、その結果は、この形式の行列因子分解の一般化境界、正則化項、および新しいアルゴリズムを提供します。共クラスタリングの分析は、高次元テンソルの分析に使用できるツリー型グラフィカルモデルに拡張されます。境界によると、ツリー型グラフィカルモデルの一般化能力は、経験的データ適合とツリーレベルに伝播される相互情報量との間のトレードオフに依存します。また、重み付きグラフクラスタリングを予測問題として定式化します。つまり、エッジ重みのサブセットが与えられた場合に、残りのエッジ重みを予測するグラフクラスタリングの能力を分析します。共クラスタリングの分析はこの問題に簡単に拡張でき、グラフクラスタリングでは経験的データの適合とクラスターがグラフノード上に保持する相互情報量との間のトレードオフを最適化する必要があることを示唆しています。

Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory
特異学習理論におけるベイズ交差検証と広く適用可能な情報量基準の漸近等価性

In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.

通常の統計モデルでは、leave-one-out交差検証は赤池情報量基準と漸近的に等価です。しかし、多くの学習マシンは特異統計モデルであるため、交差検証の漸近的な動作は不明のままです。以前の研究では、特異学習理論を確立し、期待値が平均ベイズ汎化損失に漸近的に等しい、広く適用可能な情報量基準を提案しました。この論文では、ベイズ交差検証損失と広く適用可能な情報量基準を理論的に比較し、2つの定理を証明します。まず、ベイズ交差検証損失は、ランダム変数として広く適用可能な情報量基準と漸近的に等価です。したがって、これら2つの値を使用したモデル選択とハイパーパラメータ最適化は漸近的に等価です。次に、ベイズ汎化誤差とベイズ交差検証誤差の合計は、漸近的に2λ/nに等しくなります。ここで、λ は実数対数標準閾値、nはトレーニングサンプル数です。したがって、クロスバリデーション誤差と一般化誤差の関係は、学習機械の代数幾何学的構造によって決定されます。また、逸脱情報基準は、ベイズクロスバリデーションや広く適用可能な情報基準とは異なることも明らかにします。

Incremental Sigmoid Belief Networks for Grammar Learning
文法学習のためのインクリメンタルシグモイド信念ネットワーク

We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally specified model structure. ISBNs are specifically targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies benefits from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efficient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-field approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artificial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning.

私たちは、ベイジアンネットワークのモデル構造が予測される出力構造の関数である構造化予測問題に適したベイジアンネットワークのクラスを提案します。これらの増分シグモイドビリーフネットワーク(ISBN)は、有向エッジと増分指定モデル構造により、部分出力構造による推論で無限に多くの互換性のあるモデル構造の合計を必要としないため、デコードを可能にします。ISBNは、ドメインの複雑な統計的依存関係を学習するために多数の潜在変数が役立つ、自然言語解析などの困難な構造化予測問題に特にターゲットを絞っています。多数の潜在変数を持つISBNでの正確な推論は扱いにくいが、我々は2つの効率的な近似を提案します。まず、以前のニューラルネットワーク解析モデルは、ISBNによる推論の粗い平均場近似と見なすことができることを実証します。次に、より正確でありながら扱いやすい変分近似を導出し、人工実験で有効であることが証明されます。これらのモデルの有効性をベンチマーク自然言語解析タスクで比較すると、最先端のモデルに匹敵する精度が達成されています。ISBNに近似したモデルの方が解析精度が高く、ISBNが自然言語文法学習の適切な抽象モデルであることを示しています。

Rate Minimaxity of the Lasso and Dantzig Selector for the lq Loss in lr Balls
ラッソと lq の損失のための Dantzig セレクターのレート最小化 lla ボール

We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in lr balls, we provide lower bounds for the minimax lq risk and minimax quantiles of the lq loss for all design matrices. Under an l0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the lq risk and loss in lr balls, 0≤ r≤ 1≤ q≤ ∞. By allowing q=∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors.

私たちは、回帰係数の推定を高次元線形モデルで考えます。lrボールの回帰係数については、すべての設計行列について、最小限界リスクと最小限界損失の限界を提供します。ターゲット係数ベクトルのl0スパース性条件下で、LassoセレクターとDantzigセレクターの既存のオラクル不等式をシャープにし、統一します。多くの小さな要素とユニバーサルしきい値よりも小さいしきい値レベルを持つターゲット係数ベクトルのオラクル不等式を導き出します。これらのオラクル不等式は、LRボールのlqリスクと損失0≤r≤1≤q≤ ∞ に対するLassoセレクターとDantzigセレクターのレート最小化の設計行列に十分な条件を提供します。q=∞ を許可することにより、リスク境界は、しきい値LassoセレクターとDantzigセレクターの変数選択の一貫性を意味します。

An Exponential Model for Infinite Rankings
無限ランキングの指数モデル

This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that infinite models over permutations can be simple, elegant and practical.

この論文では、選択肢(ランク付けする項目)の数が多い場合に、順位付けによって好みを表現する統計モデルを紹介します。人間のランク付け者は通常、最も好まれる項目のみをランク付けし、項目セット全体を調べたり、項目がいくつあるかを把握したりしない場合があります。同様に、検索エンジンのランク付けされた出力を提示されたユーザーは、最高ランクの項目のみを検討します。私たちは、このような状況を、無限の項目空間にわたってtop-t順序付けと呼ばれる有限順序リストで動作する段階的ランキングモデルを導入することでモデル化します。私たちは、データからこのモデルを推定するアルゴリズムを示し、それが十分な統計量を持つこと、つまり連続および離散パラメータを持つ指数族モデルであることを示します。その共役事前分布およびその他の統計特性について説明します。次に、指数的ぼやけ平均シフトノンパラメトリッククラスタリングアルゴリズムを導入することで、推定問題をマルチモーダルデータに拡張します。実験では、私たちのモデルの特性を強調し、順列に対する無限モデルがシンプルで洗練され、実用的であることを示します。

Efficient Algorithms for Conditional Independence Inference
条件付き独立性推論のための効率的なアルゴリズム

The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method.

この論文のテーマは、構造的イムセットの代数的方法による（確率的）条件付き独立性（CI）含意のコンピュータテストです。基本的な考え方は、CIステートメント（のセット）を特定の積分ベクトルに変換し、独立含意と呼ばれるベクトル間の対応する代数関係をコンピュータで検証することです。この含意のコンピュータテストの以前の方法を多面体幾何学の観点から解釈します。ただし、この論文の主な貢献は、線形計画法（LP）に基づく新しい方法です。新しい方法は、以前の方法の制限である関与する変数の数を克服します。アルゴリズムの効率を比較することを目的とした計算実験に関係する4つの方法すべての理論的根拠を振り返り、説明します。実験では、LP方法が明らかに最も高速であることがわかりました。このようなアルゴリズムの可能な適用例として、ベイジアンネットワーク構造の包含またはCIステートメントが非巡回有向グラフにエンコードされているかどうかを代数的方法でテストできることを示します。

Lp-Nested Symmetric Distributions
lp でネストされた対称分布

In this paper, we introduce a new family of probability densities called Lp-nested symmetric distributions. The common property, shared by all members of the new class, is the same functional form ρ(x) = ~ρ(f(x)), where f is a nested cascade of Lp-norms ||x||p = (∑ |xi|p)1/p. Lp-nested symmetric distributions thereby are a special case of ν-spherical distributions for which f is only required to be positively homogeneous of degree one. While both, ν-spherical and Lp-nested symmetric distributions, contain many widely used families of probability models such as the Gaussian, spherically and elliptically symmetric distributions, Lp-spherically symmetric distributions, and certain types of independent component analysis (ICA) and independent subspace analysis (ISA) models, ν-spherical distributions are usually computationally intractable. Here we demonstrate that Lp-nested symmetric distributions are still computationally feasible by deriving an analytic expression for its normalization constant, gradients for maximum likelihood estimation, analytic expressions for certain types of marginals, as well as an exact and efficient sampling algorithm. We discuss the tight links of Lp-nested symmetric distributions to well known machine learning methods such as ICA, ISA and mixed norm regularizers, and introduce the nested radial factorization algorithm (NRF), which is a form of non-linear ICA that transforms any linearly mixed, non-factorial Lp-nested symmetric source into statistically independent signals. As a corollary, we also introduce the uniform distribution on the Lp-nested unit sphere.

この論文では、Lpネスト対称分布と呼ばれる新しい確率密度のファミリーを紹介します。新しいクラスのすべてのメンバーに共通する特性は、同じ関数形式 ρ(x) = ~ρ(f(x))です。ここで、fはLpノルムのネストされたカスケードです||x||p = (∑|xi|p)1/p。したがって、Lpネスト対称分布は、fが次数1の正同次であることのみが要求される ν 球状分布の特殊なケースです。ν 球状分布とLpネスト対称分布の両方に、ガウス分布、球状および楕円対称分布、Lp球状対称分布、特定の種類の独立成分分析(ICA)モデルや独立部分空間分析(ISA)モデルなど、広く使用されている確率モデルファミリーが多数含まれていますが、ν 球状分布は通常、計算が困難です。ここでは、正規化定数の解析式、最大尤度推定の勾配、特定の種類の周辺分布の解析式、および正確で効率的なサンプリングアルゴリズムを導出することにより、Lpネスト対称分布が計算上依然として実行可能であることを示します。ICA、ISA、混合ノルム正則化などのよく知られた機械学習手法とLpネスト対称分布の密接な関係について説明し、ネストラジアル因数分解アルゴリズム(NRF)を紹介します。これは、線形混合の非因数Lpネスト対称ソースを統計的に独立した信号に変換する非線形ICAの一種です。結果として、Lpネスト単位球面上の均一分布も紹介します。

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion
Stacked Denoising Autoencoders:局所的なノイズ除去基準を持つ深層ネットワークにおける有用な表現の学習

We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

私たちは、入力の破損バージョンをノイズ除去するようにローカルにトレーニングされたノイズ除去オートエンコーダの層を積み重ねることに基づいて、ディープネットワークを構築するための独自の戦略を探求します。結果として得られるアルゴリズムは、通常のオートエンコーダの積み重ねの単純なバリエーションです。ただし、分類問題のベンチマークでは分類エラーが大幅に低下することが示されており、ディープビリーフネットワーク(DBN)とのパフォーマンスギャップが埋められ、場合によってはそれを上回ります。この完全に教師なしの方法で学習された高レベルの表現は、後続のSVM分類器のパフォーマンスの向上にも役立ちます。定性実験では、通常のオートエンコーダとは異なり、ノイズ除去オートエンコーダは、自然な画像パッチからガボールのようなエッジ検出器を学習し、数字画像からより大きなストローク検出器を学習できることが示されています。この研究では、有用な高レベル表現の学習を導く扱いやすい教師なし目標としてノイズ除去基準を使用することの価値を明確に確立しています。

Learning Instance-Specific Predictive Models
インスタンス固有の予測モデルの学習

This paper introduces a Bayesian algorithm for constructing predictive models from data that are optimized to predict a target variable well for a particular instance. This algorithm learns Markov blanket models, carries out Bayesian model averaging over a set of models to predict a target variable of the instance at hand, and employs an instance-specific heuristic to locate a set of suitable models to average over. We call this method the instance-specific Markov blanket (ISMB) algorithm. The ISMB algorithm was evaluated on 21 UCI data sets using five different performance measures and its performance was compared to that of several commonly used predictive algorithms, including naive Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor, Lazy Bayesian Rules, and AdaBoost. Over all the data sets, the ISMB algorithm performed better on average on all performance measures against all the comparison algorithms.

この論文では、特定のインスタンスでターゲット変数を適切に予測するように最適化されたデータから予測モデルを構築するためのベイジアンアルゴリズムを紹介します。このアルゴリズムは、マルコフブランケットモデルを学習し、一連のモデルに対してベイズモデルの平均化を実行して、手元のインスタンスのターゲット変数を予測し、インスタンス固有のヒューリスティックを使用して、平均化に適したモデルのセットを見つけます。このメソッドをインスタンス固有のマルコフブランケット(ISMB)アルゴリズムと呼びます。ISMBアルゴリズムは、5つの異なるパフォーマンス指標を使用して21のUCIデータセットで評価され、そのパフォーマンスは、単純ベイズ、C4.5決定木、ロジスティック回帰、ニューラルネットワーク、k-最近傍、遅延ベイジアンルール、AdaBoostなど、一般的に使用されるいくつかの予測アルゴリズムのパフォーマンスと比較されました。すべてのデータセットで、ISMBアルゴリズムは、すべての比較アルゴリズムに対して、すべてのパフォーマンス指標で平均して優れたパフォーマンスを発揮しました。

Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
コスト重視学習における最尤法: モデルの仕様、近似、および上限

The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique.

パターン分類の分野では、誤分類コストやクラス出現率に非対称性が存在することはよくあります。コストに敏感な学習手法の研究には多くの関心が寄せられていますが、コストに敏感な学習とパラメトリック推定フレームワークにおけるモデルセットの指定との関係は、いまだに不明瞭です。このため、真の事後分布を含むモデルの場合と、モデルが誤って指定されている場合を区別します。前者の場合、最大尤度(ML)推定値の閾値設定が、リスク最小化問題に対する漸近的最適解であることが示されます。一方、モデルの誤指定では、閾値設定MLは最適ではなく、リスク最小化解は誤分類コスト比によって変わることが実証されています。さらに、負の重み付き対数尤度(Elkan、2001)は、経験的損失のタイトな凸型上限であることを解析的に示します。いくつかの実際のデータセットでの経験的結果と合わせて、重み付けMLがコストに敏感な手法として推奨されると主張します。

Classification with Incomplete Data Using Dirichlet Process Priors
ディリクレ工程事前分布を使用した不完全データによる分類

A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local “expert”, and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the “experts” allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results.

ノンパラメトリック階層ベイジアンフレームワークは、単純な(線形)分類子の組み合わせに基づいて分類子を設計するために開発されました。各単純な分類器はローカルの「専門家」と呼ばれ、専門家の数とその構成は、ディリクレプロセスの定式化によって明らかになります。「専門家」のシンプルな形式により、不完全なデータの分析的な取り扱いが可能になります。このモデルは、マルチタスク学習と呼ばれる複数のデータセットに対する分類器の同時設計を可能にするように拡張されており、これもディリクレプロセスを介してノンパラメトリックに実行されます。高速推論は変分ベイジアン(VB)解析を使用して実行され、いくつかのデータセットに対する結果の例が示されています。また、ギブスサンプリングによる推論も行い、VBの結果を比較します。

Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes
固定形式変分ベイズに対する近似リーマン共役勾配学習

Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efficient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efficiency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a significant margin.

変分ベイズ(VB)法は、通常、変分ベイズ期待値最大化(VB EM)アルゴリズムまたはその変形の1つを使用して、共役指数族のモデルにのみ適用されます。この論文では、より一般的なモデルにVBを適用するための効率的なアルゴリズムを紹介します。この方法は、多変量ガウスなどの近似の関数形式を指定することに基づいています。近似のパラメーターは、近似の空間のリーマン幾何学を利用する共役勾配アルゴリズムを使用して最適化されます。これにより、適切に構造化された近似のための非常に効率的なアルゴリズムが実現します。提案された方法は、両方が適用可能な場合に、VB EMと同等またはそれ以上の効率性があることが経験的に示されています。また、VB EMが適用できない非線形状態空間モデルと非線形因子分析モデルの学習にアルゴリズムを適用します。これらのモデルでは、提案されたアルゴリズムは、代替の勾配ベースの方法を大幅に上回ります。

A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
大規模L1正則化線形分類のための最適化手法とソフトウェアの比較

Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection; however, its non-differentiability causes more difficulties in training. Although various optimization methods have been proposed in recent years, these have not yet been compared suitably. In this paper, we first broadly review existing methods. Then, we discuss state-of-the-art software packages in detail and propose two efficient implementations. Extensive comparisons indicate that carefully implemented coordinate descent methods are very suitable for training large document data.

大規模な線形分類は、多くの分野で広く使用されています。L1正則化形式は、機能選択に適用できます。ただし、その非微分性は、トレーニングをより困難にします。近年、様々な最適化手法が提案されていますが、これらはまだ適切に比較されていません。この論文では、まず既存の手法を広くレビューします。次に、最先端のソフトウェアパッケージについて詳細に議論し、2つの効率的な実装を提案します。広範な比較により、慎重に実装された座標降下法は、大規模なドキュメントデータのトレーニングに非常に適していることが示されています。

A Generalized Path Integral Control Approach to Reinforcement Learning
強化学習への一般化経路積分制御アプローチ

With the goal to generate more scalable algorithms with higher efficiency and fewer open parameters, reinforcement learning (RL) has recently moved towards combining classical techniques from optimal control and dynamic programming with modern learning techniques from statistical estimation theory. In this vein, this paper suggests to use the framework of stochastic optimal control with path integrals to derive a novel approach to RL with parameterized policies. While solidly grounded in value function estimation and optimal control based on the stochastic Hamilton-Jacobi-Bellman (HJB) equations, policy improvements can be transformed into an approximation problem of a path integral which has no open algorithmic parameters other than the exploration noise. The resulting algorithm can be conceived of as model-based, semi-model-based, or even model free, depending on how the learning problem is structured. The update equations have no danger of numerical instabilities as neither matrix inversions nor gradient learning rates are required. Our new algorithm demonstrates interesting similarities with previous RL research in the framework of probability matching and provides intuition why the slightly heuristically motivated probability matching approach can actually perform well. Empirical evaluations demonstrate significant performance improvements over gradient-based policy learning and scalability to high-dimensional control problems. Finally, a learning experiment on a simulated 12 degree-of-freedom robot dog illustrates the functionality of our algorithm in a complex robot learning scenario. We believe that Policy Improvement with Path Integrals (PI2) offers currently one of the most efficient, numerically robust, and easy to implement algorithms for RL based on trajectory roll-outs.

強化学習(RL)は、効率が高くオープンパラメータが少ない、よりスケーラブルなアルゴリズムを生成するという目標を掲げ、最近、最適制御と動的計画法の古典的な手法と統計的推定理論の最新の学習手法を組み合わせる方向に進んでいます。この観点から、この論文では、経路積分による確率最適制御のフレームワークを使用して、パラメータ化されたポリシーによるRLへの新しいアプローチを導き出すことを提案します。確率的ハミルトンヤコビベルマン(HJB)方程式に基づく価値関数推定と最適制御にしっかりと基づいている一方で、ポリシーの改善は、探索ノイズ以外のオープンアルゴリズムパラメータを持たない経路積分の近似問題に変換できます。結果として得られるアルゴリズムは、学習問題の構造に応じて、モデルベース、半モデルベース、またはモデルフリーとして考えることができます。更新方程式は、行列反転も勾配学習率も必要ないため、数値不安定性の危険はありません。私たちの新しいアルゴリズムは、確率マッチングの枠組みにおける以前の強化学習研究との興味深い類似点を示しており、わずかにヒューリスティックな動機による確率マッチングアプローチが実際に優れたパフォーマンスを発揮できる理由を直感的に示しています。実証的評価では、勾配ベースのポリシー学習よりもパフォーマンスが大幅に向上し、高次元制御問題への拡張性があることが実証されています。最後に、シミュレートされた12自由度のロボット犬での学習実験により、複雑なロボット学習シナリオにおける私たちのアルゴリズムの機能性が実証されています。私たちは、パス積分によるポリシー改善(PI2)が、軌道ロールアウトに基づく強化学習のための、現時点で最も効率的で数値的に堅牢で実装が簡単なアルゴリズムの1つであると考えています。

Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
対称クリークポテンシャルと組み合わせた抽出MRFの集団推論

Many structured information extraction tasks employ collective graphical models that capture inter-instance associativity by coupling them with various clique potentials. We propose tractable families of such potentials that are invariant under permutations of their arguments, and call them symmetric clique potentials. We present three families of symmetric potentials−MAX, SUM, and MAJORITY.We propose cluster message passing for collective inference with symmetric clique potentials, and present message computation algorithms tailored to such potentials. Our first message computation algorithm, called α-pass, is sub-quadratic in the clique size, outputs exact messages for MAX, and computes 13/15-approximate messages for Potts, a popular member of the SUM family. Empirically, it is upto two orders of magnitude faster than existing algorithms based on graph-cuts or belief propagation. Our second algorithm, based on Lagrangian relaxation, operates on MAJORITY potentials and provides close to exact solutions while being two orders of magnitude faster. We show that the cluster message passing framework is more principled, accurate and converges faster than competing approaches.We extend our collective inference framework to exploit associativity of more general intra-domain properties of instance labelings, which opens up interesting applications in domain adaptation. Our approach leads to significant error reduction on unseen domains without incurring any overhead of model retraining.

多くの構造化情報抽出タスクでは、さまざまなクリークポテンシャルと組み合わせることでインスタンス間の関連性を捉える集団グラフィカルモデルが採用されています。引数の順列に対して不変である扱いやすいポテンシャルファミリを提案し、これを対称クリークポテンシャルと呼びます。対称ポテンシャルの3つのファミリ(MAX、SUM、MAJORITY)を紹介します。対称クリークポテンシャルを使用した集団推論のためのクラスターメッセージパッシングを提案し、そのようなポテンシャルに合わせたメッセージ計算アルゴリズムを紹介します。最初のメッセージ計算アルゴリズムは α-passと呼ばれ、クリークサイズが2乗以下で、MAXの場合は正確なメッセージを出力し、SUMファミリの一般的なメンバーであるPottsの場合は13/15近似メッセージを計算します。経験的に、グラフカットやビリーフプロパゲーションに基づく既存のアルゴリズムよりも最大2桁高速です。ラグランジュ緩和に基づく2番目のアルゴリズムはMAJORITYポテンシャルで動作し、2桁高速でありながら正確なソリューションに近いものを提供します。クラスターメッセージパッシングフレームワークは、競合するアプローチよりも原理的で正確であり、収束が速いことを示しています。集合推論フレームワークを拡張して、インスタンスラベリングのより一般的なドメイン内プロパティの結合性を活用し、ドメイン適応における興味深いアプリケーションを実現します。私たちのアプローチは、モデルの再トレーニングのオーバーヘッドを発生させることなく、未知のドメインでのエラーを大幅に削減します。

Inducing Tree-Substitution Grammars
木置換文法の誘導

Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difficulty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and first order dependency grammars, which are not sufficiently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efficacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results.

テキストから文法を誘導することは、数十年にわたる研究にもかかわらず、非常に困難な学習タスクであることが証明されています。その困難さの主な理由は、妥当な文法を誘導するためには、基礎となるモデルが言語の複雑さを表現でき、かつデータから容易に学習できることも保証する必要があることです。文法誘導に関する既存の研究の大部分は、文脈自由文法と一階従属文法を使用することで、表現能力よりもモデルの単純さ(したがって学習可能性)を優先してきましたが、これらは多くの一般的な言語構造をモデル化するのに十分な表現力を持っていません。私たちは、確率的木置換文法を推論することで新しい妥協案を提案します。これは、任意の大きさの木の断片を許容し、複雑な言語構造をより適切に表現する形式です。モデルの複雑さを制限するために、浅い生成を持つスパース文法にモデルを偏らせるベイジアン非パラメトリック事前分布を使用します。このモデルの有効性を、トレーニングツリーバンクの潜在的なセグメンテーションを誘導する教師あり句構造解析と、教師なし依存文法誘導で実証しました。どちらの場合も、このモデルは興味深い潜在的な言語構造を発見し、競争力のある結果を生み出します。

Covariance in Unsupervised Learning of Probabilistic Grammars
確率的文法の教師なし学習における共分散

Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, general-purpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data.

確率文法は、自然言語テキストのような離散的かつ連続的なデータをモデル化する上で大きな柔軟性を提供します。その記号的要素は人間による検査に適しており、確率的要素は曖昧さの解決に役立ちます。また、十分に理解されている汎用学習アルゴリズムの使用も可能になります。ベイズ設定で確率文法を使用することへの関心が高まっています。現在まで、ほとんどの文献はディリクレ事前分布の使用に焦点を当ててきました。ディリクレ事前分布には、確率文法のパラメータ間の共分散を直接モデル化できないなど、いくつかの制限があります。しかし、さまざまな文法パラメータは、それらが表す言語の要素が言語特性を共有しているため、相関していると予想されます。この論文では、ディリクレ事前分布の代替として、ロジスティック正規分布のファミリーを提案します。私たちはこの分布群の推論アルゴリズムを導き出し、依存文法誘導のタスクを実験し、異なる自然言語の6つのツリーバンクのセットで事前確率によるパフォーマンスの向上を実証しました。私たちの共分散フレームワークは、異なる言語のテキストの文法内および文法間でのソフトパラメータ結合を可能にし、バイリンガルの非並列データを使用した新しい学習設定で実証的な利益を示します。

Gaussian Processes for Machine Learning (GPML) Toolbox
機械学習のためのガウス過程 (GPML) ツールボックス

The GPML toolbox provides a wide range of functionality for Gaussian process (GP) inference and prediction. GPs are specified by mean and covariance functions; we offer a library of simple mean and covariance functions and mechanisms to compose more complex ones. Several likelihood functions are supported including Gaussian and heavy-tailed for regression as well as others suitable for classification. Finally, a range of inference methods is provided, including exact and variational inference, Expectation Propagation, and Laplace’s method dealing with non-Gaussian likelihoods and FITC for dealing with large regression tasks.

GPMLツールボックスには、ガウス過程(GP)の推論と予測のための幅広い機能が用意されています。GPは、平均関数と共分散関数によって指定されます。単純な平均関数と共分散関数のライブラリと、より複雑な関数を構成するためのメカニズムを提供します。回帰用のガウス関数とヘビーテール関数、および分類に適した他の関数など、いくつかの尤度関数がサポートされています。最後に、厳密推論と変分推論、期待伝播、非ガウス尤度を扱うラプラス法、大規模な回帰タスクを処理するためのFITCなど、さまざまな推論方法が提供されています。

Semi-Supervised Novelty Detection
セミ教師ありノベルティ検出

A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study.We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions.

新規性検出の一般的な設定では、名目クラスのラベル付き例は利用できるが、新規性のラベル付き例は利用できないことを前提としています。標準的な(帰納的)アプローチは、名目密度が低い場合に新規性を宣言することであり、これにより問題は密度レベルセット推定に簡略化されます。この論文では、学習時にラベルなしの、汚染されている可能性のあるサンプルも利用できる設定を検討します。この半教師あり設定での新規性検出は、バイナリ分類問題への一般的な簡略化によって自然に解決されると主張します。特に、望ましい偽陽性率を持つ検出器は、ネイマン-ピアソン分類への簡略化によって実現できます。帰納的アプローチとは異なり、半教師あり新規性検出(SSND)では、新規性の分布に関係なく、最適な(統計的に一貫性があるなど)検出器が生成されます。したがって、新規性検出では、ラベルなしデータが決定ルールの理論的特性に大きな影響を与えます。私たちは、広範な実験研究によりSSNDの実用性を検証しました。また、SSNDが仮説検定における2つのよく知られた問題に対して、分布に依存しない学習理論的なソリューションを提供することも示しました。まず、我々の結果は、一般的な2サンプル問題、つまり2つのランダムサンプルが同じ分布から生じたかどうかを判断する問題に対する一般的なソリューションを提供します。次に、SSNDの特殊化は、いわゆるランダム効果モデルにおける多重検定の標準的なp値アプローチと一致します。閾値p値に基づく標準的な拒否領域とは異なり、一般的なSSNDフレームワークでは、複数の次元における任意の代替分布への適応が可能です。

Tree Decomposition for Large-Scale SVM Problems
大規模 SVM 問題に対するツリー分解

To handle problems created by large data sets, we propose a method that uses a decision tree to decompose a given data space and train SVMs on the decomposed regions. Although there are other means of decomposing a data space, we show that the decision tree has several merits for large-scale SVM training. First, it can classify some data points by its own means, thereby reducing the cost of SVM training for the remaining data points. Second, it is efficient in determining the parameter values that maximize the validation accuracy, which helps maintain good test accuracy. Third, the tree decomposition method can derive a generalization error bound for the classifier. For data sets whose size can be handled by current non-linear, or kernel-based, SVM training techniques, the proposed method can speed up the training by a factor of thousands, and still achieve comparable test accuracy.

大規模なデータセットによって生じる問題を処理するために、デシジョンツリーを使用して特定のデータ空間を分解し、分解された領域でSVMを訓練する方法を提案します。データ空間を分解する他の方法もありますが、デシジョンツリーには大規模なSVMトレーニングにいくつかのメリットがあることを示しています。まず、一部のデータポイントを独自の方法で分類できるため、残りのデータポイントのSVMトレーニングのコストを削減できます。次に、検証精度を最大化するパラメータ値を決定するのに効率的であり、良好なテスト精度を維持するのに役立ちます。第3に、ツリー分解法は、分類器の汎化誤差範囲を導出できます。現在の非線形またはカーネルベースのSVM学習手法でサイズを処理できるデータセットの場合、提案された方法は、学習を数千倍高速化し、それでも同等のテスト精度を達成できます。

Linear Algorithms for Online Multitask Classification
オンラインマルチタスク分類のための線形アルゴリズム

We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings.

私たちは、オンラインマルチタスク二項分類問題のための新しいパーセプトロンベースのアルゴリズムを紹介します。適切な規則性条件下では、アルゴリズムはタスクの数に比例する係数でベースラインを改善することが示されています。これらの改善は、アルゴリズムをタスクの関連性の特定の概念に偏らせるさまざまなタイプの正則化を使用して達成します。具体的には、タスク間の類似性は、タスク参照ベクトルの幾何学的な近さの観点から測定されるか、スパンされた部分空間の次元の関数として測定されます。Evgeniouらのマルチタスクカーネルのような既知の手法の組み合わせをオンライン設定に適応させることに加えて、私たちの分析では、スペクトル共正則化を実装するために使用されるpノルムパーセプトロンの行列ベースのマルチタスク拡張も導入しています。実世界のデータセットでの実験は、私たちの理論的発見を補完し、サポートします。

Expectation Truncation and the Benefits of Preselection In Training Generative Models
期待値の切り捨てと生成モデルの学習における事前選択の利点

We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model. The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artificial and realistic data, and are compared to other models in the literature. We find that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes.

私たちは、隠れ変数の事前選択を使用して、バイナリ隠れ変数を持つ生成モデルを効率的にトレーニングする方法を示します。このアプローチは期待値最大化(EM)に基づいており、特定のモデルの十分な統計量に対する効率的に計算可能な近似を使用します。十分な統計量を計算するための計算コストは、各データポイントに対して関連する隠れた原因を選択することで大幅に削減されます。この近似は、さまざまな生成モデルに適用でき、変分EM近似の観点から事前選択の利点を解釈できます。この方法がデータの尤度を最大化することを経験的に示すために、非負値行列因子分解(NMF)のバージョン、非線形成分抽出(MCA)のモデル、スパースコーディングに似た線形生成モデルなど、さまざまなタイプの生成モデルに適用されます。導出されたアルゴリズムは、人工データと現実のデータの両方に適用され、文献の他のモデルと比較されます。このトレーニングスキームにより、計算コストを桁違いに削減でき、隠れた原因を確実に抽出できることがわかります。

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance
クラスタリング比較のための情報理論的測度:バリアント、プロパティ、正規化、および偶然の補正

Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.

情報理論的尺度は、クラスタリングを比較するための尺度の基本的なクラスを形成し、最近ますます関心を集めています。しかし、それらの特性と相互関係に関する多くの疑問は未解決のままです。この論文では、文献で既によく知られているいくつかの尺度と、新たに提案されたいくつかの尺度を含む、クラスタリング比較のための情報理論的尺度の体系的な研究を行います。メトリック特性や正規化特性などの重要な特性について説明し、証明します。次に、クラスタリングコミュニティに対して、特にデータサイズがそこに含まれるクラスタの数に比べて小さい場合に、情報理論的尺度を偶然性について補正することの重要性を強調します。利用可能な情報理論ベースの尺度のうち、正規化情報距離(NID)を一般的な尺度として推奨します。これは、メトリックと正規化尺度の両方であること、偶然性を調整した正確な分析形式を許容すること、および他の正規化バリアントよりも名目上の[0,1]範囲をより適切に使用できることなど、いくつかの重要な特性を同時に備えているためです。

Regret Bounds and Minimax Policies under Partial Monitoring
パーシャルモニタリングにおける後悔バウンドとミニマックスポリシー

This work deals with four classical prediction settings, namely full information, bandit, label efficient and bandit label efficient as well as four different notions of regret: pseudo-regret, expected regret, high probability regret and tracking the best expert regret. We introduce a new forecaster, INF (Implicitly Normalized Forecaster) based on an arbitrary function ψ for which we propose a unified analysis of its pseudo-regret in the four games we consider. In particular, for ψ(x)=exp(η x) + γ/K, INF reduces to the classical exponentially weighted average forecaster and our analysis of the pseudo-regret recovers known results while for the expected regret we slightly tighten the bounds. On the other hand with ψ(x)=(η/-x)q + γ/K, which defines a new forecaster, we are able to remove the extraneous logarithmic factor in the pseudo-regret bounds for bandits games, and thus fill in a long open gap in the characterization of the minimax rate for the pseudo-regret in the bandit game. We also provide high probability bounds depending on the cumulative reward of the optimal action. Finally, we consider the stochastic bandit game, and prove that an appropriate modification of the upper confidence bound policy UCB1 (Auer et al., 2002a) achieves the distribution-free optimal rate while still having a distribution-dependent rate logarithmic in the number of plays.

この研究では、完全情報、バンディット、ラベル効率、バンディットラベル効率という4つの古典的な予測設定と、疑似後悔、期待後悔、高確率後悔、最良の専門家の後悔の追跡という4つの異なる後悔の概念を取り上げます。任意の関数 ψ に基づく新しい予測器INF (暗黙的に正規化された予測器)を導入し、検討する4つのゲームにおける疑似後悔の統一的な分析を提案します。特に、ψ(x)=exp(ηx) +γ/Kの場合、INFは古典的な指数加重平均予測器に簡略化され、疑似後悔の分析では既知の結果が回復されますが、期待後悔については境界がわずかに狭くなります。一方、新しい予測子を定義する ψ(x)=(η/-x)q +γ/Kを使用すると、バンディットゲームの疑似後悔境界における余分な対数係数を取り除くことができ、バンディットゲームの疑似後悔のミニマックスレートの特性評価における長い空白を埋めることができます。また、最適アクションの累積報酬に応じて、高確率境界も提供します。最後に、確率的バンディットゲームを検討し、上側信頼境界ポリシーUCB1 (Auerら、2002a)を適切に修正すると、プレイ回数に対して対数的な分布依存レートを維持しながら、分布フリーの最適レートが達成されることを証明します。

Mean Field Variational Approximation for Continuous-Time Bayesian Networks
連続時間ベイジアンネットワークの平均場変分近似

Continuous-time Bayesian networks is a natural structured representation language for multi-component stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean field variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efficiently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efficient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem.

連続時間ベイジアンネットワークは、時間の経過に伴って連続的に進化する多成分確率過程のための自然な構造化表現言語です。この言語が提供するコンパクトな表現にもかかわらず、このようなモデルでの推論は、比較的単純な構造化ネットワークであっても扱いにくいものです。私たちは、不均質なマルコフ過程の積を使用して軌跡上の結合分布を近似する平均場変分近似を導入します。この変分アプローチにより、効率的にクエリできるグローバルに一貫した分布が得られます。さらに、観測確率の下限値も提供されるため、学習タスクに適しています。ここでは、近似の理論的基礎、高度に最適化された常微分方程式(ODE)ソルバーを幅広く活用する効率的な実装について説明し、この近似が適しているプロセスの特性を実験的に調査し、大規模な現実世界の推論問題への応用を示します。

Using Contextual Representations to Efficiently Learn Context-Free Languages
文脈表現を使用して文脈自由言語を効率的に学習する

We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of first language acquisition. Preliminary experimental results confirm the effectiveness of this approach.

私たちは、ポジティブデータとメンバーシップオラクルのパラダイムを使用して、大規模な文脈自由言語の帰納的推論のための多項式更新時間アルゴリズムを提示します。私たちは、文脈依存言語だけでなく、構造が豊富な文脈自由言語も表現できる、文脈バイナリ特徴文法(CBFG)と呼ばれる新しい表現に移行することでこの結果を達成した。これらの表現は、部分文字列セットの分布の格子構造を明示的にモデル化し、分布学習の一般化を使用して推論することができます。この形式主義は、単純な学習可能なクラスと言語表現に必要な非常に表現力の高い表現の種類との間のギャップを埋める試みです。この形式主義は、すべての正規言語と2つの単純な制約を満たす文脈自由言語を含む、大規模な文脈自由言語の学習可能性を可能にします。この形式主義とアルゴリズムは、自然言語、特に第一言語習得のモデル化に適していると思われます。予備実験結果により、このアプローチの有効性が確認されています。

Topology Selection in Graphical Models of Autoregressive Processes
自己回帰過程のグラフィカルモデルにおけるトポロジー選択

An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an l1-type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included.

自己回帰ガウス時系列のグラフィカルモデルでのトポロジー選択のためのアルゴリズムが提示されます。モデルのグラフトポロジは、時系列の逆スペクトルのスパース性パターンを表し、変数間の条件付き独立関係を特徴付けます。この論文で提案されている方法は、条件付き最尤推定問題のl1型非平滑正則化に基づいています。これが凸最適化問題に還元されることを示し、勾配射影法を介して双対問題を解く大規模なアルゴリズムについて説明します。ランダムに生成されたデータセットと実際のデータセットを使用した実験結果も含まれます。

Learnability, Stability and Uniform Convergence
学習可能性、安定性、均一収束

The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression.

学習可能性を特徴付ける問題は、統計学習理論の最も基本的な問題です。少なくとも教師あり分類および回帰の場合、基本的で長年にわたる答えは、学習可能性は経験的リスクの母集団リスクへの一様収束に相当し、問題が学習可能である場合、経験的リスク最小化によって学習可能であるということです。この論文では、ほとんどの統計学習問題を特殊なケースとして含む一般学習設定(Vapnikによって導入)を検討します。この設定では、一様収束が成立せず、経験的リスク最小化が失敗するが、代替メカニズムを使用して学習可能な非自明な学習問題が存在することを示します。一様収束の代わりに、安定性が学習可能性の主要な必要十分条件であることを確認します。さらに、一般的な設定での学習可能性の条件は、教師あり分類および回帰の場合よりも大幅に複雑であることを示します。

Stochastic Composite Likelihood
確率的複合尤度

Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random fields. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufficient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy.

最尤推定器は、集中的な計算が必要なため、多くの場合、実用化が限られています。私たちは、複合尤度関数の確率的変動を最大化する代替推定量のファミリーを提案します。各推定器は、計算精度のトレードオフを異なる方法で解決し、それらをまとめると、計算精度のトレードオフの解像度の連続スペクトルにまたがります。推定量の一貫性を証明し、漸近分散、統計的ロバスト性、および計算の複雑さの公式を提供します。私たちは、ボルツマンマシンと条件付きランダム場の文脈で実験結果について議論します。理論的および実験的研究は、計算リソースが不足している場合の推定量の有効性を実証しています。また、計算の複雑さが軽減されるとロバスト性が関連し、統計的精度が向上する場合もあることも示しています。

Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
正則化確率学習とオンライン最適化のための双対平均化法

We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as l1-norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of l1-regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.

私たちは、正規化された確率学習とオンライン最適化の問題を検討します。ここで、目的関数は2つの凸項の合計です。1つは学習タスクの損失関数で、もう1つはスパース性を促進するためのl1ノルムなどの単純な正規化項です。私たちは、オンライン設定で正規化構造を利用できるNesterovの二重平均法の拡張を開発します。これらの方法の各反復で、学習変数は、損失関数の過去のすべての部分勾配の移動平均と、部分勾配だけでなく正規化項全体を含む単純な最小化問題を解くことによって調整されます。l1正規化の場合、我々の方法はスパース解を得るのに特に効果的です。私たちは、これらの方法が、確率的およびオンライン凸最適化に関する文献で標準となっている最適な収束率または後悔境界を達成することを示します。損失関数がLipschitz連続勾配を持つ確率学習問題の場合、我々は二重平均法の高速化バージョンも提示します。

WEKA−Experiences with a Java Open-Source Project
WEKA−Javaオープンソースプロジェクトの経験

WEKA is a popular machine learning workbench with a development lifeof nearly two decades. This article provides an overview of the factorsthat we believe to be important to its success. Rather than focussing on the software’s functionality, we review aspects ofproject management and historical development decisions that likelyhad an impact on the uptake of the project.

WEKAは、開発寿命が約20年ある人気の機械学習ワークベンチです。この記事では、その成功に重要であると私たちが信じている要因の概要を説明します。ソフトウェアの機能に焦点を当てるのではなく、プロジェクトの普及に影響を与えた可能性のあるプロジェクト管理と過去の開発決定の側面をレビューします。

Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
宇宙のハブ: 高次元データで人気のある最近傍

Different aspects of the curse of dimensionality are known to present seriouschallenges to various machine-learning methods and tasks. This paper exploresa new aspect of the dimensionality curse, referred to as hubness, thataffects the distribution of k-occurrences: the number of times a pointappears among the k nearest neighbors of other points in a data set. Throughtheoretical and empirical analysis involving synthetic and real data setswe show that under commonly used assumptions this distribution becomesconsiderably skewed as dimensionality increases, causing the emergence ofhubs, that is, points with very high k-occurrences whicheffectively represent “popular” nearest neighbors. We examine the origins ofthis phenomenon, showing that it is an inherent property of data distributionsin high-dimensional vector space, discuss its interaction with dimensionalityreduction, and explore its influence on a wide range of machine-learning tasksdirectly or indirectly based on measuring distances, belonging to supervised,semi-supervised, and unsupervised learning families.

次元の呪いのさまざまな側面は、さまざまな機械学習の方法やタスクに深刻な課題をもたらすことが知られています。この論文では、ハブネスと呼ばれる次元の呪いの新しい側面について検討します。これは、k出現の分布に影響します。k出現とは、データセット内の他のポイントのk最近傍点の中にポイントが出現する回数です。合成データセットと実際のデータセットを含む理論的および実証的分析を通じて、一般的に使用される仮定の下では、この分布は次元が増加するにつれてかなり歪んで、ハブ、つまりk出現が非常に高いポイントが出現し、事実上「人気のある」最近傍点を表すことを示します。我々はこの現象の起源を調べ、それが高次元ベクトル空間におけるデータ分布の固有の特性であることを示し、次元削減との相互作用について議論し、教師あり学習、半教師あり学習、教師なし学習ファミリーに属する、距離の測定に基づいて直接的または間接的に幅広い機械学習タスクに与える影響を調査します。

Rademacher Complexities and Bounding the Excess Risk in Active Learning
Rademacherの複雑さとアクティブラーニングにおける過剰リスクの境界化

Sequential algorithms of active learning based on the estimation of thelevel sets of the empirical risk are discussed in the paper. LocalizedRademacher complexities are used in the algorithms to estimatethe sample sizesneeded to achieve the required accuracy of learning in an adaptive way.Probabilisticbounds on the number of active examples have been proved and severalapplications to binary classification problems are considered.

経験的リスクのレベルセットの推定に基づくアクティブラーニングの逐次アルゴリズムについては、この論文で説明しています。ローカライズされたRademacherの複雑さは、適応的な方法で学習の必要な精度を達成するために必要なサンプルサイズを推定するためのアルゴリズムで使用されます。アクティブな例の数の確率的境界が証明されており、二項分類問題へのいくつかの適用が検討されています。

Sparse Semi-supervised Learning Using Conjugate Functions
共役関数を用いたスパース半教師あり学習

In this paper, we propose a general framework for sparsesemi-supervised learning, which concerns using a small portion ofunlabeled data and a few labeled data to represent target functionsand thus has the merit of accelerating function evaluations whenpredicting the output of a new example. This framework makes use ofFenchel-Legendre conjugates to rewrite a convex insensitive lossinvolving a regularization with unlabeled data, and is applicable toa family of semi-supervised learning methods such as multi-viewco-regularized least squares and single-view Laplacian supportvector machines (SVMs). As an instantiation of this framework, wepropose sparse multi-view SVMs which use a squaredε-insensitive loss. The resultant optimization is aninf-sup problem and the optimal solutions have arguablysaddle-point properties. We present a globally optimal iterativealgorithm to optimize the problem. We give the margin bound on thegeneralization error of the sparse multi-view SVMs, and derive theempirical Rademacher complexity for the induced function class.Experiments on artificial and real-world data show theireffectiveness. We further give a sequential training approach toshow their possibility and potential for uses in large-scaleproblems and provide encouraging experimental results indicating theefficacy of the margin bound and empirical Rademacher complexity oncharacterizing the roles of unlabeled data for semi-supervisedlearning.

この論文では、スパース半教師あり学習の一般的なフレームワークを提案します。これは、少量のラベルなしデータと少数のラベル付きデータを使用してターゲット関数を表すことに関するもので、新しい例の出力を予測するときに関数評価を高速化するという利点があります。このフレームワークは、フェンシェル-ルジャンドル共役を使用して、ラベルなしデータによる正則化を含む凸不感性損失を書き換え、マルチビュー共正則化最小二乗法やシングルビューラプラシアンサポートベクターマシン(SVM)などの一連の半教師あり学習方法に適用できます。このフレームワークのインスタンス化として、平方 ε 不感性損失を使用するスパースマルチビューSVMを提案します。結果として得られる最適化はinf-sup問題であり、最適解はおそらく鞍点特性を持ちます。この問題を最適化するためのグローバルに最適な反復アルゴリズムを提示します。スパースマルチビューSVMの一般化誤差のマージン境界を示し、誘導関数クラスの経験的Rademacher複雑度を導出します。人工データと実世界のデータでの実験により、その有効性が示されます。さらに、大規模な問題での使用の可能性と潜在性を示すために、順次トレーニングアプローチを示し、マージン境界と経験的Rademacher複雑度が半教師あり学習のラベルなしデータの役割を特徴付ける上で有効であることを示す有望な実験結果を示します。

Composite Binary Losses
複合バイナリ損失

We study losses for binary classification and class probability estimationand extend the understanding of them from margin losses to generalcomposite losses which are the composition of a proper loss with a linkfunction. We characterise when margin losses can be proper compositelosses, explicitly show how to determine a symmetric loss in full from halfof one of its partial losses, introduce an intrinsic parametrisation ofcomposite binary losses and give a complete characterisation of therelationship between proper losses and “classification calibrated”losses. We also consider the question of the “best” surrogate binaryloss. We introduce a precise notion of “best” and show there existsituations where two convex surrogate losses are incommensurable. Weprovide a complete explicit characterisation of the convexity of compositebinary losses in terms of the link function and the weight functionassociated with the proper loss which make up the composite loss. Thischaracterisation suggests new ways of “surrogate tuning” as well asproviding an explicit characterisation of when Bregman divergences on theunit interval are convex in their second argument. Finally, in anappendix we present some new algorithm-independent results on therelationship between properness, convexity and robustness tomisclassification noise for binary losses and show that all convex properlosses are non-robust to misclassification noise.

私たちは、バイナリ分類とクラス確率推定の損失を研究し、それらの理解をマージン損失から、リンク関数を持つ適切な損失の合成である一般的な複合損失にまで広げます。マージン損失が適切な複合損失になる場合を特徴付け、対称損失全体をその部分損失の半分から決定する方法を明示的に示し、複合バイナリ損失の本質的なパラメータ化を導入し、適切な損失と「分類調整済み」損失の関係を完全に特徴付けます。また、「最良」の代理バイナリ損失の問題も検討します。「最良」の正確な概念を導入し、2つの凸代理損失が通約不可能な状況が存在することを示します。複合損失を構成する適切な損失に関連付けられたリンク関数と重み関数に関して、複合バイナリ損失の凸性の完全な明示的な特徴付けを提供します。この特徴付けは、「代理調整」の新しい方法を示唆するとともに、単位区間のBregmanダイバージェンスが第2引数で凸である場合の明示的な特徴付けを提供します。最後に、付録では、バイナリ損失の適切性、凸性、誤分類ノイズに対する堅牢性の関係に関するアルゴリズムに依存しない新しい結果をいくつか示し、すべての凸適切損失が誤分類ノイズに対して堅牢ではないことを示します。

High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
スパースランダム射影による高次元変数選択:測定スパース性と統計的効率

We consider the problem of high-dimensional variable selection: givenn noisy observations of a k-sparse vector β* ∈ Rp,estimate the subset of non-zero entries of β*.A significant body of work has studied behavior ofl1-relaxations when applied to random measurement matrices thatare dense (e.g., Gaussian, Bernoulli). In this paper, we analyzesparsified measurement ensembles, and consider the trade-offbetween measurement sparsity, as measured by the fraction γ ofnon-zero entries, and the statistical efficiency, as measured by theminimal number of observations n required for correct variableselection with probability converging to one. Our main result is toprove that it is possible to let the fraction on non-zero entriesγ → 0 at some rate, yielding measurement matriceswith a vanishing fraction of non-zeros per row, while retaining thesame statistical efficiency as dense ensembles. A variety ofsimulation results confirm the sharpness of our theoreticalpredictions.

私たちは、高次元変数選択の問題を考えます: kスパースベクトル(k)のノイズの多い観測値 β*∈Rpを前提として、β*の非ゼロエントリのサブセットを推定します。かなりの研究が、密度の高いランダム測定行列(ガウス、ベルヌーイなど)に適用した場合のl1-緩和の挙動を研究しています。この論文では、パース化された測定アンサンブルを分析し、非ゼロエントリの分数γによって測定される測定スパース性と、確率が1に収束する正しい変数選択に必要な観測nの最小数によって測定される統計効率との間のトレードオフを検討します。私たちの主な結果は、非ゼロエントリの分数→γをあるレートで0にすることができ、行ごとに非ゼロの消失する分数を持つ測定行列を生成することが可能であることを証明します。ただし、密なアンサンブルと同じ統計効率を維持します。さまざまなシミュレーション結果が、私たちの理論予測の鋭さを裏付けています。

Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
ベイジアンネットワーク分類器の判別構造学習のための効率的なヒューリスティック

We introduce a simple order-based greedy heuristic for learningdiscriminative structure within generative Bayesian networkclassifiers. We propose two methods for establishing an order of Nfeatures. They are based on the conditional mutual information andclassification rate (i.e., risk), respectively. Given an ordering, wecan find a discriminative structure withO(Nk+1) score evaluations (where constant kis the tree-width of the sub-graph over the attributes). We present results on 25data sets from the UCI repository, for phonetic classification usingthe TIMIT database, for a visual surface inspection task, and for twohandwritten digit recognition tasks. We provide classificationperformance for both discriminative andgenerative parameter learning on both discriminativelyand generatively structured networks. The discriminativestructure found by our new procedures significantly outperformsgeneratively produced structures, and achieves a classificationaccuracy on par with the best discriminative (greedy) Bayesiannetwork learning approach, but does so with a factor of ~10-40speedup. We also show that the advantages of generativediscriminatively structured Bayesian network classifiers still hold inthe case of missing features, a case where generative classifiers havean advantage over discriminative classifiers.

私たちは、生成ベイジアンネットワーク分類器内で識別構造を学習するための、単純な順序ベースの貪欲ヒューリスティックを紹介します。N個の特徴の順序を確立するための2つの方法を提案します。これらは、それぞれ条件付き相互情報量と分類率(つまりリスク)に基づいています。順序付けが与えられれば、O(Nk+1)スコア評価で識別構造を見つけることができます(定数kは属性上のサブグラフのツリー幅です)。UCIリポジトリの25のデータセット、TIMITデータベースを使用した音声分類、視覚表面検査タスク、および2つの手書き数字認識タスクの結果を示します。識別的および生成的に構造化されたネットワークの両方で、識別的および生成的パラメータ学習の両方の分類パフォーマンスを示します。新しい手順で発見された識別構造は、生成的に生成された構造を大幅に上回り、最良の識別的(貪欲)ベイジアンネットワーク学習アプローチと同等の分類精度を達成しますが、速度は10～40倍向上します。また、生成的識別構造のベイジアンネットワーク分類器の利点は、特徴が欠落している場合でも維持され、生成的分類器が識別的分類器よりも優れていることも示しています。

Spectral Regularization Algorithms for Learning Large Incomplete Matrices
大規模不完全行列を学習するためのスペクトル正則化アルゴリズム

We use convex relaxation techniques to provide a sequence ofregularized low-rank solutions for large-scale matrix completionproblems. Using the nuclear norm as a regularizer, we provide asimple and very efficient convex algorithm for minimizing thereconstruction error subject to a bound on the nuclear norm. Ouralgorithm SOFT-IMPUTE iteratively replaces the missingelements with those obtained from a soft-thresholded SVD. With warmstarts this allows us to efficiently compute an entireregularization path of solutions on a grid of values of theregularization parameter. The computationally intensive part of ouralgorithm is in computing a low-rank SVD of a dense matrix.Exploiting the problem structure, we show that the task can beperformed with a complexity of order linear in the matrix dimensions. Oursemidefinite-programming algorithm is readily scalable to largematrices; for example SOFT-IMPUTE takes a few hours to compute low-rank approximationsof a 106 X 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to thefull Netflix training set in 3.3 hours.Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-arttechniques.

私たちは、凸緩和法を用いて、大規模な行列完成問題に正規化された低ランク解のシーケンスを提供します。核ノルムを正規化子として使用し、核ノルムの境界に従って再構成誤差を最小化する、単純で非常に効率的な凸アルゴリズムを提供します。我々のアルゴリズムSOFT-IMPUTEは、欠落している要素をソフトしきい値SVDから取得した要素で繰り返し置き換える。ウォームスタートにより、これにより、正規化パラメータの値のグリッド上の解の正規化パス全体を効率的に計算できます。我々のアルゴリズムの計算集約的な部分は、密行列の低ランクSVDを計算する部分です。問題構造を利用して、タスクが行列次元の線形オーダーの複雑さで実行できることを示す。我々の半正定値計画アルゴリズムは、大規模な行列に容易に拡張可能であり、たとえば、SOFT-IMPUTEでは、107個の観測エントリを持つ106 x 106の不完全行列の低ランク近似を計算するのに数時間かかりますが、ランク95の近似をNetflixトレーニングセット全体に当てはめるのに3.3時間かかります。当社の方法は、優れたトレーニングエラーとテストエラーを実現し、他の競合する最先端技術と比較して優れた時間を示しています。

High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
線形計画法による高次元逆共分散行列推定

This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem.

この論文では、「スパース」行列によって十分に近似できる高次元逆共分散行列を推定する問題について考察します。多変量線形回帰と逆共分散行列のエントリとの間の関係を利用して、このような「スパース性」を効果的に利用できる推定手順を提案します。提案手法は線形計画法を用いて計算することができるため、非常に高次元の問題に使用できる可能性があります。オラクル不等式は、いくつかの演算子ノルムの観点から推定誤差に対して確立されており、この手法が問題のさまざまなタイプのスパース性に適応していることを示しています。

Restricted Eigenvalue Properties for Correlated Gaussian Designs
相関ガウス計画の制限付き固有値プロパティ

Methods based on l1-relaxation, such as basis pursuit and theLasso, are very popular for sparse regression in high dimensions. Theconditions for success of these methods are now well-understood: (1)exact recovery in the noiseless setting is possible if and only if thedesign matrix X satisfies the restricted nullspace property, and (2)the squared l2-error of a Lasso estimate decays at the minimaxoptimal rate k log p / n, where k is thesparsity of the p-dimensional regression problem with additiveGaussian noise, whenever the design satisfies a restricted eigenvaluecondition. The key issue is thus to determine when the design matrixX satisfies these desirable properties. Thus far, there have beennumerous results showing that the restricted isometry property, whichimplies both the restricted nullspace and eigenvalue conditions, issatisfied when all entries of X are independent and identicallydistributed (i.i.d.), or the rows are unitary. This paper provesdirectly that the restricted nullspace and eigenvalue conditions holdwith high probability for quite general classes of Gaussian matricesfor which the predictors may be highly dependent, and hence restrictedisometry conditions can be violated with high probability. In thisway, our results extend the attractive theoretical guarantees onl1-relaxations to a much broader class of problems than the caseof completely independent or unitary designs.

基底追求法やLasso法などのl1緩和法に基づく方法は、高次元のスパース回帰で非常によく使用されます。これらの方法が成功するための条件は、現在では十分に理解されています。(1)ノイズのない設定で正確な回復が可能なのは、設計行列Xが制限付きヌルスペース特性を満たす場合のみであり、(2) Lasso推定値のl2誤差の二乗は、設計が制限付き固有値条件を満たす場合は常に、最小最大最適率k log p / nで減少します。ここで、kは、加法ガウスノイズを伴うp次元回帰問題のスパース性です。したがって、重要な問題は、設計行列Xがこれらの望ましい特性をいつ満たすかを判断することです。これまで、Xのすべての要素が独立かつ同一分布(i.i.d.)であるか、行がユニタリである場合、制限されたヌル空間と固有値条件の両方を意味する制限された等長特性が満たされることを示す結果が多数ありました。この論文では、制限されたヌル空間と固有値条件が、予測子が高度に依存している可能性のある非常に一般的なクラスのガウス行列に対して高い確率で保持され、したがって制限された等長条件が高確率で違反される可能性があることを直接証明しています。このようにして、私たちの結果は、l1緩和に関する魅力的な理論的保証を、完全に独立した設計またはユニタリ設計の場合よりもはるかに広範なクラスの問題に拡張します。

Erratum: SGDQN is Less Careful than Expected
正誤表: SGDQN は予想よりも注意力が低い

The SGD-QN algorithm described in Bordes et al. (2009)contains a subtle flaw that prevents it from reaching its design goals.Yet the flawed SGD-QN algorithm has worked well enough to be a winner of the firstPascal Large Scale Learning Challenge (Sonnenburg et al., 2008).This document clarifies the situation, proposes a corrected algorithm,and evaluates its performance.

Bordesら(2009)で説明されているSGD-QNアルゴリズムには、設計目標の達成を妨げる微妙な欠陥が含まれています。しかし、欠陥のあるSGD-QNアルゴリズムは、最初のPascal Large Scale Learning Challenge(Sonnenburgら, 2008)の勝者になるのに十分なほどうまく機能しました。このドキュメントでは、状況を明確にし、修正されたアルゴリズムを提案し、そのパフォーマンスを評価します。

Regularized Discriminant Analysis, Ridge Regression and Beyond
正則化判別分析、リッジ回帰など

Fisher linear discriminant analysis (FDA) and its kernelextension−kernel discriminant analysis (KDA)−are well knownmethods that consider dimensionality reduction and classificationjointly. While widely deployed in practical problems, there arestill unresolved issues surrounding their efficient implementationand their relationship with least mean squares procedures. Inthis paper we address these issues within the framework ofregularized estimation. Our approach leads to a flexible andefficient implementation of FDA as well as KDA. We also uncover ageneral relationship between regularized discriminant analysis andridge regression. This relationship yields variations onconventional FDA based on the pseudoinverse and a direct equivalenceto an ordinary least squares estimator.

フィッシャー線形判別分析(FDA)とそのカーネル拡張であるカーネル判別分析(KDA)は、次元削減と分類を共同で考慮するよく知られた方法です。実際的な問題に広く展開されている一方で、その効率的な実装と最小平均二乗法との関係を取り巻く未解決の問題がまだあります。この論文では、正則化推定の枠組みの中でこれらの問題に取り組みます。私たちのアプローチは、FDAとKDAの柔軟で効率的な実装につながります。また、正則化判別分析とリッジ回帰の間の一般的な関係も明らかにします。この関係は、擬逆法と通常の最小二乗推定量への直接等価性に基づく従来のFDAのバリエーションを生み出します。

Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
勾配の学習:幾何学と統計的依存性を推論する予測モデル

The problems of dimension reduction and inference of statisticaldependence are addressed by the modeling framework of learninggradients. The models we propose hold for Euclidean spaces as wellas the manifold setting. The central quantity in this approach is anestimate of the gradient of the regression or classificationfunction. Two quadratic forms are constructed from gradientestimates: the gradient outer product and gradient based diffusionmaps. The first quantity can be used for supervised dimensionreduction on manifolds as well as inference of a graphical modelencoding dependencies that are predictive of a response variable.The second quantity can be used for nonlinear projections thatincorporate both the geometric structure of the manifold as well asvariation of the response variable on the manifold. We relate thegradient outer product to standard statistical quantities such ascovariances and provide a simple and precise comparison of a varietyof supervised dimensionality reduction methods. We provide rates ofconvergence for both inference of informative directions as well asinference of a graphical model of variable dependencies.

次元削減と統計的依存性の推論の問題は、勾配学習のモデリングフレームワークによって対処されます。私たちが提案するモデルは、ユークリッド空間と多様体設定の両方に当てはまります。このアプローチの中心となる量は、回帰関数または分類関数の勾配の推定値です。勾配推定値から、勾配外積と勾配ベースの拡散マップという2つの二次形式が構築されます。最初の量は、多様体での教師あり次元削減と、応答変数を予測する依存性をエンコードするグラフィカルモデルの推論に使用できます。2番目の量は、多様体の幾何学的構造と多様体での応答変数の変化の両方を組み込んだ非線形投影に使用できます。勾配外積を共分散などの標準的な統計量に関連付け、さまざまな教師あり次元削減方法の簡単かつ正確な比較を提供します。有益な方向の推論と変数依存性のグラフィカルモデルの推論の両方の収束率を提供します。

libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
libDAI: グラフィカルモデルでの離散近似推論のための無料のオープンソース C++ ライブラリ

This paper describes the software package libDAI, a free & open sourceC++ library that provides implementations of various exact and approximateinference methods for graphical models with discrete-valued variables. libDAIsupports directed graphical models (Bayesian networks) as well as undirectedones (Markov random fields and factor graphs). It offers variousapproximations of the partition sum, marginal probability distributions andmaximum probability states. Parameter learning is also supported. Afeature comparison with other open source software packages for approximateinference is given. libDAI is licensed under the GPL v2+ license and isavailable at http://www.libdai.org.

この論文では、離散値変数を持つグラフィカルモデルのためのさまざまな厳密推論および近似推論方法の実装を提供する無料のオープンソースC++ライブラリであるソフトウェアパッケージlibDAIについて説明します。libDAIは、有向グラフィカルモデル(ベイジアンネットワーク)と無向モデル(マルコフ確率場と因子グラフ)をサポートします。これは、分割和、周辺確率分布、および最大確率状態のさまざまな近似を提供します。パラメータ学習もサポートされています。A近似推論のための他のオープンソースソフトウェアパッケージとの機能比較が示されています。libDAIはGPL v2+ライセンスの下でライセンスされており、http://www.libdai.orgで入手できます。

Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
不均一疾患の分子分類のための一致遺伝子選択と委員会分類器

Microarray gene expressions provide new opportunities for molecular classification of heterogeneous diseases. Although various reported classification schemes show impressive performance, most existing gene selection methods are suboptimal and are not well-matched to the unique characteristics of the multicategory classification problem. Matched design of the gene selection method and a committee classifier is needed for identifying a small set of gene markers that achieve accurate multicategory classification while being both statistically reproducible and biologically plausible. We report a simpler and yet more accurate strategy than previous works for multicategory classification of heterogeneous diseases. Our method selects the union of one-versus-everyone (OVE) phenotypic up-regulated genes (PUGs) and matches this gene selection with a one-versus-rest support vector machine (OVRSVM). Our approach provides even-handed gene resources for discriminating both neighboring and well-separated classes. Consistent with the OVRSVM structure, we evaluated the fold changes of OVE gene expressions and found that only a small number of high-ranked genes were required to achieve superior accuracy for multicategory classification. We tested the proposed PUG-OVRSVM method on six real microarray gene expression data sets (five public benchmarks and one in-house data set) and two simulation data sets, observing significantly improved performance with lower error rates, fewer marker genes, and higher performance sustainability, as compared to several widely-adopted gene selection and classification methods. The MATLAB toolbox, experiment data and supplement files are available at http://www.cbil.ece.vt.edu/software.htm.

マイクロアレイ遺伝子発現は、異種疾患の分子分類に新たな機会を提供します。報告されているさまざまな分類スキームは優れたパフォーマンスを示していますが、既存の遺伝子選択方法のほとんどは最適ではなく、マルチカテゴリ分類問題の固有の特性にうまく適合していません。遺伝子選択方法と委員会分類器の適合設計は、統計的に再現可能で生物学的に妥当でありながら正確なマルチカテゴリ分類を達成する遺伝子マーカーの小さなセットを識別するために必要です。私たちは、異種疾患のマルチカテゴリ分類のための以前の研究よりも単純でありながらより正確な戦略を報告します。私たちの方法は、1対全員(OVE)表現型アップレギュレーション遺伝子(PUG)の和集合を選択し、この遺伝子選択を1対残りサポートベクターマシン(OVRSVM)と適合させます。私たちのアプローチは、隣接するクラスと十分に分離されたクラスの両方を区別するための公平な遺伝子リソースを提供します。OVRSVM構造と一致して、OVE遺伝子発現の倍数変化を評価し、マルチカテゴリ分類の優れた精度を達成するには、少数の高ランク遺伝子のみが必要であることがわかりました。提案されたPUG-OVRSVM法を、6つの実際のマイクロアレイ遺伝子発現データセット(5つの公開ベンチマークと1つの社内データセット)と2つのシミュレーションデータセットでテストしたところ、広く採用されているいくつかの遺伝子選択および分類方法と比較して、エラー率が低く、マーカー遺伝子が少なく、パフォーマンスの持続性が高く、パフォーマンスが大幅に向上しました。MATLABツールボックス、実験データ、および補足ファイルは、http://www.cbil.ece.vt.edu/software.htmで入手できます。

Importance Sampling for Continuous Time Bayesian Networks
連続時間ベイジアンネットワークの重要度サンプリング

A continuous time Bayesian network (CTBN) uses a structured representationto describe a dynamic system with a finite number of states which evolvesin continuous time. Exact inference in a CTBN is often intractableas the state space of the dynamic system grows exponentially with thenumber of variables. In this paper, we first present an approximateinference algorithm based on importance sampling. We then extend it tocontinuous-time particle filtering and smoothing algorithms. These threealgorithms can estimate the expectation of any function of a trajectory,conditioned on any evidence set constraining the values of subsets of thevariables over subsets of the time line. We present experimental resultson both synthetic networks and a network learned from a real data set onpeople’s life history events. We show the accuracy as well as the timeefficiency of our algorithms, and compare them to other approximatealgorithms: expectation propagation and Gibbs sampling.

連続時間ベイジアンネットワーク(CTBN)は、構造化表現を使用して、連続時間で進化する有限数の状態を持つ動的システムを記述します。CTBNでの正確な推論は、動的システムの状態空間が変数の数とともに指数関数的に増加するため、しばしば難解です。この論文では、最初に重要度サンプリングに基づく近似推論アルゴリズムを示します。次に、それを連続時間粒子フィルタリングおよび平滑化アルゴリズムに拡張します。これらの3つのアルゴリズムは、タイムラインのサブセットに対して変数のサブセットの値を制約する任意の証拠セットに基づいて、軌跡の任意の関数の期待値を推定できます。私たちは、合成ネットワークと、人々の生活史イベントに関する実際のデータセットから学習したネットワークの両方で実験結果を提示します。アルゴリズムの精度と時間効率を示し、それらを他の近似アルゴリズム(期待伝播とギブスサンプリング)と比較します。

Model-based Boosting 2.0
モデルベースブースティング 2.0

We describe version 2.0 of the R add-on package mboost.The package implements boosting for optimizing general risk functions usingcomponent-wise (penalized) least squares estimates or regressiontrees as base-learners for fitting generalized linear, additiveand interaction models to potentially high-dimensional data.

私たちは、Rアドオンパッケージmboostのバージョン2.0について説明します。このパッケージは、一般化線形モデル、加法モデル、および交互作用モデルを潜在的に高次元データに適合させるための基本学習器として、成分ごとの(ペナルティ付き)最小二乗推定値または回帰木を使用して、一般的なリスク関数を最適化するためのブースティングを実装します。

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
モデル選択における過剰適合と性能評価におけるその後の選択バイアスについて

Model selection strategies for machine learning algorithms typically involvethe numerical optimisation of an appropriate model selection criterion, oftenbased on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.

機械学習アルゴリズムのモデル選択戦略では、通常、適切なモデル選択基準の数値最適化が行われます。これは、多くの場合、k分割交差検証などの一般化パフォーマンスの推定値に基づいています。このような推定値の誤差は、バイアスと分散の要素に分解できます。モデル選択基準の有益な特性として、バイアスがないことがしばしば挙げられますが、無視できない分散があると、モデル選択だけでなくモデルのトレーニングでも過剰適合が生じる可能性があるため、低い分散も少なくとも同じくらい重要であることを実証します。この観察結果は、後から考えればかなり明白かもしれませんが、モデル選択基準の過剰適合によるパフォーマンスの低下は驚くほど大きくなる可能性があります。この観察結果は、これまで機械学習の文献ではほとんど注目されてこなかったようです。この論文では、この形式の過剰適合の影響は、学習アルゴリズム間のパフォーマンスの違いと同程度の大きさであることが多く、したがって経験的評価では無視できないことを示します。さらに、いくつかの一般的なパフォーマンス評価手法は、この形式の過剰適合の結果として、ある種の選択バイアスの影響を受けやすく、したがって信頼性が低いことを示しています。モデル選択における過剰適合と、それに続くパフォーマンス評価における選択バイアスを回避する方法について説明し、それがベストプラクティスに組み込まれることを期待しています。この研究では、クロスバリデーションに基づくモデル選択に焦点を当てていますが、調査結果は非常に一般的であり、ベイズ証拠の最大化やパフォーマンス境界の最適化など、有限のデータサンプルで評価されるモデル選択基準の最適化を伴うあらゆるモデル選択手法に当てはまります。

Matrix Completion from Noisy Entries
ノイズの多いエントリからの行列補完

Given a matrix M of low-rank, we consider the problem ofreconstructing it from noisy observations of a small,random subset of its entries. The problem arises in a varietyof applications, from collaborative filtering (the ‘Netflix problem’)to structure-from-motion and positioning. We study a low complexity algorithmintroduced by Keshavan, Montanari, and Oh (2010), based on a combinationof spectral techniques and manifold optimization, that we call here OPTSPACE. We prove performance guarantees that are order-optimal in a number of circumstances.

行列Mが低ランクの場合、そのエントリの小さなランダムなサブセットのノイズの多い観測値からそれを再構築する問題を検討します。この問題は、協調フィルタリング(「Netflix問題」)から、動きからの構造や位置決めまで、さまざまなアプリケーションで発生します。Keshavan、Montanari、Oh (2010)によって導入された低複雑性アルゴリズムを、スペクトル手法と多様体最適化の組み合わせに基づいて研究します。私たちは、さまざまな状況で最適なパフォーマンス保証を証明します。

A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design
コンピューターベース設計のための代理モデリングおよび適応サンプリングツールボックス

An exceedingly large number of scientific and engineering fields areconfronted with the need for computer simulations to study complex,real world phenomena or solve challenging design problems. However,due to the computational cost of these high fidelity simulations,the use of neural networks, kernel methods, and other surrogate modelingtechniques have become indispensable. Surrogate models are compactand cheap to evaluate, and have proven very useful for tasks suchas optimization, design space exploration, prototyping, and sensitivityanalysis. Consequently, in many fields there is great interest intools and techniques that facilitate the construction of such regressionmodels, while minimizing the computational cost and maximizing modelaccuracy. This paper presents a mature, flexible, and adaptive machinelearning toolkit for regression modeling and active learning to tacklethese issues. The toolkit brings together algorithms for data fitting,model selection, sample selection (active learning), hyperparameteroptimization, and distributed computing in order to empower a domainexpert to efficiently generate an accurate model for the problem ordata at hand.

非常に多くの科学および工学分野では、複雑な現実世界の現象を研究したり、困難な設計上の問題を解決したりするために、コンピュータシミュレーションの必要性に直面しています。しかし、これらの高忠実度シミュレーションの計算コストのため、ニューラルネットワーク、カーネル法、およびその他の代理モデリング技術の使用が不可欠になっています。代理モデルはコンパクトで評価コストが安く、最適化、設計空間の探索、プロトタイピング、感度分析などのタスクに非常に役立つことが証明されています。その結果、多くの分野で、計算コストを最小限に抑え、モデルの精度を最大化しながら、このような回帰モデルの構築を容易にするツールと技術に大きな関心が寄せられています。この論文では、これらの問題に取り組むための回帰モデリングとアクティブラーニング用の成熟した、柔軟で適応性のある機械学習ツールキットを紹介します。このツールキットは、データフィッティング、モデル選択、サンプル選択(アクティブラーニング)、ハイパーパラメータ最適化、分散コンピューティングのアルゴリズムを統合し、ドメインエキスパートが問題や手元のデータに対して正確なモデルを効率的に生成できるようにします。

Posterior Regularization for Structured Latent Variable Models
構造化潜在変数モデルのための事後正則化

We present posterior regularization, a probabilistic framework forstructured, weakly supervised learning. Our framework efficientlyincorporates indirect supervision via constraints on posteriordistributions of probabilistic models with latent variables. Posteriorregularization separates model complexity from the complexityof structural constraints it is desired to satisfy. By directlyimposing decomposable regularization on the posterior moments oflatent variables during learning, we retain the computationalefficiency of the unconstrained model while ensuring desiredconstraints hold in expectation. We present an efficient algorithm forlearning with posterior regularization and illustrate its versatilityon a diverse set of structural constraints such as bijectivity,symmetry and group sparsity in several large scale experiments,including multi-view learning, cross-lingual dependency grammarinduction, unsupervised part-of-speech induction, and bitext wordalignment.

私たちは、構造化された弱教師あり学習のための確率的フレームワークである事後正則化を提示します。私たちのフレームワークは、潜在変数を持つ確率モデルの事後分布に対する制約を通じて、間接的な監督を効率的に組み込みます。事後正則化は、モデルの複雑さを、満たすことが望ましい構造的制約の複雑さから分離します。学習中に事後モーメントの平坦変数に分解可能な正則化を直接課すことにより、制約のないモデルの計算効率を維持しながら、望ましい制約が期待値に保持されるようにします。事後正則化による学習のための効率的なアルゴリズムを提示し、マルチビュー学習、言語間依存関係文法帰納法、教師なし品詞帰納法、バイテキストワードアライメントなど、いくつかの大規模実験で、全単射性、対称性、群スパース性などの多様な構造的制約に対するその多様性を示します。

Practical Approaches to Principal Component Analysis in the Presence of Missing Values
欠損値の存在下における主成分分析の実践的アプローチ

Principal component analysis (PCA) is a classical data analysis technique that finds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overfitting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overfitting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artificial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the Netflix problem.

主成分分析(PCA)は、分散の最大量を保持するデータの線形変換を見つける古典的なデータ分析手法です。データ値の一部が欠落しているケースを調査し、この問題には、過適合や不適切な局所最適解など、通常は非線形モデルに関連する多くの特徴があることを示します。PCAの確率論的定式化は、欠損値を処理するための優れた基盤を提供し、そのための式を提供します。高次元で非常にスパースなデータの場合、オーバーフィッティングが深刻な問題になり、PCAの従来のアルゴリズムは非常に遅くなります。新しい高速アルゴリズムを導入し、それを変分ベイズ学習に拡張します。PCAのさまざまなバージョンを人工実験で比較し、正則化と事後分散のモデリングの効果を実証しています。提案されたアルゴリズムのスケーラビリティは、それをNetflixの問題に適用することで実証されます。

Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
非IIDデータに対する色PACベイズ境界:ランキングおよび定常β混合プロセスへの応用

PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first−to the best of our knowledge−PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classifiers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classifiers learned on data from stationary φ-mixing distributions.

PAC-Bayes境界は、独立同一分布(IID)データから学習された分類器の最も正確な一般化境界の1つであり、特にマージン分類器ではその精度が高くなります。最近では、モデル選択を実行する場合(Ambroladze他、2007年)、または線形分類器の学習を直接ガイドする場合(Germain他、2009年)、これらの境界がいかに実用的であるかを示す研究があります。ただし、トレーニングデータに何らかの依存関係が見られ、従来のIID仮定が当てはまらない実際の状況も数多くあります。そのため、理論的観点からも実用的観点からも、このようなフレームワークの一般化境界を示すことは非常に重要です。この研究では、相互依存性を示すデータでトレーニングされた分類器のPAC-Bayes一般化境界を、私たちの知る限り初めて提案します。結果を確立するために採用されたアプローチは、グラフ分数カバーのおかげで、データ内の依存関係を独立したデータのセットにエンコードする、いわゆる依存関係グラフの分解に基づいています。依存関係グラフの分数彩色の数の上限を見つけることができれば、特定の設定に対する新しいPAC-Bayesの境界を得るのに十分であるため、私たちの境界は非常に一般的です。私たちは、私たちの結果を使用して、ランキング統計(AUCなど)と、定常 β 混合プロセスに従って分散されたデータでトレーニングされた分類器の境界を導き出す方法を示します。その際に、私たちのアプローチによってUプロセスをシームレスに処理できる方法を示します。補足として、定常 φ 混合分布のデータで学習した分類器のPAC-Bayes一般化境界も提供します。

Fast and Scalable Local Kernel Machines
高速でスケーラブルなローカルカーネルマシン

A computationally efficient approach to local learning with kernel methods ispresented. The Fast Local Kernel SupportVector Machine (FaLK-SVM) trains a set of local SVMs onredundant neighbourhoods in the training set and an appropriate model for eachquery point is selected at testing time according to a proximity strategy.Supported by a recent result by Zakai and Ritov (2009) relating consistency andlocalizability, our approach achieves high classification accuracies by dividingthe separation function in local optimisation problems that can be handled veryefficiently from the computational viewpoint. The introduction of a fast localmodel selection further speeds-upthe learning process. Learning and complexity bounds are derived for FaLK-SVM,and the empirical evaluation of the approach (with data sets up to 3 millionpoints) showed that it is much faster and more accurate and scalable thanstate-of-the-art accurate and approximated SVM solvers at least for nonhigh-dimensional data sets. More generally, we show that locality can be animportant factor to sensibly speed-up learning approaches and kernel methods,differently from other recent techniques that tend to dismiss local informationin order to improve scalability.

カーネル法によるローカル学習に対する計算効率の高いアプローチを紹介します。Fast Local Kernel Support Vector Machine (FaLK-SVM)は、トレーニングセット内の冗長な近傍でローカルSVMのセットをトレーニングし、テスト時に近接戦略に従って各クエリポイントの適切なモデルを選択します。一貫性とローカリゼーション可能性に関するZakaiとRitov (2009)による最近の結果に裏付けられ、計算の観点から非常に効率的に処理できるローカル最適化問題で分離関数を分割することにより、高い分類精度を実現します。高速ローカルモデル選択の導入により、学習プロセスがさらに高速化されます。FaLK-SVMの学習と複雑性の境界が導出され、このアプローチの実証的評価(データセットは最大300万ポイント)では、少なくとも高次元でないデータセットでは、最先端の正確で近似的なSVMソルバーよりもはるかに高速で正確かつスケーラブルであることが示されました。より一般的には、スケーラビリティを向上させるためにローカル情報を無視する傾向がある他の最近の技術とは異なり、ローカル性は学習アプローチとカーネルメソッドを合理的に高速化するための重要な要素になり得ることを示します。

Sparse Spectrum Gaussian Process Regression
スパーススペクトルガウス過程回帰

We present a new sparse Gaussian Process (GP) model forregression. The key novel idea is to sparsify the spectral representation of the GP. This leads to a simple, practicalalgorithm for regression tasks. We compare the achievable trade-offsbetween predictive accuracy and computational requirements, and showthat these are typically superior to existing state-of-the-artsparse approximations. We discuss both the weight space and functionspace representations, and note that the new construction impliespriors over functions which are always stationary, and canapproximate any covariance function in this class.

私たちは、回帰のための新しいスパースガウス過程(GP)モデルを紹介します。重要な斬新なアイデアは、GPのスペクトル表現をスパース化することです。これは、回帰タスクのためのシンプルで実用的なアルゴリズムにつながります。予測精度と計算要件との間の達成可能なトレードオフを比較し、これらが通常、既存の最先端のスパース近似よりも優れていることを示します。重み空間表現と関数空間表現の両方について説明し、新しい構成は常に静止している関数の事前分布を意味し、このクラスの任意の共分散関数を近似できることに注意してください。

Permutation Tests for Studying Classifier Performance
分類器の性能を研究するための順列検定

We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data.

私たちは、分類器の性能を評価するための順列ベースのp値のフレームワークを検討します。この論文では、2つの簡単な順列テストを検討します。最初のテストでは、分類器がデータ内に実際のクラス構造を見つけたかどうかを評価します。対応するヌル分布は、データ内のラベルを順列させることによって推定されます。このテストは、計算生物学の分類問題で広く使用されています。2番目のテストでは、分類器が分類で特徴間の依存関係を利用しているかどうかを調査します。対応するヌル分布は、統計で従来使用されている制限付きランダム化手法にヒントを得て、クラス内の特徴を順列させることによって推定されます。この新しいテストは、分類器の性能を向上させる上で貴重な情報となる記述的特徴を識別するのに役立ちます。これらのテストの特性を調査し、実際のデータと合成データに関する広範な実証的評価を示します。分析により、順列テストによる分類器の性能の調査は効果的であることがわかりました。特に、制限付き順列テストでは、分類器がデータ内の特徴間の相互依存性を利用しているかどうかが明確にわかります。

How to Explain Individual Classification Decisions
個々の分類決定をどのように説明するか

After building a classifier with modern tools of machine learning wetypically have a black box at hand that is able to predict well forunseen data. Thus, we get an answer to the question what isthe most likely label of a given unseen data point. However, most methods will provide no answer why the model predicted a particular label for a single instance and what features were most influential for that particular instance. The only method that is currently able to provide such explanations are decision trees. This paper proposes a procedure which (based on a set of assumptions) allows to explain the decisions of any classification method.

機械学習の最新のツールを使用して分類器を構築した後、通常、目に見えないデータを十分に予測できるブラックボックスが手元にありません。したがって、特定の目に見えないデータポイントの最も可能性の高いラベルは何かという質問に対する答えが得られます。ただし、ほとんどの方法では、モデルが1つのインスタンスに対して特定のラベルを予測した理由と、その特定のインスタンスに最も影響を与えた特徴は何かという答えは得られません。現在、このような説明を提供できる唯一の方法は、デシジョンツリーです。この論文では、(一連の仮定に基づいて)任意の分類方法の決定を説明できる手順を提案します。

The SHOGUN Machine Learning Toolbox
SHOGUN機械学習ツールボックス

We have developed a machine learning toolbox, called SHOGUN, which isdesigned for unified large-scale learning for a broad range of featuretypes and learning settings. It offers a considerable number of machinelearning models such as support vector machines, hidden Markov models, multiple kernel learning, lineardiscriminant analysis, andmore. Most of the specific algorithms are able to deal withseveral different data classes. Wehave used this toolbox in several applications from computationalbiology, some of them coming with no less than 50 million trainingexamples and others with 7 billion test examples. With more than athousand installations worldwide, SHOGUN is already widely adopted inthe machine learning community and beyond.SHOGUN is implemented in C++ and interfaces to MATLABTM, R, Octave, Python, andhas a stand-alone command line interface. The source code is freely available under the GNU General Public License, Version 3 at http://www.shogun-toolbox.org.

私たちは、幅広い特徴タイプと学習設定に対して統一的な大規模学習を実現するように設計された機械学習ツールボックス「SHOGUN」を開発しました。サポートベクターマシン、隠れマルコフモデル、多重カーネル学習、線形判別分析など、かなりの数の機械学習モデルを提供しています。特定のアルゴリズムのほとんどは、いくつかの異なるデータクラスを処理できます。このツールボックスは、Computationalbiologyのいくつかのアプリケーションで使用されており、その中には5,000万件以上のトレーニング例が付属しているものもあれば、70億個のテスト例が付属しているものもあります。世界中で1,000を超えるインストールを持つSHOGUNは、すでに機械学習コミュニティ内外で広く採用されています。SHOGUNはC++で実装されており、MATLABTM、R、Octave、Pythonとのインターフェースを持ち、スタンドアロンのコマンドラインインターフェースを備えています。ソースコードは、GNU General Public License、バージョン3、http://www.shogun-toolbox.orgの下で無料で入手できます。

Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
変分平均場アニーリングによるスパースグラフィカル因子モデルにおけるベイズ学習

We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, alsoinduce conditional independence structures via zeros in the implied precision matrices. We describe the modelsand their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computationalalgorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing isdeveloped to successively generate “artificial” posterior distributions that, at the limiting temperaturein the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, includinganalyses of handwritten digit image and cancer gene expression data.

私たちは、グラフィカル因子モデル(GFM)と呼ばれるスパース潜在因子モデルのクラスと、事後モード推定のための関連するスパース学習アルゴリズムについて説明します。線形ガウスGFMにはスパース直交因子負荷行列があり、暗黙の共分散行列のスパース性に加えて、暗黙の精度行列のゼロを介して条件付き独立構造も誘発します。このモデルと、スパース潜在因子構造とデータ/信号再構成の堅牢な推定に使用する方法について説明します。モデル探索と事後モード検索の計算アルゴリズムを開発し、潜在的なスパース構成の巨大な空間の検索に伴う困難な組み合わせ最適化に対処します。平均場変分法とアニーリングを組み合わせた手法を開発し、アニーリングスケジュールの限界温度でGFMパラメータ空間で必要な事後モードを定義する「人工」事後分布を連続的に生成します。手書き数字画像や癌遺伝子発現データの分析など、いくつかの詳細な実証研究と関連アプローチとの比較について説明します。

Evolving Static Representations for Task Transfer
タスク転送のための静的表現の進化

An important goal for machine learning is to transfer knowledge between tasks. For example, learning to play RoboCup Keepaway should contribute to learning the full game of RoboCup soccer. Previous approaches to transfer in Keepaway have focused on transforming the original representation to fit the new task. In contrast, this paper explores the idea that transfer is most effective if the representation is designed to be the same even across different tasks. To demonstrate this point, a bird’s eye view (BEV) representation is introduced that can represent different tasks on the same two-dimensional map. For example, both the 3 vs. 2 and 4 vs. 3 Keepaway tasks can be represented on the same BEV. Yet the problem is that a raw two-dimensional map is high-dimensional and unstructured. This paper shows how this problem is addressed naturally by an idea from evolutionary computation called indirect encoding, which compresses the representation by exploiting its geometry. The result is that the BEV learns a Keepaway policy that transfers without further learning or manipulation. It also facilitates transferring knowledge learned in a different domain, Knight Joust, into Keepaway. Finally, the indirect encoding of the BEV means that its geometry can be changed without altering the solution. Thus static representations facilitate several kinds of transfer.

機械学習の重要な目標は、タスク間で知識を転送することです。たとえば、RoboCup Keepawayのプレイ方法を学習することは、RoboCupサッカーの完全なゲームを学習することに貢献するはずです。Keepawayでの転送に対するこれまでのアプローチは、元の表現を新しいタスクに合わせて変換することに重点を置いていました。対照的に、この論文では、異なるタスク間でも同じ表現になるように設計されている場合に転送が最も効果的であるという考えを検討します。この点を証明するために、同じ2次元マップ上で異なるタスクを表すことができる鳥瞰図(BEV)表現が導入されています。たとえば、3対2と4対3のKeepawayタスクはどちらも同じBEVで表すことができます。しかし、問題は、生の2次元マップが高次元で構造化されていないことです。この論文では、この問題が、間接エンコーディングと呼ばれる進化計算のアイデアによって自然に解決される方法を示しています。間接エンコーディングは、ジオメトリを活用して表現を圧縮します。その結果、BEVは、それ以上の学習や操作なしで転送するKeepawayポリシーを学習します。また、別のドメインであるKnight Joustで学習した知識をKeepawayに転送することも容易になります。最後に、BEVの間接的なエンコードは、ソリューションを変更することなくそのジオメトリを変更できることを意味します。このように、静的表現はさまざまな種類の転送を容易にします。

FastInf: An Efficient Approximate Inference Library
FastInf: 効率的な近似推論ライブラリ

The FastInf C++ library is designed to perform memory and time efficient approximate inference in large-scale discrete undirected graphical models. The focus of the library is propagation based approximate inference methods, ranging from the basic loopy belief propagation algorithm to propagation based on convex free energies. Various message scheduling schemes that improve on the standard synchronous or asynchronous approaches are included. Also implemented are a clique tree based exact inference, Gibbs sampling, and the mean field algorithm. In addition to inference, FastInf provides parameter estimation capabilities as well as representation and learning of shared parameters. It offers a rich interface that facilitates extension of the basic classes to other inference and learning methods.

FastInf C++ライブラリは、大規模な離散無向グラフィカルモデルでメモリと時間効率の良い近似推論を実行するように設計されています。このライブラリの焦点は、基本的なループ信念伝搬アルゴリズムから凸自由エネルギーに基づく伝搬まで、伝搬ベースの近似推論法です。標準の同期または非同期アプローチを改善するさまざまなメッセージスケジューリングスキームが含まれています。また、クリークツリーベースの正確な推論、ギブスサンプリング、および平均場アルゴリズムも実装されています。推論に加えて、FastInfはパラメータ推定機能だけでなく、共有パラメータの表現と学習も提供します。これは、基本クラスを他の推論および学習方法に拡張することを容易にする豊富なインターフェイスを提供します。

Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity
非ガウス性を用いた構造ベクトル自己回帰モデルの推定

Analysis of causal effects between continuous-valued variables typically uses either autoregressive models or structural equation models with instantaneous effects. Estimation of Gaussian, linear structural equation models poses serious identifiability problems, which is why it was recently proposed to use non-Gaussian models. Here, we show how to combine the non-Gaussian instantaneous model with autoregressive models. This is effectively what is called a structural vector autoregression (SVAR) model, and thus our work contributes to the long-standing problem of how to estimate SVAR’s. We show that such a non-Gaussian model is identifiable without prior knowledge of network structure. We propose computationally efficient methods for estimating the model, as well as methods to assess the significance of the causal influences. The model is successfully applied on financial and brain imaging data.

連続値変数間の因果効果の解析では、通常、自己回帰モデルまたは瞬間的な効果を持つ構造方程式モデルを使用します。ガウス、線形構造方程式モデルの推定は深刻な識別可能性の問題を引き起こすため、最近、非ガウスモデルの使用が提案されました。ここでは、非ガウス瞬時モデルと自己回帰モデルを組み合わせる方法を示します。これは事実上、構造ベクトル自己回帰(SVAR)モデルと呼ばれるものであり、したがって、私たちの研究は、SVARをどのように推定するかという長年の問題に貢献しています。このような非ガウスモデルは、ネットワーク構造の予備知識がなくても識別可能であることを示します。モデルを推定するための計算効率の高い方法と、因果関係の影響の重要性を評価する方法を提案します。このモデルは、財務および脳の画像データにうまく適用されています。

Consensus-Based Distributed Support Vector Machines
コンセンサスベースの分散サポートベクターマシン

This paper develops algorithms to train support vector machines whentraining data are distributed across different nodes,and their communication to a centralized processing unit isprohibited due to, for example, communication complexity, scalability, orprivacy reasons. To accomplish this goal, the centralized linear SVMproblem is cast as a set of decentralized convex optimizationsub-problems (one per node) with consensus constraints on the wantedclassifier parameters. Using the alternating direction method ofmultipliers, fully distributed training algorithms are obtainedwithout exchanging training data among nodes. Different fromexisting incremental approaches, the overhead associated withinter-node communications is fixed and solely dependent on thenetwork topology rather than the size of the training sets availableper node. Important generalizations to train nonlinear SVMs in adistributed fashion are also developed along with sequential variantscapable of online processing. Simulated tests illustrate theperformance of the novelalgorithms.

この論文では、トレーニングデータが異なるノードに分散され、通信の複雑さ、スケーラビリティ、プライバシーなどの理由で集中処理ユニットへの通信が禁止されている場合に、サポートベクターマシンをトレーニングするアルゴリズムを開発します。この目標を達成するために、集中型線形SVMの問題は、必要な分類器パラメータに対するコンセンサス制約を持つ分散型凸最適化サブ問題(ノードごとに1つ)のセットとして表現されます。乗数の交互方向法を使用すると、ノード間でトレーニングデータを交換することなく、完全に分散されたトレーニングアルゴリズムが得られます。既存の増分アプローチとは異なり、ノード間通信に関連するオーバーヘッドは固定されており、ノードごとに使用できるトレーニングセットのサイズではなく、ネットワークトポロジにのみ依存します。分散方式で非線形SVMをトレーニングするための重要な一般化も、オンライン処理が可能なシーケンシャルバリアントとともに開発されています。シミュレーションテストは、新しいアルゴリズムのパフォーマンスを示しています。

Introduction to Causal Inference
因果推論の概要

The goal of many sciences is to understand the mechanisms by whichvariables came to take on the values they have (that is, to find agenerative model), and to predict what the values of those variableswould be if the naturally occurring mechanisms were subject to outsidemanipulations. The past 30 years has seen a number of conceptualdevelopments that are partial solutions to the problem of causalinference from observational sample data or a mixture of observationalsample and experimental data, particularly in the area of graphicalcausal modeling. However, in many domains, problems such as the largenumbers of variables, small samples sizes, and possible presence ofunmeasured causes, remain serious impediments to practical applicationsof these developments. The articles in the Special Topic on Causalityaddress these and other problems in applying graphical causal modelingalgorithms. This introduction to the Special Topic on Causalityprovides a brief introduction to graphical causal modeling, places thearticles in a broader context, and describes the differences betweencausal inference and ordinary machine learning classification and prediction problems.

多くの科学の目標は、変数が現在の値をとるようになったメカニズムを理解すること(つまり、生成モデルを見つけること)、および自然に発生するメカニズムが外部から操作された場合にそれらの変数の値がどうなるかを予測することです。過去30年間、特にグラフィカル因果モデリングの分野で、観測サンプルデータまたは観測サンプルと実験データの混合からの因果推論の問題に対する部分的な解決策となる概念的開発が数多く見られました。ただし、多くの領域で、変数の数が多い、サンプルサイズが小さい、測定されていない原因が存在する可能性があるなどの問題が、これらの開発の実際の適用に対する重大な障害となっています。因果関係に関する特別トピックの記事では、グラフィカル因果モデリングアルゴリズムの適用におけるこれらの問題やその他の問題を取り上げています。因果関係に関する特別トピックのこの紹介では、グラフィカル因果モデリングについて簡単に紹介し、記事をより広い文脈に位置付け、因果推論と通常の機械学習の分類および予測問題の違いについて説明します。

On the Foundations of Noise-free Selective Classification
ノイズフリー選択的分類の基礎について

We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifiercoverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off.Our main objective is to characterize this trade-off and to construct algorithms that can optimally or nearoptimally achieve the best possible trade-offs in a controlled manner.For noise-free models we present in this papera thorough analysis of selective classification including characterizations of RC trade-offsin various interesting settings.

私たちは、「リジェクトオプションによる分類」を指すために採用している用語である選択的分類を検討しています。選択的分類の本質は、分類子のカバレッジをトレードオフして精度を高めることです。このトレードオフをリスク・カバレッジ(RC)トレードオフと呼んでいます。私たちの主な目的は、このトレードオフを特徴付け、制御された方法で可能な限り最高のトレードオフを最適またはほぼ達成できるアルゴリズムを構築することです。ノイズフリーモデルについては、この論文で、さまざまな興味深い設定でのRCトレードオフの特性評価を含む選択的分類の徹底的な分析を示します。

MOA: Massive Online Analysis
MOA:大規模なオンライン分析

Massive Online Analysis (MOA)is a software environment for implementing algorithms and running experimentsfor online learning from evolving data streams.MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naïve Bayes classifiers at the leaves.MOA supports bi-directional interaction with WEKA, the WaikatoEnvironment for Knowledge Analysis,and is released under the GNU GPL license.

Massive Online Analysis(MOA)は、進化するデータストリームからオンライン学習するためのアルゴリズムを実装し、実験を実行するためのソフトウェア環境です。MOAには、オフラインとオンラインのメソッドのコレクションと、評価用のツールが含まれています。特に、ブースティング、バギング、およびヘフディングツリーを実装しますが、これらはすべて、葉のNaïve Bayes分類子の有無にかかわらず行われます。MOAは、WEKA(WaikatoEnvironment for Knowledge Analysis)との双方向の相互作用をサポートしており、GNU GPLライセンスの下でリリースされています。

Near-optimal Regret Bounds for Reinforcement Learning
強化学習のための最適後悔限界

For undiscounted reinforcement learning in Markov decisionprocesses (MDPs) we consider the total regret ofa learning algorithm with respect to an optimal policy.In order to describe the transition structure of an MDP we propose a new parameter:An MDP has diameter D if for any pair of states s,s’ there isa policy which moves from s to s’ in at most D steps (on average).We present a reinforcement learning algorithm with total regretÕ(DS√AT) after T steps for any unknown MDPwith S states, A actions per state, and diameter D.A corresponding lower bound of Ω(√DSAT) on thetotal regret of any learning algorithm is given as well.These results are complemented by a sample complexity bound on thenumber of suboptimal steps taken by our algorithm. This bound can beused to achieve a (gap-dependent) regret bound that is logarithmic in T.Finally, we also consider a setting where the MDP is allowed to changea fixed number of l times. We present a modification of our algorithmthat is able to deal with this setting and show a regret bound ofÕ(l1/3T2/3DS√A).

マルコフ決定プロセス(MDP)における割引なし強化学習について、最適ポリシーに関する学習アルゴリズムの総後悔を考慮します。MDPの遷移構造を記述するために、新しいパラメーターを提案します。MDPの直径はDです。これは、状態s、s’の任意のペアに対して、sからs’に最大Dステップ(平均)で移動するポリシーが存在する場合です。S状態、状態あたりAアクション、直径Dを持つ任意の未知のMDPに対して、Tステップ後に総後悔(DS√AT)を持つ強化学習アルゴリズムを示します。任意の学習アルゴリズムの総後悔に対する対応する下限 Ω(√DSAT)も示します。これらの結果は、アルゴリズムによって実行される最適でないステップの数のサンプル複雑性境界によって補完されます。この境界を使用して、Tで対数となる(ギャップに依存する)後悔境界を実現できます。最後に、MDPがl回固定回数変更できる設定も検討します。私たちは、この設定に対処できるアルゴリズムの修正を提示し、Õ(l1/3T2/3DS√A)の後悔境界を示します。

Hilbert Space Embeddings and Metrics on Probability Measures
ヒルベルト空間埋め込みと確率測度に関するメトリクス

A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk, indexed by the kernel function k that defines the inner product in the RKHS.We present three theoretical properties of γk. First, we consider the question of determining the conditions on the kernel kfor which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributionscoincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on ℜd, then it is characteristic if and only if the support of its Fourier transform is the entire ℜd. Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γk, we relate γk to other popular metrics on probability measures, and present conditions on the kernel k under which γk metrizes the weak topology.

確率測度のヒルベルト空間埋め込みが最近提案され、次元削減、同次性テスト、独立性テストなどの用途があります。この埋め込みは、任意の確率測度を再生カーネルヒルベルト空間(RKHS)の平均要素として表します。確率測度の空間上の擬似計量は、分布埋め込み間の距離として定義できます。これを γkと表記し、RKHSの内積を定義するカーネル関数kでインデックス付けします。γkの3つの理論的特性を示します。まず、γkが計量となるカーネルkの条件を決定する問題について検討します。このようなkは特性カーネルと呼ばれます。擬似計量とは異なり、計量は2つの分布が一致する場合にのみゼロになるため、RKHS埋め込みがすべての分布を一意にマッピングします(つまり、埋め込みが単射である)。以前に公開された条件は、限られた状況(コンパクトなドメインなど)にのみ適用され、確認が困難ですが、私たちの条件は単純で直感的です。積分的に厳密な正定値カーネルは特性カーネルです。あるいは、有界連続カーネルが ℜd上で並進不変である場合、そのフーリエ変換のサポートが ℜd全体である場合に限り、カーネルは特性カーネルになります。次に、分布の違いが高頻度で発生する場合、分布は埋め込み空間内で近いことを実証することにより、γkの下での分布間の距離がカーネルと分布の特性間の相互作用から生じることを示します。3番目に、γkによって誘導されるトポロジーの性質を理解するために、γkを確率測度に関する他の一般的な測定基準に関連付け、γkが弱いトポロジーを測定できるカーネルkの条件を示します。

Quadratic Programming Feature Selection
二次計画法の特徴選択

Identifying a subset of features that preserves classification accuracyis a problem of growing importance,because of the increasing size and dimensionality of real-world data sets. We propose a new feature selection method,named Quadratic Programming Feature Selection (QPFS),that reduces the task to a quadratic optimization problem.In order to limit the computational complexity of solving the optimization problem,QPFS uses the Nyström method for approximate matrix diagonalization.QPFS is thus capable of dealing with very large data sets,for which the use of other methods is computationally expensive. In experiments with small and medium data sets,the QPFS method leads to classification accuracy similar to that of other successful techniques. For large data sets, QPFS is superior in terms of computational efficiency.

分類精度を保持するフィーチャのサブセットを特定することは、実世界のデータセットのサイズと次元が増加するにつれて、重要性を増す問題です。私たちは、タスクを二次最適化問題に縮小するQuadratic Programming Feature Selection(QPFS)という名前の新しい特徴選択方法を提案します。最適化問題を解く際の計算の複雑さを制限するために、QPFSは近似行列の対角化にナイストロム法を使用します。したがって、QPFSは、他の方法を使用すると計算コストがかかる非常に大きなデータセットを処理できます。中小規模のデータセットを用いた実験では、QPFS法は他の成功した手法と同様の分類精度をもたらします。大規模なデータセットの場合、QPFSは計算効率の点で優れています。

Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
線形SVMによる低次多項式データマッピングの学習とテスト

Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space,but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues.The approach enjoys fast training and testing,but may sometimes achieve accuracy close to that ofusing highly nonlinear kernels.Empirical experiments show that the proposed method is useful for certain large-scale data sets.We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements.

SVMでは、データを高次元空間に変換することで線形的に分離できない問題を処理するために、カーネル手法が長い間使用されてきましたが、大規模なデータセットのトレーニングとテストには時間がかかることがよくあります。対照的に、カーネルなしで線形SVMを使用すると、はるかに大きなデータセットを効率的にトレーニングおよびテストできます。この作業では、多項式マッピングデータの明示的な形式に高速線形SVM法を適用し、実装の問題を調査します。このアプローチは、高速なトレーニングとテストを享受できますが、高度に非線形なカーネルを使用した場合に近い精度を達成できる場合があります。実証実験により、提案手法は特定の大規模データセットに有用であることが示されています。私たちは、提案された方法を自然言語処理(NLP)アプリケーションに適用し、一部のトレーニング/テスト速度要件の下でテスト精度を向上させることに成功しました。

Characterization, Stability and Convergence of Hierarchical Clustering Methods
階層的クラスタリング手法の特性評価、安定性、収束

We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods.

私たちは、公理的な視点の下で階層的クラスタリングスキームを研究します。このフレームワーク内で、Kleinberg (2002)の定理に類似した定理を証明できることを示す。この定理では、非存在結果の代わりに存在と一意性の定理が得られます。このユニークなスキームのさらなる特性を探求します:安定性と収束が確立されます。樹木図を超メートル空間として表し、メートル幾何学のツール、つまりグロモフ・ハウスドルフ距離を使用して、入力メートル空間の摂動が階層的方法の結果に影響を与える程度を定量化します。

Consistent Nonparametric Tests of Independence
独立性の一貫したノンパラメトリック検定

Three simple and explicit procedures for testing the independenceof two multi-dimensional random variables are described. Two of the associated test statistics (L1,log-likelihood) are defined when the empiricaldistribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure.Two kinds of tests are provided. Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these testsmake almost surely no Type I or Type II errorafter a random sample size.Asymptoticallyα-level tests are obtained from the limiting distribution of the test statistics.For the latter tests, the Type I error convergesto a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size.All tests reject the null hypothesis of independence if the teststatistics become large. The performance of the tests is evaluated experimentally onbenchmark data.

2つの多次元確率変数の独立性をテストするための3つの単純で明示的な手順について説明します。関連する検定統計量のうち2つ(L1,log-likelihood)は、変数の経験分布が有限分割に制限されている場合に定義されます。3番目のテスト統計は、カーネルベースの独立性尺度として定義されます。2種類のテストが用意されています。分布のない強い一貫性のある検定は、検定統計の大きな偏差限界に基づいて導出されます:これらの検定では、ランダムなサンプルサイズの後にほぼ確実にタイプIまたはタイプIIのエラーが発生しません。漸近αレベルの検定は、検定統計量の限定分布から得られます。後者のテストでは、タイプI誤差は固定のゼロ以外の値αに収束し、タイプII誤差はゼロに低下してサンプルサイズが増加します。すべての検定は、検定統計量が大きくなると、独立性の帰無仮説を棄却します。テストのパフォーマンスは、ベンチマークデータで実験的に評価されます。

Learning Translation Invariant Kernels for Classification
分類のための翻訳不変カーネルの学習

Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method.

カーネル関数はアルゴリズムの特徴空間を暗黙的に定義し、カーネル法の成功にはその適切な選択が重要な役割を果たします。この論文では、バイナリ分類のタスクに対する変換不変カーネルのクラスでカーネル関数を最適化する問題について検討します。このクラスの学習能力は、特徴の回転とスケーリングに対して不変であり、ラジアルカーネルのセットを包含します。変換不変カーネル関数をサブクラスのネストされたセットに埋め込む方法を示し、これらのサブクラスの1つでのカーネル学習問題について検討します。これにより、手元の問題に基づいて適切なサブクラスを選択できます。Lanckrietら(2004)が提案した基準を使用して、問題の関数定式化を取得します。最適なカーネルはコサイン関数の有限混合であることが証明されます。次に、カーネル学習問題は、一連の二次制約二次計画法(QCQP)サブ問題によって解決される半無限計画法(SIP)問題として定式化されます。コサインカーネルがランク2であるという事実を使用して、カーネルマトリックスをメモリにロードする必要がないQCQPサブ問題の定式化を提案し、この方法を大規模な問題に適用できるようにします。また、学習プロセスに個別のカーネルや等方性ガウスカーネルなどの他のクラスのカーネルを含める問題にも対処します。提案された方法のもう1つの興味深い特徴は、最適な分類器がサポートベクトルではなくコサインカーネルの数に関して拡張され、実行時に顕著な高速化をもたらすことです。副産物として、カーネルトリックを複素数値カーネル関数に一般化することもできます。USPSおよびMNIST数字認識データセットを含む人工および実際のベンチマークデータセットでの実験により、提案された方法の有用性が示されました。

Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
教師なし教師あり学習 I: ラベルなしの分類誤差と回帰誤差の推定

Estimating the error rates of classifiers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data.

分類器または回帰モデルのエラー率を推定することは、機械学習の基本的なタスクであり、これまで教師あり学習手法のみを使用して研究されてきました。私たちは、ラベルのないデータと穏やかな仮定のみを使用してこれらのエラー率を推定するための新しい教師なしフレームワークを提案します。このフレームワークの一貫性結果を証明し、合成データと実世界データの両方での実用的な適用性を実証します。

Learning From Crowds
群衆から学ぶ

For many supervised learning tasks it may be infeasible (or veryexpensive) to obtain objective and reliable labels. Instead, we can collect subjective (possibly noisy)labels from multiple experts or annotators. In practice, there is asubstantial amount of disagreement among the annotators, and hence it isof great practical interest to address conventional supervised learning problems inthis scenario. In this paper we describe a probabilistic approach for supervised learning when we have multiple annotators providing (possibly noisy) labels but no absolute gold standard. The proposed algorithm evaluates the different experts and also gives an estimate of the actual hidden labels. Experimental results indicate that the proposed method is superior to the commonly used majority voting baseline.

多くの教師あり学習タスクでは、客観的で信頼性の高いラベルを取得することは実行不可能(または非常に高価)な場合があります。代わりに、複数の専門家やアノテーターから主観的な(おそらくノイズの多い)ラベルを収集できます。実際には、アノテーター間でかなりの量の意見の相違があり、したがって、このシナリオで従来の教師あり学習の問題に対処することは、非常に実用的な関心事です。この論文では、複数のアノテーターが(おそらくノイズの多い)ラベルを提供するが、絶対的なゴールドスタンダードがない場合の教師あり学習の確率的アプローチについて説明します。提案されたアルゴリズムは、さまざまな専門家を評価し、実際の非表示ラベルの推定値も示します。実験結果は、提案された方法が一般的に使用される多数決ベースラインよりも優れていることを示しています。

Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation
ループ計算と信念伝播を使用した平面グラフ上の近似推論

We introduce novel results for approximate inference on planar graphical modelsusing the loop calculus framework. The loop calculus (Chertkov and Chernyak, 2006a)allows to express the exact partition function of a graphical model as a finitesum of terms that can be evaluated once the belief propagation (BP) solution isknown. In general, full summation over all correction terms is intractable.We develop an algorithm for the approach presented in Chertkov et al. (2008) whichrepresents an efficient truncation scheme on planar graphs and a newrepresentation of the series in terms of Pfaffians of matrices. We analyze theperformance of the algorithm for models with binary variables and pairwiseinteractions on grids and other planar graphs. We study in detail both theloop series and the equivalent Pfaffian series and show that the first term ofthe Pfaffian series for the general, intractable planar model, can provide veryaccurate approximations. The algorithm outperforms previous truncation schemesof the loop series and is competitive with other state of the art methods forapproximate inference.

私たちは、ループ計算フレームワークを使用して、平面グラフィカルモデルでの近似推論に関する新しい結果を紹介します。ループ計算(ChertkovおよびChernyak、2006a)を使用すると、グラフィカルモデルの正確なパーティション関数を、ビリーフプロパゲーション(BP)ソリューションがわかれば評価できる項の有限和として表現できます。一般に、すべての補正項の完全な合計は扱いにくいものです。Chertkovら(2008)で提示されたアプローチのアルゴリズムを開発します。このアルゴリズムは、平面グラフでの効率的な切り捨てスキームと、行列のPfaffianによる級数の新しい表現を表します。グリッドやその他の平面グラフ上でバイナリ変数とペアワイズ相互作用を持つモデルに対するアルゴリズムのパフォーマンスを分析します。ループ級数と同等のPfaffian級数の両方を詳細に研究し、一般的な扱いにくい平面モデルのPfaffian級数の最初の項が非常に正確な近似を提供できることを示します。このアルゴリズムは、ループシリーズの以前の切り捨て方式よりも優れており、近似推論の他の最先端の方法と競合します。

Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
ベイズ推定における制限付きボルツマンマシンの確率的複雑性と一般化誤差

In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation.It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b).Thezeta functionis defined by using a Kullback function.We use two methods to obtain the maximum pole:a new eigenvalue analysis method and a recursive blowing up process.We show that these methods are effective for obtaining the asymptotic form of the generalization errorof hierarchical learning models.

この論文では、ベイズ推定における制限付きボルツマンマシンの一般化誤差の漸近形式について考察します。ゼータ関数の最大極を求めることは、階層学習モデルの一般化誤差の漸近形式と関連していることが示されている(Watanabe, 2001a,b)。ゼータ関数は、Kullback関数を使用して定義されます。最大極を求めるために、新しい固有値解析法と再帰的ブローアップ法の2つの方法を使用します。これらの方法が、階層学習モデルの一般化誤差の漸近形式を取得するのに有効であることを示します。

Graph Kernels
グラフカーネル

We present a unified framework to study graph kernels, special casesof which include the random walk (Gärtner et al., 2003; Borgwardt et al., 2005)and marginalized (Kashima et al., 2003, 2004; Mahét al., 2004) graph kernels. Through reduction to a Sylvester equation we improve thetime complexity of kernel computation between unlabeled graphs with n vertices from O(n6) to O(n3). We find a spectraldecomposition approach even more efficient when computing entire kernelmatrices. For labeled graphs we develop conjugate gradient and fixed-pointmethods that take O(dn3) time per iteration, where d is the size of thelabel set. By extending the necessary linear algebra to Reproducing KernelHilbert Spaces (RKHS) we obtain the same result for d-dimensional edgekernels, and O(n4) in the infinite-dimensional case; on sparse graphsthese algorithms only take O(n2) time per iteration in all cases. Experimentson graphs from bioinformatics and other application domains show thatthese techniques can speed up computation of the kernel by an order ofmagnitude or more. We also show that certain rational kernels(Cortes et al., 2002, 2003, 2004) when specialized to graphsreduce to our random walk graph kernel. Finally, we relate ourframework to R-convolution kernels (Haussler, 1999) and provide akernel that is close to the optimal assignment kernel ofkernel of Fröhlich et al. (2006) yet provably positive semi-definite.

私たちは、グラフカーネルを研究するための統一されたフレームワークを提示します。その特殊なケースには、ランダムウォーク(Gärtner他、2003年、Borgwardt他、2005年)および周辺化(Kashima他、2003年、2004年、Mahét他、2004年)グラフカーネルが含まれます。シルベスター方程式への還元により、n頂点のラベルなしグラフ間のカーネル計算の時間計算量をO(n6)からO(n3)に改善します。カーネル行列全体を計算する場合は、スペクトル分解アプローチがさらに効率的であることがわかりました。ラベル付きグラフの場合、反復ごとにO(dn3)時間かかる共役勾配法と固定小数点法を開発します。ここで、dはラベルセットのサイズです。必要な線形代数を再生カーネルヒルベルト空間(RKHS)に拡張することで、d次元エッジカーネルで同じ結果が得られ、無限次元の場合はO(n4)になります。スパースグラフでは、これらのアルゴリズムはすべてのケースで反復ごとにO(n2)時間しかかかりません。バイオインフォマティクスやその他のアプリケーションドメインのグラフでの実験では、これらの手法によりカーネルの計算が1桁以上高速化されることが示されています。また、特定の有理カーネル(Cortes他、2002、2003、2004)をグラフに特化すると、ランダムウォークグラフカーネルに縮小されることも示されています。最後に、フレームワークをR畳み込みカーネル(Haussler、1999)に関連付け、Fröhlich他(2006)のカーネルの最適割り当てカーネルに近いカーネルを提供しますが、半正定であることが証明されています。

A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
機械学習における非平滑凸最適化問題に対する準ニュートン法

We extend the well-known BFGS quasi-Newton method and itsmemory-limited variant LBFGS to the optimization of nonsmooth convexobjectives. This is done in a rigorous fashion by generalizing threecomponents of BFGS to subdifferentials: the local quadratic model,the identification of a descent direction, and the Wolfe line searchconditions. We prove that under some technical conditions, theresulting subBFGS algorithm is globally convergent in objectivefunction value. We apply its memory-limited variant (subLBFGS)to L2-regularized risk minimization with the binary hingeloss. To extend our algorithm to the multiclass and multilabelsettings, we develop a new, efficient, exact line searchalgorithm. We prove its worst-case time complexity bounds, and showthat our line search can also be used to extend a recently developedbundle method to the multiclass and multilabel settings.We also apply the direction-finding component of our algorithm toL1-regularized risk minimization with logistic loss. In all thesecontexts our methods perform comparable to or better thanspecialized state-of-the-art solvers on a number of publiclyavailable data sets. An open source implementation of ouralgorithms is freely available.

私たちは、よく知られているBFGS準ニュートン法とそのメモリ制限バリアントLBFGSを、滑らかでない凸目的関数の最適化に拡張します。これは、BFGSの3つのコンポーネント(ローカル二次モデル、降下方向の識別、およびWolfeライン検索条件)をサブ微分に一般化することで厳密に行われます。いくつかの技術的条件下では、結果として得られるサブBFGSアルゴリズムが目的関数値でグローバルに収束することを証明します。そのメモリ制限バリアント(サブLBFGS)を、バイナリヒンジ損失によるL2正規化リスク最小化に適用します。このアルゴリズムをマルチクラスおよびマルチラベル設定に拡張するために、新しい効率的な正確なライン検索アルゴリズムを開発します。最悪の場合の時間計算量境界を証明し、直線探索を使用して、最近開発されたバンドル法をマルチクラスおよびマルチラベル設定に拡張できることを示します。また、アルゴリズムの方向検出コンポーネントを、ロジスティック損失を伴うL1正規化リスク最小化に適用します。これらすべてのコンテキストで、私たちの方法は、公開されている多数のデータセットに対して、最先端の専用ソルバーと同等かそれ以上のパフォーマンスを発揮します。私たちのアルゴリズムのオープンソース実装は、無料で入手できます。

SFO: A Toolbox for Submodular Function Optimization
SFO: サブモジュラ関数最適化のためのツールボックス

In recent years, a fundamental problem structure has emerged as very useful in a variety of machine learning applications: Submodularity is an intuitive diminishing returns property, stating that adding an element to a smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows one to efficiently find provably (near-) optimal solutions for large problems.We present SFO, a toolbox for use in MATLAB or Octave that implements algorithms for minimization and maximization of submodular functions. A tutorial script illustrates the application of submodularity to machine learning and AI problems such as feature selection, clustering, inference and optimized information gathering.

近年、基本的な問題構造がさまざまな機械学習アプリケーションで非常に役立つものとして浮上しています:サブモジュール性は直感的な収穫逓減プロパティであり、小さなセットに要素を追加すると、大きなセットに追加するよりも役立つと述べています。凸性と同様に、部分モジュール性により、大きな問題に対して証明可能な(ほぼ)最適な解決策を効率的に見つけることができます。ここでは、サブモジュラ関数の最小化と最大化のためのアルゴリズムを実装する、MATLABまたはOctaveで使用するツールボックスであるSFOを紹介します。チュートリアルスクリプトは、特徴選択、クラスタリング、推論、最適化された情報収集など、機械学習とAIの問題に対するサブモジュラリティの適用を示しています。

Continuous Time Bayesian Network Reasoning and Learning Engine
連続時間ベイジアンネットワーク推論および学習エンジン

We present a continuous time Bayesian network reasoning and learningengine (CTBN-RLE). A continuous time Bayesian network (CTBN) providesa compact (factored) description of a continuous-time Markov process.This software provides libraries and programs for most of the algorithmsdeveloped for CTBNs. For learning, CTBN-RLE implements structure andparameter learning for both complete and partial data. For inference,it implements exact inference and Gibbs and importance samplingapproximate inference for any type of evidence pattern. Additionally,the library supplies visualization methods for graphically displaying CTBNsor trajectories of evidence.

私たちは、連続時間ベイジアンネットワーク推論および学習エンジン(CTBN-RLE)を紹介します。連続時間ベイジアンネットワーク(CTBN)は、連続時間マルコフ過程のコンパクトな(因数分解された)記述を提供します。このソフトウェアは、CTBN用に開発されたほとんどのアルゴリズムのライブラリとプログラムを提供します。学習のために、CTBN-RLEは完全データと部分データの両方に対して構造とパラメータの学習を実装します。推論については、任意のタイプの証拠パターンに対して、正確な推論とギブスと重要度のサンプリング近似推論を実装します。さらに、このライブラリは、証拠のCTBNまたは軌跡をグラフィカルに表示するための視覚化方法を提供します。

Large Scale Online Learning of Image Similarity Through Ranking
ランキングによる画像の類似性の大規模オンライン学習

Learning a measure of similarity between pairs of objects is animportant generic problem in machine learning. It is particularlyuseful in large scale applications like searching for an image thatis similar to a given image or finding videos that are relevant to agiven video. In these tasks, users look for objects that are notonly visually similar but also semantically related to a givenobject. Unfortunately, the approaches that exist today for learningsuch semantic similarity do not scale to large data sets. This isboth because typically their CPU and storage requirements growquadratically with the sample size, and because many methods imposecomplex positivity constraints on the space of learned similarityfunctions.The current paper presents OASIS, an Online Algorithm forScalable Image Similarity learning that learns a bilinearsimilarity measure over sparse representations. OASIS is an onlinedual approach using the passive-aggressive family of learningalgorithms with a large margin criterion and an efficient hinge losscost. Our experiments show that OASIS is both fast and accurate at awide range of scales: for a data set with thousands of images, itachieves better results than existing state-of-the-art methods,while being an order of magnitude faster. For large, web scale,data sets, OASIS can be trained on more than two million images from150K text queries within 3 days on a single CPU. On this largescale data set, human evaluations showed that 35% of the ten nearestneighbors of a given test image, as found by OASIS, weresemantically relevant to that image. This suggests that queryindependent similarity could be accurately learned even for largescale data sets that could not be handled before.

オブジェクトのペア間の類似度の尺度を学習することは、機械学習における重要な一般的な問題です。これは、特定の画像に類似する画像を検索したり、特定のビデオに関連するビデオを見つけたりするような大規模なアプリケーションで特に役立ちます。これらのタスクでは、ユーザーは、特定のオブジェクトと視覚的に類似しているだけでなく、意味的にも関連しているオブジェクトを探します。残念ながら、このような意味的類似性を学習するための現在のアプローチは、大規模なデータセットには対応していません。これは、通常、CPUとストレージの要件がサンプルサイズの2乗で増加し、多くの方法では学習された類似度関数の空間に複雑な正値制約が課されるためです。現在の論文では、スパース表現に対して双線形類似度尺度を学習するスケーラブルな画像類似度学習のオンラインアルゴリズムであるOASISを紹介します。OASISは、大きなマージン基準と効率的なヒンジ損失コストを備えたパッシブアグレッシブファミリの学習アルゴリズムを使用するオンラインデュアルアプローチです。私たちの実験では、OASISが幅広いスケールで高速かつ正確であることが示されています。数千枚の画像を含むデータセットの場合、OASISは既存の最先端の方法よりも優れた結果を達成しながら、桁違いに高速です。大規模なWeb規模のデータセットの場合、OASISは1つのCPUで3日以内に150Kのテキストクエリから200万枚以上の画像でトレーニングできます。この大規模なデータセットでは、人間による評価により、OASISによって検出された特定のテスト画像の10個の最近傍のうち35%がその画像に意味的に関連していることが示されました。これは、クエリに依存しない類似性を、これまで処理できなかった大規模なデータセットでも正確に学習できることを示唆しています。

Analysis of Multi-stage Convex Relaxation for Sparse Regularization
スパース正則化のための多段凸緩和の解析

We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem:

私たちは、実際のアプリケーションでよく発生する非凸目的関数を持つ定式化の学習を考えます。この問題には、次の2つの方法があります。

Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
グラフ構造線形プログラムのためのメッセージパッシング:近位法と丸めスキーム

The problem of computing a maximum a posteriori (MAP) configuration isa central computational challenge associated with Markov randomfields. There has been some focus on “tree-based” linearprogramming (LP) relaxations for the MAP problem. This paper developsa family of super-linearly convergent algorithms for solving theseLPs, based on proximal minimization schemes using Bregman divergences.As with standard message-passing on graphs, the algorithms aredistributed and exploit the underlying graphical structure, and soscale well to large problems. Our algorithms have a double-loopcharacter, with the outer loop corresponding to the proximal sequence,and an inner loop of cyclic Bregman projections used to compute eachproximal update. We establish convergence guarantees for our algorithms, and illustrate theirperformance via some simulations. We also develop two classes ofrounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions.Our deterministic rounding schemes use a “re-parameterization” propertyof our algorithms so that when the LP solution is integral, the MAPsolution can be obtained even before the LP-solver converges to theoptimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general.We analyze the performance of and report simulations comparing these rounding schemes.

事後最大(MAP)構成を計算する問題は、マルコフランダムフィールドに関連する中心的な計算課題です。MAP問題に対する「ツリーベース」の線形計画法(LP)緩和法に焦点が当てられてきました。この論文では、ブレグマンダイバージェンスを使用した近位最小化スキームに基づいて、これらのLPを解決するための超線形収束アルゴリズムのファミリを開発します。グラフ上の標準的なメッセージパッシングと同様に、アルゴリズムは分散されており、基礎となるグラフィカル構造を活用しているため、大規模な問題にも適応できます。このアルゴリズムは、近位シーケンスに対応する外側のループと、各近位更新を計算するために使用される循環ブレグマン投影の内側のループという二重ループ特性を備えています。このアルゴリズムの収束保証を確立し、いくつかのシミュレーションでそのパフォーマンスを示します。また、LPソリューションから整数構成を取得するための2つのクラスの丸め方式(決定論的およびランダム化)も開発しました。決定論的丸め方式では、アルゴリズムの「再パラメータ化」プロパティを使用するため、LPソリューションが整数の場合、LPソルバーが最適に収束する前でもMAPソリューションを取得できます。また、一般的な反復LP解決アルゴリズムに適用できるグラフ構造のランダム化丸め方式も提案しています。これらの丸め方式のパフォーマンスを分析し、比較するシミュレーションを報告します。

Kronecker Graphs: An Approach to Modeling Networks
クロネッカーグラフ:ネットワークモデリングへのアプローチ

How can we generate realistic networks? In addition, how can we do so witha mathematically tractable model that allows for rigorous analysis ofnetwork properties? Real networks exhibit a long list of surprisingproperties: Heavy tails for the in- and out-degree distribution, heavytails for the eigenvalues and eigenvectors, small diameters, anddensification and shrinking diameters over time.Current network models and generators either fail to match several of the aboveproperties, are complicated to analyze mathematically, or both. Herewe propose a generative model for networks that is bothmathematically tractable and can generate networks that have all the abovementioned structural properties. Our main idea here is to use anon-standard matrix operation, the Kronecker product, to generategraphs which we refer to as “Kronecker graphs”.First, we show that Kronecker graphs naturally obey common networkproperties. In fact, we rigorously prove that they do so. We alsoprovide empirical evidence showing that Kronecker graphs can effectivelymodel the structure of real networks.We then present KRONFIT, a fast and scalable algorithm for fitting theKronecker graph generation model to large real networks. A naive approachto fitting would take super-exponential time. In contrast, KRONFIT takeslinear time, by exploiting the structure of Kronecker matrixmultiplication and by using statistical simulation techniques.Experiments on a wide range of large real and synthetic networks show that KRONFIT finds accurate parameters that very well mimic the properties of targetnetworks. In fact, using just four parameters we can accuratelymodel several aspects of global network structure.Once fitted, the model parameters can be used to gain insightsabout the network structure, and the resulting synthetic graphs can beused for null-models, anonymization, extrapolations, and graphsummarization.

どうすれば現実的なネットワークを生成できるでしょうか。さらに、ネットワーク特性の厳密な分析を可能にする数学的に扱いやすいモデルで、どうすれば実現できるでしょうか。実際のネットワークは、驚くべき特性の長いリストを示します。入次数分布と出次数分布の重い裾、固有値と固有ベクトルの重い裾、小さな直径、および時間の経過に伴う密度化と直径の縮小。現在のネットワークモデルとジェネレーターは、上記の特性のいくつかに一致しないか、数学的に分析するのが複雑であるか、またはその両方です。ここでは、数学的に扱いやすく、上記のすべての構造特性を持つネットワークを生成できるネットワークの生成モデルを提案します。ここでの主なアイデアは、非標準の行列演算であるクロネッカー積を使用して、「クロネッカーグラフ」と呼ばれるグラフを生成することです。まず、クロネッカーグラフが一般的なネットワーク特性に自然に従うことを示します。実際、それが厳密に証明されています。また、クロネッカーグラフが実際のネットワークの構造を効果的にモデル化できることを示す経験的証拠も提供します。次に、クロネッカーグラフ生成モデルを大規模な実際のネットワークに適合させる高速でスケーラブルなアルゴリズムであるKRONFITを紹介します。単純な適合方法では、超指数時間がかかります。対照的に、KRONFITは、クロネッカー行列乗算の構造を利用し、統計シミュレーション手法を使用することで、線形時間で済みます。さまざまな大規模な実際のネットワークと合成ネットワークでの実験により、KRONFITはターゲットネットワークの特性を非常によく模倣する正確なパラメーターを見つけることがわかりました。実際、4つのパラメーターのみを使用して、グローバルネットワーク構造のいくつかの側面を正確にモデル化できます。適合すると、モデルパラメーターを使用してネットワーク構造に関する洞察を得ることができ、結果として得られる合成グラフは、ヌルモデル、匿名化、外挿、およびグラフ要約に使用できます。

Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
弱ラベルデータを用いた半教師あり学習のための一般化期待基準

In this paper, we present an overview of generalized expectation criteria (GE), a simple, robust, scalable method for semi-supervised training using weakly-labeled data. GE fits model parameters by favoring models that match certain expectation constraints, such as marginal label distributions, on the unlabeled data. This paper shows how to apply generalized expectation criteria to two classes of parametric models: maximum entropy models and conditional random fields. Experimental results demonstrate accuracy improvements over supervised training and a number of other state-of-the-art semi-supervised learning methods for these models.

この論文では、弱ラベル化されたデータを使用した半教師ありトレーニングのためのシンプルで堅牢でスケーラブルな方法である一般化期待基準(GE)の概要を示します。GEは、ラベルのないデータに対して、周辺ラベル分布などの特定の期待制約に一致するモデルを優先して、モデルパラメーターを適合させます。この論文では、一般化期待基準をパラメトリックモデルの2つのクラス(最大エントロピーモデルと条件付きランダムフィールド)に適用する方法を示します。実験結果は、これらのモデルについて、教師ありトレーニングや他の多くの最先端の半教師あり学習方法よりも精度が向上することを示しています。

On Spectral Learning
スペクトル学習について

In this paper, we study the problem of learning a matrixW from a set of linear measurements. Our formulation consists insolving an optimization problem which involves regularization with aspectral penalty term. That is, the penalty term is a function of thespectrum of the covariance of W. Instances of this problem inmachine learning include multi-task learning, collaborative filteringand multi-view learning, among others. Our goal is to elucidate theform of the optimal solution of spectral learning. The theory ofspectral learning relies on the von Neumann characterization oforthogonally invariant norms and their association with symmetricgauge functions. Using this tool we formulate a representer theoremfor spectral regularization and specify it to several useful example,such as Schatten p-norms, trace norm and spectral norm, which shouldproved useful in applications.

この論文では、一連の線形測定値から行列Wを学習する問題を研究します。私たちの定式化は、非スペクトルペナルティ項による正則化を含む最適化問題を解くことで構成されます。つまり、ペナルティ項は、この問題のインスタンス機械学習におけるWの共分散のスペクトルには、マルチタスク学習、協調フィルタリング、マルチビュー学習などがあります。スペクトル学習の最適解の形を解明することを目標としています。スペクトル学習の理論は、直交不変ノルムのフォン・ノイマン特性と、それらが対称ゲージ関数と関連していることに依存しています。このツールを使用して、スペクトル正則化の表現定理を定式化し、それをSchatten pノルム、トレースノルム、スペクトルノルムなど、アプリケーションで役立つことが証明されているいくつかの有用な例に指定します。

On Learning with Integral Operators
積分演算子による学習について

A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations.The contribution of our paper is two-fold:1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation.2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008).

スペクトルクラスタリング、カーネル主成分分析、多くの多様体法など、多くの学習アルゴリズムは、経験的データに基づいて、類似関数またはカーネルによって定義される演算子の固有値と固有関数の推定に基づいています。したがって、アルゴリズムの解析では、そのような近似の品質を評価できることが重要な問題です。私たちの論文の貢献は2つあります:1。ヒルベルト空間の濃度不等式に基づく手法を使用して、スペクトル近似の多くの結果に対して新しい非常に単純化された証明を提供します2。これらの方法を使用して、グラフのスペクトル特性を推定するためのいくつかの新しい結果を提供します。ラプラシアン演算子は、von Luxburgら(2008)から結果を拡張および強化します。

Image Denoising with Kernels Based on Natural Image Relations
自然なイメージ関係に基づくカーネルによるイメージのノイズ除去

A successful class of image denoising methods is based on Bayesian approaches working inwavelet representations. The performance of these methods improves when relations among the local frequency coefficients are explicitly included. However, in these techniques, analyticalestimates can be obtained only for particular combinations of analytical modelsof signal and noise, thus precluding its straightforward extension to deal with other arbitrary noise sources.In this paper,we propose an alternative non-explicit way to take into account the relationsamong natural image wavelet coefficients for denoising: we use support vector regression(SVR) in the wavelet domain to enforce these relations in the estimated signal.Since relationsamong the coefficients are specific to the signal, theregularization property of SVR is exploited to remove the noise, which does notshare this feature. The specific signal relationsare encoded in an anisotropic kernel obtained from mutual information measures computed on arepresentative image database. In the proposed scheme,training considers minimizing the Kullback-Leibler divergence (KLD) between the estimatedand actual probability functions (or histograms) of signal and noise in orderto enforce similarity up to the higher (computationally estimable) order.Due to its non-parametric nature, the method can eventually copewith different noise sources without the need of an explicit re-formulation, as it isstrictly necessary under parametric Bayesian formalisms.Results under several noise levels and noise sources show that:(1) the proposed method outperforms conventional waveletmethods that assume coefficient independence, (2) it is similar to state-of-the-art methodsthat do explicitly include these relations when the noise source is Gaussian, and (3)it gives better numerical and visual performance when more complex, realisticnoise sources are considered. Therefore, theproposed machine learning approach can be seen as a moreflexible (model-free) alternative to the explicit descriptionof wavelet coefficient relations for image denoising.

画像のノイズ除去方法の成功例の1つは、ウェーブレット表現で動作するベイズ法です。これらの方法のパフォーマンスは、ローカル周波数係数間の関係が明示的に含まれると向上します。ただし、これらの手法では、信号とノイズの解析モデルの特定の組み合わせに対してのみ解析推定値を取得できるため、他の任意のノイズ源に対処するための直接的な拡張は不可能です。この論文では、ノイズ除去のために自然画像のウェーブレット係数間の関係を考慮するための、明示的でない代替方法を提案します。ウェーブレット領域でサポートベクター回帰(SVR)を使用して、推定信号でこれらの関係を適用します。係数間の関係は信号に固有のものであるため、この特性を共有しないノイズを除去するために、SVRの正規化特性を利用します。特定の信号関係は、代表的な画像データベースで計算された相互情報量から取得された異方性カーネルにエンコードされます。提案された方式では、トレーニングでは、信号とノイズの推定確率関数(またはヒストグラム)と実際の確率関数(またはヒストグラム)の間のKullback-Leiblerダイバージェンス(KLD)を最小化して、より高い(計算により推定可能な)次数までの類似性を強化することが考慮されています。この方法は、非パラメトリックであるため、パラメトリックベイズ形式の下では厳密に必要となる明示的な再定式化を必要とせずに、最終的にはさまざまなノイズ源に対処できます。いくつかのノイズレベルとノイズ源の下での結果は、(1)提案された方法は、係数の独立性を前提とする従来のウェーブレットメソッドよりも優れている、(2)ノイズソースがガウスである場合にこれらの関係を明示的に含める最先端の方法に似ている、(3)より複雑で現実的なノイズソースを考慮すると、数値的および視覚的なパフォーマンスが向上することを示しています。したがって、提案された機械学習アプローチは、画像ノイズ除去のためのウェーブレット係数関係の明示的な記述に代わる、より柔軟な(モデルフリーの)代替手段と見なすことができます。

A Streaming Parallel Decision Tree Algorithm
ストリーミング並列決定木アルゴリズム

We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm’s accuracy.The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy.

私たちは、決定木分類器を構築するための新しいアルゴリズムを提案します。このアルゴリズムは分散環境で実行され、特に大規模なデータセットとストリーミングデータの分類用に設計されています。これは、標準的なデシジョンツリー分類器と同じくらい正確でありながら、複数のプロセッサでのストリーミングデータの処理にスケーラブルであることが経験的に示されています。これらの知見は、アルゴリズムの精度の厳密な分析によって裏付けられています。このアルゴリズムの本質は、プロセッサでヒストグラムをすばやく構築し、データを固定量のメモリに圧縮することです。マスタープロセッサは、この情報を使用して、ターミナルツリーノードへの最適に近い分割ポイントを見つけます。私たちの分析では、分割ポイントのローカル精度の保証は、ツリー全体の精度の保証を意味することを示しています。

Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
最大エントロピーモデルのための反復スケーリングと座標降下法

Maximum entropy (Maxent) is useful in natural language processing andmany other areas. Iterative scaling (IS) methods are one of the mostpopular approaches to solve Maxent. With many variants of ISmethods, it is difficult to understand them and see the differences.In this paper, we create a general and unified framework for iterativescaling methods. This framework also connects iterative scaling andcoordinate descent methods. We prove general convergence results forIS methods and analyze their computational complexity. Based on theproposed framework, we extend a coordinate descent method for linearSVM to Maxent. Results show that it is faster than existing iterativescaling methods.

最大エントロピー(Maxent)は、自然言語処理やその他多くの分野で役立ちます。反復スケーリング(IS)法は、Maxentを解くための最も一般的なアプローチの1つです。ISメソッドには多くのバリエーションがあるため、それらを理解して違いを確認するのは困難です。この論文では、反復スケーリング手法の一般的で統一されたフレームワークを作成します。このフレームワークは、反復スケーリングと座標降下法も接続します。IS法の一般的な収束結果を証明し、その計算量を分析します。提案されたフレームワークに基づいて、linearSVMの座標降下法をMaxentに拡張します。結果は、既存の反復スケーリング方法よりも高速であることを示しています。

Stability Bounds for Stationary φ-mixing and β-mixing Processes
定常φ混合およびβ混合プロセスの安定限界

Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary φ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary φ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our φ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios.

学習理論における一般化境界のほとんどは、アルゴリズムとは無関係に、使用される仮説クラスの複雑さの何らかの尺度に基づいています。対照的に、アルゴリズムの安定性の概念は、特定の学習アルゴリズムの特定の特性を利用して、そのアルゴリズムに合わせた厳密な一般化境界を導き出すために使用できます。ただし、学習理論の多くと同様に、既存の安定性分析と境界は、サンプルが独立して同一に分布しているシナリオにのみ適用されます。ただし、多くの機械学習アプリケーションでは、この仮定は当てはまりません。学習アルゴリズムによって受信される観測には、多くの場合、何らかの固有の時間依存性があります。この論文では、観測が定常 φ 混合または β 混合シーケンスから抽出されるシナリオを検討します。これは、非i.i.d.プロセスの研究では広く採用されている仮定であり、時間の経過とともに観測間の依存性が弱まることを意味します。定常 φ 混合および β 混合シーケンスに対する、新しい独自の安定性ベースの一般化境界を証明します。これらの境界は、i.i.d.で指定された境界を厳密に一般化します。このケースはすべての安定した学習アルゴリズムに適用され、それによって安定性境界の使用を非i.i.d.シナリオに拡張します。また、サポートベクター回帰、カーネルリッジ回帰、サポートベクターマシン、および他の多くのカーネル正則化ベースおよび相対エントロピーベースの正則化アルゴリズムを含む一般的なクラスの学習アルゴリズムへの φ 混合一般化境界の適用についても説明します。したがって、これらの新しい境界は、これらのアルゴリズムを非i.i.d.シナリオで使用するための最初の理論的根拠と見なすことができます。

Maximum Relative Margin and Data-Dependent Regularization
最大相対マージンとデータ依存正則化

Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an l2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages.

サポートベクターマシン(SVM)やその類似品などの主要な分類方法は、データクラス間の分離マージンを最大化することで、強力な一般化パフォーマンスを実現します。最大マージンアプローチは有望なパフォーマンスを達成していますが、この記事では、データのアフィン変換とデータ拡散が大きい方向に対する感度について説明しています。最大マージンソリューションは、データの拡散によって誤解され、拡散が大きい方向に沿ってクラスを優先的に分離する可能性があります。この記事では、絶対的な意味でマージンを測定するのではなく、任意の投影方向のデータの拡散に対してのみ相対的にマージンを測定することで、これらの弱点を修正します。最大相対マージンは分類関数のデータ依存の正則化に対応し、最大絶対マージンは分類関数のl2ノルム制約に対応します。興味深いことに、提案された改善には、既存の最大マージン定式化への単純な拡張のみが必要であり、SVMの計算効率が維持されます。相対マージンを最大化することで、数字、テキスト分類などの実際の問題や、その他のいくつかのベンチマークデータセットで、驚くべきパフォーマンスの向上が実現します。さらに、Rademacher平均に基づく新しい定式化のリスク境界が導出されます。

PyBrain
パイブレイン

PyBrain is a versatile machine learning library for Python. Its goal is to provideflexible, easy-to-use yet still powerful algorithms for machine learning tasks, includinga variety of predefined environments and benchmarks to test and compare algorithms. Implemented algorithms include Long Short-Term Memory (LSTM), policy gradient methods, (multidimensional) recurrent neural networks and deep belief networks.

PyBrainは、Python用の汎用性の高い機械学習ライブラリです。その目標は、アルゴリズムをテストおよび比較するためのさまざまな事前定義された環境やベンチマークなど、機械学習タスクのための柔軟で使いやすく、かつ強力なアルゴリズムを提供することです。実装されているアルゴリズムには、長短期記憶(LSTM)、方策勾配法、(多次元)再帰型ニューラルネットワーク、ディープビリーフネットワークなどがあります。

A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression
大規模l1正則化ロジスティック回帰のための高速ハイブリッドアルゴリズム

l1-regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of l1 regularization attributes attractive properties to the classifier, such as feature selection, robustness to noise, and as a result, classifier generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable l1-norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a fixed point continuation phase and an interior point phase. The first phase is based completely on memory efficient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can significantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efficient without loss of accuracy.

l1正則化ロジスティック回帰は、スパースロジスティック回帰とも呼ばれ、機械学習、コンピュータービジョン、データマイニング、バイオインフォマティクス、神経信号処理で広く使用されています。l1正則化を使用すると、分類器に、特徴選択、ノイズに対する堅牢性、その結果、教師あり学習のコンテキストでの分類器の一般性などの魅力的な特性が付与されます。スパースロジスティック回帰問題に高次元の大規模データがある場合、目的関数の微分不可能なl1ノルムを最小化するには計算コストがかかります。最近の研究(Koh他、2007年、Hale他、2008年)を参考に、2種類の最適化反復を組み合わせた新しいハイブリッドアルゴリズムを提案します。1つは非常に高速でメモリを消費しないもので、もう1つは低速ですが精度が高いものです。ハイブリッド反復縮小(HIS)と呼ばれるこのアルゴリズムは、固定点継続フェーズと内部点フェーズで構成されます。最初のフェーズは、行列ベクトル乗算などのメモリ効率の高い演算に完全に基づいており、2番目のフェーズは打ち切りニュートン法に基づいています。さらに、直線探索や継続などのさまざまな最適化手法によって、収束を大幅に加速できることを示しています。このアルゴリズムは、幾何級数的な速度(最適化用語ではQ線形速度)でグローバル収束します。UCI機械学習リポジトリのベンチマークデータを使用した分析を含む、いくつかの既存のアルゴリズムとの数値比較を示し、精度を損なうことなく、当社のアルゴリズムが最も計算効率が高いことを示します。

On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
Bagged Nearest Estimate の収束率について

Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size kn grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, adaptation results by data-splitting are presented.

バギングは、パフォーマンスを向上させるために見積もりを組み合わせる簡単な方法です。1996年にBreimanによって提案されたこの方法は、元のデータセットからリサンプリングを行い、各サブサンプルから予測子を構築し、組み合わせて決定します。nサンプルをバギングすることにより、粗の最近傍回帰推定値は、統計分析に適した一貫性のある重み付けされた最近傍回帰推定値に変換されます。リサンプリングサイズknがnと共に適切に大きくなるとすると、この推定値は、リサンプリングが置換の有無にかかわらず行われるという事実とは無関係に、最適な収束率を達成できることが示されています。最適な収束率の推定値は観測値の未知の分布に依存するため、データ分割による適応結果を示します。

Second-Order Bilinear Discriminant Analysis
2次双線形判別解析

Traditional analysis methods for single-trial classification ofelectro-encephalography (EEG) focus on two types of paradigms: phase-lockedmethods, in which the amplitude of the signal is used as thefeature for classification, that is, event related potentials; andsecond-order methods, in which the feature of interest is the powerof the signal, that is, event related (de)synchronization. The process ofdeciding which paradigm to use is ad hoc and is driven byassumptions regarding the underlying neural generators. Here wepropose a method that provides an unified framework for the analysisof EEG, combining first and second-order spatial and temporalfeatures based on a bilinear model. Evaluation of the proposedmethod on simulated data shows that the technique outperformsstate-of-the art techniques for single-trial classification for abroad range of signal-to-noise ratios. Evaluations on human EEG−includingone benchmark data set from the Brain Computer Interface(BCI) competition−show statistically significant gains inclassification accuracy, with a reduction in overall classificationerror from 26%-28% to 19%.

脳波(EEG)の単一試行分類の従来の分析方法は、2種類のパラダイムに焦点を当てています。1つは位相ロック法で、信号の振幅が分類の特徴、つまりイベント関連電位として使用されます。もう1つは2次法で、関心のある特徴は信号のパワー、つまりイベント関連(脱)同期です。どちらのパラダイムを使用するかを決定するプロセスはアドホックであり、基礎となる神経ジェネレータに関する仮定によって決まります。ここでは、双線形モデルに基づいて1次と2次の空間的および時間的特徴を組み合わせた、EEG分析の統一フレームワークを提供する方法を提案します。シミュレートされたデータで提案された方法を評価した結果、この手法は、幅広い範囲の信号対雑音比で単一試行分類の最先端の手法よりも優れていることがわかりました。人間の脳波の評価（Brain Computer Interface (BCI)コンペティションからの1つのベンチマークデータセットを含む）では、分類精度が統計的に有意に向上し、全体的な分類エラーが26%～28%から19%に減少したことが示されています。

Error-Correcting Output Codes Library
エラー訂正出力コードライブラリ

In this paper, we present an open source Error-Correcting OutputCodes (ECOC) library. The ECOC framework is a powerful tool todeal with multi-class categorization problems. This librarycontains both state-of-the-art coding (one-versus-one,one-versus-all, dense random, sparse random, DECOC, forest-ECOC,and ECOC-ONE) and decoding designs (hamming, euclidean, inversehamming, laplacian, β-density, attenuated, loss-based,probabilistic kernel-based, and loss-weighted) with the parametersdefined by the authors, as well as the option to include your owncoding, decoding, and base classifier.

この論文では、オープンソースのエラー訂正出力コード(ECOC)ライブラリを紹介します。ECOCフレームワークは、多クラス分類の問題に対処するための強力なツールです。このライブラリには、最先端のコーディング(one-versus-one、one-versus-all、Dense Random、sparse random、DECOC、forest-ECOC、およびECOC-ONE)とデコードデザイン(ハミング、ユークリッド、インバースハミング、ラプラシアン、β密度、減衰、損失ベース、確率的カーネルベース、損失加重)の両方が含まれており、著者が定義したパラメーターと、独自のコーディング、デコード、およびベース分類器を含めるオプションが含まれています。

Why Does Unsupervised Pre-training Help Deep Learning?
教師なし事前学習がディープラーニングに役立つのはなぜですか?

Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deepmodels, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questionsis important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect toarchitecture depth, model capacity, and number of training examples.The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

最近の多くの研究は、ディープビリーフネットワークやオートエンコーダーバリアントのスタックなどのディープアーキテクチャの学習アルゴリズムに向けられており、主に視覚と言語のデータセットを中心に、いくつかの分野で印象的な結果が得られています。教師あり学習タスクで得られる最良の結果には、通常、教師なし事前トレーニングフェーズでの教師なし学習コンポーネントが含まれます。これらの新しいアルゴリズムによってディープモデルのトレーニングが可能になったにもかかわらず、この困難な学習問題の本質については多くの疑問が残っています。ここで調査する主な疑問は、教師なし事前トレーニングはどのように機能するかということです。ディープアーキテクチャでの学習をさらに改善するには、この疑問に答えることが重要です。私たちはいくつかの説明仮説を提案し、大規模なシミュレーションによってそれらをテストします。アーキテクチャの深さ、モデル容量、およびトレーニング例の数に関して、事前トレーニングの影響を経験的に示します。実験により、教師なし事前トレーニングの利点が確認され、明らかになります。結果は、教師なしの事前トレーニングが、トレーニングデータセットからのより良い一般化をサポートする最小値の吸引域に向けて学習を導くことを示唆しています。これらの結果からの証拠は、事前トレーニングの効果に対する正規化の説明をサポートしています。

A Rotation Test to Verify Latent Structure
潜在構造を検証するための回転試験

In multivariate regression models we have the opportunity to look for hiddenstructure unrelated to the observed predictors. However, when one fits a modelinvolving such latent variables it is important to be able to tell if thestructure is real, or just an artifact of correlation in the regression errors.We develop a new statistical test based on random rotations for verifying the existence of latent variables. The rotations are carefully constructed torotate orthogonally to the column space of the regression model. We find that only non-Gaussianlatent variables are detectable, a finding that parallels a well knownphenomenon in independent components analysis. We base our test on a measureof non-Gaussianity in the histogramof the principal eigenvector components instead of on the eigenvalue.The method finds andverifies some latent dichotomies in the microarray data from theAGEMAP consortium.

多変量回帰モデルでは、観測された予測変数とは無関係の隠れた構造を探す機会があります。しかし、このような潜在変数を含むモデルを当てはめる場合、その構造が実在するのか、それとも回帰誤差の相関のアーティファクトに過ぎないのかを見分けることができることが重要です。私たちは、潜在変数の存在を検証するために、ランダム回転に基づく新しい統計的検定を開発します。回転は、回帰モデルの列空間に直交するように慎重に構成されています。非ガウス潜伏変数のみが検出可能であることがわかりました。これは、独立成分分析におけるよく知られた現象と類似する発見です。このテストは、固有値ではなく、主要な固有ベクトル成分のヒストグラムの非ガウス性の尺度に基づいています。この手法は、AGEMAPコンソーシアムのマイクロアレイデータに潜む二分法を発見し、検証するものです。

On Finding Predictors for Arbitrary Families of Processes
プロセスの任意のファミリーの予測子を見つけることについて

The problem is sequence prediction in the following setting. A sequence x1,…,xn,… of discrete-valued observations is generated according to some unknown probabilistic law (measure) μ. After observing each outcome, it is required to give the conditional probabilities of the next observation.The measure μ belongs to an arbitrary but known class C of stochastic process measures.We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the”true” μ-conditional probabilities, if any μ∈C is chosen to generate the sequence.The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a specific and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also find several sufficient and necessary conditionsfor the existence of a predictor, in terms of topological characterizations of the family C, as well as in termsof local behaviour of the measures in C, which in some cases lead to procedures for constructing such predictors.It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family.

問題は、次の設定でのシーケンス予測です。離散値の観測のシーケンスx1,…,xn,…は、未知の確率法則(測度)μ に従って生成されます。各結果を観測した後、次の観測の条件付き確率を与える必要があります。測度 μ は、確率過程測度の任意だが既知のクラスCに属します。シーケンスを生成するために任意の μ∈Cが選択された場合、条件付き確率が(ある意味で)「真の」μ 条件付き確率に収束する予測子 ρ に関心があります。この研究の貢献は、そのような予測子が存在するファミリーCを特徴付け、ソリューションを探すための具体的かつ単純な形式を提供することです。任意の予測子が機能する場合、事前確率が離散的であり、これも機能するベイズ予測子が存在することを示します。また、C族の位相的特徴付けとC内の測度の局所的動作の観点から、予測子の存在に対する十分条件と必要条件もいくつか見つかり、場合によっては、そのような予測子を構築するための手順につながります。フレームワークは完全に一般的なものであることを強調する必要があります。つまり、検討対象の確率過程は、i.i.d.や定常である必要はなく、パラメトリック族や可算族に属する必要もありません。

Approximate Tree Kernels
おおよそのツリーカーネル

Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space.

ツリーの畳み込みカーネルは、ツリー構造データを使用した学習のシンプルな手段を提供します。ツリーカーネルの計算時間は、すべてのノードのペアを比較する必要があるため、ツリーのサイズの2乗です。したがって、HTMLドキュメントまたは構造化ネットワークデータから取得された大規模な構文解析ツリーでは、畳み込みカーネルは適用できません。この記事では、構文解析ツリーカーネルの効果的な近似手法を提案します。近似ツリーカーネル(ATK)は、カーネルの計算を関連するサブツリーのスパースサブセットに制限し、冗長な構造を破棄するため、カーネルベースの学習方法のトレーニングとテストが大幅に高速化されます。教師あり学習タスクと教師なし学習タスクのそれぞれで、このようなサブセットを識別するための線形計画法アプローチを考案しました。経験的に、近似ツリーカーネルは、通常のツリーカーネルの予測精度を維持しながら、実行時間を最大3桁改善します。教師なしタスクの場合、近似ツリーカーネルは、特徴空間内の関連する次元を識別することで、より正確な予測につながります。

Generalized Power Method for Sparse Principal Component Analysis
スパース主成分分析のための一般化べき乗法

In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed.

この論文では、スパース主成分分析(スパースPCA)への新しいアプローチを開発します。スパースPCA問題の2つの単一ユニット最適化定式化と2つのブロック最適化定式化を提案します。それぞれ、データマトリックスの単一のスパース優位主成分、または複数の成分を一度に抽出することを目的としています。最初の定式化は非凸関数を含むため計算上扱いにくいのですが、コンパクトセット上の凸関数の最大化を含む最適化プログラムの形式に書き直します。データマトリックスの列(変数)が行よりはるかに多い場合、検索空間の次元は大幅に減少します。次に、このタスクに適した単純な勾配法を提案し、分析します。目的関数または実行可能セットのいずれかが強く凸である場合、アルゴリズムは最も収束特性が高いようです。これは、単一ユニット定式化の場合に当てはまり、ブロックの場合にも適用できます。最後に、一連のランダムおよび遺伝子発現テスト問題で数値的に、得られたソリューションの品質と計算速度の両方で、このアプローチが既存のアルゴリズムよりも優れていることを示します。

Classification Using Geometric Level Sets
ジオメトリレベルセットを使用した分類

A variational level set method is developed for the supervised classification problem. Nonlinear classifier decision boundaries are obtained by minimizing an energy functional that is composed of an empirical risk term with a margin-based loss and a geometric regularization term new to machine learning: the surface area of the decision boundary. This geometric level set classifier is analyzed in terms of consistency and complexity through the calculation of its ε-entropy. For multicategory classification, an efficient scheme is developed using a logarithmic number of decision functions in the number of classes rather than the typical linear number of decision functions. Geometric level set classification yields performance results on benchmark data sets that are competitive with well-established methods.

教師付き分類問題に対して、変分水準セット法が開発されています。非線形分類器の決定境界は、マージンベースの損失を持つ経験的リスク項と、機械学習に新しく追加された幾何学的正則化項(決定境界の表面積)で構成されるエネルギー関数を最小化することによって得られます。この幾何学的レベルセット分類器は、そのεエントロピーの計算を通じて、一貫性と複雑さの観点から分析されます。マルチカテゴリ分類では、一般的な線形決定関数数ではなく、クラス数の決定関数の対数数を使用して効率的なスキームが開発されます。幾何学的水準セットの分類は、確立された方法と競争力のあるベンチマークデータセットでパフォーマンス結果をもたらします。

Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
データ可視化のための非線形次元削減への情報検索の視点

Nonlinear dimensionality reduction methods are often used to visualizehigh-dimensional data, although the existing methods have beendesigned for other related tasks such as manifold learning. It hasbeen difficult to assess the quality of visualizations since the taskhas not been well-defined. We give a rigorous definition for aspecific visualization task, resulting in quantifiable goodnessmeasures and new visualization methods. The task is informationretrieval given the visualization: to find similar data based on thesimilarities shown on the display. The fundamental tradeoff betweenprecision and recall of information retrieval can then be quantifiedin visualizations as well. The user needs to give the relative cost ofmissing similar points vs. retrieving dissimilar points, after whichthe total cost can be measured. We then introduce a new method NeRV(neighbor retrieval visualizer) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervisedvisualization; class information is taken rigorously into account whencomputing the similarity relationships. We show empirically that theunsupervised version outperforms existing unsupervised dimensionalityreduction methods in the visualization task, and the supervisedversion outperforms existing supervised methods.

非線形次元削減法は高次元データを視覚化するためによく使用されますが、既存の方法は多様体学習などの他の関連タスク用に設計されています。タスクが明確に定義されていないため、視覚化の品質を評価することは困難でした。特定の視覚化タスクの厳密な定義を提供し、定量化可能な良さの尺度と新しい視覚化方法をもたらします。タスクは、視覚化が与えられた場合の情報検索、つまりディスプレイに表示される類似性に基づいて類似データを見つけることです。情報検索の精度と再現率の基本的なトレードオフは、視覚化でも定量化できます。ユーザーは、類似点を見逃すコストと非類似点を検索するコストの相対的な値を指定する必要があります。その後、総コストを測定できます。次に、コストを最小化して最適な視覚化を生成する新しい方法NeRV (近傍検索視覚化)を紹介します。さらに、教師あり視覚化のバリアントを導出します。類似関係を計算する際には、クラス情報が厳密に考慮されます。視覚化タスクでは、教師なしバージョンが既存の教師なし次元削減方法よりも優れており、教師ありバージョンが既存の教師あり方法よりも優れていることが経験的に示されています。

Dimensionality Estimation, Manifold Learning and Function Approximation using Tensor Voting
テンソル投票を用いた次元推定、多様体学習、関数近似

We address instance-based learning from a perceptual organizationstandpoint and present methods for dimensionality estimation,manifold learning and function approximation. Under our approach, manifolds in high-dimensional spacesare inferred by estimating geometric relationships among the inputinstances. Unlike conventional manifold learning, we do not perform dimensionality reduction, butinstead perform all operations in the original input space. For thispurpose we employ a novel formulation of tensor voting, which allows an N-Dimplementation. Tensor voting is a perceptual organizationframework that has mostly been applied to computer vision problems.Analyzing the estimated local structure at the inputs, we are ableto obtain reliable dimensionality estimates at each instance,instead of a global estimate for the entire data set. Moreover, theselocal dimensionality and structure estimates enable us to measuregeodesic distances and perform nonlinear interpolation for data setswith varying density, outliers, perturbation and intersections, thatcannot be handled by state-of-the-art methods. Quantitativeresults on the estimation of local manifold structure using ground truth data are presented. In addition, we compare ourapproach with several leading methods for manifold learning at thetask of measuring geodesic distances. Finally, we show competitivefunction approximation results on real data.

私たちは、インスタンスベースの学習を知覚的組織化の観点から取り上げ、次元推定、多様体学習、および関数近似の方法を提示します。我々のアプローチでは、高次元空間の多様体は、入力インスタンス間の幾何学的関係を推定することによって推論されます。従来の多様体学習とは異なり、次元削減は実行せず、代わりに元の入力空間ですべての操作を実行します。この目的のために、N実装を可能にするテンソル投票の新しい定式化を採用します。テンソル投票は、主にコンピュータービジョンの問題に適用されている知覚的組織化フレームワークです。入力で推定されたローカル構造を分析することで、データセット全体のグローバル推定ではなく、各インスタンスで信頼性の高い次元推定を得ることができます。さらに、これらのローカル次元と構造の推定により、測地線距離を測定し、最先端の方法では処理できないさまざまな密度、外れ値、摂動、交差を持つデータセットの非線形補間を実行できます。グラウンドトゥルースデータを使用したローカル多様体構造の推定に関する定量的な結果を示します。さらに、測地線距離の測定タスクにおける多様体学習のいくつかの主要な方法と私たちのアプローチを比較します。最後に、実際のデータでの競合関数近似結果を示します。

A Convergent Online Single Time Scale Actor Critic Algorithm
収束オンライン単一時間スケール俳優批評アルゴリズム

Actor-Critic based approaches were among the first to address reinforcementlearning in a general setting. Recently, these algorithms have gainedrenewed interest due to their generality, good convergence properties,and possible biological relevance. In this paper, we introduce anonline temporal difference based actor-critic algorithm which is provedto converge to a neighborhood of a local maximum of the average reward.Linear function approximation is used by the critic in order estimatethe value function, and the temporal difference signal, which is passedfrom the critic to the actor. The main distinguishing feature of thepresent convergence proof is that both the actor and the critic operateon a similar time scale, while in most current convergence proofsthey are required to have very different time scales in order to converge.Moreover, the same temporal difference signal is used to update theparameters of both the actor and the critic. A limitation of the proposedapproach, compared to results available for two time scale convergence,is that convergence is guaranteed only to a neighborhood of an optimalvalue, rather to an optimal value itself. The single time scale andidentical temporal difference signal used by the actor and the critic,may provide a step towards constructing more biologically realisticmodels of reinforcement learning in the brain.

アクター・クリティックに基づくアプローチは、一般的な設定で強化学習に取り組んだ最初のアプローチの1つです。最近、これらのアルゴリズムは、その汎用性、優れた収束特性、および生物学的関連性の可能性により、新たな関心を集めています。この論文では、平均報酬の局所的最大値の近傍に収束することが証明されている、オンラインの時間差に基づくアクター・クリティックアルゴリズムを紹介します。クリティックは、価値関数と、クリティックからアクターに渡される時間差信号を推定するために、線形関数近似を使用します。現在の収束証明の主な際立った特徴は、アクターとクリティックの両方が同様の時間スケールで動作することです。一方、現在のほとんどの収束証明では、収束するために、両者に非常に異なる時間スケールが必要です。さらに、同じ時間差信号を使用して、アクターとクリティックの両方のパラメータを更新します。提案されたアプローチの限界は、2つの時間スケールの収束で得られる結果と比較すると、収束が最適値自体ではなく、最適値の近傍にのみ保証されることです。アクターと批評家によって使用される単一の時間スケールと同一の時間差信号は、脳内で強化学習のより生物学的に現実的なモデルを構築する第一歩となる可能性があります。

Bundle Methods for Regularized Risk Minimization
レギュラライズドリスク最小化のためのバンドル手法

A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log (1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets.

さまざまな機械学習の問題は、さまざまなアルゴリズムがさまざまなリスクの概念とさまざまな正則化器を使用して、正則化されたリスク関数を最小限に抑えるものとして説明できます。例としては、線形サポートベクターマシン(SVM)、ガウス過程、ロジスティック回帰、条件付きランダムフィールド(CRF)、なげなわなどがあります。この論文では、これらすべての推定問題を解決するスケーラブルでモジュール式の凸型ソルバーの理論と実装について説明します。ワークステーションのクラスターで並列化でき、データの局所性を可能にし、L1やL2のペナルティなどの正則化処理を処理できます。統一されたフレームワークに加えて、アルゴリズムが一般的な凸問題ではO(1/ε)ステップでε精度に収束し、連続微分可能な問題ではO(log (1/ε))ステップで収束することを示す、厳密な収束境界を示します。一般公開されているさまざまなデータセットで、汎用ソルバーの性能を実証しています。

Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
ベイジアンネットワーク構造の学習のためのクラスタ化構造制約の最適探索

We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure.The previously derived constrained optimal search (COS) remains limited even for sparse super-structures.To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them.Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs.Finally, we theoretically derive the necessary and sufficient sets of ACs to be considered for finding an optimal constrained graph.Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm.Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance.

私たちは、制約のある探索空間で最適なベイジアンネットワークを学習する問題を研究します。スケルトンは、スーパー構造と呼ばれる特定の無向グラフのサブグラフにならざるを得ません。以前に導出された制約付き最適探索(COS)は、スパースなスーパー構造に対しても依然として限界があります。その実現可能性を拡張するために、スーパー構造をいくつかのクラスターに分割し、それぞれに対して最適探索を実行することを提案します。さらに、非循環性を保証するために、祖先制約(AC)の概念を導入し、特定のACセットを満たす最適アルゴリズムを導出します。最後に、最適な制約付きグラフを見つけるために考慮すべき必要かつ十分なACセットを理論的に導出します。実証的評価により、このアルゴリズムは、数百の頂点を含む一部のグラフ、さらには平均次数が高い(最大4)スーパー構造に対しても最適なベイジアンネットワークを学習できることが実証されており、これは以前の最適アルゴリズムに比べて実現可能性が大幅に向上しています。学習されたネットワークは、スコアと構造ハミング距離の両方において、最先端のヒューリスティックアルゴリズムを大幅に上回る性能を示すことが示されています。

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions
分類のための因果発見と特徴選択のための局所因果的およびマルコフブランケット誘導パートII:分析と拡張

In part I of this work we introduced and evaluated the GeneralizedLocal Learning (GLL) framework for producing local causal and Markovblanket induction algorithms. In the present second part we analyzethe behavior of GLL algorithms and provide extensions to the coremethods. Specifically, we investigate the empirical convergence ofGLL to the true local neighborhood as a function of sample size.Moreover, we study how predictivity improves with increasing samplesize. Then we investigate how sensitive are the algorithms tomultiple statistical testing, especially in the presence of manyirrelevant features. Next we discuss the role of the algorithmparameters and also show that Markov blanket and causal graphconcepts can be used to understand deviations from optimality ofstate-of-the-art non-causal algorithms. The present paper alsointroduces the following extensions to the core GLL framework:parallel and distributed versions of GLL algorithms, versions withfalse discovery rate control, strategies for constructing novelheuristics for specific domains, and divide-and-conquerlocal-to-global learning (LGL) strategies. We test thegenerality of the LGL approach by deriving a novel LGL-basedalgorithm that compares favorably to the state-of-the-art globallearning algorithms. In addition, we investigate the use ofnon-causal feature selection methods to facilitate global learning.Open problems and future research paths related to local andlocal-to-global causal learning are discussed.

この研究のパートIでは、局所因果アルゴリズムとマルコフブランケット誘導アルゴリズムを生成するための一般化局所学習(GLL)フレームワークを紹介し、評価しました。このパート2では、GLLアルゴリズムの動作を分析し、コアメソッドの拡張を提供します。具体的には、サンプルサイズの関数として、GLLが真の局所近傍に経験的に収束するかどうかを調べます。さらに、サンプルサイズが大きくなると予測性がどのように向上するかを調べます。次に、特に多くの無関係な機能が存在する場合に、アルゴリズムが複数の統計テストに対してどの程度敏感であるかを調べます。次に、アルゴリズムパラメーターの役割について説明し、マルコフブランケットと因果グラフの概念を使用して、最先端の非因果アルゴリズムの最適性からの逸脱を理解できることも示します。この論文では、コアGLLフレームワークへの以下の拡張機能も紹介します。GLLアルゴリズムの並列および分散バージョン、誤検出率制御付きバージョン、特定のドメインに対する新しいヒューリスティックを構築する戦略、および分割統治ローカルからグローバルへの学習(LGL)戦略。最先端のグローバル学習アルゴリズムに匹敵する新しいLGLベースのアルゴリズムを導出することで、LGLアプローチの一般性をテストします。さらに、グローバル学習を容易にするための非因果的特徴選択方法の使用を調査します。ローカルおよびローカルからグローバルへの因果学習に関連する未解決の問題と将来の研究の方向性について説明します。

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation
分類のための因果発見と特徴選択のための局所因果とマルコフブランケット帰納法パートI:アルゴリズムと経験的評価

We present an algorithmic framework for learning local causal structure around target variables of interestin the form of direct causes/effects and Markov blankets applicable to very large data sets with relatively small samples.The selected feature sets can be used for causal discovery and classification. The framework (Generalized Local Learning, or GLL)can be instantiated in numerous ways, giving rise to both existing state-of-the-art as well as novel algorithms.The resulting algorithms are sound under well-defined sufficient conditions. In a first set of experiments we evaluate several algorithms derived from this framework in terms of predictivity and feature set parsimony and compare to other local causal discovery methods and to state-of-the-art non-causal feature selection methods using real data. A second set of experimental evaluations compares the algorithms in terms of ability to induce local causal neighborhoods using simulated and resimulated data and examines the relation of predictivity with causal induction performance.Our experiments demonstrate, consistently with causal feature selection theory, that local causal feature selection methods (under broad assumptions encompassing appropriate family of distributions, types of classifiers, and loss functions) exhibit strong feature set parsimony, high predictivity and local causal interpretability. Although non-causal feature selection methods are often used in practice to shed light on causal relationships, we find that they cannot be interpreted causally even when they achieve excellent predictivity. Therefore we conclude that only local causal techniques should be used when insight into causal structure is sought. In a companion paper we examine in depth the behavior of GLL algorithms, provide extensions, and show how local techniques can be used for scalable and accurate global causal graph learning.

私たちは、比較的小さなサンプルを持つ非常に大きなデータセットに適用可能な、直接的な原因/結果とマルコフブランケットの形で、関心のあるターゲット変数の周囲のローカル因果構造を学習するためのアルゴリズムフレームワークを提示します。選択された特徴セットは、因果の発見と分類に使用できます。フレームワーク(一般化ローカル学習、またはGLL)は、さまざまな方法でインスタンス化できるため、既存の最先端のアルゴリズムと新しいアルゴリズムの両方を生み出すことができます。結果として得られるアルゴリズムは、明確に定義された十分な条件下では健全です。最初の一連の実験では、このフレームワークから派生したいくつかのアルゴリズムを予測性と特徴セットの簡素性の観点から評価し、実際のデータを使用して、他のローカル因果発見方法および最先端の非因果的特徴選択方法と比較します。2番目の一連の実験評価では、シミュレーションおよび再シミュレーションされたデータを使用してローカル因果近傍を誘導する能力に関してアルゴリズムを比較し、予測性と因果誘導パフォーマンスの関係を調べます。私たちの実験は、因果特徴選択理論と一致して、ローカル因果特徴選択方法(適切な分布ファミリー、分類子の種類、および損失関数を含む広範な仮定の下)が強力な特徴セットの節約、高い予測性、およびローカル因果解釈可能性を示すことを示しています。非因果特徴選択方法は、因果関係を明らかにするために実際にはよく使用されますが、優れた予測性を達成した場合でも因果的に解釈できないことがわかりました。したがって、因果構造への洞察が求められる場合は、ローカル因果手法のみを使用すべきであると結論付けています。コンパニオンペーパーでは、GLLアルゴリズムの動作を詳細に調査し、拡張機能を提供し、スケーラブルで正確なグローバル因果グラフ学習にローカル手法を使用する方法を示します。

An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
バイナリ応答データに適用される分類木の欠損データ法の調査

There are many different methods used by classification treealgorithms when missing data occur in the predictors, but fewstudies have been done comparing their appropriateness andperformance. This paper provides both analytic and Monte Carloevidence regarding the effectiveness of six popular missing datamethods for classification trees applied to binary response data. Weshow that in the context of classification trees, the relationshipbetween the missingness and the dependent variable, as well as theexistence or non-existence of missing values in the testing data,are the most helpful criteria to distinguish different missing datamethods. In particular, separate class is clearly the bestmethod to use when the testing set has missing values and themissingness is related to the response variable. A real data setrelated to modeling bankruptcy of a firm is then analyzed. The paperconcludes with discussion of adaptation of these results to logisticregression, and other potential generalizations.

予測変数に欠損データがある場合、分類ツリーアルゴリズムではさまざまな方法が使用されますが、それらの適切性とパフォーマンスを比較した研究はほとんどありません。この論文では、バイナリ応答データに適用された分類ツリーの6つの一般的な欠損データ方法の有効性に関する分析的証拠とモンテカルロの両方を示します。分類ツリーのコンテキストでは、欠損と従属変数の関係、およびテストデータ内の欠損値の有無が、さまざまな欠損データ方法を区別するための最も役立つ基準であることを示します。特に、テストセットに欠損値があり、欠損が応答変数に関連している場合、個別のクラスを使用するのが明らかに最適な方法です。次に、企業の倒産のモデル化に関連する実際のデータセットを分析します。この論文では、これらの結果をロジスティック回帰に適応させること、およびその他の潜在的な一般化について議論して終わります。

Classification Methods with Reject Option Based on Convex Risk Minimization
凸型リスク最小化に基づくリジェクトオプション付き分類法

In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions.

この論文では、誤分類よりも低いコストでオブザベーションを分類する決定を差し控えることができるリジェクトオプションを使用して、二項分類の問題を調査します。自然損失関数は非凸型であり、経験的なリスクの最小化が容易に不可能になるため、この論文では、代理凸損失関数に基づいて凸リスクを最小化することを提案しています。無限のサンプル一貫性(両方のリスクが同じ最小化装置を共有する)のための必要十分条件が提供されます。さらに、過剰リスクは、適切な条件下で過剰代理リスクを通じて制限できることを示します。これらの境界は、一般化されたマージン条件によって狭くすることができます。この結果の影響は、一般的に使用されるいくつかのサロゲート損失関数に示されています。

On-Line Sequential Bin Packing
オンラインシーケンシャルビンパッキング

We consider a sequential version of the classical bin packing problemin which items are received one by one. Before the size of the nextitem is revealed, the decision maker needs to decide whetherthe next item is packed in the currently open bin or the bin isclosed and a new bin is opened. If the new item does not fit, it islost. If a bin is closed, the remaining free space in the bin accountsfor a loss. The goal of the decision maker is to minimize theloss accumulated over n periods. We presentan algorithm that has a cumulative loss not much larger than anystrategy in a finite class of reference strategiesfor any sequence of items.Special attention is payed to reference strategiesthat use a fixed threshold at each step to decide whether a new binis opened. Some positive and negative results are presented for this case.

私たちは、アイテムが1つずつ受け取る古典的なビンパッキング問題の逐次バージョンを考えます。次の品目のサイズが明らかになる前に、意思決定者は、次の品目が現在開いているビンに梱包されているか、ビンが閉じられて新しいビンが開かれているかを決定する必要があります。新しいアイテムが収まらない場合、それは失われます。ビンが閉じられた場合、ビンの残りの空きスペースは損失となります。意思決定者の目標は、n期間にわたって蓄積される損失を最小限に抑えることです。私たちは、任意のアイテムのシーケンスの参照戦略の有限クラスで、任意の戦略よりもそれほど大きくない累積損失を持つアルゴリズムを提示します。新しいビンが開かれるかどうかを判断するために、各ステップで固定のしきい値を使用する参照戦略には特に注意が払われます。このケースでは、いくつかの肯定的な結果と否定的な結果が提示されます。

Model Selection: Beyond the Bayesian/Frequentist Divide
モデル選択:ベイズ/頻度論的分水嶺を超えて

The principle of parsimony also known as “Ockham’s razor” has inspired manytheories of model selection. Yet such theories, all making arguments in favorof parsimony, are based on very different premises and have developed distinctmethodologies to derive algorithms. We have organized challenges andedited a special issue of JMLR and several conference proceedings around thetheme of model selection. In this editorial, we revisit the problem ofavoiding overfitting in light of the latestresults. We note the remarkable convergence of theories as different asBayesian theory, Minimum Description Length, bias/variance tradeoff,Structural Risk Minimization, and regularization, in someapproaches. We also present new andinteresting examples of the complementarity of theories leading tohybrid algorithms, neither frequentist, nor Bayesian, or perhapsboth frequentist and Bayesian!

「オッカムの剃刀」としても知られる倹約の原理は、モデル選択の多くの理論に影響を与えました。しかし、そのような理論は、すべて倹約を支持する議論をしており、非常に異なる前提に基づいており、アルゴリズムを導出するための明確な方法論を発展させてきた。私たちは、モデル選択をテーマにしたJMLRの特集号といくつかの会議録を整理し、編集しました。この社説では、最新の結果に照らして、過剰適合を避ける問題を再検討します。私たちは、いくつかのアプローチにおいて、ベイズ理論、最小記述長、バイアス/分散トレードオフ、構造リスクの最小化、および正則化として、理論の顕著な収束に注目します。また、ハイブリッドアルゴリズムにつながる理論の相補性の新しく興味深い例も提示します。ハイブリッドアルゴリズム、頻度論者でもベイジアンでもなく、おそらく頻度論者とベイジアンの両方でもありません。

Online Learning for Matrix Factorization and Sparse Coding
行列因数分解とスパース符号化のためのオンライン学習

Sparse coding−that is, modelling data vectors as sparse linearcombinations of basis elements−is widely used in machine learning,neuroscience, signal processing, and statistics. This paper focuses on thelarge-scale matrix factorization problem that consists of learningthe basis set in order to adapt it to specific data. Variations of thisproblem include dictionary learning in signal processing, non-negativematrix factorization and sparse principal component analysis. In thispaper, we propose to address these tasks with a new online optimizationalgorithm, based on stochastic approximations, which scales up gracefullyto large data sets with millions of training samples, and extends naturallyto various matrix factorization formulations, making it suitable for awide range of learning problems. A proof of convergence is presented,along with experiments with natural images and genomic data demonstratingthat it leads to state-of-the-art performance in terms of speed andoptimization for both small and large data sets.

スパースコーディング、つまりデータベクトルを基底要素のスパース線形結合としてモデル化する手法は、機械学習、神経科学、信号処理、統計学で広く使用されています。この論文では、基底セットを特定のデータに適応させるために学習する大規模な行列分解問題に焦点を当てています。この問題のバリエーションには、信号処理における辞書学習、非負行列分解、スパース主成分分析などがあります。この論文では、確率的近似に基づく新しいオンライン最適化アルゴリズムを使用してこれらのタスクに対処することを提案します。このアルゴリズムは、数百万のトレーニングサンプルを含む大規模なデータセットに適切に拡張され、さまざまな行列分解定式化に自然に拡張されるため、幅広い学習問題に適しています。収束の証明と、小規模および大規模なデータセットの両方で速度と最適化の点で最先端のパフォーマンスにつながることを示す自然画像とゲノムデータを使用した実験が提示されています。

An Efficient Explanation of Individual Classifications using Game Theory
ゲーム理論を用いた個別分類の効率的な説明

We present a general method for explaining individual predictionsof classification models. The method is based on fundamentalconcepts from coalitional game theory and predictions areexplained with contributions of individual feature values. Weovercome the method’s initial exponential time complexity with asampling-based approximation. In the experimental part of thepaper we use the developed method on models generated by severalwell-known machine learning algorithms on both synthetic andreal-world data sets. The results demonstrate that the method isefficient and that the explanations are intuitive and useful.

私たちは、分類モデルの個々の予測を説明するための一般的な方法を提示します。この方法は、連合ゲーム理論の基本概念に基づいており、予測は個々の特徴値の寄与とともに説明されます。この手法の初期の指数関数的な時間的複雑さを、サンプリングベースの近似で克服します。論文の実験部分では、合成データセットと実世界データセットの両方で、いくつかのよく知られた機械学習アルゴリズムによって生成されたモデルで開発された方法を使用します。その結果、この手法が効率的であり、説明が直感的で有用であることが示されています。

Journal of Machine Learning Research Papers: Volume 11の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 6の論文一覧

Journal of Machine Learning Research Papers: Volume 18の論文一覧

Journal of Machine Learning Research Papers: Volume 4の論文一覧