Journal of Machine Learning Research Papers: Volume 15の論文一覧

Journal of Machine Learning Research Papers Volume 15に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Active Imitation Learning: Formal and Practical Reductions to I.I.D. Learning
能動的模倣学習: 独立同分布学習の形式的かつ実践的な還元

In standard passive imitation learning, the goal is to learn a policy that performs as well as a target policy by passively observing full execution trajectories of it. Unfortunately, generating such trajectories can require substantial expert effort and be impractical in some cases. In this paper, we consider active imitation learning with the goal of reducing this effort by querying the expert about the desired action at individual states, which are selected based on answers to past queries and the learner’s interactions with an environment simulator. We introduce a new approach based on reducing active imitation learning to active i.i.d. learning, which can leverage progress in the i.i.d. setting. Our first contribution is to analyze reductions for both non-stationary and stationary policies, showing for the first time that the label complexity (number of queries) of active imitation learning can be less than that of passive learning. Our second contribution is to introduce a practical algorithm inspired by the reductions, which is shown to be highly effective in five test domains compared to a number of alternatives.

標準的な受動模倣学習では、目標は、ターゲットポリシーの完全な実行軌跡を受動的に観察することによって、ターゲットポリシーと同様に機能するポリシーを学習することです。残念ながら、このような軌跡を生成するには、専門家の多大な労力が必要になり、場合によっては非現実的です。この論文では、過去のクエリへの回答と学習者の環境シミュレーターとのやり取りに基づいて選択される個々の状態での望ましいアクションについて専門家に問い合わせることで、この労力を削減することを目標とした能動模倣学習について検討します。能動模倣学習を能動i.i.d.学習に縮小することに基づく新しいアプローチを紹介します。これにより、i.i.d.設定での進歩を活用できます。最初の貢献は、非定常ポリシーと定常ポリシーの両方の縮小を分析し、能動模倣学習のラベル複雑度(クエリ数)が受動学習のラベル複雑度よりも低くなる可能性があることを初めて示したことです。2番目の貢献は、縮小にヒントを得た実用的なアルゴリズムを紹介することです。このアルゴリズムは、5つのテストドメインで、いくつかの代替手段と比較して非常に効果的であることが示されています。

Parallelizing Exploration-Exploitation Tradeoffs in Gaussian Process Bandit Optimization
ガウス過程バンディット最適化における探索と活用のトレードオフの並列化

How can we take advantage of opportunities for experimental parallelization in exploration-exploitation tradeoffs? In many experimental scenarios, it is often desirable to execute experiments simultaneously or in batches, rather than only performing one at a time. Additionally, observations may be both noisy and expensive. We introduce Gaussian Process Batch Upper Confidence Bound (GP-BUCB), an upper confidence bound-based algorithm, which models the reward function as a sample from a Gaussian process and which can select batches of experiments to run in parallel. We prove a general regret bound for GP-BUCB, as well as the surprising result that for some common kernels, the asymptotic average regret can be made independent of the batch size. The GP-BUCB algorithm is also applicable in the related case of a delay between initiation of an experiment and observation of its results, for which the same regret bounds hold. We also introduce Gaussian Process Adaptive Upper Confidence Bound (GP-AUCB), a variant of GP-BUCB which can exploit parallelism in an adaptive manner. We evaluate GP-BUCB and GP-AUCB on several simulated and real data sets. These experiments show that GP-BUCB and GP-AUCB are competitive with state-of-the-art heuristics. (A previous version of this work appeared in the Proceedings of the 29th International Conference on Machine Learning, 2012.)

探索と活用のトレードオフにおいて、実験の並列化の機会をどのように活用できるでしょうか。多くの実験シナリオでは、実験を1つずつ実行するのではなく、同時にまたはバッチで実行することが望ましいことがよくあります。さらに、観測にはノイズが多く、コストがかかる場合があります。ガウス過程バッチ上側信頼限界(GP-BUCB)を紹介します。これは、報酬関数をガウス過程のサンプルとしてモデル化し、並列に実行する実験のバッチを選択できる、上側信頼限界ベースのアルゴリズムです。GP-BUCBの一般的な後悔限界と、いくつかの一般的なカーネルでは、漸近平均後悔をバッチサイズとは無関係にできるという驚くべき結果を証明します。GP-BUCBアルゴリズムは、実験の開始とその結果の観測の間に遅延がある場合にも適用でき、同じ後悔限界が当てはまります。また、並列処理を適応的に活用できるGP-BUCBの変種であるガウス過程適応型信頼度上限境界(GP-AUCB)も紹介します。GP-BUCBとGP-AUCBを、いくつかのシミュレーションデータセットと実際のデータセットで評価します。これらの実験により、GP-BUCBとGP-AUCBが最先端のヒューリスティックに匹敵することが示されました。(この研究の以前のバージョンは、2012年の第29回国際機械学習会議の議事録に掲載されました。)

Robust Hierarchical Clustering
ロバストな階層クラスタリング

One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output is easy to interpret. Unfortunately, it is well known, however, that many of the classic agglomerative clustering algorithms are not robust to noise. In this paper we propose and analyze a new robust algorithm for bottom-up agglomerative clustering. We show that our algorithm can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also show how to adapt our algorithm to the inductive setting where our given data is only a small random sample of the entire data set. Experimental evaluations on synthetic and real world data sets show that our algorithm achieves better performance than other hierarchical algorithms in the presence of noise.

データクラスタリングで最も広く使用されている手法の1つが、凝集型クラスタリングです。このようなアルゴリズムは、出力の解釈が容易なこともあり、計算生物学から社会科学、コンピュータービジョンに至るまで、さまざまな分野で長年使用されてきました。しかし、残念ながら、従来の凝集型クラスタリングアルゴリズムの多くはノイズに対して堅牢ではないことがよく知られています。この論文では、ボトムアップ凝集型クラスタリングの新しい堅牢なアルゴリズムを提案し、分析します。データがいくつかの自然な特性を満たし、従来の凝集型アルゴリズムが失敗する場合に、このアルゴリズムを使用して正確にクラスタリングできることを示します。また、与えられたデータがデータセット全体の小さなランダムサンプルにすぎない帰納的設定にアルゴリズムを適応させる方法も示します。合成データセットと実世界のデータセットでの実験的評価では、ノイズが存在する場合に、このアルゴリズムが他の階層型アルゴリズムよりも優れたパフォーマンスを発揮することが示されています。

Effective Sampling and Learning for Mallows Models with Pairwise-Preference Data
ペアワイズ選好データを用いたマローズモデルの効果的なサンプリングと学習

Learning preference distributions is a critical problem in many areas (e.g., recommender systems, IR, social choice). However, many existing learning and inference methods impose restrictive assumptions on the form of user preferences that can be admitted as evidence. We relax these restrictions by considering as data arbitrary pairwise comparisons of alternatives, which represent the fundamental building blocks of ordinal rankings. We develop the first algorithms for learning Mallows models (and mixtures thereof) from pairwise comparison data. At the heart of our technique is a new algorithm, the generalized repeated insertion model (GRIM), which allows sampling from arbitrary ranking distributions, and conditional Mallows models in particular. While we show that sampling from a Mallows model with pairwise evidence is computationally difficult in general, we develop approximate samplers that are exact for many important special cases–and have provable bounds with pairwise evidence–and derive algorithms for evaluating log-likelihood, learning Mallows mixtures, and non-parametric estimation. Experiments on real-world data sets demonstrate the effectiveness of our approach. (Some parts of this paper appeared in: T. Lu and C. Boutilier, Learning Mallows Models with Pairwise Preferences, Proceedings of the Twenty- Eighth International Conference on Machine Learning (ICML 2011), pp.145-152, Bellevue, WA (2011).)

嗜好分布の学習は、多くの分野(レコメンデーションシステム、IR、社会的選択など)で重要な問題です。しかし、既存の学習および推論方法の多くは、証拠として認められるユーザー嗜好の形式に制限的な仮定を課しています。私たちは、順序付けランキングの基本的な構成要素を表す、任意の代替案のペアワイズ比較をデータとして考慮することで、これらの制限を緩和します。私たちは、ペアワイズ比較データからMallowsモデル(およびその混合)を学習するための最初のアルゴリズムを開発します。私たちの手法の中核となるのは、新しいアルゴリズムである一般化反復挿入モデル(GRIM)です。これにより、任意のランキング分布、特に条件付きMallowsモデルからのサンプリングが可能になります。ペアワイズ証拠によるMallowsモデルからのサンプリングは一般に計算上困難であることを示していますが、多くの重要な特殊なケースに対して正確で、ペアワイズ証拠による証明可能な境界を持つ近似サンプラーを開発し、対数尤度、学習Mallows混合、およびノンパラメトリック推定を評価するためのアルゴリズムを導き出します。実際のデータセットでの実験により、私たちのアプローチの有効性が実証されています。(この論文の一部は、T. LuとC. Boutilierの「Learning Mallows Models with Pairwise Preferreds」、Proceedings of the Twenty- Eighth International Conference on Machine Learning (ICML 2011)、pp.145-152、Bellevue、WA (2011)に掲載されています。)

Order-Independent Constraint-Based Causal Structure Learning
順序に依存しない制約に基づく因果構造学習

We consider constraint-based methods for causal structure learning, such as the PC-, FCI-, RFCI- and CCD- algorithms (Spirtes et al., 1993, 2000; Richardson, 1996; Colombo et al., 2012; Claassen et al., 2013). The first step of all these algorithms consists of the adjacency search of the PC-algorithm. The PC-algorithm is known to be order-dependent, in the sense that the output can depend on the order in which the variables are given. This order-dependence is a minor issue in low- dimensional settings. We show, however, that it can be very pronounced in high-dimensional settings, where it can lead to highly variable results. We propose several modifications of the PC-algorithm (and hence also of the other algorithms) that remove part or all of this order-dependence. All proposed modifications are consistent in high-dimensional settings under the same conditions as their original counterparts. We compare the PC-, FCI-, and RFCI-algorithms and their modifications in simulation studies and on a yeast gene expression data set. We show that our modifications yield similar performance in low- dimensional settings and improved performance in high- dimensional settings. All software is implemented in the R-package pcalg.

私たちは、PC、FCI、RFCI、CCDアルゴリズム(Spirtes他、1993、2000年; Richardson、1996年; Colombo他、2012年; Claassen他、2013年)などの因果構造学習のための制約ベースの手法を検討します。これらすべてのアルゴリズムの最初のステップは、PCアルゴリズムの隣接検索で構成されます。PCアルゴリズムは、出力が変数が与えられる順序に依存するという意味で、順序に依存することが知られています。この順序依存性は、低次元の設定では小さな問題です。しかし、私たちは、高次元の設定では非常に顕著になる可能性があり、その結果の変動が大きくなる可能性があることを示します。私たちは、この順序依存性の一部またはすべてを除去するPCアルゴリズム(および他のアルゴリズムも)のいくつかの修正を提案します。提案されたすべての修正は、元の修正と同じ条件下で高次元設定で一貫しています。シミュレーション研究と酵母遺伝子発現データセットで、PC、FCI、RFCIアルゴリズムとその修正を比較します。修正により、低次元設定では同様のパフォーマンスが得られ、高次元設定ではパフォーマンスが向上することが示されています。すべてのソフトウェアは、Rパッケージpcalgに実装されています。

BayesOpt: A Bayesian Optimization Library for Nonlinear Optimization, Experimental Design and Bandits
BayesOpt: 非線形最適化、実験計画法、バンディットのためのベイズ最適化ライブラリ

BayesOpt is a library with state-of-the-art Bayesian optimization methods to solve nonlinear optimization, stochastic bandits or sequential experimental design problems. Bayesian optimization characterized for being sample efficient as it builds a posterior distribution to capture the evidence and prior knowledge of the target function. Built in standard C++, the library is extremely efficient while being portable and flexible. It includes a common interface for C, C++, Python, Matlab and Octave.

BayesOptは、非線形最適化、確率的バンディット、または逐次実験設計問題を解くための最先端のベイズ最適化手法を備えたライブラリです。ベイズ最適化は、ターゲット関数の証拠と事前知識を捕捉するための事後分布を構築するため、サンプル効率が高いことを特徴としています。標準のC++で構築されたこのライブラリは、移植性と柔軟性を備えながら、非常に効率的です。これには、C、C ++、Python、Matlab、Octaveの共通インターフェースが含まれています。

Semi-Supervised Eigenvectors for Large-Scale Locally-Biased Learning
大規模局所偏学習のための半教師付き固有ベクトル

In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that prespecified target region. For example, one might be interested in the clustering structure of a data graph near a prespecified “seed set” of nodes, or one might be interested in finding partitions in an image that are near a prespecified “ground truth” set of pixels. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities, thus limiting the applicability of eigenvector-based methods in situations where one is interested in very local properties of the data. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We show that these semi-supervised eigenvectors can be computed quickly as the solution to a system of linear equations; and we also describe several variants of our basic method that have improved scaling properties. We provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning; and we discuss the relationship between our results and recent machine learning algorithms that use global eigenvectors of the graph Laplacian.

多くのアプリケーションでは、大規模なデータセットの特定のターゲット領域に関する副次情報(半教師あり方式で提供されるラベルなど)があり、その事前に指定されたターゲット領域の「近く」で機械学習やデータ分析タスクを実行したい場合があります。たとえば、事前に指定されたノードの「シードセット」付近のデータグラフのクラスタリング構造に関心がある場合や、事前に指定されたピクセルの「グラウンドトゥルース」セット付近にあるイメージ内のパーティションを見つけることに関心がある場合があります。この種のローカルに偏った問題は、一般的な固有ベクトルベースの機械学習およびデータ分析ツールにとって特に困難です。根本的な理由は、固有ベクトルが本質的にグローバルな量であるため、データの非常にローカルな特性に関心がある状況では、固有ベクトルベースの方法の適用性が制限されるからです。この論文では、グラフラプラシアンの半教師あり固有ベクトルを構築する方法論を提供することでこの問題に対処し、これらの局所的に偏った固有ベクトルを使用して局所的に偏った機械学習を実行する方法を示します。これらの半教師あり固有ベクトルは、半教師あり方式で提供されると想定されるノードの入力シードセットとよく相関していることを条件として、最大分散の連続直交方向を捉えます。これらの半教師あり固有ベクトルは、線形方程式系の解として迅速に計算できることを示します。また、スケーリング特性が改善された基本手法のいくつかのバリエーションについても説明します。これらの半教師あり固有ベクトルを使用して局所的に偏った学習を実行する方法を示すいくつかの実証例を示し、グラフラプラシアンのグローバル固有ベクトルを使用する最近の機械学習アルゴリズムと私たちの結果との関係について説明します。

Transfer Learning Decision Forests for Gesture Recognition
ジェスチャー認識のための転移学習決定フォレスト

Decision forests are an increasingly popular tool in computer vision problems. Their advantages include high computational efficiency, state-of-the-art accuracy and multi-class support. In this paper, we present a novel method for transfer learning which uses decision forests, and we apply it to recognize gestures and characters. We introduce two mechanisms into the decision forest framework in order to transfer knowledge from the source tasks to a given target task. The first one is mixed information gain, which is a data-based regularizer. The second one is label propagation, which infers the manifold structure of the feature space. We show that both of them are important to achieve higher accuracy. Our experiments demonstrate improvements over traditional decision forests in the ChaLearn Gesture Challenge and MNIST data set. They also compare favorably against other state-of-the-art classifiers.

デシジョンフォレストは、コンピュータービジョンの問題でますます人気のあるツールです。その利点には、高い計算効率、最先端の精度、マルチクラスサポートなどがあります。この論文では、デシジョンフォレストを用いた転移学習の新たな手法を提示し、それをジェスチャーや文字の認識に応用します。デシジョンフォレストフレームワークに2つのメカニズムを導入して、ソースタスクから特定のターゲットタスクに知識を転送します。1つ目は、データベースの正則化器である混合情報利得です。2つ目はラベル伝播で、特徴空間の多様体構造を推測します。より高い精度を達成するためには、両方が重要であることを示します。私たちの実験では、ChaLearn Gesture ChallengeおよびMNISTデータセットで従来の決定林よりも改善が実証されています。また、他の最先端の分類子と比較しても優れています。

New Results for Random Walk Learning
ランダムウォーク学習の新しい結果

In a very strong positive result for passive learning algorithms, Bshouty et al. showed that DNF expressions are efficiently learnable in the uniform random walk model. It is natural to ask whether the more expressive class of thresholds of parities (TOP) can also be learned efficiently in this model, since both DNF and TOP are efficiently uniform-learnable from queries. However, the time bounds of the algorithms of Bshouty et al. are exponential for TOP. We present a new approach to weak parity learning that leads to quasi-efficient uniform random walk learnability of TOP. We also introduce a more general random walk model and give two positive results in this new model: DNF is efficiently learnable and juntas are efficiently agnostically learnable.

受動的学習アルゴリズムの非常に強力な肯定的な結果として、Bshoutyらは、DNF表現が一様ランダムウォークモデルで効率的に学習できることを示しました。DNFとTOPの両方がクエリから効率的に均一に学習できるため、このモデルでは、より表現力豊かなパリティのしきい値(TOP)も効率的に学習できるかどうかを問うのは自然なことです。ただし、Bshoutyらのアルゴリズムの時間境界は、TOPでは指数関数的です。私たちは、TOPの準効率的な一様ランダムウォーク学習可能性につながる弱パリティ学習への新しいアプローチを提示します。また、より一般的なランダムウォークモデルを導入し、この新しいモデルでは、DNFが効率的に学習可能であることと、juntaが効率的に非依存的に学習できることという2つの肯定的な結果が得られます。

Revisiting Bayesian Blind Deconvolution
ベイジアンブラインドデコンボリューションの再検討

Blind deconvolution involves the estimation of a sharp signal or image given only a blurry observation. Because this problem is fundamentally ill-posed, strong priors on both the sharp image and blur kernel are required to regularize the solution space. While this naturally leads to a standard MAP estimation framework, performance is compromised by unknown trade-off parameter settings, optimization heuristics, and convergence issues stemming from non-convexity and/or poor prior selections. To mitigate some of these problems, a number of authors have recently proposed substituting a variational Bayesian (VB) strategy that marginalizes over the high-dimensional image space leading to better estimates of the blur kernel. However, the underlying cost function now involves both integrals with no closed-form solution and complex, function-valued arguments, thus losing the transparency of MAP. Beyond standard Bayesian- inspired intuitions, it thus remains unclear by exactly what mechanism these methods are able to operate, rendering understanding, improvements and extensions more difficult. To elucidate these issues, we demonstrate that the VB methodology can be recast as an unconventional MAP problem with a very particular penalty/prior that conjoins the image, blur kernel, and noise level in a principled way. This unique penalty has a number of useful characteristics pertaining to relative concavity, local minima avoidance, normalization, and scale- invariance that allow us to rigorously explain the success of VB including its existing implementational heuristics and approximations. It also provides strict criteria for learning the noise level and choosing the optimal image prior that, perhaps counter-intuitively, need not reflect the statistics of natural scenes. In so doing we challenge the prevailing notion of why VB is successful for blind deconvolution while providing a transparent platform for introducing enhancements and extensions. Moreover, the underlying insights carry over to a wide variety of other bilinear models common in the machine learning literature such as independent component analysis, dictionary learning/sparse coding, and non-negative matrix factorization.

ブラインドデコンボリューションでは、ぼやけた観測値のみに基づいて鮮明な信号または画像を推定します。この問題は根本的に不適切であるため、ソリューション空間を正規化するには、鮮明な画像とぼやけたカーネルの両方に強力な事前分布が必要です。これは当然、標準的なMAP推定フレームワークにつながりますが、未知のトレードオフパラメータ設定、最適化ヒューリスティック、および非凸性や事前分布の選択の悪さに起因する収束の問題によってパフォーマンスが低下します。これらの問題の一部を軽減するために、最近、多くの著者が、高次元画像空間を周辺化してぼやけたカーネルの推定値を向上させる変分ベイズ(VB)戦略を代用することを提案しました。ただし、基礎となるコスト関数には、閉形式の解のない積分と複雑な関数値の引数の両方が含まれるようになったため、MAPの透明性が失われています。したがって、標準的なベイズに触発された直感を超えて、これらの方法が正確にどのようなメカニズムで機能するのかは不明のままであり、理解、改善、拡張がより困難になっています。これらの問題を明らかにするために、VB手法は、画像、ぼかしカーネル、ノイズレベルを原理的に結合した非常に特殊なペナルティ/事前確率を持つ、従来とは異なるMAP問題として作り直すことができることを実証します。この独自のペナルティには、相対的な凹面、局所的最小値の回避、正規化、スケール不変性に関連する多くの有用な特性があり、既存の実装ヒューリスティックと近似を含むVBの成功を厳密に説明できます。また、ノイズレベルを学習し、最適な画像事前確率を選択するための厳格な基準も提供します。これは、直感に反するかもしれませんが、自然なシーンの統計を反映する必要はありません。そうすることで、VBがブラインドデコンボリューションに成功している理由についての一般的な概念に異議を唱えながら、拡張機能と拡張を導入するための透明なプラットフォームを提供します。さらに、根底にある洞察は、独立成分分析、辞書学習/スパースコーディング、非負行列因数分解など、機械学習の文献で一般的なさまざまな他の双線形モデルにも引き継がれます。

What Regularized Auto-Encoders Learn from the Data-Generating Distribution
正規化オートエンコーダがデータ生成分布から学ぶこと

What do auto-encoders learn about the underlying data-generating distribution? Recent work suggests that some auto-encoder variants do a good job of capturing the local manifold structure of data. This paper clarifies some of these previous observations by showing that minimizing a particular form of regularized reconstruction error yields a reconstruction function that locally characterizes the shape of the data- generating density. We show that the auto-encoder captures the score (derivative of the log-density with respect to the input). It contradicts previous interpretations of reconstruction error as an energy function. Unlike previous results, the theorems provided here are completely generic and do not depend on the parameterization of the auto-encoder: they show what the auto- encoder would tend to if given enough capacity and examples. These results are for a contractive training criterion we show to be similar to the denoising auto-encoder training criterion with small corruption noise, but with contraction applied on the whole reconstruction function rather than just encoder. Similarly to score matching, one can consider the proposed training criterion as a convenient alternative to maximum likelihood because it does not involve a partition function. Finally, we show how an approximate Metropolis-Hastings MCMC can be setup to recover samples from the estimated distribution, and this is confirmed in sampling experiments.

オートエンコーダは、基礎となるデータ生成分布について何を学ぶのでしょうか。最近の研究では、一部のオートエンコーダのバリアントがデータのローカルな多様体構造をうまく捉えていることが示唆されています。この論文では、特定の形式の正規化された再構成誤差を最小化すると、データ生成密度の形状をローカルに特徴付ける再構成関数が得られることを示すことで、これらの以前の観察のいくつかを明らかにします。オートエンコーダがスコア(入力に対する対数密度の導関数)を捉えることを示します。これは、再構成誤差をエネルギー関数として解釈する以前の解釈と矛盾します。以前の結果とは異なり、ここで提供される定理は完全に汎用的であり、オートエンコーダのパラメーター化に依存しません。十分な容量と例が与えられた場合にオートエンコーダがどのような傾向を示すかを示します。これらの結果は、小さな破損ノイズを伴うノイズ除去オートエンコーダのトレーニング基準に類似していることを示す収縮トレーニング基準ですが、収縮はエンコーダだけでなく再構成関数全体に適用されます。スコアマッチングと同様に、提案されたトレーニング基準はパーティション関数を必要としないため、最大尤度の便利な代替手段と見なすことができます。最後に、推定分布からサンプルを回復するために近似メトロポリス-ヘイスティングスMCMCを設定する方法を示し、これはサンプリング実験で確認されています。

Asymptotic Accuracy of Distribution-Based Estimation of Latent Variables
潜在変数の分布ベース推定の漸近精度

Hierarchical statistical models are widely employed in information science and data engineering. The models consist of two types of variables: observable variables that represent the given data and latent variables for the unobservable labels. An asymptotic analysis of the models plays an important role in evaluating the learning process; the result of the analysis is applied not only to theoretical but also to practical situations, such as optimal model selection and active learning. There are many studies of generalization errors, which measure the prediction accuracy of the observable variables. However, the accuracy of estimating the latent variables has not yet been elucidated. For a quantitative evaluation of this, the present paper formulates distribution-based functions for the errors in the estimation of the latent variables. The asymptotic behavior is analyzed for both the maximum likelihood and the Bayes methods.

階層型統計モデルは、情報科学やデータエンジニアリングの分野で広く採用されています。モデルは、指定されたデータを表す観測可能な変数と、観測不可能なラベルの潜在変数の2種類の変数で構成されます。モデルの漸近解析は、学習プロセスの評価において重要な役割を果たします。解析結果は、理論的なものだけでなく、最適なモデル選択やアクティブラーニングなど、実践的な場面にも応用されます。観測可能な変数の予測精度を測定する一般化誤差については、多くの研究があります。しかし、潜在変数の推定精度は未だ解明されていません。これを定量的に評価するために、この論文では、潜在変数の推定誤差に対する分布ベースの関数を定式化します。漸近的な振る舞いは、最尤法とベイズ法の両方で解析されます。

Seeded Graph Matching for Correlated Erdos-Renyi Graphs
相関エルデシュ・レニグラフのシードグラフマッチング

Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated ErdÅs-RÃ©nyi graph setting, we prove that graph matching provides a strongly consistent estimate of the latent alignment in the presence of even modest correlation. We then investigate a tractable, restricted-focus version of graph matching, which is only concerned with adjacency involving vertices in a partial observation of the latent alignment; we prove that a logarithmic number of vertices whose alignment is known is sufficient for this restricted-focus version of graph matching to yield a strongly consistent estimate of the latent alignment of the remaining vertices. We show how Frank-Wolfe methodology for approximate graph matching, when there is a partially observed latent alignment, inherently incorporates this restricted-focus graph matching. Lastly, we illustrate the relationship between seeded graph matching and restricted-focus graph matching by means of an illuminating example from human connectomics.

グラフマッチングは、機械学習とパターン認識における重要な問題です。ここでは、2つのグラフの頂点セット間の潜在的アラインメント関数を推定するためのグラフマッチングの一貫性に関する理論的および実際的な結果と、潜在的アラインメントが部分的に観測される場合のその後のアルゴリズム的意味を示します。相関のあるErdÅs-RÃ©nyiグラフ設定では、グラフマッチングは、たとえわずかな相関があっても、潜在的アラインメントの強い一貫性のある推定値を提供することを証明します。次に、潜在的アラインメントの部分的な観測における頂点の隣接性のみに関係する、扱いやすい焦点制限バージョンのグラフマッチングを調査します。この焦点制限バージョンのグラフマッチングでは、アラインメントがわかっている頂点の対数数が、残りの頂点の潜在的アラインメントの強い一貫性のある推定値を生成するのに十分であることを証明します。部分的に観測される潜在的アラインメントがある場合の近似グラフマッチングのFrank-Wolfe方法論が、この焦点制限グラフマッチングを本質的に組み込む方法を示します。最後に、人間のコネクトミクスからのわかりやすい例を用いて、シードグラフマッチングと制限焦点グラフマッチングの関係を説明します。

Multi-Objective Reinforcement Learning using Sets of Pareto Dominating Policies
パレート支配方策のセットを使用した多目的強化学習

Many real-world problems involve the optimization of multiple, possibly conflicting objectives. Multi-objective reinforcement learning (MORL) is a generalization of standard reinforcement learning where the scalar reward signal is extended to multiple feedback signals, in essence, one for each objective. MORL is the process of learning policies that optimize multiple criteria simultaneously. In this paper, we present a novel temporal difference learning algorithm that integrates the Pareto dominance relation into a reinforcement learning approach. This algorithm is a multi-policy algorithm that learns a set of Pareto dominating policies in a single run. We name this algorithm Pareto Q-learning and it is applicable in episodic environments with deterministic as well as stochastic transition functions. A crucial aspect of Pareto $Q$-learning is the updating mechanism that bootstraps sets of $Q$-vectors. One of our main contributions in this paper is a mechanism that separates the expected immediate reward vector from the set of expected future discounted reward vectors. This decomposition allows us to update the sets and to exploit the learned policies consistently throughout the state space. To balance exploration and exploitation during learning, we also propose three set evaluation mechanisms. These three mechanisms evaluate the sets of vectors to accommodate for standard action selection strategies, such as $\epsilon$-greedy. More precisely, these mechanisms use multi-objective evaluation principles such as the hypervolume measure, the cardinality indicator and the Pareto dominance relation to select the most promising actions. We experimentally validate the algorithm on multiple environments with two and three objectives and we demonstrate that Pareto $Q$-learning outperforms current state-of-the-art MORL algorithms with respect to the hypervolume of the obtained policies. We note that (1) Pareto $Q$-learning is able to learn the entire Pareto front under the usual assumption that each state-action pair is sufficiently sampled, while (2) not being biased by the shape of the Pareto front. Furthermore, (3) the set evaluation mechanisms provide indicative measures for local action selection and (4) the learned policies can be retrieved throughout the state and action space.

現実世界の問題の多くは、複数の、場合によっては矛盾する目的の最適化を伴います。多目的強化学習(MORL)は、スカラー報酬信号が複数のフィードバック信号(本質的には各目的に1つ)に拡張される、標準的な強化学習の一般化です。MORLは、複数の基準を同時に最適化するポリシーを学習するプロセスです。この論文では、パレート優位関係を強化学習アプローチに統合する新しい時間差分学習アルゴリズムを紹介します。このアルゴリズムは、1回の実行でパレート優位ポリシーのセットを学習するマルチポリシーアルゴリズムです。このアルゴリズムをパレートQ学習と名付け、決定論的および確率的遷移関数を持つエピソード環境に適用できます。パレート$Q$学習の重要な側面は、$Q$ベクトルのセットをブートストラップする更新メカニズムです。この論文での主な貢献の1つは、予想される即時報酬ベクトルを予想される将来の割引報酬ベクトルのセットから分離するメカニズムです。この分解により、セットを更新し、学習したポリシーを状態空間全体で一貫して活用することができます。学習中の探索と活用のバランスをとるために、3つのセット評価メカニズムも提案します。これらの3つのメカニズムは、ベクトルのセットを評価して、$\epsilon$-greedyなどの標準的なアクション選択戦略に対応します。より正確には、これらのメカニズムは、ハイパーボリューム測定、カーディナリティ指標、パレート優位関係などの多目的評価原理を使用して、最も有望なアクションを選択します。2つおよび3つの目的を持つ複数の環境でアルゴリズムを実験的に検証し、得られたポリシーのハイパーボリュームに関して、パレート$Q$学習が現在の最先端のMORLアルゴリズムよりも優れていることを実証しました。(1)パレート$Q$学習は、各状態アクションペアが十分にサンプリングされているという通常の仮定の下で、パレートフロント全体を学習できること、(2)パレートフロントの形状によってバイアスされないことに注目します。さらに、（３）設定された評価メカニズムは、局所的な行動選択のための指標を提供し、（４）学習されたポリシーは、状態および行動空間全体で取得することができます。

Revisiting Stein’s Paradox: Multi-Task Averaging
スタインのパラドックス再考:マルチタスク平均化

We present a multi-task learning approach to jointly estimate the means of multiple independent distributions from samples. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the individual task’s sample averages. We derive the optimal amount of regularization for the two task case for the minimum risk estimator and a minimax estimator, and show that the optimal amount of regularization can be practically estimated without cross-validation. We extend the practical estimators to an arbitrary number of tasks. Simulations and real data experiments demonstrate the advantage of the proposed MTA estimators over standard averaging and James-Stein estimation.

私たちは、サンプルから複数の独立した分布の平均を共同で推定するマルチタスク学習アプローチを提示します。提案されたマルチタスク平均化(MTA)アルゴリズムは、個々のタスクのサンプル平均の凸型の組み合わせになります。最小リスク推定量と最小最大推定量の2つのタスクケースの正則化の最適量を導出し、最適な正則化量を交差検証なしで実質的に推定できることを示します。実用的な見積もりを任意の数のタスクに拡張します。シミュレーションと実際のデータ実験は、提案されたMTA推定量が標準平均化およびJames-Stein推定よりも優れていることを示しています。

Efficient Learning and Planning with Compressed Predictive States
圧縮された予測状態による効率的な学習と計画

Predictive state representations (PSRs) offer an expressive framework for modelling partially observable systems. By compactly representing systems as functions of observable quantities, the PSR learning approach avoids using local-minima prone expectation-maximization and instead employs a globally optimal moment-based algorithm. Moreover, since PSRs do not require a predetermined latent state structure as an input, they offer an attractive framework for model-based reinforcement learning when agents must plan without a priori access to a system model. Unfortunately, the expressiveness of PSRs comes with significant computational cost, and this cost is a major factor inhibiting the use of PSRs in applications. In order to alleviate this shortcoming, we introduce the notion of compressed PSRs (CPSRs). The CPSR learning approach combines recent advancements in dimensionality reduction, incremental matrix decomposition, and compressed sensing. We show how this approach provides a principled avenue for learning accurate approximations of PSRs, drastically reducing the computational costs associated with learning while also providing effective regularization. Going further, we propose a planning framework which exploits these learned models. And we show that this approach facilitates model-learning and planning in large complex partially observable domains, a task that is infeasible without the principled use of compression.

予測状態表現(PSR)は、部分的に観測可能なシステムをモデル化するための表現力豊かなフレームワークを提供します。システムを観測可能な量の関数としてコンパクトに表現することにより、PSR学習アプローチは、局所的最小値傾向の期待値最大化の使用を回避し、代わりにグローバルに最適なモーメントベースのアルゴリズムを採用します。さらに、PSRは入力として事前に決定された潜在状態構造を必要としないため、エージェントがシステムモデルに事前にアクセスせずに計画を立てる必要がある場合に、モデルベースの強化学習の魅力的なフレームワークを提供します。残念ながら、PSRの表現力には大きな計算コストが伴い、このコストがアプリケーションでのPSRの使用を妨げる主な要因となっています。この欠点を軽減するために、圧縮PSR (CPSR)の概念を導入します。CPSR学習アプローチは、次元削減、増分行列分解、および圧縮センシングにおける最近の進歩を組み合わせたものです。このアプローチが、PSRの正確な近似値を学習するための原理的な手段を提供し、学習に関連する計算コストを大幅に削減しながら、効果的な正規化も提供する方法を示します。さらに、学習したモデルを活用する計画フレームワークを提案します。また、このアプローチにより、圧縮を原理的に使用しなければ実行不可能な、大規模で複雑な部分的に観測可能なドメインでのモデル学習と計画が容易になることを示します。

SPMF: A Java Open-Source Pattern Mining Library
SPMF: Java オープンソースのパターンマイニングライブラリ

We present SPMF, an open-source data mining library offering implementations of more than 55 data mining algorithms. SPMF is a cross-platform library implemented in Java, specialized for discovering patterns in transaction and sequence databases such as frequent itemsets, association rules and sequential patterns. The source code can be integrated in other Java programs. Moreover, SPMF offers a command line interface and a simple graphical interface for quick testing. The source code is available under the GNU General Public License, version 3. The website of the project offers several resources such as documentation with examples of how to run each algorithm, a developer’s guide, performance comparisons of algorithms, data sets, an active forum, a FAQ and a mailing list.

私たちは、55を超えるデータマイニングアルゴリズムの実装を提供するオープンソースのデータマイニングライブラリであるSPMFを紹介します。SPMFは、Javaで実装されたクロスプラットフォームライブラリであり、頻繁なアイテムセット、関連付けルール、シーケンシャルパターンなどのトランザクションおよびシーケンスデータベースのパターンを検出することに特化したものです。ソース・コードは、他のJavaプログラムに統合できます。さらに、SPMFは、迅速なテストのためのコマンドラインインターフェイスとシンプルなグラフィカルインターフェイスを提供します。ソースコードは、GNU General Public License、バージョン3の下で入手できます。プロジェクトのWebサイトでは、各アルゴリズムの実行方法の例を示すドキュメント、開発者向けガイド、アルゴリズムのパフォーマンス比較、データセット、アクティブなフォーラム、FAQ、メーリングリストなど、いくつかのリソースを提供しています。

On the Bayes-Optimality of F-Measure Maximizers
Fメジャーマキシマイザーのベイズ最適性について

The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.

F値はもともと情報検索で導入されたものですが、現在ではバイナリ分類、マルチラベル分類、構造化出力予測などの問題のパフォーマンスメトリックとして日常的に使用されています。この尺度の最適化は、閉じた形式のソリューションが存在しないため、統計的にも計算的にも困難な問題です。意思決定理論の観点から、この記事では、F値を最大化するためのさまざまなアプローチの形式的かつ実験的な分析を提供します。まず、ハミング損失やサブセットゼロワン損失などの関連損失関数のベイズリスク分析から始め、このような損失をF値の代わりに最適化すると、最悪の場合の後悔が高くなることを示します。次に、F値最大化アルゴリズムに対して同様のタイプの分析を実行し、バイナリ応答変数の統計的分布に関する追加の仮定に依存しながら、このようなアルゴリズムが近似していることを示します。さらに、基礎となる分布に関係なく、計算効率が高いだけでなくベイズ最適な新しいアルゴリズムを紹介します。このため、アルゴリズムでは、結合分布のパラメータの数が（バイナリ応答の数に対して）2乗だけ必要になります。マルチラベル分類問題を使った実験によって、分析したすべての方法の実際のパフォーマンスを示します。

Convolutional Nets and Watershed Cuts for Real-Time Semantic Labeling of RGBD Videos
RGBDビデオのリアルタイムセマンティックラベリングのための畳み込みネットと流域カット

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

この研究では、RGB-D入力による屋内シーンのマルチクラスセグメンテーションを扱っています。この研究分野は最近注目を集めていますが、ほとんどの研究は依然として手作業で作成された特徴に依存しています。対照的に、私たちはマルチスケール畳み込みネットワークを適用して、画像と深度情報から直接特徴を学習します。フレームごとのラベル付けを使用して、NYU-v2深度データセットで64.5%の精度でほぼ最先端のパフォーマンスを実現しました。次に、シーンのビデオシーケンスの時間的一貫性を利用することで、ラベル付けをさらに改善できることを示します。そのために、ストリーミングビデオから時間的に一貫性のあるスーパーピクセルを生成する方法を紹介します。画像のスーパーピクセルセグメンテーションを生成するさまざまな方法の中で、FelzenszwalbとHuttenlocherのグラフベースのアプローチが広く採用されています。その興味深い特性の1つは、最小全域木を使用して、領域が準線形時間で貪欲に計算されることです。最小全域木を常に活用するフレームワークで、時間的に一貫性のあるピクセルを因果的に計算し、因果的かつリアルタイムのアプリケーションのニーズを満たす効率的なビデオセグメンテーションアプローチを提案します。FPGAなどの適切なハードウェアを使用してリアルタイムで処理できるビデオシーケンス内の屋内シーンのラベル付けを示します。

The Gesture Recognition Toolkit
ジェスチャ認識ツールキット

The Gesture Recognition Toolkit is a cross-platform open-source C++ library designed to make real-time machine learning and gesture recognition more accessible for non-specialists. Emphasis is placed on ease of use, with a consistent, minimalist design that promotes accessibility while supporting flexibility and customization for advanced users. The toolkit features a broad range of classification and regression algorithms and has extensive support for building real-time systems. This includes algorithms for signal processing, feature extraction and automatic gesture spotting.

ジェスチャ認識ツールキットは、専門家以外の人でもリアルタイムの機械学習とジェスチャ認識をより利用しやすくするために設計された、クロスプラットフォームのオープンソースC++ライブラリです。使いやすさに重点が置かれており、一貫性のあるミニマリストデザインにより、アクセシビリティを促進しながら、上級ユーザーの柔軟性とカスタマイズをサポートしています。このツールキットは、幅広い分類アルゴリズムと回帰アルゴリズムを備えており、リアルタイムシステムの構築を幅広くサポートしています。これには、信号処理、特徴抽出、自動ジェスチャスポッティングのアルゴリズムが含まれます。

Alternating Linearization for Structured Regularization Problems
構造化正則化問題に対する交互線形化

We adapt the alternating linearization method for proximal decomposition to structured regularization problems, in particular, to the generalized lasso problems. The method is related to two well-known operator splitting methods, the Douglas–Rachford and the Peaceman–Rachford method, but it has descent properties with respect to the objective function. This is achieved by employing a special update test, which decides whether it is beneficial to make a Peaceman–Rachford step, any of the two possible Douglas–Rachford steps, or none. The convergence mechanism of the method is related to that of bundle methods of nonsmooth optimization. We also discuss implementation for very large problems, with the use of specialized algorithms and sparse data structures. Finally, we present numerical results for several synthetic and real-world examples, including a three-dimensional fused lasso problem, which illustrate the scalability, efficacy, and accuracy of the method.

私たちは、近位分解の交番線形化法を構造化正則化問題、特に一般化なげなわ問題に適応させます。この方法は、2つのよく知られた演算子分割法、ダグラス–ラックフォード法とピースマン–ラックフォード法に関連していますが、目的関数に関して降下特性があります。これは、ピースマン–ラックフォードステップを作成するのが有益か、2つの可能なダグラス-ラックフォードステップのいずれかを作成するか、またはまったく行わないかを決定する特別な更新テストを採用することによって達成されます。この手法の収束メカニズムは、非平滑最適化のバンドル法の収束メカニズムに関連しています。また、特殊なアルゴリズムとスパースデータ構造を使用した非常に大きな問題の実装についても説明します。最後に、3次元融合投げ縄問題を含む、いくつかの合成および実世界の例の数値結果を示し、この手法のスケーラビリティ、有効性、および精度を示します。

Statistical Analysis of Metric Graph Reconstruction
メトリックグラフ再構成の統計解析

A metric graph is a 1-dimensional stratified metric space consisting of vertices and edges or loops glued together. Metric graphs can be naturally used to represent and model data that take the form of noisy filamentary structures, such as street maps, neurons, networks of rivers and galaxies. We consider the statistical problem of reconstructing the topology of a metric graph embedded in $\mathbb{R}^D$ from a random sample. We derive lower and upper bounds on the minimax risk for the noiseless case and tubular noise case. The upper bound is based on the reconstruction algorithm given in Aanjaneya et al. (2012).

メトリックグラフは、頂点とエッジまたはループが接着された1次元の層別メトリック空間です。メートル法グラフは、道路地図、ニューロン、川や銀河のネットワークなど、ノイズの多いフィラメント構造の形をとるデータを表現およびモデル化するために自然に使用できます。私たちは、ランダムサンプルから$mathbb{R}^D$に埋め込まれたメトリックグラフのトポロジーを再構築する統計問題を考えます。ノイズレスの場合と管状ノイズの場合のミニマックスリスクの下限と上限を導き出します。上限は、Aanjaneyaら(2012)で示された再構成アルゴリズムに基づいています。

Matrix Completion with the Trace Norm: Learning, Bounding, and Transducing
トレースノルムによる行列補完:学習、バウンディング、および形質導入

Trace-norm regularization is a widely-used and successful approach for collaborative filtering and matrix completion. However, previous learning guarantees require strong assumptions, such as a uniform distribution over the matrix entries. In this paper, we bridge this gap by providing such guarantees, under much milder assumptions which correspond to matrix completion as performed in practice. In fact, we claim that previous difficulties partially stemmed from a mismatch between the standard learning-theoretic modeling of matrix completion, and its practical application. Our results also shed some light on the issue of matrix completion with bounded models, which enforce predictions to lie within a certain range. In particular, we provide experimental and theoretical evidence that such models lead to a modest yet significant improvement.

トレースノルム正則化は、協調フィルタリングと行列補完のために広く使用されている成功したアプローチです。ただし、以前の学習保証では、行列エントリ上の一様分布など、強力な仮定が必要です。この論文では、実際に実行される行列の完成に対応するはるかに穏やかな仮定の下で、このような保証を提供することにより、このギャップを埋めます。実際、以前の困難は、行列完成の標準的な学習理論モデリングとその実用化との間の不一致から部分的に生じていると主張しています。また、私たちの結果は、予測が特定の範囲内にあることを強制する有界モデルによる行列補完の問題にも光を当てています。特に、そのようなモデルが控えめではあるが大幅な改善につながるという実験的および理論的な証拠を提供します。

Active Contextual Policy Search
アクティブなコンテキストポリシー検索

We consider the problem of learning skills that are versatilely applicable. One popular approach for learning such skills is contextual policy search in which the individual tasks are represented as context vectors. We are interested in settings in which the agent is able to actively select the tasks that it examines during the learning process. We argue that there is a better way than selecting each task equally often because some tasks might be easier to learn at the beginning and the knowledge that the agent can extract from these tasks can be transferred to similar but more difficult tasks. The methods that we propose for addressing the task-selection problem model the learning process as a non-stationary multi-armed bandit problem with custom intrinsic reward heuristics so that the estimated learning progress will be maximized. This approach does neither make any assumptions about the underlying contextual policy search algorithm nor about the policy representation. We present empirical results on an artificial benchmark problem and a ball throwing problem with a simulated Mitsubishi PA-10 robot arm which show that active context selection can improve the learning of skills considerably.

私たちは、多目的に適用できるスキルを学習するという問題について検討します。このようなスキルを学習するための一般的なアプローチの1つは、個々のタスクがコンテキストベクトルとして表現されるコンテキストポリシー検索です。私たちは、エージェントが学習プロセス中に調べるタスクを能動的に選択できる設定に関心があります。一部のタスクは最初は学習しやすい可能性があり、エージェントがこれらのタスクから抽出できる知識は、類似しているがより難しいタスクに転用できるため、各タスクを同じ頻度で選択するよりも良い方法があると主張します。タスク選択問題に対処するために提案する方法は、学習プロセスを、カスタムの内在的報酬ヒューリスティックを備えた非定常多腕バンディット問題としてモデル化し、推定学習の進行が最大化されるようにします。このアプローチでは、基礎となるコンテキストポリシー検索アルゴリズムやポリシー表現について何の仮定も行いません。私たちは、人工ベンチマーク問題と、シミュレートされたMitsubishi PA-10ロボットアームを使用したボール投げ問題に関する実証結果を提示し、能動的なコンテキスト選択によってスキルの学習が大幅に改善されることを示しています。

Inconsistency of Pitman-Yor Process Mixtures for the Number of Components
ピットマン・ヨール法の混合の成分数に対する不一致

In many applications, a finite mixture is a natural model, but it can be difficult to choose an appropriate number of components. To circumvent this choice, investigators are increasingly turning to Dirichlet process mixtures (DPMs), and Pitman–Yor process mixtures (PYMs), more generally. While these models may be well-suited for Bayesian density estimation, many investigators are using them for inferences about the number of components, by considering the posterior on the number of components represented in the observed data. We show that this posterior is not consistent—that is, on data from a finite mixture, it does not concentrate at the true number of components. This result applies to a large class of nonparametric mixtures, including DPMs and PYMs, over a wide variety of families of component distributions, including essentially all discrete families, as well as continuous exponential families satisfying mild regularity conditions (such as multivariate Gaussians).

多くのアプリケーションでは、有限の混合物が自然なモデルですが、適切な数の成分を選択するのは難しい場合があります。この選択を回避するために、研究者はますますディリクレプロセス混合物(DPM)や、より一般的にはピットマン-ヨルプロセス混合物(PYM)に目を向けています。これらのモデルはベイズ密度推定に適しているかもしれませんが、多くの研究者は、観測データで表される成分の数の事後を考慮することにより、成分の数に関する推論にそれらを使用しています。この事後部は一貫性がない—つまり、有限の混合物からのデータでは、真の成分数に集中していないことを示します。この結果は、DPMやPYMなどの大規模なノンパラメトリック混合物に、基本的にすべての離散ファミリーや、軽度の規則性条件を満たす連続指数ファミリー(多変量ガウスなど)を含む、さまざまな成分分布ファミリーに適用されます。

Learning Graphical Models With Hubs
ハブを使用したグラフィカルモデルの学習

We consider the problem of learning a high-dimensional graphical model in which there are a few hub nodes that are densely-connected to many other nodes. Many authors have studied the use of an $\ell_1$ penalty in order to learn a sparse graph in the high-dimensional setting. However, the $\ell_1$ penalty implicitly assumes that each edge is equally likely and independent of all other edges. We propose a general framework to accommodate more realistic networks with hub nodes, using a convex formulation that involves a row-column overlap norm penalty. We apply this general framework to three widely- used probabilistic graphical models: the Gaussian graphical model, the covariance graph model, and the binary Ising model. An alternating direction method of multipliers algorithm is used to solve the corresponding convex optimization problems. On synthetic data, we demonstrate that our proposed framework outperforms competitors that do not explicitly model hub nodes. We illustrate our proposal on a webpage data set and a gene expression data set.

私たちは、他の多くのノードと密に接続された少数のハブノードが存在する高次元グラフィカルモデルの学習問題を考察します。多くの著者が、高次元設定でスパースグラフを学習するために$\ell_1$ペナルティの使用を研究してきた。しかし、$\ell_1$ペナルティは、各エッジが等しくありそうで、他のすべてのエッジから独立していると暗黙的に想定しています。私たちは、行と列のオーバーラップノルムペナルティを含む凸定式化を使用して、ハブノードを持つより現実的なネットワークに対応する一般的なフレームワークを提案します。私たちは、この一般的なフレームワークを、広く使用されている3つの確率的グラフィカルモデル、すなわちガウスグラフィカルモデル、共分散グラフモデル、およびバイナリイジングモデルに適用します。対応する凸最適化問題を解決するために、交互方向乗数法アルゴリズムを使用します。合成データでは、提案されたフレームワークが、ハブノードを明示的にモデル化しない競合製品よりも優れていることを実証します。私たちは、Webページデータセットと遺伝子発現データセットで提案を説明します。

Set-Valued Approachability and Online Learning with Partial Monitoring
セット価値の親しみやすさとパーシャルモニタリングによるオンライン学習

Approachability has become a standard tool in analyzing learning algorithms in the adversarial online learning setup. We develop a variant of approachability for games where there is ambiguity in the obtained reward: it belongs to a set rather than being a single vector. Using this variant we tackle the problem of approachability in games with partial monitoring and develop a simple and generally efficient strategy (i.e., with constant per-step complexity) for this setup. As an important example, we instantiate our general strategy to the case when external regret or internal regret is to be minimized under partial monitoring.

親しみやすさは、敵対的なオンライン学習セットアップで学習アルゴリズムを分析するための標準的なツールになっています。私たちは、得られる報酬に曖昧さがあるゲームのために、親しみやすさの変種を開発します:それは単一のベクトルではなく、セットに属します。このバリアントを使用して、パーシャルモニタリングを伴うゲームにおける親しみやすさの問題に取り組み、このセットアップのためのシンプルで一般的に効率的な戦略(つまり、ステップごとの複雑さが一定)を開発します。重要な例として、私たちは一般的な戦略を、外部の後悔または内部の後悔が部分的な監視の下で最小限に抑えられる場合に具体化します。

Accelerating t-SNE using Tree-Based Algorithms
ツリーベースアルゴリズムを用いたt-SNEの高速化

The paper investigates the acceleration of t-SNE–an embedding technique that is commonly used for the visualization of high- dimensional data in scatter plots–using two tree-based algorithms. In particular, the paper develops variants of the Barnes-Hut algorithm and of the dual-tree algorithm that approximate the gradient used for learning t-SNE embeddings in $\mathcal{O}(N \log N)$. Our experiments show that the resulting algorithms substantially accelerate t-SNE, and that they make it possible to learn embeddings of data sets with millions of objects. Somewhat counterintuitively, the Barnes-Hut variant of t-SNE appears to outperform the dual-tree variant.

この論文では、2つのツリーベースのアルゴリズムを使用して、散布図の高次元データの視覚化に一般的に使用される埋め込み技術であるt-SNEの加速を調査しています。特に、この論文では、$mathcal{O}(N log N)$のt-SNE埋め込みの学習に使用される勾配を近似するBarnes-Hutアルゴリズムと双対木アルゴリズムのバリアントを開発しています。私たちの実験は、結果として得られるアルゴリズムがt-SNEを大幅に高速化し、何百万ものオブジェクトを含むデータセットの埋め込みを学習することを可能にすることを示しています。やや直感に反しますが、t-SNEのBarnes-Hutバリアントは、デュアルツリーバリアントをアウトパフォームしているように見えます。

Robust Online Gesture Recognition with Crowdsourced Annotations
クラウドソーシングアノテーションによる堅牢なオンラインジェスチャー認識

Crowdsourcing is a promising way to reduce the effort of collecting annotations for training gesture recognition systems. Crowdsourced annotations suffer from “noise” such as mislabeling, or inaccurate identification of start and end time of gesture instances. In this paper we present SegmentedLCSS and WarpingLCSS, two template-matching methods offering robustness when trained with noisy crowdsourced annotations to spot gestures from wearable motion sensors. The methods quantize signals into strings of characters and then apply variations of the longest common subsequence algorithm (LCSS) to spot gestures. We compare the noise robustness of our methods against baselines which use dynamic time warping (DTW) and support vector machines (SVM). The experiments are performed on data sets with various gesture classes (10-17 classes) recorded from accelerometers on arms, with both real and synthetic crowdsourced annotations. WarpingLCSS has similar or better performance than baselines in absence of noisy annotations. In presence of 60% mislabeled instances, WarpingLCSS outperformed SVM by 22% F1-score and outperformed DTW-based methods by 36% F1-score on average. SegmentedLCSS yields similar performance as WarpingLCSS, however it performs one order of magnitude slower. Additionally, we show to use our methods to filter out the noise in the crowdsourced annotation before training a traditional classifier. The filtering increases the performance of SVM by 20% F1-score and of DTW-based methods by 8% F1-score on average in the noisy real crowdsourced annotations.

クラウドソーシングは、ジェスチャ認識システムをトレーニングするための注釈収集の労力を削減する有望な方法です。クラウドソーシングされた注釈は、誤ったラベル付けや、ジェスチャインスタンスの開始時間と終了時間の不正確な識別などの「ノイズ」の影響を受けます。この論文では、ノイズの多いクラウドソーシングされた注釈でトレーニングしてウェアラブルモーションセンサーからジェスチャを見つける際に堅牢性を発揮する2つのテンプレートマッチング手法、SegmentedLCSSとWarpingLCSSを紹介します。これらの手法は、信号を文字列に量子化し、最長共通部分列アルゴリズム(LCSS)のバリエーションを適用してジェスチャを見つけます。この手法のノイズ堅牢性を、動的時間ワーピング(DTW)とサポートベクターマシン(SVM)を使用するベースラインと比較します。実験は、腕の加速度計から記録されたさまざまなジェスチャクラス(10～17クラス)のデータセットで、実際のクラウドソーシングされた注釈と合成された注釈の両方を使用して実行されます。WarpingLCSSは、ノイズの多い注釈がない場合、ベースラインと同等以上のパフォーマンスを発揮します。60%の誤ラベル付けされたインスタンスがある場合、WarpingLCSSはSVMよりF1スコア22%優れ、DTWベースの方法よりF1スコア36%優れていました。SegmentedLCSSはWarpingLCSSと同様のパフォーマンスを発揮しますが、パフォーマンスは1桁遅くなります。さらに、従来の分類器をトレーニングする前に、クラウドソーシングされた注釈のノイズをフィルター処理する方法を示します。このフィルター処理により、ノイズの多い実際のクラウドソーシングされた注釈で、SVMのパフォーマンスがF1スコア20%、DTWベースの方法のパフォーマンスがF1スコア8%向上します。

ooDACE Toolbox: A Flexible Object-Oriented Kriging Implementation
ooDACEツールボックス: 柔軟なオブジェクト指向クリギングの実装

When analyzing data from computationally expensive simulation codes, surrogate modeling methods are firmly established as facilitators for design space exploration, sensitivity analysis, visualization and optimization. Kriging is a popular surrogate modeling technique used for the Design and Analysis of Computer Experiments (DACE). Hence, the past decade Kriging has been the subject of extensive research and many extensions have been proposed, e.g., co-Kriging, stochastic Kriging, blind Kriging, etc. However, few Kriging implementations are publicly available and tailored towards scientists and engineers. Furthermore, no Kriging toolbox exists that unifies several Kriging flavors. This paper addresses this need by presenting an efficient object-oriented Kriging implementation and several Kriging extensions, providing a flexible and easily extendable framework to test and implement new Kriging flavors while reusing as much code as possible.

計算コストの高いシミュレーションコードからのデータを解析する場合、設計空間の探索、感度解析、可視化、最適化のファシリテーターとして、サロゲートモデリング手法がしっかりと確立されています。クリギングは、コンピュータ実験の設計と分析(DACE)に使用される一般的な代理モデリング手法です。したがって、過去10年間、クリギングは広範な研究の対象であり、コクリギング、確率的クリギング、ブラインドクリギングなど、多くの拡張が提案されてきました。しかし、一般に公開され、科学者やエンジニア向けに調整されたKrigingの実装はほとんどありません。さらに、いくつかのKrigingフレーバーを統合するKrigingツールボックスは存在しません。この論文では、効率的なオブジェクト指向のKriging実装といくつかのKriging拡張機能を紹介し、新しいKrigingフレーバーをテストおよび実装するための柔軟で簡単に拡張可能なフレームワークを提供しながら、できるだけ多くのコードを再利用することで、このニーズに対処します。

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
現実世界の分類問題を解決するために、何百もの分類器が必要ですか?

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large- scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

私たちは、Weka、R（caretパッケージの有無にかかわらず）、C、およびMatlabで実装された、17のファミリー（判別分析、ベイズ、ニューラルネットワーク、サポートベクターマシン、決定木、ルールベースの分類器、ブースティング、バギング、スタッキング、ランダムフォレストおよびその他のアンサンブル、一般化線形モデル、最近傍法、部分最小二乗法および主成分回帰、ロジスティック回帰および多項式回帰、多重適応回帰スプラインおよびその他の方法）から生じる179の分類器を評価します。これには、現在利用可能なすべての関連分類器が含まれます。私たちは、データセットコレクションに依存せずに分類器の動作に関する重要な結論を得るために、UCIデータベース全体（大規模な問題を除く）およびその他の実際の問題を表す121のデータセットを使用します。最も優れていると思われる分類器はランダムフォレスト(RF)バージョンで、その最良のバージョン(Rで実装され、caret経由でアクセス)は、84.3%のデータセットで90%を超え、最大精度の94.1%を達成しています。ただし、2番目に優れている、LibSVMを使用してCで実装されたガウスカーネルを使用したSVMは、最大精度の92.3%を達成しており、その差は統計的に有意ではありません。ランダムフォレスト、ガウスカーネルと多項式カーネルを使用したSVM、ガウスカーネルを使用したエクストリームラーニングマシン、C5.0、avNNet (caretパッケージを使用してRで実装された多層パーセプトロンの委員会)などのモデルは、明らかに残りのモデルよりも優れています。ランダムフォレストは明らかに分類器のベストファミリーであり(ベスト5分類器のうち3つはRF)、続いてSVM (トップ10に4つの分類器)、ニューラルネットワーク、ブースティングアンサンブル(それぞれトップ20に5つと3つ)となっています。

Recursive Teaching Dimension, VC-Dimension and Sample Compression
再帰的ティーチングディメンション、VCディメンション、サンプル圧縮

This paper is concerned with various combinatorial parameters of classes that can be learned from a small set of examples. We show that the recursive teaching dimension, recently introduced by Zilles et al. (2008), is strongly connected to known complexity notions in machine learning, e.g., the self- directed learning complexity and the VC-dimension. To the best of our knowledge these are the first results unveiling such relations between teaching and query learning as well as between teaching and the VC-dimension. It will turn out that for many natural classes the RTD is upper-bounded by the VCD, e.g., classes of VC-dimension 1, intersection-closed classes and finite maximum classes. However, we will also show that there are certain (but rare) classes for which the recursive teaching dimension exceeds the VC-dimension. Moreover, for maximum classes, the combinatorial structure induced by the RTD, called teaching plan, is highly similar to the structure of sample compression schemes. Indeed one can transform any repetition-free teaching plan for a maximum class $\mathcal{C}$ into an unlabeled sample compression scheme for $\mathcal{C}$ and vice versa, while the latter is produced by (i) the corner- peeling algorithm of Rubinstein and Rubinstein (2012) and (ii) the tail matching algorithm of Kuzmin and Warmuth (2007).

この論文では、少数の例から学習できるクラスのさまざまな組み合わせパラメータについて扱っています。Zillesら(2008)によって最近導入された再帰的ティーチング次元は、自己指向学習の複雑さやVC次元など、機械学習における既知の複雑さの概念と密接に関係していることを示します。私たちの知る限り、これらはティーチングとクエリ学習、およびティーチングとVC次元の関係を明らかにした最初の結果です。多くの自然なクラスでは、RTDはVCDによって上限が制限されることがわかります(例: VC次元1のクラス、交差が閉じたクラス、有限最大クラス)。ただし、再帰的ティーチング次元がVC次元を超える特定の(ただしまれな)クラスがあることも示します。さらに、最大クラスでは、ティーチングプランと呼ばれるRTDによって誘導される組み合わせ構造は、サンプル圧縮スキームの構造と非常に似ています。実際、最大クラス$\mathcal{C}$の任意の繰り返しのない教授計画を$\mathcal{C}$のラベルなしサンプル圧縮スキームに変換することができ、その逆も可能です。後者は、(i) RubinsteinとRubinstein (2012)のコーナーピーリングアルゴリズムと、(ii) KuzminとWarmuth (2007)のテールマッチングアルゴリズムによって生成されます。

High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation
逆共分散推定による線形因果ネットワークの高次元学習

We establish a new framework for statistical estimation of directed acyclic graphs (DAGs) when data are generated from a linear, possibly non-Gaussian structural equation model. Our framework consists of two parts: (1) inferring the moralized graph from the support of the inverse covariance matrix; and (2) selecting the best-scoring graph amongst DAGs that are consistent with the moralized graph. We show that when the error variances are known or estimated to close enough precision, the true DAG is the unique minimizer of the score computed using the reweighted squared $\ell_2$-loss. Our population-level results have implications for the identifiability of linear SEMs when the error covariances are specified up to a constant multiple. On the statistical side, we establish rigorous conditions for high-dimensional consistency of our two-part algorithm, defined in terms of a “gap” between the true DAG and the next best candidate. Finally, we demonstrate that dynamic programming may be used to select the optimal DAG in linear time when the treewidth of the moralized graph is bounded.

私たちは、線形、おそらく非ガウス構造方程式モデルからデータが生成される場合、有向非巡回グラフ(DAG)の統計的推定のための新しいフレームワークを確立します。このフレームワークは、(1)逆共分散行列のサポートから道徳的グラフを推論する、(2)道徳的グラフと一致するDAGの中からスコアが最も高いグラフを選択する、という2つの部分から構成されます。誤差分散が既知であるか、十分な精度で推定されている場合、真のDAGは、再加重2乗$\ell_2$損失を使用して計算されたスコアの唯一の最小値であることを示します。誤差共分散が定数倍まで指定されている場合、集団レベルの結果は、線形SEMの識別可能性に影響を及ぼします。統計面では、真のDAGと次善の候補との間の「ギャップ」で定義される、2部構成のアルゴリズムの高次元の一貫性に関する厳密な条件を確立します。最後に、道徳化グラフのツリー幅が制限されている場合、動的プログラミングを使用して線形時間で最適なDAGを選択できることを示します。

Effective String Processing and Matching for Author Disambiguation
著者の曖昧さ回避のための効果的な文字列処理とマッチング

Track 2 of KDD Cup 2013 aims at determining duplicated authors in a data set from Microsoft Academic Search. This type of problems appears in many large-scale applications that compile information from different sources. This paper describes our solution developed at National Taiwan University to win the first prize of the competition. We propose an effective name matching framework and realize two implementations. An important strategy in our approach is to consider Chinese and non-Chinese names separately because of their different naming conventions. Post-processing including merging results of two predictions further boosts the performance. Our approach achieves F1-score 0.99202 on the private leader board, while 0.99195 on the public leader board.

KDD Cup 2013のトラック2は、Microsoft Academic Searchのデータセットで重複する著者を特定することを目的としています。この種の問題は、さまざまなソースから情報をコンパイルする多くの大規模なアプリケーションで発生します。この論文では、コンペティションの1等を獲得するために国立台湾大学で開発されたソリューションについて説明します。効果的なネームマッチングフレームワークを提案し、2つの実装を実現します。私たちのアプローチにおける重要な戦略は、中国名と非中国人の名前は、命名規則が異なるため、別々に考えることです。2つの予測の結果をマージするなどの後処理により、パフォーマンスがさらに向上します。このアプローチでは、F1スコアはプライベートリーダーボードで0.99202、パブリックリーダーボードで0.99195を達成しています。

Bayesian Co-Boosting for Multi-modal Gesture Recognition
マルチモーダルジェスチャ認識のためのベイズ共ブースティング

With the development of data acquisition equipment, more and more modalities become available for gesture recognition. However, there still exist two critical issues for multi-modal gesture recognition: how to select discriminative features for recognition and how to fuse features from different modalities. In this paper, we propose a novel Bayesian Co-Boosting framework for multi-modal gesture recognition. Inspired by boosting learning and co-training method, our proposed framework combines multiple collaboratively trained weak classifiers to construct the final strong classifier for the recognition task. During each iteration round, we randomly sample a number of feature subsets and estimate weak classifier’s parameters for each subset. The optimal weak classifier and its corresponding feature subset are retained for strong classifier construction. Furthermore, we define an upper bound of training error and derive the update rule of instance’s weight, which guarantees the error upper bound to be minimized through iterations. For demonstration, we present an implementation of our framework using hidden Markov models as weak classifiers. We perform extensive experiments using the ChaLearn MMGR and ChAirGest data sets, in which our approach achieves 97.63% and 96.53% accuracy respectively on each publicly available data set.

データ収集機器の発達に伴い、ジェスチャ認識に利用できるモダリティがますます増えています。しかし、マルチモーダルジェスチャ認識には、認識のための識別機能の選択方法と、異なるモダリティからの特徴の融合方法という2つの重要な問題が依然として存在します。この論文では、マルチモーダルジェスチャ認識のための新しいベイジアンコブースティングフレームワークを提案します。ブースティング学習とコトレーニング法にヒントを得て、提案するフレームワークは、複数の共同トレーニングされた弱い分類器を組み合わせて、認識タスク用の最終的な強い分類器を構築します。各反復ラウンド中に、いくつかの特徴サブセットをランダムにサンプリングし、各サブセットの弱い分類器のパラメータを推定します。最適な弱い分類器とそれに対応する特徴サブセットは、強い分類器の構築用に保持されます。さらに、トレーニングエラーの上限を定義し、インスタンスの重みの更新ルールを導出します。これにより、反復を通じてエラーの上限が最小化されることが保証されます。デモンストレーションのために、隠れマルコフモデルを弱い分類器として使用したフレームワークの実装を示します。私たちはChaLearn MMGRとChAirGestのデータセットを使用して広範な実験を行い、公開されている各データセットで私たちのアプローチがそれぞれ97.63%と96.53%の精度を達成しました。

Optimal Data Collection For Informative Rankings Expose Well-Connected Graphs
有益なランキングのための最適なデータ収集により、適切に接続されたグラフが公開されます

Given a graph where vertices represent alternatives and arcs represent pairwise comparison data, the statistical ranking problem is to find a potential function, defined on the vertices, such that the gradient of the potential function agrees with the pairwise comparisons. Our goal in this paper is to develop a method for collecting data for which the least squares estimator for the ranking problem has maximal Fisher information. Our approach, based on experimental design, is to view data collection as a bi-level optimization problem where the inner problem is the ranking problem and the outer problem is to identify data which maximizes the informativeness of the ranking. Under certain assumptions, the data collection problem decouples, reducing to a problem of finding multigraphs with large algebraic connectivity. This reduction of the data collection problem to graph-theoretic questions is one of the primary contributions of this work. As an application, we study the Yahoo! Movie user rating data set and demonstrate that the addition of a small number of well-chosen pairwise comparisons can significantly increase the Fisher informativeness of the ranking. As another application, we study the 2011-12 NCAA football schedule and propose schedules with the same number of games which are significantly more informative. Using spectral clustering methods to identify highly-connected communities within the division, we argue that the NCAA could improve its notoriously poor rankings by simply scheduling more out-of- conference games.

頂点が選択肢を表し、弧がペアワイズ比較データを表すグラフが与えられた場合、統計的ランキング問題は、頂点で定義されるポテンシャル関数を見つけ、その勾配がペアワイズ比較と一致するようにすることです。この論文の目標は、ランキング問題の最小二乗推定量が最大のフィッシャー情報を持つデータ収集方法を開発することです。実験設計に基づく私たちのアプローチは、データ収集を、内部問題がランキング問題であり、外部問題がランキングの有益性を最大化するデータを特定することである2レベル最適化問題と見なすことです。特定の仮定の下では、データ収集問題は分離され、大きな代数的接続性を持つマルチグラフを見つける問題に縮小されます。データ収集問題をグラフ理論の問題に縮小することは、この研究の主な貢献の1つです。アプリケーションとして、Yahoo!映画のユーザー評価データセットを調査し、少数の適切に選択されたペアワイズ比較を追加することで、ランキングのフィッシャー有益性を大幅に向上できることを実証します。別の応用として、2011-12 NCAAフットボールスケジュールを調査し、同じ数の試合ではるかに有益なスケジュールを提案します。スペクトルクラスタリング法を使用して部門内の高度に結びついたコミュニティを特定し、NCAAはカンファレンス外の試合をもっとスケジュールするだけで、悪名高い低いランキングを改善できると主張します。

Multimodal Learning with Deep Boltzmann Machines
Deepボルツマンマシンによるマルチモーダル学習

Data often consists of multiple diverse modalities. For example, images are tagged with textual information and videos are accompanied by audio. Each modality is characterized by having distinct statistical properties. We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data. We show that the model can be used to create fused representations by combining features across modalities. These learned representations are useful for classification and information retrieval. By sampling from the conditional distributions over each data modality, it is possible to create these representations even when some data modalities are missing. We conduct experiments on bi-modal image-text and audio-video data. The fused representation achieves good classification results on the MIR-Flickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning. We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time.

データは、多くの場合、複数の多様なモダリティで構成されています。たとえば、画像にはテキスト情報がタグ付けされ、ビデオには音声が付随しています。各モダリティは、異なる統計的特性を持つことが特徴です。私たちは、このようなマルチモーダルデータの生成モデルを学習するためのディープボルツマンマシンを提案します。このモデルを使用して、モダリティ間で特徴を組み合わせることで、融合表現を作成できることを示します。これらの学習された表現は、分類と情報検索に役立ちます。各データモダリティの条件付き分布からサンプリングすることにより、一部のデータモダリティが欠落している場合でも、これらの表現を作成できます。私たちは、バイモーダルな画像テキストおよび音声ビデオデータで実験を行います。融合表現は、MIR-Flickrデータセットで、他のディープモデルや、マルチカーネル学習を使用するSVMベースのモデルと同等かそれ以上の優れた分類結果を達成しました。さらに、テスト時にユニモーダルデータしか利用できない場合でも、このマルチモーダルモデルが分類と検索に役立つことを実証します。

QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation
QUIC: スパース逆共分散推定のための 2 次近似

The $\ell_1$-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to recent state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and present experimental results using synthetic and real-world application data that demonstrate the considerable improvements in performance of our method when compared to previous methods.

$ell_1$-正則化されたガウス最尤推定量(MLE)は、非常に限られたサンプルからスパース逆共分散行列、またはガウスマルコフランダムフィールドの基礎となるグラフ構造を回復する際に強力な統計的保証があることが示されています。私たちは、結果として得られる最適化問題を解くための新しいアルゴリズムである正則化された対数行列式プログラムを提案します。一次勾配情報を主に使用する最近の最先端の方法とは対照的に、私たちのアルゴリズムはニュートンの方法に基づいており、二次近似を採用していますが、スパースガウスMLE問題の構造を活用するいくつかの変更が加えられています。私たちは、私たちの方法が超線形収束であることを示し、合成データと実世界のアプリケーションデータを使用して実験結果を提示し、以前の方法と比較して私たちの方法の性能が大幅に向上していることを示しています。

Confidence Intervals and Hypothesis Testing for High-Dimensional Regression
高次元回帰の信頼区間と仮説検定

Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or $p$-values for these models. We consider here high- dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and $p$-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a `de-biased’ version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high- throughput genomic data set about riboflavin production rate, made publicly available by BÃ¼hlmann et al. (2014).

高次元統計モデルを当てはめるには、非線形パラメータ推定手順の使用がしばしば必要となります。その結果、パラメータ推定値の確率分布の正確な特性を得ることは一般に不可能です。これは、特定のパラメータ推定値に関連する不確実性を定量化することが極めて困難であることを意味します。具体的には、これらのモデルに対する信頼区間または$p$値として、不確実性と統計的有意性の古典的な尺度を計算するための、一般に受け入れられている手順は存在しない。ここでは、高次元線形回帰問題を考慮し、信頼区間と$p$値を構築するための効率的なアルゴリズムを提案します。結果として得られる信頼区間は、ほぼ最適なサイズです。特定のパラメータが消失するという帰無仮説を検定する場合、この方法はほぼ最適な検出力を持つ。このアプローチは、正規化M推定量の「偏りのない」バージョンを構築することに基づいています。この新しい構成は、設計行列に特別な構造を想定しない点で、この分野の最近の研究よりも優れています。私たちは、合成データと、BÃ¼hlmannら(2014)によって公開されたリボフラビン生産率に関するハイスループットゲノムデータセットで私たちの方法をテストしました。

Bayesian Entropy Estimation for Countable Discrete Distributions
可算離散分布に対するベイジアンエントロピー推定

We consider the problem of estimating Shannon’s entropy $H$ from discrete data, in cases where the number of possible symbols is unknown or even countably infinite. The Pitman-Yor process, a generalization of Dirichlet process, provides a tractable prior distribution over the space of countably infinite discrete distributions, and has found major applications in Bayesian non- parametric statistics and machine learning. Here we show that it provides a natural family of priors for Bayesian entropy estimation, due to the fact that moments of the induced posterior distribution over $H$ can be computed analytically. We derive formulas for the posterior mean (Bayes’ least squares estimate) and variance under Dirichlet and Pitman-Yor process priors. Moreover, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior distribution over $H$, meaning the prior strongly determines the entropy estimate in the under-sampled regime. We derive a family of continuous measures for mixing Pitman-Yor processes to produce an approximately flat prior over $H$. We show that the resulting “Pitman-Yor Mixture” (PYM) entropy estimator is consistent for a large class of distributions. Finally, we explore the theoretical properties of the resulting estimator, and show that it performs well both in simulation and in application to real data.

私たちは、離散データからシャノンのエントロピー$H$を推定する問題について検討します。この場合、可能なシンボルの数が未知であるか、または可算無限である場合に該当します。ディリクレ過程の一般化であるピットマン-ヨル過程は、可算無限離散分布の空間上で扱いやすい事前分布を提供し、ベイズ非パラメトリック統計および機械学習で主要な応用が見つかりました。ここでは、$H$上の誘導事後分布のモーメントを解析的に計算できるため、ベイズエントロピー推定に自然な事前分布のファミリーを提供することを示します。ディリクレ過程およびピットマン-ヨル過程の事前分布の下での事後平均(ベイズの最小二乗推定)と分散の式を導出します。さらに、固定されたディリクレ過程またはピットマン-ヨル過程の事前分布は$H$上の狭い事前分布を意味し、事前分布がサンプル不足の領域でのエントロピー推定を強く決定することを示します。Pitman-Yor過程を混合して$H$上のほぼ平坦な事前分布を生成するための連続測定のファミリーを導出します。結果として得られる「Pitman-Yor混合」(PYM)エントロピー推定量は、多くの分布クラスで一貫していることを示します。最後に、結果として得られる推定量の理論的特性を調査し、シミュレーションと実際のデータへの適用の両方で良好なパフォーマンスを発揮することを示します。

Tensor Decompositions for Learning Latent Variable Models
潜在変数モデルを学習するためのテンソル分解

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models—including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation—which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin’s perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

この研究では、ガウス混合モデル、隠れマルコフモデル、潜在ディリクレ配分など、潜在変数モデルの幅広いクラスを対象に、低次の観測モーメント（通常は2次および3次）の特定のテンソル構造を利用する、計算的かつ統計的に効率的なパラメータ推定法を検討します。具体的には、パラメータ推定は、モーメントから導出された対称テンソルの特定の（直交）分解を抽出する問題に帰着します。この分解は、行列の特異値分解の自然な一般化と見なすことができます。テンソル分解は一般に計算が困難ですが、これらの特別に構造化されたテンソルの分解は、べき乗反復や最大化アプローチ（行列の場合と同様）などのさまざまなアプローチによって効率的に取得できます。堅牢なテンソル累乗法の詳細な分析が提供され、行列の特異ベクトルに対するウェディンの摂動定理の類似性が確立されます。これは、いくつかの一般的な潜在変数モデルに対する堅牢で計算しやすい推定アプローチを意味します。

Optimality of Graphlet Screening in High Dimensional Variable Selection
高次元変数選択におけるグラフレットスクリーニングの最適性

Consider a linear model $Y = X \beta + \sigma z$, where $X$ has $n$ rows and $p$ columns and $z \sim N(0, I_n)$. We assume both $p$ and $n$ are large, including the case of $p \gg n$. The unknown signal vector $\beta$ is assumed to be sparse in the sense that only a small fraction of its components is nonzero. The goal is to identify such nonzero coordinates (i.e., variable selection). We are primarily interested in the regime where signals are both rare and weak so that successful variable selection is challenging but is still possible. We assume the Gram matrix $G = X’X$ is sparse in the sense that each row has relatively few large entries (diagonals of $G$ are normalized to $1$). The sparsity of $G$ naturally induces the sparsity of the so-called Graph of Strong Dependence (GOSD). The key insight is that there is an interesting interplay between the signal sparsity and graph sparsity: in a broad context, the signals decompose into many small-size components of GOSD that are disconnected to each other. We propose Graphlet Screening for variable selection. This is a two-step Screen and Clean procedure, where in the first step, we screen subgraphs of GOSD with sequential $\chi^2$-tests, and in the second step, we clean with penalized MLE. The main methodological innovation is to use GOSD to guide both the screening and cleaning processes. For any variable selection procedure $\hat{\beta}$, we measure its performance by the Hamming distance between the sign vectors of $\hat{\beta}$ and $\beta$, and assess the optimality by the minimax Hamming distance. Compared with more stringent criteria such as exact support recovery or oracle property, which demand strong signals, the Hamming distance criterion is more appropriate for weak signals since it naturally allows a small fraction of errors. We show that in a broad class of situations, Graphlet Screening achieves the optimal rate of convergence in terms of the Hamming distance. Unlike Graphlet Screening, well- known procedures such as the $L^0/L^1$-penalization methods do not utilize local graphic structure for variable selection, so they generally do not achieve the optimal rate of convergence, even in very simple settings and even if the tuning parameters are ideally set. The the presented algorithm is implemented as R-CRAN package ScreenClean and in matlab (available at stat.cmu.edu).

線形モデル$Y = X \beta + \sigma z$を考えます。ここで、$X$は$n$行と$p$列を持ち、$z \sim N(0, I_n)$です。$p$と$n$は両方とも大きいと仮定します($p \gg n$の場合も含みます)。未知の信号ベクトル$\beta$は、その成分のごく一部だけが非ゼロであるという意味においてスパースであると仮定します。目標は、そのような非ゼロ座標(つまり、変数選択)を特定することです。主に、信号がまれで弱いため、変数選択を成功させることは困難ですが、それでも可能であるという状況に関心があります。グラム行列$G = X’X$は、各行に大きなエントリが比較的少ないという意味においてスパースであると仮定します($G$の対角線は$1$に正規化されます)。$G$のスパース性は、いわゆる強い依存関係のグラフ(GOSD)のスパース性を自然に誘発します。重要な洞察は、信号のスパース性とグラフのスパース性の間に興味深い相互作用があるということです。広い意味では、信号は互いに切断されたGOSDの多くの小さなコンポーネントに分解されます。変数選択にはグラフレットスクリーニングを提案します。これは2段階のスクリーニングとクリーニングの手順で、最初の手順では、連続した$\chi^2$テストでGOSDのサブグラフをスクリーニングし、2番目の手順ではペナルティ付きMLEでクリーニングします。主な方法論的革新は、GOSDを使用してスクリーニングとクリーニングの両方のプロセスをガイドすることです。任意の変数選択手順$\hat{\beta}$について、$\hat{\beta}$と$\beta$の符号ベクトル間のハミング距離でパフォーマンスを測定し、ミニマックスハミング距離で最適性を評価します。強い信号を要求する正確なサポート回復やオラクルプロパティなどのより厳格な基準と比較すると、ハミング距離基準は自然に小さなエラーを許容するため、弱い信号に適しています。幅広い状況において、Graphlet Screeningはハミング距離の観点から最適な収束率を達成することを示しています。Graphlet Screeningとは異なり、$L^0/L^1$ペナルティ法などのよく知られた手順では、変数選択にローカルグラフィック構造を利用しないため、非常に単純な設定や、チューニングパラメータが理想的に設定されている場合でも、通常は最適な収束率を達成しません。提示されたアルゴリズムは、R-CRANパッケージScreenCleanおよびmatlab (stat.cmu.eduで入手可能)として実装されています。

Efficient Occlusive Components Analysis
効率的なオクルーシブ成分解析

We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learned from an unlabeled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. Tractable approximations can be derived, however, by applying a truncated variational approach to Expectation Maximization (EM). In numerical experiments it is shown that these approximations recover the underlying set of object parameters including data noise and sparsity. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating components. The studied approach demonstrates that the multiple-causes generative approach can be generalized to extract occluding components, which links research on occlusion to the field of sparse coding approaches.

私たちは、オクルージョンの確率的生成モデルにおける教師なし学習を研究します。このモデルは、2種類の潜在変数を使用します。1つは画像内に存在するオブジェクトを示し、もう1つはそれらのオブジェクトが深度内でどのように順序付けられているかを示す。この深度順序によって、モデルパラメータで指定された、存在するオブジェクトの位置と外観が画像を形成するためにどのように組み合わされるかが決定されます。私たちは、オブジェクトパラメータが、オブジェクトが互いにオクルージョンしているラベルなしの画像セットから学習できることを示す。正確な最大尤度学習は扱いにくい。しかし、期待値最大化(EM)に切り捨て変分法を適用することで、扱いやすい近似値を導くことができます。数値実験では、これらの近似値が、データノイズやスパース性を含むオブジェクトパラメータの基礎セットを回復することが示されます。色付きのバーを使用したバーテストの新しいバージョンの実験と、より現実的なデータでの実験では、このアルゴリズムが生成コンポーネントの抽出において優れたパフォーマンスを示すことが示されます。研究されたアプローチは、複数原因生成アプローチを一般化して遮蔽コンポーネントを抽出できることを実証しており、遮蔽に関する研究をスパースコーディングアプローチの分野に結び付けています。

A Truncated EM Approach for Spike-and-Slab Sparse Coding
スパイクとスラブスパース符号化のための切り捨てEMアプローチ

We study inference and learning based on a sparse coding model with `spike-and-slab’ prior. As in standard sparse coding, the model used assumes independent latent sources that linearly combine to generate data points. However, instead of using a standard sparse prior such as a Laplace distribution, we study the application of a more flexible `spike-and-slab’ distribution which models the absence or presence of a source’s contribution independently of its strength if it contributes. We investigate two approaches to optimize the parameters of spike-and-slab sparse coding: a novel truncated EM approach and, for comparison, an approach based on standard factored variational distributions. The truncated approach can be regarded as a variational approach with truncated posteriors as variational distributions. In applications to source separation we find that both approaches improve the state-of-the-art in a number of standard benchmarks, which argues for the use of `spike-and- slab’ priors for the corresponding data domains. Furthermore, we find that the truncated EM approach improves on the standard factored approach in source separation tasks—which hints to biases introduced by assuming posterior independence in the factored variational approach. Likewise, on a standard benchmark for image denoising, we find that the truncated EM approach improves on the factored variational approach. While the performance of the factored approach saturates with increasing numbers of hidden dimensions, the performance of the truncated approach improves the state-of-the-art for higher noise levels.

私たちは、「spike-and-slab」事前分布を用いたスパース符号化モデルに基づく推論と学習について研究します。標準的なスパース符号化と同様に、使用されるモデルは、線形結合してデータポイントを生成する独立した潜在ソースを想定しています。しかし、ラプラス分布などの標準的なスパース事前分布を使用する代わりに、より柔軟な`spike-and-slab’分布の適用について研究します。この分布は、ソースの寄与の有無を、寄与する場合のその強度とは独立にモデル化します。私たちは、spike-and-slabスパース符号化のパラメータを最適化する2つのアプローチ、すなわち、新しい切り捨てEMアプローチと、比較のために、標準的な因数分解変分分布に基づくアプローチについて調査します。切り捨てアプローチは、変分分布として切り捨て事後分布を用いた変分アプローチとみなすことができます。ソース分離への応用では、両方のアプローチが、多数の標準ベンチマークにおける最先端の技術を向上させることがわかった。これは、対応するデータドメインに`spike-and-slab’事前分布を使用することを主張しています。さらに、切り捨てEMアプローチは、ソース分離タスクにおいて標準の因数分解アプローチよりも優れていることがわかりました。これは、因数分解変分アプローチで事後独立性を仮定することによって導入されるバイアスを示唆しています。同様に、画像ノイズ除去の標準ベンチマークでは、切り捨てEMアプローチが因数分解変分アプローチよりも優れていることがわかりました。因数分解アプローチのパフォーマンスは隠れ次元の数が増えるにつれて飽和しますが、切り捨てアプローチのパフォーマンスは、より高いノイズレベルに対して最先端のパフォーマンスを向上させます。

Bayesian Estimation of Causal Direction in Acyclic Structural Equation Models with Individual-specific Confounder Variables and Non-Gaussian Distributions
個別特異的交絡変数と非ガウス分布を用いた非巡回構造方程式モデルにおける因果方向のベイズ推定

Several existing methods have been shown to consistently estimate causal direction assuming linear or some form of nonlinear relationship and no latent confounders. However, the estimation results could be distorted if either assumption is violated. We develop an approach to determining the possible causal direction between two observed variables when latent confounding variables are present. We first propose a new linear non-Gaussian acyclic structural equation model with individual- specific effects that are sometimes the source of confounding. Thus, modeling individual-specific effects as latent variables allows latent confounding to be considered. We then propose an empirical Bayesian approach for estimating possible causal direction using the new model. We demonstrate the effectiveness of our method using artificial and real-world data.

いくつかの既存の方法は、線形または何らかの形の非線形関係を仮定し、潜在的な交絡因子がないと仮定して、因果関係の方向を一貫して推定することが示されています。ただし、いずれかの仮定に違反すると、推定結果が歪む可能性があります。私たちは、潜在交絡変数が存在する場合に、観測された2つの変数間の可能な因果方向を決定するためのアプローチを開発します。まず、交絡の原因となることがある個々の特定の効果を持つ新しい線形非ガウス非巡回構造方程式モデルを提案します。したがって、個々の特定の効果を潜在変数としてモデル化することで、潜在交絡を考慮することができます。次に、新しいモデルを使用して可能な因果方向を推定するための経験的ベイズアプローチを提案します。私たちは、人工データと実世界のデータを使用して、私たちの方法の有効性を実証します。

Efficient and Accurate Methods for Updating Generalized Linear Models with Multiple Feature Additions
複数の特徴を追加して一般化線形モデルを更新するための効率的で正確な方法

In this paper, we propose an approach for learning regression models efficiently in an environment where multiple features and data-points are added incrementally in a multi-step process. At each step, any finite number of features maybe added and hence, the setting is not amenable to low rank updates. We show that our approach is not only efficient and optimal for ordinary least squares, weighted least squares, generalized least squares and ridge regression, but also more generally for generalized linear models and lasso regression that use iterated re-weighted least squares for maximum likelihood estimation. Our approach instantiated to linear settings has close relations to the partitioned matrix inversion mechanism based on Schur’s complement. For arbitrary regression methods, even a relaxation of the approach is no worse than using the model from the previous step or using a model that learns on the additional features and optimizes the residual of the model at the previous step. Such problems are commonplace in complex manufacturing operations consisting of hundreds of steps, where multiple measurements are taken at each step to monitor the quality of the final product. Accurately predicting if the finished product will meet specifications at each or, at least, important intermediate steps can be extremely useful in enhancing productivity. We further validate our claims through experiments on synthetic and real industrial data sets.

この論文では、複数の特徴とデータポイントが複数のステッププロセスで段階的に追加される環境で、回帰モデルを効率的に学習するためのアプローチを提案します。各ステップでは、任意の有限数の特徴が追加される可能性があるため、設定は低ランク更新には適していません。このアプローチは、通常の最小二乗法、加重最小二乗法、一般化最小二乗法、リッジ回帰に対して効率的かつ最適であるだけでなく、より一般的には、最大尤度推定に反復再加重最小二乗法を使用する一般化線形モデルおよびLasso回帰に対しても効率的かつ最適であることを示します。線形設定にインスタンス化されたこのアプローチは、Schurの補集合に基づく分割行列反転メカニズムと密接な関係があります。任意の回帰法の場合、このアプローチを緩和しても、前のステップのモデルを使用するか、追加の特徴を学習して前のステップのモデルの残差を最適化するモデルを使用するよりも悪くはありません。このような問題は、最終製品の品質を監視するために各ステップで複数の測定が行われる、数百のステップで構成される複雑な製造オペレーションでは一般的です。完成品が各段階、または少なくとも重要な中間段階で仕様を満たすかどうかを正確に予測することは、生産性の向上に非常に役立ちます。私たちは合成データセットと実際の産業データセットでの実験を通じて、私たちの主張をさらに検証します。

Boosting Algorithms for Detector Cascade Learning
検出器カスケード学習のためのブースティングアルゴリズム

The problem of learning classifier cascades is considered. A new cascade boosting algorithm, fast cascade boosting (FCBoost), is proposed. FCBoost is shown to have a number of interesting properties, namely that it 1) minimizes a Lagrangian risk that jointly accounts for classification accuracy and speed, 2) generalizes adaboost, 3) can be made cost-sensitive to support the design of high detection rate cascades, and 4) is compatible with many predictor structures suitable for sequential decision making. It is shown that a rich family of such structures can be derived recursively from cascade predictors of two stages, denoted cascade generators. Generators are then proposed for two new cascade families, last-stage and multiplicative cascades, that generalize the two most popular cascade architectures in the literature. The concept of neutral predictors is finally introduced, enabling FCBoost to automatically determine the cascade configuration, i.e., number of stages and number of weak learners per stage, for the learned cascades. Experiments on face and pedestrian detection show that the resulting cascades outperform current state-of-the-art methods in both detection accuracy and speed.

分類器カスケードの学習の問題が検討されています。新しいカスケードブースティングアルゴリズムである高速カスケードブースティング(FCBoost)が提案されています。FCBoostには、1)分類精度と速度の両方を考慮したラグランジュリスクを最小化する、2)アダブーストを一般化する、3)コストに敏感にして高検出率カスケードの設計をサポートする、4)シーケンシャルな意思決定に適した多くの予測子構造と互換性がある、という興味深い特性が数多くあります。このような構造の豊富なファミリは、カスケードジェネレータと呼ばれる2段階のカスケードプレディクタから再帰的に導出できることが示されています。次に、文献で最も人気のある2つのカスケードアーキテクチャを一般化する2つの新しいカスケードファミリである最終段階カスケードと乗法カスケードのジェネレータが提案されています。中立予測子の概念が最終的に導入され、FCBoostは学習したカスケードのカスケード構成、つまりステージ数とステージあたりの弱学習器の数を自動的に決定できるようになりました。顔と歩行者の検出に関する実験では、結果として得られるカスケードが、検出精度と速度の両方において現在の最先端の方法よりも優れていることが示されています。

Contextual Bandits with Similarity Information
類似性情報を持つコンテキストバンディット

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time- invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of contextual bandits, a natural extension of the basic MAB problem where before each round an algorithm is given the context—a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on web pages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a similarity distance between the context- arm pairs which bounds from above the difference between the respective expected payoffs. Prior work on contextual bandits with similarity uses âuniform” partitions of the similarity space, so that each context-arm pair is approximated by the closest pair in the partition. Algorithms based on âuniform” partitions disregard the structure of the payoffs and the context arrivals, which is potentially wasteful. We present algorithms that are based on adaptive partitions, and take advantage of “benign” payoffs and context arrivals without sacrificing the worst-case performance. The central idea is to maintain a finer partition in high-payoff regions of the similarity space and in popular regions of the context space. Our results apply to several other settings, e.g., MAB with constrained temporal change (Slivkins and Upfal, 2008) and sleeping bandits (Kleinberg et al., 2008a).

多腕バンディット(MAB)問題では、オンラインアルゴリズムが一連の選択を行います。各ラウンドで、アルゴリズムは時間不変の選択肢セットから選択し、この選択肢に関連付けられた報酬を受け取ります。戦略セットが小さいケースは今ではよく理解されていますが、最近の多くの研究は、問題を扱いやすくするために追加の構造を想定する必要がある、指数的または無限に大きい戦略セットを持つMAB問題に焦点を当てています。特に、最近の文献では、アーム間の類似性に関する情報が検討されています。コンテキストバンディットの設定における類似性情報について検討します。これは、各ラウンドの前にアルゴリズムにコンテキスト(このラウンドの報酬に関するヒント)が与えられる基本的なMAB問題の自然な拡張です。コンテキストバンディットは、Webページに広告を掲載することが直接の動機であり、これはスポンサー検索の重要な問題の1つです。コンテキストバンディット設定で類似性情報を表す特に簡単な方法は、コンテキストアームペア間の類似性距離を使用することです。これは、それぞれの期待報酬の差を超えて制限されます。類似性を持つコンテキストバンディットに関するこれまでの研究では、類似性空間の「均一」なパーティションを使用して、各コンテキストアームペアをパーティション内の最も近いペアで近似していました。「均一」なパーティションに基づくアルゴリズムは、ペイオフの構造とコンテキスト到着を無視するため、無駄になる可能性があります。ここでは、適応パーティションに基づくアルゴリズムを提示し、最悪のパフォーマンスを犠牲にすることなく「無害な」ペイオフとコンテキスト到着を活用します。中心となる考え方は、類似性空間の高ペイオフ領域とコンテキスト空間の一般的な領域で、より細かいパーティションを維持することです。私たちの結果は、制約された時間的変化を伴うMAB (SlivkinsおよびUpfal、2008年)やスリーピングバンディット(Kleinberg他、2008a年)など、他のいくつかの設定にも適用できます。

One-Shot-Learning Gesture Recognition using HOG-HOF Features
HOG-HOF機能を用いたワンショット学習型ジェスチャー認識

The purpose of this paper is to describe one-shot-learning gesture recognition systems developed on the ChaLearn Gesture Dataset (ChaLearn). We use RGB and depth images and combine appearance (Histograms of Oriented Gradients) and motion descriptors (Histogram of Optical Flow) for parallel temporal segmentation and recognition. The Quadratic-Chi distance family is used to measure differences between histograms to capture cross-bin relationships. We also propose a new algorithm for trimming videos—to remove all the unimportant frames from videos. We present two methods that use a combination of HOG-HOF descriptors together with variants of a Dynamic Time Warping technique. Both methods outperform other published methods and help narrow the gap between human performance and algorithms on this task. The code is publicly available in the MLOSS repository.

本稿の目的は、ChaLearn Gesture Dataset(ChaLearn)上で開発されたワンショット学習型ジェスチャー認識システムについて述べることです。RGB画像と深度画像を使用し、外観(Histograms of Oriented Gradients)とモーション記述子(Histogram of Optical Flow)を組み合わせて、並列の時間セグメンテーションと認識を行います。2次カイ距離ファミリーは、ヒストグラム間の差を測定し、クロスビン関係をキャプチャするために使用されます。また、ビデオをトリミングするための新しいアルゴリズムを提案します—ビデオから重要でないフレームをすべて削除します。ここでは、HOG-HOF記述子とダイナミックタイムワーピング技術のバリエーションを組み合わせて使用する2つの方法を紹介します。どちらの方法も、他の公開された方法よりも優れており、このタスクに関する人間のパフォーマンスとアルゴリズムの間のギャップを狭めるのに役立ちます。このコードは、MLOSSリポジトリで公開されています。

Beyond the Regret Minimization Barrier: Optimal Algorithms for Stochastic Strongly-Convex Optimization
後悔最小化障壁を超えて:確率的強凸最適化のための最適アルゴリズム

We give novel algorithms for stochastic strongly-convex optimization in the gradient oracle model which return a $O(\frac{1}{T})$-approximate solution after $T$ iterations. The first algorithm is deterministic, and achieves this rate via gradient updates and historical averaging. The second algorithm is randomized, and is based on pure gradient steps with a random step size. This rate of convergence is optimal in the gradient oracle model. This improves upon the previously known best rate of $O(\frac{\log(T)}{T})$, which was obtained by applying an online strongly-convex optimization algorithm with regret $O(\log(T))$ to the batch setting. We complement this result by proving that any algorithm has expected regret of $\Omega(\log(T))$ in the online stochastic strongly-convex optimization setting. This shows that any online-to-batch conversion is inherently suboptimal for stochastic strongly- convex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization.

私たちは、勾配オラクルモデルにおける確率的強凸最適化のための新しいアルゴリズムを提示し、$T$回の反復後に$O(\frac{1}{T})$近似解を返す。最初のアルゴリズムは決定論的であり、勾配更新と履歴平均化によってこの収束率を達成します。2番目のアルゴリズムはランダム化されており、ランダムなステップサイズによる純粋な勾配ステップに基づく。この収束率は、勾配オラクルモデルで最適です。これは、後悔$O(\log(T))$を伴うオンライン強凸最適化アルゴリズムをバッチ設定に適用することで得られた、これまで知られていた最高率$O(\frac{\log(T)}{T})$よりも優れています。私たちは、オンライン確率的強凸最適化設定において、どのアルゴリズムでも期待後悔が$\Omega(\log(T))$であることを証明することで、この結果を補完します。これは、オンラインからバッチへの変換は、確率的強凸最適化にとって本質的に最適ではないことを示しています。これは、オンライン凸最適化がバッチ確率的凸最適化よりも厳密に困難であることを示す最初の正式な証拠です。

On Multilabel Classification and Ranking with Bandit Feedback
バンディットフィードバックによる多ラベル分類とランク付けについて

We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd- order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show $O(T^{1/2}\log T)$ regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on diverse real- world multilabel data sets, often obtaining comparable performance.

私たちは、部分的な情報設定で動作する新しいマルチラベル/ランキングアルゴリズムを紹介します。このアルゴリズムは2次降下法に基づいており、探索と活用のトレードオフに上限信頼度に依存しています。このアルゴリズムは、共変量が敵対的になる可能性があるが、多ラベル確率が(一般化された)線形モデルによって支配される部分的な敵対的な設定で分析します。$O(T^{1/2}log T)$の後悔境界を示し、既存の結果に対していくつかの方法で改善します。私たちは、さまざまな実世界のマルチラベルデータセットの完全な情報ベースラインと対比することにより、高信頼性スキームの有効性をテストし、多くの場合、同等のパフォーマンスを得ます。

Spectral Learning of Latent-Variable PCFGs: Algorithms and Sample Complexity
潜在変数PCFGのスペクトル学習:アルゴリズムとサンプルの複雑さ

We introduce a spectral learning algorithm for latent-variable PCFGs (Matsuzaki et al., 2005; Petrov et al., 2006). Under a separability (singular value) condition, we prove that the method provides statistically consistent parameter estimates. Our result rests on three theorems: the first gives a tensor form of the inside- outside algorithm for PCFGs; the second shows that the required tensors can be estimated directly from training examples where hidden-variable values are missing; the third gives a PAC-style convergence bound for the estimation method.

私たちは、潜在変数PCFGのスペクトル学習アルゴリズムを紹介する(Matsuzakiら, 2005;Petrovら, 2006)。分離可能性(特異値)条件下では、この方法が統計的に一貫性のあるパラメータ推定値を提供することを証明します。私たちの結果は3つの定理に基づいています:最初の定理は、PCFGの内側-外側アルゴリズムのテンソル形式を与えます。2つ目は、必要なテンソルを、hidden-variable値が欠落しているトレーニング例から直接推定できることを示しています。3つ目は、推定方法のPACスタイルの収束範囲を示します。

Efficient State-Space Inference of Periodic Latent Force Models
周期的潜在力モデルの効率的な状態空間推論

Latent force models (LFM) are principled approaches to incorporating solutions to differential equations within non- parametric inference methods. Unfortunately, the development and application of LFMs can be inhibited by their computational cost, especially when closed-form solutions for the LFM are unavailable, as is the case in many real world problems where these latent forces exhibit periodic behaviour. Given this, we develop a new sparse representation of LFMs which considerably improves their computational efficiency, as well as broadening their applicability, in a principled way, to domains with periodic or near periodic latent forces. Our approach uses a linear basis model to approximate one generative model for each periodic force. We assume that the latent forces are generated from Gaussian process priors and develop a linear basis model which fully expresses these priors. We apply our approach to model the thermal dynamics of domestic buildings and show that it is effective at predicting day-ahead temperatures within the homes. We also apply our approach within queueing theory in which quasi-periodic arrival rates are modelled as latent forces. In both cases, we demonstrate that our approach can be implemented efficiently using state-space methods which encode the linear dynamic systems via LFMs. Further, we show that state estimates obtained using periodic latent force models can reduce the root mean squared error to 17% of that from non-periodic models and 27% of the nearest rival approach which is the resonator model (Sarkka et al., 2012; Hartikainen et al. 2012).

潜在力モデル(LFM)は、微分方程式の解をノンパラメトリック推論法に組み込むための原理的なアプローチです。残念ながら、LFMの開発と応用は、特にLFMの閉形式の解が利用できない場合、計算コストによって妨げられることがあります。これは、これらの潜在力が周期的な動作を示す多くの現実の問題の場合に当てはまります。これを考慮して、私たちはLFMの新しいスパース表現を開発しました。これにより、計算効率が大幅に向上し、周期的またはほぼ周期的な潜在力を持つ領域への適用範囲が原理的に広がります。私たちのアプローチでは、線形基底モデルを使用して、各周期力に対して1つの生成モデルを近似します。私たちは、潜在力がガウス過程の事前分布から生成されると仮定し、これらの事前分布を完全に表現する線形基底モデルを開発します。私たちは、このアプローチを住宅の建物の熱力学のモデル化に適用し、それが住宅内の翌日の気温の予測に効果的であることを示します。また、準周期的到着率が潜在力としてモデル化される待ち行列理論にもこのアプローチを適用します。どちらの場合も、線形動的システムをLFMを介してエンコードする状態空間法を使用して、このアプローチを効率的に実装できることを実証します。さらに、周期的潜在力モデルを使用して得られた状態推定値は、非周期的モデルの二乗平均平方根誤差の17%、最も近い競合アプローチである共振器モデル(Sarkka他、2012年、Hartikainen他、2012年)の27%まで低減できることも示しています。

Cover Tree Bayesian Reinforcement Learning
カバーツリーベイズ強化学習

This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with a Gaussian process model, a linear model and simple least squares policy iteration.

この論文では、強化学習のためのオンラインツリーベースのベイジアンアプローチを提案します。推論には、一般化されたコンテキストツリーモデルを使用します。これにより、多変量ガウス区分線形モデル上の分布が定義され、閉じた形式で更新できます。ツリー構造自体は、高次元空間で効率的であるカバーツリー法を使用して構築されます。このモデルをトンプソンサンプリングおよび近似動的計画法と組み合わせて、未知の環境での効果的な探索方策を取得します。モデルの柔軟性と計算の単純さにより、連続状態空間での多くの強化学習問題に適しています。これを、ガウス過程モデル、線形モデル、および単純な最小二乗方策の反復による実験的比較で示します。

A Tensor Approach to Learning Mixed Membership Community Models
混合メンバーシップコミュニティモデルの学習に対するテンソルアプローチ

Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. (2008). This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of $3$-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g., singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model.

コミュニティ検出は、観測された相互作用から隠れたコミュニティを検出するタスクです。保証されたコミュニティ検出は、これまで主に、確率的ブロックモデルなどの重複しないコミュニティを持つモデルに限定されていました。この論文では、この制限を取り除き、Airoldiら(2008)によって初めて導入された、混合メンバーシップディリクレモデルと呼ばれる、重複するコミュニティを持つ確率的ネットワークモデルのファミリーに対して、保証されたコミュニティ検出を提供します。このモデルでは、ノードが複数のコミュニティに部分的なメンバーシップを持つことを許可し、コミュニティメンバーシップがディリクレ分布から抽出されると想定しています。さらに、このモデルには、確率的ブロックモデルが特殊なケースとして含まれています。テンソルスペクトル分解法を介してこれらのモデルを学習する統一されたアプローチを提案します。推定量は、観測されたネットワークの低次モーメントテンソルに基づいており、3個のスターカウントで構成されています。学習方法は高速で、特異値分解やテンソルパワー反復などの単純な線形代数演算に基づいています。コミュニティメンバーシップとモデルパラメータの回復を保証し、学習方法の慎重な有限サンプル分析を示します。重要な特殊なケースとして、私たちの結果は(均質な)確率ブロックモデルの最もよく知られているスケーリング要件と一致します。

Clustering Partially Observed Graphs via Convex Optimization
凸最適化による部分観測グラフのクラスタリング

This paper considers the problem of clustering a partially observed unweighted graph—i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of âdisagreementsâ—i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.

この論文では、部分的に観測された重み付けされていないグラフ、つまり、いくつかのノードペアの間にエッジがあることがわかっているグラフ、他のいくつかのノードペアにはエッジがないことがわかっているグラフ、残りのノードペアにはエッジがあるかどうかわからないグラフをクラスタリングする問題について検討します。クラスター内では比較的密な(観測された)接続があり、クラスター間では疎になるように、ノードをばらばらのクラスターに編成します。この問題に対して、新しいが自然なアプローチを採用し、クラスター内の(観測された)欠落エッジの数とクラスター間の(観測された)存在するエッジの数の合計である「不一致」の数を最小化するクラスタリングを見つけることに焦点を当てます。アルゴリズムは凸最適化を使用します。その基礎は、不一致の最小化を、部分的に観測された合計から(未知の)低ランクマトリックスと(未知の)疎行列を復元する問題に縮小することです。私たちは、古典的なPlanted Partition/Stochastic Blockモデルでアルゴリズムのパフォーマンスを評価します。我々の主な定理は、最小クラスターサイズ、エッジ密度、および観測確率の関数として、アルゴリズムが成功するための十分な条件を提供します。特に、結果は、観測確率とエッジ密度ギャップ間のトレードオフを特徴付けます。同じサイズのクラスターが一定数ある場合、結果は対数係数まで最適です。

Ramp Loss Linear Programming Support Vector Machine
ランプ損失線形計画法サポートベクトルマシン

The ramp loss is a robust but non-convex loss for classification. Compared with other non-convex losses, a local minimum of the ramp loss can be effectively found. The effectiveness of local search comes from the piecewise linearity of the ramp loss. Motivated by the fact that the $\ell_1$-penalty is piecewise linear as well, the $\ell_1$-penalty is applied for the ramp loss, resulting in a ramp loss linear programming support vector machine (ramp- LPSVM). The proposed ramp-LPSVM is a piecewise linear minimization problem and the related optimization techniques are applicable. Moreover, the $\ell_1$-penalty can enhance the sparsity. In this paper, the corresponding misclassification error and convergence behavior are discussed. Generally, the ramp loss is a truncated hinge loss. Therefore ramp-LPSVM possesses some similar properties as hinge loss SVMs. A local minimization algorithm and a global search strategy are discussed. The good optimization capability of the proposed algorithms makes ramp-LPSVM perform well in numerical experiments: the result of ramp-LPSVM is more robust than that of hinge SVMs and is sparser than that of ramp-SVM, which consists of the $\|\cdot\|_{\mathcal{K}} $-penalty and the ramp loss.

ランプ損失は、分類に対して堅牢ですが非凸損失です。他の非凸損失と比較して、ランプ損失の局所最小値を効果的に見つけることができます。局所探索の有効性は、ランプ損失の区分線形性から生まれます。$\ell_1$ペナルティも区分線形であるという事実に動機付けられて、$\ell_1$ペナルティがランプ損失に適用され、ランプ損失線形計画法サポートベクターマシン（ランプ-LPSVM）が得られます。提案されたランプ-LPSVMは区分線形最小化問題であり、関連する最適化手法が適用可能です。さらに、$\ell_1$ペナルティはスパース性を高めることができます。この論文では、対応する誤分類エラーと収束動作について説明します。一般に、ランプ損失は切り捨てられたヒンジ損失です。したがって、ランプ-LPSVMはヒンジ損失SVMと同様の特性を備えています。局所最小化アルゴリズムとグローバル探索戦略について説明します。提案されたアルゴリズムの優れた最適化機能により、ランプLPSVMは数値実験で優れたパフォーマンスを発揮します。ランプLPSVMの結果はヒンジSVMの結果よりも堅牢であり、$\|\cdot\|_{\mathcal{K}} $ペナルティとランプ損失で構成されるランプSVMの結果よりもスパースです。

Particle Gibbs with Ancestor Sampling
祖先サンプリングによる粒子ギブス

Particle Markov chain Monte Carlo (pmcmc) is a systematic way of combining the two main tools used for Monte Carlo statistical inference: sequential Monte Carlo (smc) and Markov chain Monte Carlo (mcmc). We present a new pmcmc algorithm that we refer to as particle Gibbs with ancestor sampling (pgas). pgas provides the data analyst with an off-the-shelf class of Markov kernels that can be used to simulate, for instance, the typically high-dimensional and highly autocorrelated state trajectory in a state-space model. The ancestor sampling procedure enables fast mixing of the pgas kernel even when using seemingly few particles in the underlying smc sampler. This is important as it can significantly reduce the computational burden that is typically associated with using smc. pgas is conceptually similar to the existing pg with backward simulation (pgbs) procedure. Instead of using separate forward and backward sweeps as in pgbs, however, we achieve the same effect in a single forward sweep. This makes pgas well suited for addressing inference problems not only in state-space models, but also in models with more complex dependencies, such as non-Markovian, Bayesian nonparametric, and general probabilistic graphical models.

粒子マルコフ連鎖モンテカルロ(pmcmc)は、モンテカルロ統計的推論に使用される2つの主なツール、つまりシーケンシャルモンテカルロ(smc)とマルコフ連鎖モンテカルロ(mcmc)を組み合わせた体系的な方法です。ここでは、粒子ギブスと祖先サンプリング(pgas)と呼ばれる新しいpmcmcアルゴリズムを紹介します。pgasは、データアナリストに、たとえば状態空間モデルで一般的に高次元で自己相関の高い状態軌跡をシミュレートするために使用できる既製のマルコフカーネルクラスを提供します。祖先サンプリング手順により、基礎となるsmcサンプラーで少数の粒子しか使用していない場合でも、pgasカーネルを高速に混合できます。これは、smcの使用に通常伴う計算負荷を大幅に軽減できるため重要です。pgasは、概念的には既存のpgと後方シミュレーション(pgbs)手順に似ています。ただし、pgbsのように別個の前方スイープと後方スイープを使用する代わりに、単一の前方スイープで同じ効果を実現します。これにより、pgasは状態空間モデルだけでなく、非マルコフ、ベイジアンノンパラメトリック、一般的な確率的グラフィカルモデルなど、より複雑な依存関係を持つモデルの推論問題に対処するのにも適しています。

Classifier Cascades and Trees for Minimizing Feature Evaluation Cost
特徴評価コストを最小限に抑えるための分類器カスケードとツリー

Machine learning algorithms have successfully entered industry through many real-world applications (e.g., search engines and product recommendations). In these applications, the test-time CPU cost must be budgeted and accounted for. In this paper, we examine two main components of the test-time CPU cost, classifier evaluation cost and feature extraction cost, and show how to balance these costs with the classifier accuracy. Since the computation required for feature extraction dominates the test-time cost of a classifier in these settings, we develop two algorithms to efficiently balance the performance with the test-time cost. Our first contribution describes how to construct and optimize a tree of classifiers, through which test inputs traverse along individual paths. Each path extracts different features and is optimized for a specific sub-partition of the input space. Our second contribution is a natural reduction of the tree of classifiers into a cascade. The cascade is particularly useful for class-imbalanced data sets as the majority of instances can be early-exited out of the cascade when the algorithm is sufficiently confident in its prediction. Because both approaches only compute features for inputs that benefit from them the most, we find our trained classifiers lead to high accuracies at a small fraction of the computational cost.

機械学習アルゴリズムは、多くの実際のアプリケーション(検索エンジンや製品の推奨など)を通じて業界に導入されてきました。これらのアプリケーションでは、テスト時のCPUコストを予算化し、考慮する必要があります。この論文では、テスト時のCPUコストの2つの主な要素である分類器評価コストと特徴抽出コストを検討し、これらのコストと分類器の精度のバランスをとる方法を示します。これらの設定では、特徴抽出に必要な計算が分類器のテスト時間コストの大部分を占めるため、パフォーマンスとテスト時間コストを効率的にバランスさせる2つのアルゴリズムを開発します。最初の貢献では、テスト入力が個別のパスに沿って移動する分類器のツリーを構築して最適化する方法について説明します。各パスは異なる特徴を抽出し、入力空間の特定のサブパーティションに最適化されます。2番目の貢献は、分類器のツリーをカスケードに自然に縮小することです。カスケードは、アルゴリズムが予測に十分な自信を持つ場合、インスタンスの大部分をカスケードから早期に終了できるため、クラス不均衡なデータセットに特に役立ちます。どちらのアプローチも、最もメリットのある入力の特徴のみを計算するため、トレーニング済みの分類器は、わずかな計算コストで高い精度を実現します。

Parallel MCMC with Generalized Elliptical Slice Sampling
一般化楕円スライスサンプリングによる並列 MCMC

Probabilistic models are conceptually powerful tools for finding structure in data, but their practical effectiveness is often limited by our ability to perform inference in them. Exact inference is frequently intractable, so approximate inference is often performed using Markov chain Monte Carlo (MCMC). To achieve the best possible results from MCMC, we want to efficiently simulate many steps of a rapidly mixing Markov chain which leaves the target distribution invariant. Of particular interest in this regard is how to take advantage of multi-core computing to speed up MCMC-based inference, both to improve mixing and to distribute the computational load. In this paper, we present a parallelizable Markov chain Monte Carlo algorithm for efficiently sampling from continuous probability distributions that can take advantage of hundreds of cores. This method shares information between parallel Markov chains to build a scale-location mixture of Gaussians approximation to the density function of the target distribution. We combine this approximation with a recently developed method known as elliptical slice sampling to create a Markov chain with no step- size parameters that can mix rapidly without requiring gradient or curvature computations.

確率モデルは概念的にはデータの構造を見つけるための強力なツールですが、その実用的な有効性は推論を実行する能力によって制限されることがよくあります。正確な推論はしばしば扱いにくいため、マルコフ連鎖モンテカルロ(MCMC)を使用して近似推論が実行されることがよくあります。MCMCから可能な限り最良の結果を得るには、ターゲット分布を不変のままにする、急速に混合するマルコフ連鎖の多くのステップを効率的にシミュレートする必要があります。この点で特に興味深いのは、マルチコアコンピューティングを利用してMCMCベースの推論を高速化し、混合を改善して計算負荷を分散する方法です。この論文では、数百のコアを利用できる連続確率分布から効率的にサンプリングするための並列化可能なマルコフ連鎖モンテカルロアルゴリズムを紹介します。この方法では、並列マルコフ連鎖間で情報を共有して、ターゲット分布の密度関数のスケール位置混合ガウス分布近似を構築します。この近似と、最近開発された楕円スライスサンプリングと呼ばれる方法を組み合わせて、勾配や曲率の計算を必要とせずに迅速に混合できる、ステップサイズパラメーターのないマルコフ連鎖を作成します。

The Student-t Mixture as a Natural Image Patch Prior with Application to Image Compression
画像圧縮への適用に先立つ自然画像パッチとしてのスチューデントのt混合

Recent results have shown that Gaussian mixture models (GMMs) are remarkably good at density modeling of natural image patches, especially given their simplicity. In terms of log likelihood on real-valued data they are comparable with the best performing techniques published, easily outperforming more advanced ones, such as deep belief networks. They can be applied to various image processing tasks, such as image denoising, deblurring and inpainting, where they improve on other generic prior methods, such as sparse coding and field of experts. Based on this we propose the use of another, even richer mixture model based image prior: the Student-t mixture model (STM). We demonstrate that it convincingly surpasses GMMs in terms of log likelihood, achieving performance competitive with the state of the art in image patch modeling. We apply both the GMM and STM to the task of lossy and lossless image compression, and propose efficient coding schemes that can easily be extended to other unsupervised machine learning models. Finally, we show that the suggested techniques outperform JPEG, with results comparable to or better than JPEG 2000.

最近の結果によると、ガウス混合モデル(GMM)は、特にそのシンプルさを考慮すると、自然画像パッチの密度モデリングに非常に優れていることが示されています。実数値データに対する対数尤度に関しては、公開されている最高のパフォーマンスの手法に匹敵し、ディープビリーフネットワークなどのより高度な手法を簡単に上回ります。これらは、画像のノイズ除去、ぼかし除去、インペインティングなどのさまざまな画像処理タスクに適用でき、スパースコーディングやフィールドオブエキスパートなどの他の一般的な事前手法よりも優れています。これに基づいて、さらに豊富な混合モデルベースの画像事前手法であるStudent-t混合モデル(STM)の使用を提案します。これが対数尤度の点でGMMをはるかに上回り、画像パッチモデリングの最先端の技術と競合できるパフォーマンスを実現することを実証します。GMMとSTMの両方を非可逆および可逆画像圧縮のタスクに適用し、他の教師なし機械学習モデルに簡単に拡張できる効率的なコーディングスキームを提案します。最後に、提案された手法はJPEGよりも優れており、JPEG 2000と同等かそれ以上の結果が得られることを示します。

pystruct – Learning Structured Prediction in Python
pystruct – Pythonで構造化された予測を学習する

Structured prediction methods have become a central tool for many machine learning applications. While more and more algorithms are developed, only very few implementations are available. pystruct aims at providing a general purpose implementation of standard structured prediction methods, both for practitioners and as a baseline for researchers. It is written in Python and adapts paradigms and types from the scientific Python community for seamless integration with other projects.

構造化された予測手法は、多くの機械学習アプリケーションの中心的なツールとなっています。ますます多くのアルゴリズムが開発されていますが、利用可能な実装はごくわずかです。PyStructは、実務家と研究者のベースラインの両方に対して、標準的な構造化予測方法の汎用実装を提供することを目的としています。Pythonで書かれており、科学的なPythonコミュニティのパラダイムとタイプを適応させて、他のプロジェクトとシームレスに統合します。

Causal Discovery with Continuous Additive Noise Models
連続加法ノイズモデルによる因果関係の発見

We consider the problem of learning causal directed acyclic graphs from an observational joint distribution. One can use these graphs to predict the outcome of interventional experiments, from which data are often not available. We show that if the observational distribution follows a structural equation model with an additive noise structure, the directed acyclic graph becomes identifiable from the distribution under mild conditions. This constitutes an interesting alternative to traditional methods that assume faithfulness and identify only the Markov equivalence class of the graph, thus leaving some edges undirected. We provide practical algorithms for finitely many samples, RESIT (regression with subsequent independence test) and two methods based on an independence score. We prove that RESIT is correct in the population setting and provide an empirical evaluation.

私たちは、観測的同時分布から因果指向性非巡回グラフを学習する問題を考えます。これらのグラフを使用して、データが入手できないことが多い介入実験の結果を予測することができます。観測分布が加法的ノイズ構造を持つ構造方程式モデルに従う場合、穏やかな条件下では、有向非巡回グラフが分布から識別可能になることを示します。これは、忠実性を仮定し、グラフのマルコフ同値クラスのみを識別し、一部のエッジを無向のままにする従来の方法に代わる興味深い方法です。有限個のサンプルに対する実用的なアルゴリズム、RESIT(回帰とその後の独立性検定)、および独立性スコアに基づく2つの方法を提供します。私たちは、RESITが人口設定で正しいことを証明し、経験的評価を提供します。

Sparse Factor Analysis for Learning and Content Analytics
学習とコンテンツ分析のためのスパース因子分析

We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner’s knowledge of the concepts underlying a domain, and {\em{content analytics}}, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each question’s intrinsic difficulty. We estimate these factors given the graded responses to a collection of questions. The underlying estimation problem is ill-posed in general, especially when only a subset of the questions are answered. The key observation that enables a well-posed solution is the fact that typical educational domains of interest involve only a small number of key concepts. Leveraging this observation, we develop both a bi-convex maximum-likelihood-based solution and a Bayesian solution to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate user-defined tags on questions to facilitate the interpretability of the estimated factors. Experiments with synthetic and real-world data demonstrate the efficacy of our approach. Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary learning that is of independent interest.

私たちは、機械学習ベースの学習分析のための新しいモデルとアルゴリズムを開発しました。これは、ドメインの基礎となる概念に関する学習者の知識を推定するものであり、コンテンツ分析は、一連の質問とそれらの概念の関係を推定するものです。我々のモデルは、学習者が質問に対して正しい回答をする確率を、基礎となる概念セットの理解、各質問に含まれる概念、各質問の本質的な難易度という3つの要素で表します。私たちは、一連の質問に対する段階的な回答を前提として、これらの要素を推定します。基礎となる推定問題は、特に質問のサブセットのみが回答されている場合、一般的に不適切です。適切に配置されたソリューションを可能にする重要な観察は、関心のある一般的な教育ドメインには少数の重要な概念しか含まれていないという事実です。この観察を活用して、結果として生じるSPARse因子分析(SPARFA)問題に対する双凸最大尤度ベースのソリューションとベイズソリューションの両方を開発しました。また、推定された因子の解釈を容易にするために、質問にユーザー定義のタグを組み込みます。合成データと実世界のデータを使った実験により、私たちのアプローチの有効性が実証されています。最後に、SPARFAと、独立した興味深いノイズの多いバイナリ値(1ビット)辞書学習との関連性を示します。

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
ドロップアウト:ニューラルネットワークのオーバーフィッティングを防ぐ簡単な方法

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different thinned networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

多数のパラメータを持つディープニューラルネットは、非常に強力な機械学習システムです。ただし、このようなネットワークでは、過学習が深刻な問題となります。また、大規模なネットワークは使用に時間がかかるため、テスト時に多くの異なる大規模なニューラルネットの予測を組み合わせて過学習に対処するのは困難です。ドロップアウトは、この問題に対処するための手法です。重要なアイデアは、トレーニング中にニューラルネットワークからユニット(およびその接続)をランダムにドロップすることです。これにより、ユニットが過度に共適応するのを防ぎます。トレーニング中、ドロップアウトは、指数関数的に多数の異なる間引きネットワークからサンプルを抽出します。テスト時には、重みが小さい単一の間引きされていないネットワークを使用するだけで、これらすべての間引きネットワークの予測を平均化する効果を簡単に近似できます。これにより、過学習が大幅に軽減され、他の正規化方法よりも大幅に改善されます。ドロップアウトにより、視覚、音声認識、ドキュメント分類、計算生物学の教師あり学習タスクにおけるニューラルネットワークのパフォーマンスが向上し、多くのベンチマークデータセットで最先端の結果が得られることがわかります。

Pattern Alternating Maximization Algorithm for Missing Data in High-Dimensional Problems
高次元問題における欠損データに対するパターン交互最大化アルゴリズム

We propose a novel and efficient algorithm for maximizing the observed log-likelihood of a multivariate normal data matrix with missing values. We show that our procedure, based on iteratively regressing the missing on the observed variables, generalizes the standard EM algorithm by alternating between different complete data spaces and performing the E-Step incrementally. In this non-standard setup we prove numerical convergence to a stationary point of the observed log- likelihood. For high-dimensional data, where the number of variables may greatly exceed sample size, we perform regularization using a Lasso-type penalty. This introduces sparsity in the regression coefficients used for imputation, permits fast computation and warrants competitive performance in terms of estimating the missing entries. We show on simulated and real data that the new method often improves upon other modern imputation techniques such as k-nearest neighbors imputation, nuclear norm minimization or a penalized likelihood approach with an $\ell_1$-penalty on the concentration matrix.

私たちは、欠損値を含む多変量正規データ行列の観測対数尤度を最大化する新しい効率的なアルゴリズムを提案します。私たちは、欠損値を観測変数に反復的に回帰する手順に基づき、異なる完全なデータ空間を交互に切り替え、E-Stepを段階的に実行することで、標準EMアルゴリズムを一般化することを示す。この非標準設定では、観測対数尤度の定常点への数値収束を証明します。変数の数がサンプルサイズを大幅に超える可能性がある高次元データの場合、Lassoタイプのペナルティを使用して正規化を実行します。これにより、補完に使用される回帰係数にスパース性が導入され、高速計算が可能になり、欠損エントリの推定に関して競争力のあるパフォーマンスが保証されます。私たちは、シミュレーションデータと実際のデータで、新しい方法が、k近傍補完、核ノルム最小化、または濃度行列に$\ell_1$ペナルティを課したペナルティ付き尤度アプローチなどの他の最新の補完手法よりも優れていることが多いことを示す。

Expectation Propagation for Neural Networks with Sparsity-Promoting Priors
スパース性促進事前確率を持つニューラルネットワークに対する期待伝播

We propose a novel approach for nonlinear regression using a two-layer neural network (NN) model structure with sparsity- favoring hierarchical priors on the network weights. We present an expectation propagation (EP) approach for approximate integration over the posterior distribution of the weights, the hierarchical scale parameters of the priors, and the residual scale. Using a factorized posterior approximation we derive a computationally efficient algorithm, whose complexity scales similarly to an ensemble of independent sparse linear models. The approach enables flexible definition of weight priors with different sparseness properties such as independent Laplace priors with a common scale parameter or Gaussian automatic relevance determination (ARD) priors with different relevance parameters for all inputs. The approach can be extended beyond standard activation functions and NN model structures to form flexible nonlinear predictors from multiple sparse linear models. The effects of the hierarchical priors and the predictive performance of the algorithm are assessed using both simulated and real-world data. Comparisons are made to two alternative models with ARD priors: a Gaussian process with a NN covariance function and marginal maximum a posteriori estimates of the relevance parameters, and a NN with Markov chain Monte Carlo integration over all the unknown model parameters.

私たちは、ネットワークの重みにスパース性を優先する階層的事前分布を持つ2層ニューラルネットワーク(NN)モデル構造を使用した非線形回帰の新しいアプローチを提案します。重みの事後分布、事前分布の階層的スケールパラメーター、および残差スケールを近似的に積分するための期待値伝播(EP)アプローチを紹介します。因数分解事後近似を使用して、計算効率の高いアルゴリズムを導出します。このアルゴリズムの複雑さは、独立したスパース線形モデルのアンサンブルと同様にスケールします。このアプローチにより、共通のスケールパラメーターを持つ独立したラプラス事前分布や、すべての入力に対して異なる関連性パラメーターを持つガウス自動関連性決定(ARD)事前分布など、異なるスパース性特性を持つ重み事前分布を柔軟に定義できます。このアプローチは、標準的な活性化関数とNNモデル構造を超えて拡張でき、複数のスパース線形モデルから柔軟な非線形予測子を形成できます。階層的事前分布の効果とアルゴリズムの予測性能は、シミュレーションデータと実世界のデータの両方を使用して評価されます。ARD事前分布を使用した2つの代替モデル(NN共分散関数と関連性パラメータの限界最大事後推定値を使用したガウス過程、およびすべての未知のモデルパラメータに対するマルコフ連鎖モンテカルロ積分を使用したNN)と比較されます。

Bayesian Inference with Posterior Regularization and Applications to Infinite Latent SVMs
事後正則化によるベイズ推論と無限潜在SVMへの応用

Existing Bayesian models, especially nonparametric Bayesian methods, rely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations. While priors affect posterior distributions through Bayes’ rule, imposing posterior regularization is arguably more direct and in some cases more natural and general. In this paper, we present regularized Bayesian inference (RegBayes), a novel computational framework that performs posterior inference with a regularization term on the desired post-data posterior distribution under an information theoretical formulation. RegBayes is more flexible than the procedure that elicits expert knowledge via priors, and it covers both directed Bayesian networks and undirected Markov networks. When the regularization is induced from a linear operator on the posterior distributions, such as the expectation operator, we present a general convex-analysis theorem to characterize the solution of RegBayes. Furthermore, we present two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large- margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark data sets, which appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics. Such results contribute to push forward the interface between these two important subfields, which have been largely treated as isolated in the community.

既存のベイズモデル、特にノンパラメトリックベイズ法は、改善された潜在的表現を発見するためにドメイン知識を組み込むために特別に考案された事前分布に依存しています。事前分布はベイズの規則を通じて事後分布に影響しますが、事後正則化を課す方がおそらくより直接的で、場合によってはより自然で一般的です。この論文では、情報理論的定式化の下で、目的の事後データ事後分布に対して正則化項を使用して事後推論を実行する新しい計算フレームワークである、正則化ベイズ推論(RegBayes)を紹介します。RegBayesは、事前分布を介して専門家の知識を引き出す手順よりも柔軟性が高く、有向ベイズネットワークと無向マルコフネットワークの両方をカバーします。期待値演算子などの事後分布に対する線形演算子から正則化が誘導される場合、RegBayesのソリューションを特徴付ける一般的な凸解析定理を示します。さらに、RegBayesの具体的な例を2つ紹介します。無限潜在サポートベクターマシン(iLSVM)とマルチタスク無限潜在サポートベクターマシン(MT-iLSVM)は、それぞれ分類とマルチタスク学習の予測潜在特徴を発見するためのノンパラメトリックベイズモデルと組み合わせた大マージンのアイデアを探求します。効率的な推論方法を紹介し、いくつかのベンチマークデータセットに関する実証研究を報告します。これは、大マージン学習とベイズノンパラメトリックの両方から継承されたメリットを実証しているようです。このような結果は、コミュニティ内でほとんど孤立して扱われてきたこれら2つの重要なサブフィールド間のインターフェイスを前進させることに貢献します。

Hitting and Commute Times in Large Random Neighborhood Graphs
大規模ランダム近傍グラフにおけるヒット時間と通勤時間

In machine learning, a popular tool to analyze the structure of graphs is the hitting time and the commute distance (resistance distance). For two vertices $u$ and $v$, the hitting time $H_{uv}$ is the expected time it takes a random walk to travel from $u$ to $v$. The commute distance is its symmetrized version $C_{uv} = H_{uv} + H_{vu}$. In our paper we study the behavior of hitting times and commute distances when the number $n$ of vertices in the graph tends to infinity. We focus on random geometric graphs ($\epsilon$-graphs, kNN graphs and Gaussian similarity graphs), but our results also extend to graphs with a given expected degree distribution or Erdos-Renyi graphs with planted partitions. We prove that in these graph families, the suitably rescaled hitting time $H_{uv}$ converges to $1/d_v$ and the rescaled commute time to $1/d_u + 1/d_v$ where $d_u$ and $d_v$ denote the degrees of vertices $u$ and $v$. In these cases, hitting and commute times do not provide information about the structure of the graph, and their use is discouraged in many machine learning applications.

機械学習では、グラフの構造を分析するための一般的なツールは、ヒット時間と通勤距離（抵抗距離）です。2つの頂点$u$と$v$の場合、ヒット時間$H_{uv}$は、ランダムウォークで$u$から$v$まで移動するのにかかる期待時間です。通勤距離はその対称バージョン$C_{uv} = H_{uv} + H_{vu}$です。私たちの論文では、グラフの頂点の数$n$が無限大に近づく場合のヒット時間と通勤距離の挙動を調べます。私たちはランダム幾何学グラフ（$\epsilon$グラフ、kNNグラフ、ガウス類似度グラフ）に焦点を当てていますが、結果は、与えられた期待次数分布を持つグラフや、植え付けられたパーティションを持つエルデシュ・レニグラフにも拡張されます。これらのグラフファミリでは、適切に再スケールされたヒット時間$H_{uv}$は$1/d_v$に収束し、再スケールされた通勤時間は$1/d_u + 1/d_v$に収束することを証明します。ここで、$d_u$と$d_v$は頂点$u$と$v$の次数を表します。これらの場合、ヒット時間と通勤時間はグラフの構造に関する情報を提供しないため、多くの機械学習アプリケーションではそれらの使用は推奨されません。

Graph Estimation From Multi-Attribute Data
多属性データからのグラフ推定

Undirected graphical models are important in a number of modern applications that involve exploring or exploiting dependency structures underlying the data. For example, they are often used to explore complex systems where connections between entities are not well understood, such as in functional brain networks or genetic networks. Existing methods for estimating structure of undirected graphical models focus on scenarios where each node represents a scalar random variable, such as a binary neural activation state or a continuous mRNA abundance measurement, even though in many real world problems, nodes can represent multivariate variables with much richer meanings, such as whole images, text documents, or multi-view feature vectors. In this paper, we propose a new principled framework for estimating the structure of undirected graphical models from such multivariate (or multi-attribute) nodal data. The structure of a graph is inferred through estimation of non-zero partial canonical correlation between nodes. Under a Gaussian model, this strategy is equivalent to estimating conditional independencies between random vectors represented by the nodes and it generalizes the classical problem of covariance selection (Dempster, 1972). We relate the problem of estimating non-zero partial canonical correlations to maximizing a penalized Gaussian likelihood objective and develop a method that efficiently maximizes this objective. Extensive simulation studies demonstrate the effectiveness of the method under various conditions. We provide illustrative applications to uncovering gene regulatory networks from gene and protein profiles, and uncovering brain connectivity graph from positron emission tomography data. Finally, we provide sufficient conditions under which the true graphical structure can be recovered correctly.

無向グラフィカルモデルは、データの根底にある依存関係の構造を探索または活用する多くの最新アプリケーションで重要です。たとえば、機能的脳ネットワークや遺伝子ネットワークなど、エンティティ間の接続がよく理解されていない複雑なシステムを探索するためによく使用されます。無向グラフィカルモデルの構造を推定する既存の方法は、バイナリ神経活性化状態や連続的なmRNA存在量測定など、各ノードがスカラーランダム変数を表すシナリオに重点を置いていますが、多くの現実の問題では、ノードは、全体の画像、テキストドキュメント、またはマルチビュー特徴ベクトルなど、はるかに豊富な意味を持つ多変量変数を表すことができます。この論文では、このような多変量(または多属性)ノードデータから無向グラフィカルモデルの構造を推定するための新しい原理フレームワークを提案します。グラフの構造は、ノード間の非ゼロの部分正準相関の推定を通じて推測されます。ガウスモデルでは、この戦略はノードによって表されるランダムベクトル間の条件付き独立性を推定することと同等であり、共分散選択の古典的な問題(Dempster、1972)を一般化します。非ゼロの部分正準相関を推定する問題を、ペナルティ付きガウス尤度目標の最大化に関連付け、この目標を効率的に最大化する手法を開発します。広範なシミュレーション研究により、さまざまな条件下での手法の有効性が実証されています。遺伝子およびタンパク質プロファイルから遺伝子調節ネットワークを明らかにしたり、陽電子放出断層撮影データから脳の接続グラフを明らかにしたりする例示的なアプリケーションを示します。最後に、真のグラフィカル構造を正しく回復できる十分な条件を示します。

Adaptive Minimax Regression Estimation over Sparse l_q-Hulls
スパースl_q-Hullsに対する適応ミニマックス回帰推定

Given a dictionary of $M_n$ predictors, in a random design regression setting with $n$ observations, we construct estimators that target the best performance among all the linear combinations of the predictors under a sparse $\ell_q$-norm ($0 \leq q \leq 1$) constraint on the linear coefficients. Besides identifying the optimal rates of convergence, our universal aggregation strategies by model mixing achieve the optimal rates simultaneously over the full range of $0\leq q \leq 1$ for any $M_n$ and without knowledge of the $\ell_q$-norm of the best linear coefficients to represent the regression function. To allow model misspecification, our upper bound results are obtained in a framework of aggregation of estimates. A striking feature is that no specific relationship among the predictors is needed to achieve the upper rates of convergence (hence permitting basically arbitrary correlations between the predictors). Therefore, whatever the true regression function (assumed to be uniformly bounded), our estimators automatically exploit any sparse representation of the regression function (if any), to the best extent possible within the $\ell_q$-constrained linear combinations for any $0 \leq q \leq 1$. A sparse approximation result in the $\ell_q$-hulls turns out to be crucial to adaptively achieve minimax rate optimal aggregation. It precisely characterizes the number of terms needed to achieve a prescribed accuracy of approximation to the best linear combination in an $\ell_q$-hull for $0 \leq q \leq 1$. It offers the insight that the minimax rate of $\ell_q$-aggregation is basically determined by an effective model size, which is a sparsity index that depends on $q$, $M_n$, $n$, and the $\ell_q$-norm bound in an easily interpretable way based on a classical model selection theory that deals with a large number of models.

$M_n$個の予測子の辞書が与えられ、$n$個の観測値を持つランダム設計回帰設定で、線形係数に対するスパース$\ell_q$ノルム($0 \leq q \leq 1$)制約の下で予測子のすべての線形結合の中で最良のパフォーマンスを目標とする推定子を構築します。最適な収束率を特定することに加えて、モデル混合による普遍的な集約戦略は、回帰関数を表す最良の線形係数の$\ell_q$ノルムを知らなくても、任意の$M_n$に対して$0\leq q \leq 1$の全範囲にわたって同時に最適な率を達成します。モデルの誤指定を許容するために、上限の結果は推定値の集約のフレームワークで得られます。顕著な特徴は、上限収束率を達成するために予測子間に特定の関係が必要ないことです(したがって、予測子間の基本的に任意の相関関係が許可されます)。したがって、真の回帰関数（一様境界であると仮定）が何であれ、推定器は、任意の$0 \leq q \leq 1$に対する$\ell_q$制約線形結合内で可能な限り最大限に、回帰関数のスパース表現（存在する場合）を自動的に利用します。$\ell_q$包のスパース近似結果は、ミニマックスレート最適集約を適応的に達成するために重要であることがわかります。これは、$0 \leq q \leq 1$に対する$\ell_q$包の最良の線形結合への近似の規定精度を達成するために必要な項の数を正確に特徴付けます。これは、多数のモデルを扱う古典的なモデル選択理論に基づいて、簡単に解釈できる方法で、$\ell_q$集約のミニマックス率が基本的に有効なモデルサイズによって決定されるという洞察を提供します。有効なモデルサイズは、$q$、$M_n$、$n$、および$\ell_q$ノルム境界に依存するスパースインデックスです。

Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses
代理後悔の範囲は、強く適切な損失による二者ランキングの

The problem of bipartite ranking, where instances are labeled positive or negative and the goal is to learn a scoring function that minimizes the probability of mis-ranking a pair of positive and negative instances (or equivalently, that maximizes the area under the ROC curve), has been widely studied in recent years. A dominant theoretical and algorithmic framework for the problem has been to reduce bipartite ranking to pairwise classification; in particular, it is well known that the bipartite ranking regret can be formulated as a pairwise classification regret, which in turn can be upper bounded using usual regret bounds for classification problems. Recently, Kotlowski et al. (2011) showed regret bounds for bipartite ranking in terms of the regret associated with balanced versions of the standard (non- pairwise) logistic and exponential losses. In this paper, we show that such (non-pairwise) surrogate regret bounds for bipartite ranking can be obtained in terms of a broad class of proper (composite) losses that we term as strongly proper. Our proof technique is much simpler than that of Kotlowski et al. (2011), and relies on properties of proper (composite) losses as elucidated recently by Reid and Williamson (2010, 2011) and others. Our result yields explicit surrogate bounds (with no hidden balancing terms) in terms of a variety of strongly proper losses, including for example logistic, exponential, squared and squared hinge losses as special cases. An important consequence is that standard algorithms minimizing a (non-pairwise) strongly proper loss, such as logistic regression and boosting algorithms (assuming a universal function class and appropriate regularization), are in fact consistent for bipartite ranking; moreover, our results allow us to quantify the bipartite ranking regret in terms of the corresponding surrogate regret. We also obtain tighter surrogate bounds under certain low-noise conditions via a recent result of Clemencon and Robbiano (2011).

二部ランキングの問題は、インスタンスが正または負とラベル付けされ、目標は正と負のインスタンスのペアを誤ってランク付けする確率を最小化する（または同等に、ROC曲線の下の領域を最大化する）スコアリング関数を学習することであり、近年広く研究されています。この問題の主な理論的およびアルゴリズム的枠組みは、二部ランキングをペアワイズ分類に縮小することでした。特に、二部ランキングの後悔はペアワイズ分類の後悔として定式化できることはよく知られており、これは分類問題に対する通常の後悔境界を使用して上限を設定できます。最近、Kotlowskiら(2011)は、標準的な（ペアワイズではない）ロジスティック損失と指数損失のバランスの取れたバージョンに関連付けられた後悔の観点から、二部ランキングの後悔境界を示しました。この論文では、このような（ペアワイズではない）二部ランキングの代替後悔境界が、強く適切と呼ばれる広範な適切な（複合）損失のクラスの観点から取得できることを示します。我々の証明手法はKotlowskiら(2011)のものよりはるかに単純で、ReidとWilliamson (2010、2011)らによって最近解明された適切な(合成)損失の特性に依存しています。我々の結果は、ロジスティック損失、指数損失、二乗損失、二乗ヒンジ損失などの特殊なケースを含む、さまざまな強く適切な損失に関して、明示的な代理境界(隠れたバランス項なし)をもたらします。重要な結果は、ロジスティック回帰やブースティングアルゴリズム(ユニバーサル関数クラスと適切な正則化を想定)などの(非ペアワイズ)強く適切な損失を最小化する標準アルゴリズムが、実際には二部ランキングで一貫しているということです。さらに、我々の結果により、対応する代理リグレットに関して二部ランキングリグレットを定量化できます。また、ClemenconとRobbiano (2011)の最近の結果を介して、特定の低ノイズ条件下でより厳しい代理境界も得られます。

Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
ランダムフォレストの信頼区間:ジャックナイフと無限小ジャックナイフ

We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number $B$ of bootstrap replicates, and working with a large $B$ can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require $B = \Theta (n^{1.5})$ bootstrap replicates to converge, where $n$ is the size of the training set. We propose improved versions that only require $B = \Theta (n)$ replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.

私たちは、バッグ学習器とランダムフォレストによる予測の変動性を調べ、これらの方法の標準誤差を推定する方法を示します。我々の研究は、ジャックナイフと無限小ジャックナイフ(IJ)に基づく、Efron (1992、2013)によって提案されたバッグ学習の分散推定に基づいています。実際には、バッグ学習器の予測子は有限数のブートストラップ反復$B$を使用して計算されますが、大きな$B$で作業すると計算コストが高くなる可能性があります。ジャックナイフ推定子とIJ推定子をバッグ学習器に直接適用すると、収束するために$B = \Theta (n^{1.5})$回のブートストラップ反復が必要です($n$はトレーニングセットのサイズ)。私たちは、$B = \Theta (n)$回の反復のみを必要とする改良版を提案します。さらに、IJ推定子では、所定の精度を達成するために、ジャックナイフよりも1.7倍少ないブートストラップ反復が必要であることを示します。最後に、ジャックナイフとIJ分散推定値自体のサンプリング分布を調べます。複数の実験とシミュレーション研究で調査結果を説明します。

The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
No-U-Turn Sampler:ハミルトニアンモンテカルロのパス長を適応的に設定

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size $\epsilon$ and a desired number of steps $L$. In particular, if $L$ is too small then the algorithm exhibits undesirable random walk behavior, while if $L$ is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps $L$. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS performs at least as efficiently as (and sometimes more efficiently than) a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter $\epsilon$ on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all, making it suitable for applications such as BUGS-style automatic inference engines that require efficient âturnkeyâ samplers.

ハミルトニアンモンテカルロ(HMC)は、マルコフ連鎖モンテカルロ(MCMC)アルゴリズムであり、一次勾配情報に基づく一連の手順を実行することで、多くのMCMC法を悩ませるランダムウォーク動作と相関パラメータに対する感度を回避します。これらの機能により、ランダムウォークメトロポリスやギブスサンプリングなどのより単純な方法よりもはるかに速く高次元ターゲット分布に収束できます。ただし、HMCのパフォーマンスは、ステップサイズ$\epsilon$と必要なステップ数$L$という2つのユーザー指定パラメータに非常に敏感です。特に、$L$が小さすぎるとアルゴリズムは望ましくないランダムウォーク動作を示し、$L$が大きすぎるとアルゴリズムは計算を無駄にします。ここでは、ステップ数$L$を設定する必要がないHMCの拡張機能であるNo-U-Turn Sampler (NUTS)を紹介します。NUTSは再帰アルゴリズムを使用して、ターゲット分布の広い範囲に及ぶ可能性のある候補ポイントのセットを作成し、折り返して手順をたどり始めると自動的に停止します。経験的に、NUTSは、ユーザーの介入やコストのかかるチューニング実行を必要とせずに、適切にチューニングされた標準HMC方法と少なくとも同等の効率(場合によってはそれ以上の効率)で実行されます。また、プライマルデュアル平均化に基づいて、ステップサイズパラメーター$\epsilon$をオンザフライで適応させる方法も導き出しました。したがって、NUTSは手動チューニングなしで使用できるため、効率的な「ターンキー」サンプラーを必要とするBUGSスタイルの自動推論エンジンなどのアプリケーションに適しています。

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Models
スパースマルコフモデルと独立モデルへの高次元共分散分解

Fitting high-dimensional data involves a delicate tradeoff between faithful representation and the use of sparse models. Too often, sparsity assumptions on the fitted model are too restrictive to provide a faithful representation of the observed data. In this paper, we present a novel framework incorporating sparsity in different domains. We decompose the observed covariance matrix into a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse independence model (with a sparse covariance matrix). Our framework incorporates sparse covariance and sparse precision estimation as special cases and thus introduces a richer class of high-dimensional models. %We posit the observed data as generated from a linear combination of a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse Gaussian independence model (with a sparse covariance matrix). We characterize sufficient conditions for identifiability of the two models, viz., Markov and independence models. We propose an efficient decomposition method based on a modification of the popular $\ell_1$-penalized maximum- likelihood estimator ($\ell_1$-MLE). We establish that our estimator is consistent in both the domains, i.e., it successfully recovers the supports of both Markov and independence models, when the number of samples $n$ scales as $n = \Omega(d^2 \log p)$, where $p$ is the number of variables and $d$ is the maximum node degree in the Markov model. Our experiments validate these results and also demonstrate that our models have better inference accuracy under simple algorithms such as loopy belief propagation.

高次元データのフィッティングには、忠実な表現とスパースモデルの使用との間の微妙なトレードオフが伴います。多くの場合、フィッティングモデルに対するスパース性の仮定は、観測データの忠実な表現を提供するには制限が厳しすぎます。この論文では、さまざまなドメインにスパース性を組み込んだ新しいフレームワークを紹介します。観測された共分散行列を、スパースガウスマルコフモデル(スパース精度行列を使用)とスパース独立モデル(スパース共分散行列を使用)に分解します。このフレームワークは、スパース共分散とスパース精度推定を特殊なケースとして組み込んでおり、より豊富な高次元モデルを導入しています。%観測データは、スパースガウスマルコフモデル(スパース精度行列を使用)とスパースガウス独立モデル(スパース共分散行列を使用)の線形結合から生成されるものと仮定します。2つのモデル、つまりマルコフモデルと独立モデルの識別可能性に関する十分な条件を特徴付けます。私たちは、一般的な$\ell_1$ペナルティ付き最尤推定量($\ell_1$-MLE)の修正に基づく効率的な分解法を提案します。私たちは、我々の推定量が両方のドメインで一貫していること、すなわち、サンプル数$n$が$n = \Omega(d^2 \log p)$に比例する場合にマルコフモデルと独立モデルの両方のサポートを正常に回復できることを確立した。ここで、$p$は変数の数、$d$はマルコフモデルの最大ノード次数です。我々の実験はこれらの結果を検証し、また、我々のモデルがループビリーフプロパゲーションなどの単純なアルゴリズムの下でより優れた推論精度を持つことを実証した。

Iteration Complexity of Feasible Descent Methods for Convex Optimization
凸最適化のための実行可能降下法の反復計算量

In many machine learning problems such as the dual form of SVM, the objective function to be minimized is convex but not strongly convex. This fact causes difficulties in obtaining the complexity of some commonly used optimization algorithms. In this paper, we proved the global linear convergence on a wide range of algorithms when they are applied to some non-strongly convex problems. In particular, we are the first to prove $O(\log(1/\epsilon))$ time complexity of cyclic coordinate descent methods on dual problems of support vector classification and regression.

SVMの双対形式など、多くの機械学習問題では、最小化する目的関数は凸型ですが、強凸型ではありません。この事実は、一般的に使用されるいくつかの最適化アルゴリズムの複雑さを取得するのを困難にします。この論文では、一部の非強凸問題に適用された場合に、さまざまなアルゴリズムでグローバル線形収束を証明しました。特に、私たちは、支持ベクトルの分類と回帰の双対問題において、巡回座標降下法の$O(log(1/epsilon))$時間計算量を最初に証明しました。

Locally Adaptive Factor Processes for Multivariate Time Series
多変量時系列の局所適応因子プロセス

In modeling multivariate time series, it is important to allow time-varying smoothness in the mean and covariance process. In particular, there may be certain time intervals exhibiting rapid changes and others in which changes are slow. If such time- varying smoothness is not accounted for, one can obtain misleading inferences and predictions, with over-smoothing across erratic time intervals and under-smoothing across times exhibiting slow variation. This can lead to mis-calibration of predictive intervals, which can be substantially too narrow or wide depending on the time. We propose a locally adaptive factor process for characterizing multivariate mean-covariance changes in continuous time, allowing locally varying smoothness in both the mean and covariance matrix. This process is constructed utilizing latent dictionary functions evolving in time through nested Gaussian processes and linearly related to the observed data with a sparse mapping. Using a differential equation representation, we bypass usual computational bottlenecks in obtaining MCMC and online algorithms for approximate Bayesian inference. The performance is assessed in simulations and illustrated in a financial application.

多変量時系列をモデル化する場合、平均および共分散プロセスで時間とともに変化する滑らかさを許容することが重要です。特に、急速な変化を示す特定の時間間隔と、変化が緩やかな他の時間間隔が存在する可能性があります。このような時間とともに変化する滑らかさを考慮しないと、不規則な時間間隔にわたって過度に平滑化され、変化が緩やかな時間にわたって平滑化が不十分になり、誤った推論や予測が得られる可能性があります。これにより、予測間隔の誤った較正につながる可能性があり、予測間隔は時間に応じて大幅に狭すぎたり広すぎたりする可能性があります。私たちは、平均行列と共分散行列の両方で局所的に変化する滑らかさを許容し、連続時間における多変量平均共分散の変化を特徴付けるための局所適応因子プロセスを提案します。このプロセスは、ネストされたガウス過程を通じて時間とともに進化し、スパースマッピングで観測データに線形に関連付けられる潜在辞書関数を使用して構築されます。微分方程式表現を使用して、MCMCおよび近似ベイズ推論のオンラインアルゴリズムを取得する際の通常の計算ボトルネックを回避します。パフォーマンスはシミュレーションで評価され、金融アプリケーションで説明されます。

Training Highly Multiclass Classifiers
高度に多クラス分類器のトレーニング

Classification problems with thousands or more classes often have a large range of class-confusabilities, and we show that the more-confusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent, can also be profitably applied to the nonconvex joint training of supervised dimensionality reduction and linear classifiers as done in Wsabie. Experiments on ImageNet benchmark data sets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to one-vs- all linear SVMs and Wsabie.

数千以上のクラスを持つ分類問題には、多くの場合、クラスの混乱性の範囲が大きく、混乱しやすいクラスほど、学習中に最小限に抑えられる経験的損失により多くのノイズが追加されることを示します。分類器パラメーターのトレーニングにおける非常に混乱しやすいクラスの影響を軽減し、トレーニングの任意の時点で区別しやすいクラスのペアにトレーニングを集中させるオンラインソリューションを提案します。また、最近提案されたadagrad法は、凸確率的勾配降下法のステップサイズを自動的に縮小するために提案されており、Wsabieで行われているように、教師付き次元削減と線形分類器の非凸ジョイントトレーニングにも有益に適用できることを示しています。ImageNetベンチマークデータセットと15,000〜97,000クラスでの独自の画像認識問題での実験では、1対すべての線形SVMおよびWsabieと比較して、分類精度が大幅に向上していることが示されています。

Manopt, a Matlab Toolbox for Optimization on Manifolds
Manopt、多様体最適化のための Matlab ツールボックス

Optimization on manifolds is a rapidly developing branch of nonlinear optimization. Its focus is on problems where the smooth geometry of the search space can be leveraged to design efficient numerical algorithms. In particular, optimization on manifolds is well-suited to deal with rank and orthogonality constraints. Such structured constraints appear pervasively in machine learning applications, including low-rank matrix completion, sensor network localization, camera network registration, independent component analysis, metric learning, dimensionality reduction and so on. The Manopt toolbox, available at www.manopt.org, is a user-friendly, documented piece of software dedicated to simplify experimenting with state of the art Riemannian optimization algorithms. By dealing internally with most of the differential geometry, the package aims particularly at lowering the entrance barrier.

多様体の最適化は、非線形最適化の急速に発展している分野です。その焦点は、探索空間の滑らかな幾何学を利用して効率的な数値アルゴリズムを設計できる問題にあります。特に、多様体の最適化は、ランク制約と直交制約の処理に適しています。このような構造化された制約は、低ランクの行列補完、センサーネットワークのローカリゼーション、カメラネットワークのレジストレーション、独立コンポーネント分析、メトリック学習、次元削減など、機械学習アプリケーションに広く見られます。www.manopt.orgで入手可能なManoptツールボックスは、最先端のリーマン最適化アルゴリズムの実験を簡素化するための、ユーザーフレンドリーで文書化されたソフトウェアです。このパッケージは、ほとんどの差動ジオメトリを内部で処理することにより、特に入口の障壁を下げることを目的としています。

Adaptive Sampling for Large Scale Boosting
大規模ブースティングのための適応サンプリング

Classical boosting algorithms, such as AdaBoost, build a strong classifier without concern for the computational cost. Some applications, in particular in computer vision, may involve millions of training examples and very large feature spaces. In such contexts, the training time of off-the-shelf boosting algorithms may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features or the examples used to train the weak learners. Even if some of these methods provide a guaranteed speed improvement, they offer no insurance of being more efficient than any other, given the same amount of time. The contributions of this paper are twofold: (1) a strategy to better deal with the increasingly common case where features come from multiple sources (for example, color, shape, texture, etc., in the case of images) and therefore can be partitioned into meaningful subsets; (2) new algorithms which balance at every boosting iteration the number of weak learners and the number of training examples to look at in order to maximize the expected loss reduction. Experiments in image classification and object recognition on four standard computer vision data sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods.

AdaBoostなどの従来のブースティングアルゴリズムは、計算コストを気にせずに強力な分類器を構築します。一部のアプリケーション、特にコンピュータービジョンでは、何百万ものトレーニングサンプルと非常に大きな特徴空間が必要になる場合があります。このような状況では、既製のブースティングアルゴリズムのトレーニング時間が法外になる場合があります。トレーニングを高速化する方法はいくつかありますが、通常は、弱学習器のトレーニングに使用される特徴またはサンプルをサンプリングします。これらの方法の一部が速度の向上を保証する場合でも、同じ時間で他の方法よりも効率的であるという保証はありません。この論文の貢献は2つあります。(1)特徴が複数のソース(画像の場合は色、形状、テクスチャなど)から取得され、意味のあるサブセットに分割できるという、ますます一般的になっているケースをより適切に処理する戦略。(2)ブースティングの反復ごとに弱学習器の数とトレーニングサンプルの数のバランスを取り、期待される損失の削減を最大化する新しいアルゴリズム。4つの標準的なコンピュータービジョンデータセットでの画像分類とオブジェクト認識の実験では、私たちが提案する適応型手法が、基本的なサンプリング法や最先端のバンディット法よりも優れていることが示されています。

Towards Ultrahigh Dimensional Feature Selection for Big Data
ビッグデータの超高次元特徴選択に向けて

In this paper, we present a new adaptive feature scaling scheme for ultrahigh-dimensional feature selection on Big Data, and then reformulate it as a convex semi-infinite programming (SIP) problem. To address the SIP, we propose an efficient feature generating paradigm. Different from traditional gradient-based approaches that conduct optimization on all input features, the proposed paradigm iteratively activates a group of features, and solves a sequence of multiple kernel learning (MKL) subproblems. To further speed up the training, we propose to solve the MKL subproblems in their primal forms through a modified accelerated proximal gradient approach. Due to such optimization scheme, some efficient cache techniques are also developed. The feature generating paradigm is guaranteed to converge globally under mild conditions, and can achieve lower feature selection bias. Moreover, the proposed method can tackle two challenging tasks in feature selection: 1) group-based feature selection with complex structures, and 2) nonlinear feature selection with explicit feature mappings. Comprehensive experiments on a wide range of synthetic and real-world data sets of tens of million data points with $O(10^{14})$ features demonstrate the competitive performance of the proposed method over state-of-the-art feature selection methods in terms of generalization performance and training efficiency.

この論文では、ビッグデータにおける超高次元特徴選択のための新しい適応型特徴スケーリング方式を提示し、それを凸半無限計画法(SIP)問題として再定式化します。SIPに対処するために、効率的な特徴生成パラダイムを提案します。すべての入力特徴に対して最適化を実行する従来の勾配ベースのアプローチとは異なり、提案されたパラダイムは、特徴のグループを反復的にアクティブ化し、一連の多重カーネル学習(MKL)サブ問題を解決します。トレーニングをさらに高速化するために、修正された加速近位勾配アプローチを使用して、MKLサブ問題をその基本形式で解決することを提案します。このような最適化方式により、いくつかの効率的なキャッシュ手法も開発されています。特徴生成パラダイムは、穏やかな条件下でグローバルに収束することが保証されており、特徴選択バイアスを低く抑えることができます。さらに、提案された方法は、特徴選択における1)複雑な構造を持つグループベースの特徴選択、および2)明示的な特徴マッピングによる非線形特徴選択という2つの困難なタスクに取り組むことができます。数千万のデータポイントとO(10^{14})の特徴を持つ広範囲の合成データセットと実世界のデータセットに対する包括的な実験により、提案された方法が、一般化性能とトレーニング効率の点で最先端の特徴選択方法よりも競争力のあるパフォーマンスであることが実証されました。

Fully Simplified Multivariate Normal Updates in Non-Conjugate Variational Message Passing
非共役変分メッセージパッシングにおける完全に簡略化された多変量正規更新

Fully simplified expressions for Multivariate Normal updates in non-conjugate variational message passing approximate inference schemes are obtained. The simplicity of these expressions means that the updates can be achieved very efficiently. Since the Multivariate Normal family is the most common for approximating the joint posterior density function of a continuous parameter vector, these fully simplified updates are of great practical benefit.

非共役変分メッセージパッシング近似推論スキームにおける多変量法線更新の完全に簡略化された式が得られます。これらの式が単純であるということは、更新を非常に効率的に実行できることを意味します。多変量法線ファミリーは、連続パラメーターベクトルのジョイント事後密度関数を近似するために最も一般的であるため、これらの完全に簡略化された更新は、実用上非常に役立ちます。

Structured Prediction via Output Space Search
出力空間検索による構造化予測

We consider a framework for structured prediction based on search in the space of complete structured outputs. Given a structured input, an output is produced by running a time- bounded search procedure guided by a learned cost function, and then returning the least cost output uncovered during the search. This framework can be instantiated for a wide range of search spaces and search procedures, and easily incorporates arbitrary structured-prediction loss functions. In this paper, we make two main technical contributions. First, we describe a novel approach to automatically defining an effective search space over structured outputs, which is able to leverage the availability of powerful classification learning algorithms. In particular, we define the limited-discrepancy search space and relate the quality of that space to the quality of learned classifiers. We also define a sparse version of the search space to improve the efficiency of our overall approach. Second, we give a generic cost function learning approach that is applicable to a wide range of search procedures. The key idea is to learn a cost function that attempts to mimic the behavior of conducting searches guided by the true loss function. Our experiments on six benchmark domains show that a small amount of search in limited discrepancy search space is often sufficient for significantly improving on state-of-the-art structured- prediction performance. We also demonstrate significant speed improvements for our approach using sparse search spaces with little or no loss in accuracy.

私たちは、完全な構造化出力の空間における検索に基づく構造化予測のフレームワークを検討します。構造化入力が与えられると、学習されたコスト関数によって導かれる時間制限付き検索手順を実行することによって出力が生成され、その後、検索中に発見された最小コストの出力が返されます。このフレームワークは、さまざまな検索空間と検索手順に対してインスタンス化でき、任意の構造化予測損失関数を簡単に組み込むことができます。この論文では、2つの主要な技術的貢献をします。まず、構造化出力に対して効果的な検索空間を自動的に定義する新しいアプローチについて説明します。このアプローチでは、強力な分類学習アルゴリズムの可用性を活用できます。特に、限定された不一致検索空間を定義し、その空間の品質を学習された分類器の品質に関連付けます。また、全体的なアプローチの効率を向上させるために、検索空間のスパースバージョンも定義します。次に、さまざまな検索手順に適用できる汎用的なコスト関数学習アプローチを示します。重要なアイデアは、真の損失関数によって導かれる検索を実行する動作を模倣しようとするコスト関数を学習することです。6つのベンチマークドメインでの実験では、限られた不一致検索空間での少量の検索で、最先端の構造化予測のパフォーマンスを大幅に向上できることが多いことが示されています。また、スパース検索空間を使用したアプローチでは、精度をほとんど損なうことなく、速度が大幅に向上することも実証されています。

Follow the Leader If You Can, Hedge If You Must
できるならリーダーに従い、必要ならヘッジする

Follow-the-Leader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has poor performance for worst-case data. Other hedging strategies have better worst-case guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As a stepping stone for our analysis, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by Cesa-Bianchi, Mansour, and Stoltz (2007), yielding improved worst-case guarantees. By interleaving AdaHedge and FTL, FlipFlop achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge’s worst-case guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains.

Follow-the-Leader (FTL)は、確率的設定で一定の後悔を保証する直感的な順次予測戦略ですが、最悪のケースのデータではパフォーマンスが低くなります。他のヘッジ戦略は、最悪のケースの保証は優れていますが、データが最大限に敵対的でない場合は、FTLよりもはるかにパフォーマンスが悪くなる可能性があります。ここでは、両方の長所を組み合わせられることが証明された最初の方法であるFlipFlopアルゴリズムを紹介します。分析の足がかりとして、倍増トリックを使用せずにHedgeの学習率を動的に調整する新しい方法であるAdaHedgeを開発します。AdaHedgeは、Cesa-Bianchi、Mansour、およびStoltz (2007)による方法を改良し、最悪のケースの保証を改善します。AdaHedgeとFTLをインターリーブすることで、FlipFlopはAdaHedgeの最悪のケースの保証を犠牲にすることなく、FTLの後悔の定数倍以内の後悔を実現します。AdaHedgeとFlipFlopでは、損失の範囲を事前に知る必要がありません。さらに、以前の方法とは異なり、発行された重みは損失の再スケーリングと変換に対して不変であるという直感的な特性があります。損失は負の値になることも許可されており、その場合は利益として解釈される可能性があります。

Robust Near-Separable Nonnegative Matrix Factorization Using Linear Optimization
線形最適化を用いたロバストな近接分離可能な非負行列分解

Nonnegative matrix factorization (NMF) has been shown recently to be tractable under the separability assumption, under which all the columns of the input data matrix belong to the convex cone generated by only a few of these columns. Bittorf, Recht, RÃ© and Tropp (`Factoring nonnegative matrices with linear programs’, NIPS 2012) proposed a linear programming (LP) model, referred to as Hottopixx, which is robust under any small perturbation of the input matrix. However, Hottopixx has two important drawbacks: (i) the input matrix has to be normalized, and (ii) the factorization rank has to be known in advance. In this paper, we generalize Hottopixx in order to resolve these two drawbacks, that is, we propose a new LP model which does not require normalization and detects the factorization rank automatically. Moreover, the new LP model is more flexible, significantly more tolerant to noise, and can easily be adapted to handle outliers and other noise models. Finally, we show on several synthetic data sets that it outperforms Hottopixx while competing favorably with two state-of-the-art methods.

非負値行列因数分解(NMF)は、最近、分離可能性仮定の下で扱いやすいことが示されました。分離可能性仮定の下では、入力データ行列のすべての列は、これらの列のうちのいくつかの列によって生成される凸錐に属します。Bittorf、Recht、RÃ©、およびTropp (「線形プログラムによる非負値行列の因数分解」、NIPS 2012)は、入力行列の小さな摂動に対して堅牢な、Hottopixxと呼ばれる線形計画法(LP)モデルを提案しました。ただし、Hottopixxには2つの重要な欠点があります。(i)入力行列を正規化する必要がある、(ii)因数分解ランクを事前に知っておく必要があります。この論文では、これら2つの欠点を解決するためにHottopixxを一般化し、正規化を必要とせず、因数分解ランクを自動的に検出する新しいLPモデルを提案します。さらに、新しいLPモデルはより柔軟で、ノイズに対する耐性が大幅に高く、外れ値やその他のノイズモデルの処理に簡単に適応できます。最後に、いくつかの合成データセットで、この手法がHottopixxよりも優れており、2つの最先端の手法と競合できることを示します。

Bayesian Nonparametric Comorbidity Analysis of Psychiatric Disorders
精神疾患のベイズノンパラメトリック併存疾患分析

The analysis of comorbidity is an open and complex research field in the branch of psychiatry, where clinical experience and several studies suggest that the relation among the psychiatric disorders may have etiological and treatment implications. In this paper, we are interested in applying latent feature modeling to find the latent structure behind the psychiatric disorders that can help to examine and explain the relationships among them. To this end, we use the large amount of information collected in the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) database and propose to model these data using a nonparametric latent model based on the Indian Buffet Process (IBP). Due to the discrete nature of the data, we first need to adapt the observation model for discrete random variables. We propose a generative model in which the observations are drawn from a multinomial-logit distribution given the IBP matrix. The implementation of an efficient Gibbs sampler is accomplished using the Laplace approximation, which allows integrating out the weighting factors of the multinomial- logit likelihood model. We also provide a variational inference algorithm for this model, which provides a complementary (and less expensive in terms of computational complexity) alternative to the Gibbs sampler allowing us to deal with a larger number of data. Finally, we use the model to analyze comorbidity among the psychiatric disorders diagnosed by experts from the NESARC database.

併存疾患の分析は精神医学の分野における未解決かつ複雑な研究分野であり、臨床経験やいくつかの研究から、精神疾患間の関係が病因や治療に影響を与える可能性があることが示唆されています。この論文では、潜在特性モデリングを適用して精神疾患の背後にある潜在構造を見つけ、それらの関係を調べて説明することに関心があります。この目的のために、私たちはアルコールおよび関連疾患に関する全国疫学調査(NESARC)データベースで収集された大量の情報を使用し、インド・バフェット・プロセス(IBP)に基づくノンパラメトリック潜在モデルを使用してこれらのデータをモデル化することを提案します。データの離散的性質のため、まず観測モデルを離散ランダム変数に適合させる必要があります。私たちは、IBPマトリックスが与えられた多項ロジット分布から観測値を抽出する生成モデルを提案します。効率的なギブスサンプラーの実装は、ラプラス近似を使用して行われます。これにより、多項ロジット尤度モデルの重み係数を統合できます。また、このモデルには変分推論アルゴリズムも用意されています。これは、ギブスサンプラーの補完的な代替手段(計算の複雑さの点でより安価)を提供し、大量のデータを処理できるようにします。最後に、このモデルを使用して、NESARCデータベースの専門家によって診断された精神疾患の併存疾患を分析します。

Prediction and Clustering in Signed Networks: A Local to Global Perspective
符号付きネットワークにおける予測とクラスタリング:ローカルからグローバルへの視点

The study of social networks is a burgeoning research area. However, most existing work is on networks that simply encode whether relationships exist or not. In contrast, relationships in signed networks can be positive (âlike”, âtrust”) or negative (âdislike”, âdistrust”). The theory of social balance shows that signed networks tend to conform to some local patterns that, in turn, induce certain global characteristics. In this paper, we exploit both local as well as global aspects of social balance theory for two fundamental problems in the analysis of signed networks: sign prediction and clustering. Local patterns of social balance have been used in the past for sign prediction. We define more general measures of social imbalance (MOIs) based on $\ell$-cycles in the network and give a simple sign prediction rule. Interestingly, by examining measures of social imbalance, we show that the classic Katz measure, which is used widely in unsigned link prediction, also has a balance theoretic interpretation when applied to signed networks. Motivated by the global structure of balanced networks, we propose an effective low rank modeling approach for both sign prediction and clustering. We provide theoretical performance guarantees for our low-rank matrix completion approach via convex relaxations, scale it up to large problem sizes using a matrix factorization based algorithm, and provide extensive experimental validation including comparisons with local approaches. Our experimental results indicate that, by adopting a more global viewpoint of social balance, we get significant performance and computational gains in prediction and clustering tasks on signed networks. Our work therefore highlights the usefulness of the global aspect of balance theory for the analysis of signed networks.

ソーシャルネットワークの研究は急成長中の研究分野です。しかし、既存の研究のほとんどは、関係が存在するかどうかを単純にエンコードするネットワークに関するものです。対照的に、符号付きネットワークの関係は、肯定的(「好き」、「信頼」)または否定的(「嫌い」、「不信」)になります。ソーシャルバランスの理論は、符号付きネットワークがいくつかのローカルパターンに従う傾向があり、それが特定のグローバル特性を誘発することを示しています。この論文では、ソーシャルバランス理論のローカルとグローバルの両方の側面を、符号付きネットワークの分析における2つの基本的な問題(符号予測とクラスタリング)に利用します。ソーシャルバランスのローカルパターンは、過去に符号予測に使用されてきました。ネットワーク内の$\ell$サイクルに基づいて、より一般的なソーシャルアンバランスの尺度(MOI)を定義し、簡単な符号予測ルールを示します。興味深いことに、ソーシャルアンバランスの尺度を調べることで、符号なしリンク予測で広く使用されている古典的なKatz尺度も、符号付きネットワークに適用するとバランス理論の解釈が得られることを示しています。バランスの取れたネットワークのグローバル構造に着目し、符号予測とクラスタリングの両方に効果的な低ランクモデリングアプローチを提案します。凸緩和法による低ランク行列補完アプローチの理論的なパフォーマンス保証を提供し、行列分解ベースのアルゴリズムを使用して大規模な問題サイズにスケールアップし、ローカルアプローチとの比較を含む広範な実験検証を提供します。実験結果によると、社会的バランスのよりグローバルな観点を採用することで、符号付きネットワークでの予測およびクラスタリングタスクで大幅なパフォーマンスと計算上の向上が得られます。したがって、私たちの研究は、符号付きネットワークの分析におけるバランス理論のグローバルな側面の有用性を強調しています。

New Learning Methods for Supervised and Unsupervised Preference Aggregation
教師ありおよび教師なしの選好集約のための新しい学習方法

In this paper we present a general treatment of the preference aggregation problem, in which multiple preferences over objects must be combined into a single consensus ranking. We consider two instances of this problem: unsupervised aggregation where no information about a target ranking is available, and supervised aggregation where ground truth preferences are provided. For each problem class we develop novel learning methods that are applicable to a wide range of preference types. (The code for all models introduced in this paper is available at www.cs.toronto.edu/~mvolkovs.) Specifically, for unsupervised aggregation we introduce the Multinomial Preference model (MPM) which uses a multinomial generative process to model the observed preferences. For the supervised problem we develop a supervised extension for MPM and then propose two fully supervised models. The first model employs SVD factorization to derive effective item features, transforming the aggregation problems into a learning-to-rank one. The second model aims to eliminate the costly SVD factorization and instantiates a probabilistic CRF framework, deriving unary and pairwise potentials directly from the observed preferences. Using a probabilistic framework allows us to directly optimize the expectation of any target metric, such as NDCG or ERR. All the proposed models operate on pairwise preferences and can thus be applied to a wide range of preference types. We empirically validate the models on rank aggregation and collaborative filtering data sets and demonstrate superior empirical accuracy.

この論文では、オブジェクトに対する複数の嗜好を1つのコンセンサスランキングに組み合わせる必要がある嗜好集約問題の一般的な処理方法を示します。この問題の2つの例、つまりターゲットランキングに関する情報が利用できない教師なし集約と、真の嗜好が提供される教師あり集約を検討します。問題の各クラスについて、幅広い嗜好タイプに適用できる新しい学習方法を開発します。(本論文で紹介するすべてのモデルのコードは、www.cs.toronto.edu/~mvolkovsで入手できます。)具体的には、教師なし集約に対して、多項式生成プロセスを使用して観測された嗜好をモデル化する多項式嗜好モデル(MPM)を導入します。教師あり問題に対しては、MPMの教師あり拡張を開発し、次に2つの完全教師ありモデルを提案します。最初のモデルは、SVD因数分解を使用して効果的なアイテム機能を導出し、集約問題をランク付け学習問題に変換します。2番目のモデルは、コストのかかるSVD分解を排除し、確率的CRFフレームワークをインスタンス化して、観測された選好から直接単項およびペアワイズポテンシャルを導出することを目指しています。確率的フレームワークを使用すると、NDCGやERRなどのターゲットメトリックの期待値を直接最適化できます。提案されたモデルはすべてペアワイズ選好に基づいて動作するため、幅広い選好タイプに適用できます。ランク集約および協調フィルタリングデータセットでモデルを実証的に検証し、優れた実験精度を実証しました。

A Reliable Effective Terascale Linear Learning System
信頼性の高い効果的なテラスケール線形学習システム

We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features, (The number of features here refers to the number of non-zero entries in the data matrix.) billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature. (All the empirical evaluation reported in this work was carried out between May-Oct 2011.) We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices.

私たちは、テラスケールのデータセットで凸損失を持つ線形予測子を学習するためのシステムと一連の手法を提示します。これには、数兆の特徴(ここでの特徴の数は、データ行列内のゼロ以外のエントリの数を指します) 1000台のマシンのクラスターを使用して、1時間で数十億のトレーニング例と数百万のパラメーターがあります。個々のコンポーネント手法はどれも新しいものではありませんが、効率的な実装を得るために必要な慎重な合成は新しいものです。その結果、私たちの知る限り、文献で報告されている最もスケーラブルで効率的な線形学習システムが生まれました。(この研究で報告されたすべての実証的評価は、2011年5月から10月の間に実施されました。システムのコンポーネントを説明し、徹底的に評価し、さまざまな設計選択の重要性を示します。

Gibbs Max-margin Topic Models with Data Augmentation
データ拡張を伴うギブス最大マージントピックモデル

Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max- margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restrictive assumptions and no need to solve SVM subproblems. Furthermore, each step of the âaugment-and-collapse” Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results on several medium-sized and large-scale data sets demonstrate significant improvements on time efficiency. The classification performance is also improved over competitors on binary, multi- class and multi-label classification tasks.

最大マージン学習は、分類器と構造化された出力予測子を構築するための強力なアプローチです。最大マージン教師ありトピックモデルに関する最近の研究では、ベイズトピックモデルと統合して、識別的な潜在的意味構造を発見し、未知のテストデータに対して正確な予測を行うことに成功しました。しかし、結果として生じる学習問題は、マージン損失が滑らかでないことから、通常は解決が困難です。最大マージン教師ありトピックモデルを構築するための既存のアプローチは、望ましい事後分布に関する平均場仮定を追加して、複数の潜在的なSVMサブ問題を解決するための反復手順に依存しています。この論文では、新しい最大マージン損失を定義することにより、代替アプローチを提示します。つまり、分類、回帰、マルチタスク学習など、さまざまなタスクの隠れたトピック表現を発見するための潜在変数ギブス分類器であるギブス最大マージン教師ありトピックモデルを提示します。ギブス最大マージン教師ありトピックモデルは、予想されるマージン損失を最小化します。これは、予想される予測ルールから導出された既存のマージン損失の上限です。拡張変数を導入し、共役性によってディリクレ変数を解析的に積分することで、制限的な仮定がなく、SVMサブ問題を解決する必要のない、シンプルなギブスサンプリングアルゴリズムを開発しました。さらに、「拡張と縮小」ギブスサンプリングアルゴリズムの各ステップには解析的な条件付き分布があり、そこからサンプルを簡単に抽出できます。いくつかの中規模および大規模データセットでの実験結果では、時間効率が大幅に向上していることが示されています。バイナリ、マルチクラス、マルチラベルの分類タスクでは、競合他社よりも分類パフォーマンスが向上しています。

Improving Prediction from Dirichlet Process Mixtures via Enrichment
濃縮によるディリクレプロセス混合からの予測の改善

Flexible covariate-dependent density estimation can be achieved by modelling the joint density of the response and covariates as a Dirichlet process mixture. An appealing aspect of this approach is that computations are relatively easy. In this paper, we examine the predictive performance of these models with an increasing number of covariates. Even for a moderate number of covariates, we find that the likelihood for $x$ tends to dominate the posterior of the latent random partition, degrading the predictive performance of the model. To overcome this, we suggest using a different nonparametric prior, namely an enriched Dirichlet process. Our proposal maintains a simple allocation rule, so that computations remain relatively simple. Advantages are shown through both predictive equations and examples, including an application to diagnosis Alzheimer’s disease.

柔軟な共変量依存密度推定は、応答と共変量の同時密度をディリクレ過程混合物としてモデル化することで実現できます。このアプローチの魅力的な側面は、計算が比較的簡単であることです。この論文では、共変量の数が増えているこれらのモデルの予測性能を調べます。中程度の数の共変量であっても、$x$の尤度が潜在ランダム分割の後部を支配する傾向があり、モデルの予測性能を低下させることがわかります。これを克服するには、別のノンパラメトリック事前分布、つまり濃縮ディリクレ過程を使用することをお勧めします。私たちの提案では、計算が比較的単純なままになるように、単純な割り当てルールが維持されます。利点は、アルツハイマー病の診断への応用を含む、予測方程式と例の両方を通じて示されています。

Ellipsoidal Rounding for Nonnegative Matrix Factorization Under Noisy Separability
ノイズ分離性下での非負行列因数分解のための楕円体丸め

We present a numerical algorithm for nonnegative matrix factorization (NMF) problems under noisy separability. An NMF problem under separability can be stated as one of finding all vertices of the convex hull of data points. The research interest of this paper is to find the vectors as close to the vertices as possible in a situation in which noise is added to the data points. Our algorithm is designed to capture the shape of the convex hull of data points by using its enclosing ellipsoid. We show that the algorithm has correctness and robustness properties from theoretical and practical perspectives; correctness here means that if the data points do not contain any noise, the algorithm can find the vertices of their convex hull; robustness means that if the data points contain noise, the algorithm can find the near-vertices. Finally, we apply the algorithm to document clustering, and report the experimental results.

私たちは、ノイズの多い分離性の下での非負行列分解(NMF)問題の数値アルゴリズムを提示します。分離可能性の下でのNMFの問題は、データポイントの凸包のすべての頂点を見つけることの1つとして説明できます。この論文の研究の関心は、データポイントにノイズが追加される状況で、ベクトルを頂点にできるだけ近づけることを見つけることです。私たちのアルゴリズムは、データポイントの凸包の形状を、それを囲む楕円体を使用してキャプチャするように設計されています。アルゴリズムが正しさとロバスト性の特性を持っていることを、理論的および実用的な観点から示します。ここでの正確性とは、データポイントにノイズが含まれていない場合、アルゴリズムはそれらの凸包の頂点を見つけることができることを意味します。ロバスト性とは、データポイントにノイズが含まれている場合、アルゴリズムが近接頂点を見つけることができることを意味します。最後に、このアルゴリズムをドキュメントクラスタリングに適用し、実験結果を報告します。

Conditional Random Field with High-order Dependencies for Sequence Labeling and Segmentation
シーケンスラベリングとセグメンテーションのための高次依存関係を持つ条件付きランダムフィールド

Dependencies among neighboring labels in a sequence are important sources of information for sequence labeling and segmentation. However, only first-order dependencies, which are dependencies between adjacent labels or segments, are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we give efficient inference algorithms to handle high-order dependencies between labels or segments in conditional random fields, under the assumption that the number of distinct label patterns used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting high-order dependencies can lead to substantial performance improvements for some problems, and we discuss conditions under which high-order features can be effective.

シーケンス内の隣接するラベル間の依存関係は、シーケンスのラベル付けとセグメンテーションの重要な情報源です。ただし、実際には、隣接するラベルまたはセグメント間の依存関係である一次依存関係のみが一般的に利用されます。これは、より長い距離の依存関係を考慮に入れると、一般的な推論アルゴリズムの計算が複雑になるためです。この論文では、特徴で使用される個別のラベルパターンの数が少ないという仮定の下で、条件付きランダムフィールド内のラベルまたはセグメント間の高次依存関係を処理するための効率的な推論アルゴリズムを提供します。これにより、これらの条件付きランダムフィールドの効率的な学習アルゴリズムが得られます。高次の依存関係を利用すると、一部の問題に対してパフォーマンスが大幅に向上することを実験的に示し、高次の機能が効果的になる条件について説明します。

Natural Evolution Strategies
自然進化戦略

This paper presents Natural Evolution Strategies (NES), a recent family of black-box optimization algorithms that use the natural gradient to update a parameterized search distribution in the direction of higher expected fitness. We introduce a collection of techniques that address issues of convergence, robustness, sample complexity, computational complexity and sensitivity to hyperparameters. This paper explores a number of implementations of the NES family, such as general-purpose multi-variate normal distributions and separable distributions tailored towards search in high dimensional spaces. Experimental results show best published performance on various standard benchmarks, as well as competitive performance on others.

この論文では、自然進化戦略(NES)という、自然勾配を使用してパラメータ化された検索分布を期待される適応度が高い方向に更新するブラックボックス最適化アルゴリズムの最近のファミリーを紹介します。収束性、ロバスト性、サンプルの複雑さ、計算の複雑さ、ハイパーパラメータに対する感度の問題に対処する一連の手法を紹介します。この論文では、汎用の多変量正規分布や高次元空間での検索に合わせた分離可能な分布など、NESファミリの多くの実装について説明します。実験結果は、さまざまな標準ベンチマークで公開された最高のパフォーマンスと、他のベンチマークでの競争力のあるパフォーマンスを示しています。

An Extension of Slow Feature Analysis for Nonlinear Blind Source Separation
非線形ブラインドソース分離のための低速特徴解析の拡張

We present and test an extension of slow feature analysis as a novel approach to nonlinear blind source separation. The algorithm relies on temporal correlations and iteratively reconstructs a set of statistically independent sources from arbitrary nonlinear instantaneous mixtures. Simulations show that it is able to invert a complicated nonlinear mixture of two audio signals with a high reliability. The algorithm is based on a mathematical analysis of slow feature analysis for the case of input data that are generated from statistically independent sources.

私たちは、非線形ブラインドソース分離への新しいアプローチとして、低速特徴解析の拡張を提示し、テストします。このアルゴリズムは、時間的相関に依存し、任意の非線形瞬間的な混合物から統計的に独立したソースのセットを反復的に再構築します。シミュレーションによると、2つのオーディオ信号の複雑な非線形混合物を高い信頼性で反転できることが示されています。このアルゴリズムは、統計的に独立したソースから生成された入力データの場合の低速特徴分析の数学的分析に基づいています。

Active Learning Using Smooth Relative Regret Approximations with Applications
滑らかな相対リグレット近似を用いた能動学習とその応用

The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only, fails to guarantee any useful bounds for these problems. The applications of interest are: Learning to rank from pairwise preferences, and clustering with side information (a.k.a. semi-supervised clustering). The tool we use is based on the learner’s ability to compute an estimator of the difference between the loss of any hypothesis and some fixed âpivotalâ hypothesis to within an absolute error of at most $\epsilon$ times the disagreement measure ($\ell_1$ distance) between the two hypotheses. We prove that such an estimator implies the existence of a learning algorithm which, at each iteration, reduces its in-class excess risk to within a constant factor. Each iteration replaces the current pivotal hypothesis with the minimizer of the estimated loss difference function with respect to the previous pivotal hypothesis. The label complexity essentially becomes that of computing this estimator.

Hannekeの不一致係数は、能動学習率を証明する上で、データに依存しない不変量の中心的な要素となっています。複雑度の低い概念クラスと最適解における不一致係数の境界を組み合わせると、受動学習よりも優れた能動学習率が得られることがさまざまな方法で示されています。ここでは、低い不一致係数の特定の均一バージョンの存在から導かれるが、それと同等ではない、プールベースの能動学習用の別のツールを紹介します。実際、私たちは、私たちのアプローチによって重要な能動学習の境界が得られる、非常に興味深い2つの基本的な能動学習問題を提示します。ただし、不一致係数の境界のみに依存する汎用方法では、これらの問題に対して有用な境界を保証することはできません。興味深いアプリケーションは、ペアワイズプリファレンスからのランク付け学習と、サイド情報によるクラスタリング(別名、半教師ありクラスタリング)です。私たちが使用するツールは、任意の仮説といくつかの固定された「重要な」仮説の損失の差の推定値を、2つの仮説間の不一致の尺度($\ell_1$距離)の最大$\epsilon$倍の絶対誤差内で計算する学習者の能力に基づいています。このような推定値は、各反復でクラス内過剰リスクを定数係数以内に削減する学習アルゴリズムの存在を意味することを証明します。各反復では、現在の重要な仮説を、前の重要な仮説に関する推定損失差関数の最小値に置き換えます。ラベルの複雑さは、基本的にこの推定値を計算する複雑さになります。

Policy Evaluation with Temporal Differences: A Survey and Comparison
時間差分による政策評価:調査と比較

Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance.

ポリシー評価は、ほとんどの強化学習アプローチにおいて不可欠なステップです。これにより、特定のポリシーの状態の品質評価である価値関数が得られ、ポリシー改善ステップで使用できます。1980年代後半以降、この研究分野は、データ効率の点から、時間差分(TD)法が主流でした。しかし、オフポリシーシナリオでの安定性の保証、サンプル効率の向上、推定値の不確実性の確率的処理などの中核的な問題に取り組むようになったのはごく最近のことであり、多数の新しいアプローチが生まれました。この論文では、基礎となるコスト関数、オフポリシーシナリオ、高次元の特徴空間での正則化に焦点を当て、これらの新しい開発を簡潔な概要でわかりやすく説明することを目的としています。TD、LSTD、LSPE、FPKF、残差勾配アルゴリズム、ベルマン残差最小化、GTD、GTD2、TDCを比較した初めての広範かつ体系的な比較評価を提示することで、これらの方法の長所と短所を明らかにします。さらに、オフポリシーパフォーマンスが大幅に改善されたLSTDとLSPEの代替バージョンを紹介します。

A Novel M-Estimator for Robust PCA
ロバストPCAのための新しいM推定量

We study the basic problem of robust subspace recovery. That is, we assume a data set that some of its points are sampled around a fixed subspace and the rest of them are spread in the whole ambient space, and we aim to recover the fixed underlying subspace. We first estimate ârobust inverse sample covarianceâ by solving a convex minimization procedure; we then recover the subspace by the bottom eigenvectors of this matrix (their number correspond to the number of eigenvalues close to 0). We guarantee exact subspace recovery under some conditions on the underlying data. Furthermore, we propose a fast iterative algorithm, which linearly converges to the matrix minimizing the convex problem. We also quantify the effect of noise and regularization and discuss many other practical and theoretical issues for improving the subspace recovery in various settings. When replacing the sum of terms in the convex energy function (that we minimize) with the sum of squares of terms, we obtain that the new minimizer is a scaled version of the inverse sample covariance (when exists). We thus interpret our minimizer and its subspace (spanned by its bottom eigenvectors) as robust versions of the empirical inverse covariance and the PCA subspace respectively. We compare our method with many other algorithms for robust PCA on synthetic and real data sets and demonstrate state-of-the-art speed and accuracy.

私たちは、ロバストなサブスペース回復の基本問題を研究します。つまり、データセットのいくつかのポイントが固定されたサブスペースの周囲にサンプリングされ、残りのポイントが周囲の空間全体に広がっているという前提で、固定された基底サブスペースを回復することを目指します。まず、凸最小化手順を解くことによって「ロバストな逆サンプル共分散」を推定します。次に、この行列の底部の固有ベクトル(それらの数は、0に近い固有値の数に対応します)によってサブスペースを回復します。基底データに対するいくつかの条件下で、正確なサブスペース回復を保証します。さらに、凸問題を最小化する行列に線形収束する高速反復アルゴリズムを提案します。また、ノイズと正則化の影響を定量化し、さまざまな設定でサブスペース回復を改善するための他の多くの実際的および理論的な問題についても説明します。凸エネルギー関数（最小化する）の項の合計を項の二乗の合計に置き換えると、新しい最小化関数は逆サンプル共分散（存在する場合）のスケールバージョンであることがわかります。したがって、最小化関数とそのサブスペース（その下部の固有ベクトルによって張られる）を、それぞれ経験的逆共分散とPCAサブスペースの堅牢なバージョンとして解釈します。合成データセットと実際のデータセットで堅牢なPCAを行う他の多くのアルゴリズムとこのメソッドを比較し、最先端の速度と精度を実証します。

Clustering Hidden Markov Models with Variational HEM
変分HEMによる隠れマルコフモデルのクラスタリング

The hidden Markov model (HMM) is a widely-used generative model that copes with sequential data, assuming that each observation is conditioned on the state of a hidden Markov chain. In this paper, we derive a novel algorithm to cluster HMMs based on the hierarchical EM (HEM) algorithm. The proposed algorithm i) clusters a given collection of HMMs into groups of HMMs that are similar, in terms of the distributions they represent, and ii) characterizes each group by a âcluster centerâ, that is, a novel HMM that is representative for the group, in a manner that is consistent with the underlying generative model of the HMM. To cope with intractable inference in the E-step, the HEM algorithm is formulated as a variational optimization problem, and efficiently solved for the HMM case by leveraging an appropriate variational approximation. The benefits of the proposed algorithm, which we call variational HEM (VHEM), are demonstrated on several tasks involving time-series data, such as hierarchical clustering of motion capture sequences, and automatic annotation and retrieval of music and of online hand- writing data, showing improvements over current methods. In particular, our variational HEM algorithm effectively leverages large amounts of data when learning annotation models by using an efficient hierarchical estimation procedure, which reduces learning times and memory requirements, while improving model robustness through better regularization.

隠れマルコフモデル(HMM)は、各観測が隠れマルコフ連鎖の状態に基づいていると仮定して、連続データを処理する広く使用されている生成モデルです。この論文では、階層型EM (HEM)アルゴリズムに基づいてHMMをクラスタ化する新しいアルゴリズムを導出します。提案されたアルゴリズムは、i)指定されたHMMのコレクションを、それらが表す分布の観点から類似するHMMのグループにクラスタ化し、ii) HMMの基礎となる生成モデルと一致する方法で、各グループを「クラスタセンター」、つまりグループを代表する新しいHMMで特徴付けます。Eステップでの扱いにくい推論に対処するために、HEMアルゴリズムは変分最適化問題として定式化され、適切な変分近似を利用してHMMの場合に効率的に解決されます。提案アルゴリズム（変分HEM (VHEM)と呼ぶ）の利点は、モーションキャプチャシーケンスの階層的クラスタリング、音楽やオンライン手書きデータの自動注釈と検索など、時系列データを含むいくつかのタスクで実証されており、現在の方法よりも改善されています。特に、変分HEMアルゴリズムは、効率的な階層的推定手順を使用して注釈モデルを学習する際に大量のデータを効果的に活用します。これにより、学習時間とメモリ要件が削減され、より優れた正規化によってモデルの堅牢性が向上します。

Reinforcement Learning for Closed-Loop Propofol Anesthesia: A Study in Human Volunteers
閉ループプロポフォール麻酔の強化学習:ヒトボランティアを対象とした研究

Clinical research has demonstrated the efficacy of closed-loop control of anesthesia using the bispectral index of the electroencephalogram as the controlled variable. These controllers have evolved to yield patient-specific anesthesia, which is associated with improved patient outcomes. Despite progress, the problem of patient-specific anesthesia remains unsolved. A variety of factors confound good control, including variations in human physiology, imperfect measures of drug effect, and delayed, hysteretic response to drug delivery. Reinforcement learning (RL) appears to be uniquely equipped to overcome these challenges; however, the literature offers no precedent for RL in anesthesia. To begin exploring the role RL might play in improving anesthetic care, we investigated the method’s application in the delivery of patient-specific, propofol-induced hypnosis in human volunteers. When compared to performance metrics reported in the anesthesia literature, RL demonstrated patient-specific control marked by improved accuracy and stability. Furthermore, these results suggest that RL may be considered a viable alternative for solving other difficult closed-loop control problems in medicine. More rigorous clinical study, beyond the confines of controlled human volunteer studies, is needed to substantiate these findings.

臨床研究では、脳波のバイスペクトル指数を制御変数として使用した麻酔のクローズドループ制御の有効性が実証されています。これらのコントローラーは、患者固有の麻酔を生み出すように進化しており、患者の転帰の改善につながります。進歩にもかかわらず、患者固有の麻酔の問題は未解決のままです。人間の生理機能の変動、薬物効果の不完全な測定、薬物投与に対する遅延したヒステリシス反応など、さまざまな要因が良好な制御を妨げます。強化学習(RL)は、これらの課題を克服するのに独自に備えられているように見えますが、文献には麻酔におけるRLの前例がありません。麻酔ケアの改善におけるRLの役割の調査を開始するために、私たちは、ヒトボランティアにおける患者固有のプロポフォール誘発催眠の実施におけるこの方法の適用を調査しました。麻酔の文献で報告されているパフォーマンス指標と比較すると、RLは精度と安定性の向上を特徴とする患者固有の制御を示しました。さらに、これらの結果は、RLが医療における他の困難な閉ループ制御問題を解決するための実行可能な代替手段であると考えられることを示唆しています。これらの発見を実証するには、制御された人間のボランティア研究の限界を超えた、より厳密な臨床研究が必要です。

Random Intersection Trees
ランダム交差ツリー

Finding interactions between variables in large and high- dimensional data sets is often a serious computational challenge. Most approaches build up interaction sets incrementally, adding variables in a greedy fashion. The drawback is that potentially informative high-order interactions may be overlooked. Here, we propose an alternative approach for classification problems with binary predictor variables, called Random Intersection Trees. It works by starting with a maximal interaction that includes all variables, and then gradually removing variables if they fail to appear in randomly chosen observations of a class of interest. We show that informative interactions are retained with high probability, and the computational complexity of our procedure is of order $p^\kappa$, where $p$ is the number of predictor variables. The value of $\kappa$ can reach values as low as 1 for very sparse data; in many more general settings, it will still beat the exponent $s$ obtained when using a brute force search constrained to order $s$ interactions. In addition, by using some new ideas based on min-wise hash schemes, we are able to further reduce the computational cost. Interactions found by our algorithm can be used for predictive modelling in various forms, but they are also often of interest in their own right as useful characterisations of what distinguishes a certain class from others.

大規模で高次元のデータセットで変数間の相互作用を見つけることは、多くの場合、深刻な計算上の課題です。ほとんどのアプローチでは、貪欲な方法で変数を追加しながら、相互作用セットを段階的に構築します。欠点は、潜在的に有益な高次の相互作用が見落とされる可能性があることです。ここでは、バイナリ予測変数を使用した分類問題に対する、ランダム交差木と呼ばれる代替アプローチを提案します。これは、すべての変数を含む最大の相互作用から開始し、関心のあるクラスのランダムに選択された観測に現れない変数を徐々に削除することによって機能します。有益な相互作用は高い確率で保持され、手順の計算複雑性は$p^\kappa$のオーダーであることを示します。ここで、$p$は予測変数の数です。$\kappa$の値は、非常にスパースなデータでは1という低い値に達する可能性があります。より一般的な多くの設定では、順序$s$の相互作用に制限されたブルートフォース検索を使用した場合に得られる指数$s$を上回ります。さらに、最小単位のハッシュスキームに基づくいくつかの新しいアイデアを使用することで、計算コストをさらに削減できます。私たちのアルゴリズムによって発見された相互作用は、さまざまな形式で予測モデリングに使用できますが、特定のクラスを他のクラスと区別する有用な特性として、それ自体が興味深いものであることもよくあります。

Adaptivity of Averaged Stochastic Gradient Descent to Local Strong Convexity for Logistic Regression
ロジスティック回帰のための局所強凸性への平均確率的勾配降下の適応性

In this paper, we consider supervised learning problems such as logistic regression and study the stochastic gradient method with averaging, in the usual stochastic approximation setting where observations are used only once. We show that after $N$ iterations, with a constant step-size proportional to $1/R^2 \sqrt{N}$ where $N$ is the number of observations and $R$ is the maximum norm of the observations, the convergence rate is always of order $O(1/\sqrt{N})$, and improves to $O(R^2 / \mu N)$ where $\mu$ is the lowest eigenvalue of the Hessian at the global optimum (when this eigenvalue is greater than $R^2/\sqrt{N}$). Since $\mu$ does not need to be known in advance, this shows that averaged stochastic gradient is adaptive to unknown local strong convexity of the objective function. Our proof relies on the generalized self-concordance properties of the logistic loss and thus extends to all generalized linear models with uniformly bounded features.

この論文では、ロジスティック回帰などの教師あり学習問題を検討し、観測値が一度だけ使用される通常の確率的近似設定において、平均化を用いた確率的勾配法について検討します。$N$回の反復後、$1/R^2 sqrt{N}$ ($N$は観測値の数、$R$は観測値の最大ノルム)に比例する一定のステップサイズで、収束率は常に$O(1/sqrt{N})$の次数であり、$O(R^2 / mu N)$に改善され、$mu$はグローバル最適値(この固有値が$R^2/sqrt{N}$より大きい場合)でのヘッセ分布の最小固有値であることを示します。$mu$は事前に知る必要がないため、これは平均化された確率勾配が目的関数の未知の局所的な強い凸性に適応することを示しています。私たちの証明は、ロジスティック損失の一般化された自己一致特性に依存しているため、一様に境界のある特徴を持つすべての一般化線形モデルに拡張されます。

Link Prediction in Graphs with Autoregressive Features
自己回帰特徴を持つグラフのリンク予測

In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices. On the adjacency matrix it takes into account both sparsity and low rank properties and on the VAR it encodes the sparsity. The analysis involves oracle inequalities that illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank. The estimate is computed efficiently using proximal methods, and evaluated through numerical experiments.

この論文では、時間発展グラフにおけるリンク予測の問題について考察します。ノード次数などの特定のグラフ特徴がベクトル自己回帰(VAR)モデルに従うと仮定し、この情報を使用して予測の精度を向上させることを提案します。この戦略には、隣接行列とVAR行列の空間に対する共同最適化手順が含まれます。隣接行列では、スパース性と低ランクの両方のプロパティが考慮され、VARではスパース性がエンコードされます。この分析には、スパース性と低ランクの同時効果をモデル化する際の平滑化パラメーターの選択におけるトレードオフを示すオラクル不等式が含まれます。推定値は、近位法を使用して効率的に計算され、数値実験を通じて評価されます。

Ground Metric Learning
グランドメトリック学習

Optimal transport distances have been used for more than a decade in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all parameterized distances, optimal transport distances can only prove useful in practice when this parameter is carefully chosen. To date, the only option available to practitioners to set the ground metric parameter was to rely on a priori knowledge of the features, which limited considerably the scope of application of optimal transport distances. We propose to lift this limitation and consider instead algorithms that can learn the ground metric using only a training set of labeled histograms. We call this approach ground metric learning. We formulate the problem of learning the ground metric as the minimization of the difference of two convex polyhedral functions over a convex set of metric matrices. We follow the presentation of our algorithms with promising experimental results which show that this approach is useful both for retrieval and binary/multiclass classification tasks.

最適輸送距離は、特徴のヒストグラムを比較するために機械学習で10年以上使用されてきました。最適輸送距離には1つのパラメーター、つまりグラウンドメトリックがあり、これは特徴自体の間の任意のメトリックにすることができます。すべてのパラメーター化された距離の場合と同様に、最適輸送距離は、このパラメーターが慎重に選択された場合にのみ実際に役立ちます。現在まで、グラウンドメトリックパラメーターを設定するために実務者が利用できる唯一のオプションは、特徴の事前知識に頼ることでしたが、これにより最適輸送距離の適用範囲が大幅に制限されていました。私たちはこの制限を取り除き、代わりにラベル付きヒストグラムのトレーニングセットのみを使用してグラウンドメトリックを学習できるアルゴリズムを検討することを提案します。このアプローチをグラウンドメトリック学習と呼びます。グラウンドメトリックの学習の問題を、メトリックマトリックスの凸セット上の2つの凸多面体関数の差の最小化として定式化します。アルゴリズムのプレゼンテーションに続いて、このアプローチが検索タスクとバイナリ/マルチクラス分類タスクの両方に役立つことを示す有望な実験結果を示します。

Improving Markov Network Structure Learning Using Decision Trees
決定木を用いたマルコフネットワーク構造学習の改善

Most existing algorithms for learning Markov network structure either are limited to learning interactions among few variables or are very slow, due to the large space of possible structures. In this paper, we propose three new methods for using decision trees to learn Markov network structures. The advantage of using decision trees is that they are very fast to learn and can represent complex interactions among many variables. The first method, DTSL, learns a decision tree to predict each variable and converts each tree into a set of conjunctive features that define the Markov network structure. The second, DT-BLM, builds on DTSL by using it to initialize a search-based Markov network learning algorithm recently proposed by Davis and Domingos (2010). The third, DT+L1, combines the features learned by DTSL with those learned by an L1-regularized logistic regression method (L1) proposed by Ravikumar et al. (2009). In an extensive empirical evaluation on 20 data sets, DTSL is comparable to L1 and significantly faster and more accurate than two other baselines. DT-BLM is slower than DTSL, but obtains slightly higher accuracy. DT+L1 combines the strengths of DTSL and L1 to perform significantly better than either of them with only a modest increase in training time.

マルコフネットワーク構造を学習する既存のアルゴリズムのほとんどは、少数の変数間の相互作用の学習に限定されているか、可能な構造の空間が広いため非常に低速です。この論文では、マルコフネットワーク構造を学習するために決定木を使用する3つの新しい方法を提案します。決定木を使用する利点は、学習が非常に速く、多くの変数間の複雑な相互作用を表現できることです。最初の方法であるDTSLは、各変数を予測するために決定木を学習し、各ツリーをマルコフネットワーク構造を定義する結合特徴のセットに変換します。2番目の方法であるDT-BLMは、DTSLを使用して、DavisとDomingos (2010)によって最近提案された検索ベースのマルコフネットワーク学習アルゴリズムを初期化することで、DTSLを基盤としています。3番目のDT+L1は、DTSLによって学習された特徴と、Ravikumarら(2009)によって提案されたL1正則化ロジスティック回帰法(L1)によって学習された特徴を組み合わせたものです。20のデータセットに対する広範な実証的評価では、DTSLはL1に匹敵し、他の2つのベースラインよりも大幅に高速で正確です。DT-BLMはDTSLよりも低速ですが、わずかに高い精度が得られます。DT+L1はDTSLとL1の長所を組み合わせ、トレーニング時間がわずかに増加するだけで、どちらよりも大幅に優れたパフォーマンスを発揮します。

LIBOL: A Library for Online Learning Algorithms
LIBOL:オンライン学習アルゴリズムのライブラリ

LIBOL is an open-source library for large-scale online learning, which consists of a large family of efficient and scalable state-of-the-art online learning algorithms for large- scale online classification tasks. We have offered easy-to-use command-line tools and examples for users and developers, and also have made comprehensive documents available for both beginners and advanced users. LIBOL is not only a machine learning toolbox, but also a comprehensive experimental platform for conducting online learning research.

LIBOLは、大規模なオンライン学習のためのオープンソースライブラリであり、大規模なオンライン分類タスクのための効率的でスケーラブルな最先端のオンライン学習アルゴリズムの大規模なファミリーで構成されています。ユーザーと開発者向けに使いやすいコマンドラインツールと例を提供し、初心者と上級ユーザーの両方が利用できる包括的なドキュメントも用意しています。LIBOLは、機械学習ツールボックスであるだけでなく、オンライン学習研究を実施するための包括的な実験プラットフォームでもあります。

The FASTCLIME Package for Linear Programming and Large-Scale Precision Matrix Estimation in R
Rでの線形計画法と大規模精度行列推定のためのFASTCLIMEパッケージ

We develop an R package FASTCLIME for solving a family of regularized linear programming (LP) problems. Our package efficiently implements the parametric simplex algorithm, which provides a scalable and sophisticated tool for solving large- scale linear programs. As an illustrative example, one use of our LP solver is to implement an important sparse precision matrix estimation method called CLIME (Constrained $L_1$ Minimization Estimator). Compared with existing packages for this problem such as CLIME and FLARE, our package has three advantages: (1) it efficiently calculates the full piecewise- linear regularization path; (2) it provides an accurate dual certificate as stopping criterion; (3) it is completely coded in C and is highly portable. This package is designed to be useful to statisticians and machine learning researchers for solving a wide range of problems.

私たちは、正則化線形計画法(LP)のファミリーを解くためのRパッケージFASTCLIMEを開発します。当社のパッケージは、パラメトリックシンプレックスアルゴリズムを効率的に実装し、大規模な線形プログラムを解くためのスケーラブルで洗練されたツールを提供します。例として、LPソルバーの1つの用途は、CLIME (Constrained $L_1$ Minimization Estimator )と呼ばれる重要なスパース精度行列推定法を実装することです。CLIMEやFLAREなどのこの問題の既存のパッケージと比較して、私たちのパッケージには3つの利点があります:(1)完全な区分線形正則化パスを効率的に計算します。(2)停止基準として正確な二重証明書を提供します。(3)完全にCでコード化されており、携帯性に優れています。このパッケージは、統計学者や機械学習の研究者がさまざまな問題を解決するのに役立つように設計されています。

Node-Based Learning of Multiple Gaussian Graphical Models
複数のガウスグラフィカルモデルのノードベース学習

We consider the problem of estimating high-dimensional Gaussian graphical models corresponding to a single set of variables under several distinct conditions. This problem is motivated by the task of recovering transcriptional regulatory networks on the basis of gene expression data containing heterogeneous samples, such as different disease states, multiple species, or different developmental stages. We assume that most aspects of the conditional dependence networks are shared, but that there are some structured differences between them. Rather than assuming that similarities and differences between networks are driven by individual edges, we take a node-based approach, which in many cases provides a more intuitive interpretation of the network differences. We consider estimation under two distinct assumptions: (1) differences between the $K$ networks are due to individual nodes that are perturbed across conditions, or (2) similarities among the $K$ networks are due to the presence of common hub nodes that are shared across all $K$ networks. Using a row-column overlap norm penalty function, we formulate two convex optimization problems that correspond to these two assumptions. We solve these problems using an alternating direction method of multipliers algorithm, and we derive a set of necessary and sufficient conditions that allows us to decompose the problem into independent subproblems so that our algorithm can be scaled to high-dimensional settings. Our proposal is illustrated on synthetic data, a webpage data set, and a brain cancer gene expression data set.

私たちは、複数の異なる条件下での単一の変数セットに対応する高次元ガウスグラフィカルモデルを推定する問題について検討します。この問題は、異なる疾患状態、複数の種、または異なる発達段階などの異種サンプルを含む遺伝子発現データに基づいて転写調節ネットワークを復元するというタスクに起因しています。条件依存ネットワークのほとんどの側面は共有されているが、それらの間には構造化された違いがいくつかあると仮定します。ネットワーク間の類似点と相違点は個々のエッジによって決まると仮定するのではなく、多くの場合、ネットワークの違いをより直感的に解釈できるノードベースのアプローチを採用します。私たちは、2つの異なる仮定の下で推定を検討します。(1) Kネットワーク間の違いは、条件間で変動する個々のノードによる、または(2) Kネットワーク間の類似点は、すべてのKネットワークで共有される共通のハブノードの存在による。行と列のオーバーラップノルムペナルティ関数を使用して、これら2つの仮定に対応する2つの凸最適化問題を定式化します。私たちは、交互方向乗数法アルゴリズムを使用してこれらの問題を解決し、問題を独立したサブ問題に分解して、アルゴリズムを高次元設定に拡張できるようにする必要十分条件のセットを導き出します。私たちの提案は、合成データ、Webページデータセット、および脳腫瘍遺伝子発現データセットで説明されています。

Unbiased Generative Semi-Supervised Learning
偏りのない生成的半教師あり学習

Reliable semi-supervised learning, where a small amount of labelled data is complemented by a large body of unlabelled data, has been a long-standing goal of the machine learning community. However, while it seems intuitively obvious that unlabelled data can aid the learning process, in practise its performance has often been disappointing. We investigate this by examining generative maximum likelihood semi-supervised learning and derive novel upper and lower bounds on the degree of bias introduced by the unlabelled data. These bounds improve upon those provided in previous work, and are specifically applicable to the challenging case where the model is unable to exactly fit to the underlying distribution, a situation which is common in practise, but for which fewer guarantees of semi-supervised performance have been found. Inspired by this new framework for analysing bounds, we propose a new, simple reweighing scheme which provides a provably unbiased estimator for arbitrary model/distribution pairs—an unusual property for a semi- supervised algorithm. This reweighing introduces no additional computational complexity and can be applied to very many models. Additionally, we provide specific conditions demonstrating the circumstance under which the unlabelled data will lower the estimator variance, thereby improving convergence.

少量のラベル付きデータを大量のラベルなしデータで補完する信頼性の高い半教師あり学習は、機械学習コミュニティの長年の目標でした。しかし、ラベルなしデータが学習プロセスに役立つことは直感的に明らかなように思われますが、実際にはそのパフォーマンスは期待外れになることがよくありました。私たちは、生成的最大尤度半教師あり学習を調べることでこれを調査し、ラベルなしデータによってもたらされるバイアスの度合いに関する新しい上限と下限を導き出しました。これらの上限と下限は、以前の研究で提供されたものよりも改善されており、モデルが基礎となる分布に正確に適合できないという困難なケースに特に適用できます。これは実際にはよくある状況ですが、半教師あり学習のパフォーマンスを保証するものはあまり見つかっていません。この上限を分析するための新しいフレームワークに着想を得て、私たちは、任意のモデル/分布のペアに対して証明可能な偏りのない推定値を提供する新しい単純な再重み付けスキームを提案します。これは、半教師ありアルゴリズムとしては珍しい特性です。この再重み付けによって計算上の複雑さが増すことはなく、非常に多くのモデルに適用できます。さらに、ラベルなしデータによって推定値の分散が下がり、収束が改善される状況を示す特定の条件も示します。

Early Stopping and Non-parametric Regression: An Optimal Data-dependent Stopping Rule
早期停止とノンパラメトリック回帰:データ依存の最適停止ルール

Early stopping is a form of regularization based on choosing when to stop running an iterative algorithm. Focusing on non- parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient- descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein’s unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.

早期停止は、反復アルゴリズムの実行を停止するタイミングを選択することに基づく正則化の形式です。再生カーネルヒルベルト空間でのノンパラメトリック回帰に焦点を当て、最小二乗損失関数に適用される一種の勾配降下法の早期停止戦略を分析します。ホールドアウトデータやクロス検証データを含まないデータ依存の停止規則を提案し、$L^2(\mathbb{P})$および$L^2(\mathbb{P}_n)$ノルムで測定された結果の関数推定値の二乗誤差の上限を証明します。これらの上限は、ソボレフ平滑性クラスやその他の再生カーネルヒルベルト空間形式を含むさまざまなカーネルクラスのミニマックス最適レートにつながります。シミュレーションにより、この停止規則が、ホールドアウトデータに基づく停止規則とSteinの不偏リスク推定に基づく停止規則の2つの他の停止規則と比較して優れていることを示します。また、早期停止戦略とカーネルリッジ回帰推定器の解パスとの間に密接な関係を確立します。

Off-policy Learning With Eligibility Traces: A Survey
適格性トレースによるオフポリシー学習:調査

In the framework of Markov Decision Processes, we consider linear off-policy learning, that is the problem of learning a linear approximation of the value function of some fixed policy from one trajectory possibly generated by some other policy. We briefly review on-policy learning algorithms of the literature (gradient-based and least-squares- based), adopting a unified algorithmic view. Then, we highlight a systematic approach for adapting them to off-policy learning with eligibility traces. This leads to some known algorithms—off-policy LSTD($\lambda$), LSPE($\lambda$), TD($\lambda$), TDC/GQ($\lambda$)—and suggests new extensions —off-policy FPKF($\lambda$), BRM($\lambda$), gBRM($\lambda$), GTD2($\lambda$). We describe a comprehensive algorithmic derivation of all algorithms in a recursive and memory-efficent form, discuss their known convergence properties and illustrate their relative empirical behavior on Garnet problems. Our experiments suggest that the most standard algorithms on and off-policy LSTD($\lambda$)/LSPE($\lambda$)—and TD($\lambda$) if the feature space dimension is too large for a least-squares approach—perform the best.

マルコフ決定過程の枠組みにおいて、線形オフポリシー学習、つまり、他のポリシーによって生成された可能性のある1つの軌跡から、ある固定ポリシーの価値関数の線形近似を学習する問題を検討します。統一されたアルゴリズムの視点を採用して、文献のオンポリシー学習アルゴリズム(勾配ベースおよび最小二乗ベース)を簡単にレビューします。次に、それらを適格性トレースを備えたオフポリシー学習に適応させる体系的なアプローチを紹介します。これにより、オフポリシーLSTD($\lambda$)、LSPE($\lambda$)、TD($\lambda$)、TDC/GQ($\lambda$)などの既知のアルゴリズムが紹介され、オフポリシーFPKF($\lambda$)、BRM($\lambda$)、gBRM($\lambda$)、GTD2($\lambda$)などの新しい拡張が提案されます。私たちは、再帰的かつメモリ効率の良い形式で、すべてのアルゴリズムの包括的なアルゴリズム導出を説明し、それらの既知の収束特性について議論し、ガーネット問題におけるそれらの相対的な経験的動作を示します。我々の実験は、最も標準的なアルゴリズムであるオンポリシーおよびオフポリシーのLSTD($\lambda$)/LSPE($\lambda$)と、特徴空間次元が最小二乗法には大きすぎる場合はTD($\lambda$)が最高のパフォーマンスを発揮することを示唆しています。

Information Theoretical Estimators Toolbox
情報理論推定ツールボックス

We present ITE (information theoretical estimators) a free and open source, multi-platform, Matlab/Octave toolbox that is capable of estimating many different variants of entropy, mutual information, divergence, association measures, cross quantities, and kernels on distributions. Thanks to its highly modular design, ITE supports additionally (i) the combinations of the estimation techniques, (ii) the easy construction and embedding of novel information theoretical estimators, and (iii) their immediate application in information theoretical optimization problems. ITE also includes a prototype application in a central problem class of signal processing, independent subspace analysis and its extensions.

私たちは、ITE(情報理論推定器)は、エントロピー、相互情報量、発散、関連測定、クロス量、および分布上のカーネルのさまざまなバリアントを推定できる、無料のオープンソースのマルチプラットフォームMatlab / Octaveツールボックスです。ITEは、その高度にモジュール化された設計により、(i)推定技術の組み合わせ、(ii)新しい情報理論推定器の容易な構築と埋め込み、および(iii)情報理論最適化問題への即時適用をさらにサポートします。ITEには、信号処理の中心的な問題クラス、独立した部分空間解析、およびその拡張のプロトタイプアプリケーションも含まれています。

Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning
軌跡データを使用した強化学習のベイズ最適化の改善

Recently, Bayesian Optimization (BO) has been used to successfully optimize parametric policies in several challenging Reinforcement Learning (RL) applications. BO is attractive for this problem because it exploits Bayesian prior information about the expected return and exploits this knowledge to select new policies to execute. Effectively, the BO framework for policy search addresses the exploration-exploitation tradeoff. In this work, we show how to more effectively apply BO to RL by exploiting the sequential trajectory information generated by RL agents. Our contributions can be broken into two distinct, but mutually beneficial, parts. The first is a new Gaussian process (GP) kernel for measuring the similarity between policies using trajectory data generated from policy executions. This kernel can be used in order to improve posterior estimates of the expected return thereby improving the quality of exploration. The second contribution, is a new GP mean function which uses learned transition and reward functions to approximate the surface of the objective. We show that the model-based approach we develop can recover from model inaccuracies when good transition and reward models cannot be learned. We give empirical results in a standard set of RL benchmarks showing that both our model-based and model-free approaches can speed up learning compared to competing methods. Further, we show that our contributions can be combined to yield synergistic improvement in some domains.

最近、ベイズ最適化(BO)は、いくつかの難しい強化学習(RL)アプリケーションでパラメトリックポリシーを最適化するために使用されています。BOは、期待されるリターンに関するベイズの事前情報を活用し、この知識を利用して実行する新しいポリシーを選択するため、この問題に適しています。実質的に、ポリシー検索のBOフレームワークは、探索と利用のトレードオフに対処します。この研究では、RLエージェントによって生成された連続的な軌跡情報を利用して、BOをRLに効果的に適用する方法を示します。私たちの貢献は、2つの異なるが相互に有益な部分に分けることができます。1つ目は、ポリシー実行から生成された軌跡データを使用してポリシー間の類似性を測定するための新しいガウス過程(GP)カーネルです。このカーネルを使用すると、期待されるリターンの事後推定値を改善し、探索の品質を向上させることができます。2つ目の貢献は、学習した遷移関数と報酬関数を使用して目的の表面を近似する新しいGP平均関数です。私たちが開発したモデルベースのアプローチは、適切な遷移モデルと報酬モデルを学習できない場合でも、モデルの不正確さから回復できることを示しています。標準的なRLベンチマークセットで実証結果を示し、モデルベースとモデルフリーの両方のアプローチが競合方法と比較して学習を高速化できることを示しています。さらに、私たちの貢献を組み合わせることで、一部の領域で相乗的な改善が得られることを示しています。

Convex vs Non-Convex Estimators for Regression and Sparse Estimation: the Mean Squared Error Properties of ARD and GLasso
回帰推定とスパース推定のための凸推定量と非凸推定量:ARDとGLassoの平均二乗誤差特性

We study a simple linear regression problem for grouped variables; we are interested in methods which jointly perform estimation and variable selection, that is, that automatically set to zero groups of variables in the regression vector. The Group Lasso (GLasso), a well known approach used to tackle this problem which is also a special case of Multiple Kernel Learning (MKL), boils down to solving convex optimization problems. On the other hand, a Bayesian approach commonly known as Sparse Bayesian Learning (SBL), a version of which is the well known Automatic Relevance Determination (ARD), lead to non- convex problems. In this paper we discuss the relation between ARD (and a penalized version which we call PARD) and Glasso, and study their asymptotic properties in terms of the Mean Squared Error in estimating the unknown parameter. The theoretical arguments developed here are independent of the correctness of the prior models and clarify the advantages of PARD over GLasso.

私たちは、グループ化された変数の単純な線形回帰問題を研究します。私たちは、推定と変数選択を共同で実行する方法、つまり、回帰ベクトルの変数グループを自動的にゼロに設定する方法に興味があります。Group Lasso (GLasso)は、この問題に取り組むために使用されるよく知られたアプローチであり、マルチプルカーネル学習(MKL)の特殊なケースでもありますが、要約すると凸最適化問題を解くことです。一方、一般にスパースベイジアン学習(SBL)として知られるベイジアンアプローチは、そのバージョンがよく知られている自動関連性決定(ARD)であり、非凸問題を引き起こします。この論文では、ARD(およびPARDと呼ばれるペナルティ付きバージョン)とGlassoの関係について説明し、未知のパラメータを推定する際の平均二乗誤差の観点からそれらの漸近特性を研究します。ここで展開される理論的議論は、以前のモデルの正しさとは無関係であり、GLassoに対するPARDの利点を明確にしています。

Axioms for Graph Clustering Quality Functions
グラフクラスタリングの品質関数の公理

We investigate properties that intuitively ought to be satisfied by graph clustering quality functions, that is, functions that assign a score to a clustering of a graph. Graph clustering, also known as network community detection, is often performed by optimizing such a function. Two axioms tailored for graph clustering quality functions are introduced, and the four axioms introduced in previous work on distance based clustering are reformulated and generalized for the graph setting. We show that modularity, a standard quality function for graph clustering, does not satisfy all of these six properties. This motivates the derivation of a new family of quality functions, adaptive scale modularity, which does satisfy the proposed axioms. Adaptive scale modularity has two parameters, which give greater flexibility in the kinds of clusterings that can be found. Standard graph clustering quality functions, such as normalized cut and unnormalized cut, are obtained as special cases of adaptive scale modularity. In general, the results of our investigation indicate that the considered axiomatic framework covers existing `good’ quality functions for graph clustering, and can be used to derive an interesting new family of quality functions.

私たちは、グラフクラスタリング品質関数、つまりグラフのクラスタリングにスコアを割り当てる関数が直感的に満たすべき特性を調査します。グラフクラスタリングはネットワークコミュニティ検出とも呼ばれ、多くの場合、このような関数を最適化することによって実行されます。グラフクラスタリング品質関数に合わせた2つの公理が導入され、距離ベースのクラスタリングに関する以前の研究で導入された4つの公理がグラフ設定用に再定式化され一般化されます。グラフクラスタリングの標準品質関数であるモジュール性は、これら6つの特性のすべてを満たしているわけではないことが示されます。これが、提案された公理を満たす新しい品質関数ファミリーである適応スケールモジュール性の導出の動機となります。適応スケールモジュール性には2つのパラメーターがあり、これにより、検出できるクラスタリングの種類に柔軟性がもたらされます。正規化カットや非正規化カットなどの標準的なグラフクラスタリング品質関数は、適応スケールモジュール性の特殊なケースとして得られます。一般的に、私たちの調査の結果は、検討されている公理的フレームワークがグラフクラスタリングの既存の「優れた」品質関数をカバーし、興味深い新しい品質関数のファミリを導出するために使用できることを示しています。

A Junction Tree Framework for Undirected Graphical Model Selection
無向グラフィカルモデル選択のためのジャンクションツリーフレームワーク

An undirected graphical model is a joint probability distribution defined on an undirected graph $G^*$, where the vertices in the graph index a collection of random variables and the edges encode conditional independence relationships among random variables. The undirected graphical model selection (UGMS) problem is to estimate the graph $G^*$ given observations drawn from the undirected graphical model. This paper proposes a framework for decomposing the UGMS problem into multiple subproblems over clusters and subsets of the separators in a junction tree. The junction tree is constructed using a graph that contains a superset of the edges in $G^*$. We highlight three main properties of using junction trees for UGMS. First, different regularization parameters or different UGMS algorithms can be used to learn different parts of the graph. This is possible since the subproblems we identify can be solved independently of each other. Second, under certain conditions, a junction tree based UGMS algorithm can produce consistent results with fewer observations than the usual requirements of existing algorithms. Third, both our theoretical and experimental results show that the junction tree framework does a significantly better job at finding the weakest edges in a graph than existing methods. This property is a consequence of both the first and second properties. Finally, we note that our framework is independent of the choice of the UGMS algorithm and can be used as a wrapper around standard UGMS algorithms for more accurate graph estimation.

無向グラフィカルモデルは、無向グラフ$G^*$上で定義される結合確率分布です。グラフの頂点はランダム変数の集合をインデックスし、エッジはランダム変数間の条件付き独立関係をエンコードします。無向グラフィカルモデル選択(UGMS)問題は、無向グラフィカルモデルから抽出された観測値に基づいてグラフ$G^*$を推定することです。この論文では、UGMS問題をジャンクションツリー内のセパレーターのクラスターとサブセット上の複数のサブ問題に分解するためのフレームワークを提案します。ジャンクションツリーは、$G^*$内のエッジのスーパーセットを含むグラフを使用して構築されます。UGMSにジャンクションツリーを使用する3つの主な特性について説明します。まず、異なる正規化パラメーターまたは異なるUGMSアルゴリズムを使用して、グラフの異なる部分を学習できます。これは、特定したサブ問題を互いに独立して解決できるため可能です。次に、特定の条件下では、ジャンクションツリーベースのUGMSアルゴリズムは、既存のアルゴリズムの通常の要件よりも少ない観測値で一貫した結果を生成できます。3番目に、私たちの理論と実験の両方の結果は、ジャンクションツリーフレームワークがグラフ内の最も弱いエッジを見つけるのに既存の方法よりも大幅に優れていることを示しています。この特性は、1番目と2番目の特性の両方の結果です。最後に、私たちのフレームワークはUGMSアルゴリズムの選択に依存せず、より正確なグラフ推定のために標準のUGMSアルゴリズムのラッパーとして使用できることに注意してください。

EnsembleSVM: A Library for Ensemble Learning Using Support Vector Machines
EnsembleSVM: サポートベクターマシンを用いたアンサンブル学習のためのライブラリ

EnsembleSVM is a free software package containing efficient routines to perform ensemble learning with support vector machine (SVM) base models. It currently offers ensemble methods based on binary SVM models. Our implementation avoids duplicate storage and evaluation of support vectors which are shared between constituent models. Experimental results show that using ensemble approaches can drastically reduce training complexity while maintaining high predictive accuracy. The EnsembleSVM software package is freely available online at esat.kuleuven.be/stadius/ensemblesvm.

EnsembleSVMは、サポートベクターマシン(SVM)ベースモデルを使用してアンサンブル学習を実行するための効率的なルーチンを含むフリーソフトウェアパッケージです。現在、バイナリSVMモデルに基づくアンサンブルメソッドを提供しています。私たちの実装は、構成モデル間で共有されるサポートベクトルの重複した保存と評価を回避します。実験結果によると、アンサンブルアプローチを使用すると、高い予測精度を維持しながら、トレーニングの複雑さを大幅に軽減できることが示されています。EnsembleSVMソフトウェアパッケージは、esat.kuleuven.be/stadius/ensemblesvmでオンラインから無料で入手できます。

Detecting Click Fraud in Online Advertising: A Data Mining Approach
オンライン広告におけるクリック詐欺の検出:データマイニングアプローチ

Click fraud–the deliberate clicking on advertisements with no real interest on the product or service offered–is one of the most daunting problems in online advertising. Building an effective fraud detection method is thus pivotal for online advertising businesses. We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from BuzzCity Pte. Ltd., a global mobile advertising company based in Singapore. In particular, the task is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. The competition was held from September 1 to September 30, 2012, attracting 127 teams from more than 15 countries. The mobile advertising data are unique and complex, involving heterogeneous information, noisy patterns with missing values, and highly imbalanced class distribution. The competition results provide a comprehensive study on the usability of data mining-based fraud detection approaches in practical setting. Our principal findings are that features derived from fine-grained time-series analysis are crucial for accurate fraud detection, and that ensemble methods offer promising solutions to highly-imbalanced nonlinear classification tasks with mixed variable types and noisy/missing patterns. The competition data remain available for further studies at palanteer.sis.smu.edu.sg/fdma2012.

クリック詐欺（提供されている製品やサービスに実際には興味がないのに故意に広告をクリックすること）は、オンライン広告における最も困難な問題の1つです。そのため、効果的な詐欺検出方法の構築は、オンライン広告ビジネスにとって極めて重要です。私たちは、モバイル広告における詐欺検出(FDMA) 2012コンペティションを開催し、参加者がシンガポールに拠点を置くグローバルモバイル広告会社BuzzCity Pte. Ltd.の実際の詐欺データに取り組む機会を提供しました。特に、不正なクリックを生成する詐欺的なパブリッシャーを特定し、通常のパブリッシャーと区別することが課題でした。このコンペティションは2012年9月1日から9月30日まで開催され、15か国以上から127チームが参加しました。モバイル広告データは独特かつ複雑で、異種情報、欠損値のあるノイズの多いパターン、および非常に不均衡なクラス分布が含まれています。コンペティションの結果は、実際の設定におけるデータマイニングベースの詐欺検出アプローチの有用性に関する包括的な研究を提供します。私たちの主な発見は、きめ細かい時系列分析から得られる特徴が不正行為の正確な検出に不可欠であること、そしてアンサンブル法が混合変数タイプとノイズ/欠損パターンを伴う非常に不均衡な非線形分類タスクに有望なソリューションを提供するというものです。競技データは、palanteer.sis.smu.edu.sg/fdma2012で引き続き閲覧可能で、さらに研究することができます。

Fast SVM Training Using Approximate Extreme Points
近似極値ポイントを使用した高速SVMトレーニング

Applications of non-linear kernel support vector machines (SVMs) to large data sets is seriously hampered by its excessive training time. We propose a modification, called the approximate extreme points support vector machine (AESVM), that is aimed at overcoming this burden. Our approach relies on conducting the SVM optimization over a carefully selected subset, called the representative set, of the training data set. We present analytical results that indicate the similarity of AESVM and SVM solutions. A linear time algorithm based on convex hulls and extreme points is used to compute the representative set in kernel space. Extensive computational experiments on nine data sets compared AESVM to LIBSVM (Chang and Lin, 2011), CVM (Tsang et al., 2005), BVM (Tsang et al., 2007), LASVM (Bordes et al., 2005), SVMperf (Joachims and Yu, 2009), and the random features method (Rahimi and Recht, 2007). Our AESVM implementation was found to train much faster than the other methods, while its classification accuracy was similar to that of LIBSVM in all cases. In particular, for a seizure detection data set, AESVM training was almost 500 times faster than LIBSVM and LASVM and 20 times faster than CVM and BVM. Additionally, AESVM also gave competitively fast classification times.

非線形カーネルサポートベクターマシン(SVM)を大規模なデータセットに適用する場合、トレーニング時間が長すぎることが大きな障害となります。この負担を克服することを目的とした、近似極値点サポートベクターマシン(AESVM)と呼ばれる修正を提案します。このアプローチでは、トレーニングデータセットの慎重に選択されたサブセット(代表セット)に対してSVM最適化を実行します。AESVMとSVMのソリューションの類似性を示す分析結果を示します。カーネル空間で代表セットを計算するために、凸包と極値点に基づく線形時間アルゴリズムが使用されます。9つのデータセットに対する広範な計算実験により、AESVMはLIBSVM (Chang and Lin、2011)、CVM (Tsangら、2005)、BVM (Tsangら、2007)、LASVM (Bordesら、2005)、SVMperf (Joachims and Yu、2009)、およびランダムフィーチャメソッド(Rahimi and Recht、2007)と比較されました。当社のAESVM実装は、他の方法よりもはるかに高速にトレーニングされ、分類精度はすべてのケースでLIBSVMと同等であることがわかりました。特に、発作検出データセットの場合、AESVMのトレーニングはLIBSVMおよびLASVMの約500倍、CVMおよびBVMの20倍高速でした。さらに、AESVMは競争力のある高速分類時間も提供しました。

Bridging Viterbi and Posterior Decoding: A Generalized Risk Approach to Hidden Path Inference Based on Hidden Markov Models
ビタビと事後復号化の架橋:隠れマルコフモデルに基づく隠れ経路推論への一般化リスクアプローチ

Motivated by the unceasing interest in hidden Markov models (HMMs), this paper re-examines hidden path inference in these models, using primarily a risk-based framework. While the most common maximum a posteriori (MAP), or Viterbi, path estimator and the minimum error, or Posterior Decoder (PD) have long been around, other path estimators, or decoders, have been either only hinted at or applied more recently and in dedicated applications generally unfamiliar to the statistical learning community. Over a decade ago, however, a family of algorithmically defined decoders aiming to hybridize the two standard ones was proposed elsewhere. The present paper gives a careful analysis of this hybridization approach, identifies several problems and issues with it and other previously proposed approaches, and proposes practical resolutions of those. Furthermore, simple modifications of the classical criteria for hidden path recognition are shown to lead to a new class of decoders. Dynamic programming algorithms to compute these decoders in the usual forward-backward manner are presented. A particularly interesting subclass of such estimators can be also viewed as hybrids of the MAP and PD estimators. Similar to previously proposed MAP-PD hybrids, the new class is parameterized by a small number of tunable parameters. Unlike their algorithmic predecessors, the new risk- based decoders are more clearly interpretable, and, most importantly, work “out-of-the box” in practice, which is demonstrated on some real bioinformatics tasks and data. Some further generalizations and applications are discussed in the conclusion.

隠れマルコフモデル(HMM)への絶え間ない関心に触発されて、この論文では、主にリスクベースのフレームワークを使用して、これらのモデルにおける隠れパスの推論を再検討します。最も一般的な最大事後確率(MAP)またはビタビパス推定器と最小誤差(PD)または事後デコーダーは長い間存在していましたが、他のパス推定器またはデコーダーは、最近になって、統計学習コミュニティに一般的には馴染みのない専用アプリケーションでほのめかされるか適用されただけです。しかし、10年以上前に、2つの標準的なデコーダーをハイブリッド化することを目指したアルゴリズムで定義されたデコーダーのファミリーが別の場所で提案されました。この論文では、このハイブリッド化アプローチを慎重に分析し、このアプローチと以前に提案された他のアプローチのいくつかの問題と課題を特定し、それらの実用的な解決策を提案します。さらに、隠れパス認識の古典的な基準を単純に変更すると、新しいクラスのデコーダーにつながることが示されています。これらのデコーダーを通常の順方向-逆方向の方法で計算するための動的プログラミングアルゴリズムが提示されています。このような推定器の特に興味深いサブクラスは、MAP推定器とPD推定器のハイブリッドとして見ることもできます。以前に提案されたMAP-PDハイブリッドと同様に、新しいクラスは少数の調整可能なパラメータによってパラメータ化されます。アルゴリズムの先行バージョンとは異なり、新しいリスクベースのデコーダーはより明確に解釈可能であり、最も重要なのは、実際に「すぐに」機能することです。これは、実際のバイオインフォマティクスのタスクとデータで実証されています。結論では、さらに一般化とアプリケーションについて説明します。

関連記事