Journal of Machine Learning Research Papers: Volume 6の論文一覧

Journal of Machine Learning Research Papers Volume 6に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Expectation Consistent Approximate Inference
期待値: 一貫性のある近似推論

We propose a novel framework for approximations tointractable probabilistic models which is based on afree energy formulation. Theapproximation can be understood as replacing an averageover the original intractable distribution with a tractable one.It requires two tractable probabilitydistributions which are made consistent on a set ofmoments and encode different features of the original intractabledistribution. In this way we are able to useGaussian approximations for models with discrete or boundedvariables which allow us to include non-trivialcorrelations. These are neglected in many other methods.We test the framework on toy benchmark problems forbinary variables on fully connected graphs and 2D gridsand compare with other methods, such as loopy belief propagation.Good performance is already achieved by using single nodes as tractablesubstructures. Significant improvements are obtained when a spanningtree is used instead.

私たちは、自由エネルギーの定式化に基づく難解な確率モデルの近似のための新しいフレームワークを提案します。ザ・近似は、元の難解な分布の平均を扱いやすい分布に置き換えることとして理解できます。これには、一連のモーメントで一貫性を持たせ、元の難解な分布の異なる特徴をエンコードする2つの扱いやすい確率分布が必要です。このようにして、離散変数または有界変数を持つモデルにガウス近似を使用できるため、非自明な相関を含めることができます。これらは他の多くの方法で無視されています。全結合グラフと2Dグリッド上のバイナリ変数のおもちゃのベンチマーク問題でフレームワークをテストし、ループ信念伝播などの他の方法と比較します。良好なパフォーマンスは、単一のノードをトラクタブルなサブストラクチャーとして使用することですでに達成されています。代わりにスパニングツリーを使用すると、大幅な改善が得られます。

On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
カーネルベース学習の改善のためのグラム行列の近似のためのNystrom法について

A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n3), where n is the number of training examples.We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly.The approximation is of the form ~Gk = CWk+CT, where C is a matrix consisting of a small number c of columns of G and Wk is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G.An important aspect of the algorithm is the probability distribution used to randomly sample the columns;we will use a judiciously-chosen and data-dependent nonuniform probability distribution.Let ||·||2 and ||·||F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let Gk be the best rank-k approximation to G. We prove that by choosing O(k/ε4) columns||G-CWk+CT||ξ ≤ ||G-Gk||ξ + ε Σi=1n Gii2 ,both in expectation and with high probability, for both ξ = 2, F, and for all k: 0 ≤ k ≤ rank(W).This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage.The relationships between this algorithm, other related matrix decompositions, and the Nyström method from integral equation theory are discussed.

多くのカーネルベースの方法の問題は、解を見つけるために必要な計算量がO(n3)に比例することです。ここで、nはトレーニング例の数です。n×nグラム行列Gの簡単に解釈できる低ランク近似を計算するアルゴリズムを開発して分析し、関心のある計算をより迅速に実行できるようにします。近似は~Gk = CWk+CTの形式です。ここで、CはGの少数の列cで構成される行列であり、WkはWの最良のランクk近似であり、Gのc列とそれに対応するGのc行の交差によって形成される行列です。アルゴリズムの重要な側面は、列をランダムにサンプリングするために使用される確率分布です。慎重に選択されたデータ依存の非一様確率分布を使用します。||·||2と||·||Fをそれぞれ行列のスペクトルノルムとフロベニウスノルムとし、GkをGの最良のランクk近似とします。O(k/ε4)列を選択することで、ξ= 2, Fおよびすべてのk: 0≤k≤rank(W)に対して、期待値と高い確率の両方で、||G-CWk+CT||ξ ≤||G-Gk||ξ+ε Σi=1n Gii2が得られます。この近似値は、外部ストレージからのデータに対して2回のパスを実行した後、O(n)の追加スペースと時間を使用して計算できます。このアルゴリズムと、その他の関連する行列分解、および積分方程式理論からのNyström法との関係について説明します。

Efficient Margin Maximizing with Boosting
ブースティングによる効率的なマージンの最大化

AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost*ν, that explicitly maximizes the minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

AdaBoostは、基本仮説の線形結合を生成し、この線形結合の符号を使用して予測します。線形結合は、基本仮説が特徴を形成する特徴空間の超平面として見ることができます。すべての例が現在の超平面の正しい側にある後でも、アルゴリズムの一般化誤差は改善し続けることが確認されています。この改善は、すべての例が正しい側にある後でも、分離超平面に対する例の距離(マージン)が増加するという実験的観察によるものです。ここでは、例の最小マージンを特定の精度まで明示的に最大化する、AdaBoost*ν と呼ばれる新しいバージョンのAdaBoostを紹介します。このアルゴリズムは、達成可能なマージンの現在の推定値を、基本仮説の線形係数の計算に組み込みます。新しいアルゴリズムに必要な反復回数の上限は、達成可能なマージンの明示的な推定値をパラメーターとして持つ必要があるAdaBoostの既知のバージョンに必要な回数と同じです。また、私たちのアルゴリズムでは、マージンの最大化を目指す他のアルゴリズムよりも、必要な反復回数がかなり少なくなることも実験的に示しています。

Kernel Methods for Measuring Independence
独立性を測定するためのカーネル法

We introduce two new functionals, the constrained covarianceand the kernel mutual information, to measure the degree of independenceof random variables. These quantities are both based on the covariancebetween functions of the random variables in reproducing kernel Hilbertspaces (RKHSs). We prove that when the RKHSs are universal, both functionalsare zero if and only if the random variables are pairwise independent.We also show that the kernel mutual information is an upper boundnear independence on the Parzen window estimate of the mutual information.Analogous results apply for two correlation-based dependence functionalsintroduced earlier: we show the kernel canonical correlation and thekernel generalised variance to be independence measures for universalkernels, and prove the latter to be an upper bound on the mutual informationnear independence. The performance of the kernel dependence functionalsin measuring independence is verified in the context of independentcomponent analysis.

私たちは、確率変数の独立性の程度を測定するために、制約付き共分散とカーネル相互情報量の2つの新しい関数を導入します。これらの量は両方とも、再現カーネルヒルベルト空間(RKHS)の確率変数の関数間の共分散に基づいています。RKHSが普遍であるとき、確率変数がペアワイズ独立である場合にのみ、両方の汎関数がゼロであることを証明します。また、カーネルの相互情報が、相互情報のパルゼンウィンドウ推定の上限付近の独立性であることも示します。同様の結果は、前に紹介した2つの相関ベースの依存汎関数に適用されます:カーネルの正準相関とカーネル一般化分散がユニバーサルカーネルの独立性尺度であることを示し、後者が相互情報の上限であることを証明します。カーネル依存性のパフォーマンスは、独立性の測定における関数型であり、独立性コンポーネント分析のコンテキストで検証されます。

Convergence Theorems for Generalized Alternating Minimization Procedures
一般化交互最小化手順の収束定理

The EM algorithm is widely used to develop iterative parameterestimation procedures for statistical models. In cases where theseprocedures strictly follow the EM formulation, the convergenceproperties of the estimation procedures are well understood. Insome instances there are practical reasons to develop proceduresthat do not strictly fall within the EM framework. We study EMvariants in which the E-step is not performed exactly, either toobtain improved rates of convergence, or due to approximationsneeded to compute statistics under a model family over which E-stepscannot be realized. Since these variants are not EM procedures, thestandard (G)EM convergence results do not apply to them. We presentan information geometric framework for describing such algorithmsand analyzing their convergence properties. We apply this frameworkto analyze the convergence properties of incremental EM andvariational EM. For incremental EM, we discuss conditions underthese algorithms converge in likelihood. For variational EM, weshow how the E-step approximation prevents convergence to localmaxima in likelihood.

EMアルゴリズムは、統計モデルの反復パラメータ推定手順の開発に広く使用されています。これらの手順がEM定式化に厳密に従う場合、推定手順の収束特性は十分に理解されています。場合によっては、EMフレームワークに厳密には当てはまらない手順を開発する実用的な理由があります。ここでは、収束率を向上させるため、またはEステップが実現できないモデルファミリの下で統計を計算するために必要な近似値のために、Eステップが正確に実行されないEMバリアントを検討します。これらのバリアントはEM手順ではないため、標準の(G)EM収束結果は適用されません。このようなアルゴリズムを記述し、その収束特性を分析するための情報幾何学的フレームワークを示します。このフレームワークを適用して、増分EMと変分EMの収束特性を分析します。増分EMの場合、これらのアルゴリズムが尤度で収束する条件について説明します。変分EMの場合、Eステップ近似値によって尤度の局所的最大値への収束が妨げられる仕組みを示します。

Asymptotics in Empirical Risk Minimization
経験的リスク最小化における漸近論

In this paper, we study a two-category classification problem.We indicate the categories by labels Y=1 and Y=-1. We observe a covariate, or feature, X ∈ X ⊂ ℜd.Consider a collection {ha} of classifiers indexed by a finite-dimensionalparameter a, and the classifier ha* that minimizes the prediction errorover this class. The parameter a* is estimated by the empirical risk minimizerân over the class, where the empirical risk is calculated on atraining sample of size n. We apply the Kim Pollard Theorem to show that under certain differentiability assumptions, ân converges to a* with raten-1/3, and also present the asymptotic distribution of therenormalized estimator.

この論文では、2つのカテゴリー分類問題について検討します。カテゴリは、ラベルY=1とY=-1で示します。共変量、または特徴X∈X⊂Rdを観測します。有限次元パラメータaによってインデックス付けされた分類子のコレクション{ha}と、このクラスに対する予測誤差を最小化する分類子ha*について考えてみます。パラメータa*は、経験的リスクがサイズnの学習サンプルで計算されるクラス全体の経験的リスク最小化によって推定されます。キム・ポラードの定理を適用して、特定の微分可能性の仮定の下で、ânがraten-1/3でa*に収束することを示し、また、その正規化された推定量の漸近分布を示します。

Change Point Problems in Linear Dynamical Systems
線形力学系における変化点問題

We study the problem of learning two regimes (we have a normal and aprefault regime in mind) based on a train set of non-Markovianobservation sequences. Key to the model is that we assume that once the system switches from the normalto the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting assemi-supervised since we assume the only information givento the learner is whether a particular sequence ended with a stop(implying that the sequence was generated by the normal regime) orwith a fault (implying that there was a switch from the normal to thefault regime). In the latter case the particular time point at which aswitch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transitionprobabilities result in an exact inference procedure that scalesquadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can befound using an expectation maximization (EM) algorithm with this inference algorithm in the E-step.For long sequences this will not be practically feasible and anapproximate inference and an approximate EM procedure is calledfor. We describe a flexible class of approximations corresponding todifferent choices of clusters in a Kikuchi free energy with weakconsistency constraints.

私たちは、非マルコフ観測シーケンスのトレーニングセットに基づいて、2つのレジーム(通常レジームと障害前レジームを念頭に置いています)を学習する問題を研究します。このモデルの鍵となるのは、システムが通常レジームから障害前レジームに切り替わると、元に戻すことはできず、最終的には障害につながると想定していることです。学習者に与えられる唯一の情報は、特定のシーケンスが停止で終了したか(つまり、シーケンスが通常レジームによって生成された)、障害で終了したか(つまり、通常レジームから障害レジームへの切り替えがあった)のみであると想定するため、この特定の設定を半教師ありと呼びます。後者の場合、切り替えが発生した特定の時点は不明です。使用される基礎モデルは、切り替え線形動的システム(SLDS)です。レジーム遷移確率の制約により、シーケンスの長さに2乗してスケールする正確な推論手順が実現します。最大事後確率(MAP)パラメータ推定値は、Eステップでこの推論アルゴリズムを使用した期待値最大化(EM)アルゴリズムを使用して見つけることができます。長いシーケンスの場合、これは実際には実行不可能であり、近似推論と近似EM手順が必要です。弱い一貫性制約を持つKikuchi自由エネルギーにおけるクラスターのさまざまな選択に対応する柔軟な近似クラスについて説明します。

What’s Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks
What’s Strange About Recent Events (WSARE): 病気の発生を早期に検出するためのアルゴリズム

Traditional biosurveillance algorithms detect disease outbreaks by looking forpeaks in a univariate time series of health-care data. Current health-care surveillance data, however, are no longer simply univariate data streams.Instead, a wealth of spatial, temporal, demographic and symptomaticinformation is available. We present an early disease outbreak detection algorithm called What’s Strange About Recent Events (WSARE), which uses a multivariate approach to improve its timeliness of detection. WSARE employs a rule-based technique that compares recent health-care data against data froma baseline distribution and finds subgroups of the recent data whose proportions have changed the most from the baseline data. In addition, health-care data also pose difficulties for surveillance algorithms because ofinherent temporal trends such as seasonal effects and day of week variations. WSARE approaches this problem using a Bayesian network to produce a baseline distribution that accounts for these temporal trends. The algorithm itself incorporates a wide range of ideas, including association rules, Bayesian networks, hypothesis testing and permutation tests to produce a detection algorithm that is careful to evaluate the significance of the alarms that it raises.

従来のバイオサーベイランスアルゴリズムは、単変量時系列の医療データからピークを探すことで、疾病の発生を検出します。しかし、現在の医療サーベイランスデータは、もはや単なる単変量データストリームではありません。代わりに、空間、時間、人口統計、および症状に関する豊富な情報が利用可能です。私たちは、多変量アプローチを使用して検出の適時性を向上させるWhatのStrange About Recent Events (WSARE)と呼ばれる早期の疾病発生検出アルゴリズムを紹介します。WSAREは、最近の医療データをベースライン分布のデータと比較し、ベースラインデータから比率が最も変化した最近のデータのサブグループを見つけるルールベースの手法を採用しています。さらに、季節の影響や曜日による変動などの固有の時間的傾向のために、医療データはサーベイランスアルゴリズムにとって困難をもたらします。WSAREは、ベイジアンネットワークを使用してこの問題に取り組み、これらの時間的傾向を考慮したベースライン分布を作成します。アルゴリズム自体には、関連ルール、ベイジアンネットワーク、仮説検定、順列検定など、幅広いアイデアが組み込まれており、発生するアラームの重要性を慎重に評価する検出アルゴリズムが生成されます。

A Unifying View of Sparse Approximate Gaussian Process Regression
スパース近似ガウス過程回帰の統一的な見方

We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

私たちは、ガウス過程回帰の既存の適切な確率的スパース近似をすべて含む、新しい統一ビューを提供します。私たちのアプローチは、メソッドが使用している有効な事前確率を表現することに依存しています。これにより、新たな知見を得ることができ、既存の手法間の関係性が浮き彫りになります。また、既知の近似が対応する完全なGPにどの程度近いかについて、理論的に正当化される明確なランキングも可能になります。最後に、魅力的な計算制約内で、既存の戦略の最良のものを組み合わせた、新しいより優れたスパース近似の設計を直接示します。

New Horn Revision Algorithms
新しいホーン改訂アルゴリズム

A revision algorithm is a learning algorithm that identifiesthe target concept, starting from an initial concept.Such an algorithm is considered efficient if its complexity(in terms of the measured resource) ispolynomial in the syntactic distance between the initialand the target concept, but only polylogarithmic in the number ofvariables in the universe. We give efficient revision algorithms in the modelof learning with equivalence and membership queries.The algorithms work in a general revision model whereboth deletion and addition revision operatorsare allowed. In this model one of the main open problemsis the efficient revision of Horn formulas.Two revision algorithms arepresented for special cases of this problem: for depth-1 acyclic Hornformulas, and for definite Horn formulas with unique heads.

改訂アルゴリズムは、初期概念から始めて、ターゲット概念を識別する学習アルゴリズムです。このようなアルゴリズムは、その複雑さ(測定されたリソースに関して)が初期概念とターゲット概念の間の構文距離で多項式であり、宇宙の変数の数で多対数のみである場合に効率的であると見なされます。私たちは、同等性とメンバーシップクエリを使用した学習モデルで効率的な改訂アルゴリズムを提供します。このアルゴリズムは、削除と追加の両方のリビジョン演算子が許可されている一般的なリビジョンモデルで動作します。このモデルでは、主要な未解決の問題の1つは、ホーン式の効率的な改訂です。この問題の特殊なケースには、深さ1の非巡回ホーン式と、一意のヘッドを持つ明確なホーン式の2つのリビジョンアルゴリズムが表されています。

Working Set Selection Using Second Order Information for Training Support Vector Machines
学習支援ベクトルマシンの 2 次情報を使用したワーキングセットの選択

Working set selection is an important step indecomposition methods for training support vector machines (SVMs).This paper develops a new technique for working set selection inSMO-type decomposition methods.It uses second order information to achieve fast convergence.Theoretical properties such as linear convergence are established.Experiments demonstrate thatthe proposed method is faster than existing selection methodsusing first order information.

ワーキングセットの選択は、サポートベクターマシン(SVM)のトレーニングのための重要なステップ分解方法です。この論文では、SMOタイプの分解法でワーキングセットを選択するための新しい手法を開発します。2次情報を使用して、高速な収束を実現します。線形収束などの理論的特性が確立されます。実験により、提案手法は、一次情報を用いた既存の選択法よりも高速であることが実証されています。

Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
教師なし推論と教師あり推論の特徴選択: 重みベースアプローチにおけるスパース性の出現

The problem of selecting a subset of relevant features in apotentially overwhelming quantity of data is classic and found in manybranches of science. Examples in computer vision, text processing andmore recently bio-informatics are abundant. In text classificationtasks, for example, it is not uncommon to have 104 to107 features of the size of the vocabulary containing wordfrequency counts, with the expectation that only a small fraction ofthem are relevant. Typical examples include the automatic sorting ofURLs into a web directory and the detection of spam email.In this work we present a definition of “relevancy” based onspectral properties of the Laplacian of the features’ measurementmatrix. The feature selection process is then based on a continuousranking of the features defined by a least-squares optimizationprocess. A remarkable property of the feature relevance function isthat sparse solutions for the ranking values naturally emerge as aresult of a “biased non-negativity” of a key matrix in theprocess. As a result, a simple least-squares optimization processconverges onto a sparse solution, i.e., a selection of a subset offeatures which form a local maximum over the relevance function. Thefeature selection algorithm can be embedded in both unsupervised andsupervised inference problems and empirical evidence show that thefeature selections typically achieve high accuracy even when only asmall fraction of the features are relevant.

膨大な量のデータから関連する特徴のサブセットを選択する問題は古典的であり、科学の多くの分野で見られます。コンピュータービジョン、テキスト処理、最近ではバイオインフォマティクスの例は豊富です。たとえば、テキスト分類タスクでは、単語の頻度カウントを含む語彙のサイズの104～107の特徴があり、そのうちのごく一部だけが関連していると予想されます。典型的な例としては、WebディレクトリへのURLの自動ソートやスパムメールの検出などがあります。この研究では、特徴の測定マトリックスのラプラシアンのスペクトル特性に基づく「関連性」の定義を示します。特徴選択プロセスは、最小二乗最適化プロセスによって定義された特徴の連続的なランキングに基づいています。特徴関連性関数の注目すべき特性は、ランク付け値のスパースソリューションが、プロセスにおけるキーマトリックスの「バイアスされた非負性」の結果として自然に出現することです。その結果、単純な最小二乗最適化プロセスは、スパースソリューション、つまり関連性関数の局所的最大値を形成する特徴のサブセットの選択に収束します。特徴選択アルゴリズムは、教師なしおよび教師ありの両方の推論問題に組み込むことができ、経験的証拠は、特徴のごく一部だけが関連している場合でも、特徴選択が通常高い精度を達成することを示しています。

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
複数のタスクとラベル付けされていないデータから予測構造を学習するためのフレームワーク

One of the most important issues in machine learning is whether onecan improve the performance of a supervised learning algorithm byincluding unlabeled data. Methods that use both labeled and unlabeleddata are generally referred to as semi-supervised learning. Althougha number of such methods are proposed, at the current stage, we stilldon’t have a complete understanding of their effectiveness. Thispaper investigates a closely related problem, which leads to a novelapproach to semi-supervised learning. Specifically we considerlearning predictive structures on hypothesis spaces (that is, whatkind of classifiers have good predictive power) from multiple learningtasks. We present a general framework in which the structurallearning problem can be formulated and analyzed theoretically, andrelate it to learning with unlabeled data. Under this framework,algorithms for structural learning will be proposed, and computationalissues will be investigated. Experiments will be given to demonstratethe effectiveness of the proposed algorithms in the semi-supervisedlearning setting.

機械学習における最も重要な問題の1つは、ラベルなしデータを含めることで教師あり学習アルゴリズムのパフォーマンスを改善できるかどうかです。ラベル付きデータとラベルなしデータの両方を使用する方法は、一般に半教師あり学習と呼ばれます。このような方法は数多く提案されていますが、現段階では、その有効性についてはまだ完全には理解されていません。この論文では、密接に関連する問題を調査しており、半教師あり学習への新しいアプローチにつながっています。具体的には、複数の学習タスクから仮説空間上の予測構造(つまり、どのような分類器が優れた予測力を持つか)を学習することを検討します。私たちは、構造学習問題を理論的に定式化して分析し、ラベルなしデータによる学習に関連付けることができる一般的なフレームワークを提示します。このフレームワークの下で、構造学習のアルゴリズムを提案し、計算上の問題を調査します。提案されたアルゴリズムの半教師あり学習設定での有効性を示す実験を行います。

Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
ガウス過程潜在変数モデルによる確率的非線形主成分分析

Summarising a high dimensional data set with a low dimensional embeddingis a standard approach for exploring its structure. In this paperwe provide an overview of some existing techniques for discoveringsuch embeddings. We then introduce a novel probabilistic interpretationof principal component analysis (PCA) that we term dual probabilisticPCA (DPPCA). The DPPCA model has the additional advantage that thelinear mappings from the embedded space can easily be non-linearisedthrough Gaussian processes. We refer to this model as a Gaussian processlatent variable model (GP-LVM). Through analysis of the GP-LVM objectivefunction, we relate the model to popular spectral techniques suchas kernel PCA and multidimensional scaling. We then review a practicalalgorithm for GP-LVMs in the context of large data sets and developit to also handle discrete valued data and missing attributes. Wedemonstrate the model on a range of real-world and artificially generateddata sets.

高次元データセットと低次元埋め込みを合計することは、その構造を調査するための標準的なアプローチです。このペーパーでは、そのような埋め込みを発見するためのいくつかの既存の手法の概要を提供します。次に、主成分分析(PCA)の新しい確率的解釈を導入し、これを双対確率PCA(DPPCA)と名付けます。DPPCAモデルには、埋め込み空間からの線形マッピングをガウスプロセスを通じて簡単に非線形化できるという追加の利点があります。このモデルをガウス過程潜在変数モデル(GP-LVM)と呼びます。GP-LVMの目的関数の解析を通じて、モデルをカーネルPCAや多次元スケーリングなどの一般的なスペクトル手法に関連付けます。次に、大規模なデータセットのコンテキストでのGP-LVMの実用的なアルゴリズムをレビューし、離散値データと欠落属性も処理するように開発します。実世界のデータセットと人工的に生成されたデータセットでモデルを示します。

Combining Information Extraction Systems Using Voting and Stacked Generalization
投票とスタックジェネリック化を使用した情報抽出システムの組み合わせ

This article investigates the effectiveness of voting and stackedgeneralization -also known as stacking- in the context of informationextraction (IE). A new stacking framework is proposed thataccommodates well-known approaches for IE. The key idea is to performcross-validation on the base-level data set, which consists of textdocuments annotated with relevant information, in order to create ameta-level data set that consists of feature vectors. A classifier isthen trained using the new vectors. Therefore, base-level IE systemsare combined with a common classifier at the meta-level. Variousvoting schemes are presented for comparing against stacking in variousIE domains. Well known IE systems are employed at the base-level,together with a variety of classifiers at the meta-level. Results showthat both voting and stacking work better when relying onprobabilistic estimates by the base-level systems. Voting proved to beeffective in most domains in the experiments. Stacking, on the otherhand, proved to be consistently effective over all domains, doingcomparably or better than voting and always better than the bestbase-level systems. Particular emphasis is also given to explainingthe results obtained by voting and stacking at the meta-level, withrespect to the varying degree of similarity in the output of thebase-level systems.

この記事では、情報抽出(IE)のコンテキストにおける投票とスタック一般化(スタッキングとも呼ばれる)の有効性について調査します。IEのよく知られたアプローチに対応する新しいスタッキングフレームワークが提案されています。重要なアイデアは、関連情報で注釈が付けられたテキストドキュメントで構成されるベースレベルのデータセットに対してクロス検証を実行し、特徴ベクトルで構成されるメタレベルのデータセットを作成することです。次に、分類器が新しいベクトルを使用してトレーニングされます。したがって、ベースレベルのIEシステムは、メタレベルの共通の分類器と組み合わせられます。さまざまなIEドメインでのスタッキングと比較するためのさまざまな投票方式が提示されます。よく知られているIEシステムがベースレベルで使用され、メタレベルのさまざまな分類器が一緒に使用されます。結果は、投票とスタッキングの両方が、ベースレベルのシステムによる確率的推定値に依存する場合により効果的に機能することを示しています。投票は、実験でほとんどのドメインで有効であることが証明されました。一方、スタッキングは、すべてのドメインで一貫して効果的であることが証明されており、投票と同等かそれ以上の成果を上げ、常に最高のベースレベルシステムよりも優れています。また、ベースレベルシステムの出力の類似性の程度の違いに関して、メタレベルでの投票とスタッキングによって得られた結果を説明することにも特に重点が置かれています。

Clustering with Bregman Divergences
ブレグマン発散法によるクラスタリング

A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information.

クラスタリングには、二乗ユークリッド距離、マハラノビス距離、板倉-斎藤距離、相対エントロピーなど、さまざまな歪み関数が使用されています。この論文では、ブレグマンダイバージェンスと呼ばれる歪み関数の大規模なクラスに基づくパラメトリックハードおよびソフトクラスタリングアルゴリズムを提案し、分析します。提案されたアルゴリズムは、古典的なkmeans、Linde-Buzo-Gray (LBG)アルゴリズム、およびブレグマンダイバージェンスの特別な選択によって生じる情報理論的クラスタリングなど、セントロイドベースのパラメトリッククラスタリング手法を統合します。アルゴリズムは、古典的なkmeansアルゴリズムのシンプルさとスケーラビリティを維持しながら、その方法を大規模なクラスタリング損失関数に一般化します。これは、まず、レート歪み理論によって動機付けられた量であるブレグマン情報の損失を最小化するという観点からハードクラスタリング問題を提示し、次にこの損失を単調に減少させる反復アルゴリズムを導出することで実現されます。さらに、正則指数族とブレグマンダイバージェンスの大きなクラスとの間に一対一の関係があることを示し、これを正則ブレグマンダイバージェンスと呼びます。この結果により、指数族分布の混合を学習するための効率的なEMスキームの代替解釈の開発が可能になり、正則ブレグマンダイバージェンス用のシンプルなソフトクラスタリングアルゴリズムが実現します。最後に、レート歪み理論とブレグマンクラスタリングの関係について説明し、ブレグマン情報の圧縮と損失のトレードオフの観点から、ブレグマンクラスタリングアルゴリズムの情報理論的分析を示します。

Assessing Approximate Inference for Binary Gaussian Process Classification
バイナリガウス過程分類の近似推論の評価

Gaussian process priors can be used to define flexible, probabilisticclassification models. Unfortunately exact Bayesian inference isanalytically intractable and various approximation techniques havebeen proposed. In this work we review and compare Laplace’s method andExpectation Propagation for approximate Bayesian inference in thebinary Gaussian process classification model. We present acomprehensive comparison of the approximations, their predictiveperformance and marginal likelihood estimates to results obtained byMCMC sampling. We explain theoretically and corroborate empiricallythe advantages of Expectation Propagation compared to Laplace’smethod.

ガウス過程事前確率を使用して、柔軟な確率分類モデルを定義できます。残念ながら、正確なベイズ推論は分析的に扱いにくく、さまざまな近似手法が提案されています。この作業では、バイナリガウス過程分類モデルにおける近似ベイズ推論のためのラプラスの方法と期待伝播をレビューし、比較します。近似、それらの予測性能、および周辺尤度の推定値を、MCMCサンプリングによって得られた結果と包括的に比較します。ラプラスの方法と比較した期待伝播の利点を理論的に説明し、経験的に裏付けます。

Active Coevolutionary Learning of Deterministic Finite Automata
決定論的有限オートマトンの能動的共進化学習

This paper describes an active learning approach to the problem ofgrammatical inference, specifically the inference of deterministicfinite automata (DFAs). We refer to the algorithm as theestimation-exploration algorithm (EEA). This approach differs fromprevious passive and active learning approaches to grammaticalinference in that training data is actively proposed by thealgorithm, rather than passively receiving training data from someexternal teacher. Here we show that this algorithm outperforms oneversion of the most powerful set of algorithms for grammaticalinference, evidence driven state merging (EDSM), onrandomly-generated DFAs. The performance increase is due to thefact that the EDSM algorithm only works well for DFAs withspecific balances (percentage of positive labelings), while theEEA is more consistent over a wider range of balances. Based onthis finding we propose a more general method for generating DFAsto be used in the development of future grammatical inferencealgorithms.

この論文では、文法推論、特に決定論的有限オートマトン(DFA)の推論の問題に対する能動学習アプローチについて説明します。このアルゴリズムを推定探索アルゴリズム(EEA)と呼びます。このアプローチは、外部の教師からトレーニングデータを受動的に受け取るのではなく、トレーニングデータがアルゴリズムによって能動的に提案されるという点で、文法推論に対するこれまでの受動学習および能動学習アプローチとは異なります。ここでは、このアルゴリズムが、ランダムに生成されたDFAに対して、文法推論の最も強力なアルゴリズムセットの1つのバージョンである証拠駆動型状態マージ(EDSM)よりも優れていることを示します。パフォーマンスが向上するのは、EDSMアルゴリズムが特定のバランス(肯定的なラベル付けのパーセンテージ)を持つDFAに対してのみ適切に機能するのに対し、EEAはより広範囲のバランスでより一貫しているためです。この発見に基づいて、将来の文法推論アルゴリズムの開発で使用するための、より一般的なDFA生成方法を提案します。

Managing Diversity in Regression Ensembles
回帰アンサンブルにおけるダイバーシティの管理

Ensembles are a widely used and effective technique in machine learning—theirsuccess is commonly attributed to the degree of disagreement, or ‘diversity’,within the ensemble. For ensembles where the individual estimators outputcrisp class labels, this ‘diversity’ is not well understood and remains an openresearch issue. For ensembles of regression estimators, the diversity can beexactly formulated in terms of the covariance between individual estimatoroutputs, and the optimum level is expressed in terms of abias-variance-covariance trade-off. Despite this, most approaches tolearning ensembles use heuristics to encourage the right degree of diversity.In this work we show how to explicitly control diversity through the errorfunction. The first contribution of this paper is to show that by takingthe combination mechanism for the ensemble into account we can derive an errorfunction for each individual that balances ensemble diversity with individualaccuracy. We show the relationship between this error function and anexisting algorithm called negative correlation learning, which uses aheuristic penalty term added to the mean squared error function. It isdemonstrated that these methods control the bias-variance-covariance trade-offsystematically, and can be utilised with any estimator capable of minimising aquadratic error function, for example MLPs, or RBF networks.As a second contribution, we derive a strict upper bound on the coefficient ofthe penalty term, which holds for any estimator that can be cast in ageneralised linear regression framework, with mild assumptions on the basisfunctions. Finally we present the results of an empirical study, showingsignificant improvements over simple ensemble learning, and finding that thistechnique is competitive with a variety of methods, including boosting,bagging, mixtures of experts, and Gaussian processes, on a number of tasks.

アンサンブルは機械学習で広く使用されている効果的な手法です。その成功は、アンサンブル内の不一致の度合い、つまり「多様性」に起因していると一般に考えられています。個々の推定器が明確なクラスラベルを出力するアンサンブルの場合、この「多様性」は十分に理解されておらず、未解決の研究課題のままです。回帰推定器のアンサンブルの場合、多様性は個々の推定器の出力間の共分散の観点から正確に定式化でき、最適レベルはバイアス、分散、共分散のトレードオフの観点から表現されます。それにもかかわらず、アンサンブルを学習するほとんどのアプローチでは、適切な程度の多様性を促進するためにヒューリスティックを使用しています。この研究では、エラー関数を通じて多様性を明示的に制御する方法を示します。この論文の最初の貢献は、アンサンブルの組み合わせメカニズムを考慮することで、アンサンブルの多様性と個々の精度のバランスをとる各個人のエラー関数を導出できることを示すことです。このエラー関数と、平均二乗誤差関数に追加されたヒューリスティックなペナルティ項を使用する負の相関学習と呼ばれる既存のアルゴリズムとの関係を示します。これらの方法は、バイアス-分散-共分散のトレードオフを体系的に制御し、MLPやRBFネットワークなど、2次誤差関数を最小化できる任意の推定量で利用できることが実証されています。2番目の貢献として、ペナルティ項の係数の厳密な上限を導出します。これは、基底関数に関する軽度の仮定を使用して、一般化線形回帰フレームワークにキャストできる任意の推定量に当てはまります。最後に、実証的研究の結果を提示し、単純なアンサンブル学習に比べて大幅な改善が見られ、この手法が、多くのタスクにおいてブースティング、バギング、専門家の混合、ガウス過程などのさまざまな方法と競合できることがわかりました。

Fast Kernel Classifiers with Online and Active Learning
オンライン学習とアクティブラーニングによる高速カーネル分類器

Very high dimensional learning systems become theoretically possible whentraining examples are abundant. The computing cost then becomes the limitingfactor. Any efficient learning algorithm should at least take a brief look ateach example. But should all examples be given equal attention?This contribution proposes an empirical answer. We first present an online SVM algorithm based on this premise. LASVM yields competitive misclassification rates after a single pass over the training examples, outspeeding state-of-the-art SVM solvers. Then we show how active example selection can yield faster training, higher accuracies, and simpler models, using only a fraction of the training example labels.

非常に高次元の学習システムは、トレーニングの例が豊富にあるときに理論的に可能になります。その後、コンピューティングコストが制限要因になります。効率的な学習アルゴリズムは、少なくとも各例を簡単に見ておく必要があります。しかし、すべての例に等しく注意を払うべきなのでしょうか?この貢献は、経験的な答えを提案します。まず、この前提に基づくオンラインSVMアルゴリズムを紹介します。LASVMは、トレーニング例を1回通過するだけで、最先端のSVMソルバーを凌駕する競争力のある誤分類率を生み出します。次に、アクティブなサンプル選択により、トレーニング例のラベルの一部のみを使用して、トレーニングの高速化、精度の向上、モデルの簡素化を実現する方法を示します。

A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
ディリクレプロセス事前分布による教師付きクラスタリングのためのベイズモデル

We develop a Bayesian framework for tackling the supervised clusteringproblem, the generic problem encountered in tasks such as referencematching, coreference resolution, identity uncertainty and recordlinkage. Our clustering model is based on the Dirichlet processprior, which enables us to define distributions over the countablyinfinite sets that naturally arise in this problem. We addsupervision to our model by positing the existence of a set ofunobserved random variables (we call these “reference types”) thatare generic across all clusters. Inference in our framework, whichrequires integrating over infinitely many parameters, is solved usingMarkov chain Monte Carlo techniques. We present algorithms for bothconjugate and non-conjugate priors. We present a simple—butgeneral—parameterization of our model based on a Gaussianassumption. We evaluate this model on one artificial task and threereal-world tasks, comparing it against both unsupervised andstate-of-the-art supervised algorithms. Our results show that ourmodel is able to outperform other models across a variety of tasks andperformance metrics.

私たちは、参照マッチング、共参照解決、アイデンティティの不確実性、レコードのリンクなどのタスクで遭遇する一般的な問題である、教師ありクラスタリング問題に取り組むためのベイジアンフレームワークを開発しています。私たちのクラスタリングモデルは、この問題で自然に発生する可算無限集合上の分布を定義できるディリクレ過程事前分布に基づいています。私たちは、すべてのクラスターにわたって一般的な、観測されないランダム変数のセット(私たちはこれを「参照タイプ」と呼びます)の存在を仮定することで、モデルに教師を追加します。無限に多くのパラメーターを積分する必要がある私たちのフレームワークでの推論は、マルコフ連鎖モンテカルロ手法を使用して解決されます。私たちは、共役事前分布と非共役事前分布の両方のアルゴリズムを提示します。私たちは、ガウス仮定に基づく、私たちのモデルの単純だが一般的なパラメーター化を提示します。このモデルを1つの人工タスクと3つの実際のタスクで評価し、教師なしアルゴリズムと最新の教師ありアルゴリズムの両方と比較しました。結果は、このモデルがさまざまなタスクとパフォーマンスメトリックにわたって他のモデルよりも優れていることを示しています。

Local Propagation in Conditional Gaussian Bayesian Networks
条件付きガウスベイジアンネットワークにおける局所伝播

This paper describes a scheme for local computation in conditionalGaussian Bayesian networks that combines the approach ofLauritzen and Jensen (2001) with some elements ofShachter and Kenley (1989). Message passing takes place on anelimination tree structure rather than the more compact (and usual)junction tree of cliques. This yields a local computation scheme inwhich all calculations involving the continuous variables areperformed by manipulating univariate regressions, and hence matrixoperations are avoided.

この論文では、Lauritzen and Jensen (2001)のアプローチとShachter and Kenley (1989)のいくつかの要素を組み合わせた、条件付きガウスベイジアンネットワークでの局所計算のスキームについて説明します。メッセージの受け渡しは、よりコンパクトな(そして通常の)派閥のジャンクションツリーではなく、消去ツリー構造で行われます。これにより、連続変数を含むすべての計算が単変量回帰を操作することによって実行されるローカル計算スキームが実現するため、行列演算が回避されます。

Frames, Reproducing Kernels, Regularization and Learning
フレーム、カーネルの再現、正則化と学習

This work deals with a method for building a reproducing kernelHilbert space (RKHS) from a Hilbert space with frame elementshaving special properties. Conditions on existence and a method ofconstruction are given. Then, these RKHS are used within theframework of regularization theory for function approximation.Implications on semiparametric estimation are discussed and amultiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate theeffectiveness of such methods.

この研究では、特殊な特性を持つフレーム要素を持つヒルベルト空間から、再生カーネルヒルベルト空間(RKHS)を構築する方法を扱っています。存在条件と建設方法が示されています。次に、これらのRKHSは、関数近似のための正則化理論の枠組みで使用されます。セミパラメトリック推定への影響について議論し、正則化のマルチスケールスキームも提案します。玩具問題と実世界の近似問題の結果は、このような方法の有効性を示しています。

Large Margin Methods for Structured and Interdependent Output Variables
構造化出力変数と相互依存出力変数の大きなマージン法

Learning general functional dependencies between arbitrary input andoutput spaces is one of the key challenges in computationalintelligence. While recent progress in machine learning has mainlyfocused on designing flexible and powerful input representations, thispaper addresses the complementary issue of designing classificationalgorithms that can deal with more complex outputs, such as trees,sequences, or sets. More generally, we consider problems involvingmultiple dependent output variables, structured output spaces, andclassification problems with class attributes. In order to accomplishthis, we propose to appropriately generalize the well-known notion ofa separation margin and derive a corresponding maximum-marginformulation. While this leads to a quadratic program with apotentially prohibitive, i.e. exponential, number of constraints, wepresent a cutting plane algorithm that solves the optimization problemin polynomial time for a large class of problems. The proposed methodhas important applications in areas such as computational biology,natural language processing, information retrieval/extraction, andoptical character recognition. Experiments from various domainsinvolving different types of output spaces emphasize the breadth andgenerality of our approach.

任意の入力と出力空間間の一般的な機能的依存関係を学習することは、計算知能における重要な課題の1つです。機械学習の最近の進歩は、柔軟で強力な入力表現の設計に主に焦点を当てていますが、この論文では、ツリー、シーケンス、セットなどのより複雑な出力を処理できる分類アルゴリズムの設計という補完的な問題に取り組んでいます。より一般的には、複数の従属出力変数、構造化された出力空間、クラス属性を持つ分類問題を含む問題を検討します。これを実現するために、よく知られている分離マージンの概念を適切に一般化し、対応する最大マージン定式化を導出することを提案します。これは、潜在的に禁止されている、つまり指数的な数の制約を持つ二次計画につながりますが、大規模なクラスの問題に対して多項式時間で最適化問題を解決する切断面アルゴリズムを示します。提案された方法は、計算生物学、自然言語処理、情報検索/抽出、光学文字認識などの分野で重要な用途があります。さまざまな出力スペースを含むさまざまなドメインからの実験は、私たちのアプローチの幅広さと一般性を強調しています。

A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes
カテゴリ属性の値を分割するためのベイズ最適アプローチ

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In this paper, we propose a new grouping method MODL founded on a Bayesian approach. The method relies on a model space of grouping models and on a prior distribution defined on this model space. This results in an evaluation criterion of grouping, which is minimal for the most probable grouping given the data, i.e. the Bayes optimal grouping. We propose new super-linear optimization heuristics that yields near-optimal groupings. Extensive comparative experiments demonstrate that the MODL grouping method builds high quality groupings in terms of predictive quality, robustness and small number of groups.

教師あり機械学習では、カテゴリ属性の値の分割(グループ化とも呼ばれます)は、初期属性の情報を保持し、その値の数を減らす新しい合成属性を構築することを目的としています。この論文では、ベイズアプローチに基づく新しいグループ化法MODLを提案します。この方法は、グループ化モデルのモデル空間と、このモデル空間で定義された事前分布に依存します。これにより、グループ化の評価基準が得られますが、これは、データから最も可能性の高いグループ化、つまりベイズ最適グループ化に対して最小です。私たちは、最適に近いグループ化を生成する新しい超線形最適化ヒューリスティックを提案します。広範な比較実験により、MODLグルーピング法は、予測品質、ロバスト性、および少数のグループの観点から高品質のグルーピングを構築することが示されています。

Maximum Margin Algorithms with Boolean Kernels
ブールカーネルを使用した最大証拠金アルゴリズム

Recent work has introduced Boolean kernels with which one can learnlinear threshold functions over a feature space containing allconjunctions of length up to k (for any 1 ≤k ≤ n) over the original n Booleanfeatures in the input space. This motivates the question of whethermaximum margin algorithms such as Support Vector Machines can learnDisjunctive Normal Form expressions in the Probably ApproximatelyCorrect (PAC) learning model by using this kernel. We study thisquestion, as well as a variant in which structural risk minimization(SRM) is performed where the class hierarchy is taken over the lengthof conjunctions.We show that maximum margin algorithms using the Boolean kernels donot PAC learn t(n)-term DNF for any t(n)= ω(1), even when used with such a SRM scheme. We alsoconsider PAC learning under the uniform distribution and show that ifthe kernel uses conjunctions of length˜ω(√n) then the maximum margin hypothesiswill fail on the uniform distribution as well. Our results concretelyillustrate that margin based algorithms may overfit when learningsimple target functions with natural kernels.

最近の作業では、入力空間の元のn個のブール特徴に対して、長さがk (1≤k≤nの場合)までのすべての結合を含む特徴空間上の線形しきい値関数を学習できるブールカーネルが導入されました。これにより、サポートベクターマシンなどの最大マージンアルゴリズムが、このカーネルを使用しておそらくほぼ正しい(PAC)学習モデルの選言正規形表現を学習できるかどうかという疑問が生じます。私たちはこの問題と、クラス階層が結合の長さにわたって取られる構造リスク最小化(SRM)が実行される変形を研究します。ブールカーネルを使用する最大マージンアルゴリズムは、そのようなSRMスキームで使用した場合でも、t(n) =ω(1)の任意のt(n)項DNFをPAC学習しないことを示します。また、一様分布でのPAC学習を検討し、カーネルが長さ ˜ω(√n)の結合を使用する場合、最大マージン仮説は一様分布でも失敗することを示します。私たちの結果は、マージンベースのアルゴリズムが、自然なカーネルで単純なターゲット関数を学習するときに過剰適合する可能性があることを具体的に示しています。

Inner Product Spaces for Bayesian Networks
ベイジアンネットワークの内部積空間

Bayesian networks have become one of the major models used forstatistical inference. We study the question whether the decisionscomputed by a Bayesian network can be represented within alow-dimensional inner product space. We focus on two-labelclassification tasks over the Boolean domain. As main results weestablish upper and lower bounds on the dimension of the inner productspace for Bayesian networks with an explicitly given (full or reduced)parameter collection. In particular, these bounds are tight up to afactor of 2. For some nontrivial cases of Bayesian networks weeven determine the exact values of this dimension. We furtherconsider logistic autoregressive Bayesian networks and show that everysufficiently expressive inner product space must have dimension atleast Ω(n2), where n is the number ofnetwork nodes. We also derive the bound2Ω(n) for an artificial variant ofthis network, thereby demonstrating the limits of our approach andraising an interesting open question. As a major technicalcontribution, this work reveals combinatorial and algebraic structureswithin Bayesian networks such that known methods for the derivation oflower bounds on the dimension of inner product spaces can be broughtinto play.

ベイジアンネットワークは、統計的推論に使用される主要なモデルの1つになっています。ベイジアンネットワークによって計算された決定が低次元の内積空間内で表現できるかどうかという問題を研究します。ブール領域での2ラベル分類タスクに焦点を当てます。主な結果として、明示的に指定された(完全または縮小された)パラメータコレクションを持つベイジアンネットワークの内積空間の次元の上限と下限を確立します。特に、これらの境界は2倍まで厳密です。ベイジアンネットワークのいくつかの重要なケースでは、この次元の正確な値も決定します。さらに、ロジスティック自己回帰ベイジアンネットワークを検討し、十分に表現力のある内積空間はすべて、少なくとも Ω(n2)の次元を持つ必要があることを示します。ここで、nはネットワークノードの数です。また、このネットワークの人工的なバリアントの境界2Ω(n)を導出することで、アプローチの限界を示し、興味深い未解決の問題を提起します。主要な技術的貢献として、この作業は、ベイジアンネットワーク内の組み合わせ構造と代数構造を明らかにし、内積空間の次元の下限を導出する既知の方法を活用できるようにします。

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
フォンミーゼスフィッシャー分布を用いた超球単位でのクラスタリング

Several large scale data mining applications, such as textcategorization and gene expression analysis, involve high-dimensionaldata that is also inherently directional in nature. Often such datais L2 normalized so that it lies on the surface of aunit hypersphere. Popular models such as (mixtures of) multi-variateGaussians are inadequate for characterizing such data. This paperproposes a generative mixture-model approach to clustering directionaldata based on the von Mises-Fisher (vMF) distribution, which arisesnaturally for data distributed on the unit hypersphere. Inparticular, we derive and analyze two variants of the ExpectationMaximization (EM) framework for estimating the mean and concentrationparameters of this mixture. Numerical estimation of the concentrationparameters is non-trivial in high dimensions since it involvesfunctional inversion of ratios of Bessel functions. We also formulatetwo clustering algorithms corresponding to the variants of EM that wederive. Our approach provides a theoretical basis for the use ofcosine similarity that has been widely employed by the informationretrieval community, and obtains the spherical kmeans algorithm(kmeans with cosine similarity) as a special case of both variants.Empirical results on clustering of high-dimensional text andgene-expression data based on a mixture of vMF distributions show thatthe ability to estimate the concentration parameter for each vMFcomponent, which is not present in existing approaches, yieldssuperior results, especially for difficult clustering tasks inhigh-dimensional spaces.

テキスト分類や遺伝子発現解析などの大規模データマイニングアプリケーションには、本質的に方向性のある高次元データが含まれることがあります。多くの場合、このようなデータはL2正規化され、単位超球面の表面上に配置されます。多変量ガウス分布(の混合)などの一般的なモデルは、このようなデータの特性評価には不十分です。この論文では、単位超球面に分布するデータに自然に生じるフォンミーゼスフィッシャー(vMF)分布に基づいて、方向性データをクラスタリングするための生成混合モデルアプローチを提案します。特に、この混合の平均と濃度パラメータを推定するための期待値最大化(EM)フレームワークの2つのバリアントを導出して分析します。濃度パラメータの数値推定は、ベッセル関数の比の機能反転を伴うため、高次元では簡単ではありません。また、導出したEMのバリアントに対応する2つのクラスタリングアルゴリズムも定式化します。私たちのアプローチは、情報検索コミュニティで広く採用されているコサイン類似度の使用に関する理論的基礎を提供し、球面kmeansアルゴリズム(コサイン類似度を持つkmeans)を両方のバリアントの特殊なケースとして取得します。vMF分布の混合に基づく高次元テキストと遺伝子発現データのクラスタリングに関する実験結果は、既存のアプローチには存在しない各vMFコンポーネントの濃度パラメータを推定する機能により、特に高次元空間での困難なクラスタリングタスクで優れた結果が得られることを示しています。

Efficient Computation of Gapped Substring Kernels on Large Alphabets
大きなアルファベット上のギャップ部分文字列カーネルの効率的な計算

We present a sparse dynamic programming algorithm that, given twostrings s and t , a gap penalty λ, and an integerp, computes the value of the gap-weighted length-psubsequences kernel. The algorithm works in time O(p|M| log |t|), where M = {(i,j) | si = tj} is the set of matches ofcharacters in the two sequences. The algorithm is easily adapted tohandle bounded length subsequences and different gap-penalty schemes,including penalizing by the total length of gaps and the number ofgaps as well as incorporating character-specific match/gap penalties. The new algorithm is empirically evaluated against a full dynamicprogramming approach and a trie-based algorithm both on synthetic andnewswire article data. Based on the experiments, the full dynamicprogramming approach is the fastest on short strings, and on longstrings if the alphabet is small. On large alphabets, the new sparsedynamic programming algorithm is the most efficient. On medium-sizedalphabets the trie-based approach is best if the maximum number ofallowed gaps is strongly restricted.

私たちは、2つの文字列sとt、ギャップペナルティ λ、および整数pが与えられた場合に、ギャップ重み付き長さpサブシーケンスカーネルの値を計算するスパース動的プログラミングアルゴリズムを紹介します。このアルゴリズムは、時間O(p|M| log |t|)で動作します。ここで、M = {(i,j) | si = tj}は、2つのシーケンス内の文字の一致セットです。このアルゴリズムは、長さが制限されたサブシーケンスや、ギャップの合計長さやギャップの数によるペナルティ、文字固有の一致/ギャップペナルティの組み込みなど、さまざまなギャップペナルティスキームの処理に簡単に適応できます。新しいアルゴリズムは、合成データとニュース記事データの両方で、完全な動的プログラミングアプローチとトライツリーベースのアルゴリズムに対して実験的に評価されています。実験に基づくと、完全な動的プログラミングアプローチは、短い文字列では最も高速で、アルファベットが小さい場合は長い文字列でも高速です。大きなアルファベットでは、新しいスパース動的プログラミングアルゴリズムが最も効率的です。中サイズのアルファベットでは、許容されるギャップの最大数が厳しく制限されている場合、トライツリーベースのアプローチが最適です。

Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
理論学習のための普遍的アルゴリズム第I 部 : 区分定数関数

This paper is concerned with the construction and analysis of auniversal estimator for the regression problem in supervised learning.Universal means that the estimator does not depend on any a prioriassumptions about the regression function to be estimated. Theuniversal estimator studied in this paper consists of a least-squarefitting procedure using piecewise constant functions on a partitionwhich depends adaptively on the data. The partition is generated by asplitting procedure which differs from those used in CART algorithms.It is proven that this estimator performs at the optimal convergencerate for a wide class of priors on the regression function. Namely,as will be made precise in the text, if the regression function is inany one of a certain class of approximation spaces (or smoothnessspaces of order not exceeding one — a limitation resulting becausethe estimator uses piecewise constants) measured relative to themarginal measure, then the estimator converges to the regressionfunction (in the least squares sense) with an optimal rate ofconvergence in terms of the number of samples. The estimator is alsonumerically feasible and can be implemented on-line.

この論文では、教師あり学習における回帰問題に対する普遍的な推定量の構築と分析に関するものです。普遍的とは、推定量が推定される回帰関数に関する事前仮定に依存しないことを意味します。この論文で検討される普遍的な推定量は、データに適応的に依存するパーティション上の区分定数関数を使用した最小二乗法で構成されます。パーティションは、CARTアルゴリズムで使用されるものとは異なる分割手順によって生成されます。この推定量は、回帰関数の幅広いクラスの事前確率に対して最適な収束率で機能することが証明されています。つまり、本文で詳しく説明するように、回帰関数が、限界測度を基準として測定された近似空間（または1を超えない次数の平滑空間-推定器が区分定数を使用するために生じる制限）の特定のクラスのいずれかにある場合、推定器は、サンプル数に関して最適な収束率で回帰関数（最小二乗の意味で）に収束します。推定器は数値的にも実行可能であり、オンラインで実装できます。

An MDP-Based Recommender System
MDPベースのレコメンダーシステム

Typical recommender systems adopt a static view of the recommendationprocess and treat it as a prediction problem. We argue that it ismore appropriate to view the problem of generating recommendations asa sequential optimization problem and, consequently, that Markovdecision processes (MDPs) provide a more appropriate model forrecommender systems. MDPs introduce two benefits: they take intoaccount the long-term effects of each recommendation andthe expected value of each recommendation. To succeed inpractice, an MDP-based recommender system must employ a strong initialmodel, must be solvable quickly, and should not consume too muchmemory. In this paper, we describe our particular MDP model, itsinitialization using a predictive model, the solution and updatealgorithm, and its actual performance on a commercial site. We alsodescribe the particular predictive model we used which outperformsprevious models. Our system is one of a small number of commerciallydeployed recommender systems. As far as we know, it is the first toreport experimental analysis conducted on areal commercial site. These results validate the commercial valueof recommender systems, and in particular, of our MDP-based approach.

一般的なレコメンデーションシステムは、レコメンデーションプロセスを静的に捉え、予測問題として扱います。レコメンデーション生成の問題を逐次最適化問題として捉える方が適切であり、その結果、マルコフ決定プロセス(MDP)がレコメンデーションシステムのより適切なモデルとなると私たちは考えています。MDPには2つの利点があります。各レコメンデーションの長期的な影響と各レコメンデーションの期待値を考慮に入れることです。実際に成功するには、MDPベースのレコメンデーションシステムは強力な初期モデルを採用し、迅速に解決でき、メモリをあまり消費しない必要があります。この論文では、私たちの特定のMDPモデル、予測モデルを使用したその初期化、ソリューションと更新アルゴリズム、商用サイトでの実際のパフォーマンスについて説明します。また、以前のモデルよりも優れたパフォーマンスを発揮する、私たちが使用した特定の予測モデルについても説明します。私たちのシステムは、商業的に展開されている数少ないレコメンデーションシステムの1つです。我々の知る限り、これは実際の商用サイトで実施された実験分析を報告する最初のものです。これらの結果は、レコメンデーションシステム、特に我々のMDPベースのアプローチの商業的価値を検証します。

Concentration Bounds for Unigram Language Models
ユニグラム言語モデルの濃度限界

We show several high-probability concentration bounds for learningunigram language models. One interesting quantity is theprobability of all words appearing exactly k times in a sampleof size m. A standard estimator for this quantity is theGood-Turing estimator. The existing analysis on its error shows ahigh-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze theempirical frequencies estimator, showing that with highprobability its error is bounded by approximately O( 1 /k + k1/2 / m). We derive a combined estimator,which has an error of approximately O(m-2/5), for any k.A standard measure for the quality of a learning algorithm is itsexpected per-word log-loss. The leave-one-out method can be usedfor estimating the log-loss of the unigram model. We show that itserror has a high-probability bound of approximately O(1 / m1/2), for any underlying distribution.We also bound the log-loss a priori, as a function of variousparameters of the distribution.

私たちは、ユニグラム言語モデルを学習するための高確率集中境界をいくつか示します。1つの興味深い量は、サイズmのサンプルですべての単語が正確にk回出現する確率です。この量の標準的な推定量はグッドチューリング推定量です。その誤差に関する既存の分析では、約O(k / m1/2)の高確率境界が示されています。kへの依存性をO(k1/4 / m1/2 + k / m)に改善します。また、経験的頻度推定量を分析し、高確率ではその誤差が約O( 1 /k + k1/2 / m)に制限されることを示します。任意のkに対して約O(m-2/5)の誤差を持つ複合推定量を導出します。学習アルゴリズムの品質の標準的な尺度は、その期待される単語あたりのログ損失です。leave-one-out法を使用して、ユニグラムモデルのログ損失を推定できます。その誤差は、基礎となる分布に対して約O(1 / m1/2)の高確率境界を持つことを示します。また、分布のさまざまなパラメーターの関数として、ログ損失を事前に制限します。

Separating a Real-Life Nonlinear Image Mixture
実生活の非線形画像混合の分離

When acquiring an image of a paper document, the image printed on theback page sometimes shows through. The mixture of the front- andback-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind sourceseparation.This paper addresses a difficult version of this problem,corresponding to the use of “onion skin” paper, which results in arelatively strong nonlinearity of the mixture, which becomes close tosingular in the lighter regions of the images. The separation isachieved through the MISEP technique, which is an extension of thewell known INFOMAX method. The separation results are assessed withobjective quality measures. They show an improvement over theresults obtained with linear separation, but have room for furtherimprovement.

紙の文書の画像を取得すると、裏面に印刷された画像が透けて見えることがあります。このようにして得られた表紙と裏面の画像の組み合わせは著しく非線形であり、したがって、非線形ブラインドソースの優れた実際のテストケースを構成します。この論文では、この問題の難しいバージョンを扱っており、「タマネギの皮」紙の使用に対応しており、混合物の非線形性が比較的強くなり、画像の明るい領域で特異性が近くなります。分離は、よく知られているINFOMAX法の拡張であるMISEP技術によって達成されます。分離結果は、客観的な品質測定で評価されます。線形分離で得られた結果よりも改善が見られますが、さらに改善の余地があります。

Semigroup Kernels on Measures
メジャー上のセミグループカーネル

We present a family of positive definite kernels on measures,characterized by the fact that the value of the kernel between twomeasures is a function of their sum. These kernels can be used toderive kernels on structured objects, such as images and texts, byrepresenting these objects as sets of components, such as pixelsor words, or more generally as measures on the space ofcomponents. Several kernels studied in this work make use ofcommon quantities defined on measures such as entropy orgeneralized variance to detect similarities. Given an a priorikernel on the space of components itself, the approach is furtherextended by restating the previous results in a more efficient andflexible framework using the “kernel trick”. Finally, aconstructive approach to such positive definite kernels through anintegral representation theorem is proved, before presentingexperimental results on a benchmark experiment of handwrittendigits classification to illustrate the validity of the approach.

私たちは、2つの測度間のカーネルの値がそれらの合計の関数であるという事実によって特徴付けられる、測度上の正定値カーネルの族を提示します。これらのカーネルは、これらのオブジェクトをピクセルや単語などのコンポーネントのセットとして、またはより一般的にはコンポーネントの空間の尺度として表すことにより、画像やテキストなどの構造化オブジェクトにカーネルを導出するために使用できます。この研究で研究されたいくつかのカーネルは、類似性を検出するために、エントロピーや一般化分散などの尺度で定義された一般的な量を利用しています。コンポーネント自体の空間に先験的なカーネルが与えられると、このアプローチは、「カーネルトリック」を使用して、より効率的で柔軟なフレームワークで前の結果を再返すことによってさらに拡張されます。最後に、積分表現定理によるそのような正定値カーネルへの非建設的なアプローチが証明され、その後、手書きの数字分類のベンチマーク実験に関する実験結果を提示して、アプローチの有効性を示します。

Analysis of Variance of Cross-Validation Estimators of the Generalization Error
一般化誤差の交差検証推定量の分散分析

This paper brings together methods from two different disciplines:statistics and machine learning. We address the problem of estimatingthe variance of cross-validation (CV) estimators of the generalizationerror. In particular, we approach the problem of variance estimationof the CV estimators of generalization error as a problem inapproximating the moments of a statistic. The approximationillustrates the role of training and test sets in the performance ofthe algorithm. It provides a unifying approach to evaluation ofvarious methods used in obtaining training and test sets and it takesinto account the variability due to different training and testsets. For the simple problem of predicting the sample mean and in thecase of smooth loss functions, we show that the variance of the CVestimator of the generalization error is a function of the moments ofthe random variables Y=Card(Sj ∩ Sj’)and Y*=Card(Sjc ∩Sj’c), where Sj, Sj’ aretwo training sets, and Sjc,Sj’c are the corresponding test sets. We provethat the distribution of Y and Y* is hypergeometric and we compare ourestimator with the one proposed by Nadeau and Bengio (2003). We extendthese results in the regression case and the case of absolute errorloss, and indicate how the methods can be extended to theclassification case. We illustrate the results through simulation.

この論文では、統計と機械学習という2つの異なる分野の手法をまとめています。一般化エラーのクロス検証(CV)推定値の分散を推定する問題を取り上げます。特に、一般化エラーのCV推定値の分散推定の問題を、統計のモーメントを近似する問題として扱います。この近似は、アルゴリズムのパフォーマンスにおけるトレーニングセットとテストセットの役割を示しています。これは、トレーニングセットとテストセットを取得するために使用されるさまざまな方法を評価する統一的なアプローチを提供し、異なるトレーニングセットとテストセットによる変動性を考慮に入れます。サンプル平均を予測するという単純な問題と滑らかな損失関数の場合、一般化誤差のCVestimatorの分散は、ランダム変数Y=Card(Sj∩Sj’)およびY*=Card(Sjc∩Sj’c)のモーメントの関数であることを示します。ここで、Sj、Sj’は2つのトレーニングセット、Sjc、Sj’cは対応するテストセットです。YとY*の分布が超幾何分布であることを証明し、推定値をNadeauとBengio (2003)が提案したものと比較します。これらの結果を回帰の場合と絶対誤差損失の場合に拡張し、この方法を分類の場合に拡張する方法を示します。シミュレーションによって結果を示します。

Learning the Kernel Function via Regularization
正則化によるカーネル関数の学習

We study the problem of finding an optimal kernel from a prescribedconvex set of kernels K for learning a real-valued function byregularization. We establish for a wide variety of regularizationfunctionals that this leads to a convex optimization problem and, forsquare loss regularization, we characterize the solution of thisproblem. We show that, although K may be an uncountable set,the optimal kernel is always obtained as a convex combination of atmost m+2 basic kernels, where m is the number of data examples. Inparticular, our results apply to learning the optimal radial kernel orthe optimal dot product kernel.

私たちは、正則化によって実数値関数を学習するために、規定されたカーネルKの凸集合から最適なカーネルを見つける問題を研究します。さまざまな正則化汎関数に対して、これが凸最適化問題につながることを確立し、二乗損失正則化について、この問題の解を特徴付けます。Kは不可算集合であるかもしれませんが、最適なカーネルは常に最大でm+2基本カーネルの凸の組み合わせとして得られることを示します(mはデータ例の数)。特に、私たちの結果は、最適な放射状カーネルまたは最適なドット積カーネルの学習に適用されます。

A Generalization Error for Q-Learning
Q学習の一般化誤差

Planning problems that involve learning a policy from a singletraining set of finite horizon trajectories arise in both socialscience and medical fields. We consider Q-learning with functionapproximation for this setting and derive an upper bound on thegeneralization error. This upper bound is in terms of quantitiesminimized by a Q-learning algorithm, the complexity of theapproximation space and an approximation term due to the mismatchbetween Q-learning and the goal of learning a policy that maximizesthe value function.

有限の地平線の軌跡の単一のトレーニングセットからポリシーを学習することを含む計画問題は、社会科学と医療の両方の分野で発生します。この設定では、関数近似を使用したQ学習を考慮し、一般化誤差の上限を導き出します。この上限は、Q学習アルゴリズムによって最小化された量、近似空間の複雑さ、およびQ学習と価値関数を最大化する方策を学習するという目標との不一致による近似項の観点からです。

Learning the Kernel with Hyperkernels
ハイパーカーネルによるカーネルの学習

This paper addresses the problem of choosing a kernel suitable forestimation with a support vector machine, hence further automatingmachine learning. This goal is achieved by defining a reproducingkernel Hilbert space on the space of kernels itself. Such aformulation leads to a statistical estimation problem similar tothe problem of minimizing a regularized risk functional.We state the equivalent representer theorem for the choice of kernelsand present a semidefinite programming formulation of the resultingoptimization problem. Several recipes for constructing hyperkernelsare provided, as well as the details of common machine learningproblems. Experimental results for classification, regression andnovelty detection on UCI data show the feasibility of our approach.

この論文では、サポートベクターマシンによる推定に適したカーネルを選択する問題、したがって機械学習をさらに自動化する問題に対処します。この目標は、カーネル自体の空間上に再現カーネルヒルベルト空間を定義することによって達成されます。このような定式化は、正則化されたリスク汎関数を最小化する問題と同様の統計的推定問題につながります。カーネルの選択に対する等価な表現定理を述べ、結果の最適化問題の半定値プログラミング定式化を提示します。ハイパーカーネルを構築するためのいくつかのレシピと、一般的な機械学習の問題の詳細が提供されています。UCIデータに対する分類、回帰、新規性検出の実験結果は、私たちのアプローチの実現可能性を示しています。

Gaussian Processes for Ordinal Regression
順序回帰のガウス過程

We present a probabilistic kernel approach to ordinal regressionbased on Gaussian processes. A threshold model that generalizesthe probit function is used as the likelihood function forordinal variables. Two inference techniques, based on the Laplaceapproximation and the expectation propagation algorithmrespectively, are derived for hyperparameter learning and modelselection. We compare these two Gaussian process approaches with aprevious ordinal regression method based on support vectormachines on some benchmark and real-world data sets, includingapplications of ordinal regression to collaborative filtering andgene expression analysis. Experimental results on these data setsverify the usefulness of our approach.

私たちは、ガウス過程に基づく順序回帰への確率的カーネルアプローチを提示します。プロビット関数を一般化するしきい値モデルは、尤度関数の順序変数として使用されます。ラプラス近似アルゴリズムと期待伝播アルゴリズムに基づく2つの推論手法が、ハイパーパラメーター学習とモデル選択のために導出されます。これら2つのガウス過程アプローチを、一部のベンチマークおよび実世界のデータセット上のサポートベクターマシンに基づく以前の順序回帰法と比較します。これには、順序回帰の協調フィルタリングおよび遺伝子発現解析への適用が含まれます。これらのデータセットの実験結果は、私たちのアプローチの有用性を検証しています。

Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
オンライン学習とブレグマン投影法のための行列指数勾配の更新

We address the problem of learning a symmetric positive definitematrix. The central issue is to design parameter updates thatpreserve positive definiteness. Our updates are motivated with thevon Neumann divergence. Rather than treating the most generalcase, we focus on two key applications that exemplify our methods:on-line learning with a simple square loss, and finding a symmetricpositive definite matrix subject to linear constraints. The updatesgeneralize the exponentiated gradient (EG) update and AdaBoost,respectively: the parameter is now a symmetric positive definitematrix of trace one instead of a probability vector (which in thiscontext is a diagonal positive definite matrix with trace one). Thegeneralized updates use matrix logarithms and exponentials to preservepositive definiteness. Most importantly, we show how the derivationand the analyses of the original EG update and AdaBoost generalize tothe non-diagonal case. We apply the resulting matrix exponentiatedgradient (MEG) update and DefiniteBoost to the problem oflearning a kernel matrix from distance measurements.

私たちは、対称正定値行列の学習の問題に取り組みます。中心的な問題は、正定値性を維持するパラメーター更新を設計することです。更新は、フォンノイマンダイバージェンスを動機としています。最も一般的なケースを扱うのではなく、私たちの方法を例示する2つの主要なアプリケーション、つまり単純な平方損失によるオンライン学習と、線形制約に従う対称正定値行列の検索に焦点を当てます。更新は、それぞれ指数勾配(EG)更新とAdaBoostを一般化します。つまり、パラメーターは、確率ベクトル(このコンテキストでは、トレース1の対角正定値行列)ではなく、トレース1の対称正定値行列になります。一般化された更新では、行列対数と指数を使用して正定値性を維持します。最も重要なのは、元のEG更新とAdaBoostの導出と分析が非対角の場合にどのように一般化されるかを示すことです。結果として得られる行列指数勾配(MEG)更新とDefiniteBoostを、距離測定からカーネル行列を学習する問題に適用します。

Algorithmic Stability and Meta-Learning
アルゴリズムの安定性とメタ学習

A mechnism of transfer learning is analysed, where samples drawn fromdifferent learning tasks of an environment are used to improve the learnersperformance on a new task. We give a general method to prove generalisationerror bounds for such meta-algorithms. The method can be applied to the biaslearning model of J. Baxter and to derive novel generalisation bounds formeta-algorithms searching spaces of uniformly stable algorithms. We alsopresent an application to regularized least squares regression.

転移学習のメカニズムが分析され、環境のさまざまな学習タスクから抽出されたサンプルを使用して、新しいタスクでの学習者のパフォーマンスが向上します。このようなメタアルゴリズムの一般化誤差の限界を証明するための一般的な方法を提供します。この方法は、J. Baxterのバイアス学習モデルに適用し、一様に安定したアルゴリズムの空間を検索するメタアルゴリズムの新しい一般化境界を導出することができます。また、正則化最小二乗回帰への応用も紹介します。

Learning a Mahalanobis Metric from Equivalence Constraints
等価制約からのマハラノビス計量の学習

Many learning algorithms use a metric defined over the input space asa principal tool, and their performance critically depends on thequality of this metric. We address the problem of learning metricsusing side-information in the form of equivalence constraints. Unlikelabels, we demonstrate that this type of side-information cansometimes be automatically obtained without the need of humanintervention. We show how such side-information can be used to modifythe representation of the data, leading to improved clustering andclassification.Specifically, we present the Relevant Component Analysis (RCA)algorithm, which is a simple and efficient algorithm for learning aMahalanobis metric. We show that RCA is the solution of aninteresting optimization problem, founded on an information theoreticbasis. If dimensionality reduction is allowed within RCA, we showthat it is optimally accomplished by a version ofFisher’s linear discriminant that uses constraints. Moreover, under certain Gaussianassumptions, RCA can be viewed as a Maximum Likelihood estimation ofthe within class covariance matrix. We conclude with extensiveempirical evaluations of RCA, showing its advantage over alternativemethods.

多くの学習アルゴリズムは、入力空間で定義されたメトリックを主要なツールとして使用し、そのパフォーマンスはこのメトリックの品質に大きく依存します。私たちは、同値制約の形でサイド情報を使用してメトリックを学習する問題に取り組みます。ラベルとは異なり、このタイプのサイド情報は、人間の介入を必要とせずに自動的に取得できる場合があることを示します。このようなサイド情報を使用してデータの表現を変更し、クラスタリングと分類を改善する方法を示します。具体的には、マハラノビスメトリックを学習するためのシンプルで効率的なアルゴリズムである関連成分分析(RCA)アルゴリズムを紹介します。RCAは、情報理論に基づいた興味深い最適化問題のソリューションであることを示します。RCA内で次元削減が可能な場合は、制約を使用するフィッシャーの線形判別のバージョンによって最適に達成されることを示します。さらに、特定のガウス仮定の下では、RCAはクラス内共分散行列の最大尤度推定と見なすことができます。結論として、RCAの広範な経験的評価を行い、代替方法に対するRCAの利点を示します。

Loopy Belief Propagation: Convergence and Effects of Message Errors
Loopy確率伝搬法: メッセージエラーの収束と影響

Belief propagation (BP) is an increasingly popular method of performingapproximate inference on arbitrary graphical models. At times, evenfurther approximations are required, whether due to quantization of themessages or model parameters, from other simplified message or modelrepresentations, or from stochastic approximation methods. Theintroduction of such errors into the BP message computations has thepotential to affect the solution obtained adversely. We analyze theeffect resulting from message approximation under two particular measuresof error, and show bounds on the accumulation of errors in the system.This analysis leads to convergence conditions for traditional BP messagepassing, and both strict bounds and estimates of the resulting error insystems of approximate BP message passing.

信念伝播(BP)は、任意のグラフィカルモデルで近似推論を実行する方法としてますます普及している方法です。場合によっては、メッセージやモデルパラメーターの量子化、他の簡略化されたメッセージやモデル表現、または確率的近似法によるものであっても、さらに近似が必要になります。このようなエラーをBPメッセージの計算に導入すると、得られる解に悪影響を及ぼす可能性があります。私たちは、2つの特定の誤差尺度の下でのメッセージ近似から生じる効果を分析し、システム内の誤差の蓄積の限界を示す。この分析により、従来のBPメッセージパッシングの収束条件と、近似BPメッセージパッシングの結果として生じるエラーの推定値の両方が得られます。

Learning from Examples as an Inverse Problem
逆問題としての例から学ぶ

Many works related learning from examples to regularization techniquesfor inverse problems, emphasizing the strong algorithmic andconceptual analogy of certain learning algorithms with regularizationalgorithms. In particular it is well known that regularizationschemes such as Tikhonov regularization can be effectively used in thecontext of learning and are closely related to algorithms such assupport vector machines. Nevertheless the connection with inverseproblem was considered only for the discrete (finite sample) problemand the probabilistic aspects of learning from examples were not takeninto account. In this paper we provide a natural extension of suchanalysis to the continuous (population) case and study the interplaybetween the discrete and continuous problems. From a theoreticalpoint of view, this allows to draw a clear connection between theconsistency approach in learning theory and the stability convergenceproperty in ill-posed inverse problems. The main mathematical resultof the paper is a new probabilistic bound for the regularizedleast-squares algorithm. By means of standard results on theapproximation term, the consistency of the algorithm easily follows.

多くの研究は、例からの学習を逆問題の正規化手法に関連付け、特定の学習アルゴリズムと正規化アルゴリズムの強力なアルゴリズム的および概念的類似性を強調しています。特に、ティホノフ正則化などの正則化スキームは学習のコンテキストで効果的に使用でき、サポートベクターマシンなどのアルゴリズムと密接に関連していることはよく知られています。ただし、逆問題との関連は離散（有限サンプル）問題に対してのみ考慮され、例からの学習の確率的側面は考慮されていませんでした。この論文では、このような分析を連続（母集団）ケースに自然に拡張し、離散問題と連続問題の相互作用を検討します。理論的な観点から、これにより、学習理論の一貫性アプローチと、不適切逆問題の安定性収束特性との間に明確な関係を描くことができます。この論文の主な数学的結果は、正規化された最小二乗アルゴリズムの新しい確率的境界です。近似項に関する標準的な結果により、アルゴリズムの一貫性は簡単にわかります。

Prioritization Methods for Accelerating MDP Solvers
MDP ソルバーを高速化するための優先順位付け手法

The performance of value and policy iteration can be dramaticallyimproved by eliminating redundant or useless backups, and by backingup states in the right order. We study several methods designed toaccelerate these iterative solvers, including prioritization,partitioning, and variable reordering. We generate a family ofalgorithms by combining several of the methods discussed, and presentextensive empirical evidence demonstrating that performance canimprove by several orders of magnitude for many problems, whilepreserving accuracy and convergence guarantees.

価値とポリシーの反復処理のパフォーマンスは、冗長なバックアップや無駄なバックアップを排除し、状態を正しい順序でバックアップすることで劇的に向上させることができます。これらの反復ソルバーを高速化するために設計されたいくつかの方法(優先順位付け、分割、変数の並べ替えなど)を研究します。私たちは、議論されたいくつかの方法を組み合わせてアルゴリズムのファミリーを生成し、精度と収束の保証を維持しながら、多くの問題に対してパフォーマンスが数桁向上することを示す広範な経験的証拠を提示します。

Multiclass Classification with Multi-Prototype Support Vector Machines
Multi-Prototype サポートベクターマシンによる多クラス分類

Winner-take-all multiclass classifiers are built on the top of aset of prototypes each representing one of the available classes.A pattern is then classified with the label associated to the most’similar’ prototype. Recent proposal of SVM extensions tomulticlass can be considered instances of the same strategy withone prototype per class.The multi-prototype SVM proposed in this paper extends multiclassSVM to multiple prototypes per class. It allows to combine severalvectors in a principled way to obtain large margin decisionfunctions. For this problem, we give a compact constrainedquadratic formulation and we propose a greedy optimizationalgorithm able to find locally optimal solutions for the nonconvex objective function.This algorithm proceeds by reducing the overall problem into aseries of simpler convex problems. For the solution of thesereduced problems an efficient optimization algorithm is proposed.A number of pattern selection strategies are then discussed tospeed-up the optimization process. In addition, given thecombinatorial nature of the overall problem, stochastic searchstrategies are suggested to escape from local minima which are notglobally optimal.Finally, we report experiments on a number of datasets. Theperformance obtained using few simple linear prototypes iscomparable to that obtained by state-of-the-art kernel-basedmethods but with a significant reduction(of one or two orders) in response time.

勝者総取りのマルチクラス分類器は、それぞれが利用可能なクラスの1つを表すプロトタイプのセットの上に構築されます。次に、最も「類似した」プロトタイプに関連付けられたラベルを使用してパターンが分類されます。マルチクラスへのSVM拡張の最近の提案は、クラスごとに1つのプロトタイプを持つ同じ戦略のインスタンスと考えることができます。この論文で提案されているマルチプロトタイプSVMは、マルチクラスSVMをクラスごとに複数のプロトタイプに拡張します。これにより、複数のベクトルを原則的に組み合わせて、大きなマージン決定関数を取得できます。この問題に対して、我々はコンパクトな制約付き二次定式化を与え、非凸目的関数の局所最適解を見つけることができる貪欲最適化アルゴリズムを提案します。このアルゴリズムは、全体の問題を一連のより単純な凸問題に縮小することによって進む。これらの縮小された問題の解決のために、効率的な最適化アルゴリズムが提案されます。次に、最適化プロセスを高速化するためのいくつかのパターン選択戦略について説明します。さらに、全体の問題の組み合わせの性質を考慮して、グローバルに最適ではない局所最小値から抜け出すための確率的探索戦略を提案します。最後に、いくつかのデータセットでの実験を報告します。いくつかの単純な線形プロトタイプを使用して得られたパフォーマンスは、最先端のカーネルベースの方法で得られたものと同等であるが、応答時間が大幅に短縮される（1桁または2桁）。

Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
ハードディスクの故障を予測するための機械学習手法:マルチインスタンスアプリケーション

We compare machine learning methods applied to a difficult real-worldproblem: predicting computer hard-drive failure using attributes monitoredinternally by individual drives. The problem is one of detecting rareevents in a time series of noisy and nonparametrically-distributed data. Wedevelop a new algorithm based on the multiple-instance learning frameworkand the naive Bayesian classifier (mi-NB) which is specifically designedfor the low false-alarm case, and is shown to have promising performance.Other methods compared are support vector machines (SVMs), unsupervisedclustering, and non-parametric statistical tests (rank-sum and reversearrangements). The failure-prediction performance of the SVM, rank-sum andmi-NB algorithm is considerably better than the threshold method currentlyimplemented in drives, while maintaining low false alarm rates. Ourresults suggest that nonparametric statistical tests should be consideredfor learning problems involving detecting rare events in time series data.An appendix details the calculation of rank-sum significance probabilitiesin the case of discrete, tied observations, and we give new recommendationsabout when the exact calculation should be used instead of thecommonly-used normal approximation. These normal approximations may beparticularly inaccurate for rare event problems like hard drive failures.

私たちは、個々のドライブによって内部的に監視される属性を使用してコンピュータのハードドライブの障害を予測するという、現実世界の困難な問題に適用される機械学習方法を比較します。問題は、ノイズが多く非パラメトリックに分布するデータの時系列でまれなイベントを検出することです。私たちは、マルチインスタンス学習フレームワークと、誤報率が低いケースに特化して設計され、有望なパフォーマンスを示すことがわかっている単純ベイズ分類器(mi-NB)に基づく新しいアルゴリズムを開発しました。比較される他の方法は、サポートベクターマシン(SVM)、教師なしクラスタリング、および非パラメトリック統計テスト(ランク合計と逆配置)です。SVM、ランク合計、およびmi-NBアルゴリズムの障害予測パフォーマンスは、低い誤報率を維持しながら、現在ドライブに実装されているしきい値方法よりも大幅に優れています。我々の結果は、時系列データにおける稀なイベントの検出を含む学習問題には、ノンパラメトリック統計検定を考慮する必要があることを示唆しています。付録では、離散的で同点の観測の場合の順位和有意確率の計算について詳しく説明し、一般的に使用される正規近似値の代わりに正確な計算を使用する必要がある場合について新しい推奨事項を示します。これらの正規近似値は、ハードドライブの故障などの稀なイベントの問題では特に不正確になる可能性があります。

Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
直交群上の準測地線ニューラル学習アルゴリズム:チュートリアル

The aim of this contribution is to present a tutorial on learningalgorithms for a single neural layer whose connection matrix belongsto the orthogonal group. The algorithms exploit geodesicsappropriately connected as piece-wise approximate integrals of theexact differential learning equation. The considered learningequations essentially arise from the Riemannian-gradient-basedoptimization theory with deterministic and diffusion-typegradient. The paper aims specifically at reviewing the relevantmathematics (and at presenting it in as much transparent way aspossible in order to make it accessible to readers that do not possessa background in differential geometry), at bringing together modernoptimization methods on manifolds and at comparing the differentalgorithms on a common machine learning problem. As a numericalcase-study, we consider an application to non-negative independentcomponent analysis, although it should be recognized that Riemanniangradient methods give rise to general-purpose algorithms, by no meanslimited to ICA-related applications.

この論文の目的は、接続行列が直交群に属する単一のニューラルレイヤーの学習アルゴリズムに関するチュートリアルを提示することです。アルゴリズムは、正確な微分学習方程式の区分近似積分として適切に接続された測地線を利用します。検討されている学習方程式は、本質的に、決定論的および拡散型勾配を持つリーマン勾配ベースの最適化理論から生じます。この論文の目的は、特に、関連する数学をレビューすること(そして、微分幾何学のバックグラウンドを持たない読者にも理解しやすいように、できるだけわかりやすい方法で提示すること)、多様体に関する最新の最適化手法をまとめること、および共通の機械学習問題に関するさまざまなアルゴリズムを比較することです。数値的なケーススタディとして、非負独立成分分析への応用を検討しますが、リーマン勾配法は、ICA関連のアプリケーションに限定されず、汎用アルゴリズムを生み出すことを認識する必要があります。

Smooth ε-Insensitive Regression by Loss Symmetrization
損失対称化による平滑ε非鈍感回帰

We describe new loss functions for regression problems along with anaccompanying algorithmic framework which utilizes these functions. These lossfunctions are derived by symmetrization of margin-based losses commonly used inboosting algorithms, namely, the logistic loss and the exponential loss. Theresulting symmetric logistic loss can be viewed as a smooth approximation tothe ε-insensitive hinge loss used in support vector regression. Wedescribe and analyze two parametric families of batch learning algorithms forminimizing these symmetric losses. The first family employs an iterativelog-additive update which can be viewed as a regression counterpart torecent boosting algorithms. The second family utilizes an iterativeadditive update step. We also describe and analyze online gradientdescent (GD) and exponentiated gradient (EG) algorithms for the symmetriclogistic loss. A byproduct of our work is a new simple form of regularizationfor boosting-based classification and regression algorithms. Our regressionframework also has implications on classification algorithms, namely, a newadditive update boosting algorithm for classification. We demonstrate themerits of our algorithms in a series of experiments.

私たちは、回帰問題のための新しい損失関数と、これらの関数を利用する付随するアルゴリズムフレームワークについて説明します。これらの損失関数は、ブースティングアルゴリズムで一般的に使用されるマージンベースの損失、つまりロジスティック損失と指数損失の対称化によって導出されます。結果として得られる対称ロジスティック損失は、サポートベクター回帰で使用される ε 非感受性ヒンジ損失の滑らかな近似として見ることができます。これらの対称損失を最小限に抑えるためのバッチ学習アルゴリズムの2つのパラメトリックファミリについて説明し、分析します。最初のファミリは、最近のブースティングアルゴリズムの回帰版として見ることができる反復対数加法更新を使用します。2番目のファミリは、反復加法更新ステップを使用します。また、対称ロジスティック損失のオンライン勾配降下法(GD)および指数勾配法(EG)アルゴリズムについても説明し、分析します。私たちの研究の副産物は、ブースティングベースの分類および回帰アルゴリズム用の新しい単純な形式の正則化です。私たちの回帰フレームワークは分類アルゴリズムにも影響を与えます。つまり、分類のための新しい加法更新ブースティングアルゴリズムです。一連の実験で私たちのアルゴリズムのメリットを実証します。

Estimation of Non-Normalized Statistical Models by Score Matching
スコアマッチングによる非正規化統計モデルの推定

One often wants to estimate statistical models where the probabilitydensity function is known only up to a multiplicative normalizationconstant. Typically, one then has to resort to Markov Chain MonteCarlo methods, or approximations of the normalization constant. Here,we propose that such models can be estimated by minimizing theexpected squared distance between the gradient of the log-densitygiven by the model and the gradient of the log-density of the observeddata. While the estimation of the gradient of log-density functionis, in principle, a very difficult non-parametric problem, we prove asurprising result that gives a simple formula for this objectivefunction. The density function of the observed data does not appear inthis formula, which simplifies to a sample average of a sum of somederivatives of the log-density given by the model. The validity ofthe method is demonstrated on multivariate Gaussian and independentcomponent analysis models, and by estimating an overcomplete filterset for natural image data.

多くの場合、確率密度関数が乗法正規化定数までしか知られていない統計モデルを推定したい場合があります。通常、その場合、マルコフ連鎖モンテカルロ法、または正規化定数の近似値に頼る必要があります。ここでは、モデルによって与えられた対数密度の勾配と観測データの対数密度の勾配との間の期待二乗距離を最小化することで、そのようなモデルを推定できると提案します。対数密度関数の勾配の推定は、原理的には非常に難しいノンパラメトリック問題ですが、この目的関数の簡単な式を与える驚くべき結果を証明します。観測データの密度関数はこの式には現れず、モデルによって与えられた対数密度のいくつかの導関数の合計のサンプル平均に簡略化されます。この方法の有効性は、多変量ガウスおよび独立成分分析モデルで実証され、自然画像データの過剰完備フィルタセットを推定することで実証されます。

Variational Message Passing
変分メッセージパッシング

Bayesian inference is now widely established as one of theprincipal foundations for machine learning. In practice, exactinference is rarely possible, and so a variety of approximationtechniques have been developed, one of the most widely used beinga deterministic framework called variational inference. In thispaper we introduce Variational Message Passing (VMP), a generalpurpose algorithm for applying variational inference to BayesianNetworks. Like belief propagation, VMP proceeds by sendingmessages between nodes in the network and updating posteriorbeliefs using local operations at each node. Each such updateincreases a lower bound on the log evidence (unless already at alocal maximum). In contrast to belief propagation, VMP can beapplied to a very general class of conjugate-exponential modelsbecause it uses a factorised variational approximation.Furthermore, by introducing additional variational parameters, VMPcan be applied to models containing non-conjugate distributions.The VMP framework also allows the lower bound to be evaluated, andthis can be used both for model comparison and for detection ofconvergence. Variational message passing has been implemented inthe form of a general purpose inference engine called VIBES(‘Variational Inference for BayEsian networkS’) which allowsmodels to be specified graphically and then solved variationallywithout recourse to coding.

ベイズ推論は、現在では機械学習の主要基盤の1つとして広く定着しています。実際には、正確な推論はほとんど不可能であるため、さまざまな近似手法が開発されてきました。最も広く使用されているものの1つが、変分推論と呼ばれる決定論的フレームワークです。この論文では、ベイズネットワークに変分推論を適用するための汎用アルゴリズムである変分メッセージパッシング(VMP)を紹介します。ビリーフプロパゲーションと同様に、VMPはネットワーク内のノード間でメッセージを送信し、各ノードでローカル操作を使用して事後ビリーフを更新することで進行します。このような更新ごとに、ログ証拠の下限が増加します(すでにローカル最大値に達していない場合)。ビリーフプロパゲーションとは対照的に、VMPは因数分解変分近似を使用するため、非常に一般的なクラスの共役指数モデルに適用できます。さらに、追加の変分パラメータを導入することで、VMPを非共役分布を含むモデルに適用できます。VMPフレームワークでは下限を評価することも可能で、モデルの比較と収束の検出の両方に使用できます。変分メッセージパッシングは、VIBES (「Variational Inference for BayEsian networkS」)と呼ばれる汎用推論エンジンの形で実装されており、これによりモデルをグラフィカルに指定し、コーディングに頼ることなく変分的に解決できます。

Adaptive Online Prediction by Following the Perturbed Leader
摂動リーダーの追従による適応型オンライン予測

When applying aggregating strategies to Prediction with ExpertAdvice (PEA), the learning rate must be adaptively tuned. Thenatural choice of sqrt(complexity/current loss)renders the analysis of Weighted Majority (WM) derivatives quitecomplicated. In particular, for arbitrary weights there havebeen no results proven so far. The analysis of the alternativeFollow the Perturbed Leader (FPL) algorithm from Kalai andVempala (2003) based on Hannan’s algorithm is easier. Wederive loss bounds for adaptive learning rate and both finiteexpert classes with uniform weights and countable expertclasses with arbitrary weights. For the former setup, our lossbounds match the best known results so far, while for thelatter our results are new.

Prediction with ExpertAdvice(PEA)に集約戦略を適用する場合、学習率を適応的に調整する必要があります。sqrt(複雑性/電流損失)の自然な選択により、加重多数決(WM)導関数の分析は非常に複雑になります。特に、任意のウェイトについては、これまで証明された結果はありません。代替案の分析Kalai andVenpala (2003)のPerturbed Leader(FPL)アルゴリズムに従うと、Hannanのアルゴリズムに基づく方が簡単です。適応学習率と、一様な重みを持つ有限のエキスパートクラスと任意の重みを持つ可算のエキスパートクラスの両方の損失範囲を導出します。前者のセットアップでは、ロスバウンドはこれまでに知られている最高の結果と一致しますが、後者では新しい結果が得られます。

Learning Multiple Tasks with Kernel Methods
カーネルメソッドによる複数のタスクの学習

We study the problem of learning many related tasks simultaneouslyusing kernel methods and regularization. The standard single-taskkernel methods, such as support vector machines and regularizationnetworks, are extended to the case of multi-task learning. Ouranalysis shows that the problem of estimating many task functions withregularization can be cast as a single task learning problem if afamily of multi-task kernel functions we define is used. Thesekernels model relations among the tasks and are derived from a novelform of regularizers. Specific kernels that can be used for multi-tasklearning are provided and experimentally tested on two realdata sets. In agreement with past empirical work on multi-tasklearning, the experiments show that learning multiple related taskssimultaneously using the proposed approach can significantlyoutperform standard single-task learning particularly when there aremany related tasks but few data per task.

私たちは、カーネル法と正則化を使用して、関連する多くのタスクを同時に学習する問題を研究しています。サポートベクターマシンや正則化ネットワークなどの標準的なシングルタスクカーネルの手法は、マルチタスク学習の場合に拡張されます。私たちの分析は、正則化を使用して多くのタスク関数を推定する問題は、定義するマルチタスクカーネル関数のファミリーが使用されている場合、単一のタスク学習問題としてキャストできることを示しています。これらのカーネルは、タスク間の関係をモデル化し、新しい形式の正則化器から派生します。マルチタスク学習に使用できる特定のカーネルが提供され、2つの実際のデータセットで実験的にテストされています。マルチタスク学習に関する過去の実証研究と一致して、実験は、提案されたアプローチを使用して複数の関連タスクを同時に学習すると、特に関連するタスクが多数あるがタスクごとのデータが少ない場合に、標準的な単一タスク学習を大幅に上回ることを示しています。

Active Learning to Recognize Multiple Types of Plankton
複数の種類のプランクトンを認識するアクティブラーニング

This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction withsupport vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images.Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multi-class supportvector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining.

この論文では、マルチクラス分類問題におけるドメイン専門家のラベリングの労力を軽減するアクティブラーニング方法を紹介します。アクティブラーニングは、サポートベクターマシンと組み合わせて適用され、高解像度の新世代のSIPPER II画像から水中動物プランクトンを認識します。サポートベクターマシンを使用したアクティブラーニングに関する以前のほとんどの研究は、2つのクラス問題のみを扱っています。この論文では、確率近似を用いたone-vs-oneアプローチを用いたマルチクラスサポートベクターマシンのアクティブラーニングアプローチ「breaking ties」を提案します。実験結果によると、私たちのアプローチでは、最も確実性の低いテスト例にラベルを付けてランダムサンプリングするアプローチよりも、特定の精度に到達するために必要なラベル付き画像が大幅に少なくなることがよくあります。また、バッチモードで適用することもできるため、一度に1つの画像にラベルを付けて再登録するのと同等の精度が得られます。

Learning Module Networks
学習モジュールネットワーク

Methods for learning Bayesian networks can discover dependencystructure between observed variables. Although these methods areuseful in many applications, they run into computational andstatistical problems in domains that involve a large number ofvariables. In this paper, we consider a solution that is applicablewhen many variables have similar behavior. We introduce a new class ofmodels, module networks, that explicitly partition thevariables into modules, so that the variables in each module share thesame parents in the network and the same conditional probabilitydistribution. We define the semantics of module networks, anddescribe an algorithm that learns the modules’ composition and theirdependency structure from data. Evaluation on real data in the domainsof gene expression and the stock market shows that module networksgeneralize better than Bayesian networks, and that the learned modulenetwork structure reveals regularities that are obscured in learnedBayesian networks.

ベイジアンネットワークを学習する方法は、観測された変数間の依存関係構造を発見できます。これらの方法は多くのアプリケーションで役立ちますが、多数の変数が関与するドメインでは計算および統計の問題に遭遇します。この論文では、多くの変数が同様の動作をしている場合に適用可能な解決策を検討します。新しいクラスのモデル、モジュールネットワークを導入し、変数をモジュールに明示的に分割して、各モジュールの変数がネットワーク内の同じ親と共有し、同じ条件付き確率分布を共有するようにします。モジュールネットワークの意味を定義し、データからモジュールの構成とその依存関係構造を学習するアルゴリズムについて説明します。遺伝子発現のドメインと株式市場の実際のデータを評価すると、モジュールネットワークはベイジアンネットワークよりもよく一般化され、学習されたモジュールネットワーク構造は学習されたベイジアンネットワークでは不明瞭な規則性を明らかにすることが示されています。

Tree-Based Batch Mode Reinforcement Learning
ツリーベースのバッチモード強化学習

Reinforcement learning aims to determine an optimal control policy frominteraction with a system or from observations gathered from a system. Inbatch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt, ut , rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are providedby the extremely randomized trees.

強化学習は、システムとの相互作用またはシステムから収集された観察から最適な制御ポリシーを決定することを目的としています。バッチモードでは、一連の4つの組(xt、ut、rt、xt+1)に基づくいわゆるQ関数を近似し、このQ関数から制御ポリシーを決定することで実現できます。ここで、xtは時刻tにおけるシステムの状態、utは実行される制御アクション、rtは取得された瞬間的な報酬、xt+1はシステムの後続の状態を表します。Q関数の近似は、一連の(バッチモード)教師あり学習問題の限界から取得できます。このフレームワーク内で、いくつかの古典的なツリーベースの教師あり学習方法(CART、Kdツリー、ツリーバギング)と、新しく提案された2つのアンサンブルアルゴリズム、つまり極端にランダム化されたツリーと完全にランダム化されたツリーの使用について説明します。いくつかの例でそのパフォーマンスを研究した結果、回帰ツリーに基づくアンサンブル法は、4組のセットから最適な制御ポリシーに関する関連情報を抽出するのに優れていることがわかりました。特に、完全にランダム化されたツリーは、シーケンスの収束を確実にしながら優れた結果をもたらしますが、収束制約を緩和することで、極度にランダム化されたツリーによってさらに優れた精度の結果が得られます。

Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems
アンダーサンプリング問題に対する一般化判別分析のためのアルゴリズムファミリーの特性評価

A generalized discriminant analysis based on a new optimization criterion is presented. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) when the scatter matrices are singular. An efficient algorithm for the new optimization problem is presented.

新しい最適化基準に基づく一般化判別分析が表示されます。この基準は、散布行列が特異な場合の従来の線形判別分析(LDA)の最適化基準を拡張します。新しい最適化問題に対する効率的なアルゴリズムが提示されます。

Estimating Functions for Blind Separation When Sources Have Variance Dependencies
ソースに分散依存性がある場合のブラインド分離の推定関数

A blind separation problem where the sources are not independent,but have variance dependencies is discussed. For this scenarioHyvärinen and Hurri (2004) proposed an algorithm which requires no assumption on distributions of sources and no parametric model of dependencies between components. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) to variance dependencies and study estimating functions for blind separation of such dependent sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent model as wellunder mild conditions, although they should in principle not. Our results indicate that separation can be done based only on normalized sources which are adjusted to have stationary variancesand is not affected by the dependent activity levels. We also study the asymptotic distribution of the quasi maximum likelihood method andthe stability of the natural gradient learning in detail. Simulation results of artificial and realistic examples match well with our theoretical findings.

ソースが独立ではなく、分散依存性を持つブラインド分離問題について説明します。このシナリオでは、HyvärinenとHurri (2004)は、ソースの分布に関する仮定やコンポーネント間の依存性のパラメトリックモデルを必要としないアルゴリズムを提案しました。この論文では、AmariとCardoso (1997)のセミパラメトリックアプローチを分散依存性に拡張し、このような依存ソースのブラインド分離の推定関数について検討します。特に、多くのICAアルゴリズムは、原則的には適用できないはずですが、穏やかな条件下では分散依存モデルにも適用できることを示します。結果は、分離は定常分散になるように調整された正規化ソースのみに基づいて実行でき、依存アクティビティレベルの影響を受けません。また、準最大尤度法の漸近分布と自然勾配学習の安定性についても詳細に検討します。人工例と現実の例のシミュレーション結果は、理論的発見とよく一致しています。

Learning with Decision Lists of Data-Dependent Features
データ依存特徴のディシジョンリストによる学習

We present a learning algorithm for decision lists which allowsfeatures that are constructed from the data and allows a trade-offbetween accuracy and complexity. We provide bounds on thegeneralization error of this learning algorithm in terms of thenumber of errors and the size of the classifier it finds on thetraining data. We also compare its performance on some naturaldata sets with the set covering machine and the support vectormachine. Furthermore, we show that the proposed bounds on thegeneralization error provide effective guides for model selection.

私たちは、データから構築される特徴を可能にし、精度と複雑さの間のトレードオフを可能にする決定リストの学習アルゴリズムを提示します。この学習アルゴリズムの一般化エラーの範囲を、エラーの数とトレーニングデータで検出する分類器のサイズの観点から提供します。また、一部のnaturaldataセットでの性能を、セットのカバーマシンとサポートvectormachineと比較します。さらに、汎化誤差に関する提案された境界がモデル選択の効果的なガイドを提供することを示します。

Generalization Bounds for the Area Under the ROC Curve
ROC曲線下面積の一般化限界

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem.The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classificationfunction), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a testsequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence.Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients;these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate.A comparison of our result with a recent uniform convergence result derived byFreund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.

私たちは、二部ランキング問題の評価基準として提唱されている量であるROC曲線の下の面積(AUC)の一般化特性を研究します。AUCは、分類問題の評価に使用されるエラー率とは異なる用語です。したがって、分類エラー率の既存の一般化境界を使用してAUCについて結論を導くことはできません。この論文では、ランキング関数の期待精度(分類関数の期待エラー率に類似)を定義し、ランキング関数の経験的AUC (有限データシーケンスで観測)の期待精度からの偏差に関する分布に依存しない確率境界を導出します。私たちは、テストシーケンスでの経験的AUCに関してランキング関数の期待精度を制限するのに役立つ大偏差境界と、トレーニングシーケンスでの経験的AUCに関して学習済みランキング関数の期待精度を制限するのに役立つ一様収束境界の両方を導出します。一様収束境界は、二部ランク粉砕係数と呼ぶ新しい組み合わせパラメータのセットで表現されます。これらは、分類エラー率の一様収束結果における標準のVC次元関連粉砕係数(成長関数とも呼ばれる)と同じ役割を果たします。我々の結果と、AUCに密接に関連する量についてFreundら(2003)が導出した最近の一様収束結果を比較すると、我々の結果によって提供される境界がかなり厳密になることがわかります。

Core Vector Machines: Fast SVM Training on Very Large Data Sets
コアベクターマシン:非常に大きなデータセットでの高速SVMトレーニング

Standard SVM training has O(m3) time and O(m2) space complexities, where m is the training set size. It is thus computationally infeasibleon very large data sets. By observing that practical SVM implementations only approximate the optimal solution by an iterative strategy, we scale up kernel methods by exploiting such “approximateness” in this paper. We first show that many kernel methods can be equivalently formulated as minimum enclosing ball (MEB)problems in computational geometry. Then, by adopting an efficient approximate MEB algorithm, we obtain provably approximately optimal solutions with the idea of core sets. Our proposed Core Vector Machine(CVM) algorithm can be used with nonlinear kernels and has a time complexity that is linear in m and a spacecomplexity that is independent of m.Experiments on large toy and real-world data sets demonstrate that the CVM is as accurate as existing SVM implementations, but ismuch faster and can handle much larger data sets than existing scale-up methods. For example, CVM with the Gaussian kernel produces superior results on the KDDCUP-99 intrusion detection data, which has about five million training patterns, in only 1.4 seconds on a 3.2GHz Pentium–4 PC.

標準的なSVMトレーニングには、O(m3)の時間とO(m2)の空間計算量があります(mはトレーニングセットのサイズ)。したがって、非常に大きなデータセットでは計算上実行不可能です。実用的なSVM実装では、反復戦略によってのみ最適解を近似するだけであることに注目して、この論文では、そのような「近似性」を利用してカーネルメソッドをスケールアップします。まず、多くのカーネルメソッドが、計算幾何学における最小囲み球(MEB)問題として同等に定式化できることを示します。次に、効率的な近似MEBアルゴリズムを採用することで、コアセットの概念を使用して、証明可能な近似最適解を取得します。提案するコアベクターマシン(CVM)アルゴリズムは、非線形カーネルで使用でき、時間計算量はmに線形で、空間計算量はmに依存しません。大規模なおもちゃと実際のデータセットでの実験により、CVMは既存のSVM実装と同程度の精度でありながら、既存のスケールアップメソッドよりもはるかに高速で、はるかに大きなデータセットを処理できることが実証されています。たとえば、ガウスカーネルを使用したCVMは、約500万のトレーニングパターンを持つKDDCUP-99侵入検知データに対して、3.2GHz Pentium-4 PCでわずか1.4秒で優れた結果を生成します。

A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
大規模線形SVMの高速解のための修正有限ニュートン法

This paper develops a fast method for solving linear SVMs with L2loss function that is suited for large scale data mining tasks such astext classification. This is done by modifying the finite Newtonmethod of Mangasarian in several ways. Experiments indicate that themethod is much faster than decomposition methods such as SVMlight,SMO and BSVM (e.g., 4-100 fold), especially when the number ofexamples is large. The paper also suggests ways of extending themethod to other loss functions such as the modified Huber’s lossfunction and the L1 loss function, and also for solving ordinal regression.

この論文では、テキスト分類などの大規模なデータマイニングタスクに適した、L2loss関数を使用して線形SVMを高速に解く方法を開発します。これは、Mangasarianの有限ニュートン法をいくつかの方法で修正することによって行われます。実験によると、この方法は、特に例の数が多い場合に、SVMlight、SMO、BSVMなどの分解方法よりもはるかに高速です(例:4〜100倍)。また、この論文では、修正フーバー損失関数やL1損失関数などの他の損失関数にこの手法を拡張する方法や、順序回帰を解く方法も提案しています。

Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes
ランダムクラスからの関数の凸結合に対するスパース性とクラスタリングに基づく一般化の範囲と複雑さ

A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithmsexplored in the existing literature by different approaches. Thisunified approach is based on an extension of Vapnik’s inequality forVC classes of sets to random classes of sets – that is, classesdepending on the random data, invariant under permutation of the data and possessing the increasing property.Generalization bounds are derived for convex combinations offunctions from random classes with certain properties. Algorithms,such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchiesof kernel machines or convex combinations of indicator functionsover sets with finite VC dimension, generate classifier functionsthat fall into the above category. We also explore the individualcomplexities of the classifiers, such as sparsity of weights andweighted variance over clusters from the convex combinationintroduced by Koltchinskii and Panchenko (2004), and showsparsity-type and cluster-variance-type generalization bounds forrandom classes.

既存の文献でさまざまなアプローチによって検討されているいくつかのアルゴリズムのクラスに対して、新しい一般化データ依存境界を導出するために、統一されたアプローチが採用されています。この統一されたアプローチは、VCクラスの集合に対するVapnikの不等式のランダムな集合クラスへの拡張に基づいています。つまり、ランダムデータに依存し、データの順列に対して不変で、増加する特性を持つクラスです。一般化境界は、特定の特性を持つランダムクラスから関数の凸結合に対して導出されます。SVM (サポートベクターマシン)、決定スタンプによるブースティング、ラジアルベース関数ネットワーク、カーネルマシンのいくつかの階層、または有限VC次元を持つ集合上のインジケータ関数の凸結合などのアルゴリズムは、上記のカテゴリに該当する分類関数を生成します。また、KoltchinskiiとPanchenko (2004)によって導入された凸結合からのクラスター上の重みのスパース性や加重分散などの分類器の個々の複雑さを調査し、ランダムクラスのスパース性型およびクラスター分散型の一般化境界を示します。

Tutorial on Practical Prediction Theory for Classification
分類のための実践的予測理論に関するチュートリアル

We discuss basic prediction theory and its impact on classificationsuccess evaluation, implications for learning algorithm design, anduses in learning algorithm execution. This tutorial is meant to bea comprehensive compilation of results which are both theoreticallyrigorous and quantitatively useful.

私たちは、基本的な予測理論とそれが分類に与える影響、成功評価、学習アルゴリズム設計への影響、および学習アルゴリズムの実行における使用について説明します。このチュートリアルは、理論的に厳密で定量的に有用な結果を包括的にまとめることを目的としています。

Denoising Source Separation
ノイズ除去ソース分離

A new algorithmic framework called denoising source separation (DSS)is introduced. The main benefit of this framework is that it allowsfor the easy development of new source separation algorithms which can beoptimised for specific problems. In this framework, sourceseparation algorithms are constructed around denoisingprocedures. The resulting algorithms can range from almost blind tohighly specialised source separation algorithms. Both simple linearand more complex nonlinear or adaptive denoising schemes areconsidered. Some existing independent component analysis algorithmsare reinterpreted within the DSS framework and new, robust blind sourceseparation algorithms are suggested. The framework is derived as aone-unit equivalent to an EM algorithm for sourceseparation. However, in the DSS framework it is easy to utilisevarious kinds of denoising procedures which need not be based ongenerative models.In the experimental section, various DSS schemes areapplied extensively to artificial data, to realmagnetoencephalograms and to simulated CDMA mobile network signals.Finally, various extensions to the proposed DSS algorithms areconsidered. These include nonlinear observation mappings,hierarchical models and over-complete, nonorthogonal feature spaces.With these extensions, DSS appears to have relevance to manyexisting models of neural information processing.

ノイズ除去ソース分離(DSS)と呼ばれる新しいアルゴリズムフレームワークが導入されました。このフレームワークの主な利点は、特定の問題に最適化できる新しいソース分離アルゴリズムを簡単に開発できることです。このフレームワークでは、ノイズ除去手順を中心にソース分離アルゴリズムが構築されます。結果として得られるアルゴリズムは、ほぼブラインドから高度に特殊化されたソース分離アルゴリズムまで多岐にわたります。単純な線形およびより複雑な非線形または適応型ノイズ除去スキームの両方が検討されます。既存の独立成分分析アルゴリズムの一部はDSSフレームワーク内で再解釈され、新しい堅牢なブラインドソース分離アルゴリズムが提案されます。このフレームワークは、ソース分離用のEMアルゴリズムと1ユニットで同等として導出されます。しかし、DSSフレームワークでは、生成モデルに基づく必要のないさまざまな種類のノイズ除去手順を簡単に利用できます。実験セクションでは、さまざまなDSSスキームが人工データ、実際の脳磁図、シミュレートされたCDMAモバイルネットワーク信号に広範囲に適用されます。最後に、提案されたDSSアルゴリズムのさまざまな拡張が検討されます。これには、非線形観測マッピング、階層モデル、および過剰完全で非直交な特徴空間が含まれます。これらの拡張により、DSSは神経情報処理の多くの既存のモデルに関連性があるように見えます。

A Classification Framework for Anomaly Detection
異常検出のための分類フレームワーク

One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM.

異常を説明する方法の1つは、異常が集中していないと言うことです。これは、データ生成密度のレベルセットを見つけるという問題につながります。この学習問題をバイナリ分類問題として解釈し、対応する分類リスクを密度レベル問題の標準パフォーマンスメジャーと比較します。特に、経験的分類リスクは、異常検出問題の経験的パフォーマンスメジャーとして役立つことがわかります。これにより、さまざまな異常検出アルゴリズムを経験的に、つまりテストセットの助けを借りて比較できます。さらに、上記の解釈により、サンプリングプランが適切に選択されている限り、「ラベル付けされた」サンプルを人工的にサンプリングするというよく知られたヒューリスティックを強力に正当化できます。特に、これにより、普遍的な一貫性を簡単に確立できる異常検出用のサポートベクターマシン(SVM)を提案できます。最後に、SVMを、標準的な1クラスSVMを含む他の一般的な方法と比較するいくつかの実験を報告します。

Multiclass Boosting for Weak Classifiers
弱分類器のための多クラスブースティング

AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2.

AdaBoost.M2は、弱い塩基分類子を持つ多クラス問題用に設計されたブースティングアルゴリズムです。このアルゴリズムは、学習誤差の非常に緩やかな境界を最小化するように設計されています。パフォーマンス測定値の境界も最小化する2つの代替ブースティングアルゴリズムを提案します。これらのパフォーマンス測定値は、学習誤差ほど期待誤差に強く結びついていませんが、派生した範囲はAdaBoost.M2の学習誤差の範囲よりも狭くなります。実験では、トレーニングとテストのエラー率を最小限に抑えるという点で、これらの手法はほぼ同じパフォーマンスを発揮します。新しいアルゴリズムには、基本分類子が信頼度評価誤差を最小化する必要があるのに対し、AdaBoost.M2では基本分類子が擬似損失を最小化する必要があるという利点があります。これにより、既存の基本分類器に簡単に適用できます。また、新しいアルゴリズムはAdaBoost.M2よりも速く収束する傾向があります。

Information Bottleneck for Gaussian Variables
ガウス変数の情報ボトルネック

The problem of extracting the relevant aspects of data waspreviously addressed through the information bottleneck (IB)method, through (soft) clustering one variable while preservinginformation about another – relevance – variable. The currentwork extends these ideas to obtain continuous representations thatpreserve relevant information, rather than discrete clusters, forthe special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation andtradeoff between compression and relevance for the this importantcase. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regressionmatrix Σx|yΣx-1, which is also the basis obtained in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade ofstructural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrizedby the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function ofthe compression (the “information-curve”), in terms of theeigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is madeof different analytic segments for each optimal dimension. Finally,we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussianprojections.

データの関連側面を抽出する問題は、以前は情報ボトルネック(IB)方式で対処されていました。これは、1つの変数を(ソフト)クラスタリングしながら、別の変数(関連性)に関する情報を保持するというものです。現在の作業では、これらのアイデアを拡張して、多変量ガウス変数の特殊なケースについて、離散クラスターではなく関連情報を保持する連続表現を取得します。一般的な連続IB問題は解決が困難ですが、この重要なケースの最適な表現と、圧縮と関連性のトレードオフに関する解析ソリューションを提供します。得られた最適表現は、正規化された回帰行列 Σx|yΣx-1の固有ベクトルへのノイズの多い線形投影であり、これは正準相関分析で得られる基底でもあります。ただし、ガウスIBでは、圧縮トレードオフパラメータによって、構造相転移のカスケードを通じて、各固有ベクトルの次元とスケールが一意に決定されます。これにより、異なるランクのソリューションが圧縮レベルによってパラメータ化された連続体上に存在するという新しい解釈が導入されます。また、この分析では、圧縮の関数として保持された情報(「情報曲線」)の完全な分析表現が、データの固有値スペクトルの観点から提供されます。離散的な場合と同様に、情報曲線は凹状で滑らかですが、最適な次元ごとに異なる分析セグメントで構成されています。最後に、IBフレームワークで開発されたアルゴリズム理論が、最適なガウス投影を取得するための反復アルゴリズムを提供する方法を示します。

Diffusion Kernels on Statistical Manifolds
統計多様体上の拡散カーネル

A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold definedby the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental resultsare presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels,which have been the standard for text classification.

統計モデルの幾何学的構造を利用する統計学習用のカーネルファミリが導入されています。カーネルは、統計ファミリーに関連付けられたフィッシャー情報計量によって定義されるリーマン多様体上の熱方程式に基づいており、ユークリッド空間のガウスカーネルを一般化します。重要な特殊なケースとして、多項族の幾何学に基づくカーネルが導出され、離散データに自然に適用されるカーネルベースの学習アルゴリズムにつながります。カーネルの被覆数とRademacher平均の境界は、リーマン多様体上のラプラシアンの固有値の境界を使用して証明されます。実験結果は、多項幾何学の使用が自然で動機付けられているドキュメント分類について提示され、テキスト分類の標準であったガウスカーネルまたは線形カーネルの標準的な使用よりも改善が見られます。

Learning Hidden Variable Networks: The Information Bottleneck Approach
隠れ変数ネットワークの学習:情報ボトルネックアプローチ

A central challenge in learning probabilistic graphical models is dealing with domains that involve hidden variables. The common approach for learning model parameters in such domains is the expectation maximization (EM) algorithm. This algorithm, however, can easily get trapped in sub-optimal local maxima. Learning the model structure is even more challenging. The structural EM algorithm can adapt the structure in the presence of hidden variables, but usually performs poorly without prior knowledge about the cardinality and location of the hidden variables. In this work, we present a general approach for learning Bayesian networks with hidden variables that overcomes these problems. The approach builds on the information bottleneck framework of Tishby et al. (1999). We start by proving formal correspondence between the information bottleneck objective and the standard parametric EM functional. We then use this correspondence to construct a learning algorithm that combines an information-theoretic smoothing term with a continuation procedure. Intuitively, the algorithm bypasses local maxima and achieves superior solutions by following a continuous path from a solution of, an easy and smooth, target function, to a solution of the desired likelihood function. As we show, our algorithmic framework allows learning of the parameters as well as the structure of a network. In addition, it also allows us to introduce new hidden variables during model selection and learn their cardinality. We demonstrate the performance of our procedure on several challenging real-life data sets.

確率的グラフィカルモデルの学習における中心的な課題は、隠れた変数を含むドメインを扱うことです。このようなドメインでモデルパラメータを学習するための一般的なアプローチは、期待値最大化(EM)アルゴリズムです。ただし、このアルゴリズムは、最適ではない局所的最大値に陥りやすいです。モデル構造の学習はさらに困難です。構造EMアルゴリズムは、隠れた変数が存在する場合に構造を適応させることができますが、通常、隠れた変数のカーディナリティと場所に関する事前知識がないとパフォーマンスが低下します。この研究では、これらの問題を克服する、隠れた変数を含むベイジアンネットワークを学習するための一般的なアプローチを紹介します。このアプローチは、Tishbyら(1999)の情報ボトルネックフレームワークに基づいています。まず、情報ボトルネックの目的関数と標準のパラメトリックEM関数との間の形式的な対応関係を証明します。次に、この対応関係を使用して、情報理論的平滑化項と継続手順を組み合わせた学習アルゴリズムを構築します。直感的に、このアルゴリズムは局所的最大値を回避し、簡単で滑らかな目標関数の解から目的の尤度関数の解までの連続パスをたどることで優れた解を実現します。ここで示すように、このアルゴリズムフレームワークでは、ネットワークの構造だけでなくパラメーターの学習も可能です。さらに、モデル選択中に新しい隠れた変数を導入し、それらのカーディナリティを学習することもできます。この手順のパフォーマンスを、いくつかの難しい実際のデータセットで実証します。

Stability of Randomized Learning Algorithms
ランダム化学習アルゴリズムの安定性

We extend existing theory on stability, namely how muchchanges in the training data influence the estimated models, andgeneralization performance of deterministic learning algorithms to thecase of randomized algorithms. We give formal definitions of stabilityfor randomized algorithms and prove non-asymptotic bounds on thedifference between the empirical and expected error as well as theleave-one-out and expected error of such algorithms that depend ontheir random stability. The setup we develop for this purpose can bealso used for generally studying randomized learning algorithms. Wethen use these general results to study the effects of bagging on thestability of a learning method and to prove non-asymptotic bounds onthe predictive performance of bagging which have not been possible toprove with the existing theory of stability for deterministic learningalgorithms.

私たちは、安定性に関する既存の理論、つまり、トレーニングデータの変更が推定モデルにどの程度影響するか、および決定論的学習アルゴリズムの一般化パフォーマンスをランダム化アルゴリズムの場合に拡張します。ランダム化アルゴリズムの安定性の正式な定義を与え、経験誤差と期待誤差の違い、およびランダム安定性に依存するアルゴリズムのleave-one-out誤差と期待誤差の非漸近限界を証明します。この目的のために開発するセットアップは、一般的にランダム化学習アルゴリズムを研究するためにも使用できます。次に、これらの一般的な結果を使用して、学習方法の安定性に対するバギングの影響を研究し、決定論的学習アルゴリズムの既存の安定性理論では証明できなかったバギングの予測パフォーマンスに対する非漸近限界を証明します。

Dimension Reduction in Text Classification with Support Vector Machines
サポートベクターマシンによるテキスト分類の次元削減

Support vector machines (SVMs) have been recognized as one of the mostsuccessful classification methods for many applications including textclassification. Even though the learning ability and computationalcomplexity of training in support vector machines may be independentof the dimension of the feature space, reducing computationalcomplexity is an essential issue to efficiently handle a large numberof terms in practical applications of text classification. In thispaper, we adopt novel dimension reduction methods to reduce thedimension of the document vectors dramatically. We also introducedecision functions for the centroid-based classification algorithm andsupport vector classifiers to handle the classification problem wherea document may belong to multiple classes. Our substantialexperimental results show that with several dimension reductionmethods that are designed particularly for clustered data, higherefficiency for both training and testing can be achieved withoutsacrificing prediction accuracy of text classification even when thedimension of the input space is significantly reduced.

サポートベクターマシン(SVM)は、テキスト分類を含む多くのアプリケーションで最も成功した分類方法の1つとして認識されています。サポートベクターマシンの学習能力とトレーニングの計算の複雑さは、特徴空間の次元とは無関係である可能性がありますが、計算の複雑さを軽減することは、テキスト分類の実際のアプリケーションで多数の用語を効率的に処理するための重要な問題です。この論文では、ドキュメントベクトルの次元を大幅に削減する新しい次元削減方法を採用しています。また、文書が複数のクラスに属する可能性がある分類問題を処理するために、重心ベースの分類アルゴリズムとサポートベクター分類器の決定関数も導入しました。私たちの実質的な実験結果によると、特にクラスター化されたデータ用に設計されたいくつかの次元削減方法を使用すると、入力空間の次元が大幅に削減された場合でも、テキスト分類の予測精度を犠牲にすることなく、トレーニングとテストの両方でより高い効率を達成できます。

Asymptotic Model Selection for Naive Bayesian Networks
単純ベイジアンネットワークのための漸近モデルの選択

We develop a closed form asymptotic formula to compute the marginallikelihood of data given a naive Bayesian network model with twohidden states and binary features. This formula deviates from thestandard BIC score. Our work provides a concrete example thatthe BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood.

私たちは、2つの隠れ状態とバイナリ特徴を持つ素朴なベイジアンネットワークモデルが与えられたデータの周辺尤度を計算するための閉形式の漸近式を開発します。この式は、標準のBICスコアから逸脱しています。私たちの研究は、BICスコアが層化指数ファミリーに属する統計モデルに対して一般的に正しくないという具体的な例を示しています。この主張は、BICスコアが周辺尤度の正しい漸近近似を提供することが証明されている線形指数族と曲線指数族とは対照的です。

関連記事