Journal of Machine Learning Research Papers: Volume 8の論文一覧

Journal of Machine Learning Research Papers Volume 8に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
動的重み付けマジョリティ:ドリフト概念のアンサンブル法

We present an ensemble method for concept drift that dynamically createsand removes weighted experts in response to changes in performance.The method, dynamic weighted majority (DWM), uses four mechanismsto cope with concept drift:It trains online learners of the ensemble, it weights those learnersbased on their performance, it removes them, also based on theirperformance, and it adds new experts based on the global performanceof the ensemble.After an extensive evaluation—consisting of five experiments,eight learners,and thirty data sets that varied in type of targetconcept, size, presence of noise, and the like—we concluded thatDWM outperformed other learnersthat only incrementally learn concept descriptions,that maintain and use previously encountered examples, andthat employ an unweighted, fixed-size ensemble of experts.

私たちは、パフォーマンスの変化に応じて重み付けされた専門家を動的に作成および削除する概念ドリフトのアンサンブル手法を提示します。この手法であるダイナミック・ウェイト・マジョリティ(DWM)は、コンセプトのドリフトに対処するために、アンサンブルのオンライン学習者を訓練し、そのパフォーマンスに基づいてそれらの学習者を重み付けし、また彼らのパフォーマンスに基づいて彼らを排除し、アンサンブルのグローバルなパフォーマンスに基づいて新しい専門家を追加します。5つの実験、8人の学習者、およびターゲット概念の種類、サイズ、ノイズの存在などが異なる30のデータセットからなる広範な—評価の結果—DWMは、概念の説明を段階的に学習し、以前に遭遇した例を維持および使用し、重み付けされていない固定サイズの専門家のアンサンブルを採用している他の学習者よりも優れていると結論付けました。

A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning
最適Bayes分割による順位回帰における新しい確率論的アプローチ

In this paper, we consider the supervised learning task which consistsin predicting the normalized rank of a numerical variable. Weintroduce a novel probabilistic approach to estimate the posteriordistribution of the target rank conditionally to the predictors. Weturn this learning task into a model selection problem. For that, wedefine a 2D partitioning family obtained by discretizing numericalvariables and grouping categorical ones and we derive an analyticalcriterion to select the partition with the highest posteriorprobability. We show how these partitions can be used to buildunivariate predictors and multivariate ones under a naive Bayesassumption.We also propose a new evaluation criterion for probabilistic rankestimators. Based on the logarithmic score, we show that suchcriterion presents the advantage to be minored, which is not the caseof the logarithmic score computed for probabilistic value estimator.A first set of experimentations on synthetic data shows the goodproperties of the proposed criterion and of our partitioning approach.A second set of experimentations on real data shows competitiveperformance of the univariate and selective naive Bayes rankestimators projected on the value range compared to methods submittedto a recent challenge on probabilistic metric regression tasks.Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overfitting.

この論文では、数値変数の正規化ランクを予測する教師あり学習タスクについて検討します。予測子に対して条件付きでターゲットランクの事後分布を推定する新しい確率的アプローチを紹介します。この学習タスクをモデル選択問題に変換します。そのために、数値変数を離散化し、カテゴリ変数をグループ化して得られる2Dパーティションファミリを定義し、事後確率が最も高いパーティションを選択するための分析基準を導出します。これらのパーティションを使用して、ナイーブベイズの仮定の下で単変量予測子と多変量予測子を構築する方法を示します。また、確率的ランク推定子の新しい評価基準も提案します。対数スコアに基づいて、このような基準は軽視すべき利点があることを示していますが、これは確率的値推定量に対して計算された対数スコアには当てはまりません。合成データの最初の一連の実験では、提案された基準と私たちの分割アプローチの優れた特性が示されています。実際のデータでの2番目の一連の実験では、値の範囲に投影された単変量および選択的ナイーブベイズ順位推定量の、確率的メトリック回帰タスクに関する最近のチャレンジに提出された方法との比較による競争力のあるパフォーマンスが示されています。私たちのアプローチは、カテゴリまたは数値予測子を持つすべての回帰問題に適用できます。予測情報を含む変数を自動的に検出するため、予測子の数が多い場合に特に興味深いものです。1つまたは複数の予測子から、数値ターゲットの正規化された順位の適切な予測子を構築します。基準の選択は事前項と事後項の存在によって正規化されるため、過剰適合の影響を受けません。

Stagewise Lasso
ステージワイズラッソ

Many statistical machine learning algorithmsminimize either an empirical loss function asin AdaBoost, or a penalized empirical loss as in Lasso or SVM.A single regularization tuning parameter controls the trade-offbetween fidelity to the data and generalizability, orequivalently between bias and variance.When this tuning parameter changes, a regularization “path” of solutionsto the minimization problem is generated, and the whole path is needed to selecta tuning parameter to optimize the prediction or interpretation performance.Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest becauseof their resulted sparse models for interpretation in addition to prediction.In this paper, we propose the BLasso algorithm thatties the FSF (e-Boosting) algorithm with the Lasso methodthat minimizes the L1 penalized L2 loss. BLasso is derivedas a coordinate descent method with a fixed stepsize applied to thegeneral Lasso loss function (L1 penalized convex loss). Itconsists of both a forward step and a backward step. The forwardstep is similar to e-Boosting or FSF, but thebackward step is new and revises the FSF (or e-Boosting) path to approximate theLasso path. In the cases of a finite number of base learners and abounded Hessian of the loss function,the BLasso path is shown to converge to the Lasso path whenthe stepsize goes to zero. Forcases with a larger number of base learners than the sample sizeand when the true model is sparse, our simulations indicatethat the BLasso model estimatesare sparser than those from FSF with comparableor slightly better prediction performance, and thatthe the discrete stepsize of BLasso and FSF hasan additional regularization effect in termsof prediction and sparsity.Moreover, we introduce the Generalized BLassoalgorithmto minimize a general convex loss penalized by a general convexfunction. Since the (Generalized) BLasso relies only on differences not derivatives,we conclude that it provides a class ofsimple and easy-to-implement algorithmsfor tracing the regularization or solution paths of penalized minimization problems.

多くの統計的機械学習アルゴリズムは、AdaBoostのような経験的損失関数、またはLassoやSVMのようなペナルティ付き経験的損失のいずれかを最小化します。単一の正則化チューニングパラメータは、データへの忠実度と一般化可能性、または同等にバイアスと分散の間のトレードオフを制御します。このチューニングパラメータが変化すると、最小化問題に対するソリューションの正則化「パス」が生成され、予測または解釈のパフォーマンスを最適化するチューニングパラメータを選択するにはパス全体が必要になります。ホモトピーLassoやLARS-Lasso、Forward Stagewise Fitting (FSF) (別名e-Boosting)などのアルゴリズムは、予測だけでなく解釈にもスパースモデルを生成するため、非常に興味深いものです。この論文では、FSF (e-Boosting)アルゴリズムと、L1ペナルティ付きL2損失を最小化するLassoメソッドを結び付けるBLassoアルゴリズムを提案します。BLassoは、一般的なLasso損失関数(L1ペナルティ付き凸損失)に適用される固定ステップサイズを使用した座標降下法として導出されます。これは、フォワードステップとバックワードステップの両方で構成されます。フォワードステップはe-BoostingまたはFSFに似ていますが、バックワードステップは新しく、FSF (またはe-Boosting)パスを修正してLassoパスを近似します。ベース学習器の数が有限で、損失関数のヘッセ行列が制限されている場合、ステップサイズがゼロになると、BLassoパスはLassoパスに収束することが示されています。サンプルサイズよりもベース学習者の数が多く、真のモデルがスパースである場合、シミュレーションでは、BLassoモデル推定値はFSFの推定値よりもスパースであり、予測性能は同等かわずかに優れていること、BLassoとFSFの離散ステップサイズには予測とスパース性の点で追加の正則化効果があることが示されています。さらに、一般凸関数によってペナルティが課せられる一般凸損失を最小化するために、一般化BLassoアルゴリズムを導入します。(一般化) BLassoは差分のみに依存し、微分には依存しないため、ペナルティが課せられた最小化問題の正則化または解決パスをトレースするためのシンプルで実装しやすいアルゴリズムのクラスを提供すると結論付けています。

Ranking the Best Instances
ベストインスタンスのランキング

We formulate a local form of the bipartite ranking problem where the goal is tofocus on the best instances. We propose a methodology based on theconstruction of real-valued scoring functions. We study empiricalrisk minimization of dedicated statistics which involve empiricalquantiles of the scores. We first state the problem of finding the best instances which can be cast as a classificationproblem with mass constraint. Next, we develop special performancemeasures for the local ranking problem which extend the Area Underan ROC Curve (AUC) criterion and describe the optimalelements of these new criteria. We also highlight the fact thatthe goal of ranking the best instances cannot be achieved in astage-wise manner where first, the best instances would betentatively identified and then a standard AUC criterion could beapplied. Eventually, we state preliminary statistical results forthe local ranking problem.

私たちは、最高のインスタンスに焦点を当てることを目標とする二者ランキング問題のローカル形式を定式化します。実数値スコアリング関数の構築に基づく方法論を提案します。スコアの経験的分位数を含む専用統計の経験的リスクの最小化を研究します。最初に、質量制約のある分類問題としてキャストできる最良のインスタンスを見つける問題を述べます。次に、Area Underan ROC Curve(AUC)基準を拡張し、これらの新しい基準の最適な要素を説明する、ローカルランキング問題に対する特別なパフォーマンスメジャーを開発します。また、最高のインスタンスをランク付けするという目標は、最初に最良のインスタンスをベット的に特定し、次に標準のAUC基準を適用できるような、段階的な方法では達成できないという事実も強調します。最終的には、ローカルランキング問題の予備的な統計結果を示します。

Hierarchical Average Reward Reinforcement Learning
階層平均報酬強化学習

Hierarchical reinforcement learning (HRL) is a general framework forscaling reinforcement learning (RL) to problems with large state andaction spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to flat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulationsof HRL based on the average reward SMDP model, both for discrete-timeand continuous-time. These formulations correspond to two notions ofoptimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to find hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models.We use two automated guided vehicle (AGV) scheduling tasks asexperimental testbeds to study the empirical performance of theproposed algorithms. The first problem is a relatively simple AGVscheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a non-hierarchical average reward algorithm. The results show that the proposed hierarchical average reward algorithms converge to the same performance as their discounted reward counterparts.

階層的強化学習(HRL)は、タスク(またはアクション)構造を使用してポリシーの空間を制限することにより、強化学習(RL)を大規模な状態およびアクション空間の問題に拡張するための一般的なフレームワークです。HAM、オプション、MAXQ、PHAMなどのHRLに関するこれまでの研究は、離散時間割引報酬半マルコフ決定プロセス(SMDP)モデルに限定されていました。平均報酬最適基準は、割引フレームワークよりも幅広い継続タスクに適していることが認識されています。平均報酬RLは数十年にわたって研究されてきましたが、これまでの研究は主にフラットなポリシー表現に限定されていました。この論文では、平均報酬最適基準に基づいてHRLのフレームワークを開発します。離散時間と連続時間の両方について、平均報酬SMDPモデルに基づくHRLの2つの定式化を調査します。これらの定式化は、HRLで以前に検討された2つの最適化の概念、つまり階層的最適性と再帰的最適性に対応しています。離散時間および連続時間の平均報酬SMDPモデルのもとで、階層的および再帰的に最適な平均報酬ポリシーを見つけることを学習するアルゴリズムを紹介します。提案されたアルゴリズムの実験的パフォーマンスを調査するために、2つの自動ガイド車両(AGV)スケジューリングタスクを実験テストベッドとして使用します。最初の問題は、階層的および再帰的に最適なポリシーが異なる、比較的単純なAGVスケジューリングタスクです。この問題に対して、提案されたアルゴリズムを、階層的に最適な割引報酬アルゴリズムおよび再帰的に最適な割引報酬アルゴリズムを含む他の3つのHRL方法と比較します。2番目の問題は、より大規模なAGVスケジューリングタスクです。この問題は、離散時間モデルと連続時間モデルの両方を使用してモデル化します。この問題では、階層的および再帰的に最適なポリシーが同じである階層タスク分解を使用します。提案されたアルゴリズムのパフォーマンスを、階層的に最適な割引報酬アルゴリズム、再帰的に最適な割引報酬アルゴリズム、および非階層的平均報酬アルゴリズムと比較します。結果は、提案された階層的平均報酬アルゴリズムが、割引報酬アルゴリズムと同じパフォーマンスに収束することを示しています。

Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
未知のダイナミクスを持つ環境での学習:よりロバストな概念学習者に向けて

In the process of concept learning, target concepts may have portionswith short-term changes, other portions may support long-term changes,and yet others may not change at all. For this reason several localwindows need to be handled. We suggest facing this problem, whichnaturally exists in the field of concept learning, by allocatingwindows which can adapt their size to portions of the targetconcept. We propose an incremental decision tree that is updated withincoming examples. Each leaf of the decision tree holds a time windowand a local performance measure as the main parameter to becontrolled. When the performance of a leaf decreases, the size of itslocal window is reduced. This learning algorithm, called OnlineTree2,automatically adjusts its internal parameters in order to face thecurrent dynamics of the data stream. Results show that it iscomparable to other batch algorithms when facing problems with noconcept change, and it is better than evaluated methods in its abilityto deal with concept drift when dealing with problems in which:concept change occurs at different speeds, noise may be present and,examples may arrive from different areas of the problem domain(virtual drift).

概念学習のプロセスでは、対象概念に短期的な変化のある部分、長期的な変化のある部分、まったく変化しない部分がある場合があります。このため、複数のローカルウィンドウを処理する必要があります。概念学習の分野で自然に存在するこの問題に対処するには、対象概念の各部分にサイズを適応できるウィンドウを割り当てることをお勧めします。提案するのは、入力例内で更新される増分決定木です。決定木の各リーフには、制御する主なパラメータとして、時間ウィンドウとローカルパフォーマンスメジャーが保持されます。リーフのパフォーマンスが低下すると、そのローカルウィンドウのサイズが縮小されます。OnlineTree2と呼ばれるこの学習アルゴリズムは、データストリームの現在のダイナミクスに対応するために、内部パラメータを自動的に調整します。結果は、概念の変化がない問題に直面した場合、このアルゴリズムが他のバッチアルゴリズムと同等であり、概念の変化が異なる速度で発生し、ノイズが存在する可能性があり、例が問題領域の異なる領域から到着する可能性がある(仮想ドリフト)問題に対処する場合、評価された方法よりも概念ドリフトに対処する能力が優れていることを示しています。

VC Theory of Large Margin Multi-Category Classifiers
大マージン多カテゴリ分類器のVC理論

In the context of discriminant analysis, Vapnik’s statisticallearning theory has mainly been developed in three directions:the computation of dichotomies with binary-valued functions,the computation of dichotomies with real-valued functions,and the computation of polytomies with functions taking their valuesin finite sets, typically the set of categories itself. The case of classes of vector-valued functionsused to compute polytomies has seldom been considered independently,which is unsatisfactory, for three main reasons. First,this case encompasses the other ones.Second, it cannot be treated appropriatelythrough a naïve extension of the results devoted to the computationof dichotomies. Third, most of the classification problems met in practiceinvolve multiple categories.In this paper, a VC theory of large margin multi-category classifiersis introduced. Central in this theory are generalized VC dimensions calledthe γ-Ψ-dimensions. First, a uniform convergence boundon the risk of the classifiers of interest is derived.The capacity measure involved in this bound is a covering number.This covering number can be upper bounded in terms of theγ-Ψ-dimensions thanks to generalizations of Sauer’s lemma, as is illustratedin the specific case of the scale-sensitive Natarajan dimension.A bound on this latter dimension is then computed for the class of functionson which multi-class SVMs are based. This makes it possible to applythe structural risk minimization inductive principle to those machines.

判別分析の文脈では、Vapnikの統計学習理論は主に3つの方向に展開されています。2値関数による二分法の計算、実数値関数による二分法の計算、および有限集合(通常はカテゴリ集合自体)に値を持つ関数による多分法の計算です。多分法の計算に使用されるベクトル値関数のクラスの場合が独立して考慮されることはほとんどありませんでしたが、これは主に3つの理由から不十分です。第1に、このケースは他のケースを包含します。第2に、二分法の計算に充てられた結果を単純に拡張しても適切に処理できません。第3に、実際に遭遇する分類問題のほとんどは複数のカテゴリに関係します。この論文では、大きなマージンを持つマルチカテゴリ分類器のVC理論を紹介します。この理論の中心となるのは、γ-Ψ 次元と呼ばれる一般化されたVC次元です。まず、関心のある分類器のリスクに対する一様収束境界が導出されます。この境界に含まれる容量の尺度は被覆数です。この被覆数は、スケールに敏感なナタラジャン次元の特定のケースで示されているように、ザウアーの補題の一般化により、γ-Ψ次元に関して上限を設定できます。次に、この後者の次元の境界が、マルチクラスSVMが基づいている関数のクラスに対して計算されます。これにより、構造リスク最小化の帰納原理をこれらのマシンに適用できるようになります。

Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
セット・カバー・マシンの損失範囲と不均衡データのサンプル圧縮損失範囲の改訂

Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set coveringmachine that has the property to depend on the observed fractionof positive examples and on what the classifier achieves on thepositive training examples. We show that this loss bound isincorrect. We then propose a loss bound, valid for anysample-compression learning algorithm (including the set coveringmachine), that depends on the observed fraction of positiveexamples and on what the classifier achieves on them. We alsocompare numerically the loss bound proposed in this paper with theincorrect bound, the original SCM bound and arecently proposed loss bound of Marchand and Sokolova (2005) (which does notdepend on the observed fraction of positive examples) and showthat the latter loss bounds can be substantially larger than thenew bound in the presence of imbalanced misclassifications.

Marchand and Shawe-Taylor (2002)は、観察された正の例の割合と、分類器が正のトレーニング例で何を達成するかに依存する特性を持つセットカバーマシンの損失限界を提案しました。この損失範囲が正しくないことを示します。次に、任意のサンプル圧縮学習アルゴリズム(set coveringmachineを含む)に有効な損失範囲を提案します。これは、観測されたポジティブサンプルの割合と、分類器がそれらに対して何を達成するかに依存します。また、この論文で提案された損失限界を、Marchand and Sokolova (2005)の誤った限界、元のSCM限界、および最近提案された損失限界(これは、観察された正の例の割合に依存しない)と数値的に比較し、不均衡な誤分類が存在する場合、後者の損失限界が新しい境界よりも大幅に大きくなる可能性があることを示します。

Nonlinear Estimators and Tail Bounds for Dimension Reduction in l1 Using Cauchy Random Projections
コーシーランダム射影を使用したl1の次元縮小のための非線形推定量と裾限界

For dimension reduction in the l1 norm, the method ofCauchy random projections multipliesthe original data matrix A ∈ ℝn×Dwith a random matrix R ∈ ℝD×k (k≪D) whose entries are i.i.d. samplesof the standard Cauchy C(0,1). Because of the impossibility result, one can nothope to recover the pairwise l1 distances in Afrom B=A×R∈ ℝn×k,using linear estimators without incurring largeerrors. However, nonlinear estimators are still useful for certainapplications in data stream computations, informationretrieval, learning, and data mining.We study three types of nonlinear estimators: the sample median estimators, the geometric meanestimators, and the maximum likelihood estimators (MLE). We derive tail boundsfor the geometric mean estimators and establish that k = O(log n / ε2) suffices with the constantsexplicitly given. Asymptotically (as k→∞), both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLEand propose approximating its distribution of by aninverse Gaussian.

l1ノルムの次元削減では、コーシーランダム射影法は、元のデータ行列A∈ ℝn×Dに、標準コーシーC(0,1)のi.i.d.サンプルであるランダム行列R∈ ℝD×k (k≪D)を乗算します。不可能な結果のため、大きな誤差を生じずに線形推定量を使用して、B=A×R∈ ℝn×kからAのペアワイズl1距離を回復することは期待できません。ただし、非線形推定量は、データストリーム計算、情報検索、学習、およびデータマイニングの特定のアプリケーションでは依然として有用です。ここでは、サンプル中央値推定量、幾何平均推定量、および最大尤度推定量(MLE)の3種類の非線形推定量を検討します。幾何平均推定値の末端境界を導出し、明示的に与えられた定数でk = O(log n /ε2)が十分であることを証明します。漸近的に(k→∞ として)、サンプル中央値と幾何平均推定値はどちらもMLEと比較して約80%効率的です。MLEのモーメントを分析し、その分布を逆ガウス分布で近似することを提案します。

On the Representer Theorem and Equivalent Degrees of Freedom of SVR
SVRの表現定理と等価自由度について

Support Vector Regression (SVR) for discrete data is considered.An alternative formulation of the representer theorem is derived.This result is based on the newly introduced notion ofpseudoresidual and the use of subdifferential calculus. Therepresenter theorem is exploited to analyze the sensitivityproperties of ε-insensitive SVR and introduce the notionof approximate degrees of freedom. The degrees of freedom areshown to play a key role in the evaluation of the optimism, thatis the difference between the expected in-sample error and theexpected empirical risk. In this way, it is possible to define aCp-like statistic that can be used for tuning the parametersof SVR. The proposed tuning procedure is tested on a simulatedbenchmark problem and on a real world problem (Boston Housingdata set).

離散データのサポートベクトル回帰(SVR)が考慮されます。表現定理の代替定式化が導出されます。この結果は、新たに導入された疑似残差の概念と準微分計算の使用に基づいています。そこでプレゼンター定理を利用して、ε非感受性SVRの感度特性を分析し、おおよその自由度の概念を導入します。自由度は、楽観主義の評価において重要な役割を果たすことが示されています。これは、予想されるサンプル内誤差と予想される経験的リスクとの差です。このようにして、SVRのパラメータを調整するために使用できるaCpのような統計を定義することができます。提案された調整手順は、シミュレートされたベンチマーク問題と実際の問題(Boston Housingデータセット)でテストされます。

The Need for Open Source Software in Machine Learning
機械学習におけるオープンソースソフトウェアの必要性

Open source tools have recently reached a level of maturity whichmakes them suitable for building large-scale real-world systems. Atthe same time, the field of machine learning has developed a largebody of powerful learning algorithms for diverse applications.However, the true potential of these methods is not used, sinceexisting implementations are not openly shared, resulting in softwarewith low usability, and weak interoperability. We argue that thissituation can be significantly improved by increasing incentives forresearchers to publish their software under an open sourcemodel. Additionally, we outline the problems authors are faced withwhen trying to publish algorithmic implementations of machine learningmethods. We believe that a resource of peer reviewed softwareaccompanied by short articles would be highly valuable to both themachine learning and the general scientific community.

オープンソースツールは最近、大規模な実世界のシステムの構築に適した成熟度に達しています。同時に、機械学習の分野は、多様なアプリケーションのための強力な学習アルゴリズムを大量に開発してきました。しかし、既存の実装がオープンに共有されていないため、これらの手法の真のポテンシャルは活用されず、ソフトウェアの使用可能性が低く、相互運用性が弱いという結果になります。私たちは、この状況は、研究者がオープンソースモデルの下でソフトウェアを公開するインセンティブを増やすことで大幅に改善できると主張しています。さらに、機械学習手法のアルゴリズム実装を公開しようとするときに著者が直面する問題についても概説します。私たちは、短い記事を伴う査読付きソフトウェアのリソースは、機械学習と一般的な科学コミュニティの両方にとって非常に価値があると信じています。

The Locally Weighted Bag of Words Framework for Document Representation
ドキュメント表現のための Locally Weighted Bag of Words フレームワーク

The popular bag of words assumption represents a document as ahistogram of word occurrences. While computationally efficient, such arepresentation is unable to maintain any sequential information. Wepresent an effective sequential document representation that goesbeyond the bag of words representation and its n-gramextensions. This representation uses local smoothing to embeddocuments as smooth curves in the multinomial simplex therebypreserving valuable sequential information. In contrast to bag ofwords or n-grams, the new representation is able to robustlycapture medium and long range sequential trends in the document. Wediscuss the representation and its geometric properties anddemonstrate its applicability for various text processing tasks.

一般的な単語の袋の仮定は、単語の出現のヒストグラムとしてドキュメントを表します。計算効率は高いですが、このような表現ではシーケンシャルな情報を保持できません。私たちは、単語の表現とそのn-gramextensionsの袋を超えて行く効果的なシーケンシャルドキュメント表現を提示します。この表現では、ローカル平滑化を使用して、ドキュメントを多項シンプレックスに滑らかな曲線として埋め込むことで、貴重なシーケンシャル情報を保持します。bag ofwordsやn-gramとは対照的に、新しい表現は、ドキュメント内の中距離および長距離のシーケンシャルトレンドを堅牢に捉えることができます。表現とその幾何学的特性について説明し、さまざまなテキスト処理タスクへの適用性を示します。

The On-Line Shortest Path Problem Under Partial Monitoring
パーシャルモニタリング下におけるオンライン最短経路問題

The on-line shortest path problem is considered under various modelsof partial monitoring. Given a weighted directed acyclic graph whoseedge weights can change in an arbitrary (adversarial) way, a decisionmaker has to choose in each round of a game a path between twodistinguished vertices such that the loss of the chosen path (definedas the sum of the weights of its composing edges) be as small aspossible. In a setting generalizing the multi-armed bandit problem,after choosing a path, the decision maker learns only the weights ofthose edges that belong to the chosen path. For this problem, analgorithm is given whose average cumulative loss in n rounds exceedsthat of the best path, matched off-line to the entire sequence of theedge weights, by a quantity that is proportional to 1/√n anddepends only polynomially on the number of edges of the graph. Thealgorithm can be implemented withcomplexity that is linear in the number ofrounds n (i.e., the average complexity per round is constant)and in the number of edges. An extension to the so-calledlabel efficient setting is also given, in which the decision maker isinformed about the weights of the edges corresponding to the chosenpath at a total of m ≪ n time instances. Another extension isshown where the decision maker competes against a time-varying path, ageneralization of the problem of tracking the best expert. A versionof the multi-armed bandit setting for shortest path is also discussedwhere the decision maker learns only the total weight of the chosen pathbut not the weights of the individual edges on the path. Applications torouting in packet switched networks along with simulation results arealso presented.

オンライン最短経路問題は、部分的監視のさまざまなモデルの下で検討されます。重み付き有向非巡回グラフのエッジの重みが任意(敵対的)に変化する可能性がある場合、意思決定者はゲームの各ラウンドで、選択したパスの損失(構成エッジの重みの合計として定義)が可能な限り小さくなるように、2つの区別された頂点間のパスを選択する必要があります。多腕バンディット問題を一般化する設定では、パスを選択した後、意思決定者は選択したパスに属するエッジの重みのみを学習します。この問題では、nラウンドの平均累積損失が、エッジの重みのシーケンス全体にオフラインで一致した最良パスの損失を、1/√nに比例し、グラフのエッジの数に多項式のみで依存する量だけ超えるアルゴリズムが提供されます。このアルゴリズムは、ラウンド数n (つまり、ラウンドあたりの平均複雑度は一定)とエッジ数に比例する複雑さで実装できます。ラベル効率設定と呼ばれる拡張も提供されており、この拡張では、選択したパスに対応するエッジの重みが合計m≪n回のインスタンスで意思決定者に通知されます。別の拡張では、意思決定者が時間変動パスと競合し、最適なエキスパートを追跡する問題の一般化が示されています。最短パスのマルチアームバンディット設定のバージョンについても説明します。このバージョンでは、意思決定者は選択したパスの合計重みのみを学習し、パス上の個々のエッジの重みは学習しません。パケット交換ネットワークでのルーティングへの応用とシミュレーション結果も示します。

AdaBoost is Consistent
AdaBoostは一貫性があります

The risk, or probability of error, of the classifier produced by theAdaBoost algorithm is investigated. In particular, we consider thestopping strategy to be used in AdaBoost to achieve universalconsistency. We show that provided AdaBoost is stopped aftern1-ε iterations—for sample size n and ε ∈ (0,1)—thesequence of risks of the classifiers it produces approaches theBayes risk.

AdaBoostアルゴリズムによって生成された分類子のリスク(エラーの確率)が調査されます。特に、AdaBoostで普遍的な一貫性を達成するために使用する停止戦略を検討しています。AdaBoostがn1から ε 回の反復後に停止した場合—サンプルサイズnと ε ∈(0,1)の場合—生成される分類器のリスクのシーケンスがベイズリスクに近づくことを示します。

Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
70,000人の編集者の専門知識を活用:テキスト分類のための知識ベースの特徴生成

Most existing methods for text categorization employinduction algorithms that use the words appearing in the trainingdocuments as features. While they perform well in many categorizationtasks, these methods are inherently limited when faced with morecomplicated tasks where external knowledge is essential. Recently,there have been efforts to augment these basic features with externalknowledge, including semi-supervised learning and transferlearning. In this work, we present a new framework for automaticacquisition of world knowledge and methods for incorporating it intothe text categorization process. Our approach enhances machinelearning algorithms with features generated from domain-specific andcommon-sense knowledge. This knowledge is represented by ontologiesthat contain hundreds of thousands of concepts, further enrichedthrough controlled Web crawling. Prior to text categorization, afeature generator analyzes the documents and maps them ontoappropriate ontology concepts that augment the bag of words used insimple supervised learning. Feature generation is accomplishedthrough contextual analysis of document text, thus implicitlyperforming word sense disambiguation. Coupled with the ability togeneralize concepts using the ontology, this approach addresses twosignificant problems in natural language processing—synonymy andpolysemy. Categorizing documents with the aid of knowledge-basedfeatures leverages information that cannot be deduced from thetraining documents alone. We applied our methodology using the OpenDirectory Project, the largest existing Web directory built by over70,000 human editors. Experimental results over a range of data setsconfirm improved performance compared to the bag of words documentrepresentation.

テキスト分類の既存の方法のほとんどは、トレーニングドキュメントに出現する単語を特徴として使用する誘導アルゴリズムを採用しています。これらの方法は多くの分類タスクで優れたパフォーマンスを発揮しますが、外部知識が不可欠なより複雑なタスクに直面した場合、本質的に限界があります。最近、半教師あり学習や転移学習など、外部知識を使用してこれらの基本機能を拡張する取り組みが行われています。この研究では、世界知識の自動取得の新しいフレームワークと、それをテキスト分類プロセスに組み込む方法を紹介します。私たちのアプローチは、ドメイン固有の知識と常識的な知識から生成された特徴を使用して、機械学習アルゴリズムを強化します。この知識は、数十万の概念を含むオントロジーによって表され、制御されたWebクロールによってさらに強化されます。テキスト分類の前に、特徴ジェネレーターがドキュメントを分析し、単純な教師あり学習で使用される単語のバッグを拡張する適切なオントロジー概念にそれらをマッピングします。特徴生成は、文書テキストのコンテキスト分析によって達成され、暗黙的に語義の曖昧さ解消が行われます。オントロジーを使用して概念を一般化する機能と組み合わせることで、このアプローチは、自然言語処理における2つの重要な問題、つまり同義語と多義性に対処します。知識ベースの特徴を利用して文書を分類すると、トレーニングドキュメントだけでは推測できない情報が活用されます。私たちは、70,000人以上の人間の編集者によって構築された最大の既存のWebディレクトリであるOpenDirectoryプロジェクトを使用して、この方法論を適用しました。さまざまなデータセットでの実験結果により、Bag of Words文書表現と比較してパフォーマンスが向上していることが確認されています。

Euclidean Embedding of Co-occurrence Data
共起データのユークリッド埋め込み

Embedding algorithms search for a low dimensional continuousrepresentation of data, but most algorithms only handle objects of asingle type for which pairwise distances are specified. This paperdescribes a method for embedding objects of different types, such asimages and text, into a single common Euclidean space, based on theirco-occurrence statistics. The joint distributions are modeled asexponentials of Euclidean distances in the low-dimensional embeddingspace, which links the problem to convex optimization over positivesemidefinite matrices. The local structure of the embeddingcorresponds to the statistical correlations via random walks in theEuclidean space. We quantify the performance of our method on two textdata sets, and show that it consistently and significantly outperformsstandard methods of statistical correspondence modeling, such asmultidimensional scaling, IsoMap and correspondence analysis.

埋め込みアルゴリズムは、データの低次元連続表現を検索しますが、ほとんどのアルゴリズムは、ペアワイズ距離が指定されている単一のタイプのオブジェクトのみを処理します。この論文では、画像やテキストなどの異なるタイプのオブジェクトを、それらの共起統計に基づいて単一の共通のユークリッド空間に埋め込む方法について説明します。同時分布は、低次元埋め込み空間のユークリッド距離の指数関数としてモデル化され、問題を正の半正定値行列上の凸最適化にリンクします。埋め込みの局所構造は、ユークリッド空間のランダムウォークによる統計的相関に対応します。2つのテキストデータセットでのこの手法のパフォーマンスを定量化し、多次元スケーリング、IsoMap、コレスポンデンス分析などの統計的対応モデリングの標準的な方法を一貫して大幅に上回っていることを示します。

Online Learning of Multiple Tasks with a Shared Loss
共有損失を伴う複数のタスクのオンライン学習

We study the problem of learning multiple tasks in parallel within theonline learning framework. On each online round, the algorithmreceives an instance for each of the parallel tasks and responds bypredicting the label of each instance. We consider the case where thepredictions made on each round all contribute toward a commongoal. The relationship between the various tasks is defined by aglobal loss function, which evaluates the overall quality of themultiple predictions made on each round. Specifically, each individualprediction is associated with its own loss value, and then thesemultiple loss values are combined into a single number using theglobal loss function. We focus on the case where the global lossfunction belongs to the family of absolute norms, and present severalonline learning algorithms for the induced problem. We proveworst-case relative loss bounds for all of our algorithms, anddemonstrate the effectiveness of our approach on a large-scalemulticlass-multilabel text categorization problem.

私たちは、オンライン学習フレームワーク内で複数のタスクを並行して学習する問題を研究します。各オンラインラウンドで、アルゴリズムは各並行タスクのインスタンスを受け取り、各インスタンスのラベルを予測して応答します。各ラウンドで行われた予測がすべて共通の目標に貢献する場合を検討します。さまざまなタスク間の関係は、各ラウンドで行われた複数の予測の全体的な品質を評価するグローバル損失関数によって定義されます。具体的には、各個別の予測は独自の損失値に関連付けられ、これらの複数の損失値はグローバル損失関数を使用して1つの数値に結合されます。グローバル損失関数が絶対基準のファミリーに属する場合に焦点を当て、誘発された問題に対する複数のオンライン学習アルゴリズムを示します。すべてのアルゴリズムの最悪のケースの相対損失境界を証明し、大規模なマルチクラスマルチラベルテキスト分類問題に対するアプローチの有効性を示します。

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
Proto-value Functions:マルコフ決定過程における表現と制御の学習のためのラプラシアンフレームワーク

This paper introduces a novel spectral framework for solvingMarkov decision processes (MDPs) by jointly learning representationsand optimal policies. The major components of the framework describedin this paper include: (i) A general scheme for constructingrepresentations or basis functions by diagonalizing symmetricdiffusion operators (ii) A specific instantiation of thisapproach where global basis functions called proto-valuefunctions (PVFs) are formed using the eigenvectors of the graphLaplacian on an undirected graph formed from state transitionsinduced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collectionphase, a representation learning phase that constructs basis functionsfrom samples, and a final parameter estimation phase that determinesan (approximately) optimal policy within the (linear) subspace spannedby the (current) basis functions. (iv) A specific instantiation of theRPI framework using least-squares policy iteration (LSPI) as theparameter estimation method (v) Several strategies for scaling theproposed approach to large discrete and continuous state spaces,including the Nyström extension for out-of-sampleinterpolation of eigenfunctions, and the use of Kronecker sumfactorization to construct compact eigenfunctions in product spacessuch as factored MDPs (vi) Finally, a series of illustrative discreteand continuous control tasks, which both illustrate the concepts andprovide a benchmark for evaluating the proposed approach. Manychallenges remain to be addressed in scaling the proposed framework tolarge MDPs, and several elaboration of the proposed framework arebriefly summarized at the end.

この論文では、表現と最適ポリシーを共同学習することでマルコフ決定プロセス(MDP)を解決するための新しいスペクトルフレームワークを紹介します。この論文で説明するフレームワークの主なコンポーネントは次のとおりです。(i)対称拡散演算子を対角化することで表現または基底関数を構築する一般的なスキーム、(ii)このアプローチの特定のインスタンス化。このインスタンスでは、MDPによって誘発される状態遷移から形成される無向グラフ上のグラフラプラシアンの固有ベクトルを使用して、プロト値関数(PVF)と呼ばれるグローバル基底関数が形成されます。(iii)サンプル収集フェーズ、サンプルから基底関数を構築する表現学習フェーズ、および(現在の)基底関数が張る(線形)サブスペース内で(近似的に)最適なポリシーを決定する最終的なパラメーター推定フェーズで構成される、表現ポリシー反復と呼ばれる3段階の手順。(iv)最小二乗ポリシー反復(LSPI)をパラメータ推定法として使用するRPIフレームワークの具体的なインスタンス化(v)提案されたアプローチを大規模な離散および連続状態空間に拡張するためのいくつかの戦略(固有関数のサンプル外補間のためのNyström拡張、および因数分解されたMDPなどの積空間でコンパクトな固有関数を構築するためのKronecker総因数分解の使用を含む) (vi)最後に、概念を説明し、提案されたアプローチを評価するためのベンチマークを提供する一連の例示的な離散および連続制御タスク。提案されたフレームワークを大規模なMDPに拡張する際には、対処すべき多くの課題が残っており、最後に提案されたフレームワークのいくつかの詳細について簡単にまとめます。

Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
時間差分学習のためのタスク間マッピングによる転移学習

Temporal difference (TD) learning (Sutton and Barto, 1998) has become apopular reinforcement learning technique in recent years. TD methods,relying on function approximators to generalize learning to novelsituations, have had some experimental successes and have been shownto exhibit some desirable properties in theory, but the most basicalgorithms have often been found slow in practice. This empiricalresult has motivated the development of many methods that speed upreinforcement learning by modifying a task for the learner or helpingthe learner better generalize to novel situations. This articlefocuses on generalizing across tasks, thereby speeding uplearning, via a novel form of transfer using handcoded taskrelationships. We compare learning on a complex task with threefunction approximators, a cerebellar model arithmetic computer (CMAC),an artificial neural network (ANN), and a radial basis function (RBF),and empirically demonstrate that directly transferring theaction-value function can lead to a dramatic speedup inlearning with all three. Using transfer via inter-task mapping(TVITM), agents are able to learn one task and then markedly reducethe time it takes to learn a more complex task. Our algorithms arefully implemented and tested in the RoboCup soccer Keepaway domain.This article contains and extends material published in two conferencepapers (Taylor and Stone, 2005; Taylor et al., 2005).

近年、時間差(TD)学習(SuttonおよびBarto、1998)は、人気の高い強化学習手法となっています。関数近似値に依存して学習を新しい状況に一般化するTD手法は、いくつかの実験的成功を収め、理論上は望ましい特性を示すことが示されていますが、最も基本的なアルゴリズムは実際には遅いことがよくわかっています。この実験結果により、学習者のタスクを変更したり、学習者が新しい状況にうまく一般化できるように支援したりすることで、強化学習を高速化する多くの手法が開発されました。この記事では、手動でコーディングされたタスク関係を使用した新しい形式の転送によって、タスク間の一般化に焦点を当て、学習を高速化します。複雑なタスクの学習を、小脳モデル演算コンピュータ(CMAC)、人工ニューラルネットワーク(ANN)、ラジアル基底関数(RBF)の3つの関数近似器と比較し、アクション価値関数を直接転送すると、これら3つすべてで学習が劇的に高速化されることを実証しました。タスク間マッピングによる転送(TVITM)を使用すると、エージェントは1つのタスクを学習してから、より複雑なタスクを学習するのにかかる時間を大幅に短縮できます。私たちのアルゴリズムは、RoboCupサッカーKeepawayドメインで完全に実装およびテストされています。この記事には、2つの会議論文(TaylorおよびStone、2005年、Taylor他、2005年)で発表された資料が含まれており、これを拡張しています。

A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
一般化されたフィッシャー基準に対する解のファミリーの完全な特性評価

Recently, Ye (2005) suggested yet another optimization criterionfor discriminant analysis and proposed a characterization of thefamily of solutions to this objective. The characterization, however,merely describes a part of the full solution set, that is, it is notcomplete and therefore not at all a characterization. Thiscorrespondence first gives the correct characterization and afterwardscompares it to Ye’s.

最近、Ye(2005)は、判別分析のためのさらに別の最適化基準を提案し、この目的に対する解のファミリーの特性付けを提案しました。ただし、特性評価は、完全なソリューションセットの一部を記述しているだけであり、つまり、完全ではないため、まったく特性評価ではありません。この対応は、最初に正しい特徴付けを与え、その後それをYeのものと比較します。

Refinable Kernels
絞り込み可能なカーネル

Motivated by mathematical learning from training data, we introducethe notion of refinable kernels. Various characterizations ofrefinable kernels are presented. The concept of refinable kernelsleads to the introduction of wavelet-like reproducing kernels.We also investigate a refinable kernel that forms a Riesz basis. Inparticular, we characterize refinable translation invariant kernels,and refinable kernels defined by refinable functions. This studyleads to multiresolution analysis of reproducing kernel Hilbertspaces.

学習データからの数学的学習に動機付けられて、精製可能なカーネルの概念を導入します。精製可能なカーネルのさまざまな特性が示されています。精製可能なカーネルの概念は、ウェーブレットのような再現カーネルの導入につながります。また、Riesz基底を形成する精製可能なカーネルについても調査します。特に、絞り込み可能な翻訳不変カーネルと、絞り込み可能な関数によって定義される絞り込み可能なカーネルを特徴付けます。本研究は、再生カーネルヒルベルト空間の多重解像度解析につながる。

Unlabeled Compression Schemes for Maximum Classes
最大クラスのラベルなし圧縮方式

Maximum concept classes of VC dimension dover n domain points have size n C ≤d, and this is an upper bound on the size of any concept classof VC dimension d over n points.We give a compression scheme for any maximum class that represents each concept by a subset of upto d unlabeled domain pointsand has the property that forany sample of a concept in the class,the representative of exactlyone of the concepts consistent with the sampleis a subset of the domain of the sample.This allows us to compress any sample of a conceptin the class to a subset of up to d unlabeled sample points such thatthis subset represents a concept consistent with theentire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compressesto labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class.

n個のドメインポイント上のVC次元の最大概念クラスのサイズはn C≤dであり、これはn個のポイント上のVC次元dの概念クラスのサイズの上限です。私たちは、各概念をdまでのラベルなしドメインポイントのサブセットで表す任意の最大クラスに対して圧縮スキームを与え、そのクラス内の概念の任意のサンプルに対して、サンプルと一致する概念の正確な1つの表現がサンプルのドメインのサブセットであるという特性を持つ。これにより、クラス内の概念の任意のサンプルを、このサブセットが元のサンプル全体と一致する概念を表すように、最大d個のラベル付けされていないサンプルポイントのサブセットに圧縮できます。サイズがdに等しいサンプルのラベル付きサブセットを圧縮する最大クラスの圧縮スキーム(Floyd and Warmuth, 1995)とは異なり、新しいスキームは、最大でdのサイズのラベルなし圧縮セットの数がクラス内の概念の数と等しくなるという意味で厳密です。

Very Fast Online Learning of Highly Non Linear Problems
高度に非線形問題の非常に高速なオンライン学習

The experimental investigation on the efficient learning of highlynon-linear problems by online training, using ordinary feed forwardneural networks and stochastic gradient descent on the errors computedby back-propagation, gives evidence that the most crucial factors forefficient training are the hidden units’ differentiation, theattenuation of the hidden units’ interference and the selectiveattention on the parts of the problems where the approximation errorremains high. In this report, we present global and local selectiveattention techniques and a new hybrid activation function that enablesthe hidden units to acquire individual receptive fields which may beglobal or local depending on the problem’s local complexities. Thepresented techniques enable very efficient training on complexclassification problems with embedded subproblems.

通常のフィードフォワードニューラルネットワークとバックプロパゲーションによって計算された誤差の確率的勾配降下法を用いたオンライン学習による高非線形問題の効率的な学習に関する実験的研究は、効率的な学習のための最も重要な要因が、隠れユニットの微分、隠れユニットの干渉の加熱、および近似誤差が高いままの問題の部分への選択的注意であることを示しています。このレポートでは、グローバルおよびローカルな選択的注意技術と、問題のローカルな複雑さに応じてグローバルまたはローカルな個々の受容フィールドを隠れたユニットが獲得できるようにする新しいハイブリッド活性化機能を紹介します。提示された手法は、埋め込まれたサブ問題を持つ複素分類問題に関する非常に効率的なトレーニングを可能にします。

Truncating the Loop Series Expansion for Belief Propagation
信念伝播のためのループ級数展開の切り捨て

Recently, Chertkov and Chernyak (2006b) derived an exact expression for the partition sum(normalization constant) corresponding to a graphical model, which is anexpansion around the belief propagation (BP) solution. By adding correction terms tothe BP free energy, one for each “generalized loop” in the factor graph, theexact partition sum is obtained.However, the usually enormous number ofgeneralized loops generally prohibits summation over all correctionterms.In this article we introduce truncated loop series BP (TLSBP), a particular wayof truncating the loop series of Chertkov & Chernyak by considering generalized loopsas compositions of simple loops.We analyze the performance of TLSBP in different scenarios, including the Ising modelon square grids and regular random graphs, and on PROMEDAS, a large probabilisticmedical diagnostic system.We show that TLSBP often improves upon the accuracy of the BP solution, at theexpense of increased computation time.We also show that the performance of TLSBP strongly depends onthe degree of interaction between the variables.For weak interactions, truncating the series leads to significant improvements,whereas for strong interactions it can be ineffective,even if a high number of terms is considered.

最近、ChertkovとChernyak (2006b)は、ビリーフプロパゲーション(BP)ソリューションの拡張であるグラフィカルモデルに対応するパーティション合計(正規化定数)の正確な式を導出しました。BP自由エネルギーに補正項(因子グラフ内の各「一般化ループ」につき1つ)を追加することで、正確なパーティションの合計が得られます。ただし、通常、一般化ループの数は膨大であるため、すべての補正項の合計は一般的に不可能です。この記事では、一般化ループを単純ループの合成と見なすことによってChertkovとChernyakのループシリーズを切り捨てる特別な方法である、切り捨てループシリーズBP (TLSBP)を紹介します。正方グリッドと正規ランダムグラフ上のIsingモデル、および大規模な確率的医療診断システムであるPROMEDASなど、さまざまなシナリオでTLSBPのパフォーマンスを分析します。TLSBPは、計算時間の増加を犠牲にして、BPソリューションの精度を向上させることが多いことを示しています。また、TLSBPのパフォーマンスは、変数間の相互作用の度合いに大きく依存することも示しています。弱い相互作用の場合、シリーズを切り捨てると大幅な改善がもたらされますが、強い相互作用の場合、多数の項を考慮したとしても効果がない場合があります。

A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Bregman共クラスタリングと行列近似への一般化最大エントロピーアプローチ

Co-clustering, or simultaneous clustering of rows and columns of atwo-dimensional data matrix, is rapidly becoming a powerful dataanalysis technique. Co-clustering has enjoyed wide success in variedapplication domains such as text clustering, gene-microarrayanalysis, natural language processing and image, speech and videoanalysis. In this paper, we introduce a partitional co-clusteringformulation that is driven by the search for a good matrixapproximation—every co-clustering is associated with anapproximation of the original data matrix and the quality ofco-clustering is determined by the approximation error. We allowthe approximation error to be measured using a large class of lossfunctions called Bregman divergences that include squared Euclideandistance and KL-divergence as special cases. In addition, we permitmultiple structurally different co-clustering schemes that preservevarious linear statistics of the original data matrix. Toaccomplish the above tasks, we introduce a new minimum Bregmaninformation (MBI) principle that simultaneously generalizes themaximum entropy and standard least squares principles,and leads to a matrix approximation that is optimal among allgeneralized additive models in a certain natural parameter space.Analysis based on this principle yields an elegant meta algorithm,special cases of which include most previously known alternateminimization based clustering algorithms such as kmeans andco-clustering algorithms such as informationtheoretic (Dhillon et al., 2003b) and minimum sum-squared residueco-clustering (Cho et al., 2004). To demonstrate the generality andflexibility of our co-clustering framework, we provide examples andempirical evidence on a variety of problem domains and also describenovel co-clustering applications such as missing value predictionand compression of categorical data matrices.

共クラスタリング、つまり2次元データマトリックスの行と列の同時クラスタリングは、急速に強力なデータ分析手法になりつつあります。共クラスタリングは、テキストクラスタリング、遺伝子マイクロアレイ分析、自然言語処理、画像、音声、ビデオ分析など、さまざまなアプリケーション領域で広く成功を収めています。この論文では、適切なマトリックス近似の探索によって駆動されるパーティション共クラスタリング定式化を紹介します。すべての共クラスタリングは、元のデータマトリックスの近似に関連付けられており、共クラスタリングの品質は近似エラーによって決定されます。近似エラーは、特別なケースとして2乗ユークリッド距離とKLダイバージェンスを含む、ブレグマンダイバージェンスと呼ばれる損失関数の大規模なクラスを使用して測定できます。さらに、元のデータマトリックスのさまざまな線形統計を保持する、構造的に異なる複数の共クラスタリングスキームを許可します。上記のタスクを達成するために、最大エントロピー原理と標準最小二乗原理を同時に一般化し、特定の自然パラメータ空間内のすべての一般化加法モデルの中で最適な行列近似を導く、新しい最小ブレグマン情報量(MBI)原理を導入します。この原理に基づく分析により、洗練されたメタアルゴリズムが生成されます。その特殊なケースには、kmeansなどの既知の代替最小化ベースのクラスタリングアルゴリズムや、情報理論的(Dhillonら、2003b)や最小二乗残差共クラスタリング(Choら、2004)などの共クラスタリングアルゴリズムが含まれます。共クラスタリングフレームワークの一般性と柔軟性を示すために、さまざまな問題領域での例と経験的証拠を示し、欠損値予測やカテゴリデータマトリックスの圧縮などの新しい共クラスタリングアプリケーションについても説明します。

Fast Iterative Kernel Principal Component Analysis
高速反復カーネル主成分分析

We develop gain adaptation methods that improve convergence of thekernel Hebbian algorithm (KHA) for iterative kernel PCA(Kim et al., 2005). KHA has a scalar gain parameter which is eitherheld constant or decreased according to a predetermined annealingschedule, leading to slow convergence. We accelerate it byincorporating the reciprocal of the current estimated eigenvalues aspart of a gain vector. An additional normalization term then allows usto eliminate a tuning parameter in the annealing schedule. Finally wederive and apply stochastic meta-descent (SMD) gain vector adaptation(Schraudolph, 1999, 2002) in reproducing kernel Hilbertspace to further speed up convergence. Experimental results on kernelPCA and spectral clustering of USPS digits, motion capture and imagedenoising, and image super-resolution tasks confirm that our methodsconverge substantially faster than conventional KHA. To demonstratescalability, we perform kernel PCA on the entire MNIST data set.

私たちは、反復カーネルPCAのためのカーネルヘブアルゴリズム(KHA)の収束を改善するゲイン適応法を開発します(Kimら, 2005)。KHAには、所定のアニーリングスケジュールに従って一定に保持されるか減少するスカラーゲインパラメータがあり、収束が遅くなります。現在の推定固有値の逆数をゲインベクトルの一部として組み込むことにより、これを加速します。その後、追加の正規化項により、アニーリングスケジュールのチューニングパラメータを削除できます。最後に、確率的メタディセント(SMD)利得ベクトル適応(Schraudolph, 1999, 2002)をモデル化し、適用してカーネルヒルベルト空間の再現を行い、収束をさらに加速します。kernelPCAとUSPSディジットのスペクトルクラスタリング、モーションキャプチャと画像ノイズ除去、および画像超解像タスクに関する実験結果により、私たちの方法が従来のKHAよりも大幅に速く収束することが確認されています。スケーラビリティを示すために、MNISTデータセット全体に対してカーネルPCAを実行します。

Large Margin Semi-supervised Learning
大マージンの半教師あり学習

In classification, semi-supervised learning occurs when alarge amount of unlabeled data is available with only a smallnumber of labeled data. In such a situation, how to enhancepredictability of classification through unlabeled data is thefocus. In this article, we introduce a novel large marginsemi-supervised learning methodology, using groupinginformation from unlabeled data, together with the concept ofmargins, in a form of regularization controlling the interplaybetween labeled and unlabeled data. Based on this methodology,we develop two specific machines involving support vector machinesand ψ-learning, denoted as SSVM and SPSI, through differenceconvex programming. In addition, we estimate the generalization errorusing both labeled and unlabeled data, for tuning regularizers.Finally, our theoretical and numerical analyses indicate thatthe proposed methodology achieves the desired objective ofdelivering high performance in generalization, particularly againstsome strong performers.

分類では、半教師あり学習は、ラベル付けされていないデータが大量に利用可能で、ラベル付きデータが少ない場合に発生します。このような状況では、ラベル付けされていないデータを通じて分類の予測可能性をどのように強化するかが焦点となります。この記事では、ラベル付けされていないデータからのグループ化情報とマージンの概念を使用して、ラベル付きデータとラベルなしデータの間の相互作用を制御する正則化の形で、新しい大きなマージン半教師あり学習方法論を紹介します。この方法論に基づいて、サポートベクターマシンとψ学習(SSVMおよびSPSIと呼ばれる)を含む2つの特定のマシンを差分凸プログラミングによって開発します。さらに、正規化器の調整について、ラベル付きデータとラベルなしデータの両方を使用して一般化誤差を推定します。最後に、私たちの理論的および数値的分析は、提案された方法論が、特に一部の強力なパフォーマーに対して、一般化で高いパフォーマンスを提供するという望ましい目的を達成していることを示しています。

Behavioral Shaping for Geometric Concepts
幾何学的概念のビヘイビア・シェーピング

In a search problem, an agent uses the membership oracle of a targetconcept to find a positive example of the concept. In a shapedsearch problem the agent is aided by a sequence of increasinglyrestrictive concepts leading to the target concept (analogous tobehavioral shaping). The concepts are given by membership oracles, andthe agent has to find a positive example of the target concept whileminimizing the total number of oracle queries. We show that for theconcept class of intervals on the real line an agent using a boundednumber of queries per oracle exists. In contrast, for theconcept class of unions of two intervals on the real line no agentwith a bounded number of queries per oracle exists. We theninvestigate the (amortized) number of queries per oracle needed forthe shaped search problem over other concept classes. We explore thefollowing methods to obtain efficient agents. For axis-parallelrectangles we use a bootstrapping technique to obtain gradually betterapproximations of the target concept. We show that given rectanglesR ⊆ A ⊆ ℝd one can obtain a rectangle A’ ⊇ Rwith vol(A’) / vol(R) ≤ 2, using only O(d⋅ vol(A) / vol(R)) random samples from A. For ellipsoids ofbounded eccentricity in ℝd we analyze a deterministic ray-shootingprocess which uses a sequence of rays to get close to thecentroid. Finally, we use algorithms for generating random points inconvex bodies (Dyer et al., 1991; Kannan et al., 1997) to give a randomized agent for theconcept class of convex bodies. In the final section, we exploreconnections between our bootstrapping method and activelearning. Specifically, we use the bootstrapping technique foraxis-parallel rectangles to active learn axis-parallel rectanglesunder the uniform distribution in O(d ln(1/ε)) labeledsamples.

検索問題では、エージェントはターゲット概念のメンバーシップオラクルを使用して、その概念の肯定例を見つけます。シェイプド検索問題では、エージェントはターゲット概念につながる、制限が次第に厳しくなる一連の概念によって支援されます(動作のシェイプ化に類似)。概念はメンバーシップオラクルによって与えられ、エージェントはオラクルクエリの総数を最小限に抑えながら、ターゲット概念の肯定例を見つけなければなりません。実数直線上の区間の概念クラスでは、オラクルごとに制限された数のクエリを使用するエージェントが存在することを示します。対照的に、実数直線上の2つの区間の結合の概念クラスでは、オラクルごとに制限された数のクエリを持つエージェントは存在しません。次に、シェイプド検索問題で他の概念クラスに対して必要な、オラクルごとの(償却された)クエリ数を調べます。効率的なエージェントを取得するために、次の方法を検討します。軸に平行な長方形では、ブートストラッピング手法を使用して、ターゲット概念の徐々に優れた近似値を取得します。長方形R⊆A⊆ ℝdが与えられれば、AからのO(d⋅vol(A) / vol(R))個のランダムサンプルのみを使用して、vol(A’) / vol(R)≤2の長方形A’⊇Rを取得できることを示します。ℝd内の離心率が制限された楕円体の場合、一連の光線を使用して重心に近づく決定論的な光線発射プロセスを分析します。最後に、凸体でランダムな点を生成するアルゴリズム(Dyerら、1991年、Kannanら、1997年)を使用して、凸体の概念クラスにランダムエージェントを提供します。最後のセクションでは、ブートストラップ法とアクティブラーニングの関係について説明します。具体的には、軸平行長方形のブートストラップ手法を使用して、O(d ln(1/ε))個のラベル付きサンプルで均一分布の軸平行長方形をアクティブ学習します。

“Ideal Parent” Structure Learning for Continuous Variable Bayesian Networks
連続変数ベイジアンネットワークのための「理想的な親」構造学習

Bayesian networks in general, and continuous variable networks inparticular, have become increasingly popular in recent years, largelydue to advances in methods that facilitate automatic learning fromdata. Yet, despite these advances, the key task of learning thestructure of such models remains a computationally intensiveprocedure, which limits most applications to parameter learning. Thisproblem is even more acute when learning networks in the presence ofmissing values or hidden variables, a scenario that is part of manyreal-life problems. In this work we present a general method forspeeding structure search for continuous variable networks with commonparametric distributions. We efficiently evaluate the approximatemerit of candidate structure modifications and apply time consuming(exact) computations only to the most promising ones, therebyachieving significant improvement in the running time of the searchalgorithm. Our method also naturally and efficiently facilitates theaddition of useful new hidden variables into the network structure, atask that is typically considered both conceptually difficult andcomputationally prohibitive. We demonstrate our method on syntheticand real-life data sets, both for learning structure on fully andpartially observable data, and for introducing new hidden variablesduring structure search.

ベイジアンネットワーク全般、特に連続変数ネットワークは、近年ますます人気が高まっていますが、これは主にデータからの自動学習を容易にする手法の進歩によるものです。しかし、こうした進歩にもかかわらず、このようなモデルの構造を学習するという主要なタスクは依然として計算集約的な手順であり、ほとんどのアプリケーションはパラメータ学習に限定されています。この問題は、多くの現実の問題の一部である、欠損値や隠れた変数が存在するネットワークを学習する場合にさらに深刻になります。この研究では、一般的なパラメトリック分布を持つ連続変数ネットワークの構造検索を高速化する一般的な方法を紹介します。候補構造変更のおおよそのメリットを効率的に評価し、時間のかかる(正確な)計算を最も有望なものにのみ適用することで、検索アルゴリズムの実行時間を大幅に改善します。また、この方法は、概念的に困難で計算上も法外であると考えられるタスクである、ネットワーク構造への有用な新しい隠れた変数の追加を自然かつ効率的に容易にします。私たちは、完全に観測可能なデータと部分的に観測可能なデータの構造を学習するため、また構造検索中に新しい隠れた変数を導入するために、合成データセットと実際のデータセットで私たちの方法を実証します。

Characterizing the Function Space for Bayesian Kernel Models
ベイジアンカーネルモデルの関数空間の特性評価

Kernel methods have been very popular in the machine learningliterature in the last ten years, mainly in the context of Tikhonovregularization algorithms. In this paper we study a coherent Bayesiankernel model based on an integral operator defined as the convolutionof a kernel with a signed measure. Priors on the random signedmeasures correspond to prior distributions on the functions mapped bythe integral operator. We study several classes of signed measures andtheir image mapped by the integral operator. In particular, weidentify a general class of measures whose image is dense in thereproducing kernel Hilbert space (RKHS) induced by the kernel. Aconsequence of this result is a function theoretic foundation forusing non-parametric prior specifications in Bayesian modeling, suchas Gaussian process and Dirichlet process prior distributions. Wediscuss the construction of priors on spaces of signed measures usingGaussian and Lévy processes, with the Dirichlet processes being aspecial case the latter. Computational issues involved with samplingfrom the posterior distribution are outlined for a univariateregression and a high dimensional classification problem.

カーネル法は、過去10年間、機械学習の文献で、主にTikhonov正則化アルゴリズムのコンテキストで非常に人気がありました。この論文では、カーネルと符号付き測度の畳み込みとして定義される積分演算子に基づくコヒーレントベイジアンカーネルモデルについて検討します。ランダムな符号付き測度の事前分布は、積分演算子によってマップされる関数の事前分布に対応します。私たちは、いくつかのクラスの符号付き測度と、積分演算子によってマップされるそれらのイメージについて検討します。特に、カーネルによって誘導される再生成カーネルヒルベルト空間(RKHS)でイメージが密である一般的なクラスの測度を特定します。この結果から、ベイジアンモデリングでガウス過程やディリクレ過程事前分布などの非パラメトリック事前仕様を使用するための関数理論的基礎が得られます。ガウス過程とレヴィ過程を使用して符号付き測度の空間の事前分布を構築する方法について説明します。ディリクレ過程は後者の特別なケースです。単変量回帰と高次元分類問題における事後分布からのサンプリングに伴う計算上の問題について概説します。

Structure and Majority Classes in Decision Tree Learning
決定木学習における構造と多数派クラス

To provide good classification accuracy on unseen examples, a decisiontree, learned by an algorithm such as ID3, must have sufficientstructure and also identify the correct majority class in each of itsleaves. If there are inadequacies in respect of either of these, thetree will have a percentage classification rate below that of themaximum possible for the domain, namely (100 – Bayes error rate). Anerror decomposition is introduced which enables the relativecontributions of deficiencies in structure and in incorrectdetermination of majority class to be isolated and quantified. Asub-decomposition of majority class error permits separation of thesampling error at the leaves from the possible bias introduced by theattribute selection method of the induction algorithm. It is shownthat sampling error can extend to 25% when there are more than twoclasses. Decompositions are obtained from experiments on several datasets. For ID3, the effect of selection bias is shown to vary frombeing statistically non-significant to being quite substantial, withthe latter appearing to be associated with a simple underlying model.

未知の例に対して優れた分類精度を提供するには、ID3などのアルゴリズムで学習した決定木が十分な構造を持ち、各葉で正しい多数派クラスを識別する必要があります。これらのいずれかに関して不備がある場合、ツリーの分類率はドメインで可能な最大値、つまり(100 -ベイズ誤差率)を下回ります。誤差分解が導入され、構造の欠陥と多数派クラスの誤った決定の相対的寄与を分離して定量化できます。多数派クラス誤差のサブ分解により、葉でのサンプリング誤差を、帰納アルゴリズムの属性選択方法によってもたらされる可能性のあるバイアスから分離できます。クラスが2つ以上ある場合、サンプリング誤差は25%まで拡大する可能性があることが示されています。分解は、いくつかのデータセットの実験から得られます。ID3の場合、選択バイアスの影響は、統計的に有意でない程度からかなり大きい程度まで変化することが示されており、後者は単純な基礎モデルに関連しているように見えます。

Polynomial Identification in the Limit of Substitutable Context-free Languages
置換可能な文脈自由言語の限界における多項式同定

This paper formalises the idea of substitutability introduced byZellig Harris in the 1950s and makes it the basis for a learningalgorithm from positive data only for a subclass of context-freelanguages. We show that there is a polynomial characteristic set,and thus prove polynomial identification in the limit of thisclass. We discuss the relationship of this class of languages toother common classes discussed in grammatical inference. It transpires that it is not necessary to identify constituents inorder to learn a context-free language—it is sufficient to identifythesyntactic congruence, and the operations of the syntactic monoid canbe converted into a context-free grammar. We alsodiscuss modifications to the algorithm that produces a reductionsystem rather than a context-free grammar, that will be much morecompact. We discuss the relationship to Angluin’s notion ofreversibility for regular languages. We also demonstrate that animplementation of this algorithm is capable of learning a classicexample of structure dependent syntax in English: this constitutes arefutation of an argument that hasbeen used in support of nativist theories of language.

この論文では、1950年代にゼリグ・ハリスが導入した代替可能性の考え方を形式化し、文脈自由言語のサブクラスのみの正データからの学習アルゴリズムの基礎とします。多項式特性集合が存在することを示し、このクラスの極限で多項式識別が可能であることを証明します。このクラスの言語と、文法推論で議論される他の一般的なクラスとの関係について説明します。文脈自由言語を学習するために構成要素を識別する必要はなく、構文の合同性を識別するだけで十分であり、構文モノイドの演算は文脈自由文法に変換できることがわかります。また、文脈自由文法ではなく、はるかにコンパクトな縮約システムを生成するアルゴリズムの修正についても説明します。正規言語のAngluinの可逆性の概念との関係について説明します。また、このアルゴリズムの実装により、英語の構造依存構文の典型的な例を学習できることも実証しました。これは、言語のネイティブ主義理論を支持するために使用されてきた議論の反証となります。

A Nonparametric Statistical Approach to Clustering via Mode Identification
モード同定によるクラスタリングへのノンパラメトリック統計的アプローチ

A new clustering approach based on mode identification is developed byapplying new optimization techniques to a nonparametric density estimator. Acluster is formed by those sample points that ascend to the same localmaximum (mode) of the density function. The path from a point to itsassociated mode is efficiently solved by an EM-style algorithm, namely, theModal EM (MEM). This method is then extended for hierarchical clustering byrecursively locating modes of kernel density estimators with increasingbandwidths. Without model fitting, the mode-based clustering yields adensity description for every cluster, a major advantage ofmixture-model-based clustering. Moreover, it ensures that every clustercorresponds to a bump of the density. The issue of diagnosing clusteringresults is also investigated. Specifically, a pairwise separability measurefor clusters is defined using the ridgeline between the density bumps of twoclusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm,an extension of MEM. Based upon this new measure, a cluster mergingprocedure is created to enforce strong separation. Experiments on simulatedand real data demonstrate that the mode-based clustering approach tends tocombine the strengths of linkage and mixture-model-based clustering. Inaddition, the approach is robust in high dimensions and when clusters deviatesubstantially from Gaussian distributions. Both of these cases posedifficulty for parametric mixture modeling. A C package on the newalgorithms is developed for public access athttp://www.stat.psu.edu/∼jiali/hmac.

モード識別に基づく新しいクラスタリング手法は、ノンパラメトリック密度推定量に新しい最適化手法を適用することで開発されました。クラスターは、密度関数の同じ局所的最大値(モード)に上昇するサンプルポイントによって形成されます。ポイントからその関連モードへのパスは、EMスタイルのアルゴリズム、つまりモーダルEM (MEM)によって効率的に解決されます。この方法は、帯域幅が増加するカーネル密度推定量のモードを再帰的に特定することで、階層的クラスタリングに拡張されます。モデルフィッティングなしで、モードベースのクラスタリングは、混合モデルベースのクラスタリングの主な利点である、すべてのクラスターの密度記述を生成します。さらに、すべてのクラスターが密度の隆起に対応することが保証されます。クラスタリング結果の診断の問題も調査されます。具体的には、2つのクラスターの密度隆起間の稜線を使用して、クラスターのペアワイズ分離可能性尺度が定義されます。リッジラインは、MEMの拡張であるリッジラインEM (REM)アルゴリズムによって解決されます。この新しい尺度に基づいて、強力な分離を強制するクラスターマージ手順が作成されます。シミュレーションデータと実際のデータでの実験により、モードベースのクラスタリングアプローチは、リンクベースと混合モデルベースのクラスタリングの長所を組み合わせる傾向があることが実証されています。さらに、このアプローチは高次元でも、クラスターがガウス分布から大幅に逸脱している場合でも堅牢です。これらの両方のケースは、パラメトリック混合モデリングに困難をもたらします。新しいアルゴリズムのCパッケージは、http://www.stat.psu.edu/∼jiali/hmacで公開されています。

Compression-Based Averaging of Selective Naive Bayes Classifiers
選択的単純ベイズ分類器の圧縮ベース平均化

The naive Bayes classifier has proved to be very effective on manyreal data applications. Its performance usually benefits from anaccurate estimation of univariate conditional probabilities and fromvariable selection. However, although variable selection is adesirable feature, it is prone to overfitting. In this paper, weintroduce a Bayesian regularization technique to select the mostprobable subset of variables compliant with the naive Bayesassumption. We also study the limits of Bayesian model averaging inthe case of the naive Bayes assumption and introduce a new weightingscheme based on the ability of the models to conditionally compressthe class labels. The weighting scheme on the models reduces to aweighting scheme on the variables, and finally results in a naiveBayes classifier with “soft variable selection”. Extensiveexperiments show that the compression-based averaged classifieroutperforms the Bayesian model averaging scheme.

単純ベイズ分類器は、多くの実数データアプリケーションで非常に効果的であることが証明されています。そのパフォーマンスは通常、単変量条件付き確率の不正確な推定と変数の選択から恩恵を受けます。ただし、変数の選択は望ましい機能ですが、オーバーフィットする傾向があります。この論文では、素朴なベイズ仮定に準拠した変数の最も可能性の高いサブセットを選択するためのベイズ正則化手法を紹介します。また、単純ベイズ仮定の場合のベイズモデル平均化の限界を研究し、モデルがクラスラベルを条件付きで圧縮する能力に基づく新しい重み付けスキームを導入します。モデルの重み付けスキームは、変数の重み付けスキームに縮小され、最終的に「ソフト変数選択」を備えたnaiveBayes分類子になります。広範な実験では、圧縮ベースの平均化分類器がベイズモデルの平均化スキームよりも優れていることが示されています。

Handling Missing Values when Applying Classification Models
分類モデルを適用する際の欠損値の処理

Much work has studied the effect of different treatments of missing values onmodel induction, but little work has analyzed treatments for the common caseof missing values at prediction time. This paper first compares severaldifferent methods—predictive value imputation, the distribution-basedimputation used by C4.5, and using reduced models—for applyingclassification trees to instances with missing values (and also shows evidencethat the results generalize to bagged trees and to logistic regression). Theresults show that for the two most popular treatments, each is preferableunder different conditions. Strikingly the reduced-models approach, seldommentioned or used, consistently outperforms the other two methods, sometimesby a large margin. The lack of attention to reduced modeling may be due inpart to its (perceived) expense in terms of computation or storage. Therefore,we then introduce and evaluate alternative, hybrid approaches that allow usersto balance between more accurate but computationally expensive reducedmodeling and the other, less accurate but less computationally expensivetreatments. The results show that the hybrid methods can scale gracefully tothe amount of investment in computation/storage, and that they outperformimputation even for small investments.

欠損値のさまざまな処理がモデル誘導に与える影響については多くの研究がなされてきたが、予測時に欠損値が発生する一般的なケースの処理を分析した研究はほとんどない。この論文ではまず、欠損値のあるインスタンスに分類ツリーを適用するためのさまざまな方法(予測値補完、C4.5で使用される分布ベースの補完、縮小モデルの使用)を比較する(また、結果がバギングツリーとロジスティック回帰に一般化できるという証拠も示す)。結果は、2つの最も一般的な処理について、それぞれが異なる条件下で好ましいことを示しています。驚くべきことに、ほとんど言及も使用もされていない縮小モデルのアプローチは、他の2つの方法よりも一貫して優れており、時には大幅に上回ることもあります。縮小モデルが注目されていないのは、計算やストレージの面で(認識されている)コストが一部原因である可能性があります。したがって、次に、より正確だが計算コストが高い縮小モデルと、正確ではないが計算コストが低い他の処理の間でバランスをとることができる代替のハイブリッドアプローチを紹介し、評価します。結果は、ハイブリッド方式が計算/ストレージへの投資量に合わせて適切に拡張でき、少額の投資でも補完法よりも優れていることを示しています。

Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
球面-等分散分布:分類における球面分布と正規分布の同等性

Many feature representations, as in genomics, describe directionaldata where all feature vectors share a common norm. In othercases, as in computer vision, a norm or variance normalizationstep, where all feature vectors are normalized to a common length,is generally used. These representations and pre-processing stepmap the original data from ℜp to the surface of ahypersphere Sp-1. Such representations should then bemodeled using spherical distributions. However, the difficultyassociated with such spherical representations has promptedresearchers to model their spherical data using Gaussiandistributions instead—as if the data were represented inℜp rather than Sp-1. This opens the question towhether the classification results calculated with the Gaussianapproximation are the same as those obtained when using theoriginal spherical distributions. In this paper, we show that insome particular cases (which we named spherical-homoscedastic) theanswer to this question is positive. In the more general casehowever, the answer is negative. For this reason, we furtherinvestigate the additional error added by the Gaussian modeling.We conclude that the more the data deviates fromspherical-homoscedastic, the less advisable it is to employ theGaussian approximation. We then show how our derivations can beused to define optimal classifiers for spherical-homoscedasticdistributions. By using a kernel which maps the original spaceinto one where the data adapts to the spherical-homoscedasticmodel, we can derive non-linear classifiers with potentialapplications in a large number of problems. We conclude this paperby demonstrating the uses of spherical-homoscedasticity in theclassification of images of objects, gene expression sequences,and text data.

ゲノミクスなどの多くの特徴表現は、すべての特徴ベクトルが共通のノルムを共有する方向性データを表します。コンピュータビジョンなどの他のケースでは、すべての特徴ベクトルが共通の長さに正規化されるノルムまたは分散正規化ステップが一般的に使用されます。これらの表現と前処理ステップは、元のデータを ℜpから超球Sp-1の表面にマッピングします。このような表現は、球面分布を使用してモデル化する必要があります。ただし、このような球面表現に関連する難しさから、研究者は球面データをガウス分布を使用してモデル化するように促されました。データがSp-1ではなく ℜpで表現されているかのように。これにより、ガウス近似で計算された分類結果が、元の球面分布を使用した場合に得られる結果と同じかどうかという疑問が生じます。この論文では、特定のケース(球面等分散と名付けた)では、この疑問に対する答えが肯定的であることを示します。ただし、より一般的なケースでは、答えは否定的です。このため、ガウスモデリングによって追加される追加エラーをさらに調査します。データが球状等分散から大きく外れれば外れるほど、ガウス近似を使用することは望ましくないという結論に達します。次に、導出を使用して球状等分散分布の最適な分類子を定義する方法を示します。元の空間を、データが球状等分散モデルに適応する空間にマッピングするカーネルを使用することで、多数の問題に潜在的に適用できる非線形分類子を導出できます。この論文の結論として、物体の画像、遺伝子発現シーケンス、テキストデータの分類における球状等分散の使用を示します。

Multi-class Protein Classification Using Adaptive Codes
適応コードを用いたマルチクラスタンパク質分類

Predicting a protein’s structural class from its amino acid sequenceis a fundamental problem in computational biology. Recent machinelearning work in this domain has focused on developing new input spacerepresentations for protein sequences, that is, string kernels, someof which give state-of-the-art performance for the binary predictiontask of discriminating between one class and all the others. However,the underlying protein classification problem is in fact a hugemulti-class problem, with over 1000 protein folds and even morestructural subcategories organized into a hierarchy. To handle thischallenging many-class problem while taking advantage of progress onthe binary problem, we introduce an adaptive code approach in theoutput space of one-vs-the-rest prediction scores. Specifically, weuse a ranking perceptron algorithm to learn a weighting of binaryclassifiers that improves multi-class prediction with respect to afixed set of output codes. We use a cross-validation set-up togenerate output vectors for training, and we define codes that captureinformation about the protein structural hierarchy. Our codeweighting approach significantly improves on the standard one-vs-allmethod for two difficult multi-class protein classification problems:remote homology detection and fold recognition. Our algorithm alsooutperforms a previous code learning approach due to Crammer andSinger, trained here using a perceptron, when the dimension of thecode vectors is high and the number of classes is large. Finally, wecompare against PSI-BLAST, one of the most widely used methods inprotein sequence analysis, and find that our method stronglyoutperforms it on every structure classification problem that weconsider. Supplementary data and source code are available athttp://www.cs.columbia.edu/compbio/adaptive.

アミノ酸配列からタンパク質の構造クラスを予測することは、計算生物学における基本的な問題です。この分野での最近の機械学習作業は、タンパク質配列の新しい入力空間表現、つまり文字列カーネルの開発に重点を置いており、その一部は、1つのクラスと他のすべてのクラスを区別するバイナリ予測タスクに対して最先端のパフォーマンスを提供します。しかし、基礎となるタンパク質分類問題は、実際には1000を超えるタンパク質フォールドと、階層構造に編成されたさらに多くの構造サブカテゴリを含む、巨大なマルチクラス問題です。バイナリ問題の進歩を活用しながら、この困難な多クラス問題に対処するために、1対残り予測スコアの出力空間に適応型コードアプローチを導入します。具体的には、ランキングパーセプトロンアルゴリズムを使用して、固定された出力コードセットに関するマルチクラス予測を改善するバイナリ分類器の重み付けを学習します。トレーニング用の出力ベクトルを生成するためにクロス検証セットアップを使用し、タンパク質構造階層に関する情報を取得するコードを定義します。コード重み付けアプローチは、リモートホモロジー検出とフォールド認識という2つの困難なマルチクラスタンパク質分類問題に対する標準的な1対すべての方法を大幅に改善します。また、コードベクトルの次元が高く、クラスの数が多い場合、このアルゴリズムは、パーセプトロンを使用してトレーニングされた、CrammerとSingerによる以前のコード学習アプローチよりも優れています。最後に、タンパク質配列解析で最も広く使用されている方法の1つであるPSI-BLASTと比較し、検討したすべての構造分類問題において、当社の方法がPSI-BLASTを大幅に上回ることがわかりました。補足データとソースコードは、http://www.cs.columbia.edu/compbio/adaptiveで入手できます。

An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression
大規模l1正則化ロジスティック回帰のための内点法

Logistic regression with l1 regularization has been proposed asa promising method for feature selection in classification problems.In this paper we describe an efficient interior-point method forsolving large-scale l1-regularized logistic regressionproblems. Small problems with up to a thousand or so features andexamples can be solved in seconds on a PC; medium sized problems,with tens of thousands of features and examples, can be solved intens of seconds (assuming some sparsity in the data). A variation onthe basic method, that uses a preconditioned conjugate gradientmethod to compute the search step, can solve very large problems,with a million features and examples (e.g., the 20 Newsgroups dataset), in a few minutes, on a PC. Using warm-starttechniques, a good approximation of the entire regularization pathcan be computed much more efficiently than by solving a family ofproblems independently.

l1正則化によるロジスティック回帰は、分類問題における特徴選択の有望な方法として提案されています。この論文では、大規模なl1正則化ロジスティック回帰問題を解くための効率的な内点法について説明します。最大1000個ほどの機能と例を持つ小さな問題は、PCで数秒で解決できます。数万の特徴と例を持つ中規模の問題は、数秒で解くことができます(データにある程度のスパース性があると仮定します)。基本的な方法のバリエーションは、前処理された共役勾配法を使用して検索ステップを計算するもので、100万の特徴と例(例:20のニュースグループデータセット)を含む非常に大きな問題をPC上で数分で解決できます。ウォームスタート手法を使用すると、正則化パス全体の適切な近似値を、問題のファミリーを個別に解決するよりもはるかに効率的に計算できます。

On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
グラフ半教師あり学習におけるラプラシアン正規化の有効性について

This paper investigates the effect of Laplacian normalization ingraph-based semi-supervised learning. To this end, we considermulti-class transductive learning on graphs with Laplacianregularization. Generalization bounds are derived using geometricproperties of the graph. Specifically, by introducing a definition ofgraph cut from learning theory, we obtain generalization bounds thatdepend on the Laplacian regularizer. We then use this analysis tobetter understand the role of graph Laplacian matrix normalization.Under assumptions that the cut is small, we derive near-optimalnormalization factors by approximately minimizing the generalizationbounds. The analysis reveals the limitations of the standarddegree-based normalization method in that the resulting normalizationfactors can vary significantly within each connected component withthe same class label, which may cause inferior generalizationperformance. Our theory also suggests a remedy that does not sufferfrom this problem. Experiments confirm the superiority of thenormalization scheme motivated by learning theory on artificial andreal-world data sets.

この論文では、グラフベースの半教師あり学習におけるラプラシアン正規化の効果を調査します。この目的のために、ラプラシアン正則化を使用したグラフでのマルチクラストランスダクティブ学習を検討します。一般化境界は、グラフの幾何学的特性を使用して導出されます。具体的には、学習理論からのグラフカットの定義を導入することで、ラプラシアン正則化に依存する一般化境界を取得します。次に、この分析を使用して、グラフラプラシアンマトリックス正規化の役割をより深く理解します。カットが小さいという仮定の下で、一般化境界を近似的に最小化することにより、ほぼ最適な正規化係数を導き出します。この分析により、標準の次数ベースの正規化方法の限界が明らかになりました。結果として得られる正規化係数は、同じクラスラベルを持つ各接続コンポーネント内で大幅に異なる可能性があり、一般化のパフォーマンスが低下する可能性があります。私たちの理論は、この問題に悩まされない解決策も示唆しています。実験により、人工データセットと実際のデータセットでの学習理論に動機付けられた正規化スキームの優位性が確認されています。

PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
サンプル圧縮分類器の確率的平均と多数決に対するPAC-Bayesリスク限界

We propose a PAC-Bayes theorem for the sample-compression settingwhere each classifier is described by a compression subset of thetraining data and a message string of additional information. Thissetting, which is the appropriate one to describe many learningalgorithms, strictly generalizes the usual data-independentsetting where classifiers are represented only by data-independentmessage strings (or parameters taken from a continuous set). Theproposed PAC-Bayes theorem for the sample-compression settingreduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005)when the compression subset of each classifier vanishes. Forposteriors having all their weights on a single sample-compressedclassifier, the general risk bound reduces to a bound similar tothe tight sample-compression bound proposed in Laviolette et al. (2005).Finally, we extend our results to the case where eachsample-compressed classifier of a data-dependent ensemble mayabstain of predicting a class label.

私たちは、サンプル圧縮設定のPAC-Bayes定理を提案し、各分類器はトレーニングデータの圧縮サブセットと追加情報のメッセージ文字列によって記述されます。この設定は、多くの学習アルゴリズムを説明するのに適したものであり、分類子がデータに依存しないメッセージ文字列(または連続セットから取得されたパラメーター)によってのみ表される通常のデータに依存しない設定を厳密に一般化します。サンプル圧縮設定の提案されたPAC-Bayes定理は、各分類器の圧縮サブセットが消えると、Seeger(2002)とLangford(2005)のPAC-Bayes定理に縮小されます。すべての重みを1つのサンプル圧縮分類器に保持する事後者の場合、一般的なリスクバウンドは、Lavioletteら(2005)で提案されたタイトなサンプル圧縮バウンドと同様のバウンドに減少します。最後に、データ依存アンサンブルの各サンプル圧縮分類器がクラスラベルの予測を棄権する可能性がある場合に結果を拡張します。

Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions
パリティとDNF表現の属性効率と非適応性学習

We consider the problems of attribute-efficient PAC learning of twowell-studied concept classes: parity functions and DNF expressionsover {0,1}n. We show that attribute-efficient learning of paritieswith respect to the uniform distribution is equivalent to decodinghigh-rate random linear codes from low number of errors, along-standing open problem in coding theory. This is the firstevidence that attribute-efficient learning of a natural PAC learnableconcept class can be computationally hard. An algorithm is said to use membership queries (MQs) non-adaptively if the points at which the algorithm asks MQs do notdepend on the target concept. Using a simple non-adaptive paritylearning algorithm and a modification of Levin’s algorithm forlocating a weakly-correlated parity due to Bshouty et al. (1999), we give thefirst non-adaptive and attribute-efficient algorithm for learning DNFwith respect to the uniform distribution. Our algorithm runs in timeÕ(ns4/ε) and uses Õ(s4· log2n/ε) non-adaptive MQs, where s is the number ofterms in the shortest DNF representation of the target concept. Thealgorithm improves on the best previous algorithm for learning DNF(of Bshouty et al., 1999) and can also be easily modified to toleraterandom persistent classification noise in MQs.

私たちは、よく研究されている2つの概念クラス(パリティ関数と{0,1}n上のDNF式)の属性効率の高いPAC学習の問題を検討します。一様分布に関するパリティの属性効率の高い学習は、符号理論における長年の未解決問題である、高レートのランダム線形コードを少ないエラー数からデコードすることと同等であることを示します。これは、自然なPAC学習可能な概念クラスの属性効率の高い学習が計算的に困難であることを示す最初の証拠です。アルゴリズムがメンバーシップクエリ(MQ)を非適応的に使用するとされるのは、アルゴリズムがMQを尋ねるポイントがターゲット概念に依存しない場合です。単純な非適応パリティ学習アルゴリズムと、Bshoutyら(1999)による弱相関パリティを見つけるためのLevinのアルゴリズムの修正を使用して、一様分布に関するDNFを学習するための最初の非適応型で属性効率の高いアルゴリズムを示します。私たちのアルゴリズムは、時間Õ(ns4/ε)で実行され、Õ(s4·log2n/ε)の非適応型MQを使用します。ここで、sはターゲット概念の最短DNF表現の用語数です。このアルゴリズムは、DNF学習の以前の最良のアルゴリズム(Bshoutyら、1999)を改良したもので、MQのランダムな持続的分類ノイズを許容するように簡単に変更することもできます。

Learning to Classify Ordinal Data: The Data Replication Method
順序データの分類方法の学習: データレプリケーション方法

Classification of ordinal data is one of the most important tasks ofrelation learning. This paper introduces a new machine learningparadigm specifically intended for classification problems where theclasses have a natural order. The technique reduces the problem ofclassifying ordered classes to the standard two-class problem. Theintroduced method is then mapped into support vector machines andneural networks. Generalization bounds of the proposed ordinalclassifier are also provided. An experimental study with artificial andreal data sets, including an application to gene expression analysis,verifies the usefulness of the proposed approach.

順序データの分類は、関係学習の最も重要なタスクの1つです。この論文では、クラスが自然な順序を持つ分類問題に特化した新しい機械学習パラダイムを紹介します。この手法により、順序付けられたクラスを分類する問題を標準の2クラス問題に軽減します。導入された方法は、サポートベクターマシンとニューラルネットワークにマッピングされます。提案された順序分類器の一般化境界も提供されます。遺伝子発現解析への応用を含む、人工および実在データセットを用いた実験的研究により、提案されたアプローチの有用性が検証されています。

Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
クラスタ仮定の下での半教師付き分類における一般化誤差の範囲

We consider semi-supervised classification when part of the available datais unlabeled. These unlabeled data can be useful for the classificationproblem when we make an assumption relating the behavior of the regressionfunction to that of the marginal distribution. Seeger (2000) proposed thewell-known cluster assumption as a reasonable one. We propose amathematical formulation of this assumption and a method based ondensity level sets estimation that takes advantage of it to achieve fast ratesof convergence both in the number of unlabeled examples and the number oflabeled examples.

私たちは、利用可能なデータの一部がラベル付けされていない場合、半教師あり分類を考慮します。これらのラベル付けされていないデータは、回帰関数の動作を周辺分布の動作に関連付ける仮定を立てるときに、分類問題に役立ちます。Seeger(2000)は、よく知られているクラスターの仮定を合理的なものとして提案しました。この仮定の数学的定式化と、ラベル付けされていない例の数とラベル付けされた例の数の両方で高速収束を達成するためにそれを利用する密度レベルセット推定に基づく方法を提案します。

Graph Laplacians and their Convergence on Random Neighborhood Graphs
グラフラプラシアンとランダム近傍グラフでのその収束

Given a sample from a probability measure with support on asubmanifold in Euclidean space one can construct a neighborhood graphwhich can be seen as an approximation of the submanifold. The graphLaplacian of such a graph is used in several machine learning methodslike semi-supervised learning, dimensionality reduction andclustering. In this paper we determine the pointwise limit of threedifferent graph Laplacians used in the literature as the sample sizeincreases and the neighborhood size approaches zero. We show that fora uniform measure on the submanifold all graph Laplacians have thesame limit up to constants. However in the case of a non-uniformmeasure on the submanifold only the so called random walk graphLaplacian converges to the weighted Laplace-Beltrami operator.

ユークリッド空間の亜多様体を支持する確率測度からのサンプルが与えられると、部分多様体の近似と見なすことができる近傍グラフを構築できます。このようなグラフのグラフラプラシアンは、半教師あり学習、次元削減、クラスタリングなど、いくつかの機械学習手法で使用されます。この論文では、サンプルサイズが増加し、近傍サイズがゼロに近づくにつれて、文献で使用されている3つの異なるグラフラプラシアンの点ごとの限界を決定します。部分多様体上の一様な測度に対して、すべてのグラフラプラシアンが定数まで同じ極限を持つことを示します。しかし、部分多様体上の非一様測度の場合、いわゆるランダムウォークグラフラプラシアンのみが加重ラプラス・ベルトラミ演算子に収束します。

From External to Internal Regret
外面的な後悔から内面的な後悔へ

External regret compares the performance of an online algorithm,selecting among N actions, to the performance of the best of thoseactions in hindsight. Internal regret compares the loss of an onlinealgorithm to the loss of a modified online algorithm, whichconsistently replaces one action by another.In this paper we give a simple generic reduction that, given analgorithm for the external regret problem, converts it to an efficientonline algorithm for the internal regret problem. We provide methodsthat work both in the full information model, in which the lossof every action is observed at each time step, and the partialinformation (bandit) model, where at each time step only the loss ofthe selected action is observed.The importance of internal regret in game theory is due to thefact that in a general game, if each player has sublinear internalregret, then the empirical frequencies converge to a correlatedequilibrium.For external regret we also derive a quantitative regret bound for avery general setting of regret, which includes an arbitrary set ofmodification rules (that possibly modify the online algorithm) and anarbitrary set of time selection functions (each giving differentweight to each time step). The regret for a given time selection andmodification rule is the difference between the cost of the onlinealgorithm and the cost of the modified online algorithm, where thecosts are weighted by the time selection function. This can be viewedas a generalization of the previously-studied sleeping expertssetting.

外部後悔は、N個のアクションから選択するオンラインアルゴリズムのパフォーマンスを、後から考えたそれらのアクションの中で最良のもののパフォーマンスと比較します。内部後悔は、オンラインアルゴリズムの損失を、あるアクションを別のアクションに常に置き換える修正されたオンラインアルゴリズムの損失と比較します。この論文では、外部後悔問題用のアルゴリズムを与えられた場合に、それを内部後悔問題用の効率的なオンラインアルゴリズムに変換する、単純な汎用的な削減を示します。私たちは、各アクションの損失が各タイムステップで観察される完全情報モデルと、選択されたアクションの損失のみが各タイムステップで観察される部分情報(バンディット)モデルの両方で機能する方法を提供します。ゲーム理論における内部後悔の重要性は、一般的なゲームでは、各プレーヤーが線形以下の内部後悔を持っている場合、経験的頻度が相関平衡に収束するという事実に起因します。外部後悔については、任意の変更ルールセット(オンラインアルゴリズムを変更する可能性があります)と任意の時間選択関数セット(各時間ステップに異なる重みを与える)を含む、一般的な後悔設定の定量的な後悔境界も導出します。特定の時間選択と変更ルールの後悔は、オンラインアルゴリズムのコストと、時間選択関数によって重み付けされたコストの差です。これは、以前に研究されたSleeping Experts設定の一般化として見ることができます。

Bayesian Quadratic Discriminant Analysis
ベイズ2次判別分析

Quadratic discriminant analysis is a common tool for classification,but estimation of the Gaussian parameters can be ill-posed. Thispaper contains theoretical and algorithmic contributions to Bayesianestimation for quadratic discriminant analysis. A distribution-basedBayesian classifier is derived using information geometry. Using acalculus of variations approach to define a functional Bregmandivergence for distributions, it is shown that the Bayesiandistribution-based classifier that minimizes the expected Bregmandivergence of each class conditional distribution also minimizes theexpected misclassification cost. A series approximation is used torelate regularized discriminant analysis to Bayesian discriminantanalysis. A new Bayesian quadratic discriminant analysis classifieris proposed where the prior is defined using a coarse estimate ofthe covariance based on the training data; this classifier is termedBDA7. Results on benchmark data sets and simulations show that BDA7performance is competitive with, and in some cases significantlybetter than, regularized quadratic discriminant analysis and thecross-validated Bayesian quadratic discriminant analysis classifierQuadratic Bayes.

2次判別分析は分類の一般的なツールですが、ガウスパラメータの推定は不適切である可能性があります。この論文には、2次判別分析のベイズ推定に対する理論的およびアルゴリズム的貢献が含まれています。分布ベースのベイズ分類器は、情報幾何学を使用して導出されます。分布の機能的ブレグマンダイバージェンスを定義するために変分法アプローチを使用すると、各クラスの条件付き分布の期待ブレグマンダイバージェンスを最小化するベイズ分布ベースの分類器は、期待される誤分類コストも最小化することがわかります。正規化判別分析をベイズ判別分析に関連付けるために、級数近似が使用されます。事前確率がトレーニングデータに基づく共分散の粗い推定を使用して定義される新しいベイズ2次判別分析分類器が提案されます。この分類器はBDA7と呼ばれます。ベンチマークデータセットとシミュレーションの結果から、BDA7のパフォーマンスは、正規化された2次判別分析やクロス検証されたベイズ2次判別分析分類器2次ベイズと競合し、場合によっては大幅に優れていることがわかります。

Measuring Differentiability: Unmasking Pseudonymous Authors
微分可能性の測定:仮名著者のマスキング解除

In the authorship verification problem, we are given examples of thewriting of a single author and are asked to determine if given longtexts were or were not written by this author. We present a newlearning-based method for adducing the “depth of difference” betweentwo example sets and offer evidence that this method solves theauthorship verification problem with very high accuracy. Theunderlying idea is to test the rate of degradation of the accuracy oflearned models as the best features are iteratively dropped from thelearning process.

著者確認問題では、一人の著者の執筆例が示され、与えられた長いテキストがこの著者によって書かれたものか、そうでないかを判断するように求められます。2つのサンプルセット間の「差の深さ」を導くための新しい学習ベースの方法を提示し、この方法が著者検証問題を非常に高い精度で解決する証拠を提供します。その根底にある考え方は、学習プロセスから最良の特徴が繰り返し削除されるため、学習モデルの精度が低下する速度をテストすることです。

Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
一般化正則化による最大エントロピー密度推定と種分布モデリングへの応用

We present a unified and complete account of maximum entropydensity estimation subject to constraints represented by convexpotential functions or, alternatively, by convex regularization. Weprovide fully general performance guarantees and an algorithm with acomplete convergence proof. As special cases, we easily deriveperformance guarantees for many known regularization types,including l1, l2, l22, and l2 + l22 styleregularization. We propose an algorithm solving a large and generalsubclass of generalized maximum entropy problems, including alldiscussed in the paper, and prove its convergence. Our approachgeneralizes and unifies techniques based on information geometry andBregman divergences as well as those based more directly oncompactness. Our work is motivated by a novel application of maximumentropy to species distribution modeling, an important problem inconservation biology and ecology. In a set of experiments onreal-world data, we demonstrate the utility of maximum entropy inthis setting. We explore effects of different feature types, samplesizes, and regularization levels on the performance of maxent, anddiscuss interpretability of the resulting models.

私たちは、凸ポテンシャル関数または凸正則化によって表される制約条件の下で、最大エントロピー密度推定の統一された完全な説明を提示します。私たちは、完全に一般的なパフォーマンス保証と、完全な収束証明を備えたアルゴリズムを提供します。特殊なケースとして、私たちは、l1、l2、l22、およびl2 + l22スタイルの正則化を含む多くの既知の正則化タイプのパフォーマンス保証を容易に導く。私たちは、論文で議論されているすべてのものを含む、一般化最大エントロピー問題の大規模で一般的なサブクラスを解決するアルゴリズムを提案し、その収束を証明します。我々のアプローチは、情報幾何学とブレグマンダイバージェンスに基づく手法と、より直接的にコンパクト性に基づく手法を一般化および統一します。我々の研究は、保全生物学と生態学における重要な問題である種の分布モデリングへの最大エントロピーの新しい応用に動機付けられています。実世界のデータに対する一連の実験において、我々はこの設定における最大エントロピーの有用性を実証します。私たちは、さまざまな特徴タイプ、サンプルサイズ、および正則化レベルがmaxentのパフォーマンスに与える影響を調査し、結果として得られるモデルの解釈可能性について議論します。

Synergistic Face Detection and Pose Estimation with Energy-Based Models
エネルギーベースモデルによる相乗的な顔検出と姿勢推定

We describe a novel method for simultaneously detecting faces andestimating their pose in real time. The method employs a convolutionalnetwork to map images of faces to points on a low-dimensional manifoldparametrized by pose, and images of non-faces to points far away fromthat manifold. Given an image, detecting a face and estimating itspose is viewed as minimizing an energy function with respect to theface/non-face binary variable and the continuous pose parameters. Thesystem is trained to minimize a loss function that drives correctcombinations of labels and pose to be associated with lower energyvalues than incorrect ones.The system is designed to handle very large range of poses withoutretraining. The performance of the system was tested on threestandard data sets—for frontal views, rotated faces, and profiles—is comparable to previous systems that are designed to handle asingle one of these data sets.We show that a system trained simuiltaneously for detection andpose estimation is more accurate on both tasks than similarsystems trained for each task separately.

私たちは、顔を同時に検出し、リアルタイムで姿勢を推定する新しい方法について説明します。この方法では、畳み込みネットワークを使用して、顔の画像を姿勢によってパラメータ化された低次元多様体上の点にマッピングし、顔以外の画像をその多様体から遠く離れた点にマッピングします。画像が与えられた場合、顔を検出してそのポーズを推定することは、顔/非顔のバイナリ変数と連続ポーズパラメータに関するエネルギー関数を最小化するものとみなされます。システムは、ラベルとポーズの正しい組み合わせが間違ったものよりも低いエネルギー値に関連付けられるようにする損失関数を最小化するようにトレーニングされます。システムは、再トレーニングなしで非常に広範囲のポーズを処理できるように設計されています。システムのパフォーマンスは、正面図、回転した顔、および横顔の3つの標準データセットでテストされ、これらのデータセットの1つだけを処理するように設計された以前のシステムと同等です。検出とポーズ推定を同時にトレーニングしたシステムは、各タスクを個別にトレーニングした同様のシステムよりも、両方のタスクでより正確であることを示しています。

Local Discriminant Wavelet Packet Coordinates for Face Recognition
顔認識のためのローカル判別ウェーブレットパケット座標

Face recognition is a challenging problem due to variations in pose,illumination, and expression. Techniques that can provide effectivefeature representation with enhanced discriminability are crucial.Wavelets have played an important role in image processing for itsability to capture localized spatial-frequency information ofimages. In this paper, we propose a novel local discriminantcoordinates method based on wavelet packet for face recognition tocompensate for these variations. Traditional wavelet-based methodsfor face recognition select or operate on the most discriminantsubband, and neglect the scattered characteristic of discriminantfeatures. The proposed method selects the most discriminantcoordinates uniformly from all spatial frequency subbands toovercome the deficiency of traditional wavelet-based methods. Tomeasure the discriminability of coordinates, a new dilationinvariant entropy and a maximum a posterior logistic modelare put forward. Moreover, a new triangle square ratiocriterion is used to improve classification using the Euclideandistance and the cosine criterion. Experimental results show thatthe proposed method is robust for face recognition under variationsin illumination, pose and expression.

顔認識は、ポーズ、照明、表情の変動により難しい問題です。識別性を高めた効果的な特徴表現を提供できる手法が重要です。ウェーブレットは、画像の局所的な空間周波数情報をキャプチャできるため、画像処理で重要な役割を果たしてきました。この論文では、これらの変動を補正するために、顔認識用のウェーブレットパケットに基づく新しいローカル判別座標法を提案します。従来のウェーブレットベースの顔認識法では、最も判別力の高いサブバンドを選択または操作し、判別機能の分散特性を無視します。提案された方法は、従来のウェーブレットベースの方法の欠点を克服するために、すべての空間周波数サブバンドから均一に最も判別力の高い座標を選択します。座標の判別可能性を測定するために、新しい膨張不変エントロピーと最大事後ロジスティックモデルが提案されています。さらに、新しい三角二乗比基準を使用して、ユークリッド距離とコサイン基準を使用した分類を改善します。実験結果から、提案された方法は、照明、姿勢、表情の変化に対して顔認識に堅牢であることがわかります。

Penalized Model-Based Clustering with Application to Variable Selection
変数選択への適用によるペナルティ付きモデルベースクラスタリング

Variable selection in clustering analysis is both challengingand important. In the context of model-based clustering analysis witha common diagonal covariance matrix, which isespecially suitable for “high dimension, low sample size” settings,we propose a penalized likelihood approachwith an L1 penalty function, automatically realizing variableselection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, andpropose a modified BIC as a model selection criterion to choose the number of components and thepenalization parameter.A simulation study and an application to gene functionprediction with gene expression profiles demonstrate theutility of our method.

クラスタリング分析での変数選択は、困難であると同時に重要です。共通の対角共分散行列を使用したモデルベースのクラスタリング分析のコンテキストでは、特に「高次元、低サンプルサイズ」の設定に適していますが、L1ペナルティ関数を使用したペナルティ尤度アプローチを提案し、しきい値処理による変数選択を自動的に実現し、スパースソリューションを提供します。提案したモデルに適合するEMアルゴリズムを導き出し、コンポーネントの数とペナルティパラメータを選択するためのモデル選択基準として修正BICを提案します。シミュレーション研究と遺伝子発現プロファイルによる遺伝子機能予測への応用は、私たちの方法の有用性を示しています。

Loop Corrections for Approximate Inference on Factor Graphs
因子グラフでの近似推論のためのループ補正

We propose a method to improve approximate inference methods bycorrecting for the influence of loops in the graphical model. Themethod is a generalization and alternative implementation of a recentidea from Montanari and Rizzo (2005). It is applicable to arbitraryfactor graphs, provided that the size of the Markov blankets is nottoo large. It consists of two steps: (i) an approximate inferencemethod, for example, belief propagation, is used to approximatecavity distributions for each variable (i.e., probabilitydistributions on the Markov blanket of a variable for a modifiedgraphical model in which the factors involving that variable have beenremoved); (ii) all cavity distributions are improved by amessage-passing algorithm that cancels out approximation errors byimposing certain consistency constraints. This loop correction (LC)method usually gives significantly better results than the original,uncorrected, approximate inference algorithm that is used to estimatethe effect of loops. Indeed, we often observe that the loop-correctederror is approximately the square of the error of the uncorrectedapproximate inference method. In this article, we compare differentvariants of the loop correction method with other approximateinference methods on a variety of graphical models, including “realworld” networks, and conclude that the LC method generally obtainsthe most accurate results.

私たちは、グラフィカルモデル内のループの影響を修正することで近似推論法を改善する方法を提案します。この方法は、MontanariとRizzo (2005)の最近のアイデアを一般化し、代替実装したものです。マルコフブランケットのサイズが大きすぎない場合に、任意の因子グラフに適用できます。この方法は、次の2つのステップで構成されます。(i)近似推論法(たとえば、確信伝播法)を使用して、各変数のキャビティ分布(つまり、その変数に関連する因子が削除された修正されたグラフィカルモデルの変数のマルコフブランケット上の確率分布)を近似します。(ii)すべてのキャビティ分布は、一定の一貫性制約を課すことで近似誤差をキャンセルするメッセージパッシングアルゴリズムによって改善されます。このループ補正(LC)法は、通常、ループの効果を推定するために使用される元の補正されていない近似推論アルゴリズムよりも大幅に優れた結果をもたらします。実際、ループ補正された誤差は、補正されていない近似推論法の誤差のほぼ2乗であることがよく見られます。この記事では、さまざまなグラフィカルモデル(「現実世界」のネットワークを含む)で、ループ補正法のさまざまなバリエーションを他の近似推論法と比較し、LC法が一般に最も正確な結果を得るという結論に達しました。

Bilinear Discriminant Component Analysis
双線形判別成分分析

Factor analysis and discriminant analysis are often used ascomplementary approaches to identify linear components in twodimensional data arrays. For three dimensional arrays, which mayorganize data in dimensions such as space, time, and trials, theopportunity arises to combine these two approaches. A new method,Bilinear Discriminant Component Analysis (BDCA), is derived anddemonstrated in the context of functional brain imaging data for whichit seems ideally suited. The work suggests to identify a subspaceprojection which optimally separates classes while ensuring that eachdimension in this space captures an independent contribution to thediscrimination.

因子分析と判別分析は、2次元データ配列の線形成分を特定するための補完的なアプローチとしてよく使用されます。3次元配列の場合、空間、時間、試行などの次元でデータを整理すると、これら2つのアプローチを組み合わせる機会が生じます。新しい方法であるバイリニア判別成分分析(BDCA)が導き出され、理想的に適していると思われる機能的な脳イメージングデータのコンテキストで実証されています。この研究では、この空間の各次元が識別への独立した貢献を捉えることを確保しながら、クラスを最適に分離する部分空間投影を特定することを提案しています。

Undercomplete Blind Subspace Deconvolution
未完成ブラインド部分空間デコンボリューション

We introduce the blind subspace deconvolution (BSSD) problem,which is the extension of both the blind source deconvolution(BSD) and the independent subspace analysis (ISA) tasks. Weexamine the case of the undercomplete BSSD (uBSSD). Applyingtemporal concatenation we reduce this problem to ISA. Theassociated ‘high dimensional’ ISA problem can be handled by arecent technique called joint f-decorrelation (JFD).Similar decorrelation methods have been used previously for kernelindependent component analysis (kernel-ICA). More precisely, thekernel canonical correlation (KCCA) technique is a member of thisfamily, and, as is shown in this paper, the kernel generalizedvariance (KGV) method can also be seen as a decorrelation methodin the feature space. These kernel based algorithms will beadapted to the ISA task. In the numerical examples, we (i) examinehow efficiently the emerging higher dimensional ISA tasks can betackled, and (ii) explore the working and advantages of thederived kernel-ISA methods.

私たちは、ブラインドサブスペースデコンボリューション(BSSD)問題を紹介します。これはブラインドソースデコンボリューション(BSD)と独立サブスペース分析(ISA)タスクの両方の拡張です。不完全BSSD (uBSSD)のケースを調べます。時間連結を適用して、この問題をISAに縮小します。関連する「高次元」ISA問題は、ジョイントfデコレレーション(JFD)と呼ばれる最近の手法で処理できます。同様のデコレレーション手法は、カーネル独立成分分析(カーネルICA)に以前から使用されています。より正確には、カーネル正準相関(KCCA)手法はこのファミリーのメンバーであり、この論文で示されているように、カーネル一般化分散(KGV)手法も特徴空間のデコレレーション手法と見なすことができます。これらのカーネルベースのアルゴリズムは、ISAタスクに適応されます。数値例では、(i)出現しつつある高次元ISAタスクにいかに効率的に取り組めるかを調べ、(ii)導出されたカーネルISA法の仕組みと利点を探ります。

Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
Local フィッシャー判別分析によるマルチモーダルラベルデータの次元削減

Reducing the dimensionality of data without losing intrinsicinformation is an important preprocessing step in high-dimensionaldata analysis. Fisher discriminant analysis (FDA) is a traditionaltechnique for supervised dimensionality reduction, but it tends togive undesired results if samples in a class are multimodal.An unsupervised dimensionality reduction method calledlocality-preserving projection (LPP) can work well with multimodaldata due to its locality preserving property. However, since LPP doesnot take the label information into account, it is not necessarilyuseful in supervised learning scenarios. In this paper, we propose anew linear supervised dimensionality reduction method calledlocal Fisher discriminant analysis (LFDA), which effectivelycombines the ideas of FDA and LPP. LFDA has an analytic form of theembedding transformation and the solution can be easily computed justby solving a generalized eigenvalue problem. We demonstrate thepractical usefulness and high scalability of the LFDA method in datavisualization and classification tasks through extensive simulationstudies. We also show that LFDA can be extended to non-lineardimensionality reduction scenarios by applying the kernel trick.

固有の情報を失うことなくデータの次元を減らすことは、高次元データ分析における重要な前処理手順です。フィッシャー判別分析(FDA)は、教師あり次元削減の従来の手法ですが、クラス内のサンプルがマルチモーダルである場合、望ましくない結果をもたらす傾向があります。局所性保存射影(LPP)と呼ばれる教師なし次元削減法は、局所性保存特性のため、マルチモーダルデータでうまく機能します。ただし、LPPはラベル情報を考慮しないため、教師あり学習シナリオでは必ずしも有用ではありません。この論文では、FDAとLPPの考え方を効果的に組み合わせた、局所フィッシャー判別分析(LFDA)と呼ばれる新しい線形教師あり次元削減法を提案します。LFDAは埋め込み変換の解析形式を持ち、一般化固有値問題を解くだけで簡単に解を計算できます。広範なシミュレーション研究を通じて、データ視覚化および分類タスクにおけるLFDA法の実用的な有用性と高いスケーラビリティを実証します。また、カーネルトリックを適用することで、LFDAを非線形次元削減シナリオに拡張できることも示します。

On the Consistency of Multiclass Classification Methods
多クラス分類法の一貫性について

Binary classification is a well studied special case of the classificationproblem. Statistical properties of binary classifiers, such as consistency,have been investigated in a variety of settings. Binary classification methodscan be generalized in many ways to handle multiple classes. It turns out thatone can lose consistency in generalizing a binary classification method to dealwith multiple classes. We study a rich family of multiclass methods and providea necessary and sufficient condition for their consistency. We illustrate ourapproach by applying it to some multiclass methods proposed in the literature.

二項分類は、分類問題のよく研究された特殊なケースです。二項分類器の統計的特性(一貫性など)は、さまざまな設定で調査されてきました。二項分類法は、複数のクラスを処理するためにさまざまな方法で一般化できます。複数のクラスを処理するために二項分類法を一般化すると、一貫性が失われる可能性があることがわかりました。私たちは、多クラスメソッドの豊富なファミリーを研究し、それらの一貫性のために必要かつ十分な条件を提供します。このアプローチを、文献で提案されているいくつかの多クラスメソッドに適用することで説明します。

Covariate Shift Adaptation by Importance Weighted Cross Validation
重要度加重交差検証による共変量シフト適応

A common assumption in supervised learning is that the input points inthe training set follow the same probability distribution asthe input points that will be given in the future test phase.However, this assumption is not satisfied, for example, when theoutside of the training region is extrapolated. The situation wherethe training input points and test input points followdifferent distributions while the conditional distribution ofoutput values given input points is unchanged is called thecovariate shift. Under the covariate shift, standard modelselection techniques such as cross validation do not work as desiredsince its unbiasedness is no longer maintained. In this paper, wepropose a new method called importance weighted crossvalidation (IWCV), for which we prove its unbiasedness even under thecovariate shift. The IWCV procedure is the only one that can beapplied for unbiased classification under covariate shift, whereasalternatives to IWCV exist for regression. The usefulness of ourproposed method is illustrated by simulations, and furthermoredemonstrated in the brain-computer interface, where strongnon-stationarity effects can be seen between training and testsessions.

教師あり学習の一般的な仮定は、トレーニングセット内の入力ポイントが、将来のテストフェーズで与えられる入力ポイントと同じ確率分布に従うというものです。ただし、この仮定は、たとえばトレーニング領域の外側が外挿される場合は満たされません。トレーニング入力ポイントとテスト入力ポイントが異なる分布に従う一方で、入力ポイントが与えられた場合の出力値の条件付き分布は変わらない状況を共変量シフトと呼びます。共変量シフトでは、クロスバリデーションなどの標準的なモデル選択手法は、その偏りが維持されなくなるため、期待どおりに機能しません。この論文では、重要度加重クロスバリデーション(IWCV)と呼ばれる新しい方法を提案し、共変量シフト下でも偏りがないことを証明します。IWCV手順は、共変量シフト下での偏りのない分類に適用できる唯一の手順ですが、回帰にはIWCVの代替手段が存在します。提案された方法の有用性はシミュレーションによって示され、さらに、トレーニングセッションとテストセッション間で強い非定常効果が見られる脳コンピューターインターフェースで実証されています。

Classification in Networked Data: A Toolkit and a Univariate Case Study
ネットワークデータにおける分類:ツールキットと単変量ケーススタディ

This paper is about classifying entities that are interlinked withentities for which the class is known. After surveying prior work, wepresent NetKit, a modular toolkit for classification in networkeddata, and a case-study of its application to networked data used inprior machine learning research. NetKit is based on a node-centricframework in which classifiers comprise a local classifier, arelational classifier, and a collective inference procedure. Variousexisting node-centric relational learning algorithms can beinstantiated with appropriate choices for these components, and newcombinations of components realize new algorithms. The case studyfocuses on univariate network classification, for which the onlyinformation used is the structure of class linkage in the network(i.e., only links and some class labels). To our knowledge, no workpreviously has evaluated systematically the power of class-linkagealone for classification in machine learning benchmark data sets. Theresults demonstrate that very simple network-classification modelsperform quite well—well enough that they should be used regularly asbaseline classifiers for studies of learning with networked data. Thesimplest method (which performs remarkably well) highlights the closecorrespondence between several existing methods introduced fordifferent purposes—that is, Gaussian-field classifiers, Hopfieldnetworks, and relational-neighbor classifiers. The case study alsoshows that there are two sets of techniques that are preferable indifferent situations, namely when few versus many labels are knowninitially. We also demonstrate that link selection plays an importantrole similar to traditional feature selection.

この論文では、クラスが既知のエンティティと相互にリンクされているエンティティを分類することについて述べています。これまでの研究を調査した後、ネットワーク化されたデータの分類のためのモジュール式ツールキットであるNetKitと、以前の機械学習研究で使用されたネットワーク化されたデータへのその適用のケーススタディを紹介します。NetKitは、分類器がローカル分類器、関係分類器、および集合推論手順で構成されるノード中心のフレームワークに基づいています。さまざまな既存のノード中心の関係学習アルゴリズムは、これらのコンポーネントを適切に選択することでインスタンス化でき、コンポーネントの新しい組み合わせによって新しいアルゴリズムが実現されます。ケーススタディでは、ネットワーク内のクラスリンクの構造(つまり、リンクといくつかのクラスラベルのみ)のみの情報を使用する単変量ネットワーク分類に焦点を当てています。私たちの知る限り、機械学習ベンチマークデータセットでの分類におけるクラスリンクのみの威力を体系的に評価した研究はこれまでありません。結果は、非常に単純なネットワーク分類モデルが非常に優れたパフォーマンスを発揮することを示しています。これは、ネットワーク化されたデータを使用した学習の研究のベースライン分類器として定期的に使用すべきほどです。最も単純な方法(非常に優れたパフォーマンスを発揮)は、さまざまな目的で導入されたいくつかの既存の方法(ガウス場分類器、ホップフィールドネットワーク、リレーショナルネイバー分類器)間の密接な対応を強調しています。ケーススタディでは、異なる状況、つまり最初に知られているラベルの数が少ない場合と多い場合で、2つの手法セットが適していることも示しています。また、リンク選択が従来の特徴選択と同様に重要な役割を果たすことも示しています。

Anytime Learning of Decision Trees
決定木のいつでも学習

The majority of existing algorithms for learning decision trees aregreedy—a tree is induced top-down, making locally optimal decisionsat each node. In most cases, however, the constructed tree is notglobally optimal. Even the few non-greedy learners cannot learn goodtrees when the concept is difficult. Furthermore, they require afixed amount of time and are not able to generate a better tree ifadditional time is available. We introduce a framework for anytimeinduction of decision trees that overcomes these problems by tradingcomputation speed for better tree quality. Our proposed family ofalgorithms employs a novel strategy for evaluating candidate splits.A biased sampling of the space of consistent trees rooted at anattribute is used to estimate the size of the minimal tree under thatattribute, and an attribute with the smallest expected tree isselected. We present two types of anytime induction algorithms: acontract algorithm that determines the sample size on the basisof a pre-given allocation of time, and an interruptiblealgorithm that starts with a greedy tree and continuously improvessubtrees by additional sampling. Experimental results indicate that,for several hard concepts, our proposed approach exhibits good anytimebehavior and yields significantly better decision trees when more timeis available.

決定木を学習するための既存のアルゴリズムの大部分は貪欲です。つまり、ツリーはトップダウンで誘導され、各ノードでローカルに最適な決定が行われます。ただし、ほとんどの場合、構築されたツリーはグローバルに最適ではありません。少数の非貪欲学習者でさえ、概念が難しい場合は適切なツリーを学習できません。さらに、それらは一定の時間を必要とし、追加の時間が利用可能であっても、より優れたツリーを生成することはできません。私たちは、計算速度と引き換えにツリーの品質を向上させることでこれらの問題を克服する、いつでも決定木を誘導するためのフレームワークを紹介します。我々が提案するアルゴリズムファミリは、候補の分割を評価するための新しい戦略を採用しています。属性をルートとする一貫性のあるツリーの空間の偏りのあるサンプリングを使用して、その属性の下の最小ツリーのサイズを推定し、最小の期待ツリーを持つ属性を選択します。私たちは、2種類のいつでも誘導アルゴリズムを提示します。1つは、事前に与えられた時間割り当てに基づいてサンプルサイズを決定する契約アルゴリズム、もう1つは、貪欲なツリーから開始して追加のサンプリングによってサブツリーを継続的に改善する割り込み可能なアルゴリズムです。実験結果によると、いくつかの難しい概念に対して、我々が提案したアプローチはいつでも良好な動作を示し、より多くの時間が利用できる場合は大幅に優れた決定ツリーを生成します。

Combining PAC-Bayesian and Generic Chaining Bounds
PACベイズと Generic Chaining Bounds の組合せ

There exist many different generalization error bounds in statisticallearning theory. Each of these bounds contains an improvement over theothers for certain situations or algorithms. Our goal is, first, tounderline the links between these bounds, and second, to combine thedifferent improvements into a single bound. In particular we combinethe PAC-Bayes approach introduced by McAllester (1998), which isinteresting for randomized predictions, with the optimal union boundprovided by the generic chaining technique developed by Fernique andTalagrand (see Talagrand, 1996), in a way that also takes into accountthe variance of the combined functions. We also show how this connectsto Rademacher based bounds.

統計学習理論には、多くの異なる一般化誤差範囲が存在します。これらの各境界には、特定の状況やアルゴリズムに対する他の境界よりも改善が含まれています。私たちの目標は、まず、これらの境界間のリンクを強調すること、次に、さまざまな改善点を1つの境界に結合することです。特に、ランダム化予測にとって興味深いMcAllester (1998)によって導入されたPAC-Bayesアプローチと、FerniqueとTalagrandによって開発された汎用連鎖技術(Talagrand, 1996を参照)によって提供される最適な和集合境界を、組み合わせた関数の分散も考慮する方法で組み合わせます。また、これがRademacherベースの境界にどのように接続されるかも示します。

Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters
ハイパーパラメータのベイズ正則化によるモデル選択中の過剰適合の防止

While the model parameters of a kernel machine are typically given by thesolution of a convex optimisation problem, with a single global optimum, theselection of good values for the regularisation and kernel parameters is muchless straightforward. Fortunately the leave-one-out cross-validationprocedure can be performed or a least approximated very efficiently inclosed form for a wide variety of kernel learning methods, providing aconvenient means for model selection. Leave-one-out cross-validation basedestimates of performance, however, generally exhibit a relatively highvariance and are therefore prone to over-fitting. In this paper, weinvestigate the novel use of Bayesian regularisation at the second level ofinference, adding a regularisation term to the model selection criterioncorresponding to a prior over the hyper-parameter values, where the additionalregularisation parameters are integrated out analytically. Results obtainedon a suite of thirteen real-world and synthetic benchmark data sets clearlydemonstrate the benefit of this approach.

カーネルマシンのモデルパラメータは通常、単一のグローバル最適値を持つ凸最適化問題の解によって与えられますが、正則化およびカーネルパラメータの適切な値の選択はそれほど簡単ではありません。幸いなことに、Leave-One-Outクロス検証手順は、さまざまなカーネル学習方法に対して非常に効率的に閉じた形式で実行または最小近似できるため、モデル選択に便利な手段となります。ただし、Leave-One-Outクロス検証に基づくパフォーマンスの推定値は、一般に比較的高い分散を示すため、過剰適合になりがちです。この論文では、推論の第2レベルでのベイズ正則化の新しい使用法について調査し、ハイパーパラメータ値の事前分布に対応する正則化項をモデル選択基準に追加します。この場合、追加の正則化パラメータは解析的に統合されます。13個の実世界および合成ベンチマークデータセットで得られた結果は、このアプローチの利点を明確に示しています。

Gini Support Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
ジニサポートベクターマシン: 二次エントロピーに基づくロバストな多クラス確率回帰

Many classification tasks require estimation of output classprobabilities for use as confidence scores or for inference integratedwith other models.Probability estimates derived from large margin classifiers such assupport vector machines (SVMs) are often unreliable. We extend SVMlarge margin classification to GiniSVM maximum entropy multi-classprobability regression. GiniSVM combines a quadratic (Gini-Simpson)entropy based agnostic model with a kernel based similarity model. Aform of Huber loss in the GiniSVM primal formulation elucidates aconnection to robust estimation, further corroborated by the impulsivenoise filtering property of the reverse water-filling procedure toarrive at normalized classification margins. The GiniSVM normalizedclassification margins directly provide estimates of class conditionalprobabilities, approximating kernel logistic regression (KLR) but atreduced computational cost. As with other SVMs, GiniSVM produces asparse kernel expansion and is trained by solving a quadratic programunder linear constraints. GiniSVM training is efficientlyimplemented by sequential minimum optimization or by growthtransformation on probability functions.Results on synthetic and benchmark data, including speakerverification and face detection data, show improved classificationperformance and increased tolerance to imprecision over soft-marginSVM and KLR.

多くの分類タスクでは、信頼スコアとして使用したり、他のモデルと統合された推論に使用したりするために、出力クラス確率を推定する必要があります。サポートベクターマシン(SVM)などの大マージン分類器から得られる確率推定は、多くの場合信頼できません。SVM大マージン分類をGiniSVM最大エントロピーマルチクラス確率回帰に拡張します。GiniSVMは、2次(Gini-Simpson)エントロピーベースの不可知論モデルとカーネルベースの類似性モデルを組み合わせたものです。GiniSVMプライマリ定式化のHuber損失の形式は、ロバスト推定との関連性を明らかにし、さらに、逆注水手順のインパルスノイズフィルタリングプロパティによって裏付けられ、正規化された分類マージンに到達します。GiniSVM正規化分類マージンは、クラス条件付き確率の推定値を直接提供し、カーネルロジスティック回帰(KLR)に近似しますが、計算コストは削減されます。他のSVMと同様に、GiniSVMは非スパースカーネル拡張を生成し、線形制約の下で2次計画を解くことによってトレーニングされます。GiniSVMトレーニングは、シーケンシャル最小最適化または確率関数の成長変換によって効率的に実装されます。話者検証および顔検出データを含む合成データおよびベンチマークデータの結果は、ソフトマージンSVMおよびKLRよりも分類パフォーマンスが向上し、不正確さに対する許容度が高まっていることを示しています。

Concave Learners for Rankboost
Rankboostの凹型学習者

Rankboost has been shown to be an effective algorithm for combiningranks. However, its ability to generalize well and not overfit isdirectly related to the choice of weak learner, in the sense thatregularization of the rank function is due to the regularization propertiesof its weak learners. We present a regularization property calledconsistency in preference and confidence that mathematicallytranslates into monotonic concavity, and describe a new weak rankinglearner (MWGR) that generates ranking functions with this property.In experiments combining ranks from multiple face recognition algorithmsand an experiment combining text information retrieval systems, rankfunctions using MWGR proved superior to binary weak learners.

Rankboostは、ランクを組み合わせるための効果的なアルゴリズムであることが示されています。ただし、ランク関数の正則化が弱学習器の正則化特性によるものであるという意味で、その能力がよく一般化し、過適合しない能力は、弱学習器の選択に直接関係しています。数学的に単調な凹面に変換されるconsistency in preference and confidenceという正則化プロパティを提示し、このプロパティを使用してランク付け関数を生成する新しい弱いrankinglearner (MWGR)について説明します。複数の顔認識アルゴリズムのランクを組み合わせた実験や、テキスト情報検索システムを組み合わせた実験では、MWGRを用いたランク関数がバイナリの弱い学習器よりも優れていることが証明されました。

Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
スパース性と条件付き確率の推定: いくつかの漸近結果

One of the nice properties of kernel classifiers such as SVMs is thatthey often produce sparse solutions. However, the decision functionsof these classifiers cannot always be used to estimate the conditionalprobability of the class label. We investigate the relationshipbetween these two properties and show that these are intimatelyrelated: sparseness does not occur when the conditional probabilitiescan be unambiguously estimated. We consider a family of convex lossfunctions and derive sharp asymptotic results for the fraction of datathat becomes support vectors. This enables us to characterize theexact trade-off between sparseness and the ability to estimateconditional probabilities for these loss functions.

SVMなどのカーネル分類子の優れた特性の1つは、多くの場合、スパースソリューションを生成することです。ただし、これらの分類子の決定関数を使用して、クラスラベルの条件付き確率を推定するとは限りません。これら2つの特性間の関係を調査し、これらが密接に関連していることを示します:条件付き確率が明確に推定できる場合、スパース性は発生しません。凸損失関数のファミリーを考慮し、サポートベクトルとなるデータの割合について鋭い漸近結果を導き出します。これにより、スパース性とこれらの損失関数の条件付き確率を推定する能力との間の正確なトレードオフを特徴付けることができます。

Infinitely Imbalanced Logistic Regression
無限に不均衡なロジスティック回帰

In binary classification problems it is common forthe two classes to be imbalanced: one case is veryrare compared to the other. In this paper we considerthe infinitely imbalanced case where one class has a finitesample size and the other class’s sample size grows without bound.For logistic regression, the infinitely imbalanced case oftenhas a useful solution. Under mild conditions,the intercept diverges as expected, butthe rest of the coefficient vector approaches a non trivialand useful limit.That limit can be expressed in terms of exponential tiltingand is the minimum of a convex objective function.The limiting form of logistic regression suggests a computationalshortcut for fraud detection problems.

二項分類問題では、2つのクラスが不均衡であることが一般的です:一方のケースが他方のケースと非常に比較されます。この論文では、一方のクラスが有限のサンプルサイズを持ち、もう一方のクラスのサンプルサイズが無制限に増加する無限に不均衡なケースについて考察します。ロジスティック回帰の場合、無限に不均衡なケースには、多くの場合、有用な解決策があります。穏やかな条件下では、切片は予想どおりに発散しますが、係数ベクトルの残りの部分は重要で有用な制限に近づきます。この限界は指数関数的傾斜で表すことができ、凸目的関数の最小値です。ロジスティック回帰の限定形式は、不正検出問題の計算ショートカットを示唆しています。

The Pyramid Match Kernel: Efficient Learning with Sets of Features
ピラミッドマッチカーネル:一連の機能による効率的な学習

In numerous domains it is useful to represent a single example bythe set of the local features or parts that comprise it. However,this representation poses a challenge to many conventional machinelearning techniques, since sets may vary in cardinality and elementslack a meaningful ordering. Kernel methods can learn complexfunctions, but a kernel over unordered set inputs must somehow solvefor correspondences—generally a computationally expensive taskthat becomes impractical for large set sizes. We present a new fastkernel function called the pyramid match that measurespartial match similarity in time linear in the number of features.The pyramid match maps unordered feature sets to multi-resolutionhistograms and computes a weighted histogram intersection in orderto find implicit correspondences based on the finest resolutionhistogram cell where a matched pair first appears. We show thepyramid match yields a Mercer kernel, and we prove bounds on itserror relative to the optimal partial matching cost. We demonstrateour algorithm on both classification and regression tasks, includingobject recognition, 3-D human pose inference, and time ofpublication estimation for documents, and we show that the proposedmethod is accurate and significantly more efficient than currentapproaches.

多くの分野では、単一の例を、それを構成するローカルな特徴または部分の集合で表現すると便利です。しかし、集合の基数が異なり、要素に意味のある順序がないため、この表現は多くの従来の機械学習技術にとって課題となります。カーネル法は複雑な関数を学習できますが、順序付けされていない集合入力に対するカーネルは、何らかの方法で対応を解く必要があります。これは通常、計算コストの高いタスクであり、集合のサイズが大きい場合は非現実的になります。ここでは、ピラミッドマッチと呼ばれる新しい高速カーネル関数を紹介します。これは、特徴の数に比例した時間で部分一致の類似性を測定します。ピラミッドマッチは、順序付けされていない特徴集合をマルチ解像度ヒストグラムにマッピングし、重み付けされたヒストグラムの交差を計算して、一致したペアが最初に現れる最も細かい解像度のヒストグラムセルに基づいて暗黙の対応を見つけます。ピラミッドマッチによってMercerカーネルが生成されることを示し、最適な部分一致コストに対するその誤差の境界を証明します。私たちは、物体認識、3D人間の姿勢の推論、文書の公開時刻の推定など、分類と回帰の両方のタスクでアルゴリズムを実証し、提案された方法が正確で、現在のアプローチよりも大幅に効率的であることを示します。

Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
動的条件付き確率フィールド: シーケンスデータのラベル付けとセグメント化のための因数分解確率モデル

In sequence modeling, we often wish to represent complex interactionbetween labels, such as when performing multiple, cascaded labelingtasks on the same sequence, or when long-range dependencies exist. Wepresent dynamic conditional random fields (DCRFs), ageneralization of linear-chain conditional random fields (CRFs) inwhich each time slice contains a set of state variables and edges—adistributed state representation as in dynamic Bayesian networks(DBNs)—and parameters are tied across slices. Since exact inferencecan be intractable in such models, we perform approximate inferenceusing several schedules for belief propagation, including tree-basedreparameterization (TRP). On a natural-language chunking task, we showthat a DCRF performs better than a series of linear-chain CRFs,achieving comparable performance using only half the training data.In addition to maximum conditional likelihood, we present twoalternative approaches for training DCRFs: marginal likelihoodtraining, for when we are primarily interested in predicting only asubset of the variables, and cascaded training, for when wehave a distinct data set for each state variable, as in transferlearning. We evaluate marginal training and cascaded training on bothsynthetic data and real-world text data, finding that marginaltraining can improve accuracy when uncertainty exists over the latentvariables, and that for transfer learning, a DCRF trained in acascaded fashion performs better than a linear-chain CRF that predictsthe final task directly.

シーケンスモデリングでは、同じシーケンスに対して複数のカスケードラベリングタスクを実行する場合や、長距離の依存関係が存在する場合など、ラベル間の複雑な相互作用を表現することが必要なことがよくあります。ここでは、線形連鎖条件付きランダムフィールド(CRF)の一般化である動的条件付きランダムフィールド(DCRF)を表現します。このフィールドでは、各タイムスライスに状態変数とエッジのセット(動的ベイジアンネットワーク(DBN)のような分散状態表現)が含まれ、パラメーターはスライス間で結び付けられます。このようなモデルでは正確な推論が困難な場合があるため、ツリーベースの再パラメーター化(TRP)を含む、いくつかのビリーフプロパゲーションスケジュールを使用して近似推論を実行します。自然言語のチャンキングタスクでは、DCRFが一連の線形チェーンCRFよりも優れたパフォーマンスを発揮し、半分のトレーニングデータのみを使用して同等のパフォーマンスを達成することを示しています。最大条件付き尤度に加えて、DCRFをトレーニングするための2つの代替アプローチを紹介します。主に変数のサブセットのみを予測することに興味がある場合の周辺尤度トレーニングと、転移学習のように各状態変数に異なるデータセットがある場合のカスケードトレーニングです。合成データと実際のテキストデータの両方で周辺トレーニングとカスケードトレーニングを評価し、潜在変数に不確実性がある場合に周辺トレーニングによって精度が向上すること、および転移学習では、カスケード方式でトレーニングされたDCRFが最終タスクを直接予測する線形チェーンCRFよりも優れたパフォーマンスを発揮することを発見しました。

Relational Dependency Networks
リレーショナル依存関係ネットワーク

Recent work on graphical models for relational data has demonstratedsignificant improvements in classification and inference when modelsrepresent the dependencies among instances. Despite its use inconventional statistical models, the assumption of instanceindependence is contradicted by most relational data sets. Forexample, in citation data there are dependencies among the topics of apaper’s references, and in genomic data there are dependencies amongthe functions of interacting proteins. In this paper, we presentrelational dependency networks (RDNs), graphical models that arecapable of expressing and reasoning with such dependencies in arelational setting. We discuss RDNs in the context of relational Bayesnetworks and relational Markov networks and outline the relativestrengths of RDNs—namely, the ability to represent cyclicdependencies, simple methods for parameter estimation, and efficientstructure learning techniques. The strengths of RDNs are due to theuse of pseudolikelihood learning techniques, which estimate anefficient approximation of the full joint distribution. We presentlearned RDNs for a number of real-world data sets and evaluate themodels in a prediction context, showing that RDNs identify and exploitcyclic relational dependencies to achieve significant performancegains over conventional conditional models. In addition, we usesynthetic data to explore model performance under various relationaldata characteristics, showing that RDN learning and inferencetechniques are accurate over a wide range of conditions.

リレーショナルデータのグラフィカルモデルに関する最近の研究では、モデルがインスタンス間の依存関係を表す場合、分類と推論が大幅に改善されることが実証されています。インスタンスの独立性の仮定は、従来の統計モデルで使用されていますが、ほとんどのリレーショナルデータセットでは矛盾しています。たとえば、引用データでは、論文の参考文献のトピック間に依存関係があり、ゲノムデータでは、相互作用するタンパク質の機能間に依存関係があります。この論文では、リレーショナル設定でこのような依存関係を表現し、推論できるグラフィカルモデルであるリレーショナル依存関係ネットワーク(RDN)を紹介します。リレーショナルベイズネットワークとリレーショナルマルコフネットワークのコンテキストでRDNについて説明し、RDNの相対的な長所、つまり循環依存関係を表す機能、パラメーター推定の簡単な方法、効率的な構造学習手法について概説します。RDNの長所は、完全な結合分布の効率的な近似値を推定する擬似尤度学習手法を使用しているためです。私たちは、多数の実世界のデータセットに対して学習したRDNを提示し、予測コンテキストでモデルを評価し、RDNが循環的な関係依存関係を識別して活用することで、従来の条件付きモデルよりも大幅なパフォーマンス向上を実現することを示しています。さらに、合成データを使用して、さまざまな関係データ特性におけるモデルのパフォーマンスを調査し、RDNの学習および推論技術が幅広い条件で正確であることを示しています。

Margin Trees for High-dimensional Classification
高次元分類のためのマージンツリー

We propose a method for the classification of more than two classes,from high-dimensional features. Our approach is to build a binarydecision tree in a top-down manner, using the optimal marginclassifier at each split. We implement an exact greedy algorithm forthis task, and compare its performance to less greedy procedures basedon clustering of the matrix of pairwise margins. We compare theperformance of the “margin tree” to the closely related “all-pairs”(one versus one) support vector machine, and nearest centroids on anumber of cancer microarray data sets. We also develop a simplemethod for feature selection. We find that the margin tree hasaccuracy that is competitive with other methods and offers additionalinterpretability in its putative grouping of the classes.

私たちは、高次元の特徴から2つ以上のクラスを分類する方法を提案します。私たちのアプローチは、各分割で最適なマージン分類器を使用して、トップダウン方式でバイナリ決定木を構築することです。このタスクに厳密な貪欲なアルゴリズムを実装し、そのパフォーマンスを、ペアワイズマージンの行列のクラスタリングに基づく貪欲でない手順と比較します。「マージンツリー」の性能を、密接に関連する「オールペア」(1対1)サポートベクターマシン、および多数のがんマイクロアレイデータセット上の最も近い重心と比較します。また、機能選択のための簡単な方法も開発しています。マージンツリーは、他の方法と競争力のある精度を持ち、クラスの推定グループ化に追加の解釈可能性を提供することがわかりました。

Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm
PCアルゴリズムによる高次元有向非巡回グラフの推定

We consider the PC-algorithm (Spirtes et al., 2000) for estimating theskeleton and equivalence class of a very high-dimensional directedacyclic graph (DAG) with corresponding Gaussian distribution. ThePC-algorithm is computationally feasible and often very fast forsparse problems with many nodes (variables), and it has the attractiveproperty to automatically achieve high computational efficiency as afunction of sparseness of the true underlying DAG. We prove uniformconsistency of the algorithm for very high-dimensional, sparse DAGswhere the number of nodes is allowed to quickly grow with sample sizen, as fast as O(na) for any0 < a < ∞. The sparseness assumption is rather minimalrequiring only that the neighborhoods in the DAG are of lower orderthan sample size n. We also demonstrate the PC-algorithm forsimulated data.

私たちは、対応するガウス分布を持つ超高次元の有向巡回グラフ(DAG)のスケルトンと等価クラスを推定するためのPCアルゴリズム(Spirtesら, 2000)を検討します。PCアルゴリズムは計算可能であり、多くの場合、多くのノード(変数)を持つ非常に高速なforsparse問題であり、真の基礎となるDAGのスパースの関数として高い計算効率を自動的に達成する魅力的な特性を持っています。非常に高次元でスパースなDAGのアルゴリズムの均一性を証明し、ノードの数がサンプルサイズnとともに急速に増加し、任意の0<<∞に対してO(na)と同じくらい速く成長することが許されます。スパース性の仮定は、DAG内の近傍がサンプルサイズnよりも低い順序であることのみを必要とするため、かなり最小限です。また、シミュレーションデータのPCアルゴリズムも示します。

Consistent Feature Selection for Pattern Recognition in Polynomial Time
多項式時間におけるパターン認識のための一貫した特徴選択

We analyze two different feature selection problems: finding a minimalfeature set optimal for classification (MINIMAL-OPTIMAL)vs. finding all features relevant to the target variable(ALL-RELEVANT). The latter problem is motivated by recentapplications within bioinformatics, particularly gene expressionanalysis. For both problems, we identify classes of datadistributions for which there exist consistent, polynomial-timealgorithms. We also prove that ALL-RELEVANT is much harderthan MINIMAL-OPTIMAL and propose two consistent,polynomial-time algorithms. We argue that the distribution classesconsidered are reasonable in many practical cases, so that our resultssimplify feature selection in a wide range of machine learning tasks.

私たちは、分類に最適な最小特徴セットを見つける(MINIMAL-OPTIMAL)か、ターゲット変数に関連するすべての特徴を見つける(ALL-RELEVANT)という2つの異なる特徴選択問題を分析します。後者の問題は、バイオインフォマティクス、特に遺伝子発現解析における最近のアプリケーションによって引き起こされています。どちらの問題についても、一貫性のある多項式時間アルゴリズムが存在するデータ分布のクラスを特定します。また、ALL-RELEVANTがMINIMAL-OPTIMALよりもはるかに難しいことを証明し、2つの一貫した多項式時間アルゴリズムを提案します。私たちは、考慮された分布クラスは多くの実用的なケースで合理的であり、その結果は幅広い機械学習タスクでの特徴選択を簡素化すると主張します。

Learning Horn Expressions with LOGAN-H
LOGAN-Hで角の表現を学ぶ

The paper introduces LOGAN-H a system for learning first-orderfunction-free Horn expressions from interpretations. The system isbased on an algorithm that learns by asking questions and that wasproved correct in previous work. The current paper shows how thealgorithm can be implemented in a practical system, and introduces anew algorithm based on it that avoids interaction and learns fromexamples only. The LOGAN-H system implements these algorithms andadds several facilities and optimizations that allow efficientapplications in a wide range of problems. As one of the importantingredients, the system includes several fast procedures forsolving the subsumption problem, an NP-complete problem that needsto be solved many times during the learning process. We describequalitative and quantitative experiments in several domains. Theexperiments demonstrate that the system can deal with variedproblems, large amounts of data, and that it achieves goodclassification accuracy.

この論文では、一階関数フリーのHorn表現を解釈から学習するシステムであるLOGAN-Hについて紹介します。このシステムは、質問をすることで学習するアルゴリズムに基づいており、それは以前の研究で正しいことが証明されています。この論文では、アルゴリズムを実際のシステムに実装する方法を示し、それに基づいて相互作用を回避し、例のみから学習する新しいアルゴリズムを紹介します。LOGAN-Hシステムは、これらのアルゴリズムを実装し、さまざまな問題で効率的に適用できるようにするいくつかの機能と最適化を追加します。重要な要素の1つとして、このシステムには、学習プロセス中に何度も解決する必要があるNP完全な問題である包摂問題を解決するためのいくつかの迅速な手順が含まれています。いくつかのドメインにおける定性的および定量的実験について説明します。実験は、システムがさまざまな問題、大量のデータに対処でき、優れた分類精度を達成していることを示しています。

A Stochastic Algorithm for Feature Selection in Pattern Recognition
パターン認識における特徴選択のための確率的アルゴリズム

We introduce a new model addressing feature selection from a large dictionaryof variables that can be computed from a signal or an image. Features areextracted according to an efficiency criterion, on the basis of specifiedclassification or recognition tasks. This is done by estimating a probabilitydistribution P on the complete dictionary, which distributes its massover the more efficient, or informative, components. We implement a stochasticgradient descent algorithm, using the probability as a state variable andoptimizing a multi-task goodness of fit criterion for classifiers based onvariable randomly chosen according to P. We then generate classifiersfrom the optimal distribution of weights learned on the training set. The method is first tested onseveral pattern recognition problems including face detection, handwrittendigit recognition, spam classification and micro-array analysis. We then compareour approach with other step-wise algorithms like random forests or recursive featureelimination.

私たちは、信号または画像から計算できる変数の大規模な辞書から特徴を選択する新しいモデルを紹介します。特徴は、指定された分類または認識タスクに基づいて、効率基準に従って抽出されます。これは、完全な辞書の確率分布Pを推定することによって行われます。これは、より効率的または有益なコンポーネントに質量を分配します。確率を状態変数として使用し、Pに従ってランダムに選択された変数に基づいて分類器のマルチタスク適合度基準を最適化する、確率的勾配降下アルゴリズムを実装します。次に、トレーニングセットで学習した重みの最適分布から分類器を生成します。この方法は、顔検出、手書き数字認識、スパム分類、マイクロアレイ分析などのいくつかのパターン認識問題で最初にテストされます。次に、ランダムフォレストや再帰的特徴除去などの他の段階的アルゴリズムとこのアプローチを比較します。

Integrating Naive Bayes and FOIL
NaÃ ̄ve Bayes と FOIL の統合

A novel relational learning approach that tightly integrates thenaïve Bayes learning scheme with the inductive logic programmingrule-learner FOIL is presented. In contrast to previouscombinations that have employed naïve Bayes only forpost-processing the rule sets, the presented approach employs thenaïve Bayes criterion to guide its search directly. The proposedtechnique is implemented in the NFOIL and TFOILsystems, which employ standard naïve Bayes and tree augmentednaïve Bayes models respectively. We show that these integratedapproaches to probabilistic model and rule learning outp erformpost-processing approaches. They also yield significantly moreaccurate models than si mple rule learning and are competitive withmore sophisticated ILP systems.

ベイブ学習スキームと帰納論理プログラミングのルール学習器FOILを緊密に統合した新しい関係学習アプローチが提示されます。ルールセットの事後処理にのみナイーブベイズを採用していた以前の組み合わせとは対照的に、提示されたアプローチは、検索を直接ガイドするためにナイーブベイズ基準を採用しています。提案された技術は、NFOILシステムとTFOILシステムに実装されており、それぞれ標準の単純ベイズモデルとツリー拡張ナイーブベイズモデルを採用しています。確率モデルとルール学習へのこれらの統合的なアプローチが、後処理アプローチから優れていることを示します。また、simpleルール学習よりも大幅に正確なモデルを生成し、より洗練されたILPシステムと競争力があります。

Value Regularization and Fenchel Duality
価値正則化とフェンチェル双対性

Regularization is an approach to function learning that balances fitand smoothness. In practice, we search for a function f with afinite representation f = Σi ciφi(·). In most treatments, the ciare the primary objects of study. We consider valueregularization, constructing optimization problems in which thepredicted values at the training points are the primary variables, andtherefore the central objects of study. Although this is a simplechange, it has profound consequences. From convex conjugacy and thetheory of Fenchel duality, we derive separate optimality conditionsfor the regularization and loss portions of the learning problem; thistechnique yields clean and short derivations of standard algorithms.This framework is ideally suited to studying many other phenomena atthe intersection of learning theory and optimization. We obtain avalue-based variant of the representer theorem, which underscores thetransductive nature of regularization in reproducing kernel Hilbertspaces. We unify and extend previous results on learning kernelfunctions, with very simple proofs. We analyze the use ofunregularized bias terms in optimization problems, and low-rankapproximations to kernel matrices, obtaining new results in theseareas. In summary, the combination of value regularization andFenchel duality are valuable tools for studying the optimizationproblems in machine learning.

正則化は、適合性と滑らかさのバランスをとる関数学習のアプローチです。実際には、有限表現f =Σi ciφi(·)を持つ関数fを検索します。ほとんどの処理では、ciが主な調査対象です。値の正則化を検討し、トレーニングポイントでの予測値が主な変数、つまり調査の中心となる最適化問題を構築します。これは単純な変更ですが、大きな影響があります。凸共役性とフェンチェル双対性理論から、学習問題の正則化部分と損失部分の最適条件を別々に導出します。この手法により、標準アルゴリズムの簡潔で短い導出が得られます。このフレームワークは、学習理論と最適化の交差点にある他の多くの現象の研究に最適です。再現カーネルヒルベルト空間における正則化の変換特性を強調する、表現子定理の値ベースの変形を取得します。学習カーネル関数に関する以前の結果を非常に簡単な証明で統合および拡張します。最適化問題における正則化されていないバイアス項の使用とカーネルマトリックスの低ランク近似を分析し、これらの分野で新しい結果を取得します。要約すると、値正則化とフェンチェル双対性の組み合わせは、機械学習における最適化問題を研究するための貴重なツールです。

Boosted Classification Trees and Class Probability/Quantile Estimation
ブースティング分類ツリーとクラス確率/分位点推定

The standard by which binary classifiers are usually judged,misclassification error, assumes equal costs of misclassifying thetwo classes or, equivalently, classifying at the 1/2 quantile of theconditional class probability function P[y=1|x]. Boostedclassification trees are known to perform quite well for suchproblems. In this article we consider the use of standard,off-the-shelf boosting for two more generalproblems: 1) classification with unequal costs or, equivalently,classification at quantiles other than 1/2, and 2) estimation of theconditional class probability function P[y=1|x]. We first examinewhether the latter problem, estimation of P[y=1|x], can be solvedwith LogitBoost, and with AdaBoost when combined with a natural linkfunction. The answer is negative: both approaches are oftenineffective because they overfit P[y=1|x] even though they performwell as classifiers. A major negative point of the present articleis the disconnect between class probability estimation andclassification.Next we consider the practice of over/under-sampling of the twoclasses. We present an algorithm that uses AdaBoost inconjunction with Over/Under-Sampling and Jittering of the data “JOUS-Boost”. This algorithm issimple, yet successful, and it preserves the advantage of relativeprotection against overfitting, but for arbitrarymisclassification costs and, equivalently, arbitrary quantileboundaries. We then use collections of classifiers obtained froma grid of quantiles to form estimators of class probabilities. Theestimates of the class probabilities compare favorably to thoseobtained by a variety of methods across both simulated and realdata sets.

バイナリ分類器が通常判断される基準である誤分類エラーは、2つのクラスを誤分類するコストが等しい、または同等に、条件付きクラス確率関数P[y=1|x]の1/2分位で分類することを前提としています。ブースト分類ツリーは、このような問題に対して非常に優れたパフォーマンスを発揮することが知られています。この記事では、より一般的な2つの問題、1)不等コストでの分類、または同等に1/2以外の分位数での分類、および2)条件付きクラス確率関数P[y=1|x]の推定について、標準の既成のブースティングの使用を検討します。まず、後者の問題であるP[y=1|x]の推定がLogitBoostで、また自然リンク関数と組み合わせたAdaBoostで解決できるかどうかを調べます。答えは否定的です。どちらのアプローチも、分類器としてはうまく機能するにもかかわらず、P[y=1|x]に過剰適合するため、効果がないことがよくあります。この記事の主な否定的な点は、クラス確率の推定と分類の間に断絶があることです。次に、2つのクラスのオーバー/アンダーサンプリングの実践について検討します。データのオーバー/アンダーサンプリングとジッタリングと組み合わせてAdaBoostを使用するアルゴリズム「JOUS-Boost」を紹介します。このアルゴリズムは単純ですが、効果的で、過適合に対する相対的な保護の利点は維持されますが、任意の誤分類コストと、同様に任意の分位境界があります。次に、分位グリッドから取得した分類器のコレクションを使用して、クラス確率の推定値を作成します。クラス確率の推定値は、シミュレートされたデータセットと実際のデータセットの両方でさまざまな方法で取得した推定値と比較して優れています。

Learning Equivariant Functions with Matrix Valued Kernels
行列値カーネルによる等変関数の学習

This paper presents a new class of matrix valued kernels that areideally suited to learn vector valued equivariant functions. Matrixvalued kernels are a natural generalization of the common notion of akernel. We set the theoretical foundations of so called equivariantmatrix valued kernels. We work out several properties of equivariantkernels, we give an interpretation of their behavior and showrelations to scalar kernels. The notion of (ir)reducibility of grouprepresentations is transferred into the framework of matrix valuedkernels. At the end to two exemplary applications are demonstrated.We design a non-linear rotation and translation equivariant filter for2D-images and propose an invariant object detector based on thegeneralized Hough transform.

この論文では、ベクトル値等変関数の学習に最適な新しいクラスの行列値カーネルを紹介します。行列値カーネルは、カーネルの一般的な概念を自然に一般化したものです。いわゆる等変行列値カーネルの理論的基礎を設定します。等変カーネルのいくつかの特性を解明し、それらの振る舞いの解釈とスカラーカーネルへの関係を示します。群表現の(非)還元性の概念は、行列値カーネルの枠組みに移されます。最後に、2つの例示的なアプリケーションが実証されます。2Dイメージのための非線形回転および並進等変フィルタを設計し、一般化ハフ変換に基づく不変オブジェクト検出器を提案します。

Statistical Consistency of Kernel Canonical Correlation Analysis
カーネル正準相関分析の統計的一貫性

While kernel canonical correlation analysis (CCA) has been appliedin many contexts, the convergence of finite sample estimates of theassociated functions to their population counterparts has not yetbeen established. This paper gives a mathematical proof of thestatistical convergence of kernel CCA, providing a theoreticaljustification for the method. The proof uses covariance operatorsdefined on reproducing kernel Hilbert spaces, and analyzes theconvergence of their empirical estimates of finite rank to theirpopulation counterparts, which can have infinite rank. The resultalso gives a sufficient condition for convergence on theregularization coefficient involved in kernel CCA: this shoulddecrease as n-1/3, where n is the number of data.

カーネル正準相関分析(CCA)は多くの状況で適用されていますが、関連する関数の有限サンプル推定値とそれらの母集団の対応物との収束はまだ確立されていません。この論文では、カーネルCCAの統計的収束の数学的証明を提供し、この方法の理論的正当性を提供します。この証明は、カーネルヒルベルト空間の再現で定義された共分散演算子を使用し、有限ランクの経験的推定値と、無限ランクを持つ可能性のある人口対応物との収束を分析します。この結果は、カーネルCCAに関与する正則化係数に収束するための十分な条件も提供します:これはn-1/3 (nはデータ数)として減少するはずです。

Dynamics and Generalization Ability of LVQ Algorithms
LVQアルゴリズムのダイナミクスと一般化能力

Learning vector quantization (LVQ) schemes constitute intuitive,powerful classification heuristics with numerous successfulapplications but, so far, limited theoretical background. We studyLVQ rigorously within a simplifying model situation: two competingprototypes are trained from a sequence of examples drawn from amixture of Gaussians. Concepts from statistical physics and thetheory of on-line learning allow for an exact description of thetraining dynamics in high-dimensional feature space. The analysisyields typical learning curves, convergence properties, and achievablegeneralization abilities. This is also possible for heuristictraining schemes which do not relate to a cost function. We comparethe performance of several algorithms, including Kohonen’s LVQ1 andLVQ+/-, a limiting case of LVQ2.1. The former shows close to optimalperformance, while LVQ+/- displays divergent behavior. We investigatehow early stopping can overcome this difficulty. Furthermore, westudy a crisp version of robust soft LVQ, which was recently derivedfrom a statistical formulation. Surprisingly, it exhibits relativelypoor generalization. Performance improves if a window for theselection of data is introduced; the resulting algorithm correspondsto cost function based LVQ2. The dependence of these results on themodel parameters, for example, prior class probabilities, isinvestigated systematically, simulations confirm our analyticalfindings.

学習ベクトル量子化(LVQ)スキームは、直感的で強力な分類ヒューリスティックを構成し、多数の成功したアプリケーションがありますが、これまでのところ理論的背景は限られています。私たちは、単純化モデル状況内でLVQを厳密に研究します。2つの競合するプロトタイプが、ガウス分布の混合から抽出された一連の例からトレーニングされます。統計物理学とオンライン学習の理論の概念により、高次元の特徴空間でのトレーニングダイナミクスを正確に記述できます。分析により、一般的な学習曲線、収束特性、および達成可能な一般化能力が得られます。これは、コスト関数に関連しないヒューリスティックトレーニングスキームでも可能です。KohonenのLVQ1とLVQ+/- (LVQ2.1の極限ケース)を含むいくつかのアルゴリズムのパフォーマンスを比較します。前者は最適に近いパフォーマンスを示しますが、LVQ+/-は発散的な動作を示します。早期停止によってこの困難を克服する方法を調査します。さらに、最近統計的定式化から導き出された堅牢なソフトLVQの鮮明なバージョンを研究しています。驚いたことに、このバージョンは比較的一般化が貧弱です。データ選択のウィンドウを導入するとパフォーマンスが向上します。結果として得られるアルゴリズムは、コスト関数ベースのLVQ2に対応します。これらの結果のモデルパラメーター(たとえば、事前クラス確率)への依存性は体系的に調査され、シミュレーションによって分析結果が確認されています。

General Polynomial Time Decomposition Algorithms
一般的な多項式時間分解アルゴリズム

We present a general decomposition algorithm that is uniformlyapplicable to every (suitably normalized) instance of Convex Quadratic Optimization and efficiently approaches an optimal solution. The number of iterations required to be within εof optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can beperformed in polynomial time. If we restrict our considerationsto instances of Convex Quadratic Optimization with at most k0 equality constraints for some fixed constant k0 plus some so-called box-constraints (conditions that hold for mostvariants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex QuadraticOptimization whereas the results by Hush and Scovel are specialized toSVM-optimization. Second, we achieve a higher rate of convergenceeven for the special case of SVM-optimization (despite the generalityof our approach). Third, our analysis is technicallysimpler.We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions).

私たちは、凸二次最適化のすべての（適切に正規化された）インスタンスに均一に適用可能で、最適解に効率的に近づく一般的な分解アルゴリズムを提示します。最適性の ε 以内になるために必要な反復回数は、1/ε とともに線形に、変数の数mとともに二次的に増加します。ワーキングセットの選択は多項式時間で実行できます。固定定数k0に対して最大k0の等式制約と、いわゆるボックス制約（SVM最適化のほとんどのバリアントに適用される条件）を伴う凸二次最適化のインスタンスに検討を限定すると、ワーキングセットは線形時間で見つかります。我々の分析は、HushとScovelによって導入されたレート証明ペアの概念の一般化に基づいています。彼らの結果を凸二次最適化の任意のインスタンスに拡張するために、レート証明qセットの一般的な概念を導入します。私たちは、HushとScovel (2003)の結果をいくつかの点で改善しました。まず、我々の結果は凸二次最適化に当てはまりますが、HushとScovelの結果はSVM最適化に特化しています。次に、我々はSVM最適化の特殊なケースでも(我々のアプローチの一般性にもかかわらず)より高い収束率を達成しました。第3に、我々の分析は技術的に単純です。さらに、私たちは、レート証明セットに基づくワーキングセット選択の戦略が、いわゆる「サブ最適化のスパース証拠」に基づく戦略と一致することを証明しました。この観点から見ると、我々の主な結果は、収束率を提供すること(およびより一般的な条件下での保持)により、ListとSimon (2004)およびSimon (2004)の収束結果を改善しています。

Comments on the “Core Vector Machines: Fast SVM Training on Very Large Data Sets”
「コアベクターマシン:非常に大きなデータセットでの高速SVMトレーニング」へのコメント

In a recently published paper in JMLR, Tsang et al. (2005) present analgorithm for SVM called Core Vector Machines (CVM) and illustrate itsperformances through comparisons with other SVM solvers. After readingthe CVM paper we were surprised by some of the reported results. Inorder to clarify the matter, we decided to reproduce some of theexperiments. It turns out that to some extent, our results contradictthose reported. Reasons of these different behaviors are giventhrough the analysis of the stopping criterion.

JMLRに最近発表された論文で、Tsangら(2005)は、コアベクターマシン(CVM)と呼ばれるSVMのアルゴリズムを提示し、他のSVMソルバーとの比較を通じてそのパフォーマンスを示しています。CVMの論文を読んだ後、報告された結果のいくつかに驚きました。この点を明確にするために、いくつかの実験を再現することにしました。その結果、私たちの結果は報告されたものとある程度矛盾していることがわかりました。これらの異なる動作の理由は、停止基準の分析を通じて与えられます。

Separating Models of Learning from Correlated and Uncorrelated Data
相関データと非相関データからの学習モデルの分離

We consider a natural framework of learning from correlated data, inwhich successive examples used for learning are generated according toa random walk over the space of possible examples. A recent paper byBshouty et al. (2003) shows that the class of polynomial-size DNFformulas is efficiently learnable in this random walk model; thisresult suggests that the Random Walk model is more powerful thancomparable standard models of learning from independent examples, inwhich similarly efficient DNF learning algorithms are not known. Wegive strong evidence that the Random Walk model is indeed morepowerful than the standard model, by showing that if any cryptographicone-way function exists (a universally held belief in cryptography),then there is a class of functions that can be learned efficiently inthe Random Walk setting but not in the standard setting where allexamples are independent.

私たちは、相関データから学習する自然なフレームワークを考え、学習に使用される連続した例が、可能な例の空間をランダムに歩くことによって生成されます。Bshoutyら(2003)による最近の論文は、多項式サイズのDNFformulaのクラスがこのランダムウォークモデルで効率的に学習できることを示しています。この結果は、ランダムウォークモデルが、同様に効率的なDNF学習アルゴリズムが知られていない独立した例からの学習の同等の標準モデルよりも強力であることを示唆しています。私たちは、もし暗号一方向関数が存在するならば(暗号学において普遍的に信じられている)、ランダムウォーク設定では効率的に学習できる関数のクラスが存在するが、すべての例が独立している標準設定ではそうではないことを示すことによって、ランダムウォークモデルが実際に標準モデルよりも強力であるという強力な証拠を示す。

Learnability of Gaussians with Flexible Variances
柔軟な分散を持つガウス分布の学習可能性

Gaussian kernels with flexible variances provide arich family of Mercer kernels for learning algorithms. We show thatthe union of the unit balls of reproducing kernel Hilbert spacesgenerated by Gaussian kernels with flexible variances is a uniformGlivenko-Cantelli (uGC) class. This result confirms a conjectureconcerning learnability of Gaussian kernels and verifies the uniformconvergence of many learning algorithms involving Gaussians withchanging variances. Rademacher averages and empirical coveringnumbers are used to estimate sample errors of multi-kernelregularization schemes associated with general loss functions. It isthen shown that the regularization error associated with the leastsquare loss and the Gaussian kernels can be greatly improved whenflexible variances are allowed. Finally, for regularization schemesgenerated by Gaussian kernels with flexible variances we presentexplicit learning rates for regression with least square loss andclassification with hinge loss.

柔軟な分散を持つガウスカーネルは、アルゴリズムを学習するための豊富なマーサーカーネルファミリーを提供します。柔軟な分散を持つガウスカーネルによって生成された再現カーネルヒルベルト空間の単位ボールの和集合が一様グリベンコ・カンテッリ(uGC)クラスであることを示します。この結果は、ガウスカーネルの学習可能性に関する推測を裏付けるものであり、ガウスが関与する多くの学習アルゴリズムが分散を変化させて一様に収束することを検証しています。Rademacher平均と経験的被覆数を使用して、一般的な損失関数に関連付けられたマルチカーネル正則化スキームのサンプル誤差を推定します。次に、最小二乗損失とガウスカーネルに関連する正則化誤差は、柔軟な分散が許容される場合に大幅に改善できることが示されています。最後に、柔軟な分散を持つガウスカーネルによって生成された正則化スキームについて、最小二乗損失による回帰とヒンジ損失による分類の明示的な学習率を示します。

Noise Tolerant Variants of the Perceptron Algorithm
パーセプトロンアルゴリズムのノイズ耐性バリアント

A large number of variants of the Perceptron algorithm have beenproposed and partially evaluated in recent work. One type ofalgorithm aims for noise tolerance by replacing the last hypothesis ofthe perceptron with another hypothesis or a vote among hypotheses.Another type simply adds a margin term to the perceptron in order toincrease robustness and accuracy, as done in support vector machines.A third type borrows further from support vector machines andconstrains the update function of the perceptron in ways that mimicsoft-margin techniques. The performance of these algorithms, and thepotential for combining different techniques, has not been studied indepth. This paper provides such an experimental study and revealssome interesting facts about the algorithms. In particular theperceptron with margin is an effective method for tolerating noise andstabilizing the algorithm. This is surprising since the margin initself is not designed or used for noise tolerance, and there are noknown guarantees for such performance. In most cases, similarperformance is obtained by the voted-perceptron which has theadvantage that it does not require parameter selection. Techniquesusing soft margin ideas are run-time intensive and do not giveadditional performance benefits. The results also highlight thedifficulty with automatic parameter selection which is required withsome of these variants.

パーセプトロンアルゴリズムのさまざまなバリエーションが最近の研究で提案され、部分的に評価されています。1つのタイプのアルゴリズムは、パーセプトロンの最後の仮説を別の仮説または仮説間の投票に置き換えることで、ノイズ耐性を目指します。別のタイプは、サポートベクターマシンで行われるように、堅牢性と精度を高めるために、パーセプトロンにマージン項を追加するだけです。3番目のタイプは、サポートベクターマシンからさらに借用し、ソフトマージン手法を模倣する方法でパーセプトロンの更新機能を制限します。これらのアルゴリズムのパフォーマンス、およびさまざまな手法を組み合わせる可能性については、詳細に研究されていません。この論文では、そのような実験的研究を提供し、アルゴリズムに関する興味深い事実を明らかにします。特に、マージン付きパーセプトロンは、ノイズを許容し、アルゴリズムを安定化するための効果的な方法です。マージン自体はノイズ耐性のために設計または使用されていないため、これは驚くべきことであり、そのようなパフォーマンスを保証する既知の方法はありません。ほとんどの場合、同様のパフォーマンスは、パラメータ選択を必要としないという利点を持つ投票パーセプトロンによって得られます。ソフトマージンアイデアを使用する手法は実行時間が集中し、追加のパフォーマンス上の利点は得られません。結果は、これらのバリアントの一部で必要な自動パラメータ選択の難しさも明らかにしています。

A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
分離した球面ガウス分布の混合に対するEMの確率論的分析

We show that, given data from a mixture of k well-separatedspherical Gaussians in ℜd, a simple two-round variant ofEM will, with high probability, learn the parameters of the Gaussiansto near-optimal precision, if the dimension is high (d >> lnk). We relate this to previous theoretical and empirical work on theEM algorithm.

私たちは、Rdのk個の十分に分離された球面ガウス分布の混合物からのデータが与えられた場合、EMの単純な2ラウンドバリアントは、高い確率で、次元が高い(d >> lnk)場合、ガウスシャントのパラメータをほぼ最適の精度で学習することを示します。これを、EMアルゴリズムに関する以前の理論的および実証的研究に関連付けます。

Building Blocks for Variational Bayesian Learning of Latent Variable Models
潜在変数モデルの変分ベイジアン学習のためのビルディングブロック

We introduce standardised building blocks designed to be used withvariational Bayesian learning. The blocks include Gaussian variables,summation, multiplication, nonlinearity, and delay. A large varietyof latent variable models can be constructed from these blocks,including nonlinear and variance models, which are lacking from mostexisting variational systems. The introduced blocks are designed tofit together and to yield efficient update rules. Practicalimplementation of various models is easy thanks to an associatedsoftware package which derives the learning formulas automaticallyonce a specific model structure has been fixed. Variational Bayesianlearning provides a cost function which is used both for updating thevariables of the model and for optimising the model structure. Allthe computations can be carried out locally, resulting in linearcomputational complexity. We present experimental results on severalstructures, including a new hierarchical nonlinear model for variancesand means. The test results demonstrate the good performance andusefulness of the introduced method.

私たちは、変分ベイズ学習で使用するために設計された標準化されたビルディングブロックを紹介します。ブロックには、ガウス変数、合計、乗算、非線形性、および遅延が含まれます。これらのブロックから、非線形モデルや分散モデルなど、既存の変分システムのほとんどに欠けているさまざまな潜在変数モデルを構築できます。導入されたブロックは、互いに適合し、効率的な更新ルールを生成するように設計されています。特定のモデル構造が固定されると、学習式を自動的に導出する関連ソフトウェアパッケージにより、さまざまなモデルの実用的な実装が容易になります。変分ベイズ学習は、モデルの変数の更新とモデル構造の最適化の両方に使用されるコスト関数を提供します。すべての計算はローカルで実行できるため、計算の複雑さは線形になります。分散と平均の新しい階層非線形モデルを含む、いくつかの構造に関する実験結果を示します。テスト結果は、導入された方法の優れたパフォーマンスと有用性を示しています。

Distances between Data Sets Based on Summary Statistics
要約統計量に基づくデータセット間の距離

The concepts of similarity and distance are crucial in data mining. Weconsider the problem of defining the distance between two data sets bycomparing summary statistics computed from the data sets. The initialdefinition of our distance is based on geometrical notions of certainsets of distributions. We show that this distance can be computed incubic time and that it has several intuitive properties. We also showthat this distance is the unique Mahalanobis distance satisfyingcertain assumptions. We also demonstrate that if we are dealing withbinary data sets, then the distance can be represented naturally bycertain parity functions, and that it can be evaluated in lineartime. Our empirical tests with real world data show that the distanceworks well.

類似性と距離の概念は、データマイニングにおいて非常に重要です。データセットから計算された要約統計量を比較することにより、2つのデータセット間の距離を定義する問題を考えます。私たちの距離の初期定義は、分布の特定のセットの幾何学的概念に基づいています。この距離はインキュビック時間で計算でき、いくつかの直感的な特性を持つことを示します。また、この距離が特定の仮定を満たす一意のマハラノビス距離であることも示します。また、バイナリデータセットを扱っている場合、距離は特定のパリティ関数で自然に表現でき、lineartimeで評価できることも示します。実世界のデータを用いた実証的テストでは、距離がうまく機能することが示されています。

Minimax Regret Classifier for Imprecise Class Distributions
不正確なクラス分布に対するミニマックス Regret分類器

The design of a minimum risk classifier based on data usuallystems from the stationarity assumption that the conditions duringtraining and test are the same: the misclassification costsassumed during training must be in agreement with real costs, andthe same statistical process must have generated both training andtest data. Unfortunately, in real world applications, theseassumptions may not hold. This paper deals with the problem oftraining a classifier when prior probabilities cannot be reliablyinduced from training data. Some strategies based on optimizingthe worst possible case (conventional minimax) have been proposedpreviously in the literature, but they may achieve a robustclassification at the expense of a severe performance degradation.In this paper we propose a minimax regret(minimax deviation) approach, that seeks to minimize themaximum deviation from the performance of the optimal riskclassifier. A neural-based minimax regret classifier forgeneral multi-class decision problems is presented.Experimental results show its robustness and the advantages inrelation to other approaches.

データに基づく最小リスク分類器の設計は、通常、トレーニングとテスト中の条件が同じであるという定常性仮定から生じます。つまり、トレーニング中に想定される誤分類コストは実際のコストと一致している必要があり、同じ統計プロセスによってトレーニングデータとテストデータの両方が生成されている必要があります。残念ながら、実際のアプリケーションでは、これらの仮定が当てはまらない場合があります。この論文では、事前確率をトレーニングデータから確実に誘導できない場合の分類器のトレーニングの問題を取り上げます。最悪のケースを最適化することに基づくいくつかの戦略(従来のミニマックス)が以前に文献で提案されていますが、これらはパフォーマンスの大幅な低下を犠牲にして堅牢な分類を実現する可能性があります。この論文では、最適なリスク分類器のパフォーマンスからの最大偏差を最小化することを目指すミニマックスリグレット(ミニマックス偏差)アプローチを提案します。一般的なマルチクラス決定問題用のニューラルベースのミニマックスリグレット分類器を紹介します。実験結果から、その堅牢性と他のアプローチに対する利点がわかります。

A Unified Continuous Optimization Framework for Center-Based Clustering Methods
センターベースクラスタリング手法のための統一連続最適化フレームワーク

Center-based partitioning clustering algorithms rely on minimizingan appropriately formulated objective function, and differentformulations suggest different possible algorithms. In this paper,we start with the standard nonconvex and nonsmooth formulation ofthe partitioning clustering problem. We demonstrate that withinthis elementary formulation, convex analysis tools andoptimization theory provide a unifying language and framework todesign, analyze and extend hard and soft center-based clusteringalgorithms, through a generic algorithm which retains thecomputational simplicity of the popular k-means scheme. We showthat several well known and more recent center-based clusteringalgorithms, which have been derived either heuristically, or/andhave emerged from intuitive analogies in physics, statisticaltechniques and information theoretic perspectives can be recoveredas special cases of the proposed analysis and we streamline theirrelationships.

中心ベースの分割クラスタリングアルゴリズムは、適切に定式化された目的関数の最小化に依存しており、定式化が異なると、可能なアルゴリズムが異なることが示唆されます。この論文では、分割クラスタリング問題の標準的な非凸型および非平滑型の定式化から始めます。この基本的な定式化内で、凸解析ツールと最適化理論は、一般的なk-meansスキームの計算の単純さを保持する汎用アルゴリズムを通じて、ハードおよびソフトの中心ベースのクラスタリングアルゴリズムを設計、分析、拡張するための統一言語とフレームワークを提供することを示します。私たちは、ヒューリスティックに導出された、または物理学、統計技術、情報理論の視点における直感的な類推から出現した、いくつかのよく知られた、より最近の中心ベースのクラスタリングアルゴリズムが、提案された分析の特別なケースとして回復でき、それらの関係を合理化することを示しています。

Multi-Task Learning for Classification with Dirichlet Process Priors
ディリクレプロセス事前確率による分類のためのマルチタスク学習

Consider the problem of learning logistic-regression models formultiple classification tasks, where the training data set for eachtask is not drawn from the same statistical distribution. In such amulti-task learning (MTL) scenario, it is necessary to identifygroups of similar tasks that should be learned jointly. Relying on aDirichlet process (DP) based statistical model to learn the extentof similarity between classification tasks, we developcomputationally efficient algorithms for two different forms of theMTL problem. First, we consider a symmetric multi-tasklearning (SMTL) situation in which classifiers for multiple tasksare learned jointly using a variational Bayesian (VB) algorithm.Second, we consider an asymmetric multi-task learning (AMTL)formulation in which the posterior density function from the SMTLmodel parameters (from previous tasks) is used as a prior for a newtask: this approach has the significant advantage of not requiringstorage and use of all previous data from prior tasks. The AMTLformulation is solved with a simple Markov Chain Monte Carlo (MCMC)construction. Experimental results on two real life MTL problemsindicate that the proposed algorithms: (a) automatically identifysubgroups of related tasks whose training data appear to be drawnfrom similar distributions; and (b) are more accurate than simplerapproaches such as single-task learning, pooling of data across alltasks, and simplified approximations to DP.

各タスクのトレーニングデータセットが同じ統計分布から取得されない、複数の分類タスクのロジスティック回帰モデルを学習する問題について考えてみましょう。このようなマルチタスク学習(MTL)シナリオでは、共同で学習する必要がある類似タスクのグループを識別する必要があります。分類タスク間の類似性の程度を学習するためにディリクレ過程(DP)ベースの統計モデルに依存して、MTL問題の2つの異なる形式に対して計算効率の高いアルゴリズムを開発します。まず、複数のタスクの分類子を変分ベイズ(VB)アルゴリズムを使用して共同で学習する対称マルチタスク学習(SMTL)の状況を検討します。次に、SMTLモデルパラメータ(以前のタスクから)の事後密度関数を新しいタスクの事前として使用する非対称マルチタスク学習(AMTL)の定式化を検討します。このアプローチには、以前のタスクのすべての以前のデータを保存して使用する必要がないという大きな利点があります。AMTL定式化は、単純なマルコフ連鎖モンテカルロ(MCMC)構成で解決されます。2つの実際のMTL問題に関する実験結果から、提案されたアルゴリズムは、(a)トレーニングデータが同様の分布から抽出されたと思われる関連タスクのサブグループを自動的に識別し、(b)単一タスク学習、すべてのタスクにわたるデータのプーリング、DPへの簡略化された近似などの単純なアプローチよりも正確であることがわかります。

Nonlinear Boosting Projections for Ensemble Construction
アンサンブル構築のための非線形ブースティング射影

In this paper we propose a novel approach for ensemble constructionbased on the use of nonlinear projections to achieve both accuracy anddiversity of individual classifiers. The proposed approach combinesthe philosophy of boosting, putting more effort on difficultinstances, with the basis of the random subspace method. Our maincontribution is that instead of using a random subspace, we constructa projection taking into account the instances which have posed mostdifficulties to previous classifiers. In this way, consecutivenonlinear projections are created by a neural network trained usingonly incorrectly classified instances. The feature subspace induced bythe hidden layer of this network is used as the input space to a newclassifier. The method is compared with bagging and boostingtechniques, showing an improved performance on a large set of 44problems from the UCI Machine Learning Repository. An additional studyshowed that the proposed approach is less sensitive to noise in thedata than boosting methods.

この論文では、非線形投影を使用して個々の分類器の精度と多様性の両方を実現する、アンサンブル構築の新しいアプローチを提案します。提案されたアプローチは、ブースティングの哲学、つまり困難なインスタンスにさらに力を入れることと、ランダムサブスペース法の基礎を組み合わせたものです。当社の主な貢献は、ランダムサブスペースを使用する代わりに、以前の分類器に最も困難をもたらしたインスタンスを考慮して投影を構築することです。このようにして、誤って分類されたインスタンスのみを使用してトレーニングされたニューラルネットワークによって、連続した非線形投影が作成されます。このネットワークの隠れ層によって誘導される特徴サブスペースは、新しい分類器への入力スペースとして使用されます。この方法は、バギングおよびブースティング手法と比較され、UCI機械学習リポジトリからの44の問題の大規模なセットでパフォーマンスが向上していることが示されています。追加の研究では、提案されたアプローチはブースティング手法よりもデータ内のノイズの影響を受けにくいことが示されました。

Journal of Machine Learning Research Papers: Volume 8の論文一覧

こちらもおすすめ

Journal of Machine Learning Research Papers: Volume 9の論文一覧

Journal of Machine Learning Research Papers: Volume 20の論文一覧

Journal of Machine Learning Research Papers: Volume 18の論文一覧