Journal of Machine Learning Research Papers: Volume 7の論文一覧

Journal of Machine Learning Research Papers Volume 7に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
On Inferring Application Protocol Behaviors in Encrypted Network Traffic
暗号化ネットワークトラフィックにおけるアプリケーションプロトコルの動作の推論について

Several fundamental security mechanisms for restricting access tonetwork resources rely on the ability of a reference monitor toinspect the contents of traffic as it traverses the network. However,with the increasing popularity of cryptographic protocols, thetraditional means of inspecting packet contents to enforce securitypolicies is no longer a viable approach as message contents areconcealed by encryption. In this paper, we investigate the extent towhich common application protocols can be identified using only thefeatures that remain intact after encryption—namely packet size,timing, and direction. We first present what we believe to be thefirst exploratory look at protocol identification in encrypted tunnelswhich carry traffic from many TCP connections simultaneously, usingonly post-encryption observable features. We then explore the problemof protocol identification in individual encrypted TCP connections,using much less data than in other recent approaches. The results ofour evaluation show that our classifiers achieve accuracy greater than90% for several protocols in aggregate traffic, and, for mostprotocols, greater than 80% when making fine-grained classificationson single connections. Moreover, perhaps most surprisingly, we showthat one can even estimate the number of live connections in certainclasses of encrypted tunnels to within, on average, better than 20%.

ネットワークリソースへのアクセスを制限するための基本的なセキュリティメカニズムのいくつかは、ネットワークを通過するトラフィックの内容を検査するリファレンスモニターの機能に依存しています。ただし、暗号化プロトコルの人気が高まるにつれて、メッセージの内容が暗号化によって隠されているため、セキュリティポリシーを適用するためにパケットの内容を検査する従来の方法は、もはや実行可能なアプローチではありません。この論文では、暗号化後もそのまま残る特徴(パケットのサイズ、タイミング、方向)のみを使用して、一般的なアプリケーションプロトコルをどの程度識別できるかを調べます。まず、暗号化後に観察可能な特徴のみを使用して、多数のTCP接続からのトラフィックを同時に伝送する暗号化トンネルでのプロトコル識別について、初めてと思われる探索的な調査を紹介します。次に、最近の他のアプローチよりもはるかに少ないデータを使用して、個々の暗号化TCP接続でのプロトコル識別の問題を検討します。評価の結果、当社の分類器は、集約トラフィックにおけるいくつかのプロトコルで90%を超える精度を達成し、ほとんどのプロトコルでは、単一接続で細分化された分類を行う際に80%を超える精度を達成していることがわかりました。さらに、おそらく最も驚くべきことに、暗号化されたトンネルの特定のクラスでライブ接続の数を平均で20%以内で推定できることも示しています。

Learning to Detect and Classify Malicious Executables in the Wild
悪意のある実行可能ファイルを実際に検出して分類する方法を学ぶ

We describe the use of machine learning and data mining todetect and classify malicious executables as they appear in the wild.We gathered 1,971 benign and 1,651 malicious executables and encodedeach as a training example using n-grams of byte codes as features.Such processing resulted in more than 255 million distinct n-grams.After selecting the most relevant n-grams for prediction,we evaluated a variety of inductive methods, including naive Bayes,decision trees, support vector machines, and boosting.Ultimately, boosted decision trees outperformed other methodswith an area under the ROC curve of 0.996.Results suggest that our methodology will scale to larger collectionsof executables.We also evaluated how well the methods classified executables basedon the function of their payload, such as opening a backdoorand mass-mailing.Areas under the ROC curve for detecting payload functionwere in the neighborhood of 0.9, which were smaller than those forthe detection task.However, we attribute this drop in performance to fewer trainingexamples and to the challenge of obtaining properly labeled examples,rather than to a failing of the methodology or to some inherent difficultyof the classification task.Finally, we applied detectors to 291 malicious executablesdiscovered after we gathered our original collection,and boosted decision trees achieved a true-positive rate of 0.98 fora desired false-positive rate of 0.05.This result is particularly important, for it suggests that ourmethodology could be used as the basis for an operational systemfor detecting previously undiscovered malicious executables.

私たちは、悪意のある実行ファイルが世に出現したときに、機械学習とデータマイニングを使用してそれを検出し分類する方法について説明します。1,971個の良性実行ファイルと1,651個の悪意のある実行ファイルを収集し、バイトコードのnグラムを特徴として使用してそれぞれをトレーニング例としてエンコードしました。このような処理の結果、2億5,500万個を超える異なるnグラムが生成されました。予測に最も関連性の高いnグラムを選択した後、ナイーブベイズ、決定木、サポートベクターマシン、ブースティングなどのさまざまな帰納的手法を評価しました。最終的に、ブーストされた決定木は、ROC曲線の下の領域が0.996で、他の手法よりも優れていました。結果は、私たちの方法論が実行ファイルのより大きなコレクションに拡張できることを示唆しています。また、バックドアの開設や大量メール送信など、ペイロードの機能に基づいて実行ファイルを分類する方法も評価しました。ペイロード機能の検出のROC曲線の下の領域は0.9付近で、ペイロード機能の検出の領域よりも小さかったです。タスク。ただし、このパフォーマンスの低下は、方法論の失敗や分類タスク固有の難しさではなく、トレーニング例の減少と適切にラベル付けされた例の取得の難しさによるものと考えています。最後に、元のコレクションを収集した後に検出された291個の悪意のある実行ファイルに検出器を適用し、ブーストされた決定木は、望ましい偽陽性率0.05に対して真陽性率0.98を達成しました。この結果は特に重要で、これは、私たちの方法論が、これまで発見されていなかった悪意のある実行ファイルを検出するための運用システムの基礎として使用できる可能性があることを示唆しています。

Spam Filtering Based On The Analysis Of Text Information Embedded Into Images
画像に埋め込まれたテキスト情報の分析に基づくスパムフィルタリング

In recent years anti-spam filters have become necessary tools forInternet service providers to face up to the continuously growing spamphenomenon. Current server-side anti-spam filters are made up ofseveral modules aimed at detecting different features of spam e-mails.In particular, text categorisation techniques have been investigatedby researchers for the design of modules for the analysis of thesemantic content of e-mails, due to their potentially highergeneralisation capability with respect to manually derivedclassification rules used in current server-side filters. However,very recently spammers introduced a new trick consisting of embeddingthe spam message into attached images, which can make all currenttechniques based on the analysis of digital text in the subject andbody fields of e-mails ineffective.In this paper we propose anapproach to anti-spam filtering which exploits the text informationembedded into images sent as attachments. Our approach is based onthe application of state-of-the-art text categorisation techniques tothe analysis of text extracted by OCR tools from images attached toe-mails. The effectiveness of the proposed approach is experimentallyevaluated on two large corpora of spam e-mails.

近年、スパム対策フィルターは、インターネットサービスプロバイダーが継続的に増加するスパム現象に対処するために不可欠なツールとなっています。現在のサーバー側のスパム対策フィルターは、スパムメールのさまざまな特徴を検出することを目的とした複数のモジュールで構成されています。特に、テキスト分類技術は、現在のサーバー側フィルターで使用されている手動で導き出された分類ルールよりも一般化能力が高い可能性があるため、メールのセマンティックコンテンツを分析するためのモジュールの設計のために研究者によって研究されてきました。しかし、ごく最近、スパマーは、スパムメッセージを添付画像に埋め込むという新しいトリックを導入しました。これにより、メールの件名と本文フィールドのデジタルテキストの分析に基づく現在のすべての技術が無効になる可能性があります。この論文では、添付ファイルとして送信された画像に埋め込まれたテキスト情報を利用するスパム対策フィルタリングのアプローチを提案します。私たちのアプローチは、電子メールに添付された画像からOCRツールによって抽出されたテキストの分析に最先端のテキスト分類技術を適用することに基づいています。提案されたアプローチの有効性は、スパム電子メールの2つの大規模なコーパスで実験的に評価されています。

Spam Filtering Using Statistical Data Compression Models
統計データ圧縮モデルを使用したスパムフィルタリング

Spam filtering poses a special problem in text categorization, ofwhich the defining characteristic is that filters face an activeadversary, which constantly attempts to evade filtering. Since spamevolves continuously and most practical applications are based ononline user feedback, the task calls for fast, incremental and robustlearning algorithms. In this paper, we investigate a novel approach tospam filtering based on adaptive statistical data compressionmodels. The nature of these models allows them to be employed asprobabilistic text classifiers based on character-level or binarysequences. By modeling messages as sequences, tokenization and othererror-prone preprocessing steps are omitted altogether, resulting in amethod that is very robust. The models are also fast to construct andincrementally updateable. We evaluate the filtering performance of twodifferent compression algorithms; dynamic Markov compression andprediction by partial matching. The results of our empiricalevaluation indicate that compression models outperform currentlyestablished spam filters, as well as a number of methods proposed inprevious studies.

スパムフィルタリングは、テキスト分類において特別な問題を引き起こします。その特徴は、フィルタが常にフィルタリングを回避しようとするアクティブな敵に直面することです。スパムは継続的に進化し、実用的なアプリケーションのほとんどはオンラインユーザーフィードバックに基づいているため、タスクには高速で増分的かつ堅牢な学習アルゴリズムが必要です。この論文では、適応型統計データ圧縮モデルに基づくスパムフィルタリングの新しいアプローチを調査します。これらのモデルの性質により、文字レベルまたはバイナリシーケンスに基づく確率的テキスト分類子として使用できます。メッセージをシーケンスとしてモデル化することにより、トークン化やその他のエラーが発生しやすい前処理手順が完全に省略され、非常に堅牢な方法になります。また、モデルは構築が速く、増分的に更新できます。動的マルコフ圧縮と部分一致による予測という2つの異なる圧縮アルゴリズムのフィルタリングパフォーマンスを評価します。我々の実証的評価の結果は、圧縮モデルが、現在確立されているスパムフィルターや、以前の研究で提案されたいくつかの方法よりも優れていることを示しています。

Machine Learning for Computer Security
コンピュータセキュリティのための機械学習

The prevalent use of computers and internet has enhanced the qualityof life for many people, but it has also attracted undesired attemptsto undermine these systems. This special topic contains severalresearch studies on how machine learning algorithms can help improvethe security of computer systems.

コンピュータやインターネットの普及は、多くの人々の生活の質を向上させましたが、これらのシステムを弱体化させようとする望ましくない試みも引き寄せています。この特別なトピックには、機械学習アルゴリズムがコンピューターシステムのセキュリティを向上させるのにどのように役立つかについてのいくつかの研究が含まれています。

Universal Kernels
ユニバーサルカーネル

In this paper we investigate conditions on the features of acontinuous kernel so that it may approximate an arbitrary continuoustarget function uniformly on any compact subset of the input space.A number of concrete examples are given of kernels with thisuniversal approximating property.

この論文では、入力空間の任意のコンパクトなサブセット上で任意の連続ターゲット関数を一様に近似できるように、非連続カーネルの特徴に関する条件を調査します。この普遍的な近似特性を持つカーネルの具体例をいくつか示しています。

A Robust Procedure For Gaussian Graphical Model Search From Microarray Data With p Larger Than n
pがnより大きいマイクロアレイデータからのガウスグラフィカルモデル探索のためのロバストな手順

Learning of large-scale networks of interactions from microarraydata is an important and challenging problem in bioinformatics. Awidely used approach is to assume that the available data constitutea random sample from a multivariate distribution belonging to aGaussian graphical model. As a consequence, the prime objects ofinference are full-order partial correlations which arepartial correlations between two variables given the remaining ones.In the context of microarray data the number of variables exceed thesample size and this precludes the application of traditionalstructure learning procedures because a sampling version offull-order partial correlations does not exist. In this paper weconsider limited-order partial correlations, these arepartial correlations computed on marginal distributions ofmanageable size, and provide a set of rules that allow one to assessthe usefulness of these quantities to derive the independencestructure of the underlying Gaussian graphical model. Furthermore,we introduce a novel structure learning procedure based on aquantity, obtained from limited-order partial correlations, that wecall the non-rejection rate. The applicability and usefulness ofthe procedure are demonstrated by both simulated and real data.

マイクロアレイデータからの大規模な相互作用ネットワークの学習は、バイオインフォマティクスにおける重要かつ困難な問題です。広く使用されているアプローチは、利用可能なデータがガウスグラフィカルモデルに属する多変量分布からのランダムサンプルを構成すると想定することです。その結果、推論の主な対象は、残りの変数を前提とした2つの変数間の部分相関である完全次数偏相関です。マイクロアレイデータのコンテキストでは、変数の数がサンプルサイズを超えているため、完全次数偏相関のサンプリングバージョンが存在しないため、従来の構造学習手順を適用できません。この論文では、管理可能なサイズの周辺分布で計算される部分相関である限定次数偏相関について検討し、基礎となるガウスグラフィカルモデルの独立構造を導出するためのこれらの量の有用性を評価できる一連のルールを示します。さらに、限定次数偏相関から取得される非拒否率と呼ばれる量に基づく新しい構造学習手順を紹介します。この手順の適用性と有用性は、シミュレーションデータと実際のデータの両方で実証されています。

On Representing and Generating Kernels by Fuzzy Equivalence Relations
ファジィ等価関係によるカーネルの表現と生成について

Kernels are two-placed functions that can be interpreted as innerproducts in some Hilbert space. It is this property which makeskernels predestinated to carry linear models of learning,optimization or classification strategies over to non-linear variants.Following this idea, various kernel-based methods like support vectormachines or kernel principal component analysis have been conceivedwhich prove to be successful for machine learning, data mining andcomputer vision applications. When applying a kernel-based method acentral question is the choice and the design of the kernel function.This paper provides a novel view on kernels based on fuzzy-logicalconcepts which allows to incorporate prior knowledge in the designprocess. It is demonstrated that kernels mapping to the unit intervalwith constant one in its diagonal can be represented by a commonlyused fuzzy-logical formula for representing fuzzy rule bases. Thismeans that a great class of kernels can be represented byfuzzy-logical concepts. Apart from this result, which only guaranteesthe existence of such a representation, constructive examples arepresented and the relation to unlabeled learning is pointed out.

カーネルは、あるヒルベルト空間の内積として解釈できる2位関数です。この特性により、カーネルは学習、最適化、または分類戦略の線形モデルを非線形バリアントに持ち込むように運命づけられています。このアイデアに従って、サポートベクターマシンやカーネル主成分分析などのさまざまなカーネルベースの方法が考案され、機械学習、データマイニング、コンピュータービジョンアプリケーションで成功していることが証明されています。カーネルベースの方法を適用する場合、中心的な問題はカーネル関数の選択と設計です。この論文では、設計プロセスに事前の知識を組み込むことができるファジー論理概念に基づくカーネルに関する新しい見解を示します。対角線に定数1を持つ単位間隔にマッピングするカーネルは、ファジールールベースを表すために一般的に使用されるファジー論理式で表すことができることが実証されています。これは、多くのカーネルをファジー論理概念で表すことができることを意味します。このような表現の存在を保証するだけのこの結果とは別に、建設的な例が示され、ラベルなし学習との関係が指摘されています。

Linear State-Space Models for Blind Source Separation
ブラインドソース分離のための線形状態空間モデル

We apply a type of generative modelling to the problem of blind sourceseparation in which prior knowledge about the latent source signals,such as time-varying auto-correlation and quasi-periodicity, areincorporated into a linear state-space model. In simulations, we showthat in terms of signal-to-error ratio, the sources are inferred moreaccurately as a result of the inclusion of strong prior knowledge. Weexplore different schemes of maximum-likelihood optimization for thepurpose of learning the model parameters. The Expectation Maximizationalgorithm, which is often considered the standard optimization methodin this context, results in slow convergence when the noise varianceis small. In such scenarios, quasi-Newton optimization yieldssubstantial improvements in a range of signal to noise ratios. Weanalyze the performance of the methods on convolutive mixtures ofspeech signals.

私たちは、時間とともに変化する自己相関や準周期性などの潜在的なソース信号に関する事前知識を線形状態空間モデルに組み込むブラインドソースセパレーションの問題に、ある種の生成モデリングを適用します。シミュレーションでは、信号対誤差比の観点から、強力な事前知識を含めることで、ソースがより正確に推論されることを示します。モデルパラメータを学習する目的で、最尤最適化のさまざまなスキームを探ります。期待値最大化アルゴリズムは、このコンテキストでは標準的な最適化手法と見なされることがよくありますが、ノイズ分散が小さい場合、収束が遅くなります。このようなシナリオでは、準ニュートン最適化により、信号対雑音比の範囲が大幅に改善されます。音声信号の畳み込み混合物に対する方法の性能を分析します。

Stability Properties of Empirical Risk Minimization over Donsker Classes
Donskerクラスに対する経験的リスク最小化の安定性特性

We study some stability properties of algorithms which minimize(or almost-minimize) empirical error over Donsker classes offunctions. We show that, as the number n of samples grows, theL2-diameter of the set of almost-minimizers of empirical errorwith tolerance ξ(n)=o(n-1/2) converges to zero inprobability. Hence, even in the case of multiple minimizers ofexpected error, as n increases it becomes less and less likely thatadding a sample (or a number of samples) to the training set willresult in a large jump to a new hypothesis. Moreover, under someassumptions on the entropy of the class, along with an assumptionof Komlos-Major-Tusnady type, we derive a power rate of decay forthe diameter of almost-minimizers. This rate, through anapplication of a uniform ratio limit inequality, is shown togovern the closeness of the expected errors of thealmost-minimizers. In fact, under the above assumptions, theexpected errors of almost-minimizers become closer with a rate strictlyfaster than n-1/2.

私たちは、Donskerクラスの関数に対する経験的誤差を最小化する(またはほぼ最小化する)アルゴリズムのいくつかの安定性特性を研究します。サンプルの数nが増加するにつれて、許容誤差ξ(n)=o(n-1/2)を持つ経験的誤差のほぼ最小化器のセットのL2直径がゼロの確率に収束することを示します。したがって、複数の最小化器の場合でも、nが増加すると、トレーニングセットにサンプル(または多数のサンプル)を追加すると、新しい仮説に大きくジャンプする可能性が低くなります。さらに、クラスのエントロピーに関するいくつかの仮定の下で、コムロス-メジャー-タスナディタイプの仮定とともに、ほぼ最小化器の直径に対する減衰の電力率を導出します。この率は、一様な比率の限界不等式の適用を通じて、ほぼ最小化される人の予想誤差の近さを支配することが示されています。実際、上記の仮定の下では、ほぼ最小化の予想される誤差は、n-1/2よりも厳密に速い速度で近づきます。

On Model Selection Consistency of Lasso
Lassoのモデル選択の一貫性について

Sparsity or parsimony of statistical models is crucial for theirproper interpretations, as in sciences and social sciences. Modelselection is a commonly used method to find such models, but usuallyinvolves a computationally heavy combinatorial search.Lasso (Tibshirani, 1996) is now being used as a computationallyfeasible alternative to model selection. Therefore it is importantto study Lasso for model selection purposes.In this paper, we prove that a single condition, which we call theIrrepresentable Condition, is almost necessary and sufficient forLasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficientconditions that are verifiable in practice are given to relate toprevious works and help applications of Lasso for feature selectionand sparse representation.This Irrepresentable Condition, which depends mainly on thecovariance of the predictor variables, states that Lasso selects thetrue model consistently if and (almost) only if the predictors thatare not in the true model are “irrepresentable” (in a sense to beclarified) by predictors that are in the true model. Furthermore,simulations are carried out to provide insights and understanding ofthis result.

統計モデルのスパース性または簡素性は、科学や社会科学などにおいて、その適切な解釈に不可欠です。モデル選択は、そのようなモデルを見つけるために一般的に使用される方法ですが、通常は計算量の多い組み合わせ検索を伴います。Lasso (Tibshirani、1996)は、現在、モデル選択の計算可能な代替手段として使用されています。したがって、モデル選択の目的でLassoを研究することが重要です。この論文では、従来の固定p設定とサンプルサイズnが大きくなるにつれてpが大きくなった設定の両方で、Lassoが真のモデルを選択するために、表現不可能条件と呼ばれる単一の条件がほぼ必要かつ十分であることを証明します。これらの結果に基づいて、実際に検証可能な十分な条件が与えられ、以前の研究に関連し、Lassoを特徴選択とスパース表現に適用するのに役立ちます。主に予測変数の共分散に依存するこの表現不可能な条件は、真のモデルに存在しない予測変数が真のモデルに存在する予測変数によって「表現不可能」(明確にする意味で)である場合に限り、Lassoが真のモデルを一貫して選択することを示します。さらに、この結果の洞察と理解を提供するためにシミュレーションが実行されます。

Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems
スイッチング線形力学系における平滑化推論の期待値補正

We introduce a method for approximate smoothed inference in a classof switching linear dynamical systems, based on a novel form ofGaussian Sum smoother. This class includes the switching Kalman’Filter’ and the more general case of switch transitions dependenton the continuous latent state. The method improves on the standardKim smoothing approach by dispensing with one of the keyapproximations, thus making fuller use of the available futureinformation.Whilst the central assumption required is projection to a mixture ofGaussians, we show that an additional conditional independenceassumption results in a simpler but accurate alternative. Our methodconsists of a single Forward and Backward Pass and is reminiscent ofthe standard smoothing ‘correction’ recursions in the simpler lineardynamical system. The method is numerically stable and comparesfavourably against alternative approximations, both in cases where asingle mixture component provides a good posterior approximation,and where a multimodal approximation is required.

私たちは、新しい形式のガウス和スムーザーに基づく、スイッチング線形動的システムのクラスにおける近似スムース推論の方法を紹介します。このクラスには、スイッチングカルマン「フィルター」と、連続潜在状態に依存するスイッチ遷移のより一般的なケースが含まれます。この方法は、主要な近似の1つを省くことで標準のKimスムージングアプローチを改善し、利用可能な将来の情報をより十分に活用します。必要な中心的な仮定はガウスの混合への投影ですが、追加の条件付き独立仮定により、より単純で正確な代替案が得られることを示します。我々の方法は、単一のフォワードパスとバックワードパスで構成され、より単純な線形動的システムにおける標準的なスムージング「補正」再帰を彷彿とさせます。この方法は数値的に安定しており、単一の混合コンポーネントが良好な事後近似を提供する場合と、マルチモーダル近似が必要な場合の両方で、代替近似と比較して優れています。

Estimation of Gradients and Coordinate Covariation in Classification
分類における勾配と座標共分散の推定

We introduce an algorithm that simultaneously estimates aclassification function as well as its gradient inthe supervised learning framework. The motivation for thealgorithm is to find salient variables and estimatehow they covary. An efficient implementation with respectto both memory and time is given. The utility of thealgorithm is illustrated on simulated data as well as a geneexpression data set. An error analysis is given forthe convergence of the estimate of the classification functionand its gradient to the true classification function andtrue gradient.

私たちは、教師あり学習フレームワークで、分類関数とその勾配を同時に推定するアルゴリズムを紹介します。アルゴリズムの動機は、顕著な変数を見つけ、それらがどのように変化するかを推定することです。メモリと時間の両方に関して効率的な実装が提供されます。アルゴリズムの有用性は、シミュレーションデータと遺伝子発現データセットに示されています。エラー分析は、分類関数とその勾配の推定値の真の分類関数と真の勾配への収束のために行われます。

Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
モデルベースの近似による正しい分類の確率の損失の限界

In many pattern recognition/classification problem the true classconditional model and class probabilities are approximated for reasonsof reducing complexity and/or of statistical estimation. Theapproximated classifier is expected to have worse performance, heremeasured by the probability of correct classification. We present ananalysis valid in general, and easily computable formulas forestimating the degradation in probability of correct classificationwhen compared to the optimal classifier. An example of anapproximation is the Naïve Bayes classifier. We show that theperformance of the Naïve Bayes depends on the degree of functionaldependence between the features and labels. We provide a sufficientcondition for zero loss of performance, too.

多くのパターン認識/分類問題では、複雑さを軽減するため、および/または統計的推定のために、真のクラス条件付きモデルとクラス確率が近似されます。近似された分類器は、正しい分類の確率によって測定されるパフォーマンスが低下することが予想されます。私たちは、一般的に有効で、最適な分類器と比較した場合の正しい分類の確率の低下を予見する容易に計算可能な公式を提示します。近似の例としては、単純ベイズ分類器があります。私たちは、単純ベイズの性能が特徴とラベルの間の機能依存性の程度に依存することを示しています。また、性能の損失をゼロにするための十分条件も提供します。

Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
凸損失に基づく多クラス経験的リスク最小化法の一貫性

The consistency of classification algorithm plays a central rolein statistical learning theory. A consistent algorithm guaranteesus that taking more samples essentially suffices to roughlyreconstruct the unknown distribution. We consider the consistencyof ERM scheme over classes of combinations of very simple rules(base classifiers) in multiclass classification. Our approach is,under some mild conditions, to establish a quantitativerelationship between classification errors and convex risks. Incomparison with the related previous work, the feature of ourresult is that the conditions are mainly expressed in terms of thedifferences between some values of the convex function.

分類アルゴリズムの一貫性は、統計的学習理論において中心的な役割を果たします。一貫したアルゴリズムにより、未知の分布を大まかに再構築するには、基本的により多くのサンプルを採取するだけで十分です。多クラス分類において、非常に単純なルール(基本分類子)の組み合わせのクラスに対するERMスキームの一貫性を考慮します。私たちのアプローチは、いくつかの穏やかな条件下で、分類誤差と凸リスクとの間の定量的関係を確立することです。関連する以前の研究と比較して、私たちの結果の特徴は、条件が主に凸関数のいくつかの値の違い。

Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
多様体正則化:ラベル付きおよびラベルなしの例から学習するための幾何学的フレームワーク

We propose a family of learning algorithms based on a new form ofregularization that allows us to exploit the geometry of the marginaldistribution. We focus on a semi-supervised framework thatincorporates labeled and unlabeled data in a general-purpose learner.Some transductive graph learning algorithms and standard methodsincluding support vector machines and regularized least squares can beobtained as special cases. We use properties of reproducing kernelHilbert spaces to prove new Representer theorems that providetheoretical basis for the algorithms. As a result (in contrast topurely graph-based approaches) we obtain a natural out-of-sampleextension to novel examples and so are able to handle bothtransductive and truly semi-supervised settings. We presentexperimental evidence suggesting that our semi-supervised algorithmsare able to use unlabeled data effectively. Finally we have a briefdiscussion of unsupervised and fully supervised learning within ourgeneral framework.

私たちは、周辺分布の幾何学を利用できる新しい形式の正則化に基づく学習アルゴリズムのファミリーを提案します。私たちは、汎用学習者にラベル付きデータとラベルなしデータを組み込む半教師ありフレームワークに焦点を当てています。サポートベクターマシンや正則化最小二乗法など、いくつかの変換グラフ学習アルゴリズムと標準的な方法は、特別なケースとして取得できます。kernelHilbert空間の再現特性を使用して、アルゴリズムの理論的基礎を提供する新しいRepresenter定理を証明します。その結果、(純粋にグラフベースのアプローチとは対照的に)、新しい例への自然なサンプル外拡張が得られるため、変換設定と真に半教師あり設定の両方を処理することができます。私たちは、私たちの半教師ありアルゴリズムがラベルなしデータを効果的に使用できることを示唆する実験的証拠を提示します。最後に、一般的なフレームワーク内での教師なし学習と完全に教師あり学習について簡単に説明します。

Learning Parts-Based Representations of Data
データのパーツベースの表現の学習

Many perceptual models and theories hinge on treating objects as acollection of constituent parts. When applying these approaches todata, a fundamental problem arises: how can we determine what are theparts?We attack this problem using learning, proposing a form of generative latent factor model, in which eachdata dimension is allowed to select a different factor or part as itsexplanation. This approach permits a range of variations thatposit different models for the appearance of a part. Here we provide the details for two such models: adiscrete and a continuous one. Further, we show that this latent factor model can be extendedhierarchically to account for correlations between the appearances ofdifferent parts. This permits modelling of data consisting ofmultiple categories, and learning these categories simultaneouslywith the parts when they are unobserved. Experiments demonstrate theability to learn parts-based representations, and categories, offacial images and user-preference data.

多くの知覚モデルと理論は、オブジェクトを構成要素の集合として扱うことにかかっています。これらの手法をデータに適用すると、基本的な問題が発生します。つまり、どの部分が何かをどのように判断すればよいのでしょうか。学習を使用してこの問題に取り組み、各データ次元で説明として異なる因子または部分を選択できる生成潜在因子モデルの形式を提案します。この手法では、部分の外観に異なるモデルを仮定するさまざまなバリエーションが可能です。ここでは、離散モデルと連続モデルの2つのモデルの詳細を示します。さらに、この潜在因子モデルを階層的に拡張して、異なる部分の外観間の相関関係を説明できることを示します。これにより、複数のカテゴリで構成されるデータをモデル化し、観察されていない部分と同時にこれらのカテゴリを学習できます。実験では、パーツベースの表現、カテゴリ、顔画像、ユーザーの好みのデータを学習できることが実証されています。

Point-Based Value Iteration for Continuous POMDPs
連続 POMDP のポイントベースの値の反復

We propose a novel approach to optimize Partially Observable MarkovDecisions Processes (POMDPs) defined on continuous spaces. To date,most algorithms for model-based POMDPs are restricted to discretestates, actions, and observations, but many real-world problems suchas, for instance, robot navigation, are naturally defined oncontinuous spaces. In this work, we demonstrate that the valuefunction for continuous POMDPs is convex in the beliefs overcontinuous state spaces, and piecewise-linear convex for theparticular case of discrete observations and actions but stillcontinuous states. We also demonstrate that continuous Bellman backupsare contracting and isotonic ensuring the monotonic convergence ofvalue-iteration algorithms. Relying on those properties, we extend thealgorithm, originally developed for discrete POMDPs, to work incontinuous state spaces by representing the observation, transition,and reward models using Gaussian mixtures, and the beliefs usingGaussian mixtures or particle sets. With these representations, theintegrals that appear in the Bellman backup can be computed in closedform and, therefore, the algorithm is computationallyfeasible. Finally, we further extend to deal with continuous actionand observation sets by designing effective sampling approaches.

私たちは、連続空間で定義された部分観測マルコフ決定プロセス(POMDP)を最適化するための新しいアプローチを提案します。現在、モデルベースのPOMDPのアルゴリズムのほとんどは離散状態、アクション、および観測に制限されていますが、ロボットナビゲーションなどの多くの現実の問題は、連続空間で自然に定義されます。この研究では、連続POMDPの価値関数が連続状態空間上の信念で凸であり、離散観測とアクションがあるが連続状態である特定のケースでは区分線形凸であることを示します。また、連続ベルマンバックアップは収縮および等調であり、価値反復アルゴリズムの単調収束を保証することも示します。これらの特性を利用して、もともと離散POMDP用に開発されたアルゴリズムを拡張し、観測、遷移、報酬モデルをガウス混合を使用して表し、信念をガウス混合または粒子セットを使用して表すことで、連続状態空間で機能するようにします。これらの表現により、ベルマンバックアップに現れる積分は閉じた形式で計算できるため、アルゴリズムは計算上実行可能です。最後に、効果的なサンプリング手法を設計することで、連続的なアクションセットと観測セットを扱うためにさらに拡張します。

Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
カーネル行列の固有値の正確な誤差範囲

The eigenvalues of the kernel matrix play an important role in anumber of kernel methods, in particular, in kernel principal componentanalysis. It is well known that the eigenvalues of the kernel matrixconverge as the number of samples tends to infinity. We deriveprobabilistic finite sample size bounds on the approximation error ofindividual eigenvalues which have the important property that thebounds scale with the eigenvalue under consideration, reflecting theactual behavior of the approximation errors as predicted by asymptoticresults and observed in numerical simulations. Such scaling boundshave so far only been known for tail sums of eigenvalues.Asymptotically, the bounds presented here have a slower thanstochastic rate, but the number of sample points necessary to makethis disadvantage noticeable is often unrealistically large.Therefore, under practical conditions, and for all but the largest feweigenvalues, the bounds presented here form a significant improvementover existing non-scaling bounds.

カーネル行列の固有値は、多くのカーネル法、特にカーネル主成分分析において重要な役割を果たします。カーネル行列の固有値は、サンプル数が無限大に近づくにつれて収束することがよく知られています。私たちは、個々の固有値の近似誤差に対する確率的有限サンプルサイズ境界を導出します。この境界は、境界が対象とする固有値に比例するという重要な特性を持ち、漸近的結果によって予測され、数値シミュレーションで観察される近似誤差の実際の動作を反映します。このようなスケーリング境界は、これまでは固有値の末尾和についてのみ知られています。漸近的には、ここで提示される境界は確率的速度よりも遅くなりますが、この欠点を顕著にするために必要なサンプルポイントの数は、非現実的に多いことがよくあります。したがって、実用的な条件下では、最も大きな少数の固有値を除き、ここで提示される境界は、既存の非スケーリング境界よりも大幅に改善されています。

Causal Graph Based Decomposition of Factored MDPs
因数分解されたMDPの因果グラフに基づく分解

We present Variable Influence Structure Analysis, or VISA, analgorithm that performs hierarchical decomposition of factoredMarkov decision processes.VISA uses a dynamic Bayesian network model of actions, andconstructs a causal graph that captures relationships betweenstate variables.In tasks with sparse causal graphs VISA exploits structure byintroducing activities that cause the values of state variablesto change.The result is a hierarchy of activities that together represent asolution to the original task.VISA performs state abstraction for each activity byignoring irrelevant state variables and lower-level activities.In addition, we describe an algorithm for constructing compactmodels of the activities introduced.State abstraction and compact activity models enable VISAto apply efficient algorithms to solve the stand-alone subtaskassociated with each activity.Experimental results show that the decomposition introduced byVISA can significantly accelerate construction of an optimal, ornear-optimal, policy.

私たちは、因子分解されたマルコフ決定過程の階層的分解を実行する変数影響構造分析(VISA)アルゴリズムを紹介します。VISAは、アクションの動的ベイジアンネットワークモデルを使用し、状態変数間の関係を捉える因果グラフを構築します。因果グラフがまばらなタスクでは、VISAは状態変数の値を変更するアクティビティを導入することで構造を活用します。その結果、元のタスクに対するソリューションを表すアクティビティの階層が作成されます。VISAは、無関係な状態変数と下位レベルのアクティビティを無視することで、各アクティビティの状態抽象化を実行します。さらに、導入されたアクティビティのコンパクトモデルを構築するアルゴリズムについても説明します。状態抽象化とコンパクトアクティビティモデルにより、VISAは効率的なアルゴリズムを適用して、各アクティビティに関連付けられたスタンドアロンサブタスクを解決できます。実験結果から、VISAによって導入された分解によって、最適または最適に近いポリシーの構築が大幅に加速されることがわかっています。

An Efficient Implementation of an Active Set Method for SVMs
SVM のアクティブセット方式の効率的な実装

We propose an active set algorithm to solve the convex quadratic programming (QP) problem which is the core of the support vector machine (SVM) training. The underlying method is not new and is based on the extensive practice of the Simplex method and its variantsfor convex quadratic problems. However, its applicationto large-scale SVM problems is new. Until recently the traditional active set methods were considered impractical forlarge SVM problems. By adapting the methods to the special structure of SVM problems we were able to produce an efficient implementation. We conduct an extensive study of the behavior of our method and its variations on SVM problems. We present computational results comparing our method withJoachims’ SVMlight (see Joachims, 1999). The results show that our method has overallbetter performance on many SVM problems. It seems to have a particularly strong advantage on more difficult problems. In addition this algorithm has better theoretical properties and it naturally extends to the incremental mode. Since the proposed method solves the standard SVM formulation, as does SVMlight, the generalization properties of these two approaches are identical and we do not discuss them in the paper.

私たちは、サポートベクターマシン(SVM)トレーニングの中核である凸二次計画(QP)問題を解決するアクティブセットアルゴリズムを提案します。基礎となる方法は新しいものではなく、凸二次問題に対するシンプレックス法とその変形の広範な実践に基づいています。ただし、大規模なSVM問題への適用は新しいものです。最近まで、従来のアクティブセット法は大規模なSVM問題には実用的ではないと考えられていました。この方法をSVM問題の特殊な構造に適応させることで、効率的な実装を実現できました。SVM問題におけるこの方法とその変形の動作について、広範な調査を実施します。この方法をJoachimsのSVMlight (Joachims、1999を参照)と比較した計算結果を示します。結果から、この方法は多くのSVM問題で全体的に優れたパフォーマンスを発揮することがわかります。特に難しい問題では大きな利点があるようです。さらに、このアルゴリズムは理論的な特性が優れており、増分モードに自然に拡張されます。提案された方法は、SVMlightと同様に標準のSVM定式化を解決するため、これら2つのアプローチの一般化特性は同一であり、この論文ではそれらについては説明しません。

Learning a Hidden Hypergraph
隠れたハイパーグラフの学習

We consider the problem of learning a hypergraph using edge-detecting queries.In this model, the learner may query whether a set of vertices induces an edge of the hidden hypergraph or not.We show that an r-uniform hypergraph with m edges and n vertices is learnable with O(24rm · poly(r,logn)) queries with high probability.The queries can be made in O(min(2r (log m+r)2, (log m+r)3)) rounds.We also give an algorithm that learns an almost uniform hypergraph of dimension r using O(2O((1+Δ/2)r) · m1+Δ/2 · poly(log n)) queries with high probability,where Δ is the difference between the maximum and the minimum edge sizes. This upper bound matches our lower bound of Ω((m/(1+Δ/2))1+Δ/2) for this class of hypergraphs in terms of dependence on m.The queries can also be made in O((1+Δ) · min(2r (log m+r)2, (log m+r)3)) rounds.

私たちは、エッジ検出クエリを使用してハイパーグラフを学習する問題を考えます。このモデルでは、学習器は、頂点のセットが隠れたハイパーグラフのエッジを誘導するかどうかを問い合わせることができます。m個のエッジとn個の頂点を持つr-一様ハイパーグラフが、O(24rm·poly(r,logn))クエリで高い確率で学習できることを示します。クエリはO(min(2r (log m+r)2, (log m+r)3))ラウンドで行うことができます。また、O(2O((1+Δ/2)r)·m1+Δ/2·poly(log n))クエリを使用して、次元rのほぼ一様なハイパーグラフを学習するアルゴリズムも提供します。ここで、Δ は最大エッジサイズと最小エッジサイズの差です。この上限は、mへの依存性の観点から、このクラスのハイパーグラフの下限 Ω((m/(1+Δ/2))1+Δ/2)と一致します。クエリは、O((1+Δ)·min(2r (log m+r)2, (log m+r)3))ラウンドでも行うことができます。

Noisy-OR Component Analysis and its Application to Link Analysis
ノイズOR成分分析とそのリンク解析への応用

We develop a new component analysis framework, the Noisy-OrComponent Analyzer (NOCA), that targets high-dimensional binarydata. NOCA is a probabilistic latent variable model that assumes theexpression of observed high-dimensional binary data is driven by asmall number of hidden binary sources combined via noisy-or units.The component analysis procedure is equivalent to learning of NOCAparameters. Since the classical EM formulation of the NOCA learningproblem is intractable, we develop its variational approximation. Wetest the NOCA framework on two problems: (1) a syntheticimage-decomposition problem and (2) a co-citation data analysisproblem for thousands of CiteSeer documents. We demonstrate goodperformance of the new model on both problems. In addition, wecontrast the model to two mixture-based latent-factor models: theprobabilistic latent semantic analysis (PLSA) and latent Dirichletallocation (LDA). Differing assumptions underlying these models causethem to discover different types of structure in co-citation data,thus illustrating the benefit of NOCA in building our understanding ofhigh-dimensional data sets.

私たちは、高次元バイナリデータを対象とする新しいコンポーネント分析フレームワーク、Noisy-OrComponent Analyzer (NOCA)を開発しました。NOCAは、観測された高次元バイナリデータの表現が、Noisy-Orユニットを介して結合された少数の隠れたバイナリソースによって駆動されると仮定する確率的潜在変数モデルです。コンポーネント分析手順は、NOCAパラメータの学習に相当します。NOCA学習問題の古典的なEM定式化は扱いにくいため、変分近似を開発します。NOCAフレームワークを2つの問題でテストします。(1)合成画像分解問題と(2)数千のCiteSeerドキュメントの共引用データ分析問題です。両方の問題で新しいモデルが優れたパフォーマンスを発揮することを示します。さらに、このモデルを2つの混合ベースの潜在因子モデル、確率的潜在意味分析(PLSA)と潜在ディリクレ配置(LDA)と比較します。これらのモデルの根底にある異なる仮定により、共引用データに異なるタイプの構造が発見され、高次元データセットの理解を深める上でのNOCAの利点が明らかになりました。

A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
相互情報量と条件付き独立性検定に基づくベイジアンネットワーク学習のためのスコアリング関数

We propose a new scoring function for learning Bayesian networks fromdata using score+search algorithms. This is based on the concept ofmutual information and exploits some well-known properties of thismeasure in a novel way. Essentially, a statistical independence testbased on the chi-square distribution, associated with the mutualinformation measure, together with a property of additivedecomposition of this measure, are combined in order to measure thedegree of interaction between each variable and its parent variablesin the network. The result is a non-Bayesian scoring function calledMIT (mutual information tests) which belongs to the family of scoresbased on information theory. The MIT score also represents apenalization of the Kullback-Leibler divergence between the jointprobability distributions associated with a candidate network and withthe available data set. Detailed results of a complete experimentalevaluation of the proposed scoring function and its comparison withthe well-known K2, BDeu and BIC/MDL scores are also presented.

私たちは、スコア+検索アルゴリズムを使用してデータからベイジアンネットワークを学習するための新しいスコアリング関数を提案します。これは相互情報の概念に基づいており、この尺度のよく知られた特性のいくつかを新しい方法で利用します。基本的に、相互情報尺度に関連付けられたカイ2乗分布に基づく統計的独立性テストと、この尺度の加法分解の特性を組み合わせて、ネットワーク内の各変数とその親変数間の相互作用の度合いを測定します。結果は、情報理論に基づくスコアのファミリーに属するMIT (相互情報量テスト)と呼ばれる非ベイジアンスコアリング関数です。MITスコアは、候補ネットワークと利用可能なデータセットに関連付けられた結合確率分布間のKullback-Leiblerダイバージェンスのペナルティ化も表します。提案されたスコアリング関数の完全な実験評価の詳細な結果と、よく知られているK2、BDeu、BIC/MDLスコアとの比較も提示されます。

Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
適応型プロトタイプ学習アルゴリズム:理論的および実験的研究

In this paper, we propose a number of adaptive prototype learning(APL) algorithms. They employ the same algorithmic scheme to determinethe number and location of prototypes, but differ in the use ofsamples or the weighted averages of samples as prototypes, and also inthe assumption of distance measures. To understand these algorithmsfrom a theoretical viewpoint, we address their convergence properties,as well as their consistency under certain conditions. We also presenta soft version of APL, in which a non-zero training error is allowedin order to enhance the generalization power of the resultantclassifier. Applying the proposed algorithms to twelve UCI benchmarkdata sets, we demonstrate that they outperform many instance-basedlearning algorithms, the k-nearest neighbor rule, and support vectormachines in terms of average test accuracy.

この論文では、いくつかの適応型プロトタイプ学習(APL)アルゴリズムを提案します。プロトタイプの数と位置を決定するために同じアルゴリズムスキームを採用していますが、プロトタイプとしてのサンプルの使用やサンプルの加重平均、および距離測定の仮定も異なります。これらのアルゴリズムを理論的な観点から理解するために、それらの収束特性と特定の条件下での一貫性について説明します。また、APLのソフトバージョンも提示し、結果として得られる分類器の一般化力を高めるために、ゼロ以外のトレーニングエラーが許可されます。提案されたアルゴリズムを12のUCIベンチマークデータセットに適用すると、平均テスト精度の点で、多くのインスタンスベース学習アルゴリズム、k最近傍ルール、およびサポートベクターマシンよりも優れていることが実証されます。

A Hierarchy of Support Vector Machines for Pattern Detection
パターン検出のためのサポートベクターマシンの階層化

We introduce a computational design for pattern detectionbased on a tree-structured network of support vector machines (SVMs).An SVM is associated with each cell in a recursive partitioning of thespace of patterns (hypotheses) into increasingly finer subsets. Thehierarchy is traversed coarse-to-fine and each chain of positiveresponses from the root to a leaf constitutes a detection. Ourobjective is to design and build a network which balances overallerror and computation.Initially, SVMs are constructed for each cell with noconstraints. This “free network” is then perturbed, cell by cell,into another network, which is “graded” in two ways: first, thenumber of support vectors of each SVM is reduced (by clustering) inorder to adjust to a pre-determined, increasing function of celldepth; second, the decision boundaries are shifted to preserve allpositive responses from the original set of training data. The limitson the numbers of clusters (virtual support vectors) result fromminimizing the mean computational cost of collecting all detectionssubject to a bound on the expected number of false positives.When applied to detecting faces in cluttered scenes, thepatterns correspond to poses and the free network is already fasterand more accurate than applying a single pose-specific SVM many times.The graded network promotes very rapid processing of backgroundregions while maintaining the discriminatory power of the freenetwork.

私たちは、サポートベクターマシン(SVM)のツリー構造ネットワークに基づくパターン検出の計算設計を紹介します。SVMは、パターン(仮説)の空間を徐々に細かいサブセットに再帰的に分割する各セルに関連付けられます。階層は粗いものから細かいものへと横断され、ルートからリーフまでの肯定的な応答の各チェーンが検出を構成します。私たちの目的は、全体的なエラーと計算のバランスをとるネットワークを設計および構築することです。最初に、制約なしで各セルに対してSVMが構築されます。次に、この「フリーネットワーク」はセルごとに別のネットワークに乱され、2つの方法で「段階的に」分類されます。まず、各SVMのサポートベクターの数が(クラスタリングによって)削減され、セルの深さの所定の増加関数に調整されます。2番目に、決定境界は、元のトレーニングデータセットからのすべての肯定的な応答を保持するようにシフトされます。クラスター(仮想サポートベクトル)の数の制限は、すべての検出を収集する平均計算コストを最小化することで生じますが、これは、予想される誤検出数の上限に従います。雑然としたシーンでの顔の検出に適用すると、パターンはポーズに対応し、フリーネットワークは、ポーズ固有の単一のSVMを何度も適用するよりも高速で正確です。段階的ネットワークは、フリーネットワークの識別力を維持しながら、背景領域の非常に迅速な処理を促進します。

Distance Patterns in Structural Similarity
構造類似性における距離パターン

Similarity of edge labeled graphs is considered in the sense of minimum squared distance between corresponding values. Vertex correspondences are established by isomorphisms if both graphs are of equal size and by subisomorphisms if one graph has fewer vertices than the other. Best fit isomorphisms and subisomorphisms amount to solutions of quadratic assignment problems and are computed exactly as well as approximatelyby minimum cost flow, linear assignment relaxations and related graph algorithms.

エッジラベル付きグラフの類似性は、対応する値間の最小二乗距離という意味で考慮されます。頂点の対応は、両方のグラフのサイズが等しい場合は同型によって、一方のグラフの頂点が他方のグラフよりも少ない場合は準同型によって確立されます。最適同型と部分同型は、二次代入問題の解に相当し、最小コストフロー、線形代入緩和、および関連するグラフアルゴリズムによって、正確にも近似的にも計算されます。

Walk-Sums and Belief Propagation in Gaussian Graphical Models
ガウスグラフィカルモデルにおけるウォークサムと信念伝播

We present a new framework based on walks in a graph for analysis andinference in Gaussian graphical models. The key idea is to decomposethe correlation between each pair of variables as a sum over all walksbetween those variables in the graph. The weight of each walk is givenby a product of edgewise partial correlation coefficients. Thisrepresentation holds for a large class of Gaussian graphical modelswhich we call walk-summable. We give a precise characterization ofthis class of models, and relate it to other classes includingdiagonally dominant, attractive, non-frustrated, andpairwise-normalizable. We provide a walk-sum interpretation ofGaussian belief propagation in trees and of the approximate method ofloopy belief propagation in graphs with cycles. The walk-sumperspective leads to a better understanding of Gaussian beliefpropagation and to stronger results for its convergence in loopygraphs.

私たちは、ガウスグラフィカルモデルでの分析と推論のためのグラフ内のウォークに基づく新しいフレームワークを提示します。重要な考え方は、変数の各ペア間の相関を、グラフ内のそれらの変数間のすべてのウォークの合計として分解することです。各歩行の重みは、エッジワイズ偏相関係数の積によって与えられます。この表現は、ウォーク合計可能と呼ばれるガウスグラフィカルモデルの大きなクラスに当てはまります。このクラスのモデルの正確な特性評価を行い、それを斜め優勢、魅力的、非イライラ、ペアワイズ正規化可能などの他のクラスに関連付けます。木のガウス信念伝播と、サイクルを持つグラフでのループ信念伝播の近似方法のウォークサム解釈を提供します。ウォーク・サムパースペクティブは、ガウスの信念伝播の理解を深め、ループグラフでの収束により強力な結果をもたらします。

A Linear Non-Gaussian Acyclic Model for Causal Discovery
因果関係発見のための線形非ガウス非巡回モデル

In recent years, several methods have been proposed for the discoveryof causal structure from non-experimental data. Such methods makevarious assumptions on the data generating process to facilitate itsidentification from purely observational data. Continuing this line ofresearch, we show how to discover the complete causal structure ofcontinuous-valued data, under the assumptions that (a) the datagenerating process is linear, (b) there are no unobserved confounders,and (c) disturbance variables have non-Gaussian distributions ofnon-zero variances. The solution relies on the use of the statisticalmethod known as independent component analysis, and does not requireany pre-specified time-ordering of the variables. We provide acomplete Matlab package for performing this LiNGAM analysis (short forLinear Non-Gaussian Acyclic Model), and demonstrate the effectivenessof the method using artificially generated data and real-world data.

近年、非実験データから因果構造を発見するためのいくつかの方法が提案されています。このような方法は、純粋な観測データからの識別を容易にするために、データ生成プロセスにさまざまな仮定を立てます。この一連の研究を続けて、(a)データ生成プロセスが線形であること、(b)観測されていない交絡因子がないこと、および(c)擾乱変数が非ゼロ分散の非ガウス分布を持つという仮定の下で、連続値データの完全な因果構造を発見する方法を示します。このソリューションは、独立成分分析と呼ばれる統計手法の使用に依存しており、変数の事前指定時間順序は必要ありません。このLiNGAM解析(Linear Non-Gaussian Acyclic Modelの略)を実行するための完全なMatlabパッケージを提供し、人工的に生成されたデータと実世界のデータを使用して、この手法の有効性を実証します。

Learning Spectral Clustering, With Application To Speech Separation
スペクトルクラスタリングの学習と音声分離への応用

Spectral clustering refers to a class of techniques which rely onthe eigenstructure of a similarity matrix to partition points intodisjoint clusters, with points in the same cluster having highsimilarity and points in different clusters having low similarity.In this paper, we derive new cost functions for spectralclustering based on measures of error between a given partitionand a solution of the spectral relaxation of a minimum normalizedcut problem. Minimizing these cost functions with respect to thepartition leads to new spectral clustering algorithms. Minimizingwith respect to the similarity matrix leads to algorithms forlearning the similarity matrix from fully labelled data sets. Weapply our learning algorithm to the blind one-microphone speechseparation problem, casting the problem as one of segmentationof the spectrogram.

スペクトルクラスタリングとは、類似度行列の固有構造に依存して点をバラバラなクラスターに分割し、同じクラスター内のポイントは高い類似性を持ち、異なるクラスター内のポイントは低い類似性を持つ手法のクラスを指します。この論文では、特定の分割と最小正規化カット問題のスペクトル緩和の解との間の誤差の尺度に基づいて、スペクトルクラスタリングの新しいコスト関数を導き出します。パーティションに関してこれらのコスト関数を最小化すると、新しいスペクトルクラスタリングアルゴリズムが得られます。類似性行列に関して最小化すると、完全にラベル付けされたデータセットから類似性行列を学習するためのアルゴリズムが得られます。学習アルゴリズムをブラインド1マイク音声分離問題に適用し、問題をスペクトログラムのセグメンテーションの1つとしてキャストします。

A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
まれな事象を条件とするマルコフ連鎖のエルゴード制御のためのシミュレーションベースアルゴリズム

We study the problem of long-run average cost control of Markov chainsconditioned on a rare event. In a related recent work, a simulationbased algorithm for estimating performance measures associated with aMarkov chain conditioned on a rare event has been developed. We extendideas from this work and develop an adaptive algorithm for obtaining,online, optimal control policies conditioned on a rare event. Ouralgorithm uses three timescales or step-size schedules. On the slowesttimescale, a gradient search algorithm for policy updates that isbased on one-simulation simultaneous perturbation stochasticapproximation (SPSA) type estimates is used. Deterministicperturbation sequences obtained from appropriate normalized Hadamardmatrices are used here. The fast timescale recursions compute theconditional transition probabilities of an associated chain byobtaining solutions to the multiplicative Poisson equation (for agiven policy estimate). Further, the risk parameter associated withthe value function for a given policy estimate is updated on atimescale that lies in between the two scales above. We briefly sketchthe convergence analysis of our algorithm and present a numericalapplication in the setting of routing multiple flows in communicationnetworks.

私たちは、稀なイベントを条件とするマルコフ連鎖の長期平均コスト制御の問題を研究します。関連する最近の研究では、稀なイベントを条件とするマルコフ連鎖に関連するパフォーマンス指標を推定するシミュレーションベースのアルゴリズムが開発されています。我々はこの研究のアイデアを拡張し、稀なイベントを条件とするオンラインの最適制御ポリシーを取得するための適応アルゴリズムを開発します。我々のアルゴリズムは、3つのタイムスケールまたはステップサイズスケジュールを使用します。最も遅いタイムスケールでは、1シミュレーション同時摂動確率近似(SPSA)タイプの推定に基づくポリシー更新の勾配検索アルゴリズムを使用します。ここでは、適切な正規化アダマール行列から取得された決定論的摂動シーケンスを使用します。高速タイムスケールの再帰は、乗法ポアソン方程式(特定のポリシー推定について)の解を取得することにより、関連する連鎖の条件付き遷移確率を計算します。さらに、特定のポリシー推定に対する価値関数に関連するリスクパラメーターは、上記の2つのスケールの中間にあるタイムスケールで更新されます。アルゴリズムの収束分析を簡単に概説し、通信ネットワークで複数のフローをルーティングする設定での数値アプリケーションを紹介します。

Incremental Support Vector Learning: Analysis, Implementation and Applications
インクリメンタルサポートベクター学習:分析、実装、応用

Incremental Support Vector Machines (SVM) are instrumental inpractical applications of online learning. This work focuses on thedesign and analysis of efficient incremental SVM learning, with theaim of providing a fast, numerically stable and robustimplementation. A detailed analysis of convergence and ofalgorithmic complexity of incremental SVM learning is carried out.Based on this analysis, a new design of storage and numericaloperations is proposed, which speeds up the training of anincremental SVM by a factor of 5 to 20. The performance of the newalgorithm is demonstrated in two scenarios: learning with limitedresources and active learning. Various applications of thealgorithm, such as in drug discovery, online monitoring ofindustrial devices and and surveillance of network traffic, can beforeseen.

インクリメンタルサポートベクターマシン(SVM)は、オンライン学習の非実用的なアプリケーションです。この作業は、効率的なインクリメンタルSVM学習の設計と分析に焦点を当てており、高速で数値的に安定している堅牢な実装を提供することを目的としています。インクリメンタルSVM学習の収束とアルゴリズムの複雑さの詳細な分析が行われます。この分析に基づいて、ストレージと数値演算の新しい設計が提案され、インクリメンタルSVMのトレーニングが5倍から20倍に高速化されます。新しいアルゴリズムのパフォーマンスは、限られたリソースでの学習とアクティブラーニングの2つのシナリオで実証されています。創薬、産業デバイスのオンライン監視、ネットワークトラフィックの監視など、アルゴリズムのさまざまな応用がこれまでに見られたことがあります。

Linear Programming Relaxations and Belief Propagation — An Empirical Study
線形計画法の緩和と信念伝播 — 実証的研究

The problem of finding the most probable (MAP) configuration ingraphical models comes up in a wide range of applications. In ageneral graphical model this problem is NP hard, but variousapproximate algorithms have been developed. Linear programming (LP)relaxations are a standard method in computer science forapproximating combinatorial problems and have been used for findingthe most probable assignment in small graphical models. However,applying this powerful method to real-world problems is extremelychallenging due to the large numbers of variables and constraints inthe linear program. Tree-Reweighted Belief Propagation is a promisingrecent algorithm for solving LP relaxations, but little is known aboutits running time on large problems.In this paper we compare tree-reweighted belief propagation (TRBP) and powerfulgeneral-purpose LP solvers (CPLEX) on relaxations of real-world graphicalmodels from the fields of computer vision and computational biology. We findthat TRBP almost always finds the solution significantly faster than all thesolvers in CPLEX and more importantly, TRBP can be applied to large scaleproblems for which the solvers in CPLEX cannot be applied. Using TRBP we canfind the MAP configurations in a matter of minutes for a large range of realworld problems.

グラフィカルモデルで最も確率の高い(MAP)構成を見つける問題は、さまざまなアプリケーションで発生します。一般的なグラフィカルモデルでは、この問題はNP困難ですが、さまざまな近似アルゴリズムが開発されています。線形計画法(LP)緩和は、組み合わせ問題を近似するためのコンピューターサイエンスの標準的な方法であり、小規模なグラフィカルモデルで最も確率の高い割り当てを見つけるために使用されてきました。ただし、線形計画法には多数の変数と制約があるため、この強力な方法を実際の問題に適用することは非常に困難です。ツリー再重み付け信念伝播法は、LP緩和法を解決するための有望な最新のアルゴリズムですが、大規模な問題での実行時間についてはほとんどわかっていません。この論文では、コンピュータービジョンと計算生物学の分野からの実際のグラフィカルモデルの緩和法について、ツリー再重み付け信念伝播法(TRBP)と強力な汎用LPソルバー(CPLEX)を比較します。TRBPは、ほとんどの場合、CPLEXのすべてのソルバーよりも大幅に高速にソリューションを見つけ、さらに重要なことに、TRBPはCPLEXのソルバーを適用できない大規模な問題にも適用できます。TRBPを使用すると、さまざまな現実の問題に対して、わずか数分でMAP構成を見つけることができます。

Streamwise Feature Selection
Streamwise機能の選択

In streamwise feature selection, new features are sequentiallyconsidered for addition to a predictive model. When the space ofpotential features is large, streamwise feature selection offersmany advantages over traditional feature selection methods, whichassume that all features are known in advance. Features can begenerated dynamically, focusing the search for new features onpromising subspaces, and overfitting can be controlled bydynamically adjusting the threshold for adding features to themodel. In contrast to traditional forward feature selectionalgorithms such as stepwise regression in which at each step allpossible features are evaluated and the best one is selected,streamwise feature selection only evaluates each feature once whenit is generated. We describe information-investing andα-investing, two adaptive complexity penalty methods forstreamwise feature selection which dynamically adjust the thresholdon the error reduction required for adding a new feature. These twomethods give false discovery rate style guarantees againstoverfitting. They differ from standard penalty methods such as AIC,BIC and RIC, which always drastically over- or under-fit in thelimit of infinite numbers of non-predictive features. Empiricalresults show that streamwise regression is competitive with (onsmall data sets) and superior to (on large data sets) much morecompute-intensive feature selection methods such as stepwiseregression, and allows feature selection on problems with millionsof potential features.

ストリームワイズ機能選択では、新しい機能が予測モデルへの追加対象として順番に検討されます。潜在的な特徴の空間が大きい場合、ストリームワイズ特徴選択は、すべての特徴が事前にわかっていると想定する従来の特徴選択方法に比べて多くの利点があります。特徴は動的に生成でき、有望なサブスペースでの新しい特徴の検索に焦点を合わせることができ、モデルに特徴を追加するためのしきい値を動的に調整することで、オーバーフィッティングを制御できます。各ステップですべての可能な特徴を評価して最適なものを選択するステップワイズ回帰などの従来の順方向特徴選択アルゴリズムとは対照的に、ストリームワイズ特徴選択は、各特徴が生成された時点で1回だけ評価します。ここでは、新しい特徴を追加するために必要なエラー削減のしきい値を動的に調整する、ストリームワイズ特徴選択の2つの適応型複雑性ペナルティ方法である情報投資と α 投資について説明します。これら2つの方法は、オーバーフィッティングに対する偽発見率スタイルの保証を提供します。これらは、予測不可能な特徴の数が無限であるという制限で常に大幅にオーバーフィッティングまたはアンダーフィッティングするAIC、BIC、RICなどの標準的なペナルティ方法とは異なります。実験結果によると、ストリームワイズ回帰は、ステップワイズ回帰などの計算集約型の特徴選択方法と競合し(小規模なデータセットの場合)、それよりも優れており(大規模なデータセットの場合)、数百万の潜在的な特徴を持つ問題で特徴選択が可能になります。

Estimating the “Wrong” Graphical Model: Benefits in the Computation-Limited Setting
「間違った」グラフィカルモデルの推定: 計算制限のある設定の利点

Consider the problem of joint parameter estimation and prediction in aMarkov random field: that is, the model parameters are estimated on thebasis of an initial set of data, and then the fitted model is used toperform prediction (e.g., smoothing, denoising, interpolation) on anew noisy observation. Working under the restriction of limitedcomputation, we analyze a joint method in which the same convexvariational relaxation is used to construct an M-estimator forfitting parameters, and to perform approximate marginalization for theprediction step. The key result of this paper is that in thecomputation-limited setting, using an inconsistent parameter estimator(i.e., an estimator that returns the “wrong” model even in theinfinite data limit) is provably beneficial, since the resultingerrors can partially compensate for errors made by using anapproximate prediction technique. En route to this result, we analyzethe asymptotic properties of M-estimators based on convex variationalrelaxations, and establish a Lipschitz stability property that holdsfor a broad class of convex variational methods. This stabilityresult provides additional incentive, apart from the obvious benefitof unique global optima, for using message-passing methods based onconvex variational relaxations. We show that jointestimation/prediction based on the reweighted sum-product algorithmsubstantially outperforms a commonly used heuristic based on ordinarysum-product.

マルコフランダムフィールドでのパラメーター推定と予測の結合の問題を考えてみましょう。つまり、モデルパラメーターは初期データセットに基づいて推定され、次に、適合モデルを使用して、新しいノイズのある観測に対して予測(平滑化、ノイズ除去、補間など)を実行します。計算が制限されているという制約の下で、同じ凸変分緩和法を使用してパラメーターを適合するためのM推定量を構築し、予測ステップの近似マージナライゼーションを実行する結合法を分析します。この論文の主な結果は、計算が制限された設定では、一貫性のないパラメータ推定量(つまり、無限のデータ制限でも「間違った」モデルを返す推定量)を使用すると、結果として生じる誤差が近似予測手法を使用することで生じた誤差を部分的に補償できるため、有益であることが証明されていることです。この結果に至る過程で、凸変分緩和に基づくM推定量の漸近特性を分析し、幅広いクラスの凸変分法に当てはまるLipschitz安定性特性を確立しました。この安定性の結果は、一意のグローバル最適値の明らかな利点とは別に、凸変分緩和に基づくメッセージパッシングメソッドを使用する追加のインセンティブを提供します。再重み付けされた積和アルゴリズムに基づく共同推定/予測は、通常の積和に基づく一般的なヒューリスティックよりも大幅に優れていることを示しています。

Collaborative Multiagent Reinforcement Learning by Payoff Propagation
ペイオフ伝播による協調的マルチエージェント強化学習

In this article we describe a set of scalable techniques forlearning the behavior of a group of agents in a collaborativemultiagent setting. As a basis we use the framework of coordinationgraphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies betweenagents to decompose the global payoff function into a sum of localterms. First, we deal with the single-state case and describe apayoff propagation algorithm that computes the individual actionsthat approximately maximize the global payoff function. The methodcan be viewed as the decision-making analogue of belief propagationin Bayesian networks. Second, we focus on learning the behavior ofthe agents in sequential decision-making tasks. We introducedifferent model-free reinforcement-learning techniques, unitedlycalled Sparse Cooperative Q-learning, which approximate the globalaction-value function based on the topology of a coordination graph,and perform updates using the contribution of the individual agentsto the maximal global action value. The combined use of anedge-based decomposition of the action-value function and the payoffpropagation algorithm for efficient action selection, result in anapproach that scales only linearly in the problem size. We provideexperimental evidence that our method outperforms related multiagentreinforcement-learning methods based on temporal differences.

この記事では、協調的マルチエージェント設定におけるエージェントのグループの動作を学習するためのスケーラブルな手法のセットについて説明します。基礎として、エージェント間の依存関係を利用してグローバルペイオフ関数をローカル項の合計に分解する、Guestrin、Koller、およびParr (2002a)の調整グラフのフレームワークを使用します。まず、単一状態の場合を扱い、グローバルペイオフ関数をほぼ最大化する個々のアクションを計算するペイオフ伝播アルゴリズムについて説明します。この方法は、ベイジアンネットワークにおけるビリーフ伝播の意思決定アナログと見なすことができます。次に、連続的な意思決定タスクにおけるエージェントの動作の学習に焦点を当てます。調整グラフのトポロジに基づいてグローバルアクション値関数を近似し、個々のエージェントの最大グローバルアクション値への貢献を使用して更新を実行する、スパース協調Q学習と呼ばれるさまざまなモデルフリー強化学習手法を導入しました。アクション価値関数のエッジベースの分解と、効率的なアクション選択のためのペイオフ伝播アルゴリズムを組み合わせて使用すると、問題のサイズに対して線形にしかスケールしないアプローチが実現します。時間差に基づく関連するマルチエージェント強化学習法よりも、この方法が優れているという実験的証拠を提供します。

Learning Factor Graphs in Polynomial Time and Sample Complexity
多項式時間とサンプル複雑度における学習因子グラフ

We study the computational and sample complexity of parameter andstructure learning in graphical models. Our main result shows thatthe class of factor graphs with bounded degree can be learned inpolynomial time and from a polynomial number of training examples,assuming that the data is generated by a network in this class. Thisresult covers both parameter estimation for a known network structureand structure learning. It implies as a corollary that we can learnfactor graphs for both Bayesian networks and Markov networks ofbounded degree, in polynomial time and sample complexity. Importantly,unlike standard maximum likelihood estimation algorithms, our methoddoes not require inference in the underlying network, and so appliesto networks where inference is intractable. We also show that theerror of our learned model degrades gracefully when the generatingdistribution is not a member of the target class of networks. Inaddition to our main result, we show that the sample complexity ofparameter learning in graphical models has an O(1) dependenceon the number of variables in the model when using the KL-divergencenormalized by the number of variables as the performance criterion.

私たちは、グラフィカルモデルでのパラメータと構造の学習の計算とサンプルの複雑度を調査します。主な結果は、データがこのクラスのネットワークによって生成されると仮定すると、制限された次数を持つ因子グラフのクラスは、多項式時間で多項式数のトレーニング例から学習できることを示しています。この結果は、既知のネットワーク構造のパラメータ推定と構造学習の両方をカバーしています。当然の帰結として、ベイジアンネットワークとマルコフネットワークの両方について、多項式時間とサンプル複雑度で、有界次数の因子グラフを学習できることになります。重要な点は、標準的な最尤推定アルゴリズムとは異なり、この方法は基礎となるネットワークでの推論を必要としないため、推論が困難なネットワークにも適用できることです。また、生成分布がネットワークのターゲットクラスのメンバーでない場合、学習したモデルの誤差が徐々に減少することも示しています。主な結果に加えて、パフォーマンス基準として変数の数で正規化されたKLダイバージェンスを使用する場合、グラフィカルモデルでのパラメータ学習のサンプル複雑度は、モデル内の変数の数にO(1)依存することを示します。

Considering Cost Asymmetry in Learning Classifiers
学習分類器におけるコスト非対称性の考慮

Receiver Operating Characteristic (ROC) curves are a standard way todisplay the performance of a set of binary classifiers for allfeasible ratios of the costs associated with false positives andfalse negatives. For linear classifiers, the set of classifiers istypically obtained by training once, holding constant the estimatedslope and then varying the intercept to obtain a parameterized setof classifiers whose performances can be plotted in the ROC plane.We consider the alternative of varying the asymmetry of the costfunction used for training. We show that the ROC curve obtained byvarying both the intercept and the asymmetry, and hence the slope,always outperforms the ROC curve obtained by varying only theintercept. In addition, we present a path-following algorithm forthe support vector machine (SVM) that can compute efficiently theentire ROC curve, and that has the same computational complexity astraining a single classifier. Finally, we provide a theoreticalanalysis of the relationship between the asymmetric cost modelassumed when training a classifier and the cost model assumed inapplying the classifier. In particular, we show that the mismatchbetween the step function used for testing and its convex upperbounds, usually used for training, leads to a provable andquantifiable difference around extreme asymmetries.

受信者動作特性(ROC)曲線は、偽陽性と偽陰性に関連するコストのすべての実現可能な比率について、一連のバイナリ分類器のパフォーマンスを表示する標準的な方法です。線形分類器の場合、分類器のセットは通常、1回のトレーニングで取得され、推定傾斜を一定に保ち、次に切片を変更して、ROC平面にパフォーマンスをプロットできるパラメーター化された分類器のセットを取得します。トレーニングに使用されるコスト関数の非対称性を変更する代替案を検討します。切片と非対称性の両方、つまり傾斜を変更して取得したROC曲線は、切片のみを変更して取得したROC曲線よりも常に優れていることを示します。さらに、ROC曲線全体を効率的に計算でき、単一の分類器をトレーニングする場合と同じ計算複雑度を持つ、サポートベクターマシン（SVM）のパス追跡アルゴリズムを紹介します。最後に、分類器をトレーニングするときに想定される非対称コストモデルと、分類器を適用する際に想定されるコストモデルの関係の理論的分析を提供します。特に、テストに使用されるステップ関数と、通常トレーニングに使用されるその凸上限との間の不一致により、極端な非対称性の周りに証明可能で定量化可能な違いが生じることを示します。

Large Scale Transductive SVMs
大規模変換SVM

We show how the concave-convex procedure can be appliedto transductive SVMs, which traditionally require solvinga combinatorial search problem. Thisprovides for the first time a highly scalable algorithm in the nonlinearcase.Detailed experiments verify the utility of our approach. Softwareis available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.html.

私たちは、凹凸法を、従来は組み合わせ探索問題を解く必要としていた形質転換SVMにどのように適用できるかを示します。これにより、非線形の場合に高度にスケーラブルなアルゴリズムが初めて提供されます。詳細な実験により、私たちのアプローチの有用性が検証されています。ソフトウェアはhttp://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.htmlで入手できます。

Active Learning with Feedback on Features and Instances
機能とインスタンスに関するフィードバックによるアクティブラーニング

We extend the traditional active learning framework to includefeedback on features in addition to labeling instances, and we executea careful study of the effects of feature selection and human feedbackon features in the setting of text categorization. Our experiments ona variety of categorization tasks indicate that there is significantpotential in improving classifier performance by feature re-weighting,beyond that achieved via membership queries alone (traditional activelearning) if we have access to an oracle that can point to theimportant (most predictive) features. Our experiments on humansubjects indicate that human feedback on feature relevance canidentify a sufficient proportion of the most relevant features (over50% in our experiments). We find that on average, labeling a featuretakes much less time than labeling a document. We devise an algorithmthat interleaves labeling features and documents which significantlyaccelerates standard active learning in our simulation experiments.Feature feedback can complement traditional active learning inapplications such as news filtering, e-mail classification, andpersonalization, where the human teacher can have significantknowledge on the relevance of features.

私たちは、従来の能動学習フレームワークを拡張して、インスタンスのラベル付けに加えて機能に関するフィードバックを含め、テキスト分類の設定で機能選択と機能に対する人間のフィードバックの効果を慎重に調査します。さまざまな分類タスクでの実験では、重要な(最も予測的な)機能を指し示すことができるオラクルにアクセスできる場合、メンバーシップクエリのみ(従来の能動学習)で達成されるものを超えて、機能の再重み付けによって分類器のパフォーマンスを大幅に向上できる可能性があることが示されています。人間を対象とした実験では、機能の関連性に関する人間のフィードバックによって、最も関連性の高い機能の十分な割合(実験では50%以上)を特定できることが示されています。平均すると、機能のラベル付けには、ドキュメントのラベル付けよりもはるかに短い時間しかかからないことがわかりました。私たちは、シミュレーション実験で標準的な能動学習を大幅に加速する、ラベル付け機能とドキュメントをインターリーブするアルゴリズムを考案しました。機能フィードバックは、ニュースフィルタリング、電子メール分類、パーソナライゼーションなどのアプリケーションで従来の能動学習を補完できます。これらのアプリケーションでは、人間の教師が機能の関連性について重要な知識を持つことができます。

Structured Prediction, Dual Extragradient and Bregman Projections
構造化予測、双対エクストラグラディエント、ブレグマン投影法

We present a simple and scalable algorithm for maximum-marginestimation of structured output models, including an importantclass of Markov networks and combinatorial models. We formulatethe estimation problem as a convex-concave saddle-point problemthat allows us to use simple projection methods based on thedual extragradient algorithm (Nesterov, 2003).The projection step can be solved usingdynamic programming or combinatorial algorithms for min-costconvex flow, depending on the structure of the problem. We showthat this approach provides a memory-efficient alternative toformulations based on reductions to a quadratic program (QP). Weanalyze the convergence of the method and present experiments ontwo very different structured prediction tasks: 3D imagesegmentation and word alignment, illustrating the favorablescaling properties of our algorithm.

私たちは、重要なクラスのマルコフネットワークと組み合わせモデルを含む、構造化された出力モデルの最大マージン推定のためのシンプルでスケーラブルなアルゴリズムを紹介します。推定問題を凸凹鞍点問題として定式化し、双対超勾配アルゴリズム(Nesterov、2003)に基づく単純な投影方法を使用できるようにします。射影ステップは、問題の構造に応じて、動的計画法または最小コスト凸流の組み合わせアルゴリズムを使用して解くことができます。このアプローチは、二次計画法(QP)への還元に基づく定式化に代わるメモリ効率の高い代替手段を提供することを示します。この手法の収束を分析し、3D画像セグメンテーションと単語の整列という2つの非常に異なる構造化された予測タスクに関する実験を行い、アルゴリズムの好ましいスケーリング特性を示しています。

Kernel-Based Learning of Hierarchical Multilabel Classification Models
階層型多ラベル分類モデルのカーネルベース学習

We present a kernel-based algorithm for hierarchical textclassification where the documents are allowed to belong to morethan one category at a time. The classification model is a variantof the Maximum Margin Markov Network framework, where theclassification hierarchy is represented as a Markov tree equippedwith an exponential family defined on the edges.We present an efficient optimizationalgorithm based on incremental conditional gradient ascent insingle-example subspaces spanned by the marginal dual variables.The optimization is facilitated with a dynamic programming basedalgorithm that computes best update directions in the feasible set.Experiments show that the algorithm can feasibly optimize trainingsets of thousands of examples and classification hierarchiesconsisting of hundreds of nodes. Training of the full hierarchicalmodel is as efficient as training independent SVM-light classifiersfor each node. The algorithm’s predictive accuracy was found to becompetitive with other recently introduced hierarchical multi-categoryor multilabel classification learning algorithms.

私たちは、階層的テキスト分類のためのカーネルベースのアルゴリズムを紹介します。このアルゴリズムでは、ドキュメントは一度に複数のカテゴリに属することができます。分類モデルは、最大マージンマルコフネットワークフレームワークの変形であり、分類階層は、エッジに定義された指数族を備えたマルコフツリーとして表されます。マージナルデュアル変数によって張られる単一例のサブスペースでの増分条件付き勾配上昇に基づく効率的な最適化アルゴリズムを紹介します。最適化は、実行可能なセット内で最適な更新方向を計算する動的プログラミングベースのアルゴリズムによって促進されます。実験では、このアルゴリズムが数千の例のトレーニングセットと数百のノードで構成される分類階層を効果的に最適化できることが示されています。完全な階層モデルのトレーニングは、各ノードの独立したSVMライト分類器のトレーニングと同じくらい効率的です。このアルゴリズムの予測精度は、最近導入された他の階層的マルチカテゴリまたはマルチラベル分類学習アルゴリズムと競合することがわかりました。

Efficient Learning of Label Ranking by Soft Projections onto Polyhedra
多面体へのソフトプロジェクションによるラベルランキングの効率的な学習

We discuss the problem of learning to rank labels from a real valuedfeedback associated with each label. We cast the feedback as apreferences graph where the nodes of the graph are the labels andedges express preferences over labels. We tackle the learning problemby defining a loss function for comparing a predicted graph with afeedback graph. This loss is materialized by decomposing the feedbackgraph into bipartite sub-graphs. We then adopt the maximum-marginframework which leads to a quadratic optimization problem with linearconstraints. While the size of the problem grows quadratically withthe number of the nodes in the feedback graph, we derive a problem ofa significantly smaller size and prove that it attains the sameminimum. We then describe an efficient algorithm, called SOPOPO, forsolving the reduced problem by employing a soft projection onto thepolyhedron defined by a reduced set of constraints. We also describeand analyze a wrapper procedure for batch learning when multiplegraphs are provided for training. We conclude with a set ofexperiments which show significant improvements in run time over astate of the art interior-point algorithm.

私たちは、各ラベルに関連付けられた実数値のフィードバックからラベルをランク付けする方法を学習する問題について議論します。フィードバックを好みのグラフとしてキャストします。グラフのノードはラベルで、エッジはラベルに対する好みを表します。予測されたグラフをフィードバックグラフと比較するための損失関数を定義することで、学習の問題に取り組みます。この損失は、フィードバックグラフを二部サブグラフに分解することで実現されます。次に、線形制約のある2次最適化問題につながる最大マージンフレームワークを採用します。問題のサイズはフィードバックグラフのノードの数に応じて2次的に大きくなりますが、大幅に小さいサイズの問題を導出し、同じ最小値を達成することを証明します。次に、縮小された制約セットによって定義された多面体へのソフト投影を使用して縮小された問題を解決する、SOPOPOと呼ばれる効率的なアルゴリズムについて説明します。また、トレーニング用に複数のグラフが提供される場合のバッチ学習のラッパー手順についても説明し、分析します。最後に、最先端の内点法アルゴリズムに比べて実行時間が大幅に改善されたことを示す一連の実験を紹介します。

Large Scale Multiple Kernel Learning
大規模なマルチカーネル学習

While classical kernel-based learning algorithms are based on a singlekernel, in practice it is often desirable to use multiple kernels.Lanckriet et al. (2004) considered conic combinations of kernelmatrices for classification, leading to a convex quadraticallyconstrained quadratic program. We show that it can be rewritten as asemi-infinite linear program that can be efficiently solved byrecycling the standard SVM implementations. Moreover, we generalizethe formulation and our method to a larger class of problems,including regression and one-class classification. Experimentalresults show that the proposed algorithm works for hundred thousands of examples orhundreds of kernels to be combined, and helps for automatic modelselection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism forSVMs, especially when used with sparse feature maps as appearfor string kernels, allowing us to train a string kernel SVM on a 10million real-world splice data set from computational biology. Weintegrated multiple kernel learning in our machine learning toolboxSHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun.

従来のカーネルベースの学習アルゴリズムは単一のカーネルに基づいていますが、実際には複数のカーネルを使用することが望ましい場合がよくあります。Lanckrietら(2004)は、分類のためにカーネル行列の円錐結合を検討し、凸二次制約二次計画をもたらしました。私たちは、それが半無限線形計画として書き直され、標準SVM実装を再利用することで効率的に解決できることを示しています。さらに、私たちは定式化と方法を、回帰や1クラス分類を含む、より大きなクラスの問題に一般化します。実験結果によると、提案されたアルゴリズムは、数十万の例または数百のカーネルを組み合わせても機能し、自動モデル選択に役立ち、学習結果の解釈可能性が向上します。第2部では、SVMの一般的な高速化メカニズムについて説明します。特に、文字列カーネルの場合のようにスパースな特徴マップと共に使用した場合、計算生物学からの1,000万の実際のスプライスデータセットで文字列カーネルSVMをトレーニングできます。私たちは、ソースコードをhttp://www.fml.tuebingen.mpg.de/raetsch/projects/shogunで公開している機械学習ツールボックスSHOGUNに複数のカーネル学習を統合しました。

Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization
制約なし凸微分可能最小化による厳密1ノルムサポートベクトルマシン

Support vector machines utilizing the 1-norm, typicallyset up as linear programs (Mangasarian, 2000; Bradley and Mangasarian, 1998), are formulated hereas a completely unconstrained minimization of a convex differentiable piecewise-quadratic objective function in the dual space. The objective function,which has a Lipschitz continuous gradient and contains only oneadditional finite parameter, can be minimized by a generalizedNewton method and leads to an exact solution of the support vectormachine problem. The approach here is based on a formulationof a very general linear program as an unconstrained minimizationproblem and its application to support vector machine classificationproblems. The present approach which generalizes both(Mangasarian, 2004) and (Fung and Mangasarian, 2004) is also applied to nonlinearapproximation where a minimal number of nonlinear kernel functionsare utilized to approximate a function from a given numberof function values.

1ノルムを利用するサポートベクターマシンは、通常、線形プログラムとして設定されます(Mangasarian、2000;Bradley and Mangasarian, 1998)は、双対空間における凸微分可能な区分二次目的関数の完全に制約のない最小化としてここで定式化されます。目的関数は、リプシッツ連続勾配を持ち、追加の有限パラメーターを1つだけ含みますが、一般化ニュートン法によって最小化でき、サポートベクトルマシン問題の厳密な解につながります。ここでのアプローチは、制約のない最小化問題としての非常に一般的な線形プログラムの定式化と、ベクトルマシンの分類問題をサポートするためのその適用に基づいています。(Mangasarian, 2004)と(Fung and Mangasarian, 2004)の両方を一般化する現在のアプローチは、最小数の非線形カーネル関数を使用して、与えられた数の関数値から関数を近似する非線形近似にも適用されます。

Building Support Vector Machines with Reduced Classifier Complexity
分類器の複雑さを軽減したサポートベクターマシンの構築

Support vector machines (SVMs), though accurate, are not preferred inapplications requiring great classification speed, due to the numberof support vectors being large. To overcome this problem we devise aprimal method with the following properties: (1) it decouples the ideaof basis functions from the concept of support vectors; (2) itgreedily finds a set of kernel basis functions of a specified maximumsize (dmax) to approximate the SVM primal cost function well; (3)it is efficient and roughly scales as O(ndmax2) where n is thenumber of training examples; and, (4) the number of basis functions itrequires to achieve an accuracy close to the SVM accuracy is usuallyfar less than the number of SVM support vectors.

サポートベクターマシン(SVM)は正確ですが、サポートベクターの数が多いため、優れた分類速度を必要とするアプリケーションには適していません。この問題を解決するために、次の特性を持つ原初の方法を考案します:(1)基底関数の概念をサポートベクトルの概念から切り離します。(2)指定された最大サイズ(dmax)のカーネル基底関数のセットを貪欲に見つけて、SVMの主コスト関数を適切に近似します。(3)効率的で、O(ndmax2)として大まかにスケーリングされます(nはトレーニング例の数)。(4)SVM精度に近い精度を達成するために必要な基底関数の数は、通常、SVMサポートベクトルの数よりもはるかに少なくなります。

Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems
マルチプロセッサシステム上での大規模サポートベクターマシンの学習のための並列ソフトウェア

Parallel software for solving the quadratic program arising in trainingsupport vector machines for classification problems is introduced.The software implements an iterative decomposition technique and exploitsboth the storage and the computing resourcesavailable on multiprocessor systems, by distributingthe heaviest computational tasks of each decomposition iteration.Based on a wide range of recent theoretical advances,relevant decomposition issues, such as the quadraticsubproblem solution, the gradient updating, the working set selection,are systematically described andtheir careful combination to get aneffective parallel tool is discussed.A comparison with state-of-the-art packages on benchmark problemsdemonstrates the good accuracy and the remarkable time saving achievedby the proposed software. Furthermore, challenging experiments on real-world data sets with millions training samples highlight how the software makeslarge scale standard nonlinear support vector machineseffectively tractable on common multiprocessor systems.This feature is not shown by any of the available codes.

分類問題のサポートベクターマシンのトレーニングで生じる二次計画を解くための並列ソフトウェアを紹介します。このソフトウェアは反復分解手法を実装し、各分解反復の最も重い計算タスクを分散することで、マルチプロセッサシステムで利用可能なストレージとコンピューティングリソースの両方を活用します。最近の幅広い理論的進歩に基づいて、二次サブ問題の解決、勾配の更新、ワーキングセットの選択などの関連する分解の問題が体系的に説明され、効果的な並列ツールを得るためのそれらの慎重な組み合わせについて説明します。ベンチマーク問題に関する最先端のパッケージとの比較により、提案されたソフトウェアによって達成される優れた精度と顕著な時間節約が実証されます。さらに、数百万のトレーニングサンプルを含む実際のデータセットでの困難な実験により、このソフトウェアが大規模な標準非線形サポートベクターマシンを一般的なマルチプロセッサシステムで効果的に処理できるようにする方法が明らかになりました。この機能は、利用可能などのコードにも示されていません。

Maximum-Gain Working Set Selection for SVMs
SVM の最大ゲインワーキングセットの選択

Support vector machines are trained by solving constrainedquadraticoptimization problems.This is usually done with an iterative decomposition algorithmoperating on a small working set of variables in every iteration.The training time strongly depends on the selection of thesevariables. We propose the maximum-gain working set selectionalgorithm for large scale quadratic programming. It is based on theidea to greedily maximize the progress in each single iteration. Thealgorithm takes second order information from cached kernel matrixentries into account. We prove the convergence to an optimalsolution of a variant termed hybrid maximum-gain working setselection. This method is empirically compared to the prominentmost violating pair selection and the latest algorithm using secondorder information. For large training sets our new selection schemeis significantly faster.

サポートベクターマシンは、constrainedquadraticoptimization問題を解くことによって学習されます。これは通常、反復ごとに変数の小さなワーキングセットを操作する反復分解アルゴリズムを使用して行われます。学習時間は、これらの変数の選択に大きく依存します。大規模二次計画法のための最大ゲインワーキングセット選択アルゴリズムを提案します。これは、各反復の進行状況を貪欲に最大化するという考えに基づいています。このアルゴリズムは、キャッシュされたカーネル行列エントリからの二次情報を考慮に入れます。ハイブリッド最大ゲインワーキングセット選択と呼ばれるバリアントの最適解への収束を証明します。この方法は、経験的に、最も違反しているペアの選択と、二次情報を使用した最新のアルゴリズムと比較されます。大規模なトレーニングセットの場合、新しい選択スキームは大幅に高速です。

Fast SDP Relaxations of Graph Cut Clustering, Transduction, and Other Combinatorial Problems
グラフカットクラスタリング,形質導入,およびその他の組み合わせ問題に対するSDPの迅速な緩和

The rise of convex programming has changed the face of many researchfields in recent years, machine learning being one of the ones thatbenefitted the most. A very recent developement, the relaxation ofcombinatorial problems to semi-definite programs (SDP), has gainedconsiderable attention over the last decade (Helmberg, 2000; De Bie and Cristianini, 2004a).Although SDP problems can be solved in polynomial time, for manyrelaxations the exponent in the polynomial complexity bounds is toohigh for scaling to large problem sizes. This has hampered theiruptake as a powerful new tool in machine learning.In this paper, we present a new and fast SDP relaxation of thenormalized graph cut problem, and investigate its usefulness inunsupervised and semi-supervised learning. In particular, thisprovides a convex algorithm for transduction, as well as approachesto clustering. We further propose a whole cascade of fastrelaxations that all hold the middle between older spectralrelaxations and the new SDP relaxation, allowing one to trade offcomputational cost versus relaxation accuracy. Finally, we discusshow the methodology developed in this paper can be applied to othercombinatorial problems in machine learning, and we treat the max-cutproblem as an example.

近年、凸計画法の台頭により多くの研究分野が様相を変えており、機械学習はその恩恵を最も受けた分野の1つです。ごく最近の開発である、組み合わせ問題の半正定値計画法(SDP)への緩和は、過去10年間でかなりの注目を集めています(Helmberg、2000年、De BieおよびCristianini、2004a)。SDP問題は多項式時間で解決できますが、多くの緩和では、多項式の複雑さの境界の指数が大きすぎて、大規模な問題にスケーリングできません。このため、機械学習の強力な新しいツールとして採用されることが妨げられてきました。この論文では、正規化グラフカット問題の新しい高速SDP緩和法を紹介し、教師なし学習および半教師あり学習におけるその有用性について調査します。特に、これはトランスダクション用の凸アルゴリズムとクラスタリングへのアプローチを提供します。さらに、古いスペクトル緩和と新しいSDP緩和の中間に位置する一連の高速緩和を提案し、計算コストと緩和精度のトレードオフを可能にします。最後に、この論文で開発された方法論を機械学習の他の組み合わせ問題にどのように適用できるかを説明し、最大カット問題を例として扱います。

Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming
非負行列因数分解と逐次円錐計画法によるスパース表現の学習

We exploit the biconvex nature of the Euclidean non-negative matrixfactorization (NMF) optimization problem to derive optimizationschemes based on sequential quadratic and second order coneprogramming. We show that for ordinary NMF, our approach performsas well as existing state-of-the-art algorithms, while forsparsity-constrained NMF, as recently proposed by P. O. Hoyer inJMLR 5 (2004), it outperforms previous methods. In addition,we show how to extend NMF learning within the same optimizationframework in order to make use of class membership information insupervised learning problems.

私たちは、ユークリッド非負行列分解(NMF)最適化問題の両凸性を利用して、逐次2次および2次コーンプログラミングに基づく最適化スキームを導出します。通常のNMFでは、私たちのアプローチは既存の最先端のアルゴリズムと同様に機能しますが、Forsparsity制約のあるNMFは、最近JMLR 5(2004)でP.O.Hoyerによって提案されたように、以前の方法よりも優れていることを示しています。さらに、クラスメンバーシップ情報の教師あり学習問題を活用するために、同じ最適化フレームワーク内でNMF学習を拡張する方法を示します。

Bayesian Network Learning with Parameter Constraints
パラメーター制約を使用したベイジアンネットワーク学習

The task of learning models for many real-world problems requiresincorporating domain knowledge into learning algorithms, to enableaccurate learning from a realistic volume of training data. This paperconsiders a variety of types of domain knowledge for constrainingparameter estimates when learning Bayesian networks. In particular, weconsider domain knowledge that constrains the values or relationshipsamong subsets of parameters in a Bayesian network with knownstructure.We incorporate a wide variety of parameter constraints into learningprocedures for Bayesian networks, by formulating this task as aconstrained optimization problem. The assumptions made in modulenetworks, dynamic Bayes nets and context specific independence modelscan be viewed as particular cases of such parameter constraints. Wepresent closed form solutions or fast iterative algorithms forestimating parameters subject to several specific classes of parameterconstraints, including equalities and inequalities among parameters,constraints on individual parameters, and constraints on sums andratios of parameters, for discrete and continuous variables. Ourmethods cover learning from both frequentist and Bayesian points ofview, from both complete and incomplete data.We present formal guarantees for our estimators, as well as methodsfor automatically learning useful parameter constraints from data. Tovalidate our approach, we apply it to the domain of fMRI brain imageanalysis. Here we demonstrate the ability of our system to first learnuseful relationships among parameters, and then to use them toconstrain the training of the Bayesian network, resulting in improvedcross-validated accuracy of the learned model. Experiments onsynthetic data are also presented.

多くの現実の問題のモデルを学習するタスクでは、現実的な量のトレーニングデータから正確な学習を可能にするために、学習アルゴリズムにドメイン知識を組み込む必要があります。この論文では、ベイジアンネットワークを学習する際にパラメータ推定を制約するためのさまざまな種類のドメイン知識について検討します。特に、既知の構造を持つベイジアンネットワーク内のパラメータのサブセット間の値または関係を制約するドメイン知識について検討します。私たちは、このタスクを制約付き最適化問題として定式化することにより、ベイジアンネットワークの学習手順にさまざまなパラメータ制約を組み込む。モジュールネットワーク、動的ベイズネット、およびコンテキスト固有の独立モデルで行われた仮定は、そのようなパラメータ制約の特定のケースとして見ることができます。離散変数と連続変数について、パラメータ間の等式と不等式、個々のパラメータの制約、パラメータの合計と比率の制約など、いくつかの特定のパラメータ制約のクラスに従うパラメータを推定する閉形式のソリューションまたは高速反復アルゴリズムを示します。私たちの方法は、頻度論とベイズ主義の両方の観点から、完全データと不完全データの両方から学習します。推定量に対する正式な保証と、データから有用なパラメータ制約を自動的に学習する方法を示します。私たちのアプローチを検証するために、それをfMRI脳画像分析の領域に適用します。ここでは、最初にパラメータ間の有用な関係を学習し、次にそれを使用してベイズネットワークのトレーニングを制約し、学習したモデルのクロス検証精度を向上させるシステムの能力を示します。合成データでの実験も示します。

Linear Programs for Hypotheses Selection in Probabilistic Inference Models
確率的推論モデルにおける仮説選択のための線形プログラム

We consider an optimization problem in probabilistic inference: Givenn hypotheses Hj, m possible observations Ok, theirconditional probabilities pkj, and a particular Ok, select apossibly small subset of hypotheses excluding the true target onlywith some error probability ε. After specifying theoptimization goal we show that this problem can be solved through alinear program in mn variables that indicate the probabilities todiscard a hypothesis given an observation. Moreover, we can computeoptimal strategies where only O(m+n) of these variables getfractional values. The manageable size of the linear programs and themostly deterministic shape of optimal strategies makes the methodpracticable. We interpret the dual variables as worst-casedistributions of hypotheses, and we point out some counterintuitivenonmonotonic behaviour of the variables as a function of the errorbound ε. One of the open problems is the existence of apurely combinatorial algorithm that is faster than generic linearprogramming.

私たちは、確率推論における最適化問題を検討します。n個の仮説Hj、m個の可能な観測Ok、それらの条件付き確率pkj、および特定のOkが与えられた場合、真のターゲットを除外した仮説の可能な限り小さなサブセットを、あるエラー確率 ε でのみ選択します。最適化目標を指定した後、この問題は、観測が与えられた場合に仮説を破棄する確率を示すmn個の変数の線形計画によって解決できることを示します。さらに、これらの変数のO(m+n)のみが小数値を取得する最適な戦略を計算できます。線形計画の管理可能なサイズと、最適な戦略のほぼ決定論的な形状により、この方法は実用的になります。私たちは、双対変数を仮説の最悪分布として解釈し、誤差限界 ε の関数として変数の直感に反する非単調な動作を指摘します。未解決の問題の1つは、一般的な線形計画法よりも高速な純粋に組み合わせ的なアルゴリズムの存在です。

Ensemble Pruning Via Semi-definite Programming
半定値計画法によるアンサンブルプルーニング

An ensemble is a group of learning models that jointly solve aproblem. However, the ensembles generated by existing techniques aresometimes unnecessarily large, which can lead to extra memory usage,computational costs, and occasional decreases in effectiveness. Thepurpose of ensemble pruning is to search for a good subset of ensemblemembers that performs as well as, or better than, the originalensemble. This subset selection problem is a combinatorialoptimization problem and thus finding the exact optimal solution iscomputationally prohibitive. Various heuristic methods have beendeveloped to obtain an approximate solution. However, most of theexisting heuristics use simple greedy search as the optimizationmethod, which lacks either theoretical or empirical qualityguarantees. In this paper, the ensemble subset selection problem isformulated as a quadratic integer programming problem. By applyingsemi-definite programming (SDP) as a solution technique, we are ableto get better approximate solutions. Computational experiments showthat this SDP-based pruning algorithm outperforms other heuristics inthe literature. Its application in a classifier-sharing study alsodemonstrates the effectiveness of the method.

アンサンブルとは、共同で問題を解決する学習モデルのグループです。しかし、既存の手法で生成されるアンサンブルは、不必要に大きくなる場合があり、余分なメモリ使用量、計算コスト、および有効性の低下につながることがあります。アンサンブルプルーニングの目的は、元のアンサンブルと同等かそれ以上のパフォーマンスを発揮するアンサンブルメンバーの適切なサブセットを探すことです。このサブセット選択問題は組み合わせ最適化問題であるため、正確な最適解を見つけるのは計算上不可能です。近似解を得るために、さまざまなヒューリスティック手法が開発されてきました。しかし、既存のヒューリスティックのほとんどは、最適化手法として単純な貪欲探索を使用しており、理論的または経験的な品質保証がありません。この論文では、アンサンブルサブセット選択問題を2次整数計画問題として定式化します。解決手法として半正定値計画(SDP)を適用することで、より優れた近似解を得ることができます。計算実験により、このSDPベースのプルーニングアルゴリズムは、文献にある他のヒューリスティックよりも優れていることが示されています。分類子共有研究への適用も、この方法の有効性を実証しています。

Second Order Cone Programming Approaches for Handling Missing and Uncertain Data
欠損データおよび不確実なデータを処理するための 2 次コーンプログラミングアプローチ

We propose a novel second order cone programming formulation fordesigning robust classifiers which can handle uncertainty inobservations. Similar formulations are also derived for designingregression functions which are robust to uncertainties in theregression setting. The proposed formulations are independent of theunderlying distribution, requiring only the existence of secondorder moments. These formulations are then specialized to the caseof missing values in observations for both classification andregression problems. Experiments show that the proposedformulations outperform imputation.

私たちは、不確実性の観測を処理できるロバストな分類器を設計するための新しい2次コーンプログラミング定式化を提案します。同様の定式化は、回帰設定の不確実性に対してロバストな回帰関数を設計するためにも導き出されます。提案された定式化は、基底分布から独立しており、2次モーメントの存在のみを必要とします。これらの定式化は、分類問題と回帰問題の両方の観測値の欠損値の場合に特化されます。実験は、提案された製剤が代入よりも優れていることを示しています。

The Interplay of Optimization and Machine Learning Research
最適化と機械学習研究の相互作用

The fields of machine learning and mathematicalprogramming are increasingly intertwined. Optimization problemslie at the heart of most machine learning approaches. The SpecialTopic on Machine Learning and Large Scale Optimization examinesthis interplay. Machine learning researchers have embraced theadvances in mathematical programming allowing new types of modelsto be pursued. The special topic includes models using quadratic,linear, second-order cone, semi-definite, and semi-infiniteprograms. We observe that the qualities of good optimizationalgorithms from the machine learning and optimization perspectivescan be quite different. Mathematical programming puts a premium onaccuracy, speed, and robustness. Since generalization is thebottom line in machine learning and training is normally doneoff-line, accuracy and small speed improvements are of littleconcern in machine learning. Machine learning prefers simpleralgorithms that work in reasonable computational time forspecific classes of problems. Reducing machine learning problemsto well-explored mathematical programming classes with robustgeneral purpose optimization codes allows machine learningresearchers to rapidly develop new techniques. In turn, machinelearning presents new challenges to mathematical programming. Thespecial issue include papers from two primary themes: novelmachine learning models and novel optimization approaches forexisting models. Many papers blend both themes, making smallchanges in the underlying core mathematical program that enablethe develop of effective new algorithms.

機械学習と数学プログラミングの分野はますます絡み合っています。最適化の問題は、ほとんどの機械学習アプローチの中心にあります。機械学習と大規模最適化に関する特別トピックでは、この相互作用について検討します。機械学習の研究者は、新しいタイプのモデルを追求できる数理計画法の進歩を受け入れてきました。特別トピックには、2次、線形、2次円錐、半正定値、半無限プログラムを使用するモデルが含まれます。機械学習と最適化の観点から見た優れた最適化アルゴリズムの品質は、かなり異なる可能性があることがわかります。数理計画法では、精度、速度、堅牢性が重視されます。機械学習では一般化が基本であり、トレーニングは通常オフラインで行われるため、機械学習では精度と速度のわずかな改善はあまり重要ではありません。機械学習では、特定のクラスの問題に対して妥当な計算時間で機能する単純なアルゴリズムが好まれます。機械学習の問題を、堅牢な汎用最適化コードを使用して十分に調査された数理計画法クラスに縮小することで、機械学習の研究者は新しい手法を迅速に開発できます。その結果、機械学習は数理計画法に新たな課題をもたらします。この特集号には、新しい機械学習モデルと既存のモデルに対する新しい最適化アプローチという2つの主なテーマの論文が掲載されています。多くの論文では両方のテーマが融合されており、基礎となるコア数学プログラムに小さな変更を加えることで、効果的な新しいアルゴリズムの開発を可能にしています。

Nonparametric Quantile Estimation
ノンパラメトリック分位点推定

In regression, the desired estimate of y|x is not always given by a conditional mean, although this is most common. Sometimes one wants to obtain a good estimate that satisfies the property that a proportion, τ, of y|x, will be below the estimate. For τ = 0.5 this is an estimate of the median. What might be called median regression, is subsumed under the term quantile regression. We present a nonparametric version of a quantile estimator, which can be obtained by solving a simple quadratic programming problem and provide uniform convergence statements and bounds on the quantile property of our estimator. Experimental results show the feasibility of the approach and competitiveness of our method with existing ones. We discuss several types of extensions including an approach to solve the quantile crossing problems, as well as a method to incorporate prior qualitative knowledge such as monotonicity constraints.

回帰では、y|xの望ましい推定値は、条件付き平均によって常に与えられるわけではありませんが、これは最も一般的です。時には、y|xの比率 τ が推定値を下回るという特性を満たす適切な推定値を取得したい場合があります。τ= 0.5の場合、これは中央値の推定値です。中央値回帰と呼ばれるものは、分位点回帰という用語に包含されます。私たちは、単純な二次計画問題を解くことによって取得でき、推定量の分位特性に関する一様な収束ステートメントと境界を提供する、分位点推定量のノンパラメトリックバージョンを提示します。実験結果は、私たちの方法のアプローチの実現可能性と既存の方法との競争力を示しています。分位点交差問題を解決するためのアプローチや、単調性制約などの事前の定性的知識を組み込む方法など、いくつかのタイプの拡張について説明します。

Worst-Case Analysis of Selective Sampling for Linear Classification
線形分類のための選択的サンプリングのワーストケース分析

A selective sampling algorithm is a learning algorithm forclassification that, based on the past observed data, decides whetherto ask the label of each new instance to be classified. In thispaper, we introduce a general technique for turning linear-thresholdclassification algorithms from the general additive family intorandomized selective sampling algorithms. For the most popularalgorithms in this family we derive mistake bounds that hold forindividual sequences of examples. These bounds show that oursemi-supervised algorithms can achieve, on average, the same accuracyas that of their fully supervised counterparts, but using fewerlabels. Our theoretical results are corroborated by a number ofexperiments on real-world textual data. The outcome of theseexperiments is essentially predicted by our theoretical results: Ourselective sampling algorithms tend to perform as well as thealgorithms receiving the true label after each classification, whileobserving in practice substantially fewer labels.

選択的サンプリングアルゴリズムは、過去の観測データに基づいて、分類する新しいインスタンスのラベルを要求するかどうかを決定する分類学習アルゴリズムです。この論文では、一般的な加法ファミリーの線形しきい値分類アルゴリズムをランダム選択的サンプリングアルゴリズムに変換する一般的な手法を紹介します。このファミリーで最も人気のあるアルゴリズムについて、個々の例のシーケンスに適用されるエラー境界を導き出します。これらの境界は、半教師ありアルゴリズムが、ラベルの数が少なくても、平均して完全教師ありアルゴリズムと同じ精度を達成できることを示しています。理論上の結果は、実際のテキストデータに対する多数の実験によって裏付けられています。これらの実験の結果は、基本的に理論上の結果によって予測されます。選択的サンプリングアルゴリズムは、各分類後に真のラベルを受け取るアルゴリズムと同等のパフォーマンスを発揮する傾向がありますが、実際には大幅に少ないラベルしか観測されません。

Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
ヌル空間の計算・理論解析と直交線形判別解析

Dimensionality reduction is an important pre-processing step in manyapplications. Linear discriminant analysis (LDA) is a classicalstatistical approach for supervised dimensionality reduction. It aimsto maximize the ratio of the between-class distance to thewithin-class distance, thus maximizing the class discrimination. Ithas been used widely in many applications. However, the classical LDAformulation requires the nonsingularity of the scatter matricesinvolved. For undersampled problems, where the data dimensionality ismuch larger than the sample size, all scatter matrices are singularand classical LDA fails. Many extensions, including null space LDA(NLDA) and orthogonal LDA (OLDA), have been proposed in the past toovercome this problem. NLDA aims to maximize the between-classdistance in the null space of the within-class scatter matrix, whileOLDA computes a set of orthogonal discriminant vectors via thesimultaneous diagonalization of the scatter matrices. They have beenapplied successfully in various applications.In this paper, we present a computational and theoretical analysis ofNLDA and OLDA. Our main result shows that under a mild conditionwhich holds in many applications involving high-dimensional data, NLDAis equivalent to OLDA. We have performed extensive experiments onvarious types of data and results are consistent with our theoreticalanalysis. We further apply the regularization to OLDA. The algorithmis called regularized OLDA (or ROLDA for short). An efficientalgorithm is presented to estimate the regularization value in ROLDA.A comparative study on classification shows that ROLDA is verycompetitive with OLDA. This confirms the effectiveness of theregularization in ROLDA.

次元削減は、多くのアプリケーションで重要な前処理ステップです。線形判別分析(LDA)は、教師あり次元削減のための古典的な統計的アプローチです。クラス間距離とクラス内距離の比率を最大化し、クラス判別を最大化することを目的としています。多くのアプリケーションで広く使用されています。ただし、古典的なLDA定式化では、関係する散布行列が非特異性である必要があります。データの次元がサンプルサイズよりもはるかに大きい、サンプル数が少ない問題の場合、すべての散布行列が特異となり、古典的なLDAは失敗します。この問題を克服するために、ヌル空間LDA (NLDA)や直交LDA (OLDA)など、多くの拡張が過去に提案されました。NLDAは、クラス内散布行列のヌル空間でクラス間距離を最大化することを目指し、OLDAは、散布行列の同時対角化によって一連の直交判別ベクトルを計算します。これらはさまざまなアプリケーションでうまく適用されています。この論文では、NLDAとOLDAの計算および理論分析を紹介します。主な結果は、高次元データを含む多くのアプリケーションに当てはまる穏やかな条件下では、NLDAはOLDAと同等であることを示しています。さまざまな種類のデータで広範な実験を実行し、結果は理論分析と一致しています。さらに、OLDAに正則化を適用します。アルゴリズムは、正則化OLDA (または略してROLDA)と呼ばれます。ROLDAの正則化値を推定するための効率的なアルゴリズムが提示されています。分類の比較研究により、ROLDAはOLDAと非常に競争力があることがわかりました。これにより、ROLDAの正則化の有効性が裏付けられます。

A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis
感度解析に基づくニューラルネットワークの超高速学習法

This paper introduces a learning method for two-layer feedforwardneural networks based on sensitivity analysis, which uses a lineartraining algorithm for each of the two layers. First, random valuesare assigned to the outputs of the first layer; later, these initialvalues are updated based on sensitivity formulas, which use theweights in each of the layers; the process is repeated untilconvergence. Since these weights are learnt solving a linear systemof equations, there is an important saving in computational time.The method also gives the local sensitivities of the least squareerrors with respect to input and output data, with no extracomputational cost, because the necessary information becomesavailable without extra calculations. This method, called theSensitivity-Based Linear Learning Method, can also be used toprovide an initial set of weights, which significantly improves thebehavior of other learning algorithms. The theoretical basis for themethod is given and its performance is illustrated by itsapplication to several examples in which it is compared with severallearning algorithms and well known data sets. The results have showna learning speed generally faster than other existing methods. Inaddition, it can be used as an initialization tool for other wellknown methods with significant improvements.

この論文では、2つの層それぞれに線形トレーニングアルゴリズムを使用する、感度分析に基づく2層フィードフォワードニューラルネットワークの学習方法を紹介します。まず、最初の層の出力にランダムな値が割り当てられます。その後、これらの初期値は、各層の重みを使用する感度式に基づいて更新されます。このプロセスは、収束するまで繰り返されます。これらの重みは、線形方程式を解くことで学習されるため、計算時間が大幅に節約されます。この方法では、必要な情報が追加の計算なしで利用可能になるため、余分な計算コストなしで、入力データと出力データに関する最小二乗誤差のローカル感度も得られます。感度ベースの線形学習法と呼ばれるこの方法は、重みの初期セットを提供するためにも使用できます。これにより、他の学習アルゴリズムの動作が大幅に改善されます。この方法の理論的根拠が示され、いくつかの学習アルゴリズムやよく知られているデータセットと比較するいくつかの例に適用することで、そのパフォーマンスが示されます。結果から、他の既存の方法よりも一般的に高速な学習速度が示されました。さらに、この方法は、大幅に改善された他のよく知られている方法の初期化ツールとして使用することもできます。

New Algorithms for Efficient High-Dimensional Nonparametric Classification
効率的な高次元ノンパラメトリック分類のための新しいアルゴリズム

This paper is about non-approximate acceleration of high-dimensionalnonparametric operations such as k nearest neighbor classifiers. Weattempt to exploit the fact that even if we want exact answers tononparametric queries, we usually do not need to explicitly find thedata points close to the query, but merely need to answer questionsabout the properties of that set of data points. This offers a smallamount of computational leeway, and we investigate how much thatleeway can be exploited. This is applicable to many algorithms innonparametric statistics, memory-based learning and kernel-basedlearning. But for clarity, this paper concentrates on pure k-NNclassification. We introduce new ball-tree algorithms that onreal-world data sets give accelerations from 2-fold to 100-foldcompared against highly optimized traditional ball-tree-basedk-NN. These results include data sets with up to 106 dimensions and 105 records, and demonstrate non-trivial speed-ups while giving exact answers.

この論文では、k最近傍分類器などの高次元ノンパラメトリック演算の非近似的加速に関するものです。ノンパラメトリッククエリに対する正確な答えが必要な場合でも、通常はクエリに近いデータポイントを明示的に見つける必要はなく、データポイントのセットのプロパティに関する質問に答えるだけでよいという事実を利用しようとします。これにより、計算の余裕がわずかになり、その余裕をどれだけ利用できるかを調査します。これは、インノンパラメトリック統計、メモリベースの学習、カーネルベースの学習など、多くのアルゴリズムに適用できます。しかし、明確にするために、この論文では純粋なk-NNclassificationに焦点を当てています。私たちは、現実世界のデータセットが高度に最適化された従来のボールツリーベースのk-NNと比較して2倍から100倍の加速度を与える新しいボールツリーアルゴリズムを紹介します。これらの結果には、最大106の次元と105のレコードを持つデータセットが含まれており、正確な答えを導き出しながら、重要な高速化を示しています。

Step Size Adaptation in Reproducing Kernel Hilbert Space
カーネルヒルベルト空間の再現におけるステップサイズ適応

This paper presents an online support vector machine (SVM) that usesthe stochastic meta-descent (SMD) algorithm to adapt its step sizeautomatically. We formulate the online learning problem as astochastic gradient descent in reproducing kernel Hilbert space(RKHS) and translate SMD to the nonparametric setting, where itsgradient trace parameter is no longer a coefficient vector but anelement of the RKHS. We derive efficient updates that allow us toperform the step size adaptation in linear time. We apply theonline SVM framework to a variety of loss functions, and inparticular show how to handle structured output spaces and achieveefficient online multiclass classification. Experiments show thatour algorithm outperforms more primitive methods for setting thegradient step size.

この論文では、確率的メタディセント(SMD)アルゴリズムを使用してステップサイズを自動的に適応させるオンラインサポートベクターマシン(SVM)を紹介します。オンライン学習の問題は、カーネルヒルベルト空間(RKHS)の再現におけるアストキャスタスティック勾配降下法として定式化し、SMDをノンパラメトリック設定に変換します。そこでは、その勾配トレースパラメータはもはや係数ベクトルではなく、RKHSの要素です。効率的な更新を導き出し、ステップサイズの適応を線形時間で実行できます。オンラインSVMフレームワークをさまざまな損失関数に適用し、特に構造化された出力空間の処理方法と効率的なオンライン多クラス分類の達成方法を示します。実験によると、このアルゴリズムは、勾配ステップサイズを設定するためのより原始的な方法よりも優れています。

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
多腕バンディットの行動排除・停止条件と強化学習問題

We incorporate statistical confidence intervals in both themulti-armed bandit and the reinforcement learning problems. In thebandit problem we show that given n arms, it suffices to pull thearms a total of O((n/ε2)log(1/δ)) times tofind an ε-optimal arm with probability of at least 1-δ.This bound matches the lower bound of Mannor and Tsitsiklis (2004)up to constants. We also devise action eliminationprocedures in reinforcement learning algorithms. We describe aframework that is based on learning the confidence interval aroundthe value function or the Q-function and eliminating actions thatare not optimal (with high probability). We provide a model-basedand a model-free variants of the elimination method. We furtherderive stopping conditions guaranteeing that the learned policy isapproximately optimal with high probability. Simulations demonstratea considerable speedup and added robustness over ε-greedyQ-learning.

私たちは、統計的な信頼区間をマルチアームバンディット問題と強化学習問題の両方に組み込みます。バンディット問題では、n個の腕が与えられた場合、少なくとも1-δ の確率で ε-最適な腕を見つけるには、腕を合計O((n/ε2)log(1/δ))回引っ張るだけで十分であることを示します。この境界は、定数までのMannor and Tsitsiklis (2004)の下限と一致します。また、強化学習アルゴリズムにおける行動消去手順の工夫も行います。価値関数またはQ関数の周りの信頼区間を学習し、最適でない(高い確率で)アクションを排除することに基づくフレームワークについて説明します。モデルベースとモデルフリーの消去法のバリエーションを提供します。さらに、学習した方策がほぼ最適で高い確率であることを保証する停止条件を導出します。シミュレーションでは、ε-greedyQ-learningよりも大幅な高速化と堅牢性の向上が実証されています。

A Graphical Representation of Equivalence Classes of AMP Chain Graphs
AMP チェーングラフの等価クラスのグラフ表現

This paper deals with chain graph models under alternative AMPinterpretation. A new representative of an AMP Markov equivalenceclass, called the largest deflagged graph, is proposed.The representative is based on revealed internal structure of theAMP Markov equivalence class. More specifically, the AMP Markovequivalence class decomposes into finer strong equivalenceclasses and there exists a distinguished strong equivalence classamong those forming the AMP Markov equivalence class. The largestdeflagged graph is the largest chain graph in that distinguishedstrong equivalence class. A composed graphical procedure to getthe largest deflagged graph on the basis of any AMP Markov equivalentchain graph is presented. In general, the largest deflagged graphdiffers from the AMP essential graph, which is anotherrepresentative of the AMP Markov equivalence class.

この論文では、代替のAMPinterpretationの下でチェーングラフモデルを扱います。AMPマルコフ等価クラスの新しい代表、つまり最大のフラグ解除グラフが提案されています。代表は、AMPマルコフ等価クラスの明らかにされた内部構造に基づいています。より具体的には、AMPマルコフ同値クラスは、より細かい強い同等性クラスに分解され、AMPマルコフ同値クラスを形成するものの中には、区別された強い同等性クラスが存在します。largestdeflaggedグラフは、そのdistinguishedstrong同等性クラスで最大のチェーングラフです。任意のAMPマルコフ等価チェーングラフに基づいてフラグ解除された最大のグラフを取得するための構成されたグラフィカルプロシージャが表示されます。一般に、フラグが立てられた最大のグラフは、AMPマルコフ等価クラスの別の代表であるAMPエッセンシャルグラフとは異なります。

One-Class Novelty Detection for Seizure Analysis from Intracranial EEG
頭蓋内脳波からの発作解析のためのOne-Class Novelty検出

This paper describes an application of one-class support vectormachine (SVM) novelty detection for detecting seizures in humans. Ourtechnique maps intracranial electroencephalogram (EEG) time seriesinto corresponding novelty sequences by classifying short-time,energy-based statistics computed from one-second windows of data. Wetrain a classifier on epochs of interictal (normal) EEG. During ictal(seizure) epochs of EEG, seizure activity induces distributionalchanges in feature space that increase the empirical outlierfraction. A hypothesis test determines when the parameter changediffers significantly from its nominal value, signaling a seizuredetection event. Outputs are gated in a .one-shot. manner usingpersistence to reduce the false alarm rate of the system. The detectorwas validated using leave-one-out cross-validation (LOO-CV) on asample of 41 interictal and 29 ictal epochs, and achieved 97.1%sensitivity, a mean detection latency of -7.58 seconds, and anasymptotic false positive rate (FPR) of 1.56 false positives per hour(Fp/hr). These results are better than those obtained from a noveltydetection technique based on Mahalanobis distance outlier detection,and comparable to the performance of a supervised learning techniqueused in experimental implantable devices (Echauz et al., 2001). Thenovelty detection paradigm overcomes three significant limitations ofcompeting methods: the need to collect seizure data, precisely markseizure onset and offset times, and perform patient-specific parametertuning for detector training.

この論文では、1クラスサポートベクターマシン(SVM)による新規性検出を人間の発作検出に適用する方法について説明します。この技術では、1秒間のデータウィンドウから計算された短時間のエネルギーベースの統計を分類することにより、頭蓋内脳波(EEG)時系列を対応する新規性シーケンスにマッピングします。分類器を発作間欠期(正常) EEGのエポックでトレーニングします。発作期(発作) EEGエポックでは、発作活動によって特徴空間の分布が変化し、経験的外れ値の割合が増加します。仮説検定により、パラメーターの変化が公称値と大幅に異なる場合(発作検出イベントの信号)を判断します。出力は、システムの誤報率を減らすために持続性を使用して「ワンショット」方式でゲートされます。この検出器は、41の発作間欠期と29の発作期のサンプルで、Leave-one-outクロス検証(LOO-CV)を使用して検証され、97.1%の感度、平均検出待ち時間-7.58秒、漸近的偽陽性率(FPR) 1.56偽陽性/時間(Fp/hr)を達成しました。これらの結果は、マハラノビス距離外れ値検出に基づく新規性検出手法から得られた結果よりも優れており、実験的な埋め込み型デバイスで使用される教師あり学習手法のパフォーマンスに匹敵します(Echauzら、2001年)。新規性検出パラダイムは、競合方法の3つの重要な制限、つまり発作データを収集する必要性、発作の開始時間と終了時間を正確にマークする必要があること、検出器のトレーニングのために患者固有のパラメータ調整を実行する必要があることを克服します。

Sparse Boosting
スパースブースティング

We propose Sparse Boosting (the SparseL2Boost algorithm), a variant on boosting with the squared error loss. SparseL2Boost yields sparsersolutions than the previously proposed L2Boosting by minimizing some penalized L2-loss functions, the FPE model selection criteria, through small-step gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to thesoft-thresholding estimator in orthogonal linear models, there is sometimesa desire for more sparseness to increase prediction accuracy and abilityfor better variable selection: such goals can be achieved withSparseL2Boost. We prove an equivalence of SparseL2Boost to Breiman’s nonnegative garroteestimator for orthogonal linear models and demonstrate the genericnature of SparseL2Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameterin SparseL2Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L2Boosting. Consequently, we can select between SparseL2Boost and L2Boosting by comparing their gMDL scores.

私たちは、二乗誤差損失によるブースティングのバリエーションであるスパースブースティング(SparseL2Boostアルゴリズム)を提案します。SparseL2Boostは、小ステップ勾配降下法によって、ペナルティ付きL2損失関数(FPEモデル選択基準)を最小化することで、以前に提案されたL2ブースティングよりもスパースなソリューションを生成します。ブースティングによって、たとえば直交線形モデルのソフトしきい値推定量に対応する、すでに比較的スパースなソリューションが得られる場合もありますが、予測精度と変数選択能力を向上させるために、さらにスパース性を高めることが望まれる場合があります。このような目標は、SparseL2Boostで達成できます。私たちは、SparseL2Boostが直交線形モデルに対するBreimanの非負ガロット推定量と同等であることを証明し、ノンパラメトリック相互作用モデリングに対するSparseL2Boostの汎用性を示します。SparseL2Boostのチューニングパラメータの自動選択には、L2Boostingの早期停止にも使用できるgMDLモデル選択基準を使用することを提案します。その結果、gMDLスコアを比較することで、SparseL2BoostとL2Boostingを選択できます。

Quantile Regression Forests
分位点回帰フォレスト

Random forests were introduced as a machine learning tool in Breiman (2001) and havesince proven to be very popular and powerful for high-dimensional regression and classification. For regression, random forests give an accurate approximation of theconditional mean of a response variable. It is shown here that random forests provide information aboutthe full conditional distribution of the response variable, not onlyabout the conditional mean. Conditional quantiles can be inferred withquantile regression forests, a generalisation of random forests.Quantile regression forests give a non-parametric and accurateway of estimating conditional quantiles for high-dimensional predictorvariables. The algorithm is shown to be consistent. Numerical examples suggest thatthe algorithm is competitive in terms of predictive power.

ランダムフォレストは、Breiman(2001)で機械学習ツールとして導入され、それ以来、高次元の回帰と分類に非常に人気があり、強力であることが証明されています。回帰の場合、ランダムフォレストは応答変数の条件付き平均の正確な近似値を提供します。ここでは、ランダムフォレストが、条件付き平均に関する情報だけでなく、応答変数の完全な条件付き分布に関する情報を提供することを示しています。条件付き分位数は、ランダムフォレストの一般化である分位点回帰フォレストで推論できます。分位点回帰フォレストは、高次元予測変数の条件付き分位数を推定するノンパラメトリックで正確な方法を提供します。アルゴリズムは一貫していることが示されています。数値的な例は、アルゴリズムが予測力の点で競争力があることを示唆しています。

Lower Bounds and Aggregation in Density Estimation
密度推定における下限と集計

In this paper we prove the optimality of an aggregationprocedure. We prove lower bounds for aggregation of modelselection type of M density estimators for the Kullback-Leiblerdivergence (KL), the Hellinger’s distance and the L1-distance.The lower bound, with respect to the KL distance, can be achievedby the on-line type estimate suggested, among others, byYang (2000a). Combining these results, we state that logM/n is an optimal rate of aggregation in the sense ofTsybakov (2003), where n is the sample size.

この論文では、集約手順の最適性を証明します。私たちは、Kullback-Leiblerdivergence(KL)、Hellingerのdistance、およびL1-distance.The lower-boundsのM密度推定器のM密度推定器の集約の下限を証明します。KL距離に関する下限は、とりわけYang(2000a)によって提案されたオンラインタイプ推定によって達成できます。これらの結果を組み合わせると、logM/nはTsybakov (2003)の意味での最適な凝集率であると述べます(nはサンプルサイズです)。

Segmental Hidden Markov Models with Random Effects for Waveform Modeling
波形モデリングのためのランダム効果を持つセグメント隠れマルコフモデル

This paper proposes a general probabilistic framework forshape-based modeling and classification of waveform data. Asegmental hidden Markov model (HMM) is used to characterizewaveform shape and shape variation is captured by adding randomeffects to the segmental model. The resulting probabilisticframework provides a basis for learning of waveform models fromdata as well as parsing and recognition of new waveforms.Expectation-maximization (EM) algorithms are derived andinvestigated for fitting such models to data. In particular, the”expectation conditional maximization either” (ECME) algorithm isshown to provide significantly faster convergence than a standardEM procedure. Experimental results on two real-world data setsdemonstrate that the proposed approach leads to improved accuracyin classification and segmentation when compared to alternativessuch as Euclidean distance matching, dynamic time warping, andsegmental HMMs without random effects.

この論文では、波形データの形状ベースのモデリングと分類のための一般的な確率的フレームワークを提案します。非セグメント隠れマルコフモデル(HMM)は、波形の形状を特徴付けるために使用され、形状の変化はセグメントモデルにランダム効果を追加することでキャプチャされます。結果として得られる確率的フレームワークは、データから波形モデルを学習し、新しい波形を解析して認識するための基礎を提供します。期待値最大化(EM)アルゴリズムを導き出し、そのようなモデルをデータに適合させるために調査します。特に、「期待条件付き最大化のいずれか」(ECME)アルゴリズムは、標準のEM手順よりも大幅に高速な収束を提供することが示されています。2つの実世界のデータセットでの実験結果は、提案されたアプローチが、ユークリッド距離マッチング、動的時間ワーピング、ランダム効果のないセグメントHMMなどの代替手段と比較して、分類とセグメンテーションの精度の向上につながることを示しています。

Rearrangement Clustering: Pitfalls, Remedies, and Applications
再配置クラスタリング: 落とし穴、解決策、およびアプリケーション

Given a matrix of values in which the rows correspond to objects andthe columns correspond to features of the objects, rearrangementclustering is the problem of rearranging the rows of the matrix suchthat the sum of the similarities between adjacent rows is maximized.Referred to by various names and reinvented several times, this clustering technique has beenextensively used in many fields over the last three decades. In this paper, wepoint out two critical pitfalls that have been previously overlooked.The first pitfall is deleterious when rearrangement clustering is applied toobjects that form natural clusters. The second concerns asimilarity metric that is commonly used. We present an algorithm thatovercomes these pitfalls. This algorithm is based on a variation ofthe TravelingSalesman Problem. It offers an extra benefit as itautomatically determines cluster boundaries. Using this algorithm, weoptimally solve fourbenchmark problems and a 2,467-gene expression data clusteringproblem. As expected, our new algorithm identifies better clusters than those found by previousapproaches in all five cases. Overall, our results demonstrate the benefitsof rectifying the pitfalls and exemplify the usefulness of thisclustering technique. Our code is available at ourwebsites.

行がオブジェクトに対応し、列がオブジェクトの特徴に対応する値のマトリックスが与えられた場合、再配置クラスタリングは、隣接する行間の類似性の合計が最大になるようにマトリックスの行を再配置する問題です。さまざまな名前で呼ばれ、何度も再発明されたこのクラスタリング手法は、過去30年間に多くの分野で広く使用されてきました。この論文では、これまで見過ごされてきた2つの重大な落とし穴を指摘します。最初の落とし穴は、自然なクラスターを形成するオブジェクトに再配置クラスタリングを適用すると有害になります。2つ目は、一般的に使用される類似性メトリックに関するものです。これらの落とし穴を克服するアルゴリズムを紹介します。このアルゴリズムは、巡回セールスマン問題のバリエーションに基づいています。クラスター境界を自動的に決定するという追加の利点があります。このアルゴリズムを使用して、4つのベンチマーク問題と2,467個の遺伝子発現データクラスタリング問題を最適に解決します。予想どおり、新しいアルゴリズムは、5つのケースすべてにおいて、以前のアプローチで見つかったものよりも優れたクラスターを識別します。全体として、私たちの結果は、落とし穴を修正することの利点を示し、このクラスタリング手法の有用性を例示しています。コードは、当社のWebサイトで入手できます。

Evolutionary Function Approximation for Reinforcement Learning
強化学習のための進化関数近似

Temporal difference methods are theoretically grounded and empiricallyeffective methods for addressing reinforcement learning problems.In most real-world reinforcement learning tasks, TD methods requirea function approximator to represent the value function. However,using function approximators requires manually making crucialrepresentational decisions. This paper investigatesevolutionary function approximation, a novel approach toautomatically selecting function approximator representations thatenable efficient individual learning. This method evolvesindividuals that are better able to learn. We present afully implemented instantiation of evolutionary functionapproximation which combines NEAT, a neuroevolutionary optimizationtechnique, with Q-learning, a popular TD method. The resultingNEAT+Q algorithm automatically discovers effective representationsfor neural network function approximators. This paper also presentson-line evolutionary computation, which improves the on-lineperformance of evolutionary computation by borrowing selectionmechanisms used in TD methods to choose individual actions and usingthem in evolutionary computation to select policies for evaluation.We evaluate these contributions with extended empirical studies intwo domains: 1) the mountain car task, a standard reinforcementlearning benchmark on which neural network function approximatorshave previously performed poorly and 2) server job scheduling, alarge probabilistic domain drawn from the field of autonomiccomputing. The results demonstrate that evolutionary functionapproximation can significantly improve the performance of TDmethods and on-line evolutionary computation can significantlyimprove evolutionary methods. This paper also presents additionaltests that offer insight into what factors can make neural networkfunction approximation difficult in practice.

時間差分法は、強化学習の問題に対処するための理論的根拠があり、経験的に有効な方法です。ほとんどの実際の強化学習タスクでは、TD法では、価値関数を表す関数近似器が必要です。ただし、関数近似器を使用するには、重要な表現上の決定を手動で行う必要があります。この論文では、効率的な個体学習を可能にする関数近似器の表現を自動的に選択する新しいアプローチである進化関数近似について調査します。この方法は、より学習能力の高い個体を進化させます。神経進化最適化技術であるNEATと、一般的なTD法であるQ学習を組み合わせた、進化関数近似の完全に実装されたインスタンスを紹介します。結果として得られるNEAT+Qアルゴリズムは、ニューラルネットワーク関数近似器の効果的な表現を自動的に検出します。この論文では、オンライン進化計算についても紹介します。これは、TD法で個々のアクションを選択するために使用される選択メカニズムを借用し、それを進化計算で使用して評価用のポリシーを選択することにより、進化計算のオンラインパフォーマンスを向上させます。これらの貢献を、2つのドメインでの拡張された実証研究によって評価します。1)マウンテンカータスク(ニューラルネットワーク関数近似器のパフォーマンスがこれまで低かった標準的な強化学習ベンチマーク)、2)サーバージョブスケジューリング(オートノミックコンピューティングの分野から引き出された大規模な確率ドメイン)。結果は、進化関数近似によってTD法のパフォーマンスが大幅に向上し、オンライン進化計算によって進化法が大幅に向上することを示しています。この論文では、ニューラルネットワーク関数近似を実際に困難にする要因についての洞察を提供する追加のテストも紹介します。

Infinite-Ï Limits For Tikhonov Regularization
Tikhonov 正則化の無限限

We consider the problem of Tikhonov regularization with a generalconvex loss function: this formalism includessupport vector machines and regularized least squares. For a family ofkernels that includes the Gaussian, parameterized by a “bandwidth”parameter σ, we characterize the limiting solution as σ→ ∞. In particular, we show that if weset the regularization parameter λ = ~λ σ-2p, the regularization term of the Tikhonovproblem tends to an indicator function on polynomials of degree⌊p⌋ (with residual regularization in the case where p∈ Z). The proof rests on two key ideas: epi-convergence, anotion of functional convergence under which limits of minimizersconverge to minimizers of limits, and a value-based formulationof learning, where we work with regularization on the function outputvalues (y) as opposed to the function expansion coefficients in theRKHS. Our result generalizes and unifies previous results in thisarea.

私たちは、一般凸損失関数を持つTikhonov正則化の問題を考えます:この形式には、サポートベクトルマシンと正則化された最小二乗法が含まれます。「帯域幅」パラメータσによってパラメータ化されたガウス分布を含むカーネルのファミリの場合、制限解をσ→ ∞として特徴付けます。特に、正則化パラメータ λ= ~λ σ-2pを設定すると、チホノフ問題の正則化項はdegree⌊p⌋ の多項式上の指標関数になる傾向があることを示します(p∈Zの場合に残余正則化あり)。その証明は、エピ収束、最小化器の限界が極限器に収束する関数収束のアノション、およびRKHSの関数拡張係数とは対照的に関数出力値(y)の正則化に取り組む価値ベースの学習の定式化という2つの重要なアイデアに基づいています。この結果は、この領域の以前の結果を一般化し、統一したものです。

Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
One-Class SVM および関連アルゴリズムの一貫性と収束率

We determine the asymptotic behaviour of the function computed bysupport vector machines (SVM) and related algorithms that minimize aregularized empirical convex loss function in the reproducing kernelHilbert space of the Gaussian RBF kernel, in the situation where thenumber of examples tends to infinity, the bandwidth of the Gaussiankernel tends to 0, and the regularization parameter is heldfixed. Non-asymptotic convergence bounds to this limit in the L2sense are provided, together with upper bounds on the classificationerror that is shown to converge to the Bayes risk, therefore provingthe Bayes-consistency of a variety of methods although theregularization term does not vanish. These results are particularlyrelevant to the one-class SVM, for which the regularization can notvanish by construction, and which is shown for the first time to be aconsistent density level set estimator.

私たちは、サポートベクターマシン(SVM)と関連アルゴリズムによって計算された関数の漸近的な振る舞いを決定し、Gaussian RBFカーネルの再生カーネルヒルベルト空間で不規則化された経験的凸損失関数を最小化する、例の数が無限大になる傾向があり、Gaussiankernelの帯域幅が0になる傾向があり、正則化パラメータが固定されている状況で決定します。L2senseのこの制限に対する非漸近収束境界は、ベイズリスクに収束することが示されている分類誤差の上限と共に提供され、したがって、正則化項が消えることはありませんが、さまざまな方法のベイズ整合性を証明します。これらの結果は、正則化が構造によって消えることがなく、無矛盾な密度レベル集合推定量であることが初めて示された1クラスSVMに特に関連しています。

Learning Image Components for Object Recognition
物体認識のための画像コンポーネントの学習

In order to perform object recognition it is necessary to learn representationsof the underlying components of images. Such components correspond to objects,object-parts, or features. Non-negative matrix factorisation is a generativemodel that has been specifically proposed for finding such meaningfulrepresentations of image data, through the use of non-negativity constraints onthe factors. This article reports on an empirical investigation of theperformance of non-negative matrix factorisation algorithms. It is found thatsuch algorithms need to impose additional constraints on the sparseness of thefactors in order to successfully deal with occlusion. However, these constraintscan themselves result in these algorithms failing to identify image componentsunder certain conditions. In contrast, a recognition model (a competitivelearning neural network algorithm) reliably and accurately learnsrepresentations of elementary image features without such constraints.

物体認識を行うためには、画像の根底にある構成要素の表現を学ぶ必要があります。このようなコンポーネントは、オブジェクト、オブジェクトパーツ、またはフィーチャーに対応します。非負の行列因数分解は、因子に対する非否定性の制約を使用して、画像データのそのような意味のある表現を見つけるために特別に提案された生成モデルです。この記事では、非負の行列分解アルゴリズムのパフォーマンスに関する実証的調査について報告します。このようなアルゴリズムは、オクルージョンをうまく処理するために、因子のスパース性に追加の制約を課す必要があることがわかりました。ただし、これらの制約スキャン自体により、これらのアルゴリズムは特定の条件下で画像コンポーネントを識別できません。これに対し、認識モデル(競合学習ニューラルネットワークアルゴリズム)は、このような制約なしに、基本的な画像特徴の表現を確実かつ正確に学習します。

Policy Gradient in Continuous Time
連続時間でのポリシー勾配

Policy search is a method for approximately solving an optimalcontrol problem by performing a parametric optimization search ina given class of parameterized policies. In order to process a localoptimization technique, such as a gradient method, we wish to evaluatethe sensitivity of the performance measure with respect to the policyparameters, the so-called policy gradient. This paper is concernedwith the estimation of the policy gradient for continuous-time, deterministicstate dynamics, in a reinforcement learning framework, thatis, when the decision maker does not have a model of the statedynamics.We show that usual likelihood ratio methods used in discrete-time,fail to proceed the gradient because they are subject to varianceexplosion when the discretization time-step decreases to 0. Wedescribe an alternative approach based on the approximation of thepathwise derivative, which leads to a policy gradient estimate thatconverges almost surely to the true gradient when the time-step tendsto 0. The underlying idea starts with the derivation of an explicitrepresentation of the policy gradient using pathwise derivation. Thisderivation makes use of the knowledge of the state dynamics. Then,in order to estimate the gradient from the observable data only, weuse a stochastic policy to discretize the continuous deterministicsystem into a stochastic discrete process, which enables to replacethe unknown coefficients by quantities that solely depend on knowndata. We prove the almost sure convergence of this estimate to thetrue policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discreteand continuous control spaces.

ポリシー検索は、パラメーター化されたポリシーの特定のクラスでパラメトリック最適化検索を実行することにより、最適制御問題を近似的に解決する方法です。勾配法などのローカル最適化手法を処理するために、ポリシーパラメーターに関するパフォーマンス測定の感度、いわゆるポリシー勾配を評価します。この論文では、強化学習フレームワーク、つまり意思決定者が状態ダイナミクスのモデルを持たない場合の、連続時間、決定論的状態ダイナミクスに対するポリシー勾配の推定に関するものです。離散時間で使用される通常の尤度比法は、離散化タイムステップが0に減少すると分散爆発の影響を受けるため、勾配を処理できないことを示します。パスワイズ導関数の近似に基づく代替アプローチについて説明します。これにより、タイムステップが0に近づくときに、ポリシー勾配推定がほぼ確実に真の勾配に収束します。基本的な考え方は、パスワイズ導関数を使用してポリシー勾配の明示的な表現を導出することから始まります。この導出では、状態ダイナミクスの知識を使用します。次に、観測可能なデータのみから勾配を推定するために、確率的ポリシーを使用して連続決定論的システムを確率的離散プロセスに離散化します。これにより、未知の係数を、既知のデータのみに依存する量に置き換えることができます。離散化時間ステップがゼロになると、この推定値が真のポリシー勾配にほぼ確実に収束することを証明します。この方法は、離散制御空間と連続制御空間の2つの対象問題で説明されます。

QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
サポートベクターマシンの精度と実行時間が保証されたQPアルゴリズム

We describe polynomial–time algorithmsthat produce approximate solutions with guaranteedaccuracy for a class of QP problems that are used in thedesign of support vector machine classifiers.These algorithms employ a two–stage process where thefirst stage produces an approximatesolution to a dual QP problem and the second stage mapsthis approximate dual solution to an approximate primal solution.For the second stage we describe an O(n log n)algorithm that maps an approximate dual solution with accuracy(2(2Km)1/2+8(λ)1/2)-2 λ εp2to an approximate primal solution withaccuracy εpwhere n is the number of data samples,Kn is the maximum kernel value over the data andλ > 0 is the SVM regularization parameter.For the first stage we present new resultsfor decomposition algorithms anddescribe new decomposition algorithms with guaranteedaccuracy and run time.In particular, for τ-rate certifying decomposition algorithmswe establish the optimality of τ = 1/(n-1).In additionwe extend the recent τ = 1/(n-1) algorithm of Simon(2004) to form two new composite algorithmsthat also achieve the τ = 1/(n-1) iteration boundof List and Simon (2005), but yield faster run times in practice.We also exploit the τ-rate certifying property of thesealgorithms to produce new stopping rules that are computationallyefficient and that guarantee a specified accuracy for theapproximate dual solution.Furthermore,for the dual QP problem corresponding to the standard classificationproblem we describe operational conditions for which the Simon and compositealgorithms possess an upper bound of O(n) on the number of iterations.For this same problem we also describe general conditions for whicha matching lower bound existsfor any decomposition algorithm that uses working sets of size 2.For the Simon and composite algorithms we also establish an O(n2)bound on the overall run time for the first stage.Combining the first and second stages givesan overall run time of O(n2(ck + 1))where ck is an upper bound on the computation to performa kernel evaluation. Pseudocode is presentedfor a complete algorithm that inputs an accuracy εpand produces an approximate solution that satisfiesthis accuracy in low order polynomial time.Experiments are included to illustrate the new stopping rules andto compare the Simon and composite decomposition algorithms.

私たちは、サポートベクターマシン分類器の設計で使用されるQP問題のクラスに対して、精度が保証された近似解を生成する多項式時間アルゴリズムについて説明します。これらのアルゴリズムは2段階のプロセスを採用しており、最初の段階ではデュアルQP問題の近似解を生成し、2番目の段階ではこの近似デュアル解を近似プライマリ解にマッピングします。2番目の段階では、精度(2(2Km)1/2+8(λ)1/2)-2λ εp2の近似デュアル解を精度 εpの近似プライマリ解にマッピングするO(n log n)アルゴリズムについて説明します。ここで、nはデータサンプルの数、Knはデータ全体の最大カーネル値、λ> 0はSVM正則化パラメーターです。最初の段階では、分解アルゴリズムの新しい結果を示し、精度と実行時間が保証された新しい分解アルゴリズムについて説明します。特に、τ レート証明分解アルゴリズムの場合、τ=の最適性を確立します。1/(n-1)です。さらに、Simon(2004)の最近の τ= 1/(n-1)アルゴリズムを拡張して、ListとSimon (2005)の τ= 1/(n-1)反復境界も達成するが、実際には実行時間がより短い2つの新しい複合アルゴリズムを作成します。また、これらのアルゴリズムの τ レート証明特性を利用して、計算効率が高く、近似デュアルソリューションの指定された精度を保証する新しい停止ルールを作成します。さらに、標準分類問題に対応するデュアルQP問題に対して、Simonおよび複合アルゴリズムが反復回数の上限O(n)を持つ操作条件について説明します。この同じ問題に対して、サイズ2の作業セットを使用する分解アルゴリズムに一致する下限が存在する一般的な条件も説明します。Simonおよび複合アルゴリズムに対して、第1段階の全体的な実行時間にO(n2)の境界も確立します。第1段階と第2段階を組み合わせると、全体的な実行時間は次のようになります。O(n2(ck + 1))、ここでckはカーネル評価を実行するための計算の上限です。精度 εpを入力し、低次多項式時間でこの精度を満たす近似解を生成する完全なアルゴリズムの擬似コードが提示されています。新しい停止規則を説明し、Simonと複合分解アルゴリズムを比較するための実験が含まれています。

Some Theory for Generalized Boosting Algorithms
一般化ブースティングアルゴリズムの理論

We give a review of various aspects of boosting, clarifying theissues through a few simple results, and relate our work and that ofothers to the minimax paradigm of statistics. We consider thepopulation version of the boosting algorithm and prove itsconvergence to the Bayes classifier as a corollary of a generalresult about Gauss-Southwell optimization in Hilbert space. We theninvestigate the algorithmic convergence of the sample version, andgive bounds to the time until perfect separation of the sample. Weconclude by some results on the statistical optimality of the L2boosting.

私たちは、ブースティングのさまざまな側面のレビューを行い、いくつかの簡単な結果を通じて問題を明確にし、私たちと他の人の仕事を統計のミニマックスパラダイムに関連付けます。ブースティングアルゴリズムの母集団バージョンを検討し、ヒルベルト空間でのガウス・サウスウェル最適化に関する一般的な結果の帰結として、ベイズ分類器への収束を証明します。次に、サンプルバージョンのアルゴリズム収束を調査し、サンプルが完全に分離するまでの時間に限界を与えます。L2boostingの統計的最適性に関するいくつかの結果によって結論を述べる。

Learning Minimum Volume Sets
最小ボリュームセットの学習

Given a probability measure P and a reference measureμ, one is often interested in the minimum μ-measure setwith P-measure at least α. Minimum volume sets of thistype summarize the regions of greatest probability mass of P,and are useful for detecting anomalies and constructing confidenceregions. This paper addresses the problem of estimating minimumvolume sets based on independent samples distributed according toP. Other than these samples, no other information is availableregarding P, but the reference measure μ is assumed to beknown. We introduce rules for estimating minimum volume sets thatparallel the empirical risk minimization and structural riskminimization principles in classification. As in classification, weshow that the performances of our estimators are controlled by therate of uniform convergence of empirical to true probabilities overthe class from which the estimator is drawn. Thus we obtain finitesample size performance bounds in terms of VC dimension and relatedquantities. We also demonstrate strong universal consistency, anoracle inequality, and rates of convergence. The proposed estimatorsare illustrated with histogram and decision tree set estimationrules.

確率測度Pと参照測度 μ が与えられた場合、P測度が少なくとも α である最小 μ 測度セットに関心を持つことがよくあります。このタイプの最小ボリュームセットは、Pの最大確率質量の領域を要約したもので、異常を検出し、信頼領域を構築するのに役立ちます。この論文では、Pに従って分布する独立サンプルに基づいて最小ボリュームセットを推定する問題について説明します。これらのサンプル以外には、Pに関する情報はありませんが、参照測度 μ は既知であると想定されます。分類における経験的リスク最小化と構造的リスク最小化の原則に匹敵する、最小ボリュームセットを推定するための規則を紹介します。分類の場合と同様に、推定量のパフォーマンスは、推定量が抽出されるクラスでの経験的確率から真の確率への一様収束率によって制御されることを示します。したがって、VC次元と関連量に関して、有限サンプルサイズのパフォーマンス境界が得られます。また、強い普遍的一貫性、オラクル不等式、および収束率も示します。提案された推定量は、ヒストグラムと決定木セットの推定ルールで示されます。

Pattern Recognition for Conditionally Independent Data
条件付き独立データのパターン認識

In this work we consider the task of relaxing the i.i.d. assumptionin pattern recognition (or classification), aiming to makeexisting learning algorithms applicable to a wider range of tasks.Pattern recognition is guessing a discrete label ofsome object based on a set of given examples (pairs ofobjects and labels). We consider the case of deterministically defined labels. Traditionally, thistask is studied under the assumption that examplesare independent and identically distributed. However,it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumptionof conditional independence and identical distribution of objects,while the only assumption on the distribution of labels is that therate of occurrence of each label should be above some positive threshold.We find a broad class of learning algorithms for which estimations ofthe probability of the classification error achieved under the classical i.i.d. assumption canbe generalized to the similar estimates for case of conditionally i.i.d. examples.

この研究では、パターン認識(または分類)におけるi.i.d.仮定を緩和するタスクを検討し、既存の学習アルゴリズムをより広範囲のタスクに適用できるようにすることを目標としています。パターン認識とは、一連の与えられた例(オブジェクトとラベルのペア)に基づいて、あるオブジェクトの離散ラベルを推測することです。ここでは、決定論的に定義されたラベルの場合を検討します。従来、このタスクは、例が独立しており、同一に分布しているという仮定の下で研究されています。しかし、パターン認識理論の多くの結果では、より弱い仮定が引き継がれていることが判明しています。つまり、条件付き独立性とオブジェクトの同一分布の仮定の下では、ラベルの分布に関する唯一の仮定は、各ラベルの発生率が何らかの正のしきい値を超える必要があるということです。私たちは、古典的なi.i.d.仮定の下で達成された分類エラーの確率の推定が、条件付きi.i.d.例の場合の同様の推定に一般化できる、幅広いクラスの学習アルゴリズムを見つけました。

Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
変分ベイジアン近似におけるGauss混合の確率的複雑さ

Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning.

ベイズ学習は広く使用されており、多くのデータモデリング問題で効果的であることが証明されています。ただし、それに伴う計算には膨大なコストがかかり、通常は正確に実行できません。ベイズ学習の近似として提案された変分ベイズアプローチは、多くのアプリケーションで計算の扱いやすさと優れた一般化パフォーマンスを提供してきました。変分ベイズ学習自体の特性と機能はまだ明らかにされていません。変分ベイズアプローチがどの程度優れた近似を達成できるかはまだわかっていません。この論文では、ガウス混合モデルの変分ベイズ学習について説明し、変分確率複雑度の上限と下限を導出します。最小変分自由エネルギーとベイズ証拠の下限に対応する変分確率的複雑性は、モデル選択問題に対処する上で重要になるだけでなく、真のベイズ学習の近似としての変分ベイズアプローチの精度を議論することも可能になります。

A Direct Method for Building Sparse Kernel Learning Algorithms
スパースカーネル学習アルゴリズムを構築するための直接的な方法

Many kernel learning algorithms, including support vector machines,result in a kernel machine, such as a kernel classifier, whose keycomponent is a weight vector in a feature space implicitly introducedby a positive definite kernel function. This weight vector is usuallyobtained by solving a convex optimization problem. Based on this factwe present a direct method to build sparse kernel learning algorithmsby adding one more constraint to the original convex optimizationproblem, such that the sparseness of the resulting kernel machine isexplicitly controlled while at the same time performance is kept ashigh as possible. A gradient based approach is provided to solve thismodified optimization problem. Applying this method to the supportvectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminatingsubspace that can be spanned by a small number of vectors, and in thissubspace, the different classes of data are linearly wellseparated. Experimental results over several classification benchmarksdemonstrate the effectiveness of our approach.

サポートベクターマシンを含む多くのカーネル学習アルゴリズムは、カーネル分類器などのカーネルマシンを生成します。そのキーコンポーネントは、正定値カーネル関数によって暗黙的に導入される特徴空間の重みベクトルです。この重みベクトルは通常、凸最適化問題を解くことによって得られます。この事実に基づいて、元の凸最適化問題にもう1つの制約を追加することで、スパースカーネル学習アルゴリズムを構築する直接的な方法を提示します。これにより、結果として得られるカーネルマシンのスパース性が明示的に制御され、同時にパフォーマンスが可能な限り高く維持されます。この修正された最適化問題を解決するために、勾配ベースのアプローチが提供されます。この方法をサポートベクトルマシンに適用すると、スパースな大マージン分類器を構築するための具体的なアルゴリズムが得られます。これらの分類器は、本質的に、少数のベクトルで網羅できる識別サブスペースを見つけます。このサブスペースでは、異なるクラスのデータが線形に分離されます。いくつかの分類ベンチマークでの実験結果により、このアプローチの有効性が実証されています。

Toward Attribute Efficient Learning of Decision Lists and Parities
意思決定リストとパリティの属性効率学習に向けて

We consider two well-studied problems regarding attributeefficient learning: learningdecision lists and learning parity functions.First, we give an algorithm for learning decisionlists of length k over n variables using 2Õ(k1/3) log n examples and time nÕ(k1/3). This is the firstalgorithm for learning decision lists that has both subexponentialsample complexity and subexponential running time in the relevantparameters. Our approach is based ona new construction of low degree, low weight polynomial thresholdfunctions for decision lists. For a wide range of parameters ourconstruction matches a lower bound due to Beigel for decision lists and gives an essentially optimal tradeoff betweenpolynomial threshold function degree and weight. Second, we give analgorithm for learning an unknown parity function on k out of nvariables using O(n1-1/k) examples in poly(n) time. Fork=o(log n) this yields the first polynomial time algorithmfor learning parity on a superconstant number of variables withsublinear sample complexity. We also give a simple algorithmfor learning an unknown length-k parity using O(k log n)examples in nk/2 time, which improves on the naive nk timebound of exhaustive search.

私たちは、属性効率学習に関するよく研究されている2つの問題、決定リストの学習とパリティ関数の学習について検討します。まず、2Õ(k1/3) log nの例と時間nÕ(k1/3)を使用して、長さkの決定リストをn個の変数で学習するアルゴリズムを示します。これは、関連するパラメータで指数以下のサンプル複雑性と実行時間の両方を持つ、決定リストを学習する最初のアルゴリズムです。私たちのアプローチは、決定リストの低次数、低重みの多項式しきい値関数の新しい構成に基づいています。さまざまなパラメータについて、私たちの構成は、決定リストのBeigelによる下限と一致し、多項式しきい値関数の次数と重みの間の本質的に最適なトレードオフを提供します。次に、O(n1-1/k)の例を使用して、poly(n)時間でn個の変数のうちk個の未知のパリティ関数を学習するアルゴリズムを示します。Fork=o(log n)は、線形以下のサンプル複雑度を持つ定数以上の変数のパリティを学習する最初の多項式時間アルゴリズムになります。また、O(k log n)の例を使用してnk/2時間で長さkの未知のパリティを学習する簡単なアルゴリズムも示します。これは、網羅的検索の単純なnk時間制限を改善します。

Online Passive-Aggressive Algorithms
オンラインパッシブアグレッシブアルゴリズム

We present a family of margin based online learning algorithms for variousprediction tasks. In particular we derive and analyze algorithms for binary andmulticlass categorization, regression, uniclass prediction and sequenceprediction. The update steps of our different algorithms are all based on analyticalsolutions to simple constrained optimization problems. This unified viewallows us to prove worst-case loss bounds for the different algorithms and forthe various decision problems based on a single lemma. Our bounds on thecumulative loss of the algorithms are relative to the smallest loss that can beattained by any fixed hypothesis, and as such are applicable to both realizableand unrealizable settings. We demonstrate some of the merits of the proposedalgorithms in a series of experiments with synthetic and real data sets.

私たちは、さまざまな予測タスクのためのマージンベースのオンライン学習アルゴリズムのファミリーを紹介します。特に、バイナリおよびマルチクラス分類、回帰、ユニクラス予測、およびシーケンス予測のアルゴリズムを導出および分析します。さまざまなアルゴリズムの更新ステップはすべて、単純な制約付き最適化問題に対する解析的解に基づいています。この統一されたビューにより、1つの補題に基づいて、さまざまなアルゴリズムとさまざまな決定問題に対する最悪の場合の損失範囲を証明できます。アルゴリズムの累積損失の限界は、固定された仮説によって打ち負かすことができる最小の損失に相対的であり、そのため、実現可能な設定と実現不可能な設定の両方に適用できます。提案されたアルゴリズムの利点のいくつかを、合成データセットと実際のデータセットを使用した一連の実験で実証します。

Learning Coordinate Covariances via Gradients
勾配による座標共分散の学習

We introduce an algorithm that learns gradients from samples inthe supervised learning framework. An error analysis is given forthe convergence of the gradient estimated by the algorithm to thetrue gradient. The utility of the algorithm for the problem ofvariable selection as well as determining variable covariance isillustrated on simulated data as well as two gene expressiondata sets. For square loss we provide a very efficientimplementation with respect to both memory and time.

私たちは、サンプルから勾配を学習するアルゴリズムを教師あり学習フレームワークで紹介します。アルゴリズムによって推定された勾配と真の勾配への収束について、エラー分析が行われます。変数選択の問題に対するアルゴリズムの有用性、および変数の共分散の決定は、シミュレートされたデータと2つの遺伝子発現データセットで示されています。二乗損失の場合、メモリと時間の両方に関して非常に効率的な実装を提供します。

Learning Recursive Control Programs from Problem Solving
問題解決からの再帰的制御プログラムの学習

In this paper, we propose a new representation for physical control– teleoreactive logic programs — along with an interpreter thatuses them to achieve goals. In addition, we present a new learningmethod that acquires recursive forms of these structures from tracesof successful problem solving. We report experiments in three differentdomains that demonstrate the generality of this approach. In closing,we review related work on learning complex skills and discuss directionsfor future research on this topic.

この論文では、物理的制御の新しい表現であるテレオレアティブ論理プログラムと、それらを使用して目標を達成するインタプリタを提案します。さらに、問題解決の成功の痕跡からこれらの構造の再帰的な形式を獲得する新しい学習方法を提示します。このアプローチの一般性を実証する3つの異なるドメインでの実験を報告します。最後に、複雑なスキルの学習に関する関連研究をレビューし、このトピックに関する将来の研究の方向性について話し合います。

Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
非線形判別解析のためのカーネルパラメーターと正則化係数の最適化

In this paper we consider a novel Bayesian interpretation of Fisher’sdiscriminant analysis. We relate Rayleigh’s coefficient to a noisemodel that minimises a cost based on the most probable class centresand that abandons the ‘regression to the labels’ assumption used byother algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with theincorporation of a prior we can apply Bayes’ rule to infer theposterior distribution of the direction ofdiscrimination. Nonetheless, we argue that an additional constrainingdistribution has to be included if sensible results are to beobtained. Going further, with the use of a Gaussian process prior weshow the equivalence of our model to a regularised kernel Fisher’sdiscriminant. A key advantage of our approach is the facility todetermine kernel parameters and the regularisation coefficient throughthe optimisation of the marginal log-likelihood of the data. Anadded bonus of the new formulation is that it enables us to link theregularisation coefficient with the generalisation error.

この論文では、フィッシャーの判別分析の新しいベイズ解釈について検討します。レイリー係数を、最も可能性の高いクラスセンターに基づいてコストを最小化し、他のアルゴリズムで使用される「ラベルへの回帰」の仮定を放棄するノイズモデルに関連付けます。ノイズモデルの最適化により、フィッシャーの判別と同等の判別方向が得られ、事前分布を組み込むことで、ベイズの規則を適用して判別方向の事後分布を推測できます。ただし、妥当な結果を得るには、追加の制約分布を含める必要があると主張します。さらに、ガウス過程事前分布を使用して、モデルが正規化されたカーネルフィッシャーの判別と同等であることを示します。このアプローチの主な利点は、データの周辺対数尤度の最適化を通じてカーネルパラメーターと正規化係数を決定できることです。新しい定式化のもう1つの利点は、正規化係数を一般化エラーにリンクできることです。

Inductive Synthesis of Functional Programs: An Explanation Based Generalization Approach
機能プログラムの帰納的合成:説明に基づく一般化アプローチ

We describe an approach to the inductive synthesis of recursiveequations from input/output-examples which is based on the classicaltwo-step approach to induction of functional Lisp programs ofSummers (1977). In a first step, I/O-examples are rewritten totraces which explain the outputs given the respective inputs based ona datatype theory. These traces can be integrated into one conditionalexpression which represents a non-recursive program. In a secondstep, this initial program term is generalized into recursiveequations by searching for syntactical regularities in the term. Ourapproach extends the classical work in several aspects. The mostimportant extensions are that we are able to induce a set ofrecursive equations in one synthesizing step, the equations maycontain more than one recursive call, and additionally neededparameters are automatically introduced.

私たちは、Summers (1977)の関数型Lispプログラムの帰納法に対する古典的な二段階アプローチに基づく、入力/出力例からの再帰方程式の帰納的合成へのアプローチについて述べる。最初のステップでは、I/Oサンプルをトレースに書き換えて、データ型理論に基づいてそれぞれの入力が与えられた出力を説明します。これらのトレースは、非再帰的プログラムを表す1つのconditionalexpressionに統合できます。2番目のステップでは、この初期プログラム項は、項の構文規則性を検索することにより、再帰方程式に一般化されます。私たちのアプローチは、いくつかの側面で古典的な作品を拡張します。最も重要な拡張は、1つの合成ステップで一連の再帰方程式を誘導できること、方程式に複数の再帰呼び出しが含まれる場合があり、さらに必要なパラメータが自動的に導入されることです。

Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
マルコフ連鎖における幾何学的分散縮小:価値関数への応用と勾配推定

We study a variance reduction technique for Monte Carlo estimationof functionals in Markov chains. The method is based on designingsequential control variates using successive approximationsof the function of interest V. Regular Monte Carlo estimates havea variance of O(1/N), where N is the number of sample trajectoriesof the Markov chain. Here, we obtain a geometric variance reductionO(ρN) (with ρ<1) up to a threshold that depends onthe approximation error V-AV, where A is an approximationoperator linear in the values. Thus, if V belongs to the rightapproximation space (i.e. AV=V), the variance decreases geometricallyto zero.An immediate application is value function estimation in Markov chains,which may be used for policy evaluation in a policy iteration algorithmfor solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed,is gradient estimation, that is computing the sensitivity ∂αVof the performance measure V with respect to some parameter αof the transition probabilities. For example, in policy parametricoptimization, computing an estimate of the policy gradient is requiredto perform a gradient optimization method.We show that, using two approximations for the value functionand the gradient, a geometric variance reduction is also achieved,up to a threshold that depends on the approximation errors of bothof those representations.

私たちは、マルコフ連鎖における関数のモンテカルロ推定の分散削減手法について検討します。この方法は、対象関数Vの逐次近似を使用して逐次制御変量を設計することに基づいています。通常のモンテカルロ推定の分散はO(1/N)です。ここで、Nはマルコフ連鎖のサンプル軌跡の数です。ここでは、近似誤差V-AVに依存するしきい値までの幾何分散削減O(ρN) (ρ<1)を取得します。ここで、Aは値に対して線形な近似演算子です。したがって、Vが右近似空間に属する場合(つまり、AV = V)、分散は幾何的にゼロに減少します。直接的な応用は、マルコフ連鎖における価値関数推定であり、これは、マルコフ決定過程を解決するためのポリシー反復アルゴリズムにおけるポリシー評価に使用できます。分散の削減が強く必要とされるもう1つの重要な領域は、勾配推定です。これは、遷移確率のパラメーター α に対するパフォーマンス指標Vの感度 ∂αVを計算することです。たとえば、ポリシーパラメトリック最適化では、勾配最適化法を実行するために、ポリシー勾配の推定値を計算することが必要です。価値関数と勾配の2つの近似を使用して、両方の表現の近似誤差に依存するしきい値まで、幾何的な分散の削減も達成できることを示します。

Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
逐次予測とアルファベット分解による可逆圧縮の優れた保証

We present worst case bounds for the learningrate of a known prediction method that is based on hierarchicalapplications of binary context tree weighting (CTW) predictors. Aheuristic application of this approach that relies on Huffman’s alphabetdecomposition is known to achieve state-of-the-art performancein prediction and lossless compression benchmarks. We show that ournew bound for this heuristic is tighter than the best knownperformance guarantees for prediction and lossless compressionalgorithms in various settings. This resultsubstantiates the efficiency of this hierarchical method and provides a compellingexplanation for its practical success.In addition, we present the results of a few experiments thatexamine other possibilities for improving the multi-alphabetprediction performance of CTW-based algorithms.

私たちは、バイナリコンテキストツリーウェイト(CTW)予測子の階層的アプリケーションに基づく既知の予測方法の学習率のワーストケースバウンドを示します。ハフマンのアルファベット分解に依存するこのアプローチのアヒューリスティックアプリケーションは、予測および可逆圧縮ベンチマークで最先端のパフォーマンスを達成することが知られています。このヒューリスティックの新しい境界は、さまざまな設定での予測アルゴリズムと可逆圧縮アルゴリズムの既知のパフォーマンス保証よりも厳しいことを示します。この結果は、この階層的方法の効率性を実証し、その実際的な成功に対する説得力のある説明を提供します。さらに、CTWベースのアルゴリズムのマルチアルファベット予測パフォーマンスを向上させるための他の可能性を検討するいくつかの実験の結果を示します。

Using Machine Learning to Guide Architecture Simulation
機械学習を使用してアーキテクチャシミュレーションをガイドする

An essential step in designing a new computer architecture is thecareful examination of different design options. It is critical thatcomputer architects have efficient means by which they may estimatethe impact of various design options on the overall machine. Thistask is complicated by the fact that different programs, and evendifferent parts of the same program, may have distinct behaviorsthat interact with the hardware in different ways. Researchers usevery detailed simulators to estimate processor performance, whichmodels every cycle of an executing program. Unfortunately, simulatingevery cycle of a real program can take weeks or months.To address this problem we have created a tool called SimPoint thatuses data clustering algorithms from machine learning to automaticallyfind repetitive patterns in a program’s execution. By simulating onerepresentative of each repetitive behavior pattern, simulation timecan be reduced to minutes instead of weeks for standard benchmarkprograms, with very little cost in terms of accuracy. We describe thisimportant problem, the data representation and preprocessing methodsused by SimPoint, the clustering algorithm at the core of SimPoint,and we evaluate different options for tuning SimPoint.

新しいコンピュータアーキテクチャを設計する上で不可欠なステップは、さまざまな設計オプションを慎重に検討することです。コンピュータアーキテクトが、さまざまな設計オプションがマシン全体に与える影響を予測する効率的な手段を持っていることは非常に重要です。このタスクは、異なるプログラム、さらには同じプログラムの異なる部分でさえ、ハードウェアと異なる方法で相互作用する異なる動作を持つ可能性があるという事実によって複雑になります。研究者は、実行中のプログラムのすべてのサイクルをモデル化する非常に詳細なシミュレータを使用してプロセッサのパフォーマンスを予測します。残念ながら、実際のプログラムのすべてのサイクルをシミュレートするには、数週間または数か月かかることがあります。この問題に対処するために、機械学習のデータクラスタリングアルゴリズムを使用してプログラムの実行における反復パターンを自動的に検出するSimPointというツールを作成しました。反復動作パターンごとに1つの代表的なパターンをシミュレートすることで、標準的なベンチマークプログラムのシミュレーション時間を数週間ではなく数分に短縮でき、精度の面でのコストもほとんどかかりません。この重要な問題、SimPointで使用されるデータ表現と前処理方法、SimPointの中核となるクラスタリングアルゴリズムについて説明し、SimPointを調整するためのさまざまなオプションを評価します。

Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting
Prolog Proof Treeのカーネル:ILP設定における統計的学習

We develop kernels for measuring the similarity between relationalinstances using background knowledge expressed in first-order logic.The method allows us to bridge the gap between traditional inductivelogic programming (ILP) representations and statistical approachesto supervised learning. Logic programs are first used to generateproofs of given visitor programs that use predicates declared in theavailable background knowledge. A kernel is then defined over pairsof proof trees. The method can be used for supervised learning tasksand is suitable for classification as well as regression. We reportpositive empirical results on Bongard-like and M-of-N problemsthat are difficult or impossible to solve with traditional ILPtechniques, as well as on real bioinformatics and chemoinformaticsdata sets.

私たちは、一階論理で表現される背景知識を使用して、関係インスタンス間の類似性を測定するためのカーネルを開発します。この方法により、従来の帰納論理プログラミング(ILP)表現と教師あり学習への統計的アプローチとの間のギャップを埋めることができます。ロジックプログラムは、まず、利用可能な背景知識で宣言された述語を使用する特定のビジタープログラムの証明を生成するために使用されます。その後、カーネルはプルーフツリーのペアに対して定義されます。この方法は、教師あり学習タスクに使用でき、分類や回帰にも適しています。従来のILP技術では解決が困難または不可能なボンガード様問題やM-of-N問題、および実際のバイオインフォマティクスおよびケモインフォマティクスデータセットに関する肯定的な経験的結果を報告します。

Some Discriminant-Based PAC Algorithms
いくつかの判別ベースのPACアルゴリズム

A classical approach in multi-class pattern classification is thefollowing. Estimate the probability distributions that generated theobservations for each label class, and then label new instances byapplying the Bayes classifier to the estimated distributions. Thatapproach provides more useful information than just a class label; italso provides estimates of the conditional distribution of classlabels, in situations where there is class overlap.We would like to know whether it is harder to build accurateclassifiers via this approach, than by techniques that may processall data with distinct labels together. In this paper we makethat question precise by considering it in the context of PAClearnability. We propose two restrictions on the PAC learningframework that are intended to correspond with the above approach,and consider their relationship with standard PAC learning.Our main restriction of interest leads to some interesting algorithmsthat show that the restriction is not stronger (more restrictive)than various other well-known restrictions on PAC learning.An alternative slightly milder restriction turns out to be almostequivalent to unrestricted PAC learning.

マルチクラスパターン分類の古典的なアプローチは次のとおりです。各ラベルクラスの観測を生成した確率分布を推定し、推定された分布にベイズ分類器を適用して新しいインスタンスにラベルを付けます。このアプローチは、クラスラベルだけよりも有用な情報を提供します。また、クラスが重複している状況では、クラスラベルの条件付き分布の推定値も提供します。このアプローチで正確な分類器を構築することが、異なるラベルを持つすべてのデータを一緒に処理する手法よりも難しいかどうかを知りたいと思います。この論文では、PAC学習可能性のコンテキストで検討することで、その質問を明確にします。上記のアプローチに対応することを目的としたPAC学習フレームワークに対する2つの制限を提案し、標準PAC学習との関係を検討します。主な関心の制限により、この制限がPAC学習に対するその他のさまざまなよく知られている制限よりも強力ではない(より制限的ではない)ことを示す興味深いアルゴリズムがいくつか得られます。もう1つのわずかに緩い制限は、制限のないPAC学習とほぼ同等であることがわかりました。

In Search of Non-Gaussian Components of a High-Dimensional Distribution
高次元分布の非ガウス成分の探索

Finding non-Gaussian components of high-dimensional data is animportant preprocessing step for efficient information processing.This article proposes a new linear method to identify the”non-Gaussian subspace” within a very general semi-parametricframework. Our proposed method, called NGCA (non-Gaussian componentanalysis), is based on a linear operator which, to any arbitrarynonlinear (smooth) function, associates a vector belonging to thelow dimensional non-Gaussian target subspace, up to an estimationerror. By applying this operator to a family of different nonlinearfunctions, one obtains a family of different vectors lying in avicinity of the target space. As a final step, the target spaceitself is estimated by applying PCA to this family of vectors. Weshow that this procedure is consistent in the sense that theestimaton error tends to zero at a parametric rate, uniformly overthe family, Numerical examples demonstrate the usefulness of ourmethod.

高次元データの非ガウス成分を見つけることは、効率的な情報処理のための重要な前処理ステップです。この記事では、非常に一般的なセミパラメトリックフレームワーク内の「非ガウス部分空間」を識別するための新しい線形方法を提案します。NGCA(non-Gaussian componentanalysis)と呼ばれる私たちの提案手法は、任意の非非線形(平滑)関数に対して、低次元の非ガウスターゲット部分空間に属するベクトルを推定誤差まで関連付ける線形演算子に基づいています。この演算子を異なる非線形関数の族に適用することにより、ターゲット空間の結合近傍にある異なるベクトルの族が得られます。最後のステップとして、このベクトルファミリーにPCAを適用することにより、ターゲット空間自体が推定されます。この手順は、推定誤差がパラメトリックレートでゼロになる傾向があり、家族全体で均一であるという意味で一貫していることを示しています。

Learning the Structure of Linear Latent Variable Models
線形潜在変数モデルの構造の学習

We describe anytime search procedures that (1) find disjoint subsetsof recorded variables for which the members of each subset ared-separated by a single common unrecorded cause, if such exists; (2)return information about the causal relations among the latent factorsso identified. We prove the procedure is point-wise consistentassuming (a) the causal relations can be represented by a directedacyclic graph (DAG) satisfying the Markov Assumption and theFaithfulness Assumption; (b) unrecorded variables are not caused byrecorded variables; and (c) dependencies are linear. We compare the procedure withstandard approaches over a variety of simulated structures and sample sizes, andillustrate its practical value with brief studies of social sciencedata sets. Finally, we consider generalizations for non-linearsystems.

私たちは、(1)記録された変数の非結合サブセットを見つけるために、各サブセットのメンバーが単一の共通の未記録の原因によって分離されている場合、それを見つけるための検索手順をいつでも説明します。(2)特定された潜在因子間の因果関係に関する情報を返します。(a)因果関係は、マルコフ仮定と忠実性仮定を満たす有向巡回グラフ(DAG)で表すことができると仮定して、手順が点ごとに一貫していることを証明します。(b)記録されていない変数は、記録された変数によって引き起こされたものではありません。(c)依存関係は線形です。さまざまなシミュレートされた構造とサンプルサイズで手順を標準的なアプローチと比較し、社会科学データセットの簡単な研究でその実用的な価値を示します。最後に、非線形システムの一般化について考えます。

MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals
MinReg:酵母および哺乳動物における倹約的調節ネットワークを学習するためのスケーラブルなアルゴリズム

In recent years, there has been a growing interest in applying Bayesiannetworks and their extensions to reconstruct regulatory networks fromgene expression data. Since the gene expression domain involves a largenumber of variables and a limited number of samples, it poses bothcomputational and statistical challenges to Bayesian network learningalgorithms. Here we define a constrained family of Bayesian networkstructures suitable for this domain and devise an efficient search algorithmthat utilizes these structural constraints to find high scoring networksfrom data. Interestingly, under reasonable assumptions on the underlyingprobability distribution, we can provide performance guarantees on ouralgorithm. Evaluation on real data from yeast and mouse, demonstrates thatour method cannot only reconstruct a high quality model of the yeastregulatory network, but is also the first method to scale to the complexityof mammalian networks and successfully reconstructs a reasonable model overthousands of variables.

近年、ベイジアンネットワークとその拡張を適用して遺伝子発現データから制御ネットワークを再構築することへの関心が高まっています。遺伝子発現ドメインには多数の変数と限られた数のサンプルが含まれるため、ベイジアンネットワーク学習アルゴリズムには計算と統計の両方の課題があります。ここでは、このドメインに適したベイジアンネットワーク構造の制約ファミリーを定義し、これらの構造制約を利用してデータから高スコアのネットワークを見つける効率的な検索アルゴリズムを考案します。興味深いことに、基礎となる確率分布に関する合理的な仮定の下で、アルゴリズムのパフォーマンスを保証できます。酵母とマウスの実際のデータで評価すると、私たちの方法は酵母制御ネットワークの高品質モデルを再構築できるだけでなく、哺乳類ネットワークの複雑さに合わせて拡張できる最初の方法であり、数千の変数に対して合理的なモデルを正常に再構築できることが実証されています。

Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
一般化誤差の条件付き期待値に基づく近似線形回帰におけるアクティブラーニング

The goal of active learning is to determine the locations of traininginput points so that the generalization error is minimized. Wediscuss the problem of active learning in linear regression scenarios.Traditional active learning methods using least-squares learning oftenassume that the model used for learning is correctly specified. Inmany practical situations, however, this assumption may not befulfilled. Recently, active learning methods using”importance”-weighted least-squares learning have been proposed, whichare shown to be robust against misspecification of models. In thispaper, we propose a new active learning method also using the weightedleast-squares learning, which we call ALICE (Active Learningusing the Importance-weighted least-squares learning based onConditional Expectation of the generalization error). An importantdifference from existing methods is that we predict theconditional expectation of the generalization error giventraining input points, while existing methods predict the fullexpectation of the generalization error. Due to this difference, thetraining input design can be fine-tuned depending on the realizationof training input points. Theoretically, we prove that the proposedactive learning criterion is a more accurate predictor of thesingle-trial generalization error than the existing criterion.Numerical studies with toy and benchmark data sets show that theproposed method compares favorably to existing methods.

能動学習の目標は、一般化エラーが最小になるようにトレーニング入力ポイントの位置を決定することです。線形回帰シナリオでの能動学習の問題について説明します。最小二乗学習を使用する従来の能動学習方法では、学習に使用されるモデルが正しく指定されていると想定されることがよくあります。しかし、多くの実際の状況では、この仮定は満たされない可能性があります。最近、「重要度」で重み付けされた最小二乗学習を使用する能動学習法が提案され、モデルの誤った指定に対して堅牢であることが示されています。この論文では、重み付けされた最小二乗学習を使用する新しい能動学習法を提案します。これをALICE (一般化誤差の条件付き期待値に基づく重要度で重み付けされた最小二乗学習を使用する能動学習)と呼びます。既存の方法との重要な違いは、既存の方法が一般化誤差の完全な期待値を予測するのに対し、私たちはトレーニング入力ポイントを与えられた一般化誤差の条件付き期待値を予測することです。この違いにより、トレーニング入力ポイントの実現に応じてトレーニング入力設計を微調整できます。理論的には、提案された能動学習基準は、既存の基準よりも単一試行の一般化エラーのより正確な予測子であることを証明します。おもちゃのデータセットとベンチマークデータセットを使用した数値研究では、提案された方法が既存の方法と比較して優れていることが示されています。

Bounds for Linear Multi-Task Learning
線形マルチタスク学習の限界

We give dimension-free and data-dependent bounds for linear multi-tasklearning where a common linear operator is chosen to preprocess data for avector of task specific linear-thresholding classifiers. The complexitypenalty of multi-task learning is bounded by a simple expression involvingthe margins of the task-specific classifiers, the Hilbert-Schmidt norm ofthe selected preprocessor and the Hilbert-Schmidt norm of the covarianceoperator for the total mixture of all task distributions, or, alternatively,the Frobenius norm of the total Gramian matrix for the data-dependentversion. The results can be compared to state-of-the-art results on linearsingle-task learning.

私たちは、線形マルチタスク学習には、タスク固有の線形しきい値分類器のavectorのデータを前処理するために共通の線形演算子が選択される、次元フリーでデータ依存の境界を提供します。マルチタスク学習の複雑さのペナルティは、タスク固有の分類器のマージン、選択したプリプロセッサのヒルベルト・シュミットノルム、およびすべてのタスク分布の全混合のための共分散演算子のヒルベルト・シュミットノルム、またはデータ依存バージョンの全グラミアン行列のフロベニウスノルムを含む単純な式によって制限されます。この結果は、linear-single-task learningの最先端の結果と比較できます。

Generalized Bradley-Terry Models and Multi-Class Probability Estimates
一般化 Bradley-Terry モデルと多クラス確率推定

The Bradley-Terry model for obtaining individual skill from pairedcomparisons has been popular in many areas. In machine learning, thismodel is related to multi-class probability estimates by coupling allpairwise classification results. Error correcting output codes (ECOC)are a general framework to decompose a multi-class problem to severalbinary problems. To obtain probability estimates under this framework,this paper introduces a generalized Bradley-Terry model in whichpaired individual comparisons are extended to paired team comparisons.We propose a simple algorithm with convergence proofs to solve themodel and obtain individual skill. Experiments on synthetic and re aldata demonstrate that the algorithm is useful for obtainingmulti-class probability estimates. Moreover, we discuss fourextensions of the proposed model: 1) weighted individual skill, 2)home-field advantage, 3) ties, and 4) comparisons with more than twoteams.

ペア比較から個々のスキルを取得するためのBradley-Terryモデルは、多くの分野で人気があります。機械学習では、このモデルは、allpairwise分類結果を結合することにより、マルチクラス確率推定に関連しています。エラー訂正出力コード(ECOC)は、マルチクラス問題を複数のバイナリ問題に分解するための一般的なフレームワークです。このフレームワークの下で確率推定値を得るために、この論文では、対応した個々の比較を対応のあるチーム比較に拡張する一般化されたBradley-Terryモデルを紹介します。モデルを解き、個々のスキルを習得するための収束証明を備えた単純なアルゴリズムを提案します。合成および再データに関する実験は、このアルゴリズムが多クラス確率推定値を取得するのに役立つことを示しています。さらに、提案されたモデルの4つの拡張、すなわち、1)加重された個々のスキル、2)ホームフィールドのアドバンテージ、3)引き分け、4)2つ以上のチームとの比較について議論します。

On the Complexity of Learning Lexicographic Strategies
辞書編集戦略の学習の複雑さについて

Fast and frugal heuristics are well studied models of bounded rationality. Psychological research has proposed the take-the-best heuristic as a successful strategy in decision making with limited resources. Take-the-best searches for a sufficiently good ordering of cues (or features) in a task where objects are to be compared lexicographically. We investigate the computational complexity of finding optimal cue permutations for lexicographic strategies and prove that the problem is NP-complete. It follows that no efficient (that is, polynomial-time) algorithm computes optimal solutions, unless P=NP. We further analyze the complexity of approximating optimal cue permutations for lexicographic strategies. We show that there is no efficient algorithm that approximates the optimum to within any constant factor, unless P=NP. The results have implications for the complexity of learning lexicographic strategies from examples. They show that learning them in polynomial time within the model of agnostic probably approximately correct (PAC) learning is impossible, unless RP=NP. We further consider greedy approaches for building lexicographic strategies and determine upper and lower bounds for the performance ratio of simple algorithms. Moreover, we present a greedy algorithm that performs provably better than take-the-best. Tight bounds on the sample complexity for learning lexicographic strategies are also given in this article.

高速で質素なヒューリスティックは、限定合理性のよく研究されたモデルです。心理学的研究では、限られたリソースでの意思決定における効果的な戦略として、最善を取るヒューリスティックが提案されています。最善を取るヒューリスティックは、オブジェクトを辞書式に比較するタスクで、十分に適切なキュー(または機能)の順序を検索します。私たちは、辞書式戦略の最適なキュー順列を見つける計算の複雑さを調査し、問題がNP完全であることを証明します。したがって、P = NPでない限り、効率的な(つまり、多項式時間の)アルゴリズムは最適なソリューションを計算できません。さらに、辞書式戦略の最適なキュー順列を近似する複雑さを分析します。P=NPでない限り、定数係数内で最適値を近似する効率的なアルゴリズムは存在しないことを示します。結果は、例から辞書式戦略を学習する複雑さに影響します。RP=NPでない限り、不可知論的おそらく近似正しい(PAC)学習モデル内で多項式時間で辞書式戦略を学習することは不可能であることを示します。さらに、辞書式戦略を構築するための貪欲なアプローチを検討し、単純なアルゴリズムのパフォーマンス比の上限と下限を決定します。さらに、最良を取るよりも明らかに優れたパフォーマンスを発揮する貪欲なアルゴリズムを紹介します。この記事では、辞書式戦略を学習するためのサンプルの複雑さの厳密な境界も示します。

Incremental Algorithms for Hierarchical Classification
階層分類のためのインクリメンタルアルゴリズム

We study the problem of classifying data in a given taxonomywhen classifications associated with multiple and/or partial pathsare allowed.We introduce a new algorithm that incrementally learns alinear-threshold classifier for each node of the taxonomy.A hierarchical classification is obtained by evaluatingthe trained node classifiers in a top-down fashion.To evaluate classifiers in our multipath framework,we define a new hierarchical loss function, the H-loss,capturing the intuition that whenever a classificationmistake is made on a node of the taxonomy, then no loss shouldbe charged for any additional mistake occurring in the subtreeof that node.Making no assumptions on the mechanism generating the data instances,and assuming a linear noise model for the labels,we bound the H-loss of our on-line algorithm in terms of the H-lossof a reference classifier knowing the true parameters of the label-generating process.We show that, in expectation, the excess cumulative H-loss grows at mostlogarithmically in the length of the data sequence.Furthermore, our analysis reveals the precise dependence of the rateof convergence on the eigenstructure of the data each node observes.Our theoretical results are complemented by a number of experiments on texualcorpora. In these experiments we show that, after only one epoch of training,our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine.

私たちは、複数のパスや部分的なパスに関連付けられた分類が許される場合、特定の分類法でデータを分類する問題を研究します。分類法の各ノードに対して線形しきい値分類器を段階的に学習する新しいアルゴリズムを紹介します。階層分類は、トレーニングされたノード分類器をトップダウン方式で評価することによって得られます。マルチパスフレームワークで分類器を評価するために、分類法のノードで分類ミスが発生した場合は、そのノードのサブツリーで発生した追加のミスに対して損失は発生しないという直感を捉えた新しい階層損失関数H損失を定義します。データインスタンスを生成するメカニズムについて仮定せず、ラベルの線形ノイズモデルを仮定して、ラベル生成プロセスの真のパラメータがわかっている参照分類器のH損失に関して、オンラインアルゴリズムのH損失を制限しました。予想どおり、過剰累積H損失は、データの長さに対して最大で対数的に増加することを示します。シーケンス。さらに、私たちの分析は、収束率が各ノードが観測するデータの固有構造に正確に依存することを明らかにしました。私たちの理論的結果は、テキストコーパスに関するいくつかの実験によって補完されています。これらの実験では、たった1エポックのトレーニングの後、私たちのアルゴリズムはパーセプトロンベースの階層分類器よりもはるかに優れており、階層サポートベクターマシンにかなり近いことを示しています。

Statistical Comparisons of Classifiers over Multiple Data Sets
複数のデータセットに対する分類器の統計的比較

While methods for comparing two learning algorithms on a singledata set have been scrutinized for quite some time already, theissue of statistical tests for comparisons of more algorithms onmultiple data sets, which is even more essential to typical machinelearning studies, has been all but ignored. This article reviewsthe current practice and then theoretically and empiricallyexamines several suitable tests. Based on that, we recommend a setof simple, yet safe and robust non-parametric tests forstatistical comparisons of classifiers: the Wilcoxon signed rankstest for comparison of two classifiers and the Friedman test withthe corresponding post-hoc tests for comparison of more classifiersover multiple data sets. Results of the latter can also be neatlypresented with the newly introduced CD (critical difference)diagrams.

1つのデータセットで2つの学習アルゴリズムを比較する方法は、すでにかなり前から精査されてきましたが、一般的な機械学習研究にとってさらに重要な、複数のデータセットでより多くのアルゴリズムを比較するための統計的検定の問題は、ほとんど無視されてきました。この記事では、現在の実践をレビューし、理論的および経験的にいくつかの適切なテストを検討します。それに基づいて、分類子の統計的比較には、単純でありながら安全で堅牢なノンパラメトリック検定のセットをお勧めします: 2つの分類子を比較するためのWilcoxon符号付き順位検定と、複数のデータセットでより多くの分類子を比較するための対応するポストホック検定を持つフリードマン検定。後者の結果は、新しく導入されたCD(クリティカルディファレンス)ダイアグラムでもきちんと提示できます。

関連記事