Journal of Machine Learning Research Papers: Volume 10の論文一覧

Journal of Machine Learning Research Papers Volume 10に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。
A Survey of Accuracy Evaluation Metrics of Recommendation Tasks
レコメンデーションタスクの精度評価指標に関する調査

Recommender systems are now popular both commerciallyand in the research community, where many algorithms have beensuggested for providing recommendations. These algorithms typicallyperform differently in various domains and tasks. Therefore, it isimportant from the research perspective, as well as from a practicalview, to be able to decide on an algorithm that matches the domainand the task of interest. The standard way to make such decisions isby comparing a number of algorithms offline using some evaluationmetric. Indeed, many evaluation metrics have been suggested forcomparing recommendation algorithms. The decision on the properevaluation metric is often critical, as each metric may favor adifferent algorithm. In this paper we review the properconstruction of offline experiments for deciding on the mostappropriate algorithm. We discuss three important tasks ofrecommender systems, and classify a set of appropriate well knownevaluation metrics for each task. We demonstrate how using animproper evaluation metric can lead to the selection of an improperalgorithm for the task of interest. We also discuss other importantconsiderations when designing offline experiments.

推薦システムは、現在、商業的にも研究コミュニティでも人気があり、推薦を提供するためのアルゴリズムが数多く提案されています。これらのアルゴリズムは、通常、さまざまなドメインやタスクで異なるパフォーマンスを示します。したがって、研究の観点からも、実用的にも、ドメインや対象のタスクに一致するアルゴリズムを決定できることが非常に重要です。このような決定を行う標準的な方法は、何らかの評価基準を使用して、オフラインで複数のアルゴリズムを比較することです。実際、推薦アルゴリズムを比較するための評価基準は数多く提案されています。適切な評価基準の決定は、各基準が異なるアルゴリズムを優先する可能性があるため、しばしば重要です。この論文では、最も適切なアルゴリズムを決定するためのオフライン実験の適切な構築について検討します。推薦システムの3つの重要なタスクについて説明し、各タスクに適したよく知られた評価基準のセットを分類します。不適切な評価基準を使用すると、対象のタスクに不適切なアルゴリズムが選択される可能性があります。また、オフライン実験を設計する際のその他の重要な考慮事項についても説明します。

Efficient Online and Batch Learning Using Forward Backward Splitting
前方後方分割を使用した効率的なオンラインおよびバッチ学習

We describe, analyze, and experiment with a framework for empiricalloss minimization with regularization. Our algorithmic frameworkalternates between two phases. On each iteration we first perform anunconstrained gradient descent step. We then cast and solve aninstantaneous optimization problem that trades off minimization of aregularization term while keeping close proximity to the result ofthe first phase. This view yields a simple yet effective algorithmthat can be used for batch penalized risk minimization and onlinelearning. Furthermore, the two phase approach enables sparsesolutions when used in conjunction with regularization functionsthat promote sparsity, such as l1. We derive concrete and verysimple algorithms for minimization of loss functions with l1,l2, l22, and l∞ regularization. We also showhow to construct efficient algorithms for mixed-norm l1/lqregularization. We further extend the algorithms and give efficientimplementations for very high-dimensional data with sparsity. Wedemonstrate the potential of the proposed framework in a series ofexperiments with synthetic and natural data sets.

私たちは、正則化を伴う経験的損失最小化のフレームワークについて説明、分析、実験します。我々のアルゴリズムフレームワークは、2つのフェーズを交互に実行します。各反復で、まず制約のない勾配降下ステップを実行します。次に、最初のフェーズの結果に近いまま正則化項の最小化をトレードオフする瞬間的な最適化問題を投げかけて解きます。このビューにより、バッチペナルティ付きリスク最小化とオンライン学習に使用できる、シンプルでありながら効果的なアルゴリズムが得られます。さらに、2フェーズアプローチは、l1などのスパース性を促進する正則化関数と組み合わせて使用すると、スパースソリューションを可能にします。私たちは、l1、l2、l22、およびl∞ 正則化を伴う損失関数の最小化のための具体的で非常にシンプルなアルゴリズムを導出します。また、混合ノルムl1/lq正則化のための効率的なアルゴリズムを構築する方法も示します。我々はさらにアルゴリズムを拡張し、スパース性を伴う非常に高次元のデータに対する効率的な実装を提供します。合成データセットと自然データセットを使用した一連の実験で、提案されたフレームワークの可能性を実証します。

Online Learning with Samples Drawn from Non-identical Distributions
同一でない分布から抽出されたサンプルによるオンライン学習

Learning algorithms are based on samples which are often drawnindependently from an identical distribution (i.i.d.). In thispaper we consider a different setting with samples drawn accordingto a non-identical sequence of probability distributions. Eachtime a sample is drawn from a different distribution. In thissetting we investigate a fully online learning algorithmassociated with a general convex loss function and a reproducingkernel Hilbert space (RKHS). Error analysis is conducted under theassumption that the sequence of marginal distributions convergespolynomially in the dual of a Hölder space. For regression withleast square or insensitive loss, learning rates are given in boththe RKHS norm and the L2 norm. For classification with hingeloss and support vector machine q-norm loss, rates areexplicitly stated with respect to the excess misclassificationerror.

学習アルゴリズムは、多くの場合、同一の分布(i.i.d.)から独立して抽出されるサンプルに基づいています。この論文では、確率分布の同一でないシーケンスに従って抽出されたサンプルを使用した別の設定について考察します。毎回、サンプルは異なる分布から抽出されます。この設定では、一般的な凸損失関数と再現カーネルヒルベルト空間(RKHS)に関連付けられた完全オンライン学習アルゴリズムを調査します。誤差解析は、周辺分布のシーケンスがヘルダー空間の双対に多項式に収束するという仮定の下で行われます。最小二乗損失または非鈍感損失を伴う回帰の場合、学習率はRKHSノルムとL2ノルムの両方で与えられます。hingelossとサポートベクターマシンのq-norm損失による分類では、過剰な誤分類エラーに関してレートが明示的に示されます。

Adaptive False Discovery Rate Control under Independence and Dependence
独立性と依存性の下での適応的誤発見率制御

In the context of multiple hypothesis testing, the proportion π0of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generallyunknown a priori. A testing procedure using an implicit or explicit estimateof this quantity in order to improve its efficency is called adaptive.In this paper, we focus onthe issue of false discovery rate (FDR) control and we present newadaptive multiple testing procedures with control of the FDR.In a first part, assuming independence of the p-values, we present two new procedures andgive a unified review of other existing adaptive procedures that have provablycontrolled FDR. We reportextensive simulation results comparing these proceduresand testing their robustness when the independenceassumption is violated. The new proposed procedures appearcompetitive with existing ones. The overall best, though, is reported tobe Storey’s estimator, albeit for a specific parameter setting that does notappear to have been considered before. In a second part, we propose adaptive versions of step-up procedures that have provably controlled FDR under positive dependence and unspecified dependence of the p-values, respectively. In the latter case, while simulations only show an improvement over non-adaptive procedures in limited situations, these are to our knowledge among the first theoretically founded adaptive multiple testing procedures that control the FDR when the p-valuesare not independent.

多重仮説検定の文脈では、検定する仮説プール内の真の帰無仮説の割合π0は、一般的には事前には不明であるが、しばしば重要な役割を果たします。効率を改善するためにこの量の暗黙的または明示的な推定値を使用する検定手順は、適応型と呼ばれます。この論文では、偽発見率（FDR）制御の問題に焦点を当て、FDRを制御する新しい適応型多重検定手順を紹介します。最初の部分では、p値の独立性を仮定して、2つの新しい手順を示し、証明可能にFDRを制御した他の既存の適応型手順の統一的なレビューを示します。これらの手順を比較し、独立性の仮定に違反した場合の堅牢性をテストする広範なシミュレーション結果を報告します。提案された新しい手順は、既存の手順と競合するようです。ただし、全体的に最も優れているのは、これまで考慮されていなかった特定のパラメーター設定ではあるものの、Storeyの推定量であると報告されています。2番目の部分では、それぞれp値の正の依存性と不特定の依存性の下でFDRを制御できることが証明されたステップアップ手順の適応バージョンを提案します。後者の場合、シミュレーションでは限られた状況でのみ非適応手順よりも改善が見られますが、これは、p値が独立していない場合にFDRを制御する、理論的に確立された最初の適応型多重検定手順の1つです。

Cautious Collective Classification
慎重な集団分類

Many collective classification (CC) algorithms have been shown toincrease accuracy when instances are interrelated. However, CC algorithmsmust be carefully applied because their use of estimated labelscan in some cases decrease accuracy. In this article, we show thatmanaging this label uncertainty through cautious algorithmicbehavior is essential to achieving maximal, robust performance.First, we describe cautious inference and explain how four well-knownfamilies of CC algorithms can be parameterized to use varying degrees of such caution. Second, we introduce cautious learning and show how it can be used to improve the performance ofalmost any CC algorithm, with or without cautious inference. We thenevaluate cautious inference and learning for the four collective inference families, with three local classifiers and a range ofboth synthetic and real-world data. We findthat cautious learning and cautious inference typically outperform lesscautious approaches. In addition, we identify the data characteristics that predict more substantial performancedifferences. Our results reveal that the degree of caution used usually has a larger impact on performance than the choice of the underlying inference algorithm. Together, these results identify the most appropriate CCalgorithms to use for particular task characteristics and explain multipleconflicting findings from prior CC research.

多くの集合分類(CC)アルゴリズムは、インスタンスが相互に関連しているときに精度が向上することが示されています。ただし、CCアルゴリズムは、推定ラベルの使用によって精度が低下する場合があるため、慎重に適用する必要があります。この記事では、慎重なアルゴリズム動作を通じてこのラベルの不確実性を管理することが、最大かつ堅牢なパフォーマンスを実現するために不可欠であることを示します。まず、慎重な推論について説明し、よく知られている4つのCCアルゴリズムファミリをさまざまなレベルの慎重さで使用するようにパラメーター化する方法を説明します。次に、慎重な学習を紹介し、慎重な推論の有無にかかわらず、ほとんどすべてのCCアルゴリズムのパフォーマンスを向上させるために慎重な学習を使用する方法を示します。次に、3つのローカル分類子と合成データと実際のデータの両方を使用して、4つの集合推論ファミリの慎重な推論と学習を評価します。慎重な学習と慎重な推論は、通常、それほど慎重でないアプローチよりも優れていることがわかりました。さらに、より大幅なパフォーマンスの違いを予測するデータ特性を特定します。私たちの研究結果から、基礎となる推論アルゴリズムの選択よりも、使用される注意の度合いの方がパフォーマンスに大きな影響を与えることが明らかになりました。これらの結果を合わせると、特定のタスク特性に使用する最も適切なCCアルゴリズムが特定され、以前のCC研究からの複数の矛盾する結果が説明されます。

Reproducing Kernel Banach Spaces for Machine Learning
機械学習のためのカーネルバナッハ空間の再現

We introduce the notion of reproducing kernel Banach spaces (RKBS)and study special semi-inner-product RKBS by making use ofsemi-inner-products and the duality mapping. Properties of an RKBSand its reproducing kernel are investigated. As applications, wedevelop in the framework of RKBS standard learning schemes includingminimal norm interpolation, regularization network, support vectormachines, and kernel principal component analysis. In particular,existence, uniqueness and representer theorems are established.

私たちは、カーネルバナッハ空間(RKBS)の再現の概念を導入し、セミインナー積と双対写像を利用して特殊なセミインナー積RKBSについて研究します。RKBSとその再生カーネルのプロパティを調査します。アプリケーションとして、最小ノルム補間、正則化ネットワーク、サポートベクターマシン、カーネル主成分分析など、RKBS標準学習スキームのフレームワークで開発しています。特に、存在定理、一意性定理、表現定理が確立されます。

Learning Halfspaces with Malicious Noise
悪意あるノイズによるハーフスペースの学習

We give new algorithms for learning halfspaces in the challenging malicious noise model, where an adversary may corrupt both the labelsand the underlying distribution of examples. Our algorithms cantolerate malicious noise rates exponentially larger than previous workin terms of the dependence on the dimension n, and succeed for thefairly broad class of all isotropic log-concave distributions.We give poly(n, 1/ε)-time algorithms for solving the followingproblems to accuracy ε:

私たちは、敵対者がラベルと例の基本的な分布の両方を破壊する可能性のある、挑戦的な悪意のあるノイズモデルで半空間を学習するための新しいアルゴリズムを提供します。私たちのアルゴリズムは、次元nへの依存性の点で以前の研究よりも指数関数的に大きい悪意のあるノイズレートに耐えることができ、すべての等方性対数凹分布のかなり広範なクラスで成功します。次の問題を精度εに解くためのポリ(n、1/ε)時間アルゴリズムを与えます。

Structure Spaces
ストラクチャースペース

Finite structures such as point patterns, strings, trees, and graphs occuras “natural” representations of structured data in different applicationareas of machine learning. We develop the theory of structure spacesand derive geometrical and analytical concepts such as the angle betweenstructures and the derivative of functions on structures. In particular, weshow that the gradient of a differentiable structural function is awell-defined structure pointing in the direction of steepestascent. Exploiting the properties of structure spaces, it will turn out thata number of problems in structural pattern recognition such as centralclustering or learning in structured output spaces can be formulated asoptimization problems with cost functions that are locally Lipschitz. Hence,methods from nonsmooth analysis are applicable to optimize those costfunctions.

ポイントパターン、文字列、ツリー、グラフなどの有限構造は、機械学習のさまざまなアプリケーション領域で構造化データの「自然な」表現として発生します。構造空間の理論を発展させ、構造間の角度や構造上の関数の導関数などの幾何学的および解析的概念を導き出します。特に、微分可能な構造関数の勾配は、急勾配の方向を指す明確に定義された構造であることを示します。構造空間の特性を利用すると、中央クラスタリングや構造化出力空間での学習など、構造パターン認識における多くの問題を、局所的なリプシッツであるコスト関数の最適化問題として定式化できることがわかります。したがって、非平滑解析の方法は、これらのコスト関数を最適化するために適用できます。

Bounded Kernel-Based Online Learning
制限されたカーネルベースのオンライン学習

A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded.We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2. Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy.

カーネルベースのパーセプトロンアルゴリズムなどのカーネルベースのオンラインアルゴリズムの一般的な問題は、オンライン仮説を格納するために必要なメモリの量であり、アルゴリズムが進むにつれてメモリの量が際限なく増加する可能性があります。さらに、このようなアルゴリズムの計算負荷は、仮説を格納するために使用されるメモリの量に比例して増加します。これらの問題を解決するために、これまでのほとんどの研究では、メモリを制限された状態に保つために、インスタンスの一部を破棄することに焦点を当ててきました。この論文では、インスタンスを破棄せずに、以前のオンライン仮説が及ぶ空間にインスタンスを投影する新しいアルゴリズムを紹介します。このアルゴリズムをProjectronと呼びます。Projectronソリューションのメモリサイズはトレーニング前に予測できませんが、ソリューションが制限されることが保証されていることを証明します。提案されたアルゴリズムの相対的な誤りの境界を導出し、そこからパーセプトロンよりも優れたわずかに異なるアルゴリズムを導きます。この2番目のアルゴリズムをProjectron++と呼びます。このアルゴリズムは、マルチクラスと構造化された出力設定を処理するように拡張できることを示しており、その結果、私たちの知る限り、複雑な分類タスクを学習できる最初のオンライン境界アルゴリズムが実現しました。仮説表現を境界付ける方法は、ALMA2で実証されているように、あらゆる保守的なオンラインアルゴリズムやその他のオンラインアルゴリズムに適用できます。さまざまなデータセットでの実験結果から、メモリと精度の両方の点で、さまざまな境界付きオンラインアルゴリズムと比較して、私たちの手法が経験的に優れていることがわかります。

DL-Learner: Learning Concepts in Description Logics
DL-Learner: 記述論理における概念の学習

In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the official W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems.It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service.

この論文では、記述論理とOWLの学習フレームワークであるDL-Learnerについて紹介します。OWLは、セマンティックWebの公式なW3C標準オントロジー言語です。この言語の概念は、OWLオントロジーの構築と保守、または帰納論理プログラミングと同様の問題を解決するために学習することができます。DL-Learnerには、いくつかの学習アルゴリズム、さまざまなOWL形式のサポート、推論インターフェイス、および学習問題が含まれています。これは、Javaで実装されたクロスプラットフォームフレームワークです。このフレームワークは、プログラムによるアクセスを容易にし、コマンドラインインターフェース、グラフィカルインターフェース、およびWSDLベースのWebサービスを提供します。

Hash Kernels for Structured Data
構造化データのハッシュカーネル

We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs.

私たちは、効率的なカーネルを促進するためにハッシュ化を提案します。これは、サンプリングを使用して以前の作業を一般化し、データストリームとスパース特徴空間のカーネル行列を計算する原則的な方法を示します。さらに、正確なカーネル行列からの偏差境界を与えます。これは、文字列やグラフの推定に応用できます。

Learning When Concepts Abound
概念がたくさんあるときに学ぶ

Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days.As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)).

大規模なテキスト分類や単語予測などの多くの学習タスクは、インスタンスや特徴に加えてクラスの数が多い場合、つまり数千以上の場合、効率的なトレーニングと分類から恩恵を受けることができます。この課題に対処するために、スパースクラスインデックスの学習を調査します。インデックスは、特徴からクラスへのマッピングです。インデックス学習法を、1対1の分類や、パーセプトロンとサポートベクターマシンを使用したトップダウン分類などの他の手法と比較します。インデックス学習は、トレーニング時と分類時の両方で、空間と時間の効率に非常に有利であることがわかりました。さらに、このアプローチでは、同様の精度、場合によってはより優れた精度が得られます。数十万のインスタンスと数千のクラスがある問題では、インデックスは数分で学習されますが、他の方法では数時間または数日かかることがあります。説明したように、学習更新の設計により、各特徴をインデックス内のクラスの小さなサブセットに接続するように簡単に制約できます。この制約は、スケーラビリティにとって重要です。l個のアクティブな（正の値の）特徴を持つインスタンスを考えると、各特徴は平均してインデックス内のd個のクラスに接続し（実験では10個程度）、更新と分類にはO(dl log(dl))かかります。

Maximum Entropy Discrimination Markov Networks
最大エントロピー弁別マルコフネットワーク

The standard maximum margin approach for structured prediction lacksa straightforward probabilistic interpretation of the learningscheme and the prediction rule. Therefore its unique advantages suchas dual sparseness and kernel tricks cannot be easily conjoined withthe merits of a probabilistic model such as Bayesian regularization,model averaging, and ability to model hidden variables. In thispaper, we present a new general framework called maximumentropy discrimination Markov networks (MaxEnDNet, or simply,MEDN), which integrates these two approaches and combines andextends their merits. Major innovations of this approach include: 1)It extends the conventional max-entropy discrimination learning ofclassification rules to a new structural max-entropydiscrimination paradigm of learning a distribution of Markovnetworks. 2) It generalizes the extant Markov networkstructured-prediction rule based on a point estimator of modelcoefficients to an averaging model akin to a Bayesian predictor thatintegrates over a learned posterior distribution of modelcoefficients. 3) It admits flexible entropic regularization of themodel during learning. By plugging in different prior distributionsof the model coefficients, it subsumes the well-known maximum marginMarkov networks (M3N) as a special case, and leads to a modelsimilar to an L1-regularized M3N that is simultaneously primaland dual sparse, or other new types of Markov networks. 4) Itapplies a modular learning algorithm that combines existingvariational inference techniques and convex-optimization basedM3N solvers as subroutines.Essentially, MEDN can be understood as a jointly maximumlikelihood and maximum margin estimate of Markov network. Itrepresents the first successful attempt to combine maximum entropylearning (a dual form of maximum likelihood learning) with maximummargin learning of Markov network for structured input/outputproblems; and the basic principle can be generalized to learningarbitrary graphical models, such as the generative Bayesian networksor models with structured hidden variables. We discuss a number oftheoretical properties of this approach, and show that empirically itoutperforms a wide array of competing methods for structuredinput/output learning on both synthetic and real OCR and web dataextraction data sets.

構造化予測の標準的な最大マージンアプローチには、学習スキームと予測ルールの直接的な確率的解釈が欠けています。そのため、デュアルスパースネスやカーネルトリックなどの独自の利点は、ベイズ正則化、モデル平均化、隠れた変数のモデル化機能などの確率モデルのメリットと簡単に結び付けることができません。この論文では、最大エントロピー識別マルコフネットワーク(MaxEnDNet、または単にMEDN)と呼ばれる新しい一般的なフレームワークを紹介します。これは、これら2つのアプローチを統合し、それぞれのメリットを組み合わせて拡張したものです。このアプローチの主な革新は次のとおりです。1)分類ルールの従来の最大エントロピー識別学習を、マルコフネットワークの分布を学習する新しい構造的最大エントロピー識別パラダイムに拡張します。2)モデル係数の点推定に基づく既存のマルコフネットワーク構造化予測ルールを、学習されたモデル係数の事後分布を積分するベイズ予測子に似た平均化モデルに一般化します。3)学習中にモデルの柔軟なエントロピー正則化を許可します。モデル係数のさまざまな事前分布を差し込むことで、よく知られている最大マージンマルコフネットワーク(M3N)を特別なケースとして包含し、同時にプライマとデュアルスパースであるL1正則化M3Nに似たモデル、または他の新しいタイプのマルコフネットワークにつながります。4)既存の変分推論手法と凸最適化ベースのM3Nソルバーをサブルーチンとして組み合わせたモジュール学習アルゴリズムを適用します。基本的に、MEDNはマルコフネットワークの最大尤度と最大マージン推定を組み合わせたものとして理解できます。これは、構造化された入出力問題に対して、最大エントロピー学習(最大尤度学習の二重形式)とマルコフネットワークの最大マージン学習を組み合わせた最初の成功した試みです。基本原理は、生成ベイジアンネットワークや構造化された隠れ変数を持つモデルなどの任意のグラフィカルモデルの学習に一般化できます。このアプローチのいくつかの理論的特性について説明し、合成および実際のOCRとWebデータ抽出データセットの両方で、構造化された入出力学習のさまざまな競合方法よりも経験的に優れていることを示します。

When Is There a Representer Theorem? Vector Versus Matrix Regularizers
表現定理はいつあるのか?ベクトルと行列の正則化器

We consider a general class of regularization methods whichlearn a vector of parameters on the basis of linear measurements. Itis well known that if the regularizer is a nondecreasing function ofthe L2 norm, then the learned vector is a linear combination ofthe input data. This result, known as the representer theorem, lies atthe basis of kernel-based methods in machine learning. In this paper,we prove the necessity of the above condition, in the case of differentiable regularizers. We further extend our analysis to regularization methods which learn a matrix, aproblem which is motivated by the application to multi-tasklearning. In this context, we study a more general representertheorem, which holds for a larger class of regularizers. We provide anecessary and sufficient condition characterizing this class of matrixregularizers and we highlight some concrete examples ofpractical importance. Our analysis uses basic principles from matrixtheory, especially the useful notion of matrix nondecreasing functions.

私たちは、線形測定に基づいてパラメータのベクトルを学習する一般的なクラスの正則化法について考察します。正則化子がL2ノルムの非減少関数である場合、学習されたベクトルは入力データの線形結合であることがよく知られています。この結果は代表子定理として知られ、機械学習におけるカーネルベースの方法の基礎となっています。この論文では、微分可能な正則化子の場合に上記の条件が必要であることを証明します。さらに、マルチタスク学習への応用を目的とした問題である、行列を学習する正則化法に分析を拡張します。この文脈で、より大規模なクラスの正則化子に当てはまる、より一般的な代表子定理を検討します。このクラスの行列正則化子を特徴付ける必要十分条件を提供し、実用上重要な具体的な例をいくつか示します。分析では、行列理論の基本原理、特に行列非減少関数の有用な概念を使用します。

Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression
カーネル分位点回帰の交差検証解のためのバイレベルパス追従

We show how to follow the path of cross validatedsolutions to families of regularized optimization problems, definedby a combination of a parameterized loss function and aregularization term. A primary example is kernel quantileregression, where the parameter of the loss function is the quantilebeing estimated. Even though the bi-level optimization problem weencounter for every quantile is non-convex, the manner in which theoptimal cross-validated solution evolves with the parameter of theloss function allows tracking of this solution. We prove thisproperty, construct the resulting algorithm, and demonstrate it onreal and artificial data. This algorithm allows us toefficiently solve the whole family of bi-level problems. We show howit can be extended to cover other modeling problems, like supportvector regression, and alternative in-sample model selectionapproaches.

私たちは、パラメータ化された損失関数と正則化項の組み合わせによって定義される正則化最適化問題のファミリへの交差検証解のパスをたどる方法を示します。主な例はカーネル分位点回帰で、損失関数のパラメータは推定される分位数です。すべての分位数で遭遇する二水準最適化問題は非凸ですが、損失関数のパラメーターと共に最適な交差検証解が進化する方法により、この解の追跡が可能になります。この特性を証明し、結果のアルゴリズムを構築し、それを実在データと人工データで実証します。このアルゴリズムにより、バイレベル問題のファミリー全体を効率的に解決できます。サポートベクトル回帰や代替のサンプル内モデル選択アプローチなど、他のモデリング問題をカバーするために拡張する方法を示します。

Prediction With Expert Advice For The Brier Game
ブライアゲームのための専門家のアドバイスによる予測

We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set.

私たちは、予測のブライアゲームが混合可能であることを示し、最適な学習率とそれに対する置換関数を見つけます。結果として得られる予測アルゴリズムは、サッカーとテニスの試合の結果を予測するために適用され、有名なブックメーカーが専門家の役割を果たします。理論上のパフォーマンス保証は、サッカーのデータセットでは過度に緩くなく、テニスのデータセットではかなり厳しいものです。

Reinforcement Learning in Finite MDPs: PAC Analysis
有限MDPにおける強化学習:PAC分析

We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

私たちは、多項式のサンプル数を持つ有限マルコフ決定過程(MDP)における最適に近い振る舞いを学習する問題を研究しています。これらの「PAC-MDP」アルゴリズムには、よく知られているE3およびR-MAXアルゴリズム、および最新の遅延Q学習アルゴリズムが含まれます。現在の最先端技術を、統一された理論的枠組みで問題の限界を提示することにより要約します。上限と下限のより詳細な分析が示され、モデルフリーの遅延Q学習とモデルベースのR-MAXの違いについての洞察が得られます。

Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions
製品分布を利用して、相関免疫機能の関連変数を特定する

A Boolean function f is correlation immune if each inputvariable is independent of the output, under the uniform distributionon inputs. For example, the parity function is correlation immune.We consider the problem of identifying relevant variables of acorrelation immune function, in the presence of irrelevant variables.We address this problem in two different contexts. First, we analyzeSkewing, a heuristic method that was developed to improve theability of greedy decision tree algorithms to identify relevantvariables of correlation immune Boolean functions, given examplesdrawn from the uniform distribution (Page and Ray, 2003). We presenttheoretical results revealing both the capabilities and limitations ofskewing. Second, we explore the problem of identifying relevantvariables in the Product Distribution Choice (PDC) learningmodel, a model in which the learner can choose product distributionsand obtain examples from them. We prove a lemma establishing aproperty of Boolean functions that may be of independent interest.Using this lemma, we give two new algorithms for finding relevantvariables of correlation immune functions in the PDC model.

ブール関数fは、入力が一様分布している状態で、各入力変数が出力から独立している場合、相関がありません。たとえば、パリティ関数は相関がありません。無関係な変数が存在する場合に、相関がない関数の関連変数を識別する問題について考えます。この問題を2つの異なるコンテキストで取り上げます。まず、一様分布(Page and Ray、2003)から抽出された例に基づいて、相関がないブール関数の関連変数を識別する貪欲決定木アルゴリズムの能力を向上させるために開発されたヒューリスティック手法であるSkewingを分析します。Skewingの能力と限界の両方を明らかにする理論的結果を示します。次に、学習者が製品分布を選択してそこから例を取得できるモデルである製品分布選択(PDC)学習モデルで関連変数を識別する問題を検討します。私たちは、独立した関心事となる可能性のあるブール関数の特性を確立する補題を証明します。この補題を使用して、PDCモデルにおける相関免疫関数の関連変数を見つけるための2つの新しいアルゴリズムを提供します。

Estimating Labels from Label Proportions
ラベル比率からのラベルの推定

Consider the following problem: given sets of unlabeled observations,each set with known label proportions, predict the labels of anotherset of observations, possibly with known label proportions. Thisproblem occurs in areas like e-commerce, politics, spam filtering andimproper content detection. We present consistent estimators which canreconstruct the correct labels with high probability in a uniformconvergence sense. Experiments show that our method works well inpractice.

次の問題を考えてみましょう:ラベルのない観測値の集合が与えられた場合、各セットが既知のラベル比率を持つ各セットは、おそらく既知のラベル比率を持つ別の観測セットのラベルを予測します。この問題は、電子商取引、政治、スパムフィルタリング、不適切なコンテンツ検出などの分野で発生します。私たちは、一様収束の意味で高い確率で正しいラベルを再構築できる一貫性のある推定量を提示します。実験によると、私たちの方法は実践でもうまく機能することがわかっています。

Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
相関誤差を持つ再帰的線形モデルにおける最尤推定値の計算

In recursive linear models, the multivariate normal joint distributionof all variables exhibits a dependence structure induced by arecursive (or acyclic) system of linear structural equations. Theselinear models have a long tradition and appear in seemingly unrelatedregressions, structural equation modelling, and approaches to causalinference. They are also related to Gaussian graphical models via aclassical representation known as a path diagram. Despite the models’long history, a number of problems remain open. In this paper, weaddress the problem of computing maximum likelihood estimates in thesubclass of ‘bow-free’ recursive linear models. The term ‘bow-free’refers to the condition that the errors for variables i and j beuncorrelated if variable i occurs in the structural equation forvariable j. We introduce a new algorithm, termed Residual IterativeConditional Fitting (RICF), that can be implemented using only leastsquares computations. In contrast to existing algorithms, RICF hasclear convergence properties and yields exact maximum likelihoodestimates after the first iteration whenever the MLE is available inclosed form.

再帰線形モデルでは、すべての変数の多変量正規分布は、再帰的(または非巡回的)な線形構造方程式システムによって誘導される依存構造を示します。これらの線形モデルは長い伝統があり、一見無関係な回帰、構造方程式モデリング、および因果推論へのアプローチに現れます。また、パス図として知られる古典的な表現を介して、ガウスグラフィカルモデルとも関連しています。このモデルは長い歴史がありますが、多くの問題が未解決のままです。この論文では、「ボウフリー」再帰線形モデルのサブクラスで最大尤度推定値を計算する問題に取り組みます。「ボウフリー」という用語は、変数iが変数jの構造方程式に出現する場合、変数iとjの誤差が相関しないという条件を指します。最小二乗計算のみを使用して実装できる、残差反復条件付きフィッティング(RICF)と呼ばれる新しいアルゴリズムを紹介します。既存のアルゴリズムとは対照的に、RICFは明確な収束特性を持ち、MLEが閉じた形式で利用できる場合は常に最初の反復後に正確な最大尤度推定値を生成します。

The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
非超常現象:高次元無向グラフのセミパラメトリック推定

Recent methods for estimating sparse undirected graphs forreal-valued data in high dimensional problems rely heavily on the assumption of normality.We show how to use a semiparametric Gaussian copula—or”nonparanormal”—for high dimensional inference. Just as additivemodels extend linear models by replacing linear functions with a setof one-dimensional smooth functions, the nonparanormal extends thenormal by transforming the variables by smooth functions. We derive amethod for estimating the nonparanormal, study the method’stheoretical properties, and show that it works well in many examples.

高次元問題における実数値データのスパース無向グラフを推定する最近の方法は、正規性の仮定に大きく依存しています。高次元推論のためにセミパラメトリックガウスコピュラ—または「非超常的」—を使用する方法を示します。加法モデルが線形関数を1次元の平滑関数のセットに置き換えることで線形モデルを拡張するのと同様に、非超常線は変数を平滑関数で変換することで正規線を拡張します。非超常現象を推定するための方法を導き出し、その方法の理論的特性を研究し、それが多くの例でうまく機能することを示します。

Learning Nondeterministic Classifiers
非決定論的分類器の学習

Nondeterministic classifiers are defined as those allowed to predictmore than one class for some entries from an input space. Given thatthe true class should be included in predictions and the number ofclasses predicted should be as small as possible, these kind ofclassifiers can be considered as Information Retrieval (IR)procedures. In this paper, we propose a family of IR loss functions tomeasure the performance of nondeterministic learners. Afterdiscussing such measures, we derive an algorithm for learning optimalnondeterministic hypotheses. Given an entry from the input space, thealgorithm requires the posterior probabilities to compute the subsetof classes with the lowest expected loss. From a general point ofview, nondeterministic classifiers provide an improvement in theproportion of predictions that include the true class compared totheir deterministic counterparts; the price to be paid for thisincrease is usually a tiny proportion of predictions with more thanone class. The paper includes an extensive experimental study usingthree deterministic learners to estimate posterior probabilities: amulticlass Support Vector Machine (SVM), a Logistic Regression, and aNaïve Bayes. The data sets considered comprise both UCImulti-class learning tasks and microarray expressions of differentkinds of cancer. We successfully compare nondeterministic classifierswith other alternative approaches. Additionally, we shall see how thequality of posterior probabilities (measured by the Brier score)determines the goodness of nondeterministic predictions.

非決定性分類器は、入力空間からのいくつかのエントリに対して複数のクラスを予測できる分類器として定義されます。予測には真のクラスが含まれる必要があり、予測されるクラスの数は可能な限り少なくする必要があることを考慮すると、この種の分類器は情報検索(IR)手順と見なすことができます。この論文では、非決定性学習器のパフォーマンスを測定するためのIR損失関数のファミリを提案します。このような測定について説明した後、最適な非決定性仮説を学習するためのアルゴリズムを導き出します。入力空間からのエントリが与えられた場合、アルゴリズムは事後確率を使用して、期待損失が最も低いクラスのサブセットを計算します。一般的な観点から、非決定性分類器は、決定性分類器と比較して、真のクラスを含む予測の割合が向上します。この増加の代償として、通常、複数のクラスを含む予測の割合はごくわずかです。この論文には、事後確率を推定するために、マルチクラスサポートベクターマシン(SVM)、ロジスティック回帰、およびナイーブベイズの3つの決定論的学習器を使用した広範な実験研究が含まれています。検討対象のデータセットには、UCIマルチクラス学習タスクとさまざまな種類の癌のマイクロアレイ表現の両方が含まれています。非決定論的分類器を他の代替アプローチと比較することに成功しました。さらに、事後確率の品質(Brierスコアで測定)が非決定論的予測の良し悪しを決定する仕組みについても説明します。

The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
PノルムPush: リストの上位に集中する単純な凸型ランキングアルゴリズム

We are interested in supervised ranking algorithms that performespecially well near the top of the ranked list, and are only requiredto perform sufficiently well on the rest of the list. In this work,we provide a general form of convex objective that gives high-scoringexamples more importance. This “push” near the top of the list can bechosen arbitrarily large or small, based on the preference of theuser. We choose lp-norms to provide a specific type of push; ifthe user sets p larger, the objective concentrates harder on the topof the list. We derive a generalization bound based on the p-normobjective, working around the natural asymmetry of the problem. Wethen derive a boosting-style algorithm for the problem of ranking witha push at the top. The usefulness of the algorithm is illustratedthrough experiments on repository data. We prove that the minimizer ofthe algorithm’s objective is unique in a specific sense. Furthermore,we illustrate how our objective is related to quality measurements forinformation retrieval.

私たちが関心を持っているのは、ランク付けされたリストの上位付近で特に優れたパフォーマンスを発揮し、リストの残りの部分では十分なパフォーマンスを発揮するだけでよい、教師ありランキングアルゴリズムです。この研究では、高得点の例に高い重要性を与える凸目標の一般的な形式を提供します。リストの上位付近でのこの「プッシュ」は、ユーザーの好みに基づいて、任意の大きさまたは小ささに選択できます。特定の種類のプッシュを提供するために、lpノルムを選択します。ユーザーがpを大きく設定すると、目標はリストの上位に集中します。私たちは、問題の自然な非対称性を回避しながら、pノルム目標に基づく一般化境界を導き出します。次に、上位でのプッシュによるランキングの問題に対するブースティングスタイルのアルゴリズムを導き出します。このアルゴリズムの有用性は、リポジトリデータでの実験を通じて示されます。私たちは、アルゴリズムの目標の最小化が特定の意味で一意であることを証明します。さらに、我々の目標が情報検索の品質測定とどのように関連しているかを示します。

Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
マージンベースのランキングとAdaBoostとRankBoostの同等性

We study boosting algorithms for learning to rank. We give a generalmargin-based bound for ranking based on covering numbers for thehypothesis space. Our bound suggests that algorithms that maximize theranking margin will generalize well. We then describe a new algorithm,smooth margin ranking, that precisely converges to a maximumranking-margin solution. The algorithm is a modification of RankBoost,analogous to “approximate coordinate ascent boosting.” Finally, weprove that AdaBoost and RankBoost are equally good for the problems ofbipartite ranking and classification in terms of their asymptoticbehavior on the training set. Under natural conditions, AdaBoostachieves an area under the ROC curve that is equally as good asRankBoost’s; furthermore, RankBoost, when given a specific intercept,achieves a misclassification error that is as good as AdaBoost’s. Thismay help to explain the empirical observations made by Cortes andMohri, and Caruana and Niculescu-Mizil, about the excellentperformance of AdaBoost as a bipartite ranking algorithm, as measuredby the area under the ROC curve.

私たちは、ランク付けを学習するためのブースティングアルゴリズムについて検討します。仮説空間のカバー数に基づくランク付けの一般的なマージンベースの境界を示します。この境界は、ランク付けマージンを最大化するアルゴリズムが一般化に適していることを示唆しています。次に、最大ランク付けマージンソリューションに正確に収束する新しいアルゴリズム、スムーズマージンランキングについて説明します。このアルゴリズムは、RankBoostの修正版であり、「近似座標上昇ブースティング」に類似しています。最後に、トレーニングセットでの漸近的動作の観点から、AdaBoostとRankBoostが二部ランク付けと分類の問題に同等に適していることを証明します。自然な条件下では、AdaBoostは、RankBoostと同等のROC曲線下面積を達成します。さらに、RankBoostは、特定の切片が与えられた場合、AdaBoostと同等の誤分類エラーを達成します。これは、ROC曲線の下の領域で測定された二部ランキングアルゴリズムとしてのAdaBoostの優れたパフォーマンスに関する、CortesとMohri、およびCaruanaとNiculescu-Mizilによる実証的観察を説明するのに役立つ可能性があります。

Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
大規模なリスクを最小限に抑えるための最適化された切断面アルゴリズム

We have developed an optimized cutting plane algorithm (OCA) for solving large-scalerisk minimization problems. We prove that the number of iterations OCArequires to converge to a ε precise solution is approximately linearin the sample size. We also derive OCAS, an OCA-based linear binary SupportVector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In anextensive empirical evaluation we show that OCAS outperforms currentstate-of-the-art SVM solvers like SVMlight, SVMperf and BMRM, achievingspeedup factor more than 1,200 over SVMlight on some data sets and speedupfactor of 29 over SVMperf, while obtaining the same precise support vectorsolution. OCAS, even in the early optimization steps, often shows fasterconvergence than the currently prevailing approximative methods in this domain, SGDand Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM,achieves speedups of factor of up to 10 compared to SVMmulti-class. Finally, weuse OCAS and OCAM in two real-world applications, the problem of human acceptorsplice site detection and malware detection. Effectively parallelizing OCAS, weachieve state-of-the-art results on an acceptor splice site recognition problemonly by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts toreproduce the experiments are available athttp://cmp.felk.cvut.cz/~xfrancv/ocas/html/.

私たちは、大規模なリスク最小化問題を解くための最適化された切断面アルゴリズム(OCA)を開発しました。OCAが ε の正確な解に収束するために必要な反復回数は、サンプルサイズに対してほぼ線形であることを証明しました。また、OCAベースの線形バイナリサポートベクターマシン(SVM)ソルバーであるOCASと、線形マルチクラスSVMソルバーであるOCAMも導出しました。広範な実証的評価では、OCASがSVMlight、SVMperf、BMRMなどの最新のSVMソルバーよりも優れており、一部のデータセットではSVMlightより1,200倍以上高速化され、SVMperfより29倍高速化され、同じ正確なサポートベクター解が得られることが示されました。OCASは、初期の最適化ステップでも、この分野で現在普及している近似法であるSGDやPegasosよりも収束が速いことがよくあります。さらに、私たちが提案する線形マルチクラスSVMソルバーOCAMは、SVMマルチクラスと比較して最大10倍の高速化を実現します。最後に、人間のアクセプタスプライスサイトの検出とマルウェアの検出という2つの実際のアプリケーションでOCASとOCAMを使用します。OCASを効果的に並列化することで、1,200万次元の特徴空間で利用可能な5,000万の例すべてから学習できるようになり、アクセプタスプライスサイトの認識問題で最先端の結果を達成しました。実験を再現するためのソースコード、データセット、スクリプトは、http://cmp.felk.cvut.cz/~xfrancv/ocas/html/で入手できます。

Discriminative Learning Under Covariate Shift
共変量シフト下での判別学習

We address classification problems for which the training instancesare governed by an input distribution that is allowed to differarbitrarily from the test distribution—problems also referred to asclassification under covariate shift. We derive a solution that ispurely discriminative: neither training nor test distribution aremodeled explicitly. The problem of learning under covariate shift canbe written as an integrated optimization problem. Instantiating thegeneral optimization problem leads to a kernel logistic regression andan exponential model classifier for covariate shift. The optimizationproblem is convex under certain conditions; our findings also clarifythe relationship to the known kernel mean matching procedure. Wereport on experiments on problems of spam filtering, textclassification, and landmine detection.

私たちは、トレーニングインスタンスがテスト分布から任意に異なることが許される入力分布によって支配される分類問題—共変量シフトの下での分類とも呼ばれる問題に対処します。私たちは、純粋に差別的な解決策を導き出します:トレーニングもテスト配布も明示的にモデル化されていません。共変量シフト下での学習の問題は、統合最適化問題として書くことができます。一般最適化問題をインスタンス化すると、カーネルロジスティック回帰と共変量シフトの指数モデル分類器が得られます。optimizationproblemは、特定の条件下では凸型です。私たちの調査結果は、既知のカーネル平均マッチング手順との関係も明らかにします。スパムフィルタリング、テキスト分類、地雷検出の問題に関する実験について報告します。

RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments
RL-Glue:強化学習実験のための言語に依存しないソフトウェア

RL-Glue is a standard, language-independent software package forreinforcement-learning experiments. The standardization provided byRL-Glue facilitates code sharing and collaboration. Code sharingreduces the need to re-engineer tasks and experimental apparatus, bothcommon barriers to comparatively evaluating new ideas in the contextof the literature. Our software features a minimalist interface andworks with several languages and computing platforms. RL-Gluecompatibility can be extended to any programming language thatsupports network socket communication. RL-Glue has been used to teachclasses, to run international competitions, and is currently used byseveral other open-source software and hardware projects.

RL-Glueは、強化学習実験のための標準的な言語に依存しないソフトウェアパッケージです。RL-Glueが提供する標準化により、コードの共有とコラボレーションが容易になります。コード共有により、文献の文脈で新しいアイデアを比較するための共通の障壁であるタスクと実験装置を再設計する必要性が減少します。当社のソフトウェアはミニマリストインターフェースを備えており、いくつかの言語とコンピューティングプラットフォームで動作します。RL-Glue互換性は、ネットワークソケット通信をサポートする任意のプログラミング言語に拡張できます。RL-Glueは、クラスの指導や国際コンペティションの運営に使用されており、現在、他のいくつかのオープンソースソフトウェアおよびハードウェアプロジェクトで使用されています。

Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
サポートベクトル回帰と関連する正則化カーネル法の決定論的エラー分析

We introduce a new technique for the analysis of kernel-basedregression problems. The basic tools are sampling inequalities whichapply to all machine learning problems involving penalty terms inducedby kernels related to Sobolev spaces. They lead to explicitdeterministic results concerning the worst case behaviour ofε- and ν-SVRs. Using these, we show how to adjustregularization parameters to get best possible approximation ordersfor regression. The results are illustrated by some numericalexamples.

私たちは、カーネルベースの回帰問題を解析するための新しい手法を紹介します。基本的なツールは、ソボレフ空間に関連するカーネルによって誘発されるペナルティ項を含むすべての機械学習問題に適用されるサンプリング不等式です。これらは、ε-およびν-SVRの最悪の場合の動作に関する明示的な決定論的な結果につながります。これらを使用して、回帰の最適な近似次数を取得するために正則化パラメーターを調整する方法を示します。結果は、いくつかの数値の例で示されています。

An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
複数分類器システムにおけるサブシステム学習のための反相関カーネル

We present a method for training support vector machine (SVM)-basedclassification systems for combination with other classificationsystems designed for the same task. Ideally, a new system should bedesigned such that, when combined with existing systems, the resultingperformance is optimized. We present a simple model for this problemand use the understanding gained from this analysis to propose amethod to achieve better combination performance when training SVMsystems. We include a regularization term in the SVM objectivefunction that aims to reduce the average class-conditional covariancebetween the resulting scores and the scores produced by the existingsystems, introducing a trade-off between such covariance and thesystem’s individual performance. That is, the new system “takes onefor the team”, falling somewhat short of its best possible performancein order to increase the diversity of the ensemble. We report resultson the NIST 2005 and 2006 speaker recognition evaluations (SREs) for avariety of subsystems. We show a gain of 19% on the equal error rate(EER) of a combination of four systems when applying the proposedmethod with respect to the performance obtained when the four systemsare trained independently of each other.

私たちは、サポートベクターマシン(SVM)ベースの分類システムを、同じタスク用に設計された他の分類システムと組み合わせてトレーニングする方法を紹介します。理想的には、新しいシステムは、既存のシステムと組み合わせたときに、結果のパフォーマンスが最適化されるように設計する必要があります。我々はこの問題の簡単なモデルを提示し、この分析から得られた理解を使用して、SVMシステムをトレーニングするときに、より優れた組み合わせパフォーマンスを実現する方法を提案します。私たちは、結果のスコアと既存のシステムによって生成されたスコアの間の平均クラス条件付き共分散を減らすことを目的としたSVM目的関数に正規化項を含め、このような共分散とシステムの個々のパフォーマンスの間にトレードオフを導入します。つまり、新しいシステムは「チームのために1つを取ります」。アンサンブルの多様性を高めるために、可能な限り最高のパフォーマンスにはやや及ばないことになります。私たちは、さまざまなサブシステムのNIST 2005および2006スピーカー認識評価(SRE)の結果を報告します。提案された方法を適用すると、4つのシステムをそれぞれ独立してトレーニングした場合に得られるパフォーマンスと比較して、4つのシステムの組み合わせの等エラー率(EER)が19%向上することがわかります。

Evolutionary Model Type Selection for Global Surrogate Modeling
グローバル・サロゲート・モデリングのための進化モデル・タイプの選択

Due to the scale and computational complexity of currently used simulationcodes, global surrogate (metamodels) models have become indispensable tools forexploring and understanding the design space. Due to their compactformulation they are cheap to evaluate and thus readily facilitatevisualization, design space exploration, rapid prototyping, and sensitivityanalysis. They can also be used as accurate building blocks in designpackages or larger simulation environments. Consequently, there isgreat interest in techniques that facilitate the construction of suchapproximation models while minimizing the computational cost and maximizingmodel accuracy. Many surrogate model types exist (Support Vector Machines,Kriging, Neural Networks, etc.) but no type is optimal in all circumstances.Nor is there any hard theory available that can help make this choice.In this paper we present an automatic approach to the model type selectionproblem. We describe an adaptive global surrogate modeling environmentwith adaptive sampling, driven by speciated evolution. Different modeltypes are evolved cooperatively using a Genetic Algorithm (heterogeneousevolution) and compete to approximate the iteratively selected data.In this way the optimal model type and complexity for a given data setor simulation code can be dynamically determined. Its utility andperformance is demonstrated on a number of problems where it outperformstraditional sequential execution of each model type.

現在使用されているシミュレーションコードの規模と計算の複雑さのため、グローバルサロゲート(メタモデル)モデルは、設計空間の調査と理解に不可欠なツールになっています。コンパクトな定式化により、評価コストが低く、視覚化、設計空間の調査、ラピッドプロトタイピング、感度分析が容易に行えます。また、設計パッケージや大規模なシミュレーション環境で正確な構成要素として使用することもできます。その結果、計算コストを最小限に抑え、モデルの精度を最大化しながら、このような近似モデルの構築を容易にする手法に大きな関心が寄せられています。サロゲートモデルには多くの種類(サポートベクターマシン、クリギング、ニューラルネットワークなど)がありますが、すべての状況で最適な種類はありません。また、この選択に役立つ確固たる理論もありません。この論文では、モデルの種類の選択問題に対する自動アプローチを紹介します。ここでは、種分化進化によって駆動される適応サンプリングを備えた適応型グローバルサロゲートモデリング環境について説明します。異なるモデルタイプは、遺伝的アルゴリズム(異種進化)を使用して協調的に進化し、反復的に選択されたデータを近似するために競合します。このようにして、特定のデータセットまたはシミュレーションコードに最適なモデルタイプと複雑さを動的に決定できます。その有用性とパフォーマンスは、各モデルタイプの従来の順次実行よりも優れたパフォーマンスを発揮する多くの問題で実証されています。

Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
超高次元特徴選択:線形モデルを超えて

Variable selection in high-dimensional space characterizes manycontemporary problems in scientific discovery and decision making.Many frequently-used techniques are based on independence screening;examples include correlation ranking (Fan & Lv, 2008) or feature selectionusing a two-sample t-test in high-dimensional classification(Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008)showed that this simple correlation ranking possesses a sureindependence screening property under certain conditions and that itsrevision, called iteratively sure independent screening (ISIS), isneeded when the features are marginally unrelated but jointly relatedto the response variable. In this paper, we extend ISIS, withoutexplicit definition of residuals, to a general pseudo-likelihoodframework, which includes generalized linear models as a specialcase. Even in the least-squares setting, the new method improves ISISby allowing feature deletion in the iterative process. Our techniqueallows us to select important features in high-dimensionalclassification where the popularly used two-sample t-method fails. Anew technique is introduced to reduce the false selection rate in thefeature screening stage. Several simulated and two real data examplesare presented to illustrate the methodology.

高次元空間での変数選択は、科学的発見や意思決定における多くの現代的問題を特徴づけます。頻繁に使用される多くの手法は、独立性スクリーニングに基づいています。例としては、相関ランキング(Fan & Lv、2008)や、高次元分類における2サンプルt検定を使用した特徴選択(Tibshirani他、2003)などがあります。線形モデルのコンテキストでは、Fan & Lv (2008)は、この単純な相関ランキングが特定の条件下で確実な独立性スクリーニング特性を持ち、特徴が応答変数とわずかに無関係であるが共同で関連している場合、反復確実な独立スクリーニング(ISIS)と呼ばれるその修正が必要であることを示しました。この論文では、残差の明示的な定義なしに、ISISを一般的な疑似尤度フレームワークに拡張します。これには、一般化線形モデルが特別なケースとして含まれています。最小二乗設定でも、新しい方法は反復プロセスでの特徴の削除を可能にすることでISISを改善します。私たちの技術により、一般的に使用されている2サンプルt法では失敗する高次元分類で重要な特徴を選択できます。特徴スクリーニング段階での誤った選択率を減らすために、新しい技術が導入されています。この方法論を説明するために、いくつかのシミュレーション例と2つの実際のデータ例が示されています。

Fast Approximate kNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
再帰的Lanczos二分法による高次元データのための高速近似kNNグラフ構築

Nearest neighbor graphs are widely used in data mining and machinelearning. A brute-force method to compute the exact kNN graphtakes Θ(dn2) time for n data points in the d dimensionalEuclidean space. We propose two divide and conquer methods forcomputing an approximate kNN graph in Θ(dnt) time for highdimensional data (large d). The exponent t ∈ (1,2) is anincreasing function of an internal parameter α which governs thesize of the common region in the divide step. Experiments show that ahigh quality graph can usually be obtained with small overlaps, that is,for small values of t.A few of the practical details of the algorithms are asfollows. First, the divide step uses an inexpensive Lanczos procedureto perform recursive spectral bisection. After each conquer step, anadditional refinement step is performed to improve the accuracy of thegraph. Finally, a hash table is used to avoid repeating distancecalculations during the divide and conquer process. The combination ofthese techniques is shown to yield quite effective algorithms forbuilding kNN graphs.

最近傍グラフは、データマイニングや機械学習で広く使用されています。正確なkNNグラフを計算するためのブルートフォース法は、d次元ユークリッド空間のnデータポイントに対して Θ(dn2)時間かかります。高次元データ(大きいd)に対して近似kNNグラフを Θ(dnt)時間で計算するための2つの分割統治法を提案します。指数t∈(1,2)は、分割ステップで共通領域のサイズを制御する内部パラメーター α の増加関数です。実験では、通常、tの値が小さい場合、つまりオーバーラップが小さい場合に、高品質のグラフが得られることがわかっています。アルゴリズムの実用的な詳細をいくつか次に示します。まず、分割ステップでは、安価なLanczos手順を使用して、再帰スペクトル二分を実行します。各統治ステップの後、グラフの精度を向上させるために追加の改良ステップが実行されます。最後に、ハッシュテーブルを使用して、分割統治プロセス中に距離計算が繰り返されないようにします。これらの手法を組み合わせると、kNNグラフを構築するための非常に効果的なアルゴリズムが得られることが示されています。

Provably Efficient Learning with Typed Parametric Models
型付きパラメトリックモデルによる証明可能な効率的な学習

To quickly achieve good performance, reinforcement-learning algorithmsfor acting in large continuous-valued domains must use arepresentation that is both sufficiently powerful to capture importantdomain characteristics, and yet simultaneously allows generalization,or sharing, among experiences. Our algorithm balances this tradeoff byusing a stochastic, switching, parametric dynamics representation. Weargue that this model characterizes a number of significant,real-world domains, such as robot navigati on across varyingterrain. We prove that this representational assumption allows ouralgorithm to be probably approximately correct with a samplecomplexity that scales polynomially with all problem-specificquantities including the state-space dimension. We also explicitlyincorporate the error introduced by approximate planning in our samplecomplexity bounds, in contrast to prior Probably Approximately Correct(PAC) Markov Decision Processes (MDP) approaches, which typicallyassume the estimated MDP can be solved exactly. Our experimentalresults on constructing plans for driving to work using real cartrajectory data, as well as a small robot experiment on navigatingvarying terrain, demonstrate that our dynamics representation enablesus to capture real-world dynamics in a sufficient manner to producegood performance.

優れたパフォーマンスを迅速に達成するために、大規模な連続値ドメインで動作する強化学習アルゴリズムは、重要なドメイン特性を捉えるのに十分なほど強力でありながら、同時に経験間の一般化または共有を可能にする表現を使用する必要があります。私たちのアルゴリズムは、確率的、スイッチング、パラメトリックダイナミクス表現を使用して、このトレードオフのバランスを取ります。このモデルは、さまざまな地形でのロボットナビゲーションなど、いくつかの重要な現実世界のドメインを特徴付けると考えられます。この表現の仮定により、状態空間次元を含むすべての問題固有の量と多項式にスケーリングされるサンプル複雑度で、アルゴリズムがおそらくほぼ正確になることを証明します。また、推定されたMDPが正確に解決できることを通常想定する、従来のおそらくほぼ正しい(PAC)マルコフ決定プロセス(MDP)アプローチとは対照的に、近似計画によって導入される誤差をサンプル複雑度の境界に明示的に組み込みます。実際の車の軌跡データを使用して通勤の運転計画を作成する実験結果と、さまざまな地形をナビゲートする小型ロボットの実験は、私たちのダイナミクス表現によって、良好なパフォーマンスを生み出すのに十分な方法で現実世界のダイナミクスを捉えることができることを示しています。

Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training
ハイブリッドMPI / OpenMPパラレルリニアサポートベクターマシントレーニング

Support vector machines are a powerful machine learning technology,but the training process involves a dense quadratic optimizationproblem and is computationally challenging. A parallel implementationof linear Support Vector Machine training has been developed, using acombination of MPI and OpenMP. Using an interior point method for theoptimization and a reformulation that avoids the dense Hessian matrix,the structure of the augmented system matrix is exploited to partitiondata and computations amongst parallel processors efficiently. The newimplementation has been applied to solve problems from thePASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able tosolve problems in the Challenge many times faster than other parallelapproaches. We also demonstrate that the hybrid version performs moreefficiently than the version using pure MPI.

サポートベクターマシンは強力な機械学習テクノロジですが、トレーニングプロセスには密集した2次最適化問題が含まれ、計算が困難です。線形サポートベクターマシントレーニングの並列実装は、MPIとOpenMPの組み合わせを使用して開発されました。最適化のための内部点法と密集したヘッセ行列を回避する再定式化を使用して、拡張システム行列の構造を利用して、並列プロセッサ間でデータと計算を効率的に分割します。新しい実装は、PASCALの大規模学習チャレンジの問題を解決するために適用されました。私たちは、私たちのアプローチが競争力があり、他の並列アプローチよりも何倍も速くチャレンジの問題を解決できることを示しています。また、ハイブリッドバージョンは、純粋なMPIを使用するバージョンよりも効率的に動作することも示します。

Learning Approximate Sequential Patterns for Classification
分類のための近似逐次パターンの学習

In this paper, we present an automated approach to discover patternsthat can distinguish between sequences belonging to different labeledgroups. Our method searches for approximately conserved motifs thatoccur with varying statistical properties in positive and negativetraining examples. We propose a two-step process to discover suchpatterns. Using locality sensitive hashing (LSH), we first estimatethe frequency of all subsequences and their approximate matches withina given Hamming radius in labeled examples. The discriminative abilityof each pattern is then assessed from the estimated frequencies byconcordance and rank sum testing. The use of LSH to identifyapproximate matches for each candidate pattern helps reduce theruntime of our method. Space requirements are reduced by decomposingthe search problem into an iterative method that uses a single LSHtable in memory. We propose two further optimizations to the searchfor discriminative patterns. Clustering with redundancy based on a2-approximate solution of the k-center problem decreases the numberof overlapping approximate groups while providing exhaustive coverageof the search space. Sequential statistical methods allow the searchprocess to use data from only as many training examples as are neededto assess significance. We evaluated our algorithm on data sets fromdifferent applications to discover sequential patterns forclassification. On nucleotide sequences from the Drosophila genomecompared with random background sequences, our method was able todiscover approximate binding sites that were preserved upstream ofgenes. We observed a similar result in experiments on ChIP-on-chipdata. For cardiovascular data from patients admitted with acutecoronary syndromes, our pattern discovery approach identifiedapproximately conserved sequences of morphology variations that werepredictive of future death in a test population. Our data showed thatthe use of LSH, clustering, and sequential statistics improved therunning time of the search algorithm by an order of magnitude withoutany noticeable effect on accuracy. These results suggest that ourmethods may allow for an unsupervised approach to efficiently learninteresting dissimilarities between positive and negative examplesthat may have a functional role.

この論文では、異なるラベル付きグループに属するシーケンスを区別できるパターンを発見するための自動化されたアプローチを紹介します。この方法では、正と負のトレーニング例でさまざまな統計特性を持つ、ほぼ保存されたモチーフを検索します。このようなパターンを発見するための2段階のプロセスを提案します。まず、局所性に敏感なハッシュ(LSH)を使用して、ラベル付き例の特定のハミング半径内のすべてのサブシーケンスとその近似一致の頻度を推定します。次に、一致テストと順位和テストによって推定頻度から各パターンの識別能力を評価します。各候補パターンの近似一致を識別するためにLSHを使用すると、この方法の実行時間が短縮されます。検索問題を、メモリ内の単一のLSHテーブルを使用する反復方法に分解することで、スペース要件が削減されます。識別パターンの検索に対するさらに2つの最適化を提案します。k中心問題の2近似ソリューションに基づく冗長性を備えたクラスタリングにより、重複する近似グループの数が減り、検索スペースが網羅されます。シーケンシャル統計法では、検索プロセスで、重要性を評価するために必要な数のトレーニング例のデータのみを使用できます。分類のためのシーケンシャルパターンを発見するために、さまざまなアプリケーションのデータセットでアルゴリズムを評価しました。ショウジョウバエゲノムのヌクレオチド配列をランダムな背景配列と比較すると、この方法は遺伝子の上流で保存されているおおよその結合部位を発見できました。ChIP-on-chipデータの実験でも同様の結果が見られました。急性冠症候群で入院した患者の心血管データでは、パターン発見アプローチにより、テスト集団で将来の死亡を予測する形態学的変異のおおよその保存された配列が特定されました。データによると、LSH、クラスタリング、およびシーケンシャル統計を使用すると、精度に顕著な影響を与えることなく、検索アルゴリズムの実行時間が1桁改善されました。これらの結果は、私たちの方法により、機能的な役割を持つ可能性のある正の例と負の例の間の興味深い相違点を効率的に学習するための教師なしアプローチが可能になる可能性があることを示唆しています。

Learning Acyclic Probabilistic Circuits Using Test Paths
テストパスを使用した非巡回確率回路の学習

We define a model of learning probabilistic acyclic circuits usingvalue injection queries, in which fixed values are assigned to anarbitrary subset of the wires and the value on the single output wireis observed. We adapt the approach of using test paths from theCircuit Builder algorithm (Angluin et al., 2009) to show that there isa polynomial time algorithm that uses value injection queries to learnacyclic Boolean probabilistic circuits of constant fan-in and logdepth. We establish upper and lower bounds on the attenuation factorfor general and transitively reduced Boolean probabilistic circuits oftest paths versus general experiments. We give computational evidencethat a polynomial time learning algorithm using general valueinjection experiments may not do much better than one using testpaths. For probabilistic circuits with alphabets of size three orgreater, we show that the test path lemmas(Angluin et al., 2009, 2008b) fail utterly. To overcome thisobstacle, we introduce function injection queries, in which the valueson a wire may be mapped to other values rather than just to themselvesor constants, and prove a generalized test path lemma for this case.

私たちは、値注入クエリを使用して確率的非巡回回路を学習するモデルを定義します。このモデルでは、固定値が任意の配線のサブセットに割り当てられ、単一の出力配線の値が観察されます。Circuit Builderアルゴリズム(Angluin他、2009)のテストパスを使用するアプローチを採用して、値注入クエリを使用してファンインが一定で深さがlogである非巡回ブール確率回路を学習する多項式時間アルゴリズムがあることを示します。テストパスと一般的な実験の一般的なブール確率回路および推移的に縮小されたブール確率回路の減衰係数の上限と下限を確立します。一般的な値注入実験を使用する多項式時間学習アルゴリズムは、テストパスを使用するアルゴリズムよりもそれほど優れているわけではないという計算上の証拠を示します。アルファベットのサイズが3以上の確率回路の場合、テストパスの補題(Angluin他、2009、2008b)は完全に失敗することを示します。この障害を克服するために、関数注入クエリを導入します。関数注入クエリでは、ワイヤ上の値が、それ自体または定数だけでなく他の値にマッピングされる可能性があり、この場合の一般化されたテストパスの補題を証明します。

CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
CarpeDiem:ビタビアルゴリズムの最適化と教師あり逐次学習への応用

The growth of information available to learning systems and theincreasing complexity of learning tasks determine the need fordevising algorithms that scale well with respect to all learningparameters. In the context of supervised sequential learning, theViterbi algorithm plays a fundamental role, by allowing the evaluationof the best (most probable) sequence of labels with a time complexitylinear in the number of time events, and quadratic in the number oflabels.In this paper we propose CarpeDiem, a novel algorithm allowing theevaluation of the best possible sequence of labels with asub-quadratic time complexity.We provide theoretical grounding together with solid empirical resultssupporting two chief facts. CarpeDiem always finds the optimal solutionrequiring, in most cases, only a small fraction of the time taken bythe Viterbi algorithm; meantime, CarpeDiem is never asymptoticallyworse than the Viterbi algorithm, thus confirming it as a soundreplacement.

学習システムで利用可能な情報の増加と学習タスクの複雑さの増大により、すべての学習パラメータに対して適切にスケーリングできるアルゴリズムを考案する必要性が決定されます。教師あり逐次学習のコンテキストでは、ビタビアルゴリズムは、時間イベントの数が線形で、ラベルの数が2次である、ラベルの最良の(最も可能性の高い)シーケンスの評価を可能にすることにより、基本的な役割を果たします。この論文では、亜二次時間の複雑さを持つラベルの最良のシーケンスの評価を可能にする新しいアルゴリズムであるCarpeDiemを提案します。私たちは、2つの主要な事実を裏付ける確かな経験的結果とともに、理論的根拠を提供します。CarpeDiemは常に最適な解を見つけますほとんどの場合、Viterbiアルゴリズムにかかる時間のほんの一部しか必要としません。一方、CarpeDiemはViterbiアルゴリズムよりも漸近的に劣ることはなく、したがって、それをサウンドの代替として確認しています。

Nonlinear Models Using Dirichlet Process Mixtures
ディリクレ過程混合を用いた非線形モデル

We introduce a new nonlinear model for classification, in which wemodel the joint distribution of response variable, y, andcovariates, x, non-parametrically using Dirichlet processmixtures. We keep the relationship between y and x linear withineach component of the mixture. The overall relationship becomesnonlinear if the mixture contains more than one component, withdifferent regression coefficients. We use simulated data to compare theperformance of this new approach to alternative methods such asmultinomial logit (MNL) models, decision trees, and support vectormachines. We also evaluate our approach on two classificationproblems: identifying the folding class of protein sequences anddetecting Parkinson’s disease. Our model can sometimes improvepredictive accuracy. Moreover, by grouping observations intosub-populations (i.e., mixture components), our model can sometimes provideinsight into hidden structure in the data.

私たちは、ディリクレ過程混合物を使用して、応答変数yと共変量xの同時分布をノンパラメトリックにモデル化する、分類のための新しい非線形モデルを導入します。yとxの関係は、混合物の各成分内で線形に保ちます。全体的な関係は、混合物に回帰係数が異なる複数の成分が含まれている場合、非線形になります。シミュレーションデータを使用して、この新しいアプローチのパフォーマンスを、多項ロジット(MNL)モデル、デシジョンツリー、サポートベクターマシンなどの代替手法と比較します。また、タンパク質配列のフォールディングクラスの特定とパーキンソン病の検出という2つの分類問題に対するアプローチも評価します。私たちのモデルは、予測精度を向上させる場合があります。さらに、観測値をサブポピュレーション(つまり、混合成分)にグループ化することにより、私たちのモデルはデータの隠れた構造についての洞察を提供できる場合があります。

Distributed Algorithms for Topic Models
トピックモデルの分散アルゴリズム

We describe distributed algorithms for two widely-used topic models,namely the Latent Dirichlet Allocation (LDA) model, and theHierarchical Dirichet Process (HDP) model. In our distributedalgorithms the data is partitioned across separate processors andinference is done in a parallel, distributed fashion. We propose twodistributed algorithms for LDA. The first algorithm is astraightforward mapping of LDA to a distributed processor setting. Inthis algorithmprocessors concurrently perform Gibbs sampling over local datafollowed by a global update of topic counts. The algorithm issimple to implement and can be viewed as an approximation toGibbs-sampled LDA. The second version is a model that uses ahierarchical Bayesian extension of LDA to directly account fordistributed data. This model has a theoretical guarantee ofconvergence but is more complex to implement than the first algorithm.Our distributed algorithm for HDP takes the straightforward mappingapproach, and merges newly-created topics either by matching orby topic-id. Using five real-world text corpora we show thatdistributed learning works well in practice. For both LDA and HDP, weshow that the converged test-data log probability for distributed learning isindistinguishable from that obtained with single-processor learning.Our extensive experimental results include learning topic models fortwo multi-million document collections using a 1024-processor parallelcomputer.

私たちは、広く使用されている2つのトピックモデル、つまり潜在的ディリクレ配分法(LDA)モデルと階層的ディリクレ過程(HDP)モデルの分散アルゴリズムについて説明します。分散アルゴリズムでは、データは別々のプロセッサに分割され、推論は並列分散方式で行われます。LDAには2つの分散アルゴリズムを提案します。最初のアルゴリズムは、LDAを分散プロセッサ設定に単純にマッピングしたものです。このアルゴリズムでは、プロセッサはローカルデータに対してギブスサンプリングを同時に実行し、その後トピック数をグローバルに更新します。このアルゴリズムは実装が簡単で、ギブスサンプリングLDAの近似値と見なすことができます。2番目のバージョンは、LDAの階層的ベイジアン拡張を使用して分散データを直接考慮するモデルです。このモデルは理論上は収束が保証されていますが、最初のアルゴリズムよりも実装が複雑です。HDPの分散アルゴリズムは、単純なマッピングアプローチを採用し、新しく作成されたトピックをマッチングまたはトピックIDによってマージします。5つの実際のテキストコーパスを使用して、分散学習が実際にうまく機能することを示します。LDAとHDPの両方について、分散学習の収束テストデータログ確率は、単一プロセッサ学習で得られるものと区別がつかないことが示されています。私たちの広範な実験結果には、1024プロセッサの並列コンピュータを使用して2つの数百万のドキュメントコレクションのトピックモデルを学習することが含まれています。

Settable Systems: An Extension of Pearl’s Causal Model with Optimization, Equilibrium, and Learning
設定可能なシステム:最適化、均衡、学習によるPearlの因果モデルの拡張

Judea Pearl’s Causal Model is a rich framework that provides deep insightinto the nature of causal relations. As yet, however, the Pearl Causal Model(PCM) has had a lesser impact on economics or econometrics than on otherdisciplines. This may be due in part to the fact that the PCM is not as wellsuited to analyzing structures that exhibit features of central interest toeconomists and econometricians: optimization, equilibrium, and learning. Weoffer the settable systems framework as an extension of the PCM that permitscausal discourse in systems embodying optimization, equilibrium, andlearning. Because these are common features of physical, natural, or socialsystems, our framework may prove generally useful for machine learning.Important features distinguishing the settable system framework from the PCMare its countable dimensionality and the use of partitioning andpartition-specific response functions to accommodate the behavior ofoptimizing and interacting agents and to eliminate the requirement of aunique fixed point for the system. Refinements of the PCM include thesettable systems treatment of attributes, the causal role of exogenousvariables, and the dual role of variables as causes and responses. A seriesof closely related machine learning examples and examples from game theoryand machine learning with feedback demonstrates some limitations of the PCMand motivates the distinguishing features of settable systems.

Judea Pearlの因果モデルは、因果関係の性質について深い洞察を提供する優れたフレームワークです。しかし、これまでのところ、Pearl因果モデル(PCM)は、経済学や計量経済学に他の分野ほど影響を与えていません。これは、PCMが、経済学者や計量経済学者の中心的な関心である最適化、均衡、学習の特徴を示す構造の分析にはあまり適していないことが一因である可能性があります。私たちは、最適化、均衡、学習を体現するシステムで因果論的議論を可能にするPCMの拡張として、設定可能なシステムフレームワークを提供します。これらは、物理的、自然的、または社会的システムに共通する特徴であるため、私たちのフレームワークは機械学習に一般的に役立つ可能性があります。設定可能なシステムフレームワークとPCMを区別する重要な特徴は、可算次元と、最適化エージェントと相互作用エージェントの動作に対応し、システムに対して一意の固定点の必要性を排除するためのパーティション化とパーティション固有の応答関数の使用です。PCMの改良点には、設定可能なシステムによる属性の扱い、外生変数の因果的役割、および原因と応答としての変数の二重の役割が含まれます。密接に関連する一連の機械学習の例と、ゲーム理論およびフィードバック付き機械学習の例は、PCMのいくつかの制限を示し、設定可能なシステムの特徴的な機能を説明します。

Dlib-ml: A Machine Learning Toolkit
Dlib-ml: 機械学習ツールキット

There are many excellent toolkits which provide support for developingmachine learning software in Python, R, Matlab, and similarenvironments. Dlib-ml is an open source library, targeted at bothengineers and research scientists, which aims to provide a similarlyrich environment for developing machine learning software in the C++language. Towards this end, dlib-ml contains an extensible linearalgebra toolkit with built in BLAS support. It also housesimplementations of algorithms for performing inference in Bayesiannetworks and kernel-based methods for classification, regression,clustering, anomaly detection, and feature ranking. To enable easyuse of these tools, the entire library has been developed withcontract programming, which provides complete and precisedocumentation as well as powerful debugging tools.

Python、R、Matlab、および同様の環境での機械学習ソフトウェアの開発をサポートする優れたツールキットが多数あります。Dlib-mlは、エンジニアと研究者の両方を対象としたオープンソースライブラリであり、C++言語で機械学習ソフトウェアを開発するための同様に豊富な環境を提供することを目的としています。この目的のために、dlib-mlには、BLASサポートが組み込まれた拡張可能な線形代数ツールキットが含まれています。また、ベイジアンネットワークで推論を実行するためのアルゴリズムの実装と、分類、回帰、クラスタリング、異常検出、および特徴のランク付けのためのカーネルベースの方法も収容されています。これらのツールを簡単に使用できるようにするために、ライブラリ全体が契約プログラミングで開発されており、完全で正確なドキュメントと強力なデバッグツールを提供します。

SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
SGD-QN:慎重な準ニュートン確率的勾配降下法

The SGD-QN algorithm is a stochastic gradient descent algorithm thatmakes careful use of second-order information and splits the parameterupdate into independently scheduled components. Thanks to this design,SGD-QN iterates nearly as fast as a first-order stochastic gradientdescent but requires less iterations to achieve the same accuracy.This algorithm won the “Wild Track” of the first PASCAL Large ScaleLearning Challenge (Sonnenburg et al., 2008).

SGD-QNアルゴリズムは、2次情報を慎重に使用し、パラメーター更新を独立してスケジュールされたコンポーネントに分割する確率的勾配降下アルゴリズムです。この設計のおかげで、SGD-QNは1次確率的勾配降下法とほぼ同じ速さで反復処理を行いますが、同じ精度を達成するために必要な反復回数は少なくなります。このアルゴリズムは、第1回PASCAL Large ScaleLearning Challengeの「Wild Track」で優勝しました(Sonnenburgら, 2008)。

Learning Permutations with Exponential Weights
指数関数的重みによる順列の学習

We give an algorithm for the on-line learning of permutations.The algorithm maintains its uncertainty about thetarget permutation as a doubly stochastic weight matrix, and makes predictions usingan efficient method for decomposing the weight matrix into a convex combinationof permutations.The weight matrix is updated by multiplying the currentmatrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity.Even though the result of this proceduredoes not have a closed form, a new analysis approachallows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm.This regret bound is significantly better than that of eitherKalai and Vempala’s more efficient Follow the Perturbed Leader algorithm orthe computationally expensive method of explicitly representing each permutation asan expert.

私たちは、順列のオンライン学習のためのアルゴリズムを提供します。このアルゴリズムは、ターゲットの順列に関する不確実性を二重確率的な重み行列として保持し、重み行列を順列の凸の組み合わせに分解するための効率的な方法を使用して予測を行います。重み行列は、currentmatrixエントリに指数係数を乗算することによって更新され、二重確率性を復元するための反復手順が必要です。この手順の結果は閉じた形式を持っていませんが、新しい分析アプローチにより、アルゴリズムの後悔に限られる最適な(小さな定数係数まで)ことを証明できます。この後悔の範囲は、KalaiとVempalaのより効率的なFollow the Perturbed Leaderアルゴリズムや、各順列を専門家として明示的に表現する計算コストの高い方法よりも大幅に優れています。

Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
ノンパラメトリック経験的ベイズ推定の高次元分類への応用

We consider the problem of classification using high dimensionalfeatures’ space. In a paper by Bickel and Levina (2004), it isrecommended to use naive-Bayes classifiers, that is, to treat thefeatures as if they are statistically independent.Consider now a sparse setup, where only a few of the featuresare informative for classification. Fan and Fan (2008),suggested a variable selection and classification method, called FAIR.The FAIR method improves the design of naive-Bayes classifiers insparse setups. The improvement is due toreducing the noise in estimating the features’ means. This reduction is sincethat only the means of a few selected variables should be estimated.We also consider the design of naive Bayes classifiers. We show that a good alternative tovariable selection is estimation of the meansthrough a certain non parametric empirical Bayes procedure. In sparsesetups the empirical Bayes implicitly performs an efficient variableselection. It also adapts very well to non sparse setups, and has the advantageof making use of the information from many “weakly informative” variables, whichvariable selection type of classification procedures give up on using.We compare our method with FAIR and other classification methods insimulation for sparse and non sparse setups, andin real data examples involving classification of normal versus malignant tissues based on microarray data.

私たちは、高次元の特徴空間を使用した分類の問題を検討します。BickelとLevina (2004)の論文では、ナイーブベイズ分類器を使用すること、つまり、特徴を統計的に独立しているかのように扱うことが推奨されています。次に、分類に有益な特徴が少数しかないスパース設定を考えてみましょう。FanとFan (2008)は、FAIRと呼ばれる変数選択および分類法を提案しました。FAIR法は、スパース設定でのナイーブベイズ分類器の設計を改善します。この改善は、特徴の平均を推定する際のノイズが減少するためです。この減少は、少数の選択された変数の平均のみを推定すればよいためです。ナイーブベイズ分類器の設計についても検討します。変数選択の優れた代替手段は、特定のノンパラメトリック経験的ベイズ手順による平均の推定であることを示します。スパース設定では、経験的ベイズが暗黙的に効率的な変数選択を実行します。また、この方法は非スパース設定にも非常によく適応し、変数選択タイプの分類手順では使用できない多くの「弱い情報」変数からの情報を利用できるという利点があります。スパース設定と非スパース設定のシミュレーション、およびマイクロアレイデータに基づく正常組織と悪性組織の分類を含む実際のデータ例で、この方法をFAIRおよびその他の分類方法と比較します。

Transfer Learning for Reinforcement Learning Domains: A Survey
強化学習領域のための転移学習:調査

The reinforcement learning paradigm is a popular way to addressproblems that have only limited environmental feedback, rather thancorrectly labeled examples, as is common in other machine learningcontexts. While significant progress has been made to improve learningin a single task, the idea of transfer learning has onlyrecently been applied to reinforcement learning tasks. The core ideaof transfer is that experience gained in learning to perform one taskcan help improve learning performance in a related, but different,task. In this article we present a framework that classifies transferlearning methods in terms of their capabilities and goals, and thenuse it to survey the existing literature, as well as to suggest futuredirections for transfer learning work.

強化学習パラダイムは、他の機械学習のコンテキストで一般的な例を正しくラベル付けするのではなく、環境フィードバックが限られている問題に対処するための一般的な方法です。単一のタスクで学習を改善するための大きな進歩が見られましたが、転移学習の考え方が強化学習タスクに適用されるようになったのはごく最近のことです。転送の核となる考え方は、1つのタスクを実行することを学ぶことで得られた経験が、関連しているが異なるタスクの学習パフォーマンスを向上させるのに役立つということです。この記事では、転移学習の方法をその能力と目標の観点から分類するフレームワークを提示し、それを使用して既存の文献を調査し、転移学習作業の将来の方向性を提案します。

Marginal Likelihood Integrals for Mixtures of Independence Models
独立性モデルの混合の周辺尤法積分

Inference in Bayesian statistics involves theevaluation of marginal likelihood integrals.We present algebraic algorithms for computingsuch integrals exactly for discrete data of small sample size.Our methods apply toboth uniform priors and Dirichlet priors.The underlying statistical models aremixtures of independent distributions, or,in geometric language, secant varieties of Segre-Veronese varieties.

ベイズ統計学における推論には、周辺尤積分の評価が含まれます。私たちは、小さなサンプルサイズの離散データに対して正確にそのような積分を計算するための代数的アルゴリズムを提示します。私たちの方法は、一様事前確率とディリクレ事前確率の両方に適用されます。基礎となる統計モデルは、独立した分布の混合物、または幾何学的な言語では、セグレ・ヴェロネーゼ品種の割線多様体です。

Learning Linear Ranking Functions for Beam Search with Application to Planning
ビーム探索のための線形ランク付け関数の学習と計画への応用

Beam search is commonly used to help maintain tractabilityin large search spaces at the expense of completeness and optimality. Here westudy supervised learning of linear ranking functions for controllingbeam search. The goal is to learn ranking functions that allow for beam search toperform nearly as well as unconstrained search, and hence gain computationalefficiency without seriously sacrificing optimality. In this paper, we developtheoretical aspects of this learning problem and investigate the application ofthis framework to learning in the context of automated planning. We first studythe computational complexity of the learning problem, showing that even forexponentially large search spaces the general consistency problem is in NP. Wealso identify tractable and intractable subclasses of the learning problem,giving insight into the problem structure. Next, we analyze the convergenceof recently proposed and modified online learning algorithms, where we introduceseveral notions of problem margin that imply convergence for the various algorithms.Finally, we present empirical results in automated planning, where rankingfunctions are learned to guide beam search in a number of benchmark planningdomains. The results show that our approach is often able to outperform an existingstate-of-the-art planning heuristic as well as a recent approach to learning suchheuristics.

ビーム探索は、完全性と最適性を犠牲にして、大規模な探索空間での扱いやすさを維持するためによく使用されます。ここでは、ビーム探索を制御するための線形ランキング関数の教師あり学習について検討します。目標は、ビーム探索が制約のない探索とほぼ同等のパフォーマンスを発揮できるようにランキング関数を学習し、最適性を大幅に犠牲にすることなく計算効率を高めることです。この論文では、この学習問題の理論的側面を展開し、自動計画のコンテキストでの学習へのこのフレームワークの適用について調査します。まず、学習問題の計算の複雑さを検討し、指数関数的に大きな探索空間であっても、一般的な一貫性の問題はNPであることを示します。また、学習問題の扱いやすいサブクラスと扱いにくいサブクラスを特定し、問題の構造に関する洞察を提供します。次に、最近提案および修正されたオンライン学習アルゴリズムの収束を分析し、さまざまなアルゴリズムの収束を意味する問題マージンの概念をいくつか紹介します。最後に、自動計画の実証結果を示します。この計画では、ランキング関数を学習して、いくつかのベンチマーク計画ドメインでビーム検索をガイドします。結果は、私たちのアプローチが、既存の最先端の計画ヒューリスティックや、そのようなヒューリスティックを学習する最近のアプローチよりも優れていることが多いことを示しています。

Bayesian Network Structure Learning by Recursive Autonomy Identification
再帰的自律同定によるベイジアンネットワーク構造学習

We propose the recursive autonomy identification (RAI) algorithm forconstraint-based (CB) Bayesian network structure learning. The RAIalgorithm learns the structure by sequential application ofconditional independence (CI) tests, edge direction and structuredecomposition into autonomous sub-structures. The sequence ofoperations is performed recursively for each autonomous sub-structurewhile simultaneously increasing the order of the CI test. While otherCB algorithms d-separate structures and then direct the resultedundirected graph, the RAI algorithm combines the two processes fromthe outset and along the procedure. By this means and due to structuredecomposition, learning a structure using RAI requires a smallernumber of CI tests of high orders. This reduces the complexity andrun-time of the algorithm and increases the accuracy by diminishingthe curse-of-dimensionality. When the RAI algorithm learned structuresfrom databases representing synthetic problems, known networks andnatural problems, it demonstrated superiority with respect tocomputational complexity, run-time, structural correctness andclassification accuracy over the PC, Three Phase Dependency Analysis,Optimal Reinsertion, greedy search, Greedy Equivalence Search, SparseCandidate, and Max-Min Hill-Climbing algorithms.

私たちは、制約ベース(CB)ベイジアンネットワーク構造学習のための再帰的自律性識別(RAI)アルゴリズムを提案します。RAIアルゴリズムは、条件付き独立性(CI)テスト、エッジ方向、自律サブ構造への構造分解を順次適用することで構造を学習します。操作シーケンスは、CIテストの次数を増やしながら、自律サブ構造ごとに再帰的に実行されます。他のCBアルゴリズムは構造をd分離してから、結果として得られる無向グラフを方向付けますが、RAIアルゴリズムは、最初から手順に沿って2つのプロセスを組み合わせます。この方法と構造分解により、RAIを使用して構造を学習するには、高次のCIテストの数が少なくて済みます。これにより、アルゴリズムの複雑さと実行時間が軽減され、次元の呪いが軽減されて精度が向上します。RAIアルゴリズムは、合成問題、既知のネットワーク、自然問題を表すデータベースから構造を学習し、計算の複雑さ、実行時間、構造の正確性、分類の精度に関して、PC、3フェーズ依存性分析、最適再挿入、貪欲検索、貪欲同値検索、SparseCandidate、およびMax-Min Hill-Climbingアルゴリズムよりも優れていることが実証されました。

Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
ベイジアンネットワークにおけるベイジアンスコアリング基準の強極限定理

In the machine learning community, the Bayesian scoring criterion iswidely used for model selection problems. One of the fundamentaltheoretical properties justifying the usage of the Bayesian scoringcriterion is its consistency. In this paper we refine this propertyfor the case of binomial Bayesian network models. As a by-product ofour derivations we establish strong consistency and obtain the law ofiterated logarithm for the Bayesian scoring criterion.

機械学習コミュニティでは、ベイズスコアリング基準はモデル選択問題に広く使用されています。ベイズスコアリング基準の使用を正当化する基本的な理論的特性の1つは、その一貫性です。この論文では、二項ベイジアンネットワークモデルの場合にこのプロパティを改良します。導出の副産物として、強い一貫性を確立し、ベイズスコアリング基準の法則の反復対数を取得します。

Robustness and Regularization of Support Vector Machines
サポートベクターマシンのロバスト性と正則化

We consider regularized support vector machines (SVMs) and show thatthey are precisely equivalent to a new robust optimizationformulation. We show that this equivalence of robust optimizationand regularization has implications for both algorithms, andanalysis. In terms of algorithms, the equivalence suggests moregeneral SVM-like algorithms for classification that explicitly buildin protection to noise, and at the same time control overfitting. Onthe analysis front, the equivalence of robustness and regularizationprovides a robust optimization interpretation for the success ofregularized SVMs. We use this new robustness interpretation of SVMsto give a new proof of consistency of (kernelized) SVMs, thusestablishing robustness as the reason regularized SVMsgeneralize well.

私たちは、正則化サポートベクターマシン(SVM)を検討し、それらが新しいロバスト最適化の定式化と正確に同等であることを示します。このロバスト最適化と正則化の等価性が、アルゴリズムと解析の両方に影響を与えることを示します。アルゴリズムに関しては、同等性は、ノイズに対する保護を明示的に組み込み、同時に過適合を制御する、より一般的なSVMのような分類アルゴリズムを示唆しています。解析の面では、ロバスト性と正則化の等価性により、正則化されたSVMの成功に対するロバストな最適化解釈が可能になります。SVMのこの新しいロバスト性の解釈を使用して、(カーネル化された)SVMの一貫性の新しい証明を提供し、したがって、正規化されたSVMが適切に一般化する理由としてロバスト性を確立します。

Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
エントロピー推論とJames-Stein推定量:非線形遺伝子関連ネットワークへの応用

We present a procedure for effective estimation of entropy and mutualinformation from small-sample data, and apply it to the problem ofinferring high-dimensional gene association networks. Specifically, wedevelop a James-Stein-type shrinkage estimator, resulting in a procedurethat is highly efficient statistically as well as computationally.Despite its simplicity, we show that it outperforms eight other entropyestimation procedures across a diverse range of sampling scenarios anddata-generating models, even in cases of severe undersampling. Weillustrate the approach by analyzing E. coli gene expression data andcomputing an entropy-based gene-association network from gene expressiondata. A computer program is available that implements the proposedshrinkage estimator.

私たちは、少量のサンプルデータからエントロピーと相互情報を効果的に推定する手順を提示し、それを高次元の遺伝子関連ネットワークの推論の問題に適用します。具体的には、James-Stein型収縮推定器を開発し、統計的にも計算的にも非常に効率的な手順を実現します。その単純さにもかかわらず、深刻なアンダーサンプリングの場合でも、さまざまなサンプリングシナリオとデータ生成モデルにわたって、他の8つのエントロピ推定手順よりも優れていることを示しています。大腸菌の遺伝子発現データを解析し、遺伝子発現データからエントロピーベースの遺伝子関連ネットワークを計算することにより、このアプローチを説明します。提案された収縮推定器を実装するコンピュータープログラムが利用可能です。

Classification with Gaussians and Convex Loss
ガウス分布と凸損失による分類

This paper considers binary classification algorithms generatedfrom Tikhonov regularization schemes associated with generalconvex loss functions and varying Gaussian kernels. Our main goalis to provide fast convergence rates for the excessmisclassification error. Allowing varying Gaussian kernels in thealgorithms improves learning rates measured by regularizationerror and sample error. Special structures of Gaussian kernelsenable us to construct, by a nice approximation scheme with aFourier analysis technique, uniformly bounded regularizingfunctions achieving polynomial decays of the regularization errorunder a Sobolev smoothness condition. The sample error isestimated by using a projection operator and a tight bound for thecovering numbers of reproducing kernel Hilbert spaces generated byGaussian kernels. The convexity of the general loss function playsa very important role in our analysis.

この論文では、一般凸損失関数とさまざまなガウスカーネルに関連付けられたTikhonov正則化スキームから生成されたバイナリ分類アルゴリズムについて考察します。主な目標は、過剰誤分類エラーの収束率を高速化することです。アルゴリズムでさまざまなガウスカーネルを許可すると、regularizationerrorとsample errorによって測定される学習率が向上します。ガウスカーネルの特殊な構造により、フーリエ解析手法を用いた優れた近似スキームにより、Sobolev平滑性条件下で正則化誤差の多項式減衰を達成する一様な有界正則化関数を構築することができます。サンプル誤差は、射影演算子と、ガウスカーネルによって生成された再生カーネルヒルベルト空間の被覆数に対する厳密な境界を使用して推定されます。一般的な損失関数の凸性は、私たちの分析において非常に重要な役割を果たします。

A Least-squares Approach to Direct Importance Estimation
直接重要度推定への最小二乗法

We address the problem of estimating the ratio of two probability density functions,which is often referred to as the importance.The importance values can be used for various succeeding tasks such ascovariate shift adaptation or outlier detection.In this paper, we propose a new importance estimation methodthat has a closed-form solution;the leave-one-out cross-validation scorecan also be computed analytically.Therefore, the proposed method is computationally highly efficient and simple to implement.We also elucidate theoretical properties of the proposed method such as the convergencerate and approximation error bounds. Numerical experiments show that the proposed methodis comparable to the best existing method in accuracy,while it is computationally more efficientthan competing approaches.

私たちは、重要度と呼ばれることが多い2つの確率密度関数の比を推定する問題に取り組みます。重要度の値は、共変量シフト適応や外れ値検出など、さまざまな後続のタスクに使用できます。この論文では、閉形式のソリューションを持つ新しい重要度推定方法を提案します。leave-one-out交差検証スコアは、解析的に計算することもできます。したがって、提案手法は計算効率が高く、実装が容易です。また、提案手法の収束誤差や近似誤差の限界などの理論的性質についても解明します。数値実験により、提案された方法は、既存の最良の方法に匹敵する精度を持ち、競合するアプローチよりも計算効率が高いことが示されています。

Model Monitor (M2): Evaluating, Comparing, and Monitoring Models
モデルモニター (M2): モデルの評価、比較、監視

This paper presents Model Monitor (M2), a Java toolkit for robustlyevaluating machine learning algorithms in the presence of changingdata distributions. M2 provides a simple and intuitive frameworkin which users can evaluate classifiers under hypothesized shifts indistribution and therefore determine the best model (or models) fortheir data under a number of potential scenarios. Additionally, M2is fully integrated with the WEKA machine learning environment, sothat a variety of commodity classifiers can be used if desired.

この論文では、データ分布が変化する場合に機械学習アルゴリズムをロバストに評価するためのJavaツールキットであるModel Monitor(M2)について説明します。M2は、ユーザーが仮説シフト不分布の下で分類子を評価し、さまざまな潜在的なシナリオの下でデータに最適なモデル(またはモデル)を決定できる、シンプルで直感的なフレームワークを提供します。さらに、M2はWEKA機械学習環境と完全に統合されているため、必要に応じてさまざまな商品分類器を使用できます。

A Parameter-Free Classification Method for Large Scale Learning
大規模学習のためのパラメータフリー分類法

With the rapid growth of computer storage capacities, available dataand demand for scoring models both follow an increasing trend, sharperthan that of the processing power. However, the main limitation to awide spread of data mining solutions is the non-increasingavailability of skilled data analysts, which play a key role in datapreparation and model selection.In this paper, we present a parameter-free scalable classificationmethod, which is a step towards fully automatic data mining. Themethod is based on Bayes optimal univariate conditional densityestimators, naive Bayes classification enhanced with a Bayesianvariable selection scheme, and averaging of models using a logarithmicsmoothing of the posterior distribution. We focus on the complexityof the algorithms and show how they can cope with data sets that arefar larger than the available central memory. We finally reportresults on the Large Scale Learning challenge, where our methodobtains state of the art performance within practicable computationtime.

コンピュータのストレージ容量が急速に増加し、利用可能なデータとスコアリングモデルに対する需要はどちらも、処理能力の増加よりも急激な増加傾向にあります。しかし、データマイニングソリューションの普及を阻む主な制約は、データの準備とモデルの選択で重要な役割を果たす熟練したデータアナリストの数が伸びないことです。この論文では、完全自動データマイニングへの第一歩となる、パラメータ不要のスケーラブルな分類方法を紹介します。この方法は、ベイズ最適単変量条件付き密度推定量、ベイズ変数選択スキームで強化された単純ベイズ分類、および事後分布の対数平滑化を使用したモデルの平均化に基づいています。アルゴリズムの複雑さに焦点を当て、利用可能な中央メモリよりもはるかに大きいデータセットに対処できる方法を示します。最後に、大規模学習チャレンジの結果を報告します。このチャレンジでは、実用的な計算時間内で最先端のパフォーマンスを実現しています。

Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination
アンサンブル、人工変数、および冗長性除去による特徴選択

Predictive models benefit from a compact, non-redundant subset offeatures that improves interpretability and generalization.Modern data sets are wide, dirty, mixed with both numerical andcategorical predictors, and may contain interactive effectsthat require complex models. This is a challenge for filters,wrappers, and embedded feature selection methods. Wedescribe details of an algorithm using tree-based ensembles togenerate a compact subset of non-redundant features.Parallel and serial ensembles of treesare combined into a mixed method that can uncover maskingand detect features of secondary effect.Simulated and actualexamples illustrate the effectiveness of the approach.

予測モデルは、コンパクトで冗長性のない特徴のサブセットの恩恵を受け、解釈可能性と一般化を向上させます。最新のデータセットは、幅が広く、ダーティで、数値予測子とカテゴリ予測子の両方が混在しており、複雑なモデルを必要とする対話型効果が含まれている可能性があります。これは、フィルター、ラッパー、および組み込み機能の選択方法の課題です。ツリーベースのアンサンブルを使用して、非冗長機能のコンパクトなサブセットを生成するアルゴリズムの詳細について説明します。樹木の並列アンサンブルとシリアルアンサンブルは、マスキングを明らかにし、二次効果の特徴を検出できる混合法に組み合わされます。シミュレーション例と実際の例は、アプローチの有効性を示しています。

Robust Process Discovery with Artificial Negative Events
人工的なネガティブイベントによるロバストなプロセスディスカバリー

Process discovery is the automated construction of structuredprocess models from information system event logs. Such event logsoften contain positive examples only. Without negative examples,it is a challenge to strike the right balance between recall andspecificity, and to deal with problems such as expressiveness,noise, incomplete event logs, or the inclusion of prior knowledge.In this paper, we present a configurable technique that deals withthese challenges by representing process discovery as amulti-relational classification problem on event logs supplementedwith Artificially Generated Negative Events (AGNEs). This problemformulation allows using learning algorithms and evaluationtechniques that are well-know in the machine learning community.Moreover, it allows users to have a declarative control over theinductive bias and language bias.

プロセス検出とは、情報システムのイベントログから構造化されたプロセスモデルを自動的に構築することです。このようなイベントログには、多くの場合、肯定的な例のみが含まれています。ネガティブな例がなければ、想起と特異性の間の適切なバランスを取り、表現力、ノイズ、不完全なイベントログ、事前知識の包含などの問題に対処することは困難です。この論文では、プロセスディスカバリーをイベントログの多重関係分類問題として表し、人工的に生成されたネガティブイベント(AGNE)で補完することにより、これらの課題に対処する構成可能な手法を紹介します。この問題の定式化により、機械学習コミュニティでよく知られている学習アルゴリズムと評価手法を使用できます。さらに、ユーザーは帰納的バイアスと言語バイアスを宣言的に制御できます。

Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
近似推論における摂動補正: 混合モデリング応用

Bayesian inference is intractable for many interesting models, makingdeterministic algorithms for approximate inference highly desirable.Unlike stochastic methods, which are exact in the limit,the accuracy of these approaches cannot be reasonably judged.In this paper we show how low order perturbation corrections toan expectation-consistent (EC) approximation can provide the necessary toolsto ameliorate inference accuracy, and to givean indication of the quality of approximation without having to resortto Monte Carlo methods.Further comparisons are given with variational Bayes andparallel tempering (PT) combined withthermodynamic integration on a Gaussian mixturemodel. To obtain practical results we further generalize PT to temper fromarbitrary distributions rather than a prior in Bayesian inference.

ベイズ推論は多くの興味深いモデルにとって扱いにくいため、近似推論の決定論的アルゴリズムが非常に望ましいものとなっています。極限で正確な確率的手法とは異なり、これらのアプローチの精度は合理的に判断することはできません。この論文では、期待値一貫性(EC)近似に対する低次摂動補正が、推論精度を改善し、モンテカルロ法に頼ることなく近似の品質を示すために必要なツールを提供する方法を示します。さらに、変分ベイズと並列テンパリング(PT)とガウス混合モデルでの熱力学積分を組み合わせた比較が行われます。実用的な結果を得るために、PTをさらに一般化して、ベイズ推論の事前分布ではなく、任意の分布からテンパリングします。

Incorporating Functional Knowledge in Neural Networks
ニューラルネットワークへの機能的知識の組み込み

Incorporating prior knowledge of a particular task into thearchitecture of a learning algorithm can greatly improvegeneralization performance. We study here a case where we know thatthe function to be learned is non-decreasing in its two arguments andconvex in one of them. For this purpose we propose a class offunctions similar to multi-layer neural networks but (1) that hasthose properties, (2) is a universal approximator ofLipschitz functions with these and otherproperties. We apply this new class of functions to the task ofmodelling the price of call options. Experiments show improvements onregressing the price of call options using the new types of functionclasses that incorporate the a priori constraints.

特定のタスクに関する事前知識を学習アルゴリズムのアーキテクチャに組み込むと、一般化のパフォーマンスが大幅に向上します。ここでは、学習する関数が2つの引数で非減少であり、そのうちの1つで凸であることがわかっているケースを学習します。この目的のために、多層ニューラルネットワークに似た関数のクラスを提案しますが、(1)それらの特性を持ち、(2)これらの特性と他の特性を持つリプシッツ関数の普遍的な近似器です。この新しいクラスの関数を、コールオプションの価格をモデル化するタスクに適用します。実験では、先験的な制約を組み込んだ新しいタイプの関数クラスを使用して、コールオプションの価格を逆行させる改善が示されています。

The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
潜在変数の隠れた生命:混合グラフモデルによるベイズ学習

Directed acyclic graphs (DAGs) have been widely used as a representation ofconditional independence in machine learning and statistics. Moreover,hidden or latent variables are often an important component of graphicalmodels. However, DAG models suffer from an important limitation: the familyof DAGs is not closed under marginalization of hidden variables. This meansthat in general we cannot use a DAG to represent the independencies over asubset of variables in a larger DAG. Directed mixed graphs (DMGs) are arepresentation that includes DAGs as a special case, and overcomes thislimitation. This paper introduces algorithms for performing Bayesianinference in Gaussian and probit DMG models. An important requirement forinference is the specification of the distribution over parameters of themodels. We introduce a new distribution for covariance matrices of GaussianDMGs. We discuss and illustrate how several Bayesian machine learning taskscan benefit from the principle presented here: the power to modeldependencies that are generated from hidden variables, but withoutnecessarily modeling such variables explicitly.

有向非巡回グラフ(DAG)は、機械学習や統計における条件付き独立性の表現として広く使用されています。さらに、隠れた変数や潜在変数は、多くの場合、グラフィカルモデルの重要なコンポーネントです。ただし、DAGモデルには重要な制限があります。DAGファミリーは、隠れた変数の周辺化の下で閉じられていません。つまり、一般に、DAGを使用して、より大きなDAG内の変数のサブセットの独立性を表現することはできません。有向混合グラフ(DMG)は、DAGを特別なケースとして含む表現であり、この制限を克服します。この論文では、ガウスおよびプロビットDMGモデルでベイズ推論を実行するアルゴリズムを紹介します。推論の重要な要件は、モデルのパラメーターの分布の指定です。ガウスDMGの共分散行列の新しい分布を紹介します。ここで紹介した原則、つまり、隠れた変数から生成される依存関係をモデル化する能力（ただし、そのような変数を必ずしも明示的にモデル化する必要はありません）が、いくつかのベイズ機械学習タスクにどのように役立つかについて説明し、説明します。

Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
部分的に観測可能な確率的環境におけるマルチタスク強化学習

We consider the problem of multi-task reinforcement learning (MTRL)in multiple partially observable stochastic environments. Weintroduce the regionalized policy representation (RPR) tocharacterize the agent’s behavior in each environment. The RPR is aparametric model of the conditional distribution over currentactions given the history of past actions and observations; theagent’s choice of actions is directly based on this conditionaldistribution, without an intervening model to characterize theenvironment itself. We propose off-policy batch algorithms to learnthe parameters of the RPRs, using episodic data collected whenfollowing a behavior policy, and show their linkage to policyiteration. We employ the Dirichlet process as a nonparametric priorover the RPRs across multiple environments. The intrinsic clusteringproperty of the Dirichlet process imposes sharing of episodes amongsimilar environments, which effectively reduces the number ofepisodes required for learning a good policy in each environment,when data sharing is appropriate. The number of distinct RPRs andthe associated clusters (the sharing patterns) are automaticallydiscovered by exploiting the episodic data as well as thenonparametric nature of the Dirichlet process. We demonstrate theeffectiveness of the proposed RPR as well as the RPR-based MTRLframework on various problems, including grid-world navigation andmulti-aspect target classification. The experimental results showthat the RPR is a competitive reinforcement learning algorithm inpartially observable domains, and the MTRL consistently achievesbetter performance than single task reinforcement learning.

私たちは、複数の部分的に観測可能な確率的環境におけるマルチタスク強化学習(MTRL)の問題を考察します。各環境におけるエージェントの行動を特徴付けるために、地域化ポリシー表現(RPR)を導入します。RPRは、過去の行動と観測の履歴を前提とした現在の行動に対する条件付き分布のパラメトリックモデルです。エージェントの行動の選択は、環境自体を特徴付ける介入モデルを必要とせず、この条件付き分布に直接基づいて行われます。私たちは、行動ポリシーに従う際に収集されるエピソードデータを使用してRPRのパラメータを学習するオフポリシーバッチアルゴリズムを提案し、ポリシー反復との関連性を示す。私たちは、複数の環境にわたるRPRに対するノンパラメトリックな事前分布としてディリクレ過程を採用します。ディリクレ過程の本質的なクラスタリング特性は、類似した環境間でのエピソードの共有を強制し、データ共有が適切な場合、各環境で適切なポリシーを学習するために必要なエピソードの数を効果的に削減します。異なるRPRの数と関連するクラスター(共有パターン)は、エピソードデータとディリクレ過程の非パラメトリックな性質を利用して自動的に検出されます。提案されたRPRとRPRベースのMTRLフレームワークの有効性を、グリッドワールドナビゲーションや多面的なターゲット分類などのさまざまな問題で実証します。実験結果から、RPRは部分的に観測可能なドメインで競争力のある強化学習アルゴリズムであり、MTRLは単一タスク強化学習よりも一貫して優れたパフォーマンスを達成することがわかります。

Universal Kernel-Based Learning with Applications to Regular Languages
通常の言語への応用によるユニバーサルカーネルベースの学習

We propose a novel framework for supervised learning of discreteconcepts. Since the 1970’s, the standard computational primitive hasbeen to find the most consistent hypothesis in a given complexityclass. In contrast, in this paper we propose a new basic operation:for each pair of input instances, count how many concepts of boundedcomplexity contain both of them.Our approach maps instances to a Hilbert space, whose metric isinduced by a universal kernel coinciding with our computationalprimitive, and identifies concepts with half-spaces. We prove thatall concepts are linearly separable under this mapping. Hence, givena labeled sample and an oracle for evaluating the universal kernel, wecan efficiently compute a linear classifier (via SVM, for example) anduse margin bounds to control its generalization error. Even thoughexact evaluation of the universal kernel may be infeasible, in variousnatural situations it is efficiently approximable.Though our approach is general, our main application is to regularlanguages. Our approach presents a substantial departure from currentlearning paradigms and in particular yields a novel method forlearning this fundamental concept class. Unlike existing techniques,we make no structural assumptions on the corresponding unknownautomata, the string distribution or the completeness of the trainingset. Instead, given a labeled sample our algorithm outputs aclassifier with guaranteed distribution-free generalization bounds; toour knowledge, the proposed framework is the only one capable ofachieving the latter. Along the way, we touch upon severalfundamental questions in complexity, automata, and machine learning.

私たちは、離散概念の教師あり学習のための新しいフレームワークを提案します。1970年代以来、標準的な計算プリミティブは、与えられた複雑性クラスで最も一貫性のある仮説を見つけることであった。対照的に、この論文では、新しい基本操作を提案します。入力インスタンスの各ペアについて、その両方を含む制限された複雑性の概念の数を数える。我々のアプローチは、インスタンスをヒルベルト空間にマッピングします。その距離は、我々の計算プリミティブと一致するユニバーサルカーネルによって誘導され、概念を半空間で識別します。私たちは、このマッピングの下ですべての概念が線形に分離可能であることを証明します。したがって、ラベル付きサンプルとユニバーサルカーネルを評価するためのオラクルがあれば、線形分類器を効率的に計算し(たとえば、SVM経由)、マージン境界を使用してその一般化エラーを制御することができます。ユニバーサルカーネルの正確な評価は実行不可能である可能性があるが、さまざまな自然な状況では効率的に近似可能です。我々のアプローチは一般的であるが、主な用途は正規言語です。私たちのアプローチは、現在の学習パラダイムから大きく逸脱しており、特にこの基本的な概念クラスを学習するための新しい方法を生み出します。既存の手法とは異なり、対応する未知のオートマトン、文字列分布、またはトレーニングセットの完全性について構造上の仮定は行いません。代わりに、ラベル付きサンプルが与えられると、私たちのアルゴリズムは、分布フリーの一般化境界が保証された分類器を出力します。私たちの知る限り、提案されたフレームワークは後者を実現できる唯一のものです。その過程で、複雑性、オートマトン、機械学習におけるいくつかの基本的な問題に触れます。

An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity
弱い推移性を満たすGraphoidの最小無向独立写像から依存関係を読み取るためのアルゴリズム

We present a sound and complete graphical criterion for readingdependencies from the minimal undirected independence map G of agraphoid M that satisfies weak transitivity. Here, complete meansthat it is able to read all the dependencies in M that can bederived by applying the graphoid properties and weak transitivity tothe dependencies used in the construction of G and theindependencies obtained from G by vertex separation. We argue thatassuming weak transitivity is not too restrictive. As an intermediatestep in the derivation of the graphical criterion, we prove that forany undirected graph G there exists a strictly positive discreteprobability distribution with the prescribed sample spaces that isfaithful to G. We also report an algorithm that implements thegraphical criterion and whose running time is considered to be at mostO(n2(e+n)) for n nodes and e edges. Finally, we illustrate howthe graphical criterion can be used within bioinformatics to identifybiologically meaningful gene dependencies.

私たちは、弱推移性を満たすグラフィカルMの最小無向独立マップGから依存関係を読み取るための健全で完全なグラフィカル基準を提示します。ここで、完全とは、グラフィカル特性と弱推移性を、Gの構築に使用された依存関係と、頂点分離によってGから取得された独立性に適用することによって導出できるMのすべての依存関係を読み取ることができることを意味します。私たちは、弱推移性を仮定することはあまり制限的ではないと主張します。グラフィカル基準の導出の中間ステップとして、任意の無向グラフGに対して、Gに忠実な、規定されたサンプル空間を持つ厳密に正の離散確率分布が存在することを証明します。また、グラフィカル基準を実装し、n個のノードとe個のエッジに対して実行時間が最大でO(n2(e+n))であると考えられるアルゴリズムも報告します。最後に、バイオインフォマティクス内でグラフィカル基準を使用して、生物学的に意味のある遺伝子依存関係を識別する方法を示す。

Fourier Theoretic Probabilistic Inference over Permutations
順列に対するフーリエ理論的確率論的推論

Permutations are ubiquitous in many real-world problems, such asvoting, ranking, and data association. Representing uncertaintyover permutations is challenging, since there are n!possibilities, and typical compact and factorized probabilitydistribution representations, such as graphical models, cannotcapture the mutual exclusivity constraints associated withpermutations. In this paper, we use the “low-frequency” terms of aFourier decomposition to represent distributions over permutationscompactly. We present Kronecker conditioning, a novel approach for maintaining and updating thesedistributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low orderFourier-based approximations, however, may lead to functions that donot correspond to valid distributions. To address this problem, wepresent a quadratic program defined directly in theFourier domain for projecting the approximation onto a relaxation ofthe polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario.

順列は、投票、ランキング、データ関連付けなど、現実世界の多くの問題に遍在しています。順列の不確実性を表現するのは困難です。なぜなら、n!の可能性があり、グラフィカルモデルなどの一般的なコンパクトで因数分解された確率分布表現では、順列に関連する相互排他性制約を捉えることができないからです。この論文では、フーリエ分解の「低周波」項を使用して、順列の分布をコンパクトに表現します。クロネッカー条件付けは、これらの分布をフーリエ領域で直接維持および更新し、多項式時間の帯域制限近似を可能にする新しいアプローチです。ただし、低次のフーリエベースの近似では、有効な分布に対応しない関数になる可能性があります。この問題に対処するために、近似を合法的な周辺分布の多面体の緩和に投影するための、フーリエ領域で直接定義された2次計画法を示します。実際のカメラベースの複数人物追跡シナリオで、私たちのアプローチの有効性を実証します。

On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
一般経験的リスクの無界性、依存性、高次元性による一様偏差について

The statistical learning theory of risk minimization depends heavily onprobability bounds for uniform deviations of the empirical risks.Classical probability bounds using Hoeffding’s inequality cannotaccommodate more general situations with unbounded loss anddependent data. The current paper introduces an inequality thatextends Hoeffding’s inequality to handle these more general situations.We will apply this inequality to provide probability bounds foruniform deviations in a very general framework, whichcan involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with l1-loss, and ARX model with variable selection for sign classification, which uses both lagged responses and exogenous predictors.

リスク最小化の統計的学習理論は、経験的リスクの均一な偏差の確率限界に大きく依存します。Hoeffdingの不等式を使用した古典的な確率限界は、無制限の損失と従属データを持つより一般的な状況に対応できません。現在の論文では、これらのより一般的な状況を処理するためにHoeffdingの不等式を拡張する不等式を紹介します。この不等式を適用して、離散的な決定ルール、無制限の損失、およびマーチンゲールや強い混合よりも一般的な依存関係構造を含む可能性のある非常に一般的なフレームワークで、一様偏差の確率限界を提供します。高次元予測子を持つ2つの例、l1-損失を伴う自己回帰(AR)と、遅延応答と外生的予測子の両方を使用する符号分類のための変数選択によるARXモデルについて考えます。

Nonextensive Information Theoretic Kernels on Measures
測定に関する非広汎情報理論カーネル

Positive definite kernels on probability measureshave been recently applied to classification problems involvingtext, images, and other types of structured data.Some of these kernels are related to classic informationtheoretic quantities, such as (Shannon’s) mutual informationand the Jensen-Shannon (JS) divergence. Meanwhile, there havebeen recent advances in nonextensive generalizations of Shannon’sinformation theory. This paper bridges thesetwo trends by introducing nonextensive information theoretickernels on probability measures, based on new JS-type divergences.These new divergences result from extending thethe two building blocks of the classical JS divergence:convexity and Shannon’s entropy. The notion ofconvexity is extended to the wider concept of q-convexity,for which we prove a Jensen q-inequality. Based onthis inequality, we introduce Jensen-Tsallis (JT) q-differences, anonextensive generalization of the JS divergence, and definea k-th order JT q-difference between stochastic processes.We then define a new family of nonextensive mutual informationkernels, which allow weights to be assigned to their arguments,and which includes the Boolean, JS, and linear kernelsas particular cases. Nonextensive string kernels are also definedthat generalize the p-spectrum kernel. We illustrate the performanceof these kernels on text categorization tasks, in which documentsare modeled both as bags of words and as sequences of characters.

確率測度の正定値カーネルは、最近、テキスト、画像、その他の構造化データを含む分類問題に適用されています。これらのカーネルの一部は、(シャノンの)相互情報量やジェンセン-シャノン(JS)ダイバージェンスなどの古典的な情報理論的量に関連しています。一方、シャノンの情報理論の非表示一般化が最近進歩しています。この論文では、新しいJS型ダイバージェンスに基づく確率測度の非表示情報理論的カーネルを導入することで、これら2つのトレンドを橋渡しします。これらの新しいダイバージェンスは、古典的なJSダイバージェンスの2つの構成要素である凸性とシャノンのエントロピーを拡張することによって生じます。凸性の概念は、より広い概念であるq凸性に拡張され、これに対してジェンセンのq不等式が証明されます。この不等式に基づいて、Jensen-Tsallis (JT) q差、JSダイバージェンスの非表示一般化を導入し、確率過程間のk次JT q差を定義します。次に、重みを引数に割り当てることができる非表示相互情報量カーネルの新しいファミリを定義します。これには、ブール、JS、および線形カーネルが特定のケースとして含まれます。pスペクトルカーネルを一般化する非表示文字列カーネルも定義されます。ドキュメントが単語のバッグと文字のシーケンスの両方としてモデル化されるテキスト分類タスクでのこれらのカーネルのパフォーマンスを示します。

Java-ML: A Machine Learning Library
Java-ML: 機械学習ライブラリ

Java-ML is a collection of machine learning and data miningalgorithms, which aims to be a readily usable and easily extensibleAPI for both software developers and research scientists. Theinterfaces for each type of algorithm are kept simple and algorithmsstrictly follow their respective interface. Comparing differentclassifiers or clustering algorithms is therefore straightforward, andimplementing new algorithms is also easy. The implementations of thealgorithms are clearly written, properly documented and can thus beused as a reference. The library is written in Java and is availablefrom http://java-ml.sourceforge.net/ under the GNU GPL license.

Java-MLは、機械学習とデータマイニングアルゴリズムのコレクションであり、ソフトウェア開発者と研究者の両方にとってすぐに使用でき、簡単に拡張できるAPIになることを目指しています。各タイプのアルゴリズムのインターフェースはシンプルに保たれ、アルゴリズムはそれぞれのインターフェースに厳密に従います。したがって、異なる分類子やクラスタリングアルゴリズムの比較は簡単で、新しいアルゴリズムの実装も簡単です。アルゴリズムの実装は明確に記述され、適切に文書化されているため、参照として使用できます。ライブラリはJavaで書かれており、GNU GPLライセンスの下でhttp://java-ml.sourceforge.net/から入手できます。

Polynomial-Delay Enumeration of Monotonic Graph Classes
単調グラフクラスの多項式遅延列挙

Algorithms that list graphs such that no two listed graphs areisomorphic, are important building blocks of systems for mining andlearning in graphs. Algorithms are already known that solve thisproblem efficiently for many classes of graphs of restricted topology,such as trees. In this article we introduce the concept of a denseaugmentation schema, and introduce an algorithm that can be used toenumerate any class of graphs with polynomial delay, as long as theclass of graphs can be described using a monotonic predicate operatingon a dense augmentation schema. In practice this means that this isthe first enumeration algorithm that can be applied theoreticallyefficiently in any frequent subgraph mining algorithm, and that thisalgorithm generalizes to situations beyond the standard frequentsubgraph mining setting.

リストされた2つのグラフが同型ではないようなグラフをリストするアルゴリズムは、グラフのマイニングと学習のためのシステムの重要な構成要素です。この問題を効率的に解決するアルゴリズムは、ツリーなどの制限されたトポロジのグラフの多くのクラスについてすでに知られています。この記事では、高密度拡張スキーマの概念を紹介し、グラフのクラスが高密度拡張スキーマで動作する単調な述語を使用して記述できる限り、多項式遅延を持つ任意のグラフのクラスを列挙するために使用できるアルゴリズムを紹介します。実際には、これは、任意の頻繁なサブグラフマイニングアルゴリズムで理論的に効率的に適用できる最初の列挙アルゴリズムであり、このアルゴリズムが標準のfrequentsubgraphマイニング設定を超える状況に一般化されることを意味します。

Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
擬似尤度を用いたスパースバイナリ対マルコフネットワークの推定

We consider the problems of estimating the parameters as well as thestructure of binary-valued Markov networks. For maximizing thepenalized log-likelihood, we implement an approximate procedure basedon the pseudo-likelihood of Besag (1975) and generalize it to afast exact algorithm. The exact algorithm starts with thepseudo-likelihood solution and then adjusts the pseudo-likelihoodcriterion so that each additional iterations moves it closer to theexact solution. Our results show that this procedure is faster thanthe competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), whenimplemented using the coordinate descent procedure ofFriedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and onlyslightly less accurate.

私たちは、パラメータの推定の問題と、バイナリ値マルコフネットワークの構造について考えます。ペナルティされた対数尤度を最大化するために、Besag (1975)の擬似尤度に基づく近似手順を実装し、それを高速厳密アルゴリズムに一般化します。厳密アルゴリズムは、擬似尤度解から開始し、次に擬似尤度基準を調整して、追加の反復ごとに厳密解に近づくようにします。私たちの結果は、この手順がLee、Ganapathi、およびKoller(2006a)によって提案された競合する正確な方法よりも速いことを示しています。しかし、Wainwrightら(2006)のアプローチと同様に、Friedman, Hastie, and Tibshirani (2008b)の座標降下法を使用して実装した場合、近似擬尤度は正確な方法よりもはるかに速く、精度がわずかに低いこともわかりました。

Stable and Efficient Gaussian Process Calculations
安定した効率的なガウスプロセス計算

The use of Gaussian processes can be an effective approach toprediction in a supervised learning environment. For large datasets, the standard Gaussian process approach requires solving verylarge systems of linear equations and approximations are requiredfor the calculations to be practical. We will focus on thesubset of regressors approximation technique. We will demonstratethat there can be numerical instabilities in a well knownimplementation of the technique. We discuss alternateimplementations that have better numerical stability propertiesand can lead to better predictions. Our results will beillustrated by looking at an application involving prediction ofgalaxy redshift from broadband spectrum data.

ガウス過程の使用は、教師あり学習環境での予測に対する効果的なアプローチになる可能性があります。大規模なデータセットの場合、標準的なガウスプロセスアプローチでは、非常に大規模な線形方程式系を解く必要があり、計算を実用的にするには近似が必要です。回帰子近似手法のサブセットに焦点を当てます。私たちは、この手法のよく知られた実装には数値的不安定性が存在する可能性があることを示します。より優れた数値安定性特性を持ち、より優れた予測につながる可能性のある代替実装について説明します。私たちの結果は、広帯域スペクトルデータからの銀河の赤方偏移の予測を含むアプリケーションを見ることによって説明されます。

Consistency and Localizability
一貫性とローカライズ可能性

We show that all consistent learning methods—that is, thatasymptotically achieve the lowest possible expected loss for anydistribution on (X,Y)—are necessarily localizable, by which wemean that they do not significantly change their response at aparticular point when we show them only the part of the training setthat is close to that point. This is true in particular for methodsthat appear to be defined in a non-local manner, such as supportvector machines in classification and least-squares estimators inregression. Aside from showing that consistency implies a specificform of localizability, we also show that consistency is logicallyequivalent to the combination of two properties: (1) a form oflocalizability, and (2) that the method’s global mean (over the entireX distribution) correctly estimates the true mean. Consistency cantherefore be seen as comprised of two aspects, one local and oneglobal.

私たちは、すべての一貫した学習方法—つまり、(X,Y)上の任意の分布に対して可能な限り低い期待損失を達成する—は必然的にローカライズ可能であり、これにより、特定のポイントに近いトレーニングセットの部分のみを表示すると、特定のポイントで応答が大きく変化しないことを示します。これは特に、分類のサポートベクトルマシンや最小二乗推定器の回帰など、非局所的な方法で定義されているように見える方法に当てはまります。一貫性が特定の形のローカライズ可能性を意味することを示すだけでなく、一貫性が2つのプロパティの組み合わせと論理的に同等であることも示します。(1)ローカライズ可能性の形式であること、および(2)メソッドのグローバル平均(X分布全体)が真の平均を正しく推定すること。したがって、一貫性は、ローカルとグローバルの2つの側面で構成されると見なすことができます。

A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
協調フィルタリングへの新しいアプローチ:スペクトル正則化による演算子推定

We present a general approach for collaborative filtering (CF) usingspectral regularization to learn linear operators mapping a set of”users” to a set of possibly desired “objects”. In particular,several recent low-rank type matrix-completion methods for CF areshown to be special cases of our proposed framework. Unlike existingregularization-based CF, our approach can be used to incorporateadditional information such as attributes of the users/objects—afeature currently lacking in existing regularization-based CFapproaches—using popular and well-known kernel methods. Weprovide novel representer theorems that we use to develop newestimation methods. We then provide learning algorithms based onlow-rank decompositions and test them on a standard CF data set. Theexperiments indicate the advantages of generalizing the existingregularization-based CF methods to incorporate related informationabout users and objects. Finally, we show that certain multi-tasklearning methods can be also seen as special cases of our proposedapproach.

私たちは、スペクトル正規化を使用して「ユーザー」のセットをおそらく望ましい「オブジェクト」のセットにマッピングする線形演算子を学習する協調フィルタリング(CF)の一般的なアプローチを提示します。特に、CFの最近の低ランク型行列補完法のいくつかは、我々が提案するフレームワークの特別なケースであることが示されています。既存の正規化ベースのCFとは異なり、我々のアプローチは、一般的なよく知られたカーネル法を使用して、ユーザー/オブジェクトの属性などの追加情報(現在のところ、既存の正規化ベースのCFアプローチに欠けている機能)を組み込むために使用できます。私たちは、新しい推定法の開発に使用する新しい表現定理を提供します。次に、低ランク分解に基づく学習アルゴリズムを提供し、標準のCFデータセットでテストします。実験は、既存の正規化ベースのCF方法を一般化して、ユーザーとオブジェクトに関する関連情報を組み込む利点を示しています。最後に、特定のマルチタスク学習法も、我々が提案するアプローチの特別なケースとして見ることができることを示します。

Sparse Online Learning via Truncated Gradient
Truncated Gradientによるスパースオンライン学習

We propose a general method called truncated gradientto induce sparsity in the weights ofonline-learning algorithms with convex loss functions. Thismethod has several essential properties:

私たちは、凸損失関数を持つオンライン学習アルゴリズムの重みにスパース性を誘発するために、切頭勾配と呼ばれる一般的な方法を提案します。この方法には、いくつかの重要なプロパティがあります。

Similarity-based Classification: Concepts and Algorithms
類似性に基づく分類: 概念とアルゴリズム

This paper reviews and extends the field of similarity-based classification,presenting new analyses, algorithms, data sets, and a comprehensive set ofexperimental results for a rich collection of classification problems.Specifically, the generalizability of using similarities as features isanalyzed, design goals and methods for weighting nearest-neighbors forsimilarity-based learning are proposed, and different methods for consistentlyconverting similarities into kernels are compared. Experiments on eight realdata sets compare eight approaches and their variants to similarity-basedlearning.

この論文では、類似性に基づく分類の分野をレビューおよび拡張し、新しい分析、アルゴリズム、データセット、および豊富な分類問題のコレクションに対する包括的な実験結果を提示します。具体的には、類似性を特徴として用いることの一般化可能性を分析し、類似性に基づく学習のために最近傍を重み付けする設計目標と方法を提案し、類似性を一貫してカーネルに変換するためのさまざまな方法を比較します。8つのリアルデータセットでの実験では、8つのアプローチとそのバリアントを類似性ベースの学習と比較します。

Nieme: Large-Scale Energy-Based Models
Nieme:大規模エネルギーベースモデル

In this paper we introduce NIEME, a machine learning library forlarge-scale classification, regression and ranking. NIEME, relies onthe framework of energy-based models (LeCun et al., 2006) whichunifies several learning algorithms ranging from simple perceptrons torecent models such as the pegasos support vector machine orl1-regularized maximum entropy models. This framework also unifiesbatch and stochastic learning which are both seen as energyminimization problems. NIEME, can hence be used in a wide range ofsituations, but is particularly interesting for large-scale learningtasks where both the examples and the features are processedincrementally. Being able to deal with new incoming features at anytime within the learning process is another original feature of the NIEME, toolbox. NIEME, is released under the GPL license. It isefficiently implemented in C++, it works on Linux, Mac OS X andWindows and provides interfaces for C++, Java and Python.

この論文では、大規模分類・回帰・ランキングのための機械学習ライブラリであるNIEMEについて紹介します。NIEMEは、単純なパーセプトロンからpegasosサポートベクターマシンorl1正則化最大エントロピーモデルなどの最近のモデルに至るまで、いくつかの学習アルゴリズムを統合するエネルギーベースのモデル(LeCunら、2006)のフレームワークに依存しています。このフレームワークは、エネルギー最小化問題と見なされるバッチ学習と確率的学習も統合します。したがって、NIEMEは幅広い状況で使用できますが、例と機能の両方が段階的に処理される大規模な学習タスクに特に興味深いものです。学習プロセスの中でいつでも新しい新機能を処理できることは、NIEMEのもう一つのオリジナル機能であるツールボックスです。NIEMEはGPLライセンスの下でリリースされています。C++で効率的に実装され、Linux、Mac OS X、Windowsで動作し、C++、Java、Pythonのインターフェイスを提供します。

On Efficient Large Margin Semisupervised Learning: Method and Theory
効率的大マージン半教師あり学習について:方法と理論

In classification, semisupervised learning usually involves a largeamount of unlabeled data with only a small number of labeleddata. This imposes a great challenge in that it is difficult toachieve good classification performance through labeled data alone. Toleverage unlabeled data for enhancing classification, this articleintroduces a large margin semisupervised learning method within theframework of regularization, based on an efficient margin loss forunlabeled data, which seeks efficient extraction of the informationfrom unlabeled data for estimating the Bayes decision boundary forclassification. For implementation, an iterative scheme is derivedthrough conditional expectations. Finally, theoretical and numericalanalyses are conducted, in addition to an application to gene functionprediction. They suggest that the proposed method enables to recoverthe performance of its supervised counterpart based on complete datain rates of convergence, when possible.

分類では、半教師あり学習には通常、ラベル付けされていない大量のデータが含まれ、ラベル付きデータはごくわずかです。これは、ラベル付けされたデータだけでは優れた分類性能を達成することが難しいという大きな課題を課しています。分類を強化するためにラベルなしデータを活用するために、この記事では、ラベルなしデータの効率的なマージン損失に基づいて、正規化のフレームワーク内で大きなマージン半教師あり学習方法を紹介します。これは、分類のベイズ決定境界を推定するためのラベルなしデータからの情報の効率的な抽出を求めます。実装のために、反復スキームは条件付きの期待値によって導出されます。最後に、理論解析と数値解析を行い、遺伝子機能予測への応用を行います。彼らは、提案された方法により、可能な場合は、収束率の完全なデータに基づいて、教師ありの対応物の性能を回復できるという考えを示しています。

Properties of Monotonic Effects on Directed Acyclic Graphs
有向非巡回グラフに対する単調効果の特性

Various relationships are shown hold between monotonic effects and weakmonotonic effects and the monotonicity of certain conditional expectations.Counterexamples are provided to show that the results do not hold under lessrestrictive conditions. Monotonic effects are furthermore used to relatesigned edges on a causal directed acyclic graph to qualitative effectmodification. The theory is applied to an example concerning the directeffect of smoking on cardiovascular disease controlling forhypercholesterolemia. Monotonicity assumptions are used to construct a testfor whether there is a variable that confounds the relationship between themediator, hypercholesterolemia, and the outcome, cardiovascular disease.

単調効果と弱い単調効果と、特定の条件付き期待の単調性との間には、さまざまな関係が存在することが示されています。制限の少ない条件下では結果が保持されないことを示すために、反例が提供されています。単調効果はさらに、因果関係指向非巡回グラフ上の符号付きエッジを定性的効果修正に関連付けるために使用されます。この理論は、高コレステロール血症を制御する心血管疾患に対する喫煙の直接的な影響に関する例に適用されます。単調性の仮定は、メディエーターである高コレステロール血症と結果である心血管疾患との関係を混乱させる変数があるかどうかの検定を構築するために使用されます。

Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions
最近傍クラスタリング:任意の目的関数を持つ一貫性のあるクラスタリングのためのベースライン法

Clustering is often formulated as a discrete optimization problem. Theobjective is to find, among all partitions of the data set, the bestone according to some quality measure. However, in the statisticalsetting where we assume that the finite data set has been sampled fromsome underlying space, the goal is not to find the best partition ofthe given sample, but to approximate the true partition of theunderlying space. We argue that the discrete optimization approachusually does not achieve this goal, and instead can lead toinconsistency. We construct examples which provably have thisbehavior. As in the case of supervised learning, the cure is torestrict the size of the function classes under consideration. Forappropriate “small” function classes we can prove very generalconsistency theorems for clustering optimization schemes. As oneparticular algorithm for clustering with a restricted function spacewe introduce “nearest neighbor clustering”. Similar to the k-nearestneighbor classifier in supervised learning, this algorithm can be seenas a general baseline algorithm to minimize arbitrary clusteringobjective functions. We prove that it is statistically consistent forall commonly used clustering objective functions.

クラスタリングは、離散最適化問題として定式化されることが多い。その目的は、データセットのすべてのパーティションの中から、何らかの品質基準に従って最良のものを見つけることです。しかし、有限データセットが何らかの基礎空間からサンプリングされていると仮定する統計設定では、目標は与えられたサンプルの最良のパーティションを見つけることではなく、基礎空間の真のパーティションを近似することです。私たちは、離散最適化アプローチでは通常この目標を達成できず、矛盾が生じる可能性があると主張します。私たちは、この動作が証明できる例を構築します。教師あり学習の場合と同様に、解決策は、検討中の関数クラスのサイズを制限することです。適切な「小さな」関数クラスについては、クラスタリング最適化スキームの非常に一般的な一貫性定理を証明できます。制限された関数空間でのクラスタリングの特定のアルゴリズムの1つとして、「最近傍クラスタリング」を導入します。このアルゴリズムは、教師あり学習のk最近傍分類器と同様に、任意のクラスタリング目的関数を最小化する一般的なベースラインアルゴリズムとして考えることができます。一般的に使用されるすべてのクラスタリング目的関数に対して統計的に一貫していることを証明します。

Scalable Collaborative Filtering Approaches for Large Recommender Systems
大規模なレコメンダーシステムのためのスケーラブルな協調フィルタリングアプローチ

The collaborative filtering (CF) using known user ratings of items hasproved to be effective for predicting user preferences in itemselection. This thriving subfield of machine learning became popularin the late 1990s with the spread of online services that userecommender systems, such as Amazon, Yahoo! Music, and Netflix. CFapproaches are usually designed to work on very large datasets. Therefore the scalability of the methods is crucial. In thiswork, we propose various scalable solutions that are validated againstthe Netflix Prize data set, currently the largest publicly availablecollection. First, we propose various matrix factorization (MF) basedtechniques. Second, a neighbor correction method for MF is outlined,which alloys the global perspective of MF and the localized propertyof neighbor based approaches efficiently. In the experimentationsection, we first report on some implementation issues, and we suggeston how parameter optimization can be performed efficiently for MFs. Wethen show that the proposed scalable approaches compare favorably withexisting ones in terms of prediction accuracy and/or required trainingtime. Finally, we report on some experiments performed on MovieLensand Jester data sets.

アイテムの既知のユーザー評価を使用した協調フィルタリング(CF)は、アイテム選択におけるユーザーの嗜好を予測するのに効果的であることが証明されています。機械学習のこの活気あるサブフィールドは、Amazon、Yahoo! Music、Netflixなどのレコメンデーションシステムを使用するオンラインサービスの普及により、1990年代後半に人気を博しました。CFアプローチは通常、非常に大規模なデータセットで機能するように設計されています。したがって、メソッドのスケーラビリティが重要です。この研究では、現在公開されている最大のコレクションであるNetflix Prizeデータセットに対して検証されたさまざまなスケーラブルなソリューションを提案します。まず、さまざまな行列分解(MF)ベースの手法を提案します。次に、MFのグローバルな視点と近傍ベースのアプローチのローカライズされた特性を効率的に組み合わせた、MFの近傍補正法について概説します。実験セクションでは、まず実装上の問題について報告し、MFのパラメーター最適化を効率的に実行する方法を提案します。次に、提案されたスケーラブルなアプローチが、予測精度や必要なトレーニング時間の点で既存のアプローチに匹敵することを示します。最後に、MovieLensおよびJesterデータセットで実行されたいくつかの実験について報告します。

NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM
NEUROSVM: カーネルの選択がSVMの性能に及ぼす影響を低減するアーキテクチャ

In this paper we propose a new multilayer classifier architecture. Theproposed hybrid architecture has two cascaded modules: featureextraction module and classification module. In the feature extractionmodule we use the multilayered perceptron (MLP) neural networks,although other tools such as radial basis function (RBF) networks canbe used. In the classification module we use support vector machines(SVMs)—here also other tool such as MLP or RBF can be used. Thefeature extraction module has several sub-modules each of which isexpected to extract features capturing the discriminatingcharacteristics of different areas of the input space. Theclassification module classifies the data based on the extractedfeatures. The resultant architecture with MLP in feature extractionmodule and SVM in classification module is called NEUROSVM. TheNEUROSVM is tested on twelve benchmark data sets and the performanceof the NEUROSVM is found to be better than both MLP and SVM. We alsocompare the performance of proposed architecture with that of twoensemble methods: majority voting and averaging. Here also theNEUROSVM is found to perform better than these two ensemblemethods. Further we explore the use of MLP and RBF in theclassification module of the proposed architecture. The mostattractive feature of NEUROSVM is that it practically eliminates thesevere dependency of SVM on the choice of kernel. This has beenverified with respect to both linear and non-linear kernels. We havealso demonstrated that for the feature extraction module, the fulltraining of MLPs is not needed.

この論文では、新しい多層分類器アーキテクチャを提案します。提案されたハイブリッドアーキテクチャには、特徴抽出モジュールと分類モジュールという2つのカスケードモジュールがあります。特徴抽出モジュールでは、多層パーセプトロン(MLP)ニューラルネットワークを使用しますが、ラジアルベース関数(RBF)ネットワークなどの他のツールも使用できます。分類モジュールでは、サポートベクターマシン(SVM)を使用しますが、MLPやRBFなどの他のツールも使用できます。特徴抽出モジュールには、入力空間のさまざまな領域の識別特性を捉える特徴を抽出することが期待される複数のサブモジュールがあります。分類モジュールは、抽出された特徴に基づいてデータを分類します。特徴抽出モジュールにMLP、分類モジュールにSVMを使用した結果のアーキテクチャは、NEUROSVMと呼ばれます。NEUROSVMは12個のベンチマークデータセットでテストされ、NEUROSVMのパフォーマンスはMLPとSVMの両方よりも優れていることがわかりました。また、提案されたアーキテクチャのパフォーマンスを、多数決と平均化という2つのアンサンブル手法と比較します。ここでも、NEUROSVMはこれら2つのアンサンブル手法よりもパフォーマンスが優れていることがわかります。さらに、提案されたアーキテクチャの分類モジュールでのMLPとRBFの使用を検討します。NEUROSVMの最も魅力的な機能は、カーネルの選択に対するSVMの重大な依存性を実質的に排除することです。これは、線形カーネルと非線形カーネルの両方に関して検証されています。また、特徴抽出モジュールでは、MLPの完全なトレーニングは不要であることも実証しました。

Online Learning with Sample Path Constraints
サンプルパス制約を使用したオンライン学習

We study online learning where a decision maker interacts with Naturewith the objective of maximizing her long-term average reward subjectto some sample path average constraints. We define thereward-in-hindsight as the highest reward the decision maker couldhave achieved, while satisfying the constraints, had she knownNature’s choices in advance. We show that in general thereward-in-hindsight is not attainable. The convex hull of thereward-in-hindsight function is, however, attainable. For theimportant case of a single constraint, the convex hull turns out to bethe highest attainable function. Using a calibrated forecasting rule,we provide an explicit strategy that attains this convex hull. We alsomeasure the performance of heuristic methods based on non-calibratedforecasters in experiments involving a CPU power management problem.

私たちは、意思決定者が自然と対話するオンライン学習を研究し、いくつかのサンプルパスの平均制約を条件として、彼女の長期的な平均報酬を最大化することを目的としています。私たちは、後知恵の報酬を、意思決定者が自然の選択を事前に知っていた場合に、制約を満たしながら達成できた最高の報酬と定義します。私たちは、一般的に、後知恵での報酬は達成できないことを示しています。しかし、後知恵の報酬関数の凸包は達成可能です。単一の制約の重要なケースでは、凸包が達成可能な最高の関数であることがわかります。較正された予測ルールを使用して、この凸包を達成するための明示的な戦略を提供します。また、CPU電源管理問題を含む実験で、未校正の予測者に基づくヒューリスティック手法のパフォーマンスも測定します。

On the Consistency of Feature Selection using Greedy Least Squares Regression
貪欲最小二乗回帰を用いた特徴選択の一貫性について

This paper studies the feature selection problem usinga greedy least squares regression algorithm.We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target),the greedy algorithm can select features consistently when thesample size approaches infinity. The condition is identicalto a corresponding condition for Lasso.Moreover, under a sparse eigenvalue condition,the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constanttimes the noise level.In comparison, Lasso may require the coefficients to be larger thanO(√s) times the noise level in the worst case, where s is the number of nonzero coefficients.

この論文では、貪欲な最小二乗回帰アルゴリズムを使用して特徴選択問題を研究します。私たちは、計画行列上の特定の表現不可能な条件下で(ただし、スパースターゲットとは無関係に)、サンプルサイズが無限大に近づくと、貪欲なアルゴリズムが一貫して特徴を選択できることを示します。さらに、スパース固有値条件下では、各非ゼロ係数がノイズレベルの定数倍よりも大きい限り、貪欲なアルゴリズムは特徴を確実に識別できます。これに対し、Lassoでは、係数がO(√s)にノイズレベルの倍を掛けた値よりも大きくなる必要がある場合があります(sはゼロでない係数の数です)。

Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
ベイジアン調査技術によるリカレントニューラルネットワークの同定

We introduce novel online Bayesian methods for the identification ofa family of noisy recurrent neural networks (RNNs). We presentBayesian active learning techniques for stimulus selection givenpast experiences. In particular, we consider the unknown parametersas stochastic variables and use A-optimality and D-optimalityprinciples to choose optimal stimuli. We derive myopic costfunctions in order to maximize the information gain concerningnetwork parameters at each time step. We also derive the A-optimaland D-optimal estimations of the additive noise that perturbs thedynamical system of the RNN. Here we investigate myopic as well asnon-myopic estimations, and study the problem of simultaneousestimation of both the system parameters and the noise. Employingconjugate priors our derivations remain approximation-free and giverise to simple update rules for the online learning of theparameters. The efficiency of our method is demonstrated for anumber of selected cases, including the task of controlledindependent component analysis.

私たちは、ノイズの多いリカレントニューラルネットワーク(RNN)ファミリーを識別するための新しいオンラインベイズ法を紹介します。過去の経験に基づいて刺激を選択するためのベイズ能動学習手法を紹介します。特に、未知のパラメータを確率変数と見なし、A最適性およびD最適性の原則を使用して最適な刺激を選択します。各時間ステップでネットワークパラメータに関する情報ゲインを最大化するために、近視眼的なコスト関数を導出します。また、RNNの動的システムを乱す加法性ノイズのA最適およびD最適推定値を導出します。ここでは、近視眼的推定値と非近視眼的推定値を調査し、システムパラメータとノイズの両方を同時に推定する問題を研究します。共役事前分布を使用することで、導出は近似なしで実行され、パラメータのオンライン学習のための簡単な更新規則が生まれます。制御された独立成分分析のタスクを含む、いくつかの選択されたケースで、この方法の効率性が実証されています。

Controlling the False Discovery Rate of the Association/Causality Structure Learned with the PC Algorithm
PCアルゴリズムで学習した関連/因果構造の誤発見率の制御

In real world applications, graphical statistical models are not onlya tool for operations such as classification or prediction, butusually the network structures of the models themselves are also ofgreat interest (e.g., in modeling brain connectivity). The falsediscovery rate (FDR), the expected ratio of falsely claimedconnections to all those claimed, is often a reasonable error-ratecriterion in these applications. However, current learning algorithmsfor graphical models have not been adequately adapted to the concernsof the FDR. The traditional practice of controlling the type I errorrate and the type II error rate under a conventional level does notnecessarily keep the FDR low, especially in the case of sparsenetworks. In this paper, we propose embedding an FDR-controlprocedure into the PC algorithm to curb the FDR of the skeleton of thelearned graph. We prove that the proposed method can control the FDRunder a user-specified level at the limit of large sample sizes. Inthe cases of moderate sample size (about several hundred), empiricalexperiments show that the method is still able to control the FDRunder the user-specified level, and a heuristic modification of themethod is able to control the FDR more accurately around theuser-specified level. The proposed method is applicable to any modelsfor which statistical tests of conditional independence are available,such as discrete models and Gaussian models.

現実世界のアプリケーションでは、グラフィカル統計モデルは分類や予測などの操作のツールであるだけでなく、通常、モデル自体のネットワーク構造も大きな関心事です(たとえば、脳の接続のモデル化)。これらのアプリケーションでは、誤って主張された接続と主張されたすべての接続の期待比率である偽発見率(FDR)が、多くの場合、妥当なエラー率基準となります。ただし、グラフィカルモデルの現在の学習アルゴリズムは、FDRの問題に適切に適応されていません。従来のレベル以下にタイプIエラー率とタイプIIエラー率を制御する従来の方法では、特にスパースネットワークの場合、必ずしもFDRを低く抑えることはできません。この論文では、学習したグラフのスケルトンのFDRを抑制するために、PCアルゴリズムにFDR制御手順を組み込むことを提案します。提案された方法は、大きなサンプルサイズの限界で、ユーザーが指定したレベル以下にFDRを制御できることを証明します。中程度のサンプルサイズ（数百程度）の場合、実験により、この方法は依然としてユーザー指定レベル以下のFDRを制御できることが示されており、この方法のヒューリスティックな修正により、ユーザー指定レベル付近でFDRをより正確に制御できます。提案された方法は、離散モデルやガウスモデルなど、条件付き独立性の統計的検定が利用可能なすべてのモデルに適用できます。

Generalization Bounds for Ranking Algorithms via Algorithmic Stability
アルゴリズムの安定性によるアルゴリズムのランク付けの一般化限界

The problem of ranking, in which the goal is to learn a real-valuedranking function that induces a ranking or ordering over an instancespace, has recently gained much attention in machine learning. Westudy generalization properties of ranking algorithms using the notionof algorithmic stability; in particular, we derive generalizationbounds for ranking algorithms that have good stability properties. Weshow that kernel-based ranking algorithms that perform regularizationin a reproducing kernel Hilbert space have such stability properties,and therefore our bounds can be applied to these algorithms; this isin contrast with generalization bounds based on uniform convergence,which in many cases cannot be applied to these algorithms. Our resultsgeneralize earlier results that were derived in the special setting ofbipartite ranking (Agarwal and Niyogi, 2005) to a more general setting of theranking problem that arises frequently in applications.

インスタンス空間に対するランク付けまたは順序付けを誘発する実数値のランキング関数を学習することを目標とするランキングの問題は、最近機械学習で大きな注目を集めています。アルゴリズムの安定性の概念を使用したランキングアルゴリズムの一般化特性を研究します。特に、安定性が良好なランク付けアルゴリズムの一般化限界を導出します。私たちは、再現カーネルヒルベルト空間で正則化を実行するカーネルベースのランキングアルゴリズムがそのような安定性特性を持っていることを示し、したがって、これらのアルゴリズムに限界を適用できます。これは、一様収束に基づく汎化境界とは対照的であり、多くの場合、これらのアルゴリズムには適用できません。私たちの結果は、二者ランキング(Agarwal and Niyogi、2005)の特別な設定で導き出された以前の結果を、アプリケーションで頻繁に発生するランキング問題のより一般的な設定に一般化します。

Particle Swarm Model Selection
粒子群モデルの選択

This paper proposes the application of particle swarm optimization(PSO) to the problem of full model selection, FMS, forclassification tasks. FMS is defined as follows: given a poolof preprocessing methods, feature selection and learning algorithms,to select the combination of these that obtains the lowestclassification error for a given data set; the task also includes theselection of hyperparameters for the considered methods. This problemgenerates a vast search space to be explored, well suited forstochastic optimization techniques. FMS can be applied to anyclassification domain as it does not require domain knowledge.Different model types and a variety of algorithms can be consideredunder this formulation. Furthermore, competitive yet simple modelscan be obtained with FMS. We adopt PSO for the searchbecause of its proven performance in different problems and because ofits simplicity, since neither expensive computations nor complicatedoperations are needed. Interestingly, the way the search is guidedallows PSO to avoid overfitting to some extend. Experimentalresults on benchmark data sets give evidence that the proposedapproach is very effective, despite its simplicity. Furthermore,results obtained in the framework of a model selection challenge showthe competitiveness of the models selected with PSO, comparedto models selected with other techniques that focus on a singlealgorithm and that use domain knowledge.

この論文では、分類タスクの完全モデル選択(FMS)の問題に粒子群最適化(PSO)を適用することを提案します。FMSは次のように定義されます。前処理方法、特徴選択、学習アルゴリズムのプールが与えられ、特定のデータセットに対して最小の分類エラーが得られるこれらの組み合わせを選択します。このタスクには、検討対象の方法のハイパーパラメータの選択も含まれます。この問題は、探索すべき広大な検索空間を生成し、確率的最適化手法に適しています。FMSはドメイン知識を必要としないため、あらゆる分類ドメインに適用できます。この定式化では、さまざまなモデルタイプとさまざまなアルゴリズムを検討できます。さらに、FMSを使用すると、競争力がありながらシンプルなモデルを取得できます。さまざまな問題で実績のあるパフォーマンスと、高価な計算や複雑な操作が不要なシンプルさから、検索にPSOを採用します。興味深いことに、検索のガイド方法により、PSOはある程度オーバーフィッティングを回避できます。ベンチマークデータセットでの実験結果から、提案されたアプローチはシンプルであるにもかかわらず、非常に効果的であることがわかります。さらに、モデル選択チャレンジのフレームワークで得られた結果は、単一のアルゴリズムに重点を置き、ドメイン知識を使用する他の手法で選択されたモデルと比較して、PSOで選択されたモデルの競争力を示しています。

Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining
教師あり記述ルールディスカバリー:コントラストセット、エマージングパターン、サブグループマイニングの統一調査

This paper gives a survey of contrast set mining (CSM), emergingpattern mining (EPM), and subgroup discovery (SD) in a unifyingframework named supervised descriptive rule discovery. While allthese research areas aim at discovering patterns in the form of rulesinduced from labeled data, they use different terminology and taskdefinitions, claim to have different goals, claim to use differentrule learning heuristics, and use different means for selectingsubsets of induced patterns. This paper contributes a novelunderstanding of these subareas of data mining by presenting a unifiedterminology, by explaining the apparent differences between thelearning tasks as variants of a unique supervised descriptive rulediscovery task and by exploring the apparent differences between theapproaches. It also shows that various rule learning heuristics usedin CSM, EPM and SD algorithms all aim at optimizing a trade offbetween rule coverage and precision. The commonalities (anddifferences) between the approaches are showcased on a selection ofbest known variants of CSM, EPM and SD algorithms. The paperalso provides a critical survey of existing supervised descriptiverule discovery visualization methods.

この論文では、教師あり記述的ルール発見という統一フレームワークにおけるコントラストセットマイニング(CSM)、エマージングパターンマイニング(EPM)、およびサブグループ発見(SD)について概説します。これらの研究分野はすべて、ラベル付きデータから誘導されたルールの形でパターンを発見することを目指していますが、異なる用語とタスク定義を使用し、異なる目標を主張し、異なるルール学習ヒューリスティックを使用していると主張し、誘導されたパターンのサブセットを選択するための異なる手段を使用しています。この論文では、統一された用語を提示し、学習タスク間の明らかな違いを独自の教師あり記述的ルール発見タスクのバリエーションとして説明し、アプローチ間の明らかな違いを検討することで、データマイニングのこれらのサブ領域の新しい理解に貢献します。また、CSM、EPM、およびSDアルゴリズムで使用されるさまざまなルール学習ヒューリスティックはすべて、ルールカバレッジと精度のトレードオフを最適化することを目指していることを示しています。これらのアプローチの共通点(および相違点)は、CSM、EPM、およびSDアルゴリズムの最もよく知られているバリエーションの選択によって紹介されています。また、この論文では、既存の教師あり記述ルール発見視覚化手法の批判的な調査も提供しています。

Low-Rank Kernel Learning with Bregman Matrix Divergences
Bregman 行列発散による低ランクカーネル学習

In this paper, we study low-rank matrix nearness problems, with afocus on learning low-rank positive semidefinite (kernel) matrices formachine learning applications. We propose efficient algorithms thatscale linearly in the number of data points and quadratically in therank of the input matrix. Existing algorithms for learning kernelmatrices often scale poorly, with running times that are cubic in thenumber of data points. We employ Bregman matrix divergences as themeasures of nearness—these divergences are natural for learninglow-rank kernels since they preserve rank as well as positivesemidefiniteness. Special cases of our framework yield fasteralgorithms for various existing learning problems, and experimentalresults demonstrate that our algorithms can effectively learn bothlow-rank and full-rank kernel matrices.

この論文では、機械学習アプリケーションのための低ランク正の半正定値(カーネル)行列の学習に焦点を当てて、低ランク行列の近傍問題を研究します。データポイントの数で線形にスケーリングし、入力行列のランクで二次関数的にスケーリングする効率的なアルゴリズムを提案します。カーネルマトリクスを学習するための既存のアルゴリズムは、多くの場合、スケーラビリティが低く、データポイントの数で実行時間が3乗になります。近さの尺度としてブレグマン行列発散を採用します—これらの発散は、正の半定値性だけでなくランクも保持するため、低ランクカーネルの学習には自然です。私たちのフレームワークの特別なケースでは、さまざまな既存の学習問題に対してより高速なアルゴリズムが生成され、実験結果は、アルゴリズムが低ランクとフルランクの両方のカーネル行列を効果的に学習できることを示しています。

Improving the Reliability of Causal Discovery from Small Data Sets Using Argumentation
論証を用いた小さなデータセットからの因果発見の信頼性の向上

We address the problem of improving the reliability ofindependence-based causal discovery algorithms that results from theexecution of statistical independence tests on small data sets, whichtypically have low reliability. We model the problem as a knowledgebase containing a set of independence facts that are related throughPearl’s well-known axioms. Statistical tests on finite data sets mayresult in errors in these tests and inconsistencies in the knowledgebase. We resolve these inconsistencies through the use of an instanceof the class of defeasible logics called argumentation, augmented witha preference function, that is used to reason about and possiblycorrect errors in these tests. This results in a more robustconditional independence test, called an argumentativeindependence test. Our experimental evaluation shows clear positiveimprovements in the accuracy of argumentative over purely statisticaltests. We also demonstrate significant improvements on the accuracyof causal structure discovery from the outcomes of independence testsboth on sampled data from randomly generated causal models and onreal-world data sets.

私たちは、信頼性が低いことが多い小規模なデータセットに対して統計的独立性テストを実行することで得られる独立性に基づく因果発見アルゴリズムの信頼性を向上させる問題に取り組んでいます。この問題を、パールの有名な公理によって関連付けられた独立性に関する一連の事実を含む知識ベースとしてモデル化します。有限のデータセットに対する統計的テストでは、これらのテストにエラーが生じ、知識ベースに矛盾が生じる可能性があります。これらの矛盾を解決するために、これらのテストのエラーを推論し、場合によっては修正するために使用される選好関数で拡張された、議論と呼ばれる無効化可能なロジックのクラスのインスタンスを使用します。これにより、議論的独立性テストと呼ばれる、より堅牢な条件付き独立性テストが実現します。実験的評価では、純粋に統計的なテストよりも議論的テストの精度が明らかに向上することが示されています。また、ランダムに生成された因果モデルからサンプリングされたデータと実際のデータセットの両方における独立性テストの結果から、因果構造の発見の精度が大幅に向上したことも実証しました。

Analysis of Perceptron-Based Active Learning
パーセプトロンに基づくアクティブラーニングの解析

We start by showing that in an active learning setting, the Perceptron algorithmneeds Ω(1/ε2) labels to learn linearseparators within generalization error ε. We then presenta simple active learning algorithm for this problem, whichcombines a modification of the Perceptron update with an adaptivefiltering rule for deciding whichpoints to query. For data distributed uniformly over the unitsphere, we show that our algorithm reaches generalization errorε after asking for just Õ(d log 1/ε) labels. This exponential improvement over theusual sample complexity of supervised learning had previously beendemonstrated only for the computationally more complexquery-by-committee algorithm.

私たちは、まず、アクティブラーニングの設定では、パーセプトロンアルゴリズムが一般化誤差ε内の線形セパレータを学習するためにΩ(1/ε2)ラベルが必要であることを示します。次に、この問題に対する単純なアクティブラーニングアルゴリズムを提示し、パーセプトロンの更新の変更と、クエリを実行するポイントを決定するための適応フィルタリングルールを組み合わせています。ユニットスフィア上に一様に分布したデータの場合、アルゴリズムは「(d log 1/ε)ラベルのみを要求した後で汎化誤差ε」に達することを示します。教師あり学習の通常のサンプルの複雑さに対するこの指数関数的な改善は、以前は、計算がより複雑な委員会によるクエリアルゴリズムでのみ実証されていました。

Data-driven Calibration of Penalties for Least-Squares Regression
データ駆動型のペナルティのキャリブレーション (最小二乗回帰)

Penalization procedures often suffer from their dependence onmultiplying factors, whose optimal values are either unknown or hardto estimate from data. We propose a completely data-driven calibrationalgorithm for these parameters in the least-squares regressionframework, without assuming a particular shape for the penalty. Ouralgorithm relies on the concept of minimal penalty, recentlyintroduced by Birgé and Massart (2007) in the context of penalizedleast squares for Gaussian homoscedastic regression. On the positiveside, the minimal penalty can be evaluated from the data themselves,leading to a data-driven estimation of an optimal penalty which can beused in practice; on the negative side, their approach heavily relieson the homoscedastic Gaussian nature of their stochastic framework.The purpose of this paper is twofold: stating a more generalheuristics for designing a data-driven penalty (the slopeheuristics) and proving that it works for penalized least-squaresregression with a random design, even for heteroscedastic non-Gaussiandata. For technical reasons, some exact mathematical results will beproved only for regressogram bin-width selection. This is at least afirst step towards further results, since the approach and the methodthat we use are indeed general.

ペナルティ化手順は、多くの場合、乗数に依存するという問題を抱えています。乗数の最適値は未知であるか、データから推定するのが困難です。私たちは、ペナルティに特定の形状を想定せずに、最小二乗回帰フレームワークでこれらのパラメータの完全にデータ駆動型の較正アルゴリズムを提案します。私たちのアルゴリズムは、ガウス等分散回帰のペナルティ付き最小二乗のコンテキストで最近Birgé とMassart (2007)によって導入された最小ペナルティの概念に依存しています。プラス面としては、最小ペナルティはデータ自体から評価できるため、実際に使用できる最適なペナルティをデータ駆動型で推定できます。マイナス面としては、彼らのアプローチは、確率論的フレームワークの等分散ガウス性に大きく依存しています。この論文の目的は2つあります。データ駆動型ペナルティを設計するためのより一般的なヒューリスティック(勾配ヒューリスティック)を述べることと、それが不均一分散の非ガウスデータに対しても、ランダム設計によるペナルティ付き最小二乗回帰に機能することを証明することです。技術的な理由により、正確な数学的結果は、回帰グラムのビン幅選択に対してのみ証明されます。私たちが使用するアプローチと方法は確かに一般的なので、これは少なくともさらなる結果への第一歩です。

Distance Metric Learning for Large Margin Nearest Neighbor Classification
大マージン最近傍分類のための距離メトリック学習

The accuracy of k-nearest neighbor (kNN) classification dependssignificantly on the metric used to compute distances betweendifferent examples. In this paper, we show how to learn a Mahalanobisdistance metric for kNN classification from labeled examples. TheMahalanobis metric can equivalently be viewed as a global lineartransformation of the input space that precedes kNN classificationusing Euclidean distances. In our approach, the metric is trainedwith the goal that the k-nearest neighbors always belong to the sameclass while examples from different classes are separated by a largemargin. As in support vector machines (SVMs), the margin criterionleads to a convex optimization based on the hinge loss. Unlikelearning in SVMs, however, our approach requires no modification orextension for problems in multiway (as opposed to binary)classification. In our framework, the Mahalanobis distance metric isobtained as the solution to a semidefinite program. On several datasets of varying size and difficulty, we find that metrics trained inthis way lead to significant improvements in kNN classification.Sometimes these results can be further improved by clustering thetraining examples and learning an individual metric within eachcluster. We show how to learn and combine these local metrics in aglobally integrated manner.

k近傍法(kNN)分類の精度は、異なる例間の距離を計算するために使用されるメトリックに大きく依存します。この論文では、ラベル付きの例からkNN分類用のマハラノビス距離メトリックを学習する方法を示します。マハラノビスメトリックは、ユークリッド距離を使用したkNN分類に先立つ入力空間のグローバル線形変換と同等に考えることができます。このアプローチでは、k近傍法が常に同じクラスに属し、異なるクラスの例が大きなマージンで分離されることを目標にメトリックをトレーニングします。サポートベクターマシン(SVM)と同様に、マージン基準はヒンジ損失に基づく凸最適化につながります。ただし、SVMでの学習とは異なり、このアプローチでは、多元分類(バイナリ分類ではない)の問題に対する変更や拡張は必要ありません。このフレームワークでは、マハラノビス距離メトリックは半正定値計画の解として得られます。さまざまなサイズと難易度の複数のデータセットで、この方法でトレーニングされたメトリックにより、kNN分類が大幅に改善されることがわかりました。トレーニング例をクラスタ化し、各クラスタ内で個別のメトリックを学習することで、これらの結果がさらに改善される場合もあります。これらのローカルメトリックをグローバルに統合された方法で学習および組み合わせる方法を示します。

Using Local Dependencies within Batches to Improve Large Margin Classifiers
バッチ内のローカル依存関係を使用して大きなマージン分類器を改善する

Most classification methods assume that the samples are drawnindependently and identically from an unknown data generatingdistribution, yet this assumption is violated in several real lifeproblems. In order to relax this assumption, we consider the casewhere batches or groups of samples may have internal correlations,whereas the samples from different batches may be considered to beuncorrelated. This paper introduces three algorithms for classifyingall the samples in a batch jointly: one based on a probabilisticformulation, and two based on mathematical programming analysis.Experiments on three real-life computer aided diagnosis (CAD)problems demonstrate that the proposed algorithms are significantlymore accurate than a naive support vector machine which ignores thecorrelations among the samples.

ほとんどの分類方法では、サンプルが未知のデータ生成分布から独立して同一に抽出されると仮定していますが、この仮定はいくつかの実際の問題で違反しています。この仮定を緩和するために、バッチまたはサンプルのグループが内部相関を持つ可能性があるのに対し、異なるバッチのサンプルは無相関であると考えられる場合を考えます。この論文では、バッチ内のすべてのサンプルを共同で分類するための3つのアルゴリズム(1つは確率的定式化に基づくアルゴリズム、もう1つは数理計画分析に基づくアルゴリズム)を紹介します。3つの実際のコンピュータ支援診断(CAD)問題に関する実験は、提案されたアルゴリズムが、サンプル間の相関を無視する単純なサポートベクターマシンよりも大幅に正確であることを示しています。

On The Power of Membership Queries in Agnostic Learning
不可知論的学習におけるメンバーシップクエリの力について

We study the properties of the agnostic learning framework ofHaussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, weaddress the question: is there any situation in which membershipqueries are useful in agnostic learning?Our results show that the answer is negative fordistribution-independent agnostic learning and positive for agnosticlearning with respect to a specific marginal distribution. Namely, wegive a simple proof that any concept class learnable agnostically by adistribution-independent algorithm with access to membership queriesis also learnable agnostically without membership queries. Thisresolves an open problem posed by Kearns et al. (1994). For agnosticlearning with respect to the uniform distribution over {0,1}n we showa concept class that is learnable with membership queries butcomputationally hard to learn from random examples alone (assumingthat one-way functions exist).

私たちは、Haussler(1992)とKearns、Schapire、およびSellie(1994)の不可知論的学習フレームワークの特性を研究します。特に、私たちは次の質問に取り組みます:メンバーシップクエリが不可知論者の学習に役立つ状況はありますか?私たちの結果は、答えが特定の周辺分布に関して、分布に依存しない不可知論的学習に対して否定的であり、不可知論的学習に対して正であることを示しています。つまり、メンバーシップクエリにアクセスできる分布に依存しないアルゴリズムによって不可知論的に学習可能な任意の概念クラスが、メンバーシップクエリなしでも不可知論的に学習可能であるという単純な証明を提供します。これにより、Kearnsら(1994)によって提起された未解決の問題が解決されます。不可知論的学習の場合、{0,1}n上の一様分布に関して、メンバーシップクエリでは学習可能であるが、ランダムな例だけでは計算的に学習が難しい概念クラスを示します(一方向関数が存在すると仮定します)。

Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data
ベイジアン学習のためのPython環境:知識とデータからベイジアンネットワークの構造を推論する

In this paper, we introduce PEBL, a Python library andapplication for learning Bayesian network structure from data and prior knowledge thatprovides features unmatched by alternative software packages: the ability to useinterventional data, flexible specification of structural priors, modeling withhidden variables and exploitation of parallel processing.PEBL is released under the MIT open-source license, can be installed from the Python Package Index and is available at http://pebl-project.googlecode.com.

この論文では、データと事前知識からベイジアンネットワーク構造を学習するためのPythonライブラリおよびアプリケーションであるPEBLについて紹介します。これにより、インターベンショナルデータの使用能力、構造事前確率の柔軟な指定、隠れた変数によるモデリング、並列処理の活用など、他のソフトウェアパッケージにはない機能が提供されます。PEBLはMITオープンソースライセンスの下でリリースされており、Python Package Indexからインストールでき、http://pebl-project.googlecode.comで入手できます。

Subgroup Analysis via Recursive Partitioning
再帰的分割によるサブグループ分析

Subgroup analysis is an integral part of comparative analysiswhere assessing the treatment effect on a response is of centralinterest. Its goal is to determine the heterogeneity of thetreatment effect across subpopulations. In this paper, we adaptthe idea of recursive partitioning and introduce an interactiontree (IT) procedure to conduct subgroup analysis. The IT procedureautomatically facilitates a number of objectively definedsubgroups, in some of which the treatment effect is foundprominent while in others the treatment has a negligible or evennegative effect. The standard CART (Breiman et al., 1984)methodology is inherited to construct the tree structure. Also, inorder to extract factors that contribute to the heterogeneity ofthe treatment effect, variable importance measure is madeavailable via random forests of the interaction trees. Bothsimulated experiments and analysis of census wage data arepresented for illustration.

サブグループ分析は、応答に対する治療効果の評価が中心的な関心事である比較分析の不可欠な部分です。その目標は、亜集団間の治療効果の不均一性を判断することです。この論文では、再帰的分割の考え方を適応させ、サブグループ分析を行うためのinteractiontree(IT)手順を紹介します。IT手順は、客観的に定義された多数のサブグループを自動的に促進し、そのうちのいくつかでは治療効果が顕著に見られますが、他のものでは治療は無視できる、または否定的な効果さえあります。標準的なCART(Breimanら, 1984)方法論を継承して、ツリー構造を構築します。また、治療効果の不均一性に寄与する因子を抽出するために、相互作用木のランダムフォレストを介して可変重要度測定が利用可能になります。模擬実験と国勢調査の賃金データの分析の両方を説明のために表しています。

Refinement of Reproducing Kernels
再生カーネルの改良

We continue our recent study on constructing a refinementkernel for a given kernel so that the reproducing kernel Hilbertspace associated with the refinement kernel contains that with theoriginal kernel as a subspace. To motivate this study, we firstdevelop a refinement kernel method for learning, which gives anefficient algorithm for updating a learning predictor. Severalcharacterizations of refinement kernels are then presented. It isshown that a nontrivial refinement kernel for a given kernel alwaysexists if the input space has an infinite cardinal number.Refinement kernels for translation invariant kernels andHilbert-Schmidt kernels are investigated. Various concrete examplesare provided.

私たちは、特定のカーネルの精製カーネルを構築する最近の研究を続けており、精製カーネルに関連付けられた再生カーネルヒルベルト空間が、元のカーネルを部分空間としてそれを含むようにします。この研究を動機付けるために、まず、学習予測子を更新するための非効率的なアルゴリズムを提供する学習用の改良カーネル法を開発します。次に、リファインメントカーネルのいくつかの特性を示します。これは、入力空間が無限の基数を持つ場合、特定のカーネルに対する非自明な改良カーネルが常に存在することを示しています。翻訳不変カーネルとヒルベルト・シュミットカーネルのリファインメントカーネルを調査します。様々な具体例が提供されています。

An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs
離散MRFのMAP推定のための凸緩和の解析

The problem of obtaining the maximum a posteriori estimate of ageneral discrete Markov random field (i.e., a Markov random fielddefined using a discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications,several approximation algorithms have been proposed in theliterature. In this paper, we present an analysis of three suchalgorithms based on convex relaxations: (i) LP-S: the linearprogramming (LP) relaxation proposed by Schlesinger (1976)for a special case and independently in Chekuri et al. (2001), Koster et al. (1998), and Wainwright et al. (2005) for the general case;(ii) QP-RL: the quadratic programming (QP) relaxation ofRavikumar and Lafferty (2006); and (iii) SOCP-MS: the second ordercone programming (SOCP) relaxation first proposed byMuramatsu and Suzuki (2003) for two label problems and later extended byKumar et al. (2006) for a general label set.We show that the SOCP-MS and the QP-RL relaxations areequivalent. Furthermore, we prove that despite the flexibility in theform of the constraints/objective function offered by QP andSOCP, the LP-S relaxation strictly dominates (i.e.,provides a better approximation than) QP-RL and SOCP-MS.We generalize these results by defining a large class of SOCP(and equivalent QP) relaxations which is dominated by the LP-S relaxation. Based on these results we propose some novel SOCP relaxations which define constraints using random variables thatform cycles or cliques in the graphical model representation of therandom field. Using some examples we show that the new SOCPrelaxations strictly dominate the previous approaches.

一般的な離散マルコフ確率場（離散的なラベルの集合を使用して定義されたマルコフ確率場）の最大事後推定値を求める問題は、NP困難であることが知られています。しかし、多くのアプリケーションで中心的な重要性を持つことから、文献ではいくつかの近似アルゴリズムが提案されています。この論文では、凸緩和に基づく3つのそのようなアルゴリズムの分析を示します。(i) LP-S:特殊なケースについてはSchlesinger (1976)が提案し、一般的なケースについてはChekuriら(2001)、Kosterら(1998)、およびWainwrightら(2005)がそれぞれ独立に提案した線形計画法（LP）緩和法。(ii) QP-RL: RavikumarとLafferty (2006)の二次計画法（QP）緩和法。(iii) SOCP-MS: 2次円錐計画法(SOCP)緩和法。MuramatsuとSuzuki (2003)によって2つのラベル問題に対して最初に提案され、その後Kumarら(2006)によって一般ラベルセットに対して拡張されました。SOCP-MS緩和法とQP-RL緩和法は同等であることを示します。さらに、QPとSOCPによって提供される制約/目的関数の形式の柔軟性にもかかわらず、LP-S緩和法がQP-RLとSOCP-MSを厳密に支配する(つまり、より優れた近似値を提供する)ことを証明します。これらの結果を一般化するために、LP-S緩和法が支配的なSOCP (および同等のQP)緩和法の大規模なクラスを定義します。これらの結果に基づいて、ランダムフィールドのグラフィカルモデル表現でサイクルまたはクリークを形成するランダム変数を使用して制約を定義するいくつかの新しいSOCP緩和法を提案します。いくつかの例を使用して、新しいSOCP緩和法が以前のアプローチを厳密に支配することを示します。

Markov Properties for Linear Causal Models with Correlated Errors
相関誤差を持つ線形因果モデルのマルコフ特性

A linear causal model with correlated errors, represented by a DAGwith bi-directed edges, can be tested by the set of conditionalindependence relations implied by the model. A global Markov propertyspecifies, by the d-separation criterion, the set of all conditionalindependence relations holding in any model associated with a graph. Alocal Markov property specifies a much smaller set of conditionalindependence relations which will imply all other conditionalindependence relations which hold under the global Markovproperty. For DAGs with bi-directed edges associated with arbitraryprobability distributions, a local Markov property is given inRichardson (2003) which may invoke an exponential number ofconditional independencies. In this paper, we show that for a class oflinear structural equation models with correlated errors, there is alocal Markov property which will invoke only a linear number ofconditional independence relations. For general linear models, weprovide a local Markov property that often invokes far fewerconditional independencies than that in Richardson (2003). Theresults have applications in testing linear structural equation modelswith correlated errors.

双方向エッジを持つDAGで表される相関誤差を持つ線形因果モデルは、モデルによって暗示される条件付き独立関係のセットによってテストできます。グローバルマルコフ特性は、d分離基準によって、グラフに関連付けられた任意のモデルで保持されるすべての条件付き独立関係のセットを指定します。ローカルマルコフ特性は、グローバルマルコフ特性の下で保持される他のすべての条件付き独立関係を暗示する、はるかに小さい条件付き独立関係のセットを指定します。任意の確率分布に関連付けられた双方向エッジを持つDAGの場合、Richardson (2003)でローカルマルコフ特性が与えられ、指数関数的な数の条件付き独立性を呼び出す可能性があります。この論文では、相関誤差を持つ線形構造方程式モデルのクラスに対して、線形数の条件付き独立関係のみを呼び出すローカルマルコフ特性があることを示します。一般線形モデルの場合、Richardson (2003)のものよりもはるかに少ない条件付き独立性を呼び出すローカルマルコフ特性を提供します。結果は、相関誤差のある線形構造方程式モデルのテストに応用できます。

Exploring Strategies for Training Deep Neural Networks
ディープニューラルネットワークの学習戦略の探索

Deep multi-layer neural networks have many levels of non-linearitiesallowing them to compactly represent highly non-linear andhighly-varying functions. However, until recently it was not clear howto train such deep networks, since gradient-based optimizationstarting from random initialization often appears to get stuck in poorsolutions. Hinton et al. recently proposed a greedy layer-wiseunsupervised learning procedure relying on the training algorithm ofrestricted Boltzmann machines (RBM) to initialize the parameters of adeep belief network (DBN), a generative model with many layers ofhidden causal variables. This was followed by the proposal of anothergreedy layer-wise procedure, relying on the usage of autoassociatornetworks. In the context of the above optimization problem, we studythese algorithms empirically to better understand their success. Ourexperiments confirm the hypothesis that the greedy layer-wiseunsupervised training strategy helps the optimization by initializingweights in a region near a good local minimum, but also implicitlyacts as a sort of regularization that brings better generalization andencourages internal distributed representations that are high-levelabstractions of the input. We also present a series of experimentsaimed at evaluating the link between the performance of deep neuralnetworks and practical aspects of their topology, for example,demonstrating cases where the addition of more depth helps. Finally,we empirically explore simple variants of these training algorithms,such as the use of different RBM input unit distributions, a simpleway of combining gradient estimators to improve performance, as wellas on-line versions of those algorithms.

ディープマルチレイヤーニューラルネットワークには多くのレベルの非線形性があり、高度に非線形で変化の激しい関数をコンパクトに表現できます。ただし、ランダム初期化から開始する勾配ベースの最適化では、多くの場合、不十分なソリューションで行き詰まるため、最近までこのようなディープネットワークをトレーニングする方法は明らかではありませんでした。Hintonらは最近、制限付きボルツマンマシン(RBM)のトレーニングアルゴリズムを利用して、多くの層の隠れた因果変数を持つ生成モデルであるディープビリーフネットワーク(DBN)のパラメーターを初期化する、貪欲なレイヤー単位の教師なし学習手順を提案しました。これに続いて、自動連想ネットワークの使用を利用した別の貪欲なレイヤー単位の手順が提案されました。上記の最適化問題のコンテキストで、これらのアルゴリズムを経験的に研究し、その成功をより深く理解します。我々の実験は、貪欲な層単位の教師なしトレーニング戦略が、良好な局所的最小値に近い領域で重みを初期化することで最適化に役立つという仮説を裏付けていますが、同時に暗黙的に一種の正則化として機能し、より優れた一般化をもたらし、入力の高レベル抽象化である内部分散表現を促進します。また、深層ニューラルネットワークのパフォーマンスとトポロジの実際的な側面との関連を評価することを目的とした一連の実験も紹介します。たとえば、深度を増やすことで役立つケースを示します。最後に、異なるRBM入力ユニット分布の使用、パフォーマンスを向上させるための勾配推定器の単純な組み合わせ、およびそれらのアルゴリズムのオンラインバージョンなど、これらのトレーニングアルゴリズムの単純なバリエーションを経験的に調査します。

関連記事