Journal of Statistical Software: Volume 106の記事一覧

Journal of Statistical Software Volume 106に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。

記事

Elastic Net Regularization Paths for All Generalized Linear Models
すべての一般化線形モデルの弾性ネット正則化パス

The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.
ラッソとエラスティックネットは、教師あり学習の一般的な正則化回帰モデルです。Friedman, Hastie, and Tibshirani (2010)は、通常の最小二乗回帰、ロジスティック回帰、および多項ロジスティック回帰の弾性正則化パスを計算するための計算効率の高いアルゴリズムを導入し、Simon, Friedman, Hastie, and Tibshirani (2011)は、この作業を右打ち切りデータのCoxモデルに拡張しました。エラスティックネット正則化回帰の範囲をさらに拡大し、すべての一般化線形モデルファミリ、(開始、停止)データと層を持つCoxモデル、および緩和された投げ縄の簡略化されたバージョンに拡張します。また、これらの適合モデルのパフォーマンスを測定するための便利な効用関数についても説明します。

netmeta: An R Package for Network Meta-Analysis Using Frequentist Methods
netmeta：頻度論的手法を用いたネットワークメタアナリシスのためのRパッケージ

Network meta-analysis compares different interventions for the same condition, by combining direct and indirect evidence derived from all eligible studies. Network metaanalysis has been increasingly used by applied scientists and it is a major research topic for methodologists. This article describes the R package netmeta, which adopts frequentist methods to fit network meta-analysis models. We provide a roadmap to perform network meta-analysis, along with an overview of the main functions of the package. We present three worked examples considering different types of outcomes and different data formats to facilitate researchers aiming to conduct network meta-analysis with netmeta.
ネットワークメタアナリシスは、すべての適格な研究から得られた直接的および間接的な証拠を組み合わせることにより、同じ疾患に対する異なる介入を比較します。ネットワークメタアナリシスは、応用科学者によってますます使用されており、方法論者にとって主要な研究トピックです。この記事では、ネットワークメタアナリシスモデルに適合するために頻度論的手法を採用するRパッケージnetmetaについて説明します。ネットワークメタアナリシスを実行するためのロードマップと、パッケージの主な機能の概要を提供します。netmetaを使用したネットワークメタアナリシスの実施を目指す研究者を支援するために、さまざまなタイプの結果とさまざまなデータ形式を考慮した3つの実例を紹介します。

MLGL: An R Package Implementing Correlated Variable Selection by Hierarchical Clustering and Group-Lasso
MLGL：階層クラスタリングとグループ・ラッソによる相関変数選択を実装したRパッケージ

The R package MLGL, standing for multi-layer group-Lasso, implements a new procedure of variable selection in the context of redundancy between explanatory variables, which holds true with high-dimensional data. A sparsity assumption is made that postulates that only a few variables are relevant for predicting the response variable. In this context, the performance of classical Lasso-based approaches strongly deteriorates as the redundancy increases. The proposed approach combines variables aggregation and selection in order to improve interpretability and performance. First, a hierarchical clustering procedure provides at each level a partition of the variables into groups. Then, the set of groups of variables from the different levels of the hierarchy is given as input to group-Lasso, with weights adapted to the structure of the hierarchy. At this step, group-Lasso outputs sets of candidate groups of variables for each value of the regularization parameter. The versatility offered by package MLGL to choose groups at different levels of the hierarchy a priori induces a high computational complexity. MLGL, however, exploits the structure of the hierarchy and the weights used in group-Lasso to greatly reduce the final time cost. The final choice of the regularization parameter – and therefore the final choice of groups – is made by a multiple hierarchical testing procedure.
RパッケージMLGLは、Multi-layer group-Lassoの略で、説明変数間の冗長性のコンテキストで変数選択の新しい手順を実装しており、これは高次元データにも当てはまります。応答変数の予測に関連する変数はごくわずかであると仮定するスパース性の仮定が行われます。このコンテキストでは、従来のLassoベースのアプローチのパフォーマンスは、冗長性が増加するにつれて大幅に低下します。提案されたアプローチは、解釈可能性とパフォーマンスを向上させるために、変数の集約と選択を組み合わせたものです。まず、階層クラスタリング・プロシージャーは、各レベルで変数をグループに分割します。次に、階層の異なるレベルからの変数のグループのセットが、階層の構造に適合した重みを使用して、group-Lassoへの入力として与えられます。この手順では、group-Lassoは、正則化パラメーターの各値に対して、変数の候補グループのセットを出力します。パッケージMLGLが提供する汎用性により、階層のさまざまなレベルでグループを先験的に選択できるため、計算の複雑さが高くなります。ただし、MLGLは、階層の構造とgroup-Lassoで使用される重みを利用して、最終的な時間コストを大幅に削減します。正則化パラメータの最終的な選択、したがってグループの最終的な選択は、複数の階層テスト手順によって行われます。

drda: An R Package for Dose-Response Data Analysis Using Logistic Functions
drda：ロジスティック関数を用いた用量反応データ解析のためのRパッケージ

Analysis of dose-response data is an important step in many scientific disciplines, including but not limited to pharmacology, toxicology, and epidemiology. The R package drda is designed to facilitate the analysis of dose-response data by implementing efficient and accurate functions with a familiar interface. With drda it is possible to fit models by the method of least squares, perform goodness-of-fit tests, and conduct model selection. Compared to other similar packages, drda provides in general more accurate estimates in the least-squares sense. This result is achieved by a smart choice of the starting point in the optimization algorithm and by implementing the Newton method with a trust region with analytical gradients and Hessian matrices. In this article, drda is presented through the description of its methodological components and examples of its user-friendly functions. Performance is evaluated using both synthetic data and a real, large-scale drug sensitivity screening dataset.
用量反応データの解析は、薬理学、毒物学、疫学など、多くの科学分野で重要なステップです。Rパッケージdrdaは、使い慣れたインターフェースで効率的かつ正確な機能を実装することにより、用量反応データの分析を容易にするように設計されています。drdaを使用すると、最小二乗法によるモデルの適合、適合度検定の実行、およびモデル選択を行うことができます。他の同様のパッケージと比較して、drdaは一般に最小二乗法の意味でより正確な推定値を提供します。この結果は、最適化アルゴリズムの開始点を賢く選択し、解析勾配とヘッセ行列を持つ信頼領域を持つニュートン法を実装することによって実現されます。この記事では、drdaをその方法論的コンポーネントの説明とそのユーザーフレンドリーな機能の例を通じて紹介します。性能は、合成データと実際の大規模な薬剤感受性スクリーニングデータセットの両方を使用して評価されます。

RecordTest: An R Package to Analyze Non-Stationarity in the Extremes Based on Record-Breaking Events
RecordTest：記録破りのイベントに基づいて極端に非定常性を解析するRパッケージ

The study of non-stationary behavior in the extremes is important to analyze data in environmental sciences, climate, finance, or sports. As an alternative to the classical extreme value theory, this analysis can be based on the study of record-breaking events. The R package RecordTest provides a useful framework for non-parametric analysis of non-stationary behavior in the extremes, based on the analysis of records. The underlying idea of all the non-parametric tools implemented in the package is to use the distribution of the record occurrence under series of independent and identically distributed continuous random variables, to analyze if the observed records are compatible with that behavior. Two families of tests are implemented. The first only requires the record times of the series, while the second includes more powerful tests that join the information from different types of records: upper and lower records in the forward and backward series. The package also offers functions that cover all the steps in this type of analysis such as data preparation, identification of the records, exploratory analysis, and complementary graphical tools. The applicability of the package is illustrated with the analysis of the effect of global warming on the extremes of the daily maximum temperature series in Zaragoza, Spain.
極端な非定常行動の研究は、環境科学、気候、金融、またはスポーツのデータを分析するために重要です。古典的な極値理論の代替として、この分析は記録破りのイベントの研究に基づくことができます。RパッケージのRecordTestは、レコードの分析に基づいて、極端な非定常動作のノンパラメトリック分析に役立つフレームワークを提供します。パッケージに実装されているすべてのノンパラメトリックツールの基本的な考え方は、一連の独立した同一分布の連続確率変数の下でのレコード発生の分布を使用して、観測されたレコードがその動作と互換性があるかどうかを分析することです。2つのテストファミリが実装されています。最初のレコードはシリーズのレコード時間のみを必要としますが、2番目のテストには、異なるタイプのレコード(順方向と逆方向の上位レコードと下位レコード)からの情報を結合するより強力なテストが含まれています。このパッケージには、データ準備、レコードの識別、探索的分析、補完的なグラフィカルツールなど、このタイプの分析のすべてのステップをカバーする機能も用意されています。このパッケージの適用性は、スペインのサラゴサにおける日最高気温シリーズの極端な値に対する地球温暖化の影響の分析によって示されています。

gfpop: An R Package for Univariate Graph-Constrained Change-Point Detection
gfpop：単変量グラフ制約付き変化点検出のためのRパッケージ

In a world with data that change rapidly and abruptly, it is important to detect those changes accurately. In this paper we describe an R package implementing a generalized version of an algorithm recently proposed by Hocking, Rigaill, Fearnhead, and Bourque (2020) for penalized maximum likelihood inference of constrained multiple change-point models. This algorithm can be used to pinpoint the precise locations of abrupt changes in large data sequences. There are many application domains for such models, such as medicine, neuroscience or genomics. Often, practitioners have prior knowledge about the changes they are looking for. For example in genomic data, biologists sometimes expect peaks: up changes followed by down changes. Taking advantage of such prior information can substantially improve the accuracy with which we can detect and estimate changes. Hocking et al. (2020) described a graph framework to encode many examples of such prior information and a generic algorithm to infer the optimal model parameters, but implemented the algorithm for just a single scenario. We present the gfpop package that implements the algorithm in a generic manner in R/C++. gfpop works for a user-defined graph that can encode prior assumptions about the types of changes that are possible and implements several loss functions (Gauss, Poisson, binomial, biweight, and Huber). We then illustrate the use of gfpop on isotonic simulations and several applications in biology. For a number of graphs the algorithm runs in a matter of seconds or minutes for 105 data points.
データが急激かつ急激に変化する世界では、それらの変化を正確に検出することが重要です。この論文では、Hocking, Rigaill, Fearnhead, and Bourque (2020)によって最近提案された、制約付き複数の変化点モデルのペナルティ付き最尤推論のためのアルゴリズムの一般化バージョンを実装するRパッケージについて説明します。このアルゴリズムを使用して、大規模なデータシーケンスの急激な変化の正確な位置を特定できます。このようなモデルには、医学、神経科学、ゲノミクスなど、多くの応用領域があります。多くの場合、実務家は自分が探している変化について予備知識を持っています。例えば、ゲノムデータでは、生物学者はピークを予想することがあります:上昇変化の後に下降変化が続きます。このような事前情報を活用することで、変化の検出と推定の精度を大幅に向上させることができます。Hockingら(2020)は、このような事前情報の多くの例をエンコードするグラフフレームワークと、最適なモデルパラメーターを推論する汎用アルゴリズムについて説明しましたが、アルゴリズムは1つのシナリオのみに実装されました。R/C++でアルゴリズムを汎用的に実装するgfpopパッケージを紹介します。gfpopは、可能な変更の種類に関する事前の仮定をエンコードし、いくつかの損失関数(ガウス、ポアソン、二項、双重み、およびフーバー)を実装するユーザー定義グラフに対して機能します。次に、等張シミュレーションでのgfpopの使用と、生物学におけるいくつかのアプリケーションについて説明します。多くのグラフの場合、アルゴリズムは105個のデータポイントに対して数秒または数分で実行されます。

Broken Stick Model for Irregular Longitudinal Data
不規則な縦方向のデータの壊れたスティック・モデル

Many longitudinal studies collect data that have irregular observation times, often requiring the application of linear mixed models with time-varying outcomes. This paper presents an alternative that splits the quantitative analysis into two steps. The first step converts irregularly observed data into a set of repeated measures through the broken stick model. The second step estimates the parameters of scientific interest from the repeated measurements at the subject level. The broken stick model approximates each subject’s trajectory by a series of connected straight lines. The breakpoints, specified by the user, divide the time axis into consecutive intervals common to all subjects. Specification of the model requires just three variables: time, measurement and subject. The model is a special case of the linear mixed model, with time as a linear B-spline and subject as the grouping factor. The main assumptions are: Subjects are exchangeable, trajectories between consecutive breakpoints are straight, random effects follow a multivariate normal distribution, and unobserved data are missing at random. The R package brokenstick v2.5.0 offers tools to calculate, predict, impute and visualize broken stick estimates. The package supports two optimization methods, including options to constrain the variance-covariance matrix of the random effects. We demonstrate six applications of the model: Detection of critical periods, estimation of the time-to-time correlations, profile analysis, curve interpolation, multiple imputation and personalized prediction of future outcomes by curve matching.
多くの縦断的研究では、観測時間が不規則なデータを収集するため、多くの場合、時間的に変化する結果を持つ線形混合モデルの適用が必要になります。この論文では、定量分析を2つのステップに分割する代替案を提示します。最初のステップでは、不規則に観測されたデータを、壊れたスティックモデルを通じて一連の反復測定に変換します。2番目のステップでは、被験者レベルでの繰り返し測定から科学的に関心のあるパラメータを推定します。折れたスティックモデルは、接続された一連の直線によって各被験者の軌道を近似します。ユーザーが指定したブレークポイントは、時間軸をすべてのサブジェクトに共通の連続した間隔に分割します。モデルの仕様には、時間、測定、被験者の3つの変数のみが必要です。このモデルは線形混合モデルの特殊なケースであり、時間を線形Bスプライン、サブジェクトをグループ化係数として使用します。主な仮定は、被験者が交換可能であること、連続するブレークポイント間の軌跡が直線であること、ランダム効果が多変量正規分布に従うこと、および観測されていないデータがランダムに欠落していることです。Rパッケージbrokenstick v2.5.0は、壊れたスティックの見積もりを計算、予測、代入、視覚化するためのツールを提供します。このパッケージは、変量効果の分散-共分散行列を制約するオプションを含む、2つの最適化方法をサポートしています。このモデルの6つのアプリケーションを示します:臨界期間の検出、時間間の相関の推定、プロファイル分析、曲線補間、多重代入、および曲線マッチングによる将来の結果の個別予測。

Probabilistic Estimation and Projection of the Annual Total Fertility Rate Accounting for Past Uncertainty: A Major Update of the bayesTFR R Package
過去の不確実性を考慮した年間合計特殊出生率の確率的推定と予測：bayesTFR Rパッケージのメジャーアップデート

The bayesTFR package for R provides a set of functions to produce probabilistic projections of the total fertility rates for all countries, and is widely used, including as part of the basis for the United Nations official population projections for all countries. Liu and Raftery (2020) extended the theoretical model by adding a layer that accounts for the past total fertility rate estimation uncertainty. A major update of bayesTFR implements the new extension. Moreover, a new feature of producing annual total fertility rate estimation and projections extends the existing functionality of estimating and projecting for five-year time periods. An additional autoregressive component has been developed in order to account for the larger autocorrelation in the annual version of the model. This article summarizes the updated model, describes the basic steps to generate probabilistic estimation and projections under different settings, compares performance, and provides instructions on how to summarize, visualize and diagnose the model results.
RのbayesTFRパッケージは、すべての国の合計特殊出生率の確率的予測を生成するための一連の機能を提供し、すべての国の国連の公式人口予測の基礎の一部として含めて広く使用されています。Liu and Raftery(2020)は、過去の合計特殊出生率推定の不確実性を説明する層を追加することで、理論モデルを拡張しました。bayesTFRのメジャーアップデートでは、新しい拡張機能が実装されています。さらに、年間合計特殊出生率の推定と予測を作成する新機能は、5年間の期間の推定と予測の既存の機能を拡張します。モデルの年間バージョンのより大きな自己相関を説明するために、追加の自己回帰コンポーネントが開発されました。この記事では、更新されたモデルの概要、さまざまな設定で確率的推定と予測を生成するための基本的な手順、パフォーマンスの比較、モデル結果の要約、視覚化、診断の方法について説明します。

intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset
intRinsic：データセットの固有次元をモデルベースで推定するためのRパッケージ

This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.
この記事では、データセットの固有次元の斬新な最先端の尤度ベースの推定器を実装するRパッケージであるintRinsicについて説明します。これは、ほとんどの次元削減手法に不可欠な量です。これらの新しい推定器に簡単にアクセスできるようにするために、パッケージには、効率的で低レベルのルーチンの広範なセットに依存する少数の高レベル関数が含まれています。一般的に、intRinsicには、同種と異種の固有次元推定器の2つのカテゴリに分類されるモデルが含まれます。最初のカテゴリには、各データポイントとその最初の2つの最近傍との間の距離の比率の分布特性から導出された方法である2つの最近傍推定器が含まれます。このメソッド専用の関数は、頻度論的フレームワークとベイジアンフレームワークの両方で推論を実行します。2番目のカテゴリでは、効率的なGibbsサンプラーが実装されているベイジアン混合モデルである異種固有次元アルゴリズムを見つけます。理論的背景を提示した後、シミュレートされたデータセットでのモデルのパフォーマンスを示します。このようにして、結果の妥当性をすぐに評価することで、博覧会を容易にすることができます。次に、このパッケージを使用して、有名なマイクロアレイ実験から得られたAlonデータセットの本質的な次元を研究します。最後に、同種および不均一な固有次元の推定により、データセットのトポロジカル構造に関する貴重な洞察を得る方法を示します。

Application of Equal Local Levels to Improve Q-Q Plot Testing Bands with R Package qqconf
Rパッケージqqconfを使用した Q-Qプロット・テスト・バンドを改善するための等地域レベルの適用

Quantile-quantile (Q-Q) plots are often difficult to interpret because it is unclear how large the deviation from the theoretical distribution must be to indicate a lack of fit. Most Q-Q plots could benefit from the addition of meaningful global testing bands, but the use of such bands unfortunately remains rare because of the drawbacks of current approaches and packages. These drawbacks include incorrect global type-I error rate, lack of power to detect deviations in the tails of the distribution, relatively slow computation for large data sets, and limited applicability. To solve these problems, we apply the equal local levels global testing method, which we have implemented in the R Package qqconf, a versatile tool to create Q-Q plots and probability-probability (P-P) plots in a wide variety of settings, with simultaneous testing bands rapidly created using recently-developed algorithms. qqconf can easily be used to add global testing bands to Q-Q plots made by other packages. In addition to being quick to compute, these bands have a variety of desirable properties, including accurate global levels, equal sensitivity to deviations in all parts of the null distribution (including the tails), and applicability to a range of null distributions. We illustrate the use of qqconf in several applications: assessing normality of residuals from regression, assessing accuracy of p values, and use of Q-Q plots in genome-wide association studies.
分位数-分位数(Q-Q)プロットは、理論分布からの偏差がどの程度大きいと適合度が低いかが不明瞭であるため、解釈が難しいことがよくあります。ほとんどのQ-Qプロットは、意味のあるグローバルテストバンドを追加することで恩恵を受けることができますが、残念ながら、現在のアプローチとパッケージの欠点のために、そのようなバンドの使用はまれです。これらの欠点には、グローバルType-Iエラー率が正しくない、分布の裾の偏差を検出する能力が不足している、大規模なデータセットの計算が比較的遅い、適用性が限られているなどがあります。これらの問題を解決するために、R Package qqconfに実装した等地域レベルグローバルテスト法を適用します。これは、さまざまな設定でQ-Qプロットと確率-確率(P-P)プロットを作成する汎用性の高いツールであり、最近開発されたアルゴリズムを使用して同時にテストバンドを迅速に作成します。qqconfは、他のパッケージで作成されたQ-Qプロットにグローバルテストバンドを追加するために簡単に使用できます。これらのバンドは、計算が速いだけでなく、正確なグローバルレベル、ヌル分布のすべての部分(テールを含む)の偏差に対する感度が等しい、ヌル分布の範囲に適用可能など、さまざまな望ましい特性を備えています。回帰からの残差の正規性の評価、p値の精度の評価、ゲノムワイド関連研究でのQ-Qプロットの使用など、いくつかのアプリケーションでのqqconfの使用について説明します。

disaggregation: An R Package for Bayesian Spatial Disaggregation Modeling
分解：ベイズ空間分解モデリングのためのRパッケージ

Disaggregation modeling, or downscaling, has become an important discipline in epidemiology. Surveillance data, aggregated over large regions, is becoming more common, leading to an increasing demand for modeling frameworks that can deal with this data to understand spatial patterns. Disaggregation regression models use response data aggregated over large heterogeneous regions to make predictions at fine-scale over the region by using fine-scale covariates to inform the heterogeneity. This paper presents the R package disaggregation, which provides functionality to streamline the process of running a disaggregation model for fine-scale predictions.
ディスアグリゲーションモデリング(ダウンスケーリング)は、疫学の重要な分野となっています。大規模な地域に集約された監視データは、より一般的になりつつあり、このデータを処理して空間パターンを理解できるモデリングフレームワークの需要が高まっています。ディスアグリゲーション回帰モデルは、大きな異種領域で集計された応答データを使用して、細かいスケールの共変量を使用して不均一性を通知することにより、その領域に対して細かいスケールで予測を行います。このホワイトペーパーでは、詳細な予測のための分解モデルの実行プロセスを効率化する機能を提供するRパッケージの分解について説明します。

bootUR: An R Package for Bootstrap Unit Root Tests
bootUR：ブートストラップユニットルートテスト用のRパッケージ

Unit root tests form an essential part of any time series analysis. We provide practitioners with a single, unified framework for comprehensive and reliable unit root testing in the R package bootUR. The package’s backbone is the popular augmented Dickey-Fuller test paired with a union of rejections principle, which can be performed directly on single time series or multiple (including panel) time series. Accurate inference is ensured through the use of bootstrap methods. The package addresses the needs of both novice users, by providing user-friendly and easy-to-implement functions with sensible default options, as well as expert users, by giving full user-control to adjust the tests to one’s desired settings. Our parallelized C++ implementation ensures that all unit root tests are scalable to datasets containing many time series.
ユニットルート検定は、あらゆる時系列分析の重要な部分を形成します。私たちは、RパッケージbootURでの包括的で信頼性の高いユニットルートテストのための単一の統一フレームワークを実務家に提供します。パッケージのバックボーンは、人気のある拡張ディッキー・フラー検定とリジェクションの和集合原理の組み合わせであり、単一の時系列または複数の(パネルを含む)時系列で直接実行できます。ブートストラップ法を使用することで、正確な推論が保証されます。このパッケージは、ユーザーフレンドリーで実装が簡単な機能と賢明なデフォルトオプションを提供することで初心者ユーザーのニーズに対応し、エキスパートユーザーだけでなく、テストを希望の設定に調整するための完全なユーザー制御を提供します。並列化されたC++実装により、すべての単体ルートテストが多くの時系列を含むデータセットにスケーラブルであることが保証されます。

記事

関連記事