Journal of Statistical Software: Volume 108の記事一覧

Journal of Statistical Software Volume 108に記載されている内容を一覧にまとめ、機械翻訳を交えて日本語化し掲載します。

記事

The R Package tipsae: Tools for Mapping Proportions and Indicators on the Unit Interval
Rパッケージtipsae：ユニット間隔の比率とインジケーターをマッピングするためのツール

The tipsae package implements a set of small area estimation tools for mapping proportions and indicators defined on the unit interval. It provides for small area models defined at area level, including the classical beta regression, zero- and/or one-inflated beta and flexible beta ones, possibly accounting for spatial and/or temporal dependency structures. The models, developed within a Bayesian framework, are estimated through Stan language, allowing fast estimation and customized parallel computing. The additional features of the tipsae package, such as diagnostics, visualization and exporting functions as well as variance smoothing and benchmarking functions, improve the user experience through the entire process of estimation, validation and outcome presentation. A shiny application with a user-friendly interface further eases the implementation of Bayesian models for small area analysis.
tipsaeパッケージは、単位間隔で定義された比率と指標をマッピングするための一連の小面積推定ツールを実装します。これは、古典的なベータ回帰、ゼロおよび/または1膨張ベータ、およびフレキシブルベータモデルなど、エリアレベルで定義された小さなエリアモデルを提供し、おそらく空間的および/または時間的な依存関係構造を考慮します。ベイジアンフレームワーク内で開発されたモデルは、Stan言語を介して推定されるため、高速な推定とカスタマイズされた並列計算が可能になります。診断、視覚化、エクスポート機能、分散平滑化、ベンチマーク機能など、tipsaeパッケージの追加機能は、推定、検証、結果提示の全プロセスを通じてユーザーエクスペリエンスを向上させます。ユーザーフレンドリーなインターフェースを備えた光沢のあるアプリケーションは、小面積分析のためのベイジアンモデルの実装をさらに容易にします。

The R Package markets: Estimation Methods for Markets in Equilibrium and Disequilibrium
Rパッケージ市場：均衡状態と不均衡状態の市場の推定手法

Market models constitute a significant cornerstone of empirical applications in business, industrial organization, and policymaking macroeconomics. The econometric literature proposes various estimation methods for markets in equilibrium, which entail a market-clearing structural condition, and disequilibrium, which are described based on a structural short-side rule. Nonetheless, maximum likelihood estimations of such models are computationally demanding, and software providing simple, out-of-the-box methods for estimating them is scarce. Therefore, applications rely on project-specific implementations for estimating these models, which hinders research reproducibility and result comparability. This article presents the R package markets, which provides a common interface with generic functionality simplifying the estimation of models for markets in equilibrium and disequilibrium. The package specializes in estimating demanded, supplied, and aggregated market quantities and absolute, normalized, and relative market shortages. Its functionality is exemplified via an empirical application using a classic dataset of United States credit for housing starts. Moreover, the article details the scope and design of the implementation and provides statistical measurements of the computational performance of its estimation functionality gathered via large-scale benchmarking simulations. The markets package is free software distributed under the Expat license as part of the R software ecosystem. It comprises a set of estimation and analysis tools that are not directly available from either alternative R packages or other statistical software projects.
市場モデルは、ビジネス、産業組織、および政策立案のマクロ経済学における実証的応用の重要な基礎を構成しています。計量経済学の文献では、市場清算の構造的条件を伴う均衡市場と、構造的ショートサイドルールに基づいて記述される不均衡市場について、さまざまな推定方法が提案されています。それにもかかわらず、このようなモデルの最尤推定は計算量が多く、それらを推定するためのシンプルですぐに使える方法を提供するソフトウェアは不足しています。そのため、アプリケーションはこれらのモデルを推定するためにプロジェクト固有の実装に依存しており、研究の再現性や結果の比較可能性を妨げています。この記事では、均衡市場と不均衡市場のモデルの推定を簡略化する汎用機能を備えた共通のインターフェイスを提供するRパッケージ市場について説明します。このパッケージは、需要、供給、および集約された市場数量と、絶対的、正規化、および相対的な市場不足の見積もりを専門としています。その機能は、住宅着工のための米国のクレジットの古典的なデータセットを使用した経験的アプリケーションによって例示されています。さらに、この記事では、実装の範囲と設計を詳しく説明し、大規模なベンチマークシミュレーションを通じて収集された推定機能の計算パフォーマンスの統計的測定を提供します。marketsパッケージは、Rソフトウェアエコシステムの一部としてExpatライセンスの下で配布されるフリーソフトウェアです。これは、代替のRパッケージや他の統計ソフトウェアプロジェクトから直接利用できない一連の推定および分析ツールで構成されています。

DoubleML: An Object-Oriented Implementation of Double Machine Learning in R
DoubleML：RでのDouble機械学習のオブジェクト指向実装

The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consists of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.
RパッケージDoubleMLは、Chernozhukov、Chetverikov、Demirer、Duflo、Hansen、Newey、Robins (2018)のダブル/デバイアス機械学習フレームワークを実装しています。機械学習の手法に基づいて因果モデルのパラメータを推定する機能を提供します。ダブル機械学習フレームワークは、ネイマン直交性、高品質の機械学習推定、サンプル分割の3つの主要な要素で構成されています。迷惑成分の推定は、mlr3エコシステムで利用可能なさまざまな最先端の機械学習手法によって実行できます。DoubleMLを使用すると、部分線形回帰モデルや対話型回帰モデル、およびそれらの操作変数推定への拡張など、さまざまな因果モデルで推論を実行できます。DoubleMLのオブジェクト指向実装により、モデル仕様の柔軟性が高く、拡張が容易になります。このホワイトペーパーは、double機械学習フレームワークとRパッケージDoubleMLの概要として機能します。シミュレーションされたデータセットと実際のデータセットを使用した再現可能なコード例では、DoubleMLユーザーが機械学習手法に基づいて有効な推論を実行する方法を示します。

gcimpute: A Package for Missing Data Imputation
gcimpute：欠落データ補完のためのパッケージ

This article introduces the Python package gcimpute for missing data imputation. Package gcimpute can impute missing data with many different variable types, including continuous, binary, ordinal, count, and truncated values, by modeling data as samples from a Gaussian copula model. This semiparametric model learns the marginal distribution of each variable to match the empirical distribution, yet describes the interactions between variables with a joint Gaussian that enables fast inference, imputation with confidence intervals, and multiple imputation. The package also provides specialized extensions to handle large datasets (with complexity linear in the number of observations) and streaming datasets (with online imputation). This article describes the underlying methodology and demonstrates how to use the software package.
この記事では、欠落データ補完のためのPythonパッケージgcimputeを紹介します。パッケージgcimputeは、データをガウスコピュラモデルからのサンプルとしてモデル化することで、連続値、バイナリ値、順序値、カウント値、切り捨て値など、さまざまな変数タイプで欠損データを補完できます。このセミパラメトリックモデルは、経験的分布に一致するように各変数の周辺分布を学習しますが、高速推論、信頼区間による補完、および多重代入を可能にする結合ガウス分布を使用して変数間の相互作用を記述します。このパッケージには、大規模なデータセット(観測値の数に比例する複雑さを持つ)とストリーミングデータセット(オンライン代入を含む)を処理するための特殊な拡張機能も用意されています。この記事では、基本的な方法論について説明し、ソフトウェアパッケージの使用方法を示します。

melt: Multiple Empirical Likelihood Tests in R
melt：Rでの多重経験的尤度検定

Empirical likelihood enables a nonparametric, likelihood-driven style of inference without relying on assumptions frequently made in parametric models. Empirical likelihood-based tests are asymptotically pivotal and thus avoid explicit studentization. This paper presents the R package melt that provides a unified framework for data analysis with empirical likelihood methods. A collection of functions are available to perform multiple empirical likelihood tests for linear and generalized linear models in R. The package melt offers an easy-to-use interface and flexibility in specifying hypotheses and calibration methods, extending the framework to simultaneous inferences. Hypothesis testing uses a projected gradient algorithm to solve constrained empirical likelihood optimization problems. The core computational routines are implemented in C++, with OpenMP for parallel computation.
経験的尤度は、パラメトリックモデルで頻繁に行われる仮定に頼ることなく、ノンパラメトリックで尤度主導の推論スタイルを可能にします。経験的尤度に基づく検定は漸近的に極めて重要であるため、明示的な学生化を回避します。この論文では、経験的尤度法によるデータ分析の統一フレームワークを提供するRパッケージメルトについて説明します。関数のコレクションを使用して、Rの線形モデルと一般化線形モデルに対して複数の経験的尤度検定を実行できます。パッケージメルトは、使いやすいインターフェースと、仮説とキャリブレーション方法を指定する柔軟性を提供し、フレームワークを同時推論に拡張します。仮説検定では、投影勾配アルゴリズムを使用して、制約付き経験的尤度最適化問題を解きます。主要な計算ルーチンはC++で実装され、並列計算にはOpenMPが使用されています。

PUMP: Estimating Power, Minimum Detectable Effect Size, and Sample Size When Adjusting for Multiple Outcomes in Multi-Level Experiments
PUMP：マルチレベル実験で複数の結果を調整する際の検出力および検出可能な最小効果サイズ、サンプルサイズの推定

For randomized controlled trials (RCTs) with a single intervention’s impact being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust p values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially. A reduction in power means a reduction in the probability of detecting effects when they do exist. This consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP (Power Under Multiplicity Project) R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. PUMP uses a simulation-based approach to flexibly estimate power for a wide variety of experimental designs, number of outcomes, multiple testing procedures, and other user choices. By assuming linear mixed effects models, we can draw directly from the joint distribution of test statistics across outcomes and thus estimate power via simulation. One of PUMP’s main innovations is accommodating multiple outcomes, which are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in p values from applying a multiple testing procedure. Second, when considering multiple outcomes rather than a single outcome, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power in order to choose the most appropriate types of power for the goals of their study. The package supports a variety of commonly used frequentist multi-level RCT designs and linear mixed effects models. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.
単一の介入が複数のアウトカムに及ぼす影響を測定するランダム化比較試験(RCT)では、研究者はしばしば複数の試験手順(BonferroniやBenjamini-Hochbergなど)を適用してp値を調整します。このような調整は、偽の所見の可能性を減らすだけでなく、統計の検出力を、時には大幅に変化させます。検出力の低下は、効果が存在する場合に検出する確率が減少することを意味します。この考慮事項は、既存のツールが複数のテスト手順の使用に容易に対応できないため、一般的な検出力解析では無視されることがよくあります。アナリストが複数の結果を持つマルチレベルRCTの統計的検出力、最小検出可能効果サイズ、およびサンプルサイズ要件を推定するためのツールとして、PUMP(Power Under Multiplicity Project)Rパッケージを紹介します。PUMPは、シミュレーションベースのアプローチを使用して、さまざまな実験計画、結果の数、複数のテスト手順、およびその他のユーザーの選択肢の検出力を柔軟に推定します。線形混合効果モデルを仮定することで、結果全体のテスト統計の同時分布から直接引き出し、シミュレーションを通じて検出力を推定できます。PUMPの主なイノベーションの1つは、2つの方法で説明される複数の結果に対応することです。まず、PUMPからの電力推定は、複数のテスト手順を適用することによるp値の調整を適切に考慮します。第二に、単一の結果ではなく複数の結果を考慮すると、統計的検出力の異なる定義が浮かび上がってきます。PUMPを使用すると、研究者は、研究の目的に最も適したタイプの電力を選択するために、電力のさまざまな定義を検討できます。このパッケージは、一般的に使用されるさまざまな頻度論的マルチレベルRCTデザインと線形混合効果モデルをサポートしています。このパッケージは、検出力、検出可能な最小効果サイズ、サンプルサイズ要件の推定という主要な機能に加えて、ユーザーは基礎となる仮定の変化に対するこれらの量の感度を簡単に調査できます。

Holistic Generalized Linear Models
全体論的一般化線形モデル

Holistic linear regression extends the classical best subset selection problem by adding additional constraints designed to improve the model quality. These constraints include sparsity-inducing constraints, sign-coherence constraints and linear constraints. The R package holiglm provides functionality to model and fit holistic generalized linear models. By making use of state-of-the-art mixed-integer conic solvers, the package can reliably solve generalized linear models for Gaussian, binomial and Poisson responses with a multitude of holistic constraints. The high-level interface simplifies the constraint specification and can be used as a drop-in replacement for the stats::glm() function.
全体論的線形回帰は、モデルの品質を向上させるように設計された制約を追加することで、従来の最適サブセット選択問題を拡張します。これらの制約には、スパース性を誘発する制約、符号コヒーレンス制約、線形制約が含まれます。Rパッケージholiglmは、全体論的な一般化線形モデルをモデル化し、適合させる機能を提供します。最先端の混合整数円錐ソルバーを利用することにより、このパッケージは、多数の全体的な制約を持つガウス応答、二項応答、およびポアソン応答の一般化線形モデルを確実に解くことができます。高水準インターフェースは制約の指定を簡素化し、stats::glm()関数のドロップイン置換として使用できます。

salmon: A Symbolic Linear Regression Package for Python
salmon：Pythonのシンボリック線形回帰パッケージ

One of the most attractive features of R is its linear modeling capabilities. We describe a Python package, salmon, that brings the best of R’s linear modeling functionality to Python in a Pythonic way – by providing composable objects for specifying and fitting linear models. This object-oriented design also enables other features that enhance easeof-use, such as automatic visualizations and intelligent model building.
Rの最も魅力的な機能の1つは、その線形モデリング機能です。ここでは、線形モデルを指定して適合するためのコンポーザブルオブジェクトを提供することで、Rの線形モデリング機能を最大限に活用したPythonパッケージsalmonについて説明します。このオブジェクト指向設計により、自動視覚化やインテリジェントなモデル構築など、使いやすさを向上させる他の機能も可能になります。

Modeling Nonstationary Financial Volatility with the R Package tvgarch
R Package tvgarchによる非定常金融ボラティリティのモデリング

Certain events can make the structure of volatility of financial returns to change, making it nonstationary. Models of time-varying conditional variance such as generalized autoregressive conditional heteroscedasticity (GARCH) models usually assume stationarity. However, this assumption can be inappropriate and volatility predictions can fail in the presence of structural changes in the unconditional variance. To overcome this problem, in the time-varying (TV-)GARCH model, the GARCH parameters are allowed to vary smoothly over time by assuming not only the conditional but also the unconditional variance to be time-varying. In this paper, we show how useful the R package tvgarch (Campos-Martins and Sucarrat 2023) can be for modeling nonstationary volatility in financial empirical applications. The functions for simulating, testing and estimating TV-GARCH-X models, where additional covariates can be included, are implemented in both univariate and multivariate settings.
特定のイベントにより、財務リターンのボラティリティの構造が変化し、それが非定常的になる可能性があります。一般化自己回帰条件付き不均一分散(GARCH)モデルなどの時間変条件付き分散のモデルは、通常、定常性を前提としています。ただし、この仮定は不適切である可能性があり、無条件分散の構造変化が存在すると、ボラティリティの予測が失敗する可能性があります。この問題を解決するために、時間変動(TV-) GARCHモデルでは、条件付き分散だけでなく無条件分散も時間変動すると仮定することで、GARCHパラメーターが時間とともに滑らかに変化することが許容されます。この論文では、Rパッケージtvgarch (Campos-Martins and Sucarrat 2023)が金融経験的アプリケーションの非定常ボラティリティのモデル化にどれほど役立つかを示します。TV-GARCH-Xモデルのシミュレーション、テスト、および推定のための関数は、追加の共変量を含めることができ、単変量設定と多変量設定の両方で実装されます。

Modeling Big, Heterogeneous, Non-Gaussian Spatial and Spatio-Temporal Data Using FRK
FRKを使用した大規模な異種非ガウス空間および時空間データのモデリング

Non-Gaussian spatial and spatio-temporal data are becoming increasingly prevalent, and their analysis is needed in a variety of disciplines. FRK is an R package for spatial and spatio-temporal modeling and prediction with very large data sets that, to date, has only supported linear process models and Gaussian data models. In this paper, we describe a major upgrade to FRK that allows for non-Gaussian data to be analyzed in a generalized linear mixed model framework. These vastly more general spatial and spatio-temporal models are fitted using the Laplace approximation via the software TMB. The existing functionality of FRK is retained with this advance into non-Gaussian models; in particular, it allows for automatic basis-function construction, it can handle both point-referenced and areal data simultaneously, and it can predict process values at any spatial support from these data. This new version of FRK also allows for the use of a large number of basis functions when modeling the spatial process, and thus it is often able to achieve more accurate predictions than previous versions of the package in a Gaussian setting. We demonstrate innovative features in this new version of FRK, highlight its ease of use, and compare it to alternative packages using both simulated and real data sets.
非ガウス空間データおよび時空間データはますます普及しており、その分析はさまざまな分野で必要とされています。FRKは、非常に大規模なデータセットを使用した空間的および時空間的なモデリングと予測のためのRパッケージであり、これまでは線形プロセスモデルとガウスデータモデルのみがサポートされていました。このホワイトペーパーでは、非ガウスデータを一般化線形混合モデルフレームワークで分析できるようにするFRKの大幅なアップグレードについて説明します。これらの非常に一般的な空間モデルと時空間モデルは、ソフトウェアTMBを介したラプラス近似を使用して適合されます。FRKの既存の機能は、この非ガウスモデルへの進歩によって保持されます。特に、基底関数の自動構築を可能にし、点参照データと面積データの両方を同時に処理でき、これらのデータから任意の空間サポートでのプロセス値を予測できます。この新しいバージョンのFRKでは、空間プロセスをモデル化するときに多数の基底関数を使用できるため、多くの場合、ガウス設定で以前のバージョンのパッケージよりも正確な予測を実現できます。この新しいバージョンのFRKの革新的な機能を示し、その使いやすさを強調し、シミュレートされたデータセットと実際のデータセットの両方を使用して代替パッケージと比較します。

コード・スニペット

CRTFASTGEEPWR: A SAS Macro for Power of Generalized Estimating Equations Analysis of Multi-Period Cluster Randomized Trials with Application to Stepped Wedge Designs
CRTFASTGEEPWR：段階的ウェッジ設計への応用を伴う複数期間のクラスターランダム化試験の一般化推定方程式の検出力分析のためのSASマクロ

Multi-period cluster randomized trials (CRTs) are increasingly used for the evaluation of interventions delivered at the group level. While generalized estimating equations (GEE) are commonly used to provide population-averaged inference in CRTs, there is a gap of general methods and statistical software tools for power calculation based on multi-parameter, within-cluster correlation structures suitable for multi-period CRTs that can accommodate both complete and incomplete designs. A computationally fast, nonsimulation procedure for determining statistical power is described for the GEE analysis of complete and incomplete multi-period cluster randomized trials. The procedure is implemented via a SAS macro, CRTFASTGEEPWR, which is applicable to binary, count and continuous responses and several correlation structures in multi-period CRTs. The SAS macro is illustrated in the power calculation of two complete and two incomplete stepped wedge cluster randomized trial scenarios under different specifications of marginal mean model and within-cluster correlation structure. The proposed GEE power method is quite general as demonstrated in the SAS macro with numerous input options. The power procedure and macro can also be used in the planning of parallel and crossover CRTs in addition to cross-sectional and closed cohort stepped wedge trials.
複数期間クラスターランダム化試験(CRT)は、グループレベルで実施される介入の評価にますます使用されるようになってきています。一般化推定方程式(GEE)は、CRTで母集団平均推論を提供するために一般的に使用されますが、完全設計と不完全設計の両方に対応できる複数周期CRTに適したマルチパラメータのクラスタ内相関構造に基づく検出力計算の一般的な方法と統計ソフトウェアツールにはギャップがあります。統計的検出力を決定するための計算速度の高い非シミュレーション手順が、完全および不完全な複数期間クラスター無作為化試験のGEE分析について説明されています。この手順は、SASマクロCRTFASTGEEPWRを介して実装され、バイナリ応答、カウント応答、連続応答、および複数期間CRTのいくつかの相関構造に適用できます。SASマクロは、周辺平均モデルとクラスター内相関構造の異なる仕様の下で、2つの完全ステップウェッジクラスターと2つの不完全ステップウェッジクラスターランダム化試行シナリオの検出力計算に示されています。提案されたGEE累乗方式は、多数の入力オプションを備えたSASマクロで示されているように、非常に一般的です。検出力の手順とマクロは、交差型および交差型のCRTに加えて、横断型および閉鎖的なコホートのステップウェッジ試験の計画にも使用できます。

記事

コード・スニペット

関連記事