Title: | Regression, Inference, and General Data Analysis Tools in R |
---|---|
Description: | A set of tools to streamline data analysis. Learning both R and introductory statistics at the same time can be challenging, and so we created 'rigr' to facilitate common data analysis tasks and enable learners to focus on statistical concepts. We provide easy-to-use interfaces for descriptive statistics, one- and two-sample inference, and regression analyses. 'rigr' output includes key information while omitting unnecessary details that can be confusing to beginners. Heteroscedasticity-robust ("sandwich") standard errors are returned by default, and multiple partial F-tests and tests for contrasts are easy to specify. A single regression function can fit both linear and generalized linear models, allowing students to more easily make connections between different classes of models. |
Authors: | Amy D Willis [aut, cre] |
Maintainer: | Amy D Willis <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.7 |
Built: | 2025-02-09 05:31:26 UTC |
Source: | https://github.com/statdivlab/rigr |
Developed by Scott S. Emerson, Andrew J. Spieker, Brian D. Williamson, and Travis Y. Hee Wai at the University of Washington Department of Biostatistics. Currently maintained by Prof. Amy Willis at the University of Washington Department of Biostatistics. Previously maintained by Charles Wolock and Taylor Okonek, also at the University of Washington Department of Biostatistics. Aims to facilitate regression, descriptive statistics, and one- and two-sample inference by implementing more intuitive layout and functionality for existing R functions.
Package: | rigr |
Type: | Package |
Version: | 1.0.0 |
Date: | 2021-09-10 |
License: | MIT |
A set of tools designed to facilitate easy adoption of R for students in introductory classes with little programming experience. Compiles output from existing routines together in an intuitive format, and adds functionality to existing functions. For instance, the regression function can perform linear models and generalized linear models. The user can also specify multiple-partial F-tests to print out with the model coefficients, and robust standard errors are provided automatically. We also provide functions for descriptive statistics and one- and two-sample inference with improved, legible output.
Scott S. Emerson, Andrew J. Spieker, Brian D. Williamson, Amy D. Willis, Charles Wolock, and Taylor Okonek
Maintainer: Amy Willis <[email protected]>
Compute analysis of variance (or deviance) tables for two fitted, nested uRegress
objects. The model with more
parameters is referred to as the full model (or the larger model), and the model with fewer
parameters is referred to as the null model (or the smaller model).
## S3 method for class 'uRegress' anova(object, full_object, test = "LRT", robustSE = TRUE, useFdstn = TRUE, ...)
## S3 method for class 'uRegress' anova(object, full_object, test = "LRT", robustSE = TRUE, useFdstn = TRUE, ...)
object |
an object of class |
full_object |
an object of class |
test |
a character string specifying the test statistic to be used. Can be one of |
robustSE |
a logical value indicating whether or not to use robust
standard errors in calculation. Defaults to |
useFdstn |
a logical indicator that the F distribution should be used for test statistics
instead of the chi squared distribution. Defaults to |
... |
argument to be passed in |
A list of class anova.uRegress
with the following components:
printMat |
A formatted table with inferential results (i.e., test statistics and p-values) for comparing two nested models. |
null model |
The null model in the comparison. |
full model |
The full model in the comparison. |
# Loading required libraries library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age and stroke (with robust SE by default) testReg_null <- regress ("mean", ldl~age+stroke, data = mri) # Linear regression of LDL on age, stroke, and race (with robust SE by default) testReg_full <- regress ("mean", ldl~age+stroke+race, data = mri) # Comparing the two models using the Wald test with robust SE anova(testReg_null, testReg_full, test = "Wald")
# Loading required libraries library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age and stroke (with robust SE by default) testReg_null <- regress ("mean", ldl~age+stroke, data = mri) # Linear regression of LDL on age, stroke, and race (with robust SE by default) testReg_full <- regress ("mean", ldl~age+stroke+race, data = mri) # Comparing the two models using the Wald test with robust SE anova(testReg_null, testReg_full, test = "Wald")
uRegress
objectsExtracts Cook's distances from uRegress
objects by relying on
functionality from the stats
package.
## S3 method for class 'uRegress' cooks.distance(model, ...)
## S3 method for class 'uRegress' cooks.distance(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a vector of Cook's distances
Produces table of relevant descriptive statistics for an arbitrary number of
variables of class integer
, numeric
, Surv
, Date
,
or factor
. Descriptive statistics can be obtained within strata, and
the user can specify that only a subset of the data be used. Descriptive
statistics include the count of observations, the count of cases with
missing values, the mean, standard deviation, geometric mean, minimum, and
maximum. The user can specify arbitrary quantiles to be estimated, as well
as specifying the estimation of proportions of observations within specified
ranges.
descrip( ..., strata = NULL, subset = NULL, probs = c(0.25, 0.5, 0.75), geomInclude = FALSE, replaceZeroes = FALSE, restriction = Inf, above = NULL, below = NULL, labove = NULL, rbelow = NULL, lbetween = NULL, rbetween = NULL, interval = NULL, linterval = NULL, rinterval = NULL, lrinterval = NULL )
descrip( ..., strata = NULL, subset = NULL, probs = c(0.25, 0.5, 0.75), geomInclude = FALSE, replaceZeroes = FALSE, restriction = Inf, above = NULL, below = NULL, labove = NULL, rbelow = NULL, lbetween = NULL, rbetween = NULL, interval = NULL, linterval = NULL, rinterval = NULL, lrinterval = NULL )
... |
an arbitrary number of variables for which descriptive statistics
are desired. The arguments can be vectors, matrices, or lists. Individual
columns of a matrix or elements of a list may be of class |
strata |
a vector, matrix, or list of stratification variables. Descriptive
statistics will be computed within strata defined by each unique combination
of the stratification variables, as well as in the combined sample.
If |
subset |
a vector indicating a subset to be used for all descriptive statistics.
If |
probs |
a vector of probabilities between 0 and 1 indicating quantile estimates to be included in the descriptive statistics. Default is to compute 25th, 50th (median) and 75th percentiles. |
geomInclude |
if not |
replaceZeroes |
if not |
restriction |
a value used for computing restricted means, standard deviations,
and geometric means with censored time-to-event data. The default value of
|
above |
a vector of values used to dichotomize variables. The descriptive
statistics will include an estimate for each variable of the proportion of
measurements with values greater than each element of |
below |
a vector of values used to dichotomize variables. The descriptive
statistics will include an estimate for each variable of the proportion of
measurements with values less than each element of |
labove |
a vector of values used to dichotomize variables. The descriptive
statistics will include an estimate for each variable of the proportion of
measurements with values greater than or equal to each element of |
rbelow |
a vector of values used to dichotomize variables. The descriptive
statistics will include an estimate for each variable of the proportion of
measurements with values less than or equal to each element of |
lbetween |
a vector of values with |
rbetween |
a vector of values with |
interval |
a two-column matrix of values in which each row is used to define intervals of interest to categorize variables. The descriptive statistics will include an estimate for each variable of the proportion of measurements with values between two elements in a row, with neither endpoint included in each interval. |
linterval |
a two-column matrix of values in which each row is used to define intervals of interest to categorize variables. The descriptive statistics will include an estimate for each variable of the proportion of measurements with values between two elements in a row, with the left-hand endpoint included in each interval. |
rinterval |
a two-column matrix of values in which each row is used to define intervals of interest to categorize variables. The descriptive statistics will include an estimate for each variable of the proportion of measurements with values between two elements in a row, with the right-hand endpoint included in each interval. |
lrinterval |
a two-column matrix of values in which each row is used to define intervals of interest to categorize variables. The descriptive statistics will include an estimate for each variable of the proportion of measurements with values between two elements in a row, with both endpoints included in each interval. |
This function
depends on the survival
R package. You should execute
library(survival)
if that library has not been previously installed.
Quantiles are computed for uncensored data using the default method in
quantile()
. For variables of class factor
, descriptive
statistics will be computed using the integer coding for factors. For
variables of class Surv
, estimated proportions and quantiles will be
computed from Kaplan-Meier estimates, as will be restricted means,
restricted standard deviations, and restricted geometric means. For
variables of class Date
, estimated proportions will be labeled using
the Julian date since January 1, 1970.
An object of class uDescriptives
is returned. Descriptive
statistics for each variable in the entire subsetted sample, as well as
within each stratum if any is defined, are contained in a matrix with rows
corresponding to variables and strata and columns corresponding to the
descriptive statistics. Descriptive statistics include
N: the number of observations.
Msng: the number of observations with missing values.
Mean: the mean of the nonmissing observations (this is potentially a restricted mean for right-censored time-to-event data).
Std Dev: the standard deviation of the nonmissing observations (this is potentially a restricted standard deviation for right-censored time to event data).
Geom Mn: the geometric mean of the nonmissing observations
(this is potentially a restricted geometric mean for
right-censored time to event data). Nonpositive values in
the variable will generate NA
, unless replaceZeroes
was specified.
Min: the minimum value of the nonmissing observations (this is potentially restricted for right-censored time-to-event data).
Quantiles: columns corresponding to the quantiles specified by probs
(these are potentially restricted for right-censored
time-to-event data).
Max: the maximum value of the nonmissing observations (this is potentially restricted for right-censored time-to-event data).
Proportions: columns corresponding to the proportions as specified by
above
, below
, labove
, rbelow
,
lbetween
, rbetween
, interval
,
linterval
, rinterval
, and lrinterval
.
restriction: the threshold for restricted means, standard deviations, and geometric means.
firstEvent: the time of the first event for censored time-to-event variables.
lastEvent: the time of the last event for censored time-to-event variables.
isDate: an indicator that the variable is a Date
object.
# Read in the data data(mri) # Create the table descrip(mri)
# Read in the data data(mri) # Create the table descrip(mri)
uRegress
objectsExtracts dfbeta from uRegress
objects by relying on
functionality from the stats
package. Note that
dfbeta
and dfbetas
are not the same (dfbetas
are
less than the dfbeta
values by a
scaling factor that reflects both the leverage of the observation in
question and the residual model error).
## S3 method for class 'uRegress' dfbeta(model, ...)
## S3 method for class 'uRegress' dfbeta(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a matrix of dfbeta values, with a row for each observation and a column for each model coefficient
uRegress
objectsExtracts dfbetas from uRegress
objects by relying on
functionality from the stats
package. Note that
dfbeta
and dfbetas
are not the same (dfbetas
are
less than the dfbeta
values by a
scaling factor that reflects both the leverage of the observation in
question and the residual model error).
## S3 method for class 'uRegress' dfbetas(model, ...)
## S3 method for class 'uRegress' dfbetas(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a matrix of dfbetas values, with a row for each observation and a column for each model coefficient
Create Dummy Variables
dummy( x, subset = rep(TRUE, length(x)), reference = sort(unique(x[!is.na(x)])), includeAll = FALSE )
dummy( x, subset = rep(TRUE, length(x)), reference = sort(unique(x[!is.na(x)])), includeAll = FALSE )
x |
|
subset |
|
reference |
the reference value for the dummy variables to compare to. |
includeAll |
logical value indicating whether all of the dummy variables should be returned (including the reference). |
A matrix containing the dummy variables.
data(mri) # Create a dummy variable for chd dummy(mri$chd)
data(mri) # Create a dummy variable for chd dummy(mri$chd)
Data from a study of 654 children on the relationship between smoking status and lung function (measured by FEV). Each row corresponds to a single clinic visit and contains information on age, height, sex, FEV, and smoking status. More information, including a coding key, is available at http://www.emersonstatistics.com/datasets/fev.doc.
fev
fev
A data frame with 654 rows and 7 variables:
case number (the numbers 1 to 654)
subject identification number (unique for each different child)
subject age at time of measurement (years)
measured forced exhalation volume (liters per second)
subject height at time of measurement (inches)
subject sex
smoking habits ("yes" or "no")
http://www.emersonstatistics.com/datasets/fev.txt
uRegress
objectsExtracts hat-values (leverages) from uRegress
objects by relying on
functionality from the stats
package.
## S3 method for class 'uRegress' hatvalues(model, ...)
## S3 method for class 'uRegress' hatvalues(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a vector of hat-values (leverages)
Produces point estimates, interval estimates, and p-values for linear
combinations of regression coefficients using a uRegress
object.
lincom( reg, comb, null.hypoth = 0, conf.level = 0.95, robustSE = TRUE, joint.test = FALSE, useFdstn = FALSE, eform = reg$fnctl != "mean" )
lincom( reg, comb, null.hypoth = 0, conf.level = 0.95, robustSE = TRUE, joint.test = FALSE, useFdstn = FALSE, eform = reg$fnctl != "mean" )
reg |
an object of class |
comb |
a vector or matrix containing the values of the constants which create the linear combination of the form
Zeroes must be given if coefficients aren't going to be included. For testing multiple combinations, this must be a matrix with number of columns equal to the number of coefficients in the model. |
null.hypoth |
the null hypothesis to compare the linear combination of
coefficients against. This is a scalar if one combination is given, and a
vector or matrix otherwise. The default value is |
conf.level |
a number between 0 and 1, indicating the desired confidence level for intervals. |
robustSE |
a logical value indicating whether or not to use robust
standard errors in calculation. Defaults to |
joint.test |
a logical value indicating whether or not to use a joint Chi-square test
for all the null hypotheses. If joint.test is |
useFdstn |
a logical indicator that the F distribution should be used for test statistics
instead of the chi squared distribution. Defaults to |
eform |
a logical value indicating whether or not to exponentiate the estimated coefficient. By default this is performed based on the type of regression used. |
A list of class lincom
(joint.test
is False
) or
lincom.joint
(joint.test
is True
). For the lincom
class,
comb
entries in the list are labeled comb1
, comb2
, etc. for as many linear combinations were used.
Each is a list with the following components:
printMat |
A formatted table with inferential results for the linear combination of coefficients. These include the point estimate, standard error, confidence interval, and t-test for the linear combination. |
nms |
The name of the linear combination, for printing. |
null.hypoth |
The null hypothesis for the linear combination. |
# Loading required libraries library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age (with robust SE by default) testReg <- regress ("mean", ldl~age+stroke, data = mri) # Testing coefficient created by .5*age - stroke (the first 0 comes from excluding the intercept) testC <- c(0, 0.5, -1) lincom(testReg, testC) # Test multiple combinations: # whether separately whether .5*age - stroke = 0 or Intercept + 60*age = 125 testC <- matrix(c(0, 0.5, -1, 1, 60, 0), byrow = TRUE, nrow = 2) lincom(testReg, testC, null.hypoth = c(0, 125)) # Test joint null hypothesis: # H0: .5*age - stroke = 0 AND Intercept + 60*age = 125 lincom(testReg, testC, null.hypoth = c(0, 125), joint.test = TRUE)
# Loading required libraries library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age (with robust SE by default) testReg <- regress ("mean", ldl~age+stroke, data = mri) # Testing coefficient created by .5*age - stroke (the first 0 comes from excluding the intercept) testC <- c(0, 0.5, -1) lincom(testReg, testC) # Test multiple combinations: # whether separately whether .5*age - stroke = 0 or Intercept + 60*age = 125 testC <- matrix(c(0, 0.5, -1, 1, 60, 0), byrow = TRUE, nrow = 2) lincom(testReg, testC, null.hypoth = c(0, 125)) # Test joint null hypothesis: # H0: .5*age - stroke = 0 AND Intercept + 60*age = 125 lincom(testReg, testC, null.hypoth = c(0, 125), joint.test = TRUE)
Data from an observational study of the incidence of cardiovascular disease (especially heart attacks and congestive heart failure) and cerebrovascular disease (especially strokes) in the U.S. elderly. More information, including a coding key, is available at http://www.emersonstatistics.com/datasets/mri.doc.
mri
mri
A data frame with 735 rows and 30 variables:
Participant identification number.
The date on which the participant underwent MRI scan in MMDDYY format.
Participant age at time of MRI, in years.
The sex of the partipant. Only 'Male' and 'Female' are represented.
Participant's race. One of the following: 'White', 'Black', 'Asian', or 'Subject did not identify as White, Black or Asian'. It is unclear if study participants self-identified their race, or if it was guessed by the study organisers.
Participant's weight at time of MRI (pounds).
Participant's height at time of MRI (centimeters).
Participant smoking history in pack years (1 pack year = smoking 1 pack of cigarettes per day for 1 year). A participant who has never smoked has 0 pack years.
Number of years since quitting smoking. A current smoker will have a nonzero packyrs and a 0 for yrsquit. A never smoker will have a zero for both variables.
Average alcohol intake for the participant for the two weeks prior to MRI (drinks per week, where one drink is 1 oz. whiskey, 4 oz. wine, or 12 oz.beer).
Physical activity of the participant for the week prior to MRI (1,000 kcal).
Indicator of whether the participant had been diagnosed with congestive heart failure prior to MRI (0=no, 1=yes).
Indicator of whether the participant had been diagnosed with coronary heart disease prior to MRI (0=no, 1=diagnosis of angina, 2=diagnosis of myocardial infarction).
Indicator of whether the participant had been diagnosed with a cerebrovascular event prior to MRI (0=no, 1=diagnosis of a transient ischemic attack, 2=diagnosis of stroke).
Indicator of whether the participant had been diagnosed with diabetes prior to MRI (0=no, 1=yes).
an indicator of the participant's view of their own health (1=excellent, 2=very good, 3=good, 4=fair, 5=poor)
a laboratory measure of low density lipoprotein (a kind of cholesterol) in the participant's blood at the time of MRI (mg/dL).
a laboratory measure of albumin, a kind of protein, in the participant's blood at the time of MRI (g/L).
a laboratory measure of creatinine, a waste product, in the participant's blood at the time of MRI (mg/dL).
a laboratory measure of the number of platelets circulating in the participant's blood at the time of MRI (1000 per cubic mm).
a measurement of the participant's systolic blood pressure in their arm at the time of MRI (mm Hg).
the ratio of systolic blood pressure measured in the participant's ankle at time of MRI to the systolic blood pressure in the participant's arm.
a measure of the forced expiratory volume in the participant at the time of MRI (L/sec).
a measure of cognitive function (Digit Symbol Substitution Test) for the participant at the time of MRI. Maximum score possible is 100.
a measure of loss of neurons estimated by the degree of ventricular enlargement relative to the predicted ventricular size; with 0 indicating no atrophy and 100 indicating the most severe degree of atrophy.
a measure of white matter changes detected on MRI. 0 means no changes, 9 means marked changes.
a count of the number of distinct regions identified on MRI scan which were suggestive of infarcts.
a measure of the total volume of infarct-like lesions found on MRI scan (cubic cm).
the total time (in days) that the participant was observed on study between the date of MRI and death or September 16, 1997, whichever came first.
an indicator that the
participant was observed to die while on study. If 1, the number of days
recorded in obstime
is the number of days from that participant's MRI
to their death. If 0, the number of days in obstime
is the number of
days between that participant's MRI and September 16, 1997.
http://www.emersonstatistics.com/datasets/mri.txt
Creates polynomial variables, to be used in regression. Will create polynomials of degree less than
or equal to the degree
specified, and will mean center variables by default.
polynomial(x, degree = 2, center = mean(x, na.rm = TRUE))
polynomial(x, degree = 2, center = mean(x, na.rm = TRUE))
x |
variable used to create the polynomials. |
degree |
the maximum degree
polynomial to be returned. Polynomials of degree <= |
center |
the value to center the polynomials at. |
A matrix containing the linear splines.
# Reading in a dataset data(mri) # Create a polynomial on ldl polynomial(mri$ldl, degree=3) # Use a polynomial in regress regress("mean", atrophy ~ polynomial(age, degree = 2), data = mri)
# Reading in a dataset data(mri) # Create a polynomial on ldl polynomial(mri$ldl, degree=3) # Use a polynomial in regress regress("mean", atrophy ~ polynomial(age, degree = 2), data = mri)
uRegress
objectsProduces prediction intervals for objects of class uRegress
.
## S3 method for class 'uRegress' predict(object, interval = "prediction", level = 0.95, ...)
## S3 method for class 'uRegress' predict(object, interval = "prediction", level = 0.95, ...)
object |
an object of class |
interval |
Type of interval calculation |
level |
Tolerance/confidence level |
... |
other arguments to pass to the appropriate predict function for
the class of |
Returns a matrix with the fitted value and prediction interval for the entered X.
# Loading required libraries library(survival) library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age (with robust SE by default) testReg <- regress ("mean", ldl~age, data = mri) # 95% Prediction Interval for age 50 predict(testReg)
# Loading required libraries library(survival) library(sandwich) # Reading in a dataset data(mri) # Linear regression of LDL on age (with robust SE by default) testReg <- regress ("mean", ldl~age, data = mri) # 95% Prediction Interval for age 50 predict(testReg)
Performs a one- or two-sample test of proportions using data. This test can be approximate or exact.
proptest( var1, var2 = NULL, by = NULL, exact = FALSE, null.hypoth = ifelse(is.null(var2) && is.null(by), 0.5, 0), alternative = "two.sided", conf.level = 0.95, correct = FALSE, more.digits = 0 )
proptest( var1, var2 = NULL, by = NULL, exact = FALSE, null.hypoth = ifelse(is.null(var2) && is.null(by), 0.5, 0), alternative = "two.sided", conf.level = 0.95, correct = FALSE, more.digits = 0 )
var1 |
a (non-empty) vector of binary numeric (0-1), binary factor, or logical data values |
var2 |
an optional (non-empty) vector of binary numeric (0-1), binary factor, or logical data values |
by |
a variable of equal length to
that of |
exact |
If true, performs a test of equality of proportions using exact binomial probabilities. |
null.hypoth |
a number specifying the null hypothesis for the mean (or difference in means if performing a two-sample test). Defaults to 0.5 for a one-sample test and 0 for a two-sample test. |
alternative |
a string: one of
|
conf.level |
confidence level of the test. Defaults to 0.95. |
correct |
a logical indicating whether to perform a continuity correction |
more.digits |
a numeric value specifying whether or not to display more or fewer digits in the output. Non-integers are automatically rounded down. |
Missing values must be given by "NA"
s to be recognized as missing values.
Numeric data must be given in 0-1 form.
This function also accepts binary factor variables, treating the higher level as 1 and the lower level
as 0, or logical variables.
A list of class proptest
. The print method lays out the information in an easy-to-read
format.
tab |
A formatted table of descriptive and inferential results (total number of observations, number of missing observations, sample proportion, standard error of the proportion estimate), along with a confidence interval for the underlying proportion. |
zstat |
the value of the test statistic, if using an approximate test. |
pval |
the p-value for the test |
var1 |
The user-supplied first data vector. |
var2 |
The user-supplied second data vector. |
by |
The user-supplied stratification variable. |
par |
A vector of information about the type of test (null hypothesis, alternative hypothesis, etc.) |
# Read in data set data(psa) attach(psa) # Define new binary variable as indicator # of whether or not bss was worst possible bssworst <- bss bssworst[bss == 1] <- 0 bssworst[bss == 2] <- 0 bssworst[bss == 3] <- 1 # Perform test comparing proportion in remission # between bss strata proptest(factor(inrem), by = bssworst)
# Read in data set data(psa) attach(psa) # Define new binary variable as indicator # of whether or not bss was worst possible bssworst <- bss bssworst[bss == 1] <- 0 bssworst[bss == 2] <- 0 bssworst[bss == 3] <- 1 # Perform test comparing proportion in remission # between bss strata proptest(factor(inrem), by = bssworst)
Performs a one- or two-sample test of proportions using counts of successes and trials, rather than binary data. This test can be approximate or exact.
proptesti( x1, n1, x2 = NULL, n2 = NULL, exact = FALSE, null.hypoth = ifelse(is.null(x2) && is.null(n2), 0.5, 0), conf.level = 0.95, alternative = "two.sided", correct = FALSE, more.digits = 0 )
proptesti( x1, n1, x2 = NULL, n2 = NULL, exact = FALSE, null.hypoth = ifelse(is.null(x2) && is.null(n2), 0.5, 0), conf.level = 0.95, alternative = "two.sided", correct = FALSE, more.digits = 0 )
x1 |
Number of successes in first sample |
n1 |
Number of trials in first sample |
x2 |
Number of successes in second sample |
n2 |
Number of trials in second sample |
exact |
If true, performs a test of equality of proportions with Exact Binomial based confidence intervals. |
null.hypoth |
a number specifying the null hypothesis for the mean (or difference in means if performing a two-sample test). Defaults to 0.5 for one-sample and 0 for two-sample. |
conf.level |
confidence level of the test. Defaults to 0.95 |
alternative |
a string: one of
|
correct |
a logical indicating whether to perform a continuity correction |
more.digits |
a numeric value specifying whether or not to display more or fewer digits in the output. Non-integers are automatically rounded down. |
If x2
or n2
are specified, then both must be specified, and a two-sample test is run.
A list of class proptesti
. The print method lays out the information in an easy-to-read
format.
tab |
A formatted table of descriptive and inferential results (total number of observations, sample proportion, standard error of the proportion estimate), along with a confidence interval for the underlying proportion. |
zstat |
the value of the test statistic, if using an approximate test. |
pval |
the p-value for the test |
par |
A vector of information about the type of test (null hypothesis, alternative hypothesis, etc.) |
# Two-sample test proptesti(10, 100, 15, 200, alternative = "less")
# Two-sample test proptesti(10, 100, 15, 200, alternative = "less")
Data from a study of 50 men having hormonally treated prostate cancer. Includes information on PSA levels, tumor characteristics, remission status, age, and disease state. More information, including a coding key, is available at http://www.emersonstatistics.com/datasets/PSA.doc.
psa
psa
A data frame with 50 rows and 9 variables:
patient identifier
lowest PSA value attained post therapy (ng/ml)
PSA value prior to therapy (ng/ml)
performance status (0= worst, 100= best)
bone scan score (1= least disease, 3= most)
tumor grade (1= least aggressive, 3= most)
patient's age (years)
time observed in remission (months)
Indicator whether patient still in remission at last follow-up (yes or no)
http://www.emersonstatistics.com/datasets/psa.txt
Produces point estimates, interval estimates, and p values for an arbitrary
functional (mean, geometric mean, proportion, odds, hazard) of a
variable of class integer
, or numeric
when
regressed on an arbitrary number of covariates. Multiple Partial F-tests can
be specified using the U
function.
regress( fnctl, formula, data, intercept = TRUE, weights = rep(1, nrow(data.frame(data))), subset = rep(TRUE, nrow(data.frame(data))), robustSE = TRUE, conf.level = 0.95, exponentiate = fnctl != "mean", replaceZeroes, useFdstn = TRUE, suppress = FALSE, na.action, method = "qr", qr = TRUE, singular.ok = TRUE, contrasts = NULL, init = NULL, ties = "efron", offset, control = list(...), ... )
regress( fnctl, formula, data, intercept = TRUE, weights = rep(1, nrow(data.frame(data))), subset = rep(TRUE, nrow(data.frame(data))), robustSE = TRUE, conf.level = 0.95, exponentiate = fnctl != "mean", replaceZeroes, useFdstn = TRUE, suppress = FALSE, na.action, method = "qr", qr = TRUE, singular.ok = TRUE, contrasts = NULL, init = NULL, ties = "efron", offset, control = list(...), ... )
fnctl |
a character string indicating
the functional (summary measure of the distribution) for which inference is
desired. Choices include |
formula |
an object of class |
data |
a data frame, matrix, or other data structure with matching
names to those entered in |
intercept |
a logical value
indicating whether a intercept exists or not. Default value is |
weights |
vector indicating optional weights for weighted regression. |
subset |
vector indicating a subset to be used for all inference. |
robustSE |
a logical indicator that standard errors (and confidence intervals) are to be computed using the Huber-White sandwich estimator. The default is TRUE. |
conf.level |
a numeric scalar indicating the level of confidence to be used in computing confidence intervals. The default is 0.95. |
exponentiate |
a logical indicator that the regression parameters should be exponentiated. This is by default true for all functionals except the mean. |
replaceZeroes |
if not
|
useFdstn |
a logical indicator that the F distribution should be used for test statistics instead of the chi squared distribution even in logistic regression models. When using the F distribution, the degrees of freedom are taken to be the sample size minus the number of parameters, as it would be in a linear regression model. |
suppress |
if |
na.action , qr , singular.ok , offset , contrasts , control
|
optional arguments that are passed to the functionality of |
method |
the method to be used in fitting the model. The default value for
|
init |
a numeric vector of initial values for the regression parameters for the hazard regression. Default initial value is zero for all variables. |
ties |
a character string describing method for breaking ties in hazard regression.
Only |
... |
additional arguments to be passed to the |
Regression models include linear regression (for the “mean” functional), logistic regression with logit link (for the “odds” functional), Poisson regression with log link (for the “rate” functional), linear regression of a log-transformed outcome (for the “geometric mean” functional), and Cox proportional hazards regression (for the hazard functional).
Currently, for the hazard functional, only 'coxph' syntax is supported; in other words, using 'dummy', 'polynomial',
and U
functions will result in an error when 'fnctl = hazard'.
Note that the only possible link function in 'regress' with 'fnctl = odds"' is the logit link. Similarly, the only possible link function in 'regress' with 'fnctl = "rate"' is the log link.
Objects created using the
U
function can also be passed in. If the
U
call involves a partial formula of the form
~ var1 + var2
, then regress
will return a multiple-partial
F-test involving var1
and var2
. If an F-statistic will already be
calculated regardless of the U
specification,
then any naming convention specified via name ~ var1
will be ignored.
The multiple partial tests must be the last terms specified in the model (i.e. no other predictors can
follow them).
An object of class uRegress is returned. Parameter estimates, confidence intervals, and p values are contained in a matrix $augCoefficients.
Functions for fitting linear models (lm
), and
generalized linear models (glm
). Also see the function to specify
multiple-partial F-tests, U
.
# Loading dataset data(mri) # Linear regression of atrophy on age regress("mean", atrophy ~ age, data = mri) # Linear regression of atrophy on sex and height and their interaction, # with a multiple-partial F-test on the height-sex interaction regress("mean", atrophy ~ height + sex + U(hs=~height:sex), data = mri) # Logistic regression of sex on atrophy mri$sex_bin <- ifelse(mri$sex == "Female", 1, 0) regress("odds", sex_bin ~ atrophy, data = mri) # Cox regression of age on survival library(survival) regress("hazard", Surv(obstime, death)~age, data=mri)
# Loading dataset data(mri) # Linear regression of atrophy on age regress("mean", atrophy ~ age, data = mri) # Linear regression of atrophy on sex and height and their interaction, # with a multiple-partial F-test on the height-sex interaction regress("mean", atrophy ~ height + sex + U(hs=~height:sex), data = mri) # Logistic regression of sex on atrophy mri$sex_bin <- ifelse(mri$sex == "Female", 1, 0) regress("odds", sex_bin ~ atrophy, data = mri) # Cox regression of age on survival library(survival) regress("hazard", Surv(obstime, death)~age, data=mri)
uRegress
objectsExtracts residuals (unstandardized, standardized, studentized, or jackknife)
from uRegress
objects.
## S3 method for class 'uRegress' residuals(object, type = "", ...)
## S3 method for class 'uRegress' residuals(object, type = "", ...)
object |
an object of class |
type |
denotes the type of residuals to return. Default value is
|
... |
other arguments |
Relies on
functionality from the stats
package to return residuals from the
uRegress
object. "studentized"
residuals are computed as
internally studentized residuals, while "jackknife"
computes the
externally studentized residuals.
Returns the type of residuals requested.
# Reading in a dataset data(mri) # Create a uRegress object, regressing ldl on age ldlReg <- regress("mean", age~ldl, data=mri) # Get the studentized residuals residuals(ldlReg, "studentized") # Get the jackknifed residuals residuals(ldlReg, "jackknife")
# Reading in a dataset data(mri) # Create a uRegress object, regressing ldl on age ldlReg <- regress("mean", age~ldl, data=mri) # Get the studentized residuals residuals(ldlReg, "studentized") # Get the jackknifed residuals residuals(ldlReg, "jackknife")
uRegress
objectsExtracts standardized residuals from uRegress
objects by relying on
functionality from the stats
package.
## S3 method for class 'uRegress' rstandard(model, ...)
## S3 method for class 'uRegress' rstandard(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a vector of standardized residuals
uRegress
objectsExtracts Studentized residuals from uRegress
objects by relying on
functionality from the stats
package.
## S3 method for class 'uRegress' rstudent(model, ...)
## S3 method for class 'uRegress' rstudent(model, ...)
model |
an object of class |
... |
other arguments to pass to |
a vector of Studentized residuals
Data from a study of 1,597 faculty members at a single US university. Includes information on monthly salary each year from 1976 through 1995, as well as sex, highest degree attained, year of highest degree, field, year hired, rank, and administrative duties. More information, including a coding key, is available at http://www.emersonstatistics.com/datasets/salary.doc.
salary
salary
A data frame with 19792 rows and 11 variables:
case number
identification number for the faculty member
M (male) or F (female)
highest degree attained: PhD, Prof (professional degree, eg, medicine or law), or Other (Master's or Bachelor's degree)
year highest degree attained
Arts (Arts and Humanities), Prof (professional school, e.g., Business, Law, Engineering or Public Affairs), or Other
year in which the faculty member was hired (2 digits)
year (2 digits)
rank of the faculty member in this year: Assist (Assistant), Assoc (Associate), or Full (Full)
Indicator of whether the faculty member had administrative duties (eg, department chair) in this year: 1 (yes), or 0 (no)
monthly salary of the faculty member in this year in dollars
http://www.emersonstatistics.com/datasets/salary.txt
Performs a one- or two-sample t-test using data. In the two-sample case, the user can specify whether or not observations are matched, and whether or not equal variances should be presumed.
ttest( var1, var2 = NA, by = NA, geom = FALSE, null.hypoth = 0, alternative = "two.sided", var.eq = FALSE, conf.level = 0.95, matched = FALSE, more.digits = 0 )
ttest( var1, var2 = NA, by = NA, geom = FALSE, null.hypoth = 0, alternative = "two.sided", var.eq = FALSE, conf.level = 0.95, matched = FALSE, more.digits = 0 )
var1 |
a (non-empty) numeric vector of data values. |
var2 |
an optional (non-empty) numeric vector of data. |
by |
a variable of equal length to
that of |
geom |
a logical indicating whether the geometric mean should be calculated and displayed. |
null.hypoth |
a number specifying the null hypothesis for the mean (or difference in means if performing a two-sample test). Defaults to zero. |
alternative |
a string: one of
|
var.eq |
a logical value, either
|
conf.level |
confidence level of the test. Defaults to 0.95. |
matched |
a logical value, either
|
more.digits |
a numeric value specifying whether or not to display more or fewer digits in the output. Non-integers are automatically rounded down. |
Missing values must be given by NA
to be recognized as missing values.
a list of class ttest
. The print method lays out the information in an easy-to-read
format.
tab |
A formatted table of descriptive and inferential statistics (total number of observations, number of missing observations, mean, standard error of the mean estimate, standard deviation), along with a confidence interval for the mean. |
df |
Degrees of freedom for the t-test. |
p |
P-value for the t-test. |
tstat |
Test statistic for the t-test. |
var1 |
The user-supplied first data vector. |
var2 |
The user-supplied second data vector. |
by |
The user-supplied stratification variable. |
par |
A vector of information about the type of test (null hypothesis, alternative hypothesis, etc.) |
geo |
A formatted table of descriptive and inferential statistics for the geometric mean. |
call |
The call made to the |
# Read in data set data(psa) attach(psa) # Perform t-test ttest(pretxpsa, null.hypoth = 100, alternative = "greater", more.digits = 1) # Define new binary variable as indicator # of whether or not bss was worst possible bssworst <- bss bssworst[bss == 1] <- 0 bssworst[bss == 2] <- 0 bssworst[bss == 3] <- 1 # Perform t-test allowing for unequal # variances between strata -# ttest(pretxpsa, by = bssworst) # Perform matched t-test ttest(pretxpsa, nadirpsa, matched = TRUE, conf.level = 99/100, more.digits = 1)
# Read in data set data(psa) attach(psa) # Perform t-test ttest(pretxpsa, null.hypoth = 100, alternative = "greater", more.digits = 1) # Define new binary variable as indicator # of whether or not bss was worst possible bssworst <- bss bssworst[bss == 1] <- 0 bssworst[bss == 2] <- 0 bssworst[bss == 3] <- 1 # Perform t-test allowing for unequal # variances between strata -# ttest(pretxpsa, by = bssworst) # Perform matched t-test ttest(pretxpsa, nadirpsa, matched = TRUE, conf.level = 99/100, more.digits = 1)
Performs a one- or two-sample t-test given summary statistics. In the two-sample case, the user can specify whether or not equal variances should be presumed.
ttesti( obs, mean, sd, obs2 = NA, mean2 = NA, sd2 = NA, null.hypoth = 0, conf.level = 0.95, alternative = "two.sided", var.eq = FALSE, more.digits = 0 )
ttesti( obs, mean, sd, obs2 = NA, mean2 = NA, sd2 = NA, null.hypoth = 0, conf.level = 0.95, alternative = "two.sided", var.eq = FALSE, more.digits = 0 )
obs |
number of observations for the first sample. |
mean |
the sample mean of the first sample. |
sd |
the sample standard deviation of the first sample. |
obs2 |
number of observations for the second sample (this is optional). |
mean2 |
if |
sd2 |
if |
null.hypoth |
a number specifying the null hypothesis for the mean (or difference in means if performing a two-sample test). Defaults to zero. |
conf.level |
confidence level of the test. Defaults to 0.95. |
alternative |
a string: one of
|
var.eq |
a logical value, either
|
more.digits |
a numeric value specifying whether or not to display more or fewer digits in the output. Non-integers are automatically rounded down. |
If obs2
, mean2
, or sd2
is specified, then all three must be specified
and a two-sample t-test is run.
a list of class ttesti
. The print method lays out the information in an easy-to-read
format.
tab |
A formatted table of descriptive and inferential statistics (number of observations, mean, standard error of the mean estimate, standard deviation), along with a confidence interval for the mean. |
df |
Degrees of freedom for the t-test. |
p |
P-value for the t-test. |
tstat |
Test statistic for the t-test. |
par |
A vector of information about the type of test (null hypothesis, alternative hypothesis, etc.) |
twosamp |
A logical value indicating whether a two-sample test was performed. |
call |
The call made to the |
# t-test given sample descriptives ttesti(24, 175, 35, null.hypoth=230) # two-sample test ttesti(10, -1.6, 1.5, 30, -.7, 2.1)
# t-test given sample descriptives ttesti(24, 175, 35, null.hypoth=230) # two-sample test ttesti(10, -1.6, 1.5, 30, -.7, 2.1)
Creates a partial formula of the form ~var1 + var2
. The partial formula can be named
by adding an equals sign before the tilde.
U(...)
U(...)
... |
partial formula of the form |
A partial formula (potentially named) for use in regress
.
# Reading in a dataset data(mri) # Create a named partial formula U(ma=~male+age) # Create an unnamed partial formula U(~male+age)
# Reading in a dataset data(mri) # Create a named partial formula U(ma=~male+age) # Create an unnamed partial formula U(~male+age)
Performs Wilcoxon signed rank test or Mann-Whitney-Wilcoxon rank sum test
depending on data and logicals entered. Relies heavily on the function
wilcox.test
. Adds formatting and variances.
wilcoxon( var1, var2 = NULL, alternative = "two.sided", null.hypoth = 0, paired = FALSE, exact = FALSE, correct = FALSE, conf.int = FALSE, conf.level = 0.95 )
wilcoxon( var1, var2 = NULL, alternative = "two.sided", null.hypoth = 0, paired = FALSE, exact = FALSE, correct = FALSE, conf.int = FALSE, conf.level = 0.95 )
var1 |
numeric vector of data values. Non-finite (missing or infinite) values will be omitted. |
var2 |
optional numeric vector of data values. Non-finite (missing or infinite) values will be omitted. |
alternative |
specifies the
alternative hypothesis for the test; acceptable values are
|
null.hypoth |
the value of the null hypothesis. |
paired |
logical indicating whether
the data are paired or not. Default is |
exact |
logical value indicating whether or not an exact test should be computed. |
correct |
logical indicating whether or not a continuity correction should be used and displayed. |
conf.int |
logical indicating whether or not to calculate and display a confidence interval |
conf.level |
confidence level for the interval. Defaults to 0.95. |
In the one-sample case, the returned confidence interval (when conf.int = TRUE
)
is a confidence interval for the pseudo-median of the underlying distribution. In the two-sample
case, the function returns a confidence interval for the median of the difference between samples from
the two distributions. See wilcox.test
for more information.
A list of class wilcoxon
is returned. The print method lays out the information in an easy-to-read
format.
statistic |
the value of the test statistic with a name describing it. |
parameter |
the parameter(s) for the exact distribution of the test statistic. |
p.value |
the p-value for the test (calculated for the test statistic). |
null.value |
the
parameter |
alternative |
character string describing the alternative hypothesis. |
method |
the type of test applied. |
data.name |
a character string giving the names of the data. |
conf.int |
a confidence interval for the location parameter (only
present if the argument |
estimate |
an estimate
of the location parameter (only present if the argument
|
table |
a formatted table of rank sum and number of observation values, for printing. |
vars |
a formatted table of variances, for printing. |
hyps |
a formatted table of the hypotheses, for printing. |
inf |
a formatted table of inference values, for printing. |
#- Create the data -# cf <- c(1153, 1132, 1165, 1460, 1162, 1493, 1358, 1453, 1185, 1824, 1793, 1930, 2075) healthy <- c(996, 1080, 1182, 1452, 1634, 1619, 1140, 1123, 1113, 1463, 1632, 1614, 1836) #- Perform the test -# wilcoxon(cf, healthy, paired=TRUE) #- Perform the test -# wilcoxon(cf, healthy, conf.int=TRUE)
#- Create the data -# cf <- c(1153, 1132, 1165, 1460, 1162, 1493, 1358, 1453, 1185, 1824, 1793, 1930, 2075) healthy <- c(996, 1080, 1182, 1452, 1634, 1619, 1140, 1123, 1113, 1463, 1632, 1614, 1836) #- Perform the test -# wilcoxon(cf, healthy, paired=TRUE) #- Perform the test -# wilcoxon(cf, healthy, conf.int=TRUE)