| Title: | Process Improvement using Data |
|---|---|
| Description: | A collection of scripts and data files for the statistics text: "Process Improvement using Data" <https://learnche.org/pid/> and the online course "Experimentation for Improvement" found on Coursera. The package contains code for designed experiments, data sets and other convenience functions used in the book. |
| Authors: | Kevin Dunn [aut, cre] |
| Maintainer: | Kevin Dunn <[email protected]> |
| License: | BSD_2_clause + file LICENSE |
| Version: | 0.65 |
| Built: | 2026-05-11 06:42:09 UTC |
| Source: | https://github.com/kgdunn/r-package-pid-process-improvement-using-data |
This package contains functions and datasets that complement the book Process Improvement using Data, https://learnche.org/pid/.
The functions and datasets are also used in the massive open online course (MOOC) called "Experimentation for Improvement", hosted on Coursera, https://yint.org/experiments.
The functions are most useful for design of experiments (DOE), and additional functions and datasets will be added as the book and the MOOC evolve over time.
The package is still subject to development, in terms of datasets and functionality, at least until version numbers exceed 1.0. Version numbers below 1.0 indicate that functionality may be added, removed or changed over time.
Please contact me, if you have suggestions.
Other packages that you can use immediately for experimental design are: FrF2 for fractional factorial experiments with 2-level factors and the base package for Designed Experiments, called DoE.base.
Kevin Dunn
Maintainer: Kevin Dunn <[email protected]>
Box G. E. P, Hunter, W. C. and Hunter, J. S. (2005) Statistics for Experimenters, 2nd edition. New York: Wiley.
Related packages: DoE.base, BsMD, FrF2
# 2-factor example T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # create a model with main effects, and interaction paretoPlot(doe.model) # 3-factor example data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) paretoPlot(mod.full)# 2-factor example T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # create a model with main effects, and interaction paretoPlot(doe.model) # 3-factor example data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) paretoPlot(mod.full)
The data are from boiling water in a pot under various conditions. The response variable,
y, is the time taken, in minutes to reach 90 degrees Celsius. Accurately measuring
the time to actual boiling is hard, hence the 90 degrees Celsius point is used instead.
Three factors are varied in a full factorial manner (the first 8 observations). The data are in standard order, however the actual experiments were run in random order. The last 3 rows are runs close to, or interior to the factorial.
Factors varied were:
A = Amount of water: low level was 500 mL, and high level was 600 mL
B = Lid off (low level) or lid on (high level)
C = Size of pot used: low level was 2 L, and high level was 3 L.
Coded values for A, B, and C, should be used in the linear
regression model analysis, with -1 representing the low value and +1 the
high value.
data(boilingpot)data(boilingpot)
A data.frame containing 11 observations of 4 variables (A, B, C, with y as a response variable.)
A MOOC on Design of Experiments: “Experimentation for Improvement”, https://learnche.org
Creates a two-dimensional contour plot of a factorial design model (linear least squares model). The two specified factors are plotted on the horizontal and vertical axes. If the model has more than two factors, then the remaining factors are set at their zero level. It is assumed the model is in coded units.
contourPlot(lsmodel, xlab=attr(lsmodel$terms,'term.labels')[1], ylab=attr(lsmodel$terms,'term.labels')[2], main="Contour plot", N=25, xlim=c(-3.2, 3.2), ylim=c(-3.2, 3.2), colour.function=terrain.colors)contourPlot(lsmodel, xlab=attr(lsmodel$terms,'term.labels')[1], ylab=attr(lsmodel$terms,'term.labels')[2], main="Contour plot", N=25, xlim=c(-3.2, 3.2), ylim=c(-3.2, 3.2), colour.function=terrain.colors)
lsmodel |
a linear model object (least squares model) created by the |
xlab |
the label and variable used for horizontal axis in the bar plot, and must also match a variable in the |
ylab |
the label and variable used for vertical axis in the bar plot, and must also match a variable in the |
main |
label for the plot. |
N |
the resolution of the plot; higher values have greater resolution. |
xlim |
the limits of the horizontal axis (in coded units); the defaults indicate the model's effect outside the standard factorial range. |
ylim |
the limits of the vertical axis (in coded units); the defaults indicate the model's effect outside the standard factorial range. |
colour.function |
the function used to colour-code the contour plot; see |
Typical usage is to create a generic linear model with the lm(...) command, and supply that as the input to this function.
For example, a general design of experiments with 4 factors: A, B, C, and D can be built usinglsmodel <- lm(y ~ A*B*C*D), and then various 2-D contour plots visualized withcontourPlot(lsmodel, "A", "B"), contourPlot(lsmodel, "B", "C"), or whichever combination of factors
are desired. The variables not being plotted are set at their zero (0) value.
Future versions of this function will allow specifying the inactive variable levels. In the interim, please see the rsm
package (the rsm::contour(...) function) if this functionality is required.
Returns a ggplot2 object, which, by default, is shown before being returned by this function.
The ggplot2 object may be further manipulated, if desired.
Kevin Dunn, <[email protected]>
Please see Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
# 2-factor example # ---------------- T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # model with main effects, and interaction contourPlot(doe.model) # 3-factor example # ---------------- data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) contourPlot(mod.full, N=50) # high resolution plot contourPlot(mod.full, xlab='C', ylab='S', main="Strong C:S interaction", colour.function=rainbow) # Central composite design P <- c(-1, +1, -1, +1, 0, -1.41, 0, +1.41) T <- c(-1, -1, +1, +1, +1.41, 0, -1.41, 0) y <- c(715, 713, 733, 725, 738, 717, 721, 710) mod.CCD <- lm(y ~ P*T + I(P^2) + I(T^2)) contourPlot(mod.CCD, 'P', 'T', xlim=c(-2.2, 2.2), ylim=c(-3,2))# 2-factor example # ---------------- T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # model with main effects, and interaction contourPlot(doe.model) # 3-factor example # ---------------- data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) contourPlot(mod.full, N=50) # high resolution plot contourPlot(mod.full, xlab='C', ylab='S', main="Strong C:S interaction", colour.function=rainbow) # Central composite design P <- c(-1, +1, -1, +1, 0, -1.41, 0, +1.41) T <- c(-1, -1, +1, +1, +1.41, 0, -1.41, 0) y <- c(715, 713, 733, 725, 738, 717, 721, 710) mod.CCD <- lm(y ~ P*T + I(P^2) + I(T^2)) contourPlot(mod.CCD, 'P', 'T', xlim=c(-2.2, 2.2), ylim=c(-3,2))
These are actual data, taken 1 minute apart in time, of the flow rate leaving the top of a continuous distillation column (data are from a 31 day period in time).
data(distillateflow)data(distillateflow)
A data.frame containing 44,640 observations of 1 variable
Kevin Dunn, https://openmv.net
A full factorial experiment with four factors run by a golf enthusiast. The objective of the experiments was for the golfer to maximize her driving distance at a specific tee off location on her local golf course. The golfer considered the following factors:
T = Tee height (cm)
H = Holes the golf ball had been played for prior to the experimental tee shot
C = Club type
D = Time of day (on the 24 hour clock)
The data are in standard order, however the actual experiments were run in random order.
Coded values for A, B, C, and D should be used in the linear
regression model analysis, with -1 representing the low value and +1 the
high value.
data(golf)data(golf)
A data.frame containing 16 observations of 4 variables (A, B, C, D, with y as a response variable.)
A MOOC on Design of Experiments: “Experimentation for Improvement”, https://learnche.org
The hourly profit made when selling the product at price P and the product is displayed at height H [cm] on the shelf.
grocery(P=3.46, H=150)grocery(P=3.46, H=150)
P |
the selling price of the product. The default amount, if unspecified, is €3.46. |
H |
the display height of the product, as measured from the ground up. The default value, if unspecified, is 150cm. |
This function simulates the hourly profit in a grocery store of selling a particular product.
Two factors can be adjusted by the user to determine the optimum:
P: The selling price of the product (must be a positive value).
H: The height that the product is displayed at (must be a positive value).
Can you determine the best combination of conditions, using a systematic method (i.e. not by trial and error)?
Returns the hourly profit made. Random noise is added for some realism.
Kevin Dunn, <[email protected]>
Please see Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
# Selling at the default settings of price (€3.46) # and shelf height (150cm): grocery() # Let's try selling for a low price, €2.75, # and put the product 120 cm off the ground grocery(P=2.75, H=120) # What happens if the product is too high # for the average person? grocery(P=1.25, H=200) # Can you find the optimum combination of settings to # maximize the profit, but using only a few experiments?# Selling at the default settings of price (€3.46) # and shelf height (150cm): grocery() # Let's try selling for a low price, €2.75, # and put the product 120 cm off the ground grocery(P=2.75, H=120) # What happens if the product is too high # for the average person? grocery(P=1.25, H=200) # Can you find the optimum combination of settings to # maximize the profit, but using only a few experiments?
The hourly profit made when selling the product at price P and the product is produced
at a throughput rate T [parts per hour].
manufacture(P=0.75, T=325)manufacture(P=0.75, T=325)
P |
the selling price of the product. The default amount, if unspecified, is $0.75. |
T |
the production rate (throughput), measured in parts per hour. The default value, if unspecified, is 325 parts per hour. |
This function simulates the hourly profit in a manufacturing facility.
Two factors can be adjusted by the user to determine the optimum:
P: The selling price of the product (must be a positive value).
T: The production rate, also called throughput (must be a positive value).
Can you determine the best combination of conditions, using a systematic method (i.e. not by trial and error)? More defects are created when production rates are too high.
Returns the hourly profit made. Random noise is added for some realism.
Kevin Dunn, <[email protected]>
Please see Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
# Producing at the default settings of price ($0.75) # and throughput of 325 parts per hour: manufacture() # Let's try selling for a higher price, $1.05, # and a slower throughput of 298 parts per hour: manufacture(P=1.05, T=298) # What happens if the product is sold too cheaply # at high production rates? manufacture(P=0.52, T=417) # Can you find the optimum combination of settings to # maximize the profit, but using only a few experiments?# Producing at the default settings of price ($0.75) # and throughput of 325 parts per hour: manufacture() # Let's try selling for a higher price, $1.05, # and a slower throughput of 298 parts per hour: manufacture(P=1.05, T=298) # What happens if the product is sold too cheaply # at high production rates? manufacture(P=0.52, T=417) # Can you find the optimum combination of settings to # maximize the profit, but using only a few experiments?
Four materials: A, B, C and D are added in a blend to achieve a desired heat capacity, the response variable, y.
The amounts were varied in a factorial manner for the 4 materials.
The data are scaled and coded for confidentiality. All that may be disclosed is that variable C is either added ("Yes") or not added not added ("No").
data(oildoe)data(oildoe)
A data.frame containing 19 observations of 5 variables (A, B, C, D, and the response, y)
Kevin Dunn, data from a confidential industrial source.
Creates a plot that shows the model coefficients from the least squares model as bars. The absolute value of the coefficients are taken. The coefficient's sign is represented by the bar colour: grey for negative and black for positive coefficients.
paretoPlot(lsmodel, xlab="Effect name", ylab="Magnitude of effect", main="Pareto plot", legendtitle="Sign of coefficients", negative=c("Negative", "grey"), positive=c("Positive", "black"))paretoPlot(lsmodel, xlab="Effect name", ylab="Magnitude of effect", main="Pareto plot", legendtitle="Sign of coefficients", negative=c("Negative", "grey"), positive=c("Positive", "black"))
lsmodel |
a linear model object (least squares model) created by the |
xlab |
label for horizontal axis in the bar plot. |
ylab |
label for vertical axis in the bar plot. |
main |
label for the plot. |
legendtitle |
an alternative to the default legend title. |
negative |
a two-element vector, both entries must be strings, the first of which provides the name for negative bars in the legend, and the second tells which colour to plot negative bars with. The default is: c("Negative", "grey"). |
positive |
a two-element vector, both entries must be strings, the first of which provides the name for positive bars in the legend, and the second tells which colour to plot those bars with. |
For example, a general design of experiments with 4 factors: A, B, C, and D can be built using lsmodel <- lm(y ~ A*B*C*D), and then the 2^4=16 coefficients visualized with this function using paretoPlot(lsmodel). Since the largest magnitude coefficients are mostly of interest, the coefficient bars are shown sorted from largest to smallest absolute magnitude. The sign information is retained though with the bar's colour: grey for negative and black for positive coefficients.
The coefficients are the exact coefficients from the linear model. When the linear model is in coded units (i.e. -1 for the low level, and +1 for the high level), then the coefficients represent the change in the experiment's response variable (y), for a 2-unit (not 1-unit) change in the input variable. This interpretation of course assumes the factors are independent, which is the case in a full factorial.
Please see the reference for more details.
Returns a ggplot2 object, which, by default, is shown before being returned by this function. The ggplot2 object may be further manipulated, if desired.
Kevin Dunn, <[email protected]>
A Pareto plot is similar in spirit to the Lenth plot, other than for the fact that absolute coefficients are shown. Please see Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
# 2-factor example T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # create a model with main effects, and interaction paretoPlot(doe.model) # 3-factor example data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) paretoPlot(mod.full)# 2-factor example T <- c(-1, +1, -1, +1) # centered and scaled temperature S <- c(-1, -1, +1, +1) # centered and scaled speed variable y <- c(69, 60, 64, 53) # conversion, is our response variable, y doe.model <- lm(y ~ T + S + T * S) # create a model with main effects, and interaction paretoPlot(doe.model) # 3-factor example data(pollutant) mod.full <- lm(y ~ C*T*S, data=pollutant) paretoPlot(mod.full)
The data are from the first 8 rows of the pollutant water treatment example in the book by Box, Hunter and Hunter, 2nd edition, Chapter 5, Question 19.
The 3 factors (C, T, and S) are in coded units where: C: -1 is chemical brand A; +1 is chemical brand B T: -1 is 72F for treatment temperature; +1 is 100F for the temperature S: -1 is No stirring; +1 is with fast stirring
The outcome variable is: y: the pollutant amount in the discharge [lb/day].
The aim is to find treatment conditions that MINIMIZE the amount of pollutant discharged each day, where the limit is 10 lb/day.
data(pollutant)data(pollutant)
A data.frame containing 8 observations of 4 variables (C, S, T and y)
Box, Hunter and Hunter, 2nd edition, Chapter 5, Question 19, page 232.
Box, G. E. P. and Hunter, J. S. and Hunter, W. G., 2005, Statistics for Experimenters, Second edition. Wiley.
A fixed number of popcorn kernels are cooked are heated at the same temperature. This simulation returns the number of
kernels that are edible when the pot is left on the stove for a given number of T seconds.
popcorn(T=120)popcorn(T=120)
T |
the number of seconds that the pot is left on the stove. The default amount, if not provided, is 120 seconds. |
This function simulates the number of edible popcorn kernels that are made when cooking a fixed number of kernels at the same heat setting
for a given amount of T seconds.
Time durations less than 77 seconds are not supported. A vector (list) of time values is not permitted, since the goal is to perform sequential
experimentation to determine the optimum time, with the fewest number of function calls.
Returns the number of edible popcorn kernels. Random noise is added for some realism.
Kevin Dunn, <[email protected]>
Please see Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
# Cooking for a very short duration is not supported. # For example, popcorn(T=50) will fail # Cooking from 77 seconds onwards is supported popcorn(T=120) # What happens if we leave the pot on the stove for too long? popcorn(T=500) # Can you find the optimum time to cook on the stove # using the fewest number of function calls?# Cooking for a very short duration is not supported. # For example, popcorn(T=50) will fail # Cooking from 77 seconds onwards is supported popcorn(T=120) # What happens if we leave the pot on the stove for too long? popcorn(T=500) # Can you find the optimum time to cook on the stove # using the fewest number of function calls?
The data are from a solar panel simulation case study.
The original source that Box, Hunter and Hunter used is doi:10.1016/0038-092X(67)90051-5
A theoretical model for a commercial system was made. A 2^4 factorial design was used (center point is not included in this dataset).
The factors are dimensionless groups (https://en.wikipedia.org/wiki/Dimensionless_quantity)
related to
A: total daily insolation,
B: the tank capacity,
C: the water flow through the absorber,
D: solar intermittency coming in.
The responses variables are
y1: collection efficiency, and
y2: the energy delivery efficiency.
The coded values for A, B, C, and D are given, with -1 representing
the low value and +1 the high value.
data(solar)data(solar)
A data.frame containing 16 observations of 6 variables (A, B, C, D, with y1 and y2 as responses.)
Box, Hunter and Hunter, 2nd edition, Chapter 5, page 230.
Box, G. E. P. and Hunter, J. S. and Hunter, W. G., 2005, Statistics for Experimenters, Second edition. Wiley."
Fractional factorial designs are a trade-off of information learned vs. the amount of work and cost (experiments) invested. This function shows which factors are confounded (confused; aliased) with each other when running a fractional factorial. The output shows the design's resolution and how to generate the fractional factorial.
tradeoff(runs=8, factors=7, display=TRUE)tradeoff(runs=8, factors=7, display=TRUE)
runs |
the number of (unreplicated) experimental runs in the fractional factorial design. Must be a power of 2 that is 8 or higher (e.g. 8, 16, 32, 64, ...) |
factors |
the number of factors being investigated in the fractional factorial design. |
display |
will by default print the results to the screen and also returns a list containing the same information; setting this to |
Fractional factorial designs require a sacrifice in the clarity of the information learned, at the benefit of performing fewer
experimental runs; thereby decreasing cost and time to run the full set of factorial experiments. See tradeOffTable for a
visual display of this concept.
For example, when running 8 experiments with 7 factors, the design resolution (clarity of estimated effects) is going to be low. This function's output indicates how the factors are aliased (confounded) with each other, so you can evaluate the use of the design before actually performing it.
The function provides 3 main sources of information: the resolution, the generators and the aliasing structure. Each one of these is described in the reference.
Aliasing is only reported to the level of the main effects and two-factor interaction (2fi, fi2). Higher level interactions are of-course
present in many fractional factorial designs, and may be calculated from the defining relationship. Future versions of this function will
return the defining relationship to assist with this.
The function currently returns a list with 3 entries: the resolution, the generator(s) and the aliasing structure.
Kevin Dunn, <[email protected]>
Please see Chapter 5 of the following book:
Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
Please see this paper to gain an understanding of how these trade-off tables are constructed:
Arthur Fries and William G. Hunter, (1980) Minimum Aberration Designs, Technometrics, 22(4), pp. 601-608, https://www.jstor.org/stable/1268198
tradeOffTable for a visual representation of this information.
# Running 8 experiments? What are the trade-offs with 4, 5, 6, or 7 factors? ## Not run: tradeoff(runs=8, factors=4) ## Not run: tradeoff(runs=8, factors=5) ## Not run: tradeoff(runs=8, factors=6) ## Not run: tradeoff(runs=8, factors=7) # Running 16 experiments? What are the trade-offs ? ## Not run: tradeoff(runs=16, factors=5) ## Not run: tradeoff(runs=16, factors=6) ## Not run: tradeoff(runs=16, factors=7) ## Not run: tradeoff(runs=16, factors=8) ## Not run: tradeoff(runs=16, factors=9)# Running 8 experiments? What are the trade-offs with 4, 5, 6, or 7 factors? ## Not run: tradeoff(runs=8, factors=4) ## Not run: tradeoff(runs=8, factors=5) ## Not run: tradeoff(runs=8, factors=6) ## Not run: tradeoff(runs=8, factors=7) # Running 16 experiments? What are the trade-offs ? ## Not run: tradeoff(runs=16, factors=5) ## Not run: tradeoff(runs=16, factors=6) ## Not run: tradeoff(runs=16, factors=7) ## Not run: tradeoff(runs=16, factors=8) ## Not run: tradeoff(runs=16, factors=9)
Creates a new plot that shows a trade-off table for fractional factorial designs.
tradeOffTable()tradeOffTable()
Displays the following trade-off table:
The rows in the table are the number of experiments done in the fractional factorial ().
The columns are the number of factors under investigation in the design ().
The cell at a particular row/column intersection gives several pieces of information:
The top-left entry of the form: . For example, corresponds to half-fractions,
and corresponds to quarter-fractions.
The subscript in the top-left entry, written in Roman numerals gives the design resolution. For example,
corresponds to a resolution 4 design, implying 2-factor interactions are at most confounded with other 2-factor interactions.
The bold entries in the bottom-right tell how to generate the remaining factors in the design.
A "full" entry indicates a full factorial; while "twice" indicates a twice replicated full factorial.
Blank entries are impossible fractional factorial combinations.
A detailed explanation of the table is provided in the book reference.
Certain blocks are not unique. For example, a resolution IV design (with 32 runs and 8 factors) is shown as having +/-F = ABC, +/-G=ABD and +/-H=ACDE. But another option is +/-H=BCDE, which you might see in other software, or tables in textbooks.
Kevin Dunn, <[email protected]>
Chapter 5 of the following book: Kevin Dunn, 2010 to 2026, Process Improvement using Data, https://learnche.org/pid/
Please see this paper to gain an understanding of how these trade-off tables are constructed:
Arthur Fries and William G. Hunter, (1980) Minimum Aberration Designs, Technometrics, 22(4), pp. 601-608, https://www.jstor.org/stable/1268198
tradeoff which can be used to extend the table out to more factors or more experiments.
tradeOffTable()tradeOffTable()