====================================================================== Microsoft(R) Product Support Services Application Note (Text File) XE0124: REGRESSION ANALYSIS AND BEST FIT LINES ====================================================================== Revision Date: 8/93 No Disk Included The information in this Application Note applies to Microsoft Excel versions 3.0 and 4.0 for the Macintosh(R), and Microsoft Excel versions 3.0, 4.0 and 4.0a for Windows. ---------------------------------------------------------------------- | INFORMATION PROVIDED IN THIS DOCUMENT AND ANY SOFTWARE THAT MAY | | ACCOMPANY THIS DOCUMENT (collectively referred to as an Application | | Note) IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER | | EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED | | WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR | | PURPOSE. The user assumes the entire risk as to the accuracy and | | the use of this Application Note. This Application Note may be | | copied and distributed subject to the following conditions: 1) All | | text must be copied without modification and all pages must be | | included; 2) If software is included, all files on the disk(s) | | must be copied without modification (the MS-DOS(R) utility | | diskcopy is appropriate for this purpose); 3) All components of | | this Application Note must be distributed together; and 4) This | | Application Note may not be distributed for profit. | | | | Copyright (C) 1992-1993 Microsoft Corporation. All Rights Reserved.| | Microsoft and MS-DOS are registered trademarks and Windows is a | | trademark of Microsoft Corporation. | | Macintosh is a registered trademark of Apple Computer, Inc. | | --------------------------------------------------------------------| OVERVIEW ======== This Application Note discusses how to use Microsoft Excel functions to perform simple, multiple, and polynomial regression analysis. It contains examples of how to use LINEST(), LOGEST(), TREND(), and GROWTH() to describe a best fit line or curve and to make predictions about your data. It also outlines some of the new statistical functions and tools available with Microsoft Excel version 4.0. GENERAL INFORMATION =================== Regression is a statistical method used to predict values based on relationships in existing data. By analyzing how a single dependent variable (y) is affected by the values of one or more independent variables (x), you can predict what y will be given x. You can use this information to fit a line or a curve to your existing data and to forecast future values. The LINEST(), TREND(), LOGEST(), and GROWTH() functions are the primary functions you will use to perform regression analysis in Microsoft Excel versions 3.0 and 4.0. While this Application Note focuses primarily on the functions that can be used in Microsoft Excel versions 3.0 and 4.0, Microsoft Excel version 4.0 offers several new functions and tools that you can use to perform regression analysis and to create best fit lines. When one of these new functions can be used to perform a task described in this Application Note, the function will be noted in the appropriate section. Following are some of these new functions: To do this Use this function ------------------------------------------------------------ Return the correlation coefficient CORREL() for two arrays of cells Return a single predicted y value FORECAST() based on a linear regression of known x and y ranges Return the y intercept of the INTERCEPT() linear regression line Calculate R2, the coefficient of RSQ() determination Return the slope of the linear SLOPE() regression line Return the standard error of the STEYX() regression In addition, the Analysis ToolPak add-in macro provides a special set of analysis tools, including tools to accomplish the following: To do this Use this analysis tool ----------------------------------------------------------------- Predict a value based on the forecast Exponential Smoothing for the prior period, adjusted for the error in that prior forecast Project values in the forecast period Moving Average based on the average value of the variable over a specific number of preceding periods Perform linear regression analysis and Regression return statistics and plots as specified CHOOSING THE BEST FUNCTION =========================== Whether you are performing simple regression (one x variable), multiple regression (two or more x variables), or polynomial regression (one x variable raised to different powers), you will get the most accurate results if the function that you choose to regress your data is based on the patterns in your existing data. When Your Data Is Linear ------------------------ If the rate of change in your data is even to such an extent that when you plot it in a chart the pattern in your data points resembles a line, use the linear regression functions, LINEST() and TREND(). Both functions use the "least squares" method to calculate a straight line that best fits your data. LINEST() returns information about the line, such as its slope and y intercept, and TREND() returns predicted values along the line. In Microsoft Excel version 4.0, the Regression tool (from the Options menu, choose Analysis Tools) performs linear regression, returns regression statistics, calculates best fit lines, and creates best fit line charts. For additional information, see pages 41-45 in "User's Guide 2." When Your Data Is Exponential ----------------------------- If the rate of change in your data is exponential, such that when you plot it in a chart, the pattern resembles a curve that rises or falls at an increasingly higher rate, use the logarithmic regression functions LOGEST() and GROWTH(). LOGEST() calculates an exponential curve that best fits your data and, like LINEST(), returns information about the curve. Like TREND(), GROWTH() returns predicted values along the curve. When Your Data Is Curvilinear ----------------------------- To most accurately predict values when the pattern in your data is neither linear nor exponential, use polynomial regression in conjunction with the TREND() function to calculate a best fit curve. For example, use this method if, when you plot your data in a chart, it resembles a curve for which the rate of change is not dramatic or if your data fluctuates in such a way that no linear or curved pattern can be identified. SIMPLE REGRESSION ANALYSIS ========================== Your regression analysis is "simple" if you have only one independent x variable for each dependent y variable. For example, assume you are analyzing the sales figures for the first six months of operation for WidgeMaker, Inc., a company that specializes in the design and manufacture of Widgets. NOTE: The following examples primarily use the LINEST() and TREND() functions. Wherever these two functions are discussed, LOGEST() and GROWTH() can be substituted if your data is exponentially curved and a curve fit would be more accurate than a straight line. In the following sample data, the values in the Month column are the independent x variables and the values in the Sales column are the y variables. Based on this data, you can describe, calculate, and plot a best fit line, and then predict future sales figures. Because the data is linear, you will use the LINEST() and TREND() functions to perform the regression analysis. The Regression tool in Microsoft Excel version 4.0 performs each of these tasks automatically. For additional information on using this tool and specifying the options you want, see pages 41-45 of "User's Guide 2." Because this tool performs linear regression, if your data resembles an exponential curve, use LOGEST() and GROWTH(). Following are the sales figures for WidgeMaker and the corresponding months in both table and chart form: | A | B -------------------- 1 | Month | Sales -------------------- 2 1 $4,200 3 2 $6,100 4 3 $7,300 5 4 $7,300 6 5 $8,700 7 6 $10,500 NOTE: The sales figures are formatted using the built-in currency format with no decimal places. To change your number format, choose Number from the Format menu. To create the chart as a separate document: 1. Select cells A1:B7. 2. From the File menu, choose New. Select Chart and choose OK. 3. In the New Chart dialog box, select X-Values For XY-Chart. Choose OK. 4 From the Gallery menu, choose XY (Scatter). Select chart type number 2 and choose OK. 5. Select the x-axis and choose Scale from the Format menu. Change the Minimum value to 1 and choose OK. 6. From the Chart menu, choose Attach Text. Select Chart Title and choose OK. In the Formula bar, type: WidgeMaker, Inc. 7. With the title selected, choose Patterns from the Format menu. Under Border, choose Automatic and select the Shadow check box. Choose OK. 8. From the Chart menu, choose Attach text. Select Value (Y) Axis and choose OK. In the Formula bar, type Sales. 9. From the Chart menu, choose Attach Text. Select Category (X) Axis and choose OK. In the Formula bar, type Month and press ENTER (press RETURN if you are using Microsoft Excel for the Macintosh). NOTE: In Microsoft Excel version 4.0, an easier way to create charts is to use the ChartWizard. Select cells A1:B7 and choose the ChartWizard button. Follow the steps, selecting the formatting options you want. Once the chart is finished, to add additional formatting, double-click the chart to open it in its own chart window. DESCRIBING A BEST FIT LINE ========================== The equation of a straight line is y=x+, where is the slope and is the y intercept. LINEST() returns the m and b values that describe the line derived from your existing data. Microsoft Excel version 4.0 provides individual functions for calculating the slope and the y intercept when your data is linear: SLOPE() and INTERCEPT(). For additional information, see pages 244-245 and 405-406 of the "Function Reference" version 4.0. NOTE: If your data is exponentially curved, use LOGEST() to return the and values that describe the curve. The equation used by LOGEST() is y=*^x. Finding the Slope and the Y Intercept ------------------------------------- To calculate the values of the slope and y intercept , do the following: 1. Select cells E2:F2. 2. Type the following formula: =LINEST(B2:B7,A2:A7) 3. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. The first argument in the LINEST() function is the array containing the known y values (which in this example are the Sales numbers). The second argument is the array containing the known x values (in this case, the Month numbers). NOTE: LINEST() also takes other arguments but these aren't necessary to this example. The result 1122.857, in E2, is the slope and the result 3420, in cell F2, is where the line crosses the y-axis. | E | F -------------------------- 1 | Slope | Y intercept -------------------------- 2 1122.857 3420 CALCULATING A BEST FIT LINE =========================== If your data is linear, use TREND() or LINEST() to calculate your best fit line. In Microsoft Excel version 4.0, you can also use the FORECAST() function. If your data is an exponential curve, use LOGEST() or GROWTH(). Using TREND() ------------- The TREND() function is the easiest and most efficient function for calculating the points along a best fit line. To simultaneously calculate all the values on the best fit line, do the following: 1. Select cells C2:C7 and type the following formula: =TREND(B2:B7,A2:A7) 2. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. The first argument in the TREND() function is the array containing the known y values and the second argument is the array containing the known x values. NOTE: TREND() also takes other arguments but these aren't necessary to this example. Using LINEST() -------------- You can also use the slope and the y intercept values returned by LINEST() to find data points on the best fit line by substituting the y values or the x values into y=mx+b, the equation for a line. By plugging each month number into this formula, you can calculate all the data points for your best fit line. With the m value (1122.857) in cell E2 and the b value (3420) in cell F2, do the following to generate the points on your best fit line: 1. Select cell D2 and enter the following formula (it is not necessary to enter this formula as an array): =($E$2*A2)+$F$2 2. Select cells D2:D7. 3. From the Edit menu, choose Fill Down. The values returned are the y values for your best fit line. The following table shows the results of the values returned when you use TREND() and LINEST(): | A | B | C | D | E | F ---------------------------------------------------------------------- 1 | Month | Sales | Predicted Y | Predicted Y | Slope | Y Intercept | | | TREND() | LINEST() | | ---------------------------------------------------------------------- 2 1 $4,200 $4,543 $4,543 1122.857 3420 3 2 $6,100 $5,666 $5,666 4 3 $7,300 $6,789 $6,789 5 4 $7,300 $7,911 $7,911 6 5 $8,700 $9,031 $9,031 7 6 $10,500 $10,157 $10,157 NOTE: The returned values for TREND() and LINEST() are identical. PLOTTING THE BEST FIT LINE =========================== Once you've calculated the values on your best fit line, you can add that line to your existing chart: 1. Select cells C1:C7. From the Edit menu, choose Copy. 2. Activate the chart document by selecting it from the Window menu. 3. From the Edit menu, choose Paste. 4. If you want different data point markers on your best fit line, select the line and choose Pattern from the Format menu. Under Marker, select Custom and choose the data point marker and color you want. Choose OK. NOTE: In Microsoft Excel 4.0, you can use the ChartWizard to easily update your chart with the new trend data. Activate your chart by clicking it once if it is a chart object or by selecting the filename from the Window menu if it is a separate chart document. Select the ChartWizard button and in the Step 1 Of 2 dialog box, change the reference from filename!$A$1:$B$7 to filename!$A$1:$C$7. Choose Next and in the Step 2 Of 2 dialog box, choose OK. The resulting chart will have a straight line (best fit line) running through your original data. NOTE: When you display the points of a best fit line against the original data in a chart, you will, in most cases, get the best results by using an xy (scatter) chart. If you use a line chart, the x values will be treated as labels rather than as values, and crooked lines may result. PREDICTING FUTURE VALUES ======================== In addition to returning values along the line fitted to your existing data, you can use TREND() and LINEST() to predict future values. You can also use the FORECAST() function in Microsoft Excel 4.0. Using the WidgeMaker, Inc., example, suppose you want to calculate sales figures for months 7, 8, and 9. The following examples show how to accomplish this using the TREND() and LINEST() functions, respectively. Using TREND() ------------- To predict values for months 7, 8, and 9, first enter the month numbers for which you want predicted sales figures and then use the TREND() function to calculate the values: 1. In cells A8:A10, type 7, 8, and 9 respectively. 2. Select cells B8:B10. 3. Type the following formula: =TREND(B2:B7,A2:A7,A8:A10) 4. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. The third argument in the TREND() function is the array containing the new x values for which you want to derive predicted y values. The resulting values in cells B8:B10 are the predicted sales for the next three months: | A | B --------------------- 1 | Month | Sales --------------------- 2 1 $4,200 3 2 $6,100 4 3 $7,300 5 4 $7,300 6 5 $8,700 7 6 $10,500 8 7 $11,280 9 8 $12,403 10 9 $13,526 NOTE: In Microsoft Excel 4.0, you can use the AutoFill feature to predict future values. Using the above data, if you wanted to predict sales for months 7, 8, and 9, you would select cells B2:B7, select the AutoFill handle in the lower-right corner of the selected area and drag down three additional cells. CAUTION: In addition to returning predicted values for months 7, 8, and 9, the data in cells B2:B7 will be overwritten with the values that represent the best fit line. If you do not want your original data to be overwritten, copy it to a separate area on your worksheet and then use AutoFill. Using LINEST() -------------- To obtain the new y values, you can also substitute the slope and y intercept values that you derived with the LINEST() function and the new x values (7, 8, and 9) into the formula, y=x+. See the "Calculating a Best Fit Line" section for step-by-step instructions on how to do this. MULTIPLE REGRESSION ANALYSIS ============================ When you have two or more independent x variables for each y variable, the regression analysis is multiple. For example, you could predict a child's weight given his or her age and height. Assume you've collected the following data | A | B | C -------------------------- 1 | Age | Height | Weight -------------------------- 2 3 32 35 3 5 40 40 4 6 39 43 5 10 50 70 where the values under Weight (C2:C5) represent the y variables and the values under Age and Height (A2:B5) are the x variables. PREDICTING Y VALUES =================== You can use either the TREND() or the LINEST() function to analyze the relationship of the age and height to weight, and you can make predictions based on the results of this analysis. In Microsoft Excel 4.0, the Regression tool can also be used to predict y values in a multiple regression model. The FORECAST() function only works for simple regression. For additional information, see pages 41-45 in "User's Guide 2." Using TREND() ------------- To use TREND() to predict the weight of a 9-year-old, 45-inch child, do the following: 1. In cells A6:B6, type 9 and 45 respectively. 2. Select cell C6 and type the following formula: =TREND(C2:C5,A2:B5,A6:B6) 3. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. The predicted weight is 63.42. Using LINEST() -------------- To predict a y value with LINEST(), you must first calculate the slopes for each x variable and find the y intercept. Because a slope is returned for each x variable, when you use the LINEST() function, you must first select a range of cells that consist of a single row and a single column plus an additional column for each x variable you have. In this example, because you have two x variables, you will need to select a range of three cells, three columns wide by one row. To calculate the slopes and the y intercepts, do the following: 1. Select cells A7:C7 and type the following formula: =LINEST(C2:C5,A2:B5) 2. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for Macintosh. The following values will be returned | A | B | C --------------------- 7 | -0.32 5.98 24 where -.32 is the slope for the second x variable (height), 5.98 is the slope for the first x variable (age), and 24 is the y intercept. NOTE: The slopes are in reverse order: the first slope value corresponds to the second x variable and the second slope value corresponds to the first x variable. The LINEST() function always returns the slopes in backwards order when more than one x variable is involved. You can use the slope values and the y intercept value to make predictions based on your data. Using the formula, y=(1*x1)+(2*x2)+(n*xn)+, you can predict the weight of a 45-inch 9-year old: =(9*5.98)+(45*-0.32)+24 The result of the formula, 63.42, is the predicted weight. Similar to LINEST(), the Regression tool in Microsoft Excel 4.0 returns the slope values and the y intercept value. As described previously, you can plug these values into the formula, y=x+, to predict y. NOTE: Because the x variables are independent, there may not be a good graphical representation for a multiple regression model. Each x value can be plotted with its corresponding y value but the individual lines may be completely unrelated and therefore may not be useful. POLYNOMIAL REGRESSION ANALYSIS ============================== When your data is neither exponentially curved nor consistently linear, use this method of regression. When you plot a best fit curve calculated with polynomial regression, the curve will rise and fall with the data. CALCULATING A POLYNOMIAL CURVE ============================== To calculate a polynomial curve, the y variable is regressed against the independent x variable raised to different powers. To illustrate this process, take the following example (assume WidgeMaker, Inc.'s sales for the first six months of its operation are as follows): | A | B ------------------ 1 | Month | Sales ------------------ 2 1 $4,200 3 2 $1,600 4 3 $5,120 5 4 $4,500 6 5 $5,400 7 6 $1,460 Fitting a straight line to this data would not accurately predict the sales for any given month. In this case, you'll get the best results by setting up the following polynomial regression model: | A | B | C | D | E | F --------------------------------------------- 10 | X | X^2 | X^3 | X^4 | Sales | Trend --------------------------------------------- 11 1 1 1 1 $4,200 $4,089 12 2 4 8 16 $1,600 $2,154 13 3 9 27 81 $5,120 $4,011 14 4 16 64 256 $4,500 $5,609 15 5 25 125 625 $5,400 $4,846 16 6 36 216 1296 $1,460 $1,571 The values in cells A11:A16 are the month numbers copied from A2:A7. The values in cells B11:D16 are the original x variables raised to the second, third, and fourth powers respectively. To obtain these values, do the following: 1. Select cell B11 and enter the formula: =A11^2 2. Select cell C11 and enter the formula: =A11^3 3. Select cell D11 and enter the formula: =A11^4 4. Select cells B11:D16. 5. From the Edit menu, choose Fill Down. The values in E11:E16 are the sales figures copied from B2:B7. To derive the trend values, do the following: 1. Select cells F11:F16 and type the formula: =TREND(E11:E16,A11:D16) 2. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. PLOTTING A POLYNOMIAL CURVE =========================== To add the TREND() results to this chart, select cells F10:F16 and follow the steps under "Plotting the Best Fit Line" on page 6 of this Application Note. You will create a chart that resembles the following: USING REGRESSION STATISTICS =========================== The LINEST() and LOGEST() functions can return additional regression statistics that can be helpful in using and evaluating your regression model. If you have linear data, you can use the Regression Tool in Microsoft Excel version 4.0 to automatically return all the regression statistics. If your data resembles an exponential curve, use LOGEST() to return accurate regression statistics. Using LINEST() or LOGEST() to Return Regression Statistics ---------------------------------------------------------- To return the additional statistics using LINEST() or LOGEST(), you must select a range that includes five rows and a single column plus an additional column for each x variable you have. In addition, the stats argument, which is the fourth argument in both of these functions, must be set to TRUE. To return the additional regression statistics using the following data | A | B | C ------------------------- 1 | Age | Height | Weight ------------------------- 2 3 32 35 3 5 40 40 4 6 39 43 5 10 50 70 do the following: 1. Select cells D1:F5. NOTE: This range consists of five rows and a single column plus two additional columns, one for each x variable. 2. Type the following formula: =LINEST(C2:C5,A2:B5,,TRUE) 3. Enter the formula as an array by pressing CTRL+SHIFT+ENTER in Microsoft Excel for Windows or COMMAND+RETURN in Microsoft Excel for the Macintosh. The resulting data should resemble the following: | D | E | F ------------------------------------ 1 -0.32 5.98 24 2 2.243569 5.647619 57.68449 3 .950813 6.024948 #N/A 4 9.665289 1 #N/A 5 701.7 36.3 #N/A In the first row of the statistics, you have the slope for the height, the slope for the age, and the y intercept. The second row contains the standard error of the slopes and of the y intercept. R2 and the standard error for the y estimate are in the third row. The statistic and degrees of freedom are in the fourth row, while the regression sum of squares and the residual sum of squares are in the fifth row. Using R2 to Determine the Accuracy of the Regression Model ---------------------------------------------------------- A particularly useful statistic returned is the coefficient of determination (R2). In Microsoft Excel 4.0, you can also use the RSQ() function to find R2. This R2 indicator ranges in value from 0 to 1 and tells you how closely the estimated y values correlate to your actual y values. The closer R2 is to 1, the more perfect the correlation-- this correlation indicates that the regression equation is very useful in accurately predicting a y value. On the other hand, the closer R2 is to 0, the less helpful it will be in predicting a y value. In the previous example, the value for R2 returned by LINEST() is .95, a near perfect correlation. This indicates that, based on the collected data, the LINEST() model can be used to make extremely accurate predictions of a child's weight given a specific age and height. If you want to predict how accurately a child's age and height will predict their weight, given the collected data, R2 will indicate the accuracy of the predicted weight of a child with a different age and height not included in the collected data.