Sep 9, 2012

Kaplan-Meier Estimator in R

Function Name: kmeier (* , *)
Author: Haoying Wang @ 2012
Package: R x64 2.13.0

Background: Kaplan-Meier Estimator is a classical and popular way for estimating the empirical survival function from lifetime data. The Kaplan-Meier estimate is a right continuous step function which takes jumps only at the death/failure/event time. The calculation of the Kaplan-Meier estimate can also be implemented by using PROC Lifetest in SAS. Kaplan and Meier (1958) have shown that the method gives "maximum likelihood estimate".

Function Input (two arguments):
1st: Vector of right-censored observed times for n individuals
2nd: Vector of failure time indicators (0 = censored individual; 1 = uncensored individual)

Function Output (two arguments):
1st: Vector of sorted observed event times, in increasing order;
2nd: Vector of Kaplan-Meier estimates, corresponding to output 1.

Reference: Kaplan, E. L., Meier, P., Nonparametric estimation from incomplete observations. JASA, 53:457–481, 1958.

Code Link: https://sites.google.com/site/halkingwang/programming/r/kaplan-meierestimatorinr

Jan 29, 2011

Measurement Error in Agricultural Economics

Agricultural economist is criticized by Steven Payson in his book "Economics, Science and Technology" that, 'price per pound' is a variable with a strong historical precedence in agricultural economics. My first reaction is that Steven should be right about this assertion. I have not read many published papers in agricultural economics and applied economics where sort of measurement unit is used to digitalize (numerically) variables. But my experience tells me that, in general economic theoretical research, economists do not have to specify the unit of a variable after a standardization on everything (or simply with an assumption of unit as 1 at the beginning or by default). In this case, the objective of economic modelling and associated mathematical manipulation is to tell an economic logic or reasoning story (not necessarily an easy to understand answer). As long as the direction of specification (assumed economic relationship) is correct, then there is no need to worry about numbers. The potential issue of measurement error has been taken away in structural research, a skeleton without flesh never needs to worry about the issue of obesity, the probability is no larger than 0.

In agricultural economics, one of the biggest applied area of economics, research is dominated by empirical application of economic analysis. However, often time, the data employed by these research is not collected by researchers themselves (including agricultural economists in different research institutions and graduate students). People who collect the first hand data are usually people who are not supposed to do economic analysis (and they may not be interested in doing so). They are staffs in statistics bureaus, USDA extensions, marketing research companies and so on. They follow the tradition and associate every column of data with an unit as 'per pound', 'per gallon' or 'per acre'. By this way, they also find it is easy to communicate with people who actually work on farm daily. Economists barely bother to do field data collection-the input of research activity. So there is an inconsistency, the academic researchers try very hard to explore and understand the laws and mechanisms behind the economic activities of non-academia. When the data collectors proceed their job, however,  there is a communication need to be considered as I mentioned above. The numbers and measures they are using have to be easy to understand for different people, almost all of them are not economists.

So when people outside of academia criticize the research output of economics, they do not always realize the difference between applied economics and theoretical economics. What usually happens is that, when there is a critique, theoretical economists tend to defend for entire economics. Why? because they are more famous than applied economists, and they won Nobel prizes, so people know them pretty well (at least better than they know about applied economists). When people get questions about their economic life, they come to theoretical economists for answers. But they barely know that relatively precise answer can and should only be provided (more or less, good or bad) by applied economists. And another important fact is that, theoretical economists are good at telling story and making arguments.

This inconsistency and the silence of applied economists make a big distinction between economics and natural science. In physics, for example, people know Bill Gates, Henry Ford, Thomas Edison very well. Why, because they are applied physicists (or engineers). However, when you ask people do you know any theoretical physicists?  I am afraid they can only tell you Einstein, and the reason they know Einstein is not because they want to understand or they have already understood the Theory of Relativity. It is because that Einstein spent almost 1/3 of his life to make himself (or the Theory of Relativity) famous.

Therefore, what an applied economist should do is obvious. It is not trying to adjust or reduce your standard error of estimators, since you never get large enough sample size, and you never know how much is the measurement error involved in your data. What you should do is that, to start answering question from people who do not understand economics and people who are not interested in understanding economics. Do not let theoretical economists take your position and suppress you in a corner; it is too bad that is the case now, but you can change your silent status by speaking out. And do not always ask or even make up a question by yourself, and then answer it by yourself. That is what theoretical economists should do, do not take their position because you would have a more suitable and more pleasant position if you do something different and something more important.

Reference:
[1] "Economics, Science and Technology", by Steven Payson, pp.43, Edward Elgar, Cheltenham, UK.

Jan 8, 2011

Why did the chicken cross road?

Because it maximized his utility, this is the answer from Charles Wheelan and his Naked Economics.

Because he is forward-looking, I guess this is my answer. If one has chance to grow up in a traditional farm (either American style or non-American style), he or she may have a good observation on the behavior of a chicken. Can a chicken step back like other animals or a human being? I guess no. So when a chicken is searching for food (normally, this is the only utility maximization activity), he always searches forward and by chance he may cross the road under danger. This logic does not seem to make too much sense, but the point is that a chicken crosses the road by instinct. And mostly, the instinct of a chicken involves searching food and surviving from danger, which are all featured by his forward-looking behavior.

It can be argued that forward-looking behavior is common in the natural world. But we should bear in mind the fact that a chicken never rolls forward, and, a badger does this often time. It turns out rolling forward behavior is pretty efficient in the world of a badger's existence. Now let's ask the question again, why did the badger cross road? One possible answer is that he is constrained. Because when we build a road, we keep the slope of road from this side to another side minimized to prevent roll-over. Hence, there is no slope for him to roll forward, there is a physical constraint.Of course, this does not necessarily conflict with the possibility that his utility is maximized by doing this.

So for different animals, the answer to this question tends to be different and varies by their instinct and nature of their existence. How about human beings? We are the most organized animal on the earth. We have a very complicated society and hence we are called advanced animal by ourselves. When we are not self-sufficient (almost surely everyone in U.S. is not self-sufficient at least), we play a role in the market economy, where we exchange everything possible for different level of needs. And every of us in the market can be called a businessman or entrepreneur, the only difference comes from the scale of each one's business.

Again, the same question, why did the businessman/entrepreneur (potentially every of us) cross the road? Charles Wheelan answered the question this way: because he could make more money on the other side. I agree with him since utility of every of us (a market player) is highly correlated with money in different patterns or formats. Therefore, if an individual is not out of mind (for different reason, subjective or objective) at a specific moment, then he or she crosses the road for a purpose. And the purpose is somehow connected with money making or utility maximization.

My another question is that, in terms of economics study and research, can we really study the economic behavior of any individual beyond this highly organized human being society? Put other way, is economy a specialized word and only applicable in human being society? I tend to say yes, otherwise it is hard to differentiate instinct from utility maximizing behavior. And this may also be the reason why I get confused when I come across some parts of experimental economics, a new field of economics.

Reference:
1. Charles Wheelan, Naked Economics, 2002, New York, pp.8-11.

Jan 2, 2011

W-9 form and U.S. tax system

Recently, I run into a new thing, W-9 form. As a Ph.D student, I only need to file a W-4 form. However, following my first guess, I have to say that W-9 form is a very confused form in the U.S. tax system as long as a non-citizen person is involved, this is why I wanna talk about it. Hopefully I can get it clear while I have to doubt about my capability of doing that, because even Albert Einstein said: "The hardest thing in the world to understand is the income tax". So what can I do with it, well, at least I can try, at least this is 21th century.

So, what does W-9 form mean? An IRS form, also known as "Request for Taxpayer Identification Number and Certification", which is used by an individual defined as a "U.S. person" (see notes below) or a resident alien to verify his or her taxpayer identification number (TIN). An entity that is required to file an information return with the IRS must obtain your correct TIN to report, for example, income paid to you, real estate transactions, mortgage interest you paid, etc. For example, companies that issue dividends use the W-9 form to verify a shareholder's TIN. An client or company will use the information collected on the W-9, or substitute W-9, to produce a Form 1099 which details the earnings that the independent contractor received from that client that tax year. The Form 1099 is sent to both the independent contractor and the IRS; some states may also require a separate mailing of the 1099 as well.

In brief, if you are requested by a company or institution (for example, a research institute like NBER) to file a W-9 form, which basically means you receive payment from them no matter what is the reason of this payment. And when this payment is issued to you, they do not withhold tax in an usual way (usually your main employer withholds tax on your paycheck for IRS via payroll system regularly, e.g. monthly or biweekly). Put another way, these payments have not gone through tax deduction, and you have to report these income to IRS when you do your tax return. I can think of two possible reasons for a company or institution doing this. First, the payment they issue to you does not follow a consistent and continuous fashion, then it is hard to management it via a payroll system. Second, the amount of these kinds of payment has a very large variation, but generally is very small amount, like 5-50 dollars; so it is not efficient to invest human resources to manage tax on these payments. Of course, there also could be some regulatory reason for this, and I am not patient enough to figure it out.

Now, we can see the difference between W-9 and W-4. Unlike the Form W-4, which employees use to authorize an automatic tax withholding from their paychecks (via a payroll system, and done by their employers), the W-9 does not withhold taxes or social security payments. Individuals or entities are solely liable for the full amount of taxes assessed to earned income. In some cases (for example, small business like individual contractors or designers), you are typically required to submit quarterly estimated tax payments to the IRS and possibly your state tax board as well. But in most of cases, you can report it yearly when you do the tax return I guess.

So, why we are doing business in such an inefficient way? why the companies and institutions who pay us just do not simply report the payment to IRS? Fortunately or unfortunately, we have lots of law, and one of them has something to do with this: Even though employees are legally required to provide certain personal information to their employers, an employee's privacy is protected by law. An employer that discloses an employee's personal information in any unauthorized way may be subjected to civil and criminal prosecution.

So far we know some basics about W-9 form, then for a student like me what should be known about it and how to proceed with it? An usual question, and I have a simple semi-official answer for you. According to payroll service of University of Southern California:


Form W-9 is a tax document which declares you to be a "resident alien for tax purposes." Filing a W-9 has two immediate consequences: 
1. It negates all tax treaty protections and provisions.
2. It gives you more options in what you want to claim on your W-4 (any marital status, number of allowances, or "exempt").


Generally, you should file a W-9 only after you have been in the U.S. for at least the periods of time noted for the following visa types, and then only if you feel it is appropriate:


F-1, J-1 Student     After 5 years in the U.S.
J-1 Scholar         After 2 years in the U.S.
H-1                         After 183 days in the U.S.
The University cannot give advice on whether or not you should file a W-9; however be aware that you are responsible for all potential consequences if you decide to claim residence in the U.S.

I am still interested in getting another semi-official answer from my employer, Penn State University. We will see how it goes. Back to the beginning, anyway, please notice that this is just a tip of the iceberg. How complicated the U.S. tax system can be? I am not sure there is anyone who can answer this question. But I can give you some rough numbers. In a 2004 press release the IRS stated they have around 116,675 full time employees. The IRS has a budget of $11.1 billion to hire additional workers be its part time or full time employees. According to an IRS report titled "Workforce of Tomorrow Task Force: Final Report August, 2009" the IRS has 88,203 full-time employees. This number is derived from a chart on page 7 of the report that shows the age breakdown of full-time IRS employees. Adding the numbers up comes to the 88,203 figure. At the end of the report one of the summary pages mentions the IRS 100,000 employee workforce. Obviously this number is an estimate off all full and part-time employees.

So roughly 100,000 people work for IRS every business day every year, how can you live in United States without paying taxes? That is something almost surely impossible! Do you know how many people work for Google? 23,331 (2010, worldwide, not U.S. only). Do you know how many people working for Microsoft? 89,000 (2010, worldwide, not U.S. only). It seems IRS is the biggest money-seeking organization in U.S.

I have also realized that recently, lots of people have proposed to reform U.S. tax system. Let's know about it a little bit, so we can watch our pocket. How about starting with this one:

Taxing Ourselves, 4th Edition: A Citizen's Guide to the Debate over Taxes, Joel Slemrod & Jon Bakija.

---------------------
Notes: U.S. Person


A "U.S. person" is a citizen of United States, a lawful permanent resident alien of the US, (a "Green Card" holder), a refugee or someone here as a protected political asylee or under amnesty. US persons also include organizations and entities, such as universities, incorporated in the US. The general rule is that only US persons are eligible to receive controlled items, software or information without first obtaining an export license from the appropriate agency unless a license exception or exclusion is available.

A "foreign person" is anyone who is not a US person. A foreign person also means any foreign corporation, business association, partnership or any other entity or group that is not incorporated to do business in the US. Foreign persons may include international organizations, foreign governments and any agency or subdivision of foreign governments such as consulates.



Reference:
1. http://www.wisegeek.com/what-is-a-form-w-9.htm
2. http://www.investopedia.com/terms/w/w9form.asp
3. http://ais-ss.usc.edu/empldoc/faq/faq4.html#7
4. http://en.wikipedia.org/wiki/Microsoft
5. http://en.wikipedia.org/wiki/Google
6. http://rph.stanford.edu/10-2.html

Apr 28, 2010

Two Rivers, Culture and Life

Here is an excerpt from Michael Steinhardt's book 'No Bull-my life in and out of markets' , which is of much inspiration in understanding the American and immigration culture.

"In looking back on my career and my life, I can see that my values, and the goals I continue to strive for, represent the confluence of two great rivers: The age-old river of Judaism, the people and the tradition, and the river of secularized (societies are no longer under the control or influence of religion) American. From the Eastern European Jewish river flows a region, and, more importantly, a culture, while from the other river flows twentieth and twenty-first century American life with its openness, social mobility, and material prosperity. I believe my generation of Jews, in particular, is the product of these same two rivers, and the contents of both are strong within us. But, over time, the American river has grown stronger, becoming dominant in our lives, while the Eastern European river has been subsumed (to include something in a particular group and not consider it separately). For the first 50-plus year of my life, I too traveled, almost exclusively, along the secular river of American culture. Now I work, almost exclusively, on strengthening the flow of the river of my heritage."

       --'No Bull-my life in and out of markets', Chaper 17, pp.263.

Apr 19, 2010

On Property

The word property is not easy to define precisely.
According to Merriam Webster: 'property' can be interpreted as:

(a) A quality or trait belonging and especially peculiar to an individual or thing;
(b) An effect that an object has on another object or on the senses;
(c) An attribute common to all members of a class.

More simple, according to Google Dictionary:
(a) A thing or things that are owned by somebody; a possession or possessions;
(b) A quality or characteristic that something has.

Mathematically, however, we shall not hesitate to use it in the usual (informal) fashion.
If P denotes a property that is meaningful for a collection of elements, then we agree to write {x : P(x)} for the set of all elements x for which the property P holds. We usually read this as "the set of all x such that P(x)". It is often worthwhile to specify which elements we are testing for the property P. Hence we shall often write:

{x \in \!\, S : P(x)} for the subset of S for which the property P holds.

Mar 20, 2010

Some Words about Nonparametrics

If m is believed to be smooth, then the observations at Xi near x should contain information about the value of m at x. Thus it should be possible to use something like a local average of the data near x to construct an estimator of m(x).   --R. Eubank (1988. p.7)


Parametric models are fully determined up to a parameter (vector). The fitted models can easily be interpreted and estimated accurately if the underlying assumptions are correct. If, however, they are violated then parametric estimates may be inconsistent and give a misleading picture of the regression relationship.
Nonparametric models avoid restrictive assumptions of the functional form of the regression function m. However, they may be difficult to interpret and yield inaccurate estimates if the number of regressors is large. This has been appropriately  termed The Curse of Dimensionality. Semiparametric models combine components of parametric and nonparametric models, keeping the easy interpretability of the former and retaining some  of the flexibility of the latter.


Note: Nonparametric regression estimators are very flexible but their statistical precision decreases greatly if we include several explanatory variables in the model. The latter Caveat has been appropriately termed the curse of dimensionality. Consequently, researchers have tried to develop models and estimators which offer more flexibility than standard parametric regression but overcome the curse of dimensionality by employing some form of dimension reduction. Such methods usually combine features of parametric and nonparametric techniques. As a consequence, they are usually referred to as semiparametric methods. Further advantages of semiparametric methods are the possible inclusion of categorical variables (which can often only be included in a parametric way), an easy (economic) interpretation of the results, and the possibility of a part specification of a model.          --Wolfgang Hardle (2004)

Mar 19, 2010

Integrated, Unit Roots and Box-Jenkins Approach

The Box-Jenkins Approach is only valid if the variable being modeled is stationary. Although there are many different ways in which data can be nonstationary, Box and Jenkins assumed that the nature of economic time series data is such that any nonstationarity can be removed by differencing. This explains why, the Box-Jenkins approach deals mainly with differenced data.
A key ingredient of their methodology, an ingredient adopted by econometricians (without any justification based on economic theory), is their assumption that the nonstationarity is such that differencing will create stationarity. this concept is what is meant by the term Integrated : a variable is said to be integrated of order d, written I(d), if it must be differenced d times to be made stationary. Thus a stationary variable is integrated of order zero, written I(0), a variable which must be differenced once to become stationary is said to be I(1), integrated of order one, and so on. Economic variables are seldom integrated of order greater than two, and if nonstationary are usually I(1). Here is an I(1) random walk illustrative example given by Peter Kennedy:

Mar 18, 2010

Robust Estimation and Outliers

Estimation designed to be the "best" estimator for a particular estimating problem owe their attractive properties to the fact that their derivation has exploited special features of the process generating the data, features that are assumed known by the econometrician. Knowledge that the classic linear regression model assumptions hold, for example, allows derivation of the OLS estimator as one possessing several desirable properties. Unfortunately, because these best estimators have been designed to exploit these assumptions, violations of the assumptions affect them much more than they do other, sub-optimal estimators. Because the researchers are not in a position of knowing with certainty that the assumptions used to justify their choice of estimator are met, it is tempting to protect oneself against violations of these assumptions by using an estimator whose properties, while not quite "best", are not sensitive to violations of those assumptions. Such estimators are referred to as Robust Estimators.
In the presence of fat-tailed error distributions, although the OLS estimator is BLUE, it is markedly inferior to some nonlinear unbiased estimators. These nonlinear estimators, namely robust estimators, are preferred to the OLS estimator whenever there may be reason to believe that the error distribution is fat-tailed.

So the implications here are that, we should treat the Outliers more carefully than just simply kicking them out of the sample for the purpose of a better good-of-fit in running OLS. Often Influential Observations (outliers) are the most valuable observations in a dataset, outliers maybe reflecting some unusual fact that could lead to an improvement in the model's specification.

Mar 17, 2010

Cointegration

Recall that the levels variables in the ECM entered the estimating equation in a special way: they entered combined into a single entity that captured the extent to which the system is out of equilibrium. It could be that even though these levels variables are individually I(1) (a variable which must be differenced once to become stationary), this special combination of them is I(0). If this is the case, their entry into the estimating equation will not create spurious (false, although seeming to be genuine) results.

This possibility does not seem unreasonable. A nonstationary variable will tend to wander extensively (that is what makes it nonstationary), but some pairs of nonstationary variables can be expected to wander in such a way that they do not drift too far apart. Thanks to dis-equilibrium forces that will tend to keep them together. Some examples are short and long term interests rate, prices and wages, household income and expenditures, imports and exports, spot and future prices of a commodity, and exchange rates determined in different markets. Such variables are said to be Cointegrated: although individually they are I(1), a particular linear combination of them is I(0). The cointegrating combination is interpreted as an equilibrium relationship, since it can be shown that variables in the error-correction term in an ECM must be cointegrated, and vice versa, that cointegrated variables must have an ECM representation. This is why economists have shown such interests in the concept of cointegration - it provides a formal framework for testing and for estimating long-run (equilibrium) relationships among economic variables.

One important implication of all this is that differencing is not the only means of eliminating unit roots. Consequently, if the data are found to have unit roots, before differencing (and thereby losing all the long-run information in the data) a researcher should test for cointegration; if a cointegrating relationship can be found, this should be exploited by undertaking estimation in an ECM framework.

Error-Correction Model, Differenced and Levels Variable

Error-correction model, or for short, ECM is a very popular benchmark model in time series econometrics. An error-correction model is a dynamic model in which "the movement of the variables in any periods is related to the previous period's gap from long-run equilibrium." As a simple example of this consider the relationship:
where y and x are measured in logarithms, with economic theory suggesting that in the long run y and x will grow at the same rate, so that in equilibrium (y-x) will be a constant, save for the error. This relationship can be manipulated to produce:
This is the ECM representation of the original specification; the last term is the error-correction term, interpreted as reflecting dis-equilibrium reponses. The terminology can be explained as follows: if in error y grows too quickly, the last term becomes bigger, and since its coefficient is negative (β3 <1 for stationarity), Δyt is reduced, correcting this error. In actual applications, more explanatory variables will appear, with many more lags. Notice that this ECM equation turns out to be in terms of Differenced Variables, with the error-correction component measured in terms of Levels Variables.

Mar 16, 2010

Dummy Variable, Fixed and Random Effects

Dummy variables are sometimes used in the context of panel, or longitudinal data - observations on a cross-section of individuals or firms, say, over time. In this context it is often assumed that the intercept varies across the N cross-sectional units and/or across the T time periods. In the general case (N-1)+(T-1) dummies can be used for this, with computational short-cuts available to avoid having to run a regression with all these extra variables. This way of analyzing panel data is called the Fixed Effects Model. the dummy variable coefficients reflect ignorance - they are inserted merely for the purpose of measuring shifts in the regression line arising from unknown variables. Some researchers feel that this type of ignorance should be treated in a fashion similar to the general ignorance represented by the error term, and have accordingly proposed the Random Effects, Variance Components, or Error Components model.
Which of the fixed effects and the random effects models is better? This depends on the context of the data and for what the results are to be used. If the data exhaust the population (say observations on all firms producing automobiles), then the fixed effects approach, which produces results conditional on the units in the dataset, is reasonable. If the data are a drawing of observations from a large population (say a thousand individuals in a city many times that size), and we wish to draw inferences regarding other members of that population, the fixed effects model is no longer reasonable; in this context, use of the random effects model has the advantage that it saves a lot of degrees of freedom.

The random effects model has major drawback, however, it assumes that the random error associated with each cross-section unit is uncorrelated with the other regressors, something that is not likely to be the case. Suppose, for example, that wages are being regressed on schooling for a large set of individuals, and that a missing variable, ability, is thought to affect the intercept; since schooling and ability are likely to be correlated, modeling this as a random effect will create correlation between the error and the regressor schooling (whereas modeling it as a fixed effect will not). The result is bias in the coefficient estimates from the random effect model. This may explain why the slope estimates from the fixed and random effects models are often so different. 
A Hausman test for correlation between the error and the regressors can be used to check for whether the random effects model is appropriate. Under the null hypothesis of no correlation between the error and the regressors, the random effects model is applicable and its estimated GLS estimator is consistent and efficient. Fixed effects model is consistent under both null nad the alternative.

Qualitative vs Quantitative Variables

These two words are pretty similar, I have been confused on them for a little while, now here is a clarification.

Variables can be quantitative or qualitative. (Qualitative variables are sometimes called "categorical variables.") Quantitative variables are measured on an ordinal, interval, or ratio scale; qualitative variables are measured on a nominal scale. If five-year old subjects were asked to name their favorite color, then the variable would be qualitative. If the time it took them to respond were measured, then the variable would be quantitative. In brief, a qualitative variable is not measurable with numerical instruments but with adjectives that do not imply ranking or scales (i.e. gender, colors, taste, etc.) A quantitative variable is measurable with numerical instruments and can be ordered in a quantifiable ranking (i.e. the height of a person). Through the following table you will be able to differentiate them very easily:

Mar 15, 2010

The Bayesian Approach

The essence of the debate between the Frequentists (A statistical approach for assessing the likelihood that a hypothesis is correct, by assessing the strength of the data that supports the hypothesis and the number of hypotheses that are tested. ) and the Bayesian rests on the acceptability of the subjectivist notion of probability. Once one is willing to view probability in this way, the advantages of the Bayesian approach are compelling. But most practitioners, even though they have no strong aversion to the subjectivist notion of probability, do not choose to adopt the Bayesian Approach. The reasons are practical in nature.
1. Formalizing prior beliefs into a prior distribution is not an easy task;
2. The mechanics of finding the posterior distribution are formidable (feel fear and/or respect for something, because they are impressive or powerful, or because they seem very difficult).
3. Convincing others of the validity of Bayesian results is difficult because they view those results as being "contaminated" by personal beliefs.


Following the subjective notion of probability, it is easy to imagine that before looking at the data the researcher could have a "prior" density function for β, reflecting the odds that he or she would give, before looking at the data, if asked to take bets on the true value of β. This prior distribution, when combined with the data via Bayes' Theorem, produces the posterior distribution referred to above. This posterior density function is in essence a Weighted Average of the prior density and the likelihood (or "conditional density", conditional on the data).
Generally, the Bayesian Approach consists of three steps:
1. A prior distribution is formalized, reflecting the researcher's beliefs about the parameters in question before looking at the data.
2. This prior is combined with the data, via Bayes' theorem, to produce the posterior distribution, the main output of a Bayesian analysis.
3. This posterior is combined with a loss or utility function to allow a decision to be made on the basis of minimizing expected loss or maximizing expected utility, this third step is optional.

The Bayesian approach claims several advantages over the classical approach:
1. The Bayesian approach is concerned with how information in data modifies a researcher's beliefs about parameter values and allows computation of probabilities associated with alternative hypotheses or models; this corresponds directly to the approach to these problems taken by most researchers.
2. Extraneous information is routinely incorporated in a consistent fashion in the Bayesian method through the formulation of the prior; in the classical approach such information is more likely to be ignored, and when incorporated is usually done so in ad hoc (arranged or happening when necessary and not planned in advance) ways.
3. The Bayesian approach can tailor the estimate to the purpose of the study, through selection of the loss function; in general, its compatibility with decision analysis is a decided advantage.
4. There is no need to justify the estimating procedure in terms of the awkward concept of the performance of the estimator in hypothetical (based on situations or ideas which are possible and imagined rather than real and true) repeated samples; the Bayesian approach is justified solely on the basis of the prior and the sample data.

Mar 10, 2010

Condition Index and Multicollinearity

In the case of Multicollinearity, a less common, but more satisfactory, way of detecting Multicollinearity is through the condition index, or number, of the data, the square root of the ratio of the largest to the smallest characteristic root of X'X. A high condition index reflects the presence of collinearity.
When there is no collinearity at all, the eigenvalues, condition indices and condition number will all equal one. As collinearity increases, eigenvalues will be both greater and smaller than 1, and the condition indices and the condition number will increase. An informal rule of thumb is that if the condition number is 15, multicollinearity is a concern; if it is greater than 30 multicollinearity is a very serious concern. (But again, these are just informal rules of thumb.) In SPSS, you get these values by adding the COLLIN parameter to the Regression command; in Stata you can use COLLIN. In SAS, you can use COLLIN option in Model statement of PROC REG.

Here are two more rules of thumb in dealing with multicollinearity:
Don't worry about multicollinearity if the R^2 from the regression exceeds the R^2 of any independent variable regressed on the other independent variables.
Don't worry about multicollinearity if the t statistics are all greater than 2.

Mar 9, 2010

Consistency and Convergence

A consistent sequence of estimators is a sequence of estimators that converge in probability to the quantity being estimated as the index (usually the sample size) grows without bound. In other words, increasing the sample size increases the probability of the estimator being close to the population parameter. Mathematically, a sequence of estimators \{t_n; n \ge 0\} is a consistent estimator for parameter θ if and only if, for all ε > 0, no matter how small, we have

 
\lim_{n\to\infty}\Pr\left\{
\left|
t_n-\theta\right|<\epsilon
\right\}=1.

The consistency defined above may be called Weak Consistency. The sequence is Strongly Consistent, if it Converges Almost Surely to the true value. To say that the sequence Xn converges almost surely or almost everywhere or with probability 1 or strongly towards X means that


    \operatorname{Pr}\!\left( \lim_{n\to\infty}\! X_n = X \right) = 1.
This means that the values of Xn approach the value of X, in the sense (see almost surely) that events for which Xn does not converge to X have probability 0. Using the probability space \scriptstyle (\Omega, \mathcal{F}, P ) and the concept of the random variable as a function from Ω to R, this is equivalent to the statement

    \operatorname{Pr}\Big( \omega \in \Omega : \lim_{n \to \infty} X_n(\omega) = X(\omega) \Big) = 1.

Mar 8, 2010

Indirect Least Squares Method (ILS)

Suppose we wish to estimate a structural equation containing say, three endogenous variables. The first step of the ILS technique is to estimate the reduced-form equations for these three endogenous variables. If the structural equations for these three endogenous variables. If the structural equation in question is just identified, there will be only one way of calculating the desired estimates of the structural equation parameters from the reduced-form parameter estimates. The structural parameters are expressed in terms of the reduced-form parameters, and the OLS estimates of the reduced-form parameters are plugged in these expressions to produce estimates of the structural parameters. Because these expressions are nonlinear, however, unbiased estimates of the reduced-form parameters produce Only Consistent estimates of the structural parameters, not unbiased estimates.

If an equation is over-identified, the extra identifying restrictions provide additional ways of calculating the structural parameters from the reduced-form parameters, all of which are supposed to lead to the same values of the structural parameters. But because the estimates of the reduced-form parameters do not embody these extra restrictions, these different ways of calculating the structural parameters creates different estimates of these parameters. (This is because unrestricted estimates rather than actual values of the parameters are being used for these calculations.) Because there is no way of determining which of these different estimates is the most appropriate, ILS is not used for over-identified equations. The other simultaneous equation estimating techniques have been designed to estimate structural parameters in the over-identified case; many of these can be shown to be equivalent in the over-identified case; many of these can be shown to be equivalent to ILS in the context of a just-identified equation, and to be weighted averages of the different estimates produced by ILS in the context of over-identified equations.

Here is a basic procedure to implement ILS:
1. Rearrange the structural form equations into reduced form, Estimate the reduced form equations;
2. Estimate the reduced form parameters;
3. Solve for the structural form parameters in terms of the reduced form parameters, and substitute in the estimates of the reduced form parameters to get estimates for the structural ones.
Note: If structural equation is exactly identified, there will be  a unique way to calculate the parameters. Estimates of reduced form parameters are unbiased, but estimates of the structural parameters will not be.  Both are consistent.

Mar 7, 2010

Order & Rank Conditions of Identification

The identification problem is a mathematical (as opposed to statistical) problem associated with simultaneous equation systems. It is concerned with the question of the possibility or impossibility of obtaining meaningful estimates of the structural parameters. The identification problem can be solved if economic theory and extraneous information can be used to place restrictions on the set of simultaneous equations. These restrictions can take a variety forms (such as use of extraneous estimates of parameters, knowledge  of exact relationship among parameters, knowledge of the relative variances of disturbances, knowledge of zero correlation between disturbances in different equations, etc.), but the restrictions usually employed, called Zero Restrictions, take the form of specifying that certain structural parameters are zero, i.e., that certain endogenous variables and certain exogenous variables do not appear in certain equations. Mathematical investigation has shown that in the case of Zero Restrictions on structural parameters each equation can be checked for identification by using a rule called the Rank Condition. It turns out, however, that this rule is quite awkward to employ, and as a result a simpler rule, called the Order Condition, is used in its stead. This rule only requires counting included and excluded variables in each equation.
Here is a brief illustration of order and rank conditions of identification in simultaneous equation system:





M = number of endogenous variables in the model
K = number of exogenous variables in the model
m = number of endogenous variable in an equation
k = number of exogenous variables in a given equation
Rank condition is defined by the rank of the matrix, which should have a dimension (M-1), where m is the number of endogenous variables. This matrix is formed from the coefficients of the variables (both endogenous and exogenous) excluded from that particular equation but included in the other equations in the model.
The rank condition tells us whether the equation under consideration is identified or not, whereas the order condition tells us if it is exactly identified or overidentified.
1. If K-k>m-1 and the rank of the ρ(A) is M-1 then the equation is overidentified.
2. If K-k=m-1 and the rank of the ρ(A) is M-1 then the equation is exactly identified.
3. If K-k>=m-1 and the rank of the ρ(A) is less than M-1 then the equation is underidentified.
4. If K-k<=m-1 the structural equation is unidentified. The rank of the ρ(A) is less M-1 in this case.

From these rules, we can tell that, the order condition is only a necessary condition, not a sufficient one. So that, technically speaking, the rank condition must also be checked. Many econometricians do not bother doing this, however, gambling that the rank condition will be satisfied (as it usually is) if the order condition is satisfied. This procedure is hence not recommended.

Mar 4, 2010

Direct PC SAS Output to a File

When running SAS programs interactively through the display manager, the output from any procedure is written to the Output window and notes, warnings and errors are written to the Log Window. Contents of these windows are temporary. They can be saved to a file using the File Save pulldown menus from the Output Window and from the Log Window. But if you want to make sure that the output of these windows is saved to a file every time, you can use Proc Printto to automatically route output to a file.

For example, the following program routes the output from Proc Printto directly to a file named auto.lst. What would have gone to the Output Window is redirected to the file c:\auto.lst . The statements below tell SAS to send output that would go to the Output Window to the file c:\auto.lst and to create a new file if the file currently exists.  If the NEW option was omitted, SAS would append to the file if it existed.

    PROC PRINTTO PRINT='c:\auto.lst' NEW;
    RUN;

Note: (1) sometime SAS program can collapse (unexpected terminated) before it executes all of the statements properly, then you will lose all of the results you already got.(In this kind of situation you have to end the SAS through Windows Task Manager, because generally the SAS program will stop respond). By using Proc Printto, you can save all of the temporary results you have already got before the program is unexpected terminated.

(2) Generally you need put Proc Printto statement at the very beginning of the SAS code. Of course, you can also release the print output file by using another simple statement at the very end of the SAS code:
    PROC PRINTTO;
    RUN;

(3) For log print, you can use a similar SAS Code:

    PROC PRINTTO LOG='c:\auto.log' NEW;
    RUN;

Mar 3, 2010

Observations and Thoughts on Haiti and Chile

Here are some observations from a blogger:
"The recent earthquakes in Haiti and Chile present an interesting contrast between the deleterious effects of a major earthquake in one of the richest countries in the western hemisphere and in the poorest.  It may surprise you that Chile is (by relative standards) quite an advanced and relatively wealthy country as many Americans, I think, have a tendency to view all of Latin America as a poor region.  According to the CIA, the per-capita GDP in Chile in 2009 was $14,700 while Haiti was $1,300 - so while Chile is far from US or Western European standards of living, it is a much wealthier country than Haiti.  In both cases the earthquake (and subsequent tsunami in Chile) were devastating disasters, but the scope of the tragedy in Haiti was, it appears, much, much worse."

These observations pass two serious thinking to me:
1. Other than physical demand, people's level of immaterial demand can also be determined by income or wealth; and most of time, safety is not among the basic levels of human needs.
2. Opportunity cost for poor is less than rich people when they are facing the same danger and potential of losing. Who can stand more risk and unsafety, poor or rich? This is a two-way argument.

So the practical question is that, can we validate these observation via some statistical or econometrics methods?

A good Illustration of Weighted Regression by Peter Kennedy

Measurement Error

In parametrics, the assumption of fixed regressors is made mainly for mathematical convenience, if the regressors can be considered to be fixed in repeated samples, the desirable properties of the OLS estimator can be derived quite straightforwardly. The essence of this assumption is that, if the regressors are nonstochastic, they are distributed independently of the disturbances. If this assumption is weakened to allow the explanatory variables to be stochastic but to be distributed independently of error term, all the desirable properties of the OLS estimator are maintained; their algebraic derivation is more complicated, however, and their interpretation in some instances must be changed (for example, in this circumstance, βOLS is not, strictly speaking, a linear estimator).

If the regressors are only contemporaneously uncorrelated with with the disturbance vector, the OLS estimator is biased but retains its desirable asymptotic properties at the expense of the small-sample properties of βOLS. If the regressors are contemporaneously correlated with the error term, the OLS estimator is even asymptotically biased.

When there exists contemporaneous correlation between the disturbance and a regressor, alternative estimators with desirable small-sample properties can not in general be found; as a consequence, the search for alternative estimators is conducted on the basis of their asymptotic properties. The most common estimator used in this context is the instrumental variable (IV) estimator.

Mar 1, 2010

The Nature of Agricultural Economics - My Thought

The strength of agricultural economics is not that it can compete with general economics research. People may feel that general economics research is more decent, this is true in the sense that it produces pretty neat and nice work with help of mathematic notation. Mathematics is important, this should be admitted, it is the logic language of this world. So what general economics research does is that it has been employed in the effort of expressing the world, while the agricultural economics research should be dedicated to be more close to the real world, to pass more care to people and entire world's basic needs.

There is a movie which has been putting on the screen for a while, Food Inc. (2008). An American documentary film directed by Emmy Award-winning filmmaker Robert Kenner. The film examines large-scale agricultural food production in the United States, concluding that the meat and vegetables produced by this type of economic enterprise have many hidden costs and are unhealthy and environmentally-harmful. The documentary generated extensive controversy in that it was heavily criticized by large American corporations engaged in industrial food production. This is just an example, so the question is that after popular cost-benefit analysis and its derivative forms and combinations, who really cares about problem like above? Most of time, optimization, maximization, equilibrium and so forth are too perfect to be practical in applications; some other time, human activity and interaction are so of diversity that it is not enough or even it is not neccessary to follow a cost-benefit logic, especially when you have hard time to identify who are beneficiaries and who are victims.

Here is a word I want to share with everyone: we can only and will only win the world by love and responsibility, not by proving; because essentially everything can be proved while nothing cannot be proved eventually.
                                              -- Haoying Wang, 2010

The Nature of Agricultural Economics (1)

The nature, foundation, structure and future of agricultural economics has been of concern for a long time. Even though after 90's it is noticed that agricultural economics and its education have been experiencing a downturn, it is still holding the frontier of applied econometrics and environmental economics, which are something heading future. When we look back twenty years, where lots of concerns and thoughts stacked from.

Agricultural Economics is Applied Agricultural economics is by its very nature an applied discipline-a discipline that focuses on the application of economic rinciples taken from general economics to practical, applied problems based on keen observation of the behavior of individuals, groups and institutions within an economic setting. Some agricultural economists argue that despite its reliance on economic theory, nearly all the research being conducted by agricultural economists is applied - in that the research has as its core basis observable economic phenomena based upon human behavior. Like theoretical physics related to the origins of the universe, much of the most advanced economic research being conducted in what are regarded as the best economics graduate schools has little grounding in observable economic phenomena, and consists of abstract mathematical proofs of economic theories that are seldom verifiable based on data gathered from the real world.
                                                        --David L. Debertin, 1999.


There is decreasing diversity among economics departments with respect to what is taught among the top-ten schools - that because of the inter-hiring only within the small group of schools thought to be in the peer group, there is little diversity in what is taught or in methodological approaches to research considered acceptable. As I look at the agricultural economics top-ten list, however, I see considerably greater diversity in the kinds of graduate education that would be obtained. An agricultural economics Ph.D. from Purdue would be very different from one obtained from UC-Berkeley, and no one would characterize a North Carolina State ag. econ. Ph.D. as being a clone of one produced by UW-Madison! In my view-the diversity of these graduate programs, along with the additional diversity contained in lower-ranked schools--is a source of great strength in agricultural economics, not a weakness.
                                                        --David L. Debertin, 1999.

Feb 28, 2010

Why Generalized Least Square Estimator?

It is known that heteroskedasticity affects the properties of the OLS estimatror (though still unbiased, but less efficient, namely larger variance). When you draw a scatter plot on raw data, the higher absolute values of the residuals to the right in the graph indicate that there is a positive relationship between the error variance and the independent variable. With this kind of error pattern, a few additional large positive errors near the right in this graph would tilt (make something move, into a position with one side or end higher than the other) the OLS regression line considerably. A few additional large negative errors would tilt it in the opposite direction considerably. In repeated sampling these unusual cases would average out, leaving the OLS estimator unbiased, but the variation of the OLS regression line around its mean will be greater - i. e., the variance of βOLS will be greater. The Generalized Least Square (GLS) technique pays less attention to the residuals associated with high-variance observations (by assigning them a low weight in the weighted sum of squared residuals it minimizes) since these observations give a less precise indication of where the true regression line lies. This avoids these large tilts, making the variance of βGLS smaller than that of βOLS.

In the case of that Durbin-Watson test indicates autocorrelated errors. It is typically concluded that estimation via Feasible GLS is called for. This is not always appropriate, however, the significant value of the Durbin-Watson statistic could result from an omitted explanatory variable, an incorrect functional form, or a dynamic misspecification. Only if a researcher is satisfied that none of these phenomena are responsible for the significant Durbin-Watson statistic value should estimation via feasible GLS proceed.

Feb 27, 2010

Two Nonparametrics

In the world of econometrics, the term nonparametric basically refers to the flexible functional form of the regression curve. However, there are other notions of "nonparametric statistics" which refer mostly to distribution-free methods. In the econometric context, generally, neither the error distribution nor the functional form of the mean function is prespecified.
Between the parametric econometrics and nonparametric econometrics, the question of which approach should be taken in data analysis was a key issue in a bitter fight between Pearson and Fisher in the twenties. Fisher pointed out that the nonparametric approach gave generally poor efficiency whereas Pearson was more concerned about the specification question. Both viewpoints are interesting in their own right. Pearson pointed out that the price we have to pay for pure parametric fitting is the possibility of gross misspecification resulting in too high a model bias. On the other hand, Fisher was concerned about a too pure consideration of parameter-free models which may result in more variable estimates, especially for small sample size n.

Orthogonality in Econometrics

In mathematics, two vectors are orthogonal if they are perpendicular, i.e., they form a right angle.

In linear algebra, an orthogonal matrix is a square matrix with real entries whose columns (or rows) are orthogonal unit vectors (i.e., orthonormal). Because the columns are unit vectors in addition to being orthogonal, some people use the term orthonormal to describe such matrices.
Equivalently, a matrix Q is orthogonal if its transpose is equal to its inverse:

Q^T Q = Q Q^T = I . \,     alternatively,   Q^T=Q^{-1} . \,

The concept of orthogonality tends to be very important in econometrics, since we have been building almost all of the methods and rules based on the matrix platform. For example, if it happens that a relevant independent variable is omitted, in general, the OLS estimator of the coefficients of the remaining variables is biased. If the omitted variable is orthogonal to the included variables, the slope coefficient estimator will be unbiased; the intercept estimator will retain its bias unless the mean of the observations on the omitted variable is zero.
In the case of inclusion of an irrelevant variable, unless the irrelevant variable is orthogonal to the other independent variables, the variance-covariance matrix βOLS becomes larger; the OLS estimator is not as efficient. Thus in this case the MSE of the estimator is unequivocally raised.

Feb 26, 2010

Borrow 500 Years of Life from the Heaven

Lyrics:     Junyi Zhang, Xiaobin Fan
Compostion: Ke Fu
Translation: Haoying Wang

♣ Along the gentle waviness of rising and subsiding territory


♣ Galloping on the beloved land, beloved plateau and Yangtze South


♣ In the face of ice blade and sword, accompanied by attaching wind and rain


♣ Being cherished of my golden life from heaven


♣ And full of fraternity all along


♣ Being afraid of nothing


♣ And full of lofty sentiments all along


♣ Life is always of half pain and half enjoyment


♣ But with distinct cut between good and evil


♣ All come true in the dream for future


♣ Clanking iron heel, Never stops on the vast beloved land


♣ Standing on the top of surge, and holding


♣ The movement of universe


♣ Praying for the world of mortals


♣ Full of peace and bliss


♣ And another 500 Years from the Heaven for me


♣ Another 500 Years from the Heaven for me

Feb 25, 2010

Specification Problems and Empirical Study

Peter Kennedy wrote: Econometric textbooks are mainly devoted to the exposition of econometrics for estimation and inference in the context of a given model for the data-generating process. The more important problem of specification of this model is not given much attention, for three main reasons: (1) specification is not easy; (2) most of econometricians would agree that specification is an innovative/imaginative process that cannot be taught; (3) there is no accepted "best" way of going about finding a correct specification. (Of course, this is why we can always contribute something here, it is too hard to find a best and perfect way of specification.)

So the issue can come as how much trust do we have in econometrics, different people express in a different way:
All models are wrong, but some are useful. - George Box
Models are to be used, but not to be believed. -Theil, H.


Here is what Edward E. Leamer contributed into the discussion:
When an inference is suspected to depend crucially on a doubtful assumption, two kinds of actions can be taken to alleviate the consequent doubt about the inferences. Both require a list of alternative assumptions. The first approach is statistical estimation which uses the data to select from the list of alternative assumptions and then makes suitable adjustments to the inferences to allow for doubt about the assumptions. The second approach is a sensitivity analysis that uses the alternative assumptions one at a time, thereby demonstrating either that all the alternatives lead to essentially the asame inferences or that minor changes in the assumptions make major changes in the inferences. For example, a doubtful variable can simply be included in the equation (estimation), or two different equations can be estimated, one with and one without the doubtful variable (sensitivity analysis).
Simplification is a third. The intent of simplification is to find a simple model that works well for a class of decisions. A specification search can be used for simplification,as well as for estimation and sensitivity analysis. the very prevalent confusion among these three kinds of searches ought to be eliminated since the rules for a search and measures of its success will properly depend on its intent.

Again, Peter Kennedy gave following summarization: 
♣ Models whose residuals do not test as insignificantly different from white noise (random errors) should be initially viewed as containing a misspecification, not as needing a special estimation procedure.
♣ "Testing down" is more suitable than "Testing up"; one should begin with a general, unrestricted model and then systematically simplify it in light of the sample evidence.
♣ Tests of misspecification are better undertaken by testing simultaneously for several misspecifications rather than testing one-by-one for these misspcifications.

Likelihood Ratio, Wald, Lagrange Multiplier Tests

The F test is applicable whenever we are testing linear restrictions in the classic normal linear regression model. However, if, (1) the restrictions are nonlinear; (2) the model is nonlinear in the parameters; (3) the errors are distributed non-normally; then we need other asymptotically equivalent tests.

Suppose the restriction being tested is written as g(β), satisfied at the value βMLE-R where the function g(β) cuts the horizontal axis (please refer to the graph at the bottom). Then we have three asymptotically equivalent tests available to do the test and make reference, all of them are distributed asymptotically as chi-square with degrees of freedom equal to the number of restrictions being tested.

(1) The Likelihood Ratio Test: if the restrictions is true, then ln(LR), the maximized value of ln(L) imposing the restrictions, should not be significantly less than ln(Lmax), then unrestricted maximum value of ln(L). The Likelihood Ratio test tests whether [ln(LR)-ln(Lmax)] is significantly different from zero.

(2) Wald Test: if the restriction g(β)=0 is true, then g(βMLE) should not be significantly different from zero. The Wald test tests whether βMLE (the unrestricted estimate of β) violates the restriction by a significant amount.

(3) Lagrange Multiplier Test: The log-likelihood function of ln(L) is maximized at point A where the slope of ln(L) with respect to β is zero. If the restriction is true, then the slope of ln(L) at point B should be significantly different from zero. The Lagrange Multiplier test tests whether the slope of ln(L), evaluated at the restricted estimate, is significantly different from zero.

Graph for reference:

Feb 24, 2010

Future and Complexity

--For the Understanding of Environmental Economics and Studies Concerned


I believe that man has the power, the intelligence, and the imagination to extricate himself from the serious predicament that now confronts him. The necessary first step toward wise action in the future is to obtain an understanding of the problems that exist. This in turn necessitates an understanding of the relationships between man, his natural environment, and his technology.
                                                            -Ocho Rios, Jamaica, April 1953.

In principle, the vast knowledge we have accumulated during the last 150 years makes it possible for us to look into the future with considerably more accuracy than could Malthus. But in actual fact we are dealing with an extremely complex problem which cuts across all of our major fields of inquiry and which, because of this, is difficult to unravel (to explain something that is difficult to understand or is mysterious) in all of its interlocking aspects. The complexity of the problem, our confusion, and our prejudices, have combined to form a dense fog that has obscured the most important features of the problem from our view - a fog which is in certain respects even more dense than that which existed in Malthus’ time. As a result, the basic factors that are determining the future are not generally known or appreciated.

In spite of the complexity of the problem which confronts us, its overwhelming importance, both to ourselves and to our descendants, warrants our dissecting it as objectively as possible. In doing so we must put aside our hatreds, desires, and prejudices, and look calmly upon the past and present. If we are successful in lifting ourselves from the morass (an unpleasant and complicated situation that is difficult to escape from) of irrelevant fact and opinion and in divorcing ourselves from our preconceived ideas, we will be able to see mankind both in perspective and in relation to his environment. In turn we will be able to appreciate something of the fundamental physical limitations to man’s future development and of the hazards which will confront him in the years and centuries ahead.

Feb 23, 2010

Rejection From Yale

2/23/2010

Dear Mr. Wang:

Thank you very much for applying to the Graduate School of Arts and Sciences at Yale University. I regret to inform you that we are unable to offer you admission. As you know, the very high number of extraordinary candidates among our 10,400 applicants far exceeds the number of places we have in each program, and we are not able to admit many excellent candidates.

We are using this system of electronic notification to communicate with you five to ten days more rapidly than we could by letter and, therefore, help applicants plan their futures quickly and effectively. We wish you every success in all your endeavors.

Sincerely,

Jon Butler
Dean of the Graduate School

Why Student's T-test? (Part 2)

An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question. 
--The first golden rule of mathematics, sometimes attributed to John Tukey

With many calculations, one can win; with few one cannot. How much less chance of victory has one who makes none at all! 
--Sun Tzu 'Art of War'

The T-test may be used to compare the means of a criterion variable for two independent samples or for two dependent samples (ex., before-after studies, matched-pairs studies), or between a sample mean and a known mean (one-sample t-test). In regression analysis, A T-test can be used to test any single linear constraint. Nonlinear constraints are usually tested by using a W, LR or LM test, but sometimes an "asymptotic" T-test is encountered: the nonlinear constraint is written with its right-hand side equal to zero, the left-hand side is estimated and then divided by the square root of an estimate of its asymptotic variance to produce the asymptotic T statistics.

For example, here is the formula to test mean difference for the case of equal sample sizes, n, in both groups:

Let E be the experimental condition and let C be the control condition. Let m be the means, s the standard deviations, and n be the sample size. Then
t = (mE - mC) / SQRT[(s2E + s2C) / n ] 

Three Different Types of T-test:

(1) One-sample T-tests test whether the mean of one variable differs from a constant (ex., does the mean grade of 72 for a sample of students differ significantly from the passing grade of 70?). When p<.05 the researcher concludes the group mean is significantly different from the constant.

(2) Independent sample T-tests are used to compare the means of two independently sampled groups (ex., do those working in high noise differ on a performance variable compared to those working in low noise, where individuals are randomly assigned to the high-noise or low-noise groups?) . When p<.05 the researcher concludes the two groups are significantly different in their means. This test is often used to compare the means of two groups in the same sample (ex., men vs. women) even though individuals are not (in the case of gender, cannot be) assigned randomly to the two groups (to "men" and to "women"). Random assignment would have controlled for unmeasured variables. This opens up the possibility that other variables either mask or enhance any apparent significant difference in means. That is, the independent sample t-test tests the uncontrolled difference in means between two groups If a significant difference is found, it may be due not just to gender; control variables may be at work. The researcher will wish to introduce control variables, as in any multivariate analysis. 

(3) Paired sample T-tests compare means where the two groups are correlated, as in before-after, repeated measures, matched-pairs, or case-control studies (ex., mean candidate evaluations before and after hearing a speech by the candidate). The algorithm applied to the data is different from the independent sample t-test, but interpretation of output is otherwise the same.

Associated Assumptions:

(1) Approximately Normal Distribution of the measure in the two groups is assumed. There are tests for normality. The t-test may be unreliable when the two samples come from widely different shaped distributions (see Gardner, 1975). Moore (1995) suggests data for t-tests should be normally distributed for sample size less than 15, and should be approximately notmal and without outliers for samples between 15 and 40; but may markedly skewed when sample size is greater than 40. 

(2) Roughly Similar Variances: There is a test for homogeneity of variance, also called a test of homoscedasticity. In SPSS homogeneity of variances is tested by "Levene's Test for Equality of Variances", with F value and corresponding significance. There are also other tests for homogeneity of variances. The T-test may be unreliable when the two samples are unequal in size and also have unequal variances (see Gardner, 1975). 

(3) Dependent/Independent Samples. The samples may be independent or dependent (ex., before-after, matched pairs). However, the calculation of T differs accordingly. In the one-sample test, it is assumed that the observations are independent. 

One last note is that, don't confuse a T test with analyses of a contingency table (Fishers or chi-square test). Use a T test to compare a continuous variable (e.g., blood pressure or weight). Use a contingency table to compare a categorical variable (e.g., pass vs. fail, viable vs. not viable). 

Reference:
Gardner, P. L. (1975). Scales and statistics. Review of Educational Research. 45: 43-57. Discusses assumptions of the t-test. 
Moore, D. S. (1995). The Basic Practice of Statistics. NY: Freeman and Co. 

Feb 22, 2010

Why Student's T-test? (Part 1)

Here I am trying to answer two questions for myself:

1. What is the difference between Z-test and T-test?
2. Why we need student's T-test?

First, let's be clear on Z-test V.S. T-test. A thumb rule can be referred as, Z-test is used when the sample size is more than 30 while T-test is used for smaple size less than 30. Now let's get back to the history of story:

Sometimes, measuring every single piece of item is just not practical. That is why we developed and use statistical methods to solve problems. The most practical way to do it is to measure just a sample of the population. Some methods test hypothesis by comparison. The two of the more known statistical hypothesis tests are the T-test and the Z-test. Let's try to break down the two.

Strictly speaking, the Z-test is a test for populations rather than samples. In the real world though, either test will give you a pretty close answer. using the T-test is more accurate because the sample deviation is specific and tailored to the sample you are studying, so the answer will be more accurate. When using a T-test of significance, it is assumed that the observations come from a population which follows a Normal distribution. This is often true for data that is influenced by random fluctuations in environmental conditions or random measurement errors. Whereas the T-distribution is essentially a corrected version of the normal distribution in which the population variance is unknown and hence is estimated by the sample standard deviation.

There are various T-tests and two most commonly applied tests are the one-sample and paired-sample T-tests. One-sample T-tests are used to compare a sample mean with the known population mean. Two-sample T-tests, the other hand, are used to compare either independent samples or dependent samples.

As mentioned above, T-test is best applied, at least in theory, if you have a limited sample size (n < 30) as long as the variables are approximately normally distributed and the variation of values in the two groups is not reliably different. It is also great if you do not know the populations’ standard deviation. If the standard deviation is known, then, it would be best to use another type of statistical test, the Z-test. The Z-test is also applied to compare sample and population means to know if there’s a significant difference between them. Z-tests always use normal distribution and also ideally applied if the standard deviation is known. Z-tests are often applied if the certain conditions are met; otherwise, other statistical tests like T-tests are applied in substitute. Z-tests are often applied in large samples (n > 30). When T-test is used in large samples, the T-test becomes very similar to the Z-test. There are fluctuations that may occur in T-tests sample variances that do not exist in Z-tests. Because of this, there are differences in both test results.

Summary:


1. Z-test is a statistical hypothesis test that follows a normal distribution while T-test follows a Student’s T-distribution.
2. A T-test is appropriate when you are handling small samples (n < 30) while a Z-test is appropriate when you are handling moderate to large samples (n > 30).
3. T-test is more adaptable than Z-test since Z-test will often require certain conditions to be reliable. Additionally, T-test has many methods that will suit any need.
4. T-tests are more commonly used than Z-tests.
5. Z-tests are preferred than T-tests when population standard deviations are known.

Feb 21, 2010

Maybe We Just Need a New Word: Gadget

"Samsung has just announced at Barcelona a new cell phone, the Beam, that they expect to have on the market this summer. Its special feature is a built-in pico projector, making it a combination cell phone and (very wimpy) video projector. A cute gadget, although not one that I am likely to have much use for. I do, however, have one suggestion for improving it."

Reading through this news, I am happened to be interested in the word Gadget: Two similar explanations can be easily referenced from dictionary:
(1) an often small mechanical or electronic device with a practical use but often thought of as a novelty;
(2) any object that is interesting for its ingenuity or novelty rather than for its practical use.

So it comes to me as a question, have we been proposing and digging Gadgets in econometrics and economics? This could happen to be a 'gadget' question, but it is definitely not a 'gadget' issue. Too many people are publishing papers which probably are going to have its author(s) as the only and last careful reader. So why we spend one or two years, even three years to invent such a "gadget"? For tenure, for promotion or just for fun (self understanding of the subjects)? Maybe it is just for a popular social demand of vanity, maybe it is just an indispensable part of the system, who knows?



 
Locations of visitors to this page