Panel data (pooled cross-section and time-series) require special care. Here are some pointers.
Consider a data set composed of observations on each of n cross-sectional units (countries, states, persons or whatever) in each of T periods. Let each observation comprise the values of m variables of interest. The data set then contains mnT values.
The data should be arranged "by observation": each row represents an observation; each column contains the values of a particular variable. The data matrix then has nT rows and m columns. That leaves open the matter of how the rows should be arranged. There are two possibilities.[1]
Rows grouped by unit. Think of the data matrix as composed of n blocks, each having T rows. The first block of T rows contains the observations on cross-sectional unit 1 for each of the periods; the next block contains the observations on unit 2 for all periods; and so on. In effect, the data matrix is a set of time-series data sets, stacked vertically.
Rows grouped by period. Think of the data matrix as composed of T blocks, each having n rows. The first n rows contain the observations for each of the cross-sectional units in period 1; the next block contains the observations for all units in period 2; and so on. The data matrix is a set of cross-sectional data sets, stacked vertically.
You may use whichever arrangement is more convenient. The first is perhaps easier to keep straight. If you use the second then of course you must ensure that the cross-sectional units appear in the same order in each of the period data blocks.
In either case you can use the frequency field in the observations line of the data header file (see Chapter 5) to make life a little easier.
Grouped by unit: Set the frequency equal to T. Suppose you have observations on 20 units in each of 5 time periods. Then this observations line is appropriate: 5 1.1 20.5 (read: frequency 5, starting with the observation for unit 1, period 1, and ending with the observation for unit 20, period 5). Then, for instance, you can refer to the observation for unit 2 in period 5 as 2.5, and that for unit 13 in period 1 as 13.1.
Grouped by period: Set the frequency equal to n. In this case if you have observations on 20 units in each of 5 periods, the observations line should be: 20 1.01 5.20 (read: frequency 20, starting with the observation for period 1, unit 01, and ending with the observation for period 5, unit 20). One refers to the observation for unit 2, period 5 as 5.02.
If you decide to construct a panel data set using a spreadsheet program first, then bring the data into gretl as a CSV import, the program will (probably) not at first recognize the special nature of the data. You can fix this by using the command setobs (see Chapter 10) or the GUI menu item "Sample, Set frequency, startobs…".
In a panel study you may wish to construct dummy variables of one or both of the following sorts: (a) dummies as unique identifiers for the cross-sectional units, and (b) dummies as unique identifiers of the time periods. The former may be used to allow the intercept of the regression to differ across the units, the latter to allow the intercept to differ across periods.
You can use two special functions to create such dummies. These are found under the "Data, Add variables" menu in the GUI, or under the genr command in script mode or gretlcli.
"periodic dummies" (script command genr dummy). The common use for this command is to create a set of periodic dummy variables up to the data frequency in a time-series study (for instance a set of quarterly dummies for use in seasonal adjustment). But it also works with panel data. Note that the interpretation of the dummies created by this command differs depending on whether the data rows are grouped by unit or by period. If the grouping is by unit (frequency T) the resulting variables are period dummies and there will be T of them. For instance dummy_2 will have value 1 in each data row corresponding to a period 2 observation, 0 otherwise. If the grouping is by period (frequency n) then n unit dummies will be generated: dummy_2 will have value 1 in each data row associated with cross-sectional unit 2, 0 otherwise.
"panel dummies" (script command genr paneldum). This creates all the dummies, unit and period, at a stroke. The default presumption is that the data rows are grouped by unit. The unit dummies are named du_1, du_2 and so on, while the period dummies are named dt_1, dt_2, etc. The u (for unit) and t (for time) in these names will be wrong if the data rows are grouped by period: to get them right in that setting use genr paneldum -o (script mode only).
If a panel data set has the YEAR of the observation entered as one of the variables you can create a periodic dummy to pick out a particular year, e.g. genr dum = (YEAR=1960). You can also create periodic dummy variables using the modulus operator, %. For instance, to create a dummy with value 1 for the first observation and every thirtieth observation thereafter, 0 otherwise, do
genr index genr dum = ((index-1)%30) = 0
[1] | If you don't intend to make any conceptual or statistical distinction between cross-sectional and temporal variation in the data you can arrange the rows arbitrarily, but this is probably wasteful of information. |