For longitudinal studies, the data sets of different survey data collections need to be combined.
The single data sets can easily be appended as variable names and categories have already been harmonized across all data collections.
For the person long format, different matching strategies can be chosen, depending on the desired data structure of the combined dataset (‘long’: several rows per person (one for each data collection) and one column per variable vs. ‘wide’: one row per person and a column for each data collection of variables).
In the following we provide syntax for Stata and SPSS for both cases. Especially for the family wide format, it is strongly recommended to only use and merge the variables that are needed for the analyses in order to limit the size of the final data set.
The single data sets can easily be appended as variable names and categories have already been harmonized across all data collections.
For the person long format, different matching strategies can be chosen, depending on the desired data structure of the combined dataset (‘long’: several rows per person (one for each data collection) and one column per variable vs. ‘wide’: one row per person and a column for each data collection of variables).
In the following we provide syntax for Stata and SPSS for both cases. Especially for the family wide format, it is strongly recommended to only use and merge the variables that are needed for the analyses in order to limit the size of the final data set.
Matching data files in Stata
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
3. Family wide format, ‘wide’ (one row per family over all data collections and separate columns for variables per person and data collection).
For analyses using the family wide format with Stata, use the -merge- command with the family identifier fid.
Matching data files in SPSS
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
If the combined data needs to be in wide format, it is important that all variables (except for the matching variables) in every dataset have a data collection-specific suffix. In the person format, this suffix has to be created for all variables except pid before matching. In the family format, wave-specific suffixes are already provided (except for the variables wav0100, cgr and zyg0102, which are time stable and therefore identical in all waves). Variable suffixes can be easily created using the python plugin. The following code can be customized to do this:
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
You can also limit the data to the variables you want to analyze by using the commandcd “path”
// navigate into the folder were the data is stored
use ZA6701_person_wid1_v4 0 0.dta // fill in the name of the data set in the version you are using
append using ZA6701_person_wid3_v4-0-0
append using … // optionally append further files of all data collections you want to use for longitudinal analysis
use varlist using ZA6701_person_wid1_v4-0-0.dta // replace ‘var list’ by the listof variables you want to use for your analyses
2. Person long format, ‘wide’ (one row per person over all data collections and one column for each data collection and variable). Append the data of the data collections you want to analyze using the procedure described in 1):Use the -reshape- command in order to get the person-wide format:cd “path”
use ZA6701_person_wid1_v4-0-0.dta
append using ZA6701_person_wid3_v4-0-0
local varselect "varlist" // select list of variables that need to be converted from long to wide form
rename (`varselect') =_ // add suffix
reshape wide *_, i(pid) j(wid) // convert selected variables from long to wide form
3. Family wide format, ‘wide’ (one row per family over all data collections and separate columns for variables per person and data collection).
For analyses using the family wide format with Stata, use the -merge- command with the family identifier fid.
cd “path” // navigate into the folder where the data is stored
use varlist using ZA6701_family_wide_wid1_v4 0 0.dta // replace ‘varlist’ by the list of variables you want to use for your analyses
merge 1:1 fid using ZA6701_family_wide_wid3_v4 0 0.dta, keepusing(varlist) // replace ‘varlist’ by the list of variables you want to use for your analyses
Matching data files in SPSS
1. Person long format, ‘long’ (one row per data collection for each person and one column per variable). The following example for combining the two face-to-face data collections F2F1 and F2F2 can be customized:
2. Person long format and Family wide format, ‘wide’ (one row per person/family and one column for each data collection and variable; one row per family over all data collections and separate columns for variables per person and data collection).add files
/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'
/file= 'C:\...\SUF_4-0-0_beta_04052020\ZA6701_en_person_wid3_v4-0-0.sav'.
save outfile = 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid13_match.sav'.exe.
If the combined data needs to be in wide format, it is important that all variables (except for the matching variables) in every dataset have a data collection-specific suffix. In the person format, this suffix has to be created for all variables except pid before matching. In the family format, wave-specific suffixes are already provided (except for the variables wav0100, cgr and zyg0102, which are time stable and therefore identical in all waves). Variable suffixes can be easily created using the python plugin. The following code can be customized to do this:
When each dataset has data collection-specific suffixes, all datasets must be sorted by the matching variable; datasets in person format by the pid, datasets in family format by the fid (see chapter 3.5). To finally combine two data sets, the following code can be customized:begin program.
variables = 'all' # define the variables which should get a suffix, you can use (e.g. 'all', 'x, y, z'; 'x to y').
suffix ='_1' # enter the chosen suffix.
import spss, spssaux
oldnames = spssaux.VariableDict().expand(variables)
newnames = [varnam + suffix for varnam in oldnames]
spss.Submit('rename variables(%s=%s).'%('\n'.join(oldnames),'\n'.join(newnames)))
end program.
sort cases by pid.
match files
/file= 'C:\...\
SUF_4-0-0_beta_04052020\ZA6701_en_person_wid1_v4-0-0.sav'
/file= 'C:\...\
SUF_4-0-0_beta_04052020\ZA6701_en_person_wid2_v4-0-0.sav'
/by pid.
save outf
ile= 'C:\...\SUF_4-0-0_beta_04052020\en_person_wid12_match.sav'.
exe.