Stata Basics: Combine Data (Append and Merge) (2022)

When I first started working with data, which was in a statistics class, we mostly used clean and completed dataset as examples. Later on, I realize it’s not always the case when doing research or data analysis for other purposes; in reality, we often need to put two or more dataset together to be able to begin whatever statistic analysis tasks we would like to perform. In this post, I demonstrate how to combine datasets into one file in two typical ways: append and merge, that are row-wise combining and column-wise combining, respectively.

Append data: -append-

Say you would like to stack one of your data file on top of another, then you can use the -append- command to do so. Usually the data files we would like to append contain the same variables, so let’s create two fictional data files, each of them has 4 variables: id, character name, character family and numbers of episode the character appeared in.

* Set working directory> cd [YOUR PATH] * create dataset 1> clear> input id str8 name str9 family epi id name family epi 1. 1 "Arya" "Stark" 33 2. 2 "Cersei" "Lannister" 36 3. 3 "Ned" "Stark" 11 4. end> save got1, replacefile got1.dta saved> list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 1 Arya Stark 33 | 2. | 2 Cersei Lannister 36 | 3. | 3 Ned Stark 11 | +-------------------------------+* create dataset 2> clear> input id str8 name str9 family epi id name family epi 1. 5 "Robert" "Baratheon" 7 2. 4 "Jon" "Stark" 32 3. 6 "Tyrion" "Lannister" 36 4. end> save got2, replacefile got2.dta saved> list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 5 Robert Baratheon 7 | 2. | 4 Jon Stark 32 | 3. | 6 Tyrion Lannister 36 | +-------------------------------+* combine the two datasets and see the results> use got1, clear> append using got2> list +-------------------------------+ | id name family epi | |-------------------------------| 1. | 1 Arya Stark 33 | 2. | 2 Cersei Lannister 36 | 3. | 3 Ned Stark 11 | 4. | 5 Robert Baratheon 7 | 5. | 4 Jon Stark 32 | |-------------------------------| 6. | 6 Tyrion Lannister 36 | +-------------------------------+

The combined dataset looks right to me, however we are not able to tell which dataset the observations come from. In some cases this may cause some inconvenience in tracing back to the original files or even problems in data analysis – say, in this case, if got1 and got2 contain records from two different seasons, we should mark that in the combined dataset. We can simply do this by generating a variable indicating season before we append them.

> use got1, clear> generate season=1> save got1, replacefile got1.dta saved> use got2, clear> generate season=2> save got2, replacefile got2.dta saved> use got1, clear> append using got2> list +----------------------------------------+ | id name family epi season | |----------------------------------------| 1. | 1 Arya Stark 33 1 | 2. | 2 Cersei Lannister 36 1 | 3. | 3 Ned Stark 11 1 | 4. | 5 Robert Baratheon 7 2 | 5. | 4 Jon Stark 32 2 | |----------------------------------------| 6. | 6 Tyrion Lannister 36 2 | +----------------------------------------+> save got3, replacefile got3.dta saved

Now we have a combined dataset with a variable indicating which original dataset the observations come from – although this dataset is officially fictional, as Robert Baratheon was not seen in season two…

Merge data: -merge-

It is usually pretty straightforward to append data, however it sometimes gets a bit tricky when you need to combine data in a column-wise manner, that is, merge data. Below we use two examples to demonstrate one-to-one merge and one-to-many merge.

One-to-one merge: -merge 1:1-

In the dataset we just appended (got3), we have 5 variables, with the id variable uniquely identifying the 6 observations in the data. Say we have another data file contains the id variable and the same 6 observations, but with a new variable called status – in other words, a new column. In this case, if we want to combine this new data file to got3, we should use one-to-one merge to match the records in the two files.

* First, we create the new data file with id and the new variable status> clear > input id status id status 1. 1 1 2. 2 1 3. 3 0 4. 4 1 5. 6 1 6. 5 0 7. end > list +-------------+ | id status | |-------------| 1. | 1 1 | 2. | 2 1 | 3. | 3 0 | 4. | 4 1 | 5. | 6 1 | |-------------| 6. | 5 0 | +-------------+> save got4, replacefile got4.dta saved* sort observations by id in got3> use got3, clear> sort id> list +----------------------------------------+ | id name family epi season | |----------------------------------------| 1. | 1 Arya Stark 33 1 | 2. | 2 Cersei Lannister 36 1 | 3. | 3 Ned Stark 11 1 | 4. | 4 Jon Stark 32 2 | 5. | 5 Robert Baratheon 7 2 | |----------------------------------------| 6. | 6 Tyrion Lannister 36 2 | +----------------------------------------+> save got3m, replacefile got3m.dta saved* sort observations by id in got4> use got4, clear> sort id> list +-------------+ | id status | |-------------| 1. | 1 1 | 2. | 2 1 | 3. | 3 0 | 4. | 4 1 | 5. | 5 0 | |-------------| 6. | 6 1 | +-------------+> save got4m, replacefile got4m.dta saved* merge the two files, we base this merge on the id variable in both files> use got3m, clear> merge 1:1 id using got4m Result # of obs. ----------------------------------------- not matched 0 matched 6 (_merge==3) -----------------------------------------> list +---------------------------------------------------------------+ | id name family epi season status _merge | |---------------------------------------------------------------| 1. | 1 Arya Stark 33 1 1 matched (3) | 2. | 2 Cersei Lannister 36 1 1 matched (3) | 3. | 3 Ned Stark 11 1 0 matched (3) | 4. | 4 Jon Stark 32 2 1 matched (3) | 5. | 5 Robert Baratheon 7 2 0 matched (3) | |---------------------------------------------------------------| 6. | 6 Tyrion Lannister 36 2 1 matched (3) | +---------------------------------------------------------------+

Note Stata creates a _merge variable in the merged results, which indicates how the merge was done for each observation. The value of _merge is 1 if the observation comes form file1 (master file) only, 2 if the observation comes from file2 (using file) only, 3 if the observation comes from both of the two files – in other words, 3 means the observation is matched. In this example, we can easily inspect every observation to see if they are matched. If you get more records in a dataset, which we normally do, you can summarize this _merge variable to see if you have any mismatched case.

 > tabulate _merge _merge | Freq. Percent Cum.------------------------+----------------------------------- matched (3) | 6 100.00 100.00------------------------+----------------------------------- Total | 6 100.00

Looks like we have every observation matched in this merging example.

One-to-many merge: -merge 1:m-

Here I show an example of another kind of merge called one-to-many merge. Let’s illustrate when would we need to perform one-to-many merge by combining two sample datasets: one with information of dads, another with records of their kids.

First we create the dads file with family id, family name, dads name and their status, sort the observations by family id.

> clear > input familyid str9 family str8 dname dstatus familyid family dname dstatus 1. 3 "Stark" "Ned" 0 2. 1 "Baratheon" "Robert" 0 3. 2 "Lannister" "Tywin" 1 4. end> list +-----------------------------------------+ | familyid family dname dstatus | |-----------------------------------------| 1. | 3 Stark Ned 0 | 2. | 1 Baratheon Robert 0 | 3. | 2 Lannister Tywin 1 | +-----------------------------------------+> sort familyid> save got5, replacefile got5.dta saved

Then we create the kids file with the same variables, sort by family id as well.

> clear> input familyid str9 family str8 kname kstatus familyid family kname kstatus 1. 2 "Lannister" "Cersei" 1 2. 3 "Stark" "Arya" 1 3. 2 "Lannister" "Tyrion" 1 4. 3 "Stark" "Jon" 1 5. 1 "Baratheon" "Joffrey" 0 6. end> list +------------------------------------------+ | familyid family kname kstatus | |------------------------------------------| 1. | 2 Lannister Cersei 1 | 2. | 3 Stark Arya 1 | 3. | 2 Lannister Tyrion 1 | 4. | 3 Stark Jon 1 | 5. | 1 Baratheon Joffrey 0 | +------------------------------------------+> sort familyid> save got6, replacefile got6.dta saved

Now we have the two files sharing the familyid variable as an identifier, since each dad may have more than one kid, we use one-to-many merge to combine them.

 * use the dads file as master file and kids file as using file> use got5, clear> merge 1:m familyid using got6 Result # of obs. ----------------------------------------- not matched 0 matched 5 (_merge==3) -----------------------------------------> list +---------------------------------------------------------------------------+ | familyid family dname dstatus kname kstatus _merge | |---------------------------------------------------------------------------| 1. | 1 Baratheon Robert 0 Joffrey 0 matched (3) | 2. | 2 Lannister Tywin 1 Cersei 1 matched (3) | 3. | 3 Stark Ned 0 Jon 1 matched (3) | 4. | 2 Lannister Tywin 1 Tyrion 1 matched (3) | 5. | 3 Stark Ned 0 Arya 1 matched (3) | +---------------------------------------------------------------------------+* sort by familyid > sort familyid > list +---------------------------------------------------------------------------+ | familyid family dname dstatus kname kstatus _merge | |---------------------------------------------------------------------------| 1. | 1 Baratheon Robert 0 Joffrey 0 matched (3) | 2. | 2 Lannister Tywin 1 Cersei 1 matched (3) | 3. | 2 Lannister Tywin 1 Tyrion 1 matched (3) | 4. | 3 Stark Ned 0 Arya 1 matched (3) | 5. | 3 Stark Ned 0 Jon 1 matched (3) | +---------------------------------------------------------------------------+

So the steps are really the same for one-to-one and one-to-many merge. Just pick the right one depending on the datasets you’re are going to combine, and what kind of end product you would like to obtain from the merging.

View the entire collection of UVA Library StatLab articles.

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library

You might also like

Latest Posts

Article information

Author: Jonah Leffler

Last Updated: 08/17/2022

Views: 6608

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Jonah Leffler

Birthday: 1997-10-27

Address: 8987 Kieth Ports, Luettgenland, CT 54657-9808

Phone: +2611128251586

Job: Mining Supervisor

Hobby: Worldbuilding, Electronics, Amateur radio, Skiing, Cycling, Jogging, Taxidermy

Introduction: My name is Jonah Leffler, I am a determined, faithful, outstanding, inexpensive, cheerful, determined, smiling person who loves writing and wants to share my knowledge and understanding with you.