Combining DataFrames with Pandas – Data Analysis and Visualization in Python for Ecologists (2022)

Overview

Teaching: 20 min
Exercises: 25 min

Questions

  • Can I work with data from multiple sources?

  • How can I combine data from different data sets?

Objectives

  • Combine data from multiple files into a single DataFrame using merge and concat.

  • Combine two DataFrames using a unique ID found in both DataFrames.

  • Employ to_csv to export a DataFrame in CSV format.

  • Join DataFrames using common fields (join keys).

In many “real world” situations, the data that we want to use come in multiplefiles. We often need to combine these files into a single DataFrame to analyzethe data. The pandas package provides various methods for combiningDataFrames includingmerge and concat.

To work through the examples below, we first need to load the species andsurveys files into pandas DataFrames. In iPython:

import pandas as pdsurveys_df = pd.read_csv("data/surveys.csv", keep_default_na=False, na_values=[""])surveys_df record_id month day year plot species sex hindfoot_length weight0 1 7 16 1977 2 NA M 32 NaN1 2 7 16 1977 3 NA M 33 NaN2 3 7 16 1977 2 DM F 37 NaN3 4 7 16 1977 7 DM M 36 NaN4 5 7 16 1977 3 DM M 35 NaN... ... ... ... ... ... ... ... ... ...35544 35545 12 31 2002 15 AH NaN NaN NaN35545 35546 12 31 2002 15 AH NaN NaN NaN35546 35547 12 31 2002 10 RM F 15 1435547 35548 12 31 2002 7 DO M 36 5135548 35549 12 31 2002 5 NaN NaN NaN NaN[35549 rows x 9 columns]species_df = pd.read_csv("data/species.csv", keep_default_na=False, na_values=[""])species_df species_id genus species taxa0 AB Amphispiza bilineata Bird1 AH Ammospermophilus harrisi Rodent2 AS Ammodramus savannarum Bird3 BA Baiomys taylori Rodent4 CB Campylorhynchus brunneicapillus Bird.. ... ... ... ...49 UP Pipilo sp. Bird50 UR Rodent sp. Rodent51 US Sparrow sp. Bird52 ZL Zonotrichia leucophrys Bird53 ZM Zenaida macroura Bird[54 rows x 4 columns]

Take note that the read_csv method we used can take some additional options whichwe didn’t use previously. Many functions in Python have a set of options thatcan be set by the user if needed. In this case, we have told pandas to assignempty values in our CSV to NaN keep_default_na=False, na_values=[""].More about all of the read_csv options here.

We can use the concat function in pandas to append either columns or rows fromone DataFrame to another. Let’s grab two subsets of our data to see how thisworks.

# Read in first 10 lines of surveys tablesurvey_sub = surveys_df.head(10)# Grab the last 10 rowssurvey_sub_last10 = surveys_df.tail(10)# Reset the index values to the second dataframe appends properlysurvey_sub_last10 = survey_sub_last10.reset_index(drop=True)# drop=True option avoids adding new index column with old index values

When we concatenate DataFrames, we need to specify the axis. axis=0 tellspandas to stack the second DataFrame UNDER the first one. It will automaticallydetect whether the column names are the same and will stack accordingly.axis=1 will stack the columns in the second DataFrame to the RIGHT of thefirst DataFrame. To stack the data vertically, we need to make sure we have thesame columns and associated column format in both datasets. When we stackhorizontally, we want to make sure what we are doing makes sense (i.e. the data arerelated in some way).

# Stack the DataFrames on top of each othervertical_stack = pd.concat([survey_sub, survey_sub_last10], axis=0)# Place the DataFrames side by sidehorizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis=1)

Row Index Values and Concat

Have a look at the vertical_stack dataframe? Notice anything unusual?The row indexes for the two data frames survey_sub and survey_sub_last10have been repeated. We can reindex the new dataframe using the reset_index() method.

Writing Out Data to CSV

We can use the to_csv command to do export a DataFrame in CSV format. Note that the codebelow will by default save the data into the current working directory. We cansave it to a different folder by adding the foldername and a slash to the filevertical_stack.to_csv('foldername/out.csv'). We use the ‘index=False’ so thatpandas doesn’t include the index number for each line.

Check out your working directory to make sure the CSV wrote out properly, andthat you can open it! If you want, try to bring it back into Python to make sureit imports properly.

# For kicks read our output back into Python and make sure all looks goodnew_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])

Challenge - Combine Data

In the data folder, there are two survey data files: surveys2001.csv andsurveys2002.csv. Read the data into Python and combine the files to make onenew data frame. Create a plot of average plot weight by year grouped by sex.Export your results as a CSV and make sure it reads back into Python properly.

When we concatenated our DataFrames we simply added them to each other -stacking them either vertically or side by side. Another way to combineDataFrames is to use columns in each dataset that contain common values (acommon unique id). Combining DataFrames using a common field is called“joining”. The columns containing the common values are called “join key(s)”.Joining DataFrames in this way is often useful when one DataFrame is a “lookuptable” containing additional data that we want to include in the other.

NOTE: This process of joining tables is similar to what we do with tables in anSQL database.

For example, the species.csv file that we’ve been working with is a lookuptable. This table contains the genus, species and taxa code for 55 species. Thespecies code is unique for each line. These species are identified in our surveydata as well using the unique species code. Rather than adding 3 more columnsfor the genus, species and taxa to each of the 35,549 line Survey data table, wecan maintain the shorter table with the species information. When we want toaccess that information, we can create a query that joins the additional columnsof information to the Survey data.

Storing data in this way has many benefits including:

  1. It ensures consistency in the spelling of species attributes (genus, speciesand taxa) given each species is only entered once. Imagine the possibilitiesfor spelling errors when entering the genus and species thousands of times!
  2. It also makes it easy for us to make changes to the species information oncewithout having to find each instance of it in the larger survey data.
  3. It optimizes the size of our data.

Joining Two DataFrames

To better understand joins, let’s grab the first 10 lines of our data as asubset to work with. We’ll use the .head method to do this. We’ll also readin a subset of the species table.

# Read in first 10 lines of surveys tablesurvey_sub = surveys_df.head(10)# Import a small subset of the species data designed for this part of the lesson.# It is stored in the data folder.species_sub = pd.read_csv('data/speciesSubset.csv', keep_default_na=False, na_values=[""])

In this example, species_sub is the lookup table containing genus, species, andtaxa names that we want to join with the data in survey_sub to produce a newDataFrame that contains all of the columns from both species_df andsurvey_df.

Identifying join keys

To identify appropriate join keys we first need to know which field(s) areshared between the files (DataFrames). We might inspect both DataFrames toidentify these columns. If we are lucky, both DataFrames will have columns withthe same name that also contain the same data. If we are less lucky, we need toidentify a (differently-named) column in each DataFrame that contains the sameinformation.

>>> species_sub.columnsIndex([u'species_id', u'genus', u'species', u'taxa'], dtype='object')>>> survey_sub.columnsIndex([u'record_id', u'month', u'day', u'year', u'plot_id', u'species_id', u'sex', u'hindfoot_length', u'weight'], dtype='object')

In our example, the join key is the column containing the two-letter speciesidentifier, which is called species_id.

Now that we know the fields with the common species ID attributes in eachDataFrame, we are almost ready to join our data. However, since there aredifferent types of joins, wealso need to decide which type of join makes sense for our analysis.

Inner joins

The most common type of join is called an inner join. An inner join combinestwo DataFrames based on a join key and returns a new DataFrame that containsonly those rows that have matching values in both of the originalDataFrames.

Inner joins yield a DataFrame that contains only rows where the value beingjoined exists in BOTH tables. An example of an inner join, adapted from Jeff Atwood’s blogpost about SQL joins is below:

Combining DataFrames with Pandas – Data Analysis and Visualization in Python for Ecologists (1)

The pandas function for performing joins is called merge and an Inner join isthe default option:

merged_inner = pd.merge(left=survey_sub, right=species_sub, left_on='species_id', right_on='species_id')# In this case `species_id` is the only column name in both dataframes, so if we skipped `left_on`# And `right_on` arguments we would still get the same result# What's the size of the output data?merged_inner.shapemerged_inner
 record_id month day year plot_id species_id sex hindfoot_length \0 1 7 16 1977 2 NL M 321 2 7 16 1977 3 NL M 332 3 7 16 1977 2 DM F 373 4 7 16 1977 7 DM M 364 5 7 16 1977 3 DM M 355 8 7 16 1977 1 DM M 376 9 7 16 1977 1 DM F 347 7 7 16 1977 2 PE F NaN weight genus species taxa0 NaN Neotoma albigula Rodent1 NaN Neotoma albigula Rodent2 NaN Dipodomys merriami Rodent3 NaN Dipodomys merriami Rodent4 NaN Dipodomys merriami Rodent5 NaN Dipodomys merriami Rodent6 NaN Dipodomys merriami Rodent7 NaN Peromyscus eremicus Rodent

The result of an inner join of survey_sub and species_sub is a new DataFramethat contains the combined set of columns from survey_sub and species_sub. Itonly contains rows that have two-letter species codes that are the same inboth the survey_sub and species_sub DataFrames. In other words, if a row insurvey_sub has a value of species_id that does not appear in the species_idcolumn of species, it will not be included in the DataFrame returned by aninner join. Similarly, if a row in species_sub has a value of species_idthat does not appear in the species_id column of survey_sub, that row will notbe included in the DataFrame returned by an inner join.

The two DataFrames that we want to join are passed to the merge function usingthe left and right argument. The left_on='species' argument tells mergeto use the species_id column as the join key from survey_sub (the leftDataFrame). Similarly , the right_on='species_id' argument tells merge touse the species_id column as the join key from species_sub (the rightDataFrame). For inner joins, the order of the left and right arguments doesnot matter.

The result merged_inner DataFrame contains all of the columns from survey_sub(record id, month, day, etc.) as well as all the columns from species_sub(species_id, genus, species, and taxa).

Notice that merged_inner has fewer rows than survey_sub. This is anindication that there were rows in surveys_df with value(s) for species_id thatdo not exist as value(s) for species_id in species_df.

Left joins

What if we want to add information from species_sub to survey_sub withoutlosing any of the information from survey_sub? In this case, we use a differenttype of join called a “left outer join”, or a “left join”.

Like an inner join, a left join uses join keys to combine two DataFrames. Unlikean inner join, a left join will return all of the rows from the leftDataFrame, even those rows whose join key(s) do not have values in the rightDataFrame. Rows in the left DataFrame that are missing values for the joinkey(s) in the right DataFrame will simply have null (i.e., NaN or None) valuesfor those columns in the resulting joined DataFrame.

Note: a left join will still discard rows from the right DataFrame that do nothave values for the join key(s) in the left DataFrame.

Combining DataFrames with Pandas – Data Analysis and Visualization in Python for Ecologists (2)

A left join is performed in pandas by calling the same merge function used forinner join, but using the how='left' argument:

merged_left = pd.merge(left=survey_sub, right=species_sub, how='left', left_on='species_id', right_on='species_id')merged_left
 record_id month day year plot_id species_id sex hindfoot_length \0 1 7 16 1977 2 NL M 321 2 7 16 1977 3 NL M 332 3 7 16 1977 2 DM F 373 4 7 16 1977 7 DM M 364 5 7 16 1977 3 DM M 355 6 7 16 1977 1 PF M 146 7 7 16 1977 2 PE F NaN7 8 7 16 1977 1 DM M 378 9 7 16 1977 1 DM F 349 10 7 16 1977 6 PF F 20 weight genus species taxa0 NaN Neotoma albigula Rodent1 NaN Neotoma albigula Rodent2 NaN Dipodomys merriami Rodent3 NaN Dipodomys merriami Rodent4 NaN Dipodomys merriami Rodent5 NaN NaN NaN NaN6 NaN Peromyscus eremicus Rodent7 NaN Dipodomys merriami Rodent8 NaN Dipodomys merriami Rodent9 NaN NaN NaN NaN

The result DataFrame from a left join (merged_left) looks very much like theresult DataFrame from an inner join (merged_inner) in terms of the columns itcontains. However, unlike merged_inner, merged_left contains the samenumber of rows as the original survey_sub DataFrame. When we inspectmerged_left, we find there are rows where the information that should havecome from species_sub (i.e., species_id, genus, and taxa) ismissing (they contain NaN values):

merged_left[ pd.isnull(merged_left.genus) ]
 record_id month day year plot_id species_id sex hindfoot_length \5 6 7 16 1977 1 PF M 149 10 7 16 1977 6 PF F 20 weight genus species taxa5 NaN NaN NaN NaN9 NaN NaN NaN NaN

These rows are the ones where the value of species_id from survey_sub (in thiscase, PF) does not occur in species_sub.

Other join types

The pandas merge function supports two other join types:

  • Right (outer) join: Invoked by passing how='right' as an argument. Similarto a left join, except all rows from the right DataFrame are kept, whilerows from the left DataFrame without matching join key(s) values arediscarded.
  • Full (outer) join: Invoked by passing how='outer' as an argument. This jointype returns the all pairwise combinations of rows from both DataFrames; i.e.,the result DataFrame will NaN where data is missing in one of the dataframes. This join type isvery rarely used.

Challenge - Distributions

Create a new DataFrame by joining the contents of the surveys.csv andspecies.csv tables. Then calculate and plot the distribution of:

  1. taxa by plot
  2. taxa by sex by plot

Challenge - Diversity Index

  1. In the data folder, there is a plots.csv file that contains information about thetype associated with each plot. Use that data to summarize the number ofplots by plot type.
  2. Calculate a diversity index of your choice for control vs rodent exclosureplots. The index should consider both species abundance and number ofspecies. You might choose to use the simple biodiversity index describedherewhich calculates diversity as:

    the number of species in the plot / the total number of individuals in the plot = Biodiversity index.

Key Points

  • Pandas’ merge and concat can be used to combine subsets of a DataFrame, or even data from different files.

  • join function combines DataFrames based on index or column.

  • Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.

  • to_csv can be used to write out DataFrames in CSV format.

You might also like

Latest Posts

Article information

Author: Tish Haag

Last Updated: 07/12/2022

Views: 6612

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.