The Amazon Data Engineer Guide (2022)

Table of Contents
Behavioral Questions 1. How would you describe your communication style? 2. Describe a data engineering project you worked on that was challenging. What was challenging about it? 3. Tell me about a time you had to describe a complex technical subject to a non-technical stakeholder. 4. Why Amazon? 5. You are assigned to work on a new engineering project. How do you get started? SQL Questions 1. How do you handle duplicate data in SQL? 2. Write a query to find the current salary data for each employee. 3. Write a query that returns all the neighborhoods with zero users. 4. Write a query to select the top 3 departments with at least 10 employees. Rank them by the percentage of employees making over $100,000. Python Questions 1. Given a string, write a function recurring_char to find its first recurring character. Return “None” if there is no recurring character. 2. What types of data types are available in Python? 3. Given a list of timestamps in sequential order, return a list of lists grouped by week using the first timestamp as the starting point. 4. You have an array of integers of length n spanning 0 to n with one number missing. Write a function missing_number that returns the missing number in the array. Database Design Questions 1. How would you create a schema to represent client click data on the web? 2. Say you have a table with a billion rows. How would you add a column inserting data from the source, without impacting user experience? Algorithm Questions 1. Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied, return -1. 2. What are the assumptions of linear regression? 3. How would you approach multicollinearity in multiple linear regression?

In data engineering interviews at Amazon, the most frequently tested subjects include SQL (asked in 99% of interviews), Python, behavioral questions and database design/data modeling. Use these practice data engineer interview questions to prepare for the Amazon interview.

Behavioral Questions

Remember to incorporate Amazon Leadership Principles into your answers to behavioral questions. Behavioral questions are asked in the recruiter screen, the Bar Raiser Interview and HR Interviews.

1. How would you describe your communication style?

Culture fit is assessed in Amazon behavioral interviews. This question helps the interviewer understand how you communicate, gather information and collaborate with a team. With a question like this, you might incorporate any of the three principles of Disagree and Commit, Earn Trust or Learn and Be Curious.

2. Describe a data engineering project you worked on that was challenging. What was challenging about it?

Using a framework can help you answer a question like this. Describe the situation and task you were faced with, describe the actions and finish with the results you achieved.

For example, you might say that:

“In my previous job, I was asked to build an ETL pipeline for streaming data that would gather customer transaction data to be used by the sales team. I started by gathering stakeholder input to learn about the specific needs of the sales team, and then researched options. A challenge arose during the testing phase, as there was significant pipeline lag. I had to review all of the code, and optimize it.This turned out to be a great learning experience, as I learned commonly implemented code inefficiencies that I was creating and can now avoid using.”

3. Tell me about a time you had to describe a complex technical subject to a non-technical stakeholder.

A question like this assesses your communication and collaboration skills. You might say:

“I was asked to design a marketing analytics database. However, the marketing department didn’t have extensive analytics or database knowledge. I created a short presentation, helping the team visualize the database schema and held a Q&A session after. The presentation made the schema easy to understand and helped the team better query the data.”

4. Why Amazon?

Expect a variation of this question. Some options for answering include:

  • Describing your excitement for the ecommerce space.
  • Aligning your passion with their company culture.
  • Mentioning referrals who have told you good things about working at Amazon.

5. You are assigned to work on a new engineering project. How do you get started?

Start with the initial stages like gathering stakeholder input, understanding data requirements and creating process or logical data models.

SQL Questions

SQL questions for data engineers typically include basic definitions, scenario-based questions and SQL query writing tests.

1. How do you handle duplicate data in SQL?

Start with clarifying questions. Specifically, you should ask:

  • What type of data are we working with?
  • What types of values are most likely duplicated?

This should arm you with enough information to answer this question confidently. For example, you might suggest using keys like DISTINCT, UNIQUE, or GROUP BY to de-duplicate the data.

2. Write a query to find the current salary data for each employee.

For this question we have a table representing a company payroll schema. Due to an ETL error the employees table, instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.

Hint. The first step we need to do would be to remove duplicates and retain the current salary for each user.

Given we know there are no duplicate first and last name combinations, we can remove duplicates from the employees table by running a GROUP BY on two fields, the first and last name. This allows us to then get a unique combinational value between the two fields.

3. Write a query that returns all the neighborhoods with zero users.

Hint. Whenever the question asks about finding values with zero of something (users, employees, posts, etc..) immediately think of the concept of LEFT JOIN. An inner join finds any values that are in both tables, and a left join keeps only the values in the left table.

Our predicament is to find all the neighborhoods without users. To do this, we must do a left join between the neighborhoods table and the users table.

4. Write a query to select the top 3 departments with at least 10 employees. Rank them by the percentage of employees making over $100,000.

The first step is to determine what the question is asking. With this question, we can break this question down into separate conditions:

  • Top 3 departments.
  • Percent of employees making over $100,000 in salary.
  • Department must have at least 10 employees.

What comes next to fully solve the above question?

Python Questions

Data engineer Python questions can range from definitions of data structures to writing Python code.

1. Given a string, write a function recurring_char to find its first recurring character. Return “None” if there is no recurring character.

We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character and then just check if the character exists in that saved set. If it does, then return the character.

def recurring_char(input): seen = set() for char in input: if char in seen: return char seen.add(char) return(None)

2. What types of data types are available in Python?

Python includes a variety of built-in data types, including:

  • Lists
  • Tuples
  • Dictionaries
  • Sets

There are also user-defined data types in Python. Examples include queues, trees and linked lists.

3. Given a list of timestamps in sequential order, return a list of lists grouped by week using the first timestamp as the starting point.

This is a scripting question that asks you to process unstructured data. Specifically, we are being asked to do a few different tasks:

  1. Loop through all of the datetimes.
  2. Set a beginning timestamp as our reference point.
  3. Check if the next time in the array is more than 7 days ahead.a. If it is more than 7 days, set the new timestamp as the reference point.b. If it is not more than 7 days, continue to loop through and append the last value.

4. You have an array of integers of length n spanning 0 to n with one number missing. Write a function missing_number that returns the missing number in the array.

Hint. There are two ways we can solve this problem. One way is through logical iteration and another way is through mathematical formulation. We can look at both methods as they both hold O(N) complexity.

The first method would be through general iteration through the array. We can pass in the array and create a set which will hold each value in the input array. Then we create a loop that will span the range from 0 to n, and look to see if each number is in the set we just created. If it is not, we return the missing number.

Database Design Questions

Database design questions asked in Amazon interviews typically provide you with a case and ask you to create a schema for that case.

1. How would you create a schema to represent client click data on the web?

Hint. First, we want to clarify what click data means. You could safely assume that it represents button clicks, scrolls, closing pop-ups, etc. One solution: You could assign each action a label that describes the action.

For example, here we can say that the product is Dropbox and that we want to track each folder click on the UI of an individual person’s Dropbox folder. We can label the clicking on a folder as an action name called folder_click. When the user clicks on the side panel to login or logout and we need to specify the action, we can call it login_click and logout_click.

2. Say you have a table with a billion rows. How would you add a column inserting data from the source, without impacting user experience?

This question is vague, and you would probably want to get some clarity before answering. You can see a full mock interview solution to this question on YouTube:

The Amazon Data Engineer Guide (1)

Algorithm Questions

Algorithm questions are asked in Amazon data engineer interviews. However, the focus is primarily on basic algorithmic knowledge, data structures, and easy Python coding tests.

1. Given a list of integers, find the index at which the sum of the left half of the list is equal to the right half. If there is no index where this condition is satisfied, return -1.

Hint. Start by thinking about what number you are trying to find. It is the sum of the entire list divided by 2. How could you create a function to add up all the values?

2. What are the assumptions of linear regression?

Hint. This is most similar to the types of algorithm questions you might face. You might start with noting that there is a linear relationship between the features and the response variable, which is the value you want to predict.

3. How would you approach multicollinearity in multiple linear regression?

Multiple linear regression uses more than one independent variable to predict the dependent variable. One assumption we can make with this technique is that the independent variables are also independent from one another, or that the values do not affect one another.

You might also like

Latest Posts

Article information

Author: Dr. Pierre Goyette

Last Updated: 09/21/2022

Views: 6578

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dr. Pierre Goyette

Birthday: 1998-01-29

Address: Apt. 611 3357 Yong Plain, West Audra, IL 70053

Phone: +5819954278378

Job: Construction Director

Hobby: Embroidery, Creative writing, Shopping, Driving, Stand-up comedy, Coffee roasting, Scrapbooking

Introduction: My name is Dr. Pierre Goyette, I am a enchanting, powerful, jolly, rich, graceful, colorful, zany person who loves writing and wants to share my knowledge and understanding with you.