Brew tap syncdk/aws-session-manager-plugin. Brew install aws-session-manager-plugin.
Aws Brew Install
Installs (30 days) aws-cdk: 115: Installs on Request (30 days) aws-cdk: 115: Build Errors (30 days) aws-cdk: 0: Installs (90 days) aws-cdk: 315: Installs on Request (90 days). Installs (30 days) awscli: 157,635: awscli -HEAD: 89: Installs on Request (30 days) awscli: 156,411: awscli -HEAD: 89: Build Errors (30 days) awscli: 177: Installs.
Brew Aws-iam-authenticator
In this lab you will be completing the following tasks.
Tasks Completed in this Lab:
- Create a Glue DataBrew project to explore a dataset
- Connect a sample dataset from S3
- Explore the dataset in Glue DataBrew
- Generate a rich data profile for the dataset
- Create a recipe and job to clean and normalize data in the dataset
Creating a project
Navigate to the AWS Glue DataBrew service
On the DataBrew console, select Projects
Click Create project
In the Project details section, enter
covid-states-daily
as the project nameIn the Select a dataset section, select New dataset and enter
covid-states-daily-stats
In the Connect to a new dataset section, select Amazon S3 under “Data lake/data store”
Enter the S3 path
s3://aws-dataengineering-day.workshop.aws/states_daily.csv.gz
In the Sampling section, leave the default configuration values
In the Permissions section, select the role
databrew-lab-DataBrewLabRole-xxxxx
from the drop-down listClick Create project
Glue DataBrew will create the project, this may take a few minutes.
Exploring the dataset
When the project has been created, you will be presented with the Grid view. This is the default view, where a sample of the data is shown in tabular format.
The Grid view shows
- Columns in the dataset
- Data type of each column
- Summary of the range of values that have been found
- Statistical distribution for numerical columns
Click on the Schema tab
The Schema view shows the schema that has been inferred from the dataset. In schema view, you can see statistics about the data values in each column.
In the Schema view, you can
- Select the checkbox next to a column to view the summary of statistics for the column values
- Show/Hide columns
- Rename columns
- Change the data type of columns
- Rearrange the column order by dragging and dropping the columns
Click on the Profile tab
In the Profile view, you can run a data profile job to examine and collect statistical summaries about the data. A data profile is an assessment in terms of structure, content, relationships, and derivation.
Click on Run data profile
In the job details and job run sample panels, leave the default values.
In the Job output settings section, select the S3 bucket with the name
databrew-lab-databrewoutputs3bucket-xxxxx
and a folder name (eg.>Navigate back to the
covid-states-daily
project grid view.DataBrew has inferred data type of the
date
column as integer. We will convert the data type of thedate
column to string.Click on the
#
icon next to thedate
column name and select stringNote that the transformation is added to the recipe at the right.
We will duplicate the
date
column first before splitting it intoyear
,month
,day
columns, as the original column will be deleted by this transformation.Select the
...
at the top of thedate
column.From the pop-up menu, scroll to the bottom and select Duplicate
Leave the default settings in the Duplicate column dialog
Click Apply
A copy of the
date
column is created with the namedate_copy
. Note that the duplicate column transformation is added as a step to the recipe at the right.Let’s split the
date_copy
column intoyear
,month
,day
columns.Select the
...
at the top of thedate_copy
column.Select Split column / At positions from beginning
In the Split column dialog, enter
4
for Position from the beginning to split out the year. Leave all other default settings.In the Split column dialog, scroll down and click Preview changes to see how the column is split. Note that the
date_copy
column is marked for deletion.Click Apply
Next, split the
date_copy_2
column into month and day.The result should look like the screenshot below.
Let’s rename the new columns to
year
,month
,day
Click on the
date_copy_1
column and select Rename from the menu. Enteryear
as the new column name, and click ApplyRename the other two new columns -
date_copy_2_1
anddate_copy_2_2
- tomonth
andday
respectively.The result should look like the following.
The
probableCases
column has some missing values. We will set these missing values to 0.To navigate to the
probableCases
column, click on the columns drop-down list at the top, enterprobableCases
in the search field and click View.Click on the
probableCases
column and select Remove or fill missing values / Fill with custom valueEnter
0
as the Custom value and click ApplyMap the values of the
dataQualityGrade
column to numerical values.To navigate to the
dataQualityGrade
column, click on the columns drop-down list at the top, enterdataQualityGrade
in the search field and click View.Click on the
dataQualityGrade
column and select Categorical mappingIn the Categorically map column dialog
- Select the option Map all values
- Enable Map values to numeric values
- Map the current
dataQualityGrade
value to the new value as followsdataQualityGrade New value N/A 0 A+ 1 A 2 B 3 C 4 D 5
Leave all other settings as default. Click Apply
After this transform, the new column
dataQualityGrade_mapped
is of type double, convert this column to integer.You are now ready to publish the recipe so that it can be used in DataBrew jobs. The final recipe looks like the following.
Click on the Publish button at the top of the recipe.
Optionally enter a version description, and click Publish
The recipe is published as Version 1.0. DataBrew applies a version number when a recipe is published.
Creating a DataBrew job
Click on Jobs from the menu on the left hand side of the DataBrew console.
On the Recipe jobs tab, click on Create job
Enter
covid-states-daily-prep
for the job nameSelect Create a recipe job
Select the
covid-states-daily
datasetSelect the ‘covid-states-daily-recipe’
In the Job output settings section, enter the S3 location
s3://databrew-lab-databrewoutputs3bucket-xxxxx/job-outputs/
.Expand the Additional Configuration - optional panel.
Under Custom partition by column values add
year
,month
andday
columns. This will partition the data in the output folder by year, month and day.In the Permissions section, select the role
databrew-lab-DataBrewLabRole-xxxxx
Click Create and run job
The DataBrew job is created and the job status is
Running
Wait until the job has completed successfully (approx. 4 minutes)
Click on the link to the job output, and verify that the output files are partitioned in the S3 bucket
Viewing data lineage
In DataBrew, navigate back to the
covid-states-daily
projectClick on Lineage at the top right
This view shows the origin of the data and the transformation steps that the data has been through.
Congratulations, you have completed the DataBrew lab. If you haven’t already done so, you can return to step 13 to examine the data profile.