Aws Brew

Aws Brew Install
Brew Aws-iam-authenticator

Brew tap syncdk/aws-session-manager-plugin. Brew install aws-session-manager-plugin.

Aws Brew Install

Installs (30 days) aws-cdk: 115: Installs on Request (30 days) aws-cdk: 115: Build Errors (30 days) aws-cdk: 0: Installs (90 days) aws-cdk: 315: Installs on Request (90 days). Installs (30 days) awscli: 157,635: awscli -HEAD: 89: Installs on Request (30 days) awscli: 156,411: awscli -HEAD: 89: Build Errors (30 days) awscli: 177: Installs.

Brew Aws-iam-authenticator

In this lab you will be completing the following tasks.

Tasks Completed in this Lab:

Create a Glue DataBrew project to explore a dataset
Connect a sample dataset from S3
Explore the dataset in Glue DataBrew
Generate a rich data profile for the dataset
Create a recipe and job to clean and normalize data in the dataset

Creating a project

Navigate to the AWS Glue DataBrew service
On the DataBrew console, select Projects
Click Create project
In the Project details section, enter covid-states-daily as the project name
In the Select a dataset section, select New dataset and enter covid-states-daily-stats
In the Connect to a new dataset section, select Amazon S3 under “Data lake/data store”
Enter the S3 path s3://aws-dataengineering-day.workshop.aws/states_daily.csv.gz
In the Sampling section, leave the default configuration values
In the Permissions section, select the role databrew-lab-DataBrewLabRole-xxxxx from the drop-down list
Click Create project
Glue DataBrew will create the project, this may take a few minutes.

Exploring the dataset

When the project has been created, you will be presented with the Grid view. This is the default view, where a sample of the data is shown in tabular format.
The Grid view shows
- Columns in the dataset
- Data type of each column
- Summary of the range of values that have been found
- Statistical distribution for numerical columns
Click on the Schema tab
The Schema view shows the schema that has been inferred from the dataset. In schema view, you can see statistics about the data values in each column.
In the Schema view, you can
- Select the checkbox next to a column to view the summary of statistics for the column values
- Show/Hide columns
- Rename columns
- Change the data type of columns
- Rearrange the column order by dragging and dropping the columns
Click on the Profile tab
In the Profile view, you can run a data profile job to examine and collect statistical summaries about the data. A data profile is an assessment in terms of structure, content, relationships, and derivation.
Click on Run data profile
In the job details and job run sample panels, leave the default values.
In the Job output settings section, select the S3 bucket with the name databrew-lab-databrewoutputs3bucket-xxxxx and a folder name (eg.>
Navigate back to the covid-states-daily project grid view.
DataBrew has inferred data type of the date column as integer. We will convert the data type of the date column to string.
Click on the # icon next to the date column name and select string
Note that the transformation is added to the recipe at the right.
We will duplicate the date column first before splitting it into year, month, day columns, as the original column will be deleted by this transformation.
Select the ... at the top of the date column.
From the pop-up menu, scroll to the bottom and select Duplicate
Leave the default settings in the Duplicate column dialog
Click Apply
A copy of the date column is created with the name date_copy. Note that the duplicate column transformation is added as a step to the recipe at the right.
Let’s split the date_copy column into year, month, day columns.
Select the ... at the top of the date_copy column.
Select Split column / At positions from beginning
In the Split column dialog, enter 4 for Position from the beginning to split out the year. Leave all other default settings.
In the Split column dialog, scroll down and click Preview changes to see how the column is split. Note that the date_copy column is marked for deletion.
Click Apply
Next, split the date_copy_2 column into month and day.
The result should look like the screenshot below.
Let’s rename the new columns to year, month, day
Click on the date_copy_1 column and select Rename from the menu. Enter year as the new column name, and click Apply
Rename the other two new columns - date_copy_2_1 and date_copy_2_2 - to month and day respectively.
The result should look like the following.
The probableCases column has some missing values. We will set these missing values to 0.
To navigate to the probableCases column, click on the columns drop-down list at the top, enter probableCases in the search field and click View.
Click on the probableCases column and select Remove or fill missing values / Fill with custom value
Enter 0 as the Custom value and click Apply
Map the values of the dataQualityGrade column to numerical values.
To navigate to the dataQualityGrade column, click on the columns drop-down list at the top, enter dataQualityGrade in the search field and click View.
Click on the dataQualityGrade column and select Categorical mapping
In the Categorically map column dialog
- Select the option Map all values
- Enable Map values to numeric values
- Map the current dataQualityGrade value to the new value as follows
  dataQualityGrade New value
  N/A 0
  A+ 1
  A 2
  B 3
  C 4
  D 5
Leave all other settings as default. Click Apply
After this transform, the new column dataQualityGrade_mapped is of type double, convert this column to integer.
You are now ready to publish the recipe so that it can be used in DataBrew jobs. The final recipe looks like the following.
Click on the Publish button at the top of the recipe.
Optionally enter a version description, and click Publish
The recipe is published as Version 1.0. DataBrew applies a version number when a recipe is published.

dataQualityGrade	New value
N/A	0
A+	1
A	2
B	3
C	4
D	5

Creating a DataBrew job

Click on Jobs from the menu on the left hand side of the DataBrew console.
On the Recipe jobs tab, click on Create job
Enter covid-states-daily-prep for the job name
Select Create a recipe job
Select the covid-states-daily dataset
Select the ‘covid-states-daily-recipe’
In the Job output settings section, enter the S3 location s3://databrew-lab-databrewoutputs3bucket-xxxxx/job-outputs/.
Expand the Additional Configuration - optional panel.
Under Custom partition by column values add year, month and day columns. This will partition the data in the output folder by year, month and day.
In the Permissions section, select the role databrew-lab-DataBrewLabRole-xxxxx
Click Create and run job
The DataBrew job is created and the job status is Running
Wait until the job has completed successfully (approx. 4 minutes)
Click on the link to the job output, and verify that the output files are partitioned in the S3 bucket

Viewing data lineage

In DataBrew, navigate back to the covid-states-daily project
Click on Lineage at the top right
This view shows the origin of the data and the transformation steps that the data has been through.

Congratulations, you have completed the DataBrew lab. If you haven’t already done so, you can return to step 13 to examine the data profile.