Assignment 02: Spark application to extract the Message ID, Date, From and To fields
Assignment 02:
Scenario
Assignment Description
Using the sample data consisting of a series of emails, write a Spark application to extract the Message ID, Date, From and To fields from each message’s header, and output those fields along with the email contents into a CSV file.
Input
Sample: https://bigdata220w18.blob.core.windows.net/blobs/enron_2015_sample.tgzFull Set: https://bigdata220w18.blob.core.windows.net/blobs/enron_mail_20150507.tar.gz
The data is a tar gzipped file. Once expanded, the emails are in a dir structure of the following:
maildir/user/outlook-folder/message
Where each message is a numbered text file (1., 2., 3., etc..).
Output
The output file should have the following format:
Message-ID,Date,From,To,Message
Note that the email will most likely contain commas, so you will need to delimit the fields in the CSV file (“ is the standard for this)
All data should be included “as-is”; you don’t need to do any further cleaning, i.e. Parse the date into a timestamp
Also because of the newlines in the email, the last column will be very ugly, that is expected.
All data should be included “as-is”; you don’t need to do any further cleaning, i.e. Parse the date into a timestamp
Also because of the newlines in the email, the last column will be very ugly, that is expected.
Step 1: move files to docker container
Unzip files with win zip
Run in command prompt. took about 30 minutes to upload
pscp -i c:\BigDataTechnolgies\ServerKey\tjpauley_azure.ppk C:\BigData220A\Assignment02\Data\enron_mail_20150507.tar.gz tjpauley@20.187.2.17:
Step 2: Move to data directory & unzip
Move to data directory
sudo cp enron_mail_20150507.tar.gz /data
Then move to the directory
cd /data
Unzip the tar.gz file with this command
sudo tar -xvf enron_mail_20150507.tar.gz
Step 3: Test Dataframe for first email
Comments
Post a Comment