Assignment 02: Spark application to extract the Message ID, Date, From and To fields

Assignment 02:

Scenario

Your big data consulting company has been hired by a small law firm to help them make sense of a document dump they have received for a big trial.
The firm believes that the outcome of their trial depends on finding certain information in the emails from the opposition’s clients.
They have secured an initial dump of employee’s emails at the company in question, but in order to get continuing data they need to prove that there is value in the sample. In order for their document analysts to do that in a timely manner, they will need some metadata extracted from each email so they can process it using their document review tools.
If they are able to find what they need by the deadline, your company will get an ongoing contract to build a pipeline to process incoming document dumps (YAY!)

Assignment Description

Using the sample data consisting of a series of emails, write a Spark application to extract the Message ID, Date, From and To fields from each message’s header, and output those fields along with the email contents into a CSV file.

Input

Sample: https://bigdata220w18.blob.core.windows.net/blobs/enron_2015_sample.tgz
Full Set: https://bigdata220w18.blob.core.windows.net/blobs/enron_mail_20150507.tar.gz
The data is a tar gzipped file.  Once expanded, the emails are in a dir structure of the following:

maildir/user/outlook-folder/message
Where each message is a numbered text file (1., 2., 3., etc..).

Output

The output file should have the following format:
Message-ID,Date,From,To,Message
Note that the email will most likely contain commas, so you will need to delimit the fields in the CSV file (“ is the standard for this)
All data should be included “as-is”; you don’t need to do any further cleaning, i.e. Parse the date into a timestamp
Also because of the newlines in the email, the last column will be very ugly, that is expected.

Step 1: move files to docker container

Unzip files with win zip
Run in command prompt. took about 30 minutes to upload
pscp -i c:\BigDataTechnolgies\ServerKey\tjpauley_azure.ppk C:\BigData220A\Assignment02\Data\enron_mail_20150507.tar.gz tjpauley@20.187.2.17:

Step 2: Move to data directory & unzip

Move to data directory
sudo cp enron_mail_20150507.tar.gz /data
Then move to the directory
cd /data
Unzip the tar.gz file with this command
sudo tar -xvf enron_mail_20150507.tar.gz

Step 3: Test Dataframe for first email

var emailOne = emailRdd.coalesce(1).write.format("").option("header", "true").save("file:///tmp/email.csv")



Comments

Popular posts from this blog

Assignment 01 - Installing Azure CLI 2.0 and resizing VM