Posts

Assignment 06 Kafka Demo

Quick demo of Kafka in Action Step 1: Start VM / Start Sandbox / log into sandbox Step 2: Run this script spark-shell --packages org.apache.spark.:spark-sql-kafka-0-10_2.11:2.2.0 Step 3: 

Assignment 03 Using Sqoop

Image
SQOOP - read data out of SQL database and load it to HDFS Very reliable Original tool Suggested reading: https://www.techrepublic.com/article/why-streaming-data-is-the-future-of-big-data-and-apache-kafka-is-leading-the-charge/  (Links to an external site.) Links to an external site. https://www.infoworld.com/article/3212204/big-data/all-your-streaming-data-are-belong-to-kafka.html  (Links to an external site.) Links to an external site. Streaming Data  https://www.manning.com/books/streaming-data  (Links to an external site.) Links to an external site. Confluent https://www.confluent.io/blog/  (Links to an external site.) Links to an external site. Confluent is the "commercial" backing for Kafka started by the original developers Basically what Databricks is to Spark; Confluent is to Kafka Scenario “80% of your sales come from 20% of your customers” - The Pareto Principle Customer segmentation has been a marketing tactic in use f

Assignment 02: Spark application to extract the Message ID, Date, From and To fields

Assignment 02: Scenario Your big data consulting company has been hired by a small law firm to help them make sense of a document dump they have received for a big trial. The firm believes that the outcome of their trial depends on finding certain information in the emails from the opposition’s clients. They have secured an initial dump of employee’s emails at the company in question, but in order to get continuing data they need to prove that there is value in the sample. In order for their document analysts to do that in a timely manner, they will need some metadata extracted from each email so they can process it using their document review tools. If they are able to find what they need by the deadline, your company will get an ongoing contract to build a pipeline to process incoming document dumps (YAY!) Assignment Description Using the sample data consisting of a series of emails, write a Spark application to extract the Message ID, Date, From and To fie

Assignment 01 - Installing Azure CLI 2.0 and resizing VM

Image
Week 1 Homework Installing Azure CLI 2.0 and resizing VM Now that we have more experience working with Azure VMs, I’d like people to become familiar with the command line interface (CLI) to Azure.  Every possible operation is made available through the CLI, in contrast to the web portal where many things are difficult and sometimes not even possible to do. Install CLI for your environment You generally will want to have the CLI available on your local machine/laptop since you are really only interacting with the VMs at this point. Windows/Mac users follow instructions at https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest Ubuntu There is an apt repo available, but I was a little uncomfortable with it.  The CLI is basically a wrapper around some python scripts, so the easiest is to just use pip pip install azure-cli (use the --user option if not using env manager or sudo) Login to CLI (Link to your NetId accou