1 COSC 6339 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout Edgar Gabriel Fall 2018 Pig Pig is a platform for analyzing larg...
1 COSC 6339 Big Data Analytics NoSQL (III) HBase in Hadoop MapReduce 3 rd homework assignment Edgar Gabriel Spring 2017 Recap on HBase Column-Oriented...
1 COSC 6339 Big Data Analytics Recap for the final quiz Edgar Gabriel Spring 2017 Quiz Will be held on May 1, pm You can have 1 page of handwritten no...
1 BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, ...
1 International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-3, Issue-9 E-ISSN: Query Execution Performance Analysis o...
1 Pig Hive Cascading Hadoop In Practice } Devoxx 2013 } Florian Douetteau2 About me Florian Douetteau } CEO at Dataiku } Freelance at Criteo (Online A...
1 Parallelizing K-means with Hadoop/Mahout for Big Data Analytics A Thesis Submitted for the Degree of Master of Philosophy By Jianbin Cui Department ...
1 COSC 6397 Big Data Analtics Fundantals Edga Gabil Sping 2017 Ovviw Data Chaactistics Pfanc Chaactistics Platf Cnsidatins 12 What aks lag scal Data A...
1 Beyond Hive Pig and Python2 What is Pig?3 Pig performs a series of transformations to data relations based on Pig Latin statements Relations are loa...
1 COSC 6339 Big Data Analytics NoSQL (II) HBase Edgar Gabriel Fall 2018 HBase Column-Oriented data store Distributed designed to serve large tables Bi...
1 Hortonworks Hadoop-PR Hortonworks Certified Apache Hadoop 2.0 Developer (Pig and Hive Developer)2 QUESTION: 99 Which one of the following statements...
COSC 6339 Big Data Analytics
Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout
Edgar Gabriel Fall 2018
Pig • Pig is a platform for analyzing large data sets – abstraction on top of Hadoop – Provides high level programming language designed for data processing – Converted into MapReduce and executed on Hadoop Clusters
1
Why using Pig? • MapReduce requires programmers – Must think in terms of map and reduce functions – More than likely will require Java programming • Pig provides high-level language that can be used by Analysts and Scientists – Does not require know how in parallel programming • Pig’s Features – Join Datasets – Sort Datasets – Filter – Data Types – Group By – User Defined Functions Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Components • Pig Latin – Command based language – Designed specifically for data transformation and flow expression
• Execution Environment – The environment in which Pig Latin commands are executed – Supporting local and Hadoop execution modes
• Pig compiler converts Pig Latin to MapReduce – Automatic vs. user level optimizations compared to manual MapReduce code Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
2
Running Pig • Script – Execute commands in a file – $pig scriptFile.pig
• Grunt – Interactive Shell for executing Pig Commands – Started when script file is NOT provided – Can execute scripts from Grunt via run or exec commands
• Embedded – Execute Pig commands using PigServer class – Can have programmatic access to Grunt via PigRunner class Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Latin concepts • Building blocks – Field – piece of data – Tuple – ordered set of fields, represented with “(“ and “)” (10.4, 5, word, 4, field1) – Bag – collection of tuples, represented with “{“ and “}” { (10.4, 5, word, 4, field1), (this, 1, blah) }
• Some similarities to relational databases – Bag is a table in the database – Tuple is a row in a table
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
3
Simple Pig Latin example Load grunt in default
$ pig map-reduce mode grunt> cat /input/pig/a.txt grunt supports file system a 1 commands d 4 c 9 Load contents of text file into a bag called records k 6 grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int); grunt> dump records; Display records on screen ... org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 50% complete 2012-07-14 17:36:22,040 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer .MapReduceLauncher - 100% complete ... (a,1) (d,4) (c,9) (k,6) grunt> Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Simple Pig Latin example • No action is taken until DUMP or STORE commands are encountered – Pig will parse, validate and analyze statements but not execute them • STORE – saves results (typically to a file) • DUMP – displays the results to the screen – doesn’t make sense to print large arrays to the screen – For information and debugging purposes you can print a small sub-set to the screen grunt> records = LOAD '/input/excite-small.log' AS (userId:chararray, timestamp:long, query:chararray); grunt> toPrint = LIMIT records 5; grunt> DUMP toPrint; Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
4
Simple Pig Latin example LOAD 'data' [USING function] [AS schema]; • data – name of the directory or file – Must be in single quotes • USING – specifies the load function to use – By default uses PigStorage which parses each line into fields using a delimiter – Default delimiter is tab (‘\t’) – The delimiter can be customized using regular expressions • AS – assign a schema to incoming data – Assigns names and types to fields ( alias:type) – (name:chararray, age:int, gpa:float) Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
records = LOAD '/input/excite-small.log‘ USING PigStorage() AS (userId:chararray, timestamp:long, query:chararray); • • • •
int long float double
Signed 32-bit integer Signed 64-bit integer 32-bit floating point 64-bit floating point
10 10L or 10l 10.5F or 10.5f 10.5 or 10.5e2 or 10.5E2
hello world
• bytearray
Character array (string) in Unicode UTF-8 Byte array (blob)
• tuple • bag
An ordered set of fields A collection of tuples
(T: tuple (f1:int, f2:int)) (B: bag {T: tuple(t1:int, t2:int)})
• chararray
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
5
Pig Latin Diagnostic Tools • Display the structure of the Bag – grunt> DESCRIBE ;
• Display Execution Plan – Produces Various reports, e.g. logical plan, MapReduce plan – grunt> EXPLAIN ;
• Illustrate how Pig engine transforms the data – grunt> ILLUSTRATE ;
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Joining Two Data Sets • Join Steps – Load records into a bag from input #1 – Load records into a bag from input #2 – Join the 2 data-sets (bags) by provided join key
• Default Join is Inner Join – Rows are joined where the keys match – Rows that do not have matches are not included in the result Inner join
Set 1
Set 2
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
6
Simple join example 1. Load records into a bag from input #1 posts = load '/input/user-posts.txt' using PigStorage(',') as (user:chararray, post:chararray, date:long);
2. Load records into a bag from input #2 likes = load '/input/user-likes.txt' using PigStorage(',') as (user:chararray,likes:int,date:long);
3. Join the data sets when a key is equal in both data-sets then the rows are joined into a new single row; In this case when user name is equal userInfo = join posts by user, likes by user; dump userInfo;
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Outer Join • Records which will not join with the ‘other’ record-set are still included in the result • Left Outer – Records from the first data-set are included whether they have a match or not. Fields from the unmatched (second) bag are set to null. • Right Outer – The opposite of Left Outer Join: Records from the second dataset are included no matter what. Fields from the unmatched (first) bag are set to null. • Full Outer – Records from both sides are included. For unmatched records the fields from the ‘other’ bag are set to null. Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Use cases • Loading large amounts of data – Pig is built on top of Hadoop -> scales with the number of servers – Alternative to manual bulkloading e.g. in HBASE • Using different data sources, e.g. – collect web server logs, – use external programs to fetch geo-location data for the users’ IP addresses, – join the new set of geo-located web traffic to click maps stored • Support for data sampling
8
Hive • Data Warehousing Solution built on top of Hadoop • Provides SQL-like query language named HiveQL – Minimal learning curve for people with SQL expertise – Data analysts are target audience
• Early Hive development work started at Facebook in 2007 • Translates HiveQL statements into a set of MapReduce Jobs which are then executed on a Hadoop Cluster
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive • Ability to bring structure to various data formats • Simple interface for ad hoc querying, analyzing and summarizing large amounts of data • Access to files on various data stores such as HDFS and HBase • Hive does NOT provide low latency or realtime queries – Even querying small amounts of data may take minutes
• Designed for scalability and ease-of-use rather than low latency responses
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
9
Hive • To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database – Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation an testing • Schema is not shared between users as each user has their own instance of embedded Derby • Stored in metastore_db directory which resides in the directory that hive was started from – Can easily switch another SQL installation such as MySQL
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive Interface Options • Command Line Interface (CLI) • Hive Web Interface – https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
• Re-used from Relational Databases – Database: Set of Tables, used for name conflict resolution – Table: Set of Rows that have the same schema (same columns) – Row: A single record; a set of columns – Column: provides value and type for a single value Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
10
Hive creating a table hive> CREATE TABLE posts (user STRING, post STRING, time BIGINT) > ROW FORMAT DELIMITED creates a table with 3 columns > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE; How the underlying file should be parsed OK Time taken: 10.606 seconds hive> show tables; OK posts Time taken: 0.221 seconds hive> describe posts; OK user string post string time bigint Time taken: 0.212 seconds
Display schema for posts table
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive Query Data hive> select * from posts where user="user2"; ... ... OK user2 Cool Deal 1343182133839 Time taken: 12.184 seconds
hive> select * from posts where time<=1343182133839 limit 2; ... ... OK user1 Funny Story 1343182026191 user2 Cool Deal 1343182133839 Time taken: 12.003 seconds hive>
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
11
Partitions • To increase performance Hive has the capability to partition data – The values of partitioned column divide a table into segments – Entire partitions can be ignored at query time – Similar to relational databases’ indexes but not as granular
• Partitions have to be properly crated by users – When inserting data must specify a partition
• At query time, whenever appropriate, Hive will automatically filter out partitions Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Joins • Hive support outer joins – left, right and full joins • Can join multiple tables • Default Join is Inner Join – Rows are joined where the keys match – Rows that do not have matches are not included in the result
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
12
Pig vs. Hive • Hive – Uses an SQL like query language called HQL – Gives non-programmers the ability to query and analyze data in Hadoop. • Pig – Uses a workflow driven scripting language – Don't need to be an expert Java programmer but need a few coding skills. – Can be used to convert unstructured data into a meaningful form.
Mahout • Scalable machine learning library – Built with MapReduce and Hadoop in mind – Written in Java
• Focusing on three application scenarios – Recommendation Systems – Clustering – Classifiers
• Multiple ways for utilizing Mahout – Java Interfaces – Command line interfaces
• Newest Mahout releases target Spark, not Mapreduce anymore!
13
Classification • Currently supported algorithms – – – –
Naïve Baysian Classifier Hidden Markov Models Logistical Regression Random Forest
Clustering • Currently supported algorithms – – – –
• Multiple tools available to support clustering – clusterdump: utility to output results of a clustering to a text file – cluster visualization
14
Mahout input arguments • Input data has to be sequence files and sequence vectors – Sequence file: generic Hadoop concept for binary files containing a • list of key/value pairs • Classes used for the key and the value pair – Sequence vector: binary file containing list of key/(array of values) • For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a Mahout class, not a Hadoop class)
Sequence Files • Creating a sequencfile using command line argument [email protected]>mahout seqdirectory -i /lastfm/input/ -o /lastfm/seqfiles
• Looking at the output of a sequence file [email protected]>mahout seqdumper –i /lastfm/seqfiles/controldata.seq | more Input Path: file:/lastfm/seqfiles/control-data.seq Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381} Key: 1: Value: {0:24.8923,1:25.741,2:27.5532} …
15
Using Mahout clustering
The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input files.