1 Integrating Apache ActiveMQ with ActiveVOS A Step by Step Guide AN ACTIVE ENDPOINTS TECHNICAL NOTE Vlad Orzhekhovskiy (Charter Global) 2010 Active E...
1 Integrating Apache Mesos with Science Gateways via Apache Airavata Organization: Apache Software Foundation Abstract: Science Gateways federate reso...
1 Bringing Context to Apache Hadoop Guilherme W. Cassales, Andrea S. Charão, Manuele Kirsch-Pinheiro, Carine Souveyet and Luiz Angelo Steffenel...
1 3 Integrating Apache Hive with Spark and BI Date of Publish:2 Contents...3 Apache Spark-Apache Hive connection configuration...3 Zeppelin configurat...
1 3 Integrating Apache Hive with Spark and BI Date of Publish:2 Contents...3 Apache Spark-Apache Hive connection configuration...3 Zeppelin configurat...
1 Evaluation of Apache Hadoop for parallel data analysis with ROOT S Lehrack, G Duckeck, J Ebke Ludwigs-Maximilians-University Munich, Chair of elemen...
1 Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and...
1 TPM-based Authentication Mechanism for Apache Hadoop Issa Khalil 1, Zuochao Dou 2, and Abdallah Khreishah 2 1 Qatar Computing Research Institute, Qa...
1 Building Apache Hadoop on IBM Power Systems January 5, 2015 César Diniz Maciel Executive IT Specialist IBM Global Techline2 Trademarks, Copyr...
1 Hadoop and Apache Mahout Deep Dive Temple Crag, Sierra Nevada Mahidhar Tatineni User Services, SDSC Costa Rica Big Data School December 6, 20172 Ove...
Integrating with Apache Hadoop HPE Vertica Analytic Database
Software Version: 7.2.x
Document Release Date: 10/10/2017
Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HPE shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is subject to change without notice.
Restricted Rights Legend Confidential computer software. Valid license from HPE required for possession, use or copying. Consistent with FAR 12.211 and 12.212, Commercial Computer Software, Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard commercial license.
Configuring Hadoop for Co-Located Clusters webHDFS YARN Hadoop Balancer Replication Factor Disk Space for Non-HDFS Use
12 12 12 13 13 13
Separate Clusters
13
Choosing Which Hadoop Interface to Use
16
Creating an HDFS Storage Location
16
Reading ORC and Parquet Files
16
Using the HCatalog Connector
16
Using the HDFS Connector
17
Using the MapReduce Connector
17
Using Kerberos with Hadoop
18
How Vertica uses Kerberos With Hadoop User Authentication Vertica Authentication See Also
18 18 19 20
Configuring Kerberos Prerequisite: Setting Up Users and the Keytab File HCatalog Connector HDFS Connector HDFS Storage Location Token Expiration See Also
21 21 21 21 22 22 22
Reading Native Hadoop File Formats Requirements
HPE Vertica Analytic Database (7.2.x)
23 23
Page 3 of 139
Integrating with Apache Hadoop
Creating External Tables
23
Loading Data
24
Supported Data Types
24
Kerberos Authentication
25
Examples
25
See Alsos
25
Query Performance Considerations When Writing Files Predicate Pushdown Data Locality
25 26 26 26
Configuring hdfs:/// Access
27
Troubleshooting Reads from Native File Formats webHDFS Error When Using hdfs URIs Reads from Parquet Files Report Unexpected Data-Type Mismatches Time Zones in Timestamp Values Are Not Correct Some Date and Timestamp Values Are Wrong by Several Days Error 7087: Wrong Number of Columns
Installing the Java Runtime on Your Vertica Cluster Installing a Java Runtime Setting the JavaBinaryForUDx Configuration Parameter
35 35 36
Configuring Vertica for HCatalog Copy Hadoop Libraries and Configuration Files Install the HCatalog Connector Upgrading to a New Version of Vertica Additional Options for Native File Formats
37 37 40 40 41
Using the HCatalog Connector with HA NameNode
41
Defining a Schema Using the HCatalog Connector
42
Querying Hive Tables Using HCatalog Connector
43
Viewing Hive Schema and Table Metadata
44
Synchronizing an HCatalog Schema or Table With a Local Schema or Table
48
HPE Vertica Analytic Database (7.2.x)
Page 4 of 139
Integrating with Apache Hadoop
Examples
49
Data Type Conversions from Hive to Vertica Data-Width Handling Differences Between Hive and Vertica
50 51
Using Non-Standard SerDes Determining Which SerDe You Need Installing the SerDe on the Vertica Cluster
52 52 53
Troubleshooting HCatalog Connector Problems Connection Errors UDx Failure When Querying Data: Error 3399 SerDe Errors Differing Results Between Hive and Vertica Queries Preventing Excessive Query Delays
54 54 55 56 57 57
Using the HDFS Connector
59
HDFS Connector Requirements Uninstall Prior Versions of the HDFS Connector webHDFS Requirements Kerberos Authentication Requirements
59 59 60 60
Testing Your Hadoop webHDFS Configuration
60
Loading Data Using the HDFS Connector The HDFS File URL Copying Files in Parallel Viewing Rejected Rows and Exceptions
63 64 64 66
Creating an External Table with an HDFS Source Load Errors in External Tables
66 67
HDFS ConnectorTroubleshooting Tips User Unable to Connect to Kerberos-Authenticated Hadoop Cluster Resolving Error 5118 Transfer Rate Errors Error Loading Many Files
68 68 69 70 71
Using HDFS Storage Locations
72
Storage Location for HDFS Requirements HDFS Space Requirements Additional Requirements for Backing Up Data Stored on HDFS
72 73 73
How the HDFS Storage Location Stores Data What You Can Store on HDFS What HDFS Storage Locations Cannot Do
74 74 75
Creating an HDFS Storage Location Creating a Storage Location Using Vertica for SQL on Apache Hadoop Adding HDFS Storage Locations to New Nodes
75 76 77
Creating a Storage Policy for HDFS Storage Locations Storing an Entire Table in an HDFS Storage Location
77 78
HPE Vertica Analytic Database (7.2.x)
Page 5 of 139
Integrating with Apache Hadoop
Storing Table Partitions in HDFS Moving Partitions to a Table Stored on HDFS
78 80
Backing Up Vertica Storage Locations for HDFS Configuring Vertica to Restore HDFS Storage Locations Configuration Overview Installing a Java Runtime Finding Your Hadoop Distribution's Package Repository Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository Installing the Required Hadoop Packages Setting Configuration Parameters Setting Kerberos Parameters Confirming that distcp Runs Troubleshooting Configuring Hadoop and Vertica to Enable Backup of HDFS Storage Granting Superuser Status on Hortonworks 2.1 Granting Superuser Status on Cloudera 5.1 Manually Enabling Snapshotting for a Directory Additional Requirements for Kerberos Testing the Database Account's Ability to Make HDFS Directories Snapshottable Performing Backups Containing HDFS Storage Locations
Removing HDFS Storage Locations Removing Existing Data from an HDFS Storage Location Moving Data to Another Storage Location Clearing Storage Policies Changing the Usage of HDFS Storage Locations Dropping an HDFS Storage Location Removing Storage Location Files from HDFS Removing Backup Snapshots Removing the Storage Location Directories
93 94 94 95 97 98 99 99 100
Troubleshooting HDFS Storage Locations HDFS Storage Disk Consumption Kerberos Authentication When Creating a Storage Location Backup or Restore Fails When Using Kerberos
100 101 102 103
Using the MapReduce Connector
105
MapReduce Connector Features
105
Prerequisites Hadoop and Vertica Cluster Scaling
105 106
Installing the Connector
106
Accessing Vertica Data From Hadoop Selecting VerticaInputFormat Setting the Query to Retrieve Data From Vertica Using a Simple Query to Extract Data From Vertica Using a Parameterized Query and Parameter Lists Using a Discrete List of Values
108 108 109 109 110 110
HPE Vertica Analytic Database (7.2.x)
Page 6 of 139
Integrating with Apache Hadoop
Using a Collection Object Scaling Parameter Lists for the Hadoop Cluster Using a Query to Retrieve Parameter Values for a Parameterized Query Writing a Map Class That Processes Vertica Data Working with the VerticaRecord Class
110 111 112 112 112
Writing Data to Vertica From Hadoop Configuring Hadoop to Output to Vertica Defining the Output Table Writing the Reduce Class Storing Data in the VerticaRecord
114 114 114 115 116
Passing Parameters to the Vertica Connector for Hadoop Map Reduce At Run Time Specifying the Location of the Connector .jar File Specifying the Database Connection Parameters Parameters for a Separate Output Database
119 119 119 120
Example Vertica Connector for Hadoop Map Reduce Application
121
Compiling and Running the Example Application Compiling the Example (optional) Running the Example Application Verifying the Results
125 126 127 128
Using Hadoop Streaming with the Vertica Connector for Hadoop Map Reduce Reading Data From Vertica in a Streaming Hadoop Job Writing Data to Vertica in a Streaming Hadoop Job Loading a Text File From HDFS into Vertica
129 129 132 133
Accessing Vertica From Pig Registering the Vertica .jar Files Reading Data From Vertica Writing Data to Vertica
135 135 135 136
Integrating Vertica with the MapR Distribution of Hadoop
138
Send Documentation Feedback
139
HPE Vertica Analytic Database (7.2.x)
Page 7 of 139
Integrating with Apache Hadoop Introduction to Hadoop Integration
Introduction to Hadoop Integration Apache™ Hadoop™, like Vertica, uses a cluster of nodes for distributed processing. The primary component of interest is HDFS, the Hadoop Distributed File System. You can use HDFS from Vertica in several ways: l
You can import HDFS data into locally-stored ROS files.
l
You can access HDFS data in place, using external tables.
l
You can use HDFS as a storage location for ROS files.
Hadoop includes two other components of interest: l
l
Hive, a data warehouse that provides the ability to query data stored in Hadoop. HCatalog, a component that makes Hive metadata available to applications, such as Vertica, outside of Hadoop.
A Hadoop cluster can use Kerberos authentication to protect data stored in HDFS. Vertica integrates with Kerberos to access HDFS data if needed. See Using Kerberos with Hadoop.
Hadoop Distributions Vertica can be used with Hadoop distributions from Hortonworks, Cloudera, and MapR. See Vertica Integrations for Hadoop for the specific versions that are supported.
Integration Options Vertica supports two cluster architectures. Which you use affects the decisions you make about integration. l
l
You can co-locate Vertica on some or all of your Hadoop nodes. Vertica can then take advantage of local data. This option is supported only for Vertica for SQL on Apache Hadoop. You can build a Vertica cluster that is separate from your Hadoop cluster. In this configuration, Vertica can fully use each of its nodes; it does not share resources with Hadoop. This option is not supported for Vertica for SQL on Apache Hadoop..
These layout options are described in Cluster Layout. Both layouts support several interfaces for using Hadoop: HPE Vertica Analytic Database (7.2.x)
Page 8 of 139
Integrating with Apache Hadoop Introduction to Hadoop Integration
l
l
l
l
l
An HDFS Storage Location uses HDFS to hold Vertica data (ROS files). The HCatalog Connector lets Vertica query data that is stored in a Hive database the same way you query data stored natively in a Vertica schema. Vertica can directly query data in Reading Native Hadoop File Formats (ORC and Parquet). This option is faster than using the HCatalog Connector for this type of data. The HDFS Connector lets Vertica import HDFS data. It also lets Vertica read HDFS data as an external table without using Hive. The MapReduce Connector lets you create Hadoop MapReduce jobs that retrieve data from Vertica. These jobs can also insert data into Vertica.
File Paths Hadoop file paths are generally expressed using the webhdfs scheme, such as 'webhdfs://somehost:port/opt/data/filename'. These paths are URIs, so if you need to escape a special character in a path, use URI escaping. For example: webhdfs://somehost:port/opt/data/my%20file
HPE Vertica Analytic Database (7.2.x)
Page 9 of 139
Integrating with Apache Hadoop Cluster Layout
Cluster Layout Vertica and Hadoop each use a cluster of nodes for distributed processing. These clusters can be co-located, meaning you run both products on the same machines, or separate. Co-Located Clusters are for use with Vertica for SQL on Apache Hadoop licenses. Separate Clusters are for use with Premium Edition and Community Edition licenses. With either architecture, if you are using the hdfs scheme to read ORC or Parquet files, you must do some additional configuration. See Configuring hdfs:/// Access.
Co-Located Clusters With co-located clusters, Vertica is installed on some or all of your Hadoop nodes. The Vertica nodes use a private network in addition to the public network used by all Hadoop nodes, as the following figure shows:
You might choose to place Vertica on all of your Hadoop nodes or only on some of them. If you are using HDFS Storage Locations you should use at least three Vertica nodes, the minimum number for K-Safety. Using more Vertica nodes can improve performance because the HDFS data needed by a query is more likely to be local. Normally, both Hadoop and Vertica use the entire node. Because this configuration uses shared nodes, you must address potential resource contention in your configuration on those nodes. See Configuring Hadoop for Co-Located Clusters for more information. No changes are needed on Hadoop-only nodes. You can place Hadoop and Vertica clusters within a single rack, or you can span across many racks and nodes. Spreading node types across racks can improve efficiency.
HPE Vertica Analytic Database (7.2.x)
Page 10 of 139
Integrating with Apache Hadoop Cluster Layout
Hardware Recommendations Hadoop clusters frequently do not have identical provisioning requirements or hardware configurations. However, Vertica nodes should be equivalent in size and capability, per the best-practice standards recommended in General Hardware and OS Requirements and Recommendations in Installing Vertica. Because Hadoop cluster specifications do not always meet these standards, Hewlett Packard Enterprise recommends the following specifications for Vertica nodes in your Hadoop cluster. Specifications Recommendation For... Processor
For best performance, run: l
l
Memory
Single-socket servers with 8–12 cores clocked at or above 2.6 GHz for clusters under 10 TB
Distribute the memory appropriately across all memory channels in the server: l
l
l
Storage
Two-socket servers with 8–14 core CPUs, clocked at or above 2.6 GHz for clusters over 10 TB
Minimum —8 GB of memory per physical CPU core in the server High-performance applications 12–16 GB of memory per physical core Type—at least DDR3-1600, preferably DDR3-1866
Read/write: l
Minimum — 40 MB/s per physical core of the CPU
l
For best performance — 60–80 MB/s per physical core
Storage post RAID: Each node should have 1–9 TB. For a production setting, RAID 10 is recommended. In some cases, RAID 50 is acceptable. Because of the heavy compression and encoding that Vertica does, SSDs are not required. In most cases, a RAID of more, lessexpensive HDDs performs just as well as a RAID of fewer SSDs. If you intend to use RAID 50 for your data partition, you should keep
HPE Vertica Analytic Database (7.2.x)
Page 11 of 139
Integrating with Apache Hadoop Cluster Layout
a spare node in every rack, allowing for manual failover of a Vertica node in the case of a drive failure. A Vertica node recovery is faster than a RAID 50 rebuild. Also, be sure to never put more than 10 TB compressed on any node, to keep node recovery times at an acceptable rate. Network
10 GB networking in almost every case. With the introduction of 10 GB over cat6a (Ethernet), the cost difference is minimal.
Configuring Hadoop for Co-Located Clusters If you are co-locating Vertica on any HDFS nodes, there are some additional configuration requirements.
webHDFS Hadoop has two services that can provide web access to HDFS: l
webHDFS
l
httpFS
For Vertica, you must use the webHDFS service.
YARN The YARN service is available in newer releases of Hadoop. It performs resource management for Hadoop clusters. When co-locating Vertica on YARNmanaged Hadoop nodes you must make some changes in YARN. HPE recommends reserving at least 16GB of memory for Vertica on shared nodes. Reserving more will improve performance. How you do this depends on your Hadoop distribution: l
l
If you are using Hortonworks, create a "Vertica" node label and assign this to the nodes that are running Vertica. If you are using Cloudera, enable and configure static service pools.
Consult the documentation for your Hadoop distribution for details. Alternatively, you can disable YARN on the shared nodes.
HPE Vertica Analytic Database (7.2.x)
Page 12 of 139
Integrating with Apache Hadoop Cluster Layout
Hadoop Balancer The Hadoop Balancer can redistribute data blocks across HDFS. For many Hadoop services, this feature is useful. However, for Vertica this can reduce performance under some conditions. If you are using HDFS storage locations, the Hadoop load balancer can move data away from the Vertica nodes that are operating on it, degrading performance. This behavior can also occur when reading ORC or Parquet files if Vertica is not running on all Hadoop nodes. (If you are using separate Vertica and Hadoop clusters, all Hadoop access is over the network, and the performance cost is less noticeable.) To prevent the undesired movement of data blocks across the HDFS cluster, consider excluding Vertica nodes from rebalancing. See the Hadoop documentation to learn how to do this.
Replication Factor By default, HDFS stores three copies of each data block. Vertica is generally set up to store two copies of each data item through K-Safety. Thus, lowering the replication factor to 2 can save space and still provide data protection. To lower the number of copies HDFS stores, set HadoopFSReplication, as explained in Troubleshooting HDFS Storage Locations.
Disk Space for Non-HDFS Use You also need to reserve some disk space for non-HDFS use. To reserve disk space using Ambari, set dfs.datanode.du.reserved to a value in the hdfs-site.xml configuration file. Setting this parameter preserves space for non-HDFS files that Vertica requires.
Separate Clusters In the Premium Edition product, your Vertica and Hadoop clusters must be set up on separate nodes, ideally connected by a high-bandwidth network connection. This is different from the configuration for Vertica for SQL on Apache Hadoop, in which Vertica nodes are co-located on Hadoop nodes. The following figure illustrates the configuration for separate clusters::
HPE Vertica Analytic Database (7.2.x)
Page 13 of 139
Integrating with Apache Hadoop Cluster Layout
The network is a key performance component of any well-configured cluster. When Vertica stores data to HDFS it writes and reads data across the network. The layout shown in the figure calls for two networks, and there are benefits to adding a third: l
l
l
Database Private Network: Vertica uses a private network for command and control and moving data between nodes in support of its database functions. In some networks, the command and control and passing of data are split across two networks. Database/Hadoop Shared Network: Each Vertica node must be able to connect to each Hadoop data node and the Name Node. Hadoop best practices generally require a dedicated network for the Hadoop cluster. This is not a technical requirement, but a dedicated network improves Hadoop performance. Vertica and Hadoop should share the dedicated Hadoop network. Optional Client Network: Outside clients may access the clustered networks through a client network. This is not an absolute requirement, but the use of a third network that supports client connections to either Vertica or Hadoop can improve
HPE Vertica Analytic Database (7.2.x)
Page 14 of 139
Integrating with Apache Hadoop Cluster Layout
performance. If the configuration does not support a client network, than client connections should use the shared network.
HPE Vertica Analytic Database (7.2.x)
Page 15 of 139
Integrating with Apache Hadoop Choosing Which Hadoop Interface to Use
Choosing Which Hadoop Interface to Use Vertica provides several ways to interact with data stored in Hadoop. This section explains how to choose among them. Decisions about Cluster Layout can affect the decisions you make about Hadoop interfaces.
Creating an HDFS Storage Location Using a storage location to store data in the Vertica native file format (ROS) delivers the best query performance among the available Hadoop options. (Storing ROS files on the local disk rather than in Hadoop is faster still.) If you already have data in Hadoop, however, doing this means you are importing that data into Vertica. For co-located clusters, which does not use local file storage, you might still choose to use an HDFS storage location for better performance. You can use the HDFS Connector to load data that is already in HDFS into Vertica. For separate clusters, which use local file storage, consider using an HDFS storage location for lower-priority data. See Using HDFS Storage Locations and Using the HDFS Connector.
Reading ORC and Parquet Files If your data is stored in the Optimized Row Columnar (ORC) or Parquet format, Vertica can query that data directly from HDFS. This option is faster than using the HCatalog Connector, but you cannot pull schema definitions from Hive directly into the database. Vertica reads the data in place; no extra copies are made. See Reading Native Hadoop File Formats.
Using the HCatalog Connector The HCatalog Connector uses Hadoop services (Hive and HCatalog) to query data stored in HDFS. Like the ORC Reader, it reads data in place rather than making copies. Using this interface you can read all file formats supported by Hadoop, including Parquet and ORC, and Vertica can use Hive's schema definitions. However, performance can be poor in some cases. The HCatalog Connector is also sensitive to
HPE Vertica Analytic Database (7.2.x)
Page 16 of 139
Integrating with Apache Hadoop Choosing Which Hadoop Interface to Use changes in the Hadoop libraries on which it depends; upgrading your Hadoop cluster might affect your HCatalog connections. See Using the HCatalog Connector.
Using the HDFS Connector The HDFS Connector can be used to create and query external tables, reading the data in place rather than making copies. The HDFS Connector can be used with any data format for which a parser is available. It does not use Hive data; you have to define the table yourself. Its performance can be poor because, like the HCatalog Connector, it cannot take advantage of the benefits of columnar file formats. See Using the HDFS Connector.
Using the MapReduce Connector The other interfaces described in this section allow you to read Hadoop data from Vertica or create Vertica data in Hadoop. The MapReduce Connector, in contrast, allows you to integrate with Hadoop's MapReduce jobs. Use this connector to send Vertica data to MapReduce or to have MapReduce jobs create data in Vertica. See Using the MapReduce Connector.
HPE Vertica Analytic Database (7.2.x)
Page 17 of 139
Integrating with Apache Hadoop Using Kerberos with Hadoop
Using Kerberos with Hadoop If your Hadoop cluster uses Kerberos authentication to restrict access to HDFS, you must configure Vertica to make authenticated connections. The details of this configuration vary, based on which methods you are using to access HDFS data: l
How Vertica uses Kerberos With Hadoop
l
Configuring Kerberos
How Vertica uses Kerberos With Hadoop Vertica authenticates with Hadoop in two ways that require different configurations: l
l
User Authentication—On behalf of the user, by passing along the user's existing Kerberos credentials, as occurs with the HDFS Connector and the HCatalog Connector. Vertica Authentication—On behalf of system processes (such as the Tuple Mover), by using a special Kerberos credential stored in a keytab file.
User Authentication To use Vertica with Kerberos and Hadoop, the client user first authenticates with the Kerberos server (Key Distribution Center, or KDC) being used by the Hadoop cluster. A user might run kinit or sign in to Active Directory, for example. A user who authenticates to a Kerberos server receives a Kerberos ticket. At the beginning of a client session, Vertica automatically retrieves this ticket.The database then uses this ticket to get a Hadoop token, which Hadoop uses to grant access. Vertica uses this token to access HDFS, such as when executing a query on behalf of the user. When the token expires, the database automatically renews it, also renewing the Kerberos ticket if necessary. The user must have been granted permission to access the relevant files in HDFS. This permission is checked the first time Vertica reads HDFS data. The following figure shows how the user, Vertica, Hadoop, and Kerberos interact in user authentication:
HPE Vertica Analytic Database (7.2.x)
Page 18 of 139
Integrating with Apache Hadoop Using Kerberos with Hadoop
When using the HDFS Connector or the HCatalog Connector, or when reading an ORC or Parquet file stored in HDFS, Vertica uses the client identity as the preceding figure shows.
Vertica Authentication Automatic processes, such as the Tuple Mover, do not log in the way users do. Instead, Vertica uses a special identity (principal) stored in a keytab file on every database node. (This approach is also used for Vertica clusters that use Kerberos but do not use Hadoop.) After you configure the keytab file, Vertica uses the principal residing there to automatically obtain and maintain a Kerberos ticket, much as in the client scenario. In this case, the client does not interact with Kerberos. The following figure shows the interactions required for Vertica authentication:
HPE Vertica Analytic Database (7.2.x)
Page 19 of 139
Integrating with Apache Hadoop Using Kerberos with Hadoop
Each Vertica node uses its own principal; it is common to incorporate the name of the node into the principal name. You can either create one keytab per node, containing only that node's principal, or you can create a single keytab containing all the principals and distribute the file to all nodes. Either way, the node uses its principal to get a Kerberos ticket and then uses that ticket to get a Hadoop token. For simplicity, the preceding figure shows the full set of interactions for only one database node. When creating HDFS storage locations Vertica uses the principal in the keytab file, not the principal of the user issuing the CREATE LOCATION statement.
See Also For specific configuration instructions, see Configuring Kerberos.
HPE Vertica Analytic Database (7.2.x)
Page 20 of 139
Integrating with Apache Hadoop Using Kerberos with Hadoop
Configuring Kerberos Vertica can connect with Hadoop in several ways, and how you manage Kerberos authentication varies by connection type. This documentation assumes that you are using Kerberos for both your HDFS and Vertica clusters.
Prerequisite: Setting Up Users and the Keytab File If you have not already configured Kerberos authentication for Vertica, follow the instructions in Configure for Kerberos Authentication. In particular: l
l
l
Create one Kerberos principal per node. Place the keytab file(s) in the same location on each database node and set its location in KerberosKeytabFile (see Specify the Location of the Keytab File). Set KerberosServiceName to the name of the principal (see Inform About the Kerberos Principal).
HCatalog Connector You use the HCatalog Connector to query data in Hive. Queries are executed on behalf of Vertica users. If the current user has a Kerberos key, then Vertica passes it to the HCatalog connector automatically. Verify that all users who need access to Hive have been granted access to HDFS. In addition, in your Hadoop configuration files (core-site.xml in most distributions), make sure that you enable all Hadoop components to impersonate the Vertica user. The easiest way to do this is to set the proxyuser property using wildcards for all users on all hosts and in all groups. Consult your Hadoop documentation for instructions. Make sure you do this before running hcatUtil (see Configuring Vertica for HCatalog).
HDFS Connector The HDFS Connector loads data from HDFS into Vertica on behalf of the user, using a User Defined Source. If the user performing the data load has a Kerberos key, then the UDS uses it to access HDFS. Verify that all users who use this connector have been granted access to HDFS.
HPE Vertica Analytic Database (7.2.x)
Page 21 of 139
Integrating with Apache Hadoop Using Kerberos with Hadoop
HDFS Storage Location You can create a database storage location in HDFS. An HDFS storage location provides improved performance compared to other HDFS interfaces (such as the HCatalog Connector). After you create Kerberos principals for each node, give all of them read and write permissions to the HDFS directory you will use as a storage location. If you plan to back up HDFS storage locations, take the following additional steps: l
l
l
Grant Hadoop superuser privileges to the new principals. Configure backups, including setting the HadoopConfigDir configuration parameter, following the instructions in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage Configure user impersonation to be able to restore from backups following the instructions in "Setting Kerberos Parameters" in Configuring Vertica to Restore HDFS Storage Locations.
Because the keytab file supplies the principal used to create the location, you must have it in place before creating the storage location. After you deploy keytab files to all database nodes, use the CREATE LOCATION statement to create the storage location as usual.
Token Expiration Vertica attempts to automatically refresh Hadoop tokens before they expire, but you can also set a minimum refresh frequency if you prefer. The HadoopFSTokenRefreshFrequency configuration parameter specifies the frequency in seconds: => ALTER DATABASE exampledb SET HadoopFSTokenRefreshFrequency = '86400';
If the current age of the token is greater than the value specified in this parameter, Vertica refreshes the token before accessing data stored in HDFS.
See Also l
How Vertica uses Kerberos With Hadoop
l
Troubleshooting Kerberos Authentication
HPE Vertica Analytic Database (7.2.x)
Page 22 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
Reading Native Hadoop File Formats When you create external tables or copy data into tables, you can access data in certain native Hadoop formats directly. Currently, Vertica supports the ORC (Optimized Row Columnar) and Parquet formats. Because this approach allows you to define your tables yourself instead of fetching the metadata through webhcat, these readers can provide slightly better performance than the HCatalog Connector. . If you are already using the HCatalog Connector for other reasons, however, you might find it more convenient to use it to read data in these formats also. See Using the HCatalog Connector. You can use the hdfs scheme to access ORC and Parquet files stored in HDFS, as explained later in this section. To use this scheme you must perform some additional configuration; see Configuring hdfs:/// Access.
Requirements The ORC or Parquet files must not use complex data types. All simple data types supported in Hive version 0.11 or later are supported. Files compressed by Hive or Impala require Zlib (GZIP) or Snappy compression. Vertica does not support LZO compression for these formats.
Creating External Tables In the CREATE EXTERNAL TABLE AS COPY statement, specify a format of ORC or PARQUET as follows: => CREATE AS => CREATE AS
l
EXTERNAL TABLE COPY FROM path EXTERNAL TABLE COPY FROM path
If the file resides on the local file system of the node where you issue the command— Use a local file path for path. Escape special characters in file paths with backslashes.
HPE Vertica Analytic Database (7.2.x)
Page 23 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
l
If the file resides elsewhere in HDFS—Use the hdfs:/// prefix (three slashes), and then specify the file path. Escape special characters in HDFS paths using URI encoding, for example %20 for space.
Vertica automatically converts from the hdfs scheme to the webhdfs scheme if necessary. You can also directly use a webhdfs:// prefix and specify the host name, port, and file path. Using the hdfs scheme potentially provides better performance when reading files not protected by Kerberos. When defining an external table, you must define all of the columns in the file. Unlike with some other data sources, you cannot select only the columns of interest. If you omit columns, the ORC or Parquet reader aborts with an error. Files stored in HDFS are governed by HDFS privileges. For files stored on the local disk, however, Vertica requires that users be granted access. All users who have administrative privileges have access. For other users, you must create a storage location and grant access to it. See CREATE EXTERNAL TABLE AS COPY. HDFS privileges are still enforced, so it is safe to create a location for webhdfs://host:port. Only users who have access to both the Vertica user storage location and the HDFS directory can read from the table.
Loading Data In the COPY statement, specify a format of ORC or PARQUET: => COPY tableName FROM path ORC; => COPY tableNameFROM path PARQUET;
For files that are not local, specify ON ANY NODE to improve performance. => COPY t FROM 'hdfs:///opt/data/orcfile' ON ANY NODE ORC;
As with external tables, path may be a local or hdfs:/// path. Be aware that if you load from multiple files in the same COPY statement, and any of them is aborted, the entire load aborts. This behavior differs from that for delimited files, where the COPY statement loads what it can and ignores the rest.
Supported Data Types The Vertica ORC and Parquet file readers can natively read columns of all data types supported in Hive version 0.11 and later except for complex types. If complex types such as maps are encountered, the COPY or CREATE EXTERNAL TABLE AS COPY statement aborts with an error message. The readers do not attempt to read only
HPE Vertica Analytic Database (7.2.x)
Page 24 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats some columns; either the entire file is read or the operation fails. For a complete list of supported types, see HIVE Data Types.
Kerberos Authentication If the file to be read is located on an HDFS cluster that uses Kerberos authentication, Vertica uses the current user's principal to authenticate. It does not use the database's principal.
Examples The following example shows how you can read from all ORC files in a local directory. This example uses all supported data types. => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT, a6 DOUBLE PRECISION, a7 BOOLEAN, a8 DATE, a9 TIMESTAMP, a10 VARCHAR(20), a11 VARCHAR(20), a12 CHAR(20), a13 BINARY(20), a14 DECIMAL(10,5)) AS COPY FROM '/data/orc_test_*.orc' ORC;
The following example shows the error that is produced if the file you specify is not recognized as an ORC file: => CREATE EXTERNAL TABLE t (a1 TINYINT, a2 SMALLINT, a3 INT, a4 BIGINT, a5 FLOAT) AS COPY FROM '/data/not_an_orc_file.orc' ORC; ERROR 0: Failed to read orc source [/data/not_an_orc_file.orc]: Not an ORC file
See Alsos Query Performance Troubleshooting Reads from Native File Formats
Query Performance When working with external tables in native formats, Vertica tries to improve performance in two ways: l
Pushing query execution closer to the data so less has to be read and transmitted
l
Using data locality in planning the query
HPE Vertica Analytic Database (7.2.x)
Page 25 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
Considerations When Writing Files The decisions you make when writing ORC and Parquet files can affect performance when using them. To get the best performance from Vertica, follow these guidelines when writing your files: l
Use the latest available Hive version. (You can still read your files with earlier versions.)
l
Use a large stripe size. 256 MB or greater is preferred.
l
Partition the data at the table level.
l
l
Sort the columns based on frequency of access, with most-frequently accessed columns appearing first. Use Snappy or Zlib/GZIP compression.
Predicate Pushdown Predicate pushdown moves parts of the query execution closer to the data, reducing the amount of data that must be read from disk or across the network. ORC files have three levels of indexing: file statistics, stripe statistics, and row group indexes. Predicates are applied only to the first two levels. Parquet files can have statistics in the ColumnMetaData and DataPageHeader. Predicates are applied only to the ColumnMetaData. Predicate pushdown is automatically applied for files written with Hive version 0.14 and later. Files written with earlier versions of Hive might not contain the required statistics. When executing a query against a file that lacks these statistics, Vertica logs an EXTERNAL_PREDICATE_PUSHDOWN_NOT_SUPPORTED event in the QUERY_ EVENTS system table. If you are seeing performance problems with your queries, check this table for these events.
Data Locality In a cluster where Vertica nodes are co-located on HDFS nodes, the query can use data locality to improve performance. For Vertica to do so, both the following conditions must exist:: l
The data is on an HDFS node where a database node is also present.
l
The query is not restricted to specific nodes using ON NODE.
HPE Vertica Analytic Database (7.2.x)
Page 26 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats When both these conditions exist, the query planner uses the co-located database node to read that data locally, instead of making a network call. You can see how much data is being read locally by inspecting the query plan. The label for LoadStep(s) in the plan contains a statement of the form: "X% of ORC data matched with co-located Vertica nodes". To increase the volume of local reads, consider adding more database nodes. HDFS data, by its nature, can't be moved to specific nodes, but if you run more database nodes you increase the likelihood that a database node is local to one of the copies of the data.
Configuring hdfs:/// Access When reading ORC or Parquet files from HDFS, you can use the hdfs scheme instead of the webhdfs scheme. Using the hdfs scheme can improve performance by bypassing the webHDFS service. To support the hdfs scheme, your Vertica nodes need access to certain Hadoop configuration files. If Vertica is co-located on HDFS nodes, then those files are already present. Verify that the HadoopConfDir environment variable is correctly set. Its path should include a directory containing the core-site.xml and hdfs-site.xml files. If Vertica is running on a separate cluster, you must copy the required files to those nodes and set the HadoopConfDir environment variable. A simple way to do so is to configure your Vertica nodes as Hadoop edge nodes. Edge nodes are used to run client applications; from Hadoop's perspective, Vertica is a client application. You can use Ambari or Cloudera Manager to configure edge nodes. For more information, see the documentation for your Hadoop vendor. Using the hdfs scheme does not remove the need for access to the webHDFS service. The hdfs scheme is not available for all files. If hdfs is not available, then Vertica automatically uses webhdfs instead. If you update the configuration files after starting Vertica, use the following statement to refresh them: => SELECT CLEAR_CACHES();
Troubleshooting Reads from Native File Formats You might encounter the following issues when reading ORC or Parquet files.
HPE Vertica Analytic Database (7.2.x)
Page 27 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
webHDFS Error When Using hdfs URIs When creating an external table or loading data and using the hdfs scheme, you might see errors from webHDFS failures. Such errors indicate that Vertica was not able to use the hdfs scheme and fell back to webhdfs, but that the webHDFS configuration is incorrect. Verify that the HDFS configuration files in HadoopConfDir have the correct webHDFS configuration for your Hadoop cluster. See Configuring hdfs:/// Access for information about use of these files. See your Hadoop documentation for information about webHDFS configuration.
Reads from Parquet Files Report Unexpected Data-Type Mismatches If a Parquet file contains a column of type STRING but the column in Vertica is of a different type, such as INT, you might see an unclear error message. In this case Vertica reports the column in the Parquet file as BYTE_ARRAY, as shown in the following example: ERROR 0: Datatype mismatch: column 2 in the parquet_cpp source [/tmp/nation.0.parquet] has type BYTE_ARRAY, expected int
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet does not natively support the STRING type and uses BYTE_ARRAY for strings instead. Because the Parquet file reports its type as BYTE_ARRAY, Vertica has no way to determine if the type is actually a BYTE_ARRAY or a STRING.
Time Zones in Timestamp Values Are Not Correct Reading time stamps from an ORC or Parquet file in Vertica might result in different values, based on the local time zone. This issue occurs because the ORC and Parquet formats do not support the SQL TIMESTAMP data type. If you define the column in your table with the TIMESTAMP data type, Vertica interprets time stamps read from ORC or Parquet files as values in the local time zone. This same behavior occurs in Hive. When this situation occurs, Vertica produces a warning at query time, such as the following: WARNING 0: SQL TIMESTAMPTZ is more appropriate for ORC TIMESTAMP because values are stored in UTC
When creating the table in Vertica, you can avoid this issue by using the TIMESTAMPTZ data type instead of TIMESTAMP.
HPE Vertica Analytic Database (7.2.x)
Page 28 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
Some Date and Timestamp Values Are Wrong by Several Days When Hive writes ORC or Parquet files, it converts dates before 1583 from the Gregorian calendar to the Julian calendar. Vertica does not perform this conversion. If your file contains dates before this time, values in Hive and the corresponding values in Vertica can differ by up to ten days. This difference applies to both DATE and TIMESTAMP values.
Error 7087: Wrong Number of Columns When loading data, you might see an error stating that you have the wrong number of columns: => CREATE TABLE nation (nationkey bigint, name varchar(500), regionkey bigint, comment varchar(500)); CREATE TABLE => COPY nation from :orc_dir ORC; ERROR 7087: Attempt to load 4 columns from an orc source [/tmp/orc_glob/test.orc] that has 9 columns
When you load data from Hadoop native file formats, your table must consume all of the data in the file, or this error results. To avoid this problem, add the missing columns to your table definition.
HPE Vertica Analytic Database (7.2.x)
Page 29 of 139
Integrating with Apache Hadoop Reading Native Hadoop File Formats
HPE Vertica Analytic Database (7.2.x)
Page 30 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Using the HCatalog Connector The Vertica HCatalog Connector lets you access data stored in Apache's Hive data warehouse software the same way you access it within a native Vertica table. If your files are in the Optimized Columnar Row (ORC) or Parquet format and do not use complex types, the HCatalog Connector creates an external table and uses the ORC or Parquet reader instead of using the Java SerDe. See Reading Native Hadoop File Formats for more information about these readers. The HCatalog Connector performs predicate pushdown to improve query performance. Instead of reading all data across the network to evaluate a query, the HCatalog Connector moves the evaluation of predicates closer to the data. Predicate pushdown applies to Hive partition pruning, ORC stripe pruning, and Parquet row-group pruning. The HCatalog Connector supports predicate pushdown for the following predicates: >, >=, =, <>, <=, <.
Hive, HCatalog, and WebHCat Overview There are several Hadoop components that you need to understand in order to use the HCatalog connector: l
l
l
Apache's Hive lets you query data stored in a Hadoop Distributed File System (HDFS) the same way you query data stored in a relational database. Behind the scenes, Hive uses a set of serializer and deserializer (SerDe) classes to extract data from files stored on the HDFS and break it into columns and rows. Each SerDe handles data files in a specific format. For example, one SerDe extracts data from comma-separated data files while another interprets data stored in JSON format. Apache HCatalog is a component of the Hadoop ecosystem that makes Hive's metadata available to other Hadoop components (such as Pig). WebHCat (formerly known as Templeton) makes HCatalog and Hive data available via a REST web API. Through it, you can make an HTTP request to retrieve data stored in Hive, as well as information about the Hive schema.
Vertica's HCatalog Connector lets you transparently access data that is available through WebHCat. You use the connector to define a schema in Vertica that corresponds to a Hive database or schema. When you query data within this schema, the HCatalog Connector transparently extracts and formats the data from Hadoop into tabular data. The data within this HCatalog schema appears as if it is native to Vertica.
HPE Vertica Analytic Database (7.2.x)
Page 31 of 139
Integrating with Apache Hadoop Using the HCatalog Connector You can even perform operations such as joins between Vertica-native tables and HCatalog tables. For more details, see How the HCatalog Connector Works.
HCatalog Connection Features The HCatalog Connector lets you query data stored in Hive using the Vertica native SQL syntax. Some of its main features are: l
l
l
l
The HCatalog Connector always reflects the current state of data stored in Hive. The HCatalog Connector uses the parallel nature of both Vertica and Hadoop to process Hive data. The result is that querying data through the HCatalog Connector is often faster than querying the data directly through Hive. Since Vertica performs the extraction and parsing of data, the HCatalog Connector does not signficantly increase the load on your Hadoop cluster. The data you query through the HCatalog Connector can be used as if it were native Vertica data. For example, you can execute a query that joins data from a table in an HCatalog schema with a native table.
HCatalog Connection Considerations There are a few things to keep in mind when using the HCatalog Connector: l
l
Hive's data is stored in flat files in a distributed filesystem, requiring it to be read and deserialized each time it is queried. This deserialization causes Hive's performance to be much slower than Vertica. The HCatalog Connector has to perform the same process as Hive to read the data. Therefore, querying data stored in Hive using the HCatalog Connector is much slower than querying a native Vertica table. If you need to perform extensive analysis on data stored in Hive, you should consider loading it into Vertica through the HCatalog Connector or the WebHDFS connector. Vertica optimization often makes querying data through the HCatalog Connector faster than directly querying it through Hive. Hive supports complex data types such as lists, maps, and structs that Vertica does not support. Columns containing these data types are converted to a JSON representation of the data type and stored as a VARCHAR. See Data Type Conversions from Hive to Vertica. Note: The HCatalog Connector is read only. It cannot insert data into Hive.
HPE Vertica Analytic Database (7.2.x)
Page 32 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
How the HCatalog Connector Works When planning a query that accesses data from a Hive table, the Vertica HCatalog Connector on the initiator node contacts the WebHCat server in your Hadoop cluster to determine if the table exists. If it does, the connector retrieves the table's metadata from the metastore database so the query planning can continue. When the query executes, all nodes in the Vertica cluster directly retrieve the data necessary for completing the query from HDFS. They then use the Hive SerDe classes to extract the data so the query can execute.
This approach takes advantage of the parallel nature of both Vertica and Hadoop. In addition, by performing the retrieval and extraction of data directly, the HCatalog Connector reduces the impact of the query on the Hadoop cluster.
HCatalog Connector Requirements Before you can use the HCatalog Connector, both your Vertica and Hadoop installations must meet the following requirements.
Vertica Requirements All of the nodes in your cluster must have a Java Virtual Machine (JVM) installed. See Installing the Java Runtime on Your Vertica Cluster.
HPE Vertica Analytic Database (7.2.x)
Page 33 of 139
Integrating with Apache Hadoop Using the HCatalog Connector You must also add certain libraries distributed with Hadoop and Hive to your Vertica installation directory. See Configuring Vertica for HCatalog.
Hadoop Requirements Your Hadoop cluster must meet several requirements to operate correctly with the Vertica Connector for HCatalog: l
l
l
l
l
It must have Hive and HCatalog installed and running. See Apache's HCatalog page for more information. It must have WebHCat (formerly known as Templeton) installed and running. See Apache' s WebHCat page for details. The WebHCat server and all of the HDFS nodes that store HCatalog data must be directly accessible from all of the hosts in your Vertica database. Verify that any firewall separating the Hadoop cluster and the Vertica cluster will pass WebHCat, metastore database, and HDFS traffic. The data that you want to query must be in an internal or external Hive table. If a table you want to query uses a non-standard SerDe, you must install the SerDe's classes on your Vertica cluster before you can query the data. See Using NonStandard SerDes.
Testing Connectivity To test the connection between your database cluster and WebHcat, log into a node in your Vertica cluster. Then, run the following command to execute an HCatalog query: $ curl http://webHCatServer:port/templeton/v1/status?user.name=hcatUsername
Where: l
webHCatServer is the IP address or hostname of the WebHCat server
l
port is the port number assigned to the WebHCat service (usually 50111)
l
hcatUsername is a valid username authorized to use HCatalog
Usually, you want to append ;echo to the command to add a linefeed after the curl command's output. Otherwise, the command prompt is automatically appended to the command's output, making it harder to read. For example: $ curl http://hcathost:50111/templeton/v1/status?user.name=hive; echo
HPE Vertica Analytic Database (7.2.x)
Page 34 of 139
Integrating with Apache Hadoop Using the HCatalog Connector If there are no errors, this command returns a status message in JSON format, similar to the following: {"status":"ok","version":"v1"}
This result indicates that WebHCat is running and that the Vertica host can connect to it and retrieve a result. If you do not receive this result, troubleshoot your Hadoop installation and the connectivity between your Hadoop and Vertica clusters. For details, see Troubleshooting HCatalog Connector Problems. You can also run some queries to verify that WebHCat is correctly configured to work with Hive. The following example demonstrates listing the databases defined in Hive and the tables defined within a database: $ curl http://hcathost:50111/templeton/v1/ddl/database?user.name=hive; echo {"databases":["default","production"]} $ curl http://hcathost:50111/templeton/v1/ddl/database/default/table?user.name=hive; echo {"tables":["messages","weblogs","tweets","transactions"],"database":"default"}
See Apache's WebHCat reference for details about querying Hive using WebHCat.
Installing the Java Runtime on Your Vertica Cluster The HCatalog Connector requires a 64-bit Java Virtual Machine (JVM). The JVM must support Java 6 or later, and must be the same version as the one installed on your Hadoop nodes. Note: If your Vertica cluster is configured to execute User Defined Extensions (UDxs) written in Java, it already has a correctly-configured JVM installed. See Developing User Defined Functions in Java in Extending Vertica for more information. Installing Java on your Vertica cluster is a two-step process: 1. Install a Java runtime on all of the hosts in your cluster. 2. Set the JavaBinaryForUDx configuration parameter to tell Vertica the location of the Java executable.
Installing a Java Runtime For Java-based features, Vertica requires a 64-bit Java 6 (Java version 1.6) or later Java runtime. Vertica supports runtimes from either Oracle or OpenJDK. You can choose to
HPE Vertica Analytic Database (7.2.x)
Page 35 of 139
Integrating with Apache Hadoop Using the HCatalog Connector install either the Java Runtime Environment (JRE) or Java Development Kit (JDK), since the JDK also includes the JRE. Many Linux distributions include a package for the OpenJDK runtime. See your Linux distribution's documentation for information about installing and configuring OpenJDK. To install the Oracle Java runtime, see the Java Standard Edition (SE) Download Page. You usually run the installation package as root in order to install it. See the download page for instructions. Once you have installed a JVM on each host, ensure that the java command is in the search path and calls the correct JVM by running the command: $ java -version
This command should print something similar to: java version "1.6.0_37" Java(TM) SE Runtime Environment (build 1.6.0_37-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)
Note: Any previously installed Java VM on your hosts may interfere with a newly installed Java runtime. See your Linux distribution's documentation for instructions on configuring which JVM is the default. Unless absolutely required, you should uninstall any incompatible version of Java before installing the Java 6 or Java 7 runtime.
Setting the JavaBinaryForUDx Configuration Parameter The JavaBinaryForUDx configuration parameter tells Vertica where to look for the JRE to execute Java UDxs. After you have installed the JRE on all of the nodes in your cluster, set this parameter to the absolute path of the Java executable. You can use the symbolic link that some Java installers create (for example /usr/bin/java). If the Java executable is in your shell search path, you can get the path of the Java executable by running the following command from the Linux command line shell: $ which java /usr/bin/java
If the java command is not in the shell search path, use the path to the Java executable in the directory where you installed the JRE. Suppose you installed the JRE in /usr/java/default (which is where the installation package supplied by Oracle installs the Java 1.6 JRE). In this case the Java executable is /usr/java/default/bin/java.
HPE Vertica Analytic Database (7.2.x)
Page 36 of 139
Integrating with Apache Hadoop Using the HCatalog Connector You set the configuration parameter by executing the following statement as a database superuser: => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java';
See ALTER DATABASE for more information on setting configuration parameters. To view the current setting of the configuration parameter, query the CONFIGURATION_PARAMETERS system table: => \x Expanded display is on. => SELECT * FROM CONFIGURATION_PARAMETERS WHERE parameter_name = 'JavaBinaryForUDx'; -[ RECORD 1 ]-----------------+---------------------------------------------------------node_name | ALL parameter_name | JavaBinaryForUDx current_value | /usr/bin/java default_value | change_under_support_guidance | f change_requires_restart | f description | Path to the java binary for executing UDx written in Java
Once you have set the configuration parameter, Vertica can find the Java executable on each node in your cluster. Note: Since the location of the Java executable is set by a single configuration parameter for the entire cluster, you must ensure that the Java executable is installed in the same path on all of the hosts in the cluster.
Configuring Vertica for HCatalog Before you can use the HCatalog Connector, you must add certain Hadoop and Hive libraries to your Vertica installation. You must also copy the Hadoop configuration files that specify various connection properties. Vertica uses the values in those configuration files to make its own connections to Hadoop. You need only make these changes on one node in your cluster. After you do this you can install the HCatalog connector.
Copy Hadoop Libraries and Configuration Files Vertica provides a tool, hcatUtil, to collect the required files from Hadoop. This tool copies selected libraries and XML configuration files from your Hadoop cluster to your Vertica cluster. This tool might also need access to additional libraries:
HPE Vertica Analytic Database (7.2.x)
Page 37 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
l
l
l
If you plan to use Hive to query files that use Snappy compression, you need access to the Snappy native libraries, libhadoop*.so and libsnappy*.so. If you plan to use Hive to query files that use LZO compression, you need access to the hadoop-lzo-*.jar and libgplcompression.so* libraries. In core-site.xml you must also edit the io.compression.codecs property to include com.hadoop.compression.lzo.LzopCodec. If you plan to use a JSON SerDe with a Hive table, you need access to its library. This is the same library that you used to configure Hive; for example: hive> add jar /home/release/json-serde-1.3-jar-with-dependencies.jar; hive> create external table nationjson (id int,name string,rank int,text string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/user/release/vt/nationjson';
l
If you are using any other libraries that are not standard across all supported Hadoop versions, you need access to those libraries.
If any of these cases applies to you, do one of the following: l
Include the path(s) in the path you specify as the value of --hcatLibPath, or
l
Copy the file(s) to a directory already on that path.
If Vertica is not co-located on a Hadoop node, you should do the following: 1. Copy /opt/vertica/packages/hcat/tools/hcatUtil to a Hadoop node and run it there, specifying a temporary output directory. Your Hadoop, HIVE, and HCatalog lib paths might be different; in particular, in newer versions of Hadoop the HCatalog directory is usually a subdirectory under the HIVE directory. Use the values from your environment in the following command: hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoopHiveConfPath="/hadoop;/hive;/webhcat" --hcatLibPath=/tmp/hadoop-files
2. Verify that all necessary files were copied: hcatUtil --verifyJars --hcatLibPath=/tmp/hadoop-files
3. Copy that output directory (/tmp/hadoop-files, in this example) to /opt/vertica/packages/hcat/lib on the Vertica node you will connect to when installing
HPE Vertica Analytic Database (7.2.x)
Page 38 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
the HCatalog connector. If you are updating a Vertica cluster to use a new Hadoop cluster (or a new version of Hadoop), first remove all JAR files in /opt/vertica/packages/hcat/lib except vertica-hcatalogudl.jar. 4. Verify that all necessary files were copied: hcatUtil --verifyJars --hcatLibPath=/opt/vertica/packages/hcat
If Vertica is co-located on some or all Hadoop nodes, you can do this in one step on a shared node. Your Hadoop, HIVE, and HCatalog lib paths might be different; use the values from your environment in the following command: hcatUtil --copyJars --hadoopHiveHome="/hadoop/lib;/hive/lib;/hcatalog/dist/share" --hadoopHiveConfPath="/hadoop;/hive;/webhcat" --hcatLibPath=/opt/vertica/packages/hcat/lib
The hcatUtil script has the following arguments: -c, --copyJars
copy the required JARs from hadoopHivePath to hcatLibPath.
-v, --verifyJars
verify that the required JARs are present in hcatLibPath.
--hadoopHiveHome= "value1;value2;..."
paths to the Hadoop, Hive, and HCatalog home directories. You must include the HADOOP_HOME and HIVE_ HOME paths. Separate multiple paths by a semicolon (;). Enclose paths in double quotes. In newer versions of Hadoop, look for the HCatalog directory under the HIVE directory (for example, /hive/hcatalog/share).
-hcatLibPath= "value1;value2;..."
output path of the lib/ folder of the HCatalog dependency JARs. Usually this is /opt/vertica/packages/hcat. You may use any folder, but make sure to copy all JARs to the hcat/lib folder before installing the HCatalog connector. If you have previously run hcatUtil with a different version of Hadoop, remove the old JAR files first (all except verticahcatalogudl.jar).
paths of the Hadoop, HIVE, and other components' -hadoopHiveConfPath= configuration files (such as core-site.xml, hive-site.xml, and "value" webhcat-site.xml). Separate multiple paths by a semicolon (;). Enclose paths in double quotes. These files contain
HPE Vertica Analytic Database (7.2.x)
Page 39 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
values that would otherwise have to be specified to CREATE HCATALOG SCHEMA. If you are using Cloudera, or if your HDFS cluster uses Kerberos authentication, this parameter is required. Otherwise this parameter is optional. Once you have copied the files and verified them, install the HCatalog connector.
Install the HCatalog Connector On the same node where you copied the files from hcatUtil, install the HCatalog connector by running the install.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. This script creates the library and VHCatSource and VHCatParser. Note: The data that was copied using hcatUtil is now stored in the database. If you change any of those values in Hadoop, you need to rerun hcatUtil and install.sql. The following statement returns the names of the libraries and configuration files currently being used: => SELECT dependencies FROM user_libraries WHERE lib_name='VHCatalogLib';
Now you can create HCatalog schema parameters, which point to your existing Hadoop/Hive/WebHCat services, as described in Defining a Schema Using the HCatalog Connector.
Upgrading to a New Version of Vertica After upgrading to a new version of Vertica, perform the following steps: 1. Uninstall the HCatalog Connector using the uninstall.sql script. This script resides in the ddl/ folder under your HCatalog connector installation path. 2. Delete the contents of the hcatLibPath directory. 3. Rerun hcatUtil. 4. Reinstall the HCatalog Connector using the install.sql script. For more information about upgrading Vertica, see Upgrading Vertica to a New Version.
HPE Vertica Analytic Database (7.2.x)
Page 40 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Additional Options for Native File Formats When reading Hadoop native file formats (ORC or Parquet), the HCatalog Connector attempts to use the built-in readers. When doing so, it uses the webhdfs scheme by default. You do not need to make any additional changes to support this. You can instruct the HCatalog Connector to use the hdfs scheme instead by using ALTER DATABASE to set HCatalogConnectorUseLibHDFSPP to true. If you change this setting, you must also perform the configuration described in Configuring hdfs:/// Access.
Using the HCatalog Connector with HA NameNode Newer distributions of Hadoop support the High Availability NameNode (HA NN) for HDFS access. Some additional configuration is required to use this feature with the HCatalog Connector. If you do not perform this configuration, attempts to retrieve data through the connector will produce an error. To use HA NN with Vertica, first copy /etc/hadoop/conf from the HDFS cluster to every node in your Vertica cluster. You can put this directory anywhere, but it must be in the same location on every node. (In the example below it is in /opt/hcat/hadoop_conf.) Then uninstall the HCat library, configure the UDx to use that configuration directory, and reinstall the library: => \i /opt/vertica/packages/hcat/ddl/uninstall.sql DROP LIBRARY => ALTER DATABASE mydb SET JavaClassPathSuffixForUDx = '/opt/hcat/hadoop_conf'; WARNING 2693: Configuration parameter JavaClassPathSuffixForUDx has been deprecated; setting it has no effect => \i /opt/vertica/packages/hcat/ddl/install.sql CREATE LIBRARY CREATE SOURCE FUNCTION GRANT PRIVILEGE CREATE PARSER FUNCTION GRANT PRIVILEGE
Despite the warning message, this step is necessary. After taking these steps, HCatalog queries will now work.
HPE Vertica Analytic Database (7.2.x)
Page 41 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Defining a Schema Using the HCatalog Connector After you set up the HCatalog Connector, you can use it to define a schema in your Vertica database to access the tables in a Hive database. You define the schema using the CREATE HCATALOG SCHEMA statement. When creating the schema, you must supply at least two pieces of information: l
l
the name of the schema to define in Vertica the host name or IP address of Hive's metastore database (the database server that contains metadata about Hive's data, such as the schema and table definitions)
Other parameters are optional. If you do not supply a value, Vertica uses default values. After you define the schema, you can query the data in the Hive data warehouse in the same way you query a native Vertica table. The following example demonstrates creating an HCatalog schema and then querying several system tables to examine the contents of the new schema. See Viewing Hive Schema and Table Metadata for more information about these tables. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser'; CREATE SCHEMA => -- Show list of all HCatalog schemas => \x Expanded display is on. => SELECT * FROM v_catalog.hcatalog_schemata; -[ RECORD 1 ]--------+-----------------------------schema_id | 45035996273748980 schema_name | hcat schema_owner_id | 45035996273704962 schema_owner | dbadmin create_time | 2013-11-04 15:09:03.504094-05 hostname | hcathost port | 9933 webservice_hostname | hcathost webservice_port | 50111 hcatalog_schema_name | default hcatalog_user_name | hcatuser metastore_db_name | hivemetastoredb => -- List the tables in all HCatalog schemas => SELECT * FROM v_catalog.hcatalog_table_list; -[ RECORD 1 ]------+-----------------table_schema_id | 45035996273748980 table_schema | hcat hcatalog_schema | default table_name | messages
HPE Vertica Analytic Database (7.2.x)
Page 42 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Querying Hive Tables Using HCatalog Connector Once you have defined the HCatalog schema, you can query data from the Hive database by using the schema name in your query. => SELECT * from hcat.messages limit 10; messageid | userid | time | message -----------+------------+---------------------+---------------------------------1 | nPfQ1ayhi | 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis 2 | N7svORIoZ | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi 3 | 4VvzN3d | 2013-10-29 00:32:11 | porta Vivamus condimentum 4 | heojkmTmc | 2013-10-29 00:42:55 | lectus quis imperdiet 5 | coROws3OF | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri 6 | oDRP1i | 2013-10-29 01:04:23 | risus facilisis sollicitudin sceler 7 | AU7a9Kp | 2013-10-29 01:15:07 | turpis vehicula tortor 8 | ZJWg185DkZ | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor 9 | E7ipAsYC3 | 2013-10-29 01:36:35 | varius Cum iaculis metus 10 | kStCv | 2013-10-29 01:47:19 | aliquam libero nascetur Cum mal (10 rows)
Since the tables you access through the HCatalog Connector act like Vertica tables, you can perform operations that use both Hive data and native Vertica data, such as a join: => SELECT u.FirstName, u.LastName, d.time, d.Message from UserData u -> JOIN hcat.messages d ON u.UserID = d.UserID LIMIT 10; FirstName | LastName | time | Message ----------+----------+---------------------+----------------------------------Whitney | Kerr | 2013-10-29 00:10:43 | hymenaeos cursus lorem Suspendis Troy | Oneal | 2013-10-29 00:32:11 | porta Vivamus condimentum Renee | Coleman | 2013-10-29 00:42:55 | lectus quis imperdiet Fay | Moss | 2013-10-29 00:53:39 | sit eleifend tempus a aliquam mauri Dominique | Cabrera | 2013-10-29 01:15:07 | turpis vehicula tortor Mohammad | Eaton | 2013-10-29 00:21:27 | Fusce ad sem vehicula morbi Cade | Barr | 2013-10-29 01:25:51 | sapien adipiscing eget Aliquam tor Oprah | Mcmillan | 2013-10-29 01:36:35 | varius Cum iaculis metus Astra | Sherman | 2013-10-29 01:58:03 | dignissim odio Pellentesque primis Chelsea | Malone | 2013-10-29 02:08:47 | pede tempor dignissim Sed luctus
HPE Vertica Analytic Database (7.2.x)
Page 43 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
(10 rows)
Viewing Hive Schema and Table Metadata When using Hive, you access metadata about schemas and tables by executing statements written in HiveQL (Hive's version of SQL) such as SHOW TABLES. When using the HCatalog Connector, you can get metadata about the tables in the Hive database through several Vertica system tables. There are four system tables that contain metadata about the tables accessible through the HCatalog Connector: l
l
l
l
HCATALOG_SCHEMATA lists all of the schemas that have been defined using the HCatalog Connector. See HCATALOG_SCHEMATA in the SQL Reference Manual for detailed information. HCATALOG_TABLE_LIST contains an overview of all of the tables available from all schemas defined using the HCatalog Connector. This table only shows the tables which the user querying the table can access. The information in this table is retrieved using a single call to WebHCat for each schema defined using the HCatalog Connector, which means there is a little overhead when querying this table. See HCATALOG_TABLE_LIST in the SQL Reference Manual for detailed information. HCATALOG_TABLES contains more in-depth information than HCATALOG_ TABLE_LIST. However, querying this table results in Vertica making a REST web service call to WebHCat for each table available through the HCatalog Connector. If there are many tables in the HCatalog schemas, this query could take a while to complete. See HCATALOG_TABLES in the SQL Reference Manual for more information. HCATALOG_COLUMNS lists metadata about all of the columns in all of the tables available through the HCatalog Connector. Similarly to HCATALOG_TABLES, querying this table results in one call to WebHCat per table, and therefore can take a while to complete. See HCATALOG_COLUMNS in the SQL Reference Manual for more information.
The following example demonstrates querying the system tables containing metadata for the tables available through the HCatalog Connector. => CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost'
HPE Vertica Analytic Database (7.2.x)
Page 44 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Synchronizing an HCatalog Schema or Table With a Local Schema or Table Querying data from an HCatalog schema can be slow due to Hive and WebHCat performance issues. This slow performance can be especially annoying when you want to examine the structure of the tables in the Hive database. Getting this information from Hive requires you to query the HCatalog schema's metadata using the HCatalog Connector. To avoid this performance problem you can use the SYNC_WITH_HCATALOG_ SCHEMA function to create a snapshot of the HCatalog schema's metadata within a Vertica schema. You supply this function with the name of a pre-existing Vertica schema, typically the one created through CREATE HCATALOG SCHEMA, and a Hive schema available through the HCatalog Connector. The function creates a set of external tables within the Vertica schema that you can then use to examine the structure of the tables in the Hive database. Because the metadata in the Vertica schema is local,
HPE Vertica Analytic Database (7.2.x)
Page 48 of 139
Integrating with Apache Hadoop Using the HCatalog Connector query planning is much faster. You can also use standard Vertica statements and system-table queries to examine the structure of Hive tables in the HCatalog schema. Caution: The SYNC_WITH_HCATALOG_SCHEMA function overwrites tables in the Vertica schema whose names match a table in the HCatalog schema. Do not use the Vertica schema to store other data. When SYNC_WITH_HCATALOG_SCHEMA creates tables in Vertica, it matches Hive's STRING and BINARY types to Vertica's VARCHAR(65000) and VARBINARY (65000) types. You might want to change these lengths, using ALTER TABLE SET DATA TYPE, in two cases: l
l
If the value in Hive is larger than 65000 bytes, increase the size and use LONG VARCHAR or LONG VARBINARY to avoid data truncation. If a Hive string uses multi-byte encodings, you must increase the size in Vertica to avoid data truncation. This step is needed because Hive counts string length in characters while Vertica counts it in bytes. If the value in Hive is much smaller than 65000 bytes, reduce the size to conserve memory in Vertica.
The Vertica schema is just a snapshot of the HCatalog schema's metadata. Vertica does not synchronize later changes to the HCatalog schema with the local schema after you call SYNC_WITH_HCATALOG_SCHEMA. You can call the function again to resynchronize the local schema to the HCatalog schema. If you altered column data types, you will need to repeat those changes because the function creates new external tables. By default, SYNC_WITH_HCATALOG_SCHEMA does not drop tables that appear in the local schema that do not appear in the HCatalog schema. Thus, after the function call the local schema does not reflect tables that have been dropped in the Hive database since the previous call. You can change this behavior by supplying the optional third Boolean argument that tells the function to drop any table in the local schema that does not correspond to a table in the HCatalog schema. Instead of synchronizing the entire schema, you can synchronize individual tables by using SYNC_WITH_HCATALOG_SCHEMA_TABLE. If the table already exists in Vertica the function overwrites it. If the table is not found in the HCatalog schema, this function returns an error. In all other respects this function behaves in the same way as SYNC_WITH_HCATALOG_SCHEMA.
Examples The following example demonstrates calling SYNC_WITH_HCATALOG_SCHEMA to synchronize the HCatalog schema in Vertica with the metadata in Hive. Because it synchronizes the HCatalog schema directly, instead of synchronizing another schema with the HCatalog schema, both arguments are the same.
HPE Vertica Analytic Database (7.2.x)
Page 49 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
=> CREATE HCATALOG SCHEMA hcat WITH hostname='hcathost' HCATALOG_SCHEMA='default' -> HCATALOG_USER='hcatuser'; CREATE SCHEMA => SELECT sync_with_hcatalog_schema('hcat', 'hcat'); sync_with_hcatalog_schema ---------------------------------------Schema hcat synchronized with hcat tables in hcat = 56 tables altered in hcat = 0 tables created in hcat = 56 stale tables in hcat = 0 table changes erred in hcat = 0 (1 row) => -- Use vsql's \d command to describe a table in the synced schema => \d hcat.messages List of Fields by Tables Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key -----------+----------+---------+----------------+-------+---------+----------+-------------+-----------hcat | messages | id | int | 8 | | f | f | hcat | messages | userid | varchar(65000) | 65000 | | f | f | hcat | messages | "time" | varchar(65000) | 65000 | | f | f | hcat | messages | message | varchar(65000) | 65000 | | f | f | (4 rows)
This example shows synchronizing with a schema created using CREATE HCATALOG SCHEMA. Synchronizing with a schema created using CREATE SCHEMA is also supported. You can query tables in the local schema that you synchronized with an HCatalog schema. However, querying tables in a synchronized schema isn't much faster than directly querying the HCatalog schema, because SYNC_WITH_HCATALOG_SCHEMA only duplicates the HCatalog schema's metadata. The data in the table is still retrieved using the HCatalog Connector,
Data Type Conversions from Hive to Vertica The data types recognized by Hive differ from the data types recognized by Vertica. The following table lists how the HCatalog Connector converts Hive data types into data types compatible with Vertica. Hive Data Type
Vertica Data Type
TINYINT (1-byte)
TINYINT (8-bytes)
HPE Vertica Analytic Database (7.2.x)
Page 50 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Hive Data Type
Vertica Data Type
SMALLINT (2-bytes)
SMALLINT (8-bytes)
INT (4-bytes)
INT (8-bytes)
BIGINT (8-bytes)
BIGINT (8-bytes)
BOOLEAN
BOOLEAN
FLOAT (4-bytes)
FLOAT (8-bytes)
DECIMAL (precision, scale)
DECIMAL (precision, scale)
DOUBLE (8-bytes)
DOUBLE PRECISION (8-bytes)
CHAR (length in characters)
CHAR (length in bytes)
VARCHAR (length in characters)
VARCHAR (length in bytes), if length <= 65000 LONG VARCHAR (length in bytes), if length > 65000
STRING (2 GB max)
VARCHAR (65000)
BINARY (2 GB max)
VARBINARY (65000)
DATE
DATE
TIMESTAMP
TIMESTAMP
LIST/ARRAY
VARCHAR (65000) containing a JSON-format representation of the list.
MAP
VARCHAR (65000) containing a JSON-format representation of the map.
STRUCT
VARCHAR (65000) containing a JSON-format representation of the struct.
Data-Width Handling Differences Between Hive and Vertica The HCatalog Connector relies on Hive SerDe classes to extract data from files on HDFS. Therefore, the data read from these files are subject to Hive's data width restrictions. For example, suppose the SerDe parses a value for an INT column into a
HPE Vertica Analytic Database (7.2.x)
Page 51 of 139
Integrating with Apache Hadoop Using the HCatalog Connector value that is greater than 232-1 (the maximum value for a 32-bit integer). In this case, the value is rejected even if it would fit into a Vertica's 64-bit INTEGER column because it cannot fit into Hive's 32-bit INT. Hive measures CHAR and VARCHAR length in characters and Vertica measures them in bytes. Therefore, if multi-byte encodings are being used (like Unicode), text might be truncated in Vertica. Once the value has been parsed and converted to a Vertica data type, it is treated as native data. This treatment can result in some confusion when comparing the results of an identical query run in Hive and in Vertica. For example, if your query adds two INT values that result in a value that is larger than 232-1, the value overflows its 32-bit INT data type, causing Hive to return an error. When running the same query with the same data in Vertica using the HCatalog Connector, the value will probably still fit within Vertica's 64-int value. Thus the addition is successful and returns a value.
Using Non-Standard SerDes Hive stores its data in unstructured flat files located in the Hadoop Distributed File System (HDFS). When you execute a Hive query, it uses a set of serializer and deserializer (SerDe) classes to extract data from these flat files and organize it into a relational database table. For Hive to be able to extract data from a file, it must have a SerDe that can parse the data the file contains. When you create a table in Hive, you can select the SerDe to be used for the table's data. Hive has a set of standard SerDes that handle data in several formats such as delimited data and data extracted using regular expressions. You can also use third-party or custom-defined SerDes that allow Hive to process data stored in other file formats. For example, some commonly-used third-party SerDes handle data stored in JSON format. The HCatalog Connector directly fetches file segments from HDFS and uses Hive's SerDes classes to extract data from them. The Connector includes all Hive's standard SerDes classes, so it can process data stored in any file that Hive natively supports. If you want to query data from a Hive table that uses a custom SerDe, you must first install the SerDe classes on the Vertica cluster.
Determining Which SerDe You Need If you have access to the Hive command line, you can determine which SerDe a table uses by using Hive's SHOW CREATE TABLE statement. This statement shows the HiveQL statement needed to recreate the table. For example: hive> SHOW CREATE TABLE msgjson; OK CREATE EXTERNAL TABLE msgjson(
HPE Vertica Analytic Database (7.2.x)
Page 52 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
messageid int COMMENT 'from deserializer', userid string COMMENT 'from deserializer', time string COMMENT 'from deserializer', message string COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://hivehost.example.com:8020/user/exampleuser/msgjson' TBLPROPERTIES ( 'transient_lastDdlTime'='1384194521') Time taken: 0.167 seconds
In the example, ROW FORMAT SERDE indicates that a special SerDe is used to parse the data files. The next row shows that the class for the SerDe is named org.apache.hadoop.hive.contrib.serde2.JsonSerde.You must provide the HCatalog Connector with a copy of this SerDe class so that it can read the data from this table. You can also find out which SerDe class you need by querying the table that uses the custom SerDe. The query will fail with an error message that contains the class name of the SerDe needed to parse the data in the table. In the following example, the portion of the error message that names the missing SerDe class is in bold. => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called. Cause : java.io.IOException: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory .plan(HCatalogSplitsNoOpSourceFactory.java:98) at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898) . . .
Installing the SerDe on the Vertica Cluster You usually have two options to getting the SerDe class file the HCatalog Connector needs: l
Find the installation files for the SerDe, then copy those over to your Vertica cluster. For example, there are several third-party JSON SerDes available from sites like
HPE Vertica Analytic Database (7.2.x)
Page 53 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
Google Code and GitHub. You may find the one that matches the file installed on your Hive cluster. If so, then download the package and copy it to your Vertica cluster. l
Directly copy the JAR files from a Hive server onto your Vertica cluster. The location for the SerDe JAR files depends on your Hive installation. On some systems, they may be located in /usr/lib/hive/lib.
Wherever you get the files, copy them into the /opt/vertica/packages/hcat/lib directory on every node in your Vertica cluster. Important: If you add a new host to your Vertica cluster, remember to copy every custom SerDer JAR file to it.
Troubleshooting HCatalog Connector Problems You may encounter the following issues when using the HCatalog Connector.
Connection Errors When you use CREATE HCATALOG SCHEMA to create a new schema, the HCatalog Connector does not immediately attempt to connect to the WebHCat or metastore servers. Instead, when you execute a query using the schema or HCatalog-related system tables, the connector attempts to connect to and retrieve data from your Hadoop cluster. The types of errors you get depend on which parameters are incorrect. Suppose you have incorrect parameters for the metastore database, but correct parameters for WebHCat. In this case, HCatalog-related system table queries succeed, while queries on the HCatalog schema fail. The following example demonstrates creating an HCatalog schema with the correct default WebHCat information. However, the port number for the metastore database is incorrect. => CREATE HCATALOG SCHEMA hcat2 WITH hostname='hcathost' -> HCATALOG_SCHEMA='default' HCATALOG_USER='hive' PORT=1234; CREATE SCHEMA => SELECT * FROM HCATALOG_TABLE_LIST; -[ RECORD 1 ]------+--------------------table_schema_id | 45035996273864536 table_schema | hcat2 hcatalog_schema | default table_name | test hcatalog_user_name | hive
HPE Vertica Analytic Database (7.2.x)
Page 54 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
=> SELECT * FROM hcat2.test; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called. Cause : java.io.IOException: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at org.apache.thrift.transport.TSocket.open(TSocket.java:185) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open( HiveMetaStoreClient.java:277) . . .
To resolve these issues, you must drop the schema and recreate it with the correct parameters. If you still have issues, determine whether there are connectivity issues between your Vertica cluster and your Hadoop cluster. Such issues can include a firewall that prevents one or more Vertica hosts from contacting the WebHCat, metastore, or HDFS hosts. You may also see this error if you are using HA NameNode, particularly with larger tables that HDFS splits into multiple blocks. See Using the HCatalog Connector with HA NameNode for more information about correcting this problem.
UDx Failure When Querying Data: Error 3399 You might see an error message when querying data (as opposed to metadata like schema information). This might be accompanied by a ClassNotFoundException in the log. This can happen for the following reasons: l
l
l
l
l
You are not using the same version of Java on your Hadoop and Vertica nodes. In this case you need to change one of them to match the other. You have not used hcatUtil to copy all Hadoop and Hive libraries to Vertica, or you ran hcatutil and then changed your version of Hadoop or Hive. You upgraded Vertica to a new version and did not rerun hcatutil and reinstall the HCatalog Connector. The version of Hadoop you are using relies on a third-party library that you must copy manually. You are reading files with LZO compression and have not copied the libraries or set the io.compression.codecs property in core-site.xml.
HPE Vertica Analytic Database (7.2.x)
Page 55 of 139
Integrating with Apache Hadoop Using the HCatalog Connector If you did not copy the libraries or configure LZO compression, follow the instructions in Configuring Vertica for HCatalog. If the Hive jars that you copied from Hadoop are out of date, you might see an error message like the following: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected ] HINT hive metastore service is thrift://localhost:13433 (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information)
This usually signals a problem with hive-hcatalog-core jar. Make sure you have an up-to-date copy of this file. Remember that if you rerun hcatUtil you also need to recreate the HCatalog schema. You might also see a different form of this error: ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 Error message is [ javax/servlet/Filter ]
This error can be reported even if hcatUtil reports that your libraries are up to date. The javax.servlet.Filter class is in a library that some versions of Hadoop use but that is not usually part of the Hadoop installation directly. If you see an error mentioning this class, locate servlet-api-*.jar on a Hadoop node and copy it to the hcat/lib directory on all database nodes. If you cannot locate it on a Hadoop node, locate and download it from the Internet. (This case is rare.) The library version must be 2.3 or higher. Once you have copied the jar to the hcat/lib directory, reinstall the HCatalog connector as explained in Configuring Vertica for HCatalog.
SerDe Errors Errors can occur if you attempt to query a Hive table that uses a non-standard SerDe. If you have not installed the SerDe JAR files on your Vertica cluster, you receive an error similar to the one in the following example: => SELECT * FROM hcat.jsontable; ERROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error in User Defined Object [VHCatSource], error code: 0 com.vertica.sdk.UdfException: Error message is [ org.apache.hcatalog.common.HCatException : 2004 : HCatOutputFormat not initialized, setOutput has to be called. Cause : java.io.IOException: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException SerDe com.cloudera.hive.serde.JSONSerDe does not exist) ] HINT If error message is not descriptive or local, may be we cannot read metadata from hive metastore service thrift://hcathost:9083 or HDFS namenode (check UDxLogs/UDxFencedProcessesJava.log in the catalog directory for more information) at com.vertica.hcatalogudl.HCatalogSplitsNoOpSourceFactory
HPE Vertica Analytic Database (7.2.x)
Page 56 of 139
Integrating with Apache Hadoop Using the HCatalog Connector
.plan(HCatalogSplitsNoOpSourceFactory.java:98) at com.vertica.udxfence.UDxExecContext.planUDSource(UDxExecContext.java:898) . . .
In the error message, you can see that the root cause is a missing SerDe class (shown in bold). To resolve this issue, install the SerDe class on your Vertica cluster. See Using Non-Standard SerDes for more information. This error may occur intermittently if just one or a few hosts in your cluster do not have the SerDe class.
Differing Results Between Hive and Vertica Queries Sometimes, running the same query on Hive and on Vertica through the HCatalog Connector can return different results. This discrepancy is often caused by the differences between the data types supported by Hive and Vertica. See Data Type Conversions from Hive to Vertica for more information about supported data types. If Hive string values are being truncated in Vertica, this might be caused by multi-byte character encodings in Hive. Hive reports string length in characters, while Vertica records it in bytes. For a two-byte encoding such as Unicode, you need to double the column size in Vertica to avoid truncation. Discrepancies can also occur if the Hive table uses partition columns of types other than string.
Preventing Excessive Query Delays Network issues or high system loads on the WebHCat server can cause long delays while querying a Hive database using the HCatalog Connector. While Vertica cannot resolve these issues, you can set parameters that limit how long Vertica waits before canceling a query on an HCatalog schema. You can set these parameters globally using Vertica configuration parameters. You can also set them for specific HCatalog schemas in the CREATE HCATALOG SCHEMA statement. These specific settings override the settings in the configuration parameters. The HCatConnectionTimeout configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_CONNECTION_TIMEOUT parameter control how many seconds the HCatalog Connector waits for a connection to the WebHCat server. A value of 0 (the default setting for the configuration parameter) means to wait indefinitely. If the WebHCat server does not respond by the time this timeout elapses, the HCatalog Connector breaks the connection and cancels the query. If you find that some queries on an HCatalog schema pause excessively, try setting this parameter to a timeout value, so the query does not hang indefinitely.
HPE Vertica Analytic Database (7.2.x)
Page 57 of 139
Integrating with Apache Hadoop Using the HCatalog Connector The HCatSlowTransferTime configuration parameter and the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_TIME parameter specify how long the HCatlog Connector waits for data after making a successful connection to the WebHCat server. After the specified time has elapsed, the HCatalog Connector determines whether the data transfer rate from the WebHCat server is at least the value set in the HCatSlowTransferLimit configuration parameter (or by the CREATE HCATALOG SCHEMA statement's HCATALOG_SLOW_TRANSFER_ LIMIT parameter). If it is not, then the HCatalog Connector terminates the connection and cancels the query. You can set these parameters to cancel queries that run very slowly but do eventually complete. However, query delays are usually caused by a slow connection rather than a problem establishing the connection. Therefore, try adjusting the slow transfer rate settings first. If you find the cause of the issue is connections that never complete, you can alternately adjust the Linux TCP socket timeouts to a suitable value instead of relying solely on the HCatConnectionTimeout parameter.
HPE Vertica Analytic Database (7.2.x)
Page 58 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Using the HDFS Connector The Hadoop Distributed File System (HDFS) is the location where Hadoop usually stores its input and output files. It stores files across the Hadoop cluster redundantly, to keep the files available even if some nodes are down. HDFS also makes Hadoop more efficient, by spreading file access tasks across the cluster to help limit I/O bottlenecks. The HDFS Connector lets you load files from HDFS into Vertica using the COPY statement. You can also create external tables that access data stored on HDFS as if it were a native Vertica table. The connector is useful if your Hadoop job does not directly store its data in Vertica using the MapReduce Connector (see Using the MapReduce Connector) or if you want to use User-Defined Extensions (UDxs) to load data stored in HDFS. Note: The files you load from HDFS using the HDFS Connector usually have a delimited format. Column values are separated by a character, such as a comma or a pipe character (|). This format is the same type used in other files you load with the COPY statement. Hadoop MapReduce jobs often output tab-delimited files. Like the MapReduce Connector, the HDFS Connector takes advantage of the distributed nature of both Vertica and Hadoop. Individual nodes in the Vertica cluster connect directly to nodes in the Hadoop cluster when you load multiple files from HDFS. Hadoop splits large files into file segments that it stores on different nodes. The connector directly retrieves these file segments from the node storing them, rather than relying on the Hadoop cluster to reassemble the file. The connector is read-only; it cannot write data to HDFS. The HDFS Connector can connect to a Hadoop cluster through unauthenticated and Kerberos-authenticated connections.
HDFS Connector Requirements Uninstall Prior Versions of the HDFS Connector The HDFS Connector is now installed with Vertica; you no longer need to download and install it separately. If you have previously downloaded and installed this connector, uninstall it before you upgrade to this release of Vertica to get the newest version.
HPE Vertica Analytic Database (7.2.x)
Page 59 of 139
Integrating with Apache Hadoop Using the HDFS Connector
webHDFS Requirements The HDFS Connector connects to the Hadoop file system using webHDFS, a built-in component of HDFS that provides access to HDFS files to applications outside of Hadoop. This component must be enabled on your Hadoop cluster. See your Hadoop distribution's documentation for instructions on configuring and enabling webHDFS. Note: HTTPfs (also known as HOOP) is another method of accessing files stored in an HDFS. It relies on a separate server process that receives requests for files and retrieves them from the HDFS. Since it uses a REST API that is compatible with webHDFS, it could theoretically work with the connector. However, the connector has not been tested with HTTPfs and HPE does not support using the HDFS Connector with HTTPfs. In addition, since all of the files retrieved from HDFS must pass through the HTTPfs server, it is less efficient than webHDFS, which lets Vertica nodes directly connect to the Hadoop nodes storing the file blocks.
Kerberos Authentication Requirements The HDFS Connector can connect to HDFS using Kerberos authentication. To use Kerberos, you must meet these additional requirements: l
Your Vertica installation must be Kerberos-enabled.
l
Your Hadoop cluster must be configured to use Kerberos authentication.
l
Your connector must be able to connect to the Kerberos-enabled Hadoop cluster.
l
The Kerberos server must be running version 5.
l
The Kerberos server must be accessible from every node in your Vertica cluster.
l
You must have Kerberos principals (users) that map to Hadoop users. You use these principals to authenticate your Vertica users with the Hadoop cluster.
Testing Your Hadoop webHDFS Configuration To ensure that your Hadoop installation's WebHDFS system is configured and running, follow these steps:
HPE Vertica Analytic Database (7.2.x)
Page 60 of 139
Integrating with Apache Hadoop Using the HDFS Connector
1. Log into your Hadoop cluster and locate a small text file on the Hadoop filesystem. If you do not have a suitable file, you can create a file named test.txt in the /tmp directory using the following command: echo -e "A|1|2|3\nB|4|5|6" | hadoop fs -put - /tmp/test.txt
2. Log into a host in your Vertica database using the database administrator account. 3. If you are using Kerberos authentication, authenticate with the Kerberos server using the keytab file for a user who is authorized to access the file. For example, to authenticate as an user named [email protected], use the command: $ kinit [email protected] -k -t /path/exampleuser.keytab
Where path is the path to the keytab file you copied over to the node. You do not receive any message if you authenticate successfully. You can verify that you are authenticated by using the klist command: $ klistTicket cache: FILE:/tmp/krb5cc_500 Default principal: [email protected] Valid starting Expires Service principal 07/24/13 14:30:19 07/25/13 14:30:19 krbtgt/[email protected] renew until 07/24/13 14:30:19
4. Test retrieving the file: n
If you are not using Kerberos authentication, run the following command from the Linux command line: curl -i -L "http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN&user.name=hadoopUserName"
Replacing hadoopNameNode with the hostname or IP address of the name node in your Hadoop cluster, /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1, and hadoopUserName with the user name of a Hadoop user that has read access to the file. If successful, the command produces output similar to the following: HTTP/1.1 200 OKServer: Apache-Coyote/1.1 Set-Cookie: hadoop.auth="u=hadoopUser&p=password&t=simple&e=1344383263490&s=n8YB/CHFg56qNmRQRTqO0IdRMvE ="; Version=1; Path=/ Content-Type: application/octet-stream
HPE Vertica Analytic Database (7.2.x)
Page 61 of 139
Integrating with Apache Hadoop Using the HDFS Connector
If you are using Kerberos authentication, run the following command from the Linux command line: curl --negotiate -i -L -u:anyUser http://hadoopNameNode:50070/webhdfs/v1/tmp/test.txt?op=OPEN
Replace hadoopNameNode with the hostname or IP address of the name node in your Hadoop cluster, and /tmp/test.txt with the path to the file in the Hadoop filesystem you located in step 1. If successful, the command produces output similar to the following: HTTP/1.1 401 UnauthorizedContent-Type: text/html; charset=utf-8 WWW-Authenticate: Negotiate Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream Expires: Thu, 01-Jan-1970 00:00:00 GMT Set-Cookie: hadoop.auth="[email protected]&t=kerberos& e=1375144834763&s=iY52iRvjuuoZ5iYG8G5g12O2Vwo=";Path=/ Location: http://hadoopnamenode.mycompany.com:1006/webhdfs/v1/user/release/docexample/test.txt? op=OPEN&delegation=JAAHcmVsZWFzZQdyZWxlYXNlAIoBQCrfpdGKAUBO7CnRju3TbBSlID_osB658jfGf RpEt8-u9WHymRJXRUJIREZTIGRlbGVnYXRpb24SMTAuMjAuMTAwLjkxOjUwMDcw&offset=0 Content-Length: 0 Server: Jetty(6.1.26) HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 16 Server: Jetty(6.1.26) A|1|2|3 B|4|5|6
If the curl command fails, you must review the error messages and resolve any issues before using the Vertica Connector for HDFS with your Hadoop cluster. Some debugging steps include: l
l
Verify the HDFS service's port number. Verify that the Hadoop user you specified exists and has read access to the file you are attempting to retrieve.
HPE Vertica Analytic Database (7.2.x)
Page 62 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Loading Data Using the HDFS Connector You can use the HDFS User Defined Source (UDS) in a COPY statement to load data from HDFS files. The syntax for using the HDFS UDS in a COPY statement is: COPY tableName SOURCE Hdfs(url='WebHDFSFileURL', [username='username'], [low_speed_limit=speed]); tableName
The name of the table to receive the copied data.
WebHDFSFileURL
A string containing one or more URLs that identify the file or files to be read. See below for details. Use commas to separate multiple URLs . If a URL contains certain special characters, you must escape them: l
l
Replace any commas in the URLs with the escape sequence %2c. For example, if you are loading a file named doe,john.txt, change the file's name in the URL to doe%2cjohn.txt. Replace any single quotes with the escape sequence '''. For example, if you are loading a file named john's_notes.txt, change the file's name in the URL to john'''s_notes.txt.
username
The username of a Hadoop user that has permissions to access the files you want to copy. If you are using Kerberos, omit this argument.
speed
The minimum data transmission rate, expressed in bytes per second, that the connector allows. The connector breaks any connection between the Hadoop and Vertica clusters that transmits data slower than this rate for more than 1 minute. After the connector breaks a connection for being too slow, it attempts to connect to another node in the Hadoop cluster. This new connection can supply the data that the broken connection was retrieving. The connector terminates the COPY statement and returns an error message if: l
l
It cannot find another Hadoop node to supply the data. The previous transfer attempts from all other Hadoop nodes that have the file also closed because they were too slow.
Default Value: 1048576 (1MB per second transmission rate)
HPE Vertica Analytic Database (7.2.x)
Page 63 of 139
Integrating with Apache Hadoop Using the HDFS Connector
The HDFS File URL The url parameter in the Hdfs function call is a string containing one or more commaseparated HTTP URLs. These URLS identify the files in HDFS that you want to load. The format for each URL in this string is: http://NameNode:port/webhdfs/v1/HDFSFilePath NameNode
The host name or IP address of the Hadoop cluster's name node.
Port
The port number on which the WebHDFS service is running. This number is usually 50070 or 14000, but may be different in your Hadoop installation.
webhdfs/v1/
The protocol being used to retrieve the file. This portion of the URL is always the same. It tells Hadoop to use version 1 of the WebHDFS API.
HDFSFilePath
The path from the root of the HDFS filesystem to the file or files you want to load. This path can contain standard Linux wildcards. Important: Any wildcards you use to specify multiple input files must resolve to files only. They must not include any directories. For example, if you specify the path /user/HadoopUser/output/*, and the output directory contains a subdirectory, the connector returns an error message.
The following example shows how to use the Vertica Connector for HDFS to load a single file named /tmp/test.txt. The Hadoop cluster's name node is named hadoop. => COPY testTable SOURCE Hdfs(url='http://hadoop:50070/webhdfs/v1/tmp/test.txt', username='hadoopUser'); Rows Loaded ------------2 (1 row)
Copying Files in Parallel The basic COPY statement in the previous example copies a single file. It runs on just a single host in the Vertica cluster because the Connector cannot break up the workload among nodes. Any data load that does not take advantage of all nodes in the Vertica cluster is inefficient.
HPE Vertica Analytic Database (7.2.x)
Page 64 of 139
Integrating with Apache Hadoop Using the HDFS Connector To make loading data from HDFS more efficient, spread the data across multiple files on HDFS. This approach is often natural for data you want to load from HDFS. Hadoop MapReduce jobs usually store their output in multiple files. You specify multiple files to be loaded in your Hdfs function call by: l
l
l
Using wildcards in the URL Supplying multiple comma-separated URLs in the url parameter of the Hdfs userdefined source function call Supplying multiple comma-separated URLs that contain wildcards
Loading multiple files through the Vertica Connector for HDFS results in a efficient load. The Vertica hosts connect directly to individual nodes in the Hadoop cluster to retrieve files. If Hadoop has broken files into multiple chunks, the Vertica hosts directly connect to the nodes storing each chunk. The following example shows how to load all of the files whose filenames start with "part-" located in the /user/hadoopUser/output directory on the HDFS. If there are at least as many files in this directory as there are nodes in the Vertica cluster, all nodes in the cluster load data from the HDFS. => COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/hadoopUser/output/part-*', username='hadoopUser'); Rows Loaded ------------40008 (1 row)
To load data from multiple directories on HDFS at once use multiple comma-separated URLs in the URL string: => COPY Customers SOURCE-> Hdfs(url='http://hadoop:50070/webhdfs/v1/user/HadoopUser/output/part-*, http://hadoop:50070/webhdfs/v1/user/AnotherUser/part-*', username='H=hadoopUser'); Rows Loaded ------------80016 (1 row)
Note: Vertica statements must be less than 65,000 characters long. If you supply too many long URLs in a single statement, you could go over this limit. Normally, you would only approach this limit if you are automatically generating of the COPY statement using a program or script.
HPE Vertica Analytic Database (7.2.x)
Page 65 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Viewing Rejected Rows and Exceptions COPY statements that use the Vertica Connector for HDFS use the same method for recording rejections and exceptions as other COPY statements. Rejected rows and exceptions are saved to log files. These log files are stored by default in the CopyErrorLogs subdirectory in the database's catalog directory. Due to the distributed nature of the Vertica Connector for HDFS, you cannot use the ON option to force all exception and rejected row information to be written to log files on a single Vertica host. Instead, you need to collect the log files from across the hosts to review all of the exceptions and rejections generated by the COPY statement. For more about handling rejected rows, see Capturing Load Rejections and Exceptions.
Creating an External Table with an HDFS Source You can use the HDFS Connector as a source for an external table that lets you directly perform queries on the contents of files on the Hadoop Distributed File System (HDFS). See Using External Tables in the Administrator's Guide for more information on external tables. If your HDFS data is in ORC or Parquet format, using the special readers for those formats might provide better performance. See Reading Native Hadoop File Formats. Using an external table to access data stored on an HDFS cluster is useful when you need to extract data from files that are periodically updated, or have additional files added on HDFS. It saves you from having to drop previously loaded data and then reload the data using a COPY statement. The external table always accesses the current version of the files on HDFS. Note: An external table performs a bulk load each time it is queried. Its performance is significantly slower than querying an internal Vertica table. You should only use external tables for infrequently-run queries (such as daily reports). If you need to frequently query the content of the HDFS files, you should either use COPY to load the entire content of the files into Vertica or save the results of a query run on an external table to an internal table which you then use for repeated queries. To create an external table that reads data from HDFS, use the HDFS Use-Defined Source (UDS) in a CREATE EXTERNAL TABLE AS COPY statement. The COPY portion of this statement has the same format as the COPY statement used to load data from HDFS. See Loading Data Using the HDFS Connector for more information.
HPE Vertica Analytic Database (7.2.x)
Page 66 of 139
Integrating with Apache Hadoop Using the HDFS Connector The following simple example shows how to create an external table that extracts data from every file in the /user/hadoopUser/example/output directory using the HDFS Connector. => CREATE EXTERNAL TABLE hadoopExample (A VARCHAR(10), B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE Hdfs(url= -> 'http://hadoop01:50070/webhdfs/v1/user/hadoopUser/example/output/*', -> username='hadoopUser'); CREATE TABLE => SELECT * FROM hadoopExample; A | B | C | D -------+---+---+--test1 | 1 | 2 | 3 test1 | 3 | 4 | 5 (2 rows)
Later, after another Hadoop job adds contents to the output directory, querying the table produces different results: => SELECT * FROM hadoopExample; A | B | C | D -------+----+----+---test3 | 10 | 11 | 12 test3 | 13 | 14 | 15 test2 | 6 | 7 | 8 test2 | 9 | 0 | 10 test1 | 1 | 2 | 3 test1 | 3 | 4 | 5 (6 rows)
Load Errors in External Tables Normally, querying an external table on HDFS does not produce any errors if rows rejected by the underlying COPY statement (for example, rows containing columns whose contents are incompatible with the data types in the table). Rejected rows are handled the same way they are in a standard COPY statement: they are written to a rejected data file, and are noted in the exceptions file. For more information on how COPY handles rejected rows and exceptions, see Capturing Load Rejections and Exceptions in the Administrator's Guide. Rejections and exception files are created on all of the nodes that load data from the HDFS. You cannot specify a single node to receive all of the rejected row and exception information. These files are created on each Vertica node as they process files loaded through the Vertica Connector for HDFS. Note: Since the the connector is read-only, there is no way to store rejection and exception information on the HDFS.
HPE Vertica Analytic Database (7.2.x)
Page 67 of 139
Integrating with Apache Hadoop Using the HDFS Connector Fatal errors during the transfer of data (for example, specifying files that do not exist on the HDFS) do not occur until you query the external table. The following example shows what happens if you recreate the table based on a file that does not exist on HDFS. => DROP TABLE hadoopExample; DROP TABLE => CREATE EXTERNAL TABLE hadoopExample (A INTEGER, B INTEGER, C INTEGER, D INTEGER) -> AS COPY SOURCE HDFS(url='http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt', -> username='hadoopUser'); CREATE TABLE => SELECT * FROM hadoopExample; ERROR 0: Error calling plan() in User Function HdfsFactory at [src/Hdfs.cpp:222], error code: 0, message: No files match [http://hadoop01:50070/webhdfs/v1/tmp/nofile.txt]
Note that it is not until you actually query the table that the connector attempts to read the file. Only then does it return an error.
HDFS ConnectorTroubleshooting Tips The following sections explain some of the common issues you may encounter when using the HDFS Connector.
User Unable to Connect to KerberosAuthenticated Hadoop Cluster A user may suddenly be unable to connect to Hadoop through the connector in a Kerberos-enabled environment. This issue can be caused by someone exporting a new keytab file for the user, which invalidates existing keytab files. You can determine if invalid keytab files is the problem by comparing the key version number associated with the user's principal key in Kerberos with the key version number stored in the keytab file on the Vertica cluster. To find the key version number for a user in Kerberos: 1. From the Linux command line, start the kadmin utility (kadmin.local if you are logged into the Kerberos Key Distribution Center). Run the getprinc command for the user: $ sudo kadmin [sudo] password for dbadmin: Authenticating as principal root/[email protected] with password. Password for root/[email protected]: kadmin: getprinc [email protected] Principal: [email protected] Expiration date: [never]
HPE Vertica Analytic Database (7.2.x)
Page 68 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Last password change: Fri Jul 26 09:40:44 EDT 2013 Password expiration date: [none] Maximum ticket life: 1 day 00:00:00 Maximum renewable life: 0 days 00:00:00 Last modified: Fri Jul 26 09:40:44 EDT 2013 (root/[email protected]) Last successful authentication: [never] Last failed authentication: [never] Failed password attempts: 0 Number of keys: 2 Key: vno 3, des3-cbc-sha1, no salt Key: vno 3, des-cbc-crc, no salt MKey: vno 0 Attributes: Policy: [none]
In the preceding example, there are two keys stored for the user, both of which are at version number (vno) 3. 2. To get the version numbers of the keys stored in the keytab file, use the klist command: $ sudo klist -ek exampleuser.keytab Keytab name: FILE:exampleuser.keytab KVNO Principal ---- ---------------------------------------------------------------------2 [email protected] (des3-cbc-sha1) 2 [email protected] (des-cbc-crc) 3 [email protected] (des3-cbc-sha1) 3 [email protected] (des-cbc-crc)
The first column in the output lists the key version number. In the preceding example, the keytab includes both key versions 2 and 3, so the keytab file can be used to authenticate the user with Kerberos.
Resolving Error 5118 When using the connector, you might receive an error message similar to the following: ERROR 5118: UDL specified no execution nodes; at least one execution node must be specified
To correct this error, verify that all of the nodes in your Vertica cluster have the correct version of the HDFS Connector package installed. This error can occur if one or more of the nodes do not have the supporting libraries installed. These libraries may be missing because one of the nodes was skipped when initially installing the connector package. Another possibility is that one or more nodes have been added since the connector was installed.
HPE Vertica Analytic Database (7.2.x)
Page 69 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Transfer Rate Errors The HDFS Connector monitors how quickly Hadoop sends data to Vertica.In some cases, the data transfer speed on any connection between a node in your Hadoop cluster and a node in your Vertica cluster slows beyond a lower limit (by default, 1 MB per second). When the transfer slows beyond the lower limit, the connector breaks the data transfer. It then connects to another node in the Hadoop cluster that contains the data it was retrieving. If it cannot find another node in the Hadoop cluster to supply the data (or has already tried all of the nodes in the Hadoop cluster), the connector terminates the COPY statement and returns an error. => COPY messages SOURCE Hdfs(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser'); ERROR 3399: Failure in UDx RPC call InvokeProcessUDL(): Error calling processUDL() in User Defined Object [Hdfs] at [src/Hdfs.cpp:275], error code: 0, message: [Transferring rate during last 60 seconds is 172655 byte/s, below threshold 1048576 byte/s, give up. The last error message: Operation too slow. Less than 1048576 bytes/sec transferred the last 1 seconds. The URL: http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt?op=OPEN&offset=154901544&length=113533912. The redirected URL: http://hadoop.example.com:50075/webhdfs/v1/tmp/data.txt?op=OPEN& namenoderpcaddress=hadoop.example.com:8020&length=113533912&offset=154901544.]
If you encounter this error, troubleshoot the connection between your Vertica and Hadoop clusters. If there are no problems with the network, determine if either your Hadoop cluster or Vertica cluster is overloaded. If the nodes in either cluster are too busy, they may not be able to maintain the minimum data transfer rate. If you cannot resolve the issue causing the slow transfer rate, you can lower the minimum acceptable speed. To do so, set the low_speed_limit parameter for the Hdfs source. The following example shows how to set low_speed_limit to 524288 to accept transfer rates as low as 512 KB per second (half the default lower limit). => COPY messages SOURCE Hdfs(url='http://hadoop.example.com:50070/webhdfs/v1/tmp/data.txt', username='exampleuser', low_speed_limit=524288); Rows Loaded ------------9891287 (1 row)
When you lower the low_speed_limit parameter, the COPY statement loading data from HDFS may take a long time to complete. You can also increase the low_speed_limit setting if the network between your Hadoop cluster and Vertica cluster is fast. You can choose to increase the lower limit to force COPY statements to generate an error, if they are running more slowly than they normally should, given the speed of the network.
HPE Vertica Analytic Database (7.2.x)
Page 70 of 139
Integrating with Apache Hadoop Using the HDFS Connector
Error Loading Many Files When using the HDFS Connector to load many data files in a single statement, you might receive an error message similar to the following: RROR 3399: Failure in UDx RPC call InvokePlanUDL(): Error calling planUDL() in User Defined Object [Hdfs] at [src/Glob.cpp:531], error code: 0, message: Error occurs in Glob::stat: Last error message before give up: Failed to connect to 10.20.41.212: Cannot assign requested address.
This can happen when concurrent load requests overwhelm the Name Node. It is generally safe to load hundreds of files at a time, but if you load thousands you might see this error. Use smaller batches of files to avoid this error.
HPE Vertica Analytic Database (7.2.x)
Page 71 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
Using HDFS Storage Locations The Vertica Storage Location for HDFS lets Vertica store its data in a Hadoop Distributed File System (HDFS) similarly to how it stores data on a native Linux filesystem. It lets you create a storage tier for lower-priority data to free space on your Vertica cluster for higher-priority data. For example, suppose you store website clickstream data in your Vertica database. You may find that most queries only examine the last six months of this data. However, there are a few low-priority queries that still examine data older than six months. In this case, you could choose to move the older data to an HDFS storage location so that it is still available for the infrequent queries. The queries on the older data are slower because they now access data stored on HDFS rather than native disks. However, you free space on your Vertica cluster's storage for higher-priority, frequently-queried data.
Storage Location for HDFS Requirements To store Vertica's data on HDFS, verify that: l
l
l
l
l
l
Your Hadoop cluster has WebHDFS enabled. All of the nodes in your Vertica cluster can connect to all of the nodes in your Hadoop cluster. Any firewall between the two clusters must allow connections on the ports used by HDFS. See Testing Your Hadoop webHDFS Configuration for a procedure to test the connectivity between your Vertica and Hadoop clusters. You have a Hadoop user whose username matches the name of the Vertica database administrator (usually named dbadmin). This Hadoop user must have read and write access to the HDFS directory where you want Vertica to store its data. Your HDFS has enough storage available for Vertica data. See HDFS Space Requirements below for details. The data you store in an HDFS-backed storage location does not expand your database's size beyond any data allowance in your Vertica license. Vertica counts data stored in an HDFS-backed storage location as part of any data allowance set by your license. See Managing Licenses in the Administrator's Guide for more information. If you are using an HDFS storage location with Kerberos, you must have Kerberos
HPE Vertica Analytic Database (7.2.x)
Page 72 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
running and the principals defined before creating the storage location. See Create the Principals and Keytabs for instructions on defining the principals.
HDFS Space Requirements If your Vertica database is K-safe, HDFS-based storage locations contain two copies of the data you store in them. One copy is the primary projection, and the other is the buddy projection. If you have enabled HDFS's data redundancy feature, Hadoop stores both projections multiple times. This duplication may seem excessive. However, it is similar to how a RAID level 1 or higher redundantly stores copies of both Vertica's primary and buddy projections. The redundant copies also help the performance of HDFS by enabling multiple nodes to process a request for a file. Verify that your HDFS installation has sufficient space available for redundant storage of both the primary and buddy projections of your K-safe data. You can adjust the number of duplicates stored by HDFS by setting the HadoopFSReplication configuration parameter. See Troubleshooting HDFS Storage Locations for details.
Additional Requirements for Backing Up Data Stored on HDFS In Premium Edition, to back up your data stored in HDFS storage locations, your Hadoop cluster must: l
l
Have HDFS 2.0 or later installed. The vbr backup utility uses the snapshot feature introduced in HDFS 2.0. Have snapshotting enabled for the directories to be used for backups. The easiest way to do this is to give the database administrator's account superuser privileges in Hadoop, so that snapshotting can be set automatically. Alternatively, use Hadoop to enable snapshotting for each directory before using it for backups.
In addition, your Vertica database must: l
l
Have enough Hadoop components and libraries installed in order to run the Hadoop distcp command as the Vertica database-administrator user (usually dbadmin). Have the JavaBinaryForUDx and HadoopHome configuration parameters set correctly. Caution: After you have created an HDFS storage location, full database backups will fail with the error message:
HPE Vertica Analytic Database (7.2.x)
Page 73 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter
This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure Vertica and Hadoop to enable the backup script to back these locations. After you configure Vertica and Hadoop, you can once again perform full database backups. See Backing Up HDFS Storage Locations for details on configuring your Vertica and Hadoop clusters to enable HDFS storage location backup.
How the HDFS Storage Location Stores Data The Vertica Storage Location for HDFS stores data on the Hadoop HDFS similarly to the way Vertica stores data in the Linux file system. See Managing Storage Locations in the Administrator's Guide for more information about storage locations. When you create a storage location on HDFS, Vertica stores the ROS containers holding its data on HDFS. You can choose which data uses the HDFS storage location: from the data for just a single table to all of the database's data. When Vertica reads data from or writes data to an HDFS storage location, the node storing or retrieving the data contacts the Hadoop cluster directly to transfer the data. If a single ROS container file is split among several Hadoop nodes, the Vertica node connects to each of them. The Vertica node retrieves the pieces and reassembles the file. By having each node fetch its own data directly from the source, data transfers are parallel, increasing their efficiency. Having the Vertica nodes directly retrieve the file splits also reduces the impact on the Hadoop cluster.
What You Can Store on HDFS Use HDFS storage locations to store only data. You cannot store catalog information in an HDFS storage location. Caution: While it is possible to use an HDFS storage location for temporary data storage, you must never do so. Using HDFS for temporary storage causes severe performance issues. The only time you change an HDFS storage location's usage to temporary is when you are in the process of removing it.
HPE Vertica Analytic Database (7.2.x)
Page 74 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
What HDFS Storage Locations Cannot Do Because Vertica uses the storage locations to store ROS containers in a proprietary format, MapReduce and other Hadoop components cannot access your Vertica data stored in HDFS. Never allow another program that has access to HDFS to write to the ROS files. Any outside modification of these files can lead to data corruption and loss. Use the Vertica Connector for Hadoop MapReduce if you need your MapReduce job to access Vertica data. Other applications must use the Vertica client libraries to access Vertica data. The storage location stores and reads only ROS containers. It cannot read data stored in native formats in HDFS. If you want Vertica to read data from HDFS, use the Vertica Connector for HDFS. If the data you want to access is available in a Hive database, you can use the Vertica Connector for HCatalog.
Creating an HDFS Storage Location Before creating an HDFS storage location, you must first create a Hadoop user who can access the data: l
l
If your HDFS cluster is unsecured, create a Hadoop user whose username matches the user name of the Vertica database administrator account. For example, suppose your database administrator account has the default username dbadmin. You must create a Hadoop user account named dbadmin and give it full read and write access to the directory on HDFS to store files. If your HDFS cluster uses Kerberos authentication, create a Kerberos principal for Vertica and give it read and write access to the HDFS directory that will be used for the storage location. See Configuring Kerberos.
Consult the documentation for your Hadoop distribution to learn how to create a user and grant the user read and write permissions for a directory in HDFS. Use the CREATE LOCATION statement to create an HDFS storage location. To do so, you must: l
Supply the WebHDFS URI for HDFS directory where you want Vertica to store the location's data as the path argument,. This URI is the same as a standard HDFS URL, except it uses the webhdfs:// protocol and its path does not start with /webhdfs/v1/.
HPE Vertica Analytic Database (7.2.x)
Page 75 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
l
Include the ALL NODES SHARED keywords, as all HDFS storage locations are shared storage. This is required even if you have only one HDFS node in your cluster.
The following example demonstrates creating an HDFS storage location that: l
Is located on the Hadoop cluster whose name node's host name is hadoop.
l
Stores its files in the /user/dbadmin directory.
l
Is labeled coldstorage.
The example also demonstrates querying the STORAGE_LOCATIONS system table to verify that the storage location was created. => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; CREATE LOCATION => SELECT node_name,location_path,location_label FROM STORAGE_LOCATIONS; node_name | location_path | location_label ------------------+------------------------------------------------------+---------------v_vmart_node0001 | /home/dbadmin/VMart/v_vmart_node0001_data | v_vmart_node0001 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 | coldstorage v_vmart_node0002 | /home/dbadmin/VMart/v_vmart_node0002_data | v_vmart_node0002 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 | coldstorage v_vmart_node0003 | /home/dbadmin/VMart/v_vmart_node0003_data | v_vmart_node0003 | webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 | coldstorage (6 rows)
Each node in the cluster has created its own directory under the dbadmin directory in HDFS. These individual directories prevent the nodes from interfering with each other's files in the shared location.
Creating a Storage Location Using Vertica for SQL on Apache Hadoop If you are using the Premium Edition product, then you typically use HDFS storage locations for lower-priority data as shown in the previous example. If you are using the Vertica for SQL on Apache Hadoop product, however, all of your data must be stored in HDFS. To create an HDFS storage location that complies with the Vertica for SQL on Apache Hadoop license, first create the location on all nodes and then set its storage policy to HDFS. To create the location in HDFS on all nodes: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'HDFS';
Next, set the storage policy for your database to use this location:
HPE Vertica Analytic Database (7.2.x)
Page 76 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
This causes all data to be written to the HDFS storage location instead of the local disk. For more information, see "Best Practices for SQL on Hadoop" in Managing Storage Locations.
Adding HDFS Storage Locations to New Nodes Any nodes you add to your cluster do not have access to existing HDFS storage locations. You must manually create the storage location for the new node using the CREATE LOCATION statement. Do not use the ALL NODES keyword in this statement. Instead, use the NODE keyword with the name of the new node to tell Vertica that just that node needs to add the shared location. Caution: You must manually create the storage location. Otherwise, the new node uses the default storage policy (usually, storage on the local Linux filesystem) to store data that the other the nodes store in HDFS. As a result, the node can run out of disk space. The following example shows how to add the storage location from the preceding example to a new node named v_vmart_node0004: => CREATE LOCATION 'webhdfs://hadoop:50070/user/dbadmin' NODE 'v_vmart_node0004' SHARED USAGE 'data' LABEL 'coldstorage';
Any active standby nodes in your cluster when you create an HDFS-based storage location automatically create their own instances of the location. When the standby node takes over for a down node, it uses its own instance of the location to store data for objects using the HDFS-based storage policy. Treat standby nodes added after you create the storage location as any other new node. You must manually define the HDFS storage location.
Creating a Storage Policy for HDFS Storage Locations After you create an HDFS storage location, you assign database objects to the location by setting storage policies. Based on these storage policies, database objects such as partition ranges, individual tables, whole schemas, or even the entire database store their data in the HDFS storage location. Use the SET_OBJECT_STORAGE_ POLICY function to assign objects to an HDFS storage location. In the function call,
HPE Vertica Analytic Database (7.2.x)
Page 77 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations supply the label you assigned to the HDFS storage location as the location label argument. You do so using the CREATE LOCATION statement's LABEL keyword. The following topics provide examples of storing data on HDFS.
Storing an Entire Table in an HDFS Storage Location The following example demonstrates using SET_OBJECT_STORAGE_POLICY to store a table in an HDFS storage location. The example statement sets the policy for an existing table, named messages, to store its data in an HDFS storage location, named coldstorage. => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage');
This table's data is moved to the HDFS storage location with the next merge-out. Alternatively, you can have Vertica move the data immediately by using the enforce_ storage_move parameter. You can query the STORAGE_CONTAINERS system table and examine the location_ label column to verify that Vertica has moved the data: => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_ CONTAINERS WHERE projection_name ILIKE 'messages%'; node_name | projection_name | location_label | total_row_count ------------------+-----------------+----------------+----------------v_vmart_node0001 | messages_b0 | coldstorage | 366057 v_vmart_node0001 | messages_b1 | coldstorage | 366511 v_vmart_node0002 | messages_b0 | coldstorage | 367432 v_vmart_node0002 | messages_b1 | coldstorage | 366057 v_vmart_node0003 | messages_b0 | coldstorage | 366511 v_vmart_node0003 | messages_b1 | coldstorage | 367432 (6 rows)
See Creating Storage Policies in the Administrator's Guide for more information about assigning storage policies to objects.
Storing Table Partitions in HDFS If the data you want to store in an HDFS-based storage location is in a partitioned table, you can choose to store some of the partitions in HDFS. This capability lets you to periodically move old data that is queried less frequently off of more costly higher-speed storage (such as on a solid- state drive). You can instead use slower and less expensive HDFS storage. The older data is still accessible in queries, just at a slower speed. In this scenario, the faster storage is often referred to as "hot storage," and the slower storage is referred to as "cold storage."
HPE Vertica Analytic Database (7.2.x)
Page 78 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations For example, suppose you have a table named messages containing social media messages that is partitioned by the year and month of the message's timestamp. You can list the partitions in the table by querying the PARTITIONS system table. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key | projection_name | node_name | location_label --------------+-----------------+------------------+---------------201309 | messages_b1 | v_vmart_node0001 | 201309 | messages_b0 | v_vmart_node0003 | 201309 | messages_b1 | v_vmart_node0002 | 201309 | messages_b1 | v_vmart_node0003 | 201309 | messages_b0 | v_vmart_node0001 | 201309 | messages_b0 | v_vmart_node0002 | 201310 | messages_b0 | v_vmart_node0002 | 201310 | messages_b1 | v_vmart_node0003 | 201310 | messages_b0 | v_vmart_node0001 | . . . 201405 | messages_b0 | v_vmart_node0002 | 201405 | messages_b1 | v_vmart_node0003 | 201405 | messages_b1 | v_vmart_node0001 | 201405 | messages_b0 | v_vmart_node0001 | (54 rows)
Next, suppose you find that most queries on this table access only the latest month or two of data. You may decide to move the older data to cold storage in an HDFS-based storage location. After you move the data, it is still available for queries, but with lower query performance. To move partitions to the HDFS storage location, supply the lowest and highest partition key values to be moved in the SET_OBJECT_STORAGE_POLICY function call. The following example shows how to move data between two dates to an HDFSbased storage location. In this example: l
Partition key value 201309 represents September 2013.
l
Partition key value 201403 represents March 2014.
l
The name, coldstorage, is the label of the HDFS-based storage location. => SELECT SET_OBJECT_STORAGE_POLICY('messages','coldstorage', '201309', '201403' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true');
After the statement finishes, the range of partitions now appear in the HDFS storage location labeled coldstorage. This location name now displays in the PARTITIONS system table's location_label column. => SELECT partition_key, projection_name, node_name, location_label FROM partitions ORDER BY partition_key; partition_key | projection_name | node_name | location_label --------------+-----------------+------------------+---------------201309 | messages_b0 | v_vmart_node0003 | coldstorage
HPE Vertica Analytic Database (7.2.x)
Page 79 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
After your initial data move, you can move additional data to the HDFS storage location periodically. You move individual partitions or a range of partitions from the "hot" storage to the "cold" storage location using the same method: => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'coldstorage', '201404', '201404' USING PARAMETERS ENFORCE_STORAGE_MOVE = 'true'); SET_OBJECT_STORAGE_POLICY ---------------------------Object storage policy set. (1 row) => SELECT projection_name, node_name, location_label FROM PARTITIONS WHERE PARTITION_KEY = '201404'; projection_name | node_name | location_label -----------------+------------------+---------------messages_b0 | v_vmart_node0002 | coldstorage messages_b0 | v_vmart_node0003 | coldstorage messages_b1 | v_vmart_node0003 | coldstorage messages_b0 | v_vmart_node0001 | coldstorage messages_b1 | v_vmart_node0002 | coldstorage messages_b1 | v_vmart_node0001 | coldstorage (6 rows)
Moving Partitions to a Table Stored on HDFS Another method of moving partitions from hot storage to cold storage is to move the partition's data to a separate table that is stored on HDFS. This method breaks the data into two tables, one containing hot data and the other containing cold data. Use this method if you want to prevent queries from inadvertently accessing data stored in the slower HDFS storage location. To query the older data, you must explicitly query the cold table. To move partitions:
HPE Vertica Analytic Database (7.2.x)
Page 80 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
1. Create a new table whose schema matches that of the existing partitioned table. 2. Set the storage policy of the new table to use the HDFS-based storage location. 3. Use the MOVE_PARTITIONS_TO_TABLE function to move a range of partitions from the hot table to the cold table. The following example demonstrates these steps. You first create a table named cold_ messages. You then assign it the HDFS-based storage location named coldstorage, and, finally, move a range of partitions. => CREATE TABLE cold_messages LIKE messages INCLUDING PROJECTIONS; => SELECT SET_OBJECT_STORAGE_POLICY('cold_messages', 'coldstorage'); => SELECT MOVE_PARTITIONS_TO_TABLE('messages','201309','201403','cold_messages');
Note: The partitions moved using this method do not immediately migrate to the storage location on HDFS. Instead, the Tuple Mover eventually moves them to the storage location.
Backing Up Vertica Storage Locations for HDFS Note: The backup and restore features are available only in the Premium Edition product, not in Vertica for SQL on Apache Hadoop. HP recommends that you regularly back up the data in your Vertica database. This recommendation includes data stored in your HDFS storage locations. The Vertica backup script (vbr) can back up HDFS storage locations. However, you must perform several configuration steps before it can back up these locations. Caution: After you have created an HDFS storage location, full database backups will fail with the error message: ERROR 5127: Unable to create snapshot No such file /usr/bin/hadoop: check the HadoopHome configuration parameter
This error is caused by the backup script not being able to back up the HDFS storage locations. You must configure Vertica and Hadoop to enable the backup script to back these locations. After you configure Vertica and Hadoop, you can once again perform full database backups. There are several considerations for backing up HDFS storage locations in your database:
HPE Vertica Analytic Database (7.2.x)
Page 81 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
l
l
l
The HDFS storage location backup feature relies on the snapshotting feature introduced in HDFS 2.0. You cannot back up an HDFS storage location stored on an earlier version of HDFS. HDFS storage locations do not support object-level backups. You must perform a full database backup in order to back up the data in your HDFS storage locations. Data in an HDFS storage location is backed up to HDFS. This backup guards against accidental deletion or corruption of data. It does not prevent data loss in the case of a catastrophic failure of the entire Hadoop cluster. To prevent data loss, you must have a backup and disaster recovery plan for your Hadoop cluster. Data stored on the Linux native filesystem is still backed up to the location you specify in the backup configuration file. It and the data in HDFS storage locations are handled separately by the vbr backup script.
l
l
You must configure your Vertica cluster in order to restore database backups containing an HDFS storage location. See Configuring Vertica to Back Up HDFS Storage Locations for the configuration steps you must take. The HDFS directory for the storage location must have snapshotting enabled.You can either directly configure this yourself or enable the database administrator’s Hadoop account to do it for you automatically. See Configuring Hadoop to Enable Backup of HDFS Storage for more information.
The topics in this section explain the configuration steps you must take to enable the backup of HDFS storage locations.
Configuring Vertica to Restore HDFS Storage Locations Your Vertica cluster must be able to run the Hadoop distcp command to restore a backup of an HDFS storage location. The easiest way to enable your cluster to run this command is to install several Hadoop packages on each node. These packages must be from the same distribution and version of Hadoop that is running on your Hadoop cluster. The steps you need to take depend on: l
l
The distribution and version of Hadoop running on the Hadoop cluster containing your HDFS storage location. The distribution of Linux running on your Vertica cluster.
HPE Vertica Analytic Database (7.2.x)
Page 82 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
Note: Installing the Hadoop packages necessary to run distcp does not turn your Vertica database into a Hadoop cluster. This process installs just enough of the Hadoop support files on your cluster to run the distcp command. There is no additional overhead placed on the Vertica cluster, aside from a small amount of additional disk space consumed by the Hadoop support files.
Configuration Overview The steps for configuring your Vertica cluster to restore backups for HDFS storage location are: 1. If necessary, install and configure a Java runtime on the hosts in the Vertica cluster. 2. Find the location of your Hadoop distribution's package repository. 3. Add the Hadoop distribution's package repository to the Linux package manager on all hosts in your cluster. 4. Install the necessary Hadoop packages on your Vertica hosts. 5. Set two configuration parameters in your Vertica database related to Java and Hadoop. 6. If your HDFS storage location uses Kerberos, set additional configuration parameters to allow Vertica user credentials to be proxied. 7. Confirm that the Hadoop distcp command runs on your Vertica hosts. The following sections describe these steps in greater detail.
Installing a Java Runtime You Vertica cluster must have a Java Virtual Machine (JVM) installed to run the Hadoop distcp command. It already has a JVM installed if you have configured it to: l
l
Execute User-Defined Extensions developed in Java. See Developing User Defined Extensions for more information. Access Hadoop data using the HCatalog Connector. See Using the HCatalog Connector for more information.
If your Vertica database does have a JVM installed, you must verify that your Hadoop distribution supports it. See your Hadoop distribution's documentation to determine which JVMs it supports. If the JVM installed on your Vertica cluster is not supported by your Hadoop distribution you must uninstall it. Then you must install a JVM that is supported by both Vertica and
HPE Vertica Analytic Database (7.2.x)
Page 83 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations your Hadoop distribution. See Vertica SDKs in Supported Platforms for a list of the JVMs compatible with Vertica. If your Vertica cluster does not have a JVM (or its existing JVM is incompatible with your Hadoop distribution), follow the instruction in Installing the Java Runtime on Your Vertica Cluster.
Finding Your Hadoop Distribution's Package Repository Many Hadoop distributions have their own installation system, such as Cloudera's Manager or Hortonwork's Ambari. However, they also support manual installation using native Linux packages such as RPM and .deb files. These package files are maintained in a repository. You can configure your Vertica hosts to access this repository to download and install Hadoop packages. Consult your Hadoop distribution's documentation to find the location of its Linux package repository. This information is often located in the portion of the documentation covering manual installation techniques. For example: l
l
The Hortonworks Version 2.1 topic on Configuring the Remote Repositories. The "Steps to Install CDH 5 Manually" section of the Cloudera Version 5.1.0 topic Installing CDH 5.
Each Hadoop distribution maintains separate repositories for each of the major Linux package management systems. Find the specific repository for the Linux distribution running on your Vertica cluster. Be sure that the package repository that you select matches version of Hadoop distribution installed on your Hadoop cluster.
Configuring Vertica Nodes to Access the Hadoop Distribution’s Package Repository Configure the nodes in your Vertica cluster so they can access your Hadoop distribution's package repository. Your Hadoop distribution's documentation should explain how to add the repositories to your Linux platform. If the documentation does not explain how to add the repository to your packaging system, refer to your Linux distribution's documentation. The steps you need to take depend on the package management system your Linux platform uses. Usually, the process involves:
HPE Vertica Analytic Database (7.2.x)
Page 84 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
l
l
l
l
Downloading a configuration file. Adding the configuration file to the package management system's configuration directory. For Debian-based Linux distributions, adding the Hadoop repository encryption key to the root account keyring. Updating the package management system's index to have it discover new packages.
The following example demonstrates adding the Hortonworks 2.1 package repository to an Ubuntu 12.04 host. These steps in this example are explained in the Hortonworks documentation. $ wget http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list \ -O /etc/apt/sources.list.d/hdp.list --2014-08-20 11:06:00-- http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/hdp.list Connecting to 16.113.84.10:8080... connected. Proxy request sent, awaiting response... 200 OK Length: 161 [binary/octet-stream] Saving to: `/etc/apt/sources.list.d/hdp.list' 100%[======================================>] 161
--.-K/s
in 0s
2014-08-20 11:06:00 (8.00 MB/s) - `/etc/apt/sources.list.d/hdp.list' saved [161/161] $ gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD gpg: requesting key 07513CAD from hkp server pgp.mit.edu gpg: /root/.gnupg/trustdb.gpg: trustdb created gpg: key 07513CAD: public key "Jenkins (HDP Builds) " imported gpg: Total number processed: 1 gpg: imported: 1 (RSA: 1) $ gpg -a --export 07513CAD | apt-key add OK $ apt-get update Hit http://us.archive.ubuntu.com precise Release.gpg Hit http://extras.ubuntu.com precise Release.gpg Get:1 http://security.ubuntu.com precise-security Release.gpg [198 B] Hit http://us.archive.ubuntu.com precise-updates Release.gpg Get:2 http://public-repo-1.hortonworks.com HDP-UTILS Release.gpg [836 B] Get:3 http://public-repo-1.hortonworks.com HDP Release.gpg [836 B] Hit http://us.archive.ubuntu.com precise-backports Release.gpg Hit http://extras.ubuntu.com precise Release Get:4 http://security.ubuntu.com precise-security Release [50.7 kB] Get:5 http://public-repo-1.hortonworks.com HDP-UTILS Release [6,550 B] Hit http://us.archive.ubuntu.com precise Release Hit http://extras.ubuntu.com precise/main Sources Get:6 http://public-repo-1.hortonworks.com HDP Release [6,502 B] Hit http://us.archive.ubuntu.com precise-updates Release Get:7 http://public-repo-1.hortonworks.com HDP-UTILS/main amd64 Packages [1,955 B] Get:8 http://security.ubuntu.com precise-security/main Sources [108 kB] Get:9 http://public-repo-1.hortonworks.com HDP-UTILS/main i386 Packages [762 B]
HPE Vertica Analytic Database (7.2.x)
Page 85 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
. . . Reading package lists... Done
You must add the Hadoop repository to all hosts in your Vertica cluster.
Installing the Required Hadoop Packages After configuring the repository, you are ready to install the Hadoop packages. The packages you need to install are: l
hadoop
l
hadoop-hdfs
l
hadoop-client
The names of the packages are usually the same across all Hadoop and Linux distributions.These packages often have additional dependencies. Always accept any additional packages that the Linux package manager asks to install. To install these packages, use the package manager command for your Linux distribution. The package manager command you need to use depends on your Linux distribution: l
On Red Hat and CentOS, the package manager command is yum.
l
On Debian and Ubuntu, the package manager command is apt-get.
l
On SUSE the package manager command is zypper.
Consult your Linux distribution's documentation for instructions on installing packages. The following example demonstrates installing the required Hadoop packages from the Hortonworks 2.1 distribution on an Ubuntu 12.04 system. # apt-get install hadoop hadoop-hdfs hadoop-client Reading package lists... Done Building dependency tree Reading state information... Done The following extra packages will be installed: bigtop-jsvc hadoop-mapreduce hadoop-yarn zookeeper The following NEW packages will be installed: bigtop-jsvc hadoop hadoop-client hadoop-hdfs hadoop-mapreduce hadoop-yarn zookeeper 0 upgraded, 7 newly installed, 0 to remove and 90 not upgraded. Need to get 86.6 MB of archives. After this operation, 99.8 MB of additional disk space will be used. Do you want to continue [Y/n]? Y Get:1 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main bigtop-jsvc amd64 1.0.10-1 [28.5 kB] Get:2 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main
HPE Vertica Analytic Database (7.2.x)
Page 86 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
zookeeper all 3.4.5.2.1.3.0-563 [6,820 kB] Get:3 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main hadoop all 2.4.0.2.1.3.0-563 [21.5 MB] Get:4 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main hadoop-hdfs all 2.4.0.2.1.3.0-563 [16.0 MB] Get:5 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main hadoop-yarn all 2.4.0.2.1.3.0-563 [15.1 MB] Get:6 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main hadoop-mapreduce all 2.4.0.2.1.3.0-563 [27.2 MB] Get:7 http://public-repo-1.hortonworks.com/HDP/ubuntu12/2.1.3.0/ HDP/main hadoop-client all 2.4.0.2.1.3.0-563 [3,650 B] Fetched 86.6 MB in 1min 2s (1,396 kB/s) Selecting previously unselected package bigtop-jsvc. (Reading database ... 197894 files and directories currently installed.) Unpacking bigtop-jsvc (from .../bigtop-jsvc_1.0.10-1_amd64.deb) ... Selecting previously unselected package zookeeper. Unpacking zookeeper (from .../zookeeper_3.4.5.2.1.3.0-563_all.deb) ... Selecting previously unselected package hadoop. Unpacking hadoop (from .../hadoop_2.4.0.2.1.3.0-563_all.deb) ... Selecting previously unselected package hadoop-hdfs. Unpacking hadoop-hdfs (from .../hadoop-hdfs_2.4.0.2.1.3.0-563_all.deb) ... Selecting previously unselected package hadoop-yarn. Unpacking hadoop-yarn (from .../hadoop-yarn_2.4.0.2.1.3.0-563_all.deb) ... Selecting previously unselected package hadoop-mapreduce. Unpacking hadoop-mapreduce (from .../hadoop-mapreduce_2.4.0.2.1.3.0-563_all.deb) ... Selecting previously unselected package hadoop-client. Unpacking hadoop-client (from .../hadoop-client_2.4.0.2.1.3.0-563_all.deb) ... Processing triggers for man-db ... Setting up bigtop-jsvc (1.0.10-1) ... Setting up zookeeper (3.4.5.2.1.3.0-563) ... update-alternatives: using /etc/zookeeper/conf.dist to provide /etc/zookeeper/conf (zookeeper-conf) in auto mode. Setting up hadoop (2.4.0.2.1.3.0-563) ... update-alternatives: using /etc/hadoop/conf.empty to provide /etc/hadoop/conf (hadoop-conf) in auto mode. Setting up hadoop-hdfs (2.4.0.2.1.3.0-563) ... Setting up hadoop-yarn (2.4.0.2.1.3.0-563) ... Setting up hadoop-mapreduce (2.4.0.2.1.3.0-563) ... Setting up hadoop-client (2.4.0.2.1.3.0-563) ... Processing triggers for libc-bin ... ldconfig deferred processing now taking place
Setting Configuration Parameters You must set two configuration parameters to enable Vertica to restore HDFS data: l
JavaBinaryForUDx is the path to the Java executable. You may have already set this value to use Java UDxs or the HCatalog Connector. You can find the path for the default Java executable from the Bash command shell using the command: which java
l
HadoopHome is the path where Hadoop is installed on the Vertica hosts. This is the directory that contains bin/hadoop (the bin directory containing the Hadoop
HPE Vertica Analytic Database (7.2.x)
Page 87 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
executable file). The default value for this parameter is /usr. The default value is correct if your Hadoop executable is located at /usr/bin/hadoop. The following example demonstrates setting and then reviewing the values of these parameters. => ALTER DATABASE mydb SET JavaBinaryForUDx = '/usr/bin/java'; => SELECT get_config_parameter('JavaBinaryForUDx'); get_config_parameter ---------------------/usr/bin/java (1 row) => ALTER DATABASE mydb SET HadoopHome = '/usr'; => SELECT get_config_parameter('HadoopHome'); get_config_parameter ---------------------/usr (1 row)
There are additional parameters you may, optionally, set: l
l
l
HadoopFSReadRetryTimeout and HadoopFSWriteRetryTimeout specify how long to wait before failing. The default value for each is 180 seconds, the Hadoop default. If you are confident that your file system will fail more quickly, you can potentially improve performance by lowering these values. HadoopFSReplication is the number of replicas HDFS makes. By default the Hadoop client chooses this; Vertica uses the same value for all nodes. We recommend against changing this unless directed to. HadoopFSBlockSizeBytes is the block size to write to HDFS; larger files are divided into blocks of this size. The default is 64MB.
Setting Kerberos Parameters If your Vertica nodes are co-located on HDFS nodes and you are using Kerberos, you must change some Hadoop configuration parameters. These changes are needed in order for restoring from backups to work. In yarn-site.xml on every Vertica node, set the following parameters: Parameter
No changes are needed on HDFS nodes that are not also Vertica nodes.
Confirming that distcp Runs Once the packages are installed on all hosts in your cluster, your database should be able to run the Hadoop distcp command. To test it: 1. Log into any host in your cluster as the database administrator. 2. At the Bash shell, enter the command: $ hadoop distcp
3. The command should print a message similar to the following: usage: distcp OPTIONS [source_path...] OPTIONS -async Should distcp execution be blocking -atomic Commit all changes or none -bandwidth Specify bandwidth per map in MB -delete Delete from target, files missing in source -f List of files that need to be copied -filelimit (Deprecated!) Limit number of files copied to <= n -i Ignore failures during copy -log Folder on DFS where distcp execution logs are saved -m Max number of concurrent maps to use for copy -mapredSslConf Configuration for ssl config file, to use with hftps:// -overwrite Choose to overwrite target files unconditionally, even if they exist. -p preserve status (rbugpc)(replication, block-size, user, group, permission, checksum-type) -sizelimit (Deprecated!) Limit number of files copied to <= n bytes -skipcrccheck Whether to skip CRC checks between source and target paths. -strategy Copy strategy to use. Default is dividing work based on file sizes -tmp Intermediate work path to be used for atomic commit
HPE Vertica Analytic Database (7.2.x)
Page 89 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
-update
Update target, copying only missingfiles or directories
4. Repeat these steps on the other hosts in your database to ensure all of the hosts can run distcp.
Troubleshooting If you cannot run the distcp command, try the following steps: l
l
l
l
If Bash cannot find the hadoop command, you may need to manually add Hadoop's bin directory to the system search path. An alternative is to create a symbolic link in an existing directory in the search path (such as /usr/bin) to the hadoop binary. Ensure the version of Java installed on your Vertica cluster is compatible with your Hadoop distribution. Review the Linux package installation tool's logs for errors. In some cases, packages may not be fully installed, or may not have been downloaded due to network issues. Ensure that the database administrator account has permission to execute the hadoop command. You may need to add the account to a specific group in order to allow it to run the necessary commands.
Configuring Hadoop and Vertica to Enable Backup of HDFS Storage The Vertica backup script uses HDFS's snapshotting feature to create a backup of HDFS storage locations. A directory must allow snapshotting before HDFS can take a snapshot. Only a Hadoop superuser can enable snapshotting on a directory. Vertica can enable snapshotting automatically if the database administrator is also a Hadoop superuser. If HDFS is unsecured, the following instructions apply to the database administrator account, usually dbadmin. If HDFS uses Kerberos security, the following instructions apply to the principal stored in the Vertica keytab file, usually vertica. The instructions below use the term "database account" to refer to this user. We recommend that you make the database administrator or principal a Hadoop superuser. If you are not able to do so, you must enable snapshotting on the directory before configuring it for use by Vertica. The steps you need to take to make the Vertica database administrator account a superuser depend on the distribution of Hadoop you are using. Consult your Hadoop
HPE Vertica Analytic Database (7.2.x)
Page 90 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations distribution's documentation for details. Instructions for two distributions are provided here.
Granting Superuser Status on Hortonworks 2.1 To make the database account a Hadoop superuser: 1. Log into the your Hadoop cluster's Hortonworks Hue web user interface. If your Hortonworks cluster uses Ambari or you do not have a web-based user interface, see the Hortonworks documentation for information on granting privileges to users. 2. Click the User Admin icon. 3. In the Hue Users page, click the database account''s username. 4. Click the Step 3: Advanced tab. 5. Select Superuser status.
Granting Superuser Status on Cloudera 5.1 Cloudera Hadoop treats Linux users that are members of the group named supergroup as superusers. Cloudera Manager does not automatically create this group. Cloudera also does not create a Linux user for each Hadoop user. To create a Linux account for the database account and assign the supergroup to it: 1. Log into your Hadoop cluster's NameNode as root. 2. Use the groupadd command to add a group named supergroup. 3. Cloudera does not automatically create a Linux user that corresponds to the database administrator's Hadoop account. If the Linux system does not have a user for your database account you must create it. Use the adduser command to create this user. 4. Use the usermod command to add the database account to supergroup. 5. Verify that the database account is now a member of supergroup using the groups command. 6. Repeat steps 1 through 5 for any other NameNodes in your Hadoop cluster. The following example demonstrates following these steps to grant the database administrator superuser status. # adduser dbadmin # groupadd supergroup
HPE Vertica Analytic Database (7.2.x)
Page 91 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
# usermod -a -G supergroup dbadmin # groups dbadmin dbadmin : dbadmin supergroup
Consult the Linux distribution installed on your Hadoop cluster for more information on managing users and groups.
Manually Enabling Snapshotting for a Directory If you cannot grant superuser status to the database account, you can instead enable snapshotting of each directory manually. Use the following command: hdfs dfsadmin -allowSnapshot path
Issue this command for each directory on each node. Remember to do this each time you add a new node to your HDFS cluster. Nested snapshottable directories are not allowed, so you cannot enable snapshotting for a parent directory to automatically enable it for child directories. You must enable it for each individual directory.
Additional Requirements for Kerberos If HDFS uses Kerberos, then in addition to granting the keytab principal access, you must set a Vertica configuration parameter. In Vertica, set the HadoopConfDir parameter to the location of the directory containing the core-site.xml, hdfs-site.xml, and yarnsite.xml configuration files: => ALTER DATABASE exampledb SET HadoopConfDir = '/hadoop';
All three configuration files must be present in this directory. If your Vertica nodes are not co-located on HDFS nodes, then you must copy these files from an HDFS node to each Vertica node. Use the same path on every database node, because HadoopConfDir is a global value.
Testing the Database Account's Ability to Make HDFS Directories Snapshottable After making the database account a Hadoop superuser, you should verify that the account can set directories snapshottable: 1. Log into the Hadoop cluster as the database account (dbadmin by default). 2. Determine a location in HDFS where the database administrator can create a
HPE Vertica Analytic Database (7.2.x)
Page 92 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
directory. The /tmp directory is usually available. Create a test HDFS directory using the command: hdfs dfs -mkdir /path/testdir
3. Make the test directory snapshottable using the command: hdfs dfsadmin -allowSnapshot /path/testdir
The following example demonstrates creating an HDFS directory and making it snapshottable: $ hdfs dfs -mkdir /tmp/snaptest $ hdfs dfsadmin -allowSnapshot /tmp/snaptest Allowing snaphot on /tmp/snaptest succeeded
Performing Backups Containing HDFS Storage Locations After you configure Hadoop and Vertica, HDFS storage locations are automatically backed up when you perform a full database backup. If you already have a backup configuration file for a full database backup, you do not need to make any changes to it. You just run the vbr backup script as usual to perform the full database backup. See Creating Full and Incremental Backups in the Administrator's Guide for instructions on running the vbr backup script. If you do not have a backup configuration file for a full database backup, you must create one to back up the data in your HDFS storage locations. See Creating vbr Configuration Files in the Administrator's Guide for more information.
Removing HDFS Storage Locations The steps to remove an HDFS storage location are similar to standard storage locations: 1. Remove any existing data from the HDFS storage location. 2. Change the location's usage to TEMP. 3. Retire the location on each host that has the storage location defined by using RETIRE_LOCATION. You can use the enforce_storage_move parameter to make the change immediately, or wait for the Tuple Mover to perform its next movout.
HPE Vertica Analytic Database (7.2.x)
Page 93 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
4. Drop the location on each host that has the storage location defined by using DROP_LOCATION. 5. Optionally remove the snapshots and files from the HDFS directory for the storage location. The following sections explain each of these steps in detail. Important: If you have backed up the data in the HDFS storage location you are removing, you must perform a full database backup after you remove the location. If you do not and restore the database to a backup made before you removed the location, the location's data is restored.
Removing Existing Data from an HDFS Storage Location You cannot drop a storage location that contains data or is used by any storage policy. You have several options to remove data and storage policies: l
l
l
Drop all of the objects (tables or schemas) that store data in the location. This is the simplest option. However, you can only use this method if you no longer need the data stored in the HDFS storage location. Change the storage policies of objects stored on HDFS to another storage location. When you alter the storage policy, you force all of the data in HDFS location to move to the new location. This option requires that you have an alternate storage location available. Clear the storage policies of all objects that store data on the storage location. You then move the location's data through a process of retiring it.
The following sections explain the last two options in greater detail.
Moving Data to Another Storage Location You can move data off of an HDFS storage location by altering the storage policies of the objects that use the location. Use the SET_OBJECT_STORAGE_POLICY function to change each object's storage location. If you set this function's third argument to true, it moves the data off of the storage location before returning. The following example demonstrates moving the table named test from the hdfs2 storage location to another location named ssd. => SELECT node_name, projection_name, location_label, total_row_count
HPE Vertica Analytic Database (7.2.x)
Page 94 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
Once you have moved all of the data in the storage location, you are ready to proceed to the next step of removing the storage location.
Clearing Storage Policies Another option to move data off of a storage location is to clear the storage policy of each object storing data in the location. You clear an object's storage policy using the CLEAR_OBJECT_STORAGE_POLICY function. Once you clear the storage policy, the Tuple Mover eventually migrates the object's data from the storage location to the database's default storage location. The TM moves the data when it performs a move
HPE Vertica Analytic Database (7.2.x)
Page 95 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations storage operation. This operation runs infrequently at low priority. Therefore, it may be some time before the data migrates out of the storage location. You can speed up the data migration process by: 1. Calling the RETIRE_LOCATION function to retire the storage location on each host that defines it. 2. Calling the MOVE_RETIRED_LOCATION_DATA function to move the location's data to the database's default storage location. 3. Calling the RESTORE_LOCATION function to restore the location on each host that defines it. You must perform this step because you cannot drop retired storage locations. The following example demonstrates clearing the object storage policy of a table stored on HDFS, then performing the steps to move the data off of the location. => SELECT * FROM storage_policies; schema_name | object_name | policy_details | location_label -------------+-------------+----------------+---------------public | test | Table | hdfs2 (1 row) => SELECT clear_object_storage_policy('test'); clear_object_storage_policy -------------------------------Object storage policy cleared. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); retire_location --------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); retire_location --------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 retired. (1 row) => SELECT retire_location('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003'); retire_location --------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 retired. (1 row) => SELECT node_name, projection_name, location_label, total_row_count FROM V_MONITOR.STORAGE_CONTAINERS WHERE projection_name ILIKE 'test%'; node_name | projection_name | location_label | total_row_count ------------------+-----------------+----------------+-----------------
HPE Vertica Analytic Database (7.2.x)
Page 96 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
Changing the Usage of HDFS Storage Locations You cannot drop a storage location that allows the storage of data files (ROS containers). Before you can drop an HDFS storage location, you must change its usage
HPE Vertica Analytic Database (7.2.x)
Page 97 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations from DATA to TEMP using the ALTER_LOCATION_USE function. Make this change on every host in the cluster that defines the storage location. Important: HPE recommends that you do not use HDFS storage locations for temporary file storage. Only set HDFS storage locations to allow temporary file storage as part of the removal process. The following example demonstrates using the ALTER_LOCATION_USE function to change the HDFS storage location to temporary file storage. The example calls the function three times: once for each node in the cluster that defines the location. => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001','TEMP'); ALTER_LOCATION_USE --------------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002','TEMP'); ALTER_LOCATION_USE --------------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 usage changed. (1 row) => SELECT ALTER_LOCATION_USE('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003', 'v_vmart_node0003','TEMP'); ALTER_LOCATION_USE --------------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0003 usage changed. (1 row)
Dropping an HDFS Storage Location After removing all data and changing the data usage of an HDFS storage location, you can drop it. Use the DROP_LOCATION function to drop the storage location from each host that defines it. The following example demonstrates dropping an HDFS storage location from a threenode Vertica database. => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001', 'v_vmart_node0001'); DROP_LOCATION --------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0001 dropped. (1 row) => SELECT DROP_LOCATION('webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002', 'v_vmart_node0002'); DROP_LOCATION --------------------------------------------------------------webhdfs://hadoop:50070/user/dbadmin/v_vmart_node0002 dropped.
HPE Vertica Analytic Database (7.2.x)
Page 98 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
Removing Storage Location Files from HDFS Dropping an HDFS storage location does not automatically clean the HDFS directory that stored the location's files. Any snapshots of the data files created when backing up the location are also not deleted. These files consume disk space on HDFS and also prevent the directory from being reused as an HDFS storage location. Vertica refuses to create a storage location in a directory that contains existing files or subdirectories. You must log into the Hadoop cluster to delete the files from HDFS. An alternative is to use some other HDFS file management tool.
Removing Backup Snapshots HDFS returns an error if you attempt to remove a directory that has snapshots: $ hdfs dfs -rm -r -f -skipTrash /user/dbadmin/v_vmart_node0001 rm: The directory /user/dbadmin/v_vmart_node0001 cannot be deleted since /user/dbadmin/v_vmart_node0001 is snapshottable and already has snapshots
The Vertica backup script creates snapshots of HDFS storage locations as part of the backup process. See Backing Up HDFS Storage Locations for more information. If you made backups of your HDFS storage location, you must delete the snapshots before removing the directories. HDFS stores snapshots in a subdirectory named .snapshot. You list the snapshots in the directory using the standard HDFS ls command. The following example demonstrates listing the snapshots defined for node0001. $ hdfs dfs -ls /user/dbadmin/v_vmart_node0001/.snapshot Found 1 items drwxrwx--- dbadmin supergroup 0 2014-09-02 10:13 /user/dbadmin/v_vmart_ node0001/.snapshot/s20140902-101358.629
To remove snapshots, use the command: hdfs dfs -removeSnapshot directory snapshotname
The following example demonstrates the command to delete the snapshot shown in the previous example:
HPE Vertica Analytic Database (7.2.x)
Page 99 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
You must delete each snapshot from the directory for each host in the cluster. Once you have deleted the snapshots, you can delete the directories in the storage location. Important: Each snapshot's name is based on a timestamp down to the millisecond. Nodes independently create their own snapshot. They do not synchronize snapshot creation, so their snapshot names differ. You must list each node's snapshot directory to learn the names of the snapshots it contains. See Apache's HDFS Snapshot documentation for more information about managing and removing snapshots.
Removing the Storage Location Directories You can remove the directories that held the storage location's data by either of the following methods: l
l
Use an HDFS file manager to delete directories. See your Hadoop distribution's documentation to determine if it provides a file manager. Log into the Hadoop NameNode using the database administrator’s account and use HDFS's rmr command to delete the directories. See Apache's File System Shell Guide for more information.
The following example uses the HDFS rmr command from the Linux command line to delete the directories left behind in the HDFS storage location directory /user/dbamin. It uses the -skipTrash flag to force the immediate deletion of the files. $ hdfsp dfs -ls /user/dbadmin Found 3 items drwxrwx--- dbadmin supergroup drwxrwx--- dbadmin supergroup drwxrwx--- dbadmin supergroup
Troubleshooting HDFS Storage Locations This topic explains some common issues with HDFS storage locations.
HPE Vertica Analytic Database (7.2.x)
Page 100 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
HDFS Storage Disk Consumption By default, HDFS makes three copies of each file it stores. This replication help prevent data loss due to disk or system failure. It also helps increase performance by allowing several nodes to handle a request for a file. An Vertica database with a K-Safety value of 1 or greater also stores its data redundantly using buddy projections. When a K-Safe Vertica database stores data in an HDFS storage location, its data redundancy is compounded by HDFS's redundancy. HDFS stores three copies of the primary projection's data, plus three copies of the buddy projection for a total of six copies of the data. If you want to reduce the amount of disk storage used by HDFS locations, you can alter the number of copies of data that HDFS stores. The Vertica configuration parameter named HadoopFSReplication controls the number of copies of data HDFS stores. You can determine the current HDFS disk usage by logging into the Hadoop NameNode and issuing the command: hdfs dfsadmin -report
This command prints the usage for the entire HDFS storage, followed by details for each node in the Hadoop cluster. The following example shows the beginning of the output from this command, with the total disk space highlighted: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32087212032 (29.88 GB) DFS Remaining: 31565144064 (29.40 GB) DFS Used: 522067968 (497.88 MB) DFS Used%: 1.63% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 . . .
After loading a simple million-row table into a table stored in an HDFS storage location, the report shows greater disk usage: Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32085299338 (29.88 GB) DFS Remaining: 31373565952 (29.22 GB) DFS Used: 711733386 (678.76 MB) DFS Used%: 2.22% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 . . .
The following Vertica example demonstrates:
HPE Vertica Analytic Database (7.2.x)
Page 101 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
1. Dropping the table in Vertica. 2. Setting the HadoopFSReplication configuration option to 1. This tells HDFS to store a single copy of an HDFS storage location's data. 3. Recreating the table and reloading its data. => DROP TABLE messages; DROP TABLE => ALTER DATABASE mydb SET HadoopFSReplication = 1; => CREATE TABLE messages (id INTEGER, text VARCHAR); CREATE TABLE => SELECT SET_OBJECT_STORAGE_POLICY('messages', 'hdfs'); SET_OBJECT_STORAGE_POLICY ---------------------------Object storage policy set. (1 row) => COPY messages FROM '/home/dbadmin/messages.txt' DIRECT; Rows Loaded ------------1000000
Running the HDFS report on Hadoop now shows less disk space use: $ hdfs dfsadmin -report Configured Capacity: 51495516981 (47.96 GB) Present Capacity: 32086278190 (29.88 GB) DFS Remaining: 31500988416 (29.34 GB) DFS Used: 585289774 (558.18 MB) DFS Used%: 1.82% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 . . .
Caution: Reducing the number of copies of data stored by HDFS increases the risk of data loss. It can also negatively impact the performance of HDFS by reducing the number of nodes that can provide access to a file. This slower performance can impact the performance of Vertica queries that involve data stored in an HDFS storage location.
Kerberos Authentication When Creating a Storage Location If HDFS uses Kerberos authentication, then the CREATE LOCATION statement authenticates using the Vertica keytab principal, not the principal of the user performing
HPE Vertica Analytic Database (7.2.x)
Page 102 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations the action. If the creation fails with an authentication error, verify that you have followed the steps described in Configuring Kerberos to configure this principal. When creating an HDFS storage location on a Hadoop cluster using Kerberos, CREATE LOCATION reports the principal being used as in the following example: => CREATE LOCATION 'webhdfs://hadoop.example.com:50070/user/dbadmin' ALL NODES SHARED USAGE 'data' LABEL 'coldstorage'; NOTICE 0: Performing HDFS operations using kerberos principal [vertica/hadoop.example.com] CREATE LOCATION
Backup or Restore Fails When Using Kerberos When backing up an HDFS storage location that uses Kerberos, you might see an error such as: createSnapshot: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/[email protected];
When restoring an HDFS storage location that uses Kerberos, you might see an error such as: Error msg: Initialization thread logged exception: Distcp failure!
Either of these failures means that Vertica could not find the required configuration files in the HadoopConfDir directory. Usually this is because you have set the parameter but not copied the files from an HDFS node to your Vertica node. See "Additional Requirements for Kerberos" in Configuring Hadoop and Vertica to Enable Backup of HDFS Storage .
HPE Vertica Analytic Database (7.2.x)
Page 103 of 139
Integrating with Apache Hadoop Using HDFS Storage Locations
HPE Vertica Analytic Database (7.2.x)
Page 104 of 139
Integrating with Apache Hadoop Using the MapReduce Connector
Using the MapReduce Connector The Vertica Connector for Hadoop MapReduce lets you create Hadoop MapReduce jobs that can read data from and write data to Vertica. You commonly use it when: l
l
You need to incorporate data from Vertica into your MapReduce job. For example, suppose you are using Hadoop's MapReduce to process web server logs. You may want to access sentiment analysis data stored in Vertica using Pulse to try to correlate a website visitor with social media activity. You are using Hadoop MapReduce to refine data on which you want to perform analytics. You can have your MapReduce job directly insert data into Vertica where you can analyze it in real time using all of Vertica's features.
MapReduce Connector Features The MapReduce Connector: l
l
gives Hadoop access to data stored in Vertica. lets Hadoop store its results in Vertica. The Connector can create a table for the Hadoop data if it does not already exist.
l
lets applications written in Apache Pig access and store data in Vertica.
l
works with Hadoop streaming.
The Connector runs on each node in the Hadoop cluster, so the Hadoop nodes and Vertica nodes communicate with each other directly. Direct connections allow data to be transferred in parallel, dramatically increasing processing speed. The Connector is written in Java, and is compatible with all platforms supported by Hadoop. Note: To prevent Hadoop from potentially inserting multiple copies of data into Vertica, the Vertica Connector for Hadoop Map Reduce disables Hadoop's speculative execution feature.
Prerequisites Before you can use the Vertica Connector for Hadoop MapReduce, you must install and configure Hadoop and be familiar with developing Hadoop applications. For details on
HPE Vertica Analytic Database (7.2.x)
Page 105 of 139
Integrating with Apache Hadoop Using the MapReduce Connector installing and using Hadoop, please see the Apache Hadoop Web site. See Vertica 7.2.x Supported Platforms for a list of the versions of Hadoop and Pig that the connector supports.
Hadoop and Vertica Cluster Scaling When using the connector for MapReduce, nodes in the Hadoop cluster connect directly to Vertica nodes when retrieving or storing data. These direct connections allow the two clusters to transfer large volumes of data in parallel. If the Hadoop cluster is larger than the Vertica cluster, this parallel data transfer can negatively impact the performance of the Vertica database. To avoid performance impacts on your Vertica database, ensure that your Hadoop cluster cannot overwhelm your Vertica cluster. The exact sizing of each cluster depends on how fast your Hadoop cluster generates data requests and the load placed on the Vertica database by queries from other sources. A good rule of thumb to follow is for your Hadoop cluster to be no larger than your Vertica cluster.
Installing the Connector Follow these steps to install the MapReduce Connector: If you have not already done so, download the Vertica Connector for Hadoop Map Reduce installation package from the myVertica portal. Be sure to download the package that is compatible with your version of Hadoop. You can find your Hadoop version by issuing the following command on a Hadoop node: # hadoop version
You will also need a copy of the Vertica JDBC driver which you can also download from the myVertica portal. You need to perform the following steps on each node in your Hadoop cluster: 1. Copy the Vertica Connector for Hadoop Map Reduce .zip archive you downloaded to a temporary location on the Hadoop node. 2. Copy the Vertica JDBC driver .jar file to the same location on your node. If you haven't already, you can download this driver from the myVertica portal. 3. Unzip the connector .zip archive into a temporary directory. On Linux, you usually use the command unzip. 4. Locate the Hadoop home directory (the directory where Hadoop is installed). The
HPE Vertica Analytic Database (7.2.x)
Page 106 of 139
Integrating with Apache Hadoop Using the MapReduce Connector
location of this directory depends on how you installed Hadoop (manual install versus a package supplied by your Linux distribution or Cloudera). If you do not know the location of this directory, you can try the following steps: n
See if the HADOOP_HOME environment variable is set by issuing the command echo $HADOOP_HOME on the command line.
n
See if Hadoop is in your path by typing hadoop classpath on the command line. If it is, this command lists the paths of all the jar files used by Hadoop, which should tell you the location of the Hadoop home directory.
n
If you installed using a .deb or .rpm package, you can look in /usr/lib/hadoop, as this is often the location where these packages install Hadoop.
5. Copy the file hadoop-vertica.jar from the directory where you unzipped the connector archive to the lib subdirectory in the Hadoop home directory. 6. Copy the Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Hadoop home directory ($HADOOP_HOME/lib). 7. Edit the $HADOOP_HOME/conf/hadoop-env.sh file, and find the lines: # Extra Java CLASSPATH elements. # export HADOOP_CLASSPATH=
Optional.
Uncomment the export line by removing the hash character (#) and add the absolute path of the JDBC driver file you copied in the previous step. For example: export HADOOP_CLASSPATH=$HADOOP_HOME/lib/vertica-jdbc-x.x.x.jar
This environment variable ensures that Hadoop can find the Vertica JDBC driver. 8. Also in the $HADOOP_HOME/conf/hadoop-env.sh file, ensure that the JAVA_HOME environment variable is set to your Java installation. 9. If you want your application written in Pig to be able to access Vertica, you need to: a. Locate the Pig home directory. Often, this directory is in the same parent directory as the Hadoop home directory. b. Copy the file named pig-vertica.jar from the directory where you unpacked the connector .zip file to the lib subdirectory in the Pig home directory. c. Copy the Vertica JDBC driver file (vertica-jdbc-x.x.x.jar) to the lib subdirectory in the Pig home directory.
HPE Vertica Analytic Database (7.2.x)
Page 107 of 139
Integrating with Apache Hadoop Using the MapReduce Connector
Accessing Vertica Data From Hadoop You need to follow three steps to have Hadoop fetch data from Vertica: l
Set the Hadoop job's input format to be VerticaInputFormat.
l
Give the VerticaInputFormat class a query to be used to extract data from Vertica.
l
Create a Mapper class that accepts VerticaRecord objects as input.
The following sections explain each of these steps in greater detail.
Selecting VerticaInputFormat The first step to reading Vertica data from within a Hadoop job is to set its input format. You usually set the input format within the run() method in your Hadoop application's class. To set up the input format, pass the job.setInputFormatClass method the VerticaInputFormat.class, as follows: public int run(String[] args) throws Exception { // Set up the configuration and job objects Configuration conf = getConf(); Job job = new Job(conf);
(later in the code) // Set the input format to retrieve data from // Vertica. job.setInputFormatClass(VerticaInputFormat.class);
Setting the input to the VerticaInputFormat class means that the map method will get VerticaRecord objects as its input.
HPE Vertica Analytic Database (7.2.x)
Page 108 of 139
Integrating with Apache Hadoop Using the MapReduce Connector
Setting the Query to Retrieve Data From Vertica A Hadoop job that reads data from your Vertica database has to execute a query that selects its input data. You pass this query to your Hadoop application using the setInput method of the VerticaInputFormat class. The Vertica Connector for Hadoop Map Reduce sends this query to the Hadoop nodes which then individually connect to Vertica nodes to run the query and get their input data. A primary consideration for this query is how it segments the data being retrieved from Vertica. Since each node in the Hadoop cluster needs data to process, the query result needs to be segmented between the nodes. There are three formats you can use for the query you want your Hadoop job to use when retrieving input data. Each format determines how the query's results are split across the Hadoop cluster. These formats are: l
A simple, self-contained query.
l
A parameterized query along with explicit parameters.
l
A parameterized query along with a second query that retrieves the parameter values for the first query from Vertica.
The following sections explain each of these methods in detail.
Using a Simple Query to Extract Data From Vertica The simplest format for the query that Hadoop uses to extract data from Vertica is a selfcontained hard-coded query. You pass this query in a String to the setInput method of the VerticaInputFormat class. You usually make this call in the run method of your Hadoop job class. For example, the following code retrieves the entire contents of the table named allTypes. // Sets the query to use to get the data from the Vertica database. // Simple query with no parameters VerticaInputFormat.setInput(job, "SELECT * FROM allTypes ORDER BY key;");
The query you supply must have an ORDER BY clause, since the Vertica Connector for Hadoop Map Reduce uses it to figure out how to segment the query results between the Hadoop nodes. When it gets a simple query, the connector calculates limits and offsets to be sent to each node in the Hadoop cluster, so they each retrieve a portion of the query results to process.
HPE Vertica Analytic Database (7.2.x)
Page 109 of 139
Integrating with Apache Hadoop Using the MapReduce Connector Having Hadoop use a simple query to retrieve data from Vertica is the least efficient method, since the connector needs to perform extra processing to determine how the data should be segmented across the Hadoop nodes.
Using a Parameterized Query and Parameter Lists You can have Hadoop retrieve data from Vertica using a parametrized query, to which you supply a set of parameters. The parameters in the query are represented by a question mark (?). You pass the query and the parameters to the setInput method of the VerticaInputFormat class. You have two options for passing the parameters: using a discrete list, or by using a Collection object.
Using a Discrete List of Values To pass a discrete list of parameters for the query, you include them in the setInput method call in a comma-separated list of string values, as shown in the next example: // Simple query with supplied parameters VerticaInputFormat.setInput(job, "SELECT * FROM allTypes WHERE key = ?", "1001", "1002", "1003");
The Vertica Connector for Hadoop Map Reduce tries to evenly distribute the query and parameters among the nodes in the Hadoop cluster. If the number of parameters is not a multiple of the number of nodes in the cluster, some nodes will get more parameters to process than others. Once the connector divides up the parameters among the Hadoop nodes, each node connects to a host in the Vertica database and executes the query, substituting in the parameter values it received. This format is useful when you have a discrete set of parameters that will not change over time. However, it is inflexible because any changes to the parameter list requires you to recompile your Hadoop job. An added limitation is that the query can contain just a single parameter, because the setInput method only accepts a single parameter list. The more flexible way to use parameterized queries is to use a collection to contain the parameters.
Using a Collection Object The more flexible method of supplying the parameters for the query is to store them into a Collection object, then include the object in the setInput method call. This method allows you to build the list of parameters at run time, rather than having them hardcoded. You can also use multiple parameters in the query, since you will pass a collection of ArrayList objects to setInput statement. Each ArrayList object supplies one set of parameter values, and can contain values for each parameter in the query.
HPE Vertica Analytic Database (7.2.x)
Page 110 of 139
Integrating with Apache Hadoop Using the MapReduce Connector The following example demonstrates using a collection to pass the parameter values for a query containing two parameters. The collection object passed to setInput is an instance of the HashSet class. This object contains four ArrayList objects added within the for loop. This example just adds dummy values (the loop counter and the string "FOUR"). In your own application, you usually calculate parameter values in some manner before adding them to the collection. Note: If your parameter values are stored in Vertica, you should specify the parameters using a query instead of a collection. See Using a Query to Retrieve Parameters for a Parameterized Query for details. // Collection to hold all of the sets of parameters for the query. Collection> params = new HashSet>() { }; // Each set of parameters lives in an ArrayList. Each entry // in the list supplies a value for a single parameter in // the query. Here, ArrayList objects are created in a loop // that adds the loop counter and a static string as the // parameters. The ArrayList is then added to the collection. for (int i = 0; i < 4; i++) { ArrayList