Introduction
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
In this article, we will use C# to query Hive step by step.
Prerequisite
Before starting the specific content, we need to have a Hive server that can be connected.
Next, I will use version 2.1.1-cdh6.2.0 as a demonstration.
Using beeline
to test the connection and make a simple query.
beeline -u "jdbc:hive2://yourip:10000" -n yourname -p yourpwd
use ods;
select id, bucket from ods_xxxx_oss where dt = '2022-01-21' limit 10;
The following screenshot is the result of performing the above operations.
How can we query the data from Hive with C# to get the above result?
We can use Open Database Connectivity(ODBC) to complete the above query.
For more information about ODBC, take a look at the following posts.
In next section, we will write some code to do it.
Code
Create a console application and install the following packages.
<ItemGroup>
<PackageReference Include="Dapper" Version="2.0.123" />
<PackageReference Include="System.Data.Odbc" Version="6.0.0" />
</ItemGroup>
The most important package is System.Data.Odbc, Dapper is used to simplify SQL operations.
The following code demonstrates how to connect to hive and query data from hive:
public static class HiveHelper {
// dsn
// private static string _dsn = "DSN=hivedsn;Schema=ods;UID=xxx;PWD=xxx;";
// dsn-less
private static string _dsn = "Driver=Hive;Host=xxx.xxx.xxx.xxx;Port=10000;Schema=ods;UID=xxx;PWD=xxx;";
public static List < QueryResult > QueryRawData() {
var sql = $ " select id, bucket from ods_xxx_oss where dt = '2022-01-21' limit 10; ";
try {
using
var conn = new OdbcConnection(_dsn);
conn.Open();
var res = conn.Query < QueryResult > (sql, commandTimeout: 600000000);
return res.ToList();
} catch (Exception ex) {
throw;
}
}
}
public class QueryResult {
[Column("ods_xxx_oss.id")]
public string ? Id {
get;
set;
}
[Column("ods_xxx_oss.bucket")]
public string ? Bucket {
get;
set;
}
}
Here are some things to note,
- There are two ways for connection string, DSN or DSN-less
- Create database connection with OdbcConnection
- Column name mapping should contain the table name
Add CustomPropertyTypeMap and print the query result.
SqlMapper.SetTypeMap(typeof(QueryResult), new CustomPropertyTypeMap(
typeof(QueryResult), (type, columnName) => type.GetProperties().FirstOrDefault(prop =>
prop.GetCustomAttributes(false).OfType<ColumnAttribute>().Any(attr => attr.Name == columnName))));
var data = HiveHelper.QueryRawData();
foreach (var item in data)
{
Console.WriteLine(item.Id + "\t\t" + item.Bucket);
}
As you can see, the code level is almost the same as the SQL code you usually write, the difference is in deployment.
Deployment
We will use docker to deploy this sample.
What we need are as follows:
- unixodbc
- hive odbc driver
So, we will install these two components and their dependencies.
This is a simple and non optimized dockerfile for development, but it is not recommended for production.
FROM mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim AS base
# install the dependencies
RUN sed -i s@/deb.debian.org/@/mirrors.aliyun.com/@g /etc/apt/sources.list\
&& apt-get update \
&& apt-get install -y --no-install-recommends \
sasl2-bin \
libsasl2-modules \
unixodbc \
alien \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# install cloudera hive odbc
COPY ./ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm
# convert rpm to deb
RUN alien /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm
RUN dpkg -i ./clouderahiveodbc_2.6.11.1011-2_amd64.deb && rm -rf /tmp/ClouderaHiveODBC-2.6.11.1011-1.x86_64.rpm ./clouderahiveodbc_2.6.11.1011-2_amd64.deb
# odbc configuration
COPY ./odbc.ini /etc/odbc.ini
COPY ./odbcinst.ini /etc/odbcinst.ini
FROM mcr.microsoft.com/dotnet/sdk:6.0-bullseye-slim AS build
WORKDIR /src
COPY . .
RUN dotnet restore "HiveDemo.sln"
WORKDIR /src/HiveDemo
RUN dotnet build "HiveDemo.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "HiveDemo.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "HiveDemo.dll"]
There are two important ODBC configuration files, odbcinst.ini
and odbc.ini
.
Add hive ODBC driver information to odbcinst.ini
[Hive]
Description = ODBC for Hive Cloudera
Driver = /opt/cloudera/hiveodbc/lib/64/libclouderahiveodbc64.so
HiveServerType=2
Hive is the driver name.
HiveServerType specify the version of hive server.
Add DSN information to odbc.ini
[hivedsn]
Driver = Hive
Description = Cloudera Hive ODBC Driver DSN
Host = xxx.xxx.xxx.xxx
Port = 10000
Driver = Hive
specify the driver is the above hive driver.
You also can set more configuration in odbc.ini
, such as UID and password, etc.
Here is the directory structure of our example
Build and run this example.
docker build -t hivedemo:v1 .
docker run --rm hivedemo:v1
We get the same result as the previous query through beeline
.
Here is the source code you can find on my Github page.
Summary
This article introduces the combination of C# and ODBC to query data from Hive, and similarly, we can also insert, update and delete operations, as long as it complies with Hive SQL.
I hope this will help you!
Reference