Hive Feature Store Python Example

Data Exploration and Reporting with PyHive

In this example we show how to explore data in Hive and build reports.

The example uses a dataset of real estate sales in the Sacramento area, which you can download from here. Create a dataset with a name of your choosing, for example RawData and upload the CSV file. Make sure the dataset is empty, you will need to delete the auto-generated README.md

Warning

This notebook will download three files, including a private key, that provide write access to the Hive database. Do not check these files into Git (or any other source code control system). If you have auto-push on shutdown enabled, it will add these files to git.

First we need to setup the IPython magic and the connection information by running the following cell:

%reload_ext sql
from hops import hive
hive.setup_hive_connection()

Load Data into an External table

Let’s first create a Hive external and let’s point the table to the just created dataset. In the query below you should replace [Projectname] with your project name.

The %%sql magic allows you to write HiveQL and execute the query against the HiveServer.

%%sql
CREATE EXTERNAL TABLE sacramento_properties_ext(
street string,
city string,
zip int,
state string,
beds int,
baths int,
sq__ft float,
sales_type string,
sale_date string,
price float,
latitude float,
longitude float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/Projects/[Projectname]/RawData'

To make sure the table was created successfully, we can list all the tables in the project’s database and run a simple query against the external table.

%%sql
show tables
Done.
tab_name
sacramento_properties_ext
%%sql
select * from sacramento_properties_ext limit 10
Done.
street city zip state beds baths sq__ft sales_type sale_date price latitude longitude
3526 HIGH ST SACRAMENTO 95838 CA 2 1 836.0 Residential Wed May 21 00:00:00 EDT 2008 59222.0 38.631912 -121.434875
51 OMAHA CT SACRAMENTO 95823 CA 3 1 1167.0 Residential Wed May 21 00:00:00 EDT 2008 68212.0 38.4789 -121.43103
2796 BRANCH ST SACRAMENTO 95815 CA 2 1 796.0 Residential Wed May 21 00:00:00 EDT 2008 68880.0 38.618305 -121.44384
2805 JANETTE WAY SACRAMENTO 95815 CA 2 1 852.0 Residential Wed May 21 00:00:00 EDT 2008 69307.0 38.616837 -121.43915
6001 MCMAHON DR SACRAMENTO 95824 CA 2 1 797.0 Residential Wed May 21 00:00:00 EDT 2008 81900.0 38.51947 -121.43577
5828 PEPPERMILL CT SACRAMENTO 95841 CA 3 1 1122.0 Condo Wed May 21 00:00:00 EDT 2008 89921.0 38.662594 -121.32781
6048 OGDEN NASH WAY SACRAMENTO 95842 CA 3 2 1104.0 Residential Wed May 21 00:00:00 EDT 2008 90895.0 38.68166 -121.35171
2561 19TH AVE SACRAMENTO 95820 CA 3 1 1177.0 Residential Wed May 21 00:00:00 EDT 2008 91002.0 38.53509 -121.48137
11150 TRINITY RIVER DR Unit 114 RANCHO CORDOVA 95670 CA 2 2 941.0 Condo Wed May 21 00:00:00 EDT 2008 94905.0 38.62119 -121.27055
7325 10TH ST RIO LINDA 95673 CA 3 2 1146.0 Residential Wed May 21 00:00:00 EDT 2008 98937.0 38.70091 -121.44298

Load data into a Managed Table

In this section we are going to load the data in a managed table. The table is going to be partitioned by ZIP and stored in ORC (a columnar storage format).

%%sql
CREATE TABLE sacramento_properties(
street string,
city string,
state string,
beds int,
baths int,
sq__ft float,
sales_type string,
sale_date string,
price float,
latitude float,
longitude float)
PARTITIONED by (zip int)
STORED AS ORC
%%sql
show tables
Done.
tab_name
sacramento_properties
sacramento_properties_ext

By default we don’t allow dynamic inserts. This means that we should specify, for each row, which partition it belongs to. This is meant to avoid having a wrong insert query create a huge number of directories.

In this case, however, the dataset is small and we know what we are doing. In the next cell we configure the hive session to allow nonstrict dymanic insert.

%%sql
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

Now we load the data into the managed table. Partition column(s) should be listed last in the SELECT statement.

%%sql
INSERT OVERWRITE TABLE sacramento_properties PARTITION (zip)
SELECT street, city, state, beds, baths, sq__ft, sales_type, sale_date, price, latitude, longitude, zip 
FROM sacramento_properties_ext
%%sql
select * 
FROM sacramento_properties 
WHERE zip=95608
LIMIT 10
Done.
street city state beds baths sq__ft sales_type sale_date price latitude longitude zip
5332 SANDSTONE ST CARMICHAEL CA 3 1 1152.0 Residential Wed May 21 00:00:00 EDT 2008 181872.0 38.662106 -121.31394 95608
5907 ELLERSLEE DR CARMICHAEL CA 3 1 936.0 Residential Wed May 21 00:00:00 EDT 2008 200000.0 38.664467 -121.32683 95608
4010 ALEX LN CARMICHAEL CA 2 2 1326.0 Condo Wed May 21 00:00:00 EDT 2008 250134.0 38.637028 -121.312965 95608
5925 MALEVILLE AVE CARMICHAEL CA 4 2 1120.0 Residential Tue May 20 00:00:00 EDT 2008 189000.0 38.666565 -121.325714 95608
2109 HAMLET PL CARMICHAEL CA 2 2 1598.0 Residential Tue May 20 00:00:00 EDT 2008 484000.0 38.602753 -121.32932 95608
5709 RIVER OAK WAY CARMICHAEL CA 4 2 2222.0 Residential Tue May 20 00:00:00 EDT 2008 582000.0 38.602463 -121.33098 95608
7032 FAIR OAKS BLVD CARMICHAEL CA 3 2 1245.0 Condo Mon May 19 00:00:00 EDT 2008 139500.0 38.628563 -121.3283 95608
7110 STELLA LN Unit 15 CARMICHAEL CA 2 2 1000.0 Condo Mon May 19 00:00:00 EDT 2008 182000.0 38.637398 -121.30006 95608
5847 DEL CAMPO LN CARMICHAEL CA 3 1 1713.0 Residential Mon May 19 00:00:00 EDT 2008 266000.0 38.671993 -121.32434 95608
4622 MEYER WAY CARMICHAEL CA 4 2 1559.0 Residential Mon May 19 00:00:00 EDT 2008 285000.0 38.64913 -121.31067 95608
%%sql
select sales_type, avg(price) as avg_price
FROM sacramento_properties 
WHERE zip=95608
GROUP BY sales_type
LIMIT 10
Done.
sales_type avg_price
Condo 190544.66666666666
Residential 314238.8823529412

Visualize the data

The query result can also be stored in a Python dictionary. This enables us to choose from a plethora of Python libraries available online, to visualize data and build BI reports

In the example below we use the folium library to mark where the Condos in our dataset are located in Sacramento. For the next section to work, if you haven’t already, please install the folium package in your project environment and restart the kernel. (Also, please re-run the first cell)

(Internet access is required to visualize the map)

condos = %sql select * from sacramento_properties where `sales_type` = 'Condo'
Done.
import folium
m = folium.Map(
    location=[38.5815700, -121.4944000],
    zoom_start=10
)

for condo in condos:
    folium.Marker([float(condo[9]), float(condo[10])], popup=str(condo[8])).add_to(m)

m