Script: Generate table size/row count on every table
To get started, you will need to install PyODBC, and configure your ODBC settings to be able to connect to your Vertica cluster. I won’t cover that part in that post. As soon as you confirmed you can access your Vertica cluster, get the following code into a new file:
This script will simply take all your tables from the catalog, and for each of them run a count(*) and a system audit which will respectively output the row count and the table size.
Before you run the script, make sure to create the following table that will contain the tables’ data:
CREATE TABLE tables_infos (dt DATE, table_name VARCHAR, rows_count INT, size_bytes INT, record_insertion_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
I personally run the script every morning using a simple cron, but depending on how much data you have it might not be the best idea :)
After a few days you should have accumulated enough data to run some analysis, so here are two queries to get you started:
- Identifying the biggest tables:
SELECT table_name, size_bytes/1024/1024 as size_mb FROM tables_infos WHERE dt = date(now()) ORDER BY 2 DESC;
- Growth over time:
SELECT a1.table_name, a1.dt, a2.dt, (a2.size_bytes - a1.size_bytes)/1024/1024 as diff_size_mb, (a2.rows_count - a1.rows_count) as diff_count FROM tables_infos a1 JOIN tables_infos a2 ON (a1.table_name INTERPOLATE PREVIOUS VALUE a2.table_name) WHERE a1.dt = a2.dt - 1 ORDER BY a1.table_name, a1.dt, a2.dt;
Don’t hesitate to play around and submit your favorite queries!
Even tho Vertica likes to call itself the “Real Time Analytics Platform”, you will soon realize that where Vertica excels is a totally different world.
Vertica is amazing at working with a bulk amount of data, loading tens of thousands of rows a second, and being able to query your biggest tables without flinching. Vertica is not the best tool to process your data live, simply because it’s not a conventional relational database, and it’s not trying to be one. I would recommend what I call “near real time”, loading data every few minutes, guaranteeing that Vertica will load data at its best.
In this linked blog post, our friends at Funzio are trying their luck at loading data into Vertica through Flume, a log collector that can also perform ETL tasks on your data. It’s a very interesting post, I learned a lot by reading it, and here is the 2 points that I would also like to emphasize on:
- Primary Keys are not enforced in Vertica, this is a choice they made and there’s no easy way around this but to make the unique verification on your ETL. Primary keys are used for something totally different that I could detail on a later post.
- Log collectors can be tricky, I’ve heard more than once some weird stories about data not being received on the other end, or in this case, being sent more than once. Be careful, especially for projets that are in beta or incubating.
First: Learn R; Then: Use it with Vertica
I have been trying to get to learn R for a little while, but I somehow always failed at it.
Getting to know and being experienced with R and other such statistic tools is more the job of an analyst or data scientist. As an data engineer myself, however, I figured it would be a great addition to my skills, and would help me communicate with my colleagues. Plus, learning a new language and playing around with its cool features is always a lot of fun.
So here we are with probably the best online course I have attended: “Try R" is a good alternative to the classic books and other tutorials, as it’s following a great path and is entirely dynamic. Everything is guided with precision, and you get to type and execute each small exercise into a fully functional R shell.
Once you’re done with this course and are ready to practice some more, connect R to Vertica using either UDFs or by connecting RStudio to Vertica.
How to install Vertica on Amazon Linux AMI?
Vertica can only be installed on a handful of Linux distributions. As a result, if you ever tried to install it on Ubuntu or Amazon Linux for example, you probably faced an error while a compatibility test is executed, as follows:
ERROR: Your operating system is not supported.
The trick is easy, you just have enter this command in your terminal and the next installation will overpass the compatibility test:
echo "CentOS release 5.5 (Final)" > /etc/redhat-release
Tested with Amazon Linux version 2012.09 and Vertica Database Server version 6.1.0. Compatible with previous versions as well.
Why did we choose Vertica as our data warehouse platform ?
As you probably already know, Vertica is a column-based storage, but that’s not its only strength. Let me go through each individually :
Column-oriented database : In a traditional row-oriented RDBMS, in order to process a query, the entire row has to be read even if you only need one column. Vertica stores the values of each column in a way that it only needs to read the columns being requested. As you can see in the following schema, only three columns are read by Vertica, whereas it would require the entire row to be read in a row store. Imagine a structure with hundreds of columns in each table, and this system is at its utmost!
Example query :
SELECT avg(price) FROM ticketstore WHERE symbol = ‘AAPL’ AND date = ‘5/06/09’;