Script: Generate table size/row count on every table

To get started, you will need to install PyODBC, and configure your ODBC settings to be able to connect to your Vertica cluster. I won’t cover that part in that post. As soon as you confirmed you can access your Vertica cluster, get the following code into a new file:

This script will simply take all your tables from the catalog, and for each of them run a count(*) and a system audit which will respectively output the row count and the table size.

Before you run the script, make sure to create the following table that will contain the tables’ data:

CREATE TABLE tables_infos (dt DATE, table_name VARCHAR, rows_count INT, size_bytes INT, record_insertion_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP);

I personally run the script every morning using a simple cron, but depending on how much data you have it might not be the best idea :)

After a few days you should have accumulated enough data to run some analysis, so here are two queries to get you started:

- Identifying the biggest tables:

SELECT table_name, size_bytes/1024/1024 as size_mb FROM tables_infos WHERE dt = date(now()) ORDER BY 2 DESC;

- Growth over time:

SELECT a1.table_name, a1.dt, a2.dt, (a2.size_bytes - a1.size_bytes)/1024/1024 as diff_size_mb, (a2.rows_count - a1.rows_count) as diff_count FROM tables_infos a1 JOIN tables_infos a2 ON (a1.table_name INTERPOLATE PREVIOUS VALUE a2.table_name) WHERE a1.dt = a2.dt - 1 ORDER BY a1.table_name, a1.dt, a2.dt;

Don’t hesitate to play around and submit your favorite queries!

First: Learn R; Then: Use it with Vertica

I have been trying to get to learn R for a little while, but I somehow always failed at it.

Getting to know and being experienced with R and other such statistic tools is more the job of an analyst or data scientist. As an data engineer myself, however, I figured it would be a great addition to my skills, and would help me communicate with my colleagues. Plus, learning a new language and playing around with its cool features is always a lot of fun.

So here we are with probably the best online course I have attended: “Try R” is a good alternative to the classic books and other tutorials, as it’s following a great path and is entirely dynamic. Everything is guided with precision, and you get to type and execute each small exercise into a fully functional R shell.

Once you’re done with this course and are ready to practice some more, connect R to Vertica using either UDFs or by connecting RStudio to Vertica.

How to: Install vsql on Mac OS X

  1. Download client sources for OS X: vertica-client-6.1.0-0.mac.tar.gz
  2. Untar the downloaded file, either by double-clicking the file, or using the `tar` command
  3. Move the opt/ directory and all its sub-directories where you want it to be, or leave it in Downloads/ if you wish
  4. Add the bin/ directory to your PATH environment variable. To do so, add a new PATH entry in your .bash_profile by copying this command in your Terminal: 

    echo "export PATH=`mdfind -name vsql | head -1`:${PATH}'" >> ~/.bashrc

  5. Add an alias for easy access; enter the following command in your Terminal:

    echo "alias vsqldba='vsql -h node01.mydomain.com -U myuser -w mypassword'" >> ~/.bashrc

  6. Open a new Terminal tab or window, allowing the startup files to be reloaded, then try your newly created alias `vsqldba`.

All set!

How to install Vertica on Amazon Linux AMI?

Vertica can only be installed on a handful of Linux distributions. As a result, if you ever tried to install it on Ubuntu or Amazon Linux for example, you probably faced an error while a compatibility test is executed, as follows:

ERROR: Your operating system is not supported.

The trick is easy, you just have enter this command in your terminal and the next installation will overpass the compatibility test:

echo "CentOS release 5.5 (Final)" > /etc/redhat-release

Tested with Amazon Linux version 2012.09 and Vertica Database Server version 6.1.0. Compatible with previous versions as well.

Why did we choose Vertica as our data warehouse platform ?

As you probably already know, Vertica is a column-based storage, but that’s not its only strength. Let me go through each individually :

  • Column-oriented database : In a traditional row-oriented RDBMS, in order to process a query, the entire row has to be read even if you only need one column. Vertica stores the values of each column in a way that it only needs to read the columns being requested. As you can see in the following schema, only three columns are read by Vertica, whereas it would require the entire row to be read in a row store. Imagine a structure with hundreds of columns in each table, and this system is at its utmost!

Example query :

SELECT avg(price) FROM ticketstore WHERE symbol = ‘AAPL’ AND date = ‘5/06/09’;

image

Read More