Difference between revisions of "Talk:Data Warehouse"

From HiveTool
Jump to: navigation, search
(Start for summary tables)
Line 50: Line 50:
 
== Start for summary tables ==
 
== Start for summary tables ==
  
Copied HIVE_DATA to S_HIVE_DATA
+
Copied HIVE_DATA to S_HIVE_DATA ( S from Summary)
 +
 
 +
----
  
 
'''Date fields'''
 
'''Date fields'''
Line 87: Line 89:
  
 
  If you want to base the "timestamp" instead of "hive_observation_time_local"
 
  If you want to base the "timestamp" instead of "hive_observation_time_local"
  ''# cat FillDateFields.sql  
+
  ''# cat FillDateFields.sql''
  use centesb7_hivetool_converted;
+
  ''use centesb7_hivetool_converted;''
 
   
 
   
  update S_HIVE_DATA set  
+
  ''update S_HIVE_DATA set''
  hour_of_day=DATE_FORMAT(timestamp,'%H'),
+
  ''hour_of_day=DATE_FORMAT(timestamp,'%H'),''
  day_of_month=DAYOFMONTH(timestamp),
+
  ''day_of_month=DAYOFMONTH(timestamp),''
  day_of_year=DAYOFYEAR(timestamp),
+
  ''day_of_year=DAYOFYEAR(timestamp),''
  month=MONTH(timestamp),
+
  ''month=MONTH(timestamp),''
  month_name=lower(MONTHNAME(timestamp)),
+
  ''month_name=lower(MONTHNAME(timestamp)),''
  quarter=QUARTER(timestamp),
+
  ''quarter=QUARTER(timestamp),''
  year=YEAR(timestamp),
+
  ''year=YEAR(timestamp),''
  month_year=lower(DATE_FORMAT(timestamp,'%M-%Y'))
+
  ''month_year=lower(DATE_FORMAT(timestamp,'%M-%Y'))''
  where timestamp != 0;
+
  ''where timestamp != 0;''
''
+
 
 
Some easy example queries:
 
Some easy example queries:
 
  - select year,month_name,hive_id,avg(hive_temp_c) from S_HIVE_DATA where timestamp != 0 group by year,month_name,hive_id;
 
  - select year,month_name,hive_id,avg(hive_temp_c) from S_HIVE_DATA where timestamp != 0 group by year,month_name,hive_id;
 
  - select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;
 
  - select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;
 
  - select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c),max(hive_weight_kgs) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;
 
  - select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c),max(hive_weight_kgs) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;
 +
 +
----
 +
 +
'''UTC Observation time'''
 +
For this to work I've copied the HIVE_PARAMETERS table to E_HIVE_PARAMETERS ( E from Enriched). Added a 'timezone' field which was filled in manually based on the location of the hive. Eventually 'timezone' will be present by default.
 +
 +
The Query below will fill the "hive_observation_time_utc" based on "hive_observation_time_local":
 +
 +
update S_HIVE_DATA a set hive_observation_time_utc=(select CONVERT_TZ(hive_observation_time_local,timezone,'UTC') from E_HIVE_PARAMETERS b where a.hive_id = b.hive_id);
 +
 +
----
  
 
'''rownums per hive'''
 
'''rownums per hive'''
Line 128: Line 141:
 
  update S_HIVE_DATA a inner join S_HIVE_DATA b ON a.hive_id = b.hive_id and b.hive_row_id = a.hive_row_id -1 set a.hive_weight_kgs_delta = a.hive_weight_kgs - b.hive_weight_kgs;
 
  update S_HIVE_DATA a inner join S_HIVE_DATA b ON a.hive_id = b.hive_id and b.hive_row_id = a.hive_row_id -1 set a.hive_weight_kgs_delta = a.hive_weight_kgs - b.hive_weight_kgs;
  
 +
----
  
 
TODO:
 
TODO:

Revision as of 12:37, 3 January 2016

Conversions

I've written 2 queries to convert the temperatures. They will only process rows in which ??_temp_c is not NULL and ??_temp_f is still NULL. to make sure we don't do the same thing twice. I does seem that the ambient temp values are sometimes a bit of the scale ("999.9"), Do I need to include some kind of lower and upper limit to skip those? Or do we do some kind of basic sanity checking at some other time.

I've directly updated the HIVE_DATA table, I hope that's fine?

mysql> update HIVE_DATA set hive_temp_f = (hive_temp_c * 1.8) + 32 where 
hive_temp_c is not NULL and hive_temp_f is NULL;
Query OK, 0 rows affected, 65535 warnings (11.07 sec)
Rows matched: 4270276  Changed: 0  Warnings: 3279801
mysql> update HIVE_DATA set ambient_temp_f = (ambient_temp_c * 1.8) + 32 
where ambient_temp_c is not NULL and ambient_temp_f is  NULL;
Query OK, 3682033 rows affected, 65535 warnings (16.86 sec)
Rows matched: 3682033  Changed: 3682033  Warnings: 2445369

No sure where the warnings come from though.

for easy execution:

#  mysql --verbose -p <  
CelciusToFarenheit.sql
Enter password:

update HIVE_DATA set hive_temp_f = (hive_temp_c * 1.8) + 32 where 
hive_temp_c is not NULL and hive_temp_f is NULL


update HIVE_DATA set ambient_temp_f = (ambient_temp_c * 1.8) + 32 where 
ambient_temp_c is not NULL and ambient_temp_f is  NULL

  1. cat CelciusToFarenheit.sql
use _hivetool_converted;

update HIVE_DATA set hive_temp_f = (hive_temp_c * 1.8) + 32 where 
hive_temp_c is not NULL and hive_temp_f is NULL;
update HIVE_DATA set ambient_temp_f = (ambient_temp_c * 1.8) + 32 where 
ambient_temp_c is not NULL and ambient_temp_f is  NULL;


Start for summary tables

Copied HIVE_DATA to S_HIVE_DATA ( S from Summary)


Date fields To do some easy grouping per date unit

Added fields some fields that will be useful for aggregation: day_of_year,month,month_name,quarter,year,month_year

Maybe season is a usefull one as well?

hour_of_day    int(2)
day_of_month   int(2)
day_of_year 	int(3)
month 	        int(2)
month_name 	varchar(10)
quarter        int(1)
year           int(4)
month_year     varchar(15)

to fill them:

# mysql --verbose -p < FillDateFields_local.sql
# cat FillDateFields_local.sql
use centesb7_hivetool_converted;
update S_HIVE_DATA set 
hour_of_day=DATE_FORMAT(hive_observation_time_local,'%H'),
day_of_month=DAYOFMONTH(hive_observation_time_local),
day_of_year=DAYOFYEAR(hive_observation_time_local),
month=MONTH(hive_observation_time_local),
month_name=lower(MONTHNAME(hive_observation_time_local)),
quarter=QUARTER(hive_observation_time_local),
year=YEAR(hive_observation_time_local),
month_year=lower(DATE_FORMAT(hive_observation_time_local,'%M-%Y'))
where hive_observation_time_local != 0;
If you want to base the "timestamp" instead of "hive_observation_time_local"
# cat FillDateFields.sql
use centesb7_hivetool_converted;

update S_HIVE_DATA set
hour_of_day=DATE_FORMAT(timestamp,'%H'),
day_of_month=DAYOFMONTH(timestamp),
day_of_year=DAYOFYEAR(timestamp),
month=MONTH(timestamp),
month_name=lower(MONTHNAME(timestamp)),
quarter=QUARTER(timestamp),
year=YEAR(timestamp),
month_year=lower(DATE_FORMAT(timestamp,'%M-%Y'))
where timestamp != 0;

Some easy example queries:

- select year,month_name,hive_id,avg(hive_temp_c) from S_HIVE_DATA where timestamp != 0 group by year,month_name,hive_id;
- select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;
- select hive_id,year,month_name,avg(hive_temp_c),avg(ambient_temp_c),max(hive_weight_kgs) from S_HIVE_DATA where timestamp != 0 and hive_id=10 group by hive_id,year,month_name order by year,month;

UTC Observation time For this to work I've copied the HIVE_PARAMETERS table to E_HIVE_PARAMETERS ( E from Enriched). Added a 'timezone' field which was filled in manually based on the location of the hive. Eventually 'timezone' will be present by default.

The Query below will fill the "hive_observation_time_utc" based on "hive_observation_time_local":

update S_HIVE_DATA a set hive_observation_time_utc=(select CONVERT_TZ(hive_observation_time_local,timezone,'UTC') from E_HIVE_PARAMETERS b where a.hive_id = b.hive_id);

rownums per hive We will need those to easily calculate delta's

Added field "hive_row_id" bigint(20) Put an index on it!

The query below for fills in the rowid for hive_id 10 set @rownum:=0; update S_HIVE_DATA set hive_row_id = ( select @rownum := @rownum + 1) where hive_id=10 order by row_id asc;

To do it for all hives:

# cat WriteFillHiveRownums.sql 
use centesb7_hivetool_converted;

select CONCAT('set @rownum:=0; update S_HIVE_DATA set hive_row_id = ( select @rownum := @rownum + 1) where hive_id=',hive_id,' order by row_id asc;') from HIVE_PARAMETERS;

# echo "use centesb7_hivetool_converted;" > FillHiveRownums.sql
# mysql -N -p < WriteFillHiveRownums.sql >> FillHiveRownums.sql
# mysql -p < FillHiveRownums.sql

query to fill in delta's for kgs

update S_HIVE_DATA a inner join S_HIVE_DATA b ON a.hive_id = b.hive_id and b.hive_row_id = a.hive_row_id -1 set a.hive_weight_kgs_delta = a.hive_weight_kgs - b.hive_weight_kgs;

TODO: - Add hive param fields to make easier selection on location, name, ... - fill in delta's on weight.

NASA Weight Filter

I call this the NASA weight Filter because Dr. Wayne Esaias suggested it to remove manipulation changes. This is the perl code that is used to graph the hives.

     if ( $last_weight ) {
        $delta_weight = $weight - $last_weight;
        $delta_time = ($time - $last_time)/3600;

        if ($delta_time) { $dwdt = $delta_weight/$delta_time; }

        # Begin NASA manipulation change filter

        if ( ($weight_filter eq "NASA")
          && (abs $dwdt > $max_dwdt_lbs_per_hour)             # if the change in weight exceeds the threshold
          && ($quality != 6) )                                # and this record is not flagged as a swarm (Quality 6)
           {                                                  # then don't count the change as daily change,
           $manipulation_change +=  $delta_weight;            # count it as manipulation change
           }
        else
           {
           $daily_change += $delta_weight;                    # otherwise, count it as part of the daily change
           }
        }
     else
        {                                                     #first time through
           $daily_change = $weight;
        }