Difference between revisions of "Data Warehouse"

From HiveTool
Jump to: navigation, search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
HiveTool is open notebook.  The entire primary record is publicly available online as it is recorded. There is no 'insider information'. This transparent approaches to research includes making available failed and otherwise unpublished experiments.
+
Since HiveTool is open source/open notebook, the entire primary record is publicly available online as it is recorded. Two databases are used:
 +
Input and Analysis.
  
Storing, organizing and providing access to the data for research is challenging.  The measurements bring in large amounts of data.  Each hive sends in data every five minutes - 288 times a day. Each hive inserts over 100,000 rows a year into the Operational Database.  
+
Each hive sends in data every five minutes, 288 times a day, inserting over 100,000 rows a year into the Input Database. One thousand hives would generate 100 million rows per year.  In addition to the measured data, there are external factors that need to be systematically and consistently documented. This metadata includes hive genetics, manipulations, mite treatments, data conversion and calibration formulas, etc.  
  
In addition to the measured data, there are other external factors that need to be systematically documented. Metadata includes hive genetics, manipulation data, what mite treatment is used, etc.
+
[[File:Database_servers-2-1.jpg|thumb 640px|Operational and Research Databases]]
  
[[File:Database_servers_1_1.jpg|thumb 640px|Operational and Research Databases]]
 
  
The procedures that move the data from the Operational Database to the Research Database should:
+
As the data is moved from the Input Database to the Analysis Database it should:
 +
 
 +
*Converted to other measurement systems (lb <=> kg, Fahrenheit <=> Celsius).
 +
*Manipulation changes filtered out
 +
*Weather data checked if recent, else fetched.
 +
*Compared with limits and send alerts if the data exceeds limits.
 +
*Transactional data processed
 +
 
 +
 
 +
Additional Data Warehouse goals:
  
 
*Structure the data so that it makes sense to the researcher.
 
*Structure the data so that it makes sense to the researcher.
*Structure the data to optimize query performance, even for complex analytic queries, without impacting the operational systems.
+
*Structure the data to optimize query performance.
 
*Make research and decision–support queries easier to write.
 
*Make research and decision–support queries easier to write.
 
*Maintain data and conversion history.
 
*Maintain data and conversion history.
*Improve data quality with consistent quality codes and descriptions, flagging and fixing bad data.
+
*Improve data quality by flagging and fixing bad data and assigning quality codes and descriptions.
 
+
*partitioned into yearly or seasonal periods,
The data needs to be:
+
*summarized (daily weight changes),
 
+
*cataloged and tied into the metadata (foundation type, hive orientation, mite treatment, etc.),
*cleaned up
+
*tracked and controlled with version control software,
*converted (lb <=> kg, Fahrenheit <=> Celsius)
+
*released for use by researchers for data mining, online analytical processing, research and decision support.
*transformed (manipulation changes filtered out)
 
*cataloged and  
 
*made available for use by researchers for data mining, online analytical processing, research and decision support  
 
  
[[File:Database_servers_1_2.jpg|thumb 640px|Data Warehouse]]
+
[[File:Database_servers-2-2.jpg|thumb 640px|Data Warehouse]]

Latest revision as of 17:39, 11 June 2017

Since HiveTool is open source/open notebook, the entire primary record is publicly available online as it is recorded. Two databases are used: Input and Analysis.

Each hive sends in data every five minutes, 288 times a day, inserting over 100,000 rows a year into the Input Database. One thousand hives would generate 100 million rows per year. In addition to the measured data, there are external factors that need to be systematically and consistently documented. This metadata includes hive genetics, manipulations, mite treatments, data conversion and calibration formulas, etc.

Operational and Research Databases


As the data is moved from the Input Database to the Analysis Database it should:

  • Converted to other measurement systems (lb <=> kg, Fahrenheit <=> Celsius).
  • Manipulation changes filtered out
  • Weather data checked if recent, else fetched.
  • Compared with limits and send alerts if the data exceeds limits.
  • Transactional data processed


Additional Data Warehouse goals:

  • Structure the data so that it makes sense to the researcher.
  • Structure the data to optimize query performance.
  • Make research and decision–support queries easier to write.
  • Maintain data and conversion history.
  • Improve data quality by flagging and fixing bad data and assigning quality codes and descriptions.
  • partitioned into yearly or seasonal periods,
  • summarized (daily weight changes),
  • cataloged and tied into the metadata (foundation type, hive orientation, mite treatment, etc.),
  • tracked and controlled with version control software,
  • released for use by researchers for data mining, online analytical processing, research and decision support.

Data Warehouse