Get in Touch

Course Outline

Each session lasts 2 hours

Day-1: Session -1: Business Overview of Why Big Data Business Intelligence in Government

  • Case Studies from NIH, DoE
  • Big Data adaptation rates in Government Agencies and how they are aligning future operations around Big Data Predictive Analytics
  • Broad-scale application areas in DoD, NSA, IRS, USDA, etc.
  • Interfacing Big Data with Legacy data
  • Basic understanding of enabling technologies in predictive analytics
  • Data Integration & Dashboard visualization
  • Fraud management
  • Business Rule/ Fraud detection generation
  • Threat detection and profiling
  • Cost-benefit analysis for Big Data implementation

Day-1: Session-2 : Introduction to Big Data-1

  • Main characteristics of Big Data: volume, variety, velocity, and veracity. MPP architecture for volume.
  • Data Warehouses – static schema, slowly evolving dataset
  • MPP Databases like Greenplum, Exadata, Teradata, Netezza, Vertica, etc.
  • Hadoop Based Solutions – no restrictions on dataset structure.
  • Typical pattern: HDFS, MapReduce (crunch), retrieve from HDFS
  • Batch - suited for analytical/non-interactive tasks
  • Volume: CEP streaming data
  • Typical choices – CEP products (e.g., Infostreams, Apama, MarkLogic, etc.)
  • Less production-ready – Storm/S4
  • NoSQL Databases – (columnar and key-value): Best suited as an analytical adjunct to a data warehouse/database

Day-1 : Session -3 : Introduction to Big Data-2

NoSQL solutions

  • KV Store - Keyspace, Flare, SchemaFree, RAMCloud, Oracle NoSQL Database (OnDB)
  • KV Store - Dynamo, Voldemort, Dynomite, SubRecord, MongoDB, DovetailDB
  • KV Store (Hierarchical) - GT.m, Cache
  • KV Store (Ordered) - TokyoTyrant, Lightcloud, NMDB, Luxio, MemcacheDB, Actord
  • KV Cache - Memcached, Repcached, Coherence, Infinispan, ExtremeScale, JBossCache, Velocity, Terracotta
  • Tuple Store - Gigaspaces, Coord, Apache River
  • Object Database - ZopeDB, DB40, Shoal
  • Document Store - CouchDB, Cloudant, Couchbase, MongoDB, Jackrabbit, XML-Databases, ThruDB, CloudKit, Presserve, Riak-Basho, Scalaris
  • Wide Columnar Store - BigTable, HBase, Apache Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

Varieties of Data: Introduction to Data Cleaning issues in Big Data

  • RDBMS – static structure/schema, does not promote an agile, exploratory environment.
  • NoSQL – semi-structured; has enough structure to store data without an exact schema beforehand.
  • Data cleaning issues

Day-1 : Session-4 : Big Data Introduction-3 : Hadoop

  • When to select Hadoop?
  • STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not ideal for active exploration)
  • SEMI-STRUCTURED data – challenging to handle with traditional solutions (DW/DB)
  • Warehousing data = HUGE effort and remains static even after implementation
  • For variety & volume of data, crunched on commodity hardware – HADOOP
  • Commodity H/W needed to create a Hadoop Cluster

Introduction to Map Reduce /HDFS

  • MapReduce – distribute computing over multiple servers
  • HDFS – make data available locally for the computing process (with redundancy)
  • Data – can be unstructured/schema-less (unlike RDBMS)
  • Developer responsibility to make sense of data
  • Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS

Day-2: Session-1: Big Data Ecosystem-Building Big Data ETL: Universe of Big Data Tools-which one to use and when?

  • Hadoop vs. Other NoSQL solutions
  • For interactive, random access to data
  • Hbase (column-oriented database) on top of Hadoop
  • Random access to data but with imposed restrictions (max 1 PB)
  • Not ideal for ad-hoc analytics; good for logging, counting, time-series
  • Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
  • Flume – Stream data (e.g., log data) into HDFS

Day-2: Session-2: Big Data Management System

  • Moving parts, compute nodes start/fail: ZooKeeper - For configuration/coordination/naming services
  • Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
  • Deploy, configure, cluster management, upgrade, etc. (sys admin): Ambari
  • In Cloud: Whirr

Day-2: Session-3: Predictive analytics in Business Intelligence -1: Fundamental Techniques & Machine learning based BI :

  • Introduction to Machine learning
  • Learning classification techniques
  • Bayesian Prediction-preparing training file
  • Support Vector Machine
  • KNN p-Tree Algebra & vertical mining
  • Neural Network
  • Big Data large variable problem - Random forest (RF)
  • Big Data Automation problem – Multi-model ensemble RF
  • Automation through Soft10-M
  • Text analytic tool-Treeminer
  • Agile learning
  • Agent based learning
  • Distributed learning
  • Introduction to Open source Tools for predictive analytics: R, Rapidminer, Mahout

Day-2: Session-4 Predictive analytics eco-system-2: Common predictive analytic problems in Government.

  • Insight analytic
  • Visualization analytic
  • Structured predictive analytic
  • Unstructured predictive analytic
  • Threat/fraudster/vendor profiling
  • Recommendation Engine
  • Pattern detection
  • Rule/Scenario discovery – failure, fraud, optimization
  • Root cause discovery
  • Sentiment analysis
  • CRM analytic
  • Network analytic
  • Text Analytics
  • Technology assisted review
  • Fraud analytic
  • Real Time Analytic

Day-3 : Session-1 : Real Time and Scalable Analytic Over Hadoop

  • Why common analytic algorithms fail in Hadoop/HDFS
  • Apache Hama- for Bulk Synchronous distributed computing
  • Apache SPARK- for cluster computing for real time analytic
  • CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
  • KNN p-Algebra based approach from Treeminer for reduced hardware cost of operation

Day-3: Session-2: Tools for eDiscovery and Forensics

  • eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
  • Predictive coding and technology assisted review (TAR)
  • Live demo of a TAR product (vMiner) to understand how TAR works for faster discovery
  • Faster indexing through HDFS – velocity of data
  • NLP or Natural Language processing – various techniques and open source products
  • eDiscovery in foreign languages-technology for foreign language processing

Day-3 : Session 3: Big Data BI for Cyber Security – Understanding the whole 360-degree view of speedy data collection to threat identification

  • Understanding basics of security analytics: attack surface, security misconfiguration, host defenses
  • Network infrastructure/ Large datapipe / Response ETL for real time analytic
  • Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Metadata

Day-3: Session 4: Big Data in USDA : Application in Agriculture

  • Introduction to IoT (Internet of Things) for agriculture-sensor based Big Data and control
  • Introduction to Satellite imaging and its application in agriculture
  • Integrating sensor and image data for soil fertility, cultivation recommendation, and forecasting
  • Agriculture insurance and Big Data
  • Crop Loss forecasting

Day-4 : Session-1: Fraud prevention BI from Big Data in Government-Fraud analytic:

  • Basic classification of Fraud analytics- rule based vs predictive analytics
  • Supervised vs unsupervised Machine learning for Fraud pattern detection
  • Vendor fraud/overcharging for projects
  • Medicare and Medicaid fraud- fraud detection techniques for claim processing
  • Travel reimbursement frauds
  • IRS refund frauds
  • Case studies and live demo will be provided where data is available.

Day-4 : Session-2: Social Media Analytic- Intelligence gathering and analysis

  • Big Data ETL API for extracting social media data
  • Text, image, metadata, and video
  • Sentiment analysis from social media feed
  • Contextual and non-contextual filtering of social media feed
  • Social Media Dashboard to integrate diverse social media
  • Automated profiling of social media profiles
  • Live demo of each analytic will be provided through the Treeminer Tool.

Day-4 : Session-3: Big Data Analytic in image processing and video feeds

  • Image Storage techniques in Big Data- Storage solutions for data exceeding petabytes
  • LTFS and LTO
  • GPFS-LTFS (Layered storage solution for Big image data)
  • Fundamentals of image analytics
  • Object recognition
  • Image segmentation
  • Motion tracking
  • 3-D image reconstruction

Day-4: Session-4: Big Data applications in NIH:

  • Emerging areas of Bioinformatics
  • Metagenomics and Big Data mining issues
  • Big Data Predictive analytic for Pharmacogenomics, Metabolomics, and Proteomics
  • Big Data in downstream Genomics process
  • Application of Big data predictive analytics in Public health

Big Data Dashboard for quick accessibility of diverse data and display :

  • Integration of existing application platform with Big Data Dashboard
  • Big Data management
  • Case Study of Big Data Dashboard: Tableau and Pentaho
  • Use Big Data app to push location based services in Government.
  • Tracking system and management

Day-5 : Session-1: How to justify Big Data BI implementation within an organization:

  • Defining ROI for Big Data implementation
  • Case studies for saving Analyst Time for collection and preparation of Data – increase in productivity gain
  • Case studies of revenue gain from saving the licensed database cost
  • Revenue gain from location based services
  • Savings from fraud prevention
  • An integrated spreadsheet approach to calculate approximate expense vs. Revenue gain/savings from Big Data implementation.

Day-5 : Session-2: Step by Step procedure to replace legacy data system to Big Data System:

  • Understanding practical Big Data Migration Roadmap
  • What important information is needed before architecting a Big Data implementation
  • What are the different ways of calculating volume, velocity, variety, and veracity of data
  • How to estimate data growth
  • Case studies

Day-5: Session 4: Review of Big Data Vendors and review of their products. Q/A session:

  • Accenture
  • APTEAN (Formerly CDC Software)
  • Cisco Systems
  • Cloudera
  • Dell
  • EMC
  • GoodData Corporation
  • Guavus
  • Hitachi Data Systems
  • Hortonworks
  • HP
  • IBM
  • Informatica
  • Intel
  • Jaspersoft
  • Microsoft
  • MongoDB (Formerly 10Gen)
  • MU Sigma
  • Netapp
  • Opera Solutions
  • Oracle
  • Pentaho
  • Platfora
  • Qliktech
  • Quantum
  • Rackspace
  • Revolution Analytics
  • Salesforce
  • SAP
  • SAS Institute
  • Sisense
  • Software AG/Terracotta
  • Soft10 Automation
  • Splunk
  • Sqrrl
  • Supermicro
  • Tableau Software
  • Teradata
  • Think Big Analytics
  • Tidemark Systems
  • Treeminer
  • VMware (Part of EMC)

Requirements

  • Basic knowledge of business operations and data systems in Government within their domain
  • Basic understanding of SQL/Oracle or relational databases
  • Basic understanding of Statistics (at the spreadsheet level)
 35 Hours

Number of participants


Price per participant

Testimonials (1)

Upcoming Courses

Related Categories