CHAPTER 2
Vermeulen-KrennwallnerHillman-Clark
Let’s begin by constructing a customer. I have created a fictional company for which
you will perform the practical data science as your progress through this blog. You can
execute your examples in either a Windows or Linux environment. You only have to
download the desired example set.
Any source code or other supplementary material referenced in this book is
available to readers on GitHub, via this book’s product page, located at
www.apress.com/9781484230534.
Windows
I suggest that you create a directory called c:\VKHCG to process all the examples in this
book. Next, from GitHub, download and unzip the DS_VKHCG_Windows.zip file into this
directory.
Linux
I also suggest that you create a directory called ./VKHCG, to process all the examples in
this book. Then, from GitHub, download and untar the DS_VKHCG_Linux.tar.gz file into
this directory.
Warning
If you change this directory to a new location, you will be required
to change everything in the sample scripts to this new location, to get maximum
benefit from the samples.
These files are used to create the sample company’s script and data directory, which
I will use to guide you through the processes and examples in the rest of the blog.
It’s Now Time to Meet Your Customer
Vermeulen-Krennwallner-Hillman-Clark Group (VKHCG) is a hypothetical medium-size
international company. It consists of four subcompanies: Vermeulen PLC, Krennwallner
AG, Hillman Ltd, and Clark Ltd.
Vermeulen PLC
Vermeulen PLC is a data processing company that processes all the data within the
group companies, as part of their responsibility to the group. The company handles all
the information technology aspects of the business.
This is the company for which you have just been hired to be the data scientist. Best
of luck with your future.
The company supplies
• Data science
• Networks, servers, and communication systems
• Internal and external web sites
• Data analysis business activities
• Decision science
• Process automation
• Management reporting
For the purposes of this blog, I will explain what other technologies you need
to investigate at every section of the framework, but the examples will concentrate
only on specific concepts under discussion, as the overall data science field is more
comprehensive than the few selected examples.
By way of examples, I will assist you in building a basic Data Science Technology
Stack and then advise you further with additional discussions on how to get the stack to
work at scale.
The examples will show you how to process the following business data:
• Customers
• Products
• Location
• Business processes
• A number of handy data science algorithms
I will explain how to
• Create a network routing diagram using geospatial analysis
• Build a directed acyclic graph (DAG) for the schedule of jobs, using
graph theory
If you want to have a more detailed view of the company’s data, take a browse at these
data sets in the company’s sample directory (./VKHCG/01-Vermeulen/00-RawData).
Later in this chapter, I will give you a more detailed walk-through of each data set
Krennwallner AG
Krennwallner AG is an advertising and media company that prepares advertising and
media content for the customers of the group.
It supplies
• Advertising on billboards
• Advertising and content management for online delivery
• Event management for key customers
Via a number of technologies, it records who watches what media streams. The
specific requirement we will elaborate is how to identify the groups of customers who
will have to see explicit media content. I will explain how to
• Pick content for specific billboards
• Understand online web site visitors’ data per country
• Plan an event for top-10 customers at Neuschwanstein Castle
If you want to have a more in-depth view of the company’s data, have a glance at the
sample data sets in the company’s sample directory (./VKHCG/02-Krennwallner/
00-RawData)
Hillman Ltd
The Hillman company is a supply chain and logistics company. It provisions a worldwide
supply chain solution to the businesses, including
• Third-party warehousing
• International shipping
• Door-to-door logistics
The principal requirement that I will expand on through examples is how you design
the distribution of a customer’s products purchased online.
Through the examples, I will follow the product from factory to warehouse and
warehouse to customer’s door.
I will explain how to
• Plan the locations of the warehouses within the United Kingdom
• Plan shipping rules for best-fit international logistics
• Choose what the best packing option is for shipping containers for a
given set of products
• Create an optimal delivery route for a set of customers in Scotland
If you want to have a more detailed view of the company’s data, browse the data sets
in the company’s sample directory (./VKHCG/ 03-Hillman/00-RawData).
Clark Ltd
The Clark company is a venture capitalist and accounting company that processes the
following financial responsibilities of the group:
• Financial insights
• Venture capital management
• Investments planning
• Forex (foreign exchange) trading
I will use financial aspects of the group companies to explain how you apply practical
data science and data engineering to common problems for the hypothetical financial data.
I will explain to you how to prepare
• A simple forex trading planner
• Accounting ratios
• Profitability
• Gross profit for sales
• Gross profit after tax for sales
• Return on capital employed (ROCE)
• Asset turnover
• Inventory turnover
• Accounts receivable days
• Accounts payable days
Processing Ecosystem
Five years ago, VKHCG consolidated its processing capability by transferring the
concentrated processing requirements to Vermeulen PLC to perform data science
as a group service. This resulted in the other group companies sustaining 20% of the
group business activities; however, 90% of the data processing of the combined group’s
business activities was reassigned to the core team. Vermeulen has since consolidated
Spark, Python, Mesos, Akka, Cassandra, Kafka, elastic search, and MQTT (MQ Telemetry
Transport) processing into a group service provider and processing entity.
I will use R or Python for the data processing in the examples. I will also discuss the
complementary technologies and advise you on what to consider and request for your
own environment.
Note: The complementary technologies are used regularly in the data science
environment. Although I cover them briefly, that does not make them any less
significant.
VKHCG uses the R processing engine to perform data processing in 80% of the
company business activities, and the other 20% is done by Python. Therefore, we will
prepare an R and a Python environment to perform the examples. I will quickly advise
you on how to obtain these additional environments, if you require them for your own
specific business requirements.
I will cover briefly the technologies that we are not using in the examples but that are
known to be beneficial.
Scala
Scala is popular in the data science community, as it supports massive parallel
processing in an at-scale manner. You can install the language from the following core
site: www.scala-lang.org/download/. Cheat sheets and references are available to guide
you to resources to help you master this programming language.
Note: Many users are using Scala as their strategical development
language.
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing that is at
present the fastest-growing processing engine for large-scale data science projects.
You can install the engine from the following core site: http://spark.apache.org/.
For large-scale projects, I use the Spark environment within DataStax Enterprise
(www.datastax.com), Hortonworks (https://hortonworks.com/), Cloudera
(www.cloudera.com/), and MapR (https://mapr.com/).
Note : Spark is now the most sought-after common processing engine for atscale data processing, with support increasing by the day. I recommend that you
master this engine, if you want to advance your career in data science at-scale.
Apache Mesos
Apache Mesos abstracts CPU, memory, storage, and additional computation resources
away from machines (physical or virtual), enabling fault-tolerant and elastic distributed
systems to effortlessly build and run processing solutions effectively. It is industry
proven to scale to 10,000s of nodes. This empowers the data scientist to run massive
parallel analysis and processing in an efficient manner. The processing environment
is available from the following core site: http://mesos.apache.org/. I want to give
Mesosphere Enterprise DC/OS an honorable mention, as I use it for many projects.
See https://mesosphere.com, for more details.
Note: Mesos is a cost-effective processing approach supporting growing dynamic
processing requirements in an at-scale processing environment.
Akka
Akka supports building powerful concurrent and distributed applications to perform
massive parallel processing, while sharing the common processing platform at-scale.
You can install the engine from the following core site: http://akka.io/. I use Akka
processing within the Mesosphere Enterprise DC/OS environment.
Apache Cassandra
Apache Cassandra database offers support with scalability and high availability, without
compromising performance. It has linear scalability and a reputable fault-tolerance, as it
is widely used by numerous big companies. You can install the engine from the following
core site: http://cassandra.apache.org/.
I use Cassandra processing within the Mesosphere Enterprise DC/OS environment
and DataStax Enterprise for my Cassandra installations.
Note: I recommend that you consider Cassandra as an at-scale database, as it
supports the data science environment with stable data processing capability.
Kafka
Kafka is used for building real-time data pipelines and streaming apps. It is horizontally
scalable, fault-tolerant, and impressively fast. You can install the engine from the
following core site: http://kafka.apache.org/. I use Kafka processing within the
Mesosphere Enterprise DC/OS environment, to handle the ingress of data into my data
science environments
Note: I advise that you look at Kafka as a data transport, as it supports the data
science environment with robust data collection facility.
Message Queue Telemetry Transport
Message Queue Telemetry Transport (MQTT) is a machine-to-machine (M2M) and
Internet of things connectivity protocol. It is an especially lightweight publish/subscribe
messaging transport. It enables connections to locations where a small code
footprint is essential, and lack of network bandwidth is a barrier to communication.
See http://mqtt.org/ for details.
Note: This protocol is common in sensor environments, as it provisions the
smaller code footprint and lower bandwidths that sensors demand.
Now that I have covered the items you should know about but are not going to use in
the examples, let’s look at what you will use.
Example Ecosystem
The examples require the following environment. The two setups required within
VKHCG’s environment are Python and R.
Python
Python is a high-level programming language created by Guido van Rossum and first
released in 1991. Its reputation is growing, as today, various training institutes are
covering the language as part of their data science prospectus.
I suggest you install Anaconda, to enhance your Python development. It is an open
source distribution of Python that simplifies package management and deployment of
features (see www.continuum.io/downloads).
Ubuntu
Ubuntu server installation to perform my data science
(see www.ubuntu.com/), as follows:
sudo apt-get install python3 python3-pip python3-setuptools
CentOS/RHEL
If you want to use
CentOS/RHEL,
I suggest you employ the following install process:
sudo yum install python3 python3-pip python3-setuptools
Windows
If you want to use Windows, I suggest you employ the following install process.
Download the software from www.python.org/downloads/windows/.
Is Python3 Ready?
Once installation is completed, you must test your environment as follows:
Python3 --version
On success, you should see a response like this
Python 3.4.3+
Congratulations, Python is now ready
Python Libraries
One of the most important features of Python is its libraries, which are extensively
available and make it stress-free to include verified data science processes into your
environment.
To investigate extra packages, I suggest you review the PyPI—Python Package Index
(https://pypi.python.org/).
You have to set up a limited set of Python libraries to enable you to complete the
examples.
Warning Please ensure that you have verified all the packages you use.
Remember: Open source is just that—open. Be vigilant!
Pandas
This provides a high-performance set of data structures and data-analysis tools for use in
your data science.
Ubuntu
Install this by using
sudo apt-get install python-pandas
Centos/RHEL
Install this by using
yum install python-pandas
PIP
Install this by using
pip install pandas
More information on Pandas development is available at http://pandas.pydata.
org/. I suggest following the cheat sheet (https://github.com/pandas-dev/pandas/
blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf), to guide you through the
basics of using Pandas.
I will explain, via examples, how to use these Pandas tools.
Note: I suggest that you master this package, as it will support many of your data
loading and storing processes, enabling overall data science processing
Matplotlib
Matplotlib is a Python 2D and 3D plotting library that can produce various plots,
histograms, power spectra, bar charts, error charts, scatterplots, and limitless advance
visualizations of your data science results.
Ubuntu
Install this by using
sudo apt-get install python-matplotlib
CentOS/RHEL
Install this by using
Sudo yum install python-matplotlib
PIP
Install this by using:
pip install matplotlib
Explore http://matplotlib.org/ for more details on the visualizations that you can
accomplish with exercises included in these packages.
Note: I recommend that you spend time mastering your visualization skills. Without
these skills, it is nearly impossible to communicate your data science results.
NumPy
NumPy is the fundamental package for scientific computing, based on a general
homogeneous multidimensional array structure with processing tools. Explore
www.numpy.org/ for further details.
I will use some of the tools in the examples but suggest you practice with the general
tools, to assist you with your future in data science.
SymPy
SymPy is a Python library for symbolic mathematics. It assists you in simplifying
complex algebra formulas before including them in your code. Explore www.sympy.org
for details on this package’s capabilities.
Scikit-Learn
Scikit-Learn is an efficient set of tools for data mining and data analysis packages. It
provides support for data science classification, regression, clustering, dimensionality
reduction, and preprocessing for feature extraction and normalization. This tool
supports both supervised learning and unsupervised learning processes.
I will use many of the processes from this package in the examples. Explore
http://scikit-learn.org for more details on this wide-ranging package.
Congratulations. You are now ready to execute the Python examples. Now, I will
guide you through the second setup for the R environment.
R
R is the core processing engine for statistical computing and graphics. Download the
software from www.r-project.org/ and follow the installation guidance for the specific
R installation you require.
Ubuntu
Install this by using
sudo apt-get install r-base
CentOS/RHEL
Install this by using
sudo yum install R
Windows
From https://cran.r-project.org/bin/windows/base/, install the software that
matches your environment.
Development Environment
VKHCG uses the RStudio development environment for its data science and engineering
within the group.
R Studio
RStudio produces a stress-free R ecosystem containing a code editor, debugging, and a
visualization toolset. Download the relevant software from www.rstudio.com/ and follow
the installation guidance for the specific installation you require
Ubuntu
Install this by using
wget https://download1.rstudio.org/rstudio-1.0.143-amd64.deb
sudo dpkg -i *.deb
rm *.deb
CentOS/RHEL
Install this by using
wget https://download1.rstudio.org/rstudio-1.0.143-x86_64.rpm
sudo yum install --nogpgcheck rstudio-1.0.143-x86_64.rpm
Windows
Install https://download1.rstudio.org/RStudio-1.0.143.exe.
R Packages
I suggest the following additional R packages to enhance the default R environment
Data.Table Package
Data.Table enables you to work with data files more effectively. I suggest that you
practice using Data.Table processing, to enable you to process data quickly in the R
environment and empower you to handle data sets that are up to 100GB in size.
The documentation is available at https://cran.r-project.org/web/packages/
data.table/data.table.pdf. See https://CRAN.R-project.org/package=data.table
for up-to-date information on the package.
To install the package, I suggest that you open your RStudio IDE and use the
following command:
install.packages ("data.table")
ReadR Package
The ReadR package enables the quick loading of text data into the R environment.
The documentation is available at https://cran.r-project.org/web/packages/
readr/readr.pdf. See https://CRAN.R-project.org/package=readr for up-to-date
information on the package.
To install the package, I advise you to open your RStudio IDE and use the following
command:
install.packages("readr")
I suggest that you practice by importing and exporting different formats of files, to
understand the workings of this package and master the process. I also suggest that you
investigate the following functions in depth in the ReadR package:
• Spec_delim(): Supports getting the specifications of the file without
reading it into memory
• read_delim(): Supports reading of delimited files into the R
environment
• write_delim(): Exports data from an R environment to a file on disk
JSONLite Package
This package enables you to process JSON files easily, as it is an optimized JSON parser
and generator specifically for statistical data.
The documentation is at https://cran.r-project.org/web/packages/jsonlite/
jsonlite.pdf. See https://CRAN.R-project.org/package=jsonlite for up-to-date
information on the package. To install the package, I suggest that you open your RStudio
IDE and use the following command:
install.packages ("jsonlite")
I also suggest that you investigate the following functions in the package:
• fromJSON(): This enables you to import directly into the R
environment from a JSON data source.
• prettify(): This improves the human readability by formatting the
JSON, so that a human can read it easier.
• minify(): Removes all the JSON indentation/whitespace to make the
JSON machine readable and optimized
• toJSON(): Converts R data into JSON formatted data
• read_json(): Reads JSON from a disk file
• write_json(): Writes JSON to a disk fil
Ggplot2 Package
Visualization of data is a significant skill for the data scientist. This package supports
you with an environment in which to build a complex graphic format for your data. It
is so successful at the task of creating detailed graphics that it is called “The Grammar
of Graphics.” The documentation is located at https://cran.r-project.org/web/
packages/ ggplot2/ ggplot2.pdf. See https://CRAN.R-project.org/package=
ggplot2 for up-to-date information on the package.
To install the package, I suggest that you to open your RStudio IDE and use the
following command:
install.packages("ggplot2")
I recommend that you master this package to empower you to transform your data
into a graphic you can use to demonstrate to your business the value of the results.
The packages we now have installed will support the examples.
Amalgamation of R with Spark
I want to discuss an additional package because I see its mastery as a major skill you
will require to work with current and future data science. This package is interfacing
the R environment with the distributed Spark environment and supplies an interface
to Spark’s built-in machine-learning algorithms. A number of my customers are using
Spark as the standard interface to their data environments.
Understanding this collaboration empowers you to support the processing of at-scale
environments, without major alterations in the R processing code. The documentation
is at https://cran.r-project.org/web/packages/sparklyr/ sparklyr.pdf. See
https://CRAN.R-project.org/package=sparklyr for up-to-date information on the
package.
To install the package, I suggest that you open your RStudio IDE and use the
following command:
install.packages("sparklyr")
sparklyr is a direct R interface for Apache Spark to provide a complete dplyr back
end. Once the filtering and aggregate of Spark data sets is completed downstream in the
at-scale environment, the package imports the data into the R environment for analysis
and visualization.
Sample Data
This book uses data for several examples. In the following section, I will explain how to
use the VKHCG environment you installed to create the data sets that I will use in these
examples.
Note: The processing of this sample data is spread out over the book. I am only giving
you a quick introduction to the data. I will discuss each of the data sets in more detail
once we start processing the data in later chapters. At this point, simply take note of the
data locations and general formats.
This is the minimum data you will need to complete the examples.
Note: Please select a home directory for your examples:
If on Windows, I suggest C:/VKHCG.
If on Linux, I suggest $home/VKHCG.
######################################################
rm(list=ls()) #will remove ALL objects
######################################################
MY_INSTALL_DIR = "<selected home directory>"
######################################################
if (file.exists (MY_INSTALL_DIR)==0) dir.create(MY_INSTALL_DIR)
subdirname = paste0(MY_INSTALL_DIR, "/Vermeulen")
if (file.exists(subdirname)==0) dir.create(subdirname)
######################################################
setwd(MY_INSTALL_DIR)
######################################################
if (length(sessionInfo()$otherPkgs) > 0)
lapply(paste('package:',names(sessionInfo()$otherPkgs),sep=""),
detach,character.only=TRUE,unload=TRUE)
######################################################
install.packages("readr")
######################################################
install.packages("data.table")
######################################################
Note: I am discussing only the descriptions of the data sources. It is not required
that you load the data into R now. There will be sufficient time while processing the
examples to load and process the data.
IP Addresses Data Sets
The network in VKHCG uses IP version 4 network addresses. The IPv4 protocol uses a
structured addressing format of the following structure:
IP Address = w.x.y.z
The four sections can each hold the values 0 to 255. There are 2 to the power 32 IP
addresses in the IPv4 protocol, so in universal terms, over 4 billion addresses are possible.
The following are the agreed formulas when dealing with IP4 addresses. Given an
IP Address = w.x.y.z, the IP Number = 16777216*w + 65536*x + 256*y + z. Given an IP
Number, then:
• w = int (IP Number / 16777216) % 256
• x = int (IP Number / 65536) % 256
• y = int (IP Number / 256) % 256
• z = int (IP Number) % 256
That generates IP Address = w.x.y.z.
Addresses are classified as being of Class A, B, C, or D.
Class 1st Octet Decimal Range (w)
A 1–126*
B 128–191
C 192–223
D 224–239
E 240–254
Customer Data Sets
VKHCG groups its customers onto billboards that it pays for on a per-billboard pricing
model.
In VKHCG\ 02-Krennwallner\00-RawData (this data set holds the location of all the
customer billboards):
Type of File: comma-separated values (CSV)
Data file: DE_Billboard_Locations.csv
Amount of Records: 8,873
*Class A addresses 127.0.0.0 to 127.255.255.255 are reserved for loopback and
diagnostic functions.
These addresses can be used by any company network within their internal network.
I have generated a series of IP addresses using the Class C address (192.168.0.1–
192.168.0.255), i.e., 255 addresses that you will require for the examples.
The following data is for the examples:
In VKHCG\01-Vermeulen\00-RawData:
A Class C address block for internal network usage:
Data file: IP_DATA_C_VKHCG.csv
Type of file: Comma-separated values (CSV)
Amount of record: 255
Columns in data:
Let’s investigate the next data set.
In VKHCG\01-Vermeulen\00-RawData (this data set holds guidelines for which IP
number is allocated to which location within the company’s customer network):
Type of file: Comma-separated values (CSV)
Data file: IP_DATA_ALL.csv
Amount of records: 1,247,502
Let’s investigate the next data set.
In VKHCG\01-Vermeulen\00-RawData (this data set holds which IP Number is
assigned to which location within the company’s own outside network):
Type of file: Comma-separated values (CSV)
Data file: IP_DATA_CORE.csv
Amount of records: 3,562
Logistics Data Sets
VKHCG has several warehouses and shops. I have grouped the locations of these
buildings in three data sets.
Post Codes
In VKHCG\03-Hillman\00-RawData (data set one holds a complete United Kingdom
post code list):
Type of File: comma-separated values (CSV)
Data file: GB_Postcode_Full.csv
Amount of Records: 1,714,591
Warehouse Data Set
In VKHCG\03-Hillman\00-RawData (data set two holds complete United Kingdom
warehouse locations):
Type of file: comma-separated values (CSV)
Data file: GB_Postcode_Warehouse.csv
Amount of records: 3,005
Shop Data Set
In VKHCG\03-Hillman\00-RawData (data set three holds complete United Kingdom
shop locations):
Type of file: Comma-separated values (CSV)
Data file: GB_Postcodes_Shops.csv
Amount of records: 1,048,575
Columns in data:
Exchange Rate Data Set
In VKHCG\04-Clark\00-RawData (data set one holds exchange rates against the euro for
a period 4,697 days):
Type of File: Comma-separated values (CSV) Pivot Table
Data file: Euro_ExchangeRates.csv
Amount of records: 4,697
Columns in data:
Profit-and-Loss Statement Data Set
In VKHCG\04-Clark\00-RawData (data set two holds profit-and-loss statement results):
Type of file: Comma-separated values (CSV)
Data file: Profit_And_Loss.csv
Amount of records: 2,442
Columns in data:
Summary
I have now introduced you to the company, to enable you to complete the examples in
the later chapters. Next, I will cover the layered framework, to introduce you to the basic
framework for Practical Data Science.
Subscribe For getting chapter 3 notified
email subscription.