Wednesday, September 21, 2022

Wolves vs Man City: A Comparison of the Two Teams

Wolves vs Man City

The Premier League returns this weekend with top teams Wolves and Manchester City squaring off. Wolves have been struggling in recent weeks, but they will be looking to turn their season around against a Manchester City team that is looking for a strong start to the season. City have struggled in recent years, but new manager Pep Guardiola is hoping to turn things around this season. The two teams have some impressive players, but who will be the star of the show?

Wolves vs Man City: A Comparison of the Two Teams

The Premier League season is just around the corner and with it comes the fierce rivalry between Manchester City and Wolves. These two teams have met in the past few seasons and the results have been quite telling.

Manchester City are the reigning champions, but they have had some recent struggles. They are still a very strong team, but they have had some dip in form. They have still managed to win a lot of games, but they have had some close calls.

Wolves have had a very different season. They started off very poorly and were close to being relegated. However, they have since turned things around and are now in the middle of the table.

Wolves vs Man City: The Tactical Differences

There are a few tactical differences that could be important in this game. Wolves have been playingcatch-up football in recent games, while Man City are in the midst of a good run of form. Wolves may need to be patient and take their time with the ball, while Man City may need to be more aggressive in their attacks.

Wolves also have a strong defensive record, while Man City’s defence has been letting them down recently. This could be a key factor in the game, as Wolves may need to take their time to create chances, while Man City may need to push forward quickly.

The final key difference could be in the Wolves vs Man City

Wolves vs Man City: The Physical Differences

One of the more noticeable physical differences between Wolves and Man City is their height. Wolves have an average height of 178 cm, while Man City have an average height of 183 cm. This difference is likely due to the different playing styles of the two clubs. Wolves are a possession-based team that relies on short passing in order to move the ball around the pitch, while Man City are a more physical team that relies on long balls and crosses to score goals. This difference in playing style is likely to result in different levels of physicality between the two teams.

Another physical difference between Wolves and Man City is their weight. Wolves have an average weight of 67 kg

Wolves vs Man City: The Results

The Manchester City WolfPack faced off against the Liverpool Echo WolfPack in a relegation battle. In the end, Liverpool Echo emerged victorious by a score of 3-1.

The WolfPack started the game strong, with a goal from Kyle McFadden in the first minute. Liverpool Echo responded with a goal from Demi Mitchell in the second minute. However, the WolfPack regained the lead in the fourth minute when Kyle McFadden scored again. Liverpool Echo regained the lead in the eighth minute when Demi Mitchell scored again. However, the WolfPack regained the lead in the 12th minute when Kyle McFadden scored a third time. Liverpool Echo regained the lead

Saturday, August 3, 2019

Data Science chapter 2

CHAPTER 2

Vermeulen-KrennwallnerHillman-Clark

Let’s begin by constructing a customer. I have created a fictional company for which you will perform the practical data science as your progress through this blog. You can execute your examples in either a Windows or Linux environment. You only have to download the desired example set. Any source code or other supplementary material referenced in this book is available to readers on GitHub, via this book’s product page, located at www.apress.com/9781484230534.

Windows

I suggest that you create a directory called c:\VKHCG to process all the examples in this book. Next, from GitHub, download and unzip the DS_VKHCG_Windows.zip file into this directory.

Linux

I also suggest that you create a directory called ./VKHCG, to process all the examples in this book. Then, from GitHub, download and untar the DS_VKHCG_Linux.tar.gz file into this directory.

Warning
If you change this directory to a new location, you will be required to change everything in the sample scripts to this new location, to get maximum benefit from the samples.

These files are used to create the sample company’s script and data directory, which I will use to guide you through the processes and examples in the rest of the blog.

It’s Now Time to Meet Your Customer

Vermeulen-Krennwallner-Hillman-Clark Group (VKHCG) is a hypothetical medium-size international company. It consists of four subcompanies: Vermeulen PLC, Krennwallner AG, Hillman Ltd, and Clark Ltd.

Vermeulen PLC

Vermeulen PLC is a data processing company that processes all the data within the group companies, as part of their responsibility to the group. The company handles all the information technology aspects of the business. This is the company for which you have just been hired to be the data scientist. Best of luck with your future. The company supplies
• Data science
• Networks, servers, and communication systems
• Internal and external web sites
• Data analysis business activities
• Decision science
• Process automation
• Management reporting For the purposes of this blog, I will explain what other technologies you need to investigate at every section of the framework, but the examples will concentrate only on specific concepts under discussion, as the overall data science field is more comprehensive than the few selected examples. By way of examples, I will assist you in building a basic Data Science Technology Stack and then advise you further with additional discussions on how to get the stack to work at scale.

The examples will show you how to process the following business data:
• Customers
• Products
• Location
• Business processes
• A number of handy data science algorithms I will explain how to
• Create a network routing diagram using geospatial analysis
• Build a directed acyclic graph (DAG) for the schedule of jobs, using graph theory If you want to have a more detailed view of the company’s data, take a browse at these data sets in the company’s sample directory (./VKHCG/01-Vermeulen/00-RawData). Later in this chapter, I will give you a more detailed walk-through of each data set

Krennwallner AG

Krennwallner AG is an advertising and media company that prepares advertising and media content for the customers of the group. It supplies
• Advertising on billboards
• Advertising and content management for online delivery
• Event management for key customers Via a number of technologies, it records who watches what media streams. The specific requirement we will elaborate is how to identify the groups of customers who will have to see explicit media content. I will explain how to
• Pick content for specific billboards
• Understand online web site visitors’ data per country
• Plan an event for top-10 customers at Neuschwanstein Castle

If you want to have a more in-depth view of the company’s data, have a glance at the sample data sets in the company’s sample directory (./VKHCG/02-Krennwallner/ 00-RawData)

Hillman Ltd

The Hillman company is a supply chain and logistics company. It provisions a worldwide supply chain solution to the businesses, including
• Third-party warehousing
• International shipping
• Door-to-door logistics The principal requirement that I will expand on through examples is how you design the distribution of a customer’s products purchased online. Through the examples, I will follow the product from factory to warehouse and warehouse to customer’s door. I will explain how to
• Plan the locations of the warehouses within the United Kingdom
• Plan shipping rules for best-fit international logistics
• Choose what the best packing option is for shipping containers for a given set of products
• Create an optimal delivery route for a set of customers in Scotland

If you want to have a more detailed view of the company’s data, browse the data sets in the company’s sample directory (./VKHCG/ 03-Hillman/00-RawData).

Clark Ltd

The Clark company is a venture capitalist and accounting company that processes the following financial responsibilities of the group:
• Financial insights
• Venture capital management
• Investments planning
• Forex (foreign exchange) trading

I will use financial aspects of the group companies to explain how you apply practical data science and data engineering to common problems for the hypothetical financial data. I will explain to you how to prepare
• A simple forex trading planner
• Accounting ratios
• Profitability
• Gross profit for sales
• Gross profit after tax for sales
• Return on capital employed (ROCE)
• Asset turnover
• Inventory turnover
• Accounts receivable days
• Accounts payable days

Processing Ecosystem

Five years ago, VKHCG consolidated its processing capability by transferring the concentrated processing requirements to Vermeulen PLC to perform data science as a group service. This resulted in the other group companies sustaining 20% of the group business activities; however, 90% of the data processing of the combined group’s business activities was reassigned to the core team. Vermeulen has since consolidated Spark, Python, Mesos, Akka, Cassandra, Kafka, elastic search, and MQTT (MQ Telemetry Transport) processing into a group service provider and processing entity. I will use R or Python for the data processing in the examples. I will also discuss the complementary technologies and advise you on what to consider and request for your own environment.

Note: The complementary technologies are used regularly in the data science environment. Although I cover them briefly, that does not make them any less significant.

VKHCG uses the R processing engine to perform data processing in 80% of the company business activities, and the other 20% is done by Python. Therefore, we will prepare an R and a Python environment to perform the examples. I will quickly advise you on how to obtain these additional environments, if you require them for your own specific business requirements. I will cover briefly the technologies that we are not using in the examples but that are known to be beneficial.

Scala Scala is popular in the data science community, as it supports massive parallel processing in an at-scale manner. You can install the language from the following core site: www.scala-lang.org/download/. Cheat sheets and references are available to guide you to resources to help you master this programming language.

Note: Many users are using Scala as their strategical development language.

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing that is at present the fastest-growing processing engine for large-scale data science projects. You can install the engine from the following core site: http://spark.apache.org/. For large-scale projects, I use the Spark environment within DataStax Enterprise (www.datastax.com), Hortonworks (https://hortonworks.com/), Cloudera (www.cloudera.com/), and MapR (https://mapr.com/).

Note : Spark is now the most sought-after common processing engine for atscale data processing, with support increasing by the day. I recommend that you master this engine, if you want to advance your career in data science at-scale.

Apache Mesos

Apache Mesos abstracts CPU, memory, storage, and additional computation resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to effortlessly build and run processing solutions effectively. It is industry proven to scale to 10,000s of nodes. This empowers the data scientist to run massive parallel analysis and processing in an efficient manner. The processing environment is available from the following core site: http://mesos.apache.org/. I want to give Mesosphere Enterprise DC/OS an honorable mention, as I use it for many projects. See https://mesosphere.com, for more details.

Note: Mesos is a cost-effective processing approach supporting growing dynamic processing requirements in an at-scale processing environment.

Akka

Akka supports building powerful concurrent and distributed applications to perform massive parallel processing, while sharing the common processing platform at-scale. You can install the engine from the following core site: http://akka.io/. I use Akka processing within the Mesosphere Enterprise DC/OS environment.

Apache Cassandra

Apache Cassandra database offers support with scalability and high availability, without compromising performance. It has linear scalability and a reputable fault-tolerance, as it is widely used by numerous big companies. You can install the engine from the following core site: http://cassandra.apache.org/. I use Cassandra processing within the Mesosphere Enterprise DC/OS environment and DataStax Enterprise for my Cassandra installations.

Note: I recommend that you consider Cassandra as an at-scale database, as it supports the data science environment with stable data processing capability.

Kafka

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, and impressively fast. You can install the engine from the following core site: http://kafka.apache.org/. I use Kafka processing within the Mesosphere Enterprise DC/OS environment, to handle the ingress of data into my data science environments

Note: I advise that you look at Kafka as a data transport, as it supports the data science environment with robust data collection facility.

Message Queue Telemetry Transport

Message Queue Telemetry Transport (MQTT) is a machine-to-machine (M2M) and Internet of things connectivity protocol. It is an especially lightweight publish/subscribe messaging transport. It enables connections to locations where a small code footprint is essential, and lack of network bandwidth is a barrier to communication. See http://mqtt.org/ for details.

Note: This protocol is common in sensor environments, as it provisions the smaller code footprint and lower bandwidths that sensors demand.

Now that I have covered the items you should know about but are not going to use in the examples, let’s look at what you will use.

Example Ecosystem

The examples require the following environment. The two setups required within VKHCG’s environment are Python and R.

Python

Python is a high-level programming language created by Guido van Rossum and first released in 1991. Its reputation is growing, as today, various training institutes are covering the language as part of their data science prospectus. I suggest you install Anaconda, to enhance your Python development. It is an open source distribution of Python that simplifies package management and deployment of features (see www.continuum.io/downloads).

Ubuntu

Ubuntu server installation to perform my data science (see www.ubuntu.com/), as follows: sudo apt-get install python3 python3-pip python3-setuptools CentOS/RHEL If you want to use

CentOS/RHEL,

I suggest you employ the following install process: sudo yum install python3 python3-pip python3-setuptools

Windows

If you want to use Windows, I suggest you employ the following install process. Download the software from www.python.org/downloads/windows/.

Is Python3 Ready?

Once installation is completed, you must test your environment as follows: Python3 --version On success, you should see a response like this Python 3.4.3+ Congratulations, Python is now ready

Python Libraries

One of the most important features of Python is its libraries, which are extensively available and make it stress-free to include verified data science processes into your environment. To investigate extra packages, I suggest you review the PyPI—Python Package Index (https://pypi.python.org/). You have to set up a limited set of Python libraries to enable you to complete the examples.

Warning Please ensure that you have verified all the packages you use. Remember: Open source is just that—open. Be vigilant!

Pandas

This provides a high-performance set of data structures and data-analysis tools for use in your data science.

Ubuntu

Install this by using sudo apt-get install python-pandas

Centos/RHEL

Install this by using yum install python-pandas

PIP

Install this by using pip install pandas

More information on Pandas development is available at http://pandas.pydata. org/. I suggest following the cheat sheet (https://github.com/pandas-dev/pandas/ blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf), to guide you through the basics of using Pandas. I will explain, via examples, how to use these Pandas tools.

Note: I suggest that you master this package, as it will support many of your data loading and storing processes, enabling overall data science processing

Matplotlib

Matplotlib is a Python 2D and 3D plotting library that can produce various plots, histograms, power spectra, bar charts, error charts, scatterplots, and limitless advance visualizations of your data science results.

Ubuntu

Install this by using sudo apt-get install python-matplotlib

CentOS/RHEL

Install this by using Sudo yum install python-matplotlib

PIP

Install this by using: pip install matplotlib Explore http://matplotlib.org/ for more details on the visualizations that you can accomplish with exercises included in these packages.

Note: I recommend that you spend time mastering your visualization skills. Without these skills, it is nearly impossible to communicate your data science results.

NumPy

NumPy is the fundamental package for scientific computing, based on a general homogeneous multidimensional array structure with processing tools. Explore www.numpy.org/ for further details. I will use some of the tools in the examples but suggest you practice with the general tools, to assist you with your future in data science.

SymPy

SymPy is a Python library for symbolic mathematics. It assists you in simplifying complex algebra formulas before including them in your code. Explore www.sympy.org for details on this package’s capabilities.

Scikit-Learn

Scikit-Learn is an efficient set of tools for data mining and data analysis packages. It provides support for data science classification, regression, clustering, dimensionality reduction, and preprocessing for feature extraction and normalization. This tool supports both supervised learning and unsupervised learning processes. I will use many of the processes from this package in the examples. Explore http://scikit-learn.org for more details on this wide-ranging package. Congratulations. You are now ready to execute the Python examples. Now, I will guide you through the second setup for the R environment.

R

R is the core processing engine for statistical computing and graphics. Download the software from www.r-project.org/ and follow the installation guidance for the specific R installation you require.

Ubuntu

Install this by using sudo apt-get install r-base

CentOS/RHEL

Install this by using sudo yum install R

Windows

From https://cran.r-project.org/bin/windows/base/, install the software that matches your environment.

Development Environment

VKHCG uses the RStudio development environment for its data science and engineering within the group.

R Studio

RStudio produces a stress-free R ecosystem containing a code editor, debugging, and a visualization toolset. Download the relevant software from www.rstudio.com/ and follow the installation guidance for the specific installation you require

Ubuntu

Install this by using wget https://download1.rstudio.org/rstudio-1.0.143-amd64.deb sudo dpkg -i *.deb rm *.deb

CentOS/RHEL

Install this by using wget https://download1.rstudio.org/rstudio-1.0.143-x86_64.rpm sudo yum install --nogpgcheck rstudio-1.0.143-x86_64.rpm

Windows

Install https://download1.rstudio.org/RStudio-1.0.143.exe.

R Packages

I suggest the following additional R packages to enhance the default R environment

Data.Table Package

Data.Table enables you to work with data files more effectively. I suggest that you practice using Data.Table processing, to enable you to process data quickly in the R environment and empower you to handle data sets that are up to 100GB in size. The documentation is available at https://cran.r-project.org/web/packages/ data.table/data.table.pdf. See https://CRAN.R-project.org/package=data.table for up-to-date information on the package. To install the package, I suggest that you open your RStudio IDE and use the following command: install.packages ("data.table")

ReadR Package

The ReadR package enables the quick loading of text data into the R environment. The documentation is available at https://cran.r-project.org/web/packages/ readr/readr.pdf. See https://CRAN.R-project.org/package=readr for up-to-date information on the package. To install the package, I advise you to open your RStudio IDE and use the following command: install.packages("readr") I suggest that you practice by importing and exporting different formats of files, to understand the workings of this package and master the process. I also suggest that you investigate the following functions in depth in the ReadR package:
• Spec_delim(): Supports getting the specifications of the file without reading it into memory
• read_delim(): Supports reading of delimited files into the R environment
• write_delim(): Exports data from an R environment to a file on disk

JSONLite Package

This package enables you to process JSON files easily, as it is an optimized JSON parser and generator specifically for statistical data. The documentation is at https://cran.r-project.org/web/packages/jsonlite/ jsonlite.pdf. See https://CRAN.R-project.org/package=jsonlite for up-to-date information on the package. To install the package, I suggest that you open your RStudio IDE and use the following command: install.packages ("jsonlite") I also suggest that you investigate the following functions in the package:
• fromJSON(): This enables you to import directly into the R environment from a JSON data source.
• prettify(): This improves the human readability by formatting the JSON, so that a human can read it easier.
• minify(): Removes all the JSON indentation/whitespace to make the JSON machine readable and optimized
• toJSON(): Converts R data into JSON formatted data
• read_json(): Reads JSON from a disk file
• write_json(): Writes JSON to a disk fil

Ggplot2 Package

Visualization of data is a significant skill for the data scientist. This package supports you with an environment in which to build a complex graphic format for your data. It is so successful at the task of creating detailed graphics that it is called “The Grammar of Graphics.” The documentation is located at https://cran.r-project.org/web/ packages/ ggplot2/ ggplot2.pdf. See https://CRAN.R-project.org/package= ggplot2 for up-to-date information on the package. To install the package, I suggest that you to open your RStudio IDE and use the following command:

install.packages("ggplot2")

I recommend that you master this package to empower you to transform your data into a graphic you can use to demonstrate to your business the value of the results. The packages we now have installed will support the examples.

Amalgamation of R with Spark

I want to discuss an additional package because I see its mastery as a major skill you will require to work with current and future data science. This package is interfacing the R environment with the distributed Spark environment and supplies an interface to Spark’s built-in machine-learning algorithms. A number of my customers are using Spark as the standard interface to their data environments. Understanding this collaboration empowers you to support the processing of at-scale environments, without major alterations in the R processing code. The documentation is at https://cran.r-project.org/web/packages/sparklyr/ sparklyr.pdf. See https://CRAN.R-project.org/package=sparklyr for up-to-date information on the package. To install the package, I suggest that you open your RStudio IDE and use the following command: install.packages("sparklyr") sparklyr is a direct R interface for Apache Spark to provide a complete dplyr back end. Once the filtering and aggregate of Spark data sets is completed downstream in the at-scale environment, the package imports the data into the R environment for analysis and visualization.

Sample Data

This book uses data for several examples. In the following section, I will explain how to use the VKHCG environment you installed to create the data sets that I will use in these examples. Note: The processing of this sample data is spread out over the book. I am only giving you a quick introduction to the data. I will discuss each of the data sets in more detail once we start processing the data in later chapters. At this point, simply take note of the data locations and general formats. This is the minimum data you will need to complete the examples.

Note: Please select a home directory for your examples:

If on Windows, I suggest C:/VKHCG.
If on Linux, I suggest $home/VKHCG.

######################################################
rm(list=ls()) #will remove ALL objects
######################################################
MY_INSTALL_DIR = "<selected home directory>"
######################################################
if (file.exists (MY_INSTALL_DIR)==0) dir.create(MY_INSTALL_DIR) subdirname = paste0(MY_INSTALL_DIR, "/Vermeulen") if (file.exists(subdirname)==0) dir.create(subdirname)
######################################################
setwd(MY_INSTALL_DIR)
######################################################
if (length(sessionInfo()$otherPkgs) > 0) lapply(paste('package:',names(sessionInfo()$otherPkgs),sep=""), detach,character.only=TRUE,unload=TRUE)
######################################################
install.packages("readr")
######################################################
install.packages("data.table")
######################################################

Note: I am discussing only the descriptions of the data sources. It is not required that you load the data into R now. There will be sufficient time while processing the examples to load and process the data.

IP Addresses Data Sets

The network in VKHCG uses IP version 4 network addresses. The IPv4 protocol uses a structured addressing format of the following structure: IP Address = w.x.y.z The four sections can each hold the values 0 to 255. There are 2 to the power 32 IP addresses in the IPv4 protocol, so in universal terms, over 4 billion addresses are possible. The following are the agreed formulas when dealing with IP4 addresses. Given an IP Address = w.x.y.z, the IP Number = 16777216*w + 65536*x + 256*y + z. Given an IP Number, then:
• w = int (IP Number / 16777216) % 256
• x = int (IP Number / 65536) % 256
• y = int (IP Number / 256) % 256
• z = int (IP Number) % 256 That generates IP Address = w.x.y.z. Addresses are classified as being of Class A, B, C, or D.

Class 1st Octet Decimal Range (w)

A 1–126*
B 128–191
C 192–223
D 224–239
E 240–254
Customer Data Sets VKHCG groups its customers onto billboards that it pays for on a per-billboard pricing model. In VKHCG\ 02-Krennwallner\00-RawData (this data set holds the location of all the customer billboards): Type of File: comma-separated values (CSV) Data file: DE_Billboard_Locations.csv Amount of Records: 8,873
*Class A addresses 127.0.0.0 to 127.255.255.255 are reserved for loopback and diagnostic functions.

These addresses can be used by any company network within their internal network. I have generated a series of IP addresses using the Class C address (192.168.0.1– 192.168.0.255), i.e., 255 addresses that you will require for the examples. The following data is for the examples: In VKHCG\01-Vermeulen\00-RawData: A Class C address block for internal network usage: Data file: IP_DATA_C_VKHCG.csv Type of file: Comma-separated values (CSV) Amount of record: 255 Columns in data:

Let’s investigate the next data set. In VKHCG\01-Vermeulen\00-RawData (this data set holds guidelines for which IP number is allocated to which location within the company’s customer network): Type of file: Comma-separated values (CSV) Data file: IP_DATA_ALL.csv Amount of records: 1,247,502

Let’s investigate the next data set. In VKHCG\01-Vermeulen\00-RawData (this data set holds which IP Number is assigned to which location within the company’s own outside network): Type of file: Comma-separated values (CSV) Data file: IP_DATA_CORE.csv Amount of records: 3,562

Logistics Data Sets

VKHCG has several warehouses and shops. I have grouped the locations of these buildings in three data sets.

Post Codes

In VKHCG\03-Hillman\00-RawData (data set one holds a complete United Kingdom post code list): Type of File: comma-separated values (CSV) Data file: GB_Postcode_Full.csv Amount of Records: 1,714,591

Warehouse Data Set

In VKHCG\03-Hillman\00-RawData (data set two holds complete United Kingdom warehouse locations): Type of file: comma-separated values (CSV) Data file: GB_Postcode_Warehouse.csv Amount of records: 3,005

Shop Data Set

In VKHCG\03-Hillman\00-RawData (data set three holds complete United Kingdom shop locations): Type of file: Comma-separated values (CSV)

Data file: GB_Postcodes_Shops.csv Amount of records: 1,048,575 Columns in data:

Exchange Rate Data Set

In VKHCG\04-Clark\00-RawData (data set one holds exchange rates against the euro for a period 4,697 days): Type of File: Comma-separated values (CSV) Pivot Table Data file: Euro_ExchangeRates.csv Amount of records: 4,697 Columns in data:

Profit-and-Loss Statement Data Set

In VKHCG\04-Clark\00-RawData (data set two holds profit-and-loss statement results): Type of file: Comma-separated values (CSV) Data file: Profit_And_Loss.csv Amount of records: 2,442 Columns in data:

Summary

I have now introduced you to the company, to enable you to complete the examples in the later chapters. Next, I will cover the layered framework, to introduce you to the basic framework for Practical Data Science.

Subscribe For getting chapter 3 notified
email subscription.

Tuesday, July 30, 2019

Data science Chapter 1

Learn Data Science

A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets...

About the Author:
Zain is a consulting manager for decision science, data science, data engineering, machine learning, robotics, artificial intelligence, computational analytics and business intelligence.

Acknowledgments:
To Denise: I am fortunate enough to have created a way of life I love . . . But you have given me the courage and determination to live it! Thanks for the time and patience to complete the book and numerous other mad projects. To Laurence: Thank you for all the knowledge shared on accounting and finance. To Chris: thank you. Your wisdom and insight made this great! Best of luck with your future. To the staff at Apress: your skills transformed an idea into a book. Well done!

Introduction:
People are talking about data lakes daily now. I consult on a regular basis with organizations on how to develop their data lake and data science strategy to serve their evolving and ever-changing business strategies. This requirement for agile and cost-effective information management is high on the priority list of senior managers worldwide. It is a fact that many of the unknown insights are captured and stored in a massive pool of unprocessed data in the enterprise. These data lakes have major implications for the future of the business world. It is projected that combined data scientists worldwide will have to handle 40 zettabytes of data by 2020, an increase of 300 times since 2005. There are numerous data sources that still must be converted into actionable business knowledge. This achievement will safeguard the future of the business that can achieve it. The world’s data producers are generating two-and-a-half quintillion bytes of new data every day. The addition of internet of things will cause this volume to be substantially higher. Data scientists and engineers are falling behind on an immense responsibility. By reading this introduction, you are already an innovative person who wants to understand this advanced data structure that one and all now desire to tame. To tame your data lake, you will require practical data science. I propose to teach you how to tame this beast. I am familiar with the skills it takes to achieve this goal. I will guide you with the sole purpose of you learning and expanding while mastering the practical guidance in this blog. You will understand what is in your business’s data lake and how to apply data science to it. Think of the process as comparable to a natural lake. It is vital to establish a sequence of proficient techniques with the lake, to obtain pure water in your glass. Do not stress, as by the end of this blog, you will have shared in more than 9 years of working experience with data and extracting actionable business knowledge. I will share with you the experience I gained in working with data on an international scale.

Data Science:
In 1960, Peter Naur started using the term data science as a substitute for computer science. He stated that to work with data, you require more than just computer science. I agree with his declaration. Data science is an interdisciplinary science that incorporates practices and methods with actionable knowledge and insights from data in heterogeneous schemas (structured, semi-structured, or unstructured). It amalgamates the scientific fields of data exploration with thought-provoking research fields such as data engineering, information science, computer science, statistics, artificial intelligence, machine learning, data mining, and predictive analytics. For my part, as I enthusiastically research the future use of data science, by translating multiple data lakes, I have gained several valuable insights. I will explain these with end-to-end examples and share my insights on data lakes. This book explains vital elements from these sciences that you will use to process your data lake into actionable knowledge. I will guide you through a series of recognized science procedures for data lakes. These core skills are a key set of assets to perfect as you begin your encounters with data science.

Data Analytics:

Data analytics is the science of fact-finding analysis of raw data, with the goal of drawing conclusions from the data lake. Data analytics is driven by certified algorithms to statistically define associations between data that produce insights.

The best answer is to point to a certified and recognized algorithm that you have used. Associate the algorithm to your business terminology to achieve success with your projects.

Machine Learning:

The business world is buzzing with activities and ideas about machine learning and its application to numerous business environments. Machine learning is the capability of systems to learn without explicit software development. It evolved from the study of pattern recognition and computational learning theory. The impact is that, with the appropriate processing and skills, you can augment your own data capabilities. Training enables a processing environment to complete several magnitudes of discoveries in the time it takes to have a cup of coffee.

Note :Work smarter, not harder! Offload your data science to machines. They are faster and more consistent in processing your data lakes.

This skill is an essential part of achieving major gains in shortening the data-toknowledge cycle. This blog will cover the essential practical ground rules in later.

Data Mining:

Data mining is processing data to isolate patterns and establish relationships between data entities within the data lake. For data mining to be successful, there is a small number of critical data-mining theories that you must know about data patterns. In later chapters, I will expand on how you can mine your data for insights. This will help you to discover new actionable knowledge.

Statistics:

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. Statistics deals with all aspects of data, including the planning of data collection, in terms of the design of surveys and experiments. Data science and statistics are closely related. I will show you how to run through series of statistics models covering data collection, population, and samples to enhance your data science deliveries.

Algorithms:

An algorithm is a self-contained step-by-step set of processes to achieve a specific outcome. Algorithms execute calculations, data processing, or automated reasoning tasks with repeatable outcomes. Algorithms are the backbone of the data science process. You should assemble a series of methods and procedures that will ease the complexity and processing of your specific data lake. I will discuss numerous algorithms and good practices for performing practical data science throughout the blog.

Data Visualization:

Data visualization is your key communication channel with the business. It consists of the creation and study of the visual representation of business insights. Data science’s principal deliverable is visualization. You will have to take your highly technical results and transform them into a format that you can show to non-specialists.

Storytelling:

Data storytelling is the process of translating data analyses into layperson’s terms, in order to influence a business decision or action. You can have the finest data science, but without the business story to translate your findings into business-relevant actions, you will not succeed. I will provide details and practical insights into what to check for to ensure that you have the proper story and actions.

What Next?

I will demonstrate, using the core knowledge of the underlining science, how you can make a competent start to handle the transformation process of your data lake into actionable knowledge. The sole requirement is to understand the data science of your own data lake. Start rapidly to discover what data science reveals about your business. You are the master of your own data lake. You will have to build familiarity with the data lake and what is flowing into the structure. My advice is to apply the data science on smaller scale activities, for insights from the data lake.

CHAPTER 1:

Data Science Technology Stack

The Data Science Technology Stack covers the data processing requirements in the Rapid Information Factory ecosystem. Throughout the book, I will discuss the stack as the guiding pattern. In this chapter, I will help you to recognize the basics of data science tools and their influence on modern data lake development. You will discover the techniques for transforming a data vault into a data warehouse bus matrix. I will explain the use of Spark, Mesos, Akka, Cassandra, and Kafka, to tame your data science requirements. I will guide you in the use of elastic search and MQTT (MQ Telemetry Transport), to enhance your data science solutions. I will help you to recognize the influence of R as a creative visualization solution. I will also introduce the impact and influence on the data science ecosystem of such programming languages as R, Python, and Scala.

Rapid Information Factory Ecosystem:

The Rapid Information Factory ecosystem is a convention of techniques I use for my individual processing developments. The processing route of the blog will be formulated on this basis, but you are not bound to use it exclusively. The tools I discuss in this chapter are available to you without constraint. The tools can be used in any configuration or permutation that is suitable to your specific ecosystem. I recommend that you begin to formulate an ecosystem of your own or simply adopt mine. As a prerequisite, you must become accustomed to a set of tools you know well and can deploy proficiently.

Note Remember: Your data lake will have its own properties and features, so adopt your tools to those particular characteristics.

Data Science Storage Tools:

This data science ecosystem has a series of tools that you use to build your solutions. This environment is undergoing a rapid advancement in capabilities, and new developments are occurring every day. I will explain the tools I use in my daily work to perform practical data science. Next, I will discuss the following basic data methodologies.

Schema-on-Write and Schema-on-Read:

There are two basic methodologies that are supported by the data processing tools. Following is a brief outline of each methodology and its advantages and drawbacks.

Schema-on-Write Ecosystems:

A traditional relational database management system (RDBMS) requires a schema before you can load the data. To retrieve data from my structured data schemas, you may have been running standard SQL queries for a number of years. Benefits include the following:
• In traditional data ecosystems, tools assume schemas and can only work once the schema is described, so there is only one view on the data.
• The approach is extremely valuable in articulating relationships between data points, so there are already relationships configured.
• It is an efficient way to store “dense” data.
• All the data is in the same data store.

On the other hand, schema-on-write isn’t the answer to every data science problem. Among the downsides of this approach are that
• Its schemas are typically purpose-built, which makes them hard to change and maintain.
• It generally loses the raw/atomic data as a source for future analysis.
• It requires considerable modeling/implementation effort before being able to work with the data.
• If a specific type of data can’t be stored in the schema, you can’t effectively process it from the schema.

Schema-on-Read Ecosystems:

This alternative data storage methodology does not require a schema before you can load the data. Fundamentally, you store the data with minimum structure. The essential schema is applied during the query phase. Benefits include the following:
• It provides flexibility to store unstructured, semi-structured, and disorganized data.
• It allows for unlimited flexibility when querying data from the structure.
• Leaf-level data is kept intact and untransformed for reference and use for the future.
• The methodology encourages experimentation and exploration.
• It increases the speed of generating fresh actionable knowledge.
• It reduces the cycle time between data generation to availability of actionable knowledge. Schema-on-read methodology is expanded on in Chapter 6. I recommend a hybrid between schema-on-read and schema-on-write ecosystems for effective data science and engineering. I will discuss in detail why this specific ecosystem is the optimal solution when I cover the functional layer’s purpose in data science processing.

Data Lake:

A data lake is a storage repository for a massive amount of raw data. It stores data in native format, in anticipation of future requirements. You will acquire insights from this blog on why this is extremely important for practical data science and engineering solutions. While a schema-on-write data warehouse stores data in predefined databases, tables, and records structures, a data lake uses a less restricted schema-on-read-based architecture to store data. Each data element in the data lake is assigned a distinctive identifier and tagged with a set of comprehensive metadata tags. A data lake is typically deployed using distributed data object storage, to enable the schema-on-read structure. This means that business analytics and data mining tools access the data without a complex schema. Using a schema-on-read methodology enables you to load your data as is and start to get value from it instantaneously. I will discuss and provide more details on the reasons for using a schema-on-read storage methodology in Chapters 6–11. For deployment onto the cloud, it is a cost-effective solution to use Amazon’s Simple Storage Service (Amazon S3) to store the base data for the data lake. I will demonstrate the feasibility of using cloud technologies to provision your data science work. It is, however, not necessary to access the cloud to follow the examples in this book, as they can easily be processed using a laptop.

Data Vault:

Data vault modeling, designed by Dan Linstedt, is a database modeling method that is intentionally structured to be in control of long-term historical storage of data from multiple operational systems. The data vaulting processes transform the schema-onread data lake into a schema-on-write data vault. The data vault is designed into the schema-on-read query request and then executed against the data lake. I have also seen the results stored in a schema-on-write format, to persist the results for future queries. The techniques for both methods are discussed in Chapter 9. At this point, I expect you to understand only the rudimentary structures required to formulate a data vault. The structure is built from three basic data structures: hubs, inks, and satellites. Let’s examine the specific data structures, to clarify why they are compulsory.

Hubs:

Hubs contain a list of unique business keys with low propensity to change. They contain a surrogate key for each hub item and metadata classification of the origin of the business key. The hub is the core backbone of your data vault, and in Chapter 9, I will discuss in more detail how and why you use this structure.

Links:

Associations or transactions between business keys are modeled using link tables. These tables are essentially many-to-many join tables, with specific additional metadata. The link is a singular relationship between hubs to ensure the business relationships are accurately recorded to complete the data model for the real-life business. In Chapter 9, I will explain how and why you would require specific relationships.

Satellites:

Hubs and links form the structure of the model but store no chronological characteristics or descriptive characteristics of the data. These characteristics are stored in appropriated tables identified as satellites. Satellites are the structures that store comprehensive levels of the information on business characteristics and are normally the largest volume of the complete data vault data structure. In Chapter 9, I will explain how and why these structures work so well to model real-life business characteristics. The appropriate combination of hubs, links, and satellites helps the data scientist to construct and store prerequisite business relationships. This is a highly in-demand skill for a data modeler. The transformation to this schema-on-write data structure is discussed in detail in Chapter 9, to point out why a particular structure supports the processing methodology. I will explain in that chapter why you require particular hubs, links, and satellites.

Data Warehouse Bus Matrix:

The Enterprise Bus Matrix is a data warehouse planning tool and model created by Ralph Kimball and used by numerous people worldwide over the last 40+ years. The bus matrix and architecture builds upon the concept of conformed dimensions that are interlinked by facts. The data warehouse is a major component of the solution required to transform data into actionable knowledge. This schema-on-write methodology supports business intelligence against the actionable knowledge. In Chapter 10, I provide more details on this data tool and give guidance on its use.

Data Science Processing Tools:

Now that I have introduced data storage, the next step involves processing tools to transform your data lakes into data vaults and then into data warehouses. These tools are the workhorses of the data science and engineering ecosystem. Following are the recommended foundations for the data tools I use.

Spark Apache:

Spark is an open source cluster computing framework. Originally developed at the AMP Lab of the University of California, Berkeley, the Spark code base was donated to the Apache Software Foundation, which now maintains it as an open source project. This tool is evolving at an incredible rate. IBM is committing more than 3,500 developers and researchers to work on Sparkrelated projects and formed a dedicated Spark technology center in San Francisco to pursue Spark-based innovations. SAP, Tableau, and Talend now support Spark as part of their core software stack. Cloudera, Hortonworks, and MapR distributions support Spark as a native interface. Spark offers an interface for programming distributed clusters with implicit data parallelism and fault-tolerance. Spark is a technology that is becoming a de-facto standard for numerous enterprise-scale processing applications. I discovered the following modules using this tool as part of my technology toolkit.

Spark Core:

Spark Core is the foundation of the overall development. It provides distributed task dispatching, scheduling, and basic I/O functionalities. This enables you to offload the comprehensive and complex running environment to the Spark Core. This safeguards that the tasks you submit are accomplished as anticipated. The distributed nature of the Spark ecosystem enables you to use the same processing request on a small Spark cluster, then on a cluster of thousands of nodes, without any code changes. In Chapter 10, I will discuss how you accomplish this.

Spark SQL:

Spark SQL is a component on top of the Spark Core that presents a data abstraction called Data Frames. Spark SQL makes accessible a domain-specific language (DSL) to manipulate data frames. This feature of Spark enables ease of transition from your traditional SQL environments into the Spark environment. I have recognized its advantage when you want to enable legacy applications to offload the data from their traditional relational-only data storage to the data lake ecosystem.

Spark Streaming:

Spark Streaming leverages Spark Core’s fast scheduling capability to perform streaming analytics. Spark Streaming has built-in support to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets. The process of streaming is the primary technique for importing data from the data source to the data lake. Streaming is becoming the leading technique to load from multiple data sources. I have found that there are connectors available for many data sources. There is a major drive to build even more improvements on connectors, and this will improve the ecosystem even further in the future.

MLlib Machine Learning Library:

Spark MLlib is a distributed machine learning framework used on top of the Spark Core by means of the distributed memory-based Spark architecture.

In Spark 2.0, a new library, spark.mk, was introduced to replace the RDD-based data processing with a DataFrame-based model. It is planned that by the introduction of Spark 3.0, only DataFrame-based models will exist. Common machine learning and statistical algorithms have been implemented and are shipped with MLlib, which simplifies large-scale machine learning pipelines, including

• Dimensionality reduction techniques, such as singular value decomposition (SVD) and principal component analysis (PCA)

• Summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation

• Collaborative filtering techniques, including alternating least squares (ALS)

• Classification and regression: support vector machines, logistic regression, linear regression, decision trees, and naive Bayes classification

• Cluster analysis methods, including k-means and latent Dirichlet allocation (LDA)

• Optimization algorithms, such as stochastic gradient descent and limited-memory BFGS (L-BFGS)

• Feature extraction and transformation functions

GraphX:

GraphX is a powerful graph-processing application programming interface (API) for the Apache Spark analytics engine that can draw insights from large data sets. GraphX provides outstanding speed and capacity for running massively parallel and machinelearning algorithms. The introduction of the graph-processing capability enables the processing of relationships between data entries with ease. In Chapters 9 and 10, I will discuss the use of a graph database to support the interactions of the processing through the layers

Mesos:

Apache Mesos is an open source cluster manager that was developed at the University of California, Berkeley. It delivers efficient resource isolation and sharing across distributed applications. The software enables resource sharing in a fine-grained manner, improving cluster utilization. The Enterprise version of Mesos is Mesosphere Enterprise DC/OS. This runs containers elastically, and data services support Kafka, Cassandra, Spark, and Akka. In microservices architecture, I aim to construct a service that spawns granularity, processing units and lightweight protocols through the layers. In Chapter 6, I will discuss the use of fine-grained microservices know-how to support data processing through the framework.

Akka:

The toolkit and runtime methods shorten development of large-scale data-centric applications for processing. Akka is an actor-based message-driven runtime for running concurrency, elasticity, and resilience processes. The use of high-level abstractions such as actors, streams, and futures facilitates the data science and engineering granularity processing units. The use of actors enables the data scientist to spawn a series of concurrent processes by using a simple processing model that employs a messaging technique and specific predefined actions/behaviors for each actor. This way, the actor can be controlled and limited to perform the intended tasks only. In Chapter 7-11, I will discuss the use of different fine-grained granularity processes to support data processing throughout the framework.

Cassandra:

Apache Cassandra is a large-scale distributed database supporting multi–data center replication for availability, durability, and performance. I use DataStax Enterprise (DSE) mainly to accelerate my own ability to deliver realtime value at epic scale, by providing a comprehensive and operationally simple data management layer with a unique always-on architecture built in Apache Cassandra. The standard Apache Cassandra open source version works just as well, minus some extra

Kafka:

This is a high-scale messaging backbone that enables communication between data processing entities. The Apache Kafka streaming platform, consisting of Kafka Core, Kafka Streams, and Kafka Connect, is the foundation of the Confluent Platform. The Confluent Platform is the main commercial supporter for Kafka (see www. confluent.io/). Most of the Kafka projects I am involved with now use this platform. Kafka components empower the capture, transfer, processing, and storage of data streams in a distributed, fault-tolerant manner throughout an organization in real time.

Kafka Core:

At the core of the Confluent Platform is Apache Kafka. Confluent extends that core to make configuring, deploying, and managing Kafka less complex.

Kafka Streams:

Kafka Streams is an open source solution that you can integrate into your application to build and execute powerful stream-processing functions.

Kafka Connect :

This ensures Confluent-tested and secure connectors for numerous standard data systems. Connectors make it quick and stress-free to start setting up consistent data pipelines. These connectors are completely integrated with the platform, via the schema registry. Kafka Connect enables the data processing capabilities that accomplish the movement of data into the core of the data solution from the edge of the business ecosystem. In Chapter 7-11, I will discuss the use of this messaging pipeline to stream data through the configuration.

Elastic Search:

Elastic search is a distributed, open source search and analytics engine designed for horizontal scalability, reliability, and stress-free management. It combines the speed of search with the power of analytics, via a sophisticated, developer-friendly query language covering structured, unstructured, and time-series data. In Chapter 11, I will discuss, the use of this elastic search to categorize data within the framework.

R

R is a programming language and software environment for statistical computing and graphics. The R language is widely used by data scientists, statisticians, data miners, and data engineers for developing statistical software and performing data analysis. The capabilities of R are extended through user-created packages using specialized statistical techniques and graphical procedures. A core set of packages is contained within the core installation of R, with additional packages accessible from the Comprehensive R Archive Network (CRAN). Knowledge of the following packages is a must:

• sqldf (data frames using SQL): This function reads a file into R while filtering data with an sql statement. Only the filtered part is processed by R, so files larger than those R can natively import can be used as data sources.

• forecast (forecasting of time series): This package provides forecasting functions for time series and linear models.

• dplyr (data aggregation): Tools for splitting, applying, and combining data within R

• stringr (string manipulation): Simple, consistent wrappers for common string operations

• RODBC, RSQLite, and RCassandra database connection packages: These are used to connect to databases, manipulate data outside R, and enable interaction with the source system.

• lubridate (time and date manipulation): Makes dealing with dates easier within R Chapter 1 Data Science Technology Stack 12 • ggplot2 (data visualization): Creates elegant data visualizations,

• ggplot2 (data visualization): Creates elegant data visualizations, using the grammar of graphics. This is a super-visualization capability.

• reshape2 (data restructuring): Flexibly restructures and aggregates data, using just two functions: melt and dcast (or acast).

• randomForest (random forest predictive models): Leo Breiman and Adele Cutler’s random forests for classification and regression

• gbm (generalized boosted regression models): Yoav Freund and Robert Schapire’s AdaBoost algorithm and Jerome Friedman’s gradient boosting machine I will discuss each of these packages as I guide you through the blog. In Chapter 6, I will discuss, the use of R to process the sample data within the sample framework. I will provide examples that demonstrate the basic ideas and engineering behind the framework and the tools. Please note that there are many other packages in CRAN, which is growing on a daily basis. Investigating the different packages to improve your capabilities in the R environment is time well spent.

Scala

Scala is a general-purpose programming language. Scala supports functional programming and a strong static type system. Many high-performance data science frameworks are constructed using Scala, because of its amazing concurrency capabilities. Parallelizing masses of processing is a key requirement for large data sets from a data lake. Scala is emerging as the de-facto programming language used by data-processing tools. I provide guidance on how to use it, in the course of this book. Scala is also the native language for Spark, and it is useful to master this language.

Python

Python is a high-level, general-purpose programming language created by Guido van Rossum and released in 1991. It is important to note that it is an interpreted language: Python has a design philosophy that emphasizes code readability. Python uses a dynamic type system and automatic memory management and supports multiple programming paradigms (object-oriented, imperative, functional programming, and procedural). Thanks to its worldwide success, it has a large and comprehensive standard library. The Python Package Index (PyPI) (https://pypi.python.org/pypi) supplies thousands of third-party modules ready for use for your data science projects. I provide guidance on how to use it, in the course of this blog. I suggest that you also install Anaconda. It is an open source distribution of Python that simplifies package management and deployment of features (see www.continuum. io/downloads).

MQTT (MQ Telemetry Transport)

MQTT stands for MQ Telemetry Transport. The protocol uses publish and subscribe, extremely simple and lightweight messaging protocols. It was intended for constrained devices and low-bandwidth, high-latency, or unreliable networks. This protocol is perfect for machine-to-machine- (M2M) or Internet-of-things-connected devices. MQTT-enabled devices include handheld scanners, advertising boards, footfall counters, and other machines. In Chapter 7, I will discuss how and where you can use MQTT technology and how to make use of the essential benefits it generates. The apt use of this protocol is critical in the present and future data science environments. In Chapter 11, will discuss the use of MQTT for data collection and distribution back to the business.

What’s Next?

As things change daily in the current ecosphere of ever-evolving and -increasing collections of tools and technological improvements to support data scientists, feel free to investigate technologies not included in the preceding lists. I have acquainted you with my toolbox. This Data Science Technology Stack has served me well, and I will show you how to use it to achieve success.

Note
My hard-earned advice is to practice with your tools. Make them your own! Spend time with them, cultivate your expertise.