Assignment – Preprocessing Data for scikit-learn
Very often, we’re tasked with taking data in one form and transforming it for easier downstream analysis. In this
assignment, you’ll use what you’ve learned in the course to prepare data for predictive analysis in Project 4.
Mushrooms Dataset. A famous—if slightly moldy—dataset about mushrooms can be found in the UCI repository
here: https://archive.ics.uci.edu/ml/datasets/Mushroom. The fact that this is such a well-known dataset in the
data science community has made it a good dataset to use for comparative benchmarking. For example, if someone
was working to build a better decision tree algorithm (or other predictive classifier) to analyze categorical data,
this dataset could be useful. In Project 4, we’ll use scikit-learn to answer the question, “Which other attribute
or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”
Your assignment is to
•
•
•
•
•
•
First study the dataset and the associated description of the data (i.e. “data dictionary”). You may need to
look around a bit, but it’s there!
Create a pandas DataFrame with a subset of the columns in the dataset. You should include the column
that indicates edible or poisonous, the column that includes odor, and at least one other column of your
choosing.
Add meaningful names for each column.
Replace the codes used in the data with numeric values—for example, in the first “target” column, “e” might
become 0 and “p” might become 1. This is because your downstream processing in Project 4 using
scikit-learn requires that values be stored as numerics.
Perform exploratory data analysis: show the distribution of data for each of the columns you selected, and
show scatterplots for edible/poisonous vs. odor as well as the other column that you selected.
Include some text describing your preliminary conclusions about whether either of the other columns
could be helpful in predicting if a specific mushroom is edible or poisonous.
Your deliverable is a Jupyter Notebook that performs these transformation and exploratory data analysis tasks.
If you are working in a group, you also have the option of replacing the mushroom dataset in the assignment with a
different data set that your group members might find more interesting.
You should post the Jupyter Notebook (.ipynb) file in your GitHub repository, and provide the appropriate URL to
your GitHub repository in your assignment link. You should also have the original data file accessible through your
code—for example, read directly from the UCI repository or stored in a GitHub repository.
IS 362 Assignment – Preprocessing Data for scikit-learn
Page 1 of 1
Chapter 10 – System Architecture
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Provide a checklist of issues to consider
when selecting a system architecture
Trace the evolution of system architecture
from mainframes to current designs
Explain client/server architecture, including
tiers, cost-benefit issues, and performance
Compare in-house ecommerce
development with packaged solutions and
service providers
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
2
Discuss the impact of cloud computing and
Web 2.0
Define network topology, including
hierarchical, bus, ring, star, and mesh
models
Describe wireless networking, including
wireless standards, topologies, and trends
Describe the system design specification
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
3
Issues that influence the architecture choice
◦
◦
◦
◦
◦
◦
◦
◦
◦
Corporate organization and culture
Enterprise resource planning (ERP)
Initial and total cost of ownership (TCO)
Scalability
Web integration
Legacy system interface requirements
Processing options
Security issues
Corporate portals
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
4
Corporate Organization and Culture
◦ A successful system performs well in a company’s
organization and culture
Enterprise resource planning (ERP)
◦ Objective – To establish a company-wide strategy
for using IT that includes a specific architecture,
standards for data, processing, network, and user
interface design
FIGURE 10-1 Oracle offers ERP
solutions as a cloud-based service.
Source: Oracle
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
5
◦ Companies are extending internal ERP systems to
their suppliers and customers, using supply chain
management (SCM)
Initial Cost and TCO
◦ TCO includes tangible purchases, fees, and
contracts called hard costs
◦ TCO analysis answers questions about the validity,
effectiveness, and new trends in systems planning
May affect the initial cost and TCO for a proposed
system
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
6
Scalability (Extensibility)
◦ A system’s ability to expand, change, or downsize
easily to meet the changing needs of a business
enterprise
Web Integration
◦ A web-centric architecture enables a company to
integrate new applications into its ecommerce
strategy
Legacy Systems
◦ A new system might have to interface with legacy
systems
Involves analysis of data formats and compatibility
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
7
Processing Options
◦ Systems can process data online or in batches
Security Issues
◦ Analysts must consider security issues and how the
company will address them
Corporate Portals
◦ Provide access for customers, employees, suppliers,
and the public
◦ A well-designed portal can:
Integrate with various other systems
Provide a consistent look and feel across
organizational divisions
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
8
Functions of a business information system
◦ Manage applications that perform the processing
logic
◦ Handle data storage and access
◦ Provide an interface that allows users to interact
with the system
While planning system design:
◦ Determine where the functions will be carried out
◦ Identify the advantages and disadvantages of each
design approach
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
9
Mainframe Architecture
◦ Server: A computer that supplies data, processing
services or other support to one or more computers
called clients
◦ Earliest servers - Mainframe computers
All data input and output occurred at a central location
◦ Advances in technology
enabled installation of
terminals at remote locations
FIGURE 10-3 In a centralized design, the remote
user’s keystrokes are transmitted to the mainframe,
which responds by sending screen output back to the
user’s screen.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
10
Impact of the Personal Computer
◦ Individuals could work in stand-alone mode
The workstation performed all the functions of a server
◦ Lesser IT assistance resulted in increased
productivity in certain tasks
Absence of a central storage location raised concerns
about data security, integrity, and consistency
Network Evolution
◦ Local area network (LAN): Allows sharing of data
and hardware resources
◦ Wide area network (WAN): Spans long distances and
can connect LANs that are continents apart
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
11
FIGURE 10-4 A LAN allows sharing of data and
hardware, such as printers and scanners.
FIGURE 10-5 A WAN can connect many LANs
and link users who are continents apart.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
12
Client/Server Architecture
◦ Includes systems that divide processing between
one or more networked clients and a central server
Client handles the entire user interface
Server stores data and provides data access and
database management functions
FIGURE 10-6 In a client/server
design, data is stored and usually
processed on the server.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
13
FIGURE 10-7 Comparison of the characteristics of client/server and
mainframe systems.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
14
The Client’s Role
◦ Client/server relationship must specify how the
processing will be divided between the client and the
server
◦ Fat client (thick client) design: Locates all or most of
the application processing logic at the client
◦ Thin client design: Locates all or most of the
processing logic at the server
Provides better performance as the program code
resides on the server
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
15
Client/Server Tiers
◦ Two-tier design
User interface resides on the client
Data resides on the server
Application logic can run either on the server or on the
client, or be divided between the client and the server
◦ Three-tier (n-tier) design
User interface runs on the client
Data is stored on the server
Has a middle layer between the client and server
Processes the client requests and translates them into
data access commands
Considered an application server
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
16
FIGURE 10-8 Characteristics of two-tier versus three-tier client/server design.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
17
FIGURE 10-9 The location of the data, the application logic, and the user
interface depend on the type of architecture.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
18
Middleware
◦ Enables communication between the tiers
◦ Referred to as glueware
Used to connect two or more software components in a
federated system architecture
◦ Integrates legacy systems and Web-based and/or
cloud applications
◦ Represents the slash in the term client/server
Cost-Benefit Issues
◦ Client/server systems offer the best combination of
features to meet information system requirements
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
19
Enable firms to scale the system according to the
environment
Enable transfer of applications from expensive
mainframes to less-expensive client platforms
Reduce workload and improve response times
Performance Issues
◦ Knee of the curve
Response time to requests increases significantly as
the system nears its capacity
◦ Client should contact the server only when
necessary in a client/server system
◦ Distributed database management system (DDBMS)
helps improve client/server performance
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
20
In an Internet-based architecture, the entire
user interface is provided by the web server
in the form of HTML documents
◦ Shifting the responsibility for the interface from the
client to the server simplifies data transmission and
results in lower hardware cost and complexity
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
21
Cloud Computing
◦ The concept envisions a cloud of remote computers
providing a total online software and data
environment that is hosted by
third parties
◦ Eliminates compatibility issues and
provides scaling on demand
FIGURE 10-10 Cloud computing
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
22
Web 2.0
◦ Second generation of the web
Enables people to collaborate, interact, and share
information more dynamically
◦ Considered a step towards the semantic web
◦ Wiki: Web-based repository of information
Run by social collaboration
◦ Users collaborate and add new layers of information
to the Internet operating system
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
23
In-House Solutions
◦ Benefits
A unique website, with a look and feel consistent with
the company’s other marketing efforts
Complete control over the organization of the site
A scalable structure to handle increases in sales and
product offerings in the future
More flexibility to modify and manage the site
The opportunity to integrate the firm’s web-based
business systems with its other information systems
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
24
FIGURE 10-11 Guidelines
for companies developing
ecommerce strategies.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
25
Packaged Solutions
◦ Viable alternative for medium- to large-sized firms
◦ Less complex than an in-house effort
Service Providers
◦ Application service provider (ASP) - Provides
applications or access to applications by charging a
fee
Many ASPs offer full-scale Internet business services
for companies that decide to outsource functions
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
26
Online Processing
◦ Online systems handle transactions when and
where they occur
Output is provided directly to users
◦ Avoids delays and allows a constant dialog between
the user and the system
◦ Can be used with file-oriented systems
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
27
FIGURE 10-13 When a customer
requests a balance, the ATM
system verifies the account number,
submits the query, retrieves the
current balance, and displays the
balance on the ATM screen.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
28
Batch Processing: Still With Us After All These
Years
◦ Data is managed in groups
◦ Advantages
Tasks can be planned and run on a predetermined
schedule without user involvement
Programs that require major network resources can
run when costs and impact on other traffic will be
lowest
Well-suited to address security, audit, and privacy
concerns
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
29
Real-World Examples
◦ Point of Sale (POS) terminals
FIGURE 10-14 Many retailers use a combination of online and batch processing. When a salesperson
enters the sale on the POS terminal, the online system retrieves data from the item file, updates the
quantity in stock, and produces a sales transaction record. At the end of the day, a batch processing
program produces a daily sales report and updates the accounting system.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
30
The Open Systems Interconnection (OSI)
Model
◦ Describes how data moves from an application on
one computer to an application on another
networked computer
◦ Provides physical design standards that assure
seamless network connectivity, regardless of the
specific hardware environment
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
31
Network Topology
◦ Topology - Physical or logical view of the network
Physical topology: Actual network cabling and
connections
Logical topology: Describes the way the components
interact
◦ Hierarchical network
Departmental servers control lower levels of processing
and network devices
◦ Bus network
A single communication path connects the central server,
departmental servers, workstations, and peripheral
devices
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
32
◦ Ring network
Resembles a circle where the data flows in only one
direction from one device to the next
◦ Star network
Has a central networking device called a switch which
manages the network and acts as a communications
conduit for all network traffic
◦ Mesh network
Each node connects to every
other node
Figure 10-15 Although these computers form a physical circle,
the physical layout has no bearing on the network topology,
which might be a bus, ring, star, or other logical design.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
33
FIGURE 10-16 A
hierarchical network with a
single server that controls
the network.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
34
FIGURE 10-18 A ring network with a set of computers
that send and receive data flowing in one direction.
FIGURE 10-17 A bus
network with all devices
connected to a single
communication path.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
35
FIGURE 10-19 A typical star network with a
switch, departmental server, and connected
computers, and workstations.
FIGURE 10-20 A mesh network is used in
situations where a high degree of redundancy is
needed, such as military applications. The
redundant design provides alternate data paths,
but is expensive to install and maintain.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
36
Network Devices
◦ LANs or WANs can be interconnected using routers
Router: Connects network segments, determines the
most efficient data path, and guides the flow of data
Proxy server: Provides Internet connectivity for internal
LAN users
Modeling Tools
◦ Microsoft Visio – Used to represent the physical
structure and network components of a system
◦ Creatly.com - Offer network diagram drawing
capabilities that are completely web-based
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
37
FIGURE 10-21 Routers can be used to create gateways
between different network topologies and large, dissimilar
networks such as the Internet.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
38
FIGURE 10-22 Creatly is a web-based
software application for creating network
diagrams.
Source : © 2008-2015 Cinergix Pty. Ltd.
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
39
Wireless Network Standards
◦ IEEE 802.11 is a family of standards developed by
the Institute of Electrical and Electronics Engineers
(IEEE) for wireless LANs
◦ Current wireless networks are based on variations
of the original 802.11 standard
802.11g and 802.11n offered increased bandwidth
and were widely accepted by the IT industry
Current standards, such as 802.11ac, use multiple
input/multiple output (MIMO) technology to boost
performance
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
40
Wireless Network Topologies
◦ Network topologies available for IEEE 802.11 WLANs
The Basic Service Set (BSS) or the infrastructure mode
Contains a central wireless device called an access point
or wireless access point (WAP) to serve all wireless clients
Extended Service Set (ESS)
Comprises two or more Basic Service Set networks
Wireless access can be expanded over a larger area
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
41
Wireless Trends
◦ Wi-Fi Alliance
A non-profit international association that certifies
interoperability of wireless network products based on
IEEE 802.11 specifications
Products that meet the requirements are certified as Wi-Fi
(wireless fidelity) compatible
Disadvantage - Wireless transmissions are less secure
◦ Bluetooth is used for short-distance wireless
communication
◦ IEEE works on 802.16 standards or Wi-MAX
Wi-MAX: Broadband wireless communications
protocols for MANs (metropolitan area networks)
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
42
System architecture marks the end of the
systems design phase of the SDLC
Final activities in the systems design phase
◦ Preparing a system design specification
◦ Obtaining user approval
◦ Delivering a presentation to management
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
43
System Design Specification
◦ Document that presents the complete design for a
new information system
Contains detailed costs, staffing, and scheduling for
completing the next SDLC phase
◦ Used as a baseline to measure the operational
system
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
44
◦ Sections in a system design specification
Management summary
System components
System environment
Implementation requirements
Time and cost estimates
Additional material
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
45
User Approval
◦ Users must review and approve the interface
design, report and menu designs, data entry
screens, source documents, and other areas of the
system that affect them
Ensures that approvals are obtained as and when
required
Keeps the users involved with the system’s
development
Provides feedback that can be used to guide efforts
◦ System design specification should be reviewed by
other IT department members as well
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
46
Presentations
◦ Provide an opportunity to explain the system,
answer questions, consider comments, and secure
final approval
The first presentation is to the systems analysts,
programmers, and technical support staff members
who will be involved in future project phases or
operational support for the system
Next presentation is to the department managers and
users from departments affected by the system
Final presentation is delivered to management
◦ Management will reach a decision based on the
presentation
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
47
An information system combines hardware,
software, data, procedures, and people into a
system architecture
Before selecting an architecture, the analyst
must consider enterprise resource planning,
initial cost and TCO, scalability, Web
integration, legacy interface requirements,
processing options, security issues, and
corporate portals
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
48
ERP establishes an enterprise-wide strategy
for IT resources and specific standards for
data, processing, network, and user interface
design
A system architecture requires servers and
clients
◦ Client/server architecture divides processing
between one or more clients and a central server
A thick client design places all or most of the
application processing logic at the client
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
49
A thin client design places all or most of the
processing logic at the server
Client/server designs can be two- or threetier
The Internet has had an enormous impact on
system architecture
The most prevalent processing method today
is online processing
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
50
Networks allow the sharing of hardware,
software, and data resources in order to
reduce expenses and provide more capability
to users
The way a network is configured is called the
network topology
The system design specification presents the
complete systems design for an information
system and is the basis for the presentations
that complete the systems design phase
Copyright ©2017 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
51
Purchase answer to see full
attachment