timer Asked: May 3rd, 2020

Question Description

i can provide the graphic or code, and i need sb. to analysis the data for me

there are two datasets that you need to analysis for, here are the link where you can find the data

Seed data:

Automobile data:

Topics for project:

1) How to compare multiple labels with respect to one single feature? Each label is attached to a 1-dim dataset of feature measurements. Datasets: Seed.

2) How to see intrinsic differences among multiple labels with respect to multiple features? Each label is attached to a K-dim dataset of feature measurements. Datasets: Seed.

3) How to deal with categorical features? Dataset: Automobile.

4) How to measure associative relations between a categorical response variable and multiple covariate features. Datasets: Seedand Automobile datasets.

here are the code: you may run this in JupyterLab in order to see the graphic

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from statsmodels.distributions.empirical_distribution import ECDF

import seaborn as sns

from matplotlib.pyplot import cm

from pyitlib import discrete_random_variable as drv

from plotnine import *

from statistics import *

from collections import Counter

import warnings


seed_df = pd.read_csv('seeds_dataset.txt', delim_whitespace=True, header = None)

seed_df.columns = ["area", "perimeter", "compactness", "length of kernel", " width of kernel", "asymmetry coefficient", "length of kernel groove", "wheat"]




fig, axs = plt.subplots(7, 3, figsize=(15, 10))

fig.subplots_adjust(top = 2.5, bottom = 0.5, wspace = 0.3)

axs = axs.ravel()

for i in range(1, len(seed_df.columns)):

ecdf = ECDF(seed_df.iloc[:,i-1])

x = np.linspace(min(seed_df.iloc[:,i-1]), max(seed_df.iloc[:,i-1]))

y = ecdf(x)

axs[3*i-3].step(x, y)

axs[3*i-3].set_title(f"Empirical CDF for {seed_df.columns[i-1]}")

seed_df.pivot_table(values=seed_df.columns[i-1], index=seed_df.index, columns=['wheat']).plot.hist(bins=50, stacked=True, ax=axs[3*i-2])

axs[3*i-2].set_title(f"Gapped Histogram for {seed_df.columns[i-1]}")


axs[3*i-1].set_title(f"Line plot for {seed_df.columns[i-1]}")

sns.pairplot(seed_df, hue = "wheat")

def entropy(Y):

unique, count = np.unique(Y, return_counts=True, axis=0)

prob = count/len(Y)

en = np.sum((-1)*prob*np.log2(prob))

return en

#Joint Entropy

# H(Y;X)

def jEntropy(Y,X):

YX = np.c_[Y,X]

return entropy(YX)

#Conditional Entropy

## conditional entropy = Joint Entropy - Entropy of X

## H(Y|X) = H(Y;X) - H(X)

def cEntropy(Y, X):

return jEntropy(Y, X) - entropy(X)

#Mutual Information

#Mutual Information, I(Y;X) = H(Y) - H(Y|X)

def mutual_info(Y, X):

return entropy(Y) - cEntropy(Y,X)

feature_selection.mutual_info_classif(seed_df.iloc[:,0:7], seed_df.iloc[:,-1], discrete_features='auto')

def mutualy_table(df):

n = len(df.columns) - 1

ce_df = pd.DataFrame(np.zeros((n, n)))

for i in range(0, n):

for j in range(0, n):

ce_df[i][j] = mutual_info(seed_df.iloc[:, i], seed_df.iloc[:, j])

ce_df.columns = df.columns[:-1]

ce_df.index = df.columns[:-1]

return ce_df

print("Mutual Information table is:")


auto_df = pd.read_csv('', header = None)

auto_df.columns =["symboling", "normalized-losses", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels",

"engine-location", "wheel-base", "length", "width", "height", "curb-weight", "engine-type", "num-of-cylinders",

"engine-size", "fuel-system", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg", "price"]

def count_na(df):

for col in df.columns:

l = df[df[col] == "?"].shape[0]

if l != 0:

print(f"{col} has {l} missing value")

# which is same as the document said


def replace_na(df):

df_new = df.copy()

for col in ["normalized-losses", "bore", "stroke", "horsepower", "peak-rpm", "price"]:

i = df[df[col] == "?"].index

j = list(set(range(0, df[col].shape[0])) - set(i))

df_new.loc[i, col] = mean(pd.to_numeric(df.loc[j, col]))

for col in ["num-of-doors"]:

i = df[df[col] == "?"].index

j = list(set(range(0, df[col].shape[0])) - set(i))

df_new.loc[i, col] = max(Counter(df.loc[j, col]))


auto_df_new = replace_na(auto_df)


def change_dtype(df):

l1 = [3, 4, 5, 6, 7, 8, 9, 15, 16, 18]

for i in l1:

df.iloc[:, i-1] = pd.Categorical(df.iloc[:, i-1])

l2 = list(set(range(1, 27)) - set(l1) - set([1]))

for j in l2:

df.iloc[:, j-1] = pd.to_numeric(df.iloc[:, j-1])

return df


auto_df_new_dummy = pd.get_dummies(auto_df_new)

sns.pairplot(auto_df, hue = "symboling")

Unformatted Attachment Preview

Coarse- and fine-scale geometric information content of Multiclass Classification and implied Data-driven Intelligence. Fushing Hsieh1 and Xiaodong Wang1 1 University of California, Davis CA 95616, USA Abstract. Under any Multiclass Classification (MCC) setting defined by a collection of labeled point-cloud specified by a feature-set, we extract only stochastic partial orderings from any triplet of point-cloud without directly measuring the three cloud-to-cloud distances. We show such a collective of partial ordering affords a label embedding tree geometry on the Label-space. This tree in turn gives rise to a predictive graph, or a network with precisely weighted linkages. Such two multiscale geometries taken as the coarse scale information content of MCC indeed sheds lights on explainable knowledge on why and how labeling comes about and facilitates error-free prediction with potential multiple candidate labels supported by data. For revealing within-label heterogeneity, we label naturally found clusters within each point-cloud, and likewise derive multiscale geometry as its fine-scale information content of MCC. Such a multiscale collective of knowledge is data-driven intelligence. Keywords: Label embedding tree · Partial ordering · PITCHf/x. 1 Introduction Nowadays Machine Learning (M.L.) based Artificial Intelligence (A.I.) researches are by-and-large charged to endow machines with various human’s semantic categorizing capabilities [13]. Given that human experts hardly make semantic categorizing mistakes, should machine also help to explain: How and Why, to human? We demonstrate that possible answers are computational and visible under any Multiclass Classification (MCC) setting. The keys are: first compute the pertinent information content without artificial structure; secondly, graphically display such information content via multiscale geometries, such as a tree, a network or both, to concisely organize pattern-based knowledge or intelligence contained in data to human attentions. Multiclass Classification is one major topic [16, 2, 3, 9, 6] of associating visual images or text articles with semantic concepts [12, 7, 17]. Its two popular techniques: flat and hierarchical, are prone to make mistakes [1, 10, 5]. Since a machine is primarily forced to assign a single candidate label toward a prediction. No less, no more. Such a forceful decision-making to a great extent ignores 2 F. Hsieh and X. Wang the available amount of information supported by data. With such kind of M.L. in the heart of A.I., it is beyond reasonable doubt that A.I. is bound to generate fundamental social and academic issues in the foreseeable future, if its error-prone propensity is not well harnessed in time. If completely error-free A.I. is not possible at current state of technology, then at least it should tell us its decision-making trajectory leading up to every right or wrong decision. It is in the same sense as the recommended fourth rule of robotics: “a robot or any intelligent machine-must be able to explain itself to humans” to be added to Asimov’s famous three. Since we need to see why, how and where errors occur in hope of knowing what causes, and even figuring out how to fix it. Such a quality prerequisite on A.I. and M.L. is also coherent with concurrent requirements put forth by many governments around the world: Transparent explanation upon each A.I. based decision is required. Now it is a critical time point to think about how to coherently build and display data’s authentic information content that can afford the making of explainable error-free decisions. So such information content with pertinent graphic display can be turned into Data-driven Intelligence (D.I.). In this paper, we specifically demonstrate Datadriven Intelligence for Multiclass Classification. This choice of M.L. topic is in part due to that classification is human’s primary way of acquiring intelligence, and also in part due to its fundamental importance in science and industry. On the road to Data-driven Intelligence, we begin by asking the following three simple questions. First, the naive one is: Where is relevant information in data? Secondly, what metric geometry is suitable to represent such information content? Finally, how to make perfect, or at least nearly perfect empirical inference or predictive decision-making? We address these three non-hypothetical questions thoroughly based on model-free unsupervised M.L. Here we explicitly show the nature of information content under Multiclass Classification as: multiscale heterogeneity. Such information heterogeneity can be rather intertwined and opaque when its three data-scales: numbers of label, feature and instance, are all big. The paper is organized as follows. In section 2 we describe the background and related work of MCC. In section 3 we develop a new label-embedding tree constructed via partial ordering and a classification schedule. In section 4 we illustrate a tree-decent procedure with early stop and represent the error flow. In section 5 we explore the heterogeneity embedded within labels. 2 Multiclass Classification A generic Multiclass Classification (MCC) setting has three data scales: the number of label L, the number of feature K and total number of subjects N . Each label specifies a data-cloud. A data-cloud is an ensemble of subject. Each subject is identified by a vector of K feature measurements. The complexity of data and its information content under any MCC setting is critically subject to L, K and N . The goal of Multiclass Classification is to seek for the principles Title Suppressed Due to Excessive Length 3 or intelligence that can explain label-to-feature linkages. Such linkages are intrinsically heterogeneous as being blurred by varying degrees of mixing among diverse groups within the space of labeled data-clouds. Since such data mixing patterns are likely rather convoluted and intertwined, so the overall complexity of information content must be multiscale in nature. Specifically speaking, its global scale is referred to which label’s point-cloud is close to which, but far away from which. Though such an idea of closeness is clearly and fundamentally relative, it is very difficult to define or evaluate precisely. That is, such relativity essence can’t be directly measured with the presence of two point-clouds, but it can be somehow reflected only in settings involving three or more point-clouds. From this perspective, all existing distance measures commonly suffer from missing the data-clouds essential senses of relative closeness locally and globally. For instance, recently Gromov-Wasserstein distance via Optimal Transport has been proposed as a direct evaluation of distance between two point-clouds [14]. But it suffers from the known difficulty in handling high dimensionality (large K). So this distance measure likely misses the proper senses of relative closeness among point-clouds, especially when K is big. In this paper, we propose a simple computing approach to capture the relative closeness among all involving point-clouds without directly evaluating pairwise cloud-to-cloud distance. The key idea is visible as follows: through randomly sampling a triplet of singletons from any triplet of point-clouds, we extract three partial ordering among the three pairs of cloud-to-cloud closeness. By taking one partial ordering as one win-and-loss in a tournament involving L2 teams, we can build a dominance matrix that leads to a natural label embedding tree as a manifestation of heterogeneity on the global scale. Such a triplet-based brick-by-brick construction for piecing together a label embedding tree seems intuitive and natural. Indeed such a model-free approach is brand new to M.L. literature [3, 4]. The existing hierarchical methods build a somehow symbolic label embedding tree by employing a bifurcating scheme that nearly completely ignores the notion of heterogeneity [3, 15, 11]. After building a label embedding tree on the space of L labels, we further derive a predictive graph, which is a weighted network with precisely evaluated linkages. This graph offers the detailed closeness from the predictive perspective as another key aspect of geometric information content of MCC. To further discover the fine scale information content of MCC, we look into heterogeneity embraced by each label. Clustering analysis is applied on each label’s point-cloud to bring out a natural clustering composition, and then label each cluster pertaining to a sublabel. By doing so across all labels, we result in a space of sublabel with much larger size than L. Likewise we compute a sublabel embedding tree and its corresponding predictive graph. These two geometries then constitute and represent the fine scale information content of MCC. A real database, Major League Baseball (MLB) P IT CHf /x, is analyzed for the purpose of application. Since 2008 the PITCHf/x database of MLB has been recording each every single pitch delivered by MLB pitchers in all games at its 30 4 F. Hsieh and X. Wang stadiums. A record of a pitch is a measurement vector of 21 features. A healthy MLB pitcher typically pitches around 3000 pitches, which are algorithmically categorized into one of pitch-types: Fastball, Slider, Change-up, curveball and others types. We collect data from 14 (= L) MLB pitchers, who threw around 1000 Fastball or more during the 2017 season. As one pitcher is taken as a label, his seasonal fastball collection is a point-cloud. It is noted that each pitcher tunes his Fastball slightly and distinctively when facing different batters under different circumstances of game. That is, multi-scale heterogeneity is inherently embedded into each point-cloud. A potential feature set is selected based on permutation-based feature importance measure. The importance score is defined as the reduction in the performance of Random Forest after permuting the feature values. All real data illustrations for the entire computational developments throughout this paper is done with respect to a feature set consisting of 3 features: horizontal and vertical coordinates, and horizontal speed of a pitch at the releasing point. Results on two larger feature-set are also reported. (B) (A) (C) Fig. 1. Illustrating example for the Algorithm of label embedding tree. (A) the 3D scatter plot of data; (B) the 11 labeled data-clouds defined by a tree; (C) the embedding tree. 3 A label embedding tree built by partial ordering. We develop a computing paradigm based on unsupervised machine learning to nonparametrically construct the label- and sub-label embedding trees in this Title Suppressed Due to Excessive Length 5 paper. This paradigm is designed to be scalable to the three factors: L, K and N . With a label-triplet, say (La, Lb, Lc), in the brick-by-brick construction, partial ordinal relations are referred to: D(La, Lb) < D(La, Lc) for example, where D(., .) is the unspecified “distance” between two label clouds. It is emphasized that the Alg.1 is devised to extract such relations without explicitly computing the three pairwise distances D(., .). These relations found among three point-clouds are stochastic in nature. Given a triplet of labels La, Lb, Lc, if we randomly sample three singleton vectors in RK , say XLa , XLb and XLc : one from each of three labels, separately. A piece of information of partial ordering within the triplet can be shed by inequalities among Euclidian distances d(., .) among 3 singletons XLa , XLb and XLc . That is, inequality d(xLa , xLb ) < d(xLa , xLc ) provides a small piece of information about Labels La and Lb being closer than La to Lc and Lb to Lc. By iteratively randomly sampling vector-triplets for a large number of times, say T , the probability of this relative closeness between La and Lb can be estimated P as P̂ (D(La, Lb) < D(La, Lc)) = 1 /T . Via Law of Large D(La,Lb)
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Brown University

1271 Tutors

California Institute of Technology

2131 Tutors

Carnegie Mellon University

982 Tutors

Columbia University

1256 Tutors

Dartmouth University

2113 Tutors

Emory University

2279 Tutors

Harvard University

599 Tutors

Massachusetts Institute of Technology

2319 Tutors

New York University

1645 Tutors

Notre Dam University

1911 Tutors

Oklahoma University

2122 Tutors

Pennsylvania State University

932 Tutors

Princeton University

1211 Tutors

Stanford University

983 Tutors

University of California

1282 Tutors

Oxford University

123 Tutors

Yale University

2325 Tutors