This is still the homework of Machine Learning class.

The data is come from one of

UCI repository. It's a data set about whether consumer want to buy certain types of cars. All the data is discrete so it's appreciate to using a decision tree to classify them.

I use sklearn package to build tree and use matplotlib to plot statistic graph of results. After the training, the tree being stored using

graphviz format. We can using graphviz's client to save the graph as png or other picture formats.

Looks like that sklearn is the most famous and powerful Machine Learning package in python and not like some other packages, it is still under maintains and adding new functions. Even some very

famous internet service are using it(honestly I just know Spotify and Evernote). The document of it is pretty understandable for me and the examples are ample enough.

My code trained different decision trees with different depth using same data set, and compare the training error and test error. Here is the result

document of decision tree in sklearn.

Here is the code(Python 3.3):

from sklearn import tree
import numpy as np
from matplotlib import pyplot
# read lines in file to data
with open("car.data.txt", mode='r') as f:
data = f.read().splitlines()
data_table = []
for line in data:
data_table.append(line.split(',')) # separate each words by comma
data_table = np.asarray(data_table) # change the type of data to nparray
X = [row[0:6] for row in data_table] # first 7 columns are features
Y = [row[-1] for row in data_table] # the last column is the clacification
# Decision tree classifier could only classify the tree represented by numbers
convert_dict = {'vhigh':3, 'high':2, 'med':1, 'low':0,
'2':2, '3':3, '4':4, '5more':5,
'more':5,
'small':0, 'big':2}
for row_num in range(len(X)):
for col_num in range(len(X[row_num])):
X[row_num][col_num] = convert_dict[X[row_num][col_num]]
# except 'unacc', all other
for i in range(len(Y)):
if Y[i] != 'unacc':
Y[i] = 1
else:
Y[i] = 0
# using the first half data to train the tree and the last part to test the tree
X_train = X[:len(X) // 2]
X_test = X[(len(X) // 2):]
Y_train = Y[:(len(Y) // 2)]
Y_test = Y[(len(Y) // 2):]
# using 'entropy' as the classification criterion
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, Y_train)
# save the tree to graph using graphviz format
# and we can using graphviz client to see and save the graph to png
with open("hw2_option2_original.dot", 'w') as result_file:
result_file = tree.export_graphviz(clf, out_file=result_file)
# count how many results computed by three fit to the real result
def compute_fit_number(X_train, Y_test, clf):
Y_pred = clf.predict(X_train)
fit_num = 0
for y_t, y_p in zip(Y_test, Y_pred):
if y_t == y_p:
fit_num += 1
return fit_num
# show how testing error and training error varied according to the increase of the depth.
def test_depth(max_depth, X_train, X_test, Y_train, Y_test):
testing_errors = []
training_errors = []
depths = []
for depth in range(2, max_depth + 1):
depths.append(depth)
clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=depth)
clf.fit(X_train, Y_train)
testing_errors.append(compute_fit_number(X_test, Y_test, clf) / len(Y_test))
training_errors.append(compute_fit_number(X_train, Y_test, clf) / len(Y_train))
pyplot.plot(depths, testing_errors, '-r', label="testing error")
pyplot.plot(depths, training_errors, '-g', label='training error')
pyplot.legend()
pyplot.xlabel('depth')
pyplot.ylabel('error')
pyplot.title('error variation with the increase of depth')
pyplot.show()
# test_depth(20, X_test, X_train, Y_train, Y_test)
Y_predict = clf.predict(X_test)
fit_number = 0
for y_t, y_p in zip(Y_test, Y_predict):
if y_t == y_p:
fit_number += 1
print(clf)
print('testing error is: ', fit_number / len(Y_test))
print("atrribute number", len(X_test[0]))
Y_predict = clf.predict(X_train)
fit_number = 0
for y_t, y_p in zip(Y_test, Y_predict):
if y_t == y_p:
fit_number += 1
print('training error is: ', fit_number / len(Y_test))

Here is the graph show errors of different trees with different depth of the tree.

errors vs depth of the tree