Thursday, November 6, 2014

The Easiest Way to Upgrade GCC compiler in Windows Eclipse CDT

Because window have no Linux like package management tool for now, when we want to upgrade some development tools, we have to find windows installer or compile by ourselves and set the right thing into PATH.
 Recently, I want to use the <regex> in C++ for G++ in eclipse. This is a C++11 new feature and just successfully implemented in gcc compiler. GCC version 4.8 although it contains <regex>, but the functions haven't implemented.
I've tried window installer from MinGW official site, but looks like it doesn't update PATH for me. Eventually, I find a website that contains a Windows installer could set the right PATH and easy to setup in eclipse.

Step 1

Go to this address, scroll down the page, download gcc-4.9.4-32.exe or gcc-4.9.4-64.exe depending on your system version.

Step 2

Click the downloaded file, wait extracting finish. Then click accept, set the install address to c:\MinGW.
Then click install to install gcc 4.9.1. You can check if the installation is correct by open Command Prompt and type
gcc -v
Then it should tells you the version of your gcc is 4.9.1.

Step 3

If you haven't download eclipse CDT, download it here.
Open Eclipse, Window -> Preference -> C++ -> New C++ Project Wizard -> Preferred Toolchains tab. For every project type under Executable, set Toolchains to MinGW GCC. Click OK to finish

Now, our  eclipse C++ projects could use <regex>!

Sunday, November 2, 2014

Using Python and sklearn Package Build Decision Tree

This is still the homework of Machine Learning class.
The data is come from one of UCI repository. It's a data set about whether consumer want to buy certain types of cars. All the data is discrete so it's appreciate to using a decision tree to classify them. 
I use sklearn package to build tree and use matplotlib to plot statistic graph of results. After the training, the tree being stored using graphviz format. We can using graphviz's client to save the graph as png or other picture formats. 
Looks like that sklearn is the most famous and powerful Machine Learning package in python and not like some other packages, it is still under maintains and adding new functions. Even some very famous internet service are using it(honestly I just know Spotify and Evernote). The document of it is pretty understandable for me and the examples are ample enough. 
My code trained different decision trees with different depth using same data set, and compare the training error and test error. Here is the result document of decision tree in sklearn. 

Here is the code(Python 3.3):

from sklearn import tree
import numpy as np
from matplotlib import pyplot

# read lines in file to data
with  open("", mode='r') as f: 
    data =
data_table = []
for line in data:
    data_table.append(line.split(','))  # separate each words by comma
data_table = np.asarray(data_table)  # change the type of data to nparray

X = [row[0:6] for row in data_table]  # first 7 columns are features
Y = [row[-1] for row in data_table]  # the last column is the clacification

# Decision tree classifier could only classify the tree represented by numbers
convert_dict = {'vhigh':3, 'high':2, 'med':1, 'low':0,
                '2':2, '3':3, '4':4, '5more':5,
                'small':0, 'big':2}
for row_num in range(len(X)):
    for col_num in range(len(X[row_num])):
        X[row_num][col_num] = convert_dict[X[row_num][col_num]]
# except 'unacc', all other 
for i in range(len(Y)):
    if Y[i] != 'unacc':
        Y[i] = 1
        Y[i] = 0
# using the first half data to train the tree and the last part to test the tree
X_train = X[:len(X) // 2]
X_test = X[(len(X) // 2):]

Y_train = Y[:(len(Y) // 2)]
Y_test = Y[(len(Y) // 2):]

# using 'entropy' as the classification criterion
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf =, Y_train)

# save the tree to graph using graphviz format 
# and we can using graphviz client to see and save the graph to png
with open("", 'w') as result_file:
    result_file = tree.export_graphviz(clf, out_file=result_file)
# count how many results computed by three fit to the real result
def compute_fit_number(X_train, Y_test, clf):
    Y_pred = clf.predict(X_train)
    fit_num = 0
    for y_t, y_p in zip(Y_test, Y_pred):
        if y_t == y_p:
            fit_num += 1
    return fit_num

# show how testing error and training error varied according to the increase of the depth.
def test_depth(max_depth, X_train, X_test, Y_train, Y_test):
    testing_errors = []
    training_errors = []
    depths = []
    for depth in range(2, max_depth + 1):
        clf = tree.DecisionTreeClassifier(criterion='entropy', max_depth=depth), Y_train)
        testing_errors.append(compute_fit_number(X_test, Y_test, clf) / len(Y_test))
        training_errors.append(compute_fit_number(X_train, Y_test, clf) / len(Y_train))
    pyplot.plot(depths, testing_errors, '-r', label="testing error")
    pyplot.plot(depths, training_errors, '-g', label='training error')
    pyplot.title('error variation with the increase of depth')

# test_depth(20, X_test, X_train, Y_train, Y_test)
Y_predict = clf.predict(X_test)
fit_number = 0
for y_t, y_p in zip(Y_test, Y_predict):
    if y_t == y_p:
        fit_number += 1
print('testing error is: ', fit_number / len(Y_test))
print("atrribute number", len(X_test[0]))
Y_predict = clf.predict(X_train)
fit_number = 0
for y_t, y_p in zip(Y_test, Y_predict):
    if y_t == y_p:
        fit_number += 1
print('training error is: ', fit_number / len(Y_test))

Here is the graph show errors of different trees with different depth of the tree.

errors vs depth of the tree