I’m working on a test project for a luxury SUV client to help analyze a large data set. Here is a small sample of what I started with.

I’m using some common libraries in Python to process the data set.

`import numpy as np`

import matplotlib.pyplot as plt

import pandas as pd

dataset = pd.read_csv(social_media_date.csv)

After that I slice, split and scale the data.

`# import the dataset, get index 2 + 3`

X = dataset.iloc[:, [2,3]].values

y = dataset.iloc[:, 4].values

# split training and test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 0)

# feature scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)

# only need to transform not fit

X_test = sc_X.transform(X_test)

Next, I fit the data to a logistic regression model.

`# fitting Logistic regression to the training set`

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()

classifier.fit(X_train,y_train)

# predicting the test result set

y_pred = classifier.predict(X_test,)

# making the confusion matrix - -correct and incorrect predictions

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

When I observe my confusion matrix by print cm, here is what I see…

[[65 3] = 65 correct predictions, 3 incorrect predictions

[ 8 24]] = 8 incorrect predictions 42 correct predictions

logistic regression is doing a nice job here!

Final step is to visualise the data.

`# visualise the training set results`

from matplotlib.colors import ListedColormap

X_set, y_set = X_train, y_train

X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),

np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),

alpha = 0.75, cmap = ListedColormap(('gray', 'white')))

plt.xlim(X1.min(), X1.max())

plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):

plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],

c = ListedColormap(('red', 'green'))(i), label = j, s=10)

plt.title('Logistic Regression (Training set)')

plt.xlabel('Age')

plt.ylabel('Estimated Salary')

plt.legend()

plt.show()

Here is a scatter plot of the training set and test set that has been scaled. We can see that making a Confusion Matrix on Logistic Regression was able to accurately predict a purchase given age and salary of a potential customer. In just 72 lines of code, Python has helped bring great insight into a large data set.

I learned this from a comprehensive course on machine learning at Udemy.