What is machine learning?
Machine learning tries to understand existing patterns and makes decision. In other words, tries to find relationship among data and use it to make decision.
Why do we use machine learning?
ML makes it easy to make data driven decisions without human intervention. ML can evolve over time meaning improve decision making accuracy and it can be continuous.
ML Use-cases
- Identifying tags and categories from textual data
- Weather prediction
- Recommendations
- Fraud detections
ML Pipeline
Step 1: Prepare training Data
If we have textual data and trying to predict the outcome, we need to convert textual representation into machine readable format which is numerical format.
As a first step, we can cleanse the data (remove nulls, duplicates, punctuations) and apply stemming (convert riding, rode, ride ==> ride)
Let’s convert textual data to numerical format. One such conversion can be done using CountVectorizer. Identify Unique feature names in the input.
Step 2: Extract Features
Input
‘This is the first document.’
‘This document is the second document.’
‘And this is the third one.’
‘Is this the first document?’
[‘and’, ‘document’, ‘first’, ‘is’, ‘one’, ‘second’, ‘the’, ‘third’, ‘this’]
CountVectorizer (Document term matrix)
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Step 3: Build ML Model
In Supervised machine learning, select an algorithm, specify samples(input data) and target(expected result) for building the model
Step 4: Make Predictions
- Predict live traffic: e.g credit card transaction fraudulent
- Recommendations
- Suggestions
- Auto complete
Machine Learning Model
Linear Regression is method to study relationship between a dependent variable (Y) with a given set of independent variables (X). The relationship can be established with the help of fitting a best line.
Y = mx + b
Where b is the intercept and m is the slope of the line. So basically, the linear regression algorithm gives us the most optimal value for the intercept and the slope (in two dimensions). The y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept and slope. There can be multiple straight lines depending upon the values of intercept and slope. Basically what the linear regression algorithm does is it fits multiple lines on the data points and returns the line that results in the least error.
Support Vector is Supervised learning method for classification. Maximizes the hyperplane distance from the classes.
Multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). each naive Bayes classifier can be considered a way of fitting a probability model.
Scaling
ML Prediction API Python in memory
ML Prediction API (Batch processing)
ML Prediction API streaming pipeline
Deployment Options
- AWS SageMaker or GCP cloud API
- Python flas or Django app
- Tensorflow deploy on k8 cluster on GCP or AWS or if you have your own k8 cluster, can deploy there
- Redis-AI module
Resources:
Python flask service with prediction api (coming soon)