Machine learning projects usually start the same way.
You open a notebook.
You load a dataset.
You train a model.
Then you train another one.
And another one.
Before you know it, you’re manually training 27 models like a caffeinated intern trying to find the best result.
At some point I realized something:
If the machine learning workflow is repetitive… the machine should probably do it.
So I built a mini AutoML platform using Python and FastAPI that can:
Essentially turning raw data into a live machine learning service.
Let’s walk through how the system works.
A production-ready AutoML system usually consists of multiple stages working together.
The pipeline I designed looks like this:
Dataset Upload
↓
Data Profiling
↓
Feature Engineering
↓
Automated Model Training
↓
Model Evaluation
↓
Best Model Selection
↓
Model Registry
↓
Prediction API
The AutoML system follows a modular pipeline:
Dataset → Feature Analysis → Model Training → Model Evaluation → Model Registry → Prediction API
Each component is designed to be independent so the platform can scale as datasets grow. The goal is simple:
Take a dataset and automatically produce a deployable machine learning model.
The core stack powering this platform:
The first step is allowing users to upload a dataset.
FastAPI makes it easy to build endpoints for dataset ingestion.
```python from fastapi import FastAPI, UploadFile import pandas as pd
app = FastAPI()
@app.post(“/upload”) async def upload_dataset(file: UploadFile):
df = pd.read_csv(file.file)
return {
"rows": df.shape[0],
"columns": df.shape[1],
"features": list(df.columns)
} Once uploaded, the dataset gets stored and passed into the training pipeline.
This allows the system to inspect the dataset before running any models.
Because in the real world, datasets are rarely clean.
Feature Analysis
Before training any models, the platform analyzes the dataset.
This step identifies:
numeric features
categorical features
missing values
statistical summaries
def analyze_features(df):
summary = {
"missing_values": df.isnull().sum().to_dict(),
"data_types": df.dtypes.astype(str).to_dict(),
"statistics": df.describe().to_dict()
}
return summary
This stage helps determine how preprocessing should work before models begin training.
Think of it as a data sanity check.
Automated Model Selection
Instead of manually testing algorithms, the system automatically trains several common machine learning models.
Some example models include:
Logistic Regression
Random Forest
Support Vector Machines
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC
models = { “logistic_regression”: LogisticRegression(), “random_forest”: RandomForestClassifier(), “svm”: SVC(probability=True) }
Each model gets trained on the same dataset so their performance can be compared fairly.
Because guessing which model will perform best ahead of time is mostly optimism.
Model Training Pipeline
The training pipeline loops through each model and records performance metrics.
from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
def train_models(X, y):
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2
)
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
results[name] = {
"model": model,
"accuracy": score
}
return results
At the end of training, we now have performance results for each model.
Model Evaluation
Once the models finish training, the system selects the best performer.
def select_best_model(results):
best_model = max(
results.items(),
key=lambda x: x[1]["accuracy"]
)
return best_model
Evaluation metrics may include:
accuracy
precision
recall
ROC-AUC
The system chooses the model that generalizes best to unseen data.
Model Versioning
After selecting the best model, it gets stored in a model registry.
This allows models to be reused and deployed later.
import joblib from pathlib import Path
MODEL_DIR = Path(“models”)
def save_model(model, name):
path = MODEL_DIR / f"{name}.joblib"
joblib.dump(model, path)
return str(path)
Model versioning is extremely useful because it allows:
tracking model performance
managing updates
rolling back models if needed
Async Training with Background Workers
Training machine learning models can take time.
Instead of blocking the API request, training runs in a background task.
Conceptually the workflow becomes:
User uploads dataset ↓ API triggers training job ↓ Worker trains models in background ↓ Results stored in model registry
This keeps the API responsive while models train.
Deploying the Prediction API
Once the best model is stored, it becomes accessible through a prediction endpoint.
import joblib
model = joblib.load(“models/best_model.joblib”)
@app.post(“/predict”) def predict(features: list):
prediction = model.predict([features])
probability = model.predict_proba([features]).max()
return {
"prediction": int(prediction[0]),
"confidence": float(probability)
}
Now the AutoML system produces something extremely useful:
A live machine learning API.
This means the model can be integrated into:
applications
dashboards
automation systems
data pipelines
Final Thoughts
Machine learning becomes far more powerful once it moves beyond notebooks.
By building an AutoML pipeline, we can transform a dataset into a production-ready machine learning service automatically.
Instead of manually training models over and over again, the system can:
analyze datasets
train multiple models
evaluate performance
store the best model
deploy a prediction API
Which saves time and makes machine learning workflows scalable.
And most importantly…
It means I no longer have to train 27 models manually like a caffeinated intern.
The machine can handle that now.