Fit_It! by Gene

Interactive Probability Distribution Fitting Tool

Powered by

SciPy
FastAPI
Plotly

Important Notice

Fit_It! is a free hobby project for educational and entertainment purposes only. It is not intended for critical decision-making. The service runs on free-tier cloud hosting with limited resources. Please use considerately - limit to 1-2 analyses per session to keep this service available for everyone.

Documentation Outline

Quick Stats

108
Distributions
4
Fitting Methods
500 (3K for subscribers)
Max Data Points

Introduction

Fit_It! by Gene is a free educational tool that brings the world of probability distributions to your browser. Powered by SciPy, FastAPI, and Plotly, it allows you to explore how different statistical distributions fit your data through an intuitive interface.

📊

Data Exploration

Upload your dataset and discover its probability distribution

🔍

Statistical Analysis

Compare 4 fitting methods and evaluate goodness-of-fit

📈

Visual Insights

Interactive visualizations help understand distribution fits

Note: Fit_It! is a hobby project developed during spare time. It's provided as a free service for educational purposes only. Please use it considerately to help keep it available for everyone.

Why Distribution Fitting Matters

Understanding probability distributions helps in numerous domains:

Modeling Real-world Phenomena

  • Customer wait times (exponential distribution)
  • Equipment failure (Weibull distribution)
  • Biological measurements (normal distribution)

Decision Making given Uncertainty

  • Risk assessment in finance
  • Quality control in manufacturing
  • Resource planning in operations
💡

"All models are wrong, but some are useful"

- George Box

Service Limitations

Resource Constraints

  • Hosted on free-tier cloud services with limited CPU and memory
  • No persistent storage - uploaded data is deleted after session
  • Limited bandwidth for handling multiple simultaneous requests

Please Note: To keep this service free and available to all, we request that you limit your usage to 1-2 analyses per session and avoid automated scripts.

Technical Boundaries

  • Maximum 500 data points per analysis
  • Maximum 7 distributions per analysis
  • Limited to continuous distributions
  • Header detection works best with single-row headers

Getting Started

Data Requirements

  • Format: CSV files only
  • Size: 20 to 500 data points
  • Columns:
    • Single column: Values only
    • Two columns: First column ignored, second used for fitting
  • Headers:
    • Auto-detection for single-row headers
    • Best practice: Use single-row headers or no headers

Recommended Structure

Value
12.5
14.3
11.7
...

Alternative Structure

Index, Value
1, 12.5
2, 14.3
3, 11.7
...

Workflow

1

Upload CSV

2

Select Distributions

3

Choose Fitting Method

4

Review Results

Core Concepts

Domain Validation (Suggestions Toggle)

Many distribution has mathematical constraints (data domain rules). Fit_It! tries to automatically check compatibility however this is not robust hence the take it with a pinch of salt:

Example domain rule: { "domain_check": "data > 0", "distributions": ["alpha", "gamma"], "conditional": false }

Implementation

def validate_data_domains(data):
    for rule in DOMAIN_RULES:
        context = {'np': np, 'data': data}
        try:
            mask = eval(rule["domain_check"], context)
            if not np.all(mask):
                # Flag incompatible distributions
        except Exception as e:
            # Handle evaluation error

Adaptive Binning System used for Histogram and Chi-square

Freedman-Diaconis Rule

Optimal for non-normal data with potential outliers:

\[ \text{Bin Width} = 2 \times \frac{\text{IQR}}{n^{1/3}} \] Where: \[ \text{IQR} = Q_3 - Q_1 \]

Rice Rule

Optimal for larger datasets:

\[ \text{Number of Bins} = 2 \times n^{1/3} \]

Implementation

def calculate_bins(data):
    n = len(data)
    if n < 500:
        q75, q25 = np.percentile(data, [75, 25])
        iqr = q75 - q25
        bin_width = 2 * iqr / (n ** (1/3))
        bins = int(np.ceil((np.max(data) - np.min(data)) / bin_width)
    else:
        bins = int(np.ceil(2 * (n ** (1/3)))
    return max(bins, 10)

Data Normalization

While Fit_It! works with raw data values, normalizing your data before analysis can significantly improve results for many distributions. Normalization transforms data to a common scale without distorting differences in ranges.

Why Normalize?

  • Improves numerical stability for fitting algorithms
  • Prevents features with large scales from dominating
  • Enables meaningful comparison of parameters
  • Helps distributions with scale parameters (e.g., normal, exponential)
  • Makes data more robust to outliers

Common Techniques

Z-score:
\[ z = \frac{x - \mu}{\sigma} \]

Best for normally distributed data

Min-Max:
\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

Scales to [0,1] range

Robust Scaling:
\[ x' = \frac{x - \text{median}(x)}{\text{IQR}(x)} \]

Resistant to outliers

IQR-Based Normalization

IQR (Interquartile Range) normalization is particularly useful for skewed data:

\[ \text{IQR} = Q_3 - Q_1 \] \[ x' = \frac{x - Q_1}{\text{IQR}} \]

Where Q₁ is the 25th percentile and Q₃ is the 75th percentile. This method scales data to [0,1] range based on quartiles rather than min/max.

Robust Scaler

Advanced robust normalization using median and IQR:

\[ x' = \frac{x - \text{median}(x)}{\text{IQR}} \]
  • Centers data around the median
  • Scales using IQR instead of standard deviation
  • Resistant to up to 25% outliers in either tail
  • Ideal for heavy-tailed distributions

Recommendation & Implementation

For best results with Fit_It!, we recommend normalizing your data before uploading. While not currently automated in this version, normalization can be easily done in spreadsheet software or Python:

# Python normalization examples
				import numpy as np
				from sklearn.preprocessing import RobustScaler

				# Robust scaling
				data = np.array([...]).reshape(-1, 1)
				scaler = RobustScaler(quantile_range=(25, 75))
				robust_normalized = scaler.fit_transform(data)

				# IQR-based normalization
				Q1 = np.percentile(data, 25)
				Q3 = np.percentile(data, 75)
				IQR = Q3 - Q1
				iqr_normalized = (data - Q1) / IQR

Important: Remember to keep track of your scaling parameters (median, IQR, min/max) if you need to transform results back to original scale!

Data Transformation

Beyond normalization, data transformation can make your data better conform to distributional assumptions. Transformations modify the shape of your distribution, often addressing skewness and making patterns more visible.

Why Transform Data?

  • Reduce right/left skewness in distributions
  • Stabilize variance across the data range
  • Improve linear relationships between variables
  • Make data more closely approximate normal distribution
  • Enhance performance of parametric statistical methods

Key Considerations

  • Always check distribution before and after transformation
  • Some transformations require strictly positive values
  • Interpretation of results changes after transformation
  • Reverse transformation may be needed for final results
  • Choose transformation based on data characteristics

Common Transformation Techniques

Logarithmic Transformation
\[ x' = \log_b(x) \quad \text{or} \quad x' = \log_b(x + c) \]

Best for right-skewed data. Compresses large values and expands small values. Requires x > 0. Add constant c if data contains zeros.

Use Cases:
  • Income distributions
  • Population sizes
  • Response times
Square Root Transformation
\[ x' = \sqrt{x} \quad \text{or} \quad x' = \sqrt{x + c} \]

Moderate effect on right-skewed data. Less aggressive than log transform. Works for zero values with constant adjustment.

Use Cases:
  • Count data (e.g., customer arrivals)
  • Area measurements
  • Biological growth metrics
Box-Cox Transformation
\[ x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(x) & \text{if } \lambda = 0 \end{cases} \]

Power transformation that finds optimal λ to make data most normal-like. Requires strictly positive values.

Use Cases:
  • Heteroscedastic data
  • Skewed continuous variables
  • When optimal transformation is unknown
Yeo-Johnson Transformation
\[ x'(\lambda) = \begin{cases} \frac{(x+1)^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, x \geq 0 \\ \log(x+1) & \text{if } \lambda = 0, x \geq 0 \\ \frac{-( -x+1)^{2-\lambda} - 1}{2-\lambda} & \text{if } \lambda \neq 2, x < 0 \\ -\log(-x+1) & \text{if } \lambda = 2, x < 0 \end{cases} \]

Extension of Box-Cox that works for both positive and negative values. More flexible for real-world datasets.

Use Cases:
  • Datasets with negative values
  • When Box-Cox can't be applied
  • Financial data with negative returns

Implementation in Python

import numpy as np
				import scipy.stats as stats
				from sklearn.preprocessing import PowerTransformer

				# Sample data with positive and negative values
				data = np.array([1.2, 5.7, 0.8, -2.3, 10.4, -0.5, 7.1])

				# Logarithmic transformation (for positive data)
				log_transformed = np.log1p(data[data > 0])  # log(1+x) to handle zeros

				# Square root transformation (for positive data)
				sqrt_transformed = np.sqrt(np.abs(data)) * np.sign(data)

				# Box-Cox transformation (strictly positive)
				positive_data = data[data > 0] + 1e-6  # Add small constant if zeros exist
				boxcox_transformed, lambda_val = stats.boxcox(positive_data)

				# Yeo-Johnson transformation (handles all values)
				yj_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
				yj_transformed = yj_transformer.fit_transform(data.reshape(-1, 1))

				# Inverse transformation example
				inverse_data = yj_transformer.inverse_transform(yj_transformed)

Important: Always test transformations visually with Q-Q plots or distribution comparisons. Remember that parameters (like λ in Box-Cox) must be saved to reverse transformations later.

Transformation Selection Guide

Data Characteristic Recommended Transformation
Right-skewed, positive values Logarithmic, Square root, Box-Cox
Left-skewed, positive values Exponential (x^k, k>1), Square (x^2)
Positive and negative values Yeo-Johnson, Signed power transformations
Count data, Poisson-like Square root, Anscombe (√(x + 3/8))
Proportions, percentages Logit, Arcsine square root

Fitting Strategies

Maximum Likelihood Estimation (MLE)

\[ \mathcal{L}(\theta | \mathbf{x}) = \prod_{i=1}^{n} f(x_i | \theta) \]

Pros

  • Consistent for large samples
  • Efficient estimators
  • Asymptotically normal

Cons

  • Requires PDF specification
  • Sensitive to starting values
  • Can be biased for small samples

Minimum Chi-Squared

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \]

Pros

  • Provides goodness-of-fit measure
  • Additive for combining results
  • Easy to compute

Cons

  • Requires binning data
  • Sensitive to bin size
  • Not recommended for small samples

Method of Moments (MM)

\[ \mu_k' = E[X^k] = \frac{1}{n}\sum_{i=1}^{n} x_i^k \]

Pros

  • Simple and explicit solutions
  • Intuitive approach
  • Fast computation

Cons

  • Less efficient than MLE
  • Can produce biased estimates
  • Limited to distributions with moments

Robust Min-SSE

\[ \text{SSE}_{\text{weighted}} = \sum_{i=1}^{n} \left[ \sqrt{\text{KDE}(x_i)} \cdot (\text{PDF}(x_i) - \text{KDE}(x_i))^2 \right] \]

Pros

  • Focuses on high-density regions
  • No binning artifacts
  • Excellent for multi-modal data

Cons

  • Computationally intensive
  • Bandwidth selection affects results
  • Slower than other methods

Output Interpretation

Goodness-of-Fit Ranking

Fit_It! ranks distributions using the Akaike Information Criterion (AIC). The distribution with the lowest AIC score is considered the best fit. This approach balances model fit with complexity, penalizing distributions with more parameters to avoid overfitting.

Why AIC?

  • Balances goodness-of-fit and model complexity
  • Penalizes unnecessary parameters
  • Based on information theory principles
  • Valid for both nested and non-nested models
  • Asymptotically efficient for large samples

Interpretation Guidelines

  • ΔAIC < 2: Substantial evidence for the model
  • ΔAIC 4-7: Considerably less support
  • ΔAIC > 10: Essentially no support
  • Always compare with visual fit assessment

Goodness-of-Fit Metrics

Akaike Information Criterion (AIC)

\[ \text{AIC} = 2k - 2\ln(\hat{L}) \]

Where k is the number of parameters and \(\hat{L}\) is the maximized value of the likelihood function. AIC estimates the relative information lost by a given model - the lower the AIC, the better the model balances fit and complexity.

Bayesian Information Criterion (BIC)

\[ \text{BIC} = k\ln(n) - 2\ln(\hat{L}) \]

Similar to AIC but with a stronger penalty for additional parameters. BIC introduces a sample size (n) dependent penalty term. Lower values indicate better fit, with preference for simpler models especially with larger datasets.

Sum of Squared Errors (SSE)

\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Measures the discrepancy between observed values (yᵢ) and values predicted by the model (ŷᵢ). Lower SSE indicates better fit, but this metric doesn't account for model complexity and can favor overparameterized models.

Kolmogorov-Smirnov Statistic (KS)

\[ D_n = \sup_x |F_n(x) - F(x)| \]

Measures the maximum distance between the empirical distribution function (Fₙ) and the theoretical cumulative distribution function (F). Lower values indicate better fit. KS statistic is particularly sensitive to differences in the center of the distribution.

Anderson-Darling Statistic (AD)

\[ A^2 = -n - \sum_{i=1}^{n} \frac{2i-1}{n} [\ln(F(x_i)) + \ln(1-F(x_{n+1-i}))] \]

A modification of KS that gives more weight to the tails of the distribution. This makes it more sensitive to outliers and extreme values. Lower values indicate better fit.

Cramér–von Mises Criterion (CvM)

\[ \omega^2 = \int_{-\infty}^{\infty} [F_n(x) - F(x)]^2 \,dF(x) \]

Measures the integrated squared difference between the empirical and theoretical CDFs. Like AD, it's sensitive to tail behavior but generally less so than AD. Lower values indicate better fit.

Interpreting Multiple Metrics

While Fit_It! uses AIC as the primary ranking metric, we recommend considering multiple goodness-of-fit measures:

  • AIC/BIC: Best for model selection - balance fit and complexity
  • SSE: Good for comparing fits using the same method
  • KS/AD/CvM: Best for assessing distributional similarity

Always combine statistical metrics with visual inspection of the probability plots - a good statistical fit should also look reasonable when plotted against your data.

Visualization Features

Consistent Color Mapping

Fit_It! uses a consistent color-coding system to help you track distributions across all visualizations:

  • Line plot traces: Each distribution has a unique color
  • Chart legend: Matches the line plot colors for easy identification
  • PDF formulas: Displayed in the same color as their plot
  • Scipy documentation links: Color-coded to match their distribution
  • Fun facts: Highlighted with the distribution's color
Implementation Example
// JavaScript color mapping implementation
				const colorMap = {
					"norm": "#6366f1",   // Indigo
					"gamma": "#10b981",  // Emerald
					"beta": "#f59e0b",   // Amber
					"expon": "#ef4444"   // Red
				};

				function applyColorCoding(distribution) {
					const color = colorMap[distribution];
					// Apply to plot trace
					Plotly.newPlot('graph', [{...trace, line: {color}}]);
					// Apply to documentation link
					document.getElementById(`doc-${distribution}`).style.color = color;
					// Apply to PDF formula display
					document.getElementById(`pdf-${distribution}`).style.borderColor = color;
				}
📊

Interactive distribution plot with color-coded elements

Fun Fact

The normal distribution is also called the "Gaussian bell curve" after Carl Friedrich Gauss, who introduced it in 1809 to analyze astronomical data. It appears in nature more often than you'd expect - from human height distributions to measurement errors!

Interactive Elements

👁️
Toggle Visibility

Show/hide distributions with checkboxes

🔍
Zoom & Pan

Explore details with interactive controls

💾
Export Options

Download PNG or SVG for publications

Scipy Documentation Integration

Each distribution includes a direct link to its official Scipy documentation. These links are color-coded to match the distribution's plot color for quick reference. The documentation provides:

  • Mathematical definition of the probability density function (PDF)
  • Statistical properties (mean, variance, skewness)
  • Implementation details in Scipy
  • Parameter definitions and constraints
Example

Normal Distribution Documentation:

scipy.stats.norm

Visualization Comparison

Feature Fit_It! Traditional Tools
Color consistency All elements synchronized Often inconsistent
Distribution limit Compare 7 simultaneously Typically 1-2 distributions
Documentation access Direct links with color coding Manual search required
Contextual information Fun facts and educational notes Pure statistical output

RESTful API Architecture

Service Limitations Notice

Fit_It! API is a free educational service hosted on free-tier cloud infrastructure. Please be considerate of resource constraints:

  • Limited to 3 requests per day per IP
  • Maximum 500 data points per request
  • No persistent storage - data is deleted after processing
  • Prioritize human-driven requests over automated scripts

Fit_It! by Gene is built on a modern RESTful API architecture that handles all computational tasks. This design enables scalability, separation of concerns, and efficient resource management.

Architecture Benefits

  • Scalability: Horizontal scaling to handle multiple users
  • Separation of Concerns: Frontend and backend independent
  • Efficiency: Caching and stateless operations
  • Interoperability: JSON-based communication

API Endpoints

POST /upload

Upload CSV data for analysis

POST /analyze

Perform distribution fitting

GET /plot

Retrieve visualization data

GET /distributions

List available distributions

API Usage Examples

cURL Example

				# Upload data
				curl -X POST https://api.fitit-tool.com/upload \
				  -F "file=@data.csv"

				# Analyze data
				curl -X POST https://api.fitit-tool.com/analyze \
				  -H "Content-Type: application/json" \
				  -d '{
					"session_id": "a1b2c3d4",
					"selected_dists": ["norm", "gamma", "expon"],
					"fit_method": "mle"
				  }'

				# Retrieve results
				curl -X GET https://api.fitit-tool.com/plot?session_id=a1b2c3d4

Python Example

				import requests

				# Step 1: Upload data
				upload_url = "https://api.fitit-tool.com/upload"
				files = {'file': open('data.csv', 'rb')}
				upload_response = requests.post(upload_url, files=files)
				session_id = upload_response.json()['session_id']

				# Step 2: Analyze data
				analyze_url = "https://api.fitit-tool.com/analyze"
				payload = {
					"session_id": session_id,
					"selected_dists": ["norm", "beta", "weibull_min"],
					"fit_method": "robust_min_sse"
				}
				analysis_response = requests.post(analyze_url, json=payload)

				# Step 3: Retrieve visualization data
				plot_url = f"https://api.fitit-tool.com/plot?session_id={session_id}"
				plot_data = requests.get(plot_url).json()

				# Process results
				print(f"Best fit: {plot_data['best_fit']['name']}")
				print(f"AIC: {plot_data['best_fit']['aic']}")

Postman Collection (Unimplemented)

Use our Postman collection to quickly test the API endpoints. Import the collection using the button below:

View Documentation
Postman UI

Response Structure

{
				  "status": "success",
				  "session_id": "a1b2c3d4e5",
				  "results": {
					"best_fit": {
					  "name": "gamma",
					  "params": [1.85, 0.0, 0.75],
					  "aic": 1245.67,
					  "bic": 1258.92
					},
					"distributions": [
					  {
						"name": "gamma",
						"params": [1.85, 0.0, 0.75],
						"aic": 1245.67,
						"bic": 1258.92,
						"ks_stat": 0.042,
						"sse": 0.0032
					  },
					  {
						"name": "norm",
						"params": [5.2, 1.8],
						"aic": 1298.45,
						"bic": 1308.21,
						"ks_stat": 0.087,
						"sse": 0.0121
					  }
					]
				  },
				  "plot_data": {
					"histogram": {
					  "x": [1.2, 2.4, 3.1, ...],
					  "y": [0.05, 0.12, 0.18, ...]
					},
					"pdfs": [
					  {
						"name": "gamma",
						"x": [0.5, 0.6, 0.7, ...],
						"y": [0.02, 0.04, 0.07, ...],
						"color": "#6366f1"
					  },
					  {
						"name": "norm",
						"x": [0.5, 0.6, 0.7, ...],
						"y": [0.01, 0.03, 0.05, ...],
						"color": "#10b981"
					  }
					]
				  },
				  "metadata": {
					"fit_method": "mle",
					"data_points": 1024,
					"processing_time": 1.24
				  }
				}

Free Service Reminder

Fit_It! API is provided as a free educational resource running on free-tier cloud hosting. To ensure fair access for all users:

  • Limit requests to 1-2 analyses per session
  • Space out requests by at least 10 minutes
  • Do not use for commercial applications
  • Avoid automated scripts or scraping

We appreciate your considerate usage to help keep this service available to the educational community.

Future Plans

Planned Features

  • Normalization and Transformation pre-fit
  • Discrete & Time series distribution analysis
  • Copula modeling for multivariate distributions (different project)
  • Quadrature rules derivation (to be integrated with Fit_It!)

Community Guidelines

  • No commercial use
  • Share findings, not resources
  • Space out requests (5+ minutes between analyses)
  • Report issuues via LinkedIn but please do not expect a quick response

"Probability is the very guide of life" - Cicero

Last Updated: June 2025 | Version 1.0.0 | By Gene Boo