Fit_It! by Gene - Interactive Documentation

Introduction

Fit_It! by Gene is a free educational tool that brings the world of probability distributions to your browser. Powered by SciPy, FastAPI, and Plotly, it allows you to explore how different statistical distributions fit your data through an intuitive interface.

📊

Data Exploration

Upload your dataset and discover its probability distribution

🔍

Statistical Analysis

Compare 4 fitting methods and evaluate goodness-of-fit

📈

Visual Insights

Interactive visualizations help understand distribution fits

Note: Fit_It! is a hobby project developed during spare time. It's provided as a free service for educational purposes only. Please use it considerately to help keep it available for everyone.

Why Distribution Fitting Matters

Understanding probability distributions helps in numerous domains:

Modeling Real-world Phenomena

Customer wait times (exponential distribution)
Equipment failure (Weibull distribution)
Biological measurements (normal distribution)

Decision Making given Uncertainty

Risk assessment in finance
Quality control in manufacturing
Resource planning in operations

💡

"All models are wrong, but some are useful"

- George Box

Service Limitations

Resource Constraints

Hosted on free-tier cloud services with limited CPU and memory
No persistent storage - uploaded data is deleted after session
Limited bandwidth for handling multiple simultaneous requests

Please Note: To keep this service free and available to all, we request that you limit your usage to 1-2 analyses per session and avoid automated scripts.

Technical Boundaries

Maximum 500 data points per analysis
Maximum 7 distributions per analysis
Limited to continuous distributions
Header detection works best with single-row headers

Getting Started

Data Requirements

Format: CSV files only
Size: 20 to 500 data points
Columns:
- Single column: Values only
- Two columns: First column ignored, second used for fitting
Headers:
- Auto-detection for single-row headers
- Best practice: Use single-row headers or no headers

Recommended Structure

Value
12.5
14.3
11.7
...

Alternative Structure

Index, Value
1, 12.5
2, 14.3
3, 11.7
...

Workflow

1

Upload CSV

2

Select Distributions

3

Choose Fitting Method

4

Review Results

Core Concepts

Domain Validation (Suggestions Toggle)

Many distribution has mathematical constraints (data domain rules). Fit_It! tries to automatically check compatibility however this is not robust hence the take it with a pinch of salt:

Example domain rule: { "domain_check": "data > 0", "distributions": ["alpha", "gamma"], "conditional": false }

Implementation

def validate_data_domains(data):
    for rule in DOMAIN_RULES:
        context = {'np': np, 'data': data}
        try:
            mask = eval(rule["domain_check"], context)
            if not np.all(mask):
                # Flag incompatible distributions
        except Exception as e:
            # Handle evaluation error

Adaptive Binning System used for Histogram and Chi-square

Freedman-Diaconis Rule

Optimal for non-normal data with potential outliers:

\[ \text{Bin Width} = 2 \times \frac{\text{IQR}}{n^{1/3}} \] Where: \[ \text{IQR} = Q_3 - Q_1 \]

Rice Rule

Optimal for larger datasets:

\[ \text{Number of Bins} = 2 \times n^{1/3} \]

Implementation

def calculate_bins(data):
    n = len(data)
    if n < 500:
        q75, q25 = np.percentile(data, [75, 25])
        iqr = q75 - q25
        bin_width = 2 * iqr / (n ** (1/3))
        bins = int(np.ceil((np.max(data) - np.min(data)) / bin_width)
    else:
        bins = int(np.ceil(2 * (n ** (1/3)))
    return max(bins, 10)

Data Normalization

While Fit_It! works with raw data values, normalizing your data before analysis can significantly improve results for many distributions. Normalization transforms data to a common scale without distorting differences in ranges.

Why Normalize?

Improves numerical stability for fitting algorithms
Prevents features with large scales from dominating
Enables meaningful comparison of parameters
Helps distributions with scale parameters (e.g., normal, exponential)
Makes data more robust to outliers

Common Techniques

Z-score:

\[ z = \frac{x - \mu}{\sigma} \]

Best for normally distributed data

Min-Max:

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

Scales to [0,1] range

Robust Scaling:

\[ x' = \frac{x - \text{median}(x)}{\text{IQR}(x)} \]

Resistant to outliers

IQR-Based Normalization

IQR (Interquartile Range) normalization is particularly useful for skewed data:

\[ \text{IQR} = Q_3 - Q_1 \] \[ x' = \frac{x - Q_1}{\text{IQR}} \]

Where Q₁ is the 25th percentile and Q₃ is the 75th percentile. This method scales data to [0,1] range based on quartiles rather than min/max.

Robust Scaler

Advanced robust normalization using median and IQR:

\[ x' = \frac{x - \text{median}(x)}{\text{IQR}} \]

Centers data around the median
Scales using IQR instead of standard deviation
Resistant to up to 25% outliers in either tail
Ideal for heavy-tailed distributions

Recommendation & Implementation

For best results with Fit_It!, we recommend normalizing your data before uploading. While not currently automated in this version, normalization can be easily done in spreadsheet software or Python:

# Python normalization examples
				import numpy as np
				from sklearn.preprocessing import RobustScaler

				# Robust scaling
				data = np.array([...]).reshape(-1, 1)
				scaler = RobustScaler(quantile_range=(25, 75))
				robust_normalized = scaler.fit_transform(data)

				# IQR-based normalization
				Q1 = np.percentile(data, 25)
				Q3 = np.percentile(data, 75)
				IQR = Q3 - Q1
				iqr_normalized = (data - Q1) / IQR

Important: Remember to keep track of your scaling parameters (median, IQR, min/max) if you need to transform results back to original scale!

Data Transformation

Beyond normalization, data transformation can make your data better conform to distributional assumptions. Transformations modify the shape of your distribution, often addressing skewness and making patterns more visible.

Why Transform Data?

Reduce right/left skewness in distributions
Stabilize variance across the data range
Improve linear relationships between variables
Make data more closely approximate normal distribution
Enhance performance of parametric statistical methods

Key Considerations

Always check distribution before and after transformation
Some transformations require strictly positive values
Interpretation of results changes after transformation
Reverse transformation may be needed for final results
Choose transformation based on data characteristics

Common Transformation Techniques

Logarithmic Transformation

\[ x' = \log_b(x) \quad \text{or} \quad x' = \log_b(x + c) \]

Best for right-skewed data. Compresses large values and expands small values. Requires x > 0. Add constant c if data contains zeros.

Use Cases:

Income distributions
Population sizes
Response times

Square Root Transformation

\[ x' = \sqrt{x} \quad \text{or} \quad x' = \sqrt{x + c} \]

Moderate effect on right-skewed data. Less aggressive than log transform. Works for zero values with constant adjustment.

Use Cases:

Count data (e.g., customer arrivals)
Area measurements
Biological growth metrics

Box-Cox Transformation

\[ x'(\lambda) = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \log(x) & \text{if } \lambda = 0 \end{cases} \]

Power transformation that finds optimal λ to make data most normal-like. Requires strictly positive values.

Use Cases:

Heteroscedastic data
Skewed continuous variables
When optimal transformation is unknown

Yeo-Johnson Transformation

\[ x'(\lambda) = \begin{cases} \frac{(x+1)^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, x \geq 0 \\ \log(x+1) & \text{if } \lambda = 0, x \geq 0 \\ \frac{-( -x+1)^{2-\lambda} - 1}{2-\lambda} & \text{if } \lambda \neq 2, x < 0 \\ -\log(-x+1) & \text{if } \lambda = 2, x < 0 \end{cases} \]

Extension of Box-Cox that works for both positive and negative values. More flexible for real-world datasets.

Use Cases:

Datasets with negative values
When Box-Cox can't be applied
Financial data with negative returns

Implementation in Python

import numpy as np
				import scipy.stats as stats
				from sklearn.preprocessing import PowerTransformer

				# Sample data with positive and negative values
				data = np.array([1.2, 5.7, 0.8, -2.3, 10.4, -0.5, 7.1])

				# Logarithmic transformation (for positive data)
				log_transformed = np.log1p(data[data > 0])  # log(1+x) to handle zeros

				# Square root transformation (for positive data)
				sqrt_transformed = np.sqrt(np.abs(data)) * np.sign(data)

				# Box-Cox transformation (strictly positive)
				positive_data = data[data > 0] + 1e-6  # Add small constant if zeros exist
				boxcox_transformed, lambda_val = stats.boxcox(positive_data)

				# Yeo-Johnson transformation (handles all values)
				yj_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
				yj_transformed = yj_transformer.fit_transform(data.reshape(-1, 1))

				# Inverse transformation example
				inverse_data = yj_transformer.inverse_transform(yj_transformed)

Important: Always test transformations visually with Q-Q plots or distribution comparisons. Remember that parameters (like λ in Box-Cox) must be saved to reverse transformations later.

Transformation Selection Guide

Data Characteristic	Recommended Transformation
Right-skewed, positive values	Logarithmic, Square root, Box-Cox
Left-skewed, positive values	Exponential (x^k, k>1), Square (x^2)
Positive and negative values	Yeo-Johnson, Signed power transformations
Count data, Poisson-like	Square root, Anscombe (√(x + 3/8))
Proportions, percentages	Logit, Arcsine square root

Fitting Strategies

Maximum Likelihood Estimation (MLE)

\[ \mathcal{L}(\theta | \mathbf{x}) = \prod_{i=1}^{n} f(x_i | \theta) \]

Pros

Consistent for large samples
Efficient estimators
Asymptotically normal

Cons

Requires PDF specification
Sensitive to starting values
Can be biased for small samples

Minimum Chi-Squared

\[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \]

Pros

Provides goodness-of-fit measure
Additive for combining results
Easy to compute

Cons

Requires binning data
Sensitive to bin size
Not recommended for small samples

Method of Moments (MM)

\[ \mu_k' = E[X^k] = \frac{1}{n}\sum_{i=1}^{n} x_i^k \]

Pros

Simple and explicit solutions
Intuitive approach
Fast computation

Cons

Less efficient than MLE
Can produce biased estimates
Limited to distributions with moments

Robust Min-SSE

\[ \text{SSE}_{\text{weighted}} = \sum_{i=1}^{n} \left[ \sqrt{\text{KDE}(x_i)} \cdot (\text{PDF}(x_i) - \text{KDE}(x_i))^2 \right] \]

Pros

Focuses on high-density regions
No binning artifacts
Excellent for multi-modal data

Cons

Computationally intensive
Bandwidth selection affects results
Slower than other methods

Output Interpretation

Goodness-of-Fit Ranking

Fit_It! ranks distributions using the Akaike Information Criterion (AIC). The distribution with the lowest AIC score is considered the best fit. This approach balances model fit with complexity, penalizing distributions with more parameters to avoid overfitting.

Why AIC?

Balances goodness-of-fit and model complexity
Penalizes unnecessary parameters
Based on information theory principles
Valid for both nested and non-nested models
Asymptotically efficient for large samples

Interpretation Guidelines

ΔAIC < 2: Substantial evidence for the model
ΔAIC 4-7: Considerably less support
ΔAIC > 10: Essentially no support
Always compare with visual fit assessment

Goodness-of-Fit Metrics

Akaike Information Criterion (AIC)

\[ \text{AIC} = 2k - 2\ln(\hat{L}) \]

Where k is the number of parameters and \(\hat{L}\) is the maximized value of the likelihood function. AIC estimates the relative information lost by a given model - the lower the AIC, the better the model balances fit and complexity.

Bayesian Information Criterion (BIC)

\[ \text{BIC} = k\ln(n) - 2\ln(\hat{L}) \]

Similar to AIC but with a stronger penalty for additional parameters. BIC introduces a sample size (n) dependent penalty term. Lower values indicate better fit, with preference for simpler models especially with larger datasets.

Sum of Squared Errors (SSE)

\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Measures the discrepancy between observed values (yᵢ) and values predicted by the model (ŷᵢ). Lower SSE indicates better fit, but this metric doesn't account for model complexity and can favor overparameterized models.

Kolmogorov-Smirnov Statistic (KS)

\[ D_n = \sup_x |F_n(x) - F(x)| \]

Measures the maximum distance between the empirical distribution function (Fₙ) and the theoretical cumulative distribution function (F). Lower values indicate better fit. KS statistic is particularly sensitive to differences in the center of the distribution.

Anderson-Darling Statistic (AD)

\[ A^2 = -n - \sum_{i=1}^{n} \frac{2i-1}{n} [\ln(F(x_i)) + \ln(1-F(x_{n+1-i}))] \]

A modification of KS that gives more weight to the tails of the distribution. This makes it more sensitive to outliers and extreme values. Lower values indicate better fit.

Cramér–von Mises Criterion (CvM)

\[ \omega^2 = \int_{-\infty}^{\infty} [F_n(x) - F(x)]^2 \,dF(x) \]

Measures the integrated squared difference between the empirical and theoretical CDFs. Like AD, it's sensitive to tail behavior but generally less so than AD. Lower values indicate better fit.

Interpreting Multiple Metrics

While Fit_It! uses AIC as the primary ranking metric, we recommend considering multiple goodness-of-fit measures:

AIC/BIC: Best for model selection - balance fit and complexity
SSE: Good for comparing fits using the same method
KS/AD/CvM: Best for assessing distributional similarity

Always combine statistical metrics with visual inspection of the probability plots - a good statistical fit should also look reasonable when plotted against your data.

Visualization Features

Consistent Color Mapping

Fit_It! uses a consistent color-coding system to help you track distributions across all visualizations:

Line plot traces: Each distribution has a unique color
Chart legend: Matches the line plot colors for easy identification
PDF formulas: Displayed in the same color as their plot
Scipy documentation links: Color-coded to match their distribution
Fun facts: Highlighted with the distribution's color

Implementation Example

// JavaScript color mapping implementation
				const colorMap = {
					"norm": "#6366f1",   // Indigo
					"gamma": "#10b981",  // Emerald
					"beta": "#f59e0b",   // Amber
					"expon": "#ef4444"   // Red
				};

				function applyColorCoding(distribution) {
					const color = colorMap[distribution];
					// Apply to plot trace
					Plotly.newPlot('graph', [{...trace, line: {color}}]);
					// Apply to documentation link
					document.getElementById(`doc-${distribution}`).style.color = color;
					// Apply to PDF formula display
					document.getElementById(`pdf-${distribution}`).style.borderColor = color;
				}

📊

Interactive distribution plot with color-coded elements

Fun Fact

The normal distribution is also called the "Gaussian bell curve" after Carl Friedrich Gauss, who introduced it in 1809 to analyze astronomical data. It appears in nature more often than you'd expect - from human height distributions to measurement errors!

Interactive Elements

👁️

Toggle Visibility

Show/hide distributions with checkboxes

🔍

Zoom & Pan

Explore details with interactive controls

💾

Export Options

Download PNG or SVG for publications

Scipy Documentation Integration

Each distribution includes a direct link to its official Scipy documentation. These links are color-coded to match the distribution's plot color for quick reference. The documentation provides:

Mathematical definition of the probability density function (PDF)
Statistical properties (mean, variance, skewness)
Implementation details in Scipy
Parameter definitions and constraints

Example

Normal Distribution Documentation:

scipy.stats.norm

Visualization Comparison

Feature	Fit_It!	Traditional Tools
Color consistency	All elements synchronized	Often inconsistent
Distribution limit	Compare 7 simultaneously	Typically 1-2 distributions
Documentation access	Direct links with color coding	Manual search required
Contextual information	Fun facts and educational notes	Pure statistical output

RESTful API Architecture

Service Limitations Notice

Fit_It! API is a free educational service hosted on free-tier cloud infrastructure. Please be considerate of resource constraints:

Limited to 3 requests per day per IP
Maximum 500 data points per request
No persistent storage - data is deleted after processing
Prioritize human-driven requests over automated scripts

Fit_It! by Gene is built on a modern RESTful API architecture that handles all computational tasks. This design enables scalability, separation of concerns, and efficient resource management.

Architecture Benefits

Scalability: Horizontal scaling to handle multiple users
Separation of Concerns: Frontend and backend independent
Efficiency: Caching and stateless operations
Interoperability: JSON-based communication

API Endpoints

POST /upload

Upload CSV data for analysis

POST /analyze

Perform distribution fitting

GET /plot

Retrieve visualization data

GET /distributions

List available distributions

API Usage Examples

cURL Example

				# Upload data
				curl -X POST https://api.fitit-tool.com/upload \
				  -F "file=@data.csv"

				# Analyze data
				curl -X POST https://api.fitit-tool.com/analyze \
				  -H "Content-Type: application/json" \
				  -d '{
					"session_id": "a1b2c3d4",
					"selected_dists": ["norm", "gamma", "expon"],
					"fit_method": "mle"
				  }'

				# Retrieve results
				curl -X GET https://api.fitit-tool.com/plot?session_id=a1b2c3d4

Python Example

				import requests

				# Step 1: Upload data
				upload_url = "https://api.fitit-tool.com/upload"
				files = {'file': open('data.csv', 'rb')}
				upload_response = requests.post(upload_url, files=files)
				session_id = upload_response.json()['session_id']

				# Step 2: Analyze data
				analyze_url = "https://api.fitit-tool.com/analyze"
				payload = {
					"session_id": session_id,
					"selected_dists": ["norm", "beta", "weibull_min"],
					"fit_method": "robust_min_sse"
				}
				analysis_response = requests.post(analyze_url, json=payload)

				# Step 3: Retrieve visualization data
				plot_url = f"https://api.fitit-tool.com/plot?session_id={session_id}"
				plot_data = requests.get(plot_url).json()

				# Process results
				print(f"Best fit: {plot_data['best_fit']['name']}")
				print(f"AIC: {plot_data['best_fit']['aic']}")

Postman Collection (Unimplemented)

Use our Postman collection to quickly test the API endpoints. Import the collection using the button below:

View Documentation

Postman UI

Response Structure

{
				  "status": "success",
				  "session_id": "a1b2c3d4e5",
				  "results": {
					"best_fit": {
					  "name": "gamma",
					  "params": [1.85, 0.0, 0.75],
					  "aic": 1245.67,
					  "bic": 1258.92
					},
					"distributions": [
					  {
						"name": "gamma",
						"params": [1.85, 0.0, 0.75],
						"aic": 1245.67,
						"bic": 1258.92,
						"ks_stat": 0.042,
						"sse": 0.0032
					  },
					  {
						"name": "norm",
						"params": [5.2, 1.8],
						"aic": 1298.45,
						"bic": 1308.21,
						"ks_stat": 0.087,
						"sse": 0.0121
					  }
					]
				  },
				  "plot_data": {
					"histogram": {
					  "x": [1.2, 2.4, 3.1, ...],
					  "y": [0.05, 0.12, 0.18, ...]
					},
					"pdfs": [
					  {
						"name": "gamma",
						"x": [0.5, 0.6, 0.7, ...],
						"y": [0.02, 0.04, 0.07, ...],
						"color": "#6366f1"
					  },
					  {
						"name": "norm",
						"x": [0.5, 0.6, 0.7, ...],
						"y": [0.01, 0.03, 0.05, ...],
						"color": "#10b981"
					  }
					]
				  },
				  "metadata": {
					"fit_method": "mle",
					"data_points": 1024,
					"processing_time": 1.24
				  }
				}

Free Service Reminder

Fit_It! API is provided as a free educational resource running on free-tier cloud hosting. To ensure fair access for all users:

Limit requests to 1-2 analyses per session
Space out requests by at least 10 minutes
Do not use for commercial applications
Avoid automated scripts or scraping

We appreciate your considerate usage to help keep this service available to the educational community.

Future Plans

Planned Features

→

Normalization and Transformation pre-fit
→

Discrete & Time series distribution analysis
→

Copula modeling for multivariate distributions (different project)
→

Quadrature rules derivation (to be integrated with Fit_It!)

Community Guidelines

✓

No commercial use
✓

Share findings, not resources
✓

Space out requests (5+ minutes between analyses)
✓

Report issuues via LinkedIn but please do not expect a quick response

"Probability is the very guide of life" - Cicero

Last Updated: June 2025 | Version 1.0.0 | By Gene Boo

Important Notice

Documentation Outline

Quick Stats

Introduction

Data Exploration

Statistical Analysis

Visual Insights

Why Distribution Fitting Matters

Modeling Real-world Phenomena

Decision Making given Uncertainty

Service Limitations

Resource Constraints

Technical Boundaries

Getting Started

Data Requirements

Recommended Structure

Alternative Structure

Workflow

Core Concepts

Domain Validation (Suggestions Toggle)

Implementation

Adaptive Binning System used for Histogram and Chi-square

Freedman-Diaconis Rule

Rice Rule

Implementation

Data Normalization

Why Normalize?

Common Techniques

IQR-Based Normalization

Robust Scaler

Recommendation & Implementation

Data Transformation

Why Transform Data?

Key Considerations

Common Transformation Techniques

Logarithmic Transformation

Square Root Transformation

Box-Cox Transformation

Yeo-Johnson Transformation

Implementation in Python

Transformation Selection Guide

Fitting Strategies

Maximum Likelihood Estimation (MLE)

Pros

Cons

Minimum Chi-Squared

Pros

Cons

Method of Moments (MM)

Pros

Cons

Robust Min-SSE

Pros

Cons

Output Interpretation

Goodness-of-Fit Ranking

Why AIC?

Interpretation Guidelines

Goodness-of-Fit Metrics

Akaike Information Criterion (AIC)

Bayesian Information Criterion (BIC)

Sum of Squared Errors (SSE)

Kolmogorov-Smirnov Statistic (KS)

Anderson-Darling Statistic (AD)

Cramér–von Mises Criterion (CvM)

Interpreting Multiple Metrics

Visualization Features

Consistent Color Mapping

Implementation Example

Fun Fact

Interactive Elements

Toggle Visibility

Zoom & Pan

Export Options

Scipy Documentation Integration

Visualization Comparison

RESTful API Architecture

Service Limitations Notice

Architecture Benefits

API Endpoints