MC3-18 for UCF-101 Action Recognition

Model Summary

This model is an MC3-18 (Mixed 3D Convolutions) network fine-tuned on the UCF-101 dataset for human action recognition. The architecture combines 2D and 3D convolutions, delivering an efficient temporal-spatial representation while maintaining a lightweight parameter count.

Architecture: MC3-18 (3D CNN with mixed convolutions)
Pretraining: Kinetics-400
Parameter Count: ~11.7M
Input Format: 16-frame clips, 112×112 spatial resolution
Number of Classes: 101

Intended Use

Primary use case: Action classification in short, trimmed videos similar in distribution to UCF-101.
Users: Researchers, practitioners, and engineers working on video-understanding pipelines.
Tasks:

Action recognition
Clip-level human activity tagging
Baseline modeling for low-compute video applications

Not suitable for long-horizon temporal reasoning or untrimmed video detection without adaptation.

Performance

Quantitative Results (UCF-101 Split 1, Test Set)

Metric	Value
Accuracy	87.05%
F1 Score	0.857
Precision	0.868

Comparison to Published Baseline

Original MC3-18 (Kinetics-400 → UCF-101): 85.0%
This model: 87.05% (+2.05%)

How to Use

Inference Example (PyTorch)

import torch
# Load from HuggingFace
from huggingface_hub import hf_hub_download
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor
model_path = hf_hub_download(repo_id="dronefreak/mc3-18-ucf101", filename="mc318-ufc101-split-1.pth")
model = torch.load(model_path)

# Prepare video (16 frames, C×T×H×W)
transform = Compose([
    Resize((128, 171)),
    CenterCrop(112),
    ToTensor(),
    Normalize(mean=[0.43216, 0.394666, 0.37645], 
              std=[0.22803, 0.22145, 0.216989])
])

# Inference
with torch.no_grad():
    output = model(video_tensor)
    prediction = output.argmax(dim=1)

Training

Dataset: UCF-101 Split 1 (9,537 train / 3,783 test videos)
Epochs: 200
Batch Size: 32
Optimizer: SGD (lr=0.001, momentum=0.9, weight_decay=1e-4)
Augmentation: ColorJitter, RandomHorizontalFlip, RandomCrop

Limitations

Trained only on UCF-101 (limited to 101 action classes)
Requires 16-frame clips (not suitable for real-time single-frame)
Best performance on similar action types to UCF-101

Citation

@misc{mc3_18_ucf101,
  author = {Saumya Saksena},
  title = {MC3-18 for UCF-101 Action Recognition},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/dronefreak/mc3-18-ucf101}}
}

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for dronefreak/mc3-18-ucf101

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Paper • 1711.11248 • Published Nov 30, 2017

Evaluation results

Top-1 Accuracy on UCF-101
test set self-reported

87.050
F1 Score on UCF-101
test set self-reported

85.690