A Closer Look at Spatiotemporal Convolutions for Action Recognition
Paper
โข
1711.11248
โข
Published
This model is an MC3-18 (Mixed 3D Convolutions) network fine-tuned on the UCF-101 dataset for human action recognition. The architecture combines 2D and 3D convolutions, delivering an efficient temporal-spatial representation while maintaining a lightweight parameter count.
Primary use case: Action classification in short, trimmed videos similar in distribution to UCF-101.
Users: Researchers, practitioners, and engineers working on video-understanding pipelines.
Tasks:
Not suitable for long-horizon temporal reasoning or untrimmed video detection without adaptation.
| Metric | Value |
|---|---|
| Accuracy | 87.05% |
| F1 Score | 0.857 |
| Precision | 0.868 |
import torch
# Load from HuggingFace
from huggingface_hub import hf_hub_download
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor
model_path = hf_hub_download(repo_id="dronefreak/mc3-18-ucf101", filename="mc318-ufc101-split-1.pth")
model = torch.load(model_path)
# Prepare video (16 frames, CรTรHรW)
transform = Compose([
Resize((128, 171)),
CenterCrop(112),
ToTensor(),
Normalize(mean=[0.43216, 0.394666, 0.37645],
std=[0.22803, 0.22145, 0.216989])
])
# Inference
with torch.no_grad():
output = model(video_tensor)
prediction = output.argmax(dim=1)
@misc{mc3_18_ucf101,
author = {Saumya Saksena},
title = {MC3-18 for UCF-101 Action Recognition},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/dronefreak/mc3-18-ucf101}}
}
Apache-2.0