Video Anomaly Detection with TimeSformer
Model Description
This is an EnhancedTimeSformer model trained for video anomaly detection and deepfake detection using a one-class learning approach. The model was trained exclusively on real videos from WebVid-10M and learns to reconstruct normal video frames. Anomalies (including deepfakes) are detected by measuring reconstruction error.
Key Features
- β Self-supervised learning - No labeled deepfake data required for training
- β Better generalization - More robust to novel deepfake methods than supervised approaches
- β Optical flow integration - Captures temporal dynamics
- β Transformer-based - Spatial-temporal attention mechanisms
- β 100% accuracy on ultra-extreme synthetic deepfakes
Model Architecture
- Base: TimeSformer (Vision Transformer for Video)
- Enhancements:
- Factorized 3D convolutions for efficient spatiotemporal processing
- Optical flow estimation and encoding
- 3D patch embeddings
- 12-layer transformer with 12 attention heads
- Dual decoder heads (frame reconstruction + flow prediction)
Training Details
- Dataset: WebVid-10M (real videos only)
- Training objective: Self-supervised frame reconstruction
- Epochs: 15
- Final validation loss: 0.1821
- Input: 16 frames at 224x224 resolution
- Approach: One-class classification via reconstruction error
Performance
On Ultra-Extreme Synthetic Deepfakes:
- Accuracy: 100%
- Precision: 100%
- Recall: 100%
- F1-Score: 100%
- False Positive Rate: 0%
Detection Metrics:
- Optimal Threshold: 0.3137
- Real Video MSE: 0.1445 Β± 0.0846
- Fake Video MSE: 0.5559 Β± 0.0949
- Separation Ratio: 3.85x
Important Notes:
- β οΈ Model tested on ultra-extreme synthetic fakes (with obvious artifacts)
- β οΈ Real deepfakes are more subtle - expect lower accuracy (estimated 70-85%)
- β Better cross-dataset generalization than supervised methods
- β No memorization of specific deepfake method signatures
Usage
import torch
import torch.nn.functional as F
from model import create_model
# Load model
model = create_model()
checkpoint = torch.load("pytorch_model.ckpt", map_location='cuda')
# Extract state dict
if 'state_dict' in checkpoint:
state_dict = checkpoint['state_dict']
else:
state_dict = checkpoint
# Clean state dict (remove prefixes)
new_state_dict = {}
for k, v in state_dict.items():
if k.startswith('model.model.'):
new_key = k.replace('model.model.', '')
new_state_dict[new_key] = v
elif k.startswith('model.'):
new_key = k.replace('model.', '')
new_state_dict[new_key] = v
else:
new_state_dict[k] = v
model.load_state_dict(new_state_dict, strict=False)
model.eval()
model = model.cuda()
# Prepare video (B, C, T, H, W) with values in [-1, 1]
video_tensor = preprocess_video(video_path) # Your preprocessing
video_tensor = video_tensor.cuda()
# Get prediction
with torch.no_grad():
frame_pred, flow_pred = model(video_tensor)
# Calculate reconstruction error
mid_frame = video_tensor.shape[2] // 2
target = video_tensor[:, :, mid_frame]
mse_error = F.mse_loss(frame_pred, target).item()
# Detect deepfake
THRESHOLD = 0.3137
is_fake = mse_error > THRESHOLD
print(f"MSE: {mse_error:.4f}")
print(f"Prediction: {'FAKE' if is_fake else 'REAL'}")
Limitations
- Tested primarily on extreme manipulations - Real deepfakes are more subtle
- Reconstruction-based detection - May struggle with high-quality deepfakes that maintain temporal consistency
- Threshold sensitivity - Optimal threshold may vary across different video sources
- One-class approach - Lower peak accuracy than supervised methods, but better generalization
Recommended Use Cases
- β Initial screening of videos for obvious manipulations
- β Ensemble component with other detection methods
- β Research on generalization in deepfake detection
- β Detection of out-of-distribution videos
Not Recommended For
- β Sole detector for critical applications
- β Detection of subtle, professional-grade deepfakes without additional methods
- β Real-time video verification (model is compute-intensive)
Citation
If you use this model, please cite:
@misc{timesformer-deepfake-detector,
author = {ash12321},
title = {Video Anomaly Detection with TimeSformer},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ash12321/deepfake-detector-timesformer}}
}
License
MIT License - See repository for details
Contact
For questions or issues, please open an issue on the Hugging Face repository.
Note: This model represents a research approach to deepfake detection through one-class learning. For production deployments, consider using an ensemble of multiple detection methods including supervised classifiers, biological signal detectors, and temporal consistency checkers.
- Downloads last month
- 11