YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Video Anomaly Detection with TimeSformer

Model Description

This is an EnhancedTimeSformer model trained for video anomaly detection and deepfake detection using a one-class learning approach. The model was trained exclusively on real videos from WebVid-10M and learns to reconstruct normal video frames. Anomalies (including deepfakes) are detected by measuring reconstruction error.

Key Features

✅ Self-supervised learning - No labeled deepfake data required for training
✅ Better generalization - More robust to novel deepfake methods than supervised approaches
✅ Optical flow integration - Captures temporal dynamics
✅ Transformer-based - Spatial-temporal attention mechanisms
✅ 100% accuracy on ultra-extreme synthetic deepfakes

Model Architecture

Base: TimeSformer (Vision Transformer for Video)
Enhancements:
- Factorized 3D convolutions for efficient spatiotemporal processing
- Optical flow estimation and encoding
- 3D patch embeddings
- 12-layer transformer with 12 attention heads
- Dual decoder heads (frame reconstruction + flow prediction)

Training Details

Dataset: WebVid-10M (real videos only)
Training objective: Self-supervised frame reconstruction
Epochs: 15
Final validation loss: 0.1821
Input: 16 frames at 224x224 resolution
Approach: One-class classification via reconstruction error

Performance

On Ultra-Extreme Synthetic Deepfakes:

Accuracy: 100%
Precision: 100%
Recall: 100%
F1-Score: 100%
False Positive Rate: 0%

Detection Metrics:

Optimal Threshold: 0.3137
Real Video MSE: 0.1445 ± 0.0846
Fake Video MSE: 0.5559 ± 0.0949
Separation Ratio: 3.85x

Important Notes:

⚠️ Model tested on ultra-extreme synthetic fakes (with obvious artifacts)
⚠️ Real deepfakes are more subtle - expect lower accuracy (estimated 70-85%)
✅ Better cross-dataset generalization than supervised methods
✅ No memorization of specific deepfake method signatures

Usage

import torch
import torch.nn.functional as F
from model import create_model

# Load model
model = create_model()
checkpoint = torch.load("pytorch_model.ckpt", map_location='cuda')

# Extract state dict
if 'state_dict' in checkpoint:
    state_dict = checkpoint['state_dict']
else:
    state_dict = checkpoint

# Clean state dict (remove prefixes)
new_state_dict = {}
for k, v in state_dict.items():
    if k.startswith('model.model.'):
        new_key = k.replace('model.model.', '')
        new_state_dict[new_key] = v
    elif k.startswith('model.'):
        new_key = k.replace('model.', '')
        new_state_dict[new_key] = v
    else:
        new_state_dict[k] = v

model.load_state_dict(new_state_dict, strict=False)
model.eval()
model = model.cuda()

# Prepare video (B, C, T, H, W) with values in [-1, 1]
video_tensor = preprocess_video(video_path)  # Your preprocessing
video_tensor = video_tensor.cuda()

# Get prediction
with torch.no_grad():
    frame_pred, flow_pred = model(video_tensor)
    
    # Calculate reconstruction error
    mid_frame = video_tensor.shape[2] // 2
    target = video_tensor[:, :, mid_frame]
    mse_error = F.mse_loss(frame_pred, target).item()
    
    # Detect deepfake
    THRESHOLD = 0.3137
    is_fake = mse_error > THRESHOLD
    
    print(f"MSE: {mse_error:.4f}")
    print(f"Prediction: {'FAKE' if is_fake else 'REAL'}")

Limitations

Tested primarily on extreme manipulations - Real deepfakes are more subtle
Reconstruction-based detection - May struggle with high-quality deepfakes that maintain temporal consistency
Threshold sensitivity - Optimal threshold may vary across different video sources
One-class approach - Lower peak accuracy than supervised methods, but better generalization

Recommended Use Cases

✅ Initial screening of videos for obvious manipulations
✅ Ensemble component with other detection methods
✅ Research on generalization in deepfake detection
✅ Detection of out-of-distribution videos

Not Recommended For

❌ Sole detector for critical applications
❌ Detection of subtle, professional-grade deepfakes without additional methods
❌ Real-time video verification (model is compute-intensive)

Citation

If you use this model, please cite:

@misc{timesformer-deepfake-detector,
  author = {ash12321},
  title = {Video Anomaly Detection with TimeSformer},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ash12321/deepfake-detector-timesformer}}
}

License

MIT License - See repository for details

Contact

For questions or issues, please open an issue on the Hugging Face repository.

Note: This model represents a research approach to deepfake detection through one-class learning. For production deployments, consider using an ensemble of multiple detection methods including supervised classifiers, biological signal detectors, and temporal consistency checkers.

Downloads last month: 11

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support