YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Video Anomaly Detection with TimeSformer

Model Description

This is an EnhancedTimeSformer model trained for video anomaly detection and deepfake detection using a one-class learning approach. The model was trained exclusively on real videos from WebVid-10M and learns to reconstruct normal video frames. Anomalies (including deepfakes) are detected by measuring reconstruction error.

Key Features

  • βœ… Self-supervised learning - No labeled deepfake data required for training
  • βœ… Better generalization - More robust to novel deepfake methods than supervised approaches
  • βœ… Optical flow integration - Captures temporal dynamics
  • βœ… Transformer-based - Spatial-temporal attention mechanisms
  • βœ… 100% accuracy on ultra-extreme synthetic deepfakes

Model Architecture

  • Base: TimeSformer (Vision Transformer for Video)
  • Enhancements:
    • Factorized 3D convolutions for efficient spatiotemporal processing
    • Optical flow estimation and encoding
    • 3D patch embeddings
    • 12-layer transformer with 12 attention heads
    • Dual decoder heads (frame reconstruction + flow prediction)

Training Details

  • Dataset: WebVid-10M (real videos only)
  • Training objective: Self-supervised frame reconstruction
  • Epochs: 15
  • Final validation loss: 0.1821
  • Input: 16 frames at 224x224 resolution
  • Approach: One-class classification via reconstruction error

Performance

On Ultra-Extreme Synthetic Deepfakes:

  • Accuracy: 100%
  • Precision: 100%
  • Recall: 100%
  • F1-Score: 100%
  • False Positive Rate: 0%

Detection Metrics:

  • Optimal Threshold: 0.3137
  • Real Video MSE: 0.1445 Β± 0.0846
  • Fake Video MSE: 0.5559 Β± 0.0949
  • Separation Ratio: 3.85x

Important Notes:

  • ⚠️ Model tested on ultra-extreme synthetic fakes (with obvious artifacts)
  • ⚠️ Real deepfakes are more subtle - expect lower accuracy (estimated 70-85%)
  • βœ… Better cross-dataset generalization than supervised methods
  • βœ… No memorization of specific deepfake method signatures

Usage

import torch
import torch.nn.functional as F
from model import create_model

# Load model
model = create_model()
checkpoint = torch.load("pytorch_model.ckpt", map_location='cuda')

# Extract state dict
if 'state_dict' in checkpoint:
    state_dict = checkpoint['state_dict']
else:
    state_dict = checkpoint

# Clean state dict (remove prefixes)
new_state_dict = {}
for k, v in state_dict.items():
    if k.startswith('model.model.'):
        new_key = k.replace('model.model.', '')
        new_state_dict[new_key] = v
    elif k.startswith('model.'):
        new_key = k.replace('model.', '')
        new_state_dict[new_key] = v
    else:
        new_state_dict[k] = v

model.load_state_dict(new_state_dict, strict=False)
model.eval()
model = model.cuda()

# Prepare video (B, C, T, H, W) with values in [-1, 1]
video_tensor = preprocess_video(video_path)  # Your preprocessing
video_tensor = video_tensor.cuda()

# Get prediction
with torch.no_grad():
    frame_pred, flow_pred = model(video_tensor)
    
    # Calculate reconstruction error
    mid_frame = video_tensor.shape[2] // 2
    target = video_tensor[:, :, mid_frame]
    mse_error = F.mse_loss(frame_pred, target).item()
    
    # Detect deepfake
    THRESHOLD = 0.3137
    is_fake = mse_error > THRESHOLD
    
    print(f"MSE: {mse_error:.4f}")
    print(f"Prediction: {'FAKE' if is_fake else 'REAL'}")

Limitations

  1. Tested primarily on extreme manipulations - Real deepfakes are more subtle
  2. Reconstruction-based detection - May struggle with high-quality deepfakes that maintain temporal consistency
  3. Threshold sensitivity - Optimal threshold may vary across different video sources
  4. One-class approach - Lower peak accuracy than supervised methods, but better generalization

Recommended Use Cases

  • βœ… Initial screening of videos for obvious manipulations
  • βœ… Ensemble component with other detection methods
  • βœ… Research on generalization in deepfake detection
  • βœ… Detection of out-of-distribution videos

Not Recommended For

  • ❌ Sole detector for critical applications
  • ❌ Detection of subtle, professional-grade deepfakes without additional methods
  • ❌ Real-time video verification (model is compute-intensive)

Citation

If you use this model, please cite:

@misc{timesformer-deepfake-detector,
  author = {ash12321},
  title = {Video Anomaly Detection with TimeSformer},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ash12321/deepfake-detector-timesformer}}
}

License

MIT License - See repository for details

Contact

For questions or issues, please open an issue on the Hugging Face repository.


Note: This model represents a research approach to deepfake detection through one-class learning. For production deployments, consider using an ensemble of multiple detection methods including supervised classifiers, biological signal detectors, and temporal consistency checkers.

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support