Grown men screaming like little girls

I always hated playing scary video games, which made me avoid them completely over the last years. Especially the jump scares are nothing to be made fun of. Until now! Beginning with the area of watching Let’s Plays, I (and many others) can finally enjoy watching horror games without the hassle of actually playing the game. The concept is easy: Players record their screen, their voices and sometimes themselves while playing and upload the videos to Youtube. That’s all.

Over time, I’ve grown fond of a particular channel called GameTube [German language only]. A speciality of this channel is an extra camera filming the players whilst playing games from the horror genre. Here, hilarity ensues. Apparently there is nothing more amusing on the internet than two grown men screaming like little girls. One problem though: There are still jump scares in the videos, but they are less immersive. So, you can “prepare” yourself for it to come. Which got me thinking. Is there a possibility to get a list of all jump scares in the video before actually watching it? Most of the time, I just scan the comments for people posting timestamps of jump scares, but this does not yield the complete set. But here another property of any Let’s Play helps me out. The player’s recorded voices also react to the game. If something scary pops up from behind the bushes, you’ll hear it. Most likely screams penetrating your eardrum and killing your speakers. I will show how I used those screams exactly.

At 20 kHz nobody can hear you scream

The general idea is simple. Get the audio from the video. Sample a waveform of the audio and find the screams in it. So far, so easy. One thing that helps is the fact that screams usually induce clipping in the audio. Clipping can be heard as a distorted sound (cracking, scraping). So, we are looking for very high and sharp peaks in the audio signal. First of all we set up our imports.

import struct
import numpy as np

import subprocess

Now we define some recurring parameters. Note: This experiment was conducted with the episodes of Resident Evil 7. The episodes have to be downloaded beforehand and should be named resident-evil-7-biohazard-{num}.webm, where {num} is the episode number.

episode = 2 # which episode should be inspected
samples = 100 # how many samples (points) per second should there be
scale_minutes = 60*samples # conversion to minutes

For getting an audio waveform, we call ffmpeg with the sample parameter. ffmpeg then writes a binary file with an array containing little endian, two byte integers called wave_ep{num}.bin. {num} is the episode number.

subprocess.call("ffmpeg -i resident-evil-7-biohazard-{}.webm \
                -ac 1 -filter:a aresample={} -map 0:a -c:a pcm_s16le -f data -y wave_ep{}.bin"
                .format(episode, samples, episode), shell=True)

Next, we have to read the binary data and store it to an array in python.

f = open("data/wave_ep2.bin", "rb")
wave = []

chunk = f.read(2)

while chunk:

    # <h details the data type two byte integer in little endian
    # we only want the absolute value of the signal (audio signals are symmetric around zero)
    wave.append(abs(struct.unpack('<h', chunk)[0])) 
    
    chunk = f.read(2)
        
wave = np.array(wave)
# normalise the data to avoid problems with changing sound quality of different videos
norm_wave = wave/np.max(wave)

Let’s inspect the sound wave (not normalised) of the second episode of Resident Evil 7. The x-axis details the time into the video and the y-axis the intensity of the audio.

Wave

Let’s play a game called “Find the screams”. You guessed right: The high peaks in the image show exactly where Michi and Fritz went through high stages of fear. Let’s call these Layers of Fear (pun intended).

In the normalised data normal conversation is usually below 30% of the total range of the audio signal. We can use this information to filter the unusual Layers of Fear.

# do not forget to divide by samples to get the timestamp in seconds
layers_of_fear = np.where(norm_wave > 0.3)[0]/samples

This gives us many candidates for a jump scare, but there are too many. We have to compare the neighbouring timestamps in layers_of_fear. If the difference between them is high, they are independent and therefore belong to two different jump scares. The second to last timestamp is used as an end marker for the previous jump scare.

jump_scares = dict()
i = 0

for x,y in zip(layers_of_fear[:-1], layers_of_fear[1:]):
    if np.abs(y-x) > 10 or i == 0: # ten second difference
        new_high = {}
        # subtract 5 seconds, because we want to have a timestamp 
        # shortly before the jump scare
        new_high["start"] = np.floor(y - 5)
        if i > 0:
            jump_scares[i-1]["end"] = np.ceil(x + 1)
        jump_scares[i] = new_high
        
        i += 1

jump_scares for episode two now looks as follows. All timestamps are in seconds. You might notice that this list does not only include jump scares, but also loud noises in tense situations.

{0: {'end': 462.0, 'start': 454.0},
 1: {'end': 485.0, 'start': 476.0},
 2: {'end': 526.0, 'start': 519.0},
 3: {'end': 587.0, 'start': 572.0},
 4: {'end': 625.0, 'start': 617.0},
 5: {'end': 687.0, 'start': 680.0},
 6: {'end': 823.0, 'start': 816.0},
 7: {'end': 899.0, 'start': 883.0},
 8: {'end': 920.0, 'start': 905.0},
 9: {'start': 923.0}}

I took the liberty and created a direct link to the first jump scare. Enjoy! I’ve also compiled the first ten episodes.
Caution: The timestamp does not need to be a jump scare. There are a lot of different other loud noises in this Let’s Play.

episode loud noises (direct link)
#1 11:04, 11:24, 21:03, 22:04, 24:22, 25:17
#2 7:34, 7:56, 8:39, 9:32, 10:17, 11:20, 13:36, 14:43, 15:05, 15:23
#3 3:51, 5:10, 7:06, 7:40, 8:06, 11:40, 15:50, 18:20, 18:32
#4 13:03, 13:17, 13:31, 13:52, 14:23, 14:38, 16:09, 25:30
#5 8:05, 11:16, 12:15, 12:34, 17:45, 21:54
#6 2:33, 3:03, 5:49, 7:56, 8:16, 8:29, 9:51, 11:55, 13:26, 14:15, 15:26, 15:42, 17:59, 18:13, 20:36, 22:36
#7 1:19, 3:19, 4:25, 6:29, 6:44, 8:35, 9:31, 20:29, 20:42, 20:58
#8 2:26, 2:55, 3:47, 3:57, 5:04, 5:37, 8:50, 14:35, 14:56, 15:51, 16:16, 17:18, 17:58, 18:11, 18:36, 18:51, 19:32, 20:07, 20:49
#9 1:56, 2:32, 5:56, 10:04, 14:09, 17:17, 18:51, 25:30, 26:19
#10 2:28, 5:49, 8:02, 10:21, 12:39, 13:27, 14:06, 14:40, 14:51, 15:48, 16:11, 16:22, 18:13, 19:52, 20:19, 21:11

Summary

I hope you enjoyed the rough ride. I have shown:

  • Audio signals can be used to classify/predict certain situation in videos,
  • How to build a simple algorithm to find scary things in a Youtube Let’s Play,
  • Touching your microphone or laughing in an horror Let’s Play can produce scary noises.

Of course the algorithm is far from perfect, but it performs good enough to find all major jump scares. And if Fritz stops touching the microphone or headset and therefore inducing artificial clipping, I could get rid of the other false positives. Loud laughters, PSN notifications and so on also trigger false alarms. The general rule is: The poorer the audio quality of a video is the more false jump scares will be detected.

Back to Wilhelm for everything scary.

Other sources and libraries include numpy, jupyter and ffmpeg.