TT subripper
by Filiep Geeraert


Technical background


So how does it work ?
First let us have a look at how it normally works from within Vista Mediacenter.
This is the graph you get with Graphedit if you perform a Remote connect to Vista, when it is playing a PAL DVR-MS file.

You can easily deduct from this that a DVR-MS file contains 3 streams, one for audio, one for video and, hang on, what 's that third stream for ?
Right, CCSI decoder, I do not know the full name, but I believe CC stands for Closed Captions...
If only we could get this same graph when playing from Windows Mediaplayer.

However, this is what happens when you simply choose "Render Media" in Graphedit :

You also get an error that some of the streams in the movie are not supported.
So outside of Mediacenter, applications do not know how to handle the stream.
Browsing through the list, you can no longer find the CCSI decoder, however this probably corresponds to the MS TV CC decoder.
But even if you connect that one, and then press play, the subtitles remain hidden.

So, what can we do about this ?
We can wait for MS to add support for that or we can rip the subtitles ourselves of course.

So, how do we do it ?
I am totally unfamiliar with DirectX or programming from within Mediacenter, so in order to avoid programming in such a way I had to find another means of getting the job done.

So globally speaking, this is what needs to be done :

1° Open the DVR-MS file in Graphedit and dump the contents to a (very large) file.
2° Decypher the file as it looks like Teletext data, but seems to be slightly scrambled.
3° Among all those pages of Teletext, find the subtitles and find the time they are associated with.
 

Step 1 is easy :

We choose Render media File in Graphedit, then open the DVR-MS file we want to get the subtitles from.
We remove all the arrows and boxes we do not need (audio, video, etc. can be deleted).
Then we insert the Dump Directshow filter, this will ask for a filename, we choose DUMP.FIL.
Then we press the green Play button and wait for the file to finish building.
It's done when the Play button turns green again.
We now have the scrambled Teletext pages (all of them, I guess).

2. Now we need to decypher the data.
To my surprise some parts of it were directly readable, eg. CEEFÁX was there lots of times on the pages of the BBC, so I started out by doing a find-and-replace by which I swapped the Á for the A.
I could then recognise some more words, and kept on doing find and replace operations.
In the end, I managed to pretty much decypher all important text, numeric, and punctual data for the basic characters.

I integrated this into the utility (this is step 1 of my utility).

This is the translation table I came up with (most characters seem to be the ASCII code + 64, but there are several exceptions.

Scrambled

Scrambled Unscrambled Character
35 156 £
94 35 #
161 33 !
162 34 "
164 36 $
167 39 '
168 40 (
171 43 +
173 45 -
174 46 .
176 48 0
179 51 3
181 53 5
182 54 6
185 57 9
186 58 :
188 60 <
191 63 ?
193 65 A
194 66 B
196 68 D
199 71 G
200 72 H
203 75 K
205 77 M
206 78 N
208 80 P
211 83 S
213 85 U
214 86 V
217 89 Y
218 90 Z
220 171 ½
223 43 +
227 99 c
229 101 e
230 102 f
233 105 i
234 106 j
236 108 l
237 109 m
239 111 o
241 113 q
242 114 r
244 116 t
247 119 w
248 120 x
251 172 ¼
254 246 ÷

In version 0.8 I have added support for codepages (French, German, Spanish/Portugese and Italian).
Detection of the right codepage is not automatic though, the user has to choose himself the right codepage.
Once the translation of the dumpfile is over, you have the dump of x minutes of (mostly) legible Teletext data.
Somewhere in there you should be able to find recognisable Teletext data.
After analysing a couple of programs from 2 different channels (BBC and Canvas), it seemed that a subtitle will always be signaled by the presence of these 4 characters : §§PP, followed by a couple of others which will vary.
However, from the code I could not deduct how far from the §§PP sign they would appear.
In version 0.8 though, I found several other codes which signal the presence of subtitles, so I added them to the detection routine as well.
Also, I discovered that sometimes §§PP may be in the middle of a graphic and have nothing to do with subtitles, so I finetuned this a bit more.
I think that these codes might indicate transparency, which is needed for the subtitles to work properly.
After one of these signs there would be lots of spaces and binary zeroes character, followed by the actual subtitle(s).
Each time the beginning of a line would then be indicated by
♂♂, and the end of a line by èè.
Each subtitle page could contain up to 4 lines of subtitles, some of them even zero !
These blank subtitles are just broadcast to remove the previous subtitle when a new one is not available yet, I guess.
However, in version 0.8 I have changed this a bit, because it turns out that the TT decoders seem to have some sort of timeout too.
If no new subtitle is broadcast in a while, and no blanking subtitle is broadcast either, the subtitle will be blanked anyway.
I mimicked this behaviour by setting the maximum time to 8 seconds.
Now the next question was : how will I find out when this subtitle was broadcast ?
We can know this approximately, as Teletext pages contain the time in hours, minutes and seconds.
So in my utility, after converting the characters to their ASCII counterparts, the next step was to parse through the whole file, keeping the first time as a relative time, and recalculating each subtitle from that moment in time.
I also had to keep in mind that a show could start at 23:00 and last until 02:00.
If I simply would subtract 23 hours from 2 I would get negative timings, so I compensated it by adding 24 hours if the result would have been negative.
During the development of version 0.8 I also found that sometimes a timecode using hours, minutes and seconds is present in webpages (about stock markets eg.), so I had to filter those out as well.
In the second step we will get an SRT file with the subtitles, but also with lots of blank subtitle lines.
In a final, quick phase the SRT file is processed and tidied up, again removing those blank lines (which were necessary during the build of the SRT file to get correct time intervals).
Steps that were added later on are colour detection, and creating an HTML file with the subtitles and their colours.

There you have it : your subtitles in an SRT file, with the original foreground colours.
It sounds easier than it is, believe me ;)

 

  

                                             Home

Technical background      System Requirements       Terms of use    

Manual      FAQ       Guestbook

What 's new      Gallery       Downloads