TT subripper
by Filiep Geeraert
Technical background
So how does it work ?
First let us have a look at how it normally works from within Vista Mediacenter.
This is the graph you get with Graphedit if you perform a Remote connect to
Vista, when it is playing a PAL DVR-MS file.

You can easily deduct from this that a DVR-MS file contains
3 streams, one for audio, one for video and, hang on, what 's that third stream
for ?
Right, CCSI decoder, I do not know the full name, but I believe CC stands for
Closed Captions...
If only we could get this same graph when playing from Windows Mediaplayer.
However, this is what happens when you simply choose "Render Media" in Graphedit
:

You also get an error that some of the streams in the movie
are not supported.
So outside of Mediacenter, applications do not know how to handle the stream.
Browsing through the list, you can no longer find the CCSI decoder, however this
probably corresponds to the MS TV CC decoder.
But even if you connect that one, and then press play, the subtitles remain
hidden.
So, what can we do about this ?
We can wait for MS to add support for that or we can rip the subtitles ourselves
of course.
So, how do we do it ?
I am totally unfamiliar with DirectX or programming from within Mediacenter, so
in order to avoid programming in such a way I had to find another means of
getting the job done.
So globally speaking, this is what needs to be done :
1° Open the DVR-MS file in Graphedit and dump the contents to a (very large)
file.
2° Decypher the file as it looks like Teletext data, but seems to be slightly
scrambled.
3° Among all those pages of Teletext, find the subtitles and find the time they
are associated with.
Step 1 is easy :

We choose Render media File in Graphedit, then open the
DVR-MS file we want to get the subtitles from.
We remove all the arrows and boxes we do not need (audio, video, etc. can be
deleted).
Then we insert the Dump Directshow filter, this will ask for a filename, we
choose DUMP.FIL.
Then we press the green Play button and wait for the file to finish building.
It's done when the Play button turns green again.
We now have the scrambled Teletext pages (all of them, I guess).
2. Now we need to decypher the data.
To my surprise some parts of it were directly readable, eg. CEEFÁX was there
lots of times on the pages of the BBC, so I started out by doing a
find-and-replace by which I swapped the Á for the A.
I could then recognise some more words, and kept on doing find and replace
operations.
In the end, I managed to pretty much decypher all important text, numeric, and
punctual data for the basic characters.
I integrated this into the utility (this is step 1 of my utility).
This is the translation table I came up with (most characters seem to be the ASCII code + 64, but there are several exceptions.
Scrambled
| Scrambled | Unscrambled | Character |
| 35 | 156 | £ |
| 94 | 35 | # |
| 161 | 33 | ! |
| 162 | 34 | " |
| 164 | 36 | $ |
| 167 | 39 | ' |
| 168 | 40 | ( |
| 171 | 43 | + |
| 173 | 45 | - |
| 174 | 46 | . |
| 176 | 48 | 0 |
| 179 | 51 | 3 |
| 181 | 53 | 5 |
| 182 | 54 | 6 |
| 185 | 57 | 9 |
| 186 | 58 | : |
| 188 | 60 | < |
| 191 | 63 | ? |
| 193 | 65 | A |
| 194 | 66 | B |
| 196 | 68 | D |
| 199 | 71 | G |
| 200 | 72 | H |
| 203 | 75 | K |
| 205 | 77 | M |
| 206 | 78 | N |
| 208 | 80 | P |
| 211 | 83 | S |
| 213 | 85 | U |
| 214 | 86 | V |
| 217 | 89 | Y |
| 218 | 90 | Z |
| 220 | 171 | ½ |
| 223 | 43 | + |
| 227 | 99 | c |
| 229 | 101 | e |
| 230 | 102 | f |
| 233 | 105 | i |
| 234 | 106 | j |
| 236 | 108 | l |
| 237 | 109 | m |
| 239 | 111 | o |
| 241 | 113 | q |
| 242 | 114 | r |
| 244 | 116 | t |
| 247 | 119 | w |
| 248 | 120 | x |
| 251 | 172 | ¼ |
| 254 | 246 | ÷ |
In version 0.8 I have added support for codepages (French,
German, Spanish/Portugese and Italian).
Detection of the right codepage is not automatic though, the user has to choose
himself the right codepage.
Once the translation of the dumpfile is over, you have the
dump of x minutes of (mostly) legible Teletext data.
Somewhere in there you should be able to find recognisable Teletext data.
After analysing a couple of programs from 2 different channels (BBC and Canvas),
it seemed that a subtitle will always be signaled by the presence of these 4
characters : §§PP, followed by a couple of others which will vary.
However, from the code I could not deduct how far from the §§PP sign they would
appear.
In version 0.8 though, I found several other codes which signal the presence of
subtitles, so I added them to the detection routine as well.
Also, I discovered that sometimes §§PP may be in the middle of a graphic and
have nothing to do with subtitles, so I finetuned this a bit more.
I think that these codes might indicate transparency, which is needed for the subtitles
to work properly.
After one of these signs there would be lots of spaces and binary zeroes character,
followed by the actual subtitle(s).
Each time the beginning of a line would then be indicated by
♂♂, and the end of
a line by èè.
Each subtitle page could contain up to 4 lines of subtitles, some of them even
zero !
These blank subtitles are just broadcast to remove the previous subtitle when a
new one is not available yet, I guess.
However, in version 0.8 I have changed this a bit, because it turns out that the
TT decoders seem to have some sort of timeout too.
If no new subtitle is broadcast in a while, and no blanking subtitle is
broadcast either, the subtitle will be blanked anyway.
I mimicked this behaviour by setting the maximum time to 8 seconds.
Now the next question was : how will I find out when this subtitle was broadcast
?
We can know this approximately, as Teletext pages contain the time in hours,
minutes and seconds.
So in my utility, after converting the characters to their ASCII counterparts,
the next step was to parse through the whole file, keeping the first time as a
relative time, and recalculating each subtitle from that moment in time.
I also had to keep in mind that a show could start at 23:00 and last until
02:00.
If I simply would subtract 23 hours from 2 I would get negative timings, so I
compensated it by adding 24 hours if the result would have been negative.
During the development of version 0.8 I also found that sometimes a timecode
using hours, minutes and seconds is present in webpages (about stock markets
eg.), so I had to filter those out as well.
In the second step we will get an SRT file with the subtitles, but also with
lots of blank subtitle lines.
In a final, quick phase the SRT file is processed and tidied up, again removing
those blank lines (which were necessary during the build of the SRT file to get
correct time intervals).
Steps that were added later on are colour detection, and creating an HTML file
with the subtitles and their colours.
There you have it : your subtitles in an SRT file, with the
original foreground colours.
It sounds easier than it is, believe me ;)