Friday, 29 June 2012

Using Babylon-based dictionaries on your Kindle

UPDATE! A Follow-Up Post on this Project
Since this post got wide attention, I've decided to follow-up on this project.
See my new Babylon-based dictionaries on Kindle - Round 2 post.
Now the project is shared as open-source and pre-built dictionaries are organized and shared.

Lost in translation
The problem
Addressing this issue started by by trying to purchase an Italian-English dictionary for my 2nd generation Kindle, running Kindle software v2.5.3.
One dictionary was offered for sale  (as an ebook) on Amazon's website. The problem was that the dictionary was not actually available for the device for another whole year..

Good translations
Babylon, on the other hand, offers high-quality dictionaries, spanning over pretty much every language. Babylon Translator is a paid software for Windows. Its dictionary files (.BGL) are offered as free downloads.

In a perfect universe
If I only had a way to import Babylon's free content dictionary into my Kindle and use it as the built-in dictionary, it would have been perfect..

The solution presented here was tested on my Kindle 2. I'm pretty sure it should work on newer versions of Kindle as well.
The same Babylon dictionary, used on my PC (Left) and on my Kindle (Right)
(Click for full size)
Article Level:
Reasonably moderate
Cracking the Unicode codepage code
Spoilt Kindle 2
There are a few things to know about multilingual support and Kindle (if you wish to view non-Latin international texts):
Kindle 2 does not natively support non-Latin unicode characters. This means if you try to view an ebook which contains non-Latin text (e.g. Cyrillic), you will see blank squares instead of letters.
This is a huge miss on Amazon's side for 2 reasons:
  1. Unicode characters are already supported on all platforms, computers, tablets, phones, websites, etc. All modern devices can natively display any character set. All except the Kindle 2, that is. 
  2. Kindle is not a laptop, nor a tablet, nor a smatphone.
    It's one and only purpose is to be an electronic book reader. The only thing it should do well is display texts. Why not have it natively support any text in any language? Especially since the resources for that are so common and so obvious already.. It isn't 1994 anymore...
There is a workaround (a hack) which enables Kindle 2 to display all unicode characters. It's described in detail in this great blog post, which includes links to all the necessary files to make it work, elaborate instructions and links to alternative fonts which may be installed for improved readability as well.
I am not sure how right-to-left books are displayed (e.g. ebooks in Hebrew), in terms of text-alignment and order of characters, because I have not tested such books yet. For left-to-right (e.g. in Bulgarian) languages everything seems to be OK.

And there's more..
Three more points to take into account:
  • Kindle models of generation 3 and above do support unicode natively.
    This means that they properly display ebooks in any non-Latin language.
  • Even after hacking my Kindle 2 to display non-Latin characters, I didn't manage to use the integrated dictionary to look up words in non-Latin languages.
    For example, if I'm reading a Bulgarian book and I wish to use a Bulgarian to English translator as the default integrated dictionary (i.e. point the cursor on a word to look it up), the solution described in this post doesn't seem to work (the lookup functionality does not look up).
    It seems that the integrated dictionary look-up functionality supports Latin characters only. Perhaps newer generations of Kindle don't suffer from this problem.
    I'd love to get enlightened by anyone who has succeeded to achieve this with a Kindle of any generation.
  • Setting a new default dictionary worked nicely on my Kindle device itself.
    However, I found it difficult to use my custom dictionary on my computer running Kindle for PC or on my phone running Kindle for Android app.
My Kindle 2
Ingredients
A quick download list to the tools you will need:
Step 1: Get the dictionary file
In order to create your custom Babylon dictionary file for Kindle you will need a Babylon dictionary file.
Go to Babylon's free dictionaries page, choose one (or more) and download it. All done, right? Not quite.

The dictionary file you've downloaded from Babylon's site is actually an .EXE installer, which contains the dictionary file archived in it.
There are some suggestions that it may be possible to extract the .BGL file from the installer with 7-Zip, but I did not manage to do so. The easiest way to get the dictionary file out is to run the installer, which will install Babylon (at least in trial mode).
Once Babylon is installed the .BGL file resides in %LOCALAPPDATA%\Babylon (Windows Vista/7). You may repeat the process for as many dictionaries as you require. Copy out the precious .BGL file(s) and keep or uninstall Babylon as you wish.

Step 2: Use my magic tool: BabylonToHtml
The next step is to convert the binary .BGL dictionary to textual HTML file (of a very specific structure, of course) which will be used as the source of the eBook.

About my magic tool
The binary structure of .BGL files has already been cracked (not by me). This knowledge is commonly out in the open and shared across various open-source projects. I have combined a few of those resources into one easy-to-use command-line utility.
  • One source was dictconv, a dictionary conversion tool for Linux which comes with its full C++ source. I used parts of this code (ported by me into C#) in order to analyse the meta-data of the dictionary file (text encoding, author etc).
  • Another resource is is an open-source project named ThaiLanguageTools. It's written in C# but the contents of the code looks suspiciously similar to the code of dictconv mentioned above (similar variable names, comments etc) which suggests it's a porting as well.
  • The content of Babylon's .BGL files is encoded in compressed GZip format. In order to decompress the data, I have incorporated the free open-source SharpZipLib into the project as well (as source code, so there is only one executable needed to run my app in the end. no additional DLLs).
To all the above I added my very own simple HTML generator. It structures the entries from the dictionary file in a markup compatible with the next step (converting it into an eBook).

Get the tool (with or without the source code)
If you wish to browse through the sources (and improve them!), you can download in the full Visual Studio solution from this link.
You may just want to get the executable itself and this can be done with this link.

Use it
You'll need to run my BabylonToHtml tool in a command prompt window.
If you run it without any additional parameters, you'll receive some basic help:
A handy message for the perplexed user..
Command line parameters:
  • In most cases all you have to provide is the name (and potentially the path) of your .BGL file. 
  • The output .HTML is encoded in UTF-8 (Unicode).
    However, the entries read from the .BGL dictionary are encoded with specific character sets (and sometimes with more than one).
    For example: in a Chinese - Bulgarian dictionary the source language entries are encoded with Chinese characters and the target language entries are encoded in Cyrillic.
    B
    abylonToHtml will try, by default, to get the right encoding (this info is available in the meta-data of the .BGL file in most cases), but it may make mistakes.
    These encodings can be enforced:
    It is possible to set the codepage of the source language by specifying the
    -se command line argument.
    It is possible to set the codepage of the target language by specifying the -te command line argument. 
So something like the following should be sufficient in most cases:
BabylonToHtml.exe English_Bulgarian.BGL
If your .BGL file does not reside in the same folder with the .EXE, a full path should be specified (may be wrapped with double-quotes if needed).

The encoding (and other information about your dictionary) is be parsed and progress of the process is presented...
Running...
Once the process is done, a new HTML file resides next to the original .BGL file
The new file's name matches the original .BGL file (just with .HTML extension):
All done. A new HTML file is generated. Magic!!

Step 3: Convert the dictionary to a Kindle compatible eBook
For this you will need to download, install and run the free Mobipocket Creator. The process itself is fairly simple. Here is the illustrated version:

On the main window, under "Import From Existing File" click the "HTML document" link.
Import from: HTML (duh!)
On the next screen:
Click "Browse..." on the "Choose a file" field and select the HTML file generated by BabylonToHtml.
In the "Encoding" drop-down select "International (UTF8)".
Click the "Import" button..
Import the HTML file
Click "Book settings" on the left-hand-side list and set the fields:
Set the "Encoding" drop-down to "International (UTF8)".
Check the "This eBook is a dictionary" box.
Set the Input language and the output language of your dictionary appropriately.
Click the "Update" button..
Dictionary settings..
Click "Metadata" on the left-hand-side list and set the mandatory fields: 
Give a title for your eBook, set the author,  language and main subject.
Now scroll all the way down...
Metadata(1/2): Fill a title, author, language and main subject
At the bottom of the "Metadata" screen, fill the "Suggested Retail Price" field (it cannot be left empty, "0" is also fine).
Click the "Update" button..
Metadata(2/2): Set the retail price :-)
On the top bar click the "Build" icon...
Build(1/4): Click Build
In the "Build Publication" screen click the "Build" button...
Build(2/4): Click Build
Wait for the build.
Depending on the size of your dictionary (and the size of the generated HTML file) this may take some time.
Build(3/4): Wait...
Once the process is finished, select the "Open folder containing eBook" radio button and click "OK" to get your dictionary eBook.
Build(4/4): All is done!
Your dictionary-eBook is a file with .prc extension:
Your eBook is produced with a .prc extetnsion

Step 4: Transfer the dictionary to your Kindle and start using it
Transfer
Plug the Kindle to the computer (duh!). Transfer the new eBook to the usual Documents folder, alongside your other books, and unplug.

Note: In some newer versions of Kindle, the dictionaries have been moved from the Documents folder to the Documents/Dictionaries subfolder. If the dictionary is not recognized by your Kindle device, move it there.

Set as default
Click the "Home" button, then click "Menu" and go to "Settings" and Enter:
Home screen > Menu > Settings

In the Settings screen click "Menu" again and go to "Change Primary Dictionary":
Settings screen > Menu > Change Primary Dictionary

Your newly created dictionary should appear next to the default Oxford one.
Select it and Enter:
Choose your custom dictionary

Then Click Home to leave the Settings page.
Your dictionary is now the default translator whenever you select a word in a book:
Babylon dictionary on Kindle!
You may also manually look up words in your custom dictionary as you do with the default English one.

Bonus tip: Take screenshots from the Kindle
To take a screenshot from the Kindle device: 
Press the Shift key () + ALT key + G simultaneously. The screen will flicker.

Plug the kindle to the computer, your screenshot files are in the Documents folder, named screen_shot*.gif.
Note: This process sometimes needs to be repeated. You may not find your screenshots every time. Not sure why.
Kindle screenshots!
UPDATE! A Follow-Up Post on this Project
Since this post got wide attention, I've decided to follow-up on this project.
See my new Babylon-based dictionaries on Kindle - Round 2 post.
Now the project is shared as open-source and pre-built dictionaries are organized and shared.

45 comments:

  1. Very useful, thanks for sharing! Could you give a link to the English_Bulgarian.prc file?

    ReplyDelete
    Replies
    1. Thanks Krasin! Nice to see you here :-)
      You can get the English-Bulgarian dictionary .prc file in this link

      Delete
  2. Hi Alon!

    Thank you for your post!

    Any idea how to port it to the android app?
    I'm trying to put an Hebrew dict to the android app.

    Thanks!

    ReplyDelete
    Replies
    1. Just found how to do it.
      http://translate.google.com/translate?hl=iw&sl=auto&tl=en&u=http%3A%2F%2Firising.me%2F2011%2F09%2F10518%2F

      I found the PRC files (mine is: sdcard/android/data/com.amazon.kindle/files/)
      And changed the name of the English-Hebrew.prc to one of the existing dictionaries
      Now, when I choose the UK dict it's working but can't find the words (tried few words like: "time" "father" but it can't find the definition, when I tried to search for "years" it found but everything is messed.)
      BTW, when I'm trying to open the Hebrew dic I can see the english words, but the hebrew characters displayed as squares (could be the reason for the lack of definitions when searching for a word)

      I hope you can help me.
      Thanks!

      Delete
  3. I was trying to convert NEW_Babylon_German_English_dictionary.BGL
    to html with the help of your program. The result was somewhat garbled.
    e.g. The definition of "Abdichtung" came out this way:

    proofing, sealing, act of closing off against entry or leakagebdichtung (die)

    The last words should be "leakage Abdichtung (die)"

    but it came out "leakagebdichtung". The character "A" of "Abdichtung" was swallowed and the word was appended to "leakage"

    This is the same for all the definitions.

    Can you suggest what can I do?

    Thank you for your reply

    ReplyDelete
  4. Hello Alon,
    I kinda have difficulties going through the second step - I keep getting an error. If it's not too much to ask, could you please create a Bulgarian-English dictionary (prc or html) and post it here?
    If you have some free time to do that, it would be a huge favour. Thanks in advance

    ReplyDelete
    Replies
    1. Problem solved by wrapping the name with double quotes, as suggested by the article author. This just in case anybody experiences the same problem.

      Delete
    2. Thanks for tipping, targmanebi, and sorry I wasn't around to respond on time. I've been kinda away from the blog for a while.

      Delete
    3. Well I managed to get a final "prc" file by following your instructions, but it still doesn't work. I figured it doesn't support Cyrillic. It displays the Cyrillic letters, but doesn't look up any of them. Do you think this can be solved?

      Delete
    4. Well, which Kindle are you using? In my experience, at least on my Kindle 2, lookup does not work on non-Latin words. I've also mentioned it in the post:

      "Even after hacking my Kindle 2 to display non-Latin characters, I didn't manage to use the integrated dictionary to look up words in non-Latin languages.".

      It may very well be that this does not work on other Kindle models too (since it's an American product, languages other than English do not really exist in the world, except perhaps French and Spanish, somewhere beyond the very distant borders to the north and to the south, you know..).

      Delete
  5. :D That's true. Mine is Kindle keyboard and probably that is the reason. Anyway thanks a lot for your time. i really appreciate it. good luck with you blog and other things

    ReplyDelete
    Replies
    1. The pleasure is mine. Thank you for the feedback! :-)

      Delete
  6. Hi and thank you very much! I managed a nice Spanish english dic!
    However I'm trying to make an english english and an english - french one, but the html comes out weirdly like this :
    "cos$531761$
    . Due to the fact that -os"
    All the defined words are like this with weird dollar symbols etc., am I doing it wrong?
    Thanks a lot!

    Btw, do you know how to add cp932 for japanese please? If it's not complicated, because I don't know anything about programming! :)

    ReplyDelete
    Replies
    1. I'm glad to hear that the Spanish-English dictionary worked for you!
      I had a chance to do some research, and as far as I've seen, those $... sequences represent special or accented characters. I think those are embedded in the dictionary for the sake of pronunciation (and Babylon knows how to interpret them, of course).

      I'll try to find some time in order to provide a better solution for those special characters and perhaps to support Japanese encoding (if the program doesn't detect it automatically, and not via the command line options).

      Delete
    2. Thank you very much, I'll be waiting! :)

      Delete
    3. Due to lack of time for maintaining this project, I've gone off the attempts to improve it for now, and made it open-source. Please see the follow-up post with posted dictionaries as well.

      Delete
  7. Hi,
    I have Kindle Keyboard (i.e. 3rd generation). The firmware version is 3.4.
    I did all the instructions as you wrote on English-Hebrew Babylon dictionary. There are few issues:
    1. Hebrew appears left to right. But I do see Hebrew letters (and not Gibberish)
    2. There are strange stuff like $531761$ (like other people reported)
    3. The dictionary is not popping-up in the Kindle when the cursor is hovering a word. Sometime it does with "a delay" and it displays previous hovered word.

    I was able to upload free Hebrew eBooks (which are displayed perfectly) according to the instruction in http://kneidlach.info/

    ReplyDelete
    Replies
    1. Hi,

      I have done some research and had a bit of testing with Hebrew.

      1. I did manage to reverse the Hebrew characters, but (at least on my Kindle 2) the words are still displayed in reversed order (because no matter what I did, the text is aligned to the left).

      2. The strange stuff probably represents special characters for pronunciation. I did manage to remove most of them, but not all.

      3. As for popping up- I haven't managed to pop up the dictionary properly (especially with any Hebrew one), and like you say- after some time, sometimes it pops, other times not. Only Amazon knows how that works..

      I may publish another post about this subject with additional features, or add an update to the current post, as soon as I have a little time..

      Delete
    2. TNX.
      I modified BabylonToHtml in trivail way to reserve all texts and also to remove "$$". It is now working in my Kindle. It displays Hebrew correctly (right to left) and it is popped up when standing on a word in the Kindle. Of course there are issues with special characters (in ) and the indexing (aliasing) of words.
      Bottom line is that English-Hebrew is doable. More polishing is needed.

      Delete
    3. This is good news! So it seems Kindle 3 supports Hebrew (and right-to-left options) much better than Kindle 2. Nice work! I guess many people would be interested in downloading your end-product. Feel free to share it.

      Delete
    4. I have created English-Hebrew dictionary. It is not perfect. But it does do the work! How to upload it?

      Delete
    5. Hmm you can share it on some service like Google Drive or Dropbox, or send it to me- alon@alonintheworld.com and I'll share it here in the post and in a follow-up post as soon as I have time to write it (and I'll test your dictionary on the Kidnle 2).

      Delete
  8. great job but it doesn't decrypt images.i really need a tool that extract images too but babylontohtml didn't do so.

    in this page a reverse engineer has explained the process and he mentioned named resources too but i am not good at c++ and i can't convert it. if you had sometime please take a look at it:
    http://www.woodmann.com/forum/showthread.php?7028-BGL-(babylon-glossary)-to-GLS-(babylon-glossary-source)&p=44981&viewfull=1#post44981

    ReplyDelete
    Replies
    1. Whoa! I didn't know Babylon's files contained images! Which BGL file did you have which contained images? Can you share it? I definitely want to check this feature out.
      Since this is, so far, the most commented post on this blog I think a followup post will come as soon as I have some time for it.

      Delete
    2. here it is:
      http://www.uploadbaz.com/ktbmdv4jkbds

      Delete
  9. HI the main problem with this system is that it mainly works only from ENGLISH to other languages...
    if you try italian russian for instance it doesnt lookup plural and feminine and conjiugated verbs...so as a practical results only very few works are translated...
    i would need a source dictionary with conjugated verbs...

    ReplyDelete
    Replies
    1. Indeed. See in the post: "Even after hacking my Kindle 2 to display non-Latin characters, I didn't manage to use the integrated dictionary to look up words in non-Latin languages."

      Delete
  10. i meant few WORDS are translated..

    ReplyDelete
  11. Pleas exactly tell how many software is need?.i cant run your software on pc.

    ReplyDelete
  12. Very interesting idea and process. One not so small problem. My anti-virus software identifies every Babylon download as malware. Have you found any way to get the dictionaries without the intrusive add-ons that are really hard to get rid of?

    ReplyDelete
  13. I just processed the Babylon English-Hebrew.bgl. The resulting book contains a lot of those $012345$ strings. I thought about removing all of them, but a brief examination shows that some of them are needed, or are at least not displayed as $012345$ strings.

    Do you have any further thoughts on how to remove only the ones that mess up the book?

    ReplyDelete
    Replies
    1. Due to lack of time for maintaining this project, I've gone off the attempts to improve it for now, and made it open-source. Please see the follow-up post with posted dictionaries as well.

      Delete
  14. Thank you so much for this information!! I was desperate for getting a Russian-English dictionary, and finally made one thanks to your explanation.

    However, even though downloading the glossary for the Russian-English dictionary from Babylon, after installation there was no Russian-English among the .BGL files. Instead, I searched for the file on google and found one. (that is, babylon_russian_english.bgl). For what it's worth, I'll mail you the dictionary, should you want to add it to your site.

    ReplyDelete
    Replies
    1. Cheers, Simon :)
      Your dictionary is now shared.

      Delete
    2. As told before, when I downloaded the russian-english glossary and installed Babylon from that .exe file, I only had German-English and English-German .bgl-files to go with in the file indicated in your post.
      But then I opened Babylon, clicked on menu, then settings, and there the dictionaries (my two .bgl-files) were listed. Now below there's a button "Further dictionaries" or something like that ("Weitere Wörterbücher" with my German version). By clicking on it I was redirected to the Babylon free glossary pages. So again I gave it a try and downloaded the .exe-file in clicking on the babylon_russian_english glossary.
      And now not the program, but, indeed, the dictionary itself was installed and showed afterwards among the .bgl-files already present.

      Delete
    3. ... it's very helpful, though not perfect; there are issues with identifying verbs. Still a far cry from going without any Russian dictionary :)

      Delete
    4. Thank you again, Simon, for sharing your content, and for this helpful tip!

      Delete
  15. Hi Alon,
    I just bought kindle paperwhite and was looking for english-hebrew dictionary.
    After long search I found yours. I must admit that you did a terrific job. this dictionary is very helpful. however, there is still a small problem:
    when using Babylon English-Hebrew Dictionary.sdr the words appear backwards.
    I tried using Babylon English-Hebrew Dictionary - MG Reversed Words.prc and the word appear OK. but the order of the words is wrong.
    for example, the word everywhere is: מקום בכל
    instead of
    בכל מקום

    is there something you could do about it?
    if there is anything I could help with. I'd be glad!



    ReplyDelete
  16. What about inflection in Babylon dictionaries, like "make, makes, made".

    ReplyDelete
    Replies
    1. Hi Paweł and thanks for your feedback.

      This depends on the content of the original source BGL file and how it's organized. Basically if you search for "made" or "makes" you should be directed to the same definition of "make" (or "to make"). As far as I know, my tool should handle that as well.
      I have plans for improving the tool and how it processes it output (also in terms of usability of the tool itself), but due to lack of time, this will have to wait a bit, I guess :(

      Delete
  17. Reply doesn't work, strange...

    My Babylon dictionaries, for example Interlingua-Polonese, have inflections prepared clasically:

    parlar|parla|parlate|parlara|parlava|parlante
    mówić

    So, I must try ;-)
    Thank you, Alon.

    BTW - a few years ago a Chinese wrote his Lingoes and promised to prepare a tool for creating dictionaries and converting the Babylon dictionaries (GLS files) but it has never appeared.
    http://lingoes.net/

    ReplyDelete
  18. I have just tried the Russian-English dictionary (on a sample of "Idiot") and it recognizes Russian inflections.
    So, I have to try to convert my own dictionaries.

    ReplyDelete
  19. Hi, Thank you for the great tool but one thing got my attention. I tried to convert babylon's spanish english dictionary, everything works ok except partofspeech sections, like noun, adj etc. I used pyglossary converter and I saw they are there but somehow lost during conversation. Please can you fix that part or at least show us how can we do it.

    ReplyDelete