My previous research named Reverse Engineering using Chrome was all about debugging a website and building an algorithm to bring up all of these tiles together, and by the end – getting the image in a higher resolution. In that article, I mentioned that some time ago I have published a simple gem called txt2speech that makes requests to the Google Translate API and gets the audio output back. However the Google invented the algorithm to protect his calls from unauthorized clients, the logic became unquestionably more complicated, let’s see why.
Today, I want to write about a method of debugging the Google Translate. – We will find out what has changed and how they deliver their requests by now and at the same time, I will show you how things are made in the Microsoft Translator API – in their “Bing Translate” if you will.
– We have a very simple goal here. We want to write a program that will perform the request with a text we want to transform into the audio file.
The story behind that: I used to use a feature provided in Mac OS X that allowed me to save particular text as an audio file and later share it with my smartphone. When I lived in Kyiv, my main type of the transportation was a public transport. – Because the city is quite large, it’s been taking forever to get from one point to another. While I was on my way I loved to listen to the articles in a way I am used to listening podcasts… But one day, my Macbook unfortunately, got broken. I haven’t had enough time to look for best alternative and honestly I didn’t have enough money for doing that. In addition, I had to do that quickly – so I ended up with a very cheap laptop and Ubuntu preinstalled. Thus, I missed this feature
First of all, let’s start by analyzing an old request of Google Translate API, – How it was back then?
We have had and still have ie, q, tl, total, idx, textlen parameters… Thinking logically we can guess what is what just by seeing the name and value. We understand that idx is probably an index, – textlen is a text length, right? And as long as q key holds “Hello” string in there, – it might be the text which we are looking to change… Well, it is certainly true, but thoughtful Google requires us to fill all the other parameters accordingly.
Unfortunately, after Google found out that there are many unauthorized clients, which are eager to use their undocumented API they became determined to protect the requests.
Besides the user agent – which is usually worthless, they ask you to provide a tk value now. Which gets generated every time you click on a Listen button.
– If you regularly try to click on the button like crazy, you will see, that tk value is always the same until you change the text. But… The algorithm is also using a unique token that you usually receive once the page gets loaded. And it gets defined once you click on the button to Listen what we’ve got.
Although what troubles us the most, it’s algorithms that try to confuse us. First one, which prepares the array with each letter converted according to ordinal value. So if we have “Hi” input, we get as an output – data set holding values [72, 105].
Then, have you heard something about Caesar Cipher? – Now imagine this algorithm, but slightly more complicated.
Yes, we get the data set first with integer values as I mentioned before, and then we iterate for each of them applying our next algorithm in minified version called bf (it can be changed by now), also don’t forget that we take the first letter of let’s say [“H”, “e”, “l”, “l”, “o”] collection, – which is technically 72** **then we add the first substring of TKK variable (410337.908842009) = 410337. We execute the same algorithm on this bf(410337 + 72, ‘+-a^+6’) by getting the result = 427207861, performing the other calculations and then we start over and over by adding another ordinal value of the character and all the other calculations to the result. Then we end up by joining all the values and returning tk=933078.556599
Fine, huh? – Well, their strategy is just to mislead you, me and any other engineer. You can dive more into this calculations by yourself, but I found it’s kind of worthless…
– Wait, didn’t I say we gonna write the program? Yes, we will. Luckily it’s not only Google Translate that offers us this feature and lets you to convert text to speech. Actually, there is also a competitor called Bing Translator, that does exactly the same but also lets you select particular voice you want to use – High five 👋.
Our next step is to debug and figure out, what is the difference and how they are protecting their API requests. Thanks to Microsoft it won’t take us too long.
Anyway, – back to work. How do they protect their calls? Very simply actually, in comparison with Google. They transfer the token to you once the landing page gets loaded.
All you have to do is to transfer these cookies back when you request
And what does it mean then? – We need to perform two requests over here. The first one should open the landing page and receive the token, and the second one – should send it back once we request the audio output.
However, in fact, you see there are many other values up in the cookies you get. In the example I wrote for you, these that matter is only mtstkn and MUID. Simply by deleting one by one in Chrome I have found that the others are unnecessary. Although of course, it might be a good idea to keep them so Microsoft would not notice whether it’s a real browser or our automized program.
Indeed! After all of these steps, we finally get the desired audio file. You can check the final source code here.
If you have enjoyed this article and you want me to stay motivated, I would highly appreciate if you would share and support this article with comments down below.
And as always, thanks for reading, cheers! 🍸
UPDATE: Google Translate lets you use their API without token, all you have to do is just to remove tk in there, and replace client parameter with “tw-ob” value.