Wohoo... my English-Romanian dictionary database has 59742 entries!
Well, after getting it up to somewhere around 19000 words in the database by from various sources it looked like the sources were drying up. I definitelly don't have time to type up fifty thousand words and definitions, even that isn't quite enough for it to be trully helpful as a translation tool, so I continued searching. Well, search and ye shall find. I stumbled upon Babylon on the web and lo and behold there are several romanian glossaries out there, and they're free too. Now the challenge was how do I get the entries and definitions out of the glossary since it is in binary format and there is no way I have time or resources to even try to decode or figure out how to extract the data. Well, Babylon does come with something like a 1 week free trial so I downloaded it and installed it, then I installed the glossary I wanted to access and poked around. After monkeying around with different ideas on how to scrape the data out, I found the quick and dirty solution:
Say Hello to good ole Winbatch!
Yes, that rather old clunky looking scripting language with the big owl icon. The one that "real programmers" make fun of. Well, I could either spend a week trying to figure out all the API calls and the code needed for a screen scraping program in VB, VB.net, etc... or I could spend a few minutes and have a script that is useful. Well it took me a little more than a few minutes since I had forgotten most of what I knew about Winbatch inspite of having used it at my previous job. Anyway, within half an hour I had a working program that could grab data from Babylon's screen and simulate a mouse click on the "next word" button. With a little more refinement I made it parse out the results (awesome one liner parsing command) and create a delimited file with entries and definitions. The whole program is exactly 19 lines of code.
I ran the program yesterday and over night. I had to restart it several times since I didn't bother to spend too much time adding error handling. I woke up in the morning to find that it had harvested 52355 words from the glossary, all nicelly formatted in a delimited file. SWEET!
Now I know this may not be considered "ethical" for some, but at this point I'm not looking to market the program commercially, and what it took for me to create a decent dictionary was a very very good learning experience in both VB.NET and Winbatch. Never throw out old tools. You never know when you will need them.
Say Hello to good ole Winbatch!
Yes, that rather old clunky looking scripting language with the big owl icon. The one that "real programmers" make fun of. Well, I could either spend a week trying to figure out all the API calls and the code needed for a screen scraping program in VB, VB.net, etc... or I could spend a few minutes and have a script that is useful. Well it took me a little more than a few minutes since I had forgotten most of what I knew about Winbatch inspite of having used it at my previous job. Anyway, within half an hour I had a working program that could grab data from Babylon's screen and simulate a mouse click on the "next word" button. With a little more refinement I made it parse out the results (awesome one liner parsing command) and create a delimited file with entries and definitions. The whole program is exactly 19 lines of code.
I ran the program yesterday and over night. I had to restart it several times since I didn't bother to spend too much time adding error handling. I woke up in the morning to find that it had harvested 52355 words from the glossary, all nicelly formatted in a delimited file. SWEET!
Now I know this may not be considered "ethical" for some, but at this point I'm not looking to market the program commercially, and what it took for me to create a decent dictionary was a very very good learning experience in both VB.NET and Winbatch. Never throw out old tools. You never know when you will need them.

0 Comments:
Post a Comment
<< Home