Babak Mahmoudi’s Blog

Localization for Persian Language…

Archive for the ‘Persian Language’ Category

.Net Profiling for Persian Localization, Cons and Pros

with 2 comments

There’s no doubt that in order to reasonably localize .Net applications in Persian, sooner or later the localizer should consider tampering .Net assemblies. The reason goes back to unreasonably poor implementation of Persian Calendar in .Net and applications in .Net. Providing Persian calendar is a must in most localization projects and there’s no way other than tampering assemblies to bring about that support.

For instance local users in Iran cannot live without Persian Calendar in their SharePoint sites, and you have to play with codes in SharePoint most important assembly (SharePoint.dll) to support Persian Calendar. Part of it is because that Windows in general does not provide for third party calendar systems to be added to the operating system.

The main trend in providing Persian Calendar in SharePoint is substituting one SharePoint standard calendars, such as Hijri with Persian Calendar. This way one have to somehow replace methods of an internal class namely HijriCalendarImplementation . For instance this class has a static method JulianDayToDate that will do conversion of a Julian day to a SharePoint SimpleDate structure. Obviously this should be changed if one plans to substitute Persian Calendar in place of Hijri. Expert guys here in Iran have used already available tools such as Reflector, to disassemble IL codes of SharePoint.dll and then replacing their codes and rebuilding the assembly back. They replace original assemblies with these modifies version. This way they’ve succeeded the mission.

When I first got this mission in Gostareh Negar, I actually didn’t know much about .Net programming. I was a C++ programmer, already expert in native code tracing and DLL overriding. Back to my experience in native code, I knew that sooner or later, rebuilding binary DLLs would show its disadvantages. So I just put my efforts to come up without a solution that does not require replacing the original libraries on persisted storage (hard disk). This leaded me to .Net Profiling API.

.Net Profiling API (see here) is originally devised for profiling tasks, i.e. performance measurements. Using this way one could instrument assemblies with specific calls to measure code metrics such as speed. For instance it can insert calls in method entry and exit points so that the total execution time of a method can be recorded. In effect, Profiling API provides methods to inject codes at run time when the CLR executes an application.

CLR (Common Language Runtime) includes a cross-CPU instruction format (Intermediate Language, or IL), and a JIT compiler to turn the IL into code executable by the target CPU. When it starts executing an assembly, it first Just In Time compile the IL codes into native machine code instructions on the target CPU. Within this process CLR may be asked to call a registered profiler and let it do profiling tasks and instrumentations including replacements of the IL code. In effect this will open a way to change IL codes at run time and an elegant way to do our localization mission.

While traditional profilers focus on instrumenting methods with measuring and logging calls, I focused on redirecting methods. Finally I came up with a Redirector. This could redirect method calls to another assembly by replacing method body with a call to injected method. Now I was able to inject my Persian Calendar codes directly into SharePoint.dll without touching the original assembly on the disk.

This method of code redirection based on profiling has many advantages including:

  • It’s switchable: Many users fear that messing with binary codes may have side effects and causes errors. Since profilers can be easily switched off by server config, in case of suspicious behavior one may easily switch the redirection off and check if the problem is with the injected code.
  • Does not require rebuild on new versions: When original provider releases a new version of the assembly, there is a good chance that changes are not in the redirected code. In case of SharePoint for instance the code for HijirCalendar didn’t change across the service packs and in SharePoint 2010. Therefore the redirector may still work on newer version of the assembly while others should rebuild it. In fact, Gostareh Negar clients installed SharePoint service packs without asking for an update.
  • Does not interfere with code signing: Since original assemblies are normally signed, rebuilding them requires resigning which is normally a head-ache. Redirecting occurs in JIT compilation phase, and does not encounters signing issues.

There are also disadvantages:

  • Speed: .Net code runs with lower performance while being profiled. CLR have to do profiling notifications in addition to normal tasks. This performance decrease is actually in load phase, when the program is completely JIT compiled, the effect vanishes. For web application it happens when the w3p process restarts.

Conclusion

Redirecting method based on .Net Profiling can be reasonably be a good plan for Persian Localization at least for web applications.

 

 

 

 

Written by Babak Mahmoudi

July 27, 2011 at 7:06 pm

Posted in Persian Language

Morphological Rules Pertaining to Persian Spell Checking

with 12 comments

Preface
“Analyzing Persian texts as some stemmer algorithms is essential for efficient spell checking because: It provides the level of consistency needed and It may work with a concise lexicon.
In this article the morphological rules pertaining to such algorithms are studied.”
 Introduction
In Persian words are extensively combined with various prefixes and suffixes, to make new words. In this sense, and if we define words digitally as strings of characters surrounded by space, the number of Persian words are enormously larger as compared to Latin languages as English. For example the word كتاب (ketab=book) generates following derivatives:

ketab_ha books
ketab_am my book, I am a book 
ketab_at your book 
ketab_ash his book 
ketab_eman our book 
ketab_eshan their book 
ketab_i a book, you are a book, related to books
ketab_im we are books 
ketab_id you are books 
ketab_and they are books 
ketab_itar more related to books 
ketab_itarin most related to books 
ketab_hayam my books 
ketab_hayat your books 
ketab_hayash his/her books
ketab_hayeman our books
ketab_hayetan your books
ketab_hayeshan their books
ketab_haei some books
   
*the suffixes are presented just as they spelled in Persian.

 As seen in this example 19 different words can be made by the simple root “ketab”.

 The term Morphological Rules then refers to such rules in Persian that specify how new words can be made. It should be noted here that, by making words, we do not mean the process of generating totally new words as it is usually meant in Persian literature. Actually no one talks about ‘ketab_ha’ as a new word made from ‘ketab’. This is because our digital definition of word: “a string of letters separated by space”
Thus, here we are confined rather to those simple and certain rules that are thought to be useful in the process of digital proofing.

The term curtain is important because, we are not going to consider about those patterns that are rarely used. We consider those rules that can be applied almost in all cases. Nevertheless, the rules are applicable to words based on their grammatical natures. For example you cannot pluralize a pronoun, or only verbs can be conjugated.

Thus, it should be assumed that the Morphological Rules studied here are supported by some Lexicon in which Morphemes are stored with flags that designate their grammatical nature as pertaining to stated rules. The terms Flag, and Morpheme in this article refers to such Lexicon…

Find the remaining on the following link.

Download Complete Document

Written by Babak Mahmoudi

December 8, 2008 at 2:38 pm

Posted in Persian Language