The Arabic Internet Corpus is one of several large collections of text generated from publically available data on the web, stored in an electronic database (corpora), collected for Translation Studies research at the University of Leeds in the UK.
The Arabic Internet Corpus consists of about 176 million words which were initially raw text with no further processing. Majdi Sawalha, a PhD student in Leeds, wanted to add the lemma, which is the dictionary form of the word (headword) and the root (three or four letters origin of the word) for each word in the Arabic Internet Corpus.
Arabic is different from English and other European languages, because hundreds of Arabic words can be derived from the same root; and a lemma can appear in the text in many different forms due to the attachment of prefixes and suffixes. In other words, morphemes that are attached to a word to form a new word, English examples include –ness, pre-, -s, and -ed, are attached at the beginning and at the end of the word. Therefore adding the lemma and root extraction is necessary for search applications to enable inflected forms of a word to be grouped together.
Sawalha used the SALMA – Tagger (Sawalha Atwell Leeds Morphological Analyses – Tagger) to add additional information to the Arabic Internet Corpus words at two levels; the lemma and the root. The SALMA Tagger is relatively slow - in initial tests it processed seven words per second. The lack of speed is because it has to deal with the way in which Arabic words are written, spell check the word's letters, short vowels and diacritics (marks added above or below letters to provide information about correct pronunciation) and the large dictionaries provided to the analyzer.
An estimated execution time for lemmatizing the full Arabic Internet Corpus was 300 days using an ordinary uni-processor machine. To reduce the processing time of the whole task, Sawalha used the UK's National Grid Service (NGS) to lemmatize the Arabic Internet Corpus gaining a massive reduction in execution time. He did this by dividing the Arabic Internet Corpus into half-million-word files and then wrote a program that generates scripts to run the lemmatizer for each file in parallel. The output files were combined in one lemmatized Arabic Internet Corpus comprising 176 million word-tokens, 2,412,983 word-types, 322,464 lemma-types and 87,068 root-types.
The output files were combined into one lemmatized Arabic Internet Corpus. Then, 10 random samples, of 100 words each, were selected to evaluate the accuracy of the lemmatizer. For each sample, Sawalha computed the accuracy of the root and lemma analysis and found that the average root and lemma accuracy was consistent across samples. The average root accuracy was about 81.20% and the average lemma accuracy was 80.80%.
Sawalha said “Roughly, an estimated execution time for lemmatizing the full Arabic Internet Corpus was 300 days using an ordinary uni-processor machine. By using the computational power of the NGS a massive reduction in execution time was gained – instead it only took five days." This could have been reduced further if they had been able to allocate enough CPUs to process all files strictly in parallel; NGS provides virtual parallel processing on a reduced set of CPUs.
“[This work] made the processed Arabic Internet Corpus available to other translation studies and Arabic and Middle Eastern study researchers at the University of Leeds and other world-wide institutions," Sawalha said.
Sawalha's supervisor, Eric Atwell said, “I hope we convinced, at least some, that Arabic is interesting and challenging! I must think about how to make more use of HPC resources in future Arabic-computing research proposals.”
This is an edited version of an article that first published on the NGS website.