PDF | The study of Word order has a long history. It is Greenberg (, ) who has initiated serious investigation on the word order. Join tamil-ulagam by sending an e-mail to [email protected] com. Abacus. Õ∂®£¥Ø“ abroad. ÜÂÚ¡Ä≤ absent. ÂŸÄª; å‰flÄª abuse. ªÂ˜Ä Ã. Among the reasons several English-Tamil court interpreters have yet to be bilingual situation, and need to find the equivalents of certain words as quickly and.
|Language:||English, Arabic, Hindi|
|Genre:||Health & Fitness|
|ePub File Size:||20.81 MB|
|PDF File Size:||9.68 MB|
|Distribution:||Free* [*Registration Required]|
Important Spoken Tamil Situations Into Spoken English Sentences - Sample - Free download as PDF File .pdf), Text File .txt) or read online for free. It cannot be denied that Tamil also has borrowed words from Sanskrit. But the .. Notes: As it is in English, in Tamil also adjectives do not change according to. TAMIL PROWERBS,. WITH THEIR TRANSLATION IN. E N G LIS H. Having placed the thing on the palm, why Tamil p COMMUNICATIVE ENGLISH.
In the first person plural, Tamil makes a distinction between inclusive pronouns that include the listener and exclusive pronouns that do not. Tamil does not distinguish between adjectives and adverbs - both fall under the category uriccol.
Verb auxiliaries are used to indicate attitude , a grammatical category which shows the state of mind of the speaker, and his attitude about the event spoken of in the verb. Common attitudes include pejorative opinion , antipathy , relief felt at the conclusion of an unpleasant event or period, and unhappiness at or apprehension about the eventual result of a past or continuing event.
Sentence structure[ edit ] Except in poetry, the subject precedes the object, and the verb concludes the sentence. In a standard sentence, therefore, the order is usually subject—object—verb SOV , but object—subject—verb is also common.
Tamil is a null-subject language. Not all Tamil sentences have subjects, verbs and objects. The elements that are present, however, must follow the SOV order. Tamil does not have an equivalent for the word is; the word is included in the translations only to convey the meaning. The verb to have in the meaning "to possess" is not translated directly, either. To say "I have a horse" in Tamil, a construction equivalent to "There is a horse to me" or "There exists a horse to me", is used.
Tamil lacks relative pronouns , but their meaning is conveyed by relative participle constructions, built using agglutination. For example, the English sentence "Call the boy who learned the lesson" is said in Tamil like "That-lesson-learned-boy call".
With the success of newer stochastic techniques in speech recognition, the IBM team at Yorktown Heights began to look again at their application to machine translation. The distinctive feature of Candide is that statistical methods are used as virtually the sole means of analysis and generation; no linguistic rules are applied.
The IBM research is based on the vast corpus of French and English texts contained in the reports of Canadian parliamentary debates i. The essence of the method is first to align phrases, word groups and individual words of the parallel texts, and then to calculate the probabilities that any one word in a sentence of one language corresponds to a word or words in the translated sentence with which it is aligned in the other language. Most researchers were surprised, particularly those involved in rule-based approaches, by the results which were so acceptable: Obviously, the researchers have sought to improve these results, and the IBM group proposes to introduce more sophisticated statistical methods, but they also intend to make use of some minimal linguistic information, e.
The second major corpus-based approach - benefiting likewise from improved rapid access to large databanks of text corpora is what is known as the example- based or memory-based approach. Although first proposed in by Makoto Nagao, it was only towards the end of the s that experiments began, initially in some Japanese groups and during the DLT project.
The underlying hypothesis is that translation often involves the finding or recalling of analogous examples, i. For calculating matches, some MT groups use semantic methods, e.
Other groups use statistical information about lexical frequencies in the target language. The main advantage of the approach is that since the texts have been extracted from databanks of actual translations produced by professional translators there is an assurance that the results will be accurate and idiomatic. Although the main innovation since has been the growth of corpus- based approaches, rule-based research continues in both transfer and interlingua systems.
For example, a number of researchers involved in Eurotra have continued to work on the theoretical approach developed, e. One consequence of developments in example-based methods has been that much greater attention is now paid to questions of generating good quality texts in target languages than in previous periods of machine translation activity when it was commonly assumed that the most difficult problems concerned analysis, disambiguation and the identification of the antecedents of pronouns.
In part, the impetus for this research has come from the need to provide natural language output from databases, i.
Some machine translation teams have researched multilingual generation. The use of machine translation accelerated in the s. The increase has been most marked in commercial agencies, government services and multinational companies, where translations are produced on a large scale, primarily of technical documentation.
This is the major market for the mainframe systems: All have installations where translations are being produced in large volumes. Indeed, it has been estimated that in over million words a year were translated by such services: The literary work is fed to the MT system and translation is done.
Such MT systems can break the language barriers by making available work rich sources of literature available to people across the world.
MT also overcomes the technological barriers. This has lead to digital divide in which only small section of society can understand the content presented in digital format. MT can help in this regard to overcome the digital divide.
Some of these issues are as follows. Some classification can be done by naming the typical order of subject S , verb V and object O in a sentence.
Some languages have word orders as SOV. The target language may have a different word order. In such cases, word to word translation is difficult. The selection of right word specific to the context is important. Unresolved references can lead to incorrect translation. This was the type of MT envisaged by the pioneers. This came in with the need to translate military technological documents.
The translation output can be considered only as brush-up so that the professional translator can be freed from that boring and time consuming task. This type of machine translation system is usually incorporated into the translation work stations and the PC based translation tools. But mainly three approaches are used.
These are discussed below: Linguistic knowledge will be required in order to write the rules for this type of approaches. These rules will play a vital role during the different levels of translation. The benefit of rule based machine translation method is that it can intensely examine the sentence at its syntax and semantic levels. There are complications in this method such as prerequisite of vast linguistic knowledge and very huge number of rules is needed in order to cover all the features in a language.
The three different approaches that require linguistic knowledge are as follows: Direct MT 2. Interlingua MT 3. Transfer MT 2. Direct MT form of MT is the most basic one. It translates the individual words in a sentence from one language to another using a two-way dictionary. It makes use of very simple grammar rules. These systems are based upon the principle that as MT system should do as little work as possible.
Direct MT systems take a monolithic approach towards development, i. Direct MT has following characteristics: The direct MT system starts with morphological analysis. Morphological analysis removes morphological inflections from the words to get the root word from the source language words.
A bilingual dictionary is looked up to get the target- language words corresponding to the source-language words. The last step in direct MT system is syntactic rearrangement. In syntactic rearrangement, the word order is changed to that which best matches the word order of the target language.
Figure 2. Direct Machine Translation Direct Machine Translation works well with languages which have same default sentence structure. It does not consider structure and relationships between words. The Interlingua Machine Translation converts words into a universal language that is created for the MT simply to translate it to more than one language.
Whenever a sentence matches one of the rules, or examples, it is translated directly using a dictionary. It goes from the source language to a morphological and syntactic analysis to produce a sort of Interlingua on the base forms of the source language, from this it translates it to the base forms of the target language and from there a better translation is made to create the final step in the translation.
The steps which are performed are shown in Figure 2. Analysis phase is used to produce source language structure. Transfer phase is used to transfer source language representation to a target level representation. Generation phase is used to generate target language text using target level structure. The only resource required by this type of approaches is data either the dictionaries for the dictionary based approach or bilingual and monolingual corpus for the empirical or corpus based approaches.
In this approach, word level translations will be done. This kind of approach can be used to translate the phrases in a sentence and found to be least useful in translating a full sentence.
This approach will be very useful in accelerating the human translation, by providing meaningful word translations and limiting the work of humans to correcting the syntax and grammar of the sentence. But a bilingual corpus of the language pair and the monolingual corpus of the target language are required to train the system to translate a sentence. This approach has driven lots of interest world-wide, from late s till now. That is, normally the humans split the problem into sub problems, solve each of the sub problems with the idea of how they solved this type of similar problems in the past and integrate them to solve the problem in whole.
This approach needs a huge bilingual corpus of the language pair among which translation has to be performed. Assuming that we are using a corpus that contains the following two sentence pairs: English Tamil He bought a book He has a car The parts of the sentence to be translated will be matched with these two sentences in the corpus.
Therefore, the corresponding Tamil part of the matched segments of the sentences in the corpus are taken and combined appropriately. Sometimes, post-processing may be required in order to handle numbers, gender if exact words are not available in the corpus. This approach differs from the other approaches to machine translation in many aspects. That is large amount of machine readable natural language texts are available with which this approach can be applied.
This approach makes use of translation and language models generated by analysing and determining the parameters for these models from the bilingual corpora and monolingual corpus of the target language, respectively.
In order obtain better translations from this approach, at least more than two million words if designing the system for a particular domain and more than this for designing a general system for translating particular language pair. Moreover, statistical machine translation requires an extensive hardware configuration to create translation models in order to reach average performance levels.
Commercial translation systems such as Asia Online and Systran provide systems that were implemented using this approach. Hybrid machine translation approaches differ in many numbers of aspects: Here the rule based machine translation system produces translations for a given text in source language to text in target language.
ENGLISH THROUGH TAMIL
The output of this rule based system will be post-processed by a statistical system to provide better translations. However, a machine translation system is solely responsible for the complete translation process from input of the source text to output of the target text without human assistance, using special programs, comprehensive dictionaries, and collections of linguistic rules.
Machine translation occupies the top range of positions on the scale of computer translation ambition. Machine aided translation systems fall into two subgroups: Machine-aided human translation refers to a system wherein the human is responsible for producing the translation per sentence, but may interact with the system in certain prescribed situations - for example, requesting assistance in searching through a local dictionary or thesaurus, accessing a remote terminology data bank, retrieving examples of the use of a word or phrase, or performing word processing functions like formatting.
Indeed the data bank may not be accessible to the translator on-line at all, but may be limited to the production of printed subject-area glossaries.
A terminology data banks offers access to technical terminology, but usually not to common words. The chief advantage of terminology data banks is not the fact that it is automated even with on-line access, words can be found just as quickly in a printed dictionary, but that it is up-to date: It is also possible for terminology data banks to contain more entries because it can draw on a larger group of active contributors, its users.
The time duration to design a statistical machine translation system will be very much less when compared to the rule based systems. The advantages of statistical machine translation over rule based machine translation are stated below: In contrast, rule based machine translation system requires a great deal of knowledge apart from the corpus that only linguistic experts can generate, for example, shallow classification, syntax and semantics of all the words of source language in addition to the transfer rules between source and target languages.
Generalizing the rules is more tedious task and hence, multiple rules have to be defined for each case, particularly for languages which have different sentence structure pattern.
In the other hand, rule based machine translation systems involves more improvement and customization costs till it touches the anticipated quality threshold. Updated rule based systems will be available at the moment when a person downloads a rule based system from the market. In particular, rule based systems organisation is generally a time consuming progression including more human resources. Whereas rule based systems have to be redesigned or retrained by the addition of new rules and words to the dictionary amid of many other things, which results in more time consumption and requires more knowledge from the linguists.
Though rule based systems have not found the syntactic information of words suitable for analysing the source language, or does not know the word, which will prevent the finding of suitable rule. Concerning the rule based systems governed by the linguistic rules; they are considered as distinct case of statistical approach. However, if the rules are generalized to a large extent, they will not be able handle rule exceptions.
Whereas, various versions of rule based systems generates more alike translations. Since the situation has changed. Corporate use of machine translation with human assistance has continued to expand particularly in the area of localisation and the use of translation aids has increased particularly with the approaching of translation memories.
But the main change has been the ever expanding use of unrevised machine translation output, such as online translation services provided by Babel Fish, Google, etc. The following states the various applications of machine translation briefly. For most of that history — at least 40 years — it was assumed that there were only two ways of using machine translation systems.
The first was to use machine translation to produce publishable translations, generally with human editing assistance i. The second was to offer the rough unedited machine translation versions to readers able to extract some idea of the content i. In neither case were translators directly involved — machine translation was not seen as a computer aid for translators.
The first machine translation systems operated on the traditional large-scale mainframe computers in large companies and government organizations. There was opposition from translators particularly those with the task of post- editing but the advantages of fast and consistent output has made large- scale machine translation cost-effective.
In order to improve the quality of the raw machine translation output many large companies included methods of controlling the input language by restricting vocabulary and syntactic structures — by such means, the problems of disambiguation and alternative interpretations of structure could be minimised and the quality of the output could be improved.
For most of machine translation history, translators have been wary of the impact of computers in their work. Many saw machine translation as a threat to their jobs — little knowing the inherent limitations of machine translation. During the s and s the situation changed. Translators were offered an increasing range of computer aids.
First came text-related glossaries and concordances, word processing on increasingly affordable microcomputers, then terminological resources on computer databases, access to Internet resources, and finally translation memories.
The idea of storing and retrieving already existing translations arose in the late s and early s, but did not come to fruition until the availability of large electronic textual databases and with facilitating bilingual text alignment.
All translators are now aware of their value as cost-effective aids, and they are increasingly asking for systems which go further than simple phrase and word matching — more machine translation - like facilities in other words.
With this growing interest, researchers are devoting more efforts to the real computer-based needs of translators. As just two examples there are the TransSearch and TransType systems: From the middle of the s onwards, mainframe and PC translation systems have been joined by a range of other types. First should be mentioned the obvious further miniaturisation of software: Many, such as the Ectaco range of special devices, are in effect computerized versions of the familiar phrase-book or pocket dictionary, and they are marketed primarily to the tourist and business traveller.
The dictionary sizes are often quite small, and where they include phrases, they are obviously limited. However, they are sold in large numbers and for a very wide range of language pairs. Users may be able to ask their way to the bus station, for example, but they may not be able to understand the answer. Recently, since early in this decade, many of these hand-held devices have included voice output of phrases, an obvious attraction for those unfamiliar with pronunciation in the target language.
There is an increasing number of phrase-book systems offer voice output. This facility is also increasingly available for PC based translation software — it seems that Globalink in was the earliest — and it seems quite likely that it will be an additional feature for online machine translation sometime in the future.
The research in speech translation is beset with numerous problems, not just variability of voice input but also the nature of spoken language. By contrast with written language, spoken language is colloquial, elliptical, context-dependent, interpersonal, and primarily in the form of dialogues. Speech translation therefore represents a radical departure from traditional machine translation. Complexities of speech translation can, however, be reduced by restricting communication to relatively narrow domains — a favourite for many researchers has been business communication, booking of hotel rooms, negotiating dates of meetings, etc.
From these long-term projects no commercial systems have appeared yet. There are, however, other areas of speech translation which do have working but not yet commercial systems. These are communication in patient-doctor and other health consultations, communication by soldiers in military operations, and communication in the tourism domain.
Multilingual access to information in documentary sources articles, conferences, monographs, etc. Information extraction or text mining has had similar close historical links to machine translation, strengthened likewise by the growing statistical orientation of machine translation. Many commercial and government-funded international and national organisations have to scrutinize foreign-language documents for information relevant to their activities from commercial and economic to surveillance, intelligence, and espionage.
Searching can focus on single texts or multilingual collections of texts, or range over selected databases e. These activities have also, until recently, been performed by human analysts. Now at least drafts can be obtained by statistical means — methods for summarisation have been researched since the s. The development of working systems that combine machine translation and summarisation is apparently still something for the future. The aim is to retrieve answers in text form from databases in response to natural-language questions.
Like summarization, this is a difficult task; but the possibility of multilingual question-answering is attracting more attention in recent years. Chapter 3 Creation of Parallel Corpus 3. The corpus creation for Indian languages will also be discussed elaborately. McEnrey and Wilson talk in detail about corpus linguistics.
English to Tamil Meaning :: vocabulary
However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Corpus was used to study language acquisition, spelling conventions and language pedagogy. The present day interpretation of corpus is different from the earlier one. In the present era, corpus in electronic form is made use of for various purposes including NLP. Computer comes in handy to manipulate the electronic corpus.
But before the advent of computer non-electronic corpuses in the hand written form were widely in use. Such non-electronic corpuses were made use of for the following tasks Dash Corpus in dictionary making, Corpus in dialects study, Corpus for lexical study, Corpus for writing grammars, Corpus in speech study, Corpus in language pedagogy, Corpus in language acquisition and Corpus in other fields of Linguistics 3.
Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a TV talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
Corpus linguistics is a method of carrying out linguistic analyses using huge corpuses or collections of data. As it can be used for the investigation of many kinds of linguistic questions and as it has been shown to have the potential to yield highly interesting, fundamental, and often surprising new insights about language, it has become one of the most wide-spread methods of linguistic investigation in recent years.
In principle, corpus linguistics is an approach that aims to investigate linguistic phenomena through large collections of machine-readable texts.
This approach is used within a number of research areas: In principle, any collection of more than one text can be called a corpus, corpus being Latin for "body", hence a corpus is any body of text. But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. Sampling and Representativeness 2.
Finite Size 3. Machine Readable Form 4. A Standard Reference 3. In such cases we have two options for data collection: We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
We could construct a smaller sample of that variety. This is a more realistic option. One of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously.
This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which a much less biased and representative corpus may be constructed. We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions.
This "collection of texts" as Sinclair's team prefers to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words.
Their main advantages are: Their main disadvantage is: With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,, running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size.
An exception is the London-Lund corpus, which was increased in the mids to cover a wider variety of genres. This was not always the case as in the past the word "corpus" was only used in reference to printed text. The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as psuedo- techniques.
The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer. Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" Svartvik and Quirk which represents the "original" London-Lund corpus.
Corpus data not excluding context- free frequency lists is occasionally available in other forms of media. Machine-readable corpora possess the following advantages over written or spoken formats: This is something which we covered at the end of Part One. We will examine this in detail later. One advantage of a widely available corpus is that it provides a yardstick by which successive studies can be measured. So long as the methodology is made clear, new results on related topics can be directly compared with already published results without the need for re-computation.
External Links to Free online Tamil dictionaries
Also, a standard corpus also means that a continuous base of data is being used. This implies that any variation between studies is less likely to be attributed to differences in the data and more to the adequacy of the assumptions and methodology contained in the study.
Wellington Corpus of Spoken New Zealand English contains all formal and informal discussions, debates, previously made talks, impromptu analysis, casual and normal talks, dialogues, monologues, various types of conversations, on line dictations, instant public addressing, etc. London-Lund Corpus of Spoken English, a technical extension of speech corpus, contains texts of spoken language.
British National Corpus comprises general texts belonging to different disciplines, generes, subject fields, and registers. CHILDES database is designed from text sampled in general corpus for specific variety of language, dialect and subject with emphasis on certain properties of the topic under investigation.
Zurich Corpus of English Newspapers is one of the categories of special corpus, which are made up of small samples containing finite collection of texts chosen with great care and studied in detail.
Romantic poets, Augustan prose writers, Victorian novelists, etc. However, for some unknown reasons, corpus made from dramas and plays is usually kept separate from that of prose and poetry.
Bank of English is a growing, non-finite collection of texts with scope for constant augmentation of data reflecting changes in language.
MIT Bangla-Hindi Corpus is formed when corpora of two related or non- related languages are put into one frame. Crater Corpus contains good representative collections from more than two languages 3.
Texts in one language and their translations into another are aligned: Sometimes reciprocate parallel corpora are designed where corpora containing authentic texts as well as translations in each of the languages are involved.
It aims to be large enough to represent all relevant varieties of language and characteristic vocabulary, so that it can be used as a basis for writing grammars, dictionaries, thesauruses and other reference materials.
It is composed on the basis of relevant parameters agreed upon by linguistic community. It includes spoken and written, formal and informal language representing various social and situational registers. It is used as 'benchmark' for lexicons, for performance of generic tools, and language technology applications. With growing influence of internal criteria, reference corpus is used to measure deviance of special corpus. This kind of multilingual corpus contains texts in different languages where texts are not same in content, genre or register.
These are used for comparison of different languages. It follows same composition pattern but there is no agreement on the nature of similarity, because there are few examples of comparable corpora. They are indispensable source for comparison in different languages as well as generation of bilingual and multilingual lexicons and dictionaries. Therefore, users are left to fill in blank spots for themselves.
Their place is in situations where size and corpus access do not pose a problem. The opportunistic corpus is a virtual corpus in the sense that selection of an actual corpus from opportunistic corpus is up to the needs of a particular project.
Monitor corpus generally considered as opportunistic corpus. The issues of corpus development and processing may vary depending on the type of corpus and the purpose of use. Issues related to speech corpus development differ from issues related to text corpus development.
Developing a speech corpus involves issues like propose of use, selection of informants, choice of settings, manner of data-sampling, manner of data collection, size of corpus, problem of transcription, type of data encoding, management of data files, editing of input data, processing of texts, analysis of texts, etc.
Developing a written text corpus involves issues like size of corpus, representativeness, question of nativity, determination of target users, selection of time-span, selection of documents, collection of text documents books, newspapers, magazines etc. This points out that size is an important issue in corpus generation. It is concerned with total number of words tokens and different words types to be taken into a corpus. It also involves the decision of how many categories we like keep in corpus, how many samples of texts we put in each category, and how many words we will keep in each sample.
In early corpus generation era, when computer technology for procuring language data was not much advanced, it was considered that a corpus containing 1 million words or so is large enough to represent the language. But by the mid of s, computer technology went through a vast change with unprecedented growth of its storage, processing, and accessing abilities that have been instrumental in changing the concept regarding size.
Now it is believed that the bigger the size of corpus the more it is faithful in representing language. With advanced computer technology we can generate corpus of very large size containing hundreds of million of words. However, a simple comparison of BNC - million words corpus having much more diversified structure and representative frame, with Brown, LOB, and SEU will show how these corpora are smaller in content and less diversified in structure.
This easily settles empirically the issue of size and representativeness in corpus. General argument is that if it is a monitor corpus then texts produced by native users should get priority over the texts of non-native users. Because, in that case, we get a lot of 'mention' rather than 'use' of words and phrases in corpus. If one of the main reasons for building a corpus is to enable us to analyse naturally occurring language, in order to see what does occur and what does not, then letting in lots of made-up example sentences and phrases will make it less fit for proposed purpose.
One way of avoiding this, and many other potential problems, which are found in specialised corpus, is to apply a criterion for inclusion of texts in corpus that they should not be too technical in nature.
In case of special corpus, texts produced by non-native users are considered since the aim of a special corpus is to highlight peculiarities typical to non-native users. Here the question of representiveness of corpus is not related with the language as a whole, but with the language used by a particular class of people who have learnt and used language as their second language. The idea is to have a corpus that includes data from which we can gather information about how a language is commonly used in various mainstreams of linguistic interactions.
When we try to produce some texts and references that will provide guidance on word use, spelling, syntactic constructions, meanings, etc. In principle, these texts written and spoken by native users will be more directive, appropriate, and representative for enhancing ability of language understanding and use for language learners. Perhaps, this goes with rightly along the line of desire of non-native users who while learning a second language aim to achieve the efficiency of a native language user.
The question of nativity becomes more complicated and case-sensitive when we find that same language is used by two different speech communities separated by geographical or political distance e. British English and Indian English. In these cases we like to recognise or generate lexical items or syntactic constructions that are common in, or typical of, a native speaker - especially those which differ from another lexical items typical to British English vs. In the context when Indian people are exposed to lots of linguistic material that shows marks of being non-Indian English Indians are exposed to lots of British English text , people who want to describe, recognise, understand, and generate Indian English will definitely ask for texts produced by native speakers of Indian English, which will highlight the linguistic traits typical to Indian English, and thus will defy all pervading influence of British English over Indian English.
Anybody can use it for any purpose. For specialised corpus: Since, each investigator or researcher has specific requirement, corpus has to be designed accordingly. A person working on developing tools for MT will require a parallel corpus rather than a general corpus. Similarly a person working on comparative studies between or more languages will require comparable corpus rather than a monitor corpus: Target users: Speech corpus text to speech, speech recognition, synthesis, processing, speech repairing, etc.
General, monitor, specialised, reference, opportunistic corpus etc. Learner, monitor, and general corpus 3. So determination of particular time span is required to capture features of a language within this time span. Corpus attempts to cover a particular period of time with a clear time indicator. Materials published between and are included in MIT corpus with an assumption that data will sufficiently represent the condition of present day language, and will provide information about the changes taking place within the period.
Most of the corpora incline towards written texts of standard writings. The aim of a general corpus is to identify what are central common , as well typical special features of a language. Therefore, we do not require to furnish corpus with all the best pieces of contemporary writings. A measured and proportional representation will suffice.
To be realistic we should include works of the mass of ordinary writers along with works of established and well-known writers. Thus, a corpus is a collection of materials taken from different branches of human knowledge. Here writings of highly reputed authors as well as little-known writers are included with equal emphasis. All catalogues and list of publications of different publishers need to be consulted for collection of documents books, newspapers, magazines etc.
Diversity is a safeguard to corpus against any kind of skewed representativeness. Each category has some sub-categories. Sorting can be random, regular, or selective order. There are various ways for data sampling to ensure maximum representativeness of corpus. We must clearly define the kind of language we wish to study before we define sampling procedures for it. Random sampling technique saves a corpus from being skewed and unrepresentative.
This standard technique is widely used in many areas of natural and social sciences. Another way is to use complete bibliographical index. Another approach is to define a sampling frame. Designers of Brown Corpus adopted this.
They used all books and periodicals published in a particular year. A written corpus may be made up of genres such as newspaper report, romantic fiction, legal statutes, scientific writing, social sciences, technical reports, and so on. In this process newspapers, journals, magazines, books etc.
Data from the web: This includes texts from web page, web site, and home pages. Data from e- mail: Electronic typewriting, e-mails etc. It converts texts into machine-readable form by optical character recognition OCR system.
Using this method, printed materials are quickly entered into corpus. Manual data input: It is done through computer keyboard. This is the best means for data collection from hand-written materials, transcriptions of spoken language, and old manuscripts.
The process of data input is based on the method of sampling. We can use two pages after every ten pages are from a book. This makes a corpus best representative of data stored in physical texts. For instance, if a book has many chapters, each chapter containing different subjects written by different writers, then samples collected in this process from all chapters will be properly represented.
Header File contains all physical information about the texts such as name of book, name of author s , year of publication, edition number, name of publisher, number of pages taken for input, etc. It is also advantageous to keep detailed records of the materials so that documents are identified on grounds other than those, which are selected as formatives of corpus. Information whether the text is a piece of fiction or non-fiction, book, journal or newspaper, formal or informal etc.
At time of input, physical line of texts is maintained on screen. After a paragraph is entered, one blank line is added, and then a new paragraph is started. Texts are collected in a random sampling manner and a unique mark is put at the beginning of a new sample of text. Files are developed with TC installed in PC. This allows display of various Indian scripts on computer screen. Codes for various keys used in Indian characters are standardised by the Bureau of Indian Standards.
With installation of this inside a PC, we can use almost the entire range of text-oriented application packages. We can also input and retrieve data in Indian language. Software also provides a choice of two operational display modes on the monitor: It involves various related tasks such as holding, processing, screening, retrieving information from corpus, which require utmost care and sincerity.
Once a corpus is developed and stored in computer, we need schemes for regular maintenance and augmentation. There are always some errors to be corrected, modifications to be made, and improvements to be implemented. Adaptation to new hardware and software technology and change in requirement of users are also taken care of. In addition to this, there has been constant attention to the retrieval task as well as processing and analytic tools.
At present, computer technology is not so developed to execute all these works with full satisfaction. But we hope that within a few years software technology will improve to fulfil all our needs. Method of Corpus Sanitation After the input of data, the process of editing starts. Generally, four types of error occur in data entry: To remove spelling errors, we need thorough checking of corpus with physical data source, and manual correction.
Care has to be taken to ensure that spelling of words in corpus must resemble spelling of words used in source texts. It has to be checked if words are changed, repeated or omitted, punctuation marks are properly used, lines are properly maintained, and separate paragraphs are made for each text.
Besides error correction, we have to verify omission of foreign words, quotations, dialectal forms after generation of corpus. Nativised foreign words are entered into corpus. Others are omitted. Dialectal variations are properly entered. Punctuation marks and transliterated words are faithfully reproduced. Usually, books on natural and social sciences contain more foreign words, phrases and sentences than books of stories or fiction.
All kinds of processing works become easier if corpus is properly edited. Copyright laws are complicated. There is very little which is obviously right or wrong, and legal or illegal. Moreover, copyright problems differ in various countries. If one uses the material only for personal use, then there is no problem. This is fine not only for a single individual but also for a group who are working together on some areas of research and investigation. So long it is not directly used for commercial purposes, there is no problem.
Using materials we can generate new tools and systems to commercialise. In that case also the copyright is not violated. The reformed generation of output provides safeguards against possible attacks from copyright holders. But in case of direct commercial work, we must have prior permission from legal copyright holders 3. People devise systems and techniques for accessing language data and extracting relevant information from corpus.
These processing tools are useful for linguistic research and language technology developments. There are various corpus processing techniques e. There are many corpus processing software available for English, French, German and similar such languages. For Indian language there are only a few. We need to design corpus-processing tools for our own languages keeping the nature of Indian languages in mind.
The following is the list of text processing scheme: Mathematical linguistics, computational linguistics, corpus linguistics, applied linguistics, forensic linguistics, stylometrics, etc. Corpus can be subject to both quantitative and qualitative analysis.
Simple descriptive statistical approach enables us to summarise the most important properties of observed data.
Inferential statistical approach uses information from descriptive statistical approach to answer questions or to formulate hypothesis. Evaluative statistical approach enables to test whether hypothesis is supported by evidence in data, and how mathematical model or theoretical distribution of data relates to reality Oakes To perform comparisons we apply multivariate statistical techniques e.
Here items are classified according to a particular scheme, and an arithmetical count is made on the number of items within texts, which belong to each class in the scheme. Information available from simple frequency counts are rendered either in alphabetical or in numerical order. Both lists can again be arranged in ascending or descending order according to our requirement. Anyone who is studying a text will like to know how often each different item occurs in it. A frequency list of words is a set of clues to texts.
By examining the list we get an idea about the structure of text and can plan an investigation accordingly. Alphabetical sorted list is used for simple general reference. A frequency list in alphabetical order plays a secondary role because it is used only when there is a need to check frequency of a particular item.
It is a collection of occurrences of words, each in its own textual environment. Each word is indexed with reference to the place of each occurrence in texts. It is indispensable because it gives access to many important language patterns in texts.
It provides information not accessible via intuitions. There are some concordance software available e.
It is most frequently used for lexicographical works. We use it to search single as well as multiword strings, words, phrases, idioms, etc. It is also used to study lexical, semantic, syntactic patterns, text patterns, genre studies, literary texts etc. Barlow It is an excellent tool for investigating words and morphemes, which are polysemous and have multiple functions in language. It helps to determine which pairs of words have a substantial collocational relation between them.
It compares probabilities of two words occurring together as an event with probability that they are simply the result of chance. For each pair of words, a score is given - the higher the score the greater is the collocationality. It enables to extract multiword units from corpus to use in lexicography and technical translation. It helps to group similar words together to identify sense variations e.
It helps in discriminate differences in usage between words, which are similar in meaning. For instance, strong collocates with motherly, showings, believer, currents, supporter, odour etc. Biber at al.
It helps to look up each occurrence of particular words similar to concordance. The word under investigation appears at the centre of each line, with extra space on either side. The length of context is specified for different purposes. It shows an environment of two, three or four words on either side of the word at the centre. This pattern may vary according to one's need. At the time of analysis of words, phrases, and clauses it is agreed that additional context is needed for better understanding.
After access of a corpus by KWIC we can formulate various objectives in linguistic description and devise procedures for pursuing these objectives. KWIC helps to understand importance of context, role of associative words, actual behaviour of words in contexts, actual environment of occurrence, and if any contextual restriction is present. LWG provides information for dealing with functional behaviour of constituents at the time of parsing, both in phrase and sentence level.
Using LWG we find that most non-finite verbs are followed by finite verbs, while nouns are mostly followed by suffixes and post-positions in Tamil.
It helps to analyse so called verb groups and noun groups from their local information. It provides clues for understanding their roles in phrases, clause, and sentences. Information from LWG helps to dissolve lexical ambiguity, which arises from local association of various lexical items.
Our experience with Tamil suggests that finer shades of meaning are mostly conveyed by internal relation between constituents along with their distributions in contexts.
For many compound nouns and verbs, meaning denoted by a particular association of words cannot be obtained from meanings of individual words. The main objective is to identify a word in a piece of text, isolate it from its contextual environment of use, analyse its morphophoemic structure, obtain its original meaning, and define its syntactic role it plays in text.
People working on native language can have better results since intuitive knowledge helps in finding out right root or suffix part form inflected words, which may be beyond the grasp of non-native users. All detached words are multiword strings, which need to be treated in more efficient way for processing and annotation.
Tamil OCR - OCR Tamil - free Tamil OCR - online Tamil OCR - Tamil OCR software
For processing double the best method is to use delayed processing technique where processing result of one constituent is withheld until result of processing of subsequent constituent is obtained. Part-of-speech POS tagging Parts-of-speech tagging scheme tags a word with its part-of-speech in a sentence.
It is done at three stages: In pre-editing stage, corpus is converted to a suitable format to assigns a part-of-speech tag to each word or word combination. Because of orthographic similarity, one word may have several possible POS tags.
After initial assignment of possible POS, words are manually corrected to disambiguate words in texts.
An example of POS tagging is given below. Untagged Sentence A move to stop Mr. Gaitskell from nominating any more labour life peers is to be made at a meeting of labour MPs tomorrow. In this section the parallel corpus will be studied elaborately focusing on the creation of parallel corpus for machine translation. In addition to monolingual corpora, parallel corpora have been key focus of corpus linguistics, largely because corpora of this type are important resources for translation.
Parallel corpora are valuable resources on natural language processing and, in special, on the translation area.
They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In order to be useful, these resources must be available in reasonable quantities, because most application methods are based on statistics. The quality of the results depends a lot on the size of the corpora, which means robust tools are needed to build and process them. A parallel corpus contains texts in two languages.
We can distinguish two main types of parallel corpus: Comparable corpus: An example would be a corpus of articles about football from English and Danish newspapers; or legal contracts in Spanish and Greek. Translation corpus: Many researchers have built translation corpora in the past decade, though unfortunately most of them are not easily available.
For a useful survey of parallel corpora round the world, look at Michael Barlow's parallel corpora web page Barlow n. To use a translation corpus you need a special piece of software called a Parallel Concordancer. With this software you can ask the computer to find all the examples of a word or phrase in L1, along with all the corresponding translated sentences in L2.
Two widely-used parallel concordancers are ParaConc and Multiconcord. Parallel corpora can be bilingual or multilingual, i. They can be either unidirectional e. We can classify translations according to the dependency between the original text and its translation: This is the case for institutional documents of the European Union and other multilingual institutions; or classify them with respect to the translation objective: As this type of parallel corpora is normally composed of institutional documents with laws and other important information, translation is done accurately, so that no ambiguities are inserted in the text, and they maintain symmetrical coherence; Considering the automatic translation objective, stylistic and semantic translation types can have problems.
Stylistic approach makes the translator look for some similar sound, sentence construction, rhythm, or rhyme. This means that the translator will change some of the text semantic in favor of the text style. Johansson o cf. Parallel corpora can be used for many tasks, e. It can also be used in the process of learning a second language. In fact, when new knowledge areas appear, new terms will not be present on dictionaries. This means that the query must be translated to all languages used on the database documents.
Parallel corpora are used to compare linguistic features and their frequencies in two languages subject to a contrastive analsis. They are also used to investigate similarities and differences between the source and the target language, making systematic, text-based contrastive studies at different levels of analysis possible.
In this way, parallel corpora can provide new insights into the languages compared concerning language-specific, typological and cultural differences and similarities, and allow for quantitative methods of analysis.
Closely related to the use of parallel corpora in contrastive linguistics is their application in translation studies. Parallel corpora may help translators to find translational equivalents between the source and the target language. They provide information on the frequency of words, specific uses of lexical items as well as collocational and syntactic patterns.
This procedure may help translators to develop systematic translation strategies for words or phrases which have no direct equivalent in the target language.
On this basis, sets of possible translations can be identified and the translator can choose a translation strategy according to the specific register, topic and genre. In recent times, parallel corpora have been increasingly used to develop resources for automatic translation systems.
Teachers are increasingly using parallel corpora in the classroom. Parallel corpora may also be helpful in the planning of teaching units and the identification of specific, potentially problematic, patterns of use and are thus useful tools for syllabus design. False friends are words or expressions of the target language that are similar in form to their counterpart in the source language but convey a different meaning. Even if words of the two languages have a similar meaning, they might belong to different registers or contexts, so that complete translational equivalence between source and target text is rare.
Parallel corpora are used more and more to design corpus-based bilingual dictionaries. The same will be enlarged to the extent of 25 million words in each language.
Also, the existing corpora are raw corpora and it will be cleaned for use. Apart from 22 major Indian languages there are hundreds of minor and tribal languages that deserve attention from the researchers for their analysis and interpretation. Creation of corpora in these languages will help in comparing and contrasting structure and functioning of Indian languages.
So, at least minor languages corpora will be collected to a tune of around 3 to 5 million words in each language depending upon availability of text for the purpose. Apart from these basic text corpora creations, an attempt are made to create domain specific corpora in the following areas: Newspaper corpora 2.
Child language corpus 3. Morphological Analyzers and morphological generators. POS tagged corpora is developed in a bootstrapping manner. As a first step, manual tagging will be done on some amount of text. A POS tagger which uses learning techniques will be used to learn from the tagged data.
After the training, the tool will automatically tag another set of the raw corpus. Automatically tagged corpus will then be manually validated which will be used as additional training data for enhancing the performance of the tool. This process will be repeated till the accuracy of the tool reaches a satisfactory level.
With this approach, the initial man hours per 10, words will be more. Thereafter, the tagging process will speed up. Here also the initial training set will be a complete manual effort. Thereafter, it will be a man-machine effort. Chunked corpora are a useful resource for various applications.
One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. For example apart from POS tagging, it is also necessary to tag the text with semantic tag to disambiguate homographic and polysemous words. This will be done manually in the training corpus which will used for testing corpus.Georgetown Automatic Translation GAT System , developed by Georgetown University, used direct approach for translating Russian texts mainly from physics and organic chemistry to English.
It is pertinent that majority of the population in India are fluent in regional languages such as Hindi, Punjabi etc.. Whereas, various versions of rule based systems generates more alike translations.
It requires considerable human assistance in analyzing the input. The question of nativity becomes more complicated and case-sensitive when we find that same language is used by two different speech communities separated by geographical or political distance e. Problem of copy right 80 3. Therefore, the corresponding Tamil part of the matched segments of the sentences in the corpus are taken and combined appropriately.
For example, it is common to see very long sentences in English, using abstract concepts as the subjects of sentences, and stringing several clauses together.