Linguistic Aspects
Reading a text
A story "correctly" read is a story understood by the reader. The first step for the text interpretation is to search for a certain semantic in the alphanumeric characters of the raw text.
It needs to find relevant information to make the reading expressive. In the case of a story for children, the relevant information is:
- The structure of the story (title,introductory scene, trigger event, chorus, epilogue)
- Lexical elements (called entities, nominal groups, parts of speech)
- Identification of the various speakers, of the narrator
- A narrator able to interpret various tones
Corpus of stories
- Length of about 500 to 1000 words
- Speech rounds between the various characters of the story
- Minimum three speakers for one narrator and two characters
- The lexicon used should be enjoyable and easily understandable, especially for young audiences
- The chosen texts need to be copyright-free as the GV-LEx project will make the annotated corpus available to the community
A learning phase is necesssary to automatically recognize this information. A corpus of stories was gathered and then annotated manually.
The stories were chosen according to a certain number of criteria corresponding to the planned GV-LEx project demonstration:
Starting from 850 stories taken from the French site http://contes.biz, 86 texts were selected as well as 3 stories from Rosemarie Vassalo. The texts were analyzed and annotated manually. As an indication, the table below shows some statistical information about the corpus.
| Raw Corpus | Pre-processed Corpus | |
| Total number of words | 65964 | 80746 |
| Number of different words | 12489 | 15740 |
| Average number of words per story | 742 | 907 |
| Maximum number of words | 1028 | 1318 |
| Minimum number of words | 439 | 533 |
From this annotated corpus, the computer-based annotation founction is to be developed.
