************************************************************* * Czech recordings for Sound2Sense * * annotated by Pavel Vondricka * * at the Max-Planck Institute in Nijmegen, * * 8.6.2009 - 7.9.2009 * ************************************************************* Overview of the files: ====================== ### Ces_6.wav - Ces_10.wav: - recordings provided by Helena Spilkova ### Ces_6.TextGrid - Ces_10.TextGrid - orthographical transcriptions of the recordings in Praat TextGrid format - for details see guidelines in doc/guidelines_cz.pdf (or ODT) - there are a few "unintelligible" chunks I think someone else will (maybe immediately!) decode, but I just wasn't able to, even though listening to them many times, sorry! (just look for [xxx]) - I tried to mark all ambiguous/uncertain/dubious interpretations with the \x tag - I am sure that other speakers will not always agree with some of my interpretations, but I mostly have a good reason why I chose the ones that I chose, by listening both to the wide context and to the single sounds audible in the recording - in many cases I am quite sure for myself, in some few cases I just took the choice which seemed more probable to me (just look for the \x tags, they are quite frequent) - I am not very happy with the handling of the few dialectal variants in recordings 9 and 10, but I did not get any help or advice from the Czech side - search for "dialect" in the notes (4th tier) - I am not very happy with the handling of the extremely variable morpheme -hle(nc)-/-dle(nc)- in compound pronouns; again, I did not get any help nor advice on this, so everything was just reduced to the basic -hle-form - see the guidelines ### Ces_XX.tier1.TextGrid, Ces_XX.tier2.TextGrid - automatic phonetic alignment of all utterances of the first and second speaker from the corresponding recording as provided by the "align_soundfile.pl" script (which uses other tools for the job) - see "tools/align_soundfile.pl" for details - the alignment is not very good - I did not have much time to evaluate it in detail, but here are a few typical problems I have observed: - the aligner seems to be rather confused by the large choice of pronunciation variants and reduced forms and likes to choose the wrong ones (especially leaving out wovels or the gliding sounds v/j/h even though they are clearly pronounced) - it has serious problems with filled pauses and any other unarticulated sounds (people's hesitation), especially creaky voice - they almost always destroy the whole alignment completely or at least a part of it - it has often problems matching plosives (no surprise) - the models for plosives seem to accept parts of vowels too (even voiceless plosives) - it has problems matching "j", especially in combinantions with vowels (no wonder) - some models seem to accept almost everything, even whole words or segments of sentences (e.g. "l", "m" or "n"?) - matching unvoiced fricatives seems to work quite well - the script replaces some tags with other ones that the aligner can identify and has at least some models for; here the configuration used was: - [breath]s are replaced with [sil] - [hm], [voc] and [per] are replaced with [fil] (maybe [sil] would be better for the last two?) - [ee] is replaced with "@" and [ehe] with "@h@" (I do not know if this is any better than [fil] but it does not seem to be worse...) - the real pronunciation from the [pron=...] tag is used to replace the word when provided - all slash-tags are stripped from the end of the words (including the cut-off "\-") - the dash, binding words, is replaced with space (they behave as separate words) ### doc/guidelines.pdf (and .odt) - guidelines for the annotation in PDF and ODT format - listing all problems and their common solutions ### doc/README.txt - this file ;-) ### tools/align_phrase_general.pl.diff - a patch for the last version of aligner / perl script "align_phrase_general.pl" as I received it from Petr Pollak in the 1st of August 2009 (ask the author for the full script) - the changes made by me include: - added ability to work with TextGrids not starting from time 0 - change of the "sox" parameters according to the version 14.2.0 - added parameter "-nonrm" to force the script not to remove it's temporary partial TextGrid files but to append to them new data at each run (useful for batch processing by "align_sound.pl") - added ability at accept (any) .wav soundfiles as input - [sp] is added at the beginning and end of the phrase instead of [sil], because [sil] requires a minimal length and some of my chunks do not have any silence at the beginning or end at all - ability to parse new format of "samba_substitution" rule file with named rules and comments - ability to use backreferences in rules and to apply rules on top of other rules - fixed parsing the .rec file and creation of the new TextGrid; the script created strange TextGrids that did not work well for me in Praat - fixed handling of chunks shorter than the internal minimal length of the aligner software (they should not occur, but they do) ### tools/align_soundfile.pl - perl script for batch processing of the annotations to get phonetic alignment - call with the name of the soundfile (in WAV format) as first parameter, a Praat textgrid with the same name will be expected and parsed by the script - call with the number of the tier as the second parameter - e.g. calling "align_soundfile.pl sound.wav 1" will parse the first tier in a "sound.TextGrid" for any chunks containing speech (i.e. non-bracketed text or whitespace] and sends them to the aligner together with the corresponding sound signal extracted from "sound.wav" - the script expects the "align_phrase_general.pl" script provided by Petr Pollak with modifications made by me (see "tools/align_phrase_general.pl.diff" for details) - after letting the aligner parse all the chunks, the scripts collects the data collected in the partial TextGrid files generated by the align_phrase_general.pl script and recreates a new TextGrid for the whole soundfile including 5 tiers: 1) RefPhones - alignment of phones as produced by the aligner; 2) AutoPhones - as produced by the the aligner = at the moment just exactly the same data as RefPhones; 3) Words - alignment of the single words to the signal as produced by the aligner; 4) Phrases - as they were really sent to the aligner (with possible substitutions); 5) Transcription - the original full trascription as extracted from the selected tier of the original TextGrid - the resulting new TextGrid will be called "sound.tierN.Textgrid" (given and tier as a parameters) - at the beginning of the script, paths to the aligner and a temporary directory is set - see the script for comments ### tools/dictionary.utf8.txt - a dictionary of pronunciation variants of all the words occurring in the recording - automatically generated by the series of commands: "cat *.TextGrid |./make_wordlist.sh |./prondict.pl >dictionary.utf8.txt" (see "tools/make_wordlist.pl" and "tools/prondict.pl" for details) - the dictionary (and it's sources) does not include the few dialectal anomalies (appearing clearly in only ca. 4 cases in the recording n. 10 - search the "notes" tier for "dialect" - which are the following two phenomena: 1) shortening of vowels in the dialect of Ostrava-area, 2) monophthongization "ou > o:" in the instrumental singular endings of feminina in the Moravian dialects) - this dictionary follows the same format as the "general.dict" dictionary provided with the aligner, but with the following additions: - at the end of each line there is an additional explanation (separated by tabs and a #-mark) of how the variant was generated: - "Dic" means the form comes originally from one of the dictionaries (general.dict provided with the aligner or my own additions from "pavel.utf8.dict") - "Gen" means the form was generated from the orthographical form by the tools "transc" provided with the aligner - after this follows a list of space separated names of rules that were applied to the original form in their order of application (see "tools/prondict.pl" and "sampa_substitutions" for details) - unlike the original "general.dict" this file is in UTF-8 encoding ### tools/extract_text.pl - perl script for extracting plain text from TextGrids - TextGrids expected as input (STDIN), text comes as output (STDOUT) - extracts only the first and second tier, skips the "noise" and "notes" tiers - see the script for comments ### tools/make_wordlist.sh - shell script to make a wordlist from the TextGrids - it first calls extract_text.pl for extraction of the plain text, removes all tags from the text and sorts a list of all the unique word forms using standard unix tools "sort" and "uniq" - TextGrids expected as input (see "tools/extract_text.pl"), wordlist comes as output (STDOUT), one word on each line, sordet alphabetically - see the script for comments ### tools/pavel.utf8.dict - additional dictionary of pronunciation variants for words (or their variants) not included in the "general.dict" dictionary I received together with the aligner from Petr Pollak - the format of this dictionary is the same as the format of the "general.dict" dictionary included with (and used by) the aligner: each line shows one pronunciation variant: starts with the orthographical form followed by a space and then by the phonetic transcription of the pronunciation variant in Czech SAMPA letters separated by spaces - used for the creation of the "dictionary.utf8.txt" (see "tools/prondict.pl" for details) and later in the automatic alignment - unlike the original "general.dict" this file is in UTF-8 encoding - maybe some of the variants given here should be rather products of some general rule of reduction/deletion? ### tools/prondict.pl - perl script generating pronunciation variants for the wordlist given as input (STDIN) - uses pronunciation forms from the "general.dict" dictionary and my additional "pavel.utf8.dict" dictionary as a base (marked as "Dic" in the output - see "tools/dictionary.utf8.txt") - if the word given does not exist in any of the given dictionaries, the "transc" tool is called to generate a default one (such forms are marked as "Gen" in the output) - all the rules from "sampa_substitutions" are then applied (see "tools/sampa_substitutions" for details) to every pronunciation variant found in the dictionary (or generated by the "transc" tool), in the order as they are defined; any modified form is then added to the list of variants and all the following (and the very same rule) can be applied to it again as a next step - all the generated variants are marked with the names of rules (separated by space) that were used to generate them, so that it can be checked how (by application of which rules) the form was produced (see "tools/dictionary.utf8.txt" and "sampa_substitutions" for details) - this script is based on the corresponding part of the "align_phrase_general.pl" aligner script written by Petr Pollak, but they have been extended and modified by me - at the beginning of the script, the path to the other files and tools is set (transc; general.dict; pavel.utf8.dict; sampa_substitutions) - see the script for comments ### tools/sampa_substitutions - definition of named rules for segment deletion within the phonetic transcription (used by the "tools/prondict.pl" script and by the automatic aligner (modified: see "tools/align_phrase_general.pl.diff)) - lines containing comments start with the #-mark - lines containing rules contain three strings separated by a tabulator: 1) the name of the rule (used mainly for debugging purposes, to identify the rules applied, e.g. in the comments in the resulting "dictionary.utf8.txt"); 2) the regular expression describing the string to be replaced; 3) the replacement string - the rules are to be applied to phonetic transcription containing space separated Czech SAMPA signs (as found in the dictionary of the aligner or generated by the "transc" tool - !!! WARNING !!! - the rules defined here are not based on any research but just my own impressions from the annotation process, and therefore they are not reliable, some may be completely doubtful or just wrong, and some of them may be overgenerating (e.g. applied in contexts when they should not be applied or generating pronunciation variants that do not exist anymore - but I checked the resulting "dictionary.utf8.txt" and within our vocabulary the number of forms that I would consider really impossible seems to be negligible) - the order of most of the rules is not important - the comments state where it is important for the rule to be preceded by another one! - a few of the rules reflect just trivial orthoepical rules that the authors somehow forgot to include in the "transc" tool - some rules apply (or "should apply") only to particular words, but they are listed in the rules rather than in the dictionary just because the word can have a more or less unlimited number of grammatical forms, derivations or build compounds - in those cases I preferred to write a "general" rule for the word rather than finding just the forms that exist in our recordings and list them explicitly into the local dictionary (i.e. "pavel.utf8.dict") - I think some of the rules could be a part of some wider and more general tendency, but I do not dare to make even greater generalizations - typically this is the case of frequent l-deletions in many positions - actually, I am often quite unsure if a rule really is general or if it only applies to a few words I came across - please, read the embedded comments in the file for further details and concrete examples and possibly delete the rule and add only the specific word forms really existing into the dictionary... - the speakers in our recordings speak quite properly and pronounce relatively clearly, so that there is very little evidence even for the most common reductions/deletions - therefore I had to use my own knowledge and experience of a native speaker, rather than having good evidence for all the rules and variants I have written - they go often far beyond the evidence from this particular data - sometimes I wasn't really sure whether to consider the examples that I noticed to be really usual Czech reductions/deletions or whether it was just a random mispronunciation - I admitt the possibility that some of the rules may only reflect heavy reductions rather than complete deletions (and as Jan Volin said: as long as we can hear anything there, we assume it IS there, even if it does not much resemble the "correct" sound) - in short: the rules need a good revision of an experienced phonetician and comparison to the real material!