untitled

<OAI-PMH schemaLocation=http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd> <responseDate>2018-01-15T15:42:56Z</responseDate> <request identifier=oai:HAL:hal-00327607v1 verb=GetRecord metadataPrefix=oai_dc>http://api.archives-ouvertes.fr/oai/hal/</request> <GetRecord> <record> <header> <identifier>oai:HAL:hal-00327607v1</identifier> <datestamp>2017-12-21</datestamp> <setSpec>type:COMM</setSpec> <setSpec>subject:info</setSpec> <setSpec>collection:UNIV-AG</setSpec> <setSpec>collection:BNRMI</setSpec> <setSpec>collection:CEREGMIA</setSpec> </header> <metadata><dc> <publisher>HAL CCSD</publisher> <title lang=fr>Analyse spectrale des textes: détection automatique des frontières de langue et de discours</title> <creator>Vaillant, Pascal</creator> <creator>Nock, Richard</creator> <creator>Henry, Claudia</creator> <contributor>Groupe de Recherche en Informatique et Mathématiques Appliquées Antilles-Guyane (GRIMAAG) ; Université des Antilles et de la Guyane (UAG)</contributor> <contributor>Centre de Recherche en Economie, Gestion, Modélisation et Informatique Appliquée (CEREGMIA) ; Université des Antilles et de la Guyane (UAG)</contributor> <contributor>ACI Jeunes Chercheurs JC 9009, 2003-2006 (Ministère de l'Enseignement Supérieur et de la Recherche, Fonds National pour la Science) : « Nouveaux paradigmes de classification : aspects théoriques et application à l'acquisition de connaissances » (Richard Nock, Pascal Vaillant)</contributor> <description>In French. 10 pages, 5 figures, LaTeX 2e using EPSF and custom package taln2006.sty (designed by Pierre Zweigenbaum, ATALA). Proceedings of the 13th annual French-speaking conference on Natural Language Processing: `Traitement Automatique des Langues Naturelles' (TALN 2006), Louvain (Leuven), Belgium, 10-13 April 2003</description> <description>National audience</description> <source>Verbum ex machina : Actes de la 13ème conférence annuelle sur le Traitement Automatique des Langues Naturelles</source> <source>13ème conférence annuelle sur le Traitement Automatique des Langues Naturelles (TALN 2006)</source> <coverage>Louvain (Leuven), Belgium</coverage> <contributor>Piet Mertens, Cédrick Fairon, Anne Dister et Patrick Watrin</contributor> <publisher>Presses Universitaires de Louvain (distribution I6DOC/CIACO)</publisher> <identifier>hal-00327607</identifier> <identifier>https://hal.archives-ouvertes.fr/hal-00327607</identifier> <source>https://hal.archives-ouvertes.fr/hal-00327607</source> <source>Piet Mertens, Cédrick Fairon, Anne Dister et Patrick Watrin. 13ème conférence annuelle sur le Traitement Automatique des Langues Naturelles (TALN 2006), Apr 2006, Louvain (Leuven), Belgique. Presses Universitaires de Louvain (distribution I6DOC/CIACO), ISBN 2-87463-023-3, p. 619-629, 2006, Cahiers du CENTAL</source> <identifier>ARXIV : 0810.1212</identifier> <relation>info:eu-repo/semantics/altIdentifier/arxiv/0810.1212</relation> <language>fr</language> <subject lang=en>soft spectral clustering</subject> <subject lang=en>clustering</subject> <subject lang=en>text segmentation</subject> <subject lang=en>language identification</subject> <subject lang=en>multilingual corpora</subject> <subject>ACM H.3.3; I.2.7</subject> <subject>[INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL]</subject> <subject>[INFO.INFO-IR] Computer Science [cs]/Information Retrieval [cs.IR]</subject> <type>info:eu-repo/semantics/conferenceObject</type> <type>Conference papers</type> <description lang=en>We propose a theoretical framework within which information on the vocabulary of a given corpus can be inferred on the basis of statistical information gathered on that corpus. Inferences can be made on the categories of the words in the vocabulary, and on their syntactical properties within particular languages. Based on the same statistical data, it is possible to build matrices of syntagmatic similarity (bigram transition matrices) or paradigmatic similarity (probability for any pair of words to share common contexts). When clustered with respect to their syntagmatic similarity, words tend to group into sublanguage vocabularies, and when clustered with respect to their paradigmatic similarity, into syntactic or semantic classes. Experiments have explored the first of these two possibilities. Their results are interpreted in the frame of a Markov chain modelling of the corpus' generative processe(s): we show that the results of a spectral analysis of the transition matrix can be interpreted as probability distributions of words within clusters. This method yields a soft clustering of the vocabulary into sublanguages which contribute to the generation of heterogeneous corpora. As an application, we show how multilingual texts can be visually segmented into linguistically homogeneous segments. Our method is specifically useful in the case of related languages which happened to be mixed in corpora.</description> <date>2006-04</date> </dc> </metadata> </record> </GetRecord> </OAI-PMH>