| 15-07-06586 «Research and development of linguistic and statistical methods and algorithms for automatic generation of associative hierarchical portrait domain-based ontologies». Project Manager Rykov V.V. |
The project is aimed at solving the fundamental scientific problem of semantic modeling, within the framework of which a technique is developed for automated identification of hierarchical, synonymous and associative links from Internet texts and the construction of linguistic-statistical portraits of various subject areas, in particular, using autonomous uninhabited underwater vehicles (ANPA). The study is based on the hypothesis that more general terms have more associative links, as well as on attracting associative links to define meaning, the full meaning of which is revealed through contextual environments, which makes it possible to automate the process of differentiating values and extracting knowledge from texts. The solution of the problem is built on the basis of an integrated approach combining statistical methods, corpus linguistics and distributive semantics, and is implemented in a technology that involves the development of linguistic-statistical mechanisms for the formation of an associative domain portrait (APPO), which is a dictionary of meaningful terms of the subject domain, the elements of which are related associative and hierarchical relationships. APPO is created automatically based on statistical analysis of large volumes of texts from the Internet. The hierarchical relationships included in the APPO form a polyhierarchy and a classifier, facilitating the search and navigation in the AUV subject area. Such a technique allows solving a wide class of tasks, both in the field of cognitive semantics and information retrieval, as APPO can, in most cases related to contextual search, replace or supplement the thesaurus / ontology of the subject area, which is a very laborious task. In addition, the project involves the following tasks: monitoring new objects, facts and ideas in the AUV software, automatic classification of new objects by the APPO classifier, in particular, the type / type of apparatus of the AUV, its characteristics, the company, its management, employees, competitors, partners and t . e., how often an object is mentioned at different periods of time, the tonality of messages, the source of information, the establishment of the boundaries of the subject area; the development of intelligent Internet technologies; automated formation of interactive domain-oriented encyclopedias; Visualization of interactive web search results (visual domain maps). The methodology was partially tested in the key concepts of KEYWEN, developed by the authors of the draft encyclopedia, and carrying out directed extraction of encyclopedic information from the Internet. The project also relies on the original DECL tool environment created and developed by the project applicants, which has found wide application in the construction of logical and analytical systems (DIES, Crime, Summary, Antiterror) and semantic-oriented knowledge extraction systems (Semantix, etc.). Project participants: Rykov V.V. (Project Manager), Charnine M.M., Khakimova A.Kh., Meshcherin S.A., Ognev A.P., Orlova N.A., Tsyganov V.V., Khlamov M.A., Rodina I.V., Demidov A.V.
| 15-07-06586 «Research and development of semantic methods for constructing «Contextual Science Citation Index». Project Manager Charnine M.M. |
The project is aimed at solving the fundamental scientific problem of semantic modeling, which develops a methodology for assessing the quality of scientific articles based on a probabilistic impact model of a scientific article on references and ideas in subsequent articles, as well as on the basis of a model for presenting ideas as many phrases. At present, the need to supplement standard scientometric and bibliometric indicators with a computational semantic analysis of the evaluated publications is almost universally recognized. Given the urgency of the problem of evaluating scientific products, the relevance of the proposed study is not in doubt. Many existing methods for assessing the impact and quality of scientific articles are based on the use of the Scientific Citation Index (ICI), which is calculated based on the number of direct bibliographic references to the article and therefore does not work for new articles with zero citations. The proposed methodology uses a new indicator of the quality of a scientific article – the Contextual Scientific Citation Index (ICSC), which is calculated automatically by implicit contextual links to the article and is related to the statistical probability of the expected appearance of direct bibliographic references. The CSC has predictive properties and high sensitivity, allowing to divide new articles into groups and rank them by quality. Implicit links in the article are references to other people’s ideas and their authors. Implicit links are identified using linguistic methods and the method of relevant phrases, which finds phrases similar in meaning to other articles and documents from the Internet. The similarity of the meaning is determined by means of grammatical transformations, translation programs and the replacement of synonyms, as well as with the help of associative links and the method of constructing an associative portrait of the subject area developed by the authors. A probabilistic model of the dependence of the number of direct citations on the number of implicit links and their parameters is based on a linguistic processor that detects implicit links, which is configured using the machine learning method so that the correlation between the indices of the IC and ICRC is maximum. The study is based on the hypothesis that articles with new ideas for which there are a lot of implicit links have an increased likelihood of direct quotation, and also that the inclusion of implicit links from open documents on the Internet increases the correlation between the ESC and TIN indices. The solution of the problem is based on an integrated approach combining statistical methods, corpus linguistics and distributive semantics, and is implemented in technology that involves the development of linguistic-statistical mechanisms for the formation of the ESCC. Such a technique allows solving a wide class of problems, both in the field of cognitive semantics and information retrieval, for example, the search for ideas, the assessment of the quality of scientific articles, the compilation of rating sites. Additionally, the project involves the following tasks: monitoring of new ideas and assessment of their prospects for the frequency of references in different periods of time; analysis of the continuity of scientific ideas; creation of the architecture of ideas in the subject area; the development of intelligent Internet technologies; automated formation of interactive domain-oriented encyclopedias. The methodology was partially approved by the project manager in the KEYWEN Encyclopedia of Key Concepts, which implements directed extraction of encyclopedic information from the Internet. The project is based on the DECL tool environment created and developed by applicants, which is used in the construction of logical-analytical systems (DIES, Crime, Executive Summary, Antiterror) and semantic-oriented knowledge extraction systems (Semantix, etc.). Project participants: Charnine M.M. (Project Manager), Galina I.V., Demidov A.O., Zolotarev O.V., Kuznetsov K.I., Matskevich A.G. Protasov V.I., Rodina I.V., Sokolov E.G. Khakimova A.Kh.
| 16-29-09527 «Research and development of topical modeling methods for monitoring, forecasting and visualization of terrorist activity in the information field of the Internet using a virtual environment». Project Manager Charnine M.M. |
The project is aimed at solving the fundamental scientific problem of semantic modeling, forecasting and visualization of the formation of social formations in the network, the detection of extremist communities, the analysis of their topological structure, including, in turn, websites, blogs and accounts in social networks (hereinafter referred to as ). In the course of the project, a methodology will be developed for constructing a dynamically updated lexical resource base based on textual documents published on the web (using the methods of corpus linguistics and distributive semantics). The base of lexical resources is used as a source for identifying extremist Sites and detecting semantic links (implicit links) between them. For the first time in world practice, an Index of Ideological Influence of the Site (IIW) will be built on the basis of the identified network links, based on a probabilistic influence model (impact) of ideas / phrases of a certain Site on ideas and phrases of other Sites of similar subjects. IIW has predictable properties and high sensitivity, allowing to divide new Sites into groups and rank them according to the degree of extremism and influence. Given the urgency of the problem of growing extremism and the explosive development of the network, the relevance of the proposed study is beyond doubt. The study is built around the hypothesis that the growth of the ideological influence of the group on the Internet contributes to the growth of the number of this group. At the same time, the ideological influence is measured by the number of new ideas of the group that have become widespread, and the growth in the number positively correlates with the total number of Sites of this group. The proposed approach is based on the idea that ideas can be adequately expressed by many phrases or terms that are similar in meaning (a close analogue of the proposed approach to expressing ideas / themes is the Dirichlet Hidden Distribution Method – LDA). Based on this approach, as well as on the basis of observations of the transformations of ideas in time, it becomes possible to identify the contribution of each Site in relation to the identified topics (ideas), which, in turn, makes it possible to detect hidden links between the Websites / authors. Implicit links between sites (references to similar ideas and their authors) can be identified using linguistic and statistical methods by searching for similar phrases and topics on other Sites. The semantic similarity of phrases is determined using grammatical transformations, translation programs, replacing synonyms with terms obtained using thematic analysis (for example, LDA and PLSA methods), as well as using associative links identified by the author’s method of constructing an associative portrait of the subject area (APPO). The APPO technique allows you to build the corpus of large-volume texts necessary for analysis and dynamically replenish the dictionaries of terrorist vocabulary. A probabilistic model of the dependence of the number of future implicit links on the number of available links and their parameters is built on the basis of a linguistic processor developed in the course of research that reveals implicit links. The linguistic processor is configured using the machine learning method so that the correlation between the IIW and the future growth of the extremist group is maximum, and also that the trends mentioning similar ideas on the Internet have better predictive properties. The solution to this research problem is based on the implementation of an integrated approach that combines the methods of thematic modeling, corpus linguistics, distributive semantics and visual analysis. The approach is implemented in technology, which requires the development of linguistic-statistical mechanisms for the formation of IIW and visual analysis of its topological structure. Such a technique allows solving a wide class of problems, both in the field of cognitive semantics and in the field of information retrieval. Such tasks, for example, include: monitoring new ideas, evaluating their influence and evolution over time; analysis of the continuity of ideas. Project participants: Charnine M.M. (Project Manager), Galina I.V., Gurov A.S., Zolotarev O.V., Kuznetsov K.I., Maravin A.A., Matskevich A.G. Protasov V.I., Rodina I.V., Khakimova A.Kh., Tsyganov V.V.