Home

Natural Language Processing Group

Gathering and revealing the information

Who we are?

We are a working group dedicated to natural language processing. We focus on semantic processing of electronic text data produced in natural languages with the goal of uncovering knowledge hidden in the data. In particular, we use machine learning methods. We develop web and mobile applications that use text mining algorithms to solve problems for everyday users and customers. Our solutions are also used by commercial companies.

Outputs

**Nástroje a aplikace pro zpracování přirozeného jazyka**

Vyhledávání odborníků

Useful

In our work we use various tools for the preparation and subsequent analysis of text documents. These include lemmatization, stemming, stop word removal, sentiment identification, and more. You can also find our chatbot, which recommends mobile phones.

Natural Language Processing

It is estimated that over 80% of data today is stored as text (newspaper articles, emails, blogs, Facebook posts, etc.) with little or no structure. The need for text data analysis is currently growing and is becoming a very commercially interesting area. The goal of analytical tasks is to uncover prior unknown knowledge contained in this data using non-trivial methods. The results find applications in the fields of marketing, computer security, information services, literature search, human resource management, counter-terrorism, etc.

Working with text data is generally very difficult. The data is usually unstructured and has a completely different character than numerical data (complex grammar, different meanings of words, subjectivity, irony, etc.). Moreover, procedures that work satisfactorily for one domain may not work for another domain. We seek to deploy methods from the domain of Data Mining. This mature and well-developed discipline also focuses on finding hidden knowledge in data, but works with highly structured numerical data. It is therefore advantageous to prepare textual data in such a way that Data Mining methods are applicable to it. This requires the deployment of techniques from areas such as natural language processing, statistics, machine learning, linguistics and others.

For text data analysis, the research mainly applies machine learning methods with teacher (classification), without teacher (clustering, attribute selection, association search) and semi-supervised learning and their combinations. Research goals include categorizing text documents, retrieving documents based on similarity, discovering the semantics of groups of documents, finding attributes that convey meaning, sentiment analysis, and more. The specific characteristics and limitations of these tasks are taken into account, such as the small number of suitable examples, the huge volumes and dimensionality of the data, the sparsity of the vectors representing the data, the unbalanced classes, and the multilinguality. These peculiarities are typical of datasets generated by users themselves on social networks, microblogs or discussion forums (as opposed to scientific papers or newspaper articles). Future research will also focus on analysing the relationships between textual data (news, economic summaries, social media posts and likes) and various economic phenomena such as stock price movements.

In the course of the research, tools for data preprocessing are developed and deployed, including stemming, word stemming, word stemming, word type identification, spell checking, and more. A unique application with an optional graphical user interface is continuously developed to transform the data into a format suitable for machine learning algorithms and various software tools. The research also includes the application of professional commercial or open-source software (C5, Cluto, Weka, IBM SPSS Modeler and others).