The NEC TM Data consortium’s objective is to organise unexploited national bilingual assets that can be used as open data and general data for machine learning, in order to lower translation costs at a national level and across member states. It will gather translation memories from previous national contract awards from Member States and help them to centralise these language assets with the fast-performing NEC TM database, following industry best practices.
New translation contracts will benefit from fuzzy matching analysis and translation companies will be able to work online and connect to each national NEC TM version. Translation data will be categorised and classified per domain in NEC TM, and a connection provided to eTranslation and ELRC.
The consortium will deliver a pan-European data-sharing awareness program, engaging national public administrations by providing a solid legal framework for data sharing.
It will also provide a solid framework for public administrations to adopt as policy for general data sharing from public translation contracts and real adoption scenarios at a national level. NEC TM Data promotes better data-sharing practices using open standards in translation contracts between public administrations and translation service providers.
In sum, the NEC TM Data consortium advocates for the facilitation of a single digital market. It will act as a meeting point for European data gathering efforts and the collection of national digital big data. By building a data bridge between public administrations and translation vendors, NEC TM Data will promote the free flow of data between Public Administrations and translation professionals.
EU national public administrations are huge buyers of translation services, purchasing many millions of euros of translations annually. However, since translation contracts often do not require for translation service providers to return Translation Memories to the contracting body, public administrations do not receive bilingual assets along with their completed translations. Consequently, public administrations in Europe are not leveraging valuable bilingual assets.
The NEC TM Data consortium will aim to inform public administrations at the national level about translation technologies available for language resources, as well as to lobby for TM gathering to be enshrined in national translation contracts. Given that the majority of translated data generated in public contracts is currently not returned to public administrations and therefore its potential is not being maximised, NEC TM Data will ensure that public administrations make better use of the translated data generated in public contracts.
To support these central aims, NEC TM Data will also provide the centralised infrastructure for efficient data sharing, TM matching, TM retrieval, and domain categorisation of resources generated in 10 Member States/ EEA (including the participating 3), with an emphasis on countries with low language resources. This will enable the development of NEC TM, which will be an open source software developed from Pangeanic’s translation memory database ActivaTM.
In order to provide maximum awareness, NEC TM Data will engage national language programs and data collection efforts in Spain and Latvia, and national authorities in Croatia.
After running an initial study on the number of translation contracts in the EU, the data compiled will be shared with the European Commission and national authorities so that these bodies are made aware of the costs of maintaining their translation data hermetic, as well as of the potential data that is being generated that such bodies could be capitalising on.
In order to accomplish this, NEC TM Data proposes the following central activities:
- Contact all companies that have been awarded a translation contract from 2015-2018 on behalf of national authorities and the European Commission to obtain public administration data which currently is not being put to use (in TMX format).
- Deploy a central pan-European data-sharing platform for uploading and sharing TMs (TMX files) directly between public administrations and translation service providers, as well as sharing TMs between translation professionals working on translations for the public sector.
- NEC TM will contain a number of API connectors to popular translation tools used by national administrations (commercial tools such as Trados Studio, MemoQ, Memsource; or free tools like OmegaT or MateCat) so that system users can connect directly to the service, thereby supporting their work and boosting their productivity when working on translation contracts for public administrations.
- Establish NEC TM as a national central repository of public administration data with full classification and categorisation features.
- Install NEC TM as an online central translation memory at a national level to which translation companies can connect to for all national translation contracts.
- Add anonymisation features for data shared between Public Administrations.
Moreover, this activity aims at providing an exploitation plan based on software agreement supplies of a customised docker system containing NEC TM Data.
To aid this, the most suitable business models will be defined to exploit the outcomes of the project. For that purpose, the size of the market – in terms of potential beneficiaries- their needs and the feasibility of reusing central repositories and live translation capabilities will be analysed.
The business model and the approach towards the potential identified clients– both from public and private sectors – will result in the commercialisation plan. Implementing the commercialisation plan will ensure the long-term sustainability of the software since it will guarantee its use in the long-run.
The NEC TM Data proposal includes the provision of a central TM-sharing repository, called the NEC TM Data platform. The platform will be based on Pangeanic’s commercial tool ActivaTM and it works on a similar concept using industry practices as used by other commercial tools and private organizations such as Memsource, TAUS, etc. Pangeanic will turn this commercial software into GPL (open source General Public License) and customise it for free use by Public Administrations.
Member State institutions will be actively involved in the implementation and deployment of the NEC Data TM platform (these are listed in the Consortium Members listing). Additionally, the ELRC initiative will be invited to be a contributor of relevant training data. Vendors of translation services will be contacted so that they can provide the TMs to populate the NEC TM versions at a national level. This data can then be shared to ELRC as each national body sees fit, whilst benefitting from a connection to eTranslation. This will help to improve the quality of the Automated Translation platform and foster broader usage and acceptance of automated translation services across member states.
The NEC TM platform will connect to the ELRC repository (ELRC-SHARE) to exchange data for fuzzy matching or full TM import-export. It can also be deployed at a national level so that each EU Member State can centralise all translation memories. This will allow for the involved bodies to actively access and interface with the centralised data repository. Universities, research centres, and industry will benefit from this.
NEC TM Data will contain a number of API connectors to popular translation tools used by national administrations, so that system users can connect directly to the service. This will support their work and boost productivity when working on translation contracts for public administrations.
Furthermore, NEC TM will have a secure (ESens4, Domibus) API connection to eTranslation with a register and access rights so that eTranslation can be used whenever there is no match from the TM. This will ensure that the CEF Automated Translation platform will be integrated into several public services using multilingual best practices and architectures.
Fuzzy Matching works with translations that are less than 100% accurate when finding equivalence with translations that correspond between text segments and the previously built system database.
This translation memory (TM) tool searches the previously built database and detects matching sentences and phrases. It then suggests these to the translator, giving the translator the option to use the proposed translation. Like this, fuzzy matching can lead to quicker translations and increased productivity.
The source code can be downloaded from GitHub: https://github.com/Pangeamt/nectm/
The docker with NEC TM can be downloaded from Docker Hub: https://hub.docker.com/r/nectm/activatm
Software documentation about the management of ActivaTM/NEC TM:
- Technical description: This document describes the architecture and system design of NEC TM based on ActivaTM, cloud-based Translation Memory tool. The target audience are developers, system administrators and advanced users of the tool.
- Admin UI description: This admin UI document describes the web and API interface of NEC TM based on ActivaTM, a cloud-based Translation Memory tool. The target audience are managers and advanced users of the tool.
- UI description: This user UI document describes the web and API interface of NEC TM based on ActivaTM, cloud-based Translation Memory tool. The target audience are users of the tool.
NEC TM is Open Source, released under Apache 2 license.
Enable a Single Digital Market
Promote the flow of translation data (specifically Translation Memories) from translation companies to public administrations
Encourage pan-European Data Sharing
Increase the volumes of parallel data available to the European Commission
Organise Big Data
Organise bilingual Big Data currently that is currently lost in translation companies’ internal translation memories and processes
Maximise Translation Profits
Enable public administrations to fully leverage TMs
Support the work of translators working on public sector texts