Online integration into PubPsych

Everything is implemented with respect to Solr 6.6.5.

QueryFieldRewriter class

This is the main class of the query translation module. Here, the actual translation takes place. Since it is the most interesting class, it is described first here, but details that are needed to fully understand the usage of the other classes involved will be given only in their respective paragraphs which can be found below in the text.

Inner classes

public class QueryNodeComparator implements Comparator<QueryNode>

QueryNode objects are compared using their nodeId: A QueryNode A is bigger than a QueryNode B if and only if A’s nodeId is bigger than B’s nodeId.

Fields

Note: In this section, only the fields used for the real (translation) functionality will be explained. There are other fields only used for gathering statistics for the evaluation of the system. The statistics will be explained in section 4.4.

private static final Logger LOG =
LoggerFactory.getLogger(QueryFieldRewriter.class);

The logger for this class.

private final Map<String, Map<String, String>> meshDict = new HashMap<>();

The high-quality (domain) MeSh dictionary is stored in meshDict. The first key is the string (token or phrase) one queries, the second key is a language code (“en”, “de”, “fr”, or “es”; these codes are used in the dictionaries) and the value is the respective translation string. Every inner map contains exactly three entries (for all languages except the source language).

private final Map<String, Map<String, String>> mixedDict = new HashMap<>();

The low-quality (out-of-domain) dictionary is stored in mixedDict. The structure is the same as in meshDict.

private final Map<String, Map<String, List<String>>> laFieldNames =
new TreeMap<>();

Some fields have language-specific versions in the PubPsych core schema, some do not (e.g. TI has language-specific versions like TI_D, whereas text does not have language-specific versions). Only those fields that have language-specific versions are stored in this map.
The first key is a field name, the second key is a language code (“E”, “D”, “F”, “S”; these codes are used in the PubPsych database schema) and the value is a list of language-specific field names. It is a list, because the exact number of language-specific field names varies (there is not only TI_D, but also TI_D_from_E, but there is only one German version of the field SH).

private int nodeIdCounter = 1;

Every QueryNode is assigned a unique node ID. This ID is used as a sorting criterion.

Methods

public QueryFieldRewriter(String pathToMeshDict, String pashToMixedDict,
            String pathToLaFieldNames) throws FileNotFoundException, IOException

In the constructor, the dictionaries and the mapping between fields and language-specific field names (see above) are read in.

private void readInFieldNames(String pathToFieldNames) throws FileNotFoundException, IOException

Reads in the mapping between fields and language-specific field names. The expected format of the file containing the information on the mapping is:

private void readInDictionary(String pathToDictionary, boolean mesh) throws FileNotFoundException, IOException

Reads in the MeSh dictionary or the low-quality dictionary. The Boolean value mesh indicates in which map the dictionary is stored (meshDict or mixedDict).

private void splitEntry(String entry, Map<String, String> translationMap, boolean mesh)

A method used in readInDictionary to split an entry like “en::translation” and add it to the translationMap with the language code as its key and the translation as its value.

private Translation matchFieldNamesAndTranslations(Map<String, String> translationMap, String field)

Gets a map with language codes as key and translations for a whole phrase as values. Returns a map of language-specific field names and corresponding translations. Does not check whether a translation equals the original string, since this way, more fields can be searched. If a combination of field and language doesn’t have a specific field name (like SH and French), this is marked by a tilde (e.g. SH~F) to be able to set apart language-specific field names that exist in the schema from those that are only needed in this code to distinguish different languages.

private Boolean gatherTransInfo(String fieldName, String laFieldCode,
            String laTransCode, Map<String, String> fieldsAndTranslations,
            Map<String, Boolean> fieldNamesLASpecific,
            Map<String, String> translationMap)

Returns true if a language-specific name for fieldName and the language encoded by laFieldCode exists. If yes, each language-specific field name is used as a key to create a new entry of fieldsAndTranslations, each with translationMap’s value for the key laTransCode as its value. If no, false is returned and nothing else happens.
This method is used in matchFieldNamesAndTranslatons.

private void addDummyField(Map<String, String> translationMap,
            String field, Map<String, String> fieldsAndTranslations,
            Map<String, Boolean> fieldNamesLASpecific, String laTransCode,
            String laFieldCode

Builds a dummy language-specific field name following the pattern field + “~” + laFieldCode and uses this as as key to create a new entry of fieldsAndTranslations with translationMap’s value for the key laTransCode as its value.

private String removePluralEnding(String plural, String language)

Following simple rules that depend on the language indicated by language (codes: en, de, fr, and es), the method tries to remove plural endings. The implementation follows tradQueries.py in DBTranslator/scripts/utils.

private Translation translateWholeString(String original, String field)

Tries to translate original as a whole. The methods first looks it up in the MeSh dictionary. If this is not successful, it tries to find the string in the low-quality dictionary. Returns null if none of the dictionaries contains the string original.

private boolean checkSingular(String possibleSingular, String laCode,
            String token, boolean countInStats, boolean isEntireCopy,
            boolean mesh, Map<String, Map<String, String>> translationByToken)

Looks up a possible singular form derived with the method removePluralEnding in both dictionaries. If one of the dictionaries contains the possible singular form, an additional check whether the found entry has the same source language as the one that was assumed to derive the singular form is performed. This is done to avoid mistakes where a possible singular form of one language looks like a word in another language (e.g. a possible singular form of German Weber would be, according to our simple rules, web, which would induce strange translations due to the English word web).
The methods returns true if an entry with the correct source language could be found.

private Translation translate(String original, String field)

First tries to translate original as a whole (using translateWholeString). If there is no translation, the string is split into tokens. Each token is looked up separately. If a token cannot be found in any of the dictionaries, removePluralEnding is used to derive a possible singular form. If no singular form with correct source language (see checkSingular) is found in any of the dictionaries, the respective token simply gets copied.

public QueryNode buildQueryNode(Query query, BooleanClause.Occur occLvl)

Builds a tree displaying the structure of query and its subqueries if they exist. If a new class is added to the subclasses of Query that the ZpidQParser uses, one needs to implement a mapping into the QueryNode structure here. Otherwise, queries of this new class just get copied and will never be translated. For details on the QueryNode class: see below.

public QueryNode translateQueryNode(QueryNode old)

Returns old after all structural changes with respect to the translations have been made. First, it is checked whether old has to be translated at all. If not, nothing is changed. If yes, the queryType of old determines what is happening. In the following, I use expressions like “BooleanQueries” instead of writing “QueryNodes representing a BooleanQuery” each time which would be technically correct. It should be clear from the context whether I refer to the actual Solr BooleanQuery class or my QueryNode class.

private void translateGroup(QueryNode representative, List<QueryNode> group,
            Map<String, String> translationMap, List<QueryNode> disjunctionMaxQueries, List<Integer> idsOfDMQs)

This method is used to translate disjuncts of different DisjunctionMaxQueries that have been matched together (cf. translateQueryNode). The translation of the concatenated tokens is split into single tokens for which new QueryNodes are created and distributed as disjuncts among the DisjunctionMaxQueries to which the original nodes belonged. All original nodes are then marked as “does not have to be translated” (thus avoiding double translations).

private Map<QueryNode, List<QueryNode>> matchDMQsWithSameField(List<QueryNode> disjunctionMaxQueries)

This method is used to find disjuncts of different DisjunctionMaxQueries that query the same field (and have the same boost factor if applicable). In the returned map, the key of each entry is a representative QueryNode and the value is a list of all the QueryNodes that query the same field (and have the same boost factor if applicable) as the representative QueryNode.

private List<QueryNode> buildNewChildren(Translation trans,
            QueryNode.QueryType typeOfChildren, boolean moreThanOneOriginalNode,
            float boost, QueryNode.QueryType typeOfWrappedQuery)

Returns a list of QueryNodes representing the translations. If typeOfChildren is TERM and moreThanOneOriginalNode is true (so the translation is based on the concatenation of at least two TermQueries or PhraseQueries), then the translation string of each target language is split into tokens. If there is more than one token, then a new QueryNode with occurrence level MUST is created for every token. These new nodes are then made children of a new intermediate node with occurrence level SHOULD which is added to the returned list. If there is only one token or the translation is based on just one TermQuery or PhraseQuery, a new node created with buildNewChild is added to the returned list. In case of BoostQueries, a warning is logged if moreThanOneOriginalNode is true and a new node of type BOOST is built whose wrapped query is created with buildNewChild.

private QueryNode buildNewChild(QueryNode.QueryType typeOfChild, String fieldName, String text, BooleanClause.Occur occLvl)

Returns a new QueryNode of type typeOfChild with field set to fieldName, text set to text and the occurrence level set to occLvl.

private boolean hasToBeTranslated(String fieldName)

Returns true if one of the following conditions holds:

public QueryNode simplify(QueryNode toSimplify)

Simplifies QueryNodes with redundant structure: If the child c of a BooleanQuery is a BooleanQuery and has occurrence level MUST, then all its children (grandchildren of toSimplify) that also have occurrence level MUST, are removed from c’s list of children and added to toSimplify’s list of children. If all children of c have occurrence level MUST, c itself is removed from toSimplify’s list of children.

public Query manipulateQuery(Query query)

Maps query to a QueryNode object, simplifies this object with simplify and then runs translateQueryNode on it. The result of translateQueryNode is mapped back to a Query object which is then returned.

public List<QueryNode> translateWrappedQueryOfBoostQuery(QueryNode wrappedQuery, QueryNode.QueryType wqType, QueryNode bq)

If wqType (type of wrappedQuery) is BOOLEAN or DISJUNCTIONMAX, the wrapped query of bq is replaced with translateQueryNode(wrappedQuery) and an empty list is returned. Otherwise, new nodes are created with buildNewChildren.

QueryNode class

The purpose of the QueryNode class is to provide an interface to manipulate the queries more easily than with Solr’s Query class. The main advantage is that one onley has to deal with one class for all query objects (and not with the many subclasses of Query that all provide different methods/fields). This way, one doesn’t have to build new Query objects for each change, but change the respective QueryNode object instead and then map it back to a Query object once all changes due to translation have been performed.

Inner classes

public enum QueryType {
   BOOLEAN, TERM, DISJUNCTIONMAX, PHRASE, BOOST, DUMMY
}

These are the subclasses of Query that the ZpidQParser uses (version as of November 2018).

Fields

private final int nodeId;

Every QueryNode object is assigned a unique ID in the constructor. This is needed e.g. for sorting purposes.

private final List<QueryNode> children;

This list contains QueryNode objects that are somehow subordinate to the given QueryNode. For instance, they can be subqueries of a BooleanQuery or disjuncts of a DisjunctionMaxQuery. TermQuery and PhraseQuery objects cannot have children.

private final QueryType queryType;

The type of the Query that the given QueryNode object represents.

private String fieldName;

This field is only set for TermQuery and PhraseQuery objects and contains the name of the corresponding field in the PubPsych core schema (e.g. TI).

private String text;

This field is only set for TermQuery and PhraseQuery objects and contains the actual text associated with that Query object.

private BooleanClause.Occur occLvl;

The occurrence level of the given QueryNode object is set if and only if the QueryNode object is the child of another QueryNode object with type BOOLEAN.

private Boolean toTranslate;

This field is set to true if fieldName if the QueryNode object (or the associated representative, cf. method translateQueryNode of the QueryFieldRewriter class) has not been translated yet and one of the following conditions holds:

private float tieBreakerMultiplier;

This field is set only for QueryNode objects representing a DisjunctionMaxQuery. It is a tie breaker value for multiple matches. For more details, see https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html.

private float boost;

This field is set only for QueryNode objects representing a BoostQuery.
The higher the boost, the more important the wrapped query will be to the computed score. See https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/BoostQuery.html.

private Query originalQuery;

The original Query object is stored in case it is an instance of an unhandled subclass of Query. When mapping QueryNode objects with type DUMMY back to Query objects, originalQuery will be returned.

private int slop;

This field is set only for QueryNode objects representing a PhraseQuery. It is the maximum edit distance for which terms in the PhraseQuery will still be matched to strings in the documents (see https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/PhraseQuery.html). In the current implementation, the slop of the translated PhraseQuery objects is simply copied from the original Query.

private int parentId;

The node ID of the parent node if the latter one exists. It is needed for matching disjuncts querying the same field but belonging to different DisjunctionMaxQuery objects.

private static final Logger LOG = LoggerFactory.getLogger(QueryNode.class);

The logger object.

Methods

public QueryNode(QueryType type, int id)

Every QueryNode object needs a type and a unique ID. These two cannot be changed afterwards.

public int getId()

Returns the unique node ID.

public QueryType getType()

Returns the QueryType.

public void addChild(QueryNode child)

Adds child to children if this is an allowed operation for the given queryType.

public void addChildren(List<QueryNode> newChildren)

Adds all elements of newChildren to children if this is an allowed operation for the given queryType.

public List<QueryNode> getChildren()

Returns the children list.

public QueryNode getChildAtIndex(int index)

Returns the element of children that is at position index.

public void removeChildAtIndex(int index)

Removes the element of children that is at position index.

public void replaceChildAtIndex(int index, QueryNode newChild)

Replaces the element of children that is at position index with newChild.

public void setFieldName(String name){

Sets fieldName to name if the given QueryNode is of type TERM or PHRASE.

public String getFieldName(){

Returns fieldName.

public void setText(String text)

Sets this.text to text if the given QueryNode is of type TERM or PHRASE.

public String getText()

Returns text.

public void setOccurrenceLevel(BooleanClause.Occur lvl)

Sets occLvl to lvl.

public BooleanClause.Occur getOccurrenceLevel()

Returns occLvl.

public void markAsToTranslate()

Sets toTranslate to true.

public void unmarkToTranslate()

Sets toTranslate to false.

public boolean hasToBeTranslated()

Returns toTranslate.

public void setTieBreakerMultiplier(float tbm)

Sets tieBreakerMultiplier to tbm.

public float getTieBreakerMultiplier()

Returns tieBreakerMultiplier.

public void setOriginalQuery(Query query)

Sets originalQuery to query.

public Query getOriginalQuery()

Returns originalQuery.

public void setBoost(float boost)

Sets this.boost to boost.

public float getBoost()

Returns boost.

public void setParentId(int id)

Sets parentId to id.

public void setSlop(int sl)

Sets slop to sl.

public int getSlop()

Returns slop.

public Query buildQuery(){

Maps the given QueryNode object back to an object of one of the subclasses of Solr’s Query class. Given the QueryType, the mapping works as follows:

Translation class

The Translation class is used to store information about the translation of a specific field and string.

Fields

private static final Logger LOG = LoggerFactory.getLogger(QueryNode.class);

The logger object.

private String originalFieldName;

The original field name.

private Map<String, String> fieldsAndTranslations;

The keys of this map are the language-specific versions of originalFieldName, the values are the translations.

private Map<String, Boolean> fieldNamesLanguageSpecific;

The keys of this map are the language-specific versions of originalFieldName. The values indicate whether these versions also exist in the Pubpsych core schema (true) or whether they are just dummy values to maintain the distinction between the different languages (false). We need these dummy values, because using originalFieldName (which will in this case be later used in the query itself) would overwrite existing translations in the map.

Methods

public Translation(Map<String, String> translations,
Map<String, Boolean> fieldNamesLASpecific, String fieldName)

Initializes the class fields with the given values.

public boolean isLanguageSpecificFieldName(String fieldName)

Returns true if fieldName is a language-specific field name in the PubPsych core schema, false otherwise. Logs a warning if fieldName is not a key in fieldNamesLanguageSpecific.

public Map<String, Boolean> getFieldNamesLASpecificMap()

Returns fieldNamesLanguageSpecific.

public Map<String, String> getTranslations()

Returns fieldsAndTranslations.

public String getOriginalFieldName()

Returns originalFieldName.

PreTranslationInfo class

Objects of the PreTranslationInfo class store information that is gathered before the actual translation happens. It is used for the translation of BooleanQuery objects. The texts of all children of type TermQuery or PhraseQuery are concatenated and then stored in a PreTranslationInfo object to be able to look up the whole text related to the BooleanQuery in the dictionaries.

Fields

private StringBuilder sb;

The StringBuilder is used to concatenate the text strings of all subqueries of the BooleanQuery for which the PreTranslationInfo is gathered.

private Map<Integer, QueryNode> originalNodes;

The keys are the indices in the children list (list of subqueries) of the parent node (BooleanQuery), the values are the respective subqueries.

Methods

public PreTranslationInfo()

Initializes sb and originalNodes with an empty StringBuilder respectively an empty map.

public void addString(String s)

Appends s to sb.

public String buildString()

Builds the string that has been gathered by sb.

public void addOriginalNode(int index, QueryNode node)

Adds node with key index to the originalNodes map.

public Set<Integer> getIntegers()

Returns all keys of originalNodes.

public Collection<QueryNode> getNodes()

Returns all values of originalNodes.

public Map<Integer, QueryNode> getOriginalNodesMap()

Returns originalNodes.