SHARE data extraction, preprocessing and treatment methods¶

preprocessing.share_covid_data_extraction.get_language_country_iso_codes(language_country)[source]¶

Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ.

Parameters:	language_country (param1) – language and country information retrieved from input file.
Returns:	language_country (string). Variable representing the language and country metadata in ISO codes.

preprocessing.share_covid_data_extraction.preprocess_answer_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶

Extracts and processes the answer segments from the input file.

Parameters:	row (param1) – dataframe row being currently analyzed. df_questionnaire (param2) – pandas dataframe to store questionnaire data. survey_item_prefix (param3) – prefix of survey_item_ID. splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns:	updated df_questionnaire with new valid answer segments.

preprocessing.share_covid_data_extraction.preprocess_instruction_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶

Extracts and processes the instruction segments from the input file.

Parameters:	row (param1) – dataframe row being currently analyzed. df_questionnaire (param2) – pandas dataframe to store questionnaire data. survey_item_prefix (param3) – prefix of survey_item_ID. splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns:	updated df_questionnaire with new valid instruction segments.

preprocessing.share_covid_data_extraction.preprocess_question_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶

Extracts and processes the question segments from the input file.

Parameters:	row (param1) – dataframe row being currently analyzed. df_questionnaire (param2) – pandas dataframe to store questionnaire data. survey_item_prefix (param3) – prefix of survey_item_ID. splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns:	updated df_questionnaire with new valid question segments.

preprocessing.share_covid_data_extraction.replace_abbreviations_and_fills(sentence)[source]¶

Replaces abbreviations and fills text from the text of input file.

Parameters:	sentence (param1) – text segment from input file.
Returns:	sentence (string). Text segment without abbreviations and fills text.

preprocessing.share_covid_data_extraction.retrieve_module_from_item_name(item_name)[source]¶

Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html

Parameters:	item_name (param1) – item_name information retrieved from input file.
Returns:	module (string). Module of the question.

preprocessing.share_covid_data_extraction.set_initial_structures(language_country)[source]¶

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters:	language_country (param1) – language and country of the subdataframe being analyzed
Returns:	df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).

Python3 script to extract data from XML SHARE input files Author: Danielly Sorato Author contact: danielly.sorato@gmail.com

preprocessing.share_xml_data_extraction.build_questionnaire_structure(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶

Build the final questionnaire from df_questions, df_answers and df_procedures. Calls the fill_extraction() and fill_unrolling() methods to replace the dynamic fills in the texts for the appropriate string definitions found in df_procedures.

Parameters:

df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns:

The final SHARE questionnaire, stored in df_questionnaire (pandas dataframe).

preprocessing.share_xml_data_extraction.clean_answer_text(text, country_language)[source]¶

Substitutes HTML markups in the answer text segments with fixed values

Parameters:	text (param1) – the answer text segment. country_language (param2) – country_language metadata, embedded in file name.
Returns:	the answer text (string) where the markups were replaced (if present in original string).

preprocessing.share_xml_data_extraction.clean_text_share(text, country_language, w7flag)[source]¶

Substitutes HTML markups and certain fills in the text segments with fixed values.

Parameters:	text (param1) – the answer text segment. country_language (param2) – country_language metadata, embedded in file name. w7_flag (param3) – a boolean flag that indicates if the segment comes from a input xml file in SHARE w7.
Returns:	the text (string) where the markups and fills were replaced (if present in original string).

preprocessing.share_xml_data_extraction.eliminate_showcardID_and_adjust_item_type(text, item_name)[source]¶

Substitutes the SHOWCARD_ID strings with a card number (the card IDs are not available in the input XML files).

Parameters:	text (param1) – the text segment being analyzed (either request or instruction). item_name (param2) – item_name metadata, extracted direcly from the input xml file. If ‘intro’ is in the item_name, the segment receives the introduction item_type.
Returns:	text (string) and item_type (string). The SHOWCARD_ID strings are removed from the text segment.

preprocessing.share_xml_data_extraction.extract_answers(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶

Extract answers text from XML nodes of SHARE w8 files.

Parameters:

subnode (param1) – child node being analyzed in outer loop.
df_answers (param2) – pandas dataframe containing answers extracted from XML file
name (param3) – name of the answer structure inside XML file
country_language (param4) – country_language metadata, embedded in file name.
output_source_questionnaire_flag (param5) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns:

df_answers (pandas dataframe) filled with retrieved answer segments extracted from answer_element nodes.

preprocessing.share_xml_data_extraction.extract_categories(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶

Extracts the categories (i.e., answers) from SHARE W07 XML files.

Parameters:	subnode (param1) – subnode of categories node. df_answers (param2) – a dataframe to store answer text and its attributes country_language (param3) – country and language metadata, contained in the filename output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns:	df_answers (pandas dataframe) filled with retrieved answer segments extracted from category_element nodes.

preprocessing.share_xml_data_extraction.extract_qenums(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶

Extracts the qenums (i.e., answers) from SHARE W07 XML files.

Parameters:	subnode (param1) – subnode of categories node. df_answers (param2) – a dataframe to store answer text and its attributes country_language (param3) – country and language metadata, contained in the filename output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns:	df_answers (pandas dataframe) filled with retrieved answer segments extracted from qenum_element nodes.

preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w7(subnode, df_questions, df_procedures, parent_map, name, tmt_id, splitter, country_language, output_source_questionnaire_flag)[source]¶

Extracts the questions and procedures text segments from SHARE wave 7 XML files.

Parameters:

df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
name (param4) – name node attribute inside XML file
tmt_id (param5) – tmt_id node attribute inside XML file
splitter (param6) – Sentence segmenter object from NLTK
country_language (param7) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param8) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns:

df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.

preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w8(subnode, df_questions, df_procedures, parent_map, name, splitter, country_language, output_source_questionnaire_flag)[source]¶

Extracts the questions and procedures text segments from SHARE wave 8 XML files.

Parameters:

df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
name (param4) – name node attribute inside XML file
splitter (param5) – Sentence segmenter object from NLTK
country_language (param6) – country and language metadata, contained in the filename
output_source_questionnaire_flag (param7) – indicates if the data to be extracted in the source (1) or the target language (any other value)

Returns:

df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.

preprocessing.share_xml_data_extraction.fill_extraction(text)[source]¶

Retrieves all dynamic fills (if there is any) from a given SHARE text segment, so later on these fills can be replaces by their natural language text definition.

Parameters:	text (param1) – the text segment.
Returns:	either a list of fills (list of strings), or null if there are no matching fills in the text segment.

preprocessing.share_xml_data_extraction.fill_substitution_in_answer(text, fills, df_procedures)[source]¶

Substitutes the fills in the answer text segments. The fill is substituted only if it was found in the procedure nodes (this can be checked by filtering the df_procedures dataframe by the fill present in the answer segment).

Parameters:	text (param1) – the answer text segment. fills (param2) – the list of fills that are present in the text segment. Effectivelly, for answers the fill list has just one element. df_procedures (param3) – a dataframe that stores the contents of the procedures nodes, where the fill definitions are.
Returns:	module (string) the module name.

preprocessing.share_xml_data_extraction.fill_unrolling(text, fills, df_procedures, df_questionnaire, survey_item_id, item_name, share_modules, study, item_type)[source]¶

Replaces all dynamic fills found in a given text segment by their string definitions in the df_procedures dataframe.

Parameters:

text (param1) – the text segment that contains at least one dynamic fill.
fills (param2) – the list of dynamic fills found in the text segment passed as parameter.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
item_name (param5) – the item name metadata, extracted in previous steps.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
study (param7) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
item_type (param8) – the item type metadata, extracted in previous steps.

Returns:

The updated df_questionnaire (pandas dataframe). The dynamic fill(s) in the text segment was properly replaced.

preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w7(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶

Filters the question and answer dataframes by the tmt_ids. Only segments with the same tmt_id are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.

Parameters:

df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns:

The final SHARE wave 7 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.

preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w8(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶

Filters the question and answer dataframes by the item name. Only segments with the same item name are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.

Parameters:

df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
df_questionnaire (param4) – a dataframe to hold the final questionnaire.
survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.

Returns:

The final SHARE wave 8 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.

preprocessing.share_xml_data_extraction.get_module_metadata(item_name, share_modules)[source]¶

Gets the module to which a given survey item pertains. based on the survey item name.

Parameters:	item_name (param1) – item_name metadata, extracted direcly from the input xml file. share_modules (param2) – a dictionary of module names (taken from SHARE website), encapsulated in the SHAREModules object.
Returns:	module (string) the module name.

preprocessing.share_xml_data_extraction.main(filename)[source]¶: Flag that indicates if the data to be extracted is from the source or the target questionnaire.

preprocessing.share_xml_data_extraction.replace_fill_in_answer(text)[source]¶

Substitutes certain fills in the answer text segments with fixed values.

Parameters:	text (param1) – the answer text segment.
Returns:	the answer text (string) where the fills were replaced (if present in original string).

preprocessing.share_xml_data_extraction.replace_untranslated_instructions(country_language, text)[source]¶

Replaces certain dynamic fills that are not defined in the input file by language-dependent fixed values.

Parameters:	country_language (param1) – country and language metadata, contained in the filename. text (param2) – the text segment.
Returns:	The text segment (string) without certain dynamic fills (if there were any).

preprocessing.share_xml_data_extraction.set_initial_structures(filename, output_source_questionnaire_flag)[source]¶

Set initial structures that are necessary for the extraction of each questionnaire.

Parameters:	filename (param1) – name of the input file.
Returns:	df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).

preprocessing.share_xml_data_extraction.split_answer_text_item_value_from_categories(text)[source]¶

Splits the answer text and its item value in the category node

Parameters:	text (param1) – text from category node, containing item value and answer text segment
Returns:	item_value (string) and answer text segment (string)

class preprocessing.sharemodules.SHAREModules[source]¶: SHARE modules, information taken from SHARE website

Previous topic

Next topic

This Page