SHARE data extraction, preprocessing and treatment methods¶
-
preprocessing.share_covid_data_extraction.get_language_country_iso_codes(language_country)[source]¶ Returns the ISO codes for language and country based on the values retrieved from input file. Only for target languages of MCSQ.
Parameters: language_country (param1) – language and country information retrieved from input file. Returns: language_country (string). Variable representing the language and country metadata in ISO codes.
-
preprocessing.share_covid_data_extraction.preprocess_answer_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the answer segments from the input file.
Parameters: - row (param1) – dataframe row being currently analyzed.
- df_questionnaire (param2) – pandas dataframe to store questionnaire data.
- survey_item_prefix (param3) – prefix of survey_item_ID.
- splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns: updated df_questionnaire with new valid answer segments.
-
preprocessing.share_covid_data_extraction.preprocess_instruction_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the instruction segments from the input file.
Parameters: - row (param1) – dataframe row being currently analyzed.
- df_questionnaire (param2) – pandas dataframe to store questionnaire data.
- survey_item_prefix (param3) – prefix of survey_item_ID.
- splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns: updated df_questionnaire with new valid instruction segments.
-
preprocessing.share_covid_data_extraction.preprocess_question_segment(row, df_questionnaire, survey_item_prefix, splitter)[source]¶ Extracts and processes the question segments from the input file.
Parameters: - row (param1) – dataframe row being currently analyzed.
- df_questionnaire (param2) – pandas dataframe to store questionnaire data.
- survey_item_prefix (param3) – prefix of survey_item_ID.
- splitter (param4) – NLTK object for sentence segmentation instantiated in accordance to the language.
Returns: updated df_questionnaire with new valid question segments.
-
preprocessing.share_covid_data_extraction.replace_abbreviations_and_fills(sentence)[source]¶ Replaces abbreviations and fills text from the text of input file.
Parameters: sentence (param1) – text segment from input file. Returns: sentence (string). Text segment without abbreviations and fills text.
-
preprocessing.share_covid_data_extraction.retrieve_module_from_item_name(item_name)[source]¶ Returns the module of the question based on the item_name variable. This information comes from http://www.share-project.org/special-data-sets/share-covid-19-questionnaire.html
Parameters: item_name (param1) – item_name information retrieved from input file. Returns: module (string). Module of the question.
-
preprocessing.share_covid_data_extraction.set_initial_structures(language_country)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
Parameters: language_country (param1) – language and country of the subdataframe being analyzed Returns: df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).
Python3 script to extract data from XML SHARE input files Author: Danielly Sorato Author contact: danielly.sorato@gmail.com
-
preprocessing.share_xml_data_extraction.build_questionnaire_structure(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Build the final questionnaire from df_questions, df_answers and df_procedures. Calls the fill_extraction() and fill_unrolling() methods to replace the dynamic fills in the texts for the appropriate string definitions found in df_procedures.
Parameters: - df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
- df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
- df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- df_questionnaire (param4) – a dataframe to hold the final questionnaire.
- survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
- share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
- special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
- study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
Returns: The final SHARE questionnaire, stored in df_questionnaire (pandas dataframe).
-
preprocessing.share_xml_data_extraction.clean_answer_text(text, country_language)[source]¶ Substitutes HTML markups in the answer text segments with fixed values
Parameters: - text (param1) – the answer text segment.
- country_language (param2) – country_language metadata, embedded in file name.
Returns: the answer text (string) where the markups were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.clean_text_share(text, country_language, w7flag)[source]¶ Substitutes HTML markups and certain fills in the text segments with fixed values.
Parameters: - text (param1) – the answer text segment.
- country_language (param2) – country_language metadata, embedded in file name.
- w7_flag (param3) – a boolean flag that indicates if the segment comes from a input xml file in SHARE w7.
Returns: the text (string) where the markups and fills were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.eliminate_showcardID_and_adjust_item_type(text, item_name)[source]¶ Substitutes the SHOWCARD_ID strings with a card number (the card IDs are not available in the input XML files).
Parameters: - text (param1) – the text segment being analyzed (either request or instruction).
- item_name (param2) – item_name metadata, extracted direcly from the input xml file. If ‘intro’ is in the item_name, the segment receives the introduction item_type.
Returns: text (string) and item_type (string). The SHOWCARD_ID strings are removed from the text segment.
-
preprocessing.share_xml_data_extraction.extract_answers(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extract answers text from XML nodes of SHARE w8 files.
Parameters: - subnode (param1) – child node being analyzed in outer loop.
- df_answers (param2) – pandas dataframe containing answers extracted from XML file
- name (param3) – name of the answer structure inside XML file
- country_language (param4) – country_language metadata, embedded in file name.
- output_source_questionnaire_flag (param5) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns: df_answers (pandas dataframe) filled with retrieved answer segments extracted from answer_element nodes.
-
preprocessing.share_xml_data_extraction.extract_categories(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extracts the categories (i.e., answers) from SHARE W07 XML files.
Parameters: - subnode (param1) – subnode of categories node.
- df_answers (param2) – a dataframe to store answer text and its attributes
- country_language (param3) – country and language metadata, contained in the filename
- output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns: df_answers (pandas dataframe) filled with retrieved answer segments extracted from category_element nodes.
-
preprocessing.share_xml_data_extraction.extract_qenums(subnode, df_answers, name, country_language, output_source_questionnaire_flag)[source]¶ Extracts the qenums (i.e., answers) from SHARE W07 XML files.
Parameters: - subnode (param1) – subnode of categories node.
- df_answers (param2) – a dataframe to store answer text and its attributes
- country_language (param3) – country and language metadata, contained in the filename
- output_source_questionnaire_flag (param4) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns: df_answers (pandas dataframe) filled with retrieved answer segments extracted from qenum_element nodes.
-
preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w7(subnode, df_questions, df_procedures, parent_map, name, tmt_id, splitter, country_language, output_source_questionnaire_flag)[source]¶ Extracts the questions and procedures text segments from SHARE wave 7 XML files.
Parameters: - df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
- df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
- name (param4) – name node attribute inside XML file
- tmt_id (param5) – tmt_id node attribute inside XML file
- splitter (param6) – Sentence segmenter object from NLTK
- country_language (param7) – country and language metadata, contained in the filename
- output_source_questionnaire_flag (param8) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns: df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.
-
preprocessing.share_xml_data_extraction.extract_questions_and_procedures_w8(subnode, df_questions, df_procedures, parent_map, name, splitter, country_language, output_source_questionnaire_flag)[source]¶ Extracts the questions and procedures text segments from SHARE wave 8 XML files.
Parameters: - df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
- df_procedures (param2) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- parent_map (param3) – a dictionary that maps each child to its parent in the XML tree.
- name (param4) – name node attribute inside XML file
- splitter (param5) – Sentence segmenter object from NLTK
- country_language (param6) – country and language metadata, contained in the filename
- output_source_questionnaire_flag (param7) – indicates if the data to be extracted in the source (1) or the target language (any other value)
Returns: df_questions (pandas dataframe) and df_procedures (pandas dataframe) datafarmes filled with the extracted text segments from questions and procedures nodes.
-
preprocessing.share_xml_data_extraction.fill_extraction(text)[source]¶ Retrieves all dynamic fills (if there is any) from a given SHARE text segment, so later on these fills can be replaces by their natural language text definition.
Parameters: text (param1) – the text segment. Returns: either a list of fills (list of strings), or null if there are no matching fills in the text segment.
-
preprocessing.share_xml_data_extraction.fill_substitution_in_answer(text, fills, df_procedures)[source]¶ Substitutes the fills in the answer text segments. The fill is substituted only if it was found in the procedure nodes (this can be checked by filtering the df_procedures dataframe by the fill present in the answer segment).
Parameters: - text (param1) – the answer text segment.
- fills (param2) – the list of fills that are present in the text segment. Effectivelly, for answers the fill list has just one element.
- df_procedures (param3) – a dataframe that stores the contents of the procedures nodes, where the fill definitions are.
Returns: module (string) the module name.
-
preprocessing.share_xml_data_extraction.fill_unrolling(text, fills, df_procedures, df_questionnaire, survey_item_id, item_name, share_modules, study, item_type)[source]¶ Replaces all dynamic fills found in a given text segment by their string definitions in the df_procedures dataframe.
Parameters: - text (param1) – the text segment that contains at least one dynamic fill.
- fills (param2) – the list of dynamic fills found in the text segment passed as parameter.
- df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- df_questionnaire (param4) – a dataframe to hold the final questionnaire.
- item_name (param5) – the item name metadata, extracted in previous steps.
- share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
- study (param7) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
- item_type (param8) – the item type metadata, extracted in previous steps.
Returns: The updated df_questionnaire (pandas dataframe). The dynamic fill(s) in the text segment was properly replaced.
-
preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w7(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Filters the question and answer dataframes by the tmt_ids. Only segments with the same tmt_id are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.
Parameters: - df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
- df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
- df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- df_questionnaire (param4) – a dataframe to hold the final questionnaire.
- survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
- share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
- special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
- study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
Returns: The final SHARE wave 7 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.
-
preprocessing.share_xml_data_extraction.filter_items_to_build_questionnaire_structure_w8(df_questions, df_answers, df_procedures, df_questionnaire, survey_item_prefix, share_modules, special_answer_categories, study)[source]¶ Filters the question and answer dataframes by the item name. Only segments with the same item name are considered as alignment candidates. Calls the build_questionnaire_structure() function to build the final questionnaire from df_questions, df_answers and df_procedures.
Parameters: - df_questions (param1) – the dataframe that holds the question segments extracted from the XML nodes in previous steps.
- df_answers (param2) – the dataframe that holds the answer segments extracted from the XML nodes in previous steps.
- df_procedures (param3) – the dataframe that holds the procedures for fills substitution, extracted from the XML nodes in previous steps.
- df_questionnaire (param4) – a dataframe to hold the final questionnaire.
- survey_item_prefix (param5) – the prefix of survey item IDS, either embedded in the filename or hard-coded in the case of ENG_SOUCE data extraction.
- share_modules (param6) – a dictionary (round dependent) with the full name of all SHARE modules.
- special_answer_categories (param7) – a language-specific instantiated object that contains string definitions of special answer categories (Don’t know, Refuse, etc)
- study (param8) – the study metadata, embedded in the input filename or hard-coded in the case of ENG_SOUCE data extraction.
Returns: The final SHARE wave 8 questionnaire, stored in df_questionnaire (pandas dataframe), after passing through the build_questionnaire_structure() function.
-
preprocessing.share_xml_data_extraction.get_module_metadata(item_name, share_modules)[source]¶ Gets the module to which a given survey item pertains. based on the survey item name.
Parameters: - item_name (param1) – item_name metadata, extracted direcly from the input xml file.
- share_modules (param2) – a dictionary of module names (taken from SHARE website), encapsulated in the SHAREModules object.
Returns: module (string) the module name.
-
preprocessing.share_xml_data_extraction.main(filename)[source]¶ Flag that indicates if the data to be extracted is from the source or the target questionnaire.
-
preprocessing.share_xml_data_extraction.replace_fill_in_answer(text)[source]¶ Substitutes certain fills in the answer text segments with fixed values.
Parameters: text (param1) – the answer text segment. Returns: the answer text (string) where the fills were replaced (if present in original string).
-
preprocessing.share_xml_data_extraction.replace_untranslated_instructions(country_language, text)[source]¶ Replaces certain dynamic fills that are not defined in the input file by language-dependent fixed values.
Parameters: - country_language (param1) – country and language metadata, contained in the filename.
- text (param2) – the text segment.
Returns: The text segment (string) without certain dynamic fills (if there were any).
-
preprocessing.share_xml_data_extraction.set_initial_structures(filename, output_source_questionnaire_flag)[source]¶ Set initial structures that are necessary for the extraction of each questionnaire.
Parameters: filename (param1) – name of the input file. Returns: df_questionnaire to store questionnaire data (pandas dataframe), survey_item_prefix, which is the prefix of survey_item_ID (string), study/country_language, which are metadata parameters embedded in the file name (string and string) and sentence splitter to segment request/introduction/instruction segments when necessary (NLTK object).
-
preprocessing.share_xml_data_extraction.split_answer_text_item_value_from_categories(text)[source]¶ Splits the answer text and its item value in the category node
Parameters: text (param1) – text from category node, containing item value and answer text segment Returns: item_value (string) and answer text segment (string)