Semantic graph

@class:Semgraph: The main class of the semantic graph module. It is aimed to build a semantic graph based on the provided data and parameters. More convenient to use after extracting data from geocoder.

The Semgraph class has the following methods:

@method:clean_from_dublicates: A function to clean a DataFrame from duplicates based on specified columns.

@method:clean_from_digits: Removes digits from the text in the specified column of the input DataFrame.

@method:clean_from_toponyms: Clean the text in the specified text column by removing any words that match the toponyms in the name and toponym columns.

@method:aggregate_data: Creates a new DataFrame by aggregating the data based on the provided text and toponyms columns.

class soika.src.semantic_graph.semantic_graph_builder.Semgraph(bert_name: str = 'DeepPavlov/rubert-base-cased', language: str = 'russian', device: str = 'cpu')[исходный код]

This is the main class of semantic graph module. It is aimed to build a semantic graph based on the provided data and parameters. More convinient to use after extracting data from geocoder.

Param: bert_name: the name of the BERT model to use (default is „DeepPavlov/rubert-base-cased“) language: the language of the BERT model (default is „russian“) device: the device to use for inference (default is „cpu“)

build_graph(data: DataFrame | GeoDataFrame, id_column: str, text_column: str, text_type_column: str, toponym_column: str, toponym_name_column: str, toponym_type_column: str, post_id_column: str, parents_stack_column: str, directed: bool = True, location_column: str | None = None, geometry_column: str | None = None, key_score_filter: float = 0.6, semantic_score_filter: float = 0.75, top_n: int = 1) Graph[исходный код]

Build a graph based on the provided data.

Параметры:
  • data (pd.DataFrame or gpd.GeoDataFrame) – The input data to build the graph from.

  • id_column (str) – The column containing unique identifiers.

  • text_column (str) – The column containing text information.

  • text_type_column (str) – The column indicating the type of text.

  • toponym_column (str) – The column containing toponym information.

  • toponym_name_column (str) – The column containing toponym names.

  • toponym_type_column (str) – The column containing toponym types.

  • post_id_column (str) – The column containing post identifiers.

  • parents_stack_column (str) – The column containing parent-child relationships.

  • directed (bool) – Flag indicating if the graph is directed. Defaults to True.

  • location_column (str or None) – The column containing location information. Defaults to None.

  • geometry_column (str or None) – The column containing geometry information. Defaults to None.

  • key_score_filter (float) – The threshold for key score filtering. Defaults to 0.6.

  • semantic_score_filter (float) – The threshold for semantic score filtering. Defaults to 0.75.

  • top_n (int) – The number of top keywords to extract. Defaults to 1.

Результат:

The constructed graph.

Тип результата:

nx.classes.graph.Graph

static convert_df_to_edge_df(data: DataFrame | GeoDataFrame, toponym_column: str, word_info_column: str = 'words_score') DataFrame | GeoDataFrame[исходный код]
update_graph(G: Graph, data: DataFrame | GeoDataFrame, id_column: str, text_column: str, text_type_column: str, toponym_column: str, toponym_name_column: str, toponym_type_column: str, post_id_column: str, parents_stack_column: str, directed: bool = True, counts_attribute: str | None = None, location_column: str | None = None, geometry_column: str | None = None, key_score_filter: float = 0.6, semantic_score_filter: float = 0.75, top_n: int = 1) Graph[исходный код]

Update the input graph based on the provided data, returning the updated graph.

Параметры:
  • G (nx.classes.graph.Graph) – The input graph to be updated.

  • data (pd.DataFrame or gpd.GeoDataFrame) – The input data to update the graph.

  • id_column (str) – The column containing unique identifiers.

  • text_column (str) – The column containing text information.

  • text_type_column (str) – The column indicating the type of text.

  • toponym_column (str) – The column containing toponym information.

  • toponym_name_column (str) – The column containing toponym names.

  • toponym_type_column (str) – The column containing toponym types.

  • post_id_column (str) – The column containing post identifiers.

  • parents_stack_column (str) – The column containing parent-child relationships.

  • directed (bool) – Flag indicating if the graph is directed. Defaults to True.

  • counts_attribute (str or None) – The attribute to be used for counting. Defaults to None.

  • location_column (str or None) – The column containing location information. Defaults to None.

  • geometry_column (str or None) – The column containing geometry information. Defaults to None.

  • key_score_filter (float) – The threshold for key score filtering. Defaults to 0.6.

  • semantic_score_filter (float) – The threshold for semantic score filtering. Defaults to 0.75.

  • top_n (int) – The number of top keywords to extract. Defaults to 1.

Результат:

The updated graph.

Тип результата:

nx.classes.graph.Graph

As a result of the main method Semgraph.build_graph(), the input set of messages is cleaned from duplicates, digits, identified place names and references. For each message, a given number of keywords is extracted using the KeyBERT library model; thanks to the application of pytorch, the semantic proximity between keywords is determined as the cosine distance in the resulting embeddings. The final result of the module is a graph, the nodes of which are toponyms (obtained by the geolocation module) and keywords.