Semantic graph¶

@class:Semgraph: The main class of the semantic graph module. It is aimed to build a semantic graph based on the provided data and parameters. More convenient to use after extracting data from geocoder.

The Semgraph class has the following methods:

@method:clean_from_dublicates: A function to clean a DataFrame from duplicates based on specified columns.

@method:clean_from_digits: Removes digits from the text in the specified column of the input DataFrame.

@method:clean_from_toponyms: Clean the text in the specified text column by removing any words that match the toponyms in the name and toponym columns.

@method:aggregate_data: Creates a new DataFrame by aggregating the data based on the provided text and toponyms columns.

class soika.src.semantic_graph.semantic_graph_builder.Semgraph(bert_name: str = 'DeepPavlov/rubert-base-cased', language: str = 'russian', device: str = 'cpu')[исходный код]¶

This is the main class of semantic graph module. It is aimed to build a semantic graph based on the provided data and parameters. More convinient to use after extracting data from geocoder.

Param: bert_name: the name of the BERT model to use (default is „DeepPavlov/rubert-base-cased“) language: the language of the BERT model (default is „russian“) device: the device to use for inference (default is „cpu“)

build_graph(data: DataFrame | GeoDataFrame, id_column: str, text_column: str, text_type_column: str, toponym_column: str, toponym_name_column: str, toponym_type_column: str, post_id_column: str, parents_stack_column: str, directed: bool = True, location_column: str | None = None, geometry_column: str | None = None, key_score_filter: float = 0.6, semantic_score_filter: float = 0.75, top_n: int = 1) → Graph[исходный код]¶

Build a graph based on the provided data.

Параметры:

data (pd.DataFrame or gpd.GeoDataFrame) – The input data to build the graph from.
id_column (str) – The column containing unique identifiers.
text_column (str) – The column containing text information.
text_type_column (str) – The column indicating the type of text.
toponym_column (str) – The column containing toponym information.
toponym_name_column (str) – The column containing toponym names.
toponym_type_column (str) – The column containing toponym types.
post_id_column (str) – The column containing post identifiers.
parents_stack_column (str) – The column containing parent-child relationships.
directed (bool) – Flag indicating if the graph is directed. Defaults to True.
location_column (str or None) – The column containing location information. Defaults to None.
geometry_column (str or None) – The column containing geometry information. Defaults to None.
key_score_filter (float) – The threshold for key score filtering. Defaults to 0.6.
semantic_score_filter (float) – The threshold for semantic score filtering. Defaults to 0.75.
top_n (int) – The number of top keywords to extract. Defaults to 1.

Результат:

The constructed graph.

Тип результата:

nx.classes.graph.Graph

static convert_df_to_edge_df(data: DataFrame | GeoDataFrame, toponym_column: str, word_info_column: str = 'words_score') → DataFrame | GeoDataFrame[исходный код]¶

update_graph(G: Graph, data: DataFrame | GeoDataFrame, id_column: str, text_column: str, text_type_column: str, toponym_column: str, toponym_name_column: str, toponym_type_column: str, post_id_column: str, parents_stack_column: str, directed: bool = True, counts_attribute: str | None = None, location_column: str | None = None, geometry_column: str | None = None, key_score_filter: float = 0.6, semantic_score_filter: float = 0.75, top_n: int = 1) → Graph[исходный код]¶

Update the input graph based on the provided data, returning the updated graph.

Параметры:

G (nx.classes.graph.Graph) – The input graph to be updated.
data (pd.DataFrame or gpd.GeoDataFrame) – The input data to update the graph.
id_column (str) – The column containing unique identifiers.
text_column (str) – The column containing text information.
text_type_column (str) – The column indicating the type of text.
toponym_column (str) – The column containing toponym information.
toponym_name_column (str) – The column containing toponym names.
toponym_type_column (str) – The column containing toponym types.
post_id_column (str) – The column containing post identifiers.
parents_stack_column (str) – The column containing parent-child relationships.
directed (bool) – Flag indicating if the graph is directed. Defaults to True.
counts_attribute (str or None) – The attribute to be used for counting. Defaults to None.
location_column (str or None) – The column containing location information. Defaults to None.
geometry_column (str or None) – The column containing geometry information. Defaults to None.
key_score_filter (float) – The threshold for key score filtering. Defaults to 0.6.
semantic_score_filter (float) – The threshold for semantic score filtering. Defaults to 0.75.
top_n (int) – The number of top keywords to extract. Defaults to 1.

Результат:

The updated graph.

Тип результата:

nx.classes.graph.Graph

As a result of the main method Semgraph.build_graph(), the input set of messages is cleaned from duplicates, digits, identified place names and references. For each message, a given number of keywords is extracted using the KeyBERT library model; thanks to the application of pytorch, the semantic proximity between keywords is determined as the cosine distance in the resulting embeddings. The final result of the module is a graph, the nodes of which are toponyms (obtained by the geolocation module) and keywords.