mwir

The My Web Intelligence Package for R

https://github.com/mywebintelligence/mwir

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: researchgate.net, academia.edu, scholar.google
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

The My Web Intelligence Package for R

Basic Info
  • Host: GitHub
  • Owner: MyWebIntelligence
  • License: other
  • Language: R
  • Default Branch: main
  • Size: 11.9 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 3
  • Releases: 0
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License Codemeta

README.Rmd

---
output: md_document
---

# mwiR 0.8.0 (Beta) The R Package of My Web Intelligence Project





## Purpose of My Web Intelligence

**Context and Objectives**

My Web Intelligence (MWI) is a project designed to meet the growing need for tools and methodologies in the field of digital methods in social sciences and information and communication sciences (ICS). The main objective is to map the digital ecosystem to identify key actors, assess their influence, and analyze their discourses and interactions. This project addresses the increasing centrality of digital information and interactions in various fields, including health, politics, culture, and beyond.

### About the Author

**Amar LAKEL**

Amar Lakel is a researcher in information and communication sciences, specializing in digital methods applied to social studies. He is currently a member of the MICA laboratory (Mediation, Information, Communication, Arts) at the University of Bordeaux Montaigne. His work focuses on the analysis of online discourse, mapping digital ecosystems, and the impact of digital technologies on social and cultural practices.

#### Online Profiles

-   **MICA Labo**: [MICA Labo Profile](https://mica.u-bordeaux-montaigne.fr/amar-lakel/)
-   **Google Scholar**: [Google Scholar Profile](https://scholar.google.com/citations?user=hqquhfwAAAAJ)
-   **ORCID**: [ORCID Profile](https://orcid.org/0000-0002-1234-5678)
-   **ResearchGate**: [ResearchGate Profile](https://www.researchgate.net/profile/Amar_Lakel)
-   **Academia**: [Academia Profile](https://univ-bordeaux.academia.edu/AmarLakel)
-   **Twitter**: [Twitter MyWebIntel Profile](https://twitter.com/mywebintel)
-   **LinkedIn**: [LinkedIn Profile](https://www.linkedin.com/in/amar-lakel-123456789/)

## Methodology

**Research Protocol**

The research protocol of MWI relies on a combination of quantitative and qualitative methods:

1.  **Data Extraction and Archiving**: Using crawl technologies to collect data from the web.
2.  **Data Qualification and Annotation**: Applying algorithms to analyze, classify, and annotate the data.
3.  **Data Visualization**: Developing dashboards and relational maps to interpret the results.

**Methodological Challenges**

The MWI project utilizes techniques from the sociology of controversies, social network analysis, and text mining methods to:

-   Analyze the strategic positions of speakers in a heterogeneous and complex digital corpus.
-   Identify and understand the dynamics of online discourses.
-   Map the relationships between different actors and their respective influences.

## Case Studies

**Diverse Cases**

1.  **Health Information**

-   **Asthma and Diabetes in Children**: Studies of online discourses related to these diseases to identify influential actors, understand their positions, and evaluate their impact on patients' perceptions and behaviors. [Source](https://journals.openedition.org/rfsic/8376)

2.  **Online Political Controversy**

-   **Juan Branco Project**: Analysis of discourses and influence surrounding the public figure Juan Branco, exploring the dynamics of positioning and controversy. [Source](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4584133)

3.  **Research Sociology**

-   **Digital Humanities**: Studies on the impact of digital technologies on humanities and social sciences, including how researchers use the web to disseminate and discuss their work. [Source](https://hal.science/hal-02485370)

**Results and Impact**

The results of these studies show that online discourses play a crucial role in shaping opinions and behaviors in various fields. They also highlight the importance for researchers and professionals to actively engage in these discussions to promote reliable and scientifically validated information.

## Repositories and Documentation

**NAKALA Repositories** The data and results of the MWI project are deposited on the NAKALA platform, providing open access for other researchers and practitioners. Here are some important repositories:

1.  [The collection](https://nakala.fr/collection/10.34847/nkl.b4aarv3j): Contains a detailed description of the project, methodology, and results.
2.  [Positions and Influences on the Web: The Case of Health Information](https://nakala.fr/prise-de-position): Detailed analysis of discourses on childhood asthma.
3.  [French Digital Humanities communities](https://nakala.fr/10.34847/nkl.f43by03n): A study case on French digital humanities development on the web.

# Development of the R Package

The R package developed within the framework of My Web Intelligence is designed to: - Facilitate the replication of analyses conducted in the project. - Enable the extension of developed methods and tools for other research. - Provide researchers and professionals with a powerful tool to understand and manage the dynamics of online information.

**Main Features**

-   **Project Management**: Tools to initiate and manage web exploration projects.
-   **Data Extraction**: Functions to crawl the web and extract data corpora.
-   **Analysis and Annotation**: Algorithms to analyze and annotate extracted data.
-   **Visualization**: Dashboards and maps to visualize relationships between actors and discourses.

## Conclusion

My Web Intelligence is an integrative project aimed at transforming how we understand and analyze digital information across various fields in social sciences and ICS. By combining innovative methodologies and advanced technological tools, MWI offers new perspectives on digital dynamics and proposes solutions to better understand online interactions and discourses. The R package developed from this project is an essential tool for researchers and practitioners, enabling them to fully exploit web data for in-depth and relevant analyses.

# Using mwiR : a study case

## Installation

You can install the development version of mwiR from [GitHub](https://github.com/) with:

```{r}

# install.packages("devtools")
devtools::install_github("MyWebIntelligence/mwiR")

```

## Project ('land') Setup

This is a basic example which shows you how to solve a common problem:

```{r example}
library(mwiR)
## basic example code
```

## Step 1: Creating the Research Project

In this step-by-step guide, we will walk through the initial setup and execution of a research project using the My Web Intelligence (MWI) method. This method allows researchers to analyze the impact of various factors, such as AI on work, by collecting and organizing web data. Here is a breakdown of the R script provided:

### 1. Load the Required Packages

```{r}
initmwi()
```

The `initmwi()` function initializes the My Web Intelligence environment by loading all necessary packages and setting up the environment for further operations. This function ensures that all dependencies and configurations are correctly initialized.

### 2. Set Up the Database

```{r}
db_setup()
```

The `db_setup()` function sets up the database needed for storing and managing the data collected during the research project. It initializes the necessary database schema and ensures that the database is ready for data insertion and retrieval.

-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 3. Create a Research Project (Land)

```{r}
create_land(name = "AIWork", desc = "Impact of AI on work", lang="en")
```

The `create_land()` function creates a new research project, referred to as a "land" in MWI terminology. This land will serve as the container for all data and analyses related to the project.

-   `name`: A string specifying the name of the land.
-   `desc`: A string providing a description of the land.
-   `lang`: A string specifying the language of the land. Default is `"en"`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 4. Add Search Terms

```{r}
addterm("AIWork", "AI, artificial intelligence, work, employment, job, profession, labor market")
```

The `addterm()` function adds search terms to the project. These terms will be used to crawl and collect relevant web data.

-   `land_name`: A string specifying the name of the land.
-   `terms`: A comma-separated string of terms to add.

### 5. Verify the Project Creation

```{r}
listlands("AIWork")
```

The `listlands()` function lists all lands or projects that have been created. By specifying the project name "AIWork", it verifies that the project has been successfully created.

-   `land_name`: A string specifying the name of the land to list. If `NULL`, all lands are listed. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

### 6. Add URLs Manually or Using a File

```{r}
addurl("AIWork", urls = "https://www.fr.adp.com/rhinfo/articles/2022/11/la-disparition-de-certains-metiers-est-elle-a-craindre.aspx")
```

The `addurl()` function adds URLs to the project. These URLs point to web pages that contain relevant information for the research.

-   `land_name`: A string specifying the name of the land.
-   `urls`: A comma-separated string of URLs to add. Default is `NULL`.
-   `path`: A string specifying the path to a file containing URLs. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

Alternatively, URLs can be added using a text file:

```{r}
# If using a text file

addurl("AIWork", path = "_ai_or_artificial_intelligence___work_or_employment_or_job_or_profession_or_labor_market01.txt")
```

-   `path`: The path to a text file containing the URLs to be added.

### 7. List the Projects or a Specific Project

```{r}
listlands("AIWork")
```

This function is used again to list the projects or a specific project, ensuring that the URLs have been added correctly to "AIWork".

### 8. Optionally Delete a Project

```{r}
deleteland(land_name = "AIWork")

```

The `deleteland()` function deletes a specified project. This can be useful for cleaning up after the research is completed or if a project needs to be restarted.

-   `land_name`: A string specifying the name of the land to delete.
-   `maxrel`: An integer specifying the maximum relevance for expressions to delete. Default is `NULL`.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

This script demonstrates the basic setup and execution of a research project using My Web Intelligence, including project creation, term addition, URL management, and project verification.

## Step 2: Crawling

In this section, we will walk through the process of crawling URLs and extracting content for analysis using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.

### Crawl URLs for a Specific Land

```{r}
crawlurls("IATravail", limit = 10)
```

The `crawlurls()` function crawls URLs for a specific land, updates the database, and calculates relevance scores.

-   `land_name`: A character string representing the name of the land.
-   `urlmax`: An integer specifying the maximum number of URLs to be processed (default is 50).
-   `limit`: An optional integer specifying the limit on the number of URLs to crawl.
-   `http_status`: An optional character string specifying the HTTP status to filter URLs.
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

**Example:**

This example demonstrates crawling up to 10 URLs for the land named "IATravail".

```{r}
crawlurls("IATravail", limit = 10)
```

### Crawl Domains

```{r}
crawlDomain(1000)
```

The `crawlDomain()` function crawls domains and updates the Domain table with the fetched data.

-   `nburl`: An integer specifying the number of URLs to be crawled (default is 100).
-   `db_name`: A string specifying the name of the SQLite database file. Default is `"mwi.db"`.

**Example:**

This example demonstrates crawling 1000 URLs and updating the Domain table.

```{r}
crawlDomain(1000)
```

## Step 3: Export Files and Corpora

In this section, we will walk through the process of exporting data and corpora from a research project using the My Web Intelligence (MWI) method. The following R code snippets demonstrate how to perform these tasks.

### Export Land Data

```{r}
#type = ['pagecsv', 'pagegexf', 'fullpagecsv', 'nodecsv', 'nodegexf', 'mediacsv', 'corpus']

# Exemple d'utilisation "le projet", "type d'export", "relevance", "file"

export_land("giletsjaunes", "pagegexf", 3)
```

The `export_land()` function manages the exportation of land data based on the specified export type.

-   `land_name`: A character string specifying the name of the land.
-   `export_type`: A character string specifying the type of export. Options include `"pagecsv"`, `"fullpagecsv"`, `"nodecsv"`, `"mediacsv"`, `"pagegexf"`, `"nodegexf"`, or `"corpus"`.
-   `minimum_relevance`: A numeric value specifying the minimum relevance score for inclusion in the export. Default is `1`.
-   `labase`: A character string specifying the name of the database file. Default is `"mwi.db"`.

**Example:**

This example demonstrates exporting data for the project "giletsjaunes" with a minimum relevance score of 3 into a GEXF file.

```{r}

export_land("giletsjaunes", "pagegexf", 3)
```

Owner

  • Name: My Web Intelligence
  • Login: MyWebIntelligence
  • Kind: organization
  • Email: amar.lakel@u-bordeaux-montaigne.fr
  • Location: Bordeaux

CodeMeta (codemeta.json)

{
  "@context": [
    "https://doi.org/10.5063/schema/codemeta-2.0",
    "http://schema.org"
  ],
  "@type": "SoftwareSourceCode",
  "name": "mwiR",
  "description": "My Web Intelligence (MWI) is an R package designed to address the growing need for tools and methodologies in the field of digital methods within social sciences and information and communication sciences. The main objective is to map the digital ecosystem to identify key actors, assess their influence, and analyze their discourse and interactions.",
  "identifier": "https://doi.org/10.5281/zenodo.1234567",
  "applicationCategory": "Research Tool",
  "keywords": [
    "digital methods",
    "web intelligence",
    "social sciences",
    "data mining",
    "network analysis",
    "text analysis",
    "internet sociology",
    "R package"
  ],
  "license": "https://spdx.org/licenses/MIT",
  "programmingLanguage": [
    {
      "@type": "ComputerLanguage",
      "name": "R",
      "url": "https://www.r-project.org"
    },
    {
      "@type": "ComputerLanguage",
      "name": "Python",
      "url": "https://www.python.org"
    }
  ],
  "runtimePlatform": "R version 4.1.0 or later",
  "operatingSystem": [
    "Linux",
    "Windows",
    "macOS"
  ],
  "softwareRequirements": [
    "R (>= 4.0.0)",
    "Python (>= 3.6)",
    "trafilatura"
  ],
  "codeRepository": "https://github.com/MyWebIntelligence/mwiR",
  "issueTracker": "https://github.com/MyWebIntelligence/mwiR/issues",
  "downloadUrl": "https://github.com/MyWebIntelligence/mwiR/archive/refs/tags/v0.8.0-beta.zip",
  "releaseNotes": "Initial beta release with basic functionality for web crawling, data analysis, and visualization",
  "dateCreated": "2023-05-19",
  "datePublished": "2024",
  "version": "0.9.0",
  "developmentStatus": "beta",
  "funder": {
    "@type": "Organization",
    "name": "Universit Bordeaux Montaigne"
  },
  "funding": [
    {
      "@type": "Grant",
      "name": "Projet My Web Intelligence",
      "funder": {
        "@type": "Organization",
        "name": "MICA Labo, Universit de Bordeaux Montaigne"
      }
    }
  ],
  "isPartOf": "https://github.com/MyWebIntelligence",
  "referencePublication": [
    {
      "@type": "ScholarlyArticle",
      "name": "My Web Intelligence: Challenges of Heterogeneous Digital Corpora",
      "alternativeName": "My web intelligence : un outil pour l'analyse du web et des rseaux",
      "author": [
        {
          "@type": "Person",
          "givenName": "Amar",
          "familyName": "LAKEL",
          "affiliation": {
            "@type": "Organization",
            "name": "MICA - Mdiation, Information, Communication, Art",
            "parentOrganization": {
              "@type": "Organization",
              "name": "Universit Bordeaux Montaigne"
            }
          }
        }
      ],
      "abstract": "My Web Intelligence is a program directed by Amar LAKEL within the E3D team of the MICA Laboratory (MICA) at the University of Bordeaux Montaigne. The program aims to develop a tool for extracting (crawling), archiving, qualifying, and visualizing the Web in the service of digital methods. The objective is to provide, to all experts and researchers who wish to develop studies in digital intelligence and digital humanities, with a device based on the analysis of online speech[1].",
      "identifier": "hal-03233584",
      "url": "https://hal.science/hal-03233584",
      "datePublished": "2021",
      "publisher": {
        "@type": "Organization",
        "name": "HAL Science"
      },
      "inLanguage": [
        "eng",
        "fra"
      ]
    }
  ],
  "author": [
    {
      "@type": "Person",
      "givenName": "Amar",
      "familyName": "Lakel",
      "email": "amar.lakel@u-bordeaux-montaigne.fr",
      "affiliation": {
        "@type": "Organization",
        "name": "MICA Labo, Universit de Bordeaux Montaigne"
      },
      "@id": "https://orcid.org/0000-0002-1825-0097"
    }
  ],
  "contributor": [
    {
      "@type": "Person",
      "givenName": "Amar",
      "familyName": "Lakel",
      "email": "amar.lakel@u-bordeaux-montaigne.fr",
      "affiliation": {
        "@type": "Organization",
        "name": "University of Bordeaux Montaigne"
      }
    }
  ],
  "review": {
    "@type": "Review",
    "reviewAspect": "Accuracy",
    "reviewBody": "The package provides accurate results for web data analysis and visualization."
  }
}

GitHub Events

Total
  • Issues event: 3
  • Member event: 1
  • Push event: 10
Last Year
  • Issues event: 3
  • Member event: 1
  • Push event: 10

Dependencies

DESCRIPTION cran
  • DBI * imports
  • RSQLite * imports
  • SnowballC * imports
  • cld3 * imports
  • httr * imports
  • jsonlite * imports
  • lubridate * imports
  • mockery * imports
  • pdftools * imports
  • readr * imports
  • reticulate * imports
  • stringr * imports
  • tesseract * imports
  • tools * imports
  • urltools * imports
  • zip * imports
  • testthat >= 3.0.0 suggests