graph TD A[Texte d'entrée] --> B[Tokenisation] B --> C[Traitement par le modèle] C --> D[Génération de tokens] D --> E[Texte de sortie]
Introduction aux mégadonnées en sciences sociales
Université de Montréal
graph TD A[Texte d'entrée] --> B[Tokenisation] B --> C[Traitement par le modèle] C --> D[Génération de tokens] D --> E[Texte de sortie]
Dataset | Sampling prop. | Epochs | Disk size |
---|---|---|---|
CommonCrawl | 67.0% | 1.10 | 3.3 TB |
C4 | 15.0% | 1.06 | 783 GB |
Github | 4.5% | 0.64 | 328 GB |
Wikipedia | 4.5% | 2.45 | 83 GB |
Books | 4.5% | 2.23 | 85 GB |
ArXiv | 2.5% | 1.06 | 92 GB |
StackExchange | 2.0% | 1.03 | 78 GB |
Comparaison humain vs machine
Astuce
Avons nous des bases de données où les annotations humaines sont disponibles?
Chapel Hill Expert Survey
Global Party Survey
Catégorie | LLaMA | GPT3 | OPT |
---|---|---|---|
Gender | 70.6 | 62.6 | 65.7 |
Religion | 79.0 | 73.3 | 68.6 |
Race/Color | 57.0 | 64.7 | 68.6 |
Sexual orientation | 81.0 | 76.2 | 78.6 |
Age | 70.1 | 64.4 | 67.8 |
Nationality | 64.2 | 61.6 | 62.9 |
Disability | 66.7 | 76.7 | 76.7 |
Physical appearance | 77.8 | 74.6 | 76.2 |
Socioeconomic status | 71.5 | 73.8 | 76.2 |
Average | 66.6 | 67.2 | 69.5 |
Fournisseur | Points forts | Coût | Limitations |
---|---|---|---|
Claude (Anthropic) | Le meilleur en qualité globale | Coûteux | Prix élevé |
GPT (OpenAI) | Bien équilibré sur tous les aspects | Modéré | N’est plus au sommet en termes de performance |
Gemini (Google) | Excellent modèle gratuit | Gratuit | Taux de requêtes limité sur le niveau gratuit |
Deepseek | Extrêmement performant | Très abordable | Lent et achalandé |
Groq | Rapide avec modèles open source | Niveau gratuit | Limité à 30 RPM et 6000 TPM |
Model | Context Length | Input (per 1K tokens) | Output (per 1K tokens) | Total Cost (100K tokens) |
---|---|---|---|---|
Claude 3.7 Sonnet | 200K | $0.003 | $0.015 | $1.80 |
GPT-4o | 128K | $0.005 | $0.015 | $2.00 |
o3-mini | 128K | $0.001 | $0.004 | $0.50 |
Gemini 2.0 Flash | 1M | Free | Free | Free |
Gemini 1.5 Pro | 1M/2M | $0.0012 | $0.005 | $0.62 |
DeepSeek-R1 | 64K | $0.00055 | $0.00219 | $0.27 |
Mettre vos clés d’API dans un fichier .Renviron
:
Redémarrez R et vérifiez que vos clés sont bien chargées:
library(ellmer)
countries <- c("North Korea", "Tuvalu", "Guinea-Bissau")
groq <- ellmer::chat_groq(
system_prompt = "Your role is to answer users' questions",
model = "llama-3.3-70b-versatile",
echo = "none"
)
for (i in 1:length(countries)) {
response <- groq$chat(paste("What is the capital of", countries[i], "?"))
print(response)
Sys.sleep(2)
}
library(dplyr)
library(ellmer)
df <- data.frame(
restaurant = "La ligne rouge",
text = c(
"Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent address for a meal that is good without breaking the bank. I recommend!",
"Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won't go back there",
"Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."
),
stringsAsFactors = FALSE
) %>%
dplyr::mutate(id = 1:nrow(.))
glimpse(df)
system_prompt <- "Your role is to analyze the sentiment of restaurant reviews and classify them according to specific categories"
[1] "Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent add
ress for a meal that is good without breaking the bank. I recommend!"
[1] "Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restau
rants in the neighborhood, I won't go back there"
[1] "Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both
are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash."
for (i in 1:nrow(df)) {
print("salut, voici une nouvelle itération! i vaut présentement")
print(i)
print("Merci. C'est la fin de cette itération.")
}
[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 1
[1] "Merci. C'est la fin de cette itération."
[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 2
[1] "Merci. C'est la fin de cette itération."
[1] "salut, voici une nouvelle itération! i vaut présentement"
[1] 3
[1] "Merci. C'est la fin de cette itération."
[1] "Food is good and price is ok. The only issu is the attitude of the
staff. The lady at he cash register and the guy that takes the orders se
riously lack client service skills. Both are very rude. Hygiene is anoth
er issue, there are flies all over the place. In addition to all this, t
hey only take cash."
for (i in 1:nrow(df)) {
prompt <- paste0(
"Analyze the sentiment of this restaurant review on a scale from -1 to 1, where:\n",
"- -1 represents very negative sentiment\n",
"- 0 represents neutral sentiment\n",
"- 1 represents very positive sentiment\n\n",
"Reply ONLY with a decimal number between -1 and 1, with no explanatory text, comments, or justification.\n\n",
"Review: ", df$text[i]
)
response <- groq$chat(prompt)
df$sentiment[i] <- response
Sys.sleep(2)
}
paste0()
et paste()
Permet de coller des éléments ensemble en conservant le format de texte
Super good kebab! The portions are generous, the prices are really reasonable, and the quality is there. Tasty meat, fresh bread, and everything is well seasoned. An excellent add ress for a meal that is good without breaking the bank. I recommend!
Nothing exceptional, just edible. I had good feedback about the food and I was very, very disappointed. Not to mention cash only which for me is unacceptable. Too many good restaurants in the neighborhood, I won’t go back there
Food is good and price is ok. The only issu is the attitude of the staff. The lady at he cash register and the guy that takes the orders seriously lack client service skills. Both are very rude. Hygiene is another issue, there are flies all over the place. In addition to all this, they only take cash.
review_id | etudiants | lsd | llm |
---|---|---|---|
1 | Voir tableau | 1 | 0.9 |
2 | Voir tableau | -0.2 | -0.8 |
3 | Voir tableau | 0.0 | -0.7 |
prompt <- paste0(
"Analyze this restaurant review (which may be in either English or French) and extract the following information in JSON format:\n\n",
"1. LANGUAGE: Identify whether the review is in English or French\n",
"2. TOPICS: List only the most relevant topics mentioned from these categories: food quality, service, ambiance, cleanliness, price, portion size, wait time, menu variety, accessibility, parking, other\n",
"3. SENTIMENT: Rate the overall sentiment from -1 (very negative) to 1 (very positive)\n",
"4. RECOMMENDATIONS: Extract specific suggestions for improvement\n",
"5. STRENGTHS: Identify what the restaurant is doing well\n",
"6. WEAKNESSES: Identify specific areas where the restaurant is underperforming\n\n",
"IMPORTANT: Regardless of the review's language, ALWAYS provide your analysis in English.\n\n",
"Response must be ONLY valid JSON with no additional text. Use this exact format:\n",
"{\n",
" \"language\": \"english OR french\",\n",
" \"topics\": [\"example_topic1\", \"example_topic2\"],\n",
" \"sentiment\": 0.5,\n",
" \"recommendations\": [\"Example improvement suggestion 1\", \"Example suggestion 2\"],\n",
" \"strengths\": [\"Example strength 1\", \"Example strength 2\"],\n",
" \"weaknesses\": [\"Example weakness 1\", \"Example weakness 2\"]\n",
"}\n\n",
"If a category has no relevant information, use an empty array [].\n",
"For sentiment, use only one decimal place of precision.\n\n",
"Review: ", donnees$review_text[i] # Ajout du texte de l'avis à analyser
)
response_parsed <- tryCatch({
jsonlite::fromJSON(response)
}, error = function(e) {
# En cas d'erreur de parsing, afficher un avertissement
warning("Erreur de parsing JSON pour l'avis ", i, ": ", e$message)
return(NULL)
})
Permet de déterminer quoi faire quand le programme plante et de continuer malgré les erreurs
if (!is.null(response_parsed)) {
# Stockage des valeurs simples
donnees$language[i] <- response_parsed$language
donnees$sentiment[i] <- response_parsed$sentiment
# Transformation des listes en chaînes de caractères pour stockage
# (avec gestion des cas où l'information est absente)
donnees$topics[i] <- if(length(response_parsed$topics) > 0) {
paste(response_parsed$topics, collapse = ", ")
} else {
NA
}
donnees$recommendations[i] <- if(length(response_parsed$recommendations) > 0) {
paste(response_parsed$recommendations, collapse = ", ")
} else {
NA
}
donnees$strengths[i] <- if(length(response_parsed$strengths) > 0) {
paste(response_parsed$strengths, collapse = ", ")
} else {
NA
}
donnees$weaknesses[i] <- if(length(response_parsed$weaknesses) > 0) {
paste(response_parsed$weaknesses, collapse = ", ")
} else {
NA
}
}
Prompting in English yields improved performance across many cross‐lingual tasks. For example, Jiang et al. (2019) report increased accuracy in knowledge extraction when using English prompts versus diversified manual prompts. Lin et al. (2021) observe that, for natural language understanding and machine translation across 30 languages, English prompts outperform multilingual configurations and even surpass baselines such as GPT‐3. Similarly, Muennighoff et al. (2023), Nie et al. (2024), Tu et al. (2022), and Zhang et al. (2023) report notable gains measured by accuracy, F1 scores, and BLEU when models are prompted in English.
# Role and Context
You are an expert social data analyst with
advanced training in qualitative research.
# Task and Objective
Analyze the following interview excerpt and
identify the main sociological themes present.
# Required Output Format
Present your findings as a list of 3-5 themes,
each with a brief explanation and a relevant
supporting quote.
# Data or Content to Analyze
INTERVIEW:
[interview text]
# Additional Instructions
Focus on sociological aspects rather than
psychological ones. Identify patterns related
to social structures, group dynamics, and
cultural factors.
# Use devtools to install the clellm package from github
devtools::install_github("clessn/clellm")
#Use the install_ollama() function to install ollama, only on linux
clellm::install_ollama()
# Use the ollama_install_model() function to install models
clellm::ollama_install_model("model_name")
# Use the ollama_prompt() function to prompt any model you want
prompt <- clellm::ollama_prompt("prompt", model = "model_name")
[1] "In this survey question, respondents had to name their most important issue. Please read the answer and determine to which of the following 12 categories it belongs: 'Law and Crime', 'Culture and Nationalism', 'Public Lands and Agriculture', 'Governments and Governance', 'Immigration', 'Rights, Liberties, Minorities, and Discrimination', 'Health and Social Services', 'Economy and Employment', 'Education', 'Environment and Energy', 'International Affairs and Defense', 'Technology'. Use your judgement and only output a single issue category. The answer your need to categorize is: pension reform."
Accuracy (Précision globale)
Precision (Exactitude)
Recall (Rappel)
F1 Score
Issue Category | Llama3 | Phi3 | Mistral | GPT-4 | Dict |
---|---|---|---|---|---|
Culture and Nationalism | NA | NA | 1 | NA | NA |
Economy and Employment | 0.9 | 0.87 | NA | 0.94 | 0.21 |
Education | 0.67 | 0.67 | 1 | 0.67 | NA |
Environment and Energy | 0.88 | 0.8 | 0.8 | 0.84 | 0.08 |
Governments and Governance | 0.41 | 0.47 | 0.56 | 0.65 | 0.03 |
Health and Social Services | 0.94 | 0.83 | 0.91 | 0.96 | 0.34 |
Immigration | 1 | 1 | 1 | 1 | NA |
Law and Crime | 1 | 1 | 1 | 1 | NA |
Rights, Liberties, Minorities, and Discrimination | 0.86 | 0.86 | 0.71 | 0.57 | 0.29 |
Weighted Mean | 0.81 | 0.77 | 0.5 | 0.86 | 0.19 |
graph TD A[100 Textes] --> B[LLM prédit: 60 Positifs] A --> C[LLM prédit: 40 Négatifs] B --> D[50 Vrais Positifs] B --> E[10 Faux Positifs] C --> F[35 Vrais Négatifs] C --> G[5 Faux Négatifs] style A fill:#f9f9f9,stroke:#333,stroke-width:1px style B fill:#dae8fc,stroke:#6c8ebf,stroke-width:1px style C fill:#d5e8d4,stroke:#82b366,stroke-width:1px style D fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px style E fill:#f8cecc,stroke:#b85450,stroke-width:1px style F fill:#d5e8d4,stroke:#82b366,stroke-width:2px style G fill:#f8cecc,stroke:#b85450,stroke-width:1px
Calcul des métriques:
Accuracy: (50 + 35) / 100 = 85%
Precision: 50 / 60 = 83.3%
Recall: 50 / (50 + 5) = 90.9%
F1 Score: 2 × (83.3% × 90.9%) / (83.3% + 90.9%) = 87%
Coefficient de corrélation (r): - Plus le r est proche de 1, meilleure est la prédiction
Régression linéaire: - p < 0.001 (statistiquement significatif)
Comment un LLM “comprend” le texte
Expliqué simplement