原神抽卡记录数据集-Genshin Impact gacha data

Overview

提要

持续收集原神抽卡记录中

可以使用抽卡记录导出工具导出抽卡记录的json,将json文件发送至[email protected],我会在清除个人信息后将文件提交到此处。以下两种导出工具任选其一即可。

一种抽卡记录导出工具 from sunfkny 使用方法演示视频

另一种electron版的抽卡记录导出工具 from lvlvl

目前数据集中有195917条抽卡记录

数据使用说明

你可以以个人身份自由的使用本项目数据用于抽卡机制研究,你可以自由的修改和发布我的分析代码(虽然我这代码还不如重新写一次)

但是一定不要将抽卡数据集发布整合到别的平台上,若如此,以后有人去使用多个来源的抽卡数据可能会遇到严重的数据重复问题。请让想要获得抽卡数据朋友来GitHub下载,或注明数据来自本项目。

在使用本数据集得出任何结论时,请自问过程是否严谨,结论是否可信。不应当发布显然不正确的抽卡模型或是不正确且会造成不良影响的模型,如造成不良影响,数据集整理者和提供数据的玩家不负任何责任。

通过一段时间的研究,我基本整理出了原神抽卡的所有机制:

原神抽卡全机制总结

分析抽卡机制的一些工具

数据格式说明

dataset_02文件夹中文件从0001开始顺序编号

每个文件夹内包含一个账号的抽卡记录

  • gacha100.csv 记录初行者推荐祈愿抽卡数据
  • gacha200.csv 记录常驻祈愿抽卡数据
  • gacha301.csv 记录角色活动祈愿数据
  • gacha302.csv 记录武器活动祈愿数据

csv文件内数据记录格式如下

抽卡时间 名称 类别 星级
YYYY-MM-DD HH:MM:SS 物品全名 角色/武器 3/4/5

推荐数据处理方式

计算综合概率估计值时采用无偏估计量

使用物品出现总次数/每次最后一次抽到研究星级物品时的抽数作为估计量

请不要使用物品出现总次数/总抽数,这对于原神这样的抽卡有保底的情况下并不是官方公布综合概率的无偏估计,会使得估计概率偏低

举个例子,如果数据中所有账号都只在常驻祈愿中抽10次,那么大量数据下统计得到的五星频率应该是0.6%,而不是1.6%。统计五星时应取最后一次抽到五星物品时的抽数作为总抽数,同理也应这样应用于四星

对于每个账号,去除抽取到的前几个五星/四星

收集数据时要求抽卡数据提供者标明自己是否有刷过初始五星号等,意用于去除玩家行为带来的偏差

后来发现很多提供者并未标注,并且及时不刷初始号,一开始就抽到了五星的玩家更容易留下来继续游戏,造成偏差

而对于玩了一会已经有一定数量五星的玩家,能不能再抽到五星对其是否继续玩的影响变得更低了

因此可以去除每个账号抽到的前N个五星,N的个数可以据情况选取,可以获得偏差更低的数据

同理也可以应用于四星的统计

精细研究四星概率时略去总抽数过少的数据

总抽数过少时,很难出现已经抽了九次没四星,然后抽到第十次出了五星这类情况,会导致四星的出率偏高

使用抽数较多的数据可以更精细的研究四星的概率

谨慎处理武器池

武器池的数据量比较小,做任何判断时都应该谨慎。若草草下了结论,造成了严重的影响,下结论的人是有责任的。

分析工具说明

DataAnalysis.py用于分析csv抽卡文件,这段代码还在重写中,会非常的难用,仅供参考,运行后会输出参考统计量并画出分布图,分布图中理论值是我根据实际数据、部分游戏文件推理建立的概率增长模型。

DistributionMatrix.py用于在四星五星耦合的情况下分析设计模型的抽卡概率和分布,是计算抽卡模型的综合概率与期望的大杀器

Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

M. Yusuf Sarıgöz 13 Oct 10, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
The Easy-to-use Dialogue Response Selection Toolkit for Researchers

The Easy-to-use Dialogue Response Selection Toolkit for Researchers

GMFTBY 32 Nov 13, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode.

WordleSolver An algorithm that can solve the word puzzle Wordle with an optimal number of guesses on HARD mode. How to use the program Copy this proje

Akil Selvan Rajendra Janarthanan 3 Mar 02, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
Nmt - TensorFlow Neural Machine Translation Tutorial

Neural Machine Translation (seq2seq) Tutorial Authors: Thang Luong, Eugene Brevdo, Rui Zhao (Google Research Blogpost, Github) This version of the tut

6.1k Dec 29, 2022
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Microsoft 37 Nov 29, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

Kinshuk Dua 171 Nov 21, 2022