Download the Stackoverflow Archives
- https://archive.org/details/stackexchange
- download
- stackoverflow.com-Badges.7z
- stackoverflow.com-Comments.7z
- stackoverflow.com-PostHistory.7z
- stackoverflow.com-PostLinks.7z
- stackoverflow.com-Posts.7z
- stackoverflow.com-Tags.7z
- stackoverflow.com-Users.7z
- stackoverflow.com-Votes.7z
Install p7zip
brew install p7zip
Unzip the Posts, Users and Tags
- 7za -y -oextracted x *Users.7z
- 7za -y -oextracted x *Tags.7z
- 7za -y -oextracted x *Posts.7z
Review the Extracted XML files
- Users.xml – 3.53GB
- Tags.xml – 5MB
- Posts.xml – 74GB
Clone stackoverflow-neo4j
git clone https://github.com/mdamien/stackoverflow-neo4j
Install Python
- brew install python3
- sudo pip3 install xmltodict
Extract the XML files as CSV
python3 to_csv.py extracted
This generates the following files:

Import the Data into Neo4J
- Open Neo4j Desktop
- Create a StackOverflow Project

- Create a Local Graph

- click Manage

- click Terminal

- set a variable DATA to the path of the generated CSV files
- export DATA=/Development/Neo4j/StackOverflow/stackoverflow-neo4j/csvs
- import the CSV files into a new stackoverflow graph database
./bin/neo4j-admin import \
--mode=csv \
--database=stackoverflow.db \
--id-type string \
--ignore-missing-nodes=true \
--nodes:Post $DATA/posts.csv \
--nodes:User $DATA/users.csv \
--nodes:Tag $DATA/tags.csv \
--relationships:PARENT_OF $DATA/posts_rel.csv \
--relationships:HAS_TAG $DATA/tags_posts_rel.csv \
--relationships:POSTED $DATA/users_posts_rel.csv

Edit Settings
- uncomment #dbms.active_database=graph.db
- rename graph.db to stackoverflow.db

- click Apply
- Restart the database


Open the Browser
- click Open Browser

match(n)
return count(n)

match (n) return head(labels(n)) as label, count(*);

Create Indexes and Constraints
create index on :Post(title);
create index on :Post(createdAt);
create index on :Post(score);
create index on :Post(views);
create index on :Post(favorites);
create index on :Post(answers);
create index on :Post(score);
create index on :User(name);
create index on :User(createdAt);
create index on :User(reputation);
create index on :User(age);
create index on :Tag(count);
create constraint on (t:Tag) assert t.tagId is unique;
create constraint on (u:User) assert u.userId is unique;
create constraint on (p:Post) assert p.postId is unique;
Check Indexes
:schema

Java Questions with Answers
match (tag:Tag{tagId:"java"})<-[r1:HAS_TAG]-(question)-[r2:PARENT_OF]-(answer)
return tag,r1,question,r2,answer
limit 20

match (tag:Tag{tagId:"java"})<-[r1:HAS_TAG]-(question:Post)-[r2:PARENT_OF]-(answer)
return question.title, question.body ,answer.score,answer.body
limit 20
